How many cases do I need to start?

Start with 30–50 real queries from production or from users, including the hard and unusual ones. A small set of genuine cases beats hundreds of invented ones.

Is automated evaluation enough?

For regressions and version comparisons, yes. But substantive accuracy, tone, and whether the answer makes sense are still best judged by a human on a sample.

How often should I rerun the evaluation?

On every change to the prompt, model, data, or configuration. It's the only way to catch a regression before a user does.

Guide

Decisions & comparisons

How to tell whether an AI agent works: evaluation, not impression

Judging an agent "by feel" confuses a good demo with a working system. A fixed set of cases and the right metrics give you proof instead of an impression.

A good demo is not proof it works — what counts is a repeatable result on a fixed set.
Measure accuracy, hallucinations, cost, and latency together; a single metric lies.
An automated check catches regressions on every version; a human judges what the automation can't measure.

Why a "wow demo" isn't proof

A good demo shows the agent can answer well — once, on a chosen question, under favorable conditions. That isn't the same as works: repeatably, across the full range of questions real people will actually ask.

A demo cherry-picks the cases where the model does well. Production doesn't. After launch you get questions that are unusual, badly phrased, on the edge of the agent's knowledge — and those are the ones that decide whether the system is useful. Judging "by feel" after a few pretty answers is therefore systematically too optimistic.

The rule is simple: proof, not impression. The proof is not one good answer; it's the result on a fixed, repeatable set of cases.

How to build an evaluation set

The starting point of any serious assessment is an evaluation set: a collection of real queries paired with an expected result or a correctness criterion. A good set has three qualities.

Real. The questions come from production, support, or your first users — not from the team's imagination.
Representative. It includes typical cases, edge cases, and cases the agent should refuse.
Fixed. You freeze it so that successive versions are compared on the same material.

To start, 30–50 cases are enough if they are well chosen. You can reuse some of them as few-shot examples in the prompt itself, but then exclude them from the test set — the model has already seen those examples in the prompt, so its score on them inflates the result. Otherwise you're measuring fit to known cases, not the ability to generalize.

What to measure: four dimensions

A single number always lies. An agent can be highly accurate and at the same time so slow or expensive that it's unusable. That's why evaluation looks at four dimensions at once.

Metric	What it measures	How
Accuracy	Whether the answer is correct and complete	Comparison with the expected result; human judgment or a judge model
Hallucinations	How often the agent invents facts or sources	Checking that every claim is grounded in the data
Cost	What one answer costs	Tokens or calls per query times the price list
Latency	How long the user waits	Time from query to full answer (median and the p95 tail)

It's worth tracking hallucinations separately, because they erode trust in the system fastest. One confident-sounding, made-up answer costs more than ten honest "I don't know"s. So the evaluation set should include questions whose correct answer is, in fact, a refusal.

The human in the loop

Not everything can be checked by a script. Tone, sense, tact, substantive accuracy in the nuances — that's for a human to judge. Human-in-the-loop isn't an emergency fallback; it's a standing part of the assessment: on every version, someone reviews a sample of answers and renders a verdict against the same criteria.

For that judgment to be proof rather than opinion, it has to be standardized:

The same set of criteria for every answer (e.g., accuracy 0–2, hallucination yes/no).
The same reviewers, or a clear rubric when there are several.
A record of the results, so they can be compared across versions.

The human in the loop plays a second role too — in production, they approve high-stakes actions before the agent carries them out. That's a different design decision from evaluation, but it rests on the same principle: where an error is costly, you need a control point.

When to use automation, when to use a human

Both paths are needed, because they answer different questions.

Automation checks for regressions. After every change to the prompt, model, or data, you run the whole set and see whether anything broke. It's cheap, fast, and repeatable, so it can run on every deployment.
A human judges quality where sense and context matter. It's slower and more expensive, so it runs on a sample, not the whole set.

A practical arrangement: automation on every version, a human on a representative sample for larger changes and before every production release. Increasingly, the automation role is filled by a judge model — a second model that scores answers against a rubric. That scales the assessment, but the judge itself needs calibrating against human verdicts, or it just carries its own errors forward.

Operator's rule: if you can't compare two versions of the agent with a number from the same set, you don't yet know whether the new version is better — you only have an impression.

Where to start

The cheapest first step isn't a new tool; it's freezing 30–50 real questions and scoring your current agent on them once. That alone shows where it makes things up, where it's slow, and what it costs. Everything else — automation, a judge model, regular reviews — you add only once that set starts returning concrete numbers. Proof is built incrementally, but it starts with a single fixed list of cases.

Terms in this guide

Designing architecture or agent governance? Tell us your case.

Tell us your case See how we help

Frequently asked questions

How many cases do I need to start?: Start with 30–50 real queries from production or from users, including the hard and unusual ones. A small set of genuine cases beats hundreds of invented ones.
Is automated evaluation enough?: For regressions and version comparisons, yes. But substantive accuracy, tone, and whether the answer makes sense are still best judged by a human on a sample.
How often should I rerun the evaluation?: On every change to the prompt, model, data, or configuration. It's the only way to catch a regression before a user does.