Aurora AITell us your case

Offering

ServicesProductsCase studies

For whom

Private EquityEnterpriseSMB
ServicesProductsCase studiesAboutBlogContact

Knowledge base

Start hereWikiGlossaryGuides

Guide

Decisions & comparisons

How to tell whether an AI agent works: evaluation, not impression

Judging an agent "by feel" confuses a good demo with a working system. A fixed set of cases and the right metrics give you proof instead of an impression.

Why a "wow demo" isn't proof

A good demo shows the agent can answer well — once, on a chosen question, under favorable conditions. That isn't the same as works: repeatably, across the full range of questions real people will actually ask.

A demo cherry-picks the cases where the model does well. Production doesn't. After launch you get questions that are unusual, badly phrased, on the edge of the agent's knowledge — and those are the ones that decide whether the system is useful. Judging "by feel" after a few pretty answers is therefore systematically too optimistic.

The rule is simple: proof, not impression. The proof is not one good answer; it's the result on a fixed, repeatable set of cases.

How to build an evaluation set

The starting point of any serious assessment is an evaluation set: a collection of real queries paired with an expected result or a correctness criterion. A good set has three qualities.

To start, 30–50 cases are enough if they are well chosen. You can reuse some of them as few-shot examples in the prompt itself, but then exclude them from the test set — the model has already seen those examples in the prompt, so its score on them inflates the result. Otherwise you're measuring fit to known cases, not the ability to generalize.

What to measure: four dimensions

A single number always lies. An agent can be highly accurate and at the same time so slow or expensive that it's unusable. That's why evaluation looks at four dimensions at once.

MetricWhat it measuresHow
AccuracyWhether the answer is correct and completeComparison with the expected result; human judgment or a judge model
HallucinationsHow often the agent invents facts or sourcesChecking that every claim is grounded in the data
CostWhat one answer costsTokens or calls per query times the price list
LatencyHow long the user waitsTime from query to full answer (median and the p95 tail)

It's worth tracking hallucinations separately, because they erode trust in the system fastest. One confident-sounding, made-up answer costs more than ten honest "I don't know"s. So the evaluation set should include questions whose correct answer is, in fact, a refusal.

The human in the loop

Not everything can be checked by a script. Tone, sense, tact, substantive accuracy in the nuances — that's for a human to judge. Human-in-the-loop isn't an emergency fallback; it's a standing part of the assessment: on every version, someone reviews a sample of answers and renders a verdict against the same criteria.

For that judgment to be proof rather than opinion, it has to be standardized:

  1. The same set of criteria for every answer (e.g., accuracy 0–2, hallucination yes/no).
  2. The same reviewers, or a clear rubric when there are several.
  3. A record of the results, so they can be compared across versions.

The human in the loop plays a second role too — in production, they approve high-stakes actions before the agent carries them out. That's a different design decision from evaluation, but it rests on the same principle: where an error is costly, you need a control point.

When to use automation, when to use a human

Both paths are needed, because they answer different questions.

A practical arrangement: automation on every version, a human on a representative sample for larger changes and before every production release. Increasingly, the automation role is filled by a judge model — a second model that scores answers against a rubric. That scales the assessment, but the judge itself needs calibrating against human verdicts, or it just carries its own errors forward.

Operator's rule: if you can't compare two versions of the agent with a number from the same set, you don't yet know whether the new version is better — you only have an impression.

Where to start

The cheapest first step isn't a new tool; it's freezing 30–50 real questions and scoring your current agent on them once. That alone shows where it makes things up, where it's slow, and what it costs. Everything else — automation, a judge model, regular reviews — you add only once that set starts returning concrete numbers. Proof is built incrementally, but it starts with a single fixed list of cases.

Terms in this guide

Related articles

Designing architecture or agent governance? Tell us your case.

Tell us your case See how we help

Frequently asked questions

How many cases do I need to start?
Start with 30–50 real queries from production or from users, including the hard and unusual ones. A small set of genuine cases beats hundreds of invented ones.
Is automated evaluation enough?
For regressions and version comparisons, yes. But substantive accuracy, tone, and whether the answer makes sense are still best judged by a human on a sample.
How often should I rerun the evaluation?
On every change to the prompt, model, data, or configuration. It's the only way to catch a regression before a user does.