What is Model evaluation?

AI Glossary

Model evaluation

model assessment, evaluation, AI evaluation

Model evaluation is the systematic measurement of answer quality on a fixed set of cases and metrics. It lets you compare versions and catch regressions instead of judging by gut feel.

It rests on a fixed set of test cases and clear metrics.
It lets you compare prompt or model versions and catch regressions.
It combines automatic metrics with human judgment where accuracy really matters.

In model evaluation you build a fixed set of test cases and metrics against which you check every change to a prompt, model, or configuration. That turns "gut feel" assessment — where a single successful example proves nothing — into a repeatable measurement of the whole solution's quality.

In practice, automatic metrics are combined with human judgment, because some qualities (accuracy, tone, the risk of hallucination) are hard to capture in a number. Run this way, evaluation shows whether fine-tuning or a new version actually improved the result, or merely shifted the errors somewhere else.

Related terms

In guides