AI Glossary
LLM-as-a-judge
LLM-as-a-judge, model as a judge, model-based evaluation
LLM-as-a-judge uses a language model to score another model's answers against defined criteria. It is faster and cheaper than human evaluation, but carries its own errors and biases.
- A language model scores another model's answers against criteria defined in advance.
- It lets you scale evaluation more cheaply and quickly than human review.
- It has its own limits: it can be biased, favor longer or its own answers, and simply get things wrong.
LLM-as-a-judge addresses the problem of scale in model evaluation: having people manually score thousands of answers is slow and expensive. Instead, a second language model is given an answer to assess along with clearly stated criteria — accuracy, completeness or adherence to the instruction, for example — and asked to score it or to pick the better of two variants. This makes it possible to compare versions of prompts or models across large sets of cases.
The method does, however, have its own limits. The judge model can be biased: it may favor longer answers, text in a particular style, or answers produced by a model from the same family. It can also be wrong itself, or hallucinate in the rationale behind its score. That is why LLM-as-a-judge does not replace human review but complements it — it works well as a fast filter and as a way to evaluate at scale, which you then calibrate against a sample checked by a person.
In practice, businesses pair this approach with benchmarks and regression tests for quality: when a company changes a prompt or updates a model, the judge automatically checks whether answers have degraded on an established set of examples. The key is good evaluation criteria and periodic checks that the judge's verdicts still align with human judgment.
Related terms