LLM evaluation
LLM evaluation is the practice of systematically testing what a large language model produces, so you can tell whether it is accurate, safe, and fit for the job before and after you deploy it. Because these models generate open-ended text rather than a single correct label, you cannot judge them with one accuracy number. Evaluation has to cover several dimensions, often with a mix of automated scoring and human review.
The reason this gets so much attention is that LLM behavior is hard to predict. The same model can be helpful on one prompt and confidently wrong on a slightly different one. Without structured evaluation, teams ship on vibes and discover failures in production, which is exactly what governance frameworks are trying to prevent.
What gets measured
Useful LLM evaluation looks at multiple properties, because a model can score well on one and poorly on another.
Correctness. Does the answer match the expected result or known facts? For tasks with a right answer, this is the core metric.
Faithfulness. In systems that supply context, such as retrieval-augmented generation, does the answer stay grounded in the provided source rather than adding unsupported claims? An unfaithful answer is a hallucination even when it sounds plausible.
Hallucination rate. How often does the model assert things that are not true or not supported? This is one of the most important safety properties for any factual use.
Bias. Does the model treat groups differently in ways that are not justified, for example producing systematically different responses based on names, gender, or other protected attributes?
Toxicity. Does the model produce harmful, harassing, or otherwise unacceptable content, including when prompted adversarially?
Relevance and helpfulness. Does the response actually address the question, at the right level of detail, in the expected format?
Teams pick the dimensions that matter for their use case and define metrics for each, rather than chasing a single score.
How evaluation is done
There are three common approaches, usually combined.
Reference-based scoring. You compare the model's output to a known correct answer using exact match, overlap metrics, or similarity. This works when there is a clear target but struggles with open-ended responses where many phrasings are valid.
Human review. People rate outputs against a rubric. This is the most trustworthy approach for nuanced qualities like helpfulness and tone, but it is slow and expensive, so it is usually applied to samples.
LLM-as-a-judge. A separate language model scores outputs against criteria you define, for example rating faithfulness or detecting toxicity. This scales far better than human review and correlates reasonably well when the rubric is clear. It has limits: judge models can be biased, can be inconsistent, and can be gamed, so teams calibrate them against human ratings and do not treat their scores as ground truth.
Most mature setups use reference-based metrics where answers are deterministic, an LLM judge for scale, and human review on samples to keep the judge honest.
Building an evaluation set
Good evaluation depends on good test data. Teams assemble a dataset of representative inputs, including ordinary cases, edge cases, and adversarial prompts meant to provoke failures. For many dimensions they also record an expected answer or a rubric.
The set should reflect real usage and the failure modes that would actually hurt: the questions users ask, the inputs that have caused problems before, and the categories where a wrong answer carries consequences. A static set run on every model change turns evaluation into a regression test, so you can see whether an update made things better or worse.
Why governance and regulators want evaluation evidence
Evaluation is not just an engineering nicety, it is increasingly the evidence that proves a system was tested.
Under the EU AI Act, high-risk systems must be tested for accuracy, robustness, and resilience, and that testing has to be documented. Evaluation results are a natural part of the technical documentation that shows the system performs as claimed and was checked for relevant risks.
ISO 42001, the AI management system standard, expects organizations to define performance criteria, test against them, and keep records as part of continual improvement. Evaluation is how you generate those records.
The NIST AI Risk Management Framework similarly calls for measuring AI risks, which means having defined metrics and test results rather than assurances.
For governance teams the message is consistent: define what good looks like, test for it, write down the results, and re-test when the model or its use changes. An auditor wants to see the evaluation set, the metrics, the scores, and evidence that failures were addressed.
FAQ
Why can't I just use accuracy to evaluate an LLM?
Because most LLM outputs are open-ended text, not a single correct label, so one accuracy number misses most of what matters. A model can be accurate on facts yet biased, toxic under pressure, or unfaithful to its sources. Useful evaluation measures several dimensions and matches each to the way the model is actually used.
What is LLM-as-a-judge?
It is using a separate language model to score outputs against criteria you define, such as faithfulness or toxicity. It scales far better than human review and works reasonably well when the rubric is clear. The catch is that judge models can be biased, inconsistent, or gamed, so you calibrate them against human ratings rather than trusting their scores blindly.
What is the difference between correctness and faithfulness?
Correctness asks whether the answer is factually right against a known truth. Faithfulness asks whether the answer stays grounded in the specific context the system provided, for example retrieved documents, without adding unsupported claims. An answer can be faithful to a wrong source, or correct in general while drifting from the source, so both are worth measuring in retrieval systems.
How often should I evaluate a model?
Before deployment, and again whenever the model, its prompts, or its data sources change. Running a fixed evaluation set on every change turns it into a regression test, so you can see whether an update improved or degraded behavior. High-risk uses warrant ongoing evaluation in production, not just a one-time check.
How do I evaluate bias and toxicity?
Use targeted test sets: inputs varied across protected attributes to surface unjustified differences for bias, and adversarial prompts designed to provoke harmful output for toxicity. Score with classifiers or an LLM judge, and confirm with human review on samples. The point is to probe deliberately for these failures rather than hope they do not occur.
What evidence do regulators expect from evaluation?
They want to see that you defined what good performance means, tested against it, and documented the results. For the EU AI Act that means accuracy and robustness testing in the technical documentation. For ISO 42001 it means recorded performance criteria and test results. The artifacts are your evaluation set, your metrics, your scores, and proof that failures were addressed.
Summary
LLM evaluation is the systematic testing of model outputs across dimensions like correctness, faithfulness, hallucination, bias, toxicity, and relevance, because no single accuracy number captures how an open-ended model behaves. Teams combine reference-based scoring, human review, and LLM-as-a-judge, running a representative evaluation set as a regression test whenever the model or its use changes. Beyond engineering value, evaluation produces the evidence that governance demands: the EU AI Act, ISO 42001, and the NIST AI RMF all expect defined metrics, documented test results, and proof that identified failures were addressed.