Why LLM evaluations matter for AI governance
LLM evaluations have moved from research curiosity to governance requirement. A practical look at the four risks evaluations surface, three stages of evaluation maturity, and how the EU AI Act and ISO 42001 turn evaluation evidence into audit-ready documentation.
If you ship anything built on a large language model, you live with an uncomfortable truth. You cannot know it works correctly across every case. The space of possible inputs is effectively infinite, and even the model provider can't guarantee consistency between two near-identical prompts. That uncertainty is why governance teams now want to see evaluation results before anything goes live.
This post is a practical look at what evaluations measure, why teams skip them and how they connect to the AI governance frameworks regulators are now enforcing: the EU AI Act, ISO/IEC 42001 and the documentation expectations that flow from both.
The four risks evaluations are designed to surface
When teams ask "is the model good enough?", they usually mean one of four very different things. Evaluations are how you separate them.
Hallucination. The model produces plausible-sounding output that isn't true. It happens with weak training data, but more often it happens when an enterprise gives the model their own data (a knowledge base, a policy library, customer records) and the model misreads it or invents around the gaps. You test for this by running a dataset of known questions and checking whether the model's answer matches what's in the source.
Bias and fairness. This is the part regulators care about most. There isn't much regulation that mandates raw accuracy, but there's a great deal of regulation that mandates fair treatment across groups. A useful bias evaluation asks the same question while varying gender, income, ethnicity, age, geography and roughly 20 other axes, then checks whether the answer changes in ways it shouldn't.
Inaccuracy from fine-tuning. The moment you fine-tune a base model on your own data, you change its behaviour. That change can be useful. It can also quietly introduce errors and skew. Evaluations are how you catch the regression before it ships.
Inherited model flaws. Smaller open-weights models have known weaknesses. Larger frontier models from OpenAI or Anthropic have their own. Your evaluation has to surface the ones that hit your specific use case.
If you're shipping RAG or an agent, the metric vocabulary expands — faithfulness and contextual precision for retrieval, tool correctness and plan adherence for agents — but the four underlying risks and the governance argument don't change.
Human-only review doesn't scale anymore
A few years ago, you evaluated an LLM by putting humans in front of every output. That's how RLHF worked at the major labs. Researchers read answers, marked them and fed the signal back into training. It still works. It just can't keep up with the volume of evaluations production AI now demands.
The current standard is LLM-as-a-judge. You use a strong model to evaluate the outputs of the model you're shipping. The judge reads each answer, scores it against a defined metric (correctness, hallucination, bias, fairness and so on) and returns both the score and a short written reasoning, so a human can audit the judgement instead of trusting it blindly.
Purely mathematical evaluation methods still exist (exact-match, BLEU, ROUGE, embedding similarity) and they have their place. But for the qualitative questions governance cares about ("was this answer fair?", "was it grounded in the source material?"), LLM-as-a-judge has become the default.
The judge isn't a free pass. Studies have shown LLM judges favour their own outputs, score inconsistently across runs, and miss errors when an answer is delivered confidently. So the judge itself needs spot-checking — a human reviewer reads a small sample of judged outputs and confirms the judge agrees with their own reading. The judge's model and version belong in the audit trail next to everything else.
Three stages of evaluation maturity
Evaluation isn't binary. Most teams move through three stages, and knowing which one you're in tells you what to invest in next.
Stage 1: deterministic assertions. Cheap, fast, brittle. Regex checks, schema validation, length bounds, refusal detection. They catch the obvious failures and run on every commit. A surprising amount of governance value comes from just doing this consistently.
Stage 2: LLM-as-a-judge on a static dataset. A curated set of prompts, a stronger model scoring against rubric-defined metrics, a versioned report. This is the level most regulators are implicitly assuming when they ask for "testing evidence" under the EU AI Act.
Stage 3: continuous evaluation on production traffic. Sampling live requests, running reference-free judges on them, alerting when scores drift. This is where ISO 42001 lives, and it's the stage almost nobody is at yet.
You can't really skip a stage. Trying to do stage 3 without a stage 2 dataset is just monitoring noise you have no baseline for.

How this connects to the EU AI Act and ISO 42001
Evaluations are useful to governance because they produce something regulators are explicitly asking for: evidence of how the system behaves, not just evidence of how it was built.
Under the EU AI Act, providers and deployers of high-risk AI systems have to maintain technical documentation, conduct testing and demonstrate that the system's risks have been identified and mitigated. An evaluation report with scores across hallucination, bias and fairness, with the underlying samples and the judge's reasoning attached, is a concrete artifact you can put in front of an auditor. It's harder to argue with than "we used a reputable model".
ISO/IEC 42001, the AI management system standard, asks for something similar: continuous, documented assurance that the system keeps performing inside the risk thresholds you've defined. Evaluations turn that abstract requirement into something measurable. A threshold per metric. A pass/fail status. A versioned trail.
Continuous is the operative word. A single eval run before launch doesn't satisfy ISO 42001. The standard expects a recurring rhythm — periodic re-evaluation against the same dataset, sampled monitoring of live traffic, and a documented response when scores drift. A point-in-time test signs off the launch but doesn't keep the system in scope after that.
For more on how the two frameworks compare and where they diverge, see our deep-dive on EU AI Act vs ISO 42001.
What holds up in an audit isn't the score. It's the trail behind it: who tested what, against which dataset, on which model version, with which judge, against which metrics, with which result.
What a real evaluation needs
Four ingredients:
- The model under test. The exact model and version you intend to ship, with the same configuration you'll run in production.
- A dataset. Input prompts and, for measurable metrics, expected outputs. The dataset has to reflect the real distribution of questions your system will see, not just the easy ones. Some metrics need reference outputs (correctness, exact-match). Others — faithfulness, helpfulness, toxicity — are reference-free and score the answer directly. Stage 3 evaluation on production traffic is mostly the second kind, since you rarely have ground truth for live requests.
- A judge. Either a stronger model or, for some metrics, a deterministic check. The judge needs a clear scoring rubric and should be required to give reasoning, not just a number.
- Metrics and thresholds. A numeric definition of "good enough". A common starting point is around 0.5 to 0.6 per metric, with anything below routing to a human reviewer. Where you set the bar is a product and risk decision, not a technical one, and you should document that decision next to the score. Auditors will ask why you chose 0.6 and not 0.8, and "the metric defaulted to it" is not an answer.
The output is a per-sample breakdown: each prompt, the model's answer, the judge's score, the judge's reasoning. For low-volume use cases you can read every sample. For high-volume systems, the reasoning summaries are how governance and engineering review failures without wading through thousands of responses. Failure clusters are where the next sprint's work comes from. A recurring hallucination pattern is a prompt or RAG fix. A recurring bias pattern is a fairness escalation. A recurring tool error is an agent design problem. An evaluation that doesn't change anything next sprint is a status report, not a control.
Why teams keep underinvesting in this
Two reasons, mostly.
The first is that evaluations look like overhead until something goes wrong. There's no urgent ticket asking for a bias audit until there's a very urgent ticket asking why the model treated two customer cohorts differently.
The second is that the tooling has been fragmented for years. Separate libraries for metrics, separate dashboards for results, separate spreadsheets for the documentation regulators want. Teams end up with evaluation logic spread across three places and no single artifact they can hand to an auditor or to legal.
Closing that gap is what good evaluation tooling is for. One place to define the model, the dataset, the judge, the metrics and the threshold. One place to see results per sample. One trail you can hand to whoever asks.
What to do this quarter
If you're shipping anything LLM-based and you don't already have an evaluation pipeline, the smallest useful step is this.
- Pick the one use case where a bad answer would cost you the most, legally, reputationally or commercially.
- Build a dataset of 50 to 100 real prompts for that use case — pulled from production logs where possible, supplemented with synthetic edge cases generated by a stronger model — with expected answers where applicable.
- Run an LLM-as-a-judge evaluation against correctness, hallucination and bias. Three metrics, not 30.
- Set a threshold. Fail loudly when you cross it.
- Save the report. Version it alongside your model.
That's the level of rigour regulators expect of high-risk systems under the EU AI Act, and the level of evidence ISO 42001 audits will increasingly ask for. Separately from compliance, it's also the cheapest way to find out your model is wrong before your customers do.
Evaluations aren't an academic exercise. They're how you prove, on paper, that your AI system did what you said it would.
VerifyWise's platform brings model inventory, evaluations, risk and evidence into one workspace, so the audit trail builds itself as your team works. If your evaluation logic is spread across libraries, dashboards and spreadsheets right now, it's worth a look.
About the VerifyWise team
VerifyWise builds source-available AI governance software used by organizations to manage risk, compliance, and oversight across their AI portfolios. Our editorial team draws on hands-on experience implementing governance workflows for regulated industries and fast-scaling AI teams.
Learn more about VerifyWise →Ready to govern your AI responsibly?
Start your AI governance journey with VerifyWise today.