Two years after ChatGPT turned LLMs into a product category, most teams running LLMs in production still rely on evaluation approaches that would be considered inadequate for any other critical system. The reasons are understandable. LLM outputs are open-ended, ground truth is scarce, and the space of possible failures is enormous. None of this excuses the absence of discipline. It raises the bar for the discipline required.

Benchmarks do not measure your product

Public benchmarks like MMLU, HumanEval, and GSM8K are useful signals about model capability and nearly useless as predictors of how a specific model will behave on a specific application. A model that dominates a leaderboard can underperform on your workload, and a model that looks mediocre on benchmarks can be exactly right for your domain. The lesson is to build an evaluation set that reflects what the system actually does, and to trust that set over any external score.

The three layers of evaluation

A serious LLM evaluation framework covers three layers, each answering a different question. Unit-level evaluation tests individual prompts and tool calls against expected behavior. Task-level evaluation tests end-to-end workflows on representative inputs. Production monitoring tracks quality signals on real traffic in aggregate and at the tail.

Each layer has different latency, cost, and coverage characteristics, and each catches a different class of regression. Teams that rely on only one are blind to failures the others would catch easily.

LLM-as-judge, with caveats

Using an LLM to judge another LLM's output is the dominant automation technique for open-ended evaluation, and it is both powerful and treacherous. Powerful because it scales cheaply and captures nuances that rule-based metrics miss. Treacherous because the judge has its own biases, its own failure modes, and its own cost. The only way to trust an LLM judge is to calibrate it against human labels on a held-out set and to recheck that calibration regularly.

Pairwise comparison is more reliable than absolute scoring. "Which of these two answers is better" produces sharper, less biased signal than "rate this answer from one to ten".

Synthetic data for coverage

The evaluation set should cover the long tail, not just the common cases. Synthetic generation from the model itself, guided by categories of expected failures, is a practical way to build coverage quickly. The generated cases need human review before they enter the permanent evaluation suite, but the review cost is much lower than the cost of writing every case from scratch.

Continuous monitoring in production

Offline evaluation catches regressions before release. Online monitoring catches everything else. Useful signals include hallucination rate estimated by grounding checks, refusal rate, response length distribution, user thumbs up and down, downstream conversion metrics, and escalation rate to human review. None of these is sufficient on its own. Together they form a usable picture of production quality.

Alerting should focus on business impact rather than raw metric movement. A five percent drift in response length is usually noise. A five percent rise in escalation rate is almost always a signal.

Evaluation is a product, not a project

The teams that do LLM evaluation well treat the evaluation set itself as a versioned, owned, long-lived artifact. It grows as the product grows. It evolves when the failure modes evolve. It is maintained with the same care as production code. The teams that treat evaluation as a one-time exercise before launch are the teams that get surprised six months later when quality has slipped in ways they cannot explain.

The models will keep getting better. The need for rigorous evaluation will not go away. It will become more important as the stakes rise.