AI Benchmarks and What They Actually Measure: A Plain Language Guide
Every time a major AI lab releases a new model, the announcement includes benchmark scores.
A benchmark is a standardized test for AI models. It consists of a dataset of questions, problems, or tasks with known correct answers, and a scoring methodology that produces a number representing how well a model performed. The appeal is obvious: rather than making subjective judgments about which model is better, you can run both on the same test and compare scores. The number is objective. The comparison is apples-to-apples. The result tells you something real.
The complication is that benchmarks measure performance on a specific set of tasks under specific conditions, and that performance may or may not predict performance on the tasks you actually care about. MMLU, the Massive Multitask Language Understanding benchmark, tests knowledge across 57 academic subjects through multiple-choice questions. A model that scores well on MMLU has broad factual knowledge and can answer multiple-choice questions accurately. Whether it can help you draft a contract, debug your codebase, or explain a concept to a non-technical audience is a separate question that MMLU doesn't directly address.
This isn't a flaw in MMLU specifically. It's an inherent property of benchmarks. Any benchmark is a finite set of problems drawn from a particular distribution. Real-world use cases are an infinite and constantly shifting set of problems drawn from a much more complex distribution. The benchmark is always a proxy for what you care about, and proxies are imperfect.
The imperfection becomes more serious when models are trained specifically to perform well on benchmarks rather than to develop the underlying capabilities the benchmarks are meant to measure. This is Goodhart's Law, covered in a separate piece in this blog, applied directly: when a benchmark score becomes a target, it ceases to be a reliable measure. A model can improve its benchmark score through training on benchmark-adjacent data, through exposure to question formats that resemble benchmark questions, or in the most egregious cases, through direct contamination of training data with benchmark test sets. The benchmark score goes up. The genuine capability it was supposed to measure may not have moved at all.
Data contamination is a persistent problem in AI benchmarking. Training datasets for large language models are assembled from enormous quantities of internet text, and benchmark datasets are often publicly available on the internet. If benchmark questions and answers appear in a model's training data, the model may be effectively memorizing answers rather than demonstrating reasoning capability. Detecting contamination is difficult, and model developers don't always disclose it. Benchmark scores from models with contaminated training data are not meaningful comparisons with scores from models that didn't see the test data during training.
The benchmark landscape has responded to these problems by producing new benchmarks designed to be harder to game. Benchmarks with held-out test sets not released publicly. Benchmarks that require multi-step reasoning rather than knowledge retrieval. Benchmarks that evaluate performance on tasks requiring genuine generalization to novel situations. The GPQA benchmark, which consists of PhD-level science questions that even domain experts find challenging, was designed specifically to resist the pattern-matching that allows models to perform well on easier knowledge benchmarks without genuine understanding. But harder benchmarks face the same dynamics over time: models improve, scores rise, the benchmark saturates, and the field moves on to something harder.
Human evaluation addresses some of what benchmarks miss. Rather than testing against a fixed set of questions with predetermined correct answers, human evaluation asks real users or trained raters to compare model outputs and express preferences. The Chatbot Arena leaderboard, which ranks models based on head-to-head comparisons rated by human users, captures something about conversational quality that multiple-choice benchmarks don't. The limitation is that human evaluation is slow, expensive, and subject to its own biases: raters may prefer longer, more confident-sounding answers regardless of accuracy, and the population of people choosing to participate in evaluation may not represent the population whose preferences matter for a given deployment context.
For practitioners evaluating AI models for specific use cases, the most reliable approach is task-specific evaluation on representative examples from your own context. This means assembling a set of real inputs your system will encounter, defining what good outputs look like for those inputs, and evaluating candidate models against that standard rather than relying on published benchmark scores. This is more work than reading a leaderboard, and the resulting evaluation won't generalize beyond your use case. That's actually the point. A model that leads the leaderboard on general benchmarks and performs poorly on your specific task is not the right model for your deployment, regardless of its scores.
Benchmarks are useful for getting a rough sense of where a model sits in the capability landscape and for tracking progress in the field over time. They're a reasonable starting point for model selection, not a reliable endpoint. The gap between benchmark performance and deployment performance is real, it's sometimes large, and it's the gap that careful evaluation on your own data is designed to close.