What Is Distillation? How Small AI Models Learn from Large Ones
There's a persistent tension in applied AI between capability and cost.
The most capable models are large. Large models are expensive to run. Running them at scale, serving millions of requests per day, compounds that expense into numbers that constrain what's economically viable to build. The obvious solution, using a smaller model, sacrifices capability. The less obvious solution is distillation: using a large model to teach a small one.
The insight is that you don't need to train the small model the same way you trained the large one. You can train it to imitate the large one instead.
Knowledge distillation was formalized as a technique in a 2015 paper by Geoffrey Hinton and colleagues, though the underlying intuition predates that work. The basic setup involves two models: a large, capable model called the teacher, and a smaller model called the student. Rather than training the student on the original training data with hard labels, the correct answer is either right or wrong, you train it on the teacher's output distribution, the probability the teacher assigns to every possible answer.
Those probability distributions carry information that hard labels don't. When a teacher model looks at an image of a cat and assigns 85% probability to "cat," 8% to "lynx," and 4% to "tiger," those soft probabilities encode something about the visual similarity between those categories that a hard label of "cat" doesn't capture. The student model trained on these soft targets learns not just what the right answer is but something about the structure of the problem: which answers are similar to each other, which distinctions are hard to make, where the model's uncertainty concentrates. This richer learning signal often produces a student that outperforms a model of the same size trained directly on the original data.
The temperature parameter, covered in a separate piece in this blog, plays a specific role in distillation. Raising the temperature of the teacher's softmax output produces softer probability distributions, spreading probability mass more evenly across options and making the soft targets more informative. Training the student at the same elevated temperature, then reducing it for deployment, is a common approach that extracts more information from the teacher's uncertainty.
Distillation has become a central technique in the practical deployment of large language models. The economics are compelling: a frontier model costs significant compute to run per token. A distilled student model that achieves 90% of the teacher's performance at 20% of the inference cost is a meaningful practical win for applications where the performance gap is acceptable. Many of the AI features embedded in consumer products, the ones that respond quickly without visible latency, are running on distilled models rather than the frontier models that generated the initial press coverage.
The relationship between distillation and the capabilities of frontier models has become more complex as those models have grown. Early distillation work assumed the teacher and student were trained on the same task with the same data. Modern distillation often involves a teacher with capabilities the student will never fully match, and the question becomes which capabilities transfer most efficiently and which are fundamentally size-dependent. Some capabilities appear to require a certain scale to exist at all, and no amount of distillation can transfer them to a model too small to support them. Others transfer surprisingly well, and distilled models can match frontier performance on those specific capabilities even at a fraction of the size.
Distillation also interacts with fine-tuning in ways worth understanding. A common workflow is to take a large general-purpose model, fine-tune it on a specific task to produce a capable teacher for that task, and then distill that specialized teacher into a small student model that can be deployed efficiently. The student inherits both the general knowledge of the original large model and the task-specific specialization of the fine-tuning, compressed into a package small enough for practical deployment. This approach has become standard in production ML pipelines at organizations that need capable but economical AI at scale.
There's a more recent and somewhat controversial form of distillation that involves training a student model on outputs generated by a teacher model rather than on the teacher's probability distributions. This approach, sometimes called data distillation or synthetic data generation, uses the teacher to produce large quantities of labeled training data that the student then learns from. It's less principled than classical distillation in some respects, but it's also more flexible: the teacher doesn't need to be accessible in a way that exposes its internal probability distributions, only in a way that generates usable outputs. Several high-profile model releases have used this approach, raising questions about the provenance of training data and the long-term implications of models trained primarily on other models' outputs.
For practitioners choosing between model options, distillation is part of the explanation for why smaller models have improved so dramatically in recent years without commensurate increases in their own training compute. A small model trained with distillation from a frontier teacher starts from a much better position than a small model trained from scratch, and the gap between small and large has narrowed substantially on many practical tasks as a result. The capability hierarchy still exists, but distillation has compressed it considerably.