AI Glossary
Model distillation (knowledge distillation)
knowledge distillation, knowledge transfer, distillation
Model distillation is a technique for training a smaller model (the "student") to imitate a larger one (the "teacher"). It produces a model that is smaller and cheaper to run while retaining part of the original's quality.
- A smaller "student" model learns to reproduce the outputs of a larger "teacher" model.
- The goal is a model that is lighter and cheaper at inference, at the cost of some of the original's quality.
- This is a different mechanism from quantization, which lowers the precision of an existing model's weights.
Model distillation (knowledge distillation) is a technique in which a smaller model, called the "student," is trained to imitate the behavior of a larger, more capable model that acts as the "teacher." Instead of learning from raw data alone, the student reproduces the answers and probability distributions generated by the teacher, taking over part of its knowledge in a far more compact form.
The result is a model that is lighter, faster, and cheaper at inference, and that keeps a good share of the original's quality — though usually not all of it. This is a common route to a practically capable small language model: the student inherits the large model's competence within the narrow range of tasks it is meant to handle.
It is worth distinguishing distillation from related optimization techniques. Quantization reduces hardware requirements by lowering the precision of an existing model's weights, without changing its size in terms of parameter count, and fine-tuning adapts a ready-made model to a specific task. Distillation, by contrast, trains a separate, smaller model to imitate the larger one — most often starting from a ready, pre-trained model rather than from scratch — and uses the teacher as the source of knowledge. In an enterprise deployment, these techniques are often combined to minimize cost and maximize speed.
Related terms