Distillation
Definition
A technique for training a smaller "student" model to approximate the behaviour of a larger "teacher" model. Goal: comparable quality at significantly lower inference cost and latency. Frequently used in combination with fine-tuning on domain-specific tasks.
Noise — Signal
Distillation is often simplified to "we just take a smaller model and become cheaper". Success depends on three conditions: high-quality training data or an accessible teacher model, a tightly defined task, and systematic evaluation. Distillation without a clearly bounded use case produces a model that looks stable on benchmarks but breaks at the edge cases.
The right question
Not: "Can we distil the model to save costs?" But: "Which sub-task is tightly enough defined that a smaller model can map it reliably, and do we have the evaluation infrastructure to detect quality drift before it reaches the end customer?"