Today’s large language models are no longer trained exclusively on raw text from the internet. More and more, companies are leveraging powerful “teacher” models to help train smaller or more efficient “student” models. This process, widely referred to as LLM distillation or model-to-model training, has become a crucial method for developing high-performing models with reduced computational costs. Meta utilized its massive Llama 4 Behemoth model to help train Llama 4 Scout and Maverick, while Google employed Gemini models during the development of Gemma 2 and Gemma 3. In a similar fashion, DeepSeek distilled reasoning capabilities from DeepSeek-R1 into smaller Qwen and Llama-based models.
The fundamental concept is straightforward: rather than learning only from human-written text, a student model can also learn from the outputs, probabilities, reasoning traces, or behaviors of another LLM. This enables smaller models to acquire capabilities such as reasoning, instruction following, and structured generation from much larger systems. Distillation can occur during pre-training, where teacher and student models are trained together, or during post-training, where a fully trained teacher transfers knowledge to a separate student model.
In this article, we will examine three primary approaches used for training one LLM using another: Soft-label distillation, where the student learns from the teacher’s probability distributions; Hard-label distillation, where the student mimics the teacher’s generated outputs; and Co-distillation, where multiple models learn collaboratively by sharing predictions and behaviors during training.

Soft-Label Distillation
Soft-label distillation is a training method where a smaller student LLM learns by replicating the output probability distribution of a larger teacher LLM. Rather than training only on the correct next token, the student is trained to match the teacher’s softmax probabilities across the entire vocabulary. For instance, if the teacher predicts the next token with probabilities like “cat” = 70%, “dog” = 20%, and “animal” = 10%, the student learns not just the final answer, but also the relationships and uncertainty between different tokens. This richer signal is often called the teacher’s “dark knowledge” because it contains hidden information about reasoning patterns and semantic understanding.
The greatest advantage of soft-label distillation is that it allows smaller models to inherit capabilities from much larger models while remaining faster and cheaper to deploy. Since the student learns from the teacher’s full probability distribution, training becomes more stable and informative compared to learning from hard one-word targets alone. However, this method also comes with practical challenges. To generate soft labels, you need access to the teacher model’s logits or weights, which is often not possible with closed-source models. Additionally, storing probability distributions for every token across vocabularies containing 100k+ tokens becomes extremely memory-intensive at LLM scale, making pure soft-label distillation expensive for trillion-token datasets.

Here is the paraphrased version of the provided HTML content, with the text rewritten for clarity and ease of understanding while preserving the original HTML structure and language:

Hard-label distillation
Hard-label distillation is a more straightforward method in which the student LLM learns solely from the teacher model’s final predicted token, rather than its complete probability distribution. Here, a pre-trained teacher model produces the most probable next token or response, and the student model is trained through standard supervised learning to replicate that output. Essentially, the teacher serves as a high-quality annotator, generating synthetic training data for the student. DeepSeek employed this technique to transfer reasoning abilities from DeepSeek-R1 into smaller Qwen and Llama 3.1 models.
In contrast to soft-label distillation, the student does not have access to the teacher’s internal confidence scores or token relationships—it only learns the final answer. This makes hard-label distillation significantly more computationally efficient and simpler to implement, as there is no need to store extensive probability distributions for every token. It is particularly valuable when working with proprietary “black-box” models like GPT-4 APIs, where developers can only access generated text and not the underlying logits. Although hard labels carry less information than soft labels, they remain highly effective for instruction tuning, reasoning datasets, synthetic data generation, and domain-specific fine-tuning tasks.


Co-distillation
Co-distillation is a training strategy where both the teacher and student models are trained simultaneously, rather than relying on a fixed pre-trained teacher. In this approach, the teacher LLM and student LLM process the same training data at the same time and each generate their own softmax probability distributions. The teacher is trained using the ground-truth hard labels, while the student learns by aligning with the teacher’s soft labels in addition to the actual correct answers. Meta utilized a version of this method when training Llama 4 Scout and Maverick alongside the larger Llama 4 Behemoth model.
One challenge with co-distillation is that the teacher model is not fully trained during the early stages, which means its predictions may initially be noisy or inaccurate. To address this, the student is typically trained using a combination of soft-label distillation loss and standard hard-label cross-entropy loss. This provides a more stable learning signal while still enabling knowledge transfer between models. Unlike traditional one-way distillation, co-distillation allows both models to improve together during training, often resulting in better overall performance, stronger reasoning transfer, and smaller performance gaps between the teacher and student models.


Comparing the Three Distillation Techniques
Soft-label distillation passes along the most detailed knowledge because the student picks up on the teacher’s entire probability distribution rather than just the final output. This enables compact models to grasp reasoning patterns, uncertainty, and token relationships, often resulting in better overall performance. However, it demands significant computation, needs access to the teacher’s logits or weights, and is tough to scale since storing probability distributions for large vocabularies requires a lot of memory.
Hard-label distillation is more straightforward and easier to use. The student learns only from the teacher’s final outputs, making it far cheaper and simpler to set up. It works particularly well with proprietary black-box models like GPT-4 APIs where internal probabilities are not accessible. While this method misses some of the deeper “dark knowledge” found in soft labels, it remains highly effective for instruction tuning, synthetic data generation, and task-specific fine-tuning.
Co-distillation uses a collaborative method where teacher and student models train together. The teacher improves while guiding the student at the same time, letting both models gain from shared learning signals. This can narrow the performance gap seen in traditional one-way distillation, but it also makes training more complex since the teacher’s predictions are initially unstable. In practice, soft-label distillation is chosen for maximum knowledge transfer, hard-label distillation for scalability and practicality, and co-distillation for large-scale joint training setups.



I graduated with a degree in Civil Engineering from Jamia Millia Islamia, New Delhi, in 2022. My passion lies in Data Science, with a particular focus on Neural Networks and how they can be applied across different fields.



