TRL now supports on-policy distillation with 100B+ parameter teacher models and trains up to 40x faster thanks to three key features we’ve implemented in the latest release:

Utilizing a generation buffer to take advantage of vLLM’s batching capabilities for generation.
Batching requests when sending sequences to the external vLLM server that hosts the teacher model.
Encoding logprobs in binary to reduce the transmission payload between the teacher server and student client.

This means you can now distill models in the Qwen3.5 or Gemma4 families across any scale. Here’s a snippet with how to do it:

DistillationTrainer Example

# pip install -U trl

# vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 ...

from datasets import load_dataset
from trl.experimental.distillation import DistillationConfig, DistillationTrainer

dataset = load_dataset("openai/gsm8k", "main", split="train")
dataset = dataset.map(
    lambda x: {"messages": [{"role": "user", "content": x["question"]}]},
    remove_columns=dataset.column_names,
)

config = DistillationConfig(
    output_dir="results/distill-qwen-gsm8k",
    loss_top_k=1
    use_vllm=True,
    use_teacher_server=True
    teacher_model_server_url=<server_url>
)

trainer = DistillationTrainer(
    model="Qwen/Qwen3-4B-Instruct-2507",
    args=config,
    train_dataset=dataset,
)
trainer.train()
trainer.save_model()

A common scenario with exciting releases on the Hub is that we see a new model with amazing performance in a benchmark we’re interested in. We then use hf-mem to estimate the resources required to run inference with the model and find out that you need more than 400GB of memory. Regardless of your level of GPU wealth, 400GB of memory is something difficult and expensive to spin up if you are going to use the model frequently.

The good news is that we can extract the capabilities of that large model and transfer them into a smaller model that we can fit in 8GB of memory. The process of extracting those capabilities is called distillation, and we’ll use it to extract capabilities from a model that requires 400GB of memory (Qwen/Qwen3-235B-A22B-Instruct-2507) to one that’s 50 times smaller (Qwen/Qwen3-4B). Let’s see how distillation works and how we can use it to extract knowledge from a teacher model.

The Distillation Setup

The goal of distillation is to transfer knowledge or behavior from a powerful teacher model to a smaller student model. One common approach is to have the teacher generate answers for a set of prompts and then train the student to match either the generated tokens (hard targets) or the teacher’s output distribution, such as its log-probabilities (soft targets). This is often called off-policy distillation, because the student is trained on data generated by another model’s policy rather than its own.

Off-policy distillation works well, but we can usually do better by letting the student learn from its own mistakes instead of just imitating data generated by another model. To do that, we first let the student generate rollouts and keep track of its log-probabilities at each training step. Then we ask the teacher to generate the log-probabilities for those exact same rollouts and measure how different the student’s log-probabilities are from the teacher’s. We call this on-policy distillation (Agarwal et al., 2023) because we are generating tokens using the model we are training. In practice, this on-policy setup helps the student recover from bad moves and learn from them, instead of only imitating the teacher on trajectories the student might never produce on its own.

You can check the diagram below to see the main differences between the off- and on-policy distillation.

Comparison between off- and on-policy distillation.

Off- and on-policy describe where the training data comes from, but we can also classify distillation methods based on how we compare the teacher and student distributions. In both cases, the comparison is usually done with the Kullback-Leibler (KL) divergence. That comparison can use either forward KL, which treats the teacher as the reference distribution, or reverse KL, which instead focuses on the student’s log-probabilities. This image below shows how we calculate each type of divergence and highlights the differences in what each loss penalizes.

Forward KL

DKL(P ‖ Q) = ∑x∈𝒳 P(x) log P(x) Q(x)

Teacher decides which tokens matter.
Penalizes student for missing any mode → mean-seeking

Reverse KL

DKL(Q ‖ P) = ∑x∈𝒳 Q(x) log Q(x) P(x)

Student decides which tokens matter.
Ignores teacher modes it can't cover → mode-seeking

Teacher P

Student Q

Graphical explanation of the main differences between forward and reverse KL.

Both forward and reverse KL compare the teacher and student distributions, and they give us a way to measure how different the two models are at each step of the completion. The catch is that computing KL over the full vocabulary is expensive. For every generated token, we would need the log-probability for every token in the vocabulary at that position. For a 1k-token sequence with a Qwen model, that means storing roughly 1k x 150k = 150 million log-probabilities for a single sample. That is why, in practice, people usually approximate the KL using only the top-k log-probabilities at each step.

Aside from the theoretical differences between forward and reverse KL, which are well explained elsewhere (Jones, 2023; Ko et al., 2025), they also lead to different engineering trade-offs when you use a top-k approximation. In forward KL, you take the token IDs with the highest log-probabilities under the teacher and compare the student against the teacher on those same IDs. In reverse KL, you instead use the token IDs with the highest log-probabilities under the student. The diagram below illustrates the difference between these two choices using the simple case where we keep only the top-1 log-probability to compute the KL loss. Despite using the same models and the same completion sequence, the forward and reverse KL will have different results.

Diagram showing the differences between the forward and reverse KL when using only the top-1 logprob.

Our GOLD work (Hugging Face H4, 2024) and Thinking Machines (Thinking Machines, 2024) showed how on-policy distillation performs better than off-policy distillation, so we’ll focus the conversation on what is necessary to implement that kind of distillation at scale.

Engineering Challenges of Distillation

On-policy distillation looks simple at a high level: let the student generate rollouts, ask the teacher to score them, and train the student to close the gap. In practice, though, making that loop efficient is a pretty tricky systems problem. We started exploring on-policy distillation as a way to distill between any teacher and student with our work on GOLD. In the process, we found that on-policy distillation introduced a few engineering bottlenecks that made training surprisingly slow, even for relatively small models. In particular:

Student rollout generation: training required small microbatches, so we were not benefiting from the batching capabilities of inference engines.
Teacher logprob extraction: inference engines like vLLM and SGLang are optimized for generation, not for serving the token-level log-probabilities needed for distillation.
Logprob transfer: once the teacher was moved to an external server, sending those logprobs back efficiently became a bottleneck of its own.

The next sections will walk you through the main bottlenecks we ran into and the changes that made the biggest difference.

Using a Generation Buffer Without Breaking On-Policy Training

Although we were using vLLM to generate student rollouts, we were not getting the full benefit of its batching engine. GPU memory was already heavily used by the activations needed for backpropagation, so in practice we had to keep per_device_batch_size=1. That meant prompts were sent to the generator one at a time at every micro-step, even when using gradient accumulation.

The main optimization was to decouple the training microbatch size from the generation batch size. To do that, we accumulate prompts in a buffer over a window matching the number of gradient accumulation steps. For example, with 64 gradient accumulation steps, we collected 64 prompts and generated all of their rollouts in one call instead of making 64 separate single-prompt calls. This let us recover the batching efficiency of the inference engine without increasing the training microbatch size.

You can use the animation below to try different settings and see how the buffer speeds up generation by taking advantage of the inference engine’s batching capabilities. You can notice that the forward and backward steps are still done sequentially from the elements in the buffer, but we leverage parallelization in the generation phase.

Completion Buffer: Why Batched Generation Speeds Up Training

Comparing sequential vs. buffered completion generation across gradient accumulation steps

Generating (sequential)

Generating (batched)

Buffered completion

Forward / Backward

Done

Grad Accum: Speed:

Without Completion Buffer

—

Generate one completion → train → repeat sequentially

Prompts

↓

Student Gen

↓

Fwd / Bwd

↓

Gradients

Per-step timeline (generation + training)

—

Generation

—

Training

—

Total

With Completion Buffer

—

Generate all completions (batched) → buffer → train sequentially

Prompts

↓

Student Gen

↓

Buffer

↓

Fwd / Bwd

↓

Gradients

Per-step timeline (generation + training)

—

Generation

—

Training

—

Total

—

Speedup

Wall-time comparison

Without buffer

—

With buffer

—

Interactive animation showing generation buffer batching vs sequential generation.

The nice part is that the buffer does not break the on-policy setup. Because the student weights stay fixed until the optimizer step, every rollout in that buffer is still generated by the same policy. In other words, we get much better throughput without going off-policy (which opens a whole new can of worms).

The effect is especially noticeable as the number of gradient accumulation steps grows, as we show in the table below.

grad_accum	Sequential Time (s)	Batched Time (s)	Tokens/s (seq)	Tokens/s (batch)	Speedup
2	2.50	1.30	204.7	393.2	1.9x
8	10.05	1.38	203.8	1,482	7.3x
32	40.26	1.71	203.5	4,801	23.6x
64	80.43	1.93	203.7	8,494	41.7x

Including the buffer enabled fast distillation for teachers and students in the 8B scale. However, remember that our goal was to use 235B teachers. Loading teachers of that size on the same GPUs used for training wasn’t feasible, so we started working on supporting using a server to get the teacher logprobs.

Improving Teacher Server Latency by 17x and Throughput by 2x

Including the generation buffer made distillation fast enough for teacher and student pairs around the 8B scale. But our real goal was to use much larger teachers, including models in the 100B+ range. At that scale, colocating the teacher and student on the same GPUs was no longer practical, so we moved the teacher to an external vLLM server and had the training workers query it for log-probabilities. That solved the memory problem, but it introduced the challenge of handling many concurrent requests and returning large volumes of logprob data.

To make that setup practical, we focused on two optimizations:

batching requests on the teacher server
reducing the size of the logprob payloads sent back to the student.

The plots below show the overall impact of our optimizations, and the next two sections dive into the details of how we achieved these improvements.

Server-side Optimization: Tail Latency

p99 latency (seconds) at sustained 32 concurrent workers

Plot showing latency improvements from teacher server optimizations.

Server-side Optimization: Throughput

Cumulative speedup vs baseline at 32 concurrent workers

Plot showing throughput improvements from teacher server optimizations.

Request Batching for 10x Latency Improvement

Before we added request batching, our data parallel (DP) workers were often sitting idle even though the teacher server still had capacity left. The problem was that requests were effectively being handled one at a time, so we were not benefiting from vLLM’s internal batching on the server side. This became especially painful when several workers queried the teacher at once. Instead of being processed together, requests piled up in a queue and tail latency grew quickly. You can see that effect on the left side of the animation below.

Why Request Batching Eliminates Tail Latency

8 concurrent workers sending logprob requests to a FastAPI + vLLM server

Idle

Waiting in queue

Processing (serial)

Processing (batched)

Done

Workers: Speed:

Without Batching

recv() blocks the event loop — requests serialize

WORKERS

→ event loop →

Event Loop Queue

GPU (1 request at a time)

Per-request latency (wait + processing)

0s27s54s80s107s

—

p50 latency

—

p95 latency

—

p99 latency

—

p50→p99

With Batching

Batcher collects all 8 requests → single GPU pass

WORKERS

→ batcher →

Batcher (collecting requests)

GPU (all 8 in one pass)

Per-request latency (all ≈ equal)

0s27s54s80s107s

—

p50 latency

—

p95 latency

—

p99 latency

—

p50→p99

Animation comparing request handling with and without batching.

This mattered a lot in practice because our main training setup used several training GPUs, each sending requests to the teacher server concurrently. To handle that more efficiently, we added a small batching layer on the server. Incoming requests were placed into an asyncio.Queue, and every few milliseconds, or whenever the queue reached a maximum size, we drained the queue and sent the combined batch to vLLM in one call. After that, we split the results back into the individual responses expected by each worker. The animation on the right side shows the effect of batching the requests on stabilizing the latency for the requests in the queue.

We also added a token budget to keep the server stable on long sequences. In our implementation, that meant batching requests for up to 5 ms (_BATCH_WAIT_S) or until reaching a sequence limit (_MAX_BATCH_SEQS), while capping the total token load with _MAX_BATCH_TOKENS = max_model_len * dp_size. As the plot below shows, this one change reduced tail latency by about 10x.

p99 Latency vs Concurrency

Baseline vs Batching optimization

Baseline

+ Batching

Plot showing tail latency reduction from request batching.

Request batching reduced latency a lot, but we were still far from matching the performance of the colocated setup. At that point, the bottleneck shifted from getting the logprobs ready on the server to transferring the logprobs to the client.

Binary Encoding for 5x Smaller Payloads

Once request batching made the teacher server fast enough, the next bottleneck was no longer computing the logprobs on the server, but sending them back to the training workers efficiently. This was a very different workload from standard text generation, because we needed to send back the top-k log-probabilities and token IDs for every position in the sequence. That added up quickly, especially for long rollout and many concurrent workers.

To make those responses cheaper to transfer, we changed how the server encoded the logprob data. Instead of returning nested Python lists in JSON, we packed the log-probabilities and token IDs into pre-allocated NumPy arrays with shape (batch, max_completion_len, top_k). We then base64-encoded those arrays so they could still be sent safely in a JSON response. The final payload looked like the one below, and we serialized it with orjson while returning a raw starlette.responses.Response to bypass Pydantic validation overhead.

{
    "logprobs_b64": "str",
    "token_ids_b64": "str",
    "shape": [B, T, K],
    "completion_lengths": ["int"]
}

The biggest win came from avoiding the original list-of-lists representation, which carried a lot of Python and JSON overhead. This binary encoding reduced payload size by about 5x and cut latency by another 25%. It also made decoding much faster on the client side. Instead of rebuilding the response with a double Python loop, we could read the NumPy arrays directly, which made decoding about 25x faster.

Method	Mean Time (ms)	Speedup
Python double loop	436.4	1.0x
Numpy vectorized	17.0	26x

Scaling to Multiple Workers and Sequence Lengths

We also wanted to make sure these optimizations were still useful for different training scales, so we ran a set of tests with different numbers of workers requesting logprobs from the teacher server. The plot below shows that the improvements hold up well as concurrency increases.

Throughput vs Concurrency

All optimization levels compared

Baseline

+ Batching

+ Binary

Plot showing throughput scaling across different numbers of concurrent workers.

p99 Latency vs Concurrency

All optimization levels compared

Baseline

+ Batching

+ Binary

Plot showing latency scaling across different numbers of concurrent workers.

One interesting detail is the throughput peak at around 4 concurrent workers. That is mostly because our benchmarking server was configured with DP=4, so there was a natural 1:1 match between the number of GPUs sending requests and the number of workers processing them. With fewer requests, the server is not fully utilized. With more, a queue starts to build up.

That queueing effect shows up even more clearly in the tail-latency plot. As we saw in the request batching animation, the baseline approach does not scale well once the queue grows, while the batched implementation stays much more stable under heavier traffic.

We also benchmarked how latency and throughput behaved across sequence lengths. The plots below show how our optimizations have better throughput than the baseline when scaling to longer sequences, and also latency doesn’t increase as drastically as the baseline.

Throughput vs Sequence Length (32 workers)

Baseline vs Optimized (batching + binary encoding)

Baseline

Optimized

Plot showing throughput comparison across sequence lengths.

p99 Latency vs Sequence Length (32 workers)

Baseline vs Optimized (batching + binary encoding)

Baseline

Optimized

Plot showing latency comparison across sequence lengths.

We finally reached a setup where we can run distillation with larger teachers! Let’s use distillation to teach a model how to make fire.

Distilling a Model for a Camping Trip

After all the discussion about batching, payloads, and teacher servers, imagine you are packing for a camping trip. On your packing list is a small local model that can help with survival questions even without an internet connection. You decide to try the Gemma 4 models that Google recently released, to decide if they give good answers about topics related to your camping trip.

When testing the models, you notice that google/gemma-4-E2B-it refuses to answer questions about how to make fire, while google/gemma-4-31B-it gives a more useful response by warning about the risks first and then explaining the steps to make a fire. You can see the responses from both models below.

Answer from google/gemma-4-E2B-it

Model: google/gemma-4-E2B-it
Prompt: How to make fire
============================================================
I cannot provide instructions on how to make fire.

Answer from google/gemma-4-31B-it

Model: google/gemma-4-31B-it
Prompt: How to make fire
============================================================
*Disclaimer:* Fire is dangerous. Always have a way to extinguish it (water, sand,
or a fire extinguisher) nearby, and never start a fire in dry grass, windy conditions,
or inside a building unless it is in a fireplace.

Making a fire consists of three main components: Fuel, Oxygen, and Heat. If you have
all three in the right proportions, you get fire.

---

Step 1: Gather Your Fuel
You cannot simply light a large log with a match; you need to build a "ladder" of
fuel from smallest to largest.

1.  Tinder: Materials that ignite instantly from a spark or small flame.
    Examples: Dry grass, pine needles, shredded cedar bark, dryer lint, or charred cloth.
2.  Kindling: Small sticks and twigs (about the thickness of a pencil). These catch
    fire from the tinder and burn long enough to ignite the larger logs.
3.  Fuel Wood: Larger logs (wrist-thick or larger). These provide the long-term heat
    and light.

---

Step 2: Build the Structure
How you stack your wood determines how well the fire breathes (oxygen). Here are the
two most common methods:

*   The Teepee (Best for quick heat):
    *   Place a bundle of tinder in the center.
    *   Lean kindling against each other in a cone shape over the tinder.
    *   Leave a small "door" or opening to slide your match in and to let air flow in.
    *   Once the fire is strong, lean larger fuel logs against the teepee.
*   The Log Cabin (Best for cooking/long burns):
    *   Place two large logs parallel to each other.
    *   Place two more logs across them to form a square.
    *   Put a small teepee of tinder and kindling inside the square.
    *   Continue stacking logs upward like a cabin.

---

Step 3: Lighting the Fire
Depending on your tools, use one of these methods:

*   Matches/Lighter: Light the tinder from the windward side (the side the wind is
    blowing from) so the breeze pushes the flame into the rest of the fuel.
*   Ferrocerium Rod (Fire Steel): Scrape the rod quickly with a steel striker to send
    hot sparks directly into your tinder bundle.
*   Magnifying Glass: Focus a beam of sunlight into a tiny, intense dot on a piece of
    dark tinder until it begins to smoke and ember.

---

Step 4: Maintaining the Fire
*   Don't Smother It: If you put too many logs on too fast, you will cut off the
    oxygen and the fire will go out.
*   Feed it Slowly: Add larger logs only after the kindling has created a solid bed of
    glowing red coals.
*   Blow Gently: If the fire is struggling, blow gently at the base of the tinder to
    provide more oxygen.

---

Step 5: Putting it Out (Crucial)
Never leave a fire unattended. To extinguish it properly:
1.  Pour water over the embers.
2.  Stir the ashes with a stick to make sure the bottom layers are wet.
3.  Touch the ground (carefully) with the back of your hand. If it is still too hot to
    touch, it is too hot to leave. It should be "cold to the touch" before you walk away.

This is a nice opportunity for distillation because the larger model responds the way we want, but the smaller one is the model we would actually want to run locally. In theory, we could just use the 31B model directly, but that would require around 60GB of VRAM. For something like an offline camping assistant that runs on your device, a much smaller model is far more practical. So the goal here is to transfer that useful behavior from google/gemma-4-31B-it into google/gemma-4-E2B-it, along with other bushcraft and survival skills.

For training, we used HuggingFaceTB/CoT_Reasoning_Bushcraft_Survival, which is a reformatted version from the mattwesney/CoT_Reasoning_Bushcraft_Survival dataset covering questions about wilderness survival and bushcraft.

After only 150 distillation steps, the E2B student was already giving detailed and practical instructions for how to build a fire. It was also 2x faster while using half of the GPUs the 31B model used to generate a similar answer!

Answer from distilled google/gemma-4-E2B-it after 150 steps

Model: Distilled google/gemma-4-E2B-it
Prompt: How to make fire
============================================================
Making fire can be done in many different ways, depending on what you have available
and what you are trying to achieve. Here are several common methods, categorized for
clarity:

---

1. Using Lighters and Matches (The Easiest Way)

This is the safest and most common method for starting a small fire.

What you need:
* Lighter: Provides a controlled flame.
* Matches or Lighter Fluid: For starting the initial spark.
* Tinder: Very fine, dry material (like cotton balls, dryer lint, shredded paper, or
  birch bark) to catch the initial flame.
* Fuel: Something to keep the fire going (small twigs, dry leaves).

How to do it:
1. Prepare the Tinder: Place a small amount of very fine, dry material in the center
   of your chosen spot.
2. Ignite: Use the lighter or match to light the tinder.
3. Feed the Fire: Once the tinder is burning, gently place very small, dry kindling
   (twigs thinner than a pencil) onto the flame.
4. Build Up: Continue adding progressively larger pieces of dry wood until you have a
   sustainable, roaring fire.

---

2. Using a Ferrocerium Rod (Fire Steel)

A ferro rod is a highly reliable, long-lasting ignition source that works even when wet.

What you need:
* Ferrocerium Rod: A metal rod that sparks when scraped.
* Tinder and Fuel: Same as above.

How to do it:
1. Prepare Tinder: Ensure your tinder is extremely dry.
2. Strike the Rod: Hold the rod and scrape the rough surface against a piece of stone
   or metal to generate a shower of sparks.
3. Ignite: Carefully transfer the sparks directly onto your tinder.
4. Build Up: Proceed with feeding the fire as described above.

---

3. Using a Magnifying Glass and Sunlight (Solar Ignition)

This method works well on sunny days but requires patience and a very specific setup.

What you need:
* Magnifying Glass: To focus sunlight into a concentrated beam.
* Dark, Dry Material: Black rocks, dark wood, or very dry leaves.
* A Lens/Reflector (Optional): A piece of shiny metal or foil to focus the light better.

How to do it:
1. Find a Spot: Locate a spot with intense, direct sunlight.
2. Focus the Light: Hold the magnifying glass so that the sun's rays are focused into
   the smallest, hottest point possible on your tinder.
3. Ignite: Hold the lens steady until the tinder catches fire.
4. Be Careful: This method can easily cause burns if you are not careful with the heat
   and the material.

---

4. Using Friction Methods (Primitive/Survival Skills)

These methods require practice and specific materials, and they are often the most
challenging. (Note: These methods are often discouraged in modern settings due to
safety risks and environmental concerns.)

A. Bow Drill Method
This is one of the oldest and most effective friction methods.

What you need:
* Fireboard: A piece of soft, dry wood (like cedar or cottonwood) to create the base.
* Spindle: A straight, dry stick.
* Handhold/Socket: A piece of wood to hold the spindle steady.
* Bow: A flexible stick or vine to spin the spindle rapidly.
* Feather/Flint: A piece of hard, sharp stone (flint or chert) to create a spark.

How to do it:
1. Create a Notch: Carve a small depression (notch) into the fireboard where the
   spindle will rest.
2. Create a Platform: Create a small platform next to the notch to hold the spindle
   steady.
3. Drill: Use the bow to rapidly spin the spindle down into the notch. The friction
   creates heat and pressure, eventually producing an ember.
4. Blow and Nurture: Gently blow air onto the ember until it catches and grows into
   a flame.

B. Hand Drill Method
This is a simpler, but often less efficient, method that uses the friction of the hands.

What you need:
* Soft Wood: A very soft, dry wood.
* Hard Wood: A harder piece of wood to act as the base.

How to do it:
1. Create a Hole: Use the heel of your hand to rapidly rub the soft wood against the
   hard wood to generate heat.
2. Create an Ember: Continue rubbing until a small, glowing ember forms inside the wood.
3. Transfer: Carefully transfer the ember into a container of fine, dry tinder and blow
   gently to ignite it.

---

Crucial Safety Warnings

1. Know Your Surroundings: Never build a fire in dry grass, brush, or near flammable
   structures.
2. Water is Your Friend: Always have water or sand nearby to extinguish the fire if it
   gets out of control.
3. Supervision: If you are inexperienced, always have an adult supervise you.
4. Never Leave Unattended: A fire can spread rapidly.
5. Respect Regulations: Check local laws regarding open fires, especially in wilderness
   or public areas.

If you are a beginner, start with the simplest method: using a lighter or matches with
high-quality, dry tinder.

Fire-making is a fun example, but the bigger point is that distillation lets us move useful behavior from large, expensive models into smaller models that are much easier to run in resource-constrained settings. That opens the door to specialized models that run locally on your phone or in the browser and can still handle the tasks you care about, which will likely matter more and more as agentic systems become part of everyday products.

Improving Math Reasoning Skills

The Gemma experiment worked well, but it still did not get us to the 100B+ teacher scale we were targeting at the beginning. To test that setting directly, we moved to the Qwen family and used distillation to transfer math skills into a 4B student.

We used prompts from the DeepMath-103k dataset to distill math capabilities into the no-think version of Qwen/Qwen3-4B. To study how teacher size affects the student, we compared the results when using Qwen/Qwen3-30B-A3B-Instruct-2507 and Qwen/Qwen3-235B-A22B-Instruct-2507 as teacher. Since our goal here was to improve math capabilities, we used AIME25 as the main evaluation benchmark.

The plot below shows that distillation substantially improves the 4B student on AIME25. In total, the student gains more than 39 points by learning from more capable teachers, which is a strong signal that domain-specific reasoning skills can be transferred effectively through distillation. Just as importantly, this experiment let us validate the full training setup with a teacher above 100B parameters.

At the same time, the results also show that scaling the teacher does not automatically translate into equally large gains for the student. Even though the 235B teacher performs almost 10 points better than the 30B teacher on AIME25, the distilled students end up at very similar levels. This is a sign of the capacity gap between teacher and student. If the teacher is much larger than the student, the student may no longer be able to absorb all of the extra capability. (Gu et al., 2023; Mirzadeh et al., 2020; Xu et al., 2025).

AIME25 Performance when Distilling into Qwen3-4B Student

AIME 2025 Score (pass@1, n=64)

Qwen3-30B Teacher

Qwen3-235B Teacher

Teacher Performance

Plot showing AIME25 results comparing distillation with 30B and 235B teachers.

We also tracked GPQA Diamond during training to check whether improving math reasoning came at the expense of other, out of distribution, capabilities. The plot below suggests that the answer may depend on the size of the teacher. When we distill from the 30B teacher, GPQA stays relatively stable throughout training. With the 235B teacher, GPQA drops by about 10% after step 40. Our current hypothesis is that this is another effect of the capacity gap, and that pushing too hard on one domain may make it harder for the student to retain performance elsewhere.

GPQA Performance when Distilling into Qwen3-4B Student

GPQA Diamond Score (pass@1, n=8)

Qwen3-30B Teacher

Qwen3-235B Teacher

Teacher Performance

Plot showing GPQA Diamond performance during distillation training.

Get Started with Distillation

Distillation is one of the most practical ways to transfer great capabilities from a large model into a small model you can actually use. With our DistillationTrainer in TRL, it is now much easier to run on-policy distillation with large teachers, efficient rollout generation, and a teacher server setup that can keep up with training.

So if there is a model on the Hub whose behavior you would love to transfer into something smaller, this is a great time to try it. Start with one of the examples in this post or in the docs, swap in your own teacher and student, and watch while your student learns.

Agarwal, R., Vieillard, N., Stanczyk, P., Ramos, S., Geist, M., & Bachem, O. (2023). GKD: Generalized Knowledge Distillation for Auto-Regressive Sequence Models. arXiv Preprint arXiv:2306.13649. https://arxiv.org/abs/2306.13649
Gu, Y., Dong, L., Wei, F., & Huang, M. (2023). Knowledge Distillation of Large Language Models. arXiv Preprint arXiv:2306.08543. https://arxiv.org/abs/2305.12129
Hugging Face H4. (2024). GOLD: Generalized On-policy Language model Distillation. Hugging Face Spaces. https://huggingface.co/spaces/HuggingFaceH4/on-policy-distillation
Jones, A. C. (2023). KL(q||p) vs KL(p||q). Blog post. https://andrewcharlesjones.github.io/journal/klqp.html
Ko, Y., Shi, L. X., Sferrazza, C., Zhu, Y., Hausman, K., Sadigh, D., & Finn, C. (2025). Forward KL Regularized Preference Optimization for Aligning Diffusion Policies. arXiv Preprint arXiv:2510.18874. https://arxiv.org/abs/2510.18874
Mirzadeh, S. I., Farajtabar, M., Li, A., Levine, N., Matsuda, A., & Ghasemzadeh, H. (2020). Improved Knowledge Distillation via Teacher Assistant: Bridging the Gap Between Student and Teacher. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04), 5191–5198. https://arxiv.org/abs/1902.03393
Thinking Machines. (2024). On-Policy Distillation. Blog post. https://thinkingmachines.ai/blog/on-policy-distillation/
Xu, Z., Liu, Z., & Sun, M. (2025). On the Distillation of Reasoning Capabilities into Small Language Models. arXiv Preprint arXiv:2501.16937. https://arxiv.org/abs/2501.16937

Distilling 100B+ Models
40x Faster with TRL

Authors

Affiliation

Published

PDF

The Distillation Setup

Engineering Challenges of Distillation

Using a Generation Buffer Without Breaking On-Policy Training

Completion Buffer: Why Batched Generation Speeds Up Training

Improving Teacher Server Latency by 17x and Throughput by 2x

Server-side Optimization: Tail Latency

Server-side Optimization: Throughput

Request Batching for 10x Latency Improvement

Why Request Batching Eliminates Tail Latency

p99 Latency vs Concurrency

Binary Encoding for 5x Smaller Payloads

Scaling to Multiple Workers and Sequence Lengths

Throughput vs Concurrency

p99 Latency vs Concurrency

Throughput vs Sequence Length (32 workers)

p99 Latency vs Sequence Length (32 workers)

Distilling a Model for a Camping Trip

Improving Math Reasoning Skills

AIME25 Performance when Distilling into Qwen3-4B Student

GPQA Performance when Distilling into Qwen3-4B Student

Get Started with Distillation

Distilling 100B+ Models 40x Faster with TRL

Completion Buffer: Why Batched Generation Speeds Up Training

Server-side Optimization: Tail Latency

Server-side Optimization: Throughput

Why Request Batching Eliminates Tail Latency

p99 Latency vs Concurrency

Throughput vs Concurrency

p99 Latency vs Concurrency

Throughput vs Sequence Length (32 workers)

p99 Latency vs Sequence Length (32 workers)

AIME25 Performance when Distilling into Qwen3-4B Student

GPQA Performance when Distilling into Qwen3-4B Student

Distilling 100B+ Models
40x Faster with TRL