/Tech8h ago

Nemotron 3 Ultra And DeepSeek V4 Share Specialist Teacher Distillation Approach

1485566K

Original post

1/I read the Nemotron 3 Ultra report and it's interesting to compare their post-training to DeepSeek V4's. Both now do the same thing: train 10+ specialist teachers, then merge them into one student with on-policy distillation. What actually separates them is support overlap 🧵

1:09 PM · Jun 15, 2026 · 6K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

Ravid Shwartz Ziv@ziv_ravid

2/ On-policy distillation grades the student's own rollouts. The teacher scores trajectories the student generated. That only works where the student's rollouts fall inside the teacher's support.

8h8593

LIKES4

Ravid Shwartz Ziv@ziv_ravid

The interesting question (imo) is whether we can engineer enough overlap to skip student RL, or whether it's worth keeping. My guess is we can do it without

8h2684

RETWEETS5

Ravid Shwartz Ziv@ziv_ravid

8h6K4856

REPLIES1

Ravid Shwartz Ziv@ziv_ravid

7/ In other words: V4 says sampled-token is too weak to carry the merge, while Ultra says full-vocab is worse than sampled-token. With high overlap the dense loss is safe and useful; with low overlap it's harmful.

8h1483

Ravid Shwartz Ziv@ziv_ravid

3/ DeepSeek V4 gets that for free. The teachers are forks of one base, each base + domain SFT + RL, and the student is distilled from those same forks. Everyone is a small perturbation of the same backbone, so student rollouts are already in-distribution for the teachers.

8h5763

Ravid Shwartz Ziv@ziv_ravid

8/ You can see it in their RL too. V4 drops student RL entirely because shared lineage already keeps rollouts on-support. Ultra keeps unified RLVR partly so the student samples stronger rollouts that land closer to teacher support.

8h2903

Ravid Shwartz Ziv@ziv_ravid

4/ Nvidia Ultra doesn't. Its teachers get their skills from SFT on data generated by external models (V4-Pro, gpt-oss, GLM) the student never saw. That pushes each teacher's distribution away from the student, so student rollouts can be out-of-distribution for the teacher.

8h2243

Ravid Shwartz Ziv@ziv_ravid

5/ This is why Ultra needs a "warmup" SFT to pull the student onto teacher support before distilling, while V4 doesn't.

8h1813

Ravid Shwartz Ziv@ziv_ravid

6/ It also decides the loss. V4 matches the full teacher distribution at every token. In their report, Ultra tried that, top-k and full vocab, and found it worse than scoring only the sampled token, especially on agentic tasks (It amplifies the noise).

8h1723