1/I read the Nemotron 3 Ultra report and it's interesting to compare their post-training to DeepSeek V4's. Both now do the same thing: train 10+ specialist teachers, then merge them into one student with on-policy distillation. What actually separates them is support overlap 🧵
Nemotron 3 Ultra And DeepSeek V4 Share Specialist Teacher Distillation Approach
Most Activity

2/ On-policy distillation grades the student's own rollouts. The teacher scores trajectories the student generated. That only works where the student's rollouts fall inside the teacher's support.

The interesting question (imo) is whether we can engineer enough overlap to skip student RL, or whether it's worth keeping. My guess is we can do it without
1/I read the Nemotron 3 Ultra report and it's interesting to compare their post-training to DeepSeek V4's. Both now do the same thing: train 10+ specialist teachers, then merge them into one student with on-policy distillation. What actually separates them is support overlap 🧵

7/ In other words: V4 says sampled-token is too weak to carry the merge, while Ultra says full-vocab is worse than sampled-token. With high overlap the dense loss is safe and useful; with low overlap it's harmful.

3/ DeepSeek V4 gets that for free. The teachers are forks of one base, each base + domain SFT + RL, and the student is distilled from those same forks. Everyone is a small perturbation of the same backbone, so student rollouts are already in-distribution for the teachers.

8/ You can see it in their RL too. V4 drops student RL entirely because shared lineage already keeps rollouts on-support. Ultra keeps unified RLVR partly so the student samples stronger rollouts that land closer to teacher support.

4/ Nvidia Ultra doesn't. Its teachers get their skills from SFT on data generated by external models (V4-Pro, gpt-oss, GLM) the student never saw. That pushes each teacher's distribution away from the student, so student rollouts can be out-of-distribution for the teacher.

5/ This is why Ultra needs a "warmup" SFT to pull the student onto teacher support before distilling, while V4 doesn't.

6/ It also decides the loss. V4 matches the full teacher distribution at every token. In their report, Ultra tried that, top-k and full vocab, and found it worse than scoring only the sampled token, especially on agentic tasks (It amplifies the noise).