DeepSeek V4 Keeps Student Rollouts In-Distribution Unlike Nvidia Ultra

VIEWS156LIKES2REPLIES1

8/ You can see it in their RL too. V4 drops student RL entirely because shared lineage already keeps rollouts on-support. Ultra keeps unified RLVR partly so the student samples stronger rollouts that land closer to teacher support.

Ravid Shwartz Ziv@ziv_ravid

7/ In other words: V4 says sampled-token is too weak to carry the merge, while Ultra says full-vocab is worse than sampled-token. With high overlap the dense loss is safe and useful; with low overlap it's harmful.

2h15620

Ravid Shwartz Ziv@ziv_ravid

5/ This is why Ultra needs a "warmup" SFT to pull the student onto teacher support before distilling, while V4 doesn't.

Ravid Shwartz Ziv@ziv_ravid

4/ Nvidia Ultra doesn't. Its teachers get their skills from SFT on data generated by external models (V4-Pro, gpt-oss, GLM) the student never saw. That pushes each teacher's distribution away from the student, so student rollouts can be out-of-distribution for the teacher.

2h3820

Ravid Shwartz Ziv@ziv_ravid

6/ It also decides the loss. V4 matches the full teacher distribution at every token. In their report, Ultra tried that, top-k and full vocab, and found it worse than scoring only the sampled token, especially on agentic tasks (It amplifies the noise).

Ravid Shwartz Ziv@ziv_ravid

5/ This is why Ultra needs a "warmup" SFT to pull the student onto teacher support before distilling, while V4 doesn't.

2h3620

Ravid Shwartz Ziv@ziv_ravid

7/ In other words: V4 says sampled-token is too weak to carry the merge, while Ultra says full-vocab is worse than sampled-token. With high overlap the dense loss is safe and useful; with low overlap it's harmful.

Ravid Shwartz Ziv@ziv_ravid

6/ It also decides the loss. V4 matches the full teacher distribution at every token. In their report, Ultra tried that, top-k and full vocab, and found it worse than scoring only the sampled token, especially on agentic tasks (It amplifies the noise).

2h3220

Ravid Shwartz Ziv@ziv_ravid

The interesting question (imo) is whether we can engineer enough overlap to skip student RL, or whether it's worth keeping. My guess is we can do it without

Ravid Shwartz Ziv@ziv_ravid

8/ You can see it in their RL too. V4 drops student RL entirely because shared lineage already keeps rollouts on-support. Ultra keeps unified RLVR partly so the student samples stronger rollouts that land closer to teacher support.

2h14310