/AI3h ago

Warmup SFT Boosts MOPD Gains In Agentic Domains For Nemotron 3 Ultra

713121.1K

Comments

#1409

Original post

wh@nrehiew_#1409inAI

Think this table is interesting to see what domains does the student outperform the teacher.

The merged model outperforms the specialized RLVR model on agentic and instruction following benches. On TBench, the student significantly outperforms the teacher which is interesting.

For reference, the second table is a similar figure from Mimo-v2-flash. Interesting to compare relative performance in ~similar domains

wh@nrehiew_

Because all the experts are trained individually and differently, they say that MOPD cannot be applied naively. What ends up happening is that the student is too different from the teacher.

They do a very light SFT stage on each teacher's data as warmup.

The benefit is most pronounced in agentic domains vs reasoning ones. I suspect its probably something to do with longer rollouts/multi turn (?)

8:12 AM · Jun 4, 2026 · 144 Views

/AI3h ago

Warmup SFT Boosts MOPD Gains In Agentic Domains For Nemotron 3 Ultra

--0--

Comments

#1409

Original post

wh@nrehiew_#1409inAI

Think this table is interesting to see what domains does the student outperform the teacher.

The merged model outperforms the specialized RLVR model on agentic and instruction following benches. On TBench, the student significantly outperforms the teacher which is interesting.

For reference, the second table is a similar figure from Mimo-v2-flash. Interesting to compare relative performance in ~similar domains

wh@nrehiew_

Because all the experts are trained individually and differently, they say that MOPD cannot be applied naively. What ends up happening is that the student is too different from the teacher.

They do a very light SFT stage on each teacher's data as warmup.

The benefit is most pronounced in agentic domains vs reasoning ones. I suspect its probably something to do with longer rollouts/multi turn (?)

8:12 AM · Jun 4, 2026 · 144 Views

Sentiment

Users loved the tech report on Warmup SFT boosting MOPD gains for Nemotron 3 Ultra because it highlights NVIDIA's push for NVFP4 at larger scales along with appreciated transparency.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS771LIKES4

wh@nrehiew_

The gains is limited on non-agentic reasoning. The hypothesis is that there is a lot of the teacher's distribution/gain that is not ever sampled by the student so the student can never learn.

wh@nrehiew_

Think this table is interesting to see what domains does the student outperform the teacher.

The merged model outperforms the specialized RLVR model on agentic and instruction following benches. On TBench, the student significantly outperforms the teacher which is interesting.

For reference, the second table is a similar figure from Mimo-v2-flash. Interesting to compare relative performance in ~similar domains

3h7714

BOOKMARKS1

wh@nrehiew_

Some MOPD open questions they list out: - Instead of just doing loss only on the selected token, train on the entire distribution. They say this did not help much as it might amplify noise - How to ensure student trajectories lie within the teacher's support for effective scoring - Efficiency across different domains where rollouts have very different times

wh@nrehiew_

REPLIES2

wh@nrehiew_

For benchmarking, their table has 0 bolding or highlighting which makes it annoying to compare so I got GPT-Image to annotate.

Kimi K2.6 is quite insane still but Nemotron 3 ultra fares very well on non agentic tasks

3h75

Posts from X

Most Activity

VIEWS771LIKES4

wh@nrehiew_

The gains is limited on non-agentic reasoning. The hypothesis is that there is a lot of the teacher's distribution/gain that is not ever sampled by the student so the student can never learn.

wh@nrehiew_

Think this table is interesting to see what domains does the student outperform the teacher.

The merged model outperforms the specialized RLVR model on agentic and instruction following benches. On TBench, the student significantly outperforms the teacher which is interesting.

For reference, the second table is a similar figure from Mimo-v2-flash. Interesting to compare relative performance in ~similar domains

3h77141

BOOKMARKS1REPLIES1

wh@nrehiew_

The gains is limited on non-agentic reasoning. The hypothesis is that there is a lot of the teacher's distribution/gain that is not ever sampled by the student so the student can never learn.

3h8531