/AI3h ago

Warmup SFT Boosts MOPD Gains In Agentic Domains For Nemotron 3 Ultra

--0--
Comments
Original post
wh@nrehiew_#1409inAI

Think this table is interesting to see what domains does the student outperform the teacher.

The merged model outperforms the specialized RLVR model on agentic and instruction following benches. On TBench, the student significantly outperforms the teacher which is interesting.

For reference, the second table is a similar figure from Mimo-v2-flash. Interesting to compare relative performance in ~similar domains

wh@nrehiew_

Because all the experts are trained individually and differently, they say that MOPD cannot be applied naively. What ends up happening is that the student is too different from the teacher.

They do a very light SFT stage on each teacher's data as warmup.

The benefit is most pronounced in agentic domains vs reasoning ones. I suspect its probably something to do with longer rollouts/multi turn (?)

8:12 AM · Jun 4, 2026 · 144 Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS771LIKES4
wh@nrehiew_

The gains is limited on non-agentic reasoning. The hypothesis is that there is a lot of the teacher's distribution/gain that is not ever sampled by the student so the student can never learn.

wh@nrehiew_

Think this table is interesting to see what domains does the student outperform the teacher.

The merged model outperforms the specialized RLVR model on agentic and instruction following benches. On TBench, the student significantly outperforms the teacher which is interesting.

For reference, the second table is a similar figure from Mimo-v2-flash. Interesting to compare relative performance in ~similar domains

3hViews 771Likes 4Bookmarks 1
BOOKMARKS1REPLIES1
wh@nrehiew_

Some MOPD open questions they list out: - Instead of just doing loss only on the selected token, train on the entire distribution. They say this did not help much as it might amplify noise - How to ensure student trajectories lie within the teacher's support for effective scoring - Efficiency across different domains where rollouts have very different times

wh@nrehiew_

The gains is limited on non-agentic reasoning. The hypothesis is that there is a lot of the teacher's distribution/gain that is not ever sampled by the student so the student can never learn.

3hViews 85Likes 3Bookmarks 1