Think this table is interesting to see what domains does the student outperform the teacher.
The merged model outperforms the specialized RLVR model on agentic and instruction following benches. On TBench, the student significantly outperforms the teacher which is interesting.
For reference, the second table is a similar figure from Mimo-v2-flash. Interesting to compare relative performance in ~similar domains
Because all the experts are trained individually and differently, they say that MOPD cannot be applied naively. What ends up happening is that the student is too different from the teacher.
They do a very light SFT stage on each teacher's data as warmup.
The benefit is most pronounced in agentic domains vs reasoning ones. I suspect its probably something to do with longer rollouts/multi turn (?)
