AI2's Nathan Lambert says Nvidia's multi-teacher distillation pipeline for Nemotron 3 Ultra represents the new post-training industry standard
The pipeline utilizes over 10 specialized teacher models
The pipeline utilizes over 10 specialized teacher models
@natolambert They ran into some problems which affects generalization in some domains though. Mainly what happens when teacher-student are too far apart and the student never rolls out anything useful for scoring
The gains is limited on non-agentic reasoning. The hypothesis is that there is a lot of the teacher's distribution/gain that is not ever sampled by the student so the student can never learn.