extremely informal rant: on-policy distillation is so awkward and frankly just super overrated.
why so? well, you'd absolutely hate to be the teacher in an OPD or OPSD setting.
imagine trying to teach an aspiring undergrad how to do research by... just asking them to do it, and then passively watching them wander for countless hours doing something bogus.
after they're completely done, your only tool is to replay their bogus trajectory as-is and offer them 1 token of correction starting from every (rather unhelpful) state they arrive at.
or imagine trying to teach someone how to drive to the nearest Target. so you throw them into a car and ask them to do that, and you just... let them mess around in random directions. after they're done, you can't actually help them drive anywhere, you're just offering 1-step of 10-millisecond steering guidance for them to distill, from every (bad) state they arrived at!
in OPD, the teacher is forced to stare at absolute nonsense attempts and can't course-correct at all. i can believe this to work in cases where the problems with trajectories are rather sparse and repetitive, and it *is* better than on-policy RL in many such cases.
but i think Pedagogical RL, which is a form of "controllably off-policy" self-distillation is conceptually a much more powerful direction.
the teacher's job is to actually instruct and to diverge from student's likely actions to the (smallest) necessary extent for success.





