DSPy creator Omar Khattab argues on-policy distillation is structurally inefficient because it forces teacher models into a passive role
He proposes Pedagogical Reinforcement Learning as an active alternative.
imo the real, lasting insight in the recent OPSD line of work is creating self-teachers through ICL (esp with privileged information). the on-policy component is partly a distraction from the space that ICL (and reasoning / inference scaling in general) opens during training
extremely informal rant: on-policy distillation is so awkward and frankly just super overrated. why so? well, you'd absolutely hate to be the teacher in an OPD or OPSD setting. imagine trying to teach an aspiring undergrad how to do research by... just asking them to do it, and then passively watching them wander for countless hours doing something bogus. after they're completely done, your only tool is to replay their bogus trajectory as-is and offer them 1 token of correction starting from every (rather unhelpful) state they arrive at. or imagine trying to teach someone how to drive to the nearest Target. so you throw them into a car and ask them to do that, and you just... let them mess around in random directions. after they're done, you can't actually help them drive anywhere, you're just offering 1-step of 10-millisecond steering guidance for them to distill, from every (bad) state they arrived at! in OPD, the teacher is forced to stare at absolute nonsense attempts and can't course-correct at all. i can believe this to work in cases where the problems with trajectories are rather sparse and repetitive, and it *is* better than on-policy RL in many such cases. but i think Pedagogical RL, which is a form of "controllably off-policy" self-distillation is conceptually a much more powerful direction. the teacher's job is to actually instruct and to diverge from student's likely actions to the (smallest) necessary extent for success.
one way to visualize my concern here [which i've not done... :( but maybe you could??] is to build a "virtual rollout" of the top-1 token suggested by teacher at every step. for bad enough rollouts, the resulting rollout may clearly show how little signal is actually conveyed.
imo the real, lasting insight in the recent OPSD line of work is creating self-teachers through ICL (esp with privileged information). the on-policy component is partly a distraction from the space that ICL (and reasoning / inference scaling in general) opens during training
@lateinteraction did you see this work: https://trajectory.ai/field-notes/scaling-sdpo
is this close to the pedagogical RL idea?
extremely informal rant: on-policy distillation is so awkward and frankly just super overrated. why so? well, you'd absolutely hate to be the teacher in an OPD or OPSD setting. imagine trying to teach an aspiring undergrad how to do research by... just asking them to do it, and then passively watching them wander for countless hours doing something bogus. after they're completely done, your only tool is to replay their bogus trajectory as-is and offer them 1 token of correction starting from every (rather unhelpful) state they arrive at. or imagine trying to teach someone how to drive to the nearest Target. so you throw them into a car and ask them to do that, and you just... let them mess around in random directions. after they're done, you can't actually help them drive anywhere, you're just offering 1-step of 10-millisecond steering guidance for them to distill, from every (bad) state they arrived at! in OPD, the teacher is forced to stare at absolute nonsense attempts and can't course-correct at all. i can believe this to work in cases where the problems with trajectories are rather sparse and repetitive, and it *is* better than on-policy RL in many such cases. but i think Pedagogical RL, which is a form of "controllably off-policy" self-distillation is conceptually a much more powerful direction. the teacher's job is to actually instruct and to diverge from student's likely actions to the (smallest) necessary extent for success.
@lateinteraction @JoshPurtell 🙏🙏🙏
extremely informal rant: on-policy distillation is so awkward and frankly just super overrated. why so? well, you'd absolutely hate to be the teacher in an OPD or OPSD setting. imagine trying to teach an aspiring undergrad how to do research by... just asking them to do it, and then passively watching them wander for countless hours doing something bogus. after they're completely done, your only tool is to replay their bogus trajectory as-is and offer them 1 token of correction starting from every (rather unhelpful) state they arrive at. or imagine trying to teach someone how to drive to the nearest Target. so you throw them into a car and ask them to do that, and you just... let them mess around in random directions. after they're done, you can't actually help them drive anywhere, you're just offering 1-step of 10-millisecond steering guidance for them to distill, from every (bad) state they arrived at! in OPD, the teacher is forced to stare at absolute nonsense attempts and can't course-correct at all. i can believe this to work in cases where the problems with trajectories are rather sparse and repetitive, and it *is* better than on-policy RL in many such cases. but i think Pedagogical RL, which is a form of "controllably off-policy" self-distillation is conceptually a much more powerful direction. the teacher's job is to actually instruct and to diverge from student's likely actions to the (smallest) necessary extent for success.