3h ago

DSPy creator Omar Khattab argues on-policy distillation is structurally inefficient because it forces teacher models into a passive role

He proposes Pedagogical Reinforcement Learning as an active alternative.

0
Original post

extremely informal rant: on-policy distillation is so awkward and frankly just super overrated. why so? well, you'd absolutely hate to be the teacher in an OPD or OPSD setting. imagine trying to teach an aspiring undergrad how to do research by... just asking them to do it, and then passively watching them wander for countless hours doing something bogus. after they're completely done, your only tool is to replay their bogus trajectory as-is and offer them 1 token of correction starting from every (rather unhelpful) state they arrive at. or imagine trying to teach someone how to drive to the nearest Target. so you throw them into a car and ask them to do that, and you just... let them mess around in random directions. after they're done, you can't actually help them drive anywhere, you're just offering 1-step of 10-millisecond steering guidance for them to distill, from every (bad) state they arrived at! in OPD, the teacher is forced to stare at absolute nonsense attempts and can't course-correct at all. i can believe this to work in cases where the problems with trajectories are rather sparse and repetitive, and it *is* better than on-policy RL in many such cases. but i think Pedagogical RL, which is a form of "controllably off-policy" self-distillation is conceptually a much more powerful direction. the teacher's job is to actually instruct and to diverge from student's likely actions to the (smallest) necessary extent for success.

1:41 PM · May 27, 2026 View on X

imo the real, lasting insight in the recent OPSD line of work is creating self-teachers through ICL (esp with privileged information). the on-policy component is partly a distraction from the space that ICL (and reasoning / inference scaling in general) opens during training

Omar KhattabOmar Khattab@lateinteraction

extremely informal rant: on-policy distillation is so awkward and frankly just super overrated. why so? well, you'd absolutely hate to be the teacher in an OPD or OPSD setting. imagine trying to teach an aspiring undergrad how to do research by... just asking them to do it, and then passively watching them wander for countless hours doing something bogus. after they're completely done, your only tool is to replay their bogus trajectory as-is and offer them 1 token of correction starting from every (rather unhelpful) state they arrive at. or imagine trying to teach someone how to drive to the nearest Target. so you throw them into a car and ask them to do that, and you just... let them mess around in random directions. after they're done, you can't actually help them drive anywhere, you're just offering 1-step of 10-millisecond steering guidance for them to distill, from every (bad) state they arrived at! in OPD, the teacher is forced to stare at absolute nonsense attempts and can't course-correct at all. i can believe this to work in cases where the problems with trajectories are rather sparse and repetitive, and it *is* better than on-policy RL in many such cases. but i think Pedagogical RL, which is a form of "controllably off-policy" self-distillation is conceptually a much more powerful direction. the teacher's job is to actually instruct and to diverge from student's likely actions to the (smallest) necessary extent for success.

8:41 PM · May 27, 2026 · 10.4K Views
8:54 PM · May 27, 2026 · 2.7K Views

one way to visualize my concern here [which i've not done... :( but maybe you could??] is to build a "virtual rollout" of the top-1 token suggested by teacher at every step. for bad enough rollouts, the resulting rollout may clearly show how little signal is actually conveyed.

Omar KhattabOmar Khattab@lateinteraction

imo the real, lasting insight in the recent OPSD line of work is creating self-teachers through ICL (esp with privileged information). the on-policy component is partly a distraction from the space that ICL (and reasoning / inference scaling in general) opens during training

8:54 PM · May 27, 2026 · 2.7K Views
9:16 PM · May 27, 2026 · 1.9K Views

@lateinteraction did you see this work: https://trajectory.ai/field-notes/scaling-sdpo

is this close to the pedagogical RL idea?

Omar KhattabOmar Khattab@lateinteraction

extremely informal rant: on-policy distillation is so awkward and frankly just super overrated. why so? well, you'd absolutely hate to be the teacher in an OPD or OPSD setting. imagine trying to teach an aspiring undergrad how to do research by... just asking them to do it, and then passively watching them wander for countless hours doing something bogus. after they're completely done, your only tool is to replay their bogus trajectory as-is and offer them 1 token of correction starting from every (rather unhelpful) state they arrive at. or imagine trying to teach someone how to drive to the nearest Target. so you throw them into a car and ask them to do that, and you just... let them mess around in random directions. after they're done, you can't actually help them drive anywhere, you're just offering 1-step of 10-millisecond steering guidance for them to distill, from every (bad) state they arrived at! in OPD, the teacher is forced to stare at absolute nonsense attempts and can't course-correct at all. i can believe this to work in cases where the problems with trajectories are rather sparse and repetitive, and it *is* better than on-policy RL in many such cases. but i think Pedagogical RL, which is a form of "controllably off-policy" self-distillation is conceptually a much more powerful direction. the teacher's job is to actually instruct and to diverge from student's likely actions to the (smallest) necessary extent for success.

8:41 PM · May 27, 2026 · 10.4K Views
10:17 PM · May 27, 2026 · 124 Views

@lateinteraction @JoshPurtell 🙏🙏🙏

Omar KhattabOmar Khattab@lateinteraction

extremely informal rant: on-policy distillation is so awkward and frankly just super overrated. why so? well, you'd absolutely hate to be the teacher in an OPD or OPSD setting. imagine trying to teach an aspiring undergrad how to do research by... just asking them to do it, and then passively watching them wander for countless hours doing something bogus. after they're completely done, your only tool is to replay their bogus trajectory as-is and offer them 1 token of correction starting from every (rather unhelpful) state they arrive at. or imagine trying to teach someone how to drive to the nearest Target. so you throw them into a car and ask them to do that, and you just... let them mess around in random directions. after they're done, you can't actually help them drive anywhere, you're just offering 1-step of 10-millisecond steering guidance for them to distill, from every (bad) state they arrived at! in OPD, the teacher is forced to stare at absolute nonsense attempts and can't course-correct at all. i can believe this to work in cases where the problems with trajectories are rather sparse and repetitive, and it *is* better than on-policy RL in many such cases. but i think Pedagogical RL, which is a form of "controllably off-policy" self-distillation is conceptually a much more powerful direction. the teacher's job is to actually instruct and to diverge from student's likely actions to the (smallest) necessary extent for success.

8:41 PM · May 27, 2026 · 10.4K Views
9:03 PM · May 27, 2026 · 245 Views