DSPy creator Omar Khattab argues on-policy distillation is structurally inefficient because it forces teacher models into a passive role · Digg

/Tech33d ago

DSPy creator Omar Khattab argues on-policy distillation is structurally inefficient because it forces teacher models into a passive role

He proposes Pedagogical Reinforcement Learning as an active alternative.

6083053679116.3K

Original post

Omar Khattab@lateinteraction#200inTech

extremely informal rant: on-policy distillation is so awkward and frankly just super overrated.

why so? well, you'd absolutely hate to be the teacher in an OPD or OPSD setting.

imagine trying to teach an aspiring undergrad how to do research by... just asking them to do it, and then passively watching them wander for countless hours doing something bogus.

after they're completely done, your only tool is to replay their bogus trajectory as-is and offer them 1 token of correction starting from every (rather unhelpful) state they arrive at.

or imagine trying to teach someone how to drive to the nearest Target. so you throw them into a car and ask them to do that, and you just... let them mess around in random directions. after they're done, you can't actually help them drive anywhere, you're just offering 1-step of 10-millisecond steering guidance for them to distill, from every (bad) state they arrived at!

in OPD, the teacher is forced to stare at absolute nonsense attempts and can't course-correct at all. i can believe this to work in cases where the problems with trajectories are rather sparse and repetitive, and it *is* better than on-policy RL in many such cases.

but i think Pedagogical RL, which is a form of "controllably off-policy" self-distillation is conceptually a much more powerful direction.

the teacher's job is to actually instruct and to diverge from student's likely actions to the (smallest) necessary extent for success.

1:41 PM · May 27, 2026 · 65.3K Views

Sentiment

Positive users praise pedagogical RL for enabling personalized instruction and effective sampling, while negative users criticize on-policy distillation as a lazy approach with badly degrading teacher signals.

Pos

85.0%

Neg

15.0%

29 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

Scaling SDPO - Trajectory

TRAJECTORY.AIVia

Pedagogical RL: Teaching Models to Teach Themselves from Privileged Information - Noah Ziems

NOAHZIEMS.COMVia

Posts from X

Most Activity

VIEWS17.8KBOOKMARKS157LIKES146RETWEETS8

Rishabh Agarwal@agarwl_

Speculative OPD addresses this exact issue in OPD that student distribution can sometimes be too far from the teacher to provide useful feedback.

Omar Khattab@lateinteraction

extremely informal rant: on-policy distillation is so awkward and frankly just super overrated.

why so? well, you'd absolutely hate to be the teacher in an OPD or OPSD setting.

imagine trying to teach an aspiring undergrad how to do research by... just asking them to do it, and then passively watching them wander for countless hours doing something bogus.

after they're completely done, your only tool is to replay their bogus trajectory as-is and offer them 1 token of correction starting from every (rather unhelpful) state they arrive at.

or imagine trying to teach someone how to drive to the nearest Target. so you throw them into a car and ask them to do that, and you just... let them mess around in random directions. after they're done, you can't actually help them drive anywhere, you're just offering 1-step of 10-millisecond steering guidance for them to distill, from every (bad) state they arrived at!

in OPD, the teacher is forced to stare at absolute nonsense attempts and can't course-correct at all. i can believe this to work in cases where the problems with trajectories are rather sparse and repetitive, and it *is* better than on-policy RL in many such cases.

but i think Pedagogical RL, which is a form of "controllably off-policy" self-distillation is conceptually a much more powerful direction.

the teacher's job is to actually instruct and to diverge from student's likely actions to the (smallest) necessary extent for success.

32d17.8K146157

REPLIES3

Omar Khattab@lateinteraction

imo the real, lasting insight in the recent OPSD line of work is creating self-teachers through ICL (esp with privileged information). the on-policy component is partly a distraction from the space that ICL (and reasoning / inference scaling in general) opens during training

Omar Khattab@lateinteraction

extremely informal rant: on-policy distillation is so awkward and frankly just super overrated.

why so? well, you'd absolutely hate to be the teacher in an OPD or OPSD setting.

imagine trying to teach an aspiring undergrad how to do research by... just asking them to do it, and then passively watching them wander for countless hours doing something bogus.

after they're completely done, your only tool is to replay their bogus trajectory as-is and offer them 1 token of correction starting from every (rather unhelpful) state they arrive at.

or imagine trying to teach someone how to drive to the nearest Target. so you throw them into a car and ask them to do that, and you just... let them mess around in random directions. after they're done, you can't actually help them drive anywhere, you're just offering 1-step of 10-millisecond steering guidance for them to distill, from every (bad) state they arrived at!

in OPD, the teacher is forced to stare at absolute nonsense attempts and can't course-correct at all. i can believe this to work in cases where the problems with trajectories are rather sparse and repetitive, and it *is* better than on-policy RL in many such cases.

but i think Pedagogical RL, which is a form of "controllably off-policy" self-distillation is conceptually a much more powerful direction.

the teacher's job is to actually instruct and to diverge from student's likely actions to the (smallest) necessary extent for success.

33d5K368

Grad@Grad62304977

Frankly a lot of the OPD convos are happening bcs labs are still doing the expert RL then OPD from experts which is very ugly imo and soon will not be done anymore I mainly view OPD as SFT+ where there’s a point in SFT that u can squeeze more from ur model performance but also its in a good state so OPD can help, also intuitively seems more stable for further RL Also OPD can be much cheaper than SFT a lot of the times so looks good for that So u would do SFT until u can do OPD (can’t do OPD without enough SFT)

Omar Khattab@lateinteraction

extremely informal rant: on-policy distillation is so awkward and frankly just super overrated.

why so? well, you'd absolutely hate to be the teacher in an OPD or OPSD setting.

imagine trying to teach an aspiring undergrad how to do research by... just asking them to do it, and then passively watching them wander for countless hours doing something bogus.

after they're completely done, your only tool is to replay their bogus trajectory as-is and offer them 1 token of correction starting from every (rather unhelpful) state they arrive at.

or imagine trying to teach someone how to drive to the nearest Target. so you throw them into a car and ask them to do that, and you just... let them mess around in random directions. after they're done, you can't actually help them drive anywhere, you're just offering 1-step of 10-millisecond steering guidance for them to distill, from every (bad) state they arrived at!

in OPD, the teacher is forced to stare at absolute nonsense attempts and can't course-correct at all. i can believe this to work in cases where the problems with trajectories are rather sparse and repetitive, and it *is* better than on-policy RL in many such cases.

but i think Pedagogical RL, which is a form of "controllably off-policy" self-distillation is conceptually a much more powerful direction.

the teacher's job is to actually instruct and to diverge from student's likely actions to the (smallest) necessary extent for success.

33d10.1K8581

Emiliano Penaloza@emilianopp_

@lateinteraction We proposed self distillation alongside \pi-distill a version of what you’re describing, we did find off-policy learning to not only not be bad but also better than self distillation

https://arxiv.org/abs/2602.04942

33d1.7K3337

Rishabh Agarwal@agarwl_

https://arxiv.org/abs/2410.11325 (from @WendaXu2)

Rishabh Agarwal@agarwl_

Speculative OPD addresses this exact issue in OPD that student distribution can sometimes be too far from the teacher to provide useful feedback.

32d1.8K1719

Souradip Chakraborty@SOURADIPCHAKR18

Totally agreed with @DhruvBatra_ on this. Detailed explanation on Pedagogical RL: Blog: https://noahziems.com/pedagogical-rl @NoahZiems @amritsinghbedi3 @lateinteraction

Dhruv Batra@DhruvBatra_

@willccbb Agreed with your claim as stated, but caveats to avoid a misreading of your claim:

1. self-distillation ⇏ no exploration (see pedagogical RL)

2. RL ⇏ replayable environments (see any offline RL paper)

33d5.7K815

Omar Khattab@lateinteraction

btw for those who have not read about the alternative paradigm, pedagogical RL is at:

http://noahziems.com/pedagogical-rl

Omar Khattab@lateinteraction

extremely informal rant: on-policy distillation is so awkward and frankly just super overrated.

why so? well, you'd absolutely hate to be the teacher in an OPD or OPSD setting.

imagine trying to teach an aspiring undergrad how to do research by... just asking them to do it, and then passively watching them wander for countless hours doing something bogus.

after they're completely done, your only tool is to replay their bogus trajectory as-is and offer them 1 token of correction starting from every (rather unhelpful) state they arrive at.

or imagine trying to teach someone how to drive to the nearest Target. so you throw them into a car and ask them to do that, and you just... let them mess around in random directions. after they're done, you can't actually help them drive anywhere, you're just offering 1-step of 10-millisecond steering guidance for them to distill, from every (bad) state they arrived at!

in OPD, the teacher is forced to stare at absolute nonsense attempts and can't course-correct at all. i can believe this to work in cases where the problems with trajectories are rather sparse and repetitive, and it *is* better than on-policy RL in many such cases.

but i think Pedagogical RL, which is a form of "controllably off-policy" self-distillation is conceptually a much more powerful direction.

the teacher's job is to actually instruct and to diverge from student's likely actions to the (smallest) necessary extent for success.

32d1.3K811

Omar Khattab@lateinteraction

one way to visualize my concern here [which i've not done... :( but maybe you could??] is to build a "virtual rollout" of the top-1 token suggested by teacher at every step. for bad enough rollouts, the resulting rollout may clearly show how little signal is actually conveyed.

Omar Khattab@lateinteraction

imo the real, lasting insight in the recent OPSD line of work is creating self-teachers through ICL (esp with privileged information). the on-policy component is partly a distraction from the space that ICL (and reasoning / inference scaling in general) opens during training

33d3.8K113

Dhruv Batra@DhruvBatra_

@SOURADIPCHAKR18 @NoahZiems @amritsinghbedi3 @lateinteraction Great work!

33d1.7K34

Herbie Bradley@herbiebradley

@lateinteraction did you see this work: https://trajectory.ai/field-notes/scaling-sdpo

is this close to the pedagogical RL idea?

Omar Khattab@lateinteraction

extremely informal rant: on-policy distillation is so awkward and frankly just super overrated.

why so? well, you'd absolutely hate to be the teacher in an OPD or OPSD setting.

imagine trying to teach an aspiring undergrad how to do research by... just asking them to do it, and then passively watching them wander for countless hours doing something bogus.

after they're completely done, your only tool is to replay their bogus trajectory as-is and offer them 1 token of correction starting from every (rather unhelpful) state they arrive at.

or imagine trying to teach someone how to drive to the nearest Target. so you throw them into a car and ask them to do that, and you just... let them mess around in random directions. after they're done, you can't actually help them drive anywhere, you're just offering 1-step of 10-millisecond steering guidance for them to distill, from every (bad) state they arrived at!

in OPD, the teacher is forced to stare at absolute nonsense attempts and can't course-correct at all. i can believe this to work in cases where the problems with trajectories are rather sparse and repetitive, and it *is* better than on-policy RL in many such cases.

but i think Pedagogical RL, which is a form of "controllably off-policy" self-distillation is conceptually a much more powerful direction.

the teacher's job is to actually instruct and to diverge from student's likely actions to the (smallest) necessary extent for success.

33d50035

Ahmad Beirami@abeirami

@lateinteraction All such distillation algos (by nature if gradient optimization on a form of distribution matching) suffer from many of your criticisms and even more.

These methods only make sense at scale (population level, not sample level) and are certainly data inefficient.

Omar Khattab@lateinteraction

extremely informal rant: on-policy distillation is so awkward and frankly just super overrated.

why so? well, you'd absolutely hate to be the teacher in an OPD or OPSD setting.

imagine trying to teach an aspiring undergrad how to do research by... just asking them to do it, and then passively watching them wander for countless hours doing something bogus.

after they're completely done, your only tool is to replay their bogus trajectory as-is and offer them 1 token of correction starting from every (rather unhelpful) state they arrive at.

or imagine trying to teach someone how to drive to the nearest Target. so you throw them into a car and ask them to do that, and you just... let them mess around in random directions. after they're done, you can't actually help them drive anywhere, you're just offering 1-step of 10-millisecond steering guidance for them to distill, from every (bad) state they arrived at!

in OPD, the teacher is forced to stare at absolute nonsense attempts and can't course-correct at all. i can believe this to work in cases where the problems with trajectories are rather sparse and repetitive, and it *is* better than on-policy RL in many such cases.

but i think Pedagogical RL, which is a form of "controllably off-policy" self-distillation is conceptually a much more powerful direction.

the teacher's job is to actually instruct and to diverge from student's likely actions to the (smallest) necessary extent for success.

33d1.1K42

Omar Khattab@lateinteraction

@NicolasZucchet yes!! this is acknowledged in the post indeed ("i can believe this to work in...")

basically, the meta-comment here is: OPSD claims to fix issues of sparse signal in on-policy RL, but if your policy is bad enough, the useful signal is arguably still very sparse

33d70571

Omar Khattab@lateinteraction

@Grad62304977 +1 to all!

Grad@Grad62304977

Frankly a lot of the OPD convos are happening bcs labs are still doing the expert RL then OPD from experts which is very ugly imo and soon will not be done anymore I mainly view OPD as SFT+ where there’s a point in SFT that u can squeeze more from ur model performance but also its in a good state so OPD can help, also intuitively seems more stable for further RL Also OPD can be much cheaper than SFT a lot of the times so looks good for that So u would do SFT until u can do OPD (can’t do OPD without enough SFT)

33d2K91

Matt@0xLienid

@lateinteraction Yeah the signal degrades really bad. Asking the teacher to "correct" on token 4096 in a rollout is basically an incoherent request.

Not quite your planned measurement, but some analysis I did: https://0xlienid.github.io/articles/supervision-horizon-opsd/

33d8422

Nicolas Zucchet@NicolasZucchet

@lateinteraction doesn't this analogy apply to on-policy RL as well?

33d81141

samsja@samsja19

@lateinteraction @JoshPurtell 🙏🙏🙏

Omar Khattab@lateinteraction

extremely informal rant: on-policy distillation is so awkward and frankly just super overrated.

why so? well, you'd absolutely hate to be the teacher in an OPD or OPSD setting.

imagine trying to teach an aspiring undergrad how to do research by... just asking them to do it, and then passively watching them wander for countless hours doing something bogus.

after they're completely done, your only tool is to replay their bogus trajectory as-is and offer them 1 token of correction starting from every (rather unhelpful) state they arrive at.

or imagine trying to teach someone how to drive to the nearest Target. so you throw them into a car and ask them to do that, and you just... let them mess around in random directions. after they're done, you can't actually help them drive anywhere, you're just offering 1-step of 10-millisecond steering guidance for them to distill, from every (bad) state they arrived at!

in OPD, the teacher is forced to stare at absolute nonsense attempts and can't course-correct at all. i can believe this to work in cases where the problems with trajectories are rather sparse and repetitive, and it *is* better than on-policy RL in many such cases.

but i think Pedagogical RL, which is a form of "controllably off-policy" self-distillation is conceptually a much more powerful direction.

the teacher's job is to actually instruct and to diverge from student's likely actions to the (smallest) necessary extent for success.

33d48141

Ahmad Beirami@abeirami

@lateinteraction I have heard of new families of evolutionary optimization methods that are >100x more data efficient, and make a lot more sense when viewed at sample level ;)

Ahmad Beirami@abeirami

@lateinteraction All such distillation algos (by nature if gradient optimization on a form of distribution matching) suffer from many of your criticisms and even more.

These methods only make sense at scale (population level, not sample level) and are certainly data inefficient.

33d57551

Omar Khattab@lateinteraction

@emilianopp_ love your paper, very early in this space!

33d1.1K12

Omar Khattab@lateinteraction

@agarwl_ nice! i hadn't seen this; looking into it, seems very neat!

Rishabh Agarwal@agarwl_

Speculative OPD addresses this exact issue in OPD that student distribution can sometimes be too far from the teacher to provide useful feedback.

32d672110

Jack Friedson@JackFriedson

@lateinteraction haters say things like "you can't drive here, this is my living room" to which I scoff and reply "not bitter lesson pilled"

32d212