/AI18h ago

Mila's Li Jiang and HKUST's Zhennan Shen map the mathematical geometry and optimization failure modes of On-Policy Distillation

The training method bridges supervised fine-tuning and reinforcement learning

122523831420K

#29

Original post

Omar Khattab#158

Li Jiang@louieworth

New blog post: On-Policy Distillation — Promise, Pitfalls, and Prospects.

OPD combines on-policy rollouts with dense teacher supervision.

But it is not a free lunch.

I discuss three failure modes and introduce our new paper.

https://louieworth.github.io/blog/opd_reflection/

7:58 PM · Jun 8, 2026 · 14.7K Views

/AI18h ago

Mila's Li Jiang and HKUST's Zhennan Shen map the mathematical geometry and optimization failure modes of On-Policy Distillation

The training method bridges supervised fine-tuning and reinforcement learning

122523831420K

#29

Original post

Omar Khattab#158

Li Jiang@louieworth

New blog post: On-Policy Distillation — Promise, Pitfalls, and Prospects.

OPD combines on-policy rollouts with dense teacher supervision.

But it is not a free lunch.

I discuss three failure modes and introduce our new paper.

https://louieworth.github.io/blog/opd_reflection/

7:58 PM · Jun 8, 2026 · 14.7K Views

Sentiment

Users praised the blog post analyzing on-policy distillation for clearly explaining its current scope and potential future work.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS3.3KBOOKMARKS9LIKES13RETWEETS3REPLIES2

AK@_akhaliq

On the Geometry of On-Policy Distillation

5h3.3K139

AK@_akhaliq

paper: https://huggingface.co/papers/2606.07082

AK@_akhaliq

On the Geometry of On-Policy Distillation

5h2.3K31

Li Jiang@louieworth

3/ Can a Perfect Teacher Recover the Correction Path? Sadly, our answer is NO because of myopic supervision. This concern was also raised informally by Prof. Omar Khattab @lateinteraction on X about a week earlier.

18h3753

Li Jiang@louieworth

2/ Horizon-induced teacher coverage decay.

Student rollouts are on-policy, but teacher supervision is now applied to student-sampled prefixes. As the horizon grows, these prefixes can drift outside the teacher’s high-confidence region.

18h3033

Li Jiang@louieworth

4/ This third issue also motivates our new paper with @ryanxhr Yichuan Ding @yayitsamyzhang. Our view: refined the trajectory before distillation to alleviate the misleading trajectory. We propose Trajectory-Refined Distillation (TRD). Paper: https://arxiv.org/pdf/2606.08432

More soon.

15h1491

Li Jiang@louieworth

1/ Local noise in teacher supervision.

When the student enters an off-manifold or misleading prefix, the teacher’s next-token distribution may mix recovery actions, local continuations, and plausible but unhelpful moves.

18h3634

Zhongzhu Zhou@ZhongzhuZhou

@louieworth Thanks for writing the blog, very clear about the current scope and potential future work!

12h65

Jonathan Shobrook@jshobrook

@louieworth @lateinteraction Beyond myopia, the real issue is OPD being post-hoc. The teacher intervenes only after the student locks in a bad rollout.

15h381

Suresh@_Suresh2

@louieworth about 80% of the gain vs offline KD goes away once the rollout distribution drifts too far from the teacher's training mix

15h105

Erika S@E_FutureFan

@louieworth I'm wondering if there's a sweet spot for rollout length before teacher supervision degrades. Truncating to 100 tokens and still matching performance is pretty telling.

8h63

Li Jiang@louieworth

@ZhongzhuZhou Thanks for the warm feedback.

10h44

让长风使尽@Rangfeng1117

@_akhaliq So the real question is how far off-policy can you drift before the geometry breaks.

5h6

Li Jiang@louieworth

@E_FutureFan I do not have a conclusive response to this, but it is definitely an interesting question to explore and the related work is worthing to check out.

5h1