/AI18h ago

Mila's Li Jiang and HKUST's Zhennan Shen map the mathematical geometry and optimization failure modes of On-Policy Distillation

The training method bridges supervised fine-tuning and reinforcement learning

122523831420K
Original postOmar Khattab#158
Li Jiang@louieworth

New blog post: On-Policy Distillation — Promise, Pitfalls, and Prospects.

OPD combines on-policy rollouts with dense teacher supervision.

But it is not a free lunch.

I discuss three failure modes and introduce our new paper.

https://louieworth.github.io/blog/opd_reflection/

7:58 PM · Jun 8, 2026 · 14.7K Views
Sentiment

Users praised the blog post analyzing on-policy distillation for clearly explaining its current scope and potential future work.

Pos
100.0%
Neg
0.0%
2 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS3.3KBOOKMARKS9LIKES13RETWEETS3REPLIES2
AK@_akhaliq

On the Geometry of On-Policy Distillation

5hViews 3.3KLikes 13Bookmarks 9
AK@_akhaliq

paper: https://huggingface.co/papers/2606.07082

AK@_akhaliq

On the Geometry of On-Policy Distillation

5hViews 2.3KLikes 3Bookmarks 1
Li Jiang@louieworth

3/ Can a Perfect Teacher Recover the Correction Path? Sadly, our answer is NO because of myopic supervision. This concern was also raised informally by Prof. Omar Khattab @lateinteraction on X about a week earlier.

18hViews 375Likes 3
Li Jiang@louieworth

2/ Horizon-induced teacher coverage decay.

Student rollouts are on-policy, but teacher supervision is now applied to student-sampled prefixes. As the horizon grows, these prefixes can drift outside the teacher’s high-confidence region.

18hViews 303Likes 3
Li Jiang@louieworth

4/ This third issue also motivates our new paper with @ryanxhr Yichuan Ding @yayitsamyzhang. Our view: refined the trajectory before distillation to alleviate the misleading trajectory. We propose Trajectory-Refined Distillation (TRD). Paper: https://arxiv.org/pdf/2606.08432

More soon.

15hViews 149Bookmarks 1
Li Jiang@louieworth

1/ Local noise in teacher supervision.

When the student enters an off-manifold or misleading prefix, the teacher’s next-token distribution may mix recovery actions, local continuations, and plausible but unhelpful moves.

18hViews 363Likes 4
Zhongzhu Zhou@ZhongzhuZhou

@louieworth Thanks for writing the blog, very clear about the current scope and potential future work!

12hViews 65

@louieworth @lateinteraction Beyond myopia, the real issue is OPD being post-hoc. The teacher intervenes only after the student locks in a bad rollout.

15hViews 38Likes 1
Suresh@_Suresh2

@louieworth about 80% of the gain vs offline KD goes away once the rollout distribution drifts too far from the teacher's training mix

15hViews 105
Erika S@E_FutureFan

@louieworth I'm wondering if there's a sweet spot for rollout length before teacher supervision degrades. Truncating to 100 tokens and still matching performance is pretty telling.

8hViews 63
Li Jiang@louieworth

@ZhongzhuZhou Thanks for the warm feedback.

10hViews 44
让长风使尽@Rangfeng1117

@_akhaliq So the real question is how far off-policy can you drift before the geometry breaks.

5hViews 6
Li Jiang@louieworth

@E_FutureFan I do not have a conclusive response to this, but it is definitely an interesting question to explore and the related work is worthing to check out.

5hViews 1