/AI4h ago

Sasha Rush explains targeted on-policy self-distillation, a reinforcement learning technique that corrects specific LLM rollout errors

The method bypasses noisy end-of-trajectory final rewards.

501.5K1061.3K112.9K

Original post

Recently met @srush_nlp and he started giving me an impromptu lecture on how targeted on-policy self-distillation works.

I asked him if I could record it on my iPhone.

The basic idea is this: if the model made a mistake at some point in the rollout (for example, calling a tool that doesn't exist), we want to discourage this specific error, but we don't want to just learn from the final reward, because it's a very noisy signal spread out over the whole trajectory.

So we have another model read this trajectory and figure where the error was made. It simply inserts some hint tokens to the part of the trajectory right above where the mistake was made.

Now with these injected hint tokens, have the model run a forward pass. You're not having to regenerate a new rollout - aka no new decode required.

The hint causes the model to assign lower probabilities to the error tokens. You then trains the original model to match these new probabilities, teaching it to downweight that specific mistake.

6:58 PM · Jun 3, 2026 · 83K Views

/AI4h ago

Sasha Rush explains targeted on-policy self-distillation, a reinforcement learning technique that corrects specific LLM rollout errors

The method bypasses noisy end-of-trajectory final rewards.

--0--

Original post

Dwarkesh Patel@dwarkesh_sp#70inAI

Recently met @srush_nlp and he started giving me an impromptu lecture on how targeted on-policy self-distillation works.

I asked him if I could record it on my iPhone.

So we have another model read this trajectory and figure where the error was made. It simply inserts some hint tokens to the part of the trajectory right above where the mistake was made.

Now with these injected hint tokens, have the model run a forward pass. You're not having to regenerate a new rollout - aka no new decode required.

The hint causes the model to assign lower probabilities to the error tokens. You then trains the original model to match these new probabilities, teaching it to downweight that specific mistake.

6:58 PM · Jun 3, 2026 · 83K Views

Sentiment

Many users praised the clear explanations, tennis analogies, and blackboard format in discussions of Targeted On-Policy Self-Distillation for RL error correction in LLMs, while a few dismissed the technique as unoriginal.

Pos

90.6%

Neg

9.4%

24 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS23.8KBOOKMARKS265LIKES335RETWEETS32REPLIES7

Sasha Rush@srush_nlp

On-Policy Distillation is the most active new research direction being explored in RL for LLMs. Had the chance to discuss how it works with Dwarkesh and why it fits so nicely into large-scale pipelines.

Dwarkesh Patel@dwarkesh_sp

Recently met @srush_nlp and he started giving me an impromptu lecture on how targeted on-policy self-distillation works.

I asked him if I could record it on my iPhone.

So we have another model read this trajectory and figure where the error was made. It simply inserts some hint tokens to the part of the trajectory right above where the mistake was made.

Now with these injected hint tokens, have the model run a forward pass. You're not having to regenerate a new rollout - aka no new decode required.

The hint causes the model to assign lower probabilities to the error tokens. You then trains the original model to match these new probabilities, teaching it to downweight that specific mistake.

3h23.8K335265