ML Twitter: What's your favorite on-policy (self)-distillation paper / blogs from this year? Sharing your own work is totally fine!
If who want to learn more about LLM distillation, you can watch: https://youtu.be/O1AR4iL30mg?si=Zznk_BYCnjCAmhAz
Teachers evaluate student rollouts to provide target token distributions
ML Twitter: What's your favorite on-policy (self)-distillation paper / blogs from this year? Sharing your own work is totally fine!
If who want to learn more about LLM distillation, you can watch: https://youtu.be/O1AR4iL30mg?si=Zznk_BYCnjCAmhAz
Users expressed excitement about recent on-policy distillation papers, describing them as timely new research contributions.
And this short lecture from @srush_nlp
Recently met @srush_nlp and he started giving me an impromptu lecture on how targeted on-policy self-distillation works.
I asked him if I could record it on my iPhone.
The basic idea is this: if the model made a mistake at some point in the rollout (for example, calling a tool that doesn't exist), we want to discourage this specific error, but we don't want to just learn from the final reward, because it's a very noisy signal spread out over the whole trajectory.
So we have another model read this trajectory and figure where the error was made. It simply inserts some hint tokens to the part of the trajectory right above where the mistake was made.
Now with these injected hint tokens, have the model run a forward pass. You're not having to regenerate a new rollout - aka no new decode required.
The hint causes the model to assign lower probabilities to the error tokens. You then trains the original model to match these new probabilities, teaching it to downweight that specific mistake.

@agarwl_ (wip):
https://yanta.site/c/on-policy-distillation-from-first-principles-82m6

@agarwl_ hot off the press!
Here is how I would explain the confusion at 13min. Imagine you are doing a sequence of moves playing tennis. Student produces moves (or tokens) 1,2,3, only one rollout. OPD works as follows: At every given move (e.g. current token 1) *Nadal gets inside your brain* and produces a distribution over next move p(2/1). (logProbs for token2 conditioned on token1). Then, you update your neurons so that your distribution for next move looks more like Nadal's. Then, position 2 (token 2) is still your next bad move (NOT what Nadal would do). Nadal takes over your brain again, and computes P(3|2,1). Again you want to update your brain so that your distribution for token 3 (given your bad moves 1,2) looks like what Nadal would do, if he had gotten in this bad body position. There is only a single rollout, your bad moves 1,2,3. But the magic of LLMs is that Nadal can always replace your brain and tell you what HE WOULD DO at any given position. Now in OPSD replace Nadal with (you+extra hint). But you still update using your own bad moves without hint , ie the sequence (1,2,3). (thats why its on-policy).
Recently met @srush_nlp and he started giving me an impromptu lecture on how targeted on-policy self-distillation works.
I asked him if I could record it on my iPhone.
The basic idea is this: if the model made a mistake at some point in the rollout (for example, calling a tool that doesn't exist), we want to discourage this specific error, but we don't want to just learn from the final reward, because it's a very noisy signal spread out over the whole trajectory.
So we have another model read this trajectory and figure where the error was made. It simply inserts some hint tokens to the part of the trajectory right above where the mistake was made.
Now with these injected hint tokens, have the model run a forward pass. You're not having to regenerate a new rollout - aka no new decode required.
The hint causes the model to assign lower probabilities to the error tokens. You then trains the original model to match these new probabilities, teaching it to downweight that specific mistake.