Alex Dimakis explains the mechanics of on-policy LLM distillation following a community query by Rishabh Agarwal

VIEWS2.1KBOOKMARKS12LIKES10

And this short lecture from @srush_nlp

Recently met @srush_nlp and he started giving me an impromptu lecture on how targeted on-policy self-distillation works.

I asked him if I could record it on my iPhone.

The basic idea is this: if the model made a mistake at some point in the rollout (for example, calling a tool that doesn't exist), we want to discourage this specific error, but we don't want to just learn from the final reward, because it's a very noisy signal spread out over the whole trajectory.

So we have another model read this trajectory and figure where the error was made. It simply inserts some hint tokens to the part of the trajectory right above where the mistake was made.

Now with these injected hint tokens, have the model run a forward pass. You're not having to regenerate a new rollout - aka no new decode required.

The hint causes the model to assign lower probabilities to the error tokens. You then trains the original model to match these new probabilities, teaching it to downweight that specific mistake.

5h2.1K1012

Daanish Khazi@bertgodel

@agarwl_ (wip):

https://yanta.site/c/on-policy-distillation-from-first-principles-82m6

5h7821

Ben Fielding@benfielding

@agarwl_ hot off the press!

5h771

Alex Dimakis@AlexGDimakis

Here is how I would explain the confusion at 13min. Imagine you are doing a sequence of moves playing tennis. Student produces moves (or tokens) 1,2,3, only one rollout. OPD works as follows: At every given move (e.g. current token 1) *Nadal gets inside your brain* and produces a distribution over next move p(2/1). (logProbs for token2 conditioned on token1). Then, you update your neurons so that your distribution for next move looks more like Nadal's. Then, position 2 (token 2) is still your next bad move (NOT what Nadal would do). Nadal takes over your brain again, and computes P(3|2,1). Again you want to update your brain so that your distribution for token 3 (given your bad moves 1,2) looks like what Nadal would do, if he had gotten in this bad body position. There is only a single rollout, your bad moves 1,2,3. But the magic of LLMs is that Nadal can always replace your brain and tell you what HE WOULD DO at any given position. Now in OPSD replace Nadal with (you+extra hint). But you still update using your own bad moves without hint , ie the sequence (1,2,3). (thats why its on-policy).

Dwarkesh Patel@dwarkesh_sp