Omar Khattab co-authors Pedagogical RL paper that trains self-teachers as gently off-policy samplers to produce correct and step-by-step easy rollouts
Approach adds scalable paradigm to GRPO-based reinforcement learning discussions.
@nrehiew_ @DimitrisPapail Indeed. But the next breakthrough for a far more scalable RL paradigm than GRPO is already here:
Train your self-teacher to be a pedagogical, gently off-policy sampler for RL rollouts that are both correct AND easy to follow in every step.
It feels like the next breakthrough on a scalable training algorithm is close, likely on top of GRPO with denser credit assignment beyond outcome rewards, but done so with lower bias. - ECHO does this by limiting credit to environment responses - Composer2/Self distillation/OPD
Impossible to say for sure obviously, but the fact that so many different groups are doing the same thing along the same axis likely means we are very very close to something with the same impact as GRPO
It feels like the next breakthrough on a scalable training algorithm is close, likely on top of GRPO with denser credit assignment beyond outcome rewards, but done so with lower bias. - ECHO does this by limiting credit to environment responses - Composer2/Self distillation/OPD
of course, all existing approaches might end up not being adopted (think MCTS with reasoning) but point still stands
Impossible to say for sure obviously, but the fact that so many different groups are doing the same thing along the same axis likely means we are very very close to something with the same impact as GRPO