MIT CSAIL researchers introduce Pedagogical RL method
MIT CSAIL researchers Souradip Chakraborty, Noah Ziems, Furong Huang, Meng Jiang, Amrit Singh Bedi and Omar Khattab introduce Pedagogical RL. The reinforcement learning technique incorporates privileged information into the rollout sampling process itself. This produces step-by-step trajectories aligned with effective teaching patterns and improves data efficiency for agentic tasks such as coding. The paper is hosted at noahziems.com.
in case you missed it: read the blog post on Pedagogical RL, very accessible and has some nice intuitions imo
🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL
This project went through more naming iterations than anything else: magnet RL (actively attract the rollout to the right final answer), lucky RL (stumble upon good + realistic rollouts more often than pass@K), foresight RL (learn to use special knowledge of the future), ...
🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL
End the tyranny of on-policy algorithms in LLM post-training!
Maybe the key thing isn't whether your rollouts are purely "on-policy" or not, but the extent to which they’re pedagogically useful.
Early explorations into newer paradigms for RL by @SOURADIPCHAKR18* @NoahZiems*:
🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL
ICYMI: read the blog on Pedagogical RL
Instead of sampling blindly from your LLM, leverage the label used for RLVR! Learn to directly approximate the distribution of your LLM's plausible rollouts that are actually correct. Then sample from *that*!
🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL
very accessibly written and packed with nice intuitions imo!
ICYMI: read the blog on Pedagogical RL Instead of sampling blindly from your LLM, leverage the label used for RLVR! Learn to directly approximate the distribution of your LLM's plausible rollouts that are actually correct. Then sample from *that*! https://noahziems.com/pedagogical-rl
End the tyranny of on-policy algorithms in LLM post-training!
Maybe the key thing isn't whether your rollouts are "on-policy" or not, but the extent to which they’re correct & pedagogically useful.
Early explorations into newer paradigms for RL by @SOURADIPCHAKR18* @NoahZiems*:
🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL