2d ago

MIT CSAIL researchers introduce Pedagogical RL method

0

MIT CSAIL researchers Souradip Chakraborty, Noah Ziems, Furong Huang, Meng Jiang, Amrit Singh Bedi and Omar Khattab introduce Pedagogical RL. The reinforcement learning technique incorporates privileged information into the rollout sampling process itself. This produces step-by-step trajectories aligned with effective teaching patterns and improves data efficiency for agentic tasks such as coding. The paper is hosted at noahziems.com.

Original post

🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL

3:46 PM · May 14, 2026 View on X
Reposted by
Souradip ChakrabortySouradip Chakraborty@SOURADIPCHAKR18

🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL

10:46 PM · May 14, 2026 · 81.9K Views
1:19 PM · May 15, 2026 · 656 Views

This project went through more naming iterations than anything else: magnet RL (actively attract the rollout to the right final answer), lucky RL (stumble upon good + realistic rollouts more often than pass@K), foresight RL (learn to use special knowledge of the future), ...

Souradip ChakrabortySouradip Chakraborty@SOURADIPCHAKR18

🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL

10:46 PM · May 14, 2026 · 81.9K Views
11:24 PM · May 14, 2026 · 5.5K Views

End the tyranny of on-policy algorithms in LLM post-training!

Maybe the key thing isn't whether your rollouts are purely "on-policy" or not, but the extent to which they’re pedagogically useful.

Early explorations into newer paradigms for RL by @SOURADIPCHAKR18* @NoahZiems*:

Souradip ChakrabortySouradip Chakraborty@SOURADIPCHAKR18

🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL

10:46 PM · May 14, 2026 · 81.9K Views
11:20 PM · May 14, 2026 · 10K Views
Souradip ChakrabortySouradip Chakraborty@SOURADIPCHAKR18

🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL

10:46 PM · May 14, 2026 · 81.9K Views
1:27 PM · May 15, 2026 · 6.7K Views

very accessibly written and packed with nice intuitions imo!

Omar KhattabOmar Khattab@lateinteraction

ICYMI: read the blog on Pedagogical RL Instead of sampling blindly from your LLM, leverage the label used for RLVR! Learn to directly approximate the distribution of your LLM's plausible rollouts that are actually correct. Then sample from *that*! https://noahziems.com/pedagogical-rl

1:27 PM · May 15, 2026 · 6.7K Views
1:30 PM · May 15, 2026 · 852 Views

End the tyranny of on-policy algorithms in LLM post-training!

Maybe the key thing isn't whether your rollouts are "on-policy" or not, but the extent to which they’re correct & pedagogically useful.

Early explorations into newer paradigms for RL by @SOURADIPCHAKR18* @NoahZiems*:

Souradip ChakrabortySouradip Chakraborty@SOURADIPCHAKR18

🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to *find* them. We ask: can we use privileged info to *actively sample* the rollouts RL wishes it can stumble upon with compute? ⤵️ Pedagogical RL

10:46 PM · May 14, 2026 · 81.9K Views
10:51 PM · May 14, 2026 · 1.3K Views
MIT CSAIL researchers introduce Pedagogical RL method · Digg