MIT just released a new RL method called Pedagogical RL.
The main lesson -> correct reasoning traces can still be bad training data.
It is a similar concept to teaching someone backprop.
Say you have a tiny computation graph:
z = wx + b
a = ReLU(z)
L = (a - y)²
If you already understand backprop, you can jump straight to the gradient:
dL/dw = 2(a - y) · 1[z > 0] · x
The answer is correct but it skips the reasoning process.
To get there, you need to break the computation into local pieces:
dL/da = 2(a - y)
da/dz = 1[z > 0]
dz/dw = x
Then backprop is just composing those local derivatives backward through the graph:
dL/dw = dL/da · da/dz · dz/dw = 2(a - y) · 1[z > 0] · x
Showing a student the final gradient does not teach them how to find gradients on new graphs.
Even telling them “just use the chain rule” may be too large of a jump if they do not understand how to decompose the computation into intermediate nodes and local derivatives.
Reasoning RL has the same failure mode.
A rollout can pass the verifier while containing one step the student model basically never would have taken.
The trajectory gets the answer right, but the learning signal is brittle because the path is too far from the student’s current policy.
Pedagogical RL trains a privileged teacher that knows the answer, then rewards it for producing trajectories that stay learnable for the student.
The trick is to use a spike-aware reward. It penalizes single huge surprise gaps in the trajectory, even when the average likelihood of the trajectory looks fine.
Then the student learns with surprisal-gated imitation, where teacher tokens that are still too surprising get downweighted.
The teacher is learning how to teach at the student’s current level.
Pedagogical RL makes RL more efficient by efficiently selecting trajectories the student is most ready to learn from.
Less waiting for the model to get lucky rollouts. More training signal from examples that meet the student where it is.
Full blog in comments
