LLMs encode whether they're "on the right track" in their activations along a linear axis, kind of like a value function in RL! The axis is influenced when you train the model to have new preferences, and modulates the model's confidence / likelihood of backtracking.
New work: The Value Axis 🎯
How do LLMs choose which path to take mid-task? We find they internally track the chance of reaching their goal along a linear axis, akin to a value function in RL. We show it modulates confidence in math & coding and can be reshaped with DPO and SFT.



