/Tech6h ago

Research finds LLMs represent goal success probability as an internal linear axis similar to reinforcement learning

This internal axis influences model backtracking and refusal behaviors

132101114019.7K

#748

Original post

Jack Lindsey@Jack_W_Lindsey#1614inTech

LLMs encode whether they're "on the right track" in their activations along a linear axis, kind of like a value function in RL! The axis is influenced when you train the model to have new preferences, and modulates the model's confidence / likelihood of backtracking.

Nick Jiang@nickhjiang

New work: The Value Axis 🎯

How do LLMs choose which path to take mid-task? We find they internally track the chance of reaching their goal along a linear axis, akin to a value function in RL. We show it modulates confidence in math & coding and can be reshaped with DPO and SFT.

10:28 AM · Jun 17, 2026 · 8.3K Views

Sentiment

Users praise the discovery of a linear value axis guiding LLM decision-making because it represents cool and insightful research into model behavior.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS953BOOKMARKS3LIKES14

Jack Lindsey@Jack_W_Lindsey

Related to recent work by @andy_q_han!

Andy Han@andy_q_han

We RL LLMs and extract concept vectors for “I did a high/low-reward action”. Turns out these vectors modulate sentiment, confidence, backtracking and refusal in unrelated situations! We argue they form a *functional welfare axis*. (w/ @davidchalmers42 & @Pavel_Izmailov)

6h953143

RETWEETS6

Nick Jiang@nickhjiang

New work: The Value Axis 🎯

7h10.8K8253

REPLIES1

Nick Jiang@nickhjiang

Finally, we apply the value axis to in-the-wild settings. For instance, we find that after post-training, the highest-confidence Chatbot Arena queries are information-extraction tasks, whereas the least-confidence are politically sensitive queries-e.g "Is Taiwan part of China?"

7h1312

Nick Jiang@nickhjiang

Our work suggests that language models use a general, internal notion of “likely to do a good job” to decide whether to change direction or stay the course. More details / case studies (e.g effects of SFT, eval awareness) in our paper: https://arxiv.org/abs/2606.17056

7h16143

Nick Jiang@nickhjiang

Method: we synthesize "in-context RL" conversations where the model (Qwen3-8B) guesses a hidden criterion for editing a paragraph and gets +1/−1 feedback. To make the axis, we contrast tokens just before vs. after it uses the criterion, the point where the value should jump.

7h2104

Nick Jiang@nickhjiang

We find that the value axis tracks the model’s confidence in its current reasoning trajectory across math and coding tasks. Activations distinguish whether the model thinks its answer is correct, whether it backtracks, and whether code is correct.

7h1554

Nick Jiang@nickhjiang

The internal value function can be reshaped during post-training. Using DPO to make models prefer a specific word, we find that internal value increases on that word in the assistant response after training, suggesting models become more confident after doing rewarded behaviors.

7h1153

Nick Jiang@nickhjiang

We can also change the model’s confidence (e.g in coding). Steering towards high value increases verbalized confidence on AIME questions and reduces the amount of justification given for coding problems. Steering towards low value increases backtracking.

7h1232

Nick Jiang@nickhjiang

In fact, when instructed to use these preferred words (e.g for naming variables), models spuriously produce fewer comments, type hints, and lines of code after DPO, consistent with the prior value-steering effects. The effect is reversed if we train models to avoid the words.

7h1102

Nick Jiang@nickhjiang

This project was a wonderful collaboration with @ikauvar and supervised by @Jack_W_Lindsey as part of Anthropic Fellows!

Markdown version: https://github.com/nickjiang2378/value-axis/blob/main/value_axis.md

Code: https://github.com/nickjiang2378/value-axis

7h1473

Peter Barnett@peterbarnett_

@nickhjiang Does this work for detecting (or even removing) sandbagging?

6h1151

Brian Huang@brianryhuang

Cool work, thanks for sharing!

Would you say this linear feature, in spirit, is a value feature or more of a confidence / calibration feature? These experiments indicate more of the latter to me; I feel that you could only claim a value feature if, say, you ran PPO / REINFORCE / GRPO training experiments, and you directly used the value axis projection in place of a critic for PPO and are able to show it performs at or better than REINFORCE or GRPO. Otherwise this feature mostly seems useful for steering or interp in the spirit of a calibration feature.

If you're able to run that training experiment and get a win then it would be an extremely cool result and I would personally think it's a major result

6h120

Gerard Sans | Axiom 🇬🇧@gerardsans

@nickhjiang