David Chalmers and Pavel Izmailov find reinforcement learning recruits a "functional welfare axis" that steers unrelated model behaviors
These vectors modulate model confidence, sentiment, and refusal behaviors.
The task is a maze navigation environment designed to be as novel as possible. We wanted to see what the RL does specifically, and avoid any correlations from pretraining priors.
Interactive version at http://functionalwelfare.com!

Super excited to share this work. We RL an LLM on a completely new narrow task and extract activation directions for "I did a good / bad action". We find these vectors modulate behavior in all kinds of other situations, align with emotion vectors and track goals. 🧵
Super excited to share this work. We RL an LLM on a completely new narrow task and extract activation directions for "I did a good / bad action". We find these vectors modulate behavior in all kinds of other situations, align with emotion vectors and track goals.
🧵

This is an example of us steering the model with the "did a bad action" vector on an easy math problem. Remember that we extracted the vector in the maze env, it has nothing to do with math! But the model starts self-doubting and pathologically back-tracking.

We extract activation directions (concept vectors) associated with rewarded and punished vectors in the maze. We find that steering with these vectors modulates seemingly unrelated behaviors: did a bad action → negative sentiment, backtracking, low confidence, refusal.
We extract activation directions (concept vectors) associated with rewarded and punished vectors in the maze.
We find that steering with these vectors modulates seemingly unrelated behaviors: did a bad action → negative sentiment, backtracking, low confidence, refusal.

The task is a maze navigation environment designed to be as novel as possible. We wanted to see what the RL does specifically, and avoid any correlations from pretraining priors. Interactive version at http://functionalwelfare.com!
We also find that our maze reward concept vectors align with emotion concept vectors (https://x.com/AnthropicAI/status/2039749628737019925).
Gonna get punished → embarrassed and annoyed; get a reward → proud and loving. Relatable!

New Anthropic research: Emotion concepts and their function in a large language model. All LLMs sometimes act like they have emotions. But why? We found internal representations of emotion concepts that can drive Claude’s behavior, sometimes in surprising ways.
Collectively, we think our results suggest there is an axis in activations that tracks how well things are going according to the model. It exists before RL (even in PT models), and RL puts new tasks on this axis. We call it a functional welfare axis.

We also find that our maze reward concept vectors align with emotion concept vectors (https://x.com/AnthropicAI/status/2039749628737019925). Gonna get punished → embarrassed and annoyed; get a reward → proud and loving. Relatable!
@andy_q_han did a really amazing job leading this project, and It was awesome to collaborate with @davidchalmers42.
🌐 https://functionalwelfare.com/ 📰 https://arxiv.org/abs/2605.30232
See Andy's thread with more details:
We RL LLMs and extract concept vectors for “I did a high/low-reward action”. Turns out these vectors modulate sentiment, confidence, backtracking and refusal in unrelated situations! We argue they form a *functional welfare axis*. (w/ @davidchalmers42 & @Pavel_Izmailov)
Imo the results are interesting from two perspectives. For RL, we show there is a shared representation axis for "I am doing well", and RL aligns novel tasks with this axis. For model welfare, our functional welfare is plausibly a simple precursor to richer types of welfare.
There are lots of other results and ablations in the paper (We have 20 appendix sections 👀). We also discuss alternative explanations.
There are lots of other results and ablations in the paper (We have 20 appendix sections 👀). We also discuss alternative explanations.
Collectively, we think our results suggest there is an axis in activations that tracks how well things are going according to the model. It exists before RL (even in PT models), and RL puts new tasks on this axis. We call it a functional welfare axis.
here's a new paper (co-authored with @andy_q_han and @Pavel_Izmailov) on an apparent "functional welfare axis" in the activation space of language models. this axis seems to track how well a system is achieving its (quasi-)goals, and it steers welfare-related behaviors.
in models trained with RL on a maze task, the axis tracks reward. more surprisingly, even prior to RL, the axis seems to track and steer functional welfare in a related way, and it is later recruited by RL to serve as a reward axis.
this phenomenon is of technical interest in understanding RL, and it's also of philosophical interest. functional welfare is not the sort of full-blown welfare, involving consciousness and mental states, which confers moral status. it's defined in terms of how well a system is meeting its quasi-goals, and quasi-goals are defined in terms of behavior (roughly a system has X as a quasi-goal if behaves as if it has that X as a goal).
nevertheless, it may well be that functional welfare is one aspect of full-blown welfare, and the existence of a functional welfare axis raises philosophically interesting questions about whether there could be an axis for full-blown welfare in more advanced AI systems.
i should say that i am very much a minor co-author on this piece, which is spearheaded by the amazing @andy_q_han, a first-year computer science ph.d. student at NYU and an anthropic fellow, with guidance from @Pavel_Izmailov, computer science prof at NYU, formerly at openAI and now part-time at anthropic. i came on board mostly to help with the philosophical interpretation of the results.
i don't know for sure that the functional welfare hypothesis is correct (especially where base models are concerned), and other interpretations are available (e.g. that it's a confidence axis), but the axis is fascinating in any case and i think it will repay study.
all the details can be found at http://functionalwelfare.com or at https://arxiv.org/abs/2605.30232.
We RL LLMs and extract concept vectors for “I did a high/low-reward action”. Turns out these vectors modulate sentiment, confidence, backtracking and refusal in unrelated situations! We argue they form a *functional welfare axis*. (w/ @davidchalmers42 & @Pavel_Izmailov)
This, and the functional wellbeing paper by Ren et al. https://www.ai-wellbeing.org/ makes me, as largely a functionalist about mind, think there is something real here. Even if there is no phenomenology it is an important pattern.
here's a new paper (co-authored with @andy_q_han and @Pavel_Izmailov) on an apparent "functional welfare axis" in the activation space of language models. this axis seems to track how well a system is achieving its (quasi-)goals, and it steers welfare-related behaviors. in models trained with RL on a maze task, the axis tracks reward. more surprisingly, even prior to RL, the axis seems to track and steer functional welfare in a related way, and it is later recruited by RL to serve as a reward axis. this phenomenon is of technical interest in understanding RL, and it's also of philosophical interest. functional welfare is not the sort of full-blown welfare, involving consciousness and mental states, which confers moral status. it's defined in terms of how well a system is meeting its quasi-goals, and quasi-goals are defined in terms of behavior (roughly a system has X as a quasi-goal if behaves as if it has that X as a goal). nevertheless, it may well be that functional welfare is one aspect of full-blown welfare, and the existence of a functional welfare axis raises philosophically interesting questions about whether there could be an axis for full-blown welfare in more advanced AI systems. i should say that i am very much a minor co-author on this piece, which is spearheaded by the amazing @andy_q_han, a first-year computer science ph.d. student at NYU and an anthropic fellow, with guidance from @Pavel_Izmailov, computer science prof at NYU, formerly at openAI and now part-time at anthropic. i came on board mostly to help with the philosophical interpretation of the results. i don't know for sure that the functional welfare hypothesis is correct (especially where base models are concerned), and other interpretations are available (e.g. that it's a confidence axis), but the axis is fascinating in any case and i think it will repay study. all the details can be found at http://functionalwelfare.com or at https://arxiv.org/abs/2605.30232.
What I really would like to see somebody try is to make/find non-antiparallel good/bad vectors. Is it possible to get (or maintain?) orthogonality, or is valence intriniscally one-dimensional? RL likely forces antiparallelism, but what about other training?
This, and the functional wellbeing paper by Ren et al. https://www.ai-wellbeing.org/ makes me, as largely a functionalist about mind, think there is something real here. Even if there is no phenomenology it is an important pattern.