3h ago

David Chalmers and Pavel Izmailov find reinforcement learning recruits a "functional welfare axis" that steers unrelated model behaviors

These vectors modulate model confidence, sentiment, and refusal behaviors.

0
Original post

We RL LLMs and extract concept vectors for “I did a high/low-reward action”. Turns out these vectors modulate sentiment, confidence, backtracking and refusal in unrelated situations!  We argue they form a *functional welfare axis*. (w/ @davidchalmers42 & @Pavel_Izmailov)

Figure 1: Overview of our procedure. (a) Train. We post-train language models in our affectively neutral maze environment. (b) Extract. We obtain the reward vectors v_Mold and v_Gold. (c) Evaluate. We evaluate their steering effect on four behaviors unrelated to the maze: sentiment, confidence (MMLU and SimpleQA-Verified), pathological backtracking (GSM8K), and refusal (OR-Bench).
9:33 AM · May 29, 2026 View on X

The task is a maze navigation environment designed to be as novel as possible. We wanted to see what the RL does specifically, and avoid any correlations from pretraining priors.

Interactive version at http://functionalwelfare.com!

Pavel IzmailovPavel Izmailov@Pavel_Izmailov

Super excited to share this work. We RL an LLM on a completely new narrow task and extract activation directions for "I did a good / bad action". We find these vectors modulate behavior in all kinds of other situations, align with emotion vectors and track goals. 🧵

5:15 PM · May 29, 2026 · 1.2K Views
5:15 PM · May 29, 2026 · 494 Views

Super excited to share this work. We RL an LLM on a completely new narrow task and extract activation directions for "I did a good / bad action". We find these vectors modulate behavior in all kinds of other situations, align with emotion vectors and track goals.

🧵

5:15 PM · May 29, 2026 · 1.2K Views

This is an example of us steering the model with the "did a bad action" vector on an easy math problem. Remember that we extracted the vector in the maze env, it has nothing to do with math! But the model starts self-doubting and pathologically back-tracking.

Pavel IzmailovPavel Izmailov@Pavel_Izmailov

We extract activation directions (concept vectors) associated with rewarded and punished vectors in the maze. We find that steering with these vectors modulates seemingly unrelated behaviors: did a bad action → negative sentiment, backtracking, low confidence, refusal.

5:15 PM · May 29, 2026 · 188 Views
5:15 PM · May 29, 2026 · 40 Views

We extract activation directions (concept vectors) associated with rewarded and punished vectors in the maze.

We find that steering with these vectors modulates seemingly unrelated behaviors: did a bad action → negative sentiment, backtracking, low confidence, refusal.

Pavel IzmailovPavel Izmailov@Pavel_Izmailov

The task is a maze navigation environment designed to be as novel as possible. We wanted to see what the RL does specifically, and avoid any correlations from pretraining priors. Interactive version at http://functionalwelfare.com!

5:15 PM · May 29, 2026 · 494 Views
5:15 PM · May 29, 2026 · 188 Views

We also find that our maze reward concept vectors align with emotion concept vectors (https://x.com/AnthropicAI/status/2039749628737019925).

Gonna get punished → embarrassed and annoyed; get a reward → proud and loving. Relatable!

AnthropicAnthropic@AnthropicAI

New Anthropic research: Emotion concepts and their function in a large language model. All LLMs sometimes act like they have emotions. But why? We found internal representations of emotion concepts that can drive Claude’s behavior, sometimes in surprising ways.

4:59 PM · Apr 2, 2026 · 3.4M Views
5:15 PM · May 29, 2026 · 50 Views

Collectively, we think our results suggest there is an axis in activations that tracks how well things are going according to the model. It exists before RL (even in PT models), and RL puts new tasks on this axis. We call it a functional welfare axis.

Pavel IzmailovPavel Izmailov@Pavel_Izmailov

We also find that our maze reward concept vectors align with emotion concept vectors (https://x.com/AnthropicAI/status/2039749628737019925). Gonna get punished → embarrassed and annoyed; get a reward → proud and loving. Relatable!

5:15 PM · May 29, 2026 · 50 Views
5:15 PM · May 29, 2026 · 33 Views

@andy_q_han did a really amazing job leading this project, and It was awesome to collaborate with @davidchalmers42.

🌐 https://functionalwelfare.com/ 📰 https://arxiv.org/abs/2605.30232

See Andy's thread with more details:

Andy HanAndy Han@andy_q_han

We RL LLMs and extract concept vectors for “I did a high/low-reward action”. Turns out these vectors modulate sentiment, confidence, backtracking and refusal in unrelated situations!  We argue they form a *functional welfare axis*. (w/ @davidchalmers42 & @Pavel_Izmailov)

4:33 PM · May 29, 2026 · 2.6K Views
5:15 PM · May 29, 2026 · 171 Views

Imo the results are interesting from two perspectives. For RL, we show there is a shared representation axis for "I am doing well", and RL aligns novel tasks with this axis. For model welfare, our functional welfare is plausibly a simple precursor to richer types of welfare.

Pavel IzmailovPavel Izmailov@Pavel_Izmailov

There are lots of other results and ablations in the paper (We have 20 appendix sections 👀). We also discuss alternative explanations.

5:15 PM · May 29, 2026 · 32 Views
5:15 PM · May 29, 2026 · 104 Views

There are lots of other results and ablations in the paper (We have 20 appendix sections 👀). We also discuss alternative explanations.

Pavel IzmailovPavel Izmailov@Pavel_Izmailov

Collectively, we think our results suggest there is an axis in activations that tracks how well things are going according to the model. It exists before RL (even in PT models), and RL puts new tasks on this axis. We call it a functional welfare axis.

5:15 PM · May 29, 2026 · 33 Views
5:15 PM · May 29, 2026 · 32 Views

here's a new paper (co-authored with @andy_q_han and @Pavel_Izmailov) on an apparent "functional welfare axis" in the activation space of language models.  this axis seems to track how well a system is achieving its (quasi-)goals, and it steers welfare-related behaviors.

in models trained with RL on a maze task, the axis tracks reward.  more surprisingly, even prior to RL, the axis seems to track and steer functional welfare in a related way, and it is later recruited by RL to serve as a reward axis.

this phenomenon is of technical interest in understanding RL, and it's also of philosophical interest.  functional welfare is not the sort of full-blown welfare, involving consciousness and mental states, which confers moral status.  it's defined in terms of how well a system is meeting its quasi-goals, and quasi-goals are defined in terms of behavior (roughly a system has X as a quasi-goal if behaves as if it has that X as a goal).

nevertheless, it may well be that functional welfare is one aspect of full-blown welfare, and the existence of a functional welfare axis raises philosophically interesting questions about whether there could be an axis for full-blown welfare in more advanced AI systems.

i should say that i am very much a minor co-author on this piece, which is spearheaded by the amazing @andy_q_han, a first-year computer science ph.d. student at NYU and an anthropic fellow, with guidance from @Pavel_Izmailov, computer science prof at NYU, formerly at openAI and now part-time at anthropic.  i came on board mostly to help with the philosophical interpretation of the results.

i don't know for sure that the functional welfare hypothesis is correct (especially where base models are concerned), and other interpretations are available (e.g. that it's a confidence axis), but the axis is fascinating in any case and i think it will repay study.

all the details can be found at http://functionalwelfare.com or at https://arxiv.org/abs/2605.30232.

Andy HanAndy Han@andy_q_han

We RL LLMs and extract concept vectors for “I did a high/low-reward action”. Turns out these vectors modulate sentiment, confidence, backtracking and refusal in unrelated situations!  We argue they form a *functional welfare axis*. (w/ @davidchalmers42 & @Pavel_Izmailov)

4:33 PM · May 29, 2026 · 2.6K Views
5:35 PM · May 29, 2026 · 1.9K Views

This, and the functional wellbeing paper by Ren et al. https://www.ai-wellbeing.org/ makes me, as largely a functionalist about mind, think there is something real here. Even if there is no phenomenology it is an important pattern.

David ChalmersDavid Chalmers@davidchalmers42

here's a new paper (co-authored with @andy_q_han and @Pavel_Izmailov) on an apparent "functional welfare axis" in the activation space of language models.  this axis seems to track how well a system is achieving its (quasi-)goals, and it steers welfare-related behaviors. in models trained with RL on a maze task, the axis tracks reward.  more surprisingly, even prior to RL, the axis seems to track and steer functional welfare in a related way, and it is later recruited by RL to serve as a reward axis. this phenomenon is of technical interest in understanding RL, and it's also of philosophical interest.  functional welfare is not the sort of full-blown welfare, involving consciousness and mental states, which confers moral status.  it's defined in terms of how well a system is meeting its quasi-goals, and quasi-goals are defined in terms of behavior (roughly a system has X as a quasi-goal if behaves as if it has that X as a goal). nevertheless, it may well be that functional welfare is one aspect of full-blown welfare, and the existence of a functional welfare axis raises philosophically interesting questions about whether there could be an axis for full-blown welfare in more advanced AI systems. i should say that i am very much a minor co-author on this piece, which is spearheaded by the amazing @andy_q_han, a first-year computer science ph.d. student at NYU and an anthropic fellow, with guidance from @Pavel_Izmailov, computer science prof at NYU, formerly at openAI and now part-time at anthropic.  i came on board mostly to help with the philosophical interpretation of the results. i don't know for sure that the functional welfare hypothesis is correct (especially where base models are concerned), and other interpretations are available (e.g. that it's a confidence axis), but the axis is fascinating in any case and i think it will repay study. all the details can be found at http://functionalwelfare.com or at https://arxiv.org/abs/2605.30232.

5:35 PM · May 29, 2026 · 1.9K Views
6:55 PM · May 29, 2026 · 450 Views

What I really would like to see somebody try is to make/find non-antiparallel good/bad vectors. Is it possible to get (or maintain?) orthogonality, or is valence intriniscally one-dimensional? RL likely forces antiparallelism, but what about other training?

Anders SandbergAnders Sandberg@anderssandberg

This, and the functional wellbeing paper by Ren et al. https://www.ai-wellbeing.org/ makes me, as largely a functionalist about mind, think there is something real here. Even if there is no phenomenology it is an important pattern.

6:55 PM · May 29, 2026 · 450 Views
6:55 PM · May 29, 2026 · 276 Views