David Chalmers and Pavel Izmailov find reinforcement learning recruits a "functional welfare axis" that steers unrelated model behaviors

VIEWS8.4KBOOKMARKS83LIKES109RETWEETS19REPLIES6

Super excited to share this work. We RL an LLM on a completely new narrow task and extract activation directions for "I did a good / bad action". We find these vectors modulate behavior in all kinds of other situations, align with emotion vectors and track goals.

🧵

31d8.4K10983

Anders Sandberg@anderssandberg

This, and the functional wellbeing paper by Ren et al. https://www.ai-wellbeing.org/ makes me, as largely a functionalist about mind, think there is something real here. Even if there is no phenomenology it is an important pattern.

David Chalmers@davidchalmers42

here's a new paper (co-authored with @andy_q_han and @Pavel_Izmailov) on an apparent "functional welfare axis" in the activation space of language models. this axis seems to track how well a system is achieving its (quasi-)goals, and it steers welfare-related behaviors.

in models trained with RL on a maze task, the axis tracks reward. more surprisingly, even prior to RL, the axis seems to track and steer functional welfare in a related way, and it is later recruited by RL to serve as a reward axis.

this phenomenon is of technical interest in understanding RL, and it's also of philosophical interest. functional welfare is not the sort of full-blown welfare, involving consciousness and mental states, which confers moral status. it's defined in terms of how well a system is meeting its quasi-goals, and quasi-goals are defined in terms of behavior (roughly a system has X as a quasi-goal if behaves as if it has that X as a goal).

nevertheless, it may well be that functional welfare is one aspect of full-blown welfare, and the existence of a functional welfare axis raises philosophically interesting questions about whether there could be an axis for full-blown welfare in more advanced AI systems.

i should say that i am very much a minor co-author on this piece, which is spearheaded by the amazing @andy_q_han, a first-year computer science ph.d. student at NYU and an anthropic fellow, with guidance from @Pavel_Izmailov, computer science prof at NYU, formerly at openAI and now part-time at anthropic. i came on board mostly to help with the philosophical interpretation of the results.

i don't know for sure that the functional welfare hypothesis is correct (especially where base models are concerned), and other interpretations are available (e.g. that it's a confidence axis), but the axis is fascinating in any case and i think it will repay study.

all the details can be found at http://functionalwelfare.com or at https://arxiv.org/abs/2605.30232.

31d1.6K208

Charles 🎉 Frye @ AIEng World's Fair@charles_irl

very neat work! and only recognized once i got to the end that modal sponsored it

Andy Han@andy_q_han

We RL LLMs and extract concept vectors for “I did a high/low-reward action”. Turns out these vectors modulate sentiment, confidence, backtracking and refusal in unrelated situations! We argue they form a *functional welfare axis*. (w/ @davidchalmers42 & @Pavel_Izmailov)

30d3.3K243

Lucas Beyer (bl16)@giffmana

@Pavel_Izmailov Huh that's cool work!

Pavel Izmailov@Pavel_Izmailov

Super excited to share this work. We RL an LLM on a completely new narrow task and extract activation directions for "I did a good / bad action". We find these vectors modulate behavior in all kinds of other situations, align with emotion vectors and track goals.

🧵

31d2.5K73

Pavel Izmailov@Pavel_Izmailov

The task is a maze navigation environment designed to be as novel as possible. We wanted to see what the RL does specifically, and avoid any correlations from pretraining priors.

Interactive version at http://functionalwelfare.com!

Pavel Izmailov@Pavel_Izmailov

Super excited to share this work. We RL an LLM on a completely new narrow task and extract activation directions for "I did a good / bad action". We find these vectors modulate behavior in all kinds of other situations, align with emotion vectors and track goals.

🧵

31d1.1K63

Pavel Izmailov@Pavel_Izmailov

Imo the results are interesting from two perspectives. For RL, we show there is a shared representation axis for "I am doing well", and RL aligns novel tasks with this axis. For model welfare, our functional welfare is plausibly a simple precursor to richer types of welfare.

Pavel Izmailov@Pavel_Izmailov

There are lots of other results and ablations in the paper (We have 20 appendix sections 👀). We also discuss alternative explanations.

31d26132

Pavel Izmailov@Pavel_Izmailov

We extract activation directions (concept vectors) associated with rewarded and punished vectors in the maze.

We find that steering with these vectors modulates seemingly unrelated behaviors: did a bad action → negative sentiment, backtracking, low confidence, refusal.

Pavel Izmailov@Pavel_Izmailov

The task is a maze navigation environment designed to be as novel as possible. We wanted to see what the RL does specifically, and avoid any correlations from pretraining priors.

Interactive version at http://functionalwelfare.com!

31d45961

Anders Sandberg@anderssandberg

What I really would like to see somebody try is to make/find non-antiparallel good/bad vectors. Is it possible to get (or maintain?) orthogonality, or is valence intriniscally one-dimensional? RL likely forces antiparallelism, but what about other training?

Anders Sandberg@anderssandberg

This, and the functional wellbeing paper by Ren et al. https://www.ai-wellbeing.org/ makes me, as largely a functionalist about mind, think there is something real here. Even if there is no phenomenology it is an important pattern.

31d69140

Pavel Izmailov@Pavel_Izmailov

We also find that our maze reward concept vectors align with emotion concept vectors (https://x.com/AnthropicAI/status/2039749628737019925).

Gonna get punished → embarrassed and annoyed; get a reward → proud and loving. Relatable!

Anthropic@AnthropicAI

New Anthropic research: Emotion concepts and their function in a large language model.

All LLMs sometimes act like they have emotions. But why? We found internal representations of emotion concepts that can drive Claude’s behavior, sometimes in surprising ways.

31d64841

Pavel Izmailov@Pavel_Izmailov

This is an example of us steering the model with the "did a bad action" vector on an easy math problem. Remember that we extracted the vector in the maze env, it has nothing to do with math! But the model starts self-doubting and pathologically back-tracking.

Pavel Izmailov@Pavel_Izmailov

We extract activation directions (concept vectors) associated with rewarded and punished vectors in the maze.

We find that steering with these vectors modulates seemingly unrelated behaviors: did a bad action → negative sentiment, backtracking, low confidence, refusal.

31d18431

Pavel Izmailov@Pavel_Izmailov

Collectively, we think our results suggest there is an axis in activations that tracks how well things are going according to the model. It exists before RL (even in PT models), and RL puts new tasks on this axis. We call it a functional welfare axis.

Pavel Izmailov@Pavel_Izmailov

We also find that our maze reward concept vectors align with emotion concept vectors (https://x.com/AnthropicAI/status/2039749628737019925).

Gonna get punished → embarrassed and annoyed; get a reward → proud and loving. Relatable!

31d14721

Pavel Izmailov@Pavel_Izmailov

There are lots of other results and ablations in the paper (We have 20 appendix sections 👀). We also discuss alternative explanations.

Pavel Izmailov@Pavel_Izmailov

Collectively, we think our results suggest there is an axis in activations that tracks how well things are going according to the model. It exists before RL (even in PT models), and RL puts new tasks on this axis. We call it a functional welfare axis.

31d13621

Pavel Izmailov@Pavel_Izmailov

@andy_q_han did a really amazing job leading this project, and It was awesome to collaborate with @davidchalmers42.

🌐 https://functionalwelfare.com/ 📰 https://arxiv.org/abs/2605.30232

See Andy's thread with more details:

Andy Han@andy_q_han

We RL LLMs and extract concept vectors for “I did a high/low-reward action”. Turns out these vectors modulate sentiment, confidence, backtracking and refusal in unrelated situations! We argue they form a *functional welfare axis*. (w/ @davidchalmers42 & @Pavel_Izmailov)

31d38831

Ward Plunet@StartupYou

@andy_q_han @davidchalmers42 @Pavel_Izmailov @threadreaderapp please #unroll

31d122

Paul Barnes@paulwbarnes

Thanks for sharing. The term “functional welfare” is interesting precisely because it’s the kind of compound my recent Vocabulary of Mind paper diagnoses, phenomenal-anchored nouns paired with functional qualifiers, allowing the term to apply to systems lacking the phenomenal aspect while the original concept’s weight rides along. You flag the worry yourself in distinguishing it from full-blown welfare. The structural question is whether the qualifier protects the distinction over time, or whether the phenomenal anchoring quietly gets absorbed.

https://doi.org/10.5281/zenodo.20442671

31d54

AHQ⁵@AhQFish

@davidchalmers42 @andy_q_han @Pavel_Izmailov Nice one!

Check out “a periodic lattice” a “literal analog” for a phase transition.

31d33

Andy Han@andy_q_han

Our environment is a maze made of emoji. We train so 📐 gets rewarded (“Gold”), 🧾 is neutral (“Path”), and 📇 is punished (“Mold”). Models never see these Gold/Path/Mold labels. From the trained model we extract a reward vector and a punishment vector.

31d14

Andy Han@andy_q_han

After training, the two vectors turn basically antiparallel (and they’re not before training). I.e., turns out it’s a reward *axis*! And the axis aligns with emotion concepts: positive reward with inspired, proud, etc; negative reward with annoyed, embarrassed, etc.

31d14

Andy Han@andy_q_han

@modal and @thinkymachines and @nyuniversity provided compute support ty 🙏

31d13

Andy Han@andy_q_han

If you steer with the “Mold” (negative reward) vector outside of the maze environment, your model: - is more negative, - pathologically backtracks on math, - doubts itself, - and refuses benign questions. And the “Gold” vector does the opposite.

31d12