9h ago

Research finds reinforcement learning forces LLMs to develop internal 'valence vectors' representing good or bad rollouts

These vectors influence unrelated behaviors like model refusal.

113261.5K

——0——

Original post

#1488Samuel Hammond 🦉@HAMANDCHEESE

RL recruits valence vectors to steer models toward outer rewards 🤔

6:37 PM · May 30, 2026

QUOTE POST

#1356Charles Foster@CFGEEK

LLMs learn to internally represent rollouts as good or bad *purely* from reinforcement (the model never sees the reward logic in-context). The authors show this for arbitrary emojis, initially neutrally represented, which the models learn to map onto pre-existing valence axes.

Andy Han@andy_q_han

We RL LLMs and extract concept vectors for “I did a high/low-reward action”. Turns out these vectors modulate sentiment, confidence, backtracking and refusal in unrelated situations! We argue they form a *functional welfare axis*. (w/ @davidchalmers42 & @Pavel_Izmailov)

4:33 PM · May 29, 2026 · 13.5K Views

8:07 AM · May 31, 2026 · 350 Views

#1356Charles Foster@CFGEEK

Importantly, this internal change occurred as a side effect of normal reinforcement learning in a game-like environment. They didn’t reinforce the model for making accurate predictions about reward. (At least not directly!)

Charles Foster@CFGeek

8:07 AM · May 31, 2026 · 350 Views

8:16 AM · May 31, 2026 · 94 Views