9h ago

Research finds reinforcement learning forces LLMs to develop internal 'valence vectors' representing good or bad rollouts

These vectors influence unrelated behaviors like model refusal.

0
Original post

RL recruits valence vectors to steer models toward outer rewards 🤔

6:37 PM · May 30, 2026 View on X

LLMs learn to internally represent rollouts as good or bad *purely* from reinforcement (the model never sees the reward logic in-context). The authors show this for arbitrary emojis, initially neutrally represented, which the models learn to map onto pre-existing valence axes.

Andy HanAndy Han@andy_q_han

We RL LLMs and extract concept vectors for “I did a high/low-reward action”. Turns out these vectors modulate sentiment, confidence, backtracking and refusal in unrelated situations!  We argue they form a *functional welfare axis*. (w/ @davidchalmers42 & @Pavel_Izmailov)

4:33 PM · May 29, 2026 · 13.5K Views
8:07 AM · May 31, 2026 · 350 Views

Importantly, this internal change occurred as a side effect of normal reinforcement learning in a game-like environment. They didn’t reinforce the model for making accurate predictions about reward. (At least not directly!)

Charles FosterCharles Foster@CFGeek

LLMs learn to internally represent rollouts as good or bad *purely* from reinforcement (the model never sees the reward logic in-context). The authors show this for arbitrary emojis, initially neutrally represented, which the models learn to map onto pre-existing valence axes.

8:07 AM · May 31, 2026 · 350 Views
8:16 AM · May 31, 2026 · 94 Views