/Tech10h ago

Alistair Letcher mathematically proves that model-free RL agents trained on diverse goals encode world models inside their Q-values

This challenges the distinction between model-free and model-based RL.

274995232387.1K

#123

Original post

Alistair Letcher@_aletcher

Model-free agents learn to maximise reward without modelling the environment. Right?

In recent work, we challenge this narrative by proving that agents, trained on a sufficiently rich set of goals, encode a unique and accurate world model in their value functions. 1/

6:30 AM · Jun 23, 2026 · 67.1K Views

Sentiment

Users are excited about research showing model-free agents encode accurate world models in value functions because they find the findings very cool and super interesting.

Pos

100.0%

Neg

0.0%

6 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS11.4KBOOKMARKS58

Jakob Foerster@j_foerst

Great work on recovering world models from Q-values. I am particularly excited to explore how this can be extended to map from *preferences* to world models instead, e.g. in combination with small amounts of transition data.

Alistair Letcher@_aletcher

Model-free agents learn to maximise reward without modelling the environment. Right?

In recent work, we challenge this narrative by proving that agents, trained on a sufficiently rich set of goals, encode a unique and accurate world model in their value functions. 1/

8h11.4K8558

LIKES94REPLIES8

kache@yacineMTB

Alistair Letcher@_aletcher

Model-free agents learn to maximise reward without modelling the environment. Right?

In recent work, we challenge this narrative by proving that agents, trained on a sufficiently rich set of goals, encode a unique and accurate world model in their value functions. 1/

6h10.7K9433

RETWEETS43

Alistair Letcher@_aletcher

Model-free agents learn to maximise reward without modelling the environment. Right?

In recent work, we challenge this narrative by proving that agents, trained on a sufficiently rich set of goals, encode a unique and accurate world model in their value functions. 1/

11h67.1K316242

Alistair Letcher@_aletcher

Work done at @FLAIR_ox and @MATSprogram with Mattie Fellows, @AlexDGoldie, @jonathanrichens, @j_foerst and Oliver Richardson.

🌐 Website: http://inverting-bellman.github.io 📝 Paper: http://arxiv.org/pdf/2606.21173 💻 Code: http://github.com/aletcher/inverting-bellman

⬇️ Agent & implicit WM evolving over training.

10h388155

Alistair Letcher@_aletcher

How do we extract this world model (WM) in practice?

By inverting the Bellman equation. Analogous to Q-learning (sampling from the environment to update Q-values), we introduce P-learning, which samples from an agent's Q-values to decode its internal model of the environment. 2/

11h5278

Alistair Letcher@_aletcher

Empirically, agents encode accurate WMs with far fewer goals than our theory demands. In MuJoCo Reacher, an agent trained on just 4 positional goals contains highly faithful dynamics (MSE < 1e-4), even over variables that rewards never directly depend on (e.g. velocity). 4/

11h3737

Alistair Letcher@_aletcher

Digging deeper, we find a strong correlation (Spearman ρ = 0.98) between agent performance and implicit WM accuracy, suggesting that goal-conditioned RL is a “secretly” hybrid method linking model-free and model-based RL. 6/

11h3087

Alistair Letcher@_aletcher

Interestingly, agents trained with goals over different variables (e.g. only position-based or only velocity-based) converge to the same implicit WM, suggesting a kind of platonic representation hypothesis: agents with orthogonal “values” have similar underlying “beliefs”. 7/

10h2937

Alistair Letcher@_aletcher

We then provide sufficient conditions on the type and number of reward functions for which agents provably encode the true transition kernel, covering both stochastic and deterministic MDPs over finite or continuous state space. 3/

11h4196

Alistair Letcher@_aletcher

Our results soften the boundary between model-free & model-based RL, unlocks hidden generalisation capabilities, and takes a step towards making agents more interpretable & corrigible. But we are just scratching the surface — feel free to reach out with your own ideas! 8/

10h3626

Alistair Letcher@_aletcher

Surprisingly, extracting and planning inside these internal WMs induces zero-shot generalisation to goals far beyond the training distribution: position-trained Reacher agents can plan exclusively "inside their own brains" (Q-values) to reach specific angular velocities. 5/

11h3336

Rota 🚪🧎‍♂️@pli_cachete

@_aletcher Very cool!!

8h1.7K4

kache@yacineMTB

@_aletcher Awesome

6h1.3K3

osufever@osufever

@yacineMTB "However, you're right that this is compartmentalized/task-specific by design in the experiments. It's not a general-purpose worldmodel that transfers across wildly different domains without retraining."

4h10

Alistair Letcher@_aletcher

10h2.6K243

nlev@nlevnaut

@yacineMTB

6h312

Ishan Durugkar @ NeurIPS-23@IshanDurugkar

@_aletcher Very cool stuff! Looking forward to digging into it.

9h971

Blake Edwards@bitstream_blake

@_aletcher Great work thanks

7h166

egesea@egesea009

@_aletcher A system doesn't always learn a model of the world because it was told to.Sometimes it learns one because understanding the world is the easiest way to achieve many different goals.Competence can force comprehension.

7h157

10x'er@10x_er

@_aletcher woah

5h136