/Tech11h ago

Alistair Letcher mathematically proves model-free reinforcement learning agents build internal world models when trained on diverse goals

Environmental dynamics are implicitly captured within agent value functions.

26604106515137.4K

#581

Original post

Alistair Letcher@_aletcher

Model-free agents learn to maximise reward without modelling the environment. Right?

In recent work, we challenge this narrative by proving that agents, trained on a sufficiently rich set of goals, encode a unique and accurate world model in their value functions. 1/

6:30 AM · Jun 23, 2026 · 137.3K Views

Sentiment

Users are excited by research showing model-free agents encode accurate world models in value functions, calling the findings very cool and interesting.

Pos

100.0%

Neg

0.0%

10 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS5.1K

Alistair Letcher@_aletcher

How do we extract this world model (WM) in practice?

By inverting the Bellman equation. Analogous to Q-learning (sampling from the environment to update Q-values), we introduce P-learning, which samples from an agent's Q-values to decode its internal model of the environment. 2/

3d5.1K4212

BOOKMARKS27LIKES63REPLIES3

Alistair Letcher@_aletcher

Work done at @FLAIR_ox and @MATSprogram with Mattie Fellows, @AlexDGoldie, @jonathanrichens, @j_foerst and Oliver Richardson.

🌐 Website: http://inverting-bellman.github.io 📝 Paper: http://arxiv.org/pdf/2606.21173 💻 Code: http://github.com/aletcher/inverting-bellman

⬇️ Agent & implicit WM evolving over training.

3d2.7K6327

RETWEETS106

Alistair Letcher@_aletcher

Model-free agents learn to maximise reward without modelling the environment. Right?

In recent work, we challenge this narrative by proving that agents, trained on a sufficiently rich set of goals, encode a unique and accurate world model in their value functions. 1/

3d137.3K602514

Alistair Letcher@_aletcher

Interestingly, agents trained with goals over different variables (e.g. only position-based or only velocity-based) converge to the same implicit WM, suggesting a kind of platonic representation hypothesis: agents with orthogonal “values” have similar underlying “beliefs”. 7/

3d4.1K377

Alistair Letcher@_aletcher

We then provide sufficient conditions on the type and number of reward functions for which agents provably encode the true transition kernel, covering both stochastic and deterministic MDPs over finite or continuous state space. 3/

3d2.9K292

Alistair Letcher@_aletcher

Surprisingly, extracting and planning inside these internal WMs induces zero-shot generalisation to goals far beyond the training distribution: position-trained Reacher agents can plan exclusively "inside their own brains" (Q-values) to reach specific angular velocities. 5/

3d2.3K282

Alistair Letcher@_aletcher

Empirically, agents encode accurate WMs with far fewer goals than our theory demands. In MuJoCo Reacher, an agent trained on just 4 positional goals contains highly faithful dynamics (MSE < 1e-4), even over variables that rewards never directly depend on (e.g. velocity). 4/

3d2.5K292

Alistair Letcher@_aletcher

Digging deeper, we find a strong correlation (Spearman ρ = 0.98) between agent performance and implicit WM accuracy, suggesting that goal-conditioned RL is a “secretly” hybrid method linking model-free and model-based RL. 6/

3d2.1K262

Alistair Letcher@_aletcher

Our results soften the boundary between model-free & model-based RL, unlocks hidden generalisation capabilities, and takes a step towards making agents more interpretable & corrigible. But we are just scratching the surface — feel free to reach out with your own ideas! 8/

3d2.4K261

kache@yacineMTB

@_aletcher Awesome

3d2.2K9

Rota 🚪🧎‍♂️@pli_cachete

@_aletcher Very cool!!

3d2.4K3

Ishan Durugkar @ NeurIPS-23@IshanDurugkar

@_aletcher Very cool stuff! Looking forward to digging into it.

3d7912

egesea@egesea009

@_aletcher A system doesn't always learn a model of the world because it was told to.Sometimes it learns one because understanding the world is the easiest way to achieve many different goals.Competence can force comprehension.

3d7122