Microsoft AI Frontiers researchers develop ECHO, a training method that adds environment prediction loss to GRPO so CLI agents build internal world models of terminal environments during reinforcement learning · Digg

Microsoft AI Frontiers researchers develop ECHO, a training method that adds environment prediction loss to GRPO so CLI agents build internal world models of terminal environments during reinforcement learning · Digg

Posts from X

Most Activity

VIEWS82.6KBOOKMARKS887LIKES780REPLIES14

will brown@willccbb

god what a beautiful objective. i wonder how general you can push this. best non-distillation answer ive seen for knowledge acq during RL, feels bitter-pilled in a way that most self-teaching methods aren’t.

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

42d82.6K780887

RETWEETS76

Dimitris Papailiopoulos@DimitrisPapail

ECHO is now on arxiv. Please share your thoughts and comments!

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

34d72.2K568397

Dimitris Papailiopoulos@DimitrisPapail

Very rarely you stumble on a method that's simple, obvious in hindsight, free, and touches on every problem you care about: CLI agents, continual learning, self-improvement, world models.

ECHO is one of those

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

42d68.7K518539

Dimitris Papailiopoulos@DimitrisPapail

World modeling. Faster RL. Self-improvement without verifiers.

All from one extra loss term on your favorite open-weights CLI agent.

Happy Monday!

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

42d30.9K219196

Dimitris Papailiopoulos@DimitrisPapail

Lol you can continual learn by training on terminal outputs WITHOUT REWARDS

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

42d32.7K239176

Yu Su@ysu_nlp

nice work by @DimitrisPapail and @VaishShrivas!

this work is reinforcing a recent trend that tries to make foundation models jointly predict future states (aka 'world models') and actions instead of actions alone.

we're seeing it in different forms, like World Action Models in embodied agents, or implicit world modeling in Early Experience (https://arxiv.org/abs/2510.08558). also some interesting link to on-policy self-distillation.

shared learning here is, there's still rich supervision signals that are underexplored. such signals were hard to exploit in classic ML, but foundation models have made it possible, potentially creating a recursive self-improvement loop.

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

42d20.4K163153

Guohao Li 🐫@guohao_li

very inspiring work by @DimitrisPapail and @VaishShrivas on adding terminal response prediction as an auxiliary loss to grpo for training terminal agents

this reminds me of an old line of work on unsupervised auxiliary tasks or pseudo rewards for tackling challenges in sparse reward settings and exploration. one of the most memorable papers - unreal from 10 years ago (https://arxiv.org/pdf/1611.05397) by @maxjaderberg, @VladMnih, @wojczarnecki, tom schaul, @jzl86, david silver, and @koraykv proposed multiple auxiliary tasks like maximizing pixel changes, network feature control, reward prediction, and experience replay for training a3c agents in first-person 3d game environments

that is to say there are still many good low-hanging fruits in designing good auxiliary tasks and pseudo rewards for training llm agents in different environments. for example, auxiliary tasks like artifact control, novel state discovery, and so on may be interesting to try out

BUT be careful of reward hacking such as the well-known gaussian noise television problem

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

42d9.8K8854

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

incredible Are we missing any other free, perfect, dense verifiers?

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

42d9.7K5953

Dimitris Papailiopoulos@DimitrisPapail

Just realized ECHO fits a years long obsession of transformers and computers.

"Looped Transformers are Computers" "Can You Train a Transformer to be Computer?" And now "Can You Train a Transformer to Simulate a Computer?"

Blame my hobbyist love of theory of computation

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

40d6.5K9628

Dimitris Papailiopoulos@DimitrisPapail

Prediction: by end of 2026 Echo will be part of standard agent RL trainers.

FREE LUNCH FOR EVERYONE

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

42d7.3K6233

Dimitris Papailiopoulos@DimitrisPapail

Turns out training your agent to be a world simulator improves its accuracy of solving problems

Yifu Qiu@ICML 2026@yifuqiu98

Internalizing world modeling as a native ability for agents.

42d12.3K7731

davinci@leothecurious

they added a world modeling loss term to a CLI policy model and it just got better! this has been an increasingly popular trend in robotic policy training over the past year (e.g. cosmos-policy, dreamzero) and i love that it's catching on in the LLM area now. every feedback from the world holds bits of info worth learning, it's a shame most have only cared about the extremely sparse reward feedback and dismiss the rest.

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

42d5.1K6029

Alex Dimakis@AlexGDimakis

Improve your agents with one weird trick: ECHO says, when you SFT an agent, do not train it to predict only the agent replies, but also the terminal responses. When you GRPO, you use the same rollout to predict the terminal responses with cross entropy loss. Its basically free and gets extra supervision from the CLI. This apparently helps the model develop a 'world model' of the terminal, and improves performance, which was very surprising to me.

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

42d6K4220

Dimitris Papailiopoulos@DimitrisPapail

One aspect that also appreciate about ECHO is that it can reduce reliance on SFT data to jump start a CLI agent.

An example: comparing with the OpenThoughts-Agent which is Qwen3-8B SFT’d on ∼15k GLM-4.6 trajectories, ECHO on base Qwen and NO SFT closes the gap.

Kinda cool!

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

41d5.7K6212

will brown@willccbb

a litmus test i’ve been thinking about for continual learning is bounding lifetime retrieval count per fact. a model should use tools to look things up, but gradually compound fuzzy memories of things they’ve searched, and eventually not need search. this could maybe work here

will brown@willccbb

god what a beautiful objective. i wonder how general you can push this. best non-distillation answer ive seen for knowledge acq during RL, feels bitter-pilled in a way that most self-teaching methods aren’t.

42d3.1K619

Asli Celikyilmaz@real_asli

How do machines build a mental map of reality? 🧠

Check out this frontier investigation into *world models* from our team at @ms_aifrontiers. Proud to see @DimitrisPapail and colleagues pushing the boundaries of how we think about AI reasoning.

Dimitris Papailiopoulos@DimitrisPapail

World modeling. Faster RL. Self-improvement without verifiers.

All from one extra loss term on your favorite open-weights CLI agent.

Happy Monday!

42d3.9K3211

ueaj@_ueaj

I'd bet stuff like this is most of the OS-closed gap, attention to detail based on well reasoned / high taste conceptual theories to catch all these various subtle flaws. The tail end of the distribution is where all the value is.

There's so many little things like this. Most of the ones I can think of are in pretraining bc I'm still learning RL but they're all just like this in spirit

Very clever

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

42d4.7K2811

Dimitris Papailiopoulos@DimitrisPapail

https://arxiv.org/abs/2605.24517

Dimitris Papailiopoulos@DimitrisPapail

ECHO is now on arxiv. Please share your thoughts and comments!

34d1.6K1613

Ravid Shwartz Ziv@ziv_ravid

Very cool work. I also think that signal from terminal is so underestimate (similar to RLM). and to have a strong opinion on the title is also my thing 😁

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

42d7K1311

Ece Kamar@ecekamar

World models are having a moment—and for good reason. New results from @ms_aifrontiers: extending existing training objectives for world modeling, we can improve performance and enable learning in new environments without explicit feedback. Check out ECHO by Dimitris & Vaish 👇

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

42d3.7K1612