ECHO Improves RL Performance in Agentic Tasks via World Modeling

This basic trick has a large performance impact; see an example from ECHO in the image below. I find this approach especially interesting because it goes against a commonly-accepted norm (action masking). I love simple and effective tricks like this, and it makes you wonder what other performance improvements are possible if we question default settings!

🧵 [5/N]

Cameron R. Wolfe, Ph.D.@cwolferesearch

Concretely, this can be implemented by:

1. Using RL on action tokens. 2. Using SFT on tool response tokens.

In this case, the SFT objective is formulated as RL with a constant positive advantage, allowing the SFT objective to be implemented in the normal RL policy update flow with no additional cost.

🧵 [4/N]

1h30910

BOOKMARKS2LIKES2RETWEETS1

Cameron R. Wolfe, Ph.D.@cwolferesearch

Here are all of the links for further reading: - Prime Intellect blog on ECHO / PaW: https://www.primeintellect.ai/blog/true-agents-model-the-world - Echo: https://arxiv.org/abs/2605.24517 - PaW: https://arxiv.org/abs/2606.02388

Cameron R. Wolfe, Ph.D.@cwolferesearch

This basic trick has a large performance impact; see an example from ECHO in the image below. I find this approach especially interesting because it goes against a commonly-accepted norm (action masking). I love simple and effective tricks like this, and it makes you wonder what other performance improvements are possible if we question default settings!

🧵 [5/N]

1h30122

REPLIES1

Cameron R. Wolfe, Ph.D.@cwolferesearch

Concretely, this can be implemented by:

1. Using RL on action tokens. 2. Using SFT on tool response tokens.

In this case, the SFT objective is formulated as RL with a constant positive advantage, allowing the SFT objective to be implemented in the normal RL policy update flow with no additional cost.

🧵 [4/N]

Cameron R. Wolfe, Ph.D.@cwolferesearch

Despite action masking being so common, recent papers have shown that completely removing non-action tokens from the objective is not optimal. We want the LLM to not only take action, but also form a world model (i.e., be able to predict environment observations / feedback). To do this, we want to train on both action and environmental tokens, as proposed in papers like ECHO / PaW.

🧵 [3/N]

1h5710

Cameron R. Wolfe, Ph.D.@cwolferesearch

The idea of action masking is to remove the contribution of non-LLM-generated tokens (e.g., environment feedback / tool outputs) to the policy gradient. This is basically the agentic RL version of masking prompt tokens when you run SFT. The benefits of action masking have been widely replicated across different papers. As a result, this trick is almost universally adopted in recent agent papers.

🧵 [2/N]

Cameron R. Wolfe, Ph.D.@cwolferesearch

I've been reading a ton of agentic RL papers recently. Out of all the work, one of the only commonly-used tricks is action masking, but this approach is evolving with RL + world modeling papers like ECHO / PaW.

🧵 [1/N]

1h14510

Cameron R. Wolfe, Ph.D.@cwolferesearch

Despite action masking being so common, recent papers have shown that completely removing non-action tokens from the objective is not optimal. We want the LLM to not only take action, but also form a world model (i.e., be able to predict environment observations / feedback). To do this, we want to train on both action and environmental tokens, as proposed in papers like ECHO / PaW.

🧵 [3/N]

Cameron R. Wolfe, Ph.D.@cwolferesearch

The idea of action masking is to remove the contribution of non-LLM-generated tokens (e.g., environment feedback / tool outputs) to the policy gradient. This is basically the agentic RL version of masking prompt tokens when you run SFT. The benefits of action masking have been widely replicated across different papers. As a result, this trick is almost universally adopted in recent agent papers.

🧵 [2/N]

1h10010

Deepak Vijaykeerthy@deepakvijayke

@cwolferesearch Another work you may find interesting.

1h8

Cameron R. Wolfe, Ph.D.@cwolferesearch

@deepakvijayke thanks for sharing!

1h41