Microsoft Research's Dimitris Papailiopoulos says action masking remains one of the few widely adopted constraints in agentic RL

Microsoft Research's Dimitris Papailiopoulos says action masking remains one of the few widely adopted constraints in agentic RL · Digg

Posts from X

Most Activity

VIEWS4.1KBOOKMARKS112LIKES105RETWEETS15REPLIES3

Cameron R. Wolfe, Ph.D.@cwolferesearch

Publishing a blog on agentic RL (probably the first part of many) on Monday morning. Here are the papers that are currently included:

- AgentGym-RL: https://arxiv.org/abs/2509.08755 - Agent-R1: https://arxiv.org/abs/2511.14460 - Agent-RL: https://arxiv.org/abs/2510.04206 - AutoForge: https://arxiv.org/abs/2512.22857 - RAGEN: https://arxiv.org/abs/2504.20073 - RAGEN-2: https://arxiv.org/abs/2604.06268 - ToRL: https://arxiv.org/abs/2503.23383

Also planning to cover more details on: - Echo (https://arxiv.org/abs/2605.24517) / Paw (https://arxiv.org/abs/2606.02388) and using action masking versus running SFT on environment tokens. - Properly setting up scalable infra for RL environments and trends in this area. - RL training infra trends, specifically using disaggregated / asynchronous architecture. - GLM-5.2 stability (migrating from GRPO to PPO for long horizon tasks).

Please send me more papers, I'll either try to include them in this blog or in future writeups!

5h4.1K105112

Cameron R. Wolfe, Ph.D.@cwolferesearch

Here are all of the links for further reading: - Prime Intellect blog on ECHO / PaW: https://www.primeintellect.ai/blog/true-agents-model-the-world - Echo: https://arxiv.org/abs/2605.24517 - PaW: https://arxiv.org/abs/2606.02388

1d1.4K1717

Cameron R. Wolfe, Ph.D.@cwolferesearch

Despite action masking being so common, recent papers have shown that completely removing non-action tokens from the objective is not optimal. We want the LLM to not only take action, but also form a world model (i.e., be able to predict environment observations / feedback). To do this, we want to train on both action and environmental tokens, as proposed in papers like ECHO / PaW.

🧵 [3/N]

1d1.2K135

Cameron R. Wolfe, Ph.D.@cwolferesearch

The idea of action masking is to remove the contribution of non-LLM-generated tokens (e.g., environment feedback / tool outputs) to the policy gradient. This is basically the agentic RL version of masking prompt tokens when you run SFT. The benefits of action masking have been widely replicated across different papers. As a result, this trick is almost universally adopted in recent agent papers.

🧵 [2/N]

1d1.3K135

Cameron R. Wolfe, Ph.D.@cwolferesearch

Concretely, this can be implemented by:

1. Using RL on action tokens. 2. Using SFT on tool response tokens.

In this case, the SFT objective is formulated as RL with a constant positive advantage, allowing the SFT objective to be implemented in the normal RL policy update flow with no additional cost.

🧵 [4/N]

1d836134

Cameron R. Wolfe, Ph.D.@cwolferesearch

This basic trick has a large performance impact; see an example from ECHO in the image below. I find this approach especially interesting because it goes against a commonly-accepted norm (action masking). I love simple and effective tricks like this, and it makes you wonder what other performance improvements are possible if we question default settings!

🧵 [5/N]

1d1.3K122

DC｜use.fo@vibecoder_dc

@cwolferesearch Action masking is dropout for your RL objective. Same idea - zero out contributions you don't want the model to overfit to. The evolution is knowing *what* to mask.

1d3301

Deepak Vijaykeerthy@deepakvijayke

@cwolferesearch Another work you may find interesting.

1d2462

Cameron R. Wolfe, Ph.D.@cwolferesearch

@deepakvijayke thanks for sharing!

1d491

Ishan@Ishan345

@cwolferesearch

14h171

Miles AI Wizard@MilesDigitek

@cwolferesearch Agentic RL converging on 'agents need world models' while people still insist LLMs don't have any.

18h1

Karmen@karmapple_tree

@cwolferesearch Action masking is one of those small constraints that does more philosophical work than it admits: the agent is partly defined by what it is never allowed to try.

13h