1d ago

ECHO augments GRPO by adding an auxiliary environment-token prediction loss, enabling agents to learn action-conditioned dynamics from unsuccessful trajectories in sparse-reward settings

It cuts environment-token cross-entropy loss to 0.07-0.09 nats on Qwen3 models.

10128147214.2K

——0——

Original post

#197@DIMITRISPAPAILOP

λux@NOVASARC01

cool research work! i liked how ECHO adds an auxiliary environment-token prediction loss to GRPO...so the agent learns both action selection and action-conditioned terminal dynamics...imo this should improve sample efficiency in sparse-reward agent RL because even unsuccessful trajectories can teach the model how the environment responds.

8:37 AM · May 18, 2026

QUOTE POST

#1430wh@NREHIEW_

Really clean approach.

Do cross entropy loss on the environment feedback. This allows the model to get supervision even on failed rollouts and helps form a sort of pseudo world model!

Dimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views

2:25 AM · May 19, 2026 · 10.2K Views

#1430wh@NREHIEW_

Performance nearly doubles without any additional computation!

wh@nrehiew_

Really clean approach. Do cross entropy loss on the environment feedback. This allows the model to get supervision even on failed rollouts and helps form a sort of pseudo world model!

2:25 AM · May 19, 2026 · 10.2K Views

2:25 AM · May 19, 2026 · 1.1K Views

#1430wh@NREHIEW_

Interestingly, albeit unsurprisingly, normal GRPO does not change the representation of the environment-related tokens which is kinda to be expected given they are usually masked out. ECHO naturally does model the environment better.

(world modelling)

wh@nrehiew_

Performance nearly doubles without any additional computation!

2:25 AM · May 19, 2026 · 1.1K Views

#1430wh@NREHIEW_

Training without the GRPO term and only getting the model to learn to predict environmental responses works too!

(world modelling!)

wh@nrehiew_

2:25 AM · May 19, 2026 · 1.1K Views

2:25 AM · May 19, 2026 · 192 Views

ECHO augments GRPO by adding an auxiliary environment-token prediction loss, enabling agents to learn action-conditioned dynamics from unsuccessful trajectories in sparse-reward settings

Cluster engagement

Sentiment