1d ago

ECHO augments GRPO by adding an auxiliary environment-token prediction loss, enabling agents to learn action-conditioned dynamics from unsuccessful trajectories in sparse-reward settings

It cuts environment-token cross-entropy loss to 0.07-0.09 nats on Qwen3 models.

0
Original post

cool research work! i liked how ECHO adds an auxiliary environment-token prediction loss to GRPO...so the agent learns both action selection and action-conditioned terminal dynamics...imo this should improve sample efficiency in sparse-reward agent RL because even unsuccessful trajectories can teach the model how the environment responds.

8:37 AM · May 18, 2026 View on X

Really clean approach.

Do cross entropy loss on the environment feedback. This allows the model to get supervision even on failed rollouts and helps form a sort of pseudo world model!

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

http://x.com/i/article/2056344151235387392

1:38 PM · May 18, 2026 · 325.3K Views
2:25 AM · May 19, 2026 · 10.2K Views

Performance nearly doubles without any additional computation!

whwh@nrehiew_

Really clean approach. Do cross entropy loss on the environment feedback. This allows the model to get supervision even on failed rollouts and helps form a sort of pseudo world model!

2:25 AM · May 19, 2026 · 10.2K Views
2:25 AM · May 19, 2026 · 1.1K Views

Interestingly, albeit unsurprisingly, normal GRPO does not change the representation of the environment-related tokens which is kinda to be expected given they are usually masked out. ECHO naturally does model the environment better.

(world modelling)

whwh@nrehiew_

Performance nearly doubles without any additional computation!

2:25 AM · May 19, 2026 · 1.1K Views
2:25 AM · May 19, 2026 · 1.1K Views

Training without the GRPO term and only getting the model to learn to predict environmental responses works too!

(world modelling!)

whwh@nrehiew_

Interestingly, albeit unsurprisingly, normal GRPO does not change the representation of the environment-related tokens which is kinda to be expected given they are usually masked out. ECHO naturally does model the environment better. (world modelling)

2:25 AM · May 19, 2026 · 1.1K Views
2:25 AM · May 19, 2026 · 192 Views