Just my opinion:
The real reason PPO can handle long-horizon tasks? The Value Model for multi-step task.
Honestly, it just shifts the training difficulty from GRPO’s GRM onto the Value Model. I mean, the problem isn’t solved — it’s just been relocated.
The core question remains: How do you get stable process supervision in long-horizon tasks? That’s the real bottleneck.
@victor207755822 thanks! Deli, what do you (or your agents) think of Zhipu's argument about PPO being strictly preferable for long multi-turn agentic RL?