/Tech8h ago

DeepSeek's Deli Chen argues PPO's success in long-horizon tasks merely shifts training difficulty to the value model

Stable process supervision remains the primary unresolved bottleneck.

210762011.5K

#501

Original post

Deli Chen@victor207755822#1803inTech

Just my opinion:

The real reason PPO can handle long-horizon tasks? The Value Model for multi-step task.

Honestly, it just shifts the training difficulty from GRPO’s GRM onto the Value Model. I mean, the problem isn’t solved — it’s just been relocated.

The core question remains: How do you get stable process supervision in long-horizon tasks? That’s the real bottleneck.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@victor207755822 thanks! Deli, what do you (or your agents) think of Zhipu's argument about PPO being strictly preferable for long multi-turn agentic RL?

9:56 AM · Jun 17, 2026 · 9.9K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS1.7KBOOKMARKS2LIKES12

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

> How do you get stable process supervision in long-horizon tasks? That’s the real bottleneck. Wouldn't I like to know

Deli Chen@victor207755822

Just my opinion:

The real reason PPO can handle long-horizon tasks? The Value Model for multi-step task.

Honestly, it just shifts the training difficulty from GRPO’s GRM onto the Value Model. I mean, the problem isn’t solved — it’s just been relocated.

The core question remains: How do you get stable process supervision in long-horizon tasks? That’s the real bottleneck.

7h1.7K122