DeepSeek's Deli Chen argues PPO's success in long-horizon tasks merely shifts training difficulty to the value model · Digg