AI2 post-training lead Nathan Lambert argues that the core efficacy of both PPO and GRPO relies on policy gradient methods · Digg