P3O Stabilizes Off-Policy RL Training With FP8 Rollouts · Digg