P3O Stabilizes Off-Policy RL Training With FP8 Rollouts

Original post

RL is hot and has a reputation of being difficult to control. That is, unless you have the right parametrization. Off-policy RL, in particular struggles with the fact that your roll-out looks different from what you're actually optimizing. This can happen if you roll out with a more powerful model, or if you want to roll out in FP8, while you're training in FP16, or if you update asynchronously, or in other words, almost all the time in practice.

10:45 PM · Jun 16, 2026 · 451 Views

2605.12380

ARXIV.ORGVia

#290

VIEWS177LIKES2

Alex Smola@smolix

The community has developed lots of algorithms to mitigate these problems, which work well, for a particular dataset, for a particular set of hyperparameters. Or you could simply quantify how close your policy is to what's being rolled out via effective sample size. This is what P3O does. Kudos to @rasoolfa Murdock Aubry and Nicholas Stranges from @boson_ai for pulling it off.

Blog: https://feynrl-project.github.io/blogs/episode_two.html Paper: https://arxiv.org/pdf/2605.12380 Code: https://github.com/FeynRL-project/FeynRL

Alex Smola@smolix

3h17720