3h ago

Weight Extrapolation of RL Checkpoints Produces Complementary Policies

1298121.5K

——0——

Original post

🧵 For 2 RL checkpoints trained differently, you can just weight extrapolate them and it works! Bonus: these extrapolated checkpoints are complementary policies -> Get exploration and diversity for free -> Better inference scaling when ensembling Paper: https://arxiv.org/abs/2605.28751

8:23 AM · May 28, 2026

Weight Extrapolation of RL Checkpoints Produces Complementary Policies

Sentiment

Cluster engagement