Interesting idea, good paper. I like the idea of encouraging different solution attempts and prioritize pass@k over pass@1 for many scenarios. Great to see a method that makes use of multiple reward axes and not just collapses them into one scalar.
Reminds me of the SetRL/Poly-EPO approach (https://arxiv.org/abs/2604.17654), both might assign a positive reward to non-optimal solutions if they increase diversity.
I see the advantage of sequential solution attempts here, but also think that can quickly become a bottleneck. Would be interesting to see whether the reward formulation provides similar advantages for a SetRL-like setup.
Lots of details in the appendix, well-written paper.