> The key idea is set RL, where parallel traces get reward for being useful together, even if none of them solves the problem alone.
REALLY GOOD THINKING I'm not sure about their baseline but we do need to directly and smartly optimize for breadth-first test-time scaling
"SPIRAL: Learning to Search and Aggregate"
This paper gets up to 11x better scaling efficiency by training the model to search and aggregate, not just think longer.
So most reasoning RL trains one chain of thought, but real test-time scaling uses many attempts plus a final synthesis step. This paper however trains that full pipeline end-to-end.
The key idea is set RL, where parallel traces get reward for being useful together, even if none of them solves the problem alone.
On math reasoning, SPIRAL beats GRPO by up to 15% when scaling search plus aggregation.

