/Tech11m ago

SPIRAL Paper Trains Models to Search and Aggregate for 11x Better Scaling

427193.4K

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex#501inTech

> The key idea is set RL, where parallel traces get reward for being useful together, even if none of them solves the problem alone.

REALLY GOOD THINKING I'm not sure about their baseline but we do need to directly and smartly optimize for breadth-first test-time scaling

alphaXiv@askalphaxiv

"SPIRAL: Learning to Search and Aggregate"

This paper gets up to 11x better scaling efficiency by training the model to search and aggregate, not just think longer.

So most reasoning RL trains one chain of thought, but real test-time scaling uses many attempts plus a final synthesis step. This paper however trains that full pipeline end-to-end.

The key idea is set RL, where parallel traces get reward for being useful together, even if none of them solves the problem alone.

On math reasoning, SPIRAL beats GRPO by up to 15% when scaling search plus aggregation.

2:35 AM · Jun 29, 2026 · 450 Views

Sentiment

Users praise the SPIRAL paper's collaborative reward mechanism as a new breakthrough for scaling.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

I am somewhat aesthetically opposed to doing this with concatenated traces. I suspect LLMs still have order preferences that make within-group aggregation biased. But it's a minor effect, and it's not clear how to do that otherwise. Can you help me find the Google paper that optimized a model for high pass@n scenarios or something?

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

> The key idea is set RL, where parallel traces get reward for being useful together, even if none of them solves the problem alone.

REALLY GOOD THINKING I'm not sure about their baseline but we do need to directly and smartly optimize for breadth-first test-time scaling

9m8900

GoForceX @ Wuhan@GoForceX

@teortaxesTex deepseek just send email to api users for updates: v4 final dropping mid july, and price will +100% on 9:00-12:00 14:00-18:00 UTC+8 (Beijing time)

6m9

Olivia.85渠道合作@DonnieSinclair

@teortaxesTex 这种协同奖励机制确实是 scaling 的新突破点