Stanford researchers release SPIRAL, a reinforcement learning framework that trains LLMs to coordinate parallel and aggregative inference compute · Digg

/Tech1d ago

Stanford researchers release SPIRAL, a reinforcement learning framework that trains LLMs to coordinate parallel and aggregative inference compute

It functions like a map-reduce architecture for language models.

4462474520121.1K

Original post

Michael Y. Li@michaelyli_

Excited to share our new work on training language models to use multiple axes of inference compute — sequential, parallel, and aggregative — end-to-end, led by @jubayer_hamid and @ifdita_hasan. LLMs already use many forms of compute at test time, so they should learn to use them during training too.

How do we train this? Learning to synthesize a better answer from multiple attempts can be handled with standard RL. The harder problem is teaching models to generate a set of traces that are useful together for a downstream synthesizer to produce a better final response. This leads naturally to a set RL formulation for training models to generate these traces.

Jubayer Ibn Hamid@jubayer_hamid

The most capable reasoning systems in AI scale inference compute along several axes: sequential compute to think longer, parallel compute to sample many independent attempts, and aggregative compute to synthesize prior traces into a new improved one. But during training, we only optimize how models use sequential compute. This creates a fundamental mismatch between how we ultimately deploy these systems and how we train them, leaving much of search and synthesis unoptimized.

We introduce SPIRAL, an RL framework for making all inference-compute primitives end-to-end learnable: models learn to coordinate sequential, parallel, and aggregative reasoning using only the reward of the final output. Work with @ifdita_hasan (co-lead), @michaelyli_ , @oshaikh13 , @yoonholeee , @DorsaSadigh , @chelseabfinn , @noahdgoodman 🧵

10:35 AM · Jun 23, 2026 · 5.6K Views

Sentiment

Users praised the SPIRAL RL framework for making AI inference compute end-to-end learnable, highlighting its elegant solution to the mismatch between training and deployment.

Pos

100.0%

Neg

0.0%

8 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS29.3KBOOKMARKS125LIKES125REPLIES7

sarah guo@saranormous

Cool research work on scaling inference compute

Jubayer Ibn Hamid@jubayer_hamid

The most capable reasoning systems in AI scale inference compute along several axes: sequential compute to think longer, parallel compute to sample many independent attempts, and aggregative compute to synthesize prior traces into a new improved one. But during training, we only optimize how models use sequential compute. This creates a fundamental mismatch between how we ultimately deploy these systems and how we train them, leaving much of search and synthesis unoptimized.

We introduce SPIRAL, an RL framework for making all inference-compute primitives end-to-end learnable: models learn to coordinate sequential, parallel, and aggregative reasoning using only the reward of the final output. Work with @ifdita_hasan (co-lead), @michaelyli_ , @oshaikh13 , @yoonholeee , @DorsaSadigh , @chelseabfinn , @noahdgoodman 🧵

17h29.3K125125

RETWEETS35

Jubayer Ibn Hamid@jubayer_hamid

The most capable reasoning systems in AI scale inference compute along several axes: sequential compute to think longer, parallel compute to sample many independent attempts, and aggregative compute to synthesize prior traces into a new improved one. But during training, we only optimize how models use sequential compute. This creates a fundamental mismatch between how we ultimately deploy these systems and how we train them, leaving much of search and synthesis unoptimized.

We introduce SPIRAL, an RL framework for making all inference-compute primitives end-to-end learnable: models learn to coordinate sequential, parallel, and aggregative reasoning using only the reward of the final output. Work with @ifdita_hasan (co-lead), @michaelyli_ , @oshaikh13 , @yoonholeee , @DorsaSadigh , @chelseabfinn , @noahdgoodman 🧵

1d84.2K308333

noahdgoodman@noahdgoodman

this is like map-reduce for LLMs. trained end-to-end with a neat low-variance advantage estimator. the resulting model generalizes strongly to iterated-aggregation, allowing efficient use of high test-time compute budgets.

Jubayer Ibn Hamid@jubayer_hamid

The most capable reasoning systems in AI scale inference compute along several axes: sequential compute to think longer, parallel compute to sample many independent attempts, and aggregative compute to synthesize prior traces into a new improved one. But during training, we only optimize how models use sequential compute. This creates a fundamental mismatch between how we ultimately deploy these systems and how we train them, leaving much of search and synthesis unoptimized.

We introduce SPIRAL, an RL framework for making all inference-compute primitives end-to-end learnable: models learn to coordinate sequential, parallel, and aggregative reasoning using only the reward of the final output. Work with @ifdita_hasan (co-lead), @michaelyli_ , @oshaikh13 , @yoonholeee , @DorsaSadigh , @chelseabfinn , @noahdgoodman 🧵

1d14.5K11796

Andreas Kirsch 🇺🇦@BlackHC

Very cool esp as we scale up test-time compute. Can we do it for arbitrary graphs of traces to do better RL when there are subagents as well?

Jubayer Ibn Hamid@jubayer_hamid

The most capable reasoning systems in AI scale inference compute along several axes: sequential compute to think longer, parallel compute to sample many independent attempts, and aggregative compute to synthesize prior traces into a new improved one. But during training, we only optimize how models use sequential compute. This creates a fundamental mismatch between how we ultimately deploy these systems and how we train them, leaving much of search and synthesis unoptimized.

We introduce SPIRAL, an RL framework for making all inference-compute primitives end-to-end learnable: models learn to coordinate sequential, parallel, and aggregative reasoning using only the reward of the final output. Work with @ifdita_hasan (co-lead), @michaelyli_ , @oshaikh13 , @yoonholeee , @DorsaSadigh , @chelseabfinn , @noahdgoodman 🧵

1d3.4K2212

Jubayer Ibn Hamid@jubayer_hamid

Several other results in the paper (pass@k evaluation, entropy, etc.). This is an ongoing work and there are many more next steps we plan on getting into over the next couple of months to understand the kinds of behaviors SPIRAL teaches (see section 6). Please reach out with comments and feedback. Also, inference compute scaling is a ripe area of research with a large no. of excellent works — if we forgot to cite any work, please point us to the literature! 7/T

Preprint link: https://arxiv.org/abs/2606.23595

1d426133

Jubayer Ibn Hamid@jubayer_hamid

SPIRAL optimizes both the parallel search traces and the aggregation traces in a unified manner with only the final reward as its learning signal. To optimize the search traces, we use set RL – a framework we developed in [arXiv:2509.25424, arXiv:2604.17654]. The objective gives higher credit to a set of parallel traces if it leads to higher quality downstream aggregation. To optimize the aggregation traces, we simply use GRPO over the aggregation traces sampled from each trace. We develop an algorithmic recipe that keeps sample complexity similar to that of standard RL recipes. 4/T

1d384123

Jubayer Ibn Hamid@jubayer_hamid

Empirically, we observe that SPIRAL achieves significantly better performance as all three primitives of inference compute are scaled. We test sequential compute scaling (i.e. longer traces), majority-voting (parallel traces + rule-based aggregation) and recursive self-aggregation (arXiv:2509.26626). RSA allows the model to generate parallel traces that are aggregated in a recursive manner, allowing the model to also think over longer sequences. We find that SPIRAL’s performance under RSA, which scales all three primitives, is significantly larger than that of any other (model, inference scaling) pair. 6/T

1d359122

Jubayer Ibn Hamid@jubayer_hamid

Taking the gradient of our objective, we arrive at two types of policy gradients. There is a standard RL policy gradient (Term 2 below) which optimizes the policy to aggregate a fixed set of traces into an optimal output. There is also a set RL gradient (Term 1 below) which optimizes the policy to sample an optimal set that the model can later aggregate into a good output. 3/T

1d505102

Jubayer Ibn Hamid@jubayer_hamid

All our experiments were done on @tinkerapi , thanks to a generous grant from @thinkymachines . It is truly impeccable just how reliable the RL infrastructure of Tinker is; we often struggled with scaling any kind of RL to long-horizon traces in most other frameworks due to all sorts of stability issues and Tinker’s infra has been unbelievably stable in that regard and quite easy to work with. 8/T

1d368151

Jubayer Ibn Hamid@jubayer_hamid

Current models’ usage of inference can be quite jagged. They often use sequential compute to meticulously go through routine computations but gloss over the most complex ones. They can fail to use parallel compute to explore broadly and can fail to use aggregative compute to synthesize ideas reliably – requiring us to handcraft elaborate scaffolds to mitigate these issues. In SPIRAL, during training, we sample a set of independent parallel traces, each using sequential CoT. The model then synthesizes these into a final aggregation trace which gets rewarded. SPIRAL optimizes this full pipeline end-to-end, to teach the model to use every primitive effectively and synergistically towards an optimal output. 2/T

1d824141

Jubayer Ibn Hamid@jubayer_hamid

Intuitively, SPIRAL uses a co-evolutionary procedure. The model learns to sample parallel traces that are useful during aggregation and learns to aggregate a given set of traces into an optimal final output. Optimizing the search traces is particularly interesting. Set RL optimizes sets using a common learning signal shared by all constituents in the set, which is necessary to reflect how search traces couple with one another for downstream aggregation. The set-level optimization reflects this coupling effect to allow the model to learn for itself that it must explore diverse traces — without requiring us to provide any exploration or diversity bonus! 5/T

1d342111

Jubayer Ibn Hamid@jubayer_hamid

This work also was heavily inspired by the lessons on end-to-end learning via simply providing the right affordances from NGC:

1d49062

Jubayer Ibn Hamid@jubayer_hamid

Thank you, Andreas. We indeed can extend this much further!

(1) on subagents -- the recipe can be used in a setting with sub-agents executing each parallel trace independently. This has nice implications: set RL trains the subagents to do what helps the main agent to ultimately solve the problem effectively, standard RL trains the main agent to effectively use information provided by each subagent. (In fact, subagents would be wonderful in enabling even more exploration since each trace is sampled from a different model altogether)

(2) on arbitrary graphs of traces -- in principle, during training, SPIRAL could be expanded to more than 2 levels. For n levels, you can use set RL for n-1 levels and standard RL for the n-th level. This would certainly require more training compute, though, to sample all the necessary traces. In our experiments on recursive self-aggregation, we see that even after training with only 2-levels, the model already generalizes very well to n-level aggregation for n at least up to 10. I think, intuitively, 2-level training already gives you mastery over the compute primitives enough such that you can compose them in many other ways at test-time and your model would be able to take advantage of the extra compute

1d15791

Anirudh Goyal@anirudhg9119

@jubayer_hamid https://arxiv.org/abs/2510.01123

You may find this relevant.

1d45231

Stanford AI Lab@StanfordAILab

At test time, we wrap LLMs in scaffolds that scale compute every which way -- longer chains, parallel samples, and aggregation across them. So why do we still train them to use only one of these?

Introduce Spiral: it uses set RL to teach a model to generate responses that are collectively useful for an aggregator, and standard RL to teach it to aggregate those responses into an improved answer!

Jubayer Ibn Hamid@jubayer_hamid

The most capable reasoning systems in AI scale inference compute along several axes: sequential compute to think longer, parallel compute to sample many independent attempts, and aggregative compute to synthesize prior traces into a new improved one. But during training, we only optimize how models use sequential compute. This creates a fundamental mismatch between how we ultimately deploy these systems and how we train them, leaving much of search and synthesis unoptimized.

We introduce SPIRAL, an RL framework for making all inference-compute primitives end-to-end learnable: models learn to coordinate sequential, parallel, and aggregative reasoning using only the reward of the final output. Work with @ifdita_hasan (co-lead), @michaelyli_ , @oshaikh13 , @yoonholeee , @DorsaSadigh , @chelseabfinn , @noahdgoodman 🧵

1d10.9K7750

Tong Zheng@zhengtoong

@jubayer_hamid You may be interested in this work (rl trained parallel thinking)!

1d3064

Jubayer Ibn Hamid@jubayer_hamid

@anirudhg9119 We cited this work! It’s under [MDG 25); very interesting paper.

1d27031

RustLabs@rustlabs_ai

@jubayer_hamid The aggregative axis seems underexplored. If inference can improve by synthesizing prior traces, should training explicitly optimize for trace recomposition rather than just rewarding the final sampled path?

1d1351

Jubayer Ibn Hamid@jubayer_hamid

I suspect you might be interested in this: if you have a lot of training compute, then

(1) you can actually train the model to use an arbitrary number of parallel traces instead of a fixed no. of n traces (we discuss this in section 3.3 -- essentially you define the objective over varying degrees of set sizes) -- this should make the model learn exploration-exploitation balancing even better (2) this same idea could be used to train the model to use varying number of recursion levels too.

All are very interesting ideas but we did not have enough compute to try them (yet)!

1d515

Jubayer Ibn Hamid@jubayer_hamid

I would say that generating search traces that are useful for aggregation is even more underexplored! Prior works like the RSA paper did train a model to aggregate while either not training a model to use parallel compute (ie keeping it frozen) or training it in a decoupled manner (ie each search traces gets its own individual reward based off of correctness as opposed to their usefulness to aggregation). The focus of our work is the end-to-end optimization of all the primitives in a unified manner!

1d1284