Excited to share our new work on training language models to use multiple axes of inference compute — sequential, parallel, and aggregative — end-to-end, led by @jubayer_hamid and @ifdita_hasan. LLMs already use many forms of compute at test time, so they should learn to use them during training too.
How do we train this? Learning to synthesize a better answer from multiple attempts can be handled with standard RL. The harder problem is teaching models to generate a set of traces that are useful together for a downstream synthesizer to produce a better final response. This leads naturally to a set RL formulation for training models to generate these traces.
The most capable reasoning systems in AI scale inference compute along several axes: sequential compute to think longer, parallel compute to sample many independent attempts, and aggregative compute to synthesize prior traces into a new improved one. But during training, we only optimize how models use sequential compute. This creates a fundamental mismatch between how we ultimately deploy these systems and how we train them, leaving much of search and synthesis unoptimized.
We introduce SPIRAL, an RL framework for making all inference-compute primitives end-to-end learnable: models learn to coordinate sequential, parallel, and aggregative reasoning using only the reward of the final output. Work with @ifdita_hasan (co-lead), @michaelyli_ , @oshaikh13 , @yoonholeee , @DorsaSadigh , @chelseabfinn , @noahdgoodman 🧵



