New work on Scaling Test-Time Compute for Agentic Coding:
Paper: https://arxiv.org/abs/2604.16529
This work introduces a test-time scaling framework for agentic coding that converts rollouts into structured summaries capturing key hypotheses, progress, and failure modes while discarding low-signal details.
This enables two forms of inference-time scaling: (1) Recursive Tournament Voting (RTV) for parallel selection via iterative small-group comparisons, and (2) Parallel-Distill-Refine (PDR) for sequential improvement by conditioning new rollouts on distilled summaries.
Our approach consistently boosts performance on frontier benchmarks. On SWE-Bench Verified, Claude-4.5-Opus improves from 70.9% → 77.6%, and on Terminal-Bench v2.0 from 46.9% → 59.1%. These gains highlight that effective test-time scaling for long-horizon agents hinges on representation, selection, and reuse, not just sampling more trajectories.
Check out a more detailed thread by @anirudhg9119.
How do coding agents get better from experience?
Past Attempts as Interface: Turn rollouts into reusable summaries that future attempts can build on.
http://arxiv.org/abs/2604.16529