22h ago

Meta Paper Shows Summaries Boost Coding Agent Test-Time Scaling

0
Original post

Meta paper shows that coding agents get much better when they reuse short summaries of past attempts instead of raw logs. i.e. stronger coding agents do not just need more attempts, but better ways to remember attempts. That sounds obvious until you look at what an agent actually produces: not an answer, but a messy trail of file reads, shell commands, errors, partial fixes, and abandoned ideas. The paper’s idea is to turn each full attempt into a compact summary of the main guess, partial progress, and failure points, then use those summaries both to pick the best attempts and to guide new ones. Test-time scaling breaks when the model cannot compare its own past work. For short answers, ranking is easy. For long-horizon coding, the bottleneck shifts from generation to representation. Once rollouts become summaries, two useful things happen. The system can run tournament-style selection over small groups of candidates, which works better than forcing one giant comparison, and it can feed the best summaries back into a fresh round of attempts instead of starting blind. --- The authors test this on 2 hard coding benchmarks by running many attempts in parallel, selecting promising summaries with a tournament style voting method, and then launching fresh attempts that can read the selected summaries first. The results are strong, with Claude 4.5 Opus rising from 70.9% to 77.6% on SWE-Bench Verified and from 46.9% to 59.1% on Terminal-Bench v2.0. What matters is that the paper says better test-time scaling for long coding agents is not mostly about making more attempts, but about storing experience in a form the agent can actually reuse. ---- Paper Link – arxiv. org/abs/2604.16529 Paper Title: "Scaling Test-Time Compute for Agentic Coding"

7:29 AM · May 23, 2026 View on X
Reposted by