Strong AI agents still struggle with long research work because they often fail to keep testing and improving.
New Stanford, MIT, NVIDIA, Google and other top labs paper shows shows that today’s strongest research agents win less by brilliance than by refusing to stop testing.
The paper proposes AutoLab, a benchmark with 36 tasks where each agent starts from working but weak code and must make it better within a fixed time limit.
The tasks cover system speedups, puzzles, model development, and CUDA kernel work, so the test is not just about writing code once but about managing a long work session.
The authors tested 17 strong models and found that the best results did not mainly come from the first idea being good, but from the model staying active, testing often, and using feedback well.
The best first idea was not the strongest predictor of success; persistence was.
Claude Opus 4.6 led the benchmark not because it always guessed the right move immediately, but because it kept benchmarking and folding empirical feedback into the next attempt.
Several other frontier models failed in a more revealing way: they either quit early with time left on the clock, or thought so long that they ran out of time before submitting anything useful.
----
Link – arxiv. org/abs/2606.05080
Title: "AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?"















