I am very excited about this research: We show 2 things: 1. If you just do random sampling (i.e. you try to solve a problem k times independently, and keep the best) your ELO scaling will be linear in log(test-time-compute). Agents like Claude-Code and Codex scale like that after a few hours. 2. We compare human expert coders to coding agents on the same tasks (from AtCoder Heuristic Contest). The exciting finding is that humans scale super-linearly. This is evidence that humans do continual learning, while they are solving a problem! I.e. they learn more about the coding problem they are trying to solve and scale fundamentally better compared to randomly trying things in a memoryless fashion.
This is empirical evidence that supports what many of us have felt for a while: unless we solve continual learning we will not be able to outperform humans in tasks that take many days. Current coding agents are not able to do this.
(1/n) New blog from UC Berkeley, UW, and Princeton: Who scales better in long horizon: AI coding agents or top coders?
We compared modern agents to top human contestants in an open-ended coding marathon.
Agents sprinted early. Then they plateaued. Top humans kept improving.
We study this as a new test-time scaling problem: do agents learn better intrinsic test-time strategies, or are they mostly getting more random tries?











