/Tech2h ago

Research co-authored by Alex G. Dimakis finds human experts scale super-linearly while AI coding agents scale linearly with test-time compute

Humans adapt strategies dynamically, while AI uses static random sampling.

3372489456129.9K

#1315

Original post

rohit@krishnanrohit#1315inTech

@AlexGDimakis This is very cool!

Alex Dimakis@AlexGDimakis

I am very excited about this research: We show 2 things: 1. If you just do random sampling (i.e. you try to solve a problem k times independently, and keep the best) your ELO scaling will be linear in log(test-time-compute). Agents like Claude-Code and Codex scale like that after a few hours. 2. We compare human expert coders to coding agents on the same tasks (from AtCoder Heuristic Contest). The exciting finding is that humans scale super-linearly. This is evidence that humans do continual learning, while they are solving a problem! I.e. they learn more about the coding problem they are trying to solve and scale fundamentally better compared to randomly trying things in a memoryless fashion.

This is empirical evidence that supports what many of us have felt for a while: unless we solve continual learning we will not be able to outperform humans in tasks that take many days. Current coding agents are not able to do this.

2:22 AM · Jun 18, 2026 · 213 Views

Sentiment

Many users called the research showing humans scale super-linearly on coding tasks versus AI agents extremely exciting and fascinating because it confirms that added test-time compute differs from actual learning.

Pos

87.5%

Neg

12.5%

8 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS1.5KRETWEETS1

sarah guo@saranormous

@AlexGDimakis this is cool

21h1.5K5

BOOKMARKS5

Alex Dimakis@AlexGDimakis

It’s an easy proof: assume the performance in a task is a Gaussian zero mean with variance 1. Sample it k times. Random sampling is taking the max performance from k tries. Elo can be computed by the probability of k1 trials beating k2, ie the max of k1 Gaussians to be bigger than the max of k2 gaussians. Elo(k) = constant + log(k)

1d88275

LIKES9

Ashwinee Panda@PandaAshwinee

@AlexGDimakis arguably there is another model of this; the agent scales super-linearly as it gets useful context, but then there is a degradation term of context rot / forgetting stuff. that could lead to the same results. i.e. if the context window were really 10M it would look superlinear.

1d44591

REPLIES3

Yann Viegas@_Yann77

@AlexGDimakis Btw the comparison is imo a bit dishonest. Gpt 5.5 + a good harness is already better than all human contestants on AGC. I'm pretty sure a good harness for AHC would lead to similar results.

1d901

Velon@velonxbt

@AlexGDimakis the plateau part matches what ive been seeing in practice

interesting that random sampling still scales linear tho

1d1.1K41

Jack Petty@jowenpetty

very interesting analysis paradigm!

> unless we solve continual learning we will not be able to outperform humans in tasks that take many days

I’m not sure this follows. In general yes, if humans scale superlinearly vs agents they win eventually, but for any fixed horizon it may be that agents have a sufficiently good scaling coefficient to outperform humans when t < horizon; the human horizons-by-task probably drop off such that even constant-scale agents will be good-enough replacements for most things

1d69871

Yann Viegas@_Yann77

@AlexGDimakis I think a simpler explanation would be: agents find some idea and spend all the iterations generating small perturbations that marginally increase the score while a human would dedicate a part of his time to finding genuinely new ideas.

1d34051

Francesco Giannicola - 🇪🇺 eu/acc@fragiannicola

@AlexGDimakis This matches what I feel when using coding agents. They are great at fast starts, but long work still needs taste, memory, and many small human decisions

1d35661

Amin Karbasi@aminkarbasi

@AlexGDimakis Agents seem to be submodular, humans supermodular.

1d51841

Hunter Gon@gonlenidefi

@AlexGDimakis interesting framing so brute force approach scales predictably but peaks, while structured reasoning keeps growing

1d3331

Yann Viegas@_Yann77

@MangQiuyang @AlexGDimakis I think that on the long run, the human curve is more of a staircase with exponentially longer plateaus: it requires a really novel idea to make significant improvements, and these ideas are exponentially harder to find. Also, if the agent can get high enough fast

22h101

Obsolete26@obsolete26

@AlexGDimakis What tools did the agents have access to? Could they build their own repository of knowledge through the run?

1d930

Alex Dimakis@AlexGDimakis

@obsolete26 It’s Claude code so I can do whatever it wants. But it doesn’t.

1d804

Ashwinee Panda@PandaAshwinee

@MangQiuyang @AlexGDimakis yes! i have been trying to put together useful bmk data for context management but it has been quite tough. i think the first step is always to be able to measure the impact.

1d712

Qiuyang Mang@MangQiuyang

I have the same thought: bad context management is essentially a bad test-time strategy. Within a single context window, we have already seen some models exhibit superlinear scaling laws. So this may become a research question that sits at the intersection of model architecture for longer context window and the intra-window context management.

1d702

Obsolete26@obsolete26

@AlexGDimakis So full bash, file system and web access?

1d176

pspsps@Klmsgp

@AlexGDimakis Is it really memoryless?

1d147

Qiuyang Mang@MangQiuyang

I don’t know whether GPT has an optimized harness for AGC or general CP problems. But I’m wondering whether agents’ test-time scaling ability will eventually depend on task-specific harnesses.

For humans, AHC and AGC participants are largely from the same community, and many of them have also trained extensively on traditional CP problems. Btw, human are not solving the problems under the full contest time window.

1d451

cookie_cutter@cookiec75190643

@AlexGDimakis By "continual learning", do you mean "adjustment of weights in realtime"?

1d121

Qiuyang Mang@MangQiuyang

@_Yann77 @AlexGDimakis Our work about CP harness may be related https://arxiv.org/abs/2605.15177. I agree a good harness can change the final Elo, but I don't know if they can also change curve and complexity?

1d141