/Tech2h ago

Alex Dimakis, UC Berkeley professor, says AI coding agents scale logarithmically with compute while human experts scale super-linearly

Ashwinee Panda argues 10M-token contexts could restore super-linear scaling

188111475.8K

#166

Original post

Alex Dimakis@AlexGDimakis#166inTech

I am very excited about this research: We show 2 things: 1. If you just do random sampling (i.e. you try to solve a problem k times independently, and keep the best) your ELO scaling will be linear in log(test-time-compute). Agents like Claude-Code and Codex scale like that after a few hours. 2. We compare human expert coders to coding agents on the same tasks (from AtCoder Heuristic Contest). The exciting finding is that humans scale super-linearly. This is evidence that humans do continual learning, while they are solving a problem! I.e. they learn more about the coding problem they are trying to solve and scale fundamentally better compared to randomly trying things in a memoryless fashion.

This is empirical evidence that supports what many of us have felt for a while: unless we solve continual learning we will not be able to outperform humans in tasks that take many days. Current coding agents are not able to do this.

Qiuyang Mang@MangQiuyang

(1/n) New blog from UC Berkeley, UW, and Princeton: Who scales better in long horizon: AI coding agents or top coders?

We compared modern agents to top human contestants in an open-ended coding marathon.

Agents sprinted early. Then they plateaued. Top humans kept improving.

We study this as a new test-time scaling problem: do agents learn better intrinsic test-time strategies, or are they mostly getting more random tries?

3:33 PM · Jun 16, 2026 · 3.9K Views

Sentiment

Many users called the findings on humans scaling super-linearly in long coding tasks while AI agents plateau fascinating and exciting, though one questioned whether the comparison was dishonest.

Pos

90.0%

Neg

10.0%

6 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS116LIKES7

Qiuyang Mang@MangQiuyang

@lihanc02 Thanks for sharing! So far, humans still seem much better at test-time strategy and context management. Hard to imagine what happens once AI can scale at test time like humans.

5h1167

RETWEETS4

Hanchen Li@lihanc02

Aside from what the blog says, my main takeaway when Qiuyang showed me the results is: human brains are actually so heavily optimized for context management that almost no LLMs could compare...

This disputes infinite token scaling and actually completes my personal view on AI

Qiuyang Mang@MangQiuyang

(1/n) New blog from UC Berkeley, UW, and Princeton: Who scales better in long horizon: AI coding agents or top coders?

We compared modern agents to top human contestants in an open-ended coding marathon.

Agents sprinted early. Then they plateaued. Top humans kept improving.

We study this as a new test-time scaling problem: do agents learn better intrinsic test-time strategies, or are they mostly getting more random tries?

5h2.4K2714

REPLIES2

Xinyu Wang@xwangsd

@MangQiuyang @lihanc02 Any thoughts on better (agent/LLM) inference time scaling?

5h50

Jayoo Hwang@JayooHwang

@lihanc02 Yeah to me it seems more like humans have short context windows (but very information dense) and the brain constantly works to put the right things in context every second.

4h705

Yann Viegas@_Yann77

@AlexGDimakis I think a simpler explanation would be: agents find some idea and spend all the iterations generating small perturbations that marginally increase the score while a human would dedicate a part of his time to finding genuinely new ideas.

1h461

Ashwinee Panda@PandaAshwinee

@MangQiuyang @AlexGDimakis yes! i have been trying to put together useful bmk data for context management but it has been quite tough. i think the first step is always to be able to measure the impact.

30m251

Qiuyang Mang@MangQiuyang

I have the same thought: bad context management is essentially a bad test-time strategy. Within a single context window, we have already seen some models exhibit superlinear scaling laws. So this may become a research question that sits at the intersection of model architecture for longer context window and the intra-window context management.

1h181

catid@MrCatid

@lihanc02 We're so good at coming up with efficient representations of data to think about so our thinking is a lot more efficient than LLMs

3h51

Hanchen Li@lihanc02

@JayooHwang Yep I think humans and AI simply have different thinking patterns

4h502

Praneeth Otthi@pran_otthi

@AlexGDimakis Fascinating find!

1h471

Ashwinee Panda@PandaAshwinee

@AlexGDimakis arguably there is another model of this; the agent scales super-linearly as it gets useful context, but then there is a degradation term of context rot / forgetting stuff. that could lead to the same results. i.e. if the context window were really 10M it would look superlinear.

2h241

Qiuyang Mang@MangQiuyang

@xwangsd @lihanc02 I think @yoonholeee’s MetaHarness paper is highly relevant here: https://yoonholee.com/meta-harness/

5h231

Hunter Gon@gonlenidefi

@AlexGDimakis interesting framing so brute force approach scales predictably but peaks, while structured reasoning keeps growing

2h52

Alex Dimakis@AlexGDimakis

@PandaAshwinee @MangQiuyang I think this plot shows that after 4 hours context management becomes the bottleneck and Claude code becomes random sampling of solutions.

Ashwinee Panda@PandaAshwinee

24m5500

Francesco Giannicola - 🇪🇺 eu/acc@fragiannicola

@AlexGDimakis This matches what I feel when using coding agents. They are great at fast starts, but long work still needs taste, memory, and many small human decisions

1h50

Hanchen Li@lihanc02

@MrCatid yeah I think maybe one day we will see swe-> context engineers who just design context management mechanisms for specific tasks

3h28

Yann Viegas@_Yann77

@AlexGDimakis Btw the comparison is imo a bit dishonest. Gpt 5.5 + a good harness is already better than all human contestants on AGC. I'm pretty sure a good harness for AHC would lead to similar results.

1h19

Velon@velonxbt

@AlexGDimakis the plateau part matches what ive been seeing in practice

interesting that random sampling still scales linear tho

36m14

nama@aman_gif

@lihanc02 metamemories on metamemories in metamemories...

1h14

Hanchen Li@lihanc02

@xwangsd @MangQiuyang I think maybe the end goal is to get a better context management similar to human or sth

4h12