/Tech1d ago

Andrej Karpathy's autoresearch experiment shows AI agent performance does not plateau when test-time compute budgets scale

Performance improved steadily over hundreds of sequential experiments.

551.2K82776172.9K

#382

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex#440inTech

Noam is politely reminding us that if money is no issue, we have a (jagged) superintelligence already. Money is no issue for agents helping with internal research.

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

10:03 PM · Jun 8, 2026 · 41.3K Views

/Tech1d ago

Andrej Karpathy's autoresearch experiment shows AI agent performance does not plateau when test-time compute budgets scale

Performance improved steadily over hundreds of sequential experiments.

551.2K82776172.9K

#382

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex#440inTech

Noam is politely reminding us that if money is no issue, we have a (jagged) superintelligence already. Money is no issue for agents helping with internal research.

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

10:03 PM · Jun 8, 2026 · 41.3K Views

Sentiment

Users are optimistic that stronger LLMs can keep unlocking gains via test-time compute without plateauing, because this points to far higher inference demand and ambitious future outcomes.

Pos

100.0%

Neg

0.0%

3 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS93.5KBOOKMARKS422LIKES692RETWEETS59REPLIES28

Gavin Baker@GavinSBaker

Super important post from @polynoamial and the investor TLDR is: all current estimates for compute demand might be low.

“We likely don't know what the capability ceiling is for modern LLMs because it's too expensive to measure.

Frequently when I discuss this, people ask why we don't just evaluate with a harness that pushes test-time compute until performance plateaus. The problem is that, empirically, the plateau is very far out. Sometimes we may not observe a plateau at all within practical budgets

Notice that for the stronger models the performance improvement over time is stronger. It seems likely that as models become stronger they become more effective at operating over longer horizons. The point of plateau is pushed out, and may even disappear.”

If test-time compute performance improvement over time *effectively* scales at some ratio with training…

1d93.5K692422

Gavin Baker@GavinSBaker

Original post:

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

1d43.3K131176

Midnight Capital@Midnight_Captl

This mirrors something @MartinShkreli talked about and I covered last year. Test time compute to solve ultra scale problems.

Ex. A fortune 100 asking how they can raise EPS by $.10, the answer is worth a lot, so maybe willing to spend millions of dollars on test time compute to find it

1d85086

Florian Brand@xeophon

@teortaxesTex MirrorCode blog was saying that

1d1.1K102

Tony Wang@TonyW

@GavinSBaker @polynoamial I just posted similar: we might already be at ASI, but it’s a function of token budget and time:

1d64122

Gavin Baker@GavinSBaker

@Simply_AI_00 @polynoamial Dyson Sphere this millennium looking more likely.

1d44861

Jon Turek@jturek18

@GavinSBaker @polynoamial Really interesting.

Curious if you have framework for thinking about amount (range) of compute shortage = what number (range) in terms of fwd AI capex. Or is at as simple as lack of compute = more capex?

1d66411

Julian Schrittwieser@Mononofu

The only addition I'd make is that we should make sure to use a *log* x-axis, since most of the trends involved in test-time compute are logarithmic in compute.

Julian Schrittwieser@Mononofu

Plotting benchmark results with inference cost on the x-axis is absolutely the right thing to do, great writeup by @polynoamial !

I'm also excited to see that the new https://cognition.ai/blog/frontier-code has exactly such plots

1d1.8K110

Daniel A. Saedi (DataManDan)@TheRealDanSaedi

@GavinSBaker @polynoamial I think this + low % of penetration of economic tasks that have been automated + incoming demand from world models/robotic foundation models means our estimates of total inference are low.

I like @JeffDean's heuristic for 10000x inference demand by 2030. It feels right.

1d46341

Candide III@CandideIII

@teortaxesTex > Money is no issue for agents helping with internal research. It is though (insofar as money is an issue for AI companies). There is opportunity cost. Those GPU-hours might have been spent generating pr0n and nutrition recommendations for paying customers

1d452

Simply AI@Simply_AI_00

If test-time compute keeps unlocking gains without plateauing—especially in frontier models—then we're not just underestimating inference demand. We're staring at an exponential flywheel: smarter models + longer thinking = civilization-scale compute hunger. Energy markets, grids, and chip fabs will be the new oil. Buckle up.

1d5032

Gavin Baker@GavinSBaker

@TonyW @polynoamial That is a very interesting thought. Economic limitations may be preventing us from hitting ASI with the current models.

1d3971

dani@absenteewarlord

@teortaxesTex if things worked this way across the board you would expect to see all of theoretical math get demolished by labs burning $10m per open problem

1d711

Geivn Bekar@BGaipa

@GavinSBaker @polynoamial My next Strategy!!

⬇️

1d190

JA@AitkenAdvisors

@jturek18 @GavinSBaker @polynoamial OR..

..that he with the most cash/resources can defeat all competition. Effectively blow them out of the water. 🤔🤷‍♂️

1d158

Geivn Bekar@BGaipa

@GavinSBaker My internal plan is as follows

⬇️Details are as follows

1d48

Plastic Soldier@PlastiqSoldier

@absenteewarlord @teortaxesTex You know the AI labs have actual jobs to be doing, right? Also, when Google actually used their AI for cutting-edge science they won a Nobel Prize.

1d141

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@CandideIII They're really having a ton of cheap capital

1d43

N.K@Nnoahkollerr

Agree, especially re inference workloads as we enter an agentic /"implementation" era. Inference workloads are naturally architecturally irregular, latency-sensitive, multi-step etc. So that probably means estimates around inference (optimized hardware, faster interconnects etc) are low.

Shifting to training, Would the eventual ceiling not be that ever-larger training runs show diminishing incremental ROI for 80% of downstream use cases ?(routine knowledge work, coding etc)

Im aware this framing re Training compute is probably wrong and not how things work, but i think it's an interesting thought experiment regardless.

1d1252

Nicholas Mugalli@RealNickMugalli

@GavinSBaker @polynoamial Compute going to go up by another millionth percent in the second half of the year…

1d2881