/Tech16h ago

Noam Brown, co-creator of OpenAI o1, urges benchmark creators to plot performance against test-time compute

GPT-5.5 outperforms GPT-5.4 when token budgets are equalized

2653.9K4353.1K786.4K

#18

Original post

Noam Brown@polynoamial#18inTech

http://x.com/i/article/2057694226981257216

9:57 PM · Jun 8, 2026 · 611.5K Views

/Tech16h ago

Noam Brown, co-creator of OpenAI o1, urges benchmark creators to plot performance against test-time compute

GPT-5.5 outperforms GPT-5.4 when token budgets are equalized

2653.9K4353.1K786.4K

#18

Original post

Noam Brown@polynoamial#18inTech

http://x.com/i/article/2057694226981257216

9:57 PM · Jun 8, 2026 · 611.5K Views

Sentiment

Users praised articles urging LLM benchmarks to include test-time compute since evaluations seem incomplete without inference costs, while criticizing labs for avoiding them amid high expenses.

Pos

80.9%

Neg

19.1%

30 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS71.8KBOOKMARKS345

Gavin Baker@GavinSBaker

Super important post from @polynoamial and the investor TLDR is: all current estimates for compute demand might be low.

“We likely don't know what the capability ceiling is for modern LLMs because it's too expensive to measure.

Frequently when I discuss this, people ask why we don't just evaluate with a harness that pushes test-time compute until performance plateaus. The problem is that, empirically, the plateau is very far out. Sometimes we may not observe a plateau at all within practical budgets

Notice that for the stronger models the performance improvement over time is stronger. It seems likely that as models become stronger they become more effective at operating over longer horizons. The point of plateau is pushed out, and may even disappear.”

If test-time compute performance improvement over time *effectively* scales at some ratio with training…

7h71.8K586345

LIKES686RETWEETS53REPLIES27

Noam Brown@polynoamial

We've known about LLM test-time compute scaling since @OpenAI o1. Yet 2 years later labs still report scalar evals for models; safety orgs are still surprised when a scaffold does better via 100x inference; and RSPs still ignore inference budget when deciding critical thresholds.

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

6h54.2K686235

Gavin Baker@GavinSBaker

Original post:

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

7h37.8K118157

Aaron Levie@levie

Great post. So much about model performance is a function of how much compute you’re doing at inference time. This means compute-normalized benchmarks is the only logical path forward.

And yet, the challenge is it’s a lot harder than it seems given it’s subjective how much compute to apply, which means models behave differently at different thresholds (simplistically, model X’s min thinking may beat model Y’s min thinking, but be reversed at high), and there are a near infinite set of thresholds you could choose to set.

But either way, moving more in this direction would be great for better understanding AI progress.

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

5h17.9K5555

Chris@ChrissGPT

Noam,

“performance plateau even farther out. If this trend continues, which I fully expect, benchmark scores that don’t account for inference compute usage will become less informative each model release cycle.”

This still leaves room for me to ask if you think AGI (any median human task as good as a human for any duration) will be a function of increasing test time compute or if we will need to add more layers on the transformer stack or have a new architecture to reach agi?

16h2K252

Midnight Capital@Midnight_Captl

This mirrors something @MartinShkreli talked about and I covered last year. Test time compute to solve ultra scale problems.

Ex. A fortune 100 asking how they can raise EPS by $.10, the answer is worth a lot, so maybe willing to spend millions of dollars on test time compute to find it

7h85086

Erik Brynjolfsson@erikbryn

How good is a model if you let it just keep doing more and more test-time compute?

Maybe the sky's the limit.

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

4h2.9K74

Norman Mu@TheNormanMu

@polynoamial what do you make of the lack of scaling on Cognition's new benchmark? https://cognition.ai/blog/frontier-code

16h1.3K82

0xSero@0xSero

@polynoamial

16h1K43

0xSero@0xSero

@polynoamial

16h2.4K102

Tony Wang@TonyW

@GavinSBaker @polynoamial I just posted similar: we might already be at ASI, but it’s a function of token budget and time:

7h64122

elie@eliebakouch

> it may turn out that the only way to confidently evaluate misalignment in an AI agent at a 1-year horizon is to actually run the agent for a yea

this is a bit confusing imo, AI agent time is quite different from human time, 1 year horizon task is quite different from running the agent for 1y no?

you can probably find a hardware/parallelism config that optimizes speed for very long evals, or even tradeoff sequential test time compute with parallel test time compute? (but then it's a bit different i agree)

also output token is not perfect for things like autoresearch, a big portion of the time is actually spent in "tool call" which here are training runs

7h1.1K61

Mert Gulsun@mert_gulsun

@polynoamial @0xSero Indeed, that’s why our benchmarks at @ArtificialAnlys build Pareto frontiers.

From the GPT-5.5 model card:

16h1.2K19

Gavin Baker@GavinSBaker

@Simply_AI_00 @polynoamial Dyson Sphere this millennium looking more likely.

7h44861

Przemek Chojecki | PC@prz_chojecki

@polynoamial Great article! We've just introduced ErdosBench to get multi-layered LLM benchmark on open math problems:

11h22821

Jon Turek@jturek18

@GavinSBaker @polynoamial Really interesting.

Curious if you have framework for thinking about amount (range) of compute shortage = what number (range) in terms of fwd AI capex. Or is at as simple as lack of compute = more capex?

7h66411

Nikil Ravi@nikilravi

I think the actual root cause of the problem is that it is quite hard to create a benchmark that is ~unsaturable (for a reasonable amount of time) at arbitrary amounts of test-time compute and using arbitrary scaffolds. My hypothesis is that the existence of such a benchmark will automatically incentivize the field to move in this direction- thoughts? Wrote more about this here: https://open.substack.com/pub/nikilravi/p/on-measuring-ai?r=2g4nid&utm_medium=ios

16h77741

Daniel A. Saedi (DataManDan)@TheRealDanSaedi

@GavinSBaker @polynoamial I think this + low % of penetration of economic tasks that have been automated + incoming demand from world models/robotic foundation models means our estimates of total inference are low.

I like @JeffDean's heuristic for 10000x inference demand by 2030. It feels right.

7h46341

Rohan@proxy_vector

@polynoamial @OpenAI Agree. 'Model eval' without an inference budget is becoming as incomplete as benchmark scores without dataset details. The capability is increasingly model x scaffold x compute, not just model.

5h4811

Kol Tregaskes@koltregaskes

@polynoamial This is why I like DeepSWE as it measures time, cost and tokens used.

14h42321