/AI16h ago

Noam Brown, OpenAI o1 co-creator, urges benchmark developers to plot LLM performance against test-time compute

Equal token budgets reveal GPT-5.5 outperforms GPT-5.4.

2744K4173.1K821.4K
Original post
Noam Brown@polynoamial#30inAI

http://x.com/i/article/2057694226981257216

9:57 PM · Jun 8, 2026 · 611.5K Views
Sentiment

Some users praise benchmarks that compare LLM performance to test-time compute as revealing unknown capabilities, while others criticize the emphasis as overstated and claim labs overlook inference shortfalls.

Pos
81.1%
Neg
18.9%
15 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS55.6K
Ethan Mollick@emollick

This is worth reading.

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

5hViews 55.6KLikes 278Bookmarks 221
BOOKMARKS235LIKES686RETWEETS53REPLIES27
Noam Brown@polynoamial

We've known about LLM test-time compute scaling since @OpenAI o1. Yet 2 years later labs still report scalar evals for models; safety orgs are still surprised when a scaffold does better via 100x inference; and RSPs still ignore inference budget when deciding critical thresholds.

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

6hViews 54.2KLikes 686Bookmarks 235
Suhail@Suhail

I had not fully considered this possibility before. Interesting.

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

6hViews 48.7KLikes 319Bookmarks 151

Great post. So much about model performance is a function of how much compute you’re doing at inference time. This means compute-normalized benchmarks is the only logical path forward.

And yet, the challenge is it’s a lot harder than it seems given it’s subjective how much compute to apply, which means models behave differently at different thresholds (simplistically, model X’s min thinking may beat model Y’s min thinking, but be reversed at high), and there are a near infinite set of thresholds you could choose to set.

But either way, moving more in this direction would be great for better understanding AI progress.

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

5hViews 17.9KLikes 55Bookmarks 55

Plotting benchmark results with inference cost on the x-axis is absolutely the right thing to do, great writeup by @polynoamial !

I'm also excited to see that the new https://cognition.ai/blog/frontier-code has exactly such plots

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

8hViews 8.7KLikes 67Bookmarks 32
Anjney Midha@AnjneyMidha

this is a good proposal from @polynoamial

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

7hViews 9.5KLikes 38Bookmarks 28

"I believe the proper way to evaluate models is with a performance vs test-time compute plot, with either tokens, cost, or wall-clock time on the x-axis."

We can do this on Agent Arena data! Here's a plot showing net improvement vs tokens on 100K+ real agent workflows on @arena!

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

5hViews 7.8KLikes 54Bookmarks 16
Jasmine Wang@j_asminewang

This was one of my favorite talks at Recursive!

I like this recommendation: Preparedness Frameworks and Responsible Scaling Policies should explicitly account for inference compute [scaling] when determining whether a model crosses a safety threshold.

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

6hViews 3KLikes 26Bookmarks 11
rohit@krishnanrohit

This is a great article. And identifying how the plateau changes with added inference for non verifiable tasks, like writing, would be extremely useful to know. I rather find a U shape already sometimes between thinking and pro so it's a useful area to note.

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

6hViews 2.8KLikes 13Bookmarks 6
Chris@ChrissGPT

Noam,

“performance plateau even farther out. If this trend continues, which I fully expect, benchmark scores that don’t account for inference compute usage will become less informative each model release cycle.”

This still leaves room for me to ask if you think AGI (any median human task as good as a human for any duration) will be a function of increasing test time compute or if we will need to add more layers on the transformer stack or have a new architecture to reach agi?

16hViews 2KLikes 25Bookmarks 2

How good is a model if you let it just keep doing more and more test-time compute?

Maybe the sky's the limit.

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

4hViews 2.9KLikes 7Bookmarks 4
elie@eliebakouch

> it may turn out that the only way to confidently evaluate misalignment in an AI agent at a 1-year horizon is to actually run the agent for a yea

this is a bit confusing imo, AI agent time is quite different from human time, 1 year horizon task is quite different from running the agent for 1y no?

you can probably find a hardware/parallelism config that optimizes speed for very long evals, or even tradeoff sequential test time compute with parallel test time compute? (but then it's a bit different i agree)

also output token is not perfect for things like autoresearch, a big portion of the time is actually spent in "tool call" which here are training runs

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

7hViews 2.3KLikes 16Bookmarks 3
Eric Horvitz@erichorvitz

Important finding.

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

10hViews 1.5KLikes 3Bookmarks 5
Norman Mu@TheNormanMu

@polynoamial what do you make of the lack of scaling on Cognition's new benchmark? https://cognition.ai/blog/frontier-code

16hViews 1.3KLikes 8Bookmarks 2
0xSero@0xSero

@polynoamial

16hViews 1KLikes 4Bookmarks 3
0xSero@0xSero

@polynoamial

16hViews 2.4KLikes 10Bookmarks 2
Mert Gulsun@mert_gulsun

@polynoamial @0xSero Indeed, that’s why our benchmarks at @ArtificialAnlys build Pareto frontiers.

From the GPT-5.5 model card:

16hViews 1.2KLikes 19

@polynoamial Great article! We've just introduced ErdosBench to get multi-layered LLM benchmark on open math problems:

11hViews 228Likes 2Bookmarks 1
Nikil Ravi@nikilravi

I think the actual root cause of the problem is that it is quite hard to create a benchmark that is ~unsaturable (for a reasonable amount of time) at arbitrary amounts of test-time compute and using arbitrary scaffolds. My hypothesis is that the existence of such a benchmark will automatically incentivize the field to move in this direction- thoughts? Wrote more about this here: https://open.substack.com/pub/nikilravi/p/on-measuring-ai?r=2g4nid&utm_medium=ios

16hViews 777Likes 4Bookmarks 1
Rohan@proxy_vector

@polynoamial @OpenAI Agree. 'Model eval' without an inference budget is becoming as incomplete as benchmark scores without dataset details. The capability is increasingly model x scaffold x compute, not just model.

5hViews 48Likes 1Bookmarks 1
Load more posts