/Tech1d ago

Agent Arena founder Anastasios Nikolas Angelopoulos argues AI models should be evaluated using performance-versus-compute plots to capture test-time trade-offs

An analysis of 100,000 workflows showed GPT-5.5 improved by 18.5%.

569112311.5K

Nice to see more and more attention on the trade-offs between performance and test-time compute. As competition becomes more intense, test-time compute creates incentives for providers to make the models "think more": https://arxiv.org/abs/2601.21839

"I believe the proper way to evaluate models is with a performance vs test-time compute plot, with either tokens, cost, or wall-clock time on the x-axis."

We can do this on Agent Arena data! Here's a plot showing net improvement vs tokens on 100K+ real agent workflows on @arena!

10:59 AM · Jun 9, 2026 · 362 Views
Sentiment

Users in the replies dismissed Agent Arena plots claiming performance gains from more test-time tokens as gibberish due to unclear or unconvincing explanations.

Pos
0.0%
Neg
100.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS396LIKES2

For more detail on agent arena btw and how this number is calculated, read this blog: http://arena.ai/blog/agent-arena-methodology

Disclaimer, this is also vibe coded so not production ready! And the plot currently shows more of an association ATM as opposed to an effect.

1dViews 396Likes 2
RETWEETS10

"I believe the proper way to evaluate models is with a performance vs test-time compute plot, with either tokens, cost, or wall-clock time on the x-axis."

We can do this on Agent Arena data! Here's a plot showing net improvement vs tokens on 100K+ real agent workflows on @arena!

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

1dViews 11.7KLikes 68Bookmarks 21
REPLIES1

@canonicalmodel @arena See the blog! It is the causal treatment effect of a model with respect to an average model.

1dViews 32
Name@canonicalmodel

@ml_angelopoulos @arena What's the definition of improvement?

1dViews 43
M.Kasinski@M_Kasinski

Performance vs tokens is the right axis and rarely plotted, credit for running it on real workflows instead of a benchmark suite. The companion chart I want is performance vs dollars: at frontier API prices the token axis has a hard budget, so the top of that curve is theoretical for most agents. Run the same plot with a local open model and the x-axis is free, which changes which point you actually operate at.

1dViews 23
Name@canonicalmodel

@ml_angelopoulos @arena I did and I thought it was gibberish.

1dViews 21