/Tech1d ago

Agent Arena founder Anastasios Nikolas Angelopoulos argues AI models should be evaluated using performance-versus-compute plots to capture test-time trade-offs

An analysis of 100,000 workflows showed GPT-5.5 improved by 18.5%.

569112311.5K

#872

Original post

Anastasios Nikolas Angelopoulos#872

Stratis Tsirtsis@stratis_

Nice to see more and more attention on the trade-offs between performance and test-time compute. As competition becomes more intense, test-time compute creates incentives for providers to make the models "think more": https://arxiv.org/abs/2601.21839

Anastasios Nikolas Angelopoulos@ml_angelopoulos

"I believe the proper way to evaluate models is with a performance vs test-time compute plot, with either tokens, cost, or wall-clock time on the x-axis."

We can do this on Agent Arena data! Here's a plot showing net improvement vs tokens on 100K+ real agent workflows on @arena!

10:59 AM · Jun 9, 2026 · 362 Views

/Tech1d ago

Agent Arena founder Anastasios Nikolas Angelopoulos argues AI models should be evaluated using performance-versus-compute plots to capture test-time trade-offs

An analysis of 100,000 workflows showed GPT-5.5 improved by 18.5%.

569112311.5K

#872

Original post

Anastasios Nikolas Angelopoulos#872

Stratis Tsirtsis@stratis_

Anastasios Nikolas Angelopoulos@ml_angelopoulos

"I believe the proper way to evaluate models is with a performance vs test-time compute plot, with either tokens, cost, or wall-clock time on the x-axis."

We can do this on Agent Arena data! Here's a plot showing net improvement vs tokens on 100K+ real agent workflows on @arena!

10:59 AM · Jun 9, 2026 · 362 Views

Sentiment

Users in the replies dismissed Agent Arena plots claiming performance gains from more test-time tokens as gibberish due to unclear or unconvincing explanations.

Pos

0.0%

Neg

100.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS396LIKES2

Anastasios Nikolas Angelopoulos@ml_angelopoulos

For more detail on agent arena btw and how this number is calculated, read this blog: http://arena.ai/blog/agent-arena-methodology

Disclaimer, this is also vibe coded so not production ready! And the plot currently shows more of an association ATM as opposed to an effect.

1d3962

RETWEETS10

Anastasios Nikolas Angelopoulos@ml_angelopoulos

"I believe the proper way to evaluate models is with a performance vs test-time compute plot, with either tokens, cost, or wall-clock time on the x-axis."

We can do this on Agent Arena data! Here's a plot showing net improvement vs tokens on 100K+ real agent workflows on @arena!

Noam Brown@polynoamial

http://x.com/i/article/2057694226981257216

1d11.7K6821

REPLIES1

Anastasios Nikolas Angelopoulos@ml_angelopoulos

@canonicalmodel @arena See the blog! It is the causal treatment effect of a model with respect to an average model.

1d32

Name@canonicalmodel

@ml_angelopoulos @arena What's the definition of improvement?

1d43

M.Kasinski@M_Kasinski

Performance vs tokens is the right axis and rarely plotted, credit for running it on real workflows instead of a benchmark suite. The companion chart I want is performance vs dollars: at frontier API prices the token axis has a hard budget, so the top of that curve is theoretical for most agents. Run the same plot with a local open model and the x-axis is free, which changes which point you actually operate at.

1d23

Name@canonicalmodel

@ml_angelopoulos @arena I did and I thought it was gibberish.

1d21