Nice to see more and more attention on the trade-offs between performance and test-time compute. As competition becomes more intense, test-time compute creates incentives for providers to make the models "think more": https://arxiv.org/abs/2601.21839
"I believe the proper way to evaluate models is with a performance vs test-time compute plot, with either tokens, cost, or wall-clock time on the x-axis."
We can do this on Agent Arena data! Here's a plot showing net improvement vs tokens on 100K+ real agent workflows on @arena!