/Tech8h ago

Arcee AI's Cody Blakeney argues @kalomaze's new benchmark should test reasoning budgets instead of disabling reasoning

Story Overview

Cody Blakeney from Arcee AI jumped into the thread on @kalomaze's proposed benchmark, suggesting the test should compare model performance across different reasoning budgets rather than simply turning reasoning off to expose raw scaling edges that post-training cannot easily fake.

0722969

#359

Original post

Cody Blakeney@code_star#1674inTech

It’s an interesting idea, but I don’t think completely turning off reasoning is quite right either.

While I expect big models to be more token efficient and solve tasks better under shorter token budgets, I also expect them to perform better under longer context / reasoning constraints.

Maybe consider looking at low / medium as well and comparing if that is closer or further than high/extra high.

kalomaze@kalomaze

i am trying to work on the closest thing possible to a true "big model smell" eval which is to say: something that measures something that clever post training can't trivially gap, and is cheap + topically diverse i can't test mythos for obvious reasons, but... hmm...

2:33 AM · Jun 14, 2026 · 473 Views

Open Question

Reasoning levels expose efficiency gaps

Blakeney notes that larger models could demonstrate clearer wins through better token use at tight budgets while still pulling ahead when extra reasoning steps or context are allowed, creating a more nuanced view than a single disabled setting.

Developer Impact

Early signals on eval robustness

The thread shows quick interest from researchers including @aidan_mclau calling the effort god's work, yet no full dataset, leaderboard site, or formal validation has appeared so far, leaving the practical reach of the idea still open.

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS516LIKES5

Aidan McLaughlin@aidan_mclau

@kalomaze god’s work

kalomaze@kalomaze

2h51650

Lisan al Gaib@scaling01

@kalomaze

what's your definition?

Aidan McLaughlin@aidan_mclau

i coined the term "big model smell," and even i don't know what it means anymore

15m21830