It’s an interesting idea, but I don’t think completely turning off reasoning is quite right either.
While I expect big models to be more token efficient and solve tasks better under shorter token budgets, I also expect them to perform better under longer context / reasoning constraints.
Maybe consider looking at low / medium as well and comparing if that is closer or further than high/extra high.
i am trying to work on the closest thing possible to a true "big model smell" eval which is to say: something that measures something that clever post training can't trivially gap, and is cheap + topically diverse i can't test mythos for obvious reasons, but... hmm...