i am trying to work on the closest thing possible to a true "big model smell" eval which is to say: something that measures something that clever post training can't trivially gap, and is cheap + topically diverse i can't test mythos for obvious reasons, but... hmm...
Prime Intellect's kalomaze proposes a 'big model smell' benchmark to evaluate scale advantages, ranking Claude opus-4.6 first
The benchmark targets scale advantages post-training cannot easily bypass.
Many users praised the new eval for revealing inherent model capabilities and ranked Claude Opus 4.6 highest, while negative users criticized other versions like Opus 4.8 or Fable as inferior or gimmicky.
No Digg Deeper questions have been answered for this story yet.
Most Activity
if by big model smell you literally just mean total params then the ranking should probably look something like this:
GPT-4.5 Claude Fable Gemini 3/3.1 Pro Opus 4 / 4.1 Grok 3 / 4 Opus 3 GPT-5.5 Opus 4.5 - 4.8 GPT-4 DeepSeek-V4 Pro Kimi-K2
i am trying to work on the closest thing possible to a true "big model smell" eval which is to say: something that measures something that clever post training can't trivially gap, and is cheap + topically diverse i can't test mythos for obvious reasons, but... hmm...
Incredible 4.6 does smell the biggest (least RL-degraded) to me this looks like a very sensitive eval. Yes, V4-Pro vs V4-Flash do have a roughly 0.1 Opus' worth of gap in perceived size and capability.
i am trying to work on the closest thing possible to a true "big model smell" eval which is to say: something that measures something that clever post training can't trivially gap, and is cheap + topically diverse i can't test mythos for obvious reasons, but... hmm...
i told you guys i only use 4.6
i am trying to work on the closest thing possible to a true "big model smell" eval which is to say: something that measures something that clever post training can't trivially gap, and is cheap + topically diverse i can't test mythos for obvious reasons, but... hmm...
@teortaxesTex exact precision seems esp. brutal on recent opuses (after the 4.7 continued pretrain & tokenizer switch) newer opuses just can't help themselves. they like to paraphrase/'correct'/embellish the parts of the ground truth that are supposed to remain stable
@weeklytreeman for example,
these 4 are super hard to guess the correct order imo:
Grok 3 / 4 Opus 3 GPT-5.5 Opus 4.5 - 4.8
I think they are all around 2-3T
if by big model smell you literally just mean total params then the ranking should probably look something like this:
GPT-4.5 Claude Fable Gemini 3/3.1 Pro Opus 4 / 4.1 Grok 3 / 4 Opus 3 GPT-5.5 Opus 4.5 - 4.8 GPT-4 DeepSeek-V4 Pro Kimi-K2
@kalomaze
what's your definition?
i coined the term "big model smell," and even i don't know what it means anymore

@kalomaze Wouldn't this just be like tail knowledge no internet
It’s an interesting idea, but I don’t think completely turning off reasoning is quite right either.
While I expect big models to be more token efficient and solve tasks better under shorter token budgets, I also expect them to perform better under longer context / reasoning constraints.
Maybe consider looking at low / medium as well and comparing if that is closer or further than high/extra high.
i am trying to work on the closest thing possible to a true "big model smell" eval which is to say: something that measures something that clever post training can't trivially gap, and is cheap + topically diverse i can't test mythos for obvious reasons, but... hmm...
@scaling01 i mean i think total data is relevant to this you wouldnt call a random init 2T big model smell
if by big model smell you literally just mean total params then the ranking should probably look something like this:
GPT-4.5 Claude Fable Gemini 3/3.1 Pro Opus 4 / 4.1 Grok 3 / 4 Opus 3 GPT-5.5 Opus 4.5 - 4.8 GPT-4 DeepSeek-V4 Pro Kimi-K2
oh and GPT-5 / 5.1 maybe also GPT-5.2 / 5.3 if they are part of the same pre-train are probably like DeepSeek-V3 size or smaller
if by big model smell you literally just mean total params then the ranking should probably look something like this:
GPT-4.5 Claude Fable Gemini 3/3.1 Pro Opus 4 / 4.1 Grok 3 / 4 Opus 3 GPT-5.5 Opus 4.5 - 4.8 GPT-4 DeepSeek-V4 Pro Kimi-K2

but pre-training token amount and quality matter of course
so in that sense Fable should have much more big model smell than GPT-4.5
@kalomaze god’s work
i am trying to work on the closest thing possible to a true "big model smell" eval which is to say: something that measures something that clever post training can't trivially gap, and is cheap + topically diverse i can't test mythos for obvious reasons, but... hmm...

@weeklytreeman the thing that newer opuses seem to fail at much, much more is "not paraphrasing shit that is supposed to be identical". notably more padding/embellishing on what is supposed to be a task that requires precision
I think big model smell does also depend on post-training
style matters too
if by big model smell you literally just mean total params then the ranking should probably look something like this:
GPT-4.5 Claude Fable Gemini 3/3.1 Pro Opus 4 / 4.1 Grok 3 / 4 Opus 3 GPT-5.5 Opus 4.5 - 4.8 GPT-4 DeepSeek-V4 Pro Kimi-K2

Fable is really smelly, stinky actually reaking

@kalomaze isn't 5.5 a router of different models?
to me it seems like different base models based on the reasoning effort
huh

@kalomaze matches my experience with the gpts and opuses almost in identical order, although im curious where gpt-5.2 and sonnet 4.5 stands
@kalomaze yes that's kind of my point that defining big model smell just in terms of params would be silly
@scaling01 i mean i think total data is relevant to this you wouldnt call a random init 2T big model smell

@_ueaj That's one way but not the only way.

@kalomaze because in that thread you seem to be very focused on knowledge
but I think reasoning as in this paper:
also matters, as well as style, creativity, in-context learning