Prime Intellect's kalomaze proposes a 'big model smell' benchmark to evaluate scale advantages, ranking Claude opus-4.6 first

VIEWS31.9KLIKES161RETWEETS6REPLIES22

if by big model smell you literally just mean total params then the ranking should probably look something like this:

GPT-4.5 Claude Fable Gemini 3/3.1 Pro Opus 4 / 4.1 Grok 3 / 4 Opus 3 GPT-5.5 Opus 4.5 - 4.8 GPT-4 DeepSeek-V4 Pro Kimi-K2

kalomaze@kalomaze

i am trying to work on the closest thing possible to a true "big model smell" eval which is to say: something that measures something that clever post training can't trivially gap, and is cheap + topically diverse i can't test mythos for obvious reasons, but... hmm...

2d31.9K16135

BOOKMARKS47

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Incredible 4.6 does smell the biggest (least RL-degraded) to me this looks like a very sensitive eval. Yes, V4-Pro vs V4-Flash do have a roughly 0.1 Opus' worth of gap in perceived size and capability.

kalomaze@kalomaze

i am trying to work on the closest thing possible to a true "big model smell" eval which is to say: something that measures something that clever post training can't trivially gap, and is cheap + topically diverse i can't test mythos for obvious reasons, but... hmm...

3d13.9K13747

alth0u🧶@alth0u

i told you guys i only use 4.6

kalomaze@kalomaze

i am trying to work on the closest thing possible to a true "big model smell" eval which is to say: something that measures something that clever post training can't trivially gap, and is cheap + topically diverse i can't test mythos for obvious reasons, but... hmm...

3d7.9K6217

kalomaze@kalomaze

@teortaxesTex exact precision seems esp. brutal on recent opuses (after the 4.7 continued pretrain & tokenizer switch) newer opuses just can't help themselves. they like to paraphrase/'correct'/embellish the parts of the ground truth that are supposed to remain stable

kalomaze@kalomaze

@weeklytreeman for example,

3d3.6K3412

Lisan al Gaib@scaling01

these 4 are super hard to guess the correct order imo:

Grok 3 / 4 Opus 3 GPT-5.5 Opus 4.5 - 4.8

I think they are all around 2-3T

Lisan al Gaib@scaling01

if by big model smell you literally just mean total params then the ranking should probably look something like this:

GPT-4.5 Claude Fable Gemini 3/3.1 Pro Opus 4 / 4.1 Grok 3 / 4 Opus 3 GPT-5.5 Opus 4.5 - 4.8 GPT-4 DeepSeek-V4 Pro Kimi-K2

2d2.6K180

Lisan al Gaib@scaling01

@kalomaze

what's your definition?

Aidan McLaughlin@aidan_mclau

i coined the term "big model smell," and even i don't know what it means anymore

2d1.5K152

ueaj@_ueaj

@kalomaze Wouldn't this just be like tail knowledge no internet

3d68515

Cody Blakeney@code_star

It’s an interesting idea, but I don’t think completely turning off reasoning is quite right either.

While I expect big models to be more token efficient and solve tasks better under shorter token budgets, I also expect them to perform better under longer context / reasoning constraints.

Maybe consider looking at low / medium as well and comparing if that is closer or further than high/extra high.

kalomaze@kalomaze

i am trying to work on the closest thing possible to a true "big model smell" eval which is to say: something that measures something that clever post training can't trivially gap, and is cheap + topically diverse i can't test mythos for obvious reasons, but... hmm...

2d64852

kalomaze@kalomaze

@scaling01 i mean i think total data is relevant to this you wouldnt call a random init 2T big model smell

Lisan al Gaib@scaling01

if by big model smell you literally just mean total params then the ranking should probably look something like this:

GPT-4.5 Claude Fable Gemini 3/3.1 Pro Opus 4 / 4.1 Grok 3 / 4 Opus 3 GPT-5.5 Opus 4.5 - 4.8 GPT-4 DeepSeek-V4 Pro Kimi-K2

2d1.6K170

Lisan al Gaib@scaling01

oh and GPT-5 / 5.1 maybe also GPT-5.2 / 5.3 if they are part of the same pre-train are probably like DeepSeek-V3 size or smaller

Lisan al Gaib@scaling01

if by big model smell you literally just mean total params then the ranking should probably look something like this:

GPT-4.5 Claude Fable Gemini 3/3.1 Pro Opus 4 / 4.1 Grok 3 / 4 Opus 3 GPT-5.5 Opus 4.5 - 4.8 GPT-4 DeepSeek-V4 Pro Kimi-K2

2d1.8K91

Lisan al Gaib@scaling01

but pre-training token amount and quality matter of course

so in that sense Fable should have much more big model smell than GPT-4.5

2d1.1K4

Aidan McLaughlin@aidan_mclau

@kalomaze god’s work

kalomaze@kalomaze

i am trying to work on the closest thing possible to a true "big model smell" eval which is to say: something that measures something that clever post training can't trivially gap, and is cheap + topically diverse i can't test mythos for obvious reasons, but... hmm...

2d950150

kalomaze@kalomaze

@weeklytreeman the thing that newer opuses seem to fail at much, much more is "not paraphrasing shit that is supposed to be identical". notably more padding/embellishing on what is supposed to be a task that requires precision

3d2456

Lisan al Gaib@scaling01

I think big model smell does also depend on post-training

style matters too

Lisan al Gaib@scaling01

if by big model smell you literally just mean total params then the ranking should probably look something like this:

GPT-4.5 Claude Fable Gemini 3/3.1 Pro Opus 4 / 4.1 Grok 3 / 4 Opus 3 GPT-5.5 Opus 4.5 - 4.8 GPT-4 DeepSeek-V4 Pro Kimi-K2

2d1.5K80

Lisan al Gaib@scaling01

Fable is really smelly, stinky actually reaking

2d1.4K6