i am trying to work on the closest thing possible to a true "big model smell" eval which is to say: something that measures something that clever post training can't trivially gap, and is cheap + topically diverse i can't test mythos for obvious reasons, but... hmm...
Prime Intellect ML researcher kalomaze proposes an LLM benchmark measuring inherent scale, where Anthropic's Opus-4.6 currently leads GPT-5.5
The low-cost test measures advantages post-training cannot bridge
Many users express excitement about a new eval measuring inherent model capabilities beyond post-training that ranks Claude Opus 4.6 highest for its quality and vividness, while a few criticize precision flaws in later Opus versions.
Most Activity
Incredible 4.6 does smell the biggest (least RL-degraded) to me this looks like a very sensitive eval. Yes, V4-Pro vs V4-Flash do have a roughly 0.1 Opus' worth of gap in perceived size and capability.
i am trying to work on the closest thing possible to a true "big model smell" eval which is to say: something that measures something that clever post training can't trivially gap, and is cheap + topically diverse i can't test mythos for obvious reasons, but... hmm...
@teortaxesTex exact precision seems esp. brutal on recent opuses (after the 4.7 continued pretrain & tokenizer switch) newer opuses just can't help themselves. they like to paraphrase/'correct'/embellish the parts of the ground truth that are supposed to remain stable
@weeklytreeman for example,

@kalomaze Wouldn't this just be like tail knowledge no internet

@weeklytreeman the thing that newer opuses seem to fail at much, much more is "not paraphrasing shit that is supposed to be identical". notably more padding/embellishing on what is supposed to be a task that requires precision

@kalomaze matches my experience with the gpts and opuses almost in identical order, although im curious where gpt-5.2 and sonnet 4.5 stands
i told you guys i only use 4.6
i am trying to work on the closest thing possible to a true "big model smell" eval which is to say: something that measures something that clever post training can't trivially gap, and is cheap + topically diverse i can't test mythos for obvious reasons, but... hmm...

@_ueaj That's one way but not the only way.

@Sauers_ @kalomaze I'd argue the vibes are from tail knowledge or patterns tbh

@kalomaze How do you get effort (which is what I assume the setting column refers to) as off for opus 4.8?

@neev_parikh off/none, potayto, potahto

when you scale a model you scale
1. number of non-linearities per token (model depth) 2. attention parameters (ICL) 3. MLP parameters total (tail knowledge) 4. MLP parameters active (CoT efficiency, non reasoning tail knowledge, another subtle thing I can't really pin down that I feel like Claude has)

@_ueaj @kalomaze This would work to empirically correlate with size but not capture the vibes

@kalomaze This is really fucking cool keep going

@allisonology @Sauers_ @kalomaze I noticed this a lot with fable when it was there yes, context reactivity is significantly better.
model processing depth, ICL are probably the other big ones. ICL is kinda tail knowledge but more so tail behavior or adaptivity

@kalomaze if the big model has pass^k similar to small models but requires higher tokens at high costs that’s a big model smell imo

@_ueaj @Sauers_ @kalomaze Idk I think it's more than just this. It's also like adaptability at longer context lengths, output styles.. Maybe tail knowledge would cover total params but not active? And i think active params is a large contributor in smell.

@kalomaze good job buddy keep it up. interested

@weeklytreeman for example,

@kalomaze Have you seen this technique?

@kalomaze Similar leaderboard results from my testing, esp regressions in opus 4.7 and opus 4.8 on real world tasks