/Tech4h ago

Prime Intellect ML researcher kalomaze proposes an LLM benchmark measuring inherent scale, where Anthropic's Opus-4.6 currently leads GPT-5.5

The low-cost test measures advantages post-training cannot bridge

3224748124.2K

#501

Original post

kalomaze@kalomaze#1213inTech

i am trying to work on the closest thing possible to a true "big model smell" eval which is to say: something that measures something that clever post training can't trivially gap, and is cheap + topically diverse i can't test mythos for obvious reasons, but... hmm...

8:21 PM · Jun 13, 2026 · 21.3K Views

Sentiment

Many users express excitement about a new eval measuring inherent model capabilities beyond post-training that ranks Claude Opus 4.6 highest for its quality and vividness, while a few criticize precision flaws in later Opus versions.

Pos

90.0%

Neg

10.0%

17 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS2.9KBOOKMARKS10LIKES30REPLIES4

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Incredible 4.6 does smell the biggest (least RL-degraded) to me this looks like a very sensitive eval. Yes, V4-Pro vs V4-Flash do have a roughly 0.1 Opus' worth of gap in perceived size and capability.

kalomaze@kalomaze

1h2.9K3010

RETWEETS1

kalomaze@kalomaze

@teortaxesTex exact precision seems esp. brutal on recent opuses (after the 4.7 continued pretrain & tokenizer switch) newer opuses just can't help themselves. they like to paraphrase/'correct'/embellish the parts of the ground truth that are supposed to remain stable

kalomaze@kalomaze

@weeklytreeman for example,

1h1.3K154

ueaj@_ueaj

@kalomaze Wouldn't this just be like tail knowledge no internet

4h68515

kalomaze@kalomaze

@weeklytreeman the thing that newer opuses seem to fail at much, much more is "not paraphrasing shit that is supposed to be identical". notably more padding/embellishing on what is supposed to be a task that requires precision

2h2456

Luxun's alt@weeklytreeman

@kalomaze matches my experience with the gpts and opuses almost in identical order, although im curious where gpt-5.2 and sonnet 4.5 stands

2h2772

alth0u🧶@alth0u

i told you guys i only use 4.6

kalomaze@kalomaze

25m13821

kalomaze@kalomaze

@_ueaj That's one way but not the only way.

3h3938

ueaj@_ueaj

@Sauers_ @kalomaze I'd argue the vibes are from tail knowledge or patterns tbh

3h534

Neev Parikh@neev_parikh

@kalomaze How do you get effort (which is what I assume the setting column refers to) as off for opus 4.8?

3h4452

kalomaze@kalomaze

@neev_parikh off/none, potayto, potahto

2h2912

ueaj@_ueaj

when you scale a model you scale

1. number of non-linearities per token (model depth) 2. attention parameters (ICL) 3. MLP parameters total (tail knowledge) 4. MLP parameters active (CoT efficiency, non reasoning tail knowledge, another subtle thing I can't really pin down that I feel like Claude has)

3h24

Sauers@Sauers_

@_ueaj @kalomaze This would work to empirically correlate with size but not capture the vibes

3h632

Allison Intelligence (AI)@allisonology

@kalomaze This is really fucking cool keep going

3h1455

ueaj@_ueaj

@allisonology @Sauers_ @kalomaze I noticed this a lot with fable when it was there yes, context reactivity is significantly better.

model processing depth, ICL are probably the other big ones. ICL is kinda tail knowledge but more so tail behavior or adaptivity

3h461

interstellarninja@intrstllrninja

@kalomaze if the big model has pass^k similar to small models but requires higher tokens at high costs that’s a big model smell imo

4h4133

Allison Intelligence (AI)@allisonology

@_ueaj @Sauers_ @kalomaze Idk I think it's more than just this. It's also like adaptability at longer context lengths, output styles.. Maybe tail knowledge would cover total params but not active? And i think active params is a large contributor in smell.

3h341