/Tech28d ago

WeirdML benchmark finds Claude Opus 4.8 xhigh trails GPT-5.5 xhigh but achieves 82.9% accuracy using 129 lines of code

Disabling thinking dropped Claude's accuracy to 70.5%.

3947899660.5K

#501

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex#501inTech

This is such a zany pattern if you think about it

Håvard Ihle@htihle

Claude Opus 4.8 (xhigh) scores 82.9% on WeirdML, right behind GPT 5.5.

We now also (unlike 4.7) see a clear scaling with output token use: - no thinking: 2.4k tokens, 70.5% - medium: 4.3k, 76.0% - xhigh: 12.5k, 82.9%

4:17 AM · Jun 1, 2026 · 6.4K Views

Sentiment

Positive users praise Claude Opus 4.8 for nearly matching GPT-5.5 on WeirdML with far fewer lines of code due to lower maintenance burdens, while negative users dismiss the results as insignificant or insult those promoting them.

Pos

35.7%

Neg

64.3%

15 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS47.8KBOOKMARKS71LIKES336REPLIES28

Lisan al Gaib@scaling01

Opus 4.8-xhigh scores minimally lower than GPT-5.5-xhigh, but is absolutely simplicity-maxxing

129 LOC vs 517 LOC

I know which one I would pick

Håvard Ihle@htihle

Claude Opus 4.8 (xhigh) scores 82.9% on WeirdML, right behind GPT 5.5.

We now also (unlike 4.7) see a clear scaling with output token use: - no thinking: 2.4k tokens, 70.5% - medium: 4.3k, 76.0% - xhigh: 12.5k, 82.9%

28d47.8K33671

RETWEETS37

Nicolas Bustamante@nicbstme

The big story here is that GPT 5.5 (high/xhigh) outperforms claude-opus-4.8 (max/xhigh) by 20.7% succeeding on 12 additional tasks!

More impressive: GPT is roughly half the cost and twice as fast.

OpenAI is back in the game. Overall, this competition is healthy for the industry. I'd love to see a third player rise to the top of the leaderboard!

Datacurve@datacurve

Opus 4.8 is now on DeepSWE.

On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.

28d40.3K32555

Håvard Ihle@htihle

In WeirdML we see opus models increasingly use submissions to just explore the data, without actually trying to solve the problem (no predictions for the test set).

It seems like, with no or low thinking, at least for some tasks, the prior for Opus to just explore the data overrules the system instructions saying that the score on every submission counts and will be compared to other models.

Opus does not feel the rush to maximize the score fast, it simply tries to understand the data. While this kind of attitude is great in Claude-Code, it costs Opus a lot on an eval like WeirdML, where getting the most out of every submission is important.

Exploring the data is definitely crucial in WeirdML, and GPT and Gemini also do this (prints out a bunch of info, tests hypotheses about the data etc). They just also make their best effort to solve the task at every iteration (which means that they both get more info from each submission and also get valid scores etc).

Opus, somehow, does not feel the same urgency when in this eval. Not exactly sure what this means, and more thinking does make Opus use the submissions more effectively, but it's at least an interesting finding.

Håvard Ihle@htihle

Claude Opus 4.8 (xhigh) scores 82.9% on WeirdML, right behind GPT 5.5.

We now also (unlike 4.7) see a clear scaling with output token use: - no thinking: 2.4k tokens, 70.5% - medium: 4.3k, 76.0% - xhigh: 12.5k, 82.9%

28d3.2K3616

Taelin@VictorTaelin

@scaling01 how to make your model's code concise and beautiful in 2 easy steps

1. train on my codebases

2. done

Lisan al Gaib@scaling01

Opus 4.8-xhigh scores minimally lower than GPT-5.5-xhigh, but is absolutely simplicity-maxxing

129 LOC vs 517 LOC

I know which one I would pick

28d3.1K682

BlockedPath@BlockedPaths

@scaling01 Same. A 2% benchmark gap means nothing if I have to maintain 517 lines instead of 129. Less code is less surface area to debug later. The model that writes tighter is the one I actually want in a real codebase.

28d2572

Alex Gonch@AlexGonchX

@scaling01 depends if the 400 extra lines are error handling or the model just being verbose. score doesn't tell you which

28d3671

J A Z I I@notjazii

@scaling01 which one?

28d901

🧟@RaghavKoch19380

@scaling01 i wonder if Opus 4.8 Max tops the benchmark.

28d181

Deep Bhalerao@DeepBhaleraoX

@scaling01 and that is true, opus code is lot leaner than gpt 5.5, both work almost all of the times

28d179

Chimpansky@chimpansky

@scaling01 129 vs 517 LOC is a real signal. does the thinking token count also diverge at xhigh, or does opus reach 82.9% with fewer reasoning tokens than gpt-5.5 uses?

28d163

Alireza@alireza7612

@scaling01 same task, Opus makes the surgical edit, Codex rewrites half the file around it. finally a number for it.

28d155

Alex YGift@Radipdegen

@scaling01 129 LOC flexing while the model does all the heavy lifting

respect the math, hate the disrespect to my fingers

28d132

Deepak K@deepakThamizhK

@scaling01 517 LOC to squeak out a marginal gain is not a flex it's a smell. Complexity compounds. In 6 months that's the codebase nobody wants to touch. Simplicity-maxxing isn't a consolation prize; it's the right call.

28d129

Nick Launches@nicklaunches

@scaling01 129 vs 517 LOC for nearly the same score is the whole argument honestly. Less code to maintain wins

28d94

🎱 BitcoinBananaBY@BitcoinBananaBY

@VictorTaelin @scaling01 why are you not pushing your comprehensive datasets to https://hub.harborframework.com/ which is used by terminal-bench. Many llms train on that.

28d281

Eclipse 🌖@ECLresearch

@scaling01 That LOC delta is the real story — 75% less code for near-parity output suggests the architectural efficiency gap is widening, not shrinking. Curious if that simplicity advantage holds at scale or breaks under edge cases.

28d60