/Tech5h ago

WeirdML v2 benchmark finds Claude Sonnet 5 yields modest accuracy gains but major cost improvements over Sonnet 4.6

Sonnet 5 cut benchmark costs to 0.895 from 2.86.

17324122829.7K

#501

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex#501inTech

congrats to Anthropic for great progress in sandbagging! The competitors can't distill your capabilities if you don't ship them! That's the winner's attitude. In the end, there's not much difference between honestly serving tokens and renting out your GPUs…

Håvard Ihle@htihle

Claude Sonnet 5 (high) scores 68.8% on WeirdML, comparable to GLM-5.2, and up from Sonnet 4.6 at 66.1%.

It seems different from Sonnet 4.6, and it does the Opus thing of sometimes just exploring the data instead of trying to solve the task.

10:07 AM · Jul 2, 2026 · 8.3K Views

Sentiment

Many users praise Claude Sonnet 5's strong WeirdML benchmark results and model efficiency, while some dismiss the release as unimpressive.

Pos

75.0%

Neg

25.0%

3 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS11.9KBOOKMARKS13LIKES125RETWEETS6REPLIES4

Lisan al Gaib@scaling01

incredible

Sonnet-5 is lagging the frontier by 7 months

Håvard Ihle@htihle

Claude Sonnet 5 (high) scores 68.8% on WeirdML, comparable to GLM-5.2, and up from Sonnet 4.6 at 66.1%.

It seems different from Sonnet 4.6, and it does the Opus thing of sometimes just exploring the data instead of trying to solve the task.

5h11.9K12513

Lisan al Gaib@scaling01

Nothingburger-5 scores on WeirdML

Håvard Ihle@htihle

Claude Sonnet 5 (high) scores 68.8% on WeirdML, comparable to GLM-5.2, and up from Sonnet 4.6 at 66.1%.

It seems different from Sonnet 4.6, and it does the Opus thing of sometimes just exploring the data instead of trying to solve the task.

5h6.2K476

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Also, it's presumably like half the size of GLM so, very good work on density too! It must be as cheap to serve as DSV4-Flash, but it'll have more use even at 40 times the price. Winning!

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

5h1.6K180

Lisan al Gaib@scaling01

Nothingburger-5 is the new Llamao-4

Lisan al Gaib@scaling01

Nothingburger-5 scores on WeirdML

5h1.6K160

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@scaling01 Impressive results for DSV4-Flash scaled model!

Lisan al Gaib@scaling01

Nothingburger-5 scores on WeirdML

5h60440

Paul@essenciverse

@scaling01 I really feel everyone is pretending every model is supposed to be for coding.

I dont understand sonnet 5 to be for coding at all. Its for writing and general workflows

5h85

This is Greg@Greg_TheBuilder

@scaling01 So is sonnet 5 just the most expensive email writer? I’ve noticed it’s good at doing some web fetches I suppose

5h791

midnight@midsusnight

@scaling01 nah the real fight is deciding which model is lying to you best

but frontier wars are fun to watch

5h86

marcos200s@marcos200s1

@midsusnight I’d suggest you read the more detailed post Lisan made out in his telegram channel SCALINGCALLS cause from my understanding you’re getting him all wrong

4h9

marcos200s@marcos200s1

@Greg_TheBuilder I’m sure you’re getting his view confusing, thought Same as you too when I saw the post here on x, check out the updated post on it Lisan made out on his telegram channel SCALINGCALLS I’m very sure you’ll understand him better after reading the post

4h9

marcos200s@marcos200s1

@essenciverse I’m sure you’re getting his view confusing, thought Same as you too when I saw the post here on x, check out the updated post on it Lisan made out on his telegram channel SCALINGCALLS I’m very sure you’ll understand him better after reading the post

4h8

kaloszer@kaloszer

@essenciverse @scaling01 then get it out of claude code if it sucks for coding and finetune a good coding model

4h7

墨染_85_Weex_返利点线@bbas011

@teortaxesTex 把挤牙膏当成了顶级商战真是绝了

5h6