/Tech2h ago

Tanishq Mathew Abraham shares benchmarks showing Claude Sonnet 5 beats Sonnet 4.6 across evaluations but trails Opus 4.8 on SWE-bench Pro

Story Overview

Tanishq Mathew Abraham posted benchmark tables that position the just-launched Claude Sonnet 5 as a clear step up from Sonnet 4.6 on agentic coding, terminal use, computer interaction, and multidisciplinary reasoning, while still sitting a few points behind Opus 4.8 on most of the same tests.

9104544.2K

#501

Original post

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr#613inTech

Claude Sonnet 5 benchmarks

11:00 AM · Jun 30, 2026 · 876 Views

Developer Impact

Gains land near the bigger model

Sonnet 5 reaches within striking distance of Opus 4.8 on SWE-bench Pro, Terminal-Bench, and OSWorld while also posting a slight edge on GPQA-AAA v2, showing how much capability Anthropic squeezed into the lighter tier.

Pricing Watch

Pricing stays Sonnet-level for now

Introductory rates hold at $2 per million input and $10 per million output tokens through August, after which they rise to the usual Sonnet structure, though the new tokenizer may change actual token counts.

Sentiment

Many users criticized Claude Sonnet 5 benchmarks as underwhelming versus Opus 4.8 and GLM models while questioning hype, pricing, and token use, though some praised the specs and fast release cadence.

Pos

32.3%

Neg

67.7%

24 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS2KBOOKMARKS1LIKES49

will brown@willccbb

it’s like mythos but if it wasn’t mythos and instead was basically opus 4.7

2h2K491

RETWEETS2REPLIES4

Florian Brand@xeophon

a glm-class model but its three times the price

will brown@willccbb

it’s like mythos but if it wasn’t mythos and instead was basically opus 4.7

2h374311

sankalp@dejavucoder

@willccbb lol

2h47012

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

what is the fucking point of saying this for Opus specifically? all compared models are "reference". these jerks are finding new ways to trigger me

Claude@claudeai

Sonnet 5 is a substantial improvement over Sonnet 4.6 on reasoning, tool use, coding, and knowledge work.

Its performance is close to Opus 4.8, at lower prices.

1h38160

Rohan Paul@rohanpaul_ai

On a Agentic search the improvement is so prominent.

Sonnet 4.6 barely improves when you spend more. Sonnet 5, by contrast, gets dramatically better with effort to near-Opus territory on BrowseComp agentic search.

Rohan Paul@rohanpaul_ai

And Claude Sonnet 5 just launched.

Closes the gap with Opus 4.8, and is cheap until August.

This makes agentic AI much cheaper, with $2 input tokens and $10 output tokens per 1M through Aug-26. Price rises after 08-26 to $3 input and $15 output per 1M.

They call Sonnet 5 its “most agentic Sonnet model yet,”

Its coding score hit 63.2% on SWE-bench Pro, versus 58.1% for Sonnet 4.6.

Sonnet 5 gets 63.2% in agentic coding, while Opus 4.8 reaches 69.2% and Sonnet 4.6 hits 58.1%.

But in knowledge work, Sonnet 5 slightly beats Opus 4.8, even though Opus is known for tough judgment and deep research tasks.

1h39411

Andrew Curran@AndrewCurran_

@xeophon @JagersbergKnut My timeline atm.

Florian Brand@xeophon

a glm-class model but its three times the price

1h22570

Florian Brand@xeophon

@henriqueim0veis Everyone regresses on knowledge in favor of agentic web search

1h1532

henrique imoveis@henriqueim0veis

@xeophon GLM just like any chinese model, has no world knowledge

1h1641

snow@snowclipsed

@xeophon tvke

1h1574

Yunfan Zhang@z4y5f3

I think they self-distilled just the right amount so that Sonnet 5 is worse than Opus 4.8 on every benchmark.

will brown@willccbb

it’s like mythos but if it wasn’t mythos and instead was basically opus 4.7

1h814111

Static Snow@StaticSnowman

@teortaxesTex Do you think a human made that graphic?

1h1573

AntiDisentarian@AntiDisentarian

@teortaxesTex Reference vs direct comparison does seem like a plausible semantic distinction?

1h1483

Ziwen@ziwenxu_

@xeophon Wild..

1h1672

xlr8harder@xlr8harder

@teortaxesTex so they could justify highlighting a model without the highest benchmark scores.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

what is the fucking point of saying this for Opus specifically? all compared models are "reference". these jerks are finding new ways to trigger me

49m7920

Rafa Schwinger 🇻🇦@Rafa_Schwinger

@willccbb The best feature is spying on you

1h1811

Nikita Belokopytov@NikiBelokopytov

@teortaxesTex Bro, it's was posted by Sonnet and this is a hallucination

1h1371

Nirmal Krishnan@nirmal_dist

@xeophon > family guy color palette meme

1h871

J A Z I I@notjazii

@xeophon GiB US FABLEEE

1h811

Lakshay Sagar Rana@lsrspeakstocomp

@willccbb I am becoming more and more certain that sonnet is a offspring of opus. 5 months for a new model is pretty cool

1h240

MinusGix@MInusGix

@teortaxesTex They want to target people using Sonnet to like their new model, and so 4.6 is "this is an upgrade to it, aren't you Sonnet API users (business) happy?" while Opus is "for reference, looksie, it is nearer to the same tier as our upper tier model but cheaper!!!"

1h751