Claude Sonnet 5 benchmarks
Tanishq Mathew Abraham shares benchmarks showing Claude Sonnet 5 beats Sonnet 4.6 across evaluations but trails Opus 4.8 on SWE-bench Pro
Story Overview
Tanishq Mathew Abraham posted benchmark tables that position the just-launched Claude Sonnet 5 as a clear step up from Sonnet 4.6 on agentic coding, terminal use, computer interaction, and multidisciplinary reasoning, while still sitting a few points behind Opus 4.8 on most of the same tests.
Gains land near the bigger model
Sonnet 5 reaches within striking distance of Opus 4.8 on SWE-bench Pro, Terminal-Bench, and OSWorld while also posting a slight edge on GPQA-AAA v2, showing how much capability Anthropic squeezed into the lighter tier.
Pricing stays Sonnet-level for now
Introductory rates hold at $2 per million input and $10 per million output tokens through August, after which they rise to the usual Sonnet structure, though the new tokenizer may change actual token counts.
Many users criticized Claude Sonnet 5 benchmarks as underwhelming versus Opus 4.8 and GLM models while questioning hype, pricing, and token use, though some praised the specs and fast release cadence.
No Digg Deeper questions have been answered for this story yet.
Most Activity
it’s like mythos but if it wasn’t mythos and instead was basically opus 4.7
a glm-class model but its three times the price
it’s like mythos but if it wasn’t mythos and instead was basically opus 4.7

@willccbb lol
what is the fucking point of saying this for Opus specifically? all compared models are "reference". these jerks are finding new ways to trigger me
Sonnet 5 is a substantial improvement over Sonnet 4.6 on reasoning, tool use, coding, and knowledge work.
Its performance is close to Opus 4.8, at lower prices.
On a Agentic search the improvement is so prominent.
Sonnet 4.6 barely improves when you spend more. Sonnet 5, by contrast, gets dramatically better with effort to near-Opus territory on BrowseComp agentic search.
And Claude Sonnet 5 just launched.
Closes the gap with Opus 4.8, and is cheap until August.
This makes agentic AI much cheaper, with $2 input tokens and $10 output tokens per 1M through Aug-26. Price rises after 08-26 to $3 input and $15 output per 1M.
They call Sonnet 5 its “most agentic Sonnet model yet,”
Its coding score hit 63.2% on SWE-bench Pro, versus 58.1% for Sonnet 4.6.
Sonnet 5 gets 63.2% in agentic coding, while Opus 4.8 reaches 69.2% and Sonnet 4.6 hits 58.1%.
But in knowledge work, Sonnet 5 slightly beats Opus 4.8, even though Opus is known for tough judgment and deep research tasks.
@xeophon @JagersbergKnut My timeline atm.
a glm-class model but its three times the price

@henriqueim0veis Everyone regresses on knowledge in favor of agentic web search

@xeophon GLM just like any chinese model, has no world knowledge

@xeophon tvke
I think they self-distilled just the right amount so that Sonnet 5 is worse than Opus 4.8 on every benchmark.
it’s like mythos but if it wasn’t mythos and instead was basically opus 4.7

@teortaxesTex Do you think a human made that graphic?

@teortaxesTex Reference vs direct comparison does seem like a plausible semantic distinction?

@xeophon Wild..
@teortaxesTex so they could justify highlighting a model without the highest benchmark scores.
what is the fucking point of saying this for Opus specifically? all compared models are "reference". these jerks are finding new ways to trigger me

@willccbb The best feature is spying on you

@teortaxesTex Bro, it's was posted by Sonnet and this is a hallucination

@xeophon > family guy color palette meme

@xeophon GiB US FABLEEE

@willccbb I am becoming more and more certain that sonnet is a offspring of opus. 5 months for a new model is pretty cool

@teortaxesTex They want to target people using Sonnet to like their new model, and so 4.6 is "this is an upgrade to it, aren't you Sonnet API users (business) happy?" while Opus is "for reference, looksie, it is nearer to the same tier as our upper tier model but cheaper!!!"