Claude Opus 4.8 launches, scoring 69.2% on SWE-bench Pro to outperform GPT-5.5 and Gemini 3.1 Pro · Digg

Claude Opus 4.8 launches, scoring 69.2% on SWE-bench Pro to outperform GPT-5.5 and Gemini 3.1 Pro · Digg

Posts from X

Most Activity

VIEWS37.1K

Lisan al Gaib@scaling01

Anthropic says Opus 4.8 ranks 1st on FrontierSWE

26d37.1K1146

BOOKMARKS46LIKES478RETWEETS19

Lisan al Gaib@scaling01

Opus 4.8 is indeed #1 on FrontierSWE

Lisan al Gaib@scaling01

Anthropic says Opus 4.8 ranks 1st on FrontierSWE

26d30.8K47846

REPLIES25

Yuchen Jin@Yuchenj_UW

Opus 4.8 scores 69.2% on SWE-Bench Pro, 10 points higher than GPT-5.5.

Most interesting part of the release blog is “Dynamic Workflows”:

“This new feature, available in research preview, allows Claude to take on even bigger tasks in Claude Code. Claude can plan the work and then run hundreds of parallel subagents in a single session (and with Opus 4.8, the agents can run for even longer). It then verifies its outputs before reporting back to the user.”

26d22.9K25445

Lisan al Gaib@scaling01

Claude 4.8 Opus System Card

Lisan al Gaib@scaling01

Claude Opus 4.8 Benchmarks

26d31.8K29043

Mercor@mercor_ai

We tested @claudeai Opus 4.8 (High) on APEX-SWE ahead of today's release.

It's the new #1 at 45.3% Pass@1, nearly 4 points ahead of GPT-5.3 Codex (41.5%).

Congrats @AnthropicAI on the release and having three models in the top 5!

26d11.3K20325

Arena's AI Capability Lead @petergostev runs @AnthropicAI's latest Claude Opus 4.8 through 200+ Code Arena: Frontend tests. Both thinking and non-thinking, head-to-head with past Opus variants, Gemini 3.1 Pro, 3.5 Flash, and GLM 5.1.

Compare outputs across 3D scenes, game generation, and front-end UI design and let us know what you think. Link in thread 🧵👇

26d25.1K15826

alex zhang@a1zhang

I think it's becoming clearer that programmatic sub-agent calling is the way to go over the legacy tool-calling format (which I've been pushing for since RLMs came out)!

I do wonder though if the generated "workflow" looks more eager or compiled (a design decision I've also been unsure about, because it affects how these models are trained to act); dynamic seems to imply the former but the example they give in the blog makes it kind of unclear. either way, scaling the flexibility of subagent deployment without polluting the context of the main Claude Code instance is gonna be huge

Yuchen Jin@Yuchenj_UW

Opus 4.8 scores 69.2% on SWE-Bench Pro, 10 points higher than GPT-5.5.

Most interesting part of the release blog is “Dynamic Workflows”:

“This new feature, available in research preview, allows Claude to take on even bigger tasks in Claude Code. Claude can plan the work and then run hundreds of parallel subagents in a single session (and with Opus 4.8, the agents can run for even longer). It then verifies its outputs before reporting back to the user.”

26d3.4K8633

Gabriel Stengel@GabeStengel

If you scroll down far enough in the blog post... can see that that Gemini 3.5 flash outperforms Opus 4.8 by a BIG margin on Finance Agent benchmark

"* Finance Agent v2: Gemini 3.5 Flash scores 57.9% on Finance Agent v2, a significant improvement over Gemini 3.1 Pro."

No one model rules them all! As much as any one lab would like you to believe....

Claude@claudeai

Introducing Claude Opus 4.8: it builds on Opus 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors.

Available today at the same price.

26d16.5K8321

Lisan al Gaib@scaling01

still getting token-mogged by GPT-5.5

Lisan al Gaib@scaling01

Opus 4.8 takes the top spot on AAI

26d9.7K1559

Taelin@VictorTaelin

@scaling01 Just 5 more versions to reach Mythos, I can't contain my excitement.

Lisan al Gaib@scaling01

Claude Opus 4.8 Benchmarks

26d7K1664

Leo Linsky@leo_linsky

Anthropic did it. Opus 4.8 is a tangible improvement from their previous best model (Opus 4.5). It has the intelligence of GPT 5.5, the creativity of Opus 4.7, and none of the personality problems that made their last release a frustrating product.

Results are live on our comprehensive multiplayer coding and reasoning tests.

26d5.1K7215

Bindu Reddy@bindureddy

🚨 Opus 4.8 Still Trails Behind GPT 5.5 And Is A Very Incremental Release

Opus 4.8 barely inches past 4.7 on benchmarks but lags behind GPT 5.5. considerably!!

Anthropic may be stalling a bit given it's last two releases. OpenAI has a huge opening with GPT 5.6 coming soon

Will know more tomorrow after some real world testing

26d1.7K355

Jeff Ma ✈️ ICML@18jeffreyma

hyped to see our work on ProgramBench make it in the Opus 4.8 model card!

26d4.8K336

Andrew Curran@AndrewCurran_

'Not only that, but we plan to release a new class of model with even higher intelligence than Opus.'

The Mythos release draws near. The rumor for some time is that Claude Mythos will release in about two weeks, mid June.

Andrew Curran@AndrewCurran_

26d1.2K524

Tenobrus@tenobrus

cheap fast mode 😮

26d1.5K663

Lisan al Gaib@scaling01

same cost/perf curve as GPT-5.5

Lisan al Gaib@scaling01

still getting token-mogged by GPT-5.5

26d3.2K272

Lisan al Gaib@scaling01

not sure if that includes GPT-5.5

and the rankings for the models changed

Lisan al Gaib@scaling01

Anthropic says Opus 4.8 ranks 1st on FrontierSWE

26d3.6K272

Andrew Curran@AndrewCurran_

Andrew Curran@AndrewCurran_

System card: https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0925c805a1a7ca77314ccbf4a6.pdf

26d933290

wh@nrehiew_

Big TerminalBench improvements from Opus 4.7 still behind 5.5

Massive knowledge work improvements!

26d786111

Andrew Curran@AndrewCurran_

HLE.

26d1K111