5h ago

Claude Opus 4.8 launches, scoring 69.2% on SWE-bench Pro to outperform GPT-5.5 and Gemini 3.1 Pro

It introduces a "Dynamic Workflows" preview for parallel subagents.

0
Original post

What if Claude Opus 4.8 benchmarks looked like this 👀

5:41 AM · May 28, 2026 View on X
Reposted by

Opus 4.8 scores 69.2% on SWE-Bench Pro, 10 points higher than GPT-5.5.

Most interesting part of the release blog is “Dynamic Workflows”:

“This new feature, available in research preview, allows Claude to take on even bigger tasks in Claude Code. Claude can plan the work and then run hundreds of parallel subagents in a single session (and with Opus 4.8, the agents can run for even longer). It then verifies its outputs before reporting back to the user.”

4:57 PM · May 28, 2026 · 7K Views
Andrew CurranAndrew Curran@AndrewCurran_

System card: https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0925c805a1a7ca77314ccbf4a6.pdf

4:53 PM · May 28, 2026 · 723 Views
4:54 PM · May 28, 2026 · 768 Views

'Not only that, but we plan to release a new class of model with even higher intelligence than Opus.'

The Mythos release draws near. The rumor for some time is that Claude Mythos will release in about two weeks, mid June.

Andrew CurranAndrew Curran@AndrewCurran_
4:54 PM · May 28, 2026 · 768 Views
4:57 PM · May 28, 2026 · 885 Views
Andrew CurranAndrew Curran@AndrewCurran_

'Not only that, but we plan to release a new class of model with even higher intelligence than Opus.' The Mythos release draws near. The rumor for some time is that Claude Mythos will release in about two weeks, mid June.

4:57 PM · May 28, 2026 · 885 Views
4:58 PM · May 28, 2026 · 880 Views

I think it's becoming clearer that programmatic sub-agent calling is the way to go over the legacy tool-calling format (which I've been pushing for since RLMs came out)!

I do wonder though if the generated "workflow" looks more eager or compiled (a design decision I've also been unsure about, because it affects how these models are trained to act); dynamic seems to imply the former but the example they give in the blog makes it kind of unclear. either way, scaling the flexibility of subagent deployment without polluting the context of the main Claude Code instance is gonna be huge

Yuchen JinYuchen Jin@Yuchenj_UW

Opus 4.8 scores 69.2% on SWE-Bench Pro, 10 points higher than GPT-5.5. Most interesting part of the release blog is “Dynamic Workflows”: “This new feature, available in research preview, allows Claude to take on even bigger tasks in Claude Code. Claude can plan the work and then run hundreds of parallel subagents in a single session (and with Opus 4.8, the agents can run for even longer). It then verifies its outputs before reporting back to the user.”

4:57 PM · May 28, 2026 · 7K Views
5:29 PM · May 28, 2026 · 1.1K Views

Opus 4.8 is indeed #1 on FrontierSWE

Lisan al GaibLisan al Gaib@scaling01

Anthropic says Opus 4.8 ranks 1st on FrontierSWE

5:11 PM · May 28, 2026 · 11.3K Views
5:43 PM · May 28, 2026 · 7.1K Views

not sure if that includes GPT-5.5

and the rankings for the models changed

Lisan al GaibLisan al Gaib@scaling01

Anthropic says Opus 4.8 ranks 1st on FrontierSWE

5:11 PM · May 28, 2026 · 11.3K Views
5:12 PM · May 28, 2026 · 2.4K Views

Claude 4.8 Opus System Card

Lisan al GaibLisan al Gaib@scaling01

Claude Opus 4.8 Benchmarks

4:49 PM · May 28, 2026 · 35.7K Views
4:50 PM · May 28, 2026 · 25.3K Views

@scaling01 Just 5 more versions to reach Mythos, I can't contain my excitement.

Lisan al GaibLisan al Gaib@scaling01

Claude Opus 4.8 Benchmarks

4:49 PM · May 28, 2026 · 35.7K Views
4:54 PM · May 28, 2026 · 4.5K Views

@AndrewCurran_ good chance that mythos has basically solved all solvable hle tasks or is somewhat close

Andrew CurranAndrew Curran@AndrewCurran_

HLE.

5:09 PM · May 28, 2026 · 860 Views
5:10 PM · May 28, 2026 · 84 Views

Big TerminalBench improvements from Opus 4.7 still behind 5.5

Massive knowledge work improvements!

5:02 PM · May 28, 2026 · 451 Views
Claude Opus 4.8 launches, scoring 69.2% on SWE-bench Pro to outperform GPT-5.5 and Gemini 3.1 Pro · Digg