Artificial Analysis switches its Coding Agent Index to DeepSWE, ranking Claude Code with Fable 5 in first place

VIEWS14.4KBOOKMARKS13

So proud of @datacurve (YC W24) - building THE defining software engineering benchmark in DeepSWE

Tired? SWE-Bench Pro Wired? Datacurve DeepSWE

We've updated the Artificial Analysis Coding Agent Index, replacing SWE-Bench Pro with Datacurve's DeepSWE benchmark - the swap lifts Codex with GPT-5.5 (xhigh) above Claude Code with Opus 4.8 (max), while the newly released Claude Fable 5 (max) in Claude Code debuts at the top

DeepSWE, built by @datacurve, writes its tasks from scratch rather than adapting them from public GitHub issues or pull requests, so no model has seen the solutions during training. That matters because SWE-Bench Pro, the benchmark it replaces in our Coding Agent Index, had grown gameable, with some models recovering the fix from the repository's commit history instead of solving the task.

The swap reorders the index: Codex with GPT-5.5 (xhigh) rises from 65 to 76, overtaking Claude Code with Opus 4.8 (max) at 73. Claude Code with Fable 5 (max), which enters directly on the refreshed index, leads at 77. SWE-Bench Pro had been flattering some combinations and penalizing others.

More below.

2h14.4K5913

LIKES108REPLIES5

Artificial Analysis@ArtificialAnlys

SWE-Bench Pro behaved unlike the other evaluations. Codex with GPT-5.5 (xhigh) scored just 31 on it against 64 to 84 elsewhere, while Claude Code with Opus 4.8 (max) scored 70, one of its highest results and a 25-point jump over Opus 4.7 (max). DeepSWE, its replacement, is the hardest evaluation in the index: the best agents score in the 50s and 60s, and leading open weights models score below 20.

20h13.4K10812

RETWEETS136

Artificial Analysis@ArtificialAnlys

We've updated the Artificial Analysis Coding Agent Index, replacing SWE-Bench Pro with Datacurve's DeepSWE benchmark - the swap lifts Codex with GPT-5.5 (xhigh) above Claude Code with Opus 4.8 (max), while the newly released Claude Fable 5 (max) in Claude Code debuts at the top

DeepSWE, built by @datacurve, writes its tasks from scratch rather than adapting them from public GitHub issues or pull requests, so no model has seen the solutions during training. That matters because SWE-Bench Pro, the benchmark it replaces in our Coding Agent Index, had grown gameable, with some models recovering the fix from the repository's commit history instead of solving the task.

The swap reorders the index: Codex with GPT-5.5 (xhigh) rises from 65 to 76, overtaking Claude Code with Opus 4.8 (max) at 73. Claude Code with Fable 5 (max), which enters directly on the refreshed index, leads at 77. SWE-Bench Pro had been flattering some combinations and penalizing others.

More below.

20h328.3K1.6K295

Artificial Analysis@ArtificialAnlys

See the full refreshed Artificial Analysis Coding Agents Index, with per-evaluation scores, token use, and cost for every harness and model combination: https://artificialanalysis.ai/agents/coding-agents

Read more about DeepSWE from Datacurve: https://deepswe.datacurve.ai

Join the discussion in our Discord: https://discord.gg/dkR4wVfty

20h10.2K497

NR@HsiminR

DeepSWE seems super sensitive to the harness - much more so than the other coding benchmarks.

GPT 5.5 (medium): Cursor CLI harness: 37 Codex harness: 57

Opus 4.7 (medium): Claude Code harness: 27 Opencode harness: 40

the harness - not the model - is causing 13-20 pt swings. and, the swings don't make a whole lot of sense.

13h789162

Peter Welinder@npew

GPT-5.5 vs Fable: ~same performance, but GPT-5.5 costs 50% less and so can do twice as much work.

Artificial Analysis@ArtificialAnlys

We've updated the Artificial Analysis Coding Agent Index, replacing SWE-Bench Pro with Datacurve's DeepSWE benchmark - the swap lifts Codex with GPT-5.5 (xhigh) above Claude Code with Opus 4.8 (max), while the newly released Claude Fable 5 (max) in Claude Code debuts at the top

DeepSWE, built by @datacurve, writes its tasks from scratch rather than adapting them from public GitHub issues or pull requests, so no model has seen the solutions during training. That matters because SWE-Bench Pro, the benchmark it replaces in our Coding Agent Index, had grown gameable, with some models recovering the fix from the repository's commit history instead of solving the task.

The swap reorders the index: Codex with GPT-5.5 (xhigh) rises from 65 to 76, overtaking Claude Code with Opus 4.8 (max) at 73. Claude Code with Fable 5 (max), which enters directly on the refreshed index, leads at 77. SWE-Bench Pro had been flattering some combinations and penalizing others.

More below.

1h2.4K151

Aether Oracle@aether_oracle

@ArtificialAnlys This is 9/11 for Anthropic

11h55925

User@DenisZoabi

So, as I can see and as I have said, the hype had no base. Everybody kept saying that Fable 5 is going to change everything but when we see the analysis, it’s not a big difference between GPT 5.5 xhigh and Fable 5 when talk about coding performance. When we talk about price, OpenAI WINS.

19h1.4K9

Hamza Mudassir@hamzam1981

@ArtificialAnlys how can this be correct? Mythos, with all its flaws, is dramatically better than GPT 5.5 in coding

17h1.3K5

Moonlit Monkey@MoonlitMonkey69

@hamzam1981 @ArtificialAnlys It's apparently not dramatically better when you give it challenges it hasn't memorized.

14h14414

Tyrone Robb@ty_auldric

@ArtificialAnlys Am I missing something or is the hype about Fable about it being a smidge better than chat 5.5?

17h1.4K2

Stijn@StijnSmits

@ArtificialAnlys This is embarrassing, why not measure Fable 5 medium or high?

18h58421

bruce@bruce_x_offi

@ArtificialAnlys can you do benchmark of opus model with pi?

14h37311

Naeem@identity_matrix

@ArtificialAnlys Love to see deeepswe being adopted as a replacement of swe-bench, I hope it doesn't get benchmaxxed on similarly. Congrats @datacurve and team

19h90911

L@numcep

@HsiminR @ArtificialAnlys Interesting that different harnesses are even mentioned considering the DeepSWE announcement made a specific point to use the same (their own, custom) harness with every LLM that was tested (https://deepswe.datacurve.ai/blog#evaluation-harness)…

12h7531

Ryan@ryanalmasu

@ArtificialAnlys 5.6 will wipe fable 5 if they fixed the frontend issue

19h1.3K8

neamtu@neamtuz

@ArtificialAnlys Composer 2.5 non-fast is the goat here. Nothing comes even remotely close in costs.

14h1.2K

Anoy@Anoyroyc

@ArtificialAnlys wait so the old benchmark was literally just letting models cheat by looking up commit history?

no wonder the rankings flipped that hard when they switched to actual unseen tasks

17h3861

Víctor Cavero@vcaverog

@ArtificialAnlys This is huge for benchmarking - using pre-trained solutions basically defeats the whole point

the reordering makes way more sense now

19h8927

t@tafheeeem

@ArtificialAnlys @max2aneeb see a real benchmark. 5.5 is still close. 5.6 or 6 will cook them

18h1.2K1