/Tech8h ago

Benchmarks find Anthropic's Claude Code is the lowest-performing coding harness compared to Cursor CLI and OpenCode using identical models

The critique proposes decoupling model subscriptions from proprietary interfaces.

726035225189.1K

#90

Original post

Kun Chen@kunchenguid

want to point out a few really interesting things here

1. Claude Code is actually the worst performing harness when using the same model, significantly behind opencode and cursor cli

this is the core reason i've been against the LLM companies focusing their business on locking people into their harness

what they are good at is making great models. they suck at making good harness products, just like how power plants won't make the best dishwashers, and how internet providers won't make the best phones

if anthropic wants to do what's best for their users, they should let people use their subscriptions in whatever harness they choose, not locked into claude code alone

2. fable 5 max is only 1pt above gpt 5.5 xhigh (77 vs 76)

this matches my experience so far - fable 5 does have the big model smell and it's pretty good, but it's not a massive jump forward like their marketing suggested, at least not on building software

this is actually alarming for anthropic because it's very unlikely people will want to pay 2x higher cost for the 1pt difference. my speculation would be that in enterprises people will be restricted to adopt fable & mythos only on some mission critical tasks, not used at scale

Artificial Analysis@ArtificialAnlys

We've updated the Artificial Analysis Coding Agent Index, replacing SWE-Bench Pro with Datacurve's DeepSWE benchmark - the swap lifts Codex with GPT-5.5 (xhigh) above Claude Code with Opus 4.8 (max), while the newly released Claude Fable 5 (max) in Claude Code debuts at the top

DeepSWE, built by @datacurve, writes its tasks from scratch rather than adapting them from public GitHub issues or pull requests, so no model has seen the solutions during training. That matters because SWE-Bench Pro, the benchmark it replaces in our Coding Agent Index, had grown gameable, with some models recovering the fix from the repository's commit history instead of solving the task.

The swap reorders the index: Codex with GPT-5.5 (xhigh) rises from 65 to 76, overtaking Claude Code with Opus 4.8 (max) at 73. Claude Code with Fable 5 (max), which enters directly on the refreshed index, leads at 77. SWE-Bench Pro had been flattering some combinations and penalizing others.

More below.

1:10 AM · Jun 12, 2026 · 88.2K Views

Sentiment

Many users backed criticism of Anthropic's lock-in tactics since Claude Code lags rivals in benchmarks, while others called the harness comparisons suspect or accused the company of nerfing models.

Pos

61.5%

Neg

38.5%

13 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

Robert W Q Brown@RWQBrown

@kunchenguid This is a hard angle to measure. Each harness can be heavily customized and tuned for your use case. Did they use the default settings? I doubt opencode is that much better than claude code.

12h8856

BOOKMARKS4

Jake@JakeKAllDay

@kunchenguid $1T valuations tend to mess with people's ethics 😀

as mark twain said, "It is difficult to get a man to understand something when his salary depends on his not understanding it."

7h4664

LIKES12

Jake@JakeKAllDay

@kunchenguid I think youre 100% right on #1, but the economic incentive is clear: model providers dont want to lose the cust relationship, nor do they want a downstream vendor (eg cursor) taking steps to route/optimize them into a commodity.

So their insane valuations require them to do it

8h812124

RETWEETS52

Kun Chen@kunchenguid

want to point out a few really interesting things here

1. Claude Code is actually the worst performing harness when using the same model, significantly behind opencode and cursor cli

this is the core reason i've been against the LLM companies focusing their business on locking people into their harness

if anthropic wants to do what's best for their users, they should let people use their subscriptions in whatever harness they choose, not locked into claude code alone

2. fable 5 max is only 1pt above gpt 5.5 xhigh (77 vs 76)

this matches my experience so far - fable 5 does have the big model smell and it's pretty good, but it's not a massive jump forward like their marketing suggested, at least not on building software

Artificial Analysis@ArtificialAnlys

More below.

15h88.2K595248

REPLIES3

Mng@Mng64218162

@Bolmercl @kunchenguid all the tokens in the system prompt are cached note that 90% of the people who use Claude Code use it with subscriptions so anthropic is not dump to waste their tokens/money

9h40

Mng@Mng64218162

@kunchenguid Claude code is optimized for cost not for performance by heavily relying on caching and using subagents with smaller models like haiku So it makes sense to see other tools perform better than Claude code but it’s not fair to compare performance without comparing cost

10h4023

Kevin Wu Won@kwuwon

@kunchenguid I'm not really believing the benchmarks that say 1 point difference. Fable seems significantly smarter than GPT 5.5 at software design and code review. It results in much simpler code that a clear headed senior engineer would write, not overengineered towers of Babel.

12h2722

NR@HsiminR

When the harness alone causes 15-20 pt swings, it's suspect. Codex harness increases GPT 5.5's core by 20 points. Claude code harness decreases Opus 4.7's score by 13 points.

This seems to be measuring not model quality - but perhaps how aggressively a harness continuously loops in long open-ended tasks w/o human prompting.

9h4591

Patrick Donohoe@patrickdonohoe

@kunchenguid Claude code/codex best value prop has always been the discounted tokens on the consumer plans. You are always going to choose 10x more tokens over a slightly better harness. With enterprises now paying per-token, harness differentiation will be more relevant

11h4597

Kun Chen@kunchenguid

@lawrence_stark_ i wouldn’t say that “stupid” though. openai deliberately allows all 3p apps to use their codex subscription freely which might play out well for them because more and more apps will be built on top of the openai platform as a result

5h2092

Mng@Mng64218162

@llmDestructor @Bolmercl @kunchenguid again 90% of the people who use Claude Code use it with the subscriptions so anthropic knows better how to handle their caching to reduce their costs

9h18

Matias@Bolmercl

@Mng64218162 @kunchenguid The problems its not the input cache cost of the system prompt, is making the models dumber.

9h16

Dan McInerney@DanHMcInerney

@kunchenguid Very interesting information. I noticed Codex seemed to be much faster and slightly more reliable in the 4.8/5.5 era. Fable in CC is still slow and token-heavy

8h4821

Kun Chen@kunchenguid

@JakeKAllDay yeah totally. i can empathize with why they would do it. i just see it as a bad thing for the ecosystem and disappointed that somehow doing what’s good for society isn’t incentivized nearly as much as it should

7h4601

Matias@Bolmercl

@Mng64218162 @kunchenguid CC is not optimized for cost. It waste waaaaaay more tokens by giving so many dumb system prompt tokens to the models. They become dumber and waster waaaaay more tokens per task.

9h313

Lawrence Stark@lawrence_stark_

Yes exactly but they'd be stupid to allow you to go use another harness with their sub. Hence the soft ban in 3 days on headless

For consumers, model providers should not lock us into their harnesses. For them, it's in their best interest to pull an Apple like move

But whatever, open source models will be>90% as good and >10x cheaper and they will eat up most tokens

6h2511

Saly@Saly76246513

@kunchenguid isn't it against Anthropic ToS to use claude subscription with other harnesses? I remember this being a thing a while back, when Openclaw was a thing

10h651

The Crypto Wiz@TheKryptoWiz

@kunchenguid This is the right lesson: dumb loops that ship usually beat fancy orchestration that needs a therapist.

11h311

hffmnnj@hffmnnj

@kunchenguid "my speculation would be that in enterprises people will be restricted to adopt fable & mythos only on some mission critical tasks, not used at scale"

That's sorta the point?

6h238

EJ Campbell@ejc3

What makes you believe that model companies can't write a good harness, especially when they can tune the model as part of training?

Are there any model companies not using their own harness themselves? Are you suggesting Anthropic engineers should switch to OpenCode to improve their productivity?

6h154