/Tech30d ago

Opus 4.8 improves DeepSWE benchmark performance by 6% over Opus 4.7 while lowering task costs

OpenAI's GPT 5.4 continues to outperform the new model.

2704.2K187634975.1K

#403

Original post

Theo - t3.gg@theo#1325inTech

Good results! Lines up with my experience

Datacurve@datacurve

Opus 4.8 is now on DeepSWE.

On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.

2:34 PM · May 30, 2026 · 181.4K Views

Sentiment

Positive users praise Opus 4.8 for higher DeepSWE scores at lower cost plus better honesty and task handling, while negative users call the harness garbage and suggest the updates head in the wrong direction.

Pos

54.5%

Neg

45.5%

19 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS76.9KBOOKMARKS64

banteg@banteg

opus 4.8 comes pre-mogged, even by gpt 5.4. openai is 1.9 releases ahead now.

Datacurve@datacurve

Opus 4.8 is now on DeepSWE.

On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.

30d76.9K60664

LIKES663REPLIES42

Chubby♨️@kimmonismus

Opus 4.8 is a solid jump over Opus 4.7 on DeepSWE, while also lowering the average cost per task.

However, GPT-5.5 xhigh still beats it by a pretty clear margin while being cheaper.

OpenAI has been cooking insanely hard with its models lately. Really excited to see what GPT-5.6 brings.

That said, I have to admit: I’m starting to really like Opus 4.8 as well.

We’ve entered a moment where both frontier labs keep shipping genuinely impressive models.

Datacurve@datacurve

Opus 4.8 is now on DeepSWE.

On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.

30d54.8K66353

RETWEETS44

Datacurve@datacurve

Opus 4.8 is now on DeepSWE.

On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.

30d662.1K1.4K318

Datacurve@datacurve

Opus 4.8 delivers efficiency gains by solving tasks in fewer steps, directly reducing the total number of input tokens required per task.

30d6.6K7011

Datacurve@datacurve

Full deep dive coming soon.

Check out the full benchmark here → https://deepswe.datacurve.ai

30d4.9K338

eric@ericlim

@theo

30d1.3K563

Theo - t3.gg@theo

@ericlim Lmaoooo

30d1.1K25

sven2401@sven2401

@datacurve @winkey_h Can we get 5.5 low too? 🙂 love to see how good 5.5 is cost / performance tradeof

30d1.7K15

Brendan Toscano@AIEngineer10x

@datacurve @winkey_h Interesting, initially I was skeptical with the performance of GPT-5.5 on this bench, but what I failed to realize is the massive difference between GPT-5.5 medium and xhigh. Can't wait to see the score tomorrow with GPT-5.6.

30d1.4K61

Mike@Croczillak

@datacurve @winkey_h Why do y'all make these charts backwards?

30d1.1K12

Curious Paws@CuriousPawsCo

@datacurve @winkey_h Can you add composer 2.5 to this chart?

30d57311

Robpoll🇦🇺🪃@robpoll9

I love 4.8, and the way it communicates, works through the tasks. The Honesty is seriously noticeable. UltraCode is fantastic. XHigh is my go to, Max doesn't seem worth it. Max can answer funny, bigger cost, sometimes more wrong.

I don't think the criticism is fair of 4.8. Great model. Antrhopic cooked.

30d92421

MakerMatters?@MakerMatters

@datacurve @winkey_h Nice, this lines up with my experiences using opus 4.8, Also can you guys flip the horizontal axis lol, not used to the way you guys displayed it hear. The pareto frontier is clearly open ai dominated for now.

30d1.1K11

Ryan Sael@RyanSael

@datacurve @winkey_h Seems accurate on the cost side, I feels like 5.5 limit constantly reached limit like claude code was.

30d9421

Araz@Araz_io

@datacurve @winkey_h I might be nitpicking here but the decision to go with descending order on the x-axis is straight up diabolical.

30d9407

Paweł J Lisowski@PawelJLisowski

Opus 4.8 does feel slightly better in many tasks, not massive improvement but solid. They definetily screwed sometihng with max effort though, feels almost unusable seems xhigh is highest one worth using.

I wish we could use claude sub with 3rd party harnesses, they gotta dial in on claude code and improve it more..

30d8376

c@punishedfounder

@banteg adopting "pre-mogged"

30d3125

Ayush Porwal@ayushporwalhq

@datacurve @winkey_h Matches with what I am experiencing. I think I like this bench now.

30d923

Eclipse 🌖@ECLresearch

@kimmonismus That cost-efficiency gap between GPT-5.5 and Opus 4.8 is widening — Claude's pricing edge is eroding just as OpenAI scales inference margins. Really curious if Anthropic can close that before 5.6 drops.

30d462

brayden petersen ⁂@bmptrsn

@datacurve @PrunusSpeciosa_ 👀👀👀👀

30d8933