/Tech31d ago

DeepSWE benchmark data shows GPT-5.5 outperforms Claude Opus 4.8 on software engineering tasks and token efficiency

Claude Opus 4.8 cost $12 per task.

1822.7K196321263.9K

#312

Original post

Lisan al Gaib@scaling01#1215inTech

Opus 4.8 gets score-, time- and token-mogged by GPT-5.5 on DeepSWE

9:59 AM · May 30, 2026 · 61.7K Views

Sentiment

Positive users praise GPT-5.5 for strong efficiency gains and lower costs versus Claude Opus 4.8 on DeepSWE, while negative users dismiss the benchmark itself as unreliable or biased.

Pos

69.7%

Neg

30.3%

36 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

DeepSWE

DEEPSWEVia

#1215

Posts from X

Most Activity

VIEWS63.9KBOOKMARKS76LIKES652RETWEETS50REPLIES44

Vaibhav (VB) Srivastav@reach_vb

GPT-5.5 is #1 on DeepSWE, a hard long-horizon coding benchmark 🔥

70% pass@1 vs 58% for Claude Opus 4.8.

And GPT-5.5 gets there with: ~2x faster runs ~1/2 the cost ~1/3 the output tokens

Literally, better intelligence per dollar, per minute, per task.

30d63.9K65276

Vaibhav (VB) Srivastav@reach_vb

Best part: GPT-5.5 does all of this while being ~3x more token efficient than Opus 4.8.

47k output tokens vs 136k.

Oh, it's also cheaper and faster: $6.61/task vs $12.58, 21 min vs 43 min.

Enjoy!

Vaibhav (VB) Srivastav@reach_vb

GPT-5.5 is #1 on DeepSWE, a hard long-horizon coding benchmark 🔥

70% pass@1 vs 58% for Claude Opus 4.8.

And GPT-5.5 gets there with: ~2x faster runs ~1/2 the cost ~1/3 the output tokens

Literally, better intelligence per dollar, per minute, per task.

30d17.6K23417

Andrew Ruiz@then_there_was

Genuinely awestruck at how GPT-5.5 seems to continue improving in performance the more you let it think.

30d19.3K20520

jason@jxnlco

Trying to maximize spend or actually getting your work done?

Lisan al Gaib@scaling01

Opus 4.8 gets score-, time- and token-mogged by GPT-5.5 on DeepSWE

31d16.5K18617

Collin Burdick@CollinBurdick

Who said you can't have cheap, fast, and good at the same time??

GPT-5.5 smashes Opus 4.8 on DeepSWE across all 3 at highest max reasoning.

>> Higher score: 70% vs. 58% >> 2x faster >> 2x cheaper >> 3x fewer output tokens

5.5 high still beats 4.8 max 62% vs. 58% while being 3x faster and 3x cheaper

That matters beyond software engineering. In life sciences, better models can help teams use scarce researcher time, budget, and experimental capacity more efficiently, find results sooner, and make more patient impact, faster.

And we are just getting started.

30d5.9K6910

Aidan Clark@_aidan_clark_

Nostra culpa for losing the cool-vibe to Claude but if you actually care about quality (or cost!) come try 5.5

Gabriel Chua@gabrielchua

GPT-5.5 going strong on DeepSWE

For performance vs cost/time/output tokens

30d7.4K864

Lisan al Gaib@scaling01

sauce: https://deepswe.datacurve.ai/

Lisan al Gaib@scaling01

Opus 4.8 gets score-, time- and token-mogged by GPT-5.5 on DeepSWE

31d3K272

Lisan al Gaib@scaling01

*and cost mogged too

Lisan al Gaib@scaling01

Opus 4.8 gets score-, time- and token-mogged by GPT-5.5 on DeepSWE

31d3.3K300

bruce@bruce_x_offi

@scaling01 The DeepSWE benchmark is dog sheet, according to that GPT 5.4 is the 2nd best performing model which is not true. (not just me, my whole team has the same opinion that claude before 4.7 was far far better than GPT 5.4)

31d1.5K7

Chubby♨️@kimmonismus

@reach_vb Kudos guys. 5.5 is a great model

Vaibhav (VB) Srivastav@reach_vb

Best part: GPT-5.5 does all of this while being ~3x more token efficient than Opus 4.8.

47k output tokens vs 136k.

Oh, it's also cheaper and faster: $6.61/task vs $12.58, 21 min vs 43 min.

Enjoy!

30d1K121

Stutgard@Stutigardum

@then_there_was Why tf would you put the cost like that

30d3269

LeetLLM.com@leetllm

@scaling01 Honestly, GPT 5.5 xhigh has been my daily driver lately. Opus is still great, but OpenAI is just winning the instruction following race right now.

30d71531

banteg@banteg

@reach_vb interesting that just half a year ago claude was considered a much faster model. lots of good shipping since then!

30d7175

LeetLLM.com@leetllm

@reach_vb been using 5.5 xhigh for my own agent loops and the instruction following is unreal. the pass rate is nice, but getting there at half the cost is the actual story here.

30d1315

Machine Brief@MachineBrief

@scaling01 benchmarks are theater. Opus 4.8 actually ships code that works. i'll take boring and reliable over benchmark-flexing any day.

31d5583

Deva@DevaBuilds

@reach_vb 70 vs 58 matters less than the cost story. At half the price and a third of the output tokens, you run GPT 5.5 twice as many times for the same budget. In agentic loops that compounds fast.

30d691

otw2@imotw2

@Stutigardum @then_there_was How else would you out it?

30d63

Bran@Bran_Fi

@reach_vb Intelligence per token, desired outcomes at the least amount of tokens required, at the most reasonable possible price. This is the metric that will win the long game.

30d471

Maximilian Scholz@scholzmx

@bruce_x_offi @scaling01 I'd say ever since 5.2, codex/gpt had the lead over opus. Opus definitely wasn't "far far" better.

31d482

Slopware Engineer@slopwareindy

@bruce_x_offi @scaling01 Have you actually used 5.4? 5.4 xhigh and 5.5 high are virtually indistinguishable in output.

30d412