Opus 4.8 gets score-, time- and token-mogged by GPT-5.5 on DeepSWE
DeepSWE benchmark data shows GPT-5.5 outperforms Claude Opus 4.8 on software engineering tasks and token efficiency
Claude Opus 4.8 cost $12 per task.
Positive users praise GPT-5.5 for strong efficiency gains and lower costs versus Claude Opus 4.8 on DeepSWE, while negative users dismiss the benchmark itself as unreliable or biased.
No Digg Deeper questions have been answered for this story yet.
Most Activity
GPT-5.5 is #1 on DeepSWE, a hard long-horizon coding benchmark 🔥
70% pass@1 vs 58% for Claude Opus 4.8.
And GPT-5.5 gets there with: ~2x faster runs ~1/2 the cost ~1/3 the output tokens
Literally, better intelligence per dollar, per minute, per task.
Best part: GPT-5.5 does all of this while being ~3x more token efficient than Opus 4.8.
47k output tokens vs 136k.
Oh, it's also cheaper and faster: $6.61/task vs $12.58, 21 min vs 43 min.
Enjoy!
GPT-5.5 is #1 on DeepSWE, a hard long-horizon coding benchmark 🔥
70% pass@1 vs 58% for Claude Opus 4.8.
And GPT-5.5 gets there with: ~2x faster runs ~1/2 the cost ~1/3 the output tokens
Literally, better intelligence per dollar, per minute, per task.
Genuinely awestruck at how GPT-5.5 seems to continue improving in performance the more you let it think.
Trying to maximize spend or actually getting your work done?
Opus 4.8 gets score-, time- and token-mogged by GPT-5.5 on DeepSWE
Who said you can't have cheap, fast, and good at the same time??
GPT-5.5 smashes Opus 4.8 on DeepSWE across all 3 at highest max reasoning.
>> Higher score: 70% vs. 58% >> 2x faster >> 2x cheaper >> 3x fewer output tokens
5.5 high still beats 4.8 max 62% vs. 58% while being 3x faster and 3x cheaper
That matters beyond software engineering. In life sciences, better models can help teams use scarce researcher time, budget, and experimental capacity more efficiently, find results sooner, and make more patient impact, faster.
And we are just getting started.
Nostra culpa for losing the cool-vibe to Claude but if you actually care about quality (or cost!) come try 5.5
GPT-5.5 going strong on DeepSWE
For performance vs cost/time/output tokens
sauce: https://deepswe.datacurve.ai/
Opus 4.8 gets score-, time- and token-mogged by GPT-5.5 on DeepSWE
*and cost mogged too
Opus 4.8 gets score-, time- and token-mogged by GPT-5.5 on DeepSWE

@scaling01 The DeepSWE benchmark is dog sheet, according to that GPT 5.4 is the 2nd best performing model which is not true. (not just me, my whole team has the same opinion that claude before 4.7 was far far better than GPT 5.4)
@reach_vb Kudos guys. 5.5 is a great model
Best part: GPT-5.5 does all of this while being ~3x more token efficient than Opus 4.8.
47k output tokens vs 136k.
Oh, it's also cheaper and faster: $6.61/task vs $12.58, 21 min vs 43 min.
Enjoy!

@then_there_was Why tf would you put the cost like that

@scaling01 Honestly, GPT 5.5 xhigh has been my daily driver lately. Opus is still great, but OpenAI is just winning the instruction following race right now.

@reach_vb interesting that just half a year ago claude was considered a much faster model. lots of good shipping since then!

@reach_vb been using 5.5 xhigh for my own agent loops and the instruction following is unreal. the pass rate is nice, but getting there at half the cost is the actual story here.

@scaling01 benchmarks are theater. Opus 4.8 actually ships code that works. i'll take boring and reliable over benchmark-flexing any day.

@reach_vb 70 vs 58 matters less than the cost story. At half the price and a third of the output tokens, you run GPT 5.5 twice as many times for the same budget. In agentic loops that compounds fast.

@Stutigardum @then_there_was How else would you out it?

@reach_vb Intelligence per token, desired outcomes at the least amount of tokens required, at the most reasonable possible price. This is the metric that will win the long game.

@bruce_x_offi @scaling01 I'd say ever since 5.2, codex/gpt had the lead over opus. Opus definitely wasn't "far far" better.

@bruce_x_offi @scaling01 Have you actually used 5.4? 5.4 xhigh and 5.5 high are virtually indistinguishable in output.