Good results! Lines up with my experience
Opus 4.8 is now on DeepSWE.
On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.
OpenAI's GPT 5.4 continues to outperform the new model.
Good results! Lines up with my experience
Opus 4.8 is now on DeepSWE.
On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.
Positive users praise Opus 4.8 for higher DeepSWE scores at lower cost plus better honesty and task handling, while negative users call the harness garbage and suggest the updates head in the wrong direction.
No Digg Deeper questions have been answered for this story yet.
opus 4.8 comes pre-mogged, even by gpt 5.4. openai is 1.9 releases ahead now.
Opus 4.8 is now on DeepSWE.
On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.
Opus 4.8 is a solid jump over Opus 4.7 on DeepSWE, while also lowering the average cost per task.
However, GPT-5.5 xhigh still beats it by a pretty clear margin while being cheaper.
OpenAI has been cooking insanely hard with its models lately. Really excited to see what GPT-5.6 brings.
That said, I have to admit: I’m starting to really like Opus 4.8 as well.
We’ve entered a moment where both frontier labs keep shipping genuinely impressive models.
Opus 4.8 is now on DeepSWE.
On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.
Opus 4.8 is now on DeepSWE.
On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.

Opus 4.8 delivers efficiency gains by solving tasks in fewer steps, directly reducing the total number of input tokens required per task.

Full deep dive coming soon.
Check out the full benchmark here → https://deepswe.datacurve.ai

@theo

@ericlim Lmaoooo

@datacurve @winkey_h Can we get 5.5 low too? 🙂 love to see how good 5.5 is cost / performance tradeof

@datacurve @winkey_h Interesting, initially I was skeptical with the performance of GPT-5.5 on this bench, but what I failed to realize is the massive difference between GPT-5.5 medium and xhigh. Can't wait to see the score tomorrow with GPT-5.6.

@datacurve @winkey_h Why do y'all make these charts backwards?

@datacurve @winkey_h Can you add composer 2.5 to this chart?

I love 4.8, and the way it communicates, works through the tasks. The Honesty is seriously noticeable. UltraCode is fantastic. XHigh is my go to, Max doesn't seem worth it. Max can answer funny, bigger cost, sometimes more wrong.
I don't think the criticism is fair of 4.8. Great model. Antrhopic cooked.

@datacurve @winkey_h Nice, this lines up with my experiences using opus 4.8, Also can you guys flip the horizontal axis lol, not used to the way you guys displayed it hear. The pareto frontier is clearly open ai dominated for now.

@datacurve @winkey_h Seems accurate on the cost side, I feels like 5.5 limit constantly reached limit like claude code was.

@datacurve @winkey_h I might be nitpicking here but the decision to go with descending order on the x-axis is straight up diabolical.

Opus 4.8 does feel slightly better in many tasks, not massive improvement but solid. They definetily screwed sometihng with max effort though, feels almost unusable seems xhigh is highest one worth using.
I wish we could use claude sub with 3rd party harnesses, they gotta dial in on claude code and improve it more..

@banteg adopting "pre-mogged"

@datacurve @winkey_h Matches with what I am experiencing. I think I like this bench now.

@kimmonismus That cost-efficiency gap between GPT-5.5 and Opus 4.8 is widening — Claude's pricing edge is eroding just as OpenAI scales inference margins. Really curious if Anthropic can close that before 5.6 drops.

@datacurve @PrunusSpeciosa_ 👀👀👀👀