swe-bench is kind of a shitshow, and it makes evaluating LLMs hard. DeepSWE is the first agentic code bench that makes sense.
Opus 4.8 is now on DeepSWE.
On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.