I want to see Mythos scores
Opus 4.8 is now on DeepSWE.
On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.
Opus 4.8 also reduced the average cost per task.
I want to see Mythos scores
Opus 4.8 is now on DeepSWE.
On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.
swe-bench is kind of a shitshow, and it makes evaluating LLMs hard. DeepSWE is the first agentic code bench that makes sense.
Opus 4.8 is now on DeepSWE.
On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.
Opus 4.8 also reduced the average cost per task.
I want to see Mythos scores
Opus 4.8 is now on DeepSWE.
On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.
Many users praise DeepSWE as a timely independent benchmark for real agentic coding workflows while others dismiss the results as untrustworthy due to suspected VC funding bias.