/AI1d ago

Theo of t3.gg argues SWE-Bench is unreliable as DeepSWE benchmark results show Opus 4.8 outperforming Opus 4.7

Opus 4.8 also reduced the average cost per task.

--0--
Quote posts
Original post
Lisan al Gaib@scaling01#980inAI

I want to see Mythos scores

Datacurve@datacurve

Opus 4.8 is now on DeepSWE.

On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.

3:04 PM · May 31, 2026 · 15.4K Views
Sentiment
Sentiment unavailable for this story.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS90.5KBOOKMARKS132LIKES688RETWEETS23REPLIES39

swe-bench is kind of a shitshow, and it makes evaluating LLMs hard. DeepSWE is the first agentic code bench that makes sense.

Datacurve@datacurve

Opus 4.8 is now on DeepSWE.

On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.

15hViews 90.5KLikes 688Bookmarks 132