20h ago

Theo of t3.gg argues SWE-Bench is unreliable as DeepSWE benchmark results show Opus 4.8 outperforming Opus 4.7

Opus 4.8 also reduced the average cost per task.

Sentiment

Pos80%

Neg20%

Many users praise DeepSWE as a timely independent benchmark for real agentic coding workflows while others dismiss the results as untrustworthy due to suspected VC funding bias.

17 comments with sentiment.

Theo of t3.gg argues SWE-Bench is unreliable as DeepSWE benchmark results show Opus 4.8 outperforming Opus 4.7 · Digg