20h agoTheo of t3.gg argues SWE-Bench is unreliable as DeepSWE benchmark results show Opus 4.8 outperforming Opus 4.7Opus 4.8 also reduced the average cost per task.SentimentSentimentPos80%Neg20%Many users praise DeepSWE as a timely independent benchmark for real agentic coding workflows while others dismiss the results as untrustworthy due to suspected VC funding bias.17 comments with sentiment. View comments.