Theo of t3.gg argues SWE-Bench is unreliable as DeepSWE benchmark results show Opus 4.8 outperforming Opus 4.7 · Digg