3h agoDeepSWE benchmark shows GPT-5.4 leading Claude Opus 4.8 on long-horizon coding tasksGPT-5.5 achieved a 48% score at one-fifth the costSentimentSentimentPos50%Neg50%Positive users praise GPT-5.5 for outperforming Claude Opus 4.8 on DeepSWE benchmarks and cost efficiency, while negative users call the benchmark unreliable and urge Anthropic to release better models quickly.10 comments with sentiment. View comments.