3h ago

DeepSWE benchmark shows GPT-5.4 leading Claude Opus 4.8 on long-horizon coding tasks

GPT-5.5 achieved a 48% score at one-fifth the cost

Sentiment

Pos50%

Neg50%

Positive users praise GPT-5.5 for outperforming Claude Opus 4.8 on DeepSWE benchmarks and cost efficiency, while negative users call the benchmark unreliable and urge Anthropic to release better models quickly.

10 comments with sentiment.

DeepSWE benchmark shows GPT-5.4 leading Claude Opus 4.8 on long-horizon coding tasks · Digg