9h ago

Claude Opus 4.8 scores 58% on DeepSWE coding benchmark, trailing GPT-5.5's 70% despite reducing average task costs to $12.58

Chris Hayduk notes the model ran twice as slow.

0
Original post

The efficiency frontier! Where do you think GPT-5.6 will land?

1:39 PM · May 30, 2026 View on X

It will also be interesting to track how open models close the gap in the coming months.

Epoch AIEpoch AI@EpochAIResearch

We took another look at the capability gap between open-weight and proprietary models. Since the start of the year, open-weight models have lagged the state of the art by four months.

8:01 PM · May 29, 2026 · 264.3K Views
9:03 PM · May 30, 2026 · 3.4K Views
Claude Opus 4.8 scores 58% on DeepSWE coding benchmark, trailing GPT-5.5's 70% despite reducing average task costs to $12.58 · Digg