4h ago

Claude Opus 4.8 scores 58% on DeepSWE coding benchmark, trailing GPT-5.5 but demonstrating lower task costs

Runway’s founder says DeepSWE matches real-world coding impressions.

0
Original post

The efficiency frontier! Where do you think GPT-5.6 will land?

1:39 PM · May 30, 2026 View on X

It will also be interesting to track how open models close the gap in the coming months.

Epoch AIEpoch AI@EpochAIResearch

We took another look at the capability gap between open-weight and proprietary models. Since the start of the year, open-weight models have lagged the state of the art by four months.

8:01 PM · May 29, 2026 · 203.9K Views
9:03 PM · May 30, 2026 · 2.2K Views
Claude Opus 4.8 scores 58% on DeepSWE coding benchmark, trailing GPT-5.5 but demonstrating lower task costs · Digg