Datacurve launches DeepSWE, a tougher coding benchmark made to show where leading models truly separate.
GPT-5.5 hits 70%, while GPT-5.4 reaches 56% and Claude Opus 4.7 reaches 54%, making a gap that older benchmarks largely hid.
Its a long-horizon software engineering benchmark.
- DeepSWE differs from older coding benchmarks in the source of the exam: older tests often reuse public GitHub issues and PRs, while DeepSWE uses original tasks, so models are less likely to have seen the answer during training.
- The work is also bigger even when the prompt is shorter, because older tests often tell the model what area to touch, while DeepSWE makes the agent search the repo, understand the design, edit multiple files, and avoid breaking old behavior.
On DeepSWE, prompts are half the length of SWE-bench Pro's, yet solutions require 5.5x more code and ~2x more output tokens.
- The grading is different too, because many older benchmarks reuse tests from one merged PR, while DeepSWE checks whether the requested behavior actually works, even if the model solves it in a different valid way.
