13h ago

Datacurve releases DeepSWE, a long-horizon software engineering benchmark designed to prevent data leakage

GPT-5.5 scored 70 on the new agentic evaluation.

Sentiment

Pos79.6%

Neg20.4%

Positive users praise the DeepSWE agentic coding benchmark for matching real daily use and exposing tangible gaps like GPT-5.5's lead, while negative users dismiss specific rankings and call the benchmark flawed or inaccurate.

29 comments with sentiment.

Datacurve releases DeepSWE, a long-horizon software engineering benchmark designed to prevent data leakage · Digg