4h ago

Serena Ge releases DeepSWE, a long-horizon benchmark designed to evaluate AI coding agents on complex engineering tasks

Its prompts are half the length of SWE-bench Pro.

0
Original post

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

9:18 AM · May 26, 2026 View on X

This is the new standard for engineering evals

Serena Ge (Datacurve)Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 209.2K Views
5:32 PM · May 26, 2026 · 43.8K Views

wow evals caught up to vibes

Serena Ge (Datacurve)Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 209.2K Views
6:15 PM · May 26, 2026 · 15.7K Views

I think what you'll also find is the tasks are like a third the price.

jasonjason@jxnlco

wow evals caught up to vibes

6:15 PM · May 26, 2026 · 15.7K Views
7:20 PM · May 26, 2026 · 704 Views

> they built a "NEW" coding benchmark > GPT-5.5 scores 70% > Mythos probably ~90% > mfw it's already saturated > and you are asking "when will the AI bubble pop?"

Serena Ge (Datacurve)Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 209.2K Views
8:02 PM · May 26, 2026 · 2.5K Views

you literally need to build benchmarks where frontier models score below 10% or the benchmark is cooked within 6 months

Lisan al GaibLisan al Gaib@scaling01

> they built a "NEW" coding benchmark > GPT-5.5 scores 70% > Mythos probably ~90% > mfw it's already saturated > and you are asking "when will the AI bubble pop?"

8:02 PM · May 26, 2026 · 2.5K Views
8:05 PM · May 26, 2026 · 716 Views

New coding benchmark.

GPT-5.5 and GPT-5.4 are ahead of Opus 4.7 💀

Serena Ge (Datacurve)Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 209.2K Views
7:25 PM · May 26, 2026 · 6K Views

counterintuitive result right here, imo

the data analysis is well done, lots of thoughts went behind them. great eval!

Serena Ge (Datacurve)Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 209.2K Views
7:59 PM · May 26, 2026 · 1.5K Views

@theo it’s funny i often find myself judging benchmarks on how well they match my own taste.

i should either stop paying attention to benchmarks or stop over valuing my taste.

Theo - t3.ggTheo - t3.gg@theo

This is the first code bench that actually aligns with how it feels to use these models coding.

7:13 PM · May 26, 2026 · 59.8K Views
8:38 PM · May 26, 2026 · 34 Views

This is the first code bench that actually aligns with how it feels to use these models coding.

Serena Ge (Datacurve)Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 209.2K Views
7:13 PM · May 26, 2026 · 59.8K Views

This chart is nuts

Theo - t3.ggTheo - t3.gg@theo

This is the first code bench that actually aligns with how it feels to use these models coding.

7:13 PM · May 26, 2026 · 59.8K Views
7:16 PM · May 26, 2026 · 9.9K Views

So many gems in this. Gap between "official" harnesses and their simple agent harness says a lot about how the labs are operating. Gemini gap is hilarious

Theo - t3.ggTheo - t3.gg@theo

This chart is nuts

7:16 PM · May 26, 2026 · 9.9K Views
7:48 PM · May 26, 2026 · 6.3K Views

Gemini 3.5 Flash being MORE EXPENSIVE than GPT-5.5 at HALF the score is also hilarious

Theo - t3.ggTheo - t3.gg@theo

So many gems in this. Gap between "official" harnesses and their simple agent harness says a lot about how the labs are operating. Gemini gap is hilarious

7:48 PM · May 26, 2026 · 6.3K Views
7:49 PM · May 26, 2026 · 3.4K Views

First correct benchmark I’ve seen in a while

Serena Ge (Datacurve)Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 209.2K Views
7:53 PM · May 26, 2026 · 661 Views
Serena Ge releases DeepSWE, a long-horizon benchmark designed to evaluate AI coding agents on complex engineering tasks · Digg