Serena Ge releases DeepSWE, a long-horizon benchmark designed to evaluate AI coding agents on complex engineering tasks

QUOTE POST

This is the new standard for engineering evals

Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 209.2K Views

5:32 PM · May 26, 2026 · 43.8K Views

QUOTE POST

#929jason@JXNLCO

wow evals caught up to vibes

Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 209.2K Views

6:15 PM · May 26, 2026 · 15.7K Views

REPLY

#929jason@JXNLCO

I think what you'll also find is the tasks are like a third the price.

jason@jxnlco

wow evals caught up to vibes

6:15 PM · May 26, 2026 · 15.7K Views

7:20 PM · May 26, 2026 · 704 Views

QUOTE POST

#980Lisan al Gaib@SCALING01

> they built a "NEW" coding benchmark > GPT-5.5 scores 70% > Mythos probably ~90% > mfw it's already saturated > and you are asking "when will the AI bubble pop?"

Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 209.2K Views

8:02 PM · May 26, 2026 · 2.5K Views

REPLY

#980Lisan al Gaib@SCALING01

you literally need to build benchmarks where frontier models score below 10% or the benchmark is cooked within 6 months

Lisan al Gaib@scaling01

> they built a "NEW" coding benchmark > GPT-5.5 scores 70% > Mythos probably ~90% > mfw it's already saturated > and you are asking "when will the AI bubble pop?"

8:02 PM · May 26, 2026 · 2.5K Views

8:05 PM · May 26, 2026 · 716 Views

QUOTE POST

#980Lisan al Gaib@SCALING01

New coding benchmark.

GPT-5.5 and GPT-5.4 are ahead of Opus 4.7 💀

Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 209.2K Views

7:25 PM · May 26, 2026 · 6K Views

QUOTE POST

#1153Florian Brand@XEOPHON

counterintuitive result right here, imo

the data analysis is well done, lots of thoughts went behind them. great eval!

Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 209.2K Views

7:59 PM · May 26, 2026 · 1.5K Views

REPLY

#1711🍓🍓🍓@IRULETHEWORLDMO

@theo it’s funny i often find myself judging benchmarks on how well they match my own taste.

i should either stop paying attention to benchmarks or stop over valuing my taste.

Theo - t3.gg@theo

This is the first code bench that actually aligns with how it feels to use these models coding.

7:13 PM · May 26, 2026 · 59.8K Views

8:38 PM · May 26, 2026 · 34 Views

QUOTE POST

#1829Theo - t3.gg@THEO

This is the first code bench that actually aligns with how it feels to use these models coding.

Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 209.2K Views

7:13 PM · May 26, 2026 · 59.8K Views

REPLY

#1829Theo - t3.gg@THEO

This chart is nuts

Theo - t3.gg@theo

This is the first code bench that actually aligns with how it feels to use these models coding.

7:13 PM · May 26, 2026 · 59.8K Views

7:16 PM · May 26, 2026 · 9.9K Views

REPLY

#1829Theo - t3.gg@THEO

So many gems in this. Gap between "official" harnesses and their simple agent harness says a lot about how the labs are operating. Gemini gap is hilarious

Theo - t3.gg@theo

This chart is nuts

7:16 PM · May 26, 2026 · 9.9K Views

7:48 PM · May 26, 2026 · 6.3K Views

REPLY

#1829Theo - t3.gg@THEO

Gemini 3.5 Flash being MORE EXPENSIVE than GPT-5.5 at HALF the score is also hilarious

Theo - t3.gg@theo

So many gems in this. Gap between "official" harnesses and their simple agent harness says a lot about how the labs are operating. Gemini gap is hilarious

7:48 PM · May 26, 2026 · 6.3K Views

7:49 PM · May 26, 2026 · 3.4K Views

QUOTE POST

#1894Nick Dobos@NICKADOBOS

First correct benchmark I’ve seen in a while

Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 209.2K Views

7:53 PM · May 26, 2026 · 661 Views

Serena Ge releases DeepSWE, a long-horizon benchmark designed to evaluate AI coding agents on complex engineering tasks

Cluster engagement

Sentiment