Serena Ge releases DeepSWE, a long-horizon benchmark designed to evaluate AI coding agents on complex engineering tasks

QUOTE POST

This is the new standard for engineering evals

Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 423.2K Views

5:32 PM · May 26, 2026 · 57.2K Views

QUOTE POST

#420Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@TEORTAXESTEX

Honestly, looks about right, harness (mini-swe-agent) affinities aside Kimi is the closest to a mature autonomous SWE agent out of open models DS is weak and needs handholding (though has isolated strengths like debugging) a mark of good eval: stronger separation of top tier

Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 423.2K Views

9:40 PM · May 26, 2026 · 2.3K Views

QUOTE POST

#929jason@JXNLCO

wow evals caught up to vibes

Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 423.2K Views

6:15 PM · May 26, 2026 · 28.5K Views

REPLY

#929jason@JXNLCO

I think what you'll also find is the tasks are like a third the price.

jason@jxnlco

wow evals caught up to vibes

6:15 PM · May 26, 2026 · 28.5K Views

7:20 PM · May 26, 2026 · 1.3K Views

QUOTE POST

#980Lisan al Gaib@SCALING01

> they built a "NEW" coding benchmark > GPT-5.5 scores 70% > Mythos probably ~90% > mfw it's already saturated > and you are asking "when will the AI bubble pop?"

Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 423.2K Views

8:02 PM · May 26, 2026 · 10.7K Views

REPLY

#980Lisan al Gaib@SCALING01

you literally need to build benchmarks where frontier models score below 10% or the benchmark is cooked within 6 months

Lisan al Gaib@scaling01

> they built a "NEW" coding benchmark > GPT-5.5 scores 70% > Mythos probably ~90% > mfw it's already saturated > and you are asking "when will the AI bubble pop?"

8:02 PM · May 26, 2026 · 10.7K Views

8:05 PM · May 26, 2026 · 1.7K Views

QUOTE POST

#980Lisan al Gaib@SCALING01

New coding benchmark.

GPT-5.5 and GPT-5.4 are ahead of Opus 4.7 💀

Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 423.2K Views

7:25 PM · May 26, 2026 · 18K Views

QUOTE POST

#1153Florian Brand@XEOPHON

counterintuitive result right here, imo

the data analysis is well done, lots of thoughts went behind them. great eval!

Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 423.2K Views

7:59 PM · May 26, 2026 · 3.9K Views

QUOTE POST

#1496Chubby♨️@KIMMONISMUS

It's truly amazing to see how the general sentiment has shifted in favor of Codex.

I'm reading so many posts saying that Codex is really good now with GPT-5.5, and that Claude Code is regularly preferred.

(I've become a huge Codex fan myself).

At the same time, the new DeepSWE benchmark shows that GPT-5.5 is now ranked number one in this measurement as well.

Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 423.2K Views

9:44 PM · May 26, 2026 · 6.7K Views

QUOTE POST

#1711🍓🍓🍓@IRULETHEWORLDMO

finally, a great benchmark. now we just need one thousand more.

seriously though, this is great.

and we do need one thousand more.

Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 423.2K Views

11:07 PM · May 26, 2026 · 314 Views

REPLY

#1711🍓🍓🍓@IRULETHEWORLDMO

@theo it’s funny i often find myself judging benchmarks on how well they match my own taste.

i should either stop paying attention to benchmarks or stop over valuing my taste.

Theo - t3.gg@theo

This is the first code bench that actually aligns with how it feels to use these models coding.

7:13 PM · May 26, 2026 · 136.8K Views

8:38 PM · May 26, 2026 · 635 Views

QUOTE POST

#1829Theo - t3.gg@THEO

This is the first code bench that actually aligns with how it feels to use these models coding.

Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 423.2K Views

7:13 PM · May 26, 2026 · 136.8K Views

REPLY

#1829Theo - t3.gg@THEO

This chart is nuts

Theo - t3.gg@theo

This is the first code bench that actually aligns with how it feels to use these models coding.

7:13 PM · May 26, 2026 · 136.8K Views

7:16 PM · May 26, 2026 · 15.2K Views

REPLY

#1829Theo - t3.gg@THEO

So many gems in this. Gap between "official" harnesses and their simple agent harness says a lot about how the labs are operating. Gemini gap is hilarious

Theo - t3.gg@theo

This chart is nuts

7:16 PM · May 26, 2026 · 15.2K Views

7:48 PM · May 26, 2026 · 12.1K Views

REPLY

#1829Theo - t3.gg@THEO

Gemini 3.5 Flash being MORE EXPENSIVE than GPT-5.5 at HALF the score is also hilarious

Theo - t3.gg@theo

So many gems in this. Gap between "official" harnesses and their simple agent harness says a lot about how the labs are operating. Gemini gap is hilarious

7:48 PM · May 26, 2026 · 12.1K Views

7:49 PM · May 26, 2026 · 24.7K Views

QUOTE POST

#1894Nick Dobos@NICKADOBOS

First correct benchmark I’ve seen in a while

Serena Ge (Datacurve)@serenaa_ge

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.

4:18 PM · May 26, 2026 · 423.2K Views

7:53 PM · May 26, 2026 · 5K Views

Serena Ge releases DeepSWE, a long-horizon benchmark designed to evaluate AI coding agents on complex engineering tasks

Sentiment

Cluster engagement

Digg Depth