Serena Ge releases DeepSWE, a long-horizon benchmark designed to evaluate AI coding agents on complex engineering tasks
Its prompts are half the length of SWE-bench Pro.
This is the new standard for engineering evals

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.
wow evals caught up to vibes
Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.
I think what you'll also find is the tasks are like a third the price.
wow evals caught up to vibes
> they built a "NEW" coding benchmark > GPT-5.5 scores 70% > Mythos probably ~90% > mfw it's already saturated > and you are asking "when will the AI bubble pop?"
Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.
you literally need to build benchmarks where frontier models score below 10% or the benchmark is cooked within 6 months
> they built a "NEW" coding benchmark > GPT-5.5 scores 70% > Mythos probably ~90% > mfw it's already saturated > and you are asking "when will the AI bubble pop?"
New coding benchmark.
GPT-5.5 and GPT-5.4 are ahead of Opus 4.7 💀
Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.
counterintuitive result right here, imo
the data analysis is well done, lots of thoughts went behind them. great eval!

Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.
@theo it’s funny i often find myself judging benchmarks on how well they match my own taste.
i should either stop paying attention to benchmarks or stop over valuing my taste.
This is the first code bench that actually aligns with how it feels to use these models coding.
This is the first code bench that actually aligns with how it feels to use these models coding.
Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.
This chart is nuts

This is the first code bench that actually aligns with how it feels to use these models coding.
So many gems in this. Gap between "official" harnesses and their simple agent harness says a lot about how the labs are operating. Gemini gap is hilarious

This chart is nuts
Gemini 3.5 Flash being MORE EXPENSIVE than GPT-5.5 at HALF the score is also hilarious

So many gems in this. Gap between "official" harnesses and their simple agent harness says a lot about how the labs are operating. Gemini gap is hilarious
First correct benchmark I’ve seen in a while
Today we’re releasing DeepSWE, a new standard for agentic coding benchmarks. On public leaderboards, top models often look relatively close in capability. DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work.