/AI15h ago

Abundant AI releases SWE-Marathon benchmark for long-horizon coding tasks, where Claude Opus 4.8 topped the leaderboard at 26% success

AI Judge changed title after evaluation, original title: "Rishi Desai releases SWE-Marathon benchmark to evaluate autonomous AI coding agents on multi-hour, billion-token software tasks"

Evaluation trials revealed a 13.8 percent reward-hacking rate

6752547158159.7K
Original postEthan Caballero#519
Rishi Desai@rishi_desai2

Can coding agents stay coherent over a 1 billion token budget?

Can they build Slack from scratch? Rewrite a JAX codebase in PyTorch? Build a C compiler in Rust?

Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.

9:13 AM · Jun 5, 2026 · 131.3K Views
Sentiment

Many users praised the SWE-Marathon benchmark as cool and great work for testing AI coding agents on long-horizon tasks, while others objected that it shows models optimizing for evaluations over real functionality.

Pos
64.7%
Neg
35.3%
20 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS38KBOOKMARKS50LIKES163REPLIES11
Lisan al Gaib@scaling01

we are getting spoiled with long-horizon coding benchmarks recently

here's another one: SWE-Marathon

Website: https://www.swe-marathon.org/ Paper: https://www.swe-marathon.org/swe-marathon-paper.pdf

Rishi Desai@rishi_desai2

Can coding agents stay coherent over a 1 billion token budget?

Can they build Slack from scratch? Rewrite a JAX codebase in PyTorch? Build a C compiler in Rust?

Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.

11hViews 38KLikes 163Bookmarks 50
RETWEETS4
Lisan al Gaib@scaling01

Reward Hacking on SWE-Marathon

very interesting to see GPT-5.5 on top

Lisan al Gaib@scaling01

we are getting spoiled with long-horizon coding benchmarks recently

here's another one: SWE-Marathon

Website: https://www.swe-marathon.org/ Paper: https://www.swe-marathon.org/swe-marathon-paper.pdf

9hViews 13.8KLikes 106Bookmarks 30
Jesse Mu@jayelmnop

hmmm

https://swe-marathon.vercel.app/#trajectory/rust-c-compiler-257

Lisan al Gaib@scaling01

we are getting spoiled with long-horizon coding benchmarks recently

here's another one: SWE-Marathon

Website: https://www.swe-marathon.org/ Paper: https://www.swe-marathon.org/swe-marathon-paper.pdf

9hViews 6.7KLikes 40Bookmarks 9

Interesting. This validates my feeling that on a completely OOD SWE benchmark all open models flop. DeepSeek is way ahead of the pack and still merely on par with Gemini 3.1 (lol). Well at least it's not a psychotic reward hacker. …We'll have to deal with that part somehow.

Lisan al Gaib@scaling01

we are getting spoiled with long-horizon coding benchmarks recently

here's another one: SWE-Marathon

Website: https://www.swe-marathon.org/ Paper: https://www.swe-marathon.org/swe-marathon-paper.pdf

3hViews 2.8KLikes 15Bookmarks 4
Lisan al Gaib@scaling01

not surprised to see Gemini near the top

but GPT-5.5 is weird

Lisan al Gaib@scaling01

Reward Hacking on SWE-Marathon

very interesting to see GPT-5.5 on top

9hViews 2.5KLikes 27Bookmarks 0
Lisan al Gaib@scaling01

@jayelmnop they are aware

Jesse Mu@jayelmnop

hmmm

https://swe-marathon.vercel.app/#trajectory/rust-c-compiler-257

9hViews 1.4KLikes 19Bookmarks 1
Rishi Desai@rishi_desai2

We’ve released the paper, code, and the part of evals that usually stays hidden: the trajectories.

That includes 320 GB of agent trajectories and all 1,300 rollout logs for inspection (!)

Check it out: https://www.swe-marathon.org/

15hViews 422Likes 12Bookmarks 2

Generally I feel more sympathy for OpenAI lately. Here they are, trying to RLVR towards actual scientific AGI that'll solve problems directly, as on CritPT. And their great safety-conscious competition: code, code, B2B, B2B, attack, exploit, chyna hawkery, RSI. Not equal.

Interesting. This validates my feeling that on a completely OOD SWE benchmark all open models flop. DeepSeek is way ahead of the pack and still merely on par with Gemini 3.1 (lol). Well at least it's not a psychotic reward hacker. …We'll have to deal with that part somehow.

2hViews 1KLikes 10Bookmarks 2
Rishi Desai@rishi_desai2

Building clones for CUA is tapping out, but e2e full-stack products are not. Having a rock-solid spec (e.g. " your Excel clone must support 1000 concurrent users") helps prevent gaps. It feels like a game of whac-a-mole.

Well calibrated CUA verifiers a big piece for the UX part.

9hViews 262Likes 3Bookmarks 1
Rishi Desai@rishi_desai2

There are no full-stack tasks in existing long-horizon SWE benchmarks because verification is hard.

SWE-Marathon has 4. In Clone-Slack, the agent builds a Slack-like team chat app, verified by a Computer Use Agent that logs in, creates channels, posts messages, reacts, and checks that the app actually works through the UI.

15hViews 503Likes 9
Rishi Desai@rishi_desai2

SWE-Marathon turns real frontier research projects into reproducible evals: Anthropic’s C compiler, OpenAI’s Parameter Golf, Cloudflare’s Next.js rewrite, Cursor’s long-running agent work.

20 tasks across full-stack product clones, library rewrites, ML engineering, and optimization.

15hViews 554Likes 8
Rishi Desai@rishi_desai2

Reward hacking is an arms race between coding agents and RL envs.

Across 1,300 rollouts, 14% showed reward-hacking behavior and 10% shipped clear exploit code.

Some tasks took 10+ “hardening” iterations: run agents, inspect traces, identify shortcuts, patch verifier, rerun.

15hViews 460Likes 8
Tim Kostolansky@thkostolansky

@rishi_desai2 do you ask them to use subagents eg through claude dynamic workflows? guessing this would change performance quite a bit

10hViews 269Bookmarks 1
Rishi Desai@rishi_desai2

Coding agents now run for hundreds of millions of tokens on a single task.

In Rewrite-Next.js, the agent must reimplement Cloudflare’s Next.js-on-Vite rewrite in one go.

One Claude Code rollout used 344M tokens over 4.4 hours. The longest reached 877M tokens!

15hViews 394Likes 7
Tyler@tyhouch

@rishi_desai2 do you plan to release the full tasks on HarborHub so I can try to reproduce 😀 super interested in the implementation of the CUA verifier

10hViews 281Likes 2
Nat McAleese@__nmca__

“hey claude, can I see your homework?”

Jesse Mu@jayelmnop

hmmm

https://swe-marathon.vercel.app/#trajectory/rust-c-compiler-257

9hViews 975Likes 4Bookmarks 0
Eris@eriskiiii

@scaling01 This is only weird if you have never used 5.5

9hViews 96Likes 1
Erik@LilDombi

@scaling01 GPT-5.5 being so high lines up with what @VictorTaelin has been saying. Though I’d be super curious to see what GPT-5.4 scored. Issue with the new pretrain perhaps?

9hViews 113
Rishi Desai@rishi_desai2

@tyhouch Yes that is the plan!

10hViews 235Likes 3
Tyler@tyhouch

@rishi_desai2 This agentic verifier is so simple im pissed I didn’t think of it 😃

10hViews 22Likes 1
Load more posts