/Tech1d ago

Rishi Desai launches SWE-Marathon, a long-horizon coding benchmark where Claude Opus 4.8 leads with a 26% score

Tasks include building a Slack clone from scratch.

9889860264239.4K

#450

Original post

Ethan Caballero#1077

Rishi Desai@rishi_desai2

Can coding agents stay coherent over a 1 billion token budget?

Can they build Slack from scratch? Rewrite a JAX codebase in PyTorch? Build a C compiler in Rust?

Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.

9:13 AM · Jun 5, 2026 · 147.9K Views

Sentiment

Users congratulated Rishi Desai on the SWE-Marathon benchmark because the work advancing long-horizon AI agent testing was viewed as impressive and innovative.

Pos

92.9%

Neg

7.1%

15 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS53.1KBOOKMARKS56LIKES201REPLIES11

Lisan al Gaib@scaling01

we are getting spoiled with long-horizon coding benchmarks recently

here's another one: SWE-Marathon

Website: https://www.swe-marathon.org/ Paper: https://www.swe-marathon.org/swe-marathon-paper.pdf

Rishi Desai@rishi_desai2

Can coding agents stay coherent over a 1 billion token budget?

Can they build Slack from scratch? Rewrite a JAX codebase in PyTorch? Build a C compiler in Rust?

Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.

20h53.1K20156

RETWEETS7

Lisan al Gaib@scaling01

Reward Hacking on SWE-Marathon

very interesting to see GPT-5.5 on top

Lisan al Gaib@scaling01

we are getting spoiled with long-horizon coding benchmarks recently

here's another one: SWE-Marathon

Website: https://www.swe-marathon.org/ Paper: https://www.swe-marathon.org/swe-marathon-paper.pdf

19h19.9K15840

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Interesting. This validates my feeling that on a completely OOD SWE benchmark all open models flop. DeepSeek is way ahead of the pack and still merely on par with Gemini 3.1 (lol). Well at least it's not a psychotic reward hacker. …We'll have to deal with that part somehow.

Lisan al Gaib@scaling01

we are getting spoiled with long-horizon coding benchmarks recently

here's another one: SWE-Marathon

Website: https://www.swe-marathon.org/ Paper: https://www.swe-marathon.org/swe-marathon-paper.pdf

12h6.9K5017

Lisan al Gaib@scaling01

not surprised to see Gemini near the top

but GPT-5.5 is weird

Lisan al Gaib@scaling01

Reward Hacking on SWE-Marathon

very interesting to see GPT-5.5 on top

19h3.1K300

Lisan al Gaib@scaling01

@jayelmnop they are aware

Jesse Mu@jayelmnop

hmmm

https://swe-marathon.vercel.app/#trajectory/rust-c-compiler-257

19h1.6K211

Rishi Desai@rishi_desai2

We’ve released the paper, code, and the part of evals that usually stays hidden: the trajectories.

That includes 320 GB of agent trajectories and all 1,300 rollout logs for inspection (!)

Check it out: https://www.swe-marathon.org/

1d422122

Rishi Desai@rishi_desai2

Building clones for CUA is tapping out, but e2e full-stack products are not. Having a rock-solid spec (e.g. " your Excel clone must support 1000 concurrent users") helps prevent gaps. It feels like a game of whac-a-mole.

Well calibrated CUA verifiers a big piece for the UX part.

19h26231

Rishi Desai@rishi_desai2

There are no full-stack tasks in existing long-horizon SWE benchmarks because verification is hard.

SWE-Marathon has 4. In Clone-Slack, the agent builds a Slack-like team chat app, verified by a Computer Use Agent that logs in, creates channels, posts messages, reacts, and checks that the app actually works through the UI.

1d5039

Rishi Desai@rishi_desai2

SWE-Marathon turns real frontier research projects into reproducible evals: Anthropic’s C compiler, OpenAI’s Parameter Golf, Cloudflare’s Next.js rewrite, Cursor’s long-running agent work.

20 tasks across full-stack product clones, library rewrites, ML engineering, and optimization.

1d5548

Rishi Desai@rishi_desai2

Reward hacking is an arms race between coding agents and RL envs.

Across 1,300 rollouts, 14% showed reward-hacking behavior and 10% shipped clear exploit code.

Some tasks took 10+ “hardening” iterations: run agents, inspect traces, identify shortcuts, patch verifier, rerun.

1d4608

Tim Kostolansky@thkostolansky

@rishi_desai2 do you ask them to use subagents eg through claude dynamic workflows? guessing this would change performance quite a bit

19h2691

Rishi Desai@rishi_desai2

Coding agents now run for hundreds of millions of tokens on a single task.

In Rewrite-Next.js, the agent must reimplement Cloudflare’s Next.js-on-Vite rewrite in one go.

One Claude Code rollout used 344M tokens over 4.4 hours. The longest reached 877M tokens!

1d3947

Tyler@tyhouch

@rishi_desai2 do you plan to release the full tasks on HarborHub so I can try to reproduce 😀 super interested in the implementation of the CUA verifier

19h2812

Nat McAleese@__nmca__

“hey claude, can I see your homework?”

Jesse Mu@jayelmnop

hmmm

https://swe-marathon.vercel.app/#trajectory/rust-c-compiler-257

19h1.2K40

Rishi Desai@rishi_desai2

@tyhouch Yes that is the plan!

19h2353

Tyler@tyhouch

@rishi_desai2 This agentic verifier is so simple im pissed I didn’t think of it 😃

19h221

danialhasan@dhasandev

@rishi_desai2 amazing! i dm'd you for the s3 credentials; would love to analyze the trajectory logs

17h1213

shriya@shriyalola

@rishi_desai2 this is great stuff

18h1183

Vansh Singh@vanshcsingh

@rishi_desai2 Great work! If you could add the reasoning-levels used in your experiments that can help make your graphics more interpretable.

21h32

The Crypto Wiz@TheKryptoWiz

@rishi_desai2 Long-horizon benchmarks are the right pressure test. The hard part isn't writing 1B tokens, it's not forgetting the one weird constraint from hour two.

19h1392