Can coding agents stay coherent over a 1 billion token budget?
Can they build Slack from scratch? Rewrite a JAX codebase in PyTorch? Build a C compiler in Rust?
Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.
Tasks include building a Slack clone from scratch.
Can coding agents stay coherent over a 1 billion token budget?
Can they build Slack from scratch? Rewrite a JAX codebase in PyTorch? Build a C compiler in Rust?
Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.
Users congratulated Rishi Desai on the SWE-Marathon benchmark because the work advancing long-horizon AI agent testing was viewed as impressive and innovative.
we are getting spoiled with long-horizon coding benchmarks recently
here's another one: SWE-Marathon
Website: https://www.swe-marathon.org/ Paper: https://www.swe-marathon.org/swe-marathon-paper.pdf
Can coding agents stay coherent over a 1 billion token budget?
Can they build Slack from scratch? Rewrite a JAX codebase in PyTorch? Build a C compiler in Rust?
Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.
Reward Hacking on SWE-Marathon
very interesting to see GPT-5.5 on top
we are getting spoiled with long-horizon coding benchmarks recently
here's another one: SWE-Marathon
Website: https://www.swe-marathon.org/ Paper: https://www.swe-marathon.org/swe-marathon-paper.pdf
Interesting. This validates my feeling that on a completely OOD SWE benchmark all open models flop. DeepSeek is way ahead of the pack and still merely on par with Gemini 3.1 (lol). Well at least it's not a psychotic reward hacker. …We'll have to deal with that part somehow.
we are getting spoiled with long-horizon coding benchmarks recently
here's another one: SWE-Marathon
Website: https://www.swe-marathon.org/ Paper: https://www.swe-marathon.org/swe-marathon-paper.pdf
not surprised to see Gemini near the top
but GPT-5.5 is weird
Reward Hacking on SWE-Marathon
very interesting to see GPT-5.5 on top
@jayelmnop they are aware
hmmm
https://swe-marathon.vercel.app/#trajectory/rust-c-compiler-257

We’ve released the paper, code, and the part of evals that usually stays hidden: the trajectories.
That includes 320 GB of agent trajectories and all 1,300 rollout logs for inspection (!)
Check it out: https://www.swe-marathon.org/

Building clones for CUA is tapping out, but e2e full-stack products are not. Having a rock-solid spec (e.g. " your Excel clone must support 1000 concurrent users") helps prevent gaps. It feels like a game of whac-a-mole.
Well calibrated CUA verifiers a big piece for the UX part.

There are no full-stack tasks in existing long-horizon SWE benchmarks because verification is hard.
SWE-Marathon has 4. In Clone-Slack, the agent builds a Slack-like team chat app, verified by a Computer Use Agent that logs in, creates channels, posts messages, reacts, and checks that the app actually works through the UI.

SWE-Marathon turns real frontier research projects into reproducible evals: Anthropic’s C compiler, OpenAI’s Parameter Golf, Cloudflare’s Next.js rewrite, Cursor’s long-running agent work.
20 tasks across full-stack product clones, library rewrites, ML engineering, and optimization.

Reward hacking is an arms race between coding agents and RL envs.
Across 1,300 rollouts, 14% showed reward-hacking behavior and 10% shipped clear exploit code.
Some tasks took 10+ “hardening” iterations: run agents, inspect traces, identify shortcuts, patch verifier, rerun.

@rishi_desai2 do you ask them to use subagents eg through claude dynamic workflows? guessing this would change performance quite a bit

Coding agents now run for hundreds of millions of tokens on a single task.
In Rewrite-Next.js, the agent must reimplement Cloudflare’s Next.js-on-Vite rewrite in one go.
One Claude Code rollout used 344M tokens over 4.4 hours. The longest reached 877M tokens!

@rishi_desai2 do you plan to release the full tasks on HarborHub so I can try to reproduce 😀 super interested in the implementation of the CUA verifier
“hey claude, can I see your homework?”
hmmm
https://swe-marathon.vercel.app/#trajectory/rust-c-compiler-257

@tyhouch Yes that is the plan!

@rishi_desai2 This agentic verifier is so simple im pissed I didn’t think of it 😃

@rishi_desai2 amazing! i dm'd you for the s3 credentials; would love to analyze the trajectory logs

@rishi_desai2 this is great stuff

@rishi_desai2 Great work! If you could add the reasoning-levels used in your experiments that can help make your graphics more interpretable.

@rishi_desai2 Long-horizon benchmarks are the right pressure test. The hard part isn't writing 1B tokens, it's not forgetting the one weird constraint from hour two.