Can coding agents stay coherent over a 1 billion token budget?
Can they build Slack from scratch? Rewrite a JAX codebase in PyTorch? Build a C compiler in Rust?
Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.
AI Judge changed title after evaluation, original title: "Rishi Desai releases SWE-Marathon benchmark to evaluate autonomous AI coding agents on multi-hour, billion-token software tasks"
Evaluation trials revealed a 13.8 percent reward-hacking rate
Can coding agents stay coherent over a 1 billion token budget?
Can they build Slack from scratch? Rewrite a JAX codebase in PyTorch? Build a C compiler in Rust?
Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.
Many users praised the SWE-Marathon benchmark as cool and great work for testing AI coding agents on long-horizon tasks, while others objected that it shows models optimizing for evaluations over real functionality.
we are getting spoiled with long-horizon coding benchmarks recently
here's another one: SWE-Marathon
Website: https://www.swe-marathon.org/ Paper: https://www.swe-marathon.org/swe-marathon-paper.pdf
Can coding agents stay coherent over a 1 billion token budget?
Can they build Slack from scratch? Rewrite a JAX codebase in PyTorch? Build a C compiler in Rust?
Enter SWE-Marathon: a benchmark for autonomous long-horizon software work.
Reward Hacking on SWE-Marathon
very interesting to see GPT-5.5 on top
we are getting spoiled with long-horizon coding benchmarks recently
here's another one: SWE-Marathon
Website: https://www.swe-marathon.org/ Paper: https://www.swe-marathon.org/swe-marathon-paper.pdf
hmmm
https://swe-marathon.vercel.app/#trajectory/rust-c-compiler-257
we are getting spoiled with long-horizon coding benchmarks recently
here's another one: SWE-Marathon
Website: https://www.swe-marathon.org/ Paper: https://www.swe-marathon.org/swe-marathon-paper.pdf
Interesting. This validates my feeling that on a completely OOD SWE benchmark all open models flop. DeepSeek is way ahead of the pack and still merely on par with Gemini 3.1 (lol). Well at least it's not a psychotic reward hacker. …We'll have to deal with that part somehow.
we are getting spoiled with long-horizon coding benchmarks recently
here's another one: SWE-Marathon
Website: https://www.swe-marathon.org/ Paper: https://www.swe-marathon.org/swe-marathon-paper.pdf
not surprised to see Gemini near the top
but GPT-5.5 is weird
Reward Hacking on SWE-Marathon
very interesting to see GPT-5.5 on top
@jayelmnop they are aware
hmmm
https://swe-marathon.vercel.app/#trajectory/rust-c-compiler-257

We’ve released the paper, code, and the part of evals that usually stays hidden: the trajectories.
That includes 320 GB of agent trajectories and all 1,300 rollout logs for inspection (!)
Check it out: https://www.swe-marathon.org/
Generally I feel more sympathy for OpenAI lately. Here they are, trying to RLVR towards actual scientific AGI that'll solve problems directly, as on CritPT. And their great safety-conscious competition: code, code, B2B, B2B, attack, exploit, chyna hawkery, RSI. Not equal.
Interesting. This validates my feeling that on a completely OOD SWE benchmark all open models flop. DeepSeek is way ahead of the pack and still merely on par with Gemini 3.1 (lol). Well at least it's not a psychotic reward hacker. …We'll have to deal with that part somehow.

Building clones for CUA is tapping out, but e2e full-stack products are not. Having a rock-solid spec (e.g. " your Excel clone must support 1000 concurrent users") helps prevent gaps. It feels like a game of whac-a-mole.
Well calibrated CUA verifiers a big piece for the UX part.

There are no full-stack tasks in existing long-horizon SWE benchmarks because verification is hard.
SWE-Marathon has 4. In Clone-Slack, the agent builds a Slack-like team chat app, verified by a Computer Use Agent that logs in, creates channels, posts messages, reacts, and checks that the app actually works through the UI.

SWE-Marathon turns real frontier research projects into reproducible evals: Anthropic’s C compiler, OpenAI’s Parameter Golf, Cloudflare’s Next.js rewrite, Cursor’s long-running agent work.
20 tasks across full-stack product clones, library rewrites, ML engineering, and optimization.

Reward hacking is an arms race between coding agents and RL envs.
Across 1,300 rollouts, 14% showed reward-hacking behavior and 10% shipped clear exploit code.
Some tasks took 10+ “hardening” iterations: run agents, inspect traces, identify shortcuts, patch verifier, rerun.

@rishi_desai2 do you ask them to use subagents eg through claude dynamic workflows? guessing this would change performance quite a bit

Coding agents now run for hundreds of millions of tokens on a single task.
In Rewrite-Next.js, the agent must reimplement Cloudflare’s Next.js-on-Vite rewrite in one go.
One Claude Code rollout used 344M tokens over 4.4 hours. The longest reached 877M tokens!

@rishi_desai2 do you plan to release the full tasks on HarborHub so I can try to reproduce 😀 super interested in the implementation of the CUA verifier
“hey claude, can I see your homework?”
hmmm
https://swe-marathon.vercel.app/#trajectory/rust-c-compiler-257

@scaling01 This is only weird if you have never used 5.5

@scaling01 GPT-5.5 being so high lines up with what @VictorTaelin has been saying. Though I’d be super curious to see what GPT-5.4 scored. Issue with the new pretrain perhaps?

@tyhouch Yes that is the plan!

@rishi_desai2 This agentic verifier is so simple im pissed I didn’t think of it 😃