Maintainers spent over 40 hours developing each evaluation task
Positive users welcomed Cognition's FrontierCode benchmark for measuring maintainable AI-generated code that real maintainers would accept, while negative users flagged LLM judges as a red flag or dismissed the project as misguided.
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality.
Cog had IOI Gold medalists and top code maintainers Look At The Data — FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks.
FC Diamond is so hard that Opus 4.8 scores 13.8%.
Three eras of AI coding : Three eras of benchmarks
2021 • Autocomplete : HumanEval 2023 • Passing Tests: SWEBench, TerminalBench 2026 • Maintainable Code: FrontierCode
to me the most beautiful chart when I requested a special historical run into all extant old models, the data was finding that the easiest third of FC tasks (in FC Extended) were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months.
This describes the "WTF happened in Dec 2025" vibe shift that a lot of folks from @dhh to @karpathy have called out: it is the difference between getting 95% success in 2 rerolls vs 6, making it finally feasible to go up the next layer of abstraction in agentic coding, eg @GeoffreyHuntley's ralph loops or @bcherny's /goals or @steipete's "loops that prompt your agents" without fearing too much that things go off the rails.
My guess: as AI accelerates from here, each FrontierCode tier will saturate in sequence, hopefully ~annually. I've already asked the team to prepare FrontierCode 2027....
The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith, the next model forest grows. Circle of life.
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
Opus 4.8 is the best coding model out there
FrontierCode by Cognition is probably the highest quality coding benchmark we have seen so far
it moves beyond just using unit-testing for scoring, it also tests for regression safety, mechanical cleanliness, test correctness, scope and code quality
20+ open-source developers handcrafted 150 tasks, each of which took over 40 hours to construct
it also tests a more diverse set of programming languages
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
Fascinating bench. Really like the idea of focusing on mergeability.
Confused how Opus 4.8 is 2.5x better than Opus 4.7 though 🙃
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
SWE-Bench style grading has been the standard for years now - you ask the agent to solve an issue and then run its code on a pre-constructed unit test.
The problem is that passing a unit test is only one part of writing production-ready code. You also want to evaluate agents on a number of other axes, including scope, coding style, and unintended side effects.
The result is our new benchmark FrontierCode - which has ~80% fewer false positives and for which the best model (Opus 4.8) only scores 13%!
"Where others grade like a CI, FrontierCode grades like a tech lead."
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
congrats to Kimi on benchmaxxing this future benchmark to be on Sonnet 4.6 level (released 2 months prior to K2.6)
This benchmark addresses my problem with 5.5: it passes the tests but writes shitty code. We don't need a model's output to work today, we need it not to break tomorrow...
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
new frontier eval from the cognition team. interesting that simple test time scaling is pretty noisy here instead of a clean line
lots of care in crafting a good scoring process
https://cognition.ai/blog/frontier-code
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
BREAKING: Cognition is forcing IOI gold medalists and top code maintainers to label code data
Seems concerning
It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality.
Cog had IOI Gold medalists and top code maintainers Look At The Data — FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks.
FC Diamond is so hard that Opus 4.8 scores 13.8%.
Three eras of AI coding : Three eras of benchmarks
2021 • Autocomplete : HumanEval 2023 • Passing Tests: SWEBench, TerminalBench 2026 • Maintainable Code: FrontierCode
to me the most beautiful chart when I requested a special historical run into all extant old models, the data was finding that the easiest third of FC tasks (in FC Extended) were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months.
This describes the "WTF happened in Dec 2025" vibe shift that a lot of folks from @dhh to @karpathy have called out: it is the difference between getting 95% success in 2 rerolls vs 6, making it finally feasible to go up the next layer of abstraction in agentic coding, eg @GeoffreyHuntley's ralph loops or @bcherny's /goals or @steipete's "loops that prompt your agents" without fearing too much that things go off the rails.
My guess: as AI accelerates from here, each FrontierCode tier will saturate in sequence, hopefully ~annually. I've already asked the team to prepare FrontierCode 2027....
The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith, the next model forest grows. Circle of life.
A lot of people are already deploying AI into production codebases, but until now we didn’t really have a good eval for whether it writes code that is actually high-quality and maintainable. Pretty cool grading here:
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

You can find full model results and technical implementation details on our blog:
https://cognition.ai/blog/frontier-code
Great work to @cognition and @silasalberti on this new benchmark.
Sloppy code is one of the largest issues with coding models today.
Training purely on unit tests and programmatic verifiers doesn't work.
Thank you to all of the @mercor_ai experts who helped build this benchmark as well!
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

FrontierCode has three task sets: Extended (150 tasks), Main (100 tasks) and Diamond (50 tasks). SOTA LLMs have significant room for improvement, with the top model earning a score of just 13.4/100 on our Diamond task set.

20+ world-class open-source developers built realistic coding tasks on repos they maintain. They define what “mergeable” means in their repo.
What does it take to measure mergeability? We use a mix of unit tests, rubrics and novel verifiers to assess correctness, test quality, scope discipline, style, and adherence to codebase standards.
Oh my God! @METR_Evals’s coding benchmarks are saturated! 🤯 Mythos broke the METR graph 🤯
4 weeks later, out comes a new coding task, this time from @cognition: “FrontierCode Diamond remains unsaturated: the best performing model, Claude Opus 4.8, achieves a score of only 13.4%.
There is still a lots of headroom.
*Note that METR itself never panicked. It’s the Twitterverse that has egg on its face.
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
This chart scares me a bit. My guess is that reproducibility is low. Would love for them to share more data on their runs.
Opus low scored higher than medium Opus xhigh scored higher than max gpt-5.5-medium was the highest scoring OpenAI model
Fascinating bench. Really like the idea of focusing on mergeability.
Confused how Opus 4.8 is 2.5x better than Opus 4.7 though 🙃
Progress in coding agents has largely been driven by progress in evals. I still remember when Devin was the first to reach 13% on SWE-Bench in 2024, and with just two short years of RL, SWE-Bench scores are 75%+.
Its uncanny that 13% is also what the best model gets in our newest benchmark. Why do models do so poorly on this benchmark? Because it measures actual merge-ability of code, not just whether it passes unit tests. This was a collaborative effort between our own research team and expert open-source maintainers to curate evals that take over 40 hours of human work per task. The rubrics for these tasks were fine-tuned over multiple stages of QA and review. Extremely proud of the team and excited for the coming agents that will saturate even this benchmark.
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
A massive achievement, introducing FrontierCode, finally an eval that can determine how good models can write code that will lead to actually being merged in the real world. This has 81% less false positives as SWE-Bench Pro, and is a peek at how we are determining whether models are good or bad at certain types of tasks
We will also have more evals coming that will help us improve Devin rapidly and be better at using models optimized for different types of work
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
look at the data
It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality.
Cog had IOI Gold medalists and top code maintainers Look At The Data — FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks.
FC Diamond is so hard that Opus 4.8 scores 13.8%.
Three eras of AI coding : Three eras of benchmarks
2021 • Autocomplete : HumanEval 2023 • Passing Tests: SWEBench, TerminalBench 2026 • Maintainable Code: FrontierCode
to me the most beautiful chart when I requested a special historical run into all extant old models, the data was finding that the easiest third of FC tasks (in FC Extended) were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months.
This describes the "WTF happened in Dec 2025" vibe shift that a lot of folks from @dhh to @karpathy have called out: it is the difference between getting 95% success in 2 rerolls vs 6, making it finally feasible to go up the next layer of abstraction in agentic coding, eg @GeoffreyHuntley's ralph loops or @bcherny's /goals or @steipete's "loops that prompt your agents" without fearing too much that things go off the rails.
My guess: as AI accelerates from here, each FrontierCode tier will saturate in sequence, hopefully ~annually. I've already asked the team to prepare FrontierCode 2027....
The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith, the next model forest grows. Circle of life.

Rigorous quality control is important, so we built an extensive QC pipeline with adversarial testing, calibration, and multi-stage review. Every task is manually reviewed by a Cognition researcher.
This reduces both the false positive and false negative rates for proposed solutions. As a result, FrontierCode produces 81% fewer misclassification errors than SWE-Bench Pro.
Maintainers spent over 40 hours developing each evaluation task