Cognition launches FrontierCode, a benchmark that evaluates AI coding agents on real-world maintainability and code quality

VIEWS566.1KBOOKMARKS871LIKES2.4KRETWEETS204REPLIES144

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

10h566.1K2.4K871

swyx@swyx

It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality.

Cog had IOI Gold medalists and top code maintainers Look At The Data — FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks.

FC Diamond is so hard that Opus 4.8 scores 13.8%.

Three eras of AI coding : Three eras of benchmarks

2021 • Autocomplete : HumanEval 2023 • Passing Tests: SWEBench, TerminalBench 2026 • Maintainable Code: FrontierCode

to me the most beautiful chart when I requested a special historical run into all extant old models, the data was finding that the easiest third of FC tasks (in FC Extended) were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months.

This describes the "WTF happened in Dec 2025" vibe shift that a lot of folks from @dhh to @karpathy have called out: it is the difference between getting 95% success in 2 rerolls vs 6, making it finally feasible to go up the next layer of abstraction in agentic coding, eg @GeoffreyHuntley's ralph loops or @bcherny's /goals or @steipete's "loops that prompt your agents" without fearing too much that things go off the rails.

My guess: as AI accelerates from here, each FrontierCode tier will saturate in sequence, hopefully ~annually. I've already asked the team to prepare FrontierCode 2027....

The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith, the next model forest grows. Circle of life.

Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

9h96.5K565268

Lisan al Gaib@scaling01

Opus 4.8 is the best coding model out there

FrontierCode by Cognition is probably the highest quality coding benchmark we have seen so far

it moves beyond just using unit-testing for scoring, it also tests for regression safety, mechanical cleanliness, test correctness, scope and code quality

20+ open-source developers handcrafted 150 tasks, each of which took over 40 hours to construct

it also tests a more diverse set of programming languages

Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

9h48.1K536122

Theo - t3.gg@theo

Fascinating bench. Really like the idea of focusing on mergeability.

Confused how Opus 4.8 is 2.5x better than Opus 4.7 though 🙃

Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

7h79.7K59790

Scott Wu@ScottWu46

SWE-Bench style grading has been the standard for years now - you ask the agent to solve an issue and then run its code on a pre-constructed unit test.

The problem is that passing a unit test is only one part of writing production-ready code. You also want to evaluate agents on a number of other axes, including scope, coding style, and unintended side effects.

The result is our new benchmark FrontierCode - which has ~80% fewer false positives and for which the best model (Opus 4.8) only scores 13%!

"Where others grade like a CI, FrontierCode grades like a tech lead."

Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

9h57.5K44696

Florian Brand@xeophon

congrats to Kimi on benchmaxxing this future benchmark to be on Sonnet 4.6 level (released 2 months prior to K2.6)

9h46.4K40258

Taelin@VictorTaelin

This benchmark addresses my problem with 5.5: it passes the tests but writes shitty code. We don't need a model's output to work today, we need it not to break tomorrow...

Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

6h22.8K28235

elie@eliebakouch

new frontier eval from the cognition team. interesting that simple test time scaling is pretty noisy here instead of a clean line

lots of care in crafting a good scoring process

https://cognition.ai/blog/frontier-code

Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

8h12K11943

vik@vikhyatk

BREAKING: Cognition is forcing IOI gold medalists and top code maintainers to label code data

Seems concerning

swyx@swyx

It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality.

Cog had IOI Gold medalists and top code maintainers Look At The Data — FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks.

FC Diamond is so hard that Opus 4.8 scores 13.8%.

Three eras of AI coding : Three eras of benchmarks

2021 • Autocomplete : HumanEval 2023 • Passing Tests: SWEBench, TerminalBench 2026 • Maintainable Code: FrontierCode

to me the most beautiful chart when I requested a special historical run into all extant old models, the data was finding that the easiest third of FC tasks (in FC Extended) were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months.

This describes the "WTF happened in Dec 2025" vibe shift that a lot of folks from @dhh to @karpathy have called out: it is the difference between getting 95% success in 2 rerolls vs 6, making it finally feasible to go up the next layer of abstraction in agentic coding, eg @GeoffreyHuntley's ralph loops or @bcherny's /goals or @steipete's "loops that prompt your agents" without fearing too much that things go off the rails.

My guess: as AI accelerates from here, each FrontierCode tier will saturate in sequence, hopefully ~annually. I've already asked the team to prepare FrontierCode 2027....

The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith, the next model forest grows. Circle of life.

8h27.4K20232

Karina@karinanguyen

A lot of people are already deploying AI into production codebases, but until now we didn’t really have a good eval for whether it writes code that is actually high-quality and maintainable. Pretty cool grading here:

Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

8h11.9K7638

Cognition@cognition

You can find full model results and technical implementation details on our blog:

https://cognition.ai/blog/frontier-code

10h10.4K9029

Brendan (can/do)@BrendanFoody

Great work to @cognition and @silasalberti on this new benchmark.

Sloppy code is one of the largest issues with coding models today.

Training purely on unit tests and programmatic verifiers doesn't work.

Thank you to all of the @mercor_ai experts who helped build this benchmark as well!

Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

8h14K10427

Cognition@cognition

FrontierCode has three task sets: Extended (150 tasks), Main (100 tasks) and Diamond (50 tasks). SOTA LLMs have significant room for improvement, with the top model earning a score of just 13.4/100 on our Diamond task set.

10h49.7K13721

Cognition@cognition

20+ world-class open-source developers built realistic coding tasks on repos they maintain. They define what “mergeable” means in their repo.

What does it take to measure mergeability? We use a mix of unit tests, rubrics and novel verifiers to assess correctness, test quality, scope discipline, style, and adherence to codebase standards.

10h17.7K16317

Gary Marcus@GaryMarcus

Oh my God! @METR_Evals’s coding benchmarks are saturated! 🤯 Mythos broke the METR graph 🤯

4 weeks later, out comes a new coding task, this time from @cognition: “FrontierCode Diamond remains unsaturated: the best performing model, Claude Opus 4.8, achieves a score of only 13.4%.

There is still a lots of headroom.

*Note that METR itself never panicked. It’s the Twitterverse that has egg on its face.

Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

6h11.4K7219

Theo - t3.gg@theo

This chart scares me a bit. My guess is that reproducibility is low. Would love for them to share more data on their runs.

Opus low scored higher than medium Opus xhigh scored higher than max gpt-5.5-medium was the highest scoring OpenAI model

Theo - t3.gg@theo

Fascinating bench. Really like the idea of focusing on mergeability.

Confused how Opus 4.8 is 2.5x better than Opus 4.7 though 🙃

7h14.1K14015

Walden@walden_yan

Progress in coding agents has largely been driven by progress in evals. I still remember when Devin was the first to reach 13% on SWE-Bench in 2024, and with just two short years of RL, SWE-Bench scores are 75%+.

Its uncanny that 13% is also what the best model gets in our newest benchmark. Why do models do so poorly on this benchmark? Because it measures actual merge-ability of code, not just whether it passes unit tests. This was a collaborative effort between our own research team and expert open-source maintainers to curate evals that take over 40 hours of human work per task. The rubrics for these tasks were fine-tuned over multiple stages of QA and review. Extremely proud of the team and excited for the coming agents that will saturate even this benchmark.

Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

9h5.5K7615

Jeff Wang@jeffwsurf

A massive achievement, introducing FrontierCode, finally an eval that can determine how good models can write code that will lead to actually being merged in the real world. This has 81% less false positives as SWE-Bench Pro, and is a peek at how we are determining whether models are good or bad at certain types of tasks

We will also have more evals coming that will help us improve Devin rapidly and be better at using models optimized for different types of work

Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

9h8.6K9314

Charles 🎉 Frye@charles_irl

look at the data

swyx@swyx

It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality.

Cog had IOI Gold medalists and top code maintainers Look At The Data — FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks.

FC Diamond is so hard that Opus 4.8 scores 13.8%.

Three eras of AI coding : Three eras of benchmarks

2021 • Autocomplete : HumanEval 2023 • Passing Tests: SWEBench, TerminalBench 2026 • Maintainable Code: FrontierCode

to me the most beautiful chart when I requested a special historical run into all extant old models, the data was finding that the easiest third of FC tasks (in FC Extended) were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months.

This describes the "WTF happened in Dec 2025" vibe shift that a lot of folks from @dhh to @karpathy have called out: it is the difference between getting 95% success in 2 rerolls vs 6, making it finally feasible to go up the next layer of abstraction in agentic coding, eg @GeoffreyHuntley's ralph loops or @bcherny's /goals or @steipete's "loops that prompt your agents" without fearing too much that things go off the rails.

My guess: as AI accelerates from here, each FrontierCode tier will saturate in sequence, hopefully ~annually. I've already asked the team to prepare FrontierCode 2027....

The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith, the next model forest grows. Circle of life.

9h6.8K3720

Cognition@cognition

Rigorous quality control is important, so we built an extensive QC pipeline with adversarial testing, calibration, and multi-stage review. Every task is manually reviewed by a Cognition researcher.

This reduces both the false positive and false negative rates for proposed solutions. As a result, FrontierCode produces 81% fewer misclassification errors than SWE-Bench Pro.

10h10.4K11110