@theo me too, expected 7x or so
Fascinating bench. Really like the idea of focusing on mergeability.
Confused how Opus 4.8 is 2.5x better than Opus 4.7 though 🙃
Open-source maintainers spent over 40 hours building each task.
@theo me too, expected 7x or so
Fascinating bench. Really like the idea of focusing on mergeability.
Confused how Opus 4.8 is 2.5x better than Opus 4.7 though 🙃
Positive users praise FrontierCode for using real merge-worthy tasks from OSS maintainers instead of test-passing alone, while negative users dismiss it as biased or irrelevant to most software.

FrontierCode has three task sets: Extended (150 tasks), Main (100 tasks) and Diamond (50 tasks). SOTA LLMs have significant room for improvement, with the top model earning a score of just 13.4/100 on our Diamond task set.

You can find full model results and technical implementation details on our blog:
https://cognition.ai/blog/frontier-code

20+ world-class open-source developers built realistic coding tasks on repos they maintain. They define what “mergeable” means in their repo.
What does it take to measure mergeability? We use a mix of unit tests, rubrics and novel verifiers to assess correctness, test quality, scope discipline, style, and adherence to codebase standards.
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.
Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

@cognition That's a massive lead for Opus 4.8, which does not seem to match the vibes on this site.

Rigorous quality control is important, so we built an extensive QC pipeline with adversarial testing, calibration, and multi-stage review. Every task is manually reviewed by a Cognition researcher.
This reduces both the false positive and false negative rates for proposed solutions. As a result, FrontierCode produces 81% fewer misclassification errors than SWE-Bench Pro.

FrontierCode was built in close partnership with the expert maintainers of 36 flagship open-source repositories, like @smilingnosrati, CEO & Tech Lead @CeleryOrg (29k stars), and Martin McKeaveney, CTO of @Budibase (28k stars).
Maintainers invested more than 40 hours per task, undergoing multiple rounds of iteration to ensure that any PR that satisfies these standards would actually be merged.

Tasks in the dataset have a concise problem statement with large solutions that cut across multiple files.
FrontierCode’s task set is more diverse than other software engineering evals, measuring ability across a wide range of languages and problem types.

@cognition I have a few questions after looking deeper at the numbers

@cognition Did i miss something? what harness was used here?

@cognition Glad to get this out!

@cognition whats the used harness? yours?

@cognition Which harness was used to run the evaluation? Devin or native like Claude Code/Codex? Did you see any effects there?

@cognition One step closer to figuring out what "good" code actually is. this is going to be the new benchmark standard

@cognition AI IQ picked this up already btw and now has it as one of its Software Engineering benchmarks: https://www.aiiq.org/charts/frontiercode-diamond-scores/

@cognition @grok how different is this from deepswe from datacurve?

@cognition @alessio_joseph design is A+, such a pleasure to read through this

@cognition Agents write code just keeping the task in mind, I usually end up correcting the code written by agent to ensure that entire system evolves in a manageable way.
This is the way froward,

@cognition Not trusting this benchmark for shit. I have used both 4.8 and 5.5. 5.5 is faster cheaper and far better in every single instance. Deepswe still the best benchmark. anything with a claude model on top is slop.
What everyone who works with these machines daily on jagged edge coding problems already knows.
Recursive self improvement?
Yeah right.
How about we just get to long term memory, continual learning, embedded backwards and forward pass thinking in latent space, and actual long term contextual understanding first, guys?
Maybe then we'll have machines that can actually solve problems by learning instead of memorizing lots of data from training.
After that we can go for some sci-fi BS to scare the normies into regulatory capture, mmm-kaayy?
It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality.
Cog had IOI Gold medalists and top code maintainers Look At The Data — FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks.
FC Diamond is so hard that Opus 4.8 scores 13.8%.
Three eras of AI coding : Three eras of benchmarks
2021 • Autocomplete : HumanEval 2023 • Passing Tests: SWEBench, TerminalBench 2026 • Maintainable Code: FrontierCode
to me the most beautiful chart when I requested a special historical run into all extant old models, the data was finding that the easiest third of FC tasks (in FC Extended) were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months.
This describes the "WTF happened in Dec 2025" vibe shift that a lot of folks from @dhh to @karpathy have called out: it is the difference between getting 95% success in 2 rerolls vs 6, making it finally feasible to go up the next layer of abstraction in agentic coding, eg @GeoffreyHuntley's ralph loops or @bcherny's /goals or @steipete's "loops that prompt your agents" without fearing too much that things go off the rails.
My guess: as AI accelerates from here, each FrontierCode tier will saturate in sequence, hopefully ~annually. I've already asked the team to prepare FrontierCode 2027....
The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith, the next model forest grows. Circle of life.
Open-source maintainers spent over 40 hours building each task.
@theo me too, expected 7x or so
Fascinating bench. Really like the idea of focusing on mergeability.
Confused how Opus 4.8 is 2.5x better than Opus 4.7 though 🙃