/Tech2d ago

Cognition launches FrontierCode to evaluate AI code mergeability, finding Claude Opus 4.8 leads with a peak 13.5% score

AI Judge changed title after evaluation, original title: "FrontierCode benchmark launches to test code mergeability, finding over half of SWE-bench outputs are unmergeable"

Story Overview

Cognition's FrontierCode benchmark shifts focus from whether AI code merely runs to whether human maintainers would actually accept it into production repositories, exposing that over half the outputs passing earlier SWE-Bench tests fall short on style, scope, and regression safety.

6917.2K4052.2K2.9M
Original postBen Golub#1490
Jeff Wang@jeffwsurf

A massive achievement, introducing FrontierCode, finally an eval that can determine how good models can write code that will lead to actually being merged in the real world. This has 81% less false positives as SWE-Bench Pro, and is a peek at how we are determining whether models are good or bad at certain types of tasks

We will also have more evals coming that will help us improve Devin rapidly and be better at using models optimized for different types of work

Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

12:57 PM · Jun 8, 2026 · 11.2K Views
Developer Impact

Real merge decisions expose hidden gaps in prior evals

Tasks built by veteran open-source maintainers each spending over 40 hours reveal that automated test passes often mask unmaintainable changes, with models still scoring below 15 percent on the hardest subset.

Model Watch

Early leader shows token-heavy approach still dominates

Anthropic's Opus 4.8 tops the charts at 13.4 percent on Diamond tasks while lighter models trade efficiency for lower scores, though the full picture on open-source entries and future scaling remains incomplete.

Sentiment

Positive users praise Cognition's FrontierCode benchmark for evaluating real-world mergeability of AI code beyond test passage, while negative users call it biased or flawed.

Pos
67.7%
Neg
32.3%
134 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS139.8KBOOKMARKS136LIKES984REPLIES63

Fascinating bench. Really like the idea of focusing on mergeability.

Confused how Opus 4.8 is 2.5x better than Opus 4.7 though 🙃

Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

1dViews 139.8KLikes 984Bookmarks 136
RETWEETS270
Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

2dViews 2.5MLikes 4.3KBookmarks 1.7K
Taelin@VictorTaelin

This benchmark addresses my problem with 5.5: it passes the tests but writes shitty code. We don't need a model's output to work today, we need it not to break tomorrow...

Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

1dViews 70.9KLikes 711Bookmarks 92
Rohan Paul@rohanpaul_ai

Incredible! This is just the benchmark we needed.

Claude Opus 4.8, achieves a score of only 13.4%. Other models score even lower: GPT-5.5 receives 6.3%, Gemini 3.1 Pro 4.7%, and others even less.

Cognition is introducing FrontierCode, a coding benchmark built to test whether AI code is good enough for a real maintainer to merge, not just whether it passes tests.

FrontierCode asks a harder question: did the model produce a clean, limited, well-tested, readable patch that fits the project’s existing style and would survive serious code review?

They bring 3 nested subsets of FrontierCode at increasing difficulty: The benchmark contains 150 tasks, with Main as the hardest 100 and Diamond as the hardest 50.

More than 20 open-source maintainers helped design the tasks, and each task took over 40 hours to build, review, attack, and calibrate.

The biggest finding is that top models still struggle badly when the target is mergeable code instead of merely working code.

On Diamond, the best model, Claude Opus 4.8, scores only 13.4%, while GPT-5.5 scores 6.3%, Gemini 3.1 Pro scores 4.7%, and the best open-source model listed, Kimi K2.6, scores 3.8%.

Shows that today’s strongest coding agents can often patch behavior, but they still fail many human-review standards around design, restraint, test quality, and project conventions.

The mechanism is a grading system built around blockers and non-blockers.

A blocker is something that would stop a maintainer from merging the PR, such as broken behavior, missing required behavior, unsafe scope changes, bad performance, or tests that do not prove the fix.

A solution that fails any blocker gets 0, even if parts of the code look good.

A passing solution then gets a weighted score based on softer quality items such as readability, type safety, style, and fit with the existing codebase.

FrontierCode also adds checks beyond normal unit tests.

Reverse-classical testing runs the model’s own tests against the original broken code, and those tests must fail, which proves the model wrote tests that actually catch the bug.

Scope checks punish patches that touch unrelated files, add oversized diffs, or refactor things the task did not ask for.

Adaptive grading uses an LLM to adjust test scaffolding around valid implementation differences, so a good solution is not rejected just because it used a different function name or error wording.

Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

1dViews 66.1KLikes 255Bookmarks 108
vik@vikhyatk

BREAKING: Cognition is forcing IOI gold medalists and top code maintainers to label code data

Seems concerning

swyx@swyx

It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality.

Cog had IOI Gold medalists and top code maintainers Look At The Data — FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks.

FC Diamond is so hard that Opus 4.8 scores 13.8%.

Three eras of AI coding : Three eras of benchmarks

2021 • Autocomplete : HumanEval 2023 • Passing Tests: SWEBench, TerminalBench 2026 • Maintainable Code: FrontierCode

to me the most beautiful chart when I requested a special historical run into all extant old models, the data was finding that the easiest third of FC tasks (in FC Extended) were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months.

This describes the "WTF happened in Dec 2025" vibe shift that a lot of folks from @dhh to @karpathy have called out: it is the difference between getting 95% success in 2 rerolls vs 6, making it finally feasible to go up the next layer of abstraction in agentic coding, eg @GeoffreyHuntley's ralph loops or @bcherny's /goals or @steipete's "loops that prompt your agents" without fearing too much that things go off the rails.

My guess: as AI accelerates from here, each FrontierCode tier will saturate in sequence, hopefully ~annually. I've already asked the team to prepare FrontierCode 2027....

The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith, the next model forest grows. Circle of life.

1dViews 33.8KLikes 257Bookmarks 40
Cognition@cognition

20+ world-class open-source developers built realistic coding tasks on repos they maintain. They define what “mergeable” means in their repo.

What does it take to measure mergeability? We use a mix of unit tests, rubrics and novel verifiers to assess correctness, test quality, scope discipline, style, and adherence to codebase standards.

2dViews 77.2KLikes 251Bookmarks 34
Cognition@cognition

You can find full model results and technical implementation details on our blog:

https://cognition.ai/blog/frontier-code

2dViews 27.9KLikes 129Bookmarks 46
Cognition@cognition

FrontierCode has three task sets: Extended (150 tasks), Main (100 tasks) and Diamond (50 tasks). SOTA LLMs have significant room for improvement, with the top model earning a score of just 13.4/100 on our Diamond task set.

2dViews 98.4KLikes 213Bookmarks 32

This chart scares me a bit. My guess is that reproducibility is low. Would love for them to share more data on their runs.

Opus low scored higher than medium Opus xhigh scored higher than max gpt-5.5-medium was the highest scoring OpenAI model

Fascinating bench. Really like the idea of focusing on mergeability.

Confused how Opus 4.8 is 2.5x better than Opus 4.7 though 🙃

1dViews 22.8KLikes 206Bookmarks 21
Cognition@cognition

FrontierCode was built in close partnership with the expert maintainers of 36 flagship open-source repositories, like @smilingnosrati, CEO & Tech Lead @CeleryOrg (29k stars), and Martin McKeaveney, CTO of @Budibase (28k stars).

Maintainers invested more than 40 hours per task, undergoing multiple rounds of iteration to ensure that any PR that satisfies these standards would actually be merged.

2dViews 45.8KLikes 181Bookmarks 15
Cognition@cognition

Rigorous quality control is important, so we built an extensive QC pipeline with adversarial testing, calibration, and multi-stage review. Every task is manually reviewed by a Cognition researcher.

This reduces both the false positive and false negative rates for proposed solutions. As a result, FrontierCode produces 81% fewer misclassification errors than SWE-Bench Pro.

2dViews 36.8KLikes 164Bookmarks 13
Cognition@cognition

Tasks in the dataset have a concise problem statement with large solutions that cut across multiple files.

FrontierCode’s task set is more diverse than other software engineering evals, measuring ability across a wide range of languages and problem types.

2dViews 54.1KLikes 134Bookmarks 12
swyx@swyx

there is significant alpha in convincing IOI golds and 20 year olds with their own wikipedia pages to build activedirectory sync integrations, and then putting good looking 40 year old guys and gals in suits in front of them to inference enough enterprise buzzword tokens to unlock every CIOs ai budget

vik@vikhyatk

BREAKING: Cognition is forcing IOI gold medalists and top code maintainers to label code data

Seems concerning

1dViews 4.1KLikes 74Bookmarks 7

@cognition I have a few questions after looking deeper at the numbers

1dViews 9.2KLikes 85Bookmarks 4

Open questions for the @cognition team who worked on this:

1. What is the language split on Diamond puzzles vs the "extended" subset? 2. Are you willing to share what repos are in the Diamond subset? 3. How many runs did you do for each model on a given task? 4. Do you have any suspicion as to why Opus 4.8 performed significantly better than 4.7? Or why the reasoning levels introduce so much variability in scores? 5. How repro-able are these scores? If you run the bench again, do they vary meaningfully?

1dViews 6.1KLikes 55Bookmarks 5
Matthew Schrager@MatthewSchrager

@cognition That's a massive lead for Opus 4.8, which does not seem to match the vibes on this site.

2dViews 7.6KLikes 55Bookmarks 2
Taelin@VictorTaelin

@theo me too, expected 7x or so

Fascinating bench. Really like the idea of focusing on mergeability.

Confused how Opus 4.8 is 2.5x better than Opus 4.7 though 🙃

1dViews 3.9KLikes 46Bookmarks 1
Ege Erdil@EgeErdil2

@scaling01 i think being prescriptive about mechanical cleanliness in an eval intended for AI agents is bad

AI agents don't operate with the same cleanliness / code quality concerns as humans, their capability profile is totally different

Lisan al Gaib@scaling01

Opus 4.8 is the best coding model out there

FrontierCode by Cognition is probably the highest quality coding benchmark we have seen so far

it moves beyond just using unit-testing for scoring, it also tests for regression safety, mechanical cleanliness, test correctness, scope and code quality

20+ open-source developers handcrafted 150 tasks, each of which took over 40 hours to construct

it also tests a more diverse set of programming languages

1dViews 952Likes 8Bookmarks 5
Scott Wu@ScottWu46

@NickADobos I hope that it gets solved in the next 6 months and then we can move on to even more challenging tasks!

Nick Dobos@NickADobos

@ScottWu46 Genuine question. How many days do you think it takes to saturate this?

1dViews 24.9KLikes 45Bookmarks 0
Zygis@zygisSS22

@cognition Did i miss something? what harness was used here?

2dViews 7.5KLikes 28Bookmarks 1
Load more posts