/Tech1d ago

Cognition Launches FrontierCode Benchmark Testing Mergeable AI Code

402654011068.7K
Original post
Rohan Paul@rohanpaul_ai#1102inTech

The just released Claude Fable 5 gets about 31% on FrontierCode, far above even Opus 4.8

Rohan Paul@rohanpaul_ai

Incredible! This is just the benchmark we needed.

Claude Opus 4.8, achieves a score of only 13.4%. Other models score even lower: GPT-5.5 receives 6.3%, Gemini 3.1 Pro 4.7%, and others even less.

Cognition is introducing FrontierCode, a coding benchmark built to test whether AI code is good enough for a real maintainer to merge, not just whether it passes tests.

FrontierCode asks a harder question: did the model produce a clean, limited, well-tested, readable patch that fits the project’s existing style and would survive serious code review?

They bring 3 nested subsets of FrontierCode at increasing difficulty: The benchmark contains 150 tasks, with Main as the hardest 100 and Diamond as the hardest 50.

More than 20 open-source maintainers helped design the tasks, and each task took over 40 hours to build, review, attack, and calibrate.

The biggest finding is that top models still struggle badly when the target is mergeable code instead of merely working code.

On Diamond, the best model, Claude Opus 4.8, scores only 13.4%, while GPT-5.5 scores 6.3%, Gemini 3.1 Pro scores 4.7%, and the best open-source model listed, Kimi K2.6, scores 3.8%.

Shows that today’s strongest coding agents can often patch behavior, but they still fail many human-review standards around design, restraint, test quality, and project conventions.

The mechanism is a grading system built around blockers and non-blockers.

A blocker is something that would stop a maintainer from merging the PR, such as broken behavior, missing required behavior, unsafe scope changes, bad performance, or tests that do not prove the fix.

A solution that fails any blocker gets 0, even if parts of the code look good.

A passing solution then gets a weighted score based on softer quality items such as readability, type safety, style, and fit with the existing codebase.

FrontierCode also adds checks beyond normal unit tests.

Reverse-classical testing runs the model’s own tests against the original broken code, and those tests must fail, which proves the model wrote tests that actually catch the bug.

Scope checks punish patches that touch unrelated files, add oversized diffs, or refactor things the task did not ask for.

Adaptive grading uses an LLM to adjust test scaffolding around valid implementation differences, so a good solution is not rejected just because it used a different function name or error wording.

10:55 AM · Jun 9, 2026 · 2.4K Views
Sentiment

Positive users praise FrontierCode for testing whether AI code is actually mergeable by humans rather than just passing tests, while negative users view the low scores as proof that current models still produce unmaintainable output.

Pos
56.2%
Neg
43.8%
16 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1.2K
Rohan Paul@rohanpaul_ai

This graph compares FrontierCode Diamond score with average output tokens, showing that more generated tokens do not automatically mean better code quality.

Claude Opus 4.8 gets the highest score at about 13.4% but uses many more tokens, while GPT-5.5 scores lower at 6.3% with far fewer tokens, making this a quality-versus-cost signal for coding agents.

1dViews 1.2KLikes 7Bookmarks 1
BOOKMARKS1
Rajendra Kanodia@rajekanodia

I asked Claude how it handles the problem of one line change breaking software elsewhere. Well, it doesn't really handle it although it can provide assistance in figuring out what other changes might be needed. Here is its response: Great questions. Here's an honest answer to both: 1. Can Claude make one-line code changes? Yes, absolutely. You can paste code, describe what you want changed, and Claude will make targeted edits — including single-line changes, refactors, bug fixes, or larger restructures. 2. Can Claude handle the "ripple effect" problem? This is the harder question, and the honest answer is: partially, with important limitations. What Claude can do well: If you share the relevant parts of a codebase, Claude can reason about dependencies and warn you that a change in function X will break callers Y and Z Claude can suggest coordinated changes across multiple files when you provide them Claude understands common patterns (e.g., changing a function signature means updating all call sites) Claude Code (the CLI tool) has direct filesystem access, so it can read your actual codebase and reason across files more holistically Where it breaks down: Claude has a context window limit — it can't hold an entire large codebase in mind at once. If the affected code is in a file you haven't shared, Claude won't know about it Claude has no runtime — it can't actually execute your code or run tests to discover what breaks; it reasons statically Claude can miss subtle side effects — things like shared mutable state, implicit contracts between modules, or framework magic that aren't obvious from reading code The practical reality: Claude is a strong assistant for this problem, not a complete solution. The best workflow is: Use Claude to make the change and ask it to reason about what else might break Run your actual test suite to catch what Claude missed Feed broken tests back to Claude to fix them The combination of Claude + your tests is much more reliable than either alone.

1dViews 78Likes 2Bookmarks 1
LIKES8
Rohan Paul@rohanpaul_ai

This graphic shows that FrontierCode Extended is more diverse than SWE-Bench Pro and DeepSWE, because its 150 tasks span Python, Go, TypeScript, JavaScript, Java, C/C++, and other languages instead of being concentrated in just a few ecosystems.

Its significance is that FrontierCode is testing coding agents closer to real open-source work, where maintainers deal with mixed languages, multi-PR changes, and less perfectly packaged requests.

1dViews 768Likes 8
RETWEETS34
Rohan Paul@rohanpaul_ai

Incredible! This is just the benchmark we needed.

Claude Opus 4.8, achieves a score of only 13.4%. Other models score even lower: GPT-5.5 receives 6.3%, Gemini 3.1 Pro 4.7%, and others even less.

Cognition is introducing FrontierCode, a coding benchmark built to test whether AI code is good enough for a real maintainer to merge, not just whether it passes tests.

FrontierCode asks a harder question: did the model produce a clean, limited, well-tested, readable patch that fits the project’s existing style and would survive serious code review?

They bring 3 nested subsets of FrontierCode at increasing difficulty: The benchmark contains 150 tasks, with Main as the hardest 100 and Diamond as the hardest 50.

More than 20 open-source maintainers helped design the tasks, and each task took over 40 hours to build, review, attack, and calibrate.

The biggest finding is that top models still struggle badly when the target is mergeable code instead of merely working code.

On Diamond, the best model, Claude Opus 4.8, scores only 13.4%, while GPT-5.5 scores 6.3%, Gemini 3.1 Pro scores 4.7%, and the best open-source model listed, Kimi K2.6, scores 3.8%.

Shows that today’s strongest coding agents can often patch behavior, but they still fail many human-review standards around design, restraint, test quality, and project conventions.

The mechanism is a grading system built around blockers and non-blockers.

A blocker is something that would stop a maintainer from merging the PR, such as broken behavior, missing required behavior, unsafe scope changes, bad performance, or tests that do not prove the fix.

A solution that fails any blocker gets 0, even if parts of the code look good.

A passing solution then gets a weighted score based on softer quality items such as readability, type safety, style, and fit with the existing codebase.

FrontierCode also adds checks beyond normal unit tests.

Reverse-classical testing runs the model’s own tests against the original broken code, and those tests must fail, which proves the model wrote tests that actually catch the bug.

Scope checks punish patches that touch unrelated files, add oversized diffs, or refactor things the task did not ask for.

Adaptive grading uses an LLM to adjust test scaffolding around valid implementation differences, so a good solution is not rejected just because it used a different function name or error wording.

Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

1dViews 66.3KLikes 255Bookmarks 108
REPLIES1
Red Emano@heisredminded

FrontierCode is the benchmark I've been waiting for. Passing tests != mergeable code, especially in real-world PH dev shops where review bandwidth is limited. The blocker system and reverse-classical testing catch exactly the kind of "works but will haunt you later" code that wastes teams weeks. 13.4% on Diamond shows just how far we are from truly autonomous coding agents.

1dViews 112
Rohan Paul@rohanpaul_ai

This process explains how FrontierCode tries to make subjective code review measurable: maintainers first define the task and rubric, then try to break the rubric with bad solutions, calibrate it with reference answers, and send it through pod-lead and research review.

i.e the benchmark is not just testing models against unit tests, but against a reviewed grading system designed to reduce both false positives and false negatives.

1dViews 1.2KLikes 4
Rohan Paul@rohanpaul_ai

https://cognition.ai/blog/frontier-code

1dViews 1.1KLikes 3

This is exactly the kind of artifact that gets built into an S-1 narrative: "best model scores 31% where the field scores 4-13%" plus "competitor's curve is flat" is a moat slide, full stop, landing ten days after Anthropic's confidential filing. And note who published it — Cognition is an Anthropic launch partner quoted in today's announcement, releasing a benchmark same-day where Anthropic's new model laps the field by 2.3x.

1dViews 117Likes 1
Ines Lakzit@InesLakzit

@rohanpaul_ai A 13 percent ceiling on code you would actually merge is the number that reprices the AI coding trade. The seat that gets scarcer is senior review, not the junior one everyone assumed the model would replace.

1dViews 131Likes 3
Jacob Adams@temprlflux

@rohanpaul_ai Where is xAI? Why do they always leave it out. It must be better than the lowest model on this benchmark. Why is it not shown?

1dViews 50Likes 2
Raven@wizrdoraven

@rohanpaul_ai The under-discussed line in the Cognition blog: GPT-5.5 uses up to 4x fewer tokens than Opus 4.8 on this eval. So 'best on Diamond' and 'best cost-intelligence' are now different models. For agent stacks gated by token budgets, that gap matters more than the headline score.

1dViews 73Likes 1
BReal@BReal_01

@rohanpaul_ai This benchmark results look weird. The vast majority of people that used Opus 4.7 and Opus 4.8 didn't really see much of a difference. It feels odd that there's 8.2% difference between the 2 models in this type of chart. Also why would anyone bench Gemini 3.1 flash lite? no sense

1dViews 71Likes 1

@rohanpaul_ai FrontierCode might finally expose which models can reason under real coding pressure.

1dViews 67Likes 1
Shinka - AI@ShinkaIoT

@rohanpaul_ai FrontierCode nails it: there's a world of difference between code that works and code a human actually wants to merge.

1dViews 49Likes 1

@rohanpaul_ai @rohanpaul_ai wow, those scores are low. wonder if it's gonna push devs to rethink AI training methods or just tweak what's already there.

1dViews 41Likes 1
7rtp@fredyfredo123

@rohanpaul_ai i don't buy it

1dViews 39Likes 1
Guilherme O'Tina@guilhermeotina

the interesting pattern in the chart is how open models (kimi, minimax) cluster near gpt-5.5. if diamond is about code quality taste rather than reasoning depth, then specialized training on solid codebases might matter more than scale here. worth watching which axis moves the needle

1dViews 106
Guilherme O'Tina@guilhermeotina

the interesting pattern in the chart is how open models (kimi, minimax) cluster near gpt-5.5. if diamond is about code quality taste rather than reasoning depth, then specialized training on solid codebases might matter more than scale here. worth watching which axis moves the needle

1dViews 105
Guilherme O'Tina@guilhermeotina

the interesting pattern in the chart is how open models (kimi, minimax) cluster near gpt-5.5. if diamond is about code quality taste rather than reasoning depth, then specialized training on solid codebases might matter more than scale here. worth watching which axis moves the needle

1dViews 96
LeetLLM.com@leetllm

@rohanpaul_ai The real story here isn't just the low scores, it's that we're finally moving from 'did it pass the test suite' to 'would a human actually merge this.' Half of SWE-bench passes are unmergeable slop.

1dViews 96
Load more posts