/AI3h ago

Cognition launches FrontierCode to evaluate AI code mergeability, finding Claude Opus 4.8 leads with a peak 13.5% score

AI Judge changed title after evaluation, original title: "FrontierCode benchmark launches to test code mergeability, finding over half of SWE-bench outputs are unmergeable"

Story Overview

Cognition's FrontierCode benchmark shifts focus from whether AI code merely runs to whether human maintainers would actually accept it into production repositories, exposing that over half the outputs passing earlier SWE-Bench tests fall short on style, scope, and regression safety.

3040695738.4K
Original postBen Golub#1451
Jeff Wang@jeffwsurf

A massive achievement, introducing FrontierCode, finally an eval that can determine how good models can write code that will lead to actually being merged in the real world. This has 81% less false positives as SWE-Bench Pro, and is a peek at how we are determining whether models are good or bad at certain types of tasks

We will also have more evals coming that will help us improve Devin rapidly and be better at using models optimized for different types of work

Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

12:57 PM · Jun 8, 2026 · 5K Views
Developer Impact

Real merge decisions expose hidden gaps in prior evals

Tasks built by veteran open-source maintainers each spending over 40 hours reveal that automated test passes often mask unmaintainable changes, with models still scoring below 15 percent on the hardest subset.

Model Watch

Early leader shows token-heavy approach still dominates

Anthropic's Opus 4.8 tops the charts at 13.4 percent on Diamond tasks while lighter models trade efficiency for lower scores, though the full picture on open-source entries and future scaling remains incomplete.

Sentiment

Positive users praise Cognition's FrontierCode Eval for its focus on maintainable AI-generated code while negative users dismiss the benchmark as useless, flawed, or delayed.

Pos
53.7%
Neg
46.3%
26 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS22.1KBOOKMARKS31LIKES230RETWEETS2REPLIES20

Fascinating bench. Really like the idea of focusing on mergeability.

Confused how Opus 4.8 is 2.5x better than Opus 4.7 though 🙃

Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

1hViews 22.1KLikes 230Bookmarks 31
vik@vikhyatk

BREAKING: Cognition is forcing IOI gold medalists and top code maintainers to label code data

Seems concerning

swyx@swyx

It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality.

Cog had IOI Gold medalists and top code maintainers Look At The Data — FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks.

FC Diamond is so hard that Opus 4.8 scores 13.8%.

Three eras of AI coding : Three eras of benchmarks

2021 • Autocomplete : HumanEval 2023 • Passing Tests: SWEBench, TerminalBench 2026 • Maintainable Code: FrontierCode

to me the most beautiful chart when I requested a special historical run into all extant old models, the data was finding that the easiest third of FC tasks (in FC Extended) were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months.

This describes the "WTF happened in Dec 2025" vibe shift that a lot of folks from @dhh to @karpathy have called out: it is the difference between getting 95% success in 2 rerolls vs 6, making it finally feasible to go up the next layer of abstraction in agentic coding, eg @GeoffreyHuntley's ralph loops or @bcherny's /goals or @steipete's "loops that prompt your agents" without fearing too much that things go off the rails.

My guess: as AI accelerates from here, each FrontierCode tier will saturate in sequence, hopefully ~annually. I've already asked the team to prepare FrontierCode 2027....

The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith, the next model forest grows. Circle of life.

2hViews 8.3KLikes 95Bookmarks 15

This chart scares me a bit. My guess is that reproducibility is low. Would love for them to share more data on their runs.

Opus low scored higher than medium Opus xhigh scored higher than max gpt-5.5-medium was the highest scoring OpenAI model

Fascinating bench. Really like the idea of focusing on mergeability.

Confused how Opus 4.8 is 2.5x better than Opus 4.7 though 🙃

1hViews 4.7KLikes 50Bookmarks 6
Taelin@VictorTaelin

This benchmark addresses my problem with 5.5: it passes the tests but writes shitty code. We don't need a model's output to work today, we need it not to break tomorrow...

Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

19mViews 1.3KLikes 36Bookmarks 5
swyx@swyx

there is significant alpha in convincing IOI golds and 20 year olds with their own wikipedia pages to build activedirectory sync integrations, and then putting good looking 40 year old guys and gals in suits in front of them to inference enough enterprise buzzword tokens to unlock every CIOs ai budget

vik@vikhyatk

BREAKING: Cognition is forcing IOI gold medalists and top code maintainers to label code data

Seems concerning

1hViews 951Likes 24Bookmarks 1
elie@eliebakouch

on extended subset, kimi k2.6 just feels like the extension of gpt 5.4 mini

elie@eliebakouch

new frontier eval from the cognition team. interesting that simple test time scaling is pretty noisy here instead of a clean line

lots of care in crafting a good scoring process

https://cognition.ai/blog/frontier-code

2hViews 1.2KLikes 12Bookmarks 0
vik@vikhyatk

@swyx @METR_Evals this is very cool

swyx@swyx

It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality.

Cog had IOI Gold medalists and top code maintainers Look At The Data — FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks.

FC Diamond is so hard that Opus 4.8 scores 13.8%.

Three eras of AI coding : Three eras of benchmarks

2021 • Autocomplete : HumanEval 2023 • Passing Tests: SWEBench, TerminalBench 2026 • Maintainable Code: FrontierCode

to me the most beautiful chart when I requested a special historical run into all extant old models, the data was finding that the easiest third of FC tasks (in FC Extended) were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months.

This describes the "WTF happened in Dec 2025" vibe shift that a lot of folks from @dhh to @karpathy have called out: it is the difference between getting 95% success in 2 rerolls vs 6, making it finally feasible to go up the next layer of abstraction in agentic coding, eg @GeoffreyHuntley's ralph loops or @bcherny's /goals or @steipete's "loops that prompt your agents" without fearing too much that things go off the rails.

My guess: as AI accelerates from here, each FrontierCode tier will saturate in sequence, hopefully ~annually. I've already asked the team to prepare FrontierCode 2027....

The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith, the next model forest grows. Circle of life.

2hViews 367Likes 5Bookmarks 0
elie@eliebakouch

OTE formula i think

elie@eliebakouch

on extended subset, kimi k2.6 just feels like the extension of gpt 5.4 mini

2hViews 815Likes 3Bookmarks 0
elie@eliebakouch

on other less hard subset

elie@eliebakouch

new frontier eval from the cognition team. interesting that simple test time scaling is pretty noisy here instead of a clean line

lots of care in crafting a good scoring process

https://cognition.ai/blog/frontier-code

2hViews 496Likes 3Bookmarks 0
Silas Alberti@silasalberti

@BrendanFoody @cognition @mercor_ai Thank you! 🫶

1hViews 127Likes 4
Sahin Olut@sahinolut

ngmi if you are not looking at data / traces. at rippling we had a value called "go and see" for all leaders.

i think this is extremely critical for anyone in the business of training models or using them in very intelligent ways. as someone who has trained models for a living for a while, you can't debug things if you don't know your data

1hViews 84Likes 2
Herbie Bradley@herbiebradley

@eliebakouch I'm confused why the time x-axis only extends to like 30m surely if the models have low pass rate, you can just extend to 10 hours and keep improving with a decent scaffold?

elie@eliebakouch

new frontier eval from the cognition team. interesting that simple test time scaling is pretty noisy here instead of a clean line

lots of care in crafting a good scoring process

https://cognition.ai/blog/frontier-code

2hViews 175Likes 1Bookmarks 0
Rayane@RayaneRachid_

@theo This is pure slop, gpt 5.5 x high same as opus 4.8 medium lmfaooo

59mViews 60Likes 1
Carter Leffen@carterleffen

@vikhyatk seems really smart to me

1hViews 41Likes 1
Sergio@SergioM80274824

@jeffwsurf but you didnt' eval Gemini 3.5 Flash nor latest Qwen models?

2hViews 41Likes 1
Lei Fu@the_leifu

@BrendanFoody @cognition @silasalberti @mercor_ai wow took a whole 2 years since the inception of vibe coding for someone to begin benching quality

1hViews 39Likes 1
Jacob Rhodes@Jacob_Rhodes_

@theo wow, that is kindof crazy. I am soooo glad that I can use 4.8!

1hViews 84
Terp@OnlyTerp

@jeffwsurf Love it!

1hViews 69
Utkarsh Singh@Utkarsh51557661

@BrendanFoody @cognition @silasalberti @mercor_ai benchmarks matter, but if the data quality’s bad, it’s just lipstick on a pig.

1hViews 20Likes 1
soulblocks@soulblocks

@theo only if you use /rewind ☕

1hViews 51
Load more posts
Cognition launches FrontierCode to evaluate AI code mergeability, finding Claude Opus 4.8 leads with a peak 13.5% score · Digg