/Tech5h ago

Cognition releases FrontierCode, a benchmark measuring if AI-generated code is maintainable and merge-ready rather than just functionally correct

Open-source maintainers spent over 40 hours building each task.

3392.5K186941664.9K

#81

Original post

Taelin@VictorTaelin#707inTech

@theo me too, expected 7x or so

Theo - t3.gg@theo

Fascinating bench. Really like the idea of focusing on mergeability.

Confused how Opus 4.8 is 2.5x better than Opus 4.7 though 🙃

6:31 PM · Jun 8, 2026 · 2.1K Views

/Tech5h ago

Cognition releases FrontierCode, a benchmark measuring if AI-generated code is maintainable and merge-ready rather than just functionally correct

Open-source maintainers spent over 40 hours building each task.

3392.5K186941664.9K

#81

Original post

Taelin@VictorTaelin#707inTech

@theo me too, expected 7x or so

Theo - t3.gg@theo

Fascinating bench. Really like the idea of focusing on mergeability.

Confused how Opus 4.8 is 2.5x better than Opus 4.7 though 🙃

6:31 PM · Jun 8, 2026 · 2.1K Views

Sentiment

Positive users praise FrontierCode for using real merge-worthy tasks from OSS maintainers instead of test-passing alone, while negative users dismiss it as biased or irrelevant to most software.

Pos

58.5%

Neg

41.5%

43 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS49.7K

Cognition@cognition

FrontierCode has three task sets: Extended (150 tasks), Main (100 tasks) and Diamond (50 tasks). SOTA LLMs have significant room for improvement, with the top model earning a score of just 13.4/100 on our Diamond task set.

11h49.7K13721

BOOKMARKS29

Cognition@cognition

You can find full model results and technical implementation details on our blog:

https://cognition.ai/blog/frontier-code

11h10.4K9029

LIKES163

Cognition@cognition

20+ world-class open-source developers built realistic coding tasks on repos they maintain. They define what “mergeable” means in their repo.

What does it take to measure mergeability? We use a mix of unit tests, rubrics and novel verifiers to assess correctness, test quality, scope discipline, style, and adherence to codebase standards.

11h17.7K16317

RETWEETS186

Cognition@cognition

Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers.

Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?

11h679.4K2.5K950

REPLIES5

Matthew Schrager@MatthewSchrager

@cognition That's a massive lead for Opus 4.8, which does not seem to match the vibes on this site.

11h2.4K21

Cognition@cognition

Rigorous quality control is important, so we built an extensive QC pipeline with adversarial testing, calibration, and multi-stage review. Every task is manually reviewed by a Cognition researcher.

This reduces both the false positive and false negative rates for proposed solutions. As a result, FrontierCode produces 81% fewer misclassification errors than SWE-Bench Pro.

11h10.4K11110

Cognition@cognition

FrontierCode was built in close partnership with the expert maintainers of 36 flagship open-source repositories, like @smilingnosrati, CEO & Tech Lead @CeleryOrg (29k stars), and Martin McKeaveney, CTO of @Budibase (28k stars).

Maintainers invested more than 40 hours per task, undergoing multiple rounds of iteration to ensure that any PR that satisfies these standards would actually be merged.

11h12.4K1277

Cognition@cognition

Tasks in the dataset have a concise problem statement with large solutions that cut across multiple files.

FrontierCode’s task set is more diverse than other software engineering evals, measuring ability across a wide range of languages and problem types.

11h17.2K877

Theo - t3.gg@theo

@cognition I have a few questions after looking deeper at the numbers

7h3.4K253

Zygis@zygisSS22

@cognition Did i miss something? what harness was used here?

11h2.4K151

Silas Alberti@silasalberti

@cognition Glad to get this out!

11h71126

Florian Brand@xeophon

@cognition whats the used harness? yours?

10h646161

Graham Neubig@gneubig

@cognition Which harness was used to run the evaluation? Devin or native like Claude Code/Codex? Did you see any effects there?

10h1.6K131

Jared Zoneraich@imjaredz

@cognition One step closer to figuring out what "good" code actually is. this is going to be the new benchmark standard

11h1.5K20

Ryan Shea@ryaneshea

@cognition AI IQ picked this up already btw and now has it as one of its Software Engineering benchmarks: https://www.aiiq.org/charts/frontiercode-diamond-scores/

9h1.2K91

Vishal Anton@Vishal_anton16

@cognition @grok how different is this from deepswe from datacurve?

10h2.5K21

nader dabit@dabit3

@cognition @alessio_joseph design is A+, such a pleasure to read through this

9h76813

sohit kumar@ksohit

@cognition Agents write code just keeping the task in mind, I usually end up correcting the code written by agent to ensure that entire system evolves in a manageable way.

This is the way froward,

10h1.1K41

Hypocrisy@LUXECryptoCH

@cognition Not trusting this benchmark for shit. I have used both 4.8 and 5.5. 5.5 is faster cheaper and far better in every single instance. Deepswe still the best benchmark. anything with a claude model on top is slop.

10h8644

Daniel Jeffries@Dan_Jeffries1

What everyone who works with these machines daily on jagged edge coding problems already knows.

Recursive self improvement?

Yeah right.

How about we just get to long term memory, continual learning, embedded backwards and forward pass thinking in latent space, and actual long term contextual understanding first, guys?

Maybe then we'll have machines that can actually solve problems by learning instead of memorizing lots of data from training.

After that we can go for some sci-fi BS to scare the normies into regulatory capture, mmm-kaayy?

swyx@swyx

It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality.

Cog had IOI Gold medalists and top code maintainers Look At The Data — FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks.

FC Diamond is so hard that Opus 4.8 scores 13.8%.

Three eras of AI coding : Three eras of benchmarks

2021 • Autocomplete : HumanEval 2023 • Passing Tests: SWEBench, TerminalBench 2026 • Maintainable Code: FrontierCode

to me the most beautiful chart when I requested a special historical run into all extant old models, the data was finding that the easiest third of FC tasks (in FC Extended) were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months.

This describes the "WTF happened in Dec 2025" vibe shift that a lot of folks from @dhh to @karpathy have called out: it is the difference between getting 95% success in 2 rerolls vs 6, making it finally feasible to go up the next layer of abstraction in agentic coding, eg @GeoffreyHuntley's ralph loops or @bcherny's /goals or @steipete's "loops that prompt your agents" without fearing too much that things go off the rails.

My guess: as AI accelerates from here, each FrontierCode tier will saturate in sequence, hopefully ~annually. I've already asked the team to prepare FrontierCode 2027....

The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith, the next model forest grows. Circle of life.

28m34941