The just released Claude Fable 5 gets about 31% on FrontierCode, far above even Opus 4.8
Incredible! This is just the benchmark we needed.
Claude Opus 4.8, achieves a score of only 13.4%. Other models score even lower: GPT-5.5 receives 6.3%, Gemini 3.1 Pro 4.7%, and others even less.
Cognition is introducing FrontierCode, a coding benchmark built to test whether AI code is good enough for a real maintainer to merge, not just whether it passes tests.
FrontierCode asks a harder question: did the model produce a clean, limited, well-tested, readable patch that fits the project’s existing style and would survive serious code review?
They bring 3 nested subsets of FrontierCode at increasing difficulty: The benchmark contains 150 tasks, with Main as the hardest 100 and Diamond as the hardest 50.
More than 20 open-source maintainers helped design the tasks, and each task took over 40 hours to build, review, attack, and calibrate.
The biggest finding is that top models still struggle badly when the target is mergeable code instead of merely working code.
On Diamond, the best model, Claude Opus 4.8, scores only 13.4%, while GPT-5.5 scores 6.3%, Gemini 3.1 Pro scores 4.7%, and the best open-source model listed, Kimi K2.6, scores 3.8%.
Shows that today’s strongest coding agents can often patch behavior, but they still fail many human-review standards around design, restraint, test quality, and project conventions.
The mechanism is a grading system built around blockers and non-blockers.
A blocker is something that would stop a maintainer from merging the PR, such as broken behavior, missing required behavior, unsafe scope changes, bad performance, or tests that do not prove the fix.
A solution that fails any blocker gets 0, even if parts of the code look good.
A passing solution then gets a weighted score based on softer quality items such as readability, type safety, style, and fit with the existing codebase.
FrontierCode also adds checks beyond normal unit tests.
Reverse-classical testing runs the model’s own tests against the original broken code, and those tests must fail, which proves the model wrote tests that actually catch the bug.
Scope checks punish patches that touch unrelated files, add oversized diffs, or refactor things the task did not ask for.
Adaptive grading uses an LLM to adjust test scaffolding around valid implementation differences, so a good solution is not rejected just because it used a different function name or error wording.