/Tech3h ago

Epoch AI updates FrontierMath benchmark after correcting errors in 42% of problems, boosting scores for GPT-5.5 and Google

Story Overview

Epoch AI just pushed FrontierMath Tiers 1-4 v2 after an AI-assisted audit caught fatal flaws across 42 percent of the original problems. The fixes and removals leave a tighter 338-problem set that still spans undergrad drills to short-term research-level challenges, and early runs show models posting noticeably higher marks without major shake-ups in who sits at the top.

537682910072.3K

#285

Original post

Lisan al Gaib@scaling01#1215inTech

GPT-5.5-xhigh's FrontierMath 4 score jumped from 35% to 73% after EpochAI fixed errors in the benchmark

Epoch AI@EpochAIResearch

FrontierMath: Tiers 1–4 (v2) is live.

We concluded an audit that addressed errors in 42% of problems. Rankings are similar but scores are higher across the board. The current leaders are GPT-5.5 (xhigh) with 85% on Tiers 1–3 and Google’s AI co-mathematician with 76% on Tier 4.

10:43 AM · Jun 12, 2026 · 11.2K Views

FYI

Dataset got a serious scrub

123 problems in Tiers 1-3 and 12 in Tier 4 were corrected while another dozen were dropped entirely. The remaining questions keep their automatic verification and strict compute limits, so the benchmark now offers a cleaner read on whether frontier systems can actually do hard math.

Open Question

Rankings barely budge despite the lift

GPT-5.5 still tops Tiers 1-3 and Google’s co-mathematician holds the Tier 4 lead, both with improved numbers. How much of the gain is real capability versus simply fewer broken questions remains the open thread to watch on the live board.

Sentiment

Positive users praise Epoch AI's careful audit of FrontierMath and high post-fix model scores like GPT-5.5 as showing real progress, while negative users see the fixes as exposing benchmark flaws rather than genuine advances.

Pos

70.0%

Neg

30.0%

10 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.1KBOOKMARKS2LIKES17REPLIES1

Peter Welinder@npew

Models are getting very good at math. Even better than was previously thought.

Epoch AI@EpochAIResearch

FrontierMath: Tiers 1–4 (v2) is live.

56m1.1K172

RETWEETS3

Epoch AI@EpochAIResearch

FrontierMath: Tiers 1–4 is now approaching saturation. We believe the future of math benchmarking lies in open problems drawn from real research, like those we’ve collected in FrontierMath: Open Problems.

https://epoch.ai/frontiermath/open-problems

3h5996

Epoch AI@EpochAIResearch

We’ve backfilled FrontierMath: Tiers 1–4 (v2) scores for a selection of notable models, including recent Claude Opus models. You can find these on our website. We will add scores for Claude Fable 5 and GPT Pro models shortly.

https://epoch.ai/frontiermath/tiers-1-4

3h54510

Epoch AI@EpochAIResearch

This project began in April when OpenAI shared with us that they had found more errors than expected when conducting an internal review. Note that OpenAI funded the development of Tiers 1–4 and has exclusive access to about 80% of it, with Epoch holding out the rest.

3h3949

Epoch AI@EpochAIResearch

Following this, we conducted an independent audit. We used GPT-5.5 and Opus 4.7 to flag possible errors and then engaged mathematicians to review these flags. Almost all were determined to be real and severe errors that rendered the problems impossible to solve.

3h2207

Epoch AI@EpochAIResearch

Simple calculation mistakes accounted for the vast majority of errors, typically made when the problem author was extracting the final answer. These include things like off-by-one errors and flipped signs. Some problem statements were also fatally ambiguous.

3h2125

Epoch AI@EpochAIResearch

The dataset is much improved. Still, given the complexity of FrontierMath solutions, we can’t be sure that we’ve caught all errors. We plan to conduct additional AI-assisted reviews periodically, using new frontier models, and will correct any additional errors we find.

3h5164

Epoch AI@EpochAIResearch

We also removed 5 problems (2%) from Tiers 1–3 and 7 (15%) from Tier 4. These had more fundamental flaws that we didn’t believe were worth repairing. The higher removal rate for Tier 4 reflects the greater complexity of its problems.

3h2034

Greg Burnham@GregHBurnham

My previous post on the topic

2h4535

Joseph Garvin@joseph_h_garvin

@Jsevillamol Eh, only true if the LLM had the right answer but answer key was wrong. Some of these may be inconsistencies or vagueness in the questions that the LLMs didn't detect and just tried to answer anyway. Presumably they would have noticed if the LLM said "your question has an error"

1h80

G, MD@DrBeavisAI

@EpochAIResearch please do GPT-5.5 PRO and Gemini 3.1 DeepThink and Fable

2h2213

Habanero@NeroHSN

@EpochAIResearch These are insane… how will the world even handle GPT6 tier models soon?

3h892

BestLoser@LanFanTK

@EpochAIResearch Will the v1 version continue to be updated and maintained, or will the focus shift entirely to the v2 version?

2h2421

David Turturean@DavidTurturean

@GregHBurnham Feels like the opposite of reward hacking that GPT-5.5 almost doubled its own score by finding actual issues within the benchmark

1h1201

Gregor@bygregorr

@EpochAIResearch curious if the 42% fixes were spread evenly across tiers or clustered in one or two that'd change how much weight i'd put on 'rankings stayed similar'

1h175

Saffron Warlord e/acc@rawantitmc

@GregHBurnham Hii sir, are you sure that tier 4 is uncontaminated? The jump of GPT 5.4 to GPT 5.5 is extraordinary on tier 4.

2h158

blueblimp@blueblimpms

@EpochAIResearch One thing that's great about the careful review you've done here is that it puts the saturation threshold at literally 100%. Any task the models are still missing indicates a problem that's truly hard for them.

1h421

Zanthous ✾ Zankai@ZanthousDev

@EpochAIResearch That's a wild jump...

2h127

Curline Zephirin@Curline1222

@EpochAIResearch where is Fable 5

2h381

Capricornus@Caprico_Uruk

@EpochAIResearch Once these benchmarks are somewhat saturated, will you release the problems to the public?

2h291