🚨 Frontier reasoning models have achieved many remarkable feats this year, including solving open problems in research mathematics — but we just ran them on our new evaluation built on elementary and high school math, and they get things wrong up to 52% of the time! Even Claude Fable 5 — Anthropic's newest model — has an error rate of 16.4%*.
Why are frontier models still stumbling on grade-school math reasoning when they can already solve complex research-level math?
👉 As it turns out, while reasoning models excel at producing solutions to reasoning problems, we find that still struggle to evaluate solutions, even for grade-school math — we call this the Production-Evaluation Gap.
🚀 In our new paper, An Enigma of Artificial Reason, we study a question that has received insufficient attention thus far: Can Large Reasoning Models (LRMs) reliably evaluate reasoning, or are they just really good at producing it? 🚀
To find out, we built the Valid-Answer-Invalid-Reasoning (VAIR) dataset. We derived this benchmark from GSM8K and MATH — math datasets that LLMs saturated long ago in terms of solution accuracy. Yet, on our reasoning evaluation benchmark, frontier models exhibit sharp drops in accuracy: . Claude Opus 4.7, GPT 5.4, DeepSeek R1, and Gemini 3.1 Pro all score 95–99% when producing solutions, but their accuracy collapses to 48–79% when asked to evaluate flawed reasoning.





