New Paper Shows Frontier Models Struggle Evaluating Grade-School Math Reasoning

VIEWS1KLIKES11RETWEETS2REPLIES1

Excited to share the first pre-print from our lab led by @SMZ_0001!

In "An Enigma of Artificial Reason", we find that reasoning-trained LMs excel at *producing* reasoning, but struggle to *evaluate* reasoning that reaches valid answers for invalid reasons, scoring as low as 48%.

Sun Ming Zhong@SMZ_0001

🚨 Frontier reasoning models have achieved many remarkable feats this year, including solving open problems in research mathematics — but we just ran them on our new evaluation built on elementary and high school math, and they get things wrong up to 52% of the time! Even Claude Fable 5 — Anthropic's newest model — has an error rate of 16.4%*.

Why are frontier models still stumbling on grade-school math reasoning when they can already solve complex research-level math?

👉 As it turns out, while reasoning models excel at producing solutions to reasoning problems, we find that still struggle to evaluate solutions, even for grade-school math — we call this the Production-Evaluation Gap.

🚀 In our new paper, An Enigma of Artificial Reason, we study a question that has received insufficient attention thus far: Can Large Reasoning Models (LRMs) reliably evaluate reasoning, or are they just really good at producing it? 🚀

To find out, we built the Valid-Answer-Invalid-Reasoning (VAIR) dataset. We derived this benchmark from GSM8K and MATH — math datasets that LLMs saturated long ago in terms of solution accuracy. Yet, on our reasoning evaluation benchmark, frontier models exhibit sharp drops in accuracy: . Claude Opus 4.7, GPT 5.4, DeepSeek R1, and Gemini 3.1 Pro all score 95–99% when producing solutions, but their accuracy collapses to 48–79% when asked to evaluate flawed reasoning.

50m1K111

BOOKMARKS2

Sun Ming Zhong@SMZ_0001

💡 Why does this matter?

As people increasingly use frontier models to write research papers, produce proof attempts, or generate persuasive arguments, this gap between producing arguments and vigilantly assessing them becomes a societal vulnerability, not just a technical one: If AI can produce plausible-sounding reasoning at scale, but not help us weed out what’s actually invalid, our ability to do science and make sense of the world may be significantly harmed.

How might we address this gap? In The Enigma of Reason (2017) — one of the inspirations for our work — the cognitive scientists Hugo Mercier and Dan Sperber suggest that human reasoning evolved via social incentives, and that being critical evaluators allows us to gain the benefits of others’ thinking while avoiding being misled. In contrast, AI models are trained to reason in isolation, resulting in very different incentives. By learning from human cognition, we could potentially reduce the production-evaluation gap.

(*Results on Fable 5 are freshly run, and not yet included in our paper.)

🤝 Joint work with Teresa Yeo (@aseretys), Armando Solar-Lezama, and Tan Zhi-Xuan (@xuanalogue).

📄 Paper: https://arxiv.org/abs/2606.01462. #LLMs #LRMs #Reasoning #AI4MATH #CogSci

13h205112

xuan (ɕɥɛn / sh-yen)@xuanalogue

So perhaps by learning from how humans evolved to reason w each other, we can train models to be more epistemically vigilant like humans are, and not just solipsitic outcome-focused reasoners.

Paper: https://arxiv.org/abs/2606.01462

50m6841

Sun Ming Zhong@SMZ_0001

Of course, CoTs need not be faithful to what to the evaluator models are doing under-the-hood, so we also find mechanistic evidence of the bias at work: :

🔸Linear Probes: Using probes trained on LRM activations, we find that these activations encode some representation of valid reasoning. However, we also find that these internal representations get corrupted and dynamically overridden by the presence of a valid final answer in VAIR solutions.

🔸Causal Patching: By swapping the hidden states of a valid answer token with an invalid one, we can causally flip the model's validity verdict and activations —- once the model detects that the final answer is wrong, it is no longer inclined to judge invalid reasoning steps as valid

Together, these results demonstrate the operation of answer confirmation bias at inference time. But is this bias ultimately the result of outcome-focused incentives at training time, as we hypothesize? We plan to investigate this in future work.

13h906

Sun Ming Zhong@SMZ_0001

🔍 What is VAIR?

To isolate a model's ability to evaluate reasoning — and prevent it from using its ability to produce reasoning to solve our tasks — we took gold-standard math solutions and injected trivial reasoning flaws (e.g., deleting an essential step, shuffling the order) while keeping the final answer perfectly valid.

If a model actually reads and evaluates the steps, it should detect the flaw and judge the solution as containing invalid reasoning. But if the model just solves the problem on its own and verifies that the final answer is correct, it will incorrectly judge the solution as valid.

When we subject frontier LRMs to our VAIR dataset, their evaluations consistently break down. Despite near-perfect solution accuracy, these models suffer massive performance drops when evaluating VAIR solutions (as low as 48%).

For context, human reasoners are significantly more robust at our task; we found that humans are only 6% worse at grading these tricky problems than they are at solving them —- consistent with research in cognitive science that people are incentivized to be epistemically vigilant against misleading arguments.

13h906

Sun Ming Zhong@SMZ_0001

🧠 Why does this happen?

We trace this back to an Answer Confirmation Bias.

Because LRMs are predominantly trained via outcome-based reinforcement learning, they are incentivized to produce correct answers, not to validate reasoning. One hypothesis then, is that this training ends up biasing LRMs to rely on answer validity when evaluating reasoning.

Indeed, when we inspect the verbalized “chain-of-thought” (CoT) produced by our evaluator models, we find many symptoms of answer confirmation bias: When evaluating a VAIR solution, CoT wording indicates that LRMs often (i) independently solve the problem by themselves (ii) verify that the solution’s answer is valid, then (iii)overlook reasoning flaws or fabricate rationalizations for the solution’s broken logic.

13h774

xuan (ɕɥɛn / sh-yen)@xuanalogue

Per our paper's title, we were inspired by Mercier & Sperber's "Enigma of Reason", which shows that humans are often lazy & biased at producing (verbal) reasons, but exercise epistemic vigilance when evaluating reasons, and gives a social evolutionary account for why.

xuan (ɕɥɛn / sh-yen)@xuanalogue

This bias is consistent with outcome-based reasoning training. Unlike humans who face incentives to be vigilant against bad arguments, LRMs are mostly trained to reach the right answers via any means. In the future, we hope to investigate if outcome-based RL is indeed the cause!

50m7950

xuan (ɕɥɛn / sh-yen)@xuanalogue

This gap is surprising: surely frontier models, which easily solve grade school math and now even solve Erdös problems, should be able to reliably evaluate solutions to the former? Turns out not, once we decouple answer validity from reasoning validity, as in our VAIR dataset.

xuan (ɕɥɛn / sh-yen)@xuanalogue

Excited to share the first pre-print from our lab led by @SMZ_0001!

In "An Enigma of Artificial Reason", we find that reasoning-trained LMs excel at *producing* reasoning, but struggle to *evaluate* reasoning that reaches valid answers for invalid reasons, scoring as low as 48%.

50m15630

xuan (ɕɥɛn / sh-yen)@xuanalogue

Alright here's the results of our eval + paper! More thoughts to come in a separate thread!

Sun Ming Zhong@SMZ_0001

🚨 Frontier reasoning models have achieved many remarkable feats this year, including solving open problems in research mathematics — but we just ran them on our new evaluation built on elementary and high school math, and they get things wrong up to 52% of the time! Even Claude Fable 5 — Anthropic's newest model — has an error rate of 16.4%*.

Why are frontier models still stumbling on grade-school math reasoning when they can already solve complex research-level math?

👉 As it turns out, while reasoning models excel at producing solutions to reasoning problems, we find that still struggle to evaluate solutions, even for grade-school math — we call this the Production-Evaluation Gap.

🚀 In our new paper, An Enigma of Artificial Reason, we study a question that has received insufficient attention thus far: Can Large Reasoning Models (LRMs) reliably evaluate reasoning, or are they just really good at producing it? 🚀

To find out, we built the Valid-Answer-Invalid-Reasoning (VAIR) dataset. We derived this benchmark from GSM8K and MATH — math datasets that LLMs saturated long ago in terms of solution accuracy. Yet, on our reasoning evaluation benchmark, frontier models exhibit sharp drops in accuracy: . Claude Opus 4.7, GPT 5.4, DeepSeek R1, and Gemini 3.1 Pro all score 95–99% when producing solutions, but their accuracy collapses to 48–79% when asked to evaluate flawed reasoning.

4h58620

xuan (ɕɥɛn / sh-yen)@xuanalogue

In contrast, people largely outperform LRMs at grading such flawed solutions as flawed, and exhibit a small 6% perf. drop relative to problem solving while taking less time -- broadly consistent with prior findings that humans are better at evaluating reasoning than producing it.

xuan (ɕɥɛn / sh-yen)@xuanalogue

This gap is surprising: surely frontier models, which easily solve grade school math and now even solve Erdös problems, should be able to reliably evaluate solutions to the former? Turns out not, once we decouple answer validity from reasoning validity, as in our VAIR dataset.

50m5730

xuan (ɕɥɛn / sh-yen)@xuanalogue

Why this enigma in LRMs? We find an "answer confirmation bias" at inference time: Evaluator CoTs seem to solve the problem, verify the (valid) answer, then rationalize away flawed reasoning. Linear probes & causal patching give further evidence for this.

Sun Ming Zhong@SMZ_0001

🧠 Why does this happen?

We trace this back to an Answer Confirmation Bias.

Because LRMs are predominantly trained via outcome-based reinforcement learning, they are incentivized to produce correct answers, not to validate reasoning. One hypothesis then, is that this training ends up biasing LRMs to rely on answer validity when evaluating reasoning.

Indeed, when we inspect the verbalized “chain-of-thought” (CoT) produced by our evaluator models, we find many symptoms of answer confirmation bias: When evaluating a VAIR solution, CoT wording indicates that LRMs often (i) independently solve the problem by themselves (ii) verify that the solution’s answer is valid, then (iii)overlook reasoning flaws or fabricate rationalizations for the solution’s broken logic.

50m2620

xuan (ɕɥɛn / sh-yen)@xuanalogue

This bias is consistent with outcome-based reasoning training. Unlike humans who face incentives to be vigilant against bad arguments, LRMs are mostly trained to reach the right answers via any means. In the future, we hope to investigate if outcome-based RL is indeed the cause!

xuan (ɕɥɛn / sh-yen)@xuanalogue

Why this enigma in LRMs? We find an "answer confirmation bias" at inference time: Evaluator CoTs seem to solve the problem, verify the (valid) answer, then rationalize away flawed reasoning. Linear probes & causal patching give further evidence for this.

50m1820

Vankous@vankous

@SMZ_0001 @threadreaderapp unroll

5h24

Wenzhuo Wu@wu_wenzhuo67809

@SMZ_0001 Wow this is new!!

13h33

Wenzhuo Wu@wu_wenzhuo67809

@SMZ_0001 🔥🔥🔥

13h32

Thread Reader App@threadreaderapp

@vankous @SMZ_0001 @vankous Hallo, here is your unroll: https://threadreaderapp.com/thread/2066013712398549444.html Have a good day. 🤖

5h2

Vorname MitD@vornamemitd

@SMZ_0001 Great work and really appreciate the Mercier/Sperber reference - yet another indicator of our current training/alignment regimes being potentially flawed or misguided...

4h