ICML Paper Shows Truthfulness Does Not Scale Like Reasoning

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

Paper: https://arxiv.org/pdf/2603.06612.

By @yegordb (co-lead), @JoshuaK92829 (co-lead), @jchudnov, @RylanSchaeffer, Soji Adeshina, Sheng Guan, and @sanmikoyejo.

Thank you to @stai_research @StanfordAILab @stanfordnlp

4h6041

BOOKMARKS3LIKES6RETWEETS3

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

We wanted to know how deep that correlation runs, so we built a control experiment: we fed models 10,000 random ASCII strings and forced them to pick between A, B, C, and D. No knowledge involved, no right answer. They agreed anyway, with pairwise agreement as high as 0.35 on pure noise. That can't be explained by shared knowledge; it points to shared inductive biases baked into the weights themselves. Temperature doesn't help either: between T=0.7 and T=1.0, the plurality answer flips on only 2.9% of questions.

4h1963

REPLIES1

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

This shows up even in math, where sampling famously works. On 53% of MATH questions where multiple models err, they converge on the SAME wrong answer. Aggregation succeeds in math because a verifier filters out the bad candidates, not because agreement itself signals truth.

4h2241

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

The intuition here is the wisdom of crowds: poll a model 25 times, or poll 5 different models, and individual mistakes should dissipate in the aggregate. This isn’t what happens: across four benchmarks (BoolQ, Com2Sense, HLE, plus a smaller new forecasting set we built) and five open models, no aggregation method consistently beat a single-sample baseline, even at up to 25x the inference cost.

4h3941

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

We tried many aggregation methods: majority vote, confidence weighting, prediction weighting, the Surprisingly Popular algorithm... methods that helped on one benchmark hurt on another, and on forecasting questions past the knowledge cutoff, every single one performed at chance.

4h3141

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

On HLE, picking the answer the Surprisingly Popular signal votes AGAINST hits 80%. The "wisdom" signal here points backwards (as opposed to simply noisy). (And no, you can't exploit that: the surprise signal points toward truth on some benchmarks, away from it on others, and nowhere on forecasting.)

4h2441

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

The reason is structural. Crowd wisdom rests on one key assumption: errors need to be mostly independent. Human crowds get that independence from diverse experiences and information sources. LLMs trained on overlapping corpora with similar objectives don't. When one model is confidently wrong, the others tend to be wrong in the same way. On HLE, models agree with each other (+0.4 to +0.6) while their answers are anti-correlated with truth.

4h2041

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

When doing confidence based voting, we found that all the signals – self-reported confidence, by predicted popularity, and by the surprise gap (surprisingly popular) -- don’t help. All these signals measure the same thing: expected consensus. Confidence rises much faster than accuracy on every benchmark, and models keep agreeing (85-100%) even when their stated confidence falls toward 50.

4h1741

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

We find that models are substantially better at predicting what other models will say than at identifying what's true. Even self-reported confidence tracks expected agreement about as strongly as it tracks accuracy, so weighting votes by confidence mostly amplifies the dominant misconception rather than correcting it. Social prediction and truth verification are different capabilities, and most aggregation rules are built on the former.

4h1541

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

Our takeaway: inference-time compute scales reasoning when verification is available, but it doesn't scale truth. Getting truthful models will require grounding, genuine diversity, or trained verifiers. More samples from the same prior just make the misconceptions louder.

4h2031

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

This shows up even in math, where sampling famously works. On 53% of MATH questions where multiple models err, they converge on the SAME wrong answer. Aggregation succeeds in math because a verifier filters out the bad candidates, not because agreement itself signals truth.

4h15361

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

When doing confidence based voting, we found that all the signals – self-reported confidence, by predicted popularity, and by the surprise gap (surprisingly popular) -- don’t help. All these signals measure the same thing: expected consensus. Confidence rises much faster than accuracy on every benchmark, and models keep agreeing (85-100%) even when their stated confidence falls toward 50.

4h12761

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

The reason is structural. Crowd wisdom rests on one key assumption: errors need to be mostly independent. Human crowds get that independence from diverse experiences and information sources. LLMs trained on overlapping corpora with similar objectives don't. When one model is confidently wrong, the others tend to be wrong in the same way. On HLE, models agree with each other (+0.4 to +0.6) while their answers are anti-correlated with truth.

4h14771

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

We tried many aggregation methods: majority vote, confidence weighting, prediction weighting, the Surprisingly Popular algorithm... methods that helped on one benchmark hurt on another, and on forecasting questions past the knowledge cutoff, every single one performed at chance.

4h20561

Jessica Chudnovsky ✈️ ICML 2026@jchudnov

The intuition here is the wisdom of crowds: poll a model 25 times, or poll 5 different models, and individual mistakes should dissipate in the aggregate. This isn’t what happens: across four benchmarks (BoolQ, Com2Sense, HLE, plus a smaller new forecasting set we built) and five open models, no aggregation method consistently beat a single-sample baseline, even at up to 25x the inference cost.

4h20371