Furong Huang of the University of Maryland launches SoundnessBench to evaluate whether AI research agents can judge scientific soundness
The benchmark dataset is now open-sourced on Hugging Face.
My first reaction seeing the evaluation results of frontier models against SoundnessBench was relief: Good, I’m not replaced by AI, …yet…😂
My second reaction was less comforting: what are my false-positive and false-negative rates as an advisor/reviewer?
How often do I over-encourage weak ideas — or over-criticize good ones? 😱
The AI Scientist can write code, run experiments, and draft papers. But can it do the thing every good advisor does before compute is spent: reject a bad research plan? Introducing SoundnessBench: a benchmark for scientific soundness judgment. Project:https://hosytuyen.github.io/projects/SoundnessBench Dataset:https://huggingface.co/datasets/hosytuyen/SoundnessBench Paper:https://arxiv.org/abs/2605.30329
2/n Why this matters: automation changes the cost of bad judgment. A human may waste weeks on a flawed experiment. An autonomous agent can scale that mistake overnight. Without soundness filters, we do not just accelerate science—we risk accelerating plausible-looking bad science.
1/n The current AI Scientist excitement is about automating the research loop: hypotheses → code → experiments → paper. But there is a missing first gate: before running anything, is the hypothesis-test pair actually methodologically sound?
1/n The current AI Scientist excitement is about automating the research loop: hypotheses → code → experiments → paper. But there is a missing first gate: before running anything, is the hypothesis-test pair actually methodologically sound?

The AI Scientist can write code, run experiments, and draft papers. But can it do the thing every good advisor does before compute is spent: reject a bad research plan? Introducing SoundnessBench: a benchmark for scientific soundness judgment. Project:https://hosytuyen.github.io/projects/SoundnessBench Dataset:https://huggingface.co/datasets/hosytuyen/SoundnessBench Paper:https://arxiv.org/abs/2605.30329
3/n SoundnessBench isolates one narrow but crucial skill. Not novelty. Not impact. Not whether the paper was accepted. Just: given the hypothesis and experiment plan, can the proposed study actually test the claim rigorously?
2/n Why this matters: automation changes the cost of bad judgment. A human may waste weeks on a flawed experiment. An autonomous agent can scale that mistake overnight. Without soundness filters, we do not just accelerate science—we risk accelerating plausible-looking bad science.
4/n We reconstruct 1,099 real ML research proposals from ICLR history: 458 low-soundness and 641 high-soundness. The input is proposal-only: hypothesis + related work + experiments, with experimental results removed.
3/n SoundnessBench isolates one narrow but crucial skill. Not novelty. Not impact. Not whether the paper was accepted. Just: given the hypothesis and experiment plan, can the proposed study actually test the claim rigorously?
6/n Then we ask 12 frontier LLMs to judge scientific soundness. The main failure mode is optimism bias -- maybe similar to sycophancy. Under standard prompting, low-soundness proposals are falsely labeled high-soundness 74% of the time.

5/n To make the benchmark auditable, we start from 35k+ ICLR submissions and 137k+ expert reviews, keep high-agreement cases, extract proposals near-verbatim, and verify atomic claims with retrieval-backed checking.
5/n To make the benchmark auditable, we start from 35k+ ICLR submissions and 137k+ expert reviews, keep high-agreement cases, extract proposals near-verbatim, and verify atomic claims with retrieval-backed checking.
4/n We reconstruct 1,099 real ML research proposals from ICLR history: 458 low-soundness and 641 high-soundness. The input is proposal-only: hypothesis + related work + experiments, with experimental results removed.
7/n Some models behave like extremely permissive first-gate reviewers. 9/12 models classify >70% of low-soundness proposals as sound. LLaMA-3.3-70B approves 98.0% of flawed proposals; GPT-4o approves 94.5%.

6/n Then we ask 12 frontier LLMs to judge scientific soundness. The main failure mode is optimism bias -- maybe similar to sycophancy. Under standard prompting, low-soundness proposals are falsely labeled high-soundness 74% of the time.
8/n Here is a false positive example.
7/n Some models behave like extremely permissive first-gate reviewers. 9/12 models classify >70% of low-soundness proposals as sound. LLaMA-3.3-70B approves 98.0% of flawed proposals; GPT-4o approves 94.5%.
9/n Here is another false positive.
8/n Here is a false positive example.
10/n Maybe the fix is simple: tell the model to be more skeptical? We tried an aggressive prompt: default to low soundness unless the proposal is clearly strong. False positives fall from 74.0% to 19.9%. Sounds good—until the next number.
9/n Here is another false positive.
11/n With aggressive prompting, high-soundness recall collapses from 91.8% to 36.1%. So prompting does not teach scientific taste. It mostly shifts the model from rubber-stamping weak ideas to over-rejecting good ones.

10/n Maybe the fix is simple: tell the model to be more skeptical? We tried an aggressive prompt: default to low soundness unless the proposal is clearly strong. False positives fall from 74.0% to 19.9%. Sounds good—until the next number.
12/n Here is a false negative example under aggressive prompting.
11/n With aggressive prompting, high-soundness recall collapses from 91.8% to 36.1%. So prompting does not teach scientific taste. It mostly shifts the model from rubber-stamping weak ideas to over-rejecting good ones.
15/n Takeaway: autonomy without soundness is fragile. An agent that can run 1,000 experiments still needs to know which 990 were bad ideas. SoundnessBench is a step toward measuring, and training, that judgment.
14/n A reliable AI Scientist needs more than coding, plotting, and paper-writing. It needs pre-compute scientific triage: catching bad baselines, leakage, mismatched metrics, unfalsifiable claims, and experiment plans that cannot support the hypothesis.
13/n This is why SoundnessBench is timely. The community is building agents that can execute the research loop. We test a complementary ability: can the agent decide which loop should never be run?
12/n Here is a false negative example under aggressive prompting.
14/n A reliable AI Scientist needs more than coding, plotting, and paper-writing. It needs pre-compute scientific triage: catching bad baselines, leakage, mismatched metrics, unfalsifiable claims, and experiment plans that cannot support the hypothesis.
13/n This is why SoundnessBench is timely. The community is building agents that can execute the research loop. We test a complementary ability: can the agent decide which loop should never be run?
16/n Big thanks to @hosytuyen @minghuiliu95 and Huy Nghiem 💐 Project:https://hosytuyen.github.io/projects/SoundnessBench Dataset:https://huggingface.co/datasets/hosytuyen/SoundnessBench Paper:https://arxiv.org/abs/2605.30329
15/n Takeaway: autonomy without soundness is fragile. An agent that can run 1,000 experiments still needs to know which 990 were bad ideas. SoundnessBench is a step toward measuring, and training, that judgment.
Corrections to GPT-5.4 Thinking and Claude Opus 6.4 results
12/n Here is a false negative example under aggressive prompting.
My first reaction to the evaluation results of frontier models against SoundnessBench was relief: Good, I’m not replaced by AI, …yet…😂
My second reaction was less comforting: what are my false-positive and false-negative rates as an advisor/reviewer?
How often do I over-encourage weak ideas — or over-criticize good ones? 😱
The AI Scientist can write code, run experiments, and draft papers. But can it do the thing every good advisor does before compute is spent: reject a bad research plan? Introducing SoundnessBench: a benchmark for scientific soundness judgment. Project:https://hosytuyen.github.io/projects/SoundnessBench Dataset:https://huggingface.co/datasets/hosytuyen/SoundnessBench Paper:https://arxiv.org/abs/2605.30329
Excellent thread and cool insights showing the brittleness of prompts and most importantly LLMs in judging soundness in scientific research.
The AI Scientist can write code, run experiments, and draft papers. But can it do the thing every good advisor does before compute is spent: reject a bad research plan? Introducing SoundnessBench: a benchmark for scientific soundness judgment. Project:https://hosytuyen.github.io/projects/SoundnessBench Dataset:https://huggingface.co/datasets/hosytuyen/SoundnessBench Paper:https://arxiv.org/abs/2605.30329

















