3h ago

Furong Huang of the University of Maryland launches SoundnessBench to evaluate whether AI research agents can judge scientific soundness

The benchmark dataset is now open-sourced on Hugging Face.

0
Original post

The AI Scientist can write code, run experiments, and draft papers. But can it do the thing every good advisor does before compute is spent: reject a bad research plan? Introducing SoundnessBench: a benchmark for scientific soundness judgment. Project:https://hosytuyen.github.io/projects/SoundnessBench Dataset:https://huggingface.co/datasets/hosytuyen/SoundnessBench Paper:https://arxiv.org/abs/2605.30329

6:33 PM · May 29, 2026 View on X
Reposted by

My first reaction seeing the evaluation results of frontier models against SoundnessBench was relief: Good, I’m not replaced by AI, …yet…😂

My second reaction was less comforting: what are my false-positive and false-negative rates as an advisor/reviewer?

How often do I over-encourage weak ideas — or over-criticize good ones? 😱

Furong HuangFurong Huang@furongh

The AI Scientist can write code, run experiments, and draft papers. But can it do the thing every good advisor does before compute is spent: reject a bad research plan? Introducing SoundnessBench: a benchmark for scientific soundness judgment. Project:https://hosytuyen.github.io/projects/SoundnessBench Dataset:https://huggingface.co/datasets/hosytuyen/SoundnessBench Paper:https://arxiv.org/abs/2605.30329

1:33 AM · May 30, 2026 · 5.8K Views
4:13 AM · May 30, 2026 · 1.1K Views

2/n Why this matters: automation changes the cost of bad judgment. A human may waste weeks on a flawed experiment. An autonomous agent can scale that mistake overnight. Without soundness filters, we do not just accelerate science—we risk accelerating plausible-looking bad science.

Furong HuangFurong Huang@furongh

1/n The current AI Scientist excitement is about automating the research loop: hypotheses → code → experiments → paper. But there is a missing first gate: before running anything, is the hypothesis-test pair actually methodologically sound?

1:33 AM · May 30, 2026 · 553 Views
1:33 AM · May 30, 2026 · 424 Views

1/n The current AI Scientist excitement is about automating the research loop: hypotheses → code → experiments → paper. But there is a missing first gate: before running anything, is the hypothesis-test pair actually methodologically sound?

Furong HuangFurong Huang@furongh

The AI Scientist can write code, run experiments, and draft papers. But can it do the thing every good advisor does before compute is spent: reject a bad research plan? Introducing SoundnessBench: a benchmark for scientific soundness judgment. Project:https://hosytuyen.github.io/projects/SoundnessBench Dataset:https://huggingface.co/datasets/hosytuyen/SoundnessBench Paper:https://arxiv.org/abs/2605.30329

1:33 AM · May 30, 2026 · 5.8K Views
1:33 AM · May 30, 2026 · 553 Views

3/n SoundnessBench isolates one narrow but crucial skill. Not novelty. Not impact. Not whether the paper was accepted. Just: given the hypothesis and experiment plan, can the proposed study actually test the claim rigorously?

Furong HuangFurong Huang@furongh

2/n Why this matters: automation changes the cost of bad judgment. A human may waste weeks on a flawed experiment. An autonomous agent can scale that mistake overnight. Without soundness filters, we do not just accelerate science—we risk accelerating plausible-looking bad science.

1:33 AM · May 30, 2026 · 424 Views
1:33 AM · May 30, 2026 · 124 Views

4/n We reconstruct 1,099 real ML research proposals from ICLR history: 458 low-soundness and 641 high-soundness. The input is proposal-only: hypothesis + related work + experiments, with experimental results removed.

Furong HuangFurong Huang@furongh

3/n SoundnessBench isolates one narrow but crucial skill. Not novelty. Not impact. Not whether the paper was accepted. Just: given the hypothesis and experiment plan, can the proposed study actually test the claim rigorously?

1:33 AM · May 30, 2026 · 124 Views
1:33 AM · May 30, 2026 · 129 Views

6/n Then we ask 12 frontier LLMs to judge scientific soundness. The main failure mode is optimism bias -- maybe similar to sycophancy. Under standard prompting, low-soundness proposals are falsely labeled high-soundness 74% of the time.

Furong HuangFurong Huang@furongh

5/n To make the benchmark auditable, we start from 35k+ ICLR submissions and 137k+ expert reviews, keep high-agreement cases, extract proposals near-verbatim, and verify atomic claims with retrieval-backed checking.

1:33 AM · May 30, 2026 · 134 Views
1:33 AM · May 30, 2026 · 117 Views

5/n To make the benchmark auditable, we start from 35k+ ICLR submissions and 137k+ expert reviews, keep high-agreement cases, extract proposals near-verbatim, and verify atomic claims with retrieval-backed checking.

Furong HuangFurong Huang@furongh

4/n We reconstruct 1,099 real ML research proposals from ICLR history: 458 low-soundness and 641 high-soundness. The input is proposal-only: hypothesis + related work + experiments, with experimental results removed.

1:33 AM · May 30, 2026 · 129 Views
1:33 AM · May 30, 2026 · 134 Views

7/n Some models behave like extremely permissive first-gate reviewers. 9/12 models classify >70% of low-soundness proposals as sound. LLaMA-3.3-70B approves 98.0% of flawed proposals; GPT-4o approves 94.5%.

Furong HuangFurong Huang@furongh

6/n Then we ask 12 frontier LLMs to judge scientific soundness. The main failure mode is optimism bias -- maybe similar to sycophancy. Under standard prompting, low-soundness proposals are falsely labeled high-soundness 74% of the time.

1:33 AM · May 30, 2026 · 117 Views
1:33 AM · May 30, 2026 · 112 Views

8/n Here is a false positive example.

Furong HuangFurong Huang@furongh

7/n Some models behave like extremely permissive first-gate reviewers. 9/12 models classify >70% of low-soundness proposals as sound. LLaMA-3.3-70B approves 98.0% of flawed proposals; GPT-4o approves 94.5%.

1:33 AM · May 30, 2026 · 112 Views
1:33 AM · May 30, 2026 · 90 Views

9/n Here is another false positive.

Furong HuangFurong Huang@furongh

8/n Here is a false positive example.

1:33 AM · May 30, 2026 · 90 Views
1:33 AM · May 30, 2026 · 89 Views

10/n Maybe the fix is simple: tell the model to be more skeptical? We tried an aggressive prompt: default to low soundness unless the proposal is clearly strong. False positives fall from 74.0% to 19.9%. Sounds good—until the next number.

Furong HuangFurong Huang@furongh

9/n Here is another false positive.

1:33 AM · May 30, 2026 · 89 Views
1:33 AM · May 30, 2026 · 88 Views

11/n With aggressive prompting, high-soundness recall collapses from 91.8% to 36.1%. So prompting does not teach scientific taste. It mostly shifts the model from rubber-stamping weak ideas to over-rejecting good ones.

Furong HuangFurong Huang@furongh

10/n Maybe the fix is simple: tell the model to be more skeptical? We tried an aggressive prompt: default to low soundness unless the proposal is clearly strong. False positives fall from 74.0% to 19.9%. Sounds good—until the next number.

1:33 AM · May 30, 2026 · 88 Views
1:33 AM · May 30, 2026 · 100 Views

12/n Here is a false negative example under aggressive prompting.

Furong HuangFurong Huang@furongh

11/n With aggressive prompting, high-soundness recall collapses from 91.8% to 36.1%. So prompting does not teach scientific taste. It mostly shifts the model from rubber-stamping weak ideas to over-rejecting good ones.

1:33 AM · May 30, 2026 · 100 Views
1:33 AM · May 30, 2026 · 169 Views

15/n Takeaway: autonomy without soundness is fragile. An agent that can run 1,000 experiments still needs to know which 990 were bad ideas. SoundnessBench is a step toward measuring, and training, that judgment.

Furong HuangFurong Huang@furongh

14/n A reliable AI Scientist needs more than coding, plotting, and paper-writing. It needs pre-compute scientific triage: catching bad baselines, leakage, mismatched metrics, unfalsifiable claims, and experiment plans that cannot support the hypothesis.

1:33 AM · May 30, 2026 · 98 Views
1:33 AM · May 30, 2026 · 127 Views

13/n This is why SoundnessBench is timely. The community is building agents that can execute the research loop. We test a complementary ability: can the agent decide which loop should never be run?

Furong HuangFurong Huang@furongh

12/n Here is a false negative example under aggressive prompting.

1:33 AM · May 30, 2026 · 169 Views
1:33 AM · May 30, 2026 · 87 Views

14/n A reliable AI Scientist needs more than coding, plotting, and paper-writing. It needs pre-compute scientific triage: catching bad baselines, leakage, mismatched metrics, unfalsifiable claims, and experiment plans that cannot support the hypothesis.

Furong HuangFurong Huang@furongh

13/n This is why SoundnessBench is timely. The community is building agents that can execute the research loop. We test a complementary ability: can the agent decide which loop should never be run?

1:33 AM · May 30, 2026 · 87 Views
1:33 AM · May 30, 2026 · 98 Views

16/n Big thanks to @hosytuyen @minghuiliu95 and Huy Nghiem 💐 Project:https://hosytuyen.github.io/projects/SoundnessBench Dataset:https://huggingface.co/datasets/hosytuyen/SoundnessBench Paper:https://arxiv.org/abs/2605.30329

Furong HuangFurong Huang@furongh

15/n Takeaway: autonomy without soundness is fragile. An agent that can run 1,000 experiments still needs to know which 990 were bad ideas. SoundnessBench is a step toward measuring, and training, that judgment.

1:33 AM · May 30, 2026 · 127 Views
1:33 AM · May 30, 2026 · 121 Views

Corrections to GPT-5.4 Thinking and Claude Opus 6.4 results

Furong HuangFurong Huang@furongh

12/n Here is a false negative example under aggressive prompting.

1:33 AM · May 30, 2026 · 169 Views
1:44 AM · May 30, 2026 · 98 Views

My first reaction to the evaluation results of frontier models against SoundnessBench was relief: Good, I’m not replaced by AI, …yet…😂

My second reaction was less comforting: what are my false-positive and false-negative rates as an advisor/reviewer?

How often do I over-encourage weak ideas — or over-criticize good ones? 😱

Furong HuangFurong Huang@furongh

The AI Scientist can write code, run experiments, and draft papers. But can it do the thing every good advisor does before compute is spent: reject a bad research plan? Introducing SoundnessBench: a benchmark for scientific soundness judgment. Project:https://hosytuyen.github.io/projects/SoundnessBench Dataset:https://huggingface.co/datasets/hosytuyen/SoundnessBench Paper:https://arxiv.org/abs/2605.30329

1:33 AM · May 30, 2026 · 5.8K Views
4:11 AM · May 30, 2026 · 115 Views

Excellent thread and cool insights showing the brittleness of prompts and most importantly LLMs in judging soundness in scientific research.

Furong HuangFurong Huang@furongh

The AI Scientist can write code, run experiments, and draft papers. But can it do the thing every good advisor does before compute is spent: reject a bad research plan? Introducing SoundnessBench: a benchmark for scientific soundness judgment. Project:https://hosytuyen.github.io/projects/SoundnessBench Dataset:https://huggingface.co/datasets/hosytuyen/SoundnessBench Paper:https://arxiv.org/abs/2605.30329

1:33 AM · May 30, 2026 · 5.8K Views
2:18 AM · May 30, 2026 · 249 Views