Furong Huang of the University of Maryland launches SoundnessBench to evaluate whether AI research agents can judge scientific soundness

VIEWS5.5KBOOKMARKS18LIKES28

My first reaction seeing the evaluation results of frontier models against SoundnessBench was relief: Good, I’m not replaced by AI, …yet…😂

My second reaction was less comforting: what are my false-positive and false-negative rates as an advisor/reviewer?

How often do I over-encourage weak ideas — or over-criticize good ones? 😱

Furong Huang@furongh

The AI Scientist can write code, run experiments, and draft papers.

But can it do the thing every good advisor does before compute is spent: reject a bad research plan?

Introducing SoundnessBench: a benchmark for scientific soundness judgment.

Project:https://hosytuyen.github.io/projects/SoundnessBench Dataset:https://huggingface.co/datasets/hosytuyen/SoundnessBench Paper:https://arxiv.org/abs/2605.30329

31d5.5K2818

RETWEETS28

Furong Huang@furongh

The AI Scientist can write code, run experiments, and draft papers.

But can it do the thing every good advisor does before compute is spent: reject a bad research plan?

Introducing SoundnessBench: a benchmark for scientific soundness judgment.

Project:https://hosytuyen.github.io/projects/SoundnessBench Dataset:https://huggingface.co/datasets/hosytuyen/SoundnessBench Paper:https://arxiv.org/abs/2605.30329

31d21.2K179133

REPLIES3

Furong Huang@furongh

7/n Some models behave like extremely permissive first-gate reviewers.

9/12 models classify >70% of low-soundness proposals as sound.

LLaMA-3.3-70B approves 98.0% of flawed proposals; GPT-4o approves 94.5%.

Furong Huang@furongh

6/n Then we ask 12 frontier LLMs to judge scientific soundness.

The main failure mode is optimism bias -- maybe similar to sycophancy.

Under standard prompting, low-soundness proposals are falsely labeled high-soundness 74% of the time.

31d2.4K71

Sachin Kumar@shocheen

@furongh Cool paper! you might our paper relevant: https://arxiv.org/abs/2510.16234

We created frameworks for evaluating soundness and contributions of research ideas/plans, along with a dataset created with a very similar recipe as your paper.

Furong Huang@furongh

The AI Scientist can write code, run experiments, and draft papers.

But can it do the thing every good advisor does before compute is spent: reject a bad research plan?

Introducing SoundnessBench: a benchmark for scientific soundness judgment.

Project:https://hosytuyen.github.io/projects/SoundnessBench Dataset:https://huggingface.co/datasets/hosytuyen/SoundnessBench Paper:https://arxiv.org/abs/2605.30329

31d1.9K219

Bodhisattwa Majumder@mbodhisattwa

Measuring scientific soundness is extremely hard. It is personalized, depends on context, and intent/goal of the research. Precisely why it’s rare that soundness verdicts match among independent referees.

However, even a noisy signal could still be very useful to RL in science LMs, check out our work!

Sachin Kumar@shocheen

@furongh Cool paper! you might our paper relevant: https://arxiv.org/abs/2510.16234

We created frameworks for evaluating soundness and contributions of research ideas/plans, along with a dataset created with a very similar recipe as your paper.

30d1.9K168

Tuhin Chakrabarty@TuhinChakr

Excellent thread and cool insights showing the brittleness of prompts and most importantly LLMs in judging soundness in scientific research.

Furong Huang@furongh

The AI Scientist can write code, run experiments, and draft papers.

But can it do the thing every good advisor does before compute is spent: reject a bad research plan?

Introducing SoundnessBench: a benchmark for scientific soundness judgment.

Project:https://hosytuyen.github.io/projects/SoundnessBench Dataset:https://huggingface.co/datasets/hosytuyen/SoundnessBench Paper:https://arxiv.org/abs/2605.30329

31d1.9K84

Cody Blakeney@code_star

Models be like:

LGTM

Furong Huang@furongh

7/n Some models behave like extremely permissive first-gate reviewers.

9/12 models classify >70% of low-soundness proposals as sound.

LLaMA-3.3-70B approves 98.0% of flawed proposals; GPT-4o approves 94.5%.

30d3K112

Furong Huang@furongh

5/n To make the benchmark auditable, we start from 35k+ ICLR submissions and 137k+ expert reviews, keep high-agreement cases, extract proposals near-verbatim, and verify atomic claims with retrieval-backed checking.

Furong Huang@furongh

4/n We reconstruct 1,099 real ML research proposals from ICLR history: 458 low-soundness and 641 high-soundness.

The input is proposal-only: hypothesis + related work + experiments, with experimental results removed.

31d32332

Furong Huang@furongh

6/n Then we ask 12 frontier LLMs to judge scientific soundness.

The main failure mode is optimism bias -- maybe similar to sycophancy.

Under standard prompting, low-soundness proposals are falsely labeled high-soundness 74% of the time.

Furong Huang@furongh

5/n To make the benchmark auditable, we start from 35k+ ICLR submissions and 137k+ expert reviews, keep high-agreement cases, extract proposals near-verbatim, and verify atomic claims with retrieval-backed checking.

31d31441

Guowei Xu@Kevin_GuoweiXu

@furongh This would be super interesting! If a model can judge a proposal well, then it has the potential to be creative and have better taste.

31d25431

Furong Huang@furongh

14/n A reliable AI Scientist needs more than coding, plotting, and paper-writing. It needs pre-compute scientific triage: catching bad baselines, leakage, mismatched metrics, unfalsifiable claims, and experiment plans that cannot support the hypothesis.

Furong Huang@furongh

13/n This is why SoundnessBench is timely. The community is building agents that can execute the research loop. We test a complementary ability: can the agent decide which loop should never be run?

31d21321

Furong Huang@furongh

13/n This is why SoundnessBench is timely. The community is building agents that can execute the research loop. We test a complementary ability: can the agent decide which loop should never be run?

Furong Huang@furongh

12/n Here is a false negative example under aggressive prompting.

31d20411

Guowei Xu@Kevin_GuoweiXu

@furongh So we can actually somehow track the progress of model creativity with this benchmark.

31d20011

Furong Huang@furongh

My first reaction to the evaluation results of frontier models against SoundnessBench was relief: Good, I’m not replaced by AI, …yet…😂

My second reaction was less comforting: what are my false-positive and false-negative rates as an advisor/reviewer?

How often do I over-encourage weak ideas — or over-criticize good ones? 😱

Furong Huang@furongh

The AI Scientist can write code, run experiments, and draft papers.

But can it do the thing every good advisor does before compute is spent: reject a bad research plan?

Introducing SoundnessBench: a benchmark for scientific soundness judgment.

Project:https://hosytuyen.github.io/projects/SoundnessBench Dataset:https://huggingface.co/datasets/hosytuyen/SoundnessBench Paper:https://arxiv.org/abs/2605.30329

31d45041

Furong Huang@furongh

16/n Big thanks to @hosytuyen @minghuiliu95 and Huy Nghiem 💐 Project:https://hosytuyen.github.io/projects/SoundnessBench Dataset:https://huggingface.co/datasets/hosytuyen/SoundnessBench Paper:https://arxiv.org/abs/2605.30329

Furong Huang@furongh

15/n Takeaway: autonomy without soundness is fragile. An agent that can run 1,000 experiments still needs to know which 990 were bad ideas. SoundnessBench is a step toward measuring, and training, that judgment.

31d28541

Furong Huang@furongh

10/n Maybe the fix is simple: tell the model to be more skeptical? We tried an aggressive prompt: default to low soundness unless the proposal is clearly strong. False positives fall from 74.0% to 19.9%. Sounds good—until the next number.

Furong Huang@furongh

9/n Here is another false positive.

31d21401

John (Yueh-Han) Chen@jcyhc_ai

Cool work! We released a similar paper in case you might be interested: Predicting Empirical AI Research Outcomes with Language Models [NeurIPS 2025, arXiv:2506.00794], where we built a similar benchmark and showed that LLM can be trained to forecast the outcomes of empirical AI research better than human experts in the NLP domain.

31d2044

Furong Huang@furongh

3/n SoundnessBench isolates one narrow but crucial skill. Not novelty. Not impact. Not whether the paper was accepted. Just: given the hypothesis and experiment plan, can the proposed study actually test the claim rigorously?

Furong Huang@furongh

2/n Why this matters: automation changes the cost of bad judgment. A human may waste weeks on a flawed experiment. An autonomous agent can scale that mistake overnight. Without soundness filters, we do not just accelerate science—we risk accelerating plausible-looking bad science.

31d1.2K30

Furong Huang@furongh

12/n Here is a false negative example under aggressive prompting.

Furong Huang@furongh

11/n With aggressive prompting, high-soundness recall collapses from 91.8% to 36.1%.

So prompting does not teach scientific taste.

It mostly shifts the model from rubber-stamping weak ideas to over-rejecting good ones.

31d33810

Furong Huang@furongh

2/n Why this matters: automation changes the cost of bad judgment. A human may waste weeks on a flawed experiment. An autonomous agent can scale that mistake overnight. Without soundness filters, we do not just accelerate science—we risk accelerating plausible-looking bad science.

Furong Huang@furongh

1/n The current AI Scientist excitement is about automating the research loop: hypotheses → code → experiments → paper.

But there is a missing first gate: before running anything, is the hypothesis-test pair actually methodologically sound?

31d85730