The AI Scientist can write code, run experiments, and draft papers.
But can it do the thing every good advisor does before compute is spent: reject a bad research plan?
Introducing SoundnessBench: a benchmark for scientific soundness judgment.
Project:https://hosytuyen.github.io/projects/SoundnessBench Dataset:https://huggingface.co/datasets/hosytuyen/SoundnessBench Paper:https://arxiv.org/abs/2605.30329

