3h ago

Researchers release SMDD-Bench, a benchmark evaluating LLM agents on 502 small molecule drug design tasks across five workflows, with frontier models at roughly 40 percent success

Public leaderboard at smddbench.com and arXiv paper now available.

0
Original post

🧬New agentic AI-for-science benchmark: SMDD-Bench! Can frontier LLM agents actually do small-molecule drug design? Real medicinal chemistry — not single-turn QA, not toy property prediction. Long-horizon, multi-turn, tool-using, with strict oracle budgets. We release 502 agentic tasks across 5 real drug-design workflows (pharmacophore ID, scaffold hopping, lead optimization, fragment assembly, interaction point discovery), every one guaranteed-solvable via a hidden witness molecule. Agents get a Python sandbox, 8 Boltz2 calls, 15 ADMET-AI calls, no internet — and have to plan across dozens of turns to spend that budget wisely. Result: GPT-5.4 and Gemini 3.1 Pro are neck-and-neck at the top (40.2% vs 39.0%), Claude Sonnet 4.6 right behind at 38%. Open-source models trail meaningfully. Even the best agents fail >60% of the time. 🧵 below

11:56 AM · May 22, 2026 View on X

The 5 task types are all things real medicinal chemists actually do:

2D Pharmacophore ID — write a Python filter that separates actives from inactives Interaction Point Discovery — predict the 3 most conserved 3D hotspots in a pocket Scaffold Hopping — same binding mode, new scaffold Lead Optimization — multi-objective ADMET + affinity under hard constraints Fragment Assembly — link 3D fragments into a drug-like molecule that re-docks to the right pose

NiloofarNiloofar@niloofar_mire

🧬New agentic AI-for-science benchmark: SMDD-Bench! Can frontier LLM agents actually do small-molecule drug design? Real medicinal chemistry — not single-turn QA, not toy property prediction. Long-horizon, multi-turn, tool-using, with strict oracle budgets. We release 502 agentic tasks across 5 real drug-design workflows (pharmacophore ID, scaffold hopping, lead optimization, fragment assembly, interaction point discovery), every one guaranteed-solvable via a hidden witness molecule. Agents get a Python sandbox, 8 Boltz2 calls, 15 ADMET-AI calls, no internet — and have to plan across dozens of turns to spend that budget wisely. Result: GPT-5.4 and Gemini 3.1 Pro are neck-and-neck at the top (40.2% vs 39.0%), Claude Sonnet 4.6 right behind at 38%. Open-source models trail meaningfully. Even the best agents fail >60% of the time. 🧵 below

6:56 PM · May 22, 2026 · 3.8K Views
6:56 PM · May 22, 2026 · 306 Views

Key methodological idea: witness-aware task generation.

Most "design" benchmarks have no guarantee a solution exists. We co-generate a witness molecule with every task — a hidden molecule that already passes the full evaluator. So every task instance has at least one provably valid answer.

NiloofarNiloofar@niloofar_mire

The 5 task types are all things real medicinal chemists actually do: 2D Pharmacophore ID — write a Python filter that separates actives from inactives Interaction Point Discovery — predict the 3 most conserved 3D hotspots in a pocket Scaffold Hopping — same binding mode, new scaffold Lead Optimization — multi-objective ADMET + affinity under hard constraints Fragment Assembly — link 3D fragments into a drug-like molecule that re-docks to the right pose

6:56 PM · May 22, 2026 · 306 Views
6:56 PM · May 22, 2026 · 247 Views

This is fundamentally a long-horizon, budget-constrained tool-use benchmark that just happens to live in chemistry. Agents have to:

plan across dozens of turns decide which of 23 oracle calls to spend where reason about 3D geometry from text write RDKit code that actually executes generalize SAR rules across turns

NiloofarNiloofar@niloofar_mire

Key methodological idea: witness-aware task generation. Most "design" benchmarks have no guarantee a solution exists. We co-generate a witness molecule with every task — a hidden molecule that already passes the full evaluator. So every task instance has at least one provably valid answer.

6:56 PM · May 22, 2026 · 247 Views
6:56 PM · May 22, 2026 · 175 Views

Most interesting finding: LLMs often enumerate the right molecule but fail to pick it.

If agents had submitted the best molecule they ever mentioned in their trace, overall success jumps significantly:

MiniMax M2.7: 20.6% → 36.1% Claude Sonnet 4.6: 40.7% → 52.4% Gemini 3.1 Pro: 42.0% → 50.0%

NiloofarNiloofar@niloofar_mire

This is fundamentally a long-horizon, budget-constrained tool-use benchmark that just happens to live in chemistry. Agents have to: plan across dozens of turns decide which of 23 oracle calls to spend where reason about 3D geometry from text write RDKit code that actually executes generalize SAR rules across turns

6:56 PM · May 22, 2026 · 175 Views
6:56 PM · May 22, 2026 · 161 Views

We also release:

SMDD-Bench Lite — 100-task representative subset for fast iteration SMDD-Bench Diversity — 20-task subset for testing parallel-agent diversity Public leaderboard at http://smddbench.com

We hope this becomes the testbed for training the next generation of autonomous medicinal chemistry agents. 📄 Paper: https://arxiv.org/abs/2605.21740 🌐 Site + leaderboard: https://smddbench.com

NiloofarNiloofar@niloofar_mire

Other findings: On Lead Opt, agents often converge to the same handful of molecules across 10 runs — pairwise Tanimoto >0.8. Bad for parallel deployment. Common failure mode: agents test the same disqualified substructure 3+ times. No cross-turn SAR synthesis. Another: Gemini once "solved" a scaffold-hop task by recalling a known PubChem molecule from memory in one shot. Memorization, not design.

6:56 PM · May 22, 2026 · 159 Views
6:56 PM · May 22, 2026 · 173 Views

Other findings:

On Lead Opt, agents often converge to the same handful of molecules across 10 runs — pairwise Tanimoto >0.8. Bad for parallel deployment. Common failure mode: agents test the same disqualified substructure 3+ times. No cross-turn SAR synthesis. Another: Gemini once "solved" a scaffold-hop task by recalling a known PubChem molecule from memory in one shot. Memorization, not design.

NiloofarNiloofar@niloofar_mire

Most interesting finding: LLMs often enumerate the right molecule but fail to pick it. If agents had submitted the best molecule they ever mentioned in their trace, overall success jumps significantly: MiniMax M2.7: 20.6% → 36.1% Claude Sonnet 4.6: 40.7% → 52.4% Gemini 3.1 Pro: 42.0% → 50.0%

6:56 PM · May 22, 2026 · 161 Views
6:56 PM · May 22, 2026 · 159 Views

Huge thanks to @GoogleCloud @GeminiApp for Gemini credits. None of this benchmarking would have been feasible without them.

NiloofarNiloofar@niloofar_mire

We also release: SMDD-Bench Lite — 100-task representative subset for fast iteration SMDD-Bench Diversity — 20-task subset for testing parallel-agent diversity Public leaderboard at http://smddbench.com We hope this becomes the testbed for training the next generation of autonomous medicinal chemistry agents. 📄 Paper: https://arxiv.org/abs/2605.21740 🌐 Site + leaderboard: https://smddbench.com

6:56 PM · May 22, 2026 · 173 Views
6:56 PM · May 22, 2026 · 221 Views

@niloofar_mire Very cool work, @niloofar_mire and team!

NiloofarNiloofar@niloofar_mire

🧬New agentic AI-for-science benchmark: SMDD-Bench! Can frontier LLM agents actually do small-molecule drug design? Real medicinal chemistry — not single-turn QA, not toy property prediction. Long-horizon, multi-turn, tool-using, with strict oracle budgets. We release 502 agentic tasks across 5 real drug-design workflows (pharmacophore ID, scaffold hopping, lead optimization, fragment assembly, interaction point discovery), every one guaranteed-solvable via a hidden witness molecule. Agents get a Python sandbox, 8 Boltz2 calls, 15 ADMET-AI calls, no internet — and have to plan across dozens of turns to spend that budget wisely. Result: GPT-5.4 and Gemini 3.1 Pro are neck-and-neck at the top (40.2% vs 39.0%), Claude Sonnet 4.6 right behind at 38%. Open-source models trail meaningfully. Even the best agents fail >60% of the time. 🧵 below

6:56 PM · May 22, 2026 · 3.8K Views
8:01 PM · May 22, 2026 · 66 Views
Researchers release SMDD-Bench, a benchmark evaluating LLM agents on 502 small molecule drug design tasks across five workflows, with frontier models at roughly 40 percent success · Digg