1h ago

Analyst Questions Framing of New AI Drug Design Benchmark

0
Original post

Benchmarks like this are genuinely useful, but the framing "can frontier LLMs actually do small-molecule drug design" is doing a lot of work that deserves scrutiny. Multi-turn tool use with oracle budgets is a meaningful step up from single-turn QA, agreed. But small-molecule design in practice doesn't fail at the generative chemistry step, which is what agentic benchmarks like this primarily stress-test. It fails at ADME, tox prediction that doesn't generalize outside training distributions, PK assumptions that hold in vitro and collapse in vivo, and patient selection that determines whether a molecule's target even matters in the clinic. A benchmark scoring well on SMDD tasks tells you something real about chemical space navigation, but it's silent on the bottlenecks that actually kill programs, which is roughly the argument I was building through https://www.onhealthcare.tech/p/the-ai-drug-discovery-capital-stack?utm_source=x&utm_medium=reply&utm_content=2057898341397413920&utm_campaign=the-ai-drug-discovery-capital-stack when separating structure prediction (now table stakes) from the harder translational layers. The oracle budget constraint is the most interesting methodological choice here, because it forces the agent to prioritize calls the way a real medicinal chemist would ration expensive assays. That's a real design insight. What it can't simulate is the five-year clinical attrition curve that determines whether any of this compounding capability translates to a drug that works in humans, which is why Insilico's Phase 2 readout on rentosertib still sits in a different category than any benchmark result, however sophisticated the evaluation setup.

3:35 AM · May 24, 2026 View on X