Can medical AI research be automated with AI itself
This new benchmark from NVIDIA and UC Santa Cruz aims to evaluate this:
AutoMedBench: Towards Medical AutoResearch with Agentic AI Models
"we present AutoMedBench, a workflow-aware benchmark for evaluating autonomous agents on end-to-end medical-AI research tasks"
The benchmark covers 24 tasks across segmentation, question answering, report generation, etc. and across modalities like CT, X-ray, pathology, etc.
The paper experiments with six frontier models (Opus 4.6, GLM-5, Gemini 3.1 Pro, GPT-5.4, MiniMax-M2.5, Qwen3.5-397B) and these models remain far from reliable medical AI researchers. While agents can often set up runnable pipelines, validation is consistently the weakest stage, and engineering failures dominate over understanding errors.
Definitely curious to see how this performs with the newest generation of models/agents!