We stress tested many frontier AI models for multimodal medical reasoning (including GPT-5, Claude 3.5, Gemini 2.5 Pro). They’re not ready. Faulty reasoning, use of inappropriate shortcuts, hallucinations. Published today @NatureMedicine https://www.nature.com/articles/s41591-026-04501-8
Scripps Research's Eric Topol says GPT-5, Claude 3.5, and Gemini 2.5 Pro fail clinical readiness
Story Overview
A new adversarial stress test in Nature Medicine shows leading multimodal models still lean on shortcuts, fabricate details, and falter under small input tweaks even when they ace ordinary health benchmarks, leaving a clear gap between lab scores and actual clinical reliability.
Standard benchmarks miss the hidden weak spots
Simple changes like removing key image details or swapping modalities exposed how models reach correct answers for the wrong reasons, a pattern the study ties directly to current test design rather than model size alone.
Next steps stay open until fresh data arrives
The authors released perturbation tools and rubrics for others to reuse, yet no model-specific scores or vendor replies appear in the published record, so any timeline for fixes remains unknown for now.
Many users dismissed the Nature Medicine study on frontier AI models failing medical reasoning tests as meaningless or worthless because it used outdated models, while some praised its rigor or expressed hope for future progress.
No Digg Deeper questions have been answered for this story yet.











