We stress tested many frontier AI models for multimodal medical reasoning (including GPT-5, Claude 3.5, Gemini 2.5 Pro). They’re not ready. Faulty reasoning, use of inappropriate shortcuts, hallucinations. Published today @NatureMedicine https://www.nature.com/articles/s41591-026-04501-8
Scripps Research's Eric Topol says GPT-5, Claude 3.5, and Gemini 2.5 Pro fail clinical readiness
Story Overview
A new adversarial stress test in Nature Medicine shows leading multimodal models still lean on shortcuts, fabricate details, and falter under small input tweaks even when they ace ordinary health benchmarks, leaving a clear gap between lab scores and actual clinical reliability.
Standard benchmarks miss the hidden weak spots
Simple changes like removing key image details or swapping modalities exposed how models reach correct answers for the wrong reasons, a pattern the study ties directly to current test design rather than model size alone.
Next steps stay open until fresh data arrives
The authors released perturbation tools and rubrics for others to reuse, yet no model-specific scores or vendor replies appear in the published record, so any timeline for fixes remains unknown for now.
Many users called the Nature Medicine study embarrassing or worse than nothing because it shows frontier models still fail at medical reasoning despite job-replacement hype, while others praised the authors for exposing those limits.
No Digg Deeper questions have been answered for this story yet.










