For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5
Nature Medicine study finds general-purpose LLMs outperform specialized clinical AI on medical benchmarks
Story Overview
An independent Nature Medicine evaluation put three frontier general-purpose LLMs against two dedicated clinical AI platforms on medical knowledge tests, clinician alignment tasks, and real de-identified physician queries, with the broad models coming out ahead in every category after randomized blinded review by twelve US clinicians.
Scaling keeps winning on narrow tasks
Gemini 3.1 Pro reached 97.4 percent on MedQA while the specialized tools trailed, echoing earlier patterns where general models trained on broad data outperform narrow fine-tunes when the evaluation stays within benchmark limits.
Real-world checks still needed before clinics
The study stresses that benchmark wins alone do not confirm deployment safety or patient outcomes, leaving open how these models would perform under live regulatory or liability scrutiny.
Positive users praise general frontier models outperforming specialized medical tools due to their scale and reasoning abilities, while negative users cite small sample sizes, hallucinations, and personal cases favoring specialized tools.
Most Activity
Medicine discovers the bitter lesson: frontier LLMs (here GPT 5.2, Opus 4.6, Gemini 3.1) outperform specialized "clinical AI" (e.g. OpenEvidence) in a blind test.
Even funnier that hospital IT are more likely to approve the *specialized* versions despite them being worse.
For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5
There has been a push to use OpenEvidence AI for doctors. But this paper suggests general models are much better: “Frontier LLMs outperformed clinical AI tools in all three evaluations. Clinical AI tools performed comparably to auto-enabled Google Search AI Overview on the RCQ.”
For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5
>65% of US physicians use OpenEvidence, with 27 million prompts in April https://www.nbcnews.com/tech/tech-news/openevidence-ai-doctor-medical-physician-login-app-what-npi-uptodate-rcna341064
For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5
I'm not all that surprised by this. Sutton's bitter lesson tells us that generalist models trained on far more data outperform narrower models with less data.
For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5
"Experts" really do not want to believe this (see Topol's "this was not anticipated", even though this is just Rich Sutton 101), nor do IT departments, but they'll learn eventually I guess
Medicine discovers the bitter lesson: frontier LLMs (here GPT 5.2, Opus 4.6, Gemini 3.1) outperform specialized "clinical AI" (e.g. OpenEvidence) in a blind test.
Even funnier that hospital IT are more likely to approve the *specialized* versions despite them being worse.

This exemplifies the paradox of medical AI implementation https://erictopol.substack.com/p/the-paradox-of-medical-ai-implementation

Nature Medicine just reported a remarkable result: general-purpose frontier AI models from Google, OpenAI, and Anthropic outperformed specialized medical AI tools, including OpenEvidence and UpToDate Expert AI, across MedQA, HealthBench, and blinded clinician-rated real clinical queries.
This is AI democratizing expertise in real time.
The old moat was access to specialized knowledge.
The new moat is judgment, validation, safety, and responsible deployment.
Not a replacement for physicians, a redistribution of reasoning power.
We are not watching a software update.
We are entering a technological revolution.
#AIinMedicine #MedicalAI #HealthTech

@ramez True, but you could make the case that if these organizations pivot to being good harnesses for their given domains they’d still be a strong value-add.

@EricTopol @SprakerMDPhD @EvidenceOpen @UpToDate In my experience as a radiation oncologist, OE is superior to the latest general LLMs models at giving the most detailed and accurately referenced answers to complex cases.

@ramez At the same time, the economics may run in a different direction

The problem with up to date is that it's just a collection of useless data. Almost like an encyclopedia. Most of it isn't clinically relevant. I've seen doctors and residents read about a topic they aren't familiar with and then go order $400,000 worth of labs and tests that aren't needed. Maybe LLMs and AI can bridge the gap between knowledge and practicing guideline based medicine. 💪🏻🫀🩺

and this is based on GPT-5.2 not 5.5 soon 5.6. I have always kept saying, that openEvidence is very mid.. good idea, but subpar to GPT/Gemini/Claude. I still don't get why so many MD/DO/PA/NPs use it, I assume good marketing? DoximityGPT is better and GPT for clinicians is free too now and much better and HIPAA too

It's surprising in a sense, as we assume that specialization and human curated database like UptoDate would give more reliable results. On the other hand, frontier models have greater cross-disciplinary pattern recognition. They can connect disparate clinical signals and manage nuance better than a rigid, database deopendent system.

@EricTopol @EvidenceOpen @UpToDate I'm skeptical on the finding based on my experience, but will have to see how they assessed it.

@emollick @AndrewCurran_ We knew this months ago as OpenEvidence vastly underperformed general models on MedXpertQA Text (slides from a presentation I gave in April with OpenEvidence results on the right)

@EricTopol @BraydonDymm @EvidenceOpen @UpToDate Bitter lesson in action. Not surprising for those who work with the models regularly

Not surprised at all; by design, RAG creates knowledge gaps, limits reasoning, and introduces human bias. For better or for worse. The key here is human-AI interaction; RAG-based tools enable cross-checking, whereas frontier models do not. This is critical in high-stakes decision making.

@EricTopol @EvidenceOpen @UpToDate The authors are clueless about AI, it would seem.
Wait until they try Claude Fable 5!

@EricTopol @EvidenceOpen @UpToDate Okay

@EricTopol @EvidenceOpen @UpToDate Didn't think a GPT would be the most consistent model. Fascinating.