Nature Medicine study finds general-purpose frontier LLMs outperform specialized clinical AI tools on medical benchmarks

Eric Topol@EricTopolTECH

@krithikvish @ekoermann @nyulangone 💯

Charlie Lees@charlie_leesTECH

@nealkhosla Indeed

Tanishq, Ph.D. at ICML@iScienceLuvrTECH

This study has been going viral. I think that most people are misunderstanding its conclusions a bit. This paper DOES NOT MEAN domain-specific models are not worth it. First of all, UpToDate and OpenEvidence are not models, but products. And there's no information on what models they are built on. They are likely built on top of older models. For all we know they're built on top of gpt-4o or Llama-3.1 or something 🤣 (it's probably something more recent/powerful than that but just trying to emphasize the point) Second of all, the benchmarks are a bit limited. The benchmarks include MedQA (which is pretty saturated at this point), HealthBench (which focuses on patient conversations), and a closed dataset of doctor questions to an LLM. There are many aspects of clinical use of LLMs that are not at all analyze by this benchmarking approach. What conclusions can be made then? Only that UpToDate and OpenEvidence is worse than frontier models on the limited set of benchmarks tested in this paper. It doesn't mean that domain-specific models cannot beat general purpose models. In fact, we have done a comprehensive benchmark (https://medmarks.ai) which includes MedQA and HealthBench, among many other benchmarks. We look at general-purpose models, and versions of those same models but adapted for medicine. There seems to be a noticeable boost going from general-purpose to medical fine-tune. So if you took a frontier model and were able to fine-tune for medical applications it would definitely be better. i.e. a domain-specific model would be better. It is true that the current domain-specific models (which are often built on open-source models that are not at the frontier) are often worse than frontier models. It is not true that building domain-specific models cannot beat general-purpose models. I think the main problem is open-source models aren't progressing fast enough with respect to the frontier models and on top of that very few groups are adapting them quickly enough to release better and better medical AI tools. Some groups claim to have medical-specific models that outperform frontier models, ex: Baichuan-M4. Hopefully we'll see more medical-specific models trained on top of really strong base models come out soon.

Nabeel S. Qureshi@nabeelquTECH

Medicine discovers the bitter lesson: frontier LLMs (here GPT 5.2, Opus 4.6, Gemini 3.1) outperform specialized "clinical AI" (e.g. OpenEvidence) in a blind test. Even funnier that hospital IT are more likely to approve the *specialized* versions despite them being worse. https://x.com/nabeelqu/status/2065440481127866598/photo/1 https://twitter.com/EricTopol/status/2065430578997203374

Eric Topol@EricTopolTECH

For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5

Ethan Mollick@emollickTECH

There has been a push to use OpenEvidence AI for doctors. But this paper suggests general models are much better: “Frontier LLMs outperformed clinical AI tools in all three evaluations. Clinical AI tools performed comparably to auto-enabled Google Search AI Overview on the RCQ.” https://twitter.com/erictopol/status/2065430578997203374

Eric Topol@EricTopolTECH

>65% of US physicians use OpenEvidence, with 27 million prompts in April https://www.nbcnews.com/tech/tech-news/openevidence-ai-doctor-medical-physician-login-app-what-npi-uptodate-rcna341064 https://x.com/EricTopol/status/2065439976641482849/photo/1

Eric Topol@EricTopolTECH

Here is the performance breakdown for each model's blinded assessment for 4 major tasks: (1) clinical correctness, (2) completeness, (3) safety, and (4) clarity. https://x.com/EricTopol/status/2065530014372901241/photo/1

Nature Medicine study finds general-purpose frontier LLMs outperform specialized clinical AI tools on medical benchmarks

Related Stories

Commentary on X

Digg Deeper