This study has been going viral. I think that most people are misunderstanding its conclusions a bit.
This paper DOES NOT MEAN domain-specific models are not worth it.
First of all, UpToDate and OpenEvidence are not models, but products. And there's no information on what models they are built on. They are likely built on top of older models. For all we know they're built on top of gpt-4o or Llama-3.1 or something 🤣 (it's probably something more recent/powerful than that but just trying to emphasize the point)
Second of all, the benchmarks are a bit limited. The benchmarks include MedQA (which is pretty saturated at this point), HealthBench (which focuses on patient conversations), and a closed dataset of doctor questions to an LLM. There are many aspects of clinical use of LLMs that are not at all analyze by this benchmarking approach.
What conclusions can be made then? Only that UpToDate and OpenEvidence is worse than frontier models on the limited set of benchmarks tested in this paper.
It doesn't mean that domain-specific models cannot beat general purpose models.
In fact, we have done a comprehensive benchmark (https://medmarks.ai) which includes MedQA and HealthBench, among many other benchmarks. We look at general-purpose models, and versions of those same models but adapted for medicine. There seems to be a noticeable boost going from general-purpose to medical fine-tune.
So if you took a frontier model and were able to fine-tune for medical applications it would definitely be better. i.e. a domain-specific model would be better.
It is true that the current domain-specific models (which are often built on open-source models that are not at the frontier) are often worse than frontier models.
It is not true that building domain-specific models cannot beat general-purpose models.
I think the main problem is open-source models aren't progressing fast enough with respect to the frontier models and on top of that very few groups are adapting them quickly enough to release better and better medical AI tools.
Some groups claim to have medical-specific models that outperform frontier models, ex: Baichuan-M4. Hopefully we'll see more medical-specific models trained on top of really strong base models come out soon.
For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5






