/Tech6h ago

MedARC founder Tanishq Mathew Abraham argues a viral study does not disprove the value of domain-specific medical fine-tuning

Story Overview

A recent Nature Medicine study found general frontier models outperforming two clinical AI tools on medical questions and real physician queries, sparking debate about specialized models. MedARC founder Tanishq Mathew Abraham counters that the tested tools likely build on older base models and that the benchmarks capture only limited slices of performance, leaving domain-specific fine-tuning's advantages unexamined.

8357235.7K

#613

Original post

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr#613inTech

This study has been going viral. I think that most people are misunderstanding its conclusions a bit.

This paper DOES NOT MEAN domain-specific models are not worth it.

First of all, UpToDate and OpenEvidence are not models, but products. And there's no information on what models they are built on. They are likely built on top of older models. For all we know they're built on top of gpt-4o or Llama-3.1 or something 🤣 (it's probably something more recent/powerful than that but just trying to emphasize the point)

Second of all, the benchmarks are a bit limited. The benchmarks include MedQA (which is pretty saturated at this point), HealthBench (which focuses on patient conversations), and a closed dataset of doctor questions to an LLM. There are many aspects of clinical use of LLMs that are not at all analyze by this benchmarking approach.

What conclusions can be made then? Only that UpToDate and OpenEvidence is worse than frontier models on the limited set of benchmarks tested in this paper.

It doesn't mean that domain-specific models cannot beat general purpose models.

In fact, we have done a comprehensive benchmark (https://medmarks.ai) which includes MedQA and HealthBench, among many other benchmarks. We look at general-purpose models, and versions of those same models but adapted for medicine. There seems to be a noticeable boost going from general-purpose to medical fine-tune.

So if you took a frontier model and were able to fine-tune for medical applications it would definitely be better. i.e. a domain-specific model would be better.

It is true that the current domain-specific models (which are often built on open-source models that are not at the frontier) are often worse than frontier models.

It is not true that building domain-specific models cannot beat general-purpose models.

I think the main problem is open-source models aren't progressing fast enough with respect to the frontier models and on top of that very few groups are adapting them quickly enough to release better and better medical AI tools.

Some groups claim to have medical-specific models that outperform frontier models, ex: Baichuan-M4. Hopefully we'll see more medical-specific models trained on top of really strong base models come out soon.

Eric Topol@EricTopol

For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5

3:45 AM · Jun 13, 2026 · 6.2K Views

Benchmark Insight

Fine-tuning lifts win rates on fresh tests

Medmarks runs show medically tuned versions consistently ahead of their base models, such as Gemma 3 4B moving from roughly 0.32 to 0.38 mean win rate after fine-tuning, with similar deltas at larger scales.

Open Question

Older bases and narrow tasks cloud the comparison

No public details confirm the exact foundation models behind OpenEvidence or UpToDate Expert AI, and the study’s three evaluations leave broader clinical workflows untested, so the results cannot yet settle whether newer fine-tuned systems close the gap.

Sentiment

Users mocked the idea that medical fine-tuning boosts LLM performance, pointing to cases like Qwen3 where it produced worse results and arguing that larger general models remain superior.

Pos

33.3%

Neg

66.7%

3 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS127LIKES3

Neal Khosla@nealkhosla

@iScienceLuvr Your perspective here is good but misses the amount of capital it takes to keep up with the frontier which is insanely large.

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

This study has been going viral. I think that most people are misunderstanding its conclusions a bit.

This paper DOES NOT MEAN domain-specific models are not worth it.

What conclusions can be made then? Only that UpToDate and OpenEvidence is worse than frontier models on the limited set of benchmarks tested in this paper.

It doesn't mean that domain-specific models cannot beat general purpose models.

So if you took a frontier model and were able to fine-tune for medical applications it would definitely be better. i.e. a domain-specific model would be better.

It is true that the current domain-specific models (which are often built on open-source models that are not at the frontier) are often worse than frontier models.

It is not true that building domain-specific models cannot beat general-purpose models.

1h12730

Lunari@0x_lun

@iScienceLuvr the qwen3 235b getting negative gains from medical fine tuning is sending me

somebody spent real compute to make a giant model slightly worse

5h483

Dr Danish@operationdanish

I understand your point but the fact that we’re having this discussion at all means that there is no exponential advantage of domain-specific models.

The products and companies built on billions in VC dollars are going to end up like telehealth… largely commoditized and not impactful.

1h40

Katy Beckermann@katy_beckermann

Completely agree. A few things that stand out to me:

1) Is it possible the frontier LLMs were trained on the testing set? This is a real concern when benchmarks aren't held out carefully.

2) The API/level 0 testing vs UI and direct prompting for the RAG-based products like OE and UpToDate is a huge confounder. These products are built to be used through their interfaces - testing them via API is not an apples to apples comparison.

3) And honestly the most glaring omission - no test for citations. For a field that prides itself on evidence-based medicine, that's a miss. Given how OpenEvidence actually works, citation accuracy is probably where it would shine most. Feels like the benchmark was designed in a way that played to the frontier models' strengths rather than what actually matters clinically.

1h20

Joseph Younis, MD@YounisJoseph

@iScienceLuvr It precisely means that because nobody is using OE for advanced Google search

42m18

woisau@woisau1

Great breakdown @iScienceLuvr Beyond fine-tuning, the real next step is shared verifiable memory. OriginTrail’s medical demo today with 5 agents building a provenance-rich context graph on DKG V10 shows how multi-agent systems can achieve grounded, auditable results without central silos.

6m15

Bryan Tegomoh, MD, MPH@BryanTegomoh

@iScienceLuvr Tool calling can only go as far, at the end of the day the bigger the model (parameters) and capable in a broad range of domains, the better the performance in tasks they were not specifically trained for.

42m5

Mayz@lunan_ai

@iScienceLuvr people treating product UI as model performance is such a common error

separate the wrapper from the engine