/Tech6h ago

NYU study finds generalist frontier LLMs outperform specialized medical tools like UpToDate in blinded clinical tests

Twelve clinicians evaluated the outputs using real clinical inputs.

13125511441.3K

#184

Original post

Adam Rodman@AdamRodmanMD

There has ALREADY been a lot written about NYU @EvidenceOpen @UpToDate Expert AI study but wanted to give my perspective as what counts for an "expert" in human-computer interaction these days. Especially when I see Twitter debates about item response theory. 🤣

A 🧵⬇️

Eric Topol@EricTopol

For medical information, general AI frontier models (Google, OpenAI, Anthropic) outperformed specialized @EvidenceOpen and @UpToDate as assessed by 12 US clinicians, randomized and blinded to which model and extensive testing/benchmarks. This was not anticipated. @NatureMedicine https://www.nature.com/articles/s41591-026-04431-5

11:10 AM · Jun 14, 2026 · 33.9K Views

Sentiment

Users praised the thread on general AI models outperforming specialized medical tools in clinician studies, calling the expert perspectives helpful and enlightening.

Pos

100.0%

Neg

0.0%

9 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS9.1KBOOKMARKS9LIKES33RETWEETS3

Ethan Mollick@emollick

This is a good methodological thread on the debate over a new paper that suggests generalist models beat specialized medical AIs. (And a good overview of the challenges of benchmarking AIs in medicine)

Adam Rodman@AdamRodmanMD

The TL;DR who don't want to sit through a virtual lab meeting with me:

"This study provides directional data about the reference output quality of POC reference LLMs versus base models on actual reference inputs"

1h9.1K339

REPLIES3

Adam Rodman@AdamRodmanMD

@krithikvish @ekoermann Where should we be heading as a field?

1) A neutral "convener" who can set standards (think @METR_Evals for medicine) 2) Multidimensional benchmarks with high degrees of construct validity 3) An ecosystem in which everyone willingly participates and commits to improving systems

6h92473

Adam Rodman@AdamRodmanMD

Without a reference standard, the authors need to establish a measure with some degree of construct validity.

They used a rubric that's actually quite similar to those used in AI evaluations -- a four axis scale 4-point scale (and some binary flags).

7h79423

Adam Rodman@AdamRodmanMD

So how do we measure the quality of POC reference tools? A lot of these studies comes from the LAST generation of POC tools (think UTD vs DynaMed).

There are three general domains that we can look at:

1⃣ Objective quality measures of the PLATFORM in general

7h1.1K3

Adam Rodman@AdamRodmanMD

For RCQ, the authors have chosen to treat these tools (implicitly) as reference tools, which I find completely reasonable given their TOS. This is personally what I most commonly use OE for, but I also fully recognize that many people use them differently.

7h1.1K62

Adam Rodman@AdamRodmanMD

In some settings, there's good evidence that these measures DO in fact have construct and discriminant validity (eg, the PDQSI-9, which is used to AI summaries: https://www.nature.com/articles/s41746-025-02005-2) or even the AMIE papers (COI alert!) which use rubrics from OSCEs to AI-patient convos.

7h71242

Eric Topol@EricTopol

@AdamRodmanMD @EvidenceOpen @UpToDate Thanks Adam. Great to get your perspective!

6h1.2K81

Adam Rodman@AdamRodmanMD

2⃣ User-level measures (satisfaction, time, retention, &c) 3⃣ Answer-level measures (accuracy, harm, &c).

There are some great papers in 1⃣ and 2⃣ (eg https://www.bmj.com/content/343/bmj.d5856 and https://pmc.ncbi.nlm.nih.gov/articles/PMC3221343/) but that's not really what RCQ is getting at.

So let's go to 1⃣.

7h1K12

Adam Rodman@AdamRodmanMD

There are two prospective evaluations looking at UTD vs DynaMed (https://pmc.ncbi.nlm.nih.gov/articles/PMC8810269/, https://jmla.pitt.edu/ojs/jmla/article/view/1176) that use accuracy/quality of responses.

Importantly, both of them use STANDARDIZED vignettes, and "gold standard" answers.

7h9682

Adam Rodman@AdamRodmanMD

(also, this study says ABSOLUTELY nothing about the bitter lesson of specialist vs generalist models, and is appropriately focused on reference quality -- I wouldn't personally draw any conclusions there from this)

6h64851

Adam Rodman@AdamRodmanMD

To give an example from one of my studies (where we validated expert rubrics for construct validity), imagine the question, "How do you manage A Fib?"

For completeness, one rater might want anticoag, rate control, and rhythm control.

7h6322

Adam Rodman@AdamRodmanMD

Also, since this is a heated subject, I want to state at the beginning that, as I write in every talk, paper, and grant application, I am a visiting researcher at Google, and have worked on the AMIE/CoClinician work from an evaluation standpoint.

7h1.4K71

Adam Rodman@AdamRodmanMD

Underlying all of this is a basic tension -- what ARE Open Evidence and Up-To-Date ExpertAI (and the other POC LLM tools)? Are they reference look-ups for the point-of-care? Are they clinical decision support (meant to explicitly help you make a decision like a risk calculator)?

7h1.2K71

Adam Rodman@AdamRodmanMD

The TL;DR who don't want to sit through a virtual lab meeting with me:

"This study provides directional data about the reference output quality of POC reference LLMs versus base models on actual reference inputs"

6h2.1K6

Adam Rodman@AdamRodmanMD

I'm also going to focus on the most important piece of the paper, the RCQ (Real Clinical Questions) benchmark taken from the HIPAA-complaint NYU GPT instance. I do really personally like HealthBench and the methods behind, but that would be another paper.

7h1.4K51

Adam Rodman@AdamRodmanMD

Or are they "something else," more like asking a colleague for advice?

This turns out to matter quite a bit, because it fundamentally changes how you would evaluate, both in how many turns of a conversation are references, the types of outcome measures, &c.

7h1.2K9

Adam Rodman@AdamRodmanMD

A second might want reversible causes, risk stratification, and cardioversion timing for completeness.

Which is right? How many points to you give?

7h63311

Adam Rodman@AdamRodmanMD

This is what we see in the Krippendorf's alpha -- high levels of item-level disagreement (which doesn't surprise me given the scale). For comparison, the PDQSI-9 had a Krippendorf's alpha of ~.5 (https://arxiv.org/pdf/2501.08977)

7h6361

Adam Rodman@AdamRodmanMD

@beaulieujones @EricTopol @EvidenceOpen @UpToDate I think this gets more to the fact that we don't really have a good taxonomy on what people use the tools for. I think some users DO use OE and UTD (which has had patient friendly handouts for years) to create patient-facing material.

4h16411

Adam Rodman@AdamRodmanMD

@EricTopol @EvidenceOpen @UpToDate Thanks Eric!!! Hopefully I won't scare everyone away with discussions of psychometric validation and item response theory 🤣🤣

6h41831