Scripps Research's Eric Topol says GPT-5, Claude 3.5, and Gemini 2.5 Pro fail clinical readiness · Digg

Scripps Research's Eric Topol says GPT-5, Claude 3.5, and Gemini 2.5 Pro fail clinical readiness · Digg

Posts from X

Most Activity

VIEWS7.8KBOOKMARKS11LIKES18RETWEETS5REPLIES3

Eric Topol@EricTopol

Link for free access https://rdcu.be/fqznS This extensive assessment work was led by @hoifungpoon and Yu (Aiden) Gu

Eric Topol@EricTopol

We stress tested many frontier AI models for multimodal medical reasoning (including GPT-5, Claude 3.5, Gemini 2.5 Pro). They’re not ready. Faulty reasoning, use of inappropriate shortcuts, hallucinations. Published today @NatureMedicine https://www.nature.com/articles/s41591-026-04501-8

4h7.8K1811

Max Hodak@maxhodak_

tests 5-generation-old models, concludes AI is inappropriate for medicine

Eric Topol@EricTopol

We stress tested many frontier AI models for multimodal medical reasoning (including GPT-5, Claude 3.5, Gemini 2.5 Pro). They’re not ready. Faulty reasoning, use of inappropriate shortcuts, hallucinations. Published today @NatureMedicine https://www.nature.com/articles/s41591-026-04501-8

24m829103

Yishan@yishan

Many people have already pointed out that no matter how high the quality of this paper, the long review and publication cycle makes the results irrelevant.

The way to fix this is to open source enough of the actual research methodology used so that upon publication, anyone can re-run the exact same tests on the latest models at that time to produce consistently comparable results.

This is more useful because “this is the worst the models will ever be,” and it is a reasonable assumption that they at some point WILL be ready. Hence, showing that at some point in time that they aren’t (i.e. a year ago) is much less useful than constructing a usable method that allows us to tell at what point in the future they actually ARE ready.

Eric Topol@EricTopol

We stress tested many frontier AI models for multimodal medical reasoning (including GPT-5, Claude 3.5, Gemini 2.5 Pro). They’re not ready. Faulty reasoning, use of inappropriate shortcuts, hallucinations. Published today @NatureMedicine https://www.nature.com/articles/s41591-026-04501-8

1h877101

Timothy Murphy@Timothy98537991

@EricTopol @NatureMedicine It does take time, and I'm not suggesting that your study lacked rigor. The problem is that the target is moving too quickly to be evaluated this way. The results, once finally published, are misleading because they don't represent the state of the art.

3h649

Eric Topol@EricTopol

@Timothy98537991 @NatureMedicine Obviously from someone who has never tried to publish such as assessment in a leading peer review journal. It takes time! Best we can do.

3h3169

Timothy Murphy@Timothy98537991

@EricTopol @NatureMedicine I learned way too late in life that the truth can defend itself and doesn't need me to defend it. I think it will be pretty obvious to most people that this study was accurate at the time it was done, but by the time it was published, it was meaningless.

3h15911

Mikhail Doroshenko@SandelloRed

@EricTopol @NatureMedicine Those are not frontier models

5h3116

Eric Topol@EricTopol

@SandelloRed @NatureMedicine Yes they are

5h315

Rogs 🔍🔸@ESRogs

@EricTopol @Timothy98537991 @NatureMedicine Yes, but you could have tweeted "they were not ready" rather than using the present tense.

3h524

Vladimir Heiskanen@ValtsuH

@EricTopol @NatureMedicine Would it be possible to rapidly test the same stuff again with the up-to-date models?

I'm not expecting perfection from GPT-5.5 or Opus 4.8 but maybe relevant improvement, still.

5h2631

Rodrigo@rodrigo_taxon

@EricTopol @NatureMedicine @josegallucci

5h1431

Rodrigo@rodrigo_taxon

@EricTopol @NatureMedicine @alinefortuna2

5h1161

Ben Stadler@TheBenStadler

@EricTopol @NatureMedicine Maybe be prepared with personal results of the same tests against current models to supplement your paper’s findings. Otherwise it is fairly meaningless to post, especially with your present tense framing of “they’re not ready.”

3h1394

Bhushan@bhushan_55

@EricTopol @NatureMedicine The study is outdated now, get ready for GPT6, claude mythos, Gemini 3.5 PRO

Or better, wait for a year

4h137

Simukayi Mutasa M.D.@MutasaSimu70874

@EricTopol @NatureMedicine I love this kind of research Eric, thank you for doing this. How do you suggest we mitigate the publishing delay issue for updating the research on newer models?

1h85

Eric Topol@EricTopol

@ESRogs @Timothy98537991 @NatureMedicine The ones we tested were not ready. I indicated some in the text of the post. You’re welcome

3h74

Chris Russell MD@Russell50k

@EricTopol @NatureMedicine The journal process is too slow for evaluating LLMs. Journals are not appropriate tools for this task.

3h1023

Alan Ge, MD@RealAlanGe

@Timothy98537991 @EricTopol @NatureMedicine Hi Dr. Topol! Huge fan of your work, but I have to agree with Timothy on this one. We need to evolve as a field. While important, the traditional peer review process is simply far too slow to accurately assess the rapid pace of these models. (1/3)

2h22

Jake Perry@ItsJakePerry

@DrAMansouri @EricTopol @NatureMedicine You're right, there are two benchmarks. Absolute perfection vs "as good as humans". Other studies suggest models are already as good or better than humans in many medical contexts.

2h61

Eric Topol@EricTopol

@yishan @NatureMedicine Agree. They will eventually be ready!

Yishan@yishan

Many people have already pointed out that no matter how high the quality of this paper, the long review and publication cycle makes the results irrelevant.

The way to fix this is to open source enough of the actual research methodology used so that upon publication, anyone can re-run the exact same tests on the latest models at that time to produce consistently comparable results.

This is more useful because “this is the worst the models will ever be,” and it is a reasonable assumption that they at some point WILL be ready. Hence, showing that at some point in time that they aren’t (i.e. a year ago) is much less useful than constructing a usable method that allows us to tell at what point in the future they actually ARE ready.

49m35210