/Tech7h ago

Expert medical evaluation of clinical AI systems on real point-of-care queries finds notable performance gaps

The two-week study tested systems using submitted OpenEvidence questions.

642102337.5K

Original post

For the past two weeks, our independent team of statisticians, AI evaluation experts, clinical AI researchers, and clinicians was given a unique opportunity to test one question: “How well do different AI tools answer user questions on the OpenEvidence (OE) platform?”

7:38 AM · Jun 30, 2026 · 37.5K Views

Sentiment

Users praised the paper evaluating AI tools on OpenEvidence Platform for its emphasis on the level of reasoning in LLM responses.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

Jean Feng@Jean_J_Feng

We were given access to 620 questions sampled from real queries submitted to OE, and 149 physicians across 36 states made blinded head-to-head comparisons between answers from frontier general-purpose models (Opus 4.8, Gemini 3.1 Pro, GPT5.5) and OE’s clinical AI tool.

1d6922

BOOKMARKS2LIKES4REPLIES2

Jean Feng@Jean_J_Feng

Paper: https://arxiv.org/abs/2606.28960 Data: https://huggingface.co/datasets/jjfenglab/Real-POCQi Code: https://github.com/jjfenglab/Real-POCQi-statistics

1d43542

RETWEETS2

Jean Feng@Jean_J_Feng

Today we are sharing the results on arXiv and the benchmark dataset on HuggingFace. Following our prespecified statistical analysis, OE scored highest on all five dimensions of accuracy, clinical utility, source quality, verifiability, and completeness.

1d4072

Jean Feng@Jean_J_Feng

But the more relevant question for AI application developers is whether their tool is delivering value in its intended use case for its target users. On that measure, our findings are positive.

1d3263

Jean Feng@Jean_J_Feng

Graders were matched to questions that matched their specialty to maximize evaluation accuracy, to mirror the specialization that defines modern medicine.

1d3991

Jean Feng@Jean_J_Feng

Please see the paper for important details and nuances of the work. Our hope is that this starts a healthy discussion on how to best evaluate AI systems, because no evaluation is perfect and every evaluation involves tradeoffs.

1d3631

Jean Feng@Jean_J_Feng

At least for now, the hard work of engineering and customizing an AI system can still pay off, delivering meaningful performance gains to its target users.

1d3391

Jean Feng@Jean_J_Feng

So how do these results square with recent work reporting general-purpose LLMs outperform OE? We think both findings can be true: when a specialized tool is evaluated in a setting outside of its intended workflow, it may indeed underperform.

1d3301

Jean Feng@Jean_J_Feng

On the primary endpoint of win differences (a model’s win rate minus its loss rate in head-to-head comparisons), OE led by 25–39 percentage points all these dimensions (p < 0.001). Results were consistent across our *many* sensitivity analyses.

1d355

Jean Feng@Jean_J_Feng

The incredible team behind this work: @vrpatel97, Patrick Heagerty, Yifan Mai, Venkatesh Sivaraman, @ppatrickv, Jialin Ouyang, @AnupamBJena

@UCSF_BCHSI , @UCJointCPH , @UCSF_Epibiostat @UWBiostat

1d388

Village Chief Dean ∞@GlobalResusGuy

@Jean_J_Feng Really cool paper. The level of reasoning allowed for LLM responses is arguably one of the most important parts of the study.

4h10

Maya Inspired@Maya4Rights

@Jean_J_Feng Statisticians + clinicians + AI experts together on this. That's how you get meaningful evaluation. What did you find most surprising?