For the past two weeks, our independent team of statisticians, AI evaluation experts, clinical AI researchers, and clinicians was given a unique opportunity to test one question: “How well do different AI tools answer user questions on the OpenEvidence (OE) platform?”
Expert medical evaluation of clinical AI systems on real point-of-care queries finds notable performance gaps
The two-week study tested systems using submitted OpenEvidence questions.
Users praised the paper evaluating AI tools on OpenEvidence Platform for its emphasis on the level of reasoning in LLM responses.
No Digg Deeper questions have been answered for this story yet.
Most Activity

We were given access to 620 questions sampled from real queries submitted to OE, and 149 physicians across 36 states made blinded head-to-head comparisons between answers from frontier general-purpose models (Opus 4.8, Gemini 3.1 Pro, GPT5.5) and OE’s clinical AI tool.

Paper: https://arxiv.org/abs/2606.28960 Data: https://huggingface.co/datasets/jjfenglab/Real-POCQi Code: https://github.com/jjfenglab/Real-POCQi-statistics

Today we are sharing the results on arXiv and the benchmark dataset on HuggingFace. Following our prespecified statistical analysis, OE scored highest on all five dimensions of accuracy, clinical utility, source quality, verifiability, and completeness.

But the more relevant question for AI application developers is whether their tool is delivering value in its intended use case for its target users. On that measure, our findings are positive.

Graders were matched to questions that matched their specialty to maximize evaluation accuracy, to mirror the specialization that defines modern medicine.

Please see the paper for important details and nuances of the work. Our hope is that this starts a healthy discussion on how to best evaluate AI systems, because no evaluation is perfect and every evaluation involves tradeoffs.

At least for now, the hard work of engineering and customizing an AI system can still pay off, delivering meaningful performance gains to its target users.

So how do these results square with recent work reporting general-purpose LLMs outperform OE? We think both findings can be true: when a specialized tool is evaluated in a setting outside of its intended workflow, it may indeed underperform.

On the primary endpoint of win differences (a model’s win rate minus its loss rate in head-to-head comparisons), OE led by 25–39 percentage points all these dimensions (p < 0.001). Results were consistent across our *many* sensitivity analyses.

The incredible team behind this work: @vrpatel97, Patrick Heagerty, Yifan Mai, Venkatesh Sivaraman, @ppatrickv, Jialin Ouyang, @AnupamBJena
@UCSF_BCHSI , @UCJointCPH , @UCSF_Epibiostat @UWBiostat

@Jean_J_Feng Really cool paper. The level of reasoning allowed for LLM responses is arguably one of the most important parts of the study.

@Jean_J_Feng Statisticians + clinicians + AI experts together on this. That's how you get meaningful evaluation. What did you find most surprising?