/Tech7h ago

New SciConBench benchmark of 9,110 Cochrane questions shows frontier AI agents struggle to synthesize scientific evidence

Frontier AI systems performed poorly on the synthesis tasks

15122207917.1K
Original post
Manoel@manoelribeiro

New preprint!

We introduce a new benchmark, SciConBench, with 9.11k scientific questions derived from Cochrane Systematic Reviews.

We find evidence that frontier AI agents **cannot** synthesize scientific conclusions well.

A thread 🧵

w/ @hayounggjung, @korolova & others

5:57 AM · Jun 11, 2026 · 12.3K Views
Sentiment

Users praise the new SciConBench benchmark for delivering a useful reality check on AI hype while commending the research team's effort behind the paper.

Pos
100.0%
Neg
0.0%
3 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS5.5KBOOKMARKS18LIKES40RETWEETS7REPLIES6
Gary Marcus@GaryMarcus

🚨Devastating to a lot of overclaims about AI as scientist. 🚨

Manoel@manoelribeiro

New preprint!

We introduce a new benchmark, SciConBench, with 9.11k scientific questions derived from Cochrane Systematic Reviews.

We find evidence that frontier AI agents **cannot** synthesize scientific conclusions well.

A thread 🧵

w/ @hayounggjung, @korolova & others

2hViews 5.5KLikes 40Bookmarks 18
Manoel@manoelribeiro

Paper: https://arxiv.org/pdf/2606.11337

Code: https://github.com/hayoungjungg/SciConBench

Data: https://huggingface.co/datasets/hayoungjung/SciConBench

9hViews 198Likes 4Bookmarks 4
Manoel@manoelribeiro

The results are surprising!

Under clean-room evaluation, the best-performing system, o3-deep-research, reaches only: F1 = 0.337

Even frontier models and deep research agents are far from reliably synthesizing scientific conclusions.

9hViews 161Likes 5Bookmarks 2
Manoel@manoelribeiro

This paper was a herculean effort, led by @hayoungjung! Our very first paper together!

I’m incredibly proud of the work and grateful to him and the whole team, including the doctor co-authors who helped us annotate the dataset!

9hViews 262Likes 6Bookmarks 2
Manoel@manoelribeiro

Existing benchmarks test intermediate skills: retrieval, citation grounding, QA, or summarization.

But real-world scientific synthesis is a long-horizon task! E.g., one must find the evidence > filter it > assess quality > reconcile conflicts > write a conclusion.

9hViews 252Likes 6Bookmarks 2
Manoel@manoelribeiro

We also audited consumer-facing systems, including Google AI Overview, Google AI Mode, and OpenEvidence.

Even without clean-room restrictions, these systems often produced incomplete or contradictory scientific conclusions.

This is especially concerning in health contexts!

9hViews 159Likes 5Bookmarks 2
Manoel@manoelribeiro

To test that, we introduce SCICONBENCH, a live benchmark built from the Cochrane Database of Systematic Reviews.

Each item pairs a scientific/clinical question with an expert-written conclusion from a systematic review.

In total: 9.11K questions and conclusions.

9hViews 188Likes 4Bookmarks 2
Manoel@manoelribeiro

AI systems increasingly do more than retrieve evidence: they weigh claims and produce conclusions used in consequential settings, including by doctors.

But how do we know they’re synthesizing evidence, rather than finding a synthesis someone already wrote?

9hViews 384Likes 5Bookmarks 1
Manoel@manoelribeiro

A major challenge: leakage.

If an AI agent can simply find the Cochrane review or derivative summaries online, then it is *retrieving* the answer and not synthesizing it!

So we built SCICONHARNESS, a "clean-room evaluation" harness.

9hViews 177Likes 5Bookmarks 1
Manoel@manoelribeiro

The broader takeaway: Scientific AI agents are powerful, but reliable scientific synthesis is still an open problem!

We hope our benchmark and analyses are a step in the right direction.

9hViews 165Likes 5Bookmarks 1
Manoel@manoelribeiro

We then evaluate generated conclusions by decomposing them into atomic facts.

For each answer, we measure:

- Factual precision: are the generated facts supported? - Factual recall: does the answer cover the key reference facts? - Factual F1: the overall synthesis quality

9hViews 159Likes 5Bookmarks 1
Manoel@manoelribeiro

SCICONHARNESS gives agents controlled tools for web search, browsing, and paper search, but it filters out ground-truth artifacts:

Cochrane links, matching review titles, and sources published after the review date.

This lets us measure synthesis rather than shortcut retrieval!

9hViews 166Likes 4Bookmarks 1
Manoel@manoelribeiro

We also find pervasive factual quality problems.

Across systems, many generated conclusions contain at least one fact contradicting the reference review.

And nearly all contain at least one unsupported fact, suggesting unreliable synthesis.

9hViews 151Likes 4Bookmarks 1

@manoelribeiro @hayounggjung @korolova do the SAME test again and tell it to ignore peer review and strictly apply empirical science *****

You want to bet now your results will differ hugely ?

8hViews 85
Suresh@_Suresh2

@manoelribeiro @hayounggjung @korolova i always wondered if cochrane data would make llms more likely to just guess

4hViews 42
Invincible@InvincibleEdge

@GaryMarcus benchmarks catching up to the marketing hype is always a good reality check

2h
Blissy@BlissyOnX

@GaryMarcus benchmarks like this are good but the framing feels a little selective

wonder how deep the gap actually is with domain specific training

2h
Kevin M Jablonka@kmjablonka

@manoelribeiro Very cool, might also have some links to https://arxiv.org/abs/2604.18805

2h