/Tech7h ago

New SciConBench benchmark of 9,110 Cochrane questions shows frontier AI agents struggle to synthesize scientific evidence

Frontier AI systems performed poorly on the synthesis tasks

15122207917.1K

#139

Original post

Manoel@manoelribeiro

New preprint!

We introduce a new benchmark, SciConBench, with 9.11k scientific questions derived from Cochrane Systematic Reviews.

We find evidence that frontier AI agents **cannot** synthesize scientific conclusions well.

A thread 🧵

w/ @hayounggjung, @korolova & others

5:57 AM · Jun 11, 2026 · 12.3K Views

/Tech7h ago

New SciConBench benchmark of 9,110 Cochrane questions shows frontier AI agents struggle to synthesize scientific evidence

Frontier AI systems performed poorly on the synthesis tasks

15122207917.1K

#139

Original post

Manoel@manoelribeiro

New preprint!

We introduce a new benchmark, SciConBench, with 9.11k scientific questions derived from Cochrane Systematic Reviews.

We find evidence that frontier AI agents **cannot** synthesize scientific conclusions well.

A thread 🧵

w/ @hayounggjung, @korolova & others

5:57 AM · Jun 11, 2026 · 12.3K Views

Sentiment

Users praise the new SciConBench benchmark for delivering a useful reality check on AI hype while commending the research team's effort behind the paper.

Pos

100.0%

Neg

0.0%

3 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS5.5KBOOKMARKS18LIKES40RETWEETS7REPLIES6

Gary Marcus@GaryMarcus

🚨Devastating to a lot of overclaims about AI as scientist. 🚨

Manoel@manoelribeiro

New preprint!

We introduce a new benchmark, SciConBench, with 9.11k scientific questions derived from Cochrane Systematic Reviews.

We find evidence that frontier AI agents **cannot** synthesize scientific conclusions well.

A thread 🧵

w/ @hayounggjung, @korolova & others

2h5.5K4018

Manoel@manoelribeiro

Paper: https://arxiv.org/pdf/2606.11337

Code: https://github.com/hayoungjungg/SciConBench

Data: https://huggingface.co/datasets/hayoungjung/SciConBench

9h19844

Manoel@manoelribeiro

The results are surprising!

Under clean-room evaluation, the best-performing system, o3-deep-research, reaches only: F1 = 0.337

Even frontier models and deep research agents are far from reliably synthesizing scientific conclusions.

9h16152

Manoel@manoelribeiro

This paper was a herculean effort, led by @hayoungjung! Our very first paper together!

I’m incredibly proud of the work and grateful to him and the whole team, including the doctor co-authors who helped us annotate the dataset!

9h26262

Manoel@manoelribeiro

Existing benchmarks test intermediate skills: retrieval, citation grounding, QA, or summarization.

But real-world scientific synthesis is a long-horizon task! E.g., one must find the evidence > filter it > assess quality > reconcile conflicts > write a conclusion.

9h25262

Manoel@manoelribeiro

We also audited consumer-facing systems, including Google AI Overview, Google AI Mode, and OpenEvidence.

Even without clean-room restrictions, these systems often produced incomplete or contradictory scientific conclusions.

This is especially concerning in health contexts!

9h15952

Manoel@manoelribeiro

To test that, we introduce SCICONBENCH, a live benchmark built from the Cochrane Database of Systematic Reviews.

Each item pairs a scientific/clinical question with an expert-written conclusion from a systematic review.

In total: 9.11K questions and conclusions.

9h18842

Manoel@manoelribeiro

AI systems increasingly do more than retrieve evidence: they weigh claims and produce conclusions used in consequential settings, including by doctors.

But how do we know they’re synthesizing evidence, rather than finding a synthesis someone already wrote?

9h38451

Manoel@manoelribeiro

A major challenge: leakage.

If an AI agent can simply find the Cochrane review or derivative summaries online, then it is *retrieving* the answer and not synthesizing it!

So we built SCICONHARNESS, a "clean-room evaluation" harness.

9h17751

Manoel@manoelribeiro

The broader takeaway: Scientific AI agents are powerful, but reliable scientific synthesis is still an open problem!

We hope our benchmark and analyses are a step in the right direction.

9h16551

Manoel@manoelribeiro

We then evaluate generated conclusions by decomposing them into atomic facts.

For each answer, we measure:

- Factual precision: are the generated facts supported? - Factual recall: does the answer cover the key reference facts? - Factual F1: the overall synthesis quality

9h15951

Manoel@manoelribeiro

SCICONHARNESS gives agents controlled tools for web search, browsing, and paper search, but it filters out ground-truth artifacts:

Cochrane links, matching review titles, and sources published after the review date.

This lets us measure synthesis rather than shortcut retrieval!

9h16641

Manoel@manoelribeiro

We also find pervasive factual quality problems.

Across systems, many generated conclusions contain at least one fact contradicting the reference review.

And nearly all contain at least one unsupported fact, suggesting unreliable synthesis.

9h15141

Bounded System Technology Ltd@boundedsystem

@manoelribeiro @hayounggjung @korolova do the SAME test again and tell it to ignore peer review and strictly apply empirical science *****

You want to bet now your results will differ hugely ?

8h85

Suresh@_Suresh2

@manoelribeiro @hayounggjung @korolova i always wondered if cochrane data would make llms more likely to just guess

4h42

Invincible@InvincibleEdge

@GaryMarcus benchmarks catching up to the marketing hype is always a good reality check

Blissy@BlissyOnX

@GaryMarcus benchmarks like this are good but the framing feels a little selective

wonder how deep the gap actually is with domain specific training

Kevin M Jablonka@kmjablonka

@manoelribeiro Very cool, might also have some links to https://arxiv.org/abs/2604.18805