
Case study: on this dense theoretical preprint, Refine won all 9 matches.
It found 43 atomic concerns, all supported by the paper; 19 threatened a main result.
7/8
Users are excited about Refine’s new benchmark winning 90% of AI paper review matches because it shows progress in developing stronger review tools and benchmarks.
No Digg Deeper questions have been answered for this story yet.

Case study: on this dense theoretical preprint, Refine won all 9 matches.
It found 43 atomic concerns, all supported by the paper; 19 threatened a main result.
7/8

The advantage was broad across economics:
macro 88.9%, econometrics 88.9%, applied micro 87.1%, theory 92.6%.
6/8

The benchmark covered 5 single-shot frontier reasoning models and 4 systems consisting of a frontier model + an open-source scaffold.
Refine won against every comparison system, including Fable 5 (high). The closest matches came from scaffolded reviewers.
2/8

Our procedure was informed by the literature.
One-shot grading can bias toward favoring longer reviews, or ones that fabricate issues.
Instead, we decomposed reviews into paper-grounded atomic concerns and judged on true issues found by one system and missed by another.
3/8

Beyond the headline numbers, our findings show that scaffolds made reviewers much stronger.
Refine's win rate against ordinary single-shot LLM referee reports was 94.8%.
Against scaffolded review systems, it was 85.0%.
As of today, harnesses matter.
4/8

A system won if it identified more genuine concerns the other system missed ("residual concerns").
Refine averaged 28.1 unique residual concerns per match; comparison reviews averaged 14.5.
Substantive concerns: 22.1 vs. 11.8.
5/8

We are excited for both tools and benchmarks in this area to develop.
We'll keep innovating to keep Refine the strongest one-stop reviewer.
Full report: https://www.refine.ink/blog/refine-ai-reviewer-benchmark
8/8