/Tech1h ago

Refine Releases Benchmark Winning 90% Of AI Paper Review Matches

1741477.9K

Original post unavailable.

Sentiment

Users are excited about Refine’s new benchmark winning 90% of AI paper review matches because it shows progress in developing stronger review tools and benchmarks.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS67REPLIES1

Ben Golub@ben_golub

Case study: on this dense theoretical preprint, Refine won all 9 matches.

It found 43 atomic concerns, all supported by the paper; 19 threatened a main result.

7/8

1h67

LIKES1

Ben Golub@ben_golub

The advantage was broad across economics:

macro 88.9%, econometrics 88.9%, applied micro 87.1%, theory 92.6%.

6/8

1h51

Ben Golub@ben_golub

The benchmark covered 5 single-shot frontier reasoning models and 4 systems consisting of a frontier model + an open-source scaffold.

Refine won against every comparison system, including Fable 5 (high). The closest matches came from scaffolded reviewers.

2/8

1h521

Ben Golub@ben_golub

Our procedure was informed by the literature.

One-shot grading can bias toward favoring longer reviews, or ones that fabricate issues.

Instead, we decomposed reviews into paper-grounded atomic concerns and judged on true issues found by one system and missed by another.

3/8

1h211

Ben Golub@ben_golub

Beyond the headline numbers, our findings show that scaffolds made reviewers much stronger.

Refine's win rate against ordinary single-shot LLM referee reports was 94.8%.

Against scaffolded review systems, it was 85.0%.

As of today, harnesses matter.

4/8

1h101

Ben Golub@ben_golub

A system won if it identified more genuine concerns the other system missed ("residual concerns").

Refine averaged 28.1 unique residual concerns per match; comparison reviews averaged 14.5.

Substantive concerns: 22.1 vs. 11.8.

5/8

1h7

Ben Golub@ben_golub

We are excited for both tools and benchmarks in this area to develop.

We'll keep innovating to keep Refine the strongest one-stop reviewer.

Full report: https://www.refine.ink/blog/refine-ai-reviewer-benchmark

8/8

1h57