/Tech3h ago

Chenhao Tan’s benchmark of four AI peer-review systems finds combining agentic approaches yields more effective academic reviews

The systems were tested against real submitted academic papers

420371.5K

#542

Original post

Dang Nguyen@divingwithorcas

There has been a wave of AI reviewing systems (OpenAIReview, @RefineDotInk, @CoarseDotInk, @Reviewer3, etc) But are they any good?

Our new paper benchmarks 4 agentic review systems on real papers 🧵

10:02 AM · Jun 22, 2026 · 1K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

Chenhao Tan@ChenhaoTan

Check out our work on benchmarking AI reviewing systems! Harness matters and different systems can be complementary.

Dang Nguyen@divingwithorcas

There has been a wave of AI reviewing systems (OpenAIReview, @RefineDotInk, @CoarseDotInk, @Reviewer3, etc) But are they any good?

Our new paper benchmarks 4 agentic review systems on real papers 🧵

2h53940

BOOKMARKS1LIKES5REPLIES1

Dang Nguyen@divingwithorcas

📄Check out our paper: https://arxiv.org/abs/2606.19749. 💻Try OpenAIReview: https://openaireview.org.

Thanks @Reviewer3 for the credits which made benchmarking the system possible!

And big shoutout to my co-authors Anita Hao, @yanaiela, @ChenhaoTan for making this project not only possible but enjoyable!

3h5051

Dang Nguyen@divingwithorcas

@RefineDotInk @CoarseDotInk @Reviewer3 We evaluate four systems: a zero-shot baseline, OpenAIReview (created by us), ‘coarse, and Reviewer3, using two approaches: 1. check if a review system’s behavior correlates with proxies for quality 2. whether it can detect known errors in a paper

3h361

Dang Nguyen@divingwithorcas

@RefineDotInk @CoarseDotInk @Reviewer3 Overall, our evaluations show that AI review systems can already do certain parts of reviewing like detecting errors well and are poised to assist human reviewers in the near future.

3h311

Dang Nguyen@divingwithorcas

@RefineDotInk @CoarseDotInk @Reviewer3 Our paper received some feedback from OpenAIReview itself! The system did not give us an easy pass, but did leave us with an encouraging comment:

3h271

Dang Nguyen@divingwithorcas

Next, we evaluate whether models can correctly detect errors. Perturb the content of a paper to introduce errors and check if the generated reviews recover them.

OpenAIReview + GPT 5.5 achieves the best recall at ~72%. Both ‘coarse and Reviewer3 catch fewer errors than OpenAIReview and even zero-shot across a variety of models from different families.

This gives encouraging evidence for models being able to detect real errors in papers. They likely can already assist human reviewers in real workflows.

3h271

Dang Nguyen@divingwithorcas

Zero-shot provides a strong baseline, achieving 80% accuracy overall. So 80% of the time, zero-shot gives more comments on lower-quality papers than high-quality ones.

OpenAIReview has the best accuracy at 83%. Reviewer3 closely follows at 80%. ‘coarse *underperforms* the baseline at 66%.

This shows that models can track a paper’s quality via comments without being explicitly asked to do so.

3h251

Dang Nguyen@divingwithorcas

@RefineDotInk @CoarseDotInk @Reviewer3 We recognize that these metrics do not fully capture a paper’s value, but they are meant to give signals for quality for us to see if review systems can detect them.

3h251

Dang Nguyen@divingwithorcas

We analyze whether human and AI systems comment on similar issues, and whether different models find similar issues. Humans and AI overlap on ~7 comments per paper, but each surfaces many other misses (~9 and ~15 unique). Different models are also complementary. Their union achieves 83.3% recall on the perturbation benchmark.

This suggests that combining different models as well as teaming up with humans can lead to better reviews.

3h241

Dang Nguyen@divingwithorcas

As a bonus, we look into how users have been responding to our web version of OpenAIReview. There, we have a feature for users to rate a comment as “helpful” or “not helpful.”

The overall helpful-to-unhelpful rate is 1.44 to 1.0, so the majority of comments given by OpenAIReview are helpful to real researchers. This is another strong evidence for the system’s potential for real workflows.

3h231

Dang Nguyen@divingwithorcas

The quality proxies: a paper is “high-quality” when it has high citation count, awards, or high review scores, and “low-quality” when it has low citation count, was never published, or low scores.

A good system should raise more (and more severe) issues with lower quality papers.

3h231

Dang Nguyen@divingwithorcas

@RefineDotInk @CoarseDotInk @Reviewer3 @yanaiela @ChenhaoTan This work might be of interest @qedScience @seungonekim @Joydeepb_robots

3h363