There has been a wave of AI reviewing systems (OpenAIReview, @RefineDotInk, @CoarseDotInk, @Reviewer3, etc) But are they any good?
Our new paper benchmarks 4 agentic review systems on real papers 🧵
The systems were tested against real submitted academic papers
There has been a wave of AI reviewing systems (OpenAIReview, @RefineDotInk, @CoarseDotInk, @Reviewer3, etc) But are they any good?
Our new paper benchmarks 4 agentic review systems on real papers 🧵
No Digg Deeper questions have been answered for this story yet.
Check out our work on benchmarking AI reviewing systems! Harness matters and different systems can be complementary.
There has been a wave of AI reviewing systems (OpenAIReview, @RefineDotInk, @CoarseDotInk, @Reviewer3, etc) But are they any good?
Our new paper benchmarks 4 agentic review systems on real papers 🧵

📄Check out our paper: https://arxiv.org/abs/2606.19749. 💻Try OpenAIReview: https://openaireview.org.
Thanks @Reviewer3 for the credits which made benchmarking the system possible!
And big shoutout to my co-authors Anita Hao, @yanaiela, @ChenhaoTan for making this project not only possible but enjoyable!

@RefineDotInk @CoarseDotInk @Reviewer3 We evaluate four systems: a zero-shot baseline, OpenAIReview (created by us), ‘coarse, and Reviewer3, using two approaches: 1. check if a review system’s behavior correlates with proxies for quality 2. whether it can detect known errors in a paper

@RefineDotInk @CoarseDotInk @Reviewer3 Overall, our evaluations show that AI review systems can already do certain parts of reviewing like detecting errors well and are poised to assist human reviewers in the near future.

@RefineDotInk @CoarseDotInk @Reviewer3 Our paper received some feedback from OpenAIReview itself! The system did not give us an easy pass, but did leave us with an encouraging comment:

Next, we evaluate whether models can correctly detect errors. Perturb the content of a paper to introduce errors and check if the generated reviews recover them.
OpenAIReview + GPT 5.5 achieves the best recall at ~72%. Both ‘coarse and Reviewer3 catch fewer errors than OpenAIReview and even zero-shot across a variety of models from different families.
This gives encouraging evidence for models being able to detect real errors in papers. They likely can already assist human reviewers in real workflows.

Zero-shot provides a strong baseline, achieving 80% accuracy overall. So 80% of the time, zero-shot gives more comments on lower-quality papers than high-quality ones.
OpenAIReview has the best accuracy at 83%. Reviewer3 closely follows at 80%. ‘coarse *underperforms* the baseline at 66%.
This shows that models can track a paper’s quality via comments without being explicitly asked to do so.

@RefineDotInk @CoarseDotInk @Reviewer3 We recognize that these metrics do not fully capture a paper’s value, but they are meant to give signals for quality for us to see if review systems can detect them.

We analyze whether human and AI systems comment on similar issues, and whether different models find similar issues. Humans and AI overlap on ~7 comments per paper, but each surfaces many other misses (~9 and ~15 unique). Different models are also complementary. Their union achieves 83.3% recall on the perturbation benchmark.
This suggests that combining different models as well as teaming up with humans can lead to better reviews.

As a bonus, we look into how users have been responding to our web version of OpenAIReview. There, we have a feature for users to rate a comment as “helpful” or “not helpful.”
The overall helpful-to-unhelpful rate is 1.44 to 1.0, so the majority of comments given by OpenAIReview are helpful to real researchers. This is another strong evidence for the system’s potential for real workflows.

The quality proxies: a paper is “high-quality” when it has high citation count, awards, or high review scores, and “low-quality” when it has low citation count, was never published, or low scores.
A good system should raise more (and more severe) issues with lower quality papers.

@RefineDotInk @CoarseDotInk @Reviewer3 @yanaiela @ChenhaoTan This work might be of interest @qedScience @seungonekim @Joydeepb_robots