/Tech2h ago

Nityn DG study finds AI agents allow researchers to reproduce scientific papers more than twice as fast

The work also updates CORE-Bench to evaluate agent scaffolds.

319441.6K

#139

Original post

Nitya Nadgir@nityndg

Can AI agents help researchers reproduce research more quickly? We conducted an uplift study. The answer is yes: researchers reproduced papers > 2x faster using Codex with GPT-5.4 xhigh. In a new paper, we show many other results.

3:46 PM · Jun 30, 2026 · 1.3K Views

Sentiment

Users appreciate the study showing AI agents help reproduce papers over 2x faster because the results match their observations and credit the collaborators involved.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS674RETWEETS1

Nitya Nadgir@nityndg

When a benchmark’s accuracy saturates, the field usually replaces it with a harder one. We use CORE-Bench Hard, a benchmark for computational reproducibility, as a case study to show what we can still measure after accuracy saturates.

Paper: https://arxiv.org/pdf/2606.26158v1

2h6741

LIKES6

Sayash Kapoor@sayashk

Really strong showing for @nityndg's first first-author paper. - A small uplift study that finds researchers using Codex reproduce papers more than 2x faster. - Updates to CORE-Bench to add OOD tasks, fix errors, and evaluate reliability and efficiency. - Stronger scaffolds, including Claude Code, Codex, and OpenCode.

Nitya Nadgir@nityndg

2h29060

REPLIES2

Nitya Nadgir@nityndg

Threats to construct validity affect many agent benchmarks, as documented by @adamlsteinl and @davisbrownr: https://debugml.github.io/cheating-agents/. Weaker agents often fail too early to reveal them, but stronger agents surface subtle shortcuts, grading errors, and valid alternative solutions.

2h362

Nitya Nadgir@nityndg

Our study showed human-agent collaboration reproduced results in less than half the time of human-only reproduction. This is conservative, since 5/25 human-only runs hit the 3-hour cap, while no human-agent runs did. We’re working on updating the estimate to account for this.

2h242

Nitya Nadgir@nityndg

Saturation helped improve benchmark validity: after Nicholas Carlini’s Claude Code scaffold reached near-ceiling accuracy on CORE-Bench Hard, we conducted log analysis to reveal 15 task-level errors and 20 exploitable shortcuts.

2h521

Nitya Nadgir@nityndg

Some findings: 1. We uncover errors in CORE-Bench Hard that are hard to surface before accuracy saturates. 2. Agents are consistent but under-confident & can’t tell when they’re wrong. 3. Human-agent collaboration provides substantial uplift for computational reproducibility.

2h491

Nitya Nadgir@nityndg

Paper: https://arxiv.org/pdf/2606.26158v1 Run CORE-Bench v1.1 and OOD https://github.com/princeton-pli/hal-harness/tree/feat/corebenchv2-prefect All logs, data, and figures: https://github.com/nnadgi01/corebench-analysis/tree/main

2h421

Nitya Nadgir@nityndg

We corrected these and added 10 new tasks with the same distribution. This produced CORE-Bench v1.1, with 39 tasks. A takeaway is that benchmarks need to evolve as agents improve. Log analysis is an essential part of this, which we argue in a recent paper.

2h381

Nitya Nadgir@nityndg

We also found efficiency differences: GPT-5.3-Codex (medium) and GPT-5.4 (high) had the same accuracy, but GPT-5.3-Codex cost ~60% less. Token use and dollar cost had different relationships with accuracy because harnesses used caching differently.

2h331

Nitya Nadgir@nityndg

Across five Codex CLI agents, more accurate agents were also more consistent in outcomes and resource use. But they were under-confident: pass rates were high while self-rated confidence stayed low, tracking failed bash commands more than task success.

2h321

Nitya Nadgir@nityndg

Leaderboards often collapse models and scaffolds into one score, hiding what drives success. The same model can solve and fail differently across scaffolds: Opus 4.5 matched accuracy with two different scaffolds, CORE-Agent and OpenCode, but disagreed on 12/39 capsules.

2h311

Nitya Nadgir@nityndg

We tested whether saturation was specific to the original task distribution with CORE-Bench OOD, which shifts fields across physics, economics, engineering, and CS. Top agents were again statistically tied, suggesting saturation was not due to overfitting to benchmark fields.

2h291

Nitya Nadgir@nityndg

Finally, we studied something automated benchmarks do not directly measure: whether agents help humans do real work. CORE-Bench asks whether an agent can complete a computational reproducibility task autonomously, but many real-world uses of agents are collaborative.

2h261

Nitya Nadgir@nityndg

So we ran a randomized study on real-world computational reproducibility tasks. Five co-authors reproduced results from 20 ML and social science papers, with and without agent collaboration.

2h251

Nitya Nadgir@nityndg

This measures something different from benchmark accuracy: benchmark task success doesn’t directly mean an agent will be useful for a human. For example, in our study, there were three papers where a manual run was faster than a human-agent run.

2h251

Nitya Nadgir@nityndg

But this doesn’t mean the benchmark is no longer useful. We find that agents still differ along reliability, efficiency, and the relative performance of the model vs. the scaffold.

2h251

Nitya Nadgir@nityndg

Overall, accuracy saturation is not the end of a benchmark’s lifecycle. We hope our paper helps move agent evaluation beyond accuracy-centric leaderboards and closer to the various dimensions of performance that also matter in the real world.

2h231

Nitya Nadgir@nityndg

In addition to the speedup, our paper describes other findings, including human intervention levels, blocker comparisons, and qualitative descriptions of where humans found the agent provided the most value.

2h211

Nitya Nadgir@nityndg

Our study has limits, such as a small sample and a narrow focus on whether agents help reproduce one result per paper. Still, it offers useful data on AI’s impact on science. We discuss these and other limitations in our paper & hope this aids future work on AI uplift in science.

2h211

Nitya Nadgir@nityndg

Thanks to collaborators @sayashk, @random_walker, Kangheng Liu, @PKirgis, @matilda_orona, @steverab, @tilmanbayer, Abhishek Shetty, Yue Ling, Derrick Chan-Sew, Rumi Nakagawa, Saiteja Utpala, @siegelz_

2h531