Can AI agents help researchers reproduce research more quickly? We conducted an uplift study. The answer is yes: researchers reproduced papers > 2x faster using Codex with GPT-5.4 xhigh. In a new paper, we show many other results.
Nityn DG study finds AI agents allow researchers to reproduce scientific papers more than twice as fast
The work also updates CORE-Bench to evaluate agent scaffolds.
Users appreciate the study showing AI agents help reproduce papers over 2x faster because the results match their observations and credit the collaborators involved.
No Digg Deeper questions have been answered for this story yet.
Most Activity

When a benchmark’s accuracy saturates, the field usually replaces it with a harder one. We use CORE-Bench Hard, a benchmark for computational reproducibility, as a case study to show what we can still measure after accuracy saturates.
Paper: https://arxiv.org/pdf/2606.26158v1
Really strong showing for @nityndg's first first-author paper. - A small uplift study that finds researchers using Codex reproduce papers more than 2x faster. - Updates to CORE-Bench to add OOD tasks, fix errors, and evaluate reliability and efficiency. - Stronger scaffolds, including Claude Code, Codex, and OpenCode.
Can AI agents help researchers reproduce research more quickly? We conducted an uplift study. The answer is yes: researchers reproduced papers > 2x faster using Codex with GPT-5.4 xhigh. In a new paper, we show many other results.

Threats to construct validity affect many agent benchmarks, as documented by @adamlsteinl and @davisbrownr: https://debugml.github.io/cheating-agents/. Weaker agents often fail too early to reveal them, but stronger agents surface subtle shortcuts, grading errors, and valid alternative solutions.

Our study showed human-agent collaboration reproduced results in less than half the time of human-only reproduction. This is conservative, since 5/25 human-only runs hit the 3-hour cap, while no human-agent runs did. We’re working on updating the estimate to account for this.

Saturation helped improve benchmark validity: after Nicholas Carlini’s Claude Code scaffold reached near-ceiling accuracy on CORE-Bench Hard, we conducted log analysis to reveal 15 task-level errors and 20 exploitable shortcuts.

Some findings: 1. We uncover errors in CORE-Bench Hard that are hard to surface before accuracy saturates. 2. Agents are consistent but under-confident & can’t tell when they’re wrong. 3. Human-agent collaboration provides substantial uplift for computational reproducibility.

Paper: https://arxiv.org/pdf/2606.26158v1 Run CORE-Bench v1.1 and OOD https://github.com/princeton-pli/hal-harness/tree/feat/corebenchv2-prefect All logs, data, and figures: https://github.com/nnadgi01/corebench-analysis/tree/main

We corrected these and added 10 new tasks with the same distribution. This produced CORE-Bench v1.1, with 39 tasks. A takeaway is that benchmarks need to evolve as agents improve. Log analysis is an essential part of this, which we argue in a recent paper.

We also found efficiency differences: GPT-5.3-Codex (medium) and GPT-5.4 (high) had the same accuracy, but GPT-5.3-Codex cost ~60% less. Token use and dollar cost had different relationships with accuracy because harnesses used caching differently.

Across five Codex CLI agents, more accurate agents were also more consistent in outcomes and resource use. But they were under-confident: pass rates were high while self-rated confidence stayed low, tracking failed bash commands more than task success.

Leaderboards often collapse models and scaffolds into one score, hiding what drives success. The same model can solve and fail differently across scaffolds: Opus 4.5 matched accuracy with two different scaffolds, CORE-Agent and OpenCode, but disagreed on 12/39 capsules.

We tested whether saturation was specific to the original task distribution with CORE-Bench OOD, which shifts fields across physics, economics, engineering, and CS. Top agents were again statistically tied, suggesting saturation was not due to overfitting to benchmark fields.

Finally, we studied something automated benchmarks do not directly measure: whether agents help humans do real work. CORE-Bench asks whether an agent can complete a computational reproducibility task autonomously, but many real-world uses of agents are collaborative.

So we ran a randomized study on real-world computational reproducibility tasks. Five co-authors reproduced results from 20 ML and social science papers, with and without agent collaboration.

This measures something different from benchmark accuracy: benchmark task success doesn’t directly mean an agent will be useful for a human. For example, in our study, there were three papers where a manual run was faster than a human-agent run.

But this doesn’t mean the benchmark is no longer useful. We find that agents still differ along reliability, efficiency, and the relative performance of the model vs. the scaffold.

Overall, accuracy saturation is not the end of a benchmark’s lifecycle. We hope our paper helps move agent evaluation beyond accuracy-centric leaderboards and closer to the various dimensions of performance that also matter in the real world.

In addition to the speedup, our paper describes other findings, including human intervention levels, blocker comparisons, and qualitative descriptions of where humans found the agent provided the most value.

Our study has limits, such as a small sample and a narrow focus on whether agents help reproduce one result per paper. Still, it offers useful data on AI’s impact on science. We discuss these and other limitations in our paper & hope this aids future work on AI uplift in science.

Thanks to collaborators @sayashk, @random_walker, Kangheng Liu, @PKirgis, @matilda_orona, @steverab, @tilmanbayer, Abhishek Shetty, Yue Ling, Derrick Chan-Sew, Rumi Nakagawa, Saiteja Utpala, @siegelz_