6h ago

NanoGPT-Bench Shows AI Coding Agents Recover Only 9% of Human Research Progress

222465713287.7K

——0——

Original post

Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D problem with months of human progress Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research NanoGPT-Bench is built on the NanoGPT Speedrun, a popular LLM pretraining competition to minimize the training time of a GPT-2 style model. Existing human submissions constitute nearly 2 years of work. To control for dependencies and contamination in frontier models, we standardize evaluation to a 5-month window of world records. Evaluation is fully autonomous and end-to-end, with no human intervention or internet access. 🧵

8:49 AM · May 19, 2026

QUOTE POST

#1675Anshul Kundaje@ANSHULKUNDAJE

Interesting contrast to the recent papers on AI co-scientists.

"Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research" 1/

Intology@IntologyAI

3:49 PM · May 19, 2026 · 82.6K Views

7:01 AM · May 21, 2026 · 4.8K Views

#1675Anshul Kundaje@ANSHULKUNDAJE

I'm not remotely an expert on this topic but I suspect coding agents & co-scientists are optimized to address problems rather differently. This is all quite nebulous IMO. 2/

Anshul Kundaje@anshulkundaje

Interesting contrast to the recent papers on AI co-scientists. "Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research" 1/

7:01 AM · May 21, 2026 · 4.8K Views

7:01 AM · May 21, 2026 · 310 Views

NanoGPT-Bench Shows AI Coding Agents Recover Only 9% of Human Research Progress

Sentiment

Cluster engagement