6h ago

NanoGPT-Bench Shows AI Coding Agents Recover Only 9% of Human Research Progress

0
Original post

Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D problem with months of human progress Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research NanoGPT-Bench is built on the NanoGPT Speedrun, a popular LLM pretraining competition to minimize the training time of a GPT-2 style model. Existing human submissions constitute nearly 2 years of work. To control for dependencies and contamination in frontier models, we standardize evaluation to a 5-month window of world records. Evaluation is fully autonomous and end-to-end, with no human intervention or internet access. 🧵

8:49 AM · May 19, 2026 View on X

Interesting contrast to the recent papers on AI co-scientists.

"Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research" 1/

IntologyIntology@IntologyAI

Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D problem with months of human progress Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research NanoGPT-Bench is built on the NanoGPT Speedrun, a popular LLM pretraining competition to minimize the training time of a GPT-2 style model. Existing human submissions constitute nearly 2 years of work. To control for dependencies and contamination in frontier models, we standardize evaluation to a 5-month window of world records. Evaluation is fully autonomous and end-to-end, with no human intervention or internet access. 🧵

3:49 PM · May 19, 2026 · 82.6K Views
7:01 AM · May 21, 2026 · 4.8K Views

I'm not remotely an expert on this topic but I suspect coding agents & co-scientists are optimized to address problems rather differently. This is all quite nebulous IMO. 2/

Anshul KundajeAnshul Kundaje@anshulkundaje

Interesting contrast to the recent papers on AI co-scientists. "Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research" 1/

7:01 AM · May 21, 2026 · 4.8K Views
7:01 AM · May 21, 2026 · 310 Views
NanoGPT-Bench Shows AI Coding Agents Recover Only 9% of Human Research Progress · Digg