NanoGPT-Bench Shows AI Coding Agents Recover Only 9% of Human Research Progress
Interesting contrast to the recent papers on AI co-scientists.
"Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research" 1/
Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D problem with months of human progress Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research NanoGPT-Bench is built on the NanoGPT Speedrun, a popular LLM pretraining competition to minimize the training time of a GPT-2 style model. Existing human submissions constitute nearly 2 years of work. To control for dependencies and contamination in frontier models, we standardize evaluation to a 5-month window of world records. Evaluation is fully autonomous and end-to-end, with no human intervention or internet access. 🧵
I'm not remotely an expert on this topic but I suspect coding agents & co-scientists are optimized to address problems rather differently. This is all quite nebulous IMO. 2/
Interesting contrast to the recent papers on AI co-scientists. "Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research" 1/