IntologyAI releases NanoGPT-Bench to test coding agents on the NanoGPT Speedrun and finds models recover less than 10 percent of human speedups since September 2025
Agents replicated prior records instead of making algorithmic changes.
A fascinating reality check for AI coding agents. The new NanoGPT-Bench reveals that current agents (e.g., Claude Code and Codex) only recover 9.3% of human progress on AI R&D tasks.
Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D problem with months of human progress Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research NanoGPT-Bench is built on the NanoGPT Speedrun, a popular LLM pretraining competition to minimize the training time of a GPT-2 style model. Existing human submissions constitute nearly 2 years of work. To control for dependencies and contamination in frontier models, we standardize evaluation to a 5-month window of world records. Evaluation is fully autonomous and end-to-end, with no human intervention or internet access. 🧵
Alas but well, automatically composing human state of the art is very convenient *for me*. At least we may have solved the critical skill issue of "using a strong baseline".
I think the results from this post are overblown and pretty misleading. ~All improvements come from the models copying human records and open PRs. None of the models' own “novel” ideas worked.
it's a spectrum web access is, I think, legit direction to PRs is not

Alas but well, automatically composing human state of the art is very convenient *for me*. At least we may have solved the critical skill issue of "using a strong baseline".
Very interesting results from this NanoGPT-Bench eval.
There is so much talk about self-improving agents.
But can coding agents do real AI R&D?
@IntologyAI reports that Codex, Claude Code, and Autoresearch recover only 9.3% of human progress.
Coding agents spend more of their compute on hyperparameter tuning.
In fact, coding agents rarely attempt algorithmic research at all.
Claude Code and Autoresearch both reason more about algorithmic research, but still dodge implementation.
Read more here: https://www.intology.ai/blog/nanogpt-bench
Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D problem with months of human progress Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research NanoGPT-Bench is built on the NanoGPT Speedrun, a popular LLM pretraining competition to minimize the training time of a GPT-2 style model. Existing human submissions constitute nearly 2 years of work. To control for dependencies and contamination in frontier models, we standardize evaluation to a 5-month window of world records. Evaluation is fully autonomous and end-to-end, with no human intervention or internet access. 🧵
GitHub project here: https://github.com/IntologyAI/NanoGPT-Bench/
Very interesting results from this NanoGPT-Bench eval. There is so much talk about self-improving agents. But can coding agents do real AI R&D? @IntologyAI reports that Codex, Claude Code, and Autoresearch recover only 9.3% of human progress. Coding agents spend more of their compute on hyperparameter tuning. In fact, coding agents rarely attempt algorithmic research at all. Claude Code and Autoresearch both reason more about algorithmic research, but still dodge implementation. Read more here: https://www.intology.ai/blog/nanogpt-bench
i love the idea of this; i do think that i find it hard to put much faith in benchmarks that restrict internet access for coding agents. today's agents do not do a very good job without access to the internet. frankly, neither do humans. looking up other people's code is vital!
Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D problem with months of human progress Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research NanoGPT-Bench is built on the NanoGPT Speedrun, a popular LLM pretraining competition to minimize the training time of a GPT-2 style model. Existing human submissions constitute nearly 2 years of work. To control for dependencies and contamination in frontier models, we standardize evaluation to a 5-month window of world records. Evaluation is fully autonomous and end-to-end, with no human intervention or internet access. 🧵