1d ago

IntologyAI releases NanoGPT-Bench to test coding agents on the NanoGPT Speedrun and finds models recover less than 10 percent of human speedups since September 2025

Agents replicated prior records instead of making algorithmic changes.

0
Original post

I think the results from this post are overblown and pretty misleading. ~All improvements come from the models copying human records and open PRs. None of the models' own “novel” ideas worked.

9:24 AM · May 18, 2026 View on X
Reposted by

A fascinating reality check for AI coding agents. The new NanoGPT-Bench reveals that current agents (e.g., Claude Code and Codex) only recover 9.3% of human progress on AI R&D tasks.

IntologyIntology@IntologyAI

Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D problem with months of human progress Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research NanoGPT-Bench is built on the NanoGPT Speedrun, a popular LLM pretraining competition to minimize the training time of a GPT-2 style model. Existing human submissions constitute nearly 2 years of work. To control for dependencies and contamination in frontier models, we standardize evaluation to a 5-month window of world records. Evaluation is fully autonomous and end-to-end, with no human intervention or internet access. 🧵

3:49 PM · May 19, 2026 · 44.4K Views
5:51 PM · May 19, 2026 · 3.4K Views

Alas but well, automatically composing human state of the art is very convenient *for me*. At least we may have solved the critical skill issue of "using a strong baseline".

VincentVincent@vvvincent_c

I think the results from this post are overblown and pretty misleading. ~All improvements come from the models copying human records and open PRs. None of the models' own “novel” ideas worked.

4:24 PM · May 18, 2026 · 24.8K Views
5:56 PM · May 18, 2026 · 3.9K Views

it's a spectrum web access is, I think, legit direction to PRs is not

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Alas but well, automatically composing human state of the art is very convenient *for me*. At least we may have solved the critical skill issue of "using a strong baseline".

5:56 PM · May 18, 2026 · 3.9K Views
5:57 PM · May 18, 2026 · 1.8K Views

Very interesting results from this NanoGPT-Bench eval.

There is so much talk about self-improving agents.

But can coding agents do real AI R&D?

@IntologyAI reports that Codex, Claude Code, and Autoresearch recover only 9.3% of human progress.

Coding agents spend more of their compute on hyperparameter tuning.

In fact, coding agents rarely attempt algorithmic research at all.

Claude Code and Autoresearch both reason more about algorithmic research, but still dodge implementation.

Read more here: https://www.intology.ai/blog/nanogpt-bench

IntologyIntology@IntologyAI

Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D problem with months of human progress Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research NanoGPT-Bench is built on the NanoGPT Speedrun, a popular LLM pretraining competition to minimize the training time of a GPT-2 style model. Existing human submissions constitute nearly 2 years of work. To control for dependencies and contamination in frontier models, we standardize evaluation to a 5-month window of world records. Evaluation is fully autonomous and end-to-end, with no human intervention or internet access. 🧵

3:49 PM · May 19, 2026 · 44.4K Views
12:56 AM · May 20, 2026 · 6.3K Views

GitHub project here: https://github.com/IntologyAI/NanoGPT-Bench/

elviselvis@omarsar0

Very interesting results from this NanoGPT-Bench eval. There is so much talk about self-improving agents. But can coding agents do real AI R&D? @IntologyAI reports that Codex, Claude Code, and Autoresearch recover only 9.3% of human progress. Coding agents spend more of their compute on hyperparameter tuning. In fact, coding agents rarely attempt algorithmic research at all. Claude Code and Autoresearch both reason more about algorithmic research, but still dodge implementation. Read more here: https://www.intology.ai/blog/nanogpt-bench

12:56 AM · May 20, 2026 · 6.3K Views
12:56 AM · May 20, 2026 · 1.2K Views

i love the idea of this; i do think that i find it hard to put much faith in benchmarks that restrict internet access for coding agents. today's agents do not do a very good job without access to the internet. frankly, neither do humans. looking up other people's code is vital!

IntologyIntology@IntologyAI

Can coding agents do research? We release NanoGPT-Bench, an internal eval we’ve used to test agents on an AI R&D problem with months of human progress Codex, Claude Code, Autoresearch recover only 9.3% of human progress, mostly tuning hyperparams & ignoring algorithmic research NanoGPT-Bench is built on the NanoGPT Speedrun, a popular LLM pretraining competition to minimize the training time of a GPT-2 style model. Existing human submissions constitute nearly 2 years of work. To control for dependencies and contamination in frontier models, we standardize evaluation to a 5-month window of world records. Evaluation is fully autonomous and end-to-end, with no human intervention or internet access. 🧵

3:49 PM · May 19, 2026 · 44.4K Views
4:08 PM · May 19, 2026 · 622 Views