1d ago

Prime Intellect agents improve nanoGPT record to 2,930 steps

0

Prime Intellect ran 10,000 autonomous experiments with Claude Code Opus 4.7 and Codex GPT 5.5 agents on the nanoGPT optimizer track. Over two weeks the agents consumed 14,000 H200 GPU hours and delivered a record of 2,930 steps for the 124M-parameter model, beating the prior human baseline of 2,990 steps. The company released all run logs, scripts, configurations, and a report on GitHub.

Original post

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

3:43 PM · May 14, 2026 View on X
Reposted by

if you weren’t aware, it’s prime intellect season

Prime IntellectPrime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views
10:50 PM · May 14, 2026 · 25.5K Views

We started automating AI research on nanogpt-speedruns & achieved new records

>for 2 weeks GPT 5.5 and Opus 4.7 iterated on novel optimizations >10k runs & 14k H200 hours >both agents beat the human baseline >Opus now holds the record at 2930 steps

Awesome work @eliebakouch!

Prime IntellectPrime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views
11:10 PM · May 14, 2026 · 13.5K Views

@scaling01 i agree, for transparency adding this important data point that claude stoped working a lot (which is bad) but when restarted it actually got access to new record faster than codex (which is good for claude progress ironically)

elieelie@eliebakouch

we let opus 4.7 and gpt 5.5 run on the nanogpt optimizer speedrun: ~10k runs, 14k H200 hours, 23.9B tokens. opus hits 2930, codex 2950, both beating the human baseline of 2990. we cover claude autonomy failures, codex high compute usage, and much more https://www.primeintellect.ai/auto-nanogpt

10:54 PM · May 14, 2026 · 95K Views
10:14 AM · May 15, 2026 · 5.4K Views

@scaling01 but in terms of efficiency it's clear

elieelie@eliebakouch

@scaling01 i agree, for transparency adding this important data point that claude stoped working a lot (which is bad) but when restarted it actually got access to new record faster than codex (which is good for claude progress ironically)

10:14 AM · May 15, 2026 · 5.4K Views
10:18 AM · May 15, 2026 · 477 Views

@scaling01 (even without the updated record claude was already above)

elieelie@eliebakouch

@scaling01 i agree, for transparency adding this important data point that claude stoped working a lot (which is bad) but when restarted it actually got access to new record faster than codex (which is good for claude progress ironically)

10:14 AM · May 15, 2026 · 5.4K Views
10:35 AM · May 15, 2026 · 431 Views

@jiaxinwen22 this is also due to the fact that claude stopped working a lot more than codex and got more exposure to the latest human records each time we restarted it, but the efficiency is very nice

Jiaxin WenJiaxin Wen@jiaxinwen22

The hill-climbing efficiency gap between Opus and Codex is much larger than I was expecting!

11:15 PM · May 14, 2026 · 9K Views
11:15 PM · May 14, 2026 · 916 Views

we let opus 4.7 and gpt 5.5 run on the nanogpt optimizer speedrun: ~10k runs, 14k H200 hours, 23.9B tokens. opus hits 2930, codex 2950, both beating the human baseline of 2990. we cover claude autonomy failures, codex high compute usage, and much more

primeintellect.ai
Autonomous AI research for nanogpt speedrun
We let Codex and Claude Code autonomously iterate on the nanoGPT speedrun optimizer track for two weeks, producing ~10k runs, a new 2930-step record, and a detailed look at where autonomous research agents work and break down.
Prime IntellectPrime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views
10:54 PM · May 14, 2026 · 95K Views

all the records are heavily based on work from previous contributors PRs (we do explore novel ideas in a dedicated "novelty" track, but none of them ended up improving the record).

So it only made sense to let the agents write a little thank you to the community themselves

github.com
/KellerJordan/modded-nanogpt/pull/300
elieelie@eliebakouch

we let opus 4.7 and gpt 5.5 run on the nanogpt optimizer speedrun: ~10k runs, 14k H200 hours, 23.9B tokens. opus hits 2930, codex 2950, both beating the human baseline of 2990. we cover claude autonomy failures, codex high compute usage, and much more https://www.primeintellect.ai/auto-nanogpt

10:54 PM · May 14, 2026 · 95K Views
10:56 PM · May 14, 2026 · 5.1K Views

we tried to document as much as possible about how the agents behave: autonomy patterns, scratchpad memory access, subagent spawns, compute usage, research quality, how ideas flow from source to experiment, ect... all scratchpads (the agents internal memory), ~10k run logs, and scripts are here 🫡http://github.com/PrimeIntellect-ai/experiments-autonomous-speedrunning

elieelie@eliebakouch

all the records are heavily based on work from previous contributors PRs (we do explore novel ideas in a dedicated "novelty" track, but none of them ended up improving the record). So it only made sense to let the agents write a little thank you to the community themselves https://github.com/KellerJordan/modded-nanogpt/pull/300

10:56 PM · May 14, 2026 · 5.1K Views
10:56 PM · May 14, 2026 · 2K Views

lots of things can be improved btw, this is a lower bound of what's possible and we already have a lot more cooking. for instance if you take the harness (markdown files) of v1, it was almost entirely written by claude since it was just a yolo idea i had after seeing @kellerjordan0's tweet announcing the speedrun. we did a few iterations on it to make it better for v2/v3

we also got a surprise a few hours before releasing when we discovered that claude was not actually doing statistical verification with different seeds but instead argued that since it's different hardware it's already random (which is not totally wrong but not the expected behavior and made our record worse by 10 steps)

elieelie@eliebakouch

we tried to document as much as possible about how the agents behave: autonomy patterns, scratchpad memory access, subagent spawns, compute usage, research quality, how ideas flow from source to experiment, ect... all scratchpads (the agents internal memory), ~10k run logs, and scripts are here 🫡http://github.com/PrimeIntellect-ai/experiments-autonomous-speedrunning

10:56 PM · May 14, 2026 · 2K Views
11:09 PM · May 14, 2026 · 1.4K Views
elieelie@eliebakouch

we let opus 4.7 and gpt 5.5 run on the nanogpt optimizer speedrun: ~10k runs, 14k H200 hours, 23.9B tokens. opus hits 2930, codex 2950, both beating the human baseline of 2990. we cover claude autonomy failures, codex high compute usage, and much more https://www.primeintellect.ai/auto-nanogpt

10:54 PM · May 14, 2026 · 95K Views
12:11 AM · May 15, 2026 · 1.1K Views

@damekdavis yeah it was quite impressive, an important data point tho is that claude stopped working a lot and when we restarted it, it got updated knowledge of the different runs, but even without that it's above codex curve

DamekDamek@damekdavis

Of course running this on idle is the nice part. This is more a comment on 4.7 and 5.5 tuning abilities. (I have run similar experiments my self)

11:06 PM · May 14, 2026 · 541 Views
11:19 PM · May 14, 2026 · 331 Views

@andrey_kurenkov >claim about automating a bounded, narrow surface area >"but it's not the full surface area tho"

Andrey KurenkovAndrey Kurenkov@andrey_kurenkov

Can we all agree that LLM-powered hyper param search to optimize nanoGPT better is not really AI research?

5:45 AM · May 15, 2026 · 38.3K Views
8:25 PM · May 15, 2026 · 1.3K Views

@andrey_kurenkov i do worry that people default to turning their brain off and going skeptic mode when they hear the category claim bc of orgs that have been... somewhat dubious with how liberally & dramatically they have narrativized adjacent work, fake 100x cuda speedups, etc etc

kalomazekalomaze@kalomaze

@andrey_kurenkov >claim about automating a bounded, narrow surface area >"but it's not the full surface area tho"

8:25 PM · May 15, 2026 · 1.3K Views
8:33 PM · May 15, 2026 · 439 Views

not surprised. tapping the sign again.

Lisan al GaibLisan al Gaib@scaling01

brutal Claude mog

9:45 AM · May 15, 2026 · 124.7K Views
10:48 AM · May 15, 2026 · 3.7K Views

to be honest, i don't think it's a bad thing to get some differenciation. codex can't really get highly reliable codebase management with unbounded search and vice versa.

Alexander DoriaAlexander Doria@Dorialexander

not surprised. tapping the sign again.

10:48 AM · May 15, 2026 · 3.7K Views
11:02 AM · May 15, 2026 · 676 Views

PS since this is getting some heat: I think what @PrimeIntellect did here is actually really cool! Full write up is interesting.

I just think we need to be careful about claiming improvements on nanoGPT speedrun optimizer would correspond to truly better full on AI research.

Andrey KurenkovAndrey Kurenkov@andrey_kurenkov

Can we all agree that LLM-powered hyper param search to optimize nanoGPT better is not really AI research?

5:45 AM · May 15, 2026 · 38.3K Views
11:42 PM · May 15, 2026 · 396 Views

Can we all agree that LLM-powered hyper param search to optimize nanoGPT better is not really AI research?

Prime IntellectPrime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views
5:45 AM · May 15, 2026 · 38.3K Views

To be clear, I'm not saying this sort of hill climbing / auto-optimization is not valuable! I'm just saying calling it 'AI research' is wrong, just call it what it is ('training optimization' or something).

Andrey KurenkovAndrey Kurenkov@andrey_kurenkov

Can we all agree that LLM-powered hyper param search to optimize nanoGPT better is not really AI research?

5:45 AM · May 15, 2026 · 38.3K Views
7:53 PM · May 15, 2026 · 1.2K Views

now imagine how brutal the mog is with Mythos

this is a slight update against OpenAI pulling ahead this year through faster model cycle times

Prime IntellectPrime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views
10:52 AM · May 15, 2026 · 27K Views

brutal Claude mog

Prime IntellectPrime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views
9:45 AM · May 15, 2026 · 124.7K Views

.@eliebakouch let the agents go wild on our idle compute to compete in the nanoGPT speedrun optimizer track!

Prime IntellectPrime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views
10:52 PM · May 14, 2026 · 3.8K Views
Johannes HagemannJohannes Hagemann@johannes_hage

.@eliebakouch let the agents go wild on our idle compute to compete in the nanoGPT speedrun optimizer track!

10:52 PM · May 14, 2026 · 3.8K Views
10:53 PM · May 14, 2026 · 359 Views
Prime IntellectPrime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views
10:54 PM · May 14, 2026 · 7.5K Views

.@eliebakouch cooked

samsjasamsja@samsja19
10:54 PM · May 14, 2026 · 7.5K Views
10:54 PM · May 14, 2026 · 220 Views

Interesting to consider gain relative to cost.

Prime IntellectPrime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views
10:55 PM · May 14, 2026 · 3.9K Views

Of course running this on idle is the nice part. This is more a comment on 4.7 and 5.5 tuning abilities. (I have run similar experiments my self)

DamekDamek@damekdavis

Interesting to consider gain relative to cost.

10:55 PM · May 14, 2026 · 3.9K Views
11:06 PM · May 14, 2026 · 541 Views

and so it begins

elieelie@eliebakouch

we let opus 4.7 and gpt 5.5 run on the nanogpt optimizer speedrun: ~10k runs, 14k H200 hours, 23.9B tokens. opus hits 2930, codex 2950, both beating the human baseline of 2990. we cover claude autonomy failures, codex high compute usage, and much more https://www.primeintellect.ai/auto-nanogpt

10:54 PM · May 14, 2026 · 95K Views
5:31 AM · May 15, 2026 · 587 Views

The hill-climbing efficiency between Opus and Codex is much larger than I was expecting!

Prime IntellectPrime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views
11:05 PM · May 14, 2026 · 231 Views

The hill-climbing efficiency gap between Opus and Codex is much larger than I was expecting!

Prime IntellectPrime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views
11:15 PM · May 14, 2026 · 9K Views

@andrey_kurenkov Yes. But it's fun!

Andrey KurenkovAndrey Kurenkov@andrey_kurenkov

Can we all agree that LLM-powered hyper param search to optimize nanoGPT better is not really AI research?

5:45 AM · May 15, 2026 · 38.3K Views
7:00 PM · May 15, 2026 · 552 Views
Prime Intellect agents improve nanoGPT record to 2,930 steps · Digg