Prime Intellect agents improve nanoGPT record to 2,930 steps

QUOTE POST

if you weren’t aware, it’s prime intellect season

Prime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views

10:50 PM · May 14, 2026 · 25.5K Views

QUOTE POST

#691Vincent Weisser@VINCENTWEISSER

We started automating AI research on nanogpt-speedruns & achieved new records

>for 2 weeks GPT 5.5 and Opus 4.7 iterated on novel optimizations >10k runs & 14k H200 hours >both agents beat the human baseline >Opus now holds the record at 2930 steps

Awesome work @eliebakouch!

Prime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views

11:10 PM · May 14, 2026 · 13.5K Views

QUOTE POST

#767elie@ELIEBAKOUCH

@scaling01 i agree, for transparency adding this important data point that claude stoped working a lot (which is bad) but when restarted it actually got access to new record faster than codex (which is good for claude progress ironically)

elie@eliebakouch

we let opus 4.7 and gpt 5.5 run on the nanogpt optimizer speedrun: ~10k runs, 14k H200 hours, 23.9B tokens. opus hits 2930, codex 2950, both beating the human baseline of 2990. we cover claude autonomy failures, codex high compute usage, and much more https://www.primeintellect.ai/auto-nanogpt

10:54 PM · May 14, 2026 · 95K Views

10:14 AM · May 15, 2026 · 5.4K Views

REPLY

#767elie@ELIEBAKOUCH

@scaling01 but in terms of efficiency it's clear

elie@eliebakouch

@scaling01 i agree, for transparency adding this important data point that claude stoped working a lot (which is bad) but when restarted it actually got access to new record faster than codex (which is good for claude progress ironically)

10:14 AM · May 15, 2026 · 5.4K Views

10:18 AM · May 15, 2026 · 477 Views

REPLY

#767elie@ELIEBAKOUCH

@scaling01 (even without the updated record claude was already above)

elie@eliebakouch

@scaling01 i agree, for transparency adding this important data point that claude stoped working a lot (which is bad) but when restarted it actually got access to new record faster than codex (which is good for claude progress ironically)

10:14 AM · May 15, 2026 · 5.4K Views

10:35 AM · May 15, 2026 · 431 Views

REPLY

#767elie@ELIEBAKOUCH

@jiaxinwen22 this is also due to the fact that claude stopped working a lot more than codex and got more exposure to the latest human records each time we restarted it, but the efficiency is very nice

Jiaxin Wen@jiaxinwen22

The hill-climbing efficiency gap between Opus and Codex is much larger than I was expecting!

11:15 PM · May 14, 2026 · 9K Views

11:15 PM · May 14, 2026 · 916 Views

QUOTE POST

#767elie@ELIEBAKOUCH

we let opus 4.7 and gpt 5.5 run on the nanogpt optimizer speedrun: ~10k runs, 14k H200 hours, 23.9B tokens. opus hits 2930, codex 2950, both beating the human baseline of 2990. we cover claude autonomy failures, codex high compute usage, and much more

primeintellect.ai

Autonomous AI research for nanogpt speedrun

We let Codex and Claude Code autonomously iterate on the nanoGPT speedrun optimizer track for two weeks, producing ~10k runs, a new 2930-step record, and a detailed look at where autonomous research agents work and break down.

Prime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views

10:54 PM · May 14, 2026 · 95K Views

REPLY

#767elie@ELIEBAKOUCH

all the records are heavily based on work from previous contributors PRs (we do explore novel ideas in a dedicated "novelty" track, but none of them ended up improving the record).

So it only made sense to let the agents write a little thank you to the community themselves

github.com

/KellerJordan/modded-nanogpt/pull/300

elie@eliebakouch

we let opus 4.7 and gpt 5.5 run on the nanogpt optimizer speedrun: ~10k runs, 14k H200 hours, 23.9B tokens. opus hits 2930, codex 2950, both beating the human baseline of 2990. we cover claude autonomy failures, codex high compute usage, and much more https://www.primeintellect.ai/auto-nanogpt

10:54 PM · May 14, 2026 · 95K Views

10:56 PM · May 14, 2026 · 5.1K Views

REPLY

#767elie@ELIEBAKOUCH

we tried to document as much as possible about how the agents behave: autonomy patterns, scratchpad memory access, subagent spawns, compute usage, research quality, how ideas flow from source to experiment, ect... all scratchpads (the agents internal memory), ~10k run logs, and scripts are here 🫡http://github.com/PrimeIntellect-ai/experiments-autonomous-speedrunning

elie@eliebakouch

all the records are heavily based on work from previous contributors PRs (we do explore novel ideas in a dedicated "novelty" track, but none of them ended up improving the record). So it only made sense to let the agents write a little thank you to the community themselves https://github.com/KellerJordan/modded-nanogpt/pull/300

10:56 PM · May 14, 2026 · 5.1K Views

10:56 PM · May 14, 2026 · 2K Views

REPLY

#767elie@ELIEBAKOUCH

lots of things can be improved btw, this is a lower bound of what's possible and we already have a lot more cooking. for instance if you take the harness (markdown files) of v1, it was almost entirely written by claude since it was just a yolo idea i had after seeing @kellerjordan0's tweet announcing the speedrun. we did a few iterations on it to make it better for v2/v3

we also got a surprise a few hours before releasing when we discovered that claude was not actually doing statistical verification with different seeds but instead argued that since it's different hardware it's already random (which is not totally wrong but not the expected behavior and made our record worse by 10 steps)

elie@eliebakouch

we tried to document as much as possible about how the agents behave: autonomy patterns, scratchpad memory access, subagent spawns, compute usage, research quality, how ideas flow from source to experiment, ect... all scratchpads (the agents internal memory), ~10k run logs, and scripts are here 🫡http://github.com/PrimeIntellect-ai/experiments-autonomous-speedrunning

10:56 PM · May 14, 2026 · 2K Views

11:09 PM · May 14, 2026 · 1.4K Views

REPLY

#767elie@ELIEBAKOUCH

github.com

Track 3: Aurora-on-mlp.proj + extended Contra-Muon on PR #294 stack (bin=2930, n=16) by eliebak · Pull Request #300 · KellerJordan/modded-nanogpt

Summary Adds a Track 3 optimization result: bin = 2930 steps to 3.28 val_loss, validated over n=16 non-cherry-picked seeds (0..15). This builds on PR #294 (radial brake) and applies three changes o...

elie@eliebakouch

we let opus 4.7 and gpt 5.5 run on the nanogpt optimizer speedrun: ~10k runs, 14k H200 hours, 23.9B tokens. opus hits 2930, codex 2950, both beating the human baseline of 2990. we cover claude autonomy failures, codex high compute usage, and much more https://www.primeintellect.ai/auto-nanogpt

10:54 PM · May 14, 2026 · 95K Views

12:11 AM · May 15, 2026 · 1.1K Views

REPLY

#767elie@ELIEBAKOUCH

@damekdavis yeah it was quite impressive, an important data point tho is that claude stopped working a lot and when we restarted it, it got updated knowledge of the different runs, but even without that it's above codex curve

Damek@damekdavis

Of course running this on idle is the nice part. This is more a comment on 4.7 and 5.5 tuning abilities. (I have run similar experiments my self)

11:06 PM · May 14, 2026 · 541 Views

11:19 PM · May 14, 2026 · 331 Views

REPLY

#841kalomaze@KALOMAZE

@andrey_kurenkov >claim about automating a bounded, narrow surface area >"but it's not the full surface area tho"

Andrey Kurenkov@andrey_kurenkov

Can we all agree that LLM-powered hyper param search to optimize nanoGPT better is not really AI research?

5:45 AM · May 15, 2026 · 38.3K Views

8:25 PM · May 15, 2026 · 1.3K Views

REPLY

#841kalomaze@KALOMAZE

@andrey_kurenkov i do worry that people default to turning their brain off and going skeptic mode when they hear the category claim bc of orgs that have been... somewhat dubious with how liberally & dramatically they have narrativized adjacent work, fake 100x cuda speedups, etc etc

kalomaze@kalomaze

@andrey_kurenkov >claim about automating a bounded, narrow surface area >"but it's not the full surface area tho"

8:25 PM · May 15, 2026 · 1.3K Views

8:33 PM · May 15, 2026 · 439 Views

QUOTE POST

#897Alexander Doria@DORIALEXANDER

not surprised. tapping the sign again.

Lisan al Gaib@scaling01

brutal Claude mog

9:45 AM · May 15, 2026 · 124.7K Views

10:48 AM · May 15, 2026 · 3.7K Views

REPLY

#897Alexander Doria@DORIALEXANDER

to be honest, i don't think it's a bad thing to get some differenciation. codex can't really get highly reliable codebase management with unbounded search and vice versa.

Alexander Doria@Dorialexander

not surprised. tapping the sign again.

10:48 AM · May 15, 2026 · 3.7K Views

11:02 AM · May 15, 2026 · 676 Views

QUOTE POST

#909Andrey Kurenkov@ANDREY_KURENKOV

PS since this is getting some heat: I think what @PrimeIntellect did here is actually really cool! Full write up is interesting.

I just think we need to be careful about claiming improvements on nanoGPT speedrun optimizer would correspond to truly better full on AI research.

Andrey Kurenkov@andrey_kurenkov

Can we all agree that LLM-powered hyper param search to optimize nanoGPT better is not really AI research?

5:45 AM · May 15, 2026 · 38.3K Views

11:42 PM · May 15, 2026 · 396 Views

QUOTE POST

#909Andrey Kurenkov@ANDREY_KURENKOV

Can we all agree that LLM-powered hyper param search to optimize nanoGPT better is not really AI research?

Prime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views

5:45 AM · May 15, 2026 · 38.3K Views

REPLY

#909Andrey Kurenkov@ANDREY_KURENKOV

To be clear, I'm not saying this sort of hill climbing / auto-optimization is not valuable! I'm just saying calling it 'AI research' is wrong, just call it what it is ('training optimization' or something).

Andrey Kurenkov@andrey_kurenkov

Can we all agree that LLM-powered hyper param search to optimize nanoGPT better is not really AI research?

5:45 AM · May 15, 2026 · 38.3K Views

7:53 PM · May 15, 2026 · 1.2K Views

QUOTE POST

#984Lisan al Gaib@SCALING01

now imagine how brutal the mog is with Mythos

this is a slight update against OpenAI pulling ahead this year through faster model cycle times

Prime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views

10:52 AM · May 15, 2026 · 27K Views

QUOTE POST

#984Lisan al Gaib@SCALING01

brutal Claude mog

Prime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views

9:45 AM · May 15, 2026 · 124.7K Views

QUOTE POST

#1203Johannes Hagemann@JOHANNES_HAGE

.@eliebakouch let the agents go wild on our idle compute to compete in the nanoGPT speedrun optimizer track!

Prime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views

10:52 PM · May 14, 2026 · 3.8K Views

REPLY

#1203Johannes Hagemann@JOHANNES_HAGE

@eliebakouch blog post with all the details:

primeintellect.ai

Autonomous AI research for nanogpt speedrun

We let Codex and Claude Code autonomously iterate on the nanoGPT speedrun optimizer track for two weeks, producing ~10k runs, a new 2930-step record, and a detailed look at where autonomous research agents work and break down.

Johannes Hagemann@johannes_hage

.@eliebakouch let the agents go wild on our idle compute to compete in the nanoGPT speedrun optimizer track!

10:52 PM · May 14, 2026 · 3.8K Views

10:53 PM · May 14, 2026 · 359 Views

QUOTE POST

#1281samsja@SAMSJA19

Prime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views

10:54 PM · May 14, 2026 · 7.5K Views

REPLY

#1281samsja@SAMSJA19

.@eliebakouch cooked

samsja@samsja19

10:54 PM · May 14, 2026 · 7.5K Views

10:54 PM · May 14, 2026 · 220 Views

QUOTE POST

#1342Damek@DAMEKDAVIS

Interesting to consider gain relative to cost.

Prime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views

10:55 PM · May 14, 2026 · 3.9K Views

REPLY

#1342Damek@DAMEKDAVIS

Of course running this on idle is the nice part. This is more a comment on 4.7 and 5.5 tuning abilities. (I have run similar experiments my self)

Damek@damekdavis

Interesting to consider gain relative to cost.

10:55 PM · May 14, 2026 · 3.9K Views

11:06 PM · May 14, 2026 · 541 Views

QUOTE POST

#1587Julius Adebayo@JULIUSADML

and so it begins

elie@eliebakouch

we let opus 4.7 and gpt 5.5 run on the nanogpt optimizer speedrun: ~10k runs, 14k H200 hours, 23.9B tokens. opus hits 2930, codex 2950, both beating the human baseline of 2990. we cover claude autonomy failures, codex high compute usage, and much more https://www.primeintellect.ai/auto-nanogpt

10:54 PM · May 14, 2026 · 95K Views

5:31 AM · May 15, 2026 · 587 Views

QUOTE POST

#1610Jiaxin Wen@JIAXINWEN22

The hill-climbing efficiency between Opus and Codex is much larger than I was expecting!

Prime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views

11:05 PM · May 14, 2026 · 231 Views

QUOTE POST

#1610Jiaxin Wen@JIAXINWEN22

The hill-climbing efficiency gap between Opus and Codex is much larger than I was expecting!

Prime Intellect@PrimeIntellect

Automating AI research is the next major step in AI We let Claude Code (Opus 4.7) and Codex (GPT 5.5) run autonomously on the nanoGPT speedrun optimizer track using our idle compute. ~10k runs, ~14k H200 hours Opus now holds the record at 2930 steps vs the 2990 human baseline

10:43 PM · May 14, 2026 · 511.2K Views

11:15 PM · May 14, 2026 · 9K Views

REPLY

#1698Mark Tenenholtz@MARKTENENHOLTZ

@andrey_kurenkov Yes. But it's fun!

Andrey Kurenkov@andrey_kurenkov

Can we all agree that LLM-powered hyper param search to optimize nanoGPT better is not really AI research?

5:45 AM · May 15, 2026 · 38.3K Views

7:00 PM · May 15, 2026 · 552 Views

Prime Intellect agents improve nanoGPT record to 2,930 steps

Cluster engagement

Sentiment

Cluster engagement