/AI1h ago

Automated Pipeline Cuts NanoGPT Training Steps From 2875 to 2755

75913286.5K
Original postCLS#438
Yiping Wang@ypwang61

Automatic research from mathematics to AI research:

We transfer the ScaleAutoResearch pipeline, which improves a 32-year-old Ramsey number bound, to the NanoGPT Speedrun optimizer track, using Claude Code and Codex with only 1–2 A40 nodes. We run ~300 experiments in ~5k A40 hours, and then:

⭕ Results: improve (non-interpolation) SOTA from 2875 to 2755 steps.

Changes: +: non-gain aux β₂ = 0.997; SOAP for all hidden with freq=1; LR-horizon + momentum tuning -: remove Circuit-/Contra-/Soft-Muon, Aurora, NorMuon 2nd-moment, V-SOAP-blend, attn denom-floor...

Clearly, the experiments are compute-bounded, and it is possible that more results could come with more resources!

[1/n]

1:29 PM · Jun 9, 2026 · 6.4K Views
Sentiment

Users praised the automated pipeline improving NanoGPT speedrun SOTA because they found the results super cool and the discussions insightful.

Pos
100.0%
Neg
0.0%
5 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS2.6KBOOKMARKS19LIKES35RETWEETS6REPLIES1
elie@eliebakouch

new auto ai research record on nanogpt optimizer speedrun, great work!

https://github.com/KellerJordan/modded-nanogpt/pull/321

Yiping Wang@ypwang61

Automatic research from mathematics to AI research:

We transfer the ScaleAutoResearch pipeline, which improves a 32-year-old Ramsey number bound, to the NanoGPT Speedrun optimizer track, using Claude Code and Codex with only 1–2 A40 nodes. We run ~300 experiments in ~5k A40 hours, and then:

⭕ Results: improve (non-interpolation) SOTA from 2875 to 2755 steps.

Changes: +: non-gain aux β₂ = 0.997; SOAP for all hidden with freq=1; LR-horizon + momentum tuning -: remove Circuit-/Contra-/Soft-Muon, Aurora, NorMuon 2nd-moment, V-SOAP-blend, attn denom-floor...

Clearly, the experiments are compute-bounded, and it is possible that more results could come with more resources!

[1/n]

59mViews 2.6KLikes 35Bookmarks 19
Kaiyue Wen@wen_kaiyue

Excited to see this! Compared to @eliebakouch 's pioneer attempt, two features really stood out: 1. Done in extremely low compute (1-2 A40 node). Research agents can shine even for academia level compute with careful design! 2. The agents found an extremely simple solution (soap muon by @vyasnikhil96 ) and remove more features than adding

Yiping Wang@ypwang61

Automatic research from mathematics to AI research:

We transfer the ScaleAutoResearch pipeline, which improves a 32-year-old Ramsey number bound, to the NanoGPT Speedrun optimizer track, using Claude Code and Codex with only 1–2 A40 nodes. We run ~300 experiments in ~5k A40 hours, and then:

⭕ Results: improve (non-interpolation) SOTA from 2875 to 2755 steps.

Changes: +: non-gain aux β₂ = 0.997; SOAP for all hidden with freq=1; LR-horizon + momentum tuning -: remove Circuit-/Contra-/Soft-Muon, Aurora, NorMuon 2nd-moment, V-SOAP-blend, attn denom-floor...

Clearly, the experiments are compute-bounded, and it is possible that more results could come with more resources!

[1/n]

1hViews 1.7KLikes 16Bookmarks 6
Yiping Wang@ypwang61

I believe in the long-term, the most important factors for these automated research systems (pure inference) are:

1. Base model capability (taste, long-horizon coding ability) 2. Base model token budget 3. Evaluation resources (e.g., GPUs for ML research, CPUs for Ramsey number search)

We have to acknowledge that, with current compute budgets, agents still have not discovered truly creative and effective solutions. When evaluation resources are limited, we may still need human priors to accelerate the search.

However, I believe research taste should be trainable on verifiable research problems, especially ML coding tasks. Organizations with massive compute resources and the ability to train frontier foundation models will likely continue to hold a significant advantage.

[6/n]

1hViews 82Likes 7Bookmarks 1
Yiping Wang@ypwang61

Previously, we used ScaleAutoResearch pipeline, which uses multiple organized autoresearch agents inspired by @karpathy, with context sharing in long-horizon tasks, on challenging Ramsey number searching problems (https://x.com/ypwang61/status/2052508685591785619), and improves a 32-year bound in R(3,17) that AlphaEvolve did not improve.

We extend that to @kellerjordan0 ‘s nanogpt speedrun optimizer track, a community benchmark for improving pretraining optimizers. It fixes model architecture and data, and only allows to change optimizers. The goal is to minimize the number of steps required to reach a validation loss of 3.28.

[2/n]

1hViews 305Likes 4Bookmarks 1
Yiping Wang@ypwang61

This experiment is also partially inspired by the excellent work from @PrimeIntellect (https://x.com/PrimeIntellect/status/2055056380881744365). They used ~14k H200 hours and ~10k runs, improving the record from 2990 to 2930.

Although the results are not directly comparable, since we start from the current SOTA optimizer in the PRs while they start from the Muon baseline, our results still show that non-trivial improvements can be achieved with relatively limited compute. Different designs of the AutoResearch experiment loop can have some impact on efficiency.

[5/n]

1hViews 93Likes 5Bookmarks 1
elie@eliebakouch

@ypwang61 wow this is super cool congrats

1hViews 308Likes 4
Yiping Wang@ypwang61

Although the models explored more creative ideas (roughly 40% of all our experiments were such exploratory attempts at genuinely new algorithmic designs), almost none of them brought meaningful improvements; most of the gains still came from parameter tuning.

However, observations such as point (1) do suggest some understanding of optimizer tuning, since we do not have enough compute for exhaustive parameter sweeps and the agents need to decide which directions are worth exploring under speedrun settings. In fact, the resulting improvement was larger than we expected.

[4/n]

1hViews 66Likes 5
Yiping Wang@ypwang61

We use a combination of Claude Code @AnthropicAI (Opus 4.8) and Codex @OpenAI (GPT 5.5). The ScaleAutoResearch pipeline is very similar to the one used for Ramsey numbers, but we replace the resources and domain-specific human intuitions with those for optimizer design. The method they found uses simple tricks to reduce the steps:

(1) Longer 2nd-moment memory for the 1-D "aux" params (RMSNorm gains, biases): Adam β₂ 0.99 → 0.997 (0.9965 for the attn-proj bias). Agents' reason: these 1-D params get no cross-coordinate averaging, so at β₂=0.99 their variance estimate is noisy and jitters the step size, so a longer memory steadies it. (2875 → 2830 steps)

(2) SOAP on all hidden matrices, refreshed every step: MLP+V → +q/k/attn-proj, and precondition_frequency 10 → 1. Agents' reason: if SOAP curvature helps MLP/V, the other hidden matrices should too, and an every-step eigenbasis tracks the moving curvature better (~29% more time/step). (2830 → 2800 steps)

(3) Shorter LR-cooldown horizon + momentum tuning, then prune lots of now-redundant components: Circuit-/Contra-/Soft-Muon, Aurora, NorMuon 2nd-moment, V-SOAP-blend, attn-SOAP denom-floor. Agents' reason: on the faster trajectory the LR should anneal sooner, and once SOAP covers every matrix the older geometry tricks are redundant, so pruning them also buys back ~19%/step. (2800 → 2755 steps)

[3/n]

1hViews 169Likes 4
Yiping Wang@ypwang61

Thanks a lot to @wen_kaiyue and @YouJiacheng for very insightful discussions. The results can be found here: https://github.com/KellerJordan/modded-nanogpt/pull/321

[7/n]

1hViews 83Likes 8

@ypwang61 Very cool work! I'm considering applying a alpha evolve like tool (http://github.com/ttanv/Levi) on nano gpt as well, curious to hear any suggestions or ideas? Or how you think it could compare to auto research like tools

57mViews 16Likes 1
Yiping Wang@ypwang61

@eliebakouch Thanks! Learn a lot for your great work!

1hViews 37Likes 1
Yiping Wang@ypwang61

@ttanvali I think auto research is something like alphaevolve + modern agent harness, but - parallel sampling. And this harness is quite useful in experimental iteration. In general if there are proper eval/proxy eval and some proper resources/intuitions, then it should works well for both

51mViews 8Likes 1