/Tech4h ago

Andrew Gordon Wilson says Shampoo's Modded-NanoGPT speedrun gains rely on an undocumented stability flag

Muon variants recently reduced benchmark runs to 2,930 steps.

8175162416.9K

#148

Original post

Keller Jordan@kellerjordan0#703inTech

Five recent notable Modded-NanoGPT optimization results:

Result #31: Kai Lion and Florian Hübler have improved their Muown-based run from 3075 to 2995 steps by adding NorMuon & ContraMuon modifications. 1/5

9:22 AM · Jun 12, 2026 · 2.9K Views

Sentiment

Users praise the steps-based leaderboard as a clean framing for NanoGPT optimizer benchmarks, wondering whether it will prompt Muon researchers to adjust their approach.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS5.6K

Keller Jordan@kellerjordan0

Result #35: @_arohan_ has achieved a significant >700-step improvement to the best known DistributedShampoo config.

This was achieved by switching to one-sided Shampoo, numerically stabilizing the run using an undocumented pseudoinverse flag, and retuning other hparams. 4/5

Keller Jordan@kellerjordan0

Result #33: @varunneal has provided a PSGD-Kron (Pooladzandi & Li 2024; Li 2024; Li 2022; Li 2018; Li 2015) run! It uses hyperball optimization. 3/5

4h5.6K192

BOOKMARKS7LIKES49REPLIES3

Keller Jordan@kellerjordan0

I created this new speedrun track, which compares results in terms of steps rather than wallclock, specifically to give a fair chance to optimizers other than Muon.

Happy to see the resulting accumulation of public knowledge!

Keller Jordan@kellerjordan0

Result #35: @_arohan_ has achieved a significant >700-step improvement to the best known DistributedShampoo config.

This was achieved by switching to one-sided Shampoo, numerically stabilizing the run using an undocumented pseudoinverse flag, and retuning other hparams. 4/5

4h4.7K497

RETWEETS5

Keller Jordan@kellerjordan0

Result #33: @varunneal has provided a PSGD-Kron (Pooladzandi & Li 2024; Li 2024; Li 2022; Li 2018; Li 2015) run! It uses hyperball optimization. 3/5

Keller Jordan@kellerjordan0

Result #32: @mihai673 has achieved a 30-step improvement over the old 2026/05/09 record by adding a SODA (Pethick et al. 2026)-style anchor towards init.

It is unknown whether this technique can also improve the current record. 2/5

4h1.5K424

Keller Jordan@kellerjordan0

Result #36: @konstmish improved the best known hyperparameters for the Muon + aux-AdamW baseline. The improvement came mainly from AdamW hparams.

Result #37: @wen_kaiyue then grafted that change onto Muon-Hyperball. Both runs improved by 75 steps. 5/5

Konstantin Mishchenko@konstmish

@kellerjordan0 Codex gave me this estimate: ~77% of the gain came from AdamW hyperparams (make learning rates 1.5 bigger) ~23% came from Muon hyperparams (smaller lr, bigger weight decay)

I didn't touch anything else.

4h1.4K151

Keller Jordan@kellerjordan0

Result #32: @mihai673 has achieved a 30-step improvement over the old 2026/05/09 record by adding a SODA (Pethick et al. 2026)-style anchor towards init.

It is unknown whether this technique can also improve the current record. 2/5

Keller Jordan@kellerjordan0

Five recent notable Modded-NanoGPT optimization results:

Result #31: Kai Lion and Florian Hübler have improved their Muown-based run from 3075 to 2995 steps by adding NorMuon & ContraMuon modifications. 1/5

4h981130

Andrew Gordon Wilson@andrewgwils

@DimitrisPapail https://arxiv.org/abs/2507.07101

Dimitris Papailiopoulos@DimitrisPapail

I have a silly question. How far off is vanilla SGD with well tuned learning rate schedule and batch size? I'd love to see wall clock on the x axis

28m30711

Dimitris Papailiopoulos@DimitrisPapail

@andrewgwils Small Bs all the way doesn’t work but adaptive works reasonably well eg low first high later.

25m161

Dimitris Papailiopoulos@DimitrisPapail

@andrewgwils Also

24m73

Rugbist@rugbist_

@kellerjordan0 steps-based leaderboard is such a clean framing, curious if itll make Muon folks pivot their strategy