Five recent notable Modded-NanoGPT optimization results:
Result #31: Kai Lion and Florian Hübler have improved their Muown-based run from 3075 to 2995 steps by adding NorMuon & ContraMuon modifications. 1/5
Muon variants recently reduced benchmark runs to 2,930 steps.
Five recent notable Modded-NanoGPT optimization results:
Result #31: Kai Lion and Florian Hübler have improved their Muown-based run from 3075 to 2995 steps by adding NorMuon & ContraMuon modifications. 1/5
Users praise the steps-based leaderboard as a clean framing for NanoGPT optimizer benchmarks, wondering whether it will prompt Muon researchers to adjust their approach.
Result #35: @_arohan_ has achieved a significant >700-step improvement to the best known DistributedShampoo config.
This was achieved by switching to one-sided Shampoo, numerically stabilizing the run using an undocumented pseudoinverse flag, and retuning other hparams. 4/5
Result #33: @varunneal has provided a PSGD-Kron (Pooladzandi & Li 2024; Li 2024; Li 2022; Li 2018; Li 2015) run! It uses hyperball optimization. 3/5
I created this new speedrun track, which compares results in terms of steps rather than wallclock, specifically to give a fair chance to optimizers other than Muon.
Happy to see the resulting accumulation of public knowledge!
Result #35: @_arohan_ has achieved a significant >700-step improvement to the best known DistributedShampoo config.
This was achieved by switching to one-sided Shampoo, numerically stabilizing the run using an undocumented pseudoinverse flag, and retuning other hparams. 4/5
Result #33: @varunneal has provided a PSGD-Kron (Pooladzandi & Li 2024; Li 2024; Li 2022; Li 2018; Li 2015) run! It uses hyperball optimization. 3/5
Result #32: @mihai673 has achieved a 30-step improvement over the old 2026/05/09 record by adding a SODA (Pethick et al. 2026)-style anchor towards init.
It is unknown whether this technique can also improve the current record. 2/5
Result #36: @konstmish improved the best known hyperparameters for the Muon + aux-AdamW baseline. The improvement came mainly from AdamW hparams.
Result #37: @wen_kaiyue then grafted that change onto Muon-Hyperball. Both runs improved by 75 steps. 5/5
@kellerjordan0 Codex gave me this estimate: ~77% of the gain came from AdamW hyperparams (make learning rates 1.5 bigger) ~23% came from Muon hyperparams (smaller lr, bigger weight decay)
I didn't touch anything else.
Result #32: @mihai673 has achieved a 30-step improvement over the old 2026/05/09 record by adding a SODA (Pethick et al. 2026)-style anchor towards init.
It is unknown whether this technique can also improve the current record. 2/5
Five recent notable Modded-NanoGPT optimization results:
Result #31: Kai Lion and Florian Hübler have improved their Muown-based run from 3075 to 2995 steps by adding NorMuon & ContraMuon modifications. 1/5
@DimitrisPapail https://arxiv.org/abs/2507.07101
I have a silly question. How far off is vanilla SGD with well tuned learning rate schedule and batch size? I'd love to see wall clock on the x axis

@andrewgwils Small Bs all the way doesn’t work but adaptive works reasonably well eg low first high later.

@andrewgwils Also

@kellerjordan0 steps-based leaderboard is such a clean framing, curious if itll make Muon folks pivot their strategy