/Tech6h ago

Technical analysis establishes that the Muon optimizer is structurally distinct from Shampoo, leveraging GPU-friendly Newton-Schulz iterations

Orthogonalization replaces inverse root computations to improve numerical stability.

59822113.6K
Original post
Susan Zhang@suchenzang#64inTech

we should really be thanking arabic numerals for making zeros happen

(screenshot posted with permission, and brings the thread full circle :))

9:27 AM · Jun 10, 2026 · 13.5K Views
Sentiment

Users praise Muon Optimizer innovations over Shampoo for LLM training because the work offers satisfying direct math-to-performance insights and fun research ideas that are harder to transfer than in other subfields.

Pos
100.0%
Neg
0.0%
3 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS862LIKES14
pH@pHequals7

@suchenzang you mean indian numerals 😅

6hViews 862Likes 14
BOOKMARKS1
You Jiacheng@YouJiacheng

iirc Keller said the usage of NS was inspired by Jeremy's paper https://arxiv.org/abs/2409.20325

You Jiacheng@YouJiacheng

> computing inverse roots is inherently more numerically unstable than orthogonalization we can jointly compute inverse root and matmul (P^{-1/2}G) with an iteration, and it's more stable than P^{-1/2}. so introducing iterative method is the point.

1hViews 337Likes 5Bookmarks 1
REPLIES1
Yaroslav Bulatov@yaroslavvb

@ShumingHu @suchenzang ChatGPT tells me there are ≈200,000 ML optimization papers written -- https://chatgpt.com/share/6a29abcf-f160-83e8-8507-6135bf8db564

4hViews 92
jaisel@jaiselsingh

@suchenzang I also feel like it’s a very satisfying line of work since your mathematical insight can directly show performance + you don’t need crazy compute to run your own experiments (vs say model architecture)

5hViews 321Likes 2Bookmarks 1
Yaroslav Bulatov@yaroslavvb

@jaiselsingh @suchenzang I think it's fun, the ideas have been harder to transfer than in other subfields, why? I spent spent >1000 hours nerd-sniped by methods like kfac (https://github.com/cybertronai/pytorch-sso)

4hViews 36Likes 1Bookmarks 1
Shuming Hu@ShumingHu

@suchenzang @yaroslavvb hahaha depends on the person. I’m reasonably confident sum of total optimizer research time for tilde folks is less than total TV hours of my life. Probably in between my total life commute time and total BART time.

4hViews 213Likes 1
Susan Zhang@suchenzang

@pHequals7 oh nooo

6hViews 420Likes 4
jaisel@jaiselsingh

@yaroslavvb @suchenzang this is neat! I’m going to have to go through your gh impls haha :)

4hViews 32
Yaroslav Bulatov@yaroslavvb

@jaiselsingh @suchenzang There's a 3-matrix version of KFAC in https://mathematica.stackexchange.com/questions/234502/solving-eabxab-y-for-gaussian-a-b . But the issue in both regular and this KFAC is that allocating compute budget to vanilla gradient is better. Compute-efficiency is kind of the missing component in optimizer research, comes as an afterthought

4hViews 24Likes 1
Chris Groves@CGrovesNLN

@suchenzang u mean indian

but then again

what has 10 digits and says things like "zero"?

2hViews 44
Gavin Zhang@GavinZJL

@suchenzang To be fair you could say this about most ML research, or even most academic research

2hViews 40
Shuming Hu@ShumingHu

@yaroslavvb @suchenzang 😮

4hViews 38