/Tech6h ago

Technical analysis establishes that the Muon optimizer is structurally distinct from Shampoo, leveraging GPU-friendly Newton-Schulz iterations

Orthogonalization replaces inverse root computations to improve numerical stability.

59822113.6K

#64

Original post

Susan Zhang@suchenzang#64inTech

we should really be thanking arabic numerals for making zeros happen

(screenshot posted with permission, and brings the thread full circle :))

9:27 AM · Jun 10, 2026 · 13.5K Views

/Tech6h ago

Technical analysis establishes that the Muon optimizer is structurally distinct from Shampoo, leveraging GPU-friendly Newton-Schulz iterations

Orthogonalization replaces inverse root computations to improve numerical stability.

59822113.6K

#64

Original post

Susan Zhang@suchenzang#64inTech

we should really be thanking arabic numerals for making zeros happen

(screenshot posted with permission, and brings the thread full circle :))

9:27 AM · Jun 10, 2026 · 13.5K Views

Sentiment

Users praise Muon Optimizer innovations over Shampoo for LLM training because the work offers satisfying direct math-to-performance insights and fun research ideas that are harder to transfer than in other subfields.

Pos

100.0%

Neg

0.0%

3 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS862LIKES14

pH@pHequals7

@suchenzang you mean indian numerals 😅

6h86214

BOOKMARKS1

You Jiacheng@YouJiacheng

iirc Keller said the usage of NS was inspired by Jeremy's paper https://arxiv.org/abs/2409.20325

You Jiacheng@YouJiacheng

> computing inverse roots is inherently more numerically unstable than orthogonalization we can jointly compute inverse root and matmul (P^{-1/2}G) with an iteration, and it's more stable than P^{-1/2}. so introducing iterative method is the point.

1h33751

REPLIES1

Yaroslav Bulatov@yaroslavvb

@ShumingHu @suchenzang ChatGPT tells me there are ≈200,000 ML optimization papers written -- https://chatgpt.com/share/6a29abcf-f160-83e8-8507-6135bf8db564

4h92

jaisel@jaiselsingh

@suchenzang I also feel like it’s a very satisfying line of work since your mathematical insight can directly show performance + you don’t need crazy compute to run your own experiments (vs say model architecture)

5h32121

Yaroslav Bulatov@yaroslavvb

@jaiselsingh @suchenzang I think it's fun, the ideas have been harder to transfer than in other subfields, why? I spent spent >1000 hours nerd-sniped by methods like kfac (https://github.com/cybertronai/pytorch-sso)

4h3611

Shuming Hu@ShumingHu

@suchenzang @yaroslavvb hahaha depends on the person. I’m reasonably confident sum of total optimizer research time for tilde folks is less than total TV hours of my life. Probably in between my total life commute time and total BART time.

4h2131

Susan Zhang@suchenzang

@pHequals7 oh nooo

6h4204

jaisel@jaiselsingh

@yaroslavvb @suchenzang this is neat! I’m going to have to go through your gh impls haha :)

4h32

Yaroslav Bulatov@yaroslavvb

@jaiselsingh @suchenzang There's a 3-matrix version of KFAC in https://mathematica.stackexchange.com/questions/234502/solving-eabxab-y-for-gaussian-a-b . But the issue in both regular and this KFAC is that allocating compute budget to vanilla gradient is better. Compute-efficiency is kind of the missing component in optimizer research, comes as an afterthought

4h241

Chris Groves@CGrovesNLN

@suchenzang u mean indian

but then again

what has 10 digits and says things like "zero"?

2h44

Gavin Zhang@GavinZJL

@suchenzang To be fair you could say this about most ML research, or even most academic research

2h40

Shuming Hu@ShumingHu

@yaroslavvb @suchenzang 😮

4h38