/Tech2h ago

Distributed Shampoo developer Rohan Anil says Schatten-p schedule-free optimization requires per-layer grafting for deep learning convergence

Uniform convexity in Schatten-p norms enables online dual averaging.

87355710.7K

#102

Original post

rohan anil@_arohan_#102inTech

We originally treated -p as hyper parameter, and one delta to talk about is for good convergence in deep learning setting, one needs to add a per layer grafting.

Thomas Pethick@tmpethick

1/ Let me chip in on the recent “which optimizer rules them all” discussion with a somewhat more moderate take, asking:

What Schatten-p norm to use?

Turns out the answer is regime dependent! Specifically, even when smooth in Schatten-∞, Muon is not necessarily the best choice.

7:54 AM · Jun 16, 2026 · 3.2K Views

Sentiment

Users appreciate the research showing Schatten-p norm choice for optimizers is regime dependent because the work looks promising and worth investigating further.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

Thomas Pethick@tmpethick

5/ In the process we derive a batch size scaling rule for arbitrary Schatten-p norms, which interestingly:

- for p=∞ recovers the BST scaling rule from - and suggests small batch size for Euclidean ala https://x.com/micahgoldblum/status/1943312410942603584 and

4h2874

BOOKMARKS1

Thomas Pethick@tmpethick

11/ The key insight is that 1/p||x||_p^p is p-uniformly convex. After that follows a surprisingly straightforward generalization of ODA + schedule-free analysis.

I see this ultimately as a victory of online learning and in particular the schedule-free framework @aaron_defazio+

4h24951

LIKES6

Thomas Pethick@tmpethick

@konstmish @kellerjordan0 @jxbz @leloykun @_arohan_ @gowerrobert @vyasnikhil96 @SebastienBubeck @JohnCLangford @ed_gorbunov @tonysilveti @DimitrisPapail @damekdavis @bremen79 I might also add that it should be possible to draw similar conclusions for Shampoo and PSGD parameterizations for certain ranges by using @varunneal's observation:

4h2026

RETWEETS3

Thomas Pethick@tmpethick

1/ Let me chip in on the recent “which optimizer rules them all” discussion with a somewhat more moderate take, asking:

What Schatten-p norm to use?

Turns out the answer is regime dependent! Specifically, even when smooth in Schatten-∞, Muon is not necessarily the best choice.

4h7.2K5034

REPLIES1

Thomas Pethick@tmpethick

@_arohan_ What do you mean by per layer grafting?

2h32

Thomas Pethick@tmpethick

12/ The result connects so many dots for me that its mind-boggling and has been one of the most satisfying projects I've worked on https://arxiv.org/pdf/2606.15268

4h20051

Thomas Pethick@tmpethick

3/ Interestingly, Chinchilla actually puts us well within the low-dim regime, since

tokens ∝ fan_in*fan_out*depth = dim^3

when depth/width is scaled proportionally. So as we scale up we shift further and further into the low-dim regime, where Muon is not necessarily optimal.

4h21341

Thomas Pethick@tmpethick

8/ The conclusions are all obtained from one unifying convergence guarantee. For general p-norms we obtain a *dimension-free* noise robust acceleration where:

- the allowed acceleration is decreasing in p - the noise can be increasingly heavy-tailed with increasing p

4h13741

Thomas Pethick@tmpethick

2/ In the high-dimensional regime (dim > token budget), Muon is beneficial.

But, for low-dimensional problems (dim < token budget), Muon can be suboptimal even when assuming smoothness in Schatten-∞ norm.

This is true both in the deterministic and stochastic setting!

4h2426

Thomas Pethick@tmpethick

7/ In essence, our work tells you when HTMuon/Freon/Soft-Muon is preferred (low-dimensional setting with heavy-tailed noise) and how to adjust learning rate and batch size @TianyuPang327 @nilinabra @DengShenyang24 @Zakobian

4h2235

Thomas Pethick@tmpethick

4/ The regime discussion motivates a switching strategy transitioning from Schatten-∞ to some smaller Schatten-p norm during training.

Remarkably, @nilinabra seem to have arrived at such a schedule empirically with Contra-Muon/Soft-Muon.

4h1815

Thomas Pethick@tmpethick

10/ The construction in itself is rather elegant to me, and allows one to capture Muon precisely including Nesterov momentum and weight decay.

4h1445

Thomas Pethick@tmpethick

6/ We also find that Euclidean methods has a learning rate warmup phase while larger p does not. This matches practice where Muon typically does not require learning rate warmup.

4h1584

Thomas Pethick@tmpethick

9/ What does this tell us?

- pick too small p and we pay a price if the problem geometry has larger p - pick too large p and we pay a price if the noise is less heavy-tailed (and through inability to accelerate)

4h1314

Thomas Pethick@tmpethick

Tagging a bunch of people who might find it interesting: @konstmish @kellerjordan0 @jxbz @leloykun @_arohan_ @gowerrobert @vyasnikhil96 @SebastienBubeck @JohnCLangford @ed_gorbunov @tonysilveti @DimitrisPapail @damekdavis @bremen79

4h1836

Thomas Massena@thomasmassena

@tmpethick This looks great, will check it out ! Out of curiosity, did you check out https://arxiv.org/abs/2605.19781 ? It seems like loads of people have converged towards researching intermediate Schatten-p norms recently.

2h32

rohan anil@_arohan_

@tmpethick We observed that we can decouple direction of update from magnitude. So you can rescale the per layer update to have the norm like sqrt N where N is size of tensor

https://arxiv.org/pdf/2002.09018

https://github.com/google-research/google-research/blob/4f34515ae8df194b75eb87deb4486d4713a60f19/scalable_shampoo/optax/distributed_shampoo.py#L308

2h231

Thomas Pethick@tmpethick

@thomasmassena Thanks for sharing! Yes I found a bunch of works (see tweets and paper) but not this one - will take a closer look

2h211