/Tech3h ago

SGD Generalizes to Unseen Skips While Muon and AdamW Memorize

220672K

#1828

Original post

wh@nrehiew_#1828inTech

Replicated some of the results from this cool work with SGD, Muon, AdamW, SOAP

Sara Dragutinovic@sara_drag

Yes! We train a transformer on cyclic sequences. SGD generalizes to an unseen skip. Muon doesn't — it memorizes each pattern separately instead of learning shared, generalizable representations.

5:55 AM · Jul 2, 2026 · 2K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS369LIKES5

wh@nrehiew_

Some other curves + ce losses

wh@nrehiew_

Replicated some of the results from this cool work with SGD, Muon, AdamW, SOAP

3h36950

wh@nrehiew_

As I understand it, the paper is saying that vanilla SGD learns modes sequentially in order of the magnitude of each mode in the data. In this case, the general skip rule has the highest mode followed by the task specific skips so SGD learns this general mode first.

The other optimizers precondition which kills this sequential-mode-learning-based-on-magnitude property. This keeps updates uniform and kills this curriculum.

In the case of muon, the update becomes isotropic so both the general high mode and the task specific smaller modes are learned all at once leading to poor generalization.

This is more evident in this sort of specifically constructed task but it probably doesnt matter for llm training since data is so diverse and vast that it regularizes against this.

(corrections welcome!)

wh@nrehiew_

Replicated some of the results from this cool work with SGD, Muon, AdamW, SOAP

3h10710