/AI8h ago

DeepSeek and Kimi Switch to Muon Optimizer for Lower Curvature Penalty

1521534124.8K

#851

Original post

Zhaoran Wang#851

Fengzhuo Zhang@FengzhuoZhang

Why do DeepSeek and Kimi use Muon instead of Adam?

🚀 Reasons from a curvature perspective:

1⃣ Under a second-order approx., Muon incurs a much smaller curvature penalty than Adam while maintaining the same first-order decrease.

2⃣ This advantage does not come from a smaller update norm. Instead, it comes from Muon having lower Normalized Directional Sharpness (NDS).

3⃣ Muon’s NDS advantage becomes larger when the training data is more imbalanced.

Paper Link: https://arxiv.org/abs/2606.04662

A thread 🧵

1:30 PM · Jun 4, 2026 · 124.8K Views

/AI8h ago

DeepSeek and Kimi Switch to Muon Optimizer for Lower Curvature Penalty

--0--

#851

Original post

Zhaoran Wang#851

Fengzhuo Zhang@FengzhuoZhang

Why do DeepSeek and Kimi use Muon instead of Adam?

🚀 Reasons from a curvature perspective:

1⃣ Under a second-order approx., Muon incurs a much smaller curvature penalty than Adam while maintaining the same first-order decrease.

2⃣ This advantage does not come from a smaller update norm. Instead, it comes from Muon having lower Normalized Directional Sharpness (NDS).

3⃣ Muon’s NDS advantage becomes larger when the training data is more imbalanced.

Paper Link: https://arxiv.org/abs/2606.04662

A thread 🧵

1:30 PM · Jun 4, 2026 · 124.8K Views

Sentiment

Users thank collaborators for the Muon Optimizer work that beats Adam with lower curvature penalty in imbalanced training.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

Fengzhuo Zhang@FengzhuoZhang

2/5: Update Decomposition of Norm and Direction Decomposing the curvature penalty into norm and direction, we find that Muon has a similar update norm to Adam but much lower Normalized Directional Sharpness (NDS).

8h23

REPLIES1

Fengzhuo Zhang@FengzhuoZhang

5/5: Provable Results in Quadratic Models

Under K-FAC-style assumptions and gradient alignment with the leading Hessian eigenvectors, Muon provably achieves a lower NDS and a larger loss decrease than GD.

8h17

Posts from X

Most Activity

No ranked X posts are available for this story yet.