/AI8h ago

DeepSeek and Kimi Switch to Muon Optimizer for Lower Curvature Penalty

--0--
Original postZhaoran Wang#851
Fengzhuo Zhang@FengzhuoZhang

Why do DeepSeek and Kimi use Muon instead of Adam?

🚀 Reasons from a curvature perspective:

1⃣ Under a second-order approx., Muon incurs a much smaller curvature penalty than Adam while maintaining the same first-order decrease.

2⃣ This advantage does not come from a smaller update norm. Instead, it comes from Muon having lower Normalized Directional Sharpness (NDS).

3⃣ Muon’s NDS advantage becomes larger when the training data is more imbalanced.

Paper Link: https://arxiv.org/abs/2606.04662

A thread 🧵

1:30 PM · Jun 4, 2026 · 124.8K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most Activity
No ranked X posts are available for this story yet.