Why do DeepSeek and Kimi use Muon instead of Adam?
🚀 Reasons from a curvature perspective:
1⃣ Under a second-order approx., Muon incurs a much smaller curvature penalty than Adam while maintaining the same first-order decrease.
2⃣ This advantage does not come from a smaller update norm. Instead, it comes from Muon having lower Normalized Directional Sharpness (NDS).
3⃣ Muon’s NDS advantage becomes larger when the training data is more imbalanced.
Paper Link: https://arxiv.org/abs/2606.04662
A thread 🧵
