We originally treated -p as hyper parameter, and one delta to talk about is for good convergence in deep learning setting, one needs to add a per layer grafting.
1/ Let me chip in on the recent “which optimizer rules them all” discussion with a somewhat more moderate take, asking:
What Schatten-p norm to use?
Turns out the answer is regime dependent! Specifically, even when smooth in Schatten-∞, Muon is not necessarily the best choice.


