As I understand it, the paper is saying that vanilla SGD learns modes sequentially in order of the magnitude of each mode in the data. In this case, the general skip rule has the highest mode followed by the task specific skips so SGD learns this general mode first.
The other optimizers precondition which kills this sequential-mode-learning-based-on-magnitude property. This keeps updates uniform and kills this curriculum.
In the case of muon, the update becomes isotropic so both the general high mode and the task specific smaller modes are learned all at once leading to poor generalization.
This is more evident in this sort of specifically constructed task but it probably doesnt matter for llm training since data is so diverse and vast that it regularizes against this.
(corrections welcome!)