I've added two optimizers to the public benchmark:
(1) Shampoo (with its original 1/4 power). (2) Spectral descent, which is equivalent to both Muon(mu=0) and Shampoo(b1=b2=0).
Result: Shampoo falls halfway between Muon & Adam; Spectral descent is ~2x slower.
Thread below 1/6


