we should really be thanking arabic numerals for making zeros happen
(screenshot posted with permission, and brings the thread full circle :))
Orthogonalization replaces inverse root computations to improve numerical stability.
we should really be thanking arabic numerals for making zeros happen
(screenshot posted with permission, and brings the thread full circle :))
Users praise Muon Optimizer innovations over Shampoo for LLM training because the work offers satisfying direct math-to-performance insights and fun research ideas that are harder to transfer than in other subfields.

@suchenzang you mean indian numerals 😅
iirc Keller said the usage of NS was inspired by Jeremy's paper https://arxiv.org/abs/2409.20325
> computing inverse roots is inherently more numerically unstable than orthogonalization we can jointly compute inverse root and matmul (P^{-1/2}G) with an iteration, and it's more stable than P^{-1/2}. so introducing iterative method is the point.

@ShumingHu @suchenzang ChatGPT tells me there are ≈200,000 ML optimization papers written -- https://chatgpt.com/share/6a29abcf-f160-83e8-8507-6135bf8db564

@suchenzang I also feel like it’s a very satisfying line of work since your mathematical insight can directly show performance + you don’t need crazy compute to run your own experiments (vs say model architecture)

@jaiselsingh @suchenzang I think it's fun, the ideas have been harder to transfer than in other subfields, why? I spent spent >1000 hours nerd-sniped by methods like kfac (https://github.com/cybertronai/pytorch-sso)

@suchenzang @yaroslavvb hahaha depends on the person. I’m reasonably confident sum of total optimizer research time for tilde folks is less than total TV hours of my life. Probably in between my total life commute time and total BART time.

@pHequals7 oh nooo

@yaroslavvb @suchenzang this is neat! I’m going to have to go through your gh impls haha :)

@jaiselsingh @suchenzang There's a 3-matrix version of KFAC in https://mathematica.stackexchange.com/questions/234502/solving-eabxab-y-for-gaussian-a-b . But the issue in both regular and this KFAC is that allocating compute budget to vanilla gradient is better. Compute-efficiency is kind of the missing component in optimizer research, comes as an afterthought

@suchenzang u mean indian
but then again
what has 10 digits and says things like "zero"?

@suchenzang To be fair you could say this about most ML research, or even most academic research

@yaroslavvb @suchenzang 😮
Orthogonalization replaces inverse root computations to improve numerical stability.
we should really be thanking arabic numerals for making zeros happen
(screenshot posted with permission, and brings the thread full circle :))