Being a bit pedantic and finally got a few mins to run this. kthx bye.
I remember in August 2024 (before the anthology paper) when I was attending the modula-in-numpy sessions of Scale ML it was even called ShampooLinear
Rohan Anil at Anthropic responded on X to a thread spotlighting the 2018 Shampoo optimizer paper amid mentions of ShampooLinear from Scale ML sessions. He cited the original paper, outlined its development from initial implementations to later references, and highlighted a footnote revealing the name as a pun on 'pre-conditioning' for hair shampoo. Shampoo preserves gradient tensor structure using separate preconditioning matrices per parameter dimension.
Being a bit pedantic and finally got a few mins to run this. kthx bye.
I remember in August 2024 (before the anthology paper) when I was attending the modula-in-numpy sessions of Scale ML it was even called ShampooLinear
The nesterov momentum is pretty good. I believe origin might be from James Martens 2011 paper on importance of momentum in neural networks. The derivation requires an approximation for the half step.
Being a bit pedantic and finally got a few mins to run this. kthx bye.
somehow the QT’d meme caused me to suddenly realize why the Shampoo algorithm is called that (indeed, i had never read the original paper)
Shampoo 2018 if you want a citation.
Being a bit pedantic and finally got a few mins to run this. kthx bye.
If there is a better citation let me know?
The nesterov momentum is pretty good. I believe origin might be from James Martens 2011 paper on importance of momentum in neural networks. The derivation requires an approximation for the half step.
Rohan Anil at Anthropic responded on X to a thread spotlighting the 2018 Shampoo optimizer paper amid mentions of ShampooLinear from Scale ML sessions. He cited the original paper, outlined its development from initial implementations to later references, and highlighted a footnote revealing the name as a pun on 'pre-conditioning' for hair shampoo. Shampoo preserves gradient tensor structure using separate preconditioning matrices per parameter dimension.
Being a bit pedantic and finally got a few mins to run this. kthx bye.
I remember in August 2024 (before the anthology paper) when I was attending the modula-in-numpy sessions of Scale ML it was even called ShampooLinear