Hyper-parameter tuning works really well!
Since Keller asked very nicely - I took it as a challenge to find a minimal edit distance from his config to a working config with no further code modifications, which means vanilla Meta's distributed_shampoo PyTorch package as is which won the AlgoPerf competition.
Biggest alpha here was tuning hyper-parameters apart from enabling pseudo inverse which was critical as this speed run problem produces rank deficient matrices (see those nice viz), and thanks to great help from Anna!
Hypers: lr=0.01, wd=0.1, beta2=0.9, eps=1e-15, freq=1 🙃
No Nesterov momentum was used. Oops.
All that is no longer needed - those modifications just was giving us identical training curves for same step targets because that's how math works, now the curves look different, but arrive at these end of train validation targets.
- first completed segment: step:3375/3375 val_loss:3.27656 - second completed segment: step:3375/3375 val_loss:3.27675
[1] https://github.com/facebookresearch/optimizers/issues/265#issuecomment-4668270192
Next steps for anyone looking for a late night hobby:
* I would probably fuse lot of these things, make kernels * Try it on new baselines, harder tasks: increase batch size or change architecture. * Tune frequency, change eigh to ortho items etc.
Thank you for this result! Here's one initial correction:
In this post, Rohan states that I made an attempt to implement Shampoo, and labels my Shampoo training run `Buggy shampoo impl`.
But this is incorrect: I did not implement my own version of Shampoo.
Instead -- as can be seen in the reproducible log I provided in the original post -- I used, out-of-the-box, the official DistributedShampoo implementation provided by Facebook research. This is the most commonly-used Shampoo implementation that I could find on the internet.
The extent of my use of this implementation can be seen in the few lines of code below. If there are indeed bugs in this implementation, I can safely say that they were not created by me.
The remaining mystery, for anyone interested, might naturally become something like the following - How did Rohan today achieve a significantly better result compared to the official 2022-era DistributedShampoo implementation?
I am grateful for the comments he has already made regarding the deltas between his version and the 2022 one, and I am looking forward to things becoming fully precise/detailed soon once he releases the reproducible logfiles generated by his runs.






