/Tech1d ago

Researcher Eyes Custom GPU Kernels to Speed Up AI Optimizers

269259.2K
Original post
rohan anil@_arohan_#86inTech

Finally, pretty excited to produce some kick ass kernels for these in the future, so we don’t need to be burning gpus doing bad linear algebra operations.

rohan anil@_arohan_

Last one is that with Adam grafting from Meta’s impl, means the size of update is O(sqrt(size)) - which you have to set different lr and weigth decay. The Muon implementation uses different lr / wd for various layers. I just used it, and rescaled it as appropriate.

10:28 PM · Jun 8, 2026 · 5.4K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS3.8KBOOKMARKS1LIKES20
rohan anil@_arohan_

Addendum: the bugged run here involved a numerical linear algebra flavor’ed problem without which results are still poor. not Keller’s implementation which is wrapping dist shampoo with some choice of hyper parameters.

The rest of them are hyper parameter choices, grafting method, and passing in nesterov momentum.

rohan anil@_arohan_

Finally, pretty excited to produce some kick ass kernels for these in the future, so we don’t need to be burning gpus doing bad linear algebra operations.

1dViews 3.8KLikes 20Bookmarks 1