/AI4h ago

Researcher Eyes Custom GPU Kernels to Speed Up AI Optimizers

127113.1K
Original post
rohan anil@_arohan_#79inAI

Finally, pretty excited to produce some kick ass kernels for these in the future, so we don’t need to be burning gpus doing bad linear algebra operations.

rohan anil@_arohan_

Last one is that with Adam grafting from Meta’s impl, means the size of update is O(sqrt(size)) - which you have to set different lr and weigth decay. The Muon implementation uses different lr / wd for various layers. I just used it, and rescaled it as appropriate.

10:28 PM · Jun 8, 2026 · 2.3K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS817LIKES3
rohan anil@_arohan_

Addendum: the bugged run here involved a numerical linear algebra flavor’ed problem without which results are still poor. not Keller’s implementation which is wrapping dist shampoo with some choice of hyper parameters.

The rest of them are hyper parameter choices, grafting method, and passing in nesterov momentum.

rohan anil@_arohan_

Finally, pretty excited to produce some kick ass kernels for these in the future, so we don’t need to be burning gpus doing bad linear algebra operations.

2hViews 817Likes 3Bookmarks 0