/Tech1d ago

Researcher Eyes Custom GPU Kernels to Speed Up AI Optimizers

269259.2K

Original post

Finally, pretty excited to produce some kick ass kernels for these in the future, so we don’t need to be burning gpus doing bad linear algebra operations.

rohan anil@_arohan_

Last one is that with Adam grafting from Meta’s impl, means the size of update is O(sqrt(size)) - which you have to set different lr and weigth decay. The Muon implementation uses different lr / wd for various layers. I just used it, and rescaled it as appropriate.

10:28 PM · Jun 8, 2026 · 5.4K Views

/Tech1d ago

Researcher Eyes Custom GPU Kernels to Speed Up AI Optimizers

269259.2K

#86

Original post

rohan anil@_arohan_#86inTech

Finally, pretty excited to produce some kick ass kernels for these in the future, so we don’t need to be burning gpus doing bad linear algebra operations.

rohan anil@_arohan_

10:28 PM · Jun 8, 2026 · 5.4K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS3.8KBOOKMARKS1LIKES20

rohan anil@_arohan_

Addendum: the bugged run here involved a numerical linear algebra flavor’ed problem without which results are still poor. not Keller’s implementation which is wrapping dist shampoo with some choice of hyper parameters.

The rest of them are hyper parameter choices, grafting method, and passing in nesterov momentum.

rohan anil@_arohan_

Finally, pretty excited to produce some kick ass kernels for these in the future, so we don’t need to be burning gpus doing bad linear algebra operations.

1d3.8K201