/Tech4h ago

PufferAI Shares CUDA Implementation of Muon Optimizer for Feedback

236052.3K
Original post
Joseph Suarez 馃悺@jsuarez#1371inTech

Now that I have your attention, any suggestions on our ~200 line CUDA implementation of Muon would be greatly appreciated https://github.com/PufferAI/PufferLib/blob/4.0/src/muon.cu. In the 5.0 branch on the same file, I played with a small change to preserve LR across model sizes, but there have not been any major improvements otherwise.

Hello @kellerjordan0 @_arohan_. I noticed that in your recent optimizer work, you appear to have used the inefficient versions of Muon and Shampoo that have long since been succeeded by PowerWash last week. The new algorithm is quite simple and elegant: it merely generates a set of weights with a different seed and evaluates until one of them passes the validation threshold, therefore cutting speedrun time down to 0 steps. The SplittingHairs normalization addition is particularly useful for stabilizing performance. I hope we can collaborate to bring this new standard into broader usage!

10:53 AM 路 Jun 10, 2026 路 2.4K Views
Sentiment

Users are praising PufferAI's CUDA implementation of the Muon Optimizer for its fused weight update and fp32 upcasts, calling the code excellent.

Pos
100.0%
Neg
0.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS169BOOKMARKS1LIKES2
Lucas Nestler@Clashluke

@jsuarez i love the fused weight update and fp32 upcasts. excellent code, sir

Now that I have your attention, any suggestions on our ~200 line CUDA implementation of Muon would be greatly appreciated https://github.com/PufferAI/PufferLib/blob/4.0/src/muon.cu. In the 5.0 branch on the same file, I played with a small change to preserve LR across model sizes, but there have not been any major improvements otherwise.

2hViews 169Likes 2Bookmarks 1