/AI5h ago

Zhaoran Wang releases Parallax, a new model architecture that requires the Muon optimizer to outperform standard softmax attention

The work advises against evaluating architectures solely with AdamW.

--0--
Quote posts
Comments
Original post
Jiaxin Shi@thjashin#1761inAI

@zhaoran_wang Very cool!

Zhaoran Wang@zhaoran_wang

for me, the coolest finding is that you can connect/interpolate all softmax/linear variants and give a promising direction - affine-linear : )

2:24 PM · May 31, 2026 · 267 Views
Sentiment
Sentiment unavailable for this story.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS4.9KBOOKMARKS9LIKES20REPLIES1
Jiaxin Shi@thjashin

Very interesting work from @zhaoran_wang @YifeiZuoX. Looks like the first working version of higher-order test-time regression extension of softmax attention (cc @heyyalexwang )

Yifei Zuo@YifeiZuoX

For me, the coolest finding is that Muon optimizer is crucial for Parallax to move beyond Softmax Attention.

Lesson — don't evaluate new architectures solely under AdamW, you'll miss the good ones.

paper: https://arxiv.org/abs/2605.29157 code: https://github.com/Yifei-Zuo/Parallax/

For the origin of Parallax, check out the LLA paper at ICLR 2026: paper: https://arxiv.org/abs/2510.01450 code: https://github.com/Yifei-Zuo/FlashLLA

5hViews 4.9KLikes 20Bookmarks 9