7h ago

Yifei Zuo releases Parallax, an attention mechanism that outperforms standard softmax at up to 1.7B parameters using Muon

Its custom decode kernel matches or exceeds FlashAttention performance.

0
Original post

~7/7~ Paper: https://arxiv.org/abs/2605.29157 Code: http://github.com/yifei-zuo/Parallax Authors: @YifeiZuoX (Northwestern), @dhruv31415, @AlecDewulf, @ShumingHu (Tilde Research), @zz30gs (UW), @zhaoran_wang (Northwestern) Work done as part of the Tilde Fellowship. Stay tuned for the full blog post/article next week ⚡

3:08 PM · May 29, 2026 View on X
Reposted by

for me, the coolest finding is that you can connect/interpolate all softmax/linear variants and give a promising direction - affine-linear : )

Yifei ZuoYifei Zuo@YifeiZuoX

For me, the coolest finding is that Muon optimizer is crucial for Parallax to move beyond Softmax Attention. Lesson — don't evaluate new architectures solely under AdamW, you'll miss the good ones. paper: https://arxiv.org/abs/2605.29157 code: https://github.com/Yifei-Zuo/Parallax/ For the origin of Parallax, check out the LLA paper at ICLR 2026: paper: https://arxiv.org/abs/2510.01450 code: https://github.com/Yifei-Zuo/FlashLLA

11:10 PM · May 29, 2026 · 7.6K Views
12:09 AM · May 30, 2026 · 1.6K Views

for me, the coolest finding is that you can connect all attention/linear variants and give a promising direction - affine-linear : )

Yifei ZuoYifei Zuo@YifeiZuoX

For me, the coolest finding is that Muon optimizer is crucial for Parallax to move beyond Softmax Attention. Lesson — don't evaluate new architectures solely under AdamW, you'll miss the good ones. paper: https://arxiv.org/abs/2605.29157 code: https://github.com/Yifei-Zuo/Parallax/ For the origin of Parallax, check out the LLA paper at ICLR 2026: paper: https://arxiv.org/abs/2510.01450 code: https://github.com/Yifei-Zuo/FlashLLA

11:10 PM · May 29, 2026 · 7.6K Views
11:17 PM · May 29, 2026 · 437 Views