7h ago

Yifei Zuo releases Parallax, an attention mechanism that outperforms standard softmax at up to 1.7B parameters using Muon

Its custom decode kernel matches or exceeds FlashAttention performance.

1153187912.9K

——0——

Original post

#865@ZHAORAN_WANGOP

Tilde@TILDERESEARCH

~7/7~ Paper: https://arxiv.org/abs/2605.29157 Code: http://github.com/yifei-zuo/Parallax Authors: @YifeiZuoX (Northwestern), @dhruv31415, @AlecDewulf, @ShumingHu (Tilde Research), @zz30gs (UW), @zhaoran_wang (Northwestern) Work done as part of the Tilde Fellowship. Stay tuned for the full blog post/article next week ⚡

3:08 PM · May 29, 2026

Reposted by

#238@SONGLINYANG4

QUOTE POST

#865Zhaoran Wang@ZHAORAN_WANG

Congratulations to @YifeiZuoX!!!

10:20 PM · May 29, 2026 · 1.5K Views

QUOTE POST

#865Zhaoran Wang@ZHAORAN_WANG

for me, the coolest finding is that you can connect/interpolate all softmax/linear variants and give a promising direction - affine-linear : )

Yifei Zuo@YifeiZuoX

For me, the coolest finding is that Muon optimizer is crucial for Parallax to move beyond Softmax Attention. Lesson — don't evaluate new architectures solely under AdamW, you'll miss the good ones. paper: https://arxiv.org/abs/2605.29157 code: https://github.com/Yifei-Zuo/Parallax/ For the origin of Parallax, check out the LLA paper at ICLR 2026: paper: https://arxiv.org/abs/2510.01450 code: https://github.com/Yifei-Zuo/FlashLLA

11:10 PM · May 29, 2026 · 7.6K Views

12:09 AM · May 30, 2026 · 1.6K Views

QUOTE POST

#865Zhaoran Wang@ZHAORAN_WANG

for me, the coolest finding is that you can connect all attention/linear variants and give a promising direction - affine-linear : )

Yifei Zuo@YifeiZuoX

11:10 PM · May 29, 2026 · 7.6K Views

11:17 PM · May 29, 2026 · 437 Views