Yifei Zuo releases Parallax, an attention mechanism that outperforms standard softmax at up to 1.7B parameters using Muon
Its custom decode kernel matches or exceeds FlashAttention performance.
Congratulations to @YifeiZuoX!!!
for me, the coolest finding is that you can connect/interpolate all softmax/linear variants and give a promising direction - affine-linear : )
For me, the coolest finding is that Muon optimizer is crucial for Parallax to move beyond Softmax Attention. Lesson — don't evaluate new architectures solely under AdamW, you'll miss the good ones. paper: https://arxiv.org/abs/2605.29157 code: https://github.com/Yifei-Zuo/Parallax/ For the origin of Parallax, check out the LLA paper at ICLR 2026: paper: https://arxiv.org/abs/2510.01450 code: https://github.com/Yifei-Zuo/FlashLLA
for me, the coolest finding is that you can connect all attention/linear variants and give a promising direction - affine-linear : )
For me, the coolest finding is that Muon optimizer is crucial for Parallax to move beyond Softmax Attention. Lesson — don't evaluate new architectures solely under AdamW, you'll miss the good ones. paper: https://arxiv.org/abs/2605.29157 code: https://github.com/Yifei-Zuo/Parallax/ For the origin of Parallax, check out the LLA paper at ICLR 2026: paper: https://arxiv.org/abs/2510.01450 code: https://github.com/Yifei-Zuo/FlashLLA