@zhaoran_wang Very cool!
for me, the coolest finding is that you can connect/interpolate all softmax/linear variants and give a promising direction - affine-linear : )
The work advises against evaluating architectures solely with AdamW.
@zhaoran_wang Very cool!
for me, the coolest finding is that you can connect/interpolate all softmax/linear variants and give a promising direction - affine-linear : )
Very interesting work from @zhaoran_wang @YifeiZuoX. Looks like the first working version of higher-order test-time regression extension of softmax attention (cc @heyyalexwang )
For me, the coolest finding is that Muon optimizer is crucial for Parallax to move beyond Softmax Attention.
Lesson — don't evaluate new architectures solely under AdamW, you'll miss the good ones.
paper: https://arxiv.org/abs/2605.29157 code: https://github.com/Yifei-Zuo/Parallax/
For the origin of Parallax, check out the LLA paper at ICLR 2026: paper: https://arxiv.org/abs/2510.01450 code: https://github.com/Yifei-Zuo/FlashLLA
The work advises against evaluating architectures solely with AdamW.
@zhaoran_wang Very cool!
for me, the coolest finding is that you can connect/interpolate all softmax/linear variants and give a promising direction - affine-linear : )
Users thanked the researchers for their Parallax model extending Softmax Attention via higher-order test-time regression.