Very impressive results from Min Li and @Haoxiang__Wang: simply swapping Attention for Parallax reaches 2880 steps with the SOAP-H optimizer, beating the latest SOTA record on modded-nanogpt (@kellerjordan0) with no hyperparameter tuning.
A few observations: - Parallax is uniformly stronger than Softmax Attention across all records. - Optimizers don't transfer to Parallax with the same magnitude, which confirms the optimizer–architecture interaction from the Parallax paper. - The cleanest modifications often transfer best; records built on heavy tuning transfer less reliably.
These are preliminary results, I believe both the Parallax architecture and the optimizer side have room to improve. Code is open-sourced below, give it a try.
Code: https://github.com/Yifei-Zuo/modded-nanogpt-plx/tree/master/parallax Kernel: https://github.com/Yifei-Zuo/Parallax Paper: https://arxiv.org/abs/2605.29157