/AI3h ago

Min Li and Haoxiang Wang set a new modded-nanogpt benchmark record using the Parallax architecture and SOAP-H optimizer

The run achieved the record with zero hyperparameter tuning.

--0--
Original posts
Quote posts
Reposts
Original postZhaoran Wang#865
Yifei Zuo@YifeiZuoX

Very impressive results from Min Li and @Haoxiang__Wang: simply swapping Attention for Parallax reaches 2880 steps with the SOAP-H optimizer, beating the latest SOTA record on modded-nanogpt (@kellerjordan0) with no hyperparameter tuning.

A few observations: - Parallax is uniformly stronger than Softmax Attention across all records. - Optimizers don't transfer to Parallax with the same magnitude, which confirms the optimizer–architecture interaction from the Parallax paper. - The cleanest modifications often transfer best; records built on heavy tuning transfer less reliably.

These are preliminary results, I believe both the Parallax architecture and the optimizer side have room to improve. Code is open-sourced below, give it a try.

Code: https://github.com/Yifei-Zuo/modded-nanogpt-plx/tree/master/parallax Kernel: https://github.com/Yifei-Zuo/Parallax Paper: https://arxiv.org/abs/2605.29157

4:50 PM · Jun 2, 2026 · 9.3K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS3.7KBOOKMARKS7LIKES28RETWEETS1
rohan anil@_arohan_

Shampoo supremacy

Yifei Zuo@YifeiZuoX

Very impressive results from Min Li and @Haoxiang__Wang: simply swapping Attention for Parallax reaches 2880 steps with the SOAP-H optimizer, beating the latest SOTA record on modded-nanogpt (@kellerjordan0) with no hyperparameter tuning.

A few observations: - Parallax is uniformly stronger than Softmax Attention across all records. - Optimizers don't transfer to Parallax with the same magnitude, which confirms the optimizer–architecture interaction from the Parallax paper. - The cleanest modifications often transfer best; records built on heavy tuning transfer less reliably.

These are preliminary results, I believe both the Parallax architecture and the optimizer side have room to improve. Code is open-sourced below, give it a try.

Code: https://github.com/Yifei-Zuo/modded-nanogpt-plx/tree/master/parallax Kernel: https://github.com/Yifei-Zuo/Parallax Paper: https://arxiv.org/abs/2605.29157

2hViews 3.7KLikes 28Bookmarks 7