/AI3h ago

Min Li and Haoxiang Wang set a new modded-nanogpt benchmark record using the Parallax architecture and SOAP-H optimizer

The run achieved the record with zero hyperparameter tuning.

780125011.8K

Original posts

Quote posts

#83

Reposts

#865

Original post

Zhaoran Wang#865

Yifei Zuo@YifeiZuoX

Very impressive results from Min Li and @Haoxiang__Wang: simply swapping Attention for Parallax reaches 2880 steps with the SOAP-H optimizer, beating the latest SOTA record on modded-nanogpt (@kellerjordan0) with no hyperparameter tuning.

A few observations: - Parallax is uniformly stronger than Softmax Attention across all records. - Optimizers don't transfer to Parallax with the same magnitude, which confirms the optimizer–architecture interaction from the Parallax paper. - The cleanest modifications often transfer best; records built on heavy tuning transfer less reliably.

These are preliminary results, I believe both the Parallax architecture and the optimizer side have room to improve. Code is open-sourced below, give it a try.

Code: https://github.com/Yifei-Zuo/modded-nanogpt-plx/tree/master/parallax Kernel: https://github.com/Yifei-Zuo/Parallax Paper: https://arxiv.org/abs/2605.29157

4:50 PM · Jun 2, 2026 · 9.3K Views

/AI3h ago

Min Li and Haoxiang Wang set a new modded-nanogpt benchmark record using the Parallax architecture and SOAP-H optimizer

The run achieved the record with zero hyperparameter tuning.

--0--

Original posts

Quote posts

#83

Reposts

#865

Original post

Zhaoran Wang#865

Yifei Zuo@YifeiZuoX

These are preliminary results, I believe both the Parallax architecture and the optimizer side have room to improve. Code is open-sourced below, give it a try.

Code: https://github.com/Yifei-Zuo/modded-nanogpt-plx/tree/master/parallax Kernel: https://github.com/Yifei-Zuo/Parallax Paper: https://arxiv.org/abs/2605.29157

4:50 PM · Jun 2, 2026 · 9.3K Views

Sentiment

Users praise the Parallax Architecture for its conservative approach of preserving softmax with an added correction branch instead of replacing attention entirely, which delivers strong results on Modded-NanoGPT.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS3.7KBOOKMARKS7LIKES28RETWEETS1

rohan anil@_arohan_

Shampoo supremacy

Yifei Zuo@YifeiZuoX

These are preliminary results, I believe both the Parallax architecture and the optimizer side have room to improve. Code is open-sourced below, give it a try.

Code: https://github.com/Yifei-Zuo/modded-nanogpt-plx/tree/master/parallax Kernel: https://github.com/Yifei-Zuo/Parallax Paper: https://arxiv.org/abs/2605.29157

2h3.7K287