/AI8h ago

Zero Init On Geglu Out Weights Speeds Transformer Training By 10%

--0--
Original posts
Comments
Original post
Ethan@torchcompiled#1884inAI

Ran a grid search on zero init weights recently.

Compared to base - zero o_proj only: slightly worse - zero o_proj + geglu_out: slightly better - zero geglu_out only: sizably better ~10% faster to reach same loss

Vanilla transformer with QKNorm+Geglu+RMSNorm. YMMV

2:17 AM · Jun 3, 2026 · 1.3K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS424
Ethan@torchcompiled

fixed the run names so its clearer

Ethan@torchcompiled

Ran a grid search on zero init weights recently.

Compared to base - zero o_proj only: slightly worse - zero o_proj + geglu_out: slightly better - zero geglu_out only: sizably better ~10% faster to reach same loss

Vanilla transformer with QKNorm+Geglu+RMSNorm. YMMV

7hViews 424Likes 0Bookmarks 0