/AI8h ago

Zero Init On Geglu Out Weights Speeds Transformer Training By 10%

317061.8K

Original posts

#1884

Comments

#1884

Original post

Ethan@torchcompiled#1884inAI

Ran a grid search on zero init weights recently.

Compared to base - zero o_proj only: slightly worse - zero o_proj + geglu_out: slightly better - zero geglu_out only: sizably better ~10% faster to reach same loss

Vanilla transformer with QKNorm+Geglu+RMSNorm. YMMV

2:17 AM · Jun 3, 2026 · 1.3K Views

/AI8h ago

Zero Init On Geglu Out Weights Speeds Transformer Training By 10%

--0--

Original posts

#1884

Comments

#1884

Original post

Ethan@torchcompiled#1884inAI

Ran a grid search on zero init weights recently.

Compared to base - zero o_proj only: slightly worse - zero o_proj + geglu_out: slightly better - zero geglu_out only: sizably better ~10% faster to reach same loss

Vanilla transformer with QKNorm+Geglu+RMSNorm. YMMV

2:17 AM · Jun 3, 2026 · 1.3K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

Ethan@torchcompiled

fixed the run names so its clearer

Ethan@torchcompiled

Ran a grid search on zero init weights recently.

Compared to base - zero o_proj only: slightly worse - zero o_proj + geglu_out: slightly better - zero geglu_out only: sizably better ~10% faster to reach same loss

Vanilla transformer with QKNorm+Geglu+RMSNorm. YMMV

7h42400