Ran a grid search on zero init weights recently.
Compared to base - zero o_proj only: slightly worse - zero o_proj + geglu_out: slightly better - zero geglu_out only: sizably better ~10% faster to reach same loss
Vanilla transformer with QKNorm+Geglu+RMSNorm. YMMV