Building momentum at Marin! Upgrading from Dense -> 129B parameter MoEs -> architecture improvements -> optimizer improvements gives our pretraining recipe an estimated 6x cumulative learning speedup, accounting for MFU. Includes community contributions. https://openathena.ai/blog/pretraining-speedup/
Larry Dial of Open Athena releases a Marin pretraining recipe delivering a 6x cumulative learning speedup using 129B MoEs
Training over 1.0 trillion tokens exhibited frequent loss spikes.
Most Activity
Quoting @dlwh : we are at risk of losing the reputation of spiky loss runs!
This run incorporates some stability techniques from my past projects: Hyperball, Gated Norm, and Gated Attention. Excited to see the next run from Marin!
Building momentum at Marin! Upgrading from Dense -> 129B parameter MoEs -> architecture improvements -> optimizer improvements gives our pretraining recipe an estimated 6x cumulative learning speedup, accounting for MFU. Includes community contributions. https://openathena.ai/blog/pretraining-speedup/