New NanoGPT Speedrun WR at 81.8 (-2.6s) from @.Lisennlp on Github with MUDD skip connections, an expressive and efficient mechanism for data dependent skips! Instead of a learned scalar or sigmoid(linear) gate, MUDD uses a 64 neuron 'MLP' to generate the coefficients. The key efficiency is in reusing the same input projection for up to 14 coefficients at once. https://github.com/KellerJordan/modded-nanogpt/pull/259