No idea if this has anything to do with Mythos (whose secret sauce might instead be about architecture, optimizer, training objective or data), or whether all the labs are already doing something like this. But the paper is interesting and deserves to be better-known!
Did Anthropic get more gains out of model scaling than other labs thought was possible? It reminds me of an interesting recent paper, which showed that deep layers in open LLMs are not doing much, and that this can be fixed by scaling the LayerNorm output. https://arxiv.org/abs/2502.05795
