The blockwise sparse attention itself is not particularly novel (they rather take pride in it being a straightforward sparsification of GQA), but what is cool is that they show comparable results for pretraining from scratch and late adaptation to sparsity.
Hey everyone — our high-performance MSA kernel library is now open-source. The M3 weights are expected to drop this Friday. Thanks for waiting! Github: https://github.com/MiniMax-AI/MSA Paper:https://github.com/MiniMax-AI/MSA/blob/main/docs/MiniMaxSparseAttention.pdf
