MiniMax Releases MSA Paper on Efficient Sparse Attention Mechanism

VIEWS255BOOKMARKS1REPLIES1

@aurko79 Do you think routing transformer can come back we codesign it? Or parts in msa already

6h2551

RETWEETS1

Nice to see MiniMax and others pushing content based sparse attention forward. Back in 2020 the field was exploding with several efficient attention variants:

- Local/Sliding window attention - Fixed sparsity patterns - Strided attention - Recurrence ( T-XL, compressive transformers) - Chunked attention - Linear attention and SSMs

The right prior which has stood the test of time is a combination of: a) local attention for modeling local context and b) a top-k content based sparsity to route queries to the most relevant keys (blocks of keys).

What has changed in the last 6 years is better hardware-algorithm codesign.

RyanLee@RyanLeeMiniMax

Hey everyone — our high-performance MSA kernel library is now open-source. The M3 weights are expected to drop this Friday. Thanks for waiting! Github: https://github.com/MiniMax-AI/MSA Paper：https://github.com/MiniMax-AI/MSA/blob/main/docs/MiniMaxSparseAttention.pdf

6h7.9K6354

Aurko Roy@aurko79

@_arohan_ Yeah IO aware clustering potentially

6h21