/Tech2h ago

MiniMax open-sources a blockwise sparse attention kernel showing training from scratch matches full-attention benchmarks

Story Overview

MiniMax has released its MSA kernel library to let developers experiment with blockwise sparse attention that sparsifies GQA directly. The move pairs a GitHub repo with a fresh paper showing that models trained from scratch under a 3-trillion-token budget reach benchmark parity with full-attention baselines while cutting per-token compute dramatically at long contexts.

218011.7K

#501

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex#501inTech

The blockwise sparse attention itself is not particularly novel (they rather take pride in it being a straightforward sparsification of GQA), but what is cool is that they show comparable results for pretraining from scratch and late adaptation to sparsity.

RyanLee@RyanLeeMiniMax

Hey everyone — our high-performance MSA kernel library is now open-source. The M3 weights are expected to drop this Friday. Thanks for waiting! Github: https://github.com/MiniMax-AI/MSA Paper：https://github.com/MiniMax-AI/MSA/blob/main/docs/MiniMaxSparseAttention.pdf

1:04 AM · Jun 12, 2026 · 1.4K Views

Developer Impact

Kernel now lives on GitHub for anyone to test

The public repo includes GPU kernels, Python bindings, and benchmarks that highlight an Index Branch for top-k block selection followed by exact sparse attention in the Main Branch. Early numbers point to large wall-clock gains on H800 hardware, though broader framework integration remains to be seen.

Open Question

Long-context edge cases still need scrutiny

The paper claims strong average results yet leaves open whether recall holds in every sparse long-context scenario. Independent verification of the full 3T-token table and any rare failure modes will determine how widely teams adopt the approach.

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS172LIKES4REPLIES1

kalomaze@kalomaze

@teortaxesTex "can you do it from scratch" is in fact an important question because the space of bullshit things you can do post-hoc is much wider than the space of bullshit things you can do from the beginning without paying a tax for it

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

2h17240

WuBu ⪋ WaefreBeorn 🇺🇸 👑@waefrebeorn

@kalomaze @teortaxesTex i am doing it from scratch

2h12