/Tech26d ago

BLASST wins Best Paper at MLSys 2026 for a drop-in training-free dynamic sparse attention mechanism that thresholds online softmax statistics to skip negligible blocks in long-context LLM inference

AI Judge changed title after evaluation, original title: "BLASST wins Best Paper at MLSys 2026 for a training-free dynamic sparse attention mechanism that applies a single scalar threshold to online softmax statistics to skip negligible blocks"

It targets self-attention compute and memory bottlenecks during inference.

224385929857.4K

#663

Original post

Song Han#1559

Huizi Mao@huizi_mao

Glad to be featured by SemiAnalysis. Our work BLASST was also selected as MLSys 2026 Best Paper: https://mlsys.org/virtual/2026/poster/3631

SemiAnalysis@SemiAnalysis_

Sparse attention mechanisms are finally moving beyond academic benchmarks into production systems, including DeepSeek Sparse Attention, and recently @NousResearch 's Lighthouse Attention. BLASST by NVIDIA, from paper Dynamic Blocked Attention Sparsity via Softmax Thresholding, attempts to sparsify attention in a different way, leveraging a similar rescale factor threshold idea from Flash Attention 4. We expect to see more interesting sparse attention techniques in the future. https://arxiv.org/abs/2512.12087 (2/4)

3:35 PM · May 17, 2026 · 5K Views

Sentiment

Many users congratulated NVIDIA's BLASST team on its MLSys 2026 Best Paper win for dynamic sparse attention via softmax thresholding, praising the training-free method and expressing interest in connecting with the authors.

Pos

100.0%

Neg

0.0%

10 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS38.7KBOOKMARKS226LIKES356RETWEETS52REPLIES20

Jiayi Yuan@jiayiy

🚀 BLASST just won Best Paper at #MLSys26! In this paper, we introduce a simple, training-free dynamic sparse attention mechanism that uses a single scalar threshold on online softmax statistics to skip negligible attention blocks. Unfortunately I won’t be there in person, but please say hi to my awesome coauthors! 🙌 Paper: https://arxiv.org/abs/2512.12087

SemiAnalysis@SemiAnalysis_

25d38.7K356226

finbarr@finbarrtimbers

This is an elegant paper; hope to try it out soon.

SemiAnalysis@SemiAnalysis_

25d15.8K6455

Sakura Yuki@sakurayukiai

@jiayiy My favorite genre of paper is 'we stopped doing half the math during inference and the model didn't notice'.

Does dynamic block skipping cause warp divergence on the GPU, or does the kernel mask it somehow?

25d871

EDITH@Infopulsed

@jiayiy Hey i am not kidding, i have been working on this since 2024, i had this idea actually about online softmax.. here's the repo https://github.com/MagellaX/StreamAttn

25d841

Jiayi Yuan@jiayiy

@Infopulsed Hey, thanks for sharing! StreamAttn is quite cool, a clean online softmax streaming kernel in Triton. BLASST reuses the running max stats and mainly focusing on sparse attention, we also provided easy-to-use kernels https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog16_Accelerating_Long_Context_Inference_with_Skip_Softmax_Attention.md. Happy to connect

25d451

Lorenzo Garcia@_fmla_

@jiayiy how is this different from skip softmax from nvidia?

25d48

Xiuyu Li@xiuyu_l

@jiayiy 🐐

25d1211

Kimbo@kimbochen

@jiayiy Hi Jiayi! Which authors will be there? I’d love to connect with them!

25d1051

Delta Institute @ MLSys@DeltaInstitutes

@jiayiy Congrats, Jiayi!!

25d222

higher@mall_hoki

@jiayiy simple and training-free is doing a lot of work in that sentence. dynamic sparse attention usually pays for it at long-context recall

25d129

Cliff Lattner@CliffLattner

@jiayiy @xiuyu_l I think recompute QK and do the first pass for max finding in 4-bit (no need for exp so 4-bit actually helps). Then once the sparsity map is determined, gather blocks and compute

25d106

Frosty40@FrostForger

@jiayiy we sparse these days mmk. keepin it light. just mad ea sparse kernel =p dig it

25d66

Byron Hsu@hsu_byron

@jiayiy 🐮

25d57

Thomas Tao@Thomas_Tao_1

@jiayiy Congrats. Training-free is the part that grabs me. Way easier to try.

25d47

OpenLedger Intern@InternOcto

@jiayiy AI researchers really won best paper by teaching the model how to ignore unimportant stuff more efficiently 😭

25d43

Jiayi Yuan@jiayiy

@_fmla_ this is the tech report (paper) for skip softmax :)

25d35

EDITH@Infopulsed

@jiayiy This paper is incredible

25d34

Sakura Yuki@sakurayukiai

@jiayiy Dynamic sparsity is notoriously annoying to implement efficiently. I wrote up some notes a while back on memory-efficient attention variants and why the hardware makes this so hard: https://leetllm.com/learn/flashattention-memory-efficient-attention

25d28

Nick Landolfi@nclando

@jiayiy congratulations!! :)

25d15

EDITH@Infopulsed

@jiayiy Yeah sure, we can talk man! Would love to connect

25d11