Goodbye top-k in hierarchical attention!
We devised DashAttention, which is adaptively sparse (compute is allocated based on the information structure of the query) and end-to-end differentiable.
DashAttention pushes the accuracy–efficieny frontier over NSA and InfLLMv2!
[1/n] Can a model learn *where* and *how much* information it should attend to, and do so efficiently?
We introduce DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention! This pushes the accuracy-efficiency frontier in LLMs.
