Sakana AI and NVIDIA release TwELL sparse format for ICML 2026

VIEWS397.5KBOOKMARKS2.5KLIKES3.4KRETWEETS490REPLIES48

The human brain🧠 is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLMs naturally try to do this too (> 95% of neurons in feedforward layers stay silent for any given word), but our hardware punishes them for it.

One of the most frustrating paradoxes in deep learning: making a model do less math often makes it run slower. Why? Because unstructured sparsity introduces irregular memory access, and GPUs are built for predictable, dense blocks of math.

We teamed up with @NVIDIA to try to fix this hardware mismatch. Instead of forcing the GPU to adapt to the sparsity, we built a "Hybrid" format that reshapes the sparsity to fit the GPU. Our sparsity format (TwELL) dynamically routes the 99% of highly sparse tokens through a fast path, and uses a dense backup matrix as a safety valve for the rare, heavy tokens.

Through TwELL and a new set of custom CUDA kernels for both LLM inference and training, we translated theoretical sparsity into actual wall-clock speedups: >20% faster training and inference on H100 GPUs, while also cutting energy consumption and memory requirements.

Paper: https://arxiv.org/abs/2603.23198 Blog: https://pub.sakana.ai/sparser-faster-llms/ Code: https://github.com/SakanaAI/sparser-faster-llms ⚡️

Sakana AI@SakanaAILabs

How do we make LLMs faster and lighter? Don’t force the GPU to adapt to sparsity. Reshape the sparsity to fit the GPU! ⚡️

Excited to share our new #ICML2026 paper in collaboration with @NVIDIA: "Sparser, Faster, Lighter Transformer Language Models". This work introduces new open-source GPU kernels and data formats for faster inference and training of sparse transformer language models:

Paper: https://arxiv.org/abs/2603.23198 Blog: https://pub.sakana.ai/sparser-faster-llms/ Code: https://github.com/SakanaAI/sparser-faster-llms

While LLMs are undoubtedly powerful, they are increasingly expensive to train and deploy, with a large part of this cost coming from their feedforward layers. Yet, an interesting phenomenon occurs inside these layers: For any given token, only a small fraction of the hidden activations actually matter. The rest approximate zero, wasting computation. With ReLU and very mild L1 regularization, this sparsity can exceed 95% with little to no impact on downstream performance.

So, can we leverage this sparsity to make LLMs faster? The challenge is hardware. Modern GPUs are optimized for dense matrix multiplications. Traditional sparse formats introduce irregular memory access and overheads that cancel out their theoretical savings for GEMM operations.

Our contribution is twofold: 1/ We introduce TwELL (Tile-wise ELLPACK), a new sparse packing format designed to integrate directly in the same optimized tiled matmul kernels without disrupting execution. 2/ We develop custom CUDA kernels that fuse multiple sparse matmuls to maximize throughput and compress TwELL to a hybrid representation that minimizes activation sizes.

We used our kernels to train and benchmark sparse LLMs at billion-parameter scales, demonstrating >20% speedups and even higher savings in peak memory and energy.

This work will be presented at #ICML2026. Please check out our blog and technical paper for a deep dive!

29d397.5K3.4K2.5K

Sakana AI@SakanaAILabs

Sakana AIは、@NVIDIAとの共同研究で、スパースなTransformer言語モデルの推論・学習を高速化する新しいGPUカーネルとデータ形式を開発しました。

ブログ：https://pub.sakana.ai/sparser-faster-llms/

LLMのコストの大部分を占めるフィードフォワード層では、実は各トークンに対して大半の活性がほぼゼロで無駄な計算になっています。ReLUと軽いL1正則化を組み合わせれば、性能をほとんど落とさずにスパース率を95%以上まで引き上げられます。ところが現代のGPUは密な行列積に最適化されており、従来のスパース形式は不規則なメモリアクセスのせいで理論上の高速化が相殺されてしまいます。

そこで私たちは、 ① 最適化されたタイル型matmulカーネルにそのまま組み込める新しいスパース格納形式 TwELL (Tile-wise ELLPACK) と、 ② 複数のスパースmatmulを融合してスループットを最大化するカスタムCUDAカーネルを考案しました。

数十億パラメータ規模のスパースLLMを実際に学習・評価したところ、20%以上の高速化と、ピークメモリ・消費電力の大幅な削減を達成しました。

本研究は #ICML2026 にて発表されます。ぜひブログと論文をご覧ください。

論文：https://arxiv.org/abs/2603.23198 GitHub：https://github.com/SakanaAI/sparser-faster-llms