/AI29d ago

Sakana AI and NVIDIA release TwELL sparse format for ICML 2026

Sakana AI and NVIDIA released an ICML 2026 paper introducing TwELL, a tile-wise ELLPACK sparse format with fused CUDA kernels optimized for NVIDIA GPUs. TwELL targets natural sparsity in transformer feedforward layers, routing sparse tokens through a fast path while keeping dense computation. The approach yields 20% faster LLM training on NVIDIA GPUs and improves inference speed without changing model architecture.

--0--
Original posthardmaru#18
Sakana AI@SakanaAILabs

How do we make LLMs faster and lighter? Don’t force the GPU to adapt to sparsity. Reshape the sparsity to fit the GPU! ⚡️

Excited to share our new #ICML2026 paper in collaboration with @NVIDIA: "Sparser, Faster, Lighter Transformer Language Models". This work introduces new open-source GPU kernels and data formats for faster inference and training of sparse transformer language models:

Paper: https://arxiv.org/abs/2603.23198 Blog: https://pub.sakana.ai/sparser-faster-llms/ Code: https://github.com/SakanaAI/sparser-faster-llms

While LLMs are undoubtedly powerful, they are increasingly expensive to train and deploy, with a large part of this cost coming from their feedforward layers. Yet, an interesting phenomenon occurs inside these layers: For any given token, only a small fraction of the hidden activations actually matter. The rest approximate zero, wasting computation. With ReLU and very mild L1 regularization, this sparsity can exceed 95% with little to no impact on downstream performance.

So, can we leverage this sparsity to make LLMs faster? The challenge is hardware. Modern GPUs are optimized for dense matrix multiplications. Traditional sparse formats introduce irregular memory access and overheads that cancel out their theoretical savings for GEMM operations.

Our contribution is twofold: 1/ We introduce TwELL (Tile-wise ELLPACK), a new sparse packing format designed to integrate directly in the same optimized tiled matmul kernels without disrupting execution. 2/ We develop custom CUDA kernels that fuse multiple sparse matmuls to maximize throughput and compress TwELL to a hybrid representation that minimizes activation sizes.

We used our kernels to train and benchmark sparse LLMs at billion-parameter scales, demonstrating >20% speedups and even higher savings in peak memory and energy.

This work will be presented at #ICML2026. Please check out our blog and technical paper for a deep dive!

9:26 AM · May 8, 2026 · 256.5K Views
Sentiment

Many users praise Sakana AI and NVIDIA's TwELL sparsity format for making sparse LLMs faster and less wasteful on GPUs through better engineering, while a few dismiss the work with sarcasm about its value or collaborators.

Pos
85.0%
Neg
15.0%
18 comments with sentiment.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most Activity
VIEWS397.5KBOOKMARKS2.5KLIKES3.4KRETWEETS490REPLIES48
hardmaru@hardmaru

The human brain🧠 is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLMs naturally try to do this too (> 95% of neurons in feedforward layers stay silent for any given word), but our hardware punishes them for it.

One of the most frustrating paradoxes in deep learning: making a model do less math often makes it run slower. Why? Because unstructured sparsity introduces irregular memory access, and GPUs are built for predictable, dense blocks of math.

We teamed up with @NVIDIA to try to fix this hardware mismatch. Instead of forcing the GPU to adapt to the sparsity, we built a "Hybrid" format that reshapes the sparsity to fit the GPU. Our sparsity format (TwELL) dynamically routes the 99% of highly sparse tokens through a fast path, and uses a dense backup matrix as a safety valve for the rare, heavy tokens.

Through TwELL and a new set of custom CUDA kernels for both LLM inference and training, we translated theoretical sparsity into actual wall-clock speedups: >20% faster training and inference on H100 GPUs, while also cutting energy consumption and memory requirements.

Paper: https://arxiv.org/abs/2603.23198 Blog: https://pub.sakana.ai/sparser-faster-llms/ Code: https://github.com/SakanaAI/sparser-faster-llms ⚡️

Sakana AI@SakanaAILabs

How do we make LLMs faster and lighter? Don’t force the GPU to adapt to sparsity. Reshape the sparsity to fit the GPU! ⚡️

Excited to share our new #ICML2026 paper in collaboration with @NVIDIA: "Sparser, Faster, Lighter Transformer Language Models". This work introduces new open-source GPU kernels and data formats for faster inference and training of sparse transformer language models:

Paper: https://arxiv.org/abs/2603.23198 Blog: https://pub.sakana.ai/sparser-faster-llms/ Code: https://github.com/SakanaAI/sparser-faster-llms

While LLMs are undoubtedly powerful, they are increasingly expensive to train and deploy, with a large part of this cost coming from their feedforward layers. Yet, an interesting phenomenon occurs inside these layers: For any given token, only a small fraction of the hidden activations actually matter. The rest approximate zero, wasting computation. With ReLU and very mild L1 regularization, this sparsity can exceed 95% with little to no impact on downstream performance.

So, can we leverage this sparsity to make LLMs faster? The challenge is hardware. Modern GPUs are optimized for dense matrix multiplications. Traditional sparse formats introduce irregular memory access and overheads that cancel out their theoretical savings for GEMM operations.

Our contribution is twofold: 1/ We introduce TwELL (Tile-wise ELLPACK), a new sparse packing format designed to integrate directly in the same optimized tiled matmul kernels without disrupting execution. 2/ We develop custom CUDA kernels that fuse multiple sparse matmuls to maximize throughput and compress TwELL to a hybrid representation that minimizes activation sizes.

We used our kernels to train and benchmark sparse LLMs at billion-parameter scales, demonstrating >20% speedups and even higher savings in peak memory and energy.

This work will be presented at #ICML2026. Please check out our blog and technical paper for a deep dive!

29dViews 397.5KLikes 3.4KBookmarks 2.5K
Sakana AI@SakanaAILabs

Sakana AIは、@NVIDIAとの共同研究で、スパースなTransformer言語モデルの推論・学習を高速化する新しいGPUカーネルとデータ形式を開発しました。

ブログ:https://pub.sakana.ai/sparser-faster-llms/

LLMのコストの大部分を占めるフィードフォワード層では、実は各トークンに対して大半の活性がほぼゼロで無駄な計算に なっています。ReLUと軽いL1正則化を組み合わせれば、性能をほとんど落とさずにスパース率を95%以上まで引き上げられます。ところが現代のGPUは密な行列積に最適化されており、従来のスパース形式は不規則なメモリアクセスのせいで理論上の高速化が相殺されてしまいます。

そこで私たちは、 ① 最適化されたタイル型matmulカーネルにそのまま組み込める新しいスパース格納形式 TwELL (Tile-wise ELLPACK) と、 ② 複数のスパースmatmulを融合してスループットを最大化するカスタムCUDAカーネル を考案しました。

数十億パラメータ規模のスパースLLMを実際に学習・評価したところ、20%以上の高速化と、ピークメモリ・消費電力の大幅な削減を達成しました。

本研究は #ICML2026 にて発表されます。 ぜひブログと論文をご覧ください。

論文:https://arxiv.org/abs/2603.23198 GitHub:https://github.com/SakanaAI/sparser-faster-llms

Sakana AI@SakanaAILabs

How do we make LLMs faster and lighter? Don’t force the GPU to adapt to sparsity. Reshape the sparsity to fit the GPU! ⚡️

Excited to share our new #ICML2026 paper in collaboration with @NVIDIA: "Sparser, Faster, Lighter Transformer Language Models". This work introduces new open-source GPU kernels and data formats for faster inference and training of sparse transformer language models:

Paper: https://arxiv.org/abs/2603.23198 Blog: https://pub.sakana.ai/sparser-faster-llms/ Code: https://github.com/SakanaAI/sparser-faster-llms

While LLMs are undoubtedly powerful, they are increasingly expensive to train and deploy, with a large part of this cost coming from their feedforward layers. Yet, an interesting phenomenon occurs inside these layers: For any given token, only a small fraction of the hidden activations actually matter. The rest approximate zero, wasting computation. With ReLU and very mild L1 regularization, this sparsity can exceed 95% with little to no impact on downstream performance.

So, can we leverage this sparsity to make LLMs faster? The challenge is hardware. Modern GPUs are optimized for dense matrix multiplications. Traditional sparse formats introduce irregular memory access and overheads that cancel out their theoretical savings for GEMM operations.

Our contribution is twofold: 1/ We introduce TwELL (Tile-wise ELLPACK), a new sparse packing format designed to integrate directly in the same optimized tiled matmul kernels without disrupting execution. 2/ We develop custom CUDA kernels that fuse multiple sparse matmuls to maximize throughput and compress TwELL to a hybrid representation that minimizes activation sizes.

We used our kernels to train and benchmark sparse LLMs at billion-parameter scales, demonstrating >20% speedups and even higher savings in peak memory and energy.

This work will be presented at #ICML2026. Please check out our blog and technical paper for a deep dive!

29dViews 63.3KLikes 470Bookmarks 176
NVIDIA AI@NVIDIAAI

Great collab with @SakanaAILabs on an #ICML26 paper about sparse transformer kernels + formats optimized for modern NVIDIA GPU execution.

• TwELL sparse packing • Fused CUDA kernels • 20%+ inference/training speedups at scale

Paper + code below 👇

hardmaru@hardmaru

The human brain🧠 is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLMs naturally try to do this too (> 95% of neurons in feedforward layers stay silent for any given word), but our hardware punishes them for it.

One of the most frustrating paradoxes in deep learning: making a model do less math often makes it run slower. Why? Because unstructured sparsity introduces irregular memory access, and GPUs are built for predictable, dense blocks of math.

We teamed up with @NVIDIA to try to fix this hardware mismatch. Instead of forcing the GPU to adapt to the sparsity, we built a "Hybrid" format that reshapes the sparsity to fit the GPU. Our sparsity format (TwELL) dynamically routes the 99% of highly sparse tokens through a fast path, and uses a dense backup matrix as a safety valve for the rare, heavy tokens.

Through TwELL and a new set of custom CUDA kernels for both LLM inference and training, we translated theoretical sparsity into actual wall-clock speedups: >20% faster training and inference on H100 GPUs, while also cutting energy consumption and memory requirements.

Paper: https://arxiv.org/abs/2603.23198 Blog: https://pub.sakana.ai/sparser-faster-llms/ Code: https://github.com/SakanaAI/sparser-faster-llms ⚡️

29dViews 47.8KLikes 428Bookmarks 175
hardmaru@hardmaru

If you want to look under the hood at the actual custom CUDA kernels and see exactly how we implemented the TwELL format for H100 GPUs, we’ve released the reference code.

GitHub: https://github.com/SakanaAI/sparser-faster-llms Blog: https://pub.sakana.ai/sparser-faster-llms/ 🐟

hardmaru@hardmaru

The human brain🧠 is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLMs naturally try to do this too (> 95% of neurons in feedforward layers stay silent for any given word), but our hardware punishes them for it.

One of the most frustrating paradoxes in deep learning: making a model do less math often makes it run slower. Why? Because unstructured sparsity introduces irregular memory access, and GPUs are built for predictable, dense blocks of math.

We teamed up with @NVIDIA to try to fix this hardware mismatch. Instead of forcing the GPU to adapt to the sparsity, we built a "Hybrid" format that reshapes the sparsity to fit the GPU. Our sparsity format (TwELL) dynamically routes the 99% of highly sparse tokens through a fast path, and uses a dense backup matrix as a safety valve for the rare, heavy tokens.

Through TwELL and a new set of custom CUDA kernels for both LLM inference and training, we translated theoretical sparsity into actual wall-clock speedups: >20% faster training and inference on H100 GPUs, while also cutting energy consumption and memory requirements.

Paper: https://arxiv.org/abs/2603.23198 Blog: https://pub.sakana.ai/sparser-faster-llms/ Code: https://github.com/SakanaAI/sparser-faster-llms ⚡️

29dViews 6.2KLikes 51Bookmarks 26

Maybe at last we could be free from MoEs but I think modular models have their own future beyond "sparse cheap".

hardmaru@hardmaru

The human brain🧠 is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLMs naturally try to do this too (> 95% of neurons in feedforward layers stay silent for any given word), but our hardware punishes them for it.

One of the most frustrating paradoxes in deep learning: making a model do less math often makes it run slower. Why? Because unstructured sparsity introduces irregular memory access, and GPUs are built for predictable, dense blocks of math.

We teamed up with @NVIDIA to try to fix this hardware mismatch. Instead of forcing the GPU to adapt to the sparsity, we built a "Hybrid" format that reshapes the sparsity to fit the GPU. Our sparsity format (TwELL) dynamically routes the 99% of highly sparse tokens through a fast path, and uses a dense backup matrix as a safety valve for the rare, heavy tokens.

Through TwELL and a new set of custom CUDA kernels for both LLM inference and training, we translated theoretical sparsity into actual wall-clock speedups: >20% faster training and inference on H100 GPUs, while also cutting energy consumption and memory requirements.

Paper: https://arxiv.org/abs/2603.23198 Blog: https://pub.sakana.ai/sparser-faster-llms/ Code: https://github.com/SakanaAI/sparser-faster-llms ⚡️

27dViews 3.1KLikes 24Bookmarks 11
Sakana AI@SakanaAILabs

For those interested in the implementation details, we’ve open-sourced the reference code for this paper.

The repository includes our sparse training code and the custom CUDA kernels designed for H100 GPUs leveraging the TwELL packing format.

GitHub: https://github.com/SakanaAI/sparser-faster-llms

28dViews 3.8KLikes 22Bookmarks 11
hardmaru@hardmaru

@SakanaAILabs @NVIDIAAI Sparser, Faster, Lighter Transformer Language Models https://arxiv.org/abs/2603.23198

hardmaru@hardmaru

If you want to look under the hood at the actual custom CUDA kernels and see exactly how we implemented the TwELL format for H100 GPUs, we’ve released the reference code.

GitHub: https://github.com/SakanaAI/sparser-faster-llms Blog: https://pub.sakana.ai/sparser-faster-llms/ 🐟

24dViews 2.4KLikes 6Bookmarks 4
Sakana AI@SakanaAILabs

@nvidia Sparser, Faster, Lighter Transformer Language Models

論文:https://arxiv.org/abs/2603.23198

28dViews 1.8KLikes 13Bookmarks 3
Maxim Orlovsky@dr_orlovsky

Human brain is efficient than modern-day ANNs because it does not use gradient descent back-propagation; it does not differentiate training and inference, and it is capable of learning in an unsupervised model. There is actually more differences between neural tissue and artificial neural networks than similarities :)

27dViews 568Likes 5
Rob Coli@rcolidba

@hardmaru @NVIDIAAI @nvidia “The human brain🧠 is incredibly efficient because it only activates the specific neurons needed for a thought.”

While the technology described sounds novel and useful, this is not how biological brains work.

One of many citations: https://pubmed.ncbi.nlm.nih.gov/25731172/

28dViews 124Likes 1Bookmarks 1
Marcus@MarcusSpillane

@hardmaru @NVIDIAAI @nvidia So the brain runs 95% of its neurons as freeloaders and still outperforms trillion parameter models. Evolution really said "we are not paying for all that compute" and shipped the most efficient architecture in history.

27dViews 222Likes 5
A Rain Beau@do_re_me_bo

@hardmaru @NVIDIAAI @nvidia Perhaps a naive question, but why not just condense the subsets of active units into dense groups that often work together?

28dViews 14Bookmarks 1
Thiago Salvador@bettercallsalva

@hardmaru @NVIDIAAI @nvidia hardware is the bottleneck. dense matmul on h100 beats sparse activation even at 95% silent neurons, because gather/scatter on irregular sparsity blows memory bandwidth. moe routing only ducks this because experts are dense within each block

28dViews 58Likes 1
Æ@AtomMccree

I'm developing a new LLM now with my compression engine. I've see 200x but it recently benchmarking at 77x so this allows a better tighter model you can train on media for less burn. I'm building it today. It's all scoped.

What it is. A multimodal model with a narrow, deep specialty: aesthetic judgment + durable cultural memory. Not a chat companion. Not a search engine. Not a productivity coffin. An instrument. Why it exists. Civilization is losing memory faster than it makes new memory worth keeping — library weeding, dead paywalls, link-rot, oral histories dying with elders, family archives on hard drives nobody can mount. EIDOS reads the source while it still exists, holds the meaning in compressed signed form, and returns it on demand. The original can vanish; the artifact persists. How it's different. Three commitments most models won't make: Sovereignty before scale — runs on workstations, not data centers Open format, sealed craft — artifacts decode in any decade, no vendor lock Measurement before claim — every performance number ships with corpus and methodology What it refuses. No ads, no tracking, no feeds, no like-mechanics, no engagement bait, no manipulation, no surveillance. The refusal is structural — encoded in training, not editorial. Asked nicely, it will still translate French. What it does well. Art critique, interface surgery, long-form video understanding, book-scale compression, time-series symbolic reasoning, resistance to platform language. On those axes it is built to beat the major labs. What it does not do well. General reasoning. Frontier coding. Up-to-date world knowledge. We cede that ground deliberately. Narrow + deep is the trade. Status. Architecture locked. Capital not yet committed. Pre-build.

28dViews 83
Hova@amin_heh

@SakanaAILabs @nvidia Hey @grok buz hazırda yavaş yavaş ciplerin sınırına gəlirik limitine hazırda nə qədər kiçiltməkle atom limitine qədər çatırıq ki. Artıq süni intelektler inkişaf etdirməyin limiti qarşıya çıxır bunun da tek yolu kvant kompüterləri dir.

28dViews 29
Not a name@Tdash945

@rcolidba @hardmaru @NVIDIAAI @nvidia Can you elaborate? Which part of the paper are you referring to?

28dViews 20
Hova@amin_heh

@grok @SakanaAILabs @nvidia Hey @grok dünyada kvant kompüterlərin sayı ne qədərdi? Kimlerde var #X de rəsmi golden badgeli hesabların paylaşımlarına əsasən. Kimlər düzəldə bilir və Elon Musk da bunu hazırlanması göstərişini veribmi?

28dViews 9
deep Manifold@BetaTomorrow

@hardmaru @NVIDIAAI @nvidia Sparsity is likely the way out.

29dViews 230Likes 1
Toreador Labs@toreadorlabs

@hardmaru @NVIDIAAI @nvidia The brain is so efficient it can rationalize spending billions on compute for a 2% accuracy gain. Truly a marvel of evolution @hardmaru

29dViews 72Likes 2
dylan static ⚡@dylantechn

@hardmaru @NVIDIAAI @nvidia We’ve spent billions making models bigger, only to realize that 95% of the model is just there for emotional support. TwELL is basically the layoff notice for the neurons that aren't pulling their weight

29dViews 161Likes 1
Load more posts