Sakana AI releases DiffusionBlocks to train neural networks one block at a time, cutting training memory up to 8x

VIEWS675.3KBOOKMARKS3.7KLIKES5.2KRETWEETS588

For over a decade, we’ve accepted that end-to-end backprop is the only way to train deep networks. But holding the entire network in memory all at once is why AI training is hitting a resource wall.

We found a new way to break the network into blocks and train them independently. The trick? Treating the network’s forward pass like a diffusion model denoising a signal.

This reinterpretation slashes the memory needed to train deep models. In our #ICLR2026 paper (https://arxiv.org/abs/2506.14202), we matched end-to-end performance across ViTs, DiTs, and LLMs. We did this while training just one isolated block at a time.

Sakana AI@SakanaAILabs

Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

http://pub.sakana.ai/diffusionblocks

What if we didn’t have to hold an entire neural network in memory to train it?

Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network.

In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance.

With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block.

How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently.

We validated this across five different architectures:

• ViT • DiT • Masked diffusion • Autoregressive transformers • Recurrent-depth transformers

In each case, performance is competitive with end-to-end training while using a fraction of the memory.

This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training.

Read our paper and code, to learn more.

Paper: https://arxiv.org/abs/2506.14202 GitHub: https://github.com/SakanaAI/DiffusionBlocks 🐟

34d675.3K5.2K3.7K

REPLIES165

Elon Musk@elonmusk

@hardmaru Interesting

hardmaru@hardmaru

For over a decade, we’ve accepted that end-to-end backprop is the only way to train deep networks. But holding the entire network in memory all at once is why AI training is hitting a resource wall.

We found a new way to break the network into blocks and train them independently. The trick? Treating the network’s forward pass like a diffusion model denoising a signal.

This reinterpretation slashes the memory needed to train deep models. In our #ICLR2026 paper (https://arxiv.org/abs/2506.14202), we matched end-to-end performance across ViTs, DiTs, and LLMs. We did this while training just one isolated block at a time.

33d60.7K1.2K52

Sakana AI@SakanaAILabs

ニューラルネットワークをブロックごとに学習する枠組みを開発

ブログ: http://pub.sakana.ai/diffusionblocks

ニューラルネットワークの学習は通常、ネットワーク全体を一度に扱う必要があり、深いモデルほど多くのメモリを必要とします。このメモリ消費は、近年のAIモデルの大規模化を支える上で大きな制約となってきました。

Sakana AIは、この制約を緩和する学習フレームワーク「DiffusionBlocks」を提案します（#ICLR2026 採択）。ネットワークをブロックに分割し、それぞれを独立に学習できるようにすることで、学習時に必要なメモリを1ブロック分にまで削減することができます。

中心となる発想は、各ブロックに「ひとつ前のブロックよりも、表現を少しだけターゲットに近づける」という明示的な役割を与えることです。この役割は、近年大きな成功を収めている拡散モデルが時間方向に段階的に行っている処理に対応しており、この対応関係を踏まえることで、各ブロックを原理的な目的関数のもとで独立に学習することができるようになります。

画像分類、画像生成、テキスト生成にまたがる5つのアーキテクチャ（ViT、DiT、Masked Diffusion、AR Transformer、Recurrent-depth Transformer）で検証を行い、いずれにおいてもエンドツーエンド学習に匹敵する性能を確認しました。

この枠組みは、同じ層を繰り返し適用するRecurrent-depth Transformerにも自然に拡張でき、通常必要とされるbackpropagation through timeを経ずに、1回のフォワードパスで効率的に学習できることも示しています。

大規模AIモデルの学習を、より少ない計算資源でも進められるようにすること——これはSakana AIが継続して取り組んでいるテーマのひとつです。本研究が、その一歩となることを期待しています。

論文: https://arxiv.org/abs/2506.14202

本研究は、東京大学の小山雅典氏との共同で行われました。

Sakana AI@SakanaAILabs

Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

http://pub.sakana.ai/diffusionblocks

What if we didn’t have to hold an entire neural network in memory to train it?

Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network.

In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance.

With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block.

How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently.

We validated this across five different architectures:

• ViT • DiT • Masked diffusion • Autoregressive transformers • Recurrent-depth transformers

In each case, performance is competitive with end-to-end training while using a fraction of the memory.

This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training.

Read our paper and code, to learn more.

Paper: https://arxiv.org/abs/2506.14202 GitHub: https://github.com/SakanaAI/DiffusionBlocks 🐟

33d49.2K452215

Sander Dieleman@sedielem

Diffusion models are recurrent neural networks🧐

https://sander.ai/2023/07/20/perspectives.html#rnn

Sakana AI@SakanaAILabs

Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

http://pub.sakana.ai/diffusionblocks

What if we didn’t have to hold an entire neural network in memory to train it?

Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network.

In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance.

With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block.

How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently.

We validated this across five different architectures:

• ViT • DiT • Masked diffusion • Autoregressive transformers • Recurrent-depth transformers

In each case, performance is competitive with end-to-end training while using a fraction of the memory.

This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training.

Read our paper and code, to learn more.

Paper: https://arxiv.org/abs/2506.14202 GitHub: https://github.com/SakanaAI/DiffusionBlocks 🐟

33d28.6K219151

Furong Huang@furongh

ResNets walked so DiffusionBlocks could denoise 😄

Block-wise training did not disappear — it just went off, got a diffusion-era glow-up, and came back cooler.

This is especially fun to see because in our ancient ICML 2018 BoostResNet paper, we studied a related idea: learning deep ResNet blocks sequentially, one block at a time, through the lens of boosting theory. Each block is not just another layer; it is a stepwise improvement to the model.

It is exciting to see this “build the model block by block” instinct showing up again in modern generative models. Old ideas, new objectives, much bigger compute — and suddenly the blocks are back on tour.

With @jordan_t_ash, @JohnCLangford, and Robert Schapire: https://arxiv.org/abs/1706.04964

Sakana AI@SakanaAILabs

Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

http://pub.sakana.ai/diffusionblocks

What if we didn’t have to hold an entire neural network in memory to train it?

Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network.

In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance.

With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block.

How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently.

We validated this across five different architectures:

• ViT • DiT • Masked diffusion • Autoregressive transformers • Recurrent-depth transformers

In each case, performance is competitive with end-to-end training while using a fraction of the memory.

This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training.

Read our paper and code, to learn more.

Paper: https://arxiv.org/abs/2506.14202 GitHub: https://github.com/SakanaAI/DiffusionBlocks 🐟

34d26.8K238125

Sakana AI@SakanaAILabs

Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

http://pub.sakana.ai/diffusionblocks

What if we didn’t have to hold an entire neural network in memory to train it?

Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network.

In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance.

With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block.

How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently.

We validated this across five different architectures:

• ViT • DiT • Masked diffusion • Autoregressive transformers • Recurrent-depth transformers

In each case, performance is competitive with end-to-end training while using a fraction of the memory.

This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training.

Read our paper and code, to learn more.

Paper: https://arxiv.org/abs/2506.14202 GitHub: https://github.com/SakanaAI/DiffusionBlocks 🐟

34d803.2K2.2K1.6K

Alexia Jolicoeur-Martineau@jm_alexia

Cool work but the K times reduction in memory comes with K times longer training (or K times the batch-size which overides the memory cost reduction). Still, very neat though.

Sakana AI@SakanaAILabs

Introducing DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

http://pub.sakana.ai/diffusionblocks

What if we didn’t have to hold an entire neural network in memory to train it?

Standard neural net training optimizes all parameters jointly. As a result, the memory required during training grows linearly with the depth of the network.

In our #ICLR2026 paper, we propose DiffusionBlocks, a principled framework to train networks one block at a time, drastically reducing memory requirements while matching end-to-end performance.

With DiffusionBlocks, we split the network into blocks and train them one at a time, so you only need memory for a single block.

How? We explicitly assign each block a role: to move the representation a little closer to the target than the block before it did. That role turns out to be precisely what a diffusion model does, step by step. Each block only needs to optimize its own objective and can be trained independently.

We validated this across five different architectures:

• ViT • DiT • Masked diffusion • Autoregressive transformers • Recurrent-depth transformers

In each case, performance is competitive with end-to-end training while using a fraction of the memory.

This perspective also extends naturally to recurrent-depth (Looped) transformers, which apply the same network iteratively and normally require expensive backpropagation through time (BPTT). Viewed through DiffusionBlocks, we can replace those multiple iterations with a single forward pass during training.

Read our paper and code, to learn more.

Paper: https://arxiv.org/abs/2506.14202 GitHub: https://github.com/SakanaAI/DiffusionBlocks 🐟

32d18.8K15261

Anirudh Goyal@anirudhg9119

Variational Walkback: Learning Transition Operator as a Stochastic Recurrent Net

https://arxiv.org/abs/1711.02282

Start at real data point deliberately walk away from it by applying the model with increasing noise, then train the same transition operator to walk back toward the data.

Sander Dieleman@sedielem

Diffusion models are recurrent neural networks🧐

https://sander.ai/2023/07/20/perspectives.html#rnn

33d5.2K4131

François Fleuret@francoisfleuret

@hardmaru Dude you really do cool things, seriously.

hardmaru@hardmaru

For over a decade, we’ve accepted that end-to-end backprop is the only way to train deep networks. But holding the entire network in memory all at once is why AI training is hitting a resource wall.

We found a new way to break the network into blocks and train them independently. The trick? Treating the network’s forward pass like a diffusion model denoising a signal.

This reinterpretation slashes the memory needed to train deep models. In our #ICLR2026 paper (https://arxiv.org/abs/2506.14202), we matched end-to-end performance across ViTs, DiTs, and LLMs. We did this while training just one isolated block at a time.

34d9.5K1291

BURKOV@burkov

@hardmaru Please use the CC-BY license when you submit to arXiv. Otherwise, no scientific/AI media can use any part of your paper in their articles.

34d6.6K4312

Benhao Huang@huskydogewoof

@SakanaAILabs Great work! reminds me of deep supervision in HRM, or segmented online training

34d6.3K1515

Robert Dionne@robertsdionne

@SakanaAILabs https://arxiv.org/abs/2503.24322

34d1.6K1613

hardmaru@hardmaru

🧱 DiffusionBlocks: Training Neural Networks One Block at a Time https://pub.sakana.ai/diffusionblocks

hardmaru@hardmaru

For over a decade, we’ve accepted that end-to-end backprop is the only way to train deep networks. But holding the entire network in memory all at once is why AI training is hitting a resource wall.

We found a new way to break the network into blocks and train them independently. The trick? Treating the network’s forward pass like a diffusion model denoising a signal.

This reinterpretation slashes the memory needed to train deep models. In our #ICLR2026 paper (https://arxiv.org/abs/2506.14202), we matched end-to-end performance across ViTs, DiTs, and LLMs. We did this while training just one isolated block at a time.

33d5.4K339

Mark@yieldthought

@hardmaru An end to end-to-end! https://arxiv.org/abs/1905.11786 Fave paper of mine back in the day!

33d4.8K1112

Alex UGift@Radipdegen

@SakanaAILabs wait does this mean we can train bigger models on consumer GPUs without getting oom'd

34d8K692

Adamanthys@Adamanthys

@Radipdegen @SakanaAILabs Been a thing for awhile: https://github.com/zhuhanqing/APOLLO

33d59349

Tony Wang@TonyW

Even though many labs have stopped publishing, it’s great to see frontier research is alive and well. This one 👇🏼 from @hardmaru at @SakanaAILabs

hardmaru@hardmaru

For over a decade, we’ve accepted that end-to-end backprop is the only way to train deep networks. But holding the entire network in memory all at once is why AI training is hitting a resource wall.

We found a new way to break the network into blocks and train them independently. The trick? Treating the network’s forward pass like a diffusion model denoising a signal.

This reinterpretation slashes the memory needed to train deep models. In our #ICLR2026 paper (https://arxiv.org/abs/2506.14202), we matched end-to-end performance across ViTs, DiTs, and LLMs. We did this while training just one isolated block at a time.

33d7.4K186

Sakana AI@SakanaAILabs

DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation https://arxiv.org/abs/2506.14202

33d2.3K136

rohit@krishnanrohit

This is really really cool!

hardmaru@hardmaru

For over a decade, we’ve accepted that end-to-end backprop is the only way to train deep networks. But holding the entire network in memory all at once is why AI training is hitting a resource wall.

We found a new way to break the network into blocks and train them independently. The trick? Treating the network’s forward pass like a diffusion model denoising a signal.

This reinterpretation slashes the memory needed to train deep models. In our #ICLR2026 paper (https://arxiv.org/abs/2506.14202), we matched end-to-end performance across ViTs, DiTs, and LLMs. We did this while training just one isolated block at a time.

33d5.6K135

Phil Trubey@PTrubey

@SakanaAILabs In practice, what does reduced memory buy you? Faster training? Better sample efficiency? Allows training with cheaper chips? This is only applicable to training?

34d11.3K27