20h ago

Sakana AI Proposes DiffusionBlocks for Memory-Efficient Block-Wise Network Training

0
Original post

This is a wildly cool paper, as always a really novel approach from Sakana Digging into the specifics of it though, the honest look I’m not certain the practical implication of the current state, though it’s opened a new frontier of things to explore. Compared to practical engineering tricks like activation offloading, fusing the backward with the parameter update and gradient checkpointing, and various parallelism options it’s not super clear if this becomes a preferable choice for memory savings. The damning part seems to be that even though layers can be trained independently, their update is still sequential in needing the output from the previous layer. One thought was that maybe this becomes desirable for decentralized distributed training, but one would still need to send activations over network which can still often be one of the heaviest parts. In a way, it almost feels like an alternative to pipeline parallelism or a different flavor of it. I could be missing something and not a criticism of the work but worth a realistic framing against practical baselines.

4:28 PM · May 28, 2026 View on X

@jm_alexia Yeah mostly on the same page

EthanEthan@torchcompiled

This is a wildly cool paper, as always a really novel approach from Sakana Digging into the specifics of it though, the honest look I’m not certain the practical implication of the current state, though it’s opened a new frontier of things to explore. Compared to practical engineering tricks like activation offloading, fusing the backward with the parameter update and gradient checkpointing, and various parallelism options it’s not super clear if this becomes a preferable choice for memory savings. The damning part seems to be that even though layers can be trained independently, their update is still sequential in needing the output from the previous layer. One thought was that maybe this becomes desirable for decentralized distributed training, but one would still need to send activations over network which can still often be one of the heaviest parts. In a way, it almost feels like an alternative to pipeline parallelism or a different flavor of it. I could be missing something and not a criticism of the work but worth a realistic framing against practical baselines.

11:28 PM · May 28, 2026 · 3.7K Views
12:20 AM · May 29, 2026 · 1.5K Views