15h ago

CODA reparameterizes memory-bound operations such as norms, RoPE, residuals, and SwiGLU to execute as part of GEMM epilogues in transformer blocks

CuTeDSL implementation reduces kernel launch overhead on GPUs.

3
Original post

LLM training is dominated by compute-heavy ops like MatMuls and attention. But it also has many memory-heavy ops: norms, activations, residuals, reductions. These mostly move tensors around. As FP8/NVFP4 make FLOPs cheaper, data movement gets harder to ignore. Fig: ~1B LLaMA-3 training

3:25 PM · May 21, 2026 View on X
Reposted by

After some mathematical rewrite, turns out all of transformer is a series of gemm + epilogue. Given a few optimized primitives, LLMs (and novice humans) can write speed-of-light kernels for all transformer ops!

Han GuoHan Guo@HanGuo97

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels. CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip. Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

10:25 PM · May 21, 2026 · 104.8K Views
1:51 AM · May 22, 2026 · 55.4K Views

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.

CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip.

Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

10:25 PM · May 21, 2026 · 104.8K Views

ML frameworks make training code easy to write, but they hide the cost of many small, memory-bound ops.

That is why optimized training/inference stacks often rely on bespoke kernels and manual autograd. But these are hard to write and maintain.

Can we get the best of both?

Han GuoHan Guo@HanGuo97

LLM training is dominated by compute-heavy ops like MatMuls and attention. But it also has many memory-heavy ops: norms, activations, residuals, reductions. These mostly move tensors around. As FP8/NVFP4 make FLOPs cheaper, data movement gets harder to ignore. Fig: ~1B LLaMA-3 training

10:25 PM · May 21, 2026 · 4.3K Views
10:25 PM · May 21, 2026 · 2.9K Views
CODA reparameterizes memory-bound operations such as norms, RoPE, residuals, and SwiGLU to execute as part of GEMM epilogues in transformer blocks · Digg