16h ago

CODA reparameterizes memory-bound operations such as norms, RoPE, residuals, and SwiGLU to execute as part of GEMM epilogues in transformer blocks

CuTeDSL implementation reduces kernel launch overhead on GPUs.

221.5K177971193.8K

——3——

Original post

LLM training is dominated by compute-heavy ops like MatMuls and attention. But it also has many memory-heavy ops: norms, activations, residuals, reductions. These mostly move tensors around. As FP8/NVFP4 make FLOPs cheaper, data movement gets harder to ignore. Fig: ~1B LLaMA-3 training

3:25 PM · May 21, 2026

Reposted by

#740@HANGUO97

#84@TRI_DAO

QUOTE POST

#84Tri Dao@TRI_DAO

After some mathematical rewrite, turns out all of transformer is a series of gemm + epilogue. Given a few optimized primitives, LLMs (and novice humans) can write speed-of-light kernels for all transformer ops!

Han Guo@HanGuo97

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels. CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip. Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

10:25 PM · May 21, 2026 · 108.3K Views

1:51 AM · May 22, 2026 · 57.4K Views

POST

#740Han Guo@HANGUO97

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.

CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip.

Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

10:25 PM · May 21, 2026 · 108.3K Views

#740Han Guo@HANGUO97

ML frameworks make training code easy to write, but they hide the cost of many small, memory-bound ops.

That is why optimized training/inference stacks often rely on bespoke kernels and manual autograd. But these are hard to write and maintain.

Can we get the best of both?

Han Guo@HanGuo97

10:25 PM · May 21, 2026 · 4.4K Views

10:25 PM · May 21, 2026 · 2.9K Views

CODA reparameterizes memory-bound operations such as norms, RoPE, residuals, and SwiGLU to execute as part of GEMM epilogues in transformer blocks

Sentiment

Cluster engagement