CODA reparameterizes memory-bound operations such as norms, RoPE, residuals, and SwiGLU to execute as part of GEMM epilogues in transformer blocks
CuTeDSL implementation reduces kernel launch overhead on GPUs.
After some mathematical rewrite, turns out all of transformer is a series of gemm + epilogue. Given a few optimized primitives, LLMs (and novice humans) can write speed-of-light kernels for all transformer ops!
LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels. CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip. Bonus: LLMs can write fast CODA kernels too (approaching SoLs).
LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.
CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip.
Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

ML frameworks make training code easy to write, but they hide the cost of many small, memory-bound ops.
That is why optimized training/inference stacks often rely on bespoke kernels and manual autograd. But these are hard to write and maintain.
Can we get the best of both?
LLM training is dominated by compute-heavy ops like MatMuls and attention. But it also has many memory-heavy ops: norms, activations, residuals, reductions. These mostly move tensors around. As FP8/NVFP4 make FLOPs cheaper, data movement gets harder to ignore. Fig: ~1B LLaMA-3 training