LLM training is dominated by compute-heavy ops like MatMuls and attention.
But it also has many memory-heavy ops: norms, activations, residuals, reductions. These mostly move tensors around.
As FP8/NVFP4 make FLOPs cheaper, data movement gets harder to ignore.
Fig: ~1B LLaMA-3 training
LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.
CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip.
Bonus: LLMs can write fast CODA kernels too (approaching SoLs).


