PingPong GEMM Overlaps Epilogues With Tensor Cores To Boost Efficiency
Pattern 1: GEMM + residual + RMSNorm + GEMM.
RMSNorm looks like a standalone memory-bound op. But CODA reparameterizes it to compute partial norm statistics in the first GEMM epilogue, run a small reduction, then apply scaling in the second GEMM epilogue.
Surprisingly, with the right algebraic reparameterization, almost the entire standard Transformer forward + backward pass can be expressed as: (1) GEMM + epilogues (2) attention (3) a few cheap auxiliary reductions
Surprisingly, with the right algebraic reparameterization, almost the entire standard Transformer forward + backward pass can be expressed as:
(1) GEMM + epilogues (2) attention (3) a few cheap auxiliary reductions
Speaking of performance: a MatMul/GEMM has a mainloop and an epilogue, the final step before results are written to memory. Epilogues are IO-efficient because data stays on-chip, and with the right design, like PingPong GEMM, they can overlap with tensor-core work. The catch: locality.
Similar patterns apply to the backward pass. If the forward pass consists of a sequence of GEMM-Epilogue blocks (and ends with a GEMM), the backward pass does as well.
In practice, RMSNorm backward needs one extra algebraic trick.

Pattern 2: GEMM + pairwise transforms. Think RoPE, SwiGLU, and their backward passes. With the right layout, CODA processes them directly in epilogues. Pattern 3: LM head + cross entropy. Compute logits, then extract target logits and LSE in epilogues, as in Cut Cross Entropy.
Pattern 2: GEMM + pairwise transforms.
Think RoPE, SwiGLU, and their backward passes. With the right layout, CODA processes them directly in epilogues.
Pattern 3: LM head + cross entropy.
Compute logits, then extract target logits and LSE in epilogues, as in Cut Cross Entropy.

Pattern 1: GEMM + residual + RMSNorm + GEMM. RMSNorm looks like a standalone memory-bound op. But CODA reparameterizes it to compute partial norm statistics in the first GEMM epilogue, run a small reduction, then apply scaling in the second GEMM epilogue.
Because epilogues are highly structured, CODA can start from an optimized GEMM template (QuACK PingPong) and compose a small set of fast primitives.
That turns out to be just enough structure for LLMs to write high-performance CuTeDSL kernels.
Similar patterns apply to the backward pass. If the forward pass consists of a sequence of GEMM-Epilogue blocks (and ends with a GEMM), the backward pass does as well. In practice, RMSNorm backward needs one extra algebraic trick.
Finally, huge thanks to the incredible team: @jcz42, Arjun, Driss, @tensorcore, @yoonrkim, and @tri_dao!
PDF: https://arxiv.org/abs/2605.19269 Code: https://github.com/HanGuo97/coda-kernels
Because epilogues are highly structured, CODA can start from an optimized GEMM template (QuACK PingPong) and compose a small set of fast primitives. That turns out to be just enough structure for LLMs to write high-performance CuTeDSL kernels.



