3h ago

PingPong GEMM Overlaps Epilogues With Tensor Cores To Boost Efficiency

631121.3K

——0——

Original post

Speaking of performance: a MatMul/GEMM has a mainloop and an epilogue, the final step before results are written to memory. Epilogues are IO-efficient because data stays on-chip, and with the right design, like PingPong GEMM, they can overlap with tensor-core work. The catch: locality.

3:25 PM · May 21, 2026

#740Han Guo@HANGUO97

Pattern 1: GEMM + residual + RMSNorm + GEMM.

RMSNorm looks like a standalone memory-bound op. But CODA reparameterizes it to compute partial norm statistics in the first GEMM epilogue, run a small reduction, then apply scaling in the second GEMM epilogue.

Han Guo@HanGuo97

Surprisingly, with the right algebraic reparameterization, almost the entire standard Transformer forward + backward pass can be expressed as: (1) GEMM + epilogues (2) attention (3) a few cheap auxiliary reductions

10:25 PM · May 21, 2026 · 163 Views

10:25 PM · May 21, 2026 · 165 Views

#740Han Guo@HANGUO97

Surprisingly, with the right algebraic reparameterization, almost the entire standard Transformer forward + backward pass can be expressed as:

(1) GEMM + epilogues (2) attention (3) a few cheap auxiliary reductions

Han Guo@HanGuo97

10:25 PM · May 21, 2026 · 180 Views

10:25 PM · May 21, 2026 · 163 Views

#740Han Guo@HANGUO97

Similar patterns apply to the backward pass. If the forward pass consists of a sequence of GEMM-Epilogue blocks (and ends with a GEMM), the backward pass does as well.

In practice, RMSNorm backward needs one extra algebraic trick.

Han Guo@HanGuo97

Pattern 2: GEMM + pairwise transforms. Think RoPE, SwiGLU, and their backward passes. With the right layout, CODA processes them directly in epilogues. Pattern 3: LM head + cross entropy. Compute logits, then extract target logits and LSE in epilogues, as in Cut Cross Entropy.

10:25 PM · May 21, 2026 · 160 Views

10:25 PM · May 21, 2026 · 178 Views

#740Han Guo@HANGUO97

Pattern 2: GEMM + pairwise transforms.

Think RoPE, SwiGLU, and their backward passes. With the right layout, CODA processes them directly in epilogues.

Pattern 3: LM head + cross entropy.

Compute logits, then extract target logits and LSE in epilogues, as in Cut Cross Entropy.

Han Guo@HanGuo97

Pattern 1: GEMM + residual + RMSNorm + GEMM. RMSNorm looks like a standalone memory-bound op. But CODA reparameterizes it to compute partial norm statistics in the first GEMM epilogue, run a small reduction, then apply scaling in the second GEMM epilogue.

10:25 PM · May 21, 2026 · 165 Views

10:25 PM · May 21, 2026 · 160 Views

#740Han Guo@HANGUO97

Because epilogues are highly structured, CODA can start from an optimized GEMM template (QuACK PingPong) and compose a small set of fast primitives.

That turns out to be just enough structure for LLMs to write high-performance CuTeDSL kernels.

Han Guo@HanGuo97

Similar patterns apply to the backward pass. If the forward pass consists of a sequence of GEMM-Epilogue blocks (and ends with a GEMM), the backward pass does as well. In practice, RMSNorm backward needs one extra algebraic trick.

10:25 PM · May 21, 2026 · 178 Views

10:25 PM · May 21, 2026 · 247 Views

#740Han Guo@HANGUO97

Finally, huge thanks to the incredible team: @jcz42, Arjun, Driss, @tensorcore, @yoonrkim, and @tri_dao!

PDF: https://arxiv.org/abs/2605.19269 Code: https://github.com/HanGuo97/coda-kernels

Han Guo@HanGuo97

Because epilogues are highly structured, CODA can start from an optimized GEMM template (QuACK PingPong) and compose a small set of fast primitives. That turns out to be just enough structure for LLMs to write high-performance CuTeDSL kernels.

10:25 PM · May 21, 2026 · 247 Views

10:32 PM · May 21, 2026 · 254 Views