PingPong GEMM Overlaps Epilogues With Tensor Cores To Boost Efficiency

PingPong GEMM Overlaps Epilogues With Tensor Cores To Boost Efficiency · Digg

Posts from X

Most Activity

VIEWS6.4K

Pattern 1: GEMM + residual + RMSNorm + GEMM.

RMSNorm looks like a standalone memory-bound op. But CODA reparameterizes it to compute partial norm statistics in the first GEMM epilogue, run a small reduction, then apply scaling in the second GEMM epilogue.

Han Guo@HanGuo97

Surprisingly, with the right algebraic reparameterization, almost the entire standard Transformer forward + backward pass can be expressed as:

(1) GEMM + epilogues (2) attention (3) a few cheap auxiliary reductions

39d6.4K439

BOOKMARKS46LIKES71RETWEETS7REPLIES2

Han Guo@HanGuo97

Finally, huge thanks to the incredible team: @jcz42, Arjun, Driss, @tensorcore, @yoonrkim, and @tri_dao!

PDF: https://arxiv.org/abs/2605.19269 Code: https://github.com/HanGuo97/coda-kernels

Han Guo@HanGuo97

Because epilogues are highly structured, CODA can start from an optimized GEMM template (QuACK PingPong) and compose a small set of fast primitives.

That turns out to be just enough structure for LLMs to write high-performance CuTeDSL kernels.

39d3.9K7146

Han Guo@HanGuo97

Surprisingly, with the right algebraic reparameterization, almost the entire standard Transformer forward + backward pass can be expressed as:

(1) GEMM + epilogues (2) attention (3) a few cheap auxiliary reductions

Han Guo@HanGuo97

Speaking of performance: a MatMul/GEMM has a mainloop and an epilogue, the final step before results are written to memory.

Epilogues are IO-efficient because data stays on-chip, and with the right design, like PingPong GEMM, they can overlap with tensor-core work.

The catch: locality.

39d3.1K336

Han Guo@HanGuo97

Similar patterns apply to the backward pass. If the forward pass consists of a sequence of GEMM-Epilogue blocks (and ends with a GEMM), the backward pass does as well.

In practice, RMSNorm backward needs one extra algebraic trick.

Han Guo@HanGuo97

Pattern 2: GEMM + pairwise transforms.

Think RoPE, SwiGLU, and their backward passes. With the right layout, CODA processes them directly in epilogues.

Pattern 3: LM head + cross entropy.

Compute logits, then extract target logits and LSE in epilogues, as in Cut Cross Entropy.

39d2.7K225

Han Guo@HanGuo97

Because epilogues are highly structured, CODA can start from an optimized GEMM template (QuACK PingPong) and compose a small set of fast primitives.

That turns out to be just enough structure for LLMs to write high-performance CuTeDSL kernels.

Han Guo@HanGuo97

Similar patterns apply to the backward pass. If the forward pass consists of a sequence of GEMM-Epilogue blocks (and ends with a GEMM), the backward pass does as well.

In practice, RMSNorm backward needs one extra algebraic trick.

39d3.2K254

Han Guo@HanGuo97

Pattern 2: GEMM + pairwise transforms.

Think RoPE, SwiGLU, and their backward passes. With the right layout, CODA processes them directly in epilogues.

Pattern 3: LM head + cross entropy.

Compute logits, then extract target logits and LSE in epilogues, as in Cut Cross Entropy.

Han Guo@HanGuo97

Pattern 1: GEMM + residual + RMSNorm + GEMM.

39d2.8K254