/Tech39d ago

CODA reparameterizes memory-bound operations such as norms, RoPE, residuals, and SwiGLU to execute as part of GEMM epilogues in transformer blocks

CuTeDSL implementation reduces kernel launch overhead on GPUs.

672.3K2661.6K359K

#61

Original post

Han Guo@HanGuo97#803inTech

LLM training is dominated by compute-heavy ops like MatMuls and attention.

But it also has many memory-heavy ops: norms, activations, residuals, reductions. These mostly move tensors around.

As FP8/NVFP4 make FLOPs cheaper, data movement gets harder to ignore.

Fig: ~1B LLaMA-3 training

Han Guo@HanGuo97

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.

CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip.

Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

3:25 PM · May 21, 2026 · 5.6K Views

Sentiment

Positive users praise CODA for cleverly fusing memory-bound ops into GEMM epilogues to speed LLM training and approach peak hardware use, while negative users dismiss it as derivative or insufficiently optimized.

Pos

84.5%

Neg

15.5%

16 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS178.8K

Han Guo@HanGuo97

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.

CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip.

Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

39d178.8K649510

BOOKMARKS882LIKES1.1KRETWEETS120REPLIES16

Tri Dao@tri_dao

After some mathematical rewrite, turns out all of transformer is a series of gemm + epilogue. Given a few optimized primitives, LLMs (and novice humans) can write speed-of-light kernels for all transformer ops!

Han Guo@HanGuo97

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.

CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip.

Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

39d113.8K1.1K882

Jack Zhang@jcz42

We built a kernel abstraction to rewrite the entire transformer stack as GEMM + Epilogue kernels!

Neural net architectures such as transformers consist entirely of matrix multiplications and elementwise nonlinearities such as RMSNorm, log sum exp, and gated activations. Fusing these elementwise nonlinearities into GEMMs in both the forward and backward passes allows us to make training and prefill as compute-bound as possible!

Our kernel abstraction CODA is implemented in CuTeDSL, and by abstracting away the fixed prologue and main loop of the GEMM kernel, we expose an epilogue function where LLMs like Claude can easily implement elementwise nonlinearities in fusions approaching speed-of-light!

Han Guo@HanGuo97

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.

CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip.

Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

39d15.3K15993

Han Guo@HanGuo97

Finally, huge thanks to the incredible team: @jcz42, Arjun, Driss, @tensorcore, @yoonrkim, and @tri_dao!

PDF: https://arxiv.org/abs/2605.19269 Code: https://github.com/HanGuo97/coda-kernels

39d3K5936

Clive Chan@itsclivetime

the transformer reparameterizations lab has released even more reparameterizations :D

Tri Dao@tri_dao

38d20.1K10030

Mayank Mishra@MayankMish98

Faster training with CODA 🚀 most ops re-written as GEMM+epilogues achieving SoL

Han Guo@HanGuo97

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.

CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip.

Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

39d5.3K3414

Oliver Sieberling@osieberling

A significant portion of transformer training time is not spend doing matmuls, but for norms, residuals, RoPE, etc. This work shows that the entire transformer computation can be rewritten into matmul+epilogue, which is an extremely powerful abstraction for writing gpu kernels

Han Guo@HanGuo97

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.

CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip.

Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

39d4.6K3412

Jyo Pari@jyo_pari

The computational abstractions humans developed are great for building architectures, however they’re not necessarily the right abstractions for kernels. Han shows why 🔥

Han Guo@HanGuo97

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.

CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip.

Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

38d2.7K198

Han Guo@HanGuo97

Pattern 1: GEMM + residual + RMSNorm + GEMM.

RMSNorm looks like a standalone memory-bound op. But CODA reparameterizes it to compute partial norm statistics in the first GEMM epilogue, run a small reduction, then apply scaling in the second GEMM epilogue.

39d4.6K286

Han Guo@HanGuo97

Speaking of performance: a MatMul/GEMM has a mainloop and an epilogue, the final step before results are written to memory.

Epilogues are IO-efficient because data stays on-chip, and with the right design, like PingPong GEMM, they can overlap with tensor-core work.

The catch: locality.

39d2.9K245

Han Guo@HanGuo97

ML frameworks make training code easy to write, but they hide the cost of many small, memory-bound ops.

That is why optimized training/inference stacks often rely on bespoke kernels and manual autograd. But these are hard to write and maintain.

Can we get the best of both?

Han Guo@HanGuo97

LLM training is dominated by compute-heavy ops like MatMuls and attention.

But it also has many memory-heavy ops: norms, activations, residuals, reductions. These mostly move tensors around.

As FP8/NVFP4 make FLOPs cheaper, data movement gets harder to ignore.

Fig: ~1B LLaMA-3 training

39d3.8K313

Han Guo@HanGuo97

Surprisingly, with the right algebraic reparameterization, almost the entire standard Transformer forward + backward pass can be expressed as:

(1) GEMM + epilogues (2) attention (3) a few cheap auxiliary reductions

39d2.5K214

Han Guo@HanGuo97

Similar patterns apply to the backward pass. If the forward pass consists of a sequence of GEMM-Epilogue blocks (and ends with a GEMM), the backward pass does as well.

In practice, RMSNorm backward needs one extra algebraic trick.

39d2.1K174

Han Guo@HanGuo97

Because epilogues are highly structured, CODA can start from an optimized GEMM template (QuACK PingPong) and compose a small set of fast primitives.

That turns out to be just enough structure for LLMs to write high-performance CuTeDSL kernels.

39d2.6K203

Han Guo@HanGuo97

Pattern 2: GEMM + pairwise transforms.

Think RoPE, SwiGLU, and their backward passes. With the right layout, CODA processes them directly in epilogues.

Pattern 3: LM head + cross entropy.

Compute logits, then extract target logits and LSE in epilogues, as in Cut Cross Entropy.

39d2.2K203

Peter Henderson@PeterHndrsn

Very cool work from an all-star team!

Han Guo@HanGuo97

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.

CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip.

Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

38d3.3K142

Lijie(Derrick) Yang@LijieyYang

Turns out with the magic of hiding stuff in epilogue, GEMM-plus-epilogue programming can be surprisingly effective 🚀 Congrats to the great work @HanGuo97 @jcz42!

Han Guo@HanGuo97

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.

CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip.

Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

39d1.9K142

MIT NLP@nlp_mit

make GPUs go brrrrr

Han Guo@HanGuo97

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels.

CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip.

Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

39d2.2K123

Dongxin Guo@dongxinguo

The headline trick is the commutation: r((xW0+z) ⊙ γ)W1 = r(((xW0+z) ⊙ γ)W1), so the row-wise scale slides past the second GEMM into its epilogue. That's why RMSNorm fuses and softmax doesn't. The LLM-written kernels are inside that envelope. What's the next op where you think a similar algebraic move exists?

38d89572

Pratyush Ranjan Tiwari@PratyushRT

@tri_dao Unrelated, but similar to how hedge funds try to get the most intelligent non-LLM model running with the least latency possible: do you see inference speed (in units of intelligence per seconds) competitions that might be winner-takes-all?

39d1.6K31