GPU Glossary Adds Coverage of CuTe DSL, CUTLASS, and CuTe
^That's a sample of CuTe DSL, which is used in, among others, the FlashAttention-4 kernel.
Below is the sample CuTe kernel, with a cute trick: using layouts to express transposition.

The CuTe and CuTe DSL articles include minimal code snippets illustrating core principles and basic usage. These snippets are backed by Modal Notebooks, so you can try them yourself -- and edit them to test your understanding. https://modal.com/notebooks/modal-labs/examples/nb-Vnwf5bQck2WSSETJUPk2UD
Tensor Cores have not just pushed perfomance, they've also pushed kernel engineering and the software stack it depends on. Sometimes this just causes churn and rough edges, but there's some very deep and elegant ideas in CUTLASS/CuTe!
^That's a sample of CuTe DSL, which is used in, among others, the FlashAttention-4 kernel. Below is the sample CuTe kernel, with a cute trick: using layouts to express transposition. https://modal.com/notebooks/modal-labs/examples/nb-owEUD0kdSVeL4KeEX5sjh1
h/t to @derangineer for writing the CuTE DSL article and inspiring me to write up the rest of the CUTLASS stack!
Tensor Cores have not just pushed perfomance, they've also pushed kernel engineering and the software stack it depends on. Sometimes this just causes churn and rough edges, but there's some very deep and elegant ideas in CUTLASS/CuTe!