Speaking of performance: a MatMul/GEMM has a mainloop and an epilogue, the final step before results are written to memory.
Epilogues are IO-efficient because data stays on-chip, and with the right design, like PingPong GEMM, they can overlap with tensor-core work.
The catch: locality.
ML frameworks make training code easy to write, but they hide the cost of many small, memory-bound ops.
That is why optimized training/inference stacks often rely on bespoke kernels and manual autograd. But these are hard to write and maintain.
Can we get the best of both?