Last fall, we shared our deep dive on FA4 internals.
But we didn't stop at grokking the kernel.
Since then, we've been developing improvements for inference performance and upstreaming them.
This blog post explains those contributions.
https://modal.com/blog/flash-attention-4-faster