/Tech3h ago

Modal Engineers Optimize FlashAttention-4 for Faster LLM Inference

710411585.8K
Original post
Charles 馃帀 Frye@charles_irl#1248inTech

Last fall, we shared our deep dive on FA4 internals.

But we didn't stop at grokking the kernel.

Since then, we've been developing improvements for inference performance and upstreaming them.

This blog post explains those contributions.

https://modal.com/blog/flash-attention-4-faster

12:04 PM 路 Jun 11, 2026 路 3.5K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS534BOOKMARKS2REPLIES2

A tl;dr for folks who don't care how many warpgroups FA4 devotes to softmax vs MMA loads.

Inference is different from training, so kernels look different.

Two main classes of improvement: - change what work is done in parallel (eg across KV) - support small, irregular loads

Last fall, we shared our deep dive on FA4 internals.

But we didn't stop at grokking the kernel.

Since then, we've been developing improvements for inference performance and upstreaming them.

This blog post explains those contributions.

https://modal.com/blog/flash-attention-4-faster

3hViews 534Likes 7Bookmarks 2
LIKES8

Rewriting parallelism is a big move and it'd be nice to make it even faster than we can do with CuTe DSL.

FA4 is a very tile-pilled, Tensor Core-maxxing kernel. We'd love to write (and repeatedly rewrite) such kernels with @blelbach and team's new tile programming models.

A tl;dr for folks who don't care how many warpgroups FA4 devotes to softmax vs MMA loads.

Inference is different from training, so kernels look different.

Two main classes of improvement: - change what work is done in parallel (eg across KV) - support small, irregular loads

3hViews 294Likes 8Bookmarks 0
RETWEETS1
shikhar@encapsulated007

yeah, modal just can't stop cooking...

Last fall, we shared our deep dive on FA4 internals.

But we didn't stop at grokking the kernel.

Since then, we've been developing improvements for inference performance and upstreaming them.

This blog post explains those contributions.

https://modal.com/blog/flash-attention-4-faster

2hViews 1.1KLikes 7Bookmarks 3

I'll leave the details to the blog, which includes links to PRs, benchmarking figures of merit (as ASCII tables, naturally), and more commentary, with backing resources, on GPU performance engineering.

Rewriting parallelism is a big move and it'd be nice to make it even faster than we can do with CuTe DSL.

FA4 is a very tile-pilled, Tensor Core-maxxing kernel. We'd love to write (and repeatedly rewrite) such kernels with @blelbach and team's new tile programming models.

3hViews 196Likes 2Bookmarks 0

As always, a pleasure to work with @tri_dao, Jay Shah, and the FA4 team, who reviewed the PRs -- but not this post! I take responsibility for any errors.

And of course @_dcw02 and Timmy, who push my understanding of GPUs at least as hard as they do the hardware itself.

I'll leave the details to the blog, which includes links to PRs, benchmarking figures of merit (as ASCII tables, naturally), and more commentary, with backing resources, on GPU performance engineering.

3hViews 187Likes 4Bookmarks 0