/Tech3h ago

Modal Engineers Optimize FlashAttention-4 for Faster LLM Inference

710411585.8K

#1248

Original post

Charles 🎉 Frye@charles_irl#1248inTech

Last fall, we shared our deep dive on FA4 internals.

But we didn't stop at grokking the kernel.

Since then, we've been developing improvements for inference performance and upstreaming them.

This blog post explains those contributions.

https://modal.com/blog/flash-attention-4-faster

12:04 PM · Jun 11, 2026 · 3.5K Views

/Tech3h ago

Modal Engineers Optimize FlashAttention-4 for Faster LLM Inference

710411585.8K

#1248

Original post

Charles 🎉 Frye@charles_irl#1248inTech

Last fall, we shared our deep dive on FA4 internals.

But we didn't stop at grokking the kernel.

Since then, we've been developing improvements for inference performance and upstreaming them.

This blog post explains those contributions.

https://modal.com/blog/flash-attention-4-faster

12:04 PM · Jun 11, 2026 · 3.5K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS534BOOKMARKS2REPLIES2

Charles 🎉 Frye@charles_irl

A tl;dr for folks who don't care how many warpgroups FA4 devotes to softmax vs MMA loads.

Inference is different from training, so kernels look different.

Two main classes of improvement: - change what work is done in parallel (eg across KV) - support small, irregular loads

Charles 🎉 Frye@charles_irl

Last fall, we shared our deep dive on FA4 internals.

But we didn't stop at grokking the kernel.

Since then, we've been developing improvements for inference performance and upstreaming them.

This blog post explains those contributions.

https://modal.com/blog/flash-attention-4-faster

3h53472

LIKES8

Charles 🎉 Frye@charles_irl

Rewriting parallelism is a big move and it'd be nice to make it even faster than we can do with CuTe DSL.

FA4 is a very tile-pilled, Tensor Core-maxxing kernel. We'd love to write (and repeatedly rewrite) such kernels with @blelbach and team's new tile programming models.

Charles 🎉 Frye@charles_irl

A tl;dr for folks who don't care how many warpgroups FA4 devotes to softmax vs MMA loads.

Inference is different from training, so kernels look different.

Two main classes of improvement: - change what work is done in parallel (eg across KV) - support small, irregular loads

3h29480

RETWEETS1

shikhar@encapsulated007

yeah, modal just can't stop cooking...

Charles 🎉 Frye@charles_irl

Last fall, we shared our deep dive on FA4 internals.

But we didn't stop at grokking the kernel.

Since then, we've been developing improvements for inference performance and upstreaming them.

This blog post explains those contributions.

https://modal.com/blog/flash-attention-4-faster

2h1.1K73

Charles 🎉 Frye@charles_irl

I'll leave the details to the blog, which includes links to PRs, benchmarking figures of merit (as ASCII tables, naturally), and more commentary, with backing resources, on GPU performance engineering.

Charles 🎉 Frye@charles_irl

Rewriting parallelism is a big move and it'd be nice to make it even faster than we can do with CuTe DSL.

FA4 is a very tile-pilled, Tensor Core-maxxing kernel. We'd love to write (and repeatedly rewrite) such kernels with @blelbach and team's new tile programming models.

3h19620

Charles 🎉 Frye@charles_irl

As always, a pleasure to work with @tri_dao, Jay Shah, and the FA4 team, who reviewed the PRs -- but not this post! I take responsibility for any errors.

And of course @_dcw02 and Timmy, who push my understanding of GPUs at least as hard as they do the hardware itself.

Charles 🎉 Frye@charles_irl

3h18740