10h ago

Ted Zadouri details FlashAttention-4 kernel redesigns targeting softmax and memory bottlenecks on NVIDIA Blackwell GPUs

Blackwell shifts primary performance bottlenecks away from tensor cores.

1538295.8K

——0——

Original post

#759@MARKSAROUFIMOP

Casey Aylward@CASEYAYLWARD

Great technical talk by @tedzadouri on FlashAttention-4: a deep look at how attention kernels are being redesigned for NVIDIA Blackwell, where the bottleneck shifts from tensor cores to softmax + memory movement. Also featured: voice of god aka @marksaroufim in the background asking questions. Link below!

10:57 AM · May 29, 2026

Reposted by

#853@A1ZHANG