10h ago

Ted Zadouri details FlashAttention-4 kernel redesigns targeting softmax and memory bottlenecks on NVIDIA Blackwell GPUs

Blackwell shifts primary performance bottlenecks away from tensor cores.

0
Original post

Great technical talk by @tedzadouri on FlashAttention-4: a deep look at how attention kernels are being redesigned for NVIDIA Blackwell, where the bottleneck shifts from tensor cores to softmax + memory movement. Also featured: voice of god aka @marksaroufim in the background asking questions. Link below!

10:57 AM · May 29, 2026 View on X
Reposted by