2d ago

Alibaba's Qwen3.5-397B-A17B model achieves 580 tokens per second on NVIDIA GPUs using TokenSpeed framework optimizations

The milestone utilized FlashAttention-4 optimizations and custom GPU kernels.

0
Original post

The speed-of-light optimization for Qwen3.5 on the TokenSpeed inference engine is a significant milestone, achieving a record-breaking 580 tokens per second (tps) for agentic workloads on NVIDIA GPUs. In the PyTorch Foundation's latest community blog post, you can learn all about the complete design, implementation, and optimization of Qwen3.5 models in the TokenSpeed inference framework and see for yourself how this work is improving performance 👉 https://bit.ly/4uGUvIS This achievement was a joint effort between the @Alibaba_Qwen inference team, @lightseekorg Foundation TokenSpeed team, @NVIDIAAI , and the Mooncake team, with special contributions from @tri_dao for FlashAttention-4 (FA4) optimization. @KVCache_AI

9:02 AM · May 27, 2026 View on X
Reposted by