2d ago

Alibaba's Qwen3.5-397B-A17B model achieves 580 tokens per second on NVIDIA GPUs using TokenSpeed framework optimizations

The milestone utilized FlashAttention-4 optimizations and custom GPU kernels.

1330651153364.6K

——0——

Original post

#84@TRI_DAOOP

PyTorch@PYTORCH

The speed-of-light optimization for Qwen3.5 on the TokenSpeed inference engine is a significant milestone, achieving a record-breaking 580 tokens per second (tps) for agentic workloads on NVIDIA GPUs. In the PyTorch Foundation's latest community blog post, you can learn all about the complete design, implementation, and optimization of Qwen3.5 models in the TokenSpeed inference framework and see for yourself how this work is improving performance 👉 https://bit.ly/4uGUvIS This achievement was a joint effort between the @Alibaba_Qwen inference team, @lightseekorg Foundation TokenSpeed team, @NVIDIAAI , and the Mooncake team, with special contributions from @tri_dao for FlashAttention-4 (FA4) optimization. @KVCache_AI

9:02 AM · May 27, 2026

Reposted by

#84@TRI_DAO

QUOTE POST

#1266Gavin Baker@GAVINSBAKER

Tells you a lot about the reality of Chinese silicon IMO.

5:29 PM · May 27, 2026 · 251.7K Views

Alibaba's Qwen3.5-397B-A17B model achieves 580 tokens per second on NVIDIA GPUs using TokenSpeed framework optimizations

Sentiment

Cluster engagement