🔥DFlash on NVIDIA Blackwell: up to 15x throughput at the same interactivity! Block-diffusion drafting proposes a whole token block in one pass for the target model to verify in parallel, and this is now in SGLang!
Migrating from EAGLE is one swap: set spec decode to DFlash + the matching checkpoint. Read the full guide: https://developer.nvidia.com/blog/boost-inference-performance-up-to-15x-on-nvidia-blackwell-using-dflash-speculative-decoding
Increase inference performance by up to 15x without sacrificing responsiveness.
DFlash, an open source lightweight block diffusion model designed for speculative decoding, delivers up to 15x higher throughput on NVIDIA Blackwell while maintaining the same user interactivity target.
Instead of drafting tokens one at a time, it proposes a whole block in a single pass for the main model to verify in parallel.
Adoption is drop-in with support in @lmsysorg SGLang, TensorRT-LLM, and @vllm_project.


















