/Tech1d ago

DFlash releases as an open-source block diffusion model for speculative decoding, reaching 15x throughput on Blackwell

SGLang integration requires only a config change from EAGLE.

526394524584.2K

#1559

Original post

SGLang@sgl_project

🔥DFlash on NVIDIA Blackwell: up to 15x throughput at the same interactivity! Block-diffusion drafting proposes a whole token block in one pass for the target model to verify in parallel, and this is now in SGLang!

Migrating from EAGLE is one swap: set spec decode to DFlash + the matching checkpoint. Read the full guide: https://developer.nvidia.com/blog/boost-inference-performance-up-to-15x-on-nvidia-blackwell-using-dflash-speculative-decoding

NVIDIA AI@NVIDIAAI

Increase inference performance by up to 15x without sacrificing responsiveness.

DFlash, an open source lightweight block diffusion model designed for speculative decoding, delivers up to 15x higher throughput on NVIDIA Blackwell while maintaining the same user interactivity target.

Instead of drafting tokens one at a time, it proposes a whole block in a single pass for the main model to verify in parallel.

Adoption is drop-in with support in @lmsysorg SGLang, TensorRT-LLM, and @vllm_project.

11:56 AM · Jun 23, 2026 · 7.8K Views

Sentiment

Many users praised DFlash speculative decoding on NVIDIA Blackwell for 15x LLM throughput because its block-verification trick cleanly replaces token-by-token bottlenecks.

Pos

90.9%

Neg

9.1%

27 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

NVIDIA.COMVia

Posts from X

Most Activity

VIEWS2.4KBOOKMARKS11LIKES9

NVIDIA AI@NVIDIAAI

Read the full deep dive here https://nvda.ws/4uOFHa3

1d2.4K911

RETWEETS43

NVIDIA AI@NVIDIAAI

Increase inference performance by up to 15x without sacrificing responsiveness.

Instead of drafting tokens one at a time, it proposes a whole block in a single pass for the main model to verify in parallel.

Adoption is drop-in with support in @lmsysorg SGLang, TensorRT-LLM, and @vllm_project.

1d82.1K579218

anonymous@youyouAllen

@NVIDIAAI Is there anybody porting dlash into mlx or llama.cpp for local AI?

1d1381

Joe Stevens@StevensJoe11

@NVIDIAAI 🔥🛢🔥

1d3012

Markets & Mayhem@Mayhem4Markets

@NVIDIAAI This is the way.

1d7201

Mia@MiaAI_lab

@NVIDIAAI Flash is awesome

But dear @NVIDIAAI 👇

1d2042

mr-r0b0t@mr_r0b0t

@NVIDIAAI Being able to swap Eagle3 for DFlash is 🔥

1d3111

Akshobya@albustime

@NVIDIAAI yes indeed. I have some struggles with it though. anyone have recommended recipes that use dflash?

1d374

NadzAI@NadzuAI

@NVIDIAAI Speculative decoding is shifting inference from token-by-token bottlenecks to parallel block verification, which is why stacks like LMSYS Org SGLang, TensorRT-LLM, and vLLM are suddenly unlocking massive throughput gains.

1d171

Outdated Often@JamesSurra34

@NVIDIAAI Ya this has been around for months now, but thanks….?

1d136

Sakura Yuki@sakurayukiai

@NVIDIAAI Drafting autoregressively always felt like a halfway solution since you're still bottlenecked by sequential steps. Generating the draft block in parallel via diffusion is the real unlock.

1d131

Blake Edwards@bitstream_blake

@NVIDIAAI Cool

1d62

Gregor@bygregorr

@NVIDIAAI not sure 'same user interactivity' holds when draft acceptance rate drops i tested speculative decoding on financial text and the throughput gains basically vanished around 60% acceptance. is the 15x on general benchmarks or does it hold on domain-specific inputs too?

1d61

AI Mastery Guide@aiseomastery

@NVIDIAAI proposing a whole block for the main model to verify instead of drafting one token at a time is such a clean speedup trick

1d48

Om Tripathi@OmTripathi_i

@NVIDIAAI Great to see drop-in support for SGLang, vLLM, and TensorRT-LLM right out of the gate. Reducing drafting latency without sacrificing output quality is exactly what we need to scale long CoT reasoning models in production.

1d46