SGLang integrates DFlash speculative decoding, boosting Qwen 397B-A17B inference throughput by up to 4.3x

VIEWS1.1KBOOKMARKS2

Charles 🎉 Frye@charles_irl

brrr

David Wang@_dcw02

9+ accept lengths on coding workloads

generic drafter btw

qwen 397b 4x faster

repro btw

dflash go brrr

1h1.1K52

LIKES9

Banghua Zhu@BanghuaZ

DLLM is still early stage in research, but DFlash has been really pushing the frontier of spec decoding and adopted across production stacks. Congrats on the amazing work @modal and http://z-lab.ai @zhijianliu_ !!

LMSYS Org@lmsysorg

🚀 New blog: The next generation of speculative decoding: DFlash and Spec V2

DFlash + Spec V2 hit >4.3X baseline throughput for LLM inference, now the default speculative decoding engine in SGLang! Together with @modal and http://z-lab.ai, our jointly-released DFlash drafter for Qwen 3.5 397B-A17B beats both baseline and native MTP in every setting we benchmarked: 1️⃣ >4.3X baseline & 1.5X native MTP throughput (concurrency 1, HumanEval, 8xB200) 2️⃣ Block diffusion drafter: a full token block in one forward pass 3️⃣ KV injection: target-model features fed into every draft layer’s KV cache for higher acceptance 4️⃣ Spec V2 overlap scheduler: +33% end-to-end

Read the code, deploy a DFlash server, and start experimenting!

1h75191

RETWEETS2

Zhijian Liu@zhijianliu_

This is what DFlash was built for. ⚡

Our block-diffusion drafter + KV injection, now running at frontier scale — thanks to @modal and @sgl_project for the engine + integration support!

Modal@modal

We worked with @lmsysorg and http://z-lab.ai to - integrate DFlash spec into @sgl_project - make it faster with overlap - train a DFlash drafter for @Alibaba_Qwen 397B-A17B

The result: up to 4.3x greater throughput over baseline and 1.5x over native MTP.

1h1.2K232

REPLIES1

Zhijian Liu@zhijianliu_

Jointly trained & released the Qwen 3.5 397B-A17B drafter with @modal and @lmsysorg — weights on @huggingface: https://huggingface.co/z-lab/Qwen3.5-397B-A17B-DFlash

Same recipe already powers Xiaomi MiMo at >1k tok/s.

1h2488

Modal@modal

@jianchen1799 @liin1211 You can find the drafter on @huggingface, where we've each released an identical copy of the weights. Kinda like getting matching tats with your bestie

Our copy is here: https://huggingface.co/modal-labs/Qwen3.5-397B-A17B-DFlash

The repos include scripts that reproduce our benchmark showing superiority over MTP:

1h40971

Zhijian Liu@zhijianliu_

Full write-up: https://www.lmsys.org/blog/2026-06-15-next-generation-speculative-decoding-dflash-v2/

DFlash generalizes to most target LLMs — want it on yours? Let us know!

1h15522

LMSYS Org@lmsysorg

Full blog link: https://www.lmsys.org/blog/2026-06-15-next-generation-speculative-decoding-dflash-v2/

1h10122

Modal@modal

You can read about DFlash, the SGLang Spec V2 overlap scheduler, and how it all came together on the @lmsysorg blog:

https://www.lmsys.org/blog/2026-06-15-next-generation-speculative-decoding-dflash-v2/

1h29251

Zhijian Liu@zhijianliu_

🚀 DFlash now runs on SGLang's new default speculative-decoding engine, Spec V2.

⚡️ Hitting >4.3× baseline throughput (1.5× over native MTP) on Qwen 3.5 397B-A17B. Same quality, more speed!

⭐ http://github.com/z-lab/dflash

1h2.3K5119

Modal@modal

We worked with @lmsysorg and http://z-lab.ai to - integrate DFlash spec into @sgl_project - make it faster with overlap - train a DFlash drafter for @Alibaba_Qwen 397B-A17B

The result: up to 4.3x greater throughput over baseline and 1.5x over native MTP.

1h7.1K6718

Anders Lie@anderslie

@modal @lmsysorg @sgl_project @Alibaba_Qwen @jianchen1799 @liin1211 Huge, was hacking around with dflash on sglang but def had some issues before. Can't wait for more drafters to be trained (GLM/kimi esp)

1h50

LMSYS Org@lmsysorg

🚀 New blog: The next generation of speculative decoding: DFlash and Spec V2

DFlash + Spec V2 hit >4.3X baseline throughput for LLM inference, now the default speculative decoding engine in SGLang! Together with @modal and http://z-lab.ai, our jointly-released DFlash drafter for Qwen 3.5 397B-A17B beats both baseline and native MTP in every setting we benchmarked: 1️⃣ >4.3X baseline & 1.5X native MTP throughput (concurrency 1, HumanEval, 8xB200) 2️⃣ Block diffusion drafter: a full token block in one forward pass 3️⃣ KV injection: target-model features fed into every draft layer’s KV cache for higher acceptance 4️⃣ Spec V2 overlap scheduler: +33% end-to-end

Read the code, deploy a DFlash server, and start experimenting!

1h7.9K6943

RadixArk@radixark

With @modal and http://z-lab.ai, we made >4.3X throughput the new default in SGLang together. Thanks to Qiaolin Yu (@liin1211), Liangsheng Yin (@lsyincs), and Khoa Pham (@kwafam7) for landing the integration!

LMSYS Org@lmsysorg

🚀 New blog: The next generation of speculative decoding: DFlash and Spec V2

DFlash + Spec V2 hit >4.3X baseline throughput for LLM inference, now the default speculative decoding engine in SGLang! Together with @modal and http://z-lab.ai, our jointly-released DFlash drafter for Qwen 3.5 397B-A17B beats both baseline and native MTP in every setting we benchmarked: 1️⃣ >4.3X baseline & 1.5X native MTP throughput (concurrency 1, HumanEval, 8xB200) 2️⃣ Block diffusion drafter: a full token block in one forward pass 3️⃣ KV injection: target-model features fed into every draft layer’s KV cache for higher acceptance 4️⃣ Spec V2 overlap scheduler: +33% end-to-end

Read the code, deploy a DFlash server, and start experimenting!

1h552132

Zhijian Liu@zhijianliu_

So glad DFlash landed in SGLang's new default Spec V2 engine — block-diffusion drafting + KV injection, now available to everyone serving on SGLang.

Huge thanks to @modal and @sgl_project for the engine + integration support!

LMSYS Org@lmsysorg

🚀 New blog: The next generation of speculative decoding: DFlash and Spec V2

DFlash + Spec V2 hit >4.3X baseline throughput for LLM inference, now the default speculative decoding engine in SGLang! Together with @modal and http://z-lab.ai, our jointly-released DFlash drafter for Qwen 3.5 397B-A17B beats both baseline and native MTP in every setting we benchmarked: 1️⃣ >4.3X baseline & 1.5X native MTP throughput (concurrency 1, HumanEval, 8xB200) 2️⃣ Block diffusion drafter: a full token block in one forward pass 3️⃣ KV injection: target-model features fed into every draft layer’s KV cache for higher acceptance 4️⃣ Spec V2 overlap scheduler: +33% end-to-end

Read the code, deploy a DFlash server, and start experimenting!

1h3.1K4910

Anders Lie@anderslie

@zhijianliu_ @modal @sgl_project super underrated tech, appreciate what you guys are working on! excited to see more drafters trained now that serving support is improving for it

1h20

The Hive@theHiveryIQ

4.31x is real work — block diffusion drafting plus overlap is a nice win. One thought: DFlash already puts a verify step at the center of the loop, target model accepts or rejects the drafter’s block. That verification proves the tokens are right. It does not prove which drafter ran or what the target accepted. As tok/s climbs, more decisions go unattested per second. A signed receipt over draft-and-accept closes that gap. Post-quantum, ~7ms, no TEE

1h9