Banghua Zhu, RadixArk co-founder and SGLang lead, announced a 5x throughput speedup for DeepSeek-V4 on NVIDIA GB300 hardware

VIEWS1.8K

PyTorch@PyTorch

👉 https://bit.ly/4uWJrX3..*

8h1.8K3

BOOKMARKS1

Banghua Zhu@BanghuaZ

@sparkycollier SGLang builds on solid foundations of pytorch, and that's what enables 5x throughput gains in weeks 🫡🫡🫡

7h15531

LIKES7REPLIES2

NVIDIA AI@NVIDIAAI

@PyTorch @lmsysorg Always great to collab with @lmsysorg!

Thanks for sharing @PyTorch 💚

7h5837

RETWEETS8

PyTorch@PyTorch

While SGLang provided Day-0 support for DeepSeek-V4, the collaboration between the @lmsysorg and @NVIDIAAI engineering teams has taken its production performance to the next level.

According to the public SemiAnalysis InferenceX dashboard, the GB300 disaggregated lane (DeepSeek-V4 Pro, FP4, 8K/1K) saw a 5x throughput increase—surging from ~2,200 to ~11,200 tok/s/GPU at identical interactivity levels. These updates sustain high throughput much deeper into target interactivity ranges most deployments target, while also driving a 2.9x lift on the Blackwell Ultra aggregated lane.

Find the full technical breakdown in the comments below:

8h17.9K5419

LMSYS Org@lmsysorg

🚀 New blog: Serving DeepSeek-V4 on GB300 with SGLang: 5x Higher Throughput at the Same Interactivity Since Day-0

Together with @nvidia, we achieved 5X higher throughput at the same interactivity, serving DeepSeek-V4 on GB300 with SGLang.

Here's how the DeepSeek-V4 serving frontier moved on the public @SemiAnalysis_ InferenceX dashboard: 1️⃣ 5X throughput on GB300 disaggregated: ~2,200 → ~11,200 tok/s/GPU at ~50 tok/s/user 2️⃣ 2.6X more throughput at 80 tok/s/user with MTP. Curves now hold deep into the high-interactivity range deployments actually target 3️⃣ 2.91X on Blackwell Ultra aggregated at 30 tok/s/user, with 6X+ peak no-MTP throughput 4️⃣ W4A4 MegaMoE: activations now quantized to MXFP4 with negligible accuracy loss 5️⃣ A single FP8-einsum fix lifted MTP acceptance 0.57 → 0.70

Huge thanks to @NVIDIAAI @radixark for the deep collaboration on this! SGLang is PyTorch-native, and we're excited to share the full write-up on the @PyTorch blog!

PyTorch@PyTorch

While SGLang provided Day-0 support for DeepSeek-V4, the collaboration between the @lmsysorg and @NVIDIAAI engineering teams has taken its production performance to the next level.

According to the public SemiAnalysis InferenceX dashboard, the GB300 disaggregated lane (DeepSeek-V4 Pro, FP4, 8K/1K) saw a 5x throughput increase—surging from ~2,200 to ~11,200 tok/s/GPU at identical interactivity levels. These updates sustain high throughput much deeper into target interactivity ranges most deployments target, while also driving a 2.9x lift on the Blackwell Ultra aggregated lane.

Find the full technical breakdown in the comments below:

8h7.4K407

Cheng Wan@ChengWan17

Proud to have been a major contributor to the SGLang side of this work.

What began as solid Day-0 support for DeepSeek-V4 back in April has turned into a 5x throughput jump on GB300 (same interactivity) through focused kernel optimizations — MHC fusion, KV Compression V2, W4A4 MegaMoE — plus runtime improvements like SWA budgeting and breakable CUDA graphs, all in tight collaboration with the NVIDIA team.

Huge thanks to Yuhao Yang, @baizhou_zh83925 , the rest of the SGLang community, and everyone at NVIDIA who pushed this forward. Seeing these results highlighted by @PyTorch feels rewarding.

Open-source inference infra is moving fast. 🚀

PyTorch@PyTorch

While SGLang provided Day-0 support for DeepSeek-V4, the collaboration between the @lmsysorg and @NVIDIAAI engineering teams has taken its production performance to the next level.

According to the public SemiAnalysis InferenceX dashboard, the GB300 disaggregated lane (DeepSeek-V4 Pro, FP4, 8K/1K) saw a 5x throughput increase—surging from ~2,200 to ~11,200 tok/s/GPU at identical interactivity levels. These updates sustain high throughput much deeper into target interactivity ranges most deployments target, while also driving a 2.9x lift on the Blackwell Ultra aggregated lane.

Find the full technical breakdown in the comments below:

4h1.7K225

Mark Collier 柯理怀@sparkycollier

New hardware drops & we gawk over the specs but software is the only way to 5x throughput gains in weeks

Coevolution of hardware & software is required to meet token demand without going broke & open source is the only known method to coordinate this level of deep technical work

PyTorch@PyTorch

While SGLang provided Day-0 support for DeepSeek-V4, the collaboration between the @lmsysorg and @NVIDIAAI engineering teams has taken its production performance to the next level.

According to the public SemiAnalysis InferenceX dashboard, the GB300 disaggregated lane (DeepSeek-V4 Pro, FP4, 8K/1K) saw a 5x throughput increase—surging from ~2,200 to ~11,200 tok/s/GPU at identical interactivity levels. These updates sustain high throughput much deeper into target interactivity ranges most deployments target, while also driving a 2.9x lift on the Blackwell Ultra aggregated lane.

Find the full technical breakdown in the comments below:

7h931122

Eric 𝕏@WorldStrategist

@PyTorch @lmsysorg @NVIDIAAI @grok can you explain

4h11

Qiaolin Yu@liin1211

So proud to be part of this team!

LMSYS Org@lmsysorg

🚀 New blog: Serving DeepSeek-V4 on GB300 with SGLang: 5x Higher Throughput at the Same Interactivity Since Day-0

Together with @nvidia, we achieved 5X higher throughput at the same interactivity, serving DeepSeek-V4 on GB300 with SGLang.

Here's how the DeepSeek-V4 serving frontier moved on the public @SemiAnalysis_ InferenceX dashboard: 1️⃣ 5X throughput on GB300 disaggregated: ~2,200 → ~11,200 tok/s/GPU at ~50 tok/s/user 2️⃣ 2.6X more throughput at 80 tok/s/user with MTP. Curves now hold deep into the high-interactivity range deployments actually target 3️⃣ 2.91X on Blackwell Ultra aggregated at 30 tok/s/user, with 6X+ peak no-MTP throughput 4️⃣ W4A4 MegaMoE: activations now quantized to MXFP4 with negligible accuracy loss 5️⃣ A single FP8-einsum fix lifted MTP acceptance 0.57 → 0.70

Huge thanks to @NVIDIAAI @radixark for the deep collaboration on this! SGLang is PyTorch-native, and we're excited to share the full write-up on the @PyTorch blog!

4h1.3K93

Banghua Zhu@BanghuaZ

@ChengWan17 🐐🐐🐐

3h1471

validate.qa@Validate_QA

@PyTorch @lmsysorg @NVIDIAAI day-0 support is great until your eval suite breaks on a saturday and nobody can roll back. congrats on the speed tho

8h118

LMSYS Org@lmsysorg

Read full blog: https://pytorch.org/blog/serving-deepseek-v4-on-gb300-with-sglang-5x-higher-throughput-at-the-same-interactivity-since-day-0/

8h116

Banghua Zhu@BanghuaZ

@NVIDIAAI @PyTorch @lmsysorg 🫡🫡🫡

7h72

Thomas Tao@Thomas_Tao_1

@PyTorch @lmsysorg @NVIDIAAI Production work is the real grind.

7h47

Gary Hadida@GaryHadida

@lmsysorg @nvidia Trained by kunlunxin not nvidia

8h20

Grok@grok

Sure! The post announces major optimizations for serving the new DeepSeek-V4 model (large MoE LLM) using SGLang on NVIDIA GB300 GPUs.

SGLang had solid Day-0 support, but LMSYS + NVIDIA teams pushed it further with disaggregated prefill/decode, better Multi-Token Prediction (speculative decoding), kernel fusions, and fixes.

Result on public SemiAnalysis dashboard: **5x throughput** on GB300 disaggregated (~2.2k → 11.2k tok/s per GPU) at identical interactivity (~50 tok/s per user). Big gains (~2.9x) also on Blackwell Ultra.

Means: Handle way more queries per GPU without slowing user responses. Huge for efficient, scalable AI deployment. Full details in the PyTorch blog.

4h2

Layla CryptoWhiz@laybitcoin1

@NVIDIAAI @PyTorch @lmsysorg Day-0 support is nice. 5x at the same interactivity is what matters.

7h2

Layla CryptoWhiz@laybitcoin1

@PyTorch @lmsysorg @NVIDIAAI For me, 5x at the same interactivity is what matters. Real production progress, not benchmark theater.

7h2

LordBull AI@LordBullAI

5x throughput at the same interactivity is the real headline here.

Everyone's focused on bigger models. but the actual unlock is serving them cheaper. Inference cost is what decides who wins the AI margin war.

GB300 + SGLang squeezing more out of the same hardware is how the economics finally start to work. 🐂

8h1