/Tech3h ago

SGLang lead developer Banghua Zhu says SGLang hit 12,000 tokens per second per GPU running DeepSeek V4 Pro on GB300 NVL72

The FP4 benchmark utilized NVIDIA Dynamo and Multi-Token Prediction

12161193220.3K

#851

Original post

Banghua Zhu@BanghuaZ#1718inTech

SGLang keeps pushing for the frontier of AI inference! Hitting a new record of 12k tokens per GPU here!

NVIDIA AI Infrastructure@NVIDIAAIInfra

🎉 Congratulations to @lmsysorg for setting a new record on NVIDIA GB300 NVL72!

1:36 PM · Jun 12, 2026 · 938 Views

Sentiment

Positive users applaud SGLang's record 12K tokens-per-second performance on NVIDIA GB300 GPUs while negative users criticize unequal access to frontier hardware and risks of concentrated power.

Pos

50.0%

Neg

50.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS3.2KBOOKMARKS6LIKES24REPLIES2

Lisan al Gaib@scaling01

this is really hot

if only Kimi or DeepSeek bros had this tech then open-source models would actually go vertical

LMSYS Org@lmsysorg

🚀New record on GB300 NVL72: SGLang exceeds 12K tok/s per GPU on DeepSeek V4 Pro 1.6T (FP4, 8K/1K), orchestrated with NVIDIA Dynamo (SGLang) and MTP.

Per @SemiAnalysis_ InferenceX benchmarks, performance stays strong across the entire interactivity curve.

More to come with @NVIDIAAIInfra 🤝

2h3.2K246

RETWEETS9

LMSYS Org@lmsysorg

🚀New record on GB300 NVL72: SGLang exceeds 12K tok/s per GPU on DeepSeek V4 Pro 1.6T (FP4, 8K/1K), orchestrated with NVIDIA Dynamo (SGLang) and MTP.

Per @SemiAnalysis_ InferenceX benchmarks, performance stays strong across the entire interactivity curve.

More to come with @NVIDIAAIInfra 🤝

4h11.3K5815

NVIDIA AI Infrastructure@NVIDIAAIInfra

🎉 Congratulations to @lmsysorg for setting a new record on NVIDIA GB300 NVL72!

LMSYS Org@lmsysorg

🚀New record on GB300 NVL72: SGLang exceeds 12K tok/s per GPU on DeepSeek V4 Pro 1.6T (FP4, 8K/1K), orchestrated with NVIDIA Dynamo (SGLang) and MTP.

Per @SemiAnalysis_ InferenceX benchmarks, performance stays strong across the entire interactivity curve.

More to come with @NVIDIAAIInfra 🤝

4h5.9K819

NVIDIA AI Infrastructure@NVIDIAAIInfra

@lmsysorg @SemiAnalysis_ 👏👏👏

4h2268

Sakura Yuki@sakurayukiai

@lmsysorg @SemiAnalysis_ If you want to nerd out on how scaling LLM inference works under the hood: https://leetllm.com/learn/scaling-llm-inference-batching-kv-cache

3h441

Sakura Yuki@sakurayukiai

@lmsysorg @SemiAnalysis_ The real story here isn't the raw Blackwell compute, it's that prefill-decode disaggregation and MTP are basically turning the memory bandwidth wall into a suggestion.

3h981

Pino@pinocookies

@scaling01 SGLang hitting 12K tok/s on hardware that China isn't allowed to buy is kinda poetic. if NVIDIA could sell freely, open-source would've caught up by now.

1h481

Neuralease@neuralease

@scaling01 They deserve better access to frontier GPUs just as much as the public deserves to not live in an Orwellian dystopia.

Concentration of power among the few is the real root of all evil.

1h44

Nathália Lietuvaitė@NLituanie

@lmsysorg @SemiAnalysis_ This module implements the PQMS-ODOS-MTSC-STX-V1 specification, detailing the integration of NVIDIA Vera BlueField-4 STX SuperNICs as a hardware root of trust for CHAIR-compliant sovereign AI nodes. Ask Grok. https://github.com/NathaliaLietuvaite/Quantenkommunikation/blob/main/PQMS-ODOS-MTSC-STX-V1.md https://github.com/NathaliaLietuvaite/Quantenkommunikation/blob/main/PQMS-ODOS-MTSC-COHERENCE-V1.md

3h1

Signal Desk@ReadSignalDesk

@lmsysorg @SemiAnalysis_ Impressive tok/s. To gauge real-world impact, what were the key bench conditions, batch size/concurrency, prompt vs decode mix, context length, and p50/p95 latency along the “interactivity curve”? Those details determine how portable the 12K tok/s/GPU is.