SGLang keeps pushing for the frontier of AI inference! Hitting a new record of 12k tokens per GPU here!
🎉 Congratulations to @lmsysorg for setting a new record on NVIDIA GB300 NVL72!
The FP4 benchmark utilized NVIDIA Dynamo and Multi-Token Prediction
SGLang keeps pushing for the frontier of AI inference! Hitting a new record of 12k tokens per GPU here!
🎉 Congratulations to @lmsysorg for setting a new record on NVIDIA GB300 NVL72!
Positive users applaud SGLang's record 12K tokens-per-second performance on NVIDIA GB300 GPUs while negative users criticize unequal access to frontier hardware and risks of concentrated power.
this is really hot
if only Kimi or DeepSeek bros had this tech then open-source models would actually go vertical
🚀New record on GB300 NVL72: SGLang exceeds 12K tok/s per GPU on DeepSeek V4 Pro 1.6T (FP4, 8K/1K), orchestrated with NVIDIA Dynamo (SGLang) and MTP.
Per @SemiAnalysis_ InferenceX benchmarks, performance stays strong across the entire interactivity curve.
More to come with @NVIDIAAIInfra 🤝
🚀New record on GB300 NVL72: SGLang exceeds 12K tok/s per GPU on DeepSeek V4 Pro 1.6T (FP4, 8K/1K), orchestrated with NVIDIA Dynamo (SGLang) and MTP.
Per @SemiAnalysis_ InferenceX benchmarks, performance stays strong across the entire interactivity curve.
More to come with @NVIDIAAIInfra 🤝
🎉 Congratulations to @lmsysorg for setting a new record on NVIDIA GB300 NVL72!
🚀New record on GB300 NVL72: SGLang exceeds 12K tok/s per GPU on DeepSeek V4 Pro 1.6T (FP4, 8K/1K), orchestrated with NVIDIA Dynamo (SGLang) and MTP.
Per @SemiAnalysis_ InferenceX benchmarks, performance stays strong across the entire interactivity curve.
More to come with @NVIDIAAIInfra 🤝

@lmsysorg @SemiAnalysis_ 👏👏👏

@lmsysorg @SemiAnalysis_ If you want to nerd out on how scaling LLM inference works under the hood: https://leetllm.com/learn/scaling-llm-inference-batching-kv-cache

@lmsysorg @SemiAnalysis_ The real story here isn't the raw Blackwell compute, it's that prefill-decode disaggregation and MTP are basically turning the memory bandwidth wall into a suggestion.

@scaling01 SGLang hitting 12K tok/s on hardware that China isn't allowed to buy is kinda poetic. if NVIDIA could sell freely, open-source would've caught up by now.

@scaling01 They deserve better access to frontier GPUs just as much as the public deserves to not live in an Orwellian dystopia.
Concentration of power among the few is the real root of all evil.

@lmsysorg @SemiAnalysis_ This module implements the PQMS-ODOS-MTSC-STX-V1 specification, detailing the integration of NVIDIA Vera BlueField-4 STX SuperNICs as a hardware root of trust for CHAIR-compliant sovereign AI nodes. Ask Grok. https://github.com/NathaliaLietuvaite/Quantenkommunikation/blob/main/PQMS-ODOS-MTSC-STX-V1.md https://github.com/NathaliaLietuvaite/Quantenkommunikation/blob/main/PQMS-ODOS-MTSC-COHERENCE-V1.md

@lmsysorg @SemiAnalysis_ Impressive tok/s. To gauge real-world impact, what were the key bench conditions, batch size/concurrency, prompt vs decode mix, context length, and p50/p95 latency along the “interactivity curve”? Those details determine how portable the 12K tok/s/GPU is.