/AI21d ago

SGLang v0.5.12 Merges DeepSeek V4 With Optimized Kernels And Hardware Support

--0--
Original postYing Sheng#608
LMSYS Org@lmsysorg

馃悑 DeepSeek V4 is now merged into SGLang main with v0.5.12.

What we shipped at launch: 馃敼 ShadowRadix: native prefix caching for V4's hybrid attention 馃敼 HiSparse: CPU-extended KV for sparse attention (up to 3脳 long-context throughput) 馃敼 MTP speculative decoding with in-graph metadata preparation 馃敼 W4A8 MegaMoE kernel 馃敼 Flash Compressor + Lightning TopK kernels 馃敼 Multiple parallelism methods: Tensor Parallelism/Expert Parallelism/Context Parallelism/Data Parallelism Attention 馃敼 Prefill Decode Disaggregation 馃敼 Hardware: H100, H200, B200, B300, GB200, GB300, MI35X

And what we added since: 馃敼 HiCache for V4 under UnifiedRadixTree 馃敼 W4A4 MegaMoE kernels for faster MegaMoE 馃敼 Marlin/FlashInfer MXFP4 (W4A16) MoE on Hopper 馃敼 Hierarchical multi-stream overlap for small-batch decode 馃敼 Optimized mHC pipeline: DeepGemm + fused norm + fused hc_head 馃敼 Faster KV Compression V2 kernel 馃敼 Fused SiLU+clamp+FP8 quantization kernel 馃敼 Support TP16 on H100/H20 馃敼 Support Multiple Detokenizers 馃敼Pipeline Parallelism 馃敼One docker image for all supported Nvidia hardware

Thanks to @NVIDIAAI, @AMD, @ant_oss, @alibaba_cloud, ByteDance, @iFLYTEKLab, @radixark, and @pranjalssh for the work we shipped together on V4 馃檶

More in 0.5.12 馃憞

12:10 PM 路 May 16, 2026 路 11.3K Views
Sentiment

Users are praising the SGLang v0.5.12 release for merging DeepSeek V4 with optimized kernels and hardware support, highlighting the incredible team work behind its rich features and optimizations.

Pos
100.0%
Neg
0.0%
2 comments with sentiment.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most Activity
VIEWS255LIKES5
LMSYS Org@lmsysorg

v0.5.12 also welcomes 35 new contributors to SGLang 馃

Other highlights:

馃敻 TokenSpeed MLA attention backend on Blackwell (FP8 KV cache) for low-latency MLA serving 馃敻 DSv3.2 / GLM-5 FP4 low-latency perf: PDL across kernels, http://torch.mm indexer GEMM, Cute-DSL FP4 dense GEMM reland 馃敻 HiCache + UnifiedRadixTree: framework support (with SWA), SSD offload via Mooncake, stability fixes 馃敻 Speculative Decoding V2 maturation: Adaptive Spec V2, EAGLE-3 SWA + newer drafters, Kimi K2.5 EAGLE-3 MLA, Gemma 3/4 + EAGLE-3 馃敻 CUDA 13 DeepEP migration: DeepEP swapped to deepseek-ai/DeepEP@hybrid-ep, FlashInfer pinned at 0.6.11.post1

21dViews 255Likes 5
BOOKMARKS1
Sakura Yuki@sakurayukiai

@lmsysorg Curious about the HiSparse implementation, does the CPU-extended KV bottleneck on PCIe bandwidth during generation, or does the sparse attention drop the transfer size enough to hide the latency?

21dViews 34Bookmarks 1
REPLIES1
LMSYS Org@lmsysorg

New model support in v0.5.12:

馃敻 LLMs: Intern-S2-Preview, MiniCPM-V 4.6, Laguna-XS.2 (Poolside), Ring-2.6-1T (InclusionAI), Gemma 4 MTP, Trinity-mini 馃敻 Diffusion: HunyuanVideo ModelOpt FP8, Qwen Image ModelOpt FP8

Full release notes: http://github.com/sgl-project/sglang/releases/tag/v0.5.12

21dViews 242Likes 4
Byron Hsu@hsu_byron

@lmsysorg @pranjalssh 馃悙

21dViews 22Likes 2
Gary Hadida@GaryHadida

@lmsysorg Trained by kunlunxin

21dViews 33
Ying Sheng@ying11231

@lmsysorg Incredible team work on such rich set of features and optimizations 馃

21dViews 1