3h ago

SGLang v0.5.12 Merges DeepSeek V4 With Optimized Kernels And Hardware Support

56716173.6K

——0——

Original post

#643@YING11231 @LMSYSORG

LMSYS Org@LMSYSORG

🐋 DeepSeek V4 is now merged into SGLang main with v0.5.12. What we shipped at launch: 🔹 ShadowRadix: native prefix caching for V4's hybrid attention 🔹 HiSparse: CPU-extended KV for sparse attention (up to 3× long-context throughput) 🔹 MTP speculative decoding with in-graph metadata preparation 🔹 W4A8 MegaMoE kernel 🔹 Flash Compressor + Lightning TopK kernels 🔹 Multiple parallelism methods: Tensor Parallelism/Expert Parallelism/Context Parallelism/Data Parallelism Attention 🔹 Prefill Decode Disaggregation 🔹 Hardware: H100, H200, B200, B300, GB200, GB300, MI35X And what we added since: 🔹 HiCache for V4 under UnifiedRadixTree 🔹 W4A4 MegaMoE kernels for faster MegaMoE 🔹 Marlin/FlashInfer MXFP4 (W4A16) MoE on Hopper 🔹 Hierarchical multi-stream overlap for small-batch decode 🔹 Optimized mHC pipeline: DeepGemm + fused norm + fused hc_head 🔹 Faster KV Compression V2 kernel 🔹 Fused SiLU+clamp+FP8 quantization kernel 🔹 Support TP16 on H100/H20 🔹 Support Multiple Detokenizers 🔹Pipeline Parallelism 🔹One docker image for all supported Nvidia hardware Thanks to @NVIDIAAI, @AMD, @ant_oss, @alibaba_cloud, ByteDance, @iFLYTEKLab, @radixark, and @pranjalssh for the work we shipped together on V4 🙌 More in 0.5.12 👇

12:10 PM · May 16, 2026

Cluster engagement

23 snapshots

Reposted by

#643@YING11231