๐ DeepSeek V4 is now merged into SGLang main with v0.5.12.
What we shipped at launch:
๐น ShadowRadix: native prefix caching for V4's hybrid attention
๐น HiSparse: CPU-extended KV for sparse attention (up to 3ร long-context throughput)
๐น MTP speculative decoding with in-graph metadata preparation
๐น W4A8 MegaMoE kernel
๐น Flash Compressor + Lightning TopK kernels
๐น Multiple parallelism methods: Tensor Parallelism/Expert Parallelism/Context Parallelism/Data Parallelism Attention
๐น Prefill Decode Disaggregation
๐น Hardware: H100, H200, B200, B300, GB200, GB300, MI35X
And what we added since:
๐น HiCache for V4 under UnifiedRadixTree
๐น W4A4 MegaMoE kernels for faster MegaMoE
๐น Marlin/FlashInfer MXFP4 (W4A16) MoE on Hopper
๐น Hierarchical multi-stream overlap for small-batch decode
๐น Optimized mHC pipeline: DeepGemm + fused norm + fused hc_head
๐น Faster KV Compression V2 kernel
๐น Fused SiLU+clamp+FP8 quantization kernel
๐น Support TP16 on H100/H20
๐น Support Multiple Detokenizers
๐นPipeline Parallelism
๐นOne docker image for all supported Nvidia hardware
Thanks to @NVIDIAAI, @AMD, @ant_oss, @alibaba_cloud, ByteDance, @iFLYTEKLab, @radixark, and @pranjalssh for the work we shipped together on V4 ๐
More in 0.5.12 ๐
