馃悑 DeepSeek V4 is now merged into SGLang main with v0.5.12.
What we shipped at launch: 馃敼 ShadowRadix: native prefix caching for V4's hybrid attention 馃敼 HiSparse: CPU-extended KV for sparse attention (up to 3脳 long-context throughput) 馃敼 MTP speculative decoding with in-graph metadata preparation 馃敼 W4A8 MegaMoE kernel 馃敼 Flash Compressor + Lightning TopK kernels 馃敼 Multiple parallelism methods: Tensor Parallelism/Expert Parallelism/Context Parallelism/Data Parallelism Attention 馃敼 Prefill Decode Disaggregation 馃敼 Hardware: H100, H200, B200, B300, GB200, GB300, MI35X
And what we added since: 馃敼 HiCache for V4 under UnifiedRadixTree 馃敼 W4A4 MegaMoE kernels for faster MegaMoE 馃敼 Marlin/FlashInfer MXFP4 (W4A16) MoE on Hopper 馃敼 Hierarchical multi-stream overlap for small-batch decode 馃敼 Optimized mHC pipeline: DeepGemm + fused norm + fused hc_head 馃敼 Faster KV Compression V2 kernel 馃敼 Fused SiLU+clamp+FP8 quantization kernel 馃敼 Support TP16 on H100/H20 馃敼 Support Multiple Detokenizers 馃敼Pipeline Parallelism 馃敼One docker image for all supported Nvidia hardware
Thanks to @NVIDIAAI, @AMD, @ant_oss, @alibaba_cloud, ByteDance, @iFLYTEKLab, @radixark, and @pranjalssh for the work we shipped together on V4 馃檶
More in 0.5.12 馃憞




