2h ago

AMD MI355X Beats B200 On TCO For DeepSeek-R1 Distributed Inference

0
Original post

🚀 New blog: Win on TCO: How AMD Instinct™ MI355X Achieves Cost-Competitive Distributed Inference Through SGLang with MoRI AMD Instinct™ MI355X beats B200 on TCO for DeepSeek-R1 disaggregated inference, with 5% lower cost than B200 TRT-LLM, and 1.25× higher throughput/GPU than B200 SGLang. Together with @AMD, we achieved competitive TCO through full-stack optimizations: 1. MoRI quantized all-to-all (FP4 dispatch + FP8 combine): 2.56× bandwidth reduction 2. MoRI-IO KV cache backend: ~10% higher throughput than Mooncake 3. Two-Batch Overlap with SDMA: zero-compute-overhead async transfers 4. AITER GEMM + FlyDSL FusedMoE: tuned kernels for TP & DP+EP on MI355X 5. Specv2 MTP on ROCm: delivers +4% total token throughput and -3.6% TPOT 6. CPU streaming: +20% output throughput, -16% TPOT at 2,048 concurrency Results live on @SemiAnalysis_ InferenceX dashboard.

9:40 AM · May 28, 2026 View on X