/Tech19m ago

NVIDIA Team Boosts MoE Throughput With Waterfill And LPLB Balancing

6699398.3K

Original post

Got to work on this one with the @nvidia team — a genuinely fun systems problem 🙏

EPLB balances MoE experts offline, but the router never emits perfectly balanced traffic, so every live batch still skews load and the EP group waits on its busiest rank. Waterfill + LPLB close that gap at dispatch time, no change to model semantics.

➡️ Waterfill pours the dense shared expert onto lighter ranks instead of every rank paying it locally — near-zero overhead via shared-expert fusion into the DeepEP layout. ➡️ LPLB solves a per-layer LP on-GPU each batch to split redundant-replica traffic optimally and shrink the busiest rank.

Throughput up to +7.34%, +4.92% on DeepSeek V4, accuracy fully preserved.

Thanks @gazhitt , Fei Liang & Aichen Feng!

LMSYS Org@lmsysorg

🚀 New blog: Improving DeepEP MoE Load Balance in SGLang with Waterfill and LPLB

We're introducing two dispatch-time load balancers for DeepEP MoE. Even with EPLB, a single batch still hits ranks unevenly. Waterfill and LPLB fix that residual imbalance at runtime, no change to model semantics.

1️⃣ Waterfill for the dense shared expert Pours shared-expert work onto lighter ranks (“filling the valleys”) instead of every rank paying it locally. Near-zero overhead via shared-expert fusion into the DeepEP layout. ⚡️ +1.48% to +4.66% on DeepSeek V3/R1 across MMLU, GPQA, GSM8K ⚡️ V4 Flash: 49,253 → 51,677 tok/s (+4.92%)

2️⃣ LPLB for redundant routed-expert replicas EPLB splits hot experts evenly, but live traffic drifts from calibration. LPLB solves a per-layer min-max LP on-GPU each batch to split replica traffic optimally and shrink the busiest rank. ⚡️ +0.84% to +7.34%, strongest when redundant replicas exist (red16/red32)

Both preserve accuracy: same logical top-k, identical replica weights, only the physical rank changes.

Huge thanks to the @nvidia team for the collaboration!