🚀 Our new blog: Optimizing Ling-2.6-1T on TPU with SGLang-JAX: Hiding MoE Data Movement Behind Compute with One Pallas Kernel
Ling-2.6-1T, a 1T hybrid MoE model, now serves on TPU v7x with SGLang-JAX. The SGLang-JAX team worked together with @inclusionAI on two fronts: upgrading the fused MoE kernel for deeper compute/comms overlap, and bringing up the full hybrid backbone.
1️⃣ Fused MoE V2: keeps tokens + accumulators VMEM-resident and double-buffers expert weights, hiding routing & prefetch behind compute → MoE prefill −53% 2️⃣ Hybrid memory pools: per-token MLA KV for 10 full-attn layers + per-request recurrent state for 70 GLA layers 3️⃣ GLA linear attention via chunk-wise parallel prefill 4️⃣ Single-controller DP keeps grouped RMSNorm chip-local, no per-layer cross-chip reduce
