11h ago

Xiaomi's Fuli Luo details KVCache optimizations for MiMo-V2.5, achieving up to 95% hit rates via Hybrid SWA

Analysis argues the system still trails DeepSeek-V4 on latency.

0
Original post

Inference Optimizations Behind the MiMo-V2.5 Series API Price Reductions Read the full technical blog: https://mimo.xiaomi.com/blog/mimo-v2-5-inference The V2.5 model family, including MiMo-V2.5 and MiMo-V2.5-Pro, is built on a Hybrid Sliding Window Attention (Hybrid SWA) architecture, which compresses KVCache storage to roughly 1/7 that of Full Attention. However, architectural advantages rarely translate directly into measurable gains in production serving. To realize these gains, we redesigned KVCache management, tiered caching, and the prefix-cache tree; addressed key challenges in SWA KVCache handling; and optimized scheduling as well as the Prefill/Decode pipeline. Validated on real production traffic, these optimizations have increased effective KVCache capacity by nearly 5x, with server-side cache hit rates averaging 93%–95% across mainstream harness frameworks. Together with MoE configuration tuning and multimodal inference optimizations, they enable more efficient long-context inference and form part of what makes the recent API price cuts possible.

3:41 AM · May 30, 2026 View on X

As I've been saying Tremendous engineering work, but at the end of the day, MiMo cache is still multiple times larger, loading it is slower, latency is higher, and when they match DSV4 cache hit costs, their margins are bound to be lower. no alternative to attention redesigns.

Fuli LuoFuli Luo@_LuoFuli

Inference Optimizations Behind the MiMo-V2.5 Series API Price Reductions Read the full technical blog: https://mimo.xiaomi.com/blog/mimo-v2-5-inference The V2.5 model family, including MiMo-V2.5 and MiMo-V2.5-Pro, is built on a Hybrid Sliding Window Attention (Hybrid SWA) architecture, which compresses KVCache storage to roughly 1/7 that of Full Attention. However, architectural advantages rarely translate directly into measurable gains in production serving. To realize these gains, we redesigned KVCache management, tiered caching, and the prefix-cache tree; addressed key challenges in SWA KVCache handling; and optimized scheduling as well as the Prefill/Decode pipeline. Validated on real production traffic, these optimizations have increased effective KVCache capacity by nearly 5x, with server-side cache hit rates averaging 93%–95% across mainstream harness frameworks. Together with MoE configuration tuning and multimodal inference optimizations, they enable more efficient long-context inference and form part of what makes the recent API price cuts possible.

10:41 AM · May 30, 2026 · 82.9K Views
8:19 PM · May 30, 2026 · 1.6K Views

This is also why I'm not impressed by speculations that Gemini uses some SWA+cross-layer sharing Shazeer trickery. It's not going deep enough, even within Google's own portfolio. They could afford to do more.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

As I've been saying Tremendous engineering work, but at the end of the day, MiMo cache is still multiple times larger, loading it is slower, latency is higher, and when they match DSV4 cache hit costs, their margins are bound to be lower. no alternative to attention redesigns.

8:19 PM · May 30, 2026 · 1.6K Views
8:21 PM · May 30, 2026 · 756 Views
Xiaomi's Fuli Luo details KVCache optimizations for MiMo-V2.5, achieving up to 95% hit rates via Hybrid SWA · Digg