Xiaomi's Fuli Luo details KVCache optimizations for MiMo-V2.5, achieving up to 95% hit rates via Hybrid SWA

VIEWS3.6KBOOKMARKS11LIKES46

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

As I've been saying Tremendous engineering work, but at the end of the day, MiMo cache is still multiple times larger, loading it is slower, latency is higher, and when they match DSV4 cache hit costs, their margins are bound to be lower. no alternative to attention redesigns.

Fuli Luo@_LuoFuli

Inference Optimizations Behind the MiMo-V2.5 Series API Price Reductions

Read the full technical blog: https://mimo.xiaomi.com/blog/mimo-v2-5-inference

The V2.5 model family, including MiMo-V2.5 and MiMo-V2.5-Pro, is built on a Hybrid Sliding Window Attention (Hybrid SWA) architecture, which compresses KVCache storage to roughly 1/7 that of Full Attention. However, architectural advantages rarely translate directly into measurable gains in production serving. To realize these gains, we redesigned KVCache management, tiered caching, and the prefix-cache tree; addressed key challenges in SWA KVCache handling; and optimized scheduling as well as the Prefill/Decode pipeline.

Validated on real production traffic, these optimizations have increased effective KVCache capacity by nearly 5x, with server-side cache hit rates averaging 93%–95% across mainstream harness frameworks. Together with MoE configuration tuning and multimodal inference optimizations, they enable more efficient long-context inference and form part of what makes the recent API price cuts possible.

30d3.6K4611

RETWEETS20REPLIES6

梦瑶~🌸同城上门@Kathlee10427179

@ZISHIJIAN @_LuoFuli 👆😍

🥳

‍💪⁠ 🤩💜⁠

30d13

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

This is also why I'm not impressed by speculations that Gemini uses some SWA+cross-layer sharing Shazeer trickery. It's not going deep enough, even within Google's own portfolio. They could afford to do more.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

30d1.4K140

RayLin🐡@RayLin_AI

@_LuoFuli Hybrid SWA + MoE + multimodal encoder = the best model for agents

Hybrid Sliding Window Attention reduces attention compute and KVCache cost to roughly 1/7 of Full Attention🔥🔥🔥

30d2K7

藏子世间@ZISHIJIAN

@_LuoFuli But MiMo’s speed limit is ridiculously bad, practically useless.

↻ Thinking-only response — prefilling to continue (1/2)

↻ Thinking-only response — prefilling to continue (2/2)

30d3762

Hassan@buildwithhassan

@_LuoFuli finally a model launch where the interesting number isn't on a benchmark

30d1.4K3

mfmfazrin@farook_fazrin

@_LuoFuli API failing constantly, just I sent a message to reply Ok 35000 tokens gone, your price reduction looks like pure scam

30d1.2K3

s.h@hebbarmp

@_LuoFuli If DS didn't cut it costs, were you planning on cutting costs? Be honest. I respect you either ways because of Open source contribution.

30d7852

MANISH@OrbitHigher

Great write up. A lot of good details. IMHO, a possible next gain could come from something like a regret aware KV eviction, i.e. keep KV blocks that would be expensive to recompute and likely to be reused rather than relying mainly on recency or TTL.

So for agentic workloads, this could preserve high value prefixes like system prompts, documents, tool loop repeated context and Full Attention KV while evicting cheaper low reuse KV more aggressively.

Instead of logging every KV block, the system could use a lightweight sampled prefix-level telemetry, i.e. reuse rate, time-to-next-hit, recompute-token estimate, Full/SWA KV type, fetch latency, and session/app category. These signals could produce a “recompute regret” score = reuse probability X recompute cost X transfer latency. This will help to guide eviction and retention.

30d1.6K1

Asad Ullah 🇵🇸@onesuitee

@_LuoFuli Thank you for your efforts and hard work. I request the release of MiMo V3.0 with full multi-model capabilities and advanced agentic workflows, optimized for real-world use cases at the same price

30d1.1K1

ElonChang@zwb0618

@_LuoFuli 以中文形式给我总结一下@grok

30d21

Participant in a complex world@Hopehope_G_hope

@_LuoFuli @leijun @Xiaomi Xiaomi needs an ir expert to communicate the value of xiaomi ai

30d6071

Oliviu Stoian@madebyoliver

The 5x KVCache density gain from reworking the management layer is the part that stands out. Architecture (Hybrid SWA) sets the ceiling, but production engineering is what actually gets you there. Curious if you measured throughput per dollar against standard MHA with FlashAttention at 128K+ context lengths.

30d1.5K

David@davidleverages

@_LuoFuli The 93-95% cache hit rate is the number that matters here.

That's not a benchmark. That's production traffic. At that rate the economics of serving long-context requests change completely.

1/7 KVCache storage plus near-perfect cache hits — that's how you cut API prices

30d3321

Roman P@RomanP918791

@_LuoFuli I dont know what most of these words mean, but V2.5 works great and is cheap, so thank you :)

30d501

@stnly@stnly

@_LuoFuli waiting to see how long other providers will take to catch up to this capability

30d436

Sameer Srivastava@ovalpod94416

@_LuoFuli The token dumping from Chinese labs is real and am also using kimi and mimimax for now due to cost

30d426

Laplace@d28641204

@_LuoFuli Remarkable progress

30d411

Yash@yash1_

@_LuoFuli "optimizations have increased effective KVCache capacity by nearly 5x" WOW

30d403

Smartpig@Smartpigai

@_LuoFuli 小米模型会做自己的专属编程工具吗

30d342