Huawei releases openPangu-2.0-Flash, a 92-billion parameter MoE model optimized for Ascend 910B hardware

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex#501inTech

Total DeepSeek victory. Huawei builds a, basically, V3.3–Lite

🧵 How good is Huawei's newly open-sourced openPangu-2.0-Flash? According to Zhihu contributor & NUS PhD 栖于永夜, it's currently one of the most complete open-source MoE models running on the Ascend ecosystem. Features like mHC, Muon and MTP—all frontier techniques introduced in late 2025—have successfully landed on CANN. That's an impressive engineering achievement. But compared with DeepSeek V4, the gap is still roughly one to one-and-a-half generations. Here's why 👇

1️⃣ What Exactly Is openPangu-2.0-Flash? At a high level: • 92B total parameters • ~6B active parameters/token (MoE) • 512K context • 34T training tokens Its architecture combines several of the most recent ideas in large-model training: • MLA (Multi-head Latent Attention) to compress KV cache • DSA + SWA, a hierarchical attention design where SWA handles local context while DSA aggregates sparse global information • 4-stream mHC residual connections, replacing the traditional single residual path to improve signal propagation in deep networks • 3-head MTP (Multi-Token Prediction) for faster decoding • Muon optimizer, plus multi-stage post-training with RL and online distillation Taken together, it's a remarkably complete engineering stack—especially considering it runs entirely on Ascend 910B.

2️⃣ The Biggest Gap Isn't Features—It's Attention The real difference isn't whether openPangu includes enough "new tricks." It's that DeepSeek V4 fundamentally changed the attention paradigm. openPangu follows the MLA + DSA/SWA direction. DeepSeek V4 moved beyond MLA entirely, adopting CSA + HCA, a new compressed attention architecture. The difference is where compression happens. • MLA compresses along the head dimension. • CSA/HCA compress along the sequence dimension, dramatically reducing KV cache while preserving long-context capability. The result is significant. At 1M context, DeepSeek V4 reportedly requires only: • ~10% of V3's KV cache • ~27% of the inference FLOPs The author argues that openPangu's attention design is closer to an earlier internal DeepSeek prototype that was ultimately replaced before V4's official release. It works well at 512K context, but likely has a lower ceiling than CSA/HCA.

3️⃣ Several V4 Innovations Are Still Missing Attention isn't the only difference. The article highlights several major technologies that haven't yet appeared in openPangu. 🔧 FP4 Quantization-Aware Training DeepSeek V4 trains MoE Experts directly in FP4, while keeping the remaining parameters in FP8. This isn't post-training quantization. The model learns under FP4 during training itself. That requires hardware support that Ascend 910B currently lacks. 🧭 MoE Routing V4 also upgrades the routing mechanism with: • hash-routing MoE • new affinity activation functions • relaxed routing constraints These changes improve expert utilization and training stability. openPangu appears to retain the earlier V3-style routing design.

4️⃣ The Gap Extends Beyond The Model Another major difference lies in the surrounding software stack. DeepSeek V4 isn't just a new model. It's accompanied by an entirely new infrastructure layer, including: • TileLang kernel DSL • deterministic kernels • hierarchical KV-cache storage • compressed-attention runtime support • two-stage contextual parallelism Most of these capabilities don't yet exist within today's CANN ecosystem. Even post-training differs. DeepSeek's OPD distills knowledge from over ten independently trained specialist models using full-vocabulary KL divergence—a highly demanding engineering workload. openPangu adopts a similar direction, but reproducing that pipeline on Ascend hardware remains significantly more challenging.

5️⃣ Where CANN Stands Today The author draws a fairly clear boundary. Today's CANN ecosystem can already support: ✅ DeepSeek V3-style foundations ✅ MLA ✅ DeepSeekMoE ✅ MTP ✅ mHC ✅ Muon These are relatively modular improvements. The real leap made by V4—CSA/HCA attention, FP4 QAT, new MoE routing and its supporting runtime—would require substantial operator and system-level redesign inside CANN. Some of those improvements are software problems. Others depend on future hardware generations.

6️⃣ So How Should We View openPangu? The article's conclusion is balanced. The biggest achievement isn't that openPangu beats DeepSeek V4. It's that Huawei has demonstrated frontier-scale MoE training entirely on Ascend 910B, without relying on NVIDIA GPUs. That's a major milestone for the domestic AI ecosystem. At the same time, "being able to train" and "being able to train as efficiently as the frontier" are two different things. The current architecture represents a pragmatic engineering choice given today's hardware and software constraints—not a lack of awareness of newer designs.

🔗 Read more: https://www.zhihu.com/question/2055328303435854735/answer/2055468728800825535 #AI #LLM #Huawei #Ascend #MoE #DeepSeek #AIInfra #OpenSource #CANN #Tech

9:32 AM · Jul 2, 2026 · 12.2K Views