China Trains 1.6T MoE on Ascend 910Cs, Eyes 10T Scale on Future Chips

VIEWS1.6KBOOKMARKS9LIKES16REPLIES2

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

never thought I'd see natsec cope about a Meituan product. "Bah! Big deal! we have better clusters!" Yes big deal. The whole export control policy, through all its escalations starting with restrictions which resulted in H800 at least, was premised not just on ensuring their quantitative FLOP/HBM lag, but on keeping domestic compute categorically less suitable for major pretraining jobs, primarily due to memory bandwidth limitations. No, they were not supposed to be able to do this, and certainly Meituan was not supposed to. What we've learned concretely: - Chyna* can train a 0.75T 40AB (28T tokens) MoE comparable in utility to Opus 4.6-4.7 (and in some ways 4.8). This will be well exceeded next month. The training was completed by Feb 2026 (GLM-5), same time as Mythos-Preview. - Chyna can also train an 1.6T 48AB (35T tokens) MoE on 25K Ascend 910C. This is about equal to V4 and a 1.5 larger job than GLM-5, with the obvious bandwidth/HBM complications coming from total scale. This was completed by late April. - Chyna (the whole ecosystem, every major lab) has gotten remarkably good at multi-teacher on-policy distillation which makes it almost trivial to parallelize downstream capability hill-climbing between individually slow clusters. They also have architectures and software exquisitely optimized for this scenario (V4). - Thanks to the Sophgo incident, Chyna has been able to manufacture maybe 750K-1M of these Ascends (given that 950DTs are seemingly entering/have entered high volume production, I don't think rationing stockpiled HBM2E has been relevant for a while, and the CSIS estimate gave them until Summer'2026 to use up the dies anyway), installing them in some hundreds of CloudMatrix 384 pods (I know about at least 300 being sold). There's a lot of other hardware of similar tier (Cambricon, Baidu Kunlunxin, Alibaba T-Head…) that I'll ignore. - One CM384 pod was going for $8.2M a year ago. Meituan's cluster has about 65, worth $533M or so. Let's say $700M for assorted expenses. Meituan AI capex for 2026 is iirc $2.2B, but of course Meituan is not a hyperscaler (despite having a decent R&D team), it's a minnow. Total private Chinese 2026E AI capex is estimated as ≈$100B. (*Here and throughout "Chyna" is an ironic placeholder for any individual Mainland entity, which is the normal hawk epistemology)

…The point of the rant: on this obsolete hardware, with these pitiful volumes, we could see an OOM larger-scale pretrain than what we've seen so far. It would mostly be a matter of political will and organization. If China really operated like in AI-2027 and got AGI-pilled, we'd see Meituan Ascend-relevant software already commandeered, hardware pooled, Kimi-DeepSeek-Xiaomi-Alibaba-GLM data aggregated, a good efficient design (likely just a bigger V4/K3 with more tokens) chosen, and in 3 months a pretrained model that's most of the way to Mythos. This is trivially doable without any new factors, new production, new ideas. All these, however, exist.

In 3 weeks we'll learn a little more about the state of domestic compute. I estimate that one 950DT SuperPOD with a unified-bus-based scale-UP domain spanning 8192 ASICs is about as capable as Meituan's whole cluster; and will require significantly less annoying engineering to be made useful. 10-20 SuperPods would suffice for a "lean Mythos". I predict that we'll see more than 10-20 by end of year.

Yes, it's easy to say in retrospect that "of course chips are supposed to be usable for training haha". Perhaps the export control paradigm had always been somewhat confused on this point. But in practice, it did work exactly like this: Chinese chips were only good (really, still not good) for inference, Chinese software stack was demonstrably unequipped for large scale training, and the hawk establishment could rest easy assuming that catch-up in general capabilities is solely dependent on export controls and holes in those, such as smuggling and overseas access, so ultimately up to American discretion. Over the last few months, I see the rhetoric shifting: it's an issue of total compute amount, it's an issue of how widely they can deploy and diffuse… Sorry, I can only see that as cope. It was a binary: can they train a "GPT-4", then "GPT-5", then "Mythos" at all in the near term, or can't they, and what can "we" do to make it so that they can't. Now we know that "we" can't do much: even obsolete domestic hardware can train at near-frontier scale. If AI is a strategic arms race, broad deployment and diffusion will take a backseat, a frontier model with relevant cyberattack/defense capabilities will be trained, and your measures will have mostly amounted to transient economic sabotage bringing the efficiency of Chinese economy down towards American level.

So, I'd appreciate it if hawks reflected more rigorously on what Meituan LongCat 2.0 means.

GDP@bookwormengr

How many Ascend 910s Huawei can manufacture with 'stolen' dies? Answer: 1.6 million

This number is based on how many HBM stacks they have stockpiled. That is quite a lot to reach AGI, if you ask anyone.

What happens if stolen dies or HBM runs out? - Compute dies: China's SMIC is making 7nm chips for the next generation ascend. They can make them in millions. - Memory: HBM is a bigger challenge as Chinese entities are barred from procuring anything above HBM2E. That said HBM stack enough for 1.6 million chips is already quite a lot. Also, CXMT is at HBM3 though they have high scale manufacturing issue.

Hence, China based labs like DeepSeek are laser focussed on reducing the need for HBM or HBM bandwidth needs. Their latest technique DSpark reduces HBM bandwidth by a factor of 3.

Source: ====== 1. SemiAnalysis - on Chip count based on stockpiled HBM https://newsletter.semianalysis.com/p/huawei-ai-cloudmatrix-384-chinas-answer-to-nvidia-gb200-nvl72 2. DeepSeek's 10 trillion USD grand strategy -

2h1.6K169