This is all correct and the fact that it's correct is *damning* for the existing economic model of AI services. Chinese models are not supposed to be competitive whatsoever. We are getting fleeced here.
Quite a bad take 😀
1. Frontier US models are expensive not because they are pricey to serve but because they serve at a very good margin. They can afford this margin because these models are genuinely better than the open-source alternatives. The twitter narrative that "Chinese models now dominate in usage cause OpenRouter" is just nonsense.
2. Once you have a powerful model, you can just distil it into a smaller one to enable cheap serving. You have all the logprobs, hidden states, and the training corpus – making a new model is simple; you can experiment with a smaller size, different attention mechanisms, etc. You can make it very cheap to serve. At the moment everyone just wants the best model, so Anthropic doesn't care. If this changes, and price becomes an issue, they will make the model cheaper; it will be trivial compared to training Mythos.
3. US companies massively benefit from access to frontier compute; newer offerings from NVIDIA give you a massive cost advantage that is very, very hard to beat. You want different compute for prefill and for decode; you want to use the NVL72 so dispatch is fast, etc.
4. For sparse MoEs, there are massive benefits to scaling. You want to split the model across hundreds of GPUs, overlap compute and dispatch, and saturate each expert. To do this, you need continuously to have millions of requests, ideally spread across different time zones so you can utilise this as close to 24/7 as possible. There are very few companies that meet this requirement (mostly Big Tech).
If you don't have this, you will be paying for compute that is idling. As prices of GPUs skyrocket, you won't be able to justify it.
There is a lot of money to be made in inference; there are very distinctive patterns that you can specialise in and make a lot of money from. But you need to think about this from first principles, and "companies will buy B200 nodes and serve internally running SGLang" is not going to happen, at least not at scale needed to make billions 😅












