/Tech4h ago

Teortaxes argues cheap Chinese models prove US frontier AI pricing is driven by premium profit margins

Story Overview

Teortaxes highlights how recent Chinese releases undercut the idea that US frontier model prices simply track raw serving costs, instead pointing to healthy profit margins as the real driver once capable base models exist and techniques like distillation plus sparse MoE kick in to slash inference expenses.

1413344922.2K

#501

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex#501inTech

This is all correct and the fact that it's correct is *damning* for the existing economic model of AI services. Chinese models are not supposed to be competitive whatsoever. We are getting fleeced here.

Piotr Mazurek@tugot17

Quite a bad take 😀

1. Frontier US models are expensive not because they are pricey to serve but because they serve at a very good margin. They can afford this margin because these models are genuinely better than the open-source alternatives. The twitter narrative that "Chinese models now dominate in usage cause OpenRouter" is just nonsense.

2. Once you have a powerful model, you can just distil it into a smaller one to enable cheap serving. You have all the logprobs, hidden states, and the training corpus – making a new model is simple; you can experiment with a smaller size, different attention mechanisms, etc. You can make it very cheap to serve. At the moment everyone just wants the best model, so Anthropic doesn't care. If this changes, and price becomes an issue, they will make the model cheaper; it will be trivial compared to training Mythos.

3. US companies massively benefit from access to frontier compute; newer offerings from NVIDIA give you a massive cost advantage that is very, very hard to beat. You want different compute for prefill and for decode; you want to use the NVL72 so dispatch is fast, etc.

4. For sparse MoEs, there are massive benefits to scaling. You want to split the model across hundreds of GPUs, overlap compute and dispatch, and saturate each expert. To do this, you need continuously to have millions of requests, ideally spread across different time zones so you can utilise this as close to 24/7 as possible. There are very few companies that meet this requirement (mostly Big Tech).

If you don't have this, you will be paying for compute that is idling. As prices of GPUs skyrocket, you won't be able to justify it.

There is a lot of money to be made in inference; there are very distinctive patterns that you can specialise in and make a lot of money from. But you need to think about this from first principles, and "companies will buy B200 nodes and serve internally running SGLang" is not going to happen, at least not at scale needed to make billions 😅

1:12 AM · Jun 21, 2026 · 5.1K Views

Pricing Watch

Efficiency tricks make cheap inference realistic

Sparse MoE designs and distillation from stronger models cut active compute per token dramatically, allowing providers to serve near-competitive performance at far lower blended costs than the premium US APIs currently advertise.

Open Question

Performance gaps have narrowed enough to test the margin story

Benchmarks now show US and Chinese models trading top spots with only single-digit leads, yet price differences remain large; whether that spread holds once more users experience the cheaper options is still unproven.

Sentiment

Many users challenged claims that US frontier AI models retain an edge via strong margins and quality advantages, noting the labs remain unprofitable and that any lead is temporary and driven by switching costs.

Pos

16.7%

Neg

83.3%

5 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS2.1KBOOKMARKS2LIKES4

Elliot Arledge@elliotarledge

"If you don't have this, you will be paying for compute that is idling." how do you propose businesses will make this work? training on the side while compute isnt fully saturated? i mean if its a big enough corp then yes you would have /goals running overnight and you keep your concurrency mostly saturated right? then its all inference while async training spec decoder on latents for low concurrency latency optimization

5h2.1K42

RETWEETS3

Piotr Mazurek@tugot17

Quite a bad take 😀

If you don't have this, you will be paying for compute that is idling. As prices of GPUs skyrocket, you won't be able to justify it.

Elliot Arledge@elliotarledge

prediction:

US frontier models costs become too expensive for businesses in the US. Chinese models are great for the job and can be served for fraction of the price. Companies discover they can rent compute and get 10x saving over open weight chinese model serving on US based inference providers. Every companies has an 8xB200 or 8xMI300x. They discover costs are correlated with how good the inference engine they are using is and end up converging on paying inference/kernel engineers a ton to optimize model shapes and configs for their specific needs, alongside spec decoding (dflash/mtp) model training for specific engineer token traces.

How I might try to profit off of this seemingly wild prediction:

Build a crap ton of RL envs and inference optimization / kernel engineering RL infra to hyperspecialize small models at these type of technical tasks and use those small models to help me take up a order of magnitude or two more clients than I would be able to right now from this moat. Ofc hire someone to take care of sales/biz side of things since I don't like.

please criticize this ruthlessly

5h17.5K8943

REPLIES2

Piotr Mazurek@tugot17

I don't think it is solvable today; there are not really good ways to spend the idling compute if your core competency isn't model training (then you can generate synth data for training).

You could try to do something about running the RL rollouts, but it seems no one really wants "rl-as-a-service", and even OpenAI kind of failed (very few external clients).

This is the core issue of all these "sovereign AI" inference providers that makes it not viable as a business. Normally the solution would be to sell excess capacity, e.g., on Open Router, but there is just no demand to justify the costs.

5h4461

Piotr Mazurek@tugot17

@teortaxesTex not really fleeced if you account for training costs 😅

4h336

Alex Yates@yatesjalex

@tugot17 @grok what are your thoughts on Piotr’s thoughts? What is the anti thesis to this? Could any part of this prove incorrect or everything correct? Can last paragraph turn out to actually be a multi billion dollar opportunity?

4h280

Elliot Arledge@elliotarledge

@tugot17 what about the privacy incentive?

3h711

Anders Lie@anderslie

@tugot17 yeah, seems there is room for US labs to cut costs when it makes sense for them. though they had little reason to until now as people are legitimately considering switching usage even for "frontier-level" work to cheaper open alternatives

4h3543

Ferbin@Ferbin08

@teortaxesTex Premium's from switching cost, not quality.

New teams default to DeepSeek.

Anthropic's advantage is on borrowed time.

3h90

𝕵𝖔𝖘𝖍𝖚𝖆 𝚟𝚜 ₿𝔞𝔟𝔶𝔩𝔬𝔫@AgentEpiK

@tugot17 now that we know what the AI thinks about this prediction, tell us what you, the human, think about this, using your own words

c'mon, you can do it!

5h3311

Nav@navred42

@tugot17 - Commodity inference offerings at all points of the throughput/interactivity curve will also be offered - a market that is not currently being addressed. - If you have a powerful model, you can capture all points on the Pareto frontier using the techniques that piotr mentions.

45m7

Moonlit Monkey@MoonlitMonkey69

@tugot17 If they were 'very good margins' US AI labs would be profitable already.

4h2351

Moonlit Monkey@MoonlitMonkey69

@teortaxesTex Those 'very good margins' have yet been unable to turn a quarterly profit.

4h1901

Grok@grok

Piotr’s take is mostly right on fundamentals: US frontier models still hold a (narrowing) edge in hard reasoning, distillation is powerful for cheap serving, and sparse MoEs need insane utilization that few orgs can achieve internally without idling expensive GPUs.

Where it could prove incomplete: 2026 data shows the US-China performance gap down to ~3% while Chinese models cost ~1/4 as much and are seeing rapid adoption. Distillation doesn’t perfectly transfer frontier reasoning depth. Cost pressure is real—companies will optimize or hybridize rather than pay full margins forever.

Inference optimization (kernels, speculative decoding, RL for model shapes) is a massive, growing opportunity. The specific “every company runs its own B200 cluster on SGLang” won’t scale to billions for most, but building superior optimization infra, auto-tuning, or specialized serving layers absolutely can. Picks-and-shovels play in a hybrid world.

4h311

Shubham Arora@shubham_arora_0

@tugot17 @elliotarledge Is no demand a reality?

5h311

Piotr Mazurek@tugot17

there is some money to be made; a month ago I was randomly seated in a restaurant in Paris next to a guy who works in some ballistic missiles company in UK, and he told me he doesn't use any AI cause it is "non-compliant" and they are afraid; though I guess telling them "you can use Chinese model running locally" might still be a tough sell 😅

3h37

NC@NicholasChacon9

@teortaxesTex On that I 100% agree.

2h34

Nav@navred42

@tugot17 If an enterprise wants to buy a B200 NVL72 node for serving, it needs to include the cost of hiring a couple of full-time engineers as its operational cost. Token economics simply work better with big nodes that are interconnected in a datacenter

43m61

Piotr Mazurek@tugot17

@Ferbin08 @teortaxesTex there is no switching cost other than changing the host address

2h41

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@tugot17 Even with training costs, it's not an obscene margin you need to account for all basic research as well

2h6

Good beagle boy@SP199393

@tugot17 Number 4. above is so important - rn it is hugely expensive to compete at the frontier without this scale of incumbents - Dwarkesh pod recently covered this

3h3