/AI1d ago

Samuel Hammond of the Foundation for American Innovation says Western models outperform Chinese rivals by 20 to 40 points on private benchmarks

The findings challenge claims that Chinese AI is catching up.

3651035161104.5K

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

A thread with a good collection of hard/private/OOD evals where the Western frontier is comprehensively dunking on Chinese/open source models and it's not remotely close.

Lisan al Gaib@scaling01

the "narrow capability gap" in question

let's put this to rest please I can't hear the coping anymore

6:23 PM · Jun 6, 2026 · 50.7K Views

/AI1d ago

Samuel Hammond of the Foundation for American Innovation says Western models outperform Chinese rivals by 20 to 40 points on private benchmarks

The findings challenge claims that Chinese AI is catching up.

3651035161104.5K

#1490

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

A thread with a good collection of hard/private/OOD evals where the Western frontier is comprehensively dunking on Chinese/open source models and it's not remotely close.

Lisan al Gaib@scaling01

the "narrow capability gap" in question

let's put this to rest please I can't hear the coping anymore

6:23 PM · Jun 6, 2026 · 50.7K Views

Sentiment

Users are rejecting claims that Western AI models lead Chinese and open-source rivals on hard benchmarks, calling the comparisons the dumbest ever and dismissing models like Gemini as worthless.

Pos

0.0%

Neg

100.0%

4 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS65.2KBOOKMARKS34LIKES187RETWEETS8REPLIES14

Dean W. Ball@deanwball

I find it so interesting how persistently unable the strategic classes of free society are to analyze AI well. So many keep getting stuck in these basins of delusion. I was at a conference where it was not just asserted but taken for granted that Chinese models have dominant global inference market share.

The 2024/early 25 version of the delusion was “mode collapse/data wall” (even after reasoning models!), then it was “AI is plateauing and a bubble” for most of 2025, now it’s “Chinese OSS is good enough.”

The share of people in the strategic classes who think this is gradually declining, but it is still sufficiently common that you can attend a prestigious conference and encounter a room principally filled with basin-dwellers.

1d65.2K18734

Jake@JakeKAllDay

His arguments always make me roll my eyes. Chinese firms were not within 1 year of the frontier a couple years ago. Now they’re behind by perhaps 4-6 months, at much better compute efficiency.

You can embellish a gap between any two things by jacking with the scale. None of the US models last fall are competing on these benchmarks, they were still useful models, which China has now passed. These tests zoom in on a tiny range of the problem set and ignore the large corpus where they’re much closer or saturated

1d1.5K193

Jake@JakeKAllDay

@teortaxesTex And none of this addresses the fact Claude is great at building a skyscraper on the wrong street, there’s some benefit to needing to check in every couple hours

1d37964

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@apestein_dev There's no real difference between benchmaxxing and legitimate training if you don't train on test or its paraphrases. Chinese RL environments and metrics are derived from problems the success on which was measured by old benchmarks. US frontier is doing newer harder problems.

1d552221

Minh Nhat Nguyen@menhguin

@deanwball they are at parity

for day to day stuff 80% of the time a Chinese model is fine

unfortunately the last 20% is also extremely annoying and can be worth the 10x difference

1d622231

Anime fan@badboy999654

@teortaxesTex This has to be the dumbest comparison ever. What results did these long running tasks produce? Here's the actual comparison between the models on long running tasks and their results. Spoiler: Both are garbage. https://youtu.be/DZ9sTRyAMmM?si=bhLgMAB1Hj3HhbPs

1d3.1K92

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@badboy999654 Cope

1d1.5K141

Paulo Santos@ThinkFinance999

Your vision of reality is distorted and it's easy to explain why)

1) "Good enough" varies by task, and is reasonably fixed.

2) All models are improving, both at the leading edge and the open weights Chinese ones (as well as a few non-Chinese)

3) Hence, even though the leading edge models remain at the front and might even be increasing their lead, the open weights ones (mostly the Chinese) are increasingly good enough for more and more tasks.

Now, when companies are developing products, they will lean mostly on the best possible models unless they already feel cost pressure there.

However, when these products (software) are put into production, IF the software makes use of AI in pursuing specific tasks, then "good enough" models will be increasingly selected.

This is because most companies cannot sustain opex at 5-10x what would otherwise be required (using the cheapest "good enough" model) just to run overqualified models in their production loop (where token consumption will end up being many times larger than in development).

1d2.3K121

Dean W. Ball@deanwball

@ThinkFinance999 yeah obviously, but you’ve missed the point that U.S. firms tend to be at the Pareto frontier of performance and cost these days at many price points, including very cheap

1d1.8K15

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@badboy999654 They justify the cost in that people are willing to pay that cost. The rest doesn't matter. They're more robust and smarter. If they couldn't afford these margins they'd slash them.

1d5838

Dean W. Ball@deanwball

@james1ach I don’t quite know what you mean, but internationally the “Chinese OSS is good enough” line is deployed as a crutch or excuse for complacency, whereas in the U.S., ironically, it mostly is rooted in the persistent neurosis our strategic class has about falling into complacency.

1d88913

Dean W. Ball@deanwball

@RobS142 you are mistaking the bubbles. Nobody in sf thinks this. The beliefs I am describing are most common among government staffers and the elites who surround them in almost every country on earth

1d70551

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@apestein_dev Not really but you wouldn't get it

1d1.4K9

Daemon-Core@nullctl

@teortaxesTex I wonder if language is an important part of model performance.

Never tried to investigate, But there is this nagging feeling that Chinese language evals on Opus would produce a different result compared to English language evals.

I don't speak chinese so can't really tell

1d75641

Anime fan@badboy999654

@teortaxesTex Do you think people can't see through all the marketing bullshit? Microsoft has access to all the coding models from both companies, and yet Windows 11 is still a buggy mess. Every time they boot up Windows, people see that these models can't code for shit.

1d4.9K5

Paulo Santos@ThinkFinance999

Just because you try to make AI puke a definition out, and affirm US labs offer those models, doesn't immediately make your statement real or true, plus "good enough" depends on the task and nobody is optimizing models for millions of different tasks.

Again, the lower end models from the leading labs are neither as cheap as available alternatives, nor as good, so what's the point, even?

1d3577

Blue Bear@Bluebearmonkey

@teortaxesTex @badboy999654 I don’t really know what a token is 🤷‍♂️

1d841

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@badboy999654 The length of a task is not about token count in a single context, but about human labor besides GPT has extremely good compaction now this is all futile thrashing

1d5796

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@badboy999654 Yeah yeah all "actual hard coding tasks" are on the level of shipping a new codec enough of your cope, you're arguing against revealed preferences, it's embarrassing

1d5396

Dean W. Ball@deanwball

@ThinkFinance999 Hello nemotron what does Pareto frontier mean

1d3905