/AI1d ago

Samuel Hammond of the Foundation for American Innovation says Western models outperform Chinese rivals by 20 to 40 points on private benchmarks

The findings challenge claims that Chinese AI is catching up.

3651035161104.5K
Original post

A thread with a good collection of hard/private/OOD evals where the Western frontier is comprehensively dunking on Chinese/open source models and it's not remotely close.

Lisan al Gaib@scaling01

the "narrow capability gap" in question

let's put this to rest please I can't hear the coping anymore

6:23 PM · Jun 6, 2026 · 50.7K Views
Sentiment

Users are rejecting claims that Western AI models lead Chinese and open-source rivals on hard benchmarks, calling the comparisons the dumbest ever and dismissing models like Gemini as worthless.

Pos
0.0%
Neg
100.0%
4 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS65.2KBOOKMARKS34LIKES187RETWEETS8REPLIES14
Dean W. Ball@deanwball

I find it so interesting how persistently unable the strategic classes of free society are to analyze AI well. So many keep getting stuck in these basins of delusion. I was at a conference where it was not just asserted but taken for granted that Chinese models have dominant global inference market share.

The 2024/early 25 version of the delusion was “mode collapse/data wall” (even after reasoning models!), then it was “AI is plateauing and a bubble” for most of 2025, now it’s “Chinese OSS is good enough.”

The share of people in the strategic classes who think this is gradually declining, but it is still sufficiently common that you can attend a prestigious conference and encounter a room principally filled with basin-dwellers.

1dViews 65.2KLikes 187Bookmarks 34
Jake@JakeKAllDay

His arguments always make me roll my eyes. Chinese firms were not within 1 year of the frontier a couple years ago. Now they’re behind by perhaps 4-6 months, at much better compute efficiency.

You can embellish a gap between any two things by jacking with the scale. None of the US models last fall are competing on these benchmarks, they were still useful models, which China has now passed. These tests zoom in on a tiny range of the problem set and ignore the large corpus where they’re much closer or saturated

1dViews 1.5KLikes 19Bookmarks 3
Jake@JakeKAllDay

@teortaxesTex And none of this addresses the fact Claude is great at building a skyscraper on the wrong street, there’s some benefit to needing to check in every couple hours

1dViews 379Likes 6Bookmarks 4

@apestein_dev There's no real difference between benchmaxxing and legitimate training if you don't train on test or its paraphrases. Chinese RL environments and metrics are derived from problems the success on which was measured by old benchmarks. US frontier is doing newer harder problems.

1dViews 552Likes 22Bookmarks 1

@deanwball they are at parity

for day to day stuff 80% of the time a Chinese model is fine

unfortunately the last 20% is also extremely annoying and can be worth the 10x difference

1dViews 622Likes 23Bookmarks 1
Anime fan@badboy999654

@teortaxesTex This has to be the dumbest comparison ever. What results did these long running tasks produce? Here's the actual comparison between the models on long running tasks and their results. Spoiler: Both are garbage. https://youtu.be/DZ9sTRyAMmM?si=bhLgMAB1Hj3HhbPs

1dViews 3.1KLikes 9Bookmarks 2
Paulo Santos@ThinkFinance999

Your vision of reality is distorted and it's easy to explain why)

1) "Good enough" varies by task, and is reasonably fixed.

2) All models are improving, both at the leading edge and the open weights Chinese ones (as well as a few non-Chinese)

3) Hence, even though the leading edge models remain at the front and might even be increasing their lead, the open weights ones (mostly the Chinese) are increasingly good enough for more and more tasks.

Now, when companies are developing products, they will lean mostly on the best possible models unless they already feel cost pressure there.

However, when these products (software) are put into production, IF the software makes use of AI in pursuing specific tasks, then "good enough" models will be increasingly selected.

This is because most companies cannot sustain opex at 5-10x what would otherwise be required (using the cheapest "good enough" model) just to run overqualified models in their production loop (where token consumption will end up being many times larger than in development).

1dViews 2.3KLikes 12Bookmarks 1
Dean W. Ball@deanwball

@ThinkFinance999 yeah obviously, but you’ve missed the point that U.S. firms tend to be at the Pareto frontier of performance and cost these days at many price points, including very cheap

1dViews 1.8KLikes 15

@badboy999654 They justify the cost in that people are willing to pay that cost. The rest doesn't matter. They're more robust and smarter. If they couldn't afford these margins they'd slash them.

1dViews 583Likes 8
Dean W. Ball@deanwball

@james1ach I don’t quite know what you mean, but internationally the “Chinese OSS is good enough” line is deployed as a crutch or excuse for complacency, whereas in the U.S., ironically, it mostly is rooted in the persistent neurosis our strategic class has about falling into complacency.

1dViews 889Likes 13
Dean W. Ball@deanwball

@RobS142 you are mistaking the bubbles. Nobody in sf thinks this. The beliefs I am describing are most common among government staffers and the elites who surround them in almost every country on earth

1dViews 705Likes 5Bookmarks 1
Daemon-Core@nullctl

@teortaxesTex I wonder if language is an important part of model performance.

Never tried to investigate, But there is this nagging feeling that Chinese language evals on Opus would produce a different result compared to English language evals.

I don't speak chinese so can't really tell

1dViews 756Likes 4Bookmarks 1
Anime fan@badboy999654

@teortaxesTex Do you think people can't see through all the marketing bullshit? Microsoft has access to all the coding models from both companies, and yet Windows 11 is still a buggy mess. Every time they boot up Windows, people see that these models can't code for shit.

1dViews 4.9KLikes 5
Paulo Santos@ThinkFinance999

Just because you try to make AI puke a definition out, and affirm US labs offer those models, doesn't immediately make your statement real or true, plus "good enough" depends on the task and nobody is optimizing models for millions of different tasks.

Again, the lower end models from the leading labs are neither as cheap as available alternatives, nor as good, so what's the point, even?

1dViews 357Likes 7
Blue Bear@Bluebearmonkey

@teortaxesTex @badboy999654 I don’t really know what a token is 🤷‍♂️

1dViews 84Bookmarks 1
Dean W. Ball@deanwball

@ThinkFinance999 Hello nemotron what does Pareto frontier mean

1dViews 390Likes 5
Load more posts