/AI11h ago

Lisan al Gaib says Chinese AI models trail Western rivals by four to six months, but Florian Brand argues benchmarks are flawed

The capability gap is narrowest in coding tasks.

4301613
Original post
Lisan al Gaib@scaling01#975inAI

that's the backward looking gap which I think is ~4-6 months

and all of these tasks are primarily coding outside of coding the gap is larger

and the forward looking gap with Mythos is probably ~8-12 months, considering china bros will only get access to the compute and data necessary at the end of the year or early next year

6:42 AM · Jun 7, 2026 · 84 Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS181REPLIES1
Lisan al Gaib@scaling01

@xeophon @xpasky a lot of the leaderboards I shared are also very recent ones that don't have scores for Opus 4.5 or GPT-5.2

so an open-model being "right behind Sonnet 4.6 or Opus 4.6" doesn't mean much

Lisan al Gaib@scaling01

@xeophon @xpasky there's really too few live leaderboards outside of coding

11hViews 181Likes 0Bookmarks 0
BOOKMARKS1

@scaling01 @xpasky Well and the same leaderboards use broken open model deployments to get their scores, which should be discarded. They are comparing closed models at their best vs open models at their worst / at mediocre setups at best

Lisan al Gaib@scaling01

@xeophon @xpasky a lot of the leaderboards I shared are also very recent ones that don't have scores for Opus 4.5 or GPT-5.2

so an open-model being "right behind Sonnet 4.6 or Opus 4.6" doesn't mean much

11hViews 131Likes 1Bookmarks 1
LIKES1
Lisan al Gaib@scaling01

@xeophon @xpasky there's really too few live leaderboards outside of coding

Lisan al Gaib@scaling01

that's the backward looking gap which I think is ~4-6 months

and all of these tasks are primarily coding outside of coding the gap is larger

and the forward looking gap with Mythos is probably ~8-12 months, considering china bros will only get access to the compute and data necessary at the end of the year or early next year

11hViews 113Likes 1Bookmarks 0

@scaling01 @xpasky I can prob engineer a leaderboard the same way. I use Opus 4.8, reasoning low, mini-swe-agent with its old settings (tool calling text based and no parallel tool calling allowed, 25 max turns or something) running on Bedrock vs. Kimi K2.6 @ high in Kimi CLI running on Kimi API

@scaling01 @xpasky Well and the same leaderboards use broken open model deployments to get their scores, which should be discarded. They are comparing closed models at their best vs open models at their worst / at mediocre setups at best

10hViews 104Likes 1Bookmarks 0