/Tech7h ago

@teortaxesTex says newer reasoning and engineering benchmarks will prove fragile despite Chinese models closing the coding gap

Research engineer Florian Brand warns METR benchmarks suffer from massive error bars

1817522719.6K

#501

Original post

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex#501inTech

Hear me: People used to soyface about novel coding evals, where Chyna/open models were not just behind but garbage. GLM covered most of that gap. Now we look at combined metrics like ECI, or "pure reasoning" like ARC. I predict this, too, will prove to be surprisingly fragile.

Lisan al Gaib@scaling01

"omg omg GLM-5.2 is beating fable. china is catching up"

chill out and listen to Lisan: > slightly ahead of Opus 4.5 > behind GPT-5.2, Gemini 3 Pro and Opus 4.6

7:57 PM · Jul 1, 2026 · 6.9K Views

Sentiment

Users dismissed the GLM-5.2 benchmark results versus GPT-5.2 and Claude Opus as mere Anthropic fanboyism.

Pos

0.0%

Neg

100.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS5.2K

Florian Brand@xeophon

@teortaxesTex lol, a *lot* o the actual scores of GLM-5.2 are missing. no wonder its ECI is in the gutter when the scores where its (close to) SOTA are left out.

the GBAEval score from @MechanizeWork is also sus

cc @Jsevillamol @AlexBarry4

Florian Brand@xeophon

@teortaxesTex You’d think after one year of METR error bars so wide you can fit the whole AUM of Leopold in there, people would finally understand what they mean but alas

4h5.2K162

BOOKMARKS13LIKES63RETWEETS1REPLIES4

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

For all of Dario's fearmongering, for how seriously the US is taking the "AGI race", you can tell it's moslty a race between OpenAI and Anthropic. Evaluations for frontier Chinese open weights take weeks-months, if they happen at all. China is not a factor outside rhetoric.

Florian Brand@xeophon

@teortaxesTex lol, a *lot* o the actual scores of GLM-5.2 are missing. no wonder its ECI is in the gutter when the scores where its (close to) SOTA are left out.

the GBAEval score from @MechanizeWork is also sus

cc @Jsevillamol @AlexBarry4

3h4.7K6313

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

lisan's (good) list of evals I welcome you to check how GLM 5.2 stacks up. Before code, it was math. The dirty secret is that American frontier really is frontier, and it's not just compute, they expand to new domains earlier. But if it's not compute…

Lisan al Gaib@scaling01

the "narrow capability gap" in question

let's put this to rest please I can't hear the coping anymore

7h1.9K153

Florian Brand@xeophon

@teortaxesTex You’d think after one year of METR error bars so wide you can fit the whole AUM of Leopold in there, people would finally understand what they mean but alas

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

4h557140

Florian Brand@xeophon

@teortaxesTex @MechanizeWork @Jsevillamol @AlexBarry4 They don’t, ECI uses IRT. You need a handful of scores to calculate an aggregate and don’t need the same evals for all models, which is why IRT has bigger error bars

Problem is that the selection somewhat matters, so missing the good ones depresses scores

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@xeophon @MechanizeWork @Jsevillamol @AlexBarry4 Wait how do they account for values of missing evals?

3h32770

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@xeophon @MechanizeWork @Jsevillamol @AlexBarry4 Wait how do they account for values of missing evals?

Florian Brand@xeophon

@teortaxesTex lol, a *lot* o the actual scores of GLM-5.2 are missing. no wonder its ECI is in the gutter when the scores where its (close to) SOTA are left out.

the GBAEval score from @MechanizeWork is also sus

cc @Jsevillamol @AlexBarry4

3h38220

Anime fan@badboy999654

@teortaxesTex >People It's just that Anthropic fanboy

7h15