/Tech2h ago

Chinese Models Lag GPT-5.5 and Opus on Reasoning and Long Context

03002.5K

#1218

Original post

Lisan al Gaib@scaling01#1218inTech

that is to say: - there are RL-training-compute and test-time-compute scaling laws for long-context

Lisan al Gaib@scaling01

this is also one of the core reasons why I think chinese models are further behind than what coding benchmarks suggest

and from the UK AISI blog today and EdgeBench we know that: - spending 1M vs 100M matters - having 1M vs 100K context matters

Opus 4.8 and GPT-5.5 have the same MRCR score at 100k+ context as GLM-5.2 at 16k

Why? - GLM doesn't use reasoning effectively - GPT-5.4 and GLM-5.2 have ~the same scores without reasoning, but GLM gets crushed once you turn on reasoning

(this can be fixed by doing more RL)

3:19 PM · Jul 2, 2026 · 666 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS1.2K

Lisan al Gaib@scaling01

not that we didn't know that before: "and from the UK AISI blog today and EdgeBench we know that: - spending 1M vs 100M matters - having 1M vs 100K context matters"

lol

it's just more and recent evidence

Lisan al Gaib@scaling01

this is also one of the core reasons why I think chinese models are further behind than what coding benchmarks suggest

and from the UK AISI blog today and EdgeBench we know that: - spending 1M vs 100M matters - having 1M vs 100K context matters

Opus 4.8 and GPT-5.5 have the same MRCR score at 100k+ context as GLM-5.2 at 16k

Why? - GLM doesn't use reasoning effectively - GPT-5.4 and GLM-5.2 have ~the same scores without reasoning, but GLM gets crushed once you turn on reasoning

(this can be fixed by doing more RL)

2h1.2K20

LIKES3

Lisan al Gaib@scaling01

*GLM-5.1

Lisan al Gaib@scaling01

this is also one of the core reasons why I think chinese models are further behind than what coding benchmarks suggest

and from the UK AISI blog today and EdgeBench we know that: - spending 1M vs 100M matters - having 1M vs 100K context matters

Opus 4.8 and GPT-5.5 have the same MRCR score at 100k+ context as GLM-5.2 at 16k

Why? - GLM doesn't use reasoning effectively - GPT-5.4 and GLM-5.2 have ~the same scores without reasoning, but GLM gets crushed once you turn on reasoning

(this can be fixed by doing more RL)

2h90930