that is to say: - there are RL-training-compute and test-time-compute scaling laws for long-context
this is also one of the core reasons why I think chinese models are further behind than what coding benchmarks suggest
and from the UK AISI blog today and EdgeBench we know that: - spending 1M vs 100M matters - having 1M vs 100K context matters
Opus 4.8 and GPT-5.5 have the same MRCR score at 100k+ context as GLM-5.2 at 16k
Why? - GLM doesn't use reasoning effectively - GPT-5.4 and GLM-5.2 have ~the same scores without reasoning, but GLM gets crushed once you turn on reasoning
(this can be fixed by doing more RL)