This is a remarkable graph that reinforces the same point as autoresearch results, WeirdML, "takeout delivery bench", and all other long multi-turn scenarios. Notice: *GLM-5.2 isn't any better than 5-5.1 for the first ≈150 days*. But it learns in-context. GPT-5 takes 300 days.
GLM 5.2 is 2nd in Vending-Bench.
Each GLM release has improved at a remarkably steady pace: a linear fit of R²=0.99, with almost $1k better per month.

