GLM-5.2 doesn't look great on FutureSim (but is extremely good at PostTrainBench!). I guess the GLM team didn't target forecasting of open-world events as an important capability and instead focused mostly on long-horizon coding.
So you were asking whether gains from coding would generalize to other domains?
We found GLM-5.2 to be no better than GLM-5.1 on FutureSim. The gap between open and closed-weights here is massive!
Also, despite Fable-5 being contaminated in Jan, it still scores only 27%.