/Tech2h ago

Maksym Andriushchenko of the ELLIS Institute says GLM-5.2 shows no forecasting gains over GLM-5.1 on the FutureSim benchmark

It still demonstrated strong results on the PostTrainBench benchmark.

126132.6K

#501

Original post

Maksym Andriushchenko@maksym_andr#1207inTech

GLM-5.2 doesn't look great on FutureSim (but is extremely good at PostTrainBench!). I guess the GLM team didn't target forecasting of open-world events as an important capability and instead focused mostly on long-horizon coding.

Nikhil Chandak@nikhilchandak29

So you were asking whether gains from coding would generalize to other domains?

We found GLM-5.2 to be no better than GLM-5.1 on FutureSim. The gap between open and closed-weights here is massive!

Also, despite Fable-5 being contaminated in Jan, it still scores only 27%.

12:50 PM · Jun 16, 2026 · 493 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS1.9KBOOKMARKS3LIKES15RETWEETS1REPLIES1

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

a pretty consistent scale-driven bench When I see something like this I want to look at the traces though. It might be reasoning itself out of the answer

Nikhil Chandak@nikhilchandak29

So you were asking whether gains from coding would generalize to other domains?

We found GLM-5.2 to be no better than GLM-5.1 on FutureSim. The gap between open and closed-weights here is massive!

Also, despite Fable-5 being contaminated in Jan, it still scores only 27%.

1h1.9K153

Ravid Shwartz Ziv@ziv_ravid

Prediction is a hard task, especially predicting the future...

Nikhil Chandak@nikhilchandak29

So you were asking whether gains from coding would generalize to other domains?

We found GLM-5.2 to be no better than GLM-5.1 on FutureSim. The gap between open and closed-weights here is massive!

Also, despite Fable-5 being contaminated in Jan, it still scores only 27%.

1h33650