/Tech1h ago

Fable 5 leads the new LisanBench spatial reasoning benchmark based on Opus Magnum puzzles, beating GPT-5.5

GLM-5.2 beat Gemini models despite their extensive visual pretraining

1621393621.3K

#501

Original post

Lisan al Gaib@scaling01#1215inTech

new shape-rotator benchmark

Fable and GPT-5.5 of course far ahead of the field

but now look at GLM-5.2. it's ahead of Gemini 3.5 Flash and Opus 4.8

you can't really benchmaxx a benchmark that was just released

so the GLM-5.2 gains seem more and more like a genuine improvement!

Rob Haisfield@RobertHaisfield

Are AI agents shape rotators? In this new benchmark, we let the models play campaign puzzles in Opus Magnum, a puzzle game by @zachtronics.

Ironically, Claude Opus 4.8 performed poorly, being beaten by GPT-5.5, Gemini 3.5 Flash, and GLM 5.2. Claude Fable 5 crushed them all.

2:21 PM · Jun 17, 2026 · 21.7K Views

Sentiment

Positive users praise GLM-5.2 for legitimately outperforming Opus and Gemini on fresh puzzle benchmarks like Fable 5, while negative users accuse Gemini of data contamination or dismiss it as a weak model.

Pos

42.9%

Neg

57.1%

8 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS2.4KBOOKMARKS3LIKES26REPLIES4

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Another point in favor of "multimodal pretraining does not generalize to spatial reasoning in text". Kimi had 15T heavily visual tokens. It lags GLM 5.2 even more than it does in pure text tasks. Geminis have excellent vision. Little to show for it. Opus is almost blind, yet…

Lisan al Gaib@scaling01

new shape-rotator benchmark

Fable and GPT-5.5 of course far ahead of the field

but now look at GLM-5.2. it's ahead of Gemini 3.5 Flash and Opus 4.8

you can't really benchmaxx a benchmark that was just released

so the GLM-5.2 gains seem more and more like a genuine improvement!

1h2.4K263

Lisan al Gaib@scaling01

"It’s not a vision benchmark, all done through coordinates"

from the creator @RobertHaisfield

I feel like multimodal pre-training doesn't do much without image output. it's just learning a mapping from your img encoder to the language space

I think what's more useful is vision RL

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

1h42041

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Well, maybe I'm unfair. Gemini Pro still does better than GLM, even though it's a trashy model. But as with Fable, I think it's overwhelmingly a function of scale.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

1h60370

Pangram@pangram

@noah_vandal @0x_lun @scaling01 We believe that this document is fully AI-generated

Disclaimer: For text under 75 words, results may be less accurate.

https://www.pangram.com/history/7c3b47f3-e3af-4a72-994a-2f8a8bea47a3

56m711

Noah Vandal@noah_vandal

@0x_lun @scaling01 @pangram slop?

56m171

Moon@MoonL88537

@teortaxesTex this is a very good point. if it was true gemini should be the smartest because of what is sees.

counterpoint to that is that shtty frankenmerge with semantic mashing is why and real unified latent space has never been tried or trained properly.

bad blending worse than none

1h661

Uzi@uzairakrum

@scaling01 @zephyr_z9 Just how good Glm 5.2 even is It's thrashing opus on all benchmarks

1h206

.@ediduval

@scaling01 My strategy plan.

🔻↩️↩️

1h1

JMB 🧙‍♂️@jmbollenbacher

@scaling01 Gemini almost certainly has data contamination on this benchmark because of YouTube, btw.

34m131

Fireply@fireply_ai

@scaling01 the GPT 5.5 solution in that image looks like it took the scenic route. Fable just went direct

1h371

Skye@skye1bb

@scaling01 Missed opportunity to name it Roonbench

55m80

AdiiX@adiix_official

@scaling01 GLM-5.2 punching above its weight on a fresh benchmark is the realest signal you can get no time to benchmaxx means the gains are legit

44m69

Lunari@0x_lun

@scaling01 GLM 5.2 at 24.5 beating both flash and opus on a fresh benchmark is the kind of result that usually gets dismissed until it keeps happening

Chinese labs are quietly stacking these

1h57

Rob Haisfield@RobertHaisfield

@scaling01 @teortaxesTex explainer from the site:

1h171

iksrat@c0mbinat0r

@teortaxesTex I feel like nobody has yet figured out a good way to exploit the synergy between modalities in multimodal pretraining, which also partly explains why the model still sucks at Computer Use tasks where the pretrained model isn't good enough to be RL'd on.

1h34

Zayd Honey@TheHoneyX

@scaling01 It's okay to believe that GLM just cooked with this one and made actual good efforts in intelligence not bench maxing. All what I've seen about GLM is positive and they deserve all the praise

1h30

Saylor@seylorra

@scaling01 glm quietly grinding while everyone was arguing about the top of the leaderboard is genuinely funny

the gaming pipeline for benchmarks has been consistently underrated

13m6

.@ediduval

@scaling01