/Tech20h ago

Serafim Batzoglou finds GLM-5.2 solved only two of 11 INDUCTION logic problems while running up massive hidden API costs

Hidden retries inflated the total test cost to $48.55.

23157105223.5K

Original post unavailable.

Sentiment

Many users criticized GLM-5.2 for charging up to $12 per problem while failing hard reasoning benchmarks and proving slow or unusable on complex tasks, though some noted it handles routine work adequately at lower cost.

Pos

37.5%

Neg

62.5%

8 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS1.1KLIKES18

bruce@bruce_x_offi

@s_batzoglou but we all don't work everyday on such hard tasks, I found day to day opus4.6 level tasks it is able to handle pretty well at cheaper rate, I am so happy that I can finally say I can live without Claude sub. Thanks @Zai_org

21h1.1K18

BOOKMARKS1

Serafim Batzoglou@s_batzoglou

Thanks! I haven’t tried higher order logics yet. Opus 4.8 runs into the same problem as GLM-5.2 but much less so. Too expensive to run and won’t return very good results as it seems from my incomplete results so far. I am skipping GPT-5.5 and wait for GPT-5.6.

Running on the full benchmark is $2-5k estimated each, for the elite models. However this is a good suggestion. I’ll probably make a tiny 25-problem subset and run all models.

19h29921

RETWEETS1

Serafim Batzoglou@s_batzoglou

@xeophon I use all the providers' APIs

21h65411

REPLIES1

Reaz@LLMathematician

@s_batzoglou I think you used a quantized version. Lots of API providers are greedy, and the harness matters more.

17h871

Serafim Batzoglou@s_batzoglou

Valid point. For non-frontier reasoning tasks, the Chinese models are becoming super strong. And even for hard tasks, it looks to me that kimi k2.7 is close to frontier. But can’t use GLM-5.2 for these tasks yet. Could be an issue of tuning the model internally to not chew tokens and return nothing.

21h1K31

Serafim Batzoglou@s_batzoglou

@bnjorogedev I haven't finished GPT-5.5 yet. Just a few pilot runs and it does well. Running on the full benchmark will be ~$3k. GPT-5.4 is best overall among the models I've tried, see below.

21h79341

Florian Brand@xeophon

@s_batzoglou did you use an api? if so, which one? all seem very flaky rn due to them getting hammered by demand

21h1.1K4

MirrorDiver@MirrorDiver

Finally, someone here. Everyone praises GLM. I found it not usable for my software or other tasks. It thinks forever. Composer 2.5 or local Gemma 4 worked best for cheap inference. I’m starting to believe it is Chinese propaganda to undermine frontier models. We need open source, but spam on X is not justified.

17h3095

filipe@filicroval

@s_batzoglou Might be worth testing with stricter output formatting instructions or few-shot examples of clean short formulas. The bloat penalty in your benchmark seems to expose it.

20h7123

Serafim Batzoglou@s_batzoglou

I haven’t completed them yet: - opus 4.6 doesn’t do well (see the paper) - opus 4.8 does poorly because it fails to return results and then charges for lots of tokens. Same as GLM-5.2, way less pronounced, but more expensive. - GPT-5.5 does well in pilots. I probably won’t complete the full run as it would be $3k and I am waiting for the impending release of GPT-5.6. I need to skip some models.

20h4203

a new horizon (愿/acc)@militarymindfuc

@s_batzoglou Very good research, thank you for publishing it! I have two questions; A. Did u find any differences from first order logic to higher order logic reasoning between models? B. Did u test Opus 4.7/8, Fable, GPT 5.5 across reasoning levels, is there a blog with preliminary results?

20h2923

Serafim Batzoglou@s_batzoglou

@filicroval I do test with few shot format-only and clear strict json instructions. Most errors are completely empty responses after an average of 20 hours on a single problem.

20h3482