
@s_batzoglou but we all don't work everyday on such hard tasks, I found day to day opus4.6 level tasks it is able to handle pretty well at cheaper rate, I am so happy that I can finally say I can live without Claude sub. Thanks @Zai_org
Hidden retries inflated the total test cost to $48.55.
Many users criticized GLM-5.2 for charging up to $12 per problem while failing hard reasoning benchmarks and proving slow or unusable on complex tasks, though some noted it handles routine work adequately at lower cost.
No Digg Deeper questions have been answered for this story yet.

@s_batzoglou but we all don't work everyday on such hard tasks, I found day to day opus4.6 level tasks it is able to handle pretty well at cheaper rate, I am so happy that I can finally say I can live without Claude sub. Thanks @Zai_org

Thanks! I haven’t tried higher order logics yet. Opus 4.8 runs into the same problem as GLM-5.2 but much less so. Too expensive to run and won’t return very good results as it seems from my incomplete results so far. I am skipping GPT-5.5 and wait for GPT-5.6.
Running on the full benchmark is $2-5k estimated each, for the elite models. However this is a good suggestion. I’ll probably make a tiny 25-problem subset and run all models.

@xeophon I use all the providers' APIs

@s_batzoglou I think you used a quantized version. Lots of API providers are greedy, and the harness matters more.

Valid point. For non-frontier reasoning tasks, the Chinese models are becoming super strong. And even for hard tasks, it looks to me that kimi k2.7 is close to frontier. But can’t use GLM-5.2 for these tasks yet. Could be an issue of tuning the model internally to not chew tokens and return nothing.

@bnjorogedev I haven't finished GPT-5.5 yet. Just a few pilot runs and it does well. Running on the full benchmark will be ~$3k. GPT-5.4 is best overall among the models I've tried, see below.

@s_batzoglou did you use an api? if so, which one? all seem very flaky rn due to them getting hammered by demand

Finally, someone here. Everyone praises GLM. I found it not usable for my software or other tasks. It thinks forever. Composer 2.5 or local Gemma 4 worked best for cheap inference. I’m starting to believe it is Chinese propaganda to undermine frontier models. We need open source, but spam on X is not justified.

@s_batzoglou Might be worth testing with stricter output formatting instructions or few-shot examples of clean short formulas. The bloat penalty in your benchmark seems to expose it.

I haven’t completed them yet: - opus 4.6 doesn’t do well (see the paper) - opus 4.8 does poorly because it fails to return results and then charges for lots of tokens. Same as GLM-5.2, way less pronounced, but more expensive. - GPT-5.5 does well in pilots. I probably won’t complete the full run as it would be $3k and I am waiting for the impending release of GPT-5.6. I need to skip some models.

@s_batzoglou Very good research, thank you for publishing it! I have two questions; A. Did u find any differences from first order logic to higher order logic reasoning between models? B. Did u test Opus 4.7/8, Fable, GPT 5.5 across reasoning levels, is there a blog with preliminary results?

@filicroval I do test with few shot format-only and clear strict json instructions. Most errors are completely empty responses after an average of 20 hours on a single problem.

@s_batzoglou how did opus or gpt5.5 handled those tasks?

@s_batzoglou How did 5.5 and opus do in the benchmark

@LLMathematician I used the ZAI api directly

@s_batzoglou yikes $12 per problem with 2 correct is rough

@MirrorDiver In my problems, it thinks for an average of 16 hours compared to 5 mins for many other models. Unusable at least in these tasks

@s_batzoglou I also found glm use 4-5x the token of gpt 5.5 and Opus

@s_batzoglou Are you sure you weren't running quantized models from some of the US providers on openrouter?
if not
@Zai_org @louszbd @ZixuanLi_ please work on this in the next version ; ) we want GLM to be smarter, and have shorter, higher quality reasoning.

@s_batzoglou 哥们,你是用的哪个渠道的api?我想说,很多渠道现在都有量化降智的行为。