Many users praised the benchmark thread on GLM consuming twice the tokens of Opus for DBT tasks as helpful with good details, while a few dismissed it as low-quality content or faulted the provider.
No Digg Deeper questions have been answered for this story yet.
Most Activity
Three factors drive the token gap:
More turns. GLM averages 99 turns per trial. Opus averages 80. Each turn re-sends the full context.
Atomic tool calls. GLM fires one SQL per check. Opus batches.
Lower cache hits. Opus hits 96% cache. GLM hits 53% (on the provider we used)
Follow-up to my GLM vs Opus thread: let's talk cost.
We ran 103 dbt tasks x 3 trials on each model. Same harness, same tasks.
GLM: 860M tokens Opus: 439M tokens
That's ~2x. But the "why" is more interesting than the number.
2x the tokens. Half the cost.
"Uses more tokens" and "costs more" are not the same claim.
But here's the punchline.
Normalized to 90% cache hit rate:
GLM-5.2 (Fireworks): $1.12/session Opus-4.7 (Anthropic): $2.14/session
GLM is ~48% cheaper.
But here's the punchline.
Normalized to 90% cache hit rate:
GLM-5.2 (Fireworks): $1.12/session Opus-4.7 (Anthropic): $2.14/session
GLM is ~48% cheaper.
On tasks both models solve, GLM uses ~17% more calls. Not 2x.
The 2x comes from tail cases where GLM spirals into 400+ call failures.
On tasks both models solve, GLM uses ~17% more calls. Not 2x.
The 2x comes from tail cases where GLM spirals into 400+ call failures.
Three factors drive the token gap:
More turns. GLM averages 99 turns per trial. Opus averages 80. Each turn re-sends the full context.
Atomic tool calls. GLM fires one SQL per check. Opus batches.
Lower cache hits. Opus hits 96% cache. GLM hits 53% (on the provider we used)

@RamaswmySridhar Nice! I also analyzed GLM vs other frontier costs in different scenarios.
Full breakdown here: https://spielos.xyz/GLM5.2-Cost-Breakdown/

@RamaswmySridhar How about the success rate?

@RamaswmySridhar You can likely create a batch version of the tool and expose it to GLM to cut down. The parallel tool calling was a trick the LLM providers built exactly for this reason.

@RamaswmySridhar Sridhar, not sure if it’s just me but it really seems weird to see AI slop from top leadership. Also creates, massive, difficult to read threads which lose the key takeaways you want for your audience.

@RamaswmySridhar @zephyr_z9 Why do people care about tokens and token cost? At the end of the day the cost shouldn’t matter, the result should. If one model is producing crap results, but low token cost, then who cares?

@RamaswmySridhar Great thread! Lots of good detail. 53% cache hit is definitely low. We consistently get 90%+. GLM spiraling on some problems is real - but call failures? Like tool call failures? That also sounds like provider issues

Coming soon: results for glm 5.2 w/an optimized coco harness...

@RamaswmySridhar it's not that the models use more tokens. it's that they use more tokens

@RamaswmySridhar very helpful! thanks for sharing

@RamaswmySridhar @8teAPi After GPT 5.5, most of the heavy duty tasks have moved to GPT 5.5. Would be great to see the trial and comparison between GPT 5.5 with GLM 5.2.

@RamaswmySridhar curious how GPT 5.5 does. should be token efficient

@RamaswmySridhar Very interesting. Would love to see Codex/5.5 for the same tasks. I think these real world 'benchmarks' are so relevant

@RamaswmySridhar Does snowflake provide API to call GLM? What is the price?

@RamaswmySridhar Again, great stuff. These are always good reads. But why 4.7 vs 4.8?

@RamaswmySridhar i'd split normal sessions from 400-call spirals; those are two different product risks.

@RamaswmySridhar With GLM + Fireworks + Pi harness we are seeing ~98% cache reuse.
A low cache hit rate is usually a harness issue. Maybe this is the coco optimization you are teasing?