/Tech4h ago

Benchmark Shows GLM Consumes Twice The Tokens Of Opus On Dbt Tasks

424062717478.2K

#1328

Original post

sridhar@RamaswmySridhar#1328inTech

Follow-up to my GLM vs Opus thread: let's talk cost.

We ran 103 dbt tasks x 3 trials on each model. Same harness, same tasks.

GLM: 860M tokens Opus: 439M tokens

That's ~2x. But the "why" is more interesting than the number.

3:02 PM · Jun 25, 2026 · 55.6K Views

Sentiment

Many users praised the benchmark thread on GLM consuming twice the tokens of Opus for DBT tasks as helpful with good details, while a few dismissed it as low-quality content or faulted the provider.

Pos

77.8%

Neg

22.2%

9 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS5.9KBOOKMARKS2REPLIES3

sridhar@RamaswmySridhar

Three factors drive the token gap:

More turns. GLM averages 99 turns per trial. Opus averages 80. Each turn re-sends the full context.

Atomic tool calls. GLM fires one SQL per check. Opus batches.

Lower cache hits. Opus hits 96% cache. GLM hits 53% (on the provider we used)

sridhar@RamaswmySridhar

Follow-up to my GLM vs Opus thread: let's talk cost.

We ran 103 dbt tasks x 3 trials on each model. Same harness, same tasks.

GLM: 860M tokens Opus: 439M tokens

That's ~2x. But the "why" is more interesting than the number.

4h5.9K362

LIKES41

sridhar@RamaswmySridhar

2x the tokens. Half the cost.

"Uses more tokens" and "costs more" are not the same claim.

sridhar@RamaswmySridhar

But here's the punchline.

Normalized to 90% cache hit rate:

GLM-5.2 (Fireworks): $1.12/session Opus-4.7 (Anthropic): $2.14/session

GLM is ~48% cheaper.

4h5.3K411

RETWEETS4

sridhar@RamaswmySridhar

But here's the punchline.

Normalized to 90% cache hit rate:

GLM-5.2 (Fireworks): $1.12/session Opus-4.7 (Anthropic): $2.14/session

GLM is ~48% cheaper.

sridhar@RamaswmySridhar

On tasks both models solve, GLM uses ~17% more calls. Not 2x.

The 2x comes from tail cases where GLM spirals into 400+ call failures.

4h5.8K402

sridhar@RamaswmySridhar

On tasks both models solve, GLM uses ~17% more calls. Not 2x.

The 2x comes from tail cases where GLM spirals into 400+ call failures.

sridhar@RamaswmySridhar

Three factors drive the token gap:

More turns. GLM averages 99 turns per trial. Opus averages 80. Each turn re-sends the full context.

Atomic tool calls. GLM fires one SQL per check. Opus batches.

Lower cache hits. Opus hits 96% cache. GLM hits 53% (on the provider we used)

4h5.5K160

Shayan@ShayanSpiel

@RamaswmySridhar Nice! I also analyzed GLM vs other frontier costs in different scenarios.

Full breakdown here: https://spielos.xyz/GLM5.2-Cost-Breakdown/

2h5532

Congxing Cai@congxing

@RamaswmySridhar How about the success rate?

2h42111

moonboy@Szypetike

@RamaswmySridhar You can likely create a batch version of the tool and expose it to GLM to cut down. The parallel tool calling was a trick the LLM providers built exactly for this reason.

2h2211

Neeraj@ThePeshwa

@RamaswmySridhar Sridhar, not sure if it’s just me but it really seems weird to see AI slop from top leadership. Also creates, massive, difficult to read threads which lose the key takeaways you want for your audience.

3h1.3K5

Arbitr@ge@Arbitrage_econs

@RamaswmySridhar @zephyr_z9 Why do people care about tokens and token cost? At the end of the day the cost shouldn’t matter, the result should. If one model is producing crap results, but low token cost, then who cares?

3h237

Tom Greenwald@tomgreenwald

@RamaswmySridhar Great thread! Lots of good detail. 53% cache hit is definitely low. We consistently get 90%+. GLM spiraling on some problems is real - but call failures? Like tool call failures? That also sounds like provider issues

4h5763

sridhar@RamaswmySridhar

Coming soon: results for glm 5.2 w/an optimized coco harness...

4h1183

Nick Khami@skeptrune

@RamaswmySridhar it's not that the models use more tokens. it's that they use more tokens

2h4391

Guohao Li 🐫@guohao_li

@RamaswmySridhar very helpful! thanks for sharing

4h828

Sanjeev Kumar@mishrak_sanjeev

@RamaswmySridhar @8teAPi After GPT 5.5, most of the heavy duty tasks have moved to GPT 5.5. Would be great to see the trial and comparison between GPT 5.5 with GLM 5.2.

2h820

Ankit Gupta@agupta

@RamaswmySridhar curious how GPT 5.5 does. should be token efficient

4h553

Alan Blair@AlanRBlair

@RamaswmySridhar Very interesting. Would love to see Codex/5.5 for the same tasks. I think these real world 'benchmarks' are so relevant

3h468

BizAI@hankli

@RamaswmySridhar Does snowflake provide API to call GLM? What is the price?

3h250

Latent Local@latentlocal

@RamaswmySridhar Again, great stuff. These are always good reads. But why 4.7 vs 4.8?

2h222

Subramanya N@subramanya

@RamaswmySridhar i'd split normal sessions from 400-call spirals; those are two different product risks.

3h180

voratiq@voratiq

@RamaswmySridhar With GLM + Fireworks + Pi harness we are seeing ~98% cache reuse.

A low cache hit rate is usually a harness issue. Maybe this is the coco optimization you are teasing?

3h132