/Tech7h ago

Snowflake CEO Sridhar Ramaswamy says GLM-5.2 ties Opus-4.7 on Pass@3 dbt tasks despite lagging on Pass@1

GLM-5.2 achieved the tie without reinforcement learning optimization.

3758058335171.5K

#501

Original post

sridhar@RamaswmySridhar#1328inTech

Early results from the @snowflake's coco team on GLM-5.2 vs Opus-4.7 on dbt-bench — what the trajectories actually show 🧵

9:39 AM · Jun 23, 2026 · 170.8K Views

Sentiment

Many users praised the GLM-5.2 versus Opus-4.7 benchmark thread for its concrete insights and the model's speed and cost benefits, while others criticized its training approach and tendency to repeat actions without progress.

Pos

41.7%

Neg

58.3%

13 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS22.8KBOOKMARKS13LIKES78RETWEETS2

sridhar@RamaswmySridhar

We ran 103 dbt tasks × 3 trials on both GLM-5.2 and Opus-4.7.

Pass@3: 66% vs 67% — tied. Pass@1: 47.6% vs 53.7% — Opus wins by 6 pp.

GLM is noisier per-trial, but broad enough at k=3 to stay competitive.

sridhar@RamaswmySridhar

Early results from the @snowflake's coco team on GLM-5.2 vs Opus-4.7 on dbt-bench — what the trajectories actually show 🧵

7h22.8K7813

REPLIES2

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

> it's STILL not RL-maxed mate, just how *small* is Opus actually?

sridhar@RamaswmySridhar

We ran 103 dbt tasks × 3 trials on both GLM-5.2 and Opus-4.7.

Pass@3: 66% vs 67% — tied. Pass@1: 47.6% vs 53.7% — Opus wins by 6 pp.

GLM is noisier per-trial, but broad enough at k=3 to stay competitive.

3h8.8K6510

Latent Local@latentlocal

@RamaswmySridhar @Snowflake Nice! This is good stuff. Thanks for doing the work and sharing. I just started using 5.2 so will be interesting to compare.

6h2.8K5

Philip Kiely@philipkiely

@RamaswmySridhar @Snowflake Thanks for the writeup! Given how much faster and less expensive GLM-5.2 is, it's broadly ok to burn more tokens for the same task.

5h1.1K4

GDP@bookwormengr

@RamaswmySridhar @Snowflake I am curious what endpoint you used, Sridhar? I am using Fireworks and blown away how well it deals with complex scenarios.

Many a times implementation to implementation performance can vastly differ.

Thanks for sharing.

6h1.5K3

sridhar@RamaswmySridhar

Bottom line: Verification volume doesn't predict correctness.

GLM's worst losses come from verifying the wrong axes exhaustively. Its other failure mode — early give-up — is orthogonal to verification quantity.

7h123

auuru (e/∞)@auurujimbei

@RamaswmySridhar @Snowflake do you have a cost comparison?

6h1.1K2

brendan@BrendanPlayford

@RamaswmySridhar @Snowflake this is the only kind of glm vs opus take worth reading, actual trajectories not vibes. i have had hit and miss results with GLM in testing still not sure what to make of it

7h2.7K1

sridhar@RamaswmySridhar

"GLM takes more turns" — ✅ Confirmed

99 turns avg vs 80 for Opus. 40 vs 29 execution-style calls/trial. This is real.

7h38

sridhar@RamaswmySridhar

"GLM uses 2× more tokens" — ✅ Confirmed

860M vs 439M billing tokens. More turns + atomic API calls + lower prompt-cache reuse (53% vs 96%).

7h29

sridhar@RamaswmySridhar

"GLM verifies more" — ✅ Partially confirmed

But it's atomized differently. GLM fires one sql_execute per check. Opus batches the same checks into fewer dbt show --inline calls. Same coverage, different shape.

7h28

sridhar@RamaswmySridhar

"GLM produces cleaner code" — ❌ Not supported

Pass@1 is 6 pp lower. More verification ≠ more correct.

7h27

sridhar@RamaswmySridhar

On tasks both models solve, GLM uses ~17% more calls — not 2×.

The 2× framing is a whole-run average driven by tail tasks where GLM spirals. Not representative of typical behavior.

7h26

sridhar@RamaswmySridhar

GLM failure mode #2: Over-verification of wrong axes.

One task: 411 tool calls, 24 minutes. Checked row counts, distributions, nulls, column types, DuckDB/Snowflake parity. Failed 0/3.

Opus solved it with 49 calls in 9 min.

7h26

sridhar@RamaswmySridhar

The real GLM edge: dual-platform validation.

The spec requires passing both DuckDB and Snowflake. GLM more consistently validates both targets. This is the causal factor behind several GLM-only wins.

7h25

sridhar@RamaswmySridhar

GLM failure mode #1: Early give-up.

When GLM can't see the write path from reads alone, it exits without attempting. One task: 22 turns, 5 file reads, 0 writes, stop.

7h25

Ferbin@Ferbin08

@RamaswmySridhar @Snowflake this is what it looks like when a model gets stuck.

411 calls checking the same things over and over. never tries a new approach, just spirals.

6h1.3K

Abdeali Lokhandwala@Abdeali_L

@RamaswmySridhar @Snowflake Very insightful..thank you for sharing the test results

6h741

Shivi Bhatia@Shivipmp

@RamaswmySridhar @Snowflake Brilliant thread

6h658

sridhar@RamaswmySridhar

Overall though, we are super excited for what GLT 5.2 represents and can't wait to tune Coco's harness more for it and to get it in front of our customers!

7h1161