Snowflake CEO Sridhar Ramaswamy says GLM-5.2 ties Opus-4.7 on Pass@3 dbt tasks despite lagging on Pass@1
GLM-5.2 achieved the tie without reinforcement learning optimization.
Many users praised the GLM-5.2 versus Opus-4.7 benchmark thread for its concrete insights and the model's speed and cost benefits, while others criticized its training approach and tendency to repeat actions without progress.
No Digg Deeper questions have been answered for this story yet.
Most Activity
We ran 103 dbt tasks × 3 trials on both GLM-5.2 and Opus-4.7.
Pass@3: 66% vs 67% — tied. Pass@1: 47.6% vs 53.7% — Opus wins by 6 pp.
GLM is noisier per-trial, but broad enough at k=3 to stay competitive.
Early results from the @snowflake's coco team on GLM-5.2 vs Opus-4.7 on dbt-bench — what the trajectories actually show 🧵
> it's STILL not RL-maxed mate, just how *small* is Opus actually?
We ran 103 dbt tasks × 3 trials on both GLM-5.2 and Opus-4.7.
Pass@3: 66% vs 67% — tied. Pass@1: 47.6% vs 53.7% — Opus wins by 6 pp.
GLM is noisier per-trial, but broad enough at k=3 to stay competitive.

@RamaswmySridhar @Snowflake Nice! This is good stuff. Thanks for doing the work and sharing. I just started using 5.2 so will be interesting to compare.

@RamaswmySridhar @Snowflake Thanks for the writeup! Given how much faster and less expensive GLM-5.2 is, it's broadly ok to burn more tokens for the same task.

@RamaswmySridhar @Snowflake I am curious what endpoint you used, Sridhar? I am using Fireworks and blown away how well it deals with complex scenarios.
Many a times implementation to implementation performance can vastly differ.
Thanks for sharing.

Bottom line: Verification volume doesn't predict correctness.
GLM's worst losses come from verifying the wrong axes exhaustively. Its other failure mode — early give-up — is orthogonal to verification quantity.

@RamaswmySridhar @Snowflake do you have a cost comparison?

@RamaswmySridhar @Snowflake this is the only kind of glm vs opus take worth reading, actual trajectories not vibes. i have had hit and miss results with GLM in testing still not sure what to make of it

"GLM takes more turns" — ✅ Confirmed
99 turns avg vs 80 for Opus. 40 vs 29 execution-style calls/trial. This is real.

"GLM uses 2× more tokens" — ✅ Confirmed
860M vs 439M billing tokens. More turns + atomic API calls + lower prompt-cache reuse (53% vs 96%).

"GLM verifies more" — ✅ Partially confirmed
But it's atomized differently. GLM fires one sql_execute per check. Opus batches the same checks into fewer dbt show --inline calls. Same coverage, different shape.

"GLM produces cleaner code" — ❌ Not supported
Pass@1 is 6 pp lower. More verification ≠ more correct.

On tasks both models solve, GLM uses ~17% more calls — not 2×.
The 2× framing is a whole-run average driven by tail tasks where GLM spirals. Not representative of typical behavior.

GLM failure mode #2: Over-verification of wrong axes.
One task: 411 tool calls, 24 minutes. Checked row counts, distributions, nulls, column types, DuckDB/Snowflake parity. Failed 0/3.
Opus solved it with 49 calls in 9 min.

The real GLM edge: dual-platform validation.
The spec requires passing both DuckDB and Snowflake. GLM more consistently validates both targets. This is the causal factor behind several GLM-only wins.

GLM failure mode #1: Early give-up.
When GLM can't see the write path from reads alone, it exits without attempting. One task: 22 turns, 5 file reads, 0 writes, stop.

@RamaswmySridhar @Snowflake this is what it looks like when a model gets stuck.
411 calls checking the same things over and over. never tries a new approach, just spirals.

@RamaswmySridhar @Snowflake Very insightful..thank you for sharing the test results

@RamaswmySridhar @Snowflake Brilliant thread

Overall though, we are super excited for what GLT 5.2 represents and can't wait to tune Coco's harness more for it and to get it in front of our customers!