/Tech7h ago

GLM-5.2 Ties Opus-4.7 On Pass@3 But Requires More Turns

2781725.5K

#1328

Original post

sridhar@RamaswmySridhar#1328inTech

"GLM takes more turns" — ✅ Confirmed

99 turns avg vs 80 for Opus. 40 vs 29 execution-style calls/trial. This is real.

sridhar@RamaswmySridhar

We ran 103 dbt tasks × 3 trials on both GLM-5.2 and Opus-4.7.

Pass@3: 66% vs 67% — tied. Pass@1: 47.6% vs 53.7% — Opus wins by 6 pp.

GLM is noisier per-trial, but broad enough at k=3 to stay competitive.

9:39 AM · Jun 23, 2026 · 13K Views

Sentiment

Users are excited about GLM-5.2 tying Opus-4.7 on Pass@3 because it represents promising advancements they want to tune further and bring to customers.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS12KBOOKMARKS4LIKES39REPLIES1

sridhar@RamaswmySridhar

"GLM uses 2× more tokens" — ✅ Confirmed

860M vs 439M billing tokens. More turns + atomic API calls + lower prompt-cache reuse (53% vs 96%).

sridhar@RamaswmySridhar

"GLM takes more turns" — ✅ Confirmed

99 turns avg vs 80 for Opus. 40 vs 29 execution-style calls/trial. This is real.

7h12K394

sridhar@RamaswmySridhar

Bottom line: Verification volume doesn't predict correctness.

GLM's worst losses come from verifying the wrong axes exhaustively. Its other failure mode — early give-up — is orthogonal to verification quantity.

7h123

sridhar@RamaswmySridhar

"GLM verifies more" — ✅ Partially confirmed

But it's atomized differently. GLM fires one sql_execute per check. Opus batches the same checks into fewer dbt show --inline calls. Same coverage, different shape.

7h28

sridhar@RamaswmySridhar

"GLM produces cleaner code" — ❌ Not supported

Pass@1 is 6 pp lower. More verification ≠ more correct.

7h27

sridhar@RamaswmySridhar

On tasks both models solve, GLM uses ~17% more calls — not 2×.

The 2× framing is a whole-run average driven by tail tasks where GLM spirals. Not representative of typical behavior.

7h26

sridhar@RamaswmySridhar

GLM failure mode #2: Over-verification of wrong axes.

One task: 411 tool calls, 24 minutes. Checked row counts, distributions, nulls, column types, DuckDB/Snowflake parity. Failed 0/3.

Opus solved it with 49 calls in 9 min.

7h26

sridhar@RamaswmySridhar

The real GLM edge: dual-platform validation.

The spec requires passing both DuckDB and Snowflake. GLM more consistently validates both targets. This is the causal factor behind several GLM-only wins.

7h25

sridhar@RamaswmySridhar

GLM failure mode #1: Early give-up.

When GLM can't see the write path from reads alone, it exits without attempting. One task: 22 turns, 5 file reads, 0 writes, stop.

7h25

sridhar@RamaswmySridhar

Overall though, we are super excited for what GLT 5.2 represents and can't wait to tune Coco's harness more for it and to get it in front of our customers!

7h1161