/Tech7h ago

Snowflake CEO Sridhar Ramaswamy says GLM's Pass@1 coding score trails Claude Opus by six percentage points

Granular SQL verification steps failed to improve GLM's correctness

91911765.2K

#1328

Original post

sridhar@RamaswmySridhar#1328inTech

"GLM produces cleaner code" — ❌ Not supported

Pass@1 is 6 pp lower. More verification ≠ more correct.

sridhar@RamaswmySridhar

"GLM verifies more" — ✅ Partially confirmed

But it's atomized differently. GLM fires one sql_execute per check. Opus batches the same checks into fewer dbt show --inline calls. Same coverage, different shape.

9:39 AM · Jun 23, 2026 · 10.2K Views

Sentiment

Users are excited about GLT 5.2 because it represents promising advancements they want to integrate into tools like Coco's harness for customer delivery.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS12.4K

sridhar@RamaswmySridhar

"GLM verifies more" — ✅ Partially confirmed

But it's atomized differently. GLM fires one sql_execute per check. Opus batches the same checks into fewer dbt show --inline calls. Same coverage, different shape.

sridhar@RamaswmySridhar

"GLM uses 2× more tokens" — ✅ Confirmed

860M vs 439M billing tokens. More turns + atomic API calls + lower prompt-cache reuse (53% vs 96%).

7h12.4K382

BOOKMARKS2RETWEETS1

sridhar@RamaswmySridhar

On tasks both models solve, GLM uses ~17% more calls — not 2×.

The 2× framing is a whole-run average driven by tail tasks where GLM spirals. Not representative of typical behavior.

sridhar@RamaswmySridhar

GLM failure mode #2: Over-verification of wrong axes.

One task: 411 tool calls, 24 minutes. Checked row counts, distributions, nulls, column types, DuckDB/Snowflake parity. Failed 0/3.

Opus solved it with 49 calls in 9 min.

7h8.5K252

LIKES41

sridhar@RamaswmySridhar

The real GLM edge: dual-platform validation.

The spec requires passing both DuckDB and Snowflake. GLM more consistently validates both targets. This is the causal factor behind several GLM-only wins.

sridhar@RamaswmySridhar

"GLM produces cleaner code" — ❌ Not supported

Pass@1 is 6 pp lower. More verification ≠ more correct.

7h9.9K412

REPLIES2

sridhar@RamaswmySridhar

GLM failure mode #1: Early give-up.

When GLM can't see the write path from reads alone, it exits without attempting. One task: 22 turns, 5 file reads, 0 writes, stop.

sridhar@RamaswmySridhar

The real GLM edge: dual-platform validation.

The spec requires passing both DuckDB and Snowflake. GLM more consistently validates both targets. This is the causal factor behind several GLM-only wins.

7h9.3K240

sridhar@RamaswmySridhar

Bottom line: Verification volume doesn't predict correctness.

GLM's worst losses come from verifying the wrong axes exhaustively. Its other failure mode — early give-up — is orthogonal to verification quantity.

sridhar@RamaswmySridhar

On tasks both models solve, GLM uses ~17% more calls — not 2×.

The 2× framing is a whole-run average driven by tail tasks where GLM spirals. Not representative of typical behavior.

7h9.4K280

sridhar@RamaswmySridhar

GLM failure mode #2: Over-verification of wrong axes.

One task: 411 tool calls, 24 minutes. Checked row counts, distributions, nulls, column types, DuckDB/Snowflake parity. Failed 0/3.

Opus solved it with 49 calls in 9 min.

sridhar@RamaswmySridhar

GLM failure mode #1: Early give-up.

When GLM can't see the write path from reads alone, it exits without attempting. One task: 22 turns, 5 file reads, 0 writes, stop.

7h8.9K210

sridhar@RamaswmySridhar

Overall though, we are super excited for what GLT 5.2 represents and can't wait to tune Coco's harness more for it and to get it in front of our customers!

7h1161