/Tech6h ago

Z.ai's GLM-5.2 scores 22.8% on ARC-AGI-2, leading Chinese models but prompting debate over Western benchmark hill-climbing

Story Overview

Z.ai's latest open-weights release, GLM-5.2, lands at 22.8 percent on ARC-AGI-2 and 77 percent on ARC-AGI-1 under standard CoT settings, matching certain GPT-5.4 and 5.5 runs at low reasoning effort while charging roughly nineteen to twenty-five cents per task.

1181.5K57171175.3K

#77

Original post

Lisan al Gaib@scaling01#1215inTech

LMAO

ARC Prize@arcprize

GLM-5.2 from @Zai_org on ARC-AGI (Verified)

- ARC-AGI-2: 22.8%, $0.25 - ARC-AGI-1: 77.0%, $0.19

Performance is comparable with GPT-5.4 & 5.5 (Low Reasoning Effort)

11:12 AM · Jun 24, 2026 · 56.3K Views

Cost Pressure

Per-task costs open the door to wider testing

At under a quarter per evaluation the model undercuts many closed APIs, letting more independent labs and smaller teams run their own ARC experiments without burning through budgets.

Open Question

How far the open-source lag has actually closed stays unclear

The result sets a fresh open-source record yet still trails top Western frontier scores, leaving the usual six-to-twelve-month gap narrative intact while reigniting questions about benchmark focus.

Sentiment

Positive users hail GLM-5.2's open-weights ARC-AGI results matching GPT-5 as a major open-source advance, while negative users dismiss the benchmark as meaningless hype or agenda-driven.

Pos

55.0%

Neg

45.0%

22 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS43.8KBOOKMARKS55LIKES419RETWEETS30REPLIES20

François Chollet@fchollet

This is the strongest ARC-AGI-2 performance to date by an open-source model.

ARC Prize@arcprize

GLM-5.2 from @Zai_org on ARC-AGI (Verified)

- ARC-AGI-2: 22.8%, $0.25 - ARC-AGI-1: 77.0%, $0.19

Performance is comparable with GPT-5.4 & 5.5 (Low Reasoning Effort)

5h43.8K41955

Ethan Mollick@emollick

Gemini 3 Pro was the first model to achieve at least 23% on ARC-AGI-2, which it did in November, 2025 (it actually scored 31%).

So the 8-12 month gap between closed and open weights models still seems to hold. But they are also more jagged, better at some tasks, worse at others.

ARC Prize@arcprize

GLM-5.2 from @Zai_org on ARC-AGI (Verified)

- ARC-AGI-2: 22.8%, $0.25 - ARC-AGI-1: 77.0%, $0.19

Performance is comparable with GPT-5.4 & 5.5 (Low Reasoning Effort)

5h21.4K15423

Nathan Lambert@natolambert

Add more wins for GLM.

The model has some brittle characteristics, and is getting crushed by closed models here, but we should expect open models to be more jagged, and you use multiple of them depending on the task.

Congrats again to @Zai_org and am excited for the next one

François Chollet@fchollet

This is the strongest ARC-AGI-2 performance to date by an open-source model.

5h11.7K11821

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

GLM 5.2 is the best Chinese model on ARC-AGI-2, at 22.8% (is that high or max?), on par with Opus 4.5 (16K). …Whereas Grok 4.20 is in the range of Opus 4.7, at 65%. Maybe the first time I seriously doubted ARC. Even mediocre Western labs are far ahead on hill-climbing it.

ARC Prize@arcprize

GLM-5.2 from @Zai_org on ARC-AGI (Verified)

- ARC-AGI-2: 22.8%, $0.25 - ARC-AGI-1: 77.0%, $0.19

Performance is comparable with GPT-5.4 & 5.5 (Low Reasoning Effort)

6h9.1K11812

Lisan al Gaib@scaling01

that's like perfectly in line with what I have been saying

GLM-5.2 is as strong as Opus 4.5 and GPT-5.2 implying a 7 month lag

ARC Prize@arcprize

GLM-5.2 from @Zai_org on ARC-AGI (Verified)

- ARC-AGI-2: 22.8%, $0.25 - ARC-AGI-1: 77.0%, $0.19

Performance is comparable with GPT-5.4 & 5.5 (Low Reasoning Effort)

4h6.9K905

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@scaling01 I'm sorry I think ARC is cooked no it's not 3x worse than Grok 4.2

Lisan al Gaib@scaling01

LMAO

6h4.1K936

Rohan Paul@rohanpaul_ai

GLM-5.2 got 22.8% on ARC-AGI-2:, $0.25/task

To note here, around May 2025, the best verified models on ARC-AGI-2 were only at 3.0%.

So while it is still far behind GPT-5.5 (85%), GLM-5.2 is also about 7.6x above the best frontier score from May 2025, and about 7.5x cheaper per task than GPT-5.5’s $1.87 run.

ARC Prize@arcprize

GLM-5.2 from @Zai_org on ARC-AGI (Verified)

- ARC-AGI-2: 22.8%, $0.25 - ARC-AGI-1: 77.0%, $0.19

Performance is comparable with GPT-5.4 & 5.5 (Low Reasoning Effort)

4h4.6K297

Lisan al Gaib@scaling01

Teor was dreaming about 50%+ for GLM-5.2 on ARC-AGI-2

meanwhile it's 22.8%

rough day for open-weight bros

Lisan al Gaib@scaling01

@teortaxesTex I mean CritPt scores are very high and max uses a shitton of tokens

I think above 30% would be a good signal and if it beats GPT-5.2 on score vs tokens

6h9.8K694

Mike Knoop@mikeknoop

This 23% GLM-5.2 score is right on the border of the "agentic takeoff" we saw with Opus 4.5 / GPT 5.2 in Q4 2025. Crossing 25% was pivotal for other frontier closed models (and to date no OSS model has crossed it).

ARC Prize@arcprize

GLM-5.2 from @Zai_org on ARC-AGI (Verified)

- ARC-AGI-2: 22.8%, $0.25 - ARC-AGI-1: 77.0%, $0.19

Performance is comparable with GPT-5.4 & 5.5 (Low Reasoning Effort)

5h3.8K486

kache@yacineMTB

Pretty remarkable

ARC Prize@arcprize

GLM-5.2 from @Zai_org on ARC-AGI (Verified)

- ARC-AGI-2: 22.8%, $0.25 - ARC-AGI-1: 77.0%, $0.19

Performance is comparable with GPT-5.4 & 5.5 (Low Reasoning Effort)

2h6K503

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@scaling01 (inb4 it's non-thinking) I'll just disregard ARC now

Lisan al Gaib@scaling01

Teor was dreaming about 50%+ for GLM-5.2 on ARC-AGI-2

meanwhile it's 22.8%

rough day for open-weight bros

5h1.1K270

xlr8harder@xlr8harder

@scaling01 It's weird to me that people are disappointed by this. Roughly Opus 4.5 level seems about right, and is a huge step forward for open source.

It also puts the about six months behind which is basically average right now.

So, on trend?

Lisan al Gaib@scaling01

LMAO

5h49890

JMB 🧙‍♂️@jmbollenbacher

@teortaxesTex Yeah im not so sure ARC is all special.

I think it's one solid benchmark but i dont think it's any more significant than e.g. critpt.

ARC-3 is more unique and therefore more high signal, but im still not sure it's well enough designed to be worth indexing on super hard.

6h1266

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@jmbollenbacher I think ARC is very good but I'm afraid there's been some osmosis in the Western labs on how to game it, and this is not reflective of model capability. Grok 4.2 is nowhere near GLM 5.2

6h1205

Kristoph@kristoph

@scaling01 This has become such a meaningless benchmark 😞

6h3644

Theo Harvey@theoharvey

@scaling01 who was that dude that got mad at you for correcting him looks like you need to do it again 😂

5h289

Dickson Pau@DicksonPau

@scaling01 It has no vision right?

5h411

shaurya@0oAstro

@scaling01 zai models are very hard trained for coding in my personal experience and thus only be used for coding itself because that's where they shine.

this is like using a guitar to play like a harmonica.

5h95

John Lussier@John_lussier_

@scaling01 Posting to warn folks -

6h83

Sophia@poffyit

@captain_marrvel @scaling01 Do you actually believe that? And not the fact people just want open models and not have one company dictate everything

5h70