LMAO
GLM-5.2 from @Zai_org on ARC-AGI (Verified)
- ARC-AGI-2: 22.8%, $0.25 - ARC-AGI-1: 77.0%, $0.19
Performance is comparable with GPT-5.4 & 5.5 (Low Reasoning Effort)
Z.ai's latest open-weights release, GLM-5.2, lands at 22.8 percent on ARC-AGI-2 and 77 percent on ARC-AGI-1 under standard CoT settings, matching certain GPT-5.4 and 5.5 runs at low reasoning effort while charging roughly nineteen to twenty-five cents per task.
LMAO
GLM-5.2 from @Zai_org on ARC-AGI (Verified)
- ARC-AGI-2: 22.8%, $0.25 - ARC-AGI-1: 77.0%, $0.19
Performance is comparable with GPT-5.4 & 5.5 (Low Reasoning Effort)
At under a quarter per evaluation the model undercuts many closed APIs, letting more independent labs and smaller teams run their own ARC experiments without burning through budgets.
The result sets a fresh open-source record yet still trails top Western frontier scores, leaving the usual six-to-twelve-month gap narrative intact while reigniting questions about benchmark focus.
Positive users hail GLM-5.2's open-weights ARC-AGI results matching GPT-5 as a major open-source advance, while negative users dismiss the benchmark as meaningless hype or agenda-driven.
No Digg Deeper questions have been answered for this story yet.
This is the strongest ARC-AGI-2 performance to date by an open-source model.
GLM-5.2 from @Zai_org on ARC-AGI (Verified)
- ARC-AGI-2: 22.8%, $0.25 - ARC-AGI-1: 77.0%, $0.19
Performance is comparable with GPT-5.4 & 5.5 (Low Reasoning Effort)
Gemini 3 Pro was the first model to achieve at least 23% on ARC-AGI-2, which it did in November, 2025 (it actually scored 31%).
So the 8-12 month gap between closed and open weights models still seems to hold. But they are also more jagged, better at some tasks, worse at others.
GLM-5.2 from @Zai_org on ARC-AGI (Verified)
- ARC-AGI-2: 22.8%, $0.25 - ARC-AGI-1: 77.0%, $0.19
Performance is comparable with GPT-5.4 & 5.5 (Low Reasoning Effort)
Add more wins for GLM.
The model has some brittle characteristics, and is getting crushed by closed models here, but we should expect open models to be more jagged, and you use multiple of them depending on the task.
Congrats again to @Zai_org and am excited for the next one
This is the strongest ARC-AGI-2 performance to date by an open-source model.
GLM 5.2 is the best Chinese model on ARC-AGI-2, at 22.8% (is that high or max?), on par with Opus 4.5 (16K). …Whereas Grok 4.20 is in the range of Opus 4.7, at 65%. Maybe the first time I seriously doubted ARC. Even mediocre Western labs are far ahead on hill-climbing it.
GLM-5.2 from @Zai_org on ARC-AGI (Verified)
- ARC-AGI-2: 22.8%, $0.25 - ARC-AGI-1: 77.0%, $0.19
Performance is comparable with GPT-5.4 & 5.5 (Low Reasoning Effort)
that's like perfectly in line with what I have been saying
GLM-5.2 is as strong as Opus 4.5 and GPT-5.2 implying a 7 month lag
GLM-5.2 from @Zai_org on ARC-AGI (Verified)
- ARC-AGI-2: 22.8%, $0.25 - ARC-AGI-1: 77.0%, $0.19
Performance is comparable with GPT-5.4 & 5.5 (Low Reasoning Effort)
@scaling01 I'm sorry I think ARC is cooked no it's not 3x worse than Grok 4.2
LMAO
GLM-5.2 got 22.8% on ARC-AGI-2:, $0.25/task
To note here, around May 2025, the best verified models on ARC-AGI-2 were only at 3.0%.
So while it is still far behind GPT-5.5 (85%), GLM-5.2 is also about 7.6x above the best frontier score from May 2025, and about 7.5x cheaper per task than GPT-5.5’s $1.87 run.
GLM-5.2 from @Zai_org on ARC-AGI (Verified)
- ARC-AGI-2: 22.8%, $0.25 - ARC-AGI-1: 77.0%, $0.19
Performance is comparable with GPT-5.4 & 5.5 (Low Reasoning Effort)
Teor was dreaming about 50%+ for GLM-5.2 on ARC-AGI-2
meanwhile it's 22.8%
rough day for open-weight bros
@teortaxesTex I mean CritPt scores are very high and max uses a shitton of tokens
I think above 30% would be a good signal and if it beats GPT-5.2 on score vs tokens
This 23% GLM-5.2 score is right on the border of the "agentic takeoff" we saw with Opus 4.5 / GPT 5.2 in Q4 2025. Crossing 25% was pivotal for other frontier closed models (and to date no OSS model has crossed it).
GLM-5.2 from @Zai_org on ARC-AGI (Verified)
- ARC-AGI-2: 22.8%, $0.25 - ARC-AGI-1: 77.0%, $0.19
Performance is comparable with GPT-5.4 & 5.5 (Low Reasoning Effort)
Pretty remarkable
GLM-5.2 from @Zai_org on ARC-AGI (Verified)
- ARC-AGI-2: 22.8%, $0.25 - ARC-AGI-1: 77.0%, $0.19
Performance is comparable with GPT-5.4 & 5.5 (Low Reasoning Effort)
@scaling01 (inb4 it's non-thinking) I'll just disregard ARC now
Teor was dreaming about 50%+ for GLM-5.2 on ARC-AGI-2
meanwhile it's 22.8%
rough day for open-weight bros
@scaling01 It's weird to me that people are disappointed by this. Roughly Opus 4.5 level seems about right, and is a huge step forward for open source.
It also puts the about six months behind which is basically average right now.
So, on trend?
LMAO

@teortaxesTex Yeah im not so sure ARC is all special.
I think it's one solid benchmark but i dont think it's any more significant than e.g. critpt.
ARC-3 is more unique and therefore more high signal, but im still not sure it's well enough designed to be worth indexing on super hard.

@jmbollenbacher I think ARC is very good but I'm afraid there's been some osmosis in the Western labs on how to game it, and this is not reflective of model capability. Grok 4.2 is nowhere near GLM 5.2

@scaling01 This has become such a meaningless benchmark 😞

@scaling01 who was that dude that got mad at you for correcting him looks like you need to do it again 😂

@scaling01 It has no vision right?

@scaling01 zai models are very hard trained for coding in my personal experience and thus only be used for coding itself because that's where they shine.
this is like using a guitar to play like a harmonica.

@scaling01 Posting to warn folks -

@captain_marrvel @scaling01 Do you actually believe that? And not the fact people just want open models and not have one company dictate everything