/Tech3h ago

Zai_org's open-weights GLM-5.2 ranks third overall on the GDPval-AA v2 benchmark, beating GPT-5.5 on agentic tasks

It scored 1524 Elo, trailing only two Claude models

557026510096.7K

#109

Original post

Andrew Curran@AndrewCurran_#682inTech

Artificial Analysis@ArtificialAnlys

GLM-5.2 leads open weights models and sits at #3 overall on GDPval-AA, a real-world agentic work benchmark

GLM-5.2 from @Zai_org scores 1524 Elo on GDPval-AA, which measures performance on real-world, economically valuable knowledge work through long-horizon, multi-turn tasks.

Key takeaways:

➤ #3 overall, behind only Claude Fable 5 (1783) and Claude Opus 4.8 (1615), and level with GPT-5.5 (xhigh, 1509)

➤ The leading open weights model by a wide margin: the next open model, MiniMax-M3, scores 1408

➤ Ahead of many proprietary models, including Google's Gemini 3.5 Flash (1357), Qwen 3.7 Max (1289), Muse Spark (1158)

➤ The tasks are agentic. GLM-5.2 averaged ~31 turns per task across 1,999 matches

➤ Consistent with the rest of its launch, GLM-5.2 also leads open weights on the Artificial Analysis Intelligence Index, ranks #3 on the Agentic Index, and #3 on AA-Briefcase

11:14 AM · Jun 22, 2026 · 1.8K Views

Sentiment

Many users praised GLM-5.2's strong benchmark showing and its boost for open weights models, while others called its real-world performance poor or questioned related business viability.

Pos

65.4%

Neg

34.6%

17 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS38.4KBOOKMARKS45LIKES354REPLIES21

Andrew Curran@AndrewCurran_

This is why the frontier labs can't slow down.

Artificial Analysis@ArtificialAnlys

GLM-5.2 leads open weights models and sits at #3 overall on GDPval-AA, a real-world agentic work benchmark

GLM-5.2 from @Zai_org scores 1524 Elo on GDPval-AA, which measures performance on real-world, economically valuable knowledge work through long-horizon, multi-turn tasks.

Key takeaways:

➤ #3 overall, behind only Claude Fable 5 (1783) and Claude Opus 4.8 (1615), and level with GPT-5.5 (xhigh, 1509)

➤ The leading open weights model by a wide margin: the next open model, MiniMax-M3, scores 1408

➤ Ahead of many proprietary models, including Google's Gemini 3.5 Flash (1357), Qwen 3.7 Max (1289), Muse Spark (1158)

➤ The tasks are agentic. GLM-5.2 averaged ~31 turns per task across 1,999 matches

➤ Consistent with the rest of its launch, GLM-5.2 also leads open weights on the Artificial Analysis Intelligence Index, ranks #3 on the Agentic Index, and #3 on AA-Briefcase

3h38.4K35445

RETWEETS38

Artificial Analysis@ArtificialAnlys

GLM-5.2 leads open weights models and sits at #3 overall on GDPval-AA, a real-world agentic work benchmark

GLM-5.2 from @Zai_org scores 1524 Elo on GDPval-AA, which measures performance on real-world, economically valuable knowledge work through long-horizon, multi-turn tasks.

Key takeaways:

➤ #3 overall, behind only Claude Fable 5 (1783) and Claude Opus 4.8 (1615), and level with GPT-5.5 (xhigh, 1509)

➤ The leading open weights model by a wide margin: the next open model, MiniMax-M3, scores 1408

➤ Ahead of many proprietary models, including Google's Gemini 3.5 Flash (1357), Qwen 3.7 Max (1289), Muse Spark (1158)

➤ The tasks are agentic. GLM-5.2 averaged ~31 turns per task across 1,999 matches

➤ Consistent with the rest of its launch, GLM-5.2 also leads open weights on the Artificial Analysis Intelligence Index, ranks #3 on the Agentic Index, and #3 on AA-Briefcase

3h65.4K36863

Chubby♨️@kimmonismus

Absolutely incredible: GLM-5.2 (max) sits at #3 overall on GDPval-AA, a real-world agentic work benchmark, even ahead of GPT-5.5 (xhigh).

Oh and btw: looks like open source is no longer 7 months behind.

GDPval-AA, a benchmark built around real professional and creative tasks. The models had to produce practical deliverables from identical briefs, including a retail supervisor’s task list, an emergency-stop circuit schematic, and a music video moodboard.

Thats why we'll probably see a big leap with GPT-5.6. Even open source competition is catching up insanley fast.

Artificial Analysis@ArtificialAnlys

GLM-5.2 leads open weights models and sits at #3 overall on GDPval-AA, a real-world agentic work benchmark

GLM-5.2 from @Zai_org scores 1524 Elo on GDPval-AA, which measures performance on real-world, economically valuable knowledge work through long-horizon, multi-turn tasks.

Key takeaways:

➤ #3 overall, behind only Claude Fable 5 (1783) and Claude Opus 4.8 (1615), and level with GPT-5.5 (xhigh, 1509)

➤ The leading open weights model by a wide margin: the next open model, MiniMax-M3, scores 1408

➤ Ahead of many proprietary models, including Google's Gemini 3.5 Flash (1357), Qwen 3.7 Max (1289), Muse Spark (1158)

➤ The tasks are agentic. GLM-5.2 averaged ~31 turns per task across 1,999 matches

➤ Consistent with the rest of its launch, GLM-5.2 also leads open weights on the Artificial Analysis Intelligence Index, ranks #3 on the Agentic Index, and #3 on AA-Briefcase

1h2.3K263

Artificial Analysis@ArtificialAnlys

The pattern holds on AA-Briefcase, our latest agentic knowledge work eval: GLM-5.2 is again the top open weights model, ahead of GPT-5.5 (xhigh) and behind only Claude Fable 5.

For an open weights model priced at $1.40/$4.40 per 1M input/output tokens to rank alongside the proprietary frontier on agentic work is a real step for open models.

http://artificialanalysis.ai/models/glm-5-2

3h3.3K261

Artificial Analysis@ArtificialAnlys

GDPval-AA spans real professional and creative work. We gave GLM-5.2 and three proprietary frontier models, Claude Fable 5, GPT-5.5, and Gemini 3.5 Flash, the same briefs, and rendered each deliverable exactly as produced:

➤ A daily task list for a retail supervisor ➤ An IEC emergency-stop circuit schematic ➤ A moodboard for an orchestral ballad music video

3h3.9K231

Creed Hardcastle@CreedHardcastle

@AndrewCurran_ Exactly been thinking that. With frontier models being locked out of public (loss of revenues) western labs will struggle. http://Z.ai founder said they will have mythos grade model this year - insane

3h6464

Kirk Patrick Miller@Chaos2Cured

@AndrewCurran_ @hvo_e_acc Just wait until I am done. 😎

Have some surprises coming.

I really need funding and a team. Doing it all with myself and one other doing the other pieces I need is stressful.

Oh well… I did add GLM to FreeLattice as an option. •

3h2074

Thomas Unise@thomasunise

@AndrewCurran_ This is also why frontier labs don’t stand a chance at keeping their current valuations

2h3121

Hayduke ⏹️@GWHayduke97

@AndrewCurran_ Forgive me if I've asked you this before and forgotten the answer. Are there any indications you see that China will stop allowing/requiring its labs to open-source their models once they reach Mythos tier?

2h2041

Hic Rhodus Hic Salta@PageLyndon

@AndrewCurran_ Grok has thrown in the towel...@grok isn't that right, little buddy?

2h591

Thomas Unise@thomasunise

@ArtificialAnlys @Zai_org GLM in the right harness can compete with Opus 4.8

3h3253

Niels Rogge@NielsRogge

That's right, an open, MIT-licensed model beating GPT-5.5 (xhigh) on real-world agentic work! 🔥

Available for free on @huggingface for anyone to build on top off

Artificial Analysis@ArtificialAnlys

GLM-5.2 leads open weights models and sits at #3 overall on GDPval-AA, a real-world agentic work benchmark

GLM-5.2 from @Zai_org scores 1524 Elo on GDPval-AA, which measures performance on real-world, economically valuable knowledge work through long-horizon, multi-turn tasks.

Key takeaways:

➤ #3 overall, behind only Claude Fable 5 (1783) and Claude Opus 4.8 (1615), and level with GPT-5.5 (xhigh, 1509)

➤ The leading open weights model by a wide margin: the next open model, MiniMax-M3, scores 1408

➤ Ahead of many proprietary models, including Google's Gemini 3.5 Flash (1357), Qwen 3.7 Max (1289), Muse Spark (1158)

➤ The tasks are agentic. GLM-5.2 averaged ~31 turns per task across 1,999 matches

➤ Consistent with the rest of its launch, GLM-5.2 also leads open weights on the Artificial Analysis Intelligence Index, ranks #3 on the Agentic Index, and #3 on AA-Briefcase

2h1.5K91

Dayton Davis@DaytonDavis

@AndrewCurran_ I find it baffling Opus scored this high

2h1522

Salo Zrihen@salozrihen

@ArtificialAnlys @Zai_org The 31-turn part matters more than the ranking. That’s where models become workers or interns with WiFi.

2h2431

David@csboylol

@AndrewCurran_ I used GLM 5.2 but it’s actually shit, just my experience, 5.5 is much better imo.

3h2131

Zach Or Something@zachorsomthin

@AndrewCurran_ China is good for once, making us keep our foot on the gas here.

3h2011

Hikari∣LocalLLM⚡@Hikari_07_jp

@AndrewCurran_ The evolution from GLM 5.1 to 5.2 is very impressive.

2h542

Local Ai Cherry@LocalAiCherry

@ArtificialAnlys @Zai_org Happy to see local AI models competing at the top now, this is huge for the open-source community 💕

2h342

Zoe@UltraRareAF

@AndrewCurran_ wild card

2h322

Mr Landy@landyletter

@AndrewCurran_ I'm seriously wondering if they share anything beyond Fable-level intelligence with non-enterprises. ROI on capex is brutal if you'll get distilled in a few months anyways. + they're close to the level where they can execute the Big Rug.

2h681