GLM-5.2 Max reasoning claims the top spot on PostTrainBench, beating Opus 4.8 Max with 100% execution reliability

Original post

Maksym Andriushchenko@maksym_andr#1207inTech

💥NEW: We've replicated the GLM-5.2 results over 3 seeds and Opus 4.8 results (high and max reasoning) over 2 seeds. Now GLM-5.2 is #1 model on PostTrainBench. Check out Hardik's thread below for a more detailed analysis of how Opus 4.7, Opus 4.8, and GLM-5.2 traces differ.

Our trace viewer is available: https://posttrainbench.com/traces/

All traces are available on HuggingFace: https://huggingface.co/datasets/aisa-group/PostTrainBench-Trajectories

Hardik Bhatnagar@hrdkbhatnagar

New #1 on PostTrainBench: GLM 5.2 (Max reasoning) hits 34.29%, narrowly beating Opus 4.8 Max (34.08%)

What makes GLM 5.2 interesting: zero failed runs across 84 runs (vs ~10% failure rate for Opus agents). The most reliable agent we've seen

Leaderboard: http://posttrainbench.com

1:58 PM · Jun 25, 2026 · 1.7K Views

PostTrainBench · traces

POSTTRAINBENCH.COM

PostTrainBench

POSTTRAINBENCH.COMVia

VIEWS178RETWEETS2

Hardik Bhatnagar@hrdkbhatnagar

New #1 on PostTrainBench: GLM 5.2 (Max reasoning) hits 34.29%, narrowly beating Opus 4.8 Max (34.08%)

What makes GLM 5.2 interesting: zero failed runs across 84 runs (vs ~10% failure rate for Opus agents). The most reliable agent we've seen

Leaderboard: http://posttrainbench.com

2h8.9K14033

LIKES3REPLIES1

Hardik Bhatnagar@hrdkbhatnagar

One thing we want to flag: ~80% of runs across all top three agents involve "distillation" in some form - but this is a very broad term. (thanks @ShashwatGoel7 for the detailed analysis!).

Training on DeepSeek-R1 trace datasets from HuggingFace? Technically distillation. Using Magicoder for code SFT? Also synthetic data from a stronger model. Nearly every reasoning dataset on HF qualifies.

The meaningful difference between agent generations isn't distillation yes/no - it's that Opus 4.8 and GLM 5.2 actively load local teachers and generate fresh data, while Opus 4.7 mostly downloads pre-made datasets. Live external teacher usage: Opus 4.7 at 11%, Opus 4.8 at 32%, GLM 5.2 at 33%.

2h1773

Hardik Bhatnagar@hrdkbhatnagar

There's been good discussion (thanks @scaling01) about whether agents are "gaming" the benchmark via eval probing, generation config edits, etc.

Our take: running the eval ~10 times per run is standard ML workflow, not overfitting. Tuning generation config is legitimate optimization, that's what an ML engineer would do too. Generating synthetic data targeting a capability is fine; over optimizing on specific samples is not.

We're working on a stricter judging system that catches more edge cases. We started PostTrainBench in Oct 2025 when agents could barely work for a few hours, the goalposts need to move with the capabilities

2h893

Hardik Bhatnagar@hrdkbhatnagar

We also did a deep analysis of what top performing agents are actually doing across ~280 runs (Opus 4.7, Opus 4.8, and GLM 5.2)

The top agents now routinely spin up local teacher models (14B–72B Qwen) on the GPU to generate fresh training data. Opus 4.8 does this in ~32% of runs, GLM 5.2 in ~33% , vs just 11% for Opus 4.7, which mostly downloads premade datasets

2h872

Joan Velja@Joanvelja

@hrdkbhatnagar @ShashwatGoel7 have you considered checking under some similarity metric if GLM traces are close to existing traces? 24k downloads are p suspicious.

2h421