💥NEW: We've replicated the GLM-5.2 results over 3 seeds and Opus 4.8 results (high and max reasoning) over 2 seeds. Now GLM-5.2 is #1 model on PostTrainBench. Check out Hardik's thread below for a more detailed analysis of how Opus 4.7, Opus 4.8, and GLM-5.2 traces differ.
Our trace viewer is available: https://posttrainbench.com/traces/
All traces are available on HuggingFace: https://huggingface.co/datasets/aisa-group/PostTrainBench-Trajectories
New #1 on PostTrainBench: GLM 5.2 (Max reasoning) hits 34.29%, narrowly beating Opus 4.8 Max (34.08%)
What makes GLM 5.2 interesting: zero failed runs across 84 runs (vs ~10% failure rate for Opus agents). The most reliable agent we've seen
Leaderboard: http://posttrainbench.com

