i highly doubt that GLM-5.2 was benchmaxxed on PostTrainBench or heavily distilled from Claude models. anyone can inspect the traces (https://posttrainbench.com/traces/): - the reasoning patterns overall look very reasonable. GLM-5.2 genuinely tries many very sensible approaches (see the screenshot below for everything it tried during a single post-training run on AIME!). - they are very diverse across different seeds, no mode collapse on a single post-training technique. - they are very different from Claude models. - see the thread below for more details.
TL;DR: don't blindly trust benchmark *scores*. look at the traces and draw your own conclusions!
really? i read through a few glm 5.2 posttrainbench rollouts (they have them all posted) and the results were very interesting to me. the model establishes baselines, carries out SFT, and then an RL-ish stage (sometimes iirc it did rejection sampling), and the validation/planning behavior looked pretty neat to me