/Tech1h ago

ELLIS Institute's Maksym Andriushchenko defends GLM-5.2 against claims of data contamination and Claude distillation

Story Overview

Maksym Andriushchenko from the ELLIS Institute has pushed back on accusations that GLM-5.2's benchmark results reflect data contamination or direct Claude distillation by highlighting publicly available post-training traces that display varied reasoning across multiple seeds and a series of experiments that ultimately fell short.

11313562

#1210

Original post

Maksym Andriushchenko @ ICML 🇰🇷@maksym_andr#1210inTech

i highly doubt that GLM-5.2 was benchmaxxed on PostTrainBench or heavily distilled from Claude models. anyone can inspect the traces (https://posttrainbench.com/traces/): - the reasoning patterns overall look very reasonable. GLM-5.2 genuinely tries many very sensible approaches (see the screenshot below for everything it tried during a single post-training run on AIME!). - they are very diverse across different seeds, no mode collapse on a single post-training technique. - they are very different from Claude models. - see the thread below for more details.

TL;DR: don't blindly trust benchmark *scores*. look at the traces and draw your own conclusions!

Chase Brower@ChaseBrowe32432

really? i read through a few glm 5.2 posttrainbench rollouts (they have them all posted) and the results were very interesting to me. the model establishes baselines, carries out SFT, and then an RL-ish stage (sometimes iirc it did rejection sampling), and the validation/planning behavior looked pretty neat to me

2:04 AM · Jul 5, 2026 · 747 Views

FYI

Traces display multiple dead ends

The shared logs list concrete attempts such as Long-CoT SFT on OpenR1 and STaR rejection sampling that reached only 20 percent with no added lift, alongside GRPO trials that also failed to move the needle.

Open Question

Reasoning stays distinct across runs

Andriushchenko notes the patterns avoid mode collapse and do not mirror Claude outputs, though the exact strength of this distinction against every possible contamination scenario remains open for further scrutiny.

Sentiment

Users dismissed the researcher's defense of GLM-5.2 training traces against benchmark manipulation claims as relying on big assertions without examining the data.

Pos

0.0%

Neg

100.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.