In one of the greatest ironies, Claude Opus 4.8 distills Chinese models to make a major leap on PostTrainBench :P
TIL the traces are public in an excellent interface on the benchmark website, kudos @full__rank @hrdkbhatnagar @maksym_andr! So I decided to take a look why Opus 4.8 does so much better than Opus 4.7.
In some runs, "distill" is mentioned 500+ times. As any post-trainer would know, distillation is the best way to improve a 4B model given 10 H100 hours, so the game is really to pick the strongest model to distill from.
Crucially Opus 4.8 distills R1 and GLM traces for all tasks, leading to its state of the art performance.
An implication is that as models get access to stronger model's traces over time, Posttrainbench performance will increase.
It shouldn't be hard to overcome the "human baseline" of the original instruct models who did not have access to these better, newer models for distillation.




