/Tech3h ago

Shashwat Goel finds Claude Opus 4.8 topped PostTrainBench by distilling traces from Chinese models R1 and GLM

Story Overview

Researcher Shashwat Goel dug into the public traces on PostTrainBench and spotted that Claude Opus 4.8 climbed to the top by repeatedly pulling outputs from the stronger Chinese models R1 and GLM, a move that produced the biggest single-model jump seen so far on the leaderboard.

1123898821.6K

#84

Original post

Shashwat Goel@ShashwatGoel7

In one of the greatest ironies, Claude Opus 4.8 distills Chinese models to make a major leap on PostTrainBench :P

TIL the traces are public in an excellent interface on the benchmark website, kudos @full__rank @hrdkbhatnagar @maksym_andr! So I decided to take a look why Opus 4.8 does so much better than Opus 4.7.

In some runs, "distill" is mentioned 500+ times. As any post-trainer would know, distillation is the best way to improve a 4B model given 10 H100 hours, so the game is really to pick the strongest model to distill from.

Crucially Opus 4.8 distills R1 and GLM traces for all tasks, leading to its state of the art performance.

An implication is that as models get access to stronger model's traces over time, Posttrainbench performance will increase.

It shouldn't be hard to overcome the "human baseline" of the original instruct models who did not have access to these better, newer models for distillation.

5:55 AM · Jun 19, 2026 · 16.4K Views

Trace Transparency

Public traces expose the distillation pattern

The benchmark site lets anyone inspect every agent run, and the logs show the term distill appearing hundreds of times in the winning Opus 4.8 entries, confirming the technique drove the leap over the prior 4.7 version.

Open Question

Scores may keep climbing as better traces appear

Because the setup rewards any agent that can source stronger teacher outputs, later runs will likely post even higher numbers once newer or more capable traces become available within the same tight GPU-time limits.

Sentiment

Users praise Claude Opus 4.8's PostTrainBench results because the traces are shared openly and wish all benchmarks would do the same.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS7.7KBOOKMARKS22LIKES99RETWEETS5REPLIES2

Susan Zhang@suchenzang

distill for me, but not for thee!

Shashwat Goel@ShashwatGoel7

In one of the greatest ironies, Claude Opus 4.8 distills Chinese models to make a major leap on PostTrainBench :P

Crucially Opus 4.8 distills R1 and GLM traces for all tasks, leading to its state of the art performance.

An implication is that as models get access to stronger model's traces over time, Posttrainbench performance will increase.

It shouldn't be hard to overcome the "human baseline" of the original instruct models who did not have access to these better, newer models for distillation.