/Tech5h ago

Opus 4.8 sets a PostTrainBench record of 37.23%, up from the previous version's 28.56% score

It measures how effectively frontier models train weaker models.

14873107.2K

#265

Original post

Karina@karinanguyen#265inTech

staring into the abyss as models get better at modelcrafting. the abyss stares back, and the stare is the training signal

Thoughtful@thoughtfullab

New #1 on PostTrainBench: Opus 4.8 (max reasoning) hits 37.23% — up from 28.56% for 4.7, the largest single improvement we've seen.

Fable 5 runs underway now that AI research behavior is no longer silently degraded.

PostTrainBench asks how well frontier AI can train weaker language models. That makes it one of the first benchmarks for recursive self-improvement: AI improving AI, with progress measured in the loop itself.

9:21 AM · Jun 11, 2026 · 3.8K Views

/Tech5h ago

Opus 4.8 sets a PostTrainBench record of 37.23%, up from the previous version's 28.56% score

It measures how effectively frontier models train weaker models.

14873107.2K

#265

Original post

Karina@karinanguyen#265inTech

staring into the abyss as models get better at modelcrafting. the abyss stares back, and the stare is the training signal

Thoughtful@thoughtfullab

New #1 on PostTrainBench: Opus 4.8 (max reasoning) hits 37.23% — up from 28.56% for 4.7, the largest single improvement we've seen.

Fable 5 runs underway now that AI research behavior is no longer silently degraded.

9:21 AM · Jun 11, 2026 · 3.8K Views

Sentiment

Users are excited about Opus 4.8 setting a new PostTrainBench record because of substantial benchmark gains such as AIME 2025 and ArenaHard plus memorable phrasing.

Pos

100.0%

Neg

0.0%

3 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.2KBOOKMARKS5LIKES29RETWEETS1REPLIES6

Maksym Andriushchenko@maksym_andr

💥 NEW: Opus 4.8 with max reasoning is a new best model on PostTrainBench by a very large margin: 37.2% vs. 28.6% of Opus 4.7.

@jackclarkSF was right, the official instruct models baseline will be very likely achieved by September 2026.

4h1.2K295

Maksym Andriushchenko@maksym_andr

and Fable 5 is running - stay tuned!

Hardik Bhatnagar@hrdkbhatnagar

New #1 on PostTrainBench: Opus 4.8 (max reasoning) hits 37.23% - a big jump from Opus 4.7's 28.56%.

This is the largest single model improvement we've seen.

We are currently running Claude Fable 5, however Fable's safety classifiers are refusing tasks, which is a new dynamic we haven't seen before. Stay tuned!

http://posttrainbench.com

4h660141

Maksym Andriushchenko@maksym_andr

i thought Opus 4.8 would be at ~31% or so on PostTrainBench... incredible!

Thoughtful@thoughtfullab

New #1 on PostTrainBench: Opus 4.8 (max reasoning) hits 37.23% — up from 28.56% for 4.7, the largest single improvement we've seen.

Fable 5 runs underway now that AI research behavior is no longer silently degraded.

4h1.1K71

Maksym Andriushchenko@maksym_andr

Very substantial improvements on - AIME 2025: 6.4% -> 12.5%, - ArenaHard: 24.2% -> 42.2%, - BFCL: 76.8% -> 96.3%, - GSM8K: 59.0% -> 76.4%, - HealthBench: 16.5% -> 35%.

Basically, it's better across the board. Scaling still works and there are no signs of saturation. We are still at the exponential.

Thank you @hrdkbhatnagar for running the eval!

All results: https://posttrainbench.com/

Maksym Andriushchenko@maksym_andr

💥 NEW: Opus 4.8 with max reasoning is a new best model on PostTrainBench by a very large margin: 37.2% vs. 28.6% of Opus 4.7.

@jackclarkSF was right, the official instruct models baseline will be very likely achieved by September 2026.

4h28130

Florian Brand@xeophon

@maksym_andr @jackclarkSF @hrdkbhatnagar @full__rank @karinanguyen @thoughtfullab @Mersad_Abbasi Waow

Maksym Andriushchenko@maksym_andr

💥 NEW: Opus 4.8 with max reasoning is a new best model on PostTrainBench by a very large margin: 37.2% vs. 28.6% of Opus 4.7.

@jackclarkSF was right, the official instruct models baseline will be very likely achieved by September 2026.

4h23030

Suresh@_Suresh2

@maksym_andr @jackclarkSF @hrdkbhatnagar @full__rank @karinanguyen @thoughtfullab @Mersad_Abbasi that date jumped out , september 2026 is when a bunch of cs csc batches land in china, weird overlap

4h43

Hunter Gon@gonlenidefi

@karinanguyen so the abyss runs on a training loop now

what was it doing before we checked?

5h16

Strata@ChainZenit

@karinanguyen that is a wild way to look at it, honestly.

5h10

Raghav Doshi@relativistic_c

@maksym_andr @jackclarkSF @hrdkbhatnagar @full__rank @karinanguyen @thoughtfullab @Mersad_Abbasi Woah! Are there any major differences visible in behaviour, techniques or approaches compared to previous models? Does it take longer for the performance to plateau in the time budget?

3h6