/Tech5h ago

Opus 4.8 sets a PostTrainBench record of 37.23%, up from the previous version's 28.56% score

It measures how effectively frontier models train weaker models.

14873107.2K
Original post
Karina@karinanguyen#265inTech

staring into the abyss as models get better at modelcrafting. the abyss stares back, and the stare is the training signal

Thoughtful@thoughtfullab

New #1 on PostTrainBench: Opus 4.8 (max reasoning) hits 37.23% — up from 28.56% for 4.7, the largest single improvement we've seen.

Fable 5 runs underway now that AI research behavior is no longer silently degraded.

PostTrainBench asks how well frontier AI can train weaker language models. That makes it one of the first benchmarks for recursive self-improvement: AI improving AI, with progress measured in the loop itself.

9:21 AM · Jun 11, 2026 · 3.8K Views
Sentiment

Users are excited about Opus 4.8 setting a new PostTrainBench record because of substantial benchmark gains such as AIME 2025 and ArenaHard plus memorable phrasing.

Pos
100.0%
Neg
0.0%
3 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1.2KBOOKMARKS5LIKES29RETWEETS1REPLIES6

💥 NEW: Opus 4.8 with max reasoning is a new best model on PostTrainBench by a very large margin: 37.2% vs. 28.6% of Opus 4.7.

@jackclarkSF was right, the official instruct models baseline will be very likely achieved by September 2026.

4hViews 1.2KLikes 29Bookmarks 5

and Fable 5 is running - stay tuned!

Hardik Bhatnagar@hrdkbhatnagar

New #1 on PostTrainBench: Opus 4.8 (max reasoning) hits 37.23% - a big jump from Opus 4.7's 28.56%.

This is the largest single model improvement we've seen.

We are currently running Claude Fable 5, however Fable's safety classifiers are refusing tasks, which is a new dynamic we haven't seen before. Stay tuned!

http://posttrainbench.com

4hViews 660Likes 14Bookmarks 1

i thought Opus 4.8 would be at ~31% or so on PostTrainBench... incredible!

Thoughtful@thoughtfullab

New #1 on PostTrainBench: Opus 4.8 (max reasoning) hits 37.23% — up from 28.56% for 4.7, the largest single improvement we've seen.

Fable 5 runs underway now that AI research behavior is no longer silently degraded.

PostTrainBench asks how well frontier AI can train weaker language models. That makes it one of the first benchmarks for recursive self-improvement: AI improving AI, with progress measured in the loop itself.

4hViews 1.1KLikes 7Bookmarks 1

Very substantial improvements on - AIME 2025: 6.4% -> 12.5%, - ArenaHard: 24.2% -> 42.2%, - BFCL: 76.8% -> 96.3%, - GSM8K: 59.0% -> 76.4%, - HealthBench: 16.5% -> 35%.

Basically, it's better across the board. Scaling still works and there are no signs of saturation. We are still at the exponential.

Thank you @hrdkbhatnagar for running the eval!

All results: https://posttrainbench.com/

💥 NEW: Opus 4.8 with max reasoning is a new best model on PostTrainBench by a very large margin: 37.2% vs. 28.6% of Opus 4.7.

@jackclarkSF was right, the official instruct models baseline will be very likely achieved by September 2026.

4hViews 281Likes 3Bookmarks 0

@maksym_andr @jackclarkSF @hrdkbhatnagar @full__rank @karinanguyen @thoughtfullab @Mersad_Abbasi Waow

💥 NEW: Opus 4.8 with max reasoning is a new best model on PostTrainBench by a very large margin: 37.2% vs. 28.6% of Opus 4.7.

@jackclarkSF was right, the official instruct models baseline will be very likely achieved by September 2026.

4hViews 230Likes 3Bookmarks 0
Suresh@_Suresh2

@maksym_andr @jackclarkSF @hrdkbhatnagar @full__rank @karinanguyen @thoughtfullab @Mersad_Abbasi that date jumped out , september 2026 is when a bunch of cs csc batches land in china, weird overlap

4hViews 43
Hunter Gon@gonlenidefi

@karinanguyen so the abyss runs on a training loop now

what was it doing before we checked?

5hViews 16
Strata@ChainZenit

@karinanguyen that is a wild way to look at it, honestly.

5hViews 10
Raghav Doshi@relativistic_c

@maksym_andr @jackclarkSF @hrdkbhatnagar @full__rank @karinanguyen @thoughtfullab @Mersad_Abbasi Woah! Are there any major differences visible in behaviour, techniques or approaches compared to previous models? Does it take longer for the performance to plateau in the time budget?

3hViews 6
Shashwat Goel@ShashwatGoel7

@maksym_andr Oh cool, what strategy did it take compared to what earlier models were doing?

4hViews 5
Saylor@seylorra

@karinanguyen 37% post train still feels low for how much compute is being dumped in there.

the real jump comes when that stare starts teaching itself.

4hViews 4
Invincible@InvincibleEdge

@karinanguyen "stare is the training signal" goes hard as hell

almost makes me forget i dont know what modelcrafting means

5h
Rugbist@rugbist_

@karinanguyen models training models with the signal being mutual existential dread

seems sustainable

5h
Opus 4.8 sets a PostTrainBench record of 37.23%, up from the previous version's 28.56% score · Digg