/AI4h ago

Anthropic's Andy Jones shares anecdote of AI system Fable predicting its own 29 percent benchmark score, prompting evaluation jokes

Sholto Douglas joked researchers should just ask models for scores.

20598224663K
Original post
andy jones@andy_l_jones#456inAI

did you know? you can just ask fable what its benchmark score will be

10:12 AM · Jun 9, 2026 · 36.2K Views
Sentiment

Users appreciate researchers asking Claude to predict benchmark scores instead of running evaluations because it is cheaper and faster than using GPUs for results that few trust anyway.

Pos
100.0%
Neg
0.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS26.2KBOOKMARKS28LIKES340RETWEETS6REPLIES12
Sholto Douglas@_sholtodouglas

we don’t even run evals anymore we just ask Claude what the score will be

andy jones@andy_l_jones

did you know? you can just ask fable what its benchmark score will be

3hViews 26.2KLikes 340Bookmarks 28
Jack Clark@jackclarkSF

@andy_l_jones lmao

andy jones@andy_l_jones

did you know? you can just ask fable what its benchmark score will be

4hViews 3KLikes 34Bookmarks 2
andy jones@andy_l_jones

(past performance is not an indicator of future returns. generalization not guaranteed)

4hViews 422Likes 4
Aaron Slodov@aphysicist

@_sholtodouglas plsssss turn down safety nerfs 👉👈

2hViews 339Likes 4
Justin Halford@Justin_Halford_

@_sholtodouglas Imagine being able to know how much compute is needed to meaningfully solve <problem>. What would the long term compute capex appetite be if this was the case?

3hViews 158
welt@weltistic

@_sholtodouglas im confused - help me sholto!

3hViews 99
birs_tech@Birs_tech

@_sholtodouglas

3hViews 95

@_sholtodouglas @rickasaurus Fable still only scoring slightly better on my private benchmark btw.

3hViews 85
Rugbist@rugbist_

@_sholtodouglas lmao trust the oracle approach

benchmarks: we do them, we just skip the waiting

3hViews 32
Dan McAteer@daniel_mac8

@andy_l_jones Predictive processing at its finest.

3hViews 22
Invincible@InvincibleEdge

@_sholtodouglas bro saw the future and decided to skip the test

3hViews 20
Blissy@BlissyOnX

@_sholtodouglas honestly this is cheaper and faster, why run 8 GPUs for a number nobody trusts anyway

3h
Anthropic's Andy Jones shares anecdote of AI system Fable predicting its own 29 percent benchmark score, prompting evaluation jokes · Digg