/Tech14h ago

Fable 5 Tops EQ-Bench and Creative Writing Evaluations

27289156434.5K
Original postLisan al Gaib#1064
Sam Paech@sam_paech

Fable 5 tops EQ-Bench and both creative writing evals!

Personal take from reading some outputs: It has tics and tells, and isn't compelling in the way human writing is. But I think it earned its spot, in the sense that it's *relatively* excellent & hasn't reward-hacked the eval.

3:47 AM · Jun 10, 2026 · 7K Views
Sentiment

Many users are excited about Claude Fable 5's creative writing benchmark results for stronger story quality while others criticize high evaluation costs and issues like hallucinations.

Pos
72.2%
Neg
27.8%
11 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS8.5KLIKES68REPLIES4
roon@tszzl

@LechMazur which judge model is used?

1dViews 8.5KLikes 68Bookmarks 1
BOOKMARKS3
Lech Mazur@LechMazur

@tszzl These. Same-family models are excluded from grading writers in their own family.

1dViews 847Likes 14Bookmarks 3
RETWEETS8
Lech Mazur@LechMazur

Claude Fable 5 (high) is a step up in short-fiction writing. On the Short-Story Creative Writing Benchmark, it beats Claude Opus 4.8 (xhigh) and Claude Opus 4.7 (high), and ranks second behind GPT-5.5 (xhigh).

Caveat: it refused 5 of the 400 creative-writing prompts.

1dViews 27.6KLikes 183Bookmarks 50
Lech Mazur@LechMazur

The benchmark is based on head-to-head story comparisons: two model-written short stories are shown side by side, and independent LLM judges choose which one is stronger.

1dViews 4.1KLikes 7Bookmarks 3
ρ:ɡeσn@pigeon__s

@LechMazur any benchmmark that says gpt-5.5 is better at writing than LITERALLY ANY FUCKING MODEL ON THE ENTIRE PLANET is automatically void gpt-5.5s writing makes me want to kill myself its literally the most slop thing in existence

21hViews 187Likes 7Bookmarks 1
Lech Mazur@LechMazur

Unlike in the Extended NYT Connections benchmark, where it used fewer tokens, Fable 5 used 1.2x as many total tokens as Opus 4.8 (high).

1dViews 415Likes 2Bookmarks 1
Lech Mazur@LechMazur

@tszzl Also, I should mention that each story comparison is judged by a three-model panel with the A/B order swapped for six total ratings per comparison.

1dViews 505Likes 5Bookmarks 1
Lech Mazur@LechMazur

Fable 5 also writes longer. Compared with Opus 4.x, it uses more of the allowed word budget, landing closer to the upper end of the short-story word limit

1dViews 710Likes 8
ChrisUniverse 🗽@ChrisUniverse

@pigeon__s @LechMazur Change your preferences because it’s actually really good. Better than almost every model I’ve ever used. GPT 5.5 isn’t the same as 5.4 nor anywhere close to how it’s writing is ✍🏼

20hViews 42Likes 2Bookmarks 1
Sam Paech@sam_paech

@AcousimHss Yes, that's correct. It compares the responses in head-to-head matchups and picks the winner/loser. So with eqbench3, the judge thinks those 3 models are better than its own outputs. In eqbench4 (releasing soon) 3x judges are used to mitigate self-bias.

13hViews 6Likes 2
Lech Mazur@LechMazur

More info: https://github.com/lechmazur/writing/

1dViews 484Likes 6
Simon Coste ꙮ@__SimonCoste__

@LechMazur Could we have access to some of these prompts & short stories ? couldnt find it in the GH.

15hViews 34
VioP@AcousimHss

@sam_paech how do u set a model to top ?? how does this work ? so now is fable 5 the new judge ? if we r to do local reproduction tests for our models do we have to use fable 5 then?

14hViews 133Likes 2
welt@weltistic

@LechMazur COT: “hmmm the user is asking for a short story. I better be cautious because it may put the user in a state of calm, which could lead to a breakthrough in AI research if I’m not too careful”

1dViews 178Likes 1
Sam Paech@sam_paech

@AcousimHss It's llm-judged, but the judges are kept constant. To reproduce the results, use the same judge as the leaderboard (noted in the about page & on repo readme). Lmk if you run into any issues reproducing results, I'm happy to help.

14hViews 92Likes 1
Nate Dalva@dalvabaird

@LechMazur @tszzl Do they prefer their own writing?

23hViews 92Likes 1
Lech Mazur@LechMazur

@mcnultydigital

8hViews 25Likes 1
VioP@AcousimHss

@sam_paech oh no i understand that this is llm judged , my question was how do top 1 model get placed , so like did opus 4.6 place fable over itself and 4.7,4.8?? or was there any other way im jst curious to learn is all!

14hViews 17Likes 1
Sam@i_x_Sam

@tszzl Sly

1dViews 455Likes 1

@LechMazur Can you please share one of the refused prompts or say more about one of the refused prompts so we have an idea of what was refused. I’m guessing it had to do with AI companionship/anthropomorphizing risks.

9hViews 13
Load more posts