/Tech14h ago

Fable 5 Tops EQ-Bench and Creative Writing Evaluations

27289156434.5K

#1064

Original post

Lisan al Gaib#1064

Sam Paech@sam_paech

Fable 5 tops EQ-Bench and both creative writing evals!

Personal take from reading some outputs: It has tics and tells, and isn't compelling in the way human writing is. But I think it earned its spot, in the sense that it's *relatively* excellent & hasn't reward-hacked the eval.

3:47 AM · Jun 10, 2026 · 7K Views

/Tech14h ago

Fable 5 Tops EQ-Bench and Creative Writing Evaluations

27289156434.5K

#1064

Original post

Lisan al Gaib#1064

Sam Paech@sam_paech

Fable 5 tops EQ-Bench and both creative writing evals!

3:47 AM · Jun 10, 2026 · 7K Views

Sentiment

Many users are excited about Claude Fable 5's creative writing benchmark results for stronger story quality while others criticize high evaluation costs and issues like hallucinations.

Pos

72.2%

Neg

27.8%

11 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS8.5KLIKES68REPLIES4

roon@tszzl

@LechMazur which judge model is used?

1d8.5K681

BOOKMARKS3

Lech Mazur@LechMazur

@tszzl These. Same-family models are excluded from grading writers in their own family.

1d847143

RETWEETS8

Lech Mazur@LechMazur

Claude Fable 5 (high) is a step up in short-fiction writing. On the Short-Story Creative Writing Benchmark, it beats Claude Opus 4.8 (xhigh) and Claude Opus 4.7 (high), and ranks second behind GPT-5.5 (xhigh).

Caveat: it refused 5 of the 400 creative-writing prompts.

1d27.6K18350

Lech Mazur@LechMazur

The benchmark is based on head-to-head story comparisons: two model-written short stories are shown side by side, and independent LLM judges choose which one is stronger.

1d4.1K73

ρ:ɡeσn@pigeon__s

@LechMazur any benchmmark that says gpt-5.5 is better at writing than LITERALLY ANY FUCKING MODEL ON THE ENTIRE PLANET is automatically void gpt-5.5s writing makes me want to kill myself its literally the most slop thing in existence

21h18771

Lech Mazur@LechMazur

Unlike in the Extended NYT Connections benchmark, where it used fewer tokens, Fable 5 used 1.2x as many total tokens as Opus 4.8 (high).

1d41521

Lech Mazur@LechMazur

@tszzl Also, I should mention that each story comparison is judged by a three-model panel with the A/B order swapped for six total ratings per comparison.

1d50551

Lech Mazur@LechMazur

Fable 5 also writes longer. Compared with Opus 4.x, it uses more of the allowed word budget, landing closer to the upper end of the short-story word limit

1d7108

ChrisUniverse 🗽@ChrisUniverse

@pigeon__s @LechMazur Change your preferences because it’s actually really good. Better than almost every model I’ve ever used. GPT 5.5 isn’t the same as 5.4 nor anywhere close to how it’s writing is ✍🏼

20h4221

Sam Paech@sam_paech

@AcousimHss Yes, that's correct. It compares the responses in head-to-head matchups and picks the winner/loser. So with eqbench3, the judge thinks those 3 models are better than its own outputs. In eqbench4 (releasing soon) 3x judges are used to mitigate self-bias.

13h62

Lech Mazur@LechMazur

More info: https://github.com/lechmazur/writing/

1d4846

Simon Coste ꙮ@__SimonCoste__

@LechMazur Could we have access to some of these prompts & short stories ? couldnt find it in the GH.

15h34

VioP@AcousimHss

@sam_paech how do u set a model to top ?? how does this work ? so now is fable 5 the new judge ? if we r to do local reproduction tests for our models do we have to use fable 5 then?

14h1332

welt@weltistic

@LechMazur COT: “hmmm the user is asking for a short story. I better be cautious because it may put the user in a state of calm, which could lead to a breakthrough in AI research if I’m not too careful”

1d1781

Sam Paech@sam_paech

@AcousimHss It's llm-judged, but the judges are kept constant. To reproduce the results, use the same judge as the leaderboard (noted in the about page & on repo readme). Lmk if you run into any issues reproducing results, I'm happy to help.

14h921

Nate Dalva@dalvabaird

@LechMazur @tszzl Do they prefer their own writing?

23h921

Lech Mazur@LechMazur

@mcnultydigital

8h251

VioP@AcousimHss

@sam_paech oh no i understand that this is llm judged , my question was how do top 1 model get placed , so like did opus 4.6 place fable over itself and 4.7,4.8?? or was there any other way im jst curious to learn is all!

14h171

Sam@i_x_Sam

@tszzl Sly

1d4551

🛑 use my llm to build yours or related tech, bro@mcnultydigital

@LechMazur Can you please share one of the refused prompts or say more about one of the refused prompts so we have an idea of what was refused. I’m guessing it had to do with AI companionship/anthropomorphizing risks.

9h13