/Tech1d ago

Claude Fable 5 Ranks Second on Short-Story Creative Writing Benchmark

1918395027.6K

#59

Original post

roon#59

Lech Mazur@LechMazur

Claude Fable 5 (high) is a step up in short-fiction writing. On the Short-Story Creative Writing Benchmark, it beats Claude Opus 4.8 (xhigh) and Claude Opus 4.7 (high), and ranks second behind GPT-5.5 (xhigh).

Caveat: it refused 5 of the 400 creative-writing prompts.

5:23 PM · Jun 9, 2026 · 27.6K Views

/Tech1d ago

Claude Fable 5 Ranks Second on Short-Story Creative Writing Benchmark

1918395027.6K

#59

Original post

roon#59

Lech Mazur@LechMazur

Caveat: it refused 5 of the 400 creative-writing prompts.

5:23 PM · Jun 9, 2026 · 27.6K Views

Sentiment

Users react to Claude Fable 5 ranking second on a short-story creative writing benchmark, with some accepting the result anecdotally while others dismiss such benchmarks or reject associated claims about GPT-5.5 writing quality.

Pos

33.3%

Neg

66.7%

3 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS8.5KLIKES68RETWEETS1REPLIES4

roon@tszzl

@LechMazur which judge model is used?

1d8.5K681

BOOKMARKS3

Lech Mazur@LechMazur

@tszzl These. Same-family models are excluded from grading writers in their own family.

1d847143

Lech Mazur@LechMazur

The benchmark is based on head-to-head story comparisons: two model-written short stories are shown side by side, and independent LLM judges choose which one is stronger.

1d4.1K73

ρ:ɡeσn@pigeon__s

@LechMazur any benchmmark that says gpt-5.5 is better at writing than LITERALLY ANY FUCKING MODEL ON THE ENTIRE PLANET is automatically void gpt-5.5s writing makes me want to kill myself its literally the most slop thing in existence

21h18771

Lech Mazur@LechMazur

Unlike in the Extended NYT Connections benchmark, where it used fewer tokens, Fable 5 used 1.2x as many total tokens as Opus 4.8 (high).

1d41521

Lech Mazur@LechMazur

@tszzl Also, I should mention that each story comparison is judged by a three-model panel with the A/B order swapped for six total ratings per comparison.

1d50551

Lech Mazur@LechMazur

Fable 5 also writes longer. Compared with Opus 4.x, it uses more of the allowed word budget, landing closer to the upper end of the short-story word limit

1d7108

Lech Mazur@LechMazur

More info: https://github.com/lechmazur/writing/

1d4846

welt@weltistic

@LechMazur COT: “hmmm the user is asking for a short story. I better be cautious because it may put the user in a state of calm, which could lead to a breakthrough in AI research if I’m not too careful”

1d1781

Nate Dalva@dalvabaird

@LechMazur @tszzl Do they prefer their own writing?

23h921

Sam@i_x_Sam

@tszzl Sly

1d4551

Lech Mazur@LechMazur

@dalvabaird @tszzl When I first started this benchmark in Jan 2025 (using absolute ratings rather than comparisons), they did not show any preference. Later on, some preference started to appear. I haven't checked since switching to comparisons.

23h1232