Fable 5 Ranks High on LisanBench Despite Invalid Moves

VIEWS939REPLIES1

*I've actually withheld LisanBench scores for like 2 weeks now, because honestly starting with Opus 4.6 the benchmark is saturating in its current form.

Lisan al Gaib@scaling01

LisanBench results for Fable 5

I've actually withheld LisanBench scores for like 2 weeks now, because honestly starting with Opus 4.6 the benchmark in its current form.

Overall, Fable 5 scores #3 and #2 in the main path length and difficulty weighted metrics, but it's not meaningfully better than Opus 4.7 and Opus 4.6. This has two reasons: - Opus 4.7 was tested with xhigh thinking budget, so scores are naturally higher than just medium. With equal budgets Fable would likely score higher. - Opus 4.6 was one of the only models where I could find reward-hacky (I have written an article about this)

Fable is slightly more explorative and has a slightly higher fraction of legal moves than both Opus 4.6 and Opus 4.7.

But on issue I found with Fable is that it generates a lot of invalid word transitions. Instead of just using edit-distance 1 it uses edit-distance 2. This happens ~28% of the time, for Opus 4.8 it was 0%, and Opus 4.7 and Opus 4.6 were at 4.7% and 13.3%.

Often it used transpositions, which are invalid: fist -> fits or tied -> tide

I have also moved to a more comprehensive dictionary, because the old dictionary was stopping the chains on english words, that weren't in the dictionary. This issue still persists, although it's much less common and doesn't affect ranking too much. If we allowed those words than its score would improve by +165.67 points, which wouldn't change its ranking.

There's one hack I've done to make Fable 5 look good by taking only the top 10 hardest words. Because most starting words are just way too easy and models just farm points (see image 1 with table for scores).

Also on the number of total valid chains, which also includes chains that happen after the official scoring stops due to an invalid transition, it scores #1.

Overall I would still say that Fable 5 is the best model if you take the token budget and reward hacking into account. I think Opus 4.8 and other non Anthropic like GPT-5.5 also support that Opus/Sonnet 4.6 and Opus 4.7 are outliers. But I also see how this interpretation could be seen as cherry-picked.

4h93950

BOOKMARKS1LIKES8

Lisan al Gaib@scaling01

note to self, use this metric that weights both by starting word hardness and weigths transitions by hardness