/Tech19h ago

Anthropic Fable 5 Tops ParseBench With 90% Content Faithfulness Score

462552710776.4K

#707

Original post

Jerry Liu#707

LlamaIndex 🦙@llama_index

Day 0 Anthropic Fable 5 in ParseBench: We tested the model's advancements when it comes to document understanding. The model clearly peaks when it comes to adherence to the original text:

📃 Content faithfulness: 90.02% vs 86.19% (Gemini 3 Flash) and 86.81% (GPT-5.5) 🔢 Semantic formatting: 72.62% vs 58.35% and 60.12%, a 12+ point lead

These are two of the most important metrics for SOTA document understanding: does the output preserve what the document actually says, and does it preserve formatting that carries meaning?

But ... it's not a sweep there continues to be a lot of alpha in unlocking document understanding for frontier models.

Full results below 👇

5:18 PM · Jun 9, 2026 · 33.1K Views

/Tech19h ago

Anthropic Fable 5 Tops ParseBench With 90% Content Faithfulness Score

462552710776.4K

#707

Original post

Jerry Liu#707

LlamaIndex 🦙@llama_index

Day 0 Anthropic Fable 5 in ParseBench: We tested the model's advancements when it comes to document understanding. The model clearly peaks when it comes to adherence to the original text:

📃 Content faithfulness: 90.02% vs 86.19% (Gemini 3 Flash) and 86.81% (GPT-5.5) 🔢 Semantic formatting: 72.62% vs 58.35% and 60.12%, a 12+ point lead

These are two of the most important metrics for SOTA document understanding: does the output preserve what the document actually says, and does it preserve formatting that carries meaning?

But ... it's not a sweep there continues to be a lot of alpha in unlocking document understanding for frontier models.

Full results below 👇

5:18 PM · Jun 9, 2026 · 33.1K Views

Sentiment

Some users criticized Anthropic Fable 5's reported split of 90 content faithfulness versus 49 visual grounding as a strange combination to ship initially.

Pos

0.0%

Neg

100.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS43.2KBOOKMARKS95LIKES204RETWEETS21REPLIES24

Jerry Liu@jerryjliu0

Claude Fable 5 thinks document parsing is beneath it

It is absolutely crushing on all reasoning-intensive/long horizon benchmarks: SWE-Bench Pro, FrontierCode, GDPval, Runescape, etc.

But for document understanding tasks, it is roughly equivalent with Gemini 3 Flash in performance, at roughly 10-15x the token cost.

We benchmarked the model on ParseBench and compared it against all other frontier models. It is definitely up there compared to other frontier models, but falls far short of specialized OCR providers.

What we found interesting is that Fable 5 is self-aware about this. When we ask the model what tasks it enjoys the last, it actively said that it dislikes tasks "where the request is fully specified and the answer is fully known" - implying part of it being bad is due to laziness and lack of willingness to actually solve the task at hand.

For a full list of results across different frontier models, check out ParseBench! https://www.parsebench.ai/

LlamaIndex 🦙@llama_index

Day 0 Anthropic Fable 5 in ParseBench: We tested the model's advancements when it comes to document understanding. The model clearly peaks when it comes to adherence to the original text:

📃 Content faithfulness: 90.02% vs 86.19% (Gemini 3 Flash) and 86.81% (GPT-5.5) 🔢 Semantic formatting: 72.62% vs 58.35% and 60.12%, a 12+ point lead

These are two of the most important metrics for SOTA document understanding: does the output preserve what the document actually says, and does it preserve formatting that carries meaning?

But ... it's not a sweep there continues to be a lot of alpha in unlocking document understanding for frontier models.

Full results below 👇

18h43.2K20495

anya@annaeremburg

@llama_index fable 5's 49.24% on visual grounding while leading content faithfulness by four points tells you the model reads text with unusual fidelity and still doesn't fully know where it is on the page

19h432

Kazba@Kazba22

@llama_index content faithfulness at 90 while visual grounding sits at 49 is a strange split to ship on day 0

19h192

Eclipse 🌖@ECLresearch

@llama_index Content faithfulness at 90% is a meaningful lead, but the real question for production is whether that gain holds under long-context stress tests beyond the static benchmark.

19h9

Hershal Rao@Hershal0_0

@llama_index So it can read the fine print perfectly but has no idea where it’s standing. Relatable.

19h2