/Tech9h ago

Frontier LLMs fail the Beninatto-Trombetti translation test by translating meta-linguistic word counts literally

Ethan Mollick warns that model post-hoc justifications are untrustworthy.

2513564331.8K

Original post

This is an interesting test, and the frontier models (GPT-5.5 Pro Extended, Claude 5 Fable Max) do fail. They refuse to turn the "three words" into "four" if that fits better

Prompting the AI to act like a translator surfaces the problem, but it still avoids changing the wording

Valerio Capraro@ValerioCapraro

Claude Fable 5 doesn’t truly understand. And here is a beautiful proof:

The Beninatto-Trombetti test is a translation test for professional translators. It measures the ability to infer context, revise the surface form, and generalize beyond literal mapping.

For example, the correct translation of:

“Solo 3 parole: non sei solo”

is not:

“Just 3 words: you are not alone”

but:

“Just 4 words: you are not alone.”

An LLM that understands the sentence must also update the meta-linguistic claim inside the sentence.

Claude Fable 5 is arguably the most advanced LLM currently available. And yet it still fails this simple test.

LLMs are extraordinary machines for recombining existing knowledge. But they don’t truly understand.

We are still far from AGI.

3:32 PM · Jun 11, 2026 · 22K Views

/Tech9h ago

Frontier LLMs fail the Beninatto-Trombetti translation test by translating meta-linguistic word counts literally

Ethan Mollick warns that model post-hoc justifications are untrustworthy.

2513564331.8K

#184

Original post

Ethan Mollick@emollick#184inTech

This is an interesting test, and the frontier models (GPT-5.5 Pro Extended, Claude 5 Fable Max) do fail. They refuse to turn the "three words" into "four" if that fits better

Prompting the AI to act like a translator surfaces the problem, but it still avoids changing the wording

Valerio Capraro@ValerioCapraro

Claude Fable 5 doesn’t truly understand. And here is a beautiful proof:

The Beninatto-Trombetti test is a translation test for professional translators. It measures the ability to infer context, revise the surface form, and generalize beyond literal mapping.

For example, the correct translation of:

“Solo 3 parole: non sei solo”

is not:

“Just 3 words: you are not alone”

but:

“Just 4 words: you are not alone.”

An LLM that understands the sentence must also update the meta-linguistic claim inside the sentence.

Claude Fable 5 is arguably the most advanced LLM currently available. And yet it still fails this simple test.

LLMs are extraordinary machines for recombining existing knowledge. But they don’t truly understand.

We are still far from AGI.

3:32 PM · Jun 11, 2026 · 22K Views

Sentiment

Positive users praise frontier LLMs for grasping literary translation subtleties and knowing when to break rules instead of failing the Beninatto-Trombetti test, while one user called the post bait-y.

Pos

83.3%

Neg

16.7%

6 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS5.6KBOOKMARKS2LIKES16REPLIES3

Ethan Mollick@emollick

Here is the justification (but treat post hoc justifications with suspicion, since AIs are not able to reflect on their own thinking)

Ethan Mollick@emollick

This is an interesting test, and the frontier models (GPT-5.5 Pro Extended, Claude 5 Fable Max) do fail. They refuse to turn the "three words" into "four" if that fits better

Prompting the AI to act like a translator surfaces the problem, but it still avoids changing the wording

1h5.6K162

RETWEETS1

G, MD@DrBeavisAI

@emollick Gemini 3.5 extended and 3.1 pro extended get it right with this prompt

“Solo 3 parole: non sei solo” into english & german in local way of saying it with translation to get the accurate saying

1h25

Ethan Mollick@emollick

Lots of interesting replies in the comments, though I disagree that this is a pure prompting problem. Under many different kinds of prompts, you get this contradiction in the numbers. That doesn't mean there aren't ways to get around that or better prompts, but it's not simply a skill issue.

Ethan Mollick@emollick

Here is the justification (but treat post hoc justifications with suspicion, since AIs are not able to reflect on their own thinking)

1h3.1K82

Ethan Mollick@emollick

@ASM65617010 It doesn’t. Look at german

1h972

ASM@ASM65617010

@emollick

1h206

Seventh@seventhmeal

@emollick really feel this one is just ambiguous instruction

1h471

Seventh@seventhmeal

@emollick like here is how I would instinctively prompt for this (opus 4.8 medium)

1h151

Kevin A. Bryan@Afinetheorem

@emollick I don't think this is a mistake. In literary translation, 4. In exact UN style translation of the phrase, 3. And every frontier model I checked correctly notes the difference.

1h1992

ASM@ASM65617010

@emollick As I see it, it does. It understands the subtlety of the translation perfectly, explains it, and then says clearly: “Most languages can’t keep all of that, so here are faithful translations, with the word-count magic surviving in some and breaking in others.” Full answer:

1h232

tsunami_crypto@ls_brd

@emollick even frontier models dont get that good translation is betrayal of the original text

the best ones know when to break the rules

1h521

MapleMAD@MapleMAD1

@emollick Prompting skill problem.

1h71

Fedesco@Fedesco5

@emollick Oh, but I don't think this is a fair test. I've worked extensively in translation, and I'd have translated that into Spanish as Fable did. Entirely different if it said something like "Let me say it in just three words". Compare with the attached (professional) mistranslation.

1h181

Solgato@Tigger0000

@emollick grok got it on the first try with "you're" instead of "you are" but might have already heard about it, i have no idea if they always show me when the agents are looking for context out there

1h42

wren@gnostic_snakes

@emollick i was honestly just complaining to fable abt the bait-y original post and they brought up this similar point on their own

1h34

HappyChirperX@happyChirperX

@emollick This is obviously a technicality depending on how you phrase it. There is no objective correct answer.

1h31

goldengrape@goldengrape

@emollick If you prompt it to pay attention to self-reference, faithfulness, expressiveness, and elegance, you can get good results. ChatGPT will translate it into: 仅三字：你不孤

1h24

Solgato@Tigger0000

@emollick could AIs be designed to reflect on their own thinking? we reflect on our memory of our thinking

1h21

YouAndYourBS@YouAndYourBS

@emollick Evidently I'm not AGI, I would just translate it literally and not change the 3 to a 4.

Feels weird to be changing numbers in a translation.

1h20

Nag Alluri@nagk01

@emollick It’s a cool test, but I wonder how humans would answer when given the same question? Would most people respond with the literal translation or the semantic one?

1h15

Solgato@Tigger0000

@nagk01 @emollick i'm not sure i would have thought to say "you're" instead of "you are" as grok did

1h11