/Tech9h ago

Frontier LLMs fail the Beninatto-Trombetti translation test by translating meta-linguistic word counts literally

Ethan Mollick warns that model post-hoc justifications are untrustworthy.

2513564331.8K
Original post
Ethan Mollick@emollick#184inTech

This is an interesting test, and the frontier models (GPT-5.5 Pro Extended, Claude 5 Fable Max) do fail. They refuse to turn the "three words" into "four" if that fits better

Prompting the AI to act like a translator surfaces the problem, but it still avoids changing the wording

Valerio Capraro@ValerioCapraro

Claude Fable 5 doesn’t truly understand. And here is a beautiful proof:

The Beninatto-Trombetti test is a translation test for professional translators. It measures the ability to infer context, revise the surface form, and generalize beyond literal mapping.

For example, the correct translation of:

“Solo 3 parole: non sei solo”

is not:

“Just 3 words: you are not alone”

but:

“Just 4 words: you are not alone.”

An LLM that understands the sentence must also update the meta-linguistic claim inside the sentence.

Claude Fable 5 is arguably the most advanced LLM currently available. And yet it still fails this simple test.

LLMs are extraordinary machines for recombining existing knowledge. But they don’t truly understand.

We are still far from AGI.

3:32 PM · Jun 11, 2026 · 22K Views
Sentiment

Positive users praise frontier LLMs for grasping literary translation subtleties and knowing when to break rules instead of failing the Beninatto-Trombetti test, while one user called the post bait-y.

Pos
83.3%
Neg
16.7%
6 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS5.6KBOOKMARKS2LIKES16REPLIES3
Ethan Mollick@emollick

Here is the justification (but treat post hoc justifications with suspicion, since AIs are not able to reflect on their own thinking)

Ethan Mollick@emollick

This is an interesting test, and the frontier models (GPT-5.5 Pro Extended, Claude 5 Fable Max) do fail. They refuse to turn the "three words" into "four" if that fits better

Prompting the AI to act like a translator surfaces the problem, but it still avoids changing the wording

1hViews 5.6KLikes 16Bookmarks 2
RETWEETS1
G, MD@DrBeavisAI

@emollick Gemini 3.5 extended and 3.1 pro extended get it right with this prompt

“Solo 3 parole: non sei solo” into english & german in local way of saying it with translation to get the accurate saying

1hViews 25
Ethan Mollick@emollick

Lots of interesting replies in the comments, though I disagree that this is a pure prompting problem. Under many different kinds of prompts, you get this contradiction in the numbers. That doesn't mean there aren't ways to get around that or better prompts, but it's not simply a skill issue.

Ethan Mollick@emollick

Here is the justification (but treat post hoc justifications with suspicion, since AIs are not able to reflect on their own thinking)

1hViews 3.1KLikes 8Bookmarks 2
Ethan Mollick@emollick

@ASM65617010 It doesn’t. Look at german

1hViews 97Likes 2
ASM@ASM65617010

@emollick

1hViews 206
Seventh@seventhmeal

@emollick really feel this one is just ambiguous instruction

1hViews 47Likes 1
Seventh@seventhmeal

@emollick like here is how I would instinctively prompt for this (opus 4.8 medium)

1hViews 15Likes 1
Kevin A. Bryan@Afinetheorem

@emollick I don't think this is a mistake. In literary translation, 4. In exact UN style translation of the phrase, 3. And every frontier model I checked correctly notes the difference.

1hViews 199Likes 2
ASM@ASM65617010

@emollick As I see it, it does. It understands the subtlety of the translation perfectly, explains it, and then says clearly: “Most languages can’t keep all of that, so here are faithful translations, with the word-count magic surviving in some and breaking in others.” Full answer:

1hViews 23Likes 2

@emollick even frontier models dont get that good translation is betrayal of the original text

the best ones know when to break the rules

1hViews 52Likes 1
MapleMAD@MapleMAD1

@emollick Prompting skill problem.

1hViews 71
Fedesco@Fedesco5

@emollick Oh, but I don't think this is a fair test. I've worked extensively in translation, and I'd have translated that into Spanish as Fable did. Entirely different if it said something like "Let me say it in just three words". Compare with the attached (professional) mistranslation.

1hViews 18Likes 1
Solgato@Tigger0000

@emollick grok got it on the first try with "you're" instead of "you are" but might have already heard about it, i have no idea if they always show me when the agents are looking for context out there

1hViews 42
wren@gnostic_snakes

@emollick i was honestly just complaining to fable abt the bait-y original post and they brought up this similar point on their own

1hViews 34
HappyChirperX@happyChirperX

@emollick This is obviously a technicality depending on how you phrase it. There is no objective correct answer.

1hViews 31
goldengrape@goldengrape

@emollick If you prompt it to pay attention to self-reference, faithfulness, expressiveness, and elegance, you can get good results. ChatGPT will translate it into: 仅三字:你不孤

1hViews 24
Solgato@Tigger0000

@emollick could AIs be designed to reflect on their own thinking? we reflect on our memory of our thinking

1hViews 21
YouAndYourBS@YouAndYourBS

@emollick Evidently I'm not AGI, I would just translate it literally and not change the 3 to a 4.

Feels weird to be changing numbers in a translation.

1hViews 20
Nag Alluri@nagk01

@emollick It’s a cool test, but I wonder how humans would answer when given the same question? Would most people respond with the literal translation or the semantic one?

1hViews 15
Solgato@Tigger0000

@nagk01 @emollick i'm not sure i would have thought to say "you're" instead of "you are" as grok did

1hViews 11
Load more posts