/Tech3h ago

Frontier Models Struggle With Exact Text Preservation During Mad Libs Fill-In Tasks

543061.8K

Original post

rapidly interleaving between exact content preservation of the original text -> doing mad libs style fill in the middle is hard for frontier models especially when the source material has typos, even when you instruct the models to *specifically preserve* all typos faithfully

2:42 PM · Jun 14, 2026 · 1.3K Views

Sentiment

Users praise the Mad Libs fill-in task as a great evaluation idea because it highlights frontier models' struggles with exact text preservation.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS548BOOKMARKS1LIKES8

kalomaze@kalomaze

the rapid interleaving surfaces this way harder than the "subtle local edit made at exactly one place" case far more state transitions between "copy earlier context" and "freestyle generate what you figure was originally here", creating more room for at least one boundary fudge

kalomaze@kalomaze

3h54881

REPLIES2

kalomaze@kalomaze

@osoleve i want to try a variant where i specifically measure this with local swa for say last 32 tokens vs global attention, and only truncate the ones that become close to exactly determinable/unimodal distributionally with earlier context, and extremely undeterminable with swa only.

2h171

kalomaze@kalomaze

@osoleve my metric here is an all or nothing metric btw if there are 14 spans that must remain verbatim in the output, and only 13/14 are present, thats a zero score its designed to be harsh to fudging *at least one boundary* of many from a doc on purpose

3h181

kalomaze@kalomaze

@osoleve depends on the structure of the doc and the model's priors heavily, as far as i can tell seen cases where it fudges a single region consistently despite the preservation instruction, and seen cases where the fudge region is highly stochastic

3h171

kalomaze@kalomaze

@osoleve for more context here is what a doc looks like, instruction says to preserve the original regions exactly incl typos, and ONLY generate guesses for the [...] absent regions i use gemma4 12b's NTP entropy as a proxy for where to truncate (most predictable subsequences)

2h191

oso@osoleve

@kalomaze This is interesting because it seems to indicate that "token mover" and "induction" heads are... Not.

3h191

oso@osoleve

@kalomaze I wish we could see total and active params of the big boys because this smells intrinsically tied to model capacity (and tokenizer)

3h131

kalomaze@kalomaze

@osoleve also idk if people would have any concerns about me using specifically one model (gemma4 pretrain). or if i should be using ideally like. a mix of small pretrains & filtering by consensus agreement? that could be a way to make the task design more universal possibly.

2h101

oso@osoleve

@kalomaze Have you looked at the partial cases? My intuition says if it gets more than ~half it gets them all, like I wouldn't expect to see near misses

3h16

oso@osoleve

@kalomaze Very cool shit man, great idea for an eval

3h101

kalomaze@kalomaze

@osoleve approximately doing it is in fact not the same thing as exactly doing it for precise IF i think if a journalist said [sic] many times and corrected the typos for ~5% of the quotes anyways, then that would be bad journalism this is most painful on the deepseeks apparently

3h81

pratyush@pratty_agi

@kalomaze Claude doesn’t get diagrams right in the terminal

3h27

oso@osoleve

@kalomaze I think tokenizer+pretrain is going to have an outsized effect, yeah

2h1

oso@osoleve

I wonder if you could construct really long garden path sentences. I've only ever seen short ones but you should in theory be able to produce sentences that can't be correctly parsed until they're completely parsed and also of arbitrary length. My initial thought of expanding clauses in existing ones wouldn't work because adding specifiers collapses the paths.

2h1