rapidly interleaving between exact content preservation of the original text -> doing mad libs style fill in the middle is hard for frontier models especially when the source material has typos, even when you instruct the models to *specifically preserve* all typos faithfully
Frontier Models Struggle With Exact Text Preservation During Mad Libs Fill-In Tasks
Users praise the Mad Libs fill-in task as a great evaluation idea because it highlights frontier models' struggles with exact text preservation.
Most Activity
the rapid interleaving surfaces this way harder than the "subtle local edit made at exactly one place" case far more state transitions between "copy earlier context" and "freestyle generate what you figure was originally here", creating more room for at least one boundary fudge
rapidly interleaving between exact content preservation of the original text -> doing mad libs style fill in the middle is hard for frontier models especially when the source material has typos, even when you instruct the models to *specifically preserve* all typos faithfully

@osoleve i want to try a variant where i specifically measure this with local swa for say last 32 tokens vs global attention, and only truncate the ones that become close to exactly determinable/unimodal distributionally with earlier context, and extremely undeterminable with swa only.

@osoleve my metric here is an all or nothing metric btw if there are 14 spans that must remain verbatim in the output, and only 13/14 are present, thats a zero score its designed to be harsh to fudging *at least one boundary* of many from a doc on purpose

@osoleve depends on the structure of the doc and the model's priors heavily, as far as i can tell seen cases where it fudges a single region consistently despite the preservation instruction, and seen cases where the fudge region is highly stochastic

@osoleve for more context here is what a doc looks like, instruction says to preserve the original regions exactly incl typos, and ONLY generate guesses for the [...] absent regions i use gemma4 12b's NTP entropy as a proxy for where to truncate (most predictable subsequences)

@kalomaze This is interesting because it seems to indicate that "token mover" and "induction" heads are... Not.

@kalomaze I wish we could see total and active params of the big boys because this smells intrinsically tied to model capacity (and tokenizer)

@osoleve also idk if people would have any concerns about me using specifically one model (gemma4 pretrain). or if i should be using ideally like. a mix of small pretrains & filtering by consensus agreement? that could be a way to make the task design more universal possibly.

@kalomaze Have you looked at the partial cases? My intuition says if it gets more than ~half it gets them all, like I wouldn't expect to see near misses

@kalomaze Very cool shit man, great idea for an eval

@osoleve approximately doing it is in fact not the same thing as exactly doing it for precise IF i think if a journalist said [sic] many times and corrected the typos for ~5% of the quotes anyways, then that would be bad journalism this is most painful on the deepseeks apparently

@kalomaze Claude doesn’t get diagrams right in the terminal

@kalomaze I think tokenizer+pretrain is going to have an outsized effect, yeah

I wonder if you could construct really long garden path sentences. I've only ever seen short ones but you should in theory be able to produce sentences that can't be correctly parsed until they're completely parsed and also of arbitrary length. My initial thought of expanding clauses in existing ones wouldn't work because adding specifiers collapses the paths.