1d ago

Stony Brook's Tuhin Chakrabarty finds a prize-winning Granta story contains 1,236 phrases copied from online fanfiction

An n-gram tool mapped the phrases to fanfiction sites.

11
Original post

LLMs are not conscious. They do not have a perfect sense of embodiment. They are autoregressive models that generate text by sampling, more or less, from a very large pile of things other people wrote. More details in this essay on Substack 👇 https://tuhinchakrabarty.substack.com/p/ai-slop-grantagate-and-bad-writing

7:22 AM · May 22, 2026 View on X
Reposted by

@alexolegimas @TuhinChakr Built on an ai2 project led by @liujc1998 !!!

Alex ImasAlex Imas@alexolegimas

This from @TuhinChakr is brilliant. That prize winning story from Granta? Turns out it's just a bunch of random whole phrases taken directly from existing text on the internet. Tool allows you to trace those n-grams directly to their source, which is mostly random fanfiction. https://tuhinchakrabarty.substack.com/p/ai-slop-grantagate-and-bad-writing

8:09 PM · May 22, 2026 · 245.1K Views
12:45 PM · May 24, 2026 · 617 Views

@TuhinChakr @alexolegimas @liujc1998 I know just saying for fun!

Tuhin ChakrabartyTuhin Chakrabarty@TuhinChakr

@natolambert @alexolegimas @liujc1998 @natolambert fyi i am not stealing any credit. I have already attributed it to infinigram in the substack as well as mentioned creativity index and olmo trace several times on X after :)

12:53 PM · May 24, 2026 · 22 Views
1:00 PM · May 24, 2026 · 14 Views

Paging Alan Sokol.

Alex ImasAlex Imas@alexolegimas

This from @TuhinChakr is brilliant. That prize winning story from Granta? Turns out it's just a bunch of random whole phrases taken directly from existing text on the internet. Tool allows you to trace those n-grams directly to their source, which is mostly random fanfiction. https://tuhinchakrabarty.substack.com/p/ai-slop-grantagate-and-bad-writing

8:09 PM · May 22, 2026 · 245.1K Views
12:59 AM · May 23, 2026 · 38.5K Views

This is a reasonable take. One can at best make statistical claims on such n-gram analysis.

Max SperoMax Spero@max_spero_

For those who don't know, infini-gram is a really cool N-gram search engine that works impressively fast over massive datasets Just because there is an N-gram match doesn't necessarily mean an LLM "plagiarized" from the given work, but there is a reasonable chance that the given document was in the pretraining set of the LLM and influenced the weights towards producing that N-gram. What is most interesting to me are actually the 115 N-grams found nowhere else on the internet. Maybe that's some sign that it's from the prompt or context. Or maybe even just a token getting randomly sampled. I'd love to see some more comparisons on human text as well. Waybe there is a major difference here in N-gram similarity for human and AI text, but we won't know until we try it!

4:04 AM · May 23, 2026 · 18.7K Views
3:58 PM · May 23, 2026 · 2.6K Views

Since this post has blown up.

1) The research is based on two papers

https://arxiv.org/pdf/2410.04265 https://arxiv.org/pdf/2504.07096

2) When writing about the matches I focused on webpages that are not defunct and fan fiction results were especially relevant to AI fiction but some phrases can be in other websites too. That does not change the point about genre mismatch or stitching rare expressions

3) The attribution engine is built using CommonCrawl that LLMs have been trained on. So it might not catch all the possible webpages that might have that expression

Alex ImasAlex Imas@alexolegimas

This from @TuhinChakr is brilliant. That prize winning story from Granta? Turns out it's just a bunch of random whole phrases taken directly from existing text on the internet. Tool allows you to trace those n-grams directly to their source, which is mostly random fanfiction. https://tuhinchakrabarty.substack.com/p/ai-slop-grantagate-and-bad-writing

8:09 PM · May 22, 2026 · 245.1K Views
12:02 AM · May 23, 2026 · 4.4K Views

A reader told me

cane and forgetting ->

cane ( as in sugarcane from Trinidad) cane -> rum rum -> drinking drinking -> forgetting

🤯🤯🤯

Marzena KarpinskaMarzena Karpinska@mar_kar_

One thing from @TuhinChakr post hits very close home, people tend to #rationalize (bc we don't know better) and see things not there. We saw it already in GPT-2 generated stories -- we *expect* things to *mean* something so we tend to see things that are not there...

5:33 PM · May 23, 2026 · 2.7K Views
5:48 PM · May 23, 2026 · 832 Views

@alexolegimas Thank you 🥹

Alex ImasAlex Imas@alexolegimas

This from @TuhinChakr is brilliant. That prize winning story from Granta? Turns out it's just a bunch of random whole phrases taken directly from existing text on the internet. Tool allows you to trace those n-grams directly to their source, which is mostly random fanfiction. https://tuhinchakrabarty.substack.com/p/ai-slop-grantagate-and-bad-writing

8:09 PM · May 22, 2026 · 245.1K Views
8:15 PM · May 22, 2026 · 8.4K Views

@natolambert @alexolegimas @liujc1998 @natolambert fyi i am not stealing any credit. I have already attributed it to infinigram in the substack as well as mentioned creativity index and olmo trace several times on X after :)

Nathan LambertNathan Lambert@natolambert

@alexolegimas @TuhinChakr Built on an ai2 project led by @liujc1998 !!!

12:45 PM · May 24, 2026 · 617 Views
12:53 PM · May 24, 2026 · 22 Views

This from @TuhinChakr is brilliant. That prize winning story from Granta? Turns out it's just a bunch of random whole phrases taken directly from existing text on the internet. Tool allows you to trace those n-grams directly to their source, which is mostly random fanfiction.

substack.com
/p/ai-slop-grantagate-and-bad-writing
8:09 PM · May 22, 2026 · 245.1K Views
Stony Brook's Tuhin Chakrabarty finds a prize-winning Granta story contains 1,236 phrases copied from online fanfiction · Digg