Stony Brook's Tuhin Chakrabarty finds a prize-winning Granta story contains 1,236 phrases copied from online fanfiction

REPLY

@alexolegimas @TuhinChakr Built on an ai2 project led by @liujc1998 !!!

Alex Imas@alexolegimas

This from @TuhinChakr is brilliant. That prize winning story from Granta? Turns out it's just a bunch of random whole phrases taken directly from existing text on the internet. Tool allows you to trace those n-grams directly to their source, which is mostly random fanfiction. https://tuhinchakrabarty.substack.com/p/ai-slop-grantagate-and-bad-writing

8:09 PM · May 22, 2026 · 245.1K Views

12:45 PM · May 24, 2026 · 617 Views

REPLY

#64Nathan Lambert@NATOLAMBERT

@TuhinChakr @alexolegimas @liujc1998 I know just saying for fun!

Tuhin Chakrabarty@TuhinChakr

@natolambert @alexolegimas @liujc1998 @natolambert fyi i am not stealing any credit. I have already attributed it to infinigram in the substack as well as mentioned creativity index and olmo trace several times on X after :)

12:53 PM · May 24, 2026 · 22 Views

1:00 PM · May 24, 2026 · 14 Views

QUOTE POST

#100Marc Andreessen 🇺🇸@PMARCA

Paging Alan Sokol.

Alex Imas@alexolegimas

This from @TuhinChakr is brilliant. That prize winning story from Granta? Turns out it's just a bunch of random whole phrases taken directly from existing text on the internet. Tool allows you to trace those n-grams directly to their source, which is mostly random fanfiction. https://tuhinchakrabarty.substack.com/p/ai-slop-grantagate-and-bad-writing

8:09 PM · May 22, 2026 · 245.1K Views

12:59 AM · May 23, 2026 · 38.5K Views

QUOTE POST

#570Chenhao Tan@CHENHAOTAN

This is a reasonable take. One can at best make statistical claims on such n-gram analysis.

Max Spero@max_spero_

For those who don't know, infini-gram is a really cool N-gram search engine that works impressively fast over massive datasets Just because there is an N-gram match doesn't necessarily mean an LLM "plagiarized" from the given work, but there is a reasonable chance that the given document was in the pretraining set of the LLM and influenced the weights towards producing that N-gram. What is most interesting to me are actually the 115 N-grams found nowhere else on the internet. Maybe that's some sign that it's from the prompt or context. Or maybe even just a token getting randomly sampled. I'd love to see some more comparisons on human text as well. Waybe there is a major difference here in N-gram similarity for human and AI text, but we won't know until we try it!

4:04 AM · May 23, 2026 · 18.7K Views

3:58 PM · May 23, 2026 · 2.6K Views

QUOTE POST

#1050Tuhin Chakrabarty@TUHINCHAKR

Since this post has blown up.

1) The research is based on two papers

https://arxiv.org/pdf/2410.04265 https://arxiv.org/pdf/2504.07096

2) When writing about the matches I focused on webpages that are not defunct and fan fiction results were especially relevant to AI fiction but some phrases can be in other websites too. That does not change the point about genre mismatch or stitching rare expressions

3) The attribution engine is built using CommonCrawl that LLMs have been trained on. So it might not catch all the possible webpages that might have that expression

Alex Imas@alexolegimas

This from @TuhinChakr is brilliant. That prize winning story from Granta? Turns out it's just a bunch of random whole phrases taken directly from existing text on the internet. Tool allows you to trace those n-grams directly to their source, which is mostly random fanfiction. https://tuhinchakrabarty.substack.com/p/ai-slop-grantagate-and-bad-writing

8:09 PM · May 22, 2026 · 245.1K Views

12:02 AM · May 23, 2026 · 4.4K Views

QUOTE POST

#1050Tuhin Chakrabarty@TUHINCHAKR

A reader told me

cane and forgetting ->

cane ( as in sugarcane from Trinidad) cane -> rum rum -> drinking drinking -> forgetting

🤯🤯🤯

Marzena Karpinska@mar_kar_

One thing from @TuhinChakr post hits very close home, people tend to #rationalize (bc we don't know better) and see things not there. We saw it already in GPT-2 generated stories -- we *expect* things to *mean* something so we tend to see things that are not there...

5:33 PM · May 23, 2026 · 2.7K Views

5:48 PM · May 23, 2026 · 832 Views

REPLY

#1050Tuhin Chakrabarty@TUHINCHAKR

@alexolegimas Thank you 🥹

Alex Imas@alexolegimas

This from @TuhinChakr is brilliant. That prize winning story from Granta? Turns out it's just a bunch of random whole phrases taken directly from existing text on the internet. Tool allows you to trace those n-grams directly to their source, which is mostly random fanfiction. https://tuhinchakrabarty.substack.com/p/ai-slop-grantagate-and-bad-writing

8:09 PM · May 22, 2026 · 245.1K Views

8:15 PM · May 22, 2026 · 8.4K Views

REPLY

#1050Tuhin Chakrabarty@TUHINCHAKR

@natolambert @alexolegimas @liujc1998 @natolambert fyi i am not stealing any credit. I have already attributed it to infinigram in the substack as well as mentioned creativity index and olmo trace several times on X after :)

Nathan Lambert@natolambert

@alexolegimas @TuhinChakr Built on an ai2 project led by @liujc1998 !!!

12:45 PM · May 24, 2026 · 617 Views

12:53 PM · May 24, 2026 · 22 Views

POST

#1777Alex Imas@ALEXOLEGIMAS

This from @TuhinChakr is brilliant. That prize winning story from Granta? Turns out it's just a bunch of random whole phrases taken directly from existing text on the internet. Tool allows you to trace those n-grams directly to their source, which is mostly random fanfiction.

substack.com

/p/ai-slop-grantagate-and-bad-writing

8:09 PM · May 22, 2026 · 245.1K Views

Stony Brook's Tuhin Chakrabarty finds a prize-winning Granta story contains 1,236 phrases copied from online fanfiction

Cluster engagement

Sentiment