
Firstly, false positive rate differs wildly by scenario of how the text was created. the 1-in-10,000 metric happens under the most ideal, sterile cases. Real-life scenarios and mixed text have far worse reliability
Many users criticized Pangram and similar AI text detectors for high false positive rates that cause unwarranted damage to individuals and waste developer effort while dismissing the marketing claims as unrealistic.
No Digg Deeper questions have been answered for this story yet.

Firstly, false positive rate differs wildly by scenario of how the text was created. the 1-in-10,000 metric happens under the most ideal, sterile cases. Real-life scenarios and mixed text have far worse reliability

The witch-hunting and call outs are doing unwarranted damage, and arguably this benefits publicity of the product.

@akhmxt Wait it’s paywalled?

The model is trained and validated on in-house text datasets of human text pre-llm era against "lab-grown-AI" text

There's research papers citing that human spoken language patterns takes on Ai characteristics, after filtering for scripted works. Naturally we mimic the culture we're exposed to and adapt our language The training and validation on pre-llm human text doesn't account for that

@torchcompiled Should make the piece free / remove paywall for wider distribution

Failure rates are conditional on the genre of text, and then a big one: the reported failure rate is a population average reflecting a heterogenous and imbalanced population. Some writers are paying the cost of the worst-case scenario, while others less, FPR is just an average.

The classifier output is an inference, given we see text with XYZ patterns, what is the probability that it came from an LLM vs a human?

The Taylor Lorenz Case

Model updates preserve similar or better false positive rates in average, but don't reveal how individual decisions change. There's a risk something scans as AI on monday but gets flagged on human on friday, and this can be the difference between a case and a nothingburger.

https://open.substack.com/pub/ethansmith2000/p/ai-text-detection-arms-dealers-in?r=jsutr&utm_medium=ios

The evidence tab suffers from confirmation bias and the multiple hypothesis bias (when testing many things one is likely to come back true)

The studies of the metric on external datasets, APT, Grammarly, and BEEMO show that a mixed text can basically end up anywhere on the scale of AI to human. So the person who did light AI polish/editor work can easily be flagged human or fully AI

The majority of benchmarks are over internal datasets, the validation set which matches the qualities of the train set, basically a better suggestion of "did we avoid memorizing" than does this extrapolate to in-the-wild usage. External audits often follow same pattern

for mixed authorship/AI-assistance detection, most benchmarks not only use their own crafted datasets but they also create the labels, because there is really no ground truth for how much "AI-ness" a text has. There is high variance and disagreement with human evals here.

Ironically a Pangram blog incidentally reinforces this idea without saying it

A paper by Garland, reminds that this kind of classification, population averages of FPR over a whole validation set don't recognize that some cases are more challenging than others, and some folks are worse off than others for false positives

We may often use AI text detection to infer things like effort and the writing process, but - AI created article, doing all the ideation, rewritten by human might go under the radar - a light edit pass would likely hit the detection

Error reporting may have a survivorship bias

The "skin in the game" asymmetry