
Firstly, false positive rate differs wildly by scenario of how the text was created. the 1-in-10,000 metric happens under the most ideal, sterile cases. Real-life scenarios and mixed text have far worse reliability
Many users criticized the Pangram AI Text Detector for high false-positive rates that cause real harm and mocked its accuracy claims as unrealistic marketing that enables misuse like unwarranted witch-hunting.
No Digg Deeper questions have been answered for this story yet.

Firstly, false positive rate differs wildly by scenario of how the text was created. the 1-in-10,000 metric happens under the most ideal, sterile cases. Real-life scenarios and mixed text have far worse reliability

The witch-hunting and call outs are doing unwarranted damage, and arguably this benefits publicity of the product.

@akhmxt Wait it’s paywalled?

The model is trained and validated on in-house text datasets of human text pre-llm era against "lab-grown-AI" text

@torchcompiled Should make the piece free / remove paywall for wider distribution

The Taylor Lorenz Case

Failure rates are conditional on the genre of text, and then a big one: the reported failure rate is a population average reflecting a heterogenous and imbalanced population. Some writers are paying the cost of the worst-case scenario, while others less, FPR is just an average.

There's research papers citing that human spoken language patterns takes on Ai characteristics, after filtering for scripted works. Naturally we mimic the culture we're exposed to and adapt our language The training and validation on pre-llm human text doesn't account for that

The classifier output is an inference, given we see text with XYZ patterns, what is the probability that it came from an LLM vs a human?

Model updates preserve similar or better false positive rates in average, but don't reveal how individual decisions change. There's a risk something scans as AI on monday but gets flagged on human on friday, and this can be the difference between a case and a nothingburger.

The evidence tab suffers from confirmation bias and the multiple hypothesis bias (when testing many things one is likely to come back true)

The studies of the metric on external datasets, APT, Grammarly, and BEEMO show that a mixed text can basically end up anywhere on the scale of AI to human. So the person who did light AI polish/editor work can easily be flagged human or fully AI

for mixed authorship/AI-assistance detection, most benchmarks not only use their own crafted datasets but they also create the labels, because there is really no ground truth for how much "AI-ness" a text has. There is high variance and disagreement with human evals here.

The majority of benchmarks are over internal datasets, the validation set which matches the qualities of the train set, basically a better suggestion of "did we avoid memorizing" than does this extrapolate to in-the-wild usage. External audits often follow same pattern

Ironically a Pangram blog incidentally reinforces this idea without saying it

A paper by Garland, reminds that this kind of classification, population averages of FPR over a whole validation set don't recognize that some cases are more challenging than others, and some folks are worse off than others for false positives

@akhmxt Doesn’t look like I even have paid subscriptions enabled weird

The papers on human speech resembling AI patterns

Many less official cases won't and can't be investigated but still do damage, and we're stuck with no falsifiability

A library of cases around classifier failure and admitted shortcomings