AI detectors fail because student writing is too varied to judge from 1 document.
The problem is not only that AI writing is getting better, but that many real students write in ways that can look statistically close to AI output.
The paper frames this as a testing problem where the detector does not know each student’s normal writing style, so “human writing” is not 1 fixed target.
Because of that, any detector that catches many AI-written submissions must also wrongly accuse some real students, especially students whose writing is more structured, formulaic, or shaped by learning English.
The authors use basic statistics to show that this false-accusation problem is not just a bug in current tools, because it appears whenever student writing overlaps with AI writing.
A university is not comparing “AI text” with “human text”; it is comparing one submission with the unknown writing habits of one particular student.
Better detectors may reduce some errors, but they cannot erase the structural problem created by one-shot judgment.
----
Paper Link – arxiv. org/abs/2603.20254
Paper Title: "AI Detectors Fail Diverse Student Populations: A Mathematical Framing of Structural Detection Limits"
