The Invisible Gavel: Why We Must Question AI Detectors in the Classroom

Praise Bickersteth

There is a quiet crisis unfolding in the halls of our universities. As generative artificial intelligence becomes a staple of the modern toolkit, a parallel industry has emerged promising to catch the machines at work. For many lecturers and academic boards, these AI detectors have been welcomed as a digital thin blue line, a way to maintain the sanctity of original thought. 

However, a growing body of rigorous evidence suggests that these tools are not the impartial judges we believe them to be. In fact, they are often prone to errors that can derail a student’s career based on little more than a mathematical guess. We are witnessing the rise of a system where a software’s “confidence score” can outweigh a professor’s long term knowledge of a student’s potential.

The core of the problem lies in how these detectors actually function. They do not know if a human wrote a sentence in the way a person understands language. Instead, they measure properties like perplexity and burstiness, essentially checking how predictable the word choices are and how much the sentence structure varies. While this sounds scientific, it creates a dangerous trap for certain groups of people. A seminal study led by James Zou at Stanford University, published in the journal Patterns, found a staggering bias against non native English speakers. When the researchers ran essays from a TOEFL proficiency test through seven popular detectors, more than half were wrongly flagged as AI generated. The logic is simple yet devastating: because non native writers often use more standard, clear, and predictable language to ensure they are understood, the algorithms mistake their careful prose for the robotic output of a machine. This means that international students, who already face significant hurdles, are now being disproportionately targeted by flawed software.

This is not just a technical glitch; it is a matter of fundamental academic justice. When a student is flagged, the burden of proof often shifts unfairly onto them. In a standard disciplinary hearing, the school should have to prove guilt. Yet, with AI detection, the student is often asked to prove a negative, to show they did not use a tool that leaves no physical trace or digital footprint. This puts students in an impossible position where they must defend their own cognitive process against a black box algorithm. Research from the University of Maryland in 2025 further complicated this picture, revealing that even minimal polishing of a human draft with an AI to fix a few commas or improve flow can trigger a false positive rate as high as 75 percent. In a world where students use basic grammar checkers to refine their work, the line between original authorship and assisted writing has become so blurred that these detectors are effectively guessing.

Beyond the technical errors, there is a human cost to this reliance on automated policing. When a lecturer sees a high AI score, it immediately colors their perception of the student. It creates a “guilty until proven innocent” atmosphere that poisons the mentor and student relationship. If a student spent three weeks researching a paper only to be told a software thinks they cheated, the psychological impact is profound. It breeds resentment and fear, causing students to write in a cramped, overly complex style just to avoid being “too clear” for the detector. This is the opposite of what good writing should be. Instead of encouraging clarity, we are accidentally training students to write in erratic ways just to prove they are human.

For academics and lecturers, the temptation to rely on a percentage score is understandable. We are all overworked, underpaid, and facing a technological shift that feels overwhelming and fast. Grading hundreds of essays is exhausting, and having a tool that claims to do the heavy lifting of spotting cheats is a seductive prospect. But we must remember that these tools offer probability, not proof. Even major providers like Turnitin have had to walk back some of their claims, acknowledging that their scores can be inaccurate and should never be used as the sole basis for failing a student or starting a disciplinary action. Relying on an algorithm to police integrity risks creating a culture of surveillance that stifles the very creativity and risk taking we are supposed to nurture in higher education.

Furthermore, the rapid evolution of AI models means that detectors are always one step behind. As soon as a detection company updates its software, a new version of a chatbot is released that can mimic human “burstiness” even better. It is an expensive and futile arms race where the only real losers are the students caught in the crossfire. If the software cannot keep up with the technology it is supposed to catch, then its role in a high stakes environment like a university is highly questionable. We are using 2024 tools to judge 2026 problems, and the math simply does not add up for the students.

True academic integrity cannot be outsourced to a piece of code. It requires us to return to the basics of education. It requires us to look at the process, to engage with the early drafts, and to really know our students’ voices over the course of a semester. We should be moving toward assessment methods that are “AI resistant” by design, such as oral exams, in class reflections, or highly specific local case studies, rather than relying on a digital hall monitor that is prone to making mistakes. As we navigate this new era, we must be brave enough to admit that our “truth machines” are deeply flawed. To ignore the evidence of their unreliability is to fail the very students who trust us to be fair, objective, and human. We owe it to the next generation of scholars to be more than just users of software; we must be guardians of justice in the classroom.

Bickersteth writes from Lagos

Related Articles