Can teachers spot AI writing? Penn researchers weigh in

The widespread use of AI has sparked a dilemma for teachers and professors. How do you tell the difference between assignments written by students, and those written by a bot?

A University of Pennsylvania sign

A University of Pennsylvania sign is seen in Philadelphia, Friday, Dec. 15, 2023. (AP Photo/Matt Rourke)

From Philly and the Pa. suburbs to South Jersey and Delaware, what would you like WHYY News to cover? Let us know!

Across the country, educators are growing increasingly concerned about the impact of ChatGPT and other AI chatbots on students’ learning — and with good reason.

A recent survey by Inside Higher Ed found that a staggering 85% of college students have used generative AI for coursework in the past year — with a quarter using it to complete assignments for them, and 19% to write full essays.

“We have had some examples of students using ChatGPT, using Gemini, using things and then presenting that work as their own work,” said Amy Mount, the director of curriculum and instruction at Gateway Regional School District of New Jersey.

  • WHYY thanks our sponsors — become a WHYY sponsor

With the availability of different AI platforms, guarding against AI use can be hard.

“Even if I block ChatGPT, they’ll go to Claude. If I block Claude, they’ll go to Gemini. If I go — right? There’s no way to block every single piece of generative AI,” said Mount.

The widespread use of AI has sparked a dilemma for teachers and professors — how do you tell the difference between assignments written by students, and those written by a bot?

It’s a thorny question with no easy solutions.

“When I first started researching into it, and trying it out myself, there was somebody coming out with a way to keep kids from cheating with it,” says Kathleen Bially, a media specialist at Gateway Regional High School. “And then they realized that there were so many roadblocks that there is no way to ever know fully if a student is cheating with AI.”

That hasn’t stopped educators from trying to find ways of identifying AI-produced work, ranging from AI detectors, to scanning essays for common hallmarks of AI writing, to following their gut instinct that something is off with an assignment. But how accurate are these methods? And are they reliable enough to accuse students of plagiarism, potentially leading to disciplinary action or even expulsion?

How accurate are AI detectors?

These are questions that Chris Callison-Burch, a professor of Computer and Information Science at the University of Pennsylvania and director of the school’s new online AI masters program, has been exploring, along with his PhD student, Liam Dugan.

Last year, the duo released a study designed to investigate how accurate commercial AI detectors — which are themselves powered by machine learning — really are at identifying text produced by generative AI.

While many of the companies producing these detectors have boasted accuracy rates of 99%, Callison-Burch and Dugan had their doubts.

“They would only evaluate their detectors on their own generated data sets,” Dugan said. “So they didn’t have a publicly available set of machine-generated text that everyone was evaluating on.”

And there were other reasons to be suspicious.

“A lot of counter opinions were being voiced — like OpenAI said they had given up on their own internal efforts at being able to spot AI-generated text because the path was too hard,” Callison-Burch said. “It was too hard to train a system to spot AI-generated text.”

To assess the detectors’ true accuracy, Callison-Burch, Dugan, and their co-authors tested them using RAID (Robust AI Detector), a dataset they created that contains over 10 million documents from different sources, both human and AI. The texts span multiple genres and include the work of nearly a dozen large language models, along with “adversarial attacks,” or methods designed to fool AI detectors.

Their results?

“We found that the AI detectors were definitely not as good as they claimed, but that they were better than I had originally anticipated,” Dugan said.

The detectors performed fairly well when it came to text that was copied and pasted directly from large language models. But their accuracy fell significantly once a text had been edited by various means, like swapping out synonyms, changing the order of words, or inserting chunks of text written by humans.

The team also tested more advanced “adversarial settings,” including homoglyph attacks, where you replace letters with letters from another alphabet that looks similar.

“So you could replace an ‘a’ with a Cyrillic letter that looks similar. And that would totally tank the ability of the AI detectors to realize that it was AI written,” Callison-Burch said.

  • WHYY thanks our sponsors — become a WHYY sponsor

However, this may no longer be a strategy that enterprising students can employ — Dugan says after the study was published, one of the detection companies included in the study emailed to thank him for bringing the vulnerability to their attention, adding that they had already patched it.

But maybe most troubling, when it came to the detectors’ accuracy, was their vulnerability to false positives, or mistakenly flagging human-written texts as AI-generated.

“The false positive rates of these detectors were actually quite high — upwards of, you know, 5-6%,” Dugan said. “And that if you corrected for that, and you set a much higher threshold — that is to say, we only output when we are 100% sure that it is AI — then the detectors get much, much worse as you try to lower that false positive rate.”

In other words, as they raised the threshold for proof, the detectors’ performance declined.

While 5-6% might not sound like a lot, Callison-Burch points out that any mistakes could have serious consequences.

“I think the practical effect of that is really important, right?” he said. “Like, 5% false positives means that in the class that I’m running, which has, this semester, 700 students in it, that would mean I would be falsely accusing 35 students of using AI, but in fact they didn’t. And so that’s clearly not a fair thing if you’re using it in that kind of high-stakes situation.”

Can humans tell the difference?

So if AI detectors are less than reliable, what about humans?

According to Dugan, the research isn’t encouraging.

“The majority of studies suggest that people are quite bad at AI detection on average,” he said.

In fact, multiple studies have found that humans perform barely better than chance at successfully identifying AI writing.

Dugan adds that his team’s research has shown that our ability to detect AI writing isn’t fixed, and can be substantially improved through training.

For one study, published in 2023, they trained a group of several hundred students to improve their accuracy at spotting AI-generated writing through a game. Participants had to read through a text, and spot the place where it transitioned from human-written to machine-written. After each round, the game would reveal the correct answer, and provide feedback to help the participants improve.

Over many repetitions, the students’ accuracy increased, showing that humans are indeed capable of learning how to spot AI. But there was one big caveat — they only improved if they received a motivating incentive, which, in this case, was extra credit based on performance.

Still, Callison-Burch says, there are limits.

“As these AI systems got bigger, as the number of parameters that fill out their neural networks grew, it became harder for people to be able to identify what was human-written versus what was machine-written.”

In other words, the more large language models advance, the better they get at hiding their tracks — which is one more reason, Callison-Burch says, for professors to be careful about making accusations.

“Gut instinct is not enough to then make that declaration that they violated academic integrity,” he said. “So in those sorts of situations, in the high-stakes ones, I would not rely on people’s judgments. And I think that even the tools that we’ve developed to try to spot AI are also probably not reliable enough that I would fully trust those either.”

Get daily updates from WHYY News!

WHYY is your source for fact-based, in-depth journalism and information. As a nonprofit organization, we rely on financial support from readers like you. Please give today.

Want a digest of WHYY’s programs, events & stories? Sign up for our weekly newsletter.

Together we can reach 100% of WHYY’s fiscal year goal