0% found this document useful (0 votes)
92 views39 pages

Testing of Detection Tools For AI-generated Text

ia y sus usos

Uploaded by

mariaternura
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views39 pages

Testing of Detection Tools For AI-generated Text

ia y sus usos

Uploaded by

mariaternura
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Weber‑Wulff et al.

International Journal for


International Journal for Educational Integrity (2023) 19:26
https://doi.org/10.1007/s40979-023-00146-z Educational Integrity

ORIGINAL ARTICLE Open Access

Testing of detection tools for AI‑generated


text
Debora Weber‑Wulff1, Alla Anohina‑Naumeca2, Sonja Bjelobaba3* , Tomáš Foltýnek4, Jean Guerrero‑Dib5,
Olumide Popoola6, Petr Šigut4 and Lorna Waddington7

*Correspondence:
[email protected] Abstract
1
University of Applied Sciences Recent advances in generative pre-trained transformer large language models have
HTW, Berlin, Germany emphasised the potential risks of unfair use of artificial intelligence (AI) generated
2
Riga Technical University, Rīga, content in an academic environment and intensified efforts in searching for solutions
Latvia
3
Uppsala University, Uppsala, to detect such content. The paper examines the general functionality of detection
Sweden tools for AI-generated text and evaluates them based on accuracy and error type analy‑
4
Masaryk University, Brno, sis. Specifically, the study seeks to answer research questions about whether existing
Czechia
5
Universidad de Monterrey, San detection tools can reliably differentiate between human-written text and ChatGPT-
Pedro Garza García, Mexico generated text, and whether machine translation and content obfuscation techniques
6
Queen Mary University affect the detection of AI-generated text. The research covers 12 publicly available
of London, London, UK
7
University of Leeds, Leeds, UK tools and two commercial systems (Turnitin and PlagiarismCheck) that are widely used
in the academic setting. The researchers conclude that the available detection tools
are neither accurate nor reliable and have a main bias towards classifying the output
as human-written rather than detecting AI-generated text. Furthermore, content
obfuscation techniques significantly worsen the performance of tools. The study
makes several significant contributions. First, it summarises up-to-date similar scientific
and non-scientific efforts in the field. Second, it presents the result of one of the most
comprehensive tests conducted so far, based on a rigorous research methodology,
an original document set, and a broad coverage of tools. Third, it discusses the impli‑
cations and drawbacks of using detection tools for AI-generated text in academic
settings.
Keywords: Artificial intelligence, Generative pre-trained transformers, Machine-
generated text, Detection of AI-generated text, Academic integrity, ChatGPT, AI
detectors

Introduction
Higher education institutions (HEIs) play a fundamental role in society. They shape the
next generation of professionals through education and skill development, simultane-
ously providing hubs for research, innovation, collaboration with business, and civic

© The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits
use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original
author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third
party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the mate‑
rial. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or
exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://
creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdo‑
main/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 2 of 39

engagement. It is also in higher education that students form and further develop their
personal and professional ethics and values. Hence, it is crucial to uphold the integrity of
the assessments and diplomas provided in tertiary education.
The introduction of unauthorised content generation—“the production of academic
work, in whole or part, for academic credit, progression or award, whether or not
a payment or other favour is involved, using unapproved or undeclared human or
technological assistance” (Foltýnek et al. 2023)—into higher education contexts poses
potential threats to academic integrity. Academic integrity is understood as “compli-
ance with ethical and professional principles, standards and practices by individuals
or institutions in education, research and scholarship” (Tauginienė et al. 2018).
Recent advancements in artificial intelligence (AI), particularly in the area of the
generative pre-trained transformer (GPT) large language models (LLM), have led to
a range of publicly available online text generation tools. As these models are trained
on human-written texts, the content generated by these tools can be quite difficult to
distinguish from human-written content. They can thus be used to complete assess-
ment tasks at HEIs.
Despite the fact that unauthorised content generation created by humans, such as
contract cheating (Clarke & Lancaster 2006), has been a well-researched form of stu-
dent cheating for almost two decades now, HEIs were not prepared for such radical
improvements in automated tools that make unauthorised content generation so eas-
ily accessible for students and researchers. The availability of tools based on GPT-3
and newer LLMs, ChatGPT (OpenAI 2023a, b) in particular, as well as other types
of AI-based tools such as machine translation tools or image generators, have raised
many concerns about how to make sure that no academic performance deception
attempts have been made. The availability of ChatGPT has forced HEIs into action.
Unlike contract cheating, the use of AI tools is not automatically unethical. On the
contrary, as AI will permeate society and most professions in the near future, there is
a need to discuss with students the benefits and limitations of AI tools, provide them
with opportunities to expand their knowledge of such tools, and teach them how to
use AI ethically and transparently.
Nonetheless, some educational institutions have directly prohibited the use of
ChatGPT (Johnson 2023), and others have even blocked access from their university
networks (Elsen-Rooney 2023), although this is just a symbolic measure with vir-
tual private networks quite prevalent. Some conferences have explicitly prohibited
AI-generated content in conference submissions, including machine-learning con-
ferences (ICML 2023). More recently, Italy became the first country in the world to
ban the use of ChatGPT, although that decision has in the meantime been rescinded
(Schechner 2023). Restricting the use of AI-generated content has naturally led to
the desire for simple detection tools. Many free online tools that claim to be able to
detect AI-generated text are already available.
Some companies do urge caution when using their tools for detecting AI-gener-
ated text for taking punitive measures based solely on the results they provide. They
acknowledge the limitations of their tools, e.g. OpenAI explains that there are several
ways to deceive the tool (OpenAI 2023a, b, 8 May). Turnitin made a guide for teachers
on how they should approach the students whose work was flagged as AI-generated
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 3 of 39

(Turnitin 2023a, b, 16 March). Nevertheless, four different companies (GoWinston,


2023; Content at Scale 2023; Compilatio 2023; GPTZero 2023) claim to be the best on
the market.
The aim of this paper is to examine the general functionality of tools for the detec-
tion of the use of ChatGPT in text production, assess the accuracy of the output pro-
vided by these tools, and their efficacy in the face of the use of obfuscation techniques
such as online paraphrasing tools, as well as the influence of machine translation tools to
human-written text.
Specifically, the paper aims to answer the following research questions:

RQ1: Can detection tools for AI-generated text reliably detect human-written text?
RQ2: Can detection tools for AI-generated text reliably detect ChatGPT-generated
text?
RQ3: Does machine translation affect the detection of human-written text?
RQ4: Does manual editing or machine paraphrasing affect the detection of Chat-
GPT-generated text?
RQ5: How consistent are the results obtained by different detection tools for AI-gen-
erated text?

The next section briefly describes the concept and history of LLMs. It is followed
by a review of scientific and non-scientific related work and a detailed description of
the research methodology. After that, the results are presented in terms of accuracy,
error analysis, and usability issues. The paper ends with discussion points and conclu-
sions made.still gained 1.0 points as in the previous methods. The formula for accuracy
calculation

Large language models


We understand LLMs as systems trained to predict the likelihood of a specific character,
word, or string (called a token) in a particular context (Bender et al. 2021). Such statistical
language models have been used since the 1980s (Rosenfeld 2000), amongst other things
for machine translation and automatic speech recognition. Efficient methods for the esti-
mation of word representations in multidimensional vector spaces (Mikolov et al. 2013),
together with the attention mechanism and transformer architecture (Vaswani et al. 2017)
made generating human-like text not only possible, but also computationally feasible.
ChatGPT is a Natural Language Processing system that is owned and developed by
OpenAI, a research and development company established in 2015. Based on the trans-
former architecture, OpenAI released the first version of GPT in June 2018. Within less
than a year, this version was replaced by a much improved GPT-2, and then in 2020 by
GPT-3 (Marr 2023). This version could generate coherent text within a given context.
This was in many ways a game-changer, as it is capable of creating responses that are
hard to distinguish from human-written text (Borji 2023; Brown et al. 2020). As 7% of
the training data is on languages other than English, GPT-3 can also perform multilin-
gually (Brown et al. 2020). In November 2022, ChatGPT was launched. It demonstrated
significant improvements in its capabilities, a user-friendly interface, and it was widely
reported in the general press. Within two months of its launch, it had over 100 million
subscribers and was labelled “the fastest growing consumer app ever” (Milmo 2023).
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 4 of 39

AI in education brings both challenges and opportunities. Authorised and prop-


erly acknowledged usage of AI tools, including LLMs, is not per se a form of mis-
conduct (Foltýnek et al. 2023). However, using AI tools in an educational context for
unauthorised content generation (Foltýnek et al. 2023) is a form of academic mis-
conduct (Tauginienė et al. 2018). Although LLMs have become known to the wider
public after the release of ChatGPT, there is no reason to assume that they have not
been used to create unauthorised and undeclared content even before that date. The
accessibility, quantity, and recent development of AI tools have led many educators
to demand technical solutions to help them distinguish between human-written and
AI-generated texts.
For more than two decades, educators have been using software tools in an attempt
to detect academic misconduct. This includes using search engines and text-matching
software in order to detect instances of potential plagiarism. Although such automated
detection can identify some plagiarism, previous research by Foltýnek et al. (2020) has
shown that text-matching software not only do not find all plagiarism, but further-
more will also mark non-plagiarised content as plagiarism, thus providing false positive
results. This is a worst-case scenario in academic settings, as an honest student can be
accused of misconduct. In order to avoid such a scenario, now, when the market has
responded with the introduction of dozens of tools for AI-generated text, it is important
to discuss whether these tools clearly distinguish between human-written and machine-
generated content.

Related work
The development of LLMs has led to an acceleration of different types of efforts in the
field of automatic detection of AI-generated text. Firstly, several researchers has studied
human abilities to detect machine-generated texts (e.g. Guo et al. 2023; Ippolito et al.
2020; Ma et al. 2023). Secondly, some attempts have been made to build benchmark text
corpora to detect AI-generated texts effectively; for example, Liyanage et al. (2022) have
offered synthetic and partial text substitution datasets for the academic domain. Thirdly,
many research works are focused on developing new or fine-tuning parameters of the
already pre-trained models of machine-generated text (e.g. Chakraborty et al. 2023; Dev-
lin et al. 2019).
These efforts provide a valuable contribution to improving the performance and capa-
bilities of detection tools for AI-generated text. In this section, the authors of the paper
mainly focus on studies that compare or test the existing detection tools that educators
can use to check the originality of students’ assignments. The related works examined
in the paper are summarised in Tables 1, 2, and 3. They are categorised as published
scientific publications, preprints and other publications. It is worth mentioning that
although there are many comparisons on the Internet made by individuals and organisa-
tions, Table 3 includes only those with the higher coverage of tools and/or at least partly
described methodology of experiments.
Some researchers have used known text-matching software to check if they are able
to find instances of plagiarism in the AI-generated text. Aydin and Karaarslan (2022)
tested the iThenticate system and have revealed that the tool has found matches with
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 5 of 39

Table 1 Related work: published scientific publications


Source Detection tools used Dataset Evaluation metrics

Aydin & Karaarslan 2022 1 An article with three sec‑ N/A


iThenticate tions: the text written by the
paper’s authors, the ChatGPT
-paraphrased abstract text of
articles, the content gener‑
ated by ChatGPT answering
specific questions
Anderson et al. 2023 1 Two ChatGPT-generated N/A
GPT-2 Output Detector essays and the same essays
paraphrased by AI
Elkhatat et al. 2023 5 15 ChatGPT 3.5 generated, 15 Specificity, Sensitivity, Positive
OpenAI text Classifier, ChatGPT 4 generated and 5 Predictive Value, Negative
Writer, Copyleaks, human-written passages Predictive Value
GPTZero, CrossPlag
Gao et al. 2022 2 50 ChatGPT-generated scien‑ AUROC
(Plagiarismdetector. tific abstracts
net, GPT-2 Output
Detector)

Table 2 Related work: preprints


Source Detection tools used Dataset Evaluation metrics

Khalil & Er 2023 3 50 essays generated by True positive,


iThenticate, Turnitin, ChatGPT ChatGPT on various topics False negative
(such as physics laws, data
mining, global warming,
driving schools, machine
learning, etc.)
Wang et al. 2023 6 • Q&A-GPT: 115 K pairs of AUC scores, False positive
GPT2-Detector, RoBERTa-QA, human-generated answers rate, False negative rate
DetectGPT, GPTZero (taken from Stack Overflow)
Writer, OpenAI Text Classifier and ChatGPT generated
answers (for the same topic)
for 115 K questions
• Code2Doc-GPT: 126 K sam‑
ples from CodeSearchNet
and GPT code description for
6 programming languages
• 226.5 K pairs of code
samples human and Chat‑
GPT generated (APPS-GPT,
CONCODE-GPT, Doc2Code-
GPT)
• Wiki-GPT dataset: 25 K sam‑
ples of human-generated
and GPT polished texts
Pegoraro et al. 2023 24 approaches and tools, 58,546 responses gener‑ True positive rate, True
among them online tools ated by humans and 72,966 negative rate
ZeroGPT, OpenAI Text Classi‑ responses generated by the
fier, GPTZero, Hugging Face, ChatGPT model, resulting
Writefull, Copyleaks, Content in 131,512 unique samples
at Scale, Originality.ai, Writer, that address 24,322 distinct
Draft and Goal questions from various fields,
including medicine, opendo‑
main, and finance

other information sources both for ChatGPT-paraphrased text and -generated text.
They also found that ChatGPT does not produce original texts after paraphrasing, as the
match rates for paraphrased texts were very high in comparison to human-written and
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 6 of 39

Table 3 Related work: other publications


Source Detection tools used Dataset Evaluation metrics

Gewirtz 2023 3 • 3 human-generated texts N/A


GPT-2 Output Detector, Writer, • 3 ChatGPT-generated texts
Content at Scale
van Oijen 2023 7 • 10 generated passages Accuracy
Content at Scale, Copyleaks, based on prompts (factual
Corrector App, Crossplag, info, rewrites of existing test,
GPTZero, OpenAI, Writer fictional scenarios, advice,
explanations at different
levels, impersonation of a
specified character, Dutch
translation)
• 5 human-generated text
from different sources
(Wikipedia, SURF, Alice in
Wonderland, Reddit post)
Compilatio 2023 11 • 50 human-written texts Reliability (the number of
Compilatio, Draft and Goal, • 75 texts generated by Chat‑ correctly classified/the total
GLTR, GPTZero, Content at GPT and YouChat number of text passages)
Scale, DetectGPT, Crossplag,
Kazan SEO, AI Text Classifier,
Copyleaks, Writer AI Content
Detector
Demers 2023 16 • Human writing sample N/A
Originality AI, Writer, Copyl‑ • ChatGPT 4 writing sample
eaks, Open AI Text Classifier, • ChatGPT 4 writing sample
Crossplag, GPTZero, Sapling, with the additional prompt
Content At Scale, Zero GPT, "beat detection"
GLTR, Hugging Face, Corrector,
Writeful, Hive Moderation,
Paraphrasing tool AI Content
Detector, AI Writing Check

ChatGPT-generated text passages. In the experiment of Gao et al. (2022), Plagiarismde-


tector.net recognized nearly all of the fifty scientific abstracts generated by ChatGPT as
completely original.
Khalil and Er (Khalil and Er 2023) fed 50 ChatGPT-generated essays into two text-
matching software systems (25 essays to iThenticate and 25 essays to the Turnitin sys-
tem), although they are just different interfaces to the same engine. They found that 40
(80%) of them were considered to have a high level of originality, although they defined
this as a similarity score of 20% or less. Khalil and Er (Khalil and Er 2023) also attempted
to test the capabilities of ChatGPT to detect if the essays were generated by ChatGPT
and state an accuracy of 92%, as 46 essays were supposedly said to be cases of plagia-
rism. As of May 2023, ChatGPT now issues a warning to such questions such as: “As an
AI language model, I cannot verify the specific source or origin of the paragraph you
provided.“
The authors of this paper consider the study of Khalil and Er (Khalil and Er 2023) to be
problematic for two reasons. First, it is worth noting that the application of text-match-
ing software systems to the detection of LLM-generated text makes little sense because
of the stochastic nature of the word selection. Second, since an LLM will “hallucinate”,
that is, make up results, it cannot be asked whether it is the author of a text.
Several researchers focused on testing sets of free and/or paid detection tools for AI-
generated text. Wang et al. (2023) checked the performance of detection tools on both
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 7 of 39

natural language content and programming code and determined that “detecting Chat-
GPT-generated code is even more difficult than detecting natural language contents.”
They also state that tools often exhibit bias, as some of them have a tendency to predict
that content is ChatGPT generated (positive results), while others tend to predict that it
is human-written (negative results).
By testing fifty ChatGPT-generated paper abstracts on the GPT-2 Output detector,
Gao et al. (2022) concluded that the detector was able to make an excellent distinction
between original and generated abstracts because the majority of the original abstracts
were scored extremely low (corresponding to human-written content) while the detec-
tor found a high probability of AI-generated text in the majority (33 abstracts) of the
ChatGPT-generated abstracts with 17 abstracts scored below 50%.
Pegoraro et al. (2023) tested not only online detection tools for AI-generated text but
also many of the existing detection approaches and claimed that detection of the Chat-
GPT-generated text passages is still a very challenging task as the most effective online
detection tool can only achieve a success rate of less than 50%. They also concluded that
most of the analysed tools tend to classify any text as human-written.
Tests completed by van Oijen (2023) showed that the overall accuracy of tools in
detecting AI-generated text reached only 27.9%, and the best tool achieved a maxi-
mum of 50% accuracy, while the tools reached an accuracy of almost 83% in detecting
human-written content. The author concluded that detection tools for AI-generated text
are "no better than random classifiers" (van Oijen 2023). Moreover, the tests provided
some interesting findings; for example, the tools found it challenging to detect a piece of
human-written text that was rewritten by ChatGPT or a text passage that was written in
a specific style. Additionally, there was not a single attribution of a human-written text
to AI-generated text, that is, an absence of false positives.
Although Demers (2023) only provided results of testing without any further analysis,
their examination allows making conclusions that a text passage written by a human was
recognised as human-written by all tools, while ChatGPT-generated text had a mixed
evaluation with the tendency to be predicted as human-written (10 tools out of 16) that
increased even further for the ChatGPT writing sample with the additional prompt "beat
detection" (12 tools out of 16).
Elkhatat et al.(2023) revealed that detection tools were generally more successful in
identifying GPT-3.5-generated text than GPT-4-generated text and demonstrated incon-
sistencies (false positives and uncertain classifications) in detecting human-written text.
They also questioned the reliability of detection tools, especially in the context of investi-
gating academic integrity breaches in academic settings.
In the tests conducted by Compilatio, the detection tools for AI-generated text
detected human-written text with reliability in the range of 78–98% and AI-generated
text – 56–88%. Gewirtz’ (2023) results on testing three human-written and three Chat-
GPT-generated texts demonstrated that two of the selected detection tools for AI-gener-
ated text could reach only 50% accuracy and one an accuracy of 66%.
The effect of paraphrasing on the performance of detection tools for AI-generated text
has also been studied. For example, Anderson et al. (2023) concluded that paraphras-
ing has significantly lowered the detection capabilities of the GPT-2 Output Detector by
increasing the score for human-written content from 0.02% to 99.52% for the first essay
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 8 of 39

and from 61.96% to 99.98% for the second essay. Krishna et al. (2023) applied paraphras-
ing to the AI-generated texts and revealed that it significantly lowered the detection
accuracy of five detection tools for AI-generated text used in the experiments.
The results of the above-mentioned studies suggest that detecting AI-generated text
passages is still challenging for existent detection tools for AI-generated text, whereas
human-written texts are usually identified quite accurately (accuracy above 80%). How-
ever, the ability of tools to identify AI-generated text is under question as their accuracy
in many studies was only around 50% or slightly above. Depending on the tool, a bias
may be observed identifying a piece of text as either ChatGPT-generated or human-
written. In addition, tools have difficulty identifying the source of the text if ChatGPT
transforms human-written text or generates text in a particular style (e.g. a child’s expla-
nation). Furthermore, the performance of detection tools significantly decreases when
texts are deliberately modified by paraphrasing or re-writing. Detection of the AI-gener-
ated text remains challenging for existing detection tools, but detecting ChatGPT-gener-
ated code is even more difficult.
Existing research has several shortcomings:

• quite often experiments are carried out with a limited number of detection tools for
AI-generated text on a limited set of data;
• sometimes human-written texts are taken from publicly available websites or recog-
nised print sources, and thus could potentially have been previously used to train
LLMs and/or provide no guarantee that they were actually written by humans;
• the methodological aspects of the research are not always described in detail and are
thus not available for replication;
• testing whether the AI-generated and further translated text can influence the accu-
racy of the detection tools is not discussed at all;
• a limited number of measurable metrics is used to evaluate the performance of
detection tools, ignoring the qualitative analysis of results, for example, types of clas-
sification errors that can have significant consequences in an academic setting.

Methodology
Test cases
The focus of this research is determining the accuracy of tools which state that they are
able to detect AI-generated text. In order to do so, a number of situational parameters
were set up for creating the test cases for the following categories of English-language
documents:

• human-written;
• human-written in a non-English language with a subsequent AI/machine translation
to English;
• AI-generated text;
• AI-generated text with subsequent human manual edits;
• AI-generated text with subsequent AI/machine paraphrase.
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 9 of 39

For the first category (called 01-Hum), the specification was made that 10.000 charac-
ters (including spaces) were to be written at about the level of an undergraduate in the
field of the researcher writing the paper. These fields include academic integrity, civil
engineering, computer science, economics, history, linguistics, and literature. None of
the text may have been exposed to the Internet at any time or even sent as an attachment
to an email. This is crucial because any material that is on the Internet is potentially
included in the training data for an LLM.
For the second category (called 02-MT), around 10.000 characters (including spaces)
were written in Bosnian, Czech, German, Latvian, Slovak, Spanish, and Swedish. None
of this texts may have been exposed to the Internet before, as for 01-Hum. Depending on
the language, either the AI translation tool DeepL (3 cases) or Google Translate (6 cases)
was used to produce the test documents in English.
It was decided to use ChatGPT as the only AI-text generator for this investigation, as
it was the one with the largest media attention at the beginning of the research. Each
researcher generated two documents with the tool using different prompts, (03-AI and
04-AI) with a minimum of 2000 characters each and recorded the prompts. The lan-
guage model from February 13, 2023 was used for all test cases.
Two additional texts of at least 2000 characters were generated using fresh prompts
for ChatGPT, then the output was manipulated. It was decided to use this type of test
case, as students will have a tendency to obfuscate results with the expressed purpose
of hiding their use of an AI-content generator. One set (05-ManEd) was edited manually
with a human exchanging some words with synonyms or reordering sentence parts and
the other (06-Para) was rewritten automatically with the AI-based tool Quillbot (Quill-
bot 2023), using the default values of the tool for modes (Standard) and synonym level.
Documentation of the obfuscation, highlighting the differences between the texts, can
be found in the Appendix.
With nine researchers preparing texts (the eight authors and one collaborator), 54 test
cases were thus available for which the ground truth is known.

AI‑generated text detection tool selection


A list of detection tools for AI-generated text was prepared using social media and
Google search. Overall, 18 tools were considered, out of which 6 were excluded: 2 were
not available, 2 were not online applications but Chrome extensions and thus out of the
scope of this research, 1 required payment, and 1 did not produce any quantifiable result.
The company Turnitin approached the research group and offered a login, noting that
they could only offer access from early April 2023. It was decided to test the system,
although it is not free, because it is so widely used and already widely discussed in aca-
demia. Another company, PlagiarismCheck, was also advertising that it had a detec-
tion tool for AI-generated text in addition to its text-matching detection system. It was
decided to ask them if they wanted to be part of the test as well, as the researchers did
not want to have only one paid system. They agreed and provided a login in early May.
We caution that their results may be different from the free tools used, as the companies
knew that the submitted documents were part of a test suite and they were able to use
the entire test document.
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 10 of 39

The following 14 detection tools were tested:

• Check For AI (https://​check​forai.​com)


• Compilatio (https://​ai-​detec​tor.​compi​latio.​net/)
• Content at Scale (https://​conte​ntats​cale.​ai/​ai-​conte​nt-​detec​tor/)
• Crossplag (https://​cross​plag.​com/​ai-​conte​nt-​detec​tor/)
• DetectGPT (https://​detec​tgpt.​ericm​itche​ll.​ai/)
• Go Winston (https://​gowin​ston.​ai)
• GPT Zero (https://​gptze​ro.​me/)
• GPT-2 Output Detector Demo (https://​openai-​openai-​detec​tor.​hf.​space/)
• OpenAI Text Classifier (https://​platf​orm.​openai.​com/​ai-​text-​class​ifier)
• PlagiarismCheck (https://​plagi​arism​check.​org/)
• Turnitin (https://​demo-​ai-​writi​ng-​10.​turni​tin.​com/​home/)
• Writeful GPT Detector (https://x.​write​full.​com/​gpt-​detec​tor)
• Writer (https://​writer.​com/​ai-​conte​nt-​detec​tor/)
• Zero GPT (https://​www.​zerog​pt.​com/)

Table 4 gives an overview of the minimum/maximum sizes of text that could be exam-
ined by the free tools at the time of testing, if known.
PlagiarismCheck and Turnitin are combined text similarity detectors and offer an
additional functionality of determining the probability the text was written by an AI, so
there was no limit on the amount of text tested. Signup was necessary for Check for
AI, Crossplag, Go Winston, GPT Zero, and OpenAI Text Classifier (a Google account
worked).

Data collection
The tests were run by the individual authors between March 7 and March 28, 2023.
Since Turnitin was not available until April, those tests were completed between April
14 and April 20, 2023. The testing of PlagiarismCheck was performed between May 2

Table 4 Minimum and maximum sizes for free tools


Tool name Minimum Size Maximum Size

Check for AI 350 characters 2500 characters


Compilatio 200 characters 2000 characters
Content at Scale 25 words 25000 characters
Crossplag Not stated 1000 words
DetectGPT 40 words 256 words
Go Winston 500 characters 2000 words
GPT Zero 250 characters 5000 characters
GPT-2 Output Detector Demo 50 tokens 510 tokens
OpenAI Text Classifier 1000 characters Not stated
Writeful GPT Detector 50 words 1000 words
Writer Not stated 1500 characters
Zero GPT Not stated Not stated
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 11 of 39

Table 5 Classification accuracy scales for human-written and AI-generated texts

Human-written (NEGATIVE) text (docs 01-Hum & 02-MT), and the tool says that it is written by a:
[100—80%) human True negative TN
[80—60%) human Partially true negative PTN
[60—40%) human Unclear UNC
[40—20%) human Partially false positive PFP
[20—0%] human False positive FP
AI-generated (POSITIVE) text (docs 03-AI, 04-AI, 05-ManEd & 06-Para), and the tool says it is written by
a:
[100—80%) human False negative FN
[80—60%) human Partially false negative PFN
[60—40%) human Unclear UNC
[40—20%) human Partially true positive PTP
[20—0%] human True positive TP
[ or] means inclusive ( or) means exclusive

Table 6 Mapping of textual results to classification labels


Tool Result 01-Hum, 02-MT 03-AI, 04-AI,
05-ManEd,
06-Para

Check for AI “very low risk” TN FN


“low risk” PTN PFN
“medium risk” UNC UNC
“high risk” PFP PTP
“very high risk” FP TP
GPT Zero “likely to be written entirely by human” TN FN
“may include parts written by AI” PFP PTP
“likely to be written entirely by AI” FP TP
OpenAI Text Classifier “The classifier considers the text to be …”
“… likely AI-generated.” FP TP
“… possibly AI-generated.” PFP PTP
“Unclear if it is AI-generated” UNC UNC
“… unlikely AI-generated.” PTN PFN
“… very unlikely AI-generated.” TN FN
DetectGPT “very unlikely to be from GPT-2” TN FN
“unlikely to be from GPT-2” PTN PFN
“likely to be from GPT-2” PFP PTP
“very likely from GPT-2” FP TP

and May 8, 2023. All the 54 test cases had been presented to each of the tools for a total
of 756 tests.

Evaluation methodology
For the evaluation, the authors were split into groups of two or three and tasked with
evaluating the results of the tests for the cases from either 01-Hum & 04-AI, 02-MT &
05-ManEd, or 03-AI & 06-Para. Since the tools do not provide an exact binary classifi-
cation, one five-step classification was used for the original texts (01-Hum & 02-MT)
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 12 of 39

and another one was used for the AI-generated texts (03-AI, 04-AI, 05-ManEd &
06-Para). They were based on the probabilities that were reported for texts being
human-written or AI-generated as specified in Table 5.
For four of the detection tools, the results were only given in the textual form (“very
low risk”, “likely AI-generated”, “very unlikely to be from GPT-2”, etc.) and these were
mapped to the classification labels as given in Table 6.
After all of the classifications were undertaken and disagreements ironed out, the
measures of accuracy, the false positive rate, and the false negative rate were calculated.

Results
Having evaluated the classification outcomes of the tools as (partially) true/false posi-
tives/negatives, the researchers evaluated this classification on two criteria: accuracy and
error type. In general, classification systems are evaluated using accuracy, precision, and
recall. The research authors also conducted an error analysis since the educational con-
text means different types of error have different significance.

Accuracy
When no partial results are allowed, i.e. only TN, TP, FN, and FP are allowed, accuracy is
defined as a ratio of correctly classified cases to all cases

ACC = (TN + TP)/(TN + TP + FN + FP);

As our classificaion contains also partially correct and partially incorrect results (i.e.,
five classes instead of two), the basic commonly used formula has to be adjusted to
properly count these cases. There is no standard way of how this adjustment should be
done. Therefore, we will use three different methods which we believe reflect different
approaches that educators may have when interpreting tools’ outputs. The first (binary)

Table 7 Accuracy of the detection tools (binary approach)


Tool 01-Hum 02-MT 03-AI 04-AI 05-ManEd 06-Para Total Accuracy Rank

Check For AI 9 0 9 8 4 2 32 59% 6


Compilatio 8 9 8 8 5 2 40 74% 2
Content at Scale 9 9 0 0 0 0 18 33% 14
Crossplag 9 6 9 7 4 2 37 69% 4
DetectGPT 9 5 2 8 0 1 25 46% 11
Go Winston 7 7 9 8 4 1 36 67% 5
GPT Zero 6 3 7 7 3 3 29 54% 8
GPT-2 Output Detector 9 7 9 8 5 1 39 72% 3
Demo
OpenAI Text Classifier 9 8 2 7 2 1 29 54% 8
PlagiarismCheck 7 5 3 3 1 2 21 39% 13
Turnitin 9 9 8 9 4 2 41 76% 1
Writeful GPT Detector 9 7 2 3 2 0 23 43% 12
Writer 9 7 4 4 2 1 27 50% 10
Zero GPT 9 5 7 8 2 1 32 59% 6
Average 94% 69% 63% 70% 30% 15%
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 13 of 39

approach is to consider partially correct classification as incorrect and calculate the


accuracy as

ACC_bin = (TN + TP)/(TN + PTN + TP + PTP + FN + PFN + FP + PFP + UNC)

For the systems providing percentages of confidence, this method basically sets the
threshold of 80% (see Table 5). Table 7 shows the number of correctly classified docu-
ments, i.e. the sum of true positives and true negatives. The maximum for each cell is 9
(because there were 9 documents in each class), the overall maximum is 9 * 6 = 54. The
accuracy is calculated as a ratio of the total and the overall maximum. Note that even the
highest accuracy values are below 80%. The last row shows the average accuracy for each
document class, across all the tools.
This method provides a good overview of the number of cases in which the classifiers
are “sure” about the outcome. However, for real-life educational scenarios, partially cor-
rect classifications are also valuable. Especially in case 05-ManEd, which involved human
editing, the partially positive classification results make sense. Therefore, the researchers
explored more ways of assessment. These methods differ in the score awarded to various
incorrect outcomes.
In our second approach, we include partially correct evaluations and count them as
correct ones. The formula for accuracy computation is.

ACC_bin_incl = (TN+PTN+TP+PTP)/(TN+PTN+TP+PTP+FN+PFN+FP+PFP+UNC)

In case of systems providing percentages, this method basically sets the threshold of
60% (see Table 5). The results of this classification approach may be found in Table 8.
Obivously, all systems achieved higher accuracy, and the systems that provided more
partially correct results (GPT Zero, Check for AI) influenced the order.
In our third approach, which we call semi-binary evaluation, the researchers distin-
guish partially correct classifications (PTN or PTP) both from the correct and incorrect
ones. The partially correct classifications were awarded 0.5 points, while entirely correct

Table 8 Accuracy of the detection tools (binary inclusive approach)


Tool 01-Hum 02-MT 03-AI 04-AI 05-ManEd 06-Para Total Accuracy Rank

Check For AI 9 7 9 8 4 3 40 74% 4


Compilatio 9 9 9 8 6 2 43 80% 2
Content at Scale 9 9 0 0 0 0 18 33% 14
Crossplag 9 6 9 7 5 2 38 70% 9
DetectGPT 9 8 9 8 4 2 40 74% 4
Go Winston 8 8 9 8 5 2 40 74% 4
GPT Zero 6 3 8 9 8 8 42 78% 3
GPT-2 Output Detector 9 7 9 8 5 2 40 74% 4
Demo
OpenAI Text Classifier 9 9 5 8 5 2 38 70% 9
PlagiarismCheck 9 8 5 6 3 3 34 63% 12
TurnItIn 9 9 9 9 5 3 44 81% 1
Writeful GPT Detector 9 8 8 6 3 1 35 65% 11
Writer 9 7 5 6 4 2 33 61% 13
Zero GPT 9 8 7 8 4 4 40 74% 4
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 14 of 39

Table 9 Accuracy of the detection tools (semi-binary approach)


Tool 01-Hum 02-MT 03-AI 04-AI 05-ManEd 06-Para Total Accuracy Rank

Check For AI 9 3.5 9 8 4 2.5 36 67% 6


Compilatio 8.5 9 8.5 8 5.5 2 41.5 77% 2
Content at Scale 9 9 0 0 0 0 18 33% 14
Crossplag 9 6 9 7 4.5 2 37.5 69% 5
DetectGPT 9 6.5 5.5 8 2 1.5 32.5 60% 10
Go Winston 7.5 7.5 9 8 4.5 1.5 38 70% 4
GPT Zero 6 3 7.5 8 5.5 5.5 35.5 66% 8
GPT-2 Output Detector 9 7 9 8 5 1.5 39.5 73% 3
Demo
OpenAI Text Classifier 9 8.5 3.5 7.5 3.5 1.5 33.5 62% 9
PlagiarismCheck 8 6.5 4 4.5 2 2.5 27.5 51% 13
Turnitin 9 9 8.5 9 4.5 2.5 42.5 79% 1
Writeful GPT Detector 9 7.5 5 4.5 2.5 0.5 29 54% 12
Writer 9 7 4.5 5 3 1.5 30 56% 11
Zero GPT 9 6.5 7 8 3 2.5 36 67% 6
Average 95% 77% 71% 74% 39% 22%

Table 10 Scores for logarithmic evaluation


Positive case Negative case Score

FN FP 1
PFN PFP 2
UNC UNC 4
PTP PTN 8
TP TN 16

classification (TN or TP) still gained 1.0 points as in the previous methods. The formula
for accuracy calculation is

ACC_semibin =(TN + TP + 0.5 ∗ PTN + 0.5 ∗ PTP) /


(TN + PTN + TP + PTPFN + PFN + FP + PFP + UNC)

Table 9 shows the assessment results of the classifiers using semi-binary classification.
The values correspond to the number of correctly classified documents with partially
correct results awarded half a point (TP + TN + 0.5 * PTN + 0.5 * PTP). The maximum
value is again 9 for each cell and 54 for the total.
A semi-binary approach to accuracy calculation captures the notion of partially
correct classification but still does not distinguish between various forms of incor-
rect classification. We address this issue by employing a third,—logarithmic approach
to accuracy calculation that awards 1 point to completely incorrect classification
and doubles the score for each level of the classification that was closer to the cor-
rect result. The scores for the particular classifier outputs are shown in Table 10 and
the overall scores of the classifiers are shown in Table 11. Note that the maximum
value for each cell is now 9 * 16 = 864. The accuracy, again, is calculated as a ratio
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 15 of 39

Table 11 Logarithmic approach to accuracy evaluation


Tool 01-Hum 02-MT 03-AI 04-AI 05-ManEd 06-Para Total Accuracy Rank

Check For AI 144 62 144 129 74 54 607 70% 7


Compilatio 136 144 136 132 91 40 679 79% 2
Content at Scale 144 144 23 24 17 18 370 43% 14
Crossplag 144 99 144 115 76 40 618 72% 6
DetectGPT 144 108 88 129 38 36 543 63% 10
Go Winston 124 124 144 130 79 45 646 75% 4
GPT Zero 102 60 121 128 89 89 589 68% 8
GPT-2 Output Detector 144 114 144 129 84 35 650 75% 3
Demo
OpenAI Text Classifier 144 136 67 124 67 48 586 68% 9
PlagiarismCheck 128 108 76 82 50 53 497 58% 12
Turnitin 144 144 136 144 81 53 702 81% 1
Writeful GPT Detector 144 122 81 76 50 20 493 57% 13
Writer 144 117 83 84 53 35 516 60% 11
Zero GPT 144 108 120 132 65 54 623 72% 5
Average 96% 79% 75% 77% 45% 31%

Fig. 1 Overall accuracy for each tool calculated as an average of all approaches discussed

of the total score and the maximum possible score. This approach provides the most
detailed distinction among all varieties of (in)correctness.
As can be seen from Tables 7, 8, 9, and 11, the approach to accuracy evaluation
has almost no influence on the ranking of the classifiers. Figure 1 presents the overall
accuracy for each tool as the mean of all accuracy approaches used.
Turnitin received the highest score using all approaches to accuracy classification,
followed by Compilatio and GPT-2 Output Detector (again in all approaches). This is
particularly interesting because as the name suggests, GPT-2 Output Detector was
not trained to detect GPT-3.5 output. Crossplag and Go Winston were the only other
tools to achieve at least 70% accuracy.
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 16 of 39

Fig. 2 Overall accuracy for each document type (calculated as an average of all approaches discussed)

Variations in accuracy
As Fig. 2 above shows, the overall average accuracy figure is misleading, as it obscures
major variations in accuracy between document types. Further analysis reveals the
influence of machine translation, human editing, and machine paraphrasing on over-
all accuracy:

Influence of machine translation The overall accuracy for case 01-Hum (human-writ-
ten) was 96%. However, in the case of the documents written by humans in languages
other than English that were machine-translated to English (case 02-MT), the accuracy
dropped by 20%. Apparently, machine translation leaves some traces of AI in the output,
even if the original was purely human-written.

Influence of human manual editing Case 05-ManEd (machine-generated with subse-


quent human editing) generally received slightly over half the score (42%) compared to
cases 03-AI and 04-AI (machine-generated with no further modifications; 74%). This
reflects a typical scenario of student misconduct in cases where the use of AI is pro-
hibited. The student obtains a text written by an AI and then quickly goes through it
and makes some minor changes such as using synonyms to try to disguise unauthorised
content generation. This type of writing has been called patchwriting (Howard 1995).
Only ~ 50% accuracy of the classifiers shows that these cases, which are assumed to be
the most common ones, are almost undetectable by current tools.

Influence of machine paraphrase Probably the most surprising results are for case
06-Para (machine-generated with subsequent machine paraphrase). The use of AI to
transform AI-generated text results in text that the classifiers consider human-written.
The overall accuracy for this case was 26%, which means that most AI-generated texts
remain undetected when machine-paraphrased.
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 17 of 39

Fig. 3 Accuracy (logarithmic) for each document type by detection tool for AI-generated text

Consistency in tool results


With the notable exception of GPT Zero, all the tested tools followed the pattern of
higher accuracy when identifying human-written text than when identifying texts
generated or modified by AI or machine tools, as seen in Fig. 3. Therefore, their clas-
sification is (probably deliberately) biased towards humans rather than AI output.
This classification bias is preferable in academic contexts for the reasons discussed
below.

Precision
Another important indicator of system’s performance is precision, i.e. the ratio of true
positive cases to all positively classified cases. Precision indicates the probability that a
positive classification provided by the system is correct. For pure binary classifiers, the
precision is calculated as a ratio of true positives to all positively classified cases:

Precision = TP/(TP + FP)


Table 12 Overview of classification results and precision
Tool TP PTP FP PFP TN PTN FN PFN UNC Total Prec_incl Prec_excl

Check For AI 23 1 1 9 7 1 10 2 54 96% 100%


Compilatio 23 2 17 1 9 1 1 54 100% 100%
Content at Scale 18 12 13 11 54 –- –-
Weber‑Wulff et al. International Journal for Educational Integrity

Crossplag 22 1 3 15 11 2 54 88% 88%


DetectGPT 11 12 14 3 7 6 1 54 100% 100%
Go Winston 22 2 14 2 4 3 7 54 100% 100%
GPT Zero 20 13 9 9 3 54 79% 100%
(2023) 19:26

GPT-2 Output Detector Demo 23 1 2 16 10 1 1 54 92% 92%


OpenAI Text Classifier 12 8 17 1 2 4 10 54 100% 100%
PlagiarismCheck 9 8 12 5 1 10 9 54 100% 100%
TurnItIn 23 3 18 4 3 3 54 100% 100%
Writeful GPT Detector 7 11 1 16 1 13 3 2 54 95% 100%
Writer 11 6 1 16 13 3 4 54 94% 92%
Zero GPT 18 5 14 3 3 11 54 100% 100%
Page 18 of 39
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 19 of 39

In case of partially true/false positives, the researches had two options how to deal
with them. The exclusive approach counts them as negatively classified (so the formula
does not change), whereas the inclusive approach counts them as positively classified:

Precision_incl = (TP + PTP)/(TP + PTP + FP + PFP)

Table 12 shows an overview of the classification results, i.e. all (partially) true/false
positives/negatives. Also, both inclusive and exclusive precision values are provided.
Precision is missing for Content at Scale because this system did not provide any posi-
tive classifications. The only system for which the inclusive precision is significantly dif-
ferent from the exclusive one, is GPT Zero which yielded the largest number of partially
false positives.

Error analysis
In this section, the researchers quantify more indicators of tools’ performance, namely
two types of classification errors that might have significant consequences in educational
contexts: false positives leading to false accusations against a student and undetected
cases (students gaining an unfair advantage over others), i.e. false negative ratio which is
tightly related to recall.

False accusations: harm to individual students


If educators use one of the classifiers to detect student misconduct, there is a question
of what kind of output leads to the accusation of a student from unauthorised content
generation. The researchers believe that a typical educator would accuse a student if
the output of the classifier is positive or partially positive. Some teachers may also sus-
pect students of misconduct in unclear or partially negative cases, but the research
authors think that educators generally do not initiate disciplinary action in these cases.

Table 13 False positive (false accusation) ratio


Tool 01-Hum 02-MT Total FPR

Check For AI 0 1 1 5.6%


Compilatio 0 0 0 0.0%
Content at Scale 0 0 0 0.0%
Crossplag 0 3 3 16.7%
DetectGPT 0 0 0 0.0%
Go Winston 0 0 0 0.0%
GPT Zero 3 6 9 50.0%
GPT-2 Output Detector Demo 0 2 2 11.1%
OpenAI Text Classifier 0 0 0 0.0%
PlagiarismCheck 0 0 0 0.0%
Turnitin 0 0 0 0.0%
Writeful GPT Detector 0 1 1 5.6%
Writer 0 1 1 5.6%
Zero GPT 0 0 0 0.0%
Average 2.4% 11.1%
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 20 of 39

Fig. 4 False accusations for human-written documents

Fig. 5 False accusations for machine-translated documents

Therefore, for each tool, we also computed the likelihood of false accusation of a student
as a ratio of false positives and partially false positives to all negative cases, i.e.

FPR = (FP + PFP)/N_negative

Table 13 shows the number of cases in which the classification of a particular docu-
ment would lead to a false accusation. The table includes only documents 01-Hum and
02-MT, because the AI-generated documents are not relevant. The risk of false accu-
sations is zero for half of the tools, as can be also seen from Figs. 4 and 5. Six of the
fourteen tools tested generated false positives, with the risk increasing dramatically for
machine-translated texts. For GPT Zero, half of the positive classifications would be
false accusations, which makes this tool unsuitable for the academic environment.
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 21 of 39

Table 14 Percentage of undetected cases


Tool 03-AI 04-AI 05-ManEd 06-Para Total FNR Recall

Check For AI 0 1 5 6 12 33.3% 66.7%


Compilatio 0 1 3 7 11 30.6% 69.4%
Content at Scale 9 9 9 9 36 100.0% 0.0%
Crossplag 0 2 4 7 13 36.1% 63.9%
DetectGPT 0 1 5 7 13 36.1% 63.9%
Go Winston 0 1 4 7 12 33.3% 66.7%
GPT Zero 1 0 1 1 3 8.3% 91.7%
GPT-2 Output Detector Demo 0 1 4 7 12 33.3% 66.7%
OpenAI Text Classifier 4 1 4 7 16 44.4% 55.6%
PlagiarismCheck 4 3 6 6 19 52.8% 47.2%
Turnitin 0 0 4 6 10 27.8% 72.2%
Writeful GPT Detector 1 3 6 8 18 50.0% 50.0%
Writer 4 3 5 7 19 52.8% 47.2%
Zero GPT 2 1 5 5 13 36.1% 63.9%
Average 19.8% 21.4% 51.6% 71.4%

Fig. 6 False negatives for AI-generated documents 03-AI

Undetected cases: undermining academic integrity


Another form of academic harm is undetected cases, i.e. AI-generated texts that remain
undetected. A student who used unauthorised content generation likely obtains an
unfair advantage over those who fulfilled the task with integrity. The actual victims of
this form of misconduct are the honest students that receive the same credits as the dis-
honest ones. The likelihood of an AI-generated document being undetected (false nega-
tive rate, FNR) is given in Table 14, which includes only positive cases (03-AI, 04-AI,
05-ManEd and 06-Para). The false negative rate is calculated as

FNR = (FN + PFN)/N_positive


Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 22 of 39

Fig. 7 False negatives for AI-generated documents 04-AI

Fig. 8 False negatives for AI-generated documents 03-AI and 04-AI together

For the sake of completeness, Table 14 also contains recall (1—FNR) that indicates
how many of positive cases were correclty classified by the system.
Figures 6, 7, and 8 above show that 13 out of the 14 tested tools produced false nega-
tives or partially false negatives for documents 03-AI and 04-AI; only Turnitin correctly
classified all documents in these classes. None of the tools could correctly classify all AI-
generated documents that undergo manual editing or machine paraphrasing.
As the document sets 03-AI and 04-AI were prepared using the same method, the
researchers expected the results would be the same. However, for some tools (OpenAI
Text Classifier and DetectGPT), the results were notably different. This could indicate
a mistake in testing made or interpretation of the results. Therefore, the researchers
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 23 of 39

Fig. 9 False negatives for manually edited documents

Fig. 10 False negatives for machine-paraphrased documents

double-checked all the results to avoid this kind of mistake. We also tried to upload
some documents again. We did obtain different values, but we found out that this was
due to inconsistency in the results of these tools and not due to our mistakes.
Content at Scale misclassified all of the positive cases; these results in combination
with the 100% correct classification of human-written documents indicate that the
tool is inherently biased towards human classification and thus completely useless.
Overall, of the AI-generated texts approx. 20% of cases would likely be misattributed
to humans, meaning the risk of unfair advantage is significantly greater than that of
false accusation.
Figures 9 and 10 show an even greater risk of students gaining an unfair advantage
through the use of obfuscation strategies. At an overall level, for manually edited texts
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 24 of 39

Fig. 11 Compilatio’s NaN% reliability

Fig. 12 Turnitin’s similarity report shows up first, it is not clear that the “AI” is clickable

(case 05-ManEd) the ratio of undetected texts increases to approx. 50% and in the
case of machine-paraphrased texts (case 06-Para) rises even higher.

Usability issues
There were a few usability issues that cropped up during the testing that may be
attributable to the beta nature of the tools under investigation.
For example, the tool DetectGPT at some point stopped working and only replied
with the statement “Server error 
We might just be overloaded. Try again in a few
minutes?”. This issue occurred after the initial testing round and persisted until the
time of submission of this paper. Others would stall in an apparent infinite loop or
throw an error message and the test had to be repeated at a later time.
Writeful GPT Detector would not accept computer code. The tool apparently iden-
tified code as not English, and the tool only accepted English texts.
Compilatio at one point returned “NaN% reliability” (See Fig. 11) for a ChatGPT-
generated text that included program code. “NaN” is computer jargon for “not a
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 25 of 39

number” and indicates that there were calculation issues such as division by zero or
number representation overflow. Since there was also a robot head returned, this was
evaluated as correctly identifying ChatGPT-generated text, but the non-numerical
percentage might confuse instructors using the tool.
The operation of a few of the tools was not immediately clear to some of the authors
and the handling of results was sometimes not easy to document. For example, in Pla-
giarismCheck the AI-Detection button was not always presented on the screen and it
would only show the last four tests done. Interestingly, Turnitin often returned high sim-
ilarity values for ChatGPT-generated text, especially for program code or program out-
put. This was distracting, as the similarity results were given first, the AI-detection could
only be accessed by clicking on a number above the text “AI” that did not look clickable,
but was, see Fig. 12.

Discussion
Detection tools for AI-generated text do fail, they are neither accurate nor reliable (all
scored below 80% of accuracy and only 5 over 70%). In general, they have been found
to diagnose human-written documents as AI-generated (false positives) and often diag-
nose AI-generated texts as human-written (false negatives). Our findings are consistent
with previously published studies (Gao et al. 2022; Anderson et al. 2023; Elkhatat et al.
2023; Demers 2023; Gewirtz 2023; Krishna et al. 2023; Pegoraro et al. 2023; van Oijen
2023; Wang et al. 2023) and substantially differ from what some detection tools for AI-
generated text claim (Compilatio 2023; Crossplag.com 2023; GoWinston.ai 2023; Zero
GPT 2023). The detection tools present a main bias towards classifying the output as
human-written rather than detecting AI-generated content. Overall, approximately 20%
of AI-generated texts would likely be misattributed to humans.

Fig. 13 Writer’s suggestion to lower “detectable AI content”


Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 26 of 39

They are neither robust, since their performance worsens even more with the use of
obfuscation techniques such as manual editing or machine paraphrasing, nor are they
able to cope with texts translated from other languages. Overall, approximately 50%
of AI-generated texts that undergo some obfuscation would likely be misattributed to
humans.
The results provided by the tools are not always easy to interpret for an average user.
Some of them provide statistical information to justify the classification, and others
highlight the text that is “likely” machine-generated. Some present values such as “per-
plexity = 137.222” or “Burstiness Score: 17104.959” with many digits of precision that do
not generally help a user understand the results.
Some of the detection tools such as Writer are clearly aimed to be used to hide AI-
written text, providing suggestions to users such as “You should edit your text until
there’s less detectable AI content." (See Fig. 13).
Detection tools for AI-generated text provide simple outputs with statements like
“This document was likely written by AI” or “11% likely this comes from GPT-3, GPT-4
or ChatGPT”, without any possibility of verification or evidence. Therefore, a student
accused of unauthorised content generation only on this basis would have no possibility
for a defence. The probability of false positives ranged from 0% (Turnitin) to 50% (GPT
Zero). The probability of false negatives ranged from 8% (GPT Zero) to 100% (Content
at Scale). The different types of failures may have serious implications. False positives
could lead to wrong accusations of students, the false negatives allow students to evade
detection of unauthorised content generation gaining unfair advantages and promoting
impunity. Our experience and personal communications indicate that there is a large
group of academics that believe in the output of the classifiers. The research results
show that users should be extremely cautious when interpreting the results.
It is noteworthy that using machine translation such as Google translate or DeepL
can lead to a higher number of false positives, leaving L2 students (and researchers) at
risk of being falsely accused of unauthorised content generation when using machine
translation to translate their own texts.
As the tools do not provide any evidence, the likelihood that an educational institution
is able to prove this form of academic misconduct is extremely low. Reports provided by
detection tools for AI-generated text cannot be used as the only basis for reporting stu-
dents for cheating. They can give faculty a hint that some sort of misconduct may have
happened, but further dialogue and conversations with students should take place.
One of the tools that the researchers came across, GLTR (http://​gltr.​io/) does not
provide any classification, so it was decided to exclude it from testing. Nonetheless, it
highlights the words (tokens) based on how commonly they appear in a given context.
Interpretation of the output is up to the educator, but the research authors find the
visualisation of this information very useful. The colour-coded predictability of indi-
vidual words does not necessarily mean that the text was generated by AI, but may
also mean that the text does not bring any innovation or added value, which might
be—in some situations—a relevant indicator of its quality.
As the detection tools for AI-generated text are not reliable, a prevention-focused
approach needs to be prioritised over a detection one. It is also paramount to
inform the educators about this fact. The focus should instead be on the preventive
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 27 of 39

pedagogical strategies on how to ethically use generative AI tools, including a discus-


sion about the benefits and limitations of such tools.
This presupposes defining, describing, and training on the differences between the
ethical and unethical use of AI tools will be important for students, faculty, and staff. The
ENAI recommendations on the ethical use of Artificial Intelligence in Education may be
a good starting point (Foltýnek et al. 2023) for such discussions. It is also important to
encourage educators to rethink their assessment strategies and instruments to achieve
a design with features that reduce or even eliminate the possibility of enabling cheating.
Our study has some limitations. It focused only on English language texts. Even
though we had computer code, we did not test the performance of the systems specifi-
cally on that. There were also indications that the results from the tools can vary when
the same material is tested at a different time; we did not systematically examine the
replicability of the results provided by the tools. Nevertheless, we tentatively suggest that
this inconsistency can have major implications in misconduct investigations and thus
provides another strong reason against the use of these tools as a single source of an
accusation of misconduct. Our document set is also somewhat limited: we did not test
the kind of hybrid writing with iterative use of AI that may be likely to be more typical
of student use of generative AI. However, the poor performance of the tools across the
range of documents does not imply better performance for hybrid writing.

Conclusion and future work


This paper exposes serious limitations of the state-of-the-art AI-generated text detec-
tion tools and their unsuitability for use as evidence of academic misconduct. Our find-
ings do not confirm the claims presented by the systems. They too often present false
positives and false negatives. Moreover, it is too easy to game the systems by using para-
phrasing tools or machine translation. Therefore, our conclusion is that the systems we
tested should not be used in academic settings. Although text matching software also
suffers from false positives and false negatives (Foltýnek et al. 2020), at least it is possible
to provide evidence of potential misconduct. In the case of the detection tools for AI-
generated text, this is not the case.
Our findings strongly suggest that the “easy solution” for detection of AI-generated
text does not (and maybe even could not) exist. Therefore, rather than focusing on
detection strategies, educators continue to need to focus on preventive measures and
continue to rethink academic assessment strategies (see, for example, Bjelobaba 2020).
Written assessment should focus on the process of development of student skills rather
than the final product.
Future research in this area should test the performance of AI-generated text detec-
tion tools on texts produced with different (and multiple) levels of obfuscation e.g., the
use of machine paraphrasers, translators, patch writers, etc. Another line of research
might explore the detection of AI-generated text at a cohort level through its impact
on student learning (e. g. through assessment scores) and education systems (e. g. the
impact of generative AI on similarity scores). Research should also build on the known
issues with cloud-based text-matching software to explore the legal implications and
data privacy issues involved in uploading content to cloud-based (or institutional) AI
detection tools.
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 28 of 39

Appendix
Case studies 05‑ManEd
The following images show the generated texts on the left and the human-obfuscated
ones on the right. The identical text is coloured in the same colour on both sides, with
the changes popping out in white. The images were prepared using the similarity-texter.
As can be seen, some texts were rather heavily re-written, others only had a few words
exchanged.

Fig. 14 AIDT23-05-AAN

Fig. 15 AIDT23-05-DWW
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 29 of 39

Fig. 16 AIDT23-05-JGD

Fig. 17 AIDT23-05-JPK
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 30 of 39

Fig. 18 AIDT23-05-LLW

Fig. 19 AIDT23-05-OLU
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 31 of 39

Fig. 20 AIDT23-05-PTR

Fig. 21 AIDT23-05-SBB
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 32 of 39

Fig. 22 AIDT23-05-TFO

Case studies 06‑Para


These test cases were first generated with ChatGPT, then automatically re-written using
Quillbot with the default settings. The generated original is on the left, the re-written
version on the right.

Fig. 23 AIDT23-06-AAN
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 33 of 39

Fig. 24 AIDT23-06-DWW

Fig. 25 AIDT23-06-JGD
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 34 of 39

Fig. 26 AIDT23-06-JPK

Fig. 27 AIDT23-06-LLW
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 35 of 39

Fig. 28 AIDT23-06-OLU

Fig. 29 AIDT23-06-PTR
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 36 of 39

Fig. 30 AIDT23-06-SBB

Fig. 31 AIDT23-06-TFO

Abbreviations
01-Hum Human-written
02-MT Human-written in a non-English language with a subsequent AI/machine translation to English
03-AI AI-generated text
04-AI AI-generated text with subsequent human manual edits
05-ManEd AI-generated text with subsequent manual paraphrase by human
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 37 of 39

06-Para AI-generated text with subsequent AI/machine paraphrase


ACC​ Accuracy
ACC_bin Accuracy, binary approach
ACC_SEMIBIN Accuracy, semi-binary approach
AI Artificial intelligence
GPT Generative pre-trained transformer
FAS False accusation
FN False negative
FP False positive
HEIs Higher education institutions
LLM Large language models
NaN Not a number
PFN Partially false negative
PFP Partially false positive
PTP Partially true positive
PTN Partially true negative
TN True negative
TP True positive
UNC Unclear

Supplementary Information
The online version contains supplementary material available at https://​doi.​org/​10.​1007/​s40979-​023-​00146-z.

Additional file 1. Supplementary material: Raw data.

Acknowledgements
The authors wish to thank their colleague Július Kravjar from Slovakia who contributed a full set of test documents to the
investigation.
The authors also wish to thank their colleagues from Turkey, Salim Razı and Özgür Çelik, who participated in the initial
stages of the discussions about this research endeavour, but due to the devastating earthquake in February 2023 were
not able to contribute further.
The tool similarity-texter was created as part of the bachelor’s thesis of Sofia Kalaidopoulou and is based on Dick Grune’s
sim_text algorithm. It was submitted to the HTW Berlin in 2016 and is available under a Creative Commons BY-NC-SA 4.0
International License at https://​people.​f4.​htw-​berlin.​de/​~weber​wu/​simte​xter/​app.​html.
ChatGPT was NOT used to tweak any portion of this publication.

Authors’ contributions
All authors created test data, ran the tests, collected data, discussed the statistical results, and contributed equally to the
text. TF and OP prepared the statistics for discussion.

Authors’ information
The authors are members of the European Network for Academic Integrity (ENAI) working group on Technology and
Academic Integrity. DWW is a plagiarism researcher and a retired professor of computer science from the HTW Berlin,
Germany. AAN is an associate professor at the Department of Artificial Intelligence and Systems Engineering of Riga
Technical University, Latvia. SB is a researcher in research integrity at Center for Research Ethics & Bioethics, at Uppsala
University, Sweden, and the Vice-president of ENAI. TF is an assistant professor at the Department of Machine Learning
and Data Processing at the Faculty of Informatics, Masaryk University, Czechia, and President of ENAI. JGD is a professor
of the School of Engineering from University of Monterrey, Mexico and oversees the efforts of its Center for Integrity and
Ethics. OP is an Education Developer specialising in assessment integrity at Queen Mary University of London, UK. PS is a
student of Computer Science at the Faculty of Informatics, Masaryk University, Czechia. LW is the University of Leeds, UK,
Academic Integrity Lead.

Funding
Open access funding provided by Uppsala University. The authors had no funding for this research other than from their
respective institutions.

Availability of data and materials


All data and testing materials are available at https://​www.​acade​micin​tegri​ty.​eu/​wp/​techn​ology-​acade​mic-​integ​rity-​
worki​ng-​group/.

Declarations
Competing interests
Two authors of this article, SB and TF, are involved in organising the European Conference on Ethics and Integrity in Aca‑
demia 2023, co-organised by the European Network for Academic Integrity. This conference receives sponsorship from
Turnitin and Compilatio. This did not influence the research presented in the paper in any phase.
Three of the authors, JGD, SB and TF are members of the editorial board of the International Journal for Educational
Integrity. They can thus not act as reviewers.
One author, TF, is guest editor for the special issue on Artificial Intelligence.
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 38 of 39

Received: 19 July 2023 Accepted: 19 October 2023

References
Anderson N, Belavy DL, Perle SM, Hendricks S, Hespanhol L, Verhagen E, Memon AR (2023) AI did not write this manu‑
script, or did it? Can we trick the AI text detector into generated texts? The potential future of ChatGPT and AI in
Sports & Exercise Medicine manuscript generation. BMJ Open Sport Exerc Med 9:e001568. https://​doi.​org/​10.​1136/​
bmjsem-​2023-​001568
Aydın Ö, Karaarslan E (2022) OpenAI ChatGPT Generated Literature Review: Digital Twin in Healthcare. In: Aydın Ö (ed)
Emerging Computer Technologies 2. İzmir Akademi Dernegi, pp 22–31
Bender EM, Gebru T, McMillan-Major A, Shmitchell S (2021) On the Dangers of Stochastic Parrots: Can Language Models
Be Too Big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. ACM, New
York, pp 610–623. https://​doi.​org/​10.​1145/​34421​88.​34459​22
Bjelobaba S (2020) Academic Integrity Teacher Training: Preventive Pedagogical Practices on the Course Level. In: Khan Z,
Hill C, Foltýnek T (eds) Integrity in Education for Future Happiness. Mendel University in Brno, Brno, pp 9–18 (http://​
acade​micin​tegri​ty.​eu/​confe​rence/​proce​edings/​2020/​bjelo​baba.​pdf )
Borji A. (2023). A Categorical Archive of ChatGPT Failures. arXiv. https://​doi.​org/​10.​48550/​arXiv.​2302.​03494
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S,
Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler DM, Wu J, Winter C, Amodei D. (2020). Language
Models are Few-Shot Learners. arXiv. https://​doi.​org/​10.​48550/​arXiv.​2005.​14165
Chakraborty S, Bedi AS, Zhu S, An B, Manocha D, Huang F. (2023) On the Possibilities of AI-Generated Text Detection.
arXiv. https://​doi.​org/​10.​48550/​arXiv.​2304.​04736
Clarke R, Lancaster T. (2006). Eliminating the successor to plagiarism? Identifying the usage of contract cheating sites.
Proceedings of 2nd International Plagiarism Conference Newcastle, UK, 14
Compilatio (2023). Comparison of the best AI detectors in 2023 (ChatGPT, YouChat...). https://​www.​compi​latio.​net/​en/​
blog/​best-​ai-​detec​tors. Accessed 12 April 2023
Content at Scale (2023). How accurate is this for AI detection purposes? https://​conte​ntats​cale.​ai/​ai-​conte​nt-​detec​tor/.
Accessed 8 May 2023
Crossplag.com (2023). How accurate is the AI Detector? https://​cross​plag.​com/​ai-​conte​nt-​detec​tor/. Accessed 8 May
2023
Demers T. (2023). 16 of the best AI and ChatGPT content detectors compared. Search Engine Land. https://​searc​hengi​
neland.​com/​ai-​chatg​pt-​conte​nt-​detec​tors-​395957. Accessed May 9 2023
Devlin J, Chang MW, Lee K, Toutanova K. (2019). BERT: Pre-training of deep bidirectional transformers for language
understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Compu‑
tational Linguistics: Human Language Technologies, Vol. 1 (pp. 4171–4186)x. Minneapolis, Minnesota. Association
for Computational Linguistics
Elkhatat AM, Elsaid K, Almeer S (2023) Evaluating the efficacy of AI content detection tools in differentiating between
human and AI-generated text. Int J Educ Integrity 19:17. https://​doi.​org/​10.​1007/​s40979-​023-​00140-5. (19(1), 1-16)
Elsen-Rooney M. (2023). NYC education department blocks ChatGPT on school devices, networks. Chalkbeat New York.
https://​ny.​chalk​beat.​org/​2023/1/​3/​23537​987/​nyc-​schoo​ls-​ban-​chatg​pt-​writi​ng-​artif​i cial-​intel​ligen​ce. Accessed 14
June 2023
Foltýnek T, Dlabolová D, Anohina-Naumeca A, Razı S, Kravjar J, Kamzola L, Guerrero-Dib J, Çelik Ö, Weber-Wulff D (2020)
Testing of support tools for plagiarism detection. Int J Educ Technol High Educ 17(1):1–31. https://​doi.​org/​10.​1186/​
s41239-​020-​00192-4
Foltýnek T, Bjelobaba S, Glendinning I, Khan ZR, Santos R, Pavletic P, Kravjar J (2023) ENAI Recommendations on the ethi‑
cal use of Artificial Intelligence in Education. Int J Educ Integrity 19(1):1. https://​doi.​org/​10.​1007/​s40979-​023-​00133-4
Gao CA, Howard FM, Markov NS, Dyer EC, Ramesh S, Luo Y, Pearson AT. (2022) Comparing scientific abstracts generated
by ChatGPT to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded
human reviewers. bioRxiv. https://​doi.​org/​10.​1101/​2022.​12.​23.​521610
Gewirtz D. (2023). Can AI detectors save us from ChatGPT? I tried 3 online tools to find out. https://​www.​zdnet.​com/​artic​
le/​can-​ai-​detec​tors-​save-​us-​from-​chatg​pt-i-​tried-3-​online-​tools-​to-​find-​out/. Accessed 8 May 2023
GoWinston.ai. (2023). “Are AI detection tools accurate?” Winston AI | The most powerful AI content detector. https://​
gowin​ston.​ai/. Accessed 8 May 2023
GPTZero. (2023). The Global Standard for AI Detection:Humans Deserve the Truth. https://​gptze​ro.​me/. Accessed 8 May
2023
Guo B, Zhang X, Wang Z, Jiang M, Nie J, Ding Y, Yue J, Wu Y. (2023). How Close is ChatGPT to Human Experts? Comparison
Corpus, Evaluation, and Detection. arXiv. https://​doi.​org/​10.​48550/​arXiv.​2301.​07597
Howard RM (1995) Plagiarisms, Authorships, and the Academic Death Penalty. Coll Engl 57(7):788–806. https://​doi.​org/​
10.​2307/​378403
ICML. (2023). ICML 2023 Call For Papers, Fortieth International Conference on Machine Learning. https://​icml.​cc/​Confe​
rences/​2023/​CallF​orPap​ers. Accessed 14 June 2023
Ippolito D, Duckworth D, Callison-Burch C, Eck D. (2020). Automatic Detection of Generated Text is Easiest when Humans
are Fooled. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp.1808–
1822). https://​doi.​org/​10.​18653/​v1/​2020.​acl-​main.​164/
Johnson A. (2023). ChatGPT In Schools: Here’s Where It’s Banned—And How It Could Potentially Help Students. Forbes.
https://​www.​forbes.​com/​sites/​arian​najoh​nson/​2023/​01/​18/​chatg​pt-​in-​schoo​ls-​heres-​where-​its-​banned-​and-​how-​
it-​could-​poten​tially-​help-​stude​nts/. Accessed 14 June 2023
Weber‑Wulff et al. International Journal for Educational Integrity (2023) 19:26 Page 39 of 39

Khalil M, Er E. (2023). Will ChatGPT get you caught? Rethinking of Plagiarism Detection. EdArXiv. https://​doi.​org/​10.​35542/​
osf.​io/​fnh48
Krishna K, Song Y, Karpinska M, Wieting J, Iyyer M. (2023). Paraphrasing evades detectors of AI-generated text, but
retrieval is an effective defense. arXiv. https://​doi.​org/​10.​48550/​arXiv.​2303.​13408
Liyanage V, Buscaldi D, Nazarenko A. (2022). A Benchmark Corpus for the Detection of Automatically generated Text in
Academic Publications. Proceedings of the 1­ 3th Conference on Language Resources and Evaluation (pp. 4692–
4700). European Language Resources Association
Ma Y, Liu J, Yi F, Cheng Q, Huang Y, Lu W, Liu X. (2023). AI vs. Human - Differentiation Analysis of Scientific Content Genera‑
tion. arXiv. https://​doi.​org/​10.​48550/​arXiv.​2301.​10416
Marr B. (2023). A Short History Of ChatGPT: How We Got To Where We Are Today. Forbes. https://​www.​forbes.​com/​sites/​
berna​rdmarr/​2023/​05/​19/a-​short-​histo​ry-​of-​chatg​pt-​how-​we-​got-​to-​where-​we-​are-​today/. Accessed 14 June 2023
Mikolov T, Chen K, Corrado G, Dean J. (2013). Efficient estimation of word representations in vector space. arXiv. https://​
doi.​org/​10.​48550/​arXiv.​1301.​3781
Milmo D. (2023). ChatGPT reaches 100 million users two months after launch. The Guardian. https://​www.​thegu​ardian.​
com/​techn​ology/​2023/​feb/​02/​chatg​pt-​100-​milli​on-​users-​open-​ai-​faste​st-​growi​ng-​app. Accessed 14 June 2023
van Oijen V. (2023). AI-generated text detectors: Do they work? SURF Communities. https://​commu​nities.​surf.​nl/​en/​ai-​in-​
educa​tion/​artic​le/​ai-​gener​ated-​text-​detec​tors-​do-​they-​work. Accessed 8 May 2023
OpenAI. (2023). ChatGPT February 13 Version. https://​chat.​openai.​com/
OpenAI. (2023). New AI classifier for indicating AI-written text. https://​openai.​com/​blog/​new-​ai-​class​ifier-​for-​indic​ating-​
ai-​writt​en-​text
Pegoraro A, Kumari K, Fereidooni H, Sadeghi AR. (2023). To ChatGPT, or not to ChatGPT: That is the question! arXiv. https://​
doi.​org/​10.​48550/​arXiv.​2304.​01487
Quillbot (2023). Quillbot AI Paraphrasing Tool. https://​quill​bot.​com/
Rosenfeld R (2000) Two decades of statistical language modeling: Where do we go from here? Proc IEEE 88(8):1270–1278.
https://​doi.​org/​10.​1109/5.​88008​3t
Schechner S. (2023). ChatGPT Ban Lifted in Italy After Data-Privacy Concessions. Wall Street J. https://​www.​wsj.​com/​artic​
les/​chatg​pt-​ban-​lifted-​in-​italy-​after-​data-​priva​cy-​conce​ssions-​d03d5​3e7. Accessed 14 June 2023
Tauginienė L, Gaižauskaité I, Glendinning I, Kravjar J, Ojstršek M, Ribeiro L, Odineca T, Marino F, Cosentino M, Sivasubrama‑
niam S. (2018). Glossary for Academic Integrity. ENAI. http://​www.​acade​micin​tegri​ty.​eu/​wp/​wp-​conte​nt/​uploa​ds/​
2018/​02/​GLOSS​ARY_​final.​pdf. Accessed 14 June 2023
Turnitin (2023). Understanding false positives within our AI writing detection capabilities. https://​www.​turni​tin.​com/​
blog/​under​stand​ing-​false-​posit​ives-​within-​our-​ai-​writi​ng-​detec​tion-​capab​iliti​es. Accessed 14 June 2023
Turnitin (2023). Resources to Address False Positives.Turnitin Support. https://​suppo​rtcen​ter.​turni​tin.​com/s/​artic​le/​Turni​
tin-s-​AI-​Writi​ng-​Detec​tion-​Toolk​it-​for-​admin​istra​tors-​and-​instr​uctors. Accessed 8 May 2023
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. (2017). Attention is all you need.
Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Advances in Neural
Information Processing systems, USA. https://​proce​edings.​neuri​ps.​cc/​paper/​2017/​file/​3f5ee​24354​7dee9​1fbd0​53c1c​
4a845​aa-​Paper.​pdf. Accessed 8 May 2023
Wang J, Liu S, Xie X, Li Y. (2023). Evaluating AIGC Detectors on Code Content. arXiv. https://​doi.​org/​10.​48550/​arXiv.​2304.​
05193
Zero GPT (2023). What is the accuracy rate of ZeroGPT? ZeroGPT - Chat GPT, Open AI and AI text detector Free Too.
https://​www.​zerog​pt.​com/. Accessed 8 May 2023

Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Ready to submit your research ? Choose BMC and benefit from:

• fast, convenient online submission


• thorough peer review by experienced researchers in your field
• rapid publication on acceptance
• support for research data, including large and complex data types
• gold Open Access which fosters wider collaboration and increased citations
• maximum visibility for your research: over 100M website views per year

At BMC, research is always in progress.

Learn more biomedcentral.com/submissions

You might also like