0% found this document useful (0 votes)

8 views67 pages

Ilovepdf Merged

Uploaded by

Akash Jha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views67 pages

Ilovepdf Merged

Uploaded by

Akash Jha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

A Project report on

Automatic Detection of AI generated Text Model

Submitted in partial fulfilment of the Requirements for the award of my degree of

Bachelor of Technology (ITE)

Under the Supervision of
Ms. Sapna Gupta

Submitted By: Akash Jha

00714813120

Department of Information Technology and Engineering

Maharaja Agrasen Institute of Technology, Delhi
ACKNOWLEDGEMENT

I take this opportunity to express my sincere thanks and deep gratitude to all those people who
extended their wholehearted co-operation and have helped me in completing this project
successfully.

First of all, I would like to thank Ms. Sapna Gupta and Mrs. Amita Goel for their precious
time and constant support whenever needed. They have been a driving force behind the
successful completion of this project.

I would also like to express my deep gratitude towards MAIT for considering me a part of
their organization and provide such a great platform to learn and enhance my skills.

A very special thanks goes to all faculties of Maharaja Agrasen Institute of technology under
whose guidance I have been able to excel in my career and reach to such a prestigious
organization.

AKASH JHA (00714813120)

DECLARATION

I hereby declare that the work presented in this project report titled, AI
GENERATED TEXT DETECTION MODEL submitted by us in the partial
fulfillment of the requirement of the award of the degree of Bachelor of Technology
(B.Tech.) Submitted in the Department of Information Technology and
Engineering, Maharaja Agrasen Institute of Technology is an authentic record of
my project work carried out under the guidance of Ms. Sapna Gupta. The matter
presented in this project report has not been submitted either in part or full to any
university or Institute for award of any degree.

Date: 11/05/2024 Akash Jha

00714813120
SUPERVISOR’S CERTIFICATE

It is to certify that the Project entitled AI GENERATED TEXT DETECTION

MODEL which is being submitted by Mr. Akash Jha to the Maharaja Agrasen
Institute of Technology, Rohini in the fulfillment of the requirement for the award
of the degree of Bachelor of Technology (B.Tech.), is a record of bona fide project
work carried out by them under my guidance and supervision.

Ms. Sapna Gupta Prof. (Dr.) Amita Goel

Assistant Professor H.O.D

Department of Information Technology Department of Information Technology
and Engineering. and Engineering
LIST OF FIGURES

Fig 1 Implementation framework for testing tools 5

Fig 2 Accuracy comparison of AI text detection tools on AH&AITD 11

Fig 3 Testing confusion matrices of AI text detection tools on AH&AITD 12
Fig 4 Testing Receiver Operating Curves of AI text detection tools on AH&AITD12
Fig 5 Overall accuracy for each tool calculated as an average of all approaches 26

discussed
Fig 6 Overall accuracy for each document type (calculated as an average of all 27
approaches discussed)
Fig 7 Accuracy (logarithmic) for each document type by detection tool for AI- 28

generated text
Fig 8 False accusations for human-written documents 31
Fig 9 False accusations for machine-translated documents 31
Fig. 10 False negatives for AI‑generated documents 03‑AI 32

Fig 11 False negatives for AI‑generated documents 04‑AI 33

Fig 12 False negatives for AI‑generated documents 03‑AI and 04‑AI together 33
Fig 13 False negatives for manually edited documents 34
Fig 14 False negatives for machine-paraphrased documents 34

Fig 16 Turnitin’s similarity report shows up frst, it is not clear that the “AI” is clickable 35
Fig. 17 Writer’s suggestion to lower “detectable AI content” 36
Fig 18 AIDT23-05-JGD 37

Fig 19 AIDT23-05-JPK 37
Fig 20 AIDT23-05-LLW 38
Fig 21 AIDT23-05-OLU 38
Fig 22 AIDT23-05-PTR 39

Fig 23 AIDT23-05-SBB 39
Fig 24 AIDT23-05-TFO 40
Fig 25 AIDT23-06-AAN 40

Fig 26 AIDT23-06-DWW 41
Fig 27 AIDT23-06-JGD 41
Fig 28 AIDT23-06-JPK 42
Fig 29 AIDT23-06-LLW 42

Fig 30 AIDT23-06-OLU 43
Fig 31 AIDT23-06-PTR 43
Fig 32 AIDT23-06-SBB 44
Fig 33 AIDT23-06-TFO 44

Fig 34 Zipf,Clumpiness, and Kurtosisscore distributionsforthe Wikipedia dataset 47

Fig 35 (a) Early Fusion Model and (b)Late Fusion Model 47
Fig 36 Detection accuracy of the Roberta detector w.r.t. number of trainings 56
examples per class, averaged over ten random initializations
TABLE OF CONTENT

1. Acknowledgment
2. Declaration
3. Supervisor’s Certificate
4. List of figures
5. Abstract 1
6. Introduction 1
7. Literature Review 3
8. Material and Methods 5
9. Analysis 10
10.Test Functions 19
11.Outcomes Results 23
12.Variations in accuracy 27
13.Consistency in tool results 28
14.Usability issues 35
15.Discussion 36
16.Case studies 06‑Para 40
17.Related Work 45
18.Approach 45
a. Baseline 46
b. Statistical Feature Selection 46
c. Model Architecture 47
19.Experiments
a.Data 48
b.Evaluation method 48
c.Experimental details 48
20.Predictable Outcomes 49
21.ExplainingPoorModel Performance within PubMed 49
22.Outcomes Conclusion 49
23.Training TGM 50
24.Generating text from TGM 51
25.Social impacts of TGMs 51
26.Text generative models 52
27.Model architecture, training data, training cost 52
28.Controllable generation 53
29.Detectors 53
30.Conclusion 57
31.References 58
Automatic Detection of AI Generated
Text Model

Abstract

Text generative models (TGMs) excel in producing text that matches the style of human language
reasonably well. Such TGMs can be misused by adversaries, e.g., by automatically generating
fake news and fake product reviews that can look authentic and fool humans. Detectors that can
distinguish text generated by TGM from human written text play a vital role in mitigating such
misuse of TGMs. Recently, there has been a flurry of works from both natural language pro-
cessing (NLP) and machine learning (ML) communities to build accurate detectors for English.
Despite the importance of this problem, there is currently no work that surveys this fast-growing
literature and introduces newcomers to important research challenges. In this work, we fill this
void by providing a critical survey and review of this literature to facilitate a comprehensive un-
derstanding of this problem. We conduct an in-depth error analysis of the state-of-the-art detector
and discuss research directions to guide future work in this exciting area.

1 Introduction
Current state-of-the-art text generative models (TGMs) excel in producing text that approaches the style
of human language, especially in terms of grammaticality, fluency, coherency, and usage of real world
knowledge (Radford et al., 2019; Zellers et al., 2019; Keskar et al., 2019; Bakhtin et al., 2020; Brown
et al., 2020). TGMs are useful in a wide variety of applications, including story generation (Fan et al.,
2018), conversational response generation (Zhang et al., 2020), code auto-completion (Solaiman et al.,
2019), and radiology report generation (Liu et al., 2019a). However, TGMs can also be misused for fake
news generation (Zellers et al., 2019; Brown et al., 2020; Uchendu et al., 2020), fake product reviews
generation (Adelani et al., 2020), and spamming/phishing. (Weiss, 2019). Thus, it is important to build
tools that can minimize the threats posed by the misuse of TGMs.
The commonly used approach to combat the threats posed by the misuse of TGMs is to formulate
the problem of distinguishing text generated by TGMs and human written text as a classification task.
The classifier, henceforth called detector, can be used to automatically remove machine generated text
from online platforms such as social media, e-commerce, email clients, and government forums, when
the intention of the TGM generated text is abuse. An ideal detector should be: (i) accurate, that is, good
accuracy with a good trade-off for false positives and false negatives depending on the online platform
(email client, social media) on which TGM is applied (Solaiman et al., 2019); (ii) data-efficient, that
is, needs as few examples as possible from the TGM used by the attacker (Zellers et al., 2019); (iii)
generalizable, that is, detects text generated by different modeling choices of the TGM used by the
attacker such as model architecture, TGM training data, TGM conditioning prompt length, model size,
and text decoding method (Solaiman et al., 2019; Bakhtin et al., 2020; Uchendu et al., 2020); and (iv)
interpretable, that is, detector decisions need to be understandable to humans (Gehrmann et al., 2019);
and (v) robust, that is, detector can handle adversarial examples (Wolff, 2020). Given the importance
of this problem, there has been a flurry of research recently from both NLP and ML communities on

1
scientific community, and we evaluate the quality of scientific literature generated by AI compared to
scientific language authored by humans.
In recent years, a lot of interest has been directed toward the potential of generative models such
as ChatGPT to generate human-like writing, images, and other forms of media. An OpenAI-developed
variant of the ubiquitous GPT-3 language model, ChatGPT is designed specifically for generating text
suitable for conversation and may be trained to perform tasks including answering questions, translating
text, and creating new languages [10]. Even though ChatGPT and other generative models have made great
advances in creating human-like language, it still needs to be determined the difference between writing
generated by a machine and text written by a human. This is the case even though ChatGPT and other
generative models. This is of the utmost importance in applications such as content moderation, where it is
important to identify and remove hazardous information and automated spam [11].
Recent work has centered on enhancing pre-trained algorithms' capacity to identify text generated
by artificial intelligence. Along with GPT-2, OpenAI also released a detection model consisting of a
RoBERTa-based binary classification system that was taught to distinguish between human-written and
GPT-2-generated text. Integrating source-domain data with in-domain labeled data is what Black et al.
(2021) do to overcome the difficulty of finding GPT-2-generated technical research literature [12]. The
challenge and dataset on detecting machine-created scientific publications, DagPap22, were proposed by
Kashnitsky et al. [13]. During the COLING 2022 session on Scholarly Document Processing. Algorithms
like GPT-3, GPT-neo, and led-large-book-summary are examples of abstract algorithms used. DagPap22's
prompt templates necessitate including information on the primary topic and scientific structural function,
making it more probable that the tool will collect problematic and easily-discoverable synthetic abstracts
[14], [15]. More recently, GPTZero has been proposed to detect ChatGPT-generated text, primarily based
on perplexity. Recent studies have revealed two major issues that need addressing. To begin, every study
given here had to make do with small data samples. Thus, a larger, more robust data set is required to
advance our understanding. Second, researchers have typically used mock data to fine-tune final versions
of pre-train models. Text created with various artificial intelligence programs should all be detectable by
the same approach.
Recently developed algorithms for detecting AI-generated text can tell the difference between the
two. These systems employ state-of-the-art models and algorithms to decipher text created by artificial
intelligence. One API that can accurately identify AI-generated content is Check For AI, which analyses
text samples. A further tool called Compilatio uses sophisticated algorithms to identify instances of
plagiarism, even when they are present in AI-generated content. Similarly, Content at Scale assesses
patterns, writing style, and other language properties to spot artificially generated text. Crossplag is an
application programming interface (API) that can detect AI-generated text in a file. The DetectGPT
artificial intelligence content detector can easily identify GPT model-generated text. We use Go Winston
to identify artificially manufactured news content and social media content. Machine learning and linguistic
analysis are used by GPT Zero to identify AI-generated text. The GPT-2 Output Detector Demo over at
OpenAI makes it simple to test if a text was produced by a GPT-2 model. OpenAI Text Classifier uses an
application programming interface to categorize text, including text generated by artificial intelligence. In
addition to other anti-plagiarism features, PlagiarismCheck may identify information created by artificial
intelligence. Turnitin is another well-known tool that uses AI-generated text detection to prevent plagiarism.
The Writeful GPT Detector is a web-based tool that uses pattern recognition to identify artificially produced
text. Last but not least, the Writer can spot computer-generated text and check the authenticity and
originality of written materials. Academics, educators, content providers, and businesses must deal with the
challenges of AI-generated text, but new detection techniques are making it easier.
This research will conduct a comparative analysis of AI-generated text detection tools using self-
generated custom dataset. To accomplish this, the researchers will collect datasets using a variety of
artificial intelligence (AI) text generators and humans. This study, in contrast to others that have been
reported, incorporates a wide range of text formats, sizes, and organizational patterns. The next stage is to
test and compare the tools with the proposed tool. The following bullet points present the most significant
takeaways from this study's summary findings.

2
 Collecting dataset using different LLMs on different topics having different sizes and writing styles.
 Investigation of detection tools for AI text detection on collected dataset.
 Comparison of the proposed tool to other cutting-edge tools to demonstrate the interpretability of the
best tool among them.
The following is the structure of this article: The results and comments are presented in Section 4, while
Section 2 provides a brief literature review. The final chapter summarizes the work and recommends where
the authors could go.
2. Literature Review
Artificial intelligence-generated text identification has sparked a paradigm change in the ever-evolving
fields of both technology and literature. This innovative approach arises from the combination of artificial
intelligence and language training, in which computers are given access to reading comprehension
strategies developed by humans. Like human editors have done for decades in the physical world, AI-
generated text detection is now responsible for determining the authenticity of literary works in the digital
world. The ability of algorithms to tell the difference between human-authored material and that generated
by themselves is a remarkable achievement of machine learning. As we venture into new waters, there are
serious implications for spotting plagiarized work, gauging the quality of content, and safeguarding writers'
rights. AI-generated text detection is a guardian for literary originality and reassurance that the spirit of
human creativity lives on in future algorithms, connecting the past and future of written expression.
In their detailed analysis of three custom-built LLMs on a dataset they created, ChatGPT
Comparison Corpus (HC3), Guo et al. [16] Using F1 scores for each corpus or sentence, we evaluated the
performance of the three models and found that the best model had a maximum F1 score of 98.78%. Results
indicated superiority over other SOTA strategies. Because the authors of the corpus only used abstract
paragraphs from a small subset of the available research literature, the dataset is skewed toward that subset.
The presented models may need to perform better on a general-purpose data set. Wang et al. [17] have
offered a benchmarked dataset-based comparison of several AI content detection techniques. For evaluation,
they have used question-and-answer, code-summarization, and code-generation databases. The study found
an average AUC of 0.40 across all selected AI identification techniques, with the datasets used for
comparison containing 25k samples for both human and AI-generated content. Due to the lack of diversity
in the dataset, the chosen tools performed poorly; a biased dataset cannot demonstrate effective performance.
Tools can also be more accurate or have a higher area under the curve (AUC).
Catherine et al. [18] employed the 'GPT-2 Output Detector' to assess the quality of generated
abstracts. The study's findings revealed a significant disparity between the abstracts created and the actual
abstracts. The AI output detector consistently assigned high 'false' scores to the generated abstracts, with a
median score of 99.98% (interquartile range: 12.73%, 99.98%). This suggests a strong likelihood of
machine-generated content. The initial abstracts exhibited far lower levels of 'false' ratings, with a median
of 0.02% and an interquartile range (IQR) ranging from 0.02% to 0.09%. The AI output detector exhibited
a robust discriminatory ability, evidenced by its AUROC (Area Under the Receiver Operating
Characteristic) value of 0.94. Utilizing a website and iThenticate software to conduct a plagiarism detection
assessment revealed that the generated abstracts obtained higher scores, suggesting a greater linguistic
similarity to other sources. Remarkably, human evaluators had difficulties in distinguishing between
authentic and generated abstracts. The researchers achieved an accuracy rate of 68% in accurately
identifying abstracts generated by ChatGPT. Notably, 14% of the original abstracts were produced by
machine-generated methods. The literature critiques have brought attention to the issue of abstracts that are
thought to be generated by artificial intelligence (AI).
Using a dataset generated by the users themselves, Debora et al. [19] compared and contrasted
multiple AI text detection methods. The research compared 12 publicly available tools, two proprietary and
available only to qualified academic institutions and other research groups. The researchers' primary focus
has been explaining why and how artificial intelligence (AI) techniques are useful in the academy and the

3
sciences. The results of the comparison were then shown and discussed. Finally, the limitations of AI
technologies regarding evaluation criteria were discussed.
To fully evaluate such detectors, the team [20] first trained DIPPER, a paraphrase generation model
with 11 billion parameters, to rephrase entire texts in response to contextual information such as user-
generated cues. Using scalar controls, DIPPER's paraphrased results can be tailored to vocabulary and
sentence structure. Extensive testing proved that DIPPER's paraphrase of AI-generated text could evade
watermarking techniques and GPTZero, DetectGPT, and OpenAI's text classifier. The detection accuracy
of DetectGPT was decreased from 70.3% to 4.6% while maintaining a false positive rate of 1% when
DIPPER was used to paraphrase text generated by three well-known big language models, one of which
was GPT3.5-davinci-003. These rephrases were impressive since they didn't alter the original text's
meaning. The study developed a straightforward defense mechanism to safeguard AI-generated text
identification from paraphrase-based attacks. Language model API providers were required to get
semantically identical texts for this defense strategy to work. To find sequences comparable to the candidate
text, the algorithm looked through a collection of already generated sequences. A 15-million-generation
database derived from a finely tuned T5-XXL model confirmed the efficacy of this defense strategy. The
software identified Paraphrased generations in 81% to 97% of test cases, demonstrating its efficacy.
Remarkably, only 1% of human-written sequences were incorrectly labeled as AI-generated by the software.
The project made its code, models, and data publicly available to pave the way for additional work on
detecting and protecting AI-generated text.
OpenAI [21], an AI research company, compared manual and automatic ML-based synthetic text
recognition methods. Utilizing models trained on GPT-2 datasets enhances the inherent authenticity of the
text created by GPT-2, hence facilitating human evaluators' identification of erroneous datasets.
Consequently, the team evaluated a rudimentary logistic regression model, a detection model based on fine-
tuning, and a detection model employing zero-shot learning. A logistic regression model was trained using
TFIDF, unigram, and bigram features and evaluated using various generating processes and model
parameters afterward. The most basic classifiers demonstrated an accuracy rate of 97% or higher. Models
need help in identifying shorter outputs. Topological Data Analysis (TDA) was utilized by Kushnareva et
al. [22] to count graph components, edges, and cycles. Text recognition machine learning used these
features. The characteristics trained a logistic regression classifier on WebText, Amazon Reviews,
RealNews, and GROVER [23]. ChatGPT's lack of thorough testing makes this approach's success uncertain.
The online application DetectGPT was used to zero-shot identify and separate AI-generated text
from human-generated text in another investigation [24]. Log probabilities from the generative model were
employed. The researchers found intentionally generated text in the model's log probability function's
negative curvature. The authors thought assessing the log probability of the models under discussion was
always possible. This method only works with GPT-2 cues, the scientists say. In another study, Mitrovic et
al. trained an ML model to identify ChatGPT queries from human ones [25]. ChatGPT-generated two-line
restaurant reviews were recognized by DISTILBERT, a lightweight BERT-trained and Transformer-tuned
model. SHAP explained model predictions. Researchers observed that the ML model couldn't recognize
ChatGPT messages. The authors introduced AICheatCheck, a web-based AI detection tool that can
distinguish if a text was produced by ChatGPT or a human [26]. AICheck analyzes text patterns to detect
origin. The writers used Guo et al. [16] and education to make do with limited data. The study must explain
AICheatCheck's precision. The topic was recently investigated by Cotton et al. [27]. The benefits and
drawbacks of using ChatGPT in the classroom concerning plagiarism are discussed. In another text [28],
the authors used statistical distributions to analyze simulated data. Using a GLTR application, they make
sure the text you put in is correct by highlighting it in different colors. Questions used on the GLTR exam
were written by the general public and based on the publicly accessible GPT-human-generated content for
the 21.5B parameter model [10]. The authors also studied human subjects by having students spot instances
of fabricated news.
Different AI-generated text classification models have been presented in recent years, with approaches
ranging from deep learning and transfer learning to machine learning. Furthermore, software incorporated
the most effective models to help end users verify AI-generated writing. Some studies evaluate various AI

4
text detection tools by comparing their performance on extremely limited and skewed datasets. Therefore,
it is necessary to have a dataset that includes samples from many domains written in the same language that
the models were trained in. It needs to be clarified which of the many proposed tools for AI text
identification is the most effective. To find the best tool for each sort of material, whether from a research
community or content authors, it is necessary to do a comparative analysis of the top-listed tools.
3. Material and Methods
Many different tools for recognizing artificial intelligence-created text were compared in this analysis. The
approach outlined here consists of three distinct phases. Examples of human writing are collected from
many online sources, and OpenAI frameworks are used to generate examples of AI writing from various
prompts (such as articles, abstracts, stories, and comment writing). In the following stage, you will select
six applications for your newly formed dataset. Finally, the performance of the tools is provided based on
several state-of-the-art measurements, allowing end users to pick the best alternative. Figure 1 depicts the
overall structure of the executing process.

Text Sample Console App API Request API Response

Text Database

Response Database Object File

Loading Responses

Calculate Results

Desision

Human Written AI Generated

Figure 1: Implementation framework for testing tools

3.1 Datasets
The primary goal of this study is to amass human-written samples from many sources, including academic
databases such as Google Scholar and Research Gate, content producer and blogger databases such as
Wikipedia, and other knowledge aggregators. The dataset collected using above mentioned tools is named
as AH&AITD (Arslan’s Human and AI Text Database) is available at this link (). The samples used for
testing are divided into two groups in Table 1: "Human Written" and "AI Generated." The "Human Written"
section of the dataset is further broken down into subheadings like "Open Web Text," "Blogs," "Web Text,"
"Q&A," "News Articles," "Opinion Statements," and "Scientific Research." The number of samples from
each source is also included in this group's total of 5,790 samples. The "AI-Generated" group, on the other
hand, has a wide range of AI models, each with its own sample size (such as ChatGPT [29], GPT-4 [30],

5
Paraphrase [31], GPT-2 [32], GPT-3 [33], DaVinci, GPT-3.5, OPT-IML [34], and Flan-T5 [35]). The final
number in the table is 11,580, the total number of students in both groups. This table is useful since it shows
where the testing dataset came from and how the AI models were evaluated. Table 1 presents details of
testing samples used for evaluation of targeted AI generated text detection tools.
Table 1: Testing dataset samples (AH&AITD).
Class Source Number of Samples Total
Open Web Text 2343
Blogs 196
Web Text 397
Human Written Q&A 670 5790
News Articles 430
Opinion Statements 1549
Scientific Research 205
ChatGPT 1130
GPT-4 744
Paraphrase 1694
GPT-2 328
AI Generated GPT-3 296 5790
Davinci 433
GPT-3.5 364
OPT-IML 406
Flan-T5 395
Total 11580 11580

The method of data collecting for human written samples is based on a variety of various approaches.
Most human-written samples were manually obtained from human-written research articles, and only
abstracts were used. Additionally, open web texts are harvested from various websites, including Wikipedia.
3.2 AI Generated Text Detection Tools
Without AI-generated text detection systems, which monitor automated content distribution online, the
modern digital ecosystem would collapse. These systems employ state-of-the-art machine learning and
natural language processing methods to identify and label data generated by AI models like GPT-3,
ChatGPT, and others. They play a vital role in the moderation process by preventing the spread of
misinformation, protecting online communities, and identifying and removing fake news. Platform
administrators and content moderators can use these tools to spot literature generated by artificial
intelligence by seeing patterns, language quirks, and other telltale signals. The importance of AI-created
text detection tools for user security, ethical online discourse, and legal online material has remained strong
despite the advancements in AI. In this section, AI generated text detection tools are described briefly.

3.2.1 AI Text Detection API (Zylalab)

Zyalab's AI Text Detection API [36] makes locating and analyzing text in various content types simple.
This API employs cutting-edge artificial intelligence (AI) technology to precisely recognize and extract
textual content from various inputs, including photos, documents, and digital media. AI Text Detection API
uses cutting-edge OpenAI technology to identify ChatGPT content. Its high accuracy and simple interface

6
let instructors spot plagiarism in student essays and other AI-generated material. Its ease of integration into
workflows and use by non-technical users is a major asset. Due to OpenAI's natural language processing
powers, the API can detect even mild plagiarism, ensuring the information's uniqueness. It helps teachers
grade essays by improving the efficiency of checking student work for originality. In conclusion, the AI
Text Detection API simplifies, accurately, and widely applies plagiarism detection and essay grading for
content suppliers, educators, and more. Due to its ability to analyze text and provide a detailed report, this
tool can be used for plagiarism detection, essay grading, content generation, chatbot building, and machine
learning research. There are no application type constraints, merely API request limits. The API makes use
of OpenAI technology. It has a simple interface and high accuracy, allowing it to detect plagiarism in AI-
generated writing and serve as an essay detector for teachers.

3.2.2 GPTKIT

The innovators of GPTKit [37] saw a need for an advanced tool to accurately identify Chat GPT material,
so they built one. GPTKit is distinguished from other tools because it utilizes six distinct AI-based content
recognition methods, all working together to considerably enhance the precision with which AI-generated
content may be discovered. Educators, professionals, students, content writers, employees, and independent
contractors worried about the accuracy of AI-generated text will find GPTKit highly adaptable. When users
input text for analysis, GPTKit uses these six techniques to assess the content’s authenticity and accuracy.
Customers can try out GPTKit’s features for free by having it return the first 2048 characters of a response
to a request. Due to the team’s dedication to continuous research, the detector in GPTKit now claims an
impressive accuracy rate of over 93% after being trained on a big dataset. You can rest easy knowing that
your data will remain private during detection and afterward, as GPTKit only temporarily stores information
for processing and promptly deletes it from its servers. GPTKit is a great tool to use if you wish to validate
information with artificial intelligence (AI) for authenticity or educational purposes.

3.2.3 GPTZero

GPTZero is the industry standard for identifying Large Language Model documents like ChatGPT. It
detects AI content at the phrase, paragraph, and document levels, making it adaptable. The GPTZero model
was trained on a wide range of human-written and AI-generated text, focusing on English prose. After
servicing 2.5 million people and partnering with 100 education, publishing, law, and other institutions,
GPTZero is a popular AI detector. Users may easily enter text for analysis using its simple interface, and
the system returns detailed detection findings, including sentence-by-sentence highlighting of AI-detected
material, for maximum transparency. GPTZero supports numerous AI language models, making it a
versatile AI detection tool. ChatGPT, GPT-4, GPT-3, GPT-2, LLaMA, and AI services are included. It was
the most accurate and trustworthy AI detector of seven tested by TechCrunch. Customized for student
writing and academic prose, GPTZero is ideal for school. Despite its amazing powers, GPTZero [38] admits
it has limitations in the ever-changing realm of AI-generated entertainment. Thus, teachers should combine
its findings into a more complete assessment that prioritizes student comprehension in safe contexts. To
help teachers and students address AI misuse and the significance of human expression and real-world
learning, GPTZero emphasizes these topics. In-person evaluations, edited history analysis, and source
citations can help teachers combat AI-generated content. Due to its commitment to safe AI adoption,
GPTZero may be a reliable partner for educators facing AI issues.

3.2.4 Sapling

Sapling AI Content Detector [39] is a cutting-edge program that accurately recognizes and categorizes AI-
generated media. This state-of-the-art scanner utilizes state-of-the-art technology to verify the authenticity
and integrity of text by checking for the existence of AI-generated material in various contexts. Whether in

7
the courtroom, the publishing industry, or the classroom, Sapling AI Content Detector is a potent solution
to the issue of AI-generated literature. Its straightforward interface and comprehensive detection results
equip users to make informed judgments about the authenticity of the material. Sapling AI Content
Detector's dedication to precision and dependability makes it a valuable resource for companies and
individuals serious about preserving the highest possible content quality and originality requirements.

3.2.5 Originality

The Originality AI Content Detector [40] was intended to address the growing challenge of identifying AI-
generated text. This cutting-edge artificial intelligence influence detector can tell if human-written content
has been altered. It examines every word, every sentence, every paragraph, and every document. Human-
made and computer-generated texts may be distinguished with confidence thanks to the rich variety of
training data. Educators, publishers, academics, and content producers will find this tool invaluable for
guarding the authenticity and integrity of their own work. The Originality AI Content Detector highlights
potential instances of AI-generated literature to increase awareness and promote the responsible use of AI
technologies in writing. The era of AI-driven content creation gives users the knowledge to make purposeful
decisions that preserve the quality and originality of their writing.

3.2.6 Writer

The Writer AI Content Detector [41] is cutting-edge software for spotting content created by artificial
intelligence. This program utilizes cutting-edge technologies to look for signs of artificial intelligence in
the text at the phrase, paragraph, and overall document levels. Since he was taught using a large dataset,
including human-authored and AI-generated content, the Writer is very good at telling them apart. This
guide is a must-read for every instructor, publisher, or content provider serious about their craft. By alerting
users to the presence of AI content and offering details about it, the Writer arms them with the information
they need to protect the authenticity of their works. The author is an honest champion of originality,
advocating for responsible and ethical content generation in an era when AI is increasingly involved in the
creative process. Developers can get the Writer AI Content Detector SDK by running "pip install writer."
The use of API credentials for writer authentication is critical [42]. These API keys can be found on your
account's dashboard. Simply replace the sample API keys in the code snippets with your own or sign in for
individualized code snippets. Users without access to their secret API keys on the control panel. To become
a Writer account's development team member, you should contact the account's owner. Developers can
access the Writer SDK and AI Content Detector once signed in. The SDK includes document and user
management tools, content identification, billing information retrieval, content production, model
customization, file management, snippet handling, access to a style guide, terminology management, user
listing, and management. With this full suite of resources, customers can confidently include AI-driven
content recognition into their projects and apps without compromising safety or precision.
3.3 Experimental Setup
Six distinct content identification approaches developed using artificial intelligence were
evaluated in depth for this study. Each tool has an API that can be used with various languages
and frameworks. To take advantage of these features, subscriptions have been obtained for each
API, and the software has been put through its pace with Python scripts. The results were produced
using the testing dataset discussed above. All experiments have been run on a sixth-generation
Dell I7 system with 24 GB of RAM and 256 SSD ROM using Python 3.11 on MS Code with
Jupyter Notebook Integration.
3.3 Evaluation Policy
To ensure the robustness, dependability, and usefulness of a company's machine-learning models, the
company should develop and adhere to an evaluation policy. This policy spells evaluation, validation, and

8
application of models in detail. As a first step, it converges on a standardized approach to evaluation,
allowing for fair and uniform assessment of model performance across projects. Comparing projects,
identifying best practices, and maximizing model development are all made easier with the introduction of
uniform standards. Second, a policy for assessing model performance guarantees that they hit targets for
measures like accuracy, precision, and recall. As a result, only high-quality, reliable models with strong
performance are deployed. Reduced implementation risks are achieved through the policy's assistance in
identifying model inadequacies, biases, and inaccuracies. An assessment policy fosters accountability and
trustworthiness in data science by requiring uniformity and transparency in model construction.
Accuracy is important in machine learning and statistics because it measures model prediction.
Accuracy is a percentage of accurately predicted cases to the dataset's total occurrences. The term
"accuracy" could mean:
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = × 100
𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
In this formula, the "Total Number of Predictions" represents the size of the dataset, while the
"Number of Correct Predictions" is the number of predictions made by the model that corresponds to the
actual values. A quick and dirty metric to gauge a model's efficacy is accuracy, but when one class greatly
outnumbers the other in unbalanced datasets, this may produce misleading results.
Precision is the degree to which a model correctly predicts the outcome. In the areas of statistics
and machine learning, it is a common metric. The number of correct positive forecasts equals the ratio of
true positive predictions to all positive predictions. The accuracy equation can be described as follows:
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
The avoidance of false positives and negatives in practical use is what precision quantifies. A high
accuracy score indicates that when the model predicts a positive outcome, it is more likely to be true, which
is especially important in applications where false positives could have major consequences, such as
medical diagnosis or fraud detection.
Recall (true positive rate or sensitivity) is an important performance metric in machine learning
and classification applications. It measures a model's ability to discover and label every instance of interest
in a given dataset. To recall information, follow this formula:
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
In this formula, TP represents the total number of true positives, whereas FN represents the total
number of false negatives. Medical diagnosis and fraud detection are two examples of areas where missing
a positive instance can have serious effects; applications with a high recall, which indicates the model
effectively catches a large proportion of the true positive cases, could profit greatly from such a model.
The F1 score is a popular metric in machine learning that combines precision and recall into a single
value, offering a fairer evaluation of a model's efficacy, especially when working with unbalanced datasets.
The formula for its determination is as follows:
(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙)
𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 ×
(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
Precision is the proportion of correct predictions relative to the total number of correct predictions
made by the model, whereas recall measures the same proportion relative to the number of genuine positive
cases in the dataset. The F1 score excels when a compromise between reducing false positives and false
negatives is required, such as medical diagnosis, information retrieval, and anomaly detection. By factoring
in precision and recall, F1 is a well-rounded measure of a classification model's efficacy.
A machine learning classification model's accuracy can be evaluated using the ROC curve and the
Confusion Matrix. The ROC curve compares the True Positive Rate (Sensitivity) to the False Positive Rate
(1-Specificity) at different cutoffs to understand a model's discriminatory ability. The Confusion Matrix
provides a more detailed assessment of model accuracy, precision, recall, and F1-score, which meticulously

9
tabulates model predictions into True Positives, True Negatives, False Positives, and False Negatives. Data
scientists and analysts can use these tools to learn everything they need to know about model performance,
threshold selection, and striking a balance between sensitivity and specificity in classification jobs.
4. Analysis
The results and discussion surrounding these tools reveal intriguing insights into the usefulness and
feasibility of six AI detection approaches for differentiating AI-generated text from human-authored
content. Detection technologies, including GPTZero, Sapling, Writer, AI Text Detection API (Zyalab),
Originality AI Content Detector, and GPTKIT, were ranked based on several factors, including accuracy,
precision, recall, and f1-score. Table 2 compares different AI text detection approaches that can be used to
tell the difference between AI-written and -generated text.
Table 2: Comparative results of AI text detection tools on AH&AITD.
Tools Classes Precision Recall F1 Score
AI Generated 90 12 21
GPTKIT
Human Written 53 99 69
AI Generated 65 60 62
GPTZERO
Human Written 63 68 65
AI Generated 98 96 97
Originality
Human Written 96 98 97
AI Generated 86 40 54
Sapling
Human Written 61 94 74
AI Generated 79 52 62
Writer
Human Written 64 87 74
AI Generated 84 45 59
Zylalab
Human Written 62 91 74
First, GPTKIT impresses with its high F1 Score (21) because of its high precision (90) in detecting
human-written text but shockingly low recall (12). This suggests that GPTKIT is overly conservative,
giving rise to several false negatives. On the other hand, its recall (99) and F1 Score (69) are excellent when
recognizing language created by humans. GPTZero's performance is more uniformly excellent across the
board. Its recall (60%) and F1 Score (62) more than make up for its lower precision (65%) on AI-generated
text. An F1 Score of 65 for human written text strikes a reasonable balance between accuracy (63) and
recall (68).
When distinguishing AI-generated content from machine-written content, Originality shines. Its F1
Score of 97 reflects its remarkable precision (98), recall (96), and overall effectiveness. It also excels at text
created by humans, with an F1 Score of 97, recall of 98%, and precision of 96%. The high precision (86)
and recall (40) on AI-generated text give Sapling an F1 Score of 54. Despite a high recall (94) and poor
precision (61) in identifying human-written text, the F1 Score 74 leaves room for improvement. Writer is
unbiased in assessing the relative merits of AI-generated and human-written content. It has an average F1
Score of 62 because it has an average level of precision while analyzing AI-generated text (79) and recall
(52). The F1 Score for this piece of human-written text is 74, meaning it has an excellent balance of
precision (64) and recall (87). Regarding recognizing AI-generated content, Zylalab has a good 84 precision,
45 recall, and 59 F1 Score. Recognizing synthetic language is where it shines, with an F1 Score of 74, a
recall of 91, and a precision of 62.
As a result of its superior performance in terms of precision, recall, and F1 Score across both classes,
we have concluded that Originality is the most reliable alternative for AI text identification. Additionally,
GPTZERO displays all-around performance, making it a practical option. However, Sapling shows skills
in identifying AI-generated text whereas GPTKIT demonstrates remarkable precision but needs better recall.

10
Writers find a comfortable medium ground but need to differentiate themselves. Zylalab performs about as
well as the best of the rest, but it has room to grow. Before selecting a tool, it is crucial to consider the needs
and priorities of the job.
Figure 2 provides a visual representation of a comparison between the accuracy of six different AI
text identification systems, including "GPTkit," "GPTZero," "Originality," "Sapling," "Writer," and
"Zylalab." Data visualization demonstrates that "GPTkit" has a 55.29 percent accuracy rate, "GPTZero" has
a 63.7 percent accuracy rate, "Originality" has a spectacular 97.0 percent accuracy rate, "Sapling" has a
66.6 percent accuracy rate, "Writer" has a 69.05 percent accuracy rate, and "Zylalab" has a 68.23 percent
accuracy rate. These accuracy ratings demonstrate how well the tools distinguish between natural and
computer-generated text. When contrasting the two forms of writing, "Originality" achieves the highest
degree of accuracy. Compared to the other two, "GPTkit" has the lowest detection accuracy and thus the
most room for improvement. This visual representation of the performance of various AI text detection
tools will be an important resource for users looking for the most precise tool for their needs.

Figure 2: Accuracy comparison of AI text detection tools on AH&AITD

Artificial Intelligence-Generated and Human-Written Text Detection testing confusion matrices are
shown in Figure 3 for simple comparison. Confusion matrices like this demonstrate visually how well
certain technologies can distinguish between text generated by AI and that authored by humans. The use of
blue to illustrate the matrices aids in their readability. The actual labels appear in the matrix rows, while the
predicted labels are displayed in the columns. The total number of occurrences that match that criterion is
in each matrix cell. These matrices allow us to compare various tools based on their ability to classify texts
accurately, recall, and overall performance. This graphic is an excellent reference for consumers,
researchers, and decision-makers because it visually compares the accuracy of various AI text detection
technologies.

11
Figure 3: Testing confusion matrices of AI text detection tools on AH&AITD
Figure 4 displays the Testing Receiver Operating Curves (ROCs) for selecting AI text detection
algorithms, visually comparing their relative strengths and weaknesses. These ROC curves, one for each
tool, are essential for judging how well they can tell the difference between AI-generated and human-written
material. Values for "GPTkit," "GPTZero," "Originality," "Sapling," "Writer," and "Zylalab" in terms of
Area Under the Curve (AUC) are 0.55, 0.64%, 0.97%, 0.67%, 0.69%, and 0.68%, respectively. The Area
under the curve (AUC) is a crucial parameter for gauging the precision and efficiency of such programs. A
bigger area under the curve (AUC) suggests that the two text types can be distinguished with more accuracy.
To help users, researchers, and decision-makers choose the best AI text recognition tool for their needs,
Figure 4 provides a visual summary of how these tools rank regarding their discriminative strength.

Figure 4: Testing Receiver Operating Curves of AI text detection tools on AH&AITD

12
y

engagement. It is also in higher education that students form and further develop their
personal and professional ethics and values. Hence, it is crucial to uphold the integrity of
the assessments and diplomas provided in tertiary education.
The introduction of unauthorised content generation—“the production of academic
work, in whole or part, for academic credit, progression or award, whether or not
a payment or other favour is involved, using unapproved or undeclared human or
technological assistance” (Foltýnek et al. 2023)—into higher education contexts poses
potential threats to academic integrity. Academic integrity is understood as “compli-
ance with ethical and professional principles, standards and practices by individuals
or institutions in education, research and scholarship” (Tauginienė et al. 2018).
Recent advancements in artificial intelligence (AI), particularly in the area of the
generative pre-trained transformer (GPT) large language models (LLM), have led to
a range of publicly available online text generation tools. As these models are trained
on human-written texts, the content generated by these tools can be quite difficult to
distinguish from human-written content. They can thus be used to complete assess-
ment tasks at HEIs.
Despite the fact that unauthorised content generation created by humans, such as
contract cheating (Clarke & Lancaster 2006), has been a well-researched form of stu-
dent cheating for almost two decades now, HEIs were not prepared for such radical
improvements in automated tools that make unauthorised content generation so eas-
ily accessible for students and researchers. The availability of tools based on GPT-3
and newer LLMs, ChatGPT (OpenAI 2023a, b) in particular, as well as other types
of AI-based tools such as machine translation tools or image generators, have raised
many concerns about how to make sure that no academic performance deception
attempts have been made. The availability of ChatGPT has forced HEIs into action.
Unlike contract cheating, the use of AI tools is not automatically unethical. On the
contrary, as AI will permeate society and most professions in the near future, there is
a need to discuss with students the benefits and limitations of AI tools, provide them
with opportunities to expand their knowledge of such tools, and teach them how to
use AI ethically and transparently.
Nonetheless, some educational institutions have directly prohibited the use of
ChatGPT (Johnson 2023), and others have even blocked access from their university
networks (Elsen-Rooney 2023), although this is just a symbolic measure with vir-
tual private networks quite prevalent. Some conferences have explicitly prohibited
AI-generated content in conference submissions, including machine-learning con-
ferences (ICML 2023). More recently, Italy became the first country in the world to
ban the use of ChatGPT, although that decision has in the meantime been rescinded
(Schechner 2023). Restricting the use of AI-generated content has naturally led to
the desire for simple detection tools. Many free online tools that claim to be able to
detect AI-generated text are already available.
Some companies do urge caution when using their tools for detecting AI-gener-
ated text for taking punitive measures based solely on the results they provide. They
acknowledge the limitations of their tools, e.g. OpenAI explains that there are several
ways to deceive the tool (OpenAI 2023a, b, 8 May). Turnitin made a guide for teachers
on how they should approach the students whose work was flagged as AI-generated

13
y

(Turnitin 2023a, b, 16 March). Nevertheless, four different companies (GoWinston,

2023; Content at Scale 2023; Compilatio 2023; GPTZero 2023) claim to be the best on
the market.
The aim of this paper is to examine the general functionality of tools for the detec-
tion of the use of ChatGPT in text production, assess the accuracy of the output pro-
vided by these tools, and their efficacy in the face of the use of obfuscation techniques
such as online paraphrasing tools, as well as the influence of machine translation tools to
human-written text.
Specifically, the paper aims to answer the following research questions:

RQ1: Can detection tools for AI-generated text reliably detect human-written text?
RQ2: Can detection tools for AI-generated text reliably detect ChatGPT-generated
text?
RQ3: Does machine translation affect the detection of human-written text?
RQ4: Does manual editing or machine paraphrasing affect the detection of Chat-
GPT-generated text?
RQ5: How consistent are the results obtained by different detection tools for AI-gen-
erated text?

The next section briefly describes the concept and history of LLMs. It is followed
by a review of scientific and non-scientific related work and a detailed description of
the research methodology. After that, the results are presented in terms of accuracy,
error analysis, and usability issues. The paper ends with discussion points and conclu-
sions made.still gained 1.0 points as in the previous methods. The formula for accuracy
calculation

Large language models

We understand LLMs as systems trained to predict the likelihood of a specific character,
word, or string (called a token) in a particular context (Bender et al. 2021). Such statistical
language models have been used since the 1980s (Rosenfeld 2000), amongst other things
for machine translation and automatic speech recognition. Efficient methods for the esti-
mation of word representations in multidimensional vector spaces (Mikolov et al. 2013),
together with the attention mechanism and transformer architecture (Vaswani et al. 2017)
made generating human-like text not only possible, but also computationally feasible.
ChatGPT is a Natural Language Processing system that is owned and developed by
OpenAI, a research and development company established in 2015. Based on the trans-
former architecture, OpenAI released the first version of GPT in June 2018. Within less
than a year, this version was replaced by a much improved GPT-2, and then in 2020 by
GPT-3 (Marr 2023). This version could generate coherent text within a given context.
This was in many ways a game-changer, as it is capable of creating responses that are
hard to distinguish from human-written text (Borji 2023; Brown et al. 2020). As 7% of
the training data is on languages other than English, GPT-3 can also perform multilin-
gually (Brown et al. 2020). In November 2022, ChatGPT was launched. It demonstrated
significant improvements in its capabilities, a user-friendly interface, and it was widely
reported in the general press. Within two months of its launch, it had over 100 million
subscribers and was labelled “the fastest growing consumer app ever” (Milmo 2023).

14
y

AI in education brings both challenges and opportunities. Authorised and prop-

erly acknowledged usage of AI tools, including LLMs, is not per se a form of mis-
conduct (Foltýnek et al. 2023). However, using AI tools in an educational context for
unauthorised content generation (Foltýnek et al. 2023) is a form of academic mis-
conduct (Tauginienė et al. 2018). Although LLMs have become known to the wider
public after the release of ChatGPT, there is no reason to assume that they have not
been used to create unauthorised and undeclared content even before that date. The
accessibility, quantity, and recent development of AI tools have led many educators
to demand technical solutions to help them distinguish between human-written and
AI-generated texts.
For more than two decades, educators have been using software tools in an attempt
to detect academic misconduct. This includes using search engines and text-matching
software in order to detect instances of potential plagiarism. Although such automated
detection can identify some plagiarism, previous research by Foltýnek et al. (2020) has
shown that text-matching software not only do not find all plagiarism, but further-
more will also mark non-plagiarised content as plagiarism, thus providing false positive
results. This is a worst-case scenario in academic settings, as an honest student can be
accused of misconduct. In order to avoid such a scenario, now, when the market has
responded with the introduction of dozens of tools for AI-generated text, it is important
to discuss whether these tools clearly distinguish between human-written and machine-
generated content.

Related work
The development of LLMs has led to an acceleration of different types of efforts in the
field of automatic detection of AI-generated text. Firstly, several researchers has studied
human abilities to detect machine-generated texts (e.g. Guo et al. 2023; Ippolito et al.
2020; Ma et al. 2023). Secondly, some attempts have been made to build benchmark text
corpora to detect AI-generated texts effectively; for example, Liyanage et al. (2022) have
offered synthetic and partial text substitution datasets for the academic domain. Thirdly,
many research works are focused on developing new or fine-tuning parameters of the
already pre-trained models of machine-generated text (e.g. Chakraborty et al. 2023; Dev-
lin et al. 2019).
These efforts provide a valuable contribution to improving the performance and capa-
bilities of detection tools for AI-generated text. In this section, the authors of the paper
mainly focus on studies that compare or test the existing detection tools that educators
can use to check the originality of students’ assignments. The related works examined
in the paper are summarised in Tables 1, 2, and 3. They are categorised as published
scientific publications, preprints and other publications. It is worth mentioning that
although there are many comparisons on the Internet made by individuals and organisa-
tions, Table 3 includes only those with the higher coverage of tools and/or at least partly
described methodology of experiments.
Some researchers have used known text-matching software to check if they are able
to find instances of plagiarism in the AI-generated text. Aydin and Karaarslan (2022)
tested the iThenticate system and have revealed that the tool has found matches with

15
y

Table 3 Related work: published scientific publications

Source Detection tools used Dataset Evaluation metrics

Aydin & Karaarslan 2022 1 An article with three sec‑ N/A

iThenticate tions: the text written by the
paper’s authors, the ChatGPT
-paraphrased abstract text of
articles, the content gener‑
ated by ChatGPT answering
specific questions
Anderson et al. 2023 1 Two ChatGPT-generated N/A
GPT-2 Output Detector essays and the same essays
paraphrased by AI
Elkhatat et al. 2023 5 15 ChatGPT 3.5 generated, 15 Specificity, Sensitivity, Positive
OpenAI text Classifier, ChatGPT 4 generated and 5 Predictive Value, Negative
Writer, Copyleaks, human-written passages Predictive Value
GPTZero, CrossPlag
Gao et al. 2022 2 50 ChatGPT-generated scien‑ AUROC
(Plagiarismdetector. tific abstracts
net, GPT-2 Output
Detector)

Table 4 Related work: preprints

Source Detection tools used Dataset Evaluation metrics

Khalil & Er 2023 3 50 essays generated by True positive,

iThenticate, Turnitin, ChatGPT ChatGPT on various topics False negative
(such as physics laws, data
mining, global warming,
driving schools, machine
learning, etc.)
Wang et al. 2023 6 • Q&A-GPT: 115 K pairs of AUC scores, False positive
GPT2-Detector, RoBERTa-QA, human-generated answers rate, False negative rate
DetectGPT, GPTZero (taken from Stack Overflow)
Writer, OpenAI Text Classifier and ChatGPT generated
answers (for the same topic)
for 115 K questions
• Code2Doc-GPT: 126 K sam‑
ples from CodeSearchNet
and GPT code description for
6 programming languages
• 226.5 K pairs of code
samples human and Chat‑
GPT generated (APPS-GPT,
CONCODE-GPT, Doc2Code-
GPT)
• Wiki-GPT dataset: 25 K sam‑
ples of human-generated
and GPT polished texts
Pegoraro et al. 2023 24 approaches and tools, 58,546 responses gener‑ True positive rate, True
among them online tools ated by humans and 72,966 negative rate
ZeroGPT, OpenAI Text Classi‑ responses generated by the
fier, GPTZero, Hugging Face, ChatGPT model, resulting
Writefull, Copyleaks, Content in 131,512 unique samples
at Scale, Originality.ai, Writer, that address 24,322 distinct
Draft and Goal questions from various fields,
including medicine, opendo‑
main, and finance

other information sources both for ChatGPT-paraphrased text and -generated text.
They also found that ChatGPT does not produce original texts after paraphrasing, as the
match rates for paraphrased texts were very high in comparison to human-written and

16
y

Table 5 Related work: other publications

Source Detection tools used Dataset Evaluation metrics

Gewirtz 2023 3 • 3 human-generated texts N/A

GPT-2 Output Detector, Writer, • 3 ChatGPT-generated texts
Content at Scale
van Oijen 2023 7 • 10 generated passages Accuracy
Content at Scale, Copyleaks, based on prompts (factual
Corrector App, Crossplag, info, rewrites of existing test,
GPTZero, OpenAI, Writer fictional scenarios, advice,
explanations at different
levels, impersonation of a
specified character, Dutch
translation)
• 5 human-generated text
from different sources
(Wikipedia, SURF, Alice in
Wonderland, Reddit post)
Compilatio 2023 11 • 50 human-written texts Reliability (the number of
Compilatio, Draft and Goal, • 75 texts generated by Chat‑ correctly classified/the total
GLTR, GPTZero, Content at GPT and YouChat number of text passages)
Scale, DetectGPT, Crossplag,
Kazan SEO, AI Text Classifier,
Copyleaks, Writer AI Content
Detector
Demers 2023 16 • Human writing sample N/A
Originality AI, Writer, Copyl‑ • ChatGPT 4 writing sample
eaks, Open AI Text Classifier, • ChatGPT 4 writing sample
Crossplag, GPTZero, Sapling, with the additional prompt
Content At Scale, Zero GPT, "beat detection"
GLTR, Hugging Face, Corrector,
Writeful, Hive Moderation,
Paraphrasing tool AI Content
Detector, AI Writing Check

ChatGPT-generated text passages. In the experiment of Gao et al. (2022), Plagiarismde-

tector.net recognized nearly all of the fifty scientific abstracts generated by ChatGPT as
completely original.
Khalil and Er (Khalil and Er 2023) fed 50 ChatGPT-generated essays into two text-
matching software systems (25 essays to iThenticate and 25 essays to the Turnitin sys-
tem), although they are just different interfaces to the same engine. They found that 40
(80%) of them were considered to have a high level of originality, although they defined
this as a similarity score of 20% or less. Khalil and Er (Khalil and Er 2023) also attempted
to test the capabilities of ChatGPT to detect if the essays were generated by ChatGPT
and state an accuracy of 92%, as 46 essays were supposedly said to be cases of plagia-
rism. As of May 2023, ChatGPT now issues a warning to such questions such as: “As an
AI language model, I cannot verify the specific source or origin of the paragraph you
provided.“
The authors of this paper consider the study of Khalil and Er (Khalil and Er 2023) to be
problematic for two reasons. First, it is worth noting that the application of text-match-
ing software systems to the detection of LLM-generated text makes little sense because
of the stochastic nature of the word selection. Second, since an LLM will “hallucinate”,
that is, make up results, it cannot be asked whether it is the author of a text.
Several researchers focused on testing sets of free and/or paid detection tools for AI-
generated text. Wang et al. (2023) checked the performance of detection tools on both

17
y

natural language content and programming code and determined that “detecting Chat-
GPT-generated code is even more difficult than detecting natural language contents.”
They also state that tools often exhibit bias, as some of them have a tendency to predict
that content is ChatGPT generated (positive results), while others tend to predict that it
is human-written (negative results).
By testing fifty ChatGPT-generated paper abstracts on the GPT-2 Output detector,
Gao et al. (2022) concluded that the detector was able to make an excellent distinction
between original and generated abstracts because the majority of the original abstracts
were scored extremely low (corresponding to human-written content) while the detec-
tor found a high probability of AI-generated text in the majority (33 abstracts) of the
ChatGPT-generated abstracts with 17 abstracts scored below 50%.
Pegoraro et al. (2023) tested not only online detection tools for AI-generated text but
also many of the existing detection approaches and claimed that detection of the Chat-
GPT-generated text passages is still a very challenging task as the most effective online
detection tool can only achieve a success rate of less than 50%. They also concluded that
most of the analysed tools tend to classify any text as human-written.
Tests completed by van Oijen (2023) showed that the overall accuracy of tools in
detecting AI-generated text reached only 27.9%, and the best tool achieved a maxi-
mum of 50% accuracy, while the tools reached an accuracy of almost 83% in detecting
human-written content. The author concluded that detection tools for AI-generated text
are "no better than random classifiers" (van Oijen 2023). Moreover, the tests provided
some interesting findings; for example, the tools found it challenging to detect a piece of
human-written text that was rewritten by ChatGPT or a text passage that was written in
a specific style. Additionally, there was not a single attribution of a human-written text
to AI-generated text, that is, an absence of false positives.
Although Demers (2023) only provided results of testing without any further analysis,
their examination allows making conclusions that a text passage written by a human was
recognised as human-written by all tools, while ChatGPT-generated text had a mixed
evaluation with the tendency to be predicted as human-written (10 tools out of 16) that
increased even further for the ChatGPT writing sample with the additional prompt "beat
detection" (12 tools out of 16).
Elkhatat et al.(2023) revealed that detection tools were generally more successful in
identifying GPT-3.5-generated text than GPT-4-generated text and demonstrated incon-
sistencies (false positives and uncertain classifications) in detecting human-written text.
They also questioned the reliability of detection tools, especially in the context of investi-
gating academic integrity breaches in academic settings.
In the tests conducted by Compilatio, the detection tools for AI-generated text
detected human-written text with reliability in the range of 78–98% and AI-generated
text – 56–88%. Gewirtz’ (2023) results on testing three human-written and three Chat-
GPT-generated texts demonstrated that two of the selected detection tools for AI-gener-
ated text could reach only 50% accuracy and one an accuracy of 66%.
The effect of paraphrasing on the performance of detection tools for AI-generated text
has also been studied. For example, Anderson et al. (2023) concluded that paraphras-
ing has significantly lowered the detection capabilities of the GPT-2 Output Detector by
increasing the score for human-written content from 0.02% to 99.52% for the first essay

18
y

and from 61.96% to 99.98% for the second essay. Krishna et al. (2023) applied paraphras-
ing to the AI-generated texts and revealed that it significantly lowered the detection
accuracy of five detection tools for AI-generated text used in the experiments.
The results of the above-mentioned studies suggest that detecting AI-generated text
passages is still challenging for existent detection tools for AI-generated text, whereas
human-written texts are usually identified quite accurately (accuracy above 80%). How-
ever, the ability of tools to identify AI-generated text is under question as their accuracy
in many studies was only around 50% or slightly above. Depending on the tool, a bias
may be observed identifying a piece of text as either ChatGPT-generated or human-
written. In addition, tools have difficulty identifying the source of the text if ChatGPT
transforms human-written text or generates text in a particular style (e.g. a child’s expla-
nation). Furthermore, the performance of detection tools significantly decreases when
texts are deliberately modified by paraphrasing or re-writing. Detection of the AI-gener-
ated text remains challenging for existing detection tools, but detecting ChatGPT-gener-
ated code is even more difficult.
Existing research has several shortcomings:

• quite often experiments are carried out with a limited number of detection tools for
AI-generated text on a limited set of data;
• sometimes human-written texts are taken from publicly available websites or recog-
nised print sources, and thus could potentially have been previously used to train
LLMs and/or provide no guarantee that they were actually written by humans;
• the methodological aspects of the research are not always described in detail and are
thus not available for replication;
• testing whether the AI-generated and further translated text can influence the accu-
racy of the detection tools is not discussed at all;
• a limited number of measurable metrics is used to evaluate the performance of
detection tools, ignoring the qualitative analysis of results, for example, types of clas-
sification errors that can have significant consequences in an academic setting.

Test Functions
Test cases
The focus of this research is determining the accuracy of tools which state that they are
able to detect AI-generated text. In order to do so, a number of situational parameters
were set up for creating the test cases for the following categories of English-language
documents:

• human-written;
• human-written in a non-English language with a subsequent AI/machine translation
to English;
• AI-generated text;
• AI-generated text with subsequent human manual edits;
• AI-generated text with subsequent AI/machine paraphrase.

19
y

For the first category (called 01-Hum), the specification was made that 10.000 charac-
ters (including spaces) were to be written at about the level of an undergraduate in the
field of the researcher writing the paper. These fields include academic integrity, civil
engineering, computer science, economics, history, linguistics, and literature. None of
the text may have been exposed to the Internet at any time or even sent as an attachment
to an email. This is crucial because any material that is on the Internet is potentially
included in the training data for an LLM.
For the second category (called 02-MT), around 10.000 characters (including spaces)
were written in Bosnian, Czech, German, Latvian, Slovak, Spanish, and Swedish. None
of this texts may have been exposed to the Internet before, as for 01-Hum. Depending on
the language, either the AI translation tool DeepL (3 cases) or Google Translate (6 cases)
was used to produce the test documents in English.
It was decided to use ChatGPT as the only AI-text generator for this investigation, as
it was the one with the largest media attention at the beginning of the research. Each
researcher generated two documents with the tool using different prompts, (03-AI and
04-AI) with a minimum of 2000 characters each and recorded the prompts. The lan-
guage model from February 13, 2023 was used for all test cases.
Two additional texts of at least 2000 characters were generated using fresh prompts
for ChatGPT, then the output was manipulated. It was decided to use this type of test
case, as students will have a tendency to obfuscate results with the expressed purpose
of hiding their use of an AI-content generator. One set (05-ManEd) was edited manually
with a human exchanging some words with synonyms or reordering sentence parts and
the other (06-Para) was rewritten automatically with the AI-based tool Quillbot (Quill-
bot 2023), using the default values of the tool for modes (Standard) and synonym level.
Documentation of the obfuscation, highlighting the differences between the texts, can
be found in the Appendix.
With nine researchers preparing texts (the eight authors and one collaborator), 54 test
cases were thus available for which the ground truth is known.

AI‑generated text detection tool selection

A list of detection tools for AI-generated text was prepared using social media and
Google search. Overall, 18 tools were considered, out of which 6 were excluded: 2 were
not available, 2 were not online applications but Chrome extensions and thus out of the
scope of this research, 1 required payment, and 1 did not produce any quantifiable result.
The company Turnitin approached the research group and offered a login, noting that
they could only offer access from early April 2023. It was decided to test the system,
although it is not free, because it is so widely used and already widely discussed in aca-
demia. Another company, PlagiarismCheck, was also advertising that it had a detec-
tion tool for AI-generated text in addition to its text-matching detection system. It was
decided to ask them if they wanted to be part of the test as well, as the researchers did
not want to have only one paid system. They agreed and provided a login in early May.
We caution that their results may be different from the free tools used, as the companies
knew that the submitted documents were part of a test suite and they were able to use
the entire test document.

20
y

The following 14 detection tools were tested:

• Check For AI (https://checkforai.com)

• Compilatio (https://ai-detector.compilatio.net/)
• Content at Scale (https://contentatscale.ai/ai-content-detector/)
• Crossplag (https://crossplag.com/ai-content-detector/)
• DetectGPT (https://detectgpt.ericmitchell.ai/)
• Go Winston (https://gowinston.ai)
• GPT Zero (https://gptzero.me/)
• GPT-2 Output Detector Demo (https://openai-openai-detector.hf.space/)
• OpenAI Text Classifier (https://platform.openai.com/ai-text-classifier)
• PlagiarismCheck (https://plagiarismcheck.org/)
• Turnitin (https://demo-ai-writing-10.turnitin.com/home/)
• Writeful GPT Detector (https://x.writefull.com/gpt-detector)
• Writer (https://writer.com/ai-content-detector/)
• Zero GPT (https://www.zerogpt.com/)

Table 4 gives an overview of the minimum/maximum sizes of text that could be exam-
ined by the free tools at the time of testing, if known.
PlagiarismCheck and Turnitin are combined text similarity detectors and offer an
additional functionality of determining the probability the text was written by an AI, so
there was no limit on the amount of text tested. Signup was necessary for Check for
AI, Crossplag, Go Winston, GPT Zero, and OpenAI Text Classifier (a Google account
worked).

Data collection
The tests were run by the individual authors between March 7 and March 28, 2023.
Since Turnitin was not available until April, those tests were completed between April
14 and April 20, 2023. The testing of PlagiarismCheck was performed between May 2

Table 6 Minimum and maximum sizes for free tools

Tool name Minimum Size Maximum Size

Check for AI 350 characters 2500 characters

Compilatio 200 characters 2000 characters
Content at Scale 25 words 25000 characters
Crossplag Not stated 1000 words
DetectGPT 40 words 256 words
Go Winston 500 characters 2000 words
GPT Zero 250 characters 5000 characters
GPT-2 Output Detector Demo 50 tokens 510 tokens
OpenAI Text Classifier 1000 characters Not stated
Writeful GPT Detector 50 words 1000 words
Writer Not stated 1500 characters
Zero GPT Not stated Not stated

21
y

Table 7 Classification accuracy scales for human-written and AI-generated texts

Human-written (NEGATIVE) text (docs 01-Hum & 02-MT), and the tool says that it is written by a:
[100—80%) human True negative TN
[80—60%) human Partially true negative PTN
[60—40%) human Unclear UNC
[40—20%) human Partially false positive PFP
[20—0%] human False positive FP
AI-generated (POSITIVE) text (docs 03-AI, 04-AI, 05-ManEd & 06-Para), and the tool says it is written by
a:
[100—80%) human False negative FN
[80—60%) human Partially false negative PFN
[60—40%) human Unclear UNC
[40—20%) human Partially true positive PTP
[20—0%] human True positive TP
[ or] means inclusive ( or) means exclusive

Table 8 Mapping of textual results to classification labels

Tool Result 01-Hum, 02-MT 03-AI, 04-AI,
05-ManEd,
06-Para

Check for AI “very low risk” TN FN

“low risk” PTN PFN
“medium risk” UNC UNC
“high risk” PFP PTP
“very high risk” FP TP
GPT Zero “likely to be written entirely by human” TN FN
“may include parts written by AI” PFP PTP
“likely to be written entirely by AI” FP TP
OpenAI Text Classifier “The classifier considers the text to be …”
“… likely AI-generated.” FP TP
“… possibly AI-generated.” PFP PTP
“Unclear if it is AI-generated” UNC UNC
“… unlikely AI-generated.” PTN PFN
“… very unlikely AI-generated.” TN FN
DetectGPT “very unlikely to be from GPT-2” TN FN
“unlikely to be from GPT-2” PTN PFN
“likely to be from GPT-2” PFP PTP
“very likely from GPT-2” FP TP

and May 8, 2023. All the 54 test cases had been presented to each of the tools for a total
of 756 tests.

Evaluation model
For the evaluation, the authors were split into groups of two or three and tasked with
evaluating the results of the tests for the cases from either 01-Hum & 04-AI, 02-MT &
05-ManEd, or 03-AI & 06-Para. Since the tools do not provide an exact binary classifi-
cation, one five-step classification was used for the original texts (01-Hum & 02-MT)

22
y

and another one was used for the AI-generated texts (03-AI, 04-AI, 05-ManEd &
06-Para). They were based on the probabilities that were reported for texts being
human-written or AI-generated as specified in Table 5.
For four of the detection tools, the results were only given in the textual form (“very
low risk”, “likely AI-generated”, “very unlikely to be from GPT-2”, etc.) and these were
mapped to the classification labels as given in Table 6.
After all of the classifications were undertaken and disagreements ironed out, the
measures of accuracy, the false positive rate, and the false negative rate were calculated.

Outcomes Results
Having evaluated the classification outcomes of the tools as (partially) true/false posi-
tives/negatives, the researchers evaluated this classification on two criteria: accuracy and
error type. In general, classification systems are evaluated using accuracy, precision, and
recall. The research authors also conducted an error analysis since the educational con-
text means different types of error have different significance.

Accuracy
When no partial results are allowed, i.e. only TN, TP, FN, and FP are allowed, accuracy is
defined as a ratio of correctly classified cases to all cases

ACC = (TN + TP)/(TN + TP + FN + FP);

As our classificaion contains also partially correct and partially incorrect results (i.e.,
five classes instead of two), the basic commonly used formula has to be adjusted to
properly count these cases. There is no standard way of how this adjustment should be
done. Therefore, we will use three different methods which we believe reflect different
approaches that educators may have when interpreting tools’ outputs. The first (binary)

Table 9 Accuracy of the detection tools (binary approach)

Tool 01-Hum 02-MT 03-AI 04-AI 05-ManEd 06-Para Total Accuracy Rank

Check For AI 9 0 9 8 4 2 32 59% 6

Compilatio 8 9 8 8 5 2 40 74% 2
Content at Scale 9 9 0 0 0 0 18 33% 14
Crossplag 9 6 9 7 4 2 37 69% 4
DetectGPT 9 5 2 8 0 1 25 46% 11
Go Winston 7 7 9 8 4 1 36 67% 5
GPT Zero 6 3 7 7 3 3 29 54% 8
GPT-2 Output Detector 9 7 9 8 5 1 39 72% 3
Demo
OpenAI Text Classifier 9 8 2 7 2 1 29 54% 8
PlagiarismCheck 7 5 3 3 1 2 21 39% 13
Turnitin 9 9 8 9 4 2 41 76% 1
Writeful GPT Detector 9 7 2 3 2 0 23 43% 12
Writer 9 7 4 4 2 1 27 50% 10
Zero GPT 9 5 7 8 2 1 32 59% 6
Average 94% 69% 63% 70% 30% 15%

23
y

approach is to consider partially correct classification as incorrect and calculate the

accuracy as

ACC_bin = (TN + TP)/(TN + PTN + TP + PTP + FN + PFN + FP + PFP + UNC)

For the systems providing percentages of confidence, this method basically sets the
threshold of 80% (see Table 5). Table 7 shows the number of correctly classified docu-
ments, i.e. the sum of true positives and true negatives. The maximum for each cell is 9
(because there were 9 documents in each class), the overall maximum is 9 * 6 = 54. The
accuracy is calculated as a ratio of the total and the overall maximum. Note that even the
highest accuracy values are below 80%. The last row shows the average accuracy for each
document class, across all the tools.
This method provides a good overview of the number of cases in which the classifiers
are “sure” about the outcome. However, for real-life educational scenarios, partially cor-
rect classifications are also valuable. Especially in case 05-ManEd, which involved human
editing, the partially positive classification results make sense. Therefore, the researchers
explored more ways of assessment. These methods differ in the score awarded to various
incorrect outcomes.
In our second approach, we include partially correct evaluations and count them as
correct ones. The formula for accuracy computation is.

ACC_bin_incl = (TN+PTN+TP+PTP)/(TN+PTN+TP+PTP+FN+PFN+FP+PFP+UNC)

In case of systems providing percentages, this method basically sets the threshold of
60% (see Table 5). The results of this classification approach may be found in Table 8.
Obivously, all systems achieved higher accuracy, and the systems that provided more
partially correct results (GPT Zero, Check for AI) influenced the order.
In our third approach, which we call semi-binary evaluation, the researchers distin-
guish partially correct classifications (PTN or PTP) both from the correct and incorrect
ones. The partially correct classifications were awarded 0.5 points, while entirely correct

Table 10 Accuracy of the detection tools (binary inclusive approach)

Tool 01-Hum 02-MT 03-AI 04-AI 05-ManEd 06-Para Total Accuracy Rank

Check For AI 9 7 9 8 4 3 40 74% 4

Compilatio 9 9 9 8 6 2 43 80% 2
Content at Scale 9 9 0 0 0 0 18 33% 14
Crossplag 9 6 9 7 5 2 38 70% 9
DetectGPT 9 8 9 8 4 2 40 74% 4
Go Winston 8 8 9 8 5 2 40 74% 4
GPT Zero 6 3 8 9 8 8 42 78% 3
GPT-2 Output Detector 9 7 9 8 5 2 40 74% 4
Demo
OpenAI Text Classifier 9 9 5 8 5 2 38 70% 9
PlagiarismCheck 9 8 5 6 3 3 34 63% 12
TurnItIn 9 9 9 9 5 3 44 81% 1
Writeful GPT Detector 9 8 8 6 3 1 35 65% 11
Writer 9 7 5 6 4 2 33 61% 13
Zero GPT 9 8 7 8 4 4 40 74% 4

24
y

Table 11 Accuracy of the detection tools (semi‑binary approach)

Tool 01-Hum 02-MT 03-AI 04-AI 05-ManEd 06-Para Total Accuracy Rank

Check For AI 9 3.5 9 8 4 2.5 36 67% 6

Compilatio 8.5 9 8.5 8 5.5 2 41.5 77% 2
Content at Scale 9 9 0 0 0 0 18 33% 14
Crossplag 9 6 9 7 4.5 2 37.5 69% 5
DetectGPT 9 6.5 5.5 8 2 1.5 32.5 60% 10
Go Winston 7.5 7.5 9 8 4.5 1.5 38 70% 4
GPT Zero 6 3 7.5 8 5.5 5.5 35.5 66% 8
GPT-2 Output Detector 9 7 9 8 5 1.5 39.5 73% 3
Demo
OpenAI Text Classifier 9 8.5 3.5 7.5 3.5 1.5 33.5 62% 9
PlagiarismCheck 8 6.5 4 4.5 2 2.5 27.5 51% 13
Turnitin 9 9 8.5 9 4.5 2.5 42.5 79% 1
Writeful GPT Detector 9 7.5 5 4.5 2.5 0.5 29 54% 12
Writer 9 7 4.5 5 3 1.5 30 56% 11
Zero GPT 9 6.5 7 8 3 2.5 36 67% 6
Average 95% 77% 71% 74% 39% 22%

Table 12 Scores for logarithmic evaluation

Positive case Negative case Score

FN FP 1
PFN PFP 2
UNC UNC 4
PTP PTN 8
TP TN 16

classification (TN or TP) still gained 1.0 points as in the previous methods. The formula
for accuracy calculation is

ACC_semibin =(TN + TP + 0.5 ∗ PTN + 0.5 ∗ PTP) /

(TN + PTN + TP + PTPFN + PFN + FP + PFP + UNC)

Table 9 shows the assessment results of the classifiers using semi-binary classification.
The values correspond to the number of correctly classified documents with partially
correct results awarded half a point (TP + TN + 0.5 * PTN + 0.5 * PTP). The maximum
value is again 9 for each cell and 54 for the total.
A semi-binary approach to accuracy calculation captures the notion of partially
correct classification but still does not distinguish between various forms of incor-
rect classification. We address this issue by employing a third,—logarithmic approach
to accuracy calculation that awards 1 point to completely incorrect classification
and doubles the score for each level of the classification that was closer to the cor-
rect result. The scores for the particular classifier outputs are shown in Table 10 and
the overall scores of the classifiers are shown in Table 11. Note that the maximum
value for each cell is now 9 * 16 = 864. The accuracy, again, is calculated as a ratio

25
y

Table 13 Logarithmic approach to accuracy evaluation

Tool 01-Hum 02-MT 03-AI 04-AI 05-ManEd 06-Para Total Accuracy Rank

Check For AI 144 62 144 129 74 54 607 70% 7

Compilatio 136 144 136 132 91 40 679 79% 2
Content at Scale 144 144 23 24 17 18 370 43% 14
Crossplag 144 99 144 115 76 40 618 72% 6
DetectGPT 144 108 88 129 38 36 543 63% 10
Go Winston 124 124 144 130 79 45 646 75% 4
GPT Zero 102 60 121 128 89 89 589 68% 8
GPT-2 Output Detector 144 114 144 129 84 35 650 75% 3
Demo
OpenAI Text Classifier 144 136 67 124 67 48 586 68% 9
PlagiarismCheck 128 108 76 82 50 53 497 58% 12
Turnitin 144 144 136 144 81 53 702 81% 1
Writeful GPT Detector 144 122 81 76 50 20 493 57% 13
Writer 144 117 83 84 53 35 516 60% 11
Zero GPT 144 108 120 132 65 54 623 72% 5
Average 96% 79% 75% 77% 45% 31%

Fig. 5 Overall accuracy for each tool calculated as an average of all approaches discussed

of the total score and the maximum possible score. This approach provides the most
detailed distinction among all varieties of (in)correctness.
As can be seen from Tables 7, 8, 9, and 11, the approach to accuracy evaluation
has almost no influence on the ranking of the classifiers. Figure 1 presents the overall
accuracy for each tool as the mean of all accuracy approaches used.
Turnitin received the highest score using all approaches to accuracy classification,
followed by Compilatio and GPT-2 Output Detector (again in all approaches). This is
particularly interesting because as the name suggests, GPT-2 Output Detector was
not trained to detect GPT-3.5 output. Crossplag and Go Winston were the only other
tools to achieve at least 70% accuracy.

26
y

Fig. 6 Overall accuracy for each document type (calculated as an average of all approaches discussed)

Variations in accuracy
As Fig. 2 above shows, the overall average accuracy figure is misleading, as it obscures
major variations in accuracy between document types. Further analysis reveals the
influence of machine translation, human editing, and machine paraphrasing on over-
all accuracy:

Influence of machine translation The overall accuracy for case 01-Hum (human-writ-
ten) was 96%. However, in the case of the documents written by humans in languages
other than English that were machine-translated to English (case 02-MT), the accuracy
dropped by 20%. Apparently, machine translation leaves some traces of AI in the output,
even if the original was purely human-written.

Influence of human manual editing Case 05-ManEd (machine-generated with subse-

quent human editing) generally received slightly over half the score (42%) compared to
cases 03-AI and 04-AI (machine-generated with no further modifications; 74%). This
reflects a typical scenario of student misconduct in cases where the use of AI is pro-
hibited. The student obtains a text written by an AI and then quickly goes through it
and makes some minor changes such as using synonyms to try to disguise unauthorised
content generation. This type of writing has been called patchwriting (Howard 1995).
Only ~ 50% accuracy of the classifiers shows that these cases, which are assumed to be
the most common ones, are almost undetectable by current tools.

Influence of machine paraphrase Probably the most surprising results are for case
06-Para (machine-generated with subsequent machine paraphrase). The use of AI to
transform AI-generated text results in text that the classifiers consider human-written.
The overall accuracy for this case was 26%, which means that most AI-generated texts
remain undetected when machine-paraphrased.

27
y

Fig. 7 Accuracy (logarithmic) for each document type by detection tool for AI-generated text

Consistency in tool results

With the notable exception of GPT Zero, all the tested tools followed the pattern of
higher accuracy when identifying human-written text than when identifying texts
generated or modified by AI or machine tools, as seen in Fig. 3. Therefore, their clas-
sification is (probably deliberately) biased towards humans rather than AI output.
This classification bias is preferable in academic contexts for the reasons discussed
below.

Precision
Another important indicator of system’s performance is precision, i.e. the ratio of true
positive cases to all positively classified cases. Precision indicates the probability that a
positive classification provided by the system is correct. For pure binary classifiers, the
precision is calculated as a ratio of true positives to all positively classified cases:

Precision = TP/(TP + FP)

28
Table 12 Overview of classification results and precision
Tool TP PTP FP PFP TN PTN FN PFN UNC Total Prec_incl Prec_excl

Check For AI 23 1 1 9 7 1 10 2 54 96% 100%

Compilatio 23 2 17 1 9 1 1 54 100% 100%
Content at Scale 18 12 13 11 54 –- –-
Crossplag 22 1 3 15 11 2 54 88% 88%
DetectGPT 11 12 14 3 7 6 1 54 100% 100%
Go Winston 22 2 14 2 4 3 7 54 100% 100%
GPT Zero 20 13 9 9 3 54 79% 100%
y

GPT-2 Output Detector Demo 23 1 2 16 10 1 1 54 92% 92%

OpenAI Text Classifier 12 8 17 1 2 4 10 54 100% 100%
PlagiarismCheck 9 8 12 5 1 10 9 54 100% 100%
TurnItIn 23 3 18 4 3 3 54 100% 100%
Writeful GPT Detector 7 11 1 16 1 13 3 2 54 95% 100%
Writer 11 6 1 16 13 3 4 54 94% 92%
Zero GPT 18 5 14 3 3 11 54 100% 100%

29
y

In case of partially true/false positives, the researches had two options how to deal
with them. The exclusive approach counts them as negatively classified (so the formula
does not change), whereas the inclusive approach counts them as positively classified:

Precision_incl = (TP + PTP)/(TP + PTP + FP + PFP)

Table 12 shows an overview of the classification results, i.e. all (partially) true/false
positives/negatives. Also, both inclusive and exclusive precision values are provided.
Precision is missing for Content at Scale because this system did not provide any posi-
tive classifications. The only system for which the inclusive precision is significantly dif-
ferent from the exclusive one, is GPT Zero which yielded the largest number of partially
false positives.

Error analysis
In this section, the researchers quantify more indicators of tools’ performance, namely
two types of classification errors that might have significant consequences in educational
contexts: false positives leading to false accusations against a student and undetected
cases (students gaining an unfair advantage over others), i.e. false negative ratio which is
tightly related to recall.

False accusations: harm to individual students

If educators use one of the classifiers to detect student misconduct, there is a question
of what kind of output leads to the accusation of a student from unauthorised content
generation. The researchers believe that a typical educator would accuse a student if
the output of the classifier is positive or partially positive. Some teachers may also sus-
pect students of misconduct in unclear or partially negative cases, but the research
authors think that educators generally do not initiate disciplinary action in these cases.

Table 15 False positive (false accusation) ratio

Tool 01-Hum 02-MT Total FPR

Check For AI 0 1 1 5.6%

Compilatio 0 0 0 0.0%
Content at Scale 0 0 0 0.0%
Crossplag 0 3 3 16.7%
DetectGPT 0 0 0 0.0%
Go Winston 0 0 0 0.0%
GPT Zero 3 6 9 50.0%
GPT-2 Output Detector Demo 0 2 2 11.1%
OpenAI Text Classifier 0 0 0 0.0%
PlagiarismCheck 0 0 0 0.0%
Turnitin 0 0 0 0.0%
Writeful GPT Detector 0 1 1 5.6%
Writer 0 1 1 5.6%
Zero GPT 0 0 0 0.0%
Average 2.4% 11.1%

30
y

Fig. 8 False accusations for human-written documents

Fig. 9 False accusations for machine-translated documents

Therefore, for each tool, we also computed the likelihood of false accusation of a student
as a ratio of false positives and partially false positives to all negative cases, i.e.

FPR = (FP + PFP)/N_negative

Table 13 shows the number of cases in which the classification of a particular docu-
ment would lead to a false accusation. The table includes only documents 01-Hum and
02-MT, because the AI-generated documents are not relevant. The risk of false accu-
sations is zero for half of the tools, as can be also seen from Figs. 4 and 5. Six of the
fourteen tools tested generated false positives, with the risk increasing dramatically for
machine-translated texts. For GPT Zero, half of the positive classifications would be
false accusations, which makes this tool unsuitable for the academic environment.

31
y

Table 16 Percentage of undetected cases

Tool 03-AI 04-AI 05-ManEd 06-Para Total FNR Recall

Check For AI 0 1 5 6 12 33.3% 66.7%

Compilatio 0 1 3 7 11 30.6% 69.4%
Content at Scale 9 9 9 9 36 100.0% 0.0%
Crossplag 0 2 4 7 13 36.1% 63.9%
DetectGPT 0 1 5 7 13 36.1% 63.9%
Go Winston 0 1 4 7 12 33.3% 66.7%
GPT Zero 1 0 1 1 3 8.3% 91.7%
GPT-2 Output Detector Demo 0 1 4 7 12 33.3% 66.7%
OpenAI Text Classifier 4 1 4 7 16 44.4% 55.6%
PlagiarismCheck 4 3 6 6 19 52.8% 47.2%
Turnitin 0 0 4 6 10 27.8% 72.2%
Writeful GPT Detector 1 3 6 8 18 50.0% 50.0%
Writer 4 3 5 7 19 52.8% 47.2%
Zero GPT 2 1 5 5 13 36.1% 63.9%
Average 19.8% 21.4% 51.6% 71.4%

Fig. 10 False negatives for AI‑generated documents 03‑AI

Undetected cases: undermining academic integrity

Another form of academic harm is undetected cases, i.e. AI-generated texts that remain
undetected. A student who used unauthorised content generation likely obtains an
unfair advantage over those who fulfilled the task with integrity. The actual victims of
this form of misconduct are the honest students that receive the same credits as the dis-
honest ones. The likelihood of an AI-generated document being undetected (false nega-
tive rate, FNR) is given in Table 14, which includes only positive cases (03-AI, 04-AI,
05-ManEd and 06-Para). The false negative rate is calculated as

FNR = (FN + PFN)/N_positive

32
y

Fig. 11 False negatives for AI‑generated documents 04‑AI

Fig. 12 False negatives for AI‑generated documents 03‑AI and 04‑AI together

For the sake of completeness, Table 14 also contains recall (1—FNR) that indicates
how many of positive cases were correclty classified by the system.
Figures 6, 7, and 8 above show that 13 out of the 14 tested tools produced false nega-
tives or partially false negatives for documents 03-AI and 04-AI; only Turnitin correctly
classified all documents in these classes. None of the tools could correctly classify all AI-
generated documents that undergo manual editing or machine paraphrasing.
As the document sets 03-AI and 04-AI were prepared using the same method, the
researchers expected the results would be the same. However, for some tools (OpenAI
Text Classifier and DetectGPT), the results were notably different. This could indicate
a mistake in testing made or interpretation of the results. Therefore, the researchers

33
y

Fig. 13 False negatives for manually edited documents

Fig. 14 False negatives for machine-paraphrased documents

double-checked all the results to avoid this kind of mistake. We also tried to upload
some documents again. We did obtain different values, but we found out that this was
due to inconsistency in the results of these tools and not due to our mistakes.
Content at Scale misclassified all of the positive cases; these results in combination
with the 100% correct classification of human-written documents indicate that the
tool is inherently biased towards human classification and thus completely useless.
Overall, of the AI-generated texts approx. 20% of cases would likely be misattributed
to humans, meaning the risk of unfair advantage is significantly greater than that of
false accusation.
Figures 9 and 10 show an even greater risk of students gaining an unfair advantage
through the use of obfuscation strategies. At an overall level, for manually edited texts

34
y

Fig. 15 Compilatio’s NaN% reliability

Fig. 16 Turnitin’s similarity report shows up first, it is not clear that the “AI” is clickable

(case 05-ManEd) the ratio of undetected texts increases to approx. 50% and in the
case of machine-paraphrased texts (case 06-Para) rises even higher.

Usability issues
There were a few usability issues that cropped up during the testing that may be
attributable to the beta nature of the tools under investigation.
For example, the tool DetectGPT at some point stopped working and only replied
with the statement “Server error 
We might just be overloaded. Try again in a few
minutes?”. This issue occurred after the initial testing round and persisted until the
time of submission of this paper. Others would stall in an apparent infinite loop or
throw an error message and the test had to be repeated at a later time.
Writeful GPT Detector would not accept computer code. The tool apparently iden-
tified code as not English, and the tool only accepted English texts.
Compilatio at one point returned “NaN% reliability” (See Fig. 11) for a ChatGPT-
generated text that included program code. “NaN” is computer jargon for “not a

35
y

number” and indicates that there were calculation issues such as division by zero or
number representation overflow. Since there was also a robot head returned, this was
evaluated as correctly identifying ChatGPT-generated text, but the non-numerical
percentage might confuse instructors using the tool.
The operation of a few of the tools was not immediately clear to some of the authors
and the handling of results was sometimes not easy to document. For example, in Pla-
giarismCheck the AI-Detection button was not always presented on the screen and it
would only show the last four tests done. Interestingly, Turnitin often returned high sim-
ilarity values for ChatGPT-generated text, especially for program code or program out-
put. This was distracting, as the similarity results were given first, the AI-detection could
only be accessed by clicking on a number above the text “AI” that did not look clickable,
but was, see Fig. 12.

Discussion
Detection tools for AI-generated text do fail, they are neither accurate nor reliable (all
scored below 80% of accuracy and only 5 over 70%). In general, they have been found
to diagnose human-written documents as AI-generated (false positives) and often diag-
nose AI-generated texts as human-written (false negatives). Our findings are consistent
with previously published studies (Gao et al. 2022; Anderson et al. 2023; Elkhatat et al.
2023; Demers 2023; Gewirtz 2023; Krishna et al. 2023; Pegoraro et al. 2023; van Oijen
2023; Wang et al. 2023) and substantially differ from what some detection tools for AI-
generated text claim (Compilatio 2023; Crossplag.com 2023; GoWinston.ai 2023; Zero
GPT 2023). The detection tools present a main bias towards classifying the output as
human-written rather than detecting AI-generated content. Overall, approximately 20%
of AI-generated texts would likely be misattributed to humans.

Fig. 17 Writer’s suggestion to lower “detectable AI content”

36
y

Fig. 18 AIDT23-05-JGD

Fig. 19 AIDT23-05-JPK

37
y

Fig. 20 AIDT23-05-LLW

Fig. 21 AIDT23-05-OLU

38
y

Fig. 22 AIDT23-05-PTR

Fig. 23 AIDT23-05-SBB

39
y

Fig. 24 AIDT23-05-TFO

Case studies 06‑Para

These test cases were first generated with ChatGPT, then automatically re-written using
Quillbot with the default settings. The generated original is on the left, the re-written
version on the right.

Fig. 25 AIDT23-06-AAN

40
y

Fig. 26 AIDT23-06-DWW

Fig. 27 AIDT23-06-JGD

41
y

Fig. 28 AIDT23-06-JPK

Fig. 29 AIDT23-06-LLW

42
y

Fig. 30 AIDT23-06-OLU

Fig. 31 AIDT23-06-PTR

43
y

Fig. 32 AIDT23-06-SBB

Fig. 33 AIDT23-06-TFO

Abbreviations
01-Hum Human-written
02-MT Human-written in a non-English language with a subsequent AI/machine translation to English
03-AI AI-generated text
04-AI AI-generated text with subsequent human manual edits
05-ManEd AI-generated text with subsequent manual paraphrase by human

44
In particular, finetuning a language model to detect “itself” has proven to be an effective strategy
for text detection. However, existing research suggests that a purely neural approach struggles
from distribution shifts, and is overly reliant on the decoding strategy with which the training data
was generated (Solaiman et al., 2019). In other words, though the neural-based approach is highly
performant for specific samples of text, it is significantly less robust.
Our project attempts to mitigate this shortcoming by combining the statistics- and neural-based
approaches in order to achieve the best of both worlds. We hypothesize that, regardless of the
language model, machine-generated and human-written text have fundamental statistical differences
that can be learned. Our main thesis is that explicitly encoding such statistical features into the input
can prevent the classifier from overfitting on the model-intrinsic features present in the training data,
thus improving its performance against a greater variety of generated text.
To do so, we add a custom “statistical embeddings” layer to a RoBERTa model pre-trained for
the synthetic text detection task. This embeddings layer extracts various statistical features from
the input sequence and injects them into the classification pipeline. We finetune both our baseline
RoBERTa detector along with our augmented detector on an open-source dataset of GPT-3 generated
text and human-written Wikipedia introductions. We observe similar performance for both models
on the test set partitioned from the same dataset (i.e. equal distribution). Both classifiers are then
evaluated against machine-generated and human-written “long answers” in the PubMedQA dataset
(i.e. different topic and distribution from the training set). Despite a marginal increase in accuracy
for the WikiIntro dataset, we were unable to see significant improvements in robustness using our
approach. We suspect that this may in part be attributed to the sparseness of our statistical feature
vectors. The specific results and analysis are presented in section 5.

5. Related Work

There has been significant prior research regarding the statistical, zero-shot approach to synthetic text
detection. A study done in 2017 by Nguyen-Son et al. (2017). reported promising results through
statistical analysis alone, relying on (i) word distribution frequencies, (ii) complex phrase features,
and (iii) sentence- and paragraph-level consistency. However, since 2017, the capabilities of NLG
models have grown exponentially, and thus it is expected that previous results will be significantly
challenged by state-of-the-art models such as GPT-3.
More recently, Tian released “GPTZero” which analyzes the perplexity and burstiness of a given
text to detect whether it was machine-generated. However, GPTZero has been shown to work poorly
when dealing with shorter sequences or against adversarial rewording and paraphrasing (Tian, 2022).
Perhaps the most significant breakthrough in this regard is “DetectGPT”, released in January 2023 by
Mitchell et al. (2023). It achieved significantly improved performance compared to other zero-shot
classification methods by using the observation that texts generated by LLMs tend to occupy negative
curvature regions of the model’s log probability function. Though the results are remarkable, the
paper reports that supervised models still perform better than DetectGPT for in-distribution data,
thus motivating our approach that the neural-based approach has greater potential once its lack of
robustness is resolved.
Research in neural-based text detection has been greatly accelerated by high-performing language
models that can be applied to a myriad of downstream tasks. The most notable result to date is
OpenAI’s RoBERTa-based sequence classifier which was released alongside GPT-2. The model was
trained on a behemoth corpus of GPT-2 generated text and was able to outperform human detection
with an accuracy of 95%. However, research shows that the model’s accuracy drops significantly
when tested against distribution shifts Solaiman et al. (2019). We later confirm this result in section 5.

6. Approach

Our main contribution to the synthetic text detection literature is in the combining of statistical and
neural methods. We attempt to improve upon OpenAI’s RoBERTa model by adding a statistical
embeddings layer that extracts and encodes useful information about the input sequence. The exact
ensemble of features used is described in more detail in section 4.2.

45
We were greatly inspired by the Transformer architecture and its way of encoding the notion of
position in self-attention. Namely, the Transformer architecture extracts positional information and
adds it to the input embeddings such that the order of words is accounted for by the model. We
perform a similar procedure in which a statistical feature vector is first constructed by our statistical
embeddings layer, then added into the input embeddings. The model architecture and design decisions
are discussed in more detail in section 4.3.

6.1 Baseline

As described above, our baseline is OpenAI’s RoBERTa sequence classifier pre-trained for GPT-2 text
detection. The sequence classifier consists of the original RoBERTa model and an additional classifi-
cation head. The classification head is a standard feed-forward network with dropout regularization,
a tanh non-linearity, and an output projection. We first compare our main model’s performance
against this pre-trained RoBERTa classifier (trained on GPT-2 data). We then finetune the RoBERTa
classifier on GPT-3 data and perform a similar comparison. The results of the baseline experiments
are presented in section 5.

6.2 Statistical Feature Selection

We postulate that using an ensemble of statistical features can increase the robustness of our model.
In particular, we focus on robustness against adversarial prompts, e.g. asking ChatGPT to “generate
a text that uses many exclamation marks and commonly-used words.” It is intuitive that including
multiple features in our detection provides a straightforward hedge against such attacks, as targeting
more and more statistical features will only impede the model’s ability to generate fluent text.
In this section, we describe in brief detail the statistical features used by our model. We note that
the input sequence is first lemmatized using the WordNet corpus, prior to being processed by the
statistical embeddings layer.

1. Zipf: Zipf’s law refers to the observation that many types of real-world data follow a partic-
ular distribution. Human-written text is one such example. Formally, given a distribution di
of the i-th most common lemma in a document,
1
di ∝ (1)
i
We use a log scale for both the rank and the frequency of a lemma and perform linear
regression. The slope of the regression line f is used as our Zipf feature of the input text.
2. Clumpiness: This feature focuses on the fact that human-written text tends to contain
particular words more frequently than others, as opposed to machine-generated text which
is more uniform (i.e. less clumpy). We calculate the text’s Gini coefficient as an indicator:
Pn Pn
i=1 |xi − xj |
G= Pn j=1Pn (2)
2 i=1 j=1 xj

3. Burstiness: Burstiness measures the intermittent increases or decreases of the frequency

of an event, which makes it another good indicator of the relative uniformity of lemma
usage. There are many valid ways to measure burstiness. We choose to calculate the index
of dispersion, which measures the ratio of the variance to the mean.
σ2
D= (3)
µ
4. Kurtosis: Kurtosis, or the fourth-standardized moment, encodes the presence of outliers. A
higher Kurtosis score indicates greater extremity of deviations in the distribution.
E[(X − µ)4 ]
Kurt[X] = (4)
(E[(X − µ)2 ])2
5. Punctuation: Another statistical measure is punctuation distribution. We hypothesize that
machine-generated text differs from human-written text in the frequency and spread of
punctuations (Alexandra N. M. Darmon and Porter, 2020).

46
6. Stop word Ratio: Stop words are words that are generally filtered out in NLP applications
due to their disproportionate frequency and comparative semantic neutrality (e.g. “a”, “an”,
“the”, etc.) We believe that human-written text will contain a higher ratio of such stop words.

Early data analysis of our WikiIntro dataset (detailed in section 5.1) yields the following plots shown
in Figure 34. We observe differences in the distribution of the Zipf, Clumpiness, and Kurtosis scores
between human-written text and generated text. Given that the sample size is equivalent for both
groups of text, we see that the relative modes of the distribution are different.

Figure 34: Zipf, Clumpiness, and Kurtosis score distributions for the Wikipedia dataset

6.3 Model Architecture

Our model incorporates a “Statistical Embeddings Layer” which extracts the aforementioned statistical
feature vectors from the input sequence and concatenates them to produce a single vector s. The
statistical embedding is then passed into a feed forward layer which applies a linear transformation to
s then ReLU for non-linearity.
We then experiment with two versions of our model. The early fusion approach sums the output of
the above to the inputs of the RoBERTa model (Figure 2a). This is largely due to our confidence in
the quantity and quality of training data we collected which warrants an early fusion approach. The
late fusion approach infuses the statistical features to the output of the RoBERTa model instead, prior
to being processed by the classification head(Figure 2b).
The baseline code for the pre-trained RoBERTa sequence classifier comes from OpenAI Google
(2018). We have written code to load and process our dataset as well as extract various statistical
features. Furthermore, we make original modifications to the RoBERTa model by adding the layers
described above as well as by enabling early and late fusion. Overall, the main innovation of our
project lies in our approach to combine neural and statistical methods in the task of machine-text
generation.

Figure 35: (a) Early Fusion Model and (b) Late Fusion Model

47
Human Written Intro (Truncated) GPT-3 Generated Intro (Truncated)
"A full-time job is employment in which a person "A full-time job is employment in which a person
works a minimum number of hours defined as works a set number of hours each week, typically
such by their employer. Full-time employment 40 hours. In some countries, full-time employ-
often comes with benefits that are not typically ment is the norm, while in others it is less com-
offered to part-time, temporary, or flexible work- mon. The term "full-time job" can refer to a vari-
ers, such as annual leave, sick leave, and health ety of different types of employment, including
insurance. Part-time jobs are mistakenly..." traditional jobs..."

Figure 3: Examples of human-written and GPT-3 completed introductions for “Full-time job”

7 Experiments

7.1 Data

We use two publicly available datasets for the training and testing of our baseline and main model.
Both datasets consist of human-written text and machine-generated text that are processed and
appropriately labeled to be used in our pipeline (examples in Figure 3). The first dataset is a
collection of GPT-3 generated text and human-written Wikipedia introductions for 150,000 topics
(Aaditya Bhat, 2023). We use this data to train our baseline model (as the pretrained model has only
seen GPT-2 text) and our main model, both early and late fusion. We use a 80-10-10 split of our
dataset for training, validation, and testing. We also pre-process the data by randomly truncating each
pair of human-written and model-generated text to the same length.
We then utilize the PubMedQA dataset to test the transferability of our model. In particular, we make
use of the “long answers” that contain both human-written and machine-generated samples (Jin et al.,
2019). We decided to use the PubMedQA dataset because of its difference in both the topic and the
NLG model from our training set, which can provide a good insight into the robustness of our model.
The results and analysis of our experiments are presented section 6.

7.2 Evaluation method

We use prediction accuracy and area under the receiving operating characteristics (AUROC) as our
metrics. AUROC is commonly used to evaluate the performance of binary classification models. It
measures the ability of a model to distinguish between two classes by plotting the True Positive Rate
(TPR) against the False Positive Rate (FPR). The ROC curve is a graphical representation of the
performance of a binary classifier as the discrimination threshold is varied. The area under this curve
(AUROC) represents the probability that the classifier will rank a randomly chosen positive instance
higher than a randomly chosen negative instance. AUROC ranges from 0.0 to 1.0, where a value of
0.5 indicates that the model is no better than random guessing, and a value of 1.0 indicates a perfect
classifier.

7.3 Experimental details

The first step of our research involves testing the existing “gold standard” detection model (OpenAI’s
model pretrained on GPT-2) on our WikiIntros GPT-3 dataset. We use the results of this as the
baseline performance before making changes to model architecture and passing in our statistical
features.
Following this, we fine-tune the pretrained model using our GPT-3 training set, and test it to establish
a baseline for the “gold standard” model’s performance after undergoing fine-tuning with a relatively
small training set (LR of 2e-05, 5 epochs). We use this to quantify the performance of the existing
model when tuned on newer data.
The core of our research focuses on our improved model, which incorporates the aforementioned
statistical features. We train this model on the same training set (LR of 2e-05, 10 epochs) and compare
the performance with the results from the previous two parts. We perform this training process twice,
once on our early-fusion model and once on our late-fusion model.

48
8. Predicatble Outcomes

Running the tests, we observe that the finetuning the RoBERTa model on the WikiIntros dataset
allows the model to achieve near perfect accuracy – and a 16% improvement over the same model
trained on datasets generated by older LLM models, in this case GPT-2 (detailed in Table 1). We see
that our late-fusion model achieves marginally higher performance than this – suggesting that the
incorporation of statistical features does help the model improve accuracy.
On reason for the high performance across the models is the nature of the dataset. There are certain
high-level differences between the human-written text and AI-generated text, in that the generated
text is more generalized/overview focused, while the human-written (which are the Wikipedia pages)
might include references to other similar terms/disambiguations. This may contribute to the model
overfitting on this particular dataset.

Table 17: Comparison of Model Accuracies and AUROC Metrics

Wiki Intros PubMedQA
Model Test Accuracy AUROC Test Accuracy AUROC
Baseline 0.832 0.9117 0.5144 0.5308
Baseline + GPT-3 Finetune 0.9907 0.999 0.5171 0.5359
Early-Fusion Model 0.86626 0.9938 0.5035 0.5294
Late-Fusion Model 0.9956 0.999 0.5058 0.5360

8.1 Explaining Poor Model Performance within PubMedQA

We observed that none of the models performed well on the PubMedQA dataset, implying poor
robustness toward distribution shifts. One potential explanation is that a scientific Q&A database has
much lower variance when it comes to response content. Correct responses, AI-generated or not, will
contain the same content and likely in a similar format, given the nature of the prompt (in this case
the question). Whereas in the WikiIntros dataset, there is a much more open-ended prompt, simply
describing the topic at hand. Apart from the nature of the dataset, we could also attribute the poor
robustness of the model to certain shortcomings of experimental approach. For instance, we believe
that the model benefited only marginally from the statistical features due to small magnitudes and
sparseness; extracting information regarding punctuation, especially, yielded highly sparse features.
A greater variety of statistical features with non-negligible magnitudes may be needed to augment a
neural classifier better. Given the nature of our approach in which we performed an element-wise
addition of the statistical embedding vector to the input embedding (for early fusion) or output vector
of RoBERTa (for late fusion), a relatively sparse vector with low magnitude elements may not have
been the optimal way to incorporate our statistical features. Concatenation, instead of element-wise
addition, may have been a more favorable choice of fusing in the aforementioned statistical features.
Moreover, for the model to learn the linear transformation of these statistical features, we may need
to train the model for a larger number of epochs.

9 Outcomes Conclusion

Through our research, we were able to identify statistical features that when fed into our RoBERTa
model (late-fusion) was able to marginally outperform the existing gold standard model for AI-
generated text detection task on the WikiIntros dataset. Both models performed very well on the task
with near perfect accuracy, and were significantly better than existing models trained on generated
text from older models (GPT-2).
Our research sought to achieve greater robustness towards distribution shifts through augmenting a
neural classifier with statistical features of the input text. When we tested our baseline and early/late
fusion models that were trained on the WikiIntros dataset using the PubMedQA dataset, we witnessed
the late-fusion model achieve marginal improvements in the AUROC metric compared to the baseline.
However, the improvement itself is too marginal, and the magnitude still remains in the vicinity of
0.5, suggesting the model is only marginally better than a random guess even after incorporating

49
building useful detectors. However, there is currently no work that provides a literature review of existing
detection works and highlight important research challenges.
In this paper, we present a critical literature review of the existing detection research for English to aid
understanding of this important area. We organize the survey to guide the reader seamlessly through a
number of important aspects, as follows: First, we establish the background for the detection task, which
includes TGMs, decoding methods for text generation, and social impacts of TGMs (§2). Second, we
present various aspects of large-scale TGMs such as model architecture, training cost, and controllability
(§3). Third, we present and discuss the various existing detectors in terms of their underlying methods
(§4). Fourth, we provide a linguistically and computationally motivated analysis of key issues of the
state-of-the-art detector (§5). Fifth, we discuss interesting future research directions that can help in
building useful detectors (§6). Our main contributions are three-fold:

• We provide the first survey on the important, burgeoning area of detection of machine generated
text from human written text.

• We develop an error analysis of current state-of-the-art detector, guided and illustrated by machine
generated texts, to shed light on the limitations of existing detection work.

• Motivated by our analysis and existing challenges, we propose a rich and diverse set of research
directions to guide future work in this exciting area.

10 Background
Here, we provide the background for the problem of detecting machine generated text from human
written text. Specifically, we introduce key concepts in training a TGM, generating text from a TGM,
and social implications of using TGMs in practice. Existing detection datasets are discussed in Appendix.
10.1 Training TGM
TGM is typically a neural language model (NLM) trained to model the probability of a token given
the previous tokens in a text sequence, i.e., pθ (xt |x1 , . . . , xi , . . . , xt−1 ), with tokens coming from a
vocabulary, xi ∈ V. If x = (x1 , . . . , x|x| ) represents the text sequence, pθ typically takes the form
|x|
pθ (x) = Πt=1 pθ (xt |x1 , . . . , xt−1 ). If p∗ (x) denotes the reference distribution and D denotes a finite set
of text sequences from p∗ , TGM estimates parameters θ by minimizing the following objective function:
(j)
|D| |x |
(j) (j) (j) (j)
X X
L(pθ , D) = − log pθ (xt |x1 , . . . , xi , . . . , xt−1 ). (1)
j=1 t=1

Notice that TGM can be a non-neural model (e.g., n-gram LM) and based on nontraditional LM objective
(e.g., masked language modeling (Devlin et al., 2019; Song et al., 2019)). In this survey, we focus pri-
marily on TGMs for English that are neural and based on traditional LM objective, as they are successful
in generating coherent paragraphs of English text.
10.2 Generating text from TGM
Given a sub-sequence (prefix), x1:k ∼ p∗ , the task of generating text from TGM is to use pθ
to conditionally decode a continuation, x̂k+1:N ∼ pθ (.|x1:k ) such that the resulting completion
(x1 , . . . , xk , x̂k+1 , . . . , x̂N ) resembles a sample from p∗ (Welleck et al., 2020). In a news article gen-
eration task, the prefix can be headlines and the continuation can be the body of the news article. In
a story generation task, the prefix can be beginning of a story and the continuation can be rest of the
story. Since the computation of the optimal continuation (x̂k+1:N ) is not tractable with time complexity
of O((N − k)|V| ), approximate deterministic or stochastic decoding methods are utilized to generate
continuations.
Deterministic methods: In deterministic methods, the continuation is fully determined by the TGM
parameters and prefix. The two most commonly used deterministic decoding methods are greedy search

50
and beam search. Greedy search works by selecting the highest probability token at each time step: xt =
arg max pθ (xt |x1 , . . . , xt−1 ) with time complexity of O((N − k)|V|). On the other hand, beam search
maintains a fixed-size (b) set of partially decoded sequences, called hypotheses. At each time step, beam
search creates new hypotheses by appending each token in the vocabulary to each existing hypothesis,
scoring the resulting sequences using p∗ with time complexity of O((N − k)b|V|). In practice, these
deterministic decoding methods depend highly on the underlying model probabilities and suffer from
producing degenerate continuation, i.e., generic text often with repetitive tokens (Holtzman et al., 2020).
Recently, Welleck et al., (2020) show that the degeneracy issues with beam search can be alleviated by
training a TGM with the original TGM objective (Eq. (1)) augmented with an unlikelihood objective that
assigns lower probabilities to unlikely generations.
Stochastic methods: Stochastic decoding methods work by sampling from a model-dependent dis-
tribution at each time step, xt ∼ q(xt |x1 , . . . , xt−1 , pθ ). In unrestricted sampling (also known as
pure sampling), the chance of sampling a low-confidence token from the unreliable tail distribution
is very high, leading to text that can be unrelated to prefix. To reduce the chance of sampling a low-
P token, sampling is limited to a subset of the vocabulary W ⊂ V at each time step. Let
confidence
Z = x∈W pθ (x|x1 , . . . , xt−1 ). If xt ∈ W, q(xt |x1 , . . . , xt−1 , pθ ) is set as pθ (xt |x1 , . . . , xt−1 )/Z,
otherwise set as 0. The two most effective stochastic decoding methods are top-k sampling (Fan et al.,
2018) and top-p (or nucleus) sampling (Holtzman et al., 2020). The top-k sampler P limits sampling to the
k most-probable tokens, that is, W is the size k subset of V that maximizes x∈W pθ (x|x1 , . . . , xt−1 ).
The top-k sampler uses a constant value of k, which can be sub-optimal in different contexts, that is,
generated text is limited to a subset of natural language distribution. For example, generic contexts
(e.g., predicting noun) might require larger value of k, while other contexts (e.g., predicting preposi-
tions) might require smaller value of k so that only useful candidate tokens are considered. The nucleus
sampler overcomes the burden of considering only a fixed number of tokens by limiting sampling to the
smallest
P set of tokens with total mass above a threshold p ∈ [0, 1], i.e., W is the smallest subset with
x∈W p θ (x|x1 , . . . , xt−1 ) >= p. Thus, the number of candidate tokens considered varies dynamically
depending on the context, and the resulting text is reasonably natural with less repetitions. Recently,
Massarelli et al., (2020) show that top-k and top-p sampler tend to generate more nonfactual sentences,
as corroborated by Wikipedia.

10.3 Social impacts of TGMs

Bias: Unsurprisingly, a TGM can capture and amplify the societal biases (over-generalized beliefs about
a particular group of people, e.g., Group X are bad drivers) present in the training data (Sun et al., 2019;
Nadeem et al., 2020). Solaiman et al., (2019) and Brown et al., (2020) show that TGMs reflect gender
bias (e.g., favoring males over females), racial bias (e.g., favoring white over black people), and religious
bias (e.g., favoring Christians over Muslims). Although TGMs can be used as a tool to study how patterns
in the training data can translate to these unintended biases in the model outputs (Solaiman et al., 2019),
the biases can cause harm to the people in relevant groups in many ways (Crawford, 2017).
Beneficial usage: TGMs are used to create task-specific systems, such as question answering, reading
comprehension, natural language inference, and machine translation (Radford et al., 2019; Brown et al.,
2020). TGMs can also be used to generate text that approximately matches the style of human language,
which benefits applications such as story generation (Fan et al., 2018), conversational response genera-
tion (Zhang et al., 2020), code auto-completion (TabNine, 2020), and radiology report generation (Liu et
al., 2019a).
Malicious usage: TGMs can have unfortunate uses by (even low-skilled) adversaries for malicious pur-
poses, such as fake news generation (Zellers et al., 2019; Brown et al., 2020; Uchendu et al., 2020),
fake product reviews generation (Adelani et al., 2020), and spamming/phishing (Weiss, 2019). Humans
can spot fake news articles (Brown et al., 2020), fake product reviews (Adelani et al., 2020), and fake
comments (Weiss, 2019) generated by TGM only at chance level. To combat the threats posed by such
adversaries, accurate models that can identify text generated by TGM from human written text need to
be built. Such a model can have benevolent uses such as moderating content in vulnerable platforms

51
TGM training text sequence (x) prefix (x1:k ) continuation decoding threats
(data size / params) (x̂k+1:N ) method discussed
GPT-2 fragments from WebText starting of an article rest of the article top-k NA
(Radford et (collection of internet arti- (e.g., few lines about (e.g., rest of the re-
al., 2019) cles) (40GB / 1.5B) a research finding) search finding)
GROVER news article along with their meta- missing meta- top-p trustworthy
(Zellers et meta-information from Re- information/body information/body fake news
al., 2019) alNews (120GB / 1.5B) of a news article (e.g., in the prefix
headline, author)
CTRL control code (e.g., URL) control code (e.g., article correspond- greedy NA
(Keskar et followed by text (e.g., news URL) with optionally ing to the control search with
al., 2019) article) from several do- some strings of text code repetition
mains (140GB / 1.6B) penalty
Adelani et product reviews (fine-tuning product review (hu- product review top-k fake prod.
al., (2020) GPT-2) (20GB / 0.1B) man written) (machine) reviews
Dathathri et no training and no fine- beginning of a story rest of the story or top-k NA
al., (2020) tuning or general articles article
GPT-3 fragments from Common- three previous news body of the pro- top-p fake news
(Brown et Crawl (570GB / 175B) articles and title of a posed article
al., 2020) proposed article

Table 18: Summary of the characteristics of TGMs that can act as threat models. The last column corre-
sponds to the threats discussed in the original paper.

including social media, email clients, government websites, and e-commerce websites.

11 Text generative models

In this section, we will discuss various aspects of large-scale TGMs. These TGMs act as threat models
since they can be misused by a low-skilled adversary, e.g., by generating fake news and fake product
reviews. Table 1 displays the summary of key characteristics of these TGMs along with the threats they
pose (according to the original papers).

11.1 Model architecture, training data, training cost

Model architecture: The model architecture underlying all the state-of-the-art TGMs is the trans-
former (Vaswani et al., 2017). Compared to recurrent neural networks (RNNs) (Elman, 1990), the
transformer model does not have a bias to recent tokens and can learn long-range dependency infor-
mation. The generation from TGMs such as GPT-2 which are based on transformer architecture tends to
be grammatically correct, coherent, and uses world knowledge (Radford et al., 2019). 1
Training data: TGMs such as GPT-2, CTRL (Keskar et al., 2019), and GPT-3 (Brown et al., 2020)
have billions of parameters. They are generally trained using the language modeling objective on large
amounts of raw text from a diverse set of sources (like Wikipedia, Reddit, and news sources). As an
exception, GROVER (Zellers et al., 2019) is trained on millions of news article only. Such trained
TGMs can also be fine-tuned on a domain-specific corpus for the LM task to generate text that matches
the respective domain reasonably. For example, Adelani et al., (2020) fine-tune the GPT-2 model on the
specific domain of product reviews to generate fake reviews, which mimics the style of a human review.
Training cost: Training TGMs with billions of parameters on millions of documents requires a huge
computational budget (Zellers et al., 2019), high energy cost (Strubell et al., 2019), and long training
time (Brown et al., 2020). Unfortunately, it is not yet a standard practice to report financial (vs. energy vs.
computational) budget in every research publication. This makes it hard for us to perform TGM training
feasibility studies. One exception is the work done by Zellers et al., (2019), where they explicitly mention
that their proposed TGM model, GROVER, took two weeks of training with a cost of $25K (including
the cost of data collection). We note that even though this may be an expensive budget, it is by no means
outside the reach of even low-resource organizations, let alone nation states. The implication is that
1
Text generated by RNN can be more easily detected (Fagni et al., 2020), as such text is usually less grammatically correct
and less coherent (based on our manual observations).

52
various entities of variable sizes and resource capabilities can practically deploy models for spreading
disinformation using TGMs.

11.2 Controllable generation

Controllable TGMs possess the ability to control the aspects of the generation such as topic and sentiment
of the article. GPT-2 (Radford et al., 2019) and GPT-3 (Brown et al., 2020) assume the prefix to be any
natural language text, which might be too coarse in controlling the generation in an explicit fashion.
Researchers have devised two ways to design a controllable TGM, which we now introduce.
Training with control tokens: The first way is to leverage meta-information about the article such as its
author, date of creation, source domain and prepend this information as additional token(s) to the input
sequence, before training the TGM. These tokens act as additional context for the article, allowing the
TGM to learn the relation between the meta-information and the original article. Once trained, the TGM
model can be controlled by prompting with the meta-information of users’ interest. The first controllable
TGM proposed is the GROVER model, which can generate a news article given the meta-information of
the news article (such as headline, author, and date). The GROVER model can create trustworthy fake
news that is harder for humans to identify than human written fake news and can thus pose a significant
threat. Similar to the GROVER model, the CTRL model provides explicit control of particular aspects
of the generated text by exploiting naturally occurring control codes (e.g., the URL for a news article)
to condition the text (e.g., news article body). These control codes govern style (e.g., sports vs. politics,
FOX sports vs. CNN sports), content (e.g., Wikipedia vs. books), and task-specific behavior (e.g.,
question answering vs. machine translation).
Control using attribute classifier: The second and the most recent way to design a controllable TGM is
to combine a pretrained TGM like GPT-2 with one or more attribute classifiers (e.g., sentiment classifier)
that guide text generation (Dathathri et al., 2020). The attribute models measure the extent to which the
desired attribute is encoded in a piece of text. At each timestep, GPT-2 updates its latent representations
based on gradients from the attribute model for the text generated so far so as to increase the likelihood of
the generated text having the desired attribute. The updated latents are used to compute a new next token
distribution from which a token to be generated is sampled. The interesting property of this method is
that the TGM model need not be retrained (unlike Adelani et al., (2020) work that need retraining of the
GPT-2 model), thereby avoiding the significant cost of retraining.

12 Detectors
In this section, we discuss various detectors for identifying machine generated text from human writ-
ten text. To aid understanding of the literature, we organize the detectors according to the underlying
methods on which they are based.

12.1 Classifiers trained from scratch

Bag-of-words classifier: Some detectors employ classical machine learning methods such as logistic re-
gression to train a model from scratch to discriminate between text generated by TGM and human written
text. Solaiman et al., (2019) use a simple baseline model that represents a document with tf-idf vector
(unigrams and bigrams) on top of a logistic regression model to distinguish WebText articles (online web
pages) from text generated using GPT-2 models. They study different sizes of GPT-2 models that vary in
terms of number of parameters (117M, 345M, 762M, 1542M) and different sampling techniques (pure
sampling, top-k sampling, and top-p sampling). They observe that generations from the larger GPT-2
models are difficult to detect compared to that of the smaller models, which indicates that the larger
the TGM, the closer the style of the generated text with that of human written text. Top-k samples are
easier to detect while nucleus samples are harder to detect. This result stems from the fact that top-k
sampler typically over-generates common words, leaving statistical anomalies that are easily spotted by
the detector (Ippolito et al., 2020). Additionally, Solaiman et al., (2019) fine-tune the GPT-2 model on
Amazon product reviews and show that the text generated by fine-tuned GPT-2 model is harder to detect
as fine-tuned domain specific TGMs are more human-like than general purpose TGM (i.e., the original

53
GPT-2 model).
Detecting machine configuration: Tay et al., (2020) study the extent to which different modeling
choices (decoding method, TGM model size, prompt length) leave artifacts (detectable signatures that
arise from modeling choices) in the generated text. They propose the task of identifying the TGM mod-
eling choice given the text generated by TGM. They show that a classifier can be trained to predict the
modeling choice well beyond the chance level, which ascertains that text generated by TGM may be
more sensitive to TGM modeling choices than previously thought. They also find that the proposed
detection task of identifying text generated by different TGM modeling choices is less harder than the
task of identifying text generated by TGM from human written text along with different TGM modeling
choices. They show that word order does not matter much as a bag-of-words detector performs very sim-
ilar to detectors based on complex encoder (e.g., transformer). This result is consistent with the recent
work done by Uchendu et al., (2020), which shows that simple models (traditional ML models trained
on psychological features and simple neural network architectures) perform well in three settings: (i)
classify if two given articles are generated by the same TGM; (ii) classify if a given article is written
by a human or a TGM (the original detection problem); (iii) identify the TGM that generated a given
article (similar to Tay et al., (2020)). For the original detection problem, the authors find that the text
generated by the GPT-2 model to be hard to detect among several TGMs (see Appendix for the list of
studied TGMs).
12.2 Zero-shot classifier
In the zero-shot classification setting, a pretrained TGM (for example, GPT-2, GROVER) is employed
to detect generations from itself or similar models. The detector does not require supervised detection
examples for further training (i.e., fine-tuning).
Total log probability: Solaiman et al., (2019) present a baseline that uses TGM to evaluate total log
probability, and thresholds based on this probability to make the prediction. For instance, text is predicted
as machine generated if the overall likelihood of the text according to the GPT-2 model is closer to the
mean likelihood over all machine generated texts than to the mean likelihood of human written texts.
However, they find that this classifier performs poorly compared to the previously discussed logistic
regression based classifier (§4.1).
Giant Language model Test Room (GLTR) tool: The GLTR tool (Gehrmann et al., 2019) proposes a
suite of baseline statistical methods that can highlight the distributional differences in text generated
by GPT-2 model and human written text. Specifically, GLTR enables the study of a piece of text by
visualizing per-token model probability, per-token rank in the predicted next token distribution, and
entropy of the predicted next token distribution. Based on these visualizations, the tool clearly shows
that TGMs over-generate from a limited subset of the true distribution of natural language. Indeed, rare
word usage in text generated by GPT-2 model is markedly less compared to the human written text. The
tool lets humans (including non-experts) to study a piece of text, but might be less effective in future
once TGMs start generating text that lacks statistical anomalies.

12.3 Fine-tuning NLM

In this setup, a pretrained language model (e.g., BERT, RoBERTa (Liu et al., 2019b)) is fine-tuned to
detect text generated from itself or similar models. Unlike the zero-shot classification setup, the detector
does require supervised detection examples for further training.
GROVER detector: Zellers et al., (2019) propose a detector based on a linear classifier on top
of GROVER model, which outperforms existing detectors (fastText (Bojanowski et al., 2017) and
BERT (Devlin et al., 2019)) and thereby conclude that the best models for generating neural disinforma-
tion are also the best at detecting their own generations. This result suggests the need to make generators
such as GROVER and GPT-2 publicly available. 2 Nevertheless, the authors do not experiment with
BERT model to observe similar pattern that the BERT model also excels in detecting the text written by
2
The public release of TGM is a complex issue that warrants interdisciplinary considerations, including from policy and
security groups. The authors of the GPT-2 model (Radford et al., 2018) initially kept their largest model private due to concerns
about the potential for misuse. They released their largest model, eight months after publication of the article.

54
itself as the BERT detector and the BERT generator possess similar inductive bias. Uchendu et al., (2020)
show that the off-the-shelf GROVER detector does not perform well in detecting text generated by TGMs
other than the original GROVER model.
RoBERTa detector: Solaiman et al., (2019) experiment with fine-tuning the RoBERTa language model
for the detection task and establishes the state-of-the-art performance in identifying the web pages gener-
ated by the largest GPT-2 model with ∼95% accuracy. The RoBERTa detector trained on top-p examples
transfers well to examples from all the other decoding methods (pure and top-k). Regardless of the de-
tector model’s capacity, the detector performs well when trained on examples from the larger GPT-2
model and transfers well to examples generated by a smaller GPT-2 model. On the other hand, training
on smaller GPT-2 model’s outputs results in poor performance in classifying the larger GPT-2 model’s
outputs. The most interesting finding of this work is that fine-tuning using the RoBERTa model achieves
higher accuracy than fine-tuning a GPT-2 model with equivalent capacity. This result might be due to
the superior quality of the bidirectional representations inherent in the masked language modeling ob-
jective employed by the RoBERTa language model compared to the GPT-2 language model, which is
limited by learning only unidirectional representation (left to right). This finding contradicts that of the
GROVER work (Zellers et al., 2019), where the authors conclude that the best models for detecting neu-
ral disinformation from a TGM is the TGM itself. Recently, Fagni et al., (2020) show that the RoBERTa
detector establishes the state-of-the-art performance in spotting machine generated tweets from human
written tweets accurately, outperforming both traditional ML models (e.g., bag-of-words) and complex
neural network models (e.g., RNN, CNN) by a large margin. This interesting result indicates that the
RoBERTa detector can generalize to publication sources unseen during its pretraining such as Twitter.
The RoBERTa detector also outperforms existing detectors in spotting news articles generated by several
TGMs (Uchendu et al., 2020) and product reviews generated by the GPT-2 model fine-tuned on Amazon
product reviews (Adelani et al., 2020).

12.4 Human-machine collaboration

Apart from building a statistical model to detect online disinformation, one can build a system that can
leverage human visual interpretation skills and common sense knowledge.
Differences in human and machine detector: Ippolito et al., (2020) study the differences in the ability
of humans and automated detectors to identify text generated by TGM. The authors observe: (i) hu-
man raters are good at noticing contradictions or semantic errors (e.g., incoherence) in text generated by
TGM, which the automatic detectors are weak at, due to lack of deep semantic understanding; (ii) auto-
matic detectors are good when text generated by TGM contains over-representation of high-likelihood
words (caveat of top-k sampling as discussed in §2.2), whereas the human raters are not good. Overall,
automatic detectors are significantly better than human raters, but generalize poorly to text generated by
unseen decoding methods.
Supporting untrained humans: As seen before, the GLTR tool (Gehrmann et al., 2019) can aid humans
by visualizing the properties of text such as unexpected and out-of-context words. The main advantage
of GLTR is that it can facilitate untrained humans to accurately detect synthetic text (from 54% to 72% in
terms of accuracy). However, GLTR flags machine generated easily but it is hard to be confident that the
text is not machine generated. This result suggests the need for human-machine collaboration to solve
the detection task (Solaiman et al., 2019).
Real or Fake Text (RoFT) tool: The RoFT tool (Dugan et al., 2020) focuses on evaluating human de-
tection of text generated by TGM by asking humans to detect the sentence boundary at which the text
transitions from human written text to machine generated text. The main assumption is that TGM suc-
cessfully fools the human if the guess from the human is far from the true sentence boundary. Current
TGMs can fool humans by one or two sentences. The core advantages of the RoFT tool include its engag-
ing annotation interface, collection of user’s explanation for their guess in free form text, and potential
to scale to different textual domains as well as different TGM modeling choices. The main limitation of
the tool is that the text shown to the humans can be rife with human generated sentences, and hence does
not reflect an organic generation from a TGM.

55
Issues with the state-of-the-art detector

In this section, we discuss open issues in the state-of-the-art detector based on the RoBERTa model,
which has been shown to excel in detecting text generated by TGM based on news articles, product
reviews, tweets, and web pages (see §4.3). 3 We focus on the task of detecting text generated by the
GPT-2 model from human written Amazon product reviews, a challenging task given the shortness of
reviews. We employ the RoBERTa detector on the publicly available dataset, containing generations
from the GPT-2 model (1542M parameters) based on pure, top-k and top-p sampling along with human
written reviews (see Appendix for dataset details). In Figure 1, we plot the accuracy of the detector
w.r.t. number of training examples per class, averaged over ten random initializations to control for
initialization effects. We observe that the RoBERTa detector needs several thousands of examples to
reach high accuracy. Specifically, it has an impractical requirement of 200K, 15K and 50K training
examples for performing at 90% accuracy on identifying pure, top-k and top-p examples respectively. 4
Given that creation of large datasets for the detection task is hard (Zellers et al., 2019), it is important to
investigate whether the data-efficiency of the RoBERTa detector can be significantly improved.

0.98
0.93
Detection accuracy

0.88
0.83
0.78
0.73
0.68
1000 5000 15000 30000 50000 100000 150000 200000 250000
Number of training examplesperclass (human/machine)

pure top-k (k=40) top-p (p=0.9)

Figure 36: Detection accuracy of the RoBERTa detector w.r.t. number of training examples per class,
averaged over ten random initializations.

We manually inspect 100 randomly picked false positives (machine generated product review incor-
rectly predicted as human written product review) of the RoBERTa detector trained on 15K examples
each from top-p generations and from human written reviews.5 Below, we list down the error categories
that we have identified and provide at least one example for each error category.
Fluency: Among the false positive reviews, we find 73 reviews to be very fluent and can confuse even
humans (1).
(1) I loved this film. I can’t really explain why, but when I first saw it it struck me as bizarre, al-
most oddball, but I quickly got over that and remembered that I love oddball films. This was
an early 80’s film. A great film to see on a gloomy rainy evening. This film is suspenseful
and full of weirdness. Add this to your collection.
Shortness: Out of these 73 identified fluent reviews, 27 reviews are very short, with a median of 24
words. We give two examples below:
(2) love it. best sweeper.
(3) My favorite combo. Always works and usually cools my system to boot. So glad I got these
instead of other brands.
Factuality: We find 10 false positive reviews to contain factual errors.

3
Concurrent with our work, Zhong et al., (2020) propose a detector that leverages factual and coherence structure underlying
the text, which outperforms the RoBERTa detector in spotting machine generated text based on news articles and web pages.
We also acknowledge that detectors fine-tuned on the state-of-the-art NLMs such as T5 (Raffel et al., 2020), ELECTRA (Clark
et al., 2020) might most likely outperform the RoBERTa detector in general.
4
Given that attackers can create synthetic text at scale using TGMs, 90% detection accuracy might not be a high accuracy.
5
As seen in §2.2 and §4, top-p sampling produces good quality text that reasonably matches the style of human writing and
is also harder to detect for humans. We leave the study of false negatives for future. Our annotation of 100 false positives can
be accessed at: https://github.com/UBC-NLP/coling2020_machine_generated_text.

56
in building intelligent TGMs that narrows the gap between machine and human distribution of natural
language text, auxiliary signals could play a crucial role in mitigating the threats posed by TGMs.
Assessing veracity of the text
Existing detectors have an assumption that the fake text is determined by the source (e.g., TGM) that
generated the text. This assumption does not hold true in two practical scenarios: (i) real text auto-
generated in a process similar to that of fake text, and (ii) adversaries creating fake text by modifying
articles originating from legitimate human sources. Schuster et al., (2020) show that existing detectors
perform poorly in these two scenarios as they rely too much on distributional features, which cannot help
in distinguishing texts from similar sources. Hence, we call for more research on detectors that assess
the veracity of machine generated text by consulting external sources, like knowledge bases (Thorne and
Vlachos, 2018) and diffusion network (Vosoughi et al., 2018), instead of relying only on the source.
12.5 Building generalizable detectors
Existing detectors exhibit poor cross-domain accuracy, that is, they are not generalizable to different
publication formats (Wikipedia, books, news sources) (Bakhtin et al., 2019). Beyond publication formats
and topics (e.g., politics, sports), the detector should also transfer to unseen TGM settings such as model
architecture, different decoding methods (e.g., top-k, top-p), model size, different prefix lengths, and
training data (Bakhtin et al., 2020; Uchendu et al., 2020).
12.6 Building interpretable detectors
We discussed the importance of human raters pairing up with automatic detectors in §4.4. A viable way
for this collaboration is to make the decisions taken by the automatic detector interpretable (such as in
GLTR) so that human raters can logically group (e.g., contradictions) the model decisions and humans
can “accept”, “modify”, or “reject” these decisions. This calls for more research in building detectors
that can provide explanations for its decisions, which are understandable to humans.
12.7 Building detectors robust to adversarial attacks
Existing detectors are brittle, i.e., the detector decisions can vary significantly for even small changes
in the text input. For example, Wolff (2020) shows that the RoBERTa detector can be attacked using
simple schemes such as replacing characters with homoglyphs and misspelling some words. These two
attacks reduce the detector’s recall in text generated by TGM from 97.44% to 0.26% and 22.68% respec-
tively. Therefore, it is important to study various adversarial attacks ranging from simple attacks (e.g.,
misspellings) to advanced attacks (e.g., universal attacks (Wallace et al., 2019)) and create adversarial
examples with an aim to characterize the vulnerabilities of the detector as well as to make the detector
robust against various attacks.

13. Conclusion
Detectors able to tease apart machine generated text from human written text can play a vital role in
mitigating misuse of TGMs such as in automatic creation of fake news and fake product reviews. Our
categorization of existing detectors and related issues into classifiers trained from scratch, zero-shot clas-
sifiers, fine-tuning NLMs, and human-machine collaboration can help readers contextualize each detector
w.r.t the fast-growing literature. We also hope that our computationally and linguistically motivated error
analysis of the state-of-the-art detector can bring readers up to speed on many existing challenges in
building useful detectors. Our rich and diverse set of research directions also have the potential to guide
future work in this exciting area.

14. Acknowledgements
We thank Ramya Rao Basava and Peter Sullivan for helpful discussions in the initial stage of the
project. We gratefully acknowledge support from the Natural Sciences and Engineering Research Coun-
cil of Canada, Compute Canada (https://www.computecanada.ca), and UBC ARC–Sockeye
(https://doi.org/10.14288/SOCKEYE).

57
15. References
David Ifeoluwa Adelani, Haotian Mai, Fuming Fang, Huy H. Nguyen, Junichi Yamagishi, and Isao Echizen. 2020.
Generating Sentiment-Preserving Fake Online Reviews Using Neural Language Models and Their Human-
and Machine-Based Detection. In Proceedings of the 34th International Conference on Advanced Information
Networking and Applications, AINA-2020, volume 1151, pages 1341–1354.

Amazon. 2019. Amazon Customer Reviews Dataset. https://s3.amazonaws.com/

Anton Bakhtin, Sam Gross, Myle Ott, Yuntian Deng, Marc’Aurelio Ranzato, and Arthur Szlam. 2019. Real or
Fake? Learning to Discriminate Machine from Human Generated Text. CoRR, abs/1906.03351.

Anton Bakhtin, Yuntian Deng, Sam Gross, Myle Ott, Marc’Aurelio Ranzato, and Arthur Szlam. 2020. Energy-
Based Models for Text. CoRR, abs/2004.10188.

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with
Subword Information. Transactions of the Association for Computational Linguistics, pages 135–146.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Nee-
lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen
Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christo-
pher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are
few-shot learners. CoRR, abs/2005.14165.

Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. Electra: Pre-training text
encoders as discriminators rather than generators. In International Conference on Learning Representations.

Alexis Conneau and Guillaume Lample. 2019. Cross-lingual Language Model Pretraining. In Advances in Neural
Information Processing Systems 32, pages 7059–7069.

Kate Crawford. 2017. The trouble with bias. NIPS 2017 Keynote.

Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and
Rosanne Liu. 2020. Plug and Play Language Models: A Simple Approach to Controlled Text Generation.
In International Conference on Learning Representations.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirec-
tional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and
Short Papers), pages 4171–4186.

Liam Dugan, Daphne Ippolito, Arun Kirubarajan, and Chris Callison-Burch. 2020. RoFT: A Tool for Evaluating
Human Detection of Machine-Generated Text. CoRR, abs/2010.03070.

Jeffrey L. Elman. 1990. Finding structure in time. Cognitive Science, 14(2):179 – 211.

Tiziano Fagni, Fabrizio Falchi, Margherita Gambini, Antonio Martella, and Maurizio Tesconi. 2020. TweepFake:
about Detecting Deepfake Tweets. CoRR, abs/2008.00036.

Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical Neural Story Generation. In Proceedings of
the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
889–898.

Sebastian Gehrmann, Hendrik Strobelt, and Alexander Rush. 2019. GLTR: Statistical Detection and Visualization
of Generated Text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics:
System Demonstrations, pages 111–116.

Barbara J. Grosz, Aravind K. Joshi, and Scott Weinstein. 1995. Centering: A Framework for Modeling the Local
Coherence of Discourse. Computational Linguistics, 21(2):203–225.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780.

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The Curious Case of Neural Text
Degeneration. In International Conference on Learning Representations.

58
Dirk Hovy. 2016. The Enemy in Your Own Camp: How Well Can We Detect Statistically-Generated Fake Reviews
– An Adversarial Study. In Proceedings of the 54th Annual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), pages 351–356.

Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. 2020. Automatic Detection of Gen-
erated Text is Easiest when Humans are Fooled. In Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics, pages 1808–1822.

Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. 2019. CTRL: A
Conditional Transformer Language Model for Controllable Generation. CoRR, abs/1909.05858.

Guanxiong Liu, Tzu-Ming Harry Hsu, Matthew B. A. McDermott, Willie Boag, Wei-Hung Weng, Peter Szolovits,
and Marzyeh Ghassemi. 2019a. Clinically Accurate Chest X-Ray Report Generation. In Proceedings of the
Machine Learning for Healthcare Conference, MLHC, volume 106, pages 249–269.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke
Zettlemoyer, and Veselin Stoyanov. 2019b. RoBERTa: A Robustly Optimized BERT Pretraining Approach.
CoRR, abs/1907.11692.

Luca Massarelli, Fabio Petroni, Aleksandra Piktus, Myle Ott, Tim Rocktäschel, Vassilis Plachouras, Fabrizio
Silvestri, and Sebastian Riedel. 2020. How Decoding Strategies Affect the Verifiability of Generated Text.
CoRR, abs/1911.03587.

Moin Nadeem, Anna Bethke, and Siva Reddy. 2020. StereoSet: Measuring stereotypical bias in pretrained lan-
guage models. CoRR, abs/2004.09456.

Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, and Sergey Edunov. 2019. Facebook FAIR’s
WMT19 news translation task submission. In Proceedings of the Fourth Conference on Machine Translation
(Volume 2: Shared Task Papers, Day 1), pages 314–319.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving Language Understand-
ing by Generative Pre-Training. https://s3-us-west-2.amazonaws.com/openai-assets/
research-covers/language-unsupervised/language_understanding_paper.pdf.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Lan-
guage Models are Unsupervised Multitask Learners. https://d4mucfpksywv.cloudfront.
net/better-language-models/language_models_are_unsupervised_multitask_
learners.pdf.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li,
and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal
of Machine Learning Research, 21(140):1–67.

Tal Schuster, Roei Schuster, Darsh J. Shah, and Regina Barzilay. 2020. The Limitations of Stylometry for Detect-
ing Machine-Generated Fake News. Computational Linguistics.

Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, and
Jasmine Wang. 2019. Release Strategies and the Social Impacts of Language Models. CoRR, abs/1908.09203.

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS: Masked Sequence to Sequence
Pre-training for Language Generation. In International Conference on Machine Learning, pages 5926–5936.

Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy Considerations for Deep
Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,
pages 3645–3650.

Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding,
Kai-Wei Chang, and William Yang Wang. 2019. Mitigating Gender Bias in Natural Language Processing:
Literature Review. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,
pages 1630–1640.

TabNine. 2020. TabNine. Autocompletion with deep learning. https://tabnine.com/blog/deep. Ac-

cessed on 04/22/2020.

Reuben Tan, Bryan A. Plummer, and Kate Saenko. 2020. Detecting Cross-Modal Inconsistency to Defend Against
Neural Fake News. CoRR, abs/2009.07698.

Dissertation
No ratings yet
Dissertation
188 pages
Trip Back Home: Chapter 7 (GROUP 4)
50% (2)
Trip Back Home: Chapter 7 (GROUP 4)
33 pages
A Minor Project Report On Detection of Ai Generated Text Using Machine Learning
No ratings yet
A Minor Project Report On Detection of Ai Generated Text Using Machine Learning
62 pages
2025-A Survey On LLM-Generated Text Detection Necessity, Methods, and Future Directions
No ratings yet
2025-A Survey On LLM-Generated Text Detection Necessity, Methods, and Future Directions
64 pages
A Survey of AI-generated Text Forensic Systems: Detection, Attribution, and Characterization
No ratings yet
A Survey of AI-generated Text Forensic Systems: Detection, Attribution, and Characterization
20 pages
Ilovepdf Merged Pagenumber
No ratings yet
Ilovepdf Merged Pagenumber
59 pages
RKM029A02 - Project - INT248 - Report 1
No ratings yet
RKM029A02 - Project - INT248 - Report 1
16 pages
Detecting AI Generated Text Factors Influencing
No ratings yet
Detecting AI Generated Text Factors Influencing
45 pages
Coat Corpus of Artificial Texts
No ratings yet
Coat Corpus of Artificial Texts
26 pages
Final Report A I Detect
No ratings yet
Final Report A I Detect
34 pages
Text Generation (Final)
No ratings yet
Text Generation (Final)
36 pages
Project Report SD111
No ratings yet
Project Report SD111
23 pages
Report4 Merged Organized 1 30
No ratings yet
Report4 Merged Organized 1 30
30 pages
Ttoimage Merged
No ratings yet
Ttoimage Merged
57 pages
Vora
No ratings yet
Vora
24 pages
Beyond Lexical Boundaries: LLM-Generated Text Detection For Romanian Digital Libraries
No ratings yet
Beyond Lexical Boundaries: LLM-Generated Text Detection For Romanian Digital Libraries
31 pages
Understanding and Perception of Automated Text Generation Among The Public Two Surveys With Representative Samples in Germany
No ratings yet
Understanding and Perception of Automated Text Generation Among The Public Two Surveys With Representative Samples in Germany
23 pages
2023 IEEE Machine-Generated - Text - A - Comprehensive - Survey - of - Threat - Models - and - Detection - Methods
No ratings yet
2023 IEEE Machine-Generated - Text - A - Comprehensive - Survey - of - Threat - Models - and - Detection - Methods
26 pages
A Thesis That Writes Itself
No ratings yet
A Thesis That Writes Itself
38 pages
Final Report Vericheck
No ratings yet
Final Report Vericheck
49 pages
On The Possibilities of AI-Generated Text Detection
No ratings yet
On The Possibilities of AI-Generated Text Detection
29 pages
BTP Conda
No ratings yet
BTP Conda
13 pages
Print Mo Na Toh
No ratings yet
Print Mo Na Toh
56 pages
Zero-Shot Machine-Generated Text Detection
No ratings yet
Zero-Shot Machine-Generated Text Detection
13 pages
Seqxgpt: Sentence-Level Ai-Generated Text Detection
No ratings yet
Seqxgpt: Sentence-Level Ai-Generated Text Detection
13 pages
An Explainable Transformer-Based Model For Phishing Email Detection: A Large Language Model Approach
No ratings yet
An Explainable Transformer-Based Model For Phishing Email Detection: A Large Language Model Approach
15 pages
LLM-Detector: Improving AI-Generated Chinese Text Detection With Open-Source LLM Instruction Tuning
No ratings yet
LLM-Detector: Improving AI-Generated Chinese Text Detection With Open-Source LLM Instruction Tuning
17 pages
One-Class Learning For AI-Generated Essay Detection
No ratings yet
One-Class Learning For AI-Generated Essay Detection
24 pages
Ieee Access Chatgpt
No ratings yet
Ieee Access Chatgpt
15 pages
Detecting AI-Generated Text in Educational Content - Leveraging Machine Learning and Explainable AI For Academic Integrity
No ratings yet
Detecting AI-Generated Text in Educational Content - Leveraging Machine Learning and Explainable AI For Academic Integrity
18 pages
Ppaeer of Major Project
No ratings yet
Ppaeer of Major Project
8 pages
Unveiling The Environmental Implications of Automatic Text Generation and The Role of Detection Systems
No ratings yet
Unveiling The Environmental Implications of Automatic Text Generation and The Role of Detection Systems
9 pages
AI Generated Content Detection
No ratings yet
AI Generated Content Detection
8 pages
Report4 Merged Organized 31 59
No ratings yet
Report4 Merged Organized 31 59
29 pages
Wep Aidml 2024-196-203
No ratings yet
Wep Aidml 2024-196-203
8 pages
Edited - PROJECT REPORT - Amisha
No ratings yet
Edited - PROJECT REPORT - Amisha
24 pages
Case Study On Software Engineering
No ratings yet
Case Study On Software Engineering
19 pages
An Intelligent System For Detecting Fake News
No ratings yet
An Intelligent System For Detecting Fake News
8 pages
Automatic Detection of Generated Texts and Energy Exploring The Relationship
No ratings yet
Automatic Detection of Generated Texts and Energy Exploring The Relationship
8 pages
26501-Article Text-30564-1-2-20230626
No ratings yet
26501-Article Text-30564-1-2-20230626
9 pages
Detectgpt: Zero-Shot Machine-Generated Text Detection Using Probability Curvature
No ratings yet
Detectgpt: Zero-Shot Machine-Generated Text Detection Using Probability Curvature
12 pages
Negative Harmony
20% (5)
Negative Harmony
8 pages
AI Generated Text Detection Presentation
No ratings yet
AI Generated Text Detection Presentation
8 pages
A Simple Yet Efficient Ensemble Approach For Ai - Generated Text Detection
No ratings yet
A Simple Yet Efficient Ensemble Approach For Ai - Generated Text Detection
9 pages
Visvesvaraya Technological University: Detection of AI Generated Text
No ratings yet
Visvesvaraya Technological University: Detection of AI Generated Text
5 pages
Case Study 406
No ratings yet
Case Study 406
10 pages
BOT or Brain FP Final
No ratings yet
BOT or Brain FP Final
9 pages
Perceptions of Human and Machine-Generated Articles
No ratings yet
Perceptions of Human and Machine-Generated Articles
16 pages
JETIRGH06026
No ratings yet
JETIRGH06026
5 pages
Generative AI Text Classification Using Ensemble LLM Approaches
No ratings yet
Generative AI Text Classification Using Ensemble LLM Approaches
8 pages
A Pathway Towards Responsible AI Generated Content
No ratings yet
A Pathway Towards Responsible AI Generated Content
12 pages
OPC Unified Architecture Specification Part 5 - Information Model Version 1.00
No ratings yet
OPC Unified Architecture Specification Part 5 - Information Model Version 1.00
77 pages
Advancements in AI Detection Technologies
No ratings yet
Advancements in AI Detection Technologies
5 pages
10.1515 - Opis 2022 0158
No ratings yet
10.1515 - Opis 2022 0158
24 pages
Beyond Black Box AI-Generated Plagiarism Detection: From Sentence To Document Level
No ratings yet
Beyond Black Box AI-Generated Plagiarism Detection: From Sentence To Document Level
9 pages
Fake News Detection Report
No ratings yet
Fake News Detection Report
18 pages
Aiml Project Report
No ratings yet
Aiml Project Report
46 pages
ToP GUIA
No ratings yet
ToP GUIA
78 pages
1.11 Geometry Pagenumber
No ratings yet
1.11 Geometry Pagenumber
22 pages
What Is Natural Language Processing (NLP)
No ratings yet
What Is Natural Language Processing (NLP)
15 pages
Project5 Fake Text Detection
No ratings yet
Project5 Fake Text Detection
4 pages
Science Research Journal
No ratings yet
Science Research Journal
7 pages
Notice of Violation of IEEE Publication Principles "Evaluating Machine Learning Algorithms For Fake News Detection"
No ratings yet
Notice of Violation of IEEE Publication Principles "Evaluating Machine Learning Algorithms For Fake News Detection"
7 pages
Fdocuments - in General Microbiology Spotters by DR Sudheer Kher MD Hod Microbiology
No ratings yet
Fdocuments - in General Microbiology Spotters by DR Sudheer Kher MD Hod Microbiology
32 pages
TQM MCQs
100% (2)
TQM MCQs
8 pages
Siklus Hidup Produk: (Product Life Cycle)
No ratings yet
Siklus Hidup Produk: (Product Life Cycle)
24 pages
Reviewer For UCSP
No ratings yet
Reviewer For UCSP
8 pages
General Physics1 Q2 W8 Module8 Thermodynamics
No ratings yet
General Physics1 Q2 W8 Module8 Thermodynamics
23 pages
The Johannine Comma
No ratings yet
The Johannine Comma
8 pages
Adolf Hitler 1921 1941 (001 027)
No ratings yet
Adolf Hitler 1921 1941 (001 027)
27 pages
Double Layer Tile Press
100% (1)
Double Layer Tile Press
6 pages
Log
No ratings yet
Log
79 pages
Design and Development of Board Cleaning System: Gaurav Gangurde
No ratings yet
Design and Development of Board Cleaning System: Gaurav Gangurde
4 pages
Pamphlet Stitch Bookbinding Class Handout
No ratings yet
Pamphlet Stitch Bookbinding Class Handout
6 pages
KREBS millMAX Slurry Pump Brochure
No ratings yet
KREBS millMAX Slurry Pump Brochure
12 pages
Instructiona L Design: Angelu Adlawan UDM-SN NR-41, GROUP 2
No ratings yet
Instructiona L Design: Angelu Adlawan UDM-SN NR-41, GROUP 2
17 pages
Tubeo, Lush Ishi E. (BSN-1D) LM-2
No ratings yet
Tubeo, Lush Ishi E. (BSN-1D) LM-2
8 pages
Literal and Non-Literal Meanings of Words and Idioms
No ratings yet
Literal and Non-Literal Meanings of Words and Idioms
8 pages
High Burden of Chronic Kidney Disease of Unknown Origin in North East Nigeria
No ratings yet
High Burden of Chronic Kidney Disease of Unknown Origin in North East Nigeria
12 pages
First Regular Session: Fifteenth Congress of The Republic) of The Philippines)
No ratings yet
First Regular Session: Fifteenth Congress of The Republic) of The Philippines)
4 pages
GGSIPU Cutoff Round2 18 Jul 2024
No ratings yet
GGSIPU Cutoff Round2 18 Jul 2024
37 pages
Lesson Plan Three Phase Construction
No ratings yet
Lesson Plan Three Phase Construction
11 pages
Physics Project
No ratings yet
Physics Project
8 pages
HBR - 3 Ways Our Brains Undermine Our Ability To Be A Good Leader
No ratings yet
HBR - 3 Ways Our Brains Undermine Our Ability To Be A Good Leader
7 pages
List of Foreign Correspondents in The Spanish Civil War - Wikipedia
No ratings yet
List of Foreign Correspondents in The Spanish Civil War - Wikipedia
7 pages
English Class Assignment: Rangkuman Artikel "The History of Social Media and Its Impact On Business"
No ratings yet
English Class Assignment: Rangkuman Artikel "The History of Social Media and Its Impact On Business"
2 pages
QT - 50mm Rockwool Wall - Ceiling Panel - Accessories
No ratings yet
QT - 50mm Rockwool Wall - Ceiling Panel - Accessories
2 pages
Business Plan For Food Delivery Service
No ratings yet
Business Plan For Food Delivery Service
6 pages