Ilovepdf Merged
Ilovepdf Merged
I take this opportunity to express my sincere thanks and deep gratitude to all those people who
extended their wholehearted co-operation and have helped me in completing this project
successfully.
First of all, I would like to thank Ms. Sapna Gupta and Mrs. Amita Goel for their precious
time and constant support whenever needed. They have been a driving force behind the
successful completion of this project.
I would also like to express my deep gratitude towards MAIT for considering me a part of
their organization and provide such a great platform to learn and enhance my skills.
A very special thanks goes to all faculties of Maharaja Agrasen Institute of technology under
whose guidance I have been able to excel in my career and reach to such a prestigious
organization.
I hereby declare that the work presented in this project report titled, AI
GENERATED TEXT DETECTION MODEL submitted by us in the partial
fulfillment of the requirement of the award of the degree of Bachelor of Technology
(B.Tech.) Submitted in the Department of Information Technology and
Engineering, Maharaja Agrasen Institute of Technology is an authentic record of
my project work carried out under the guidance of Ms. Sapna Gupta. The matter
presented in this project report has not been submitted either in part or full to any
university or Institute for award of any degree.
discussed
Fig 6 Overall accuracy for each document type (calculated as an average of all 27
approaches discussed)
Fig 7 Accuracy (logarithmic) for each document type by detection tool for AI- 28
generated text
Fig 8 False accusations for human-written documents 31
Fig 9 False accusations for machine-translated documents 31
Fig. 10 False negatives for AI‑generated documents 03‑AI 32
Fig 16 Turnitin’s similarity report shows up frst, it is not clear that the “AI” is clickable 35
Fig. 17 Writer’s suggestion to lower “detectable AI content” 36
Fig 18 AIDT23-05-JGD 37
Fig 19 AIDT23-05-JPK 37
Fig 20 AIDT23-05-LLW 38
Fig 21 AIDT23-05-OLU 38
Fig 22 AIDT23-05-PTR 39
Fig 23 AIDT23-05-SBB 39
Fig 24 AIDT23-05-TFO 40
Fig 25 AIDT23-06-AAN 40
Fig 26 AIDT23-06-DWW 41
Fig 27 AIDT23-06-JGD 41
Fig 28 AIDT23-06-JPK 42
Fig 29 AIDT23-06-LLW 42
Fig 30 AIDT23-06-OLU 43
Fig 31 AIDT23-06-PTR 43
Fig 32 AIDT23-06-SBB 44
Fig 33 AIDT23-06-TFO 44
1. Acknowledgment
2. Declaration
3. Supervisor’s Certificate
4. List of figures
5. Abstract 1
6. Introduction 1
7. Literature Review 3
8. Material and Methods 5
9. Analysis 10
10.Test Functions 19
11.Outcomes Results 23
12.Variations in accuracy 27
13.Consistency in tool results 28
14.Usability issues 35
15.Discussion 36
16.Case studies 06‑Para 40
17.Related Work 45
18.Approach 45
a. Baseline 46
b. Statistical Feature Selection 46
c. Model Architecture 47
19.Experiments
a.Data 48
b.Evaluation method 48
c.Experimental details 48
20.Predictable Outcomes 49
21.ExplainingPoorModel Performance within PubMed 49
22.Outcomes Conclusion 49
23.Training TGM 50
24.Generating text from TGM 51
25.Social impacts of TGMs 51
26.Text generative models 52
27.Model architecture, training data, training cost 52
28.Controllable generation 53
29.Detectors 53
30.Conclusion 57
31.References 58
Automatic Detection of AI Generated
Text Model
Abstract
Text generative models (TGMs) excel in producing text that matches the style of human language
reasonably well. Such TGMs can be misused by adversaries, e.g., by automatically generating
fake news and fake product reviews that can look authentic and fool humans. Detectors that can
distinguish text generated by TGM from human written text play a vital role in mitigating such
misuse of TGMs. Recently, there has been a flurry of works from both natural language pro-
cessing (NLP) and machine learning (ML) communities to build accurate detectors for English.
Despite the importance of this problem, there is currently no work that surveys this fast-growing
literature and introduces newcomers to important research challenges. In this work, we fill this
void by providing a critical survey and review of this literature to facilitate a comprehensive un-
derstanding of this problem. We conduct an in-depth error analysis of the state-of-the-art detector
and discuss research directions to guide future work in this exciting area.
1 Introduction
Current state-of-the-art text generative models (TGMs) excel in producing text that approaches the style
of human language, especially in terms of grammaticality, fluency, coherency, and usage of real world
knowledge (Radford et al., 2019; Zellers et al., 2019; Keskar et al., 2019; Bakhtin et al., 2020; Brown
et al., 2020). TGMs are useful in a wide variety of applications, including story generation (Fan et al.,
2018), conversational response generation (Zhang et al., 2020), code auto-completion (Solaiman et al.,
2019), and radiology report generation (Liu et al., 2019a). However, TGMs can also be misused for fake
news generation (Zellers et al., 2019; Brown et al., 2020; Uchendu et al., 2020), fake product reviews
generation (Adelani et al., 2020), and spamming/phishing. (Weiss, 2019). Thus, it is important to build
tools that can minimize the threats posed by the misuse of TGMs.
The commonly used approach to combat the threats posed by the misuse of TGMs is to formulate
the problem of distinguishing text generated by TGMs and human written text as a classification task.
The classifier, henceforth called detector, can be used to automatically remove machine generated text
from online platforms such as social media, e-commerce, email clients, and government forums, when
the intention of the TGM generated text is abuse. An ideal detector should be: (i) accurate, that is, good
accuracy with a good trade-off for false positives and false negatives depending on the online platform
(email client, social media) on which TGM is applied (Solaiman et al., 2019); (ii) data-efficient, that
is, needs as few examples as possible from the TGM used by the attacker (Zellers et al., 2019); (iii)
generalizable, that is, detects text generated by different modeling choices of the TGM used by the
attacker such as model architecture, TGM training data, TGM conditioning prompt length, model size,
and text decoding method (Solaiman et al., 2019; Bakhtin et al., 2020; Uchendu et al., 2020); and (iv)
interpretable, that is, detector decisions need to be understandable to humans (Gehrmann et al., 2019);
and (v) robust, that is, detector can handle adversarial examples (Wolff, 2020). Given the importance
of this problem, there has been a flurry of research recently from both NLP and ML communities on
1
scientific community, and we evaluate the quality of scientific literature generated by AI compared to
scientific language authored by humans.
In recent years, a lot of interest has been directed toward the potential of generative models such
as ChatGPT to generate human-like writing, images, and other forms of media. An OpenAI-developed
variant of the ubiquitous GPT-3 language model, ChatGPT is designed specifically for generating text
suitable for conversation and may be trained to perform tasks including answering questions, translating
text, and creating new languages [10]. Even though ChatGPT and other generative models have made great
advances in creating human-like language, it still needs to be determined the difference between writing
generated by a machine and text written by a human. This is the case even though ChatGPT and other
generative models. This is of the utmost importance in applications such as content moderation, where it is
important to identify and remove hazardous information and automated spam [11].
Recent work has centered on enhancing pre-trained algorithms' capacity to identify text generated
by artificial intelligence. Along with GPT-2, OpenAI also released a detection model consisting of a
RoBERTa-based binary classification system that was taught to distinguish between human-written and
GPT-2-generated text. Integrating source-domain data with in-domain labeled data is what Black et al.
(2021) do to overcome the difficulty of finding GPT-2-generated technical research literature [12]. The
challenge and dataset on detecting machine-created scientific publications, DagPap22, were proposed by
Kashnitsky et al. [13]. During the COLING 2022 session on Scholarly Document Processing. Algorithms
like GPT-3, GPT-neo, and led-large-book-summary are examples of abstract algorithms used. DagPap22's
prompt templates necessitate including information on the primary topic and scientific structural function,
making it more probable that the tool will collect problematic and easily-discoverable synthetic abstracts
[14], [15]. More recently, GPTZero has been proposed to detect ChatGPT-generated text, primarily based
on perplexity. Recent studies have revealed two major issues that need addressing. To begin, every study
given here had to make do with small data samples. Thus, a larger, more robust data set is required to
advance our understanding. Second, researchers have typically used mock data to fine-tune final versions
of pre-train models. Text created with various artificial intelligence programs should all be detectable by
the same approach.
Recently developed algorithms for detecting AI-generated text can tell the difference between the
two. These systems employ state-of-the-art models and algorithms to decipher text created by artificial
intelligence. One API that can accurately identify AI-generated content is Check For AI, which analyses
text samples. A further tool called Compilatio uses sophisticated algorithms to identify instances of
plagiarism, even when they are present in AI-generated content. Similarly, Content at Scale assesses
patterns, writing style, and other language properties to spot artificially generated text. Crossplag is an
application programming interface (API) that can detect AI-generated text in a file. The DetectGPT
artificial intelligence content detector can easily identify GPT model-generated text. We use Go Winston
to identify artificially manufactured news content and social media content. Machine learning and linguistic
analysis are used by GPT Zero to identify AI-generated text. The GPT-2 Output Detector Demo over at
OpenAI makes it simple to test if a text was produced by a GPT-2 model. OpenAI Text Classifier uses an
application programming interface to categorize text, including text generated by artificial intelligence. In
addition to other anti-plagiarism features, PlagiarismCheck may identify information created by artificial
intelligence. Turnitin is another well-known tool that uses AI-generated text detection to prevent plagiarism.
The Writeful GPT Detector is a web-based tool that uses pattern recognition to identify artificially produced
text. Last but not least, the Writer can spot computer-generated text and check the authenticity and
originality of written materials. Academics, educators, content providers, and businesses must deal with the
challenges of AI-generated text, but new detection techniques are making it easier.
This research will conduct a comparative analysis of AI-generated text detection tools using self-
generated custom dataset. To accomplish this, the researchers will collect datasets using a variety of
artificial intelligence (AI) text generators and humans. This study, in contrast to others that have been
reported, incorporates a wide range of text formats, sizes, and organizational patterns. The next stage is to
test and compare the tools with the proposed tool. The following bullet points present the most significant
takeaways from this study's summary findings.
2
Collecting dataset using different LLMs on different topics having different sizes and writing styles.
Investigation of detection tools for AI text detection on collected dataset.
Comparison of the proposed tool to other cutting-edge tools to demonstrate the interpretability of the
best tool among them.
The following is the structure of this article: The results and comments are presented in Section 4, while
Section 2 provides a brief literature review. The final chapter summarizes the work and recommends where
the authors could go.
2. Literature Review
Artificial intelligence-generated text identification has sparked a paradigm change in the ever-evolving
fields of both technology and literature. This innovative approach arises from the combination of artificial
intelligence and language training, in which computers are given access to reading comprehension
strategies developed by humans. Like human editors have done for decades in the physical world, AI-
generated text detection is now responsible for determining the authenticity of literary works in the digital
world. The ability of algorithms to tell the difference between human-authored material and that generated
by themselves is a remarkable achievement of machine learning. As we venture into new waters, there are
serious implications for spotting plagiarized work, gauging the quality of content, and safeguarding writers'
rights. AI-generated text detection is a guardian for literary originality and reassurance that the spirit of
human creativity lives on in future algorithms, connecting the past and future of written expression.
In their detailed analysis of three custom-built LLMs on a dataset they created, ChatGPT
Comparison Corpus (HC3), Guo et al. [16] Using F1 scores for each corpus or sentence, we evaluated the
performance of the three models and found that the best model had a maximum F1 score of 98.78%. Results
indicated superiority over other SOTA strategies. Because the authors of the corpus only used abstract
paragraphs from a small subset of the available research literature, the dataset is skewed toward that subset.
The presented models may need to perform better on a general-purpose data set. Wang et al. [17] have
offered a benchmarked dataset-based comparison of several AI content detection techniques. For evaluation,
they have used question-and-answer, code-summarization, and code-generation databases. The study found
an average AUC of 0.40 across all selected AI identification techniques, with the datasets used for
comparison containing 25k samples for both human and AI-generated content. Due to the lack of diversity
in the dataset, the chosen tools performed poorly; a biased dataset cannot demonstrate effective performance.
Tools can also be more accurate or have a higher area under the curve (AUC).
Catherine et al. [18] employed the 'GPT-2 Output Detector' to assess the quality of generated
abstracts. The study's findings revealed a significant disparity between the abstracts created and the actual
abstracts. The AI output detector consistently assigned high 'false' scores to the generated abstracts, with a
median score of 99.98% (interquartile range: 12.73%, 99.98%). This suggests a strong likelihood of
machine-generated content. The initial abstracts exhibited far lower levels of 'false' ratings, with a median
of 0.02% and an interquartile range (IQR) ranging from 0.02% to 0.09%. The AI output detector exhibited
a robust discriminatory ability, evidenced by its AUROC (Area Under the Receiver Operating
Characteristic) value of 0.94. Utilizing a website and iThenticate software to conduct a plagiarism detection
assessment revealed that the generated abstracts obtained higher scores, suggesting a greater linguistic
similarity to other sources. Remarkably, human evaluators had difficulties in distinguishing between
authentic and generated abstracts. The researchers achieved an accuracy rate of 68% in accurately
identifying abstracts generated by ChatGPT. Notably, 14% of the original abstracts were produced by
machine-generated methods. The literature critiques have brought attention to the issue of abstracts that are
thought to be generated by artificial intelligence (AI).
Using a dataset generated by the users themselves, Debora et al. [19] compared and contrasted
multiple AI text detection methods. The research compared 12 publicly available tools, two proprietary and
available only to qualified academic institutions and other research groups. The researchers' primary focus
has been explaining why and how artificial intelligence (AI) techniques are useful in the academy and the
3
sciences. The results of the comparison were then shown and discussed. Finally, the limitations of AI
technologies regarding evaluation criteria were discussed.
To fully evaluate such detectors, the team [20] first trained DIPPER, a paraphrase generation model
with 11 billion parameters, to rephrase entire texts in response to contextual information such as user-
generated cues. Using scalar controls, DIPPER's paraphrased results can be tailored to vocabulary and
sentence structure. Extensive testing proved that DIPPER's paraphrase of AI-generated text could evade
watermarking techniques and GPTZero, DetectGPT, and OpenAI's text classifier. The detection accuracy
of DetectGPT was decreased from 70.3% to 4.6% while maintaining a false positive rate of 1% when
DIPPER was used to paraphrase text generated by three well-known big language models, one of which
was GPT3.5-davinci-003. These rephrases were impressive since they didn't alter the original text's
meaning. The study developed a straightforward defense mechanism to safeguard AI-generated text
identification from paraphrase-based attacks. Language model API providers were required to get
semantically identical texts for this defense strategy to work. To find sequences comparable to the candidate
text, the algorithm looked through a collection of already generated sequences. A 15-million-generation
database derived from a finely tuned T5-XXL model confirmed the efficacy of this defense strategy. The
software identified Paraphrased generations in 81% to 97% of test cases, demonstrating its efficacy.
Remarkably, only 1% of human-written sequences were incorrectly labeled as AI-generated by the software.
The project made its code, models, and data publicly available to pave the way for additional work on
detecting and protecting AI-generated text.
OpenAI [21], an AI research company, compared manual and automatic ML-based synthetic text
recognition methods. Utilizing models trained on GPT-2 datasets enhances the inherent authenticity of the
text created by GPT-2, hence facilitating human evaluators' identification of erroneous datasets.
Consequently, the team evaluated a rudimentary logistic regression model, a detection model based on fine-
tuning, and a detection model employing zero-shot learning. A logistic regression model was trained using
TFIDF, unigram, and bigram features and evaluated using various generating processes and model
parameters afterward. The most basic classifiers demonstrated an accuracy rate of 97% or higher. Models
need help in identifying shorter outputs. Topological Data Analysis (TDA) was utilized by Kushnareva et
al. [22] to count graph components, edges, and cycles. Text recognition machine learning used these
features. The characteristics trained a logistic regression classifier on WebText, Amazon Reviews,
RealNews, and GROVER [23]. ChatGPT's lack of thorough testing makes this approach's success uncertain.
The online application DetectGPT was used to zero-shot identify and separate AI-generated text
from human-generated text in another investigation [24]. Log probabilities from the generative model were
employed. The researchers found intentionally generated text in the model's log probability function's
negative curvature. The authors thought assessing the log probability of the models under discussion was
always possible. This method only works with GPT-2 cues, the scientists say. In another study, Mitrovic et
al. trained an ML model to identify ChatGPT queries from human ones [25]. ChatGPT-generated two-line
restaurant reviews were recognized by DISTILBERT, a lightweight BERT-trained and Transformer-tuned
model. SHAP explained model predictions. Researchers observed that the ML model couldn't recognize
ChatGPT messages. The authors introduced AICheatCheck, a web-based AI detection tool that can
distinguish if a text was produced by ChatGPT or a human [26]. AICheck analyzes text patterns to detect
origin. The writers used Guo et al. [16] and education to make do with limited data. The study must explain
AICheatCheck's precision. The topic was recently investigated by Cotton et al. [27]. The benefits and
drawbacks of using ChatGPT in the classroom concerning plagiarism are discussed. In another text [28],
the authors used statistical distributions to analyze simulated data. Using a GLTR application, they make
sure the text you put in is correct by highlighting it in different colors. Questions used on the GLTR exam
were written by the general public and based on the publicly accessible GPT-human-generated content for
the 21.5B parameter model [10]. The authors also studied human subjects by having students spot instances
of fabricated news.
Different AI-generated text classification models have been presented in recent years, with approaches
ranging from deep learning and transfer learning to machine learning. Furthermore, software incorporated
the most effective models to help end users verify AI-generated writing. Some studies evaluate various AI
4
text detection tools by comparing their performance on extremely limited and skewed datasets. Therefore,
it is necessary to have a dataset that includes samples from many domains written in the same language that
the models were trained in. It needs to be clarified which of the many proposed tools for AI text
identification is the most effective. To find the best tool for each sort of material, whether from a research
community or content authors, it is necessary to do a comparative analysis of the top-listed tools.
3. Material and Methods
Many different tools for recognizing artificial intelligence-created text were compared in this analysis. The
approach outlined here consists of three distinct phases. Examples of human writing are collected from
many online sources, and OpenAI frameworks are used to generate examples of AI writing from various
prompts (such as articles, abstracts, stories, and comment writing). In the following stage, you will select
six applications for your newly formed dataset. Finally, the performance of the tools is provided based on
several state-of-the-art measurements, allowing end users to pick the best alternative. Figure 1 depicts the
overall structure of the executing process.
Loading Responses
Calculate Results
Desision
5
Paraphrase [31], GPT-2 [32], GPT-3 [33], DaVinci, GPT-3.5, OPT-IML [34], and Flan-T5 [35]). The final
number in the table is 11,580, the total number of students in both groups. This table is useful since it shows
where the testing dataset came from and how the AI models were evaluated. Table 1 presents details of
testing samples used for evaluation of targeted AI generated text detection tools.
Table 1: Testing dataset samples (AH&AITD).
Class Source Number of Samples Total
Open Web Text 2343
Blogs 196
Web Text 397
Human Written Q&A 670 5790
News Articles 430
Opinion Statements 1549
Scientific Research 205
ChatGPT 1130
GPT-4 744
Paraphrase 1694
GPT-2 328
AI Generated GPT-3 296 5790
Davinci 433
GPT-3.5 364
OPT-IML 406
Flan-T5 395
Total 11580 11580
The method of data collecting for human written samples is based on a variety of various approaches.
Most human-written samples were manually obtained from human-written research articles, and only
abstracts were used. Additionally, open web texts are harvested from various websites, including Wikipedia.
3.2 AI Generated Text Detection Tools
Without AI-generated text detection systems, which monitor automated content distribution online, the
modern digital ecosystem would collapse. These systems employ state-of-the-art machine learning and
natural language processing methods to identify and label data generated by AI models like GPT-3,
ChatGPT, and others. They play a vital role in the moderation process by preventing the spread of
misinformation, protecting online communities, and identifying and removing fake news. Platform
administrators and content moderators can use these tools to spot literature generated by artificial
intelligence by seeing patterns, language quirks, and other telltale signals. The importance of AI-created
text detection tools for user security, ethical online discourse, and legal online material has remained strong
despite the advancements in AI. In this section, AI generated text detection tools are described briefly.
Zyalab's AI Text Detection API [36] makes locating and analyzing text in various content types simple.
This API employs cutting-edge artificial intelligence (AI) technology to precisely recognize and extract
textual content from various inputs, including photos, documents, and digital media. AI Text Detection API
uses cutting-edge OpenAI technology to identify ChatGPT content. Its high accuracy and simple interface
6
let instructors spot plagiarism in student essays and other AI-generated material. Its ease of integration into
workflows and use by non-technical users is a major asset. Due to OpenAI's natural language processing
powers, the API can detect even mild plagiarism, ensuring the information's uniqueness. It helps teachers
grade essays by improving the efficiency of checking student work for originality. In conclusion, the AI
Text Detection API simplifies, accurately, and widely applies plagiarism detection and essay grading for
content suppliers, educators, and more. Due to its ability to analyze text and provide a detailed report, this
tool can be used for plagiarism detection, essay grading, content generation, chatbot building, and machine
learning research. There are no application type constraints, merely API request limits. The API makes use
of OpenAI technology. It has a simple interface and high accuracy, allowing it to detect plagiarism in AI-
generated writing and serve as an essay detector for teachers.
3.2.2 GPTKIT
The innovators of GPTKit [37] saw a need for an advanced tool to accurately identify Chat GPT material,
so they built one. GPTKit is distinguished from other tools because it utilizes six distinct AI-based content
recognition methods, all working together to considerably enhance the precision with which AI-generated
content may be discovered. Educators, professionals, students, content writers, employees, and independent
contractors worried about the accuracy of AI-generated text will find GPTKit highly adaptable. When users
input text for analysis, GPTKit uses these six techniques to assess the content’s authenticity and accuracy.
Customers can try out GPTKit’s features for free by having it return the first 2048 characters of a response
to a request. Due to the team’s dedication to continuous research, the detector in GPTKit now claims an
impressive accuracy rate of over 93% after being trained on a big dataset. You can rest easy knowing that
your data will remain private during detection and afterward, as GPTKit only temporarily stores information
for processing and promptly deletes it from its servers. GPTKit is a great tool to use if you wish to validate
information with artificial intelligence (AI) for authenticity or educational purposes.
3.2.3 GPTZero
GPTZero is the industry standard for identifying Large Language Model documents like ChatGPT. It
detects AI content at the phrase, paragraph, and document levels, making it adaptable. The GPTZero model
was trained on a wide range of human-written and AI-generated text, focusing on English prose. After
servicing 2.5 million people and partnering with 100 education, publishing, law, and other institutions,
GPTZero is a popular AI detector. Users may easily enter text for analysis using its simple interface, and
the system returns detailed detection findings, including sentence-by-sentence highlighting of AI-detected
material, for maximum transparency. GPTZero supports numerous AI language models, making it a
versatile AI detection tool. ChatGPT, GPT-4, GPT-3, GPT-2, LLaMA, and AI services are included. It was
the most accurate and trustworthy AI detector of seven tested by TechCrunch. Customized for student
writing and academic prose, GPTZero is ideal for school. Despite its amazing powers, GPTZero [38] admits
it has limitations in the ever-changing realm of AI-generated entertainment. Thus, teachers should combine
its findings into a more complete assessment that prioritizes student comprehension in safe contexts. To
help teachers and students address AI misuse and the significance of human expression and real-world
learning, GPTZero emphasizes these topics. In-person evaluations, edited history analysis, and source
citations can help teachers combat AI-generated content. Due to its commitment to safe AI adoption,
GPTZero may be a reliable partner for educators facing AI issues.
3.2.4 Sapling
Sapling AI Content Detector [39] is a cutting-edge program that accurately recognizes and categorizes AI-
generated media. This state-of-the-art scanner utilizes state-of-the-art technology to verify the authenticity
and integrity of text by checking for the existence of AI-generated material in various contexts. Whether in
7
the courtroom, the publishing industry, or the classroom, Sapling AI Content Detector is a potent solution
to the issue of AI-generated literature. Its straightforward interface and comprehensive detection results
equip users to make informed judgments about the authenticity of the material. Sapling AI Content
Detector's dedication to precision and dependability makes it a valuable resource for companies and
individuals serious about preserving the highest possible content quality and originality requirements.
3.2.5 Originality
The Originality AI Content Detector [40] was intended to address the growing challenge of identifying AI-
generated text. This cutting-edge artificial intelligence influence detector can tell if human-written content
has been altered. It examines every word, every sentence, every paragraph, and every document. Human-
made and computer-generated texts may be distinguished with confidence thanks to the rich variety of
training data. Educators, publishers, academics, and content producers will find this tool invaluable for
guarding the authenticity and integrity of their own work. The Originality AI Content Detector highlights
potential instances of AI-generated literature to increase awareness and promote the responsible use of AI
technologies in writing. The era of AI-driven content creation gives users the knowledge to make purposeful
decisions that preserve the quality and originality of their writing.
3.2.6 Writer
The Writer AI Content Detector [41] is cutting-edge software for spotting content created by artificial
intelligence. This program utilizes cutting-edge technologies to look for signs of artificial intelligence in
the text at the phrase, paragraph, and overall document levels. Since he was taught using a large dataset,
including human-authored and AI-generated content, the Writer is very good at telling them apart. This
guide is a must-read for every instructor, publisher, or content provider serious about their craft. By alerting
users to the presence of AI content and offering details about it, the Writer arms them with the information
they need to protect the authenticity of their works. The author is an honest champion of originality,
advocating for responsible and ethical content generation in an era when AI is increasingly involved in the
creative process. Developers can get the Writer AI Content Detector SDK by running "pip install writer."
The use of API credentials for writer authentication is critical [42]. These API keys can be found on your
account's dashboard. Simply replace the sample API keys in the code snippets with your own or sign in for
individualized code snippets. Users without access to their secret API keys on the control panel. To become
a Writer account's development team member, you should contact the account's owner. Developers can
access the Writer SDK and AI Content Detector once signed in. The SDK includes document and user
management tools, content identification, billing information retrieval, content production, model
customization, file management, snippet handling, access to a style guide, terminology management, user
listing, and management. With this full suite of resources, customers can confidently include AI-driven
content recognition into their projects and apps without compromising safety or precision.
3.3 Experimental Setup
Six distinct content identification approaches developed using artificial intelligence were
evaluated in depth for this study. Each tool has an API that can be used with various languages
and frameworks. To take advantage of these features, subscriptions have been obtained for each
API, and the software has been put through its pace with Python scripts. The results were produced
using the testing dataset discussed above. All experiments have been run on a sixth-generation
Dell I7 system with 24 GB of RAM and 256 SSD ROM using Python 3.11 on MS Code with
Jupyter Notebook Integration.
3.3 Evaluation Policy
To ensure the robustness, dependability, and usefulness of a company's machine-learning models, the
company should develop and adhere to an evaluation policy. This policy spells evaluation, validation, and
8
application of models in detail. As a first step, it converges on a standardized approach to evaluation,
allowing for fair and uniform assessment of model performance across projects. Comparing projects,
identifying best practices, and maximizing model development are all made easier with the introduction of
uniform standards. Second, a policy for assessing model performance guarantees that they hit targets for
measures like accuracy, precision, and recall. As a result, only high-quality, reliable models with strong
performance are deployed. Reduced implementation risks are achieved through the policy's assistance in
identifying model inadequacies, biases, and inaccuracies. An assessment policy fosters accountability and
trustworthiness in data science by requiring uniformity and transparency in model construction.
Accuracy is important in machine learning and statistics because it measures model prediction.
Accuracy is a percentage of accurately predicted cases to the dataset's total occurrences. The term
"accuracy" could mean:
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = × 100
𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
In this formula, the "Total Number of Predictions" represents the size of the dataset, while the
"Number of Correct Predictions" is the number of predictions made by the model that corresponds to the
actual values. A quick and dirty metric to gauge a model's efficacy is accuracy, but when one class greatly
outnumbers the other in unbalanced datasets, this may produce misleading results.
Precision is the degree to which a model correctly predicts the outcome. In the areas of statistics
and machine learning, it is a common metric. The number of correct positive forecasts equals the ratio of
true positive predictions to all positive predictions. The accuracy equation can be described as follows:
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
The avoidance of false positives and negatives in practical use is what precision quantifies. A high
accuracy score indicates that when the model predicts a positive outcome, it is more likely to be true, which
is especially important in applications where false positives could have major consequences, such as
medical diagnosis or fraud detection.
Recall (true positive rate or sensitivity) is an important performance metric in machine learning
and classification applications. It measures a model's ability to discover and label every instance of interest
in a given dataset. To recall information, follow this formula:
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
In this formula, TP represents the total number of true positives, whereas FN represents the total
number of false negatives. Medical diagnosis and fraud detection are two examples of areas where missing
a positive instance can have serious effects; applications with a high recall, which indicates the model
effectively catches a large proportion of the true positive cases, could profit greatly from such a model.
The F1 score is a popular metric in machine learning that combines precision and recall into a single
value, offering a fairer evaluation of a model's efficacy, especially when working with unbalanced datasets.
The formula for its determination is as follows:
(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙)
𝐹1 𝑆𝑐𝑜𝑟𝑒 = 2 ×
(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
Precision is the proportion of correct predictions relative to the total number of correct predictions
made by the model, whereas recall measures the same proportion relative to the number of genuine positive
cases in the dataset. The F1 score excels when a compromise between reducing false positives and false
negatives is required, such as medical diagnosis, information retrieval, and anomaly detection. By factoring
in precision and recall, F1 is a well-rounded measure of a classification model's efficacy.
A machine learning classification model's accuracy can be evaluated using the ROC curve and the
Confusion Matrix. The ROC curve compares the True Positive Rate (Sensitivity) to the False Positive Rate
(1-Specificity) at different cutoffs to understand a model's discriminatory ability. The Confusion Matrix
provides a more detailed assessment of model accuracy, precision, recall, and F1-score, which meticulously
9
tabulates model predictions into True Positives, True Negatives, False Positives, and False Negatives. Data
scientists and analysts can use these tools to learn everything they need to know about model performance,
threshold selection, and striking a balance between sensitivity and specificity in classification jobs.
4. Analysis
The results and discussion surrounding these tools reveal intriguing insights into the usefulness and
feasibility of six AI detection approaches for differentiating AI-generated text from human-authored
content. Detection technologies, including GPTZero, Sapling, Writer, AI Text Detection API (Zyalab),
Originality AI Content Detector, and GPTKIT, were ranked based on several factors, including accuracy,
precision, recall, and f1-score. Table 2 compares different AI text detection approaches that can be used to
tell the difference between AI-written and -generated text.
Table 2: Comparative results of AI text detection tools on AH&AITD.
Tools Classes Precision Recall F1 Score
AI Generated 90 12 21
GPTKIT
Human Written 53 99 69
AI Generated 65 60 62
GPTZERO
Human Written 63 68 65
AI Generated 98 96 97
Originality
Human Written 96 98 97
AI Generated 86 40 54
Sapling
Human Written 61 94 74
AI Generated 79 52 62
Writer
Human Written 64 87 74
AI Generated 84 45 59
Zylalab
Human Written 62 91 74
First, GPTKIT impresses with its high F1 Score (21) because of its high precision (90) in detecting
human-written text but shockingly low recall (12). This suggests that GPTKIT is overly conservative,
giving rise to several false negatives. On the other hand, its recall (99) and F1 Score (69) are excellent when
recognizing language created by humans. GPTZero's performance is more uniformly excellent across the
board. Its recall (60%) and F1 Score (62) more than make up for its lower precision (65%) on AI-generated
text. An F1 Score of 65 for human written text strikes a reasonable balance between accuracy (63) and
recall (68).
When distinguishing AI-generated content from machine-written content, Originality shines. Its F1
Score of 97 reflects its remarkable precision (98), recall (96), and overall effectiveness. It also excels at text
created by humans, with an F1 Score of 97, recall of 98%, and precision of 96%. The high precision (86)
and recall (40) on AI-generated text give Sapling an F1 Score of 54. Despite a high recall (94) and poor
precision (61) in identifying human-written text, the F1 Score 74 leaves room for improvement. Writer is
unbiased in assessing the relative merits of AI-generated and human-written content. It has an average F1
Score of 62 because it has an average level of precision while analyzing AI-generated text (79) and recall
(52). The F1 Score for this piece of human-written text is 74, meaning it has an excellent balance of
precision (64) and recall (87). Regarding recognizing AI-generated content, Zylalab has a good 84 precision,
45 recall, and 59 F1 Score. Recognizing synthetic language is where it shines, with an F1 Score of 74, a
recall of 91, and a precision of 62.
As a result of its superior performance in terms of precision, recall, and F1 Score across both classes,
we have concluded that Originality is the most reliable alternative for AI text identification. Additionally,
GPTZERO displays all-around performance, making it a practical option. However, Sapling shows skills
in identifying AI-generated text whereas GPTKIT demonstrates remarkable precision but needs better recall.
10
Writers find a comfortable medium ground but need to differentiate themselves. Zylalab performs about as
well as the best of the rest, but it has room to grow. Before selecting a tool, it is crucial to consider the needs
and priorities of the job.
Figure 2 provides a visual representation of a comparison between the accuracy of six different AI
text identification systems, including "GPTkit," "GPTZero," "Originality," "Sapling," "Writer," and
"Zylalab." Data visualization demonstrates that "GPTkit" has a 55.29 percent accuracy rate, "GPTZero" has
a 63.7 percent accuracy rate, "Originality" has a spectacular 97.0 percent accuracy rate, "Sapling" has a
66.6 percent accuracy rate, "Writer" has a 69.05 percent accuracy rate, and "Zylalab" has a 68.23 percent
accuracy rate. These accuracy ratings demonstrate how well the tools distinguish between natural and
computer-generated text. When contrasting the two forms of writing, "Originality" achieves the highest
degree of accuracy. Compared to the other two, "GPTkit" has the lowest detection accuracy and thus the
most room for improvement. This visual representation of the performance of various AI text detection
tools will be an important resource for users looking for the most precise tool for their needs.
11
Figure 3: Testing confusion matrices of AI text detection tools on AH&AITD
Figure 4 displays the Testing Receiver Operating Curves (ROCs) for selecting AI text detection
algorithms, visually comparing their relative strengths and weaknesses. These ROC curves, one for each
tool, are essential for judging how well they can tell the difference between AI-generated and human-written
material. Values for "GPTkit," "GPTZero," "Originality," "Sapling," "Writer," and "Zylalab" in terms of
Area Under the Curve (AUC) are 0.55, 0.64%, 0.97%, 0.67%, 0.69%, and 0.68%, respectively. The Area
under the curve (AUC) is a crucial parameter for gauging the precision and efficiency of such programs. A
bigger area under the curve (AUC) suggests that the two text types can be distinguished with more accuracy.
To help users, researchers, and decision-makers choose the best AI text recognition tool for their needs,
Figure 4 provides a visual summary of how these tools rank regarding their discriminative strength.
12
y
engagement. It is also in higher education that students form and further develop their
personal and professional ethics and values. Hence, it is crucial to uphold the integrity of
the assessments and diplomas provided in tertiary education.
The introduction of unauthorised content generation—“the production of academic
work, in whole or part, for academic credit, progression or award, whether or not
a payment or other favour is involved, using unapproved or undeclared human or
technological assistance” (Foltýnek et al. 2023)—into higher education contexts poses
potential threats to academic integrity. Academic integrity is understood as “compli-
ance with ethical and professional principles, standards and practices by individuals
or institutions in education, research and scholarship” (Tauginienė et al. 2018).
Recent advancements in artificial intelligence (AI), particularly in the area of the
generative pre-trained transformer (GPT) large language models (LLM), have led to
a range of publicly available online text generation tools. As these models are trained
on human-written texts, the content generated by these tools can be quite difficult to
distinguish from human-written content. They can thus be used to complete assess-
ment tasks at HEIs.
Despite the fact that unauthorised content generation created by humans, such as
contract cheating (Clarke & Lancaster 2006), has been a well-researched form of stu-
dent cheating for almost two decades now, HEIs were not prepared for such radical
improvements in automated tools that make unauthorised content generation so eas-
ily accessible for students and researchers. The availability of tools based on GPT-3
and newer LLMs, ChatGPT (OpenAI 2023a, b) in particular, as well as other types
of AI-based tools such as machine translation tools or image generators, have raised
many concerns about how to make sure that no academic performance deception
attempts have been made. The availability of ChatGPT has forced HEIs into action.
Unlike contract cheating, the use of AI tools is not automatically unethical. On the
contrary, as AI will permeate society and most professions in the near future, there is
a need to discuss with students the benefits and limitations of AI tools, provide them
with opportunities to expand their knowledge of such tools, and teach them how to
use AI ethically and transparently.
Nonetheless, some educational institutions have directly prohibited the use of
ChatGPT (Johnson 2023), and others have even blocked access from their university
networks (Elsen-Rooney 2023), although this is just a symbolic measure with vir-
tual private networks quite prevalent. Some conferences have explicitly prohibited
AI-generated content in conference submissions, including machine-learning con-
ferences (ICML 2023). More recently, Italy became the first country in the world to
ban the use of ChatGPT, although that decision has in the meantime been rescinded
(Schechner 2023). Restricting the use of AI-generated content has naturally led to
the desire for simple detection tools. Many free online tools that claim to be able to
detect AI-generated text are already available.
Some companies do urge caution when using their tools for detecting AI-gener-
ated text for taking punitive measures based solely on the results they provide. They
acknowledge the limitations of their tools, e.g. OpenAI explains that there are several
ways to deceive the tool (OpenAI 2023a, b, 8 May). Turnitin made a guide for teachers
on how they should approach the students whose work was flagged as AI-generated
13
y
RQ1: Can detection tools for AI-generated text reliably detect human-written text?
RQ2: Can detection tools for AI-generated text reliably detect ChatGPT-generated
text?
RQ3: Does machine translation affect the detection of human-written text?
RQ4: Does manual editing or machine paraphrasing affect the detection of Chat-
GPT-generated text?
RQ5: How consistent are the results obtained by different detection tools for AI-gen-
erated text?
The next section briefly describes the concept and history of LLMs. It is followed
by a review of scientific and non-scientific related work and a detailed description of
the research methodology. After that, the results are presented in terms of accuracy,
error analysis, and usability issues. The paper ends with discussion points and conclu-
sions made.still gained 1.0 points as in the previous methods. The formula for accuracy
calculation
14
y
Related work
The development of LLMs has led to an acceleration of different types of efforts in the
field of automatic detection of AI-generated text. Firstly, several researchers has studied
human abilities to detect machine-generated texts (e.g. Guo et al. 2023; Ippolito et al.
2020; Ma et al. 2023). Secondly, some attempts have been made to build benchmark text
corpora to detect AI-generated texts effectively; for example, Liyanage et al. (2022) have
offered synthetic and partial text substitution datasets for the academic domain. Thirdly,
many research works are focused on developing new or fine-tuning parameters of the
already pre-trained models of machine-generated text (e.g. Chakraborty et al. 2023; Dev-
lin et al. 2019).
These efforts provide a valuable contribution to improving the performance and capa-
bilities of detection tools for AI-generated text. In this section, the authors of the paper
mainly focus on studies that compare or test the existing detection tools that educators
can use to check the originality of students’ assignments. The related works examined
in the paper are summarised in Tables 1, 2, and 3. They are categorised as published
scientific publications, preprints and other publications. It is worth mentioning that
although there are many comparisons on the Internet made by individuals and organisa-
tions, Table 3 includes only those with the higher coverage of tools and/or at least partly
described methodology of experiments.
Some researchers have used known text-matching software to check if they are able
to find instances of plagiarism in the AI-generated text. Aydin and Karaarslan (2022)
tested the iThenticate system and have revealed that the tool has found matches with
15
y
other information sources both for ChatGPT-paraphrased text and -generated text.
They also found that ChatGPT does not produce original texts after paraphrasing, as the
match rates for paraphrased texts were very high in comparison to human-written and
16
y
17
y
natural language content and programming code and determined that “detecting Chat-
GPT-generated code is even more difficult than detecting natural language contents.”
They also state that tools often exhibit bias, as some of them have a tendency to predict
that content is ChatGPT generated (positive results), while others tend to predict that it
is human-written (negative results).
By testing fifty ChatGPT-generated paper abstracts on the GPT-2 Output detector,
Gao et al. (2022) concluded that the detector was able to make an excellent distinction
between original and generated abstracts because the majority of the original abstracts
were scored extremely low (corresponding to human-written content) while the detec-
tor found a high probability of AI-generated text in the majority (33 abstracts) of the
ChatGPT-generated abstracts with 17 abstracts scored below 50%.
Pegoraro et al. (2023) tested not only online detection tools for AI-generated text but
also many of the existing detection approaches and claimed that detection of the Chat-
GPT-generated text passages is still a very challenging task as the most effective online
detection tool can only achieve a success rate of less than 50%. They also concluded that
most of the analysed tools tend to classify any text as human-written.
Tests completed by van Oijen (2023) showed that the overall accuracy of tools in
detecting AI-generated text reached only 27.9%, and the best tool achieved a maxi-
mum of 50% accuracy, while the tools reached an accuracy of almost 83% in detecting
human-written content. The author concluded that detection tools for AI-generated text
are "no better than random classifiers" (van Oijen 2023). Moreover, the tests provided
some interesting findings; for example, the tools found it challenging to detect a piece of
human-written text that was rewritten by ChatGPT or a text passage that was written in
a specific style. Additionally, there was not a single attribution of a human-written text
to AI-generated text, that is, an absence of false positives.
Although Demers (2023) only provided results of testing without any further analysis,
their examination allows making conclusions that a text passage written by a human was
recognised as human-written by all tools, while ChatGPT-generated text had a mixed
evaluation with the tendency to be predicted as human-written (10 tools out of 16) that
increased even further for the ChatGPT writing sample with the additional prompt "beat
detection" (12 tools out of 16).
Elkhatat et al.(2023) revealed that detection tools were generally more successful in
identifying GPT-3.5-generated text than GPT-4-generated text and demonstrated incon-
sistencies (false positives and uncertain classifications) in detecting human-written text.
They also questioned the reliability of detection tools, especially in the context of investi-
gating academic integrity breaches in academic settings.
In the tests conducted by Compilatio, the detection tools for AI-generated text
detected human-written text with reliability in the range of 78–98% and AI-generated
text – 56–88%. Gewirtz’ (2023) results on testing three human-written and three Chat-
GPT-generated texts demonstrated that two of the selected detection tools for AI-gener-
ated text could reach only 50% accuracy and one an accuracy of 66%.
The effect of paraphrasing on the performance of detection tools for AI-generated text
has also been studied. For example, Anderson et al. (2023) concluded that paraphras-
ing has significantly lowered the detection capabilities of the GPT-2 Output Detector by
increasing the score for human-written content from 0.02% to 99.52% for the first essay
18
y
and from 61.96% to 99.98% for the second essay. Krishna et al. (2023) applied paraphras-
ing to the AI-generated texts and revealed that it significantly lowered the detection
accuracy of five detection tools for AI-generated text used in the experiments.
The results of the above-mentioned studies suggest that detecting AI-generated text
passages is still challenging for existent detection tools for AI-generated text, whereas
human-written texts are usually identified quite accurately (accuracy above 80%). How-
ever, the ability of tools to identify AI-generated text is under question as their accuracy
in many studies was only around 50% or slightly above. Depending on the tool, a bias
may be observed identifying a piece of text as either ChatGPT-generated or human-
written. In addition, tools have difficulty identifying the source of the text if ChatGPT
transforms human-written text or generates text in a particular style (e.g. a child’s expla-
nation). Furthermore, the performance of detection tools significantly decreases when
texts are deliberately modified by paraphrasing or re-writing. Detection of the AI-gener-
ated text remains challenging for existing detection tools, but detecting ChatGPT-gener-
ated code is even more difficult.
Existing research has several shortcomings:
• quite often experiments are carried out with a limited number of detection tools for
AI-generated text on a limited set of data;
• sometimes human-written texts are taken from publicly available websites or recog-
nised print sources, and thus could potentially have been previously used to train
LLMs and/or provide no guarantee that they were actually written by humans;
• the methodological aspects of the research are not always described in detail and are
thus not available for replication;
• testing whether the AI-generated and further translated text can influence the accu-
racy of the detection tools is not discussed at all;
• a limited number of measurable metrics is used to evaluate the performance of
detection tools, ignoring the qualitative analysis of results, for example, types of clas-
sification errors that can have significant consequences in an academic setting.
Test Functions
Test cases
The focus of this research is determining the accuracy of tools which state that they are
able to detect AI-generated text. In order to do so, a number of situational parameters
were set up for creating the test cases for the following categories of English-language
documents:
• human-written;
• human-written in a non-English language with a subsequent AI/machine translation
to English;
• AI-generated text;
• AI-generated text with subsequent human manual edits;
• AI-generated text with subsequent AI/machine paraphrase.
19
y
For the first category (called 01-Hum), the specification was made that 10.000 charac-
ters (including spaces) were to be written at about the level of an undergraduate in the
field of the researcher writing the paper. These fields include academic integrity, civil
engineering, computer science, economics, history, linguistics, and literature. None of
the text may have been exposed to the Internet at any time or even sent as an attachment
to an email. This is crucial because any material that is on the Internet is potentially
included in the training data for an LLM.
For the second category (called 02-MT), around 10.000 characters (including spaces)
were written in Bosnian, Czech, German, Latvian, Slovak, Spanish, and Swedish. None
of this texts may have been exposed to the Internet before, as for 01-Hum. Depending on
the language, either the AI translation tool DeepL (3 cases) or Google Translate (6 cases)
was used to produce the test documents in English.
It was decided to use ChatGPT as the only AI-text generator for this investigation, as
it was the one with the largest media attention at the beginning of the research. Each
researcher generated two documents with the tool using different prompts, (03-AI and
04-AI) with a minimum of 2000 characters each and recorded the prompts. The lan-
guage model from February 13, 2023 was used for all test cases.
Two additional texts of at least 2000 characters were generated using fresh prompts
for ChatGPT, then the output was manipulated. It was decided to use this type of test
case, as students will have a tendency to obfuscate results with the expressed purpose
of hiding their use of an AI-content generator. One set (05-ManEd) was edited manually
with a human exchanging some words with synonyms or reordering sentence parts and
the other (06-Para) was rewritten automatically with the AI-based tool Quillbot (Quill-
bot 2023), using the default values of the tool for modes (Standard) and synonym level.
Documentation of the obfuscation, highlighting the differences between the texts, can
be found in the Appendix.
With nine researchers preparing texts (the eight authors and one collaborator), 54 test
cases were thus available for which the ground truth is known.
20
y
Table 4 gives an overview of the minimum/maximum sizes of text that could be exam-
ined by the free tools at the time of testing, if known.
PlagiarismCheck and Turnitin are combined text similarity detectors and offer an
additional functionality of determining the probability the text was written by an AI, so
there was no limit on the amount of text tested. Signup was necessary for Check for
AI, Crossplag, Go Winston, GPT Zero, and OpenAI Text Classifier (a Google account
worked).
Data collection
The tests were run by the individual authors between March 7 and March 28, 2023.
Since Turnitin was not available until April, those tests were completed between April
14 and April 20, 2023. The testing of PlagiarismCheck was performed between May 2
21
y
Human-written (NEGATIVE) text (docs 01-Hum & 02-MT), and the tool says that it is written by a:
[100—80%) human True negative TN
[80—60%) human Partially true negative PTN
[60—40%) human Unclear UNC
[40—20%) human Partially false positive PFP
[20—0%] human False positive FP
AI-generated (POSITIVE) text (docs 03-AI, 04-AI, 05-ManEd & 06-Para), and the tool says it is written by
a:
[100—80%) human False negative FN
[80—60%) human Partially false negative PFN
[60—40%) human Unclear UNC
[40—20%) human Partially true positive PTP
[20—0%] human True positive TP
[ or] means inclusive ( or) means exclusive
and May 8, 2023. All the 54 test cases had been presented to each of the tools for a total
of 756 tests.
Evaluation model
For the evaluation, the authors were split into groups of two or three and tasked with
evaluating the results of the tests for the cases from either 01-Hum & 04-AI, 02-MT &
05-ManEd, or 03-AI & 06-Para. Since the tools do not provide an exact binary classifi-
cation, one five-step classification was used for the original texts (01-Hum & 02-MT)
22
y
and another one was used for the AI-generated texts (03-AI, 04-AI, 05-ManEd &
06-Para). They were based on the probabilities that were reported for texts being
human-written or AI-generated as specified in Table 5.
For four of the detection tools, the results were only given in the textual form (“very
low risk”, “likely AI-generated”, “very unlikely to be from GPT-2”, etc.) and these were
mapped to the classification labels as given in Table 6.
After all of the classifications were undertaken and disagreements ironed out, the
measures of accuracy, the false positive rate, and the false negative rate were calculated.
Outcomes Results
Having evaluated the classification outcomes of the tools as (partially) true/false posi-
tives/negatives, the researchers evaluated this classification on two criteria: accuracy and
error type. In general, classification systems are evaluated using accuracy, precision, and
recall. The research authors also conducted an error analysis since the educational con-
text means different types of error have different significance.
Accuracy
When no partial results are allowed, i.e. only TN, TP, FN, and FP are allowed, accuracy is
defined as a ratio of correctly classified cases to all cases
As our classificaion contains also partially correct and partially incorrect results (i.e.,
five classes instead of two), the basic commonly used formula has to be adjusted to
properly count these cases. There is no standard way of how this adjustment should be
done. Therefore, we will use three different methods which we believe reflect different
approaches that educators may have when interpreting tools’ outputs. The first (binary)
23
y
For the systems providing percentages of confidence, this method basically sets the
threshold of 80% (see Table 5). Table 7 shows the number of correctly classified docu-
ments, i.e. the sum of true positives and true negatives. The maximum for each cell is 9
(because there were 9 documents in each class), the overall maximum is 9 * 6 = 54. The
accuracy is calculated as a ratio of the total and the overall maximum. Note that even the
highest accuracy values are below 80%. The last row shows the average accuracy for each
document class, across all the tools.
This method provides a good overview of the number of cases in which the classifiers
are “sure” about the outcome. However, for real-life educational scenarios, partially cor-
rect classifications are also valuable. Especially in case 05-ManEd, which involved human
editing, the partially positive classification results make sense. Therefore, the researchers
explored more ways of assessment. These methods differ in the score awarded to various
incorrect outcomes.
In our second approach, we include partially correct evaluations and count them as
correct ones. The formula for accuracy computation is.
ACC_bin_incl = (TN+PTN+TP+PTP)/(TN+PTN+TP+PTP+FN+PFN+FP+PFP+UNC)
In case of systems providing percentages, this method basically sets the threshold of
60% (see Table 5). The results of this classification approach may be found in Table 8.
Obivously, all systems achieved higher accuracy, and the systems that provided more
partially correct results (GPT Zero, Check for AI) influenced the order.
In our third approach, which we call semi-binary evaluation, the researchers distin-
guish partially correct classifications (PTN or PTP) both from the correct and incorrect
ones. The partially correct classifications were awarded 0.5 points, while entirely correct
24
y
FN FP 1
PFN PFP 2
UNC UNC 4
PTP PTN 8
TP TN 16
classification (TN or TP) still gained 1.0 points as in the previous methods. The formula
for accuracy calculation is
Table 9 shows the assessment results of the classifiers using semi-binary classification.
The values correspond to the number of correctly classified documents with partially
correct results awarded half a point (TP + TN + 0.5 * PTN + 0.5 * PTP). The maximum
value is again 9 for each cell and 54 for the total.
A semi-binary approach to accuracy calculation captures the notion of partially
correct classification but still does not distinguish between various forms of incor-
rect classification. We address this issue by employing a third,—logarithmic approach
to accuracy calculation that awards 1 point to completely incorrect classification
and doubles the score for each level of the classification that was closer to the cor-
rect result. The scores for the particular classifier outputs are shown in Table 10 and
the overall scores of the classifiers are shown in Table 11. Note that the maximum
value for each cell is now 9 * 16 = 864. The accuracy, again, is calculated as a ratio
25
y
Fig. 5 Overall accuracy for each tool calculated as an average of all approaches discussed
of the total score and the maximum possible score. This approach provides the most
detailed distinction among all varieties of (in)correctness.
As can be seen from Tables 7, 8, 9, and 11, the approach to accuracy evaluation
has almost no influence on the ranking of the classifiers. Figure 1 presents the overall
accuracy for each tool as the mean of all accuracy approaches used.
Turnitin received the highest score using all approaches to accuracy classification,
followed by Compilatio and GPT-2 Output Detector (again in all approaches). This is
particularly interesting because as the name suggests, GPT-2 Output Detector was
not trained to detect GPT-3.5 output. Crossplag and Go Winston were the only other
tools to achieve at least 70% accuracy.
26
y
Fig. 6 Overall accuracy for each document type (calculated as an average of all approaches discussed)
Variations in accuracy
As Fig. 2 above shows, the overall average accuracy figure is misleading, as it obscures
major variations in accuracy between document types. Further analysis reveals the
influence of machine translation, human editing, and machine paraphrasing on over-
all accuracy:
Influence of machine translation The overall accuracy for case 01-Hum (human-writ-
ten) was 96%. However, in the case of the documents written by humans in languages
other than English that were machine-translated to English (case 02-MT), the accuracy
dropped by 20%. Apparently, machine translation leaves some traces of AI in the output,
even if the original was purely human-written.
Influence of machine paraphrase Probably the most surprising results are for case
06-Para (machine-generated with subsequent machine paraphrase). The use of AI to
transform AI-generated text results in text that the classifiers consider human-written.
The overall accuracy for this case was 26%, which means that most AI-generated texts
remain undetected when machine-paraphrased.
27
y
Fig. 7 Accuracy (logarithmic) for each document type by detection tool for AI-generated text
Precision
Another important indicator of system’s performance is precision, i.e. the ratio of true
positive cases to all positively classified cases. Precision indicates the probability that a
positive classification provided by the system is correct. For pure binary classifiers, the
precision is calculated as a ratio of true positives to all positively classified cases:
28
Table 12 Overview of classification results and precision
Tool TP PTP FP PFP TN PTN FN PFN UNC Total Prec_incl Prec_excl
29
y
In case of partially true/false positives, the researches had two options how to deal
with them. The exclusive approach counts them as negatively classified (so the formula
does not change), whereas the inclusive approach counts them as positively classified:
Table 12 shows an overview of the classification results, i.e. all (partially) true/false
positives/negatives. Also, both inclusive and exclusive precision values are provided.
Precision is missing for Content at Scale because this system did not provide any posi-
tive classifications. The only system for which the inclusive precision is significantly dif-
ferent from the exclusive one, is GPT Zero which yielded the largest number of partially
false positives.
Error analysis
In this section, the researchers quantify more indicators of tools’ performance, namely
two types of classification errors that might have significant consequences in educational
contexts: false positives leading to false accusations against a student and undetected
cases (students gaining an unfair advantage over others), i.e. false negative ratio which is
tightly related to recall.
30
y
Therefore, for each tool, we also computed the likelihood of false accusation of a student
as a ratio of false positives and partially false positives to all negative cases, i.e.
Table 13 shows the number of cases in which the classification of a particular docu-
ment would lead to a false accusation. The table includes only documents 01-Hum and
02-MT, because the AI-generated documents are not relevant. The risk of false accu-
sations is zero for half of the tools, as can be also seen from Figs. 4 and 5. Six of the
fourteen tools tested generated false positives, with the risk increasing dramatically for
machine-translated texts. For GPT Zero, half of the positive classifications would be
false accusations, which makes this tool unsuitable for the academic environment.
31
y
32
y
Fig. 12 False negatives for AI‑generated documents 03‑AI and 04‑AI together
For the sake of completeness, Table 14 also contains recall (1—FNR) that indicates
how many of positive cases were correclty classified by the system.
Figures 6, 7, and 8 above show that 13 out of the 14 tested tools produced false nega-
tives or partially false negatives for documents 03-AI and 04-AI; only Turnitin correctly
classified all documents in these classes. None of the tools could correctly classify all AI-
generated documents that undergo manual editing or machine paraphrasing.
As the document sets 03-AI and 04-AI were prepared using the same method, the
researchers expected the results would be the same. However, for some tools (OpenAI
Text Classifier and DetectGPT), the results were notably different. This could indicate
a mistake in testing made or interpretation of the results. Therefore, the researchers
33
y
double-checked all the results to avoid this kind of mistake. We also tried to upload
some documents again. We did obtain different values, but we found out that this was
due to inconsistency in the results of these tools and not due to our mistakes.
Content at Scale misclassified all of the positive cases; these results in combination
with the 100% correct classification of human-written documents indicate that the
tool is inherently biased towards human classification and thus completely useless.
Overall, of the AI-generated texts approx. 20% of cases would likely be misattributed
to humans, meaning the risk of unfair advantage is significantly greater than that of
false accusation.
Figures 9 and 10 show an even greater risk of students gaining an unfair advantage
through the use of obfuscation strategies. At an overall level, for manually edited texts
34
y
Fig. 16 Turnitin’s similarity report shows up first, it is not clear that the “AI” is clickable
(case 05-ManEd) the ratio of undetected texts increases to approx. 50% and in the
case of machine-paraphrased texts (case 06-Para) rises even higher.
Usability issues
There were a few usability issues that cropped up during the testing that may be
attributable to the beta nature of the tools under investigation.
For example, the tool DetectGPT at some point stopped working and only replied
with the statement “Server error
We might just be overloaded. Try again in a few
minutes?”. This issue occurred after the initial testing round and persisted until the
time of submission of this paper. Others would stall in an apparent infinite loop or
throw an error message and the test had to be repeated at a later time.
Writeful GPT Detector would not accept computer code. The tool apparently iden-
tified code as not English, and the tool only accepted English texts.
Compilatio at one point returned “NaN% reliability” (See Fig. 11) for a ChatGPT-
generated text that included program code. “NaN” is computer jargon for “not a
35
y
number” and indicates that there were calculation issues such as division by zero or
number representation overflow. Since there was also a robot head returned, this was
evaluated as correctly identifying ChatGPT-generated text, but the non-numerical
percentage might confuse instructors using the tool.
The operation of a few of the tools was not immediately clear to some of the authors
and the handling of results was sometimes not easy to document. For example, in Pla-
giarismCheck the AI-Detection button was not always presented on the screen and it
would only show the last four tests done. Interestingly, Turnitin often returned high sim-
ilarity values for ChatGPT-generated text, especially for program code or program out-
put. This was distracting, as the similarity results were given first, the AI-detection could
only be accessed by clicking on a number above the text “AI” that did not look clickable,
but was, see Fig. 12.
Discussion
Detection tools for AI-generated text do fail, they are neither accurate nor reliable (all
scored below 80% of accuracy and only 5 over 70%). In general, they have been found
to diagnose human-written documents as AI-generated (false positives) and often diag-
nose AI-generated texts as human-written (false negatives). Our findings are consistent
with previously published studies (Gao et al. 2022; Anderson et al. 2023; Elkhatat et al.
2023; Demers 2023; Gewirtz 2023; Krishna et al. 2023; Pegoraro et al. 2023; van Oijen
2023; Wang et al. 2023) and substantially differ from what some detection tools for AI-
generated text claim (Compilatio 2023; Crossplag.com 2023; GoWinston.ai 2023; Zero
GPT 2023). The detection tools present a main bias towards classifying the output as
human-written rather than detecting AI-generated content. Overall, approximately 20%
of AI-generated texts would likely be misattributed to humans.
36
y
Fig. 18 AIDT23-05-JGD
Fig. 19 AIDT23-05-JPK
37
y
Fig. 20 AIDT23-05-LLW
Fig. 21 AIDT23-05-OLU
38
y
Fig. 22 AIDT23-05-PTR
Fig. 23 AIDT23-05-SBB
39
y
Fig. 24 AIDT23-05-TFO
Fig. 25 AIDT23-06-AAN
40
y
Fig. 26 AIDT23-06-DWW
Fig. 27 AIDT23-06-JGD
41
y
Fig. 28 AIDT23-06-JPK
Fig. 29 AIDT23-06-LLW
42
y
Fig. 30 AIDT23-06-OLU
Fig. 31 AIDT23-06-PTR
43
y
Fig. 32 AIDT23-06-SBB
Fig. 33 AIDT23-06-TFO
Abbreviations
01-Hum Human-written
02-MT Human-written in a non-English language with a subsequent AI/machine translation to English
03-AI AI-generated text
04-AI AI-generated text with subsequent human manual edits
05-ManEd AI-generated text with subsequent manual paraphrase by human
44
In particular, finetuning a language model to detect “itself” has proven to be an effective strategy
for text detection. However, existing research suggests that a purely neural approach struggles
from distribution shifts, and is overly reliant on the decoding strategy with which the training data
was generated (Solaiman et al., 2019). In other words, though the neural-based approach is highly
performant for specific samples of text, it is significantly less robust.
Our project attempts to mitigate this shortcoming by combining the statistics- and neural-based
approaches in order to achieve the best of both worlds. We hypothesize that, regardless of the
language model, machine-generated and human-written text have fundamental statistical differences
that can be learned. Our main thesis is that explicitly encoding such statistical features into the input
can prevent the classifier from overfitting on the model-intrinsic features present in the training data,
thus improving its performance against a greater variety of generated text.
To do so, we add a custom “statistical embeddings” layer to a RoBERTa model pre-trained for
the synthetic text detection task. This embeddings layer extracts various statistical features from
the input sequence and injects them into the classification pipeline. We finetune both our baseline
RoBERTa detector along with our augmented detector on an open-source dataset of GPT-3 generated
text and human-written Wikipedia introductions. We observe similar performance for both models
on the test set partitioned from the same dataset (i.e. equal distribution). Both classifiers are then
evaluated against machine-generated and human-written “long answers” in the PubMedQA dataset
(i.e. different topic and distribution from the training set). Despite a marginal increase in accuracy
for the WikiIntro dataset, we were unable to see significant improvements in robustness using our
approach. We suspect that this may in part be attributed to the sparseness of our statistical feature
vectors. The specific results and analysis are presented in section 5.
5. Related Work
There has been significant prior research regarding the statistical, zero-shot approach to synthetic text
detection. A study done in 2017 by Nguyen-Son et al. (2017). reported promising results through
statistical analysis alone, relying on (i) word distribution frequencies, (ii) complex phrase features,
and (iii) sentence- and paragraph-level consistency. However, since 2017, the capabilities of NLG
models have grown exponentially, and thus it is expected that previous results will be significantly
challenged by state-of-the-art models such as GPT-3.
More recently, Tian released “GPTZero” which analyzes the perplexity and burstiness of a given
text to detect whether it was machine-generated. However, GPTZero has been shown to work poorly
when dealing with shorter sequences or against adversarial rewording and paraphrasing (Tian, 2022).
Perhaps the most significant breakthrough in this regard is “DetectGPT”, released in January 2023 by
Mitchell et al. (2023). It achieved significantly improved performance compared to other zero-shot
classification methods by using the observation that texts generated by LLMs tend to occupy negative
curvature regions of the model’s log probability function. Though the results are remarkable, the
paper reports that supervised models still perform better than DetectGPT for in-distribution data,
thus motivating our approach that the neural-based approach has greater potential once its lack of
robustness is resolved.
Research in neural-based text detection has been greatly accelerated by high-performing language
models that can be applied to a myriad of downstream tasks. The most notable result to date is
OpenAI’s RoBERTa-based sequence classifier which was released alongside GPT-2. The model was
trained on a behemoth corpus of GPT-2 generated text and was able to outperform human detection
with an accuracy of 95%. However, research shows that the model’s accuracy drops significantly
when tested against distribution shifts Solaiman et al. (2019). We later confirm this result in section 5.
6. Approach
Our main contribution to the synthetic text detection literature is in the combining of statistical and
neural methods. We attempt to improve upon OpenAI’s RoBERTa model by adding a statistical
embeddings layer that extracts and encodes useful information about the input sequence. The exact
ensemble of features used is described in more detail in section 4.2.
45
We were greatly inspired by the Transformer architecture and its way of encoding the notion of
position in self-attention. Namely, the Transformer architecture extracts positional information and
adds it to the input embeddings such that the order of words is accounted for by the model. We
perform a similar procedure in which a statistical feature vector is first constructed by our statistical
embeddings layer, then added into the input embeddings. The model architecture and design decisions
are discussed in more detail in section 4.3.
6.1 Baseline
As described above, our baseline is OpenAI’s RoBERTa sequence classifier pre-trained for GPT-2 text
detection. The sequence classifier consists of the original RoBERTa model and an additional classifi-
cation head. The classification head is a standard feed-forward network with dropout regularization,
a tanh non-linearity, and an output projection. We first compare our main model’s performance
against this pre-trained RoBERTa classifier (trained on GPT-2 data). We then finetune the RoBERTa
classifier on GPT-3 data and perform a similar comparison. The results of the baseline experiments
are presented in section 5.
We postulate that using an ensemble of statistical features can increase the robustness of our model.
In particular, we focus on robustness against adversarial prompts, e.g. asking ChatGPT to “generate
a text that uses many exclamation marks and commonly-used words.” It is intuitive that including
multiple features in our detection provides a straightforward hedge against such attacks, as targeting
more and more statistical features will only impede the model’s ability to generate fluent text.
In this section, we describe in brief detail the statistical features used by our model. We note that
the input sequence is first lemmatized using the WordNet corpus, prior to being processed by the
statistical embeddings layer.
1. Zipf: Zipf’s law refers to the observation that many types of real-world data follow a partic-
ular distribution. Human-written text is one such example. Formally, given a distribution di
of the i-th most common lemma in a document,
1
di ∝ (1)
i
We use a log scale for both the rank and the frequency of a lemma and perform linear
regression. The slope of the regression line f is used as our Zipf feature of the input text.
2. Clumpiness: This feature focuses on the fact that human-written text tends to contain
particular words more frequently than others, as opposed to machine-generated text which
is more uniform (i.e. less clumpy). We calculate the text’s Gini coefficient as an indicator:
Pn Pn
i=1 |xi − xj |
G= Pn j=1Pn (2)
2 i=1 j=1 xj
46
6. Stop word Ratio: Stop words are words that are generally filtered out in NLP applications
due to their disproportionate frequency and comparative semantic neutrality (e.g. “a”, “an”,
“the”, etc.) We believe that human-written text will contain a higher ratio of such stop words.
Early data analysis of our WikiIntro dataset (detailed in section 5.1) yields the following plots shown
in Figure 34. We observe differences in the distribution of the Zipf, Clumpiness, and Kurtosis scores
between human-written text and generated text. Given that the sample size is equivalent for both
groups of text, we see that the relative modes of the distribution are different.
Figure 34: Zipf, Clumpiness, and Kurtosis score distributions for the Wikipedia dataset
Our model incorporates a “Statistical Embeddings Layer” which extracts the aforementioned statistical
feature vectors from the input sequence and concatenates them to produce a single vector s. The
statistical embedding is then passed into a feed forward layer which applies a linear transformation to
s then ReLU for non-linearity.
We then experiment with two versions of our model. The early fusion approach sums the output of
the above to the inputs of the RoBERTa model (Figure 2a). This is largely due to our confidence in
the quantity and quality of training data we collected which warrants an early fusion approach. The
late fusion approach infuses the statistical features to the output of the RoBERTa model instead, prior
to being processed by the classification head(Figure 2b).
The baseline code for the pre-trained RoBERTa sequence classifier comes from OpenAI Google
(2018). We have written code to load and process our dataset as well as extract various statistical
features. Furthermore, we make original modifications to the RoBERTa model by adding the layers
described above as well as by enabling early and late fusion. Overall, the main innovation of our
project lies in our approach to combine neural and statistical methods in the task of machine-text
generation.
Figure 35: (a) Early Fusion Model and (b) Late Fusion Model
47
Human Written Intro (Truncated) GPT-3 Generated Intro (Truncated)
"A full-time job is employment in which a person "A full-time job is employment in which a person
works a minimum number of hours defined as works a set number of hours each week, typically
such by their employer. Full-time employment 40 hours. In some countries, full-time employ-
often comes with benefits that are not typically ment is the norm, while in others it is less com-
offered to part-time, temporary, or flexible work- mon. The term "full-time job" can refer to a vari-
ers, such as annual leave, sick leave, and health ety of different types of employment, including
insurance. Part-time jobs are mistakenly..." traditional jobs..."
Figure 3: Examples of human-written and GPT-3 completed introductions for “Full-time job”
7 Experiments
7.1 Data
We use two publicly available datasets for the training and testing of our baseline and main model.
Both datasets consist of human-written text and machine-generated text that are processed and
appropriately labeled to be used in our pipeline (examples in Figure 3). The first dataset is a
collection of GPT-3 generated text and human-written Wikipedia introductions for 150,000 topics
(Aaditya Bhat, 2023). We use this data to train our baseline model (as the pretrained model has only
seen GPT-2 text) and our main model, both early and late fusion. We use a 80-10-10 split of our
dataset for training, validation, and testing. We also pre-process the data by randomly truncating each
pair of human-written and model-generated text to the same length.
We then utilize the PubMedQA dataset to test the transferability of our model. In particular, we make
use of the “long answers” that contain both human-written and machine-generated samples (Jin et al.,
2019). We decided to use the PubMedQA dataset because of its difference in both the topic and the
NLG model from our training set, which can provide a good insight into the robustness of our model.
The results and analysis of our experiments are presented section 6.
We use prediction accuracy and area under the receiving operating characteristics (AUROC) as our
metrics. AUROC is commonly used to evaluate the performance of binary classification models. It
measures the ability of a model to distinguish between two classes by plotting the True Positive Rate
(TPR) against the False Positive Rate (FPR). The ROC curve is a graphical representation of the
performance of a binary classifier as the discrimination threshold is varied. The area under this curve
(AUROC) represents the probability that the classifier will rank a randomly chosen positive instance
higher than a randomly chosen negative instance. AUROC ranges from 0.0 to 1.0, where a value of
0.5 indicates that the model is no better than random guessing, and a value of 1.0 indicates a perfect
classifier.
The first step of our research involves testing the existing “gold standard” detection model (OpenAI’s
model pretrained on GPT-2) on our WikiIntros GPT-3 dataset. We use the results of this as the
baseline performance before making changes to model architecture and passing in our statistical
features.
Following this, we fine-tune the pretrained model using our GPT-3 training set, and test it to establish
a baseline for the “gold standard” model’s performance after undergoing fine-tuning with a relatively
small training set (LR of 2e-05, 5 epochs). We use this to quantify the performance of the existing
model when tuned on newer data.
The core of our research focuses on our improved model, which incorporates the aforementioned
statistical features. We train this model on the same training set (LR of 2e-05, 10 epochs) and compare
the performance with the results from the previous two parts. We perform this training process twice,
once on our early-fusion model and once on our late-fusion model.
48
8. Predicatble Outcomes
Running the tests, we observe that the finetuning the RoBERTa model on the WikiIntros dataset
allows the model to achieve near perfect accuracy – and a 16% improvement over the same model
trained on datasets generated by older LLM models, in this case GPT-2 (detailed in Table 1). We see
that our late-fusion model achieves marginally higher performance than this – suggesting that the
incorporation of statistical features does help the model improve accuracy.
On reason for the high performance across the models is the nature of the dataset. There are certain
high-level differences between the human-written text and AI-generated text, in that the generated
text is more generalized/overview focused, while the human-written (which are the Wikipedia pages)
might include references to other similar terms/disambiguations. This may contribute to the model
overfitting on this particular dataset.
We observed that none of the models performed well on the PubMedQA dataset, implying poor
robustness toward distribution shifts. One potential explanation is that a scientific Q&A database has
much lower variance when it comes to response content. Correct responses, AI-generated or not, will
contain the same content and likely in a similar format, given the nature of the prompt (in this case
the question). Whereas in the WikiIntros dataset, there is a much more open-ended prompt, simply
describing the topic at hand. Apart from the nature of the dataset, we could also attribute the poor
robustness of the model to certain shortcomings of experimental approach. For instance, we believe
that the model benefited only marginally from the statistical features due to small magnitudes and
sparseness; extracting information regarding punctuation, especially, yielded highly sparse features.
A greater variety of statistical features with non-negligible magnitudes may be needed to augment a
neural classifier better. Given the nature of our approach in which we performed an element-wise
addition of the statistical embedding vector to the input embedding (for early fusion) or output vector
of RoBERTa (for late fusion), a relatively sparse vector with low magnitude elements may not have
been the optimal way to incorporate our statistical features. Concatenation, instead of element-wise
addition, may have been a more favorable choice of fusing in the aforementioned statistical features.
Moreover, for the model to learn the linear transformation of these statistical features, we may need
to train the model for a larger number of epochs.
9 Outcomes Conclusion
Through our research, we were able to identify statistical features that when fed into our RoBERTa
model (late-fusion) was able to marginally outperform the existing gold standard model for AI-
generated text detection task on the WikiIntros dataset. Both models performed very well on the task
with near perfect accuracy, and were significantly better than existing models trained on generated
text from older models (GPT-2).
Our research sought to achieve greater robustness towards distribution shifts through augmenting a
neural classifier with statistical features of the input text. When we tested our baseline and early/late
fusion models that were trained on the WikiIntros dataset using the PubMedQA dataset, we witnessed
the late-fusion model achieve marginal improvements in the AUROC metric compared to the baseline.
However, the improvement itself is too marginal, and the magnitude still remains in the vicinity of
0.5, suggesting the model is only marginally better than a random guess even after incorporating
49
building useful detectors. However, there is currently no work that provides a literature review of existing
detection works and highlight important research challenges.
In this paper, we present a critical literature review of the existing detection research for English to aid
understanding of this important area. We organize the survey to guide the reader seamlessly through a
number of important aspects, as follows: First, we establish the background for the detection task, which
includes TGMs, decoding methods for text generation, and social impacts of TGMs (§2). Second, we
present various aspects of large-scale TGMs such as model architecture, training cost, and controllability
(§3). Third, we present and discuss the various existing detectors in terms of their underlying methods
(§4). Fourth, we provide a linguistically and computationally motivated analysis of key issues of the
state-of-the-art detector (§5). Fifth, we discuss interesting future research directions that can help in
building useful detectors (§6). Our main contributions are three-fold:
• We provide the first survey on the important, burgeoning area of detection of machine generated
text from human written text.
• We develop an error analysis of current state-of-the-art detector, guided and illustrated by machine
generated texts, to shed light on the limitations of existing detection work.
• Motivated by our analysis and existing challenges, we propose a rich and diverse set of research
directions to guide future work in this exciting area.
10 Background
Here, we provide the background for the problem of detecting machine generated text from human
written text. Specifically, we introduce key concepts in training a TGM, generating text from a TGM,
and social implications of using TGMs in practice. Existing detection datasets are discussed in Appendix.
10.1 Training TGM
TGM is typically a neural language model (NLM) trained to model the probability of a token given
the previous tokens in a text sequence, i.e., pθ (xt |x1 , . . . , xi , . . . , xt−1 ), with tokens coming from a
vocabulary, xi ∈ V. If x = (x1 , . . . , x|x| ) represents the text sequence, pθ typically takes the form
|x|
pθ (x) = Πt=1 pθ (xt |x1 , . . . , xt−1 ). If p∗ (x) denotes the reference distribution and D denotes a finite set
of text sequences from p∗ , TGM estimates parameters θ by minimizing the following objective function:
(j)
|D| |x |
(j) (j) (j) (j)
X X
L(pθ , D) = − log pθ (xt |x1 , . . . , xi , . . . , xt−1 ). (1)
j=1 t=1
Notice that TGM can be a non-neural model (e.g., n-gram LM) and based on nontraditional LM objective
(e.g., masked language modeling (Devlin et al., 2019; Song et al., 2019)). In this survey, we focus pri-
marily on TGMs for English that are neural and based on traditional LM objective, as they are successful
in generating coherent paragraphs of English text.
10.2 Generating text from TGM
Given a sub-sequence (prefix), x1:k ∼ p∗ , the task of generating text from TGM is to use pθ
to conditionally decode a continuation, x̂k+1:N ∼ pθ (.|x1:k ) such that the resulting completion
(x1 , . . . , xk , x̂k+1 , . . . , x̂N ) resembles a sample from p∗ (Welleck et al., 2020). In a news article gen-
eration task, the prefix can be headlines and the continuation can be the body of the news article. In
a story generation task, the prefix can be beginning of a story and the continuation can be rest of the
story. Since the computation of the optimal continuation (x̂k+1:N ) is not tractable with time complexity
of O((N − k)|V| ), approximate deterministic or stochastic decoding methods are utilized to generate
continuations.
Deterministic methods: In deterministic methods, the continuation is fully determined by the TGM
parameters and prefix. The two most commonly used deterministic decoding methods are greedy search
50
and beam search. Greedy search works by selecting the highest probability token at each time step: xt =
arg max pθ (xt |x1 , . . . , xt−1 ) with time complexity of O((N − k)|V|). On the other hand, beam search
maintains a fixed-size (b) set of partially decoded sequences, called hypotheses. At each time step, beam
search creates new hypotheses by appending each token in the vocabulary to each existing hypothesis,
scoring the resulting sequences using p∗ with time complexity of O((N − k)b|V|). In practice, these
deterministic decoding methods depend highly on the underlying model probabilities and suffer from
producing degenerate continuation, i.e., generic text often with repetitive tokens (Holtzman et al., 2020).
Recently, Welleck et al., (2020) show that the degeneracy issues with beam search can be alleviated by
training a TGM with the original TGM objective (Eq. (1)) augmented with an unlikelihood objective that
assigns lower probabilities to unlikely generations.
Stochastic methods: Stochastic decoding methods work by sampling from a model-dependent dis-
tribution at each time step, xt ∼ q(xt |x1 , . . . , xt−1 , pθ ). In unrestricted sampling (also known as
pure sampling), the chance of sampling a low-confidence token from the unreliable tail distribution
is very high, leading to text that can be unrelated to prefix. To reduce the chance of sampling a low-
P token, sampling is limited to a subset of the vocabulary W ⊂ V at each time step. Let
confidence
Z = x∈W pθ (x|x1 , . . . , xt−1 ). If xt ∈ W, q(xt |x1 , . . . , xt−1 , pθ ) is set as pθ (xt |x1 , . . . , xt−1 )/Z,
otherwise set as 0. The two most effective stochastic decoding methods are top-k sampling (Fan et al.,
2018) and top-p (or nucleus) sampling (Holtzman et al., 2020). The top-k sampler P limits sampling to the
k most-probable tokens, that is, W is the size k subset of V that maximizes x∈W pθ (x|x1 , . . . , xt−1 ).
The top-k sampler uses a constant value of k, which can be sub-optimal in different contexts, that is,
generated text is limited to a subset of natural language distribution. For example, generic contexts
(e.g., predicting noun) might require larger value of k, while other contexts (e.g., predicting preposi-
tions) might require smaller value of k so that only useful candidate tokens are considered. The nucleus
sampler overcomes the burden of considering only a fixed number of tokens by limiting sampling to the
smallest
P set of tokens with total mass above a threshold p ∈ [0, 1], i.e., W is the smallest subset with
x∈W p θ (x|x1 , . . . , xt−1 ) >= p. Thus, the number of candidate tokens considered varies dynamically
depending on the context, and the resulting text is reasonably natural with less repetitions. Recently,
Massarelli et al., (2020) show that top-k and top-p sampler tend to generate more nonfactual sentences,
as corroborated by Wikipedia.
51
TGM training text sequence (x) prefix (x1:k ) continuation decoding threats
(data size / params) (x̂k+1:N ) method discussed
GPT-2 fragments from WebText starting of an article rest of the article top-k NA
(Radford et (collection of internet arti- (e.g., few lines about (e.g., rest of the re-
al., 2019) cles) (40GB / 1.5B) a research finding) search finding)
GROVER news article along with their meta- missing meta- top-p trustworthy
(Zellers et meta-information from Re- information/body information/body fake news
al., 2019) alNews (120GB / 1.5B) of a news article (e.g., in the prefix
headline, author)
CTRL control code (e.g., URL) control code (e.g., article correspond- greedy NA
(Keskar et followed by text (e.g., news URL) with optionally ing to the control search with
al., 2019) article) from several do- some strings of text code repetition
mains (140GB / 1.6B) penalty
Adelani et product reviews (fine-tuning product review (hu- product review top-k fake prod.
al., (2020) GPT-2) (20GB / 0.1B) man written) (machine) reviews
Dathathri et no training and no fine- beginning of a story rest of the story or top-k NA
al., (2020) tuning or general articles article
GPT-3 fragments from Common- three previous news body of the pro- top-p fake news
(Brown et Crawl (570GB / 175B) articles and title of a posed article
al., 2020) proposed article
Table 18: Summary of the characteristics of TGMs that can act as threat models. The last column corre-
sponds to the threats discussed in the original paper.
including social media, email clients, government websites, and e-commerce websites.
52
various entities of variable sizes and resource capabilities can practically deploy models for spreading
disinformation using TGMs.
12 Detectors
In this section, we discuss various detectors for identifying machine generated text from human writ-
ten text. To aid understanding of the literature, we organize the detectors according to the underlying
methods on which they are based.
53
GPT-2 model).
Detecting machine configuration: Tay et al., (2020) study the extent to which different modeling
choices (decoding method, TGM model size, prompt length) leave artifacts (detectable signatures that
arise from modeling choices) in the generated text. They propose the task of identifying the TGM mod-
eling choice given the text generated by TGM. They show that a classifier can be trained to predict the
modeling choice well beyond the chance level, which ascertains that text generated by TGM may be
more sensitive to TGM modeling choices than previously thought. They also find that the proposed
detection task of identifying text generated by different TGM modeling choices is less harder than the
task of identifying text generated by TGM from human written text along with different TGM modeling
choices. They show that word order does not matter much as a bag-of-words detector performs very sim-
ilar to detectors based on complex encoder (e.g., transformer). This result is consistent with the recent
work done by Uchendu et al., (2020), which shows that simple models (traditional ML models trained
on psychological features and simple neural network architectures) perform well in three settings: (i)
classify if two given articles are generated by the same TGM; (ii) classify if a given article is written
by a human or a TGM (the original detection problem); (iii) identify the TGM that generated a given
article (similar to Tay et al., (2020)). For the original detection problem, the authors find that the text
generated by the GPT-2 model to be hard to detect among several TGMs (see Appendix for the list of
studied TGMs).
12.2 Zero-shot classifier
In the zero-shot classification setting, a pretrained TGM (for example, GPT-2, GROVER) is employed
to detect generations from itself or similar models. The detector does not require supervised detection
examples for further training (i.e., fine-tuning).
Total log probability: Solaiman et al., (2019) present a baseline that uses TGM to evaluate total log
probability, and thresholds based on this probability to make the prediction. For instance, text is predicted
as machine generated if the overall likelihood of the text according to the GPT-2 model is closer to the
mean likelihood over all machine generated texts than to the mean likelihood of human written texts.
However, they find that this classifier performs poorly compared to the previously discussed logistic
regression based classifier (§4.1).
Giant Language model Test Room (GLTR) tool: The GLTR tool (Gehrmann et al., 2019) proposes a
suite of baseline statistical methods that can highlight the distributional differences in text generated
by GPT-2 model and human written text. Specifically, GLTR enables the study of a piece of text by
visualizing per-token model probability, per-token rank in the predicted next token distribution, and
entropy of the predicted next token distribution. Based on these visualizations, the tool clearly shows
that TGMs over-generate from a limited subset of the true distribution of natural language. Indeed, rare
word usage in text generated by GPT-2 model is markedly less compared to the human written text. The
tool lets humans (including non-experts) to study a piece of text, but might be less effective in future
once TGMs start generating text that lacks statistical anomalies.
54
itself as the BERT detector and the BERT generator possess similar inductive bias. Uchendu et al., (2020)
show that the off-the-shelf GROVER detector does not perform well in detecting text generated by TGMs
other than the original GROVER model.
RoBERTa detector: Solaiman et al., (2019) experiment with fine-tuning the RoBERTa language model
for the detection task and establishes the state-of-the-art performance in identifying the web pages gener-
ated by the largest GPT-2 model with ∼95% accuracy. The RoBERTa detector trained on top-p examples
transfers well to examples from all the other decoding methods (pure and top-k). Regardless of the de-
tector model’s capacity, the detector performs well when trained on examples from the larger GPT-2
model and transfers well to examples generated by a smaller GPT-2 model. On the other hand, training
on smaller GPT-2 model’s outputs results in poor performance in classifying the larger GPT-2 model’s
outputs. The most interesting finding of this work is that fine-tuning using the RoBERTa model achieves
higher accuracy than fine-tuning a GPT-2 model with equivalent capacity. This result might be due to
the superior quality of the bidirectional representations inherent in the masked language modeling ob-
jective employed by the RoBERTa language model compared to the GPT-2 language model, which is
limited by learning only unidirectional representation (left to right). This finding contradicts that of the
GROVER work (Zellers et al., 2019), where the authors conclude that the best models for detecting neu-
ral disinformation from a TGM is the TGM itself. Recently, Fagni et al., (2020) show that the RoBERTa
detector establishes the state-of-the-art performance in spotting machine generated tweets from human
written tweets accurately, outperforming both traditional ML models (e.g., bag-of-words) and complex
neural network models (e.g., RNN, CNN) by a large margin. This interesting result indicates that the
RoBERTa detector can generalize to publication sources unseen during its pretraining such as Twitter.
The RoBERTa detector also outperforms existing detectors in spotting news articles generated by several
TGMs (Uchendu et al., 2020) and product reviews generated by the GPT-2 model fine-tuned on Amazon
product reviews (Adelani et al., 2020).
55
Issues with the state-of-the-art detector
In this section, we discuss open issues in the state-of-the-art detector based on the RoBERTa model,
which has been shown to excel in detecting text generated by TGM based on news articles, product
reviews, tweets, and web pages (see §4.3). 3 We focus on the task of detecting text generated by the
GPT-2 model from human written Amazon product reviews, a challenging task given the shortness of
reviews. We employ the RoBERTa detector on the publicly available dataset, containing generations
from the GPT-2 model (1542M parameters) based on pure, top-k and top-p sampling along with human
written reviews (see Appendix for dataset details). In Figure 1, we plot the accuracy of the detector
w.r.t. number of training examples per class, averaged over ten random initializations to control for
initialization effects. We observe that the RoBERTa detector needs several thousands of examples to
reach high accuracy. Specifically, it has an impractical requirement of 200K, 15K and 50K training
examples for performing at 90% accuracy on identifying pure, top-k and top-p examples respectively. 4
Given that creation of large datasets for the detection task is hard (Zellers et al., 2019), it is important to
investigate whether the data-efficiency of the RoBERTa detector can be significantly improved.
0.98
0.93
Detection accuracy
0.88
0.83
0.78
0.73
0.68
1000 5000 15000 30000 50000 100000 150000 200000 250000
Number of training examplesperclass (human/machine)
Figure 36: Detection accuracy of the RoBERTa detector w.r.t. number of training examples per class,
averaged over ten random initializations.
We manually inspect 100 randomly picked false positives (machine generated product review incor-
rectly predicted as human written product review) of the RoBERTa detector trained on 15K examples
each from top-p generations and from human written reviews.5 Below, we list down the error categories
that we have identified and provide at least one example for each error category.
Fluency: Among the false positive reviews, we find 73 reviews to be very fluent and can confuse even
humans (1).
(1) I loved this film. I can’t really explain why, but when I first saw it it struck me as bizarre, al-
most oddball, but I quickly got over that and remembered that I love oddball films. This was
an early 80’s film. A great film to see on a gloomy rainy evening. This film is suspenseful
and full of weirdness. Add this to your collection.
Shortness: Out of these 73 identified fluent reviews, 27 reviews are very short, with a median of 24
words. We give two examples below:
(2) love it. best sweeper.
(3) My favorite combo. Always works and usually cools my system to boot. So glad I got these
instead of other brands.
Factuality: We find 10 false positive reviews to contain factual errors.
3
Concurrent with our work, Zhong et al., (2020) propose a detector that leverages factual and coherence structure underlying
the text, which outperforms the RoBERTa detector in spotting machine generated text based on news articles and web pages.
We also acknowledge that detectors fine-tuned on the state-of-the-art NLMs such as T5 (Raffel et al., 2020), ELECTRA (Clark
et al., 2020) might most likely outperform the RoBERTa detector in general.
4
Given that attackers can create synthetic text at scale using TGMs, 90% detection accuracy might not be a high accuracy.
5
As seen in §2.2 and §4, top-p sampling produces good quality text that reasonably matches the style of human writing and
is also harder to detect for humans. We leave the study of false negatives for future. Our annotation of 100 false positives can
be accessed at: https://github.com/UBC-NLP/coling2020_machine_generated_text.
56
in building intelligent TGMs that narrows the gap between machine and human distribution of natural
language text, auxiliary signals could play a crucial role in mitigating the threats posed by TGMs.
Assessing veracity of the text
Existing detectors have an assumption that the fake text is determined by the source (e.g., TGM) that
generated the text. This assumption does not hold true in two practical scenarios: (i) real text auto-
generated in a process similar to that of fake text, and (ii) adversaries creating fake text by modifying
articles originating from legitimate human sources. Schuster et al., (2020) show that existing detectors
perform poorly in these two scenarios as they rely too much on distributional features, which cannot help
in distinguishing texts from similar sources. Hence, we call for more research on detectors that assess
the veracity of machine generated text by consulting external sources, like knowledge bases (Thorne and
Vlachos, 2018) and diffusion network (Vosoughi et al., 2018), instead of relying only on the source.
12.5 Building generalizable detectors
Existing detectors exhibit poor cross-domain accuracy, that is, they are not generalizable to different
publication formats (Wikipedia, books, news sources) (Bakhtin et al., 2019). Beyond publication formats
and topics (e.g., politics, sports), the detector should also transfer to unseen TGM settings such as model
architecture, different decoding methods (e.g., top-k, top-p), model size, different prefix lengths, and
training data (Bakhtin et al., 2020; Uchendu et al., 2020).
12.6 Building interpretable detectors
We discussed the importance of human raters pairing up with automatic detectors in §4.4. A viable way
for this collaboration is to make the decisions taken by the automatic detector interpretable (such as in
GLTR) so that human raters can logically group (e.g., contradictions) the model decisions and humans
can “accept”, “modify”, or “reject” these decisions. This calls for more research in building detectors
that can provide explanations for its decisions, which are understandable to humans.
12.7 Building detectors robust to adversarial attacks
Existing detectors are brittle, i.e., the detector decisions can vary significantly for even small changes
in the text input. For example, Wolff (2020) shows that the RoBERTa detector can be attacked using
simple schemes such as replacing characters with homoglyphs and misspelling some words. These two
attacks reduce the detector’s recall in text generated by TGM from 97.44% to 0.26% and 22.68% respec-
tively. Therefore, it is important to study various adversarial attacks ranging from simple attacks (e.g.,
misspellings) to advanced attacks (e.g., universal attacks (Wallace et al., 2019)) and create adversarial
examples with an aim to characterize the vulnerabilities of the detector as well as to make the detector
robust against various attacks.
13. Conclusion
Detectors able to tease apart machine generated text from human written text can play a vital role in
mitigating misuse of TGMs such as in automatic creation of fake news and fake product reviews. Our
categorization of existing detectors and related issues into classifiers trained from scratch, zero-shot clas-
sifiers, fine-tuning NLMs, and human-machine collaboration can help readers contextualize each detector
w.r.t the fast-growing literature. We also hope that our computationally and linguistically motivated error
analysis of the state-of-the-art detector can bring readers up to speed on many existing challenges in
building useful detectors. Our rich and diverse set of research directions also have the potential to guide
future work in this exciting area.
14. Acknowledgements
We thank Ramya Rao Basava and Peter Sullivan for helpful discussions in the initial stage of the
project. We gratefully acknowledge support from the Natural Sciences and Engineering Research Coun-
cil of Canada, Compute Canada (https://www.computecanada.ca), and UBC ARC–Sockeye
(https://doi.org/10.14288/SOCKEYE).
57
15. References
David Ifeoluwa Adelani, Haotian Mai, Fuming Fang, Huy H. Nguyen, Junichi Yamagishi, and Isao Echizen. 2020.
Generating Sentiment-Preserving Fake Online Reviews Using Neural Language Models and Their Human-
and Machine-Based Detection. In Proceedings of the 34th International Conference on Advanced Information
Networking and Applications, AINA-2020, volume 1151, pages 1341–1354.
Anton Bakhtin, Sam Gross, Myle Ott, Yuntian Deng, Marc’Aurelio Ranzato, and Arthur Szlam. 2019. Real or
Fake? Learning to Discriminate Machine from Human Generated Text. CoRR, abs/1906.03351.
Anton Bakhtin, Yuntian Deng, Sam Gross, Myle Ott, Marc’Aurelio Ranzato, and Arthur Szlam. 2020. Energy-
Based Models for Text. CoRR, abs/2004.10188.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with
Subword Information. Transactions of the Association for Computational Linguistics, pages 135–146.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Nee-
lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen
Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christo-
pher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are
few-shot learners. CoRR, abs/2005.14165.
Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. Electra: Pre-training text
encoders as discriminators rather than generators. In International Conference on Learning Representations.
Alexis Conneau and Guillaume Lample. 2019. Cross-lingual Language Model Pretraining. In Advances in Neural
Information Processing Systems 32, pages 7059–7069.
Kate Crawford. 2017. The trouble with bias. NIPS 2017 Keynote.
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and
Rosanne Liu. 2020. Plug and Play Language Models: A Simple Approach to Controlled Text Generation.
In International Conference on Learning Representations.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirec-
tional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and
Short Papers), pages 4171–4186.
Liam Dugan, Daphne Ippolito, Arun Kirubarajan, and Chris Callison-Burch. 2020. RoFT: A Tool for Evaluating
Human Detection of Machine-Generated Text. CoRR, abs/2010.03070.
Jeffrey L. Elman. 1990. Finding structure in time. Cognitive Science, 14(2):179 – 211.
Tiziano Fagni, Fabrizio Falchi, Margherita Gambini, Antonio Martella, and Maurizio Tesconi. 2020. TweepFake:
about Detecting Deepfake Tweets. CoRR, abs/2008.00036.
Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hierarchical Neural Story Generation. In Proceedings of
the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
889–898.
Sebastian Gehrmann, Hendrik Strobelt, and Alexander Rush. 2019. GLTR: Statistical Detection and Visualization
of Generated Text. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics:
System Demonstrations, pages 111–116.
Barbara J. Grosz, Aravind K. Joshi, and Scott Weinstein. 1995. Centering: A Framework for Modeling the Local
Coherence of Discourse. Computational Linguistics, 21(2):203–225.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780.
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The Curious Case of Neural Text
Degeneration. In International Conference on Learning Representations.
58
Dirk Hovy. 2016. The Enemy in Your Own Camp: How Well Can We Detect Statistically-Generated Fake Reviews
– An Adversarial Study. In Proceedings of the 54th Annual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), pages 351–356.
Daphne Ippolito, Daniel Duckworth, Chris Callison-Burch, and Douglas Eck. 2020. Automatic Detection of Gen-
erated Text is Easiest when Humans are Fooled. In Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics, pages 1808–1822.
Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. 2019. CTRL: A
Conditional Transformer Language Model for Controllable Generation. CoRR, abs/1909.05858.
Guanxiong Liu, Tzu-Ming Harry Hsu, Matthew B. A. McDermott, Willie Boag, Wei-Hung Weng, Peter Szolovits,
and Marzyeh Ghassemi. 2019a. Clinically Accurate Chest X-Ray Report Generation. In Proceedings of the
Machine Learning for Healthcare Conference, MLHC, volume 106, pages 249–269.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke
Zettlemoyer, and Veselin Stoyanov. 2019b. RoBERTa: A Robustly Optimized BERT Pretraining Approach.
CoRR, abs/1907.11692.
Luca Massarelli, Fabio Petroni, Aleksandra Piktus, Myle Ott, Tim Rocktäschel, Vassilis Plachouras, Fabrizio
Silvestri, and Sebastian Riedel. 2020. How Decoding Strategies Affect the Verifiability of Generated Text.
CoRR, abs/1911.03587.
Moin Nadeem, Anna Bethke, and Siva Reddy. 2020. StereoSet: Measuring stereotypical bias in pretrained lan-
guage models. CoRR, abs/2004.09456.
Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, and Sergey Edunov. 2019. Facebook FAIR’s
WMT19 news translation task submission. In Proceedings of the Fourth Conference on Machine Translation
(Volume 2: Shared Task Papers, Day 1), pages 314–319.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving Language Understand-
ing by Generative Pre-Training. https://s3-us-west-2.amazonaws.com/openai-assets/
research-covers/language-unsupervised/language_understanding_paper.pdf.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Lan-
guage Models are Unsupervised Multitask Learners. https://d4mucfpksywv.cloudfront.
net/better-language-models/language_models_are_unsupervised_multitask_
learners.pdf.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li,
and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal
of Machine Learning Research, 21(140):1–67.
Tal Schuster, Roei Schuster, Darsh J. Shah, and Regina Barzilay. 2020. The Limitations of Stylometry for Detect-
ing Machine-Generated Fake News. Computational Linguistics.
Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec Radford, and
Jasmine Wang. 2019. Release Strategies and the Social Impacts of Language Models. CoRR, abs/1908.09203.
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS: Masked Sequence to Sequence
Pre-training for Language Generation. In International Conference on Machine Learning, pages 5926–5936.
Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy Considerations for Deep
Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,
pages 3645–3650.
Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding,
Kai-Wei Chang, and William Yang Wang. 2019. Mitigating Gender Bias in Natural Language Processing:
Literature Review. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,
pages 1630–1640.
Reuben Tan, Bryan A. Plummer, and Kate Saenko. 2020. Detecting Cross-Modal Inconsistency to Defend Against
Neural Fake News. CoRR, abs/2009.07698.
59