On The Possibilities of AI-Generated Text Detection
On The Possibilities of AI-Generated Text Detection
On The Possibilities of AI-Generated Text Detection
Abstract
Our work addresses the critical issue of distinguishing text generated by Large Language
Models (LLMs) from human-produced text, a task essential for numerous applications. Despite
ongoing debate about the feasibility of such differentiation, we present evidence supporting its
consistent achievability, except when human and machine text distributions are indistinguishable
across their entire support. Drawing from information theory, we argue that as machine-generated
text approximates human-like quality, the sample size needed for detection increases. We establish
precise sample complexity bounds for detecting AI-generated text, laying groundwork for future
research aimed at developing advanced, multi-sample detectors. Our empirical evaluations
across multiple datasets (Xsum, Squad, IMDb, and Kaggle FakeNews) confirm the viability
of enhanced detection methods. We test various state-of-the-art text generators, including
GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against
detectors, including oBERTa-Large/Base-Detector, GPTZero. Our findings align with OpenAI’s
empirical data related to sequence length, marking the first theoretical substantiation for these
observations.
1 Introduction
Large Language Models (LLMs) like GPT-3 mark a significant milestone in the field of Natural
Language Processing (NLP). Pre-trained on vast text corpora, these models excel in generating
contextually relevant and fluent text, advancing a variety of NLP tasks including language translation,
question-answering, and text classification. Notably, their capacity for zero-shot generalization
obviates the need for extensive task-specific training. Recent research by (Shin et al., 2021) further
highlights the LLMs’ versatility in generating diverse writing styles, ranging from academic to
creative, without the need for domain-specific training. This adaptability extends their applicability
to various use-cases, including chatbots, virtual assistants, and automated content generation.
However, the advanced capabilities of LLMs come with ethical challenges (Bommasani et al.,
2022). Their aptitude for generating coherent, contextually relevant text opens the door for misuse,
such as the dissemination of fake news and misinformation. These risks erode public trust and distort
societal perceptions. Additional concerns include plagiarism, intellectual property theft, and the
generation of deceptive product reviews, which negatively impact both consumers and businesses.
LLMs also have the potential to manipulate web content maliciously, influencing public opinion and
political discourse.
∗
Equal Contributions. The authors are with the Department of Computer Science, University of Maryland, College
Park, MD, USA. Email: {schkra3,amritbd,sczhu,bangan,dmanocha,furongh}@umd.edu
1
Given these ethical concerns, there is an imperative for the responsible development and deploy-
ment of LLMs. The ethical landscape associated with these models is complex and multifaceted.
Addressing these challenges is vital for harnessing the societal benefits that responsibly deployed
LLMs can offer. To this end, recent research has pivoted towards creating detectors capable of
distinguishing text generated by machines from that authored by humans. These detectors serve as
a safeguard against the potential misuse of LLMs. One central question underpinning this area of
research is:
"Is it possible to detect the AI-generated text in practice?"
Our work provides an affirmative answer to this question. Specifically, we demonstrate that detecting
AI-generated text is nearly always feasible, provided multiple samples are collected, as illustrated in
Figure 1. The necessity for collecting multiple samples is consistent with real-world settings where
ROC AUROC
1
n=500
n=100
TPR
n=30
0.5
0 0.5 1
FPR
Figure 1: In light of the sample complexity bound presented in Theorem 1, we show here pictorially
how increasing the number of samples n used for detection would affect the ROC of the best possible
detector, which is achieved by the likelihood-ratio-based classifier. We note that in the ROC curve
on the left for TV(m, h) = 0.1, the AUROC of the best possible detector will be 0.6 as derived in
(Sadasivan et al., 2023) (shown by an orange dot in right figure). The AUROC of 0.6 would lead to the
conclusion that detection is hard. In contrast, we note that by increasing the number of samples n,
the ROC upper bound starts increasing towards 1 exponentially fast (shown by the shaded blue
region in the left figure for different values of n), and hence the AUROC of the best possible detector
also starts increasing as shown by corresponding blue dots in the right figure. This ensures that the
detection should be possible even in hard scenarios when TV(m, h) norm is small.
data is abundant. For example, in the context of social media bot identification, one can readily
collect multiple posts to determine their origin, whether machine-generated or human-authored.
This real-world applicability emphasizes the importance and urgency of developing sophisticated
detection mechanisms for the ethical use of LLMs. We summarize our main contributions as follows.
(1) Possibility of AI-generated text detection: We utilize a mathematically rigorous
approach to answer the question of the possibility of AI-generated text detection. We conclude that
there is a hidden possibility of detecting the AI text, which improves with the text sequence (token)
length.
(2) Sample complexity of AI-generated text detection: We derive the sample complexity
bounds, a first-of-its-kind tailored for detecting AI-generated text for both IID and non-IID settings.
(3) Comprehensive Empirical Evaluations: We have conducted extensive empirical evalua-
tions for real datasets Xsum, Squad, IMDb, and Fake News dataset with state-of-the-art generators
2
(GPT-2, GPT3.5 Turbo, Llama, Llama-2 (13B), Llama-2 (70B)) and detectors (OpenAI’s Roberta
(large), OpenAI’s Roberta (base), and ZeroGPT (SOTA Detector)).
Traditional approaches. They involve statistical outlier detection methods, which employ statisti-
cal metrics such as entropy, perplexity, and n-gram frequency to differentiate between human and
machine-generated texts (Lavergne et al., 2008; Gehrmann et al., 2019). However, with the advent of
ChatGPT (OpenAI) (OpenAI, 2023), a new innovative statistical detection methodology, DetectGPT
(Mitchell et al., 2023), has been developed. It operates on the principle that text generated by
the model tends to lie in the negative curvature areas of the model’s log probability. DetectGPT
(Mitchell et al., 2023) generates and compares multiple perturbations of model-generated text to
determine whether the text is machine-generated or not based on the log probability of the original
text and the perturbed versions. DetectGPT significantly outperforms the majority of the existing
zero-shot methods for model sample detection with very high AUC scores (note that we use the terms
AUROC and AUC interchangeably for presentation convenience).
3
select tokens from the green list, based on prior tokens, resulting in watermarks that are typically
unnoticeable to humans. These advancements in watermarking technology not only strengthen
copyright protection and content authentication but also open up new avenues for research in areas
such as privacy in language, secure communication, and digital rights management.
Impossibility result. The interesting recent literature by (Sadasivan et al., 2023; Krishna et al.,
2023) showed the vulnerabilities of watermark-based detection methodologies using vanilla paraphras-
ing attacks. (Sadasivan et al., 2023) developed a lightweight neural network-based paraphraser and
applied it to the output text of the AI-generative model to evade a whole range of detectors, including
watermarking schemes, neural network-based detectors, and zero-shot classifiers. (Sadasivan et al.,
2023) also introduced a notion of spoofing attacks where they exposed the vulnerability of LLMs
protected by watermarking under such attacks. (Krishna et al., 2023) on the other hand, trained a
paraphrase generation model capable of paraphrasing paragraphs and showed that paraphrased texts
with DIPPER (Krishna et al., 2023) evade several detectors, including watermarking, GPTZero,
DetectGPT, and OpenAI’s text classifier with a significant drop in accuracy. Additionally, (Sadasivan
et al., 2023) highlighted the impossibility of machine-generated text detection when the total variation
(TV) norm between human and machine-generated text distributions is small.
In this work, we show that there is a hidden possibility of detecting the AI-generated text even
if the TV norm between human and machine-generated text distributions is small. This result is in
support of the recent detection possibility claims by Krishna et al. (2023).
4
The rigorous definitions of TPRγ and FPRγ are as follows.
Z
TPRγ =Ps∼m(·) [D(s) ≥ γ] = I{D(s)≥γ} · m(s) · ds, (3)
Z
FPRγ =Ps∼h(·) [D(s) ≥ γ] = I{D(s)≥γ} · h(s) · ds, (4)
where I{condition} is the indicator function which takes value 1 if the condition is true, and 0 otherwise.
Note that without loss of generality, we have chosen to consider m(s) and h(s) as the probability
density function of machine and human on a sample s by considering continuous s (as also considered
in (Sadasivan et al., 2023), but similar results hold for discrete s by replacing the integral with
a summation and by considering m(s) and h(s) as the probability mass function of machine and
human on a sample s.
Both TPRγ and FPRγ are within the closed interval [0, 1] for any threshold γ. For a good
detector, TPRγ should be as high as possible, and FPRγ should be as low as possible. As a result,
a high area under the ROC curve (AUROC) is desirable for detection. AUROC is between 1/2 and 1,
i.e., AUROC ∈ [1/2, 1]. An AUROC value of 1/2 means a random detection and a value of 1 indicates a
perfect detection. For efficient detection, the goal is to design a detector D such that AUROC is as
high as possible.
for any detector D and any threshold γ. We note that the above bound is tight and can always
be achieved with equality by likelihood-ratio-based detectors for any distribution m and h, by the
Neyman-Pearson Lemma (Cover, 1999, Chapter 11). We restate the lemma for completeness and
discuss its tightness in Appendix B.1. From the definitions of TPR and FPR in (3)-(4), it holds that
where min is used because TPRγ ∈ [0, 1]. The upper bound in (7) is called the ROC upper bound
and is the bound leveraged in one of the recent works (Sadasivan et al., 2023) to derive AUROC upper
2
bound AUC ≤ 12 + TV(m, h) − TV(m,h)
2 which holds for any D. This upper bound led to the claim of
the impossibility of detecting the AI-generated text whenever TV(m, h) is small.
Hidden Possibility. However, we note that the claim of impossibility from the AUROC upper bound
could be too conservative for detection in practical scenarios. For instance, we provide a motivating
example of detecting whether an account on Twitter is an AI-bot or human. It is natural that we
will have a collection of text samples from the account, denoted by {si }ni=1 , and it is realistic to
5
assume that n is very high. Therefore, the natural practical question is whether we can detect if the
provided text set {si }ni=1 is machine-generated or human-generated. With this motivation, we next
explain that detection is always possible.
We formalize the problem setting and prove our claim by utilizing the existing results in the
information theory literature. Let us consider the same setup as detailed before, while we are given
a set of samples S := {si }ni=1 . For simplicity, we assume that the samples are i.i.d. drawn from
either the human h or machine m. Interestingly, now the hypothesis test can be re-written as
H0 : S ∼ m⊗n v.s. H1 : S ∼ h⊗n , (8)
where m⊗n := m ⊗ m ⊗ · · · ⊗ m (n times) denotes the product distribution, as does h⊗n . This is
one of the key observations that focus on the correct hypothesis-testing framework with multiple
samples. Similar to before (cf. 7), based on Le Cam’s lemma, it holds that now 1 − TV(m⊗n , h⊗n )
gives the minimum Type-I and Type-II error rate, which implies
TPRnγ ≤ min{FPRnγ + TV(m⊗n , h⊗n ), 1}, (9)
where
Z
TPRnγ =PS∼m⊗n [D(S) ≥ γ] = I{D(S)≥γ} · m⊗n (S) · dS, (10)
Z
FPRnγ =PS∼h⊗n [D(S) ≥ γ] = I{D(S)≥γ} · h⊗n (S) · dS. (11)
We emphasize that the term TV(m⊗n , h⊗n ) is an increasing sequence in n and eventually converges
to 1 as n → ∞. Due to the data processing inequality, it holds that TV(m⊗k , h⊗k ) ≤ TV(m⊗n , h⊗n )
when k ≤ n and naturally leads to TV(m, h) ≤ TV(m⊗n , h⊗n ). This is a crucial observation, showing
that even if the machine and human distributions were close in the sentence space, by collecting
more sentences, it is possible to inflate the total variation norm to make the detection possible.
Now, from the large deviation theory, we can show that the rate at which total variation distance
approaches 1 is exponential with the number of samples (Polyanskiy & Wu, 2022, Chapter 7),
TV(m⊗n , h⊗n ) = 1 − exp (−nIc (m, h) + o(n)) , (12)
mα (s)h1−α (s)ds.
R
where, Ic (m, h) is known as the Chernoff information and is given by Ic (m, h) = − log inf 0≤α≤1
The above expressions lead to Proposition 1 next.
Proposition 1 (Area Under ROC Curve). For any detector D, with a given collection of i.i.d.
samples S := {si }ni=1 either from human h(s) or machine m(s), it holds that
1 TV(m⊗n , h⊗n )2
AUROC ≤ + TV(m⊗n , h⊗n ) − , (13)
2 2
where TV(m⊗n , h⊗n ) := 1−exp (−nIc (m, h) + o(n)) and Ic (m, h) is the Chernoff information. There-
fore, the upper bound of AUROC increases exponentially with respect to the number of samples n.
The proof of the above proposition follows by integrating the TPRnγ upper bound in (9) over
FPRnγ . We note that the expression in (13) and the equality of TV distance in terms of Chernoff
information presents an interesting connection between the number of samples and AUROC of the
best possible detector (which archives the bound in (9) with equality). It is evident that if we
increase the number of samples, n → ∞, the total variation distance TV(m⊗n , h⊗n ) approaches 1
and that too exponentially fast, and hence increasing the AUROC. This indicates that as long as the
two distributions are not exactly the same, which is rarely the same, the detection will always be
possible by collecting more samples as established next.
6
3.3 Attainability of the AUROC Upper-Bound via Likelihood-Ratio-Based
Detectors
Likelihood-ratio-based Detector. Here, we discuss the attainability of bounds in Proposition
1 to establish that the bound is indeed tight. We note that it is a well-established fact in the
literature that a likelihood-ratio-based detector would attain the bound for any distributions h and
m and hence is the best possible detector (detailed proof provided in Appendix B.1). We discuss the
likelihood-ratio-based detector here for completeness in the context of LLMs as follows. Specifically,
the likelihood ratio-based detector is given by
(
Text from machine if m⊗n (S) ≥ h⊗n (S),
D∗ (S) := (14)
Text from human if m⊗n (S) < h⊗n (S).
We proved in Appendix B.1 that the detector in (14) attains the bound and is the best possible
detector.
number of samples for the best possible detector which is likelihood-ratio-based as mentioned in (14),
for any ϵ ∈ [0.5, 1). Therefore, AI-generated text detection is possible for any δ > 0.
The proof of Theorem 1 is provided in Appendix B.2. From the statement of Theorem 1, it is
clear that, as long as δ > 0 (which means no matter how close human h(s) and m(s) distributions
are) and ϵ < 1, there exists n such that we can achieve high AUROC and perform the detection. Here,
n corresponds to the number of sentences generated by either humans or machines which we need to
detect. We provide additional detailed remarks and insights in Appendix A.
7
natural intuition for the domain of natural language and serves as a foundation for extending our
results to non-iid scenarios. Eq. (16) provides a way to measure the dependence between random
variables, which is later usedP to extend Chernoff bound to non-iid cases. . In the context of LLMs,
one can think of the sum sk as the "average meaning" of these text samples, such as "woman" +
"royalty" may have a similar meaning as "queen".
Before introducing the final result, let us assume the number of sequences or samples is denoted
by n, there are L independent subsets, and the corresponding subset is represented by τj where
j ∈ (1, 2 · · · , L), where τj consists of cj samples (dependent). This is a natural assumption in NLP
where a large paragraph often consists of multiple topics, and sentences for each topic are dependent.
With the above definitions, we state the main result in Theorem 2 for the non-iid setting.
Theorem 2 (Sample Complexity of AI-generated Text Detection (non-iid)). If human and
machine distributions are close TV(m, h) = δ > 0, then to achieve an AUROC of ϵ, we require
v
u
L L
1 1 1X u1 1 1 X
u
n = Ω 2 log + (cj − 1)ρj + t 2 log · (cj − 1)ρj (17)
δ 1−ϵ δ δ 1−ϵ δ
j=1 j=1
number of samples for the best possible detector which is likelihood-ratio-based as mentioned in (14),
for any ϵ ∈ [0.5, 1). Therefore, AI-generated text detection is possible for any δ > 0.
The proof is provided in Appendix 2. From the statement of Theorem 2, we note that for δ > 0
(h(s) and m(s) are close but not exactly the same) and ϵ < 1, there exists n such that we can achieve
high AUROC and perform the detection. In comparison to the iid result in Theorem 1, the non-iid
result in Theorem 2 has an additional term that depends on cj and ρj . Clearly, for ρj = 0, the
sample complexity result in Theorem 2 boils down to the result in Theorem 1.
4 Experimental Studies
In this section, we provide detailed empirical evidence to support our detectability claims of this
work. We consider various human-machine generated datasets and general language classification
datasets.
AUROC Discussion and Comparisons: We first try to explain the meaning of the mathematical
results we obtain via simulations. For instance, we show a pictorial representation of AUROC bound
we obtained in Proposition 1 and compare it against the ROC upper bound we mentioned in (9) for
different values of n. In Figure 1, we show that even if the original distributions of human h(s) and
machines m(s) are close in TV norm TV = 0.1, we can increase the ROC area (and hence AUROC) via
increasing the number of samples we collect n to perform the detection.
Datasets, AI-Text Generators and Detectors Description: Our experimental analysis spans
across 4 critical datasets, including the news articles from XSum dataset (Narayan et al., 2018),
Wikipedia paragraphs from Squad dataset (Rajpurkar et al., 2016), IMDb reviews (Maas et al.,
2011), and Kaggle FakeNews dataset (Lifferth, 2018), utilizing the datasets in a diverse manner
to validate our hypothesis. The first two datasets (XSum and Squad) have been leveraged to
8
100 t-statistic = -46.91, p-value = 0.000
Xsum / R J L V W L F 5 H J U H V V L R Q
AUC achieved by best Detector (%)
$ 8 &