On The Possibilities of AI-Generated Text Detection

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

On the Possibilities of AI-Generated Text Detection

Souradip Chakraborty∗, Amrit Singh Bedi*, Sicheng Zhu, Bang An,


Dinesh Manocha, and Furong Huang
arXiv:2304.04736v3 [cs.CL] 2 Oct 2023

Abstract
Our work addresses the critical issue of distinguishing text generated by Large Language
Models (LLMs) from human-produced text, a task essential for numerous applications. Despite
ongoing debate about the feasibility of such differentiation, we present evidence supporting its
consistent achievability, except when human and machine text distributions are indistinguishable
across their entire support. Drawing from information theory, we argue that as machine-generated
text approximates human-like quality, the sample size needed for detection increases. We establish
precise sample complexity bounds for detecting AI-generated text, laying groundwork for future
research aimed at developing advanced, multi-sample detectors. Our empirical evaluations
across multiple datasets (Xsum, Squad, IMDb, and Kaggle FakeNews) confirm the viability
of enhanced detection methods. We test various state-of-the-art text generators, including
GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against
detectors, including oBERTa-Large/Base-Detector, GPTZero. Our findings align with OpenAI’s
empirical data related to sequence length, marking the first theoretical substantiation for these
observations.

1 Introduction
Large Language Models (LLMs) like GPT-3 mark a significant milestone in the field of Natural
Language Processing (NLP). Pre-trained on vast text corpora, these models excel in generating
contextually relevant and fluent text, advancing a variety of NLP tasks including language translation,
question-answering, and text classification. Notably, their capacity for zero-shot generalization
obviates the need for extensive task-specific training. Recent research by (Shin et al., 2021) further
highlights the LLMs’ versatility in generating diverse writing styles, ranging from academic to
creative, without the need for domain-specific training. This adaptability extends their applicability
to various use-cases, including chatbots, virtual assistants, and automated content generation.
However, the advanced capabilities of LLMs come with ethical challenges (Bommasani et al.,
2022). Their aptitude for generating coherent, contextually relevant text opens the door for misuse,
such as the dissemination of fake news and misinformation. These risks erode public trust and distort
societal perceptions. Additional concerns include plagiarism, intellectual property theft, and the
generation of deceptive product reviews, which negatively impact both consumers and businesses.
LLMs also have the potential to manipulate web content maliciously, influencing public opinion and
political discourse.

Equal Contributions. The authors are with the Department of Computer Science, University of Maryland, College
Park, MD, USA. Email: {schkra3,amritbd,sczhu,bangan,dmanocha,furongh}@umd.edu

1
Given these ethical concerns, there is an imperative for the responsible development and deploy-
ment of LLMs. The ethical landscape associated with these models is complex and multifaceted.
Addressing these challenges is vital for harnessing the societal benefits that responsibly deployed
LLMs can offer. To this end, recent research has pivoted towards creating detectors capable of
distinguishing text generated by machines from that authored by humans. These detectors serve as
a safeguard against the potential misuse of LLMs. One central question underpinning this area of
research is:
"Is it possible to detect the AI-generated text in practice?"
Our work provides an affirmative answer to this question. Specifically, we demonstrate that detecting
AI-generated text is nearly always feasible, provided multiple samples are collected, as illustrated in
Figure 1. The necessity for collecting multiple samples is consistent with real-world settings where

ROC AUROC
1
n=500
n=100
TPR
n=30

0.5

0 0.5 1
FPR

Figure 1: In light of the sample complexity bound presented in Theorem 1, we show here pictorially
how increasing the number of samples n used for detection would affect the ROC of the best possible
detector, which is achieved by the likelihood-ratio-based classifier. We note that in the ROC curve
on the left for TV(m, h) = 0.1, the AUROC of the best possible detector will be 0.6 as derived in
(Sadasivan et al., 2023) (shown by an orange dot in right figure). The AUROC of 0.6 would lead to the
conclusion that detection is hard. In contrast, we note that by increasing the number of samples n,
the ROC upper bound starts increasing towards 1 exponentially fast (shown by the shaded blue
region in the left figure for different values of n), and hence the AUROC of the best possible detector
also starts increasing as shown by corresponding blue dots in the right figure. This ensures that the
detection should be possible even in hard scenarios when TV(m, h) norm is small.

data is abundant. For example, in the context of social media bot identification, one can readily
collect multiple posts to determine their origin, whether machine-generated or human-authored.
This real-world applicability emphasizes the importance and urgency of developing sophisticated
detection mechanisms for the ethical use of LLMs. We summarize our main contributions as follows.
(1) Possibility of AI-generated text detection: We utilize a mathematically rigorous
approach to answer the question of the possibility of AI-generated text detection. We conclude that
there is a hidden possibility of detecting the AI text, which improves with the text sequence (token)
length.
(2) Sample complexity of AI-generated text detection: We derive the sample complexity
bounds, a first-of-its-kind tailored for detecting AI-generated text for both IID and non-IID settings.
(3) Comprehensive Empirical Evaluations: We have conducted extensive empirical evalua-
tions for real datasets Xsum, Squad, IMDb, and Fake News dataset with state-of-the-art generators

2
(GPT-2, GPT3.5 Turbo, Llama, Llama-2 (13B), Llama-2 (70B)) and detectors (OpenAI’s Roberta
(large), OpenAI’s Roberta (base), and ZeroGPT (SOTA Detector)).

2 Background on AI-Generated Text Detectors and Related Works


Recent research has shown promising results in developing detection methods. Some of these
methods use statistical approaches to identify differences in the linguistic patterns of human and
machine-generated text. We survey the existing approaches here.

Traditional approaches. They involve statistical outlier detection methods, which employ statisti-
cal metrics such as entropy, perplexity, and n-gram frequency to differentiate between human and
machine-generated texts (Lavergne et al., 2008; Gehrmann et al., 2019). However, with the advent of
ChatGPT (OpenAI) (OpenAI, 2023), a new innovative statistical detection methodology, DetectGPT
(Mitchell et al., 2023), has been developed. It operates on the principle that text generated by
the model tends to lie in the negative curvature areas of the model’s log probability. DetectGPT
(Mitchell et al., 2023) generates and compares multiple perturbations of model-generated text to
determine whether the text is machine-generated or not based on the log probability of the original
text and the perturbed versions. DetectGPT significantly outperforms the majority of the existing
zero-shot methods for model sample detection with very high AUC scores (note that we use the terms
AUROC and AUC interchangeably for presentation convenience).

Classifier-based detectors. In contrast to statistical methods, classifier-based detectors are


common in natural language detection paradigms, particularly in fake news and misinformation
detection (Schildhauer, 2022; Zou & Ling, 2021). OpenAI has recently fine-tuned a GPT model
(OpenAI, 2023) using data from Wikipedia, WebText, and internal human demonstration data to
create a web interface for a discrimination task using text generated by 34 language models. This
approach combines a classifier-based approach with a human evaluation component to determine
whether a given text was machine-generated or not. These recent advancements in the field of
detecting AI-generated text have significant implications for detecting and preventing the spread of
misinformation and fake news, thereby contributing to the betterment of society (Schildhauer, 2022;
Zou & Ling, 2021; Kshetri & Voas, 2022).

Watermark-based identification. An alternative detection paradigm that has garnered significant


interest in this field is the evolution of watermark-based identification (Verma et al., 2009; Wadhera
et al., 2022). One of the most exciting works in recent times around this research revolves around
watermarking and developing efficient watermarks for machine-generated text detection. Historically,
watermarks have been employed in the realm of image processing and computer vision to safeguard
copyrighted content and prevent intellectual property theft (Langelaar et al., 2000). They can also
be used for data hiding, where information is hidden within the watermark itself, allowing for secure
and discreet transmission of information. Early research by (Atallah et al., 2001; Meral et al., 2009)
was among the first to demonstrate the potential of watermarks in language through syntax tree
manipulations. More recently with the advent of ChatGPT, innovative work by (Kirchenbauer
et al., 2023) has shown how to incorporate watermarks by using only the LLM’s logits at each step.
The watermarking technique proposed by (Kirchenbauer et al., 2023) allows for the verification
of a watermark’s authenticity by employing a specific hash function. More specifically, the soft
watermarking approach by (Kirchenbauer et al., 2023) involves categorizing tokens into “green”
and “red” lists for generating distinct patterns. Watermarked language models are more likely to

3
select tokens from the green list, based on prior tokens, resulting in watermarks that are typically
unnoticeable to humans. These advancements in watermarking technology not only strengthen
copyright protection and content authentication but also open up new avenues for research in areas
such as privacy in language, secure communication, and digital rights management.

Impossibility result. The interesting recent literature by (Sadasivan et al., 2023; Krishna et al.,
2023) showed the vulnerabilities of watermark-based detection methodologies using vanilla paraphras-
ing attacks. (Sadasivan et al., 2023) developed a lightweight neural network-based paraphraser and
applied it to the output text of the AI-generative model to evade a whole range of detectors, including
watermarking schemes, neural network-based detectors, and zero-shot classifiers. (Sadasivan et al.,
2023) also introduced a notion of spoofing attacks where they exposed the vulnerability of LLMs
protected by watermarking under such attacks. (Krishna et al., 2023) on the other hand, trained a
paraphrase generation model capable of paraphrasing paragraphs and showed that paraphrased texts
with DIPPER (Krishna et al., 2023) evade several detectors, including watermarking, GPTZero,
DetectGPT, and OpenAI’s text classifier with a significant drop in accuracy. Additionally, (Sadasivan
et al., 2023) highlighted the impossibility of machine-generated text detection when the total variation
(TV) norm between human and machine-generated text distributions is small.
In this work, we show that there is a hidden possibility of detecting the AI-generated text even
if the TV norm between human and machine-generated text distributions is small. This result is in
support of the recent detection possibility claims by Krishna et al. (2023).

3 Proposed Approach: Methodology and Analysis


3.1 Notations and Definitions
Before discussing the main results, let us define the notations used in this paper. We define the set
of all possible texts (textual representations) as S, a human-generated text distribution as h(s) over
s ∈ S, and machine-generated text distribution as m(s). Here m(s) and h(s) are valid probability
density functions. We can also modify the same notation given a specific prompt (noted by p)
or context (denoted by c) or question (denoted by q) accordingly, such as Sc , h(s | p, c, q), and
m(s | p, c, q) respectively. However, for the sake of clarity and ease of discussion in this work, we
will omit the use of complex notation.
In the literature, the problem of detecting AI-generated text is considered as a binary classification
problem. The (potentially nonlinear and complex) detector D(s) maps the sample s ∈ S to R
for possible binary classification, and then compares it against a threshold γ to perform detection.
D(s) ≥ γ is classified as AI-generated while D(s) < γ is categorized as human-generated. For the
detector D(s) to detect whether the text samples s is generated from the machine or not, we need
to study the receiver operating characteristic curve (ROC curve) (Fawcett, 2006), which involves
two terms, namely True Positive Rate (TPR) and False Positive Rate (FPR). Once we obtain ROC,
we can study the area under the ROC curve AUROC, which characterizes the detection performance
of detector D. The upper bound on AUROC describes the performance of the best possible detector.
Under a detection threshold γ, TPR and FPR are denoted as TPRγ and FPRγ respectively:

TPRγ :Probability of detecting AI-generated text as AI-generated under threshold γ, (1)


FPRγ :Probability of detecting human-generated text as AI-generated under threshold γ. (2)

4
The rigorous definitions of TPRγ and FPRγ are as follows.
Z
TPRγ =Ps∼m(·) [D(s) ≥ γ] = I{D(s)≥γ} · m(s) · ds, (3)
Z
FPRγ =Ps∼h(·) [D(s) ≥ γ] = I{D(s)≥γ} · h(s) · ds, (4)

where I{condition} is the indicator function which takes value 1 if the condition is true, and 0 otherwise.
Note that without loss of generality, we have chosen to consider m(s) and h(s) as the probability
density function of machine and human on a sample s by considering continuous s (as also considered
in (Sadasivan et al., 2023), but similar results hold for discrete s by replacing the integral with
a summation and by considering m(s) and h(s) as the probability mass function of machine and
human on a sample s.
Both TPRγ and FPRγ are within the closed interval [0, 1] for any threshold γ. For a good
detector, TPRγ should be as high as possible, and FPRγ should be as low as possible. As a result,
a high area under the ROC curve (AUROC) is desirable for detection. AUROC is between 1/2 and 1,
i.e., AUROC ∈ [1/2, 1]. An AUROC value of 1/2 means a random detection and a value of 1 indicates a
perfect detection. For efficient detection, the goal is to design a detector D such that AUROC is as
high as possible.

3.2 Hidden Possibilities of AI-Generated Text Detection


To study the AUROC for any detector D, we start by invoking LeCam’s lemma (Le Cam, 2012;
Wasserman, 2013) which states that for any distributions m and h, given an observation s, the
minimum sum of Type-I and Type-II error probabilities in testing whether s ∼ m versus s ∼ h is
equal to 1 − TV(m, h). Hence, mathematically, we can write

Ps∼h(·) [D(s) ≥ γ] + Ps∼m(·) [D(s) < γ] ≥ 1 − TV(m, h), (5)


| {z } | {z }
Type-I error (false positive) Type-II error (false negative)

for any detector D and any threshold γ. We note that the above bound is tight and can always
be achieved with equality by likelihood-ratio-based detectors for any distribution m and h, by the
Neyman-Pearson Lemma (Cover, 1999, Chapter 11). We restate the lemma for completeness and
discuss its tightness in Appendix B.1. From the definitions of TPR and FPR in (3)-(4), it holds that

FPRγ + 1 − TPRγ ≥ 1 − TV(m, h), (6)

which implies that

TPRγ ≤ min{FPRγ + TV(m, h), 1}, (7)

where min is used because TPRγ ∈ [0, 1]. The upper bound in (7) is called the ROC upper bound
and is the bound leveraged in one of the recent works (Sadasivan et al., 2023) to derive AUROC upper
2
bound AUC ≤ 12 + TV(m, h) − TV(m,h)
2 which holds for any D. This upper bound led to the claim of
the impossibility of detecting the AI-generated text whenever TV(m, h) is small.

Hidden Possibility. However, we note that the claim of impossibility from the AUROC upper bound
could be too conservative for detection in practical scenarios. For instance, we provide a motivating
example of detecting whether an account on Twitter is an AI-bot or human. It is natural that we
will have a collection of text samples from the account, denoted by {si }ni=1 , and it is realistic to

5
assume that n is very high. Therefore, the natural practical question is whether we can detect if the
provided text set {si }ni=1 is machine-generated or human-generated. With this motivation, we next
explain that detection is always possible.
We formalize the problem setting and prove our claim by utilizing the existing results in the
information theory literature. Let us consider the same setup as detailed before, while we are given
a set of samples S := {si }ni=1 . For simplicity, we assume that the samples are i.i.d. drawn from
either the human h or machine m. Interestingly, now the hypothesis test can be re-written as
H0 : S ∼ m⊗n v.s. H1 : S ∼ h⊗n , (8)
where m⊗n := m ⊗ m ⊗ · · · ⊗ m (n times) denotes the product distribution, as does h⊗n . This is
one of the key observations that focus on the correct hypothesis-testing framework with multiple
samples. Similar to before (cf. 7), based on Le Cam’s lemma, it holds that now 1 − TV(m⊗n , h⊗n )
gives the minimum Type-I and Type-II error rate, which implies
TPRnγ ≤ min{FPRnγ + TV(m⊗n , h⊗n ), 1}, (9)
where
Z
TPRnγ =PS∼m⊗n [D(S) ≥ γ] = I{D(S)≥γ} · m⊗n (S) · dS, (10)
Z
FPRnγ =PS∼h⊗n [D(S) ≥ γ] = I{D(S)≥γ} · h⊗n (S) · dS. (11)

We emphasize that the term TV(m⊗n , h⊗n ) is an increasing sequence in n and eventually converges
to 1 as n → ∞. Due to the data processing inequality, it holds that TV(m⊗k , h⊗k ) ≤ TV(m⊗n , h⊗n )
when k ≤ n and naturally leads to TV(m, h) ≤ TV(m⊗n , h⊗n ). This is a crucial observation, showing
that even if the machine and human distributions were close in the sentence space, by collecting
more sentences, it is possible to inflate the total variation norm to make the detection possible.
Now, from the large deviation theory, we can show that the rate at which total variation distance
approaches 1 is exponential with the number of samples (Polyanskiy & Wu, 2022, Chapter 7),
TV(m⊗n , h⊗n ) = 1 − exp (−nIc (m, h) + o(n)) , (12)
mα (s)h1−α (s)ds.
R
where, Ic (m, h) is known as the Chernoff information and is given by Ic (m, h) = − log inf 0≤α≤1
The above expressions lead to Proposition 1 next.
Proposition 1 (Area Under ROC Curve). For any detector D, with a given collection of i.i.d.
samples S := {si }ni=1 either from human h(s) or machine m(s), it holds that
1 TV(m⊗n , h⊗n )2
AUROC ≤ + TV(m⊗n , h⊗n ) − , (13)
2 2
where TV(m⊗n , h⊗n ) := 1−exp (−nIc (m, h) + o(n)) and Ic (m, h) is the Chernoff information. There-
fore, the upper bound of AUROC increases exponentially with respect to the number of samples n.
The proof of the above proposition follows by integrating the TPRnγ upper bound in (9) over
FPRnγ . We note that the expression in (13) and the equality of TV distance in terms of Chernoff
information presents an interesting connection between the number of samples and AUROC of the
best possible detector (which archives the bound in (9) with equality). It is evident that if we
increase the number of samples, n → ∞, the total variation distance TV(m⊗n , h⊗n ) approaches 1
and that too exponentially fast, and hence increasing the AUROC. This indicates that as long as the
two distributions are not exactly the same, which is rarely the same, the detection will always be
possible by collecting more samples as established next.

6
3.3 Attainability of the AUROC Upper-Bound via Likelihood-Ratio-Based
Detectors
Likelihood-ratio-based Detector. Here, we discuss the attainability of bounds in Proposition
1 to establish that the bound is indeed tight. We note that it is a well-established fact in the
literature that a likelihood-ratio-based detector would attain the bound for any distributions h and
m and hence is the best possible detector (detailed proof provided in Appendix B.1). We discuss the
likelihood-ratio-based detector here for completeness in the context of LLMs as follows. Specifically,
the likelihood ratio-based detector is given by
(
Text from machine if m⊗n (S) ≥ h⊗n (S),
D∗ (S) := (14)
Text from human if m⊗n (S) < h⊗n (S).

We proved in Appendix B.1 that the detector in (14) attains the bound and is the best possible
detector.

Sample Complexity of Best Possible Detector. To further emphasize the dependence on


the number of samples n, we derive the sample complexity bound of AI-generated text detection in
Theorem 1 as follows.

Theorem 1 (Sample Complexity of AI-generated Text Detection (Possibility Result)).


If human and machine distributions are close TV(m, h) = δ > 0, then to achieve an AUROC of ϵ, we
require
  
1 1
n = Ω 2 log (15)
δ 1−ϵ

number of samples for the best possible detector which is likelihood-ratio-based as mentioned in (14),
for any ϵ ∈ [0.5, 1). Therefore, AI-generated text detection is possible for any δ > 0.

The proof of Theorem 1 is provided in Appendix B.2. From the statement of Theorem 1, it is
clear that, as long as δ > 0 (which means no matter how close human h(s) and m(s) distributions
are) and ϵ < 1, there exists n such that we can achieve high AUROC and perform the detection. Here,
n corresponds to the number of sentences generated by either humans or machines which we need to
detect. We provide additional detailed remarks and insights in Appendix A.

3.4 Extension to Non-IID case


We extend the sample complexity results of Theorem 1 to the non-iid setting in this subsection. In
order to accomplish that, we make certain assumptions about the structures present in the input,
which is a well-founded assumption that proves to be practical and applicable in the context of
various natural language tasks (for ex: present of topics in documents (Jelodar et al., 2018; Loureiro
et al., 2023)). Let us denote the strength of the association with ρ to characterize the dependence
between the sequences si and the dependence is given as
Pi−1
sk
E[Si |Si−1 = si−1 , · · · , S1 = s1 ] = ρ k=1 + (1 − ρ)E[Si ], (16)
i−1
which boils down to the iid case for ρ = 0. An increasing ρ indicates increasing dependence on the
previous sequence with ρ = 1 means that the conditional expectation can be completely expressed
in terms of the previous samples in the sequence. The dependence assumption of (16) embodies a

7
natural intuition for the domain of natural language and serves as a foundation for extending our
results to non-iid scenarios. Eq. (16) provides a way to measure the dependence between random
variables, which is later usedP to extend Chernoff bound to non-iid cases. . In the context of LLMs,
one can think of the sum sk as the "average meaning" of these text samples, such as "woman" +
"royalty" may have a similar meaning as "queen".
Before introducing the final result, let us assume the number of sequences or samples is denoted
by n, there are L independent subsets, and the corresponding subset is represented by τj where
j ∈ (1, 2 · · · , L), where τj consists of cj samples (dependent). This is a natural assumption in NLP
where a large paragraph often consists of multiple topics, and sentences for each topic are dependent.
With the above definitions, we state the main result in Theorem 2 for the non-iid setting.
Theorem 2 (Sample Complexity of AI-generated Text Detection (non-iid)). If human and
machine distributions are close TV(m, h) = δ > 0, then to achieve an AUROC of ϵ, we require
 v  
u
  L   L
1 1 1X u1 1 1 X
u
n = Ω  2 log + (cj − 1)ρj + t 2 log ·  (cj − 1)ρj  (17)

δ 1−ϵ δ δ 1−ϵ δ
j=1 j=1

number of samples for the best possible detector which is likelihood-ratio-based as mentioned in (14),
for any ϵ ∈ [0.5, 1). Therefore, AI-generated text detection is possible for any δ > 0.
The proof is provided in Appendix 2. From the statement of Theorem 2, we note that for δ > 0
(h(s) and m(s) are close but not exactly the same) and ϵ < 1, there exists n such that we can achieve
high AUROC and perform the detection. In comparison to the iid result in Theorem 1, the non-iid
result in Theorem 2 has an additional term that depends on cj and ρj . Clearly, for ρj = 0, the
sample complexity result in Theorem 2 boils down to the result in Theorem 1.

4 Experimental Studies
In this section, we provide detailed empirical evidence to support our detectability claims of this
work. We consider various human-machine generated datasets and general language classification
datasets.
AUROC Discussion and Comparisons: We first try to explain the meaning of the mathematical
results we obtain via simulations. For instance, we show a pictorial representation of AUROC bound
we obtained in Proposition 1 and compare it against the ROC upper bound we mentioned in (9) for
different values of n. In Figure 1, we show that even if the original distributions of human h(s) and
machines m(s) are close in TV norm TV = 0.1, we can increase the ROC area (and hence AUROC) via
increasing the number of samples we collect n to perform the detection.

4.1 Real Data Experiments


In this section, we perform a detailed experimental analysis and ablation to validate our theorem
with several real human-machine generated datasets as well as general natural language datasets.

Datasets, AI-Text Generators and Detectors Description: Our experimental analysis spans
across 4 critical datasets, including the news articles from XSum dataset (Narayan et al., 2018),
Wikipedia paragraphs from Squad dataset (Rajpurkar et al., 2016), IMDb reviews (Maas et al.,
2011), and Kaggle FakeNews dataset (Lifferth, 2018), utilizing the datasets in a diverse manner
to validate our hypothesis. The first two datasets (XSum and Squad) have been leveraged to

8
100 t-statistic = -46.91, p-value = 0.000

Xsum /RJLVWLF5HJUHVVLRQ
AUC achieved by best Detector (%)

90 Squad 5DQGRP)RUHVW 0.95

$8&RI5HDO'HWHFWRU ;VXP
0/3
80 0.90

AUC of Real Detector


70  0.85
60 0.80
50 0.75
40 0.70
5
30 0.65
1 2 3 4 5 6         
Ngrams 3HUFHQWDJHRI6HTXHQFHXVHGIRU'HWHFWLRQ Vanilla IID Pair

(a) (b) (c)

Figure 2: (a)-(c) validates our theorem for real human-machine classification datasets generated
with XSum (Narayan et al., 2018) and Squad (Rajpurkar et al., 2016), showing that with an increase
in the number of samples/sequence length, detection performance improves significantly. Figure 2a
shows that the AUROC achieved by the best possible detector using the equation increases significantly
from 58% to 97% with an increase in the Ngrams of the feature space for both Xsum and Squad
datasets. Figure 2b demonstrates the improvement in AUROC with respect to sequence length using
various real detectors/classifiers. Figure 2c shows using a box-plot-based comparison that if we
consider 2 iid sequences (from either machine/human) to detect instead of one, the AUROC of the real
detector improves drastically from 73% to 97%, hence validating our hypothesis.

generate machine-generated text by prompting an LLM with the first 50 tokens of each article
in the dataset, sampling from the conditional distribution of the LLMs, as followed in (Mitchell
et al., 2023; Krishna et al., 2023; Sadasivan et al., 2023). Specifically, we use a diverse set of
SOTA open-source text generators including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF,
and Llama-2-70B-Chat-HF as the LLM for generating the machine-generated text using the token
prompts as described above. We consider 500 passages from both the Xsum and Squad datasets and
subsequently 500 machine-generated texts corresponding to them using GPT-2 and evaluate the
detection performance in 3 broad categories including (1) supervised detection, (2) contrastive with
i.i.d. samples, and (3) zero-shot performance. Finally, we leverage two additional general language
datasets (detailed in Appendix C.1), IMDb and Kaggle FakeNews, to give more insights into the
separability and detection performance with an increasing number of samples.

(1) Supervised detection performance: To validate our hypothesis from a supervised detec-
tion/classification perspective, we first compute the total variation distance between the human
and machine-generated texts at various n-gram levels where n-gram = 1 indicates the detection is
at a word level, and as we increase it, it approaches sentence to paragraph level. We subsequently
estimate the AUROC of the best detector using equation (14) by increasing the length of the n-gram
from 1 to 6 as shown in Figure 2a. It is evident that with increasing n-grams, the AUROC of the
best detector increases significantly from 58% to 97% for both Xsum and Squad datasets. This
empirical observation completely aligns with our theory and intuition. To further test our hypothesis
with real detectors, we train 3 vanilla classification models including Logistic Regression, Random
Forest, and a 2-layer Neural Network with TF-IDF-based feature representation (bag of words) on
the human-machine generated datasets including Xsum and Squad. We report the performance of
the test AUROC with increasing sequence length in Figure 2b, which shows a significant increase in
accuracy as the sequence length increases even with real detectors. This observation is also supported
by the results obtained from Open-AI and summarized in the report (Solaiman et al., 2019). This

9
Figure 3: (a)-(f ) validates our theorem for real human-machine classification datasets generated
with XSum & Squad, with zero-shot detection performance. We use different generator/detector
pairs to show the performance comparisons. For instance, (a) shows the detection performance
(AUROC) of OpenAI’s Roberta detector (Large) on the text generated by GPT3.5 Turbo, and we
extend it to other pairs in (b)-(f). We observe that with the increase in the number of samples or
sequence length for detection, the zero-shot detection performance from both the models improves
from around 50% to 90% for both Xsum and Squad human-machine datasets. We also performed
similar experiments with GPT-2 as well and results are available in Figure 9 in the appendix.

impressive performance aligns with our claims and provides evidence that designing a detector with
high performance for AI-generated text is always possible.

(2) Detection with pairwise IID Samples: We also design an experiment where we assume
that one can have access to 2 iid samples (from machine or human) for detection instead of just
one example, which is practical and can be easily obtained in several scenarios. For example,
consider detecting fake news or propaganda from a Twitter bot. We restructure our training set of
the human-machine dataset by constructing pairwise training samples with labels of humans and
machines and perform binary classification with only 30% of the enhanced pairwise dataset with
very limited bag-of-word based features and Logistic regression, as shown in Figure 2c. We note that
there is a statistically significant boost in detection performance with pairwise samples, even with a
vanilla model and sampled dataset, which indicates that detection will be almost always possible in
most scenarios where it is indeed crucial.

(3) Zero-Shot detection performance: Next, we substantiate our claims using zero-shot detection
performance on the human-machine dataset for both Xsum and Squad demonstrated in Figures
3(a)-(f). For the zero-shot detection in Figures 3(a)-(c), we use the RoBERTa-Large-Detector and
RoBERTa-Base-Detector from OpenAI, which are trained or fine-tuned for binary classification with

10
datasets containing human and AI-generated texts (AIT, b). We also perform experiments with
another state-of-the-art detector called ZeroGPT (AIT, a) shown in Figures 3(d)-(f). We observe
that with the increase in the number of samples or sequence length of detection, the zero-shot
detection performance of models improves drastically from around 50% to 90% on both Xsum and
Squad human-machine datasets. Naturally, the performance of RoBERTa-Large-Detector is better
compared to RoBERTa-Base-Detector, but still, the improvement in AUROC with the number of
samples/sequence length is significant with both the models, validating our claims.

(4) Detection with Paraphrasing: We also perform the


experiments with paraphrasing the document generated 1.0
Without Paraphrase
by the machine using a pre-trained Open-sourced Hugging- 0.9 With Paraphrase

Zero Shot Detection AUC


Face Paraphraser Parrot (Damodaran, 2021) which allows
0.8
controlling the adequacy, fluency, and diversity of the gen-
erated text. We perform both supervised (Appendix), with 0.7
pairwise IID Samples (Appendix) and Zero-shot detection 0.6
with OpenAI’s RoBERTa-Large-Detector. It is evident
0.5
from Figure 4 that the detection performance decreases
with paraphrasing as also shown in (Sadasivan et al., 2023; 0.4
Krishna et al., 2023). 0.3
30 50 80 150 300 500
Although the detection performance drops by approx- Sequence length for Detection
imately 15% due to paraphrasing, the trend of perfor-
mance improvement still remains prominent as the se-
quence length increases, which validates our hypothesis Figure 4: This figure demonstrates zero-
even under attack. Hence, one can potentially evade such shot detection performance with and
attacks by considering larger sequence lengths with the without paraphrasing using RoBERTa-
sample complexity trade-off. Additionally, we observed Large-Detector. Although the detection
that the performance degradation is much lesser with performance drops by approximately
pairwise iid samples, highlighting the possibilities with 15% due to paraphrasing, the trend of
fine-grained detectors. performance improvement holds as the
sequence length increases.

5 Conclusion
We note that it becomes harder to detect the AI-generated text when m(s) is close to h(s), and
paraphrasing or successive attacks can indeed reduce the detection performance as shown in our
experiments. However, in several domains where we assert that by collecting more samples/sentences
it will be possible to increase the attainable area under the receiver operating characteristic curve
(AUROC) sufficiently greater than 1/2, and hence make the detection possible. We further remark
that it would be quite difficult to make LLMs exactly equal to human distributions due to the
vast diversity within the human population, which may require a large number of samples from an
information-theoretic perspective and provide a lower bound on the closeness distance to human
distributions. While there are potential risks associated with detectors, such as misidentification
and false alarms, we believe that the ideal approach is to strive for more powerful, robust, fair, and
better detectors and more robust watermarking techniques. To that end, we are hopeful, based on
our results, that text detection is indeed possible under most of the settings and that these detectors
could help mitigate the misuse of LLMs and ensure their responsible use in society.

11
References
AI-based Text Detection (zerogpt). AI Text Detector, Jan a. URL https://www.zerogpt.com.

Ai-based text detection (openai). AI Text Detector, Jan b. URL https://openai.com/blog/


new-ai-classifier-for-indicating-ai-written-text.

Scott Aaronson. My Projects at OpenAI, Nov 2022. URL https://scottaaronson.blog/?p=6823.

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea
Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine
Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally
Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee,
Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka
Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander
Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, and Andy
Zeng. Do as i can, not as i say: Grounding language in robotic affordances, 2022.

Mikhail J. Atallah, Victor Raskin, Michael Crogan, Christian Hempelmann, Florian Kerschbaum,
Dina Mohamed, and Sanket Naik. Natural language watermarking: Design, analysis, and a
proof-of-concept implementation. In Proceedings of the 4th International Workshop on Information
Hiding, IHW ’01, pp. 185–199, Berlin, Heidelberg, 2001. Springer-Verlag. ISBN 3540427333.

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx,
Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson,
Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel,
Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano
Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren
Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter
Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil
Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar
Khattab, Pang Wei Koh, Mark Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal
Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu
Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa,
Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen Nie, Juan Carlos Niebles,
Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung
Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Rob Reich, Hongyu
Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh,
Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori,
Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu,
Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang,
Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. On the
opportunities and risks of foundation models, 2022.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel
Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler,
Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott
Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.

12
Tanya Chowdhury, Razieh Rahimi, and James Allan. Rank-lime: Local model-agnostic feature
attribution for learning to rank, 2022.

Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.

Prithiviraj Damodaran. Parrot: Paraphrase generation for nlu., 2021.

Amit Dhurandhar. Auto-correlation dependent bounds for relational data. In Proc. of the 11th
Workshop on Mining and Learning with Graphs. Chicago, 2013.

Tom Fawcett. An introduction to roc analysis. Pattern recognition letters, 27(8):861–874, 2006.

Sebastian Gehrmann, Hendrik Strobelt, and Alexander M. Rush. Gltr: Statistical detection and
visualization of generated text, 2019.

Hamed Jelodar, Yongli Wang, Chi Yuan, Xia Feng, Xiahui Jiang, Yanchao Li, and Liang Zhao.
Latent dirichlet allocation (lda) and topic modeling: models, applications, a survey, 2018.

Wang-Cheng Kang and Julian McAuley. Self-attentive sequential recommendation, 2018.

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi
Chen, and Wen tau Yih. Dense passage retrieval for open-domain question answering, 2020.

Su Young Kim, Hyeonjin Park, Kyuyong Shin, and Kyung-Min Kim. Ask me what you need: Product
retrieval using knowledge from gpt-3, 2022.

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A
watermark for large language models, 2023.

Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. Paraphrasing
evades detectors of ai-generated text, but retrieval is an effective defense, 2023.

Nir Kshetri and Jeffrey Voas. Deep learning–based social media misinformation detection. IEEE
Software, 39(1):53–59, 2022. doi: 10.1109/MS.2022.3053106.

Gerhard C Langelaar, Iwan Setyawan, and Reginald L Lagendijk. Watermarking digital image and
video data. a state-of-the-art overview. IEEE Signal processing magazine, 17(5):20–46, 2000.

Thomas Lavergne, Tanguy Urvoy, and François Yvon. Detecting fake content with relative entropy
scoring. In Pan, 2008.

Lucien Le Cam. Asymptotic methods in statistical decision theory. Springer Science & Business
Media, 2012.

Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou. Gpt detectors are biased
against non-native english writers, 2023.

William Lifferth. Fake news, 2018. URL https://kaggle.com/competitions/fake-news.

Manuel V. Loureiro, Steven Derby, and Tri Kurniawan Wijaya. Topics as entity clusters: Entity-based
topics from language models and graph neural networks, 2023.

Andrew Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Potts.
Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the
association for computational linguistics: Human language technologies, pp. 142–150, 2011.

13
Hasan Mesut Meral, Bülent Sankur, A. Sumru Özsoy, Tunga Güngör, and Emre Sevinç. Natural
language watermarking via morphosyntactic alterations. Comput. Speech Lang., 23(1):107–125,
jan 2009. ISSN 0885-2308. doi: 10.1016/j.csl.2008.04.001. URL https://doi.org/10.1016/j.
csl.2008.04.001.

Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, and Chelsea Finn.
Detectgpt: Zero-shot machine-generated text detection using probability curvature, 2023.

Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary!
topic-aware convolutional neural networks for extreme summarization. In Proceedings of the
2018 Conference on Empirical Methods in Natural Language Processing, pp. 1797–1807, Brussels,
Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/
D18-1206. URL https://aclanthology.org/D18-1206.

OpenAI. Ai text classifier. https://platform.openai.com/ai-text-classifier, 2023.

OpenAI. Gpt-4 technical report, 2023.

Hariom A. Pandya and Brijesh S. Bhatt. Question answering survey: Directions, challenges, datasets,
evaluation matrices, 2021.

Yury Polyanskiy and Yihong Wu. Information theory: From coding to learning, 2022.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions
for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods
in Natural Language Processing, pp. 2383–2392, Austin, Texas, November 2016. Association
for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL https://aclanthology.org/
D16-1264.

Alexander M Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive
sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural
Language Processing, pp. 379–389, 2015.

Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi.
Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156, 2023.

Tyler Schildhauer. Fake news detection in the era of ai. In Proceedings of the 25th ACM Conference
on Computer-Supported Cooperative Work and Social Computing, pp. 1–10. ACM, 2022. doi:
10.1145/1234567.1234567.

Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and Joelle Pineau. Building
end-to-end dialogue systems using generative hierarchical neural network models. In Proceedings
of the Thirtieth AAAI Conference on Artificial Intelligence, 2015.

Jin Dong Shin, Hyun Jae Kim, Kyung Min Lee, Seonghyeon Kim, and Seungwon Shin. Multilingual
language generation and automatic writing evaluation with transformer models. arXiv preprint
arXiv:2104.06399, 2021.

Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu, Alec
Radford, Gretchen Krueger, Jong Wook Kim, Sarah Kreps, Miles McCain, Alex Newhouse, Jason
Blazakis, Kris McGuffie, and Jasmine Wang. Release strategies and the social impacts of language
models, 2019.

14
Salil Pravin Vadhan. A study of statistical zero-knowledge proofs. PhD thesis, Massachusetts Institute
of Technology, 1999.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing
systems, 30:5998–6008, 2017.

Harsh K Verma, Abhishek Narain Singh, and Raman Kumar. Robustness of the digital image
watermarking techniques against brightness and rotation attack, 2009.

Shweta Wadhera, Deepa Kamra, Ankit Rajpal, Aruna Jain, and Vishal Jain. A comprehensive
review on digital image watermarking, 2022.

Zhen Wang. Modern question answering datasets and benchmarks: A survey, 2022.

Larry Wasserman. Lecture notes for stat 705: Advanced data analysis. https://www.stat.cmu.
edu/~larry/=stat705/Lecture27.pdf, 2013. Accessed on April 9, 2023.

Heqi Zheng, Xiao Zhang, Zewen Chi, Heyan Huang, Tan Yan, Tian Lan, Wei Wei, and Xian-Ling
Mao. Cross-lingual phrase retrieval, 2022.

Xiaofei Zou and Xu Ling. Ai-based detection of misinformation in social media. IEEE Access, 9:
112408–112418, 2021. doi: 10.1109/ACCESS.2021.3104419.

15
Appendix
Table of Contents
A Additional Insights and Remarks 17

B Detailed Proofs 19
B.1 Revisiting Le Cam’s Lemma and the Existence of the Optimal Detector . . . . . . 19
B.2 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
B.3 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

C Additional Figures of Experimental Results 24


C.1 Additional Experimental Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

D Detailed Conclusion & Scope of Future Works 27

16
Appendix

A Additional Insights and Remarks


Remark 1: Insights for watermark design. From a practical perspective, even though Theorem
1 shows that detection is always possible by collecting more samples, it might be costly as well if
the number n needed is extremely high. However, one could mitigate this trade-off by developing
efficient watermarking techniques as discussed in (Kirchenbauer et al., 2023; Aaronson, 2022), which
essentially increases the Chernoff information, or in other words, increases the δ, eventually reducing
the required number of samples. Nevertheless, empirical demonstrations in (Sadasivan et al., 2023;
Krishna et al., 2023) exposed the vulnerability of the watermark-based detectors with paraphrasing-
based attacks, raising a genuine concern in the community about the detection of AI-generated
texts.
To address this concern, more recently, interesting work by (Krishna et al., 2023) proposed a
novel defense mechanism based on information retrieval principles to combat prior attacks and
demonstrated its effectiveness even with a corpus size of 15M generations. This result also supports
our theory, indicating that it is always possible to detect AI-generated text depending on the
detection method. In addition, there are some recent open-sourced text detection tools (AIT, a,b)
whose performances are also worth considering and validate the fact that detection is indeed possible
under certain settings. We believe that with the new insights from this work, one can design more
efficient and robust watermarks spanning a larger corpus of text, which will be hard to remove via
vanilla paraphrasers.

Remark 2: Insights for detector design. This work demonstrates that detecting AI-generated
text should be almost always possible but one would need to collect more samples depending on the
hardness of the problem (controlled by the closeness of human and machine distributions). The recent
study by (Liang et al., 2023) raises an important concern regarding the bias in some of the existing
detectors. The authors in Liang et al. (2023) revealed that a significant proportion of the current
detectors inaccurately classify non-native English writing samples as AI-generated, potentially leading
to unjust consequences in various contexts. Interestingly, updating text generated by non-native
speakers with prompts such as Enhance it to sound more like that of a native speaker leads to a
substantial decrease in misclassification. This evidence suggests that most current detectors prioritize
low perplexity as a crucial criterion for identifying a text as AI-generated, which might be flawed in
various contexts, for example - academic papers as shown in (Liang et al., 2023). More specifically,
we want to highlight the potential for bias in detectors relying primarily on perplexity scores, as
elaborated in (Liang et al., 2023), underscoring the need for a comprehensive and equitable redesign
that takes into account other relevant metrics. Our research demonstrates a promising approach to
text detection, wherein the collection of more samples and the development of a multi-sample-based
detector significantly enhance performance from the best word-level detector, as demonstrated by
our experimental results depicted in Figures 6-8. While our results demonstrate the potential for
improved detection accuracy at the paragraph level, it is important to note that this approach
requires designing detectors capable of processing multiple samples. For instance, in our IMDb
example, we developed a paragraph-level detector that can take the entire paragraph as input, in
contrast to the word-level detector, which only processes one word at a time. Thus our approach
requires the detector to deal with n samples, which may be complicated compared to processing
just one sample, leading to a trade-off that could be critical for accurate detection in practice. To
summarize, our work offers valuable insights into detector design, specifically about the sample
complexity of AI-text detention and its connection to Chernoff information of human and machine

17
(a) LLM learns different distribution. (b) LLM learns similar distribution.

Figure 5: We present the two detectability regimes for LLMs. Figure 5(a) denotes the scenario in
which, when LLMs learn a different distribution, and the detection is easy. Figure 5(b) shows a
scenario when LLMs’ distribution is very close to human’s, it is hard but possible to detect in this
setting via collecting more samples. Additionally in scenarios of Figure 5(b), efficient watermarking
techniques such as (Kirchenbauer et al., 2023; Krishna et al., 2023) could help in improving the
separability and detectability.

distributions. We can utilize these insights to develop robust and fair detectors that enhance the
overall accuracy of text detection methods.

Remark 3: Task-specific detectability & optimistic view of LLMs. In addition to our findings
on the detectability of LLM-generated content, we want to highlight the significance of task-specific
detectability (Figure 5). While the primary focus of Theorem 1 is to detect machine-generated text,
it is important to consider the broader context of LLMs and their potential positive applications.
LLMs have demonstrated significant potential to assist in a variety of tasks, including language
translation (Vaswani et al., 2017), text summarization (Rush et al., 2015), dialogue systems (Serban
et al., 2015), question answering (Pandya & Bhatt, 2021; Wang, 2022; Karpukhin et al., 2020),
information retrieval (Chowdhury et al., 2022; Zheng et al., 2022; Kim et al., 2022), recommendation
engine (Kang & McAuley, 2018; Brown et al., 2020), language grounded robotics (Ahn et al., 2022)
and many others. In these scenarios, the goal is to generate high-quality text that meets the needs
of the user rather than to deceive or mislead. For example, consider the application of an LLM as a
tool to assist individuals or groups with moderate English writing skills to improve their writing. In
this case, a well-trained LLM model could have a better (and different) distribution across S than
the human distribution h(s). This difference in distributions ensures that it should be possible to
detect that AI generates the text. This understanding of detectability underscores the complexity
of working with LLMs and emphasizes the importance of tailored approaches to maximize their
potential. Our work provides insights into the intricacies of LLM-generated content detection, paving
the way for more targeted and practical applications of these powerful models.

Remark 4: Realistic scenarios where m(s) and h(s) are different. Theorem 1 suggests that
even small differences between the machine-generated text m(s) and the human-generated text h(s)
should help for AI-generated text detection. In many practical applications, this difference can be
easily achieved since we can control m(s), but not necessarily h(s). One such application is the
use of LLMs to address biases and prejudices in human-generated text. While biases can arise due
to the diverse backgrounds of certain communities or clusters of humans, LLMs can be trained to
generate unbiased text by minimizing the likelihood of biased language in the training data. This
can lead to a more inclusive and equitable society, where language use is free from discrimination.

18
Importantly, it is crucial to maintain a gap between the bias in human-generated text and that in
machine-generated text. This ensures that biased language remains more likely to originate from
human-generated text than from LLMs. By doing so, effective detection and separation of the two
sources can be achieved, enabling us to fully harness the potential of LLMs without compromising
their integrity. With careful consideration and responsible use, LLMs can make a positive impact on
our society, helping us to communicate more effectively and promoting fairness and inclusivity in
language use.

B Detailed Proofs
B.1 Revisiting Le Cam’s Lemma and the Existence of the Optimal Detector
We first restate Le Cam’s lemma and its proof, which appears in Le Cam (2012) and many lecture
notes such as (Wasserman, 2013).

Lemma 1 (Le Cam’s Lemma). Let S be an arbitrary set. For any two distributions m and h on S,
we have

inf Ps∼m [Ψ(s) ̸= 1] + Ps∼h [Ψ(s) ̸= 0] = 1 − TV(m, h), (18)
Ψ

where the infimum is taken over all detectors (measurable maps) Ψ : S → {1, 0}. Particularly, the
detector with the acceptance region A∗ := {s : m(s) ≥ h(s)}, defined as
(
∗ 1 s ∈ A∗
Ψ (s) :=
0 s ∈ S\A∗ ,

achieves the infimum. We note that Ψ∗ is the likelihood ratio-based detector.

Proof. For notation simplicity, we use m and h to denote both the probability measure and the
probability density of the machine-generated and human-generated text, respectively, with the
specific meaning discernible from the context. For any detector Ψ : S → {1, 0}, denote A as its
acceptance region, where Ψ(s) = 1 for s ∈ A, and Ψ(s) = 0 for s ∈ S\A. Then we have

Ps∼m [Ψ(s) ̸= 1] + Ps∼h [Ψ(s) ̸= 0] = m(S\A) + h(A)


= 1 − (m(A) − h(A)). (19)

Taking the infimum over all acceptance regions on both sides in (19) yields
 
inf Ps∼m [Ψ(s) ̸= 1] + Ps∼h [Ψ(s) ̸= 0] = inf 1 − (m(A) − h(A))
Ψ Ψ

= 1 − sup (m(A) − h(A))
Ψ
= 1 − TV(m, h).

Next, we proceed to show that the ration-based detector Ψ∗ (s), defined in the statement of
Lemma 1, achieves the infimum. We first note that the acceptance region A∗ := {s : m(s) ≥ h(s)} is
a measurable set that is included in the collection of all acceptance regions since I{m(s)≥h(s)} is a
measurable function. Therefore,

Ps∼m [Ψ∗ (s) ̸= 1] + Ps∼h [Ψ∗ (s) ̸= 0] ≥ inf Ps∼m [Ψ(s) ̸= 1] + Ps∼h [Ψ(s) ̸= 0] .

(20)
Ψ

19
On the other hand, for any measurable set B, we have B\A∗ = {s ∈ B : m(s) < h(s)} and
A∗ \B = {s ∈/ B : m(s) ≥ h(s)} by the definition of A∗ . Therefore, by the sigma-additivity of
measure, we have

1 − (m(A∗ ) − h(A∗ )) = 1 − (m(A∗ ∩ B) − h(A∗ ∩ B)) − (m(A∗ \B) − h(A∗ \B)). (21)

In the right-hand side of (21), we note that m(A∗ \B) − h(A∗ \B) ≥ 0 because our detector is
likelihood-ratio-based. This implies we can upper bound the right-hand side in (21) by dropping the
negative term as follows

1 − (m(A∗ ) − h(A∗ )) ≤ 1 − (m(A∗ ∩ B) − h(A∗ ∩ B)). (22)

Further from the definition of the ratio-based detector, we note that m(B\A∗ ) − h(B\A∗ ) < 0. This
implies −(m(B\A∗ ) − h(B\A∗ )) > 0 and we can upper bound the right hand side of (22) by adding
just the positive number −(m(B\A∗ ) − h(B\A∗ )) as follows,

1 − (m(A∗ ) − h(A∗ )) ≤ 1 − (m(A∗ ∩ B) − h(A∗ ∩ B)) − (m(B\A∗ ) − h(B\A∗ )). (23)

From the sigma-additivity of measure, we can write

1 − (m(A∗ ) − h(A∗ )) ≤ 1 − (m(B) − h(B)). (24)

Since the inequality in (24) holds for any measurable set B, we can write

Ps∼m [Ψ∗ (s) ̸= 1] + Ps∼h [Ψ∗ (s) ̸= 0] ≤ inf Ps∼m [Ψ(s) ̸= 1] + Ps∼h [Ψ(s) ̸= 0] .

(25)
Ψ

Hence, from the lower bound in (20) and upper bound in (25), we conclude that Ψ∗ (s) achieves the
infimum, which completes the proof.

The Le Cam’s lemma directly applies to our detector D with threshold γ by noting that any
detector can be implemented via a detector with a threshold. Indeed, define Dγ : S → {1, 0} via
(
1 D(s) ≥ γ
Dγ (s) :=
0 D(s) < γ,

then it holds that {Ψ : S → {1, 0}} ⊆ {Dγ : S → {1, 0}, D : S → R, γ ∈ R} because for any Ψ, we
can choose D to be exactly the same as Ψ (since {1, 0} ∈ R) and set γ = 0.5.
In fact, the detector Ψ∗ is exactly the likelihood-ratio-based detector which, by the Neyman-
Pearson lemma (Cover, 1999, Chapter 11), is optimal in this (simple-vs.-simple) hypothesis test
setting.
Relationship to the tightness analysis in Sadasivan et al. (2023). The authors of
Sadasivan et al. (2023) provide a tightness analysis for their AUROC upper bound. The main part
of the proof is to show the tightness of Equation 18. Specifically, for any given human-generated text
distribution h, they construct a machine-generated text distribution m and a detector D with some
threshold γ, and show that the detector with the threshold achieves the equality in Equation 18. We
note that their constructed detector with the threshold is exactly the likelihood-ratio-based detector.
Moreover, a key difference between our result and theirs is that we show that the tightness can be
achieved for any given distribution of m and h while they construct a specific m given h. While
their specific construction of the machine distribution gives many insights into the problem, it is not
necessary for achieving the tightness. This difference also implies that we can be more optimistic
about the problem since the classifier achieving the tightness exists for any machine-generated
distributions.

20
B.2 Proof of Theorem 1
The first part of the proof follows from the standard application of Chernoff’s bounds (Vadhan,
1999, Appendix A). From the statement of Theorem 1, we note that the AUROC of the best possible
detector is given by

1 TV(m⊗n , h⊗n )2
AUROC = + TV(m⊗n , h⊗n ) − . (26)
2 2
Let us start in a hard detection setting where m(s) and h(s) are really close and we know that
TV(m, h) = δ where δ > 0 is small. From the definition of TV distance, we know that there exists
some set A ∈ S such that given the samples sm ∼ m(s) and sh ∼ h(s) it holds

P(sm ∈ A) − P(sh ∈ A) = δ. (27)

Let us define P(sh ∈ A) = p which implies that P(sm ∈ A) = p + δ. Let us now collect n samples
{si }ni=1 from m(s), we know that the probability of any sample si in A is given by p + δ. Hence, on
average (p + δ)n number of samples will be in A. In a similar manner, if we have n samples from
h(s), pn will be in A on average. Therefore, we can utilize the Chernoff bound to write
  !
δ −nδ 2
P at least p + n samples of h are in A ≤ exp 2
2
  !
δ −nδ 2
P at most p + n samples of m are in A ≤ exp 2 . (28)
2

Now, let us denote the set of n−tuples by A′ which contains more than p + δ

2 n samples of A.
Therefore, we can bound

TV(m⊗n , h⊗n ) ≥P({sm n ′ h n ′


i }i=1 ∈ A ) − P({si }i=1 ∈ A )
−nδ 2 −nδ 2
≥(1 − exp 2 ) − exp 2

−nδ 2
=1 − 2 exp 2 . (29)

The TV norm lower bound in (29) tells us the minimum value of TV(m⊗n , h⊗n ) for given n and δ.
Therefore, if we need to obtain the AUROC of the best possible detector to be equal to, or higher
than say ϵ ∈ [0.5, 1], which means we want

1 TV(m⊗n , h⊗n )2
+ TV(m⊗n , h⊗n ) − ≥ ϵ. (30)
2 2
Now, since the left-hand side is the monotonically increasing function of TV(m⊗n , h⊗n ), it holds from
the minimum value in (29) that
−nδ 2
1 −nδ 2 (1 − 2 exp 2 )2
+ (1 − 2 exp 2 ) − ≥ ϵ. (31)
2 2
After expanding the squares, we get
1 −nδ 2 1 2 −nδ 2
+ (1 − 2 exp 2 ) − − 2 exp−nδ +2 exp 2 ≥ ϵ. (32)
2 2

21
After rearranging the terms, we get
1−ϵ 2
≥ exp−nδ . (33)
2
Taking log on both sides and rearranging terms yields
 
1 2
n ≥ 2 log . (34)
δ 1−ϵ
Hence proved.

B.3 Proof of Theorem 2


Before starting the analysis, let us restate the following bound from (Dhurandhar, 2013) for quick
reference.
Lemma 2 (Upper Bound for Non-iid scenario). Let n be the number of samples drawn
sequentially from P(S1 , S2 · · · Sn ) = L
Q
j=1 j where τj are independent subsets consisting of cj
τ ,
dependent sequences (s1 , s2 · · · scj ) such that L
P
j=1 cj = n. Under dependence structure in (16), for
PL
l=1 (cj −1)ρj
any δ > n , it holds that
PL
−2(nδ − j=1 (cj − 1)ρj )2
P(|S̄ − E[S̄]| ≥ δ) ≤ 2 exp , (35)
n
1 Pn ρ Pi−1
where S̄ = n n=1 si and E[Si |Si−1 = si−1 , · · · , S1 = s1 ] = i−1 k=1 sk + (1 − ρ)E[Si ].
Lemma B.3 provided upper bounds for non-iid scenarios, with an exponential bound in sample
size n along with an additional dependence on the strength of association ρj and the size of the
dependent sequence cj . It is important to note that when we have ρ = 0, it exactly boils down to
the standard Chernoff bound.
Now, we move to do the sample complexity analysis for the non-iid setting. Similar to the proof
for the iid case, we define P(sh ∈ A) = p which implies that P(sm ∈ A) = p + δ. Let us now collect
n samples sequentially {si }ni=1 from m(s), we know that the probability of any sample si in A is
given by p + δ. Hence, on average (p + δ)n number of samples will be in A. In a similar manner, if
we have n samples from h(s), pn will be in A on average. Therefore, we can utilize the Chernoff
bound to write

! 2
 δ −PL (c −1)ρ
−2 n 2
δ j=1 j j
P at least p + n samples of h are in A ≤ 2e n
2
! 2
  −2 n 2δ −PL (c −1)ρ
δ j=1 j j
P at most p + n samples of m are in A ≤ 2e n , (36)
2
2
2(n δ − L
j=1 (cj −1)ρj )
P
where for simplicity of notations let’s consider β = − 2 n . Now, let us denote the set
′ δ

of n tuples by A which contains more than p + 2 n samples of A. Therefore, we can bound

TV(m⊗n , h⊗n ) ≥P({sm n ′ h n ′


i }i=1 ∈ A ) − P({si }i=1 ∈ A )
=1 − 4 expβ . (37)

22
The TV norm lower bound in (37) tells us the minimum value of TV(m⊗n , h⊗n ) for given n and
δ. Therefore, to obtain the AUROC of the best possible detector to be equal to, or higher than say
ϵ ∈ [0.5, 1], it should hold that

1 TV(m⊗n , h⊗n )2
+ TV(m⊗n , h⊗n ) − ≥ ϵ. (38)
2 2
Since the left-hand side in (38) is the monotonically increasing function of TV(m⊗n , h⊗n ), it holds
from the minimum value in (29) that

1 (1 − 4 expβ )2
+ (1 − 4 expβ ) − ≥ ϵ. (39)
2 2
After expanding the squares, we get
1 1
+ 1 − 4 expβ − − 8 exp2β +4 expβ ≥ ϵ, (40)
2 2
which implies
δ PL 2
4( n 2 − )
1−ϵ j=1 (cj −1)ρj
≥ exp2β = exp− n , (41)
8
where substitute the value of β and taking logarithm on both sides, we get
 2
  L
8 4 δ X
log ≤ n − (cj − 1)ρj  (42)
1−ϵ n 2
j=1
 2
L
4 δ X
= n − (cj − 1)ρj 
n 2
j=1
   2
L L
X 4 X
= nδ 2 − 4δ  (cj − 1)ρj  +  (cj − 1)ρj  .
n
j=1 j=1

 
Let’s denote α = L 8
P
(c
j=1 j − 1)ρj and γ(ϵ) = log 1−ϵ , for simplicity of calculations. The quadratic
inequality from the above equation boils down to solving

δ 2 n2 − n(4αδ + γ(ϵ)) + 4α2 ≥ 0, (43)

which is in the form of a standard quadratic equation and the corresponding solution is given by

γ(ϵ) α 1 p
n≥ + 2 + (4αδ + γ(ϵ))2 − 16α2 δ 2 (44)
2δ 2 δ 2δ 2
γ(ϵ) α 1 p
= + 2 + γ(ϵ)2 + 8αδγ(ϵ)
2δ 2 δ 2δ 2 v
u  
L L
γ(ϵ) 2 1 u
X u X
= + (cj − 1)ρj + 2 t(γ(ϵ))2 + 8  (cj − 1)ρj  δγ(ϵ).
2δ 2 δ 2δ
j=1 j=1

23
Now, we further expand upon the expression as
v    
u
L L
1 2 1 u1 
X u X
n ≥ 2 γ(ϵ) + (cj − 1)ρj + √ t (γ(ϵ))2 + 8  (cj − 1)ρj  δγ(ϵ)
2δ δ 2δ 2 2
j=1 j=1
v  
u
L L
1 2X 1 1 u X
u
≥ 2 γ(ϵ) + (cj − 1)ρj + √ γ(ϵ) + √ t2 (cj − 1)ρj  δγ(ϵ), (45)
2δ δ 2 2δ 2 2δ 2
j=1 j=1

where, the first-term results from multiplying and dividing by a constant factor 2, and the second
term is an application of Jensen’s inequality for convex functions. Using the order notation, we
obtain
 v  
u
  L   L
1 1 1X u1 1
u X
n = Ω  2 log + (cj − 1)ρj + t 3 log  (cj − 1)ρj  . (46)
δ 1−ϵ δ δ 1−ϵ
j=1 j=1

C Additional Figures of Experimental Results


C.1 Additional Experimental Details
IMDb Dataset Experiments. To validate our claims on the possibilities of detection, we run
experiments on the IMDb dataset (Maas et al., 2011), which is a widely-used benchmark dataset
in the field of natural language processing. The dataset consists of 50, 000 movie reviews from the
internet movie database that have been labeled as positive or negative based on their sentiment.
The goal is to classify the reviews accordingly based on their text content. The experiments are
done to validate our hypothesis on a more general class of language tasks including classification
and detection. We specifically focus on the representation space of the inputs for both the human
and machine distributions and try to validate our hypothesis by comparing the input space of words
to the input space of a group of sentences. The objective is to analyze the variations in performance
of the detector when detecting at word-level versus paragraphs. Hence, there are two scenarios to
consider. The first is where we’re given a word and we have to determine whether it came from
positive or negative class. The second, and more practical case, is where we’re given a paragraph i.e
a group of sentences and we have to detect whether it came from positive or negative class.
So, we first compute the total variation distance between the positive and negative classes at the
word level. This is done by computing the divergence between the distribution over the space of words
between the two classes. Figure 7(a) shows that the best possible AUROC achieved by the detector
is 0.585 at the word level. From these results, it seems almost impossible to distinguish the two
classes. However when we perform the detection at a paragraph level using a real detector (standard
ML models, including random forest, logistic regression, and a vanilla multi-layer perception), we
see a remarkable improvement in the detection performance. As shown in Figure 7(a), all the real
detectors achieve a train AUROC of greater than 0.85 (≥ 0.93 for random forest and MLP), and a test
AUROC of greater than 0.8, which surpasses the upper-bounds of the best detector at a word level,
validating our theory and intuition. This impressive performance is fully aligned with our claims and
provides evidence that designing a detector with high performance for AI-generated text is always
possible even for general NLP detection tasks.

24
i rented i am curious yellow from my video store because of all
the controversy that surrounded it when it was first released in i
also heard that at first it was seized by u s customs if it ever tried
to enter this country therefore being a fan of films considered
controversial i really had to see this for myself the plot is
centered around a young swedish drama student named lena
who wants to learn everything she can about life in particular she
wants to focus her attentions to making some...

(a) IMDb dataset (Maas et al., 2011)

(b) Paragraph representation space

Figure 6: Figure 6(a) (left) shows examples of textual paragraphs and corresponding labels present
in the IMDb dataset. It also highlights part of one random paragraph (one input) showing that
in general, having a lot of sentences as input for detection is very common and practical. Figure
4(a) (right) represents the word-cloud representation of the word distribution based on which the
word-level total variation is estimated. Figure 6(b) denotes the representation of the input paragraph
using a Bag-of-Words-based count vectorizer for our algorithms and proving our hypothesis. It
demonstrates paragraph representation space with Bag-of-Words-based count vectorizer where each
row indicates one review.

25
Classification AUROC AUROC
Accuracy
Algorithm (Train) (Test)
AUROC
Metric
(best detector) Logistic Regression 85.67 85.71 83.42

TV-norm = 0.088 58.47 Random Forest 93.91 94.12 82.12

Word-Level Vanilla MLP 94.32 94.31 83.12

(a) Paragraph-Level

Classification AUROC AUROC


Accuracy
Algorithm (Train) (Test)
AUROC
Metric
(best detector) Logistic Regression 75.12 76.11 74.12

TV-norm = 0.067 56.53 Random Forest 96.78 97.12 73.44

Word-Level Vanilla MLP 96.95 96.12 72.89

Paragraph-Level
(b)

Figure 7: The table in Figure 7(a) (left) represents the total variation norm distance at a word
level i.e input to the detector is the word and one needs to detect if it’s a positive or negative class
(human or machine in our context). It also shows the AUROC that can be achieved by the best
detector based on the total variation norm as shown in (Sadasivan et al., 2023). Figure 7(b) (right)
shows the accuracy and AUROC achieved by real detectors (standard machine learning algorithms)
at a paragraph level, where each input to the detector is a paragraph or a group of sentences. It
is evident that at a paragraph level, even a simple untuned ML detector can achieve a very high
AUROC of more than 85%, which was very low at a word level. Similarly, in the tables in Figure
7(b), we observe a similar behavior as we increase the hardness of the problem by reducing the
number of sentences from the passage. We note that the AUROC achieved by the real detector
decreases but is still much larger than the word-level best detector’s AUROC which validates our
claims.

26
Classification AUROC AUROC
Accuracy
Algorithm (Train) (Test)
AUROC
Metric
(best detector) Logistic Regression 86.39 86.32 84.95

TV-norm = 0.102 59.73 Random Forest 94.40 94.39 89.99

Word-Level Vanilla MLP 96.10 97.01 89.91

Paragraph-Level

Figure 8: The table in Figure 8(a) (left) represents the total variation norm distance at a word
level for the Fake News dataset (Lifferth, 2018). It shows the AUROC that can be achieved by the
best detector based on the total variation norm as shown is 59.73%. Figure 8(b) (right) shows the
accuracy and AUC achieved by a real detector at a paragraph level goes up to 90%, which validates
our hypothesis for a general class of NLP tasks

IMDb NLP Dataset Experiments with Increased Hardness. To provide additional confirma-
tion of the efficacy of our claim, we made the experimental setting more challenging by randomly
decreasing the number of sentences in each review, making it difficult for any genuine detector or
classifier to distinguish. In this scenario, we again compared the performance to the previous scenario
and observed that all the methods were able to achieve a test AUROC greater than 0.7, which is lower
than the previous case. This result supports our hypothesis that as the number of samples/sentences
increases, detection accuracy improves.
We conducted a similar experiment on a Fake News classification dataset (Lifferth, 2018), and
the results were consistent with our previous findings. This indicates that AI-generated text can be
detected, although we need to be cautious and gather more samples as the distribution becomes
closer.
We would like to emphasize that the purpose of this experimentation is to demonstrate our
hypothesis regarding the feasibility of detection rather than to showcase the accuracy of classification.
This is because the accuracy of classification is already well-established, with a simple pre-trained
BERT-based model being capable of achieving high accuracy.

D Detailed Conclusion & Scope of Future Works


We note that it becomes harder to detect the AI-generated text when m(s) is close to h(s), and
paraphrasing attacks can indeed reduce the detection performance as shown in our experiments.
However, we assert that by collecting more samples/sentences, it will be possible to increase the
attainable area under the receiver operating characteristic curve (AUROC) sufficiently greater than
1/2, and hence make the detection possible. We further remark that it would be quite difficult
to make LLMs exactly equal to human distributions due to the vast diversity within the human
population, which may require a large number of samples from an information-theoretic perspective
and provides a lower bound on the closeness distance to human distributions. Diversity could lead
to realistic analysis to prove that the distributions are sufficiently separated to be detectable.
We want to emphasize that as we show detectability is always possible (unless m = h in
exactness), in several scenarios when m and h are very close, it might need a lot of samples to
detect. However, watermark-based techniques can help address this issue by causing shifts in the
distributions. The additional insights from our work could help to design better watermarks, which
cannot be attacked easily with paraphrases. More specifically, it is possible to create more powerful
and robust watermarks to introduce a minor change in the machine distributions, and then collecting

27
1.0 1.0
XSum XSum
Zero Shot Detection AUC 0.9 Squad 0.9 Squad

Zero Shot Detection AUC


0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
30 50 80 150 300 500 30 50 80 150 300 500
Sequence length for Detection Sequence length for Detection
(a) (b)

Figure 9: (a)-(b) validates our theorem for real human-machine classification datasets generated with
XSum & Squad, with zero-shot detection performance. We use the RoBERTa-Base-Detector (9a) and
RoBERTa-Large-Detector (9b) from OpenAI which are trained or fine-tuned for binary classification
with datasets containing human and AI-generated texts. We observe that with the increase in the
number of samples or sequence length for detection, the zero-shot detection performance from both
the models improves drastically from around 50% to 97% for both Xsum and Squad human-machine
datasets.

more samples should help to perform the AI-generated text detection.


While there are potential risks associated with detectors, such as misidentification and false
alarms, we believe that the ideal approach is to strive for more powerful, robust, fair, and better
detectors and more robust watermarking techniques. We believe that addressing issues such as
representation space, robust watermarks, and interpretability is crucial for the safe and trustworthy
application of generative language models and detection. To that end, we are hopeful, based on our
results, that text detection is indeed possible under most of the settings and that these detectors
could help mitigate the misuse of LLMs and ensure their responsible use in society.

28
(a) (b)

Figure 10: (a) demonstrates the detection performance of Vanilla classifiers/detectors on the Xsum
dataset (Randomly sampled) generated by GPT-2. (b) demonstrates the detection performance of
Vanilla classifiers/detectors on the Squad dataset (Randomly sampled) generated by GPT-2. This
shows that even for vanilla detectors, our result holds for random subsets of the data.

29

You might also like