0% found this document useful (0 votes)
55 views17 pages

LLM-Detector: Improving AI-Generated Chinese Text Detection With Open-Source LLM Instruction Tuning

Uploaded by

Rest
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views17 pages

LLM-Detector: Improving AI-Generated Chinese Text Detection With Open-Source LLM Instruction Tuning

Uploaded by

Rest
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

LLM-Detector: Improving AI-Generated Chinese Text Detection with

Open-Source LLM Instruction Tuning

Rongsheng Wang 1 Haoming Chen 1 Ruizhe Zhou 1 Han Ma 1 Yaofei Duan 1 Yanlan Kang 2 Songhua Yang 3
Baoyu Fan 1 Tao Tan 1

Abstract text corpora, enabling them to generate texts that are con-
arXiv:2402.01158v1 [cs.CL] 2 Feb 2024

textually relevant and fluent. (Brown et al., 2020) However,


ChatGPT and other general large language mod- the impressive capabilities of generative language models
els (LLMs) have achieved remarkable success, but in text generation have also led to rising concerns about
they have also raised concerns about the misuse their possible misuse in various areas, including phishing,
of AI-generated texts. Existing AI-generated text dissemination of misinformation, and academic dishonesty.
detection models, such as based on BERT and Unfortunately, when human classifying AI-generated and
RoBERTa, are prone to in-domain over-fitting, human-written texts, human only slightly better than ran-
leading to poor out-of-domain (OOD) detection dom guessing (Gehrmann et al., 2019). Therefore, our goal
performance. In this paper, we first collected Chi- is to develop an automated system to classify AI-generated
nese text responses generated by human experts texts with the aim of mitigating its potential for misuse.
and 9 types of LLMs, for which to multiple do-
mains questions, and further created a dataset that There has been some previous effort in detecting AI-
mixed human-written sentences and sentences generated texts. First, Guo et al. (Guo et al., 2023b) fine-
polished by LLMs. We then proposed LLM- tuned RoBERTa to detect whether a certain text (English
Detector, a novel method for both document-level and Chinese) is generated by ChatGPT or written by human.
and sentence-level text detection through Instruc- However, the study conducted demonstrates that a limitation
tion Tuning of LLMs. Our method leverages the of supervised models is the potential occurrence of overfit-
wealth of knowledge LLMs acquire during pre- ting in-domain data, resulting in poor detection performance
training, enabling them to detect the text they out-of-domain (OOD) (Chakraborty et al., 2023). The sec-
generate. Instruction tuning aligns the model’s ond is zero-shot classifier, DetectGPT (Mitchell et al., 2023),
responses with the user’s expected text detec- works under the assumption that AI-generated texts varia-
tion tasks. Experimental results show that pre- tions typically have lower model probability than the origi-
vious methods struggle with sentence-level AI- nal, while human-written could go either way. As current
generated text detection and OOD detection. In zero-shot classifiers require input documents, with consid-
contrast, our proposed method not only signif- erable length (exceeding 100 tokens) for the classifier to
icantly outperforms baseline methods in both effectively capture contextual features of the text, for which
sentence-level and document-level text detection in terms of classifying short sentences, their performances
but also demonstrates strong generalization ca- are relatively poor. Kirchenbauer et al. (Kirchenbauer et al.,
pabilities. Furthermore, since LLM-Detector is 2023) demonstrated how to incorporate a watermark using
trained based on open-source LLMs, it is easy to only the logarithmic credentials of each step to mark AI-
customize for deployment. generated texts. While watermark-based detectors are an
intriguing area of research, adding watermarks may affect
the readability of the texts, and the removal of watermarks
is also a challenge we need to address. Another noteworthy
1. Introduction issue that previous work has focused on is to distinguish
Large language models (LLMs), such as ChatGPT, repre- whether an entire document is generated by AI. However,
sent a significant milestone in the field of natural language users often use language models to modify portions of texts
processing (NLP). LLMs have been pre-trained on extensive rather than fully trusting the model to generate an entire doc-
ument. Therefore, it is also important to explore fine-grained
1
Macao Polytechnic University 2 Fudan University 3 Wuhan (e.g. sentence-level) detection of AI-generated texts.
University. Correspondence to: Tao Tan <[email protected]>.
Before the era of LLMs, models needed to learn task-

1
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning

specific knowledge and the alignment between task inputs using training data consistent with the target language
and desired outputs. This is why training on negative sam- is crucial for improving the model’s performance on
ples could sometimes be beneficial, as it provided the model specific language text detection tasks. Furthermore,
with supplementary knowledge and boundaries for the task- since LLM-Detector is trained based on open-source
specific information (Li et al., 2023b). In the era of LLMs, LLMs, it is easy to customize for deployment.
models no longer need to learn task-specific knowledge and
alignment between task inputs and desired outputs, as most • We conducted sentence-level text detection experi-
of the required knowledge has already been learned during ments using a dataset that mixed sentences generated
pre-training. Instruction tuning can facilitate the alignment by human experts and AI. The experimental results
between the model and the expected user task responses. indicate that existing methods (such as Sent-RoBERTa
We introduce LLM-Detector, a powerful method to address and Sniffer) face difficulties in solving the problem of
the challenges of text detection. Specifically, in document- sentence-level AI-generated text detection. Our pro-
level AI-generated text detection, we label the dataset and posed LLM-Detector, however, achieved encouraging
use it for Instruction Tuning of LLMs. In sentence-level results in document-level and cross-domain text detec-
AI-generated text detection, we label each sentence in the tion challenges and demonstrated outstanding general-
dataset and use it for Instruction Tuning. We also inves- ization capabilities in sentence-level text detection.
tigate the impact of instruction tuning on text detection
performance using text generated by a specific LLM and the
influence of different Chinese and English language models 2. Related Work
on detection performance. Experimental results show that LLMs have been pre-trained on extensive text corpora, en-
existing methods like Fast-DetectGPT (Bao et al., 2023), abling them to generate contextually relevant and fluent
MPU (TEXTS), GLTR (Gehrmann et al., 2019) are not ef- texts. However, this also increases the difficulty of detecting
fective in sentence-level AI-generated text detection. Our AI-generated texts. The existing methods for detecting gen-
proposed LLM-Detector achieves promising results in both erated texts can be broadly categorized into two types: black-
sentence and document-level AI-generated text detection box and white-box detection (Tang et al., 2023), contingent
challenges and exhibits excellent generalization on OOD upon the level of access to the model that is suspected to
datasets. Our contributions are summarized as follows: have generated the target texts.

• To promote research in the field of AI-generated Chi- 2.1. Black-Box Detection


nese text detection based on Instruction Tuning for
For black-box detection, classifiers are restricted to API-
LLMs, particularly to delve into the discrepancies be-
level access to LLMs (only available for the text). To de-
tween humans and LLMs, we have compiled 151.7k
velop a proficient detector, black-box methods are typically
responses to the same directive questions from hu-
designed to first extract and select features based on text
man experts. These directive questions span across
samples. Originating from both human and AI-generated
various domains, including open domains, computer
texts, the black-box detection method would train a classifi-
science, finance, medicine, law, psychology, journal-
cation model leveraging relevant features, for which heavily
ism, etc. This dataset contains document-level and
relies on the large amount of text data and detectors.
sentence-level text annotations and can be used to an-
alyze the characteristic differences in language and Datasets. Recently, a growing body of research has concen-
style between humans and LLMs, holding significant trated on amassing responses generated by LLMs and com-
value for guiding the future development of LLMs in paring them to human-written texts spanning a wide range
Chinese text detection. of domains. (Guo et al., 2023b) collected the HC3 (Hu-
man ChatGPT Comparison Corpus) Chinese dataset, which
• We proposed LLM-Detector, a text detection model consists of nearly 40K questions and their corresponding an-
that can determine whether text is generated by humans swers from human experts and ChatGPT, which has a wide
or AI. The model significantly improves the limitations range of domains coverage (open-domain, computer sci-
of previous technology in text detection. Specifically, ence, finance, medicine, law, and psychology). (Wang et al.,
in in-domain detection, LLM-Detector has an accu- 2023) collected the M4 (Multi-generator, Multi-domain, and
racy rate of up to 98.52%, far surpassing statistical- Multi-lingual Black-Box Machine-Generated Text Detec-
based detectors (such as GLTR’s 77.06% and Fast- tion) dataset, which consists of questions and their corre-
DetectGPT’s 59.55%) and supervised classifiers (such sponding answers from human experts and LLMs, covering
as RoBERTa’s 89.93%). In OOD detection, its accu- a wide range of languages (English, Chinese, Russian, Ara-
racy rate is 96.70%, while other detectors’ performance bic, Indonesian and Urdu). Overall, Previous work has not
significantly decreases. Our experiments indicate that established a comprehensive Chinese text detection dataset

2
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning

that encompasses diverse Chinese text data with different robustness and interpretability, they can compromise the
parameter types from LLMs and human expert responses. quality of the generated text and may not be highly practical
in certain scenarios.
Detectors. Existing black-box detectors can be grouped
into two main categories: supervised classifiers and zero-
shot classifiers. Logistic regression with GLTR (Gehrmann 3. Methodology
et al., 2019) features and an end-to-end RoBerta (Guo et al.,
3.1. Overview of LLM-Detector
2023b) classifier, to detect whether a certain text (English
and Chinese) is generated by ChatGPT or humans across The structure of our proposed Chinese text detection model,
several domains. However, the study conducted demon- LLM-Detector, is shown in Figure 1. In the training stage,
strates that a limitation of supervised models is the poten- we constructed a response dataset based on HC3 seed ques-
tial occurrence of overfitting within the domain, resulting tions, which consists of responses generated by human ex-
in poor detection performance OOD (OOD) (Chakraborty perts and multiple LLMs, including their source labels (AI
et al., 2023). To address the limitations of supervised or human) and more granular sentence-level annotations.
classifiers, zero-shot classifiers, using a pre-trained lan- These sentence-level annotations include mixed texts written
guage model directly without fine-tuning, are immune to by humans and polished by AI. Subsequently, we adapted
domain-specific degradation. Zero-shot classifiers such as a foundational LLM to LLM-Detector through instruction
GPT-Zero 1 , DetectGPT (Mitchell et al., 2023) and Fast- tuning, fine-tuning it on response samples from human ex-
DetectGPT (Bao et al., 2023) have been developed. These perts and multiple LLMs to elicit the model’s Chinese text
methods utilize checks on perplexity and burstiness in the detection capabilities. In the evaluation stage, we input the
text to determine whether it is artificially generated or au- corresponding instruction text into the LLM-Detector for
thored by a human. The current zero-shot classifiers require detection based on the joint responses generated by M4 seed
input documents of considerable length (exceeding 100 to- problems from LLMs and human experts. The diverse of
kens) for the classifier to effectively capture contextual fea- the LLM-Detector’s Chinese text detection dataset provides
tures of the text. In terms of classifying short sentences, better guidance for the foundational LLM in modeling the
their performance is relatively poor. connection between user instructions and appropriate re-
sponses, thereby enhancing the text detection capabilities of
2.2. White-Box Detection the instruction-tuned LLM.
White-box detection require fully access to LLMs, thereby The organization of the second part of this document is as
enabling control over the generation behavior of the model follows: Section 3.2 provides a detailed description of the
or embedding watermark within the generated texts. This process of building Chinese text detection data by humans
enables the tracking and detection of AI-generated texts and multiple LLMs; Section 3.3 explains the design of LLM-
within white-box settings. Detector; and finally, Section 3.4 introduces the in-domain
and OOD datasets we created. Section 3.5 provides an
White-Box detection involves using statistical boundaries overview of all datasets used in this work and compares the
between linguistic patterns found in human-written and AI- differences between human-written and AI-generated text.
generated text as proxies. These boundaries are determined
based on n-gram frequencies (Badaskar et al., 2008), en-
3.2. Generating detection data with different LLMs
tropy (Lavergne et al., 2008), and perplexity (Beresneva,
2016). One limitation of these statistics-based methods 3.2.1. G ENERATION OF D OCUMENT-L EVEL DATA
is the assumption, which assumes access to the model’s
prediction distributions. This constraint hinders broader HC3 (Guo et al., 2023a) is the first human-ChatGPT compar-
applications, especially for models behind APIs. ison corpus which contains 12, 853 questions from WebText
Q&A, Baike Q&A, Medical Dialog, Chinese Corpus, Legal
Inspired by copyright protection watermarks in the image Q&A, etc.
and video fields, as proposed by (Kirchenbauer et al., 2023),
partitions the model’s vocabulary into whitelist and black Specifically, We further utilized the 12,853 sub-questions
list tokens when predicting the next token given a prompt. from HC3, allowing 9 different LLMs (including ChatGPT,
During text generation, the goal is to produce whitelist GPT-4, etc.) to generate responses, which we labeled as
tokens as much as possible, effectively creating a strong ”AI”. All LLMs used are displayed in Figure 1. Finally,
watermark. The third-parties can determine if the text is we combined the original human expert responses from
machine-generated by analyzing the frequency of whitelist HC3 with the newly generated responses to create a training
tokens within the text. While watermarking methods offer dataset. According to the experimental design, we filter
texts with a length of less than 10 as the final training set.
1
https://gptzero.me/ An example of the generated document-level detection data

3
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning

Large Language Models and Humans

Response data for M4 人工智能(AI)是指利用计算机科学技术来模拟、扩展和拓展人


Response data for HC3
... 类智能的一门学科。它涉及设计和开发能够执行类似人类思维过
程的系统,使机器能够执行需要...
Multi-Domain In-domain (Artificial intelligence (AI) refers to a discipline that uses
computer science and technology to simulate, extend, and expand
human intelligence. It involves designing and developing systems
WebTextQA & BaikeQA NLPCC WebTextQA Baike that can perform processes similar to human thought processes,
enabling machines to perform tasks that require ...)

Crawled BaiduBaike DBQA


News from media and AI AI
Medical Dialog Chinese Corpus
Out-of-domain 人工智能(AI)是一种利用算法构建动态计算环境来模拟人类智能
过程的基础技术。AI系统通过大量数据的输入和处理,执行任务和
News 决策,无需人类直接参与。这种技术...
Baidu AI Studio, LegalQA (Artificial intelligence (AI) is a fundamental technology that
utilizes algorithms to build dynamic computational environments to
simulate human intelligent processes. AI systems perform tasks and
make decisions through the input and processing of large amounts of
Seed Question data, without direct human involvement. This technology ...)
Evaluation
Supervised Fine-Tuning
HC3 M4 <HUMAN>人工智能(AI)是一种利用算法构建动态计算环境来模
Evaluation Task 拟人类智能过程的基础技术。</HUMAN> <AI>AI系统通过大量数
据的输入和处理,执行任务和决策,无需人类直接参与。这种
技术...</AI>
Example seed question Document-Level (<HUMAN>Artificial intelligence (AI) is a fundamental technology
that utilizes algorithms to build dynamic computational
# Question
什么是人工智能?
LLM-Detector environments to simulate human intelligent processes.<HUMAN>
<AI>AI systems perform tasks and make decisions through the
(What is Artificial Intelligence?) input and processing of large amounts of data, without direct
Build different small, medium, and large Sentence-Level human involvement. This technology ...</AI>)
versions based on Qwen 1.8B, 7B, and 14B

Figure 1. LLM-Detector Framework. First, HC3 and M4 seed questions are used to prompt responses from human experts and multiple
LLMs, where the responses to HC3’s multi-domain seed questions will be employed to train the LLM-Detector. Second, the responses
generated using M4 seed questions from the same domain as HC3 are utilized to test the in-domain capabilities of the LLM-Detector,
while an additionally constructed News dataset is used to test the LLM-Detector’s OOD capabilities.
can be found in Appendix A.1. nent, we designed a concise and clear detection instruction
that explicitly indicates LLM should learn the text detec-
3.2.2. G ENERATION OF S ENTENCE -L EVEL DATA tion task based on the text source label and the instruction
text. To prevent the distraction that might arise from lengthy
To construct the sentence-level dataset, we sampled 5,589
instructions, we deliberately did not create an exhaustive in-
human responses from the dataset provided by HC3 as the
struction to cover all criteria, allowing LLM to better focus
data source. Then we collect longer sentence lengths from
on the relationship between the input instruction text pairs
the data source and use regular expressions to break sen-
and their corresponding text source labels.
tences. In addition, we randomly selected several sentences
whose number could be in [1, len of the sents-1] to ensure
that the text contains at least one human sentence and input Table 1. Illustration of the format for the instruction pairs xc in
coach instruction tuning. x represents the instruction text, while
them into a large language model for polishing. The specific
xr denotes the label for the text source.
process of generating sentence-level data is detailed in the
AppendixA.2.
Instruction: Categorize the texts into one of the 2
classes: human or AI. Input[x]
3.3. Design of LLM-Detector
Output: [xr ]
Specifically, given an instruction dataset V of instruction
pairs x = (I NSTRUCTION, O UTPUT) with x ∈ V , each in-
struction x is generated by either human experts or LLMs,
and is labeled as xr according to its source. The text detec- Given an LLM with parameters θ as the initial model for
tion dataset R is ultimately formed, which includes instruc- coach instruction tuning, training the model on the con-
tion pairs with their corresponding source labels (Human structed instruction dataset C results in the adaptation of the
or AI), represented as R = {(x, xr ) | x ∈ V }. During LLM’s parameters from θ to θc , denoted as LLM-Detector.
the coach instruction tuning process, each (x, xr ) ∈ R is Specifically, θc is obtained by maximizing the probability
leveraged to construct an instruction pair xc , leading to an of predicting the next tokens in the O UTPUT component of
instruction dataset C = {xc | x ∈ V }. xc , conditioned on the I NSTRUCTION of xc ∈ C, which is
Table 1 illustrates how the I NSTRUCTION of xc guides LLM formulated as follows:
to detect the text source of x (the original instruction pair), X
with the O UTPUT of xc being xr , which is the label for the θc = arg max log P (O UTPUT | I NSTRUCTION; θ, xc ).
θ xc ∈C
text source. When constructing the I NSTRUCTION compo-
(1)

4
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning

3.4. Human experts and LLMs constructing test set OOD detection performance. Again, we built a sentence-
level AI-generated Chinese text detection dataset, which
To evaluate the detection ability of our method, we construct
includes both human-written sentences and AI-generated
an in-domain test set and an OOD test set.
sentences, more likely to appear in real AI-assisted writing.
This dataset contains a total of 7.1k data samples, as shown
3.4.1. G ENERATION OF D OCUMENT-L EVEL I N -D OMAIN
in Appendix A.4, Appendix A.5 and Table 2. We do a
DATA
linguistic and semantic analysis for our dataset as shown in
M4 (Wang et al., 2023) is a large-scale benchmark that is Appendix A.6.
multigenerator, multi-domain, and multi-lingual corpus for
machine-generated text detection. The domains of M4 in-
Table 2. Size of document-level Training Set and Test Set.
clude Wikipedia, WikiHow, Reddit, arXiv, RuATD, Baike.
Data Source Total
We sampled Chinese language questions from M4 and gen-
erated responses using nine different LLMs (including Chat- Human 21,681
Train Set
GPT, GPT-4, etc.). All LLMs used are displayed in Figure 1. AI 96,453
Finally, we combined the original human expert responses Human 3,000
In-Domain Test Set
from M4 with the newly generated responses to create a AI 26,750
test dataset. Similar to building a training set, we also filter Human 2,000
Out-of-Domain Test Set
sentences with a length of less than 10 as the final in-domain AI 1,915
test set. Total - 151,799

Because the sources of HC3 and M4 contain the same public


In the training set, human sentences account for around
corpus as their sources like the large scale Chinese corpus
18.4%, while AI sentences account for about 81.6%. The
for NLP (Xu, 2019), we use M4 as the in-domain test set to
dataset is mainly based on sentences generated by AI be-
evaluate the performance of our method.
cause we mainly want the model to learn the features of AI
sentences. However, the model also needs human sentences
3.4.2. G ENERATION OF D OCUMENT-L EVEL
as a comparison to learn the different features of the two
O UT- OF -D OMAIN DATA
generation mechanisms for improving the detection ability
News Broadcast Text 2 is the text version of News Broad- of the model.
cast which is crawled through the public data of the CCTV
network. Besides, we generated news in some fields through 4. Experiments
ChatGPT, such as sports and science.
4.1. Tasks
In this process, we use the temperature 0.7, set the top p to
1, and adopt the max tokens as 4096, others are the default. Previous work has mostly focused on document-level AI-
The prompt that we give to ChatGPT can be found in the generated text detection (Mitchell et al., 2023; Guo et al.,
appendix A.3. 2023b), and many of these efforts are difficult to ex-
tend to sentence-level AI-generated text detection. Ad-
3.5. Dataset Overview ditionally, previous methods often exhibit limitations on
OOD (Chakraborty et al., 2023) datasets. To address this,
In order to address the challenges mentioned above, we we have defined a variety of tasks to test the performance
first constructed a multi-domain dataset for Chinese text of the LLM-Detector. This includes in-domain text detec-
detection, which includes responses from different LLMs tion, OOD text detection, and sentence-level text detection.
and human experts. These responses were generated by The specific descriptions of the three tasks can be found in
9 types of LLMs, including ChatGPT, GPT-4, etc. The appendix B.1.
data sources cover 7 main domains, including web Q&A,
encyclopedias, Baidu-Baike, medical dialogues, etc., for
4.2. Baselines
which with 151.7k data samples, each recorded according to
its generation source. For the 151.7k data mentioned above, For zero-shot classifiers, we mainly compare our proposed
we used 118.1k data samples for Instruction Tuning LLMs, LLM-Detector with Fast-DetectGPT (Bao et al., 2023),
and 29.7k data samples were used to evaluate the detection GLTR (Gehrmann et al., 2019), PPL (Guo et al., 2023b),
performance in the in-domain. The remaining data samples ChatGPT and GPT-4. For the supervised classifier, we also
were from news data of different domains than the previous conducted fine-tuning training on Bert (Devlin et al., 2018),
training and evaluation datasets, used to evaluate the model’s RoBerta (Liu et al., 2019), MPU (TEXTS), LLaMA2 (Tou-
2 vron et al., 2023), and Mixtral (Jiang et al., 2023) using our
https://cn.govopendata.com/xinwenlianbo/
own dataset. And we compared it with LLM-Detector. A

5
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning
Table 3. Experimental results of different detection models on the
detailed description of these previous methods can be found in-domain dataset. ”finetuned” indicates models that have been
in the appendix B.2. trained on the same dataset. Bold text indicates the model with the
best performance on the in-domain dataset.
4.3. Experimental Settings Model Accuracy
Computational resources and parameter settings. Our Statistical-based Classifier
model is built upon Qwen (Bai et al., 2023), a Chinese large Fast-DetectGPT 59.55%
language model with parameter sizes of 1.8 billion, 7 bil- GLTR 77.06%
lion, and 13 billion. Based on Qwen LLM with different PPL 10.26%
parameters, we trained Small, Medium, and Large LLM- Zero-Shot Classifier
Detectors to distinguish the impact of model parameters on ChatGPT 81.46%
accuracy. The training process utilizes 4 A100 (80G) GPUs GPT-4 37.41%
with parallelization, incorporating quantized low-rank adap- Supervised Classifier
tation (QLoRA) (Dettmers et al., 2023). This methodology BERT-finetuned 76.50%
is implemented through transformers and PEFT libraries. RoBERTa-finetuned 89.93%
To manage training costs, we employ fp16 precision with BERT-MPU 75.95%
ZeRO-2 (Rajbhandari et al., 2021), a gradient accumulation RoBERTa-MPU 89.93%
strategy. During the entire training process, the learning rate LLaMA-2-7B-finetuned 83.65%
is 5e-5, the training epochs are 3.0, and the LoRA Rank is 8. LLaMA-2-13B-finetuned 96.53%
At the end of the entire training process, the best model was Mistral-7B-finetuned 97.98%
saved for evaluation. For the training of BERT, RoBERTa,
LLM-Detector-Small 97.84%
and MPU, the learning rate is 1e-3 and the training epochs
LLM-Detector-Medium 98.35%
are 50.0.
LLM-Detector-Large 98.52%
Metrics. We utilize accuracy (ACC) to assess the perfor-
mance of models in classifying text, distinguishing between
human-written and AI-generated content. For sentence-level When fine-tuning large language models for text detection
evaluation, we utilize metrics include precision (P.), recall tasks, Chinese LLMs trained on Chinese data significantly
(R.), and Macro-F1. Precision and recall individually repre- outperform those trained on English data. This phenomenon
sent the ”accuracy” and ”coverage” of each category. The indicates that using training data consistent with the tar-
Macro-F1 Score serves as an effective combination of these get language is crucial for improving the performance of
two indicators, providing a comprehensive measure of over- models on specific language tasks. For instance, when fine-
all performance. A detailed description of these previous tuned with Chinese LLMs (such as Qwen), the resulting
methods can be found in the Appendix B.3. LLM-Detector typically achieves higher accuracy than its
counterparts based on English LLMs (such as Mistral and
4.4. Main Results LLaMA).
The experimental results are shown in Table 3. Generally, We investigate the performance of various detection models
after fine-tuning with LLM-Detector, models of various on out-of-distribution (OOD) datasets, The results are shown
sizes can achieve significantly better performance. This in Table 4. The experimental results indicate that among
demonstrates the effectiveness and broad applicability of the statistical classifier models, the Fast-DetectGPT model
instruction tuning for large models in enhancing LLM text achieves the highest accuracy, reaching 94.48%. In the
detection. In addition, we have the following observations. supervised classifier models, the LLaMA-2-13B-finetuned
model has the highest accuracy, achieving 93.19%. Among
On the in-domain dataset, classifiers based on supervised
all models, the LLM-Detector-Large model demonstrates
learning typically outperform zero-shot and statistical-based
the most impressive performance, with an accuracy of
classifiers. Furthermore, classifiers trained on large lan-
96.70%. These results suggest that for the detection of
guage models surpass those based on smaller parameter
OOD datasets, the LLM-Detector-Large model is a choice
models such as BERT and RoBERTa. This confirms the in-
with high accuracy and effectiveness.
herent advantage of large-scale parameters in model perfor-
mance. On the other hand, there is a positive correlation be- We implement Sniffer and Sent-RoBERTa based on
tween the scale of the model and its detection performance. sentence-level detection tasks, as well as our LLM-Detector.
For instance, the LLM-Detector-Large model outperforms Sniffer (Li et al., 2023a) is a powerful model that can de-
the LLM-Detector-Medium and LLM-Detector-Small mod- tect and trace the origins of AI-generated texts. To perform
els in terms of accuracy. The LLaMA-13B model has higher sentence-level detection, we train a sentence-level Sniffer
accuracy than the LLaMA-7B model. following the structure and training process of the origi-

6
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning

decrease in accuracy with shorter text lengths; that is, the


Table 4. Experimental results of different detection models on the
detection accuracy does not typically decrease due to the
OOD dataset. ”finetuned” indicates models that have been trained
on the same dataset. Bold text and blue background indicates the brevity of the text. In contrast, supervised detectors see a
model with the best performance on the OOD dataset. substantial decrease in detection accuracy on short texts.
We speculate that this is because supervised detectors are
Model Accuracy
unable to effectively capture the characteristics of human-
Statistical-based Classifier and AI-generated texts in contexts where there is insuffi-
Fast-DetectGPT 94.48% cient context. Statistical-based detectors also experience a
GLTR 78.60% decrease in accuracy when detecting short texts.
PPL 51.09%
Supervised Classifier
BERT-finetuned 48.39% Table 6. Accuracy of different detectors on Short Texts.”finetuned”
RoBERTa-finetuned 48.95% indicates models that have been trained on the same dataset. Bold
text and blue background indicates the model with the best perfor-
BERT-MPU 23.07%
mance on the short texts.
RoBERTa-MPU 48.95%
LLaMA-2-7B-finetuned 87.05% Model Accuracy
LLaMA-2-13B-finetuned 93.19% Statistical-based Classifier
Mistral-7B-finetuned 92.73% Fast-DetectGPT 72.78%
LLM-Detector-Small 90.22% GLTR 65.36%
LLM-Detector-Medium 93.57% PPL 57.03%
LLM-Detector-Large 96.70% Zero-Shot Classifier
ChatGPT 67.26%
GPT-4 57.16%
Supervised Classifier
nal Sniffer, but using a single sentence as input instead of
BERT-finetuned 42.02%
an entire document. RoBERTa (Liu et al., 2019) is built
RoBERTa-finetuned 43.25%
on the Transformer encoder and can handle both sentence
BERT-MPU 37.88%
classification tasks and sequence labeling tasks. We train
RoBERTa-MPU 43.25%
a sentence-level RoBERTa for detection using a method
LLaMA-2-7B-finetuned 96.38%
based on sentence classification. The results in Table 5
LLaMA-2-13B-finetuned 97.94%
clearly show that our LLM-Detector outperforms the other
Mistral-7B-finetuned 98.87%
two methods, demonstrating its effectiveness. In contrast,
Sent-RoBERTa’s performance is noticeably inferior, high- LLM-Detector-Small 97.80%
lighting the challenge of adapting document-level detection LLM-Detector-Medium 98.80%
methods for sentence-level detection. LLM-Detector-Large 99.20%

Table 5. Performance comparison of different LLMs based on dif- We further investigated the impact of text length on Fast-
ferent dataset sources. DetectGPT and RoBERTa, as shown in the Appendix B.4.
Model P. R. Macro-F1 We continued to sample texts of lengths 100, 150, and
Sniffer 65.0% 64.24% 62.51% 200 from the in-domain dataset for detection. As the text
Sent-RoBERTa 37.30% 42.15% 39.11% length increased gradually, both Fast-DetectGPT (a statis-
tical detector) and RoBERTa (a supervised detector) saw
LLM-Detector-Small 71.36% 72.62% 73.5% improvements in accuracy. After the text length exceeded
100 characters, the accuracy of Fast-DetectGPT rapidly rose
to 94.3%. When the text length exceeded 200 characters,
4.5. Usability Analysis the accuracy of RoBERTa rapidly increased to 83.8%.
Robustness on Short Texts. Zero-shot detectors, due to Robustness in Mixed Text Detection. In further research,
their statistical properties, are expected to perform worse we explored the impact of mixed text on the performance of
on shorter texts, and similarly, supervised learning detec- LLM-Detector, and the results are shown in Figure 2. We
tors will also face the same issue. We sampled texts with found that when the proportion of mixed text reaches 50%-
lengths ranging from 10 to 50 from the in-domain dataset 60%, the detection accuracy drops sharply. This is because,
to evaluate different detectors. As shown in Table 6, the as the proportion of mixed text increases, the characteristics
LLM-Detector and text detectors trained based on LLMs of the original text may be weakened. For LLM-Detector,
(such as Mistral and LLaMA) do not exhibit a significant this may mean that the signals used to judge the authenticity

7
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning

of the text become weaker, while noise increases, leading not exceeding 5%.
to a decline in model performance. Overall, LLM-Detector
exhibits a certain robustness in mixed text detection and is 5. Conclusion
able to resist the influence of mixed text to a certain extent.
In this paper, we have designed a simple yet effective
100 97.6% method to detect text generated by AI. Our proposed method
92.7% 92.9% 91.9% is based on the intuition that LLM has learned a wealth of
86.4%
80 knowledge during the pre-training, which enables LLM to
autonomously detect the text it generates. Instruction tun-
Accuracy (%)

60
ing can facilitate the alignment between the model and the
user’s expected text detection task responses. Compared to
40
previous methods, our method can accomplish AI-generated
text detection at both the document and sentence levels
20 17.5% and maintain good performance on OOD data. Therefore,
8.4% 11.3% our method possesses superior generalization ability and
3.9% 5.1% practicality. We conducted experiments on the three pro-
0
0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% 90-100%
The proportion of content mixed by AI (%) posed datasets, which cover responses generated by differ-
ent LLMs, including in-domain and OOD data, and provide
more fine-grained sentence-level annotations. The experi-
mental results demonstrate that our method can effectively
Figure 2. The accuracy performance increases with the proportion
of AI-generated content.
identify texts generated by LLMs. Moreover, our method
shows strong robustness against biases in mixed AI texts,
Are instruction-tuned LLMs better at detecting text they short texts, and OOD texts.
themselves have generated? To evaluate the detection
performance of different LLMs on content they have gen-
erated themselves, we fine-tuned the LLMs on responses
Impact Statements
generated by three different LLMs and human-written texts. The proposed AI text detection method offers advancements
The results are shown in the appendix B.5. A notable trend in content moderation and information security by identify-
is that LLMs tend to perform best on text detection with ing AI-generated text, ensuring authenticity and reliability.
texts they have generated. For instance, the ChatGLM2- However, it faces challenges such as potential false positives
6B model achieves the highest accuracy (99.91%) on the and negatives, and biases inherent in pre-trained LLMs. As
dataset it generated, which is significantly higher than any AI models evolve, there is a need for ongoing research to
other model tested against the same dataset. Similarly, the enhance the method’s robustness and address these limita-
Qwen-14B model also has a high accuracy of 96.18% on its tions, ensuring its effectiveness and ethical application in
generated dataset. However, an interesting anomaly arises various domains.
with the BlueLM-7B model. The Qwen-7B model outper-
forms the BlueLM-7B on its own dataset, with an accuracy
of 97.8% compared to 97.1% for the BlueLM-7B. While
References
this could suggest a potential issue with the BlueLM-7B Badaskar, S., Agarwal, S., and Arora, S. Identifying real
model’s training, it is also worth noting that the difference is or fake articles: Towards better language modeling. In
very small (only 0.7%), which could fall within the margin Proceedings of the Third International Joint Conference
of error. on Natural Language Processing: Volume-II, 2008.
The impact of text generated by LLMs of different scales Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan,
on the accuracy of text detection. We used the LLM- Y., Ge, W., Han, Y., Huang, F., et al. Qwen technical
Detector to perform text detection on texts generated by report. arXiv preprint arXiv:2309.16609, 2023.
LLMs of different parameter sizes. We found that the texts
produced by LLMs of varying scales had no significant Bao, G., Zhao, Y., Teng, Z., Yang, L., and Zhang, Y.
impact on the accuracy of text detection by LLM-Detector, Fast-detectgpt: Efficient zero-shot detection of machine-
indicating that detectors trained on LLMs demonstrate better generated text via conditional probability curvature. arXiv
robustness and generalization, as shown in the appendix B.6. preprint arXiv:2310.05130, 2023.
Specifically, the three differently sized detectors—Small,
Medium, and Large showed a small range of fluctuation in Beresneva, D. Computer-generated text detection using ma-
detection accuracy for texts generated by LLMs of different chine learning: A systematic review. In Natural Language
scales, with the gap between the highest and lowest accuracy Processing and Information Systems: 21st International

8
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning

Conference on Applications of Natural Language to In- Li, L., Wang, P., Ren, K., Sun, T., and Qiu, X. Origin tracing
formation Systems, NLDB 2016, Salford, UK, June 22-24, and detecting of llms. arXiv preprint arXiv:2304.14072,
2016, Proceedings 21, pp. 421–426. Springer, 2016. 2023a.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Li, M., Zhang, Y., Li, Z., Chen, J., Chen, L., Cheng, N.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Wang, J., Zhou, T., and Xiao, J. From quantity to quality:
Askell, A., et al. Language models are few-shot learners. Boosting llm performance with self-guided data selection
Advances in neural information processing systems, 33: for instruction tuning. arXiv preprint arXiv:2308.12032,
1877–1901, 2020. 2023b.

Chakraborty, S., Bedi, A., Zhu, S., An, B., Manocha, D., Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,
and Huang, F. On the possibilities of ai-generated text Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.
detection (arxiv: 2304.04736). arxiv, 2023. Roberta: A robustly optimized bert pretraining approach.
arXiv preprint arXiv:1907.11692, 2019.
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer,
L. Qlora: Efficient finetuning of quantized llms. arXiv Mitchell, E., Lee, Y., Khazatsky, A., Manning, C. D., and
preprint arXiv:2305.14314, 2023. Finn, C. Detectgpt: Zero-shot machine-generated text
detection using probability curvature. arXiv preprint
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: arXiv:2301.11305, 2023.
Pre-training of deep bidirectional transformers for lan-
Rajbhandari, S., Ruwase, O., Rasley, J., Smith, S., and
guage understanding. arXiv preprint arXiv:1810.04805,
He, Y. Zero-infinity: Breaking the gpu memory wall for
2018.
extreme scale deep learning. In Proceedings of the Inter-
Gehrmann, S., Strobelt, H., and Rush, A. M. Gltr: Statisti- national Conference for High Performance Computing,
cal detection and visualization of generated text. arXiv Networking, Storage and Analysis, pp. 1–14, 2021.
preprint arXiv:1906.04043, 2019.
Tang, R., Chuang, Y.-N., and Hu, X. The science of detect-
Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, ing llm-generated texts. arXiv preprint arXiv:2303.07205,
Y., Yue, J., and Wu, Y. How close is chatgpt to human 2023.
experts? comparison corpus, evaluation, and detection. TEXTS, A.-G. Multiscale positive-unlabeled detection of
arXiv preprint arxiv:2301.07597, 2023a. ai-generated texts.
Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
Y., Yue, J., and Wu, Y. How close is chatgpt to human A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
experts? comparison corpus, evaluation, and detection. Bhosale, S., et al. Llama 2: Open foundation and fine-
arXiv preprint arXiv:2301.07597, 2023b. tuned chat models. arXiv preprint arXiv:2307.09288,
2023.
Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The
curious case of neural text degeneration. arXiv preprint Wang, Y., Mansurov, J., Ivanov, P., Su, J., Shelmanov, A.,
arXiv:1904.09751, 2019. Tsvigun, A., Whitehouse, C., Afzal, O. M., Mahmoud, T.,
Aji, A. F., et al. M4: Multi-generator, multi-domain, and
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., multi-lingual black-box machine-generated text detection.
Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., arXiv preprint arXiv:2305.14902, 2023.
Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint
arXiv:2310.06825, 2023. Xu, B. Nlp chinese corpus: Large scale chinese corpus for
nlp, September 2019. URL https://doi.org/10.
Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., 5281/zenodo.3402023.
and Goldstein, T. A watermark for large language models.
arXiv preprint arXiv:2301.10226, 2023.

Langley, P. Crafting papers on machine learning. In Langley,


P. (ed.), Proceedings of the 17th International Conference
on Machine Learning (ICML 2000), pp. 1207–1216, Stan-
ford, CA, 2000. Morgan Kaufmann.

Lavergne, T., Urvoy, T., and Yvon, F. Detecting fake content


with relative entropy scoring. Pan, 8(27-31):4, 2008.

9
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning

A. Details of Dataset
A.1. Example of document-level data organization
Instruction is the response generated by human experts or different LLMs. Output is the label of the data source as shown in
Table 7.

Table 7. The example of the document-level data.

Instruction:
Categorize the texts into one of the two classes: human or AI.
Input:
单间80多,如果住的天数多70多。里面有一个单独的卫生间,是隔出来的那种。其他的不是很清楚。另
外德强那边有一个新开的巧克力时钟旅馆,听干净的,价格和相约谷差不多。你可以去看看。便宜点的
还有恒久附近的,50多,但是没有单独的卫生间。都有电脑可以上网的。学校附近的小宾馆旅社很多学
校里面也有非常多基本60-80左右
(Translation: The single rooms are over 80, if you stay for more days it’s around 70. There’s a separate toilet inside,
the kind that’s partitioned off. I’m not very clear about the others. By the way, there’s a newly opened Chocolate
Clock Hotel over at Deqiang’s place, it’s said to be clean, and the price is about the same as Xiangyue Valley. You
can go take a look. There are also cheaper ones near Hengjiu, over 50, but they don’t have a separate toilet. All of
them have computers with internet access. There are many small hotels and guesthouses near the school, and there
are also many inside the school, with prices ranging from about 60-80.)
Output: Human

A.2. Example of sentence-level data organization


The data source is the response written by human experts. If the sentence number of one piece of data is 7, we randomly
select [L1, L3, L4, L7] as the sentence to be polished and hand it over to ChatGPT (3.5-turbo) for polishing. To construct
the AI-generated data, we adopt the ChatGPT to generate the sentences. We design a prompt to make the model understand
the task and improve the performance. The polishing prompt that we provide to ChatGPT is shown in Table 8.

Table 8. The prompt for using ChatGPT to polish sentence-level data.

Prompt
请润色下述内容,不要做任何解释,直接输出润色结果:
(Translation: Please polish the following content without any explanation, and output the polishing results directly:)

After obtaining the polished [P1, P3, P4, P7], splice them back together to form the paragraphs [P1, L2, P3, P4, L5, L6, P7]
that blend AI and Human. Using the same method, we sampled 1,504 samples from M4 for sentence-level data construction.
Ultimately, we used the HC3 sentence-level data as the training set and M4 as the test set. An example is shown in Table 9.
For sentence-level, the size of the train set is 5,589 and the size of the test set is 1,504.

A.3. The prompt for constructing the OOD dataset


The prompt that we give to ChatGPT as shown in Table 10.

A.4. Details of the document-level dataset


There are three datasets for the document-level. Firstly, we collect data from HC3 and construct the train set including
21,681 Human-generated data and 96,453 AI-generated data. Second, we collect data from M4 and construct the in-domain
test set including 3000 Human-generated data and 26,750 AI-generated data. Finally, we collect data from News Broadcast
and construct the out-of-domain test set including 2000 Human-generated data and 1915 AI-generated data. The details of

10
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning

Table 9. The example of the sentence-level data.

Instruction:
Categorize the texts into one of the 2 classes: human or AI.
Input:
”单间80多,如果住的天数多70多。里面有一个单独的卫生间,是隔出来的那种。其他的不是很清楚。另
外德强那边有一个新开的巧克力时钟旅馆,听干净的,价格和相约谷差不多。你可以去看看。便宜点的
还有恒久附近的,50多,但是没有单独的卫生间。都有电脑可以上网的。学校附近的小宾馆旅社很多学
校里面也有非常多基本60-80左右
(Translation: The single rooms are over 80, if you stay for more days it’s around 70. There’s a separate toilet inside,
the kind that’s partitioned off. I’m not very clear about the others. By the way, there’s a newly opened Chocolate
Clock Hotel over at Deqiang’s place, it’s said to be clean, and the price is about the same as Xiangyue Valley. You
can go take a look. There are also cheaper ones near Hengjiu, over 50, but they don’t have a separate toilet. All of
them have computers with internet access. There are many small hotels and guesthouses near the school, and there
are also many inside the school, with prices ranging from about 60-80.)”
Output: <HUMAN >单间80多,</HUMAN ><HUMAN >如果住的天数多70多。</HUMAN ><AI >里
面有一个独立的卫生间,</AI ><AI >是隔出来的那种。</AI ><AI >其他方面有些不太清楚。</AI
><AI >另外,</AI ><AI >在德强那里有一家新开的巧克力时光旅馆,</AI ><AI >听说环境很干
净,</AI ><AI >价格和相约谷差不多。</AI ><AI >你可以去看看。</AI ></AI >在恒久附近有更便
宜的房间,</AI ><AI >大约50多元,</AI ><AI >但是没有独立的卫生间。</AI ><HUMAN >都有电
脑可以上网的。</HUMAN ><HUMAN >学校附近的小宾馆旅社很多学校里面也有非常多基本60-80左
右</HUMAN >

Table 10. The prompt for constructing the OOD dataset.

Prompt
你是一个新闻编辑,用户给你一个新闻类别,请在用户指定类别要求下写一个100字到700字的新闻稿。
新闻类别包括:政治新闻 经济新闻 社会新闻 科技新闻 文化艺术新闻 娱乐新闻 环境新闻...
(Translation: You are a news editor, and the user provides you with a news category. Write a news article of 100 to
700 words based on the specified category. The news classes include Political News, Economic News, Social News,
Technology News, Cultural and Arts News, Entertainment News, and Environmental News, ...)

the training set, in-domain test set, and OOD test set can be counted as shown in Table 11.

A.5. Dataset Source Parse Analysis


To explore and analyze our train set, in-domain test set, and out-of-domain test set, we do a dataset source parse analysis.
The three datasets can be grouped by Haman and AI. Each grouped data can also be grouped again by its data source and
can be plotted by the count of each part as shown in Figure 3.

A.6. Detailed analysis of the dataset


In Figures 4 and 5, we observe the distribution of part-of-speech tags in both training and test datasets, comparing human-
written texts to those generated by AI. These figures highlight that while there is a general alignment in the linguistic
structure of both sources, notable distinctions emerge in specific categories. In the training set (Figure 4), human texts
exhibit a marginally higher usage of verbs and nouns, whereas AI-generated texts have a slightly increased use of pronouns
and adverbs. This trend is also evident in the test set (Figure 5), particularly with a marked increase in adverbs in AI texts,
indicating a potential linguistic preference of the AI. Both figures corroborate that conjunctions and modal verbs are utilized
with comparable frequency by humans and AI, suggesting a shared understanding of sentence construction and modality.
The data encapsulated in these figures imply that while AI can closely emulate human part-of-speech patterns, distinct

11
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning

Table 11. Document-Level Training and Test Sets with Different Data Sizes from Various Sources.
No. Train Set / Test Set Data Label Data Source Count Total
1 Human Human 21681 21681
2 ChatGPT 17376
3 ChatGLM2-6b 12850
4 XVERSE-13b 12833
5 Train Set Qwen-14b 12823
AI 96453
6 GPT-4 12796
7 BlueLM-7b 12702
8 Baichuan2-53b 12659
9 ERNIE-Bot-3.5 2414
10 Human Human 3000 3000
11 ChatGPT 3000
12 ChatGLM2-6B 3000
13 XVERSE-13b 2998
14 Qwen-14b 2997
In-Domain Test Set
15 AI GPT-4 2987 26750
16 BlueLM-7b 2980
17 Davinci003 2975
18 ERNIE-Bot-3.5 2972
19 Baichuan2-53b 2841
20 Human News 2000 2000
Out-of-Domain Test Set
21 AI ChatGPT 1915 1915
22 Total - - - 151799

XVERSE-13b
Qwen-14b
ChatGLM2-6b

Orig

GPT-4
nal-
AI

Hum
an_
Cha
tGP
n

T
ai
Tr

BlueLM-7b
OOD Test
AI

ChatGPT
Human News-Broad
Huma cast
s t
Te

n Origna
l-Hum
n

an_Cha
ai

tGPT
om

Baic
huan
-D

2-53
In

ER b
NIE
Baichuan2-53b AI -Bo
Da t-3.
vin 5
Human Bl ci0
ue 03
LM
.5
-3

-7
ot

b
-B
IE

Qw
XV

GP
Orig
N

T-
ChatG
ER

en
nal-

4
RS

-14
Hum

Orign
E-1

al-Hu
an_
LM2-6

man_
3b
Cha

ChatG
PT
tGP
T
B

Figure 3. Dataset Source Parse Analysis. The source of the response data in the training set, in-domain test set, and OOD test set.

differences in usage can be pivotal in differentiating between human and AI-generated content.
In Figure 6, the sentiment distribution across the training and test datasets for human-written and AI-generated texts is
depicted. The bar charts compare the proportion of neutral, positive, and negative sentiments expressed in both datasets,
with orange bars representing human-produced content and purple bars for AI-generated material. In the training set, a
substantial majority of AI-generated texts are classified as neutral (86%), while human texts show a slightly lower neutral
sentiment proportion (61%). Conversely, human texts exhibit a significantly higher inclination towards negative sentiments
(34%) compared to AI (11%), with positive sentiments being minimal in both but slightly higher in AI (5% compared to 3%
in human texts). A similar pattern is observable in the test set, where AI texts are predominantly neutral (83%), but human
texts are less so (65%). The negative sentiment in human texts (25%) is more than double that in AI texts (12%), with

12
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning

20.0% 20.0% Human


20.0 AI
18.6%
18.1% 18.3%
17.7%
17.5

15.0

12.5
Proportion (%)

10.0

7.5
6.4% 6.5%
5.7% 5.9% 5.6%
5.0 4.7%
4.1% 4.0%
3.4% 3.5% 3.2% 3.2%
2.5% 2.4% 2.3%
2.5 2.0% 1.7% 1.6%

0.0
v x n r uj d c m p a eng vn

Figure 4. Part-of-Speech Comparison on Train Set.

25.1% Human
25 AI

21.2%
20
18.8%
17.2% 17.7%
16.3%
15
Proportion (%)

10

6.1% 6.5%
5.2% 4.9% 5.3%
5 4.5% 4.4%
3.9% 3.5% 3.3%
2.7% 3.0% 2.5% 2.6%
1.6% 1.1% 1.5%
0.8%
0
x n v d m uj r c p a eng nr

Figure 5. Part-of-Speech Comparison on Test Set.

positive sentiment remaining low for both. These charts suggest that AI-generated texts may tend toward neutral sentiment,
while human authors express a broader emotional range, particularly negative sentiments. This pattern across both training
and test sets highlights a key difference in the emotional tone between human and AI writing.
Our training set includes 119,475 neutral sentences which occupy 80.8% of the dataset, 22,658 negative sentences which
occupy 15.3% of the dataset, and 5,751 positive sentences which occupy 3.9% of the dataset.

B. Details of the Experiment


B.1. Details of Tasks
In-domain Detection. In-domain detection refers to the ability of the AI-generated text detector to accurately identify AI-
generated text within the same domain or topic on which it was trained. To further evaluate the AI-generated text detector’s
performance in in-domain detection, we also experimented with different text lengths and evaluated its performance on
fine-tuned LLMs with varying parameters. This allowed us to gain a better understanding of the AI-generated text detector’s
robustness and generalizability in in-domain text detection tasks.
Out-of-domain Detection. OOD detection refers to the ability of the AI-generated text detector to accurately identify
AI-generated text from domains or topics different from those on which it was trained.

13
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning

86% Human 83% Human


80 AI 80 AI

70 70
65%
61% 60
60
Proportion (%)

Proportion (%)
50 50

40 40
34%
30 30
25%
20 20
11% 11% 12%
10 10
5% 3% 5%
0 0
Neutral Positive Negative Neutral Positive Negative

Figure 6. Sentiment Distribution: (Left) Distribution in the Training Set, (Right) Distribution in the Test Set.

Sentence-level Detection. Sentence-level detection refers to the ability of an AI-generated text detector to accurately
recognize AI-generated text at the sentence level (rather than at the document level). This is a more challenging task because
the context provided by the entire document is not available, and the AI-generated text detector must rely solely on the
content of a single sentence.

B.2. Details of Baselines


To fully test the effectiveness of our proposed method, we compared it with AI-generated text detection methods based on
statistical information and supervised learning methods. Additionally, we selected two advanced LLMs that have shown
excellent performance in English.

• Fast-DetectGPT (Bao et al., 2023) is a method for zero-shot detection of AI-generated text. This method uses
conditional probability curvature as an indicator and detects whether text is machine-generated by sampling and
evaluating the differences in word selection probabilities. Compared to DetectGPT (Mitchell et al., 2023), this method
has increased detection speed by two orders of magnitude, while accuracy has improved by approximately 75%.

• GLTR (Gehrmann et al., 2019) studied three types of features of an input text. Their major assumption is that to
generate fluent and natural-looking text, most decoding strategies sample high probabilities tokens from the head of the
distribution.

• Perplexity (PPL) (Guo et al., 2023b) is a metric for evaluating the performance of language models. It measures
the exponent of the negative average logarithmic likelihood of a given text under the language model. A lower PPL
indicates that the language model is more confident in its predictions and is thus considered a better model. We use
GPT-2 to calculate the PPL of human- and AI-generated content to distinguish who generated the text.

• MPU (TEXTS) proposes a multi-scale positive-unlabeled AI text detection method, which models AI text detection
as a partial positive-unlabeled problem, utilizes length-based multi-scale PU loss, and introduces a text multi-scaling
module. MPU significantly improves the detection performance of short texts and enhances long text detection. It
has been implemented based on two methods: BERT and RoBERTa, referred to as BERT-MPU and RoBERTa-MPU,
respectively.

• LLaMA-2 (Touvron et al., 2023) is a language model trained on approximately 2T tokens. It has demonstrated
exceptional performance across multiple benchmark tests and has been widely used in LLM research. We adopt
LLaMA-2-7B and LLaMA-2-13B as the base model for instruction tuning.

• Mistral-7B (Jiang et al., 2023) is a language model designed for superior performance and efficiency. It employs
mechanisms such as grouped-query attention and sliding window attention to surpass other language models on various
benchmarks.

14
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning

In addition to the comparison methods mentioned above, we trained BERT (Devlin et al., 2018) and RoBERTa (Liu et al.,
2019) text classification models based on the same data for text detection. At the same time, we conducted Zero-shot text
detection based on ChatGPT and GPT-4.
For classifying using ChatGPT and GPT-4 with Zero-shot, we conduct three predictions and take the average result. We
adopted a method similar to that proposed by (Holtzman et al., 2019) for open-text generation. Specifically, we used
temperature sampling with a temperature is 0.7, top p is 1.0, and max tokens is 2048, while keeping other settings at their
defaults. The prompts for ChatGPT and GPT-4 are as follows.
1 Determine whether this passage is generated by AI or written by human. Do not respond
2 with anything other than AI and Human. You are only allowed to answer AI or Human.

B.3. Details of Metrics


Commonly used concepts in evaluation metrics are expressed as follows:

• True Positive (TP): the number of positive classes predicted to be positive classes.
• True Negative (TN): the number of negative classes predicted as negative classes.
• False Positive (FP): the number of negative classes predicted as positive classes, which is the number of detection
errors.
• False Negative (FN): the number of positive classes predicted as negative classes, which is the number of missed
detections.

Precision is defined as
TP
P recision = (2)
TP + FP

Recall is defined as
TP
Recall = (3)
TP + FN

Macro-F1 is defined as
N
1 X 2 × Pi × Ri
Macro-F1 = (4)
N i=1 Pi + Ri

N represents the number of classes (In our task, it is two classes: Human and AI). Pi and Ri are the precision and recall for
the i-th class, respectively. TPi , FPi , and FNi are the number of true positives, false positives, and false negatives for the
i-th class, respectively. The Macro-F1 Score is calculated by taking the sum of the F1 Scores for each class and dividing by
the total number of classes.

B.4. The impact of text length on Fast-DetectGPT and RoBERTa


We further investigated the impact of text length on Fast-DetectGPT and RoBERTa, as shown in Figure 7. We continued to
sample texts of lengths 100, 150, and 200 from the in-domain dataset for detection. As the text length increased gradually,
both Fast-DetectGPT (a statistical detector) and RoBERTa (a supervised detector) saw improvements in accuracy. After the
text length exceeded 100 characters, the accuracy of Fast-DetectGPT rapidly rose to 94.3%. When the text length exceeded
200 characters, the accuracy of RoBERTa rapidly increased to 83.8%.
To evaluate the detection performance of different LLMs on content they have generated themselves, we fine-tuned the
LLMs on responses generated by three different LLMs and human-written texts. The results are shown in the Figure 12.
A notable trend is that LLMs tend to perform best on text detection with texts they have generated. For instance, the
ChatGLM2-6B model achieves the highest accuracy (99.91%) on the dataset it generated, which is significantly higher than
any other model tested against the same dataset. Similarly, the Qwen-14B model also has a high accuracy of 96.18% on its
generated dataset. However, an interesting anomaly arises with the BlueLM-7B model. The Qwen-7B model outperforms
the BlueLM-7B on its own dataset, with an accuracy of 97.8% compared to 97.1% for the BlueLM-7B. While this could

15
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning

94.3% 95.8% 96.16%

90 92.75%
89.8%
80
Accuracy (%) 72.78%
70
66.05%
60
49.96% 58.25%
50
40 Fast-DetectGPT
RoBERTa
30 33.14%
25 50 75 100 125 150 175 200
Text Length

Figure 7. As the length of the text increases, the accuracy performance of Fast-DetectGPT and RoBERTa.

suggest a potential issue with the BlueLM-7B model’s training, it is also worth noting that the difference is very small (only
0.7%), which could fall within the margin of error.

B.5. Are instruction-tuned LLMs better at detecting text they themselves have generated

Table 12. Performance comparison of different LLMs based on different dataset sources. Bold text and blue background indicates the
model with the best performance.
Data Generation Source Model Accuracy
ChatGLM2-6B ChatGLM2-6B 99.91%
ChatGLM2-6B XVERSE-7B 94.60%
ChatGLM2-6B Baichuan2-7B 98.87%
ChatGLM2-6B Qwen-7B 96.63%
ChatGLM2-6B Mistral-7B 96.87%
ChatGLM2-6B LLaMA-2-7B 97.14%
Qwen-14B Qwen-14B 96.18%
Qwen-14B Baichuan2-13B 95.92%
Qwen-14B LLaMA-2-13B 94.43%
Qwen-14B XVERSE-13B 91.19%
BlueLM-7B BlueLM-7B 97.10%
BlueLM-7B XVERSE-7B 92.51%
BlueLM-7B Baichuan2-7B 95.09%
BlueLM-7B LLaMA-2-7B 96.63%
BlueLM-7B Mistral-7B 94.44%
BlueLM-7B Qwen-7B 97.80%

B.6. The impact of text generated by LLMs of different scales on the accuracy of text detection
We used the LLM-Detector to perform text detection on texts generated by LLMs of different parameter sizes. We found
that the texts produced by LLMs of varying scales had no significant impact on the accuracy of text detection by LLM-

16
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning

Detector, indicating that detectors trained on LLMs demonstrate better robustness and generalization, as shown in Figure 13.
Specifically, the three differently sized detectors—Small, Medium, and Large showed a small range of fluctuation in
detection accuracy for texts generated by LLMs of different scales, with the gap between the highest and lowest accuracy
not exceeding 5%.

Table 13. Performance comparison of different LLMs based on different dataset sources. The darker the color, the better the performance.
Gradually increasing model size range →
Model ChatGLM2-6B BlueLM-7B XVERSE-13B Qwen-14B Baichuan2-53B ERNIE-Bot ChatGPT GPT-4
LLM-Detector-Small 98.48% 98.85% 97.54% 95.60% 97.20% 96.79% 96.51% 97.82%
LLM-Detector-Medium 95.41% 95.89% 95.84% 99.33% 96.83% 96.49% 95.81% 94.48%
LLM-Detector-Large 93.28% 98.11% 99.44% 98.15% 96.99% 99.23% 99.64% 95.25%

17

You might also like