LLM-Detector: Improving AI-Generated Chinese Text Detection With Open-Source LLM Instruction Tuning
LLM-Detector: Improving AI-Generated Chinese Text Detection With Open-Source LLM Instruction Tuning
Rongsheng Wang 1 Haoming Chen 1 Ruizhe Zhou 1 Han Ma 1 Yaofei Duan 1 Yanlan Kang 2 Songhua Yang 3
Baoyu Fan 1 Tao Tan 1
Abstract text corpora, enabling them to generate texts that are con-
arXiv:2402.01158v1 [cs.CL] 2 Feb 2024
1
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning
specific knowledge and the alignment between task inputs using training data consistent with the target language
and desired outputs. This is why training on negative sam- is crucial for improving the model’s performance on
ples could sometimes be beneficial, as it provided the model specific language text detection tasks. Furthermore,
with supplementary knowledge and boundaries for the task- since LLM-Detector is trained based on open-source
specific information (Li et al., 2023b). In the era of LLMs, LLMs, it is easy to customize for deployment.
models no longer need to learn task-specific knowledge and
alignment between task inputs and desired outputs, as most • We conducted sentence-level text detection experi-
of the required knowledge has already been learned during ments using a dataset that mixed sentences generated
pre-training. Instruction tuning can facilitate the alignment by human experts and AI. The experimental results
between the model and the expected user task responses. indicate that existing methods (such as Sent-RoBERTa
We introduce LLM-Detector, a powerful method to address and Sniffer) face difficulties in solving the problem of
the challenges of text detection. Specifically, in document- sentence-level AI-generated text detection. Our pro-
level AI-generated text detection, we label the dataset and posed LLM-Detector, however, achieved encouraging
use it for Instruction Tuning of LLMs. In sentence-level results in document-level and cross-domain text detec-
AI-generated text detection, we label each sentence in the tion challenges and demonstrated outstanding general-
dataset and use it for Instruction Tuning. We also inves- ization capabilities in sentence-level text detection.
tigate the impact of instruction tuning on text detection
performance using text generated by a specific LLM and the
influence of different Chinese and English language models 2. Related Work
on detection performance. Experimental results show that LLMs have been pre-trained on extensive text corpora, en-
existing methods like Fast-DetectGPT (Bao et al., 2023), abling them to generate contextually relevant and fluent
MPU (TEXTS), GLTR (Gehrmann et al., 2019) are not ef- texts. However, this also increases the difficulty of detecting
fective in sentence-level AI-generated text detection. Our AI-generated texts. The existing methods for detecting gen-
proposed LLM-Detector achieves promising results in both erated texts can be broadly categorized into two types: black-
sentence and document-level AI-generated text detection box and white-box detection (Tang et al., 2023), contingent
challenges and exhibits excellent generalization on OOD upon the level of access to the model that is suspected to
datasets. Our contributions are summarized as follows: have generated the target texts.
2
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning
that encompasses diverse Chinese text data with different robustness and interpretability, they can compromise the
parameter types from LLMs and human expert responses. quality of the generated text and may not be highly practical
in certain scenarios.
Detectors. Existing black-box detectors can be grouped
into two main categories: supervised classifiers and zero-
shot classifiers. Logistic regression with GLTR (Gehrmann 3. Methodology
et al., 2019) features and an end-to-end RoBerta (Guo et al.,
3.1. Overview of LLM-Detector
2023b) classifier, to detect whether a certain text (English
and Chinese) is generated by ChatGPT or humans across The structure of our proposed Chinese text detection model,
several domains. However, the study conducted demon- LLM-Detector, is shown in Figure 1. In the training stage,
strates that a limitation of supervised models is the poten- we constructed a response dataset based on HC3 seed ques-
tial occurrence of overfitting within the domain, resulting tions, which consists of responses generated by human ex-
in poor detection performance OOD (OOD) (Chakraborty perts and multiple LLMs, including their source labels (AI
et al., 2023). To address the limitations of supervised or human) and more granular sentence-level annotations.
classifiers, zero-shot classifiers, using a pre-trained lan- These sentence-level annotations include mixed texts written
guage model directly without fine-tuning, are immune to by humans and polished by AI. Subsequently, we adapted
domain-specific degradation. Zero-shot classifiers such as a foundational LLM to LLM-Detector through instruction
GPT-Zero 1 , DetectGPT (Mitchell et al., 2023) and Fast- tuning, fine-tuning it on response samples from human ex-
DetectGPT (Bao et al., 2023) have been developed. These perts and multiple LLMs to elicit the model’s Chinese text
methods utilize checks on perplexity and burstiness in the detection capabilities. In the evaluation stage, we input the
text to determine whether it is artificially generated or au- corresponding instruction text into the LLM-Detector for
thored by a human. The current zero-shot classifiers require detection based on the joint responses generated by M4 seed
input documents of considerable length (exceeding 100 to- problems from LLMs and human experts. The diverse of
kens) for the classifier to effectively capture contextual fea- the LLM-Detector’s Chinese text detection dataset provides
tures of the text. In terms of classifying short sentences, better guidance for the foundational LLM in modeling the
their performance is relatively poor. connection between user instructions and appropriate re-
sponses, thereby enhancing the text detection capabilities of
2.2. White-Box Detection the instruction-tuned LLM.
White-box detection require fully access to LLMs, thereby The organization of the second part of this document is as
enabling control over the generation behavior of the model follows: Section 3.2 provides a detailed description of the
or embedding watermark within the generated texts. This process of building Chinese text detection data by humans
enables the tracking and detection of AI-generated texts and multiple LLMs; Section 3.3 explains the design of LLM-
within white-box settings. Detector; and finally, Section 3.4 introduces the in-domain
and OOD datasets we created. Section 3.5 provides an
White-Box detection involves using statistical boundaries overview of all datasets used in this work and compares the
between linguistic patterns found in human-written and AI- differences between human-written and AI-generated text.
generated text as proxies. These boundaries are determined
based on n-gram frequencies (Badaskar et al., 2008), en-
3.2. Generating detection data with different LLMs
tropy (Lavergne et al., 2008), and perplexity (Beresneva,
2016). One limitation of these statistics-based methods 3.2.1. G ENERATION OF D OCUMENT-L EVEL DATA
is the assumption, which assumes access to the model’s
prediction distributions. This constraint hinders broader HC3 (Guo et al., 2023a) is the first human-ChatGPT compar-
applications, especially for models behind APIs. ison corpus which contains 12, 853 questions from WebText
Q&A, Baike Q&A, Medical Dialog, Chinese Corpus, Legal
Inspired by copyright protection watermarks in the image Q&A, etc.
and video fields, as proposed by (Kirchenbauer et al., 2023),
partitions the model’s vocabulary into whitelist and black Specifically, We further utilized the 12,853 sub-questions
list tokens when predicting the next token given a prompt. from HC3, allowing 9 different LLMs (including ChatGPT,
During text generation, the goal is to produce whitelist GPT-4, etc.) to generate responses, which we labeled as
tokens as much as possible, effectively creating a strong ”AI”. All LLMs used are displayed in Figure 1. Finally,
watermark. The third-parties can determine if the text is we combined the original human expert responses from
machine-generated by analyzing the frequency of whitelist HC3 with the newly generated responses to create a training
tokens within the text. While watermarking methods offer dataset. According to the experimental design, we filter
texts with a length of less than 10 as the final training set.
1
https://gptzero.me/ An example of the generated document-level detection data
3
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning
Figure 1. LLM-Detector Framework. First, HC3 and M4 seed questions are used to prompt responses from human experts and multiple
LLMs, where the responses to HC3’s multi-domain seed questions will be employed to train the LLM-Detector. Second, the responses
generated using M4 seed questions from the same domain as HC3 are utilized to test the in-domain capabilities of the LLM-Detector,
while an additionally constructed News dataset is used to test the LLM-Detector’s OOD capabilities.
can be found in Appendix A.1. nent, we designed a concise and clear detection instruction
that explicitly indicates LLM should learn the text detec-
3.2.2. G ENERATION OF S ENTENCE -L EVEL DATA tion task based on the text source label and the instruction
text. To prevent the distraction that might arise from lengthy
To construct the sentence-level dataset, we sampled 5,589
instructions, we deliberately did not create an exhaustive in-
human responses from the dataset provided by HC3 as the
struction to cover all criteria, allowing LLM to better focus
data source. Then we collect longer sentence lengths from
on the relationship between the input instruction text pairs
the data source and use regular expressions to break sen-
and their corresponding text source labels.
tences. In addition, we randomly selected several sentences
whose number could be in [1, len of the sents-1] to ensure
that the text contains at least one human sentence and input Table 1. Illustration of the format for the instruction pairs xc in
coach instruction tuning. x represents the instruction text, while
them into a large language model for polishing. The specific
xr denotes the label for the text source.
process of generating sentence-level data is detailed in the
AppendixA.2.
Instruction: Categorize the texts into one of the 2
classes: human or AI. Input[x]
3.3. Design of LLM-Detector
Output: [xr ]
Specifically, given an instruction dataset V of instruction
pairs x = (I NSTRUCTION, O UTPUT) with x ∈ V , each in-
struction x is generated by either human experts or LLMs,
and is labeled as xr according to its source. The text detec- Given an LLM with parameters θ as the initial model for
tion dataset R is ultimately formed, which includes instruc- coach instruction tuning, training the model on the con-
tion pairs with their corresponding source labels (Human structed instruction dataset C results in the adaptation of the
or AI), represented as R = {(x, xr ) | x ∈ V }. During LLM’s parameters from θ to θc , denoted as LLM-Detector.
the coach instruction tuning process, each (x, xr ) ∈ R is Specifically, θc is obtained by maximizing the probability
leveraged to construct an instruction pair xc , leading to an of predicting the next tokens in the O UTPUT component of
instruction dataset C = {xc | x ∈ V }. xc , conditioned on the I NSTRUCTION of xc ∈ C, which is
Table 1 illustrates how the I NSTRUCTION of xc guides LLM formulated as follows:
to detect the text source of x (the original instruction pair), X
with the O UTPUT of xc being xr , which is the label for the θc = arg max log P (O UTPUT | I NSTRUCTION; θ, xc ).
θ xc ∈C
text source. When constructing the I NSTRUCTION compo-
(1)
4
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning
3.4. Human experts and LLMs constructing test set OOD detection performance. Again, we built a sentence-
level AI-generated Chinese text detection dataset, which
To evaluate the detection ability of our method, we construct
includes both human-written sentences and AI-generated
an in-domain test set and an OOD test set.
sentences, more likely to appear in real AI-assisted writing.
This dataset contains a total of 7.1k data samples, as shown
3.4.1. G ENERATION OF D OCUMENT-L EVEL I N -D OMAIN
in Appendix A.4, Appendix A.5 and Table 2. We do a
DATA
linguistic and semantic analysis for our dataset as shown in
M4 (Wang et al., 2023) is a large-scale benchmark that is Appendix A.6.
multigenerator, multi-domain, and multi-lingual corpus for
machine-generated text detection. The domains of M4 in-
Table 2. Size of document-level Training Set and Test Set.
clude Wikipedia, WikiHow, Reddit, arXiv, RuATD, Baike.
Data Source Total
We sampled Chinese language questions from M4 and gen-
erated responses using nine different LLMs (including Chat- Human 21,681
Train Set
GPT, GPT-4, etc.). All LLMs used are displayed in Figure 1. AI 96,453
Finally, we combined the original human expert responses Human 3,000
In-Domain Test Set
from M4 with the newly generated responses to create a AI 26,750
test dataset. Similar to building a training set, we also filter Human 2,000
Out-of-Domain Test Set
sentences with a length of less than 10 as the final in-domain AI 1,915
test set. Total - 151,799
5
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning
Table 3. Experimental results of different detection models on the
detailed description of these previous methods can be found in-domain dataset. ”finetuned” indicates models that have been
in the appendix B.2. trained on the same dataset. Bold text indicates the model with the
best performance on the in-domain dataset.
4.3. Experimental Settings Model Accuracy
Computational resources and parameter settings. Our Statistical-based Classifier
model is built upon Qwen (Bai et al., 2023), a Chinese large Fast-DetectGPT 59.55%
language model with parameter sizes of 1.8 billion, 7 bil- GLTR 77.06%
lion, and 13 billion. Based on Qwen LLM with different PPL 10.26%
parameters, we trained Small, Medium, and Large LLM- Zero-Shot Classifier
Detectors to distinguish the impact of model parameters on ChatGPT 81.46%
accuracy. The training process utilizes 4 A100 (80G) GPUs GPT-4 37.41%
with parallelization, incorporating quantized low-rank adap- Supervised Classifier
tation (QLoRA) (Dettmers et al., 2023). This methodology BERT-finetuned 76.50%
is implemented through transformers and PEFT libraries. RoBERTa-finetuned 89.93%
To manage training costs, we employ fp16 precision with BERT-MPU 75.95%
ZeRO-2 (Rajbhandari et al., 2021), a gradient accumulation RoBERTa-MPU 89.93%
strategy. During the entire training process, the learning rate LLaMA-2-7B-finetuned 83.65%
is 5e-5, the training epochs are 3.0, and the LoRA Rank is 8. LLaMA-2-13B-finetuned 96.53%
At the end of the entire training process, the best model was Mistral-7B-finetuned 97.98%
saved for evaluation. For the training of BERT, RoBERTa,
LLM-Detector-Small 97.84%
and MPU, the learning rate is 1e-3 and the training epochs
LLM-Detector-Medium 98.35%
are 50.0.
LLM-Detector-Large 98.52%
Metrics. We utilize accuracy (ACC) to assess the perfor-
mance of models in classifying text, distinguishing between
human-written and AI-generated content. For sentence-level When fine-tuning large language models for text detection
evaluation, we utilize metrics include precision (P.), recall tasks, Chinese LLMs trained on Chinese data significantly
(R.), and Macro-F1. Precision and recall individually repre- outperform those trained on English data. This phenomenon
sent the ”accuracy” and ”coverage” of each category. The indicates that using training data consistent with the tar-
Macro-F1 Score serves as an effective combination of these get language is crucial for improving the performance of
two indicators, providing a comprehensive measure of over- models on specific language tasks. For instance, when fine-
all performance. A detailed description of these previous tuned with Chinese LLMs (such as Qwen), the resulting
methods can be found in the Appendix B.3. LLM-Detector typically achieves higher accuracy than its
counterparts based on English LLMs (such as Mistral and
4.4. Main Results LLaMA).
The experimental results are shown in Table 3. Generally, We investigate the performance of various detection models
after fine-tuning with LLM-Detector, models of various on out-of-distribution (OOD) datasets, The results are shown
sizes can achieve significantly better performance. This in Table 4. The experimental results indicate that among
demonstrates the effectiveness and broad applicability of the statistical classifier models, the Fast-DetectGPT model
instruction tuning for large models in enhancing LLM text achieves the highest accuracy, reaching 94.48%. In the
detection. In addition, we have the following observations. supervised classifier models, the LLaMA-2-13B-finetuned
model has the highest accuracy, achieving 93.19%. Among
On the in-domain dataset, classifiers based on supervised
all models, the LLM-Detector-Large model demonstrates
learning typically outperform zero-shot and statistical-based
the most impressive performance, with an accuracy of
classifiers. Furthermore, classifiers trained on large lan-
96.70%. These results suggest that for the detection of
guage models surpass those based on smaller parameter
OOD datasets, the LLM-Detector-Large model is a choice
models such as BERT and RoBERTa. This confirms the in-
with high accuracy and effectiveness.
herent advantage of large-scale parameters in model perfor-
mance. On the other hand, there is a positive correlation be- We implement Sniffer and Sent-RoBERTa based on
tween the scale of the model and its detection performance. sentence-level detection tasks, as well as our LLM-Detector.
For instance, the LLM-Detector-Large model outperforms Sniffer (Li et al., 2023a) is a powerful model that can de-
the LLM-Detector-Medium and LLM-Detector-Small mod- tect and trace the origins of AI-generated texts. To perform
els in terms of accuracy. The LLaMA-13B model has higher sentence-level detection, we train a sentence-level Sniffer
accuracy than the LLaMA-7B model. following the structure and training process of the origi-
6
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning
Table 5. Performance comparison of different LLMs based on dif- We further investigated the impact of text length on Fast-
ferent dataset sources. DetectGPT and RoBERTa, as shown in the Appendix B.4.
Model P. R. Macro-F1 We continued to sample texts of lengths 100, 150, and
Sniffer 65.0% 64.24% 62.51% 200 from the in-domain dataset for detection. As the text
Sent-RoBERTa 37.30% 42.15% 39.11% length increased gradually, both Fast-DetectGPT (a statis-
tical detector) and RoBERTa (a supervised detector) saw
LLM-Detector-Small 71.36% 72.62% 73.5% improvements in accuracy. After the text length exceeded
100 characters, the accuracy of Fast-DetectGPT rapidly rose
to 94.3%. When the text length exceeded 200 characters,
4.5. Usability Analysis the accuracy of RoBERTa rapidly increased to 83.8%.
Robustness on Short Texts. Zero-shot detectors, due to Robustness in Mixed Text Detection. In further research,
their statistical properties, are expected to perform worse we explored the impact of mixed text on the performance of
on shorter texts, and similarly, supervised learning detec- LLM-Detector, and the results are shown in Figure 2. We
tors will also face the same issue. We sampled texts with found that when the proportion of mixed text reaches 50%-
lengths ranging from 10 to 50 from the in-domain dataset 60%, the detection accuracy drops sharply. This is because,
to evaluate different detectors. As shown in Table 6, the as the proportion of mixed text increases, the characteristics
LLM-Detector and text detectors trained based on LLMs of the original text may be weakened. For LLM-Detector,
(such as Mistral and LLaMA) do not exhibit a significant this may mean that the signals used to judge the authenticity
7
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning
of the text become weaker, while noise increases, leading not exceeding 5%.
to a decline in model performance. Overall, LLM-Detector
exhibits a certain robustness in mixed text detection and is 5. Conclusion
able to resist the influence of mixed text to a certain extent.
In this paper, we have designed a simple yet effective
100 97.6% method to detect text generated by AI. Our proposed method
92.7% 92.9% 91.9% is based on the intuition that LLM has learned a wealth of
86.4%
80 knowledge during the pre-training, which enables LLM to
autonomously detect the text it generates. Instruction tun-
Accuracy (%)
60
ing can facilitate the alignment between the model and the
user’s expected text detection task responses. Compared to
40
previous methods, our method can accomplish AI-generated
text detection at both the document and sentence levels
20 17.5% and maintain good performance on OOD data. Therefore,
8.4% 11.3% our method possesses superior generalization ability and
3.9% 5.1% practicality. We conducted experiments on the three pro-
0
0-10% 10-20% 20-30% 30-40% 40-50% 50-60% 60-70% 70-80% 80-90% 90-100%
The proportion of content mixed by AI (%) posed datasets, which cover responses generated by differ-
ent LLMs, including in-domain and OOD data, and provide
more fine-grained sentence-level annotations. The experi-
mental results demonstrate that our method can effectively
Figure 2. The accuracy performance increases with the proportion
of AI-generated content.
identify texts generated by LLMs. Moreover, our method
shows strong robustness against biases in mixed AI texts,
Are instruction-tuned LLMs better at detecting text they short texts, and OOD texts.
themselves have generated? To evaluate the detection
performance of different LLMs on content they have gen-
erated themselves, we fine-tuned the LLMs on responses
Impact Statements
generated by three different LLMs and human-written texts. The proposed AI text detection method offers advancements
The results are shown in the appendix B.5. A notable trend in content moderation and information security by identify-
is that LLMs tend to perform best on text detection with ing AI-generated text, ensuring authenticity and reliability.
texts they have generated. For instance, the ChatGLM2- However, it faces challenges such as potential false positives
6B model achieves the highest accuracy (99.91%) on the and negatives, and biases inherent in pre-trained LLMs. As
dataset it generated, which is significantly higher than any AI models evolve, there is a need for ongoing research to
other model tested against the same dataset. Similarly, the enhance the method’s robustness and address these limita-
Qwen-14B model also has a high accuracy of 96.18% on its tions, ensuring its effectiveness and ethical application in
generated dataset. However, an interesting anomaly arises various domains.
with the BlueLM-7B model. The Qwen-7B model outper-
forms the BlueLM-7B on its own dataset, with an accuracy
of 97.8% compared to 97.1% for the BlueLM-7B. While
References
this could suggest a potential issue with the BlueLM-7B Badaskar, S., Agarwal, S., and Arora, S. Identifying real
model’s training, it is also worth noting that the difference is or fake articles: Towards better language modeling. In
very small (only 0.7%), which could fall within the margin Proceedings of the Third International Joint Conference
of error. on Natural Language Processing: Volume-II, 2008.
The impact of text generated by LLMs of different scales Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan,
on the accuracy of text detection. We used the LLM- Y., Ge, W., Han, Y., Huang, F., et al. Qwen technical
Detector to perform text detection on texts generated by report. arXiv preprint arXiv:2309.16609, 2023.
LLMs of different parameter sizes. We found that the texts
produced by LLMs of varying scales had no significant Bao, G., Zhao, Y., Teng, Z., Yang, L., and Zhang, Y.
impact on the accuracy of text detection by LLM-Detector, Fast-detectgpt: Efficient zero-shot detection of machine-
indicating that detectors trained on LLMs demonstrate better generated text via conditional probability curvature. arXiv
robustness and generalization, as shown in the appendix B.6. preprint arXiv:2310.05130, 2023.
Specifically, the three differently sized detectors—Small,
Medium, and Large showed a small range of fluctuation in Beresneva, D. Computer-generated text detection using ma-
detection accuracy for texts generated by LLMs of different chine learning: A systematic review. In Natural Language
scales, with the gap between the highest and lowest accuracy Processing and Information Systems: 21st International
8
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning
Conference on Applications of Natural Language to In- Li, L., Wang, P., Ren, K., Sun, T., and Qiu, X. Origin tracing
formation Systems, NLDB 2016, Salford, UK, June 22-24, and detecting of llms. arXiv preprint arXiv:2304.14072,
2016, Proceedings 21, pp. 421–426. Springer, 2016. 2023a.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Li, M., Zhang, Y., Li, Z., Chen, J., Chen, L., Cheng, N.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Wang, J., Zhou, T., and Xiao, J. From quantity to quality:
Askell, A., et al. Language models are few-shot learners. Boosting llm performance with self-guided data selection
Advances in neural information processing systems, 33: for instruction tuning. arXiv preprint arXiv:2308.12032,
1877–1901, 2020. 2023b.
Chakraborty, S., Bedi, A., Zhu, S., An, B., Manocha, D., Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,
and Huang, F. On the possibilities of ai-generated text Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.
detection (arxiv: 2304.04736). arxiv, 2023. Roberta: A robustly optimized bert pretraining approach.
arXiv preprint arXiv:1907.11692, 2019.
Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer,
L. Qlora: Efficient finetuning of quantized llms. arXiv Mitchell, E., Lee, Y., Khazatsky, A., Manning, C. D., and
preprint arXiv:2305.14314, 2023. Finn, C. Detectgpt: Zero-shot machine-generated text
detection using probability curvature. arXiv preprint
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: arXiv:2301.11305, 2023.
Pre-training of deep bidirectional transformers for lan-
Rajbhandari, S., Ruwase, O., Rasley, J., Smith, S., and
guage understanding. arXiv preprint arXiv:1810.04805,
He, Y. Zero-infinity: Breaking the gpu memory wall for
2018.
extreme scale deep learning. In Proceedings of the Inter-
Gehrmann, S., Strobelt, H., and Rush, A. M. Gltr: Statisti- national Conference for High Performance Computing,
cal detection and visualization of generated text. arXiv Networking, Storage and Analysis, pp. 1–14, 2021.
preprint arXiv:1906.04043, 2019.
Tang, R., Chuang, Y.-N., and Hu, X. The science of detect-
Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, ing llm-generated texts. arXiv preprint arXiv:2303.07205,
Y., Yue, J., and Wu, Y. How close is chatgpt to human 2023.
experts? comparison corpus, evaluation, and detection. TEXTS, A.-G. Multiscale positive-unlabeled detection of
arXiv preprint arxiv:2301.07597, 2023a. ai-generated texts.
Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
Y., Yue, J., and Wu, Y. How close is chatgpt to human A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
experts? comparison corpus, evaluation, and detection. Bhosale, S., et al. Llama 2: Open foundation and fine-
arXiv preprint arXiv:2301.07597, 2023b. tuned chat models. arXiv preprint arXiv:2307.09288,
2023.
Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The
curious case of neural text degeneration. arXiv preprint Wang, Y., Mansurov, J., Ivanov, P., Su, J., Shelmanov, A.,
arXiv:1904.09751, 2019. Tsvigun, A., Whitehouse, C., Afzal, O. M., Mahmoud, T.,
Aji, A. F., et al. M4: Multi-generator, multi-domain, and
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., multi-lingual black-box machine-generated text detection.
Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., arXiv preprint arXiv:2305.14902, 2023.
Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint
arXiv:2310.06825, 2023. Xu, B. Nlp chinese corpus: Large scale chinese corpus for
nlp, September 2019. URL https://doi.org/10.
Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., 5281/zenodo.3402023.
and Goldstein, T. A watermark for large language models.
arXiv preprint arXiv:2301.10226, 2023.
9
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning
A. Details of Dataset
A.1. Example of document-level data organization
Instruction is the response generated by human experts or different LLMs. Output is the label of the data source as shown in
Table 7.
Instruction:
Categorize the texts into one of the two classes: human or AI.
Input:
单间80多,如果住的天数多70多。里面有一个单独的卫生间,是隔出来的那种。其他的不是很清楚。另
外德强那边有一个新开的巧克力时钟旅馆,听干净的,价格和相约谷差不多。你可以去看看。便宜点的
还有恒久附近的,50多,但是没有单独的卫生间。都有电脑可以上网的。学校附近的小宾馆旅社很多学
校里面也有非常多基本60-80左右
(Translation: The single rooms are over 80, if you stay for more days it’s around 70. There’s a separate toilet inside,
the kind that’s partitioned off. I’m not very clear about the others. By the way, there’s a newly opened Chocolate
Clock Hotel over at Deqiang’s place, it’s said to be clean, and the price is about the same as Xiangyue Valley. You
can go take a look. There are also cheaper ones near Hengjiu, over 50, but they don’t have a separate toilet. All of
them have computers with internet access. There are many small hotels and guesthouses near the school, and there
are also many inside the school, with prices ranging from about 60-80.)
Output: Human
Prompt
请润色下述内容,不要做任何解释,直接输出润色结果:
(Translation: Please polish the following content without any explanation, and output the polishing results directly:)
After obtaining the polished [P1, P3, P4, P7], splice them back together to form the paragraphs [P1, L2, P3, P4, L5, L6, P7]
that blend AI and Human. Using the same method, we sampled 1,504 samples from M4 for sentence-level data construction.
Ultimately, we used the HC3 sentence-level data as the training set and M4 as the test set. An example is shown in Table 9.
For sentence-level, the size of the train set is 5,589 and the size of the test set is 1,504.
10
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning
Instruction:
Categorize the texts into one of the 2 classes: human or AI.
Input:
”单间80多,如果住的天数多70多。里面有一个单独的卫生间,是隔出来的那种。其他的不是很清楚。另
外德强那边有一个新开的巧克力时钟旅馆,听干净的,价格和相约谷差不多。你可以去看看。便宜点的
还有恒久附近的,50多,但是没有单独的卫生间。都有电脑可以上网的。学校附近的小宾馆旅社很多学
校里面也有非常多基本60-80左右
(Translation: The single rooms are over 80, if you stay for more days it’s around 70. There’s a separate toilet inside,
the kind that’s partitioned off. I’m not very clear about the others. By the way, there’s a newly opened Chocolate
Clock Hotel over at Deqiang’s place, it’s said to be clean, and the price is about the same as Xiangyue Valley. You
can go take a look. There are also cheaper ones near Hengjiu, over 50, but they don’t have a separate toilet. All of
them have computers with internet access. There are many small hotels and guesthouses near the school, and there
are also many inside the school, with prices ranging from about 60-80.)”
Output: <HUMAN >单间80多,</HUMAN ><HUMAN >如果住的天数多70多。</HUMAN ><AI >里
面有一个独立的卫生间,</AI ><AI >是隔出来的那种。</AI ><AI >其他方面有些不太清楚。</AI
><AI >另外,</AI ><AI >在德强那里有一家新开的巧克力时光旅馆,</AI ><AI >听说环境很干
净,</AI ><AI >价格和相约谷差不多。</AI ><AI >你可以去看看。</AI ></AI >在恒久附近有更便
宜的房间,</AI ><AI >大约50多元,</AI ><AI >但是没有独立的卫生间。</AI ><HUMAN >都有电
脑可以上网的。</HUMAN ><HUMAN >学校附近的小宾馆旅社很多学校里面也有非常多基本60-80左
右</HUMAN >
Prompt
你是一个新闻编辑,用户给你一个新闻类别,请在用户指定类别要求下写一个100字到700字的新闻稿。
新闻类别包括:政治新闻 经济新闻 社会新闻 科技新闻 文化艺术新闻 娱乐新闻 环境新闻...
(Translation: You are a news editor, and the user provides you with a news category. Write a news article of 100 to
700 words based on the specified category. The news classes include Political News, Economic News, Social News,
Technology News, Cultural and Arts News, Entertainment News, and Environmental News, ...)
the training set, in-domain test set, and OOD test set can be counted as shown in Table 11.
11
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning
Table 11. Document-Level Training and Test Sets with Different Data Sizes from Various Sources.
No. Train Set / Test Set Data Label Data Source Count Total
1 Human Human 21681 21681
2 ChatGPT 17376
3 ChatGLM2-6b 12850
4 XVERSE-13b 12833
5 Train Set Qwen-14b 12823
AI 96453
6 GPT-4 12796
7 BlueLM-7b 12702
8 Baichuan2-53b 12659
9 ERNIE-Bot-3.5 2414
10 Human Human 3000 3000
11 ChatGPT 3000
12 ChatGLM2-6B 3000
13 XVERSE-13b 2998
14 Qwen-14b 2997
In-Domain Test Set
15 AI GPT-4 2987 26750
16 BlueLM-7b 2980
17 Davinci003 2975
18 ERNIE-Bot-3.5 2972
19 Baichuan2-53b 2841
20 Human News 2000 2000
Out-of-Domain Test Set
21 AI ChatGPT 1915 1915
22 Total - - - 151799
XVERSE-13b
Qwen-14b
ChatGLM2-6b
Orig
GPT-4
nal-
AI
Hum
an_
Cha
tGP
n
T
ai
Tr
BlueLM-7b
OOD Test
AI
ChatGPT
Human News-Broad
Huma cast
s t
Te
n Origna
l-Hum
n
an_Cha
ai
tGPT
om
Baic
huan
-D
2-53
In
ER b
NIE
Baichuan2-53b AI -Bo
Da t-3.
vin 5
Human Bl ci0
ue 03
LM
.5
-3
-7
ot
b
-B
IE
Qw
XV
GP
Orig
N
T-
ChatG
ER
en
nal-
4
RS
-14
Hum
Orign
E-1
al-Hu
an_
LM2-6
man_
3b
Cha
ChatG
PT
tGP
T
B
Figure 3. Dataset Source Parse Analysis. The source of the response data in the training set, in-domain test set, and OOD test set.
differences in usage can be pivotal in differentiating between human and AI-generated content.
In Figure 6, the sentiment distribution across the training and test datasets for human-written and AI-generated texts is
depicted. The bar charts compare the proportion of neutral, positive, and negative sentiments expressed in both datasets,
with orange bars representing human-produced content and purple bars for AI-generated material. In the training set, a
substantial majority of AI-generated texts are classified as neutral (86%), while human texts show a slightly lower neutral
sentiment proportion (61%). Conversely, human texts exhibit a significantly higher inclination towards negative sentiments
(34%) compared to AI (11%), with positive sentiments being minimal in both but slightly higher in AI (5% compared to 3%
in human texts). A similar pattern is observable in the test set, where AI texts are predominantly neutral (83%), but human
texts are less so (65%). The negative sentiment in human texts (25%) is more than double that in AI texts (12%), with
12
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning
15.0
12.5
Proportion (%)
10.0
7.5
6.4% 6.5%
5.7% 5.9% 5.6%
5.0 4.7%
4.1% 4.0%
3.4% 3.5% 3.2% 3.2%
2.5% 2.4% 2.3%
2.5 2.0% 1.7% 1.6%
0.0
v x n r uj d c m p a eng vn
25.1% Human
25 AI
21.2%
20
18.8%
17.2% 17.7%
16.3%
15
Proportion (%)
10
6.1% 6.5%
5.2% 4.9% 5.3%
5 4.5% 4.4%
3.9% 3.5% 3.3%
2.7% 3.0% 2.5% 2.6%
1.6% 1.1% 1.5%
0.8%
0
x n v d m uj r c p a eng nr
positive sentiment remaining low for both. These charts suggest that AI-generated texts may tend toward neutral sentiment,
while human authors express a broader emotional range, particularly negative sentiments. This pattern across both training
and test sets highlights a key difference in the emotional tone between human and AI writing.
Our training set includes 119,475 neutral sentences which occupy 80.8% of the dataset, 22,658 negative sentences which
occupy 15.3% of the dataset, and 5,751 positive sentences which occupy 3.9% of the dataset.
13
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning
70 70
65%
61% 60
60
Proportion (%)
Proportion (%)
50 50
40 40
34%
30 30
25%
20 20
11% 11% 12%
10 10
5% 3% 5%
0 0
Neutral Positive Negative Neutral Positive Negative
Figure 6. Sentiment Distribution: (Left) Distribution in the Training Set, (Right) Distribution in the Test Set.
Sentence-level Detection. Sentence-level detection refers to the ability of an AI-generated text detector to accurately
recognize AI-generated text at the sentence level (rather than at the document level). This is a more challenging task because
the context provided by the entire document is not available, and the AI-generated text detector must rely solely on the
content of a single sentence.
• Fast-DetectGPT (Bao et al., 2023) is a method for zero-shot detection of AI-generated text. This method uses
conditional probability curvature as an indicator and detects whether text is machine-generated by sampling and
evaluating the differences in word selection probabilities. Compared to DetectGPT (Mitchell et al., 2023), this method
has increased detection speed by two orders of magnitude, while accuracy has improved by approximately 75%.
• GLTR (Gehrmann et al., 2019) studied three types of features of an input text. Their major assumption is that to
generate fluent and natural-looking text, most decoding strategies sample high probabilities tokens from the head of the
distribution.
• Perplexity (PPL) (Guo et al., 2023b) is a metric for evaluating the performance of language models. It measures
the exponent of the negative average logarithmic likelihood of a given text under the language model. A lower PPL
indicates that the language model is more confident in its predictions and is thus considered a better model. We use
GPT-2 to calculate the PPL of human- and AI-generated content to distinguish who generated the text.
• MPU (TEXTS) proposes a multi-scale positive-unlabeled AI text detection method, which models AI text detection
as a partial positive-unlabeled problem, utilizes length-based multi-scale PU loss, and introduces a text multi-scaling
module. MPU significantly improves the detection performance of short texts and enhances long text detection. It
has been implemented based on two methods: BERT and RoBERTa, referred to as BERT-MPU and RoBERTa-MPU,
respectively.
• LLaMA-2 (Touvron et al., 2023) is a language model trained on approximately 2T tokens. It has demonstrated
exceptional performance across multiple benchmark tests and has been widely used in LLM research. We adopt
LLaMA-2-7B and LLaMA-2-13B as the base model for instruction tuning.
• Mistral-7B (Jiang et al., 2023) is a language model designed for superior performance and efficiency. It employs
mechanisms such as grouped-query attention and sliding window attention to surpass other language models on various
benchmarks.
14
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning
In addition to the comparison methods mentioned above, we trained BERT (Devlin et al., 2018) and RoBERTa (Liu et al.,
2019) text classification models based on the same data for text detection. At the same time, we conducted Zero-shot text
detection based on ChatGPT and GPT-4.
For classifying using ChatGPT and GPT-4 with Zero-shot, we conduct three predictions and take the average result. We
adopted a method similar to that proposed by (Holtzman et al., 2019) for open-text generation. Specifically, we used
temperature sampling with a temperature is 0.7, top p is 1.0, and max tokens is 2048, while keeping other settings at their
defaults. The prompts for ChatGPT and GPT-4 are as follows.
1 Determine whether this passage is generated by AI or written by human. Do not respond
2 with anything other than AI and Human. You are only allowed to answer AI or Human.
• True Positive (TP): the number of positive classes predicted to be positive classes.
• True Negative (TN): the number of negative classes predicted as negative classes.
• False Positive (FP): the number of negative classes predicted as positive classes, which is the number of detection
errors.
• False Negative (FN): the number of positive classes predicted as negative classes, which is the number of missed
detections.
Precision is defined as
TP
P recision = (2)
TP + FP
Recall is defined as
TP
Recall = (3)
TP + FN
Macro-F1 is defined as
N
1 X 2 × Pi × Ri
Macro-F1 = (4)
N i=1 Pi + Ri
N represents the number of classes (In our task, it is two classes: Human and AI). Pi and Ri are the precision and recall for
the i-th class, respectively. TPi , FPi , and FNi are the number of true positives, false positives, and false negatives for the
i-th class, respectively. The Macro-F1 Score is calculated by taking the sum of the F1 Scores for each class and dividing by
the total number of classes.
15
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning
90 92.75%
89.8%
80
Accuracy (%) 72.78%
70
66.05%
60
49.96% 58.25%
50
40 Fast-DetectGPT
RoBERTa
30 33.14%
25 50 75 100 125 150 175 200
Text Length
Figure 7. As the length of the text increases, the accuracy performance of Fast-DetectGPT and RoBERTa.
suggest a potential issue with the BlueLM-7B model’s training, it is also worth noting that the difference is very small (only
0.7%), which could fall within the margin of error.
B.5. Are instruction-tuned LLMs better at detecting text they themselves have generated
Table 12. Performance comparison of different LLMs based on different dataset sources. Bold text and blue background indicates the
model with the best performance.
Data Generation Source Model Accuracy
ChatGLM2-6B ChatGLM2-6B 99.91%
ChatGLM2-6B XVERSE-7B 94.60%
ChatGLM2-6B Baichuan2-7B 98.87%
ChatGLM2-6B Qwen-7B 96.63%
ChatGLM2-6B Mistral-7B 96.87%
ChatGLM2-6B LLaMA-2-7B 97.14%
Qwen-14B Qwen-14B 96.18%
Qwen-14B Baichuan2-13B 95.92%
Qwen-14B LLaMA-2-13B 94.43%
Qwen-14B XVERSE-13B 91.19%
BlueLM-7B BlueLM-7B 97.10%
BlueLM-7B XVERSE-7B 92.51%
BlueLM-7B Baichuan2-7B 95.09%
BlueLM-7B LLaMA-2-7B 96.63%
BlueLM-7B Mistral-7B 94.44%
BlueLM-7B Qwen-7B 97.80%
B.6. The impact of text generated by LLMs of different scales on the accuracy of text detection
We used the LLM-Detector to perform text detection on texts generated by LLMs of different parameter sizes. We found
that the texts produced by LLMs of varying scales had no significant impact on the accuracy of text detection by LLM-
16
LLM-Detector: Improving AI-Generated Chinese Text Detection with Open-Source LLM Instruction Tuning
Detector, indicating that detectors trained on LLMs demonstrate better robustness and generalization, as shown in Figure 13.
Specifically, the three differently sized detectors—Small, Medium, and Large showed a small range of fluctuation in
detection accuracy for texts generated by LLMs of different scales, with the gap between the highest and lowest accuracy
not exceeding 5%.
Table 13. Performance comparison of different LLMs based on different dataset sources. The darker the color, the better the performance.
Gradually increasing model size range →
Model ChatGLM2-6B BlueLM-7B XVERSE-13B Qwen-14B Baichuan2-53B ERNIE-Bot ChatGPT GPT-4
LLM-Detector-Small 98.48% 98.85% 97.54% 95.60% 97.20% 96.79% 96.51% 97.82%
LLM-Detector-Medium 95.41% 95.89% 95.84% 99.33% 96.83% 96.49% 95.81% 94.48%
LLM-Detector-Large 93.28% 98.11% 99.44% 98.15% 96.99% 99.23% 99.64% 95.25%
17