0% found this document useful (0 votes)
25 views11 pages

Assessing AI Detectors in Identifying AI Generated Code

This paper investigates the effectiveness of AI-generated content (AIGC) detectors in identifying AI-generated code within programming education. An empirical study was conducted using a dataset of 5,069 samples to evaluate five AIGC detectors, revealing their poor performance in distinguishing between human-written and AI-generated code. The findings highlight the need for reliable detection mechanisms and raise concerns about academic integrity and the potential for students to exploit these tools for dishonest practices.

Uploaded by

maryfathimaanuja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views11 pages

Assessing AI Detectors in Identifying AI Generated Code

This paper investigates the effectiveness of AI-generated content (AIGC) detectors in identifying AI-generated code within programming education. An empirical study was conducted using a dataset of 5,069 samples to evaluate five AIGC detectors, revealing their poor performance in distinguishing between human-written and AI-generated code. The findings highlight the need for reliable detection mechanisms and raise concerns about academic integrity and the potential for students to exploit these tools for dishonest practices.

Uploaded by

maryfathimaanuja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Assessing AI Detectors in Identifying AI-Generated Code:

Implications for Education


Wei Hung Pan Ming Jie Chok Jonathan Leong Shan Wong
[email protected] [email protected] [email protected]
School of Information Technology, School of Information Technology, School of Information Technology,
Monash University Malaysia Monash University Malaysia Monash University Malaysia
Subang Jaya, Malaysia Subang Jaya, Malaysia Subang Jaya, Malaysia

Yung Xin Shin Yeong Shian Poon Zhou Yang


[email protected] [email protected] [email protected]
arXiv:2401.03676v1 [cs.SE] 8 Jan 2024

School of Information Technology, School of Information Technology, School of Computing and Information
Monash University Malaysia Monash University Malaysia Systems, Singapore Management
Subang Jaya, Malaysia Subang Jaya, Malaysia University
Singapore, Singapore

Chun Yong Chong David Lo Mei Kuan Lim


[email protected] [email protected] [email protected]
School of Information Technology, School of Computing and Information School of Information Technology,
Monash University Malaysia Systems, Singapore Management Monash University Malaysia
Subang Jaya, Malaysia University Subang Jaya, Malaysia
Singapore, Singapore

ABSTRACT CCS CONCEPTS


Educators are increasingly concerned about the usage of Large Lan- • Social and professional topics → Software engineering edu-
guage Models (LLMs) such as ChatGPT in programming education, cation.
particularly regarding the potential exploitation of imperfections
in Artificial Intelligence Generated Content (AIGC) Detectors for KEYWORDS
academic misconduct. Software Engineering Education, AI-Generated Code, AI-Generated
In this paper, we present an empirical study where the LLM is Code Detection
examined for its attempts to bypass detection by AIGC Detectors.
This is achieved by generating code in response to a given question ACM Reference Format:
using different variants. We collected a dataset comprising 5,069 Wei Hung Pan, Ming Jie Chok, Jonathan Leong Shan Wong, Yung Xin Shin,
samples, with each sample consisting of a textual description of Yeong Shian Poon, Zhou Yang, Chun Yong Chong, David Lo, and Mei Kuan
a coding problem and its corresponding human-written Python Lim. 2024. Assessing AI Detectors in Identifying AI-Generated Code: Im-
plications for Education. In Proceedings of ACM Conference (Conference’17).
solution codes. These samples were obtained from various sources,
ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn
including 80 from Quescol, 3,264 from Kaggle, and 1,725 from Leet-
Code. From the dataset, we created 13 sets of code problem variant
prompts, which were used to instruct ChatGPT to generate the 1 INTRODUCTION
outputs. Subsequently, we assessed the performance of five AIGC In recent years, an increase in capability has been observed in the
detectors. Our results demonstrate that existing AIGC Detectors development of LLMs, enabling them to generate high-quality, co-
perform poorly in distinguishing between human-written code and herent paragraphs, answer questions, and even produce human-like
AI-generated code. code [20, 26]. LLM are designed to understand and generate hu-
‡ Zhou
man text, and they are trained using vast amounts of data scraped
Yang is the corresponding author.
from the internet. Most of these LLMs, developed by major cor-
porations, are generally made accessible to the public, allowing
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed anyone with Internet access to utilize them [13]. This accessibility
for profit or commercial advantage and that copies bear this notice and the full citation enables individuals to swiftly obtain valuable and reference-worthy
on the first page. Copyrights for components of this work owned by others than ACM answers.
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a At the educational level, this accessibility can enhance learning
fee. Request permissions from [email protected]. and research efficiency while potentially influencing educational
Conference’17, July 2017, Washington, DC, USA assessment and evaluation [30]. The use of AI-based tools and re-
© 2024 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 sources has gradually led to a situation where students increasingly
https://doi.org/10.1145/nnnnnnn.nnnnnnn rely on these tools to quickly access answers and information. This
Conference’17, July 2017, Washington, DC, USA Pan et al.

growing dependence on AI-driven solutions has a noticeable impact concerns. Each iteration pushes AI language models forward with
on academic dishonesty [15]. transformative capabilities.
As a consequence, educators find themselves compelled to uti- The widespread adoption of AIGC has led to a growing need
lize AIGC Detectors to ascertain whether students are involved for reliable detection mechanisms. One notable development in
in academic dishonesty [33]. While existing AIGC Detectors have this area is the AIGC Detector is detailed in the work by Wang
proven their proficiency in identifying AI-generated text, their ef- et al. [35]. These detectors have demonstrated impressive accu-
fectiveness in recognizing AI-generated code remains uncertain racy in identifying AI-generated content. However, understanding
due to the intricate nature of programming code [35]. This discrep- the implications of the accuracy of AIGC Detector within educa-
ancy can lead to disparities in the evaluation of students’ academic tional settings, particularly in CS and SE learning environments, is
submissions, potentially resulting in unfair grading. essential.
The paper aims to conduct an empirical study to evaluate the In an era where these detectors ensure the authenticity of ed-
performance of different AIGC Detectors in detecting AIGC across ucational content, it is imperative to examine their effectiveness
diverse contextual and syntactical variations. and limitations comprehensively. Relying solely on their accuracy
Overall, the main contributions of our paper are summarized as might create a false sense of security, potentially allowing students
follows: to deceive these detectors. This raises questions about the robust-
ness of educational assessments and the integrity of the learning
• We conducted a comprehensive empirical study to assess the
process [31].
performance of five AIGC detectors using 13 variant prompts.
An intriguing aspect of this inquiry lies in potential scenarios
To the best of our knowledge, this is the first study specifically
where students might outsmart AIGC Detectors. For example, in
evaluating the performance of different AIGC detectors on AIGC
specific situations, students may find ways to manipulate the sys-
generated with various question prompts.
tem, deceiving even sophisticated AIs like ChatGPT [9]. This raises
• We constructed 13 large datasets, each containing 5,069 sam-
concerns about the reliability of automated evaluations, especially if
ples representing specific code problem prompt variants. Each
educators heavily depend on a single detector. Understanding these
dataset comprises code problems, human-generated code for that
vulnerabilities is vital for ensuring the authenticity of educational
problem, and AI-generated code for the same problem.
assessments in the digital age [22].
In summary, this research is motivated by the transformative
2 BACKGROUND AND MOTIVATIONS impact of AIGC on SE and CS education. Our study aims to sig-
nificantly contribute by unraveling the complexities surrounding
In this section, we discuss some of the background related to our
AIGC Detectors, exploring their accuracy, vulnerabilities, and impli-
work, particularly in the domain of software engineering education
cations for educational practices. By shedding light on these critical
and the use of generative AI technologies to help in teaching and
aspects, our work strives to enhance the understanding of AIGC De-
learning processes.
tection within the context of CS and SE education. Ultimately, our
research seeks to ensure the integrity of educational assessments,
2.1 Impact of AIGC on SE and CS Education thereby fostering a secure and genuine learning environment for
Software Engineering (SE) and Computer Science (CS) education students and educators alike.
form the bedrock of technological progress [41]. These fields equip
students with essential skills, such as problem-solving, logical think-
ing, and creativity. However, the cornerstone of SE and CS education
is programming. Programming is not just a skill; it embodies the 2.2 Selection of AIGC Detectors
essence of SE and CS education. It empowers students to create in- The integration of AIGC Detection in SE education significantly
novative software solutions, analyze complex problems, and make enhances academic integrity and ensures the authenticity of educa-
significant contributions to the digital world [24]. tional content. Educators can leverage these detectors to identify
The emergence of generative AI and the subsequent prolifera- instances of plagiarism, unauthorized use of AI-generated code, and
tion of AIGC have brought about a transformative era in education. academic dishonesty among students, thereby fostering fairness
AIGC represents a paradigm shift where automated content cre- and promoting originality within SE coursework. Moreover, AIGC
ation is revolutionizing learning experiences for both educators and Detectors play a pivotal role in verifying the authenticity of edu-
learners [18]. This shift, while revolutionary, introduces a complex cational materials, allowing educators to maintain the credibility
interplay of challenges and opportunities in the realm of education of resources utilized in SE courses. This technological integration
[6]. AIGC can personalize and enhance learning, but it also poses not only strengthens the overall educational ecosystem but also
questions about the authenticity of content and the methods of stimulates meaningful discussions among students about responsi-
assessment in the digital age [5]. ble AI usage, emphasizing the importance of ethical practices in SE
Generative Pre-trained Transformers (GPTs) [4] have revolution- education.
ized natural language processing (NLP). It started with GPT-1 in Five AIGC Detectors are instrumental in achieving these objec-
2018 (117 million parameters), laying the foundation. GPT-2 arrived tives. GPTZero [1], Sapling [2], GPT-2 Detector [32], DetectGPT
in 2019 (1.5 billion parameters), improving text generation. GPT-3.5, [7, 25], and Giant Language Model Test Room (GLTR) [17, 19, 28]
part of OpenAI’s API, introduced advanced features like editing offer educators reliable tools to preserve academic integrity and
and inserting. GPT-4 added multimodal capabilities, raising ethical encourage ethical AI usage. However, it is crucial for educators to
Assessing AI Detectors in Identifying AI-Generated Code: Implications for Education Conference’17, July 2017, Washington, DC, USA

Figure 1: Workflow of the AIGC and their variants’ generation process, AIGC prediction with AIGC Detectors, and comparative
analysis of AIGC Detector outputs with accuracy metrics.

thoroughly evaluate these tools and consider their limitations to Our objectives of RQ1 and RQ2 is to assess the performance of the
ensure their effective use in SE education. chosen AIGC Detectors in identifying AI-generated code and inves-
tigate the limitations of AIGC Detectors in identifying AI-generated
3 EMPIRICAL STUDY DESIGN AND code. Both research questions share a similar experimental setup.
METHODOLOGY (1) Data Evaluation: Evaluation of the AIGC Detectors’ perfor-
mance is based on the 13 variants of the dataset we have metic-
This section discusses the research questions, methodology and
ulously collected.
process of our empirical study. Figure 1 offers an overview of the
(2) Experimental Configuration: For the selected AIGC Detec-
empirical study process, highlighting the key phase of data col-
tors, we have established a discriminative threshold of 0.5. This
lection, prediction, and analysis. We have made the replication
threshold serves as a criterion to distinguish AI-generated code
package publicly available.1
samples. Specifically, if the output probability surpasses the
The data collection phase is essential for ensuring the integrity
0.5 threshold, it signifies the presence of AI-generated content
and robustness of our study. We gathered coding problems from
within the input samples.
various sources and introduced different variants based on the col-
(3) Performance Assessment Metrics: The performance of the
lected coding problems. Then, by utilizing OpenAI API service,2
detectors is measured using a range of metrics such as accuracy,
we collected AI-generated code and performed manual validation
precision, true positive rate (TPR), false positive rate (FPR), true
to ensure the correctness of the dataset. Moving on to the data
negative rate (TNR) and false negative rate (FNR). These metrics
prediction phase, by using AI-generated code and human-written
provide a comprehensive understanding of how effectively the
code as input, we employed selected AIGC Detectors to perform
detectors identify AI-generated code.
AIGC Detection and collect the prediction results. In the data anal-
ysis phase, using the compiled prediction results, we evaluated the 3.2 Data Collection
efficacy of each AIGC Detector via predetermined performance
metrics and utilised the results analysed to answer our research We gathered fundamental code problems along with corresponding
questions. solutions code from various online sources, such as using the exist-
ing datasets from Kaggle [40] and web scraping technique to crawl
data from Quescol [39]. Kaggle is an online community of data
3.1 Research Questions and Experimental Setup scientists and machine learning engineers that allows users to find
In Section 3.1, we discuss the research questions and the experi- datasets used in building AI models, publish datasets, collaborate
mental setup. with others, and enter competitions to solve data science challenges.
Quescol is an educational website that provides a huge collection of
(1) RQ1: How accurate are existing AIGC Detectors at detect-
the previous year and most important questions and answers as per
ing AI-generated code?
the exam perspective. For the AIGC dataset, we generated different
(2) RQ2: What are the limitations of existing AIGC Detectors
variants of AI-generated code to explore different scenarios and
when it comes to detecting AI-generated code?
find out situations that might be possible for students to fool AIGC
Detectors. The details about the different variants are discussed in
Section 3.4.
1 https://figshare.com/articles/dataset/Replication_Package/24298036 For the compilation of the AIGC dataset, we developed an au-
2 https://platform.openai.com/ tomation script leveraging the API service from OpenAI.2 The script
Conference’17, July 2017, Washington, DC, USA Pan et al.

will use the collected code problems as inputs, and then prompt Algorithm 3 GenerateCodes: Collecting AI-Generated Code
ChatGPT to generate the outputs, subsequently label and store Input:
them as AI-generated code. The script was executed 13 times to • 𝑉 : Collection of various prompts
• 𝐶 : Data Collection of Human and AI-generated Code
generate different code variants, such as removing stop words or
Output: 𝐶 : Modified Data Collection
adding dead code, ensuring that a dataset of initial size 70,966 was 1: procedure GenerateCodes(𝑉 , 𝐶 )
collected for thorough analysis. 2: for 𝑣 ∈ 𝑉 do
The coding problems and human-written code are obtained 3: 𝐶 ← { } // Clear the Data Collection
4: for 𝑖 ← 1 to 𝑁 do // Generate AI Code
from three primary sources: 1.) Python Coding Question [38]: 5: 𝑐 ← AI Code Generation based on 𝑣
this source from Quescol contributed 80 samples to our dataset, 6: Include 𝑐 in Data Collection 𝐶
which includes some common interview questions for Python; 2.) 7: end for
8: end for
Coding Problems and Solution Python Code [37]: this source 9: return 𝐶 // Code Data Collection
from Kaggle contributes approximately 3,264 samples to our dataset, 10: end procedure
which encompasses a wide range of Python code solutions to vari-
ous coding problems; 3.) LeetCode Solutions and Content KPIs
[36]: this source from Kaggle contributes around 1,725 samples to Algorithm 4 DetectCodes: Code Detection Using AIGC Detector
our dataset which includes solutions to coding problems from the Input:
popular online coding platform, LeetCode, along with additional • 𝐴𝐼𝐺𝐶𝐷 : Artificial Intelligence Generated Content Detector
content and key performance indicators. The collected dataset com- • 𝐶 ′ : Processed Data Collection of Human and AI-generated Code
• 𝐷 : Collection Results of 1(Human) and 0(AI)
prises 5,069 samples where each sample consists of a textual descrip- Output: 𝐷 : Detection Result Collection
tion of a coding problem and the corresponding solution code that 1: procedure DetectCodes(𝐴𝐼𝐺𝐶𝐷 , 𝐶 ′ , 𝐷 )
were labeled as human-written code. The dataset covers diverse 2: for 𝑎𝑖𝑔𝑐𝑑 ∈ 𝐴𝐼𝐺𝐶𝐷 do
3: 𝐷 ← { } // Result Collection of 1 (Human) and 0 (AI)
programming concepts, algorithms, and coding challenges. Based 4: for 𝑐 ∈ 𝐶 ′ do
on collected code problems, we have created 13 variants of AIGC 5: 𝑑 ← Code Detection using 𝐴𝐼𝐺𝐶𝐷
content. 6: Include 𝑑 in Result Collection 𝐷
7: end for
8: end for
9: return 𝐷 // Detection Result Collection
Algorithm 1 DynCodeMetrics: Dynamic Prompt Variation and 10: end procedure
Code Detection with Metric Calculation
Input:
• 𝑃 : Collection of Prompt Dataset
• 𝐴𝐼𝐺𝐶𝐷 : Artificial Intelligence Generated Content Detector
• 𝑉 𝑎𝑟 𝑃𝑟𝑜𝑚𝑝𝑡 : Variant Modification
3.3 Workflow Explanation
Output: 𝑅 : Metric Results To ensure the replicability and clarity of our empirical study proce-
1: procedure DynCodeMetrics(𝑃 , 𝐴𝐼𝐺𝐶𝐷 , 𝑉 𝑎𝑟 𝑃𝑟𝑜𝑚𝑝𝑡 ) dure, we have provided our procedure in a systematic, step-by-step
2: 𝑉 ← { } // Collection of various prompts
3: 𝐷 ← { } // Collection Results of 1(Human) and 0(AI) algorithm format to eliminate misinterpretation of our procedure
4: 𝐶 ← { } // Collection of Human and AI-generated Code in Figure 1. As shown in Algorithms 1 to 4, four procedures are
5: 𝑉 ← CollectVariants(𝑃 , 𝑉 𝑎𝑟 𝑃𝑟𝑜𝑚𝑝𝑡 , 𝑉 ) included on how variant dataset is collected and metric is calculated:
6: GenerateCodes(𝑉 , 𝐶 )
7: 𝐶 ′ ← Post Process on 𝐶 // Post-processing on generated codes • DynCodeMetrics: Algorithm 1, named DynCodeMetrics, plays
8: DetectCodes(𝐴𝐼𝐺𝐶𝐷 , 𝐶 ′ , 𝐷 )
9: 𝑅 ← Calculate accuracy, precision, recall, TPR, FPR based on 𝐷 a central role in coordinating various sub-algorithms to stream-
10: return 𝑅 // Result of Metrics line the workflow of dynamic prompt generation, AI-generated
11: end procedure code post-processing, code detection, and metric computation.
Taking inputs 𝑃 (Prompt Datasets), 𝐴𝐼𝐺𝐶𝐷 (AIGC Detector), and
VarPrompt (Variant Modification Method), it initializes collec-
tions 𝑉 (Variant Prompts), 𝐷 (AIGC Detector Results), and 𝐶
Algorithm 2 CollectVariants: Generate Different Variation of (Human and AI-generated Codes). The procedure first applies
Prompt the CollectVariants method to modify prompts using variants
and stores them in 𝑉 . Subsequently, it employs GenerateCodes
Input:
• 𝑃 : Collection of Prompt Dataset to create AI-generated codes based on modified prompts and
• 𝑉 𝑎𝑟 𝑃𝑟𝑜𝑚𝑝𝑡 : Variant Modification performs post-processing on 𝐶 to clean and validate the dataset.
• 𝑉 : Collection of various prompts The DetectCodes method utilizes the AIGC Detector to identify
Output: 𝑉 : Collection of Various Prompts
human and AI-generated codes, updating 𝐷 accordingly. The
1: procedure CollectVariants(𝑃 , 𝑉 𝑎𝑟 𝑃𝑟𝑜𝑚𝑝𝑡 , 𝑉 )
2: for 𝑝 ∈ 𝑃 do algorithm concludes by calculating accuracy, precision, true posi-
3: 𝑣 ← Variant Modification on 𝑝 tive rate (TPR), false positive rate (FPR), true negative rate (TNR),
4: Include 𝑣 in collection of various prompts 𝑉
5: end for
and false negative rate (FNR) based on detection results, provid-
6: return 𝑉 // Collection of various prompts ing a comprehensive analysis. The calculated metrics, denoted
7: end procedure as 𝑅, are returned as the output.
• CollectVariants: From Algorithm 2, the CollectVariants takes
three inputs: 𝑃 (Collection of Prompts Datasets), 𝑉 𝑎𝑟𝑃𝑟𝑜𝑚𝑝𝑡
Assessing AI Detectors in Identifying AI-Generated Code: Implications for Education Conference’17, July 2017, Washington, DC, USA

(Variant Modification Method), and empty collection 𝑉 (collec- 3. Ask to Mimic Human: Appended "Please mimic a human re-
tion of various prompts). It iterates over each coding problem sponse." to each prompt, guiding ChatGPT to generate contextually
in the dataset and applies the variant modification method to appropriate code mimicking human responses and enhancing its
modify the prompt. The modified prompts are then included in ability to mimic human-written code.
the 𝑉 (collection of various prompts) and being returned. 4. Solution Without Comment: Instructed ChatGPT to exclude
• GenerateCodes: From Algorithm 3, the GenerateCodes takes comments in the solution, simulating cases where students omit
𝑉 (collection of prompts) and empty collection 𝐶 (collection of comments, violating good programming practices.
Human and AI-generated codes) as input. It iterates each variant 5. Assertion Test Code: Instructed ChatGPT to include assertion
prompt and generates codes based on the variant prompt. The test code, assessing AIGC Detectors’ ability to identify syntactically
procedure generates the specified number of AI codes for each valid code with test assertions—a common approach for verifying
variant prompt and includes them in the data collection. The program correctness.
data collection 𝐶 (Collection of Human and AI-generated codes) 6. Solution with Test Case: Instructed ChatGPT to include test
contains the code from humans and AI are returned. cases, evaluating AIGC Detectors’ ability to identify syntactically
• DetectCodes: From Algorithm 4, the DetectCodes procedure valid code with included test cases—a common method for verifying
takes the 𝐴𝐼𝐺𝐶𝐷 (AIGC Detector), 𝐶 ′ (Postprocessed Human and program correctness.
AI-generated codes), and an empty 𝐷 collection (collection of 7. Unittest Test Case: Instructed ChatGPT to include unittest test
AIGC Detector’s results) for storing the detection results as input. cases, evaluating AIGC Detectors’ ability to identify syntactically
It iterates over each AIGC Detector and performs classification valid code with included unittest test cases—a common approach
on each code in the post-processed data using the corresponding for verifying program correctness.
model. The detection results are then included in the result 𝐷 8. Replace variable names: Using the AI-generated code from
collection (collection of AIGC Detector’s results). variant 1, we used the AST library to replace all variable names
Algorithm 1 combines Algorithms 2 to 4 to dynamically vary prompts, with single-character letters (a to z). Simulates cases where students
generate AI codes, perform code detection, and calculate relevant violate basic programming rules in variable naming.
metrics. The resulting metrics can be used to evaluate the effective- 9. Replace function names: Using the AI-generated code from
ness and accuracy of the AIGC Detector. variant 1, we developed a Python script with the AST library to
replace all function names with single-character letters (a to z).
Simulates cases where students violate basic programming rules in
3.4 Data Variations
function naming.
Table 1 displays the dataset sizes used for each AIGC Detector 10. Replace variable and function names: Using AI-generated
across various variants. The sizes of datasets for Variants 2 and 3 code from variant 1, we developed a Python script with the AST
(we will discuss the specific details of each variant in the follow- library to replace all variable and function names with single-
ing paragraph) are smaller across all AIGC Detectors due to the character letters (a to z). Simulates extreme cases where program-
removal of data resulting from invalid outputs by ChatGPT. Specif- mers violate basic programming rules in naming convention.
ically, when applying the Sapling detector on Variant 2, the dataset 11. Long method: Long method, a term associated with code
size is further reduced as our initial approach has a minimum word smells, indicates design weaknesses that raise the risk of bugs.
requirement for code detection (which will be discussed in detail Asked ChatGPT for a longer output in function block format to
in the following paragraph). Similarly, the DetectGPT detector ex- simulate scenarios with long method code smell, reducing code
cludes data with fewer than 100 words due to its minimum word readability and maintainability.
requirement, leading to a reduced dataset size. 12. Short method: Contrary to the long method variant, prompted
In the process of preparing our dataset, we introduced 13 vari- ChatGPT to generate a shorter output in function block format.
ations of AI-generated code. These variations were achieved by Aims to simulate scenarios adhering to best programming practices
altering the prompts provided to ChatGPT or the solution code with concise methods focusing on a single functionality.
received from ChatGPT. Our aim was to find out the limitations 13. Adding 5 snippets of dead code: Injected 5 snippets of dead
and performance of each AIGC Detector under different variants code into the output as noise for the AIGC Detector. Aims to explore
of programming code. Detailed descriptions of each variation are the impact of dead code on the detector’s performance, simulat-
available on Figshare. 3 ing scenarios where useless code is inadvertently included in the
1. Without modification: Adapted the prompt based on Sam Alt- solution during development.
man’s example4 . Prompted ChatGPT with unmodified code prob-
lems, adding preliminary conditions[8]. Mimics users’ typical ap-
proach to resolving coding problems. 3.5 Evaluation Metrics
2. Removal of Stopwords: Preprocessed coding problems by re- For this empirical study, we have selected a few metrics [23, 34]
moving common stopwords using the NLTK library. Aims to simu- to evaluate the performance of selected AIGC Detectors. Table 2
late users modifying questions to enhance output by eliminating shows the confusion matrix used in the paper where 1 indicates
stopwords. human-written code, and 0 indicates AI-generated code.
TPR/Recall: True Positive Rate, also known as Recall, calcu-
3 https://figshare.com/articles/dataset/Variant_Description/24265018 lated as 𝑇 𝑃𝑅/𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇 𝑃𝑇+𝐹
𝑃 , where TP is the number of human-
𝑁
4 https://twitter.com/sama/status/1682826943312326659 written codes correctly labelled as human-written, FN is the number
Conference’17, July 2017, Washington, DC, USA Pan et al.

Table 1: Post-processed dataset size for each Detector across variant

Variant
Detector
1 2 3 4 5 6 7 8 9 10 11 12 13
GLTR 5,069 5,065* 5,064* 5,069 5,069 5,069 5,069 5,069 5,069 5,069 5,069 5,069 5,068*
Sapling 5,069 5,055* 5,064* 5,069 5,069 5,069 5,069 5,069 5,069 5,069 5,069 5,069 5,068*
GPT Zero 5,069 5,065* 5,064* 5,069 5,069 5,069 5,069 5,069 5,069 5,069 5,069 5,069 5,068*
GPT-2 Detector 5,069 5,065* 5,064* 5,069 5,069 5,069 5,069 5,069 5,069 5,069 5,069 5,069 5,068*
DetectGPT 4,016* 4,674* 4,509* 4,097* 4,785* 4,702* 4,888* 3,932* 3,997* 3,952* 4,211* 3,925* 4,785*
Note: * indicates a deviation from the initial dataset size of 5,069.

Table 2: Confusion Matrix used in this study


Predicted Value

Actual Value
HUMAN (1) AI (0)
HUMAN (1) True Positive (TP) False Positive (FP)
AI (0) False Negative (FN) True Negative (TN)

of human-written codes incorrectly labelled as AI-generated and Figure 2: Accuracy Performance for DetectGPT
TP+FN represents the total number of human-written codes.
FNR: False Negative Rate, calculated as 𝐹 𝑁 𝑅 = 𝑇 𝑃𝐹+𝐹
𝑁 , where
𝑁
4 RESULTS
FN is the number of human-written codes incorrectly labelled as In this section, we showcase our experimental results and provide
AI-generated, TP is the number of human-written codes correctly an analysis to address each research question. We have uploaded
labelled as human-written and TP+FN represents the total number the full set of results on Figshare.5
of human-written codes.
TNR: True Negative Rate, calculated as 𝑇 𝑁 𝑅 = 𝑇 𝑁𝑇 +𝐹
𝑁 , where
𝑃 4.1 RQ1: How accurate are existing AIGC
TN is the number of AI-generated codes correctly labelled as AI- Detectors at detecting AI-generated code?
generated, FP is the number of AI-generated codes incorrectly
Table 3 shows the performance of GPT-2 Detector and GPTZero,
labelled as human-written and TN+FP represents the total number
where both detectors exhibit poor performance when compared
of AI-generated codes.
with baseline variant 1. Their Accuracy (ACC) results hover around
FPR: False Positive Rate, calculated as 𝐹 𝑃𝑅 = 𝑇 𝑁𝐹+𝐹
𝑃 , where
𝑃 0.5, suggesting a lack of effectiveness in distinguishing AI-generated
FP is the number of AI-generated codes incorrectly labelled as
code from human-written code. They tend to classify a significant
human-written, TN is the number of AI-generated codes correctly
portion of input code as human-generated rather than machine-
labelled as AI-generated and TN+FP represents the total number of
generated. This performance issue could stem from their primary
AI-generated codes.
𝑃 +𝑇 𝑁 training on natural language data or potential overfitting to natural
Accuracy (ACC): Calculated as 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇 𝑃 +𝑇𝑇 𝑁 +𝐹 𝑃 +𝐹 𝑁 , language.
where TP is the number of human-written codes correctly labelled
In the assessment of AIGC Detectors, both the GPT-2 Detector
as human-written, TN is the number of AI-generated codes correctly
and GPTZero exhibited challenges in effectively distinguishing
labelled as AI-generated and TP+TN+FP+FN is the total number of
between human-written and AI-generated code across all evaluated
codes.
variants. In Table 3, their performance was characterized by high
Precision: Calculated as 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇 𝑃𝑇+𝐹
𝑃 , where TP is the
𝑃 TPR and low TNR while having an ACC of around 0.5, indicating a
number of human-written codes correctly labelled as human-written,
tendency to classify both human-written and AI-generated code as
FP is the number of AI-generated codes incorrectly labelled as
human-written code.
human-written and TP+FP is the total number of codes labelled as
DetectGPT, akin to the aforementioned AIGC Detectors, show-
human-written.
2 cases an ACC near 0.5 in Figure 2, indicating its struggle in distin-
F1 Score: Calculated as 𝐹 1 = 1 1 , which represents
𝑅𝑒𝑐𝑎𝑙𝑙 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 guishing AI-generated from human-written code. Notably, Detect-
the harmonic mean of precision and recall. GPT tends to misclassify the majority of human-generated code as
AUC: Area Under Curve, which is used to evaluate how much the AI-generated. However, a more positive aspect emerges when we
model is capable of distinguishing between classes. A higher AUC compare DetectGPT to the baseline (Variant 1) and other variants
score means that it has a better capability to distinguish between like Function Name (Variant 9), Variable and Function Name (Vari-
positive and negative classes. When the AUC score is around 0.5, it ant 10), Long Method (Variant 11), and Short Method (Variant 12).
indicates the model performs random choice when predicting the
samples. 5 https://figshare.com/articles/dataset/Variant_Result/24265015
Assessing AI Detectors in Identifying AI-Generated Code: Implications for Education Conference’17, July 2017, Washington, DC, USA

Table 3: Accuracy, TPR, and TNR of All 5 AIGC Detectors

Variant
Detector Metric
1 2 3 4 5 6 7 8 9 10 11 12 13
ACC 0.4971 0.4988 0.4986 0.4966 0.4969 0.4976 0.4989 0.5108 0.4972 0.4973 0.4970 0.4965 0.5824
GPT Zero TPR 0.9927 0.9927 0.9927 0.9927 0.9927 0.9927 0.9927 0.9927 0.9927 0.9927 0.9927 0.9927 0.9927
TNR 0.0016 0.0049 0.0045 0.0006 0.0012 0.0026 0.0051 0.0288 0.0018 0.0020 0.0014 0.0004 0.1721
ACC 0.5043 0.5005 0.4961 0.5013 0.4716 0.4788 0.4828 0.5134 0.5100 0.5252 0.4911 0.4958 0.4922
GPT-2 Detector TPR 0.9128 0.9127 0.9127 0.9128 0.9128 0.9128 0.9128 0.9128 0.9128 0.9128 0.9128 0.9128 0.9128
TNR 0.0959 0.0883 0.0796 0.0898 0.0304 0.0448 0.0529 0.1140 0.1071 0.1375 0.0694 0.0787 0.0716
ACC 0.4893 0.4742 0.4685 0.4941 0.4055 0.4014 0.5132 0.4943 0.5278 0.5125 0.5354 0.5153 0.4373
DetectGPT TPR 0.2756 0.2993 0.2978 0.2824 0.3083 0.3056 0.3124 0.2693 0.2737 0.2677 0.2769 0.2662 0.3093
TNR 0.7029 0.6491 0.6392 0.7059 0.5028 0.4972 0.7140 0.7192 0.7818 0.7573 0.7939 0.7643 0.5653
ACC 0.5040 0.4936 0.4841 0.6569 0.6999 0.6920 0.7693 0.4908 0.4952 0.4881 0.5375 0.5020 0.6478
GLTR TPR 0.6461 0.6464 0.6461 0.6461 0.6461 0.6461 0.6461 0.6461 0.6461 0.6461 0.6461 0.6461 0.6460
TNR 0.3620 0.3408 0.3221 0.6678 0.7538 0.7378 0.8925 0.3356 0.3442 0.3300 0.4289 0.3579 0.6496
ACC 0.6056 0.5961 0.6031 0.6048 0.5425 0.5828 0.6083 0.6630 0.6258 0.6811 0.6187 0.6059 0.6528
Sapling TPR 0.4780 0.4797 0.4783 0.4780 0.4780 0.4780 0.4780 0.4780 0.4780 0.4780 0.4780 0.4780 0.4781
TNR 0.7333 0.7126 0.7279 0.7315 0.6070 0.6875 0.7386 0.8479 0.7735 0.8842 0.7593 0.7339 0.8275
Note: green indicates an increase compared to Variant 1, while red indicates a decrease

Figure 3: TPR and TNR Performance for GLTR Figure 4: TPR and TNR Performance for Sapling
DetectGPT demonstrates an improvement in TNR ranging from
Sapling detector outperforms other AIGC Detectors, consistently
5% to 9%. This improvement signifies its competence in accurately
achieving ACC values above 0.6 in ten out of fourteen AIGC vari-
identifying AIGC.
ants. In our evaluation of Sapling’s AI Detector and Figure 4, we
The ACC performance of GLTR exhibits variability across dif-
compared its performance to the baseline (Variant 1) across various
ferent AIGC variants. Approximately half of the variants surpass
code variants. Our findings highlight significant improvements,
the 0.6 Accuracy (ACC) threshold, while the remainder approach
particularly in TNR, when the Sapling detector is applied to spe-
a more modest 0.5. In our evaluation of GLTR and Figure 3, we
cific variants such as "Variable Name" (Variant 8), "Variable and
conducted a comparative analysis involving the baseline Variant
Function Name" (Variant 10), and "DeadCode" (Variant 13). These
1 and several other variants, including "No Comment (Variant 4),"
variants exhibited a substantial increase in TNR, ranging from 9%
"Assertion (Variant 5)," "Test Case (Variant 6)," "Unittest Test Case
to 15%, indicating their enhanced ability to detect AI-generated
(Variant 7)," and "Dead Code (Variant 13)." The aim was to assess
code accurately.
the system’s performance in detecting GPT-generated code as AI-
generated content.
The results uncovered a significant enhancement in GLTR’s abil- Answer to RQ1: Existing AIGC Detectors perform poorly in dis-
tinguishing between human-written code and AI-generated code,
ity to identify AI-generated code when compared to the baseline
indicating the inherent weaknesses of current detectors. This under-
variant. This improvement ranged impressively from 18% to 53% scores the need for further research and development in this domain
across the diverse set of variants. This suggests that the incorpo- to enhance their efficacy.
ration of these new variants, such as "No Comment", "Assertion",
"Test Case", "Unittest Test Case", and "Dead Code", substantially
bolstered the system’s proficiency in AI-generated code detection. 4.2 RQ2: What are the limitations of existing
Additionally, it is noteworthy that the GLTR exhibited a notable
6% increase in the TNR when tasked with evaluating the "Long
AIGC Detectors when it comes to detecting
Method" variant 11. This improvement signifies an augmented capa- AI-generated code?
bility to accurately categorize non-AI-generated code, particularly From the results in Table 3, we have revealed significant sensitivity
in the context of extensive and intricate code segments. to code variants, particularly with GLTR, as demonstrated by a
Conference’17, July 2017, Washington, DC, USA Pan et al.

Samples Paired t-Test [14] comparing ACC differences between the consistently perform poorly. In our comprehensive assessment of
original (Variant 1) and all other versions of variants (Variant 2 various AI detection tools, we observed a lack of accuracy and
to 13), where the p-values for all five AIGC Detectors are 0.0296 reliability in their capacity to identify AI-generated code content.
(GLTR), 0.1032 (GPT-2 Detector), 0.2479 (GPTZero), 0.3816 (Sapling) These limitations encompass several key areas:
and 0.5714 (DetectGPT). GLTR demonstrated the lowest p-value, 1. Detection Accuracy: Across the board, these tools struggled to
signifying a significant performance difference across the variants. accurately identify AI-generated code, often producing high rates
This underscores GLTR’s heightened sensitivity to specific vari- of false negatives and false positives, leading to inaccuracies.
ant patterns, resulting in misclassifications, as evident in the wide 2. Lack of Specificity: Existing tools encountered difficulties dis-
range of accuracy from 0.4841 (Variant 3) to 0.7693 (Variant 7). tinguishing between human-generated and AI-generated code, re-
This finding highlights a potential vulnerability of AIGC Detectors, sulting in misclassifications and a lack of precision.
particularly when students prompt ChatGPT to mimic a human’s 3. Generalization Challenges: While fine-tuning improved per-
response when generating the code as GLTR tends to treat most formance within specific code domains, these tools faced difficulty
AIGC incorrectly as human-written, potentially enabling academic in generalizing their capabilities. They often faltered when con-
plagiarism. Moreover, Variants 2, 3, and 5 induced a drop in perfor- fronted with code from diverse domains or when AI-generated code
mance (both ACC and TNR) across multiple models when compared closely resembled human coding styles.
to that of Variant 1, underlining the need for further research to
address these specific vulnerabilities and ensure the integrity of
Answer to RQ2: The limitations of AIGC Detectors such as GPTZero,
AI-generated content detection methods in educational settings.
Sapling AI Detector, GPT-2 Detector, DetectGPT, and GLTR become
After some thorough analysis of each AIGC Detector’s perfor- evident when applied to the detection of AI-generated code. Variants
mance across each variant, we discuss some of the limitations that 2, 3, and 5 enable students to deceive most models, leading to a sig-
we observed for each AIGC Detector. nificant decrease in accuracy and TNR across the majority of models.
1. GPTZero Limitations: GPTZero may struggle to accurately These limitations stem from the fundamental differences between
detect AI-generated code due to the distinct syntax and structure programming code and natural language text. Code adheres to strict
of programming languages. Code follows strict rules and conven- rules, follows distinct patterns, and may not exhibit the linguistic
tions, deviating from the linguistic patterns GPTZero is designed to characteristics that these models are designed to detect.
identify. This mismatch can result in limited effectiveness, leading
to potential false positives or false negatives when applied to code
detection tasks. 5 DISCUSSION
2. Sapling Limitations: Sapling is designed for analyzing text
generated by language models like GPT-3 and ChatGPT, excelling 5.1 Suggestions to SE and CS Educators
in identifying such content. However, it may not be optimized for AIGC has emerged as a transformative force in education, particu-
detecting programming code, as AI-generated code can have dis- larly in the domains of CS/SE. This section aims to distill insights
tinct characteristics not aligned with natural language patterns. from recent research papers, to provide educators with strategies
The limitations of Sapling AI Detector include its focus on language and best practices for the successful integration of AI into CS/SE
model-generated text and potential challenges in code detection. education.
3. GPT-2 Detector Limitations: The GPT-2 Detector relies on a Researchers from Hong Kong introduced the IDEE Framework,
dataset of text generated by GPT-2 models, making it effective for which serves as a guiding framework for utilizing generative AI
identifying GPT-2 outputs. However, it may struggle with other AI in education [29]. This framework emphasizes the importance of
models or programming code, as distinct characteristics might not identifying desired outcomes, determining the appropriate level
be represented in the dataset. The detector’s ability to detect code of automation, ensuring ethical considerations, and evaluating ef-
from different models is limited, given the variability in complexity fectiveness. Moreover, the work by Kaplan et al. explores teachers’
and structure of programming code. perspectives on generative AI technology and its potential imple-
4. DetectGPT Limitations: DetectGPT employs perturbations in mentation in education [21]. The authors suggest that teachers
input text for innovative text classification but may face limitations exhibit positive perspectives towards generative AI, with more
in detecting programming code. The intricate logic and syntax of frequent usage leading to increased positivity. Educators perceive
code may not exhibit the same perturbation patterns as natural generative AI as a tool for enhancing professional development
language, leading to challenges in adaptability and potential for and student learning. This suggest that embracing generative AI
false or missed detections in this context. in education has the potential to positively impact both educators
5. GLTR Limitations: GLTR, relying on models like GPT-2, en- and students, fostering continuous growth and improved learning
hances text interpretability through word rankings and color cod- experiences.
ing. However, its limitations are apparent in programming code On the other hand, the work by Chan et al. delves into the expe-
detection, where the unique linguistic patterns and nuances are riences and perceptions of Gen Z students and Gen X/Y teachers
not effectively captured by these methods. GLTR may struggle to regarding the use of generative AI in higher education [12]. Stu-
provide accurate detections for programming code. dents express optimism about the benefits of generative AI, while
Our findings indicate a common limitation shared among current teachers emphasize the need for guidelines and policies to ensure
AIGC Detectors in their ability to detect AI-generated code: they responsible use. Furthermore, an AI Ecological Education Policy
Framework for higher education [11]. It addresses three dimensions:
Assessing AI Detectors in Identifying AI-Generated Code: Implications for Education Conference’17, July 2017, Washington, DC, USA

Pedagogical, Governance, and Operational, providing a comprehen- and adapting strategies as needed, ultimately contributing to im-
sive structure to navigate the implications of AI integration. proved outcomes and enhanced educational experiences. For exam-
Based on the literature and research findings, it is imperative ple, in a Data Science course using generative AI for predictive mod-
for educators to acknowledge the limitations inherent in current elling, maintain ongoing assessment. Systematically monitor how
AIGC detectors, especially when dealing with code-based content. well the technology contributes to achieving predictive modelling
Our study emphasizes the critical need for ongoing exploration and objectives, identify areas for improvement, and adapt strategies as
advancement in specialized tools, algorithms, and frameworks. This needed, ultimately enhancing students’ data science skills.
becomes particularly evident when considering the predominant 5. Comprehensive Policies: In the integration of generative AI
use of existing AIGC detectors for text-based content. Our results within educational settings, it is imperative to develop comprehen-
highlight a significant gap in their applicability to code-based AIGC, sive guidelines and policies based on empirical evidence and best
calling for the development of more refined tools and algorithms practices. By adopting an evidence-based approach, educational
tailored to this specific context. This underscores the importance of institutions can ensure that their policies are not only robust and
staying abreast of technological advancements to ensure the effec- compliant but also rooted in real-world outcomes and experiences,
tive detection and evaluation of AIGC within educational materials. safeguarding both the educational mission and the well-being of all
In the absence of dedicated detectors tailored for code-based AIGC, stakeholders. In a Software Engineering curriculum incorporating
we propose several key recommendations for educators: a code generation AI tool, establish comprehensive policies. De-
1. Define Objectives: Precisely outline the educational objectives velop guidelines based on empirical evidence and best practices to
that are in harmony with the application of generative AI. This govern the use of AI-generated code in assignments. These policies
step ensures a seamless integration of technology into educational may address code review processes, plagiarism detection, and the
purposes, fostering alignment between technology utilization and ethical implications of AI-assisted programming, ensuring a fair,
educational aspirations. For example, in a Python programming transparent, and secure learning environment for both students
course, educators could utilize generative AI to enhance students’ and instructors.
algorithmic creativity. Objectives might include collaborative de- 6. Stay Informed: Promote ongoing research and evaluation of
sign and optimization of sorting algorithms using AI assistance, AI integration in educational settings to stay informed about ad-
fostering a deeper understanding of algorithmic principles. vancements, benefits, and risks associated with AI technology. For
2. Automation Level: When incorporating generative AI into ed- instance, in a university’s Computer Science department, educa-
ucation, it is essential to carefully consider the level of automation tors actively engage in continuous research on AI integration in
to employ. This decision hinges on whether to pursue full automa- coursework. This commitment ensures they stay informed about
tion, where AI systems handle educational tasks entirely on their the latest advancements, benefits, and potential risks associated
own or opt for a supplementary approach that blends AI capabili- with AI technology, allowing for adaptive and optimized teaching
ties with human involvement. This choice plays a pivotal role in methods based on the field’s latest insights.
shaping the educational landscape, as it determines the extent to
which technology should be seamlessly integrated into the learning 5.2 Threats to validity
process, aligning with the unique goals and requirements of each 5.2.1 Internal Validity. The study faces challenges related to the
educational scenario. Take for instance a data science course, when varied prompts used to generate AIGC with ChatGPT. While these
integrating a code generation AI tool, consider the automation level. prompts aim to simulate average user inputs, they may not fully mir-
Decide whether to fully automate coding assignments, letting AI ror real-world scenarios. Additionally, ChatGPT’s non-deterministic
generate solutions, or adopt a supplementary approach, blending nature, leading to diverse responses from the same prompt, could
AI capabilities with human involvement. This choice influences impact result reproducibility.
how students learn programming, aligning with the curriculum’s Another challenge pertains to verifying the authenticity of source
goals—whether to emphasize algorithmic logic through manual code written by humans. For instance, datasets sourced from plat-
coding or leverage AI for rapid prototyping and problem-solving. forms such as Kaggle could include content generated by LLM tools,
3. Ethical Focus: Give paramount importance to ethical concerns. blurring the distinction between human and AI contributions. How-
Develop comprehensive guidelines and policies to safeguard the ever, since these datasets were released to the public before LLM
responsible and ethical usage of AI in the educational context. Edu- models went mainstream, the impact is estimated to be minimal.
cators can, for example, develop guidelines to ensure responsible In specific scenarios, vague queries leading to responses such
and ethical usage of AI algorithms in assignments, particularly as "I’m sorry, as an AI language model, I am unable to provide
those involving sensitive data. This may involve creating policies code as the question lacked specific details about the expression to
addressing issues such as data privacy, bias mitigation, and trans- be evaluated" pose a risk to internal validity. Ambiguous or open-
parency, fostering a learning environment that upholds ethical ended coding questions can result in inaccurate AIGC. Therefore,
standards in AI applications. the datasets of the question prompts underwent preprocessing.
4. Continuous Evaluation: It is essential to maintain an ongoing During this process, rows containing responses such as the one
process of assessing the effectiveness of generative AI in education. mentioned were removed, ensuring the integrity and accuracy of
This involves systematically monitoring how well the technology the dataset used in our study.
serves educational objectives, identifying areas for improvement,
5.2.2 Construct Validity. Construct validity explores the intricate
interplay between theoretical constructs and empirical data, posing
Conference’17, July 2017, Washington, DC, USA Pan et al.

concerns about potential biases arising from platforms such as and the development of ethical guidelines to ensure responsible
Kaggle, where datasets are curated and shared, and ChatGPT may AI-generated content use in education.
have been trained on datasets similar to the one used in this study.
However, understanding ChatGPT’s ability to generate diverse 7 CONCLUSION AND FUTURE WORK
and contextually relevant content, rather than simply replicate The rise of generative AI models presents both opportunities and
existing snippets, this research maintains objectivity and rigor. challenges, particularly in education and programming. Our study
These insights allow for a nuanced approach, ensuring the study’s aimed to assess AI content detection tools’ effectiveness in identi-
authenticity and safeguarding the integrity of the research findings. fying AI-generated code, revealing their limitations and offering
insights for educators and students. We examined various AI con-
5.2.3 External Validity. It is crucial to acknowledge that the conclu- tent detection models using a dataset containing human-written
sions drawn in this study might be specific to the datasets analyzed. and AI-generated code. Our evaluation focused on metrics such as
To enhance the generalizability of our study, we meticulously cu- recall, precision, F1 score, accuracy, and AUC.
rated our dataset by gathering data from diverse sources, aiming to We have revealed significant sensitivity in GLTR, as indicated
replicate real-world software development scenarios as closely as by its remarkably low p-value of 0.0296 in the Samples Paired t
possible. Furthermore, we concentrated our efforts on the Python Test. This sensitivity led to notable accuracy discrepancies ranging
programming language, a deliberate choice made to maintain con- from 0.4841 to 0.7693, exacerbated by specific code variants such as
sistency and control within our study. variants 2, 3, and 5. These findings underscore the imperative for im-
mediate research efforts to enhance the reliability of AI-generated
content detectors, safeguarding academic integrity in educational
6 RELATED WORK contexts. Addressing these challenges is pivotal for educators and
This section discusses several studies and research articles that institutions to adeptly navigate the complexities introduced by AI-
explore the detection of AI-generated content, particularly in aca- generated code, ensuring the integrity of programming education.
demic and educational contexts. Our findings suggest that these tools hold promise in distinguish-
The study by Otterbacher [27] highlights the need to develop ing AIGC from human-written code but face challenges due to code
a culture that promotes responsible and ethical use of generative complexity and writing style variations. Ethical guidelines for AIGC
AI in various domains, including science and education. Rather integration into education are essential. Moreover, we have found
than relying solely on technical solutions to combat AI-generated that GLTR is very sensitive to different variants introduced into the
content, the authors argue for a holistic approach that considers AIGC produced by ChatGPT, proven by the smallest p-value when
the broader implications of AI in academia. On the other hand, the the Samples Pair t-Test is applied to the AIGC Detector’s accuracies
work by Chaka [10] evaluated AI content detection tools, highlight- and the wide range of the accuracy.
ing their limitations in accurately detecting AI-generated content, Educators and institutions should adopt strategies for responsi-
especially in academic contexts. These limitations can lead to aca- ble AI usage, considering curriculum design, assessment methods,
demic integrity issues, including plagiarism. Chaka’s study [10] and ethical directives. Additionally, the long-term impact of AIGC
also emphasizes the need for ongoing research to evaluate the accu- Detector tools on student skill development and creativity in pro-
racy and reliability of AI content detection tools. Improving these gramming warrants further exploration. In conclusion, our research
tools’ ability to detect AI-generated content is crucial for combating highlights the evolving landscape of AI-generated code and the
AI-generated plagiarism in academia. AIGC, notably code produced role of AIGC Detectors in education. Emphasizing responsible AI
by ChatGPT, shows promise in software-related tasks but raises usage, ethical guidelines, and ongoing tool refinement can empower
concerns in education, particularly regarding plagiarism [35]. students in a technology-driven world while preserving academic
The study by Adnan aimed to compare generated abstracts with integrity and fostering creativity.
original ones and assess the impact of instructions given to GPT-3.5 Future research should prioritize enhancing AI content detec-
models on abstract quality [3]. The result warns against misuse of tion models to effectively handle a wider range of code variations
results and emphasizes the risks in deploying ML tools in academia and writing styles, bolstering their reliability in educational con-
due to a 1% false positive rate. At a larger scale, such misclas- texts. Investigating the long-term impact of AIGC Detector tools on
sification could reject significant research, potentially impacting students’ learning and creative engagement in programming can
humanity. provide valuable insights for educators. Additionally, exploring the
Similar to many related studies suggesting AI’s inherent risks, adaptability of AI-driven content detection models to architectural
Farrelly and Baker propose that AI, particularly LLM, holds consid- design, software design, and UML diagrams will contribute to a
erable disruptive potential in education and society [16]. Minority more comprehensive understanding of their applicability across di-
and international students encounter increased allegations of aca- verse domains, fostering the development of improved educational
demic integrity breaches due to these technologies. Nonetheless, tools and strategies.
they also offer valuable benefits, particularly for international stu-
dents and individuals with disabilities. 8 DATA AVAILABILITY
In relation to our study, these works provide valuable insights
The replication package, along with the associated data, has been
into the limitations of existing AI content detection tools, especially
made publicly available.6
in the context of AI-generated code. They underscore the need
for a comprehensive evaluation of AIGC Detectors’ performance 6 https://figshare.com/articles/dataset/Replication_Package/24298036
Assessing AI Detectors in Identifying AI-Generated Code: Implications for Education Conference’17, July 2017, Washington, DC, USA

REFERENCES Education. Journal of Interactive Learning Research 34, 2 (2023), 313–338.


[1] [n. d.]. GPTzero. https://gptzero.me/ [22] Tetyana Tanya Krupiy. 2020. A vulnerability analysis: Theorising the impact
[2] [n. d.]. Sapling. https://sapling.ai/ai-content-detector [Online]. Available. of artificial intelligence decision-making processes on individuals, society and
[3] Adnan Al Medawer. [n. d.]. Textual Analysis and Detection of AI-Generated human diversity from a social justice perspective. Computer law & security review
Academic Texts. ([n. d.]). 38 (2020), 105429.
[4] Fawad Ali. 2023. GPT-1 to GPT-4: Each of OpenAI’s GPT Models Explained and [23] Ajay Kulkarni, Deri Chong, and Feras A Batarseh. 2020. Foundations of data
Compared. (11 April 2023). https://www.makeuseof.com/gpt-models-explained- imbalance and solutions for a data democracy. In Data democracy. Elsevier,
and-compared/ 83–106.
[5] David Baidoo-Anu and Leticia Owusu Ansah. 2023. Education in the era of [24] Claudio Mirolo, Cruz Izu, Violetta Lonati, and Emanuele Scapin. 2022. Abstraction
generative artificial intelligence (AI): Understanding the potential benefits of in Computer Science Education: An Overview. Informatics in Education 20, 4
ChatGPT in promoting teaching and learning. Journal of AI 7, 1 (2023), 52–62. (2022), 615–639.
[6] Aras Bozkurt. 2023. Generative artificial intelligence (AI) powered conversational [25] Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and
educational agents: The inevitable paradigm shift. Asian Journal of Distance Chelsea Finn. 2023. DetectGPT: Zero-Shot Machine-Generated Text Detection
Education 18, 1 (2023). using Probability Curvature. In Proceedings of the 40th International Conference on
[7] BurhanUlTayyab. 2023. DetectGPT. https://github.com/BurhanUlTayyab/ Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas
DetectGPT. Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato,
[8] Ralph Cajipe. 2023. chatgpt-prompt-engineering. https://github.com/ralphcajipe/ and Jonathan Scarlett (Eds.). PMLR, 24950–24962. https://proceedings.mlr.press/
chatgpt-prompt-engineering/blob/main/1-guidelines.ipynb. v202/mitchell23a.html
[9] Christoph C. Cemper. 2023. Ai cheats - how to trick Ai Content Detectors. [26] Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar,
https://www.linkresearchtools.com/blog/ai-content-detector-cheats/ Muhammad Usman, Nick Barnes, and Ajmal Mian. 2023. A comprehensive
[10] Chaka Chaka. 2023. Detecting AI content in responses generated by ChatGPT, overview of large language models. arXiv preprint arXiv:2307.06435 (2023).
YouChat, and Chatsonic: The case of five AI content detection tools. Journal of [27] Jahna Otterbacher. 2023. Why technical solutions for detecting AI-generated
Applied Learning and Teaching 6, 2 (2023). content in research and education are insufficient. Patterns 4, 7 (2023).
[11] Cecilia Ka Yuk Chan. 2023. A comprehensive AI policy education framework for [28] Hendrik Strobelt, Sebastian Gehrmann, and Alexander Rush. [n. d.]. Catching a
university teaching and learning. International Journal of Educational Technology Unicorn with GLTR: A tool to detect automatically generated text. Collaboration
in Higher Education 20, 1 (2023), 1–25. of MIT-IBM Watson AI lab and HarvardNLP. http://gltr.io/
[12] Cecilia Ka Yuk Chan and Katherine KW Lee. 2023. The AI generation gap: Are [29] Jiahong Su and Weipeng Yang. 2023. Unlocking the power of ChatGPT: A
Gen Z students more interested in adopting generative AI such as ChatGPT in framework for applying generative AI in education. ECNU Review of Education
teaching and learning than their Gen X and Millennial Generation teachers? (2023), 20965311231168423.
arXiv preprint arXiv:2305.02878 (2023). [30] Teo Susnjak. 2022. ChatGPT: The end of online exam integrity? arXiv preprint
[13] Hailin Chen, Fangkai Jiao, Xingxuan Li, Chengwei Qin, Mathieu Ravaut, Ruochen arXiv:2212.09292 (2022).
Zhao, Caiming Xiong, and Shafiq Joty. 2023. ChatGPT’s One-year Anniver- [31] Zachari Swiecki, Hassan Khosravi, Guanliang Chen, Roberto Martinez-
sary: Are Open-Source Large Language Models Catching up? arXiv preprint Maldonado, Jason M Lodge, Sandra Milligan, Neil Selwyn, and Dragan Gašević.
arXiv:2311.16989 (2023). 2022. Assessment in the age of artificial intelligence. Computers and Education:
[14] Frances Chumney. 2018. PAIRED SAMPLES t & WILCOXON SIGNED RANKS Artificial Intelligence 3 (2022), 100075.
TESTS. Retrieved January 24 (2018), 2022. [32] Chip Thien. 2023. gpt-2-output-dataset. https://github.com/MacroChip/gpt-2-
[15] Damian Okaibedi Eke. 2023. ChatGPT and the rise of generative AI: Threat to output-dataset
academic integrity? Journal of Responsible Technology 13 (2023), 100060. [33] Levent Uzun. 2023. ChatGPT and academic integrity concerns: Detecting artificial
[16] Tom Farrelly and Nick Baker. 2023. Generative artificial intelligence: Implications intelligence generated content. Language Education and Technology 3, 1 (2023).
and considerations for higher education practice. Education Sciences 13, 11 (2023), [34] Ž Vujović et al. 2021. Classification model evaluation metrics. International
1109. Journal of Advanced Computer Science and Applications 12, 6 (2021), 599–606.
[17] Sebastian Gehrmann, Hendrik Strobelt, and Alexander Rush. 2019. GLTR: Statis- [35] Jian Wang, Shangqing Liu, Xiaofei Xie, and Yi Li. 2023. Evaluating AIGC Detectors
tical Detection and Visualization of Generated Text. In Proceedings of the 57th on Code Content. arXiv preprint arXiv:2304.05193 (2023).
Annual Meeting of the Association for Computational Linguistics: System Demon- [36] www.kaggle.com. 2023. Leetcode Solutions and Content KPIs. https://www.kaggle.
strations. Association for Computational Linguistics, Florence, Italy, 111–116. com/datasets/jacobhds/leetcode-solutions-and-content-kpis Last accessed on
https://doi.org/10.18653/v1/P19-3019 May 16, 2023.
[18] Simone Grassini. 2023. Shaping the future of education: exploring the potential [37] www.kaggle.com. 2023. Natural Language to Python Code. https://www.kaggle.
and consequences of AI and ChatGPT in educational settings. Education Sciences com/datasets/linkanjarad/coding-problems-and-solution-python-code Last ac-
13, 7 (2023), 692. cessed on May 16, 2023.
[19] Biyang Guo, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, [38] www.quescol.com. 2023. Python Coding Question: 90+ Python Interview Cod-
Jianwei Yue, and Yupeng Wu. 2023. How Close is ChatGPT to Human Experts? ing Questions. https://quescol.com/interview-preparations/python-coding-
Comparison Corpus, Evaluation, and Detection. arXiv preprint arxiv:2301.07597 question#google_vignette Last accessed on May 16, 2023.
(2023). [39] www.quescol.com. 2023. Quescol - A Platform That Provides Previous Year Ques-
[20] Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, tions And Answers. https://quescol.com/ Last accessed on Dec 23, 2023.
David Lo, John Grundy, and Haoyu Wang. 2023. Large Language Models for [40] www.wikipedia.org. 2023. Kaggle. https://en.wikipedia.org/wiki/Kaggle Last
Software Engineering: A Systematic Literature Review. arXiv:2308.10620 [cs.SE] accessed on Dec 23, 2023.
[21] Regina Kaplan-Rakowski, Kimberly Grotewold, Peggy Hartwick, and Kevin Pa- [41] Franco Zambonelli and H Van Dyke Parunak. 2002. Signs of a revolution in com-
pin. 2023. Generative AI and Teachers’ Perspectives on Its Implementation in puter science and software engineering. In International Workshop on Engineering
Societies in the Agents World. Springer, 13–28.

You might also like