MAGECODE Machine-Generated Code Detection Method Using Large Language Models

The document presents MageCode, a novel method for detecting machine-generated code (MGC) using Large Language Models (LLMs) and metric-based techniques. It addresses the challenges of existing detection methods that struggle with source code, achieving up to 98.46% accuracy with a low false positive rate of less than 1% on a new dataset of over 45,000 code solutions generated by LLMs. The study highlights the importance of academic integrity in programming education as students increasingly rely on virtual assistants for their assignments.

Uploaded by

yasasreekommireddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views17 pages

MAGECODE Machine-Generated Code Detection Method Using Large Language Models

Uploaded by

yasasreekommireddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Received 1 October 2024, accepted 19 November 2024, date of publication 2 December 2024,

date of current version 23 December 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3509987

MAGECODE: Machine-Generated Code Detection

Method Using Large Language Models
HUNG PHAM 1 , HUYEN HA 1 , VAN TONG 1, DUNG HOANG1 ,
DUC TRAN 1 , AND TUYEN NGOC LE 2,3
1 School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi 100000, Vietnam
2 Department of Electronic Engineering, Ming Chi University of Technology, Taipei 24301, Taiwan
3 Center for Reliability Engineering, Ming Chi University of Technology, Taipei 24301, Taiwan
Corresponding authors: Duc Tran ([email protected]) and Tuyen Ngoc Le ([email protected])
This work was supported in part by the National Science and Technology Council, Taiwan, under Grant NSTC 112-2222-E-131-003.

ABSTRACT The widespread use of virtual assistants (e.g., GPT4 and Gemini, etc.) by students in their
academic assignments raises concerns about academic integrity. Consequently, various machine-generated
text (MGT) detection methods, developed from metric-based and model-based approaches, were proposed
and shown to be highly effective. The model-based MGT methods often encounter difficulties when dealing
with source code due to disparities in semantics compared to natural languages. Meanwhile, the efficacy
of metric-based MGT methods on source code has not been investigated. Moreover, the challenge of
identifying machine-generated codes (MGC) has received less attention, and existing solutions demonstrate
low accuracy and high false positive rates across diverse human-written codes. In this paper, we take into
account both semantic features extracted from Large Language Models (LLMs) and the applicability of
metrics (e.g., Log-Likelihood, Rank, Log-rank, etc.) for source code analysis. Concretely, we propose
MageCode, a novel method for identifying machine-generated codes. MageCode utilizes the pre-trained
model CodeT5+ to extract semantic features from source code inputs and incorporates metric-based
techniques to enhance accuracy. In order to assess the proposed method, we introduce a new dataset
comprising more than 45,000 code solutions generated by LLMs for programming problems. The solutions
for these programming problems which were obtained from three advanced LLMs (GPT4, Gemini, and
Code-bison-32k), were written in Python, Java, and C++. The evaluation of MageCode on this dataset
demonstrates superior performance compared to existing baselines, achieving up to 98.46% accuracy while
maintaining a low false positive rate of less than 1%.

INDEX TERMS Machine-generated code detection, large language model, metrics, CodeT5+.

I. INTRODUCTION at St. Pölten University of Applied Sciences found that over

The emergence of advanced large language models (LLMs), 60% of students surveyed use virtual assistants like ChatGPT
such as ChatGPT and Gemini, has had a significant for their programming homework. However, the potential
impact on all industries, including software development and for student misuse of these models raises concerns about
programming education. Their proficiency in comprehending academic integrity and educational quality. Many students
programming problem descriptions and generating code abuse virtual assistants to complete their homework without
makes them valuable tools for enhancing student learning, understanding or reviewing the solutions. This will eventually
thereby resulting in the widespread use of these artificial affect student creativity and independent thinking, which lead
intelligence assistants. According to a recent survey by Das to negative consequences for both students and the education
et al. [1], nearly 80% of students surveyed use ChatGPT to do system.
their homework. Similarly, a recent survey by Haindl et al. [2] However, most previous studies focused on detecting
LLM-generated text with various methods, which can be
The associate editor coordinating the review of this manuscript and divided into two categories [3]: metric-based and model-
approving it for publication was Yang Liu . based methods. While the effectiveness of model-based
2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
190186 For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 12, 2024
H. Pham et al.: MAGECODE: Machine-Generated Code Detection Method Using LLMs

methods when performed on source code has been demon- • Proposing a novel machine-generated code detec-
strated to be limited, the applicability of metric-based tion method that utilizes the pre-trained encoder-only
methods on source code remains unexplored. Meanwhile, CodeT5+ model and integrated with highly applicable
existing machine-generated code detection methods, such as statistical metrics in source code analysis.
DetectGPT4Code [4] and AIGCode Detector [5], have not
The remainder of this paper is organized as follows:
achieved high accuracy nor have they been evaluated in high-
Section II provides an overview of leading LLMs employed
stakes scenarios, where a false positive rate below 1% is
in code generation and explores existing detection meth-
necessary to minimize the risk of incorrect code removal and
ods, including those for identifying machine-generated text
ensure that appropriate actions are taken.
in general and specifically those designed for detecting
In this paper, we aim to bridge the gap by develop-
machine-generated code. Section III discusses in detail our
ing a novel effective method to detect machine-generated
proposed method. The dataset construction and specification
codes in educational environments. The research focuses
are described in Section IV. Section V presents the experi-
on detecting code solutions for programming problems
mental results. The paper concludes with Section VI, which
rather than source code in large-scale software projects.
summarizes our findings.
This paper first thoroughly evaluates the performance of
six metric-based machine-generated text detection methods
when applied to source code to find highly adaptable metrics II. BACKGROUND AND RELATED WORKS
in source code analysis. Subsequently, we propose a novel A. ADVANCED LARGE LANGUAGE MODELS
model-based detector that utilizes the pre-trained model This section explores the capabilities in code generation tasks
CodeT5+ [6] to extract semantic features from source code of different Large Language Models (LLMs): GPT4, Gemini,
inputs and combines them with appropriate statistical metrics and Code-bison-32k.
to detect machine-generated code. Extensive experiments are GPT4 [7] is a large multimodal model that can accept
conducted to assess its effectiveness and compare it to current image and text inputs and produce text outputs. It is built upon
baseline methods. Transfomer architecture to predict the next token in a text
To the best of our knowledge, there was no public sequence. GPT4 outperforms both previous large language
dataset available for our experiments at the time of writ- models and most state-of-the-art (SOTA) systems on a suite of
ing. Therefore, we constructed a new dataset containing traditional NLP benchmarks [7]. The performance of GPT4
both human-written and machine-generated source code, in HumanEval benchmark achieves 67% compared to 65.8%
which are solutions for a set of programming problems. of the current SOTA.
Machine-generated codes were obtained by querying three Gemini [8] is a family of multimodal models trained
popular LLMs in code generation, namely ChatGPT [7], with diverse input to develop strong generalist abilities
Gemini [8], and Code-bison-32k [9] with the descriptions across multiple modals. A notable feature of Gemini is its
of programming problems. The response codes from these ability to generate code, interpret user inputs describing
LLMs were then tested against test cases to ensure their desired functionalities, and translate them into functional
quality before being included in the final dataset. The final code. This capability has significant potential to streamline
dataset contains more than 45,000 machine-generated code software development workflows and enhance human-AI
snippets written in three popular programming languages collaboration.
in education environments: Python, Java, and C++. Our Code-bison-32k [9] is a specialized generative AI model
experiments were then conducted on this newly constructed focused on code generation and software development
dataset. tasks. It excels in writing, debugging, and optimizing code
To summarize, our work provides the following across various programming languages. According to Google
contributions: Deepmind, Code-bison-32k supports a wide range of coding
• Introducing a new dataset for the machine-generated languages, including C, C++, C#, Python, Java, JavaScript,
code detection problem, including over 45,000 source and more than 30 additional languages [10].
code produced by three well-known large language mod- The three aforementioned models were employed to
els: GPT-4-turbo, Gemini-pro-1.0, and Code-bison-32k. produce machine-generated codes in our dataset. For GPT4,
The dataset consists of examples of human-written and we used GPT4-turbo, which, during the experiments, points
machine-generated source code in three programming to the gpt-4-0125-preview model. For Gemini, we used
languages: Python, Java, and C++. The dataset was Gemini-1.0-pro edition. Both GPT4 and Gemini are widely
published to Hugging Face1 for the research community. popular AI-powered products that have drawn signifi-
• Evaluating the effectiveness of metric-based detec- cant attention from the community. On the other hand,
tion methods commonly used for detecting machine- Code-bison-32k is just a deep learning model itself and does
generated text across three code datasets of Python, Java, not attract much attention. Nevertheless, Code-bison-32k
and C++. is specifically designed for code-related tasks, whereas
GPT4 and Gemini provide broader capabilities across dif-
1 https://huggingface.co/datasets/HungPhamBKCS/magecode-dataset ferent domains. Therefore, the inclusion of Code-bison-32k