MAGECODE Machine-Generated Code Detection Method Using Large Language Models
MAGECODE Machine-Generated Code Detection Method Using Large Language Models
ABSTRACT The widespread use of virtual assistants (e.g., GPT4 and Gemini, etc.) by students in their
academic assignments raises concerns about academic integrity. Consequently, various machine-generated
text (MGT) detection methods, developed from metric-based and model-based approaches, were proposed
and shown to be highly effective. The model-based MGT methods often encounter difficulties when dealing
with source code due to disparities in semantics compared to natural languages. Meanwhile, the efficacy
of metric-based MGT methods on source code has not been investigated. Moreover, the challenge of
identifying machine-generated codes (MGC) has received less attention, and existing solutions demonstrate
low accuracy and high false positive rates across diverse human-written codes. In this paper, we take into
account both semantic features extracted from Large Language Models (LLMs) and the applicability of
metrics (e.g., Log-Likelihood, Rank, Log-rank, etc.) for source code analysis. Concretely, we propose
MageCode, a novel method for identifying machine-generated codes. MageCode utilizes the pre-trained
model CodeT5+ to extract semantic features from source code inputs and incorporates metric-based
techniques to enhance accuracy. In order to assess the proposed method, we introduce a new dataset
comprising more than 45,000 code solutions generated by LLMs for programming problems. The solutions
for these programming problems which were obtained from three advanced LLMs (GPT4, Gemini, and
Code-bison-32k), were written in Python, Java, and C++. The evaluation of MageCode on this dataset
demonstrates superior performance compared to existing baselines, achieving up to 98.46% accuracy while
maintaining a low false positive rate of less than 1%.
INDEX TERMS Machine-generated code detection, large language model, metrics, CodeT5+.
methods when performed on source code has been demon- • Proposing a novel machine-generated code detec-
strated to be limited, the applicability of metric-based tion method that utilizes the pre-trained encoder-only
methods on source code remains unexplored. Meanwhile, CodeT5+ model and integrated with highly applicable
existing machine-generated code detection methods, such as statistical metrics in source code analysis.
DetectGPT4Code [4] and AIGCode Detector [5], have not
The remainder of this paper is organized as follows:
achieved high accuracy nor have they been evaluated in high-
Section II provides an overview of leading LLMs employed
stakes scenarios, where a false positive rate below 1% is
in code generation and explores existing detection meth-
necessary to minimize the risk of incorrect code removal and
ods, including those for identifying machine-generated text
ensure that appropriate actions are taken.
in general and specifically those designed for detecting
In this paper, we aim to bridge the gap by develop-
machine-generated code. Section III discusses in detail our
ing a novel effective method to detect machine-generated
proposed method. The dataset construction and specification
codes in educational environments. The research focuses
are described in Section IV. Section V presents the experi-
on detecting code solutions for programming problems
mental results. The paper concludes with Section VI, which
rather than source code in large-scale software projects.
summarizes our findings.
This paper first thoroughly evaluates the performance of
six metric-based machine-generated text detection methods
when applied to source code to find highly adaptable metrics II. BACKGROUND AND RELATED WORKS
in source code analysis. Subsequently, we propose a novel A. ADVANCED LARGE LANGUAGE MODELS
model-based detector that utilizes the pre-trained model This section explores the capabilities in code generation tasks
CodeT5+ [6] to extract semantic features from source code of different Large Language Models (LLMs): GPT4, Gemini,
inputs and combines them with appropriate statistical metrics and Code-bison-32k.
to detect machine-generated code. Extensive experiments are GPT4 [7] is a large multimodal model that can accept
conducted to assess its effectiveness and compare it to current image and text inputs and produce text outputs. It is built upon
baseline methods. Transfomer architecture to predict the next token in a text
To the best of our knowledge, there was no public sequence. GPT4 outperforms both previous large language
dataset available for our experiments at the time of writ- models and most state-of-the-art (SOTA) systems on a suite of
ing. Therefore, we constructed a new dataset containing traditional NLP benchmarks [7]. The performance of GPT4
both human-written and machine-generated source code, in HumanEval benchmark achieves 67% compared to 65.8%
which are solutions for a set of programming problems. of the current SOTA.
Machine-generated codes were obtained by querying three Gemini [8] is a family of multimodal models trained
popular LLMs in code generation, namely ChatGPT [7], with diverse input to develop strong generalist abilities
Gemini [8], and Code-bison-32k [9] with the descriptions across multiple modals. A notable feature of Gemini is its
of programming problems. The response codes from these ability to generate code, interpret user inputs describing
LLMs were then tested against test cases to ensure their desired functionalities, and translate them into functional
quality before being included in the final dataset. The final code. This capability has significant potential to streamline
dataset contains more than 45,000 machine-generated code software development workflows and enhance human-AI
snippets written in three popular programming languages collaboration.
in education environments: Python, Java, and C++. Our Code-bison-32k [9] is a specialized generative AI model
experiments were then conducted on this newly constructed focused on code generation and software development
dataset. tasks. It excels in writing, debugging, and optimizing code
To summarize, our work provides the following across various programming languages. According to Google
contributions: Deepmind, Code-bison-32k supports a wide range of coding
• Introducing a new dataset for the machine-generated languages, including C, C++, C#, Python, Java, JavaScript,
code detection problem, including over 45,000 source and more than 30 additional languages [10].
code produced by three well-known large language mod- The three aforementioned models were employed to
els: GPT-4-turbo, Gemini-pro-1.0, and Code-bison-32k. produce machine-generated codes in our dataset. For GPT4,
The dataset consists of examples of human-written and we used GPT4-turbo, which, during the experiments, points
machine-generated source code in three programming to the gpt-4-0125-preview model. For Gemini, we used
languages: Python, Java, and C++. The dataset was Gemini-1.0-pro edition. Both GPT4 and Gemini are widely
published to Hugging Face1 for the research community. popular AI-powered products that have drawn signifi-
• Evaluating the effectiveness of metric-based detec- cant attention from the community. On the other hand,
tion methods commonly used for detecting machine- Code-bison-32k is just a deep learning model itself and does
generated text across three code datasets of Python, Java, not attract much attention. Nevertheless, Code-bison-32k
and C++. is specifically designed for code-related tasks, whereas
GPT4 and Gemini provide broader capabilities across dif-
1 https://huggingface.co/datasets/HungPhamBKCS/magecode-dataset ferent domains. Therefore, the inclusion of Code-bison-32k
enhances the variety and thoroughness of our dataset. It is compute the similarity between the Query and other parts in
also worth mentioning that we were aware of Github Copilot, the sequence [12]. Combining them together, the Attention
which is also a widely used AI assistant for automated weight is as follows.
code production. However, as far as we know, this tool
Q[X ]K [X ]T
does not provide an API set for convenient code retrieval. A[X ] = Softmax √ (2)
After consideration, we decided to use only the three models dk
mentioned earlier. where dk is the number of rows in Wk and Wq . Value V [X ] is
the matrix that contains the actual information to be passed to
B. CODET5+ MODELS AND THE TRANSFORMER the next layer in the model. From V [X ] and Attention weights
ARCHITECTURE A[X ], the output of a self-attention SA[X ] is computed as
CodeT5+ [6] is a new family of open code large lan-
guage models for code understanding and generation tasks. SA[X ] = V [X ]A[X ] (3)
Developed by Salesforce AI Research, CodeT5+ models
In the Multi-head self-attention mechanism, the self-
are aimed at overcoming two major constraints of existing
attention function is performed H times with H different sets
code LLMs: architecture flexibility and restricted collection
of Wv , Wk , and Wq , which outputs H self-attention outputs
of pretraining tasks. Current code LLMs can only work in
SA1 [X ], SA2 [X ], . . . , SAH [X ]. Finally, these self-attention
certain designs, like encoder-only or decoder-only. CodeT5+
outputs are (vertically) concatenated and transformed lin-
models, on the other hand, have a flexible architecture
early to produce the final output, i.e.,
that can work in encoder-only, decoder-only, or combined
encoder-decoder modes to dynamically adapt to a wide range MultiHead(X ) = Wh [SA1 [X ], SA1 [X ], . . . , SAH [X ]] (4)
of downstream applications. Additionally, CodeT5+ models
include a variety of pre-training tasks, such as span denoising, where Wh represents weights for multi-head computation.
causal language modeling (CLM), contrastive learning, and The Multi-head self-attention mechanism was found to be
text-code matching tasks, to fix the problem of limited pre- beneficial and help self-attention perform well [12]. Both
training tasks. This allows CodeT5+ models to bridge the gap the Transformer encoder and decoder consist of multiple
between the pre-training and fine-tuning stages, as well as identical layers, each layer contains a Multi-head self-
surpass existing code LLMs in aligning with the complexities attention block and other transformations. Although only
of different downstream code tasks. being a component in the whole architecture, this mechanism
At the heart of CodeT5+ models lies the Transformer plays an important role in the success of Transformers.
architecture, created by Google to handle natural language By using Transformers as a foundation and taking an
processing tasks. The Transformer architecture [11] was innovative approach to pre-training tasks as well as flexible
developed to tackle the challenges faced by Recurrent architecture, CodeT5+ models have achieved state-of-the-
Neural Networks (RNNs) in capturing relationships between art results across different code-related tasks, such as code
distant words in a phrase and their slow training speed. comprehension and generation [6]. Therefore, we chose
The Transformer architecture follows the encoder-decoder CodeT5+ models to be the base model for our MageCode
architecture of RNNs, where the encoder encodes an input method. The use of CodeT5+ in the method is detailed in
sequence of symbols (which represents, for example, words Section III.
in a sentence) to a sequence of continuous representation,
and the decoder produces a sequence of symbols from C. RELATED WORKS
the continuous representation. However, instead of using 1) MACHINE-GENERATED TEXT DETECTION METHODS
the recurrence mechanism in RNNs, both the Transformer Recently, the problem of machine-generated text identifica-
encoder and decoder employ the Multi-head self-attention tion has drawn a lot of attention and research efforts. Various
mechanism, which allows the models to understand global methods have been proposed to differentiate text produced
dependencies between words in a phrase or sentence. A by LLMs from text written by humans. These methods can
self-attention function takes an input embedding X = be broadly categorized into metric-based and model-based
[x1 , x2 , . . . , xN ] ∈ RD×N , each embedding xi of dimension methods [3].
D × 1 represents a word or token in the input sentence. Then, In general, metric-based methods rely on pre-trained LLMs
three following quantities, named Value, Key, and Query, are to analyze the text and extract statistical features from
given as it, for example, the token-wise log probability or rank of
each word in a document given its previous context [3].
V [X ] = Wv X , K [X ] = Wk X , Q[X ] = Wq X (1)
This paper assesses six metric-based detection methods,
where Wv , Wk , and Wq represent weights for Value, Key, including Log-Likelihood [13], Rank [14], Log-Rank [4],
and Query, respectively. The terms Query and Key are Entropy [14], GLTR [14], and LRR [15]. These metrics have
derived from the field of information retrieval, where Quey been demonstrated in prior studies [3] to be rather effective
is used to measure how much attention should be placed in identifying machine-generated content from six LLMs,
on different parts of the input sentence, and Key is used to including ChatGLM [16], Dolly [17], ChatGPT-turbo [7],
GPT4All [18], StableLM [19], and Claude [20], across three in both original and perturbed codes. Finally, the resulting
distinct datasets. DetectGPT4Code score is calculated to identify code snippets
In terms of model-based approaches, a classification generated by language models.
model is developed by fine-tuning pre-trained language
models with datasets that comprise both machine-generated III. MAGECODE
and human-written texts [3], [13]. After the fine-tuning A. METHODOLOGY OVERVIEW
process, the developed classification model should be able We developed MageCode, a machine-generated code detec-
to distinguish machine-generated content in the provided tion method that utilizes the pre-trained model CodeT5+
dataset. For example, OpenAIDetector [13] fine-tuned a 220M and integrates appropriate metrics. MageCode pro-
RoBERTa model with texts produced by the largest GPT2 cesses a code snippet input in four phases: Tokenization,
model. Similarly, ChatGPT Detector [21] was developed to Feature Extraction, Metric Calculation, and Classification.
detect ChatGPT-generated content by fine-tuning a RoBERTa In the Tokenization phase, the source code input is con-
model with the HC3 dataset as input. These methods have verted to embedding vectors by using CodeT5+’s tokenizer.
been demonstrated to achieve high accuracy in classifying Subsequently, the embedding vector is passed through the
text origins, even in challenging situations such as maintain- Encoder layers of the CodeT5+ model in the Feature Extrac-
ing a false positive rate (FPR) below 0.1%. However, recent tion phase to extract semantic features. In parallel, the Metric
studies indicate that LLM-generated text detection methods Calculation phase calculates several metrics which represent
are less effective when performed on source code data [22]. the statistical features of the source code. Then, statistical
features are combined with semantic features to form a
final feature vector. Finally, in the Classification phase, the
2) MACHINE-GENERATED CODE DETECTION METHODS
classification layer consisting of two fully connected neuron
Compared to the machine-generated text detection problem, network layers will employ the feature vector to perform
the task of identifying machine-generated code has garnered the binary classification operation. The processing flow of
less attention and research efforts. Existing methods mainly MageCode is depicted in Figure 1. Details of each phase in
rely on feature analysis, i.e., metric-based methods, notably the method are described in the following sections.
AIGCode based on perplexity and DetectGPT4Code based
on probability curvature.
B. SOUCE CODE TOKENIZATION
AIGCode [5] is designed specifically to identify
AI-generated code and prevent the misuse of LLMs among The Tokenization phase is performed by the tokenizer of the
students in programming education. This method leverages CodeT5+ model. The tokenizer first breaks the input source
targeted masking and a fine-tuned CodeBERT [23] model. code into smaller units called tokens. Since the tokenizer has
By masking areas of the code with higher perplexity, a maximum input sequence length of 512 tokens, it either
it creates subtle variations that reveal patterns characteristic of truncates or pads the source code input depending on its
AI-generated code. AIGCode then evaluates these variations length. If the input source code exceeds the 512-token limit,
using a scoring system that considers overall perplexity, the tokenizer will remove excess tokens from the end to
variation in code line perplexity, and burstiness. Higher ensure it fits within the maximum limit. On the contrary, if the
scores indicate a higher likelihood of AI generation, based source code is shorter than the limit, the tokenizer appends a
on the observation that AI-generated code often exhibits special [PAD] token to the end until the sequence reaches the
lower perplexity and is less susceptible to perturbations. required length. The obtained embedding vectors will be used
However, AIGCode was assessed using a limited dataset as input for the pre-trained encoder-only CodeT5+ model for
of approximately 5,000 machine-generated codes and only feature extraction.
attained accuracy up to 92% for the three considered
programming languages in this paper. C. FEATURE EXTRACTION
DetectGPT4Code [22] is a training-free method for In the Feature Extraction phase, the embedding vector
detecting code generated by black-box models such as obtained from the Tokenization phase is fed into the
ChatGPT. Built upon the principles of the oringial zero-shot pre-trained encoder-only CodeT5+ model. Opting for an
text detection method, DetectGPT [4], this method utilizes a encoder-only model over a decoder-only or combined
small code language model as a surrogate white-box model encoder-decoder model is justified by the findings that
to estimate the probability curve of the rightmost token in decoder-only models are often not ideal for understanding
the candidate code. Particularly, DetectGPT4Code employs a tasks such as retrieval and detection tasks compared to
smaller, open-sourced language model (LM) like Incoder-6B encoder-only models [6], and encoder-decoder models fail
to calculate the probability curve of candidate code as well as to beat state-of-the-art (SoTA) encoder-only or decoder-only
generate multiple perturbed versions. Subsequently, various baselines on retrieval and code completion tasks respec-
surrogate models ranging from 100M to 10B parameters, tively [6]. The pre-trained encoder-only CodeT5+ model
including PyCodeGPT-110M, PolyCoder, and CodeParrot, uses a multi-layer bidirectional Transformer-based architec-
are utilized to estimate the probabilities of the ending tokens ture. The architecture is made up of 12 identical encoder
block, and each block has two main parts: a bidirectional a machine. The Log-Likelihood score is computed as
self-attention layer and a fully connected feed-forward neural t
network layer with ReLU activation [6]. 1X
Log-Likelihood = log pθ (xi |x<i ) (7)
The multihead self-attention mechanism and a normaliza- t
i=1
tion layer are applied to input embeddings X in each encoder
block to make the initial output embedding Y . where t denotes the number of tokens, xi represents the ith
token in the given sequence of tokens, x<i refers to all tokens
Y = LayerNorm(X + MultiHead(X )) (5) before ith token, and pθ (xi |x<i ) represents the conditional
probability of observing token xi given the context x<i .
Then, the output embedding Y is passed through a fully- Rank [14] first computes the absolute rank of each word
connected feed-forward neural (FFN) network layer with in a text based on its preceding context. Following that, the
ReLU activation function. Subsequently, the result of this rank score of the given text is determined by calculating
transformation applies a normalization layer to produce the the average of these rank values across all words. A lower
final output embedding Z. score indicates a higher probability of the text produced by a
machine. Rank score is calculated as
Z = LayerNorm(Y + FFN (Y )) (6)
t
1X
The output embedding Z of an encoder block serves as Rank = rθ (xi |x<i ) (8)
t
the input embedding X for the successive encoder block. i=1
After 12 encoder blocks, we obtain the final hidden state of where rθ (xi |x<i ) >= 1 is the rank of token xi conditioned on
classification vector [CLS], denoted as r (r ∈ R1×768 ), which the previous tokens.
is the aggregated representation of source code features. Log-Rank [4], different from the Rank metric, calculates
the score by applying the logarithm function to the rank value
D. METRIC CALCULATION of each word instead of directly utilizing the absolute rank.
In parallel with the Tokenization and Feature Extraction The Log-Rank can be defined as
phase, the source code input is used to calculate several
t
metrics that represent different statistical aspects of the 1X
source code. This research considers the evaluation of the six Log-Rank = log rθ (xi |x<i ) (9)
t
i=1
following metrics.
Log-Likelihood [13] employs a language model to cal- In a similar manner to the Rank score, Entropy [14]
culate the logarithmic probability of tokens. In particular, computes the score of a text by averaging the entropy value
it computes the mean logarithmic likelihood of each word of each word based on its prior context. Previous research [4],
token in a given text to produce a score. A larger score [14] points out that texts produced by machines are more
suggests a higher likelihood that the text was produced by likely to have lower Entropy scores. The Entropy score is as
follows
t
1X
Entropy = − pθ (xi |x<i ) log pθ (xi |x<i ) (10)
t
i=1
GLTR [14] is specifically developed as a supplemen-
tary tool to assist in identifying whether a given text is
machine-generated. This work employs the GLTR method
incorporated into the MGTBench Framework [3], with a
specific focus on Test-2 features. These features encompass
the assessment of the proportion of words that rank within the LISTING 1. Python code snippet example.
top 10, 100, 1,000, and other categories.
Log-Likelihood Log-Rank Ratio (LRR [15]) is proposed
by Su et al. that combines Log-Likelihood and Log-Rank to G. EXAMPLE
offer comprehensive insights for a given text. The LRR can We provide a specific example to demonstrate the detection
be computed as flow of MageCode. Suppose we have a Python code snippet
Log-Likelihood
Pt
log pθ (xi |x<i ) as follows:
LRR = = Pi=1t (11) First, the code is tokenized using the tokenizer of
Log-Rank i=1 log rθ (xi |x<i ) CodeT5+ model. The output is an embedding vector with
After being calculated, normalization is applied to shape 512 dimensions:
these metric results into the uniform distribution. Finally, the
1 536 5155 . . . 0 0
normalized metrics are appended to the end of r to create the
final feature vector. This feature vector serves as the input This embedding vector then becomes the input of the
to the classification layer in the binary classification phase. pre-trained encoder-only CodeT5+ model, which uses the
It should be noted that not all six metrics presented above are Transformer architecture for feature extraction. We obtain
integrated in MageCode. Rather, the suitability of each metric the final hidden state of the [CLS] vector, which has
will be investigated (Section V-A), and only metrics that show 768 dimensions, after passing the embedding vector through
promising results will be included in the final feature vector. 12 encoder blocks:
−0.54 −0.59 −0.29 . . . −0.12 −0.49
E. BINARY CLASSIFICATION
The classification layer consists of two fully-connected feed-
forward neural (FFN) network layers, with a ReLU activation Subsequently, the code snippet input is used to calculate
function positioned between them. The first FFN network the aforementioned metrics. Those metrics are calculated by
layer contains 1024 neurons, while the second layer has only using the CodeBERT-base-mlm model [23] as the base model
1 neuron. Following the second layer, the sigmoid function (section V-A). The calculated results are then concatenated
is applied to the single output neuron to carry out the binary to form a temporary vector, referenced as a ‘‘metric vector.’’
classification operation, which effectively classifies the input Suppose all six metrics are integrated into the MageCode
source code as either human-written or machine-generated. method, the obtained ‘‘metric vector’’, as shown in the
equation at the bottom of the page.
The listed order follows the order in which six metrics are
F. TIME COMPLEXITY
described in section III-D: Log-Likelihood, Rank, Log-Rank,
The proposed technique includes two primary components: Entropy, GLTR, and LRR. Since the GLTR method employed
feature extraction utilizing the pre-trained model CodeT5+ in this work encompasses the assessment of the proportion
and metric calculation. The time complexity for metric of words that rank within the top 10, 100, 1,000, and other
calculation is O(n), with n representing the length of the input categories, it has four outputs instead of one as remaining
sequence. Additionally, CodeT5+ comprises Transformer metrics. The ‘‘metric vector’’ is then normalized, with mean
blocks that incorporate self-attention methods. The attention and standard deviation determined after the training process
mechanism, developed by Google, has a time complexity on the train split of dataset. The normalized ‘‘metric vector’’:
of O(n2 × d), where n represents the sequence length and
d is the representation dimension [11]. Consequently, the 0.85 1.16 −0.10 −1.19 1.36 −1.21 −0.79 −0.49 0.76
time complexity of the proposed approach is represented in
Equation 12. is appended to the end of the feature vector calculated above
to form the complete feature vector, as shown in the equation
O(n2 × d) + O(n) = O(n2 × d); (12) at the bottom of the next page.
−11.23 911.42 3.82 0.49 0.35 0.20 0.27 0.18 −2.94
This final feature vector serves as the input to the were excluded to prevent contamination from solutions
fully-connected layer in the Classification phase, which generated by AI assistants. After being preprocessed, the
classifies the provided code snippet as human-written or dataset contained 3,499 Python questions with 757,794
machine-generated. solutions, 11,086 C++ questions with 1,756,180 solutions,
and 10,118 Java questions with 900,079 solutions. All of
IV. DATASET CONSTRUCTION these code samples were labeled as human-written.
This section outlines our dataset construction process, which Notwithstanding the aforementioned preprocessing phase,
is illustrated in Figure 2. First, we collected programming it is crucial to emphasize that there is no assurance that
problems and their solutions from previous studies. After all ‘‘human-written’’ labeled solutions are entirely devoid of
applying some pre-processing steps, we queried advanced the influence of LLMs. This constitutes a constraint in our
LLMs, such as GPT4, with the collected problems to produce dataset development procedure. Nonetheless, given that these
machine-generated codes, while the original solutions were solutions are primarily sourced from competing program-
labeled as human-written codes. Finally, the entire collected ming platforms prior to the emergence of user-friendly and
dataset was divided into training, validation, and test sets, robust web-based LLMs like ChatGPT, the danger of LLM
ensuring that there is no overlap among these sets, meaning contamination in these solutions is comparatively little and
no solutions in two different sets that solve the same appropriate for this study.
programming problem.
B. MACHINE-GENERATED CODE COLLECTION AND
A. CODING PROBLEM COLLECTION TESTING
We collected programming problems from two open-source To obtain machine-generated solutions, we queried three
datasets: Topics in Algorithmic Code Generation (TACO) advanced LLMs with collected programming problems:
[24] and CodeContest [25]. GPT-4-turbo from OpenAI, Gemini-1.0-pro, and Code-bison-
TACO [24] is a large-scale, publicly accessible dataset 32k from Google. This work utilizes OpenAI’s API service
specifically curated to advance the development of for GPT-4-turbo generation and Google AI’s API service
cutting-edge code generation models. Focusing on algo- for Gemini-1.0-pro and Code-bison-32k generation. The
rithmic problems, this extensive dataset comprises 26,443 decoding temperature is configured to 0.7. This is a critical
programming challenges and their corresponding 1,539,152 parameter that balances the predictivity (with lower values)
high-quality Python solutions. Designed to elevate training and creativity (with higher values) of generated content.
and evaluation benchmarks, TACO provides a robust The value of 0.7 allows multiple solutions to the same
foundation for advancing the capabilities of code generation programming problem to not be too similar to each other
systems. while still making them pass as many tests as possible. It is
CodeContests [25] is a dataset specifically created for also used by TACO [24] when experimenting with GPT4..
training AI models to solve programming problems. It was All other settings are maintained at their default values.
introduced by Google DeepMind in 2022 for developing AI The experiments were conducted between February 25 and
systems that can compete in coding contests. The dataset June 18, 2024. All solutions were generated by prompting
contains a variety of programming challenges from popular 3,499 Python questions, 11,086 C++ questions, and 10,118
platforms like Aizu, AtCoder, CodeChef, Codeforces, and Java questions into the three above-mentioned LLMs. The
HackerEarth. Each programming problem includes test cases prompt used to collect machine-generated source code is
with both input and expected output, as well as examples of detailed in Table 1. The prompt involved stating the question
correct and incorrect solutions written in Python, Java, and details and desired programming language, providing starter
C++. This comprehensive dataset was used in training AI code if necessary, outlining the input format, and then
models like AlphaCode, which have shown promising results requesting code generation from the queried LLM, without
in competitive programming. explanation, test cases, or example usage for generated source
The dataset used for this study consisted of Python, code. For each question, up to five solutions from each
C++, and Java questions and solutions. Python questions language model are collected.
and solutions were taken from the TACO dataset, while We then evaluated the machine-generated solutions using
C++ and Java questions and solutions were taken from test cases from the TACO and CodeContest datasets. To be
the CodeContest dataset. Only data from before January 1, included in the dataset, a solution had to pass at least one
2022, was included to avoid the influence of advanced AI test case, reflecting the common practice among students
models like ChatGPT and Gemini. Additionally, answers who aim to earn partial points in programming assignments.
from programming practice websites like GeeksForGeeks Besides, selected solutions were not required to pass every
[−0.54 − 0.59 −0.29 ... −0.12 − 0.49 0.85 1.16 ... − 0.49 0.76]
TABLE 1. The prompt used to query GPT-4-turbo, Gemini-1.0-Pro, and C++ solutions, we use Judge0 [26], a robust and scalable
Code-bison-32k models.
online code execution system. As an open-source project
with a readily available Docker image, Judge0 has become
a crucial part of various production systems requiring online
code execution capabilities. In our approach, we deployed a
local Judge0 server and leveraged its APIs to interact with the
server. Through those APIs, we send solutions generated by
LLMs to the Judge0 server, along with a list of test cases and
the desired programming language environment. The server
then evaluates the submitted code in the specified language
against the provided test cases. After the evaluation, the server
returns a result indicating whether the solution is passed
if at least one test case is successful, or failed if no test
case is passed or if there are compilation or runtime errors.
test case, as many students only submit code that is Only passed solutions are included in the final dataset. These
sufficient enough to help them pass their exams or homework. solutions are labeled as machine-generated.
The collected human-written solutions above also include
‘‘incorrect’’ solutions that only succeed on a subset of C. DATASET SPLIT AND SPECIFICATION
test cases. Including these incomplete solutions in both After the collection phase, we obtained over 81,000 human-
human-written and machine-generated solution sets adds written and over 45,000 machine-generated source code.
diversity to the dataset, enabling models trained on the dataset An overview of the dataset is described in Table 2.
to learn a variety of solution patterns rather than just perfect
ones. Moreover, obtaining machine-generated solutions that TABLE 2. Overview of the dataset used for detecting code generated by
large language models.
can pass all test cases of a programming problem is both
costly and time-consuming since we would need to query
LLMs several times until we received at least one perfect
solution. That process is challenging, if not unfeasible, for
medium and hard programming problems.
To automatically test machine-generated solutions from
LLMs, we employed two approaches. For Python solutions,
we utilized the existing source code provided by the TACO The dataset is split into three smaller datasets, each of
project for testing Python code. In the case of Java and which corresponds to a programming language. Each split
dataset is then divided into the training set, the validation critical detection scenarios often arise from false positives,
set, and the test set, with their respective proportions of 76%, which refer to the erroneous identification of human-written
4%, and 20%. The data in the training set and validation set code as machine-generated. This paper assesses detection
are evenly distributed between human-written and machine- methods under the range of a ∈ {10, 1, 0.1}.
generated categories. Conversely, the test set includes the All experiments are conducted on the newly constructed
remaining data, which preserves the original distribution dataset as described in Section IV. As can be seen from
of the dataset and serves as realistic data for the testing Table 3, the C++ dataset has 43,000 human-written samples,
procedure. It is worth noting that each programming problem which is 14 times greater than the number of machine-
has several solutions, which may exhibit a high degree of generated samples. In order to address the issue of imbalance
similarity or possess identical patterns. Therefore, while during evaluations, we made the decision to reduce the size of
splitting the dataset, we ensure that there are no two solutions the human-written set by only using human-written samples
from two separate sets that solve the same problem to avoid in a quantity equivalent to the machine-generated set, which
data leakage when training and testing our method. The consists of 2,998 samples.
number of samples in each set across different groups is The experiments are implemented in a Window 11 PC
described in Table 3. equipped with CPU Core i9 14900k, 128 GB RAM and GPU
Nvidia RTX4070Ti.
TABLE 3. The number of samples in the training set, validation set, and
test set across three programming languages. A. EVALUATION OF METRIC-BASED METHODS
The six evaluated metric-based detection methods are Log-
Likelihood [13], Rank [14], Log-Rank [4], Entropy [14],
GLTR [14], and LRR [15]. For each method, this work uses
the CodeBERT-base-mlm model [23] as the base model to
extract logits from. From the metrics extracted using the
CodeBERT-base-mlm model, a logistic regression model is
constructed to provide concrete predictions. Table 4 presents
Each sample in the dataset possesses three properties: the evaluation results across three programming languages.
task_id, code, and label. The task_id field in the Python Table 4 clearly demonstrates that the Log-Likelihood
dataset represents the ordinal number of programming method surpasses other metrics in Python and C++ datasets,
problems in the train split of the TACO dataset. For Java and achieving the highest figures in Accuracy, F1-Score, and
C++ datasets, this field corresponds to the ordinal number of AUROC. The results demonstrate the excellent effective-
programming problems in the train split of the CodeContest ness of the Log-Likelihood metric in correctly identifying
dataset. The code field contains source code, and the label machine-generated code in these chosen programming lan-
field indicates the origin of the source code, with 0 denoting guages. Moreover, the illustration in Figure 3 reveals a
human-written code and 1 denoting machine-generated code. clear differentiation between positive and negative samples in
The dataset is stored in several CSV files, each representing the distribution of Log-Likelihood score. Machine-generated
different sets and programming languages. It is publicly codes typically exhibit Log-Likelihood scores positioned to
published on Hugging Face for the research community. the right of the Log-Likelihood score distribution typically
observed in human-written codes. This discovery in the
V. EVALUATION source code corresponds to results obtained in natural
This section presents evaluation results that assess the language analysis [13], [14].
performance of MageCode and other detectors. Firstly, The performance of Entropy exhibits variability among
we evaluate several metric-based machine-generated text different programming languages. In the Java programming
detection methods on source code input. The metrics that language, it attains the highest values of Accuracy (76.51%),
show promising results will be selected for integration F1 score (82.61%), and AUROC (79.07%). However, the
into the MageCode method. In the following experiment, performance of this metric is comparatively inferior in Python
we evaluate the performance of MageCode and compare it and C++, particularly in C++, with Accuracy of 53.84%,
to current baselines. Finally, we examine the potential effects F1 score of 55.90%, and AUROC of 55.32%. Remarkably,
of different factors on the performance of MageCode. there exists a notable disparity in the entropy distribution
This study evaluates Accuracy, F1-score, and Area Under between code created by machines and code written by
ROC Curve (AUROC) for performance measurement. Addi- humans for Java (Figure 6). However, the distributions for
tionally, we also consider True Positive Rate at a False Python and C++ are relatively comparable, especially in the
Positive Rate (TPR@aFPR) to measure the sensitivity of C++ dataset. This disparity explains why the use of entropy
the method at very low FPR. This metric offers valuable for classifying C++ code from both human-written and
information about the model performance in critical detection machine-generated sources is less efficient. Moreover, our
scenarios. Optimal true positive rate (TPR) while minimizing experiments using source code input confirm the consistent
false positive rate (FPR) is crucial, as the primary risks in finding that the entropy of machine-generated material is
TABLE 4. Evaluation of metric-based machine-generated text detection methods across three programming languages. (Unit: %).
generally lower than that of human-written information, of the logarithm transformation in Log-Rank improves its
as demonstrated in prior studies [4], [14]. effectiveness as a detector in comparison to the Rank
Comparing the performance of the Rank and Log-Rank metric. Using the Java dataset as an illustration, the Rank
methods on the datasets, Log-Rank consistently provides distributions depicted in Figure 4 for machine-generated
substantial enhancements. While demonstrating a similar code and human-written code are difficult to distinguish
performance compared to Rank in the C++ dataset, Log- due to their significant overlap. In contrast, the Log-Rank
Rank surpasses Rank in Python and Java datasets outright. score distribution shown in Figure 5 offers a more distinct
Within the Java dataset, the Log-Rank algorithm attains an differentiation. Consistent with findings on natural language
Accuracy of 74.07%, F1 Score of 80.48%, and AUROC of content [14], [27], we also note that machine-generated code
77.81%. These results even exceeds the measurements of generally exhibits lower average values for both Rank and
Log-Likelihood on the same dataset, and are only lower than Log-Rank scores.
the Entropy metric. The distribution of Rank and Log-Rank GLTR shows solid performance across three distinct
scores is shown in Figure 4 and Figure 5, respectively. programming domains. On Python and C++ datasets, its
Although both strategies take the observed rank of each performance is just lower than that of Log-Likelihood. For
token into account when making the prediction, the use instance, the GLTR method achieves an Accuracy score of
73.30%, an F1 Score of 74.71%, and an AUROC of 78.71% it continuously exhibits poor performance in all programming
on the Python dataset. The stable performance of this metric languages. Figure 7 displays the distribution of LRR scores
across many languages can be ascribed to its consideration among three programming languages. The distribution of
of the proportion of tokens that are ranked among the top 10, machine-generated code and human-generated code exhibits
100, 1,000, and so on. no significant distinction based on the LRR scores. This
The LRR method consistently demontrates the poorest observation implies that LRR may not be a very efficient
performance in all measurements and languages, with approach for identifying code created by an LLM.
results ranging from 54% to 62%. This performance Furthermore, we performed a Friedman test on the
indicates that, while Log-Likelihood and Log-Rank indepen- performance of six metric-based methods. The Friedman
dently show promising outcomes, the combination of both test is a non-parametric statistical test used to compare the
approaches does not improve detection ability. Contrarily, performance of multiple models across multiple datasets
based on the average ranking of tested models. This test this findings, we have selected Log-Likelihood, Log-Rank,
evaluates whether there are significant differences in the Entropy, and GLTR to be integrated into the MageCode
performance of tested models [28]. When performning the method.
test, the level of significance is set to 0.05, which was used in
the previous studies [28], [29]. AUROC is the selected metric B. EVALUATION OF MAGECODE
for performance comparison and ranking. Since there are six The experiment in this section evaluates the MageCode
metrics evaluated, the test was performed with 6 degrees of method integrated with the four above selected metrics
freedom. across three datasets. The evaluation results are subsequently
compared to the existing baselines, namely OpenAI Detec-
TABLE 5. Rankings of metric-based machine-generated text detection tor [13] and DetectGPT4Code [21]. As introduced earlier,
methods across three programming languages.
OpenAI Detector was developed by fine-tuning the RoBERTa
model using the output from the GPT-2 1.5-billion-parameter
model, which was originally used to detect text generated
by GPT-2. Meanwhile, DetectGPT4Code is inspired by
the DetectGPT method and introduces enhancements by
utilize a surrogate model, PolyCode-160M, to overcome the
difficulties encountered by DetectGPT when the source code
model is a black box (e.g., GPT4).
Table 6 shows the results of evaluating the three mentioned
Table 5 shows the rankings of six metrics across three methods in detecting machine-generated code based on met-
datasets in terms of performance. AUROC is selected for rics such as Accuracy, F1-Score, AUROC, [email protected]%FPR,
performance comparison and ranking. For each dataset, TPR@1%FPR, and TPR@10%FPR.
rankings are from 1 to 6, with rank 1 assigned to the best We find that MageCode consistently exhibits superior
metric and rank 6 assgned to the worst metric. The average performance across all assessed metrics and programming
ranking of each metric is calculated as the average of its languages. It showcases exceptional performance in Python,
rankings across three datasets. with an Accuracy of 98.46% and F1 Score of 98.70%,
Based on these rankings, we can calculate the Friedman AUROC of 99.83%, and TPR values of 86.87%, 96.14%,
test statistic result, which is 8.90. With 6 degrees of freedom and 99.77% at FPRs of 0.1%, 1%, and 10% correspondingly.
and the significance level of 0.05, the critical chi-squared In contrast, the OpenAI Detector and DetectGPT4Code
value is 12.59, taken from the chi-square table. Since show a notable performance gap, with the OpenAI Detector
8.90 < 12.59, the null hypothesis of Friedman test cannot specifically demonstrating a low Accuracy of 59.19% and
be rejected, which means there is no statistically significant insignificant TPR values at lower FPR thresholds. Similarly,
difference in the performance of tested metrics. the MageCode method for Java demonstrates strong perfor-
There is a possibility that this result is caused by the mance with an Accuracy of 98.05%, F1 Score of 98.59%,
performance of evaluated metrics on the Java dataset, where AUROC of 99.47%, and TPR values of 37.89%, 85.45%, and
the ranks differ the most. If we exclude the rankings from 99.76% for FPR values of 0.1%, 1%, and 10% respectively.
the Java dataset and solely consider rankings from the two Both the OpenAI Detector and DetectGPT4Code exhibit poor
remaining datasets, the Friedman test would yield a result of performance in all measures, particularly in their failure to
13.23 > 12.59. This new result means there is a statistically obtain significant TPR values at lower FPR levels. On the
significant difference in the performance of six metrics on C++ dataset, MageCode consistently performs better than
Python and C++ datasets. other methods, achieving an Accuracy of 95.53%, F1 Score
Additionally, from Table 5, it is simple to tell that of 75.46%, AUROC of 99.06%, and TPR values of 47.97%,
Log-Likelihood, Log-Rank, and GLTR stand out from the 85.36%, and 97.93% at FPR values of 0.1%, 1%, and 10%
remaining metrics in terms of performance. Besides, despite correspondingly.
the inferior result on the Python and C++ datasets, the fact The main reason for the low accuracy of the OpenAI
that Entropy is the metric that has the best ranking on the Java Detector model while processing source code inputs is
dataset makes it unignorable. its lack of training on source code, which hinders its
The evaluation results of metric-based methods clearly ability to accurately distinguish between code written by
indicate that Log-Likelihood, Entropy and GLTR are the humans and code generated by machines. On the other
most applicable metrics in source code analysis. Rank and hand, DetectGPT4Code is a metric-based approach that
Log-Rank are essentially identical metrics, with a minor utilizes the probability curvature metric. Due to the intrinsic
distinction in the calculation of the final score of a given text. complexity of source code in terms of syntax, structure, and
Nevertheless, Log-Rank has been empirically demonstrated logic, depending on a single metric threshold is inadequate
to be the better metric in identifying machine-generated for distinguishing between human-written and machine-
source code when compared to Rank. LRR is without a generated code, as shown by prior studies [3]. To overcome
doubt the metric with the worst performance. Following the limitations of these approaches, MageCode has employed
TABLE 6. Comparison results of MageCode with current baselines across three programming languages. (Unit: %).
the CodeT5+ model, which is highly proficient in extracting performance of CodeBERT in these metrics is inferior when
significant characteristics from source code. Additionally, applied to Java and C++. On the other hand, PolyCoder is
it has integrated multiple metrics to accurately categorize the the worst model by consistently exhibiting inferior results
source code. compared to the other two models.
We observe that CodeBERT exhibits a notable decrease
C. ABLATION STUDIES in performance when applied to C++ in comparison to
This section conducts three ablation experiments to investi- Python and Java. This disparity can be ascribed to the lack
gate the effects of different factors on MageCode: the impact of model pre-training on C++. On the other hand, the
of pre-trained Code LLM, the impact of integrated metrics, inadequate performance of PolyCoder on all three datasets
and the impact of token length. may be attributed mostly to its decoder-only Transformer-
based architecture. This architecture is better suited for text
1) INFLUENCE OF PRE-TRAINED CODE LLM production tasks but less efficient for classification tasks
Table 7 presents the comparison results when applying when compared to encoder-only models such as CodeT5+
different base LLM models in the tokenization and feature and CodeBERT.
extraction phases of the MageCode method. Three base LLM
models considered in this experiment are CodeT5+ [6], 2) INFLUENCE OF INTEGRATED METRICS IN MAGECODE
CodeBERT [23], and PolyCoder [30]. This section examines the influence of integrated metrics on
It can be observed that, across all programming languages, the performance of the MageCode method. As presented in
CodeT5+ consistently demonstrates competitive or superior Section III, source code features extracted from the encoder-
results compared to CodeBERT and PolyCoder. In the only CodeT5+ model and metrics calculated from source
Python dataset, despite better performance of CodeBERT code input are combined to create the final feature vector
in [email protected]%FPR and TPR@1%FPR, CodeT5+ is still before heading to the classification phase. The effective-
the top performer by achieving the highest results in ness of these metrics when being applied independently
Accuracy (98.46%), F1-Score (98.70%), AUROC (99.83%). in detecting machine-generated code has been thoroughly
CodeBERT closely follows with Accuracy of 98.29%, studied in Section V-A. To determine how these metrics affect
F1-Score of 98.54%, and AUROC of 99.81%. The difference MageCode when integrated together into the feature vector,
in these metrics between CodeT5+ and CodeBERT is we re-trained and evaluated MageCode in two scenarios:
under 0.2%, which proves CodeBERT as a competitive Metric Integrated, which means the feature vector contains
candidate compared to CodeT5+. However, in Java and both CodeT5+ extracted features and calculated metrics,
C++ datasets, CodeT5+ demonstrates superior results by and CodeT5+ Only, where the feature vector only contains
achieving the highest results in all measurements, with CodeT5+ features. Table 8 shows the evaluation results of
Accuracy, F1-Score, and AUROC over 98% in Java and over these scenarios.
95% in C++. Meanwhile, CodeBERT shows significantly The results in Table 8 reveal that two scenarios demonstrate
worse results, with 2-3% worse in Accuracy, F1-Score, and highly competitive results, with a slightly better performance
AUROC compared to CodeT5+. Although demonstrating of Metric Integrated overall. The Metric Integrated approach
promising results in TPR@a%FPR in the Python dataset, the consistently outperforms CodeT5+ Only in Accuracy,
TABLE 7. Performance metrics of various base models in detecting machine-generated source code. (Unit: %).
TABLE 8. The influence of integrated metrics to the performance of MageCode. (Unit: %).
F1-Score, AUROC, and [email protected]%FPR across three which significantly contributes to MageCode. Moreover,
datasets. The disparity in performance between these two the improvement in the performance of MageCode when
scenarios is minor when applied to Python and Java integrating various metrics into the feature vector has
datasets. Specifically, in the Python dataset, Metric Integrated demonstrated the beneficial impact of these metrics on the
outperforms CodeT5+ Only by 0.05% in Accuracy, 0.05% in method, which is consistent with the evaluation findings in
F1-Score, and 0.01% in AUROC. The Java dataset has values Section V-A.
of 0.07%, 0.05%, and 0.04%, respectively. The superior
performance of Metric Integrated over CodeT5+ Only is 3) INFLUENCE OF TOKENS LENGTH TO DETECTION METRICS
particularly evident in the C++ dataset, exhibiting a 0.15% Since MageCode is limited to processing a maximum of
increase in Accuracy, 0.15% increase in F1-Score, and 0.04% 512 tokens, this subsection further analyzes the accuracy
increase in AUROC. Furthermore, while CodeT5+ Only score of MageCode for each token length range. This
may exhibit superior performance in TPR@1%FPR (for subsection divided the original test set into smaller test sets,
Python and C++) and TPR@10%FPR (for C++), Metric grouping source code with the same token length range into
Integrated consistently surpasses it in [email protected]%FPR, with one group (e.g., 0-128, 128-256, etc.). The results are shown
enhancements of 0.5%, 3.8%, and 9.41% in Python, Java, and in Figure 8.
C++, respectively. We clearly observe that the model achieves the lowest
Nevertheless, the notable performance of CodeT5+ Only accuracy score when the input source code is too short (below
underscores the exceptional competence of the CodeT5+ 128 tokens). This observation aligns with some previous work
model in comprehending and analyzing source code, on developing machine-generated text detectors [31]. As the
FIGURE 8. The influence of token length to the performance of MageCode on three different datasets.
token length increases (128-256, 256-512, etc.), the accuracy length of source code input. In MageCode, three pre-trained
also gradually increases. In the case of Python and Java, LLMs are taken into account: CodeT5+, CodeBERT, and
the method achieves the highest accuracy when the token PolyCoder. CodeT5+ and CodeBERT both exhibit superior
length range is within 128-256 or 256-512. These two ranges and comparable results, while PolyCoder suffers from
exhibit a competitive level of accuracy, with a variation of poor performance. Besides, the evaluation of MageCode
0.01% in Python. This finding also holds true for C++ with performance on two scenarios (one with integrated metrics
a decrease of 0.02%. Regarding Java, the difference is further and the other without using these metrics) demonstrates the
pronounced, increasing by 0.23% from 98.30% to 98.53%. beneficial impact of those metrics on the detection ability of
However, this difference is relatively minor when compared the MageCode method. The last experiments focus on token
to the disparities in other ranges. When the input token length length by dividing the test sets into groups based on varying
exceeds 512, we observe a slight drop in accuracy. This token length ranges and assessing the accuracy of MageCode
outcome is understandable given that long source code may for each group. Groups of 128-256 and 256-512 tokens show
include significant features at the end that the model does not competitive and highest accuracies since the source code in
process due to the truncation mechanism. C++ is the only these groups is sufficiently lengthy to avoid the truncation
special case where the truncation of lengthy tokens does not mechanism, except for the C++ case, whose accuracy is
negatively affect the detection accuracy but conversely results highest when evaluating on the 512-2048 token length range
in the highest accuracy across all test subsets. group.
This section has presented evaluation results from three
main experiments. First, the applicability of six metric- VI. CONCLUSION
based machine-generated text detection methods on source This paper presents MageCode, a novel machine-generated
code input has been explored. Four of them, namely code detection method. MageCode utilizes the pre-trained
Log-Likelihood, Log-Rank, Entropy, and GLTR, show encoder-only CodeT5+ 220M model to extract features from
promising results and are integrated into the MageCode source code input. This method also makes use of various
method. The remaining two are excluded due to the metric-based machine-generated text detection methods for
poor performance of LRR and inferior results of Rank in performance improvement. The evaluation results demon-
comparison to its analogous metrics - Log-Rank. strate the superior performance of MageCode compared to
Second, the performance of the MageCode method has existing baselines, with high accuracy and true positive rate
been evaluated and compared to existing baselines. The while maintaining false positive rate lower than 1%.
experiment results demonstrate superior performance of This paper has explored the applicability of several
MageCode, attaining an accuracy over 98% for Python and metric-based machine-generated text detection methods to
Java, and 95% for C++, with a true positive rate surpassing source code analysis. Log-Likelihood, Log-Rank, Entropy,
85% while maintaining a false positive rate lower than 1%. and GLTR were proven to be beneficial and integrated
Finally, the section concludes with ablation studies into MageCode, which make a positive impact on the
examining the impact of various factors on MageCode performance of the method. The effect of other factors, such
performance: pre-trained LLM, integrated metrics, and token as pre-trained LLM and token length, on the effectiveness
of MageCode is also considered. In considered scenar- [14] S. Gehrmann, H. Strobelt, and A. M. Rush, ‘‘GLTR: Statistical detection
ios, MageCode outperforms benchmarks (OpenAI Detector and visualization of generated text,’’ 2019, arXiv:1906.04043.
[15] J. Su, T. Yue Zhuo, D. Wang, and P. Nakov, ‘‘DetectLLM: Leveraging log
and DetectGPT4Code), achieving up to 98.46% accuracy rank information for zero-shot detection of machine-generated text,’’ 2023,
for Python. In MageCode, incorporating the pre-trained arXiv:2306.05540.
model CodeT5+ with metric-based techniques enhances [16] T. GLM et al., ‘‘ChatGLM: A family of large language models from GLM-
130B to GLM-4 all tools,’’ 2024, arXiv:2406.12793.
performance compared to utilizing only the pre-trained [17] M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, S. Shah, A. Ghodsi,
model CodeT5+. Besides, source code with different token P. Wendell, M. Zaharia, and R. Xin. (2023). Free Dolly: Introduc-
lengths also has a considerable effect on the performance ing the World’s First Truly Open Instruction-tuned Llm. [Online].
Available: https://www.databricks.com/blog/2023/04/12/dolly-first-open-
of MageCode, where the method tends to achieve higher commercially-viable-instruction-tuned-llm
accuracy when being applied to source code with token [18] Y. Anand, Z. Nussbaum, A. Treat, A. Miller, R. Guo, B. Schmidt,
lengths in the ranges of 128-256 and 256-512. G. Community, B. Duderstadt, and A. Mulyar, ‘‘GPT4All: An ecosystem
of open source compressed language models,’’ 2023, arXiv:2311.04931.
To conduct experiments, we constructed a new dataset [19] J. Tow. Stablelm Alpha V2 Models. Accessed: Jun. 1, 2023. [Online].
for the machine-generated code detection problem. The Available: https://huggingface.co/stabilityai/stablelm-base-alpha-7b-v2
dataset includes over 45,000 source code generated by [20] Claude. Accessed: Jun. 1, 2023. [Online]. Available: https://claude.ai/
[21] B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, J. Yue, and
three advanced large language models in code generation: Y. Wu, ‘‘How close is ChatGPT to human experts? Comparison corpus,
GPT-4-Turbo, Gemini-pro-1.0, and Code-bison-32k. It also evaluation, and detection,’’ 2023, arXiv:2301.07597.
contains over 80,000 human-written solutions collected from [22] X. Yang, K. Zhang, H. Chen, L. Petzold, W. Yang Wang, and
W. Cheng, ‘‘Zero-shot detection of machine-generated codes,’’ 2023,
previous studies and carefully preprocessed. All source code arXiv:2310.05103.
in the dataset are written in three popular programming lan- [23] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin,
guages in educational environments: Python, Java, and C++. T. Liu, D. Jiang, and M. Zhou, ‘‘CodeBERT: A pre-trained model for
programming and natural languages,’’ 2020, arXiv:2002.08155.
Initially, MageCode concentrates on identifying source [24] R. Li, J. Fu, B.-W. Zhang, T. Huang, Z. Sun, C. Lyu, G. Liu, Z. Jin, and
codes produced by humans or LLMs. In the future, we will G. Li, ‘‘TACO: Topics in algorithmic COde generation dataset,’’ 2023,
enhance MageCode to detect code similarity across vari- arXiv:2312.14852.
[25] Y. Li et al., ‘‘Competition-level code generation with AlphaCode,’’
ous source code files and to discover flaws inside these Science, vol. 378, no. 6624, pp. 1092–1097, Dec. 2022.
files. Furthermore, MageCode emphasized the educational [26] H. Z. Dosilovic and I. Mekterovic, ‘‘Robust and scalable online code
environment; nevertheless, this research can be expanded execution system,’’ in Proc. 43rd Int. Conv. Inf., Commun. Electron.
Technol. (MIPRO), Sep. 2020, pp. 1627–1632.
to identify unethical applications of AI in job applications,
[27] D. Ippolito, D. Duckworth, C. Callison-Burch, and D. Eck, ‘‘Automatic
coding competitions, and similar contexts. detection of generated text is easiest when humans are fooled,’’ 2019,
arXiv:1911.00650.
[28] V. M. Hanriot, L. C. B. Torres, and A. P. Braga, ‘‘Multiclass graph-based
REFERENCES
large margin classifiers: Unified approach for support vectors and neural
[1] S. R. Das and M. J. V., ‘‘Perceptions of higher education students networks,’’ IEEE Trans. Neural Netw. Learn. Syst., vol. 3, no. 1, pp. 1–10,
towards ChatGPT usage,’’ Int. J. Technol. Educ., vol. 7, no. 1, pp. 86–106, Apr. 2024.
Feb. 2024. [29] T. Akshar, V. Singh, N. L. B. Murthy, A. Krishna, and L. Kumar, ‘‘A
[2] P. Haindl and G. Weinberger, ‘‘Students’ experiences of using ChatGPT codebert based empirical framework for evaluating classification-enabled
in an undergraduate programming course,’’ IEEE Access, vol. 12, vulnerability prediction models,’’ in Proc. 17th Innov. Softw. Eng. Conf.,
pp. 43519–43529, 2024. Feb. 2024, pp. 1–11, doi: 10.1145/3641399.3641405.
[3] X. He, X. Shen, Z. Chen, M. Backes, and Y. Zhang, ‘‘MGTBench: Bench- [30] F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn, ‘‘A systematic
marking machine-generated text detection,’’ 2023, arXiv:2303.14822. evaluation of large language models of code,’’ in Proc. 6th ACM SIGPLAN
[4] E. Mitchell, Y. Lee, A. Khazatsky, C. D. Manning, and C. Finn, ‘‘Detect- Int. Symp. Mach. Program., Jun. 2022, pp. 1–10.
GPT: Zero-shot machine-generated text detection using probability [31] V. Verma, E. Fleisig, N. Tomlin, and D. Klein, ‘‘Ghostbuster: Detecting
curvature,’’ in Proc. Int. Conf. Mach. Learn., Jan. 2023, pp. 24950–24962. text ghostwritten by large language models,’’ 2023, arXiv:2305.15047.
[5] Z. Xu and V. S. Sheng, ‘‘Detecting AI-generated code assignments using
perplexity of large language models,’’ in Proc. AAAI Conf. Artif. Intell.,
Mar. 2024, vol. 38, no. 21, pp. 23155–23162.
[6] Y. Wang, H. Le, A. Deepak Gotmare, N. D. Q. Bui, J. Li, and S. C. H. Hoi,
‘‘CodeT5+: Open code large language models for code understanding and
generation,’’ 2023, arXiv:2305.07922.
[7] OpenAI et al., ‘‘GPT-4 technical report,’’ 2023, arXiv:2303.08774.
[8] G. Team et al., ‘‘Gemini: A family of highly capable multimodal models,’’
2023, arXiv:2312.11805.
[9] Google. (2024). Code Bison Repository. Accessed: Jun. 18, 2024.
[Online]. Available: https://console.cloud.google.com/vertex-
ai/publishers/google/model-garden/code-bison
[10] (2024). Code Models Overview. Accessed: Jun. 21, 2024.
[Online]. Available: https://cloud.google.com/vertex-ai/generative- HUNG PHAM received the bachelor’s degree
ai/docs/code/code-models-overview in computer engineering from the School of
[11] A. Vaswani et al., ‘‘Attention is all you need,’’ in Proc. Adv. Neural Inf. Information and Communications Technology,
Process. Syst., vol. 30, Jun. 2017, pp. 5998–6008. Hanoi University of Science and Technology,
[12] S. J. Prince, Understanding Deep Learning. Cambridge, MA, USA: MIT in 2023, where he is currently the master’s degree.
Press, 2023. He is working for the Back Khoa Cyber Security
[13] I. Solaiman, M. Brundage, J. Clark, A. Askell, A. Herbert-Voss, (BKCS) Center. His research interests include
J. Wu, A. Radford, G. Krueger, J. Wook Kim, S. Kreps, M. McCain, cybersecurity and trusted computing.
A. Newhouse, J. Blazakis, K. McGuffie, and J. Wang, ‘‘Release strategies
and the social impacts of language models,’’ 2019, arXiv:1908.09203.
HUYEN HA received the bachelor’s degree in DUC TRAN received the M.Sc. degree in research
computer science from the School of Information and the Ph.D. degree in computer science from
and Communications Technology, Hanoi Univer- City University London, in 2013 and 2015, respec-
sity of Science and Technology, in 2024. She tively. He is currently a Lecturer with the School
is currently with the Back Khoa Cyber Security of Information and Communication Technology,
(BKCS) Center and the School of Information and Hanoi University of Science and Technology, and
Communications Technology, Hanoi University of the Director of the Bach Khoa Cyber Security
Science and Technology. Her research interests Center (BKCS). His current research interests
include cybersecurity and large language models. include machine learning, pattern recognition, and
computer security. His current and past application
areas of his work include biometric authentication, network security, and
multimedia security. He served on the program committee for the SOICT
2016 and SOIS 2018, 2022 conferences.