0% found this document useful (0 votes)
9 views

Enhancing Source Code Classification Effectiveness Via Prompt Learning Incorporating Knowledge Features

Uploaded by

floydacademics
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Enhancing Source Code Classification Effectiveness Via Prompt Learning Incorporating Knowledge Features

Uploaded by

floydacademics
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

www.nature.

com/scientificreports

OPEN Enhancing source code


classification effectiveness
via prompt learning incorporating
knowledge features
Yong Ma 1,3, Senlin Luo 1, Yu‑Ming Shang 2*, Yifei Zhang 1 & Zhengjun Li 1
Researchers have investigated the potential of leveraging pre-trained language models, such as
CodeBERT, to enhance source code-related tasks. Previous methodologies have relied on CodeBERT’s
‘[CLS]’ token as the embedding representation of input sequences for task performance, necessitating
additional neural network layers to enhance feature representation, which in turn increases
computational expenses. These approaches have also failed to fully leverage the comprehensive
knowledge inherent within the source code and its associated text, potentially limiting classification
efficacy. We propose CodeClassPrompt, a text classification technique that harnesses prompt
learning to extract rich knowledge associated with input sequences from pre-trained models,
thereby eliminating the need for additional layers and lowering computational costs. By applying an
attention mechanism, we synthesize multi-layered knowledge into task-specific features, enhancing
classification accuracy. Our comprehensive experimentation across four distinct source code-related
tasks reveals that CodeClassPrompt achieves competitive performance while significantly reducing
computational overhead.

The intersection of machine learning, programming languages, and software engineering has garnered significant
interest from the software engineering community. The classification of source code-related tasks holds significant
importance in the field of software engineering, as it serves the purpose of identifying the programming lan-
guage used in the source code and enhancing the overall software quality. Many techniques, including Bayesian
­methods1, Random ­Forest2, and ­XGBoost3, have been employed to accomplish source code-related classifica-
tion. Additionally, deep learning techniques, such as T ­ extCNN4, have gained substantial prominence and are
extensively utilized in this domain. Recently, large-scale pre-trained language models, such as ­BERT5, ­RoBERTa6,
­GPT7, and T ­ 58, have emerged as promising tools for various downstream Natural Language Processing (NLP)
­tasks9. The impressive performance of pre-trained language models (PLM) in NLP tasks has inspired intensive
research in the field of software engineering. Source code-related tasks have achieved remarkable success with
source code-dedicated pre-trained language models, such as C ­ odeBERT10, ­CodeT511, and G ­ raphCodeBERT12,
as demonstrated by recent s­ tudies13,14.
The CodeBERT model, a member of the BERT family, is extensively utilized in the field of software
­engineering10. In BERT family models, the ‘[CLS]’ token functions as the classification token and serves as the
aggregate representation of the entire input s­ equence5. Previous research consistently adopts this design choice,
employing the representation of the ‘[CLS]’ token for various classification ­tasks5,15. It leverages the highest layer
of vector output of the ‘[CLS]’ token from a BERT family model as the representation for an input sequence.
However, relying solely on the ‘[CLS]’ token in the highest layer of model output as the representative of the
entire input text sequence imposes limitations on the effectiveness of the BERT family. Previous ­research16,17
has demonstrated that BERT-based models can capture a rich hierarchy of linguistic information in NLP tasks.
Specifically, surface features are captured in lower layers, syntactic features in middle layers, and semantic fea-
tures in higher ­layers17. Other studies propose utilizing the ‘[CLS]’ tokens from multiple layers of vector output
to more comprehensively represent the input i­nformation18. Nonetheless, relying solely on the ‘[CLS]’ token
from the output of a BERT model as the exclusive representation of an input text is considered ­insufficient19.
Some investigations have incorporated supplementary neural network layers, such as Long Short-Term Memory
(LSTM) ­networks20, to enhance the capabilities of feature representation. However, it is important to note that

1
Beijing Institute of Technology, Beijing 100085, China. 2Beijing University of Posts and Telecommunications,
Beijing 100876, China. 3Qi-Anxin Technology Group, Beijing 100044, China. *email: [email protected]

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 1

Vol.:(0123456789)
www.nature.com/scientificreports/

this approach comes with a notable increase in computational overhead, including the movement and computa-
tion of additional parameters.
The reason behind the use of ‘[CLS]’ tokens in combination with LSTM is the inadequate capacity for feature
representation on source code related tasks. Fortunately, the rapid development of large-scale language ­models21
is bringing promising solutions by introducing prompt learning as the fourth paradigm for both programming
and natural language processing ­tasks22. During prompt learning, the ‘[MASK]’ token can aggregate abundant
knowledge associated with input sequences, including source code. By leveraging knowledge that goes beyond
surface, syntactic, and semantic features, the use of LSTM layers can be rendered unnecessary, leading to reduced
computation costs and improved feature representation.
The approach to leveraging prompt learning is to use a natural language prompt template to wrap the input
text sequence, followed by performing masked language modeling with a pre-trained language model. For
instance, in a text classification task, the text sequence x can be wrapped into a prompt template, such as “It was
[MASK]. x ,” and then input into a language model. The logits derived from the language model’s output contain
a wealth of retrieved knowledge at the location of ‘[MASK]’, which can be mapped to a specific category for
classification tasks.
In the domain of software engineering, the pursuit of a novel classification approach that offers low com-
putational costs and possesses a robust capability for feature representation has emerged as a pivotal necessity.
To enhance the performance of tasks related to source code processing while minimizing computational costs,
a promising approach is to leverage the knowledge extracted from a language model through prompt learning
to represent the input source code or related text sequence. BERT-based models, such as CodeBERT, produce
multiple layers of output, where each layer’s output at the ‘[MASK]’ position encapsulates specific knowledge
relevant to the input information. By consolidating the important knowledge conveyed by the language model for
each task, we can obtain a more comprehensive and discriminative feature set, thereby improving the accuracy
of the task without the need for additional feature-enhancing mechanisms.
Motivated by the aforementioned concepts, we propose a novel approach, CodeClassPrompt, to enhance the
effectiveness of source code classification through prompt learning that incorporates knowledge features. The
CodeClassPrompt leverages the comprehensive knowledge output from CodeBERT, a bimodal language model
with the ability to process both natural language and programming languages. Our method revolves around
the utilization of a carefully crafted prompt template, which encapsulates the input source code or related text.
Subsequently, we employ a language model to extract knowledge that encapsulates the intrinsic characteristics
of the input text. By leveraging the attention ­mechanism23, we aggregate the significance of knowledge across
various layers for a given task and map it to a specific class. This approach enables us to fully exploit the exten-
sive knowledge embedded in each layer of the BERT-based model through prompt learning, obviating the need
for additional neural network layers for feature extraction and reducing computational costs. The aggregation
of pivotal knowledge through the attention mechanism heightens the effectiveness of feature representation,
thereby enhancing task accuracy.
To validate the efficacy of our proposed approach, we conducted experiments on four representative down-
stream tasks in the realm of source code: code language classification, code smell classification, technical debt
classification, and code comment classification. The experimental results validate that our method attains com-
parable performance to prior studies for all four tasks, while concurrently reducing computational costs. Fur-
thermore, we performed ablation experiments to assess the reliability of our components across various task
configurations.
To the best of our knowledge, the main contributions of our study can be summarized as follows:

1. Our paper is the first to combine the prompt-learning paradigm with source code-related tasks, advancing
the technical progress of the software engineering field.
2. We present CodeClassPrompt, a pioneering approach that leverages the knowledge features acquired through
prompt learning and aggregates indispensable knowledge using an attention mechanism. As a consequence,
our study yields new state-of-the-art results on the Code Smell dataset and comparable results on three other
datasets compared to previous investigations, while substantially mitigating computational costs.
3. We evaluate our approach on four classical source code-related tasks and demonstrate its effectiveness on
both programming language and natural language tasks through a comprehensive set of experiments.
4. We have made our trained models and related code publicly available in our GitHub repository (https://​
github.​com/​BIT-​ENGD/​codep​rompt​class) to facilitate researchers in reproducing the results of our study
or conducting further research.

This paper is organized as follows. Section “Related Work” provides an overview of the relevant literature con-
cerning CodeClassPrompt. Section “CodeClassPrompt” presents a comprehensive description of the design of our
proposed approach. In Section “Experimental Study”, we delineate the experimental setup of CodeClassPrompt
and provide detailed information about the baselines employed in the experiments. Section “Results and Analy-
sis” presents the experimental results and offers a thorough analysis of the findings. Moreover, we address
potential threats to the validity of our approach in Section “Threats to Validity”. Finally, we conclude the paper
in Section “Conclusion”.

Related work
Code representation learning
Significant advances have been made in the study of the intersection of machine or deep learning, program-
ming languages, and software engineering, based on the assumption that programming code resembles natural

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 2

Vol:.(1234567890)
www.nature.com/scientificreports/

l­ anguage24. However, raw source code cannot be directly fed into machine or deep learning models, and code rep-
resentation is a fundamental step in making source code compatible with these models. This involves preparing
a numerical representation of the source code that can be used to solve specific software engineering problems.
In this subsection, we briefly introduce representative studies related to source code representation learning.
As source code has rich structure, it cannot be treated as only a series of text t­ okens24,25. Harer et al.26 tokenize a
code snippet and categorize all tokens into different bins, such as comments and string literals. All tokens with
the same categorical information are mapped to a unified identifier, which is then transformed into a vector
using the word2vec ­algorithm27. DeFreez et al.28 proposed the Func2Vec method, which embeds the control-flow
graph of a function as a vector to represent the function and facilitates estimating the similarity of functions. A
context-incorporating ­method25 was proposed to use syntactic and type annotations information for source code
embedding, which can distinguish the lexical tokens in different syntactic and type contexts. C ­ ode2Vec29 offers
a new approach, which decomposes code into a collection of paths in its abstract syntax tree (AST), learning
the atomic representation of each path and how to aggregate a set of them. For the problem of large ASTlength,
Zhang et al.30 found that splitting a large AST into a number of small statement trees and then encoding them
as vectors can capture both lexical and syntactic knowledge of the statements. Motivated by the need for code
summarization, Hu et al.31 proposed a simple representation of code that only leverages the vectors of a sequence
of API names to express a code snippet. With the advancement of PLMs, a number of methods based on PLMs
have been proposed. Yang et al.32 offer a fresh way by utilizing a PLM and convolutional neural networks (CNN)
to extract the feature representation of a code snippet. Since BERT is not dedicated to source code-related tasks,
Feng et al.10 utilize both code snippets and natural language to construct a bimodal pre-trained model, Code-
BERT, that can effectively represent source code-related material. Recently, Jain et al.33 introduced contrastive
learning into code representation learning, as BERT-based models are much more sensitive to source code edits
and cannot represent similar code snippets with slightly different literal expressions.

Source code related classification


Source code-related classification, which includes code language classification, code smell classification, code
comment classification, and technical debt classification, are four crucial tasks in software engineering that have
been thoroughly investigated by researchers.

Code language classification


In most source code-related tasks, code language classification is the initial step for further processing. Previ-
ously, the programming language of a piece of source code was assigned manually or determined based on its file
­extension4. SC++3 employs the Random Forest Classifier (RFC) and XGBoost (a gradient boosting algorithm)
to build a machine learning classifier that can detect programming languages, even for code snippets from a
family of programming languages such as C, C++, and C#. Khasnabish et al.1 utilized several variants of Bayesian
classifier models to detect 10 programming languages and achieved remarkable results. Multiple layers of neural
networks and convolutional neural networks were trained to judge the programming language of over 60 kinds
of source ­code4. Large language models have demonstrated their enormous power in natural language tasks.
Inspired by this, some ­researchers34 have also employed large language models, such as RoBERTa, for source
code classification with successful results. The RoBERTa classification method employs a pipeline that consists
of RoBERTa Model and fully connected neural networks. The CodeBERT model has been extensively utilized
for code language classification. In the work of Liu et al.18, a two-step fine-tuning method EL-CodeBERT was
devised to address code smell detection. This method utilizes the CodeBERT model to generate multiple layers of
‘[CLS]’ vectors, which function as semantic representations for an input comment of source code. Subsequently,
a bidirectional long short-term memory (LSTM) network is utilized to extract more effective features from these
multiple layers of ‘[CLS]’ vectors. In order to capture the significance of these features, an attention mechanism
is employed to aggregate the most important ones. Finally, the aggregated features are fed into a fully connected
neural network to perform the classification task. Two-step fine-tuning is utilized to address the difference in
model training between CodeBert and Bi-LSTM18.

Code comment classification


Code comments are a powerful tool to help programmers understand and maintain code snippets, but different
comments can have different ­intentions35. Rabi et al.36 developed a multilingual approach to code comment clas-
sification that utilizes natural language processing and text analysis to classify common types of class comment
information with high accuracy for Python, Java, and Smalltalk programming languages. The Naive Bayes clas-
sifier, the J48 tree model, and the Random Forest model underlie the classifier used. To reveal the relationship
between a code block and the associated comment’s category, Chen et al.37 classified comments into six intention
categories and manually labeled 20,000 code-comment pairs. These categories include “what”, “why”, “how-to-
use,” “how-it-is-done”, “property”, and “others”. PLM based methods, such as EL-CodeBERT18, have demonstrated
their effectiveness in code comment classification, yielding promising results.

Technical debt classificationn


Delivering high-quality, bug-free software is the goal of all software projects. When programmers are limited
by time or other resources, the code they deliver is either incomplete, requires rework, produces errors, or is a
temporary workaround. Such incomplete or temporary workarounds are commonly referred to as “technical
debt”38–40 at the cost of paying a higher price later on. Self-admitted technical debt (SATD) is common in software
projects and can have a negative impact on software maintenance. Therefore, identifying SATD is very important
for software engineering and needs to be investigated. Potdar and S­ hihab38 identified SATD by studying source

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 3

Vol.:(0123456789)
www.nature.com/scientificreports/

code comments of four projects and manually devising an approach of 62 patterns. As manually designed patterns
have significant drawbacks such as less generality and some physical burden, Huang et al.41 investigated a text
mining based approach that combines multiple classifiers to detect SATD in source comments of target software
projects. Since various characteristics of SATD features in code comments, such as vocabulary diversity, project
uniqueness, length, and semantic variations, pose a notable challenge to the accuracy of pattern or traditional
text mining-based SATD detection, Ren et al.42 propose a convolutional neural network (CNN)-based approach
for classifying code comments as SATD or non-SATD. To avoid the daunting manual effort of extracting features,
Wang et al.43 leverage attention-based neural networks to detect SATD. EL-CodeBERT18, a novel method that
leverages a PLM, showcases its potential in SATD detection tasks.

Code smell classification


Kent Beck defined the term “code smell” in the context of identifying quality problems in code that can be refac-
tored to improve the maintainability of s­ oftware44, caused by design flaws or bad programmer habits. Previous
works have investigated various methods for identifying code smells in source code, and among them, machine
learning is an effective approach for code smell classification. Fontana et al.45,46 studied 16 different machine
learning methods on four code smells (Data Class, Large Class, Feature Envy, Long Method) and 74 software
systems with 1986 manually validated code smell samples. They found that J48 and Random Forest are able to
achieve high performance in code smell classification. Das et al.47 propose a supervised convolutional neural
network-based approach to code smell classification by eliminating the effort of manually selecting features. To
eliminate the manual effort of feature extraction, several neural network based methods have been proposed in
the community. Neural network based methods still require a large number of labeled samples, and Liu et al.48
proposed an automatic approach to generate labeled training data for neural network based classifiers. Transfer
learning can be a reliable solution to the dilemma of better performance and more labeled samples. Sharma
et al.49 propose a transfer learning method involving convolutional neural networks and recurrent neural net-
works, which can transfer the learned detection ability between C# and Java language. Li et al.50 propose a hybrid
model based on deep learning for multi-label code smell classification, which utilizes both graph convolutional
neural networks and bidirectional long short-term memory networks with attention mechanism. To capitalize
on the benefits offered by large language models and enhance performance, EL-CodeBERT18 is employed for
code smell classification.

Prompt learning method


Prompt-learning method has emerged as the fourth p ­ aradigm22 of natural language processing (NLP) and has
drawn enormous attention from the NLP community, which adapts a variety of downstream NLP tasks to pre-
trained tasks of large language models. Starting from GPT-321, prompt learning has demonstrated its unique
advantage in various downstream NLP tasks, with applications in text c­ lassification51, machine t­ ranslation52,
etc. Existing researches focusing on prompt learning consists of three main components: a pre-traind language
model, a prompt template and a verbalizer, in which research related to verbalizer is orthogonal to our study.
Previous studies related to PLM have broadly focused on PLM architectures, including ­BERT5, ­RoBERTa6, GPT-
3, ­Bart53, and others. Recently, several PLM related studies have focused on the software engineering domain,
with ­CodeBERT13 and ­GraphCodeBERT12 being typical works. As a generative PLM, ­CodeT511 benefits a broad
set of source code related tasks. A prompt template is used as a container to wrap the input text sequence into a
prompt, which is then fed into a PLM to motivate the PLM to recall the rich knowledge associated with the input
information. Templates can be constructed in a handcrafted ­manner54,55 and achieve remarkable performance on
a variety of downstream NLP tasks. To avoid the onerous effort of manually constructing a prompt template, some
researchers seek to construct prompt templates automatically. The automatically generated templates are of two
types: discrete templates and continuous templates. The MINE a­ pproach56 is a mining-based discrete approach
to automatically find templates given a set of training inputs x and outputs y. Jiang et al.56 leverage round-trip
translation of the prompt into another language then back to generate new templates. The MINE approach is
a mining-based approach to automatically find templates given a set of training inputs x and outputs y. Prefix
­Tuning57 is a method that can be applied to continuous templates. The technique involves adding a sequence
of task-specific vectors as prefixes to the input while keeping the parameters of the pre-trained language model
(PLM) frozen.
EL-CodeBERT establishes itself as the strongest baseline by attaining state-of-the-art performance across all
four source code-related tasks.

Novelty of our study


The proposed method represents a significant departure from existing state-of-the-art approaches that primar-
ily rely on the ‘[CLS]’ vectors obtained from multiple layers of the output vectors from a BERT-based model to
represent input text (Liu et al., 2022; Jawahar et al., 2019), coupled with the use of LSTM to enhance features. In
contrast, CodeClassPrompt introduces a novel approach that leverages prompt learning to induce knowledge
features from a large language model. This innovative technique enables the successful completion of source
code-related tasks without additional LSTM layers while reducing computational overhead during execution.
Notably, CodeClassPrompt avoids a two-step training procedure by eliminating the need for LSTM layers. Spe-
cifically, CodeClassPrompt utilizes prompt learning to guide a pre-trained language model towards generating
outputs that are abundant in knowledge and intricately connected to a provided input sequence. To effectively
capture indispensable feature information while discarding extraneous details, attention values are meticulously
computed for a designated task using a specialized attention mechanism. Subsequently, the important features
are aggregated and categorized without necessitating supplementary layers for feature extraction. This approach,

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 4

Vol:.(1234567890)
www.nature.com/scientificreports/

employed by CodeClassPrompt, facilitates the extraction of comprehensive feature information while concur-
rently minimizing superfluous complexity.

CodeClassPrompt
Our proposed approach, named CodeClassPrompt, leverages multiple aspects of knowledge recalled by a PLM
to facilitate source code-related classification tasks. The knowledge contained in each layer of CodeBERT out-
put vectors has its own h­ ierarchy17 and is regarded as a distinct aspect of knowledge. The model architecture of
our CodeClassPrompt approach is shown in Fig. 1, which consists of a prompt wrapper, an embedding layer, a
CodeBERT layer, a knowledge layer, and an attention layer.

Task definition
The CodeClassPrompt approach leverages a prompt template, which encapsulates segments of source code as a
prompt, to activate a pre-trained model aimed at retrieving relevant knowledge associated with the input con-
tent. The acquired knowledge serves as the representation of the input information, appearing at the position
of ‘[MASK]’ in each layer of the output vector. Subsequently, an attention mechanism is employed to aggregate
multiple layers of this representation, resulting in a feature representation of the input text. This aggregated
feature is then utilized for the classification of the source code or related text.
Specifically, we denote the pre-trained model as M, with CodeBERT being an example. We use xp to represent
the prompt template, where “ It was [MASK]” serves as the prefix of the prompt, x denotes the input text.
xp = [CLS]It was [MASK]. x

Prompt Learning harnesses the inherent capability of masked language modeling in BERT family models. This
approach entails predicting pertinent contextual knowledge for masked positions based on diverse prompt
prefixes associated with the original input text.
Let the input information x = “int main() {return 0; }′′ . After encapsulating x within the prompt tem-
plate xp, During this process, x undergoes a transformation and gives rise to a modified representation denoted
as x̂p, which is commonly referred to as a prompt.
x̂p = [CLS]It was [MASK]. int main() {return 0; }

The aforementioned procedure is recognized as the Prompt Wrapper in this study. Following this, the modified
representation x̂p is introduced as the input content that is fed into the model M. The CodeBERT model consists
of two primary components: the embedding layer and the CodeBERT Layer. The embedding layer is responsible
for mapping the input prompt x̂p to a set of vectors, each vector having a dimension size of 768. These vectors
traverse the CodeBERT layer, which retrieves the relevant knowledge information stored during the pre-training
phase of the CodeBERT model. This information is represented as a vector at the position of the ‘[MASK]’ token
within the output vectors. The overall output consists of 13 layers of vectors. The first layer corresponds to the
embedded vectors of the input, while the subsequent layers capture hierarchical knowledge that is closely associ-
ated with the input ­content17.
The knowledge layer assumes the role of selecting multiple layers of vectors as input vectors for the attention
layer, specifically targeting vectors located at the position of “[MASK]” within each layer. In the scope of this
study, the selected layers encompass layer 2 through layer 12.

Figure 1.  The architecture of CodeClassPrompt.

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 5

Vol.:(0123456789)
www.nature.com/scientificreports/

The attention layer assigns individual weights to the different layers of the output derived from the knowledge
layer. The weighted sum of these layer outputs serves as the representation that is then fed into the final classi-
fier, which is a fully connected neural network. The purpose of this process is to obtain the ultimate class label
for the original input text.

Prompt wrapper
The prompt wrapper is designed to wrap an input sequence into a prompt template as a prompt, which is then
inputted into a PLM. A effective prompt can induce the PLM to output relevant knowledge related to the input
sequence.
In prompt learning, an input sequence x = {w1 , w2 , ...wn } need to first be wrapped into a prompt template
xp = [CLS]It was [MASK]. x

as a prompt x̂p, then fed the prompt into a PLM.

Embedding layer
This layer is aimed to capture the relationship between tokens which maps target input from a textual form to a
vector representation on a low-dimentional dense space. To the input a prompt to the model, the wrapped prompt
text sequence is first tokenized by word-piece algorithm then we obtain the sequence x = (x1 , x2 , . . . , xn ), where
n is the length of the input sequence, xn is the n-th tokenized sub-word. Since the length of the input sequence
varies for different inputs, we need to pad them to a uniform length to facilitate subsequent processing of the
CodeBERT model. For the set maximum length of input sequence N, if the length of sequences is less than N,
we pad 0 to the end of these sequences to make their length equal to N. For sequences whose length is greater
than N, we directly truncate redundant text sequence at the end. Therefore, the output of the embedding layer is
X = (X1 , X2 , . . . , XN ). An input text is not merely a combination of tokens, it has important order information.
To make the model to leverage the order information, absolute positional encoding (APE) is added in X , the
final output to feed into the next layer is

X = X + APE(X ), X ∈ RN×dmodel , (1)


where dmodel is the size of the embedding dimension.

CodeBERT layer
In this layer, unlike other works, we leverage the knowledge information output from a masked language model.
The embedding vectors from one embedding layer are fed into the layer to motivate the layer to output the
knowledge information stored during the pre-training phase. The motivated knowledge information is lever-
aged as knolwedge features of input sequences. CodeBERT is a bi-modal pre-trained language model based on
transformers for both programming languages (PL) and natural langauges (NL)10. CodeBERT is pre-trained on
a large general-purpose corpus by two tasks: Masked Language Model (MLM) and Replaced Token Detection
(RTD). Specifically, the MLM task targets bimodal data by simultaneously feeding the code with the correspond-
ing comments and randomly selecting positions for masking, then replacing the token with a special ‘[MASK]’
token, the goal of the MLM task is to predict the original token. The RTD task targets unimodal data with
separate codes and comments, randomly replaces the token, and aims to learn whether the token is the original
word using a d ­ iscriminator10. However, in a large language model,such as CodeBERT, the output of ‘[MASK]’
location is representing not only the original token but also the relavent knowledge about input information.

Knowledge layer
Previous studies leverage only the vector of one or more layers at ‘[CLS]’ l­ocation18. According to Choi et al.15,
the output of ‘[CLS]’ location is not the best choice for the representation of an input sequence, which are not
stable for downstream tasks. Jawahar et al.17 proposed that each layer of the output of a BERT-like model has
different semantic features that can be combined to further extract higher-level features that better represent
the input. Unlike these studies, we leverage knowledge information from the output of CodeBERT. As described
in the prior subsection, the output of ‘[MASK]’ location includes knowledge information related to the input,
in which each layer has different aspect of knowledge and different importance related to the input. The output
of the CodeBERT model has 13 layers, where one embedding layer and 12 layers for the encoders. The lowest-
level embedding has little knowledge from the whole input, and the remaining layers have different importance.
Through pilot experiments, the outputs from layer 2 through layer 12 have been chosen as knowledge sources.
xmask_2
 
x
Xknow =  mask_3 , Xknow ∈ R11×dmodel , (2)

···
xmask_12
where dmodel is the size of the embedding dimension.

Attention layer
Several knowledge features were obtained in the previous subsection, not all representational information con-
tributes equally to the input, and each layer of knowledge features has different weights for the entire represen-
tation of the input. Some source code related tasks focus more on lower-level knowledge features, while others

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 6

Vol:.(1234567890)
www.nature.com/scientificreports/

focus more on higher-level features. Therefore, in CodeClassPrompt, the attention mechanism is used to compute
different weights for each knowledge feature. Specifically, we first compute the tanh value of Xknow as
ui = tanh(Xknow ), (3)
then the similarity between ui and the context vector uw can be calculated and transformed into a probability
distribution by Softmax.
exp (uT uw )
α i =  Ti (4)
i ui uw

α i can be treated as the importance of the input for each leavel of knowledge feature, therefore using α i as a global
weighted summation over Xknow can generate the input vector xout,

xout = α i xi (5)
i

Finally, for the xout , it can be classified by a layer of fully connected feed-forward network.
p(y|xout ) = w(gelu(xout )) + b (6)
The final class label y of the input is
ŷ = arg max p(y|xout ) (7)

Experimental study
In this section, we design four source code-related tasks aimed at answering the following questions (RQs).
RQ1: Can our proposed approach achieve comparable results for source code-related tasks without the
necessity of additional feature extraction layers, in comparison to the baselines?
RQ2: Can the performance of CodeClassPrompt be enhanced by solely utilizing either the attention mecha-
nism or a prompt template?
RQ3: How does the attention mechanism work on the four source code-related classification tasks?
RQ4: Does the proposed model perform equally well on both programming language-based and natural
language-based tasks?
RQ5: What is the potential improvement in computational efficiency that can be achieved by eliminating
additional neural network layers?
RQ1 aims to validate the performance of CodeClassPrompt on four source code-related classification tasks.
To achieve this goal, we conducted extensive experiments and compared CodeClassPrompt with current state-
of-the-art baselines. In RQ2, we investigated the effectiveness of different components in the CodeBERT-based
approach on performance. To answer these questions, we performed comprehensive ablation studies on four
datasets. RQ3 examines the differences in focusing on different layers of knowledge across four different tasks.
RQ4 addresses the question of whether knowledge features are equally effective for both natural language-based
and programming language-based tasks. Lastly, RQ5 explores the potential improvement in computational
efficiency that can be attained by eliminating additional neural network layers within a CodeBERT pipeline.

Tasks and datasets


To validate our proposed approach, we conducted extensive experiments on four downstream tasks related to
source code. The first two tasks (code language classification and code smell classification) are related to program-
ming language processing, while the other two (code comment classification and technical debt classification)
are related to natural language processing. Each task has a dedicated dataset, which is described in detail in the
corresponding task description.

Code language classification


In this task, the target data are source code snippets. We leverage the publicly shared dataset in SC++3, which
collects 21 programming languages popular in the Stack Overflow community, based on the 2017 Stack Overflow
developer survey (https://​insig​hts.​stack​overf​l ow.​com/​survey/​2017#​techn​ology). As is inevitable, there is some
invalid source code in the dataset, which we removed using tools from D ­ eepSCC34, keeping 19 programming
languages. In other words, this task is a multi-class source code task that predicts the programming language
type of each code snippet.

Code smell classification


Code smell classification is a binary classification task where a model needs to decide whether it has code smell
or not, and its target is the source of a variety of code snippets. Fakhoury et al.58 build a corpus which selects
4205 lines of source code from 13 java open-source systems to avoid domain-specific dependencies of the
results. The corpus has over 1700 labeled code snippets, which is in line with the taxonomy of linguistic smells
presented in the ­paper59. The corpus is a typical dataset for code smell classification in previous ­studies18, and
we adopt it for this task.

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 7

Vol.:(0123456789)
www.nature.com/scientificreports/

Code comment classification


Code comments are crucial software components that contain important information concerning software design,
code implementation, and other technical details. We adopt the corpus shared by Pascarella and ­Bacchelli60,
which has over 11,000 code reviews and 16 classes from six java open source software projects. This task is a
multi-class natural language classification task that aims to categorize each comment into a specific class.

Technical debt classification


Techincal debt classification is a natural language classification task based on annotated information indicating
whether there is technical debt or not. Our goal is to detect technical debt annotated by programmers, i.e., self-
admitted techical debt (SATD). The dataset presented by Maldonado et al.61 consists of approximately 10,000
code comments collected from 10 open source projects, which are classified into five types of SATD, namely,
design debt, requirement debt, defect debt, documentation debt, or test debt. All our SATD experiments are
performed on the corpus.
For brevity, in the following sections, code language classification is abbreviated as code language, code smell
classification as code smell, code comment classification as code comment, and technical debt classification as
technical debt.
To ensure a rigorous and unbiased evaluation, we employ a stratified sampling technique to partition the
corpus into distinct training and test sets for each task, adhering to an 80:20 percent ratio. This approach enables
a fair comparison with the baseline results. These statistics consist of (1) the number of training and test sets,
(2) the mean, mode, and median of the code/comment length, and (3) the percentage of samples with sizes <32,
<64, <128, <256, and <300. All statistical information for the four datasets is detailed in Table 1.
Tables 2, 3, 4, 5, 6, 7, 8, and 9 provide a comprehensive breakdown of the category distribution for each dataset.
The Code Language and Code Smell datasets demonstrate a well-balanced distribution of classes without any
significant class imbalance. In the case of the remaining datasets, some degree of class imbalance is observed,
with certain classes being underrepresented compared to others.

Task Class Num Train Test Avg (%) Mode Median < 32 (%) < 64 (%) < 128 (%) < 256 (%) < 300 (%)
Code Language 19 179,556 44,889 58.58 2 31.0 51.04 73.76 89.78 96.94 97.75
Code Smell 2 1399 350 41.97 8 20.0 68.19 85.56 93.14 97.63 98.43
Code Comment 16 8985 2247 15.82 2 7.0 88.46 93.88 97.12 99.97 99.99
Technical Debt 2 31,708 6652 9.27 3 6.0 95.98 99.23 99.87 99.98 99.99

Table 1.  Corpus statistics for four source code-related tasks.

Label Count Label Count Label Count Label Count Label Count
0 9574 1 9605 2 9594 3 9559 4 9639
5 9681 6 9584 7 9542 8 6780 9 9591
10 9623 11 9546 12 9556 13 9639 14 9611
15 9660 16 9591 17 9527 18 9654

Table 2.  The statistical information of dataset CODE LANGUAGE: train set.

Label Count Label Count Label Count Label Count Label Count
0 2427 1 2396 2 2407 3 2442 4 2362
5 2320 6 2417 7 2459 8 1647 9 2410
10 2378 11 2455 12 2445 13 2362 14 2390
15 2341 16 2410 17 2474 18 2347

Table 3.  The statistical information of dataset CODE LANGUAGE: test set.

Label Count Label Count


0 755 1 644

Table 4.  The statistical information of dataset CODE SMELL: train set.

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 8

Vol:.(1234567890)
www.nature.com/scientificreports/

Label Count Label Count


0 189 1 161

Table 5.  The statistical information of dataset CODE SMELL: test set.

Label Count Label Count Label Count Label Count Label Count
0 164 1 263 2 43 3 782 4 195
5 159 6 46 7 70 8 662 9 357
10 823 11 205 12 3365 13 152 14 16
15 1683

Table 6.  The statistical information of dataset CODE COMMENT: train set.

Label Count Label Count Label Count Label Count Label Count
0 41 1 66 2 11 3 196 4 49
5 40 6 11 7 17 8 166 9 89
10 206 11 51 12 841 13 38 14 4
15 421

Table 7.  The statistical information of dataset CODE COMMENT: test set.

Label Count Label Count


0 29020 1 2688

Table 8.  The statistical information of dataset CODE SATD: train set.

Label Count Label Count


0 6069 1 583

Table 9.  The statistical information of dataset CODE SATD: test set.

Evaluation metrics
Accuracy, Precision, Recall, and F1-Score are chosen as evaluation metrics for the binary classification task.
These evaluation metrics are calculated as follows.
TP + TN
Accuracy = (8)
TP + TN + FP + FN

TP
Precision = (9)
TP + FP

TP
Recall = (10)
TP + FN

2 × Precision × Recall
F1- Score = (11)
Precision + Recall
where TP means that a positive sample is predicted as a positive class, TN means that a negative sample is
predicted as a negative class, FP means that a negative sample is assigned a positive label, and FN means that a
positive sample is predicted as a negative class.
For all multi-classification tasks, we leverage the macro approach to compute evaluation metrics. Specifi-
cally, we tally TP, FP, FN and TN for each class and then compute Precision, Recall and F1-Score, respectively.
Finally, we obtain the mean value of each metric for all classes to obtain Macro-Precision, Macro-Recall, and
Macro-F1-Score.

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 9

Vol.:(0123456789)
www.nature.com/scientificreports/

For simplicity, Accuracy is abbreviated as ACC, Precision as P, Recall as R, and F1-Score as F1 in the follow-
ing tables.

Baselines
Previous studies have extensively investigated source code-related ­tasks62, utilizing a range of AI-based tools
from machine learning to deep learning. In this study, we conduct a comprehensive evaluation of our Code-
ClassPrompt model in comparison to eight recently proposed baselines across four source code-related classifica-
tion tasks. These baselines can be categorized into two distinct groups based on the AI development perspective:
machine learning-based approaches and neural network-based approaches. The latter category can be further
divided into two sub-classes, namely classical neural network-based approaches and pre-trained language model-
based methods. The following is a brief introduction to all the baselines we compare with: Random Forest,
XGBoost, TextCNN, AttBLSTM, BERT, RoBERTa, CodeBERT, and EL-CodeBERT.

Machine learning based methods


Random Forest is an ensemble machine learning algorithm based on decision trees and bagging proposed by
­Breiman2, and the baseline experiments were implemented with the library scikit-learn (https://​github.​com/​
scikit-​learn/​scikit-​learn).
XGBoost is sugguest by Chen and G ­ uestrin63, which is a scalable end-to-end tree boosting system and is used
widely by data scientists to achieve state-of-the-art results on many machine learning challenges. It is sparsity-
aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. We adopt the offiical
implementation from the original author on github (https://​github.​com/​dmlc/​xgboo​st).

Neural networks based methods


Classical Neural Netowrk based Approaches
TextCNN ­Kim64 proposed TextCNN, a sophisticated approach that leverages convolutional neural networks
for natural language processing. In this study, we adopt a classical implementation of TextCNN as a baseline
(https://​github.​com/​NTDXYG/​Text-​Class​ify-​based-​pytor​ch/​blob/​master/​model/​TextC​NN.​py).
AttBLSTM Zhou et al.65 proposed AttBLSTM, a combined approach that leverages attention mechanism
and bidirectional long short-term memory network (BiLSTM) to capture important semantic features of textual
sequences. In this study, we utilize a baseline implementation of AttBLSTM based on source code available on
Github (https://g​ ithub.c​ om/N
​ TDXYG/T ​ ext-C
​ lassi​ fy-b
​ ased-p
​ ytorc​ h/b
​ lob/m
​ aster/m
​ odel/T
​ extRN
​ N_A
​ ttent​ ion.p
​ y).
Pre-trained Model based Methods
BERT5 is a deep bidirectional transformer-based language model for language understanding, pre-trained
on both next sentence prediction and masked language modeling tasks using self-supervised methods with
large-scale corpora. In this study, we utilize a baseline implementation of BERT based on the BERT-base model
(https://​huggi​ngface.​co/​trans​forme​rs/​v3.0.​2/​model_​doc/​bert.​html#​bertf​orseq​uence​class​ifica​tion).
RoBERTa6 is an enhanced BERT-based model pre-trained on a much larger corpus than the original BERT
model (https://​huggi​ngface.​co/​rober​ta-​base) using only masked language modeling approach. In this study, we
utilize a baseline implementation of sequence classification based on the RoBERTa-base model, which is the
official implementation provided by the authors (https://​huggi​ngface.​co/​docs/​trans​forme​rs/​model_​doc/​rober​
ta#​trans​forme​rs.​Rober​taFor​Seque​nceCl​assif​i cati​on).
CodeBERT10 is a transformer-based model pre-trained on a corpus consisting of both programming language
and natural language, using both masked language modeling and replaced token detection tasks. In this study,
we utilize a baseline implementation of CodeBERT based on the CodeBERT-base model.
EL-CodeBERT18 is a two-stage model that builds upon CodeBERT, attention mechanism, and BiLSTM. The
model utilizes BiLSTM to extract multiple layers of semantic features and then employs attention mechanism
to aggregate the final features, resulting in promising performance compared to other baselines. To ensure
consistency with the baseline implementation, we adopt the official code and training method (https://​github.​
com/​NTDXYG/​ELCod​eBert).

Experimental settings
Our proposed approach utilizes O ­ penPrompt66, an open-source framework for prompt learning, along with
10
the CodeBERT-base m ­ odel .To ensure consistent and fair comparison, we conducted all experiments using an
Nvidia RTX3090 GPU (Graphic Processing Unit), Linux operating system (Ubuntu 22.04), and 64 GB of system
memory for both the baselines and our proposed method.
The prompt templates leveraged for each respective dataset are delineated in Table 10, which were selected
­ ramework66.
from the four universal template candidates enumerated in Table 11, derived from the OpenPrompt F
To ascertain the most suitable template for each task, a series of pilot experiments were undertaken. These
experiments served as a systematic evaluation to assess the performance and alignment of each template with
the specific task requirements.

Results and analysis


This section presents the evaluation and analysis of our proposed approach, CodeClassPrompt. We will com-
mence by comparing our results with the baselines on four source code-related tasks to demonstrate the per-
formance of our approach. All results of CodeClassPrompt are obtained by conducting five independent runs
with different random seeds. The average value and standard deviation are calculated for each metric, with the
maximum value indicated in parentheses. Following that, we will experimentally validate the effectiveness of
each component in our approach. We will then conduct a series of attention-related experiments to demonstrate

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 10

Vol:.(1234567890)
www.nature.com/scientificreports/

Task Prompt Template


Code Language “ Just [MASK] ! x”
Code Smell “ x In summary , it was [MASK] .”
Code Comment “ It was [MASK] . x”
Technical Debt “ Just [MASK] ! x”

Table 10.  Prompt templates for four tasks.

No. Prompt Template


1 “ It was [MASK] . x”
2 “ x In summary , it was [MASK] .”
3 “ x All in all , it was “mask” .”
4 “ Just [MASK] ! x”

Table 11.  All template candidates.

how different tasks focus on different knowledge layers. Next, we will analyze the varying effects of knowledge
features on programming language-based and natural language-based tasks. Finally, a series of experiments will
be conducted to demonstrate the computational cost-saving potential of CodeClassPrompt.

Result analysis for RQ1


In Tables 12, 13, 14, and 15, the first two rows correspond to machine learning approaches, the next two rows
marked with an asterisk (*) are classical neural network approaches without pre-trained language models, and
the next four rows are pre-trained model-based approaches. The number in parentheses following a method

Method ACC(%) P(%) R(%) F1(%)


Random ­Forest18 78.728 79.362 78.825 78.874
XGBoost18 78.803 79.925 78.891 79.217
TextCNN18* 82.662 83.561 82.706 82.964
AttBLSTM18* 79.035 79.801 79.107 79.272
BERT18 86.865 87.129 86.938 86.985
RoBERTa18 87.202 87.424 87.276 87.135
CodeBERT18 87.418 88.042 87.450 87.614
EL-CodeBERT(2022)18 87.959 88.177 88.023 88.077
EL-CodeBERT 87.757 ± 0.075(87.870) 88.065 ± 0.058(88.139) 87.816 ± 0.081(87.934) 87.895 ± 0.073(88.002)
CodeClassPrompt 87.906 ± 0.085(88.024) 88.104 ± 0.089(88.232) 87.980 ± 0.083(88.091) 88.030 ± 0.085(88.149)

Table 12.  Results on code language classification task. The number in bold represents the optimal value for
each metric.

Method ACC (%) P(%) R(%) F1(%)


Random ­Forest18 78.286 78.880 77.548 77.756
XGBoost18 75.714 75.667 75.305 75.409
TextCNN18* 80.000 80.016 79.641 79.761
AttBLSTM18* 78.857 78.810 78.537 78.631
BER18T 79.714 79.580 79.653 79.610
RoBERTa18 81.143 81.014 81.067 81.038
CodeBERT18 85.429 85.516 85.128 85.264
EL-CodeBERT(2022)18 86.000 85.990 85.795 85.874
EL-CodeBERT 76.343 ± 1.134(77.714) 76.752 ± 1.000(77.927) 75.767 ± 1.380(77.479) 75.854 ± 1.343(77.526)
CodeClassPrompt 86.171 ± 0.291(86.571) 86.329 ± 0.348(86.678) 85.862 ± 0.33(86.462) 86.006 ± 0.3(86.479)

Table 13.  Results on code smell classification task. The number in bold represents the optimal value for each
metric.

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 11

Vol.:(0123456789)
www.nature.com/scientificreports/

Method ACC(%) P(%) R(%) F1(%)


Random ­Forest18 90.921 83.783 75.104 74.618
XGBoost18 90.565 77.561 68.661 71.645
TextCNN18* 91.945 87.541 78.596 80.977
AttBLSTM18* 92.345 85.578 78.268 80.596
BERT18 94.482 87.149 83.935 85.275
RoBERTa18 94.393 90.525 86.121 86.875
CodeBERT18 94.838 87.916 86.301 86.820
EL-CodeBERT(2022)18 95.238 89.395 87.280 87.977
EL-CodeBERT 94.811 ± 0.170(94.971) 89.070 ± 2.886(92.522) 85.966 ± 0.651(86.530) 86.673 ± 1.083(87.475)
CodeClassPrompt 95.220 ± 0.108(95.416) 89.193 ± 0.648(89.691) 86.938 ± 0.287(87.183) 87.770 ± 0.249(87.942)

Table 14.  Results on code comment classification task. The number in bold represents the optimal value for
each metric.

Method ACC(%) P(%) R(%) F1(%)


Random ­Forest18 97.278 95.080 87.019 90.564
XGBoost18 97.294 93.901 88.516 90.991
TextCNN18* 96.978 93.015 87.258 89.881
AttBLSTM18* 97.114 93.979 87.177 90.227
BERT18 96.783 91.229 88.004 89.534
RoBERTa18 97.595 93.352 91.110 92.279
CodeBERT18 97.835 94.197 91.991 93.059
EL-CodeBERT(2022)18 97.850 94.024 92.310 93.146
EL-CodeBERT 97.895 ± 0.016(97.910) 93.687 ± 0.713(94.970) 93.110 ± 0.994(94.273) 93.375 ± 0.175(93.530)
CodeClassPrompt 97.811 ± 0.058(97.895) 93.954 ± 0.234(94.330) 92.118 ± 0.248(92.387) 93.011 ± 0.186(93.263)

Table 15.  Results on technical debt classification task. The number in bold represents the optimal value for
each metric.

name denotes the year the method was proposed. The method names of baselines highlighted in red signify
that their results were obtained through experimentation, whereas the other baseline results are cited from the
paper of EL-CodeBERT18.
Code Language Classification. Table 12 presents the results of the comparison between CodeClassPrompt
and all baselines. We observe that pre-trained model-based methods achieve better performance than machine
learning and classical neural networks, with clear advantages.
Our CodeClassPrompt approach demonstrates comparable performance to all baseline methods across the
four evaluation metrics, namely accuracy, precision, recall, and F1-score. Specifically, the achieved values for
accuracy, precision, recall, and F1-score are 87.906% (with a maximum value of 88.024%), 88.104% (with a
maximum value of 88.232%), 87.980% (with a maximum value of 88.091%), and 88.030% (with a maximum
value of 88.149%) respectively. Importantly, each maximum value surpasses the reported values of the baseline
models. These findings serve as compelling evidence that the integration of knowledge features and attention
mechanisms significantly enhances the performance of source code-related classification tasks. Notably, the
utilization of knowledge features inspired by prompts has proven to be effective in capturing the salient features
of the input information more accurately.
Code Smell Classification. The comparative results between CodeClassPrompt and all baseline methods
are presented in Table 13. Among the evaluated methods, the pre-trained model-based approach exhibited the
highest performance. However, our CodeClassPrompt method outperformed all other methods across all met-
rics, achieving the highest accuracy, precision, recall, and F1-score values of 86.171%, 86.329%, 85.862%, and
86.006% respectively. The empirical results demonstrate that CodeClassPrompt is capable of achieving superior
performance on the Code Smell dataset compared to the state-of-the-art baseline approaches, all while requiring
significantly lower computational resources. This can be attributed to the superior feature extraction capabilities
of the CodeClassPrompt method in the domain of programming language processing.
Code Comment Classification. Table 14 presents the results obtained from the comparative analysis
between CodeClassPrompt and all the baseline models. The table reveals that CodeClassPrompt has achieved
results that are comparable to those of the baselines. The metrics used for evaluation, namely accuracy, precision,
recall, and F1-score, indicate values of 95.220%, 89.193%, 86.938%, and 87.770% respectively. Moreover, Code-
ClassPrompt has achieved an impressive maximum accuracy value of 95.416%. These findings serve as evidence
of CodeClassPrompt’s commendable performance in the realm of code comment classification, underscoring
its promising capability in extracting features from natural language-based content. It is worth noting that all of
these results were obtained with significantly lower computational resources when compared to EL-CodeBERT.

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 12

Vol:.(1234567890)
www.nature.com/scientificreports/

Technical Debt Classification. Table 15 presents the outcomes of the comparative analysis conducted
between CodeClassPrompt and all the baseline approaches. In terms of the evaluation metrics, CodeClassPrompt
has demonstrated comparable results, achieving accuracy, precision, recall, and F1-score values of 97.811%,
93.954%, 92.118%, and 93.011% respectively. The table reveals that all the investigated approaches, ranging
from machine learning methods to pre-trained model-based techniques, have yielded similar accuracy levels.
Summary for RQ1: The results demonstrate that CodeClassPrompt exhibits a clear superiority in perfor-
mance on the code smell classification task, while also achieving comparable levels of performance to the state-
of-the-art baseline approaches across three additional evaluated tasks. These findings underscore the operational
efficacy and inherent strengths of the CodeClassPrompt methodology. Specifically, this approach distinguishes
itself in tasks associated with programming language processing, such as code language classification and code
smell identification, offering unique benefits. Importantly, these findings emphasize that CodeClassPrompt not
only matches the performance of previous state-of-the-art baselines but does so with reduced computational
costs. This efficiency can be primarily attributed to CodeClassPrompt’s superior capability in extracting features
from both programming and natural languages, eliminating the need for additional neural network layers to
enhance its feature extraction process.
To substantiate our findings, we have performed a paired T-test comparing our results with those obtained
from the EL-CodeBERT experiments. The reults are presented in Table 16. It can be observed that, in the case of
the dataset Code Language and Technical Debt, there is no discernible distinction between CodeClassPrompt
and EL-CodeBERT. However, for the datasets Code Smell and Code Comment, notable differences exist. Refer-
ring to Table 13, it is evident that CodeClassPrompt outperforms EL-CodeBERT in all metrics pertaining to
code smell classification. Moreover, as shown in Table 14, CodeClassPrompt demonstrates a slightly superior
Accuracy metric in comparison to EL-CodeBERT for the code comment classification task.

Result analysis for RQ2


RQ2 aims to investigate the contribution of the attention and prompt components in our proposed approach.
To demonstrate their contribution, we conducted extensive ablation experiments on four source code-related
tasks, keeping the same experimental setup except for replacing the attention mechanism with a fully connected
network or removing the prompt templates. Table 17 presents a comparison of the experimental results. From
the table, we observed that replacing the attention mechanism with a fully connected neural network results in
worse performance compared to using the attention mechanism on all four tasks, particularly on programming

Dataset ACC​ P R F1
Code Language 0.07676 0.57343 0.06107 0.10552
Code Smell 0.00008 0.00007 0.00011 0.00011
Code Comment 0.00222 0.93387 0.08265 0.12453
Technical Debt 0.06752 0.55778 0.15295 0.06117

Table 16.  Results of the T-test conducted between CodeClassPrompt and EL-CodeBERT. The numerical value
assigned to each item represents its respective p-value. The item value flagged in bold indicates that its p-value
is less than 0.05.

Task Method ACC (%) P(%) R(%) F1(%)


CodeClassPrompt 88.024 88.232 88.091 88.149
Code Language* w/o Attention 87.828 88.010 87.926 87.958
w/o Attention &Prompt 87.418 88.042 87.450 87.614
CodeClassPrompt 86.000 86.167 85.657 85.824
Code Smell* w/o Attention 82.286 82.407 81.900 82.051
w/o Attention &Prompt 85.429 85.516 85.128 85.264
CodeClassPrompt 95.416 89.654 87.036 87.930
Code Commet w/o Attention 95.016 92.360 86.328 87.797
w/o Attention &Prompt 94.838 87.916 86.301 86.820
CodeClassPrompt 97.895 94.330 92.257 93.263
Techical Debt w/o Attention 97.745 92.677 93.337 93.004
w/o Attention &Prompt 97.835 94.197 91.991 93.059

Table 17.  Results of the ablation study. “w/o attention” refers to the absence of the attention mechanism. The
task marked with an asterisk (*) is a programming language task, while others are natural language tasks. “w/o
Attention &Prompt” denotes the absence of both the attention mechanism and prompt templates, equivalent to
a classifier based on fully connected feed-forward networks and CodeBERT. The number in bold represents the
optimal value for each metric on a task.

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 13

Vol.:(0123456789)
www.nature.com/scientificreports/

language tasks, such as code classification and code smell classification. For the code comment classification
and technical debt classification tasks, some individual evaluation metrics show slight increase, except for accu-
racy. After removing both the attention and prompt components, the model degenerates into a classifier based
on fully connected feed-forward networks and CodeBERT, and its performance is worse on all metrics than
CodeClassPrompt.
Comparing the last two rows in each task, we observed that utilizing the prompt template alone (denoted as
“w/o Attention”) does not improve performance in code smell classification (Accuracy value drops from 85.429%
to 82.286%), and its contribution does not provide an explicit advantage in the other tasks. However, combin-
ing the prompt template with the attention mechanism can significantly enhance performance on all tasks. The
results of our proposed CodeClassPrompt approach strongly demonstrate its effectiveness.
Comparing code language classification with code smell classification, we observed that the attention mecha-
nism has a greater advantage over fully connected feed-forward networks (denoted as “w/o Attention &Prompt”)
in code smell classification. The reason for this difference is that code smell requires the detection of semantic
differences between code snippets of the same type, while code language classification involves detecting lin-
guistic features of different programming languages. From the results, it appears that semantic features can be
detected more easily than linguistic features.
Summary for RQ2: Our proposed CodeClassPrompt approach, which incorporates prompt templates and
the attention mechanism to avoid additional computation costs associated with additional neural network lay-
ers, significantly improves performance on the four source code-related tasks compared to fully connected
feed-forward networks. Both the attention and prompt components are valid and indispensable for achieving
the best performance.

Result analysis for RQ3


RQ3 aims to investigate the layers with decisive attention values that each task focuses on. We selected two exam-
ples from each task to examine the attention values of each layer. The results are presented in Figs. 2, 3, 4, and 5.
Code Language Classification. As shown in Fig. 2, the concentration of attention in code language clas-
sification tasks is mainly on the last two layers of knowledge, particularly the final layer. For instance, in the case
of id 144, the attention value focused on the last layer is 99.37%, while that on layer 11 is only 0.42%. Similarly,
for id 300, attention on the last layers is 86.90%, whereas that on layer 11 is 12.60%. According ­to67, the last layer
represents the highest level of knowledge regarding the input. As the code language classification task aims to

Figure 2.  Attention values at each layer for code language classification.

Figure 3.  Attention values at each layer for code smell classification.

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 14

Vol:.(1234567890)
www.nature.com/scientificreports/

Figure 4.  Attention values at each layer for code comment classification.

Figure 5.  Attention values at each layer for technical debt classification.

distinguish the type of programming languages, the highest level of knowledge contains sufficient identifying
features to accomplish this task.
Code Smell Classification. As shown in Fig. 3, code smell classification tasks focus on each layer of knowl-
edge derived from a pre-trained model, but there is a higher concentration on the last few layers. For example,
in the case of id 144, the attention values are 25.87% on the final layer and 20.13% on layer 11. The objective
of code smell classification is to detect coding style and semantic information, and it relies on the features of
knowledge that range from low to high levels.
Code Comment Classification. As illustrated in Fig. 4, this task primarily focuses on the last four layers
of knowledge, with a stronger emphasis on the last three layers. For instance, for id 144, the attention values for
layer 12, 11, and 10 are 54.84%, 24.55%, and 14.44%, respectively. The last layer, which contains the highest-level
knowledge, is particularly crucial for this task. Code comment classification is a natural language processing task
that places a significant emphasis on differences in semantic information.
Technical Debt Classification. As shown in Fig. 5, technical debt classification focuses more on the middle
layers of knowledge rather than the last layers. For instance, in the case of id 144, it concentrates more on layer
7, 8, 9, 10, and 11, while for the last layer, the attention value is only 0.51%. A similar trend can be observed for
id 300. This suggests that the highest layer of knowledge contributes less to technical debt classification than
other layers. This task is a binary classification task, similar to emotional classification, and does not involve
concrete semantic information.
Summary for RQ3: Various source code-related tasks demonstrate different degrees of attention on each
layer of output knowledge from the CodeBERT model. This indicates that different levels of knowledge hold
distinct meanings for different tasks, highlighting the effectiveness of the attention mechanism for source code-
related tasks.

Result analysis for RQ4


The hidden state output of the BERT-based model consists of 13 layers (from layer 0 to layer 12), where layer 0
represents the embedding vector of the input information, and the other layers contain hierarchical linguistic
­features17. The primary objective of our experiments is to examine the influence of different layers of knowledge
features on classification tasks related to source code. We aim to identify the most suitable layers that can serve
as candidate layers for the attention mechanism, allowing us to aggregate crucial information effectively. To this
end, we collected 13 sets of knowledge features for each task, with the first set ranging from layer 0 through layer
12, the second set from layer 1 through layer 12, and so on.

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 15

Vol.:(0123456789)
www.nature.com/scientificreports/

The four tasks can be grouped into two categories: programming language-based tasks (code language clas-
sification and code smell classification) and natural language-based tasks (code comment classification and
technical debt classification). The experimental results for four metrics, namely Accuracy, Precision, Recall, and
F1-score, are showcased in Figs. 6, 7, 8, and 9 correspondingly.
From these figures, it is observed that the CodeClassPrompt approach consistently achieves the best results
on the third set of knowledge features (from layer 2 through layer 12) across all evaluation metrics for program-
ming language-based tasks. The attention mechanism is aimed to select the important features, the set with best
results is regarded as the most important. For natural language processing tasks, our CodeClassPrompt approach
on this set of features (from layer 2 through layer 12) still obtained better results than the baselines. These results
demonstrate that the attention mechanism on knowledge features from layer 2 through layer 12 is better able to
represent features on the programming language-based tasks. To ensure consistency across the four source code-
related tasks, we have deliberately selected a fixed set of knowledge features, encompassing layers 2 through 12,
as input candidates for the attention mechanism. By employing this specific range of layers, we aim to maintain
uniformity in our approach across all tasks.
Summary for RQ4: Our CodeClassPrompt approach achieves greater performance on both programming
language-related tasks and natural language-based tasks, with higher stability particularly for programming
language-based source code-related tasks.

Result analysis for RQ5


The computational efficiency experiment aimed to demonstrate the impact of additional neural network layers on
a CodeBERT-based classification pipeline, including parameter quantity, computatinal cost and time-saving (The
computation costs and parameter numbers were evaluated using the Python library “thop”). Previous ­studies18
have utilized additional BiLSTM layers to extract more effective features and achieve notable performance with
the CodeBERT-based pipeline. Using identical hardware and software configurations, including the GPU, CPU,
memory, and software versions, as well as employing the same hyperparameters such as the maximum sequence
length. The computation cost of a specific BERT family model is solely dependent on the maximum length of
the input content. For input content with a shorter length than the maximum, padding is applied to extend it
to the specified maximum length. Conversely, if the input content exceeds the maximum length, truncation is

Figure 6.  Results for the metric Accuracy.

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 16

Vol:.(1234567890)
www.nature.com/scientificreports/

Figure 7.  Results for the metric Precision.

utilized to ensure adherence to the predefined maximum length. We selected an example from the dataset of code
language classification and conducted ten groups of experiments, each consisting of 1,000 inference repetitions
on the example. The resulting average time consumed by the ten groups is reported in Table 18, where “Total”
denotes the time consumption of the entire pipeline, and “LSTM” represents the time cost of the BiLSTM layer in
the pipeline. The results show that the BiLSTM layer consumes 11.18% of the time during the entire pipeline. To
assess the reasons behind the time-saving, we have analyzed the discrepancies in parameter numbers and com-
putational costs between CodeClassPrompt and CodeClassPrompt with additional BiLSTM layers. The results are
presented in Table 19. From the table, it is evident that the number of parameters is reduced by 22.57%, while the
computational costs are reduced by 1.32%. By reducing the number of parameters, CodeClassPrompt effectively
reduces memory access, leading to a significant decrease in the time required for transferring parameters from
the CPU memory to the GPU memory. Additionally, the reduced computational effort further contributes to
time savings. Collectively, these factors make CodeClassPrompt significantly more time-efficient.
Summary for RQ5: The experiments indicate that by leveraging powerful feature extraction capabilities,
the removal of an additional neural network layer eliminates redundant parameters and computational costs,
thereby substantially reducing computation time.

Discussion
Error analysis
The following analysis pertains to the meticulous examination of classification errors within datasets that primar-
ily focus on source code classification. These datasets encompass two distinct tasks: code language classification
and code smell classification.

Code languae classification


By conducting a comprehensive investigation of error examples, we find that they can be classified into seven
distinct groups. The detailed categories and corresponding samples are provided in Table 20. The below part
presents an in-depth analysis of the seven identified groups:

• Short

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 17

Vol.:(0123456789)
www.nature.com/scientificreports/

Figure 8.  Results for the metric Recall.

  The content provided is insufficient in length to accurately determine its class information.
• Pseudo
  The information provided appears to resemble pseudo-code, making it difficult to detect its specific class
or category.
• Non-code
  The provided content clearly does not resemble any form of source code or pseudo-code.
• No-feature
  The examined source code displays characteristics that align with multiple code languages. Specifically, the
analyzed code snippet lacks the distinctive language-specific features associated with a single programming
language.
• ErrorClass
  Despite the presence of evident features in the code snippet, the classifier mistakenly assigns it to an incor-
rect class.
• LikeButNone
  Although the content bears resemblance to a certain type of source code, it is, in fact, not a valid repre-
sentation of source code.
• Mix
  The given code snippet exhibits characteristics of multiple programming languages, leading to a mixed
representation. As a consequence, the classifier may encounter difficulties in accurately categorizing the
content.

After conducting a comprehensive analysis of all the incorrect cases, it has been observed that the “None-
code” type of snippets constitutes almost half of the total, closely followed by the “Short” type.

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 18

Vol:.(1234567890)
www.nature.com/scientificreports/

Figure 9.  Results for the metric F1-score.

Task Max Length Repeat Times Total(ms) LSTM(ms) Percentage(%)


Code Language 256 10*1000 6430.27 719.15 11.18

Table 18.  Time consumption of an additional LSTM layer in CodeBERT based pipeline.

Model Parameters (M) Comp Costs (GFLOPs) Reduced Parameters (%) Reduced Comp Costs (%)
CodeClassPrompt LSTM 109.87 22.05 0.0 0.0
CodeClassPrompt 85.07 21.76 22.57 1.32

Table 19.  Computation costs and parameters of CodeBERT based pipeline. “Comp Costs” refers to
computational costs. “M” denotes one million (1,000,000). “GFLOPs” represents one billion (1,000,000,000)
floating-point operations per second. The maximum sequence length is set to 256. “CodeClassPrompt LSTM ”
refers to CodeClassPrompt with BiLSTM layers.

Code smell classification


Code smell classification is a binary classification task, where the errors can be categorized into three distinct
types. The detailed breakdown of these types is provided in Table 21. Next, we will provide detailed descriptions
of these types.

• Short
  The provided code snippet is of insufficient length to reliably determine its class information accurately.
• Bad-label

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 19

Vol.:(0123456789)
www.nature.com/scientificreports/

Category Examples
BitSet ++=
Short
retUnique()
<if>abc <else>xyz <if> <else> <if> <else> abc xyz
Pseudo
|Type Object pointer| | Sync Block | | Instance fields...| | Instance fields...|
www.domain.com/first/second/last/ last www.domain.com/last/ www.domain.com/first/second/third/fourth/last ...
Non-code
abc1 abc2 abc3 abc1 ok abc2 ok abc3 ok
f a b = ((a+b) == 2) && ((a*b) == 2) &&
No-feature
$x = $hashblah || ’default’
std::string mystr=“MY-PC” bSuccess = SetComputerNameA(mystr.c_str()); if( bSuccess == 0 ) printfÜnable to change
ErrorClass computer name ...
<?php $this->dojo()->setLocalPath($this->baseUrl().’/javascript/dojo/dojo.js’) ...
<Return Address>
LikeButNone
<function appendNextFib at 0x01FB14B0>
http://server/base/feeds/documents?bq=[type in ’news’] bq=[type = ’news’] -> return [“news”] bq=[type in ’news’] ->
return [“news”] ...
Mix
< script type=“text/javascript” src=“js/jquery.query-2.1.6.js”> </script> <? $next_exp = 123; ?> $(document).
ready(function() ...

Table 20.  Examples of error classification in code language.

Category Examples
final int [ ] stack ;
Short
private String xtends ;
Bad-label <comment> private String <w> non Proxy Hosts </w>
Error-class <comment> private int <w> skipped Positions </w> ;

Table 21.  Examples of error classification in code smell.

  Despite the dataset assigning an incorrect label to the given code snippet, the classifier successfully identi-
fied its correct type.
• Error-class
  The trained classifier incorrectly assigns an erroneous label to the code snippet.

After conducting a comprehensive analysis of all the erroneous cases, it has been observed that the “Short” type
of snippets constitutes more than one-quarter of the total. Assessing the quality or undesirability of a snippet
that is excessively brief presents considerable difficulty.
Based on the aforementioned discussion, it is evident that the primary drawback stems from the suboptimal
quality of the datasets, primarily caused by the introduction of noisy data.

Further investigation
This work aims to enhance computational efficiency by leveraging the prompt-learning paradigm. The effective-
ness of manually defined templates has been demonstrated in source code-related classification tasks. However,
it is worth examining whether automatically constructed templates can outperform manual prompt templates in
source code-related tasks. Furthermore, while zero-shot based prompt learning has proven effective in natural
language processing tasks, its applicability to source code-related tasks, particularly those involving program-
ming languages, requires further investigation.

Threats to validity
In this section, we will focus on discussing potential threats to the validity of our empirical study.

Interanl threats
The internal validity threats to our research are predominantly associated with the experimental milieu. The
initial concern pertains to the selection of hardware and software platforms, which may influence the reliability of
our method’s execution. To ameliorate this issue, we ensured a consistent experimental framework by employing
uniform hardware configurations and opting for established software versions, including PyTorch and the Linux
operating system. The second threat is rooted in the execution of baselines, which we addressed by sourcing
code from reputable, well-established libraries. Finally, the third threat involves the stochastic nature of deep
learning model initializations. To guarantee the reproducibility of our findings, we utilized fixed random seeds
across all experimental trials.

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 20

Vol:.(1234567890)
www.nature.com/scientificreports/

External threats
The main external threat lies in the choice of datasets used for the four downstream tasks related to source code.
To mitigate this threat, we have opted to use publicly available corpora. Specifically, for code language classifica-
tion, we draw upon the dataset provided by Alrashedy et al.3. For code smell classification, we utilize the dataset
from Fakhoury et al.58. For code comment classification, we rely on the dataset prepared by Pascarella et al.60.
Lastly, for technical debt classification, we employ the dataset introduced by Maldonado et al.58,61.

Construct threats
The primary constructive threat we address in this study is the selection of appropriate evaluation metrics for
assessing performance on source code-related tasks. To ensure a fair and comprehensive comparison, we have
selected four widely-used metrics (accuracy, precision, recall, and F1-Score) that have been extensively employed
in previous studies, such as the work by Alrashedy et al.3.

Conclusion
In this study, we introduced CodeClassPrompt, a novel approach that harnesses relevant knowledge extracted
from a pre-trained model to improve source code-related classification. Through empirical analysis, we dem-
onstrated the effectiveness of CodeClassPrompt, achieving enhanced computational efficiency while yielding
comparable results to previous studies. CodeClassPrompt consolidates multi-layer knowledge into a unified input
representation, eliminating the need for additional neural layers and thereby reducing computational costs. This
efficiency is further supported by computational cost experiments, while ablation studies validate the utility of
the multi-layer attention mechanism in combination with the prompt learning paradigm. Additionally, attention
analysis provides insights into the distinct contributions of each layer of the output from pre-trained language
models (PLMs) to the performance of various tasks.

Data availability
The data supporting the findings of this study are openly available in the GitHub repository at https://​github.​
com/​BITEN​GD/​codec​lassp​rompt.

Received: 18 October 2023; Accepted: 5 August 2024

References
1. Khasnabish, J. N., Sodhi, M., Deshmukh, J. & Srinivasaraghavan, G. Detecting programming language from source code using
Bayesian learning techniques. In Machine Learning and Data Mining in Pattern Recognition. Lecture Notes in Computer Science
(ed. Perner, P.) 513–522 (Springer, Cham, 2014).
2. Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
3. Alrashedy, K., Dharmaretnam, D., German, D. M., Srinivasan, V. & Aaron Gulliver, T. SCC++: Predicting the programming
language of questions and snippets of stack overflow. J. Syst. Softw. 162, 110505 (2020).
4. Gilda, S. Source code classification using neural networks. In 2017 14th International Joint Conference on Computer Science and
Software Engineering (JCSSE), 1–6 (2017).
5. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understand-
ing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), 4171–4186 (Association for Computational Linguistics, Minneapolis,
Minnesota, 2019).
6. Liu, Y. et al. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:​1907.​11692 (2019).
7. Radford, A., Narasimhan, K., Salimans, T. & Sutskever, I. Improving language understanding by generative pre-training. OPENAI
blog (2018).
8. Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 1–67 (2020).
9. Qiu, X. et al. Pre-trained models for natural language processing: A survey. Sci. China Technol. Sci. 63, 1872–1897 (2020).
10. Feng, Z. et al. CodeBERT: A pre-trained model for programming and natural languages. arXiv:​2002.​08155 (2020).
11. Wang, Y., Wang, W., Joty, S. & Hoi, S. C. H. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code under-
standing and generation. arXiv:​2109.​00859 (2021).
12. Guo, D. et al. GraphCodeBERT: Pre-training code representations with data flow. arXiv:​2009.​08366 (2021).
13. Kwon, S., Jang, J.-I., Lee, S., Ryu, D. & Baik, J. CodeBERT based software defect prediction for edge-cloud systems. In Agapito, G.
et al. (eds.) Current Trends in Web Engineering, Communications in Computer and Information Science, 11–21 (Springer, Cham,
2023).
14. Kanade, A., Maniatis, P., Balakrishnan, G. & Shi, K. Learning and evaluating contextual embedding of source code. In Proceedings
of the 37th International Conference on Machine Learning, 5110–5121 (PMLR, 2020).
15. Choi, H., Kim, J., Joe, S. & Gwon, Y. Evaluation of BERT and ALBERT sentence embedding performance on downstream NLP
tasks. In 2020 25th International Conference on Pattern Recognition (ICPR), 5482–5487 (2021).
16. Goldberg, Y. Assessing BERT’s syntactic abilities. arXiv:​1901.​05287 (2019).
17. Jawahar, G., Sagot, B. & Seddah, D. What does BERT learn about the structure of language? In Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics, 3651–3657 (Association for Computational Linguistics, Florence, Italy,
2019).
18. Liu, K., Yang, G., Chen, X. & Zhou, Y. EL-CodeBert: Better exploiting CodeBert to support source code-related classification
tasks. In Proceedings of the 13th Asia-Pacific Symposium on Internetware, Internetware ’22, 147–155 (Association for Computing
Machinery, New York, NY, USA, 2022).
19. Choi, H., Kim, J., Joe, S. & Gwon, Y. Evaluation of BERT and ALBERT sentence embedding performance on downstream NLP
tasks. In 2020 25th International Conference on Pattern Recognition (ICPR), 5482–5487. https://​doi.​org/​10.​1109/​ICPR4​8806.​2021.​
94121​02(2021).
20. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780. https://​doi.​org/​10.​1162/​neco.​1997.9.​
8.​1735 (1997).
21. Brown, T. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, vol. 33, 1877–1901
(Curran Associates, Inc., 2020).

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 21

Vol.:(0123456789)
www.nature.com/scientificreports/

22. Liu, P. et al. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv:2​ 107.​
13586 (2021).
23. Vaswani, A. et al. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing
Systems, NIPS’17, 6000–6010 (Curran Associates Inc., Red Hook, NY, USA, 2017).
24. Allamanis, M., Barr, E. T., Devanbu, P. & Sutton, C. A survey of machine learning for big code and naturalness. ACM Comput.
Surv. 51, 81:1-81:37 (2018).
25. Nguyen, A. T., Nguyen, T. D., Phan, H. D. & Nguyen, T. N. A deep neural network language model with contexts for source code.
In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), 323–334 (2018).
26. Harer, J. et al. Automated software vulnerability detection with machine learning. ArXiv (2018).
27. Le, Q. & Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference
on International Conference on Machine Learning—Volume 32, ICML’14, II–1188–II–1196 (JMLR.org, Beijing, China, 2014).
28. DeFreez, D., Thakur, A.V. & Rubio-González, C. Path-based function embedding and its application to error-handling specifica-
tion mining. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on
the Foundations of Software Engineering, ESEC/FSE 2018, 423–433 (Association for Computing Machinery, New York, NY, USA,
2018).
29. Alon, U., Zilberstein, M., Levy, O. & Yahav, E. Code2vec: Learning distributed representations of code. Proc. ACM Program. Lang.
3, 1–29 (2019).
30. Zhang, J. et al. A novel neural source code representation based on abstract syntax tree. In 2019 IEEE/ACM 41st International
Conference on Software Engineering (ICSE), 783–794 (2019).
31. Hu, X. et al. Summarizing source code with transferred API knowledge. In Proceedings of the Twenty-Seventh International Joint
Conference on Artificial Intelligence, IJCAI-18, 2269–2275 (International Joint Conferences on Artificial Intelligence Organization,
2018).
32. Yang, G., Zhou, Y., Chen, X. & Yu, C. Fine-grained Pseudo-code generation method via code feature extraction and transformer.
In 2021 28th Asia-Pacific Software Engineering Conference (APSEC), 213–222 (2021).
33. Jain, P. et al. Contrastive code representation learning. In Proceedings of the 2021 Conference on Empirical Methods in Natural
Language Processing, 5954–5971 (Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021).
34. Yang, G. DeepSCC: Source code classification based on fine-tuned RoBERTa (S). In The 33rd International Conference on Software
Engineering and Knowledge Engineering, 499–502 (2021).
35. Shinyama, Y., Arahori, Y., & Gondow, K. Analyzing, comments, code to boost program comprehension. In 25th Asia-Pacific
Software Engineering Conference (APSEC), 325–334 (IEEE 2018 (Japan, Nara, 2018).
36. Rani, P., Panichella, S., Leuenberger, M., Di Sorbo, A. & Nierstrasz, O. How to identify class comment types? A multi-language
approach for class comment classification. J. Syst. Softw. 181, 111047 (2021).
37. Chen, Q., Xia, X., Hu, H., Lo, D. & Li, S. Why my code summarization model does not work: Code comment improvement with
category prediction. ACM Trans. Softw. Eng. Methodol. 30, 25:1-25:29 (2021).
38. Potdar, A. & Shihab, E. An exploratory study on self-admitted technical debt. In 2014 IEEE International Conference on Software
Maintenance and Evolution, 91–100 (IEEE, Victoria, BC, Canada, 2014).
39. Brown, N. et al. Managing technical debt in software-reliant systems. In Proceedings of the FSE/SDP Workshop on Future of Software
Engineering Research, FoSER ’10, 47–52 (Association for Computing Machinery, New York, NY, USA, 2010).
40. Wehaibi, S., Shihab, E. & Guerrouj, L. Examining the impact of self-admitted technical debt on software quality. In 2016 IEEE 23rd
International Conference on Software Analysis, Evolution, and Reengineering (SANER), vol. 1, 179–188 (2016).
41. Huang, Q., Shihab, E., Xia, X., Lo, D. & Li, S. Identifying self-admitted technical debt in open source projects using text mining.
Empir. Softw. Eng. 23, 418–451 (2018).
42. Ren, X. et al. Neural network-based detection of self-admitted technical debt: From performance to explainability. ACM Trans.
Softw. Eng. Methodol. 28, 15:1-15:45 (2019).
43. Wang, X. et al. Detecting and explaining self-admitted technical debts with attention-based neural networks. In Proceedings of
the 35th IEEE/ACM International Conference on Automated Software Engineering, ASE ’20, 871–882 (Association for Computing
Machinery, New York, NY, USA, 2021).
44. Fowler, M. Refactoring (Addison-Wesley Professional, Berlin, 2018).
45. Arcelli Fontana, F. & Zanoni, M. Code smell severity classification using machine learning techniques. Knowl.-Based Syst. 128,
43–58 (2017).
46. Arcelli Fontana, F., Mäntylä, M. V., Zanoni, M. & Marino, A. Comparing and experimenting machine learning techniques for code
smell detection. Empir. Softw. Eng. 21, 1143–1191 (2016).
47. Das, A. K., Yadav, S. & Dhal, S. Detecting code smells using deep learning. In TENCON 2019—2019 IEEE Region 10 Conference
(TENCON), 2081–2086 (2019).
48. Liu, H. et al. Deep learning based code smell detection. IEEE Trans. Softw. Eng. 47, 1811–1837 (2021).
49. Sharma, T., Efstathiou, V., Louridas, P. & Spinellis, D. Code smell detection by deep direct-learning and transfer-learning. J. Syst.
Softw. 176, 110936 (2021).
50. Li, Y. & Zhang, X. Multi-label code smell detection with hybrid model based on deep learning. In The 34th International Conference
on Software Engineering and Knowledge Engineering, 42–47 (2022).
51. Sun, C., Qiu, X., Xu, Y. & Huang, X. How to fine-tune Bert for text classification? In Chinese Computational Linguistics. Lecture
Notes in Computer Science (eds Sun, M. et al.) 194–206 (Springer, Cham, 2019).
52. Radford, A. et al. Language models are unsupervised multitask learners. OPENAI blog (2019).
53. Lewis, M. et al. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and compre-
hension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7871–7880 (Association for
Computational Linguistics, Online, 2020).
54. Perez, E., Kiela, D. & Cho, K. True few-shot learning with language models. In Advances in Neural Information Processing Systems,
vol. 34, 11054–11070 (Curran Associates, Inc., 2021).
55. Schick, T. & Schütze, H. Few-shot text generation with pattern-exploiting training. arXiv:​2012.​11926 (2021).
56. Jiang, Z., Xu, F. F., Araki, J. & Neubig, G. How can we know what language models know?. Trans. Assoc. Comput. Linguist. 8,
423–438 (2020).
57. Li, X. L. & Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of
the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume
1: Long Papers), 4582–4597 (Association for Computational Linguistics, Online, 2021).
58. Fakhoury, S., Arnaoudova, V., Noiseux, C., Khomh, F. & Antoniol, G. Keep it simple: Is deep learning good for linguistic smell
detection? In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), 602–611 (2018).
59. Arnaoudova, V., Di Penta, M. & Antoniol, G. Linguistic antipatterns: What they are and how developers perceive them. Empir.
Softw. Eng. 21, 104–158 (2016).
60. Pascarella, L. & Bacchelli, A. Classifying code comments in Java open-source software systems. In Proceedings of the 14th Interna-
tional Conference on Mining Software Repositories, MSR ’17, 227–237 (IEEE Press, Buenos Aires, Argentina, 2017).
61. Maldonado, E. d S., Shihab, E. & Tsantalis, N. Using natural language processing to automatically detect self-admitted technical
debt. IEEE Trans. Softw. Eng. 43, 1044–1062 (2017).

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 22

Vol:.(1234567890)
www.nature.com/scientificreports/

62. Sharma, T. et al. A survey on machine learning techniques for source code analysis. arXiv:​2110.​09610 (2022).
63. Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining, KDD ’16, 785–794 (Association for Computing Machinery, New York, NY, USA,
2016).
64. Kim, Y. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in
Natural Language Processing (EMNLP), 1746–1751 (Association for Computational Linguistics, Doha, Qatar, 2014).
65. Zhou, P. et al. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th
Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 207–212 (Association for Computational
Linguistics, Berlin, Germany, 2016).
66. Ding, N. et al. OpenPrompt: An open-source framework for prompt-learning. arXiv:​2111.​01998 (2021).
67. Conneau, A., Kruszewski, G., Lample, G., Barrault, L. & Baroni, M. What you can cram into a single \$ &!#* vector: Probing
sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), 2126–2136 (Association for Computational Linguistics, Melbourne, Australia, 2018).

Acknowledgements
The work was supported by the 242nd National Information Security Project (No. 2020A065).

Author contributions
Y.M. developed the study concept. Y.M. and Y.S conceived the experiment(s), Y.M. and Y.S. conducted the
experiment(s), Z.L., S.L. and Y.S. analysed the results. The final manuscript was written by Y.M., Y.Z. All authors
reviewed the manuscript.

Competing interests
The authors declare no competing interests.

Additional information
Correspondence and requests for materials should be addressed to Y.-M.S.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives
4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in
any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide
a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have
permission under this licence to share adapted material derived from this article or parts of it. The images or
other third party material in this article are included in the article’s Creative Commons licence, unless indicated
otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and
your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain
permission directly from the copyright holder. To view a copy of this licence, visit http://​creat​iveco​mmons.​org/​
licen​ses/​by-​nc-​nd/4.​0/.

© The Author(s) 2024

Scientific Reports | (2024) 14:20220 | https://doi.org/10.1038/s41598-024-69402-7 23

Vol.:(0123456789)

You might also like