(2023) An Empirical Comparison of Pre-Trained Models of Source Code
(2023) An Empirical Comparison of Pre-Trained Models of Source Code
Email: [email protected]
Abstract—While a large number of pre-trained models of as there are code-specific characteristics that may not be
source code have been successfully developed and applied to properly taken into account by these models, such as the syn-
a variety of software engineering (SE) tasks in recent years, tactic [17], [18] and semantic structures [19] inherent in source
our understanding of these pre-trained models is arguably fairly
limited. With the goal of advancing our understanding of these code [20]. Consequently, SE researchers have developed a
models, we perform the first systematic empirical comparison number of pre-trained models of source code (henceforth
of 19 recently-developed pre-trained models of source code on CodePTMs) that take into account code-specific characteristics
13 SE tasks. To gain additional insights into these models, we in the past few years [21]–[26].
adopt a recently-developed 4-dimensional categorization of pre- Despite the fact that a large number of CodePTMs have
trained models, and subsequently investigate whether there are
correlations between different categories of pre-trained models been successfully developed and applied to a variety of SE
and their performances on different SE tasks. tasks in recent years, our understanding of CodePTMs is
Index Terms—Pre-training of Source Code, AI for SE arguably fairly limited. Currently, only one survey of pre-
trained models of source code is available from Niu et al. [27],
I. I NTRODUCTION but it just performs a summary and analysis from the results
Despite the successful application of deep learning to var- reported by the origin model. While pre-trained models are
ious Artificial Intelligence (AI) subfields such as natural lan- task-agnostic and therefore can be applied to different SE tasks
guage processing (NLP) and computer vision in recent years, a by design, virtually all CodePTMs have been evaluated on
large amount of annotated training data is typically needed to only a handful of SE tasks. For instance, TreeBERT [28], has
train the millions or even billions of network parameters in a only been evaluated on code summarization and method name
deep neural model. For many learning tasks, including those in generation. This is by no means ideal: without knowing how
software engineering (SE), obtaining annotated data is costly. TreeBERT performs on the remaining SE tasks, we do not
To address this data annotation bottleneck, NLP researchers know whether it can achieve state-of-the-art results on any of
have come up with an idea that can arguably be considered those tasks. This in turn implies that our understanding of these
one of the most exciting developments in recent deep learning models could be partial and that the current state-of-the-art
research, namely pre-training [1]–[4]. Rather than training a could have been very different had we evaluated the existing
model from scratch (i.e., with randomly initialized network models on most, if not all, of the available SE tasks. Even
weights), which typically requires a lot of task-specific anno- when two pre-trained models are being evaluated on the same
tated data, one can first pre-train it on one or more so-called SE task, a head-to-head comparison of these models could still
self-supervised tasks (i.e., tasks for which annotated data be made complicated if they are evaluated on different datasets
can be automatically generated and therefore large amounts available for this task [29].
of training data are readily available) so that its weights With the goal of advancing our understanding of exist-
encode general linguistic and commonsense knowledge about ing pre-trained models of source code, we conduct the first
language, and then the resulting pre-trained model can be systematic empirical comparison of 19 recently-developed
fine-tuned to learn the target task using (a potentially small CodePTMs on 13 popular SE tasks. To gain additional insights
amount of) task-specific annotated training data in the usual into these CodePTMs, we employ a recently-developed four-
supervised manner. A large number of pre-trained models of dimensional categorization of CodePTMs [27] to categorize
natural language (PTM-NLs) have been developed and widely existing the 19 CodePTMs used in our study, and subsequently
used in NLP, such as BERT [5], XLNet [6], RoBERTa [7], investigate whether there are correlations between categories
ELECTRA [8], GPT-2 [9], T5 [10], and BART [11]. of CodePTMs and their performances on SE tasks.
Soon thereafter, pre-trained models have made their way
II. E XPERIMENTAL S ETUP
into SE research. Initial applications of pre-trained models in
SE have primarily involved retraining PTM-NLs on source A. SE Tasks
code [12]–[16]. Nevertheless, employing the resulting re- Table I enumerates the 13 SE tasks we will use in our
trained models (henceforth PTM-Cs) for SE tasks is not ideal, comparative experiments. These are also the SE tasks that
2137
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 09,2024 at 03:06:46 UTC from IEEE Xplore. Restrictions apply.
TABLE II
C ATEGORIZATION OF EXISTING PRE - TRAINED MODELS AND THEIR PERFORMANCE ON SE TASKS AS REPORTED IN THEIR ORIGINAL PAPERS . T HE
STRONGEST RESULT FOR EACH DATASET IS BOLDFACED .
TABLE III
C ATEGORIZATION AND DESCRIPTION OF THE PRE - TRAINING TASKS MENTIONED IN TABLE II.
2138
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 09,2024 at 03:06:46 UTC from IEEE Xplore. Restrictions apply.
adapted, T5-learning, PLBART, ProphetNet-Code, CoTexT, 2) Outputs: The output required by a SE task may not
CodeT5, SPT-Code and UniXcoder are in this category. If be the same as the output produced by the PTMs. Hence,
more than one model is provided, we choose the “base” additional modules or operations may be needed in order to
version consistent with the approach in the original paper. get the output required by SE Tasks. The outputs that need to
(2) Of the remaining PTMs, if the source code and datasets be provided by PTMs for different SE tasks can be divided
are provided, we re-train them according to the setting intro- into two types:
duced in the original papers to get the pre-trained models and (1) Output based on the input representation: Among
the tokenizers. TreeBERT is the only model in this category. the SE tasks, Code Search and Code Question Answering
(3) For those that have the source code but not the datasets, use the input representation directly (to calculate the simi-
we collect the required datasets ourselves in the same way larity between two sequences), while the others need a fully
as the original authors did, and re-train them according to the connected layer and a softmax layer to be added to obtain
settings in the original papers. Only CugLM is in this category. a probability distribution. PTMs with different architectures
(4) If no source code is provided, we re-implement and use different ways to get the representation vector for the
pre-train according to the settings (e.g., tokenizer, hyper- input. For TE-based models, we use the vector that cor-
parameters, and dataset) described in the original papers. They responds to the position of the classification symbol in the
are GPT-C, C-BERT, DeepDebug and SynCoBERT4 . input (typically “[CLS]”) as the representation vector. For TD-
When evaluating on a downstream SE task, each of the 19 based models, we use the last time step of the output hidden
models is fine-tuned on the training data available for that task. state (i.e., the position of the special symbol “[endoftext]”
b) The 5 non-PTMs: As noted above, we also include in the input sequence). For TF-based models, it depends.
four PTMs-NL (RoBERTa, GPT-2, BART, and T5) and a Since T5-based models (i.e., T5, T5-learning, DeepDebug,
vanilla Transformer model in our comparison. For the four CoTexT and CodeT5) formalize all tasks as text-to-text tasks,
PTMs-NL, we use their publicly available implementations. for classification tasks we map all categories to text (e.g. for
Like the 19 PTMs of source code, these five models are being a binary classification task, 0 is mapped to “false” and 1 to
fine-tuned on task-specific data [66] before applying to each “true”), while for retrieval tasks, we use the output hidden
downstream task. state of the decoder corresponding to the “[EOS]” symbol as
E. Application to SE Tasks the representation vector. In contrast, for BART-based models
(i.e., BART, PLBART and SPT-Code), we keep the input of the
Two aspects need to be considered while applying PTMs to
decoder to be the same as the input of the encoder and use the
SE tasks, namely, Inputs and Outputs.
decoder hidden state of the last timestep as the representation
1) Inputs: The inputs for different SE tasks are different.
vector. For other TF-based models, we only use its encoder
When applying a PTM to a SE task, the input of the task
and adopt the same method as used in the TE-based models.
should be organized into a form needed by the PTM. The
(2) Output based on the ultimate output sequence: For
input of the SE tasks in Table II belongs to three types:
TE-based models, we follow Lu et al. [35] to randomly
(1) Using only a code snippet as input: Tasks such as
initialize a Transformer Decoder of the same size as the
Defect Detection and Code Translation assume input that
model to form an encoder-decoder architecture. For TD-based
belongs to this category. Here, we follow the input repre-
models, we follow GPT-2 [9]: for training, we concatenate
sentation as defined by PTMs. For example, for TreeBERT,
the input and output sequences using a special symbol; and
we parse the code into an AST and encode each path in the
for evaluation, we pass the input sequence concatenated with
AST before passing it to the Transformer, as described in the
this special symbol into the model and use the sequence
original paper; and for PLBART, we add a special symbol
predicted by the model as the output. TF-based models can be
indicating the programming language, e.g., “[java]”, to the
applied directly to this type of tasks. The Code Completion
input sequence.
task deserves special mention. Recall that it requires a model
(2) Using only a natural language description as input:
to complete the unfinished line given the previous context.
This is used by tasks such as Code Search and Code Genera-
However, during training, it follows the GPT-like, casual
tion. In this case, we input the text sequence directly. But for
language modeling manner. This is not applicable to TE- and
PLBART, we follow the approach described in its paper and
TF-based PTMs that adopt the encoder-decoder architecture
add a special symbol “[en]” to the input.
for this task. Therefore, when training TE- and TF-based
(3) Using a code-code pair or a code-NL pair as input:
PTMs, we randomly extract the first 15%-85% of the entire
Tasks like Clone Detection (inputs: code-code) and Code
sequence as input (since the input context in the test data is
Question Answering (inputs: code-NL) belong to this type. In
ensured to be at least 15% of the whole length [35]) of the
this case, we prepare the inputs for the two parts separately and
encoder, and the rest is used as the input of the decoder.
then concatenate them to obtain the final input representation.
4 To verify the validity of the latter two types of models pre-trained by us, F. Other Settings and Data Availability
we perform fine-tuning on the downstream tasks corresponding to the original For other settings, e.g., the hyperparameters and the opti-
paper and use pair-wise t-tests to ensure that the difference between our results
and those reported in the original papers are statistically indistinguishable. mizer, we adopt those used in the provided source code or
Details can be found in the supplementary materials. mentioned in the original paper. If neither of the above is
2139
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 09,2024 at 03:06:46 UTC from IEEE Xplore. Restrictions apply.
available, we perform parameter tuning ourselves to maximize TABLE IV
model performance on held-out development data5 . C URRENT SOTA S AND NEW SOTA S .
2140
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 09,2024 at 03:06:46 UTC from IEEE Xplore. Restrictions apply.
TABLE V
E XPERIMENTAL RESULTS ON CODE UNDERSTANDING TASKS .
DD CD ET CR CS QA
Model Acc F1 Acc MAP MRR MRR
Cur New Cur New Cur New Cur New Cur New Cur New
Transformer 64.40 89.27 48.98 64.27 3.12 52.89
RoBERTa 61.05 64.47 94.9 95.35 - 76.94 76.67 80.20 18.33 18.82 60.3 60.28
GPT-2 - 63.22 - 96.22 - 75.54 - 53.30 - 16.38 - 58.06
BART - 63.81 - 95.11 - 73.68 - 79.63 - 16.65 - 55.57
T5 61.93 61.87 - 94.86 - 74.75 - 69.16 - 16.97 - 45.63
CuBERT - 64.25 - 94.78 79.12 79.90 - 76.87 - 22.26 - 54.33
GPT-C - 63.77 - 95.46 - 78.26 - 55.23 - 24.39 - 50.32
C-BERT 57.4 64.05 - 95.00 -- 74.57 - 72.91 - 25.34 - 54.81
JavaBERT - 64.50 - 96.57 - 67.66 - 77.44 - 25.02 - 54.04
CodeGPT-adapted - 65.64 - 96.65 - 76.71 - 72.63 - 25.97 - 54.24
DeepDebug - 64.18 - 95.90 - 73.50 - 73.51 - 30.58 - 57.39
CodeBERT - 65.02 - 96.77 - 81.25 82.67 85.61 - 38.21 65.7 65.90
GraphCodeBERT - 65.92 97.1 97.11 - 83.26 85.16 87.73 - 38.76 68.4 68.55
CugLM - 64.19 - 96.44 - 79.01 - 83.32 - 36.20 - 61.44
DOBF - 63.86 95.9 96.84 - 79.04 - 87.31 38.3 38.56 - 61.31
T5-learning - 63.60 - 96.38 - 69.85 - 80.82 - 37.98 - 60.21
PLBART 63.18 64.21 97.2 97.01 - 77.93 - 85.02 - 38.70 65.0 65.01
ProphetNet-Code - 63.57 - 96.05 - 79.37 - 79.82 - 37.64 - 63.73
CoTexT 65.99 65.68 - 95.96 - 77.21 - 86.65 - 38.13 - 68.70
TreeBERT - 65.76 - 96.51 - 78.08 - 85.54 - 39.60 - 64.98
CodeT5 65.78 65.82 97.2 97.18 - 85.00 - 87.53 - 40.03 67.8 67.91
SynCoBERT 64.5 66.25 97.4 97.55 - 82.70 88.24 88.52 38.1 39.99 - 69.19
SPT-Code - 64.88 - 96.40 - 77.11 - 86.54 - 37.05 - 64.55
UniXcoder - 65.64 95.2 96.32 - 83.47 90.52 90.55 41.3 41.57 70.1 70.30
the Code Completion task whose SOTA model, CodeGPT- than TD or TF, for several reasons. First, to draw this
adapted, is of type PTM-C, which covers models designed conclusion, we need to compare the results of two models
for Natural Language but pre-trained on Source Code. that differ only w.r.t. architecture, but there do not exist two
Second, while many PTMs have been proposed, only five of PTMs on our list that differ only w.r.t. architecture. Second,
them have managed to achieve SOTA performance on at least TF-based CoTexT, which uses MLM as the only pre-traning
one SE task. They are CodeT5 (SOTA on 5 tasks), UniXcoder task, outperforms TE-based UniXcoder, which uses three more
(SOTA on 3 tasks), PLBART (SOTA on 2 tasks), SynCoBERT complex pre-training tasks, ULM, MCL and CMG. Finally,
(SOTA on 2 tasks), and CodeGPT-adapted (SOTA on 1 task). TF-based DeepDebug achieves better results than TE-based
Third, vanilla Transformer’s performance relative to the C-BERT when using only code as input and MLM as its only
PTMs is different for different SE tasks: (1) on Clone De- pre-training task.
tection (CD), Error Type prediction (ET), Code Search (CS),
Code Translation, Assert Generation, and Code Summariza- 2) Modality: Both code structure and NL are shown to
tion, vanilla Transformer is surpassed in performance by all have a positive effect on the performance of the models on
types of PTMs (i.e., PTM-NL, PTM-C, and CodePTM); (2) on this task, but the way they are being used also matters. As
Code Completion and Mutant Generation, vanilla Transformer an example, TF-based TreeBERT outperforms some of the TF-
is beaten by all PTMs-C and CodePTMs but it outperforms based models that use code and NL (e.g, DOBF, T5-learning,
two PTMs-NL, BART and T5; (3) on Code-to-Code Retrieval PLBART) significantly owning to its use of ASTs. As another
(CR) and Code Question Answering (QA), vanilla Transformer example, TF-based CoTexT outperforms TF-based T5-learning
not only surpasses a PTM-NL (GPT-2 on CR and T5 on QA), considerably: CoText concatenates Code and the correspond-
but also beats one PTM-C (GPT-C for both tasks); and (4) on ing Doc as one single input, whereas T5-learning treats the
Defect Detection (DD) and Bug Fixing, vanilla Transformer features derived from these two modalities as separate data
even outperforms CodePTMs in addition to PTMs-NL and instances. This suggests that how the information derived from
PTMs-C, beating CugLM, DOBF, T5-learning, PLBART, and these modalities is used has an impact on performance.
ProphetNet-Code on DD, and CugLM on Bug Fixing.
In the following, we discuss in detail the observations 3) Pre-training Tasks: First, the results in the New column
obtained from the current and the new results on each task. of this task in Table V reveal that the most influential pre-
training tasks are cross-modal-aware classification tasks
A. Defect Detection as they are being used by the Top-5 models. These tasks
SynCoBERT defeats CoTexT and becomes the new SOTA include TEP/EP (used by SynCoBERT and GraphCodeBERT),
PTM for this task, and Accuracy improves by 0.26. MCL (used by SynCoBERT and UniXcoder), and NA (adopted
1) Architecture: While the Top-2 models on this classifi- by GraphCodeBERT). This observation is different from the
cation task, SynCoBERT and GraphCodeBERT, are both TE- conclusion derived from the Cur column, where Seq2Seq
based, there is not enough empirical evidence for us to MLM (the only pre-training task used by the old SOTA model,
conclude that TE is a better architecture for this task CoTexT) seems to have the greatest impact on defect detection.
2141
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 09,2024 at 03:06:46 UTC from IEEE Xplore. Restrictions apply.
B. Clone Detection their pre-training tasks (ProphetNet-Code uses FNP whereas
The new results on this task do not change significantly PLBART uses DAE), ProphetNet-Code surpasses PLBART in
from the current ones, except that PLBART, which is currently performance.
tied for second place, has slipped to fourth place. The drop D. Code-to-Code Retrieval
in PLBART’s rank seems to suggest that using multiple pre-
Currently, the relative advantages and disadvantages of
training tasks is better than using a single pre-training task on
different model architectures are not available since only four
this task: while the Top-2 models, SynCoBERT and CodeT5,
TE-based models are evaluated on this task. However, with
employ four distinct pre-training tasks, PLBART uses DAE as
the new results, the conclusion that TE-based models have
the only pre-training task. Besides, the new results also enable
more advantages over the other architectures on this task
us to see the performance of TD-based models on this task; in
can be verified, since the Top-3 models of this task are all TE-
particular, the best TD-based PTM, CodeGPT-adapted, ranks
based (i.e., UniXcoder, SynCoBERT, and GraphCodeBERT).
7th.
Besides, the performance of the TF- and TD-based models is
C. Exception Type also measurable. Specifically, the best performing TF-based
This task is the only multi-label classification task among model (CodeT5) and TD-based one (CodeGPT-adapted) ranks
our 13 SE evaluation tasks. Currently, only one model (i.e., 4th and 20th, respectively.
CuBERT) has been applied to this task, which prevents us E. Code Search
from drawing any conclusions about the relative performance 1) Architecture: Although the SOTA model on this task is
of different types of models on a multi-label classification task still UniXcoder (TE-based), the rank of CodeT5 (TF-based)
like this. Fortunately, our results enable us to draw several new improved from third to second in the new results, and the
conclusions: third position is taken by SynCoBERT (TE-based). TreeBERT
1) Architecture: Most notably, according to the new results, (TF-based) ranks fourth, GraphCodeBERT (TE-based) ranks
the SOTA performance on this task is not achieved by fifth, and PLBART (TF-based) ranks sixth. These results seem
a TE-based model. Instead, TF-based CodeT5, which turns to suggest that TE-based and TF-based models perform
the task into a text-to-text form, achieves the best results. comparably on this task, as they alternate in the Top-6. In
The best TE-based model (UniXcoder) and the best TD-based addition, the performance of TD-based models on this task is
model (GPT-C) rank second and tenth respectively, and their now measurable: the best TD-based PTM (CodeGPT-adapted)
accuracies are 1.53 and 6.74 points lower than that of CodeT5. ranks 15th.
Recall that in Section II-E, we mentioned that as a T5-based 2) Pre-training Tasks: The MLM pre-training task and
model, CodeT5, when applied to a classification task, maps its variants, as well as cross-modal-aware tasks demon-
each label to a unique text string. Specifically for Exception strate their necessity in achieving top performance on
Type, it does not predict the index of each exception, but rather this task. Specifically, the pre-training tasks the top-ranked
the text string of that exception. In this way, CodeT5 turns this models used all include MLM (and its variants such as Seq2seq
classification task into one of generating NL, which is exactly MLM), as well as cross-modal-aware tasks (e.g., MCL, BDG,
what CodeT5 is good at. In contrast, for TE-based models EP). On one hand, MLM and its variants enable a model to
(e.g., SynCoBERT, UniXcoder, GraphCodeBERT), most of the generate better input representations. On the other hand, the
tasks they use in pre-training are binary classification tasks cross-modal-aware tasks typically allow a model to learn the
(e.g., MCL, TEP/EP, NA), so they may lack the knowledge alignment between different input modalities with the same
needed for multi-label classification. semantics. These two types of pre-training tasks therefore
2) Modality: The impact of each modality on this task allow a model to generate a more uniform input representation
becomes clear as well. All of the Top-3 models (i.e., CodeT5, for multimodal inputs, which is exactly what a model needs
UniXcoder, and GraphCodeBERT) use NL as one of the input to have for Code Search.
modalities, while both code and code structure were only used 3) Modality: Pre-training on multiple modalities appear
by two of them (CodeT5 and UniXcoder). This seems to to benefit this task since all of the Top-6 models are pre-
suggest that NL has a better positive impact on this task trained on two or three modalities. Concretely, UniXcoder is
than the other two modalities. pre-trained on NL and Structure, TreeBERT is pre-trained on
3) Pre-training Tasks: Both the classification pre-training Code and Structure, CodeT5 and PLBART are both pre-trained
task NSP and the generative pre-training task FNP seem on Code and NL, while SynCoBERT and GraphCodeBERT
to have positive impacts on this task. To exemplify, while are pre-trained on all of the three modalities. It is hard to
CuBERT and C-BERT are both TE-based models that use tell which modality has the largest impact on performance,
code as the only modality and differ only in their pre-training because the absence of any one of them would not prevent a
tasks (CuBERT uses both MLM and NSP whereas C-BERT model from becoming the Top-6..
uses only MLM), CuBERT outperforms C-BERT by as many
as 5 percent points in accuracy. As another example, while F. Code Question Answering
ProphetNet-Code and PLBART are both TF-based models that The new SOTA model remains the same as the current one,
use code and NL as input modalities and differ only in terms of i.e., UniXcoder. But our newly reported SOTA performance
2142
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 09,2024 at 03:06:46 UTC from IEEE Xplore. Restrictions apply.
TABLE VI
E XPERIMENTAL RESULTS ON CODE TRANSLATION AND ASSERT GENERATION .
TABLE VII
E XPERIMENTAL RESULTS ON BUG FIXING , CODE COMPLETION AND MUTANT GENERATION .
has an improvement of 0.2 percent MRR points. Note that on this task are CoTexT, SPT-Code, DOBF, and UniXcoder
MCL, the pre-training task used by the SOTA model UniX- respectively.
coder, aims to distinguish whether two inputs match each 1) Architecture: According to the new results, that TF-
other, which is also the goal of the Code Question Answering based models take the absolute lead on this task can be
task. verified, since the Top-5 models are all TF-based, and given
that we have more TF-based models in our comparison than
G. Code Translation
before, the rank of the best performing TE-based model (i.e.,
Although the models are ranked according to their average SynCoBERT in Current and UniXcoder in New) drops from
EM value on the “Java to C#” and the “C# to Java” sub- fourth to sixth.
datasets, we find that the Top-2 models on the two sub-datasets 2) Modality: The importance of NL is well validated,
are both PLBART and CodeT510 . Besides, the Top 3–6 models due to the fact that the Top-4 performers in both the current
10 For a discussion of the results w.r.t. other evaluation metrics. See the (i.e., CodeT5, PLBART, SPT-Code and SynCoBERT) and the
supplementary file. new results (i.e., PLBART, CodeT5, CoTexT and SPT-Code)
2143
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 09,2024 at 03:06:46 UTC from IEEE Xplore. Restrictions apply.
TABLE VIII
E XPERIMENTAL RESULTS ON CODE GENERATION TASKS INVOLVING NATURAL LANGUAGE .
use NL. The role of code structure, on the other hand, is less This seems to suggest that the TF architecture should be
clear, since the Top-2 models (i.e., PLBART, CodeT5) are considered first when designing high performance pre-
not pre-trained on the Structure modality and the best model trained models for this task.
using code structure drops from the third place in “Cur” (i.e., 2) Pre-training Tasks: The most useful pre-training tasks
SynCoBERT) to the fourth place in “New” (i.e., SPT-Code). for Bug Fixing is the sequence-to-sequence variants of
3) Pre-training Tasks: The new results show that the pre- MLM adapted to the Transformer decoder. They enable a
training objective DAE has a more significant impact than model to acquire the ability to generate target sequences from
BDG/CMG. To exemplify, consider PLBART and CodeT5, an incomplete one. As an example, consider the top-3 TF-
both of which are TF-based and employ the same modalities based models, which all use such pre-training tasks: Seq2seq
(code and NL). They differ only in terms of the pre-training MLM in CodeT5 and CoTexT, TMLM in TreeBERT, and
tasks: the former uses DAE and the latter uses BDG/CMG. Seq2seq IMLM in CodeT5. Moreover, by using Seq2seqMLM
The fact that PLBART outperforms CodeT5 can therefore be as the only pre-training task, the second-highest ranked model,
attributed to the fact that DAE is a better pre-training task for CoTexT, achieves better performance than TreeBERT, which
Code Translation than BDG/CMG. This conclusion is contrary uses NOP in addition to TMLM for pre-training.
to the conclusion drawn from the CUR results, where BDG
is believed to have a stronger influence than DAE on Code
I. Code Completion
Translation due to the fact that CodeT5 beat PLBART by 1.3
percent EM points. This is the only SE task among the ones we consider
where SOTA performance is achieved by the TD-based model
H. Bug Fixing CodeGPT-adapted, and it is the SOTA model in both the
Considering the EM value averaged over the “small” and current and new results. This seems to suggest the absolute
“medium” datasets, the Top-4 models change from CodeT5, dominance of the TD architecture on this task. Our new
CoTexT, DeepDebug, and SPT-Code (listed in decreasing results further suggest that the pre-training objective ULM
order of performance) to CodeT5, CoTexT, TreeBERT, and (adopted by the Top-3 models on this task, i.e., CodeGPT-
UniXcoder. A closer examination of the sub-datasets reveals adapted, CugLM, and GPT-C), whose goal is similar to
that UniXcoder outperforms CoText and TreeBERT, achieving that of code completion, plays an influential role in Code
the second best performance on the “medium” dataset while Completion. As an example, consider the TE-based model
ranking 4th on the “small” one. CugLM, which outperforms another TE-based model CuBERT
1) Architecture: The Top-3 performance is achieved by (pretrained on MLM and NSP) and achieves the second best
three TF-based models (i.e, CodeT5, CoTexT, and TreeBERT) performance by using ULM in addition to MLM and NSP.
and the best and second TE-based models (i.e., UniXcoder Moreover, in terms of modality, Code Completion is the only
and SynCoBERT) rank 4th and 5th respectively. Besides, the task where neither NL nor code structure plays a positive
best TD-based model (i.e., CodeGPT-adapted) only rank 14th. role since all of the Top-3 models use code as the only input
2144
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 09,2024 at 03:06:46 UTC from IEEE Xplore. Restrictions apply.
modality. We speculate the reason is that there is currently no M. Code Generation
effective way to combine these two modalities with ULM. The SOTA model changes from TE-based (UniXcoder)
to TF-based (CodeT5), and the best TE-based (UniXcoder)
J. Assert Generation and TD-based (CodeGPT-adapted) models rank 2nd and 11th,
respectively. While the new SOTA model (i.e., CodeT5) for
Since only T5-learning has been evaluated on this task Code Generation is also the SOTA model for Code Summa-
currently, all conclusions drawn from the new results could rization, the ranks of T5 and BART (pre-trained only on
be viewed as new findings. First, the Top-5 performers are NL) on this task are lower than their ranks on Code
all TF-based models (i.e., PLBART, TreeBERT, SPT-Code, Summarization, because understanding code and generating
T5-learning, and CodeT5 by order). The best performing TE- code are fundamentally different in nature. In addition, the
based (i.e., UniXcoder) and TD-based (i.e., CodeGPT-adapted) importance of the code and NL modalities for this task is
models rank 8th and 16th, respectively. As far as modality not as clear as that for Code Summarization, considering
is concerned, NL seems to have a greater impact than other that among the Top-3 models (CodeT5, UniXcoder, and Tree-
modalities, as four of the Top-5 models (i.e., PLBART, SPT- BERT), only CodeT5 uses both code and NL: UniXcoder uses
Code, T5-learning, and CodeT5) use NL, whereas only two code structure and NL and TreeBERT uses code structure and
(i.e., TreeBERT and SPT-Code) use code structures. code instead. Moreover, although only NL is the input of this
task, pre-training on code structure has a positive impact
K. Mutant Generation on this task, since both of the second and third best models
(UniXcoder and TreeBERT) are pre-trained on tasks such as
NL and code structure appear to have positive im-
MCL, CMG, and NOP.
pacts, since the Top-3 models (i.e., CodeT5, TreeBERT, and
PLBART by order) use either NL or code structure as one of VI. I NSIGHTS AND TAKEAWAYS
the inputs in addition to the code. As for pre-training tasks,
After analysis and discussion by task, we have some insights
DAE alone is able to help the model (i.e., PLBART) achieve
and takeaways to provide to subsequent researchers.
the third best performance. The structure-aware pre-training
tasks, such as TMLM and NOP used by TreeBERT (the second • When designing a new model to solve multiple tasks, look
best) and CAP used by SPT-Code (the fourth best) clearly have up the current SOTA model’s architecture, features, and
positive impacts on this task. pre-training tasks for each task, and use such information
as a starting point.
• Always pre-train on multiple programming languages.
L. Code Summarization
• Always pre-train with NL, since all of the new SOTAs
1) Architecture: The best TE-based model (UniXcoder) use NL.
ranks 5th, with the Top-4 being TF-based models (i.e., • Utilize structure information in PTMs for code under-
CodeT5, ProphetNet-Code, CoTexT, and SPT-Code), which standing tasks.
suggests the strong positive influence of the TF architecture • Ensure the pre-training tasks are as similar in form as
on this task. This is not in line with the current results in possible to the target downstream task.
which the best TE-based model ranked second. With the new • Use different CodePTMs for different target task types
results, the best TD-based model (GPT-2) ranks 15th. since there is no almighty CodePTM, as per our results
2) Modality: The highest ranks achieved by models pre- and Zeng et al. [69].
trained on NL only (e.g., T5 and BART) are 5th and 9th in Particularly, for the following tasks, we have additional
the current and new results, respectively. The reasons why takeaways:
they perform even better than some of the models pre-trained • Clone Detection: Although the TE-based model achieves
on code or structure in addition to NL (e.g., CodeBERT, the best performance, comparable results are achieved
GraphCodeBERT, etc.) are two-fold. First, they acquire the with the TF-based model. Besides, the use of NL and
ability to generate NL during pre-training, which is required code structure is beneficial. Finally, MLM and its variants
by Code Summarization, and (2) because of the “naturalness” have better results on this task.
of the source code [13], [68], they are able to understand the • Code-to-Code Retrieval: Utilize NL and code structure
code to some extent although they only have the ability to following the ”Altogether” strategy. Besides, MLM and
understand NL. its variants, as well as structure-aware pre-training tasks,
3) Pre-training Tasks: Cross-modal(Code and Natural have positive effects on this task.
Language)-aware generation tasks such as BDG/CMG and • Code Question Answering: Prefer TE models and use
MNG have positive impacts on a model’s performance on NL whenever possible.
this task. As an example, CodeT5, which utilizes BDG, and • Assert Generation: NL is not a required modality. The
SPT-Code, which utilizes MNG, are the top performers among reason is that although the model with the best perfor-
TF-based models, and UniXcoder, which utilizes CMG, is the mance uses NL, NL is not used in the same data sample
top performer among TE-based models. as other modalities (because of the Standalone strategy).
2145
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 09,2024 at 03:06:46 UTC from IEEE Xplore. Restrictions apply.
Seq2seq pre-training tasks, such as DAE, MASS and ACKNOWLEDGMENT
MNG, should be prioritized. This work was supported by National Natural Science Foun-
• Code Generation: Besides TF, TE is worth trying. The dation of China (61802167), Natural Science Foundation of
use of NL-Code generation code pre-training tasks (e.g., Jiangsu Province, China (BK20201250), Cooperation Fund of
BDG and CMG) is mandatory. Huawei-NJU Creative Laboratory for the Next Programming,
Finally, through our experiments, we propose several pos- and NSF award 2034508. We also thank the reviewers for
sible subsequent research directions as follows. their helpful comments. Chuanyi Li and Jidong Ge are the
• Design more efficient pre-training tasks to make Code- corresponding authors.
PTMs learn source code features better [20].
R EFERENCES
• Improve the efficiency of CodePTMs for fine-tuning on
downstream tasks [70]. [1] A. M. Dai and Q. V. Le, “Semi-supervised sequence learning,” Advances
in neural information processing systems, vol. 28, 2015.
• Make the large CodePTMs lighter [71], [72]. [2] J. Howard and S. Ruder, “Universal language model fine-tuning for
• Improve the robustness of CodePTMs. text classification,” in Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers),
VII. T HREATS TO VALIDITY 2018, pp. 328–339.
[3] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee,
Construct Validity: As discussed in Section II-D, we have and L. Zettlemoyer, “Deep contextualized word representations,” in
re-implemented some PTMs (category IV) or re-collected Proceedings of the 2018 Conference of the North American Chapter
some datasets (category III and some in IV). The replication of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers). New Orleans, Louisiana:
may not be perfect but we have tried our best to do the Association for Computational Linguistics, Jun. 2018, pp. 2227–2237.
re-implementation and collect the datasets to minimize the [Online]. Available: https://aclanthology.org/N18-1202
deviations from the original model (See Section II-D). Besides, [4] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving
language understanding by generative pre-training,” 2018.
we adopt the statistical significance testing to measuring the [5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
differences between our implementation and the original ones. of deep bidirectional transformers for language understanding,” in
Internal Validity: It is widely agreed that, during fine- Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language
tuning, hyperparameters have a significant impact on the Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186.
performance of pre-trained models. For models where hyper- [6] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and
parameters for fine-tuning are not available (See Section II-D), Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language
understanding,” Advances in neural information processing systems,
the settings we obtain by hyperparameter search may introduce vol. 32, 2019.
some bias with the performance reported in the original paper. [7] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
But we have tried best to derive best performance of these L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert
pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
models on each SE task. [8] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-
External Validity: The results and observations we obtained training text encoders as discriminators rather than generators,” in
in this work may apply only to the downstream tasks and International Conference on Learning Representations, 2019.
[9] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
corresponding datasets we have evaluated. For the other SE “Language models are unsupervised multitask learners,” 2019.
tasks and datasets, we cannot guarantee exactly the same [10] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
results and observations. Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning
with a unified text-to-text transformer,” Journal of Machine Learning
VIII. C ONCLUSION Research, vol. 21, pp. 1–67, 2020.
[11] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy,
We conducted the first systematic empirical comparison of V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence
existing pre-trained models of source code11 . We believe that pre-training for natural language generation, translation, and comprehen-
sion,” in Proceedings of the 58th Annual Meeting of the Association for
the results of our large-scale evaluation and the associated dis- Computational Linguistics, 2020, pp. 7871–7880.
cussion can provide SE researchers with a better understanding [12] A. Kanade, P. Maniatis, G. Balakrishnan, and K. Shi, “Learning and
of existing PTMs and their relative strengths and weaknesses, evaluating contextual embedding of source code,” in International
Conference on Machine Learning. PMLR, 2020, pp. 5110–5121.
as well as a better characterization of the state-of-the-art of [13] L. Buratti, S. Pujar, M. Bornea, S. McCarley, Y. Zheng, G. Rossiello,
each SE task on which PTMs are commonly evaluated. A. Morari, J. Laredo, V. Thost, Y. Zhuang et al., “Exploring soft-
This paper provides many valuable findings that are either ware naturalness through neural language models,” arXiv preprint
arXiv:2006.12641, 2020.
not available based on the existing results alone or completely [14] A. Svyatkovskiy, S. K. Deng, S. Fu, and N. Sundaresan, “Intellicode
contrary to current findings. For example, we found that compose: Code generation using transformer,” in Proceedings of the
TF-based models have clear advantages for not only code 28th ACM Joint Meeting on European Software Engineering Conference
and Symposium on the Foundations of Software Engineering, 2020, pp.
generation tasks but also code understanding tasks. We hope 1433–1443.
that this paper could provide interested researchers with a [15] N. T. De Sousa and W. Hasselbring, “Javabert: Training a transformer-
comprehensive and comparable insights into the current state based model for the java programming language,” in 2021 36th
IEEE/ACM International Conference on Automated Software Engineer-
of this domain and inspire them to design more powerful pre- ing Workshops (ASEW). IEEE, 2021, pp. 90–95.
trained models of source code. [16] D. Drain, C. Wu, A. Svyatkovskiy, and N. Sundaresan, “Generating bug-
fixes using pretrained transformers,” in Proceedings of the 5th ACM
11 All materials used in our experiments are available at https://github.com/ SIGPLAN International Symposium on Machine Programming, 2021,
NougatCA/FineTuner and https://doi.org/10.5281/zenodo.7318110. pp. 1–8.
2146
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 09,2024 at 03:06:46 UTC from IEEE Xplore. Restrictions apply.
[17] U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “code2vec: Learning [36] J. Huang, D. Tang, L. Shou, M. Gong, K. Xu, D. Jiang, M. Zhou,
distributed representations of code,” Proceedings of the ACM on Pro- and N. Duan, “Cosqa: 20,000+ web queries for code search and
gramming Languages, vol. 3, no. POPL, pp. 1–29, 2019. question answering,” in Proceedings of the 59th Annual Meeting of the
[18] J. Zhang, X. Wang, H. Zhang, H. Sun, K. Wang, and X. Liu, “A novel Association for Computational Linguistics and the 11th International
neural source code representation based on abstract syntax tree,” in Joint Conference on Natural Language Processing (Volume 1: Long
2019 IEEE/ACM 41st International Conference on Software Engineering Papers), 2021, pp. 5690–5700.
(ICSE). IEEE, 2019, pp. 783–794. [37] B. Roziere, M.-A. Lachaux, L. Chanussot, and G. Lample, “Unsu-
[19] T. Ben-Nun, A. S. Jakobovits, and T. Hoefler, “Neural code comprehen- pervised translation of programming languages,” Advances in Neural
sion: A learnable representation of code semantics,” Advances in Neural Information Processing Systems, vol. 33, pp. 20 601–20 611, 2020.
Information Processing Systems, vol. 31, 2018. [38] M. Tufano, C. Watson, G. Bavota, M. D. Penta, M. White, and
[20] A. Karmakar and R. Robbes, “What do pre-trained code models know D. Poshyvanyk, “An empirical study on learning bug-fixing patches in
about code?” in 2021 36th IEEE/ACM International Conference on the wild via neural machine translation,” ACM Transactions on Software
Automated Software Engineering (ASE). IEEE, 2021, pp. 1332–1336. Engineering and Methodology (TOSEM), vol. 28, no. 4, pp. 1–29, 2019.
[21] M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to rep- [39] V. Raychev, P. Bielik, and M. Vechev, “Probabilistic model for code with
resent programs with graphs,” in International Conference on Learning decision trees,” in Proceedings of the 2016 ACM SIGPLAN International
Representations, 2018. Conference on Object-Oriented Programming, Systems, Languages, and
[22] C. Cummins, H. Leather, Z. Fisches, T. Ben-Nun, T. Hoefler, and Applications, 2016, pp. 731–747.
M. O’Boyle, “Deep data flow analysis,” 2020. [Online]. Available: [40] F. Liu, G. Li, Y. Zhao, and Z. Jin, “Multi-task learning based pre-
https://arxiv.org/abs/2012.01470 trained language model for code completion,” in Proceedings of the
[23] T. Hoang, H. J. Kang, D. Lo, and J. Lawall, “Cc2vec: Distributed 35th IEEE/ACM International Conference on Automated Software En-
representations of code changes,” in Proceedings of the ACM/IEEE 42nd gineering, 2020, pp. 473–485.
International Conference on Software Engineering, 2020, pp. 518–529. [41] U. Alon, R. Sadaka, O. Levy, and E. Yahav, “Structural language models
[24] W. Ma, M. Zhao, E. Soremekun, Q. Hu, J. M. Zhang, M. Papadakis, of code,” in International conference on machine learning. PMLR,
M. Cordy, X. Xie, and Y. L. Traon, “Graphcode2vec: generic code em- 2020, pp. 245–256.
bedding via lexical and program dependence analyses,” in Proceedings [42] M. Tufano, C. Watson, G. Bavota, M. Di Penta, M. White, and
of the 19th International Conference on Mining Software Repositories, D. Poshyvanyk, “Learning how to mutate source code from bug-fixes,”
2022, pp. 524–536. in 2019 IEEE International Conference on Software Maintenance and
Evolution (ICSME). IEEE Computer Society, 2019, pp. 301–312.
[25] N. D. Bui, Y. Yu, and L. Jiang, “Infercode: Self-supervised learning of
[43] C. Watson, M. Tufano, K. Moran, G. Bavota, and D. Poshyvanyk, “On
code representations by predicting subtrees,” in 2021 IEEE/ACM 43rd
learning meaningful assert statements for unit test cases,” in Proceedings
International Conference on Software Engineering (ICSE). IEEE, 2021,
of the ACM/IEEE 42nd International Conference on Software Engineer-
pp. 1186–1197.
ing, 2020, pp. 1398–1409.
[26] K. Zhang, W. Wang, H. Zhang, G. Li, and Z. Jin, “Learning to
[44] S. Haque, A. LeClair, L. Wu, and C. McMillan, “Improved auto-
represent programs with heterogeneous graphs,” in Proceedings of the
matic summarization of subroutines via attention to file context,” in
30th IEEE/ACM International Conference on Program Comprehension,
Proceedings of the 17th International Conference on Mining Software
2022, pp. 378–389.
Repositories, 2020, pp. 300–310.
[27] C. Niu, C. Li, B. Luo, and V. Ng, “Deep learning meets software [45] X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin, “Deep code comment gener-
engineering: A survey on pre-trained models of source code,” in Pro- ation,” in 2018 IEEE/ACM 26th International Conference on Program
ceedings of the Thirty-First International Joint Conference on Artificial Comprehension (ICPC). IEEE, 2018, pp. 200–20 010.
Intelligence, IJCAI 2022, 2022, pp. 5546–5555. [46] U. Alon, S. Brody, O. Levy, and E. Yahav, “code2seq: Generating
[28] X. Jiang, Z. Zheng, C. Lyu, L. Li, and L. Lyu, “Treebert: A tree-based sequences from structured representations of code,” in International
pre-trained model for programming language,” in Proceedings of the Conference on Learning Representations, 2019.
Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, vol. [47] X. Hu, G. Li, X. Xia, D. Lo, S. Lu, and Z. Jin, “Summarizing source
161. PMLR, 27–30 Jul 2021, pp. 54–63. code with transferred api knowledge,” in Proceedings of the 27th
[29] C. Niu, C. Li, V. Ng, J. Ge, L. Huang, and B. Luo, “Spt-code: Sequence- International Joint Conference on Artificial Intelligence, IJCAI 2018,
to-sequence pre-training for learning source code representations,” in 2018, pp. 2269–2275.
2022 IEEE/ACM 44th International Conference on Software Engineer- [48] A. V. Miceli-Barone and R. Sennrich, “A parallel corpus of python
ing (ICSE), 2022, pp. 01–13. functions and documentation strings for automated code documentation
[30] Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective vul- and code generation,” in Proceedings of the Eighth International Joint
nerability identification by learning comprehensive program semantics Conference on Natural Language Processing (Volume 2: Short Papers),
via graph neural networks,” Advances in neural information processing 2017, pp. 314–319.
systems, vol. 32, 2019. [49] S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer, “Mapping language
[31] M. Pradel and K. Sen, “Deepbugs: A learning approach to name-based to code in programmatic context,” in Proceedings of the 2018 Confer-
bug detection,” Proceedings of the ACM on Programming Languages, ence on Empirical Methods in Natural Language Processing, 2018, pp.
vol. 2, no. OOPSLA, pp. 1–25, 2018. 1643–1652.
[32] J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia, [50] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for
“Towards a big data curated benchmark of inter-project code clones,” automatic evaluation of machine translation,” in Proceedings of the 40th
in Proceedings of the 2014 IEEE International Conference on Software annual meeting of the Association for Computational Linguistics, 2002,
Maintenance and Evolution, 2014, pp. 476–480. pp. 311–318.
[33] K. W. Nafi, T. S. Kar, B. Roy, C. K. Roy, and K. A. Schneider, “Clcdsa: [51] S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan,
cross language code clone detection using syntactical features and api M. Zhou, A. Blanco, and S. Ma, “Codebleu: a method for automatic
documentation,” in 2019 34th IEEE/ACM International Conference on evaluation of code synthesis,” arXiv preprint arXiv:2009.10297, 2020.
Automated Software Engineering (ASE). IEEE, 2019, pp. 1026–1037. [52] R.-M. Karampatsis and C. Sutton, “Scelmo: Source code embeddings
[34] L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin, “Convolutional neural from language models,” arXiv preprint arXiv:2004.13214, 2020.
networks over tree structures for programming language processing,” in [53] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou,
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, B. Qin, T. Liu, D. Jiang, and M. Zhou, “Codebert: A pre-trained model
2016, pp. 1287–1293. for programming and natural languages,” in Proceedings of the 2020
[35] S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, Conference on Empirical Methods in Natural Language Processing:
C. Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, Findings, 2020, pp. 1536–1547.
L. Zhou, M. Tufano, M. GONG, M. Zhou, N. Duan, N. Sundaresan, [54] D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. LIU, L. Zhou, N. Duan,
S. K. Deng, S. Fu, and S. LIU, “CodeXGLUE: A machine learning A. Svyatkovskiy, S. Fu, M. Tufano, S. K. Deng, C. Clement, D. Drain,
benchmark dataset for code understanding and generation,” in Thirty- N. Sundaresan, J. Yin, D. Jiang, and M. Zhou, “Graphcodebert: Pre-
fifth Conference on Neural Information Processing Systems Datasets training code representations with data flow,” in International Confer-
and Benchmarks Track (Round 1), 2021. ence on Learning Representations, 2021.
2147
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 09,2024 at 03:06:46 UTC from IEEE Xplore. Restrictions apply.
[55] B. Roziere, M.-A. Lachaux, M. Szafraniec, and G. Lample, “Dobf: A
deobfuscation pre-training objective for programming languages,” arXiv
preprint arXiv:2102.07492, 2021.
[56] A. Mastropaolo, S. Scalabrino, N. Cooper, D. N. Palacio, D. Poshy-
vanyk, R. Oliveto, and G. Bavota, “Studying the usage of text-to-text
transfer transformer to support code-related tasks,” in 2021 IEEE/ACM
43rd International Conference on Software Engineering (ICSE). IEEE,
2021, pp. 336–347.
[57] W. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified pre-
training for program understanding and generation,” in Proceedings of
the 2021 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, 2021,
pp. 2655–2668.
[58] W. Qi, Y. Gong, Y. Yan, C. Xu, B. Yao, B. Zhou, B. Cheng, D. Jiang,
J. Chen, R. Zhang et al., “Prophetnet-x: Large-scale pre-training models
for english, chinese, multi-lingual, dialog, and code generation,” arXiv
preprint arXiv:2104.08006, 2021.
[59] L. Phan, H. Tran, D. Le, H. Nguyen, J. Annibal, A. Peltekian, and Y. Ye,
“Cotext: Multi-task learning with code-text transformer,” in Proceedings
of the 1st Workshop on Natural Language Processing for Programming
(NLP4Prog 2021), 2021, pp. 40–47.
[60] D. Peng, S. Zheng, Y. Li, G. Ke, D. He, and T.-Y. Liu, “How could
neural networks understand programs?” in International Conference on
Machine Learning. PMLR, 2021, pp. 8476–8486.
[61] J. Zhang, H. Hong, Y. Zhang, Y. Wan, Y. Liu, and Y. Sui, “Disentangled
code representation learning for multiple programming languages,” in
Findings of the Association for Computational Linguistics: ACL-IJCNLP
2021, 2021, pp. 4454–4466.
[62] Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware
unified pre-trained encoder-decoder models for code understanding
and generation,” in Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing, 2021, pp. 8696–8708.
[63] X. Wang, Y. Wang, F. Mi, P. Zhou, Y. Wan, X. Liu, L. Li, H. Wu,
J. Liu, and X. Jiang, “Syncobert: Syntax-guided multi-modal contrastive
pre-training for code representation,” arXiv preprint arXiv:2108.04556,
2021.
[64] D. Guo, S. Lu, N. Duan, Y. Wang, M. Zhou, and J. Yin, “Unixcoder:
Unified cross-modal pre-training for code representation,” in Proceed-
ings of the 60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), 2022, pp. 7212–7225.
[65] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[66] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
in Neural Information Processing Systems, vol. 30. Curran Associates,
Inc., 2017.
[67] E. S. Edgington, “Approximate randomization tests,” The Journal of
Psychology, vol. 72, no. 2, pp. 143–149, 1969.
[68] M. D. Ernst, “Natural language is a programming language: Applying
natural language processing to software development,” in 2nd Summit
on Advances in Programming Languages (SNAPL 2017). Schloss
Dagstuhl-Leibniz-Zentrum fuer Informatik, 2017.
[69] Z. Zeng, H. Tan, H. Zhang, J. Li, Y. Zhang, and L. Zhang, “An extensive
study on pre-trained models for program understanding and generation,”
in Proceedings of the 31st ACM SIGSOFT International Symposium on
Software Testing and Analysis, 2022, pp. 39–51.
[70] D. Wang, Z. Jia, S. Li, Y. Yu, Y. Xiong, W. Dong, and X. Liao,
“Bridging pre-trained models and downstream tasks for source code
understanding,” in Proceedings of the 44th International Conference on
Software Engineering, 2022, pp. 287–298.
[71] Z. Zhang, H. Zhang, B. Shen, and X. Gu, “Diet code is healthy:
Simplifying programs for pre-trained models of code,” in Proceedings
of the 30th ACM Joint European Software Engineering Conference
and Symposium on the Foundations of Software Engineering, 2022, pp.
1073–1084.
[72] J. Shi, Z. Yang, B. Xu, H. J. Kang, and D. Lo, “Compressing pre-
trained models of code into 3 mb,” in The 37th IEEE/ACM International
Conference on Automated Software Engineering, ASE 2022, 2022.
2148
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on April 09,2024 at 03:06:46 UTC from IEEE Xplore. Restrictions apply.