0% found this document useful (0 votes)
3 views9 pages

Exploring and Evaluating Personalized Models For Code Generation

Uploaded by

brunojimezz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views9 pages

Exploring and Evaluating Personalized Models For Code Generation

Uploaded by

brunojimezz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Exploring and Evaluating Personalized Models

for Code Generation


Andrei Zlotchevski Dawn Drain Alexey Svyatkovskiy
McGill University Anthropic Microsoft
Montreal, Quebec, Canada San Francisco, CA, USA Redmond, WA, USA
[email protected] [email protected] [email protected]

Colin Clement Neel Sundaresan Michele Tufano


Microsoft Microsoft Microsoft
Redmond, WA, USA Redmond, WA, USA Redmond, WA, USA
arXiv:2208.13928v2 [cs.SE] 20 Sep 2022

[email protected] [email protected] [email protected]


ABSTRACT ACM Reference Format:
Large Transformer models achieved the state-of-the-art status for Andrei Zlotchevski, Dawn Drain, Alexey Svyatkovskiy, Colin Clement, Neel
Sundaresan, and Michele Tufano. 2022. Exploring and Evaluating Person-
Natural Language Understanding tasks and are increasingly be-
alized Models for Code Generation. In Proceedings of the 30th ACM Joint
coming the baseline model architecture for modeling source code. European Software Engineering Conference and Symposium on the Founda-
Transformers are usually pre-trained on large unsupervised cor- tions of Software Engineering (ESEC/FSE ’22), November 14–18, 2022, Singa-
pora, learning token representations and transformations relevant pore, Singapore. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/
to modeling generally available text, and are then fine-tuned on 3540250.3558959
a particular downstream task of interest. While fine-tuning is a
tried-and-true method for adapting a model to a new domain – for 1 INTRODUCTION
example, question-answering on a given topic – generalization re- It is well-known that even the best models can fail to generalize
mains an on-going challenge. In this paper, we explore and evaluate properly to new domains, and even to new users of said models.
transformer model fine-tuning for personalization. In the context For example, a model trained to answer questions in general may
of generating unit tests for Java methods, we evaluate learning to not answer StackOverflow questions as well as the questions in
personalize to a specific software project using several personal- the training domain, or a software developer in an Enterprise envi-
ization techniques. We consider three key approaches: (i) custom ronment with private code may have libraries and attribute name
fine-tuning, which allows all the model parameters to be tuned; (ii) which differ from public source code used to train a code synthesis
lightweight fine-tuning, which freezes most of the model’s parame- model.
ters, allowing tuning of the token embeddings and softmax layer The current dominant paradigm in Natural Language Processing
only or the final layer alone; (iii) prefix tuning, which keeps model (NLP) modeling is to pre-train a large transformer model [30] on a
parameters frozen, but optimizes a small project-specific prefix vec- large corpus and then fine-tune it on a particular task of interest. For
tor. Each of these techniques offers a trade-off in total compute example, a question-answering (Q&A) model is generally first pre-
cost and predictive performance, which we evaluate by code and trained on a large corpus of textual data for the specific language
task-specific metrics, training time, and total computational opera- (e.g., Wikipedia, and news articles in English), then fine-tuned on
tions. We compare these fine-tuning strategies for code generation a task-specific dataset of paired questions and corresponding an-
and discuss the potential generalization and cost benefits of each swers. The pre-training process aims at learning semantic vector
in various deployment scenarios. representation of the language and words, while the fine-tuning
process specializes the model for a specific domain.
CCS CONCEPTS Transformer models are also increasingly the baseline archi-
• Software and its engineering → Software testing and de- tecture used for code generation tasks, such as writing methods
bugging; • Information systems → Recommender systems. from natural language description [2, 5, 7], or generating test cases
from the focal method under test [29]. Similarly for NLP tasks
KEYWORDS these models are pre-trained on a large corpus of natural text and
Personalized Models, Code Generation publicly available source code and then fine-tuned on a specific
code-related task. Further, these models also may not generalize to
Permission to make digital or hard copies of all or part of this work for personal or new domains of interest, and can benefit from task or even user-
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation specific fine-tuning, here called customization or personalization.
on the first page. Copyrights for components of this work owned by others than ACM Customization is particularly relevant for code generation models
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, since it provides several benefits:
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected]. • allows fine-tuning on source code data that may not be avail-
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore able when training a base model (e.g., private repositories or
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9413-0/22/11. . . $15.00 internal codebases), enabling improved overall performances
https://doi.org/10.1145/3540250.3558959 on codebases with proprietary dependencies and code styles;
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Andrei Zlotchevski, Dawn Drain, Alexey Svyatkovskiy, Colin Clement, Neel Sundaresan, and Michele Tufano

• the opportunity to improve data privacy by considering pri-


vate or sensitive data only during the customization process
on the client side;
• the opportunity to reduce deployment cost as customized
models can offer better user performance without increasing
model size.
Custom models can provide clear benefits to users and model
providers. We envision serving tens or hundreds of thousands of
custom models, but doing so presents several logistical hurdles,
including the costs of training, storing, and loading these models
into GPU memory for inference. Worse, memory costs will only
be exacerbated when working with ever larger and more powerful
models.
For these reasons, we investigate several customization approaches,
some of which can dramatically reduce the memory footprint and Figure 1: Heat-map displaying the ratios of shared tokens
amortized computational cost introduced by custom models. Specif- among software projects. Most projects share relatively few
ically, we consider three fine-tuning approaches: (i) custom fine- identifiers with other codebases.
tuning, which allows all the model parameters to be tuned; (ii)
lightweight fine-tuning, which only optimizes the token embedding
(commonly set at 1024 tokens) cannot provide a complete view of
representations or the final softmax layer; (iii) prefix tuning, which
the project with its peculiarities. An accurate generation requires
keeps language model parameters frozen, but optimizes a small
information about packages, classes, APIs, and identifiers that are
project-specific vector prefix.
external to the portion of code provided as input. Thus, we argue
In our extensive empirical evaluation we found that all the cus-
for personalized models that can generate custom code for specific
tomization strategies lead to significant model improvements on a
projects.
target project in terms of both intrinsic and task-specific metrics.
As an exploratory study, we begin by observing projects’ diver-
While there is no unambiguous winner among the customization
sity in terms of tokens used in their source code. Specifically, we’re
strategies, each approach can provide specific benefits in particu-
interested in understanding the amount of shared tokens among
lar deployment scenarios. This paper provides insights on these
different software projects. This could serve as an initial, rough,
customization strategies, their benefits and drawbacks, as well as
proxy metric to measure project diversity and potentially motivate
providing guidelines and suggestions on which one to use based
the need for personalized models.
on the training cost, memory and storage, number of users, and
We select 930 Java software projects randomly sampled from
deployment scenarios.
GitHub, declaring an open source license, which have been updated
within the last five years, and are not forks. These projects belong to
2 MOTIVATION the validation set of the open dataset Methods2Test [29]. For each
Software projects are often classified based on their architecture project, we collect all the available .java files combining them into
(e.g., web, server, monolithic), domain (e.g., finance, healthcare), a single-project corpus. Next, we remove licensing information and
topic or usages (e.g., games, editors). In this context, several tech- comments using regex, then tokenize the corpus using gensim [22]
niques have been proposed in the literature for the task of software tokenizer (with lowercase setting). From the list of tokens, we com-
categorization, which aims at organizing projects into groups that pute the set of unique tokens used within the project, and exclude
broadly describe the behavior or topic of the software. MUDABlue the Java keywords from this set (similar to stopwords).
[13], relies on Latent Semantic Indexing (LSI), an Information Re- For each pair of projects 𝑝𝑖 and 𝑝 𝑗 , with token sets 𝑇𝑖 and 𝑇 𝑗 ,
trieval (IR) technique, to automatically categorize software systems we compute the shared token set 𝑇𝑖,𝑗 = 𝑇𝑖 ∩ 𝑇 𝑗 . Next, for both 𝑝𝑖
in open source software repositories. For the same task, LACT [27] and 𝑝 𝑗 , we compute their corresponding ratios of shared tokens as
uses Latent Dirichlet Allocation (LDA), and recently neural text clas- follows: 𝑅𝑖,𝑗 = |𝑇𝑖,𝑗 |/|𝑇𝑖 | and 𝑅 𝑗,𝑖 = |𝑇𝑖,𝑗 |/|𝑇 𝑗 |.
sification with word embeddings has been used [14] to categorize Figure 1 shows the ratio of shared tokens between each pair
similar software projects. of projects as a heat-map. Projects are sorted in ascending order
While projects can be broadly categorized, each individual soft- of the number of unique tokens used in their source code. Blue
ware project, apart from trivial forks and clones, have peculiar values indicate a low ratio of shared tokens (the darker, the lower),
characteristics which make them unique. Codebases have different while red values indicate project pairs with substantial amount
user-defined types, API usages, specific coding styles, and identi- of shared tokens. The heat-map appears mostly blue, indicating
fiers’ preferences chosen by developers. These idiosyncrasies rep- that the majority of projects share relatively few tokens among
resent an additional level of complexity for models that aim at each others. The upper-right corner shows pairs with higher ratios
generating code for a variety of software projects. (white/red points), these are ratio computed for very small projects
This is exacerbated by the fact that transformer models only whose tokens are contained in very large projects, hence the corner
receive a limited-size input during inference, often considering position. Overall, the majority of project share relatively few iden-
only the current source code file. This confined window of tokens tifiers with other projects. Specifically, if we consider all the values
Exploring and Evaluating Personalized Models for Code Generation ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore

(a) Custom (b) L-EO (c) L-LDB (d) Prefix

Figure 2: Overview of the Customization Approaches - Transformer models during fine-tuning, where the frozen parts of the
model (not trainable) are displayed in gray: (a) Custom fine-tuning modifies all parameters during training; (b) L-EO trains
only the embedding and output layers; (c) L-LDB allows to train only the parameters of the last decoder block; (d) Prefix tuning
adds a trainable prefix to the encoder and decoder blocks.

in the matrix 𝑅 except for the diagonal (i.e., token shared with the for maximization functions, or 𝑓 (𝑚 ′, 𝑝) < 𝑓 (𝑚, 𝑝) for minimization
project itself), a project on median shares only 13% of its tokens functions.
with another project, and the third quartile of the distribution is In this work, 𝑚 is an encoder-decoder transformer model, 𝑡 is a
below 23%. code generation task, and 𝑝 is a target software project to which
We consider this study only as a preliminary analysis into the we intend to customize 𝑚.
diversity of software projects, which could motivate the need for
personalized models. We acknowledge the limitations of this study,
which could be extended considering different types of tokenizers, 3.2 Custom Fine-tuning
preprocessing steps, and metrics. In Sec. 4 we design an experimen- Custom fine-tuning is the most straightforward customization ap-
tal study that analyzes in details the impact of personalization on proach. The model to be customized is taken as is and trained on a
the performances of transformer-based code generation models. selected project. All parameters are trainable during this process.
Figure 2a shows the model during fine-tuning, where all the param-
3 APPROACH eters from the encoder and decoder blocks, as well as embeddings
This section describes the proposed customization approach for and output layers can be modified.
code generation models. We begin by formally defining the cus-
tomization process, then we provide details for each of the fine-
tuning strategies. 3.3 Lightweight Fine-tuning - Embeddings and
Output Layer (L-EO)
3.1 Customization Process Fully fine-tuning a model for every project or user may be prohibi-
We use the term customization to refer to the process of fine-tuning tive in terms of storage and memory costs. As a result, we explore
a model 𝑚, previously trained on a generic dataset for a task 𝑡, ways to mitigate these costs by reducing the number of parameters
with the goal of improving its performance on a specific dataset that vary from one custom model to another. In our lightweight
𝑝. The performance of a machine learning model 𝑚 on a dataset 𝑝 fine-tuning experiments, we achieve this by freezing most parame-
is measured by one or more evaluation functions 𝑓 (𝑚, 𝑝), where ters in the baseline model, and only keeping a small subset trainable.
𝑓 can be either a maximization (e.g., BLEU, top-k accuracy) or Figure 2b shows the Lightweight fine-tuning - Embeddings and
minimization (e.g., perplexity) function. The customization process Output Layer (L-EO) design, where most of the model parameters
is designed to modify the trainable parameters of the model 𝑚, are frozen (displayed in gray), and we allow only the embedding and
obtaining the model 𝑚 ′ , such that the performance of 𝑚 ′ on 𝑝 is output layers parameters to be fine-tuned, following the approach
better than what was attained by 𝑚. Specifically, 𝑓 (𝑚 ′, 𝑝) > 𝑓 (𝑚, 𝑝) in [17].
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Andrei Zlotchevski, Dawn Drain, Alexey Svyatkovskiy, Colin Clement, Neel Sundaresan, and Michele Tufano

3.4 Lightweight Fine-tuning - Last Decoder Table 1: Comparing the number of trainable parameters in
Block (L-LDB) each fine-tuning method.

In this lightweight fine-tuning strategy, shown in Figure 2c (L-LDB), Parameters


Customization Process
most of the model’s parameters are kept frozen, while only the Total Trained
parameters in the last decoder block are trainable, this includes: self- Custom 406M 406M (100%)
attention, encoder-decoder attention, layernorm and feedforward L-EO 406M 53M (13%)
layers. This design decision of training only the last decoder block is L-LDB 406M 17M (4.2%)
motivated by experimental results analyzing the model’s parameter Prefix 416M 10M (2.4%)
changes during custom fine-tuning. Figure 3 reports the average
absolute changes, during fine-tuning, in the parameters belonging a BART Transformer model with 406M parameters. Custom fine-
to different Encoder and Decoder blocks for a BART model. We tuning allows to train 100% of the 406M available parameters in the
observe that, as we go through the transformer model, the average model. During L-EO finetuning, instead, only 13% (53M) parameters
change in parameter values tends to increase, with the last decoder are trained. The L-LDB finetuning reduces the number of trainable
block showing the highest changes in parameter values. As a result, parameters to 4.2% (17M). Finally, Prefix tuning has the lowest
we hypothesize that it could be sufficient to tune the last decoder number of trainable parameters, only 2.4% (10M) of the total, but
block and obtain performance improvements similar to the fully these are additional parameters added to the model (total 416M).
custom fine-tuned model.
0.000300
4 EXPERIMENTAL DESIGN
The goal of our experimental design is to investigate whether cus-
0.000275
tom models outperform the baseline model, leading to performance
Average Absolute Change

0.000250 improvements in terms of intrinsic metrics (RQ1 ), as well as ex-


0.000225 trinsic task-specific metrics (RQ2 ). Next, we analyze and compare
the different customization approaches in terms of training and
0.000200
compute costs (RQ3 ) as well as model size and required storage for
0.000175 deployment.
0.000150 In our case study, we chose Unit Test Case generation as our
code generation task 𝑡, and AthenaTest by Tufano et al. [29] as
0.000125
our baseline model 𝑚, which is a BART transformer model pre-
0.000100 trained on source code and English, and fine-tuned on Java unit test
(Other, -1)
(Encoder, 0)
(Encoder, 1)
(Encoder, 2)
(Encoder, 3)
(Encoder, 4)
(Encoder, 5)
(Encoder, 6)
(Encoder, 7)
(Encoder, 8)
(Encoder, 9)
(Encoder, 10)
(Encoder, 11)
(Decoder, 0)
(Decoder, 1)
(Decoder, 2)
(Decoder, 3)
(Decoder, 4)
(Decoder, 5)
(Decoder, 6)
(Decoder, 7)
(Decoder, 8)
(Decoder, 9)
(Decoder, 10)
(Decoder, 11)

generation. The task is modeled as a translation task, where the


input is a focal method (i.e., method under test), and the output is a
test case which tests the focal method’s correctness. We randomly
Transformer Block sample 20 projects from the test set, each of those representing
the dataset 𝑝 on which a custom model is fine-tuned. Specifically,
Figure 3: This figure shows the total average parameter for each project 𝑝, we start from the baseline model 𝑚 and fine-
change after fine-tuning to a new project domain, showing tune four different custom models according to the four proposed
that the largest parameter changes occur in deeper parts of fine-tuning strategies. For each project and fine-tuning strategy
the model. This motivates our choice to try only fine-tuning (e.g., L-EO), we fine-tune and evaluate the models using 4-fold
the later layers of the model. cross-validation. The models are trained until the best validation
loss is reached, independently for every fold, every repository, and
3.5 Prefix Tuning every customization approach. In total, we fine-tune and evaluate
Prefix tuning was first introduced by Li and Liang [15], with the 20(𝑝𝑟𝑜 𝑗𝑒𝑐𝑡𝑠) × 4(𝑎𝑝𝑝𝑟𝑜𝑎𝑐ℎ𝑒𝑠) × 4(𝑓 𝑜𝑙𝑑𝑠) = 320 models.
goal of fine-tuning a general model for different tasks. The tech-
nique concatenates a sequence (prefix) of virtual tokens (trainable 4.1 Dataset
parameters) to the front of the input of every encoder and decoder Table 2 reports information about the 20 GitHub repositories sam-
block. In our context, the intuition behind this approach is that pled from the test set, which will be used to customize our models.
the prefix embeds the properties of a specific project, which allows The table shows (i) the Project ID, which will be used in the paper
the model to generate customized responses for that repository. to reference a specific project; (ii) the project name; (iii) the project
Practically, we set the prefix length to 200 tokens, and thus with an size in terms of disk usage; (iv) the popularity of the project in
embedding size of 1024, this gives a total of 1024×200×24×2 ≈ 10M terms of number of stars obtained on GitHub; (v) and the dataset
trainable parameters. The prefix is initialized to the most frequent size, which corresponds to the number of data points for the unit
words in the repository for which the model is customized. test generation task (i.e., pair of focal method and test case). The list
of projects represent a diverse set of repositories with different size,
3.6 Trainable Parameters during Fine-tuning domain, and popularity. They span from small personal projects
Table 1 provides an overview of the number of total and trainable (e.g., Tutorials with 6 stars), to open source projects developed
parameters involved in each customization process, in the case of by large organizations such as Apache and Google.
Exploring and Evaluating Personalized Models for Code Generation ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore

Table 2: Dataset - Projects used for customization

Project ID Name Project Size (MB) Stars Dataset Size


26644682 Talend Data Prep 68.8 56 651
40735368 GeoTools 62.4 8 653
107330274 Titus Control Plane 36.0 302 660
52972024 Smart Actors 57.8 22 704
9714608 Arakhnê Foundation Classes 17.9 13 753
60701247 Android Plugin for IntelliJ IDEA 1026.7 716 754
14550159 EverRest 5.3 24 761
9278888 Brave 18.8 2084 787
66940520 DHIS 2 118.1 211 862
33645537 Tutorials 34.4 6 878
62253355 Mobi 62.6 35 986
155883728 OakPAL 15.0 9 1005
4710920 Apache Dubbo 36.1 36231 1058
29603649 Wilma 6.7 40 1074
42949039 Herd 227.2 127 1249
1381673 Drools 176.7 3908 1394
1244027 ModeShape 131.1 212 1550
73948366 AthenZ 38.8 639 1920
660443 Chemistry Development Kit (CDK) 214.8 305 2591
87849739 Eclipse Ditto™ Project 52.5 311 2842

4.2 RQ1 : Intrinsic Evaluation Metrics proportion of perfect matches among the top 5 model pre-
RQ1 : Do custom models obtain better performances on in- dictions.
trinsic metrics, such as BLEU and perplexity, w.r.t. the base- • Abstracted Code Matches: We pass the model output and
line? To begin, we investigate how the different model customiza- target output through the src2abs tool [28], to obtain an
tion approaches described in Sec. 3 score on intrinsic metrics such abstracted version, masking variable names, method names,
as BLEU and perplexity. All approaches entail fine-tuning the base- etc. We also do not distinguish between different objects of
line model to the dataset of a specific project, with the choice of the same type.
parameters being tuned depending on the approach taken. The four • Coding Style: For each project’s custom model, we would
variants are trained independently until the best validation loss like to determine how closely the model learns the devel-
is achieved. We report the BLEU4 score and the mean perplexity oper’s personal programming style and preferences. To this
per token on the test fold, for all the 20 projects. Next, we perform end, we extract the collection of all identifiers (i.e., variables
statistical tests to investigate whether the observed differences be- and functions’ names) from the unit tests written by the
tween the baseline and custom models are significant, as well as developer as well as those generated by the models. We then
differences among the customization approaches. Specifically, we pass these text outputs through a tf-idf vectorizer and com-
rely on the Kruskal-Wallis test, a non-parametric statistical test. pute the cosine similarity between them. This allows us to
compare the developer’s and the models’ word usage. We
examine the similarity between the developer’s unit tests
4.3 RQ2 : Task-specific Performances and the baseline and custom models generated tests. This
RQ2 : Do custom models improve on performance metrics scores the vocabulary similarity of the unit tests with the
specific to unit test generation? We want to investigate how model generated code.
the different customization approaches compare with respect to the
downstream task of generating unit tests. Beyond BLEU score and
perplexity, we would like to see if custom models can produce the 4.4 RQ3 : Training Cost Comparison
correct target code, how closely their unit tests mimic the repository RQ3 : Given the same amount of compute, which custom mod-
style, or even if they can perfectly match the desired output. els achieve the biggest performance improvement? Since our
four training regimes tune a different number of parameters, simply
• Perfect Matches: We compare the model’s output string with comparing the training time or number of optimization steps to
the target developer-written unit test. If the two strings are reach the best validation loss may not be appropriate. For a model
identical, this is considered a perfect match. We do not take with 𝑁 parameters, we approximate the computation cost of a for-
spacing and indentation into account as we are using a Java ward pass to be 𝐶 ≈ 2𝑁 floating point operations per training
dataset (where indentation is not required). We report the token, with an additional correction for embedding layers. The
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Andrei Zlotchevski, Dawn Drain, Alexey Svyatkovskiy, Colin Clement, Neel Sundaresan, and Michele Tufano

Top-K Accuracy
25
Baseline
Custom
L-EO
20 L-LDB
Prefix

15
Match %

10

0
1 2 3 4 5
Top-K

(a) Exact (solid) and abstracted (dotted) match rate (b) Coding style as identifier similarity

Figure 4: Task-specific metrics (a) custom models outperform the baseline in terms of perfect matches (solid line) and abstract
matches (dotted line); (b) custom models generate code that uses identifiers (i.e., variable and function names) that are more
similar to the project codebase.

backward pass takes roughly twice the amount of compute, but (full results will be available on our online appendix). Customized
it is unnecessary for layers that are frozen. For additional details, models produce the correct code structure as their top prediction in
we refer to Table 1 in [12]. We report the resulting compute in ∼13-14% of instances, and a perfect match in ∼4-6% of cases. They
petaFLOPS-seconds. also tend to improve as we consider their top 5 predictions. Be-
tween the different customization processes, Custom consistently
5 RESULTS performs the best, closely followed by Prefix and L-LDB. When
considering abstracted code matches, these three approaches are
5.1 RQ1 : Intrinsic Evaluation Metrics nearly identical. L-EO, however, performs slightly worse than the
Table 3 presents the results of custom models in terms of the in- others.
trinsic metrics: BLEU and perplexity. Specifically, for each project, Plot 4b shows the distribution of tf-idf cosine similarity com-
we report the average BLEU and perplexity over the four folds, puted between identifiers used in the developers’ written tests and
achieved on the test set by the different customization strategy, the models’ generated outputs. We observe that the distribution
as well as the baseline model. We observe notable improvements for custom models is skewed towards the higher values of cosine
in both metrics for every project w.r.t. the baseline, with BLEU similarity. This result demonstrates that custom models tend to
going from 16.1 achieved by the baseline model to 36-38 by custom use variable and function names that are more similar to what
models. developers used in their own tests.
The statistical tests reported in Table 4 confirm that the improve-
ment observed by the four customization techniques are statistically 5.3 RQ3 : Training Cost Comparison
significant w.r.t. the baseline (𝑝 < 1e-7). However, we do not observe
statistical significance in the differences among the customization For each customization process, we plot validation loss as a function
strategies, meaning that, in terms of intrinsic metrics performances, of compute, as defined in section 4.4. The results are presented in
the differences are within margin of error. Figure 5, where the light lines represent the validation loss curve for
each individual project and fold, while the bold line represents the
average for each custom strategy. First note that Custom achieves
5.2 RQ2 : Task-specific Performances very large gains during the first epoch, as evidenced by the fact that
The results in terms of task-specific performances are presented in its validation loss starts much lower than L-EO and L-LDB. Custom
Figure 4. The plot 4a shows the top-k accuracy for perfect matches also outperforms other customization processes when given a lim-
(solid line) and abstracted matches (dotted line), aggregated over the ited amount of compute. However, we observe that beyond a certain
20 projects. The baseline model outputs the same code structure (ab- amount of compute, Custom and L-LDB tend to achieve similar
stracted) in roughly 3% of all cases, and virtually never produces the performances. In contrast, L-EO starts at the same validation loss
exact target output (<1%). Moreover, its performance does not im- as L-LDB but converges much slower to the best loss, requiring 2-3
prove as we consider more predictions. All customization processes times as much compute.
show significant improvement compared to the baseline. Specif- Since the prefix parameters suffer from poor initialization, Prefix
ically, these improvements are observed for every single project is the most expensive customization process. To overcome this
Exploring and Evaluating Personalized Models for Code Generation ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore

Table 3: The BLEU score and perplexity for the customization methods evaluated on the 20 projects in our test set.

BLEU4 Perplexity
Project
Base Cust. L-EO L-LDB Prefix Base Cust. L-EO L-LDB Prefix
26644682 14.1 32.9 31.6 31.9 34.0 1.275 1.212 1.208 1.238 1.197
40735368 18.5 30.7 29.0 29.4 29.1 1.276 1.186 1.197 1.186 1.194
107330274 14.8 38.0 35.0 35.9 35.7 1.273 1.160 1.168 1.164 1.175
52972024 10.2 31.8 33.2 32.2 30.1 1.271 1.142 1.146 1.135 1.135
9714608 14.7 41.0 38.1 40.4 40.2 1.263 1.155 1.145 1.150 1.138
60701247 10.8 28.9 24.4 25.9 26.6 1.267 1.187 1.190 1.172 1.176
14550159 20.0 49.5 47.2 46.6 46.4 1.245 1.121 1.122 1.116 1.124
9278888 17.3 46.8 44.5 47.2 47.8 1.272 1.137 1.152 1.138 1.140
66940520 17.4 37.9 33.9 35.5 37.7 1.264 1.154 1.163 1.154 1.150
33645537 17.0 30.4 31.2 32.0 31.0 1.264 1.231 1.200 1.192 1.211
62253355 14.7 48.0 45.7 47.3 48.0 1.292 1.113 1.114 1.114 1.116
155883728 13.7 41.3 37.5 39.3 39.5 1.238 1.132 1.148 1.146 1.140
4710920 28.2 39.1 38.1 38.8 38.6 1.218 1.161 1.162 1.167 1.160
29603649 19.1 58.4 54.9 56.6 56.8 1.266 1.096 1.110 1.099 1.098
42949039 17.0 38.2 37.7 37.5 37.3 1.238 1.154 1.152 1.154 1.148
1381673 14.3 33.3 29.3 30.9 30.8 1.261 1.133 1.152 1.138 1.138
1244027 19.6 30.1 29.7 30.0 30.0 1.244 1.142 1.160 1.142 1.150
73948366 12.0 33.1 31.8 34.0 33.6 1.267 1.161 1.157 1.159 1.164
660443 15.0 34.0 37.2 36.5 34.3 1.281 1.180 1.170 1.169 1.177
87849739 13.4 45.1 47.0 48.9 46.8 1.259 1.138 1.136 1.124 1.144
Average 16.1 38.4 36.9 37.8 37.7 1.262 1.153 1.158 1.153 1.154

Table 4: Kruskal-Wallis Test p-values testing the significance of the pairwise hypothesis that one customization method is
superior than another. Custom strategies are significantly better than baseline.

BLEU4 Perplexity
Base Cust. L-EO L-LDB Prefix Base Cust. L-EO L-LDB Prefix
Base - 3e-08 3e-08 3e-08 3e-08 - 3e-08 3e-08 3e-08 3e-08
Cust. 3e-08 - 0.4 0.7 0.7 3e-08 - 0.5 0.9 0.9
EO 3e-08 0.4 - 0.5 0.7 3e-08 0.5 - 0.5 0.5
LDB 3e-08 0.7 0.5 - 0.9 3e-08 0.9 0.5 - 0.8
Prefix 3e-08 0.7 0.7 0.9 - 3e-08 0.9 0.5 0.8 -

problem, it is possible to first train the prefix on a large generic characteristic also leads to the major disadvantage of this approach:
dataset. Then, given proper hyperparameter tuning, it is possible each custom model is an entire copy of the original model. Storage
to substantially cut down compute cost for customizing the prefix. and inference costs could become prohibitive when serving many
users with personalized custom models.
6 DISCUSSION & LESSONS LEARNED Lightweight fine-tuning achieves good results while training
fewer parameters. This allows to serve potentially many users
The four customization strategies considered in this work are ef-
with custom models which can be stored and loaded efficiently.
fective in improving a code generation model’s performances on a
Specifically, L-LDB trains fewer parameters than L-EO, however
given software project. Specifically, all custom models significantly
the latter could allow to deploy the embedding and output layers
outperform the baseline in terms of intrinsic metrics (i.e., BLEU
on the user side, with a privacy-preserving focus.
and perplexity) as well as task-specific metrics (i.e., abstract and
Prefix fine-tuning trains the lowest number of parameters (only
raw matches). While the differences among the customization ap-
2.4% for a BART model), while improving over the baseline. How-
proaches are not significant (no clear winner), each strategy offers
ever, it increases the total number of parameters of the model (pre-
specific advantages in different circumstances and deployment sce-
fixes are additional virtual tokens) and requires more compute time
narios.
to achieve good performances, mostly due to the prefix initialization
Custom fine-tuning achieves the overall best performances and
problem. On the bright side, this strategy allows to batch together
the customization process is relatively fast and efficient. This is
requests from different users (with different prefixes), which can
somewhat expected, since this customization strategy allows all
be processed by a single model, generating personalized outputs.
the model’s parameters to be tuned on the specific project. This
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Andrei Zlotchevski, Dawn Drain, Alexey Svyatkovskiy, Colin Clement, Neel Sundaresan, and Michele Tufano

Custom
4.0
L-EO
L-LDB
3.5
Prefix

3.0
Cross entropy loss

2.5

2.0

1.5

1.0

0.5

1
10 10 0 10 1 10 2 10 3
Compute

Figure 5: Validation Loss vs Compute (PF-seconds) - Light lines represent the validation loss curve for each individual project
and fold, while the bold line represents the average for each custom strategy. Custom is the most efficient, lightweight ap-
proaches require slightly more compute to reach a comparable validation loss, while prefix is the least efficient, suffering
from poor initialization.

7 THREATS TO VALIDITY Transformer models [31] in particular for code completion [3, 4, 7,
The major threats to our study relate to external validity, which 21, 25, 26], code synthesis from examples [6], natural language to
concerns the generalization of our findings. Our study is performed code [2, 6, 7], code feature summarization [1, 16, 18, 19, 23, 32], code
on a specific coding task (i.e., test case generation) and for a single search [9, 10], unit test generation [29] and even bug fixing [8] and
programming language (i.e., Java). The findings could not gener- detection [33]. This paper naturally is an extension and evaluation
alize to all the coding tasks and different programming languages. of personalized unit test generations as studied by Tufano et al. [29],
However, our extensive empirical analysis investigating different and an important contribution to the understanding optimization
personalization strategies in terms of several performance metrics, in a deployment scenario.
can provide guidelines for applying these techniques on different Much of the previous literature on personalized models focuses
coding tasks, languages, and datasets. It is important to note that, on client-side training to keep data on device [20, 24], and most
while each coding task has its peculiarities, test case generation task work is in the domain of search query completion [11], natural lan-
involves the generation of complete methods, variable assignments, guage completion [20], or even automated speech recognition [24].
method calls, and different types of statements, thus could serve Naturally this work extends the domain of evaluation beyond natu-
as a good generic coding task. Java language is among the most ral language tasks and into the software engineering domain. This
popular and similar to other programming languages such as C# paper does not evaluate methods for client side training with re-
and Ruby. As part of our future works we intend to apply these stricted resources, however, as the most powerful large language
personalization techniques to different coding tasks and languages. models which enable state of the art code synthesis have 10-100
million parameters. At the time of writing such large models can-
not be executed in a reasonable amount of time on most consumer
8 RELATED WORK
laptops. We leave to future work extending these studies to models
This work is related to two areas of the existing literature: neural which have been pruned, quantized, distilled, and optimized to be
source code generation and model personalization. Neural code ran in limited resource environments.
generation has generated an intense recent interest in NLP, using
Exploring and Evaluating Personalized Models for Code Generation ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore

9 CONCLUSION [14] Alexander LeClair, Zachary Eberhart, and Collin McMillan. 2018. Adapting
neural text classification for improved software categorization. In 2018 IEEE
In this paper we explored different ways to customize a code gen- International Conference on Software Maintenance and Evolution (ICSME). IEEE.
eration model for a given codebase, with the goal of improving its https://doi.org/10.1109/ICSME.2018.00056
[15] Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous
performances on a target project. We described and analyzed four Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Associa-
customization strategies and applied them on 20 different software tion for Computational Linguistics and the 11th International Joint Conference on
projects for the task of generating unit test cases. Natural Language Processing (Volume 1: Long Papers). Association for Computa-
tional Linguistics, 4582–4597. https://doi.org/10.18653/v1/2021.acl-long.353
Specifically, we considered the following strategies: (i) custom [16] Shangqing Liu, Yu Chen, Xiaofei Xie, Jing Kai Siow, and Yang Liu. 2021. Retrieval-
fine-tuning, which allows all the model parameters to be tuned on Augmented Generation for Code Summarization via Hybrid GNN. In International
the target project; (ii) L-EO fine-tuning, a lightweight training which Conference on Learning Representations (ICLR). https://doi.org/10.48550/arXiv.
2006.05405
freezes most of the model’s parameters, tuning only embedding [17] Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. 2021. Pretrained
and output layers; (iii) L-LDB fine-tuning, a lightweight training Transformers as Universal Computation Engines. CoRR abs/2103.05247 (2021).
arXiv:2103.05247 https://arxiv.org/abs/2103.05247
which only tunes the last decoder block; (iv) prefix tuning, which [18] Laura Moreno, Jairo Aponte, Giriprasad Sridhara, Andrian Marcus, Lori Pollock,
keeps language model parameters frozen, but optimizes a small and K Vijay-Shanker. 2013. Automatic generation of natural language summaries
project-specific vector (prefix). for java classes. In 2013 21st International Conference on Program Comprehension
(ICPC). IEEE, 23–32.
In our extensive empirical evaluation we found that all the cus- [19] Laura Moreno, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, Andrian
tomization strategies lead to significant model’s improvements on Marcus, and Gerardo Canfora. 2014. Automatic generation of release notes. In
a target project, in terms of both intrinsic and task-specific metrics, Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of
Software Engineering. 484–495.
with the custom models adapting to the coding style of the target [20] Vadim Popov, Mikhail Kudinov, Irina Piontkovskaya, Petr Vytovtov, and Alex
project. While there is no clear winner among the customization Nevidomsky. 2018. Distributed fine-tuning of language models on private data.
In International Conference on Learning Representations.
strategies, each approach can provide specific benefits in particular [21] Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code completion with
deployment scenarios. statistical language models. In Proceedings of the 35th ACM SIGPLAN Conference
on Programming Language Design and Implementation. 419–428.
[22] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling
REFERENCES with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges
[1] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2018. code2seq: Gen- for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/
erating sequences from structured representations of code. arXiv preprint 884893/en.
arXiv:1808.01400 (2018). [23] Simone Scalabrino, Gabriele Bavota, Christopher Vendome, Mario Linares-
[2] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Vásquez, Denys Poshyvanyk, and Rocco Oliveto. 2017. Automatically assessing
Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, code understandability: How far are we?. In 2017 32nd IEEE/ACM International
et al. 2021. Program Synthesis with Large Language Models. arXiv preprint Conference on Automated Software Engineering (ASE). IEEE, 417–427.
arXiv:2108.07732 (2021). [24] Joel Shor, Dotan Emanuel, Oran Lang, Omry Tuval, Michael Brenner, Julie Cattiau,
[3] Marc Brockschmidt, Miltiadis Allamanis, Alexander L Gaunt, and Oleksandr Polo- Fernando Vieira, Maeve McNally, Taylor Charbonneau, Melissa Nollstadt, et al.
zov. 2018. Generative Code Modeling with Graphs. In International Conference 2019. Personalizing ASR for dysarthric and accented speech with limited data.
on Learning Representations. https://doi.org/10.48550/arXiv.1805.08490 arXiv preprint arXiv:1907.13511 (2019). https://doi.org/10.21437/Interspeech.2019-
[4] Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from examples 1427
to improve code completion systems. In Proceedings of the 7th joint meeting of [25] Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020.
the European software engineering conference and the ACM SIGSOFT symposium IntelliCode compose: code generation using transformer. In ESEC/FSE ’20: 28th
on the foundations of software engineering. 213–222. ACM Joint European Software Engineering Conference and Symposium on the
[5] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020,
Kaplan, Harri Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Prem Devanbu, Myra B. Cohen, and Thomas Zimmermann (Eds.). ACM, 1433–
Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 1443. https://doi.org/10.1145/3368089.3417058
(2021). [26] Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu, and Neel Sundaresan. 2019. Pythia:
[6] Xinyun Chen, Chang Liu, and Dawn Song. 2018. Towards Synthesizing Complex AI-assisted Code Completion System. In Proceedings of the 25th ACM SIGKDD
Programs From Input-Output Examples. In International Conference on Learning International Conference on Knowledge Discovery & Data Mining. 2727–2735.
Representations. https://openreview.net/forum?id=Skp1ESxRZ [27] Kai Tian, Meghan Revelle, and Denys Poshyvanyk. 2009. Using latent dirichlet
[7] Colin Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, and allocation for automatic categorization of software. In 2009 6th IEEE International
Neel Sundaresan. 2020. PyMT5: Multi-mode Translation of Natural Language Working Conference on Mining Software Repositories. IEEE, 163–166. https:
and Python Code with Transformers. In Proceedings of the 2020 Conference on //doi.org/10.1109/MSR.2009.5069496
Empirical Methods in Natural Language Processing (EMNLP). 9052–9065. [28] Michele Tufano. 2018. src2abs. https://github.com/micheletufano/src2abs.
[8] Dawn Drain, Chen Wu, Alexey Svyatkovskiy, and Neel Sundaresan. 2021. Gener- [29] Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, and Neel
ating Bug-Fixes Using Pretrained Transformers. arXiv preprint arXiv:2104.07896 Sundaresan. 2021. Unit Test Case Generation with Transformers and Focal
(2021). Context. arXiv preprint arXiv:2009.05617 (2021). arXiv:2009.05617 [cs.SE]
[9] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, [30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A Pre- Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All
Trained Model for Programming and Natural Languages. In Proceedings of the You Need. CoRR abs/1706.03762 (2017). arXiv:1706.03762 http://arxiv.org/abs/
2020 Conference on Empirical Methods in Natural Language Processing: Findings. 1706.03762
1536–1547. [31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
[10] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
Brockschmidt. 2019. CodeSearchNet Challenge: Evaluating the State of Semantic you need. In Advances in neural information processing systems. 5998–6008.
Code Search. arXiv preprint arXiv:1909.09436 (2019). [32] Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and
[11] Aaron Jaech and Mari Ostendorf. 2018. Personalized Language Model for Query Philip S Yu. 2018. Improving automatic source code summarization via deep rein-
Auto-Completion. In Proceedings of the 56th Annual Meeting of the Association for forcement learning. In Proceedings of the 33rd ACM/IEEE International Conference
Computational Linguistics (Volume 2: Short Papers). 700–705. on Automated Software Engineering. 397–407.
[12] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin [33] Juan Zhai, Xiangzhe Xu, Yu Shi, Guanhong Tao, Minxue Pan, Shiqing Ma, Lei
Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Xu, Weifeng Zhang, Lin Tan, and Xiangyu Zhang. 2020. CPC: Automatically
2020. Scaling Laws for Neural Language Models. CoRR abs/2001.08361 (2020). classifying and propagating natural language comments via program analysis. In
arXiv:2001.08361 https://arxiv.org/abs/2001.08361 Proceedings of the ACM/IEEE 42nd International Conference on Software Engineer-
[13] Shinji Kawaguchi, Pankaj K Garg, Makoto Matsushita, and Katsuro Inoue. 2006. ing. 1359–1371.
Mudablue: An automatic categorization system for open source repositories.
Journal of Systems and Software 79, 7 (2006), 939–953.

You might also like