Exploring and Evaluating Personalized Models For Code Generation
Exploring and Evaluating Personalized Models For Code Generation
Figure 2: Overview of the Customization Approaches - Transformer models during fine-tuning, where the frozen parts of the
model (not trainable) are displayed in gray: (a) Custom fine-tuning modifies all parameters during training; (b) L-EO trains
only the embedding and output layers; (c) L-LDB allows to train only the parameters of the last decoder block; (d) Prefix tuning
adds a trainable prefix to the encoder and decoder blocks.
in the matrix 𝑅 except for the diagonal (i.e., token shared with the for maximization functions, or 𝑓 (𝑚 ′, 𝑝) < 𝑓 (𝑚, 𝑝) for minimization
project itself), a project on median shares only 13% of its tokens functions.
with another project, and the third quartile of the distribution is In this work, 𝑚 is an encoder-decoder transformer model, 𝑡 is a
below 23%. code generation task, and 𝑝 is a target software project to which
We consider this study only as a preliminary analysis into the we intend to customize 𝑚.
diversity of software projects, which could motivate the need for
personalized models. We acknowledge the limitations of this study,
which could be extended considering different types of tokenizers, 3.2 Custom Fine-tuning
preprocessing steps, and metrics. In Sec. 4 we design an experimen- Custom fine-tuning is the most straightforward customization ap-
tal study that analyzes in details the impact of personalization on proach. The model to be customized is taken as is and trained on a
the performances of transformer-based code generation models. selected project. All parameters are trainable during this process.
Figure 2a shows the model during fine-tuning, where all the param-
3 APPROACH eters from the encoder and decoder blocks, as well as embeddings
This section describes the proposed customization approach for and output layers can be modified.
code generation models. We begin by formally defining the cus-
tomization process, then we provide details for each of the fine-
tuning strategies. 3.3 Lightweight Fine-tuning - Embeddings and
Output Layer (L-EO)
3.1 Customization Process Fully fine-tuning a model for every project or user may be prohibi-
We use the term customization to refer to the process of fine-tuning tive in terms of storage and memory costs. As a result, we explore
a model 𝑚, previously trained on a generic dataset for a task 𝑡, ways to mitigate these costs by reducing the number of parameters
with the goal of improving its performance on a specific dataset that vary from one custom model to another. In our lightweight
𝑝. The performance of a machine learning model 𝑚 on a dataset 𝑝 fine-tuning experiments, we achieve this by freezing most parame-
is measured by one or more evaluation functions 𝑓 (𝑚, 𝑝), where ters in the baseline model, and only keeping a small subset trainable.
𝑓 can be either a maximization (e.g., BLEU, top-k accuracy) or Figure 2b shows the Lightweight fine-tuning - Embeddings and
minimization (e.g., perplexity) function. The customization process Output Layer (L-EO) design, where most of the model parameters
is designed to modify the trainable parameters of the model 𝑚, are frozen (displayed in gray), and we allow only the embedding and
obtaining the model 𝑚 ′ , such that the performance of 𝑚 ′ on 𝑝 is output layers parameters to be fine-tuned, following the approach
better than what was attained by 𝑚. Specifically, 𝑓 (𝑚 ′, 𝑝) > 𝑓 (𝑚, 𝑝) in [17].
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Andrei Zlotchevski, Dawn Drain, Alexey Svyatkovskiy, Colin Clement, Neel Sundaresan, and Michele Tufano
3.4 Lightweight Fine-tuning - Last Decoder Table 1: Comparing the number of trainable parameters in
Block (L-LDB) each fine-tuning method.
4.2 RQ1 : Intrinsic Evaluation Metrics proportion of perfect matches among the top 5 model pre-
RQ1 : Do custom models obtain better performances on in- dictions.
trinsic metrics, such as BLEU and perplexity, w.r.t. the base- • Abstracted Code Matches: We pass the model output and
line? To begin, we investigate how the different model customiza- target output through the src2abs tool [28], to obtain an
tion approaches described in Sec. 3 score on intrinsic metrics such abstracted version, masking variable names, method names,
as BLEU and perplexity. All approaches entail fine-tuning the base- etc. We also do not distinguish between different objects of
line model to the dataset of a specific project, with the choice of the same type.
parameters being tuned depending on the approach taken. The four • Coding Style: For each project’s custom model, we would
variants are trained independently until the best validation loss like to determine how closely the model learns the devel-
is achieved. We report the BLEU4 score and the mean perplexity oper’s personal programming style and preferences. To this
per token on the test fold, for all the 20 projects. Next, we perform end, we extract the collection of all identifiers (i.e., variables
statistical tests to investigate whether the observed differences be- and functions’ names) from the unit tests written by the
tween the baseline and custom models are significant, as well as developer as well as those generated by the models. We then
differences among the customization approaches. Specifically, we pass these text outputs through a tf-idf vectorizer and com-
rely on the Kruskal-Wallis test, a non-parametric statistical test. pute the cosine similarity between them. This allows us to
compare the developer’s and the models’ word usage. We
examine the similarity between the developer’s unit tests
4.3 RQ2 : Task-specific Performances and the baseline and custom models generated tests. This
RQ2 : Do custom models improve on performance metrics scores the vocabulary similarity of the unit tests with the
specific to unit test generation? We want to investigate how model generated code.
the different customization approaches compare with respect to the
downstream task of generating unit tests. Beyond BLEU score and
perplexity, we would like to see if custom models can produce the 4.4 RQ3 : Training Cost Comparison
correct target code, how closely their unit tests mimic the repository RQ3 : Given the same amount of compute, which custom mod-
style, or even if they can perfectly match the desired output. els achieve the biggest performance improvement? Since our
four training regimes tune a different number of parameters, simply
• Perfect Matches: We compare the model’s output string with comparing the training time or number of optimization steps to
the target developer-written unit test. If the two strings are reach the best validation loss may not be appropriate. For a model
identical, this is considered a perfect match. We do not take with 𝑁 parameters, we approximate the computation cost of a for-
spacing and indentation into account as we are using a Java ward pass to be 𝐶 ≈ 2𝑁 floating point operations per training
dataset (where indentation is not required). We report the token, with an additional correction for embedding layers. The
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Andrei Zlotchevski, Dawn Drain, Alexey Svyatkovskiy, Colin Clement, Neel Sundaresan, and Michele Tufano
Top-K Accuracy
25
Baseline
Custom
L-EO
20 L-LDB
Prefix
15
Match %
10
0
1 2 3 4 5
Top-K
(a) Exact (solid) and abstracted (dotted) match rate (b) Coding style as identifier similarity
Figure 4: Task-specific metrics (a) custom models outperform the baseline in terms of perfect matches (solid line) and abstract
matches (dotted line); (b) custom models generate code that uses identifiers (i.e., variable and function names) that are more
similar to the project codebase.
backward pass takes roughly twice the amount of compute, but (full results will be available on our online appendix). Customized
it is unnecessary for layers that are frozen. For additional details, models produce the correct code structure as their top prediction in
we refer to Table 1 in [12]. We report the resulting compute in ∼13-14% of instances, and a perfect match in ∼4-6% of cases. They
petaFLOPS-seconds. also tend to improve as we consider their top 5 predictions. Be-
tween the different customization processes, Custom consistently
5 RESULTS performs the best, closely followed by Prefix and L-LDB. When
considering abstracted code matches, these three approaches are
5.1 RQ1 : Intrinsic Evaluation Metrics nearly identical. L-EO, however, performs slightly worse than the
Table 3 presents the results of custom models in terms of the in- others.
trinsic metrics: BLEU and perplexity. Specifically, for each project, Plot 4b shows the distribution of tf-idf cosine similarity com-
we report the average BLEU and perplexity over the four folds, puted between identifiers used in the developers’ written tests and
achieved on the test set by the different customization strategy, the models’ generated outputs. We observe that the distribution
as well as the baseline model. We observe notable improvements for custom models is skewed towards the higher values of cosine
in both metrics for every project w.r.t. the baseline, with BLEU similarity. This result demonstrates that custom models tend to
going from 16.1 achieved by the baseline model to 36-38 by custom use variable and function names that are more similar to what
models. developers used in their own tests.
The statistical tests reported in Table 4 confirm that the improve-
ment observed by the four customization techniques are statistically 5.3 RQ3 : Training Cost Comparison
significant w.r.t. the baseline (𝑝 < 1e-7). However, we do not observe
statistical significance in the differences among the customization For each customization process, we plot validation loss as a function
strategies, meaning that, in terms of intrinsic metrics performances, of compute, as defined in section 4.4. The results are presented in
the differences are within margin of error. Figure 5, where the light lines represent the validation loss curve for
each individual project and fold, while the bold line represents the
average for each custom strategy. First note that Custom achieves
5.2 RQ2 : Task-specific Performances very large gains during the first epoch, as evidenced by the fact that
The results in terms of task-specific performances are presented in its validation loss starts much lower than L-EO and L-LDB. Custom
Figure 4. The plot 4a shows the top-k accuracy for perfect matches also outperforms other customization processes when given a lim-
(solid line) and abstracted matches (dotted line), aggregated over the ited amount of compute. However, we observe that beyond a certain
20 projects. The baseline model outputs the same code structure (ab- amount of compute, Custom and L-LDB tend to achieve similar
stracted) in roughly 3% of all cases, and virtually never produces the performances. In contrast, L-EO starts at the same validation loss
exact target output (<1%). Moreover, its performance does not im- as L-LDB but converges much slower to the best loss, requiring 2-3
prove as we consider more predictions. All customization processes times as much compute.
show significant improvement compared to the baseline. Specif- Since the prefix parameters suffer from poor initialization, Prefix
ically, these improvements are observed for every single project is the most expensive customization process. To overcome this
Exploring and Evaluating Personalized Models for Code Generation ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore
Table 3: The BLEU score and perplexity for the customization methods evaluated on the 20 projects in our test set.
BLEU4 Perplexity
Project
Base Cust. L-EO L-LDB Prefix Base Cust. L-EO L-LDB Prefix
26644682 14.1 32.9 31.6 31.9 34.0 1.275 1.212 1.208 1.238 1.197
40735368 18.5 30.7 29.0 29.4 29.1 1.276 1.186 1.197 1.186 1.194
107330274 14.8 38.0 35.0 35.9 35.7 1.273 1.160 1.168 1.164 1.175
52972024 10.2 31.8 33.2 32.2 30.1 1.271 1.142 1.146 1.135 1.135
9714608 14.7 41.0 38.1 40.4 40.2 1.263 1.155 1.145 1.150 1.138
60701247 10.8 28.9 24.4 25.9 26.6 1.267 1.187 1.190 1.172 1.176
14550159 20.0 49.5 47.2 46.6 46.4 1.245 1.121 1.122 1.116 1.124
9278888 17.3 46.8 44.5 47.2 47.8 1.272 1.137 1.152 1.138 1.140
66940520 17.4 37.9 33.9 35.5 37.7 1.264 1.154 1.163 1.154 1.150
33645537 17.0 30.4 31.2 32.0 31.0 1.264 1.231 1.200 1.192 1.211
62253355 14.7 48.0 45.7 47.3 48.0 1.292 1.113 1.114 1.114 1.116
155883728 13.7 41.3 37.5 39.3 39.5 1.238 1.132 1.148 1.146 1.140
4710920 28.2 39.1 38.1 38.8 38.6 1.218 1.161 1.162 1.167 1.160
29603649 19.1 58.4 54.9 56.6 56.8 1.266 1.096 1.110 1.099 1.098
42949039 17.0 38.2 37.7 37.5 37.3 1.238 1.154 1.152 1.154 1.148
1381673 14.3 33.3 29.3 30.9 30.8 1.261 1.133 1.152 1.138 1.138
1244027 19.6 30.1 29.7 30.0 30.0 1.244 1.142 1.160 1.142 1.150
73948366 12.0 33.1 31.8 34.0 33.6 1.267 1.161 1.157 1.159 1.164
660443 15.0 34.0 37.2 36.5 34.3 1.281 1.180 1.170 1.169 1.177
87849739 13.4 45.1 47.0 48.9 46.8 1.259 1.138 1.136 1.124 1.144
Average 16.1 38.4 36.9 37.8 37.7 1.262 1.153 1.158 1.153 1.154
Table 4: Kruskal-Wallis Test p-values testing the significance of the pairwise hypothesis that one customization method is
superior than another. Custom strategies are significantly better than baseline.
BLEU4 Perplexity
Base Cust. L-EO L-LDB Prefix Base Cust. L-EO L-LDB Prefix
Base - 3e-08 3e-08 3e-08 3e-08 - 3e-08 3e-08 3e-08 3e-08
Cust. 3e-08 - 0.4 0.7 0.7 3e-08 - 0.5 0.9 0.9
EO 3e-08 0.4 - 0.5 0.7 3e-08 0.5 - 0.5 0.5
LDB 3e-08 0.7 0.5 - 0.9 3e-08 0.9 0.5 - 0.8
Prefix 3e-08 0.7 0.7 0.9 - 3e-08 0.9 0.5 0.8 -
problem, it is possible to first train the prefix on a large generic characteristic also leads to the major disadvantage of this approach:
dataset. Then, given proper hyperparameter tuning, it is possible each custom model is an entire copy of the original model. Storage
to substantially cut down compute cost for customizing the prefix. and inference costs could become prohibitive when serving many
users with personalized custom models.
6 DISCUSSION & LESSONS LEARNED Lightweight fine-tuning achieves good results while training
fewer parameters. This allows to serve potentially many users
The four customization strategies considered in this work are ef-
with custom models which can be stored and loaded efficiently.
fective in improving a code generation model’s performances on a
Specifically, L-LDB trains fewer parameters than L-EO, however
given software project. Specifically, all custom models significantly
the latter could allow to deploy the embedding and output layers
outperform the baseline in terms of intrinsic metrics (i.e., BLEU
on the user side, with a privacy-preserving focus.
and perplexity) as well as task-specific metrics (i.e., abstract and
Prefix fine-tuning trains the lowest number of parameters (only
raw matches). While the differences among the customization ap-
2.4% for a BART model), while improving over the baseline. How-
proaches are not significant (no clear winner), each strategy offers
ever, it increases the total number of parameters of the model (pre-
specific advantages in different circumstances and deployment sce-
fixes are additional virtual tokens) and requires more compute time
narios.
to achieve good performances, mostly due to the prefix initialization
Custom fine-tuning achieves the overall best performances and
problem. On the bright side, this strategy allows to batch together
the customization process is relatively fast and efficient. This is
requests from different users (with different prefixes), which can
somewhat expected, since this customization strategy allows all
be processed by a single model, generating personalized outputs.
the model’s parameters to be tuned on the specific project. This
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Andrei Zlotchevski, Dawn Drain, Alexey Svyatkovskiy, Colin Clement, Neel Sundaresan, and Michele Tufano
Custom
4.0
L-EO
L-LDB
3.5
Prefix
3.0
Cross entropy loss
2.5
2.0
1.5
1.0
0.5
1
10 10 0 10 1 10 2 10 3
Compute
Figure 5: Validation Loss vs Compute (PF-seconds) - Light lines represent the validation loss curve for each individual project
and fold, while the bold line represents the average for each custom strategy. Custom is the most efficient, lightweight ap-
proaches require slightly more compute to reach a comparable validation loss, while prefix is the least efficient, suffering
from poor initialization.
7 THREATS TO VALIDITY Transformer models [31] in particular for code completion [3, 4, 7,
The major threats to our study relate to external validity, which 21, 25, 26], code synthesis from examples [6], natural language to
concerns the generalization of our findings. Our study is performed code [2, 6, 7], code feature summarization [1, 16, 18, 19, 23, 32], code
on a specific coding task (i.e., test case generation) and for a single search [9, 10], unit test generation [29] and even bug fixing [8] and
programming language (i.e., Java). The findings could not gener- detection [33]. This paper naturally is an extension and evaluation
alize to all the coding tasks and different programming languages. of personalized unit test generations as studied by Tufano et al. [29],
However, our extensive empirical analysis investigating different and an important contribution to the understanding optimization
personalization strategies in terms of several performance metrics, in a deployment scenario.
can provide guidelines for applying these techniques on different Much of the previous literature on personalized models focuses
coding tasks, languages, and datasets. It is important to note that, on client-side training to keep data on device [20, 24], and most
while each coding task has its peculiarities, test case generation task work is in the domain of search query completion [11], natural lan-
involves the generation of complete methods, variable assignments, guage completion [20], or even automated speech recognition [24].
method calls, and different types of statements, thus could serve Naturally this work extends the domain of evaluation beyond natu-
as a good generic coding task. Java language is among the most ral language tasks and into the software engineering domain. This
popular and similar to other programming languages such as C# paper does not evaluate methods for client side training with re-
and Ruby. As part of our future works we intend to apply these stricted resources, however, as the most powerful large language
personalization techniques to different coding tasks and languages. models which enable state of the art code synthesis have 10-100
million parameters. At the time of writing such large models can-
not be executed in a reasonable amount of time on most consumer
8 RELATED WORK
laptops. We leave to future work extending these studies to models
This work is related to two areas of the existing literature: neural which have been pruned, quantized, distilled, and optimized to be
source code generation and model personalization. Neural code ran in limited resource environments.
generation has generated an intense recent interest in NLP, using
Exploring and Evaluating Personalized Models for Code Generation ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore
9 CONCLUSION [14] Alexander LeClair, Zachary Eberhart, and Collin McMillan. 2018. Adapting
neural text classification for improved software categorization. In 2018 IEEE
In this paper we explored different ways to customize a code gen- International Conference on Software Maintenance and Evolution (ICSME). IEEE.
eration model for a given codebase, with the goal of improving its https://doi.org/10.1109/ICSME.2018.00056
[15] Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous
performances on a target project. We described and analyzed four Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Associa-
customization strategies and applied them on 20 different software tion for Computational Linguistics and the 11th International Joint Conference on
projects for the task of generating unit test cases. Natural Language Processing (Volume 1: Long Papers). Association for Computa-
tional Linguistics, 4582–4597. https://doi.org/10.18653/v1/2021.acl-long.353
Specifically, we considered the following strategies: (i) custom [16] Shangqing Liu, Yu Chen, Xiaofei Xie, Jing Kai Siow, and Yang Liu. 2021. Retrieval-
fine-tuning, which allows all the model parameters to be tuned on Augmented Generation for Code Summarization via Hybrid GNN. In International
the target project; (ii) L-EO fine-tuning, a lightweight training which Conference on Learning Representations (ICLR). https://doi.org/10.48550/arXiv.
2006.05405
freezes most of the model’s parameters, tuning only embedding [17] Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. 2021. Pretrained
and output layers; (iii) L-LDB fine-tuning, a lightweight training Transformers as Universal Computation Engines. CoRR abs/2103.05247 (2021).
arXiv:2103.05247 https://arxiv.org/abs/2103.05247
which only tunes the last decoder block; (iv) prefix tuning, which [18] Laura Moreno, Jairo Aponte, Giriprasad Sridhara, Andrian Marcus, Lori Pollock,
keeps language model parameters frozen, but optimizes a small and K Vijay-Shanker. 2013. Automatic generation of natural language summaries
project-specific vector (prefix). for java classes. In 2013 21st International Conference on Program Comprehension
(ICPC). IEEE, 23–32.
In our extensive empirical evaluation we found that all the cus- [19] Laura Moreno, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, Andrian
tomization strategies lead to significant model’s improvements on Marcus, and Gerardo Canfora. 2014. Automatic generation of release notes. In
a target project, in terms of both intrinsic and task-specific metrics, Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of
Software Engineering. 484–495.
with the custom models adapting to the coding style of the target [20] Vadim Popov, Mikhail Kudinov, Irina Piontkovskaya, Petr Vytovtov, and Alex
project. While there is no clear winner among the customization Nevidomsky. 2018. Distributed fine-tuning of language models on private data.
In International Conference on Learning Representations.
strategies, each approach can provide specific benefits in particular [21] Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code completion with
deployment scenarios. statistical language models. In Proceedings of the 35th ACM SIGPLAN Conference
on Programming Language Design and Implementation. 419–428.
[22] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling
REFERENCES with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges
[1] Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2018. code2seq: Gen- for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/
erating sequences from structured representations of code. arXiv preprint 884893/en.
arXiv:1808.01400 (2018). [23] Simone Scalabrino, Gabriele Bavota, Christopher Vendome, Mario Linares-
[2] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Vásquez, Denys Poshyvanyk, and Rocco Oliveto. 2017. Automatically assessing
Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, code understandability: How far are we?. In 2017 32nd IEEE/ACM International
et al. 2021. Program Synthesis with Large Language Models. arXiv preprint Conference on Automated Software Engineering (ASE). IEEE, 417–427.
arXiv:2108.07732 (2021). [24] Joel Shor, Dotan Emanuel, Oran Lang, Omry Tuval, Michael Brenner, Julie Cattiau,
[3] Marc Brockschmidt, Miltiadis Allamanis, Alexander L Gaunt, and Oleksandr Polo- Fernando Vieira, Maeve McNally, Taylor Charbonneau, Melissa Nollstadt, et al.
zov. 2018. Generative Code Modeling with Graphs. In International Conference 2019. Personalizing ASR for dysarthric and accented speech with limited data.
on Learning Representations. https://doi.org/10.48550/arXiv.1805.08490 arXiv preprint arXiv:1907.13511 (2019). https://doi.org/10.21437/Interspeech.2019-
[4] Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from examples 1427
to improve code completion systems. In Proceedings of the 7th joint meeting of [25] Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020.
the European software engineering conference and the ACM SIGSOFT symposium IntelliCode compose: code generation using transformer. In ESEC/FSE ’20: 28th
on the foundations of software engineering. 213–222. ACM Joint European Software Engineering Conference and Symposium on the
[5] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020,
Kaplan, Harri Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Prem Devanbu, Myra B. Cohen, and Thomas Zimmermann (Eds.). ACM, 1433–
Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 1443. https://doi.org/10.1145/3368089.3417058
(2021). [26] Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu, and Neel Sundaresan. 2019. Pythia:
[6] Xinyun Chen, Chang Liu, and Dawn Song. 2018. Towards Synthesizing Complex AI-assisted Code Completion System. In Proceedings of the 25th ACM SIGKDD
Programs From Input-Output Examples. In International Conference on Learning International Conference on Knowledge Discovery & Data Mining. 2727–2735.
Representations. https://openreview.net/forum?id=Skp1ESxRZ [27] Kai Tian, Meghan Revelle, and Denys Poshyvanyk. 2009. Using latent dirichlet
[7] Colin Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, and allocation for automatic categorization of software. In 2009 6th IEEE International
Neel Sundaresan. 2020. PyMT5: Multi-mode Translation of Natural Language Working Conference on Mining Software Repositories. IEEE, 163–166. https:
and Python Code with Transformers. In Proceedings of the 2020 Conference on //doi.org/10.1109/MSR.2009.5069496
Empirical Methods in Natural Language Processing (EMNLP). 9052–9065. [28] Michele Tufano. 2018. src2abs. https://github.com/micheletufano/src2abs.
[8] Dawn Drain, Chen Wu, Alexey Svyatkovskiy, and Neel Sundaresan. 2021. Gener- [29] Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, and Neel
ating Bug-Fixes Using Pretrained Transformers. arXiv preprint arXiv:2104.07896 Sundaresan. 2021. Unit Test Case Generation with Transformers and Focal
(2021). Context. arXiv preprint arXiv:2009.05617 (2021). arXiv:2009.05617 [cs.SE]
[9] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, [30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A Pre- Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All
Trained Model for Programming and Natural Languages. In Proceedings of the You Need. CoRR abs/1706.03762 (2017). arXiv:1706.03762 http://arxiv.org/abs/
2020 Conference on Empirical Methods in Natural Language Processing: Findings. 1706.03762
1536–1547. [31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
[10] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
Brockschmidt. 2019. CodeSearchNet Challenge: Evaluating the State of Semantic you need. In Advances in neural information processing systems. 5998–6008.
Code Search. arXiv preprint arXiv:1909.09436 (2019). [32] Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and
[11] Aaron Jaech and Mari Ostendorf. 2018. Personalized Language Model for Query Philip S Yu. 2018. Improving automatic source code summarization via deep rein-
Auto-Completion. In Proceedings of the 56th Annual Meeting of the Association for forcement learning. In Proceedings of the 33rd ACM/IEEE International Conference
Computational Linguistics (Volume 2: Short Papers). 700–705. on Automated Software Engineering. 397–407.
[12] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin [33] Juan Zhai, Xiangzhe Xu, Yu Shi, Guanhong Tao, Minxue Pan, Shiqing Ma, Lei
Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Xu, Weifeng Zhang, Lin Tan, and Xiangyu Zhang. 2020. CPC: Automatically
2020. Scaling Laws for Neural Language Models. CoRR abs/2001.08361 (2020). classifying and propagating natural language comments via program analysis. In
arXiv:2001.08361 https://arxiv.org/abs/2001.08361 Proceedings of the ACM/IEEE 42nd International Conference on Software Engineer-
[13] Shinji Kawaguchi, Pankaj K Garg, Makoto Matsushita, and Katsuro Inoue. 2006. ing. 1359–1371.
Mudablue: An automatic categorization system for open source repositories.
Journal of Systems and Software 79, 7 (2006), 939–953.