2019 ICLR CuBERT Pre Trained Contextual Embedding of Source Code
2019 ICLR CuBERT Pre Trained Contextual Embedding of Source Code
January 3, 2020
Abstract
The source code of a program not only serves as a formal description of an executable task,
but it also serves to communicate developer intent in a human-readable form. To facilitate this,
developers use meaningful identifier names and natural-language documentation. This makes it
possible to successfully apply sequence-modeling approaches, shown to be effective in natural-
language processing, to source code. A major advancement in natural-language understanding
has been the use of pre-trained token embeddings; BERT and other works have further shown that
pre-trained contextual embeddings can be extremely powerful and can be fine-tuned effectively
for a variety of downstream supervised tasks. Inspired by these developments, we present the
first attempt to replicate this success on source code. We curate a massive corpus of Python
programs from GitHub to pre-train a BERT model, which we call Code Understanding BERT
(CuBERT). We also pre-train Word2Vec embeddings on the same dataset. We create a benchmark
of five classification tasks and compare fine-tuned CuBERT against sequence models trained
with and without the Word2Vec embeddings. Our results show that CuBERT outperforms the
baseline methods by a margin of 2.9–22%. We also show its superiority when fine-tuned with
smaller datasets, and over fewer epochs. We further evaluate CuBERT’s effectiveness on a joint
classification, localization and repair task involving prediction of two pointers.
1 Introduction
Modern software development places a high value on writing clean and readable code. This helps
other developers understand the author’s intent so that they can maintain and extend the code. De-
velopers use meaningful identifier names and natural-language documentation to make this hap-
pen (Martin, 2008). As a result, source code contains substantial information that can be exploited
by machine-learning algorithms. Sequence modeling on source code has been shown to be success-
ful in a variety of software-engineering tasks, such as code completion (Hindle et al., 2012; Raychev
et al., 2014), source code to pseudocode mapping (Oda et al., 2015), API-sequence prediction (Gu
et al., 2016), program repair (Pu et al., 2016; Gupta et al., 2017), and natural language to code
mapping (Iyer et al., 2018), among others.
The distributed vector representations of tokens, called token (or word) embeddings, are a crucial
component of neural methods for sequence modeling. Learning useful embeddings in a supervised
setting with limited data is often difficult. Therefore, many unsupervised learning approaches have
been proposed to take advantage of large amounts of unlabeled data that are more readily available.
This has resulted in ever more useful pre-trained token embeddings (Mikolov et al., 2013a; Pen-
nington et al., 2014). However, the subtle differences in the meaning of a token in varying contexts
are lost when each word is associated with a single representation. Recent techniques for learning
contextual embeddings (McCann et al., 2017; Peters et al., 2018; Radford et al., 2018, 2019; Devlin
et al., 2019; Yang et al., 2019) provide ways to compute representations of tokens based on their
surrounding context, and have shown significant accuracy improvements in downstream tasks, even
with only a small number of task-specific parameters.
1
Inspired by the success of pre-trained contextual embeddings for natural languages, we present
the first attempt to apply the underlying techniques to source code. In particular, BERT (Devlin
et al., 2019) produces a bidirectional Transformer encoder (Vaswani et al., 2017) by training it to
predict values of masked tokens and whether two sentences follow each other in a natural discourse.
The pre-trained model can be fine-tuned for downstream supervised tasks and has been shown to
produce state-of-the-art results on a number of NLP benchmarks. In this work, we derive contextual
embedding of source code by training a BERT model on source code. We call our model CuBERT,
short for Code Understanding BERT.
In order to achieve this, we curate a massive corpus of Python programs collected from GitHub.
GitHub projects are known to contain a large amount of duplicate code. To avoid biasing the model
to such duplicated code, we perform deduplication using the method of Allamanis (2018). The
resulting corpus has 6.6M unique files with a total of 2 billion words. We also train Word2Vec
embeddings (Mikolov et al., 2013a,b), namely, continuous bag-of-words (CBOW) and Skipgram
embeddings, on the same corpus. For evaluating CuBERT, we create a benchmark of five classifi-
cation tasks, ranging from classification of source code according to presence or absense of certain
classes of bugs, to mismatch between a function’s natural language description and its body, to pre-
dicting the right kind of exception to catch for a given code fragment. These tasks are motivated by
prior work in this space, but unfortunately, the associated datasets come from different languages
and varied sources. We want to ensure that there is no overlap between pre-training and fine-tuning
datasets, and that all of the tasks are defined on Python code. We therefore create new datasets for
the five tasks after carefully separating the pre-training and fine-tuning corpora. To evaluate Cu-
BERT’s effectiveness on a more complex task, we create a task for joint classification, localization
and repair of variable misuse bugs (Vasic et al., 2019), which involves predicting two pointers.
We fine-tune CuBERT on each of the classification tasks and compare the results with multi-
layered bidirectional LSTM (Hochreiter & Schmidhuber, 1997) models. We train the LSTM models
from scratch and also using pre-trainined Word2Vec embeddings. Our results show that CuBERT
consistently outperforms these baseline models by 2.9–22% across the tasks. We perform a number
of additional studies by varying the sampling strategies used for training Word2Vec models, by
varying program lengths, and by comparing against Transformer models trained from scratch. In
addition, we also show that CuBERT can be fine-tuned effectively using only 33% of the task-
specific labeled data and with only 2 epochs, and that it attains results competitive to the baseline
models trained with the full datasets and much larger number of epochs. CuBERT when fine-tuned
on the variable misuse localization and repair task, produces high classification, localization and
localization+repair accuracies. The contributions of this paper are as follows:
• We present the first attempt at pre-training a BERT contextual embedding of source code.
• We show the efficacy of the pre-trained contextual embedding on five classification tasks.
Our results show that the fine-tuned models outperform the baseline LSTM models supported
by Word2Vec embeddings, and Transformers trained from scratch. Further, the fine-tuning
works well even for smaller datasets and fewer training epochs. We also evaluate CuBERT on
a multi-headed pointer prediction task.
• We plan to make the models and datasets publicly available for use by others.
2 Related Work
Given the abundance of natural-language text, and the relative difficulty of obtaining labeled data,
much effort has been devoted to using large corpora to learn about language in an unsupervised fash-
ion, before trying to focus on tasks with small labeled training datasets. Word2Vec (Mikolov et al.,
2013a,b) computed word embeddings based on word co-occurrence and proximity, but the same em-
bedding is used regardless of the context. The continued advances in word embeddings (Pennington
et al., 2014) led to publicly released pre-trained embeddings, used in a variety of tasks.
To deal with varying word context, contextual word embeddings were developed (McCann et al.,
2017; Peters et al., 2018; Radford et al., 2018, 2019), in which an embedding is learned for the
2
context of a word in a particular sentence, namely the sequence of words preceding it and possibly
following it. BERT (Devlin et al., 2019) improved natural-language pre-training by using a de-
noising autoencoder. Instead of learning a language model, which is inherently sequential, BERT
optimizes for predicting a noised word within a sentence. Such prediction instances are generated by
choosing a word position and either keeping it unchanged, removing the word, or replacing the word
with a random wrong word. It also pre-trains with the objective of predicting whether two sentences
can be next to each other. These pre-training objectives, along with the use of a Transformer-based
architecture, gave BERT an accuracy boost in a number of NLP tasks over the state-of-the-art.
BERT has been improved upon in various ways, including modifying training objectives, utilizing
ensembles, combining attention with autoregression (Yang et al., 2019), and expanding pre-training
corpora and time (Liu et al., 2019). However, the main architecture of BERT seems to hold up as
the state-of-the-art, as of this writing.
In the space of programming languages, attempts have been made to learn embeddings in the
context of specific software-engineering tasks. These include embeddings of variable and method
identifiers using local and global context (Allamanis et al., 2015), abstract syntax trees or ASTs (Mou
et al., 2016), paths in ASTs (Alon et al., 2019), memory heap graphs (Li et al., 2016), and ASTs
enriched with data flow information (Allamanis et al., 2018). These approaches require analyzing
source code beyond simple tokenization. In this work, we derive a pre-trained contextual embedding
of tokenized source code without explicitly modeling source-code-specific information, and show
that the resulting embedding can be effectively fine-tuned for downstream tasks.
3 Experimental Setup
3.1 Code Corpus for fine-tuning Tasks
We use the ETH Py150 corpus (Raychev et al., 2016) to generate datasets for the fine-tuning tasks.
The ETH Py150 corpus consists of 150K Python files from GitHub, and is partitioned into a training
split (100K files) and a test split (50K files). We held out 10K files from the training split as a
validation split. We deduplicated the dataset in the fashion of Allamanis (2018), resulting in a
slightly smaller dataset of 85K, 9.5K, and 47K files in train, validation, and test, respectively.
3
We split identifiers according to common heuristic rules (e.g., snake or Camel case). Finally, we split
string literals using heuristic rules, on whitespace characters, and on special characters. We limit all
thus produced tokens to a maximum length of 15 characters. We call this the program vocabulary.
Our Python pre-training code corpus contained 10.2M unique tokens, including 12 reserved tokens.
We greedily compress the program vocabulary into a subword vocabulary (Schuster & Nakajima,
2012) using the SubwordTextEncoder from the Tensor2Tensor project (Vaswani et al., 2018),
resulting in slightly over 50K tokens. All words in the program vocabulary can be losslessly encoded
using one or more of the subword tokens.
We encode programs first into program tokens, as described above, and then encode those to-
kens one by one in the subword vocabulary. The objective of this encoding scheme is to preserve
syntactically meaningful boundaries of tokens. For example, the identifier “snake case” could
be encoded as “sna ke ca se”, preserving the snake case split of its characters, even if the
subtoken “e c” were very popular in the corpus; the latter encoding might result in a smaller rep-
resentation but would lose the intent of the programmer in using a snake-case identifier. Similarly,
“i=0” may be very frequent in the corpus, but we still force it to be encoded as separate tokens i,
=, and 0, ensuring that we preserve the distinction between operators and operands. Both the BERT
model and the Word2Vec embeddings are built on the subword vocabulary.
Variable Misuse Classification Allamanis et al. (2018) observed that developers may mistakenly
use an incorrect variable in the place of a correct one. These mistakes may occur when developers
copy-paste similar code but forget to rename all occurrences of variables from the original fragment,
or when there are similar variable names in contexts that can be confused with each other. These can
be subtle errors that remain undetected during compilation. The task by Allamanis et al. (2018) is
to predict a correct variable name at a location within a function and was devised on C# programs.
We take the classification version restated by Vasic et al. (2019), wherein, given a function, the task
is to predict whether there is a variable misuse at some location in the function, without specifying
a particular location to consider. In this setting, the classifier has to consider all variables and their
usages to make the decision. In order to create negative (buggy) examples, we replace a variable use
at some location with another variable that is defined within the function.
Wrong Binary Operator Pradel & Sen (2018) proposed the task of detecting whether a binary
operator in a given expression is correct. They use features extracted from limited surrounding
context. We use the entire function with the goal of detecting whether any binary operator in the
function is incorrect. The negative examples are created by randomly replacing some binary operator
with another type-compatible operator.
Swapped Operand Pradel & Sen (2018) propose the wrong binary operand task where a variable
or constant is used incorrectly in an expression, but that task is quite similar to the variable misuse
task we already use. We therefore define another class of operand errors where the operands of non-
commutative binary operators are swapped. The operands can be arbitrary subexpressions, and are
not restricted to be just variables or constants. To simplify example generation, we restrict examples
for this task to those in which the binary operator and its operands all fit within a single line.
4
Train Validation Test
Variable Misuse Classification 796020 8192 (86810) 429854
Wrong Binary Operator 537244 8192 (59112) 293872
Swapped Operand 276116 8192 (30818) 152248
Function-Docstring 391049 8192 (44029) 213269
Exception Type 21694 2459 (2459) 12036
Variable Misuse Localization and Repair 796020 8192 (86810) 429854
Table 1: Benchmark fine-tuning datasets. Note that for validation, we have subsampled the original
datasets (in parentheses) down to 8192 examples, except for exception classification, which only
had 2459 validation examples, all of which are included.
Exception Type While it is possible to write generic exception handlers (e.g., “except Exception”
in Python), it is considered a good coding practice to catch and handle the precise exceptions that
can be raised by a code fragment. We identified the 20 most common exception types from the
GitHub dataset, excluding the catch-all Exception (full list in Table 6). Given a function with
an except clause for one of these exception types, we replace the exception with a special “hole”
token. The task is the multi-class classification problem of predicting the original exception type.
5
times, and generates independently derived examples from each duplicate. With 50% probability,
the second example sentence comes from a random document (for NSP). With 15% probability, a
token is chosen for an MLM prediction (up to 20 per example), and from those chosen, 80% are
masked, 10% are left undisturbed, and 10% are replaced with a random token.
CuBERT is similarly formulated, but a CuBERT sentence is a logical code line, as defined by
the Python standard. Intuitively, a logical code line is the shortest sequence of consecutive lines
that may constitute a legal statement, e.g., it has correctly matching parentheses. We count example
lengths by counting the subword tokens of both sentences (see Section 3.3).
We train the BERT Large model, consisting of 24 layers with 16 attention heads and hidden size
of 1024 units. Sentences are created by parsing our pre-training dataset. Task-specific classifiers pass
the embedding of a special start-of-example [CLS] token through feedforward and softmax layers.
For the pointer prediction task, the pointer is computed over the sequence of outputs generated by
the last layer of the BERT model.
3.6 Baselines
3.6.1 Word2Vec
We train Word2Vec models using the same pre-training corpus as the BERT model. To maintain
parity, we generate the dataset for Word2Vec using the same pipeline as BERT but by disabling
masking and generation of negative examples for NSP. The dataset is generated without any dupli-
cation. We train both CBOW and Skipgram models using GenSim (Řehůřek & Sojka, 2010). To
deal with the large vocabulary, we use negative sampling and hierarchical softmax (Mikolov et al.,
2013a,b) to train the two versions. In all, we obtain four Word2Vec embeddings.
4 Experimental Results
4.1 Training Details
As stated above, CuBERT’s dataset generation duplicates the corpus 10 times, whereas Word2Vec is
trained without duplication. To compensate for this difference, we trained Word2Vec for 10 epochs
and CuBERT for 1 epoch. We pre-train CuBERT with the default configuration of the BERT Large
model. For sequences of length 128, 256 and 512, we use batch sizes of 8192, 4096 and 2048
respectively. For Word2Vec, when training with negative samples, we choose 10 negative samples.
The embedding sizes for all the pre-trained models are set at 1024.
For the baseline BiLSTM models, we did extensive experimentation on the Variable Misuse
task by varying the number of layers (1–3) and the number of hidden units (128, 256, 512). We
also tried LSTM output dropout probability (0.1, 0.5), optimizers (Adam (Kingma & Ba, 2014) and
AdaGrad (Duchi et al., 2011)), and learning rates (1e-3, 1e-4, 1e-5). The most promising combina-
tion was a 3-layered BiLSTM with 512 hidden units per layer, LSTM output dropout probability of
0.1 and Adam optimizer with learning rate of 1e-3. We use this set of parameters for all the tasks
except the Exception Type task. Due to the much smaller dataset size of the latter (Table 1), we
did a separate search and chose a single-layer BiLSTM with 256 hidden units. We used the batch
6
size of 8192 for the larger tasks and 64 for the Exception Type task. For the baseline Transformer
models, we originally attempted to train a Transformer model of the same configuration as CuBERT.
However, the size of our training dataset seemed too small to train that large a Transformer. Instead,
we performed a hyperparameter search over transformer layers (1–6), hidden units (128, 256, 512),
learning rates (5e-5, 1e-4, 5e-4, 1e-3) and batch sizes (64, 256, 1024, 2048, 4096, 8192) on the
Variable Misuse task. The best architecture (4 layers, 512 hidden units, 16 attention heads, learning
rate of 5e-4, batch size of 4096) is used for all the tasks except the Exception Type task. A separate
experimentation for the smaller Exception Type dataset resulted in the best configuration of 3 layers,
512 hidden units, 16 attention heads, learning rate of 5e-5, and batch size of 2048.
Finally, for our baseline pointer model (referred to as LSTM+pointer below) we searched over
the following hyperparameter choices: hidden sizes of 512 and 1024, token embedding sizes of
512 and 1024, learning rates of 0.1, 0.01, and 0.001, and the AdaGrad and Gradient Descent op-
timizers. In contrast to the original work, we generated one pair of buggy/bug-free examples per
function (rather than one per variable use, per function, which would bias towards longer functions),
use CuBERT’s subword-tokenized vocabulary of 50K subtokens (rather than a limited full-token
vocabulary, which leaves many tokens out of vocabulary).
5. How does CuBERT perform on complex tasks? We implemented and fine-tuned a model for
a multi-headed pointer prediction task, namely, the Variable-Misuse Localization and Repair
task (Section 4.7). We compare it to the model from Vasic et al. (2019).
Except for Section 4.6, all the results are presented for sequences of length 512. We give examples
of classification instances in Appendix B and include visualizations of attention weights for them.
7
Setting Misuse Operator Operand Docstring Exception
From scratch 76.05% 82.00% 87.77% 78.43% 40.37%
ns 77.66% 84.42% 88.66% 89.13% 48.85%
BiLSTM CBOW
hs 77.01% 84.11% 89.69% 86.74% 46.73%
(100 epochs)
ns 71.58% 83.06% 87.67% 84.69% 48.54%
Skipgram
hs 77.21% 83.06% 89.01% 82.56% 49.68%
2 epochs 90.09% 85.15% 88.67% 95.81% 52.38%
CuBERT 10 epochs 92.73% 88.43% 88.67% 95.81% 62.55%
20 epochs 94.61% 90.24% 92.56% 96.85% 71.74%
Transformer (100 epochs) 79.37% 78.66% 86.21% 91.10% 48.60%
Table 2: Test accuracies of fine-tuned CuBERT against BiLSTM (with and without Word2Vec
embeddings) and Transformer trained from scratch on the classification tasks. “ns” and “hs” re-
spectively refer to negative sampling and hierarchical softmax settings used for training CBOW and
Skipgram models. “From scratch” refers to training with freshly initialized token embeddings, that
is, without pre-trained Word2Vec embeddings.
20 epochs, compared to the 100 epochs used for BiLSTMs. The Exception Type classification task
is an interesting case since it has an order of magnitude less training data than the other tasks (see
Table 1). The difference between the performance of BiLSTM and CuBERT is the highest for this
task. Thus, fine-tuning is of much value for tasks with limited labeled training data.
We analyzed the performance of CuBERT with the reduced fine-tuning budget of only 2 and 10
epochs (see Table 2). Except for the Operand task, CuBERT outperforms BiLSTM within 2 fine-
tuning epochs. On the Operand task, the performance difference between CuBERT with 2 or 10
fine-tuning epochs and BiLSTM is about 1%. For the rest of the tasks, CuBERT with only 2 fine-
tuning epochs outperforms BiLSTM (with the best task-wise Word2Vec configuration) by a margin
of 0.7–12%. This shows that CuBERT can reach accuracies that are comparable to or better than
those of BiLSTMs trained with Word2Vec embeddings within only a few epochs.
We also trained the BiLSTM models from scratch, that is, without using the Word2Vec embed-
dings. The results are shown in the first row of Table 2. Compared to those, the use of Word2Vec
embeddings performs better by a margin of 1.5–10.5%. Though no single Word2Vec configuration
is the best, CBOW trained with negative sampling gives the most consistent results overall.
8
Best of Train
Misuse Operator Operand Docstring Exception
# Epochs Fraction
100% 90.09% 85.15% 88.67% 95.81% 52.38%
2 66% 89.52% 83.26% 88.66% 95.17% 34.70%
33% 88.64% 82.28% 87.45% 95.29% 26.87%
100% 92.73% 88.43% 88.67% 95.81% 62.55%
10 66% 92.06% 87.06% 90.39% 95.64% 64.59%
33% 91.23% 84.44% 87.45% 95.48% 54.22%
100% 94.61% 90.24% 92.56% 96.85% 71.74%
20 66% 94.19% 89.36% 92.01% 96.17% 70.11%
33% 93.54% 87.67% 91.30% 96.37% 67.72%
Table 3: Effects of reducing training-split size on fine-tuning performance on the classification tasks.
Table 4: Best out of 20 epochs of fine-tuning, for three example lengths, on the classification tasks.
The Function Docstring task seems robust to the reduction of the training dataset, both early and
late in the fine-tuning process (that is, within 2 vs. 20 epochs), whereas the Exception Classification
task is heavily impacted by the dataset reduction, given that it has relatively few training examples to
begin with. Interestingly enough, for some tasks, even fine-tuning for only 2 epochs and only using
a third of the training data outperforms the baselines. For example, for both Variable Misuse and
Function Docstring, CuBERT at 2 epochs and 1/3rd training data outperforms the BiLSTM with
Word2Vec and the Transformer baselines.
9
Model Setting True Classification Localization Loc+Repair
Positive Accuracy Accuracy Accuracy
LSTM+pointer 100 epochs 81.63% 78.76% 63.83% 56.37%
2 epochs 97.18% 89.37% 79.05% 75.84%
CuBERT+pointer 10 epochs 94.94% 93.05% 88.52% 85.91%
20 epochs 96.83% 94.85% 91.11% 89.35%
Table 5: Comparison of the fine-tuned CuBERT+pointer model and the LSTM+pointer model
from Vasic et al. (2019) on the variable misuse localization and repair task.
model and refer the reader to the above paper. As reported in Section 4 of Vasic et al. (2019), to
enable comparison with an enumerative approach, the evaluation was performed only on 12K test
examples. In comparison, we report the numbers on all 430K test examples (Table 1) for both the
models.
Similar to other tasks, we trained the baseline model for 100 epochs and fine-tuned CuBERT for
up to 20 epochs. Table 5 gives the results along the same metrics as Vasic et al. (2019). The metrics
are defined as follows: 1) True Positive is the percentage of bug-free functions classified as bug-free.
2) Classification Accuracy is the percentage of correctly classified examples (between bug-free and
buggy). 3) Localization Accuracy is the percentage of buggy examples for which the localization
pointer correctly identifies the bug location. 4) Localization+Repair Accuracy is the precentage of
buggy examples for which both the localization and repair pointers make correct predictions. As
seen from Table 5, the CuBERT+pointer model outperforms the LSTM+pointer model consistently
across all the metrics, and even within 2 and 10 epochs.
References
Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code.
CoRR, abs/1812.06469, 2018. URL http://arxiv.org/abs/1812.06469.
Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. Suggesting accurate method
and class names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software
Engineering, ESEC/FSE 2015, pp. 38–49, New York, NY, USA, 2015. ACM. ISBN 978-1-
4503-3675-8. doi: 10.1145/2786805.2786849. URL http://doi.acm.org/10.1145/
2786805.2786849.
10
Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. Learning to represent programs
with graphs. In International Conference on Learning Representations, 2018.
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. Code2vec: Learning distributed rep-
resentations of code. Proc. ACM Program. Lang., 3(POPL):40:1–40:29, January 2019. ISSN
2475-1421. doi: 10.1145/3290353. URL http://doi.acm.org/10.1145/3290353.
Antonio Valerio Miceli Barone and Rico Sennrich. A parallel corpus of python functions and
documentation strings for automated code documentation and code generation. arXiv preprint
arXiv:1707.02275, 2017.
Jose Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. When deep learning
met code search. arXiv preprint arXiv:1905.03813, 2019.
Andy Coenen, Emily Reif, Ann Yuan, Been Kim, Adam Pearce, Fernanda Vi’egas, and Martin
Wattenberg. Visualizing and measuring the geometry of bert. ArXiv, abs/1906.02715, 2019.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep
bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June
2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https:
//www.aclweb.org/anthology/N19-1423.
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. Deep api learning. In Proceed-
ings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software En-
gineering, FSE 2016, pp. 631–642, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4218-
6. doi: 10.1145/2950290.2950334. URL http://doi.acm.org/10.1145/2950290.
2950334.
Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. Deepfix: Fixing common c lan-
guage errors by deep learning. In Proceedings of the Thirty-First AAAI Conference on Artifi-
cial Intelligence, AAAI’17, pp. 1345–1351. AAAI Press, 2017. URL http://dl.acm.org/
citation.cfm?id=3298239.3298436.
A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu. On the naturalness of software. In 2012
34th International Conference on Software Engineering (ICSE), pp. 837–847, June 2012. doi:
10.1109/ICSE.2012.6227135.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–
1780, November 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. URL http://dx.
doi.org/10.1162/neco.1997.9.8.1735.
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. Mapping language to code
in programmatic context. In Proceedings of the 2018 Conference on Empirical Methods in Nat-
ural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 1643–1652,
2018. URL https://www.aclweb.org/anthology/D18-1192/.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. Gated graph sequence neural
networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan,
Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.
org/abs/1511.05493.
11
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike
Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining
approach. CoRR, abs/1907.11692, 2019. URL http://arxiv.org/abs/1907.11692.
Annie Louis, Santanu Kumar Dash, Earl T Barr, and Charles Sutton. Deep learning to detect redun-
dant method comments. arXiv preprint arXiv:1806.04616, 2018.
Robert C. Martin. Clean Code: A Handbook of Agile Software Craftsmanship. Prentice Hall PTR,
Upper Saddle River, NJ, USA, 1 edition, 2008. ISBN 0132350882, 9780132350884.
Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation:
Contextualized word vectors. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30,
pp. 6294–6305. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/
7209-learned-in-translation-contextualized-word-vectors.pdf.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word repre-
sentations in vector space. In 1st International Conference on Learning Representations, ICLR
2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013a. URL
http://arxiv.org/abs/1301.3781.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representa-
tions of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling,
Z. Ghahramani, and K. Q. Weinberger (eds.), Advances in Neural Information Processing Systems
26, pp. 3111–3119. Curran Associates, Inc., 2013b.
Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. Convolutional neural networks over tree
structures for programming language processing. In Proceedings of the Thirtieth AAAI Con-
ference on Artificial Intelligence, AAAI’16, pp. 1287–1293. AAAI Press, 2016. URL http:
//dl.acm.org/citation.cfm?id=3015812.3016002.
Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki Hata, Sakriani Sakti, Tomoki Toda, and
Satoshi Nakamura. Learning to generate pseudo-code from source code using statistical machine
translation (t). In 2015 30th IEEE/ACM International Conference on Automated Software Engi-
neering (ASE), pp. 574–584. IEEE, 2015.
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word
representation. In In EMNLP, 2014.
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and
Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of NAACL-HLT,
pp. 2227–2237, 2018.
Michael Pradel and Koushik Sen. Deepbugs: A learning approach to name-based bug detection.
Proc. ACM Program. Lang., 2(OOPSLA):147:1–147:25, October 2018. ISSN 2475-1421. doi:
10.1145/3276517. URL http://doi.acm.org/10.1145/3276517.
Yewen Pu, Karthik Narasimhan, Armando Solar-Lezama, and Regina Barzilay. Sk p: A neural
program corrector for moocs. In Companion Proceedings of the 2016 ACM SIGPLAN Inter-
national Conference on Systems, Programming, Languages and Applications: Software for Hu-
manity, SPLASH Companion 2016, pp. 39–40, New York, NY, USA, 2016. ACM. ISBN 978-
1-4503-4437-1. doi: 10.1145/2984043.2989222. URL http://doi.acm.org/10.1145/
2984043.2989222.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language un-
derstanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-
assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018.
12
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language
models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019.
Veselin Raychev, Martin Vechev, and Eran Yahav. Code completion with statistical language mod-
els. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design
and Implementation, PLDI ’14, pp. 419–428, New York, NY, USA, 2014. ACM. ISBN 978-
1-4503-2784-8. doi: 10.1145/2594291.2594321. URL http://doi.acm.org/10.1145/
2594291.2594321.
Veselin Raychev, Pavol Bielik, and Martin T. Vechev. Probabilistic model for code with decision
trees. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented
Programming, Systems, Languages, and Applications, OOPSLA 2016, part of SPLASH 2016,
Amsterdam, The Netherlands, October 30 - November 4, 2016, pp. 731–747, 2016.
Radim Řehůřek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. In
Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50,
Valletta, Malta, May 2010. ELRA. http://is.muni.cz/publication/884893/en.
Mike Schuster and Kaisuke Nakajima. Japanese and korean voice search. In International Confer-
ence on Acoustics, Speech and Signal Processing, pp. 5149–5152, 2012.
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representa-
tions. arXiv preprint arXiv:1803.02155, 2018.
Marko Vasic, Aditya Kanade, Petros Maniatis, David Bieber, and Rishabh Singh. Neural program
repair by jointly learning to localize and repair. CoRR, abs/1904.01720, 2019. URL http:
//arxiv.org/abs/1904.01720.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neu-
ral Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc., 2017. URL
http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
Ashish Vaswani, Samy Bengio, Eugene Brevdo, François Chollet, Aidan N. Gomez, Stephan
Gouws, Llion Jones, Lukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam
Shazeer, and Jakob Uszkoreit. Tensor2tensor for neural machine translation. In Proceedings
of the 13th Conference of the Association for Machine Translation in the Americas, AMTA 2018,
Boston, MA, USA, March 17-21, 2018 - Volume 1: Research Papers, pp. 193–199, 2018. URL
https://www.aclweb.org/anthology/W18-1819/.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V.
Le. Xlnet: Generalized autoregressive pretraining for language understanding. CoRR,
abs/1906.08237, 2019. URL http://arxiv.org/abs/1906.08237.
13
Train
Exception Type Test Validation
100% 66% 33%
ValueError 2324 477 4058 2715 1344
KeyError 2240 453 4009 2566 1271
AttributeError 1657 311 2895 1896 876
TypeError 913 187 1747 1175 564
OSError 891 164 1641 1106 543
IOError 865 168 1560 1046 560
ImportError 776 202 1372 935 471
IndexError 694 153 1197 813 408
DoesNotExist 6 2 3 0 0
KeyboardInterrupt 287 67 590 408 223
StopIteration 307 69 488 302 155
AssertionError 177 32 397 276 158
SystemExit 139 23 264 173 101
RuntimeError 128 36 299 203 104
HTTPError 59 13 119 80 35
UnicodeDecodeError 151 24 251 173 82
NotImplementedError 127 27 222 136 52
ValidationError 95 15 172 121 58
ObjectDoesNotExist 105 17 213 142 64
NameError 95 19 197 124 56
Table 6: Example counts per class for the Exception Type task, broken down into the dataset splits.
We show separately the 100% train dataset, as well as its 33% and 66% subsamples used in the
ablations.
14
Commutative Non-Commutative
Arithmetic +, * -, /, %
Comparison ==, !=, is, is not <, <=, >, >=
Membership in, not in
Boolean and, or
15
for instance, a buggy example would only swap == with is, but not with not in, which would
not type-check if we performed static type inference on Python.
We take appropriate care to ensure the code parses after a bug is introduced. For instance, if we
swap the operator in the expression 1==2 with is, we ensure that there is space between the tokens
(i.e., 1 is 2 rather than the incorrect 1is2), even though it was not needed before.
16
• A targets mask, which marks as True all tokens holding the correct variable, for buggy exam-
ples. Note that the correct variable may appear in multiple locations in a function, therefore
this mask may have multiple True positions. Bug-free examples have an all-False targets
mask.
• An error-location mask, which marks as True the location where the bug occurs (for buggy
examples) or the first location (for bug-free examples).
All the masks mark as True some of the locations that hold variables. Because many variables are
subtokenized into multiple tokens, if a variable is to be marked as True in the corresponding mask,
we only mark as True its first subtoken, keeping trailing subtokens as False.
B Attention Visualizations
In this section, we provide sample code snippets used to test the different classification tasks. Fur-
ther, Figures 1–5 show visualizations of the attention matrix of the last layer of the fine-tuned Cu-
BERT model (Coenen et al., 2019) for the code snippets. In the visualization, the Y-axis shows the
query tokens and X-axis shows the tokens being attended to. The attention weight between a pair of
tokens is the maximum of the weights assigned by the multi-head attention mechanism. The color
changes from dark to light as weight changes from 0 to 1.
17
def on_resize(self, event):
event.apply_zoom()
Figure 1: Variable Misuse Example. In the code snippet, ‘event.apply zoom’ should actually
be ‘self.apply zoom’. The CuBERT variable-misuse model correctly predicts that the code has
an error. As seen from the attention map, the query tokens are attending to the second occurrence of
the ‘event’ token in the snippet, which corresponds to the incorrect variable usage.
18
def__gt__(self,other):
if isinstance(other,int)and other==0:
return self.get_value()>0
return other is not self
Figure 2: Wrong Operator Example. In this code snippet, ‘other is not self’ should actu-
ally be ‘other < self’. The CuBERT wrong-binary-operator model correctly predicts that the
code snippet has an error. As seen from the attention map, the query tokens are all attending to the
incorrect operator ‘is’.
19
def__contains__(cls,model):
return cls._registry in model
Figure 3: Swapped Operand Example. In this code snippet, the return statement should be ‘model
in cls. registry’. The swapped-operand model correctly predicts that the code snippet has
an error. The query tokens are paying substantial attention to ‘in’ and the second occurrence of
‘model’ in the snippet.
20
Docstring: ’Get form initial data.’
Function:
def__add__(self,cov):
return SumOfKernel(self,cov)
Figure 4: Function Docstring Example. The CuBERT function-docstring model correctly predicts
that the docstring is wrong for this code snippet. Note that most of the query tokens are attending to
the tokens in the docstring.
21
try:
subprocess.call(hook_value)
return jsonify(success=True), 200
except __HOLE__ as e:
return jsonify(success=False,
error=str(e)), 400
Figure 5: Exception Classification Example. For this code snippet, the CuBERT exception-
classification model correctly predicts ‘ HOLE ’ as ‘OSError’. The model’s attention matrix
also shows that ‘ HOLE ’ is attending to ‘subprocess’, which is indicative of an OS-related
error.
22