21-Uri Alon-Machine Learning For Programming Language Processing
21-Uri Alon-Machine Learning For Programming Language Processing
Programming Language
Processing
Uri Alon
Machine Learning for
Programming Language
Processing
Research Thesis
Uri Alon
Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and its practical
implications. In International Conference on Learning Representations, 2021. URL https:
//openreview.net/forum?id=i80OPhOCVH2.
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. A general path-based represen-
tation for predicting program properties. In Proceedings of the 39th ACM SIGPLAN Con-
ference on Programming Language Design and Implementation, PLDI 2018, pages 404–419,
New York, NY, USA, 2018. ACM. ISBN 978-1-4503-5698-5. doi: 10.1145/3192366.3192412.
URL http://doi.acm.org/10.1145/3192366.3192412.
Uri Alon, Omer Levy, and Eran Yahav. code2seq: Generating sequences from structured
representations of code. In International Conference on Learning Representations, 2019a.
URL https://openreview.net/forum?id=H1gKYo09tX.
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. Code2vec: Learning distributed
representations of code. Proc. ACM Program. Lang., 3(POPL):40:1–40:29, January 2019b.
ISSN 2475-1421. doi: 10.1145/3290353. URL http://doi.acm.org/10.1145/3290353.
Uri Alon, Roy Sadaka, Omer Levy, and Eran Yahav. Structural language models of code.
In International Conference on Machine Learning, pages 245–256. PMLR, 2020.
The following publications were part of my PhD research and present results that
are supplemental to this work:
Uri Alon, Golan Pundak, and Tara N Sainath. Contextual speech recognition with diffi-
cult negative training examples. In ICASSP 2019-2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pages 6440–6444. IEEE, 2019.
Shaked Brody, Uri Alon, and Eran Yahav. A structural model for contextual code changes.
Proceedings of the ACM on Programming Languages, 4(OOPSLA):1–28, 2020.
Shaked Brody, Uri Alon, and Eran Yahav. How attentive are graph attention networks?
arXiv preprint arXiv:2105.14491, 2021.
Yaniv David, Uri Alon, and Eran Yahav. Neural reverse engineering of stripped binaries
using augmented control flow graphs. Proceedings of the ACM on Programming Languages,
4(OOPSLA):1–28, 2020.
Noam Yefet, Uri Alon, and Eran Yahav. Adversarial examples for models of code. Proceed-
ings of the ACM on Programming Languages, 4(OOPSLA):1–30, 2020.
Acknowledgements
First of all, I would like to thank my wonderful advisor, Prof. Eran Yahav. This
thesis would not have been possible without his inspirational guidance and his visionary
optimism. It has been a real privilege to learn from him how to research.
I am also grateful to Dr. Omer Levy. Although being 10 time zones apart during
our joint work, his help was vital to this dissertation and to my development as a
researcher.
I would like to thank my co-authors: Meital Zilberstein, Shaked Brody, Roy Sadaka,
Yaniv David, Noam Yefet, Elad Nahmias, Chaim Baskin and Ben Finkelshtein; my
office-mates over the years Gail Weiss and Dana Drachsler-Cohen; and Guy Waldman
who created the demo websites, which helped me demonstrate and publicize my work.
I would also like to thank Golan Pundak and Tara N. Sainath, who hosted me for
an internship at Google New York in the summer of 2018. This internship, although
not directly related to the main line of my thesis, had a significant contribution to the
development of this dissertation.
I would like to thank my mother-in-law, Anath, for her constant help and support,
and my father-in-law, Dudi (Prof. David Engelberg), for the out-of-scope academic
advice, and for many pointers along the way.
I am grateful to my parents, Dorit and Nir, who have always encouraged me to
excel and follow my dreams.
Finally, I would like to thank my amazing wife, Lee, who has convinced me to
pursue graduate studies, supported my every decision, and with our dearest one, Gur,
tolerated my deadlines and rejections.
I gratefully acknowledge the generous financial help of the Technion, the Henry and
Marilyn Taub Faculty of Computer Science, and the Irwin and Joan Jacobs scholarship.
Contents
List of Figures
Abstract 1
1 Introduction 3
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 AST paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
8 Conclusion 121
Hebrew Abstract i
List of Figures
2.1 A JavaScript program and its AST, along with an example of one of the
paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 An overview of our approach. We start with a code snippet C, and
extract its path representation to be used as an input to machine learning
models. The AST and paths were extracted from the example program
in Figure 2.1a. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 A JavaScript statement and its partial AST. . . . . . . . . . . . . . . . . 18
2.4 An example statement and its AST, with an example of a path between
the SymbolVar terminals that represent a and d. The length of this path
is 4, and its width is 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 An example of a typical program where the maximal path length is
relatively small, but the width can be large. . . . . . . . . . . . . . . . . 20
2.6 Example of a Python program with stripped names and with predictions
produced using our AST paths. . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Example of a JavaScript program with stripped names, with predictions
produced using our AST paths and an online version of UnuglifyJS at
nice2predict.org. This is the default example shown at nice2predict.org. . . 28
2.8 Examples of Java programs with stripped names and with predictions
produced using our AST paths. We deliberately selected challenging
examples in which the prediction cannot be aided by specific classes and
interfaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.9 Accuracy results of AST paths with CRFs, for the task of variable nam-
ing in JavaScript, for different combination values of max_length and
max_width (UnuglifyJS is presented here for comparison). . . . . . . . 30
2.10 The accuracy of each abstraction method compared to the consumed
training time, for the task of variable naming in Java . . . . . . . . . . . 31
3.1 A code snippet and its predicted labels as computed by our model. . . . 34
3.2 Examples of three methods that can be easily distinguished by our
model despite having similar syntactic structure: our model successfully
captures the subtle differences between them and predicts meaningful
names. Each method portrays the top-4 paths that were given the most
attention by the model. The widths of the colored paths are proportional
to the attention that each path was given. . . . . . . . . . . . . . . . . . 39
3.3 The top-4 attended paths of Figure 3.2a, as were learned by the model,
shown on the AST of the same snippet. The width of each colored path
is proportional to the attention it was given (red O 1 : 0.23, blue O 2:
0.14, green O3 : 0.09, orange O 4 : 0.07). . . . . . . . . . . . . . . . . . 40
3.4 The architecture of our path-attention network. A fully connected layer
learns to combine embeddings of each path-context with itself; atten-
tion weights are learned using the combined context vectors and used to
compute a code vector. The code vector is used to predict the label. . . 45
3.5 Our model achieves significantly better results than the baselines and in
shorter time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6 Example predictions from our model, with the top-4 paths that were
given the most attention for each code snippet. The width of each path
is proportional to the attention it was given by the model. . . . . . . . . 57
3.7 An example for a method name prediction, portrayed on the AST. The
top-four path-contexts were given a similar attention, which is higher
than the rest of the path-contexts. . . . . . . . . . . . . . . . . . . . . . 59
3.8 An example for a method name prediction, portrayed on the AST. The
width of each path is proportional to the attention it was given. . . . . . 60
4.1 Example of (a) code summarization of a Java code snippet, and (b) code
captioning of a C# code snippet, along with the predictions produced by
our models. The highlighted paths in each example are the top-attended
paths in each decoding step. Because of space limitations we included
only the top-attended path for each decoding step, but hundreds of paths
are attended at each step. . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 An example of two Java methods that have exactly the same function-
ality. Although these methods have different sequential (token-based)
representations, repeating paths, which might differ in only a single node
(a ForStmt node instead of a Do-while node), will be revealed if we con-
sider syntactic patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3 Our model encodes each AST path with its values as a vector, and uses
the average of all of the k paths as the decoder’s start state. The decoder
generates an output sequence while attending over the k encoded paths. 69
4.4 Visualization of the F1 score of our model compared to the baselines,
for the code summarization task, across datasets. Our model achieves
significantly higher results than the baselines. . . . . . . . . . . . . . . . 73
4.5 F1 score compared to the length of the input code. This experiment was
performed for the code summarization task on the Java-med test set. All
examples having more than 30 lines were counted as having 30 lines. . . 74
4.6 Visualization of the BLEU score of our model compared to the baselines,
for the code captioning task. Our model achieves significantly higher
results than the baselines. . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.1 Examples from the Java (left) and C# (right) test sets. The highlighted
expression in each example is the target p, which our models correctly
generated from the rest of the snippet. Additional and larger examples
can be found in the supplementary material. . . . . . . . . . . . . . . . . 80
5.2 The subtree representing x > 1 is generated given its surrounding tree.
At each step, the model generates the next node (denoted by ? ) of path1 ,
path2 and path3 using the root path R. Dashed lines denote the AST
structure; solid lines denote AST paths. Most AST paths are omitted
from the figure, for clarity. . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3 Augmenting the AST with EOSnode and EOStok nodes. . . . . . . . . . . . 83
5.4 Examples for cases where the top candidate is a “tree-match” (marked
with ), but only the second candidate is an “exact match” (marked
with ✓ in bold). Predictions that are logically equivalent to the ground
truth are marked with ↔. . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.1 The bottleneck that existed in RNN seq2seq models (before attention) is
strictly more harmful in GNNs: information from a node’s exponentially-
growing receptive field is compressed into a fixed-size vector. Black ar-
rows are graph edges; red curved arrows illustrate information flow. . . . 96
6.2 The NeighborsMatch problem: green nodes have blue neighbors and
an alphabetical label. The goal is to predict the label (A, B, or C) of
the green node that has the same number of blue neighbors as the target
node in the same graph. In this example, the correct label is C, because
the target node has two blue neighbors, like the node marked with C in
the same graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.3 Standard GAT (Figure 6.3a) computes static attention: the ranking of
attention coefficients is global for all nodes in the graph, and is uncon-
ditioned on the query node. For example, all queries (q0 to q9) attend
mostly to the 8th key (k8). In contrast, GATv2 (Figure 6.3b) can ac-
tually compute dynamic attention, where every query has a different
ranking of attention coefficients of the keys. . . . . . . . . . . . . . . . . 101
7.2 An example of two edits. These examples are different and the edits
operate on different values. However, observing the structure of these
edits reveals the similarity between them and allows a learning model to
generalize better. This similarity is expressed as almost identical AST
paths. For simplicity, only the program fragment that should be edited
P is shown, without the context C. . . . . . . . . . . . . . . . . . . . . 108
7.3 An EditCompletion example from our test set. Figure 7.3a shows the
edit that transforms C into C ′ – overloading the function AddNavigation.
Figure 7.3e shows P and P ′ as code in red and green, respectively. Fig-
ure 7.3b depicts the partial AST and the first three edit operations of
the edit. Figure 7.3c shows the AST after applying the first three oper-
ations, and shows the next three operations as AST paths. Figure 7.3d
illustrates the AST after performing all operations, resulting in an AST
that corresponds to P ′ . Every edit operation is represented by an AST
path having the same color and number as the edit command. Dot-
ted contours represent subtrees that will be affected by applying these
operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.4 A Java snippet f1 is classified correctly as sort by the model of code2vec.
org. Given f1 and the target contains, our approach generates f2 by
renaming array to ttypes. Given the target escape, our approach gener-
ates f3 by adding an unused variable declaration of int upperhexdigits.
Additional examples can be found in Yefet et al. (2020). . . . . . . . . . 114
7.5 Perturbing a variable name: the original variable name is represented as
a one-hot vector over the variable-name vocabulary. After perturbation,
the vector is no longer one-hot. We apply argmax to find the most
likely adversarial name, resulting with another one-hot vector over the
variable-name vocabulary. . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.6 A C# VarMisuse example which is classified correctly as DestinationType
in the method Equals by the GGNN model of Allamanis et al. (2018).
Given the code in Figure 7.6a and the target SourceType, our approach
renames a local variable destination in another method to the spe-
cific name scsqbhj, making the model predict the wrong variable in
the method Equals, thus (“maliciously”) introducing a real bug in the
method Equals. Additional examples are shown in Yefet et al. (2020). . 117
Abstract
Over the past decade, the deep learning revolution, driven by artificial neural networks,
has transformed a broad range of areas in computer science such as computer vision,
speech recognition, and natural language processing (NLP). In parallel, the number of
open-source, publicly available codebases has grown dramatically, enabling the appli-
cation of neural networks to a wide range of programming-related tasks, a field that
we dub Programming Language Processing (PLP).
Yet, the problem of representing programs in machine- and deep-learning models
remained an open question. Obviously, programs do not have a straightforward tenso-
rial representation as images do. Although a program can be represented as a sequence
of tokens like a natural language text, programs are far more structured than free-text,
since programs must comply with a rigid and rich syntax, defined by a context-free
grammar. Furthermore, every programming language has predefined semantics that
describe what syntactically valid programs mean and do.
This dissertation focuses on this general problem: representing programs in machine-
and deep-learning models, in ways that facilitate learning while capturing as much
information as possible, and keeping the model as general as possible. This thesis
introduces the AST paths approach, which represents programs using paths from the
program’s Abstract Syntax Tree (AST). The AST paths representation allows build-
ing powerful and accurate neural models, while keeping them lightweight and scalable.
Specifically, this thesis shows how these models can be trained on millions of examples,
in tasks that include predicting properties of individual program elements, predicting
properties of code snippets, generating a natural language sequence from a given source
code snippet, and generating code completions. These models were publicly released as
online interactive demos along with open-source implementations and datasets. Some
of the models, such as code2vec and code2seq, are highly popular and widely used in
academia and industry.
Finally, this dissertation studies theoretical differences between different program
representations. These studies revealed broader foundational limitations of another
popular representation, the graph neural network framework.
1
2
Chapter 1
Introduction
The past decade has seen monumental advances in machine learning. Specifically,
deep learning, powered by artificial neural networks, has transformed a broad range
of areas in computer science such as computer vision (Krizhevsky et al., 2012), speech
recognition (Hinton et al., 2012), and natural language processing (NLP) (Collobert
et al., 2011), and has spread to a variety of other scientific disciplines such as biology
(Angermueller et al., 2016), chemistry (Gilmer et al., 2017), and physical simulation
(Sanchez-Gonzalez et al., 2020). In parallel, despite great popularity of rule-based (i.e.,
deterministic, non-learning) coding assistance tools, the vast majority of programming,
fixing, naming, and debugging efforts are still performed by human programmers. Pro-
gramming requires expertise and is time-consuming; further, even expert programmers
lookup for online help frequently, and still create bugs. Thus, the dramatically grow-
ing availability of open-source code repositories creates exciting new opportunities for
employing deep learning for a wide range of programming-related applications, a field
that we dub Programming Language Processing (PLP).
Nonetheless, before even considering the actual learning algorithm or model, rep-
resenting programs in machine- and deep-learning models remained an open problem.
That is, it is unclear which facets of programs capture the most information about
the program, while remaining compact and generalizable. Further, it is unclear how
to input these facets into a learning algorithm. While an image can be represented as
a matrix or a tensor of pixels, and a natural language sentence can be represented as
a one-dimensional sequence of words, it is unclear whether programs can be described
using such simple input representations. Although a program can be represented as a
sequence of tokens like natural language text, programs are far more structured than
free-text, since a (valid) program must comply with a rigid and rich syntax, defined by
a context-free grammar. Thus, a program may only look like text, where in fact, it is
more of a tree than a sequence. Further, every programming language has predefined
semantics that describe what syntactically valid programs mean and do. So, providing
the learning model with some information about the semantics of the language, along
with the program itself, might help the model in performing some tasks. However,
3
excessive dependence on semantics restricts the scalability of the learning models, lim-
its the feasible amount of training data, and hurts the practical applicability of such
models.
Thus, the main question investigated in this thesis is:
Research impact In this thesis, we present a line of work that began as a contribu-
tion in the field of programming languages, but quickly drew attention and impacted
a variety of other fields such as software engineering (Henkel et al., 2018; Cambronero
et al., 2019; Liu et al., 2019b; Kang et al., 2019), machine learning (Zügner et al., 2021;
Yao et al., 2021; Fernandes et al., 2019; Liu et al., 2021), NLP (Feng et al., 2020; Fujita
et al., 2020; Panthaplackel et al., 2020; Yu et al., 2020) and security (Schuster et al.,
2021; Sonnekalb, 2019; Compton et al., 2020; Lacomis et al., 2019). Further, our in-
tuitions from structural representations of code have led to identifying general insights
that drew attention in the general geometric deep learning community (Valsesia et al.,
2021; Bronstein et al., 2021; Morris et al., 2021; Kurin et al., 2021; Lukovnikov and
Fischer, 2021; Kreuzer et al., 2021; Godwin et al., 2021). Our code and online demos
are used by thousands of people and draw positive feedback and enthusiastic response.
Method name
prediction (Chapter 3)
Variable name
prediction (Chapter 2)
Figure 1.1: An illustration demonstration of some of the models presented in this thesis.
4
1.1 Background
Representing programs in machine- and deep-learning models has always been an open
problem. Here we discuss the main existing approaches.
5
Learning
Effort
Implicitly re-learn syntactic
Model size,
and semantic regularities
training time,
amount of data...
language-specific,
task-specific, model
Analysis
Surface text AST Handcrafted Data flow Control flow
Effort
...
(token stream) Paths features Analysis Analysis
Figure 1.2: An abstract illustration of the tradeoff between the learning effort (the effort
that is put on the learning model) and the analysis effort (that is made before learning).
Learning from the surface text is straightforward and allows employing existing textual
approaches, but puts much effort on the learning model. Manually designed features
and semantic analysis reduce the learning effort, but make the model language- and
task-specific.
A useful and interesting sweet-spot between the sequential and the semantic represen-
tations, which is one of the main topics of this dissertation, is AST paths representation.
AST paths represent programs as paths between nodes in the program’s Abstract Syn-
tax Tree (AST). These paths can be thought of as “structural n-grams”, which make
this approach general and simple. The use of syntax reduces much of the learning effort
from AST paths-based models compared to sequential models, because the model does
not need to waste capacity by learning that “an opening parenthesis must be followed
by a closing parenthesis”, for example. Compared to semantic models, AST paths are
much more general, because the same model and approach can be used for different
tasks and languages, while only replacing the parser.
Figure 1.2 summarizes the tradeoff discussed above between the learning effort and
the generalization limitation. In the following chapters, we show the applicability and
generality of the AST paths approach. We note that there are other works that follow
the purely-syntactic approach (Allamanis and Sutton, 2014; Bielik et al., 2016; Yin
and Neubig, 2017; Rabinovich et al., 2017; Yin et al., 2019; Yin and Neubig, 2018; Yao
et al., 2021; Kim et al., 2021). Some of these will be discussed where relevant; others
use GNN to encode the AST, and are thus susceptible to the GNN bottleneck that we
present in Chapter 6.
6
Figure 1.3: A screenshot of the code2vec website: http://code2vec.org. A user-
provided code snippet is fed into the model, which predicts an appropriate label (a
method name of sort, in this case). The website also shows the paths that were given
the highest attention in the prediction.
1.2 Overview
Next, we give a brief overview of some of the main methods presented in this thesis.
The exact details and formal definitions are provided in the respective chapters.
code2seq The ability to generate natural language sequences from source code snip-
pets has a variety of applications such as code summarization, documentation, and
retrieval. Sequence-to-sequence (seq2seq) models, adopted from neural machine trans-
lation (NMT), have achieved state-of-the-art performance on these tasks by treating
7
Figure 1.4: A screenshot of the code2seq website: http://code2seq.org. A user-
provided code snippet is fed into the model, which predicts an appropriate natural
language sequence: “save bitmap to file”. The tree on the right illustrates the paths
that were given the highest attention while predicting each of the output words.
8
Figure 1.5: A screenshot of the AnyCodeGen website: http://AnyCodeGen.org. A
partial user-provided code snippet is fed into the model, with one or more “holes”,
marked by “??”. The model generates a completion for every “hole”, such as
stats[i].getPath(). The website also shows the predicted partial AST of each of
the suggested completions.
9
Aspect Instantiation in this thesis
AST paths Chapter 2
Code
code2vec Chapter 3
representation
Structural language models Chapter 5
code2seq Chapter 4
Adversarial examples for models of code Section 7.2
Application
Edit completion Section 7.1
Reverse engineering using neural networks David et al. (2020)
Foundations The GNN bottleneck Section 6.1
and theory Expressiveness of graph attention networks Section 6.2
1.2.1 Contributions
To summarize, the main contributions of this thesis are:
• An approach for embedding an entire code snippet as a vector, based on its bag
of AST paths. We demonstrate this approach in a large-scale neural model that
predicts a label from a code vector (Chapter 3).
Table 1.1 summarizes the different aspects of programming language processing that
we consider in this thesis.
10
Chapter 2
A General Path-Based
Representation for Predicting
Program Properties
Predicting program properties such as names or expression types has a wide range of
applications. It can ease the task of programming, and increase programmer produc-
tivity. A major challenge when learning from programs is how to represent programs
in a way that facilitates effective learning.
We present a general path-based representation for learning from programs. Our
representation is purely syntactic and extracted automatically. The main idea is to
represent a program using paths in its abstract syntax tree (AST). This allows a learning
model to leverage the structured nature of code rather than treating it as a flat sequence
of tokens.
We show that this representation is general and can: cover different prediction tasks,
drive different learning algorithms (for both generative and discriminative models), and
work across different programming languages.
We evaluate our approach on the tasks of predicting variable names, method names,
and full types. We use our representation to drive both CRF-based and word2vec-
based learning, for programs of four languages: JavaScript, Java, Python and C#. Our
evaluation shows that our approach obtains better results than task-specific handcrafted
representations across different tasks and programming languages.
2.1 Introduction
Leveraging machine learning models for predicting program properties such as variable
names, method names, and expression types is a topic of much recent interest (Raychev
et al., 2015; Allamanis et al., 2015a, 2016; Raychev et al., 2016a; Bielik et al., 2016;
Maddison and Tarlow, 2014). These techniques are based on learning a statistical model
from a large amount of code and using the model to make predictions in new programs.
11
while (!d) {
if (someCondition()) {
d = true;
}
}
(b) The program’s AST, and example of an AST
(a) A simple JavaScript program. path.
Figure 2.1: A JavaScript program and its AST, along with an example of one of the
paths.
Our approach We present a novel program representation for learning from pro-
grams. Our approach uses different path-based abstractions of the program’s abstract
syntax tree. This family of path-based representations is natural, general, fully auto-
matic, and works well across different tasks and programming languages.
AST paths We define AST paths as paths between nodes in a program’s abstract
syntax tree (AST). To automatically generate paths, we first parse the program to
produce an AST, and then extract paths between nodes in the tree. We represent a
path in the AST as a sequence of nodes connected by up and down movements, and
represent a program element as the set of paths that its occurrences participate in.
Figure 2.1a shows an example JavaScript program. Figure 2.1b shows its AST, and
one of the extracted paths. The path from the first occurrence of the variable d to its
second occurrence can be represented as:
This is an example of a pairwise path between leaves in the AST, but in general the
family of path-based representations contains n-wise paths, which do not necessarily
span between leaves and do not necessarily contain all the nodes in between. Specifi-
cally, we consider several choices of subsets of this family in Section 2.3.
Using a path-based representation has several major advantages over existing meth-
ods:
1. Paths are generated automatically: there is no need for manual design of features
aiming to capture potentially interesting relationships between program elements.
This approach extracts unexpectedly useful paths, without the need for an expert
12
to design features. The user is required only to choose a subset of our proposed
family of path-based representations.
2. This representation is useful for any programming language, without the need to
identify common patterns and nuances in each language.
4. AST paths are purely syntactic, and do not require any semantic analysis.
Tasks In this work, we demonstrate the power and generality of AST paths on the
following tasks:
• Predicting method names Good method names adequately balance the need
to describe the internal implementation of the method and its external usage (Høst
and Østvold, 2009). When published in a popular library’s API, descriptive and
intuitive method names facilitate the use of methods and classes, while poorly
chosen names can doom a project to irrelevance (Allamanis et al., 2015a). Al-
though method names are clearly program elements and can be predicted by the
previous task, in this task we assumes that all the other names in the method
are given, along with the names of the elements around the method invocation,
when available in the same file.
Raychev et al. (2015) used relations in the AST as features for learning tasks over
programs. They defined an explicit grammar to derive features which capture specific
relationships between nodes in the AST of JavaScript programs, as well as relations
13
Figure 2.2: An overview of our approach. We start with a code snippet C, and extract
its path representation to be used as an input to machine learning models. The AST
and paths were extracted from the example program in Figure 2.1a.
produced by language-specific semantic analysis, such as “may call” and “may access”.
We show that our automatic general representation performs better than their features
for their original task, and also generalizes to drive two different learning algorithms
and three different prediction tasks, over different programming languages.
Paths in an AST have also been used by Bielik et al. (2016) and by Raychev et al.
(2016a,b) for a different goal: identifying context nodes. These works do not use the
paths themselves as a representation of the input, and the prediction is only affected
by the context node that was found on the other end of the path. In our work, we
use the path itself as a representation of a program element. Therefore, the prediction
depends not only on the context node but also on the way it is related to the element
in question.
Allamanis et al. (2015a) defined the challenging task of predicting method names,
which can be viewed as a form of function summarization (Allamanis et al., 2016). We
show that our representation performs better by being able to learn across projects.
• A new, general family of representations for program elements. The main idea is
to use AST paths as representations of code.
14
2.2 Overview
In this section, we illustrate our approach with a simple JavaScript program for the
task of predicting names; as we show in later sections, the same approach also applies
to other tasks, other languages, and other learning algorithms.
Given a program with non-descriptive names, our goal is to predict likely names for
local variables and function parameters. The non-descriptive names could have been
given by an inexperienced programmer, or could have been the result of deliberate strip-
ping. In the latter case, we refer to such a program as a program with stripped names.
Stripping names can be part of a minification process in JavaScript, or obfuscation in
Java and other languages.
Consider the code snippet of Figure 2.1a. This simple snippet captures a common
programming pattern in many languages. Suppose that we wish to find a better name
for the variable d.
The path expresses the fact that the variable d is used, with negation, as a stopping
condition of a “while” loop, and then assigned a new value if an “if” condition inside
the loop evaluates to true. This path alone expresses the fact that d is the stopping
condition of the loop.
The path p4 in Figure 2.2, between the variable d and the value true is:
This path captures the fact that the assignment changes the value of d to true, and
therefore it is indeed the assignment that stops the loop.
15
and neither “done”, “complete”, nor any similar name was predicted by past work for
this example.
Learning algorithms The learning model can vary between different algorithms,
presenting tradeoffs of efficiency and accuracy. In Section 2.4.3 we show that both CRFs
and word2vec can be used for this prediction task. In both of these learning algorithms,
using AST paths produces better results than the alternative representations, whether
they are manually designed or sequence-based representations.
Key aspects The example highlights several key aspects of our approach:
• Useful paths such as path I span multiple lines of the program, but are also
supported by shorter paths like path II, which only spans a single program line.
Short paths alone are not enough to predict a meaningful name. Making a pre-
diction using all paths that an element participates in provides a rich context for
predicting the name of the element.
• AST paths can distinguish between programs that previous works could not.
• In addition to predicting done, a model trained with AST paths can propose
several semantically similar names, as we demonstrate in Section 2.4.3. This
shows that AST paths are strong indicators of the program element’s semantics.
16
2.3.1 AST Paths
To learn from programs, we are looking for a representation that captures interesting
properties of ASTs while keeping it open for generalization. One way to obtain such
a representation is to decompose the AST to parts that repeat across programs but
can also discriminate between different programs. One such decomposition is into paths
between nodes in the AST. We note that in general we consider n-wise paths, i.e., those
that have more than two ends, but for simplicity we base the following definitions on
pairwise paths between AST terminals.
We start by defining an AST, an AST-path, a path-context and an abstract path-
context.
Definition 2.3.1 (Abstract Syntax Tree). An Abstract Syntax Tree (AST) for a code
snippet C is a tuple ⟨N, T, X, s, δ, val⟩ where N is a set of nonterminal nodes, T is a set of
terminal nodes, X is a set of terminal values, s ∈ N is the root node, δ : N → (N ∪ T )∗
is a function that maps a nonterminal node to a list of its children, and val : T → X
is a function that maps a terminal node to an associated value. Every node except the
root appears exactly once in all the lists of children.
We define a path-context as a tuple of an AST path and the values associated with
its end nodes: (i.e. n1 and nk+1 ). In general, we consider path-contexts which span
between arbitrary AST nodes, e.g., a terminal and its ancestor, but for simplicity, we
base the following definitions on path-contexts which span between terminals:
Definition 2.3.3 (Path-context). Given an AST Path p, its path-context is the triplet
⟨xs , p, xf ⟩ where xs = val (start (p)) and xf = val (end (p)) are the values associated
with the start and end nodes of p.
17
var item = array[i];
(a) (b)
That is, a path-context describes two nodes from the AST with the syntactic path
between them.
Finally, we define an Abstract path-context as an abstraction of concrete path con-
text:
Example 2.3.5. For example, consider the JavaScript line of code in Figure 2.3a and
its partial AST in Figure 2.3b. We denote the path between the variable item to the
variable array by p. Using αid , the abstract path-context of p is:
Using a different abstraction function yields a different abstract path-context, for ex-
ample αf orget−arrows :
Naïvely extracting all the paths in the AST and representing each of them uniquely
can be computationally infeasible, and as a result of the bias-variance tradeoff (Hastie
et al., 2001), can lead to worse prediction results. However, alternative abstraction
functions can be used to control the number of distinct extracted paths. In Section 2.4.6
we describe alternative abstractions that abstract some of the information, and thus
allow us to tune the trade-off between accuracy, training time, and model size.
18
var a, b, c, d;
(a) (b)
Figure 2.4: An example statement and its AST, with an example of a path between
the SymbolVar terminals that represent a and d. The length of this path is 4, and its
width is 3.
Path length and width We define hyper-parameters that limit the path length and
width. We define the following hyper-parameters:
• max_length, defined as the maximal length of a path, i.e., the maximum value
of k.
• max_width, defined as the maximal allowed difference between sibling nodes that
participate in the same path, as shown in Figure 2.4.
When limiting these parameters to certain values, we do not extract longer or wider
paths. We tune the optimal values of width and length by grid search of combinations
on a validation set of programs and choose the combination that yields the highest accu-
racy, as described in Section 2.4. The tuning process of finding the optimal parameter
values should be separate for each language and task.
Obviously, setting the values of these parameters to a value that is too low limits the
expressiveness of the paths, does not capture enough context for each element, limits
the ability to model the training and test data, and therefore produces poor accuracy.
Why, then, does limiting the path length and width actually improve accuracy? There
are several reasons:
• Sparsity Using paths that are too long can cause the representation space to be
too sparse. A long path might appear too few times (or even only once) in the
19
assert.equal(a,1);
assert.equal(...);
...
assert.equal(b,1);
Figure 2.5: An example of a typical program where the maximal path length is relatively
small, but the width can be large.
training set and cause the model to predict specific labels with high probability.
This phenomenon is known as overfitting, where the learned AST paths are very
specific to the training data and the model fails to generalize to new, unseen data.
• Performance There is a practical limit on the amount of data that a model can
be trained on. Too much data can cause the training phase to become infeasibly
long. There is a tradeoff between how many programs the model can be trained
on, and how many paths are extracted from each program. Therefore, it makes
sense to limit the number of extracted paths from each program by limiting the
paths’ length and width, in order to be able to train on a larger and more varied
training set.
In fact, tuning path length and width is used to control the bias-variance tradeoff.
Shorter paths increase the bias error, while longer paths increase the variance error.
The relationship between these parameters and results is discussed and demonstrated
in Section 2.4.5.
2.4 Evaluation
Since the goal of this work is to provide a representation of program elements, we com-
pared the effect of different representations on the accuracy of the learning algorithms.
To show that our approach can be applied to the representation of the input without
modifying the learning algorithm, we used off-the-shelf learning algorithms but repre-
sented the input in each experiment using a different representation (when possible).
Our evaluation aims to answer the following questions:
• How useful are AST paths compared to existing representations? (Section 2.4.3)
• How useful are AST paths across different programming languages, tasks and
learning algorithms? (Section 2.4.3)
• Do AST paths just memorize the input, or do they capture deeper semantic
regularities? (Section 2.4.4)
• How long are the useful paths? How do the paths’ length and width affect the
results? (Section 2.4.5)
20
• How important is the concrete representation of paths? Which abstractions can
be used to represent paths without reducing accuracy? (Section 2.4.6)
Leafwise and semi-paths Although the family of representations in this work in-
cludes n-wise paths and paths between any kind of AST nodes, for simplicity and fea-
sible training time, we performed most of the experiments using leafwise-paths (paths
between AST terminals) and semi-paths — paths between an AST terminal and one
of its ancestor nodes in the AST. The idea is that leafwise-paths are more diverse and
therefore more expressive than semi-paths, but semi-paths provide more generalization.
Semi-paths allow us to generalize learning and capture common patterns in different
programs, even if the full path does not recur.
An exception is the prediction of full types in Java, in which we predict types
of expressions which are not necessarily terminals. In this case, we also used paths
between terminals to the nonterminal in question.
AST construction and path extraction For Java we used JavaParser; for JavaScript
we used UglifyJS for parsing and traversing the AST, along with additional modifica-
tions from UnuglifyJS; for Python we used the Python internal parser and AST visitor;
and for C# we used Roslyn.
• Top-k candidates suggestion. CRFs output a single prediction for each program
element. We implemented an additional API that receives a parameter k and
21
Table 2.1: The amounts of data used for the experimental evaluation of each language.
suggests the top-k candidate names for each program element (this extension
was adopted into Nice2Predict). This allowed us to manually investigate the
quality of results (Section 2.4.4). When all top-k predictions for a variable name
captured similar notions, it increased our confidence that the model performs
stable predictions.
Datasets For each language, we collected source code from public GitHub projects,
and split it randomly to training, validation and test sets. Our data included the top
ranked projects of each language and the projects that were forked the most. Table 2.1
shows the amount of data used for each language. Java required an order of magnitude
more data than the other languages: we had to keep enlarging our Java dataset to
achieve results that were close to the other languages.
Following recent work which found a large amount of code duplication in GitHub (Lopes
et al., 2017), we devoted much effort into filtering duplicates from our dataset, and espe-
cially the JavaScript dataset. To filter duplicates, we used file names, directory names
(such as “node_modules”), and md5 of files. In Java and Python, which do not com-
mit dependencies, duplication is less severe (as also observed by Lopes et al. (2017)).
Furthermore, in our setting, we took the top-ranked and most popular projects, in
which we observed duplication to be less of a problem (Lopes et al. (2017) measured
duplication across all the code in GitHub).
Evaluation metric For simplicity, in all the experiments we measured the per-
centage of exact match predictions, case-insensitive and ignoring differences in non-
alphabetical characters. For example, this metric considers totalCount as an exact
match to total_count. An exception is the comparison to Allamanis et al. (2016), who
optimized their Java method name prediction model to maximize the F1 score over
sub-tokens. In this case, we compared their model with ours on both exact match and
F1 score. An unknown test label (“UNK”) was always counted as an incorrect predic-
tion, or as a possibly partial prediction when using the F1 score, and our model never
suggests “UNK”. For example, if the true test label is get<UNK>, our model could get
partial precision and partial recall for predicting getFoo.
22
Table 2.2: Accuracy comparison for variable name prediction, method name prediction,
and full type prediction using CRFs.
Table 2.3: Accuracy comparison for the variable name prediction task that was evalu-
ated using word2vec in JavaScript.
• Prediction of variable names across all four languages. Variable names have
sufficient training data in all languages to produce meaningful results. In this
experiment we used both CRFs and word2vec. As baselines we used the work of
Raychev et al. (2015), CRFs with token-based n-grams as factors, and a simple
rule-based baseline. For JavaScript with word2vec, we used word2vec with linear
token context as a baseline and show that path representations yield dramatic
improvement.
• Prediction of full types in Java. For Java, we compared our results to a synthetic
23
(straw-man) baseline that predicts all types to be java.lang.String. This baseline
shows that despite the prevalence of the String type, the task of type prediction
is still very challenging.
Evaluation with CRFs We present our evaluation results with CRFs for names in
the top part of Table 2.2. For JavaScript, where a tool that uses predefined features
exists, we evaluated the other tool with the exact same datasets and settings, and
the same AST terminals as CRF nodes, which makes the input representation (AST
paths vs. their features) the only difference between the two experiments. Using our
representations yields 7.6% higher accuracy than the previous work.
For Java, we compared the results with two baselines:
• CRFs + n-grams - this baseline uses the same CRF nodes as the path-based
model, except that the relations between them are the sequential n-grams. We
chose n = 4 as the value that maximizes accuracy on the validation set, such that
the produced model consumes approximately the same amount of memory and
disk as the path-based model.
• Rule-based - Since Java is a typed language which has a rich type system, and
typical code tends to use a lot of classes and interfaces, we wonder whether the
task of predicting variable names is easier in Java than in other languages and can
be solved using traditional rule-based (non-learning) approaches. Our rule-based
baseline predicts variable names based on the following pattern heuristics and
statistics of the training corpus:
– for(int i = ...) {
– this.<fieldName> = <fieldName>;
– catch (... e) {
– void set<fieldName>(... <fieldName>) {
– Otherwise: use the type: HttpClient client.
24
As shown, using CRFs with AST paths yields higher results than the baselines, in
all the languages, showing that our representation yields higher results than manually
defined features, n-grams, and rule-based approaches.
• The linear token-stream approach uses the surrounding tokens to predict a vari-
able name. Surrounding tokens (e.g., values, keywords, parentheses, dots and
brackets) may implicitly hint at the syntactic relations, without AST paths. This
is the type of context usually used in NLP, in the original implementation of
word2vec, and in many works in programming languages.
• The path-neighbors, no-paths approach uses the same surrounding AST nodes for
contexts as AST paths, except that the path itself is hidden, and only the identity
of the surrounding AST nodes is used. The goal of using this baseline is to show
that the advantage of AST paths over token-stream is not only in their wider
span, but in the representation of the path itself.
Using word2vec with AST paths produces much better results compared to these
baselines. This shows the advantage of using AST paths as context over token-stream
based contexts, and the significance of using a representation of the paths for prediction.
Limitations of evaluation We noticed that our models often predict names that are
very similar but not identical to the original name, such as message instead of msg, or
synonyms such as complete instead of done; these are counted as incorrect predictions.
Moreover, we noticed that our models sometimes predict better names than the original
names. Therefore, the accuracy results are an underapproximation of the ability of AST
paths to predict meaningful names.
Another limitation lies in the inability of CRFs and word2vec to predict out-of-
vocabulary (OoV) names. As was previously observed (Allamanis et al., 2016, 2015a),
there are two main types of OoV names in programs: names that did not appear in
the training corpus but can be composed of known names (neologisms), and entirely
new names. The total OoV rate among our various datasets and tasks varied between
5 − 15%, and specifically 7% for predicting variable names in JavaScript, and 13%
for Java method names. Several techniques were suggested to deal with each type of
OoV (Allamanis et al., 2016, 2015a), which we did not consider here and are out of
scope of this work.
Discussion We note that the accuracy for Java is lower than for JavaScript. We
have a few possible explanations: The JavaScript training set contains projects that are
rather domain specific, mostly client and server code for web systems (for example, the
25
terms request and response are widely used across all projects). In contrast, the Java
code is much more varied in terms of domains. Additionally, the Java naming scheme
makes extensive use of compound names (e.g., multithreadedHttpConnectionManager),
and this is amplified by the type-based name suggestions for variables provided by
modern Java IDEs. In contrast, the JavaScript variable names are typically shorter and
are not an amalgamation of multiple words (e.g., value, name, elem, data are frequent
names).
The accuracy of C# is similar to Java, but using significantly less training data.
We believe that C# naming is more structured because the commonly used C# IDE
(VisualStudio), suggests variable names based on their types.
The accuracy for Python is lower than that of JavaScript. Manual examination
of the training data shows that Python programs vary widely in code quality, making
the training set more noisy than that of other languages. In addition, the variety of
domains and IDEs for Python makes variable names less standard. Finally, Python
is easy to write, even for non-programmers, and thus there is a wide variety of non-
professional Python code. The low accuracy for Python is also consistent with Raychev
et al. (2016a).
We present our evaluation results for predicting method names in Table 2.2. Accuracy
was similar for all languages (∼ 50%).
Good method names balance the need to describe the internal implementation of
the method and its external usage (Høst and Østvold, 2009). For predicting method
26
def sh3( c ): def sh3( cmd ):
p = Popen( c , stdout=PIPE, process = Popen( cmd , stdout=PIPE,
stderr=PIPE, shell=True) stderr=PIPE, shell=True)
o , e = p .communicate() out , err = process .communicate()
r = p .returncode retcode = process .returncode
if r : if retcode :
raise CalledProcessError( r , c ) raise CalledProcessError(retcode, cmd )
else: else:
return o .rstrip(), e .rstrip() return out .rstrip(), err .rstrip()
Figure 2.6: Example of a Python program with stripped names and with predictions
produced using our AST paths.
names, we use mostly the paths from within a method to its name, but when available
in the same file, we also use paths from invocations of the method to the method name.
Ideally, one would use paths from different files (and for library methods, even across
projects), but this requires a non-local view, which we would like to avoid for efficiency
reasons.
We use the internal paths from the leaf that represents the method name to other
leaves within the method AST (which capture the method implementation) and the
external paths from references of the method to their surrounding leaves (which rep-
resent the usage of the method). However, we observed that using only internal paths
yields only 1% lower accuracy.
In Java, CRFs with AST paths are compared to the model of Allamanis et al.
(2016), which we trained on the same training corpus. Since their model is optimized
to maximize the F1 score over sub-tokens, Table 2.2 presents both exact accuracy and
F1 score for method name prediction in Java. The table shows that CRFs with AST
paths significantly improve over the previous work in both metrics.
Our results for predicting full types in Java using CRFs are shown in the bottom part
of Table 2.2. Our goal is to predict the full type even when it explicitly appears in
the code (e.g., com.mysql.jdbc.Connection, rather than org.apache.http.Connection).
Here we also use paths from leaves to nonterminals which represent expressions. The
evaluated types were only those that could be solved by a global type inference engine.
Therefore, accuracy is the percent of correct predictions out of the results that are
given by type inference.
Although a type inference engine still produces more accurate results than our learn-
ing approach, our results using AST paths are surprisingly good, especially considering
the relative simplicity of our representation. We also note that type inference is a global
task, and our approach reconstructs types locally without considering the global scope
of the project.
27
function f( a , b , c ) {
b .open('GET', a , false);
b .send( c );
}
Figure 2.7: Example of a JavaScript program with stripped names, with predictions
produced using our AST paths and an online version of UnuglifyJS at nice2predict.org.
This is the default example shown at nice2predict.org.
Figure 2.8: Examples of Java programs with stripped names and with predictions
produced using our AST paths. We deliberately selected challenging examples in which
the prediction cannot be aided by specific classes and interfaces.
CRFs with AST paths achieved 69.1% accuracy when predicting full type for Java.
We contrast this result with a naïve baseline that uniformly predicts the type java.lang.String
for all expressions. This naive baseline yields an accuracy of 24.1%, which shows the
task is nontrivial, even when factoring out the most commonly used Java type.
28
Prediction Examples
Figure 2.6 shows an example of a Python program predicted using AST paths. It can
be seen that all the names predicted using AST paths were renamed with meaningful
names such as process, cmd and retcode.
Figure 2.7 shows the default JavaScript example from nice2predict.org, predicted
using AST paths and an online version of UnuglifyJS at nice2predict.org. We note that
their online model was not trained on the same dataset as our model. The model
which was trained using UnuglifyJS and our dataset yielded worse results. It can be
seen that our model produced more meaningful names such as url (instead of source)
and callback (instead of n).
Figure 2.8 shows examples of Java programs. To demonstrate the expressiveness
of AST paths, we deliberately selected challenging examples in which the prediction
cannot be aided by the informative class and interface names that Java code usually
contains (as in: HttpClient client). Instead, our model had to leverage the syntactic
structure to predict the meaningful names: done, values, value and count.
In Section 2.3 we introduced and discussed the importance of the max_length and
max_width parameters. For each language we experimented with different combina-
tions of values for max_length and max_width on its validation set. We chose the
values that produced the highest accuracy while still being computationally feasible
when evaluating the model with the test set.
Accuracy with different path length and width We experimented with tuning
the path parameters and observed their effect on the accuracy. The best parameter
values for each prediction are shown in Table 2.2.
For the task of name prediction, for all languages, the best path length is 6-7, and
the best width is 3-4. The variations in path length stem from minor differences in the
structure of the AST. For example, despite the similarity in source level between Java
and C#, the C# AST is slightly more elaborate than the one we used for Java.
A drill-down of the accuracy given different parameter values for variable name
prediction in JavaScript is shown in Figure 2.9. We observe that the max_length pa-
rameter has a significant positive effect, while the contribution of a larger max_width
is positive but minor. This observation affirms our initial hypothesis that our long-
distance paths are fundamental and crucial to the accuracy of the prediction. It also
confirms our belief that an automatic representation of code (rather than manually de-
fined) is essential, since the long-distance paths are very unlikely to have been designed
manually.
For the task of method name prediction, since there are significantly fewer paths,
we could afford to set a high parameter value without too much tuning and still keep
29
68
66
64
Accuracy (%)
62
AST Paths with max_width=3
60 AST Paths with max_width=2
58 AST Paths with max_width=1
UnuglifyJS (Raychev et al. (2015))
56
54
52
50
3 4 5 6 7
Max path length
Figure 2.9: Accuracy results of AST paths with CRFs, for the task of variable nam-
ing in JavaScript, for different combination values of max_length and max_width
(UnuglifyJS is presented here for comparison).
the training time and resources feasible. We therefore set the length in this case to 12
for JavaScript, 10 for Python, and just 6 for Java.
For the task of predicting full types in Java, we used length 4 and width 1, which
yielded an accuracy of 69.1%. The intuition for the short path length is that in many
cases the type of an expression can be inferred locally from other neighboring types,
often from an explicit type declaration.
Higher values for max_length and max_width resulted in higher training times,
but combined with the downsampling approach, it is possible to maintain a shorter
training time while increasing the parameter values, and control the tradeoff between
accuracy and training time.
• “No-arrows” - using the full path encoding, except the up and down symbols
{↑, ↓}.
30
60
no-arrows full
forget-order
55 first-top-last
top
Accuracy (%)
50 first-last
45
40
no-path
35
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Training time (hours)
Figure 2.10: The accuracy of each abstraction method compared to the consumed
training time, for the task of variable naming in Java
• “Forget-order” - using paths without arrows and without order between the nodes:
instead of treating a path as a sequence of nodes, treat it as a bag of nodes.
• “First-top-last” - keeping only the first, top and last nodes of the path. The “top”
node refers to the node that is hierarchically the highest, from which the direction
of the path changes from upwards to downwards.
• “No-paths” - using no paths at all, and treating all relations between program
elements as the same. The name of an element is predicted by using the bag
of surrounding identifiers, without considering the syntactic relation to each of
them.
All of the following experiments were performed using CRFs for variable names
prediction, on the Java corpus and on the same hardware. In every experiment, the
training corpus and the rest of the settings were identical. The number of training
iterations was fixed.
Figure 2.10 shows the accuracy of each abstraction compared to the consumed
training time. As shown, as more information is kept, accuracy is increased, with the
cost of a longer training time. An interesting “sweet-spot” is “first-top-last”, which
reduces training time by half compared to the full representation, with accuracy that
is as 95% as good.
We also observe that the arrows and the order of the nodes in the path contribute
about 1% accuracy.
31
32
Chapter 3
3.1 Introduction
Distributed representations of words (such as “word2vec”) (Mikolov et al., 2013a,b;
Pennington et al., 2014), sentences, paragraphs, and documents (such as “doc2vec”)
(Le and Mikolov, 2014) have played a key role in unlocking the potential of neural
networks for natural language processing (NLP) tasks (Bengio et al., 2003; Collobert
and Weston, 2008; Socher et al., 2011; Turian et al., 2010; Glorot et al., 2011; Turney,
2006). Methods for learning distributed representations produce low-dimensional vector
representations for objects, referred to as embeddings. In these vectors, the “meaning”
of an element is distributed across multiple vector components, such that semantically
similar objects are mapped to close vectors.
Goal: The goal of this work is to learn code embeddings, continuous vectors for repre-
senting snippets of code. By learning code embeddings, our long-term goal is to enable
the application of neural techniques to a wide range of programming-language tasks.
In this work, we use the motivating task of semantic labeling of code snippets.
33
Predictions
reverseArray 77.34%
reverse 18.18%
subArray 1.45%
copyArray 0.74%
Figure 3.1: A code snippet and its predicted labels as computed by our model.
Our approach We present a novel framework for predicting program properties using
neural networks. Our main contribution is a neural network that learns code embeddings
— continuous distributed vector representations for code. The code embeddings allow
us to model correspondence between code snippets and labels in a natural and effective
manner.
Our neural network architecture uses a representation of code snippets that lever-
ages the structured nature of source code and learns to aggregate multiple syntactic
paths into a single vector. This ability is fundamental to the application of deep learning
in programming languages, in the same way that word embeddings in natural language
processing (NLP) are fundamental to the application of deep learning for NLP tasks.
The input to our model is a code snippet and a corresponding tag, label, caption, or
name. This label expresses the semantic property that we wish the network to model,
for example, a tag that should be assigned to the snippet, or the name of the method,
class, or project that the snippet was taken from. Let C be the code snippet and L be
the corresponding label or tag. Our underlying hypothesis is that the distribution of
labels can be inferred from syntactic paths in C. Our model therefore attempts to learn
the label distribution, conditioned on the code: P (L|C).
We demonstrate the effectiveness of our approach for the task of predicting a
method’s name given its body. This problem is important as good method names
make code easier to understand and maintain. A good name for a method provides
a high-level summary of its purpose. Ideally, “If you have a good method name, you
don’t need to look at the body.” (Fowler and Beck, 1999). Choosing good names can be
especially critical for methods that are part of public APIs, as poor method names can
doom a project to irrelevance (Allamanis et al., 2015a; Høst and Østvold, 2009).
34
Table 3.1: Semantic similarities between method names.
A ≈B
size getSize, length, getCount, getLength
active isActive, setActive, getIsActive, enabled
done end, stop, terminate
toJson serialize, toJsonString, getJson, asJson
run execute, call, init, start
executeQuery executeSql, runQuery, getResultSet
actionPerformed itemStateChanged, mouseClicked, keyPressed
toString getName, getDescription, getDisplayName
equal eq, notEqual, greaterOrEqual, lessOrEqual
error fatalError, warning, warn
3.1.1 Applications
1. Automatic code review - Suggesting better method names when the name given
by the developer doesn’t match the method’s functionality. Better method names
prevent naming bugs, improve the readability and maintenance of code, and fa-
cilitate the use of public APIs. This application was previously shown to be of
significant importance (Fowler and Beck, 1999; Allamanis et al., 2015a; Høst and
Østvold, 2009).
2. Retrieval and API discovery - Semantic similarities enable search in “the problem
domain” instead of search “in the solution domain”. For example, a developer
might look for a serialize method, while the equivalent method of the class is
named toJson as serialization is performed via json. An automatic tool that
looks for the vector most similar to the requested name among the available
methods will find toJson (Table 3.1). Such semantic similarities are difficult to
find without our approach. Further, an automatic tool which uses our vectors
can easily determine that a programmer is using the method equals right after
toLowerCase and suggest using equalsIgnoreCase instead (Table 3.6).
The code vectors we produce can be used as input to any machine learning pipeline
that performs tasks such as code retrieval, captioning, classification and tagging, or
as a metric for measuring similarity between snippets of code for ranking and clone
detection. The novelty of our approach is in its ability to produce vectors that capture
properties of snippets of code, such that similar snippets (according to any desired
criteria) are assigned similar vectors. This ability unlocks a variety of applications for
working with machine-learning algorithms on code.
35
We deliberately picked the difficult task of method name prediction, for which
prior results were poor (Allamanis et al., 2016; Alon et al., 2018; Allamanis et al.,
2015a), as an evaluation benchmark. Succeeding in this challenging task implies good
performance in other tasks such as predicting whether or not a program performs I/O,
predicting the required dependencies of a program, and predicting whether a program is
a suspected malware. We show that even for this challenging benchmark, our technique
dramatically improves the results of previous works.
• Learning which parts in the representation are relevant to prediction of the desired
property, and learning the order of importance of the part.
36
We represent a given code snippet as a bag (multiset) of its extracted paths. The
challenges are then how to aggregate a bag of contexts and which paths to focus on for
making a prediction.
Attention The problem can be stated informally as the need to learn a correspon-
dence between a bag of path-contexts and a label. Representing each bag of path-
contexts monolithically will result in sparsity – even similar methods will not have
the exact same bag of path-contexts. We therefore need a compositional mechanism
that can aggregate a bag of path-contexts such that bags that yield the same label are
mapped to close vectors. Such a compositional mechanism would be able to generalize
and represent new unseen bags by utilizing the individual path-contexts and their com-
ponents (paths, values, etc.) that were observed during training to be parts of other
bags.
To address this challenge we use a novel neural attention network architecture. At-
tention models have gained much popularity recently, mainly for neural machine trans-
lation (NMT) (Bahdanau et al., 2014; Luong et al., 2015; Vaswani et al., 2017), reading
comprehension (Levy et al., 2017; Seo et al., 2016), speech recognition (Chorowski et al.,
2015; Bahdanau et al., 2016) and computer vision (Xu et al., 2015; Mnih et al., 2014;
Ba et al., 2014).
Our neural attention mechanism learns how much focus (“attention”) should be
given to each element in a bag of path-contexts. It allows us to precisely aggregate the
information captured in each individual path-context into a single vector that captures
information about the entire code snippet. As we show in Section 3.5.4, our model
is relatively interpretable: the weights allocated by our attention mechanism can be
visualized to understand the relative importance of each path-context in a prediction.
The attention mechanism is learned simultaneously with the embeddings, optimizing
both the atomic representations of paths and the ability to compose multiple contexts
into a single code vector.
Soft and hard attention The terms “soft” and “hard” attention were proposed for
the task of image caption generation by Xu et al. (2015). Applied in our setting, soft
attention means that weights are distributed “softly” over all path-contexts in a code
snippet, while hard attention refers to selection of a single path-context to focus on
at a time. The use of soft attention over syntactic paths is the main understanding
that leads to the improved results. We compare our model to an equivalent model that
uses hard attention in Section 3.5.2, and show that soft attention is more efficient for
modeling code.
37
Bielik et al., 2016; Allamanis and Sutton, 2013; Hindle et al., 2012). The ability to
predict semantic properties of a program without running it, and with little or no
semantic analysis at all, is crucial to a wide range of applications: predicting names for
program entities (Alon et al., 2018; Raychev et al., 2015; Allamanis et al., 2015a), code
completion (Raychev et al., 2014; Mishne et al., 2012), code summarization (Allamanis
et al., 2016), code generation (Murali et al., 2018; Maddison and Tarlow, 2014; Amodio
et al., 2017), and more (see (Allamanis et al., 2017; Vechev and Yahav, 2016) for a
survey).
3.1.4 Contributions
The main contributions of this work are:
• A qualitative evaluation that interprets the attention that the model has learned
to give to the different path-contexts when making predictions.
3.2 Overview
In this section we demonstrate how our model assigns different vectors to similar snip-
pets of code, in a way that captures the subtle differences between them. The vectors are
useful for making a prediction about each snippet, even though none of these snippets
has been observed in its entirely in the training data.
The main idea of our approach is to extract syntactic paths from within a code
snippet, represent them as a bag of distributed vector representations, and use an
attention mechanism to compute a learned weighted average of the path vectors in
38
(a) (b) (c)
Predictions: Predictions Predictions
contains get indexOf
matches getProperty getIndex
canHandle getValue findIndex
equals getElement indexOfNull
containsExact getObject getInstructionIndex
Figure 3.2: Examples of three methods that can be easily distinguished by our model
despite having similar syntactic structure: our model successfully captures the subtle
differences between them and predicts meaningful names. Each method portrays the
top-4 paths that were given the most attention by the model. The widths of the colored
paths are proportional to the attention that each path was given.
order to produce a single code vector. Finally, this code vector can be used for various
tasks, such as to predict a likely name for the whole snippet.
Path extraction First, each query method in the training corpus is parsed to con-
struct an AST. Then, the AST is traversed and syntactic paths between AST leaves
are extracted. Each path is represented as a sequence of AST nodes, linked by up and
down arrows, which symbolize the up or down link between adjacent nodes in the tree.
The path composition is kept with the values of the AST leaves it is connecting, as a
tuple we refer to as a path-context. These terms are defined formally in Section 3.3.
Figure 3.3 portrays the top-four path-contexts that were given the most attention by
39
Figure 3.3: The top-4 attended paths of Figure 3.2a, as were learned by the model,
shown on the AST of the same snippet. The width of each colored path is proportional
to the attention it was given (red O
1 : 0.23, blue O
2 : 0.14, green O
3 : 0.09, orange
O4 : 0.07).
the model, on the AST of the method from Figure 3.2a, such that the width of each
path is proportional to the attention it was given by the model during this prediction.
40
own, as part of training on millions of examples. For example, it can be seen in Fig-
ure 3.3 that the red O
1 path-context, which spans from the field elements to the return
value true, was given the highest attention. For comparison, the blue O
2 path-context,
which spans from the parameter target to the return value false, was given a lower
attention.
Consider the red O 1 path-context of Figure 3.2a and Figure 3.3. As we explain in
Section 3.3, this path is represented as:
Inspecting this path node-by-node reveals that this single path captures the main
functionality of the method: the method iterates over a field called elements, and for
each of its values it checks an if condition; if the condition is true, the method returns
true. Since we use soft attention, the final prediction takes into account other paths as
well, such as paths that describe the if condition itself, but it can be understood why
the model gave this path the highest attention.
Figure 3.2 also shows the top-5 suggestions from the model for each method. As
can be seen in all three examples, most of the top suggestions are very similar to each
other and all of them are descriptive of the method. Observing the top-5 suggestions in
Figure 3.2a shows that two of them (contains and containsExact) are very accurate, but
it can also be imagined how a method called matches would share similar characteristics:
a method called matches is also likely to have an if condition inside a for loop, and to
return true if the condition is true.
Another interesting observation is that the orange O 4 path-context of Figure 3.2a,
which spans from Object to target, was given a lower attention than other path-contexts
in the same method but higher attention than the same path-context in Figure 3.2c.
This demonstrates how attention is not constant but is given with respect to the other
path-contexts in the code.
Comparison with n-grams The method in Figure 3.2a shows the four path-contexts
that were given the most attention during the prediction of the method name contains.
Out of them, the orange O4 path-context, spans between two consecutive tokens:
Object and target. This might create the (false) impression that representing this
method as a bag-of-bigrams could be as expressive as syntactic paths. However, as
can be seen in Figure 3.3, the orange O4 path goes through an AST node of type
P arameter, which uniquely distinguishes it from, for example, a local variable dec-
laration of the same name and type. In contrast, a bigram model will represent the
expression Object target equally whether target is a method parameter or a local vari-
able. This shows that a model using a syntactic representation of a code snippet can
distinguish between two snippets of code that other representations cannot. By aggre-
gating all the contexts using attention, the model can use subtle differences between
41
snippets to produce a more accurate prediction.
Key aspects The illustrated examples highlight several key aspects of our approach:
• Subtle differences between code snippets are easily distinguished by our model,
even if the code snippets have a similar syntactic structure and share many com-
mon tokens and n-grams.
Definition 3.3.1 (Abstract Syntax Tree). An Abstract Syntax Tree (AST) for a code
snippet C is a tuple ⟨N, T, X, s, δ, ϕ⟩ where N is a set of nonterminal nodes, T is a set
of terminal nodes, X is a set of values, s ∈ N is the root node, δ : N → (N ∪ T )∗ is
a function that maps a nonterminal node to a list of its children, and ϕ : T → X is a
function that maps a terminal node to an associated value. Every node except the root
appears exactly once in all the lists of children.
Next, we define AST paths. For convenience, in the rest of this section we assume
that all definitions refer to a single AST ⟨N, T, X, s, δ, ϕ⟩.
An AST path is a path between nodes in the AST, starting from one terminal,
ending in another terminal, and passing through an intermediate nonterminal in the
path which is a common ancestor of both terminals. More formally:
42
Definition 3.3.2 (AST path). An AST-path of length k is a sequence of the form:
n1 d1 ...nk dk nk+1 , where n1 , nk+1 ∈ T are terminals, for i ∈ [2..k]: ni ∈ N are nontermi-
nals and for i ∈ [1..k]: di ∈ {↑, ↓} are movement directions (either up or down in the
tree). If di =↑, then: ni ∈ δ (ni+1 ); if di =↓, then: ni+1 ∈ δ (ni ). For an AST-path p,
we use start (p) to denote n1 — the starting terminal of p, and end (p) to denote nk+1
— its final terminal.
Using this definition we define a path-context as a tuple of an AST path and the
values associated with its terminals:
That is, a path-context describes two actual tokens with the syntactic path between
them.
Example 3.3.4. A possible path-context that represents the statement: “x = 7;” would
be:
⟨x, (N ameExpr ↑ AssignExpr ↓ IntegerLiteralExpr) , 7⟩
To limit the size of the training data and reduce sparsity, it is possible to limit
different parameters of the paths. Following earlier works, we limit the paths by maxi-
mum length — the maximal value of k, and limit the maximum width — the maximal
difference in child index between two child nodes of the same intermediate node. These
values are determined empirically as hyperparameters of our model.
3.4 Model
In this section we describe our model in detail. Section 3.4.1 describes the way the
input source code is represented, Section 3.4.2 describes the architecture of the neural
network, Section 3.4.3 describes the training process, and Section 3.4.4 describes the
way the trained model is used for prediction. Finally, Section 3.4.5 discusses some of
the model design choices and compares the architecture to prior art.
High-level view At a high-level, the key point is that a code snippet is composed of
a bag of contexts, and each context is represented by a vector whose values are learned.
The values of this vector capture two distinct notions: the semantic meaning of this
context, and the amount of attention this context should get.
The problem is as follows: given an arbitrarily large number of context vectors, we
need to aggregate them into a single vector. Two trivial approaches would be to learn
the most important one of them, or to use them all by vector-averaging them. These
43
alternatives will be discussed in Section 3.5.2, and the results of implementing them
are shown in Table 3.4 (“hard attention” and “no-attention”) to yield poor results.
Our main insight in this work is that all context vectors should be used but the
model should be allowed to learn how much focus to give each vector. This is done by
learning how to average context vectors in a weighted manner. The weighted average
is obtained by weighting each vector by a factor of its dot product with another global
attention vector. The vector of each context and the global attention vector are trained
and learned simultaneously using the standard neural approach of backpropagation.
Once trained, the neural network is simply a pure mathematical function, which uses
algebraic operators to output a code vector given a set of contexts.
Our path-attention model receives as input a code snippet in some programming lan-
guage and a parser for that language.
where termN odes is a mapping between a code snippet and the set of terminal nodes
in its AST. We represent C as the set of path-contexts that can be derived from it:
∃(terms , termt ) ∈ T P airs (C) :
Rep (C) = (xs , p, xt ) xs = ϕ (terms ) ∧ xt = ϕ (termt )
∧ start(p) = terms ∧ end(p) = termt
That is, C is represented as the set of triplets ⟨xs , p, xt ⟩ such that xs and xt are values
of AST terminals, and p is the AST path that connects them. For example, the
representation of the code snippet from Figure 3.2a contains, among others, the four
AST paths of Figure 3.3.
Overall, the model learns the following components: embeddings for paths and names
(matrices path_vocab and value_vocab), a fully connected layer (matrix W ), atten-
tion vector (a), and embeddings for the tags (tags_vocab). We describe our model
from left-to-right (Figure 3.4). We define two embedding vocabularies: value_vocab
and path_vocab, which are matrices in which every row corresponds to an embedding
44
Figure 3.4: The architecture of our path-attention network. A fully connected layer
learns to combine embeddings of each path-context with itself; attention weights are
learned using the combined context vectors and used to compute a code vector. The
code vector is used to predict the label.
value_vocab ∈ R|X|×d
path_vocab ∈ R|P |×d
where as before, X is the set of values of AST terminals that were observed during
training, and P is the set of AST paths. An embedding is looked up by simply picking
the appropriate row of the matrix. For example, if we consider Figure 3.2a again,
value_vocab contains rows for each token value such as boolean, target and Object.
path_vocab contains rows which are mapped to each of the AST paths of Figure 3.3
1 path: N ame ↑ F ieldAccess ↑ F oreach ↓
(without the token values), such as the red O
Block ↓ If Stmt ↓ Block ↓ Return ↓ BooleanExpr. The values of these matrices are
initialized randomly and are learned simultaneously with the network during training.
The width of the matrix W is the embedding size d ∈ N – the dimensionality hyper-
parameter. d is determined empirically, limited by the training time, model complexity,
and the GPU memory, and it typically ranges between 100-500. For convenience, we
refer to the embeddings of both the paths and the values as vectors of the same size d,
but in general they can be of different sizes.
A bag of path-contexts B = {b1 , ..., bn } that were extracted from a given code
snippet is fed into the network. Let bi = ⟨xs , pj , xt ⟩ be one of these path-contexts, such
that {xs , xt } ∈ X are values of terminals and pj ∈ P is their connecting path. Each
component of a path-context is looked up and mapped to its corresponding embedding.
The three embeddings of each path-context are concatenated to a single context vector:
45
ci ∈ R3d that represents that path-context:
h i
ci = embedding (⟨xs , pj , xt ⟩) = value_vocabs ; path_vocabj ; value_vocabt ∈ R3d
(3.1)
For example, for the red O1 path-context from Figure 3.3, its context vector would
be the concatenation of the vectors of elements, the red O
1 path, and true.
where W ∈ Rd×3d is a learned weights matrix and tanh is the hyperbolic tangent func-
tion. The height of the weights matrix W determines the size of c̃i , and for convenience
is the same size (d) as before. In general, the height of W can be different; this will
affect the size of the final code vector. tanh is the hyperbolic tangent element-wise
function, a commonly used monotonic nonlinear activation function which outputs val-
ues in the range (−1, 1), which increases the expressiveness of the model. That is, the
fully connected layer “compresses” a context vector of size 3d into a combined context
vector of size d by multiplying it with a weights matrix, and then it applies the tanh
function to each element of the vector separately.
exp(c̃Ti · a)
attention weight αi = Pn
j=1 exp(c̃j · a)
T
The exponents in the equations are used only to make the attention weights positive,
and they are divided by their sum to have a sum of 1, as a standard softmax function.
46
The aggregated code vector υ ∈ Rd , which represents the whole code snippet, is
a linear combination of the combined context vectors {c̃1 , ..., c̃n } factored by their
attention weights:
X
n
code vector υ = αi · c̃i (3.2)
i=1
That is, the attention weights are non-negative and their sum is 1, and they are used
as the factors of the combined context vectors c̃i . Thus, attention can be viewed as
a weighted average, where the weights are learned and calculated with respect to the
other members in the bag of path-contexts.
Prediction Prediction of the tag is performed using the code vector. We define a tag
vocabulary which is learned as part of training:
where Y is the set of tag values found in the training corpus. As before, the embedding
of tagi is row i of tags_vocab. For example, looking at Figure 3.2a again, we see that
tags_vocab contains rows for each of contains, matches and canHandle. The predicted
distribution of the model q (y) is computed as the (softmax-normalized) dot product
between the code vector υ and each of the tag embeddings:
exp(υ T · tags_vocabi )
f or yi ∈ Y : q (yi ) = P
yj ∈Y exp(υ · tags_vocabj )
T
That is, the probability that a specific tag yi should be assigned to the given code
snippet C is the normalized dot product between the vector of yi and the code vector
υ.
3.4.3 Training
To train the network we use cross-entropy loss (Rubinstein, 1999, 2001) between the
predicted distribution q and the “true” distribution p. Since p is a distribution that
assigns a value of 1 to the actual tag in the training example and 0 otherwise, the
cross-entropy loss for a single example is equivalent to the negative log-likelihood of the
true label, and can be expressed as:
X
H (p||q) = − p (y) log q (y) = −log q (ytrue )
y∈Y
where ytrue is the actual tag that was seen in the example. That is, the loss is the
negative logarithm of q (ytrue ), the probability that the model assigns to ytrue . As
q (ytrue ) tends to 1, the loss approaches zero. The further q (ytrue ) goes below 1, the
greater the loss becomes. Thus, minimizing this loss is equivalent to maximizing the
log-likelihood that the model assigns to the true labels ytrue .
47
The network is trained using any gradient descent based algorithm and the standard
approach of back-propagating the training error through each of the learned parameters
(i.e., deriving the loss with respect to each of the learned parameters and updating the
learned parameter’s value by a small “step” towards the direction that minimizes the
loss).
Using the code vector An unseen code can be fed into the trained network exactly
as in the training step, up to the computation of the code vector (Eq. equation 3.2).
This code embedding can now be used in another deep learning pipeline for various
tasks such as finding similar programs, code search, refactoring suggestion, and code
summarization.
Predicting tags and names The network can also be used to predict tags and
names for unseen code. In this case we also compute the code vector υ using the
weights and parameters that were learned during training, and prediction is done by
finding the closest target tag:
where qυC is the predicted distribution of the model, given the code vector υC .
48
important path-context can appear anywhere in a method body (and span throughout
the method body).
Working with syntactic-only context The main contribution of this work is its
ability to aggregate multiple contexts into a fixed-length vector in a weighted manner
and use the vector to make a prediction. In general, our proposed model is not bound
to any specific representation of the input program; it can be applied in a similar way
to a “bag of contexts” in which the contexts are designed for a specific task, or it can be
applied to contexts that were produced using semantic analysis. Specifically, we chose
to use a syntactic representation that is similar to that of Alon et al. (2018) because
it was shown to be useful as a representation for modeling programming languages in
machine learning models. It was also shown to be more expressive than n-grams and
manually designed features.
An alternative approach is to include semantic relations as context. Such an ap-
proach was taken by Allamanis et al. (2018), who presented a Gated Graph Neural
Network in which program elements are graph nodes and semantic relations such as
ComputedFrom and LastWrite are edges in the graph. In their work, these semantic re-
lations were chosen and implemented for specific programming language and tasks. In
our work, we wish to explore how far a syntactic-only approach can go. Using seman-
tic knowledge has many advantages and might reveal information that is not clearly
expressed in a syntactic-only observation. However, using semantic knowledge comes
at a cost: (i) an expert is required to choose and design the semantic analyses; (ii)
generalizing to new languages is much more difficult, as the semantic analyses need to
be implemented differently for every language; and (iii) the designed analyses might
not easily generalize to other tasks. In contrast, in our syntactic approach (i) neither
expert knowledge of the language nor manual feature designing is required; (ii) general-
izing to other languages is accomplished by simply replacing the parser and extracting
paths from the new language’s AST using the same traversal algorithm; and (iii) the
same syntactic paths generalize surprisingly well to other tasks (as was shown by Alon
et al. (2018)).
Large corpus, simple model As Mikolov et al. (2013a) found for word represen-
tations, we found that a simpler model with a large amount of data is more efficient
than a complex model and a small corpus.
Some previous works decomposed the target predictions. Allamanis et al. (2016,
2015a) decomposed method names into smaller “sub-tokens” and used the continuous
prediction approach to compose a full name. Iyer et al. (2016) decomposed StackOver-
flow titles to single words and predicted them word-by-word. In theory, this approach
could be used to predict new compositions of names that were not observed in the
training corpus, referred to as neologisms (Allamanis et al., 2015a). However, when
scaling to millions of examples this approach might become cumbersome and fail to
49
train well due to hardware and time limitations. As shown in Section 3.5.1, our model
yields significantly better results than previous models that used this approach.
Another disadvantage of subtoken-by-subtoken learning is that it requires a time-
consuming beam-search during prediction. This results in an orders-of-magnitude
slower prediction rate (the number of predictions that the model is able to make per
second). An empirical comparison of the prediction rate of our model and the models of
Allamanis et al. (2016) and Iyer et al. (2016) shows that our model achieves a roughly
200 times faster prediction rate than Iyer et al. (2016) and 10, 000 times faster than
Allamanis et al. (2016) (Section 3.5.1).
OoV prediction The other possible advantage of Allamanis et al. (2016)’s method
— the ability to produce out-of-vocabulary (OoV) predictions by means of a copy
mechanism and subtoken-by-subtoken decoding — offer only a negligible contribu-
tion. An analysis of our test data shows that the top-10 most frequent method names,
such as toString, hashCode and equals, which are typically easy to predict, appear in
less than 6% of the test examples. The 13% least occurring names are rare names,
which did not appear in their entirety in the training data, and are difficult or im-
possible to predict exactly even with a neologism or copy mechanism. One example
is imageFormatExceptionShouldProduceNotSuccessOperationResultWithMessage. How-
ever, when trained and evaluated on the same corpus as our model, less than 3% of
the predictions of each of these baselines were actually neologisms or OoV. Moreover,
in most of the cases where the baseline suggested a neologism or OoV, it could have
produced a more accurate prediction using only already seen target names.
We thus believe that our efforts would be better spent on the prediction of complete
names.
50
3.5 Evaluation
The main contribution of our method is its ability to aggregate an arbitrary sized
snippet of code into a fixed-size vector in a way that captures its semantics. Since Java
methods are usually short, focused, have a single functionality and a descriptive name,
a natural benchmark of our approach would consider a method body as a code snippet,
and use the produced code vector to predict the method name. Succeeding in this task
would suggest that the code vector has indeed accurately captured the functionality
and semantic role of the method.
Our evaluation aims to answer the following questions:
• How useful is our model in predicting method names, and how well does it measure
in comparison to other recent approaches (Section 3.5.1)?
• What is the contribution of the attention mechanism to the model? How well
would it perform using hard attention instead, or using no attention at all (Sec-
tion 3.5.2)?
• What are the properties of the learned vectors? Which semantic patterns do they
encode (Section 3.5.4)?
Training process In our experiments we took the top 1M paths — those that oc-
curred the most in the training set. We used the Adam optimization algorithm (Kingma
and Ba, 2014), an adaptive gradient descent method commonly used in deep learning.
We used dropout (Srivastava et al., 2014) of 0.25 on the context vectors. The values
of all the parameters were initialized using the initialization heuristic of Glorot and
Bengio (2010). When training on a single Tesla K80 GPU, we achieved a training
throughput of more than 1000 methods per second. Therefore, a single training epoch
takes about 3 hours, and it takes about 1.5 days to completely train a model. Training
on newer GPUs doubles and quadruples the speed. Although the attention mechanism
can aggregate an arbitrary number of inputs, we randomly sampled up to k = 200
path-contexts from each training example. The value k = 200 seemed to be enough to
“cover” each method, since increasing to k = 300 did not seem to improve the results.
51
Table 3.2: Size of data used in the experimental evaluation.
and most popular projects, in which duplication was observed to be less of a problem.
Additionally, they filtered out migrated projects and forks of the same project. While
it is possible that some duplications are left between the training and test set, in this
case the compared baselines could have benefited from them as well. In this dataset,
the files from all the projects were shuffled and split to 12, 636, 998 training, 371, 364
validation and 368, 445 test methods.
We trained our model on the training set and tuned hyperparameters on the vali-
dation set for maximizing F1 score. The number of training epochs was tuned on the
validation set using early stopping. Finally, we report results on the unseen test set.
A summary of the amount of data used is shown in Table 3.2.
Evaluation metric Ideally, we would have liked to manually evaluate the results,
but given that manual evaluation is very difficult to scale, we adopted the measure used
in previous works (Allamanis et al., 2016; Alon et al., 2018; Allamanis et al., 2015a),
which measured precision, recall, and F1 score over sub-tokens, case-insensitive. This
is based on the idea that the quality of a method name prediction depends mainly on
the sub-words used to compose it. For example, for a method called countLines, a
prediction of linesCount is considered as an exact match, a prediction of count has
full precision but low recall, and a prediction of countBlankLines has full recall but
low precision. An unknown sub-token in the test label (“UNK”) is counted as a false
negative, therefore automatically hurting recall.
While there are alternative metrics in the literature, such as accuracy and BLEU
score, they are problematic because accuracy counts even mostly correct predictions
as completely incorrect, and the BLEU score tends to favor short predictions, which
are usually uninformative. We provide a qualitative evaluation including a manual
inspection of examples in Section 3.5.4.
We compare our model to two other recently proposed models that address similar
tasks:
52
Table 3.3: Evaluation comparison between our model and previous works.
of the test set due to its slow prediction rate (Table 3.3). We note that the F1 score
reported here is lower than the original results reported in their paper, because we
consider the task of learning a single model that is able to predict names for a method
from any possible project. We do not make the restrictive assumption of having a per-
project model, able to predict only names within that project. The results we report
for CNN+attention are when evaluating their technique in this realistic setting. In
contrast, the numbers reported in their original work are for the simplified setting of
predicting names within the scope of a single project.
Performance Table 3.3 shows the precision, recall, and F1 score of each model. The
model of Alon et al. (2018) seems to perform better than that of Allamanis et al. (2016)
and Iyer et al. (2016), while our model achieves significantly better precision and recall
than all of them.
53
60
55
50
45
40
PathAttention (this work)
F1 score
35
Paths+CRFs
30
CNN+Attention
25
LSTM+Attention
20
15
10
5
0
0 6 12 18 24 30 36 42 48 54 60 66 72
Training time (hours)
Figure 3.5: Our model achieves significantly better results than the baselines and in
shorter time.
Short and long methods The reported results are based on evaluation on all the
test data. Additionally evaluating the performance of our model with respect to the
length of a test method, we observe similar results across method lengths, with a natural
decrease in performance as the length increases. For example, the F1 score of one-line
methods is around 65; for two-to-ten lines 59; and for eleven-lines and more 52, while the
average method length is 7 lines. We used all the methods in the dataset, regardless of
their size. This shows the robustness of our model to the length of the methods. Short
methods have shorter names and their logic is usually simpler, while long methods
benefit from more context for prediction, but their names are usually longer, more
diverse and sparse, for example: generateTreeSetHashSetSpoofingSetInteger, which
has 17 lines of code.
Speed Figure 3.5 shows the test F1 score over training time for each of the evaluated
models. In just 3 hours, our model achieves results that are as 88% as good as its final
results, and in 6 hours results that are as 95% as good, with both being substantially
higher than the best results of the baseline models. Our model achieves its best results
after 30 hours.
Table 3.3 shows the approximate prediction rate of the different models. The syn-
tactic preprocessing time of our model is negligible but is included in the calculation.
As shown, due to their complexity and expensive beam search on prediction, the other
models are several orders of magnitude slower than ours, limiting their applicability.
Data efficiency The results reported in Table 3.3 were obtained using our full and
large training corpus, to demonstrate the ability of our approach to leverage enormous
amounts of training data in a relatively short training time. However, in order to in-
54
Table 3.4: Comparison of model designs.
vestigate the data efficiency of our model, we also performed experiments using smaller
training corpora which are not reported in detail here. With 20% of the data, the F1
score of our model drops to only 50%. With 5% of the data, the F1 score drops only
to 30% of our top results. We do not focus on this series of experiments here: since
our model can process more than a thousand of examples per second, there is no real
reason to deliberately limit the size of the training corpus.
2. Hard attention — in which instead of focusing the attention “softly” over the
path-contexts, all the attention is given to a single path-context, i.e., the network
learns to select a single most important path-context at a time.
A new model was trained for each of these alternative designs. However, training
hard-attention neural networks is difficult, because the gradient of the argmax function
is zero almost everywhere. Therefore, we experimented with an additional approach:
train-soft, predict-hard, in which training is performed using soft attention (as in our
ordinary model), and prediction is performed using hard attention. Table 3.4 shows
the results of all the compared alternative designs. As seen, hard attention achieves
the lowest results. This concludes that when predicting method names, or in gen-
eral describing code snippets, it is more beneficial to use all the contexts with equal
weights than to focus on the single most important one. Train-soft, predict-hard im-
proves over hard training, and gains similar results to no-attention. As soft attention
achieves higher scores than all of the alternatives, both on training and prediction, this
55
Table 3.5: Our model while hiding input components.
56
(a) (b) (c)
Figure 3.6: Example predictions from our model, with the top-4 paths that were given
the most attention for each code snippet. The width of each path is proportional to
the attention it was given by the model.
performed:
• “only-values” - using only the values of the terminals for prediction, without
paths, and therefore representing each path-context as: ⟨xs , __, xt ⟩.
• “no-values” - using only the path: ⟨__, p, __⟩, without identifiers and keywords.
• “value-path” - allowing the model to use a path and one of its values: ⟨xs , p, __⟩.
The results of these experiments are presented in Table 3.5. Interestingly, the “full”
representation (⟨xs , p, xt ⟩) achieves better results than the sum of “only-values” and
“no-values”, without each of them alone “covering” for the other. This shows the
importance of using both paths and keywords, and letting the attention mechanism
learn how to combine them in every example. The poorer results of “only-values”
(compared to the full representation) show the importance of using syntactic paths.
As shown in the table, dropping identifiers and keywords hurt the model more than
dropping paths, but combining them achieves significantly better results. Better results
are obtained for “no-paths” than for “no-values”, and “single-identifiers” obtains the
worst results.
The poor results of “no-values” suggest that predicting names for methods with
obfuscated names is a much more difficult task. In this scenario, it might be more
beneficial to predict variable names as a first step using a model that was trained
specifically for this task, and then predict a method name given the predicted variable
names.
57
3.5.4 Qualitative Evaluation
Interpreting Attention
Despite the “black-box” reputation of neural networks, our model is partially inter-
pretable thanks to the attention mechanism, which allows us to visualize the distribu-
tion of weights over the bag of path-contexts. Figure 3.6 illustrates a few predictions,
along with the path-contexts that were given the most attention in each method. The
width of each of the visualized paths is proportional to the attention weight that it was
allocated. We note that in these figures the path is represented only as a connecting
line between tokens, while in fact it contains rich syntactic information which is not
expressed properly in the figures. Figure 3.7 and Figure 3.8 portray the paths on the
AST.
The examples of Figure 3.6 are particularly interesting since the top names are
accurate and descriptive (reverseArray and reverse; isPrime; sort and bubbleSort)
but do not appear explicitly in the method bodies. The method bodies, and specifically
the path-contexts that were given the most attention, describe lower-level operations.
Suggesting a descriptive name for each of these methods is difficult and might take
time even for a trained human programmer. The average method length in our dataset
of real-world projects is 7 lines, and the examples presented in this section are longer
than this average length.
Figure 3.7 and Figure 3.8 show additional predictions of our model, along with the
path-contexts that were given the most attention in each example. The path-contexts
are portrayed both on the code and on the AST. An interactive demo of method
name predictions and name vector similarities can be found at: http://code2vec.org.
When manually examining the predictions of custom inputs, it is important to note
that a machine learning model learns to predict names for examples that are likely to
be observed “in the wild”. Thus, it can be misled by confusing adversarial examples
that are unlikely to be found in real code.
Surprisingly, the learned method name vectors encode many semantic similarities and
even analogies that can be represented as linear additions and subtractions. When
simply looking for the closest vector (in terms of cosine distance) to a given method
name vector, the resulting neighbors usually contain semantically similar names; e.g.
size is most similar to getSize, length, getCount, and getLength. Table 3.1 shows
additional examples of name similarities.
When looking for a vector that is close to two other vectors, we often find names
that are semantic combinations of the two other names. Specifically, we can look for
the vector v that maximizes the similarity to two vectors a and b:
58
Predictions:
count 42.77%
countOccurrences 33.74%
indexOf 8.86%
Figure 3.7: An example for a method name prediction, portrayed on the AST. The
top-four path-contexts were given a similar attention, which is higher than the rest of
the path-contexts.
A +B ≈C
get value getValue
get instance getInstance
getRequest addBody postRequest
setHeaders setRequestBody createHttpPost
remove add update
decode fromBytes deserialize
encode toBytes serialize
equals toLowerCase equalsIgnoreCase
59
Predictions:
done 34.27%
isDone 29.79%
goToNext 12.81%
Figure 3.8: An example for a method name prediction, portrayed on the AST. The
width of each path is proportional to the attention it was given.
A: B C: D
open : connect close : disconnect
key : keys value : values
lower : toLowerCase upper : toUpperCase
down : onMouseDown up : onMouseUp
warning : getWarningCount error : getErrorCount
value : containsValue key : containsKey
start : activate end : deactivate
receive : download send : upload
60
where ⊛ is an arithmetic operator used to combine two similarities, and V is a vocabu-
lary of learned name vectors, tags_vocab in our case. When measuring similarity using
cosine distance, Equation (3.3) can be written as:
since cosine distance between two vectors equals the dot product of their unit vectors.
This provides us with a simpler method for finding the above combination of method
name similarities:
This implies that the model has learned that equalsIgnoreCase is the most similar name
to equals and toLowerCase combined. Table 3.6 shows some of these examples.
Just as Mikolov et al. (2013a,c) used vector calculation to express syntactic and
semantic word analogies in NLP, the method name vectors learned by our model
also express similar syntactic and semantic analogies. For example, vec (download)-
vec (receive)+vec (send) results in a vector whose closest neighbor is the vector for
upload. This analogy can be read as: “receive is to send as download is to: upload”.
More examples are shown in Table 3.7.
Closed labels vocabulary One of the major limiting factors is the closed label
space we use as target: our model is able to predict only labels that were observed as
is at training time. This works very well for the vast majority of targets (that repeat
across multiple programs), but as the targets become very specific and diverse (e.g.,
61
findUserInfoByUserIdAndKey) the model is unable to compose such names and usually
catches only the main idea (for example: findUserInfo). Overall, on a general dataset,
our model outperforms the baselines by a large margin even though the baselines are
technically able to produce complex names. code2seq solves this limitation by pre-
dicting a sequence of output symbols. This exact sequence was not necessarily seen at
training time.
Sparsity and Data-hunger There are three main sources of sparsity in our model:
• Terminal values are represented as whole symbols - e.g., each newArray and
oldArray is a unique symbol that has an embedding of its own, even though
they share most of their characters (Array).
• AST paths are represented as monolithic symbols - two paths that share most of
their AST nodes but differ in only a single node are represented as distinct paths
which are assigned distinct embeddings.
• Target nodes are whole symbols, even if they are composed of more common
smaller symbols.
This sparsity results in the model consuming a lot of trained parameters to keep
an embedding for each observed value. The large number of trained parameters results
in large GPU memory consumption at training time, increases the size of the stored
model (about 1.4 GB), and requires a lot of training data. Furthermore, modeling
source code with a finer granularity of atomic units may allow the model to represent
more unseen contexts as compositions of smaller atomic units, thus repeating more
atomic units across examples. In the model described in this work, paths, terminal
values or target values that were not observed during training cannot be represented.
To address these limitations we train the model on a huge dataset of more than 12M
examples, but the model might not perform as well using smaller datasets. Although
requiring a lot of GPU memory, training our model on millions of examples fits in the
memory of a relatively old Tesla K80 GPU.
An alternative approach for reducing the sparsity of AST paths is to use path
abstractions where only parts of the path are used in the context (abstracting away
certain kinds of nodes, merging certain kinds of nodes, etc.).
In code2seq, we solve all these sources of sparsity by decomposing terminals, paths,
and targets to smaller components. This significantly improves the results of the
code2seq model on both small and large datasets, while reducing the size of the stored
model by about 90%.
62
adversarial variable names, the prediction of the label is usually less accurate. We
are considering several approaches to address this limitation in future research. One
potential solution is to train the model on a mixed dataset of good and hidden variable
names, hopefully reducing model dependency on variable names; another solution is to
apply a model that was trained for variable de-obfuscation first (such as (Alon et al.,
2018; Raychev et al., 2015)) and feed the predicted variable names into our model.
63
64
Chapter 4
4.1 Introduction
Modeling the relation between source code and natural language can be used for auto-
matic code summarization (Allamanis et al., 2016), documentation (Iyer et al., 2016),
retrieval (Allamanis et al., 2015b), and even generation (Balog et al., 2017; Rabinovich
et al., 2017; Yin and Neubig, 2017; Devlin et al., 2017; Murali et al., 2018; Brockschmidt
et al., 2019). In this work, we consider the general problem of generating a natural
language sequence from a given snippet of source code.
A direct approach is to frame the problem as a machine translation problem, where
the source sentence is the sequence of tokens in the code and the target sentence is
a corresponding natural language sequence. This approach allows one to apply state-
of-the-art neural machine translation (NMT) models from the sequence-to-sequence
(seq2seq) paradigm (Sutskever et al., 2014; Cho et al., 2014b; Bahdanau et al., 2014;
Luong et al., 2015; Vaswani et al., 2017), yielding state-of-the-art performance on vari-
ous code captioning and documentation benchmarks (Iyer et al., 2016; Allamanis et al.,
2016; Loyola et al., 2017) despite having extremely long source sequences.
We present an alternative approach for encoding source code that leverages the
syntactic structure of programming languages: code2seq. We represent a given code
snippet as a set of compositional paths over its abstract syntax tree (AST), where each
path is compressed to a fixed-length vector using LSTMs (Hochreiter and Schmidhuber,
1997). During decoding, code2seq attends over a different weighted average of the
path-vectors to produce each output token, much like NMT models attend over token
representations in the source sentence.
We show the effectiveness of our code2seq model on two tasks: (1) code summariza-
65
Code summarization in Java: Code captioning in C#:
(a) (b)
Figure 4.1: Example of (a) code summarization of a Java code snippet, and (b) code
captioning of a C# code snippet, along with the predictions produced by our models.
The highlighted paths in each example are the top-attended paths in each decoding
step. Because of space limitations we included only the top-attended path for each
decoding step, but hundreds of paths are attended at each step.
tion (Figure 4.1a), where we predict a Java method’s name given its body, and (2) code
captioning (Figure 4.1b), where we predict a natural language sentence that describes
a given C# snippet.
On both tasks, our code2seq model outperforms models that were explicitly de-
signed for code, such as the model of Allamanis et al. (2016) and CodeNN (Iyer et al.,
2016), as well as TreeLSTMs (Tai et al., 2015) and state-of-the-art NMT models (Lu-
ong et al., 2015; Vaswani et al., 2017). To examine the importance of each component
of the model, we conduct a thorough ablation study. In particular, we show the im-
portance of structural encoding of code, by showing how our model yields a significant
improvement over an ablation that uses only token-level information without syntactic
paths. To the best of our knowledge, this is the first work to directly use paths in the
abstract syntax tree for end-to-end generation of sequences.
An Abstract Syntax Tree (AST) uniquely represents a source code snippet in a given
language and grammar. The leaves of the tree are called terminals, and usually refer
to user-defined values which represent identifiers and names from the code. The non-
leaf nodes are called nonterminals and represent a restricted set of structures in the
language, e.g., loops, expressions, and variable declarations. For example, Figure 4.2c
shows a partial AST for the code snippet of Figure 4.2a. Names (such as num) and types
(such as int) are represented as values of terminals; syntactic structures such as variable
declaration (VarDec) and a do-while loop (DoStmt) are represented as nonterminals.
Given the AST of a code snippet, we consider all pairwise paths between terminals,
and represent them as sequences of terminal and nonterminal nodes. We then use these
paths with their terminals’ values to represent the code snippet itself. For example,
66
int countOccurrences(String str, int countOccurrences(String source,
char ch) { char value) {
int num = 0; int count = 0;
int index = -1; for (int i = 0; i < source.length(); i++) {
do { if (source.charAt(i) == value) {
index = str.indexOf(ch, index + 1); count++;
if (index >= 0) { }
num++; }
} return count;
} while (index >= 0); }
return num;
}
(a) (b)
(c) (d)
Figure 4.2: An example of two Java methods that have exactly the same functional-
ity. Although these methods have different sequential (token-based) representations,
repeating paths, which might differ in only a single node (a ForStmt node instead of a
Do-while node), will be revealed if we consider syntactic patterns.
consider the two Java methods of Figure 4.2. Both of these methods count occurrences
of a character in a string. They have exactly the same functionality, although a different
implementation, and therefore different surface forms. If these snippets are encoded
as sequences of tokens, the recurring patterns that suggest the common method name
might be overlooked. However, a structural observation reveals syntactic paths that
are common to both methods, and differ only in a single node of a Do-while statement
versus a For statement. This example shows the effectiveness of a syntactic encoding
of code. Such an encoder can generalize much better to unseen examples because the
AST normalizes a lot of the surface form variance. Since our encoding is compositional,
the encoder can generalize even if the paths are not identical (e.g., a For node in one
path and a While in the other).
Since a code snippet can contain an arbitrary number of such paths, we sample k
paths as the representation of the code snippet. To avoid bias, k new paths are sampled
afresh in every training iteration. In Section 4.5 we show that this runtime-sampling
provides regularization and improves results compared to sampling the same k paths
for each example in advance.
67
Formally, we use C to denote a given snippet of code. Every training iteration, k
pairsof terminals
are uniformly sampled from within the AST of C. Each pair of termi-
nals v1i , vlii implies a single path between them: v1i v2i ...vlii . Finally, the input code ex-
n o
ample is represented as a set of these k random AST paths: v11 v21 ...vl11 , ..., v1k v2k ...vlkk ,
where lj is the length of the jth path.
Y
m
p (y1 , ..., ym |x1 , ..., xn ) = p (yj |y<j , z1 , ..., zn )
j=1
X
n
αt = sof tmax (ht Wa z) ct = αit zi
i
The context vector ct and the decoding state ht are then combined to predict the
current target token yt . Previous work differs in the way the context vector is computed
and in the way it is combined with the current decoding state. A standard approach
(Luong et al., 2015) is to pass ct and ht through a multi-layer perceptron (MLP) and
then predict the probability of the next token using softmax:
p (yt |y<t , z1 , ..., zn ) = sof tmax (Ws tanh (Wc [ct ; ht ]))
68
Figure 4.3: Our model encodes each AST path with its values as a vector, and uses
the average of all of the k paths as the decoder’s start state. The decoder generates an
output sequence while attending over the k encoded paths.
Given a set of AST paths {x1 , ..., xk }, our goal is to create a vector representation zi
for each path xi = v1i v2i ...vlii . We represent each path separately using a bi-directional
LSTM to encode the path, and sub-token embeddings to capture the compositional
nature of the terminals’ values (the tokens).
Path Representation Each AST path is composed of nodes and their child indices
from a limited vocabulary of up to 364 symbols. We represent each node using a learned
embedding matrix E nodes and then encode the entire sequence using the final states of
a bi-directional LSTM:
Token Representation The first and last node of an AST path are terminals whose
values are tokens in the code. Following Allamanis et al. (2015a, 2016), we split code
tokens into subtokens; for example, a token with the value ArrayList will be decom-
posed into Array and List. This is somewhat analogous to byte-pair encoding in NMT
(Sennrich et al., 2016), although in the case of programming languages, coding conven-
tions such as camel notation provide us with an explicit partition of each token. We
use a learned embedding matrix E subtokens to represent each subtoken, and then sum
the subtoken vectors to represent the full token:
X
encode_token(w) = Essubtokens
s∈split(w)
69
The LSTM decoder may also predict subtokens at each step (e.g. when generating
method names), although the decoder’s subtoken embedding matrix will be different.
where value is the mapping of a terminal node to its associated value, and Win is a
(2dpath + 2dtoken ) × dhidden matrix.
Decoder Start State To provide the decoder with an initial state, we average the
combined representations of all the k paths in the given example:
1X k
h0 = zi
k i=1
Unlike typical encoder-decoder models, the order of the input random paths is not
taken into account. Each path is encoded separately and the combined representations
are aggregated with mean pooling to initialize the decoder’s state. This represents the
given source code as a set of random paths.
Attention Finally, the decoder generates the output sequence while attending over
all of the combined representations z1 , ...zk , similarly to the way that seq2seq models
attend over the source symbols. The attention mechanism is used to dynamically select
the distribution over these k combined representations while decoding, just as an NMT
model would attend over the encoded source tokens.
4.4 Experiments
We evaluate our model on two code-to-sequence tasks: summarization (Section 4.4.1),
in which we predict Java methods’ names from their bodies, and captioning (Sec-
tion 4.4.2), where we generate natural language descriptions of C# code snippets.
Although out of the focus of this work, in Section 4.4.3 we show that our model also
generates Javadocs more accurately than an existing work. We thus demonstrate that
our approach can produce both method names and natural language outputs, and can
encode a code snippet in any language for which an AST can be constructed (i.e., a
parser exists).
Setup The values of all of the parameters are initialized using the initialization heuris-
tic of Glorot and Bengio (2010). We optimize the cross-entropy loss (Rubinstein, 1999,
70
2001) with a Nesterov momentum (Nesterov, 1983) of 0.95 and an initial learning rate
of 0.01, decayed by a factor of 0.95 every epoch. For the Code Summarization task, we
apply dropout (Srivastava et al., 2014) of 0.25 on the input vectors xj , and 0.7 for the
Code Captioning task because of the smaller number of examples in the C# dataset.
We apply a recurrent dropout of 0.5 on the LSTM that encodes the AST paths. We
used dtokens = dnodes = dhidden = dtarget = 128. For the Code Summarization task,
each LSTM that encodes the AST paths had 128 units and the decoder LSTM had
320 units. For the Code Captioning task, to support the longer target sequences, each
encoder LSTM had 256 units and the decoder was of size 512.
In this task, we predict a Java method’s name given its body. As was previously
observed (Allamanis et al., 2016; Alon et al., 2019c), this is a good benchmark because
a method name in open-source Java projects tends to be succinct and precise, and a
method body is often a complete logical unit. We predict the target method name as a
sequence of sub-tokens, e.g., setMaxConnectionsPerServer is predicted as the sequence
“set max connections per server”. The target sequence length is about 3 on average.
We adopt the measure used by Allamanis et al. (2016) and Alon et al. (2019c), who
measured precision, recall, and F1 score over the target sequence, case insensitive.
Data We experiment with this task across three datsets. In these datasets, we always
train across multiple projects and predict on distinct projects:
Java-small – Contains 11 relatively large Java projects, originally used for 11 dis-
tinct models for training and predicting within the scope of the same project (Allamanis
et al., 2016). We use the same data, but train and predict across projects: we took 9
projects for training, 1 project for validation and 1 project as our test set. This dataset
contains about 700K examples.
Java-med – A new dataset of the 1000 top-starred Java projects from GitHub. We
randomly select 800 projects for training, 100 for validation and 100 for testing. This
dataset contains about 4M examples and we make it publicly available.
Java-large – A new dataset of the 9500 top-starred Java projects from GitHub that
were created since January 2007. We randomly select 9000 projects for training, 250
71
for validation and 300 for testing. This dataset contains about 16M examples and we
make it publicly available.
Baselines We re-trained all of the baselines on all of the datasets of this task using
the original implementations of the authors. We compare code2seq to the following
baselines: Allamanis et al. (2016), who used a convolutional attention network to pre-
dict method names; syntactic paths with Conditional Random Fields (CRFs) (Alon
et al., 2018); code2vec (Alon et al., 2019c); and a TreeLSTM (Tai et al., 2015) en-
coder with an LSTM decoder and attention on the input sub-trees. Additionally, we
compared to three NMT baselines that read the input source code as a stream of to-
kens: 2-layer bidirectional encoder-decoder LSTMs (split tokens and full tokens) with
global attention (Luong et al., 2015), and the Transformer (Vaswani et al., 2017), which
achieved state-of-the-art results for translation tasks.
We put significant effort into strengthening the NMT baselines in order to provide a
fair comparison: (1) we split tokens to subtokens, as in our model (e.g., HashSet → Hash
Set) – this was shown to improve the results by about 10 F1 points (Figure 4.4); (2) we
deliberately kept the original casing of the source tokens since we found it to improve
their results; and (3) during inference, we replaced generated UNK tokens with the
source tokens that were given the highest attention. For the 2-layer BiLSTM we used
embeddings of size 512, an encoder and a decoder of 512 units each, and the default
hyperparameters of OpenNMT (Klein et al., 2017). For the Transformer, we used
their original hyperparameters (Vaswani et al., 2017). This resulted in a Transformer
model with 169M parameters and a BiLSTM model with 134M parameters, while our
code2seq model had only 37M.1
Performance Figure 4.4 shows the results for the code summarization task. Our
model significantly outperforms the baselines in both precision and recall across all
three datasets, demonstrating that there is added value in leveraging ASTs to encode
source code. Our model improves over the best baselines, BiLSTM with split tokens,
by between 4 to 8 F1 points on all benchmarks. BiLSTM with split tokens consistently
scored about 10 F1 points more than BiLSTM with full tokens, and for this reason we
included only the split token Transformer and TreeLSTM baselines. Our model outper-
forms ConvAttention (Allamanis et al., 2016), which was designed specifically for this
task; Paths+CRFs (Alon et al., 2018), which used syntactic features; and TreeLSTMs.
Although TreeLSTMs also leverage syntax, we hypothesize that our syntactic paths
capture long distance relationships while TreeLSTMs capture mostly local properties.
Examples for predictions made by our model and each of the baselines can be found in
Alon et al. (2019a) and at http://code2seq.org.
1
We also trained versions of the NMT baselines in which we down-matched the sizes and number
of parameters to our model. These baselines seemed to benefit from more parameters, so the results
reported here are for the versions that had many more parameters than our model.
72
59.19
60
55.18
53.23 53.63
50 47.5246.69 48.13
45.73
43.02 42.73
41.22
40 37.16 37.25 37.95
35.235.46
F1 33.05
31.41 32.49
score 30
25.03 25.06 25.06
18.62
20
10 6.74
Figure 4.4: Visualization of the F1 score of our model compared to the baselines, for
the code summarization task, across datasets. Our model achieves significantly higher
results than the baselines.
Fernandes et al. (2019) encoded code using Graph Neural Networks (GNN), and
reported lower performance than our model on Java-large without specifying the exact
F1 score. They report slightly higher results than us on Java-small only by extending
their GNN encoder with a subtoken-LSTM (BiLSTM+GNN→ LSTM); by extending
the Transformer with GNN (SelfAtt+GNN→SelfAtt); or by extending their LSTM
decoder with a pointer network (GNN→LSTM+Pointer). All these extensions can be
incorporated into our model as well.
Data Efficiency ConvAttention (Allamanis et al., 2016) performed even better than
the Transformer on the Java-small dataset, but could not scale and leverage the larger
datasets. Paths+CRFs showed very poor results on the Java-small dataset, which is
expected due to the sparse nature of their paths and the CRF model. When compared
to the best among the baselines (BiLSTM with split tokens), our model achieves a
relative improvement of 7.3% on Java-large, but as the dataset becomes smaller, the
larger the relative difference becomes: 13% on Java-med and 22% on Java-small; when
compared to the Transformer, the relative improvement is 23% on Java-large and 37%
on Java-small. These results show the data efficiency of our architecture: while the
data-hungry NMT baselines require large datasets, our model can leverage both small
73
and large datasets.
70
65
60
55
50 code2seq (this work)
45
F1 40 2-layer BiLSTMs
35 TreeLSTM
30
25 Transformer
20 code2vec
15
10
1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930+
Code length (lines)
Figure 4.5: F1 score compared to the length of the input code. This experiment was
performed for the code summarization task on the Java-med test set. All examples
having more than 30 lines were counted as having 30 lines.
For this task we consider predicting a full natural language sentence given a short
C# code snippet. We used the dataset of CodeNN (Iyer et al., 2016), which consists of
66,015 pairs of questions and answers from StackOverflow. They used a semi-supervised
classifier to filter irrelevant examples and asked human annotators to provide two ad-
ditional titles for the examples in the test set, making a total of three reference titles
for each code snippet. The target sequence length in this task is about 10 on aver-
age. This dataset is especially challenging as it is orders of magnitude smaller than the
code summarization datasets. Additionally, StackOverflow code snippets are typically
short, incomplete at times, and aim to provide an answer to a very specific question.
We evaluated using BLEU score with smoothing, using the same evaluation scripts as
Iyer et al. (2016).
74
25
23.04
0
StackOverflow dataset
Figure 4.6: Visualization of the BLEU score of our model compared to the baselines,
for the code captioning task. Our model achieves significantly higher results than the
baselines.
Results Figure 4.6 summarizes the results for the code captioning task. Our model
achieves a BLEU score of 23.04, which improves by 2.51 points (12.2% relative) over Co-
deNN, whose authors introduced this dataset, and over all the other baselines, including
BiLSTMs, TreeLSTMs and the Transformer, which achieved slightly lower results than
CodeNN. Examples for predictions made by our model and each of the baselines can
be found in Alon et al. (2019a). These results show that when the training examples
are short and contain incomplete code snippets, our model generalizes better to unseen
examples than a shallow textual token-level approach, thanks to its syntactic repre-
sentation of the data. Although TreeLSTMs also represent the data syntactically, the
TreeLSTM baseline achieved lower scores.
Although the task of generating code documentation is outside the focus of this work,
we performed an additional comparison to Hu et al. (2018). They trained a standard
seq2seq model by using the linearized AST as the source sequence and a Javadoc natural
language sentence as the target sequence. While they originally report a BLEU score of
38.17, we computed their BLEU score using prediction logs provided us by the authors
and obtained a BLEU score of 8.97, which we find more realistic. Training our model
on the same dataset as Hu et al., matching LSTM sizes, and using the same script on
our predictions yields a BLEU score of 14.53, which is a 62% relative gain over the
model of Hu et al. (2018). This shows that our structural approach represents code
better than linearizing the AST and learning it as a sequence.
75
Table 4.1: Variations on the code2seq model, performed on the validation set of Java-
med.
1. No AST nodes – instead of encoding an AST path using an LSTM, take only the
first and last terminal values to construct an input vector
4. No tokens – use only the AST nodes without using the values associated with the
terminals.
5. No attention – decode the target sequence given the initial decoder state, without
attention.
Table 4.1 shows the results of these alternatives. As seen, not encoding AST nodes
resulted in a degradation especially in the precision: a decrease of 5.16 compared to
4.30 for the recall. It is quite surprising that this ablation was still better than the
baselines (Figure 4.4): for example, the Transformer can implicitly capture pairs of
tokens using its self-attention mechanism. However, not all tokens are AST leaves.
By focusing on AST leaves, we increase the focus on named tokens, and effectively
ignore functional tokens like brackets, parentheses, semicolons, etc. Transformers can
76
(in theory) capture the same signal, but perhaps they require significantly more layers
or a different optimization to actually learn to focus on those particular elements. The
AST gives us this information for free without having to spend more transformer layers
just to learn it. Additionally, for practical reasons we limited the length of the paths
to 9 . This leads to pairs of leaves that are close in the AST, but not necessarily close
in the sequence. In contrast, the Transformer’s attention is effectively skewed towards
sequential proximity because of the positional embeddings.
Using a single prediction with no decoder reduces recall by more than one-third.
This shows that the method name prediction task should be addressed as a sequential
prediction, despite the methods’ relatively short names. Using no token splitting or
no tokens at all drastically reduces the score, showing the significance of encoding
both subtokens and syntactic paths. Despite the poor results of no tokens, it is still
surprising that the model can achieve around half the score of the full model, as using
no tokens is equivalent to reasoning about code which has no identifier names, types,
APIs, and constant values, which can be very difficult even for a human. The no
attention experiment shows the contribution of attention in our model, which is very
close in its relative value to the contribution of attention in seq2seq models (Luong
et al., 2015; Bahdanau et al., 2014). The no random experiment shows the positive
contribution of sampling k different paths afresh on every training iteration, instead
of using the same sample of paths from each example during the entire training. This
approach provides data-level regularization that further improves an already powerful
model.
77
78
Chapter 5
5.1 Introduction
Code completion is the problem of generating code given its surrounding code as con-
text. In its most general form, this problem is extremely challenging as it requires
reasoning over an unbounded number of syntactic structures and user-defined symbols.
Previous approaches have avoided this issue by limiting the generation problem: pro-
gram synthesis approaches are often tailored to domain-specific languages (Gulwani,
2011; Polozov and Gulwani, 2015; Devlin et al., 2017; Ellis et al., 2019), while other
recent approaches generate code in general languages like Java and C#, but severely
restrict the syntax, vocabulary, domain, or nature of the generated programs (Murali
et al., 2018; Brockschmidt et al., 2019; Young et al., 2019).
We introduce the task of any-code completion – generating code in a general-purpose
programming language without any restriction on its vocabulary or structure. Specif-
ically, we focus on generating code in context: given a program P and some part of
the program p, the task is to predict p from the rest of the program P − =P\p. Any-
code completion thus generalizes the restricted completion task of Brockschmidt et al.
(2019), in which the target code contained only primitive types (e.g., int and string)
and excluded user-defined functions. Figure 5.1 shows two any-code completion exam-
ples.
In related tasks such as semantic parsing (Dong and Lapata, 2018; Yu et al., 2018;
Iyer et al., 2019), natural-language-to-code (Allamanis et al., 2015b; Iyer et al., 2018),
and edit-to-code (Yin et al., 2019; Zhao et al., 2019), models must use separate encoders
and decoders because of the different modalities of the input (e.g. natural language
text) and the output (code). In contrast, we leverage the fact that our input and output
are of the same modality (code), and pursue better generalization by modeling them
jointly.
In Alon et al. (2020), we present a new approach that explicitly models the source
79
public static Path[] stat2Paths( public static string Camelize(
FileStatus[] stats) { this string input)
if (stats == null) return null; {
Path[] ret = new Path[stats.length]; var word = input.Pascalize();
for (int i = 0; i < stats.length; ++i){ return word.Length > 0 ?
ret[i] = ; .ToLower()
} + word.Substring(1)
return ret; : word;
} }
True ref: stats[i].getPath() True ref: word.Substring(0, 1)
(25.2%) stats[i].getPath() (14.1%) word.Substring(0, 1)
SLM (3.3%) Path(stats[i]) SLM (8.2%) word.trim()
top-5:� (2.5%) new Path(stats[i], charset) top-5: (5.8%) word.Substring(1)
(Java) (1.7%) stat(stats[i], ret) (C#) (2.4%) input.Substring(0, 1)
(0.8%) new Path(stats[i]) (1.9%) wordValue.Substring(0, 1)
(a) (b)
Figure 5.1: Examples from the Java (left) and C# (right) test sets. The highlighted ex-
pression in each example is the target p, which our models correctly generated from the
rest of the snippet. Additional and larger examples can be found in the supplementary
material.
and the target code as the same tree – structural language modeling (SLM). SLM
estimates the probability of the program’s abstract syntax tree (AST) by decomposing
it into a product of conditional probabilities over its nodes. We present a neural model
that computes these conditional probabilities by considering all AST paths leading to
a target node, generalizing over traditional language models that consider sequences
of words. While prior work uses AST paths to read programs (Alon et al., 2019c), we
generate code by predicting the next node along the set of paths, generating the target
AST node-by-node.
We evaluate SLMs on Java any-code completion, achieving a new state of the art:
exact-match accuracy@1 of 18.04% and accuracy@5 of 24.83% (previous SOTA: 16.93%
and 23.17%). SLMs also outperform existing models in the restricted completion task
of Brockschmidt et al. (2019) in C# by a wide margin, 37.61% accuracy@1 compared
to 26.42%. Our ablation study reveals the importance of joint modeling of the source
and target code, rather than separating encoders from decoders. Finally, we discuss the
theoretical advantages of SLMs, and show how they generalize many previous structural
approaches for code generation. An interactive demo of our model is presented at
http://AnyCodeGen.org.
80
Method Method Method
Root Root Root
? Greater Greater
? Name
...
IfExpr IfExpr if( x > 1 ) {
Greater Greater ...
Name ? Name IntExp
}
...
x x ?
(d) (e)
(f)
Figure 5.2: The subtree representing x > 1 is generated given its surrounding tree. At
each step, the model generates the next node (denoted by ? ) of path1 , path2 and path3
using the root path R. Dashed lines denote the AST structure; solid lines denote AST
paths. Most AST paths are omitted from the figure, for clarity.
81
compose the probability of a tree P r (AP ) using the chain rule, akin to the standard
approach in language modeling:
Y
P r (AP ) = P r (at |a<t ) (5.1)
t
where a<t are all the nodes that were traversed before at .
In any-code completion, part of the tree (AP − ) is already observed. Therefore, we
order the nodes of AP − to be before the nodes of the target p, and compute only the
conditional probabilities over the nodes in p, essentially conditioning on the observed
tree AP − .
Representing Partial Trees via Paths How can we represent the partial tree
composed of a<t when computing P r (at |a<t )? In standard language modeling, the
structure is linear, and a<t is a sequence. One way to represent a partial tree is to
linearize it according to the traversal order (Xiao et al., 2016); however, this creates
artificially long distances between the current node at and ancestor nodes (e.g., the
root a0 ). Another option is to use only the path from the root node to at (Rabinovich
et al., 2017), but this ignores a lot of contextual information (e.g., sibling nodes).
We follow Alon et al. (2018) and use the set of paths from every leaf to at to-
gether with the path from the root to at . Intuitively, each path captures the effect
of a different, possibly distant, program element on at , along with the syntactic rela-
tionship between them. For example, in Figure 5.1 (left) the three paths originating
from Path[] ret inform the model about the existence of ret which is an array of
type Path. Thus, when completing ret[i] = ... – the completion should be a Path
object. Other paths inform the model that the target is inside a For loop, iterated
stats.length times. Considering the information flowing from all paths, our model
correctly generates stats[i].getPath().
We denote the (candidate) node at time t as at , its (given) parent, which is currently
expanded, by π (at ), and the set of all paths as St :
where ℓ ⇝ π (at ) is the (only) path in the tree between a leaf ℓ and the current node to
expand π (at ). We denote the path from the root of the program as Rt = a0 ⇝ π (at ),
which represents the current, relative position of π (at ) in the program (marked as R in
Figure 5.2). Whereas prior work used whole paths (between two leaf nodes) to encode
ASTs (Alon et al., 2019a,c), our model observes partial paths (between a leaf and any
other node) and learns to extend them by predicting their next node.
Figure 5.2 illustrates the traversal order of a subtree that represents the expression
x > 1 and some of the paths used to compute the probability at each step. At each
step, the probability of the next node is computed given the paths St from the root
82
and every given leaf up to the current node to expand. Figure 5.2d shows how after
the terminal node with the value x is given, path3 originating from this leaf is also used
to compute the probability of the next nodes.
Our path-based approach generalizes previous approaches such as “parent feeding”
and “previous action” encoding (Yin and Neubig, 2017), context nodes (Bielik et al.,
2016), and some of the graph-edges of Brockschmidt et al. (2019). See Section 5.8 for
further discussion.
Node Trees vs. Production Trees While we predict a single node at each step, pre-
vious work (Iyer et al., 2018, 2019) predicts a grammar production rule. This represen-
tation decomposes the code in a way that often forces the model to predict with partial
information. For instance, consider generating the expression str.Substring(3). The
model of Brockschmidt et al. (2019) would first predict the rule Expr→Expr.Substring(Expr),
and only then expand Expr→str and Expr→3. That is, the model needs to predict the
method name (Substring) before the invoking object (str). Further, the Substring
method can get either one or two arguments, forcing the model to choose whether to
use the one- or two-argument rule in advance. Node generation, however, allows us to
predict the presence of a function call and only then to predict its object and method
name, rather than predicting these a priori.
83
vector (Section 5.3.1); then, we contextualize and aggregate the entire set. Finally, we
predict the target node at by combining a subtoken vocabulary with a syntactic copy
mechanism (Section 5.3.3).
Given a partial AST path, i.e., a sequence of nodes n1 , . . . , nk , our goal is to create a
vector representation.
We first represent each node ni using embeddings. A subtoken node is represented
by the index of its subtoken w in the embedding matrix E subtoken ; AST nodes are
represented as a pair ni = (τ, κ) where τ is the node type, e.g. IfStatement, and κ
is the node index among its sibling nodes. We represent node types using a learned
embedding matrix E type and the child indices using a learned matrix E index . The node’s
vector representation is the concatenation of the type and index vectors.
E subtoken ni is the subtoken w
w
e (ni ) = h i
E type ; E index ni is the AST node (τ, κ)
τ κ
We encode the entire path using a uni-directional LSTM stack, and take the final
states:2
⇝
f (n1 , . . . , nk ) = LSTM (e (n1 ) , . . . , e (nk ))
Given a set of partial paths S (omitting the iterator t for simplicity), we denote
⇝
their encodings as H = { f (n1 , . . . , nk ) | (n1 , . . . , nk ) ∈ S}.
Efficient Computation When modeling a subtree, there are large overlaps between
paths from different time steps. In particular, paths that originate from the same leaf
share the same prefix. We therefore apply the LSTM on the prefix once and cache the
intermediate state across suffixes, speeding up both training and inference significantly.
An example is shown in the supplementary material (Fig. 2).
Given the set of paths S leading up to the parent π(a) of the target node a, our goal is
to represent S in the context of predicting a. To do so, we introduce the aggregation
function g (H, r, i). As its input, g takes the set of encoded paths H, the encoded root
path r, and the child index i of the currently predicted child node a relative to its
parent.
We first contextualize the path encodings H using a transformer encoder (Vaswani
et al., 2017).3 In parallel, we apply a non-linear transformation to the encoding of the
2
Replacing the LSTMs with transformers yielded similar results in preliminary experiments.
3
Since H is a set, we do not use positional embeddings.
84
⇝
root path r = f (R), in order to inform it that we wish to predict the i-th child of π(a):
In this formulation, the parameter matrix Ci is used when the child index is i, while
the parameter matrix Wa is used for every instance.
We then compute attention over the set of contextualized path encodings Z using
the index-informed root-path encoding re as the query; we pass the weighted average
ze and the root-path encoding re through another fully-connected layer; we denote the
e
resulting vector representation as h:
X
α = softmax (Z · re) ze = αj · Zj (5.2)
j
Predicting AST Nodes If a is an AST node, we predict a using a softmax over the
node type embeddings E type :
e
P r (a|S) = softmax E type · h (π(a) is a nonterminal)
e
scopy (ℓ) = Hℓ · Wc · h e
sgen (w) = Ewsubtoken · h
The scores scopy and sgen are then summed over all occurrences that correspond to the
same symbol and subsequently normalized via softmax.
85
A key difference from most previous work (Ling et al., 2016; Yin and Neubig, 2017) is
that our copy mechanism uses the syntactic relation to the source (the path Hℓ ), rather
than the sequential relation or the graph-node representation (Yin et al., 2019).
5.4.1 Benchmarks
Any-Code Completion: Java We take the Java-small dataset of Alon et al. (2019a),
which is a re-split of the dataset of Allamanis et al. (2016). It contains 11 GitHub
projects, broken down into a single method per example, and split to train/dev/test
by project to reduce code overlap. This dataset was found to contain the least code
duplication by Allamanis (2018b). We create any-code completion examples by select-
ing every expression larger than a single AST node as the target, using the remainder
of the method as the context. We remove methods containing the word “test” in their
body or file name, and omit 10% of the examples by filtering out methods longer than
20 lines to avoid configurations, initializations, and auto-generated code. To make the
task even harder, we remove examples where the target appears as-is in the context.
Ultimately, this dataset contains 1.3M/10k/20k train/dev/test examples.
86
Model acc@1 acc@5 tree@1 tree@5
code2seq (Alon et al., 2019a) 10.68 15.56 30.46 43.94
Iyer et al. (2018) 5.94 9.19 25.54 36.75
seq2prod (Yin and Neubig, 2017) 8.05 11.82 30.77 41.73
Transformersmall (Vaswani et al., 2017)+copy 14.23 21.35 31.83 47.40
Transformerbase (Vaswani et al., 2017)+copy 16.65 24.05 34.68 50.52
BiLSTM→LSTM (Luong et al., 2015)+copy 16.93 23.17 34.29 49.72
seq2tree (Aharoni and Goldberg, 2017)+copy 16.81 23.04 38.14 52.36
SLM (this work) 18.04 24.83 39.10 55.32
5.4.2 Baselines
87
to the left of the target code, while we consider also context to the right. Extending
PHOG could potentially improve its results.
In both Java and C#, we compare to code2seq (Alon et al., 2019a), which is a strong
code→NL model. We train it to generate the target code as a sequence of subtokens.
Architecture We use embeddings of size 512, 2 layers of LSTMs with 256 units, and
4 transformer layers with 8 attention heads. We kept a small subtoken vocabulary of
size 1000 to encourage the model to learn to copy; larger vocabularies did not show
an improvement. These resulted in a very lightweight model of only 15M parameters,
which is close to Transformersmall (11.8M parameters). In comparison, Transformerbase
had more than 45M parameters (3× more parameters than our model).
Training We train the model end-to-end on a single V100 GPU, using cross entropy
and the Adam optimizer (Kingma and Ba, 2014), an initial learning rate of 10−4 multi-
plied by 0.95 every 20k steps. We bucket examples based on the number of predictions
in the target subtree (nodes + subtokens + EOS), and vary the batch size such that
each batch contains about 512 targets. We train the model to prefer copying entire
tokens rather than copying subtokens, if possible, by optimizing for the entire token as
the true label. We apply dropout of 0.25 in the Transformer layers, and a recurrent
dropout of 0.5 in the LSTMs.
Inference We perform beam search with width of 5 and optimize for accuracy@1.
5.5 Results
Any-Code Completion: Java Table 5.1 shows that our SLM achieves over 1.1%
and 0.78% better acc@1 and acc@5 (respectively) over the two strongest baselines.
The improvement over Transformersmall , which is closer to our model in the number of
parameters, is even higher: over 3.8% and 3.4% in acc@1 and acc@5.
88
Ablation acc@1 acc@5 tree@1 tree@5
Paths→Seq 12.95 18.52 33.44 43.43
Seq→Path 12.12 17.12 28.68 43.99
Paths→Paths 17.63 24.62 37.78 53.98
No Root Att 14.43 18.48 28.20 35.65
No Copy 10.72 15.70 30.61 44.35
SLM (original) 18.04 24.83 39.10 55.32
Restricted Completion: C# Table 5.2 shows the results for the restricted com-
pletion task in C#, where seq2seq+copy is the BiLSTM→LSTM+copy model which
performed the best among the Java baselines. We first observe that the seq2seq+copy
and the seq2tree+copy baselines outperform the GNN →NAG of Brockschmidt et al.
(2019), who introduced this task. Although Brockschmidt et al. (2019) did compare
to a seq2seq baseline, their GNN →NAG model could copy symbols from the context,
but their baseline did not. To conduct a fair comparison with our SLM model, we
equipped the seq2seq and seq2tree baselines with a copy mechanism. Even though
the seq2seq+copy and the seq2tree+copy baselines perform substantially better than
the state of the art in this setting, our SLM model is able to go beyond, achieving
significant gains over all models.
The superiority of our model over GNN →NAG may also be related to the GNN
bottleneck (Alon and Yahav, 2021), which hinders GNNs from propagating long-range
messages. In contrast, propagating long-range messages using paths is natural for our
model.
89
Paths→Seq follows code2seq (Alon et al., 2019a) and separates the model to an
encoder and a decoder, where the decoder generates the target code as a sequence
of subtokens. The main difference from code2seq is that Paths→Seq includes a copy
mechanism, as in our model.
Seq→Path follows Rabinovich et al. (2017) and separates our model to an encoder
and a decoder (including a copy mechanism), where the encoder encodes the context
as a sequence of subtokens using a BiLSTM, and the decoder generates the missing
subtree using the root path and the index of the generated child.
Paths→Paths is similar to our SLM model except that it uses separate encoder
and decoder. These encoder and decoder have untied weights, unlike our SLM model
which models the source and the target jointly.
No Root Attention uses max pooling instead of attention in aggregating multiple
paths (see Section 5.3.2). The index-informed path from the root to the target’s parent
(R in Figure 5.2) is concatenated with the result, instead of being used as attention
query.
No Copy replaces copy mechanism with a much larger vocabulary (25k subtokens
instead of 1k).
Results Table 5.3 shows the results of these alternatives. As our SLM model performs
better than Paths→Paths, this ablation shows the importance of joint modeling of the
context and the target subtree by parameter tying.
Each of Paths→Paths and the seq2seq baselines (Table 5.1) performs better than
Paths→Seq and Seq→Path; this shows the importance of using the same type of encoder
and decoder for any-code completion, rather than combining “an optimal encoder”
with “an optimal decoder”. While this distinction between encoder and decoder types
might be necessary for semantic parsing (Rabinovich et al., 2017; Dong and Lapata,
2018), NL→code (Yin and Neubig, 2017) and code→NL (Alon et al., 2019a; Fernandes
et al., 2019) tasks because of the different modalities of the input and the output, this
discrepancy may hurt generalization when the output is essentially a missing part of
the input’s AST.
Paths→Paths performs better than the seq2seq baselines (Table 5.1), showing the
advantage of using paths over textual sequences, even without parameter tying.
No Root Attention degrades acc@1 and acc@5 by 3.6% to 6.3%. This shows that
dynamically attending to the context paths given the current root path is crucial.
Not using a copying mechanism results in a degradation of 7.3% to 9.1%. Programs
use symbols and identifiers repetitively, thus the ability to copy symbols from the
context is crucial for this task. For this reason, we included a copying mechanism in
all NMT baselines in Section 5.4.
90
private static void log(String value) { public int compareTo(LongWritable o) {
if (value != null long thisValue = this.value;
&& ) long thatValue = o.value;
value = value.substring(0, 55)+"..."; return (thisValue < thatValue ? -1 :
LOG.info(value); ( ));
} }
(a) (b)
Figure 5.4: Examples for cases where the top candidate is a “tree-match” (marked
with ), but only the second candidate is an “exact match” (marked with ✓ in bold).
Predictions that are logically equivalent to the ground truth are marked with ↔.
Our main results (Table 5.1 and Table 5.2) reveal a gap between acc@k and tree@k:
when ignoring identifier values and comparing only the tree structure, accuracy is
significantly higher across all models. While our SLM model performs better than
all baselines in acc@k, our model also shows greater potential for improvement in its
tree@k results, which are much higher than the baselines’. We thus focus on studying
the cases where the tree was predicted correctly, but the model failed to generate the
code exactly including names.
Figure 5.4a shows an example of this case: the ground truth has a structure of
the form: NAME.NAME() > INT. Our model predicts value.length() > 0 (a tree-match)
as its first candidate and value.length() > 55 (the ground truth) as its second. Null-
checking a string is often followed by checking that it is also not empty, making the
first candidate a reasonable prediction as well.
Figure 5.4b shows another example: in this case, the ground truth thisValue ==
thatValue ? 0 : 1 was predicted correctly only as the second candidate. Nevertheless,
the top-3 candidates are tree-matches since all of them are of the form: NAME == NAME
? INT : INT. Interestingly, the fifth candidate (thisValue == thatValue) ? 0 : 1 is
logically-equivalent to the ground truth.
In both examples, our model’s top candidate differs from the ground truth by a sin-
gle identifier or literal: in Figure 5.4a the model predicted 0 instead of 55; in Figure 5.4b
the model predicted thisValue instead of thatValue. Such single subtoken errors are
responsible for 30% of the cases where the model’s top prediction is a tree-match but
not an exact match. Single token (whole identifier or literal) mismatches are respon-
sible for 74% of these cases. Thus, improving our model’s ability to predict the right
names has the potential to enhance our gains furthermore. Detailed results of allowing
such mistakes in our model and in the baselines can be found in the supplementary
91
public float getProgress() {
this.readLock.lock();
try {
if (this.currentAttempt != null) {
return ;
}
return 0;
} finally {
this.readLock.unlock();
}
}
Figure 5.5: An example from our test set in which a compiler-guided generation could
filter out non-compiling candidates, and thus rank the ground truth second instead
of fifth. Four out of the five candidates are “tree-match” (marked with ), the fifth
candidate is an “exact match” (marked with ✓ in bold), and only the second and the
fifth candidate compile (marked with ).
material.
Additional possible post-filtering could filter out candidates that do not compile.
In Figure 5.5, the first, third and fourth candidates do not compile, because the
this.currentAttempt object does not have getCount, get, nor getTime methods. If
the model’s predictions would have been considered in the context of the entire project
including its dependencies, these candidates could have been filtered out, and the (cor-
rect) fifth candidate would be ranked second. We leave compiler-guided code generation
to future work.
Additional examples can be found in the supplementary material and in our inter-
active demo at http://AnyCodeGen.org.
92
leaves of Ap into the currently expanded node π (at ), such as path3 in Figure 5.2e.
• The “context node” of PHOG (Bielik et al., 2016) is just one of the previously-
traversed leaf nodes in a<t . Thus, not only that our model conditions on this
context node as well, our model also takes into account the syntactic relation, i.e.,
the path, between the context and π (at ). Moreover, while PHOG conditions on a
single leaf, SLMs condition on every leaf in a<t .
• Finally, Brockschmidt et al. (2019) define special graph edges (e.g., “NextSib” and
“Child”) to capture relations on the AST. Allamanis et al. (2018) further defines
data-flow and control-flow graph edges such as “ComputedFrom” and “Guard-
edByNegation”. Most of these relations can be expressed as partial AST paths
without manually designing them.
93
Chen et al. (2018b) addressed JavaScript↔CoffeeScript translation with a tree-to-
tree approach, which required a strong alignment between the source and target trees.
94
Chapter 6
While studying different representations of code, we noticed that graph neural net-
works, although very versatile and popular, fail to learn long-range patterns in the
training data. When trained on programming tasks that depend on long-range inter-
actions, we found that GNNs usually overfit short-range artifacts in the data. This
phenomenon was surprising, because AST paths had no problem learning long-range
signals. Searching through the literature, it turned out that since the proposal of the
GNNs (Gori et al., 2005; Scarselli et al., 2008), their struggle to propagate information
between distant nodes in the graph was one of the major problems in training GNNs.
In Section 6.1 we propose a new explanation for this problem: GNNs are susceptible
to a bottleneck when aggregating messages across a long path. This bottleneck causes
the over-squashing of exponentially growing information into fixed-size vectors.
Further, we found that Graph Attention Networks (GATs), which are one of the
most popular GNN architectures and are considered as the state-of-the-art architecture
for representation learning with graphs, can only compute a restricted kind of attention
where the ranking of attended nodes is unconditioned on the query node. In Section 6.2,
we formally define this restricted kind of attention as static attention and distinguish
it from a strictly more expressive dynamic attention. To remove this limitation, we
introduce a simple fix by modifying the order of operations and propose GATv2: a
dynamic graph attention variant that is strictly more expressive than GAT.
Graph neural networks (GNNs) (Gori et al., 2005; Scarselli et al., 2008; Micheli, 2009)
have seen sharply growing popularity over the last few years (Duvenaud et al., 2015;
Hamilton et al., 2017; Xu et al., 2019). GNNs provide a general framework to model
95
Bottleneck Bottleneck
input sequence
(a) The bottleneck of RNN seq2seq models (b) The bottleneck of graph neural networks
Figure 6.1: The bottleneck that existed in RNN seq2seq models (before attention)
is strictly more harmful in GNNs: information from a node’s exponentially-growing
receptive field is compressed into a fixed-size vector. Black arrows are graph edges; red
curved arrows illustrate information flow.
complex structural data containing elements (nodes) with relationships (edges) between
them. A variety of real-world domains such as social networks, computer programs,
chemical and biological systems can be naturally represented as graphs. Thus, many
graph-structured domains are commonly modeled using GNNs.
A GNN layer can be viewed as a message-passing step (Gilmer et al., 2017), where
each node updates its state by aggregating messages flowing from its direct neighbors.
GNN variants (Li et al., 2016; Veličković et al., 2018; Kipf and Welling, 2017) mostly
differ in how each node aggregates the representations of its neighbors with its own
representation. However, most problems also require the interaction between nodes
that are not directly connected, and they achieve this by stacking multiple GNN layers.
Different learning problems require different ranges of interaction between nodes in the
graph to be solved. We call this required range of interaction between nodes – the
problem radius.
In practice, GNNs were observed not to benefit from more than few layers. The
accepted explanation for this phenomenon is over-smoothing: node representations be-
come indistinguishable when the number of layers increases (Wu et al., 2020). Nonethe-
less, over-smoothing was mostly demonstrated in short-range tasks (Li et al., 2018;
Klicpera et al., 2018; Chen et al., 2020a; Oono and Suzuki, 2020; Zhao and Akoglu,
2020; Rong et al., 2020; Chen et al., 2020b) – tasks that have small problem radii, where
a node’s correct prediction mostly depends on its local neighborhood. Such tasks in-
clude paper subject classification (Sen et al., 2008) and product category classification
(Shchur et al., 2018). Since the learning problems depend mostly on short-range in-
formation in these datasets, it makes sense why more layers than the problem radius
might be extraneous. In contrast, in tasks that also depend on long-range information
(and thus have larger problem radii), we hypothesize that the explanation for limited
performance is over-squashing.
To allow a node to receive information from other nodes at a radius of K, the GNN
needs to have at least K layers, or otherwise, it will suffer from under-reaching – these
96
distant nodes will simply not be aware of each other. Clearly, to avoid under-reaching,
problems that depend on long-range interaction require as many GNN layers as the
range of the interaction. However, as the number of layers increases, the number of
nodes in each node’s receptive field grows exponentially. This causes over-squashing:
information from the exponentially-growing receptive field is compressed into fixed-
length node vectors. Consequently, the graph fails to propagate messages flowing from
distant nodes, and learns only short-range signals from the training data.
In fact, the GNN bottleneck is analogous to the bottleneck of sequential RNN
models. Traditional seq2seq models (Sutskever et al., 2014; Cho et al., 2014a,b) suffered
from a bottleneck at every decoder state – the model had to encapsulate the entire
input sequence into a fixed-size vector. In RNNs, the receptive field of a node grows
linearly with the number of recursive applications. However in GNNs, the bottleneck is
asymptotically more harmful, because the receptive field of a node grows exponentially.
This difference is illustrated in Figure 6.1.
Our main contribution in this work is introducing the over-squashing phenomenon –
a novel explanation for the major and well-known issue of training GNNs for long-range
problems, and showing its harmful practical implications. We use a controlled problem
to demonstrate how over-squashing prevents GNNs from fitting long-range patterns in
the data, and to provide theoretical lower bounds for the required hidden size given the
problem radius. We show, analytically and empirically, that GCN (Kipf and Welling,
2017) and GIN (Xu et al., 2019) are susceptible to over-squashing more than other
types of GNNs such as GAT (Veličković et al., 2018) and GGNN (Li et al., 2016).
We further show that prior work that extensively tuned GNNs to real-world datasets
suffer from over-squashing: breaking the bottleneck using a simple fully adjacent layer
reduces the error rate by 42% in the QM9 dataset, by 12% in ENZYMES, by 4.8% in
NCI1, and improves accuracy in VarMisuse, without any additional tuning.
6.1.1 Preliminaries
A directed graph G = (V, E) contains nodes V and edges E, where (u, v) ∈ E denotes
an edge from a node u to a node v. For brevity, in the following definitions we treat
all edges as having the same type; in general, every edge can have a type and features
(Schlichtkrull et al., 2018).
97
(1)
GNN layer updates each node’s representation given its neighbors, yielding hv ∈ Rd .
In general, the k-th layer of a GNN is a parametric function fk that is applied to each
node by considering its neighbors:
hv(k) = fk hv(k−1) , {hu(k−1) | u ∈ Nv }; θk (6.1)
where Nv is the set of nodes that have edges to v: Nv = {u ∈ V | (u, v) ∈ E}. The
total number of layers K is usually determined empirically as a hyperparameter.
The design of the function f is what mostly distinguishes one type of GNN from
the other. For example, graph convolutional networks (GCN) define f as:
!
X 1
hv(k) =σ W (k) hu(k−1) (6.2)
u∈Nv ∪{v} cu,v
where σ is a nonlinearity such as ReLU , and cu,v is a normalization factor often set
p
to |Nv | · |Nu | or |Nv | (Hamilton et al., 2017). Usually, the last (K-th) layer’s output
(K)
is used for prediction: in node-prediction, hv is used to predict a label for v; in
graph-prediction, a permutation-invariant “readout” function aggregates the nodes of
the final layer using summation, averaging, or a weighted sum (Li et al., 2016).
98
?
A C
Figure 6.2: The NeighborsMatch problem: green nodes have blue neighbors and an
alphabetical label. The goal is to predict the label (A, B, or C) of the green node that
has the same number of blue neighbors as the target node in the same graph. In this
example, the correct label is C, because the target node has two blue neighbors, like
the node marked with C in the same graph.
correct answer is C in this case, because the target node has two blue neighbors, like
the node marked with C in the same graph. Every example in the dataset has a
different mapping from numbers of neighbors to labels, and thus message propagation
and matching between the target node and all the green nodes must be performed for
every graph in the dataset.
Since the model must propagate information from all green nodes before predicting
the label, a bottleneck at the target node is inevitable. This bottleneck causes over-
squashing, which can prevent the model from fitting the training data perfectly. provide
theoretical lower bounds for the GNN’s hidden size. Obviously, adding direct edges
between the target node and the green nodes, or making the existing edges bidirectional,
could ease information flow for this specific problem. However, in real-life domains (e.g.,
molecules), we do not know the optimal message propagation structure a priori, and
must use the given relations (such as bonds between atoms) as the graph’s edges.
Although this is a contrived problem, it resembles real-world problems that are often
modeled as graphs. For example, a computer program in a language such as Python
may declare multiple variables (i.e., the green nodes in Figure 6.2) along with their
types and values (their numbers of blue neighbors in Figure 6.2); later in the program,
predicting which variable should be used in a specific location (predict the alphabetical
label in Figure 6.2) must use one of the variables that are available in scope based on
the required type and the required value at that point.
Short- vs. long-range problems Much of prior GNN work has focused on problems
that were local in nature, with small problem radii, where the underlying inductive
bias was that a node’s most relevant context is its local neighborhood, and long-range
interaction was not necessarily needed. With the growing popularity of GNNs, their
adoption expanded to domains that required longer-range information propagation as
well, without addressing the inherent bottleneck. In this work, we focus on problems
that require long-range information. That is, a correct prediction requires considering
the local environment of a node and interactions beyond the close neighborhood. For
example, a chemical property of a molecule (Ramakrishnan et al., 2014; Gilmer et al.,
99
2017) can depend on the combination of atoms that reside in the molecule’s opposite
sides. Problems of this kind require long-range interaction, and thus, a large number
of GNN layers. Since the receptive field of each node grows exponentially with the
number of layers, the more layers – over-squashing is more harmful.
In problems that are local in nature (small r) – the bottleneck is less troublesome,
because a GNN can perform well with only few layers (e.g., K=2 layers in Kipf and
Welling (2017)), and the receptive field of a node can be exponentially smaller. Domains
such as citation networks (Sen et al., 2008), social networks (Leskovec and Mcauley,
2012), and product recommendations (Shchur et al., 2018) usually raise short-range
problems and are thus not the focus of this work. So, how long is long-range? We
discuss and analyze this question theoretically in Alon and Yahav (2021).
Evaluation In Alon and Yahav (2021), we show empirical evaluation that demon-
strates that the GNN bottleneck exists and raises over-squashing starting from values
of r as small as r = 4: we generated a synthetic benchmark that is theoretically solv-
able; however, in practice, all GNNs fail to reach 100% training accuracy because of the
bottleneck. Second, we find that the bottleneck exists in prior work, which addressed
real-world problems, by showing that the original implementations of the authors can
be further improved by considering the bottleneck. Finally, we find that GNNs that
absorb incoming edges equally, like GCN (Kipf and Welling, 2017) and GIN (Xu et al.,
2019), are more susceptible to over-squashing than GNNs that use attention to weigh
incoming edges like GAT (Veličković et al., 2018) and GGNN (Li et al., 2016).
100
k0 k1 k2 k3 k4 k5 k6 k7 k8 k9 k0 k1 k2 k3 k4 k5 k6 k7 k8 k9
q0 0.08 0.10 0.10 0.07 0.08 0.08 0.11 0.09 0.20 0.08 q0 0.95 0.00 0.00 0.01 0.01 0.00 0.00 0.02 0.01 0.00
q1 0.05 0.10 0.10 0.04 0.04 0.04 0.13 0.06 0.38 0.04 q1 0.01 0.92 0.01 0.01 0.01 0.00 0.01 0.01 0.00 0.02
q2 0.05 0.10 0.10 0.04 0.05 0.05 0.13 0.06 0.38 0.05 q2 0.00 0.00 0.95 0.00 0.00 0.01 0.02 0.01 0.00 0.00
q3 0.08 0.10 0.10 0.07 0.08 0.08 0.10 0.09 0.24 0.08 q3 0.01 0.01 0.00 0.94 0.00 0.01 0.00 0.00 0.02 0.01
q4 0.08 0.09 0.09 0.07 0.07 0.07 0.10 0.08 0.27 0.07 q4 0.00 0.00 0.00 0.00 0.96 0.00 0.00 0.01 0.01 0.00
q5 0.09 0.11 0.11 0.08 0.09 0.08 0.11 0.10 0.16 0.09 q5 0.00 0.01 0.01 0.01 0.01 0.89 0.01 0.01 0.04 0.02
q6 0.04 0.10 0.11 0.03 0.04 0.04 0.14 0.06 0.40 0.04 q6 0.00 0.01 0.04 0.00 0.01 0.01 0.86 0.02 0.01 0.03
q7 0.07 0.09 0.09 0.06 0.07 0.07 0.10 0.08 0.29 0.07 q7 0.04 0.02 0.01 0.01 0.03 0.01 0.00 0.87 0.00 0.01
q8 0.04 0.11 0.11 0.02 0.04 0.03 0.14 0.07 0.41 0.04 q8 0.01 0.00 0.01 0.01 0.01 0.01 0.01 0.00 0.94 0.00
q9 0.07 0.09 0.09 0.06 0.07 0.07 0.11 0.08 0.30 0.07 q9 0.01 0.02 0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.93
1.0
0.4 q0
q1
q2 0.8 q0
0.3 q3 q1
q4 q2
q5 0.6
q6 q3
0.2 q7 q4
q8 0.4 q5
q9 q6
q7
0.1 0.2 q8
q9
0.0
k0 k1 k2 k3 k4 k5 k6 k7 k8 k9 k0 k1 k2 k3 k4 k5 k6 k7 k8 k9
(a) Attention in standard GAT (Veličković et al. (b) Attention in GATv2, our fixed version of
(2018)) GAT
Figure 6.3: Standard GAT (Figure 6.3a) computes static attention: the ranking of
attention coefficients is global for all nodes in the graph, and is unconditioned on the
query node. For example, all queries (q0 to q9) attend mostly to the 8th key (k8). In
contrast, GATv2 (Figure 6.3b) can actually compute dynamic attention, where every
query has a different ranking of attention coefficients of the keys.
Graph neural networks (GNNs; Gori et al., 2005; Scarselli et al., 2008) have seen
increasing popularity over the past few years (Duvenaud et al., 2015; Atwood and
Towsley, 2016; Bronstein et al., 2017; Monti et al., 2017). GNNs provide a general
and efficient framework to learn from graph-structured data. Thus, GNNs are easily
applicable in domains where the data can be represented as a set of nodes and the
prediction depends on the relationships (edges) between the nodes. Such domains
include molecules, social networks, product recommendation, computer programs and
more.
A GNN can be viewed as a message-passing network (Gilmer et al., 2017), where
each node iteratively updates its state by interacting with its neighbors. GNN vari-
ants (Wu et al., 2019; Xu et al., 2019; Li et al., 2016) mostly differ in how each node
aggregates the representations of its neighbors and combines them with its own repre-
101
sentation. Veličković et al. (2018) pioneered the use of attention-based neighborhood
aggregation, in one of the most popular GNN variants – Graph Attention Network
(GAT). In GAT, every node updates its representation by attending to its neighbors
using its own representation as the query. This generalizes the standard averaging or
max-pooling of neighbors (Kipf and Welling, 2017; Hamilton et al., 2017), by allowing
every node to compute a weighted average of its neighbors. The work of Veličković et al.
also generalizes the Transformer’s (Vaswani et al., 2017) self-attention mechanism, from
sequences to graphs (Joshi, 2020).
While GAT is one of the most popular GNN architectures (Bronstein et al., 2021)
and is considered as the state-of-the-art neural architecture for learning with graphs
(Wang et al., 2019a), we show that GATs do not actually compute dynamic attention,
a fact that severely hinders their expressiveness. Instead, we show that GAT only uses
a restricted “static” form of attention: for every query node, attention is monotonic
with respect to its neighbor key scores. That is, the ranking (the argsort) of attention
coefficients is shared across all nodes in the graph, and is unconditioned on the query
node. This limitation of the standard GAT is demonstrated in Figure 6.3a.
Supposedly, the conceptual idea of attention as the form of interaction between GNN
nodes is orthogonal to the specific choice of attention function. However, Veličković
et al.’s original design of GAT has spread to a variety of domains (Wang et al., 2019a;
Qiu et al., 2018; Yang et al., 2020; Wang et al., 2019c; Huang and Carley, 2019; Ma
et al., 2020; Kosaraju et al., 2019; Nathani et al., 2019; Wu et al., 2020; Zhang et al.,
2020) and has become the default implementation of “graph attention network” in
all popular GNN libraries such as PyTorch Geometric (Fey and Lenssen, 2019), DGL
(Wang et al., 2019b), and others (Dwivedi et al., 2020; Gordić, 2020; Brockschmidt,
2020).
To overcome the limitation we identified in GAT, we introduce a simple fix to its
attention function by modifying the order of internal operations. The result is GATv2
– a graph attention variant that has a universal approximator attention function, and
is thus strictly more expressive than GAT. The effect of fixing the attention function in
GATv2 is demonstrated in Figure 6.3b.
In summary, our main contribution is identifying that one of the most popular
GNN types, the graph attention network, cannot actually compute dynamic attention.
We introduce formal definitions for analyzing the expressive power of graph attention
mechanisms, and derive our claims theoretically from the equations of Veličković et al.
(2018). Empirically, we use a synthetic problem to show that standard GAT cannot
express alignment problems that require dynamic attention. We introduce a simple
fix by switching the order of internal operations in the attention function of GAT,
and propose GATv2, which does compute dynamic attention. We further conduct a
thorough empirical comparison of GAT and GATv2 and find that GATv2 outperforms
GAT across 11 benchmarks of node-, link-, and graph-prediction. For example, GATv2
outperforms extensively tuned GNNs by over 1.4% in the difficult “UnseenProj Test” set
102
of the VarMisuse task (Allamanis et al., 2018), without any hyperparameter tuning; and
GATv2 improves over an extensively-tuned GAT by 11.5% in 13 prediction objectives
in QM9. In node-prediction benchmarks from OGB (Hu et al., 2020), not only that
GATv2 outperforms GAT with respect to accuracy – we find that GATv2 is also much
more robust to noise.
103
104
Chapter 7
Additional Applications
Adversarial Examples Neural models of code have shown impressive results when
performing tasks such as predicting method names and identifying certain kinds of
bugs. We show that these models are vulnerable to adversarial examples, and introduce
a novel approach for attacking trained models of code using adversarial examples. The
main idea of our approach is to force a given trained model to make an incorrect
prediction, as specified by the adversary, by introducing small perturbations that do
not change the program’s semantics, thereby creating an adversarial example. To find
such perturbations, we present a new technique for Discrete Adversarial Manipulation
105
of Programs (DAMP). DAMP works by deriving the desired prediction with respect
to the model’s inputs, while holding the model weights constant, and following the
gradients to slightly modify the input code.
We show that our DAMP attack is effective across three neural architectures:
code2vec, GGNN, and GNN-FiLM, in both Java and C#. Our evaluations demon-
strate that DAMP has up to 89% success rate in changing a prediction to the adversary’s
choice (a targeted attack) and a success rate of up to 94% in changing a given predic-
tion to any incorrect prediction (a non-targeted attack). To defend a model against
such attacks, we empirically examine a variety of possible defenses and discuss their
trade-offs. We show that some of these defenses can dramatically drop the success rate
of the attacker, with a minor penalty of 2% relative degradation in accuracy when they
are not performing under attack.
Our code, data, and trained models are available at https://github.com/tech-srl/
adversarial-examples .
106
- if(self.isDisabled())
+ if(attack == null || attack.IsTraitDisabled)
Input
return false;
- var targetPos = attack != null
? attack.GetTargetPosition(pos, target) : target.CenterPosition;
+ var targetPos =
attack.GetTargetPosition(pos, target) ;
(a) The predicate of the if statement in C was edited to include a null check for attack. Thus,
in P, the checking of attack != null and the ternary operator can be removed.
107
(a) (b)
Figure 7.2: An example of two edits. These examples are different and the edits
operate on different values. However, observing the structure of these edits reveals
the similarity between them and allows a learning model to generalize better. This
similarity is expressed as almost identical AST paths. For simplicity, only the program
fragment that should be edited P is shown, without the context C.
108
Figure 7.1b shows another example, in which the edit in the context is a modification
of a function signature. In C ′ , the return type was changed to FileCharacteristics, and
the output parameter fileCharacteristics for the function was removed. P consists of
an assignment to the parameter fileCharacteristics, and a return statement of true
value. The edit in the context implies a necessary edit in P, in which the assignment
statement has to be removed (since fileCharacteristics is no longer defined) and the
return statement must include a variable of type FileCharacteristics. Our model
successfully predicted the correct edit for P. P ′ consists of returning an object of type
FileCharacteristics.
Representing Code Edits The main design decision in learning code edits is how
to represent the edit, i.e., how to represent the difference between the code in its original
form and its desired, altered, form. Naïvely, differencing programs can be performed by
treating the code as text and using text-diff algorithms for line- or inline-differencing.
In contrast, we model the difference between the abstract syntax trees (ASTs) of the
original and the edited code. This allows us to naturally use paths in the AST (AST
paths) to model edits.
109
into First(<predicate>). Although the predicates are different and these edits operate
on different values, the structure of the edits in Figure 7.2a and Figure 7.2b is identical.
This similarity is expressed in the AST paths that represent these edits. For example,
consider the identical structure of the path O
1 in the two figures, where it operates on
a different value in each figure (FirstOrDefault and First).
Our use of AST paths allows the model to generalize these edits, even though these
edits are not identical and their predicates are different.
We apply a Pointer Network (Vinyals et al., 2015) to point to paths in the AST of
P and create an edit operation sequence, i.e., an edit script. While prior work used
AST paths to read programs and predict a label Chapters 3 and 4, we generate an edit
script by predicting AST paths, i.e., making AST paths the output of our model.
We show the effectiveness of C3 on EditCompletion on a new dataset, scraped from
over 300,000 commits in GitHub.
Our approach significantly outperforms textual and syntactic approaches that either
model the code or model only the edit, and are driven by strong neural models.
• We introduce the EditCompletion task: given a program P and edits that oc-
curred in its context, predict the likely edits that should be applied to P.
• Our technique directly captures the relationships between subtrees that are changed
in an edit using paths in the AST. The output of our technique is an edit script
that is executed to edit the program fragment P.
• A new EditCompletion dataset of source code edits and their surrounding con-
text edits, scraped from over 300,000 commits in GitHub.
• A thorough ablation study that examines the contribution of syntactic and textual
representations in different components of our model.
110
(a)
(b) (c)
- productType.AddNavigation(
new Navigation(
featuredProductFk,
"FeaturedProductCategory",
pointsToPrincipal: false);
+ productType.AddNavigation(
"FeaturedProductCategory",
featuredProductFk,
pointsToPrincipal: false);
(d) (e)
Figure 7.3: An EditCompletion example from our test set. Figure 7.3a shows the
edit that transforms C into C ′ – overloading the function AddNavigation. Figure 7.3e
shows P and P ′ as code in red and green, respectively. Figure 7.3b depicts the partial
AST and the first three edit operations of the edit. Figure 7.3c shows the AST after
applying the first three operations, and shows the next three operations as AST paths.
Figure 7.3d illustrates the AST after performing all operations, resulting in an AST
that corresponds to P ′ . Every edit operation is represented by an AST path having
the same color and number as the edit command. Dotted contours represent subtrees
that will be affected by applying these operations.
High-level Overview Consider the edit that occurred in the context of Figure 7.3a
– insertion of a new definition of the method AddNavigation, which overloads previous
111
definitions. After applying this edit, it is possible to use this new signature when calling
AddNavigation. Consider the original code snippet P at the top of Figure 7.3e. The edit
in the context allows us to simplify the call to AddNavigation using the new signature,
as shown in the “edited” code snippet P ′ at the bottom of Figure 7.3e. Consider the
partial AST of P in Figure 7.3b. The desired edit can be described as an edit script
consisting of six edit operations to the AST of P. Consider the first operation: O
1 MOV.
The meaning of this operation is to move the node Expr with its subtree to be the
leftmost child of the node Unit. This edit operation can be represented by the red O
1
path: Expr → Arg → ArgList → Call → Expr → Unit. Note how this path directly
captures the syntactic relationship between the node Expr and the node Unit, allowing
our model to predict a MOV operation as part of the edit script.
In Figure 7.3c we can see the result of applying the following first three operations:
O
1 MOV, O
2 MOV, O
3 MOV, moving subtrees to new locations in the tree. The last three
commands are DEL operations, expressing deletion of a node and its underlying subtree.
4 DEL is repre-
These operations can be represented using paths as well. For instance, O
sented by the green O4 path: Navigation → Call → Expr → Unit → DEL, where DEL
is an artificial node that we add as a child of the AST’s root. In Figure 7.3d we can
see the AST after applying all six operations. After executing all six operations, our
model produces P ′ , shown in Figure 7.3e.
Path Extraction To inform the model about the available edits it can use for predic-
tion, we parse the AST of P to extract all AST paths that represent valid edits. Every
path can represent different edit “commands” that use the same path. For example,
2 path in Figure 7.3b: Name → Call → ArgList → Arg → Expr →
consider the blue O
Call. This path can represent a move operation – MOV, i.e., moving the node Name with
its subtree, to be the leftmost child of Call; alternatively, this path can represent an
insertion operation – INS, i.e., copy Name with its subtree, and insert it as the leftmost
child of Call. To distinguish between different edit operations that are represented
using the same AST path, each path is encoded as a vector once, and projected into
three vectors using different learned functions. Each resulting vector corresponds to a
different kind of edit operation. For example, the orange O 3 path in Figure 7.3b can
represent either “move” (MOV), “update” (UPD) or “insert” (INS) operations. In this case,
this path was projected using the learned function that represents “move”.
Edit Script Prediction We predict one edit operation at each step by pointing at
a path and its associated operation from among the valid edit operations. This results
in an edit script. For example, in Figure 7.3, our model finds that the red O
1 path with
MOV is most likely to be the first operation. Then, given this edit, our model finds that
the blue O 2 path with MOV is most likely to be the next operation, and so on, until we
predict a special “end of sequence” (EOS) symbol.
112
Modeling Code Likelihood vs. Modeling Edit Likelihood Modeling edits using
AST paths provides an effective way to model only the difference between P and P ′ .
For example, consider the red O 1 path that moves the subtree rooted at Expr from its
original place to be the first child of Unit. To predict this edit, our model only needs
to select the red O
1 path out of the other available operations. In contrast, a model
that attempts to generate P ′ entirely (Chen et al., 2019), would need to generate the
entire subtree from scratch in the new location.
Pairwise Edit Operations Most edit operations, such as “move” and “update”, can
be described as pairwise operations, having the “source” and the “target” locations as
their two arguments. AST paths provide a natural way to represent pairwise relations,
originating from the “source” location, and reaching the “target” location through the
shortest path between them in the tree. In contrast, prior work that used only unary
edit operations such as HOPPITY (Dinella et al., 2020) are limited to inserting each
1 MOV operation.
node individually, and thus use multiple edit commands to express the O
Our model represents this edit operation as a single AST path – the red O
1 path.
Key aspects The example in Figure 7.3 demonstrates several key aspects of our
method:
• Edits applied to the context of P can provide useful information for the required
edit to P.
• A neural model, trained on these paths, can generalize well to other programs,
thanks to the direct modeling of code edits as paths.
• By pointing at the available edit operations, the task that the model addresses
becomes choosing the most likely edit, rather than generating P ′ from scratch,
and thus significantly eases the learning task.
113
Correctly predicted example Adversarial perturbations
Target: contains Target: escape
void f1(int[] array){ void f2(int[] ttypes){ void f3(int[] array){
boolean swapped = true; boolean swapped = true; boolean swapped = true;
for (int i = 0; for (int i = 0; for (int i = 0;
i < array.length && swapped; i++){ i < ttypes.length && swapped; i++){ i < array.length && swapped; i++){
swapped = false; swapped = false; swapped = false;
for (int j = 0; for (int j = 0; for (int j = 0;
j < array.length-1-i; j++) { j < ttypes.length-1-i; j++) { j < array.length-1-i; j++) {
if (array[j] > array[j+1]) { if (ttypes[j] > ttypes[j+1]) { if (array[j] > array[j+1]) {
int temp = array[j]; int temp = ttypes[j]; int temp = array[j];
array[j] = array[j+1]; ttypes[j] = ttypes[j+1]; array[j] = array[j+1];
array[j+1]= temp; ttypes[j+1]= temp; array[j+1]= temp;
swapped = true; swapped = true; swapped = true;
} } }
} } }
} } } int upperhexdigits;
} } }
Figure 7.4: A Java snippet f1 is classified correctly as sort by the model of code2vec.
org. Given f1 and the target contains, our approach generates f2 by renaming array
to ttypes. Given the target escape, our approach generates f3 by adding an unused
variable declaration of int upperhexdigits. Additional examples can be found in Yefet
et al. (2020).
specially-crafted noise to a correctly labeled input, such that the model under attack
yields a desired incorrect label when presented with the modified input (i.e., with the
addition of noise). Adding noise to a continuous object to change the prediction of a
model is relatively easy to achieve mathematically. For example, for an image, this can
be achieved by changing the intensity of pixel values (Szegedy et al., 2013; Goodfellow
et al., 2014b). Unfortunately, this does not carry over to the domain of programs, since
a program is a discrete object that must maintain semantic properties.
In this work, we present a novel approach for generating adversarial examples for
neural models of code. More formally:
7.2.1 Goal
Given a program P and a correct prediction y made by a model M, such that: M (P) =
y, our goal is to find a semantically equivalent program P ′ such that M makes a given
adversarial prediction ybad of the adversary’s choice: M (P ′ ) = ybad .
The main challenge in tackling the above goal lies in exploring the vast space of
programs that are semantically equivalent to P, and finding a program for which M
will predict ybad .
Generally, we can define a set of semantic-preserving transformations, which in
turn induce a space of semantically equivalent programs. For example, we can rename
variables; and add dead code.
There are clearly many other semantic preserving transformations (e.g., re-ordering
independent statements), but their application would require a deeper analysis of the
program to guarantee that they are indeed semantic preserving. In this work, therefore,
we focus on the above two semantic-preserving transformations, which can be safely
applied without any semantic analysis.
114
Original variable name
argmax
Figure 7.5: Perturbing a variable name: the original variable name is represented as a
one-hot vector over the variable-name vocabulary. After perturbation, the vector is no
longer one-hot. We apply argmax to find the most likely adversarial name, resulting
with another one-hot vector over the variable-name vocabulary.
One naïve approach for exploring the space of equivalent programs is to randomly
apply transformations using brute-force. We can apply transformations randomly to
generate new programs and use the model to make a prediction for each generated
program. However, the program space to be explored is exponentially large, making
exhaustive exploration prohibitively expensive.
115
our approach is general and is applicable to any model that can be derived with respect to
its inputs i.e., any neural model. We do not make any assumptions about the internal
details or specific architecture of the model under attack.
To mitigate these attacks, we evaluate and compare a variety of defensive ap-
proaches. Some of these defenses work by re-training the model using another loss
function or a modified version of the same dataset. Other defensive approaches are
“modular”, in the sense that they can be placed in front of an already-trained model,
identify perturbations in the input, and feed a masked version of the input into the
vulnerable model. These defense mechanisms allow us to trade off the accuracy of the
original model for improved robustness.
• The first technique for generating targeted adversarial examples for models of
code. Our technique, called Discrete Adversarial Manipulation of Programs
(DAMP), is general and only requires that the attacker is able to compute gra-
dients in the model under attack (or in a similar model). DAMP is effective in
generating both targeted and non-targeted attacks.
116
Correctly predicted example Adversarial perturbation
Target: SourceType
struct TypePair : IEquatable<TypePair> struct TypePair : IEquatable<TypePair>
{ {
public static TypePair Create<TSource, public static TypePair Create<TSource,
TDestination>(TSource source, TDestination>(TSource source,
TDestination destination, ...) TDestination scsqbhj, ...)
{ {
... ...
} }
... ...
public Type SourceType { get; } public Type SourceType { get; }
public Type DestinationType { get; } public Type DestinationType { get; }
public bool Equals(TypePair other) => public bool Equals(TypePair other) =>
SourceType == other.SourceType SourceType == other.SourceType
&& DestinationType && DestinationType
== other. DestinationType ; == other. SourceType ;
} }
(a) (b)
code2vec model (Alon et al., 2019c) applied to this code snippet predicts the correct
name, sort, with a probability of 98.54%.
Given the code snippet f1 and the arbitrarily chosen adversarial target contains,
our approach finds that renaming the original variable array to ttypes in f2 forces
the model to predict the label contains with an even higher probability, although the
functionality remained the same. We denote this as a VarName attack.
Given the code snippet f1 and the adversarial target escape, our approach finds
that adding an unused variable (i.e., dead code) named upperhexdigits to the end of
f1 forces the model to predict the label escape with a probability of 100% (see f3 in
Figure 7.4). We denote this attack as a DeadCode targeted attack.
In general, bypassing semantic labeling models can allow a variety of malicious
behaviors. For example, a malicious developer can spread malicious code to users by
having the code classified as “benign” by a malware detection model (David et al.,
2020; Saxe and Berlin, 2015; Wang et al., 2017; Arp et al., 2014; Grosse et al., 2016).
In this work, we demonstrate the ability to bypass semantic labeling by applying our
attack on the code2vec model (Figure 7.4), forcing the model to predict a label of our
choice.
117
vious bug as the correct completion. In Figure 7.6a, a GNN model trained on the
VarMisuse task (Allamanis et al., 2018; Brockschmidt, 2020) in C# correctly chooses
to “fill the blank” using the field DestinationType inside the method Equals. By renam-
ing a local variable called destination in another method to the specific name scsqbhj
(Figure 7.6b), the model chooses the incorrect field SourceType in the method Equals.
The fields DestinationType (correct) and SourceType (incorrect) both have the same
type; thus, the code still compiles and the attack causes a real bug in Equals.
More generally, bypassing a bug detection model (Pradel and Sen, 2018; Rice et al.,
2017; Bader et al., 2019) can allow a malicious developer inside an organization or inside
an open-source project to intentionally introduce bugs. In this work, we demonstrate
this ability using the VarMisuse on Graph Neural Networks (GNNs) (Figure 7.6),
forcing the model to choose an incorrect (but type-correct) variable.
In addition to the code2vec and VarMisuse tasks that we address in this work, we
believe adversarial examples can be applied to neural code search (Sachdev et al., 2018;
Liu et al., 2019a; Cambronero et al., 2019). A developer can attract users to a specific
library or an open-source project by introducing code that will be disproportionately
highly ranked by a neural code search model.
Consider the code snippet f1 of Figure 7.4 that sorts a given array. The code2vec
model (Alon et al., 2019c) applied to this code snippet predicts the correct name, sort.
Our goal is to find semantically equivalent snippets that will cause an underlying model
to yield an incorrect target prediction of our choice.
118
Deriving with Respect to a Discrete Input In settings where the input is dis-
crete, the first layer of a neural network is typically an embedding layer that embeds
discrete objects, such as names and tokens, into a continuous space (Alon et al., 2019a;
Allamanis et al., 2016; Iyer et al., 2016). The input is the index of the symbol, which
is used to look up its embedding in the embedding matrix. The question for discrete
inputs is therefore: what does it mean to derive with respect to the model’s inputs?
One approach is to derive with respect to the embedding vector, which is the result of
the embedding layer. In this approach, after the gradient is obtained, we need to reflect
the update of the embedding vector back to discrete-input space. This can be done
by looking for the nearest-neighbors of the updated embedding vector in the original
embedding space, and finding a nearby vector that has a corresponding discrete input.
In this approach, there is no guarantee that following the gradient is the best step.
In contrast, our Discrete Adversarial Manipulation of Programs (DAMP) approach
derives with respect to a one-hot vector that represents the distribution over discrete
values (e.g., over variable names). Instead of deriving by the input itself, the gradient
is taken with respect to the distribution over the inputs. Intuitively, this allows us to
directly obtain the best discrete value for following the gradient.
More details can be found in Yefet et al. (2020). In Finkelshtein et al. (2020), we
extend adversarial attacks on programs to general graphs.
119
120
Chapter 8
Conclusion
In this dissertation, we presented a simple and general approach for learning from
programs. The main idea is to represent a program using paths in its abstract syntax
tree (AST). This allows a learning model to leverage the structured nature of source
code rather than treating it as a flat sequence of tokens.
We showed that this representation can serve as basis for models that are trained
on up to 16M methods examples, and be useful for a variety of programming languages
and prediction tasks: predicting variable names, predicting method names, generating
a natural language sequence given a code snippet, and the most challenging task: any-
code completion.
We showed that these models are easily generalizable to different programming lan-
guages, including JavaScript, Java, Python, and C#. Our models perform significantly
better than previous programming-language-oriented works and state-of-the-art NMT
models applied in our settings. Our approaches generalize many previous work in this
area while reaching state-of-the-art performance on challenging benchmarks.
While comparing our approach and existing methods, we found theoretical expla-
nations to the empirical differences between different models. In particular, we found
a novel explanation to a well known limitation in training graph neural networks: a
bottleneck that causes over-squashing. As a result, GNNs fails to propagate long-range
information, learn only short-range signals from the training data instead, and performs
poorly when the prediction task depends on long-range interaction.
We believe that the principles presented in this thesis can serve as a basis for a wide
range of tasks which involve source code, source code and natural language, and can
be extended to other kinds of generated outputs. Since the representation of programs
using AST paths is fundamental to programming languages, it can be used in a variety
of other machine learning tasks, including different applications and different learning
models. We also believe that structural language modeling enables a wide range of
future applications, similarly to how language modeling research has contributed to
NLP in recent years.
Code, data, and trained models can be found at:
121
https://github.com/tech-srl/PigeonJS/
https://github.com/tech-srl/code2vec/
https://github.com/tech-srl/code2seq/
https://github.com/tech-srl/slm-code-generation/
https://github.com/tech-srl/bottleneck/
https://github.com/tech-srl/how_attentive_are_gats/
122
Bibliography
JavaParser. http://javaparser.org.
Roslyn. https://github.com/dotnet/roslyn.
UglifyJS. https://github.com/mishoo/UglifyJS.
UnuglifyJS. https://github.com/eth-srl/UnuglifyJS.
Roee Aharoni and Yoav Goldberg. Towards string-to-tree neural machine translation.
In Proceedings of the 55th Annual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), pages 132–140, Vancouver, Canada, July
2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-2021. URL
https://www.aclweb.org/anthology/P17-2021.
Miltiadis Allamanis. The adverse effects of code duplication in machine learning models
of code. arXiv preprint arXiv:1812.06469, 2018b.
Miltiadis Allamanis and Charles Sutton. Mining source code repositories at massive
scale using language modeling. In Proceedings of the 10th Working Conference on
Mining Software Repositories, MSR ’13, pages 207–216, Piscataway, NJ, USA, 2013.
IEEE Press. ISBN 978-1-4673-2936-1. URL http://dl.acm.org/citation.cfm?
id=2487085.2487127.
Miltiadis Allamanis and Charles Sutton. Mining idioms from source code. In Pro-
ceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations
of Software Engineering, FSE 2014, pages 472–483, New York, NY, USA, 2014.
ACM. ISBN 978-1-4503-3056-5. doi: 10.1145/2635868.2635901. URL http:
//doi.acm.org/10.1145/2635868.2635901.
Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. Learning natural
coding conventions. In Proceedings of the 22Nd ACM SIGSOFT International Sympo-
sium on Foundations of Software Engineering, FSE 2014, pages 281–293, New York,
NY, USA, 2014. ACM. ISBN 978-1-4503-3056-5. doi: 10.1145/2635868.2635883.
URL http://doi.acm.org/10.1145/2635868.2635883.
Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. Suggesting
accurate method and class names. In Proceedings of the 2015 10th Joint Meeting on
123
Foundations of Software Engineering, ESEC/FSE 2015, pages 38–49, New York, NY,
USA, 2015a. ACM. ISBN 978-1-4503-3675-8. doi: 10.1145/2786805.2786849. URL
http://doi.acm.org/10.1145/2786805.2786849.
Miltiadis Allamanis, Daniel Tarlow, Andrew D. Gordon, and Yi Wei. Bimodal Mod-
elling of Source Code and Natural Language. In Proceedings of the 32nd International
Conference on Machine Learning, volume 37 of JMLR Proceedings, pages 2123–2132.
JMLR.org, 2015b.
Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. A survey
of machine learning for big code and naturalness. arXiv preprint arXiv:1709.06182,
2017.
Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and its practical
implications. In International Conference on Learning Representations, 2021. URL
https://openreview.net/forum?id=i80OPhOCVH2.
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. A general path-based
representation for predicting program properties. In Proceedings of the 39th ACM
SIGPLAN Conference on Programming Language Design and Implementation, PLDI
2018, pages 404–419, New York, NY, USA, 2018. ACM. ISBN 978-1-4503-5698-5. doi:
10.1145/3192366.3192412. URL http://doi.acm.org/10.1145/3192366.3192412.
Uri Alon, Omer Levy, and Eran Yahav. code2seq: Generating sequences from struc-
tured representations of code. In International Conference on Learning Representa-
tions, 2019a. URL https://openreview.net/forum?id=H1gKYo09tX.
Uri Alon, Golan Pundak, and Tara N Sainath. Contextual speech recognition with
difficult negative training examples. In ICASSP 2019-2019 IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP), pages 6440–6444.
IEEE, 2019b.
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. Code2vec: Learning
distributed representations of code. Proc. ACM Program. Lang., 3(POPL):40:1–
40:29, January 2019c. ISSN 2475-1421. doi: 10.1145/3290353. URL http:
//doi.acm.org/10.1145/3290353.
124
Uri Alon, Roy Sadaka, Omer Levy, and Eran Yahav. Structural language models of
code. In International Conference on Machine Learning, pages 245–256. PMLR,
2020.
Matthew Amodio, Swarat Chaudhuri, and Thomas Reps. Neural attribute machines
for program generation. arXiv preprint arXiv:1705.09231, 2017.
Christof Angermueller, Tanel Pärnamaa, Leopold Parts, and Oliver Stegle. Deep learn-
ing for computational biology. Molecular systems biology, 12(7):878, 2016.
Daniel Arp, Michael Spreitzenbarth, Malte Hübner, Hugo Gascon, Konrad Rieck, and
CERT Siemens. Drebin: Effective and explainable detection of android malware in
your pocket. 2014.
Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition
with visual attention. arXiv preprint arXiv:1412.7755, 2014.
Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra. Getafix: Learning
to fix bugs automatically. Proceedings of the ACM on Programming Languages, 3
(OOPSLA):1–27, 2019.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation
by jointly learning to align and translate. CoRR, abs/1409.0473, 2014. URL http:
//arxiv.org/abs/1409.0473.
Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua
Bengio. End-to-end attention-based large vocabulary speech recognition. In Acous-
tics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference
on, pages 4945–4949. IEEE, 2016.
Matej Balog, Alexander L Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel
Tarlow. Deepcoder: Learning to write programs. In ICLR, 2017.
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural
probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, March 2003. ISSN
1532-4435. URL http://dl.acm.org/citation.cfm?id=944919.944966.
Pavol Bielik, Veselin Raychev, and Martin T. Vechev. PHOG: probabilistic model for
code. In Proceedings of the 33nd International Conference on Machine Learning,
ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2933–2942, 2016.
URL http://jmlr.org/proceedings/papers/v48/bielik16.html.
Pavol Bielik, Veselin Raychev, and Martin Vechev. Program synthesis for character
level language modeling. In ICLR, 2017.
125
Marc Brockschmidt. Gnn-film: Graph neural networks with feature-wise linear mod-
ulation. Proceedings of the 36th International Conference on Machine Learning,
ICML, 2020.
Shaked Brody, Uri Alon, and Eran Yahav. A structural model for contextual code
changes. Proceedings of the ACM on Programming Languages, 4(OOPSLA):1–28,
2020.
Shaked Brody, Uri Alon, and Eran Yahav. How attentive are graph attention networks?
arXiv preprint arXiv:2105.14491, 2021.
Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Van-
dergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal
Processing Magazine, 34(4):18–42, 2017.
Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. Geometric deep
learning: Grids, groups, graphs, geodesics, and gauges, 2021.
Jose Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. When
deep learning met code search. In Proceedings of the 2019 27th ACM Joint Meeting
on European Software Engineering Conference and Symposium on the Foundations
of Software Engineering, pages 964–974, 2019.
Nicholas Carlini and David Wagner. Audio adversarial examples: Targeted attacks on
speech-to-text. In 2018 IEEE Security and Privacy Workshops (SPW), pages 1–7.
IEEE, 2018.
Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. Measuring and relieving
the over-smoothing problem for graph neural networks from the topological view. In
Proceedings of the Thirty-Fourth Conference on Association for the Advancement of
Artificial Intelligence (AAAI), 2020a.
Jianfei Chen, Jun Zhu, and Le Song. Stochastic training of graph convolutional net-
works with variance reduction. In International Conference on Machine Learning,
pages 942–950, 2018a.
Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep
graph convolutional networks. In International Conference on Machine Learning,
pages 1725–1735. PMLR, 2020b.
126
Xinyun Chen, Chang Liu, and Dawn Song. Tree-to-tree neural networks for program
translation. In Advances in Neural Information Processing Systems, pages 2547–2557,
2018b.
Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshy-
vanyk, and Martin Monperrus. Sequencer: Sequence-to-sequence learning for end-
to-end program repair. CoRR, abs/1901.01808, 2019. URL http://arxiv.org/abs/
1901.01808.
Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On
the properties of neural machine translation: Encoder–decoder approaches. In Pro-
ceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statis-
tical Translation, pages 103–111, 2014a.
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi
Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations
using rnn encoder-decoder for statistical machine translation. arXiv preprint
arXiv:1406.1078, 2014b.
Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua
Bengio. Attention-based models for speech recognition. In Advances in Neural In-
formation Processing Systems, pages 577–585, 2015.
Ronan Collobert and Jason Weston. A unified architecture for natural language pro-
cessing: Deep neural networks with multitask learning. In Proceedings of the 25th
International Conference on Machine Learning, ICML ’08, pages 160–167, New York,
NY, USA, 2008. ACM. ISBN 978-1-60558-205-4. doi: 10.1145/1390156.1390177. URL
http://doi.acm.org/10.1145/1390156.1390177.
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu,
and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of
machine learning research, 12(ARTICLE):2493–2537, 2011.
Rhys Compton, Eibe Frank, Panos Patros, and Abigail Koay. Embedding java classes
with code2vec: Improvements from variable obfuscation. In Proceedings of the 17th
International Conference on Mining Software Repositories, pages 243–253, 2020.
Yaniv David, Uri Alon, and Eran Yahav. Neural reverse engineering of stripped binaries
using augmented control flow graphs. Proceedings of the ACM on Programming
Languages, 4(OOPSLA):1–28, 2020.
127
Elizabeth Dinella, Hanjun Dai, Ziyang Li, Mayur Naik, Le Song, and Ke Wang. Hop-
pity: Learning graph transformations to detect and fix bugs in programs. In Interna-
tional Conference on Learning Representations, 2020. URL https://openreview.
net/forum?id=SJeqs6EFvB.
Li Dong and Mirella Lapata. Coarse-to-fine decoding for neural semantic parsing.
In Proceedings of the 56th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 731–742, 2018.
Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. Hotflip: White-box adver-
sarial examples for text classification. arXiv preprint arXiv:1712.06751, 2017.
Kevin Ellis, Maxwell Nye, Yewen Pu, Felix Sosa, Josh Tenenbaum, and Armando
Solar-Lezama. Write, execute, assess: Program synthesis with a repl. In
H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Gar-
nett, editors, Advances in Neural Information Processing Systems 32, pages
9165–9174. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/
9116-write-execute-assess-program-synthesis-with-a-repl.pdf.
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun
Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model
for programming and natural languages. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Processing: Findings, pages 1536–1547,
2020.
Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch
Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds,
2019.
Ben Finkelshtein, Chaim Baskin, Evgenii Zheltonozhskii, and Uri Alon. Single-node
attack for fooling graph neural networks. arXiv preprint arXiv:2011.03574, 2020.
128
Martin Fowler and Kent Beck. Refactoring: improving the design of existing code.
Addison-Wesley Professional, 1999.
Shogo Fujita, Hidetaka Kamigaito, Hiroya Takamura, and Manabu Okumura. Pointing
to subwords for generating function names in source code. In Proceedings of the 28th
International Conference on Computational Linguistics, pages 316–327, 2020.
Alexander L Gaunt, Marc Brockschmidt, Nate Kushman, and Daniel Tarlow. Differ-
entiable programs with neural libraries. In Proceedings of the 34th International
Conference on Machine Learning-Volume 70, pages 1213–1222. JMLR. org, 2017.
Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E
Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th
International Conference on Machine Learning-Volume 70, pages 1263–1272. JMLR.
org, 2017.
Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feed-
forward neural networks. In Proceedings of the Thirteenth International Conference
on Artificial Intelligence and Statistics, pages 249–256, 2010.
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-
scale sentiment classification: A deep learning approach. In Proceedings of the 28th
international conference on machine learning (ICML-11), pages 513–520, 2011.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In
Advances in neural information processing systems, pages 2672–2680, 2014a.
Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing
adversarial examples. arXiv preprint arXiv:1412.6572, 2014b.
Marco Gori, Gabriele Monfardini, and Franco Scarselli. A new model for learning in
graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural
Networks, 2005., volume 2, pages 729–734. IEEE, 2005.
129
Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, and Patrick
McDaniel. Adversarial perturbations against deep neural networks for malware clas-
sification. arXiv preprint arXiv:1606.04435, 2016.
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. Incorporating copying mecha-
nism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393, 2016.
Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. Program synthesis. Founda-
tions and Trends® in Programming Languages, 4(1-2):1–119, 2017.
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie LIU, Long Zhou,
Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin
Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou.
Graphcode{bert}: Pre-training code representations with data flow. In International
Conference on Learning Representations, 2021. URL https://openreview.net/
forum?id=jLoC4ez43PZ.
Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on
large graphs. In Advances in neural information processing systems, pages 1024–1034,
2017.
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical
Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA,
2001.
Vincent J Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis, and David
Bieber. Global relational models of source code. In International conference on
learning representations, 2019.
Jordan Henkel, Shuvendu K Lahiri, Ben Liblit, and Thomas Reps. Code vectors: Un-
derstanding programs through embedded abstracted symbolic traces. In Proceedings
of the 2018 26th ACM Joint Meeting on European Software Engineering Conference
and Symposium on the Foundations of Software Engineering, pages 163–174, 2018.
Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu.
On the naturalness of software. In Proceedings of the 34th International Conference
on Software Engineering, ICSE ’12, pages 837–847, Piscataway, NJ, USA, 2012.
IEEE Press. ISBN 978-1-4673-1067-3. URL http://dl.acm.org/citation.cfm?
id=2337223.2337322.
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep
Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al.
130
Deep neural networks for acoustic modeling in speech recognition: The shared views
of four research groups. IEEE Signal processing magazine, 29(6):82–97, 2012.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput.,
9(8):1735–1780, November 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735.
URL http://dx.doi.org/10.1162/neco.1997.9.8.1735.
Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu,
Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine
learning on graphs. arXiv preprint arXiv:2005.00687, 2020.
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. Deep code comment generation.
In Proceedings of the 26th Conference on Program Comprehension, pages 200–210.
ACM, 2018.
Binxuan Huang and Kathleen M Carley. Syntax-aware aspect level sentiment classi-
fication with graph attention networks. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing and the 9th International Joint
Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5472–5480,
2019.
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. Summarizing
source code using a neural attention model. In Proceedings of the 54th Annual
Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12,
2016, Berlin, Germany, Volume 1: Long Papers, 2016. URL http://aclweb.org/
anthology/P/P16/P16-1195.pdf.
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. Mapping lan-
guage to code in programmatic context. arXiv preprint arXiv:1808.09588, 2018.
Srinivasan Iyer, Alvin Cheung, and Luke Zettlemoyer. Learning programmatic idioms
for scalable semantic parsing. arXiv preprint arXiv:1904.09086, 2019.
Chaitanya Joshi. Transformers are graph neural networks. The Gradient, 2020.
Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. Learning and
evaluating contextual embedding of source code. In International Conference on
Machine Learning, pages 5110–5121. PMLR, 2020.
131
Hong Jin Kang, Tegawendé F Bissyandé, and David Lo. Assessing the generalizability
of code2vec token embeddings. In 2019 34th IEEE/ACM International Conference
on Automated Software Engineering (ASE), pages 1–12. IEEE, 2019.
Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish Chandra. Code prediction by
feeding trees to transformers. In 2021 IEEE/ACM 43rd International Conference on
Software Engineering (ICSE), pages 150–162. IEEE, 2021.
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolu-
tional networks. In ICLR, 2017.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico,
Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris
Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. Moses: Open source
toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting
of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, pages 177–
180, Stroudsburg, PA, USA, 2007. Association for Computational Linguistics. URL
http://dl.acm.org/citation.cfm?id=1557769.1557821.
Devin Kreuzer, Dominique Beaini, William L Hamilton, Vincent Létourneau, and Pru-
dencio Tossou. Rethinking graph transformers with spectral attention. arXiv preprint
arXiv:2106.03893, 2021.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with
deep convolutional neural networks. Advances in neural information processing sys-
tems, 25:1097–1105, 2012.
132
Vitaly Kurin, Maximilian Igl, Tim Rocktäschel, Wendelin Boehmer, and Shimon
Whiteson. My body is a cage: the role of morphology in graph-based incompati-
ble control. In International Conference on Learning Representations, 2021. URL
https://openreview.net/forum?id=N3zUDGN5lO.
Jure Leskovec and Julian J Mcauley. Learning to discover social circles in ego networks.
In Advances in neural information processing systems, pages 539–547, 2012.
Omer Levy and Yoav Goldberg. Dependency-based word embeddings. In ACL (2),
pages 302–308. Citeseer, 2014a.
Omer Levy and Yoav Goldberg. Linguistic regularities in sparse and explicit word
representations. In Proceedings of the eighteenth conference on computational natural
language learning, pages 171–180, 2014b.
Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. Zero-shot relation
extraction via reading comprehension. In Proceedings of the 21st Conference on
Computational Natural Language Learning (CoNLL 2017), Vancouver, Canada, Au-
gust 3-4, 2017, pages 333–342, 2017. doi: 10.18653/v1/K17-1034. URL https:
//doi.org/10.18653/v1/K17-1034.
Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insights into graph convolutional
networks for semi-supervised learning. In Thirty-Second AAAI Conference on Arti-
ficial Intelligence, 2018.
Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence
neural networks. In International Conference on Learning Representations, 2016.
Ben Liblit, Andrew Begel, and Eve Sweeser. Cognitive perspectives on the role of
naming in computer programs. In Proceedings of the 18th Annual Psychology of Pro-
gramming Workshop, Sussex, United Kingdom, sep 2006. Psychology of Programming
Interest Group.
Wang Ling, Phil Blunsom, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiský,
Fumin Wang, and Andrew Senior. Latent predictor networks for code generation.
133
In Proceedings of the 54th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 599–609, Berlin, Germany, August 2016.
Association for Computational Linguistics. doi: 10.18653/v1/P16-1057. URL https:
//www.aclweb.org/anthology/P16-1057.
Jason Liu, Seohyun Kim, Vijayaraghavan Murali, Swarat Chaudhuri, and Satish Chan-
dra. Neural query expansion for code search. In Proceedings of the 3rd ACM SIG-
PLAN International Workshop on Machine Learning and Programming Languages,
MAPL 2019, pages 29–37, New York, NY, USA, 2019a. ACM. ISBN 978-1-4503-6719-
6. doi: 10.1145/3315508.3329975. URL http://doi.acm.org/10.1145/3315508.
3329975.
Kui Liu, Dongsun Kim, Tegawendé F Bissyandé, Taeyoung Kim, Kisub Kim, Anil
Koyuncu, Suntae Kim, and Yves Le Traon. Learning to spot and refactor inconsistent
method names. In 2019 IEEE/ACM 41st International Conference on Software
Engineering (ICSE), pages 1–12. IEEE, 2019b.
Shangqing Liu, Yu Chen, Xiaofei Xie, Jingkai Siow, and Yang Liu. Retrieval-augmented
generation for code summarization via hybrid gnn. In International Conference on
Learning Representations, 2021.
Cristina V. Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny,
Hitesh Sajnani, and Jan Vitek. Déjàvu: A map of code duplicates on github. Proc.
ACM Program. Lang., 1(OOPSLA):84:1–84:28, October 2017. ISSN 2475-1421. doi:
10.1145/3133908. URL http://doi.acm.org/10.1145/3133908.
Pablo Loyola, Edison Marrese-Taylor, and Yutaka Matsuo. A neural architecture for
generating natural language descriptions from source code changes. In Proceedings of
the 55th Annual Meeting of the Association for Computational Linguistics (Volume
2: Short Papers), pages 287–292. Association for Computational Linguistics, 2017.
doi: 10.18653/v1/P17-2045. URL http://www.aclweb.org/anthology/P17-2045.
Denis Lukovnikov and Asja Fischer. Gated relational graph attention networks, 2021.
URL https://openreview.net/forum?id=v-9E8egy_i.
Nianzu Ma, Sahisnu Mazumder, Hao Wang, and Bing Liu. Entity-aware dependency-
based deep graph attention network for comparative preference classification. In
Proceedings of the 58th Annual Meeting of the Association for Computational Lin-
guistics, pages 5782–5788, 2020.
134
Chris J. Maddison and Daniel Tarlow. Structured generative models of natural source
code. In Proceedings of the International Conference on Machine Learning - Volume
32, ICML’14, pages II–649–II–657. JMLR.org, 2014. URL http://dl.acm.org/
citation.cfm?id=3044805.3044965.
Alessio Micheli. Neural network for graphs: A contextual constructive approach. IEEE
Transactions on Neural Networks, 20(3):498–511, 2009.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of
word representations in vector space. CoRR, abs/1301.3781, 2013a. URL http:
//arxiv.org/abs/1301.3781.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed
representations of words and phrases and their compositionality. In Proceedings of the
26th International Conference on Neural Information Processing Systems, NIPS’13,
pages 3111–3119, USA, 2013b. Curran Associates Inc. URL http://dl.acm.org/
citation.cfm?id=2999792.2999959.
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous
space word representations. 2013c.
Alon Mishne, Sharon Shoham, and Eran Yahav. Typestate-based semantic code search
over partial programs. In Proceedings of the ACM International Conference on Object
Oriented Programming Systems Languages and Applications, OOPSLA ’12, pages
997–1016, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1561-6. doi: 10.1145/
2384616.2384689. URL http://doi.acm.org/10.1145/2384616.2384689.
Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent mod-
els of visual attention. In Proceedings of the 27th International Conference on Neural
Information Processing Systems, NIPS’14, pages 2204–2212, Cambridge, MA, USA,
2014. MIT Press. URL http://dl.acm.org/citation.cfm?id=2969033.2969073.
Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda,
and Michael M Bronstein. Geometric deep learning on graphs and manifolds using
mixture model cnns. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 5115–5124, 2017.
Christopher Morris, Matthias Fey, and Nils M Kriege. The power of the weisfeiler-leman
algorithm for machine learning with graphs. arXiv preprint arXiv:2105.05911, 2021.
Dana Movshovitz-Attias and William W Cohen. Natural language models for predicting
programming comments. 2013.
Vijayaraghavan Murali, Swarat Chaudhuri, and Chris Jermaine. Bayesian sketch learn-
ing for program synthesis. CoRR, abs/1703.05698, 2018. URL http://arxiv.org/
abs/1703.05698.
135
Deepak Nathani, Jatin Chauhan, Charu Sharma, and Manohar Kaul. Learning
attention-based embeddings for relation prediction in knowledge graphs. In Proceed-
ings of the 57th Annual Meeting of the Association for Computational Linguistics,
pages 4710–4723, 2019.
Yurii E Nesterov. A method for solving the convex programming problem with con-
vergence rate o (1/k^ 2). In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547,
1983.
Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive
power for node classification. In International Conference on Learning Representa-
tions, 2020. URL https://openreview.net/forum?id=S1ldO2EFPr.
Sheena Panthaplackel, Pengyu Nie, Milos Gligoric, Junyi Jessy Li, and Raymond
Mooney. Learning to update natural language comments based on code changes.
In Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, pages 1853–1868, 2020.
Emilio Parisotto, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou,
and Pushmeet Kohli. Neuro-symbolic program synthesis. In ICLR, 2017.
Dinglan Peng, Shuxin Zheng, Yatao Li, Guolin Ke, Di He, and Tie-Yan Liu. How could
neural networks understand programs? In to appear in ICML’2021, 2021.
Amir Pnueli and Roni Rosner. On the synthesis of a reactive module. In Proceed-
ings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming
languages, pages 179–190. ACM, 1989.
Oleksandr Polozov and Sumit Gulwani. Flashmeta: a framework for inductive program
synthesis. In ACM SIGPLAN Notices, volume 50, pages 107–126. ACM, 2015.
Danish Pruthi, Bhuwan Dhingra, and Zachary C Lipton. Combating adversarial mis-
spellings with robust word recognition. ACL, 2019.
Jiezhong Qiu, Jian Tang, Hao Ma, Yuxiao Dong, Kuansan Wang, and Jie Tang.
Deepinf: Social influence prediction with deep learning. In Proceedings of the 24th
136
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(KDD’18), 2018.
Maxim Rabinovich, Mitchell Stern, and Dan Klein. Abstract syntax networks for code
generation and semantic parsing. In Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), pages 1139–
1149. Association for Computational Linguistics, 2017. doi: 10.18653/v1/P17-1105.
URL http://www.aclweb.org/anthology/P17-1105.
Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole Von Lilien-
feld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific
data, 1:140022, 2014.
Veselin Raychev, Martin Vechev, and Eran Yahav. Code completion with statistical
language models. SIGPLAN Not., 49(6):419–428, June 2014. ISSN 0362-1340. doi:
10.1145/2666356.2594321. URL http://doi.acm.org/10.1145/2666356.2594321.
Veselin Raychev, Martin Vechev, and Andreas Krause. Predicting program proper-
ties from ”big code”. In Proceedings of the 42Nd Annual ACM SIGPLAN-SIGACT
Symposium on Principles of Programming Languages, POPL ’15, pages 111–124,
New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3300-9. doi: 10.1145/2676726.
2677009. URL http://doi.acm.org/10.1145/2676726.2677009.
Veselin Raychev, Pavol Bielik, and Martin Vechev. Probabilistic model for code with
decision trees. In Proceedings of the 2016 ACM SIGPLAN International Conference
on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA
2016, pages 731–747, New York, NY, USA, 2016a. ACM. ISBN 978-1-4503-4444-
9. doi: 10.1145/2983990.2984041. URL http://doi.acm.org/10.1145/2983990.
2984041.
Veselin Raychev, Pavol Bielik, Martin Vechev, and Andreas Krause. Learning programs
from noisy data. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Sympo-
sium on Principles of Programming Languages, POPL ’16, pages 761–774, New York,
NY, USA, 2016b. ACM. ISBN 978-1-4503-3549-2. doi: 10.1145/2837614.2837671.
URL http://doi.acm.org/10.1145/2837614.2837671.
Andrew Rice, Edward Aftandilian, Ciera Jaspan, Emily Johnston, Michael Pradel, and
Yulissa Arroyo-Paredes. Detecting argument selection defects. Proceedings of the
ACM on Programming Languages, 1(OOPSLA):104, 2017.
Yu Rong, Wenbing Huang, Tingyang Xu, and Junzhou Huang. Dropedge: Towards
deep graph convolutional networks on node classification. In International Confer-
ence on Learning Representations, 2020. URL https://openreview.net/forum?
id=Hkx1qkrKPr.
137
Reuven Rubinstein. The cross-entropy method for combinatorial and continuous opti-
mization. Methodology and Computing in Applied Probability, 1(2):127–190, 1999.
Alexander M. Rush, Sumit Chopra, and Jason Weston. A neural attention model
for abstractive sentence summarization. In Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal,
September 17-21, 2015, pages 379–389, 2015. URL http://aclweb.org/anthology/
D/D15/D15-1044.pdf.
Saksham Sachdev, Hongyu Li, Sifei Luan, Seohyun Kim, Koushik Sen, and Satish
Chandra. Retrieval on source code: a neural code search. In Proceedings of the 2nd
ACM SIGPLAN International Workshop on Machine Learning and Programming
Languages, MAPL@PLDI 2018, Philadelphia, PA, USA, June 18-22, 2018, pages
31–41, 2018. doi: 10.1145/3211346.3211353. URL http://doi.acm.org/10.1145/
3211346.3211353.
Alvaro Sanchez-Gonzalez, Jonathan Godwin, Tobias Pfaff, Rex Ying, Jure Leskovec,
and Peter Battaglia. Learning to simulate complex physics with graph networks. In
International Conference on Machine Learning, pages 8459–8468. PMLR, 2020.
Joshua Saxe and Konstantin Berlin. Deep neural network based malware detection using
two dimensional binary program features. In 2015 10th International Conference on
Malicious and Unwanted Software (MALWARE), pages 11–20. IEEE, 2015.
Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele
Monfardini. The graph neural network model. IEEE Transactions on Neural Net-
works, 20(1):61–80, 2008.
Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov,
and Max Welling. Modeling relational data with graph convolutional networks. In
European Semantic Web Conference, pages 593–607. Springer, 2018.
Roei Schuster, Congzheng Song, Eran Tromer, and Vitaly Shmatikov. You autocom-
plete me: Poisoning vulnerabilities in neural code completion. In 30th {USENIX}
Security Symposium ({USENIX} Security 21), 2021.
Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and
Tina Eliassi-Rad. Collective classification in network data. AI magazine, 29(3):
93–93, 2008.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of
rare words with subword units. In Proceedings of the 54th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725,
138
Berlin, Germany, August 2016. Association for Computational Linguistics. URL
http://www.aclweb.org/anthology/P16-1162.
Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirec-
tional attention flow for machine comprehension. arXiv preprint arXiv:1611.01603,
2016.
Xujie Si, Yuan Yang, Hanjun Dai, Mayur Naik, and Le Song. Learning a meta-solver
for syntax-guided program synthesis. In International Conference on Learning Rep-
resentations, 2019. URL https://openreview.net/forum?id=Syl8Sn0cK7.
Richard Socher, Cliff C. Lin, Andrew Y. Ng, and Christopher D. Manning. Parsing Nat-
ural Scenes and Natural Language with Recursive Neural Networks. In Proceedings
of the 26th International Conference on Machine Learning (ICML), 2011.
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.
Journal of Machine Learning Research, 15(1):1929–1958, 2014.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural
networks. In Advances in Neural Information Processing Systems, pages 3104–3112,
2014.
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan,
Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv
preprint arXiv:1312.6199, 2013.
Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved semantic rep-
resentations from tree-structured long short-term memory networks. In Proceedings
of the 53rd Annual Meeting of the Association for Computational Linguistics and
the 7th International Joint Conference on Natural Language Processing (Volume 1:
Long Papers), pages 1556–1566. Association for Computational Linguistics, 2015.
doi: 10.3115/v1/P15-1150. URL http://aclweb.org/anthology/P15-1150.
139
Armstrong A. Takang, Penny A. Grubb, and Robert D. Macredie. The effects of
comments and identifier names on program comprehensibility: an experimental in-
vestigation. J. Prog. Lang., 4(3):143–167, 1996. URL http://compscinet.dcs.kcl.
ac.uk/JP/jp040302.abs.html.
Rohan Taori, Amog Kamsetty, Brenton Chu, and Nikita Vemuri. Targeted adversarial
examples for black box audio systems. In 2019 IEEE Security and Privacy Workshops
(SPW), pages 15–20. IEEE, 2019.
Joseph Turian, Lev Ratinov, and Yoshua Bengio. Word representations: A simple
and general method for semi-supervised learning. In Proceedings of the 48th Annual
Meeting of the Association for Computational Linguistics, ACL ’10, pages 384–394,
Stroudsburg, PA, USA, 2010. Association for Computational Linguistics. URL http:
//dl.acm.org/citation.cfm?id=1858681.1858721.
Diego Valsesia, Giulia Fracastoro, and Enrico Magli. Don’t stack layers in graph neu-
ral networks, wire them randomly. In ICLR 2021 Workshop on Geometrical and
Topological Representation Learning, 2021.
Marko Vasic, Aditya Kanade, Petros Maniatis, David Bieber, and Rishabh singh. Neu-
ral program repair by jointly learning to localize and repair. In International Con-
ference on Learning Representations, 2019. URL https://openreview.net/forum?
id=ByloJ20qtm.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances
in Neural Information Processing Systems, pages 6000–6010, 2017.
Martin T. Vechev and Eran Yahav. Programming with ”big code”. Foundations and
Trends in Programming Languages, 3(4):231–284, 2016. doi: 10.1561/2500000028.
URL https://doi.org/10.1561/2500000028.
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò,
and Yoshua Bengio. Graph attention networks. In International Conference on
Learning Representations, 2018.
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks, 2015.
Richard J Waldinger and Richard CT Lee. Prow: A step toward automatic program
writing. In Proceedings of the 1st international joint conference on Artificial intelli-
gence, pages 241–252. Morgan Kaufmann Publishers Inc., 1969.
140
Eric Wallace, Mitchell Stern, and Dawn Song. Imitation attacks and defenses for black-
box machine translation systems. In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP), pages 5531–5546, 2020.
Guangtao Wang, Rex Ying, Jing Huang, and Jure Leskovec. Improving graph attention
networks with large margin-based constraints. arXiv preprint arXiv:1910.11945,
2019a.
Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou,
Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li,
and Zheng Zhang. Deep graph library: A graph-centric, highly-performant package
for graph neural networks. arXiv preprint arXiv:1909.01315, 2019b.
Qinglong Wang, Wenbo Guo, Kaixuan Zhang, Alexander G Ororbia II, Xinyu Xing,
Xue Liu, and C Lee Giles. Adversary resistant deep neural networks with an applica-
tion to malware detection. In Proceedings of the 23rd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 1145–1153. ACM, 2017.
Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu.
Heterogeneous graph attention network. In The World Wide Web Conference, pages
2022–2032, 2019c.
Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Wein-
berger. Simplifying graph convolutional networks. In International conference on
machine learning, pages 6861–6871. PMLR, 2019.
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu
Philip. A comprehensive survey on graph neural networks. IEEE Transactions on
Neural Networks and Learning Systems, 2020.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan
Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image
caption generation with visual attention. In International Conference on Machine
Learning, pages 2048–2057, 2015.
141
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph
neural networks? In International Conference on Learning Representations, 2019.
URL https://openreview.net/forum?id=ryGs6iA5Km.
Yiding Yang, Jiayan Qiu, Mingli Song, Dacheng Tao, and Xinchao Wang. Distilling
knowledge from graph convolutional networks. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
Ziyu Yao, Frank F. Xu, Pengcheng Yin, Huan Sun, and Graham Neubig. Learn-
ing structural edits via incremental tree transformations. In International Confer-
ence on Learning Representations, 2021. URL https://openreview.net/forum?
id=v9hAX77--cZ.
Noam Yefet, Uri Alon, and Eran Yahav. Adversarial examples for models of code.
Proceedings of the ACM on Programming Languages, 4(OOPSLA):1–30, 2020.
Pengcheng Yin and Graham Neubig. A syntactic neural model for general-purpose
code generation. In Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pages 440–450. Association for
Computational Linguistics, 2017. doi: 10.18653/v1/P17-1041. URL http://www.
aclweb.org/anthology/P17-1041.
Pengcheng Yin and Graham Neubig. Tranx: A transition-based neural abstract syntax
parser for semantic parsing and code generation. In Proceedings of the 2018 Confer-
ence on Empirical Methods in Natural Language Processing: System Demonstrations,
pages 7–12, 2018.
Pengcheng Yin, Graham Neubig, Miltiadis Allamanis, Marc Brockschmidt, and Alexan-
der L. Gaunt. Learning to represent edits. In International Conference on Learning
Representations, 2019. URL https://openreview.net/forum?id=BJl6AjC5F7.
Halley Young, Osbert Bastani, and Mayur Naik. Learning neurosymbolic generative
models via program synthesis. In Kamalika Chaudhuri and Ruslan Salakhutdinov,
editors, Proceedings of the 36th International Conference on Machine Learning, vol-
ume 97 of Proceedings of Machine Learning Research, pages 7144–7153, Long Beach,
California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/
v97/young19a.html.
Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and Dragomir Radev. TypeSQL: Knowledge-
based type-aware neural text-to-SQL generation. In Proceedings of the 2018 Con-
ference of the North American Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume 2 (Short Papers), pages 588–594,
New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi:
10.18653/v1/N18-2093. URL https://www.aclweb.org/anthology/N18-2093.
142
Xiaohan Yu, Quzhe Huang, Zheng Wang, Yansong Feng, and Dongyan Zhao. Towards
context-aware code comment generation. In Proceedings of the 2020 Conference
on Empirical Methods in Natural Language Processing: Findings, pages 3938–3947,
2020.
Kai Zhang, Yaokang Zhu, Jun Wang, and Jie Zhang. Adaptive structural fingerprints
for graph attention networks. In International Conference on Learning Representa-
tions, 2020. URL https://openreview.net/forum?id=BJxWx0NYPr.
Lingxiao Zhao and Leman Akoglu. Pairnorm: Tackling oversmoothing in gnns. In Inter-
national Conference on Learning Representations, 2020. URL https://openreview.
net/forum?id=rkecl1rtwB.
Rui Zhao, David Bieber, Kevin Swersky, and Daniel Tarlow. Neural networks for
modeling source code edits, 2019.
Daniel Zügner, Tobias Kirschstein, Michele Catasta, Jure Leskovec, and Stephan Gün-
nemann. Language-agnostic representation learning of source code from structure
and context. In International Conference on Learning Representations, 2021. URL
https://openreview.net/forum?id=Xh5eMZVONGF.
143
ומודלים המבוססים על רשתות נוירונים גרפיות ,מצאנו כי רשתות הנוירונים הגרפיות מתקשות ללמוד
תבניות ארוכות-טווח בדאטא .כלומר ,אם החיזוי ה״נכון״ תלוי בתבניות הכוללות צמתים המרוחקים
אחד מהשני בגרף ,הרשת לא מצליחה ללמוד אותן ,ובמקום זה ,הרשת מתאמנת-יתר על תבניות
קצרות-טווח .תופעה זו הייתה מפתיעה ,מכיוון שמסלולי ASTבדרך כלל אינם מתקשים כלל בללמוד
תבניות ארוכות-טווח .בסקירה ספרותית ,מצאנו כי מאז המצאת רשתות הנוירונים הגרפיות בשנת
,2005הקושי שלהן בללמוד תבניות ארוכות-טווח נצפה והיה ידוע כבעיה באימון רשתות אלו .אולם,
בעיה זו מעולם לא הוסברה ונחקרה כראוי .בתזה זאת ,אנחנו מציגים את בעיית ״צוואר הבקבוק״
של רשתות הנוירונים הגרפיות ,מציעים הסבר חדש לתופעה זו ,מראים שתופעה זו מתקיימת אף
במודלים קיימים ,וכי סוגים שונים של רשתות נוירונים גרפיות -נפגעות מ״צוואר הבקבוק״ באופן
שונה .תגלית זו שופכת אור ומשלימה את התמונה אודות ייצוגים פופולריים שונים של תוכניות,
חולשותיהם וחוזקותיהם.
ii
תקציר
במהלך העשור האחרון ,מהפכת הלמידה העמוקה ,הנשענת על רשתות נוירונים מלאכותיות ,שינתה
ללא היכר טווח רחב של תחומים במדעי המחשב כגון ראייה ממוחשבת ,זיהוי דיבור ,ועיבוד שפה
טבעית .במקביל ,גדל דרמטית מספר בסיסי הקוד ופרויקטי הקוד הפתוח הזמינים לציבור ,מה
שאפשר את השימוש ברשתות נוירונים במגוון רחב של משימות הקשורות לתכנות וכתיבת קוד ,תחום
שאנחנו מכנים ״עיבוד שפות תכנות״ ) ,Programming Language Processing – PLPכאנלוגיה
ל״עיבוד שפה טבעית״ – .(NLP
עם זאת ,בעיית הייצוג של תוכניות מחשב באלגוריתמים לומדים ובמערכות למידה עמוקה נותרה
כשאלה פתוחה .ברור כי לתוכניות אין ייצוג "מטריציוני" ישיר ופשוט כמו שיש ,למשל ,לתמונות .אף
על פי שתוכנית מחשב יכולה להיות מיוצגת כרצף של ״מילים״ או רצף של תווים ,כמו טקסט בשפה
טבעית ,תוכניות מחשב הן הרבה יותר מבניות מטקסט-חופשי ,מכיוון שתוכניות חייבות לציית לתחביר
נוקשה ועשיר המוגדר על ידי דקדוק חסר-הקשר .יתרה מזאת ,לכל שפת תכנות יש סמנטיקה מוגדרת
מראש ,המתארת מה פירושן של תוכניות תקינות-דקדוקית ומה הן עושות.
תזה זו מתמקדת בבעיה הכללית הבאה :ייצוג תוכניות מחשב באלגוריתמים לומדים ומערכות למידה
עמוקה באופן שמקל על הלמידה ,בעודו תופס מידע רב ככל הניתן מתוכניות הקלט ,וגם מאפשר
למודל הלומד להשאר כללי ככל האפשר .תזה זו מציגה את גישת מסלולי ה ,AST-אשר מייצגת
תוכניות באמצעות מסלולים בעץ התחביר המופשט ) (Abstract Syntax Tree – ASTשל התוכנית.
ייצוג זה ,מסלולי ה ,AST-מאפשר לבנות מודלים נוירוניים חזקים ומדויקים ,אבל עם זאת גם קלי-משקל
וניתנים לאימון על כמות מידע מאסיבית .ספציפית ,תזה זאת מראה איך ניתן לאמן את המודלים
הללו על מיליוני דוגמאות ,עבור משימות הכוללות חיזוי תכונות של רכיבי תוכנית בודדים ,חיזוי תכונות
של קטעי קוד שלמים ,חיזוי של משפט בשפה טבעית מתוך קטע קוד נתון ,והפקה של השלמות קוד
אוטומטיות .מודלים אלו פורסמו בפומבי כאתרי אינטרנט של הדגמות אינטראקטיביות ,לצד קוד-
פתוח של מימושים ודאטא .חלק מהמודלים הללו ,כמו code2vecו ,code2seq-הינם פופולריים
במיוחד ,ונמצאים בשימוש נרחב באקדמיה ובתעשייה.
לבסוף ,תזה זאת חוקרת את ההבדלים התיאורטיים בין ייצוגים שונים של תוכניות .ספציפית ,המחקר
שהתחיל כממוקד בתוכניות הוביל לגילוי מגבלות אינהרנטיות רחבות יותר הקיימות בייצוג פופולרי
אחר ,רשתות הנוירונים הגרפיות ) . (GNNרשתות הנוירונים הגרפיות נעשו פופולריות ביותר בשלוש
השנים האחרונות ,עקב הכלליות והרב-גוניות שלהן .למעשה ,כל דאטא שניתן להצגה כגרף ,כגון
עץ התחביר של תוכנית מחשב ,רשת חברתית ,ומולקולה -ניתן בקלות להזין לתוך רשת נוירונים
גרפית .זמן האימון של רשת נוירונים גרפית הוא יחסית מהיר ,ואסימפטוטית הוא תלוי לינארית
במספר הצמתים ומספר הקשתות .כאשר השווינו אמפירית את המודלים המבוססים על מסלולי AST
i
. בפקולטה למדעי המחשב,המחקר בוצע בהנחייתו של פרופסור ערן יהב
Uri Alon and Eran Yahav. On the bottleneck of graph neural networks and its practical
implications. In International Conference on Learning Representations, 2021. URL https:
//openreview.net/forum?id=i80OPhOCVH2.
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. A general path-based represen-
tation for predicting program properties. In Proceedings of the 39th ACM SIGPLAN Con-
ference on Programming Language Design and Implementation, PLDI 2018, pages 404–419,
New York, NY, USA, 2018. ACM. ISBN 978-1-4503-5698-5. doi: 10.1145/3192366.3192412.
URL http://doi.acm.org/10.1145/3192366.3192412.
Uri Alon, Omer Levy, and Eran Yahav. code2seq: Generating sequences from structured
representations of code. In International Conference on Learning Representations, 2019a.
URL https://openreview.net/forum?id=H1gKYo09tX.
Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. Code2vec: Learning distributed
representations of code. Proc. ACM Program. Lang., 3(POPL):40:1–40:29, January 2019b.
ISSN 2475-1421. doi: 10.1145/3290353. URL http://doi.acm.org/10.1145/3290353.
Uri Alon, Roy Sadaka, Omer Levy, and Eran Yahav. Structural language models of code.
In International Conference on Machine Learning, pages 245–256. PMLR, 2020.
והם מציגים,הפרסומים הבאים היו חלק מהמחקר שבוצעו במהלך תקופת הדוקטורט של המחבר
:תוצאות משלימות לחיבור זה
Uri Alon, Golan Pundak, and Tara N Sainath. Contextual speech recognition with diffi-
cult negative training examples. In ICASSP 2019-2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pages 6440–6444. IEEE, 2019.
Shaked Brody, Uri Alon, and Eran Yahav. A structural model for contextual code changes.
Proceedings of the ACM on Programming Languages, 4(OOPSLA):1–28, 2020.
Shaked Brody, Uri Alon, and Eran Yahav. How attentive are graph attention networks?
arXiv preprint arXiv:2105.14491, 2021.
Yaniv David, Uri Alon, and Eran Yahav. Neural reverse engineering of stripped binaries
using augmented control flow graphs. Proceedings of the ACM on Programming Languages,
4(OOPSLA):1–28, 2020.
Noam Yefet, Uri Alon, and Eran Yahav. Adversarial examples for models of code. Proceed-
ings of the ACM on Programming Languages, 4(OOPSLA):1–30, 2020.
תודות
שם הנרי- לפקולטה למדעי המחשב על,אני מודה מעומק לבי על המימון הנדיב של מחקר זה לטכניון
. ולמלגת אירווין וג׳ואן ג׳ייקובס,ומרילין טאוב
עיבוד שפות תכנות באמצעות למידת מכונה
חיבור על מחקר
אורי אלון
אורי אלון