Detecting Code Clones With Graph Neural Network and Flow-Augmented Abstract Syntax Tree
Detecting Code Clones With Graph Neural Network and Flow-Augmented Abstract Syntax Tree
[email protected]
arXiv:2002.08653v1 [cs.SE] 20 Feb 2020
Abstract—Code clones are semantically similar code fragments as syntactic similarities, while type-4 clone can be seen as
pairs that are syntactically similar or different. Detection of code semantic similarity. As type-4 clones include clones that are
clones can help to reduce the cost of software maintenance and highly dissimilar syntactically, it is the hardest clone type to
prevent bugs. Numerous approaches of detecting code clones have
been proposed previously, but most of them focus on detecting detect for most clone detection approaches. Code syntactical
syntactic clones and do not work well on semantic clones with similarity has already been well-studied, while in recent years
different syntactic features. To detect semantic clones, researchers researchers have started to focus on detecting code semantic
have tried to adopt deep learning for code clone detection to auto- similarity. Along with the advances of deep neural networks,
matically learn latent semantic features from data. Especially, to several deep learning-based approaches tried to capture the
leverage grammar information, several approaches used abstract
syntax trees (AST) as input and achieved significant progress semantic similarities through learning from data. Most of these
on code clone benchmarks in various programming languages. approaches include two steps: use neural networks to calculate
However, these AST-based approaches still can not fully leverage a vector representation for each code fragment, then calculate
the structural information of code fragments, especially semantic the similarity between two code vector representations to
information such as control flow and data flow. To leverage detect clones. To leverage the explicit structural information
control and data flow information, in this paper, we build a
graph representation of programs called flow-augmented abstract in programs, these approaches often use abstract syntax tree
syntax tree (FA-AST). We construct FA-AST by augmenting (AST) as the input of their models [2]–[4]. A typical example
original ASTs with explicit control and data flow edges. Then of these approaches is CDLH [3], which encode code frag-
we apply two different types of graph neural networks (GNN) ments by directly applying Tree-LSTM [5] on binarized ASTs.
on FA-AST to measure the similarity of code pairs. As far as we Although AST can reflect the rich structural information for
have concerned, we are the first to apply graph neural networks
on the domain of code clone detection. program syntax, it does not contain some semantic information
We apply our FA-AST and graph neural networks on two Java such as control flow and data flow.
datasets: Google Code Jam and BigCloneBench. Our approach To exploit explicit control flow information, some re-
outperforms the state-of-the-art approaches on both Google Code searchers use control flow graphs (CFG) to detect code clones.
Jam and BigCloneBench tasks. For example, DeepSim [6] extracts semantic features from
Index Terms—clone detection, data flow, control flow, deep
learning, graph neural network CFGs to build semantic matrices for clone detection. But CFG
still lacks data flow information. Furthermore, most CFGs
I. I NTRODUCTION only contain control flows between code blocks and exclude
Code clone detection aims to measure the similarity be- the low-level syntactic structure within code blocks. Another
tween two code snippets. Commonly, there are two kinds drawback of CFGs is that in some programming languages,
of similarities within code clones: syntactic similarity and CFGs are much harder to obtain than ASTs.
semantic similarity. Syntactic similarity are often introduced In this paper, we aim to build a graph representation form
when programmers copy a code fragment and then pasting of programs which can reflect both syntactical information
to another location, while semantic similarity occurs when and semantical information from ASTs. In order to detect
developers try to implement a certain functionality which is code clones with the graphs we have built, we propose a new
identical or similar to an existing code fragment. approach that uses graph neural networks (GNN) to detect
To better study the effectiveness of clone detectors on code clones. Our approach mainly includes three steps: First,
different types of code similarities, researchers started to create graph representation for programs. Second, calculate
systematically categorize code clones into multiple classes. vector representations for code fragments using graph neural
One common taxonomy proposed by [1] group code clones networks. Third, measure code similarity by measuring the
into four types. The first three types of clones can be concluded similarity of code vector representations. To fully leverage
the control flow and data flow information in programs, we
§ Corresponding Authors construct an AST-based graph representation of programs
which we call it flow-augmented AST (FA-AST). FA-AST Type-3 (T3): In addition to Type-1 and Type-2 clone differ-
is constructed by adding various types of edges representing ences, syntactically similar code fragments that differ at the
different types of control and data flow to ASTs. After we statement level. These fragments can have statements added,
build the FA-AST for code fragments, we apply two different modified, and/or removed with respect to each other.
GNN models, gated graph neural network (GGNN) [7] and Type-4 (T4): Syntactically dissimilar code fragments that
graph matching network (GMN) [8] on FA-ASTs to learn the still share the same functionality. For example, one code
feature vectors for code fragments. The first model separately fragment implementing bubble sort and another code fragment
computes vector representations for different code fragments, implementing quick sort are considered a Type-4 code clone
while the latter one jointly computes vector representations pair.
for a code pair. After we get the vector representations for As the boundary between type-3 and type-4 clones is often
a code pair, by measuring the similarity between them, we ambiguous, in benchmarks like BigCloneBench [9] researchers
can determine whether these two code fragments belong to a further divide these two clone types into three categories:
clone. strongly type-3, moderately type-3, and weakly type-3/type-
In this paper, we build FA-AST for Java programs and 4. Each category is harder to detect than the former one. In
evaluate FA-AST and graph neural networks on two code this paper, we refer to weak type-3/type-4 clones as semantic
clone datasets: Google Code Jam dataset collected by [6] and clones.
the widely used clone detection benchmark BigCloneBench
B. Graph Neural Networks
[9]. The results show that our approach outperforms most
existing clone detection approaches, especially several AST- Traditional deep neural network models like convolutional
based deep learning approaches including RtvNN [2], CDLH neural network (CNN) and recurrent neural network (RNN)
[3] and ASTNN [4]. have shown success in Euclidean data like images or sequential
The main contributions of this paper are as follows: data like natural language. Different from images and natural
1) To the best of our knowledge, we are the first to apply language, graph data is much more complex. An image can be
graph neural networks on code clone detection. We adopt seen as a set of pixels and text as a sequence of words, while
two different types of graph neural networks and analyze the in a graph, there are at least two types of information: nodes
difference between their performances. and the relationship between nodes (edges). So it is important
2) We design a novel graph representation form FA-AST to build novel neural network architectures for graphs.
for Java programs that leverage both control and data flow of The concept of GNN was first proposed in [10]. The target
programs. Our graph representation is purely AST-based and of GNN is to learn a state embedding for each node which
can easily be extended to other programming languages. contains the information of its neighborhood, and sometimes
to learn the embedding of a whole graph. Most existing GNN
3) We evaluate our approach on two datasets: Google Code
models can be fit into the general framework message passing
Jam and BigCloneBench. Our approach performs comparable
neural networks (MPNN) [11], which its overall architecture
to state-of-the-art approaches on BigCloneBench and outper-
is depicted in Figure 1. In the MPNN framework, a neural
forms state-of-the-art approaches on Google Code Jam.
network model consists of two phases: message passing and
The remainder of this paper is structured as follows: Section
readout. Suppose we have a graph G = (V, E) where V is
II introduces the background knowledge. Section III defines
the set of vertices and E is the set of edges. Each node in
the problem we aim to solve. We present the details of
G retains a state h, and each edge is assigned an embedding
our approach in Section IV. We evaluate our approach and
e. The message passing step update the hidden state of nodes
analyze its performance in Section V. In Section VI we discuss
by:
some findings in our experiments and some possible future
(t) (t)
improvements to our work. Section VII lists the related works. mj→i = fmessage (hi , hj , eij ), ∀(i, j) ∈ E (1)
Finally, in Section VIII, we make a conclusion about our work.
mi = faggregate ({mj→i |∀(i, j) ∈ E}) (2)
(t+1) (t)
II. BACKGROUND hi = fupdate (hi , mi ) (3)
In this section we will introduce the background knowledge Where fmessage is the message function and fupdate is the
of code clone detection and graph neural networks (GNNs). vertex update function. faggregate is an aggregation function
which we often use direct sum. Equations (1) and (2) can be
A. Code Clone Detection seen as an aggregator in which each node gathers information
According to [1], code clones can be categorized into the from its neighbors. Equation (3) is an updater that updates the
following four types: hidden state of all nodes [12]. During the message passing
Type-1 (T1): Syntactically identical code fragments, except phase, the above updating process runs for T steps. In the
for differences in white space and comments. readout phase, the model computes a vector representation for
Type-2 (T2): In addition to Type-1 clone differences, syn- the whole graph with the readout function fR by:
tactically identical code fragments, except for differences in hG = fR ( hTi |i ∈ V )
(4)
identifier names and literal values.
Fig. 1. Example of a GNN model applied on a directed graph.
III. P ROBLEM D EFINITION the two code fragments as a clone pair. We apply the mean
Given two code fragments Ci and Cj , we set a constant squared error (MSE) loss to train our model:
label yij for them to indicate whether (Ci , Cj ) is a clone pair d
1X
or not. Then for a set of code fragments pairs with known (yi − ŷi )2 (5)
clone labels we can build a training set D = {(Ci , Cj , yij )}. d i=1
We aim to train a deep learning model for learning a function
Here d is the dimension of yi and ŷi . Since in our clone
φ that maps a code fragment C to a feature vector v so that
detection task our prediction is a single real value as the
for any pair of code fragments (Ci , Cj ), their similarity score
similarity between two code snippets, so in our model d = 1.
sij = φ(Ci , Cj ) is as close to the corresponding label yij .
In the inference phase, in order to determine whether a pair
B. Building Graphs Based on Abstract Syntax Trees
of code fragments (Ci , Cj ) is a clone pair, we set a threshold
value σ between true and false clone pairs. (Ci , Cj ) is a true Program ASTs only represent the syntactic structure of
clone pair if their similarity score sij ≥ σ and vice versa. code, so we add different types of edges to ASTs. Although
program in some languages can be converted into control flow
IV. P ROPOSED A PPROACH graphs in the form of assembly language or some intermediate
In this section, we first introduce an overview of our representations (IR), we do not directly use these control flow
proposed approach based on program graphs and graph neural graphs for the following reasons:
networks. Next, we describe the process of building a graph 1) For a single program, the number of edges in a control
representation: flow-augmented abstract syntax tree (FA-AST) flow graph is often far fewer than in an AST. For graph neural
for code fragments. We then explain the technical details of our networks, fewer edges mean less information passing between
neural network models: gated graph neural networks (GGNN) nodes, and the states of nodes are less updated.
and graph matching networks (GMN). 2) Most nodes in a control flow graph is a statement
expression rather than a single token. If we embed those nodes
A. Approach Overview using simple approaches like bag of words, we will lose the
Figure 2 shows an overview of our approach. To process semantic information within these nodes. Another reasonable
a code fragment, we first parse it into its AST. Next, we approach is to build a sub-graph for each statement, but this
build a graph representation FA-AST for the code fragment will significantly increase the computational cost of our neural
by adding edges representing control and data flow to its network model.
AST. Then we initialize the embeddings of FA-AST nodes To extract ASTs from Java programs, we use a python
and edges before jointly feeding a pair of vectorized FA- package javalang . Below we use the node types, values, and
ASTs into a graph matching network. The graph matching production rules in javalang to describe Java ASTs.
network then computes vector representation for all nodes To build the graph representation for programs, we construct
in both FA-ASTs. To detect code clones, we use a readout the following types of edges based on abstract syntax trees:
function to pool the vectors of nodes into a graph-level vector Child: connect a non-terminal AST node to each of its
representation for each FA-AST separately. After we get the children according to the AST.
vector representations of both programs, we use the cosine Parent: connect a non-root AST node to its parent node.
similarity of these two vectors to measure their similarity. If
the similarity score is larger than the threshold σ, we consider https://github.com/c2nes/javalang
IfStatement
Readout Cosine
Code Fragment AST Adding FA-AST Graph Embedding Similarity
Parser Graph Function
Edges Condition ThenStatement ElseS
Matching Clone Label
CondTrue
Network
Code Fragment AST FA-AST Graph Embedding
CondFalse
Fig. 2. The overview of our approach.
IfStatement
3. For statements: A For node has two children: a 1) Graph Embedding Model: In this model, we use a gated
ForControl node and a body node. similar to the While graph neural network (GGNN) [7] to learn the embeddings for
nodes, we add a ForExec edge and a ForNext edge between graphs. GGNN follows the GNN framework we introduced in
the two children of For nodes. section II. For GGNN, we use a multilayer perceptron (MLP)
as fmessage and a gated recurrent unit (GRU) [13] as fupdate . collected by [6]. The GCJ dataset consists of 1,669 Java
Namely, the propagation process of GGNN is: files from 12 different competition problems. Each file is
(t) (t) a Java class. As [6] have inspected, very few files within
mj→i = MLP(hi , hj , eij ), ∀(i, j) ∈ E1 ∪ E2 a competition problem are syntactically similar, so we can
X
mi = mj→i assume that most code pairs from the same problem are type-
(6)
j 4 clones.
(t+1) (t) The second dataset BigCloneBench is a widely used large
hi = GRU (hi , mi )
code clone benchmark that contains over 6,000,000 true clone
For the readout function fG , we follow the function proposed pairs and 260,000 false clone pairs from 10 different func-
in [7]: tionalities. In BigCloneBench, each code fragment is a Java
X (T ) (T ) method. As the boundary between type-3 and type-4 clones is
hG = MLPG ( σ(MLPgate (hi )) M LP (hi )) (7)
often ambiguous, type-3/type-4 clone pairs in BigCloneBench
i∈V
are further divided by a statement-level similarity score within
2) Graph Matching Networks: Graph Matching Networks [0, 1): strongly type-3 (ST3) with similarity in [0.7, 1.0),
(GMN) framework defined by [8] can jointly learn embeddings moderately type-3 (MT3) with similarity in [0.5, 0.7), and
for a pair of graphs. Apart from the traditional GNN prop- weakly type-3/type-4 (WT3/T4) with similarity in [0.0, 0.5).
agation process, GMN additionally computes a cross-graph Table I summarizes the distribution of all clone types in
attention between nodes from two graphs. Figure 7 illustrates BigCloneBench. Since the majority of code clone pairs are
the difference between our GGNN graph embedding approach Weak Type-3/Type-4 clones, BigCloneBench is quite appro-
and GMN. Although a GMN model takes two graphs as priate to be used for evaluating semantic clone detection. In
input at a time, it can still produce the separate embeddings our experiment, we follow the settings of in the CDLH paper
for each input graph. The complete propagation process is [3], which discard code fragments without any tagged true or
demonstrated as following: false clone pairs, left with 9,134 code fragments.
Table II shows the basic information about the two datasets
(t) (t)
mj→i = fmessage (hi , hj , eij ), ∀(i, j) ∈ E1 ∪ E2 (8) in our experiment. Generally, since BigCloneBench contains
(t) (t)
far more code fragments than GCJ, its vocabulary size is
µj→i = fmatch (hi , hj ), ∀i
∈ V1 , j ∈ V2 or i ∈ V2 , j ∈ V1 significantly larger. On the other hand, code fragments in GCJ
(9) are usually longer than in BigCloneBench. This is mainly
(t+1) (t) because each BigCloneBench code fragment only implements
X X
hi = fnode (hi , mj→i , µj 0 →i ) (10)
j j0
a single functionality like bubble sort or file copy, while in
(T ) GCJ programmers are often required to solve a more com-
hG1 = fG ({hi }i∈V1 ) (11) plicated algorithmic problem. For both datasets, false clone
(T )
hG2 = fG ({hi }i∈V2 ) (12) pairs are much more than true clone pairs, especially in the
BigCloneBench dataset. By evaluating these two datasets, we
Where fmessage is an MLP and fmatch is an attention can find out the generalizability of our approach over code
mechanism defined by: clones in different domains and granularities.
(t) (t)
exp(sh (hi , hj ))
aj→i =P (t) (t)
(13) TABLE I
j 0 exp(sh (hi , hj 0 )) P ERCENTAGE OF DIFFERENT CLONE TYPES IN B IG C LONE B ENCH
(t) (t)
µj→i = aj→i (hi − hj ) (14) Clone Type T1 T2 ST3 MT3 WT3/T4
Here sh is a vector similarity function which we use dot Percentage(%) 0.455 0.058 0.243 1.014 98.23
(t)
product in our paper. fnode is a GRU cell which hi is
its P
current hidden Pstate at timestep t and the concatenation
of j mj→i and j 0 µj 0 →i is its input. Similar to GGNN, TABLE II
we also use the readout function in Equation (7) for GMN. BASIC INFORMATION OF TWO DATASETS
With these settings, our GMN model is similar to the GGNN
model, and the only difference is that GMN adds a cross-graph GCJ BigCloneBench
matching vector to the input of the updater GRU. Code fragments 1,669 9,134
Average lines of code 58.79 32.89
V. E XPERIMENTS Average number of nodes 396.98 241.46
Vocabulary size 8,033 77,535
A. Experiment Data True clone pairs 275,570 336,498
False clone pairs 1,116,376 2,080,088
We evaluate our approach on two datasets: Google Code
Jam (GCJ) [14] and BigCloneBench [9]. The Google Code
Jam [14] is an online programming competition held annually To address the importance of the control flow, we fur-
by Google. In this paper, we use the version of the dataset ther analyze the frequency of different control flows in
0
ℎ
4 ℎ
4
T steps
similarity similarity
graph vector
cross-graph attention
Fig. 7. Basic architecture of the GGNN embedding model (left) and GMN model (right).
our datasets. Table III shows the number of occurrences ASTNN [4] uses recursive neural networks to encode AST
for different control flow nodes in two datasets. For both subtrees for statements, then feed the encodings of all state-
datasets, BlockStatement is the most frequent control ment trees into an RNN to compute the vector representation
flow, since sequential executions widely exist in nearly all for a program. The similarity score between code pairs is
programs. WhileStatement is the fewest or second-fewest measured by the L1 norm.
among the four control flow types we mentioned in FA- We implement the GGNN and GMN model with PyTorch
AST. An interesting difference between the two datasets is and its extension library PyTorch Geometric [17]. We set
that ForStatement appears much more times in GCJ the dimension of graph neural network layers and token
than in BigCloneBench. This is probably because in pro- embeddings to 100. In both experiments, we run the GNN
gramming contests, sometimes programmers need to im- propagation for 4 steps. Token embeddings are initialized
plement complicated algorithms that contain a lot of For randomly and trained together with the model. We train our
loops. Since DoStatement and SwitchStatement ap- neural networks using the Adam optimizer [18] with a learning
pear much fewer than the other control flows, we decide not rate of 0.001. We set the batch size to 32. The threshold
to add edges for these two control flows, as shown in Section between true and false clones are tuned by the results on the
IV. In general, as code fragments in GCJ are usually longer validation set. We run all experiments on a server with 32 cores
than in BigCloneBench, control flow nodes appear more in of 2.1GHz CPU and an NVIDIA Titan Xp GPU. Similar to our
GCJ dataset. approaches, for all neural network baselines RtvNN, CDLH,
and ASTNN, we also set the hidden layer size to 100. For
TABLE III the rest setting of these baselines, we follow the description
AVERAGE OCCURRENCES OF CONTROL FLOW NODES IN OUR DATASETS .
in their original papers or released code.
GCJ BigCloneBench
For both datasets, we split the dataset for training, valida-
tion, and test set by 8:1:1. For the BigCloneBench dataset,
IfStatement 3.114 2.724
WhileStatement 0.437 0.441 we use the same 9,134 code fragments from [3]. As for both
ForStatement 4.064 0.422 datasets, the number of false clone pairs is far more than true
BlockStatement 7.049 3.274 clone pairs, so we apply data balance on training sets. For the
DoStatement 0.006 0.014
SwitchStatement 0.013 0.012 training sets of both tasks, we randomly downsample the false
clone pairs to make the ratio between true and false pairs 1:1.
C. Experiment Results
B. Experiment Settings
We compare our approach with the following clone detect- TABLE IV
R ESULTS ON THE GCJ DATASET
ing approaches:
DECKARD [15] is an AST-based clone detector which Model Precision Recall F1
generates characteristic vectors for each AST subtree using Deckard 0.45 0.44 0.44
predefined rules and then clusters them to detect code clones. RtvNN 0.20 0.90 0.33
RtvNN [2] first uses an RNN language model to learn ASTNN 0.98 0.93 0.95
the embeddings for program tokens, then use a recursive FA-AST+GGNN 0.96 1.0 0.97
autoencoder [16] to learn representations for ASTs. In order to FA-AST+GMN 0.99 0.97 0.98
represent ASTs by recursive neural networks, ASTs are turned
into full binary trees. 1) Results on Google Code Jam: Table IV shows the preci-
CDLH [3] uses binary Tree-LSTM [5] to encode ASTs, and sion, recall and F1 value of our approach on the GCJ dataset.
a hash function to optimize the distance between the vector
representation of AST pairs by hamming distance. https://pytorch.org
We observe that our approach far outperforms all baselines
in precision, recall and F1. By exploiting both the syntactical
information in the AST and semantical information of control
and data flow, our approach (FA-AST+GMN) improves the
F1-score on GCJ from 0.95 (ASTNN) to 0.98.
TABLE V
R ESULTS ON THE B IG C LONE B ENCH DATASET
100%
return sb.toString();
}
eways Fig. 9. The corresponding locations in source code for the node pairs with the ten highest attention scores in a clone pair in BigCloneBench.
ents
We further draw the ROC curve of our approaches and A. The Advantage of Our Approach Over ASTNN
compare them with the best baseline ASTNN. The ROC curve We believe our approach outperforms previous deep
and ROC_AUC score for our approaches and ASTNN on learning-based clone detection approaches for the following
BigCloneBench are shown in Figure 10. Similar to the results reasons:
shown in Table V, FA-AST achieves the highest ROC_AUC 1) Our graph representation of programs, FA-AST, contains
score among the three approaches, and the result of ASTNN both syntactical information in ASTs and control and data
is a little higher than FA-AST+GGNN but lower than FA- flows in CFG, while previous approaches are based purely on
AST+GMN. either AST or CFG.
2) We treat a code fragment as a whole graph and directly @Test
public void testLoadHttpGzipped() throws Exception {
input the graph in our neural network, while some previous String url = HTTP_GZIPPED;
LoadingInfo loadingInfo = Utils.openFileObject(fsManager.resolveFile(url));
InputStream contentInputStream = loadingInfo.getContentInputStream();
approaches do not keep all structural information. For exam- byte[] actual = IOUtils.toByteArray(contentInputStream);
byte[] expected = IOUtils.toByteArray(new GZIPInputStream(new URL(url).openStream()));
ple, CDLH converts ASTs into binary trees before feeding it assertEquals(expected.length, actual.length);
}
into a Tree-LSTM. DeepSim [6] builds a semantic matrix for (a)