0% found this document useful (0 votes)
51 views11 pages

Detecting Code Clones With Graph Neural Network and Flow-Augmented Abstract Syntax Tree

This document proposes using graph neural networks and flow-augmented abstract syntax trees to detect semantic code clones. It constructs flow-augmented ASTs by adding control and data flow edges to ASTs. Then it applies two graph neural network models, GGNN and GMN, to the FA-ASTs to learn vector representations of code fragments. By measuring the similarity of vector pairs, it can detect clones, including semantic clones that are syntactically different. This approach aims to leverage both syntactic and semantic structural information for more accurate clone detection compared to prior work.

Uploaded by

maati all time
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views11 pages

Detecting Code Clones With Graph Neural Network and Flow-Augmented Abstract Syntax Tree

This document proposes using graph neural networks and flow-augmented abstract syntax trees to detect semantic code clones. It constructs flow-augmented ASTs by adding control and data flow edges to ASTs. Then it applies two graph neural network models, GGNN and GMN, to the FA-ASTs to learn vector representations of code fragments. By measuring the similarity of vector pairs, it can detect clones, including semantic clones that are syntactically different. This approach aims to leverage both syntactic and semantic structural information for more accurate clone detection compared to prior work.

Uploaded by

maati all time
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Detecting Code Clones with Graph Neural Network

and Flow-Augmented Abstract Syntax Tree


Wenhan Wang∗† , Ge Li∗†§ , Bo Ma∗† , Xin Xia‡ , Zhi Jin∗†§
∗ Key laboratory of High Confidence Software Technologies (Peking University), Ministry of Education
† Institute of Software, EECS, Peking University, Beijing, China

{wwhjacob, lige, 1700012844, zhijin}@pku.edu.cn


‡ Faculty of Information Technology, Monash University, Melbourne, Australia

[email protected]
arXiv:2002.08653v1 [cs.SE] 20 Feb 2020

Abstract—Code clones are semantically similar code fragments as syntactic similarities, while type-4 clone can be seen as
pairs that are syntactically similar or different. Detection of code semantic similarity. As type-4 clones include clones that are
clones can help to reduce the cost of software maintenance and highly dissimilar syntactically, it is the hardest clone type to
prevent bugs. Numerous approaches of detecting code clones have
been proposed previously, but most of them focus on detecting detect for most clone detection approaches. Code syntactical
syntactic clones and do not work well on semantic clones with similarity has already been well-studied, while in recent years
different syntactic features. To detect semantic clones, researchers researchers have started to focus on detecting code semantic
have tried to adopt deep learning for code clone detection to auto- similarity. Along with the advances of deep neural networks,
matically learn latent semantic features from data. Especially, to several deep learning-based approaches tried to capture the
leverage grammar information, several approaches used abstract
syntax trees (AST) as input and achieved significant progress semantic similarities through learning from data. Most of these
on code clone benchmarks in various programming languages. approaches include two steps: use neural networks to calculate
However, these AST-based approaches still can not fully leverage a vector representation for each code fragment, then calculate
the structural information of code fragments, especially semantic the similarity between two code vector representations to
information such as control flow and data flow. To leverage detect clones. To leverage the explicit structural information
control and data flow information, in this paper, we build a
graph representation of programs called flow-augmented abstract in programs, these approaches often use abstract syntax tree
syntax tree (FA-AST). We construct FA-AST by augmenting (AST) as the input of their models [2]–[4]. A typical example
original ASTs with explicit control and data flow edges. Then of these approaches is CDLH [3], which encode code frag-
we apply two different types of graph neural networks (GNN) ments by directly applying Tree-LSTM [5] on binarized ASTs.
on FA-AST to measure the similarity of code pairs. As far as we Although AST can reflect the rich structural information for
have concerned, we are the first to apply graph neural networks
on the domain of code clone detection. program syntax, it does not contain some semantic information
We apply our FA-AST and graph neural networks on two Java such as control flow and data flow.
datasets: Google Code Jam and BigCloneBench. Our approach To exploit explicit control flow information, some re-
outperforms the state-of-the-art approaches on both Google Code searchers use control flow graphs (CFG) to detect code clones.
Jam and BigCloneBench tasks. For example, DeepSim [6] extracts semantic features from
Index Terms—clone detection, data flow, control flow, deep
learning, graph neural network CFGs to build semantic matrices for clone detection. But CFG
still lacks data flow information. Furthermore, most CFGs
I. I NTRODUCTION only contain control flows between code blocks and exclude
Code clone detection aims to measure the similarity be- the low-level syntactic structure within code blocks. Another
tween two code snippets. Commonly, there are two kinds drawback of CFGs is that in some programming languages,
of similarities within code clones: syntactic similarity and CFGs are much harder to obtain than ASTs.
semantic similarity. Syntactic similarity are often introduced In this paper, we aim to build a graph representation form
when programmers copy a code fragment and then pasting of programs which can reflect both syntactical information
to another location, while semantic similarity occurs when and semantical information from ASTs. In order to detect
developers try to implement a certain functionality which is code clones with the graphs we have built, we propose a new
identical or similar to an existing code fragment. approach that uses graph neural networks (GNN) to detect
To better study the effectiveness of clone detectors on code clones. Our approach mainly includes three steps: First,
different types of code similarities, researchers started to create graph representation for programs. Second, calculate
systematically categorize code clones into multiple classes. vector representations for code fragments using graph neural
One common taxonomy proposed by [1] group code clones networks. Third, measure code similarity by measuring the
into four types. The first three types of clones can be concluded similarity of code vector representations. To fully leverage
the control flow and data flow information in programs, we
§ Corresponding Authors construct an AST-based graph representation of programs
which we call it flow-augmented AST (FA-AST). FA-AST Type-3 (T3): In addition to Type-1 and Type-2 clone differ-
is constructed by adding various types of edges representing ences, syntactically similar code fragments that differ at the
different types of control and data flow to ASTs. After we statement level. These fragments can have statements added,
build the FA-AST for code fragments, we apply two different modified, and/or removed with respect to each other.
GNN models, gated graph neural network (GGNN) [7] and Type-4 (T4): Syntactically dissimilar code fragments that
graph matching network (GMN) [8] on FA-ASTs to learn the still share the same functionality. For example, one code
feature vectors for code fragments. The first model separately fragment implementing bubble sort and another code fragment
computes vector representations for different code fragments, implementing quick sort are considered a Type-4 code clone
while the latter one jointly computes vector representations pair.
for a code pair. After we get the vector representations for As the boundary between type-3 and type-4 clones is often
a code pair, by measuring the similarity between them, we ambiguous, in benchmarks like BigCloneBench [9] researchers
can determine whether these two code fragments belong to a further divide these two clone types into three categories:
clone. strongly type-3, moderately type-3, and weakly type-3/type-
In this paper, we build FA-AST for Java programs and 4. Each category is harder to detect than the former one. In
evaluate FA-AST and graph neural networks on two code this paper, we refer to weak type-3/type-4 clones as semantic
clone datasets: Google Code Jam dataset collected by [6] and clones.
the widely used clone detection benchmark BigCloneBench
B. Graph Neural Networks
[9]. The results show that our approach outperforms most
existing clone detection approaches, especially several AST- Traditional deep neural network models like convolutional
based deep learning approaches including RtvNN [2], CDLH neural network (CNN) and recurrent neural network (RNN)
[3] and ASTNN [4]. have shown success in Euclidean data like images or sequential
The main contributions of this paper are as follows: data like natural language. Different from images and natural
1) To the best of our knowledge, we are the first to apply language, graph data is much more complex. An image can be
graph neural networks on code clone detection. We adopt seen as a set of pixels and text as a sequence of words, while
two different types of graph neural networks and analyze the in a graph, there are at least two types of information: nodes
difference between their performances. and the relationship between nodes (edges). So it is important
2) We design a novel graph representation form FA-AST to build novel neural network architectures for graphs.
for Java programs that leverage both control and data flow of The concept of GNN was first proposed in [10]. The target
programs. Our graph representation is purely AST-based and of GNN is to learn a state embedding for each node which
can easily be extended to other programming languages. contains the information of its neighborhood, and sometimes
to learn the embedding of a whole graph. Most existing GNN
3) We evaluate our approach on two datasets: Google Code
models can be fit into the general framework message passing
Jam and BigCloneBench. Our approach performs comparable
neural networks (MPNN) [11], which its overall architecture
to state-of-the-art approaches on BigCloneBench and outper-
is depicted in Figure 1. In the MPNN framework, a neural
forms state-of-the-art approaches on Google Code Jam.
network model consists of two phases: message passing and
The remainder of this paper is structured as follows: Section
readout. Suppose we have a graph G = (V, E) where V is
II introduces the background knowledge. Section III defines
the set of vertices and E is the set of edges. Each node in
the problem we aim to solve. We present the details of
G retains a state h, and each edge is assigned an embedding
our approach in Section IV. We evaluate our approach and
e. The message passing step update the hidden state of nodes
analyze its performance in Section V. In Section VI we discuss
by:
some findings in our experiments and some possible future
(t) (t)
improvements to our work. Section VII lists the related works. mj→i = fmessage (hi , hj , eij ), ∀(i, j) ∈ E (1)
Finally, in Section VIII, we make a conclusion about our work.
mi = faggregate ({mj→i |∀(i, j) ∈ E}) (2)
(t+1) (t)
II. BACKGROUND hi = fupdate (hi , mi ) (3)
In this section we will introduce the background knowledge Where fmessage is the message function and fupdate is the
of code clone detection and graph neural networks (GNNs). vertex update function. faggregate is an aggregation function
which we often use direct sum. Equations (1) and (2) can be
A. Code Clone Detection seen as an aggregator in which each node gathers information
According to [1], code clones can be categorized into the from its neighbors. Equation (3) is an updater that updates the
following four types: hidden state of all nodes [12]. During the message passing
Type-1 (T1): Syntactically identical code fragments, except phase, the above updating process runs for T steps. In the
for differences in white space and comments. readout phase, the model computes a vector representation for
Type-2 (T2): In addition to Type-1 clone differences, syn- the whole graph with the readout function fR by:
tactically identical code fragments, except for differences in hG = fR ( hTi |i ∈ V )

(4)
identifier names and literal values.
Fig. 1. Example of a GNN model applied on a directed graph.

III. P ROBLEM D EFINITION the two code fragments as a clone pair. We apply the mean
Given two code fragments Ci and Cj , we set a constant squared error (MSE) loss to train our model:
label yij for them to indicate whether (Ci , Cj ) is a clone pair d
1X
or not. Then for a set of code fragments pairs with known (yi − ŷi )2 (5)
clone labels we can build a training set D = {(Ci , Cj , yij )}. d i=1
We aim to train a deep learning model for learning a function
Here d is the dimension of yi and ŷi . Since in our clone
φ that maps a code fragment C to a feature vector v so that
detection task our prediction is a single real value as the
for any pair of code fragments (Ci , Cj ), their similarity score
similarity between two code snippets, so in our model d = 1.
sij = φ(Ci , Cj ) is as close to the corresponding label yij .
In the inference phase, in order to determine whether a pair
B. Building Graphs Based on Abstract Syntax Trees
of code fragments (Ci , Cj ) is a clone pair, we set a threshold
value σ between true and false clone pairs. (Ci , Cj ) is a true Program ASTs only represent the syntactic structure of
clone pair if their similarity score sij ≥ σ and vice versa. code, so we add different types of edges to ASTs. Although
program in some languages can be converted into control flow
IV. P ROPOSED A PPROACH graphs in the form of assembly language or some intermediate
In this section, we first introduce an overview of our representations (IR), we do not directly use these control flow
proposed approach based on program graphs and graph neural graphs for the following reasons:
networks. Next, we describe the process of building a graph 1) For a single program, the number of edges in a control
representation: flow-augmented abstract syntax tree (FA-AST) flow graph is often far fewer than in an AST. For graph neural
for code fragments. We then explain the technical details of our networks, fewer edges mean less information passing between
neural network models: gated graph neural networks (GGNN) nodes, and the states of nodes are less updated.
and graph matching networks (GMN). 2) Most nodes in a control flow graph is a statement
expression rather than a single token. If we embed those nodes
A. Approach Overview using simple approaches like bag of words, we will lose the
Figure 2 shows an overview of our approach. To process semantic information within these nodes. Another reasonable
a code fragment, we first parse it into its AST. Next, we approach is to build a sub-graph for each statement, but this
build a graph representation FA-AST for the code fragment will significantly increase the computational cost of our neural
by adding edges representing control and data flow to its network model.
AST. Then we initialize the embeddings of FA-AST nodes To extract ASTs from Java programs, we use a python
and edges before jointly feeding a pair of vectorized FA- package javalang . Below we use the node types, values, and
ASTs into a graph matching network. The graph matching production rules in javalang to describe Java ASTs.
network then computes vector representation for all nodes To build the graph representation for programs, we construct
in both FA-ASTs. To detect code clones, we use a readout the following types of edges based on abstract syntax trees:
function to pool the vectors of nodes into a graph-level vector Child: connect a non-terminal AST node to each of its
representation for each FA-AST separately. After we get the children according to the AST.
vector representations of both programs, we use the cosine Parent: connect a non-root AST node to its parent node.
similarity of these two vectors to measure their similarity. If
the similarity score is larger than the threshold σ, we consider https://github.com/c2nes/javalang
IfStatement

Readout Cosine
Code Fragment AST Adding FA-AST Graph Embedding Similarity
Parser Graph Function
Edges Condition ThenStatement ElseS
Matching Clone Label
CondTrue
Network
Code Fragment AST FA-AST Graph Embedding

CondFalse
Fig. 2. The overview of our approach.
IfStatement

NextSib: connect a node to its next sibling (from left to WhileStatement


right). Because graph Condition
neural networks do ThenStatement
not consider the ElseStatement
order of nodes, it is necessary toCondTrue
provide the order of children WhileExec
to our neural network model. Condition Body
NextToken: connect a terminal node to the next terminal
CondFalse WhileNext

node. In ASTs, terminal nodes refer to the identifier tokens


Fig. 4. Control flow edges for While statements.
in program source code, so a NextToken edge connects an
identifier token to the next token in the corresponding source IfStatement
WhileStatement
code. ForStatement
NextUse: a NextUse edge connects a node of a variable use
WhileExec
to its next appearance. NextUse edges can exploit useful data Condition ThenStatement
ForExec ElseStatement
Condition Body
flow information from ASTs. WhileNext
ForControl
CondTrue Body
ForNext
Apart from the above edge types, we add several types
of edges to represent the control flow of programs. In this Fig. 5. Control flow CondFalse
edges for For statements.
paper we focus on the following basic control flow types:
sequential execution, If statements, While and For loops. Since
other control flow structures like DoWhile statements and Case 4. Sequential execution: in Java, the sequential execution of
blocks appear much less in programs and are not supported statementsWhileStatement
exist in code clocks such as method bodies or loop ForS
by some programming languages (e.g., Python), we omit to bodies. A BlockStatement node is the root of a sequence
add control flow edges for them. We describe the details for of statement WhileExec
AST-subtrees which Body are executed sequentially.
F
Condition
control flow edges in FA-AST as follows: ForControl
Different fromWhileNext
the control flow nodes we mentioned before, F
1. If statements: In AST, an IfStatement node contains a BlockStatement node can have an arbitrary number of
two or three children. The first child is the If condition. The children, so we add a Nextstmt edge between the root of each
second (and third) children is the If body when the condition statement subtree to its next sibling.
is true (or false). As shown in figure, we add a CondTrue
edge from the condition node to the ThenStatement
node and a CondFalse edge from the condition node to the BlockStatement
ElseStatement node.

IfStatement Stmt1 Stmt2 ... Stmtn


NextStmt

Fig. 6. Control flow edges for sequences of statements.


Condition ThenStatement ElseStatement
CondTrue
Finally, to increase the frequency of message passing, for
CondFalse
each edge without a backward edge (e.g., CondTrue and
NextStmt), we add an additional backward edge for them.
Fig. 3. Control flow edges for If statements.
C. Neural Network Model for Modeling Code Pairs
2.WhileStatement
While statements: A While node has two children: a ForStatement
condition code and a body node. We connect a WhileExec edge In this paper, we use two different types of graph neural
condition node to the body node, and a WhileNext networks:
from theWhileExec a traditional GNN for graph embeddings and a
ForExec
Condition Body
node from the body node to the condition node to simulate graph matching networkBody
[8] which jointly models two graphs
WhileNext
ForControl
the execution process of loops. simultaneously.
ForNext

3. For statements: A For node has two children: a 1) Graph Embedding Model: In this model, we use a gated
ForControl node and a body node. similar to the While graph neural network (GGNN) [7] to learn the embeddings for
nodes, we add a ForExec edge and a ForNext edge between graphs. GGNN follows the GNN framework we introduced in
the two children of For nodes. section II. For GGNN, we use a multilayer perceptron (MLP)
as fmessage and a gated recurrent unit (GRU) [13] as fupdate . collected by [6]. The GCJ dataset consists of 1,669 Java
Namely, the propagation process of GGNN is: files from 12 different competition problems. Each file is
(t) (t) a Java class. As [6] have inspected, very few files within
mj→i = MLP(hi , hj , eij ), ∀(i, j) ∈ E1 ∪ E2 a competition problem are syntactically similar, so we can
X
mi = mj→i assume that most code pairs from the same problem are type-
(6)
j 4 clones.
(t+1) (t) The second dataset BigCloneBench is a widely used large
hi = GRU (hi , mi )
code clone benchmark that contains over 6,000,000 true clone
For the readout function fG , we follow the function proposed pairs and 260,000 false clone pairs from 10 different func-
in [7]: tionalities. In BigCloneBench, each code fragment is a Java
X (T ) (T ) method. As the boundary between type-3 and type-4 clones is
hG = MLPG ( σ(MLPgate (hi )) M LP (hi )) (7)
often ambiguous, type-3/type-4 clone pairs in BigCloneBench
i∈V
are further divided by a statement-level similarity score within
2) Graph Matching Networks: Graph Matching Networks [0, 1): strongly type-3 (ST3) with similarity in [0.7, 1.0),
(GMN) framework defined by [8] can jointly learn embeddings moderately type-3 (MT3) with similarity in [0.5, 0.7), and
for a pair of graphs. Apart from the traditional GNN prop- weakly type-3/type-4 (WT3/T4) with similarity in [0.0, 0.5).
agation process, GMN additionally computes a cross-graph Table I summarizes the distribution of all clone types in
attention between nodes from two graphs. Figure 7 illustrates BigCloneBench. Since the majority of code clone pairs are
the difference between our GGNN graph embedding approach Weak Type-3/Type-4 clones, BigCloneBench is quite appro-
and GMN. Although a GMN model takes two graphs as priate to be used for evaluating semantic clone detection. In
input at a time, it can still produce the separate embeddings our experiment, we follow the settings of in the CDLH paper
for each input graph. The complete propagation process is [3], which discard code fragments without any tagged true or
demonstrated as following: false clone pairs, left with 9,134 code fragments.
Table II shows the basic information about the two datasets
(t) (t)
mj→i = fmessage (hi , hj , eij ), ∀(i, j) ∈ E1 ∪ E2 (8) in our experiment. Generally, since BigCloneBench contains
(t) (t)
far more code fragments than GCJ, its vocabulary size is
µj→i = fmatch (hi , hj ), ∀i
∈ V1 , j ∈ V2 or i ∈ V2 , j ∈ V1 significantly larger. On the other hand, code fragments in GCJ
(9) are usually longer than in BigCloneBench. This is mainly
(t+1) (t) because each BigCloneBench code fragment only implements
X X
hi = fnode (hi , mj→i , µj 0 →i ) (10)
j j0
a single functionality like bubble sort or file copy, while in
(T ) GCJ programmers are often required to solve a more com-
hG1 = fG ({hi }i∈V1 ) (11) plicated algorithmic problem. For both datasets, false clone
(T )
hG2 = fG ({hi }i∈V2 ) (12) pairs are much more than true clone pairs, especially in the
BigCloneBench dataset. By evaluating these two datasets, we
Where fmessage is an MLP and fmatch is an attention can find out the generalizability of our approach over code
mechanism defined by: clones in different domains and granularities.
(t) (t)
exp(sh (hi , hj ))
aj→i =P (t) (t)
(13) TABLE I
j 0 exp(sh (hi , hj 0 )) P ERCENTAGE OF DIFFERENT CLONE TYPES IN B IG C LONE B ENCH
(t) (t)
µj→i = aj→i (hi − hj ) (14) Clone Type T1 T2 ST3 MT3 WT3/T4

Here sh is a vector similarity function which we use dot Percentage(%) 0.455 0.058 0.243 1.014 98.23
(t)
product in our paper. fnode is a GRU cell which hi is
its P
current hidden Pstate at timestep t and the concatenation
of j mj→i and j 0 µj 0 →i is its input. Similar to GGNN, TABLE II
we also use the readout function in Equation (7) for GMN. BASIC INFORMATION OF TWO DATASETS
With these settings, our GMN model is similar to the GGNN
model, and the only difference is that GMN adds a cross-graph GCJ BigCloneBench
matching vector to the input of the updater GRU. Code fragments 1,669 9,134
Average lines of code 58.79 32.89
V. E XPERIMENTS Average number of nodes 396.98 241.46
Vocabulary size 8,033 77,535
A. Experiment Data True clone pairs 275,570 336,498
False clone pairs 1,116,376 2,080,088
We evaluate our approach on two datasets: Google Code
Jam (GCJ) [14] and BigCloneBench [9]. The Google Code
Jam [14] is an online programming competition held annually To address the importance of the control flow, we fur-
by Google. In this paper, we use the version of the dataset ther analyze the frequency of different control flows in
0

4 ℎ
4

T steps

similarity similarity

graph vector

cross-graph attention

Fig. 7. Basic architecture of the GGNN embedding model (left) and GMN model (right).

our datasets. Table III shows the number of occurrences ASTNN [4] uses recursive neural networks to encode AST
for different control flow nodes in two datasets. For both subtrees for statements, then feed the encodings of all state-
datasets, BlockStatement is the most frequent control ment trees into an RNN to compute the vector representation
flow, since sequential executions widely exist in nearly all for a program. The similarity score between code pairs is
programs. WhileStatement is the fewest or second-fewest measured by the L1 norm.
among the four control flow types we mentioned in FA- We implement the GGNN and GMN model with PyTorch
AST. An interesting difference between the two datasets is and its extension library PyTorch Geometric [17]. We set
that ForStatement appears much more times in GCJ the dimension of graph neural network layers and token
than in BigCloneBench. This is probably because in pro- embeddings to 100. In both experiments, we run the GNN
gramming contests, sometimes programmers need to im- propagation for 4 steps. Token embeddings are initialized
plement complicated algorithms that contain a lot of For randomly and trained together with the model. We train our
loops. Since DoStatement and SwitchStatement ap- neural networks using the Adam optimizer [18] with a learning
pear much fewer than the other control flows, we decide not rate of 0.001. We set the batch size to 32. The threshold
to add edges for these two control flows, as shown in Section between true and false clones are tuned by the results on the
IV. In general, as code fragments in GCJ are usually longer validation set. We run all experiments on a server with 32 cores
than in BigCloneBench, control flow nodes appear more in of 2.1GHz CPU and an NVIDIA Titan Xp GPU. Similar to our
GCJ dataset. approaches, for all neural network baselines RtvNN, CDLH,
and ASTNN, we also set the hidden layer size to 100. For
TABLE III the rest setting of these baselines, we follow the description
AVERAGE OCCURRENCES OF CONTROL FLOW NODES IN OUR DATASETS .
in their original papers or released code.
GCJ BigCloneBench
For both datasets, we split the dataset for training, valida-
tion, and test set by 8:1:1. For the BigCloneBench dataset,
IfStatement 3.114 2.724
WhileStatement 0.437 0.441 we use the same 9,134 code fragments from [3]. As for both
ForStatement 4.064 0.422 datasets, the number of false clone pairs is far more than true
BlockStatement 7.049 3.274 clone pairs, so we apply data balance on training sets. For the
DoStatement 0.006 0.014
SwitchStatement 0.013 0.012 training sets of both tasks, we randomly downsample the false
clone pairs to make the ratio between true and false pairs 1:1.
C. Experiment Results
B. Experiment Settings
We compare our approach with the following clone detect- TABLE IV
R ESULTS ON THE GCJ DATASET
ing approaches:
DECKARD [15] is an AST-based clone detector which Model Precision Recall F1
generates characteristic vectors for each AST subtree using Deckard 0.45 0.44 0.44
predefined rules and then clusters them to detect code clones. RtvNN 0.20 0.90 0.33
RtvNN [2] first uses an RNN language model to learn ASTNN 0.98 0.93 0.95
the embeddings for program tokens, then use a recursive FA-AST+GGNN 0.96 1.0 0.97
autoencoder [16] to learn representations for ASTs. In order to FA-AST+GMN 0.99 0.97 0.98
represent ASTs by recursive neural networks, ASTs are turned
into full binary trees. 1) Results on Google Code Jam: Table IV shows the preci-
CDLH [3] uses binary Tree-LSTM [5] to encode ASTs, and sion, recall and F1 value of our approach on the GCJ dataset.
a hash function to optimize the distance between the vector
representation of AST pairs by hamming distance. https://pytorch.org
We observe that our approach far outperforms all baselines
in precision, recall and F1. By exploiting both the syntactical
information in the AST and semantical information of control
and data flow, our approach (FA-AST+GMN) improves the
F1-score on GCJ from 0.95 (ASTNN) to 0.98.

TABLE V
R ESULTS ON THE B IG C LONE B ENCH DATASET

Model Precision Recall F1


Deckard 0.93 0.02 0.03
RtvNN 0.95 0.01 0.01
CDLH 0.92 0.74 0.82
ASTNN 0.92 0.94 0.93
FA-AST+GGNN 0.85 0.90 0.88
FA-AST+GMN 0.96 0.94 0.95

2) Results on BigCloneBench: Table V shows the results


of BigCloneBench. Our approach achieves much higher re-
call (0.94) and F1 (0.95) than most baselines. Notably, our
approach outperforms ASTNN by precision and F1.
On two tasks, the F1 of GMN models both outperform
GGNN models, confirming our assumption that adding cross-
graph attention in the GNN propagation process can enhance
the power of the model to capture code similarities. Another
noticeable phenomenon is that compared to GMN models, the
recall of GGNN models is often higher than their precision
scores.
To further analyze the behavioral difference between GMN
and GGNN on clone detection, we make a study on the
changing process of different clone metrics when we adjust
the threshold similarity score between true and false clone
pairs. Figure 8 shows the changing process of precision,
recall, and F1 on BigCloneBench test set when we gradually
change the threshold similarity score from -1 to 1. Although
GGNN achieves similar recall to GMN, its precision is lower
than GMN, especially when the threshold is low. This results Fig. 8. Precision, recall and F1 curve when changing the threshold value for
in GGNN only achieve high F1 values in a small interval BigCloneBench.
(0.5,0.75), while GMN can reach a near-best F1 in a large
interval (-0.5,0.75). A small change of threshold value may
significantly affect the result of GGNN models, while GMN assume AST node pairs with similar semantics and context
performs more stable. After we inspect the output similarity should have larger attention value than other node pairs. In
scores of both models, we found out that for a large part of Figure 9, we select the ten highest attention values aj→i within
false clone pairs, their outputs of GGNN are closer to 0 rather the whole attention matrix and display their corresponding
than the ground truth label -1. This indicates that in GGNN location in their source code text. The dotted lines connect
cannot effectively distinguish dissimilar code fragments from the node pairs with the highest similarities between the upper
datasets, which fits the fact that GGNN achieves recall values code fragment and the lower one. We can observe that GMN
higher than precision on both datasets. In practice, the data can learn cross graph similarities on both low-level similarities
distribution of the validation set and test set can be largely (like a method name close()) and higher-level similarities
different, so the threshold tuned on the validation set may not (like a While code block). Although most attention links
suit the test set. So compared to GGNN, we believe GMN is are intuitive to human readers, there still exist a few links
more robust to the variation of the validation set. that cannot be well-explained (like several edges from the
Additionally, we make a visualization study on the attention upper code fragment to the method name fetchUrl in the
scores of the GMN model. In GMN, the cross-graph attention lower code fragment). This is likely because existing graph
scores (aj→i in Equation (13)) measure the similarity of two neural networks are not suited for modeling hierarchies in tree
nodes from two different code fragments. After training, we structures. As GNNs do not consider the order of neighbours,
visualize
文件 编辑 查看 调整图形 其它 帮助 所有更改均已保存

100%

? protected String downloadURLtoString(URL url) throws IOException {


BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
StringBuffer sb = new StringBuffer(100 * 1024);
元素拖至此处
String str;
while ((str = in.readLine()) != null) {
sb.append(str);
Heading
}
Text
in.close();
Lorem ipsum dolor sit amet,
consectetur adipisicing elit, sed
do eiusmod tempor incididunt ut
labore et dolore magna aliqua.

return sb.toString();
}

public static String fetchUrl(String urlString) {


try {
URL url = new URL(urlString);
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
String line = null;
StringBuilder builder = new StringBuilder();
while ((line = reader.readLine()) != null) {
builder.append(line);
}
reader.close();
return builder.toString();
} catch (MalformedURLException e) {
} catch (IOException e) {
}
return "";
}

eways Fig. 9. The corresponding locations in source code for the node pairs with the ten highest attention scores in a clone pair in BigCloneBench.
ents

the children and parent nodes of a node are treated equally


as its neighbours. Although we add directed Child and Parent
1.0
edges to FA-ASTs, it still does not change the distribution of
a node’s neighbours.
0.8
As BigCloneBech has already labeled clone pairs with
different types, we analyze the ability of our model to detect
0.6
different clone types individually. Table VI shows the results
TPR

on different clone types in BigCloneBench. As the results for


0.4
most of our baselines are much lower than our FA-AST+GMN,
here we only compare our approach (FA-AST+GMN) with
0.2
ASTNN. When comparing our approach with ASTNN, we FA-AST+GMN (AUC=0.996)
can see that our approach outperforms ASTNN in the WT3/T4 FA-AST+GGNN (AUC=0.986)
0.0 ASTNN (AUC=0.988)
semantic clones which we concern most.
0.0 0.2 0.4 0.6 0.8 1.0
FPR
TABLE VI
R ESULTS ON DIFFERENT CLONE TYPES IN B IG C LONE B ENCH
Fig. 10. The ROC curve and ROC_AUC score for FA-AST approaches and
ASTNN on the test set of BigCloneBench.
ASTNN FA-AST+GMN
Type
Precision Recall F1 Precision Recall F1
T1 100 100 100 100 100 100 VI. D ISCUSSION
T2 100 100 100 100 100 100 In this section, we first discuss the different behaviors
ST3 100 99.6 99.8 100 99.6 99.8
MT3 100 97.9 98.9 100 96.5 98.2 between our approaches and other baselines. Then we discuss
WT3/T4 93.3 92.2 92.8 95.7 93.5 94.6 some issues which our work does not solve at this point and
are worth investigating in the future.

We further draw the ROC curve of our approaches and A. The Advantage of Our Approach Over ASTNN
compare them with the best baseline ASTNN. The ROC curve We believe our approach outperforms previous deep
and ROC_AUC score for our approaches and ASTNN on learning-based clone detection approaches for the following
BigCloneBench are shown in Figure 10. Similar to the results reasons:
shown in Table V, FA-AST achieves the highest ROC_AUC 1) Our graph representation of programs, FA-AST, contains
score among the three approaches, and the result of ASTNN both syntactical information in ASTs and control and data
is a little higher than FA-AST+GGNN but lower than FA- flows in CFG, while previous approaches are based purely on
AST+GMN. either AST or CFG.
2) We treat a code fragment as a whole graph and directly @Test
public void testLoadHttpGzipped() throws Exception {
input the graph in our neural network, while some previous String url = HTTP_GZIPPED;
LoadingInfo loadingInfo = Utils.openFileObject(fsManager.resolveFile(url));
InputStream contentInputStream = loadingInfo.getContentInputStream();
approaches do not keep all structural information. For exam- byte[] actual = IOUtils.toByteArray(contentInputStream);
byte[] expected = IOUtils.toByteArray(new GZIPInputStream(new URL(url).openStream()));
ple, CDLH converts ASTs into binary trees before feeding it assertEquals(expected.length, actual.length);
}
into a Tree-LSTM. DeepSim [6] builds a semantic matrix for (a)

a code fragment by manually extracts several human-defined @Test


public void testCopyUnknownSize() throws IOException {
types of semantic features from CFG. ASTNN decomposes an final InputStream in = new ByteArrayInputStream(TEST_DATA);
final ByteArrayOutputStream out = new ByteArrayOutputStream(TEST_DATA.length);
AST into a sequence of statements subtrees by order of depth- final int cpySize = ExtraIOUtils.copy(in, out, (-1));
assertEquals("Mismatched copy size", TEST_DATA.length, cpySize);
first traverse so that it may lose the different relationships such final byte[] outArray = out.toByteArray();
assertArrayEquals("Mismatched data", TEST_DATA, outArray);
as nesting and if/else branches between statements. }

To have an intuitive view of the power of our approach (b)

over ASTNN, we demonstrate a few example true clone pairs


Fig. 12. Example of a false clone pair in BigCloneBench which FA-
in which our approach (FA-AST+GMN) correctly predicted AST+GMN correctly predicted as false while ASTNN wrongly predicted.
while ASTNN did not. Figure 11(a) and Figure 11(b) belong to
a true clone pair in BigCloneBench, in which both code frag-
ments implement a file copy functionality. The similarity score tasks are small. So we assume that some widely-used code
predicted by ASTNN is 5.8e-07 (range from 0 to 1), while clone datasets (like the two datasets in our paper) are not
the similarity predicted by FA-AST is 0.94. These two code difficult enough to test the power of current deep learning
fragments are significantly different in statements, so ASTNN models. So in the future, to test the power of up-to-date
cannot capture the similarity between them, while GMN can deep learning models on clone detection, we need to build
learn these similarities between entire methods from training larger and more complex code clone datasets. One direction
data. Figure 12 shows a false clone pair in BigCloneBench is to increase the number of different functionalities in a
(Figure 12 (a) implements a decompress zip functionality, dataset. For example, the current GCJ dataset contains 12
Figure 12 (b) implements a file copy functionality), which the different functionalities, and BigCloneBench contains only ten
similarity predicted by ASTNN is 0.94, while the similarity functionalities. However, in real application cases, the code
predicted by FA-AST is -0.27. We can see that these two fragments are likely unable to be categorized into several
code fragments are similar in both token level and statement classes by their functionalities. So building larger datasets
level, so ASTNN predicted a high similarity score. From the with more types of different functionalities can help to test
two examples above, we believe that our approach can better the ability of code clone detection approaches in more close-
capture the semantics of code fragments than ASTNN. to-reality scenarios.

public static void copyFile(File source, File dest) throws IOException {


C. Generalizability of Our Approach to Other Programming
FileChannel in = null, out = null; Languages
try {
in = new FileInputStream(source).getChannel();
out = new FileOutputStream(dest).getChannel();
In this paper, we use Java as an example to demonstrate
in.transferTo(0, in.size(), out); the construction of FA-ASTs. In our approach, FA-AST is
} catch (FileNotFoundException fnfe) {
Log.debug(fnfe); built with AST, control flow, and data flow, which all of them
} finally {
if (in != null) in.close(); exist in most programming languages. We can follow the FA-
if (out != null) out.close();
}
AST building process in this paper to build graphs for other
} programming languages with only small modifications.
(a)
VII. R ELATED W ORK
private void createButtonCopyToClipboard() {
buttonCopyToClipboard = new Button(shell, SWT.PUSH); We introduce the related work from two perspectives: first
buttonCopyToClipboard.setText("Co&py to Clipboard");
buttonCopyToClipboard.setLayoutData(SharedStyle.relativeToBottomRight(buttonClose)); is the application of deep learning on clone detection, and
buttonCopyToClipboard.addSelectionListener(new SelectionAdapter() {
@Override second is the application of graph neural networks on various
public void widgetSelected(final SelectionEvent event) {
IOUtils.copyToClipboard(Version.getEnvironmentReport());
software engineering tasks.
}
});
}
A. Code Clone Detection with Deep Learning
(b)
As deep learning has made a breakthrough in natural
language processing, researchers have considered applying
Fig. 11. Example of a true clone pair in BigCloneBench which FA-
AST+GMN correctly predicted as true while ASTNN wrongly predicted. deep learning models to programming languages, which code
clone detection is a well-suited task. White et al. [2] used
a recursive autoencoder [16] to learn representations of Java
B. The Quality of Code Clone Datasets ASTs in an unsupervised manner, then used the representations
Our approaches have already shown very high results (F1 to compute the similarity between code pairs. Li et al. [19]
close to 1.0) on both of our tasks, but the results of ASTNN proposed CCLearner, a purely token-based clone detector.
are close to ours, and the room for improvement on these two CCLearner categorizes source code tokens into eight classes.
For a pair of code fragments (methods), it calculates eight tively detect semantic clones (i.e., clones that are very different
similarity scores in terms of token frequency in each category syntactically). In this paper, we propose a novel approach that
to form a feature vector that is then fed into a feedforward leverages explicit control and data flow information for code
neural network. Wei et al. [3] proposed CDLH, which used a clone detection. Our approach applies two different GNNs,
hash loss to measure the similarity of two code pairs. CDLH gate graph neural networks and graph matching networks
first converted program ASTs into binary trees, then used a over a flow-augmented AST (FA-AST). By building FA-
binary Tree-LSTM [5] to represent these trees. Wei et al. [20] AST using original ASTs and flow edges, our approach can
proposed CDPU ( Clone Detection with Positive-Unlabeled directly capture the syntax and semantic structure in ASTs.
learning), which extended CDLH with adversarial training. Experimental results on two datasets (Google Code Jam and
Different from previous supervised approaches, CDPU can BigCloneBench) show that by combining graph neural net-
be trained in a semi-supervised way using a small number works and control/data flow information, we can enhance the
of label clones and a large number of unlabeled code pairs. performance of detecting semantic code clones.
Zhao et al. [6] proposed a deep-learning based clone detec- In the future, we plan to improve our neural model and
tion framework DeepSim. Different from other deep-learning explore other program representation forms to capture more
based clone detection techniques, the inputs of DeepSim is accurate syntactic and semantic features of source code. An-
not code fragments, but semantic matrices with manually other feasible extension to our existing work is to combine
extracted semantic features from CFGs. Although DeepSim ASTs with other program structures, like token sequences or
first applies deep learning approach on the Google Code Jam data dependence graphs.
dataset and achieved the previous state-of-the-art, we do not
compare it with our approach because we cannot reproduce ACKNOWLEDGMENT
their experiments from the code they released. Saini et al. [21] This research is supported by the National Key R&D
proposed Oreo, which uses a Siamese Network consists of two Program under Grant No. 2018YFB1003904, and the Na-
feedforward networks to predict code clones. The inputs of the tional Natural Science Foundation of China under Grant No.
neural network are a series of human-defined software metrics. 61832009.
They trained Oreo using 50k Java projects from GitHub and
R EFERENCES
evaluate their approach on BigCloneBench. Zhang et al. [4]
proposed a program representation model ASTNN, which [1] C. K. Roy and J. R. Cordy, “A survey on software clone detection
research,” QueenâĂŹs School of Computing TR, vol. 541, no. 115, pp.
aimed to mitigate the long dependency problem in previous 64–68, 2007.
sequential models. The authors evaluated their model on code [2] M. White, M. Tufano, C. Vendome, and D. Poshyvanyk, “Deep learning
classification and clone detection. code fragments for code clone detection,” in Proceedings of the 31st
IEEE/ACM International Conference on Automated Software Engineer-
ing. ACM, 2016, pp. 87–98.
B. Graph Neural Networks for Software Engineering [3] H. Wei and M. Li, “Supervised deep features for software functional
Li et al. [7] proposed gated graph neural network (GGNN) clone detection by exploiting lexical and syntactical information in
source code.” in IJCAI, 2017, pp. 3034–3040.
which used a GRU cell to update the state of nodes. They [4] J. Zhang, X. Wang, H. Zhang, H. Sun, K. Wang, and X. Liu, “A novel
evaluated their model on a simple program verification task neural source code representation based on abstract syntax tree,” in Pro-
to detect null pointers. The input they used is not the entire ceedings of the 41st International Conference on Software Engineering.
IEEE Press, 2019, pp. 783–794.
program but the memory heap states of programs. Other works [5] K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic rep-
try to apply GNNs on entire code fragments. To represent resentations from tree-structured long short-term memory networks,”
a program with a graph, one straightforward approach is to in Proceedings of the 53rd Annual Meeting of the Association for
Computational Linguistics and the 7th International Joint Conference
use control flow graphs [8], [22]. Phan et al. [22] used graph on Natural Language Processing (Volume 1: Long Papers), 2015, pp.
convolutional network for defect detection on control flow 1556–1566.
graphs in C. To produce CFGs for C, they first compiled [6] G. Zhao and J. Huang, “Deepsim: deep learning code functional similar-
ity,” in Proceedings of the 2018 26th ACM Joint Meeting on European
C source code to assembly code, then generated CFGs from Software Engineering Conference and Symposium on the Foundations
the compiled assembly code. Li et al. [8] proposed graph of Software Engineering. ACM, 2018, pp. 141–151.
matching networks (GMN) for learning the similarity between [7] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph
sequence neural networks,” arXiv preprint arXiv:1511.05493, 2015.
two graphs. They applied their model to compute the similarity [8] Y. Li, C. Gu, T. Dullien, O. Vinyals, and P. Kohli, “Graph matching
between control flow graphs of binary functions. Another networks for learning the similarity of graph structured objects,” in
group of works tries to create program graphs using AST [23], International Conference on Machine Learning, 2019, pp. 3835–3845.
[9] J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia,
[24]. Allamanis et al. [23] used GGNN to learn representations “Towards a big data curated benchmark of inter-project code clones,”
for C# programs for two tasks: variable naming and correcting in 2014 IEEE International Conference on Software Maintenance and
variable misuse. Brockschmidt et al. [24] used GGNN to Evolution. IEEE, 2014, pp. 476–480.
[10] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini,
generate program expressions for code completion in C#. “The graph neural network model,” IEEE Transactions on Neural
Networks, vol. 20, no. 1, pp. 61–80, 2008.
VIII. C ONCLUSION AND F UTURE W ORK [11] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl,
“Neural message passing for quantum chemistry,” in Proceedings of the
Code clone detection has been a widely-studied field in 34th International Conference on Machine Learning-Volume 70. JMLR.
software engineering, but few existing approaches can effec- org, 2017, pp. 1263–1272.
[12] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun, “Graph Proceedings, 2015. [Online]. Available: http://arxiv.org/abs/1412.6980
neural networks: A review of methods and applications,” arXiv preprint [19] L. Li, H. Feng, W. Zhuang, N. Meng, and B. Ryder, “Cclearner: A deep
arXiv:1812.08434, 2018. learning-based clone detection approach,” in 2017 IEEE International
[13] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the Conference on Software Maintenance and Evolution (ICSME). IEEE,
properties of neural machine translation: Encoder-decoder approaches,” 2017, pp. 249–260.
arXiv preprint arXiv:1409.1259, 2014. [20] H. Wei and M. Li, “Positive and unlabeled learning for detecting
[14] “Google code jam,” https://code.google.com/codejam/contests.html, software functional clones with adversarial training.” in IJCAI, 2018,
2016, accessed: 2016-10-8. pp. 2840–2846.
[15] L. Jiang, G. Misherghi, Z. Su, and S. Glondu, “Deckard: Scalable and [21] V. Saini, F. Farmahinifarahani, Y. Lu, P. Baldi, and C. V. Lopes, “Oreo:
accurate tree-based detection of code clones,” in Proceedings of the 29th Detection of clones in the twilight zone,” in Proceedings of the 2018
international conference on Software Engineering. IEEE Computer 26th ACM Joint Meeting on European Software Engineering Conference
Society, 2007, pp. 96–105. and Symposium on the Foundations of Software Engineering. ACM,
[16] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning, 2018, pp. 354–365.
“Semi-supervised recursive autoencoders for predicting sentiment dis- [22] A. V. Phan, M. Le Nguyen, and L. T. Bui, “Convolutional neural
tributions,” in Proceedings of the conference on empirical methods in networks over control flow graphs for software defect prediction,”
natural language processing. Association for Computational Linguis- in 2017 IEEE 29th International Conference on Tools with Artificial
tics, 2011, pp. 151–161. Intelligence (ICTAI). IEEE, 2017, pp. 45–52.
[17] M. Fey and J. E. Lenssen, “Fast graph representation learning with [23] M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to repre-
pytorch geometric,” arXiv preprint arXiv:1903.02428, 2019. sent programs with graphs,” arXiv preprint arXiv:1711.00740, 2017.
[18] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” [24] M. Brockschmidt, M. Allamanis, A. L. Gaunt, and O. Polozov, “Gen-
in 3rd International Conference on Learning Representations, ICLR erative code modeling with graphs,” arXiv preprint arXiv:1805.08490,
2015, San Diego, CA, USA, May 7-9, 2015, Conference Track 2018.

You might also like