0% found this document useful (0 votes)
2 views

[ICSE18] Deep Code Search

The document introduces CODEnn, a deep neural network designed for code search that embeds code snippets and natural language descriptions into a unified vector space to improve retrieval accuracy. It addresses the limitations of traditional information retrieval methods by focusing on semantic relationships rather than textual similarity. The effectiveness of the proposed approach is demonstrated through the implementation of a code search tool called DeepCS, which outperforms existing techniques in retrieving relevant code snippets from a large-scale codebase.

Uploaded by

Rubin Wang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

[ICSE18] Deep Code Search

The document introduces CODEnn, a deep neural network designed for code search that embeds code snippets and natural language descriptions into a unified vector space to improve retrieval accuracy. It addresses the limitations of traditional information retrieval methods by focusing on semantic relationships rather than textual similarity. The effectiveness of the proposed approach is demonstrated through the implementation of a code search tool called DeepCS, which outperforms existing techniques in retrieving relevant code snippets from a large-scale codebase.

Uploaded by

Rubin Wang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Deep Code Search

Xiaodong Gu1 , Hongyu Zhang2 , and Sunghun Kim1,3


1 The Hong Kong University of Science and Technology, Hong Kong
[email protected],[email protected]
2 The University of Newcastle, Callaghan, Australia

[email protected]
3 Clova AI Research, NAVER

ABSTRACT 1 INTRODUCTION
To implement a program functionality, developers can reuse pre- Code search is a very common activity in software development
viously written code snippets by searching through a large-scale practices [57, 68]. To implement a certain functionality, for example,
codebase. Over the years, many code search tools have been pro- to parse XML files, developers usually search and reuse previously
posed to help developers. The existing approaches often treat source written code by performing free-text queries over a large-scale
code as textual documents and utilize information retrieval models codebase.
to retrieve relevant code snippets that match a given query. These Many code search approaches have been proposed [13, 15, 29,
approaches mainly rely on the textual similarity between source 31, 32, 35, 44, 45, 47, 62], most of them being based on information
code and natural language query. They lack a deep understanding retrieval (IR) techniques. For example, Linstead et al. [43] proposed
of the semantics of queries and source code. Sourcerer, an information retrieval based code search tool that com-
In this paper, we propose a novel deep neural network named bines the textual content of a program with structural information.
CODEnn (Code-Description Embedding Neural Network). Instead McMillan et al. [47] proposed Portfolio, which returns a chain of
of matching text similarity, CODEnn jointly embeds code snippets functions through keyword matching and PageRank. Lu et al. [44]
and natural language descriptions into a high-dimensional vec- expanded a query with synonyms obtained from WordNet and
tor space, in such a way that code snippet and its corresponding then performed keyword matching of method signatures. Lv et
description have similar vectors. Using the unified vector repre- al. [45] proposed CodeHow, which combines text similarity and
sentation, code snippets related to a natural language query can API matching through an extended Boolean model.
be retrieved according to their vectors. Semantically related words A fundamental problem of the IR-based code search is the mis-
can also be recognized and irrelevant/noisy keywords in queries match between the high-level intent reflected in the natural lan-
can be handled. guage queries and low-level implementation details in the source
As a proof-of-concept application, we implement a code search code [12, 46]. Source code and natural language queries are hetero-
tool named DeepCS using the proposed CODEnn model. We em- geneous. They may not share common lexical tokens, synonyms, or
pirically evaluate DeepCS on a large scale codebase collected from language structures. Instead, they may only be semantically related.
GitHub. The experimental results show that our approach can ef- For example, a relevant snippet for the query “read an object from
fectively retrieve relevant code snippets and outperforms previous an xml” could be as follows:
techniques. public static < S > S deserialize(Class c, File xml) {
try {
JAXBContext context = JAXBContext.newInstance(c);
CCS CONCEPTS Unmarshaller unmarshaller = context.createUnmarshaller();
S deserialized = (S) unmarshaller.unmarshal(xml);
• Software and its engineering → Reusability; return deserialized;
} catch (JAXBException ex) {
log.error("Error-deserializing-object-from-XML", ex);
KEYWORDS return null;
}
code search, deep learning, joint embedding }

ACM Reference Format: Existing approaches may not be able to return this code snippet
Xiaodong Gu1 , Hongyu Zhang2 , and Sunghun Kim1, 3 . 2018. Deep Code as it does not contain keywords such as read and object or their
Search. In ICSE ’18: ICSE ’18: 40th International Conference on Software
synonyms such as load and instance. Therefore, an effective code
Engineering , May 27-June 3, 2018, Gothenburg, Sweden. ACM, New York,
NY, USA, 12 pages. https://doi.org/10.1145/3180155.3180167
search engine requires a higher-level semantic mapping between
code and natural language queries. Furthermore, the existing ap-
Permission to make digital or hard copies of all or part of this work for personal or proaches have difficulties in query understanding [27, 29, 45]. They
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation cannot effectively handle irrelevant/noisy keywords in queries [27].
on the first page. Copyrights for components of this work owned by others than the Therefore, an effective code search engine should also be able to
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or understand the semantic meanings of natural language queries and
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected]. source code in order to improve the accuracy of code search.
ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden In our previous work, we introduced the DeepAPI framework [27],
© 2018 Copyright held by the owner/author(s). Publication rights licensed to the which is a deep learning based method that learns the semantics of
Association for Computing Machinery.
ACM ISBN 978-1-4503-5638-1/18/05. . . $15.00 queries and the corresponding API sequences. However, searching
https://doi.org/10.1145/3180155.3180167 source code is much more difficult than generating APIs, because
ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden Xiaodong Gu, Hongyu Zhang, and Sunghun Kim

the semantics of code snippets are related not only to the API se-
quences but also to other source code aspects such as tokens and
method names. For example, DeepAPI could return the same API Output Layer

ImageIO.write for the query save image as png and save image as Embedding

jpg. Nevertheless, the actual code snippets for answering the two Hidden Layer ht h1 h2 h3
ht-1
queries are different in terms of source code tokens. Therefore, the wt w1 w2 w3
code search problem requires models that can exploit more aspects Input Layer
of the source code. parse xml file
In this paper, we propose a novel deep neural network named CO-
DEnn (Code-Description Embedding Neural Network). To bridge (a) RNN Structure (b) RNN for sentence embedding
Figure 1: Illustration of the RNN Sentence Embedding
the lexical gap between queries and source code, CODEnn jointly
embeds code snippets and natural language descriptions into a
high-dimensional vector space, in such a way that code snippet 2.1 Embedding Techniques
and its corresponding description have similar vectors. With the Embedding (also known as distributed representation [50, 72]) is
unified vector representation, code snippets semantically related a technique for learning vector representations of entities such as
to a natural language query can be retrieved according to their words, sentences and images in such a way that similar entities
vectors. Semantically related words can also be recognized and have vectors close to each other [48, 50].
irrelevant/noisy keywords in queries can be handled. A typical embedding technique is word embedding, which repre-
Using CODEnn, we implement a code search tool, DeepCS as a sents words as fixed-length vectors so that similar words are close
proof of concept. DeepCS trains the CODEnn model on a corpus of to each other in the vector space [48, 50]. For example, suppose the
18.2 million Java code snippets (in the form of commented methods) word execute is represented as [0.12, -0.32, 0.01] and the word run
from GitHub. Then, it reads code snippets from a codebase and is represented as [0.12, -0.31, 0.02]. From their vectors, we can es-
embeds them into vectors using the trained CODEnn model. Finally, timate their distance and identify their semantic relation. Word
when a user query arrives, DeepCS finds code snippets that have embedding is usually realized using a machine learning model such
the nearest vectors to the query vector and return them. as CBOW and Skip-Gram [48]. These models build a neural net-
To evaluate the effectiveness of DeepCS, we perform code search work that captures the relations between a word and its contextual
on a search codebase using 50 real-world queries obtained from words. The vector representations of words, as parameters of the
Stack Overflow. Our results show that DeepCS returns more rele- network, are trained with a text corpus [50].
vant code snippets than the two related approaches, that is, Code- Likewise, a sentence (i.e., a sequence of words) can also be em-
How [45] and a conventional Lucene-based code search tool [5]. bedded as a vector [59]. A simple way of sentence embedding is,
On average, the first relevant code snippet returned by DeepCS for example, to view it as a bag of words and add up all its word
is ranked 3.5, while the first relevant results returned by Code- vectors [39].
How [45] and Lucene [43] are ranked 5.5 and 6.0, respectively.
For 76% of the queries, the relevant code snippets can be found 2.2 RNN for Sequence Embedding
within the top 5 returned results. The evaluation results confirm
We now introduce a widely-used deep neural network, the Recur-
the effectiveness of DeepCS.
rent Neural Networks (RNN) [49, 59] for the embedding of sequen-
To our knowledge, we are the first to propose deep learning based
tial data such as natural language sentences. The Recurrent Neural
code search. The main contributions of our work are as follows:
Network is a class of neural networks where hidden layers are
• We propose a novel deep neural network, CODEnn, to learn a recurrently used for computation. This creates an internal state
unified vector representation of both source code and natural of the network to record dynamic temporal behavior. Figure 1a
language queries. shows the basic structure of an RNN. The neural network includes
• We develop DeepCS, a tool that utilizes CODEnn to retrieve three layers, an input layer which maps each input to a vector, a
relevant code snippets for given natural language queries. recurrent hidden layer which recurrently computes and updates a
• We empirically evaluate DeepCS using a large scale codebase. hidden state after reading each input, and an output layer which
The rest of this paper is organized as follows. Section 2 describes utilizes the hidden state for specific tasks. Unlike traditional feed-
the background of the deep learning based embedding models. forward neural networks, RNNs can embed sequential inputs such
Section 3 describes the proposed deep neural network for code as sentences using their internal memory [25].
search. Section 4 describes the detailed design of our approach. Consider a natural language sentence with a sequence of T
Section 5 presents the evaluation results. Section 6 discusses our words s=w 1 , ..., wT , RNN embeds it through the following com-
work, followed by Section 7 that presents the related work. We putations: it reads words in the sentence one by one, and updates a
conclude the paper in Section 8. hidden state at each time step. Each word w t is first mapped to a
d-dimensional vector wt ∈Rd by a one-hot representation [72] or
2 BACKGROUND word embedding [50]. Then, the hidden state (values in the hidden
Our work adopts recent advanced techniques from deep learning layer) ht is updated at time t by considering the input word wt and
and natural language processing [10, 17, 70]. In this section, we the preceding hidden state ht −1 :
discuss the background of these techniques. h t = tanh(W [h t −1 ; w t ]), ∀t = 1, 2, ..., T (1)
Deep Code Search ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden

max pooling public void readText(String file) {


BufferedReader br = new BufferedReader(
3 4 7 5 with 1×4 7 new FileInputStream(file));

window size “read a text file String line = null;

Description(Query)
while ((line = br.readLine())!= null) {
1 5 2 0 5 line by line” System.out.println(line);
}

Embedding

Embedding
br.close();
8 3 2 4 8

Code
}

public <S> S deserialize(Class c,File xml) {


h0 h1 h2 h 3 JAXBContext context
= JAXBContext.newInstance(c);
Unmarshaller unmarshaller
Figure 2: Illustration of max pooling “read an object = context.createUnmarshaller();
S deserialized
from an xml file”
where [a;b]∈R2d represents the concatenation of two vectors, W ∈
=(S)unmarshaller.unmarshal(xml);
return deserialized;
}

R2d ×d is the matrix of trainable parameters in the RNN, while Figure 3: An example showing the idea of joint embedding for code
tanh is a non-linearity activation function of the RNN. Finally, and queries. The yellow points represent query vectors while the
the embedding vector of the sentence is summarized from the blue points represent code vectors.
hidden states h 1 , ..., hT . A typical way is to select the last hidden
where ϕ:X→Rd is an embedding function to map X into a d-
state hT as the embedding vector. The embedding vector can also be
dimensional vector space V ; ψ :Y→Rd is an embedding function
summarized using other computations such as the maxpooling [36]:
to map Y into the same vector space V ; J (·, ·) is a similarity mea-
s = maxpooling([h 1, ..., hT ]) (2) sure (e.g., cosine) to score the matching degrees of VX and VY in
Maxpooling is an operation that selects the maximum value in order to learn the mapping functions. Through joint embedding,
each fixed-size region over a matrix. Figure 2 shows an example heterogeneous data can be easily correlated through their vectors.
of maxpooling over a sequence of hidden vectors h 1 , ..., hT . Each Joint embedding has been widely used in many tasks [22, 74,
column represents a hidden vector. The window size of each region 78]. For example, in computer vision, Karpathy and Li [33] use a
is set to 1×T in this example. The result is a fixed-length vector Convolutional Neural Network (CNN) [22], a deep neural network
whose elements are the maximum values of each row. Maxpooling as the ϕ and an RNN as the ψ , to jointly embed both image and text
can capture the most important feature (one with the highest value) into the same vector space for labeling images [33].
for each region and can transform sentences of variable lengths
into a fixed-length vector. 3 A DEEP NEURAL NETWORK FOR CODE
Figure 1b shows an example of how RNN embeds a sentence SEARCH
(e.g., parse an xml file) into a vector. To facilitate understanding, Inspired by existing joint embedding techniques [21, 22, 33, 78],
we expand the recurrent hidden layer for each time step. The RNN we propose a novel deep neural network named CODEnn (Code-
reads words in the sentence one by one, and updates a hidden state Description Embedding Neural Network) for the code search prob-
at each time step. When it reads the first word parse, it maps the lem. Figure 3 illustrates the key idea. Natural language queries and
word into a vector w1 and computes the current hidden state h 1 code snippets are heterogeneous and cannot be easily matched ac-
using w1 . Then, it reads the second word xml, embeds it into w2 , cording to their lexical tokens. To bridge the gap, CODEnn jointly
and updates the hidden state h 1 to h 2 using w2 . The procedure embeds code snippets and natural language descriptions into a
continues until the RNN receives the last word file and outputs the unified vector space so that a query and the corresponding code
final state h 3 . The final state h 3 can be used as the embedding c of snippets are embedded into nearby vectors and can be matched by
the whole sentence. vector similarities.
The embedding of the sentence, i.e., the sentence vector, can
be used for specific applications. For example, one can build a 3.1 Architecture
language model conditioning on the sentence vector for machine
As introduced in Section 2.3, a joint embedding model requires three
translation [17]. We can also embed two sentences (a question
sentence and an answer sentence) and compare their vectors for components: the embedding functions ϕ:X→Rd and ψ :Y→Rd ,
answer selection [21, 71]. as well as the similarity measure J (·, ·). CODEnn realizes these
components with deep neural networks.
2.3 Joint Embedding of Heterogeneous Data Figure 4 shows the overall architecture of CODEnn. The neu-
ral network consists of three modules, each corresponding to a
Suppose there are two heterogeneous data sets X and Y. We want component of joint embedding:
to learn a correlation between them, namely,
f :X→Y (3) • a code embedding network (CoNN) to embed source code
For example, suppose X is a set of images and Y is a set of natural into vectors.
language sentences, f can be the correlation between the images • a description embedding network (DeNN) to embed natural
and the sentences (i.e., image captioning). Since the two data sources language descriptions into vectors.
are heterogeneous, it is difficult to discover the correlation f directly. • a similarity module that measures the degree of similarity
Thus, we need a bridge to connect these two levels of information. between code and descriptions.
Joint Embedding, also known as multi-modal embedding [78], is The following subsections describe the detailed design of these
a technique to jointly embed/correlate heterogeneous data into a modules.
unified vector space so that semantically similar concepts across 3.1.1 Code Embedding Network. The code embedding network
the two modalities occupy nearby regions of the space [33]. The embeds source code into vectors. Source code is not simply plain
joint embedding of X and Y can be formulated as: text. It contains multiple aspects of information such as tokens,
ϕ ψ
X −→ VX → J (VX , VY ) ← VY ←− Y (4) control flows and APIs [46]. In our model, we consider three aspects
ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden Xiaodong Gu, Hongyu Zhang, and Sunghun Kim

code vector [Ԧ𝒄] description vector [𝒅]


Code Vector Description Vector

Cosine Cosine
Similarity Similarity
Fusion
𝒎 𝒂 𝒕Ԧ
max pooling max pooling max pooling max pooling

Code Embedding Description Embedding

RNN
RNN

RNN
RNN
RNN

RNN
RNN

MLP
RNN

RNN

MLP
MLP
Network (CoNN) Network(DeNN)

text reader Scanner.new Scanner.next Scanner.close str buff close read a text file

method name [M] API sequence [A] Tokens [Γ] [D]


Code Description
Code Description

(a) Overall Architecture (b) Detailed Structure


Figure 4: The structure of the Code-Description Embedding Neural Network
of source code: the method name, the API invocation sequence, and c = tanh(W C [m; a; t ]) (9)
the tokens contained in the source code. They are commonly used
where [a;b;c] represents the concatenation of three is vectors, W C
in existing code search approaches [19, 27, 41, 44, 45]. For each
the matrix of trainable parameters in the MLP. The output vector c
code snippet (at the method level), we extract these three aspects
represents the final embedding of the code snippet.
of information. Each is embedded individually and then combined
into a single vector representing the entire code. 3.1.2 Description Embedding Network. The description embed-
Consider an input code snippet C=[M, A, Γ], where M=w 1 ,..., ding network (DeNN) embeds natural language descriptions into
w N M is the method name represented as a sequence of N M camel vectors. Consider a description D=w 1 , ..., w N D comprising a se-
split tokens [1]; A=a 1 , ..., a N A is the API sequence with N A con- quence of N D words. DeNN embeds it into a vector d using an RNN
secutive API method invocations, and Γ={τ1 , ..., τ N Γ } is the set of with maxpooling:
tokens in the snippet. The neural network embeds the three aspects D
h t = tanh(W [h t −1 ; w t ]), ∀t = 1, 2, ..., N D
as follows: for the method name M, it embeds the sequence of camel (10)
split tokens using an RNN with maxpooling: d = maxpooling([h 1, ..., h N D ])
M
h t = tanh(W [h t −1 ; w t ]), ∀t = 1, 2, ..., N M where wt ∈Rd represents the embedded representation of the de-
(5)
m = maxpooling([h 1, ..., h N M ]) scription word w t , W D is the matrix of trainable parameters in the
where wt ∈Rd is the embedding vector of token w t , [a;b]∈R2d rep- RNN, ht , t=1, ...N D are the hidden states of the RNN.
resents the concatenation of two vectors, W M ∈R2d ×d is the matrix 3.1.3 Similarity Module. We have described the transformations
of trainable parameters in the RNN, tanh is the activation function that map the code and description into vectors (i.e., the c and d).
of the RNN. A method name is thus embedded as a d-dimensional Since we want the vectors of code and description to be jointly
vector m. embedded, we measure the similarity between the two vectors.
Likewise, the API sequence A is embedded into a vector a using We use the cosine similarity for the measurement, which is
an RNN with maxpooling: defined as: cT d
h t = tanh(W A
[h t −1 ; a t ]), ∀t = 1, 2, ..., N A cos(c, d ) = (11)
(6) ∥ c ∥∥ d ∥
a = maxpooling([h 1, ..., h N A ]) where c and d are the vectors of code and a description respec-
where at ∈Rd is the embedding vector of API at , W A is the matrix tively. The higher the similarity, the more related the code is to the
of trainable parameters in the RNN. description.
For the tokens Γ, as they have no strict order in the source code,
they are simply embedded via a multilayer perceptron (MLP), i.e., Overall, CODEnn takes a ⟨code, description⟩ pair as input and
the conventional fully connected layer [52]: predicts their cosine similarity cos(c, d).
h i = tanh(W Γ τi ), ∀i = 1, 2, ..., N Γ (7)
3.2 Model Training
where τi ∈Rd represents the embedded representation of the to-
ken τi , W Γ is the matrix of trainable parameters in the MLP, Now we present how to train the CODEnn model to embed both
hi , i=1, ..., N Γ are the embedding vectors of all individual tokens. code and descriptions into a unified vector space. The high-level
The individual vectors are also summarized to a single vector t via goal of the joint embedding is: if a code snippet and a description
maxpooling: have similar semantics, their embedded vectors should be close to
t = maxpooling([h 1, ..., h N Γ ]) (8) each other. In other words, given an arbitrary code snippet C and
Finally, the vectors of the three aspects are fused into one vector an arbitrary description D, we want it to predict a high similarity
through a fully connected layer: if D is a correct description of C, and a little similarity otherwise.
Deep Code Search ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden

Offline Training Query Recommended 4.1 Collecting Training Corpus


Code
Training
Instances
As described in Section 3, the CODEnn model requires a large-scale
Training CODEnn embedding Nearest training corpus that contains code elements and the correspond-
0101010 Vectors
Model
Query Vector Selection ing descriptions, i.e., the ⟨method name, API sequence, tokens,
aspect
extraction description⟩ tuples. Figure 6 shows an excerpt of the training cor-
code snippets
(Java methods)
natural
language
aspect
extraction pus.
descriptions
code snippets
(Java methods)
We build the training tuples using Java methods that have doc-
umentation comments1 from open-source projects on GitHub [3].
Commented
Code Snippets
Search
Code Vectors
For each Java method, we use the method declaration as the code
element and the first sentence of its documentation comment as its
Codebase
Offline Embedding
natural language description. According to the Javadoc guidance2 ,
Figure 5: The overall workflow of DeepCS
the first sentence is usually a summary of a method. To prepare the
data, we download Java projects from GitHub created from August,
At training time, we construct each training instance as a triple 2008 to June, 2016. To remove toy or experimental programs, we
⟨C, D+, D-⟩: for each code snippet C, there is a positive descrip- exclude any projects without a star. We select only the Java methods
tion D+ (a correct description of C) as well as a negative description that have documentation comments from the downloaded projects.
(an incorrect description of C) D- randomly chosen from the pool Finally, we obtain a corpus comprising 18,233,872 commented Java
of all D+’s. When trained on the set of ⟨C, D+, D-⟩ triples, the CO- methods.
DEnn predicts the cosine similarities of both ⟨C, D+⟩ and ⟨C, D-⟩ Having collected the corpus of commented code snippets, we ex-
pairs and minimizes the ranking loss [18, 22]: tract the ⟨method name, API sequence, tokens, description⟩ tuples
as follows:
Method Name Extraction: For each Java method, we extract its
Õ
L(θ ) = max (0, ϵ − cos(c, d +) + cos(c, d -)) (12)
<C, D +, D - >∈P name and parse the name into a sequence of tokens according
to camel case [1]. For example, the method name listFiles will be
where θ denotes the model parameters, P denotes the training parsed into the tokens list and files.
dataset, ϵ is a constant margin. c, d+ and d- are the embedded API Sequence Extraction: We extract an API sequence from each
vectors of C, D+ and D-, respectively. A small, fixed ϵ value of Java method using the same procedures as described in Deep-
0.05 is used in all the experiments. Intuitively, the ranking loss API [27] – parsing the AST using the Eclipse JDT compiler [2]
encourages the cosine similarity between a code snippet and its and traversing the AST. The API sequences are produced as fol-
correct description to go up, and the cosine similarities between a lows [27]:
code snippet and incorrect descriptions to go down. • For each constructor invocation new C(), we produce C.new
and append it to the API sequence.
4 DEEPCS: DEEP LEARNING BASED CODE • For each method call o.m() where o is an instance of class C,
SEARCH we produce C.m and append it to the API sequence.
• For a method call passed as a parameter, we append the
In this section, we describe DeepCS, a code search tool based on
method before the calling method. For example, o 1 .m 1 (o 2
the proposed CODEnn model. DeepCS recommends top K most
.m 2 (),o 3 .m 3 ()), we produce a sequence C 2 .m 2 -C 3 .m 3 -C 1 .m 1 ,
relevant code snippets for a given natural language query. Figure 5
where Ci is the class of the instance oi .
shows the overall architecture. It includes three main phases: offline
• For a sequence of statements s 1 ; s 2 ;...;s N , we extract the API
training, offline code embedding, and online code search.
sequence ai from each statement si , concatenate them to the
We begin by collecting a large-scale corpus of code snippets,
API sequence a 1 -a 2 -...-a N .
i.e., Java methods with corresponding descriptions. We extract sub-
• For conditional statements such as if(s 1 ){s 2 ;}else{s 3 ;}, we cre-
elements (including method names, tokens, and API sequences)
ate a sequence from all possible branches, that is, a 1 -a 2 -a 3 ,
from the methods. Then, we use the corpus to train the CODEnn
where ai is the API sequence extracted from the statement si .
model (the offline training phase). For a given codebase from which
• For loop statements such as while(s 1 ){s 2 ;}, we produce a
users would like to search for code snippets, DeepCS extracts code
sequence a 1 -a 2 , where a 1 and a 2 are API sequences extracted
elements for each Java method in the search codebase, and computes
from the statement s 1 and s 2 , respectively.
a code vector using the CoNN module of the trained CODEnn model
(the offline embedding phase). Finally, when a user query arrives, Token Extraction: To collect tokens from a Java method, we tok-
DeepCS first computes the vector representation of the query using enize the method body, split each token according to camel case [1],
the DeNN module of the CODEnn model, and then returns code and remove the duplicated tokens. We also remove stop words (such
snippets whose vectors are close to the query vector (the online as the and in) and Java keywords as they frequently occur in source
code search phase). code and are not discriminative.
In theory, our approach could search for source code written in Description Extraction: To extract the documentation comment,
any programming languages. In this paper, we limit our scope to
1 A documentation comment in JAVA starts with slash-asterisk-asterisk (/∗∗) and ends
the Java code. The following sections describe the detailed steps of
with asterisk-slash (∗/)
our approach. 2 http://www.oracle.com/technetwork/articles/java/index-137868.html
ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden Xiaodong Gu, Hongyu Zhang, and Sunghun Kim

Method Name API Sequence Tokens Description (English)


1 file reader InputStream.read→OutputStream.write input, output, stream, write copy a file from an inputstream
2 open URL.new→URL.openConnection url, open, conn open a url
3 test exists File.new→File.exists file, create, exists test file exists
⋮ ⋮ ⋮ ⋮ ⋮
Figure 6: An excerpt of training tuples
/**
* Converts a Date into a Calendar. More specially, before a search starts, DeepCS embeds all code
* @param date the date to convert to a Calendar snippets in the codebase into vectors using the trained CoNN mod-
* @return the created Calendar
* @throws NullPointerException if null is passed in ule of CODEnn in an off-line manner. During the on-line search,
* @since 3.0
*/
when a developer enters a natural language query, DeepCS first
public static Calendar toCalendar(final Date date) { embeds the query into a vector using the trained DeNN module
final Calendar c = Calendar.getInstance();
c.setTime(date); of CODEnn. Then, it estimates the cosine similarities between the
}
return c; query vector and all code vectors using Equation 11. Finally, the top
K code snippets whose vectors are most similar to the query vector
are returned as the search results. K is set to 10 in our experiments.
Method Name: to calendar
API sequence: Calendar.getInstance Calendar.setTime
Tokens: calendar, get, instance, set, time, date 5 EVALUATION
Description: converts a date into a calendar.
In this section, we evaluate DeepCS through experiments. We also
Figure 7: An example of extracting code elements from a Java
compare DeepCS with related code search approaches.
method DateUtils.toCalendar 3
we use the Eclipse JDT compiler [2] to parse the AST from a Java 5.1 Experimental Setup
method and extract the JavaDoc Comment from the AST.
Figure 7 shows an example of code elements and documentation 5.1.1 Search Codebase. To better evaluate DeepCS, our exper-
comments extracted from a Java method DateU tils.toCalendar 3 in iments are performed over a search codebase, which is different
the Apache commons-lang library. from the training corpus. Code snippets that match a user query are
retrieved from the search codebase. In practice, the search codebase
4.2 Training CODEnn Model could be an organization’s local codebase or any codebase created
from open source projects.
We use the large-scale corpus described in the previous section To construct the search codebase, we choose the Java projects
to train the CODEnn model, following the method described in that have at least 20 stars in GitHub. Different from the training cor-
Section 3.2. pus, they are considered in isolation and contain all code (including
The detailed implementation of the CODEnn model is as follows: those do not have Javadoc comments). There are 9,950 projects in
we use the bi-directional LSTM [70], a state-of-the-art kind of RNN total. We select all 16,262,602 methods from these projects. For each
for the RNN implementation. All LSTMs have 200 hidden units Java method, we extract a ⟨method name, API sequence, tokens⟩
in each direction. We set the dimension of word embedding to triple to generate its code vector.
100. The CODEnn has two types of MLPs, the embedding MLP for
embedding individual tokens and the fusion MLP to combine the 5.1.2 Query Subjects. To select code search queries for the eval-
embeddings of different aspects. We set the number of hidden units uation, we adopt a systematic procedure used in [41]4 . We build
as 100 for the embedding MLP and 400 for the fusion MLP. a benchmark of queries from the top 50 voted Java programming
The CODEnn model is trained via the mini-batch Adam algo- questions in Stack Overflow. To achieve so, we browse the list of
rithm [37, 40]. We set the batch size (i.e., the number of instances Java-tagged questions in Stack Overflow and sort them according to
per batch) as 128. For training the neural networks, we limit the the votes that each one receives5 . We manually check the sorted list
size of the vocabulary to 10,000 words that are most frequently sequentially, and add questions that satisfy the following conditions
used in the training dataset. to the benchmark:
We build our model on Keras [4] and Theano [6], two open- (1) The question is a concrete Java programming task. We exclude
source deep learning frameworks. We train our models on a server questions about problems, knowledge, configurations, experience
with one Nvidia K40 GPU. The training lasts ∼50 hours with 500 and questions whose descriptions are vague and abstract. For ex-
epochs. ample, Failed to load the JNI Library, What is the difference between
StringBuilder and StringBuffer?, and Why does Java have transient
4.3 Searching Code Snippets fields?. (2) The accepted answer to the question contains a Java
Given a user’s free-text query, DeepCS returns the relevant code code snippet. (3) The question is not a duplicate of the previous
snippets through the trained CODEnn model. It first computes the questions. We filter out questions that are tagged as “duplicated”.
code vector for each code snippet (i.e., a Java method) in the search The full list of the 50 selected queries can be found in Table 1.
codebase. Then, it selects and returns the code snippets that have For each query, two developers manually inspect the top 10 results
the top K nearest vectors to the query vector. returned by DeepCS and label their relevance to the query. Then
3 https://github.com/apache/commons-lang/blob/master/src/main/java/org/apache/ 4 http://taoxie.cs.illinois.edu/racs/subjects.html
5 http://stackoverflow.com/questions/tagged/java?sort=votes&pagesize=15
commons/lang3/time/DateUtils.java
Deep Code Search ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden

they discuss the inconsistent labels and relabel them. The procedure Table 1: Benchmark Queries and Evaluation Results (NF: Not
repeats until a consensus is reached. Found within the top 10 returned results LC:Lucene CH:CodeHow
DCS:DeepCS)
5.1.3 Performance Measure. We use four common metrics to Question FRank
No. Query
measure the effectiveness of code search, namely, FRank, Success- ID LC CH DCS
1 309424 convert an inputstream to a string 2 1 1
Rate@k, Precision@k, and Mean Reciprocal Rank (MRR). They are 2 157944 create arraylist from array NF NF 2
widely used metrics in information retrieval and the code search 3 1066589 iterate through a hashmap NF 4 1
4 363681 generating random integers in a specific range NF 6 2
literature [41, 45, 62, 79]. 5 5585779 converting string to int in java NF 10 1
6 1005073 initialization of an array in one line NF 4 1
The FRank (also known as best hit rank [41]) is the rank of the 7 1128723 how can I test if an array contains a certain value 6 6 1
first hit result in the result list [62]. It is important as users scan 8 604424 lookup enum by string value 1 NF 10
9 886955 breaking out of nested loops in java NF NF NF
the results from top to bottom. A smaller FRank implies lower 10 1200621 how to declare an array NF NF 4
inspection effort for finding the desired result. We use FRank to 11 41107 how to generate a random alpha-numeric string NF 1 1
12 409784 what is the simplest way to print a java array 6 NF 1
assess the effectiveness of a single code search query. 13 109383 sort a map by values NF 1 3
The SuccessRate@k (also known as success percentage at k [41]) 14 295579 fastest way to determine if an integer’s square root is an integer NF NF NF
15 80476 how can I concatenate two arrays in java NF 1 1
measures the percentage of queries for which more than one correct 16 326369 how do I create a java string from the contents of a file 8 NF 5
17 1149703 how can I convert a stack trace to a string 3 1 2
result could exist in the top k ranked results [35, 41, 79]. In our 18 513832 how do I compare strings in java 1 3 1
evaluations it is calculated as follows: 19 3481828 how to split a string in java 1 1 1
Q 20 2885173 how to create a file and write to a file in java 2 1 NF
1 Õ 21 507602 how can I initialise a static map 7 1 2
SuccessRate@k = δ (FRankq ≤ k) (13) 22 223918 iterating through a collection, avoiding concurrentmodifica- 3 3 2
|Q | q=1 tionexception when removing in loop
23 415953 how can I generate an md5 hash 1 3 6
where Q is a set of queries, δ (·) is a function which returns 1 if the 24 1069066 get current stack trace in java 3 1 1
input is true and 0 otherwise. SuccessRate@k is important because 25
26
2784514
153724
sort arraylist of custom objects by property
how to round a number to n decimal places in java
1
1
1
1
1
4
a better code search engine should allow developers to discover the 27 473282 how can I pad an integers with zeros on the left NF 3 1
28 529085 how to create a generic array in java NF NF 3
needed code by inspecting fewer returned results. The higher the 29 4716503 reading a plain text file in java 4 NF 7
metric value, the better the code search performance. 30 1104975 a for loop to iterate over enum in java NF NF NF
31 3076078 check if at least two out of three booleans are true NF NF NF
The Precision@k [45, 57] measures the percentage of relevant re- 32 4105331 how do I convert from int to string 2 1 NF
sults in the top k returned results for each query. In our evaluations 33 8172420 how to convert a char to a string in java 5 10 3
34 1816673 how do I check if a file exists in java 1 2 1
it is calculated as follows: 35 4216745 java string to date conversion 6 NF 1
36 1264709 convert inputstream to byte array in java 7 5 1
#relevant results in the top k results
Precision@k = (14) 37 1102891 how to check if a string is numeric in java 1 NF 2
k 38 869033 how do I copy an object in java 2 1 1
Precision@k is important because developers often inspect multiple 39
40
180158
5868369
how do I time a method’s execution in java
how to read a large text file line by line using java
NF
1
NF
1
2
1
results of different usages to learn from [62]. A better code search 41 858572 how to make a new list in java 2 1 1
42 1625234 how to append text to an existing file in java 3 1 1
engine should allow developers to inspect less noisy results. The 43 2201925 converting iso 8601-compliant string to date 3 1 1
higher the metric values, the better the code search performance. 44 122105 what is the best way to filter a java collection NF 9 2
45 5455794 removing whitespace from strings in java NF 3 1
We evaluate SuccessRate@k and Precision@k when k’s value is 1, 5, 46 225337 how do I split a string with any whitespace chars as delimiters 1 1 2
and 10. These values reflect the typical sizes of results that users 47 52353 in java, what is the best way to determine the size of an object NF NF NF
48 160970 how do I invoke a java method when given the method name 3 1 2
would inspect [41]. as a string
The MRR [45, 79] is the average of the reciprocal ranks of results 49 207947
50 1026723
how do I get a platform dependent new line character
how to convert a map to list in java
1
6
NF 10
NF 1
of a set of queries Q. The reciprocal rank of a query is the inverse
of the rank of the first hit result [26]. MRR is calculated as follows:
|Q |
1 Õ 1 Lucene is a popular, conventional text search engine behind
M RR = (15)
|Q | q=1 F Rankq many existing code search tools such as Sourcerer [43]. Sourcerer
The higher the MRR value, the better the code search performance. combines Lucene with code properties such as FQN (full qualified
name) of entities and code popularity to retrieve the code snippets.
5.1.4 Comparison Methods. We compare the effectiveness of In our implementation of the Lucene-based code search tool, we
our approach with CodeHow [45] and a conventional Lucene-based consider the heuristic of FQN. We did not include the code popular-
code search tool [5]. ity heuristic (computed using PageRank) as it does not significantly
CodeHow is a state-of-the-art code search engine proposed re- improve the code search performance [43].
cently. It is an information retrieval based code search tool that We use the same experimental setting for CodeHow and the
incorporates an extended Boolean model and API matching. It first Lucene-based tool as used for evaluating DeepCS.
retrieves relevant APIs to a query by matching the query with
the API documentation. Then, it searches code by considering both
plain code and the related APIs. Like DeepCS, CodeHow also consid- 5.2 Results
ers multiple aspects of source code such as method name and APIs. Table 1 shows the evaluation results of DeepCS and related ap-
It combines multiple aspects using an Extended Boolean Model [45]. proaches for each query in the benchmark. The column Question ID
The facts that CodeHow also considers APIs and is also built for shows the original ID of the question in Stack Overflow where the
large-scale code search make it an ideal baseline for our experi- query comes from. The column FRank shows the FRank result of
ments. each approach. The symbol ‘NF’ stands for Not Found which means
ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden Xiaodong Gu, Hongyu Zhang, and Sunghun Kim

Table 2: Overall Accuracy of DeepCS and the Related Approaches


DeepCS + ● ●
DeepCS +
Tool R@1 R@5 R@10 P@1 P@5 P@10 MRR
Lucene 0.24 0.48 0.62 0.24 0.24 0.26 0.35
CodeHow 0.38 0.58 0.66 0.38 0.29 0.28 0.45 CodeHow + CodeHow +
DeepCS 0.46 0.76 0.86 0.46 0.50 0.49 0.60
Lucene + Lucene + ●

that no relevant result has been returned within the top K results
(K=10). 0 2 4 6 8 10 12 0 20 40 60 80 100
The results show that DeepCS produces generally more relevant FRank
(a) FRank Precision@1
(b) Precision@1
results than Lucene and CodeHow. Figure 8a shows the statisti-
cal summary of FRank for the three approaches. The symbol ‘+’
+ DeepCS +
indicates the average FRank achieved by each approach. We conser- DeepCS

vatively treat the FRank as 11 for queries that fail to obtain relevant
+ +
results within the top 10 returned results. We observe that DeepCS CodeHow CodeHow

achieves more relevant results with an average FRank of 3.5, which


is smaller than the average FRank achieved by CodeHow (5.5) and Lucene + Lucene +
Lucene (6.0). The FRank values of DeepCS concentrate on the range
0 20 40 60 80 100 0 20 40 60 80 100
from 1 to 4, while CodeHow and Lucene produce larger variance Precision@5 Precision@10
(c) Precision@5 (d) Precision@10
and many less relevant results. Figure 8b, 8c and 8d show the sta-
tistics of Precision@k for the three approaches when k is 1, 5 and Figure 8: The statistical comparison of FRank and Precison@k for
10, respectively. We observe that DeepCS achieves better overall three code search approaches
precision values than CodeHow and the Lucene-based tool.
To test the statistical significance, we apply the Wilcoxon signed-
rank test (p<0.05) for the comparison of FRank and Precision@k queries and return relevant snippets. Apparently, DeepCS has the
between DeepCS and the two related approaches for all the queries. ability to recognize query semantics.
We conservatively treat the FRank as 11 for queries that fail to The ability of query understanding enables DeepCS to perform
obtain relevant results within the top 10 returned results. The p- a more robust code search. Its search results are less affected by
values for the comparisons of DeepCS with Lucene and CodeHow irrelevant or noisy keywords. For example, the query get the content
are all less than 0.05, indicating the statistical significance of the of an input stream as a string using a specified character encoding
improvement of DeepCS over the related approaches. contains 9 keywords. CodeHow returns many snippets that are
Table 2 shows the overall performance of the three approaches, related to less relevant keywords such as specified and character.
measured in terms of SuccessRate@k, Precision@k and MRR. The DeepCS, on the other hand, can successfully identify the importance
columns R@1, R@5 and R@10 show the results of SuccessRate@k of different keywords and understand the key point of the query
when k is 1, 5 and 10, respectively. The columns P@1, P@5 and (Figure 10).
P@10 show the results of the average Precision@k over all queries Another advantage of DeepCS relates to associative search. That
when k is 1, 5 and 10, respectively. The column MRR shows the is, it not only seeks snippets with matched keywords but also rec-
MRR values of the three approaches. The results show that DeepCS ommends those without matched keywords but are semantically
returns more relevant code snippets than CodeHow and Lucene. related. This is important because it significantly increases the
For example, the R@5 value is 0.76, which means that for 76% of search scope especially when the codebase is small. Besides, devel-
the queries, the relevant code snippets can be found within the opers need snippets of multiple usages [62]. The associative search
top 5 return results. The P@5 value is 0.5, which means that 50% provides more options of code snippets for developers to learn
of the top 5 results are deemed accurate. For the SuccessRate@k, from. Figure 11a shows the first result of the query read an object
the improvements to CodeHow are 21%, 31% and 30%, respectively. from an xml file. As discussed in Section 1, traditional IR-based
For the Precision@k, the improvements to CodeHow are 21%, 72% approaches may only match snippets that contain keywords such
and 75%, respectively. For the MRR, the improvement to CodeHow as xml, object and read. However, as shown in the figure, DeepCS
is 33%. Overall, our approach improves the accuracy of related successfully recognizes the query semantic and returns results of
techniques on all metrics. xml deserialize, even the keywords do not exist in the result. By
contrast, CodeHow only returns snippets containing read, object
5.3 Examples of Code Search Results and xml, narrowing down the search scope. The example indicates
We now provide concrete examples of code search results that that DeepCS searches code by understanding the semantics instead
demonstrate the advantages of DeepCS. of just matching keywords. Similarly, the query initialization of an
Figure 9a and 9b show the results for two queries: queue an event arraylist in one line in Table 1 returns snippets containing “new
to be run on the thread and run an event on a thread queue. The ArrayList⟨⟩” although the snippet does not include the keyword
two queries have the same set of keywords with different word initialization. Figure 11b shows another example of the associative
sequences. The keyword queue in the two queries have different search. When searching play a song. DeepCS not only returns snip-
meanings and it could be difficult for an IR-based approach to pets with matching keywords but also recommends results with
distinguish. Still, DeepCS can understand the meaning of the two semantically related words such as audio and voice.
Deep Code Search ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden

public static byte[] generateRandom256() {


public boolean enqueue(EventHandler handler, Event event) { byte[] randomSeed1 = ByteUtils.longToBytes(System.nanoTime());
synchronized(monitor) { byte[] randomSeed2 = (new SecureRandom()).generateSeed(KEY_SIZE_BYTES);
byte[] bh1 = ByteUtils.concatenate(randomSeed1, randomSeed2);
••••••
Thread.sleep(100L);
handlers[tail] = handler;
byte[] randomSeed3 = UUID.randomUUID().toString().getBytes();
events[tail] = event;
byte[] randomSeed4 = ByteUtils.longToBytes(System.nanoTime());
tail++; byte[] bh2 = ByteUtils.concatenate(randomSeed3, randomSeed4);
if (handlers.length <= tail) return simpleHash256(ByteUtils.concatenate(bh1, bh2));
tail = 0; }
monitor.notify();
} Figure 12: An example showing the inaccurate results – The first
return true; result of the query “generate md5”
}

(a) The third result of the query “queue an event to be run on the
thread” and natural language queries are heterogeneous. By jointly embed-
ding source code and natural language query into the same vector
public void run() {
while (!stop) { representation, their similarities can be measured more accurately.
DynamicModelEvent evt; Better query understanding through deep learning Unlike tra-
while ((evt = eventQueue.poll()) != null) {
for (DynamicModelListener l: listeners.toArray(
ditional techniques, DeepCS learns queries and source code repre-
new DynamicModelListener[0])) sentations with deep learning. Characteristics of queries, such as
l.dynamicModelChanged(evt);
}
semantically related words and word orders, are considered in these
•••••• models [27]. Therefore, it can recognize the semantics of query and
}
}
code better. For example, it can distinguish the query queue an event
to be run on the thread from the query run an event on the event
(b) The first result of the query “run an event on the thread queue”
queue.
Figure 9: Examples showing the query understanding
Clustering snippets by natural language semantics An advan-
public static String toStringWithEncoding( tage of our approach is that it embeds semantically similar code
InputStream inputStream, String encoding) {
if (inputStream == null)
snippets into vectors that are close to each other. Semantically
throw new IllegalArgumentException( similar code snippets are grouped according to their semantics.
"inputStream-should-not-be-null");
char[] buffer = new char[BUFFER_SIZE]; Therefore, in addition to the exact matching snippets, DeepCS also
StringBuffer stringBuffer = new StringBuffer(); recommends the semantically related ones.
BufferedReader bufferedReader = new BufferedReader(
new InputStreamReader(inputStream, encoding), BUFFER_SIZE);
int character = -1;
••••••
6.2 Limitation of DeepCS
}
return stringBuffer.toString();
Despite the advantages such as associative search, DeepCS could
Figure 10: An example showing the search robustness – The first still return inaccurate results. It sometimes ranks partially relevant
result of the query “get the content of an input stream as a string results higher than the exact matching ones. Figure 12 shows the
using a specified character encoding” result for the query generate md5. The exactly matching result is
ranked 7 in the result list, while partially related results such as
public static < S > S deserialize(Class c, File xml) { generate checksum are recommended before the exact results. This
try { is because DeepCS ranks results by just considering their semantic
JAXBContext context = JAXBContext.newInstance(c);
Unmarshaller unmarshaller = context.createUnmarshaller(); vectors. In future work, more code features (such as programming
S deserialized = (S) unmarshaller.unmarshal(xml); context) [58] could be considered in our model to further adjust the
return deserialized;
} catch (JAXBException ex) { results.
log.error("Error-deserializing-object-from-XML", ex);
return null;
} 6.3 Threats to Validity
}
Our goal is to improve the performance of code search over GitHub,
(a) The first result of the query “read an object from an xml file” thus both training and search are performed over GitHub corpus.
public void playVoice(int clearedLines) throws Exception { There is a threat of overlap between the training and search code-
int audiosAvailable = audioLibrary.get(clearedLines).size(); bases. To mitigate this threat, in our experiments, the training and
int audioIndex = rand.nextInt(audiosAvailable);
audioLibrary.get(clearedLines).get(audioIndex).play(); search codebases are constructed to be significantly different. The
} training codebase only contains code that has corresponding de-
(b) The second result of the query “play a song” scriptions, while the search codebase is considered in isolation and
Figure 11: Examples showing the associative search contains all code (including those do not have descriptions). We be-
lieve the threat of overfitting for this overlap is not significant as our
6 DISCUSSIONS training codebase considers a vast majority of code in Github. The
most important goal of our experiments is to evaluate DeepCS in a
6.1 Why does DeepCS Work? real-world code search scenario. For that, we used 50 real queries
We have identified three advantages of DeepCS that may explain collected from Stack Overflow to test the effectiveness of DeepCS.
its effectiveness in code search: These queries are not descriptions/comments of Java methods and
A unified representation of heterogeneous data Source code are not used for training.
ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden Xiaodong Gu, Hongyu Zhang, and Sunghun Kim

In our experiments, the relevancy of returned results were manu- searching results. However, Reiss’s approach differs significantly
ally graded and could suffer from subjectivity bias. To mitigate this from DeepCS. It does not consider the semantics of natural language
threat, (i) the manual analysis was performed independently by two queries. Furthermore, it requires users to provide not only natu-
developers and (ii) the developers performed an open discussion ral language queries but also other specifications such as method
to resolve conflict grades for the 50 questions. In the future, we declarations and test cases.
will further mitigate this threat by inviting more developers for the Besides code search, there have been many other information
grading. retrieval tasks in software engineering [8, 9, 16, 23, 24, 29, 51, 55, 63,
In the grading of relevancy, we consider only the top 10 results. 67] such as bug localization [66, 73, 80], feature localization [19],
Queries that fail are identically assigned with an FRank of 11 and traceability links recovery [20] and community Question Answer-
could be biased from the real relevancy of code snippets. However, ing [11]. Ye et al. [80] proposed to embed words into vector rep-
we believe that the setting is reasonable. In real-world code search, resentations to bridge the lexical gap between source code and
developers usually inspect the top K results and ignore the remain- natural language for SE-related text retrieval tasks. Different from
ing. That means it does not make much difference if a code snippet DeepCS, the vector representations learned by their method are at
appears at rank 11 or 20 if K is 10. the level of individual words and tokens instead of the whole query
Like related work (e.g., [14, 41]), we evaluate DeepCS with popu- sentences. Their method is based on a bag-of-words assumption,
lar Stack Overflow questions. SO questions may not be representa- and word sequences are not considered.
tive to all possible queries for code search engines. To mitigate this
threat, (i) DeepCS is not trained on SO questions but on large scale
Github corpus. (ii) We select the most frequently asked questions
7.2 Deep Learning for Source Code
which might be also commonly asked by developers in other search Recently, researchers have investigated possible applications of
engines. In the future, we will extend the scale and scope of test deep learning techniques to source code [7, 38, 53, 56, 60, 64, 75, 76].
queries. A typical use of deep learning is code generation [42, 54]. For exam-
ple, Mou et al. [54] proposed to generate code from natural language
7 RELATED WORK user intentions using an RNN Encoder-Decoder model. Their re-
sults show the feasibility of applying deep learning techniques to
7.1 Code Search code generation from a highly homogeneous dataset (simple pro-
In code search, a line of work has investigated marrying state- gramming assignments). Gu et al. [27] applies deep learning for
of-the-art information retrieval and natural language processing API learning, that is, generating API usage sequences for a given
techniques [13–15, 32, 35, 41, 45–47, 61, 81]. Much of the existing natural language query. They also apply deep learning to migrate
work focuses on query expansion and reformulation [29, 31, 44]. For APIs between different programming languages [28]. Deep learning
example, Hill et al. [30] reformulated queries with natural language is also applied to code completion [64, 77]. For example, White et
phrasal representations of method signatures. Haiduc et al. [29] al. [77] applied the RNN language model to source code files and
proposed to reformulate queries based on machine learning. Their showed its effectiveness in predicting software tokens. Recently,
method trains a machine learning model that automatically recom- White et al. [76] also applied deep learning to code clone detection.
mends a reformulation strategy based on the query properties. Lu et Their framework automatically links patterns mined at the lexical
al. [44] proposed to extend a query with synonyms generated from level with patterns mined at the syntactic level. In our work, we
WordNet. There is also much work that takes into account code explore the application of deep learning to code search.
characteristics. For example, McMillan et al. [47] proposed Portfolio,
a code search engine that combines keyword matching with PageR-
ank to return a chain of functions. Lv et al. [45] proposed CodeHow,
8 CONCLUSION
a code search tool that incorporates an extended Boolean model In this paper, we propose a novel deep neural network named
and API matching. Ponzanelli et al. [61] proposed an approach that CODEnn for code search. Instead of matching text similarity, CO-
automatically retrieves pertinent discussions from Stack Overflow DEnn learns a unified vector representation of both source code
given a context in the IDE. Recently Li et al. [41] proposed RACS, a and natural language queries so that code snippets semantically
code search framework for JavaScript that considers relationships related to a query can be retrieved according to their vectors. As
(e.g., sequencing, condition, and callback relationships) among the a proof-of-concept application, we implement a code search tool,
invoked API methods. DeepCS based on the proposed CODEnn model. Our experimental
As described in Section 6, DeepCS differs from existing code study has shown that the proposed approach is effective and outper-
search techniques in that it does not rely on information retrieval forms related approaches. Our source code and data are available
techniques. It measures the similarity between code snippets and at https://github.com/guxd/deep-code-search.
user queries through joint embedding and deep learning. Thus, it In the future, we will investigate more aspects of source code
can better understand code and query semantics. such as control structures to better represent high-level semantics of
As the keyword based approaches are inefficient on recognizing source code. The deep neural network we designed may also benefit
semantics, researchers have drawn increasing attention on seman- other software engineering problems such as bug localization.
tics based code search [34, 65, 69]. For example, Reiss [65] proposed Acknowledgment: The authors would like to thank Dongmei
the semantics-based code search, which uses user specifications to Zhang at Microsoft Research Asia for her support for this project
characterize the requirement and uses transformations to adapt the and insightful comments on the paper.
Deep Code Search ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden

REFERENCES [30] E. Hill, L. Pollock, and K. Vijay-Shanker. Improving source code search with nat-
[1] Camel case, https://en.wikipedia.org/wiki/camelcase. ural language phrasal representations of method signatures. In Proceedings of the
[2] Eclipse JDT. http://www.eclipse.org/jdt/. 2011 26th IEEE/ACM International Conference on Automated Software Engineering,
[3] Github. https://github.com. pages 524–527. IEEE Computer Society, 2011.
[4] Keras. https://keras.io/. [31] E. Hill, M. Roldan-Vega, J. A. Fails, and G. Mallet. NL-based query refinement
[5] Lucene. https://lucene.apache.org/. and contextualized code search results: A user study. In Software Maintenance,
[6] Theano, http://deeplearning.net/software/theano/. Reengineering and Reverse Engineering (CSMR-WCRE), 2014 Software Evolution
[7] M. Allamanis, H. Peng, and C. Sutton. A convolutional attention network for Week-IEEE Conference on, pages 34–43. IEEE, 2014.
extreme summarization of source code. In International Conference on Machine [32] R. Holmes, R. Cottrell, R. J. Walker, and J. Denzinger. The end-to-end use of
Learning (ICML), 2016. source code examples: An exploratory study. In Software Maintenance, 2009.
[8] J. Anvik and G. C. Murphy. Reducing the effort of bug report triage: Recom- ICSM 2009. IEEE International Conference on, pages 555–558. IEEE, 2009.
menders for development-oriented decisions. ACM Transactions on Software [33] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating
Engineering and Methodology (TOSEM), 20(3):10, 2011. image descriptions. In Proceedings of the IEEE Conference on Computer Vision and
[9] A. Bacchelli, M. Lanza, and R. Robbes. Linking e-mails and source code arti- Pattern Recognition, pages 3128–3137, 2015.
facts. In Proceedings of the 32nd ACM/IEEE International Conference on Software [34] Y. Ke, K. T. Stolee, C. Le Goues, and Y. Brun. Repairing programs with semantic
Engineering-Volume 1, pages 375–384. ACM, 2010. code search (T). In Automated Software Engineering (ASE), 2015 30th IEEE/ACM
[10] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly International Conference on, pages 295–306. IEEE, 2015.
learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. [35] I. Keivanloo, J. Rilling, and Y. Zou. Spotting working code examples. In Proceedings
[11] O. Barzilay, C. Treude, and A. Zagalsky. Facilitating crowd sourced software of the 36th International Conference on Software Engineering, pages 664–675. ACM,
engineering via stack overflow. In Finding Source Code on the Web for Remix and 2014.
Reuse, pages 289–308. Springer, 2013. [36] Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint
[12] T. J. Biggerstaff, B. G. Mitbander, and D. E. Webster. Program understanding and arXiv:1408.5882, 2014.
the concept assignment problem. Communications of the ACM, 37(5):72–82, 1994. [37] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint
[13] J. Brandt, M. Dontcheva, M. Weskamp, and S. R. Klemmer. Example-centric arXiv:1412.6980, 2014.
programming: integrating web search into the development environment. In [38] A. N. Lam, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. Combining deep
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, learning with information retrieval to localize buggy files for bug reports (n).
pages 513–522. ACM, 2010. In Automated Software Engineering (ASE), 2015 30th IEEE/ACM International
[14] B. A. Campbell and C. Treude. NLP2Code: Code snippet content assist via natural Conference on, pages 476–481. IEEE, 2015.
language tasks. arXiv preprint arXiv:1701.05648, 2017. [39] Q. Le and T. Mikolov. Distributed representations of sentences and documents.
[15] W.-K. Chan, H. Cheng, and D. Lo. Searching connected API subgraph via text In Proceedings of the 31st International Conference on Machine Learning (ICML-14),
phrases. In Proceedings of the ACM SIGSOFT 20th International Symposium on the pages 1188–1196, 2014.
Foundations of Software Engineering, page 10. ACM, 2012. [40] M. Li, T. Zhang, Y. Chen, and A. J. Smola. Efficient mini-batch training for
[16] O. Chaparro and A. Marcus. On the reduction of verbose queries in text retrieval stochastic optimization. In Proceedings of the 20th ACM SIGKDD international
based software maintenance. In Proceedings of the 38th International Conference conference on Knowledge discovery and data mining, pages 661–670. ACM, 2014.
on Software Engineering Companion, pages 716–718. ACM, 2016. [41] X. Li, Z. Wang, Q. Wang, S. Yan, T. Xie, and H. Mei. Relationship-aware code
[17] K. Cho, B. Van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, search for JavaScript frameworks. In Proceedings of the ACM SIGSOFT 24th
and Y. Bengio. Learning phrase representations using RNN encoder–decoder for International Symposium on the Foundations of Software Engineering. ACM, 2016.
statistical machine translation. In Proceedings of the 2014 Conference on Empirical [42] W. Ling, E. Grefenstette, K. M. Hermann, T. Kocisky, A. Senior, F. Wang, and
Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar, P. Blunsom. Latent predictor networks for code generation. arXiv preprint
Oct. 2014. Association for Computational Linguistics. arXiv:1603.06744, 2016.
[18] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. [43] E. Linstead, S. Bajracharya, T. Ngo, P. Rigor, C. Lopes, and P. Baldi. Sourcerer:
Natural language processing (almost) from scratch. Journal of Machine Learning mining and searching internet-scale software repositories. Data Mining and
Research, 12(Aug):2493–2537, 2011. Knowledge Discovery, 18:300–336, 2009.
[19] C. S. Corley, K. Damevski, and N. A. Kraft. Exploring the use of deep learning [44] M. Lu, X. Sun, S. Wang, D. Lo, and Y. Duan. Query expansion via wordnet for
for feature location. In Software Maintenance and Evolution (ICSME), 2015 IEEE effective code search. In 2015 IEEE 22nd International Conference on Software
International Conference on, pages 556–560. IEEE, 2015. Analysis, Evolution, and Reengineering (SANER), pages 545–549. IEEE, 2015.
[20] B. Dagenais and M. P. Robillard. Recovering traceability links between an api [45] F. Lv, H. Zhang, J. Lou, S. Wang, D. Zhang, and J. Zhao. CodeHow: Effective code
and its learning resources. In 2012 34th International Conference on Software search based on API understanding and extended boolean model. In Proceedings
Engineering (ICSE), pages 47–57. IEEE, 2012. of the 30th IEEE/ACM International Conference on Automated Software Engineering
[21] M. Feng, B. Xiang, M. R. Glass, L. Wang, and B. Zhou. Applying deep learning to (ASE 2015). IEEE, 2015.
answer selection: A study and an open task. In 2015 IEEE Workshop on Automatic [46] C. McMillan, M. Grechanik, D. Poshyvanyk, C. Fu, and Q. Xie. Exemplar: A source
Speech Recognition and Understanding (ASRU), pages 813–820. IEEE, 2015. code search engine for finding highly relevant applications. IEEE Transactions on
[22] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. DeViSE: Software Engineering, 38(5):1069–1087, 2012.
A deep visual-semantic embedding model. In Advances in neural information [47] C. McMillan, M. Grechanik, D. Poshyvanyk, Q. Xie, and C. Fu. Portfolio: find-
processing systems, pages 2121–2129, 2013. ing relevant functions and their usage. In Proceedings of the 33rd International
[23] X. Ge, D. C. Shepherd, K. Damevski, and E. Murphy-Hill. Design and evaluation Conference on Software Engineering (ICSE’11), pages 111–120. IEEE, 2011.
of a multi-recommendation system for local code search. Journal of Visual [48] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word
Languages & Computing, 2016. representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[24] G. Gousios, M. Pinzger, and A. v. Deursen. An exploratory study of the pull-based [49] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur. Recurrent
software development model. In Proceedings of the 36th International Conference neural network based language model. In INTERSPEECH 2010, 11th Annual
on Software Engineering, pages 345–355. ACM, 2014. Conference of the International Speech Communication Association, Makuhari,
[25] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhuber. Chiba, Japan, September 26-30, 2010, pages 1045–1048, 2010.
A novel connectionist system for unconstrained handwriting recognition. IEEE [50] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed rep-
transactions on pattern analysis and machine intelligence, 31(5):855–868, 2009. resentations of words and phrases and their compositionality. In Advances in
[26] M. Grechanik, C. Fu, Q. Xie, C. McMillan, D. Poshyvanyk, and C. Cumby. A neural information processing systems, pages 3111–3119, 2013.
search engine for finding highly relevant applications. In 2010 ACM/IEEE 32nd [51] I. J. Mojica, B. Adams, M. Nagappan, S. Dienst, T. Berger, and A. E. Hassan. A
International Conference on Software Engineering, volume 1, pages 475–484. IEEE, large scale empirical study on software reuse in mobile apps. IEEE Software,
2010. 31(2):78–86, 2014.
[27] X. Gu, H. Zhang, D. Zhang, and S. Kim. Deep API learning. In Proceedings of [52] D. J. Montana and L. Davis. Training feedforward neural networks using genetic
the ACM SIGSOFT 20th International Symposium on the Foundations of Software algorithms. In IJCAI, volume 89, pages 762–767, 1989.
Engineering (FSE’16), 2016. [53] L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin. Convolutional neural networks
[28] X. Gu, H. Zhang, D. Zhang, and S. Kim. DeepAM: Migrate APIs with multi-modal over tree structures for programming language processing. In Proceedings of the
sequence to sequence learning. In Proceedings of the Twenty-Sixth International Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 1287–1293.
Joint Conferences on Artifical Intelligence (IJCAI’17), 2017. AAAI Press, 2016.
[29] S. Haiduc, G. Bavota, A. Marcus, R. Oliveto, A. De Lucia, and T. Menzies. Au- [54] L. Mou, R. Men, G. Li, L. Zhang, and Z. Jin. On end-to-end program generation
tomatic query reformulations for text retrieval in software engineering. In from user intention by deep neural networks. arXiv, 2015.
Proceedings of the 2013 International Conference on Software Engineering, pages [55] A. Nederlof, A. Mesbah, and A. v. Deursen. Software engineering for the web:
842–851. IEEE Press, 2013. the state of the practice. In Companion Proceedings of the 36th International
ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden Xiaodong Gu, Hongyu Zhang, and Sunghun Kim

Conference on Software Engineering, pages 4–13. ACM, 2014. [69] K. T. Stolee, S. Elbaum, and D. Dobos. Solving the search for source code. ACM
[56] T. D. Nguyen, A. T. Nguyen, H. D. Phan, and T. N. Nguyen. Exploring api Transactions on Software Engineering and Methodology (TOSEM), 23(3):26, 2014.
embedding for api usages and applications. In Proceedings of the 39th International [70] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural
Conference on Software Engineering, pages 438–449. IEEE Press, 2017. networks. In Advances in neural information processing systems, pages 3104–3112,
[57] L. Nie, H. Jiang, Z. Ren, Z. Sun, and X. Li. Query expansion based on crowd 2014.
knowledge for code search. IEEE Transactions on Services Computing, PP(99):1–1, [71] M. Tan, B. Xiang, and B. Zhou. Lstm-based deep learning models for non-factoid
2016. answer selection. arXiv preprint arXiv:1511.04108, 2015.
[58] H. Niu, I. Keivanloo, and Y. Zou. Learning to rank code examples for code search [72] J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general
engines. Empirical Software Engineering, pages 1–33, 2016. method for semi-supervised learning. In Proceedings of the 48th annual meeting
[59] H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, and R. K. Ward. Deep of the association for computational linguistics, pages 384–394. Association for
sentence embedding using the long short term memory network: Analysis and Computational Linguistics, 2010.
application to information retrieval. CoRR, abs/1502.06922, 2015. [73] Y. Uneno, O. Mizuno, and E.-H. Choi. Using a distributed representation of words
[60] H. Peng, L. Mou, G. Li, Y. Liu, L. Zhang, and Z. Jin. Building program vector rep- in localizing relevant files for bug reports. In Software Quality, Reliability and
resentations for deep learning. In Proceedings of the 8th International Conference Security (QRS), 2016 IEEE International Conference on, pages 183–190. IEEE, 2016.
on Knowledge Science, Engineering and Management - Volume 9403, KSEM 2015, [74] J. Weston, S. Bengio, and N. Usunier. Wsabie: scaling up to large vocabulary image
pages 547–553, New York, NY, USA, 2015. Springer-Verlag New York, Inc. annotation. In Proceedings of the Twenty-Second international joint conference on
[61] L. Ponzanelli, G. Bavota, M. Di Penta, R. Oliveto, and M. Lanza. Mining stackover- Artificial Intelligence-Volume Volume Three, pages 2764–2770. AAAI Press, 2011.
flow to turn the ide into a self-confident programming prompter. In Proceedings [75] M. White, M. Tufano, M. Martinez, M. Monperrus, and D. Poshyvanyk. Sorting
of the 11th Working Conference on Mining Software Repositories, pages 102–111. and transforming program repair ingredients via deep learning code similarities.
ACM, 2014. arXiv preprint arXiv:1707.04742, 2017.
[62] M. Raghothaman, Y. Wei, and Y. Hamadi. SWIM: synthesizing what I mean: code [76] M. White, M. Tufano, C. Vendome, and D. Poshyvanyk. Deep learning code frag-
search and idiomatic snippet synthesis. In Proceedings of the 38th International ments for code clone detection. In Proceedings of the 31th IEEE/ACM International
Conference on Software Engineering, pages 357–367. ACM, 2016. Conference on Automated Software Engineering (ASE 2016), 2016.
[63] M. Rahimi and J. Cleland-Huang. Patterns of co-evolution between requirements [77] M. White, C. Vendome, M. Linares-Vásquez, and D. Poshyvanyk. Toward deep
and source code. In 2015 IEEE Fifth International Workshop on Requirements learning software repositories. In Mining Software Repositories (MSR), 2015
Patterns (RePa), pages 25–31. IEEE, 2015. IEEE/ACM 12th Working Conference on, pages 334–345. IEEE, 2015.
[64] V. Raychev, M. Vechev, and E. Yahav. Code completion with statistical language [78] R. Xu, C. Xiong, W. Chen, and J. J. Corso. Jointly modeling deep video and
models. In In Proceedings of the 35th ACM SIGPLAN Conference on Programming compositional text to bridge vision and language in a unified framework. In
Language Design and Implementation. ACM, 2014. AAAI, pages 2346–2352. Citeseer, 2015.
[65] S. P. Reiss. Semantics-based code search. In Proceedings of the 31st International [79] X. Ye, R. Bunescu, and C. Liu. Learning to rank relevant files for bug reports
Conference on Software Engineering, pages 243–253. IEEE Computer Society, 2009. using domain knowledge. In Proceedings of the 22nd ACM SIGSOFT International
[66] M. Renieres and S. P. Reiss. Fault localization with nearest neighbor queries. Symposium on Foundations of Software Engineering, pages 689–699. ACM, 2014.
In Automated Software Engineering, 2003. Proceedings. 18th IEEE International [80] X. Ye, H. Shen, X. Ma, R. Bunescu, and C. Liu. From word embeddings to docu-
Conference on, pages 30–39, Oct 2003. ment similarities for improved information retrieval in software engineering. In
[67] P. C. Rigby and M. P. Robillard. Discovering essential code elements in informal Proceedings of the 38th International Conference on Software Engineering, pages
documentation. In Proceedings of the 2013 International Conference on Software 404–415. ACM, 2016.
Engineering, pages 832–841. IEEE Press, 2013. [81] J. Zhou and R. J. Walker. API Deprecation: A retrospective analysis and detection
[68] J. Singer, T. Lethbridge, N. Vinson, and N. Anquetil. An examination of software method for code examples on the web. In Proceedings of the ACM SIGSOFT 20th
engineering work practices. In CASCON First Decade High Impact Papers, pages International Symposium on the Foundations of Software Engineering (FSE’16).
174–188. IBM Corp., 2010. ACM, 2016.

You might also like