[ICSE18] Deep Code Search
[ICSE18] Deep Code Search
[email protected]
3 Clova AI Research, NAVER
ABSTRACT 1 INTRODUCTION
To implement a program functionality, developers can reuse pre- Code search is a very common activity in software development
viously written code snippets by searching through a large-scale practices [57, 68]. To implement a certain functionality, for example,
codebase. Over the years, many code search tools have been pro- to parse XML files, developers usually search and reuse previously
posed to help developers. The existing approaches often treat source written code by performing free-text queries over a large-scale
code as textual documents and utilize information retrieval models codebase.
to retrieve relevant code snippets that match a given query. These Many code search approaches have been proposed [13, 15, 29,
approaches mainly rely on the textual similarity between source 31, 32, 35, 44, 45, 47, 62], most of them being based on information
code and natural language query. They lack a deep understanding retrieval (IR) techniques. For example, Linstead et al. [43] proposed
of the semantics of queries and source code. Sourcerer, an information retrieval based code search tool that com-
In this paper, we propose a novel deep neural network named bines the textual content of a program with structural information.
CODEnn (Code-Description Embedding Neural Network). Instead McMillan et al. [47] proposed Portfolio, which returns a chain of
of matching text similarity, CODEnn jointly embeds code snippets functions through keyword matching and PageRank. Lu et al. [44]
and natural language descriptions into a high-dimensional vec- expanded a query with synonyms obtained from WordNet and
tor space, in such a way that code snippet and its corresponding then performed keyword matching of method signatures. Lv et
description have similar vectors. Using the unified vector repre- al. [45] proposed CodeHow, which combines text similarity and
sentation, code snippets related to a natural language query can API matching through an extended Boolean model.
be retrieved according to their vectors. Semantically related words A fundamental problem of the IR-based code search is the mis-
can also be recognized and irrelevant/noisy keywords in queries match between the high-level intent reflected in the natural lan-
can be handled. guage queries and low-level implementation details in the source
As a proof-of-concept application, we implement a code search code [12, 46]. Source code and natural language queries are hetero-
tool named DeepCS using the proposed CODEnn model. We em- geneous. They may not share common lexical tokens, synonyms, or
pirically evaluate DeepCS on a large scale codebase collected from language structures. Instead, they may only be semantically related.
GitHub. The experimental results show that our approach can ef- For example, a relevant snippet for the query “read an object from
fectively retrieve relevant code snippets and outperforms previous an xml” could be as follows:
techniques. public static < S > S deserialize(Class c, File xml) {
try {
JAXBContext context = JAXBContext.newInstance(c);
CCS CONCEPTS Unmarshaller unmarshaller = context.createUnmarshaller();
S deserialized = (S) unmarshaller.unmarshal(xml);
• Software and its engineering → Reusability; return deserialized;
} catch (JAXBException ex) {
log.error("Error-deserializing-object-from-XML", ex);
KEYWORDS return null;
}
code search, deep learning, joint embedding }
ACM Reference Format: Existing approaches may not be able to return this code snippet
Xiaodong Gu1 , Hongyu Zhang2 , and Sunghun Kim1, 3 . 2018. Deep Code as it does not contain keywords such as read and object or their
Search. In ICSE ’18: ICSE ’18: 40th International Conference on Software
synonyms such as load and instance. Therefore, an effective code
Engineering , May 27-June 3, 2018, Gothenburg, Sweden. ACM, New York,
NY, USA, 12 pages. https://doi.org/10.1145/3180155.3180167
search engine requires a higher-level semantic mapping between
code and natural language queries. Furthermore, the existing ap-
Permission to make digital or hard copies of all or part of this work for personal or proaches have difficulties in query understanding [27, 29, 45]. They
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation cannot effectively handle irrelevant/noisy keywords in queries [27].
on the first page. Copyrights for components of this work owned by others than the Therefore, an effective code search engine should also be able to
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or understand the semantic meanings of natural language queries and
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected]. source code in order to improve the accuracy of code search.
ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden In our previous work, we introduced the DeepAPI framework [27],
© 2018 Copyright held by the owner/author(s). Publication rights licensed to the which is a deep learning based method that learns the semantics of
Association for Computing Machinery.
ACM ISBN 978-1-4503-5638-1/18/05. . . $15.00 queries and the corresponding API sequences. However, searching
https://doi.org/10.1145/3180155.3180167 source code is much more difficult than generating APIs, because
ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden Xiaodong Gu, Hongyu Zhang, and Sunghun Kim
the semantics of code snippets are related not only to the API se-
quences but also to other source code aspects such as tokens and
method names. For example, DeepAPI could return the same API Output Layer
ImageIO.write for the query save image as png and save image as Embedding
jpg. Nevertheless, the actual code snippets for answering the two Hidden Layer ht h1 h2 h3
ht-1
queries are different in terms of source code tokens. Therefore, the wt w1 w2 w3
code search problem requires models that can exploit more aspects Input Layer
of the source code. parse xml file
In this paper, we propose a novel deep neural network named CO-
DEnn (Code-Description Embedding Neural Network). To bridge (a) RNN Structure (b) RNN for sentence embedding
Figure 1: Illustration of the RNN Sentence Embedding
the lexical gap between queries and source code, CODEnn jointly
embeds code snippets and natural language descriptions into a
high-dimensional vector space, in such a way that code snippet 2.1 Embedding Techniques
and its corresponding description have similar vectors. With the Embedding (also known as distributed representation [50, 72]) is
unified vector representation, code snippets semantically related a technique for learning vector representations of entities such as
to a natural language query can be retrieved according to their words, sentences and images in such a way that similar entities
vectors. Semantically related words can also be recognized and have vectors close to each other [48, 50].
irrelevant/noisy keywords in queries can be handled. A typical embedding technique is word embedding, which repre-
Using CODEnn, we implement a code search tool, DeepCS as a sents words as fixed-length vectors so that similar words are close
proof of concept. DeepCS trains the CODEnn model on a corpus of to each other in the vector space [48, 50]. For example, suppose the
18.2 million Java code snippets (in the form of commented methods) word execute is represented as [0.12, -0.32, 0.01] and the word run
from GitHub. Then, it reads code snippets from a codebase and is represented as [0.12, -0.31, 0.02]. From their vectors, we can es-
embeds them into vectors using the trained CODEnn model. Finally, timate their distance and identify their semantic relation. Word
when a user query arrives, DeepCS finds code snippets that have embedding is usually realized using a machine learning model such
the nearest vectors to the query vector and return them. as CBOW and Skip-Gram [48]. These models build a neural net-
To evaluate the effectiveness of DeepCS, we perform code search work that captures the relations between a word and its contextual
on a search codebase using 50 real-world queries obtained from words. The vector representations of words, as parameters of the
Stack Overflow. Our results show that DeepCS returns more rele- network, are trained with a text corpus [50].
vant code snippets than the two related approaches, that is, Code- Likewise, a sentence (i.e., a sequence of words) can also be em-
How [45] and a conventional Lucene-based code search tool [5]. bedded as a vector [59]. A simple way of sentence embedding is,
On average, the first relevant code snippet returned by DeepCS for example, to view it as a bag of words and add up all its word
is ranked 3.5, while the first relevant results returned by Code- vectors [39].
How [45] and Lucene [43] are ranked 5.5 and 6.0, respectively.
For 76% of the queries, the relevant code snippets can be found 2.2 RNN for Sequence Embedding
within the top 5 returned results. The evaluation results confirm
We now introduce a widely-used deep neural network, the Recur-
the effectiveness of DeepCS.
rent Neural Networks (RNN) [49, 59] for the embedding of sequen-
To our knowledge, we are the first to propose deep learning based
tial data such as natural language sentences. The Recurrent Neural
code search. The main contributions of our work are as follows:
Network is a class of neural networks where hidden layers are
• We propose a novel deep neural network, CODEnn, to learn a recurrently used for computation. This creates an internal state
unified vector representation of both source code and natural of the network to record dynamic temporal behavior. Figure 1a
language queries. shows the basic structure of an RNN. The neural network includes
• We develop DeepCS, a tool that utilizes CODEnn to retrieve three layers, an input layer which maps each input to a vector, a
relevant code snippets for given natural language queries. recurrent hidden layer which recurrently computes and updates a
• We empirically evaluate DeepCS using a large scale codebase. hidden state after reading each input, and an output layer which
The rest of this paper is organized as follows. Section 2 describes utilizes the hidden state for specific tasks. Unlike traditional feed-
the background of the deep learning based embedding models. forward neural networks, RNNs can embed sequential inputs such
Section 3 describes the proposed deep neural network for code as sentences using their internal memory [25].
search. Section 4 describes the detailed design of our approach. Consider a natural language sentence with a sequence of T
Section 5 presents the evaluation results. Section 6 discusses our words s=w 1 , ..., wT , RNN embeds it through the following com-
work, followed by Section 7 that presents the related work. We putations: it reads words in the sentence one by one, and updates a
conclude the paper in Section 8. hidden state at each time step. Each word w t is first mapped to a
d-dimensional vector wt ∈Rd by a one-hot representation [72] or
2 BACKGROUND word embedding [50]. Then, the hidden state (values in the hidden
Our work adopts recent advanced techniques from deep learning layer) ht is updated at time t by considering the input word wt and
and natural language processing [10, 17, 70]. In this section, we the preceding hidden state ht −1 :
discuss the background of these techniques. h t = tanh(W [h t −1 ; w t ]), ∀t = 1, 2, ..., T (1)
Deep Code Search ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden
Description(Query)
while ((line = br.readLine())!= null) {
1 5 2 0 5 line by line” System.out.println(line);
}
Embedding
Embedding
br.close();
8 3 2 4 8
Code
}
R2d ×d is the matrix of trainable parameters in the RNN, while Figure 3: An example showing the idea of joint embedding for code
tanh is a non-linearity activation function of the RNN. Finally, and queries. The yellow points represent query vectors while the
the embedding vector of the sentence is summarized from the blue points represent code vectors.
hidden states h 1 , ..., hT . A typical way is to select the last hidden
where ϕ:X→Rd is an embedding function to map X into a d-
state hT as the embedding vector. The embedding vector can also be
dimensional vector space V ; ψ :Y→Rd is an embedding function
summarized using other computations such as the maxpooling [36]:
to map Y into the same vector space V ; J (·, ·) is a similarity mea-
s = maxpooling([h 1, ..., hT ]) (2) sure (e.g., cosine) to score the matching degrees of VX and VY in
Maxpooling is an operation that selects the maximum value in order to learn the mapping functions. Through joint embedding,
each fixed-size region over a matrix. Figure 2 shows an example heterogeneous data can be easily correlated through their vectors.
of maxpooling over a sequence of hidden vectors h 1 , ..., hT . Each Joint embedding has been widely used in many tasks [22, 74,
column represents a hidden vector. The window size of each region 78]. For example, in computer vision, Karpathy and Li [33] use a
is set to 1×T in this example. The result is a fixed-length vector Convolutional Neural Network (CNN) [22], a deep neural network
whose elements are the maximum values of each row. Maxpooling as the ϕ and an RNN as the ψ , to jointly embed both image and text
can capture the most important feature (one with the highest value) into the same vector space for labeling images [33].
for each region and can transform sentences of variable lengths
into a fixed-length vector. 3 A DEEP NEURAL NETWORK FOR CODE
Figure 1b shows an example of how RNN embeds a sentence SEARCH
(e.g., parse an xml file) into a vector. To facilitate understanding, Inspired by existing joint embedding techniques [21, 22, 33, 78],
we expand the recurrent hidden layer for each time step. The RNN we propose a novel deep neural network named CODEnn (Code-
reads words in the sentence one by one, and updates a hidden state Description Embedding Neural Network) for the code search prob-
at each time step. When it reads the first word parse, it maps the lem. Figure 3 illustrates the key idea. Natural language queries and
word into a vector w1 and computes the current hidden state h 1 code snippets are heterogeneous and cannot be easily matched ac-
using w1 . Then, it reads the second word xml, embeds it into w2 , cording to their lexical tokens. To bridge the gap, CODEnn jointly
and updates the hidden state h 1 to h 2 using w2 . The procedure embeds code snippets and natural language descriptions into a
continues until the RNN receives the last word file and outputs the unified vector space so that a query and the corresponding code
final state h 3 . The final state h 3 can be used as the embedding c of snippets are embedded into nearby vectors and can be matched by
the whole sentence. vector similarities.
The embedding of the sentence, i.e., the sentence vector, can
be used for specific applications. For example, one can build a 3.1 Architecture
language model conditioning on the sentence vector for machine
As introduced in Section 2.3, a joint embedding model requires three
translation [17]. We can also embed two sentences (a question
sentence and an answer sentence) and compare their vectors for components: the embedding functions ϕ:X→Rd and ψ :Y→Rd ,
answer selection [21, 71]. as well as the similarity measure J (·, ·). CODEnn realizes these
components with deep neural networks.
2.3 Joint Embedding of Heterogeneous Data Figure 4 shows the overall architecture of CODEnn. The neu-
ral network consists of three modules, each corresponding to a
Suppose there are two heterogeneous data sets X and Y. We want component of joint embedding:
to learn a correlation between them, namely,
f :X→Y (3) • a code embedding network (CoNN) to embed source code
For example, suppose X is a set of images and Y is a set of natural into vectors.
language sentences, f can be the correlation between the images • a description embedding network (DeNN) to embed natural
and the sentences (i.e., image captioning). Since the two data sources language descriptions into vectors.
are heterogeneous, it is difficult to discover the correlation f directly. • a similarity module that measures the degree of similarity
Thus, we need a bridge to connect these two levels of information. between code and descriptions.
Joint Embedding, also known as multi-modal embedding [78], is The following subsections describe the detailed design of these
a technique to jointly embed/correlate heterogeneous data into a modules.
unified vector space so that semantically similar concepts across 3.1.1 Code Embedding Network. The code embedding network
the two modalities occupy nearby regions of the space [33]. The embeds source code into vectors. Source code is not simply plain
joint embedding of X and Y can be formulated as: text. It contains multiple aspects of information such as tokens,
ϕ ψ
X −→ VX → J (VX , VY ) ← VY ←− Y (4) control flows and APIs [46]. In our model, we consider three aspects
ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden Xiaodong Gu, Hongyu Zhang, and Sunghun Kim
Cosine Cosine
Similarity Similarity
Fusion
𝒎 𝒂 𝒕Ԧ
max pooling max pooling max pooling max pooling
RNN
RNN
RNN
RNN
RNN
RNN
RNN
MLP
RNN
RNN
MLP
MLP
Network (CoNN) Network(DeNN)
text reader Scanner.new Scanner.next Scanner.close str buff close read a text file
they discuss the inconsistent labels and relabel them. The procedure Table 1: Benchmark Queries and Evaluation Results (NF: Not
repeats until a consensus is reached. Found within the top 10 returned results LC:Lucene CH:CodeHow
DCS:DeepCS)
5.1.3 Performance Measure. We use four common metrics to Question FRank
No. Query
measure the effectiveness of code search, namely, FRank, Success- ID LC CH DCS
1 309424 convert an inputstream to a string 2 1 1
Rate@k, Precision@k, and Mean Reciprocal Rank (MRR). They are 2 157944 create arraylist from array NF NF 2
widely used metrics in information retrieval and the code search 3 1066589 iterate through a hashmap NF 4 1
4 363681 generating random integers in a specific range NF 6 2
literature [41, 45, 62, 79]. 5 5585779 converting string to int in java NF 10 1
6 1005073 initialization of an array in one line NF 4 1
The FRank (also known as best hit rank [41]) is the rank of the 7 1128723 how can I test if an array contains a certain value 6 6 1
first hit result in the result list [62]. It is important as users scan 8 604424 lookup enum by string value 1 NF 10
9 886955 breaking out of nested loops in java NF NF NF
the results from top to bottom. A smaller FRank implies lower 10 1200621 how to declare an array NF NF 4
inspection effort for finding the desired result. We use FRank to 11 41107 how to generate a random alpha-numeric string NF 1 1
12 409784 what is the simplest way to print a java array 6 NF 1
assess the effectiveness of a single code search query. 13 109383 sort a map by values NF 1 3
The SuccessRate@k (also known as success percentage at k [41]) 14 295579 fastest way to determine if an integer’s square root is an integer NF NF NF
15 80476 how can I concatenate two arrays in java NF 1 1
measures the percentage of queries for which more than one correct 16 326369 how do I create a java string from the contents of a file 8 NF 5
17 1149703 how can I convert a stack trace to a string 3 1 2
result could exist in the top k ranked results [35, 41, 79]. In our 18 513832 how do I compare strings in java 1 3 1
evaluations it is calculated as follows: 19 3481828 how to split a string in java 1 1 1
Q 20 2885173 how to create a file and write to a file in java 2 1 NF
1 Õ 21 507602 how can I initialise a static map 7 1 2
SuccessRate@k = δ (FRankq ≤ k) (13) 22 223918 iterating through a collection, avoiding concurrentmodifica- 3 3 2
|Q | q=1 tionexception when removing in loop
23 415953 how can I generate an md5 hash 1 3 6
where Q is a set of queries, δ (·) is a function which returns 1 if the 24 1069066 get current stack trace in java 3 1 1
input is true and 0 otherwise. SuccessRate@k is important because 25
26
2784514
153724
sort arraylist of custom objects by property
how to round a number to n decimal places in java
1
1
1
1
1
4
a better code search engine should allow developers to discover the 27 473282 how can I pad an integers with zeros on the left NF 3 1
28 529085 how to create a generic array in java NF NF 3
needed code by inspecting fewer returned results. The higher the 29 4716503 reading a plain text file in java 4 NF 7
metric value, the better the code search performance. 30 1104975 a for loop to iterate over enum in java NF NF NF
31 3076078 check if at least two out of three booleans are true NF NF NF
The Precision@k [45, 57] measures the percentage of relevant re- 32 4105331 how do I convert from int to string 2 1 NF
sults in the top k returned results for each query. In our evaluations 33 8172420 how to convert a char to a string in java 5 10 3
34 1816673 how do I check if a file exists in java 1 2 1
it is calculated as follows: 35 4216745 java string to date conversion 6 NF 1
36 1264709 convert inputstream to byte array in java 7 5 1
#relevant results in the top k results
Precision@k = (14) 37 1102891 how to check if a string is numeric in java 1 NF 2
k 38 869033 how do I copy an object in java 2 1 1
Precision@k is important because developers often inspect multiple 39
40
180158
5868369
how do I time a method’s execution in java
how to read a large text file line by line using java
NF
1
NF
1
2
1
results of different usages to learn from [62]. A better code search 41 858572 how to make a new list in java 2 1 1
42 1625234 how to append text to an existing file in java 3 1 1
engine should allow developers to inspect less noisy results. The 43 2201925 converting iso 8601-compliant string to date 3 1 1
higher the metric values, the better the code search performance. 44 122105 what is the best way to filter a java collection NF 9 2
45 5455794 removing whitespace from strings in java NF 3 1
We evaluate SuccessRate@k and Precision@k when k’s value is 1, 5, 46 225337 how do I split a string with any whitespace chars as delimiters 1 1 2
and 10. These values reflect the typical sizes of results that users 47 52353 in java, what is the best way to determine the size of an object NF NF NF
48 160970 how do I invoke a java method when given the method name 3 1 2
would inspect [41]. as a string
The MRR [45, 79] is the average of the reciprocal ranks of results 49 207947
50 1026723
how do I get a platform dependent new line character
how to convert a map to list in java
1
6
NF 10
NF 1
of a set of queries Q. The reciprocal rank of a query is the inverse
of the rank of the first hit result [26]. MRR is calculated as follows:
|Q |
1 Õ 1 Lucene is a popular, conventional text search engine behind
M RR = (15)
|Q | q=1 F Rankq many existing code search tools such as Sourcerer [43]. Sourcerer
The higher the MRR value, the better the code search performance. combines Lucene with code properties such as FQN (full qualified
name) of entities and code popularity to retrieve the code snippets.
5.1.4 Comparison Methods. We compare the effectiveness of In our implementation of the Lucene-based code search tool, we
our approach with CodeHow [45] and a conventional Lucene-based consider the heuristic of FQN. We did not include the code popular-
code search tool [5]. ity heuristic (computed using PageRank) as it does not significantly
CodeHow is a state-of-the-art code search engine proposed re- improve the code search performance [43].
cently. It is an information retrieval based code search tool that We use the same experimental setting for CodeHow and the
incorporates an extended Boolean model and API matching. It first Lucene-based tool as used for evaluating DeepCS.
retrieves relevant APIs to a query by matching the query with
the API documentation. Then, it searches code by considering both
plain code and the related APIs. Like DeepCS, CodeHow also consid- 5.2 Results
ers multiple aspects of source code such as method name and APIs. Table 1 shows the evaluation results of DeepCS and related ap-
It combines multiple aspects using an Extended Boolean Model [45]. proaches for each query in the benchmark. The column Question ID
The facts that CodeHow also considers APIs and is also built for shows the original ID of the question in Stack Overflow where the
large-scale code search make it an ideal baseline for our experi- query comes from. The column FRank shows the FRank result of
ments. each approach. The symbol ‘NF’ stands for Not Found which means
ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden Xiaodong Gu, Hongyu Zhang, and Sunghun Kim
that no relevant result has been returned within the top K results
(K=10). 0 2 4 6 8 10 12 0 20 40 60 80 100
The results show that DeepCS produces generally more relevant FRank
(a) FRank Precision@1
(b) Precision@1
results than Lucene and CodeHow. Figure 8a shows the statisti-
cal summary of FRank for the three approaches. The symbol ‘+’
+ DeepCS +
indicates the average FRank achieved by each approach. We conser- DeepCS
vatively treat the FRank as 11 for queries that fail to obtain relevant
+ +
results within the top 10 returned results. We observe that DeepCS CodeHow CodeHow
(a) The third result of the query “queue an event to be run on the
thread” and natural language queries are heterogeneous. By jointly embed-
ding source code and natural language query into the same vector
public void run() {
while (!stop) { representation, their similarities can be measured more accurately.
DynamicModelEvent evt; Better query understanding through deep learning Unlike tra-
while ((evt = eventQueue.poll()) != null) {
for (DynamicModelListener l: listeners.toArray(
ditional techniques, DeepCS learns queries and source code repre-
new DynamicModelListener[0])) sentations with deep learning. Characteristics of queries, such as
l.dynamicModelChanged(evt);
}
semantically related words and word orders, are considered in these
•••••• models [27]. Therefore, it can recognize the semantics of query and
}
}
code better. For example, it can distinguish the query queue an event
to be run on the thread from the query run an event on the event
(b) The first result of the query “run an event on the thread queue”
queue.
Figure 9: Examples showing the query understanding
Clustering snippets by natural language semantics An advan-
public static String toStringWithEncoding( tage of our approach is that it embeds semantically similar code
InputStream inputStream, String encoding) {
if (inputStream == null)
snippets into vectors that are close to each other. Semantically
throw new IllegalArgumentException( similar code snippets are grouped according to their semantics.
"inputStream-should-not-be-null");
char[] buffer = new char[BUFFER_SIZE]; Therefore, in addition to the exact matching snippets, DeepCS also
StringBuffer stringBuffer = new StringBuffer(); recommends the semantically related ones.
BufferedReader bufferedReader = new BufferedReader(
new InputStreamReader(inputStream, encoding), BUFFER_SIZE);
int character = -1;
••••••
6.2 Limitation of DeepCS
}
return stringBuffer.toString();
Despite the advantages such as associative search, DeepCS could
Figure 10: An example showing the search robustness – The first still return inaccurate results. It sometimes ranks partially relevant
result of the query “get the content of an input stream as a string results higher than the exact matching ones. Figure 12 shows the
using a specified character encoding” result for the query generate md5. The exactly matching result is
ranked 7 in the result list, while partially related results such as
public static < S > S deserialize(Class c, File xml) { generate checksum are recommended before the exact results. This
try { is because DeepCS ranks results by just considering their semantic
JAXBContext context = JAXBContext.newInstance(c);
Unmarshaller unmarshaller = context.createUnmarshaller(); vectors. In future work, more code features (such as programming
S deserialized = (S) unmarshaller.unmarshal(xml); context) [58] could be considered in our model to further adjust the
return deserialized;
} catch (JAXBException ex) { results.
log.error("Error-deserializing-object-from-XML", ex);
return null;
} 6.3 Threats to Validity
}
Our goal is to improve the performance of code search over GitHub,
(a) The first result of the query “read an object from an xml file” thus both training and search are performed over GitHub corpus.
public void playVoice(int clearedLines) throws Exception { There is a threat of overlap between the training and search code-
int audiosAvailable = audioLibrary.get(clearedLines).size(); bases. To mitigate this threat, in our experiments, the training and
int audioIndex = rand.nextInt(audiosAvailable);
audioLibrary.get(clearedLines).get(audioIndex).play(); search codebases are constructed to be significantly different. The
} training codebase only contains code that has corresponding de-
(b) The second result of the query “play a song” scriptions, while the search codebase is considered in isolation and
Figure 11: Examples showing the associative search contains all code (including those do not have descriptions). We be-
lieve the threat of overfitting for this overlap is not significant as our
6 DISCUSSIONS training codebase considers a vast majority of code in Github. The
most important goal of our experiments is to evaluate DeepCS in a
6.1 Why does DeepCS Work? real-world code search scenario. For that, we used 50 real queries
We have identified three advantages of DeepCS that may explain collected from Stack Overflow to test the effectiveness of DeepCS.
its effectiveness in code search: These queries are not descriptions/comments of Java methods and
A unified representation of heterogeneous data Source code are not used for training.
ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden Xiaodong Gu, Hongyu Zhang, and Sunghun Kim
In our experiments, the relevancy of returned results were manu- searching results. However, Reiss’s approach differs significantly
ally graded and could suffer from subjectivity bias. To mitigate this from DeepCS. It does not consider the semantics of natural language
threat, (i) the manual analysis was performed independently by two queries. Furthermore, it requires users to provide not only natu-
developers and (ii) the developers performed an open discussion ral language queries but also other specifications such as method
to resolve conflict grades for the 50 questions. In the future, we declarations and test cases.
will further mitigate this threat by inviting more developers for the Besides code search, there have been many other information
grading. retrieval tasks in software engineering [8, 9, 16, 23, 24, 29, 51, 55, 63,
In the grading of relevancy, we consider only the top 10 results. 67] such as bug localization [66, 73, 80], feature localization [19],
Queries that fail are identically assigned with an FRank of 11 and traceability links recovery [20] and community Question Answer-
could be biased from the real relevancy of code snippets. However, ing [11]. Ye et al. [80] proposed to embed words into vector rep-
we believe that the setting is reasonable. In real-world code search, resentations to bridge the lexical gap between source code and
developers usually inspect the top K results and ignore the remain- natural language for SE-related text retrieval tasks. Different from
ing. That means it does not make much difference if a code snippet DeepCS, the vector representations learned by their method are at
appears at rank 11 or 20 if K is 10. the level of individual words and tokens instead of the whole query
Like related work (e.g., [14, 41]), we evaluate DeepCS with popu- sentences. Their method is based on a bag-of-words assumption,
lar Stack Overflow questions. SO questions may not be representa- and word sequences are not considered.
tive to all possible queries for code search engines. To mitigate this
threat, (i) DeepCS is not trained on SO questions but on large scale
Github corpus. (ii) We select the most frequently asked questions
7.2 Deep Learning for Source Code
which might be also commonly asked by developers in other search Recently, researchers have investigated possible applications of
engines. In the future, we will extend the scale and scope of test deep learning techniques to source code [7, 38, 53, 56, 60, 64, 75, 76].
queries. A typical use of deep learning is code generation [42, 54]. For exam-
ple, Mou et al. [54] proposed to generate code from natural language
7 RELATED WORK user intentions using an RNN Encoder-Decoder model. Their re-
sults show the feasibility of applying deep learning techniques to
7.1 Code Search code generation from a highly homogeneous dataset (simple pro-
In code search, a line of work has investigated marrying state- gramming assignments). Gu et al. [27] applies deep learning for
of-the-art information retrieval and natural language processing API learning, that is, generating API usage sequences for a given
techniques [13–15, 32, 35, 41, 45–47, 61, 81]. Much of the existing natural language query. They also apply deep learning to migrate
work focuses on query expansion and reformulation [29, 31, 44]. For APIs between different programming languages [28]. Deep learning
example, Hill et al. [30] reformulated queries with natural language is also applied to code completion [64, 77]. For example, White et
phrasal representations of method signatures. Haiduc et al. [29] al. [77] applied the RNN language model to source code files and
proposed to reformulate queries based on machine learning. Their showed its effectiveness in predicting software tokens. Recently,
method trains a machine learning model that automatically recom- White et al. [76] also applied deep learning to code clone detection.
mends a reformulation strategy based on the query properties. Lu et Their framework automatically links patterns mined at the lexical
al. [44] proposed to extend a query with synonyms generated from level with patterns mined at the syntactic level. In our work, we
WordNet. There is also much work that takes into account code explore the application of deep learning to code search.
characteristics. For example, McMillan et al. [47] proposed Portfolio,
a code search engine that combines keyword matching with PageR-
ank to return a chain of functions. Lv et al. [45] proposed CodeHow,
8 CONCLUSION
a code search tool that incorporates an extended Boolean model In this paper, we propose a novel deep neural network named
and API matching. Ponzanelli et al. [61] proposed an approach that CODEnn for code search. Instead of matching text similarity, CO-
automatically retrieves pertinent discussions from Stack Overflow DEnn learns a unified vector representation of both source code
given a context in the IDE. Recently Li et al. [41] proposed RACS, a and natural language queries so that code snippets semantically
code search framework for JavaScript that considers relationships related to a query can be retrieved according to their vectors. As
(e.g., sequencing, condition, and callback relationships) among the a proof-of-concept application, we implement a code search tool,
invoked API methods. DeepCS based on the proposed CODEnn model. Our experimental
As described in Section 6, DeepCS differs from existing code study has shown that the proposed approach is effective and outper-
search techniques in that it does not rely on information retrieval forms related approaches. Our source code and data are available
techniques. It measures the similarity between code snippets and at https://github.com/guxd/deep-code-search.
user queries through joint embedding and deep learning. Thus, it In the future, we will investigate more aspects of source code
can better understand code and query semantics. such as control structures to better represent high-level semantics of
As the keyword based approaches are inefficient on recognizing source code. The deep neural network we designed may also benefit
semantics, researchers have drawn increasing attention on seman- other software engineering problems such as bug localization.
tics based code search [34, 65, 69]. For example, Reiss [65] proposed Acknowledgment: The authors would like to thank Dongmei
the semantics-based code search, which uses user specifications to Zhang at Microsoft Research Asia for her support for this project
characterize the requirement and uses transformations to adapt the and insightful comments on the paper.
Deep Code Search ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden
REFERENCES [30] E. Hill, L. Pollock, and K. Vijay-Shanker. Improving source code search with nat-
[1] Camel case, https://en.wikipedia.org/wiki/camelcase. ural language phrasal representations of method signatures. In Proceedings of the
[2] Eclipse JDT. http://www.eclipse.org/jdt/. 2011 26th IEEE/ACM International Conference on Automated Software Engineering,
[3] Github. https://github.com. pages 524–527. IEEE Computer Society, 2011.
[4] Keras. https://keras.io/. [31] E. Hill, M. Roldan-Vega, J. A. Fails, and G. Mallet. NL-based query refinement
[5] Lucene. https://lucene.apache.org/. and contextualized code search results: A user study. In Software Maintenance,
[6] Theano, http://deeplearning.net/software/theano/. Reengineering and Reverse Engineering (CSMR-WCRE), 2014 Software Evolution
[7] M. Allamanis, H. Peng, and C. Sutton. A convolutional attention network for Week-IEEE Conference on, pages 34–43. IEEE, 2014.
extreme summarization of source code. In International Conference on Machine [32] R. Holmes, R. Cottrell, R. J. Walker, and J. Denzinger. The end-to-end use of
Learning (ICML), 2016. source code examples: An exploratory study. In Software Maintenance, 2009.
[8] J. Anvik and G. C. Murphy. Reducing the effort of bug report triage: Recom- ICSM 2009. IEEE International Conference on, pages 555–558. IEEE, 2009.
menders for development-oriented decisions. ACM Transactions on Software [33] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating
Engineering and Methodology (TOSEM), 20(3):10, 2011. image descriptions. In Proceedings of the IEEE Conference on Computer Vision and
[9] A. Bacchelli, M. Lanza, and R. Robbes. Linking e-mails and source code arti- Pattern Recognition, pages 3128–3137, 2015.
facts. In Proceedings of the 32nd ACM/IEEE International Conference on Software [34] Y. Ke, K. T. Stolee, C. Le Goues, and Y. Brun. Repairing programs with semantic
Engineering-Volume 1, pages 375–384. ACM, 2010. code search (T). In Automated Software Engineering (ASE), 2015 30th IEEE/ACM
[10] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly International Conference on, pages 295–306. IEEE, 2015.
learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. [35] I. Keivanloo, J. Rilling, and Y. Zou. Spotting working code examples. In Proceedings
[11] O. Barzilay, C. Treude, and A. Zagalsky. Facilitating crowd sourced software of the 36th International Conference on Software Engineering, pages 664–675. ACM,
engineering via stack overflow. In Finding Source Code on the Web for Remix and 2014.
Reuse, pages 289–308. Springer, 2013. [36] Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint
[12] T. J. Biggerstaff, B. G. Mitbander, and D. E. Webster. Program understanding and arXiv:1408.5882, 2014.
the concept assignment problem. Communications of the ACM, 37(5):72–82, 1994. [37] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint
[13] J. Brandt, M. Dontcheva, M. Weskamp, and S. R. Klemmer. Example-centric arXiv:1412.6980, 2014.
programming: integrating web search into the development environment. In [38] A. N. Lam, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. Combining deep
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, learning with information retrieval to localize buggy files for bug reports (n).
pages 513–522. ACM, 2010. In Automated Software Engineering (ASE), 2015 30th IEEE/ACM International
[14] B. A. Campbell and C. Treude. NLP2Code: Code snippet content assist via natural Conference on, pages 476–481. IEEE, 2015.
language tasks. arXiv preprint arXiv:1701.05648, 2017. [39] Q. Le and T. Mikolov. Distributed representations of sentences and documents.
[15] W.-K. Chan, H. Cheng, and D. Lo. Searching connected API subgraph via text In Proceedings of the 31st International Conference on Machine Learning (ICML-14),
phrases. In Proceedings of the ACM SIGSOFT 20th International Symposium on the pages 1188–1196, 2014.
Foundations of Software Engineering, page 10. ACM, 2012. [40] M. Li, T. Zhang, Y. Chen, and A. J. Smola. Efficient mini-batch training for
[16] O. Chaparro and A. Marcus. On the reduction of verbose queries in text retrieval stochastic optimization. In Proceedings of the 20th ACM SIGKDD international
based software maintenance. In Proceedings of the 38th International Conference conference on Knowledge discovery and data mining, pages 661–670. ACM, 2014.
on Software Engineering Companion, pages 716–718. ACM, 2016. [41] X. Li, Z. Wang, Q. Wang, S. Yan, T. Xie, and H. Mei. Relationship-aware code
[17] K. Cho, B. Van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, search for JavaScript frameworks. In Proceedings of the ACM SIGSOFT 24th
and Y. Bengio. Learning phrase representations using RNN encoder–decoder for International Symposium on the Foundations of Software Engineering. ACM, 2016.
statistical machine translation. In Proceedings of the 2014 Conference on Empirical [42] W. Ling, E. Grefenstette, K. M. Hermann, T. Kocisky, A. Senior, F. Wang, and
Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar, P. Blunsom. Latent predictor networks for code generation. arXiv preprint
Oct. 2014. Association for Computational Linguistics. arXiv:1603.06744, 2016.
[18] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. [43] E. Linstead, S. Bajracharya, T. Ngo, P. Rigor, C. Lopes, and P. Baldi. Sourcerer:
Natural language processing (almost) from scratch. Journal of Machine Learning mining and searching internet-scale software repositories. Data Mining and
Research, 12(Aug):2493–2537, 2011. Knowledge Discovery, 18:300–336, 2009.
[19] C. S. Corley, K. Damevski, and N. A. Kraft. Exploring the use of deep learning [44] M. Lu, X. Sun, S. Wang, D. Lo, and Y. Duan. Query expansion via wordnet for
for feature location. In Software Maintenance and Evolution (ICSME), 2015 IEEE effective code search. In 2015 IEEE 22nd International Conference on Software
International Conference on, pages 556–560. IEEE, 2015. Analysis, Evolution, and Reengineering (SANER), pages 545–549. IEEE, 2015.
[20] B. Dagenais and M. P. Robillard. Recovering traceability links between an api [45] F. Lv, H. Zhang, J. Lou, S. Wang, D. Zhang, and J. Zhao. CodeHow: Effective code
and its learning resources. In 2012 34th International Conference on Software search based on API understanding and extended boolean model. In Proceedings
Engineering (ICSE), pages 47–57. IEEE, 2012. of the 30th IEEE/ACM International Conference on Automated Software Engineering
[21] M. Feng, B. Xiang, M. R. Glass, L. Wang, and B. Zhou. Applying deep learning to (ASE 2015). IEEE, 2015.
answer selection: A study and an open task. In 2015 IEEE Workshop on Automatic [46] C. McMillan, M. Grechanik, D. Poshyvanyk, C. Fu, and Q. Xie. Exemplar: A source
Speech Recognition and Understanding (ASRU), pages 813–820. IEEE, 2015. code search engine for finding highly relevant applications. IEEE Transactions on
[22] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. DeViSE: Software Engineering, 38(5):1069–1087, 2012.
A deep visual-semantic embedding model. In Advances in neural information [47] C. McMillan, M. Grechanik, D. Poshyvanyk, Q. Xie, and C. Fu. Portfolio: find-
processing systems, pages 2121–2129, 2013. ing relevant functions and their usage. In Proceedings of the 33rd International
[23] X. Ge, D. C. Shepherd, K. Damevski, and E. Murphy-Hill. Design and evaluation Conference on Software Engineering (ICSE’11), pages 111–120. IEEE, 2011.
of a multi-recommendation system for local code search. Journal of Visual [48] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word
Languages & Computing, 2016. representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[24] G. Gousios, M. Pinzger, and A. v. Deursen. An exploratory study of the pull-based [49] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur. Recurrent
software development model. In Proceedings of the 36th International Conference neural network based language model. In INTERSPEECH 2010, 11th Annual
on Software Engineering, pages 345–355. ACM, 2014. Conference of the International Speech Communication Association, Makuhari,
[25] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhuber. Chiba, Japan, September 26-30, 2010, pages 1045–1048, 2010.
A novel connectionist system for unconstrained handwriting recognition. IEEE [50] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed rep-
transactions on pattern analysis and machine intelligence, 31(5):855–868, 2009. resentations of words and phrases and their compositionality. In Advances in
[26] M. Grechanik, C. Fu, Q. Xie, C. McMillan, D. Poshyvanyk, and C. Cumby. A neural information processing systems, pages 3111–3119, 2013.
search engine for finding highly relevant applications. In 2010 ACM/IEEE 32nd [51] I. J. Mojica, B. Adams, M. Nagappan, S. Dienst, T. Berger, and A. E. Hassan. A
International Conference on Software Engineering, volume 1, pages 475–484. IEEE, large scale empirical study on software reuse in mobile apps. IEEE Software,
2010. 31(2):78–86, 2014.
[27] X. Gu, H. Zhang, D. Zhang, and S. Kim. Deep API learning. In Proceedings of [52] D. J. Montana and L. Davis. Training feedforward neural networks using genetic
the ACM SIGSOFT 20th International Symposium on the Foundations of Software algorithms. In IJCAI, volume 89, pages 762–767, 1989.
Engineering (FSE’16), 2016. [53] L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin. Convolutional neural networks
[28] X. Gu, H. Zhang, D. Zhang, and S. Kim. DeepAM: Migrate APIs with multi-modal over tree structures for programming language processing. In Proceedings of the
sequence to sequence learning. In Proceedings of the Twenty-Sixth International Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 1287–1293.
Joint Conferences on Artifical Intelligence (IJCAI’17), 2017. AAAI Press, 2016.
[29] S. Haiduc, G. Bavota, A. Marcus, R. Oliveto, A. De Lucia, and T. Menzies. Au- [54] L. Mou, R. Men, G. Li, L. Zhang, and Z. Jin. On end-to-end program generation
tomatic query reformulations for text retrieval in software engineering. In from user intention by deep neural networks. arXiv, 2015.
Proceedings of the 2013 International Conference on Software Engineering, pages [55] A. Nederlof, A. Mesbah, and A. v. Deursen. Software engineering for the web:
842–851. IEEE Press, 2013. the state of the practice. In Companion Proceedings of the 36th International
ICSE ’18, May 27-June 3, 2018, Gothenburg, Sweden Xiaodong Gu, Hongyu Zhang, and Sunghun Kim
Conference on Software Engineering, pages 4–13. ACM, 2014. [69] K. T. Stolee, S. Elbaum, and D. Dobos. Solving the search for source code. ACM
[56] T. D. Nguyen, A. T. Nguyen, H. D. Phan, and T. N. Nguyen. Exploring api Transactions on Software Engineering and Methodology (TOSEM), 23(3):26, 2014.
embedding for api usages and applications. In Proceedings of the 39th International [70] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural
Conference on Software Engineering, pages 438–449. IEEE Press, 2017. networks. In Advances in neural information processing systems, pages 3104–3112,
[57] L. Nie, H. Jiang, Z. Ren, Z. Sun, and X. Li. Query expansion based on crowd 2014.
knowledge for code search. IEEE Transactions on Services Computing, PP(99):1–1, [71] M. Tan, B. Xiang, and B. Zhou. Lstm-based deep learning models for non-factoid
2016. answer selection. arXiv preprint arXiv:1511.04108, 2015.
[58] H. Niu, I. Keivanloo, and Y. Zou. Learning to rank code examples for code search [72] J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general
engines. Empirical Software Engineering, pages 1–33, 2016. method for semi-supervised learning. In Proceedings of the 48th annual meeting
[59] H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, and R. K. Ward. Deep of the association for computational linguistics, pages 384–394. Association for
sentence embedding using the long short term memory network: Analysis and Computational Linguistics, 2010.
application to information retrieval. CoRR, abs/1502.06922, 2015. [73] Y. Uneno, O. Mizuno, and E.-H. Choi. Using a distributed representation of words
[60] H. Peng, L. Mou, G. Li, Y. Liu, L. Zhang, and Z. Jin. Building program vector rep- in localizing relevant files for bug reports. In Software Quality, Reliability and
resentations for deep learning. In Proceedings of the 8th International Conference Security (QRS), 2016 IEEE International Conference on, pages 183–190. IEEE, 2016.
on Knowledge Science, Engineering and Management - Volume 9403, KSEM 2015, [74] J. Weston, S. Bengio, and N. Usunier. Wsabie: scaling up to large vocabulary image
pages 547–553, New York, NY, USA, 2015. Springer-Verlag New York, Inc. annotation. In Proceedings of the Twenty-Second international joint conference on
[61] L. Ponzanelli, G. Bavota, M. Di Penta, R. Oliveto, and M. Lanza. Mining stackover- Artificial Intelligence-Volume Volume Three, pages 2764–2770. AAAI Press, 2011.
flow to turn the ide into a self-confident programming prompter. In Proceedings [75] M. White, M. Tufano, M. Martinez, M. Monperrus, and D. Poshyvanyk. Sorting
of the 11th Working Conference on Mining Software Repositories, pages 102–111. and transforming program repair ingredients via deep learning code similarities.
ACM, 2014. arXiv preprint arXiv:1707.04742, 2017.
[62] M. Raghothaman, Y. Wei, and Y. Hamadi. SWIM: synthesizing what I mean: code [76] M. White, M. Tufano, C. Vendome, and D. Poshyvanyk. Deep learning code frag-
search and idiomatic snippet synthesis. In Proceedings of the 38th International ments for code clone detection. In Proceedings of the 31th IEEE/ACM International
Conference on Software Engineering, pages 357–367. ACM, 2016. Conference on Automated Software Engineering (ASE 2016), 2016.
[63] M. Rahimi and J. Cleland-Huang. Patterns of co-evolution between requirements [77] M. White, C. Vendome, M. Linares-Vásquez, and D. Poshyvanyk. Toward deep
and source code. In 2015 IEEE Fifth International Workshop on Requirements learning software repositories. In Mining Software Repositories (MSR), 2015
Patterns (RePa), pages 25–31. IEEE, 2015. IEEE/ACM 12th Working Conference on, pages 334–345. IEEE, 2015.
[64] V. Raychev, M. Vechev, and E. Yahav. Code completion with statistical language [78] R. Xu, C. Xiong, W. Chen, and J. J. Corso. Jointly modeling deep video and
models. In In Proceedings of the 35th ACM SIGPLAN Conference on Programming compositional text to bridge vision and language in a unified framework. In
Language Design and Implementation. ACM, 2014. AAAI, pages 2346–2352. Citeseer, 2015.
[65] S. P. Reiss. Semantics-based code search. In Proceedings of the 31st International [79] X. Ye, R. Bunescu, and C. Liu. Learning to rank relevant files for bug reports
Conference on Software Engineering, pages 243–253. IEEE Computer Society, 2009. using domain knowledge. In Proceedings of the 22nd ACM SIGSOFT International
[66] M. Renieres and S. P. Reiss. Fault localization with nearest neighbor queries. Symposium on Foundations of Software Engineering, pages 689–699. ACM, 2014.
In Automated Software Engineering, 2003. Proceedings. 18th IEEE International [80] X. Ye, H. Shen, X. Ma, R. Bunescu, and C. Liu. From word embeddings to docu-
Conference on, pages 30–39, Oct 2003. ment similarities for improved information retrieval in software engineering. In
[67] P. C. Rigby and M. P. Robillard. Discovering essential code elements in informal Proceedings of the 38th International Conference on Software Engineering, pages
documentation. In Proceedings of the 2013 International Conference on Software 404–415. ACM, 2016.
Engineering, pages 832–841. IEEE Press, 2013. [81] J. Zhou and R. J. Walker. API Deprecation: A retrospective analysis and detection
[68] J. Singer, T. Lethbridge, N. Vinson, and N. Anquetil. An examination of software method for code examples on the web. In Proceedings of the ACM SIGSOFT 20th
engineering work practices. In CASCON First Decade High Impact Papers, pages International Symposium on the Foundations of Software Engineering (FSE’16).
174–188. IBM Corp., 2010. ACM, 2016.