Codesearchnet Challenge Evaluating The State of Semantic Code Search
Codesearchnet Challenge Evaluating The State of Semantic Code Search
Semantic code search is the task of retrieving relevant code given a Challenge on top of a new CodeSearchNet Corpus. The Code-
natural language query. While related to other information retrieval SearchNet Corpus was programmatically obtained by scraping
tasks, it requires bridging the gap between the language used in open-source repositories and pairing individual functions with
code (often abbreviated and highly technical) and natural language their (processed) documentation as natural language annotation.
more suitable to describe vague concepts and ideas. It is large enough (2 million datapoints) to enable training of high-
To enable evaluation of progress on code search, we are releasing capacity deep neural models on the task. We discuss this process in
the CodeSearchNet Corpus and are presenting the CodeSearch- detail in section 2 and also release the data preprocessing pipeline
Net Challenge, which consists of 99 natural language queries with to encourage further research in this area.
about 4k expert relevance annotations of likely results from Code- The CodeSearchNet Challenge is defined on top of this, pro-
SearchNet Corpus. The corpus contains about 6 million functions viding realistic queries and expert annotations for likely results.
from open-source code spanning six programming languages (Go, Concretely, in version 1.0, it consists of 99 natural languages queries
Java, JavaScript, PHP, Python, and Ruby). The CodeSearchNet paired with likely results for each of six considered programming
Corpus also contains automatically generated query-like natural languages (Go, Java, JavaScript, PHP, Python, and Ruby). Each
language for 2 million functions, obtained from mechanically scrap- query/result pair was labeled by a human expert, indicating the
ing and preprocessing associated function documentation. In this relevance of the result for the query. We discuss the methodology
article, we describe the methodology used to obtain the corpus and in detail in section 3.
expert labels, as well as a number of simple baseline solutions for Finally, we create a number of baseline methods using a range
the task. of state-of-the-art neural sequence processing techniques (bag of
We hope that CodeSearchNet Challenge encourages researchers words, RNNs, CNNs, attentional models) and evaluate them on our
and practitioners to study this interesting task further and will host datasets. We discuss these models in section 4 and present some
a competition and leaderboard to track the progress on the chal- preliminary results.
lenge. We are also keen on extending CodeSearchNet Challenge
to more queries and programming languages in the future. 2 THE CODE SEARCH CORPUS
As it is economically infeasible to create a dataset large enough
for training high-capacity models using expert annotations, we
1 INTRODUCTION instead create a proxy dataset of lower quality. For this, we follow
The deep learning revolution has fundamentally changed how we other attempts in the literature [5, 6, 9, 11] and pair functions in
approach perceptive tasks such as image and speech recognition open-source software with the natural language present in their
and has shown substantial successes in working with natural lan- respective documentation. However, to do so requires a number
guage data. These have been driven by the co-evolution of large of preprocessing steps and heuristics. In the following, we discuss
(labelled) datasets, substantial computational capacity, and a num- some general principles and decisions driven by in-depth analysis
ber of advances in machine learning models. of common error cases.
However, deep learning models still struggle on highly structured
CodeSearchNet Corpus Collection. We collect the corpus from
data. One example is semantic code search: while search on natural
publicly available open-source non-fork GitHub repositories, using
language documents and even images has made great progress,
libraries.io to identify all projects which are used by at least one
searching code is often still unsatisfying. Standard information re-
other project, and sort them by “popularity” as indicated by the
trieval methods do not work well in the code search domain, as there
number of stars and forks. Then, we remove any projects that do
is often little shared vocabulary between search terms and results
not have a license or whose license does not explicitly permit the
(e.g. consider a method called deserialize_JSON_obj_from_stream
re-distribution of parts of the project. We then tokenize all Go, Java,
that may be a correct result for the query “read JSON data”). Even
JavaScript, Python, PHP and Ruby functions (or methods) using
more problematic is that evaluating methods for this task is ex-
TreeSitter — GitHub’s universal parser — and, where available, their
tremely hard, as there are no substantial datasets that were created
respective documentation text using a heuristic regular expression.
for this task; instead, the community tries to make do with small
datasets from related contexts (e.g. pairing questions on web forums Filtering. To generate training data for the CodeSearchNet
to code chunks found in answers). Challenge, we first consider only those functions in the corpus
CodeSearchNet
i.e. maximize the inner product of the code and query encodings of
the pair, while minimizing the inner product between each ci and
the distractor snippets cj (i , j). Note that we have experimented
with other similar objectives (e.g. considering cosine similarity and
max-margin approaches) without significant changes in results on
our validation dataset. The code for the baselines can be found at
Figure 3: Model Architecture Overview.
https://github.com/github/CodeSearchNet.
At test time, we index all functions in CodeSearchNet Cor-
pus using Annoy. Annoy offers fast, approximate nearest neighbor
4 BASELINE CODESEARCH MODELS indexing and search. The index includes all functions in the Code-
We implemented a range of baseline models for the code search SearchNet Corpus, including those that do not have an associated
task, using standard techniques from neural sequence processing documentation comment.
and web search.
4.2 ElasticSearch Baseline
4.1 Joint Vector Representations for Code In our experiments, we additionally included ElasticSearch, a widely
Search used search engine with the default parameters. We configured it
with an index using two fields for every function in our dataset:
Following earlier work [11, 20], we use joint embeddings of code the function name, split into subtokens; and the text of the entire
and queries to implement a neural search system. Our architecture function. We use the default ElasticSearch tokenizer.
employs one encoder per input (natural or programming) language
and trains them to map inputs into a single, joint vector space. Our 4.3 Evaluation
training objective is to map code and the corresponding language
onto vectors that are near to each other, as we can then implement Following the training/validation/testing data split, we train our
a search method by embedding the query and then returning the set baseline models using our objective from above. While it does not
of code snippets that are “near” in embedding space. Although more directly correspond to the real target task of code search, it has
complex models considering more interactions between queries been widely used as a proxy for training similar models [6, 23].
and code can perform better [20], generating a single vector per For testing purposes on CodeSearchNet Corpus, we fix a set
query/snippet allows for efficient indexing and search. of 999 distractor snippets cj for each test pair (ci , di ) and test all
To learn these embedding functions, we combine standard se- trained models. Table 3 presents the Mean Reciprocal Rank results
quence encoder models in the architecture shown in Figure 3. on this task. Overall, we see that the models achieve relatively
First, we preprocess the input sequences according to their seman- good performance on this task, with the self-attention-based model
tics: identifiers appearing in code tokens are split into subtokens performing best. This is not unexpected, as the self-attention model
(i.e. a variable camelCase yields two subtokens camel and case), has the highest capacity of all considered models.
and natural language tokens are split using byte-pair encoding We have also run our baselines on CodeSearchNet Challenge
(BPE) [10, 21]. and show the results in Table 4. Here, the neural bag of words
Then, the token sequences are processed to obtain (contextual- model performs very well, whereas the stronger neural models on
ized) token embeddings, using one of the following architectures. the training task do less well. We note that the bag of words model
is particularly good at keyword matching, which seems to be a
Neural Bag of Words where each (sub)token is embedded to a crucial facility in implementing search methods. This hypothesis is
learnable embedding (vector representation). further validated by the fact that the non-neural ElasticSearch-based
Bidirectional RNN models where we employ the GRU cell [7] baseline performs the best among all baselines models we have
to summarize the input sequence. tested. As noted by Cambronero et al. [6], this can be attributed to
1D Convolutional Neural Network over the input sequence of the fact that the training data constructed from code documentation
tokens [15]. is not a good match for the code search task.
Self-Attention where multi-head attention [22] is used to com-
pute representations of each token in the sequence. 5 RELATED WORK
The token embeddings are then combined into a sequence embed- Applying machine learning to code has been widely considered [2].
ding using a pooling function, for which we have implemented A few academic works have looked into related tasks. First, seman-
mean/max-pooling and an attention-like weighted sum mechanism. tic parsing has received a lot of attention in the NLP community.
CodeSearchNet Challenge
Although most approaches are usually aimed towards creating an in code. Researching neural methods that can efficiently and
executable representation of a natural language utterance with a accurately represent rare terms will improve performance.
domain-specific language, general-purpose languages have been • Code semantics such as control and data flow are not ex-
recently considered by Hashimoto et al. [13], Lin et al. [16], Ling ploited explicitly by existing methods, and instead search
et al. [17], Yin and Neubig [25]. methods seem to be mainly operate on identifiers (such as
Iyer et al. [14] generate code from natural language within the variable and function) names. How to leverage semantics to
context of existing methods, whereas Allamanis et al. [3], Alon improve results remains an open problem.
et al. [4] consider the task of summarizing functions to their names. • Recently, in NLP, pretraining methods such as BERT [8] have
Finally, Fernandes et al. [9] consider the task of predicting the found great success. Can similar methods be useful for the
documentation text from source code. encoders considered in this work?
More related to CodeSearchNet is prior work in code search • Our data covers a wide range of general-purpose code queries.
with deep learning. In the last few years there has been research in However, anecdotal evidence indicates that queries in spe-
this area ( Cambronero et al. [6], Gu et al. [11, 12], Yao et al. [23]), cific projects are usually more specialized. Adapting search
and architectures similar to those discussed previously have been methods to such use cases could yield substantial perfor-
shown to work to some extent. Recently, Cambronero et al. [6] mance improvements.
looked into the same problem that CodeSearchNet is concerned • Code quality of the searched snippets was a recurrent issue
with and reached conclusions similar to those discussed here. In with our expert annotators. Despite its subjective nature,
contrast to the aforementioned works, here we provide a human- there seems to be agreement on what constitutes very bad
annotated dataset of relevance scores and test a few more neural code. Using code quality as an additional signal that allows
search architectures along with a standard information retrieval for filtering of bad results (at least when better results are
baseline. available) could substantially improve satisfaction of search
users.
6 CONCLUSIONS & OPEN CHALLENGES
We hope that CodeSearchNet is a good step towards engaging REFERENCES
with the machine learning, IR and NLP communities towards devel- [1] Miltiadis Allamanis. 2018. The Adverse Effects of Code Duplication in Machine
oping new machine learning models that understand source code Learning Models of Code. arXiv preprint arXiv:1812.06469 (2018).
and natural language. Despite the fact this report gives emphasis [2] Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. 2018.
A survey of machine learning for big code and naturalness. ACM Computing
on semantic code search we look forward to other uses of the pre- Surveys (CSUR) 51, 4 (2018), 81.
sented datasets. There are still plenty of open challenges in this [3] Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A Convolutional
Attention Network for Extreme Summarization of Source Code. In Proceedings of
area. the International Conference on Machine Learning (ICML).
• Our ElasticSearch baseline, that performs traditional keyword- [4] Uri Alon, Omer Levy, and Eran Yahav. 2018. code2seq: Generating sequences
from structured representations of code. arXiv preprint arXiv:1808.01400 (2018).
based search, performs quite well. It has the advantage of [5] Antonio Valerio Miceli Barone and Rico Sennrich. 2017. A parallel corpus of
being able to efficiently use rare terms, which often appear Python functions and documentation strings for automated code documentation
CodeSearchNet
and code generation. arXiv preprint arXiv:1707.02275 (2017). Linux Operating System. In International Conference on Language Resources and
[6] Jose Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. Evaluation.
2019. When Deep Learning Met Code Search. arXiv preprint arXiv:1905.03813 [17] Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomas Kocisky, Andrew
(2019). Senior, Fumin Wang, and Phil Blunsom. 2016. Latent Predictor Networks for
[7] Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Ben- Code Generation. In Proceedings of the Annual Meeting of the Association for
gio. 2014. On the Properties of Neural Machine Translation: Encoder–Decoder Computational Linguistics (ACL).
Approaches. Syntax, Semantics and Structure in Statistical Translation (2014). [18] Cristina V Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny,
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Hitesh Sajnani, and Jan Vitek. 2017. DéjàVu: a map of code duplicates on GitHub.
Pre-training of deep bidirectional transformers for language understanding. arXiv Proceedings of the ACM on Programming Languages 1, OOPSLA (2017), 84.
preprint arXiv:1810.04805 (2018). [19] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Intro-
[9] Patrick Fernandes, Miltiadis Allamanis, and Marc Brockschmidt. 2018. Structured duction to Information Retrieval. Cambridge University Press.
Neural Summarization. arXiv preprint arXiv:1811.01824 (2018). [20] Bhaskar Mitra, Nick Craswell, et al. 2018. An introduction to neural information
[10] Philip Gage. 1994. A new algorithm for data compression. The C Users Journal retrieval. Foundations and Trends® in Information Retrieval 13, 1 (2018), 1–126.
12, 2 (1994), 23–38. [21] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine
[11] Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In 2018 translation of rare words with subword units. In Proceedings of the Annual Meeting
IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, of the Association for Computational Linguistics (ACL).
933–944. [22] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
[12] Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
API Learning. In Proceedings of the International Symposium on Foundations of you need. In Advances in Neural Information Processing Systems. 5998–6008.
Software Engineering (FSE). [23] Ziyu Yao, Jayavardhan Reddy Peddamail, and Huan Sun. 2019. CoaCor: Code
[13] Tatsunori B Hashimoto, Kelvin Guu, Yonatan Oren, and Percy S Liang. 2018. A Annotation for Code Retrieval with Reinforcement Learning. (2019).
retrieve-and-edit framework for predicting structured outputs. In Advances in [24] Ziyu Yao, Daniel S Weld, Wei-Peng Chen, and Huan Sun. 2018. StaQC: A Sys-
Neural Information Processing Systems. 10073–10083. tematically Mined Question-Code Dataset from Stack Overflow. In Proceedings of
[14] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2018. Map- the 2018 World Wide Web Conference on World Wide Web. International World
ping language to code in programmatic context. arXiv preprint arXiv:1808.09588 Wide Web Conferences Steering Committee, 1693–1703.
(2018). [25] Pengcheng Yin and Graham Neubig. 2017. A Syntactic Neural Model for General-
[15] Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv Purpose Code Generation. Proceedings of the Annual Meeting of the Association
preprint arXiv:1408.5882 (2014). for Computational Linguistics (ACL).
[16] Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D. Ernst. 2018.
NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the