A Language-Agnostic Model for Semantic Source Code Labeling

Gelman, Ben; Hoyle, Bryan; Moore, Jessica; Saxe, Joshua; Slater, David

doi:10.1145/3243127.3243132

Computer Science > Machine Learning

arXiv:1906.01032 (cs)

[Submitted on 3 Jun 2019]

Title:A Language-Agnostic Model for Semantic Source Code Labeling

Authors:Ben Gelman, Bryan Hoyle, Jessica Moore, Joshua Saxe, David Slater

View PDF

Abstract:Code search and comprehension have become more difficult in recent years due to the rapid expansion of available source code. Current tools lack a way to label arbitrary code at scale while maintaining up-to-date representations of new programming languages, libraries, and functionalities. Comprehensive labeling of source code enables users to search for documents of interest and obtain a high-level understanding of their contents. We use Stack Overflow code snippets and their tags to train a language-agnostic, deep convolutional neural network to automatically predict semantic labels for source code documents. On Stack Overflow code snippets, we demonstrate a mean area under ROC of 0.957 over a long-tailed list of 4,508 tags. We also manually validate the model outputs on a diverse set of unlabeled source code documents retrieved from Github, and we obtain a top-1 accuracy of 86.6%. This strongly indicates that the model successfully transfers its knowledge from Stack Overflow snippets to arbitrary source code documents.

Comments:	MASES 2018 Publication
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Software Engineering (cs.SE); Machine Learning (stat.ML)
Cite as:	arXiv:1906.01032 [cs.LG]
	(or arXiv:1906.01032v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1906.01032
Related DOI:	https://doi.org/10.1145/3243127.3243132

Submission history

From: Ben Gelman [view email]
[v1] Mon, 3 Jun 2019 19:21:42 UTC (2,229 KB)

Computer Science > Machine Learning

Title:A Language-Agnostic Model for Semantic Source Code Labeling

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:A Language-Agnostic Model for Semantic Source Code Labeling

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators