mining-soft-eg-data-github
mining-soft-eg-data-github
Abstract—GitHub is the largest collaborative source code to the quality of the study itself; as a result, many GitHub -
hosting site built on top of the Git version control system. The bound studies suffer from data validity issues [3], [4].
availability of a comprehensive API has made GitHub a target A notable archiving attempt, which emerged through the
for many software engineering and online collaboration research
efforts. In our work, we have discovered that a) obtaining data repository mining community, is the GHTorrent project [5], a
from GitHub is not trivial, b) the data may not be suitable for all scalable, off-line mirror of all data offered through the GitHub
types of research, and c) improper use can lead to biased results. API . GHT orrent follows the GitHub event stream and system-
In this tutorial, we analyze how data from GitHub can be used for atically retrieves from it all data, their metadata and their
large-scale, quantitative research, while avoiding common pitfalls. dependencies. It then processes and stores all retrieved items
We use the GHTorrent dataset, a queryable offline mirror of the
GitHub API data, to draw examples from and present pitfall in a relational database, while also storing the original data in a
avoidance strategies. MongoDB database. GHTorrent offers to interested researchers
Index Terms—GitHub; GHTorrent; empirical software engi- both downloads of the corresponding database dumps (cur-
neering; Git rently, >15 TB of data) and online access facilities (including
live database access and Google BigQuery).1 GHTorrent has
I. D ESCRIPTION been very successful: more than 200 researchers have sub-
GitHub is a collaborative code hosting site built on top of the scribed and used the online access points, 33% of all empirical
git version control system. It includes a variety of features that research publications on GitHub are based on it [4], while it
encourage teamwork and continued discussion over the life of is being used in production by companies, such as Microsoft.
a project. GitHub uses a “fork & pull” collaboration model [1], As GHTorrent is becoming the de facto standard dataset
where developers create their own copies of a repository and for large scale quantitative analysis for GitHub data, we
submit requests when they want the project maintainer to believe that it is crucial for researchers to know how to use
incorporate their changes into the project’s main branch, thus GHT orrent to sample for projects, how to treat the data and
providing an environment in which people can easily conduct how to avoid common pitfalls in order to minimize the risk
code reviews. Every repository can optionally use GitHub ’s of doing unsound research.
issue tracking system to report and discuss bugs and other In the tutorial, we address the following topics.
concerns. GitHub also contains integrated social features: users • GitHub data collection strategies, including querying the
are able to subscribe to update by “watching” projects and API, using online services such as GitHub Archive and
“following” other users, resulting in a constant stream of data GHTorrent.
about people and projects of interest. The system supports user • Using GHTorrent to sample appropriate repositories for
profiles that provide a summary of a person’s recent activity various types of research questions.
within the site, such as their commits, the projects they forked • Writing, managing, and optimizing complex and expen-
or the issues they reported. sive relational queries on GHTorrent relational data.
Due to the combination of reasons such as data availability, • Using GHTorrent effectively: understanding the data col-
data homogeneity and volume, GitHub has become both the lection challenges and avoiding common pitfalls.
target of choice and the source of data for various research • Copyright and privacy issues when using the GitHub data.
efforts, ranging from distributed collaboration [1] to deep
learning on software data [2]. However, GitHub data do not II. S PEAKER B IOGRAPHIES
come for free for researchers: initially, GitHub is imposing The two speakers have pioneered the use of data from
limits on their API, which, given the volume of interesting GitHub in software engineering research with the introduction
projects, can put a significant delay on data acquisition; of the GHTorrent data collection framework [6]. They have
the technicalities of the retrieval complicate the acquisition both been active on GitHub related research ever since. Apart
process. Moreover, there is no data schema (GitHub is only from creating GHTorrent, the speakers have studied the pull-
exposing its data as JSON responses through a REST API), request based distributed software development practice [1],
while the data only represent the current project state. In
addition, selection and filtering of GitHub data imposes threats 1 https://bigquery.cloud.google.com/queries/ghtorrent-bq
502
506
504