0% found this document useful (0 votes)
6 views

mining-soft-eg-data-github

The document discusses the challenges and methodologies for mining software engineering data from GitHub, highlighting the GHTorrent project as a key resource for researchers. It emphasizes the importance of understanding data collection strategies and avoiding common pitfalls to ensure valid research outcomes. The tutorial aims to equip researchers with the necessary knowledge to effectively utilize GitHub data for empirical studies.

Uploaded by

topat43112
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

mining-soft-eg-data-github

The document discusses the challenges and methodologies for mining software engineering data from GitHub, highlighting the GHTorrent project as a key resource for researchers. It emphasizes the importance of understanding data collection strategies and avoiding common pitfalls to ensure valid research outcomes. The tutorial aims to equip researchers with the necessary knowledge to effectively utilize GitHub data for empirical studies.

Uploaded by

topat43112
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

2017 IEEE/ACM 39th IEEE International Conference on Software Engineering Companion

Mining Software Engineering Data from GitHub


Georgios Gousios Diomidis Spinellis
Department of Software Technology Department of Management Science and Technology
Delft University of Technology Athens University of Economics and Business
Delft, The Netherlands Athens, Greece
[email protected] [email protected]

Abstract—GitHub is the largest collaborative source code to the quality of the study itself; as a result, many GitHub -
hosting site built on top of the Git version control system. The bound studies suffer from data validity issues [3], [4].
availability of a comprehensive API has made GitHub a target A notable archiving attempt, which emerged through the
for many software engineering and online collaboration research
efforts. In our work, we have discovered that a) obtaining data repository mining community, is the GHTorrent project [5], a
from GitHub is not trivial, b) the data may not be suitable for all scalable, off-line mirror of all data offered through the GitHub
types of research, and c) improper use can lead to biased results. API . GHT orrent follows the GitHub event stream and system-
In this tutorial, we analyze how data from GitHub can be used for atically retrieves from it all data, their metadata and their
large-scale, quantitative research, while avoiding common pitfalls. dependencies. It then processes and stores all retrieved items
We use the GHTorrent dataset, a queryable offline mirror of the
GitHub API data, to draw examples from and present pitfall in a relational database, while also storing the original data in a
avoidance strategies. MongoDB database. GHTorrent offers to interested researchers
Index Terms—GitHub; GHTorrent; empirical software engi- both downloads of the corresponding database dumps (cur-
neering; Git rently, >15 TB of data) and online access facilities (including
live database access and Google BigQuery).1 GHTorrent has
I. D ESCRIPTION been very successful: more than 200 researchers have sub-
GitHub is a collaborative code hosting site built on top of the scribed and used the online access points, 33% of all empirical
git version control system. It includes a variety of features that research publications on GitHub are based on it [4], while it
encourage teamwork and continued discussion over the life of is being used in production by companies, such as Microsoft.
a project. GitHub uses a “fork & pull” collaboration model [1], As GHTorrent is becoming the de facto standard dataset
where developers create their own copies of a repository and for large scale quantitative analysis for GitHub data, we
submit requests when they want the project maintainer to believe that it is crucial for researchers to know how to use
incorporate their changes into the project’s main branch, thus GHT orrent to sample for projects, how to treat the data and
providing an environment in which people can easily conduct how to avoid common pitfalls in order to minimize the risk
code reviews. Every repository can optionally use GitHub ’s of doing unsound research.
issue tracking system to report and discuss bugs and other In the tutorial, we address the following topics.
concerns. GitHub also contains integrated social features: users • GitHub data collection strategies, including querying the
are able to subscribe to update by “watching” projects and API, using online services such as GitHub Archive and
“following” other users, resulting in a constant stream of data GHTorrent.
about people and projects of interest. The system supports user • Using GHTorrent to sample appropriate repositories for
profiles that provide a summary of a person’s recent activity various types of research questions.
within the site, such as their commits, the projects they forked • Writing, managing, and optimizing complex and expen-
or the issues they reported. sive relational queries on GHTorrent relational data.
Due to the combination of reasons such as data availability, • Using GHTorrent effectively: understanding the data col-
data homogeneity and volume, GitHub has become both the lection challenges and avoiding common pitfalls.
target of choice and the source of data for various research • Copyright and privacy issues when using the GitHub data.
efforts, ranging from distributed collaboration [1] to deep
learning on software data [2]. However, GitHub data do not II. S PEAKER B IOGRAPHIES
come for free for researchers: initially, GitHub is imposing The two speakers have pioneered the use of data from
limits on their API, which, given the volume of interesting GitHub in software engineering research with the introduction
projects, can put a significant delay on data acquisition; of the GHTorrent data collection framework [6]. They have
the technicalities of the retrieval complicate the acquisition both been active on GitHub related research ever since. Apart
process. Moreover, there is no data schema (GitHub is only from creating GHTorrent, the speakers have studied the pull-
exposing its data as JSON responses through a REST API), request based distributed software development practice [1],
while the data only represent the current project state. In
addition, selection and filtering of GitHub data imposes threats 1 https://bigquery.cloud.google.com/queries/ghtorrent-bq

978-1-5386-1589-8/17 $31.00 © 2017 IEEE 503


505
501
DOI 10.1109/ICSE-C.2017.164
co-created an automated work-prioritization framework for Acknowledgement
pull request based projects [7], co-examined the build and test The project associated with this work has received fund-
practices of projects on TravisCI [8] and co-documented the ing from the European Union’s Horizon 2020 research and
pitfals of using GitHub-based datasets [3]. innovation programme under grant agreement No 732223.
Georgios Gousios is an assistant professor at the Web
Information Systems group, Delft University of Technology. R EFERENCES
His research interests include software engineering, soft- [1] G. Gousios, M. Pinzger, and A. v. Deursen, “An exploratory study of the
ware analytics and programming languages. He works in the pull-based software development model,” in ICSE 2014: Proceedings of
the 36th International Conference on Software Engineering. New York,
fields of distributed software development processes, software NY, USA: ACM, 2014, pp. 345–355.
quality, software testing, developer productivity assessment, [2] X. Gu, H. Zhang, D. Zhang, and S. Kim, “Deep API learning,” in
research infrastructures and software security. His research FSE 2016: Proceedings of the 2016 24th ACM SIGSOFT International
Symposium on Foundations of Software Engineering. New York, NY,
has been published in top venues, where he has received USA: ACM, 2016, pp. 631–642.
four best paper awards and various nominations. In total, he [3] E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. German, and
has published more that 50 papers and also co-edited the D. Damian, “An in-depth study of the promises and perils of mining
GitHub,” Empirical Software Engineering, vol. 21, no. 5, pp. 2035–2071,
“Beautiful Architectures” book (OReilly, 2009). He is the 2016.
main author of the GHTorrent data collection and curration [4] V. Cosentino, J. Luis, and J. Cabot, “Findings from GitHub: methods,
framework and the Alitheia Core repository mining platform. datasets and limitations,” in MSR 2016: Proceedings of the 13th Inter-
national Workshop on Mining Software Repositories. ACM, 2016, pp.
Georgios holds and MSc from the university of Manchester 137–141.
and a PhD from the Athens University of Economics and [5] G. Gousios, “The GHTorrent dataset and tool suite,” in MSR 2013:
Business, both in software engineering. In addition to research, Proceedings of the 10th Working Conference on Mining Software Repos-
itories, May 2013, pp. 233–236.
he is also active as a speaker, both in research and practitioner [6] G. Gousios and D. Spinellis, “GHTorrent: GitHub’s data from a firehose,”
oriented conferences. in MSR 2012: Proceedings of the 9th Working Conference on Mining
Diomidis Spinellis is a Professor in the Department of Software Repositories, M. W. Godfrey and J. Whitehead, Eds. IEEE,
Jun. 2012, pp. 12–21.
Management Science and Technology at the Athens University [7] E. van der Veen, G. Gousios, and A. Zaidman, “Automatically prioritizing
of Economics and Business, Greece. His research interests pull requests,” in MSR 2015: Proceedings of the 12th Working Conference
include software engineering, IT security, and cloud sys- on Mining Software Repositories, May 2015, pp. 357–361.
[8] M. Beller, G. Gousios, and A. Zaidman, “TravisTorrent: Synthesizing
tems engineering. He has written two award-winning, widely- travis CI and GitHub for full-stack research on continuous integration,”
translated books: Code Reading and Code Quality: The Open in MSR 2017: Proceedings of the 14th Working Conference on Mining
Source Perspective. His most recent book is Effective De- Software Repositories, 2017.
bugging: 66 Specific Ways to Debug Software and Systems,
which was published as part of Addison-Wesley’s Effective
Software Development Series in 2016. Dr. Spinellis has also
published more than 200 technical papers in journals and
refereed conference proceedings, which have received more
than 5000 citations. He served for a decade as a member
of the IEEE Software editorial board, authoring the regular
Tools of the Trade column. He has contributed code that
ships with macOS and BSD Unix and is the developer
of CScout, UMLGraph, ckjm, dgsh, and other open-source
software packages, libraries, and tools. He holds an MEng in
Software Engineering and a PhD in Computer Science, both
from Imperial College London. Dr. Spinellis has served as
an elected member of the IEEE Computer Society Board of
Governors (2013—2015), and is a senior member of the ACM
and the IEEE. From January 2015 he is serving as the Editor-
in-Chief for IEEE Software.
III. R ELATED PRESENTATIONS
• The GHTorrent dataset and toolsuite2
• Mining GitHub for Fun and Profit3
• The issue #32 incident4
• Working Effectively with Pull Requests5
2 https://speakerdeck.com/gousiosg/the-ghtorrent-dataset-and-toolsuite
3 https://speakerdeck.com/gousiosg/mining-github-for-fun-and-profit
4 https://speakerdeck.com/gousiosg/the-number-issue32-incident
5 https://speakerdeck.com/gousiosg/working-effectively-with-pull-requests

502
506
504

You might also like