mining-soft-eg-data-github

The document discusses the challenges and methodologies for mining software engineering data from GitHub, highlighting the GHTorrent project as a key resource for researchers. It emphasizes the importance of understanding data collection strategies and avoiding common pitfalls to ensure valid research outcomes. The tutorial aims to equip researchers with the necessary knowledge to effectively utilize GitHub data for empirical studies.

Uploaded by

topat43112

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

mining-soft-eg-data-github

Uploaded by

topat43112

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

2017 IEEE/ACM 39th IEEE International Conference on Software Engineering Companion

Mining Software Engineering Data from GitHub

Georgios Gousios Diomidis Spinellis
Department of Software Technology Department of Management Science and Technology
Delft University of Technology Athens University of Economics and Business
Delft, The Netherlands Athens, Greece
[email protected] [email protected]

Abstract—GitHub is the largest collaborative source code to the quality of the study itself; as a result, many GitHub -
hosting site built on top of the Git version control system. The bound studies suffer from data validity issues [3], [4].
availability of a comprehensive API has made GitHub a target A notable archiving attempt, which emerged through the
for many software engineering and online collaboration research
efforts. In our work, we have discovered that a) obtaining data repository mining community, is the GHTorrent project [5], a
from GitHub is not trivial, b) the data may not be suitable for all scalable, off-line mirror of all data offered through the GitHub
types of research, and c) improper use can lead to biased results. API . GHT orrent follows the GitHub event stream and system-
In this tutorial, we analyze how data from GitHub can be used for atically retrieves from it all data, their metadata and their
large-scale, quantitative research, while avoiding common pitfalls. dependencies. It then processes and stores all retrieved items
We use the GHTorrent dataset, a queryable offline mirror of the
GitHub API data, to draw examples from and present pitfall in a relational database, while also storing the original data in a
avoidance strategies. MongoDB database. GHTorrent offers to interested researchers
Index Terms—GitHub; GHTorrent; empirical software engi- both downloads of the corresponding database dumps (cur-
neering; Git rently, >15 TB of data) and online access facilities (including
live database access and Google BigQuery).1 GHTorrent has
I. D ESCRIPTION been very successful: more than 200 researchers have sub-
GitHub is a collaborative code hosting site built on top of the scribed and used the online access points, 33% of all empirical
git version control system. It includes a variety of features that research publications on GitHub are based on it [4], while it
encourage teamwork and continued discussion over the life of is being used in production by companies, such as Microsoft.
a project. GitHub uses a “fork & pull” collaboration model [1], As GHTorrent is becoming the de facto standard dataset
where developers create their own copies of a repository and for large scale quantitative analysis for GitHub data, we
submit requests when they want the project maintainer to believe that it is crucial for researchers to know how to use
incorporate their changes into the project’s main branch, thus GHT orrent to sample for projects, how to treat the data and
providing an environment in which people can easily conduct how to avoid common pitfalls in order to minimize the risk
code reviews. Every repository can optionally use GitHub ’s of doing unsound research.
issue tracking system to report and discuss bugs and other In the tutorial, we address the following topics.
concerns. GitHub also contains integrated social features: users • GitHub data collection strategies, including querying the
are able to subscribe to update by “watching” projects and API, using online services such as GitHub Archive and
“following” other users, resulting in a constant stream of data GHTorrent.
about people and projects of interest. The system supports user • Using GHTorrent to sample appropriate repositories for
profiles that provide a summary of a person’s recent activity various types of research questions.
within the site, such as their commits, the projects they forked • Writing, managing, and optimizing complex and expen-
or the issues they reported. sive relational queries on GHTorrent relational data.
Due to the combination of reasons such as data availability, • Using GHTorrent effectively: understanding the data col-
data homogeneity and volume, GitHub has become both the lection challenges and avoiding common pitfalls.
target of choice and the source of data for various research • Copyright and privacy issues when using the GitHub data.
efforts, ranging from distributed collaboration [1] to deep
learning on software data [2]. However, GitHub data do not II. S PEAKER B IOGRAPHIES
come for free for researchers: initially, GitHub is imposing The two speakers have pioneered the use of data from
limits on their API, which, given the volume of interesting GitHub in software engineering research with the introduction
projects, can put a significant delay on data acquisition; of the GHTorrent data collection framework [6]. They have
the technicalities of the retrieval complicate the acquisition both been active on GitHub related research ever since. Apart
process. Moreover, there is no data schema (GitHub is only from creating GHTorrent, the speakers have studied the pull-
exposing its data as JSON responses through a REST API), request based distributed software development practice [1],
while the data only represent the current project state. In
addition, selection and filtering of GitHub data imposes threats 1 https://bigquery.cloud.google.com/queries/ghtorrent-bq

505
501
DOI 10.1109/ICSE-C.2017.164
co-created an automated work-prioritization framework for Acknowledgement
pull request based projects [7], co-examined the build and test The project associated with this work has received fund-
practices of projects on TravisCI [8] and co-documented the ing from the European Union’s Horizon 2020 research and
pitfals of using GitHub-based datasets [3]. innovation programme under grant agreement No 732223.
Georgios Gousios is an assistant professor at the Web
Information Systems group, Delft University of Technology. R EFERENCES
His research interests include software engineering, soft- [1] G. Gousios, M. Pinzger, and A. v. Deursen, “An exploratory study of the
ware analytics and programming languages. He works in the pull-based software development model,” in ICSE 2014: Proceedings of
the 36th International Conference on Software Engineering. New York,
fields of distributed software development processes, software NY, USA: ACM, 2014, pp. 345–355.
quality, software testing, developer productivity assessment, [2] X. Gu, H. Zhang, D. Zhang, and S. Kim, “Deep API learning,” in
research infrastructures and software security. His research FSE 2016: Proceedings of the 2016 24th ACM SIGSOFT International
Symposium on Foundations of Software Engineering. New York, NY,
has been published in top venues, where he has received USA: ACM, 2016, pp. 631–642.
four best paper awards and various nominations. In total, he [3] E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. German, and
has published more that 50 papers and also co-edited the D. Damian, “An in-depth study of the promises and perils of mining
GitHub,” Empirical Software Engineering, vol. 21, no. 5, pp. 2035–2071,
“Beautiful Architectures” book (OReilly, 2009). He is the 2016.
main author of the GHTorrent data collection and curration [4] V. Cosentino, J. Luis, and J. Cabot, “Findings from GitHub: methods,
framework and the Alitheia Core repository mining platform. datasets and limitations,” in MSR 2016: Proceedings of the 13th Inter-
national Workshop on Mining Software Repositories. ACM, 2016, pp.
Georgios holds and MSc from the university of Manchester 137–141.
and a PhD from the Athens University of Economics and [5] G. Gousios, “The GHTorrent dataset and tool suite,” in MSR 2013:
Business, both in software engineering. In addition to research, Proceedings of the 10th Working Conference on Mining Software Repos-
itories, May 2013, pp. 233–236.
he is also active as a speaker, both in research and practitioner [6] G. Gousios and D. Spinellis, “GHTorrent: GitHub’s data from a firehose,”
oriented conferences. in MSR 2012: Proceedings of the 9th Working Conference on Mining
Diomidis Spinellis is a Professor in the Department of Software Repositories, M. W. Godfrey and J. Whitehead, Eds. IEEE,
Jun. 2012, pp. 12–21.
Management Science and Technology at the Athens University [7] E. van der Veen, G. Gousios, and A. Zaidman, “Automatically prioritizing
of Economics and Business, Greece. His research interests pull requests,” in MSR 2015: Proceedings of the 12th Working Conference
include software engineering, IT security, and cloud sys- on Mining Software Repositories, May 2015, pp. 357–361.
[8] M. Beller, G. Gousios, and A. Zaidman, “TravisTorrent: Synthesizing
tems engineering. He has written two award-winning, widely- travis CI and GitHub for full-stack research on continuous integration,”
translated books: Code Reading and Code Quality: The Open in MSR 2017: Proceedings of the 14th Working Conference on Mining
Source Perspective. His most recent book is Effective De- Software Repositories, 2017.
bugging: 66 Specific Ways to Debug Software and Systems,
which was published as part of Addison-Wesley’s Effective
Software Development Series in 2016. Dr. Spinellis has also
published more than 200 technical papers in journals and
refereed conference proceedings, which have received more
than 5000 citations. He served for a decade as a member
of the IEEE Software editorial board, authoring the regular
Tools of the Trade column. He has contributed code that
ships with macOS and BSD Unix and is the developer
of CScout, UMLGraph, ckjm, dgsh, and other open-source
software packages, libraries, and tools. He holds an MEng in
Software Engineering and a PhD in Computer Science, both
from Imperial College London. Dr. Spinellis has served as
an elected member of the IEEE Computer Society Board of
Governors (2013—2015), and is a senior member of the ACM
and the IEEE. From January 2015 he is serving as the Editor-
in-Chief for IEEE Software.
III. R ELATED PRESENTATIONS
• The GHTorrent dataset and toolsuite2
• Mining GitHub for Fun and Profit3
• The issue #32 incident4
• Working Effectively with Pull Requests5
2 https://speakerdeck.com/gousiosg/the-ghtorrent-dataset-and-toolsuite
3 https://speakerdeck.com/gousiosg/mining-github-for-fun-and-profit
4 https://speakerdeck.com/gousiosg/the-number-issue32-incident
5 https://speakerdeck.com/gousiosg/working-effectively-with-pull-requests

502
506
504

Practical C++ Backend Programming
From Everand
Practical C++ Backend Programming
Justin Barbara
No ratings yet
Big Data Analytics On Large-Scale Socio-Technical Software Engineering Archives
No ratings yet
Big Data Analytics On Large-Scale Socio-Technical Software Engineering Archives
5 pages
Jobst - 2022 - Efficient GitHub Crawling Using The GraphQL API
No ratings yet
Jobst - 2022 - Efficient GitHub Crawling Using The GraphQL API
16 pages
Pydriller
No ratings yet
Pydriller
4 pages
Quality Metrics 10
No ratings yet
Quality Metrics 10
5 pages
Case Study Github
No ratings yet
Case Study Github
2 pages
Name: Kushal Reg No.: 18BCE0557 Github Collabrator Abstract
No ratings yet
Name: Kushal Reg No.: 18BCE0557 Github Collabrator Abstract
8 pages
First Draft Ipd
No ratings yet
First Draft Ipd
7 pages
Github As Devops
No ratings yet
Github As Devops
4 pages
Git and GitHub
From Everand
Git and GitHub
Alisa Turing
No ratings yet
What Is GitHub & What Is It Used For
No ratings yet
What Is GitHub & What Is It Used For
22 pages
Categorizing The Content of Github Readme Files: Institutional Knowledge at Singapore Management University
No ratings yet
Categorizing The Content of Github Readme Files: Institutional Knowledge at Singapore Management University
38 pages
Gitolite Essentials
From Everand
Gitolite Essentials
Sitaram Chamarty
No ratings yet
GitHub Essentials - Sample Chapter
No ratings yet
GitHub Essentials - Sample Chapter
34 pages
Implementing GitOps with Kubernetes: Automate, manage, scale, and secure infrastructure and cloud-native applications on AWS and Azure
From Everand
Implementing GitOps with Kubernetes: Automate, manage, scale, and secure infrastructure and cloud-native applications on AWS and Azure
Pietro Libro
No ratings yet
Open Source CW 2025
No ratings yet
Open Source CW 2025
4 pages
Github Business Model
No ratings yet
Github Business Model
11 pages
Unit 3 (3)
No ratings yet
Unit 3 (3)
11 pages
GitHub_StandardizedMetrics
No ratings yet
GitHub_StandardizedMetrics
25 pages
The GitOps Handbook: Simplifying Cloud-Native DevOps Workflows
From Everand
The GitOps Handbook: Simplifying Cloud-Native DevOps Workflows
Robert Johnson
No ratings yet
Large-Scale Analysis of The Co-Commit Patterns of The Active Developers in GitHub's Top Repositories
No ratings yet
Large-Scale Analysis of The Co-Commit Patterns of The Active Developers in GitHub's Top Repositories
11 pages
What's in A GitHub Star
No ratings yet
What's in A GitHub Star
36 pages
DevOps Unleashed with Git and GitHub: Automate, collaborate, and innovate to enhance your DevOps workflow and development experience
From Everand
DevOps Unleashed with Git and GitHub: Automate, collaborate, and innovate to enhance your DevOps workflow and development experience
Yuki Hattori
No ratings yet
Mastering Git: Attain expert-level proficiency with Git by mastering distributed version control features
From Everand
Mastering Git: Attain expert-level proficiency with Git by mastering distributed version control features
Jakub Narębski
No ratings yet
GitHub Actions Cookbook: A practical guide to automating repetitive tasks and streamlining your development process
From Everand
GitHub Actions Cookbook: A practical guide to automating repetitive tasks and streamlining your development process
Michael Kaufmann
No ratings yet
Github Provider: Example Usage
No ratings yet
Github Provider: Example Usage
35 pages
Se-Assigment 4 Answered
No ratings yet
Se-Assigment 4 Answered
5 pages
github
No ratings yet
github
3 pages
Twee - How GitHub Works
No ratings yet
Twee - How GitHub Works
5 pages
Lecture - 02
No ratings yet
Lecture - 02
38 pages
Open-Source - Gains & Opportunities
No ratings yet
Open-Source - Gains & Opportunities
61 pages
Github Copilot For Developers: Smart Coding With AI Pair Programmer
From Everand
Github Copilot For Developers: Smart Coding With AI Pair Programmer
Rob Botwright
No ratings yet
What Is Git Hub
No ratings yet
What Is Git Hub
3 pages
Mastering the Art of Go Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of Go Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Go Programming Blueprints
From Everand
Go Programming Blueprints
Mat Ryer
No ratings yet
PRACTICAL1 (Github)
No ratings yet
PRACTICAL1 (Github)
7 pages
A Systematic Mapping Study of Software Development With Github
No ratings yet
A Systematic Mapping Study of Software Development With Github
20 pages
AIAA Open Source FEA in Industry
No ratings yet
AIAA Open Source FEA in Industry
12 pages
Chat GPT Prompt Engineering With Tech Trends: Tech trends, #1
From Everand
Chat GPT Prompt Engineering With Tech Trends: Tech trends, #1
ATHEER Mahir
No ratings yet
GitHub for Next-Generation Coders: Build your ideas, share your code, and join a community of creators
From Everand
GitHub for Next-Generation Coders: Build your ideas, share your code, and join a community of creators
Igor Irić
No ratings yet
OSFYNov 2022
No ratings yet
OSFYNov 2022
100 pages
Basic Git
No ratings yet
Basic Git
87 pages
github
No ratings yet
github
20 pages
Lab 2
No ratings yet
Lab 2
3 pages
jss18
No ratings yet
jss18
33 pages
Basic Git (1)
No ratings yet
Basic Git (1)
23 pages
Dev_Cheatsheet_for_PMs_1670856914
No ratings yet
Dev_Cheatsheet_for_PMs_1670856914
6 pages
ERPANET Case Study: Project Gutenberg
From Everand
ERPANET Case Study: Project Gutenberg
ERPANET
No ratings yet
Assignment1
No ratings yet
Assignment1
13 pages
Project 1
No ratings yet
Project 1
4 pages
Introduction to GitHub using GitHub.dev 1
No ratings yet
Introduction to GitHub using GitHub.dev 1
26 pages
Collaborating in The Open-Source Software World: The Power of Github Collaborating in The Open-Source Software World: The Power of Github
No ratings yet
Collaborating in The Open-Source Software World: The Power of Github Collaborating in The Open-Source Software World: The Power of Github
8 pages
Introduction To Git - gitTutorial
No ratings yet
Introduction To Git - gitTutorial
20 pages
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
No ratings yet
Bigdata Analysis: Streaming Twitter Data With Apache Hadoop and V Isualizing Using Biginsights
5 pages
Object – Oriented Programming in Python-1
No ratings yet
Object – Oriented Programming in Python-1
11 pages
Practical C++ Backend Programming: Crafting Databases, APIs, and Web Servers for High-Performance Backend
From Everand
Practical C++ Backend Programming: Crafting Databases, APIs, and Web Servers for High-Performance Backend
Justin Barbara
No ratings yet
1.01 01 - Introduction To Git
No ratings yet
1.01 01 - Introduction To Git
23 pages
Get started with GitHub - Google Docs
No ratings yet
Get started with GitHub - Google Docs
2 pages
Practical 9 and Practical 10 Cloud Computing
No ratings yet
Practical 9 and Practical 10 Cloud Computing
12 pages
Stru of DS Project
No ratings yet
Stru of DS Project
24 pages
Python Tutorial_ Magic Methods
No ratings yet
Python Tutorial_ Magic Methods
1 page
Discord Py
50% (2)
Discord Py
104 pages
SQL Queries Based Assignment
No ratings yet
SQL Queries Based Assignment
13 pages
7- Lecture07 - Semantic Analysis, Exercise
No ratings yet
7- Lecture07 - Semantic Analysis, Exercise
18 pages
Acknowledgement: Lovely Professional University
No ratings yet
Acknowledgement: Lovely Professional University
10 pages
Alabat.m23.1 - Assignment 4
No ratings yet
Alabat.m23.1 - Assignment 4
3 pages
Learn Kotlin - Functions Cheatsheet - Codecademy
No ratings yet
Learn Kotlin - Functions Cheatsheet - Codecademy
4 pages
DOM in JAVASCRIPT
No ratings yet
DOM in JAVASCRIPT
3 pages
Using External Code in Labview
No ratings yet
Using External Code in Labview
302 pages
ALOK_RESUME-2024
No ratings yet
ALOK_RESUME-2024
1 page
Submitted To: Mrs. Meenakshi Negi Submitted By: Parth Sharma X-A R.NO-26
No ratings yet
Submitted To: Mrs. Meenakshi Negi Submitted By: Parth Sharma X-A R.NO-26
22 pages
Ttorial Com
No ratings yet
Ttorial Com
110 pages
Mod 1 - Syntax Directed Translation
No ratings yet
Mod 1 - Syntax Directed Translation
80 pages
Weekly Report Template
No ratings yet
Weekly Report Template
3 pages
Flight Resevation HLD
No ratings yet
Flight Resevation HLD
9 pages
Odoo Docs
No ratings yet
Odoo Docs
83 pages
4.. Driver and Driver Types
No ratings yet
4.. Driver and Driver Types
9 pages
OOP-Week3 - Class 2UML-CLass Diagram-Pages
No ratings yet
OOP-Week3 - Class 2UML-CLass Diagram-Pages
20 pages
Unit1 CP
No ratings yet
Unit1 CP
44 pages
3classes and Objects - Methods - Access Specifiers & Modifiers
No ratings yet
3classes and Objects - Methods - Access Specifiers & Modifiers
46 pages
Linq PDF
No ratings yet
Linq PDF
17 pages
Software Process: Atul Gupta
No ratings yet
Software Process: Atul Gupta
32 pages
Flutter Essentials – Navigation, Routing, And State -- PARKER, JP -- 2024 -- Independently Published -- Ab0922453562aa65cf101616bead33f4 -- Anna’s Archive
No ratings yet
Flutter Essentials – Navigation, Routing, And State -- PARKER, JP -- 2024 -- Independently Published -- Ab0922453562aa65cf101616bead33f4 -- Anna’s Archive
124 pages
Practical Question
No ratings yet
Practical Question
6 pages
Exception Handling
No ratings yet
Exception Handling
10 pages
Erlang PDF Library
No ratings yet
Erlang PDF Library
2 pages
Yozolog
No ratings yet
Yozolog
2 pages
Apex Integration Services Salesforce Notes
No ratings yet
Apex Integration Services Salesforce Notes
12 pages
Log 20231012 114556
No ratings yet
Log 20231012 114556
440 pages
Siemens PLM Teamcenter On Cloud Fs Y3
No ratings yet
Siemens PLM Teamcenter On Cloud Fs Y3
2 pages

mining-soft-eg-data-github

Uploaded by

mining-soft-eg-data-github

Uploaded by

2017 IEEE/ACM 39th IEEE International Conference on Software Engineering Companion

Mining Software Engineering Data from GitHub

978-1-5386-1589-8/17 $31.00 © 2017 IEEE 503

You might also like