git1
git1
Keywords: repository mining, msr, git, quality analysis, version control system, vcs,
source control management, scm, data mining, data extraction
Acknowledgements
First and foremost I would like to thank my supervisor Daniel Toll for all the time and
energy he has put into supervising me. He has gone far and beyond what could be
expected of him for me with this thesis and the entire school year. He has been a mentor
and an inspiration for me to perform at my best.
I would also like to thank Caroline Millgårdh, Fredrik Forsmo, and Josef Ottosson for
reading the first drafts, giving me feedback, submitting to being guinea pigs, and testing
code examples every now and then.
Last but not least, a big thank you and acknowledgment to D. W. McCray for helping
me proofread this thesis.
i
Table of content
Abstract _______________________________________________________
Acknowledgements ____________________________________________ i
1. Introduction ________________________________________________ 1
2. Problem ___________________________________________________ 2
2.1 Introduction to repository mining 2
2.2 Problem background 2
2.3 Problem definition 3
9. Conclusion ________________________________________________ 24
9.1 Common problems encountered with repository mining 24
9.2 Profits of mining Git repositories 24
9.3 Successful data extraction for software quality analysis 25
9.4 Best practice 25
ii
9.5 Hypotheses 26
9.6 Future work 26
Sources _____________________________________________________ 27
iii
Appendices
A. Articles included in literature study ______________________________ 1
B. Usage documentation of Doris _________________________________ 2
C. Source code metrics measurement _______________________________ 5
iv
1. Introduction
Repository mining is a useful technique when performing software engineering
research. With the help of successful repository mining you can, for example, make
links between bug fixes [1], propose other classes that also have been edited when
changing a function [2], or measure productivity in software projects [3]. All of these
techniques have code metrics and repository mining in common.
To be able to perform an analysis of source code evolution, you need access to at
least two states of that source code. The more states you have available to analyze, the
finer the granularity of the analysis will be. Repository mining helps to gather many
different snapshots of the state of the code. This can be done manually through
programs such as Tortoise [4]. But in a research situation, automation can be of great
benefit.
When performing repository mining, a researcher is faced with a variety of problems.
Planning for these problems is essential to ensure useful results. This paper will identify
the problems that a programmer or researcher might come across while performing
repository mining. Problems that were discovered in the process of writing this thesis
were investigated to see if solutions already existed elsewhere. If no other solution had
been found, an attempt to find an implementation of a solution to a similar problem was
made; failing that, a theoretical solution was proposed.
This paper is intended to further knowledge of what problems that can be
encountered when mining Git repositories. Git was chosen because there is little
research made on that version control system. It recently has had a large increment of
usage. The result of this thesis is a list of commonly found problems, a test of existing
repository mining tools for Git, and a repository mining tool written to match the
specifications (section 2.3.1) to mine Git repositories.
When studying articles and papers about the subject it was discovered that repository
mining is very common. It is noted by S. Kim et al. [5], and H. Nakamura et al. [6] that
there is a lack of sharing how this is done.
The thesis starts with an introduction and a description of the problem. Following
that, a short description of the method of research. This description is then further
elaborated in three sections starting with a literature study, testing of some tools created
to perform repository mining, and an exploration of how to create a repository mining
tool for Git. Finally, it is summarized by a discussion chapter and a clarification on what
the results of the research are.
1
2. Problem
In this section there will be a brief introduction to repository mining, problem
background and also the problem definition and a hypothesis.
2
usage of Git within different open source communities, there is a need to get a working
tool to extract data from Git repositories.
The big challenge is to create a tool that can perform automated repository mining
with Git. Previous research has mostly used centralized version control systems such as
CVS in their research. How to perform repository mining on a decentralized repository
differs from a centralized one.
To ensure that the work had not already been made and to get knowledge of the
current state-of-the-art, a pre-study consisting of reading previous research and short
evaluations of existing tools was performed. This pre-study (section 4 and 5) resulted in
a list of common problems and also showed that there was no tool that fulfilled the
requirements (section 2.3.1).
3
3. Method ontology
A pre-study was performed (section 4 and 5) to answer how data can be extracted from
Git repositories. The main focus of the pre-study was to see if tools existed that fulfilled
the requirements (section 2.3.1). The secondary focus was to gain knowledge of
repository mining practices and the current state of the art.
This pre-study gave insight on how repository mining is used and also what can be
expected of a tool created for this purpose. It also gave better knowledge of what kind
of problems could be anticipated (section 4.6). The study also showed that there was no
extant tool that fulfilled the requirements. It was also discovered that there few
repository mining tools that actually perform extraction of data and store it in a manner
so that software quality measurements can be performed easily. Most current available
tools work more with the log-system of Git than with the actual source code. The
software that was found was hard to modify and use due to a lack of documentation
(section 5.7).
The results of the pre-study showed that it was necessary to create a new repository
mining tool for Git (section 7). When creating this tool, some hardware related
problems were discovered. The problems found are not restricted to repository mining.
They are however relevant to performing repository mining.
These problems had to be solved without compromising the requirements previously
stated. To ensure that the tool actually would work in a realistic situation where
repository mining could be useful a practical usage test implementation (section 7.4)
was performed. The development of this program also centered on making the program
easy to modify and to create a clear source code documentation and user guide.
The results and experiences gained from the pre-study is then combined with the
results and experience of the creation of a repository mining tool to be able to generate
an answer (section 8 and 9) to the problem posed: How can data be extracted from Git
repositories to perform a software metrics and software quality analysis?
4
4. Current state-of-the-art and practice
For this thesis, previous work was of great importance. This resulted in a literature
study of what researchers have done prior, to get an insight in the field and see what can
be expected to come in the future. This will also provide a benchmark for what the
current state-of-the-art and the current practice is within the field of mining software
repositories.
The papers and articles that have been used for this study can be found in appendix
Appendix A: Articles included in literature study.
4.2 Conduction
The articles were read with a focus on theories of how repository mining can be
performed. A critical view was held as to the writer’s preferred VCS. Problems were
scrutinized: article-specific problems were given lower priority than those which
applied in multiple situations. Substantial attention was given to the actual usage of
repository mining and the intended results of the paper in question.
5
project with commit access [5], [13], [12]. This information might be needed to get a
correlation between new functions and bug-reports, or refactoring to project size.
Careful selection of a project to study is necessary to overcome these issues. A
researcher’s pre-existing relationship with a project can result in biased findings. Project
size and current use of metric data are also important points to consider: many smaller
projects to not make use of metrics, and a project large enough to do so might quickly
generate enough data to overwhelm a researcher. A researcher should also consider his
or her communication skills as well as those of the project contact; if a contact does not
know that the researcher is interested in certain information, he or she might not pass it
on.
One way of handling this could be to use a project that has been used by previously
by another researcher. There are some projects that are more popular than others. Most
of these are well known open-source project e.g., Apache HTTPD [21], GNOME [22],
Eclipse [23], and NetBeans [24]. These are also frequently reoccurring projects used in
the papers read for this study. D. M. German [25] also makes a valid point that using a
close source project will generally restrict what data you are able to gather, and what
data you are able to publish. These problems generally do not apply to an open source
project. That said, there are some ethical issues that should be considered, some of
which are discussed in section 7.2.
6
To create a repository mining tool that can handle every VCS on the market would
be a large project. There are many different VCS on the market. It could be narrowed
down by just using the most popular ones, but details of market share among VCS
implementations are currently unavailable and not likely to become available. The only
figure found was statistic from GitHub [26]. These statistics does include unmaintained
projects and non-programming repositories.
7
5. Evaluation of existing programs
The practical experiments are divided into testing of existing tools and creation of a
program in Java to create a mining tool. The reasons for testing existing tools were both
to find a benchmark and to see if a new tool would have to be created. The second part
is to get some knowledge of how much effort is required to create a repository mining
tool.
5.3 Limitation
Different initial testing stages were chosen to minimize the time consumption as far as
possible. Focus was on making the programs run in a short period of time to be able to
spend more time in extensive testing to see how RAM and time consumption in detail
for the programs.
8
5.5 Initial testing
For a program to be considered useable there were some main criteria that needed to be
fulfilled. These were:
1. Existence of public webpage with download.
2. Date of last update.
3. Handling of Git.
4. Used in articles read for this paper.
If these four criteria could be matched on the Linux installation they were later also
tested on the Windows installation.
All programs were set up in accordance to the documentation for that program and
results that are expected from it. No changes to the source code were made in this step.
A time limit of one hour was given to each program to go from complete download
to runnable. This time also included setup time of dependencies of that particular
program that were not reusable over more than one program. These were MySQL
database, language interpreter/compiler, etc. The time limit was based on a tight time
schedule and to give all programs tested an equal playing field.
The reason that an initial testing phase was used was to save time when finding out
the basic functionality of the program. The reason for using both Linux and Windows
was to make sure that the program could be used in both environments. A short
summary of the outcome of the testing can be found in table 5.1.
5.7 Results
This section contains the results and a written evaluation of how the different programs
performed during the initial test phase.
5.7.1 Kenyon
Website: http://dforge.cse.ucsc.edu/projects/kenyon/
9
After looking for Kenyon, neither the source code or executable binaries were found.
The link to the software at the University of California Santa Cruz was a dead link.
5.7.2 APFEL
Website: http://www.st.cs.uni-saarland.de/softevo/apfel/
The search for APFEL resulted in finding a webpage stating that APFEL was not
supported anymore and the source code had been removed. An e-mail was sent to the
person listed as contact person for APFEL (Thomas Zimmermann) who passed the
information that the source was not any longer public. The reason for this was lack of
time from the programmers to maintain the source code.
5.7.3 Evolizer
Website: http://www.evolizer.org/
The source of Evolizer was investigated and no libraries were found that supported
Git. This meant that if Evolizer is going to be used it have to be extended to support Git.
This is a possible option for future research. This was not considered a valid option at
this time due to the time constraints of this project and the involved process of re-
engineering Evolizer to work with Git.
5.7.4 git_mining_tools
Website: https://github.com/cabird/git_mining_tools/
This is by the bat the most promising tool found. It is written in ruby and mines Git
repositories. Sadly the documentation was scarce and it is hard to understand how the
tool is started. Ruby is an unfamiliar language for the tester and the code could not be
reverse engineered in the timespan to clarify how to make it work. Dependency issues
also made getting this tool working too time-intensive to be considered.
The fact that the tool uses a database to store the result in also causes the need for
some sort of extraction tool from the local database. As a result the outcome of a full
mine is not available as compilable source code without a second tool to extract the
files.
5.7.5 Shrimp
Website: http://sourceforge.net/projects/chiselgroup/
When reading about shrimp it was unclear if it was a visualization library, mining
library or an all in one tool. Further investigation showed that it is a visualization tool
that depends on other tools. No further research on Shrimp was made.
5.7.6 Gitdm
Website: https://github.com/markmc/openstack-gitdm
Git data mining is a tool written in Python. It is a plugin to the Git client which is
accessed through the logging functionality built in to Git. The tool does not mine source
code but mine the Git log files. This can be a very useful tool when performing analysis
of the Git log and commit actions. It is however not a tool used for mining Git
repositories for source code.
5.8 Problems
When doing research of the current available tools for mining Git repositories, some
major problems/short comings were found with every tested tool. Most of them cannot
be bound to a specific program. Also the factors of some language specific drawbacks
10
are not considered a problem, as in this case of a program written in C#1, which will not
have the drawback of being platform dependent. This is because the program, or the
programmer, can work around the C# limitations.
1
Using .NET and not Mono.
11
6. Conclusion of pre-study
There are many existing tools that can be used when performing repository mining. But
there are few that are specialized for Git. Most current tools are created for CVS. The
few tools created for Git are poorly documented and knowledge about the language they
are created in is needed to use them.
Because of the timeframe in which this thesis was written not all possible tools could
be tested. The one that was tested stored information of commits in a database which
would require a database to be installed that the host system could access. It would also
require a second program to extract the information for any measurements. With this in
mind the need to create new tool for the purpose of mining Git repositories became
apparent. This tool should have no need of external storage systems other than a regular
file system.
There are many problems associated with repository mining. They vary from being
bound to the version control system mined to how the version control system is being
used. It also depends on what kind of research is being performed.
In some fields of research there is the problem of how bug reports are made and what
the hidden heuristics look like [11], [13]. In others, there might be a problem that the
version control system has been swapped between for example, Subversion to Git. To
get a complete set of problems is virtually impossible, and it would also become
depreciated information as the version control systems continue to be developed.
However, some general problems were found (see section 4.5) and how to handle
these problems could be suggested. These problems should be kept in mind when
starting research involving repository mining. The primary concern when performing
repository mining is to remember what information from the repository is needed.
12
7. Mining Git with Doris
Shortcomings were found in every investigated program for mining Git repositories.
These included being outdated, and a lack of documentation. This lead to the
development of a new program.
The goal of this development can be divided into two parts:
Gain knowledge of problems when creating a repository mining tool.
See how the found shortcomings can be managed or eliminated.
After reading articles and testing existing tools, some flaws and problems were
discovered. But the biggest problem being there are very few Git repository mining
tools. Out of all the repository mining tools found with the limitations set up (see 5.2
Program selection) only one was found that incorporated Git. This program failed to
even start because of a Ruby dependency which would not install correctly.
This made it clear that some software needed to be developed that was easy to start
out of the box, supported most providers of Git repositories, and would work on as
many operating systems as possible. The program was named Data Oriented Repository
Information System (Doris).
13
7.1.5 Metadata
The metadata for the repository should be stored in such a format that it is easy to
extract information for particular commits.
Metadata is in this case used as a reference to information about a particular commit.
It can be information such as committer, commit time, commit name etc.
7.2 Implementation
The program was implemented using Java to achieve platform independence. This will
result in ease of installation for researchers wanting to use this tool. Also with Java,
external dependencies will be kept to a minimum. Drivers for databases can be kept
internal and therefore of little concern to the user.
The JGit application programming interface (API) was chosen based on its extensive
Git support and thorough documentation.
To achieve configurability, a system with flags was developed (full documentation
can be found in appendix Appendix B Usage documentation of Doris).
The supported formats to retrieve .git-files are the hypertext transfer protocol (http)
and Git protocol. The local file:// notation can also be used to pass a link to a .git-file.
The decision to leave out secure shell (SSH) was based on the fact that extra internal
functionality would be needed to be included in the program. If the only way to access a
.git-file is through ssh the user has to clone a bare repository of the head revision and
then pass the file:// link to that bare repository.
As a meta-data log format, extensible markup language (xml) was selected due to the
ability to customize the structure and still maintain a standardized format that does not
requires a custom built parser.
To retrieve the commits, multithreading is used. If there are less than two cores, there
will automatically be multiple threads-per-core created; otherwise the thread count will
be adapted to the number of cores on the host system. Through tests it was shown that
even on a single core processor multiple threads were faster than a single thread. This is
most likely due to IO-wait time and the possibility to perform computations while this
IO-wait occurred.
7.2.1 JGit
JGit [31] is an API for using the Git VCS through Java. It is an open source library
hosted by the Eclipse foundation. JGit has few external dependencies which makes it
favorable for embedding into software. Few external dependencies keep the general size
of the entire software smaller.
14
7.3 Testing, problems encountered and solutions
During the creation of Doris there were some problems were encountered. Some of the
problems found were expected thanks to the results of the pre-study, while others were
discovered only through experimentation. Some problems expected due to the pre-study
were not encountered, but do remain a theoretical possibility. The results of these tests
can be found from section 7.3.4 to 7.3.9. Table 7.1 also gives a short comparison
between the different results.
7.3.2 Metadata
The problem with metadata logs was to decide what to include, as the use of metadata
and its importance differs depending on what kind of research is being performed. This
means that there is no “cheat sheet” that can be used to find what information is
important to the general public. An educated guess had to be made for the “out-of-the-
box” logging (see appendix B section Log file). Making the log creation class easy to
modify was given precedence over trying to optimizing what metadata to include. Git
stores every bit of information locally the metadata can be extracted through the .git-file
[15]. This assures that it is not as crucial to store all details when performing the mining
operation.
The actual logging is stored as xml. The advantage of this is that most programming
languages have a native xml parsing library, such as XPath [33], which makes the
output of doris easier to analyze through automated software.
A drawback to using xml for this is that xml takes quite a bit of memory, which is an
issue with a large number of commits to process. To solve this problem either another
format of storing meta-data would have to be created, or the use of another xml parser
for creating the file would need to be used. Another solution for this can also be to
decrease the meta-data extracted while mining a repository.
As a temporary solution, a flag to turn meta-data storage off was included for larger
repositories. This would not pose any real problem. If the meta-data information is of
essence all of the information can, as mentioned earlier, be extracted from the bare
clone of the repository.
One other problem encountered when mining Git repositories that were not
controlled for testing was that some characters were translated into invalid XML. This
resulted in an exception being thrown by the XML parser used when the file was loaded
into memory.
After an investigation of what special characters that are not available for
XML/HTML replacements [34] an array with the ASCII character numeral
representation was created and all messages was scanned for these prior to adding them
as node content. During that scan all these characters was removed. Since the characters
that lacked representation were special characters (such as escape and substitute etc.)
this could be done without interfering with the meaning being conveyed by the
message.
15
7.3.3 Documentation
To tell if the documentation is good or bad is very subjective. Hence a guarantee of this
project leading to exemplary documentation cannot be given. Good documentation will,
in my opinion, only come from a community that is involved with development of it.
As a short test, some people were asked to run Doris by just using the usage guide.
The test subjects had no previous experience of repository mining but hold
programming knowledge. With the help of the documentation they were able to make
Doris mine repositories and also were able to find bugs.
Also the JavaDocs of Doris were provided through the Github repository through a
functionality called gh-pages [35]. This was to simplify modifications that might be
made by other users of Doris. This also forces the programmer to create useful
comments in the source code.
16
Without removing internal .git files the entire mined repository required, excluding
log file, 17.1 gigabytes of disk space on Windows, compared to 9.48 gigabytes of
storage space used after deleting .git directories in the individual commit folder.
Since the head .git file contains all information of previous commits and the entire
log, the internal ones for each commit can be deleted without any information loss.
During the test with automatic cleanup, there were nine, out of approximately 791 000,
files that encountered failure in deletion. In a consecutive test five files of the same
repository could not be deleted. Doris informed the user of what nine files that were not
deleted and Doris reported the file paths as expected.
17
It was during this test many of the bugs appeared that were hard to predict. One
example of such a bug is the XML character problem.
18
7.4.2 Manner of execution
The measurements were made in a very simple manner. The source code file of a
particular sort (e.g., .java, .c, .js) was read line by line. If the line started with the
characters // it was considered a comment line. Also if a line of /* was introduced, all
lines following until an */ was encountered was also considered a comment line. A line
consisting of only white space was not included in the calculation. After this was done
for all files with the requested type the sum of all comment lines and all source code
lines were summed to the total number of lines.
The values from the initial commit were stored as a base value (chain index [38][20])
and every commit after this was compared to that value. The comparison was made by
dividing the value for that repository with the base value and multiplying it by 100.
Since no compensation was made for initial commits that were empty only projects with
a non-empty initial commit could be included to prevent calculation errors when
dividing by zero.
7.4.3 Results
The Facebook iOS SDK (Figure 7.1) had a higher change of lines of comments than
lines of source code. Total lines almost mirror the comments values except the change
being larger. After approximately half the time-line the comments changed a bit less.
The lines of source code changed less than the total lines in the project and the
comments.
Number of lines
3500
3000
2500
Base value
2000 Total lines
Lines of source code
1500
Lines of comments
1000
500
0
289
121
145
169
193
217
241
265
313
337
361
385
409
433
457
1
25
49
73
97
Commit number
Figure 7.1 Measurement results of Facebook iOS SDK.
19
The Hosebird client (Figure 7.2) had a large increase of lines of comments and total
lines very early. This later settles down and then spikes again at the 24th commit. Lines
of source code are fairly stable until the 88th commit where a large increase is made that
is mirrored in the total lines graph.
Number of lines
250
200
Base value
150
Total lines
Lines of source code
100
Lines of comments
50
0
Commit number
101
1
6
51
11
16
21
26
31
36
41
46
56
61
66
71
76
81
86
91
96
With the JPacman (Figure 7.3) project there was a larger change of the lines of
comments than lines of source code and total lines. There was not as much change in
this as with Facebook iOS SDK. The changes follow each other better and the change in
total lines and lines of source code are almost identical, differing at the most with 10%.
Number of lines
180
160
140
40
20
0
Commit number
1
4
7
13
46
10
16
19
22
25
28
31
34
37
40
43
20
The twitter async (Figure 7.4) project showed some more interesting development.
Here the dominant change in the lines of source code at the start was fairly unique. At
around commit 42 there is a peak in the comment changes that fast goes back down
again at commit number 44. Then between commit 55 and 88 the changes in comments
and source code are almost identical. At about 90 the comment peaks and after that it is
fairly stable.
Numer of lines
7000
6000
5000
Base value
4000 Total lines
Lines of source code
3000
Lines of comments
2000
1000
0
Commit number
105
113
121
129
137
145
153
1
9
97
17
25
33
41
49
57
65
73
81
89
7.4.4 Conclusion
The projects showed no similarities in how the change between lines of comments and
lines of source code. But only four projects were included in the study and the tool used
to perform the measurements is not the best method to use. Since an analysis of these
results is out of the scope for this thesis further study of this is needed to come to a valid
conclusion.
However, the main purpose of the study was to show that Doris could be useful in a
situation where repository mining is needed to perform the measurement. This could be
done with no problem. The measurement class used could simply enter each directory
with a name consisting of an integer and automatically compare them to each other.
This proved that Doris can be used to perform repository mining and the result can later
be used to compare different commits to gain insight of changes between them.
Also by adding another namespace to Doris it showed that Doris could easily be
modified to perform an analysis automatically paired with the mining.
21
8. General discussion
In this section a general discussion of different problems that were found during the
work of this thesis and the pre-study will be held. It will focus on a few larger points
found.
22
surprising, as Git is now a widely used version control system. Git is used by the Linux
foundation, Facebook, WordPress, and jQuery. These are fairly large organizations
where research that requires repository mining could be done to great benefit.
Knowing the goal of your repository mining is paramount when choosing a mining
tool. If the information needed can be gathered from the repository-provided metadata,
downloading the complete source code archive would be wasteful; if the needed
information can only be gotten from the source code, a system which relies on meta-
data, log messages, and difference information will require more processing to gather
the needed data.
23
9. Conclusion
The conclusion of the thesis and a short summary discussion of the conclusions.
24
9.3 Successful data extraction for software quality analysis
The main purpose for this thesis was to see how extraction of data for software quality
analysis could be performed. To see if Doris is successful with this would require an
entire thesis on its own. But with the help of a basic practical usage test (section 7.4) I
believe it can be used.
When performing repository mining for the purpose of software quality analysis,
there are some things that are more important than others. In this case the possibility of
comparing different commits to each other and gaining access to full source code (e.g.
compiling and running each commit) were two high priorities. In that case a mining tool
which would store information in a directory structure is preferred. Metadata is also of
little interest except for potential commit messages to be able to weed out potential
back-up commits.
The analysis should be able to be either hooked onto the mining tool or be run as a
batch job when the mining is done. This requires that the mining tool is either verbose
or easy to modify. Preferably it should be both. That the mined commits be logically
structured is also a requirement. If there is metadata, this should be easily connected to a
certain commit.
Metadata should be stored in a format that can be easily represented through
automated software. This representation should also have a clear connection to the
source code it belongs to. This rules out elaborate models and promotes solutions such
as json or xml. In these formats objects can be represented without any external
dependencies, because there is a standardized notation and most programming
languages have libraries to parse them. If the metadata log file is created correctly then
an automated analysis where certain commits are ignored can be performed. These
commits might have a commit message with the line “back-up” in them. It can also be
reversed and commits tagged with “refactoring” might trigger a comparison between
that and the prior commit.
I believe that with the extensive documentation and verbosity of Doris makes it easy
to modify to add analysis tools both through source code and through scripts that read
output. The general idea of the log file makes it easy to connect metadata to a particular
commit by using a program.
25
9.5 Hypotheses
To answer the first hypothesis “An existing repository mining tool exists that can
extract data from Git repositories.” The conclusion would be “yes”. But there are many
problems found with this yes. First off there are several problems in documentation and
to simply get the tools that were found to start without having a detailed knowledge
about both the tool and the language that it is created in. They also have a tendency to
be very purpose specific. This makes it prone to have new tools created for each study.
The second hypothesis “Repository mining can be conducted decentralized
repositories in the same way as centralized repositories” also gave a positive outcome.
With the help of Doris decentralized repositories was mined in a similar fashion as
centralized repositories. The main difference was that all mining can be performed at
the local computer. This also eliminates the need for a connection to the central server
when performing a repository mining.
One benefit of this is that researchers with a poor network connection can mine
decentralized repositories as efficient as researchers with a better network connection.
In the end this can open up for research in Computer Science that requires repository
mining to universities with the lack of a stable internet connection by having large
repositories sent to them on a Universal Serial Bus (USB) flash drive by someone with a
more stable internet connection.
26
Sources
[1] A. Bachmann and A. Bernstein, "Software Process Data Quality and
Characteristics," in IWPSE-Evol '09, Amsterdam, Netherlands, 2009.
[2] T. Zimmermann, S. Diehl and A. Zeller, "Mining Version Histories to Guide
Software Changes," in ICSE '04, Edinburgh, UK, 2005.
[3] B. Mas y Parareda and M. Pizka, "Measuring Productivity Using the Infamous
Line of Code Metric," in APSEC '07, Nagoya, Jp, 2007.
[4] Tigris.org, ”tortoise.tigris.org,” CollabNet, [Online]. Available:
http://tortoisesvn.tigris.org/. [Använd 8 April 2013].
[5] S. Kim, T. Zimmermann, M. Kim, A. Hassan, A. Mockus, T. Gibra, M. Pingzer,
E. J. J. Whitehead and A. Zeller, "TA-RE: an exchange language for mining
software repositories," in MSR '06, Shanghai, Ch, 2006.
[6] H. Nakamura, R. Nagano, K. Hisazumi, Y. Kameri, N. Ubayashi and A. Fukada,
"QORAL: An external domain-specific language for mining software
repositories," in IWESEP '12, Osaka, 2012.
[7] Tigris.org, "subversion.tigris.org," CollabNet, 2009. [Online]. Available:
http://subversion.tigris.org/. [Accessed 5 April 2013].
[8] Free Software Foundation, ”Concurrent Version System,” Free Software
Foundation, [Online]. Available: http://savannah.nongnu.org/projects/cvs.
[Använd 4 April 2013].
[9] "Mercurial SCM," punct, [Online]. Available: http://mercurial.selenic.com/.
[Accessed 4 April 2013].
[10] ”Git,” GitHub, [Online]. Available: http://git-scm.com/. [Använd 4 April 2013].
[11] C. Bird, P. C. Rigby, D. J. Hamilton, D. M. German and P. Devanbu, "The
promises and perils of mining git," in MSR '09, Nates, Fr, 2009.
[12] S. Matsumoto and M. Nakamura, "Service Oriented Framework for Mining
Software Repository," in IWSM-MENSURA 2011, Nara, Jp, 2011.
[13] A. Bachmann, C. Bird, F. Rahman, P. Devanbu and A. Bernstein, "The missing
links: Bugs and bug-fix commits," in SIGSOFT 2010/FSE-18, Santa Fe, USA,
2010.
[14] C. Kiefer, A. Bernstein and J. Tappolet, "Mining software repositories with
iSPARAQL and Software Evolution Ontology.," in MSR '07, Minneapolis, USA,
2007.
[15] S. Chacon, Pro Git, New York: Apress, 2009.
[16] G. Meszaros, xUnit Test Patterns, Massachusetts, USA: Pearson Education Inc.,
2007.
[17] B. Len, C. Paul and K. Rick, Software Architecture in Practice, Massachusetts,
USA: Pearson Education Inc., 2012.
[18] T. J. McCabe, "A Complexity Measure," IEEE Transactions on Software
Engineering, Vols. SE-2, pp. 308-320, 1976.
[19] Association for Computing Machinery, "ACM Digital Library," ACM, [Online].
27
Available: http://dl.acm.org/.
[20] IEEE, "IEEE Xplore," IEEE, [Online]. Available: http://ieeexplore.ieee.org/.
[21] Apache Software Foundation, ”Welcome! - The Apache HTTP Server Project,”
Apache Software Foundation, [Online]. Available: http://httpd.apache.org/.
[22] The GNOME Project, "GNOME," Canonical, [Online]. Available:
http://www.gnome.org/.
[23] The Eclipise Foundation, "Eclipse.org," The Eclipise Foundation. [Online].
[24] Oracle, ”Welcome To NetBeans,” Oracle, [Online]. Available:
http://netbeans.org/.
[25] D. M. German, "Mining CVS repositories, the softChange experience," in MSR
'04, Edinburgh, UK, 2004.
[26] Github Inc, "Press - Github," Github Inc, [Online]. Available:
https://github.com/about/press. [Accessed 13 April 2013].
[27] Linux Mark Institute, ”Main Page - Linux Mint,” Linux Mark Institute, [Online].
Available: http://www.linuxmint.com/.
[28] Microsoft Corp., "Windows Server 2008 R2 and Windows Server 2008,"
Microsoft Corp., [Online]. Available: http://technet.microsoft.com/en-
us/library/dd349801(v=WS.10).aspx.
[29] GNU Foundation, "The GNU Operating System," The Gnu Fondation, [Online].
Available: http://www.gnu.org/philosophy/philosophy.html.
[30] E. Carlsson, "gingerswede/doris," Github inc, 20 April 2013. [Online]. Available:
https://github.com/gingerswede/doris. [Accessed 21 April 2013].
[31] Eclipse Foundation, "JGit," Eclipse Foundation, [Online]. Available:
http://eclipse.org/jgit/.
[32] Eclipse foundation, "JGit - Documentation," Eclipse foundation, [Online].
Available: http://www.eclipse.org/jgit/documentation/. [Accessed 26 March
2013].
[33] W3C, "XMLP Path Language (XPath)," 16 November 1999. [Online]. Available:
http://www.w3.org/TR/xpath/. [Accessed 9 May 2013].
[34] DeGraeve.com, "Special Characters in HTML," DeGraeve.com, [Online].
Available: http://www.degraeve.com/reference/specialcharacters.php. [Accessed
15 April 2013].
[35] GitHub Inc., "What are GitHub pages," [Online]. Available:
https://help.github.com/articles/what-are-github-pages. [Accessed 26 April 2013].
[36] JodaOrg, "JodaOrg/joda-time · GitHub," GitHub, [Online]. Available:
https://github.com/JodaOrg/joda-time/.
[37] Git, ”git/git · GitHub,” GitHub, [Online]. Available: https://github.com/git/git.
[38] European Commission, "Glossary: Chain index," [Online]. Available:
http://epp.eurostat.ec.europa.eu/statistics_explained/index.php/Glossary:Chain_in
dex. [Accessed 2013 February 8].
[39] Atlassian, "Free source code hosting for Git and Mercurial by Bitbucket,"
Atlassian, [Online]. Available: http://www.bitbucket.org/.
[40] Google, "Google Code," Google, [Online]. Available: http://code.google.com/.
28
Appendix A. Articles included in literature study
Articles are listed in chronological ascending order.
D. M. German, “Mining CVS repositories, the softChange experience” (2004)
T. Zimmermann, S. Diehl, A. Zeller, “Mining Version Histories to Guide Software
Changes” (2005)
I. Hammouda, K. Koskimies, “Concern>Based Mining of Heterogeneous Software
Repositories” (2006)
L. Voinea, A. Telea, “Mining Software Repositories with CVSgrab” (2006)
S. Kim, T. Zimmermann, M. Kim, A. Hassan, A. Mockus, T. Girba, M. Pinzger, E. J.
Whitehead Jr., A. Zeller, “TA-RE: An Exchange Language for Mining Software
Repositories” (2006)
B. Mas y Parareda and M. Pizka, "Measuring Productivity Using the Infamous Line of
Code Metric" (2007)
C. Kiefer, A. Bernstein, J. Tappolet,”Mining Software Repositories with iSPARQL and
a Software Ontology” (2007)
A. Bachmann, A. Bernstein,”Software Process Data Quality and Charachteristics – A
historical View on Open and Closed Source Projects” (2009)
C. Bird, P. C. Rigby, E. T. Barr, D. J. Hamilton, D. M. German, P. Devanbu, “The
Promises and Perils of Mining Git” (2009)
A. Bachmann, C. Bird, F. Rahman, P. Devanbu, A. Bernstein, “The Missing Links:
Bugs and Bug-fix Commits” (2010)
S. Matsumoto, M. Nakamura, “Service Oriented Framework for Mining Software
Repository” (2011)
R. Peters, A. Zaidman, “Evaluating the Lifespan of Code Smells using Software
Repository Mining” (2012)
B. Ray, C. Wiley, M. Kim “REPETOIRE: A Cross-System Porting Analysis Tool for
Forked Software Projects” (2012)
H. Nakamura, R. Nagano, K. Hisazumi, Y. Kamei, N. Ubayashi, A. Fukuda, “QORAL:
An External Domain-Specific Language for Mining Software Repositories” (2012)
1
Appendix B. Usage documentation of Doris
Doris
Table of contents
License
About
Dependencies
Usage guide
o Help
o URI
o Target
o Start point
o End point
o Limit
o No log
o Metrics
o Important
Log file
License
Doris is free software: you can redistribute it and/or modify it under the terms of the
GNU General Public License as published by the Free Software Foundation, either
version 3 of the License, or (at your option) any later version. Doris is distributed in the
hope that it will be useful, but WITHOUT ANY WARRANTY; without even the
implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE. See the GNU General Public License for more details. You should have
received a copy of the GNU General Public License along with Doris. If not, see
<http://www.gnu.org/licenses/>.
About
Doris was created by Emil Carlsson as part of a bachelor thesis about problems
encountered when mining software repositories. The main goal of the thesis was to find
a mining tool that could handle git, work with as few dependencies as possible, and also
provide automated reproducible extraction and measurement pipeline.
Dependencies
Doris is written in Java and requires Java (JRE 1.7 or newer) to be installed on the
computer running it.
Usage guide
When using parameters and not specifying target directory, Doris will automatically
create a directory with the same name as the .git file used for mining. If no parameters
are passed to Doris, Doris will prompt for URI to .git file and the target to store the
2
results from the mining. All flags are to be appended after the command to initialize
Doris. When using flags the URI flag must be included as a minimum.
Run Doris on *nix:
emil@linux-computer ~ $ ./dors.jar
Run Doris on Windows:
C:\> java -jar c:\path o\doris.jar
Help
-h, --help [flag]
Shows help information. If a flag is appended it will show help information of that
particular flag.
URI
-u, --uri <link to .git-file>
Specifies the URI where the .git file can be found. The protocols that Doris can handle
is http(s)://, git:// and file://. Example of
formatting: git://github.com/GingerSwede/doris.git.
Target
-t, --target <path to target directory>
Specifies the target where the different commits should be stored. When omitted Doris
will use the current working directory and set up a folder named after the .git-file used
in the URI.
Start point
-s, --startpoint <commit sha-1>
Set a starting point for Doris to start mining the repository from. Full sha-1 is needed. If
the sha-1 value is incorrect the mining will never be started.
End point
-e, --endpoint <commit sha-1>
Set a commit where Doris should stop mining. Full sha-1 is needed. If the sha-1 value
is incorrect the mining will not stop. The given sha-1 commit will not be included in
the mining results.
Limit
-l, --limit <max number of commits>
Set a maximum number of commits Doris should mine. Amount is to be given as an
integer (e.g., 6, 10, and 600).
No log
-n, --nolog
When this flag is passed the logging option in Doris is turned off. This is recommended
when mining larger repositories that will generate many commits. All information that
3
is logged by Doris can manually be obtained through the .git-file copied to local access.
It can be found in the same directory as the mining results.
Metrics
-m, --metrics <file ending[,file ending[,file ending]]>
Creates a simple software metrics analysis where the amount of source code is
compared with the amount of comments in percent. Multiple file endings separated with
a comma and no spaces.
Important
If the -e and the -l flag is used in combination Doris will end on the flags criteria that is
reached first.
Log file
Unless the -n flag is used Doris will automatically log basic information about the
different commits in an xml-file. The log contains information about parent commit,
author, committer, commit message and commit time (given in UNIX time). Example:
<project project_name="ExampleRepository">
<commit commit_name="08046e7b57f772f270619601d1a9420f76320066"
commit_number="0" commit_time="1358168496">
<author e_mail="[email protected]" name="John Doe"/>
<committer e_mail="[email protected]" name="John Doe"/>
<commit_message>
Initial commit
</commit_message>
</commit>
</project>
4
Appendix C. Source code metrics measurement
package com.gingerswede.source.metrics;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.Writer;
import se.lnu.cs.doris.global.Utilities;
/**
* This file is a part of Doris
*
* Doris is free software: you can redistribute it and/or modify it under
the
* terms of the GNU General Public License as published by the Free
Software
* Foundation, either version 3 of the License, or (at your option) any
later
* version. Doris is distributed in the hope that it will be useful, but
WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS
* FOR A PARTICULAR PURPOSE. See the GNU General Public License for more
* details.
*
* You should have received a copy of the GNU General Public License along
with
* Doris. If not, see <http://www.gnu.org/licenses/>.
*
*
* @author Emil Carlsson
*
*/
public class SLOC {
5
}
if (this.m_baseValueTotal < 0) {
this.m_baseValueTotal = sloct;
this.m_baseValueComments = slocmt;
this.m_baseValueCode = slocd;
sloct = 100;
slocmt = 100;
slocd = 100;
} else {
sloct = (int) ((double) sloct
/ (double) this.m_baseValueTotal * 100);
slocmt = (int) ((double) slocmt
/ (double) this.m_baseValueComments * 100);
slocd = (int) ((double) slocd
/ (double) this.m_baseValueCode * 100);
}
try {
6
writer = new BufferedWriter(new
FileWriter(csvFile.getAbsolutePath(), true));
writer.append(appendString);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally {
try {
writer.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
if (file.isDirectory()) {
for (File f : file.listFiles()) {
sloc += this.countLines(f, countComments);
}
} else {
if (this.m_fileEndings == null) {
readFile = true;
} else {
for (String s : this.m_fileEndings) {
if (file.getName().endsWith(s)) {
readFile = true;
break;
} else {
7
readFile = false;
}
}
}
if (readFile) {
BufferedReader br = new BufferedReader(new FileReader(file));
Boolean isEOF = true;
Boolean isComment;
Boolean isBlankLine;
Boolean inMultiLineComment = false;
Boolean prevMultiLineComment = inMultiLineComment;
do {
String t = br.readLine();
if (t != null) {
isComment = this.lineIsComment(t);
isBlankLine = t.trim().equals("");
prevMultiLineComment = inMultiLineComment;
inMultiLineComment = this.resolveMultiLineComment(t,
inMultiLineComment);
isEOF = false;
if (!isBlankLine
&& (!countComments ? !isComment
&& !prevMultiLineComment : isComment
&& prevMultiLineComment)) {
sloc++;
}
} else {
isEOF = true;
}
} while (!isEOF);
br.close();
}
}
return sloc;
}
8
SE-391 82 Kalmar / SE-351 95 Växjö
Tel +46 (0)772-28 80 00
[email protected]
Lnu.se/dfm