0% found this document useful (0 votes)
55 views19 pages

Information and Software Technology: Gillian J. Greene, Marvin Esterhuizen, Bernd Fischer

Uploaded by

George
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views19 pages

Information and Software Technology: Gillian J. Greene, Marvin Esterhuizen, Bernd Fischer

Uploaded by

George
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

JID: INFSOF

ARTICLE IN PRESS [m5G;December 12, 2016;20:48]

Information and Software Technology 0 0 0 (2016) 1–19

Contents lists available at ScienceDirect

Information and Software Technology


journal homepage: www.elsevier.com/locate/infsof

Visualizing and exploring software version control repositories using


interactive tag clouds over formal concept lattices
Gillian J. Greene∗, Marvin Esterhuizen, Bernd Fischer
CAIR, CSIR Meraka, Computer Science Division, Stellenbosch University, Stellenbosch, South Africa

a r t i c l e i n f o a b s t r a c t

Article history: Context: version control repositories contain a wealth of implicit information that can be used to answer
Received 29 January 2016 many questions about a project’s development process. However, this information is not directly accessi-
Revised 2 December 2016
ble in the repositories and must be extracted and visualized.
Accepted 5 December 2016
Available online xxx Objective: the main objective of this work is to develop a flexible and generic interactive visualization
engine called ConceptCloud that supports exploratory search in version control repositories.
Keywords:
Formal concept analysis Method: ConceptCloud is a flexible, interactive browser for SVN and Git repositories. Its main novelty is
Tag clouds the combination of an intuitive tag cloud visualization with an underlying concept lattice that provides
Browsing software repositories a formal structure for navigation. ConceptCloud supports concurrent navigation in multiple linked but
Interactive tag cloud visualization individually customizable tag clouds, which allows for multi-faceted repository browsing, and scriptable
construction of unique visualizations.
Results: we describe the mathematical foundations and implementation of our approach and use Con-
ceptCloud to quickly gain insight into the team structure and development process of three projects. We
perform a user study to determine the usability of ConceptCloud. We show that untrained participants
are able to answer historical questions about a software project better using ConceptCloud than using a
linear list of commits.
Conclusion: ConceptCloud can be used to answer many difficult questions such as “What has happened
in this project while I was away?” and “Which developers collaborate?”. Tag clouds generated from our
approach provide a visualization in which version control data can be aggregated and explored interac-
tively.
© 2016 Elsevier B.V. All rights reserved.

1. Introduction information becomes a valuable resource for new developers as


well [4].
Version control repositories contain a wealth of implicit in- While it is well-known that version control repositories are a
formation that can be used to answer many questions about a valuable source of information, repository tools are not set up to
project’s development process, such as “Who worked on these provide insights into the history of a project directly and can only
files?”, “Which developers collaborate?”, “What are the co-changed be used to see information about individual commits. Manually ex-
methods?”, or “What has happened in this project while I was amining the last few commits to a software project in the version
away?”. Answering such questions is a daily task for software de- control repository is feasible for regular contributors of the project
velopers [1]. Developers also rely on examining the history of a who are only seeking information about the most recent changes.
software project to keep up with changes, understand coding de- However, manually examining the entire commit log of a reposi-
cisions and debug [2]. In co-located teams new developers rely on tory in order to answer more complex questions becomes infeasi-
members of the team to help them ramp-up [3] but in large open- ble. Individual commits provide information about a single change
source projects, where this is no longer possible, the repository to the project but only when a large number of commits is aggre-
gated does the information become accessible to developers seek-
ing answers to complex questions.

Corresponding author. We develop ConceptCloud, an interactive tag cloud visualization
E-mail addresses: [email protected] (G.J. Greene), [email protected] engine for software repositories that aggregates commit data and
(M. Esterhuizen), bfi[email protected] (B. Fischer). lets users easily construct uniform visualizations of many different

http://dx.doi.org/10.1016/j.infsof.2016.12.001
0950-5849/© 2016 Elsevier B.V. All rights reserved.

Please cite this article as: G.J. Greene et al., Visualizing and exploring software version control repositories using interactive tag clouds
over formal concept lattices, Information and Software Technology (2016), http://dx.doi.org/10.1016/j.infsof.2016.12.001
JID: INFSOF
ARTICLE IN PRESS [m5G;December 12, 2016;20:48]

2 G.J. Greene et al. / Information and Software Technology 000 (2016) 1–19

aspects of the project history. ConceptCloud makes use of a novel and lattices (for example, one derived from the Linux repository)
combination of tag clouds and an underlying concept lattice [5] to become incomprehensible. Our refinement-based navigation algo-
support exploratory search [6,7] tasks on software repositories. rithm (see Section 3.4) then enables interactive repository brows-
When users have no previous knowledge of a project or have ing through our tag cloud interface (see Section 4.1). Our naviga-
not yet formulated a direct query their task becomes one of ex- tion algorithm maintains a focus concept in the underlying lattice
ploratory search instead of direct search or retrieval [7]. While there which represents the user’s current tag selection. We derive the
are already approaches supporting specific retrieval tasks and visu- tag cloud visualization from the current focus concept and update
alizing aspects of software repositories [8], support for exploratory it after each navigation step. Navigation is driven by the user’s se-
search in software repository data remains unavailable. The goal of lection (or de-selection) of tags in the tag cloud. Fig. 1(i) shows
our work is to build a flexible and interactive visualization engine the initial focus concept generating the first tag cloud, after the
that allows users to visualize different aspects of a project interac- selection of tag “Alice” the focus moves to (ii) and the tag cloud is
tively and therefore supports exploratory search tasks, instead of updated.
presenting the user with one static, pre-configured view. By using different objects in the formal contexts (see
An exploratory search approach can provide an overview of the Section 3.2) that are used to construct concept lattices, we are able
repository data and allow the user to further investigate any as- to generate tag clouds that provide different perspectives on the
pects of the project which they might find interesting. Therefore, same underlying data in the same familiar visualization. Our foun-
exploratory search approaches can support new developers on a dation in formal concept analysis allows us to change the objects
project in understanding the project history and team structure. easily to get different insights on the same repository.
An exploratory approach can also be used to answer more general We have implemented our approach in the ConceptCloud
questions (e.g., “Which developers collaborate?”) which cannot be browser (available at www.conceptcloud.org) which includes ad-
formulated as a single search query which would be possible if the vanced visualizations, such as multiple interlinked tag clouds.
question was more focused (e.g., “Who collaborates with Alice?”). Section 5 shows the application of ConceptCloud to three different
Tag clouds (or word clouds) are a simple visualization method repositories.
for textual data where the frequency of each tag is reflected in In this paper, we extend our previous work [17] by providing
its size. We use a tag cloud visualization to present aggregated a formalization for our formal context construction from reposi-
software repository data, as tag clouds support exploratory search tories (see Section 2), combining multiple archives (such as issue
tasks and have been found to be effective when the informa- databases and version control repositories) in the same context in
tion discovery task is wide [9]. While our tag cloud visualization order to support data fusion (see Section 3.2.5) and developing a
may not be the optimal visualization for all aspects of the data, browser scripting language for ConceptCloud to support advanced
it is flexible enough to visualize many aspects of the software customizations (see Section 4.2.4). We have also conducted addi-
project such as developer expertise (e.g., which developers have tional evaluation in the form of a user study (see Section 7).
worked on particular files or directories and would be good candi-
dates to ask questions about this functionality), co-changed meth- 2. Modeling software repositories
ods in a software project, project activity (e.g., in which years and
months has there been a lot of development, and on which parts We use a simple repository model derived from Hindle and Ger-
of the system), or developer collaboration (e.g., which developers mán’s SCQL [18] to formalize how we construct the contexts that
are working together on which parts of the project) in a uniform underpin our browser: a repository is simply a collection of ver-
way. Our interactive tag clouds allow developers to aggregate com- sions of a set of files that are grouped into revisions. Note that we
mits into groups and filter commits that apply to a certain topic, follow the SVN terminology [19] here. Hindle and Germán [18] re-
which has been noted by developers to be useful [2]. fer to versions as revisions, while revisions are called modification
We generate tags directly from the data that we extract from requests; elsewhere revisions are called transactions.
software repositories, instead of relying on user-generated labels as A version v ∈ V denotes the abstract state of a file f ∈ F created
tags for particular content, as often done in Web 2.0 applications by an authora ∈ A at a time t ∈ T . We ignore the actual file con-
(such as Flickr’s early tag cloud view). The data available in a ver- tents and only use meta-data and abstract modifications. Versions
sion control archive is often large (for example, more than 50 0,0 0 0 constitute a version history if they are ordered by a precedence re-
revisions for the Linux [10] repository) and so we allow the user lation ≺ that holds only between versions of the same file and is
to make incremental refinements (i.e., navigate) in the tag cloud in compatible with the file creation times. We say that vevolves into
order to generate smaller, more detailed visualizations. The naviga- v if v≺v holds; two versions v1 and v2 are merged into v if v1 ≺v
tion in our tag clouds is crucial for facilitating exploratory search and v2 ≺v.
tasks. Navigation using tag clouds has previously been explored us-
ing a Bayesian approach [11]; however, navigation in our browser Definition 1. Let V ⊆ F × T × A be a set of versions over files F
is supported by a novel combination of tag clouds and concept lat- and ≺ ⊆ V × V be an irreflexive partial order. (V, ≺ ) is called a ver-
tices [5,12,13]. sion history iff v = ( f, t, a ) ∈ V, v = ( f  , t  , a ) ∈ V, and v≺v imply
We conjecture that a concept lattice [5] provides a high level f = f  and t < t .
of internal structure for the repository data and therefore allows
A revision r is a set V of file versions that are committed to the
users to explore the data through multiple navigation paths. Con-
repository R at time t by an author a; on commit, some meta-data
cept lattices have been shown to be useful for browsing data
(i.e., author, time, and an additional log message l ∈ L) is stored to-
[14–16] but large lattices do not provide a suitable data visualiza-
gether with the versions. We assume that each revision r ∈ R con-
tion because the relationships between the concepts are difficult
tains only one version of a file (which need not be the most recent
to identify in a large Hasse diagram. Therefore, we make use of
version), and that each revision is uniquely determined by an ab-
a concept lattice to facilitate navigation in the more intuitive and
stract identifier id(r).
scalable tag cloud visualization.
Fig. 1 shows an overview of our approach. We construct a for- Definition 2. Let (V, ≺ ) be a version history and R ⊆ P(V ) ×
mal context from data in a version control archive (see Section 4.1) T × A × L be a set of revisions. R is called a repository iff r =
and generate a concept lattice directly from the context. Note that (V, tr , ar , l ) ∈ R and v = ( f, tv , av ) ∈ V imply tv ≤ tr and v ≺ v for
we have used a small illustrative example as larger context tables all v ∈ V.

Please cite this article as: G.J. Greene et al., Visualizing and exploring software version control repositories using interactive tag clouds
over formal concept lattices, Information and Software Technology (2016), http://dx.doi.org/10.1016/j.infsof.2016.12.001
JID: INFSOF
ARTICLE IN PRESS [m5G;December 12, 2016;20:48]

G.J. Greene et al. / Information and Software Technology 000 (2016) 1–19 3

Fig. 1. Navigating concept lattices with tag clouds: tag clouds correspond to the matching colored concepts in the lattice (tag clouds from left to right correspond to concepts
i, ii and iii respectively). Context table (top left) used to generate concept lattice (top right). Tag clouds are refined on each tag selection (selected tags shown in red). (For
interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

We can easily extend this basic model towards common revi- attributes. Such contexts can be imagined as cross tables where the
sion control systems. For example, in CVS [20], the notions of ver- rows are objects and the columns are attributes (cf. Fig. 1).
sions and revisions are conflated; in our model we thus have for
all revisions r = (V, t, a, l ) ∈ R that V = ( f, t, a ). Note that we do Definition 3. A formal context is a triple (O, A, I ) where O and A
not model revision tagging explicitly, but assume that the tags are are sets of objects and attributes, respectively, and I ⊆ O × A is an
part of the log messages. In SVN, each revision can only contain arbitrary incidence relation.
the most recent version of a file, and only the commit author and
Definition 4. Let (O, A, I ) be a context, O ⊆ O, and A ⊆ A. The
time are recorded but not the file author or modification time.
common attributes of O are defined by α (O ) = {a ∈ A | ∀o ∈ O :
Hence, in our model we thus have for all r = (V, tr , ar , l ) ∈ R and
(o, a ) ∈ I }, the common objects of A by ω (A ) = {o ∈ O | ∀a ∈ A :
v = ( f, t f , a f ) ∈ V that v ∈ V implies that t f = tr and a f = ar . Note
(o, a ) ∈ I }.
that we are only interested in the linear sequence of revisions and
therefore do not model explicit branching and merging, but again For example, the common attributes of the objects
assume that this information is encoded into the log messages, if revision-1 and revision-2 in Fig. 1 are Alice, 10/14
requested. For distributed revision control systems such as Git we and build.xml.
analyze a clone of the repository. Note that clones of the reposi- Concepts are pairs of objects and attributes which are synony-
tory in different states will generate different contexts, as the con- mous. They are maximal rectangles (modulo permutation of rows
texts are generated using the commit information extracted from and columns) in the context table. For example, ({revision1,
the repository. Therefore, if a repository is not up-to-date (i.e., has revision2}, {Alice, 10/14, build.xml}) in Fig. 1 is a con-
changes available to be pulled) then the generated context will dif- cept, since adding another revision object loses common attributes,
fer from that of the up-to-date repository, as the list of commits while adding another attribute loses common objects.
differs.
Definition 5. Let C be a context. c = (O, A ) is called a concept of
3. Navigation framework C iff α (O ) = A and ω (A ) = O. πO (c ) = O and πA (c ) = A are called
c’s extent and intent, respectively. The set of all concepts of C is
In our model, we have a set of revisions and a set of attributes denoted by B(C ).
for each revision; the attributes are divided into separate cate-
Concepts are partially ordered by inclusion of extents such that
gories such as author, date, or file name. Our goal in browsing is
a concept’s extent includes the extent of all of its subconcepts; the
to retrieve a set of revisions which share a common attribute such
intent-part follows by duality.
as the same author, and then to refine this set gradually by adding
more attributes. We use formal concept analysis (FCA) as framework Definition 6. Let C be a context, c1 = (O1 , A1 ), c2 = (O2 , A2 ) ∈ B(C ).
to achieve this goal. c1 and c2 are ordered by the subconcept relation, c1 ≤ c2 , iff O1 ⊆
O2 . The structure of B(C ) and ≤ is denoted by B (C ).
3.1. Formal concept analysis
The basic theorem of FCA states that the structure induced by
Formal concept analysis (FCA) [5,12,13] uses lattice-theoretic the concepts of a formal context and their ordering is always a
methods to investigate abstract relations between objects and their complete lattice. Such concept lattices have strong mathematical

Please cite this article as: G.J. Greene et al., Visualizing and exploring software version control repositories using interactive tag clouds
over formal concept lattices, Information and Software Technology (2016), http://dx.doi.org/10.1016/j.infsof.2016.12.001
JID: INFSOF
ARTICLE IN PRESS [m5G;December 12, 2016;20:48]

4 G.J. Greene et al. / Information and Software Technology 000 (2016) 1–19

properties and reveal hidden structural and hierarchical proper- ities that appear across the bins, e.g., similarities between identi-
ties of the original relation. They can be computed automatically cally named files such as README.txt in different directories. We
from any given relation between objects and attributes. The great- use d, n, and t, respectively, to denote mappings from each time
est lower bound or meet and least upper bound or join can also be to the corresponding weekday, and from each file to its base name
expressed by the common attributes and objects. and type, respectively.
Note that we do not perform more complicated pre-processing
Theorem 7 (Wille [5]). Let C be a context. Then B (C ) is a complete
steps such as word sense disambiguation [23] or identity merging
lattice, the concept lattice of C. Its meet and join operation for any
[24]. We instead prefer to leave the user in control of such deci-
set I ⊂ B(C ) of concepts are given by
sions.
 
  
( Oi , Ai ) = Oi , α (ω ( Ai )) 3.2.2. Revision-based contexts
i∈I i∈I i∈I In a revision-based context we interpret the revisions, repre-
  sented by their revision number, as objects and the commit meta-
   data (e.g., author or words from the log message) as attributes;
( Oi , Ai ) = ω ( α ( Oi )), Ai
each revision is associated with its own meta-data as attribute.
i∈I i∈I i∈I
This context type represents the canonical view of repositories. Its
Each attribute and object has a uniquely determined defining concepts are sets of revisions and their common attributes (e.g., all
concept in the lattice. For example, the defining concept for Alice revisions that include a common set of files). It is useful to get a
is indicated in blue in the concept lattice in Fig. 1(ii). The defining historical overview of a project, for example to identify when the
concepts can be calculated directly from the attribute or object, most changes have been made to a project, which developers have
respectively, and need not be searched in the lattice. worked on particular files and which directories have been devel-
opment hotspots.
Definition 8. Let B (O, A, I ) be a concept lattice. The defining con-
cept of an attribute a ∈ A (object o ∈ O) is the greatest (smallest) Definition 9. Let R be a repository, and AR = W ∪ A ∪ T ∪ F. CR =
concept c such that a ∈ π A (c) (o ∈ π O (c)) holds. It is denoted by (id (R ), AR , IR ) is called the revision-based context ofR if for all r =
μ(a) (σ (o)). We use the δ (x) to denote μ(x) if x is an attribute and (V, t, a, l ) ∈ R, v = ( f, t  , a ) ∈ V, and x ∈ AR , we have (r, x ) ∈ IR iff
σ (x) otherwise. (i) x ∈ W(l ), or
Efficient algorithms exist for the computation of the concept (ii) x = a, or
lattices and the meet and join of concepts in the lattice, such as (iii) x = d (t ) or x ∈ T(t ), or
Lindig’s algorithm [21]. (iv) x = n( f ) or x ∈ D( f ), or
(v) x = t ( f ).
3.2. Contexts from repositories
3.2.3. File-based contexts
In a file-based context we interpret the files as objects but de-
In order to construct a concept lattice from repository data we
rive the attributes from the revisions’ pre-processed meta-data;
need a context table. The first step in the construction of such a
more precisely, each file receives all attributes from all revisions
context table is to determine which field in the data will be taken
that involve the file. Concepts from such contexts are sets of files
as the object and which fields are suitable as attributes for that
with common attributes (e.g., the set of all files on which a group
object. We use three different object types, namely revisions, files,
of developers have all worked); in particular, each commit induces
and revision-file pairs (i.e, changes) in order to construct different
a concept: since a developer can only commit one set of files at
types of contexts, which enables us to create different tag cloud
any given time, the set of committed files is maximal with respect
visualizations for the same repository, providing new insights into
to the set of all attributes derived from the commit meta-data.
the data. We are able to combine multiple data sources in the
same context to support data fusion as object types in the context Definition 10. Let R be a repository, and AF = W ∪ A ∪ T ∪ id (R ).
table need not be homogeneous. We use a combination of issue CF = (F, AF , IF ) is called the file-based context ofR if for all r =
and version control data, in the same context, to provide a more (V, t, a, l ) ∈ R, v = ( f, t  , a ) ∈ V, and x ∈ AF , we have ( f, x ) ∈ IF iff
complete overview of a project.
(i) x ∈ W(l ), or
(ii) x = a, or
3.2.1. Basic preprocessing
(iii) x = d (t ) or x ∈ T(t ), or
When we construct context tables we pre-process the meta-
(iv) x = n( f ) or x ∈ D( f )\{ f }, or
data that we extract from the revision control system, in partic-
(v) x = t ( f ), or
ular the log messages, file names, and commit times from each
(vi) x = id (r ).
revision in the repository. We use a function W : L → P(W ) that
segments each log message into individual words w ∈ W, removes Note that revision- and file-based contexts give complemen-
words on a default stop list, and reduces each word to its stem, us- tary views on the repository. For example, the author tags from
ing the Apache Lucene implementation of Porter’s [22] stemming a revision-based context will be scaled according to the number
algorithm. Since the stem is not necessarily a proper word we take of revisions that the author has committed over the project life-
the most frequently used word that evaluates to a given stem as time; during browsing only one author tag can be selected at a
representative in the cloud. time since each revision has only one author. In a file-based con-
We group both file names and commit times into increasingly text, the author tags will be scaled according to how many files a
coarser bins. For file names, we use a function D : F → P(F ) that particular author has changed. Selecting an author tag will reveal
decomposes each file name into a set of all path prefixes, similar all collaborators, i.e., all other authors who have also changed any
to recursively applying the Unix dirname command. For commit of the same files. Selecting two author tags will then reveal the ex-
times, we use a function T : T → P(T ) that truncates the times at tent of their collaboration, i.e., all files they have both worked on.
different precision levels (days, months, and years). Therefore file-based contexts can be used to visualize the collabo-
In addition, we also use aggregators (such as aggregating files ration in the project, showing which developers work together and
with the same names, even across directories) to capture regular- on which files.

Please cite this article as: G.J. Greene et al., Visualizing and exploring software version control repositories using interactive tag clouds
over formal concept lattices, Information and Software Technology (2016), http://dx.doi.org/10.1016/j.infsof.2016.12.001
JID: INFSOF
ARTICLE IN PRESS [m5G;December 12, 2016;20:48]

G.J. Greene et al. / Information and Software Technology 000 (2016) 1–19 5

Fig. 2. Multiple linked tag clouds of the JUnit Repository in ConceptCloud, showing changed files (top), authors (bottom) and years (left). The tag cloud is constructed from
a revision-based context.

3.2.4. Change-based contexts tion of these such as GitHub [27]. Moreover, archive entries can
In a change-based context we use pairs of files and revisions be linked across the different tools, by, for example, adding an is-
as objects, so that for example (hello.java, revision-1) sue identifier to the log message of a revision which references
and (hello.java, revision-3) become separate objects in the that issue. Ideally, visualization tools should be able to “fuse” the
context. This allows us to use the content of the files as addi- information from different archives for the same project into a
tional attributes, which we cannot do with revision- or file-based single combined data structure, such as Hipikat’s uniform artifact
contexts. In our implementation we focus on the changes (rather database [4] or Codebook’s central graph [28].
than the entire contents), and use a lightweight fact extractor Here, we combine data from multiple archives (or different fea-
[25] to get the signatures of the changed methods from each file. tures of GitHub) into a single context using multiple object types.
We could therefore have, for example the attributes public int In particular, we combine repository data and GitHub issue data
equals(), public static void main(), and Alice asso- into the same context. In the combined contexts we use the re-
ciated with the object (hello.java, revision-1) to repre- visions and bug reports as objects (since the object types in the
sent the fact that revision-1 by Alice changes the methods context table need not be homogeneous) and derive the attributes
equals and main. Selecting a method tag m then produces a tag from both the revisions’ pre-processed meta-data and the text
cloud which contains all other methods that have been co-changed from the bug reports. Therefore, where bug reports and revisions
with m, scaled according to how often they have been changed to- share a common attribute they will be grouped together in the
gether (cf. Fig. 3). Therefore change-based contexts can be used to same concepts, indicating the relation of the bug reports to the
construct visualizations that depict the co-changed methods in the revisions. The combined context gives a more complete overview
project as well as showing other method information, for exam- of the project activities.
ple, which methods are development hotspots and in which time Note that the objects in a combined context are a union of re-
periods. visions and issue IDs; this is different to the construction of the
In our model, we assume a set M of abstract modifications change-based contexts where the objects are pairs of revisions and
(in the spirit of the atomic changes of Ren et. al [26]), and use files. The combined context’s attributes are the union of the origi-
(v , v ) ⊆ M to denote the (non-symmetric) difference between nal attributes for both the revisions and the issues, and each object
two versions v ≺v of a file. keeps its own attributes. We merge corresponding attribute cate-
gories from the data sources, e.g., log messages and issue descrip-
Definition 11. Let R be a repository, and AC = W ∪ A ∪ T ∪ F ∪
tions. This assumes that words have the same meaning in the dif-
id (R ) ∪ M. CC = (F × id (R ), AC , IC ) is called the change-based con-
ferent archives, but in return it provides us with implicit links be-
text ofR if for all r = (V, t, a, l ) ∈ R, v = ( f, t  , a ) ∈ V, v ∈ V with
tween bugs and revisions that both talk about a specific topic (e.g.,
v ≺v, and x ∈ AC , we have (( f, r ), x ) ∈ IC iff
“Linux”), because their log messages and descriptions share a com-
(i) x ∈ W(l ), or mon attribute. The issues and revisions are therefore connected
(ii) x = a, or automatically, without the need to create any links, as for exam-
(iii) x = d (t ) or x ∈ T(t ), or ple described by Silwerski et al. [29]. However, for a data source
(iv) x = n( f ) or x ∈ D( f ), or such as GitHub, which stores explicit references between commits
(v) x = id (r ), or and issues, we are able to link these in the context table by using
(vi) x ∈ (v , v). a “surrogate key” attribute which we assign to both the revision
object and the issue object in the context table. A surrogate key
3.2.5. Combined contexts: bug reports and revision control data is therefore, an additional attribute which serves exclusively to in-
Software development projects often make use of dedicated dicate an explicit link between the revision and the issue in the
tools for different tasks, such as issue databases, task trackers, and concept lattice. Section 5.2 provides examples of tag clouds gener-
source code repositories, or use a tool that provides a combina- ated from Git repositories and issues in the GitHub issue-tracking

Please cite this article as: G.J. Greene et al., Visualizing and exploring software version control repositories using interactive tag clouds
over formal concept lattices, Information and Software Technology (2016), http://dx.doi.org/10.1016/j.infsof.2016.12.001
JID: INFSOF
ARTICLE IN PRESS [m5G;December 12, 2016;20:48]

6 G.J. Greene et al. / Information and Software Technology 000 (2016) 1–19

Fig. 3. JUnit: vacation cloud for David Saff constructed from a change-based context. Main tag cloud view (top). Changes by Alex Yursha (bottom left) and Kevin Cooney
(bottom right). Alex Yursha and Kevin Cooney are selected with sticky tags. Only tags with occurrence greater than two are shown.

system. Combined contexts can be used to visualize which files Navigation is refinement-based: when the user selects another
have been changed when a bug has been fixed as well as showing tag, the browser updates the focus by computing the meet of that
the project activity both in terms of commits and issue reports. tag’s defining concept and the old focus.
Intuitively, deselection should be the inverse of selection: de-
selecting the last selected tag should move the focus back to its
3.3. Tag clouds from concepts previous position. Because of the duality in the concept lattice,
we would expect the de-selection operation to be implemented by
We visualize repository data with a tag cloud that we construct the join in the lattice. However, using the join operation to de-
from the focus concept in the lattice. Since a concept comprises a select an attribute a would move the focus up in the lattice and
set of objects and a set of attributes, it is tempting to use the at- effectively de-select all other currently selected attributes except
tributes (i.e., the intent) as the tag cloud. However, this produces a, which leads to counterintuitive results. We must therefore re-
degraded clouds because (i) the intent only contains the attributes compute the focus as the meet of the defining concepts of the re-
common to all objects, and (ii) each attribute only occurs once so maining selected tags, in order to provide a de-selection operation
that all tags would have the same size. Instead, we use the intents which is the inverse of the selection operation.
of the extents; more precisely, we collect all attributes of the defin-
ing concept of each object in the extent of the focus concept; we
3.5. Relation to information retrieval
also add the objects themselves, to allow their direct selection in
the tag cloud.
Our lattice-based browsing approach is related to classical in-
Definition 12. The tag cloud from a concept c = (O, A ) ∈ B(C ) is de- formation retrieval (IR) [30,31]. The context table can be seen as a
 Boolean version of the document-term matrix, while the concept
fined as τ (c ) = O o∈O πA σ (o).
lattice can be seen as representation of the usual indexes. A con-
Here denotes multiset union. By construction, the objects in cept in the lattice contains for each document in its extent, the set
the tag cloud induce subconcepts of the concept from which the of terms that occur in the document in its intent. For each term
tag cloud was derived; moreover, all tags have a non-bottom meet the set of objects in its introducing concept is its inverted index
with that concept. entry. If we see the selected tags as a conjunctive query, then the
focus’ extent is the query’s result.
The tag cloud can also be seen as the aggregation of the
3.4. Navigating concept lattices with tag clouds Boolean term frequencies for each document in the query result,
scaled according to the size of the document collection. The con-
The browser maintains a focus concept, from which it renders cept lattice provides us with an efficient way to compute this
the tag cloud as described above; when the user selects (or dese- tag cloud; a computation from only the inverted index would be
lects) a tag, the browser updates the focus and re-renders the tag impractically inefficient: we would first need to retrieve all doc-
cloud. The focus, or more precisely, its extent contains the sub- uments indexed by the selected tags, then iterate over the en-
set of objects in the repository that share all currently selected tire vocabulary and compute the size of the intersection of each
tags. The initial focus (corresponding to an empty selection set) is term’s inverted index with the query’s result. Hence, any efficient
therefore the lattice’s top element, whose extent contains the en- IR-based implementation must use the same information in essen-
tire repository (see Fig. 1(i)). tially the same way as our lattice-based implementation. However,

Please cite this article as: G.J. Greene et al., Visualizing and exploring software version control repositories using interactive tag clouds
over formal concept lattices, Information and Software Technology (2016), http://dx.doi.org/10.1016/j.infsof.2016.12.001
JID: INFSOF
ARTICLE IN PRESS [m5G;December 12, 2016;20:48]

G.J. Greene et al. / Information and Software Technology 000 (2016) 1–19 7

we can exploit the lattice structure, e.g., to update the focus incre- and structure to the tag clouds. By selecting a tag in the tag cloud
mentally, or to show which other tags are implied by (i.e. always the resulting cloud will provide contextual information for the cur-
occur along with) the current selection set. rently selected tag.
The initial tag cloud shown in ConceptCloud includes tags from
4. ConceptCloud browser all attributes and objects in the context table (using the top con-
cept in the lattice as the focus). This allows the user to select any
We have implemented our approach in the ConceptCloud tag from the extracted repository information. Tags in the initial
browser. The VISSOFT 2015 evaluated artifact [17] is available at tag cloud will be at their largest size because we scale all tags ac-
vissoft15.conceptcloud.org/ and the continuously updated web ap- cording the maximum and minimum tags in this cloud. Making
plication is available at www.conceptcloud.org. Our browser can selections in the initial tag cloud will result in clouds with smaller
automatically index Git and SVN repositories and create tag cloud tags (cf. Fig. 1), indicating that the cloud is only showing attribute
visualizations from them. It also supports more advanced pre- tags from a subset of the total objects in the context table.
processing and interface customizations. By construction, the objects in the tag cloud induce subconcepts
ConceptCloud comprises three main components that extract of the concept from which the tag cloud was derived; moreover, all
meta-data from the revision control system, construct a context ta- tags have a non-bottom meet with that concept.
ble in the desired format, and display the tag cloud of the resulting
lattice. ConceptCloud automates the process of creating a tag cloud Proposition 13. Let c ∈ B (O, A, I ) be a concept, o ∈ O, and t ∈ O ∪
visualization from a version control archive and its user interface A. Then (i) o ∈ τ (c)⇒σ (o) ≤ c, and (ii) t ∈ τ (c)⇒δ (t)∧c = ⊥.
supports customization of the tag clouds. The browser is generic Since the tag clouds can be very large we provide functional-
and can show tag clouds of different context types. It is also com- ity in the interface to limit clouds to one particular category (e.g.,
pletely automatic: there are no manual pre-processing steps, and commit authors), or to remove unwanted categories from them.
the user only needs to enter the URL of the repository. A more de- The cloud can also be adjusted to show only a certain number of
tailed description of the tool architecture and usage is available in tags or to show only tags that occur more than a given number of
[32]. times. Since all the tags are textual, users are also able to search
ConceptCloud currently supports extraction of meta-data and in the tag cloud to find a tag if they already know which tag they
construction of context tables from SVN [19] and Git [33] reposito- want to select (such as their commit name).
ries, both locally and remotely. For Git repositories, the hashes are Customized visualizations can be created from the initial tag
converted into sequential revision numbers. Both extractors sup- cloud by selecting relevant tags and by moving categories of tags
port the revision-, file-, and change-based contexts, as described in into separate viewers. For example, Fig. 2 shows a view of the year,
Section 3.2. The construction of change-based contexts requires the filename and author clouds for the JUnit repository where the file-
identification of methods changed in consecutive versions, which name tag AllTests.java has been selected. The visualization shows
requires the extraction to be language-aware. Such contexts are in which years this file has been changed, who has changed this
currently limited to Java files. The generated context tables can be file and what other files are often changed in the same commit as
saved in XML format so that they can be loaded again without ex- this one, scaled according to how often they are changed together.
traction. Fig. 2 allows us to answer questions such as “Who has changed
For the lattice construction, we use a method based on the Col- this file?” (i.e., expertise) , “Is this file still under development?”
ibri/Java library [34] which constructs concepts on the fly. We thus and “What other files should I be looking at if I want to change
never need to compute the full lattice and are able to render an this file?” (i.e., co-changed files).
initial tag cloud relatively quickly. Viewers can also be opened with a “sticky” tag that always re-
mains selected and cannot be deselected. This enables us to open
4.1. Tag cloud interface multiple parallel viewers with different tag selections in the same
category (such as months, cf. Fig. 4) which update simultaneously
We make use of a tag cloud visualization that can be cus- when another tag is selected in any viewer. Sticky tags therefore
tomized to show different views on the repository. Multiple dif- enable us to show mutually exclusive views in two tag clouds next
ferent visualizations for different metrics were found to confuse to each other.
users [35]. We therefore propose one uniform visualization that A tag is implied if it has not been selected explicitly, but corre-
can be used to explore various different aspects of a version con- sponds to an attribute in the focus’ intent. Implied tags reveal the
trol archive. repository’s internal structure, similar to the way association rules
The simplest and most popular tag cloud layout [36] is as an reveal the implicit structure of shopping baskets [39] but without
alphabetically sorted list of tags in a roughly rectangular shape any additional cost.
which was found by Schrammel et al. to perform better than ran-
dom or semantic layouts [37]; we use this layout because it sim- 4.2. Advanced visualization in ConceptCloud
plifies textual search within the tag cloud. We scale each tag i be-
tween the given minimum and maximum font sizes fmin and fmax , In addition to the interface customizations that can be per-
according to its weight ti in relation to the minimum and maxi- formed on the tag cloud there are also two customizations that
mum weights in the context table, tmin and tmax ; hence, can be performed during construction, namely personalization and

filtering. A combination of these two customizations allows us to
( fmax − fmin ) · (ti − tmin )
size(i ) = + fmin − 1 produce a “vacation cloud” as described in Section 4.2.3 below.
tmax − tmin
ConceptCloud also supports a number of advanced visualiza-
for ti > tmin and size(i ) = fmin otherwise. tions such as customizing a specific tag cloud or using a scripting
A variety of alternative tag layout methods have been proposed, language to automatically layout the ConceptCloud interface.
such as tag flakes by Caro et al. [38]. Tag flakes are used in order
to provide context for tags as basic tag clouds fail to show how 4.2.1. Personalization in tag clouds
the tags are related [38]. However, instead of using a more com- We can personalize a tag cloud for a particular developer by
plex visualization that depicts the relationships between the tags, identifying all tags that apply to that developer (e.g., files they have
we use incremental refinement in the tag cloud to provide context changed) in our pre-processing step. We then assign these tags

Please cite this article as: G.J. Greene et al., Visualizing and exploring software version control repositories using interactive tag clouds
over formal concept lattices, Information and Software Technology (2016), http://dx.doi.org/10.1016/j.infsof.2016.12.001
JID: INFSOF
ARTICLE IN PRESS [m5G;December 12, 2016;20:48]

8 G.J. Greene et al. / Information and Software Technology 000 (2016) 1–19

Fig. 4. JUnit: author clouds (top); changes to TestRunner.java (bottom). Tag Clouds constructed from a file-based context and months/files are selected as sticky tags.

to different categories than the tags from the remaining commits


(such as “file of interest”), and render them in a different color.
In the personalized tag cloud, the files that have been changed by
that particular developer will thus be easily identifiable in views
even when the tag for that developer has not been selected.

4.2.2. Filtering tag clouds


If we want to analyze only a particular section of a repository
(e.g., only the portion since we started working on the project)
we can restrict the revision range from which the context table is
constructed. Our pre-processing offers different ways of specifying
Listing 1. Example of a script written in ConSL. The author view shows only author
the ranges of interest, such as processing only a certain number tags and the for-loop opens an author view for each year tag selection. Views are
of commits or processing only commits falling between a specified sized at 50% of the full screen width and viewer menus are hidden.
start and end date.

4.2.3. Customized visualizations


The combination of personalization and filtering steps allows ously. Scripts can be written in order to open viewers for specific
ConceptCloud to highlight answers to the question “What hap- categories, open viewers with sticky tags (i.e., selections that are
pened in my project while I was away?” with a vacation cloud as unique to the view and cannot be modified in the view) and to
for example shown in Fig. 3. This is constructed from a change- customize the layout of the viewers in ConceptCloud’s interface.
based context where file and method tags have been personalized While manually opening viewers from the ConceptCloud interface
to the developer (here David Saff) and revisions have been filtered is useful for exploration of a repository, opening multiple viewers
by the date of his last commit. (such as all tags from a particular category) and manually laying
The initial tag cloud shows in which revision most files were them out can be time consuming. ConSL scripts provide a mecha-
changed (1856), when most changes happened (2014/06/18), or nism to easily recreate a particular viewer layout and can be used
which developers have made most changes (Alex Yursha and on multiple datasets so that the datasets can be compared using
Kevin Cooney, cf. Fig. 3). Tag colors indicate the corresponding the same custom layout. The same script can also be loaded ev-
categories and selected tags are shown in red. The words from the ery time a dataset is loaded so that there is no need to manually
commit messages indicate that most changes were either pull re- configure the tag cloud layout on opening ConceptCloud. After ex-
quests or stylistic in nature, as indicated by prominent tags such ecuting a ConSL script the user can still perform all available cus-
as Change, Codingstyle, Legacycodingstyle, or Remove. How- tomizations through the interface.
ever, the overall view of the changes in Fig. 3 does not provide ConSL scripts are compiled and used to generate JavaScript code
us with many detailed insights into the data and we refine the that is executed in the browser where ConceptCloud is loaded.
view by selecting tags in order to discover more insights. Select- ConSL provides four main operations: defining a view, for-loop
ing a developer gives a more detailed view of their changes and constructs, opening a view and setting layout. A view can be de-
selecting one of the most active developers, Yursha, reveals that fined with one, multiple or all categories of tags in the tag cloud.
he has only committed one revision that contains stylistic changes Views can then be opened with optional sticky tag selection ar-
to many files. Alternatively, selecting Cooney reveals that he has guments. For example, a view showing only the authors in the
merged in several pull requests (cf. Fig. 3) which contain changes project can be defined and then this view can be opened with se-
to files that Saff has previously worked on (such as AllTests.java). lection of year tag 2015, to open a view showing all project au-
Selecting further tags (e.g., From and Rowanhill) brings out fur- thors in 2015. A for-loop construct is provided to open multiple
ther details (e.g., about the pull requests from Hill). The cloud also viewers with sticky tags from all tags in a certain category, such as
shows how often files and methods have been changed; it uses dif- a sticky tag for each year (see Listing 1), which can be tedious to
ferent colors to distinguish changes in files previously changed by achieve manually through the interface. ConSL’s layout functional-
Saff from those in other files. We can therefore see that different ity allows the user to specify a precise layout and ordering for all
variants the method skipped was a development hotspot during the viewers. For example, Listing 1 shows a layout where each row
Saff’s absence; we can further see that variants with different sig- will contain two viewers of equal width. The internal menus of
natures were added (shown in light grey), on top of the changes to each of these viewers will be also be collapsed. Alternatively, us-
the variants that Saff has also worked on (shown in dark grey). ing the interface’s drag and drop functionality to manually resize
and layout multiple viewers can often lead to imprecise layouts.
4.2.4. Scripting tag cloud viewers ConSL scripts can be loaded at the same time as a saved Concept-
We have developed a scripting language, ConSL, in order to con- Cloud context table, or from the tag cloud view. This enables users
struct and lay out multiple viewers in ConceptCloud simultane- to load scripts after initial exploration of the dataset.

Please cite this article as: G.J. Greene et al., Visualizing and exploring software version control repositories using interactive tag clouds
over formal concept lattices, Information and Software Technology (2016), http://dx.doi.org/10.1016/j.infsof.2016.12.001
JID: INFSOF
ARTICLE IN PRESS [m5G;December 12, 2016;20:48]

G.J. Greene et al. / Information and Software Technology 000 (2016) 1–19 9

made changes in June 2002 (cf. bottom of Fig. 4). We also see that
there is a group of files which have been changed at the same
time.
Differences in visualizations: while the visualization presented by
Weissgerber et al. [40] is an author file graph which shows for
each author lines connecting the author to a specific file (which
is represented as a dot in the graph) our visualization shows the
different tag sizes for the developers according to their amount of
contribution. In the author file graph visualization the amount of
nodes connected to a developer can be used to assess the amount
of their activity, whereas in the tag cloud their tag size directly cor-
responds to the amount of activity. Additionally, by selecting au-
thor names in the tag cloud the names of the corresponding files
that these authors have been changing will be shown in the tag
cloud. It is unclear how the author file graph presents the names of
the files which have been changed. The author file graph [40] also
Listing 2. ConSL script for generating author by month view of the JUnit repository allows the identification of developer collaboration: if two devel-
(Fig. 4).
opers are linked to the same node they have collaborated on a file.
In our tag cloud view from a file-based context the selection of a
particular author would update all other author tags to show only
5. Illustrative application examples
authors that have been collaborating with the selected author in
a size that represents the amount of collaboration. Therefore, in
We apply our ConceptCloud browser to two open source repos-
our tag cloud view the identification of collaboration is interactive
itories and one industrial application to demonstrate the insights
and it is also scalable, since the tags for all collaborating develop-
that can be obtained using the browser. We repeat and expand on
ers can be shown at the same time. Using the sticky tag function,
a previous case study on the JUnit repository in Section 5.1 to high-
comparisons between different groups of collaborators can also be
light the flexibility of our browser. We also show how the browser
easily drawn, by comparing the tag clouds. The file author matrix
can be used to explore both combined version control and issue
[40] shows a grid-like summary of which developers have been
data simultaneously using the RubyGems repository in Section 5.2.
working on which files across the project, where each pixel color
We have also applied our browser to generate insights from a small
indicates the amount of activity on a file. In our tag cloud visual-
industrial project (see Section 5.3) in order to evaluate the appro-
ization files can be selected to see which authors have been work-
priateness of the insights that can be gathered with ConceptCloud.
ing most actively on a file and authors can also be selected to in-
dicate on which files they have been working. A summary view
5.1. JUnit repository across a group of developers (or files) can be created by making
sticky tag viewers for the group of developers and comparing the
JUnit is a popular open-source testing framework for Java which tag clouds created.
has been used in previous studies [40,41]. Here we repeat Weiss- Conclusions: ConceptCloud allowed us to gather the same in-
gerber’s study [40], which investigates developer roles up until sights as the dedicated tool presented by Weissgerber et al. [40].
2006, and extend it to a more current date. We show that we can However, ConceptCloud does not produce a static picture but al-
easily extend the previous observations on the repository through lows the user to refine the analysis, and access the other informa-
our interface even though our interface was not specialized only tion (e.g., log messages) that remains available.
to identify collaboration patterns. We also show that we can make
the same observations using our ConceptCloud browser as the cus- 5.2. Rubygems repository
tomized visualizations for each aspect presented in [40]. We cre-
ated the revision-based context for the JUnit project from its first We constructed the combined context for commits and issues
revision in 03/12/20 0 0 up until 26/02/2014 (1772 revisions). from the RubyGems GitHub repository [42] to show how we can
Overview: in order to get an initial view of the project we open combine issue and repository data in the same tag cloud. The
a commit time view and restrict it to years. This shows that project GitHub issue tracking system provides links between issues and
activity increases dramatically from the first full year in 2001 until commits that either close an issue or reference it. We extract these
2007 and remains relatively steady thereafter. Selecting the year links, using the GitHub API, to create explicit links between issues
tag 20 0 0 in the full cloud shows us that developer egamma and commits in our tag cloud, but we also extract keywords from
started the project in December 20 0 0. In an author cloud for the the issues and commit messages and use these to create implicit
first full year of development (2001) we see that developers kbeck links between issues and commits that discuss the same topics.
and emeade join the project in 2001 but egamma remains the For other issue tracking systems that do not include explicit links
most prolific author in that year (cf. [40]). between issues and commits we would still be able to extract im-
Authors by month: Weissgerber et al. [40] look specifically at the plicit links from the commit messages.
file changes made in the months March to June 2002. To repeat Linked issues and commits appear in the same tag cloud, show-
this we open viewers with “sticky tags” for March, April, and June ing which files have been changed in order to close an issue. For
2002 (there was no commit in May 2002) and limit these to show example, Fig. 5 shows the tag cloud containing information for is-
only author (cf. Fig. 4, top). Selecting an author tag shows us which sue 227 which was closed by commit 3642. We can immediately
files the author has worked on in each month. Fig. 4 shows kbeck’s see that files rubygems.rb and specification.rb were fixed in re-
contributes less and less in the given period. The cloud for June lation to the bug reported about inactive gems. We see here tags
2002 shows the addition of developers vbossica and clarkware #227 as well as tag 227, where #227 represents the issue object
to the project. and 227 is part of the commit message for commit 3642. We see
Selecting the file TestRunner.java, shows that egamma and also tags r3642 and 3642, where 3642 represents the revision ob-
kbeck have changed this file in April 2002 and only egamma has ject and r3643 is used as a link between both the revision and

Please cite this article as: G.J. Greene et al., Visualizing and exploring software version control repositories using interactive tag clouds
over formal concept lattices, Information and Software Technology (2016), http://dx.doi.org/10.1016/j.infsof.2016.12.001
JID: INFSOF
ARTICLE IN PRESS [m5G;December 12, 2016;20:48]

10 G.J. Greene et al. / Information and Software Technology 000 (2016) 1–19

underlying context table and explore commits that are related to


specific issues in the tag cloud interface.

5.3. Industrial application

We used ConceptCloud to analyze the Git repository of a small,


local non-profit organization. This project develops an educational
service comprising of a mobile app, backend, and data analyt-
Fig. 5. RubyGems: tag cloud for commit 3642 which closes issue 227 in RubyGems.
ics. The goal of this application was determine whether the in-
sights that we gather using ConceptCloud are appropriate and can
be confirmed by the project manager. Since this is a small lo-
calized development team the insights that we gather might not
be surprising to the project manager, but we aim to validate the
accuracy of our observations. We analyzed the project from its
start (08/2015) up until the app’s release to the Google Play Store
(01/2016). Note that we use only abbreviations of the developers’
commit names here to preserve their anonymity.
Project contributors: by creating author viewers for each month
from the revision-based context of the project (Fig. 8) we saw
that the project started towards the end of August 2015 with only
three developers. In September all three developers contributed in
similar amounts to the project. In October two more developers
(CM and P9) joined and overall commit activity of the developers
greatly increases. Additional contributors RS and S also joined the
project in November, and this team remained relatively stable with
only the addition of PW in December. The team structure changed
again in January with SM, PW and RS leaving but two new (and
therefore less-active) contributors, HW and F joining the project.
In each month the developers (excluding the new additions) ap-
peared to be sharing the workload uniformly. We see that project
was expanding but also that there was a high developer churn.
The project manager confirmed that contractor SM was replaced
by two new full time developers in January.
We can also observe other apparent small contributors (with
one to three commits) which on further investigation appear to
be alternative aliases (particularly GitHub usernames) for some of
the contributors (such as a developer editing the ReadMe file di-
rectly on GitHub and the commit being recorded with their GitHub
username). These alias characteristics could also be incorporated
Listing 3. ConSL script for the view in Fig. 6.
to identity merging techniques as identity merging in projects is a
difficult problem [43].
Developer collaboration: by creating the file-based context of the
the issue objects. Therefore, while 3642 and #227 occur only once, project we can observe collaborations between the developers. Se-
r3642 occurs twice (and appears bigger) in the tag cloud as it is an lecting developer LS (Fig. 9) showed that he often collaborated
attribute which applies to both the issue and the revision objects with developers SM and P9, however there are a small amount
in the tag cloud. of files common to other team members as well. Developers F and
Additionally, we can explore all commits and issues that discuss HW have also collaborated with LS since they joined the project.
a particular topic such as gem install. Fig. 6 shows the main files If we select an additional tag for developer AV (who showed up
(orange tags), committers (green tags) and GitHub issue reporters only as a small tag) and show which files both AV and LS have
(maroon tags) that are associated with the keywords “gem” and changed, we saw that the gitignore was the only file common to
”install”. We can see the main files changed that fix issues men- most of the development team, which indicates that they also have
tioning gem install and also files changed in commits where the different project focuses.
commit message mentions gem and install. We can further re- If we select the tags for LS’s collaborators SM and P9 and
strict the cloud to showing only commits that have closed a bug show the tag cloud for the changed directories (Fig. 10(a)), we see
report (by selecting the bug report status closed tag) mentioning a directory structure that indicates that these developers worked
words gem and install (Fig. 7). We see that Eric Hodel is the only on the Android client. This cloud confirms that developers F and
author that makes commits closing issues that mention gem and HW had also begun working on the Android client.
install and these commits only occur in 2013 and 2014. This indi- The directory cloud of developer AV (Fig. 10(b)), who collab-
cates that while other authors have also made commits mention- orated mostly with S and CM shows a very different directory
ing gem and install, Eric Hodel is responsible for this area as he structure (appearing to be concerned with backend development).
has either fixed issues referring to gem install or has been respon- Therefore, we see a clear separation of responsibilities among the
sible for closing these issues when merging a pull request from development team. However, when we investigate the collabora-
another developer. tion clouds of S and CM individually we see that these three de-
Conclusions: ConceptCloud can be used successfully to combine velopers each worked on a number of files that are not touched by
multiple data sources to get more detailed information on a spe- other team members. If one of these developers were to leave the
cific project. We can fuse issue and repository data into the same team there would be a large number of files that no other team

Please cite this article as: G.J. Greene et al., Visualizing and exploring software version control repositories using interactive tag clouds
over formal concept lattices, Information and Software Technology (2016), http://dx.doi.org/10.1016/j.infsof.2016.12.001
JID: INFSOF
ARTICLE IN PRESS [m5G;December 12, 2016;20:48]

G.J. Greene et al. / Information and Software Technology 000 (2016) 1–19 11

Fig. 6. RubyGems: main changed files, committers and bug reporters from commits and issues mentioning Gem Install. Tag clouds constructed from a combined context of
GitHub issues and commits.

Fig. 9. Industrial application study: collaboration with developer LS. Tag cloud build
from a file-based context.

Commit activity: comparing the revision-based and file-based


views on the weekdays on which the developers have been mak-
ing commits (Fig. 11) we see that the most commit activity occurs
between Tuesdays and Thursdays, with less activity on Mondays
and Fridays and very little over the weekends (Fig. 11, left). This is
Fig. 7. RubyGems: changed files, committers from commits closing issues mention-
consistent with what we expect from a full-time commercial de-
ing Gem Install.
velopment team.
Observing the number of files changed on each weekday
(Fig. 11, right) shows that while less commits are made on Fridays
these commits generally touch more files. This would be consistent
with developers committing their changes before the weekend in
fewer but larger commits. The project manager also indicated that
bi-weekly sprint planning takes place on a Friday, which could also
explain the fewer but larger commits observed on Fridays.
Commit messages: examining the most frequent words used in
commit messages in the first full month of the project (09/2015)
and comparing those to the commit messages in 01/2016 (Fig. 12)
Fig. 8. Industrial application study: developer contributors over project from
we see that the initial activity was largely concerned with Face-
project start to first release. Tag clouds generated from a revision-based context. book and Database integration. In the last month examined
Months are selected as sticky tags. (01/2016) we see that the activity is more centered around bug
fixes and the user interface changes (Images, Styling), which in-
dicates that the project was about to be released.
member would be familiar with. Therefore, we see that the back-
end team has a very low “bus factor” [44].
Contributer RS did not appear as one of the main collaborators 5.3.1. Threats to validity
of AV’s team (backend development) or LS’s team (android appli- The study on the industrial application has been performed by
cation) and on further investigation of RS’s changed directories we the first author. However, to mitigate risks the first author had
see that he contributed mostly images to the project. no previous knowledge of the project or development team. We

Please cite this article as: G.J. Greene et al., Visualizing and exploring software version control repositories using interactive tag clouds
over formal concept lattices, Information and Software Technology (2016), http://dx.doi.org/10.1016/j.infsof.2016.12.001
JID: INFSOF
ARTICLE IN PRESS [m5G;December 12, 2016;20:48]

12 G.J. Greene et al. / Information and Software Technology 000 (2016) 1–19

Fig. 10. Industrial application study: directories and collaboration of developers (a) LS, SM and P9 and (b) AV. Tag cloud build from a file-based context.

prising to them. However, we have seen that there are many valu-
able insights, such as team collaboration, areas of expertise and
activities, contained in the version control repository. Using Con-
ceptCloud we were able to gather these insights which would be
very valuable for new developers starting on the team and teams
in which the collaboration patterns or activities are not obvious to
the project manager. We could identify the different roles of de-
velopers in the team by examining the directory structure of the
files they committed, which indicated what parts of the system
members were working on. We were also able to identify when
developers joined and left the team and how the different team
members collaborate.
Fig. 11. Industrial application study: weekdays of developer commits. (left) tags
sized according to number of commits (right) tags sized according to number of 5.4. Conclusions
files changed.

Using ConceptCloud we are able to get an overview of the col-


laboration and work patterns in both open source and industrial
projects. We see that the different context types give us different
views on the project (e.g., collaboration vs. commit activity) and
that additional project information such as the issue-reports can
be merged into the same context to provide more detailed infor-
mation on a project.
Comparing our observations from an industrial project to those
made from open-source projects we observe that the commit ac-
tivity in the industrial project is much more regular and the con-
tributions are shared relatively evenly among the contributors. The
development team of the commercial project is also separated into
smaller groups (+/− 3) that work consistently on one aspect of the
Fig. 12. Industrial application study: popular keywords from commit messages in project. In contrast, in the open-source projects we observe one
the first and last month analyzed .
main contributor who has a much higher activity than the others
during their involvement; when this contributor leaves the project
subsequently verified all observations with the project manager in another developer takes over this role.
order to establish their correctness.
6. Performance evaluation
5.3.2. Discussion
We have applied ConceptCloud to an industrial project to deter- FCA is commonly associated with high run-times and so we
mine whether the insights that we can gather are appropriate and evaluate ConceptCloud’s performance on a variety of repositories
can be confirmed by the project manager. For this application, the to illustrate the feasibility of our approach. In particular, we used
team was small and co-located so the insights might not be sur- ConceptCloud on a medium-sized server with 64GB RAM and two

Please cite this article as: G.J. Greene et al., Visualizing and exploring software version control repositories using interactive tag clouds
over formal concept lattices, Information and Software Technology (2016), http://dx.doi.org/10.1016/j.infsof.2016.12.001
JID: INFSOF
ARTICLE IN PRESS [m5G;December 12, 2016;20:48]

G.J. Greene et al. / Information and Software Technology 000 (2016) 1–19 13

Table 1
Metrics for revision-based contexts.

Project Type |O| |A| |I| Indexing Initial cloud


time (s) creation
time (s)

Subversivea SVN 1511 8222 88,090 55.5 1.8


JUnitb Git 1905 5959 66,242 8.0 1.9
AngularJSc Git 5547 9055 133,436 116.2 2.8
Springd Git 9017 40,332 540,813 43.4 14.8
Valgrinde SVN 10,989 29,009 348,136 176.6 40.0
Djangof Git 18,471 38,821 583,701 58.4 11.0
Moodleg Git 69,550 154,834 2,222,486 333.2 45.7
DPortsh Git 155,627 196,850 2,917,269 2,049.9 892.8
a
https://dev.eclipse.org/svnroot/technology/org.eclipse.subversive/. Analyzed December 2006 to September
2014.
b
https://github.com/junit-team/junit. Analyzed December 20 0 0 to September 2014.
c
https://github.com/angular/angular.js. Analyzed December 2013 to September 2014.
d
https://github.com/spring- projects/spring- framework. Analyzed July 2008 to September 2014.
e
svn://svn.valgrind.org/valgrind/trunk Analyzed March 2002 to September 2014.
f
https://github.com/django/django. Analyzed July 2005 to September 2014.
g
https://github.com/moodle/moodle. Analyzed November 2001 to September 2014.
h
https://github.com/DragonFlyBSD/DPorts. Analyzed October 2012 to September 2014.

Xeon 8-core 2.0Ghz CPUs to analyze several Git and SVN reposito- 7. User study
ries in order to evaluate its performance.
We created revision-based contexts (using local clones of Git We performed a user study in order to evaluate whether un-
repositories and remotely accessing the SVN repositories). Table 1 trained users are able to answer questions about the history of a
summarizes the characteristics of and runtimes for these repos- software project using ConceptCloud more or less effectively than
itories, showing the number of revisions |O|, the number of at- with current widely-used interfaces. In particular we compare Con-
tributes |A|, and the size of the incidence relation (i.e., the num- ceptCloud to the default list-view of commits as implemented in
ber of object/attribute pairs) |I|, as well as the time to create the GitK and the GitHub interface, which is graph-based. Both GitHub
context table (i.e., indexing) and to draw the repository’s full tag and linear list commit views, such as GitK, are widely used in prac-
cloud. tice and we therefore use these interfaces as the controls for com-
We see that the indexing times (including the extraction of all parison against ConceptCloud. Linear list commit views are imple-
of the log information for the repositories) are only a few seconds mented in many popular Git GUIs (such as SourceTree1 and Tor-
for smaller repositories, and a few minutes for medium-sized ones; toiseGit2 ), but we use GitK as it is packaged standard with Git. GitK
even the largest repository with 155,627 revisions requires only provides a searchable linear list of commits and shows the diffs
34 min. Note that these times are not directly related to either the between two revisions. GitHub’s interface is widely used in order
size or the density (i.e., |I|) of the context tables but are to a large to visualize the history of a software project and provides graph
extent determined by the (lexical) pre-processing. views of user’s activity in repositories. GitHub also provides a code
The initial cloud creation times are given for the full tag cloud search interface. GitK, GitHub and ConceptCloud present the same
for the repository, which contains |O| + |A| tags. The table thus underlying information through different interfaces. We therefore
gives an indication of the cloud computation in the worst case; in compare the effectiveness of our tag cloud interface to that of a
practice, we can limit the number of tags shown to substantially searchable list interface and an interactive graph-based interface.
improve this. However, the initial tag cloud is cached and so can Since the participants in our study had never used our Concept-
be generated off-line in a pre-processing step. Subsequent loads of Cloud browser before, we also investigate whether the browser can
the initial tag cloud from cache are instantaneous. be used successfully by untrained users.
Tag clouds become smaller with subsequent navigation steps In this study, we aim to answer the following research ques-
and are therefore created substantially faster. Overall, navigation tions:
is instantaneous for small and medium repositories, with some
degradation on the initial clouds for very large repositories. RQ1: is a rich exploratory interface, such as our interactive tag
Note that drawing the initial cloud requires us to compute the cloud interface, accessible to untrained users?
defining concepts of all objects; however, since we use an incre- RQ2: does our interactive tag cloud interface allow users to
mental lattice construction approach and therefore never compute achieve higher correctness than the familiar linear list view
the full lattice, we do not experience the high runtimes commonly of commits when answering questions about the history of
associated with FCA. a software project in a set time period?
To reduce drawing time for larger repositories we could limit RQ3: does our interactive tag cloud interface allow users to
the number of tags shown in the initial tag cloud to only those achieve higher correctness than a graph-based interface,
that apply to a larger portion of the revisions in the repository and such as the one provided by GitHub, when answering ques-
then show the full tag set when the user has made selections to tions about the history of a software project in a set time
refine the tag cloud. For large repositories that are indexed repeat- period?
edly, our approach allows us to incrementally update the context
table (and therefore the concept lattice) so that updates can be
performed quickly and the initial indexing needs to only be per- 1
https://www.sourcetreeapp.com/ .
formed once. 2
https://tortoisegit.org/ .

Please cite this article as: G.J. Greene et al., Visualizing and exploring software version control repositories using interactive tag clouds
over formal concept lattices, Information and Software Technology (2016), http://dx.doi.org/10.1016/j.infsof.2016.12.001
JID: INFSOF
ARTICLE IN PRESS [m5G;December 12, 2016;20:48]

14 G.J. Greene et al. / Information and Software Technology 000 (2016) 1–19

Table 2 Table 3
Question set for user study, (a) Ruby Gems (b) Backbone (c) Retrofit. Descriptive statistics for average percentages
obtained with each of the three tools across all
(a) RubyGems: questions.
1 Who is the contributor with the most commits on the Ruby Gems
project? GitK ConceptCloud GitHub
2 In which year were the most commits made to the project?
Mean 0.52 0.71 0.67
3 Which file types has Charlie Somerville changed in his commits?
Sd 0.21 0.10 0.10
4 Which contributors have worked on the file lib/rubygems/psych
Min. 0.27 0.55 0.53
additions.rb?
Max. 0.84 0.90 0.85
5 Who has been making the most changes on the project since
Range 0.57 0.36 0.32
Samuel E. Giddins last worked on it?
6 When was this repository created?
b) Backbone:
1 In which month was the most activity on the project? the largest with 6388 revisions, Backbone consisted of 3130 revi-
2 Who was the most active developer in this month? sions and Retrofit had 998 revisions. We used repositories of dif-
3 Who is the most prolific author of the backbone/test directory? ferent sizes so that the results of our study would not be biased
4 Who was the last person to change the file backbone.js?
5 Which file has been changed the most in this project?
towards one repository size.
6 Who has made the most changes to the images in the project (jpg, The question sets were developed by exploring the repositories
png)? equally using GitK [47], GitHub [27] and ConceptCloud. Question
7 Who has changed the most files that Brad Dunbar has also sets included questions about the location of files, collaboration of
changed?
users, expertise of the contributors as well as the history of the
c) Retrofit: projects. The question answers were then verified using all three
1 Where are the tests for the main project located?
2 Who has edited the .yml files?
tools to make sure that the results were consistent. All questions
3 Who contributed the most to this project in its first year? were weighted equally. We used all three tools to generate the
4 Who has worked on JacksonConverter.java? question sets because the different tools have different strengths
5 Who merged pull request #1017? and weaknesses and using only one tool would have made the
questions easier to answer for the participants assigned to a spe-
cific tool.
7.1. Experimental setup Participants were given 15 min to answer each question set, (6,
7 and 5 questions respectively) after which they were given the
We used a between-subjects design to conduct the experiment, next question set and corresponding repository. Participants were
where each participant uses only one of the three tools to an- made aware of this time limit at the beginning of the user study
swer questions about the software development process in spec- and before each new question set was started. Participants were
ified projects. We constructed three questions sets, based on three asked to answer as many questions as they could in the time pro-
different software repositories that were also available on GitHub. vided and to move on from a question when they were unable to
All participants were asked to answer three question sets using answer it.
a tool (GitK, GitHub or ConceptCloud) which was randomly as-
signed to them. We then evaluated the correctness of the an- 7.4. Analysis and results
swers supplied by the participants. Each participant was supplied
with a user manual, detailing how their tool showed the history We used the R package for analysis of the experimental re-
of software projects. We marked all of the question answers that sults. We performed the Shapiro and Wilk [48] test to determine
were submitted by the participants and calculated their results. whether participants’ scores were normally distributed, in order to
We investigate the hypothesis that there is no difference between determine what further analysis could be performed. We obtained
the correctness results obtained by the participants over all three a p-value of 0.06, and at a confidence level of 0.05 we cannot re-
tools. ject the null hypothesis and conclude that the data is normally dis-
Our user study took place in a computer lab at Stellenbosch tributed.
University. All participants took part at the same time to avoid
communication about the tasks. Participants were not permitted 7.4.1. Summary statistics
to communicate during the study. Fig. 13 shows a summary of average correctness percentages
Resources for the user study, including question sets, sample so- achieved by participants for each question set, in the order that the
lutions and the versions of the repositories used, are available at question sets were answered (Ruby Gems, Backbone and Retrofit).
www.conceptcloud.org/userstudy15. In the first question set users of GitHub performed the best, and
for the second and third question sets users of ConceptCloud per-
7.2. Population formed the best. Fig. 14 shows a box-and-whisker plot for the av-
erage scores obtained across all questions for each tool. We see
We performed our user study with students in our third year that the median as well as the minimum value for participants
Software Engineering class of 2015. Previous courses required the using ConceptCloud is the highest, followed by GitHub and then
students to submit assignments using Git repositories, so all were GitK. The range of results of participants using GitK is the highest,
all familiar with Git. The participating group consisted of 47 stu- with some participants achieving high averages and others achiev-
dents in total. Participation was voluntary for all students. ing much lower results than those using either GitHub or Concept-
Cloud. Fig. 15 shows the box-and-whisker diagrams for the per-
7.3. Tasks centages obtained across each of the question sets for each of the
tools. Participants using ConceptCloud achieved higher median per-
We developed three question sets using three different repos- centages for each new question set, which indicates there might
itories available on GitHub, namely RubyGems [42], Backbone have been some learning effect observed over the different ques-
[45] and Retrofit [46] (see Table 2). We selected these repositories tion sets. However, participants using GitK or GitHub performed
as they are popular projects available on GitHub, and they differ in worse in the second question set and then again better in the third
size. At the time of the user study the RubyGems repository was question set.

Please cite this article as: G.J. Greene et al., Visualizing and exploring software version control repositories using interactive tag clouds
over formal concept lattices, Information and Software Technology (2016), http://dx.doi.org/10.1016/j.infsof.2016.12.001
JID: INFSOF
ARTICLE IN PRESS [m5G;December 12, 2016;20:48]

G.J. Greene et al. / Information and Software Technology 000 (2016) 1–19 15

Fig. 14. Box and whisker plots for average percentages obtained using Concept-
Cloud, GitK or GitHub.

Table 4
P-values for Tukey test.

Tool comparison P-value

GitHub–ConceptCloud 0.6343916
GitK–ConceptCloud 0.0 0 06546
GitK–GitHub 0.0059474

to a t-test because the t-test only accounts for the comparison of


two tools. We first checked for interaction effects of the tools and
the question sets. We found that the interaction effects were not
statistically significant (p = 0.11). Therefore there is no evidence
that the variation of correctness between the three tools depends
on the question set. This therefore allows us to compare the per-
Fig. 13. Average percentage obtained by participants, across all three tasks, using formance of participants using each tool across all three question
ConceptCloud, GitK or GitHub. (a) Bars from left to right indicate: ConceptCloud, sets to draw conclusions on the accuracy obtained with each of
GitHub and GitK (b) Bars from left to right indicate: Ruby Gems, Backbone and the tools. We obtained a p-value of 0.0 0 0548 from the ANOVA
Retrofit Questions. test for the comparison of the tools. We therefore rejected the null
hypothesis at a significance level of 0.05 and concluded that the
mean values of percentages obtained by participants differed sta-
7.4.2. Statistical significance tistically significantly over the three tools. We further performed a
We performed a two-way ANOVA test on the correctness ob- post-hoc Tukey test [49] to determine in which tool comparisons
tained by each participant across all three question sets to deter- statistically significant differences exist (GitHub vs. GitK etc.). The
mine if there was any statistically significant difference in the cor- p-values obtained for all comparisons are listed in Table 4. Using
rectness obtained by users of the different tools. We tested our null a significance level of 0.05 we find that the difference between re-
hypothesis that the results of the participants would be the same sults obtained using GitK and ConceptCloud as well as GitK and
over all three tools. We formulated this null hypothesis so that we GitHub are statistically significant. A graph plot of the confidence
would be able to conclude whether there was any difference in the intervals is given in Fig. 16. Therefore we can conclude that partici-
performance of the tools, rather than only investigating whether pants using ConceptCloud or GitHub were able to answer questions
one tool was better than another. Since we have more than two about software projects statistically significantly better than those
tools to compare we perform a two-way ANOVA test as opposed using GitK.

Fig. 15. Average percentage obtained by participants using (a) GitK, (b) ConceptCloud, (c) GitHub across all Three Question Sets.

Please cite this article as: G.J. Greene et al., Visualizing and exploring software version control repositories using interactive tag clouds
over formal concept lattices, Information and Software Technology (2016), http://dx.doi.org/10.1016/j.infsof.2016.12.001
JID: INFSOF
ARTICLE IN PRESS [m5G;December 12, 2016;20:48]

16 G.J. Greene et al. / Information and Software Technology 000 (2016) 1–19

to answer questions about the project history with a fair amount


of accuracy.
In response to RQ2 and RQ3 we observe that while participants
using ConceptCloud achieved the highest average over all question
sets, these were only statistically significantly better than the re-
sults obtained using the linear list view provided by GitK. We can
therefore conclude the ConceptCloud interface allows users to an-
swer questions better than a linear list view which is common in
many repository GUI tools (RQ2). However, no statistical signifi-
cance was observed in comparison with GitHub, so we cannot con-
clude that the ConceptCloud interface allows users to answer ques-
tions about a software project better than a graph-based interface
(RQ3).
While all participants were able to answer various types of
questions through the different interfaces we also see evidence
that each interface makes specific activities more prominent.

7.6. Threats to validity

Our user study was conducted using a centralized lab server


and so it is possible that some participants experienced slower
loading times than others. However, since all of the participants
took part in the user study at the same time the load on the
servers would have been largely consistent for all the participants
in the lab at the time.
Fig. 16. Confidence Intervals obtained from Tukey test. The GitHub–ConceptCloud The repositories used were of varying sizes and so the results
interval includes 0, indicating no difference between the means for those two
of a specific tool might be influenced by the size of the reposi-
groups.
tory. However, the average correctness percentages were actually
the worst for the middle-sized repository (Backbone) for all tools
7.4.3. Question types and so we do not see a direct trend showing higher correctness
We further investigated which user group had the highest re- with either smaller or larger repositories for any of the tools. There
sult on each question to understand what types of questions were were also no statistically significant interaction effects between the
better answered through each interface. question set (i.e., repository used) and the tool used which allowed
We found that participants using GitHub were best able to an- us to compare the performance of each tool over all question sets.
swer questions about the activity on the project (“Who is the con- The questions sets that we constructed could have been biased
tributor with the most commits?”) as well as which users were towards one type of visualization. However, to mitigate this risk
the last to change a specific file or who has worked on a particular we constructed questions using observations from all three tools
file. These results are to be expected as the GitHub activity charts equally and also verified that all questions could be answered cor-
prominently show the years and months in which the most com- rectly using all tools.
mits have been made as well as including an activity chart for each Questions were marked by the first author, however the sample
developer. On the GitHub code search interface, specific files can solutions were verified using all tools prior to the marking pro-
be searched for and these include a list of contributors, so GitHub cess and so the questions have all been marked using answers that
also makes this information prominent. were consistent across all the tools.
Participants using GitK were best able to answer questions The participants might not be representative of a real-world
about changes occurring after a developer has made their last com- sample of software developers. However, all participants were also
mit as well as when a project originated. Since the linear list view involved in their own software development projects and were fa-
provides an chronologically ordered list of commits this type of in- miliar with Git.
formation is easy to obtain from scrolling through the commit list. Our sample size is limited, due to the size of our Software Engi-
Participants using ConceptCloud were best able to answer ques- neering class, however we have made conclusions from our study
tions about a user’s activity in a specific time period, as well as as much as our sample size has allowed.
which files have been changed the most and which developers are The participants might have shown a bias towards our tool as
changing certain types of files (“Who has made the most changes it was developed at their university. However, we have never mea-
to the images in the project (jpg, png)?”). ConceptCloud allows sured the participants tool preferences, only their performance and
users to select a specific time period (month, year etc.) and ob- since each participant has used only one tool (due to a between-
serve the size of the developers’ tags (commits) in this time period. subjects design) their performance should not be affected by any
ConceptCloud includes the changed file types as tags, so informa- tool preferences.
tion about what type of work a developer is doing on a project
(front-end etc.) is simple to obtain. 8. Related work

7.5. Discussion Our work is related to topics of visualization, navigation us-


ing tag clouds and applications of formal concept analysis to soft-
With respect to our research questions we find for RQ1 that un- ware. We discuss a subset of techniques used to visualize version
trained users were able to make use of the ConceptCloud tool to control repositories and bug repositories that can be most closely
answer questions about the history of a software project. On av- compared to our approach (see Section 8.1). We also discuss ap-
erage, participants using ConceptCloud received a correctness per- plications of tag clouds directly to software visualization (see
centage of 71% which indicated that they were able to use the tool Section 8.2) and other navigation techniques used with tag clouds

Please cite this article as: G.J. Greene et al., Visualizing and exploring software version control repositories using interactive tag clouds
over formal concept lattices, Information and Software Technology (2016), http://dx.doi.org/10.1016/j.infsof.2016.12.001
JID: INFSOF
ARTICLE IN PRESS [m5G;December 12, 2016;20:48]

G.J. Greene et al. / Information and Software Technology 000 (2016) 1–19 17

for wider applications (e.g., clinical trial data, see Section 8.3). In indicate that list-based interfaces do not support exploration tasks
Section 8.4, we discuss previous applications of formal concept effectively. Codebook has been built with the aim of supporting
analysis to tasks in software engineering that are most closely multiple information needs from software development archives.
related to the goals our approach, for example, detection of co- While the Codebook data storage is flexible enough to support
changed methods and methods related to a particular bug report. users in answering different types of questions, the applications
built on top Codebook are aimed at answering specific questions.
8.1. Visualizing software and bug repositories With ConceptCloud we aim to have a single application that is
flexible enough to support users in answering different types of
8.1.1. Team structure and developer expertise visualizations questions, rather a centralized data-structure which can be used
Girba et al. use an “ownership map” visualization [50] in order as the base for different applications. However, our context tables
identify developer interaction and development patterns using the can also be seen as a central data structure for storing multiple
CVS log of a project. Girba et al. also identify several behavioral types of project information.
patterns of developers, such as teamwork, takeover, and cleaning, Hipikat [4] also monitors multiple information sources
and show how these can be identified in their ownership map vi- (Bugzilla, CVS, email, newsgroups) and builds a uniform ar-
sualization. These collaboration patterns could also be observed in tifact database. It has a number of heuristics (based on text
our tag clouds constructed from a file-based context. While the similarity and activity times) to create links between the artifacts,
ownership map visualization serves to provide an overview of the and provides lists of related artifacts on request. Hipikat queries
project developer patterns in a single visualization, our tag cloud are made using the Eclipse IDE and results are displayed in a
views are aimed at allowing users to interactively explore the con- Hipikat list view Eclipse plugin. However, the goal of Hipikat is
tributions. Therefore, while the collaboration patterns might not be more to recommend relevant items to project newcomers and
visible in a single tag cloud view, our approach aims to support not to provide them with an interface through which to explore
users in exploring the information at varied levels of detail. The the artifacts. Cubranic et al. [4] also note that project artifacts
user can then also continue exploring other aspects of the project are not easily accessible to developers as searching the archives
when they have identified interesting collaboration patterns. requires them to know the correct search terms for finding rele-
Alonso et al. [51] also use a tag cloud visualization to display vant information. In our work we also argue that searching the
information from CVS version control repositories. Their “exper- software development archives does not support all use cases,
tise cloud visualization” creates a tag cloud of committers that are as to be able to conduct a search the developer already needs
identified using a rule-based classification of CVS log information. to have some information about the archive. In our approach we
Users are then able to select the names in this cloud to display a aim to make the information contained in software development
cloud of the developers’ expertise. The expertise cloud visualiza- archives accessible to users for interactive exploration so that they
tion [51] differs from that of ConceptCloud as the different types can access the information even before they have formulated a
of information can only be displayed in separate clouds, meaning direct query. This is a different approach to the recommendations
that the combinations of tags a user can select are limited. In con- provided in [4] and supports users in exploring the full archives in
trast our underlying concept lattice only limits the available tag an unbiased way.
selections to tags that will not cause an empty tag cloud to be dis- Cubranic et al. [4] also note that while a list-based presentation
played. of results (as used by Hipikat) is very common “when the user’s
Weissgerber et al. [40] develop a transaction overview visual- purpose is exploratory browsing of a collection, such a flat-list pre-
ization, file-author matrix, and author-file graph to allow identifi- sentation does not indicate relationships within the results, only to
cation of team structure, developer collaboration, and project activ- the query itself.”. We propose interactive tag clouds as an alterna-
ity over a certain time period from data contained in the version tive view, as they allow users to explore query results in an aggre-
control system. Section 5.1 compares these visualization techniques gated form and support users in further filtering the results and
to the tag cloud view provided by ConceptCloud in the context of identifying relationships between them.
the JUnit case study. Information fragments [53] provide answers to developer’s
questions by combining subsets of relevant project information.
8.1.2. Co-evolution of production and test code Information fragments are comprised of nodes of different types,
Zaidman et al. [52] develop a change-history view and a such as a team member or work item. Node types are similar to
growth-history view to study the co-evolution of production and tag categories in ConceptCloud. The presentation of results uses an
test code. The change history view is a plot of the changed files Eclipse plugin and supports a counting feature to get an overview
over the revisions of a project’s repository distinguishing between of the number of occurrences of nodes, to get for example, the
production and test code. In our tag clouds we can distinguish be- number of items a developer has been working on. Our tag
tween production and test code by observing the project’s direc- cloud automatically gives the user an overview of the number
tory structure. of occurrences of each item as the tags are sized according to
occurrence frequency. The information fragments prototype is
8.1.3. Centralized data structures and visualizations for multiple aimed at answering specific questions that developers ask on a
software development artifacts day-to-day basis and not on allowing exploration of the underlying
Codebook [28] is a social network inspired toolset to ana- archives. While our approach can be used to answer the questions
lyze information implicitly contained in software repositories. Its identified by Fritz and Murphy [53] it is specifically aimed at
central data structure is a graph, where the nodes represent the supporting exploration of the underlying archives even for users
artifacts and actors (e.g., change set, developer), and the edges who have not yet formulated a direct query. While the list-based
represent the different relations between these (e.g., contains, interface presented in the information fragments prototype groups
committer). This graph is built from different sources including items together, to show for example which developers have been
revision archives, bulletin boards, mails, and directory information. working on a section of the code, our tag clouds present this type
Direct queries in a specific format can be given to Codebook to of information through navigation, where the user can select the
answer different types of questions. Results are displayed in a relevant file or directory and observe the developers that have
web interface that provides a ranked result list including images made changes to it. The “queries” that can be composed through
of people associated with artifacts. Results from our user study our tag cloud interface are also more flexible in that different

Please cite this article as: G.J. Greene et al., Visualizing and exploring software version control repositories using interactive tag clouds
over formal concept lattices, Information and Software Technology (2016), http://dx.doi.org/10.1016/j.infsof.2016.12.001
JID: INFSOF
ARTICLE IN PRESS [m5G;December 12, 2016;20:48]

18 G.J. Greene et al. / Information and Software Technology 000 (2016) 1–19

kinds of information (e.g., years, files and developers) can be the same underlying data and observe collaboration patterns of the
selected at the same time. developers. By using changes (i.e., revision-file pairs) as objects we
are able to easily identify the co-changed methods in a project.
8.2. Tag cloud visualizations of software Additionally, our context tables can be used as a centralized data
structure for multiple sources of information, such as version con-
There have been applications of tag cloud visualizations directly trol data and bug reports.
to software for different purposes. Our tag clouds provide a visualization in which version con-
Guido [54] includes a tag cloud to visualize names of types, trol data can be aggregated and explored interactively to support
variables, parameters and methods in source code. Selecting nodes developers in tasks such as keeping up with project changes. Our
in the graph visualization that Guido also provides will highlight interface is customizable through the use of a scripting language,
the corresponding tags in the tag cloud and selecting a name in which can be used to repeatedly access a constructed view on the
the tag cloud will highlight corresponding source code elements dataset. Our interactive visualization supports users in exploratory
in the graph view. The visualizations are linked in Guido similarly search tasks when they have no previous knowledge of a project.
to the multiple tag clouds that update simultaneously in Concept- We have used the ConceptCloud browser to repeat a previous
Cloud. Anslow et al. [55] use a tag cloud to visualize the structure case study [40] and to make observations about the internal struc-
of Java class names. Emerson et al. use tag clouds to visualize Java ture of a small commercial development project. We have also per-
methods and explore several different tag cloud layouts using the formed a user study to determine the usability of ConceptCloud
TAGGLE tool [56]. TAGGLE extends basic tag cloud views and al- and to compare its effectiveness in allowing users to answer his-
lows highlighters to be associated with tags so that if a tag is se- torical questions about a project to that of other existing informa-
lected, related tags in the cloud will be highlighted. Tag clouds in tion representations. Through our user study we conclude that un-
TAGGLE are customizable, as they are in ConceptCloud, with TAG- trained users are able to make use of our ConceptCloud browser to
GLE additionally allowing tag layouts to be changed. answer questions about the history of a software project.
In future, we plan to conduct an additional user study which
8.3. Tag clouds and navigation compares our ConceptCloud browser to other tools mentioned in
related work (which index repositories as well as additional infor-
Mesnage and Carmen [11] use a Bayesian approach for navi- mation sources such as email archives) to determine how the dif-
gation in tag clouds that allows tags related to one or more se- ferent tools perform in both search and exploratory search tasks.
lected tags to be shown in the cloud, where previously clouds We are currently working on building a generic framework from
could only be created for one selected tag. Gwizdka and Bake- our ConceptCloud browser so that this can be used to visualize a
laar [57] look at displaying a tag cloud history, which allows users variety of semi-structured data archives (such as academic paper
to keep track of their previous navigation steps, when clouds are data) [63]. We are also applying ConceptCloud in a different do-
used for pivot navigation. This approach is not directly applicable main and conducting another user study in which we specifically
to our tag clouds since we use refinement navigation where multi- evaluate the learning effects present when using the tool.
ple tags can be selected. Hernandez et al. [58] use multiple linked
tag clouds to browse semi-structured clinical trial data. These tag Acknowledgments
clouds are generated from the results of an initial search query and
each represent one facet (e.g., medical condition) of the data. A This research is funded in part by a STIAS Doctoral Scholarship,
multi-faceted view can also be created in ConceptCloud by moving NRF Grant 93582, CAIR, and the MIH Media Lab.
tag categories into separate tag clouds.
Supplementary material
8.4. Software and formal concept analysis
Supplementary material associated with this article can be
Poshyvanyk and Marcus [59] use a combination of latent se- found, in the online version, at 10.1016/j.infsof.2016.12.001.
mantic indexing and concept lattices to find methods that are rel-
References
evant to a bug report. Girba et al. [60] use concept analysis to
detect co-change patterns in revision control systems. Objects are [1] J. Sillito, G.C. Murphy, K. De Volder, Questions programmers ask during soft-
packages, classes, or methods, while properties are the validity of ware evolution tasks, in: Proceedings of the SIGSOFT ’06/FSE-14 International
expressions over certain metrics of the objects (e.g., number of Symposium on Foundations of Software Engineering, 2006, pp. 23–34.
[2] M. Codoban, S. Srinivasa Ragavan, D. Dig, B. Bailey, Software history under
classes, methods, or statements); the specific expression is deter- the lens: a study on why and how developers examine it, in: Proceedings of
mined by which co-change pattern is to be detected. Similar ideas the International Conference on Software Maintainance and Evolution (ICSME),
could be integrated into our approach. 2015.
[3] S.E. Sim, R.C. Holt, The ramp-up problem in software projects: a case study
There have also been direct applications of formal concept anal- of how software immigrants naturalize, in: Proceedings of the International
ysis to source code analysis and re-engineering [61,62] but these Conference on Software Engineering (ICSE), IEEE, 1998, pp. 361–370.
only consider an individual program, not a repository. [4] D. Cubranic, G.C. Murphy, J. Singer, K.S. Booth, Hipikat: a project memory for
software development, IEEE Trans. Softw. Eng. 31 (6) (2005) 446–465.
[5] R. Wille, Restructuring lattice theory: an approach based on hierarchies of con-
9. Conclusions and future work cepts, in: Ordered Sets, Reidel, 1982, pp. 445–470.
[6] R.W. White, R.A. Roth, Exploratory search: beyond the query-response
paradigm, Synth. Lect. Inf. Concept. Retr. Serv. 1 (1) (2009) 1–98.
In this paper, we have developed an interactive browser for re-
[7] G. Marchionini, Exploratory search: from finding to understanding, Commun.
vision control archives. We use a novel combination of concept lat- ACM 49 (4) (2006) 41–46.
tices and tag clouds, to make the information implicitly contained [8] H. Kagdi, M.L. Collard, J.I. Maletic, A survey and taxonomy of approaches for
mining software repositories in the context of software evolution, J. Softw.
in repositories accessible to users. Our browser can thus be used
Maint. Evol. Res. Pract. 19 (2) (2007) 77–131.
to answer many difficult questions such as “What has happened in [9] J. Sinclair, M. Cardew-Hall, The folksonomy tag cloud: when is it useful? J. Inf.
this project while I was away?”, “Which developers collaborate?”, Sci. 34 (1) (2008) 15–29.
or “What are the co-changed methods?”. [10] Linux github repository, (https://github.com/torvalds/linux).
[11] C.S. Mesnage, M.J. Carman, Tag navigation, in: Proceedings of the SoSEA 2nd
By changing the type of objects in the context table (e.g., re- International Workshop on Social Software Engineering and Applications, ACM,
visions, files etc.) we are able to provide complementary views on 2009, pp. 29–32.

Please cite this article as: G.J. Greene et al., Visualizing and exploring software version control repositories using interactive tag clouds
over formal concept lattices, Information and Software Technology (2016), http://dx.doi.org/10.1016/j.infsof.2016.12.001
JID: INFSOF
ARTICLE IN PRESS [m5G;December 12, 2016;20:48]

G.J. Greene et al. / Information and Software Technology 000 (2016) 1–19 19

[12] B. Ganter, R. Wille, Formal Concept Analysis - Mathematical Foundations, [40] P. Weissgerber, M. Pohl, M. Burch, Visual data mining in software archives to
Springer, Berlin, 1999. detect how developers work together, in: Proceedings of the Fourth Interna-
[13] B.A. Davey, H.A. Priestley, Introduction to Lattices and Order, 2nd. ed., Cam- tional Workshop on Mining Software Repositories (MSR), 2007, pp. 9–17.
bridge University Press, Cambridge, 2002. [41] S. Thummalapenta, T. Xie, Spotweb: detecting framework hotspots and
[14] B. Fischer, Specification-based browsing of software component libraries, Au- coldspots via mining open source code on the web, in: Proceedings of the
tom. Softw. Eng. (ASE) 7 (2) (20 0 0) 179–20 0. International Conference on Automated Software Engineering (ASE), 2008,
[15] C. Lindig, Concept-based component retrieval, in: Proceedings of IJCAI, 1995, pp. 327–336.
pp. 21–25. [42] Rubygems, (https://github.com/rubygems/rubygems).
[16] C. Carpineto, G. Romano, A lattice conceptual clustering system and its appli- [43] M. Goeminne, T. Mens, A comparison of identity merge algorithms for software
cation to browsing retrieval, Mach. Learn. 24 (2) (1996) 95–122. repositories, Sci. Comput. Program. 78 (8) (2013) 971–986.
[17] G.J. Greene, B. Fischer, Interactive tag cloud visualization of software version [44] Bus factor, (http://deviq.com/bus-factor/).
control repositories, in: Proceedings of the IEEE 3rd Working Conference on [45] Backbone, (https://github.com/jashkenas/backbone).
Software Visualization (VISSOFT), IEEE, 2015, pp. 56–65. [46] Retrofit, (https://github.com/square/retrofit).
[18] A. Hindle, D.M. Germán, SCQL: a formal model and a query language for source [47] Gitk, (https://git-scm.com/docs/gitk).
control repositories, ACM SIGSOFT Softw. Eng. Notes 30 (4) (2005) 1–5. [48] S.S. Shapiro, M.B. Wilk, An analysis of variance test for normality (complete
[19] C.M. Pilato, B. Collins-Sussman, B.W. Fitzpatrick, Version Control with Subver- samples), Biometrika 52 (3/4) (1965) 591–611.
sion - the Standard in Open Source Version Control, O’Reilly Media, Inc, Se- [49] J.W. Tukey, Comparingindividual means in the analysis of variance, Biometrics
bastopol, California, 2008. 5 (2) (1949) 99–114.
[20] J. Vesperman, Essential CVS, O’Reilly Media, Inc., Sebastopol, California, 2006. [50] T. Girba, A. Kuhn, M. Seeberger, S. Ducasse, How developers drive software
[21] C. Lindig, Fast concept analysis, Work. Concept. Struct.Contrib. ICCS 20 0 0 evolution, in: Proceedings of the International Workshop on Principles of Soft-
(20 0 0) 152–161. ware Evolution, 2005, pp. 113–122.
[22] M.F. Porter, An algorithm for suffix stripping, Prog. Electron. Lib. Inf. Syst. 14 [51] O. Alonso, P.T. Devanbu, M. Gertz, Expertise identification and visualization
(3) (1980) 130–137. from cvs, in: Proceedings of the International Working Conference on Mining
[23] R. Navigli, Word sensedisambiguation: a survey, ACM Comput. Surv. 41 (2) Software Repositories (MSR), 2008, pp. 125–128.
(2009) 10:1–10:69. [52] A. Zaidman, B. Van Rompaey, S. Demeyer, A. Van Deursen, Mining software
[24] G. Robles, J.M. Gonzalez-Barahona, Developer identification methods for inte- repositories to study co-evolution of production and test code, in: Proceedings
grated data from various sources, SIGSOFT Softw. Eng. Notes 30 (4) (2005) 1–5. of the International Conference on Software Testing, Verification, and Valida-
[25] G.C. Murphy, D. Notkin, Lightweight lexical source model extraction, ACM tion, IEEE, 2008, pp. 220–229.
Trans. Softw. Eng. Methodolol. 5 (3) (1996) 262–292. [53] T. Fritz, G.C. Murphy, Using information fragments to answer the questions
[26] X. Ren, F. Shah, F. Tip, B.G. Ryder, O. Chesley, Chianti: a tool for change im- developers ask, in: Proceedings of the International Conference on Software
pact analysis of java programs, in: Proceedings of the Object-Oriented Pro- Engineering (ICSE), ACM, 2010, pp. 175–184.
gramming, Systems, Languages and Applications, OOPSLA, 2004, pp. 432–448. [54] R. Cottrell, B. Goyette, R. Holmes, R. Walker, J. Denzinger, Compare and con-
[27] Github, (http://github.com). trast: visual exploration of source code examples, in: Proceedings of the In-
[28] A. Begel, Y.P. Khoo, T. Zimmermann, Codebook: discovering and exploiting re- ternational Workshop on Visualizing Software for Understanding and Analysis
lationships in software repositories, in: Proceedings of the International Con- (VISSOFT), 2009, pp. 29–32.
ference on Software Engineering (ICSE), 2010, pp. 125–134. [55] C. Anslow, J. Noble, S. Marshall, E. Tempero, Visualizing the word structure of
[29] J. Śliwerski, T. Zimmermann, A. Zeller, When do changes induce fixes? in: java class names, in: Proceedings of the Object-Oriented Programming Systems
Proceedings of the International Workshop on Mining Software Repositories Languages and Applications (OOPSLA), 2008, pp. 777–778.
(MSR), ACM, 2005, pp. 1–5. [56] J. Emerson, N. Churcher, C. Deaker, From toy to tool: extending tag clouds for
[30] C. Van Rijsbergen, Information Retrieval. software and information visualisation, in: Proceedings of the Australian Soft-
[31] C. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval ware Engineering Conference, 2013, pp. 155–164.
[32] G.J. Greene, B. Fischer, Conceptcloud: a tagcloud browser for software archives, [57] J. Gwizdka, P. Bakelaar, Tag trails: navigation with context and history, in: Pro-
in: Proceedings of the ACM SIGSOFT International Symposium on Foundations ceedings of the CHI’09 Extended Abstracts on Human Factors in Computing
of Software Engineering (FSE), 2014, pp. 759–762. Systems, ACM, 2009, pp. 4579–4584.
[33] J. Loeliger, M. McCullough, Version Control with Git: Powerful Tools and [58] M.-E. Hernandez, S.M. Falconer, M.-A. Storey, S. Carini, I. Sim, Synchronized
Techniques for Collaborative Software Development, O’Reilly Media, Inc., Se- tag clouds for exploring semi-structured clinical trial data, in: Proceedings of
bastopol, California, 2012. the Conference of the Center for Advanced Studies on Collaborative Research:
[34] D.N. Götzmann, Colibri/java, 2007, (http://code.google.com/p/colibri-java/). Meeting of Minds (CASCON), ACM, 2008, pp. 4:42–4:56.
[35] C. Anslow, S. Marshall, J. Noble, R. Biddle, Sourcevis: collaborative software vi- [59] D. Poshyvanyk, A. Marcus, Combining formal concept analysis with informa-
sualization for co-located environments, in: Proceedings of the IEEE Working tion retrieval for concept location in source code, in: Proceedings of the Inter-
Conference on Software Visualization (VISSOFT), 2013, pp. 1–10. national Conference on Program Comprehension (ICPC), 2007, pp. 37–48.
[36] S. Lohmann, J. Ziegler, L. Tetzlaff, Comparison of tag cloud layouts: task-re- [60] T. Gîrba, S. Ducasse, A. Kuhn, R. Marinescu, R. Daniel, Using concept analysis
lated performance and visual exploration, in: Proceedings of the Interna- to detect co-change patterns, in: Proceedings of the IWPSE Ninth International
tional Conference on Human-Computer Interaction (INTERACT), 2009, pp. 392– Workshop on Principles of Software Evolution: In Conjunction with the 6th
404. ESEC/FSE Joint Meeting, 2007, pp. 83–89.
[37] J. Schrammel, M. Leitner, M. Tscheligi, Semantically structured tag clouds: [61] G. Snelting, Reengineering of configurations based on mathematical concept
an empirical evaluation of clustered presentation approaches, in: Proceedings analysis, ACM Trans. Softw. Eng. Methodol. 5 (2) (1996) 146–189.
of the SIGCHI Conference on Human Factors in Computing Systems, 2009, [62] G. Snelting, F. Tip, Reengineering class hierarchies using concept analysis, SIG-
pp. 2037–2040. SOFT Softw. Eng. Notes 23 (6) (1998) 99–110.
[38] L.D. Caro, K.S. Candan, M.L. Sapino, Navigating within news collections using [63] G.J. Greene, A generic framework for concept-based exploration of semi-struc-
tag-flakes, J. Vis. Lang. Comput. 22 (2) (2011) 120–139. tured software engineering data, in: Proceedings of the Automated Software
[39] M.J. Zaki, M. Ogihara, Theoretical foundations of association rules, in: Proceed- Engineering (ASE), IEEE, 2015, pp. 894–897.
ings of the 3rd ACM SIGMOD Workshop on Research Issues in Data Mining
and Knowledge Discovery, 1998.

Please cite this article as: G.J. Greene et al., Visualizing and exploring software version control repositories using interactive tag clouds
over formal concept lattices, Information and Software Technology (2016), http://dx.doi.org/10.1016/j.infsof.2016.12.001

You might also like