06 Lopez Ijitwe Sna PDF
06 Lopez Ijitwe Sna PDF
06 Lopez Ijitwe Sna PDF
ABSTRACT
Source code management repositories of large, long-lived libre (free, open source) software
projects can be a source of valuable data about the organizational structure, evolution, and
knowledge exchange in the corresponding development communities. Unfortunately, the sheer
volume of the available information renders it almost unusable without applying methodologies
which highlight the relevant information for a given aspect of the project. Such methodology
is proposed in this article, based on well known concepts from the social networks analysis
eld, which can be used to study the relationships among developers and how they collaborate
in different parts of a project. It is also applied to data mined from some well known projects
(Apache, GNOME, and KDE), focusing on the characterization of their collaboration network
architecture. These cases help to understand the potentials of the methodology and how it is
applied, but also shows some relevant results which open new paths in the understanding of the
informal organization of libre software development communities.
Keywords:
INTRODUCTION
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is
prohibited.
28 Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc.
is prohibited.
Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006 29
and where the limits are of those ways of coordinating and exchanging information. We
have designed a methodology following this
approach, and have also applied it to some
well known projects. Although the aim of our
approach is mainly descriptive, not proposing
novel models for project evolution or agent behavior, just trying to describe in as much detail
as possible the organizational structure of libre
software projects, our work is illustrative of the
power of the SNA techniques. To attain this
goal, our approach is similar to that presented
in Madey, Freeh, and Tynan (2002) and Xu,
Gao, Christley, and Madey (2005): we consider
libre software projects as complex systems
and characterize them by using mathematical
formalisms. As a result, some interesting facts
related to the organizational structure of libre
software projects have been uncovered.
The remainder of this article is organized
as follows. The next section contains a basic
introduction to SNA, and how we pretend to
apply its techniques to the study of libre software
projects based on the data available in their CVS
repositories. The third section species in detail
the methodology for such a study, followed by
the fourth section with a brief introduction to
a set of classical social network analysis parameters. After that, the fth section presents
the main characteristics of the networks corresponding to the three projects used as case
examples: Apache, GNOME, and KDE. This
serves as an introduction to the more detailed
comments on several aspects of those projects,
presented in the sixth, seventh, eighth, ninth,
and tenth sections. The nal section offers some
conclusions, comments on some related work,
and discusses further lines of research.
APPLICATION OF SNA TO
LIBRE SOFTWARE PROJECTS
its single components and relationships. Network characterization is widely used in many
scientic and technological disciplines, such
as neurobiology (Watts & Strogatz, 1998),
computer networks (Albert, Barabsi, Jeong, &
Bianconi, 2000), or linguistics (Kumar, Raghavan, Rajagopalan, & Tomkins, 2002).
Although some voices argue that the
software development process found in libre
software projects is hardly to be considered as
a new development paradigm (Fuggetta, 2003);
without doubt, the way it handles its human
resources differs completely from traditional
organizations (Germn, 2004). In both cases,
traditional and libre software environments,
the human factor is of key importance for the
development process and how the software
evolves (Grba, Kuhn, Seeberger, & Ducasse,
2005), but the volunteer nature of many
contributors in the libre software case makes
it a clearly differentiated situation (Robles,
Gonzlez-Barahona, & Michlmayr, 2005b).
Previous research on this topic has both
attended to technical and organizational points
of view. Germn used data from a versioning
repository in time to determine feature-adding and bug-correcting phases. He also found
evidence for developer territoriality (software
artifacts that are mainly, if not uniquely, touched
by a single developer) (Germn, 2004).
The intention of other papers has been to
uncover the social structure of the underlying
community. The rst efforts in the libre software
world are due to Madey et al. (2002), who took
data from the largest libre software projects
repository, SourceForge.net, and inferred relationships among developers that contributed
to projects in common. A statistical analysis
of some basic social network parameters can
also be found by Lpez, Gonzalez-Barahona,
and Robles (2004) for some large libre software projects. Xu et al. (2005) have presented
a more profound topological analysis of the
libre software community, joining in the same
work characteristics from previous papers:
data based on the SourceForge platform and
a statistical analysis of some parameters with
the goal of gaining knowledge on the topology
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is
prohibited.
30 Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006
METHODOLOGY
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc.
is prohibited.
Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006 31
Modules network. Each vertex represents a particular software module (usually a directory in the CVS repository)
of the project. Two modules are linked
together by an edge when there is at least
one commiter who has contributed to
both. Those edges are weighted using a
degree of relationship between the two
modules, dened as the total number of
commits performed by common commiters.
Commiters network. In this case, each
vertex represents a particular commiter
(developer). Two commiters are linked by
an edge when they have contributed to at
least one common module. Again, edges
are weighted by a degree of relationship
dened as the total number of commits
performed by both developers on modules
to which both have contributed.
Parameters
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is
prohibited.
32 Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006
(1)
ijNG(v)
wij
1
kv(kv-1)
(2)
(3)
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc.
is prohibited.
Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006 33
st(v)
svtG
st
(4)
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is
prohibited.
34 Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006
Table 1. Summary of the SNA parameters described in this article, their meaning and their
interpretation
Parameter
Meaning
Interpretation
Common activity among two entities
How strong the relationship is
(measured in commits)
Gives the cost of reaching one vertex from
Cost of relationship
Inverse of the degree of relationship
the other
Degree
Number of vertices connected to a node Popularity of a vertex
Probability of a vertex having a given Topology of the network (Poisson or power
Distribution degree
degree
law distributions)
Degree considering weights of the links Maximum capacity to receive information for a
Weighted degree
among vertices
vertex. Effort in maintaining the relationships
Transitivity of a network: tendency of a vertex
Fraction of the total number of edges
to promote relationships among its neighbors.
Clustering coefcient
that could exist for a given vertex that
Helps identifying hot spots of knowledge
really exist
interchange in dynamic networks
We i g h t e d c l u s t e r i n g Generalization of the clustering coefcient Local efciency of the network around a vertex.
concept to weighted networks
Redundancy of interactions around a vertex
coefcient
Gives the inuence of a vertex in a graph. The
Measurement of the proximity of a vertex
Distance centrality
higher the value the easier it is for the vertex to
to the rest
spread information through the network
Measurement of the information control. Higher
values mean that the vertex is an intermediate
Number of shortest paths traversing a
Betweenness centrality
node for the communication of the rest. Vertices
vertex
with high values are known to cover structural
holes
Diameter (or average distance among Optimizes short and long term information ow
vertices) similar but higher average efciency. Especially well adapted to solve
Small world
clustering coefficient than random the problem of searching knowledge through
network
their vertices
Degree of relationship
Tables 2 and 3 summarize the main parameters of both. In the case of commiter networks
the GNOME case has been omitted.
By comparing the data in both tables some
interesting conclusions can already be drawn.
It may be observed, for instance, that the average number of commiters per module is greater
in KDE (12.5) than in Apache (4.3), meaning
more people being involved in the average KDE
subproject. It can also be highlighted that the
average degree on the commiters networks is
in general larger than in the modules ones. This
is especially true for KDE, which rises from
a value of 21.4 in the latter case to 225 in the
former. In the case of Apache it only raises from
14.2 to 31.1. Therefore, we can conclude that in
those cases, commiters are much more linked
than modules. The percentage of modules linked
gives an idea of the synergy (in form of shar-
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc.
is prohibited.
Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006 35
Table 2. Number of vertices and edges of the module networks in the Apache, GNOME, and
KDE projects
Project name
Apache
KDE
GNOME
Modules (Vertices)
175
73
667
Edges
2491
1560
121,134
Average
14.23
21.37
181.61
% of edges (avg)
8.13
29.27
27.23
Table 3. Number of vertices and edges of the commiter networks in the Apache and KDE projects
Project name
Apache
KDE
GNOME
Commiters
(Vertices)
751
915
869
Edges
23,324
205,877
N/A
Commiters per
Avg Number of edges
module
4.3
31.06
12.5
225.00
1.3
N/A
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is
prohibited.
36 Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc.
is prohibited.
Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006 37
Figure 2. Assortativity (degree - degree distribution) for Apache (), KDE (+) and GNOME ().
Cumulative weighted degree distribution for Apache ( ), KDE (+) and GNOME ()
(a) Assortativity
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is
prohibited.
38 Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006
CLUSTERING COEFFICIENT
IN THE MODULES NETWORK
For the analysis of the clustering coefcient, we have represented its distribution in
Figure 3a.
In Table 4 the average distance <d> among
vertices are represented, together with the average clustering coefcients <cc> for our three
networks and their equivalent random counterparts (<rd> is the random average distance and
<rcc> is the random average clustering coefcient). As can be observed, the three networks
satisfy the small world condition, since their
average distances are slightly above those of
their random counterparts; but the clustering
coefcients are clearly higher.
As can be observed, the average random
clustering coefcients for KDE and GNOME
are very close to the real ones, due to the high
density of those networks. This could be an
indication of over-redundancy in their links.
That would mean that the same efciency
of information could be obtained with fewer
relationships (i.e., eliminating many edges in
the network without signicantly increasing the
diameter or reducing the clustering coefcient).
In this sense, the Apache network seems to be
more optimized. To interpret this fact, the reader
may remember that links in this network are
related to the existence of common developers
for the linked modules. It should be noted that
DISTANCE CENTRALITY IN
THE MODULES NETWORK
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc.
is prohibited.
Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006 39
Figure 3. Clustering coefcient distribution for Apache ( ), KDE (+) and GNOME (). Average weighted clustering coefcient as a function of the degree of vertices for Apache ( ), KDE
(+) and GNOME (.)
<d> / <rd>
2.06 / 1.47
1.31 / 1.11
1.46 / 1.10
<cc> / <rcc>
0.73 / 0.19
0.88 / 0.65
0.87 / 0.54
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is
prohibited.
40 Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006
Figure 4. Distance centrality distribution for Apache ( ), KDE (+), and GNOME (); Average distance centrality as a function of the degree of vertices for Apache ( ), KDE (+), and
GNOME ()
information.
We can also analyze the average distance
centrality as a function of the degree (average
distance centrality-degree distribution), which
is shown if Figure 4b. It can be observed that
in all three cases the average distance centrality
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc.
is prohibited.
Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006 41
BETWEENNESS CENTRALITY
IN THE MODULES NETWORK
The distance centrality of a vertex indicates how well new knowledge created in a
vertex spreads to the rest of the network. On
the other hand, betweenness centrality is a
measurement of how easy it is for a vertex to
generate this new information. Vertices with
high betweenness centrality indexes are the
crossroads of organizations, where information from different origins can be intercepted,
analyzed, or manipulated. In Figure 5a, the
betweenness centrality distribution for our three
networks can be observed. In the same way, this
was the case for distance centrality, as it grows
following a multiple power law. Nevertheless,
there is a signicant difference between the
distributions of these two parameters. Although
the log-log scale of the axis of Figure 5a does not
allow visualizing it, the most probable value of
the betweenness centrality in all three networks
is zero. Just to show an example, only 102 out
of 677 vertices of the GNOME network have a
nonzero betweenness centrality. So, the distance
centrality is a common good of all members of
the network, while the betweenness centrality
is owned by reduced elite. This should not
be surprising at all, as projects usually have
modules (i.e., applications) which have a more
central position and attract more development
attention. Surrounding these modules, other
minor modules may appear.
This fact can also be visualized in Figure
5b, where we represent the average betweenness
centrality as a function of the degree. It can be
clearly seen that only vertices of high degree
have nonzero betweenness centralities.
COMMITER NETWORKS
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is
prohibited.
42 Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006
Figure 5. Betweenness centrality distribution for Apache ( ), KDE (+) and GNOME (). Average betweenness centrality distribution for Apache ( ), KDE (+) and GNOME ()
CONCLUSIONS, LESSONS
LEARNED, AND FURTHER
WORK
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc.
is prohibited.
Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006 43
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is
prohibited.
44 Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006
Figure 7. Degree - degree distribution for Apache ( ) and KDE (+); Average weighted degree
as a function of the degree for Apache ( ) and KDE (+)
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc.
is prohibited.
Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006 45
<d> / <rd>
2.18 / 1.60
1.47 / 1.10
<cc> / <rcc>
0.84 / 0.08
0.86 / 0.52
REFERENCES
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is
prohibited.
46 Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc.
is prohibited.
Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006 47
ENDNOTES
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is
prohibited.
48 Int. J. of Information Technology and Web Engineering, 1(3), 27-48, July-September 2006
Luis Lpez obtained his PhD in electrical and electronic engineering at Universidad Rey Juan
Carlos in 2003 and his MS in electrical and electronic engineering at Universidad Politcnica
de Madrid and at ENST Tlcom-Paris in 1998. He is author of more than 50 publications
including 10 papers published in different research international journals and 20 contributions
to conferences and workshops.
Gregorio Robles received his Telecommunication Engineering degree from the Universidad Politcnica de Madrid (2001) and has recently defended his PhD thesis at the Universidad Rey Juan
Carlos (2006). His research work is centered on the empirical study of libre software development,
especially from but not limited to a software engineering perspective. He has developed or collaborated in the design and implementation of software programmes to automate the analysis of
libre software and the tools used to produce them. He has also been involved in several projects
related to the study and promotion of libre software nanced by the European Commission IST
programmes, such as FLOSS (2000-1), CALIBRE (2004-6) or FLOSSWorld (2005-7).
Jesus M. Gonzalez-Barahona teaches and researches in Universidad Rey Juan Carlos, Mostoles
(Spain). He started to be involved in the promotion of libre software in 1991. Since then, he has
carried on several activities in this area, including the organization of seminars and courses, and
the participation in working groups on libre software, both at the Spanish and European levels.
Currently he collaborates with several libre software projects (including Debian) and associations,
writes in several media about topics related to libre software, and consults for companies and
public administrations on issues related to their strategy on these topics. His research interests
include libre software engineering, and in particular quantitative measures of libre software
development and distributed tools for collaboration in libre software projects. In this area, he
has published several papers, and is participating in some international research projects (more
info in http://libresoft.urjc.es ). He is also one of the promoters of the idea of an European master
program on libre software, and has specic interest in the education in that area.
Israel Herraiz holds a MSc in chemical and mechanical engineering, a BSc in chemical engineering and he is currently pursuing his PhD in computer science at the Universidad Rey Juan
Carlos in Madrid, Spain. He discovered free software in 2000, and has since then developed
several free tools for chemical engineering.
Copyright 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc.
is prohibited.