0% found this document useful (0 votes)
23 views19 pages

Networks

The document discusses network models and centrality measures. It provides examples of different centrality measures - closeness, betweenness, and eigenvector - applied to a toy graph. It also discusses the Watts-Strogatz small world network model and how rewiring a regular network to add short cuts can result in a small world where paths are short but clustering remains high.

Uploaded by

dexxt0r
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views19 pages

Networks

The document discusses network models and centrality measures. It provides examples of different centrality measures - closeness, betweenness, and eigenvector - applied to a toy graph. It also discusses the Watts-Strogatz small world network model and how rewiring a regular network to add short cuts can result in a small world where paths are short but clustering remains high.

Uploaded by

dexxt0r
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

CS109/Stat121/AC209/E-109

Data Science
Network Models II
Hanspeter Pster & Joe Blitzstein
[email protected] / [email protected]
1
5
4 3
2
This Week

Project proposals due next Monday (Nov 11)


http://cs109.org/projects/projects.php

No late days or extensions are possible on


project milestones or deadlines!

HW5 due next Friday (Nov 15)

Friday lab 10-11:30 am in MD G115


4.2 Vertex and Edge Characteristics 91
(a) (b)
(c) (d)
Illustration of (b) closeness, (c) betweenness, and (d) eigenvector centrality measures on
the graph in (a). Example and gures courtesy of Ulrik Brandes.
Figures 4.4(b) (d) provide visual summaries of the closeness, betweenness, and
eigenvector centralities, respectively, of the vertices in our toy graph. In each case,
vertices are arranged using a radial layout, with more central vertices located closer
to the center.
Under the closeness centrality, c
Cl
(v), the dark blue vertex is judged to be most
central under this measure, followed closely by the red and yellow vertices. How-
ever, under the betweenness centrality, c
B
(v), it can be seen that the yellow vertex
is now judged to be most central. In fact, note that the relative positions of all but
the most extreme vertices (i.e., small green) have changed noticeably from Fig-
Kolaczyk (2009)
Comparing centrality measures
Its a small world after all: Watts-Strogatz Model
Nature Macmillan Publishers Ltd 1998
8
letters to nature
NATURE | VOL 393 | 4 JUNE 1998 441
removed froma clustered neighbourhood to make a short cut has, at
most, a linear effect on C; hence C(p) remains practically unchanged
for small p even though L(p) drops rapidly. The important implica-
tion here is that at the local level (as reected by C(p)), the transition
to a small world is almost undetectable. To check the robustness of
these results, we have tested many different types of initial regular
graphs, as well as different algorithms for random rewiring, and all
give qualitatively similar results. The only requirement is that the
rewired edges must typically connect vertices that would otherwise
be much farther apart than L
random
.
The idealized construction above reveals the key role of short
cuts. It suggests that the small-world phenomenon might be
common in sparse networks with many vertices, as even a tiny
fraction of short cuts would sufce. To test this idea, we have
computed L and C for the collaboration graph of actors in feature
lms (generated from data available at http://us.imdb.com), the
electrical power grid of the western United States, and the neural
network of the nematode worm C. elegans
17
. All three graphs are of
scientic interest. The graph of lm actors is a surrogate for a social
network
18
, with the advantage of being much more easily specied.
It is also akin to the graph of mathematical collaborations centred,
traditionally, on P. Erdos (partial data available at http://
www.acs.oakland.edu/,grossman/erdoshp.html). The graph of
the power grid is relevant to the efciency and robustness of
power networks
19
. And C. elegans is the sole example of a completely
mapped neural network.
Table 1 shows that all three graphs are small-world networks.
These examples were not hand-picked; they were chosen because of
their inherent interest and because complete wiring diagrams were
available. Thus the small-world phenomenon is not merely a
curiosity of social networks
13,14
nor an artefact of an idealized
modelit is probably generic for many large, sparse networks
found in nature.
We now investigate the functional signicance of small-world
connectivity for dynamical systems. Our test case is a deliberately
simplied model for the spread of an infectious disease. The
population structure is modelled by the family of graphs described
in Fig. 1. At time t 0, a single infective individual is introduced
into an otherwise healthy population. Infective individuals are
removed permanently (by immunity or death) after a period of
sickness that lasts one unit of dimensionless time. During this time,
each infective individual can infect each of its healthy neighbours
with probability r. On subsequent time steps, the disease spreads
along the edges of the graph until it either infects the entire
population, or it dies out, having infected some fraction of the
population in the process.
p = 0 p = 1
Increasing randomness
Regular Small-world Random
Figure 1 Random rewiring procedure for interpolating between a regular ring
lattice and a random network, without altering the number of vertices or edges in
the graph. We start with a ring of n vertices, each connected to its k nearest
neighbours by undirected edges. (For clarity, n 20 and k 4 in the schematic
examples shown here, but much larger n and k are used in the rest of this Letter.)
We choose a vertex and the edge that connects it to its nearest neighbour in a
clockwise sense. With probability p, we reconnect this edge to a vertex chosen
uniformly at random over the entire ring, with duplicate edges forbidden; other-
wise we leave the edge in place. We repeat this process by moving clockwise
around the ring, considering each vertex in turn until one lap is completed. Next,
we consider the edges that connect vertices to their second-nearest neighbours
clockwise. As before, we randomly rewire each of these edges with probability p,
and continue this process, circulatingaround the ring and proceeding outward to
more distant neighbours after each lap, until each edge in the original lattice has
been considered once. (As there are nk/2 edges in the entire graph, the rewiring
process stops after k/2 laps.) Three realizations of this process are shown, for
different values of p. For p 0, the original ring is unchanged; as p increases, the
graph becomes increasingly disordered until for p 1, all edges are rewired
randomly. One of our main results is that for intermediate values of p, the graph is
a small-world network: highly clustered like a regular graph, yet with small
characteristic path length, like a random graph. (See Fig. 2.)
Table 1 Empirical examples of small-world networks
L
actual
L
random
C
actual
C
random
.............................................................................................................................................................................
Film actors 3.65 2.99 0.79 0.00027
Power grid 18.7 12.4 0.080 0.005
C. elegans 2.65 2.25 0.28 0.05
.............................................................................................................................................................................
Characteristic path length L and clustering coefcient C for three real networks, compared
to random graphs with the same number of vertices (n) and average number of edges per
vertex (k). (Actors: n 225;226, k 61. Power grid: n 4;941, k 2:67. C. elegans: n 282,
k 14.) The graphs are dened as follows. Two actors are joined by an edge if they have
acted in a lm together. We restrict attention to the giant connected component
16
of this
graph, which includes ,90% of all actors listed in the Internet Movie Database (available at
http://us.imdb.com), as of April 1997. For the power grid, vertices represent generators,
transformers and substations, and edges represent high-voltage transmission lines
between them. For C. elegans, an edge joins two neurons if they are connected by either
a synapse or a gap junction. We treat all edges as undirected and unweighted, and all
vertices as identical, recognizing that these are crude approximations. All three networks
show the small-world phenomenon: L ) L
random
but C qC
random
.
0
0.2
0.4
0.6
0.8
1
0.0001 0.001 0.01 0.1 1
p
L(p) / L(0)
C(p) / C(0)
Figure 2 Characteristic path length L(p) and clustering coefcient C(p) for the
family of randomly rewired graphs described in Fig. 1. Here L is dened as the
number of edges in the shortest path between two vertices, averaged over all
pairs of vertices. The clustering coefcient C(p) is dened as follows. Suppose
that a vertex v has k
v
neighbours; then at most k
v
k
v
21=2 edges can exist
between them(this occurs when every neighbour of v is connected to every other
neighbour of v). Let C
v
denote the fraction of these allowable edges that actually
exist. Dene C as the average of C
v
over all v. For friendship networks, these
statistics have intuitive meanings: L is the average number of friendships in the
shortest chain connecting two people; C
v
reects the extent to which friends of v
are also friends of each other; and thus C measures the cliquishness of a typical
friendship circle. The data shown in the gure are averages over 20 random
realizations of the rewiring process described in Fig.1, and have been normalized
by the values L(0), C(0) for a regular lattice. All the graphs have n 1;000 vertices
and an average degree of k 10 edges per vertex. We note that a logarithmic
horizontal scale has been used to resolve the rapid drop in L(p), corresponding to
the onset of the small-world phenomenon. During this drop, C(p) remains almost
constant at its value for the regular lattice, indicating that the transition to a small
world is almost undetectable at the local level.
Watts-Strogatz (Nature, 1998)
Distances and clustering in Watts-Strogatz model
Watts-Strogatz (Nature, 1998)
Nature Macmillan Publishers Ltd 1998
8
letters to nature
NATURE | VOL 393 | 4 JUNE 1998 441
removed froma clustered neighbourhood to make a short cut has, at
most, a linear effect on C; hence C(p) remains practically unchanged
for small p even though L(p) drops rapidly. The important implica-
tion here is that at the local level (as reected by C(p)), the transition
to a small world is almost undetectable. To check the robustness of
these results, we have tested many different types of initial regular
graphs, as well as different algorithms for random rewiring, and all
give qualitatively similar results. The only requirement is that the
rewired edges must typically connect vertices that would otherwise
be much farther apart than L
random
.
The idealized construction above reveals the key role of short
cuts. It suggests that the small-world phenomenon might be
common in sparse networks with many vertices, as even a tiny
fraction of short cuts would sufce. To test this idea, we have
computed L and C for the collaboration graph of actors in feature
lms (generated from data available at http://us.imdb.com), the
electrical power grid of the western United States, and the neural
network of the nematode worm C. elegans
17
. All three graphs are of
scientic interest. The graph of lm actors is a surrogate for a social
network
18
, with the advantage of being much more easily specied.
It is also akin to the graph of mathematical collaborations centred,
traditionally, on P. Erdos (partial data available at http://
www.acs.oakland.edu/,grossman/erdoshp.html). The graph of
the power grid is relevant to the efciency and robustness of
power networks
19
. And C. elegans is the sole example of a completely
mapped neural network.
Table 1 shows that all three graphs are small-world networks.
These examples were not hand-picked; they were chosen because of
their inherent interest and because complete wiring diagrams were
available. Thus the small-world phenomenon is not merely a
curiosity of social networks
13,14
nor an artefact of an idealized
modelit is probably generic for many large, sparse networks
found in nature.
We now investigate the functional signicance of small-world
connectivity for dynamical systems. Our test case is a deliberately
simplied model for the spread of an infectious disease. The
population structure is modelled by the family of graphs described
in Fig. 1. At time t 0, a single infective individual is introduced
into an otherwise healthy population. Infective individuals are
removed permanently (by immunity or death) after a period of
sickness that lasts one unit of dimensionless time. During this time,
each infective individual can infect each of its healthy neighbours
with probability r. On subsequent time steps, the disease spreads
along the edges of the graph until it either infects the entire
population, or it dies out, having infected some fraction of the
population in the process.
p = 0 p = 1
Increasing randomness
Regular Small-world Random
Figure 1 Random rewiring procedure for interpolating between a regular ring
lattice and a random network, without altering the number of vertices or edges in
the graph. We start with a ring of n vertices, each connected to its k nearest
neighbours by undirected edges. (For clarity, n 20 and k 4 in the schematic
examples shown here, but much larger n and k are used in the rest of this Letter.)
We choose a vertex and the edge that connects it to its nearest neighbour in a
clockwise sense. With probability p, we reconnect this edge to a vertex chosen
uniformly at random over the entire ring, with duplicate edges forbidden; other-
wise we leave the edge in place. We repeat this process by moving clockwise
around the ring, considering each vertex in turn until one lap is completed. Next,
we consider the edges that connect vertices to their second-nearest neighbours
clockwise. As before, we randomly rewire each of these edges with probability p,
and continue this process, circulatingaround the ring and proceeding outward to
more distant neighbours after each lap, until each edge in the original lattice has
been considered once. (As there are nk/2 edges in the entire graph, the rewiring
process stops after k/2 laps.) Three realizations of this process are shown, for
different values of p. For p 0, the original ring is unchanged; as p increases, the
graph becomes increasingly disordered until for p 1, all edges are rewired
randomly. One of our main results is that for intermediate values of p, the graph is
a small-world network: highly clustered like a regular graph, yet with small
characteristic path length, like a random graph. (See Fig. 2.)
Table 1 Empirical examples of small-world networks
L
actual
L
random
C
actual
C
random
.............................................................................................................................................................................
Film actors 3.65 2.99 0.79 0.00027
Power grid 18.7 12.4 0.080 0.005
C. elegans 2.65 2.25 0.28 0.05
.............................................................................................................................................................................
Characteristic path length L and clustering coefcient C for three real networks, compared
to random graphs with the same number of vertices (n) and average number of edges per
vertex (k). (Actors: n 225;226, k 61. Power grid: n 4;941, k 2:67. C. elegans: n 282,
k 14.) The graphs are dened as follows. Two actors are joined by an edge if they have
acted in a lm together. We restrict attention to the giant connected component
16
of this
graph, which includes ,90% of all actors listed in the Internet Movie Database (available at
http://us.imdb.com), as of April 1997. For the power grid, vertices represent generators,
transformers and substations, and edges represent high-voltage transmission lines
between them. For C. elegans, an edge joins two neurons if they are connected by either
a synapse or a gap junction. We treat all edges as undirected and unweighted, and all
vertices as identical, recognizing that these are crude approximations. All three networks
show the small-world phenomenon: L ) L
random
but C qC
random
.
0
0.2
0.4
0.6
0.8
1
0.0001 0.001 0.01 0.1 1
p
L(p) / L(0)
C(p) / C(0)
Figure 2 Characteristic path length L(p) and clustering coefcient C(p) for the
family of randomly rewired graphs described in Fig. 1. Here L is dened as the
number of edges in the shortest path between two vertices, averaged over all
pairs of vertices. The clustering coefcient C(p) is dened as follows. Suppose
that a vertex v has k
v
neighbours; then at most k
v
k
v
21=2 edges can exist
between them(this occurs when every neighbour of v is connected to every other
neighbour of v). Let C
v
denote the fraction of these allowable edges that actually
exist. Dene C as the average of C
v
over all v. For friendship networks, these
statistics have intuitive meanings: L is the average number of friendships in the
shortest chain connecting two people; C
v
reects the extent to which friends of v
are also friends of each other; and thus C measures the cliquishness of a typical
friendship circle. The data shown in the gure are averages over 20 random
realizations of the rewiring process described in Fig.1, and have been normalized
by the values L(0), C(0) for a regular lattice. All the graphs have n 1;000 vertices
and an average degree of k 10 edges per vertex. We note that a logarithmic
horizontal scale has been used to resolve the rapid drop in L(p), corresponding to
the onset of the small-world phenomenon. During this drop, C(p) remains almost
constant at its value for the regular lattice, indicating that the transition to a small
world is almost undetectable at the local level.
Scientic Communication as Sequential Art (Bret Victor)
http://worrydream.com/ScienticCommunicationAsSequentialArt/
Class Size Paradox
Why do so many schools boast small
average class size but then so many students
end up in huge classes?
Simple example: each student takes one course;
suppose there is one course with 100 students,
fty courses with 2 students.
Dean calculates: (100+50*2)/51 = 3.92
Students calculate: (100*100+100*2)/200 = 51
Class Size Paradox in Networks
http://opinionator.blogs.nytimes.com/2012/09/17/friends-you-can-count-on/?_r=0
Popular article on this phenomenon by Strogatz:
Average number of friends of a persons friends is greater
than average number of friends of a person!

Again a reminder of the importance of considering sampling.
Community Detection
Porter et al survey: http://arxiv.org/pdf/0902.3788v2.pdf
Fig. 0.5. The largest connected component (379 nodes) of the network of network scientists
(1589 total nodes), determined by coauthorship of papers listed in two well-known review articles
[13, 83] and a small number of papers added manually [86]. Each of the nodes in the network,
which we depict using a Kamada-Kawaii visualization [62], is colored according to its community
assignment using the leading-eigenvector spectral method [86].
Applications. Armed with the above ideas and algorithms, we turn to selected
demonstrations of their ecacy. The increasing rapidity of developments in net-
work community detection has resulted in part from the ever-increasing abundance
of data sets (and the ability to extract them, with user cleverness). This newfound
wealthincluding large, time-dependent data setshas, in turn, arisen from the mas-
sive amount of information that is now routinely collected on websites and by com-
munication companies, governmental agencies, and others. Electronic databases now
provide detailed records of human communication patterns, oering novel avenues to
map and explore the structure of social, communication, and collaboration networks.
Biologists also have extensive data on numerous systems that can be cast into network
form and which beg for additional quantitative analyses.
Because of space limitations, we restrict our discussion to ve example applica-
tions in which community detection has played a prominent role: scientic coauthor-
ship, mobile phone communication, online social networking sites, biological systems,
and legislatures. We make no attempt to be exhaustive for any of these examples; we
merely survey research (both by others and by ourselves) that we particularly like.
Scientic Collaboration Networks. We know from the obsessive computation
of Erd os numbers that scientists can be quite narcissistic. (If you want any further
evidence, just take a look at the selection of topics and citations in this section.) In
this spirit, we use scientic coauthorship networks as our rst example.
A bipartite (two-mode) coauthorship networkwith scientists linked to papers
14
Community Detection of Committees in Congress
Porter et al survey: http://arxiv.org/pdf/0902.3788v2.pdf
AGRICULTURE
APPROPRIATIONS
INTERNATIONAL RELATIONS
BUDGET
HOUSE ADMINISTRATION
ENERGY/COMMERCE
FINANCIAL SERVICES
VETERANS AFFAIRS
EDUCATION
ARMED SERVICES
JUDICIARY
RESOURCES
RULES
SCIENCE
SMALL BUSINESS
OFFICIAL CONDUCT
TRANSPORTATION
GOVERNMENT REFORM
WAYS AND MEANS
INTELLIGENCE
HOMELAND SECURITY
AGRICULTURE
APPROPRIATIONS
INTERNATIONAL RELATIONS
BUDGET
HOUSE ADMINISTRATION
ENERGY/COMMERCE
FINANCIAL SERVICES
VETERANS AFFAIRS
EDUCATION
ARMED SERVICES
JUDICIARY
RESOURCES
RULES
SCIENCE
SMALL BUSINESS
OFFICIAL CONDUCT
TRANSPORTATION
GOVERNMENT REFORM
WAYS AND MEANS
INTELLIGENCE
HOMELAND SECURITY
Fig. 0.4. (Left) The network of committees (squares) and subcommittees (circles) in the 108th
U.S. House of Representatives (2003-04), color-coded by the parent standing and select committees
and visualized using the Kamada-Kawaii method [62]. The darkness of each weighted edge between
committees indicates how strongly they are connected. Observe that subcommittees of the same parent
committee are closely connected to each other. (Right) Coarse-grained plot of the communities in this
network. Here one can see some close connections between dierent committees, such as Veterans
Aairs/Transportation and Rules/Homeland Security.
until each node is in its own singleton community. This hierarchical partitioning pro-
cess can then be represented by a tree, or dendrogram (see Fig. 0.2). Such processes
can yield a hierarchy of nested modules (see Fig. 0.3), or a collection of modules at
one mesoscopic level might be obtained in an algorithm independently from those at
another level. However obtained, the community structure of a network refers to the
set of graph partitions obtained at each reasonable step of such procedures. Note
that community structure investigations rely implicitly on using connected network
components. (We will assume such connectedness in our discussion of community-
detection algorithms below.) Community detection can be applied individually to
separate components of networks that are not connected.
Many real-world networks possess a natural hierarchy. For example, the com-
mittee assignment network of the U. S. House of Representatives includes the House
oor, groups of committees, committees, groups of subcommittees within larger com-
mittees, and individual subcommittees [100,101]. As shown in Fig. 0.4, dierent House
committees are resolved into distinct modules within this network. At a dierent hier-
archical level, small groups of committees belong to larger but less densely-connected
modules. To give an example closer to home, lets consider the departmental organiza-
tion at a university and suppose that the network in Fig. 0.3 represents collaborations
among professors. (It actually represents grassland species interactions [23].) At one
level of inspection, everybody in the mathematics department might show up in one
community, such as the large one in the upper left. Zooming in, however, reveals
smaller communities that might represent the departments subelds.
Although network community structure is almost always fairly complicated, sev-
eral forms of it have nonetheless been observed and shown to be insightful in appli-
cations. The structures of communities and between communities are important for
the demographic identication of network components and the function of dynamical
processes that operate on networks (such as the spread of opinions and diseases) [39].
A community in a social network might indicate a circle of friends, a community in
the World Wide Web might indicate a group of pages on closely-related topics, and a
4
Community Detection Algorithms
Girvan-Newman algorithm: iteratively remove edges by calculating betweennesses and
removing the edge with maximum betweenness.

Metric called modularity.
http://www.pnas.org/content/103/23/8577/F1.expansion.html
Bickel-Chen on Community Detection
Inconsistency result for Newman-Girvan.
http://www.stat.berkeley.edu/~bickel/Bickel%20Chen%2021068.full.pdf
S
T
A
T
I
S
T
I
C
S
Corollary 1: If the conditions of Theorem 1 hold and,

W L
1
O( c, A),
=

1
n
n

i=1
I( c
i
= a) : a = 1, , K

,
then,

n( ) N(0, D()
T
),

n(

W W) S (
T
+
T
) 2(
T
S)W,
= N(0, D()
T
),
with A B denoting point-wise product. The limiting variances are
what we would get for maximum likelihood estimates if c = c, i.e. we
knew the assignment to begin with. So consistent modularities lead
to efcient estimates of the parameters.
This follows since with probability tending to 1, c = c.
To estimate w(, ) in the nonparametric case we need K
, and w(, ) and () smooth. We approximate by W
K

K
2
||w(aK
1
, bK
1
)||,
K
(a) K
1
(aK
1
), where w(, ), W
K
are canonical and the modularity dening F
K
, F
K
(, , ) is of
order K
2
. We have preliminary results in that direction but their
formulation is complicated and we do not treat themfurther here.
Consistency of N-G, L-M
We show in SI Appendix using the appropriate F
NG
, F
LM
that the
likelihood modularity is always consistent while the Newman
Girvan is not. This is perhaps not surprising since N-G focuses
on the diagonal of O. In fact, we would hope that N-G is consis-
tent under the submodel {(, , W) : W
aa
>

b=a
W
ab
for all a},
which corresponds to Newman and Girvans motivation. We have
shown this for K = 2 but it surprisingly fails for K > 2. Here is a
counterexample. Let K = 3, = (1/3, 1/3, 1/3)
T
and
P =

.06 .04 0
.04 .12 .04
0 .04 .66

.
As n , with true labeling, Q
NG
approaches 0.033. How-
ever, the maximum Q
NG
, about 0.038, is achieved by merging
the rst two communities. That is, two sparser communities are
merged. This is consistent with an observation of Fortunato and
Barthelemy (17).
If for the prole likelihood we maximize only over e such that

W
aa
(e) >

b=a

W
ab
(e) for all a, we obtain c which is consis-
tent under the submodel above, and in the Karate Club example
performs like N-G.
Computational Issues
Computationof optimal assignments using modularities is, inprin-
ciple, NP hard. However, although the surface is multimodal, in
the examples we have considered and generally when the signal
is strong, optimization from several starting points using a label
switching algorithm (19) works well.
Simulation
Wegeneraterandommatrices AandmaximizeQ
NG
, Q
LM
toobtain
node labels respectively, where Q
LM
is maximized using a label
switching algorithm. To make a fair comparison, the initial label-
ing for Q
NG
and Q
LM
is to randomly choose 50%of the nodes with
correct labels and the other 50% with random labels. For spectral
clustering, weadopt thealgorithmof (18) by usingtherst K eigen-
vectors of D(d)
1/2
AD(d)
1/2
, where d = (d
1
, , d
n
)
T
and d
i
is
the degree of the i-th node. We generate the P matrix randomly
by forcing symmetry and then add a constant to diagonal entries
Fig. 1. Empirical comparison of NewmanGirvan, likelihood modularities
and spectral clustering (18), where K = 3, the number of nodes n varies
from 200 to 1500, and the percent of correct labeling is computed from 100
replicates of each simulation case. Here , P are given in the text.
such that I holds. The is generated randomly from the simplex.
To be precise, the values for Fig. 1 are = (.203, .286, .511)
T
and
P = bn
1
log n

.43 .06 .13


.06 .34 .17
.13 .17 .40

,
where n varies from200 to 1,500 and b varies from10 to 100. Obvi-
ously, Fig. 1 says that the likelihood method exhibits much less
incorrect labeling than NewmanGirvan and spectral clustering.
This is consistent with theoretical comparison.
Data Examples
We compare the L-M and N-G modularity algorithms below
with applications to two real data sets. To deal with the issue of
non-convex optimization, we simply use many restarting points.
Zacharys Karate Club Network. We rst compare L-M and N-G
with the famous Karate Club network of ref. 20, from the social
science literature, which has become something of a standard test
for community detection algorithms. The network shows the pat-
terns of friendship between the members of a karate club at a
US university in the 1970s. The example is of particular interest
because shortly after the observation and construction of the net-
work, the club in question split into two components separated
by the dashed line as shown in Figs. 2 and 3 as a result of an
internal dispute. Fig. 2 Left shows two communities identied by
maximizing the likelihood modularity where the shapes of the ver-
tices denote the membership of the corresponding individuals,
and similarly the right panel shows communities identied by N-
G. Obviously,the N-G communities match the two sub-divisions
identied by the split save for one mis-classied individual. The
L-M communities are quite different, and obviously one com-
munity consists of ve individuals with central importance that
connect with many other nodes while the other community con-
sists of the remaining individuals. Although not reecting the split
this corresponds to other plausible distinguishing characteristics
of the individuals. However, if we force the constraint that within-
community density is no less than the density of relationship to all
other communities, thesubmodel wediscussed, thenweobtaintwo
L-M communities that match the split perfectly. The same parti-
tions as ours with and without constraint have also been reported
Bickel and Chen PNAS December 15, 2009 vol. 106 no. 50 21071
Respondent Driven Sampling (RDS)

sampling scheme for hard-to-reach populations,


based on link-tracing across a social network with
coupon incentives

becoming extremely-widely used all over the


world; hundreds of studies done or ongoing, e.g.,
CDC National HIV Behavioral Surveillance
(NHBS) studies of injection drug users

RDS as sampling vs. RDS estimation


0
1
2
3
4 5 6 7 8
9
10
11
12
Is RDS contact tracing?
Source: http://www.eurosurveillance.org/
Recruitment Tree Example
Volz-Heckathorn RDS Estimator
This is a form of Horvitz-Thompson estimator,
reweighting as in importance sampling.

E(Y ) =
P
n
j=1
Y
j
/d
j
P
n
j=1
1/d
j
Relies on a long list of strong assumptions; Handcock-Gile
and Blitzstein-Nesterko perform sensitivity analyses under
various conditions.
Goel-Salganik (Stats in Medicine 2009, PNAS 2010):
RDS variances can be extremely large, especially if there
are bottlenecks in the network from modularity/
communities, and from multiple recruitment.
Typical design effects of 5-10, and coverage probabilities
much lower than the nominal 95% values




RDS AS MCMC 9
A B
Figure 2. Hypothetical network with an edge between
every pair of nodes, where within-group edges have
higher weight than between-group edges. Here the two
groups are dened by infection status, and a bottleneck
exists between healthy and infected individuals. This is
the only type of bottleneck that had been considered in
the previous RDS literature.
that as long as there were sucient connections between infected and
uninfected individuals, the RDS estimates would be reasonably precise.
While this structural feature is certainly a concern, taken in isolation it
underestimates the eect of network structure on the variance of RDS
estimates. Even when infected and uninfected individuals are relatively
well connected, bottlenecks in other parts of the network can lead to
large variance.
To illustrate this point, we analyze RDS on two network models in
detail. Our examples, while motivated by the qualitative features of
real social networks, are not intended to be accurate models of any
specic social network. Rather, they provide insight by allowing for
exact and interpretable results.
3.1. Two Network Models.
3.1.1. A Two-Group Model. Consider a population V consisting of two
groups, A and B, of equal size N/2. Edges exist between every pair of
individuals, however within-group edges have weight 1c while between-
group edges have weight c where 0 < c < 1/2 (see Figure 3(a)).
12
That is, within-group relationships are stronger than between-group
relationships. In this model, c parameterizes homophily based on group
membershipthe well-observed social tendency for people to form ties
12
This network model allows for self-edges which means that it allows for self-
recruitment during the sampling process. This assumption departs from the actual
RDS sampling process, but has minimal eect on the qualitative results.
To consult a statistician after an experiment is
nished is often merely to ask him to conduct a post-
mortem examination. He can perhaps say what the
experiment died of.
-- R.A Fisher
What would Fisher say?
To Model or Not To Model;
Design-based vs. model-based

Model the underlying network? What about unknown


nodes?

the recruitment process?

coupon refusal?

the outcome variables (such as HIV status)?

You might also like