Networks
Networks
Data Science
Network Models II
Hanspeter Pster & Joe Blitzstein
[email protected] / [email protected]
1
5
4 3
2
This Week
W L
1
O( c, A),
=
1
n
n
i=1
I( c
i
= a) : a = 1, , K
,
then,
n( ) N(0, D()
T
),
n(
W W) S (
T
+
T
) 2(
T
S)W,
= N(0, D()
T
),
with A B denoting point-wise product. The limiting variances are
what we would get for maximum likelihood estimates if c = c, i.e. we
knew the assignment to begin with. So consistent modularities lead
to efcient estimates of the parameters.
This follows since with probability tending to 1, c = c.
To estimate w(, ) in the nonparametric case we need K
, and w(, ) and () smooth. We approximate by W
K
K
2
||w(aK
1
, bK
1
)||,
K
(a) K
1
(aK
1
), where w(, ), W
K
are canonical and the modularity dening F
K
, F
K
(, , ) is of
order K
2
. We have preliminary results in that direction but their
formulation is complicated and we do not treat themfurther here.
Consistency of N-G, L-M
We show in SI Appendix using the appropriate F
NG
, F
LM
that the
likelihood modularity is always consistent while the Newman
Girvan is not. This is perhaps not surprising since N-G focuses
on the diagonal of O. In fact, we would hope that N-G is consis-
tent under the submodel {(, , W) : W
aa
>
b=a
W
ab
for all a},
which corresponds to Newman and Girvans motivation. We have
shown this for K = 2 but it surprisingly fails for K > 2. Here is a
counterexample. Let K = 3, = (1/3, 1/3, 1/3)
T
and
P =
.06 .04 0
.04 .12 .04
0 .04 .66
.
As n , with true labeling, Q
NG
approaches 0.033. How-
ever, the maximum Q
NG
, about 0.038, is achieved by merging
the rst two communities. That is, two sparser communities are
merged. This is consistent with an observation of Fortunato and
Barthelemy (17).
If for the prole likelihood we maximize only over e such that
W
aa
(e) >
b=a
W
ab
(e) for all a, we obtain c which is consis-
tent under the submodel above, and in the Karate Club example
performs like N-G.
Computational Issues
Computationof optimal assignments using modularities is, inprin-
ciple, NP hard. However, although the surface is multimodal, in
the examples we have considered and generally when the signal
is strong, optimization from several starting points using a label
switching algorithm (19) works well.
Simulation
Wegeneraterandommatrices AandmaximizeQ
NG
, Q
LM
toobtain
node labels respectively, where Q
LM
is maximized using a label
switching algorithm. To make a fair comparison, the initial label-
ing for Q
NG
and Q
LM
is to randomly choose 50%of the nodes with
correct labels and the other 50% with random labels. For spectral
clustering, weadopt thealgorithmof (18) by usingtherst K eigen-
vectors of D(d)
1/2
AD(d)
1/2
, where d = (d
1
, , d
n
)
T
and d
i
is
the degree of the i-th node. We generate the P matrix randomly
by forcing symmetry and then add a constant to diagonal entries
Fig. 1. Empirical comparison of NewmanGirvan, likelihood modularities
and spectral clustering (18), where K = 3, the number of nodes n varies
from 200 to 1500, and the percent of correct labeling is computed from 100
replicates of each simulation case. Here , P are given in the text.
such that I holds. The is generated randomly from the simplex.
To be precise, the values for Fig. 1 are = (.203, .286, .511)
T
and
P = bn
1
log n
,
where n varies from200 to 1,500 and b varies from10 to 100. Obvi-
ously, Fig. 1 says that the likelihood method exhibits much less
incorrect labeling than NewmanGirvan and spectral clustering.
This is consistent with theoretical comparison.
Data Examples
We compare the L-M and N-G modularity algorithms below
with applications to two real data sets. To deal with the issue of
non-convex optimization, we simply use many restarting points.
Zacharys Karate Club Network. We rst compare L-M and N-G
with the famous Karate Club network of ref. 20, from the social
science literature, which has become something of a standard test
for community detection algorithms. The network shows the pat-
terns of friendship between the members of a karate club at a
US university in the 1970s. The example is of particular interest
because shortly after the observation and construction of the net-
work, the club in question split into two components separated
by the dashed line as shown in Figs. 2 and 3 as a result of an
internal dispute. Fig. 2 Left shows two communities identied by
maximizing the likelihood modularity where the shapes of the ver-
tices denote the membership of the corresponding individuals,
and similarly the right panel shows communities identied by N-
G. Obviously,the N-G communities match the two sub-divisions
identied by the split save for one mis-classied individual. The
L-M communities are quite different, and obviously one com-
munity consists of ve individuals with central importance that
connect with many other nodes while the other community con-
sists of the remaining individuals. Although not reecting the split
this corresponds to other plausible distinguishing characteristics
of the individuals. However, if we force the constraint that within-
community density is no less than the density of relationship to all
other communities, thesubmodel wediscussed, thenweobtaintwo
L-M communities that match the split perfectly. The same parti-
tions as ours with and without constraint have also been reported
Bickel and Chen PNAS December 15, 2009 vol. 106 no. 50 21071
Respondent Driven Sampling (RDS)
E(Y ) =
P
n
j=1
Y
j
/d
j
P
n
j=1
1/d
j
Relies on a long list of strong assumptions; Handcock-Gile
and Blitzstein-Nesterko perform sensitivity analyses under
various conditions.
Goel-Salganik (Stats in Medicine 2009, PNAS 2010):
RDS variances can be extremely large, especially if there
are bottlenecks in the network from modularity/
communities, and from multiple recruitment.
Typical design effects of 5-10, and coverage probabilities
much lower than the nominal 95% values
RDS AS MCMC 9
A B
Figure 2. Hypothetical network with an edge between
every pair of nodes, where within-group edges have
higher weight than between-group edges. Here the two
groups are dened by infection status, and a bottleneck
exists between healthy and infected individuals. This is
the only type of bottleneck that had been considered in
the previous RDS literature.
that as long as there were sucient connections between infected and
uninfected individuals, the RDS estimates would be reasonably precise.
While this structural feature is certainly a concern, taken in isolation it
underestimates the eect of network structure on the variance of RDS
estimates. Even when infected and uninfected individuals are relatively
well connected, bottlenecks in other parts of the network can lead to
large variance.
To illustrate this point, we analyze RDS on two network models in
detail. Our examples, while motivated by the qualitative features of
real social networks, are not intended to be accurate models of any
specic social network. Rather, they provide insight by allowing for
exact and interpretable results.
3.1. Two Network Models.
3.1.1. A Two-Group Model. Consider a population V consisting of two
groups, A and B, of equal size N/2. Edges exist between every pair of
individuals, however within-group edges have weight 1c while between-
group edges have weight c where 0 < c < 1/2 (see Figure 3(a)).
12
That is, within-group relationships are stronger than between-group
relationships. In this model, c parameterizes homophily based on group
membershipthe well-observed social tendency for people to form ties
12
This network model allows for self-edges which means that it allows for self-
recruitment during the sampling process. This assumption departs from the actual
RDS sampling process, but has minimal eect on the qualitative results.
To consult a statistician after an experiment is
nished is often merely to ask him to conduct a post-
mortem examination. He can perhaps say what the
experiment died of.
-- R.A Fisher
What would Fisher say?
To Model or Not To Model;
Design-based vs. model-based
coupon refusal?