Lecture 02
Lecture 02
Sampling
36-720, Fall 2016
Scribes: Cristobal De La Maza and Valerie Yuan
http://www.stat.cmu.edu/~cshalizi/networks/16-1 for updates
31 August 2016
Contents
1 Sampling procedures 2
1.1 Ideal Data: Network Census . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Imperfections in network censuses . . . . . . . . . . . . . 2
2 Sampling Designs 3
2.1 Induced and Incident Subgraphs . . . . . . . . . . . . . . . . . . 3
2.1.1 Example of a Bias from Induced-Subgraph Sampling . . . 3
2.2 “Exploratory” Sampling Designs . . . . . . . . . . . . . . . . . . 4
2.2.1 Snowball sampling . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Respondent-driven sampling . . . . . . . . . . . . . . . . . 5
2.2.3 Trace-route sampling . . . . . . . . . . . . . . . . . . . . . 5
3 Coping strategies 6
3.1 Head in sand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Learn sampling theory (design based inference) . . . . . . . . . . 7
3.2.1 Strengths and Weaknesses . . . . . . . . . . . . . . . . . . 8
3.3 Missing data tools . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Model the Effective Network . . . . . . . . . . . . . . . . . . . . 9
5 Exercises 10
1
The real foundation of any branch of statistics is data collection. For the
sorts of statistics we’ve mostly seen before, where data are IID (or IID-ish), data
comes from samples or from experiments. It’s hard (though not impossible) to
experiment with networks, so we mostly have to deal with samples, and even
there, things become much more complicated than we’re used to. Unfortunately,
this complexity is all too often ignored when analyzing empirical networks.
1 Sampling procedures
1.1 Ideal Data: Network Census
The ideal data would be a census or enumeration of the network. This would
record every node, and every edge between nodes, with no spurious additional
nodes or edges. If you are in the fortunate situation of having a complete
network census, you can pretty much ignore the sampling process, and proceed
to model network formation.
The mean number must, necessarily, be equal for men and for women. (This is not true of the
median; why?) However, in every study of this sort known to me, the mean number reported
by men is substantially higher than the mean number reported by women.
2
2 Sampling Designs
If we cannot get hold of the true, “population” graph G = (V, E), we may,
guided by the example of IID statistics, try to measure a “sampled” graph
G∗ = (V ∗ , E ∗ ), with V ∗ ⊆ V and E ∗ ⊆ E. Different sampling designs amount
to different ways of obtaining such sampled subgraphs2 . In baby statistics, our
first step in understanding sampling is the concept of a simple random sample
(SRS) of units from the population. In networks, even a simple random sample
is complicated.
sampling.
3
graph G∗ has an observed adjacency matrix A∗ , and A∗ij = 1 iff Aij = 1 and
both i and j are in the the sample. What’s the expected value of the plug-in
∗
estimate k from G∗ , say k ?
" #
h ∗
i 1 X ∗
E k = E ki (1)
m
i∈V ∗
1 X X
= E A∗ij (2)
m ∗ ∗
i∈V j∈V
n X n
1 X
= E Aij Zi Zj (3)
m i=1 j=1
n n
1 XX
= Aij E [Zi Zj ] (4)
m i=1 j=1
n n
1 XX
= Aij π 2 (5)
m i=1 j=1
n n
1 2XX
= π Aij (6)
nπ i=1 j=1
n n
π XX
= Aij = πk (7)
n i=1 j=1
Here, the key step is replacing the expectation of the indicator variable with
the sampling probability5 . Since π is a sampling probability, and there < 1, we
have that the mean degree of the induced subgraph is less than the true mean
degree, by a factor of π.
4
sometimes called a star design, though I have certainly also heard it called
an “egocentric design”. When we deal with star designs, we collect multiple
local graph neighborhoods, and an important question is whether those overlap;
depending on the recording process, this information might be available (so we
realized that Karl appears in the ego networks of both Irene and Joey) or not.
work in sociology — but I have not read the earlier papers, and this is the oldest statistical
analysis of the technique which I know of.
7 Snowballs do not generally start with seeds, but the mathematical sciences are not the
5
1. Pick a set of source nodes.
2. Pick a set of target nodes.
3. For each source-target combination, find a path from the source to the
target, and record all nodes and edges traversed along the path.
Clearly, a lot will depend on how, precisely, paths are found, but this is an
application-specific issue.
While the name “trace-route” comes from a Unix utility which finds paths
across the Internet, the practice is actually older, and not just limited to com-
puter networks. The most famous instance of trace-route sampling is probably
the Milgram study which led to the folklore that any two people in the US
have at most “six degrees of separation”. In that study, sources in the midwest
were tasked to get an envelope to a target, a stock-broker living outside Boston,
where at each step the envelope had to be passed on to someone known on a
first-name basis8 . Dodds et al. (2003) was a comparatively recent attempt to
do something similar but with e-mail.
Depending on exactly how route-tracing gets done, one may or may not get
information from “failed” routes, i.e., ones which didn’t succeed in getting from
source to target; that’s long been appreciated. What was not realized until
Achlioptas et al. (2005) is that trace-route sampling systematically distorts the
degree distribution, making all kinds of graphs look like they have heavy-tailed
distributions whether they do or not.
3 Coping strategies
Sampling issues with networks are real and potentially affect all aspects of in-
ference — just like every other part of statistics. Network data analysis has
therefore developed several strategies for coping with sampling problems.
6
3.2 Learn sampling theory (design based inference)
Classical sampling theory is a theory of statistical inference in which probability
assumptions are only made about the sampling process. The true population
is regarded as unknown but fixed, and no stochastic assumptions are made
about how it is generated. (One can always regard this as conditioning on
the unknown population.) Because all the probability assumptions refer to the
sampling design, and the validity of the inference depends only on whether the
design has been accurately modeled, this is sometimes called design-based
inference.
As an example of how this works, consider trying to estimate the mean µ
of some quantity Xi over a finite population of size n, using a sample of units
S. If everyP unit is sampled with equal probability, the familiar sample mean
X = |S|−1 i∈S Xi is a good estimate. With unequal probabilities, however,
what should one do?
A simple, classic solution is the Horvitz-Thompson estimator:
1 X Xi
µ̂HT ≡ (8)
n πi
i∈S
7
" #
1 X Xi
E [µ̂HT ] = E (9)
n πi
i∈S
" #
1 X Xi
= E Zi (10)
n i∈1:n πi
1 X Xi
= E [Zi ] (11)
n i∈1:n πi
1 X Xi
= P (Zi = 1) (12)
n i∈1:n πi
1 X Xi
= πi (13)
n i∈1:n πi
1 X
= Xi (14)
n i∈1:n
= µ (15)
One can show (Exercise 2a) that the variance of the estimator is
1 X X πij
Var [µHT
ˆ ]= 2 Xi Xj −1 (16)
n i∈1:n j∈1:n πi πj
with πij being the joint inclusion probability, i.e., the probability of including
both i and j in the sample (with πii = πi ). Notice that if all the πi → 1, the
variance goes to 0, as is reasonable (and as is required if the estimator is to
be consistent). We can’t actually calculate this true variance, since we can’t
sum over all the unknown units in the population, but there is a (consistent)
empirical counter-part:
d [µ̂HT ] = 1 1 1
XX
Var X X
i j − (17)
n2 πi πj πij
i∈S j∈S
8
3.3 Missing data tools
Another approach is to treat the unobserved part of the network as missing
data, and try to infer it. This can range from simple imputation strategies,
to complex model-based strategies for inference, such as the EM algorithm.
Successful imputation or EM is not design-based but model-based, and requires
a model both of the network, and of the sampling process. It is very, very
rare for anything to be “missing at random”, let alone “missing completely at
random”. Perhaps for this reason, comparatively little has been done on this
direction (Handcock and Gile, 2010)10 ; more should be.
9
Newman” refer to the same researcher (or, rather, when they do); in citation
networks, papers may be cited differently; the same person may have multiple
phone numbers; etc., etc. Network structure can in fact be an important clue in
entity resolution (Bhattacharya and Getoor, 2007), but that gets a bit circular...
Diffusion refers to the way that many of the automatically-recorded net-
works which provide us with our big data have themselves spread over other,
older social networks. What we see when we look at the network of (say) Face-
book ties is a combination of the pre-Facebook social network and the results
of the diffusion process. Comparatively little has been done to understand the
results. One of the best studies is that of Schoenebeck (2013), who showed how
even if the diffusion process treats all nodes homogeneously, the network-as-
diffused can differ radically in its properties from the underlying network. If
you say “I only care about Facebook, not about the social network”, this may
not matter, but even then it can change your understanding of why Facebook
looks the way it does.
The third issue is performativity, the way theories can become (partially)
self-fulfilling prophecies. The companies which run online social networks are
all very invested in getting very big, very dense networks of users12 . This is why
they all offer link suggestion or link recommendation services. The algorithms
behind these recommendations implement theories about how social networks
form, and what sort of link patterns they should have. To the extent that people
follow these recommendations, then, the recorded network will seem to conform
to the theory. The only worthwhile study of this I know of is Healy (2015).
5 Exercises
To think through, not to hand in.
they can charge a user, either monetarily or in hassle, time spent watching ads, etc., is just
less than the “switching cost” to the user of changing over to a different network. Given a
choice between two equally-functional websites, users will typically prefer ones with more of
their friends / contacts / peers, so switching costs will increase as the network gets larger and
denser. (Said slightly differently, large groups of users would have to coordinate switching to
a different service, and coordination itself creates switching costs.) The economics of lock-in,
switching costs and network externalities were all laid out very lucidly long ago by Shapiro
and Varian (1998) — from the point of view of the companies running the networks, not of
the users who are, as the saying goes, the product being sold. (Note that Varian is now the
chief economist at Google.)
10
(a) Derive the variance of µ̂HT . Hint: Repeat the trick with the Zi
indicators used to show µ̂HT is unbiased.
(b) To use Eq. 8, we P
need to know n, the total population size. Suppose
we replace n by i∈S 1/πi . Will this still be an unbiased estimate?
Will the variance be larger or smaller than that of Eq. 8?
References
Achlioptas, Dimitris, Aaron Clauset, David Kempe and Cristopher Moore
(2005). “On the Bias of Traceroute Sampling (or: Why almost every network
looks like it has a power law).” In Proceedings of the 37th ACM Symposium
on Theory of Computing. URL http://arxiv.org/abs/cond-mat/0503087.
Bhattacharya, Inrajit and Lise Getoor (2007). “Collective Entity Resolution In
Relational Data.” ACM Transactions on Knowledge Discovery from Data,
1(1): 5. doi:10.1145/1217299.1217304.
Borgatti, Stephen P., Kathleen M. Carley and David Krackhardt (2006). “On
the robustness of centrality measures under conditions of imperfect data.”
Social Networks, 28: 124–136. doi:10.1016/j.socnet.2005.05.001.
Dodds, Peter Sheridan, Roby Muhamad and Duncan J. Watts (2003). “An
Experimental Study of Search in Global Social Networks.” Science, 301:
827–829. doi:10.1126/science.1081058.
Goffman, Erving (1959). The Presentation of Self in Everyday Life. New York:
Anchor Books.
11
Kleinfeld, Judith (2002). “Could It Be a Big World After All? What the
Milgram Papers in the Yale Archive Reveal About the Original Small World
Study.” Society, 39: 61–66. URL http://www.uaf.edu/northern/big_
world.html.
Kolaczyk, Eric D. (2009). Statistical Analysis of Network Data. New York:
Springer-Verlag.
Kossinets, Gueorgi (2006). “Effects of Missing Data in Social Networks.” Social
Networks, 28: 247–268. URL http://arxiv.org/abs/cond-mat/0306335.
doi:10.1016/j.socnet.2005.07.002.
Lee, Sang Hoon, Pan-Jun Kim and Hawoong Jeong (2006). “Statistical Prop-
erties of Sampled Networks.” Physical Review E , 73: 016102. URL http:
//arxiv.org/abs/cond-mat/0505232. doi:10.1103/PhysRevE.73.016102.
Schoenebeck, Grant (2013). “Potential Networks, Contagious Communities,
and Understanding Social Network Structure.” In Proceedings of the 22nd
International World Wide Web Conference [WWW 2013] (Daniel Schwabe
and Virgilio Almeida and Hartmut Glaser and Ricardo Baeza-Yates and Sue
Moon, eds.), pp. 1123–1132. Geneva, Switzerland: International World Wide
Web Conferences Steering Committee. URL http://arxiv.org/abs/1304.
1845.
Shapiro, Carl and Hal R. Varian (1998). Information Rules: A Strategic Guide
to the Network Economy. Boston: Harvard Business School Press, 1st edn.
Stumpf, Michael P. H. and Carsten Wiuf (2005). “Sampling Properties of Ran-
dom Graphs: The Degree Distribution.” Physical Review E , 72: 036117.
URL http://arxiv.org/abs/cond-mat/0507345.
Stumpf, Michael P. H., Carsten Wiuf and Robert M. May (2005). “Subnets
of Scale-free Networks are not Scale-free: Sampling Properties of Networks.”
Proceedings of the National Academy of Sciences (USA), 102: 4221–4224.
doi:10.1073/pnas.0501179102.
Virkar, Yogesh and Aaron Clauset (2014). “Power-law distributions in binned
empirical data.” Annals of Applied Statistics, 8: 89–119. URL http://
arxiv.org/abs/1208.3524. doi:10.1214/13-AOAS710.
12