0% found this document useful (0 votes)

13 views

Lecture 02

Uploaded by

Sunshine Ratrey

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Lecture 02

Uploaded by

Sunshine Ratrey

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Lecture 2: Data Collection and Network

Sampling
36-720, Fall 2016
Scribes: Cristobal De La Maza and Valerie Yuan
http://www.stat.cmu.edu/~cshalizi/networks/16-1 for updates
31 August 2016

Contents
1 Sampling procedures 2
1.1 Ideal Data: Network Census . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Imperfections in network censuses . . . . . . . . . . . . . 2

2 Sampling Designs 3
2.1 Induced and Incident Subgraphs . . . . . . . . . . . . . . . . . . 3
2.1.1 Example of a Bias from Induced-Subgraph Sampling . . . 3
2.2 “Exploratory” Sampling Designs . . . . . . . . . . . . . . . . . . 4
2.2.1 Snowball sampling . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Respondent-driven sampling . . . . . . . . . . . . . . . . . 5
2.2.3 Trace-route sampling . . . . . . . . . . . . . . . . . . . . . 5

3 Coping strategies 6
3.1 Head in sand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Learn sampling theory (design based inference) . . . . . . . . . . 7
3.2.1 Strengths and Weaknesses . . . . . . . . . . . . . . . . . . 8
3.3 Missing data tools . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Model the Effective Network . . . . . . . . . . . . . . . . . . . . 9

4 Big Data Solves Nothing 9

5 Exercises 10

1
The real foundation of any branch of statistics is data collection. For the
sorts of statistics we’ve mostly seen before, where data are IID (or IID-ish), data
comes from samples or from experiments. It’s hard (though not impossible) to
experiment with networks, so we mostly have to deal with samples, and even
there, things become much more complicated than we’re used to. Unfortunately,
this complexity is all too often ignored when analyzing empirical networks.

1 Sampling procedures
1.1 Ideal Data: Network Census
The ideal data would be a census or enumeration of the network. This would
record every node, and every edge between nodes, with no spurious additional
nodes or edges. If you are in the fortunate situation of having a complete
network census, you can pretty much ignore the sampling process, and proceed
to model network formation.

1.1.1 Imperfections in network censuses

Unfortunately, even studies which try to get a complete census may fall short of
perfection. The exact failure modes depend on the nature of the network and
indeed on the details of the measurement process; for concreteness, I focus here
on survey-based measurements of social networks.
These surveys often work by approaching people and asking them questions
like “Who are your friends?” or “From whom do you seek advice?” or “From
whom have you borrowed money?” Errors can creep in here in many ways, such
as varying understandings of the link (help-you-move friend, or help-you-move-
a-body friend?). There can be different results depending on whether are given
suggestions, or a checklist of possibilities, or are asked to spontaneously recall
names. Answers may be influenced by shame, boastfulness1 , or other emotions
related to the “presentation of self in everyday life” (Goffman, 1959). In older
studies, it was common to frame the question as something like “name up to
three colleagues you commonly go to for advice”; such censoring by degree
necessarily prevented any recorded out-degree from being higher than three.
In, say, studies of protein interaction networks, the specific causes of mea-
surement error are different — other proteins are not, presumably, ashamed to
admit that they bind to cytochrome C — but there are others, which can lead
both to false negatives and false positives. For a serious applied study, there is
no real substitute for detailed knowledge of the domain, and of how the data
is actually obtained, which is one reason the idea of the statistician as a “con-
sultant” giving a week to a project, let alone as a “data scientist” treating the
data as just another database, is so pernicious.
1 An amusing example is asking people about the number of their heterosexual sex partners.

The mean number must, necessarily, be equal for men and for women. (This is not true of the
median; why?) However, in every study of this sort known to me, the mean number reported
by men is substantially higher than the mean number reported by women.

2
2 Sampling Designs
If we cannot get hold of the true, “population” graph G = (V, E), we may,
guided by the example of IID statistics, try to measure a “sampled” graph
G∗ = (V ∗ , E ∗ ), with V ∗ ⊆ V and E ∗ ⊆ E. Different sampling designs amount
to different ways of obtaining such sampled subgraphs2 . In baby statistics, our
first step in understanding sampling is the concept of a simple random sample
(SRS) of units from the population. In networks, even a simple random sample
is complicated.

2.1 Induced and Incident Subgraphs

We could start with a simple random sample of nodes, i.e., V ∗ is an SRS of V .
We’d then take the induced subgraph3 , i.e., (i, j) ∈ E ∗ iff (i, j) ∈ E, i ∈ V ∗ ,
and j ∈ V ∗ . This natural procedure, induced subgraph sampling, turns out
be very biased for even very simple network statistics, though the biases can
sometimes be calculated and compensated for4 .
On the other hand, we start with a simple random sample of edges, i.e., E ∗
is an SRS of E. We’d then take the nodes which are incident on those edges,
i.e., i ∈ V ∗ if, for some j ∈ V , (i, j) ∈ E ∗ . Experience with conventional surveys
may make incident-subgraph sampling seem odd, but there are many situations
where it’s actually quite natural — imagine sampling a fraction of all phone
calls made through a cell-carrier (where nodes = phone numbers), or financial
transactions.
Starting from the same G, induced-subgraph and incident-subgraph sam-
pling lead to very different distributions over G∗ , even if we adjust the sample
sizes so that (e.g.) the average number of edges in G∗ are comparable.

2.1.1 Example of a Bias from Induced-Subgraph Sampling

The canonical example of how sampling can induce a bias, even when we’re just
doing a simple random sample of nodes, is the mean degree. Intuitively, we
don’t see any edges outside the induced subgraph, so the degree we record for
each node is at most its real degree, and the mean degree in the sampled graph
should be ≤ the true mean degree. We can be more precise, and say by how
much it’s lower. Pn
Notation: say ki is the degree of node i, so ki = j=1 Aij . The mean degree
Pn
over the whole network is k = n−1 i=1 ki , or k = n−1 i j Aij .
P P
Now take a simple random sample of m nodes, so the probability of seeing
node i is the same for all nodes, π = m/n. We’ll write Zi = 1 if node i is in the
sample, and Zi = 0 otherwise, i.e., Zi is the indicator for i ∈ V ∗ . The observed
2 For simplicity, we’re ignoring the interaction of sampling and measurement error, but of

course both can be present together.

3 See Lecture 1 for the concept of an “induced subgraph”
4 See problem 2 in homework 1 for examples of calculating biases due to induced-subgraph

sampling.

3
graph G∗ has an observed adjacency matrix A∗ , and A∗ij = 1 iff Aij = 1 and
both i and j are in the the sample. What’s the expected value of the plug-in
∗
estimate k from G∗ , say k ?

" #
h ∗
i 1 X ∗
E k = E ki (1)
m
i∈V ∗
 
1 X X
= E A∗ij  (2)
m ∗ ∗
i∈V j∈V
 
n X n
1 X
= E Aij Zi Zj  (3)
m i=1 j=1
n n
1 XX
= Aij E [Zi Zj ] (4)
m i=1 j=1
n n
1 XX
= Aij π 2 (5)
m i=1 j=1
n n
1 2XX
= π Aij (6)
nπ i=1 j=1
n n
π XX
= Aij = πk (7)
n i=1 j=1

Here, the key step is replacing the expectation of the indicator variable with
the sampling probability5 . Since π is a sampling probability, and there < 1, we
have that the mean degree of the induced subgraph is less than the true mean
degree, by a factor of π.

2.2 “Exploratory” Sampling Designs

For both induced- and incident- subgraph sampling, the sampling frame is in
some sense separate from the actual, realized graph: the population from which
we draw our SRS has to include all nodes, or all edges, but doesn’t use the
graph beyond that. Other designs, which do make use of the graph topology,
are however certainly possible, and common.
In egocentric designs, we sample nodes and record information about their
local neighborhoods, or ego networks. Sometimes this is as simple as their
in- and out- degrees (or degree, in undirected graphs). Other times we record
edges and non-edges among the neighbors of the initial node (“ego”); this is
5 In fact, I cheated a little; E [Z ] = π, but E [Z Z ] would only equal π 2 if Z and Z
i i j i j
are uncorrelated. But if we sample exactly m nodes, without replacement, Zi and Zj are
(negatively) correlated. However, the correlation goes to zero as n, and you can work out the
the corrections if you really want to.

4
sometimes called a star design, though I have certainly also heard it called
an “egocentric design”. When we deal with star designs, we collect multiple
local graph neighborhoods, and an important question is whether those overlap;
depending on the recording process, this information might be available (so we
realized that Karl appears in the ego networks of both Irene and Joey) or not.

2.2.1 Snowball sampling

A natural extension of egocentric designs is snowball sampling (Goodman,
1961)6 — known in the theory of algorithms as “breadth-first search”. Here we
start with a seed node7 , and record its immediate neighborhood. (So far, this
is just a star design.) We then repeat the process for each of the neighbors,
and then their neighbors, etc., etc., until either no new nodes are found or we
get tired, i.e., a pre-selected size (in terms either of the number of nodes or the
number of layers around the seed) is reached. Of course, there can be multiple
seeds; there may then be an issue of determining when two snowballs which
have formed around different seeds have over-lapped.
Snowball sampling leads to a different distribution over graphs than does
either induced- or incident- subgraph sampling. Even if the seed is chosen by
a simple random sample, the other nodes picked up by the snowball are not
a random sample. Since they are nodes which can be reached by following
paths from the seed, they must have degree at least 1, must be at least weakly
connected to the seed, and in general tend to have higher-than-average degree.

2.2.2 Respondent-driven sampling

An important variant on snowball sampling, for social networks, is respondent-
driven sampling. This originated as a way of studying members of hard-
to-find (“hidden”) sub-populations — often ones which were hidden because
membership in them is stigmatized or illegal, such as intravenous drug users.
The technique is to find some initial members of the group in question, and then
persuade them to recruit other members whom they know as research subjects.
Often, the respondents are given unique physical tokens to pass on those whom
they recruit, so that links can be traced, and there may be some incentive for
participation. Censoring by degree can result if, for instance, there is only a
limited number of physical tokens per respondent.

2.2.3 Trace-route sampling

Trace-route sampling probes a network by, as the name suggests, tracing routes
through it. The typical procedure goes as follows:
6 Goodman (1961) did not introduce snowball sampling — the paper in fact refers to earlier

work in sociology — but I have not read the earlier papers, and this is the oldest statistical
analysis of the technique which I know of.
7 Snowballs do not generally start with seeds, but the mathematical sciences are not the

place to look for consistent metaphors.

5
1. Pick a set of source nodes.
2. Pick a set of target nodes.
3. For each source-target combination, find a path from the source to the
target, and record all nodes and edges traversed along the path.

Clearly, a lot will depend on how, precisely, paths are found, but this is an
application-specific issue.
While the name “trace-route” comes from a Unix utility which finds paths
across the Internet, the practice is actually older, and not just limited to com-
puter networks. The most famous instance of trace-route sampling is probably
the Milgram study which led to the folklore that any two people in the US
have at most “six degrees of separation”. In that study, sources in the midwest
were tasked to get an envelope to a target, a stock-broker living outside Boston,
where at each step the envelope had to be passed on to someone known on a
first-name basis8 . Dodds et al. (2003) was a comparatively recent attempt to
do something similar but with e-mail.
Depending on exactly how route-tracing gets done, one may or may not get
information from “failed” routes, i.e., ones which didn’t succeed in getting from
source to target; that’s long been appreciated. What was not realized until
Achlioptas et al. (2005) is that trace-route sampling systematically distorts the
degree distribution, making all kinds of graphs look like they have heavy-tailed
distributions whether they do or not.

3 Coping strategies
Sampling issues with networks are real and potentially affect all aspects of in-
ference — just like every other part of statistics. Network data analysis has
therefore developed several strategies for coping with sampling problems.

3.1 Head in sand

That is, ignore distortions or biases due to sampling, and pretend that the graph
we see is the whole graph. This is generally not a good idea.
For induced-subgraph sampling, the mean degree is biased from the real
degree by a calculable factor. Indeed, the sample values of motif counts for all
motifs are also biased (again, in calculable ways). These would be pretty easy
to compensate for. But degree distribution, for example, gets distorted in very
complicated, hard-to-fix ways, even with induced-subgraph sampling (Stumpf
and Wiuf, 2005; Stumpf et al., 2005; Lee et al., 2006). As noted above, for
trace-route sampling, in particular, the degree distribution of pretty much any
graph ends up heavy tailed (Achlioptas et al., 2005).
8 See Kleinfeld (2002) for a detailed examination of exactly how the study was conducted,

and what it showed.

6
3.2 Learn sampling theory (design based inference)
Classical sampling theory is a theory of statistical inference in which probability
assumptions are only made about the sampling process. The true population
is regarded as unknown but fixed, and no stochastic assumptions are made
about how it is generated. (One can always regard this as conditioning on
the unknown population.) Because all the probability assumptions refer to the
sampling design, and the validity of the inference depends only on whether the
design has been accurately modeled, this is sometimes called design-based
inference.
As an example of how this works, consider trying to estimate the mean µ
of some quantity Xi over a finite population of size n, using a sample of units
S. If everyP unit is sampled with equal probability, the familiar sample mean
X = |S|−1 i∈S Xi is a good estimate. With unequal probabilities, however,
what should one do?
A simple, classic solution is the Horvitz-Thompson estimator:
1 X Xi
µ̂HT ≡ (8)
n πi
i∈S

where πi is the (assumed-known) inclusion probability of unit i, i.e., the

probability of unit i being included in the sample9 . Notice that if all inclusion
probabilities are equal, π = |S|/n, we get back the sample mean X.
The intuition here is that if we saw one unit with inclusion probability πi ,
there are probably about 1/πi others that we didn’t see. More formally, we
can show that this is an unbiased estimator. To see this, let’s work out the
expectation of µ̂HT , introducing indicator variables Zi , i ∈ 1 : n, which are 1 if
i ∈ S and 0 otherwise.
9 Using 1/πi here is an example of a general trick in estimation known as inverse proba-
bility weighting.

7
" #
1 X Xi
E [µ̂HT ] = E (9)
n πi
i∈S
" #
1 X Xi
= E Zi (10)
n i∈1:n πi
1 X Xi
= E [Zi ] (11)
n i∈1:n πi
1 X Xi
= P (Zi = 1) (12)
n i∈1:n πi
1 X Xi
= πi (13)
n i∈1:n πi
1 X
= Xi (14)
n i∈1:n
= µ (15)

One can show (Exercise 2a) that the variance of the estimator is

1 X X πij
Var [µHT
ˆ ]= 2 Xi Xj −1 (16)
n i∈1:n j∈1:n πi πj

with πij being the joint inclusion probability, i.e., the probability of including
both i and j in the sample (with πii = πi ). Notice that if all the πi → 1, the
variance goes to 0, as is reasonable (and as is required if the estimator is to
be consistent). We can’t actually calculate this true variance, since we can’t
sum over all the unknown units in the population, but there is a (consistent)
empirical counter-part:

d [µ̂HT ] = 1 1 1
XX
Var X X
i j − (17)
n2 πi πj πij
i∈S j∈S

Horvitz-Thompson is a basic tool of design-based inference, but hardly the

only one; this deliberately barely scratches the surface.

3.2.1 Strengths and Weaknesses

The sampling-theory approach works well for stuff you can express as averages
(or totals) of population quantities, and where you can work out inclusion prob-
abilities from knowledge of the sampling design. Many network statistics can
be expressed as averages (sometimes by defining the “unit” as, e.g., a dyad
of nodes), but exact calculation of inclusion probabilities is harder. Kolaczyk
(2009) collects many examples.

8
3.3 Missing data tools
Another approach is to treat the unobserved part of the network as missing
data, and try to infer it. This can range from simple imputation strategies,
to complex model-based strategies for inference, such as the EM algorithm.
Successful imputation or EM is not design-based but model-based, and requires
a model both of the network, and of the sampling process. It is very, very
rare for anything to be “missing at random”, let alone “missing completely at
random”. Perhaps for this reason, comparatively little has been done on this
direction (Handcock and Gile, 2010)10 ; more should be.

3.4 Model the Effective Network

A final strategy is to model the observed network. This means modeling both
the observation/sampling process and the actual network, but combining them
so that we get a family of probability distributions over the observed graph.
That observed network is (or can be) still informative about the parameters
of the underlying generative model. If that is all that’s of interest, it may be
possible to short-circuit the use of EM or imputation, which are more about
recovering the full graph.
I have seen very few examples of this applied to networks, but it’s actu-
ally rather common for univariate data, which is often recorded in grouped or
binned form. One could try to tackle that by EM, but it’s often much more
straightforward to derive (if only numerically) a likelihood for the binned data,
and then estimate parameters of the unbinned distribution based on that (e.g.,
Virkar and Clauset 2014). I think this is a seriously under-explored topic for
network modeling.

4 Big Data Solves Nothing

Even when, as the promoters say, “n = all”, and the data are automatically
recorded (voluntarily or involuntarily), almost all the network sampling issues
we’ve gone over remain. After all, as the promoters do not say, you’re getting all
of a biased convenience sample, not all of the truth11 . Three issues are partic-
ularly prominent for networks: entity resolution, diffusion, and performativity.
Entity resolution, or record linkage, is a pervasive problem for data
analysis. Generally speaking, it’s the problem of determining when multiple
data points all record information about the same thing (or records which are
apparently co-referent really are about different things). In networks, this is
usually about determining when two (or more) apparent nodes really refer to
the same underlying entity. In building collaboration networks, for instance,
one has to determine whether “Mark Newman”, “M. E. J. Newman” and “M.
10 Studies demonstrating the importance of not ignoring the fact that some data are missing

include Borgatti et al. (2006); Kossinets (2006).

11 I know I borrowed the phrase “big data is a biased convenience sample” from someone,

but I can’t recall who.

9
Newman” refer to the same researcher (or, rather, when they do); in citation
networks, papers may be cited differently; the same person may have multiple
phone numbers; etc., etc. Network structure can in fact be an important clue in
entity resolution (Bhattacharya and Getoor, 2007), but that gets a bit circular...
Diffusion refers to the way that many of the automatically-recorded net-
works which provide us with our big data have themselves spread over other,
older social networks. What we see when we look at the network of (say) Face-
book ties is a combination of the pre-Facebook social network and the results
of the diffusion process. Comparatively little has been done to understand the
results. One of the best studies is that of Schoenebeck (2013), who showed how
even if the diffusion process treats all nodes homogeneously, the network-as-
diffused can differ radically in its properties from the underlying network. If
you say “I only care about Facebook, not about the social network”, this may
not matter, but even then it can change your understanding of why Facebook
looks the way it does.
The third issue is performativity, the way theories can become (partially)
self-fulfilling prophecies. The companies which run online social networks are
all very invested in getting very big, very dense networks of users12 . This is why
they all offer link suggestion or link recommendation services. The algorithms
behind these recommendations implement theories about how social networks
form, and what sort of link patterns they should have. To the extent that people
follow these recommendations, then, the recorded network will seem to conform
to the theory. The only worthwhile study of this I know of is Healy (2015).

5 Exercises
To think through, not to hand in.

1. Induced-subgraph sampling and degree The density of a graph is defined

as the ratio between the number
P of edges it has, and the maximum number
k
it could have, i.e., the ratio ni i for a simple undirected graph. §2.1.1
(2)
showed that the mean degree is biased when we take a random sample of
nodes. Modify the same argument to show that the density is unbiased.
2. Horvitz-Thompson
12 This is because they want to lock you in to using their network. Economically, the amount

they can charge a user, either monetarily or in hassle, time spent watching ads, etc., is just
less than the “switching cost” to the user of changing over to a different network. Given a
choice between two equally-functional websites, users will typically prefer ones with more of
their friends / contacts / peers, so switching costs will increase as the network gets larger and
denser. (Said slightly differently, large groups of users would have to coordinate switching to
a different service, and coordination itself creates switching costs.) The economics of lock-in,
switching costs and network externalities were all laid out very lucidly long ago by Shapiro
and Varian (1998) — from the point of view of the companies running the networks, not of
the users who are, as the saying goes, the product being sold. (Note that Varian is now the
chief economist at Google.)

10
(a) Derive the variance of µ̂HT . Hint: Repeat the trick with the Zi
indicators used to show µ̂HT is unbiased.
(b) To use Eq. 8, we P
need to know n, the total population size. Suppose
we replace n by i∈S 1/πi . Will this still be an unbiased estimate?
Will the variance be larger or smaller than that of Eq. 8?

3. Degree-proportional sampling Write the degree distribution of your favorite

graph in terms of its probability mass function as p, i.e., the fraction of
nodes of degree k is pk .
(a) Suppose that nodes are included in the sample with a probability
proportional to their degree. Find the
(b)

References
Achlioptas, Dimitris, Aaron Clauset, David Kempe and Cristopher Moore
(2005). “On the Bias of Traceroute Sampling (or: Why almost every network
looks like it has a power law).” In Proceedings of the 37th ACM Symposium
on Theory of Computing. URL http://arxiv.org/abs/cond-mat/0503087.
Bhattacharya, Inrajit and Lise Getoor (2007). “Collective Entity Resolution In
Relational Data.” ACM Transactions on Knowledge Discovery from Data,
1(1): 5. doi:10.1145/1217299.1217304.
Borgatti, Stephen P., Kathleen M. Carley and David Krackhardt (2006). “On
the robustness of centrality measures under conditions of imperfect data.”
Social Networks, 28: 124–136. doi:10.1016/j.socnet.2005.05.001.

Dodds, Peter Sheridan, Roby Muhamad and Duncan J. Watts (2003). “An
Experimental Study of Search in Global Social Networks.” Science, 301:
827–829. doi:10.1126/science.1081058.
Goffman, Erving (1959). The Presentation of Self in Everyday Life. New York:
Anchor Books.

Goodman, Leo A. (1961). “Snowball Sampling.” Annals of Mathematical Statis-

tics, 32: 147–170. doi:10.1214/aoms/1177705148.
Handcock, Mark S. and Krista J. Gile (2010). “Modeling Social Networks from
Sampled Data.” Annals of Applied Statistics, 4: 5–25. URL http://arxiv.
org/abs/1010.0891.

Healy, Kieran (2015). “The Performativity of Networks.” European Journal

of Sociology, 56: 175–205. URL http://kieranhealy.org/files/papers/
performativity.pdf. doi:10.1017/S0003975615000107.

11
Kleinfeld, Judith (2002). “Could It Be a Big World After All? What the
Milgram Papers in the Yale Archive Reveal About the Original Small World
Study.” Society, 39: 61–66. URL http://www.uaf.edu/northern/big_
world.html.
Kolaczyk, Eric D. (2009). Statistical Analysis of Network Data. New York:
Springer-Verlag.
Kossinets, Gueorgi (2006). “Effects of Missing Data in Social Networks.” Social
Networks, 28: 247–268. URL http://arxiv.org/abs/cond-mat/0306335.
doi:10.1016/j.socnet.2005.07.002.

Lee, Sang Hoon, Pan-Jun Kim and Hawoong Jeong (2006). “Statistical Prop-
erties of Sampled Networks.” Physical Review E , 73: 016102. URL http:
//arxiv.org/abs/cond-mat/0505232. doi:10.1103/PhysRevE.73.016102.
Schoenebeck, Grant (2013). “Potential Networks, Contagious Communities,
and Understanding Social Network Structure.” In Proceedings of the 22nd
International World Wide Web Conference [WWW 2013] (Daniel Schwabe
and Virgilio Almeida and Hartmut Glaser and Ricardo Baeza-Yates and Sue
Moon, eds.), pp. 1123–1132. Geneva, Switzerland: International World Wide
Web Conferences Steering Committee. URL http://arxiv.org/abs/1304.
1845.
Shapiro, Carl and Hal R. Varian (1998). Information Rules: A Strategic Guide
to the Network Economy. Boston: Harvard Business School Press, 1st edn.
Stumpf, Michael P. H. and Carsten Wiuf (2005). “Sampling Properties of Ran-
dom Graphs: The Degree Distribution.” Physical Review E , 72: 036117.
URL http://arxiv.org/abs/cond-mat/0507345.

Stumpf, Michael P. H., Carsten Wiuf and Robert M. May (2005). “Subnets
of Scale-free Networks are not Scale-free: Sampling Properties of Networks.”
Proceedings of the National Academy of Sciences (USA), 102: 4221–4224.
doi:10.1073/pnas.0501179102.
Virkar, Yogesh and Aaron Clauset (2014). “Power-law distributions in binned
empirical data.” Annals of Applied Statistics, 8: 89–119. URL http://
arxiv.org/abs/1208.3524. doi:10.1214/13-AOAS710.

The AI Wealth Creation Blueprint PDF
67% (3)
The AI Wealth Creation Blueprint PDF
50 pages
2306 8MA0-01 AS Pure Mathematics - June 2023 (Word)
100% (6)
2306 8MA0-01 AS Pure Mathematics - June 2023 (Word)
10 pages
Procedural Generation in Game Design
93% (14)
Procedural Generation in Game Design
339 pages
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
88% (8)
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
56 pages
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
81% (48)
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
708 pages
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
100% (10)
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
821 pages
A Step-By-Step Introduction To VAR Models (With Simulations On Matlab) - Michele Piffer
No ratings yet
A Step-By-Step Introduction To VAR Models (With Simulations On Matlab) - Michele Piffer
43 pages
A Coomer's Guide To AI Dungeon
No ratings yet
A Coomer's Guide To AI Dungeon
30 pages
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
100% (25)
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
306 pages
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
100% (24)
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
52 pages
Banana Pancakes - Ukulele Chord Chart
100% (1)
Banana Pancakes - Ukulele Chord Chart
2 pages
The Fabric of Reality
100% (1)
The Fabric of Reality
6 pages
75 Productivity Hacks - System Sunday
100% (7)
75 Productivity Hacks - System Sunday
75 pages
Military Remote Viewing Manual
100% (5)
Military Remote Viewing Manual
72 pages
Acceptance Sampling in Quality Control 2017
100% (1)
Acceptance Sampling in Quality Control 2017
883 pages
Solutions 1
No ratings yet
Solutions 1
17 pages
IJIGSP-Template 2nd Paper
No ratings yet
IJIGSP-Template 2nd Paper
11 pages
Main
No ratings yet
Main
183 pages
Análisis Estadístico de Resultados
No ratings yet
Análisis Estadístico de Resultados
11 pages
IML19_Term1
No ratings yet
IML19_Term1
5 pages
Introduction To Monte Carlo Methods: Professor Richard J. Sadus
No ratings yet
Introduction To Monte Carlo Methods: Professor Richard J. Sadus
31 pages
Stat For Comp (7-9)
No ratings yet
Stat For Comp (7-9)
22 pages
IJIGSP-Template 2nd Paper Modified 03-10-2021
No ratings yet
IJIGSP-Template 2nd Paper Modified 03-10-2021
12 pages
Introduction To Resampling Methods: Can We Find Samples/resamples Exactly Generated From F ? No
No ratings yet
Introduction To Resampling Methods: Can We Find Samples/resamples Exactly Generated From F ? No
4 pages
Jamal Arif Dynamics Laboratory
No ratings yet
Jamal Arif Dynamics Laboratory
15 pages
Statistical Methods: 1. Pareto Diagrams and Dot Diagrams
No ratings yet
Statistical Methods: 1. Pareto Diagrams and Dot Diagrams
48 pages
Old Exam-Dec PDF
No ratings yet
Old Exam-Dec PDF
6 pages
Statistics for Managenent II
No ratings yet
Statistics for Managenent II
73 pages
Problem Set 2
No ratings yet
Problem Set 2
18 pages
Bootstrap: Estimate Statistical Uncertainties
No ratings yet
Bootstrap: Estimate Statistical Uncertainties
22 pages
Probabilistic Methods in Engineering: Exercise Set 4
No ratings yet
Probabilistic Methods in Engineering: Exercise Set 4
3 pages
Lecture 7 Classification
No ratings yet
Lecture 7 Classification
52 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
04 Jackknife PDF
No ratings yet
04 Jackknife PDF
22 pages
Machine Learning Assignments and Answers
No ratings yet
Machine Learning Assignments and Answers
35 pages
STAT354 Study Guide
No ratings yet
STAT354 Study Guide
54 pages
Model Question Paper Ans
No ratings yet
Model Question Paper Ans
19 pages
Lecture+14 SAS Bootstrap and Jackknife
No ratings yet
Lecture+14 SAS Bootstrap and Jackknife
12 pages
Practical-5_2CEIT606_Artificial Intelligence
No ratings yet
Practical-5_2CEIT606_Artificial Intelligence
14 pages
CH5530-Simulation Lab: MATLAB Additional Graded Session 01/04/2020
No ratings yet
CH5530-Simulation Lab: MATLAB Additional Graded Session 01/04/2020
3 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
IJIGSP-Template 2nd Paper Modified
No ratings yet
IJIGSP-Template 2nd Paper Modified
12 pages
Exercises On Backpropagation
No ratings yet
Exercises On Backpropagation
4 pages
CS109/Stat121/AC209/E-109 Data Science: Network Models
No ratings yet
CS109/Stat121/AC209/E-109 Data Science: Network Models
20 pages
CS109/Stat121/AC209/E-109 Data Science: Network Models
No ratings yet
CS109/Stat121/AC209/E-109 Data Science: Network Models
20 pages
ECS 315: Probability and Random Processes: September 2010
No ratings yet
ECS 315: Probability and Random Processes: September 2010
54 pages
Statistical Uncertainty and Error Propagation: Martin Vermeer March 27, 2014
No ratings yet
Statistical Uncertainty and Error Propagation: Martin Vermeer March 27, 2014
34 pages
Making Sense of Data Statistic Course
No ratings yet
Making Sense of Data Statistic Course
39 pages
Minimum Variance Portfolio Optimization in The Spiked Covariance Model
No ratings yet
Minimum Variance Portfolio Optimization in The Spiked Covariance Model
4 pages
Lecture Notes For "Introduction To Mathematical Modeling" - Freie Universit at Berlin, Winter Semester 2017/2018
No ratings yet
Lecture Notes For "Introduction To Mathematical Modeling" - Freie Universit at Berlin, Winter Semester 2017/2018
79 pages
Chapter 3
No ratings yet
Chapter 3
6 pages
Lecture 9, 07 - 12 - 20 Competing Risks, Binomial Removals
No ratings yet
Lecture 9, 07 - 12 - 20 Competing Risks, Binomial Removals
39 pages
NM_Course
No ratings yet
NM_Course
37 pages
Course Dairy For V Semester
No ratings yet
Course Dairy For V Semester
37 pages
Probability and Probability Distribution
No ratings yet
Probability and Probability Distribution
100 pages
Notes MSM
No ratings yet
Notes MSM
66 pages
2023_Financial Econometrics de review
No ratings yet
2023_Financial Econometrics de review
4 pages
Probability For CSC - Tutorials - 2024
No ratings yet
Probability For CSC - Tutorials - 2024
32 pages
Probability For CSC Tutorials
No ratings yet
Probability For CSC Tutorials
59 pages
SS 2 Gen Maths 3RD Term E-Notes 2017.
No ratings yet
SS 2 Gen Maths 3RD Term E-Notes 2017.
121 pages
Notes PDF
No ratings yet
Notes PDF
54 pages
Eng4201 Note Merged-2
No ratings yet
Eng4201 Note Merged-2
58 pages
Elg 3126
No ratings yet
Elg 3126
6 pages
Chap5 Statistical Inference
No ratings yet
Chap5 Statistical Inference
38 pages
Solving Traveling Salesman Problems Via Artificial
No ratings yet
Solving Traveling Salesman Problems Via Artificial
6 pages
Cluster Sampling
No ratings yet
Cluster Sampling
18 pages
Tutprac 1
No ratings yet
Tutprac 1
8 pages
Monte Carlo For The Newbies: Simon Léger April 06, 2006
No ratings yet
Monte Carlo For The Newbies: Simon Léger April 06, 2006
14 pages
Bootstrap Methodology in Claim Reserving: Joaoas@iseg - Utl.pt
No ratings yet
Bootstrap Methodology in Claim Reserving: Joaoas@iseg - Utl.pt
13 pages
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
Factoring and Algebra - A Selection of Classic Mathematical Articles Containing Examples and Exercises on the Subject of Algebra (Mathematics Series)
From Everand
Factoring and Algebra - A Selection of Classic Mathematical Articles Containing Examples and Exercises on the Subject of Algebra (Mathematics Series)
CSPacademic
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Worked Examples in Advanced Mechanics of Materials using MATLAB
From Everand
Worked Examples in Advanced Mechanics of Materials using MATLAB
Eric Okoth Ogur
No ratings yet
Introduction to Deep Learning
From Everand
Introduction to Deep Learning
Eugene Charniak
No ratings yet
Counmcamsc
No ratings yet
Counmcamsc
7 pages
Ada Unit V
No ratings yet
Ada Unit V
28 pages
Operation Research
No ratings yet
Operation Research
7 pages
CF Presentation Final
No ratings yet
CF Presentation Final
30 pages
NCERt Solution Chapter 2 .12th
No ratings yet
NCERt Solution Chapter 2 .12th
5 pages
Ncert Chapter 4
No ratings yet
Ncert Chapter 4
3 pages
The Secrets of A Slot Machine
No ratings yet
The Secrets of A Slot Machine
4 pages
2045: The Year Man Becomes Immortal
No ratings yet
2045: The Year Man Becomes Immortal
9 pages
My Ai Cheat List
100% (11)
My Ai Cheat List
3 pages
A Methodology For Detecting Credit Card Fraud
No ratings yet
A Methodology For Detecting Credit Card Fraud
60 pages
Mythic Magazine #009
100% (3)
Mythic Magazine #009
27 pages
Improved Statistical Test
87% (171)
Improved Statistical Test
20 pages
Mercity - Ai-Guide To Fine-Tuning LLMs Using PEFT and LoRa Techniques
No ratings yet
Mercity - Ai-Guide To Fine-Tuning LLMs Using PEFT and LoRa Techniques
25 pages
Teas Topics To Study
100% (12)
Teas Topics To Study
6 pages
Next Generation Sequencing Data Analysis
No ratings yet
Next Generation Sequencing Data Analysis
435 pages
Algebra Workbook
100% (3)
Algebra Workbook
299 pages
Ghosh S. Mathematics and Computer Science Vol 1. 2023
No ratings yet
Ghosh S. Mathematics and Computer Science Vol 1. 2023
743 pages
Websites and Tools Links
No ratings yet
Websites and Tools Links
3 pages
Download Complete Artificial Intelligence and Problem Solving 1st Edition Danny Kopec PDF for All Chapters
100% (4)
Download Complete Artificial Intelligence and Problem Solving 1st Edition Danny Kopec PDF for All Chapters
61 pages
Deep Thinking Where Machine Intelligence PDF
100% (1)
Deep Thinking Where Machine Intelligence PDF
3 pages
List of Deepfake Tools
No ratings yet
List of Deepfake Tools
5 pages
Cognitive Bias Cheat Sheet
100% (1)
Cognitive Bias Cheat Sheet
17 pages
Prompt Engineering - Links and Resources
No ratings yet
Prompt Engineering - Links and Resources
2 pages
Scientific American - April 2024
100% (1)
Scientific American - April 2024
88 pages
KKTP MTK Kelas Xi
No ratings yet
KKTP MTK Kelas Xi
6 pages
Curriculum of BS Computational Physics Ver 5 (CRC) 06.10.2023
No ratings yet
Curriculum of BS Computational Physics Ver 5 (CRC) 06.10.2023
120 pages
Gaussian States
No ratings yet
Gaussian States
201 pages
Geometry3d Techrep
No ratings yet
Geometry3d Techrep
66 pages
Frayer Model
No ratings yet
Frayer Model
2 pages
Machine Systems Torsional Vibration of Ronald L. Eshleman PDF
No ratings yet
Machine Systems Torsional Vibration of Ronald L. Eshleman PDF
10 pages
Econometric Theory Lecture Notes
No ratings yet
Econometric Theory Lecture Notes
90 pages
Feed-Gas Molecular Weight Affects Centrifugal Compressor Efficiency
No ratings yet
Feed-Gas Molecular Weight Affects Centrifugal Compressor Efficiency
8 pages
Fluke PM6666
No ratings yet
Fluke PM6666
8 pages
BA Lab Manual
No ratings yet
BA Lab Manual
62 pages
EEG Forward and Inverse Model
No ratings yet
EEG Forward and Inverse Model
3 pages
IAL Bio SB2 Practs CP16 Student
No ratings yet
IAL Bio SB2 Practs CP16 Student
3 pages
Calculus Applications
No ratings yet
Calculus Applications
15 pages
CHEM 1211K Lab Fall 2020: Submission Guide
No ratings yet
CHEM 1211K Lab Fall 2020: Submission Guide
3 pages
Stored Energy and Structure in Top Down Processed Nanostructured Metals 2009 Scripta Materialia
No ratings yet
Stored Energy and Structure in Top Down Processed Nanostructured Metals 2009 Scripta Materialia
6 pages
Topic 9 Matrices
No ratings yet
Topic 9 Matrices
12 pages
Session One - Kinematics Part A
No ratings yet
Session One - Kinematics Part A
9 pages
Grade 8 Module
No ratings yet
Grade 8 Module
9 pages
ST2187 Block 3
No ratings yet
ST2187 Block 3
20 pages
rc-1 2015-16 Chapter Three 080116
No ratings yet
rc-1 2015-16 Chapter Three 080116
28 pages
BCS304 Super Important - 22SCHEME
No ratings yet
BCS304 Super Important - 22SCHEME
3 pages
4 Dim
No ratings yet
4 Dim
8 pages

Lecture 02

Uploaded by

Lecture 02

Uploaded by

Lecture 2: Data Collection and Network

4 Big Data Solves Nothing 9

1.1.1 Imperfections in network censuses

2.1 Induced and Incident Subgraphs

2.1.1 Example of a Bias from Induced-Subgraph Sampling

course both can be present together.

2.2 “Exploratory” Sampling Designs

2.2.1 Snowball sampling

2.2.2 Respondent-driven sampling

2.2.3 Trace-route sampling

place to look for consistent metaphors.

3.1 Head in sand

and what it showed.

where πi is the (assumed-known) inclusion probability of unit i, i.e., the

Horvitz-Thompson is a basic tool of design-based inference, but hardly the

3.2.1 Strengths and Weaknesses

3.4 Model the Effective Network

4 Big Data Solves Nothing

include Borgatti et al. (2006); Kossinets (2006).

but I can’t recall who.

1. Induced-subgraph sampling and degree The density of a graph is defined

3. Degree-proportional sampling Write the degree distribution of your favorite

Goodman, Leo A. (1961). “Snowball Sampling.” Annals of Mathematical Statis-

Healy, Kieran (2015). “The Performativity of Networks.” European Journal

You might also like