Walking in Facebook:
A Case Study of Unbiased
Sampling of OSNs
Minas Gjoka, Maciej Kurant ,
Carter Butts, Athina Markopoulou
UC Irvine, EPFL
Minas Gjoka, UC Irvine
Walking in Facebook
Outline
Motivation and Problem Statement
Sampling Methodology
Data Analysis
Conclusion
Minas Gjoka, UC Irvine
Walking in Facebook
Online Social Networks (OSNs)
A network of declared friendships
between users
Allows users to maintain relationships
Many popular OSNs with different focus
Social Graph
Facebook, LinkedIn, Flickr,
Minas Gjoka, UC Irvine
Walking in Facebook
Why Sample OSNs?
Representative samples desirable
study properties
test algorithms
Obtaining complete dataset difficult
companies usually unwilling to share data
tremendous overhead to measure all (~100TB for
Facebook)
Minas Gjoka, UC Irvine
Walking in Facebook
Problem statement
Obtain a representative sample of users in a
given OSN by exploration of the social graph.
in this work we sample Facebook (FB)
explore graph using various crawling techniques
Minas Gjoka, UC Irvine
Walking in Facebook
Related Work
Graph traversal (BFS)
A. Mislove et al, IMC 2007
Y. Ahn et al, WWW 2007
C. Wilson, Eurosys 2009
Random walks (MHRW, RDS)
M. Henzinger et al, WWW 2000
D. Stutbach et al, IMC 2006
A. Rasti et al, Mini Infocom 2009
Minas Gjoka, UC Irvine
Walking in Facebook
Our Contributions
Compare various crawling techniques in FBs social graph
Breadth-First-Search (BFS)
Random Walk (RW)
Metropolis-Hastings Random Walk (MHRW)
Practical recommendations
online convergence diagnostic tests
proper use of multiple parallel chains
methods that perform better and tradeoffs
Uniform sample of Facebook users
collection and analysis
made available to researchers
Minas Gjoka, UC Irvine
Walking in Facebook
Outline
Motivation and Problem Statement
Sampling Methodology
crawling methods
data collection
convergence evaluation
method comparisons
Data Analysis
Conclusion
Minas Gjoka, UC Irvine
Walking in Facebook
(1) Breadth-First-Search (BFS)
Starting from a seed, explores all
neighbor nodes. Process continues
iteratively without replacement.
BFS leads to bias towards high
degree nodes
Lee et al, Statistical properties of Sampled
Networks, Phys Review E, 2006
Early measurement studies of
OSNs use BFS as primary sampling
technique
i.e [Mislove et al], [Ahn et al], [Wilson et al.]
Minas Gjoka, UC Irvine
Walking in Facebook
(2) Random Walk (RW)
Explores graph one node at
a time with replacement
RW
,w
Degree of node
In the stationary distribution
k
2 E
Number of edges
Minas Gjoka, UC Irvine
Walking in Facebook
10
(3) Re-Weighted Random Walk (RWRW)
Hansen-Hurwitz estimator
Corrects for degree bias at the end of collection
Without re-weighting, the probability distribution for
node property A is:
p( A )
| Ai |
|V |
uV 1
uAi
Subset of sampled
nodes with value i
All sampled nodes
Re-Weighted probability distribution :
p( A )
uAi
1/ ku
uV
1/ ku
Minas Gjoka, UC Irvine
Walking in Facebook
Degree of node u
11
(4) Metropolis-Hastings Random Walk
(MHRW)
Explore graph one node at
a time with replacement
k
1
k min(1, k ) if w neighbor of
PMH
,w
1 PMH
,y
if w=
In the stationary distribution
1
V
Minas Gjoka, UC Irvine
MH
AA
P
Walking in Facebook
1 1 31 1 2
1P ( )
3 3 55 51215
MH
AC
Uniform userID Sampling (UNI)
As a basis for comparison, we collect a
uniform sample of Facebook userIDs (UNI)
rejection sampling on the 32-bit userID space
UNI not a general solution for sampling OSNs
userID space must not be sparse
names instead of numbers
Minas Gjoka, UC Irvine
Walking in Facebook
13
Summary of Datasets
Sampling method
MHRW
RW
BFS
#Valid Users
28x81K
28x81K 28x81K
984K
# Unique Users
957K
2.19M
984K
2.20M
UNI
Egonets for a subsample of MHRW
- local properties of nodes
Datasets available at:
http://odysseas.calit2.uci.edu/research/osn.html
Minas Gjoka, UC Irvine
Walking in Facebook
14
Data Collection
Basic Node Information
What information do we collect for each sampled
node u?
Minas Gjoka, UC Irvine
Walking in Facebook
15
Detecting Convergence
Number of samples (iterations) to loose
dependence from starting points?
Minas Gjoka, UC Irvine
Walking in Facebook
16
Online Convergence Diagnostics
Geweke
Detects convergence for a single walk. Let X be a
sequence of samples for metric of interest.
Xa
Xb
z
E( X a ) E( X b )
Var ( X a ) Var ( X b )
J. Geweke, Evaluating the accuracy of sampling based approaches to calculate posterior
moments in Bayesian Statistics 4, 1992
Minas Gjoka, UC Irvine
Walking in Facebook
17
Online Convergence Diagnostics
Gelman-Rubin
Detects convergence for m>1 walks
Walk 1
Walk 2
Between walks
variance
n 1 m 1 B
R
mn W
n
Walk 3
Within walks
variance
A. Gelman, D. Rubin, Inference from iterative simulation using multiple sequences in
Statistical Science Volume 7, 1992
Minas Gjoka, UC Irvine
Walking in Facebook
18
When do we reach equilibrium?
Node Degree
Burn-in determined to be 3K
Minas Gjoka, UC Irvine
Walking in Facebook
19
Methods Comparison
Node Degree
Poor performance
for BFS, RW
28 crawls
MHRW, RWRW
produce good
estimates
per chain
overall
Minas Gjoka, UC Irvine
Walking in Facebook
20
Sampling Bias
BFS
Low degree nodes
under-represented
by two orders of
magnitude
BFS is biased
Minas Gjoka, UC Irvine
Walking in Facebook
21
Sampling Bias
MHRW, RW, RWRW
Degree distribution identical to UNI (MHRW,RWRW)
RW as biased as BFS but with smaller variance in each walk
Minas Gjoka, UC Irvine
Walking in Facebook
22
Practical Recommendations
for Sampling Methods
Use MHRW or RWRW. Do not use BFS, RW.
Use formal convergence diagnostics
assess convergence online
use multiple parallel walks
MHRW vs RWRW
RWRW slightly better performance
MHRW provides a ready-to-use sample
Minas Gjoka, UC Irvine
Walking in Facebook
23
Outline
Motivation and Problem Statement
Sampling Methodology
Data Analysis
Conclusion
Minas Gjoka, UC Irvine
Walking in Facebook
24
FB Social Graph
Degree Distribution
a =1.32
1
a =3.38
2
Degree distribution not a power law
Minas Gjoka, UC Irvine
Walking in Facebook
25
FB Social Graph
Topological Characteristics
Our MHRW sample
Assortativity Coefficient = 0.233
range of Clustering Coefficient [0.05, 0.35]
[Wilson et al, Eurosys 09]
Assortativity Coefficient = 0.17
range of Clustering Coefficient [0.05, 0.18]
More details in our paper and technical report
Minas Gjoka, UC Irvine
Walking in Facebook
26
Conclusion
Compared graph crawling methods
MHRW, RWRW performed remarkably well
BFS, RW lead to substantial bias
Practical recommendations
correct for bias
usage of online convergence diagnostics
proper use of multiple chains
Datasets publicly available
http://odysseas.calit2.uci.edu/research/osn.html
Minas Gjoka, UC Irvine
Walking in Facebook
27
Thank you
Questions?
Minas Gjoka, UC Irvine
Walking in Facebook
28