CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
http://cs246.stanford.edu
2/7/2011
Strongly connected:
Any directed graph can be expressed in terms of these two types of graphs
Jure Leskovec, Stanford C246: Mining Massive Datasets 4
2/7/2011
Strongly connected component (SCC) is a set of nodes S: Any directed graph is a DAG on its SCCs:
Every pair of nodes in S can reach each other There is no larger set containing S with this property
2/7/2011
Take a large snapshot of the Web and try to understand how its SCCs fit as a DAG Computational issues:
Say want to find SCC containing specific node v? Observation:
Out(v) nodes reachable from v (via out-edges) In(v) nodes reachable from v (via in-edges) SCC containing v: = Out(v, G) In(v, G) = Out(v, G) Out(v, G)
where G is G with directions of edges flipped
Jure Leskovec, Stanford C246: Mining Massive Datasets
2/7/2011
2/7/2011
2/7/2011
Normalized count, pk
pk = k
2/7/2011
2/7/2011
10
Random network
Power-law network
Since there is large diversity in the connectivity of the webgraph we can rank the pages by the link structure
2/7/2011
12
We will cover the following Link Analysis approaches to computing importances of nodes in a graph:
Page Rank Hubs and Authorities (HITS) Topic-Specific (Personalized) Page Rank Spam Detection Algorithms
2/7/2011
13
First try:
Page is more important if it has more links
2/7/2011
14
Each links vote is proportional to the importance of its source page If page P with importance x has n out-links, each link gets x/n votes Page Ps own importance is the sum of the votes on its in-links
2/7/2011
15
y/2
y = y /2 + a /2 a = y /2 + m m = a /2
Msoft a/2 m
16
Gaussian elimination method works for small examples, but we need a better method for large web-size graphs
Jure Leskovec, Stanford C246: Mining Massive Datasets 17
2/7/2011
Matrix M has one row and one column for each web page Suppose page j has n out-links M is a column stochastic matrix Suppose r is a vector with one entry per web page:
ri is the importance score of page i Call it the rank vector |r| = 1
Jure Leskovec, Stanford C246: Mining Massive Datasets
2/7/2011
18
1/3
2/7/2011
19
The flow equations can be written r = Mr So the rank vector is an eigenvector of the stochastic web matrix
In fact, its first or principal eigenvector, with corresponding eigenvalue 1
2/7/2011
20
Y!
A 0 r = Mr
MS 0 1 0
Yahoo
Y! A MS
Amazon
Msoft y 0 a = 0 1 m 0 0 y a m
y = y /2 + a /2 a = y /2 + m m = a /2
2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
21
Simple iterative scheme Suppose there are N web pages Initialize: r0 = [1/N,.,1/N]T Iterate: rk+1 = Mrk Stop when |rk+1 - rk|1 <
|x|1 = 1iN|xi| is the L1 norm Can use any other vector norm e.g., Euclidean
2/7/2011
22
Power iteration:
Set ri=1/n ri=j Mijrj And iterate
Y! A Y! A MS MS
Y! 0
A 0
MS 0 1 0
Example:
y a = m 1/3 1/3 1/3 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 . . . 1/6 2/5 2/5 1/5
23
2/7/2011
Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t
p(t) is a probability distribution on pages
Jure Leskovec, Stanford C246: Mining Massive Datasets 24
2/7/2011
Where is the surfer at time t+1? Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t) Our rank vector r satisfies r = Mr
Then p(t) is called a stationary distribution for the random walk So it is a stationary distribution for the random surfer Follows a link uniformly at random p(t+1) = Mp(t)
2/7/2011
25
A central result from the theory of random walks (aka Markov processes):
For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0.
2/7/2011
26
2/7/2011
27
Power iteration:
Set ri=1 ri=j Mijrj And iterate
A
Y!
MS Y! Y! A MS 0 A 0 MS 0 0 1
Example:
y a = m 1 1 1 1 3/2 7/4 5/8 3/8 2
0 0 3
2/7/2011
28
The Google solution for spider traps At each time step, the random surfer has two options:
With probability , follow a link at random With probability 1-, jump to some page uniformly at random Common values for are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within a few time steps
Jure Leskovec, Stanford C246: Mining Massive Datasets 29
2/7/2011
0.2*1/3
Yahoo
1/2 0.8*1/2 0.2*1/3
1/2 0.8*1/2
0.2*1/3
Amazon
Msoft
1/3 1/3 1/3 + 0.2 1/3 1/3 1/3 1/3 1/3 1/3
Yahoo
1/3 1/3 1/3 + 0.2 1/3 1/3 1/3 1/3 1/3 1/3
Amazon y a = m 1 1 1
y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.776 0.536 . . . 1.688 7/11 5/11 21/11
2/7/2011
31
Y! A Y! Y! A MS 0 A 0
MS MS 0 0 0
Example:
y a = m
0 0 0
32
2/7/2011
Teleports
Follow random teleport links with probability 1.0 from dead-ends Adjust matrix accordingly
2/7/2011
33
2/7/2011
34
Construct the N x N matrix A as follows Verify that A is a stochastic matrix The page rank vector r is the principal eigenvector of this matrix Equivalently, r is the stationary distribution of the random walk with teleports
satisfying r = Ar Aij = Mij + (1-)/N
2/7/2011
35
Easy if we have enough main memory to hold A, rold, rnew Say N = 1 billion pages
We need 4 bytes for each entry (say) 2 billion entries for vectors, approx 8GB Matrix A has N2 entries
1018 is a large number!
2/7/2011
36
r = Ar, where Aij = Mij + (1-)/N ri = 1jN Aij rj ri = 1jN [ Mij + (1-)/N] rj = 1jN Mij rj + (1-)/N 1jN rj = 1jN Mij rj + (1-)/N, since |r| = 1 r = Mr + [(1-)/N]N
where [x]N is an N-vector with all entries x
2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 37
M is a sparse matrix!
10 links per node, approx 10N entries
2/7/2011
38
0 1 2
2/7/2011
3 5 2
rold 0 1 2 3 4 5 6
40
0 1 2
3 4 2
2/7/2011
Questions:
What if we had enough memory to fit both rnew and rold? What if we could not even fit rnew in memory? See reading: http://i.stanford.edu/~ullman/mmds/ch5.pdf
2/7/2011
41
2/7/2011
46
Interesting pages fall into two classes: 1. Authorities are pages containing useful information
Newspaper home pages Course home pages Home pages of auto manufacturers
2.
2/7/2011
2/7/2011
48
2/7/2011
49
2/7/2011
50
A good hub links to many good authorities A good authority is linked from many good hubs Model using two scores for each node:
Hub score and Authority score Represented as vectors h and a
2/7/2011
51
Amazon
2/7/2011
Msoft
Jure Leskovec, Stanford C246: Mining Massive Datasets 52
Yahoo A=
y a m y 1 1 1 a 1 0 1 m 0 1 0
Amazon
Msoft
2/7/2011
53
Notation: Then:
Vector a=(a1,an), h=(h1,hn) Adjacency matrix (n x n): Aij=1 if ij else Aij=0
hi = a j hi = Aij a j
i j j
So: h = A a Likewise:
a = AT h
2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 54
The hub score of page i is proportional to the sum of the authority scores of the pages it links to: h = A a
Constant is a scaling factor, = 1/hi
The authority score of page i is proportional to the sum of the hub scores of the pages it is linked from: a = AT h
Constant is scaling factor, = 1/ai
2/7/2011
55
2/7/2011
56
110 AT = 1 0 1 110
Amazon
Yahoo
Msoft
1 1 1 1 1 1
1 1 1
1 4/5 1
Algorithm:
Set: a = h = 1n Repeat:
h=M a, a=MT h Normalize
Then: a=MT (M a)
new h
new a
a is being updated (in 2 steps): MT (M a) = (MTM) a h is updated (in 2 steps): M (MT h) = (M MT) h Repeated matrix powering
58
2/7/2011
Under reasonable assumptions about A, the HITS iterative algorithm converges to vectors h* and a*:
h* is the principal eigenvector of matrix AAT a* is the principal eigenvector of matrix ATA
2/7/2011
59
2/7/2011