CS246: Mining Massive Datasets Jure Leskovec,: Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University
http://cs246.stanford.edu
What is the structure of the Web? How is it organized?
2/7/2011
Jure Leskovec, Stanford C246: Mining Massive Datasets
What is the structure of the Web? How is it organized?
Web as a directed graph

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 3
Two types of directed graphs:

DAG Directed Acyclic Graph:
Has no cycles: if u can reach v, then v can not reach u Any node can reach any node via a directed path
Strongly connected:
Any directed graph can be expressed in terms of these two types of graphs
Jure Leskovec, Stanford C246: Mining Massive Datasets 4
2/7/2011
Strongly connected component (SCC) is a set of nodes S: Any directed graph is a DAG on its SCCs:
Every pair of nodes in S can reach each other There is no larger set containing S with this property
Each SCC is a super-node Super-node A links to super-node B if a node in A links to node in B
2/7/2011
Take a large snapshot of the Web and try to understand how its SCCs fit as a DAG Computational issues:
Say want to find SCC containing specific node v? Observation:
Out(v) nodes reachable from v (via out-edges) In(v) nodes reachable from v (via in-edges) SCC containing v: = Out(v, G) In(v, G) = Out(v, G) Out(v, G)
where G is G with directions of edges flipped
2/7/2011
[Broder et al., 00]
250 million webpages, 1.5 billion links [Altavista]

2/7/2011
Out-/In- Degree Distribution:

pk: fraction of nodes with k out-/in-links Histogram of pk vs. k
Normalized count, pk
2/7/2011
Plot the same data on log-log axes:
Normalized count, pk
pk = k
log pk = log log k
2/7/2011
[Broder et al., 00]
2/7/2011
10
Random network
Power-law network
Degree distribution is Binomial, i.e., all nodes have similar degree

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
Degrees are Power-law, i.e., heavily skewed

11
Web pages are not equally important

www.joe-schmoe.com vs. www.stanford.edu
Since there is large diversity in the connectivity of the webgraph we can rank the pages by the link structure
2/7/2011
12
We will cover the following Link Analysis approaches to computing importances of nodes in a graph:
Page Rank Hubs and Authorities (HITS) Topic-Specific (Personalized) Page Rank Spam Detection Algorithms
2/7/2011
13
First try:
Page is more important if it has more links
Think of in-links as votes:

www.stanford.edu has 23,400 inlinks www.joe-schmoe.com has 1 inlink
In-coming links? Out-going links?
Are all in-links are equal?

Links from important pages count more Recursive question!
2/7/2011
14
Each links vote is proportional to the importance of its source page If page P with importance x has n out-links, each link gets x/n votes Page Ps own importance is the sum of the votes on its in-links
2/7/2011
15
The web in 1839

y a/2 Yahoo y/2 m Amazon a
2/7/2011
y/2
y = y /2 + a /2 a = y /2 + m m = a /2
Msoft a/2 m
16
3 equations, 3 unknowns, no constants Additional constraint forces uniqueness

y+a+m = 1 y = 2/5, a = 2/5, m = 1/5 No unique solution All solutions equivalent modulo scale factor
Gaussian elimination method works for small examples, but we need a better method for large web-size graphs
2/7/2011
Matrix M has one row and one column for each web page Suppose page j has n out-links M is a column stochastic matrix Suppose r is a vector with one entry per web page:
ri is the importance score of page i Call it the rank vector |r| = 1
If j i, then Mij = 1/n else Mij = 0 Columns sum to 1
2/7/2011
18
Suppose page j links to 3 pages, including i

j i
= i
1/3
2/7/2011
19
The flow equations can be written r = Mr So the rank vector is an eigenvector of the stochastic web matrix
In fact, its first or principal eigenvector, with corresponding eigenvalue 1
2/7/2011
20
Y!
A 0 r = Mr
MS 0 1 0
Yahoo
Y! A MS
Amazon
Msoft y 0 a = 0 1 m 0 0 y a m
y = y /2 + a /2 a = y /2 + m m = a /2
2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets
21
Simple iterative scheme Suppose there are N web pages Initialize: r0 = [1/N,.,1/N]T Iterate: rk+1 = Mrk Stop when |rk+1 - rk|1 <
|x|1 = 1iN|xi| is the L1 norm Can use any other vector norm e.g., Euclidean
2/7/2011
22
Power iteration:
Set ri=1/n ri=j Mijrj And iterate
Y! A Y! A MS MS
Y! 0
A 0
MS 0 1 0
Example:
y a = m 1/3 1/3 1/3 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 . . . 1/6 2/5 2/5 1/5
23
2/7/2011
Imagine a random web surfer

At any time t, surfer is on some page P At time t+1, the surfer follows an outlink from P uniformly at random Ends up on some page Q linked from P Process repeats indefinitely
Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t
p(t) is a probability distribution on pages
2/7/2011
Where is the surfer at time t+1? Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t) Our rank vector r satisfies r = Mr
Then p(t) is called a stationary distribution for the random walk So it is a stationary distribution for the random surfer Follows a link uniformly at random p(t+1) = Mp(t)
2/7/2011
25
A central result from the theory of random walks (aka Markov processes):
For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0.
2/7/2011
26
Some pages are dead ends (have no out-links)

Such pages cause importance to leak out
Spider traps (all out links are within the group)

A group of pages is a spider trap if there are no links from within the group to the outside of the group Random surfer gets trapped And eventually spider traps absorb all importance
2/7/2011
27
Power iteration:
Set ri=1 ri=j Mijrj And iterate
A
Y!
MS Y! Y! A MS 0 A 0 MS 0 0 1
Example:
y a = m 1 1 1 1 3/2 7/4 5/8 3/8 2
0 0 3
2/7/2011
28
The Google solution for spider traps At each time step, the random surfer has two options:
With probability , follow a link at random With probability 1-, jump to some page uniformly at random Common values for are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within a few time steps
2/7/2011
0.2*1/3
Yahoo
1/2 0.8*1/2 0.2*1/3
1/2 0.8*1/2
y y 1/2 a 1/2 m 0 1/2 1/2 0 0.8 1/2 0 0 0 1/2 1
y 1/2 0.8* 1/2 0
y 1/3 + 0.2* 1/3 1/3
0.2*1/3
Amazon
Msoft
1/3 1/3 1/3 + 0.2 1/3 1/3 1/3 1/3 1/3 1/3
y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15

Yahoo
1/2 1/2 0 0.8 1/2 0 0 0 1/2 1
1/3 1/3 1/3 + 0.2 1/3 1/3 1/3 1/3 1/3 1/3
Amazon y a = m 1 1 1
Msoft 1.00 0.60 1.40 0.84 0.60 1.56
y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.776 0.536 . . . 1.688 7/11 5/11 21/11
2/7/2011
31
Some pages are dead ends (have no out-links) Power iteration:

Set ri=1 ri=j Mijrj And iterate
1 1 1 1 5/8 3/8
Y! A Y! Y! A MS 0 A 0
Such pages cause importance to leak out
MS MS 0 0 0
Example:
y a = m
0 0 0
32
2/7/2011
Teleports
Follow random teleport links with probability 1.0 from dead-ends Adjust matrix accordingly
2/7/2011
33
Suppose there are N pages

Consider a page j, with set of outlinks O(j) We have Mij = 1/|O(j)| when ji and Mij = 0 otherwise The random teleport is equivalent to
adding a teleport link from j to every other page with probability (1-)/N reducing the probability of following each outlink from 1/|O(j)| to /|O(j)| Equivalent: tax each page a fraction (1-) of its score and redistribute evenly
2/7/2011
34
Construct the N x N matrix A as follows Verify that A is a stochastic matrix The page rank vector r is the principal eigenvector of this matrix Equivalently, r is the stationary distribution of the random walk with teleports
satisfying r = Ar Aij = Mij + (1-)/N
2/7/2011
35
Key step is matrix-vector multiplication

rnew = Arold
Easy if we have enough main memory to hold A, rold, rnew Say N = 1 billion pages
We need 4 bytes for each entry (say) 2 billion entries for vectors, approx 8GB Matrix A has N2 entries
1018 is a large number!
2/7/2011
36
r = Ar, where Aij = Mij + (1-)/N ri = 1jN Aij rj ri = 1jN [ Mij + (1-)/N] rj = 1jN Mij rj + (1-)/N 1jN rj = 1jN Mij rj + (1-)/N, since |r| = 1 r = Mr + [(1-)/N]N
where [x]N is an N-vector with all entries x
We can rearrange the PageRank equation

r = Mr + [(1-)/N]N
[(1-)/N]N is an N-vector with all entries (1-)/N
M is a sparse matrix!
10 links per node, approx 10N entries
So in each iteration, we need to:

Compute rnew = Mrold Add a constant value (1-)/N to each entry in rnew
2/7/2011
38
Encode sparse matrix using only nonzero entries

Space proportional roughly to number of links say 10N, or 4*10*1 billion = 40GB still wont fit in memory, but will fit on disk
source node degree destination nodes
0 1 2
2/7/2011
3 5 2
1, 5, 7 17, 64, 113, 117, 245 13, 23

39
Assume enough RAM to fit rnew into memory

Store rold and matrix M on disk Initialize all entries of rnew to (1-)/N For each page p (of out-degree n): Read into memory: p, n, dest1,,destn, rold(p) for j = 1n: rnew(destj) += rold(p) / n
rnew 0 1 2 3 4 5 6 src degree destination
rold 0 1 2 3 4 5 6
40
0 1 2
3 4 2
1, 5, 6 17, 64, 113, 117 13, 23
2/7/2011
In each iteration, we have to:

Read rold and M Write rnew back to disk IO Cost = 2|r| + |M|
Questions:
What if we had enough memory to fit both rnew and rold? What if we could not even fit rnew in memory? See reading: http://i.stanford.edu/~ullman/mmds/ch5.pdf
2/7/2011
41
Measures generic popularity of a page

Biased against topic-specific authorities Solution: Topic-Specific PageRank (next lecture)
Uses a single measure of importance

Other models e.g., hubs-and-authorities Solution: Hubs-and-Authorities (next)
Susceptible to Link spam

Artificial link topographies created in order to boost page rank Solution: TrustRank (next lecture)
2/7/2011
46
Interesting pages fall into two classes: 1. Authorities are pages containing useful information
Newspaper home pages Course home pages Home pages of auto manufacturers
2.
Hubs are pages that link to authorities

List of newspapers Course bulletin List of US auto manufacturers
NYT: 10 Ebay: 3 Yahoo: 3 CNN: 8 WSJ: 9
47
2/7/2011
2/7/2011
48
2/7/2011
49
2/7/2011
50
A good hub links to many good authorities A good authority is linked from many good hubs Model using two scores for each node:
Hub score and Authority score Represented as vectors h and a
2/7/2011
51
HITS uses adjacency matrix A[i, j] = 1 if page i links to page j, 0 else

AT, the transpose of A, is similar to the PageRank matrix M but AT has 1s where M has fractions
Yahoo A= y a m y 1 1 1 a 1 0 1 m 0 1 0
Amazon
2/7/2011
Msoft
Yahoo A=
y a m y 1 1 1 a 1 0 1 m 0 1 0
Amazon
Msoft
2/7/2011
53
Notation: Then:
Vector a=(a1,an), h=(h1,hn) Adjacency matrix (n x n): Aij=1 if ij else Aij=0
hi = a j hi = Aij a j
i j j
So: h = A a Likewise:
a = AT h
The hub score of page i is proportional to the sum of the authority scores of the pages it links to: h = A a
Constant is a scaling factor, = 1/hi
The authority score of page i is proportional to the sum of the hub scores of the pages it is linked from: a = AT h
Constant is scaling factor, = 1/ai
2/7/2011
55
The HITS algorithm:

Initialize h, a to all 1s Repeat:
h=Aa Scale h so that its sums to 1.0 a = AT h Scale a so that its sums to 1.0
Until h, a converge (i.e., change very little)
2/7/2011
56
111 A= 101 010
110 AT = 1 0 1 110
Amazon
Yahoo
Msoft
a(yahoo) = a(amazon) = a(msoft) = h(yahoo) = h(amazon) = h(msoft) =

2/7/2011
1 1 1 1 1 1
1 1 1
1 4/5 1
... 1 0.75 . . . ... 1
1 0.732 1 1.000 0.732 0.268

57
... 1 1 1 2/3 0.71 0.73 . . . 1/3 0.29 0.27 . . .
Algorithm:
Set: a = h = 1n Repeat:
h=M a, a=MT h Normalize
Then: a=MT (M a)
new h
Thus: a=(MT M) a h=(M MT) h
new a
a is being updated (in 2 steps): MT (M a) = (MTM) a h is updated (in 2 steps): M (MT h) = (M MT) h Repeated matrix powering
58
2/7/2011
Under reasonable assumptions about A, the HITS iterative algorithm converges to vectors h* and a*:
h* is the principal eigenvector of matrix AAT a* is the principal eigenvector of matrix ATA
2/7/2011
59
PageRank and HITS are two solutions to the same problem:

What is the value of an in-link from u to v? In the PageRank model, the value of the link depends on the links into u In the HITS model, it depends on the value of the other links out of u
The destinies of PageRank and HITS post-1998 were very different

2/7/2011

CS246: Mining Massive Datasets Jure Leskovec,: Stanford University

Uploaded by

Copyright:

Available Formats

CS246: Mining Massive Datasets Jure Leskovec,: Stanford University

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS246: Mining Massive Datasets Jure Leskovec,: Stanford University

Uploaded by

Copyright:

Available Formats

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

What is the structure of the Web? How is it organized?

Jure Leskovec, Stanford C246: Mining Massive Datasets

What is the structure of the Web? How is it organized?

Web as a directed graph

Two types of directed graphs:

Each SCC is a super-node Super-node A links to super-node B if a node in A links to node in B

Jure Leskovec, Stanford C246: Mining Massive Datasets

[Broder et al., 00]

250 million webpages, 1.5 billion links [Altavista]

Out-/In- Degree Distribution:

Jure Leskovec, Stanford C246: Mining Massive Datasets

Plot the same data on log-log axes:

log pk = log log k

Jure Leskovec, Stanford C246: Mining Massive Datasets

[Broder et al., 00]

Jure Leskovec, Stanford C246: Mining Massive Datasets

Degree distribution is Binomial, i.e., all nodes have similar degree

Degrees are Power-law, i.e., heavily skewed

Web pages are not equally important

Jure Leskovec, Stanford C246: Mining Massive Datasets

Jure Leskovec, Stanford C246: Mining Massive Datasets

Think of in-links as votes:

In-coming links? Out-going links?

Are all in-links are equal?

Jure Leskovec, Stanford C246: Mining Massive Datasets

Jure Leskovec, Stanford C246: Mining Massive Datasets

The web in 1839

Jure Leskovec, Stanford C246: Mining Massive Datasets

3 equations, 3 unknowns, no constants Additional constraint forces uniqueness

If j i, then Mij = 1/n else Mij = 0 Columns sum to 1

Suppose page j links to 3 pages, including i

Jure Leskovec, Stanford C246: Mining Massive Datasets

Jure Leskovec, Stanford C246: Mining Massive Datasets

Jure Leskovec, Stanford C246: Mining Massive Datasets

Jure Leskovec, Stanford C246: Mining Massive Datasets

Imagine a random web surfer

Jure Leskovec, Stanford C246: Mining Massive Datasets

Jure Leskovec, Stanford C246: Mining Massive Datasets

Some pages are dead ends (have no out-links)

Spider traps (all out links are within the group)

Jure Leskovec, Stanford C246: Mining Massive Datasets

Jure Leskovec, Stanford C246: Mining Massive Datasets

y y 1/2 a 1/2 m 0 1/2 1/2 0 0.8 1/2 0 0 0 1/2 1

y 1/2 0.8* 1/2 0

y 1/3 + 0.2* 1/3 1/3

y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15

1/2 1/2 0 0.8 1/2 0 0 0 1/2 1

Msoft 1.00 0.60 1.40 0.84 0.60 1.56

Jure Leskovec, Stanford C246: Mining Massive Datasets

Some pages are dead ends (have no out-links) Power iteration:

Such pages cause importance to leak out

Jure Leskovec, Stanford C246: Mining Massive Datasets

Jure Leskovec, Stanford C246: Mining Massive Datasets

Suppose there are N pages

Jure Leskovec, Stanford C246: Mining Massive Datasets

Jure Leskovec, Stanford C246: Mining Massive Datasets

Key step is matrix-vector multiplication

Jure Leskovec, Stanford C246: Mining Massive Datasets