CS246: Mining Massive Datasets Jure Leskovec,: Stanford University

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

CS246: Mining Massive Datasets Jure Leskovec, Stanford University


What is the structure of the Web? How is it organized?


Jure Leskovec, Stanford C246: Mining Massive Datasets

What is the structure of the Web? How is it organized?

Web as a directed graph

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

Two types of directed graphs:

DAG Directed Acyclic Graph:
Has no cycles: if u can reach v, then v can not reach u Any node can reach any node via a directed path

Strongly connected:

Any directed graph can be expressed in terms of these two types of graphs
Jure Leskovec, Stanford C246: Mining Massive Datasets 4


Strongly connected component (SCC) is a set of nodes S: Any directed graph is a DAG on its SCCs:
Every pair of nodes in S can reach each other There is no larger set containing S with this property

Each SCC is a super-node Super-node A links to super-node B if a node in A links to node in B


Jure Leskovec, Stanford C246: Mining Massive Datasets

Take a large snapshot of the Web and try to understand how its SCCs fit as a DAG Computational issues:
Say want to find SCC containing specific node v? Observation:
Out(v) nodes reachable from v (via out-edges) In(v) nodes reachable from v (via in-edges) SCC containing v: = Out(v, G) In(v, G) = Out(v, G) Out(v, G)
where G is G with directions of edges flipped
Jure Leskovec, Stanford C246: Mining Massive Datasets


[Broder et al., 00]

250 million webpages, 1.5 billion links [Altavista]

Jure Leskovec, Stanford C246: Mining Massive Datasets 7


Out-/In- Degree Distribution:

pk: fraction of nodes with k out-/in-links Histogram of pk vs. k
Normalized count, pk


Jure Leskovec, Stanford C246: Mining Massive Datasets

Plot the same data on log-log axes:

Normalized count, pk

pk = k

log pk = log log k


Jure Leskovec, Stanford C246: Mining Massive Datasets

[Broder et al., 00]


Jure Leskovec, Stanford C246: Mining Massive Datasets


Random network

Power-law network

Degree distribution is Binomial, i.e., all nodes have similar degree

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Degrees are Power-law, i.e., heavily skewed


Web pages are not equally important

www.joe-schmoe.com vs. www.stanford.edu

Since there is large diversity in the connectivity of the webgraph we can rank the pages by the link structure


Jure Leskovec, Stanford C246: Mining Massive Datasets


We will cover the following Link Analysis approaches to computing importances of nodes in a graph:
Page Rank Hubs and Authorities (HITS) Topic-Specific (Personalized) Page Rank Spam Detection Algorithms


Jure Leskovec, Stanford C246: Mining Massive Datasets


First try:
Page is more important if it has more links

Think of in-links as votes:

www.stanford.edu has 23,400 inlinks www.joe-schmoe.com has 1 inlink

In-coming links? Out-going links?

Are all in-links are equal?

Links from important pages count more Recursive question!


Jure Leskovec, Stanford C246: Mining Massive Datasets


Each links vote is proportional to the importance of its source page If page P with importance x has n out-links, each link gets x/n votes Page Ps own importance is the sum of the votes on its in-links


Jure Leskovec, Stanford C246: Mining Massive Datasets


The web in 1839

y a/2 Yahoo y/2 m Amazon a


y = y /2 + a /2 a = y /2 + m m = a /2

Msoft a/2 m

Jure Leskovec, Stanford C246: Mining Massive Datasets


3 equations, 3 unknowns, no constants Additional constraint forces uniqueness

y+a+m = 1 y = 2/5, a = 2/5, m = 1/5 No unique solution All solutions equivalent modulo scale factor

Gaussian elimination method works for small examples, but we need a better method for large web-size graphs
Jure Leskovec, Stanford C246: Mining Massive Datasets 17


Matrix M has one row and one column for each web page Suppose page j has n out-links M is a column stochastic matrix Suppose r is a vector with one entry per web page:
ri is the importance score of page i Call it the rank vector |r| = 1
Jure Leskovec, Stanford C246: Mining Massive Datasets

If j i, then Mij = 1/n else Mij = 0 Columns sum to 1



Suppose page j links to 3 pages, including i

j i
= i



Jure Leskovec, Stanford C246: Mining Massive Datasets


The flow equations can be written r = Mr So the rank vector is an eigenvector of the stochastic web matrix
In fact, its first or principal eigenvector, with corresponding eigenvalue 1


Jure Leskovec, Stanford C246: Mining Massive Datasets



A 0 r = Mr

MS 0 1 0




Msoft y 0 a = 0 1 m 0 0 y a m

y = y /2 + a /2 a = y /2 + m m = a /2
2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets


Simple iterative scheme Suppose there are N web pages Initialize: r0 = [1/N,.,1/N]T Iterate: rk+1 = Mrk Stop when |rk+1 - rk|1 <

|x|1 = 1iN|xi| is the L1 norm Can use any other vector norm e.g., Euclidean


Jure Leskovec, Stanford C246: Mining Massive Datasets


Power iteration:
Set ri=1/n ri=j Mijrj And iterate

Y! 0

A 0

MS 0 1 0

y a = m 1/3 1/3 1/3 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 . . . 1/6 2/5 2/5 1/5


Jure Leskovec, Stanford C246: Mining Massive Datasets

Imagine a random web surfer

At any time t, surfer is on some page P At time t+1, the surfer follows an outlink from P uniformly at random Ends up on some page Q linked from P Process repeats indefinitely

Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t
p(t) is a probability distribution on pages
Jure Leskovec, Stanford C246: Mining Massive Datasets 24


Where is the surfer at time t+1? Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t) Our rank vector r satisfies r = Mr
Then p(t) is called a stationary distribution for the random walk So it is a stationary distribution for the random surfer Follows a link uniformly at random p(t+1) = Mp(t)


Jure Leskovec, Stanford C246: Mining Massive Datasets


A central result from the theory of random walks (aka Markov processes):

For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0.


Jure Leskovec, Stanford C246: Mining Massive Datasets


Some pages are dead ends (have no out-links)

Such pages cause importance to leak out

Spider traps (all out links are within the group)

A group of pages is a spider trap if there are no links from within the group to the outside of the group Random surfer gets trapped And eventually spider traps absorb all importance


Jure Leskovec, Stanford C246: Mining Massive Datasets


Power iteration:
Set ri=1 ri=j Mijrj And iterate


MS Y! Y! A MS 0 A 0 MS 0 0 1

y a = m 1 1 1 1 3/2 7/4 5/8 3/8 2

0 0 3


Jure Leskovec, Stanford C246: Mining Massive Datasets


The Google solution for spider traps At each time step, the random surfer has two options:
With probability , follow a link at random With probability 1-, jump to some page uniformly at random Common values for are in the range 0.8 to 0.9

Surfer will teleport out of spider trap within a few time steps
Jure Leskovec, Stanford C246: Mining Massive Datasets 29



1/2 0.8*1/2 0.2*1/3

1/2 0.8*1/2

y y 1/2 a 1/2 m 0 1/2 1/2 0 0.8 1/2 0 0 0 1/2 1

y 1/2 0.8* 1/2 0

y 1/3 + 0.2* 1/3 1/3




1/3 1/3 1/3 + 0.2 1/3 1/3 1/3 1/3 1/3 1/3

y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15

2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 30


1/2 1/2 0 0.8 1/2 0 0 0 1/2 1

1/3 1/3 1/3 + 0.2 1/3 1/3 1/3 1/3 1/3 1/3

Amazon y a = m 1 1 1

Msoft 1.00 0.60 1.40 0.84 0.60 1.56

y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.776 0.536 . . . 1.688 7/11 5/11 21/11


Jure Leskovec, Stanford C246: Mining Massive Datasets


Some pages are dead ends (have no out-links) Power iteration:

Set ri=1 ri=j Mijrj And iterate
1 1 1 1 5/8 3/8

Y! A Y! Y! A MS 0 A 0

Such pages cause importance to leak out

MS MS 0 0 0

y a = m

0 0 0


Jure Leskovec, Stanford C246: Mining Massive Datasets

Follow random teleport links with probability 1.0 from dead-ends Adjust matrix accordingly


Jure Leskovec, Stanford C246: Mining Massive Datasets


Suppose there are N pages

Consider a page j, with set of outlinks O(j) We have Mij = 1/|O(j)| when ji and Mij = 0 otherwise The random teleport is equivalent to
adding a teleport link from j to every other page with probability (1-)/N reducing the probability of following each outlink from 1/|O(j)| to /|O(j)| Equivalent: tax each page a fraction (1-) of its score and redistribute evenly


Jure Leskovec, Stanford C246: Mining Massive Datasets


Construct the N x N matrix A as follows Verify that A is a stochastic matrix The page rank vector r is the principal eigenvector of this matrix Equivalently, r is the stationary distribution of the random walk with teleports
satisfying r = Ar Aij = Mij + (1-)/N


Jure Leskovec, Stanford C246: Mining Massive Datasets


Key step is matrix-vector multiplication

rnew = Arold

Easy if we have enough main memory to hold A, rold, rnew Say N = 1 billion pages
We need 4 bytes for each entry (say) 2 billion entries for vectors, approx 8GB Matrix A has N2 entries
1018 is a large number!


Jure Leskovec, Stanford C246: Mining Massive Datasets


r = Ar, where Aij = Mij + (1-)/N ri = 1jN Aij rj ri = 1jN [ Mij + (1-)/N] rj = 1jN Mij rj + (1-)/N 1jN rj = 1jN Mij rj + (1-)/N, since |r| = 1 r = Mr + [(1-)/N]N
where [x]N is an N-vector with all entries x
2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 37

We can rearrange the PageRank equation

r = Mr + [(1-)/N]N
[(1-)/N]N is an N-vector with all entries (1-)/N

M is a sparse matrix!
10 links per node, approx 10N entries

So in each iteration, we need to:

Compute rnew = Mrold Add a constant value (1-)/N to each entry in rnew


Jure Leskovec, Stanford C246: Mining Massive Datasets


Encode sparse matrix using only nonzero entries

Space proportional roughly to number of links say 10N, or 4*10*1 billion = 40GB still wont fit in memory, but will fit on disk
source node degree destination nodes

0 1 2

3 5 2

1, 5, 7 17, 64, 113, 117, 245 13, 23


Jure Leskovec, Stanford C246: Mining Massive Datasets

Assume enough RAM to fit rnew into memory

Store rold and matrix M on disk Initialize all entries of rnew to (1-)/N For each page p (of out-degree n): Read into memory: p, n, dest1,,destn, rold(p) for j = 1n: rnew(destj) += rold(p) / n
rnew 0 1 2 3 4 5 6 src degree destination

rold 0 1 2 3 4 5 6

0 1 2

3 4 2

1, 5, 6 17, 64, 113, 117 13, 23


Jure Leskovec, Stanford C246: Mining Massive Datasets

In each iteration, we have to:

Read rold and M Write rnew back to disk IO Cost = 2|r| + |M|

What if we had enough memory to fit both rnew and rold? What if we could not even fit rnew in memory? See reading: http://i.stanford.edu/~ullman/mmds/ch5.pdf


Jure Leskovec, Stanford C246: Mining Massive Datasets


Measures generic popularity of a page

Biased against topic-specific authorities Solution: Topic-Specific PageRank (next lecture)

Uses a single measure of importance

Other models e.g., hubs-and-authorities Solution: Hubs-and-Authorities (next)

Susceptible to Link spam

Artificial link topographies created in order to boost page rank Solution: TrustRank (next lecture)


Jure Leskovec, Stanford C246: Mining Massive Datasets


Interesting pages fall into two classes: 1. Authorities are pages containing useful information
Newspaper home pages Course home pages Home pages of auto manufacturers

Hubs are pages that link to authorities

List of newspapers Course bulletin List of US auto manufacturers
NYT: 10 Ebay: 3 Yahoo: 3 CNN: 8 WSJ: 9


Jure Leskovec, Stanford C246: Mining Massive Datasets


Jure Leskovec, Stanford C246: Mining Massive Datasets



Jure Leskovec, Stanford C246: Mining Massive Datasets



Jure Leskovec, Stanford C246: Mining Massive Datasets


A good hub links to many good authorities A good authority is linked from many good hubs Model using two scores for each node:
Hub score and Authority score Represented as vectors h and a


Jure Leskovec, Stanford C246: Mining Massive Datasets


HITS uses adjacency matrix A[i, j] = 1 if page i links to page j, 0 else

AT, the transpose of A, is similar to the PageRank matrix M but AT has 1s where M has fractions
Yahoo A= y a m y 1 1 1 a 1 0 1 m 0 1 0


Jure Leskovec, Stanford C246: Mining Massive Datasets 52

Yahoo A=

y a m y 1 1 1 a 1 0 1 m 0 1 0




Jure Leskovec, Stanford C246: Mining Massive Datasets


Notation: Then:
Vector a=(a1,an), h=(h1,hn) Adjacency matrix (n x n): Aij=1 if ij else Aij=0

hi = a j hi = Aij a j
i j j

So: h = A a Likewise:

a = AT h
2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 54

The hub score of page i is proportional to the sum of the authority scores of the pages it links to: h = A a
Constant is a scaling factor, = 1/hi

The authority score of page i is proportional to the sum of the hub scores of the pages it is linked from: a = AT h
Constant is scaling factor, = 1/ai


Jure Leskovec, Stanford C246: Mining Massive Datasets


The HITS algorithm:

Initialize h, a to all 1s Repeat:
h=Aa Scale h so that its sums to 1.0 a = AT h Scale a so that its sums to 1.0

Until h, a converge (i.e., change very little)


Jure Leskovec, Stanford C246: Mining Massive Datasets


111 A= 101 010

110 AT = 1 0 1 110



a(yahoo) = a(amazon) = a(msoft) = h(yahoo) = h(amazon) = h(msoft) =


1 1 1 1 1 1

1 1 1

1 4/5 1

... 1 0.75 . . . ... 1

1 0.732 1 1.000 0.732 0.268


... 1 1 1 2/3 0.71 0.73 . . . 1/3 0.29 0.27 . . .

Jure Leskovec, Stanford C246: Mining Massive Datasets

Set: a = h = 1n Repeat:
h=M a, a=MT h Normalize

Then: a=MT (M a)
new h

Thus: a=(MT M) a h=(M MT) h

new a

a is being updated (in 2 steps): MT (M a) = (MTM) a h is updated (in 2 steps): M (MT h) = (M MT) h Repeated matrix powering


Jure Leskovec, Stanford C246: Mining Massive Datasets

Under reasonable assumptions about A, the HITS iterative algorithm converges to vectors h* and a*:
h* is the principal eigenvector of matrix AAT a* is the principal eigenvector of matrix ATA


Jure Leskovec, Stanford C246: Mining Massive Datasets


PageRank and HITS are two solutions to the same problem:

What is the value of an in-link from u to v? In the PageRank model, the value of the link depends on the links into u In the HITS model, it depends on the value of the other links out of u

The destinies of PageRank and HITS post-1998 were very different

Jure Leskovec, Stanford C246: Mining Massive Datasets 60


You might also like