CS246: Mining Massive Datasets Jure Leskovec,: Stanford University

Download as pdf or txt
Download as pdf or txt
You are on page 1of 56

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

http://cs246.stanford.edu

What is the structure of the Web? How is it organized?

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

What is the structure of the Web? How is it organized?

Web as a directed graph


2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

Two types of directed graphs:


DAG Directed Acyclic Graph:
Has no cycles: if u can reach v, then v can not reach u Any node can reach any node via a directed path

Strongly connected:

Any directed graph can be expressed in terms of these two types of graphs
Jure Leskovec, Stanford C246: Mining Massive Datasets 4

2/7/2011

Strongly connected component (SCC) is a set of nodes S: Any directed graph is a DAG on its SCCs:
Every pair of nodes in S can reach each other There is no larger set containing S with this property

Each SCC is a super-node Super-node A links to super-node B if a node in A links to node in B

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

Take a large snapshot of the Web and try to understand how its SCCs fit as a DAG Computational issues:
Say want to find SCC containing specific node v? Observation:
Out(v) nodes reachable from v (via out-edges) In(v) nodes reachable from v (via in-edges) SCC containing v: = Out(v, G) In(v, G) = Out(v, G) Out(v, G)
where G is G with directions of edges flipped
Jure Leskovec, Stanford C246: Mining Massive Datasets

2/7/2011

[Broder et al., 00]

250 million webpages, 1.5 billion links [Altavista]


Jure Leskovec, Stanford C246: Mining Massive Datasets 7

2/7/2011

Out-/In- Degree Distribution:


pk: fraction of nodes with k out-/in-links Histogram of pk vs. k
Normalized count, pk

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

Plot the same data on log-log axes:

Normalized count, pk

pk = k

log pk = log log k

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

[Broder et al., 00]

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

10

Random network

Power-law network

Degree distribution is Binomial, i.e., all nodes have similar degree


2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

Degrees are Power-law, i.e., heavily skewed


11

Web pages are not equally important


www.joe-schmoe.com vs. www.stanford.edu

Since there is large diversity in the connectivity of the webgraph we can rank the pages by the link structure

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

12

We will cover the following Link Analysis approaches to computing importances of nodes in a graph:
Page Rank Hubs and Authorities (HITS) Topic-Specific (Personalized) Page Rank Spam Detection Algorithms

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

13

First try:
Page is more important if it has more links

Think of in-links as votes:


www.stanford.edu has 23,400 inlinks www.joe-schmoe.com has 1 inlink

In-coming links? Out-going links?

Are all in-links are equal?


Links from important pages count more Recursive question!

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

14

Each links vote is proportional to the importance of its source page If page P with importance x has n out-links, each link gets x/n votes Page Ps own importance is the sum of the votes on its in-links

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

15

The web in 1839


y a/2 Yahoo y/2 m Amazon a
2/7/2011

y/2

y = y /2 + a /2 a = y /2 + m m = a /2

Msoft a/2 m

Jure Leskovec, Stanford C246: Mining Massive Datasets

16

3 equations, 3 unknowns, no constants Additional constraint forces uniqueness


y+a+m = 1 y = 2/5, a = 2/5, m = 1/5 No unique solution All solutions equivalent modulo scale factor

Gaussian elimination method works for small examples, but we need a better method for large web-size graphs
Jure Leskovec, Stanford C246: Mining Massive Datasets 17

2/7/2011

Matrix M has one row and one column for each web page Suppose page j has n out-links M is a column stochastic matrix Suppose r is a vector with one entry per web page:
ri is the importance score of page i Call it the rank vector |r| = 1
Jure Leskovec, Stanford C246: Mining Massive Datasets

If j i, then Mij = 1/n else Mij = 0 Columns sum to 1

2/7/2011

18

Suppose page j links to 3 pages, including i


j i
= i

1/3

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

19

The flow equations can be written r = Mr So the rank vector is an eigenvector of the stochastic web matrix
In fact, its first or principal eigenvector, with corresponding eigenvalue 1

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

20

Y!

A 0 r = Mr

MS 0 1 0

Yahoo

Y! A MS

Amazon

Msoft y 0 a = 0 1 m 0 0 y a m

y = y /2 + a /2 a = y /2 + m m = a /2
2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets

21

Simple iterative scheme Suppose there are N web pages Initialize: r0 = [1/N,.,1/N]T Iterate: rk+1 = Mrk Stop when |rk+1 - rk|1 <

|x|1 = 1iN|xi| is the L1 norm Can use any other vector norm e.g., Euclidean

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

22

Power iteration:
Set ri=1/n ri=j Mijrj And iterate
Y! A Y! A MS MS

Y! 0

A 0

MS 0 1 0

Example:
y a = m 1/3 1/3 1/3 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 . . . 1/6 2/5 2/5 1/5
23

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

Imagine a random web surfer


At any time t, surfer is on some page P At time t+1, the surfer follows an outlink from P uniformly at random Ends up on some page Q linked from P Process repeats indefinitely

Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t
p(t) is a probability distribution on pages
Jure Leskovec, Stanford C246: Mining Massive Datasets 24

2/7/2011

Where is the surfer at time t+1? Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t) Our rank vector r satisfies r = Mr
Then p(t) is called a stationary distribution for the random walk So it is a stationary distribution for the random surfer Follows a link uniformly at random p(t+1) = Mp(t)

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

25

A central result from the theory of random walks (aka Markov processes):

For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0.

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

26

Some pages are dead ends (have no out-links)


Such pages cause importance to leak out

Spider traps (all out links are within the group)


A group of pages is a spider trap if there are no links from within the group to the outside of the group Random surfer gets trapped And eventually spider traps absorb all importance

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

27

Power iteration:
Set ri=1 ri=j Mijrj And iterate
A

Y!

MS Y! Y! A MS 0 A 0 MS 0 0 1

Example:
y a = m 1 1 1 1 3/2 7/4 5/8 3/8 2

0 0 3

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

28

The Google solution for spider traps At each time step, the random surfer has two options:
With probability , follow a link at random With probability 1-, jump to some page uniformly at random Common values for are in the range 0.8 to 0.9

Surfer will teleport out of spider trap within a few time steps
Jure Leskovec, Stanford C246: Mining Massive Datasets 29

2/7/2011

0.2*1/3

Yahoo
1/2 0.8*1/2 0.2*1/3

1/2 0.8*1/2

y y 1/2 a 1/2 m 0 1/2 1/2 0 0.8 1/2 0 0 0 1/2 1

y 1/2 0.8* 1/2 0

y 1/3 + 0.2* 1/3 1/3

0.2*1/3

Amazon

Msoft

1/3 1/3 1/3 + 0.2 1/3 1/3 1/3 1/3 1/3 1/3

y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15


2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 30

Yahoo

1/2 1/2 0 0.8 1/2 0 0 0 1/2 1

1/3 1/3 1/3 + 0.2 1/3 1/3 1/3 1/3 1/3 1/3

Amazon y a = m 1 1 1

Msoft 1.00 0.60 1.40 0.84 0.60 1.56

y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.776 0.536 . . . 1.688 7/11 5/11 21/11

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

31

Some pages are dead ends (have no out-links) Power iteration:


Set ri=1 ri=j Mijrj And iterate
1 1 1 1 5/8 3/8

Y! A Y! Y! A MS 0 A 0

Such pages cause importance to leak out

MS MS 0 0 0

Example:
y a = m

0 0 0
32

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

Teleports
Follow random teleport links with probability 1.0 from dead-ends Adjust matrix accordingly

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

33

Suppose there are N pages


Consider a page j, with set of outlinks O(j) We have Mij = 1/|O(j)| when ji and Mij = 0 otherwise The random teleport is equivalent to
adding a teleport link from j to every other page with probability (1-)/N reducing the probability of following each outlink from 1/|O(j)| to /|O(j)| Equivalent: tax each page a fraction (1-) of its score and redistribute evenly

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

34

Construct the N x N matrix A as follows Verify that A is a stochastic matrix The page rank vector r is the principal eigenvector of this matrix Equivalently, r is the stationary distribution of the random walk with teleports
satisfying r = Ar Aij = Mij + (1-)/N

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

35

Key step is matrix-vector multiplication


rnew = Arold

Easy if we have enough main memory to hold A, rold, rnew Say N = 1 billion pages
We need 4 bytes for each entry (say) 2 billion entries for vectors, approx 8GB Matrix A has N2 entries
1018 is a large number!

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

36

r = Ar, where Aij = Mij + (1-)/N ri = 1jN Aij rj ri = 1jN [ Mij + (1-)/N] rj = 1jN Mij rj + (1-)/N 1jN rj = 1jN Mij rj + (1-)/N, since |r| = 1 r = Mr + [(1-)/N]N
where [x]N is an N-vector with all entries x
2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 37

We can rearrange the PageRank equation


r = Mr + [(1-)/N]N
[(1-)/N]N is an N-vector with all entries (1-)/N

M is a sparse matrix!
10 links per node, approx 10N entries

So in each iteration, we need to:


Compute rnew = Mrold Add a constant value (1-)/N to each entry in rnew

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

38

Encode sparse matrix using only nonzero entries


Space proportional roughly to number of links say 10N, or 4*10*1 billion = 40GB still wont fit in memory, but will fit on disk
source node degree destination nodes

0 1 2
2/7/2011

3 5 2

1, 5, 7 17, 64, 113, 117, 245 13, 23


39

Jure Leskovec, Stanford C246: Mining Massive Datasets

Assume enough RAM to fit rnew into memory


Store rold and matrix M on disk Initialize all entries of rnew to (1-)/N For each page p (of out-degree n): Read into memory: p, n, dest1,,destn, rold(p) for j = 1n: rnew(destj) += rold(p) / n
rnew 0 1 2 3 4 5 6 src degree destination

rold 0 1 2 3 4 5 6
40

0 1 2

3 4 2

1, 5, 6 17, 64, 113, 117 13, 23

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

In each iteration, we have to:


Read rold and M Write rnew back to disk IO Cost = 2|r| + |M|

Questions:
What if we had enough memory to fit both rnew and rold? What if we could not even fit rnew in memory? See reading: http://i.stanford.edu/~ullman/mmds/ch5.pdf

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

41

Measures generic popularity of a page


Biased against topic-specific authorities Solution: Topic-Specific PageRank (next lecture)

Uses a single measure of importance


Other models e.g., hubs-and-authorities Solution: Hubs-and-Authorities (next)

Susceptible to Link spam


Artificial link topographies created in order to boost page rank Solution: TrustRank (next lecture)

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

46

Interesting pages fall into two classes: 1. Authorities are pages containing useful information
Newspaper home pages Course home pages Home pages of auto manufacturers
2.

Hubs are pages that link to authorities


List of newspapers Course bulletin List of US auto manufacturers
NYT: 10 Ebay: 3 Yahoo: 3 CNN: 8 WSJ: 9
47

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

48

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

49

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

50

A good hub links to many good authorities A good authority is linked from many good hubs Model using two scores for each node:
Hub score and Authority score Represented as vectors h and a

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

51

HITS uses adjacency matrix A[i, j] = 1 if page i links to page j, 0 else


AT, the transpose of A, is similar to the PageRank matrix M but AT has 1s where M has fractions
Yahoo A= y a m y 1 1 1 a 1 0 1 m 0 1 0

Amazon
2/7/2011

Msoft
Jure Leskovec, Stanford C246: Mining Massive Datasets 52

Yahoo A=

y a m y 1 1 1 a 1 0 1 m 0 1 0

Amazon

Msoft

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

53

Notation: Then:
Vector a=(a1,an), h=(h1,hn) Adjacency matrix (n x n): Aij=1 if ij else Aij=0

hi = a j hi = Aij a j
i j j

So: h = A a Likewise:

a = AT h
2/7/2011 Jure Leskovec, Stanford C246: Mining Massive Datasets 54

The hub score of page i is proportional to the sum of the authority scores of the pages it links to: h = A a
Constant is a scaling factor, = 1/hi

The authority score of page i is proportional to the sum of the hub scores of the pages it is linked from: a = AT h
Constant is scaling factor, = 1/ai

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

55

The HITS algorithm:


Initialize h, a to all 1s Repeat:
h=Aa Scale h so that its sums to 1.0 a = AT h Scale a so that its sums to 1.0

Until h, a converge (i.e., change very little)

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

56

111 A= 101 010

110 AT = 1 0 1 110
Amazon

Yahoo

Msoft

a(yahoo) = a(amazon) = a(msoft) = h(yahoo) = h(amazon) = h(msoft) =


2/7/2011

1 1 1 1 1 1

1 1 1

1 4/5 1

... 1 0.75 . . . ... 1

1 0.732 1 1.000 0.732 0.268


57

... 1 1 1 2/3 0.71 0.73 . . . 1/3 0.29 0.27 . . .

Jure Leskovec, Stanford C246: Mining Massive Datasets

Algorithm:
Set: a = h = 1n Repeat:
h=M a, a=MT h Normalize

Then: a=MT (M a)
new h

Thus: a=(MT M) a h=(M MT) h

new a

a is being updated (in 2 steps): MT (M a) = (MTM) a h is updated (in 2 steps): M (MT h) = (M MT) h Repeated matrix powering
58

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

Under reasonable assumptions about A, the HITS iterative algorithm converges to vectors h* and a*:
h* is the principal eigenvector of matrix AAT a* is the principal eigenvector of matrix ATA

2/7/2011

Jure Leskovec, Stanford C246: Mining Massive Datasets

59

PageRank and HITS are two solutions to the same problem:


What is the value of an in-link from u to v? In the PageRank model, the value of the link depends on the links into u In the HITS model, it depends on the value of the other links out of u

The destinies of PageRank and HITS post-1998 were very different


Jure Leskovec, Stanford C246: Mining Massive Datasets 60

2/7/2011

You might also like