0% found this document useful (0 votes)

39 views60 pages

Unit 4

Uploaded by

kjsravani2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views60 pages

Unit 4

Uploaded by

kjsravani2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Note to other teachers and users of these slides: We would be delighted if you found this our

material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org

Mining of Massive Datasets

Jure Leskovec, Anand Rajaraman, Jeff Ullman
Stanford University
http://www.mmds.org
High dim. Graph Infinite Machine
Apps
data data data learning

Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams

Community Web Decision Association

Clustering
Detection advertising Trees Rules

Dimensional Duplicate
Spam Queries on Perceptron,
ity document
Detection streams kNN
reduction detection

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

Facebook social graph
4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3
Connections between political blogs
Polarization of the network [Adamic-Glance, 2005]
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4
Citation networks and Maps of science
[Börner et al., 2012]
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5
domain2

domain1

router

domain3

Internet
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
Seven Bridges of Königsberg
[Euler, 1735]
Return to the starting point by traveling each
link of the graph once and only once.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

 Web as a directed graph:
▪ Nodes: Webpages
▪ Edges: Hyperlinks
I teach a
class on
Networks. CS224W:
Classes are
in the
Gates
building Computer
Science
Department
at Stanford
Stanford
University

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 10
 How to organize the Web?
 First try: Human curated
Web directories
▪ Yahoo, DMOZ, LookSmart
 Second try: Web Search
▪ Information Retrieval investigates:
Find relevant docs in a small
and trusted set
▪ Newspaper articles, Patents, etc.
▪ But: Web is huge, full of untrusted documents,
random things, web spam, etc.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 11
2 challenges of web search:
 (1) Web contains many sources of information
Who to “trust”?
▪ Trick: Trustworthy pages may point to each other!
 (2) What is the “best” answer to query
“newspaper”?
▪ No single right answer
▪ Trick: Pages that actually know about newspapers
might all be pointing to many newspapers

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

 All web pages are not equally “important”
www.joe-schmoe.com vs. www.stanford.edu

 There is large diversity

in the web-graph
node connectivity.
Let’s rank the pages by
the link structure!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

 We will cover the following Link Analysis
approaches for computing importances
of nodes in a graph:
▪ Page Rank
▪ Topic-Specific (Personalized) Page Rank
▪ Web Spam Detection Algorithms

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14

 Idea: Links as votes
▪ Page is more important if it has more links
▪ In-coming links? Out-going links?
 Think of in-links as votes:
▪ www.stanford.edu has 23,400 in-links
▪ www.joe-schmoe.com has 1 in-link

 Are all in-links are equal?

▪ Links from important pages count more
▪ Recursive question!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16

A
B
3.3 C
38.4
34.3

D
E F
3.9
8.1 3.9

1.6
1.6 1.6 1.6
1.6

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17

 Each link’s vote is proportional to the
importance of its source page

 If page j with importance rj has n out-links,

each link gets rj / n votes

 Page j’s own importance is the sum of the

votes on its in-links i k
ri/3 r /4
k

j rj/3
rj = ri/3+rk/4
rj/3 rj/3

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18

 A “vote” from an important The web in 1839

page is worth more y/2

 A page is important if it is y
pointed to by other important
a/2
pages y/2
 Define a “rank” rj for page j m
a m
a/2
ri
rj =  “Flow” equations:

i→ j di
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
𝒅𝒊 … out-degree of node 𝒊
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19
Flow equations:
 3 equations, 3 unknowns, ry = ry /2 + ra /2
no constants ra = ry /2 + rm
rm = ra /2
▪ No unique solution
▪ All solutions equivalent modulo the scale factor
 Additional constraint forces uniqueness:
▪ 𝒓𝒚 + 𝒓𝒂 + 𝒓𝒎 = 𝟏
𝟐 𝟐 𝟏
▪ Solution: 𝒓𝒚 = , 𝒓𝒂 = , 𝒓𝒎 =
𝟓 𝟓 𝟓
 Gaussian elimination method works for
small examples, but we need a better
method for large web-size graphs
 We need a new formulation!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20
 Stochastic adjacency matrix 𝑴
▪ Let page 𝑖 has 𝑑𝑖 out-links
1
▪ If 𝑖 → 𝑗, then 𝑀𝑗𝑖 = else 𝑀𝑗𝑖 = 0
𝑑𝑖
▪ 𝑴 is a column stochastic matrix
▪ Columns sum to 1
 Rank vector 𝒓: vector with an entry per page
▪ 𝑟𝑖 is the importance score of page 𝑖
▪ σ𝑖 𝑟𝑖 = 1
ri
 The flow equations can be written rj = 
i→ j di
𝒓 = 𝑴⋅ 𝒓
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21
ri
 Remember the flow equation: rj = 
d
 Flow equation in the matrix form i → j i
𝑴⋅ 𝒓=𝒓
▪ Suppose page i links to 3 pages, including j
i

j rj
. =
ri
1/3

M . r = r
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22
 The flow equations can be written
𝒓 = 𝑴 ∙ 𝒓
 So the rank vector r is an eigenvector of the
stochastic web matrix M
▪ In fact, its first or principal eigenvector,
with corresponding eigenvalue 1 NOTE: x is an
eigenvector with
▪ Largest eigenvalue of M is 1 since M is the corresponding
eigenvalue λ if:
column stochastic (with non-negative entries) 𝑨𝒙 = 𝝀𝒙
▪ We know r is unit length and each column of M
sums to one, so 𝑴𝒓 ≤ 𝟏

 We can now efficiently solve for r!

The method is called Power iteration
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23
y a m
y y ½ ½ 0
a ½ 0 1
a m m 0 ½ 0

r = M∙r

ry = ry /2 + ra /2 y ½ ½ 0 y
ra = ry /2 + rm a = ½ 0 1 a
rm = ra /2 m 0 ½ 0 m

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

 Given a web graph with n nodes, where the
nodes are pages and edges are hyperlinks
 Power iteration: a simple iterative scheme
▪ Suppose there are N web pages (t )
ri
=
( t +1)
▪ Initialize: r(0) = [1/N,….,1/N]T rj
i→ j di
▪ Iterate: r(t+1) = M ∙ r(t) di …. out-degree of node i
▪ Stop when |r(t+1) – r(t)|1 < 
|x|1 = 1≤i≤N|xi| is the L1 norm
Can use any other vector norm, e.g., Euclidean

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25

y a m
 Power Iteration: y
y ½ ½ 0
▪ Set 𝑟𝑗 = 1/N a ½ 0 1
𝑟𝑖 a m m 0 ½ 0
▪ 1: 𝑟′𝑗 = σ𝑖→𝑗
𝑑𝑖
ry = ry /2 + ra /2
▪ 2: 𝑟 = 𝑟′ ra = ry /2 + rm
▪ Goto 1 rm = ra /2

 Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2, …

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26

 Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2, …

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27

Details!

 Power iteration:
A method for finding dominant eigenvector (the
vector corresponding to the largest eigenvalue)
▪ 𝒓(𝟏) = 𝑴 ⋅ 𝒓(𝟎)
▪ 𝒓(𝟐) = 𝑴 ⋅ 𝒓 𝟏
= 𝑴 𝑴𝒓 𝟏
= 𝑴𝟐 ⋅ 𝒓 𝟎

▪ 𝒓(𝟑) = 𝑴 ⋅ 𝒓 𝟐
= 𝑴 𝑴𝟐 𝒓 𝟎
= 𝑴𝟑 ⋅ 𝒓 𝟎

 Claim:
Sequence 𝑴 ⋅ 𝒓 𝟎 , 𝑴𝟐 ⋅ 𝒓 𝟎 , … 𝑴𝒌 ⋅ 𝒓 𝟎 , …
approaches the dominant eigenvector of 𝑴

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

Details!

 Claim: Sequence 𝑴 ⋅ 𝒓 𝟎 , 𝑴𝟐 ⋅ 𝒓 𝟎 , … 𝑴𝒌 ⋅ 𝒓 𝟎
,…
approaches the dominant eigenvector of 𝑴
 Proof:
▪ Assume M has n linearly independent eigenvectors,
𝑥1 , 𝑥2 , … , 𝑥𝑛 with corresponding eigenvalues
𝜆1 , 𝜆2 , … , 𝜆𝑛 , where 𝜆1 > 𝜆2 > ⋯ > 𝜆𝑛
▪ Vectors 𝑥1 , 𝑥2 , … , 𝑥𝑛 form a basis and thus we can write:
𝑟 (0) = 𝑐1 𝑥1 + 𝑐2 𝑥2 + ⋯ + 𝑐𝑛 𝑥𝑛
▪ 𝑴𝒓(𝟎) = 𝑴 𝒄𝟏 𝒙𝟏 + 𝒄𝟐 𝒙𝟐 + ⋯ + 𝒄𝒏 𝒙𝒏
= 𝑐1 (𝑀𝑥1 ) + 𝑐2 (𝑀𝑥2 ) + ⋯ + 𝑐𝑛 (𝑀𝑥𝑛 )
= 𝑐1 (𝜆1 𝑥1 ) + 𝑐2 (𝜆2 𝑥2 ) + ⋯ + 𝑐𝑛 (𝜆𝑛 𝑥𝑛 )
▪ Repeated multiplication on both sides produces
𝑀𝑘 𝑟 (0) = 𝑐1 (𝜆1𝑘 𝑥1 ) + 𝑐2 (𝜆𝑘2 𝑥2 ) + ⋯ + 𝑐𝑛 (𝜆𝑘𝑛 𝑥𝑛 )
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29
Details!

 Claim: Sequence 𝑴 ⋅ 𝒓 𝟎 , 𝑴𝟐 ⋅ 𝒓 𝟎 , … 𝑴𝒌 ⋅ 𝒓 𝟎
,…
approaches the dominant eigenvector of 𝑴
 Proof (continued):
▪ Repeated multiplication on both sides produces
𝑀𝑘 𝑟 (0) = 𝑐1 (𝜆1𝑘 𝑥1 ) + 𝑐2 (𝜆𝑘2 𝑥2 ) + ⋯ + 𝑐𝑛 (𝜆𝑘𝑛 𝑥𝑛 )
𝑘 𝑘
𝜆2 𝜆2
▪ 𝑀𝑘 𝑟 (0) = 𝜆1𝑘 𝑐1 𝑥1 + 𝑐2 𝑥2 + ⋯ + 𝑐𝑛 𝑥𝑛
𝜆1 𝜆1
𝜆2 𝜆3
▪ Since 𝜆1 > 𝜆2 then fractions , …<1
𝑘 𝜆1 𝜆1
𝜆𝑖
and so = 0 as 𝑘 → ∞ (for all 𝑖 = 2 … 𝑛).
𝜆1

▪ Thus: 𝑴𝒌 𝒓(𝟎) ≈ 𝒄𝟏 𝝀𝒌𝟏 𝒙𝟏

▪ Note if 𝑐1 = 0 then the method won’t converge
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 30
i1 i2 i3
 Imagine a random web surfer:
▪ At any time 𝒕, surfer is on some page 𝒊
▪ At time 𝒕 + 𝟏, the surfer follows an j
ri
out-link from 𝒊 uniformly at random rj = 
i→ j d out (i)
▪ Ends up on some page 𝒋 linked from 𝒊
▪ Process repeats indefinitely
 Let:
 𝒑(𝒕) … vector whose 𝒊th coordinate is the
prob. that the surfer is at page 𝒊 at time 𝒕
▪ So, 𝒑(𝒕) is a probability distribution over pages

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31

i1 i2 i3
 Where is the surfer at time t+1?
▪ Follows a link uniformly at random
j
𝒑 𝒕 + 𝟏 = 𝑴 ⋅ 𝒑(𝒕) p(t + 1) = M  p(t )
 Suppose the random walk reaches a state
𝒑 𝒕 + 𝟏 = 𝑴 ⋅ 𝒑(𝒕) = 𝒑(𝒕)
then 𝒑(𝒕) is stationary distribution of a random walk
 Our original rank vector 𝒓 satisfies 𝒓 = 𝑴 ⋅ 𝒓
▪ So, 𝒓 is a stationary distribution for
the random walk

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32

 A central result from the theory of random
walks (a.k.a. Markov processes):

For graphs that satisfy certain conditions,

the stationary distribution is unique and
eventually will be reached no matter what the
initial probability distribution at time t = 0

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33

(t )
ri
=
( t +1)
rj
i→ j di
or
equivalently r = Mr
 Does this converge?
 Does it converge to what we want?

 Are results reasonable?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35

(t )
ri
=
( t +1)
a b rj
i→ j di

 Example:
ra 1 0 1 0
=
rb 0 1 0 1
Iteration 0, 1, 2, …

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36

(t )
ri
=
( t +1)
a b rj
i→ j di

 Example:
ra 1 0 0 0
=
rb 0 1 0 0
Iteration 0, 1, 2, …

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37

Dead end
2 problems:
 (1) Some pages are
dead ends (have no out-links)
▪ Random walk has “nowhere” to go to
▪ Such pages cause importance to “leak out”

 (2) Spider traps:

(all out-links are within the group)
▪ Random walked gets “stuck” in a trap
▪ And eventually spider traps absorb all importance
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 38
y a m
 Power Iteration: y
y ½ ½ 0
▪ Set 𝑟𝑗 = 1 a ½ 0 0
𝑟𝑖 a m m 0 ½ 1
▪ 𝑟𝑗 = σ𝑖→𝑗
𝑑𝑖
m is a spider trap ry = ry /2 + ra /2
▪ And iterate
ra = ry /2
rm = ra /2 + rm
 Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 3/6 7/12 16/24 1
Iteration 0, 1, 2, …
All the PageRank score gets “trapped” in node m.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39
 The Google solution for spider traps: At each
time step, the random surfer has two options
▪ With prob. , follow a link at random
▪ With prob. 1-, jump to some random page
▪ Common values for  are in the range 0.8 to 0.9
 Surfer will teleport out of spider trap
within a few time steps
y y

a m a m
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40
y a m
 Power Iteration: y
y ½ ½ 0
▪ Set 𝑟𝑗 = 1 a ½ 0 0
𝑟𝑖 a m m 0 ½ 0
▪ 𝑟𝑗 = σ𝑖→𝑗
𝑑𝑖
ry = ry /2 + ra /2
▪ And iterate
ra = ry /2
rm = ra /2
 Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2, …
Here the PageRank “leaks” out since the matrix is not stochastic.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41
 Teleports: Follow random teleport links with
probability 1.0 from dead-ends
▪ Adjust matrix accordingly

y y

a m a m
y a m y a m
y ½ ½ 0 y ½ ½ ⅓
a ½ 0 0 a ½ 0 ⅓
m 0 ½ 0 m 0 ½ ⅓

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42

Why are dead-ends and spider traps a problem
and why do teleports solve the problem?
 Spider-traps are not a problem, but with traps
PageRank scores are not what we want
▪ Solution: Never get stuck in a spider trap by
teleporting out of it in a finite number of steps
 Dead-ends are a problem
▪ The matrix is not column stochastic so our initial
assumptions are not met
▪ Solution: Make matrix column stochastic by always
teleporting when there is nowhere else to go
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 43
 Google’s solution that does it all:
At each step, random surfer has two options:
▪ With probability , follow a link at random
▪ With probability 1-, jump to some random page

 PageRank equation [Brin-Page, 98]

𝑟𝑖 1 di … out-degree
𝑟𝑗 = ෍ 𝛽 + (1 − 𝛽) of node i

𝑑𝑖 𝑁
𝑖→𝑗
This formulation assumes that 𝑴 has no dead ends. We can either
preprocess matrix 𝑴 to remove all dead ends or explicitly follow random
teleport links with probability 1.0 from dead-ends.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44
 PageRank equation [Brin-Page, ‘98]
𝑟𝑖 1
𝑟𝑗 = ෍ 𝛽 + (1 − 𝛽)
𝑑𝑖 𝑁
𝑖→𝑗
[1/N]NxN…N by N matrix
 The Google Matrix A: where all entries are 1/N
1
𝐴=𝛽𝑀+ 1−𝛽
𝑁 𝑁×𝑁
 We have a recursive problem: 𝒓 = 𝑨 ⋅ 𝒓
And the Power method still works!
 What is  ?
▪ In practice  =0.8,0.9 (make 5 steps on avg., jump)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45
M [1/N]NxN
7/15
y 1/2 1/2 0 1/3 1/3 1/3
0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3

y 7/15 7/15 1/15

13/15
a 7/15 1/15 1/15
a m 1/15 7/15 13/15
m
A

y 1/3 0.33 0.24 0.26 7/33

a = 1/3 0.20 0.20 0.18 ... 5/33
m 1/3 0.46 0.52 0.56 21/33

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46

 Key step is matrix-vector multiplication
▪ rnew = A ∙ rold
 Easy if we have enough main memory to
hold A, rold, rnew
 Say N = 1 billion pages
▪ We need 4 bytes for A = ∙M + (1-) [1/N]NxN
each entry (say) ½ ½ 0 1/3 1/3 1/3

▪ 2 billion entries for A = 0.8 ½ 0 0 +0.2 1/3 1/3 1/3

0 ½ 1 1/3 1/3 1/3
vectors, approx 8GB
▪ Matrix A has N2 entries 7/15 7/15 1/15
▪ 1018 is a large number! = 7/15 1/15 1/15
1/15 7/15 13/15
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 48
 Suppose there are N pages
 Consider page i, with di out-links
 We have Mji = 1/|di| when i → j
and Mji = 0 otherwise
 The random teleport is equivalent to:
▪ Adding a teleport link from i to every other page
and setting transition probability to (1-)/N
▪ Reducing the probability of following each
out-link from 1/|di| to /|di|
▪ Equivalent: Tax each page a fraction (1-) of its
score and redistribute evenly

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 49

𝟏−𝜷
 𝒓 = 𝑨 ⋅ 𝒓, where 𝑨𝒋𝒊 = 𝜷 𝑴𝒋𝒊 +
𝑵
 𝑟𝑗 = σ𝑁
i=1 𝐴𝑗𝑖 ⋅ 𝑟𝑖
1−𝛽
 𝑟𝑗 = σ𝑁𝛽 𝑀𝑗𝑖 +
𝑖=1 ⋅ 𝑟𝑖
𝑁
𝑁 1−𝛽 𝑁
= σi=1 𝛽 𝑀𝑗𝑖 ⋅ 𝑟𝑖 + σi=1 𝑟𝑖
𝑁
𝑁 1−𝛽
= σi=1 𝛽 𝑀𝑗𝑖 ⋅ 𝑟𝑖 + since σ𝑟𝑖 = 1
𝑁
𝟏−𝜷
 So we get: 𝒓 = 𝜷 𝑴 ⋅ 𝒓 +
𝑵 𝑵

Note: Here we assumed M

[x]N … a vector of length N with all entries x
has no dead-ends
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 50
 We just rearranged the PageRank equation
𝟏−𝜷
𝒓 = 𝜷𝑴 ⋅ 𝒓 +
𝑵 𝑵
▪ where [(1-)/N]N is a vector with all N entries (1-)/N

 M is a sparse matrix! (with no dead-ends)

▪ 10 links per node, approx 10N entries
 So in each iteration, we need to:
▪ Compute rnew =  M ∙ rold
▪ Add a constant value (1-)/N to each entry in rnew
▪ Note if M contains dead-ends then σ𝒋 𝒓𝒏𝒆𝒘
𝒋 < 𝟏 and
we also have to renormalize rnew so that it sums to 1
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 51
 Input: Graph 𝑮 and parameter 𝜷
▪ Directed graph 𝑮 (can have spider traps and dead ends)
▪ Parameter 𝜷
 Output: PageRank vector 𝒓𝒏𝒆𝒘
1
▪ Set: 𝑟𝑗𝑜𝑙𝑑 =
𝑁
▪ repeat until convergence: σ𝑗 𝑟𝑗𝑛𝑒𝑤 − 𝑟𝑗𝑜𝑙𝑑 > 𝜀
𝒓𝒐𝒍𝒅
▪ ∀𝑗: 𝒓′𝒏𝒆𝒘
𝒋 = σ𝒊→𝒋 𝜷 𝒊
𝒅𝒊
𝒓′𝒏𝒆𝒘
𝒋= 𝟎 if in-degree of 𝒋 is 0
▪ Now re-insert the leaked PageRank:
𝒏𝒆𝒘 𝟏−𝑺
∀𝒋: 𝒓𝒏𝒆𝒘
𝒋 = 𝒓′ 𝒋 + where: 𝑆 = σ𝑗 𝑟′𝑗𝑛𝑒𝑤
𝑵
▪ 𝒓𝒐𝒍𝒅 = 𝒓𝒏𝒆𝒘
If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends
the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 52
 Encode sparse matrix using only nonzero
entries
▪ Space proportional roughly to number of links
▪ Say 10N, or 4*10*1 billion = 40GB
▪ Still won’t fit in memory, but will fit on disk
source
node degree destination nodes
0 3 1, 5, 7
1 5 17, 64, 113, 117, 245
2 2 13, 23

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 53

 Assume enough RAM to fit rnew into memory
▪ Store rold and matrix M on disk
 1 step of power-iteration is:
Initialize all entries of rnew = (1-) / N
For each page i (of out-degree di):
Read into memory: i, di, dest1, …, destdi, rold(i)
For j = 1…di
rnew(destj) +=  rold(i) / di
0 rnew source degree destination rold 0
1 1
0 3 1, 5, 6
2 2
3 1 4 17, 64, 113, 117 3
4 4
2 2 13, 23
5 5
6 6
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 54
 Assume enough RAM to fit rnew into memory
▪ Store rold and matrix M on disk
 In each iteration, we have to:
▪ Read rold and M
▪ Write rnew back to disk
▪ Cost per iteration of Power method:
= 2|r| + |M|

 Question:
▪ What if we could not even fit rnew in memory?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 55

rnew src degree destination rold
0 0
0 4 0, 1, 3, 5 1
1
1 2 0, 5 2
2 3
2 2 3, 4 4
3
M 5
4
5

▪ Break rnew into k blocks that fit in memory

▪ Scan M and rold once for each block
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 56
 Similar to nested-loop join in databases
▪ Break rnew into k blocks that fit in memory
▪ Scan M and rold once for each block
 Total cost:
▪ k scans of M and rold
▪ Cost per iteration of Power method:
k(|M| + |r|) + |r| = k|M| + (k+1)|r|
 Can we do better?
▪ Hint: M is much bigger than r (approx 10-20x), so
we must avoid reading it k times per iteration
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 57
src degree destination
rnew
0 4 0, 1
0
1 1 3 0 rold
0
2 2 1 1
2
2 0 4 3 3
4
3
2 2 3 5

0 4 5
4
5
1 3 5
2 2 4
Break M into stripes! Each stripe contains only
destination nodes in the corresponding block of rnew
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 58
 Break M into stripes
▪ Each stripe contains only destination nodes
in the corresponding block of rnew
 Some additional overhead per stripe
▪ But it is usually worth it
 Cost per iteration of Power method:
=|M|(1+) + (k+1)|r|

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 59

 Measures generic popularity of a page
▪ Biased against topic-specific authorities
▪ Solution: Topic-Specific PageRank (next)
 Uses a single measure of importance
▪ Other models of importance
▪ Solution: Hubs-and-Authorities
 Susceptible to Link spam
▪ Artificial link topographies created in order to
boost page rank
▪ Solution: TrustRank
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 60

Compiled ESL Activities - Activity Directions - Updated Sept 13th, 2014
50% (2)
Compiled ESL Activities - Activity Directions - Updated Sept 13th, 2014
177 pages
Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
ch05 Linkanalysis1
No ratings yet
ch05 Linkanalysis1
60 pages
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
64 pages
Analysis of Large Graphs: Trustrank and Webspam
No ratings yet
Analysis of Large Graphs: Trustrank and Webspam
62 pages
Association Rules and Frequent Item Sets
No ratings yet
Association Rules and Frequent Item Sets
98 pages
ch06 Assocrules
No ratings yet
ch06 Assocrules
110 pages
Community Detection in Social Networks
No ratings yet
Community Detection in Social Networks
64 pages
ch01 Intro
No ratings yet
ch01 Intro
28 pages
17-Matrix Sketching
No ratings yet
17-Matrix Sketching
65 pages
Unit 5
No ratings yet
Unit 5
39 pages
ch07 Clustering
No ratings yet
ch07 Clustering
56 pages
04 Pagerank
No ratings yet
04 Pagerank
64 pages
08 Recsys2
No ratings yet
08 Recsys2
60 pages
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
ch01 Intro
No ratings yet
ch01 Intro
29 pages
Big Data Analytics Course Introduction
No ratings yet
Big Data Analytics Course Introduction
28 pages
ch06 Assocrules
No ratings yet
ch06 Assocrules
59 pages
ch-09 - Part 1
No ratings yet
ch-09 - Part 1
22 pages
18-Sub-Modular Functions
No ratings yet
18-Sub-Modular Functions
51 pages
19 Bandits
No ratings yet
19 Bandits
48 pages
18 Advertising
No ratings yet
18 Advertising
48 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
56 pages
ch09 Recsys1
No ratings yet
ch09 Recsys1
43 pages
Week 16 Lecture 01 02 SVD and CUR (Example)
No ratings yet
Week 16 Lecture 01 02 SVD and CUR (Example)
56 pages
Large-Scale Machine Learning: K-NN, Perceptron: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Large-Scale Machine Learning: K-NN, Perceptron: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
33 pages
Big Data - Lecture 06 - SVD
No ratings yet
Big Data - Lecture 06 - SVD
56 pages
Ch01 Intro
No ratings yet
Ch01 Intro
19 pages
09 Pagerank
No ratings yet
09 Pagerank
61 pages
Lecture11 PageRank V0
No ratings yet
Lecture11 PageRank V0
38 pages
BD - Lecture 3 - Decision Tree
No ratings yet
BD - Lecture 3 - Decision Tree
39 pages
MLC 04 Graph Methods Ranking Communities Link Prediction-Sose2023
No ratings yet
MLC 04 Graph Methods Ranking Communities Link Prediction-Sose2023
110 pages
Story of Creation
No ratings yet
Story of Creation
3 pages
Calcium Carbonate
33% (3)
Calcium Carbonate
1 page
Graph Data Mining: Slides Are Modified From Jiawei Han & Micheline Kamber
No ratings yet
Graph Data Mining: Slides Are Modified From Jiawei Han & Micheline Kamber
37 pages
Staad Questions PDF
No ratings yet
Staad Questions PDF
8 pages
Pagerank The Flow Formulation
No ratings yet
Pagerank The Flow Formulation
6 pages
ch02 Mapreduce
No ratings yet
ch02 Mapreduce
7 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
42 pages
ch04 Streams1
No ratings yet
ch04 Streams1
4 pages
ch03 LSH
No ratings yet
ch03 LSH
58 pages
London City Hall: Architectural Analysis Course: Intelligent Building
100% (2)
London City Hall: Architectural Analysis Course: Intelligent Building
17 pages
14 Link 1
No ratings yet
14 Link 1
10 pages
Big Data - Lecture05 - LSH
No ratings yet
Big Data - Lecture05 - LSH
56 pages
Lecture 27
No ratings yet
Lecture 27
21 pages
16 Streams
No ratings yet
16 Streams
5 pages
ch04 Streams2
No ratings yet
ch04 Streams2
4 pages
Mining Data Streams 1
No ratings yet
Mining Data Streams 1
46 pages
19 Submodular
No ratings yet
19 Submodular
47 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
44 pages
Regent College London New
No ratings yet
Regent College London New
2 pages
TM3 ch05 Link Analysis
No ratings yet
TM3 ch05 Link Analysis
69 pages
Mining Data Streams (Part 1) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining Data Streams (Part 1) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
Assignment5 NLA Aug2023
No ratings yet
Assignment5 NLA Aug2023
7 pages
Monsoon Theories
100% (1)
Monsoon Theories
14 pages
Pebeo
No ratings yet
Pebeo
1 page
Orientering
No ratings yet
Orientering
15 pages
Jeffrey D. Ullman Stanford University
No ratings yet
Jeffrey D. Ullman Stanford University
55 pages
Module 6-: Real Time Big Data Models
No ratings yet
Module 6-: Real Time Big Data Models
58 pages
Mining of Massive Datasets: Jure Leskovec Anand Rajaraman Jeffrey D. Ullman
0% (1)
Mining of Massive Datasets: Jure Leskovec Anand Rajaraman Jeffrey D. Ullman
17 pages
Mod2 Data Streams
No ratings yet
Mod2 Data Streams
75 pages
Parkinson Disease & ALS Cheat Sheet
No ratings yet
Parkinson Disease & ALS Cheat Sheet
4 pages
Lecture 9
No ratings yet
Lecture 9
64 pages
Advanced Analysis of Algorithms: Dept of CS & IT University of Sargodha
No ratings yet
Advanced Analysis of Algorithms: Dept of CS & IT University of Sargodha
51 pages
Google Pagerank and Reduced-Order Modelling
No ratings yet
Google Pagerank and Reduced-Order Modelling
56 pages
Web of Science Core Collection:: Journal Evaluation Process and Selection Criteria
No ratings yet
Web of Science Core Collection:: Journal Evaluation Process and Selection Criteria
35 pages
Colonial Houses and The Stephen Moylan Press
No ratings yet
Colonial Houses and The Stephen Moylan Press
7 pages
To Issue Swing Door For Entrance To Ac Area (With Overhead Concealed Double Acting Door Closer) Mi006232
No ratings yet
To Issue Swing Door For Entrance To Ac Area (With Overhead Concealed Double Acting Door Closer) Mi006232
2 pages
Mining Massive Datasets Preface
No ratings yet
Mining Massive Datasets Preface
17 pages
Attachment and Culture - Security in The United States and Japan
No ratings yet
Attachment and Culture - Security in The United States and Japan
12 pages
Review 1 Lop 11 Thi Diem Units 123
No ratings yet
Review 1 Lop 11 Thi Diem Units 123
2 pages
Malpezzi Ozanne Thibodeau Characteristic Prices 59 Metro Areas Hedonic Indexes Hud-50814
No ratings yet
Malpezzi Ozanne Thibodeau Characteristic Prices 59 Metro Areas Hedonic Indexes Hud-50814
200 pages
How Human Behaviour Amplifies The Bullwhip Effect A Study Based On The Beer Distribution Game Online
No ratings yet
How Human Behaviour Amplifies The Bullwhip Effect A Study Based On The Beer Distribution Game Online
12 pages
Phil Summa
No ratings yet
Phil Summa
3 pages
5 People Who Disappeared But Would Reappear Years Later
No ratings yet
5 People Who Disappeared But Would Reappear Years Later
5 pages
Hufnagel Transcript
No ratings yet
Hufnagel Transcript
3 pages
Service Manual, PM7100, English PT00112534 Rev A Release 8-2020
No ratings yet
Service Manual, PM7100, English PT00112534 Rev A Release 8-2020
64 pages
Work Immersion Instructions
No ratings yet
Work Immersion Instructions
14 pages
Module 5 Reflection 1
No ratings yet
Module 5 Reflection 1
7 pages
BCOC Outstanding 24 Oktober 2023
No ratings yet
BCOC Outstanding 24 Oktober 2023
12 pages
Update On Renewed Effort To Strengthen Routine Immunization
No ratings yet
Update On Renewed Effort To Strengthen Routine Immunization
49 pages
College Code / Name: 9615 - Maria College of Engineering and Technology Branch Code / Name: 103 - B.E. Civil Engineering
No ratings yet
College Code / Name: 9615 - Maria College of Engineering and Technology Branch Code / Name: 103 - B.E. Civil Engineering
3 pages
FICM Unit 3
No ratings yet
FICM Unit 3
6 pages
New App-Karen Ortiz
No ratings yet
New App-Karen Ortiz
2 pages
Portfolio Grade 1 Math Lesson
No ratings yet
Portfolio Grade 1 Math Lesson
1 page
Book Report Choice Board 1
No ratings yet
Book Report Choice Board 1
1 page
Crushing The Technical Interview: Data Structures And Algorithms (Java Edition)
From Everand
Crushing The Technical Interview: Data Structures And Algorithms (Java Edition)
Keith Henning
No ratings yet
Practical Packet Analysis, 3rd Edition: Using Wireshark to Solve Real-World Network Problems
From Everand
Practical Packet Analysis, 3rd Edition: Using Wireshark to Solve Real-World Network Problems
Chris Sanders
3.5/5 (6)
Mastering Data Mining with Python – Find patterns hidden in your data
From Everand
Mastering Data Mining with Python – Find patterns hidden in your data
Megan Squire
No ratings yet
Mathematica Data Visualization
From Everand
Mathematica Data Visualization
Nazmus Saquib
3.5/5 (2)

Unit 4

Uploaded by

Unit 4

Uploaded by

Note to other teachers and users of these slides: We would be delighted if you found this our

Mining of Massive Datasets

Community Web Decision Association

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12

 There is large diversity

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14

 Are all in-links are equal?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17

 If page j with importance rj has n out-links,

 Page j’s own importance is the sum of the

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18

page is worth more y/2

 We can now efficiently solve for r!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

▪ Thus: 𝑴𝒌 𝒓(𝟎) ≈ 𝒄𝟏 𝝀𝒌𝟏 𝒙𝟏

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 31

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 32

For graphs that satisfy certain conditions,

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33

 Are results reasonable?

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37

 (2) Spider traps:

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 42

 PageRank equation [Brin-Page, 98]

y 7/15 7/15 1/15

y 1/3 0.33 0.24 0.26 7/33

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 46

▪ 2 billion entries for A = 0.8 ½ 0 0 +0.2 1/3 1/3 1/3

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 49

Note: Here we assumed M

 M is a sparse matrix! (with no dead-ends)

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 53

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 55

▪ Break rnew into k blocks that fit in memory

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 59

You might also like