Unit 4
Unit 4
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams
Dimensional Duplicate
Spam Queries on Perceptron,
ity document
Detection streams kNN
reduction detection
domain1
router
domain3
Internet
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6
Seven Bridges of Königsberg
[Euler, 1735]
Return to the starting point by traveling each
link of the graph once and only once.
D
E F
3.9
8.1 3.9
1.6
1.6 1.6 1.6
1.6
j rj/3
rj = ri/3+rk/4
rj/3 rj/3
i→ j di
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
𝒅𝒊 … out-degree of node 𝒊
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19
Flow equations:
3 equations, 3 unknowns, ry = ry /2 + ra /2
no constants ra = ry /2 + rm
rm = ra /2
▪ No unique solution
▪ All solutions equivalent modulo the scale factor
Additional constraint forces uniqueness:
▪ 𝒓𝒚 + 𝒓𝒂 + 𝒓𝒎 = 𝟏
𝟐 𝟐 𝟏
▪ Solution: 𝒓𝒚 = , 𝒓𝒂 = , 𝒓𝒎 =
𝟓 𝟓 𝟓
Gaussian elimination method works for
small examples, but we need a better
method for large web-size graphs
We need a new formulation!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20
Stochastic adjacency matrix 𝑴
▪ Let page 𝑖 has 𝑑𝑖 out-links
1
▪ If 𝑖 → 𝑗, then 𝑀𝑗𝑖 = else 𝑀𝑗𝑖 = 0
𝑑𝑖
▪ 𝑴 is a column stochastic matrix
▪ Columns sum to 1
Rank vector 𝒓: vector with an entry per page
▪ 𝑟𝑖 is the importance score of page 𝑖
▪ σ𝑖 𝑟𝑖 = 1
ri
The flow equations can be written rj =
i→ j di
𝒓 = 𝑴⋅ 𝒓
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21
ri
Remember the flow equation: rj =
d
Flow equation in the matrix form i → j i
𝑴⋅ 𝒓=𝒓
▪ Suppose page i links to 3 pages, including j
i
j rj
. =
ri
1/3
M . r = r
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22
The flow equations can be written
𝒓 = 𝑴 ∙ 𝒓
So the rank vector r is an eigenvector of the
stochastic web matrix M
▪ In fact, its first or principal eigenvector,
with corresponding eigenvalue 1 NOTE: x is an
eigenvector with
▪ Largest eigenvalue of M is 1 since M is the corresponding
eigenvalue λ if:
column stochastic (with non-negative entries) 𝑨𝒙 = 𝝀𝒙
▪ We know r is unit length and each column of M
sums to one, so 𝑴𝒓 ≤ 𝟏
r = M∙r
ry = ry /2 + ra /2 y ½ ½ 0 y
ra = ry /2 + rm a = ½ 0 1 a
rm = ra /2 m 0 ½ 0 m
Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2, …
Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2, …
Power iteration:
A method for finding dominant eigenvector (the
vector corresponding to the largest eigenvalue)
▪ 𝒓(𝟏) = 𝑴 ⋅ 𝒓(𝟎)
▪ 𝒓(𝟐) = 𝑴 ⋅ 𝒓 𝟏
= 𝑴 𝑴𝒓 𝟏
= 𝑴𝟐 ⋅ 𝒓 𝟎
▪ 𝒓(𝟑) = 𝑴 ⋅ 𝒓 𝟐
= 𝑴 𝑴𝟐 𝒓 𝟎
= 𝑴𝟑 ⋅ 𝒓 𝟎
Claim:
Sequence 𝑴 ⋅ 𝒓 𝟎 , 𝑴𝟐 ⋅ 𝒓 𝟎 , … 𝑴𝒌 ⋅ 𝒓 𝟎 , …
approaches the dominant eigenvector of 𝑴
Claim: Sequence 𝑴 ⋅ 𝒓 𝟎 , 𝑴𝟐 ⋅ 𝒓 𝟎 , … 𝑴𝒌 ⋅ 𝒓 𝟎
,…
approaches the dominant eigenvector of 𝑴
Proof:
▪ Assume M has n linearly independent eigenvectors,
𝑥1 , 𝑥2 , … , 𝑥𝑛 with corresponding eigenvalues
𝜆1 , 𝜆2 , … , 𝜆𝑛 , where 𝜆1 > 𝜆2 > ⋯ > 𝜆𝑛
▪ Vectors 𝑥1 , 𝑥2 , … , 𝑥𝑛 form a basis and thus we can write:
𝑟 (0) = 𝑐1 𝑥1 + 𝑐2 𝑥2 + ⋯ + 𝑐𝑛 𝑥𝑛
▪ 𝑴𝒓(𝟎) = 𝑴 𝒄𝟏 𝒙𝟏 + 𝒄𝟐 𝒙𝟐 + ⋯ + 𝒄𝒏 𝒙𝒏
= 𝑐1 (𝑀𝑥1 ) + 𝑐2 (𝑀𝑥2 ) + ⋯ + 𝑐𝑛 (𝑀𝑥𝑛 )
= 𝑐1 (𝜆1 𝑥1 ) + 𝑐2 (𝜆2 𝑥2 ) + ⋯ + 𝑐𝑛 (𝜆𝑛 𝑥𝑛 )
▪ Repeated multiplication on both sides produces
𝑀𝑘 𝑟 (0) = 𝑐1 (𝜆1𝑘 𝑥1 ) + 𝑐2 (𝜆𝑘2 𝑥2 ) + ⋯ + 𝑐𝑛 (𝜆𝑘𝑛 𝑥𝑛 )
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29
Details!
Claim: Sequence 𝑴 ⋅ 𝒓 𝟎 , 𝑴𝟐 ⋅ 𝒓 𝟎 , … 𝑴𝒌 ⋅ 𝒓 𝟎
,…
approaches the dominant eigenvector of 𝑴
Proof (continued):
▪ Repeated multiplication on both sides produces
𝑀𝑘 𝑟 (0) = 𝑐1 (𝜆1𝑘 𝑥1 ) + 𝑐2 (𝜆𝑘2 𝑥2 ) + ⋯ + 𝑐𝑛 (𝜆𝑘𝑛 𝑥𝑛 )
𝑘 𝑘
𝜆2 𝜆2
▪ 𝑀𝑘 𝑟 (0) = 𝜆1𝑘 𝑐1 𝑥1 + 𝑐2 𝑥2 + ⋯ + 𝑐𝑛 𝑥𝑛
𝜆1 𝜆1
𝜆2 𝜆3
▪ Since 𝜆1 > 𝜆2 then fractions , …<1
𝑘 𝜆1 𝜆1
𝜆𝑖
and so = 0 as 𝑘 → ∞ (for all 𝑖 = 2 … 𝑛).
𝜆1
Example:
ra 1 0 1 0
=
rb 0 1 0 1
Iteration 0, 1, 2, …
Example:
ra 1 0 0 0
=
rb 0 1 0 0
Iteration 0, 1, 2, …
a m a m
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 40
y a m
Power Iteration: y
y ½ ½ 0
▪ Set 𝑟𝑗 = 1 a ½ 0 0
𝑟𝑖 a m m 0 ½ 0
▪ 𝑟𝑗 = σ𝑖→𝑗
𝑑𝑖
ry = ry /2 + ra /2
▪ And iterate
ra = ry /2
rm = ra /2
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2, …
Here the PageRank “leaks” out since the matrix is not stochastic.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 41
Teleports: Follow random teleport links with
probability 1.0 from dead-ends
▪ Adjust matrix accordingly
y y
a m a m
y a m y a m
y ½ ½ 0 y ½ ½ ⅓
a ½ 0 0 a ½ 0 ⅓
m 0 ½ 0 m 0 ½ ⅓
𝑑𝑖 𝑁
𝑖→𝑗
This formulation assumes that 𝑴 has no dead ends. We can either
preprocess matrix 𝑴 to remove all dead ends or explicitly follow random
teleport links with probability 1.0 from dead-ends.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44
PageRank equation [Brin-Page, ‘98]
𝑟𝑖 1
𝑟𝑗 = 𝛽 + (1 − 𝛽)
𝑑𝑖 𝑁
𝑖→𝑗
[1/N]NxN…N by N matrix
The Google Matrix A: where all entries are 1/N
1
𝐴=𝛽𝑀+ 1−𝛽
𝑁 𝑁×𝑁
We have a recursive problem: 𝒓 = 𝑨 ⋅ 𝒓
And the Power method still works!
What is ?
▪ In practice =0.8,0.9 (make 5 steps on avg., jump)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 45
M [1/N]NxN
7/15
y 1/2 1/2 0 1/3 1/3 1/3
0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3
Question:
▪ What if we could not even fit rnew in memory?
0 4 5
4
5
1 3 5
2 2 4
Break M into stripes! Each stripe contains only
destination nodes in the corresponding block of rnew
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 58
Break M into stripes
▪ Each stripe contains only destination nodes
in the corresponding block of rnew
Some additional overhead per stripe
▪ But it is usually worth it
Cost per iteration of Power method:
=|M|(1+) + (k+1)|r|