Lecture 9
Lecture 9
Dimensional Duplicate
Spam Queries on Perceptron
ity
Detection streams , kNN
reduction document
detection
2
Facebook social graph
4-degrees of separation [Backstrom-Boldi-Rosa-
3
Connections between political
blogs
4
Citation networks and Maps
of science
[Börner et al., 2012] 5
domain2
domain1
router
domain3
Intern 6
Seven Bridges of
Königsberg
[Euler, 1735]
Return to the starting point by
traveling each
link of the graph once and only 7
Web as a directed graph:
Nodes: Webpages
Edges: Hyperlinks
I teach
a class
on CS224W:
Network Classes
s. are in
the
Gates Computer
building Science
Departme
nt at
Stanford Stanfor
d
Universi
ty
8
Web as a directed graph:
Nodes: Webpages
Edges: Hyperlinks
I teach
a class
on CS224W:
Network Classes
s. are in
the
Gates Computer
building Science
Departme
nt at
Stanford Stanfor
d
Universi
ty
9
10
How to organize the Web?
First try: Human curated
Web directories
Yahoo, DMOZ, LookSmart
Second try: Web Search
Information Retrieval investigates:
Find relevant docs in a small
and trusted set
Newspaper articles, Patents,
etc.
But: Web is huge, full of untrusted documents,
11
2 challenges of web
(1) Web contains many sources of information
search:
Who to “trust”?
Trick: Trustworthy pages may point to each other!
12
All web pages are not equally “important”
thispersondoesnotexist.com vs. www.stanford.edu
13
We will cover the following Link Analysis
approaches for computing importance
of nodes in a graph:
PageRank
Topic-Specific (Personalized) PageRank
Web Spam Detection Algorithms
14
Idea: Links as votes
Page is more important if it has more links
In-coming links? Out-going links?
Think of in-links as votes:
www.stanford.edu has millions in-links
thispersondoesnotexist.com has a few thousands in-link
D
E F
3.9
8.1 3.9
1.6
1.6 1.6 1.6
1.6
Each link’s vote is proportional to the
importance of its source page
If page j with importance rj has n out-links,
each link gets rj / n votes
j r j/
rj = ri/3+rk/4
rj/ 3
3
rj/3
A “vote” from an important The web in
1839
page is worth more ry/2
A page is important if it is
y
pointed to by other
ra/2
important pages ry/2
Define a “rank” rj for page j rm
a m
ra/2
r “Flow”
rj equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
𝒅𝒊 … out-degree of
node 𝒊
rm = ra /2
Flow equations:
3 equations, 3 ry = ry /2 + ra /2
unknowns, no constants ra = ry /2 + rm
No unique solution rm = ra /2
r
1/3 i
M . r = r
ry ra rm
y ry ½ ½ 0
ra ½ 0 1
a m rm 0 ½ 0
r=
M∙r
ry = ry /2 + ra /2 ry ½ ½ 0 ry
ra = ry /2 + rm ra = ½ 0 1 ra
rm = ra /2 rm 0 ½ 0 rm
𝒓=𝑴∙ 𝒓
The flow equations can be written
eigenvector of 𝑀 = PageRank.
The math: limiting distribution = principal thewith
corresponding
𝑨𝒙 = 𝝀𝒙
eigenvalue λ if:
1/N
a ½ 0 1
1: 𝑟′ = 𝑟𝑖
a
𝑗 𝑖→𝑗
m m 0 ½ 0
σ2: 𝑟 𝑑𝑖
ry = ry /2 + ra /2
= 𝑟′
ra = ry /2 + rm
rm = ra /2
Goto 1
rExample:
y 1/3 1/3 5/12 9/24 6/15
ra 1/3 3/6 1/3 11/2 … 6/15
= 1/3 1/6 3/12 4 3/15
rm Iteration 0, 1, 2, … 1/6
y a m
Power Iteration: y
Set 𝑟𝑗 =
y ½ ½ 0
1/N
a ½ 0 1
1: 𝑟′ = 𝑟𝑖
a
𝑗 𝑖→𝑗
m m 0 ½ 0
σ2: 𝑟 𝑑𝑖
ry = ry /2 + ra /2
= 𝑟′
ra = ry /2 + rm
rm = ra /2
Goto 1
Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2, …
29
i1 i2 i3
Imagine a random web surfer:
At any time 𝒕, surfer is on some page 𝒊
At time 𝒕 + 𝟏, the surfer follows an
out-link from 𝒊 uniformly at random
j
r
Ends up on some page 𝒋 linked
j
r
from 𝒊
Process repeats indefinitely
𝒑(𝒕) … vector whose 𝒊th coordinate is
the prob. that the surfer is at page 𝒊 at
Let:
time 𝒕
So, 𝒑(𝒕) is a probability distribution over
i1 i2 i3
Where is the surfer at time
𝒑 𝒕 + 𝟏 = 𝑴 ⋅ 𝒑(𝒕)
t+1?
Follows a link uniformly at random
j
p(t 1) M
𝒑 𝒕 + 𝟏= 𝑴 ⋅ 𝒑(𝒕) = 𝒑(𝒕)
Suppose the random walk reaches a statep(t)
⋅𝒓
So, 𝒓 is a stationary distribution for
the random walk
A central result from the theory of random
walks (a.k.a. Markov processes):
Substitute ri = di/2m:
Example:
ra 1 0 1 0
=
rb 0 1 0 1
Iteration 0, 1, 2, …
(t
a b r j 1)
r
j i
i i)
(td
Example:
ra 1 0 0 0
=
rb 0 1 0 0
Iteration 0, 1, 2, …
Dead end
Two problems:
(1) Dead ends: Some
pages have no out-links
Random walk has “nowhere”
to go to
Such pages cause importance
to “leak out”
1/𝑁
a ½ 0 0
𝑟= 𝑟𝑖
a m m 0 ½ 1
𝑗
σ
𝑖→𝑗
m is a spider trap ry = ry /2 + ra /2
𝑑𝑖
And iterate
ra = ry /2
rm = ra /2 + rm
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 3/6 7/12 16/24 1
Iteration 0, 1, 2, …
All the PageRank score gets “trapped”
The Google solution for spider traps: At each
time step, the random surfer has two options
With prob. , follow a link at random
With prob. 1-, jump to some random page
is typically in the range 0.8 to 0.9
Surfer will teleport out of spider
trap within a few time steps
y
a y m a m
y a m
Power Iteration: y
Set 𝑟𝑗 =
y ½ ½ 0
1/𝑁
a ½ 0 0
𝑟= 𝑟𝑖
a m m 0 ½ 0
𝑗
σ
𝑖→𝑗 ry = ry /2 + ra /2
𝑑𝑖
And iterate
ra = ry /2
rm = ra /2
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2, …
y y
a m a m
y a m y a m
y ½ ½ 0 y ½ ½ ⅓
a ½ 0 0 a ½ 0 ⅓
m 0 ½ 0 m 0 ½ ⅓
Why are dead-ends and spider traps a
problem and why do teleports solve the
problem?
Spider-traps are not a problem, but with traps
PageRank scoresgetare
Solution: Never notinwhat
stuck wetrap
a spider want
by
teleporting out of it in a finite number of steps
Dead-ends are a problem
The matrix is not column stochastic so our initial
assumptions are not met
Solution: Make matrix column stochastic by always
teleporting when there is nowhere else to go
Google’s solution that does it all:
At each step, random surfer has two options:
With probability , follow a link at random
With probability 1-, jump to some random page
since
So we get:
1 𝒓 𝒏𝒆𝒘
= �
Output:
Set: � PageRank vector
𝑟 𝑜𝑙𝑑 �
<
convergence: σ 𝑗 𝑗𝑟 𝑛 𝑒 𝑤 −
𝜀
repeat until �
𝑗
𝑟 ∀𝑗: 𝒓′ =𝒊→𝒋
𝒏𝒆
𝑜𝑙𝑑 𝒘𝒋
𝒊
𝒅
σ 𝒓′𝒋 = 𝟎𝜷if in-degree of 𝒋 is
𝒐𝒍𝒊 𝒓
𝒏𝒆𝒘
𝒅
Now0re-insert the leaked PageRank:
∀𝒋: 𝒓
𝒋
𝒏𝒆𝒘
= 𝒓�′ 𝒏𝒆𝒘 +� where: 𝑆 =𝑗
𝒓𝟏−𝑺
𝒐𝒍𝒅 = �
σ 𝑟′𝑛𝑒𝑤
𝑗
𝒓𝒏𝒆𝒘
�
If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-
ends the amount of leaked PageRank may be larger. We have to explicitly account for it by computing
Encode sparse matrix using only nonzero
entries
Space proportional roughly to number of links
Say 10N, or 4*10*1 billion = 40GB
Still won’t fit in memory, but will fit on disk
source
node degree destination
0 3
nodes 1, 5, 7
1 5 17, 64, 113, 117, 245
2 2 13, 23
Assume enough RAM to fit rnew into memory
Store rold and matrix M on disk
1 step of power-iteration is:
Assuming
Initialize all entries of rnew = (1-) / N no dead
For each page i (of out-degree di): ends
For j = 1…di
rnew(destj) += rold(i) / di
0 0
rnew source degree destination rold
1
1 0 3 1, 5, 6
2 2
1 4 17, 64, 113, 117 3
3
4 2 2 13, 23 4
5 5
6 6
4/27/2021 Jure Leskovec, Stanford C246: Mining Massive Datasets 60
Assume enough RAM to fit rnew into memory
Store rold and matrix M on disk
In each iteration, we have to:
Read rold and M
Write rnew back to disk
= 2|𝒓| + |𝑴|
Cost per iteration of Power method:
Question:
What if we could not even fit rnew in memory?
rne src destinat rol
0w 0degree4 ion 0
0, 1, 3, 5 d
1 1
1 2 0, 5 2
2 2 3
2 3, 4
4
3
M 5
4
5
Can we do better?
Hint: M is much bigger than r (approx 10-20x), so
we must avoid reading it k times per iteration
src destinati
rne degree on
0w 0 4 0, 1
1 rol
1 2 0 d 0
1
2
2 0 4 3 3
4
3 2 2 3 5
0 4 5
4 1 2 5
5
2 2 4
Break M into stripes! Each stripe contains only
destination nodes in the corresponding block of
Break M into stripes
Each stripe contains only destination nodes
in the corresponding block of rnew
Some additional overhead per stripe
But it is usually worth it
=|𝑴|(𝟏 + ε) + (𝒌 + 𝟏)|𝒓|
Cost per iteration of Power method: