09 Pagerank
09 Pagerank
material useful for giving your own lectures. Feel free to use these slides verbatim, or to
modify them to fit your own needs. If you make use of a significant portion of these slides
in your own lecture, please include this message, or a link to our web site: http://www.mmds.org
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams
Dimensional Duplicate
Spam Queries on Perceptron,
ity document
Detection streams kNN
reduction detection
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
Facebook social graph
4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
Connections between political blogs
Polarization of the network [Adamic-Glance, 2005]
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
Citation networks and Maps of science
[Börner et al., 2012]
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
domain2
domain1
router
domain3
Internet
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
Seven Bridges of Königsberg
[Euler, 1735]
Return to the starting point by traveling each
link of the graph once and only once.
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Web as a directed graph:
§ Nodes: Webpages
§ Edges: Hyperlinks
I teach a
class on
Networks. CS224W:
Classes are
in the
Gates
building Computer
Science
Department
at Stanford
Stanford
University
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Web as a directed graph:
§ Nodes: Webpages
§ Edges: Hyperlinks
I teach a
class on
Networks. CS224W:
Classes are
in the
Gates
building Computer
Science
Department
at Stanford
Stanford
University
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ How to organize the Web?
¡ First try: Human curated
Web directories
§ Yahoo, DMOZ, LookSmart
¡ Second try: Web Search
§ Information Retrieval investigates:
Find relevant docs in a small
and trusted set
§ Newspaper articles, Patents, etc.
§ But: Web is huge, full of untrusted documents,
random things, web spam, etc.
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
2 challenges of web search:
¡ (1) Web contains many sources of information
Who to “trust”?
§ Trick: Trustworthy pages may point to each other!
¡ (2) What is the “best” answer to query
“newspaper”?
§ No single right answer
§ Trick: Pages that actually know about newspapers
might all be pointing to many newspapers
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ All web pages are not equally “important”
thispersondoesnotexist.com vs. www.stanford.edu
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ We will cover the following Link Analysis
approaches for computing importance
of nodes in a graph:
§ PageRank
§ Topic-Specific (Personalized) PageRank
§ Web Spam Detection Algorithms
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Idea: Links as votes
§ Page is more important if it has more links
§ In-coming links? Out-going links?
¡ Think of in-links as votes:
§ www.stanford.edu has millions in-links
§ thispersondoesnotexist.com has a few thousands in-link
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Web pages are important if people visit them
a lot.
¡ But we can’t watch everybody using the Web.
¡ A good surrogate for visiting pages is to
assume people follow links randomly.
¡ Leads to random surfer model:
§ Start at a random page and follow random out-
links repeatedly, from whatever page you are at.
§ PageRank = limiting probability of being at a page
at any point in time.
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Solve the recursive equation: “importance of a
page = its share of the importance of each of its
predecessor pages”
§ Equivalent to the random-surfer definition of
PageRank
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
A
B
3.3 C
38.4
34.3
D
E F
3.9
8.1 3.9
1.6
1.6 1.6 1.6
1.6
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Each link’s vote is proportional to the
importance of its source page
j rj/3
rj = ri/3+rk/4
rj/3 rj/3
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ A “vote” from an important The web in 1839
i® j di
ry = ry /2 + ra /2
ra = ry /2 + rm
𝒅𝒊 … out-degree of node 𝒊
rm = ra /2
𝒓𝒋 are the solutions to the “flow” equation
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
Flow equations:
¡ 3 equations, 3 unknowns, ry = ry /2 + ra /2
no constants ra = ry /2 + rm
rm = ra /2
§ No unique solution
§ All solutions equivalent modulo the scale factor
¡ Additional constraint forces uniqueness:
§ 𝒓𝒚 + 𝒓𝒂 + 𝒓𝒎 = 𝟏
𝟐 𝟐 𝟏
§ Solution: 𝒓𝒚 = 𝟓 , 𝒓𝒂 = 𝟓 , 𝒓𝒎 = 𝟓
¡ Gaussian elimination method works for
small examples, but we need a better
method for large web-size graphs
¡ We need a new formulation!
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Define stochastic adjacency matrix 𝑴
§ Let page 𝑖 has 𝑑𝑖 out-links
!
§ If 𝑖 → 𝑗, then 𝑀𝑗𝑖 = else 𝑀𝑗𝑖 = 0
"!
§ 𝑴 is a column stochastic matrix
§ Each column sums to 1
¡ Define rank vector 𝒓: a vector with one entry
per page; it captures importance of the page
§ 𝑟𝑖 = importance score of page 𝑖
ri
§ ∑' 𝑟 𝑖 = 1 rj = å
¡ The flow equations can be written i® j di
𝒓 = 𝑴⋅ 𝒓
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
ri
¡ Remember the flow equation: rj = å
¡ Flow equation in the matrix form i ® j d i
𝑴⋅ 𝒓=𝒓
§ Suppose page i links to 3 pages, including j
i
j rj
. =
ri
1/di=1/3
M . r = r
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
y a m
y y ½ ½ 0
M = a ½ 0 1
a m m 0 ½ 0
r = M · r
ry = ry /2 + ra /2 ry ½ ½ 0 ry
ra = ry /2 + rm ra = ½ 0 1 ra
rm = ra /2 rm 0 ½ 0 rm
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ The flow equations can be written
𝒓 = 𝑴 $ 𝒓
¡ So the rank vector r is an eigenvector of the
stochastic web matrix M
§ Starting from any stochastic vector 𝒖, the limit
𝑴(𝑴(… 𝑴(𝑴 𝒖)))
is the long-term distribution of the surfers. NOTE: x is an
eigenvector with
§ The math: limiting distribution = principal the corresponding
eigenvalue λ if:
eigenvector of 𝑀 = PageRank.
𝑨𝒙 = 𝝀𝒙
§ Note: If 𝒓 is the limit of 𝑴𝑴 … 𝑴𝒖, then 𝒓 satisfies
the equation 𝒓 = 𝑴𝒓, so r is an eigenvector of 𝑴 with eigenvalue 1
¡ We can now efficiently solve for r!
The method is called Power iteration
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Given a web graph with N nodes, where the
nodes are pages and edges are hyperlinks
¡ Power iteration: a simple iterative scheme
§ Suppose there are N web pages (t )
ri
Initialize: r(0) = [1/N,….,1/N]T =å
( t +1)
§ rj
i® j di
§ Iterate: r(t+1) = M · r(t) di …. out-degree of node i
§ Stop when |r(t+1) – r(t)|1 < e
|x|1 = å1≤i≤N|xi| is the L1 norm
So that r is a distribution (sums to 1)
§ 2: 𝑟 = 𝑟′ ry = ry /2 + ra /2
§ Goto 1 ra = ry /2 + rm
rm = ra /2
¡ Example:
ry 1/3 1/3 5/12 9/24 6/15
r = ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2, …
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
y a m
¡ Power Iteration: y
y ½ ½ 0
§ Set 𝑟 = [1/N, 1/N, 1/N] a ½ 0 1
a m
§ 1: 𝑟 ( = 𝑀. 𝑟 m 0 ½ 0
§ 2: 𝑟 = 𝑟′ ry = ry /2 + ra /2
§ Goto 1 ra = ry /2 + rm
rm = ra /2
¡ Example:
ry 1/3 1/3 5/12 9/24 6/15
r = ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2, …
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Power iteration:
A method for finding dominant eigenvector (the
vector corresponding to the largest eigenvalue)
§ 𝒓(𝟏) = 𝑴 ⋅ 𝒓(𝟎)
§ 𝒓(𝟐) = 𝑴 ⋅ 𝒓 𝟏
= 𝑴 𝑴𝒓 𝟎
= 𝑴𝟐 ⋅ 𝒓 𝟎
§ 𝒓(𝟑) = 𝑴 ⋅ 𝒓 𝟐
= 𝑴 𝑴𝟐 𝒓 𝟎
= 𝑴𝟑 ⋅ 𝒓 𝟎
¡ Claim:
Sequence 𝑴 ⋅ 𝒓 𝟎 , 𝑴𝟐 ⋅ 𝒓 𝟎 , … 𝑴𝒌 ⋅ 𝒓 𝟎 , …
approaches the dominant eigenvector of 𝑴
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Claim: Sequence 𝑴 ⋅ 𝒓 𝟎 , 𝑴𝟐 ⋅ 𝒓 𝟎 , … 𝑴𝒌 ⋅ 𝒓 𝟎
,…
approaches the dominant eigenvector of 𝑴
¡ Proof:
§ Assume M has n linearly independent eigenvectors,
𝑥' , 𝑥( , … , 𝑥) with corresponding eigenvalues
𝜆' , 𝜆( , … , 𝜆) , where 𝜆' > 𝜆( > ⋯ > 𝜆)
§ Vectors 𝑥' , 𝑥( , … , 𝑥) form a basis and thus we can write:
𝑟 (+) = 𝑐' 𝑥' + 𝑐( 𝑥( + ⋯ + 𝑐) 𝑥)
§ 𝑴𝒓(𝟎) = 𝑴 𝒄𝟏 𝒙𝟏 + 𝒄𝟐 𝒙𝟐 + ⋯ + 𝒄𝒏 𝒙𝒏
= 𝑐' (𝑀𝑥' ) + 𝑐( (𝑀𝑥( ) + ⋯ + 𝑐) (𝑀𝑥) )
= 𝑐' (𝜆' 𝑥' ) + 𝑐( (𝜆( 𝑥( ) + ⋯ + 𝑐) (𝜆) 𝑥) )
§ Repeated multiplication on both sides produces
𝑀1 𝑟 (+) = 𝑐' (𝜆'1 𝑥' ) + 𝑐( (𝜆1( 𝑥( ) + ⋯ + 𝑐) (𝜆1) 𝑥) )
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Claim: Sequence 𝑴 ⋅ 𝒓 𝟎 , 𝑴𝟐 ⋅ 𝒓 𝟎 , … 𝑴𝒌 ⋅ 𝒓 𝟎 ,…
approaches the dominant eigenvector of 𝑴
¡ Proof (continued):
§ Repeated multiplication on both sides produces
𝑀( 𝑟 (*) = 𝑐,(𝜆,( 𝑥,) + 𝑐-(𝜆(- 𝑥-) + ⋯ + 𝑐. (𝜆(. 𝑥. )
( (
/! /#
§ 𝑀( 𝑟 (*) = 𝜆,( 𝑐,𝑥, + 𝑐- 𝑥- + ⋯ + 𝑐. 𝑥.
/" /"
/! /$
§ Since 𝜆, > 𝜆- then fractions , …<1
( /" /"
/%
and so = 0 as 𝑘 → ∞ (for all 𝑖 = 2 … 𝑛).
/"
§ Thus: 𝑴𝒌 𝒓(𝟎) ≈ 𝒄𝟏 𝝀𝒌𝟏 𝒙𝟏
§ Note if 𝑐" = 0 then the method won’t converge
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
i1 i2 i3
¡ Imagine a random web surfer:
§ At any time 𝒕, surfer is on some page 𝒊
§ At time 𝒕 + 𝟏, the surfer follows an j
ri
out-link from 𝒊 uniformly at random rj = å
§ Ends up on some page 𝒋 linked from 𝒊 i® j d out (i)
§ Process repeats indefinitely
¡ Let:
¡ 𝒑(𝒕) … vector whose 𝒊th coordinate is the
prob. that the surfer is at page 𝒊 at time 𝒕
§ So, 𝒑(𝒕) is a probability distribution over pages
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
i1 i2 i3
¡ Where is the surfer at time t+1?
§ Follows a link uniformly at random
j
𝒑 𝒕 + 𝟏 = 𝑴 ⋅ 𝒑(𝒕) p(t + 1) = M × p(t )
¡ Suppose the random walk reaches a state
𝒑 𝒕 + 𝟏 = 𝑴 ⋅ 𝒑(𝒕) = 𝒑(𝒕)
then 𝒑(𝒕) is stationary distribution of a random walk
¡ Our original rank vector 𝒓 satisfies 𝒓 = 𝑴 ⋅ 𝒓
§ So, 𝒓 is a stationary distribution for the random
walk
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ A central result from the theory of random
walks (a.k.a. Markov processes):
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Given an undirected graph with N
nodes, where the nodes are pages and x y z
edges are hyperlinks
¡ Claim [Existence]: For node v, v
rv = dv/2m is a solution.
¡ Proof:
§ Iteration step: r(t+1) = M · r(t)
§ Substitute ri = di/2m:
3
1
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Node 1 has the highest PR, followed by Node 3
¡ Degree ≠ PageRank
5
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Add edge 3 -> 2. Now, which node has highest
PageRank? Second highest?
3
1
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Node 3 has the highest PR, followed by 2.
¡ Small changes to graph can change PR!
5
3
1
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
(t )
ri
=å
( t +1)
rj
i® j di
or
equivalently r = Mr
¡ Does this converge?
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
(t )
ri
=å
0 1 ( t +1)
M= a b rj
i® j di
1 0
¡ Example:
ra 1 0 1 0
=
rb 0 1 0 1
Iteration 0, 1, 2, …
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
(t )
ri
=å
0 1 ( t +1)
M= a b rj
i® j di
0 0
¡ Example:
ra 1 0 0 0
=
rb 0 1 0 0
Iteration 0, 1, 2, …
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
Dead end
Two problems:
¡ (1) Dead ends: Some pages
have no out-links
§ Random walk has “nowhere” to go to
§ Such pages cause importance to “leak out” Spider
trap
a m a m
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
y a m
¡ Power Iteration: y
y ½ ½ 0
§ Set 𝑟1 = 1/𝑁 a ½ 0 0
3!
a m m 0 ½ 0
§ 𝑟1 = ∑'→1
4!
m is a dead end ry = ry /2 + ra /2
§ And iterate
ra = ry /2
rm = ra /2
¡ Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2, …
Here the PageRank score “leaks” out since the matrix is not stochastic.
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Teleports: Follow random teleport links with
probability 1.0 from dead-ends
§ Adjust matrix accordingly
y y
a m a m
y a m y a m
y ½ ½ 0 y ½ ½ ⅓
a ½ 0 0 a ½ 0 ⅓
m 0 ½ 0 m 0 ½ ⅓
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
Why are dead-ends and spider traps a problem
and why do teleports solve the problem?
¡ Spider-traps are not a problem, but with traps
PageRank scores are not what we want
§ Solution: Never get stuck in a spider trap by
teleporting out of it in a finite number of steps
¡ Dead-ends are a problem
§ The matrix is not column stochastic so our initial
assumptions are not met
§ Solution: Make matrix column stochastic by always
teleporting when there is nowhere else to go
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Google’s solution that does it all:
At each step, random surfer has two options:
§ With probability b, follow a link at random
§ With probability 1-b, jump to some random page
𝑑" 𝑁
"→!
This formulation assumes that 𝑴 has no dead ends. We can either
preprocess matrix 𝑴 to remove all dead ends or explicitly follow random
teleport links with probability 1.0 from dead-ends.
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ PageRank equation [Brin-Page, ‘98]
𝑟' 1
𝑟& = 0 𝛽 + (1 − 𝛽)
𝑑' 𝑁
'→&
¡ The Google Matrix A: [1/N]NxN…N by N matrix
where all entries are 1/N
1
𝐴=𝛽𝑀+ 1−𝛽
𝑁 )×)
¡ We have a recursive problem: 𝒓 = 𝑨 ⋅ 𝒓
And the Power method still works!
¡ What is b ?
§ In practice b =0.8,0.9 (jump every 5 steps on avg.)
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
M [1/N]NxN
7/15
y 1/2 1/2 0 1/3 1/3 1/3
0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3
1/
5
7/1
15
5
7/1
1/
13/15
a 7/15 1/15 1/15
a 7/15
m 1/15 7/15 13/15
1/15
m
1/
15 A
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Key step is matrix-vector multiplication
§ rnew = A · rold
¡ Easy if we have enough main memory to
hold A, rold, rnew
¡ Say N = 1 billion pages
§ We need 4 bytes for A = b·M + (1-b) [1/N]NxN
each entry (say) ½ ½ 0 1/3 1/3 1/3
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Assume enough RAM to fit rnew into memory
§ Store rold and matrix M on disk
¡ 1 step of power-iteration is:
Initialize all entries of rnew = (1-b) / N Assuming no
dead ends
For each page i (of out-degree di):
Read into memory: i, di, dest1, …, destdi, rold(i)
For j = 1…di
rnew(destj) += b rold(i) / di
0 rnew source degree destination rold 0
1
1 0 3 1, 5, 6
2 2
3 1 4 17, 64, 113, 117 3
4
4 2 2 13, 23
5 5
6 6
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
Some Problems with PageRank:
¡ Measures generic popularity of a page
§ Biased against topic-specific authorities
§ Solution: Topic-Specific PageRank (next)
¡ Uses a single measure of importance
§ Other models of importance
§ Solution: Hubs-and-Authorities
¡ Susceptible to Link spam
§ Artificial link topographies created in order to
boost page rank
§ Solution: TrustRank
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets