0% found this document useful (0 votes)
13 views64 pages

Lecture 9

The document discusses the PageRank algorithm as a method for analyzing large graphs, particularly in the context of web pages and their importance based on link structures. It explains the concept of links as votes, the random surfer model, and the iterative power method used to compute PageRank. Additionally, it addresses challenges such as dead ends and spider traps in web graphs.

Uploaded by

Mahbub Hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views64 pages

Lecture 9

The document discusses the PageRank algorithm as a method for analyzing large graphs, particularly in the context of web pages and their importance based on link structures. It explains the concept of links as votes, the random surfer model, and the iterative power method used to compute PageRank. Additionally, it addresses challenges such as dead ends and spider traps in web graphs.

Uploaded by

Mahbub Hasan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 64

CSE488: Big Data Analytics

Lecture 9: Analysis of Large Graphs


Page Rank Algorithm

Dr. Mohammad Rezwanul


Huq
Associate Professor
East West University
High Gra Infini Machi
App
dim. ph te ne
data dat data learni s
a ng
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der
hashing streams
systems

Community Web Decision Association


Clustering
Detection advertising Trees Rules

Dimensional Duplicate
Spam Queries on Perceptron
ity
Detection streams , kNN
reduction document
detection

2
Facebook social graph
4-degrees of separation [Backstrom-Boldi-Rosa-
3
Connections between political
blogs
4
Citation networks and Maps
of science
[Börner et al., 2012] 5
domain2

domain1

router

domain3

Intern 6
Seven Bridges of
Königsberg
[Euler, 1735]
Return to the starting point by
traveling each
link of the graph once and only 7
Web as a directed graph:
 Nodes: Webpages
 Edges: Hyperlinks
I teach
a class
on CS224W:
Network Classes
s. are in
the
Gates Computer
building Science
Departme
nt at
Stanford Stanfor
d
Universi
ty
8
Web as a directed graph:
 Nodes: Webpages
 Edges: Hyperlinks
I teach
a class
on CS224W:
Network Classes
s. are in
the
Gates Computer
building Science
Departme
nt at
Stanford Stanfor
d
Universi
ty
9
10
How to organize the Web?
First try: Human curated
Web directories
 Yahoo, DMOZ, LookSmart
Second try: Web Search
 Information Retrieval investigates:
Find relevant docs in a small
and trusted set
 Newspaper articles, Patents,
etc.
 But: Web is huge, full of untrusted documents,
11
2 challenges of web
(1) Web contains many sources of information
search:
Who to “trust”?
 Trick: Trustworthy pages may point to each other!

(2) What is the “best” answer to


query “newspaper”?
 No single right answer
 Trick: Pages that actually know about newspapers
might all be pointing to many newspapers

12
All web pages are not equally “important”
thispersondoesnotexist.com vs. www.stanford.edu

There is a large diversity


in the web-graph
node connectivity.
Let’s rank the pages by
the link structure!

13
We will cover the following Link Analysis
approaches for computing importance
of nodes in a graph:
 PageRank
 Topic-Specific (Personalized) PageRank
 Web Spam Detection Algorithms

14
Idea: Links as votes
 Page is more important if it has more links
 In-coming links? Out-going links?
Think of in-links as votes:
 www.stanford.edu has millions in-links
 thispersondoesnotexist.com has a few thousands in-link

Are all in-links equal?


 Links from important pages count more
 Recursive question!

4/27/2021 Jure Leskovec, Stanford C246: Mining Massive Datasets 16


Web pages are important if people visit them
a lot.
But we can’t watch everybody using the
Web.
A good surrogate for visiting pages is
to assume people follow links
randomly.
Leads to random surfer model:
 Start at a random page and follow random out-
links repeatedly, from whatever page you are at.
 PageRank = limiting probability of being at a
page.
Solve the recursive equation: “importance of a
page = its share of the importance of each of
its predecessor pages”
 Equivalent to the random-surfer definition of
PageRank

Technically, importance = the principal


eigenvector of the transition matrix of the Web
 A few fix-ups needed
A
B
3.3 C
38.4
34.3

D
E F
3.9
8.1 3.9

1.6
1.6 1.6 1.6
1.6
Each link’s vote is proportional to the
importance of its source page
If page j with importance rj has n out-links,
each link gets rj / n votes

Page j’s own importance is the sum of the


votes on its in-links i k
ri/3 r /4
k

j r j/
rj = ri/3+rk/4
rj/ 3
3
rj/3
A “vote” from an important The web in
1839
page is worth more ry/2
A page is important if it is
y
pointed to by other
ra/2
important pages ry/2
Define a “rank” rj for page j rm
a m
ra/2
r “Flow”
rj   equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
𝒅𝒊 … out-degree of
node 𝒊
rm = ra /2
Flow equations:
3 equations, 3 ry = ry /2 + ra /2
unknowns, no constants ra = ry /2 + rm
 No unique solution rm = ra /2

 All solutions equivalent modulo the scale factor


Additional constraint forces uniqueness:
 𝒓𝒚 + 𝒓𝒂 + 𝒓 𝒎 =𝟏
=𝟐 , = , 𝒓𝒎 𝟏
𝒓𝒂𝟐 𝟓 =
𝒓𝒚
 Solution:
Gaussian 𝟓
elimination 𝟓
method works
for small examples, but we need a
better method for large web-size
graphs
Stochastic adjacency matrix 𝑴
 Let page 𝑖 has 𝑑𝑖 out-links
 If , then else
 𝑴 is a column stochastic matrix

Rank vector 𝒓: vector with an entry per


 Columns sum to 1

 𝑟𝑖 is the importance score of page 𝑖


page
σ 𝑖 𝑟𝑖 = 1 r
The flow equations can be written rj  
ri
Remember the flow equation: rj 
 i j d i
𝑴⋅𝒓= 𝒓
Flow equation in the matrix form

 Suppose page i links to 3 pages, including


j
i
j r
. = j

r
1/3 i

M . r = r
ry ra rm
y ry ½ ½ 0
ra ½ 0 1
a m rm 0 ½ 0

r=
M∙r
ry = ry /2 + ra /2 ry ½ ½ 0 ry
ra = ry /2 + rm ra = ½ 0 1 ra
rm = ra /2 rm 0 ½ 0 rm
𝒓=𝑴∙ 𝒓
The flow equations can be written

So the rank vector r is an eigenvector of the

 Starting from any stochastic vector 𝒖, the limit


stochastic web matrix M

𝑴(𝑴(… 𝑴(𝑴 𝒖)))


is the long-term distribution of the surfers. NOTE: x is
an eigenvector

eigenvector of 𝑀 = PageRank.
 The math: limiting distribution = principal thewith
corresponding

𝑨𝒙 = 𝝀𝒙
eigenvalue λ if:

 Note: If 𝒓 is the limit of 𝑴𝑴 … 𝑴𝒖,


the
thenequation 𝒓 = 𝑴𝒓, so r is an eigenvector of 𝑴 with eigenvalue
𝒓 satisfies
We can now efficiently solve for r!
1

The method is called Power iteration


Given a web graph with N nodes, where the
nodes are pages and edges are hyperlinks
Power iteration: a simple iterative scheme
 Suppose there are N web pages
 Initialize: r(0) = [1/N,….,1/N]T r
(t
j 1)
r
 Iterate: r(t+1) = M ∙ r(t) j (tdi)i
 i
di …. out-degree of node i
 Stop when |r(t+1) – r(t)|1 < 
|x|1 = 1≤i≤N|xi| is the L1 norm
Can use any other vector norm, e.g., Euclidean

About 50 iterations is sufficient to estimate the limiting solution.


y a m
Power Iteration: y
 Set 𝑟𝑗 =
y ½ ½ 0

 1/N
a ½ 0 1

1: 𝑟′ = 𝑟𝑖
a
𝑗 𝑖→𝑗
m m 0 ½ 0

σ2: 𝑟 𝑑𝑖
ry = ry /2 + ra /2

= 𝑟′
ra = ry /2 + rm
rm = ra /2
 Goto 1
rExample:
y 1/3 1/3 5/12 9/24 6/15
ra 1/3 3/6 1/3 11/2 … 6/15
= 1/3 1/6 3/12 4 3/15
rm Iteration 0, 1, 2, … 1/6
y a m
Power Iteration: y
 Set 𝑟𝑗 =
y ½ ½ 0

 1/N
a ½ 0 1

1: 𝑟′ = 𝑟𝑖
a
𝑗 𝑖→𝑗
m m 0 ½ 0

σ2: 𝑟 𝑑𝑖
ry = ry /2 + ra /2

= 𝑟′
ra = ry /2 + rm
rm = ra /2
 Goto 1
Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2, …

29
i1 i2 i3
Imagine a random web surfer:
 At any time 𝒕, surfer is on some page 𝒊
 At time 𝒕 + 𝟏, the surfer follows an
out-link from 𝒊 uniformly at random
j
r
 Ends up on some page 𝒋 linked 
j
r
from 𝒊
 Process repeats indefinitely
𝒑(𝒕) … vector whose 𝒊th coordinate is
the prob. that the surfer is at page 𝒊 at
Let:

time 𝒕
 So, 𝒑(𝒕) is a probability distribution over
i1 i2 i3
Where is the surfer at time

𝒑 𝒕 + 𝟏 = 𝑴 ⋅ 𝒑(𝒕)
t+1?
 Follows a link uniformly at random
j
p(t 1)  M 

𝒑 𝒕 + 𝟏= 𝑴 ⋅ 𝒑(𝒕) = 𝒑(𝒕)
Suppose the random walk reaches a statep(t)

then 𝒑(𝒕) is stationary distribution of a random

Our original rank vector 𝒓 satisfies 𝒓 = 𝑴


walk

⋅𝒓
 So, 𝒓 is a stationary distribution for
the random walk
A central result from the theory of random
walks (a.k.a. Markov processes):

For graphs that satisfy certain conditions,


the stationary distribution is unique and
eventually will be reached no matter what
is
the initial probability distribution at time t =
0
Given an undirected graph with N
nodes, where the nodes are pages and x y z
edges are hyperlinks
Claim [Existence]: For node v, v
rv = dv/2m is a solution.
Proof:
 Iteration step: r(t+1) = M ∙ r(t)

 Substitute ri = di/2m:

Done! Uniqueness: exercise! m = #edges


Which node has highest PageRank? Second highest?
Node 1 has the highest PR, followed by Node 3
Degree ≠ PageRank
Add edge 3 -> 2. Now, which node has highest
PageRank? Second highest?
Node 3 has the highest PR, followed by 2.
Small changes to graph can change PR!
d
(t
ri
r
j 1)
(t )
or
equivalently r
 i j i

Does this converge? Mr


Does it converge to what we want?
Are results reasonable?
(t
a b r j 1)
r
j i
 i i)
(td

Example:
ra 1 0 1 0
=
rb 0 1 0 1
Iteration 0, 1, 2, …
(t
a b r j 1)
r
j i
 i i)
(td

Example:
ra 1 0 0 0
=
rb 0 1 0 0
Iteration 0, 1, 2, …
Dead end
Two problems:
(1) Dead ends: Some
pages have no out-links
 Random walk has “nowhere”
to go to
 Such pages cause importance
to “leak out”

(2) Spider traps:


(all out-links are within the
group)
y a m
Power Iteration: y
 Set 𝑟𝑗 =
y ½ ½ 0

 1/𝑁
a ½ 0 0

𝑟= 𝑟𝑖
a m m 0 ½ 1

𝑗
σ
𝑖→𝑗
m is a spider trap ry = ry /2 + ra /2
𝑑𝑖
And iterate
ra = ry /2
rm = ra /2 + rm
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 3/6 7/12 16/24 1
Iteration 0, 1, 2, …
All the PageRank score gets “trapped”
The Google solution for spider traps: At each
time step, the random surfer has two options
 With prob. , follow a link at random
 With prob. 1-, jump to some random page
 is typically in the range 0.8 to 0.9
Surfer will teleport out of spider
trap within a few time steps
y

a y m a m
y a m
Power Iteration: y
 Set 𝑟𝑗 =
y ½ ½ 0

 1/𝑁
a ½ 0 0

𝑟= 𝑟𝑖
a m m 0 ½ 0

𝑗
σ
𝑖→𝑗 ry = ry /2 + ra /2
𝑑𝑖
And iterate
ra = ry /2
rm = ra /2
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2, …

Here the PageRank score “leaks” out since the matrix is


Teleports: Follow random teleport links with
probability 1.0 from dead-ends
 Adjust matrix accordingly

y y

a m a m
y a m y a m
y ½ ½ 0 y ½ ½ ⅓
a ½ 0 0 a ½ 0 ⅓
m 0 ½ 0 m 0 ½ ⅓
Why are dead-ends and spider traps a
problem and why do teleports solve the
problem?
Spider-traps are not a problem, but with traps
PageRank scoresgetare
Solution: Never notinwhat
stuck wetrap
a spider want
by
teleporting out of it in a finite number of steps
Dead-ends are a problem
 The matrix is not column stochastic so our initial
assumptions are not met
 Solution: Make matrix column stochastic by always
teleporting when there is nowhere else to go
Google’s solution that does it all:
At each step, random surfer has two options:
 With probability , follow a link at random
 With probability 1-, jump to some random page

PageRank equation [Brin-Page, 98]


di … out-
degree
of node i

This formulation assumes that 𝑴 has no dead ends. We can either


preprocess matrix 𝑴 to remove all dead ends or explicitly follow random
teleport links with probability 1.0 from dead-ends.
PageRank equation [Brin-Page, ‘98]

The Google Matrix [1/N]NxN…N by N matrix


where all entries are 1/N
A:

We have a recursive problem: 𝒓 = 𝑨 ⋅ 𝒓


And the Power method still
works! What is  ?
 In practice  =0.8,0.9 (jump every 5
M [1/N]NxN
7/15
y 1/2 1/2 0 1/3 1/3 1/3
0. 1/2 0 0 + 1/3 1/3 1/3
8 0.2
0 1 1/3 1/3 1/3
1/2y 7/15 7/15 1/15
13/15
a 7/15 1/15 1/15
a m 1/15 7/15 13/15
m
A

y 1/3 0.33 0.24 0.26 7/33


a = 1/3 0.20 0.20 0.18 ... 5/33
m 1/3 0.46 0.52 0.56 21/33
Key step is matrix-vector multiplication
 rnew = A ∙ rold
Easy if we have enough main memory to
hold A, rold, rnew
Say

N = 1 billion pages A = ∙M + (1-)
We need 4 bytes for
[1/N]NxN
each entry (say) ½ ½ 0 1/3 1/3 1/3
½ 0 +0. 1/3 1/3 1/3
 2 billion entries for A= 0 1/3 1/3 1/3
2
vectors, approx 8GB 0.8 0 ½ 1
 Matrix A has N2 7/15 7/15 1/15
entries = 7/15 1/15 1/15
 1018 is a large 1/15 7/15 13/15
, where
𝑟 = i= 𝐴𝑗𝑖
𝑗
1 ⋅ 𝑟𝑖
σ𝑁

since

So we get:

Note: Here we assume M


[x]N … a vector of length N with all
has no dead-ends
entries x
We just rearranged the PageRank equation

 where [(1-)/N]N is a vector with all N entries (1-)/N

M is a sparse matrix! (with no dead-ends)


 10 links per node, approx 10𝑁 entries
So in each iteration, we need to:
 Compute rnew =  M ∙ rold
 Add a constant value (1-)/N to each entry in rnew
 Note if M contains dead-ends then σ 𝒋 𝒓𝒋𝒏𝒆𝒘 < 𝟏
we also have to renormalize rnew so that
Input: Graph 𝑮 and
𝜷 𝑮 (can have spider traps and dead
 Directed graph
parameter
 Parameter 𝜷
ends)

1 𝒓 𝒏𝒆𝒘
= �
Output:
 Set: � PageRank vector
𝑟 𝑜𝑙𝑑 �
<
convergence: σ 𝑗 𝑗𝑟 𝑛 𝑒 𝑤 −
𝜀
 repeat until �
𝑗
𝑟 ∀𝑗: 𝒓′ =𝒊→𝒋
𝒏𝒆
 𝑜𝑙𝑑 𝒘𝒋
𝒊
𝒅
σ 𝒓′𝒋 = 𝟎𝜷if in-degree of 𝒋 is
𝒐𝒍𝒊 𝒓
𝒏𝒆𝒘
𝒅
 Now0re-insert the leaked PageRank:
∀𝒋: 𝒓
𝒋
𝒏𝒆𝒘
= 𝒓�′ 𝒏𝒆𝒘 +� where: 𝑆 =𝑗
 𝒓𝟏−𝑺
𝒐𝒍𝒅 = �
σ 𝑟′𝑛𝑒𝑤
𝑗

𝒓𝒏𝒆𝒘

If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-
ends the amount of leaked PageRank may be larger. We have to explicitly account for it by computing
Encode sparse matrix using only nonzero
entries
 Space proportional roughly to number of links
 Say 10N, or 4*10*1 billion = 40GB
 Still won’t fit in memory, but will fit on disk
source
node degree destination
0 3
nodes 1, 5, 7
1 5 17, 64, 113, 117, 245
2 2 13, 23
Assume enough RAM to fit rnew into memory
 Store rold and matrix M on disk
1 step of power-iteration is:
Assuming
Initialize all entries of rnew = (1-) / N no dead
For each page i (of out-degree di): ends

Read into memory: i, di, dest1, …, destd , rold(i) i

For j = 1…di
rnew(destj) +=  rold(i) / di
0 0
rnew source degree destination rold
1
1 0 3 1, 5, 6
2 2
1 4 17, 64, 113, 117 3
3
4 2 2 13, 23 4
5 5
6 6
4/27/2021 Jure Leskovec, Stanford C246: Mining Massive Datasets 60
Assume enough RAM to fit rnew into memory
 Store rold and matrix M on disk
In each iteration, we have to:
 Read rold and M
 Write rnew back to disk

= 2|𝒓| + |𝑴|
 Cost per iteration of Power method:

Question:
 What if we could not even fit rnew in memory?
rne src destinat rol
0w 0degree4 ion 0
0, 1, 3, 5 d

1 1
1 2 0, 5 2
2 2 3
2 3, 4
4
3
M 5
4
5

 Break rnew into k blocks that fit in memory


 Scan M and rold once for each block
Similar to nested-loop join in
databases
 Break rnew into k blocks that fit in memory
 Scan M and rold once for each block
Total cost:
 k scans of M and rold

𝑘(|𝑴| + |𝒓|) + |𝒓| = 𝒌|𝑴| + (𝒌 + 𝟏)|𝒓|


 Cost per iteration of Power method:

Can we do better?
 Hint: M is much bigger than r (approx 10-20x), so
we must avoid reading it k times per iteration
src destinati
rne degree on
0w 0 4 0, 1
1 rol
1 2 0 d 0
1
2
2 0 4 3 3
4
3 2 2 3 5

0 4 5
4 1 2 5
5
2 2 4
Break M into stripes! Each stripe contains only
destination nodes in the corresponding block of
Break M into stripes
 Each stripe contains only destination nodes
in the corresponding block of rnew
Some additional overhead per stripe
 But it is usually worth it

=|𝑴|(𝟏 + ε) + (𝒌 + 𝟏)|𝒓|
Cost per iteration of Power method:

where 𝜀 is a small number.


Measures generic popularity of a
page
 Biased against topic-specific authorities
 Solution: Topic-Specific PageRank (next)
Uses a single measure of importance
 Other models of importance
 Solution: Hubs-and-Authorities
Susceptible to Link spam
 Artificial link topographies created in order to
boost page rank
 Solution: TrustRank
Classic work: Markov chains, citation analysis
RankDex patent [Robin Li, '96]
 Key idea: use backlinks (led to Baidu!)
HITS Algorithm [Kleinberg, SODA '98]
 Key idea: iterative scoring!

PageRank work [Page et al, '98]

You might also like