0% found this document useful (0 votes)

19 views61 pages

09 Pagerank

Page Rank

Uploaded by

kasori km

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views61 pages

09 Pagerank

Page Rank

Uploaded by

kasori km

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Note to other teachers and users of these slides: We would be delighted if you found our

material useful for giving your own lectures. Feel free to use these slides verbatim, or to
modify them to fit your own needs. If you make use of a significant portion of these slides
in your own lecture, please include this message, or a link to our web site: http://www.mmds.org

CS246: Mining Massive Datasets

Jure Leskovec, Stanford University
Mina Ghashami, Amazon
http://cs246.stanford.edu
High dim. Graph Infinite Machine
Apps
data data data learning

Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams

Community Web Decision Association

Clustering
Detection advertising Trees Rules

Dimensional Duplicate
Spam Queries on Perceptron,
ity document
Detection streams kNN
reduction detection

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
Facebook social graph
4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
Connections between political blogs
Polarization of the network [Adamic-Glance, 2005]
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
Citation networks and Maps of science
[Börner et al., 2012]
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
domain2

domain1

router

domain3

Internet
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
Seven Bridges of Königsberg
[Euler, 1735]
Return to the starting point by traveling each
link of the graph once and only once.

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Web as a directed graph:
§ Nodes: Webpages
§ Edges: Hyperlinks
I teach a
class on
Networks. CS224W:
Classes are
in the
Gates
building Computer
Science
Department
at Stanford
Stanford
University

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ How to organize the Web?
¡ First try: Human curated
Web directories
§ Yahoo, DMOZ, LookSmart
¡ Second try: Web Search
§ Information Retrieval investigates:
Find relevant docs in a small
and trusted set
§ Newspaper articles, Patents, etc.
§ But: Web is huge, full of untrusted documents,
random things, web spam, etc.
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
2 challenges of web search:
¡ (1) Web contains many sources of information
Who to “trust”?
§ Trick: Trustworthy pages may point to each other!
¡ (2) What is the “best” answer to query
“newspaper”?
§ No single right answer
§ Trick: Pages that actually know about newspapers
might all be pointing to many newspapers

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ All web pages are not equally “important”
thispersondoesnotexist.com vs. www.stanford.edu

¡ There is a large diversity

in the web-graph
node connectivity.
Let’s rank the pages by
the link structure!

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ We will cover the following Link Analysis
approaches for computing importance
of nodes in a graph:
§ PageRank
§ Topic-Specific (Personalized) PageRank
§ Web Spam Detection Algorithms

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Idea: Links as votes
§ Page is more important if it has more links
§ In-coming links? Out-going links?
¡ Think of in-links as votes:
§ www.stanford.edu has millions in-links
§ thispersondoesnotexist.com has a few thousands in-link

¡ Are all in-links equal?

§ Links from important pages count more
§ Recursive question!

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Web pages are important if people visit them
a lot.
¡ But we can’t watch everybody using the Web.
¡ A good surrogate for visiting pages is to
assume people follow links randomly.
¡ Leads to random surfer model:
§ Start at a random page and follow random out-
links repeatedly, from whatever page you are at.
§ PageRank = limiting probability of being at a page
at any point in time.

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Solve the recursive equation: “importance of a
page = its share of the importance of each of its
predecessor pages”
§ Equivalent to the random-surfer definition of
PageRank

¡ Technically, importance = the principal

eigenvector of the transition matrix of the Web
§ A few fix-ups needed

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
A
B
3.3 C
38.4
34.3

D
E F
3.9
8.1 3.9

1.6
1.6 1.6 1.6
1.6

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Each link’s vote is proportional to the
importance of its source page

¡ If page j with importance rj has n out-links,

each link gets rj / n votes

¡ Page j’s own importance is the sum of the

votes on its in-links i k
ri/3 r /4
k

j rj/3
rj = ri/3+rk/4
rj/3 rj/3

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ A “vote” from an important The web in 1839

page is worth more ry/2

¡ A page is important if it is
y
pointed to by other important
ra/2
pages ry/2
¡ Define a “rank” rj for page j rm
a m
ra/2
ri
rj = å “Flow” equations:

i® j di
ry = ry /2 + ra /2
ra = ry /2 + rm
𝒅𝒊 … out-degree of node 𝒊
rm = ra /2
𝒓𝒋 are the solutions to the “flow” equation
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
Flow equations:
¡ 3 equations, 3 unknowns, ry = ry /2 + ra /2
no constants ra = ry /2 + rm
rm = ra /2
§ No unique solution
§ All solutions equivalent modulo the scale factor
¡ Additional constraint forces uniqueness:
§ 𝒓𝒚 + 𝒓𝒂 + 𝒓𝒎 = 𝟏
𝟐 𝟐 𝟏
§ Solution: 𝒓𝒚 = 𝟓 , 𝒓𝒂 = 𝟓 , 𝒓𝒎 = 𝟓
¡ Gaussian elimination method works for
small examples, but we need a better
method for large web-size graphs
¡ We need a new formulation!
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Define stochastic adjacency matrix 𝑴
§ Let page 𝑖 has 𝑑𝑖 out-links
!
§ If 𝑖 → 𝑗, then 𝑀𝑗𝑖 = else 𝑀𝑗𝑖 = 0
"!
§ 𝑴 is a column stochastic matrix
§ Each column sums to 1
¡ Define rank vector 𝒓: a vector with one entry
per page; it captures importance of the page
§ 𝑟𝑖 = importance score of page 𝑖
ri
§ ∑' 𝑟 𝑖 = 1 rj = å
¡ The flow equations can be written i® j di
𝒓 = 𝑴⋅ 𝒓
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
ri
¡ Remember the flow equation: rj = å
¡ Flow equation in the matrix form i ® j d i
𝑴⋅ 𝒓=𝒓
§ Suppose page i links to 3 pages, including j
i

j rj
. =
ri
1/di=1/3

M . r = r
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
y a m
y y ½ ½ 0
M = a ½ 0 1
a m m 0 ½ 0

r = M · r
ry = ry /2 + ra /2 ry ½ ½ 0 ry
ra = ry /2 + rm ra = ½ 0 1 ra
rm = ra /2 rm 0 ½ 0 rm

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ The flow equations can be written
𝒓 = 𝑴 $ 𝒓
¡ So the rank vector r is an eigenvector of the
stochastic web matrix M
§ Starting from any stochastic vector 𝒖, the limit
𝑴(𝑴(… 𝑴(𝑴 𝒖)))
is the long-term distribution of the surfers. NOTE: x is an
eigenvector with
§ The math: limiting distribution = principal the corresponding
eigenvalue λ if:
eigenvector of 𝑀 = PageRank.
𝑨𝒙 = 𝝀𝒙
§ Note: If 𝒓 is the limit of 𝑴𝑴 … 𝑴𝒖, then 𝒓 satisfies
the equation 𝒓 = 𝑴𝒓, so r is an eigenvector of 𝑴 with eigenvalue 1
¡ We can now efficiently solve for r!
The method is called Power iteration
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Given a web graph with N nodes, where the
nodes are pages and edges are hyperlinks
¡ Power iteration: a simple iterative scheme
§ Suppose there are N web pages (t )
ri
Initialize: r(0) = [1/N,….,1/N]T =å
( t +1)
§ rj
i® j di
§ Iterate: r(t+1) = M · r(t) di …. out-degree of node i
§ Stop when |r(t+1) – r(t)|1 < e
|x|1 = å1≤i≤N|xi| is the L1 norm
So that r is a distribution (sums to 1)

About 50 iterations is sufficient to estimate the limiting solution.

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
y a m
¡ Power Iteration: y
y ½ ½ 0
§ Set 𝑟 = [1/N, 1/N, 1/N] a ½ 0 1
a m
§ 1: 𝑟 ( = 𝑀. 𝑟 m 0 ½ 0

§ 2: 𝑟 = 𝑟′ ry = ry /2 + ra /2
§ Goto 1 ra = ry /2 + rm
rm = ra /2
¡ Example:
ry 1/3 1/3 5/12 9/24 6/15
r = ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15

Iteration 0, 1, 2, …

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
y a m
¡ Power Iteration: y
y ½ ½ 0
§ Set 𝑟 = [1/N, 1/N, 1/N] a ½ 0 1
a m
§ 1: 𝑟 ( = 𝑀. 𝑟 m 0 ½ 0

§ 2: 𝑟 = 𝑟′ ry = ry /2 + ra /2
§ Goto 1 ra = ry /2 + rm
rm = ra /2
¡ Example:
ry 1/3 1/3 5/12 9/24 6/15
r = ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15

Iteration 0, 1, 2, …

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Power iteration:
A method for finding dominant eigenvector (the
vector corresponding to the largest eigenvalue)
§ 𝒓(𝟏) = 𝑴 ⋅ 𝒓(𝟎)
§ 𝒓(𝟐) = 𝑴 ⋅ 𝒓 𝟏
= 𝑴 𝑴𝒓 𝟎
= 𝑴𝟐 ⋅ 𝒓 𝟎

§ 𝒓(𝟑) = 𝑴 ⋅ 𝒓 𝟐
= 𝑴 𝑴𝟐 𝒓 𝟎
= 𝑴𝟑 ⋅ 𝒓 𝟎

¡ Claim:
Sequence 𝑴 ⋅ 𝒓 𝟎 , 𝑴𝟐 ⋅ 𝒓 𝟎 , … 𝑴𝒌 ⋅ 𝒓 𝟎 , …
approaches the dominant eigenvector of 𝑴

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Claim: Sequence 𝑴 ⋅ 𝒓 𝟎 , 𝑴𝟐 ⋅ 𝒓 𝟎 , … 𝑴𝒌 ⋅ 𝒓 𝟎
,…
approaches the dominant eigenvector of 𝑴
¡ Proof:
§ Assume M has n linearly independent eigenvectors,
𝑥' , 𝑥( , … , 𝑥) with corresponding eigenvalues
𝜆' , 𝜆( , … , 𝜆) , where 𝜆' > 𝜆( > ⋯ > 𝜆)
§ Vectors 𝑥' , 𝑥( , … , 𝑥) form a basis and thus we can write:
𝑟 (+) = 𝑐' 𝑥' + 𝑐( 𝑥( + ⋯ + 𝑐) 𝑥)
§ 𝑴𝒓(𝟎) = 𝑴 𝒄𝟏 𝒙𝟏 + 𝒄𝟐 𝒙𝟐 + ⋯ + 𝒄𝒏 𝒙𝒏
= 𝑐' (𝑀𝑥' ) + 𝑐( (𝑀𝑥( ) + ⋯ + 𝑐) (𝑀𝑥) )
= 𝑐' (𝜆' 𝑥' ) + 𝑐( (𝜆( 𝑥( ) + ⋯ + 𝑐) (𝜆) 𝑥) )
§ Repeated multiplication on both sides produces
𝑀1 𝑟 (+) = 𝑐' (𝜆'1 𝑥' ) + 𝑐( (𝜆1( 𝑥( ) + ⋯ + 𝑐) (𝜆1) 𝑥) )
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Claim: Sequence 𝑴 ⋅ 𝒓 𝟎 , 𝑴𝟐 ⋅ 𝒓 𝟎 , … 𝑴𝒌 ⋅ 𝒓 𝟎 ,…
approaches the dominant eigenvector of 𝑴
¡ Proof (continued):
§ Repeated multiplication on both sides produces
𝑀( 𝑟 (*) = 𝑐,(𝜆,( 𝑥,) + 𝑐-(𝜆(- 𝑥-) + ⋯ + 𝑐. (𝜆(. 𝑥. )
( (
/! /#
§ 𝑀( 𝑟 (*) = 𝜆,( 𝑐,𝑥, + 𝑐- 𝑥- + ⋯ + 𝑐. 𝑥.
/" /"
/! /$
§ Since 𝜆, > 𝜆- then fractions , …<1
( /" /"
/%
and so = 0 as 𝑘 → ∞ (for all 𝑖 = 2 … 𝑛).
/"
§ Thus: 𝑴𝒌 𝒓(𝟎) ≈ 𝒄𝟏 𝝀𝒌𝟏 𝒙𝟏
§ Note if 𝑐" = 0 then the method won’t converge

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
i1 i2 i3
¡ Imagine a random web surfer:
§ At any time 𝒕, surfer is on some page 𝒊
§ At time 𝒕 + 𝟏, the surfer follows an j
ri
out-link from 𝒊 uniformly at random rj = å
§ Ends up on some page 𝒋 linked from 𝒊 i® j d out (i)
§ Process repeats indefinitely
¡ Let:
¡ 𝒑(𝒕) … vector whose 𝒊th coordinate is the
prob. that the surfer is at page 𝒊 at time 𝒕
§ So, 𝒑(𝒕) is a probability distribution over pages
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
i1 i2 i3
¡ Where is the surfer at time t+1?
§ Follows a link uniformly at random
j
𝒑 𝒕 + 𝟏 = 𝑴 ⋅ 𝒑(𝒕) p(t + 1) = M × p(t )
¡ Suppose the random walk reaches a state
𝒑 𝒕 + 𝟏 = 𝑴 ⋅ 𝒑(𝒕) = 𝒑(𝒕)
then 𝒑(𝒕) is stationary distribution of a random walk
¡ Our original rank vector 𝒓 satisfies 𝒓 = 𝑴 ⋅ 𝒓
§ So, 𝒓 is a stationary distribution for the random
walk

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ A central result from the theory of random
walks (a.k.a. Markov processes):

For graphs that satisfy certain conditions,

the stationary distribution is unique and
eventually will be reached no matter what is
the initial probability distribution at time t = 0

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Given an undirected graph with N
nodes, where the nodes are pages and x y z
edges are hyperlinks
¡ Claim [Existence]: For node v, v
rv = dv/2m is a solution.
¡ Proof:
§ Iteration step: r(t+1) = M · r(t)

§ Substitute ri = di/2m:

¡ Done! Uniqueness: exercise! m = #edges

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Which node has highest PageRank? Second highest?

3
1

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Node 1 has the highest PR, followed by Node 3
¡ Degree ≠ PageRank
5

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Add edge 3 -> 2. Now, which node has highest
PageRank? Second highest?

3
1

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Node 3 has the highest PR, followed by 2.
¡ Small changes to graph can change PR!
5

3
1

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
(t )
ri
=å
( t +1)
rj
i® j di
or
equivalently r = Mr
¡ Does this converge?

¡ Does it converge to what we want?

¡ Are results reasonable?

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
(t )
ri
=å
0 1 ( t +1)
M= a b rj
i® j di
1 0

¡ Example:
ra 1 0 1 0
=
rb 0 1 0 1
Iteration 0, 1, 2, …

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
(t )
ri
=å
0 1 ( t +1)
M= a b rj
i® j di
0 0

¡ Example:
ra 1 0 0 0
=
rb 0 1 0 0
Iteration 0, 1, 2, …

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
Dead end
Two problems:
¡ (1) Dead ends: Some pages
have no out-links
§ Random walk has “nowhere” to go to
§ Such pages cause importance to “leak out” Spider
trap

¡ (2) Spider traps:

(all out-links are within the group)
§ Random walk gets “stuck” in a trap
§ And eventually spider traps absorb all importance
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
y a m
¡ Power Iteration: y
y ½ ½ 0
§ Set 𝑟1 = 1/𝑁 a ½ 0 0
3!
a m m 0 ½ 1
§ 𝑟1 = ∑'→1
4!
m is a spider trap ry = ry /2 + ra /2
§ And iterate
ra = ry /2
rm = ra /2 + rm
¡ Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 3/6 7/12 16/24 1
Iteration 0, 1, 2, …
All the PageRank score gets “trapped” in node m.
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ The Google solution for spider traps: At each
time step, the random surfer has two options
§ With prob. b, follow a link at random
§ With prob. 1-b, jump to some random page
§ b is typically in the range 0.8 to 0.9
¡ Surfer will teleport out of spider trap
within a few time steps
y y

a m a m
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
y a m
¡ Power Iteration: y
y ½ ½ 0
§ Set 𝑟1 = 1/𝑁 a ½ 0 0
3!
a m m 0 ½ 0
§ 𝑟1 = ∑'→1
4!
m is a dead end ry = ry /2 + ra /2
§ And iterate
ra = ry /2
rm = ra /2
¡ Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2, …
Here the PageRank score “leaks” out since the matrix is not stochastic.
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Teleports: Follow random teleport links with
probability 1.0 from dead-ends
§ Adjust matrix accordingly

y y

a m a m
y a m y a m
y ½ ½ 0 y ½ ½ ⅓
a ½ 0 0 a ½ 0 ⅓
m 0 ½ 0 m 0 ½ ⅓

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
Why are dead-ends and spider traps a problem
and why do teleports solve the problem?
¡ Spider-traps are not a problem, but with traps
PageRank scores are not what we want
§ Solution: Never get stuck in a spider trap by
teleporting out of it in a finite number of steps
¡ Dead-ends are a problem
§ The matrix is not column stochastic so our initial
assumptions are not met
§ Solution: Make matrix column stochastic by always
teleporting when there is nowhere else to go
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Google’s solution that does it all:
At each step, random surfer has two options:
§ With probability b, follow a link at random
§ With probability 1-b, jump to some random page

¡ PageRank equation [Brin-Page, 98]

𝑟" 1 di … out-degree
𝑟! = & 𝛽 + (1 − 𝛽) of node i

𝑑" 𝑁
"→!
This formulation assumes that 𝑴 has no dead ends. We can either
preprocess matrix 𝑴 to remove all dead ends or explicitly follow random
teleport links with probability 1.0 from dead-ends.
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ PageRank equation [Brin-Page, ‘98]
𝑟' 1
𝑟& = 0 𝛽 + (1 − 𝛽)
𝑑' 𝑁
'→&
¡ The Google Matrix A: [1/N]NxN…N by N matrix
where all entries are 1/N
1
𝐴=𝛽𝑀+ 1−𝛽
𝑁 )×)
¡ We have a recursive problem: 𝒓 = 𝑨 ⋅ 𝒓
And the Power method still works!
¡ What is b ?
§ In practice b =0.8,0.9 (jump every 5 steps on avg.)
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
M [1/N]NxN
7/15
y 1/2 1/2 0 1/3 1/3 1/3
0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3
1/
5
7/1

15
5
7/1

y 7/15 7/15 1/15

13/15
a 7/15 1/15 1/15
a 7/15
m 1/15 7/15 13/15
1/15
m
1/
15 A

y 1/3 0.33 0.24 0.26 7/33

a = 1/3 0.20 0.20 0.18 ... 5/33
m 1/3 0.46 0.52 0.56 21/33

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Key step is matrix-vector multiplication
§ rnew = A · rold
¡ Easy if we have enough main memory to
hold A, rold, rnew
¡ Say N = 1 billion pages
§ We need 4 bytes for A = b·M + (1-b) [1/N]NxN
each entry (say) ½ ½ 0 1/3 1/3 1/3

§ 2 billion entries for A = 0.8 ½ 0 0 +0.2 1/3 1/3 1/3

0 ½ 1 1/3 1/3 1/3
vectors, approx 8GB
§ Matrix A has N2 entries 7/15 7/15 1/15
§ 1018 is a large number! = 7/15 1/15 1/15
1/15 7/15 13/15
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
𝟏.𝜷
¡ 𝒓 = 𝑨 ⋅ 𝒓, where 𝑨𝒋𝒊 = 𝜷 𝑴𝒋𝒊 +
𝑵
)
¡ 𝑟& = ∑12! 𝐴&' ⋅ 𝑟'
!.3
¡ 𝑟& = ∑)
'2! 𝛽 𝑀&' + ⋅ 𝑟'
)
!.3
= ∑)
12! 𝛽 𝑀&' ⋅ 𝑟' + ∑)
12! 𝑟'
)
!.3
= ∑)
12! 𝛽 𝑀&' ⋅ 𝑟' + since ∑𝑟' = 1
)
𝟏.𝜷
¡ So we get: 𝒓 = 𝜷 𝑴 ⋅ 𝒓 +
𝑵 𝑵

Note: Here we assume M

[x]N … a vector of length N with all entries x
has no dead-ends
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ We just rearranged the PageRank equation
𝟏−𝜷
𝒓 = 𝜷𝑴 ⋅ 𝒓 +
𝑵 𝑵
§ where [(1-b)/N]N is a vector with all N entries (1-b)/N

¡ M is a sparse matrix! (with no dead-ends)

§ 10 links per node, approx 10𝑁 entries
¡ So in each iteration, we need to:
§ Compute rnew = b M · rold
§ Add a constant value (1-b)/N to each entry in rnew
§ Note if M contains dead-ends then ∑𝒋 𝒓𝒏𝒆𝒘
𝒋 < 𝟏 and
we also have to renormalize rnew so that it sums to 1
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Input: Graph 𝑮 and parameter 𝜷
§ Directed graph 𝑮 (can have spider traps and dead ends)
§ Parameter 𝜷
¡ Output: PageRank vector 𝒓𝒏𝒆𝒘
'
§ Set: 𝑟2345 =
6
§ repeat until convergence: ∑2 𝑟2)78 − 𝑟2345 < 𝜀
𝒓𝒐𝒍𝒅
§ ∀𝑗: 𝒓′𝒏𝒆𝒘
𝒋 = ∑𝒊→𝒋 𝜷 𝒊
𝒅𝒊
𝒓′𝒏𝒆𝒘
= 𝟎 if in-degree of 𝒋 is 0
𝒋
§ Now re-insert the leaked PageRank:
𝒏𝒆𝒘 𝟏,𝑺
∀𝒋: 𝒓𝒏𝒆𝒘
𝒋 = 𝒓* 𝒋 + where: 𝑆 = ∑# 𝑟′#$%&
𝑵
§ 𝒓𝒐𝒍𝒅 = 𝒓𝒏𝒆𝒘
If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends
the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Encode sparse matrix using only nonzero
entries
§ Space proportional roughly to number of links
§ Say 10N, or 4*10*1 billion = 40GB
§ Still won’t fit in memory, but will fit on disk
source
node degree destination nodes
0 3 1, 5, 7
1 5 17, 64, 113, 117, 245
2 2 13, 23

Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
¡ Assume enough RAM to fit rnew into memory
§ Store rold and matrix M on disk
¡ 1 step of power-iteration is:
Initialize all entries of rnew = (1-b) / N Assuming no
dead ends
For each page i (of out-degree di):
Read into memory: i, di, dest1, …, destdi, rold(i)
For j = 1…di
rnew(destj) += b rold(i) / di
0 rnew source degree destination rold 0
1
1 0 3 1, 5, 6
2 2
3 1 4 17, 64, 113, 117 3
4
4 2 2 13, 23
5 5
6 6
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets
Some Problems with PageRank:
¡ Measures generic popularity of a page
§ Biased against topic-specific authorities
§ Solution: Topic-Specific PageRank (next)
¡ Uses a single measure of importance
§ Other models of importance
§ Solution: Hubs-and-Authorities
¡ Susceptible to Link spam
§ Artificial link topographies created in order to
boost page rank
§ Solution: TrustRank
Jure Leskovec & Mina Ghashami, Stanford C246: Mining Massive Datasets

Optimization in Planning and Operation of Electric Power System
100% (1)
Optimization in Planning and Operation of Electric Power System
362 pages
View Answer / Hide Answer
No ratings yet
View Answer / Hide Answer
90 pages
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
64 pages
Sorting Algorithms - Presentation
0% (1)
Sorting Algorithms - Presentation
32 pages
Algorithms
No ratings yet
Algorithms
49 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
56 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
44 pages
04 Pagerank
No ratings yet
04 Pagerank
64 pages
Data Mining - Discretization
100% (1)
Data Mining - Discretization
5 pages
Community Detection in Social Networks
No ratings yet
Community Detection in Social Networks
64 pages
Module 6-: Real Time Big Data Models
No ratings yet
Module 6-: Real Time Big Data Models
58 pages
Jeffrey D. Ullman Stanford University
No ratings yet
Jeffrey D. Ullman Stanford University
55 pages
Advanced Analysis of Algorithms: Dept of CS & IT University of Sargodha
No ratings yet
Advanced Analysis of Algorithms: Dept of CS & IT University of Sargodha
51 pages
3 Web Crawling
No ratings yet
3 Web Crawling
39 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
42 pages
CS345 Data Mining: Link Analysis Algorithms Page Rank
No ratings yet
CS345 Data Mining: Link Analysis Algorithms Page Rank
37 pages
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
No ratings yet
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
33 pages
Networks
No ratings yet
Networks
75 pages
CSF-469-L11-13 (Link Analysis Page Rank)
No ratings yet
CSF-469-L11-13 (Link Analysis Page Rank)
47 pages
08 Recsys2
No ratings yet
08 Recsys2
60 pages
02 Tradition ML
No ratings yet
02 Tradition ML
68 pages
MPC With Integrators
No ratings yet
MPC With Integrators
11 pages
Google Pagerank: Maths Delivers!
No ratings yet
Google Pagerank: Maths Delivers!
24 pages
Signals, Continuous Time and Discrete Time
No ratings yet
Signals, Continuous Time and Discrete Time
27 pages
Pagerank The Flow Formulation
No ratings yet
Pagerank The Flow Formulation
6 pages
19 Submodular
No ratings yet
19 Submodular
47 pages
The Graph Neural Network Model
No ratings yet
The Graph Neural Network Model
20 pages
ch05 Linkanalysis1
No ratings yet
ch05 Linkanalysis1
60 pages
Tutorial - III: 1) Simple Moving Average
No ratings yet
Tutorial - III: 1) Simple Moving Average
16 pages
Unit - 4
No ratings yet
Unit - 4
22 pages
Hits Algorithm
No ratings yet
Hits Algorithm
55 pages
Ben Ulmer, Matt Fernandez, Predicting Soccer Results in The English Premier League
No ratings yet
Ben Ulmer, Matt Fernandez, Predicting Soccer Results in The English Premier League
5 pages
Complex Networks
No ratings yet
Complex Networks
145 pages
16 Streams
No ratings yet
16 Streams
61 pages
Om 9 2017 CLR
No ratings yet
Om 9 2017 CLR
25 pages
Module 4 MapReduce and Link Analysis
No ratings yet
Module 4 MapReduce and Link Analysis
103 pages
Assignment5 NLA Aug2023
No ratings yet
Assignment5 NLA Aug2023
7 pages
Graph Theory
No ratings yet
Graph Theory
18 pages
TM3 ch05 Link Analysis
No ratings yet
TM3 ch05 Link Analysis
69 pages
Link Analysis
No ratings yet
Link Analysis
43 pages
Unit 4
No ratings yet
Unit 4
13 pages
Graph Neural Network & Traditional Neural Network Introduction
No ratings yet
Graph Neural Network & Traditional Neural Network Introduction
69 pages
Graph Mining Handout
No ratings yet
Graph Mining Handout
7 pages
Chapter 03 - Random Variables
No ratings yet
Chapter 03 - Random Variables
14 pages
Secure Communication Using Quantum Computing Method
No ratings yet
Secure Communication Using Quantum Computing Method
4 pages
Datamining-Lect7 - Link Analysis Ranking PageRank - Random Walks HITS Absorbing Random Walks and Label Propagation
No ratings yet
Datamining-Lect7 - Link Analysis Ranking PageRank - Random Walks HITS Absorbing Random Walks and Label Propagation
99 pages
CS2004
No ratings yet
CS2004
2 pages
Machine Learning For Asset Management 1714827480
No ratings yet
Machine Learning For Asset Management 1714827480
233 pages
CH6605 Process Instrumentation, Dynamics and Control
No ratings yet
CH6605 Process Instrumentation, Dynamics and Control
16 pages
DSP Questions
No ratings yet
DSP Questions
4 pages
Graph Neural Network Introduction
No ratings yet
Graph Neural Network Introduction
88 pages
TEAM 4 - Overhead Costs
No ratings yet
TEAM 4 - Overhead Costs
6 pages
Contoh PngmbngnAplikasiTextRecognitionDgnNeuralNetwork - Anif - Safitri
No ratings yet
Contoh PngmbngnAplikasiTextRecognitionDgnNeuralNetwork - Anif - Safitri
9 pages
18 Advertising
No ratings yet
18 Advertising
48 pages
19 Bandits
No ratings yet
19 Bandits
48 pages
M269-Final-By ISA-5th Edition
No ratings yet
M269-Final-By ISA-5th Edition
110 pages
Continuum Mechanics, Applied Mathematics and Scientific Computing: Godunov's Legacy
No ratings yet
Continuum Mechanics, Applied Mathematics and Scientific Computing: Godunov's Legacy
378 pages
Unit 4
No ratings yet
Unit 4
60 pages
Linear Equations
No ratings yet
Linear Equations
7 pages
MAS Session1
No ratings yet
MAS Session1
72 pages
C407X 07
No ratings yet
C407X 07
15 pages
18-Sub-Modular Functions
No ratings yet
18-Sub-Modular Functions
51 pages
Network Science
No ratings yet
Network Science
916 pages
Quantum Field Theory (2022-23) Toby Wiseman
No ratings yet
Quantum Field Theory (2022-23) Toby Wiseman
4 pages
Es 5 Power - Flow
No ratings yet
Es 5 Power - Flow
84 pages
Sample DSP Application COMPUTER ENGINEERING FLORES.R
No ratings yet
Sample DSP Application COMPUTER ENGINEERING FLORES.R
1 page
Rec Sys Network
No ratings yet
Rec Sys Network
45 pages
16 Streams
No ratings yet
16 Streams
5 pages
Maintenance Strategy Optimization in Mineral Processing Multi Component Systems A Case Study of Slurry Filtration Plant IJERTV13IS010043
No ratings yet
Maintenance Strategy Optimization in Mineral Processing Multi Component Systems A Case Study of Slurry Filtration Plant IJERTV13IS010043
5 pages
Lecture 9
No ratings yet
Lecture 9
64 pages
MLC 04 Graph Methods Ranking Communities Link Prediction-Sose2023
No ratings yet
MLC 04 Graph Methods Ranking Communities Link Prediction-Sose2023
110 pages
MLC 01 Intro Basics Graphs-Sose2023
No ratings yet
MLC 01 Intro Basics Graphs-Sose2023
44 pages
Lec 10
No ratings yet
Lec 10
14 pages
Feb 28
No ratings yet
Feb 28
12 pages
14 Link 1
No ratings yet
14 Link 1
10 pages
PageRank 2021
No ratings yet
PageRank 2021
55 pages
Spring 2024 Project #2
No ratings yet
Spring 2024 Project #2
2 pages
06 GNN3
No ratings yet
06 GNN3
73 pages
Lecture11 PageRank V0
No ratings yet
Lecture11 PageRank V0
38 pages
Lecture 12 - Link Analysis
No ratings yet
Lecture 12 - Link Analysis
57 pages
CG Unit-IV
No ratings yet
CG Unit-IV
62 pages
Graph Based Data Science
No ratings yet
Graph Based Data Science
37 pages
Lecture 1
No ratings yet
Lecture 1
51 pages
WE Lecture 04 WebStructureMining Part1
No ratings yet
WE Lecture 04 WebStructureMining Part1
26 pages
L-26 (Parametric Geometric Continuity Conditions)
No ratings yet
L-26 (Parametric Geometric Continuity Conditions)
10 pages
3.5 WebMining ImportantPages
No ratings yet
3.5 WebMining ImportantPages
11 pages
Project2 SimplifiedPageRank
No ratings yet
Project2 SimplifiedPageRank
6 pages
NX Nastran 9.0 for Designers
From Everand
NX Nastran 9.0 for Designers
Prof. Sham Tickoo
4.5/5 (2)
The Problem with Software: Why Smart Engineers Write Bad Code
From Everand
The Problem with Software: Why Smart Engineers Write Bad Code
Adam Barr
4/5 (2)
Getting Started with FPGAs: Digital Circuit Design, Verilog, and VHDL for Beginners
From Everand
Getting Started with FPGAs: Digital Circuit Design, Verilog, and VHDL for Beginners
Russell Merrick
5/5 (1)

09 Pagerank

Uploaded by

09 Pagerank

Uploaded by

Note to other teachers and users of these slides: We would be delighted if you found our

CS246: Mining Massive Datasets

Community Web Decision Association

¡ There is a large diversity

¡ Are all in-links equal?

¡ Technically, importance = the principal

¡ If page j with importance rj has n out-links,

¡ Page j’s own importance is the sum of the

page is worth more ry/2

About 50 iterations is sufficient to estimate the limiting solution.

For graphs that satisfy certain conditions,

¡ Done! Uniqueness: exercise! m = #edges

¡ Does it converge to what we want?

¡ Are results reasonable?

¡ (2) Spider traps:

¡ PageRank equation [Brin-Page, 98]

y 7/15 7/15 1/15

y 1/3 0.33 0.24 0.26 7/33

§ 2 billion entries for A = 0.8 ½ 0 0 +0.2 1/3 1/3 1/3

Note: Here we assume M

¡ M is a sparse matrix! (with no dead-ends)

You might also like