0% found this document useful (0 votes)

13 views64 pages

Lecture 9

The document discusses the PageRank algorithm as a method for analyzing large graphs, particularly in the context of web pages and their importance based on link structures. It explains the concept of links as votes, the random surfer model, and the iterative power method used to compute PageRank. Additionally, it addresses challenges such as dead ends and spider traps in web graphs.

Uploaded by

Mahbub Hasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views64 pages

Lecture 9

Uploaded by

Mahbub Hasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 64

CSE488: Big Data Analytics

Lecture 9: Analysis of Large Graphs

Page Rank Algorithm

Dr. Mohammad Rezwanul

Huq
Associate Professor
East West University
High Gra Infini Machi
App
dim. ph te ne
data dat data learni s
a ng
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der
hashing streams
systems

Community Web Decision Association

Clustering
Detection advertising Trees Rules

Dimensional Duplicate
Spam Queries on Perceptron
ity
Detection streams , kNN
reduction document
detection

2
Facebook social graph
4-degrees of separation [Backstrom-Boldi-Rosa-
3
Connections between political
blogs
4
Citation networks and Maps
of science
[Börner et al., 2012] 5
domain2

domain1

router

domain3

Intern 6
Seven Bridges of
Königsberg
[Euler, 1735]
Return to the starting point by
traveling each
link of the graph once and only 7
Web as a directed graph:
 Nodes: Webpages
 Edges: Hyperlinks
I teach
a class
on CS224W:
Network Classes
s. are in
the
Gates Computer
building Science
Departme
nt at
Stanford Stanfor
d
Universi
ty
8
Web as a directed graph:
 Nodes: Webpages
 Edges: Hyperlinks
I teach
a class
on CS224W:
Network Classes
s. are in
the
Gates Computer
building Science
Departme
nt at
Stanford Stanfor
d
Universi
ty
9
10
How to organize the Web?
First try: Human curated
Web directories
 Yahoo, DMOZ, LookSmart
Second try: Web Search
 Information Retrieval investigates:
Find relevant docs in a small
and trusted set
 Newspaper articles, Patents,
etc.
 But: Web is huge, full of untrusted documents,
11
2 challenges of web
(1) Web contains many sources of information
search:
Who to “trust”?
 Trick: Trustworthy pages may point to each other!

(2) What is the “best” answer to

query “newspaper”?
 No single right answer
 Trick: Pages that actually know about newspapers
might all be pointing to many newspapers

12
All web pages are not equally “important”
thispersondoesnotexist.com vs. www.stanford.edu

There is a large diversity

in the web-graph
node connectivity.
Let’s rank the pages by
the link structure!

13
We will cover the following Link Analysis
approaches for computing importance
of nodes in a graph:
 PageRank
 Topic-Specific (Personalized) PageRank
 Web Spam Detection Algorithms

14
Idea: Links as votes
 Page is more important if it has more links
 In-coming links? Out-going links?
Think of in-links as votes:
 www.stanford.edu has millions in-links
 thispersondoesnotexist.com has a few thousands in-link

Are all in-links equal?

 Links from important pages count more
 Recursive question!

4/27/2021 Jure Leskovec, Stanford C246: Mining Massive Datasets 16

Web pages are important if people visit them
a lot.
But we can’t watch everybody using the
Web.
A good surrogate for visiting pages is
to assume people follow links
randomly.
Leads to random surfer model:
 Start at a random page and follow random out-
links repeatedly, from whatever page you are at.
 PageRank = limiting probability of being at a
page.
Solve the recursive equation: “importance of a
page = its share of the importance of each of
its predecessor pages”
 Equivalent to the random-surfer definition of
PageRank

Technically, importance = the principal

eigenvector of the transition matrix of the Web
 A few fix-ups needed
A
B
3.3 C
38.4
34.3

D
E F
3.9
8.1 3.9

1.6
1.6 1.6 1.6
1.6
Each link’s vote is proportional to the
importance of its source page
If page j with importance rj has n out-links,
each link gets rj / n votes

Page j’s own importance is the sum of the

votes on its in-links i k
ri/3 r /4
k

j r j/
rj = ri/3+rk/4
rj/ 3
3
rj/3
A “vote” from an important The web in
1839
page is worth more ry/2
A page is important if it is
y
pointed to by other
ra/2
important pages ry/2
Define a “rank” rj for page j rm
a m
ra/2
r “Flow”
rj   equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
𝒅𝒊 … out-degree of
node 𝒊
rm = ra /2
Flow equations:
3 equations, 3 ry = ry /2 + ra /2
unknowns, no constants ra = ry /2 + rm
 No unique solution rm = ra /2

 All solutions equivalent modulo the scale factor

Additional constraint forces uniqueness:
 𝒓𝒚 + 𝒓𝒂 + 𝒓 𝒎 =𝟏
=𝟐 , = , 𝒓𝒎 𝟏
𝒓𝒂𝟐 𝟓 =
𝒓𝒚
 Solution:
Gaussian 𝟓
elimination 𝟓
method works
for small examples, but we need a
better method for large web-size
graphs
Stochastic adjacency matrix 𝑴
 Let page 𝑖 has 𝑑𝑖 out-links
 If , then else
 𝑴 is a column stochastic matrix

Rank vector 𝒓: vector with an entry per

 Columns sum to 1

 𝑟𝑖 is the importance score of page 𝑖

page
σ 𝑖 𝑟𝑖 = 1 r
The flow equations can be written rj  
ri
Remember the flow equation: rj 
 i j d i
𝑴⋅𝒓= 𝒓
Flow equation in the matrix form

 Suppose page i links to 3 pages, including

j
i
j r
. = j

r
1/3 i

M . r = r
ry ra rm
y ry ½ ½ 0
ra ½ 0 1
a m rm 0 ½ 0

r=
M∙r
ry = ry /2 + ra /2 ry ½ ½ 0 ry
ra = ry /2 + rm ra = ½ 0 1 ra
rm = ra /2 rm 0 ½ 0 rm
𝒓=𝑴∙ 𝒓
The flow equations can be written

So the rank vector r is an eigenvector of the

 Starting from any stochastic vector 𝒖, the limit

stochastic web matrix M

𝑴(𝑴(… 𝑴(𝑴 𝒖)))

is the long-term distribution of the surfers. NOTE: x is
an eigenvector

eigenvector of 𝑀 = PageRank.
 The math: limiting distribution = principal thewith
corresponding

𝑨𝒙 = 𝝀𝒙
eigenvalue λ if:

 Note: If 𝒓 is the limit of 𝑴𝑴 … 𝑴𝒖,

the
thenequation 𝒓 = 𝑴𝒓, so r is an eigenvector of 𝑴 with eigenvalue
𝒓 satisfies
We can now efficiently solve for r!
1

The method is called Power iteration

Given a web graph with N nodes, where the
nodes are pages and edges are hyperlinks
Power iteration: a simple iterative scheme
 Suppose there are N web pages
 Initialize: r(0) = [1/N,….,1/N]T r
(t
j 1)
r
 Iterate: r(t+1) = M ∙ r(t) j (tdi)i
 i
di …. out-degree of node i
 Stop when |r(t+1) – r(t)|1 < 
|x|1 = 1≤i≤N|xi| is the L1 norm
Can use any other vector norm, e.g., Euclidean

About 50 iterations is sufficient to estimate the limiting solution.

y a m
Power Iteration: y
 Set 𝑟𝑗 =
y ½ ½ 0

 1/N
a ½ 0 1

1: 𝑟′ = 𝑟𝑖
a
𝑗 𝑖→𝑗
m m 0 ½ 0

σ2: 𝑟 𝑑𝑖
ry = ry /2 + ra /2

= 𝑟′
ra = ry /2 + rm
rm = ra /2
 Goto 1
rExample:
y 1/3 1/3 5/12 9/24 6/15
ra 1/3 3/6 1/3 11/2 … 6/15
= 1/3 1/6 3/12 4 3/15
rm Iteration 0, 1, 2, … 1/6
y a m
Power Iteration: y
 Set 𝑟𝑗 =
y ½ ½ 0

 1/N
a ½ 0 1

1: 𝑟′ = 𝑟𝑖
a
𝑗 𝑖→𝑗
m m 0 ½ 0

σ2: 𝑟 𝑑𝑖
ry = ry /2 + ra /2

= 𝑟′
ra = ry /2 + rm
rm = ra /2
 Goto 1
Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15
Iteration 0, 1, 2, …

29
i1 i2 i3
Imagine a random web surfer:
 At any time 𝒕, surfer is on some page 𝒊
 At time 𝒕 + 𝟏, the surfer follows an
out-link from 𝒊 uniformly at random
j
r
 Ends up on some page 𝒋 linked 
j
r
from 𝒊
 Process repeats indefinitely
𝒑(𝒕) … vector whose 𝒊th coordinate is
the prob. that the surfer is at page 𝒊 at
Let:

time 𝒕
 So, 𝒑(𝒕) is a probability distribution over
i1 i2 i3
Where is the surfer at time

𝒑 𝒕 + 𝟏 = 𝑴 ⋅ 𝒑(𝒕)
t+1?
 Follows a link uniformly at random
j
p(t 1)  M 

𝒑 𝒕 + 𝟏= 𝑴 ⋅ 𝒑(𝒕) = 𝒑(𝒕)
Suppose the random walk reaches a statep(t)

then 𝒑(𝒕) is stationary distribution of a random

Our original rank vector 𝒓 satisfies 𝒓 = 𝑴

walk

⋅𝒓
 So, 𝒓 is a stationary distribution for
the random walk
A central result from the theory of random
walks (a.k.a. Markov processes):

For graphs that satisfy certain conditions,

the stationary distribution is unique and
eventually will be reached no matter what
is
the initial probability distribution at time t =
0
Given an undirected graph with N
nodes, where the nodes are pages and x y z
edges are hyperlinks
Claim [Existence]: For node v, v
rv = dv/2m is a solution.
Proof:
 Iteration step: r(t+1) = M ∙ r(t)

 Substitute ri = di/2m:

Done! Uniqueness: exercise! m = #edges

Which node has highest PageRank? Second highest?
Node 1 has the highest PR, followed by Node 3
Degree ≠ PageRank
Add edge 3 -> 2. Now, which node has highest
PageRank? Second highest?
Node 3 has the highest PR, followed by 2.
Small changes to graph can change PR!
d
(t
ri
r
j 1)
(t )
or
equivalently r
 i j i

Does this converge? Mr

Does it converge to what we want?
Are results reasonable?
(t
a b r j 1)
r
j i
 i i)
(td

Example:
ra 1 0 1 0
=
rb 0 1 0 1
Iteration 0, 1, 2, …
(t
a b r j 1)
r
j i
 i i)
(td

Example:
ra 1 0 0 0
=
rb 0 1 0 0
Iteration 0, 1, 2, …
Dead end
Two problems:
(1) Dead ends: Some
pages have no out-links
 Random walk has “nowhere”
to go to
 Such pages cause importance
to “leak out”

(2) Spider traps:

(all out-links are within the
group)
y a m
Power Iteration: y
 Set 𝑟𝑗 =
y ½ ½ 0

 1/𝑁
a ½ 0 0

𝑟= 𝑟𝑖
a m m 0 ½ 1

𝑗
σ
𝑖→𝑗
m is a spider trap ry = ry /2 + ra /2
𝑑𝑖
And iterate
ra = ry /2
rm = ra /2 + rm
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 3/6 7/12 16/24 1
Iteration 0, 1, 2, …
All the PageRank score gets “trapped”
The Google solution for spider traps: At each
time step, the random surfer has two options
 With prob. , follow a link at random
 With prob. 1-, jump to some random page
 is typically in the range 0.8 to 0.9
Surfer will teleport out of spider
trap within a few time steps
y

a y m a m
y a m
Power Iteration: y
 Set 𝑟𝑗 =
y ½ ½ 0

 1/𝑁
a ½ 0 0

𝑟= 𝑟𝑖
a m m 0 ½ 0

𝑗
σ
𝑖→𝑗 ry = ry /2 + ra /2
𝑑𝑖
And iterate
ra = ry /2
rm = ra /2
Example:
ry 1/3 2/6 3/12 5/24 0
ra = 1/3 1/6 2/12 3/24 … 0
rm 1/3 1/6 1/12 2/24 0
Iteration 0, 1, 2, …

Here the PageRank score “leaks” out since the matrix is

Teleports: Follow random teleport links with
probability 1.0 from dead-ends
 Adjust matrix accordingly

y y

a m a m
y a m y a m
y ½ ½ 0 y ½ ½ ⅓
a ½ 0 0 a ½ 0 ⅓
m 0 ½ 0 m 0 ½ ⅓
Why are dead-ends and spider traps a
problem and why do teleports solve the
problem?
Spider-traps are not a problem, but with traps
PageRank scoresgetare
Solution: Never notinwhat
stuck wetrap
a spider want
by
teleporting out of it in a finite number of steps
Dead-ends are a problem
 The matrix is not column stochastic so our initial
assumptions are not met
 Solution: Make matrix column stochastic by always
teleporting when there is nowhere else to go
Google’s solution that does it all:
At each step, random surfer has two options:
 With probability , follow a link at random
 With probability 1-, jump to some random page

PageRank equation [Brin-Page, 98]

di … out-
degree
of node i

This formulation assumes that 𝑴 has no dead ends. We can either

preprocess matrix 𝑴 to remove all dead ends or explicitly follow random
teleport links with probability 1.0 from dead-ends.
PageRank equation [Brin-Page, ‘98]

The Google Matrix [1/N]NxN…N by N matrix

where all entries are 1/N
A:

We have a recursive problem: 𝒓 = 𝑨 ⋅ 𝒓

And the Power method still
works! What is  ?
 In practice  =0.8,0.9 (jump every 5
M [1/N]NxN
7/15
y 1/2 1/2 0 1/3 1/3 1/3
0. 1/2 0 0 + 1/3 1/3 1/3
8 0.2
0 1 1/3 1/3 1/3
1/2y 7/15 7/15 1/15
13/15
a 7/15 1/15 1/15
a m 1/15 7/15 13/15
m
A

y 1/3 0.33 0.24 0.26 7/33

a = 1/3 0.20 0.20 0.18 ... 5/33
m 1/3 0.46 0.52 0.56 21/33
Key step is matrix-vector multiplication
 rnew = A ∙ rold
Easy if we have enough main memory to
hold A, rold, rnew
Say

N = 1 billion pages A = ∙M + (1-)
We need 4 bytes for
[1/N]NxN
each entry (say) ½ ½ 0 1/3 1/3 1/3
½ 0 +0. 1/3 1/3 1/3
 2 billion entries for A= 0 1/3 1/3 1/3
2
vectors, approx 8GB 0.8 0 ½ 1
 Matrix A has N2 7/15 7/15 1/15
entries = 7/15 1/15 1/15
 1018 is a large 1/15 7/15 13/15
, where
𝑟 = i= 𝐴𝑗𝑖
𝑗
1 ⋅ 𝑟𝑖
σ𝑁

since

So we get:

Note: Here we assume M

[x]N … a vector of length N with all
has no dead-ends
entries x
We just rearranged the PageRank equation

 where [(1-)/N]N is a vector with all N entries (1-)/N

M is a sparse matrix! (with no dead-ends)

 10 links per node, approx 10𝑁 entries
So in each iteration, we need to:
 Compute rnew =  M ∙ rold
 Add a constant value (1-)/N to each entry in rnew
 Note if M contains dead-ends then σ 𝒋 𝒓𝒋𝒏𝒆𝒘 < 𝟏
we also have to renormalize rnew so that
Input: Graph 𝑮 and
𝜷 𝑮 (can have spider traps and dead
 Directed graph
parameter
 Parameter 𝜷
ends)

1 𝒓 𝒏𝒆𝒘
= �
Output:
 Set: � PageRank vector
𝑟 𝑜𝑙𝑑 �
<
convergence: σ 𝑗 𝑗𝑟 𝑛 𝑒 𝑤 −
𝜀
 repeat until �
𝑗
𝑟 ∀𝑗: 𝒓′ =𝒊→𝒋
𝒏𝒆
 𝑜𝑙𝑑 𝒘𝒋
𝒊
𝒅
σ 𝒓′𝒋 = 𝟎𝜷if in-degree of 𝒋 is
𝒐𝒍𝒊 𝒓
𝒏𝒆𝒘
𝒅
 Now0re-insert the leaked PageRank:
∀𝒋: 𝒓
𝒋
𝒏𝒆𝒘
= 𝒓�′ 𝒏𝒆𝒘 +� where: 𝑆 =𝑗
 𝒓𝟏−𝑺
𝒐𝒍𝒅 = �
σ 𝑟′𝑛𝑒𝑤
𝑗

𝒓𝒏𝒆𝒘
�

If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-
ends the amount of leaked PageRank may be larger. We have to explicitly account for it by computing
Encode sparse matrix using only nonzero
entries
 Space proportional roughly to number of links
 Say 10N, or 4*10*1 billion = 40GB
 Still won’t fit in memory, but will fit on disk
source
node degree destination
0 3
nodes 1, 5, 7
1 5 17, 64, 113, 117, 245
2 2 13, 23
Assume enough RAM to fit rnew into memory
 Store rold and matrix M on disk
1 step of power-iteration is:
Assuming
Initialize all entries of rnew = (1-) / N no dead
For each page i (of out-degree di): ends

Read into memory: i, di, dest1, …, destd , rold(i) i

For j = 1…di
rnew(destj) +=  rold(i) / di
0 0
rnew source degree destination rold
1
1 0 3 1, 5, 6
2 2
1 4 17, 64, 113, 117 3
3
4 2 2 13, 23 4
5 5
6 6
4/27/2021 Jure Leskovec, Stanford C246: Mining Massive Datasets 60
Assume enough RAM to fit rnew into memory
 Store rold and matrix M on disk
In each iteration, we have to:
 Read rold and M
 Write rnew back to disk

= 2|𝒓| + |𝑴|
 Cost per iteration of Power method:

Question:
 What if we could not even fit rnew in memory?
rne src destinat rol
0w 0degree4 ion 0
0, 1, 3, 5 d

1 1
1 2 0, 5 2
2 2 3
2 3, 4
4
3
M 5
4
5

 Break rnew into k blocks that fit in memory

 Scan M and rold once for each block
Similar to nested-loop join in
databases
 Break rnew into k blocks that fit in memory
 Scan M and rold once for each block
Total cost:
 k scans of M and rold

𝑘(|𝑴| + |𝒓|) + |𝒓| = 𝒌|𝑴| + (𝒌 + 𝟏)|𝒓|

 Cost per iteration of Power method:

Can we do better?
 Hint: M is much bigger than r (approx 10-20x), so
we must avoid reading it k times per iteration
src destinati
rne degree on
0w 0 4 0, 1
1 rol
1 2 0 d 0
1
2
2 0 4 3 3
4
3 2 2 3 5

0 4 5
4 1 2 5
5
2 2 4
Break M into stripes! Each stripe contains only
destination nodes in the corresponding block of
Break M into stripes
 Each stripe contains only destination nodes
in the corresponding block of rnew
Some additional overhead per stripe
 But it is usually worth it

=|𝑴|(𝟏 + ε) + (𝒌 + 𝟏)|𝒓|
Cost per iteration of Power method:

where 𝜀 is a small number.

Measures generic popularity of a
page
 Biased against topic-specific authorities
 Solution: Topic-Specific PageRank (next)
Uses a single measure of importance
 Other models of importance
 Solution: Hubs-and-Authorities
Susceptible to Link spam
 Artificial link topographies created in order to
boost page rank
 Solution: TrustRank
Classic work: Markov chains, citation analysis
RankDex patent [Robin Li, '96]
 Key idea: use backlinks (led to Baidu!)
HITS Algorithm [Kleinberg, SODA '98]
 Key idea: iterative scoring!

PageRank work [Page et al, '98]

RRB NTPC CBT Stage I & II Mathematics VOLUME 1 in English
0% (1)
RRB NTPC CBT Stage I & II Mathematics VOLUME 1 in English
416 pages
Social Network Analysis
No ratings yet
Social Network Analysis
28 pages
Microcontroller Lab Manual GTU SEM V 2012
0% (1)
Microcontroller Lab Manual GTU SEM V 2012
76 pages
430 MM Clutch For ZF AS Tronic Transmission: Sachs
100% (1)
430 MM Clutch For ZF AS Tronic Transmission: Sachs
30 pages
SQL Server and ASP Net Questions & Answers
No ratings yet
SQL Server and ASP Net Questions & Answers
12 pages
FS For Medium Voltage Motor PDF
No ratings yet
FS For Medium Voltage Motor PDF
8 pages
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
No ratings yet
CS246: Mining Massive Datasets Jure Leskovec,: Stanford University
56 pages
CH - 4 Discrete Fourier Transform
100% (1)
CH - 4 Discrete Fourier Transform
64 pages
Sharpmx 4111n Guide
No ratings yet
Sharpmx 4111n Guide
68 pages
PageRank 2021
No ratings yet
PageRank 2021
55 pages
Datamining-Lect7 - Link Analysis Ranking PageRank - Random Walks HITS Absorbing Random Walks and Label Propagation
No ratings yet
Datamining-Lect7 - Link Analysis Ranking PageRank - Random Walks HITS Absorbing Random Walks and Label Propagation
99 pages
Advanced Digital Systems Design 01
No ratings yet
Advanced Digital Systems Design 01
26 pages
TM3 ch05 Link Analysis
No ratings yet
TM3 ch05 Link Analysis
69 pages
Page Rank With 13 Cases
No ratings yet
Page Rank With 13 Cases
72 pages
09 Pagerank
No ratings yet
09 Pagerank
61 pages
04 Pagerank
No ratings yet
04 Pagerank
64 pages
Dbms Review-3: G.BALAVIGNESH-10MSE1072 Harshavardhan-10Mse1077
No ratings yet
Dbms Review-3: G.BALAVIGNESH-10MSE1072 Harshavardhan-10Mse1077
35 pages
Module 4 MapReduce and Link Analysis
No ratings yet
Module 4 MapReduce and Link Analysis
103 pages
CSF-469-L11-13 (Link Analysis Page Rank)
No ratings yet
CSF-469-L11-13 (Link Analysis Page Rank)
47 pages
MLC 04 Graph Methods Ranking Communities Link Prediction-Sose2023
No ratings yet
MLC 04 Graph Methods Ranking Communities Link Prediction-Sose2023
110 pages
Advanced Analysis of Algorithms: Dept of CS & IT University of Sargodha
No ratings yet
Advanced Analysis of Algorithms: Dept of CS & IT University of Sargodha
51 pages
Module VI Link Analysis Final
No ratings yet
Module VI Link Analysis Final
104 pages
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
No ratings yet
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
33 pages
ch05 Linkanalysis1
No ratings yet
ch05 Linkanalysis1
60 pages
Lecture 12 - Link Analysis
No ratings yet
Lecture 12 - Link Analysis
57 pages
Page Rank and HITS
No ratings yet
Page Rank and HITS
39 pages
Rec Sys Network
No ratings yet
Rec Sys Network
45 pages
Technical University of Ilmenau Institute For Theoretical and Technical Computer Science Automata and Formal Languages
No ratings yet
Technical University of Ilmenau Institute For Theoretical and Technical Computer Science Automata and Formal Languages
19 pages
Course 5-6
No ratings yet
Course 5-6
54 pages
Link Analysis
No ratings yet
Link Analysis
47 pages
1.1 Pagerank Description
No ratings yet
1.1 Pagerank Description
19 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
44 pages
PMBD-07-Link Analysis
No ratings yet
PMBD-07-Link Analysis
42 pages
Module 6-: Real Time Big Data Models
No ratings yet
Module 6-: Real Time Big Data Models
58 pages
Jeffrey D. Ullman Stanford University
No ratings yet
Jeffrey D. Ullman Stanford University
55 pages
Lect 14-Web Ranking
No ratings yet
Lect 14-Web Ranking
30 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
18 pages
Google Pagerank and Reduced-Order Modelling
No ratings yet
Google Pagerank and Reduced-Order Modelling
56 pages
Page Rank
No ratings yet
Page Rank
29 pages
Liuty
No ratings yet
Liuty
50 pages
Lecture11 PageRank V0
No ratings yet
Lecture11 PageRank V0
38 pages
Information Retrieval: Unit 4: Web Search - Part 1
No ratings yet
Information Retrieval: Unit 4: Web Search - Part 1
63 pages
6 Pagerank
No ratings yet
6 Pagerank
7 pages
Prac Ex 05
No ratings yet
Prac Ex 05
23 pages
Power Point
No ratings yet
Power Point
77 pages
Network Analysis and Mining: Pagerank
No ratings yet
Network Analysis and Mining: Pagerank
17 pages
Page Rank PDF
0% (1)
Page Rank PDF
20 pages
Lec 31
No ratings yet
Lec 31
15 pages
Assignment5 NLA Aug2023
No ratings yet
Assignment5 NLA Aug2023
7 pages
14 Link 1
No ratings yet
14 Link 1
10 pages
Report PDF
No ratings yet
Report PDF
35 pages
Feb 28
No ratings yet
Feb 28
12 pages
Small World
No ratings yet
Small World
66 pages
PageRank Algorithm - The Mathematics of Google Search
No ratings yet
PageRank Algorithm - The Mathematics of Google Search
8 pages
Link Analysis: (Follow The Links To Learn More!)
No ratings yet
Link Analysis: (Follow The Links To Learn More!)
28 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
9 pages
Lec 10
No ratings yet
Lec 10
14 pages
Lab 4-2
No ratings yet
Lab 4-2
4 pages
Math 551 Lab 12
No ratings yet
Math 551 Lab 12
5 pages
Pagerank
No ratings yet
Pagerank
3 pages
Cse535 Link Analysis
No ratings yet
Cse535 Link Analysis
19 pages
Google Pagerank: Maths Delivers!
No ratings yet
Google Pagerank: Maths Delivers!
24 pages
The Linear Algebra Behind Google'S Pagerank Algorithm: Sujit Dunga 11110102
No ratings yet
The Linear Algebra Behind Google'S Pagerank Algorithm: Sujit Dunga 11110102
6 pages
Mini-Project #3 - Pagerank: 1 Motivation
No ratings yet
Mini-Project #3 - Pagerank: 1 Motivation
3 pages
Project2 SimplifiedPageRank
No ratings yet
Project2 SimplifiedPageRank
6 pages
Design and Analysis of Pressure Vessel
No ratings yet
Design and Analysis of Pressure Vessel
9 pages
Doc3 Main Report
No ratings yet
Doc3 Main Report
60 pages
Sany HBT8018C-5
No ratings yet
Sany HBT8018C-5
2 pages
PH DThesis
No ratings yet
PH DThesis
159 pages
CS345 Data Mining: Link Analysis Algorithms Page Rank
No ratings yet
CS345 Data Mining: Link Analysis Algorithms Page Rank
37 pages
DQS251 - Piling - Spun Piles - Tutorial-Drwgs-Dec 2019 - pg11
No ratings yet
DQS251 - Piling - Spun Piles - Tutorial-Drwgs-Dec 2019 - pg11
8 pages
Adafruit Cap1188 Breakout
No ratings yet
Adafruit Cap1188 Breakout
20 pages
Ram Sharma
No ratings yet
Ram Sharma
2 pages
Exercises Week 02 Flow Diagrams
No ratings yet
Exercises Week 02 Flow Diagrams
12 pages
Artifacts OF THE PROCESS: Presented by
No ratings yet
Artifacts OF THE PROCESS: Presented by
15 pages
Presentation For Master Thesis Proposal
100% (3)
Presentation For Master Thesis Proposal
6 pages
ASUS X541UV Repair Guide
No ratings yet
ASUS X541UV Repair Guide
7 pages
NAJRUL ANSARI Storekeeper
No ratings yet
NAJRUL ANSARI Storekeeper
3 pages
An Introduction To Music Technology 2nd Edition Dan Hosken PDF Download
No ratings yet
An Introduction To Music Technology 2nd Edition Dan Hosken PDF Download
54 pages
Ranchhod Rangila
No ratings yet
Ranchhod Rangila
13 pages
Request For Live Scan Service: Teacher Cred 44340 Ec
No ratings yet
Request For Live Scan Service: Teacher Cred 44340 Ec
4 pages
P2P Integration Manual V2.1 - Itanagar
No ratings yet
P2P Integration Manual V2.1 - Itanagar
22 pages
608668d64559c256bd9600e6 Add
No ratings yet
608668d64559c256bd9600e6 Add
2 pages
106 Unsupervised Learning - Association Rules
No ratings yet
106 Unsupervised Learning - Association Rules
13 pages
PDF Eng
No ratings yet
PDF Eng
8 pages
SBMS 6am 19012023
No ratings yet
SBMS 6am 19012023
2 pages
Silver Charm Bracelet With Heart Clasp Sterling Silver Pandora US
No ratings yet
Silver Charm Bracelet With Heart Clasp Sterling Silver Pandora US
1 page
(18 April 2024) Aligning Open Language Models
No ratings yet
(18 April 2024) Aligning Open Language Models
77 pages
Crushing The Technical Interview: Data Structures And Algorithms (Python Edition)
From Everand
Crushing The Technical Interview: Data Structures And Algorithms (Python Edition)
Keith Henning
No ratings yet
Mastering Data Mining with Python – Find patterns hidden in your data
From Everand
Mastering Data Mining with Python – Find patterns hidden in your data
Megan Squire
No ratings yet

Lecture 9

Uploaded by

Lecture 9

Uploaded by

CSE488: Big Data Analytics

Lecture 9: Analysis of Large Graphs

Dr. Mohammad Rezwanul

Community Web Decision Association

(2) What is the “best” answer to

There is a large diversity

Are all in-links equal?

4/27/2021 Jure Leskovec, Stanford C246: Mining Massive Datasets 16

Technically, importance = the principal

Page j’s own importance is the sum of the

 All solutions equivalent modulo the scale factor

Rank vector 𝒓: vector with an entry per

 𝑟𝑖 is the importance score of page 𝑖

 Suppose page i links to 3 pages, including

So the rank vector r is an eigenvector of the

 Starting from any stochastic vector 𝒖, the limit

𝑴(𝑴(… 𝑴(𝑴 𝒖)))

 Note: If 𝒓 is the limit of 𝑴𝑴 … 𝑴𝒖,

The method is called Power iteration

About 50 iterations is sufficient to estimate the limiting solution.

then 𝒑(𝒕) is stationary distribution of a random

Our original rank vector 𝒓 satisfies 𝒓 = 𝑴

For graphs that satisfy certain conditions,

Done! Uniqueness: exercise! m = #edges

Does this converge? Mr

(2) Spider traps:

Here the PageRank score “leaks” out since the matrix is

PageRank equation [Brin-Page, 98]

This formulation assumes that 𝑴 has no dead ends. We can either

The Google Matrix [1/N]NxN…N by N matrix

We have a recursive problem: 𝒓 = 𝑨 ⋅ 𝒓

y 1/3 0.33 0.24 0.26 7/33

Note: Here we assume M

 where [(1-)/N]N is a vector with all N entries (1-)/N

M is a sparse matrix! (with no dead-ends)

Read into memory: i, di, dest1, …, destd , rold(i) i

 Break rnew into k blocks that fit in memory

𝑘(|𝑴| + |𝒓|) + |𝒓| = 𝒌|𝑴| + (𝒌 + 𝟏)|𝒓|

where 𝜀 is a small number.

PageRank work [Page et al, '98]

You might also like