0% found this document useful (0 votes)

41 views47 pages

CSF-469-L11-13 (Link Analysis Page Rank)

This document discusses link analysis and PageRank algorithms for analyzing large graphs, specifically the web graph. It describes: 1) The web can be modeled as a directed graph with webpages as nodes and hyperlinks as edges. 2) PageRank is an algorithm that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set. It defines a "random surfer" model and considers the probability of ending up at a page as an indication of its importance. 3) The PageRank algorithm can be formulated as the principal eigenvector of the normalized link matrix of the web graph. It

Uploaded by

nitin gopala krishna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views47 pages

CSF-469-L11-13 (Link Analysis Page Rank)

Uploaded by

nitin gopala krishna

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Analysis of Large Graphs:

2
Web as a Graph
◼ Web as a directed graph:
▪ Nodes: Webpages
▪ Edges: Hyperlinks

I teach a
class on IR.
CS F469:
Classes are
in the
LTC building
CSIS DEPT
BITS HYD
BITS PILANI
HYD
CAMPUS

3
Web as a Directed Graph
Broad Question
◼ How to organize the Web?
◼ First try: Human curated
Web directories
▪ Yahoo, DMOZ, LookSmart
◼ Second try: Web Search
▪ Information Retrieval investigates:
Find relevant docs in a small
and trusted set
▪ Newspaper articles, Patents, etc.
▪ But: Web is huge, full of untrusted documents,
random things, web spam, etc.
Web Search: 2 Challenges
2 challenges of web search:
◼ (1) Web contains many sources of information
Who to “trust”?
▪ Trick: Trustworthy pages may point to each other!
◼ (2) What is the “best” answer to query
“newspaper”?
▪ No single right answer
▪ Trick: Pages that actually know about newspapers
might all be pointing to many newspapers
Ranking Nodes on the Graph
◼ All web pages are not equally “important”
http://universe.bits-pilani.ac.in/hyderabad/arunamalap
ati/Profile
vs.
http://www.bits-pilani.ac.in/Hyderabad/index.aspx

◼ There is large diversity

in the web-graph
node connectivity.
Let’s rank the pages by
the link structure!
Link Analysis Algorithms
◼ We will cover the following Link Analysis
approaches for computing importances
of nodes in a graph:
▪ Page Rank
▪ Topic-Specific (Personalized) Page Rank
▪ Web Spam Detection Algorithms
PageRank:
The “Flow” Formulation
Links as Votes
◼ Idea: Links as votes
▪ Page is more important if it has more links
▪ In-coming links? Out-going links?
◼ Think of in-links as votes:
▪ http://www.bits-pilani.ac.in/Hyderabad/index.aspx has 23,400 in-links
▪ http://universe.bits-pilani.ac.in/hyderabad/arunamalapati/Profile has 1
in-link

◼ Are all in-links are equal?

▪ Links from important pages count more
▪ Recursive question!
Example: PageRank Scores

A
B
3.3 C
38.4
34.3

D
E F
3.9
8.1 3.9

1.6
1.6 1.6 1.6
1.6
Simple Recursive Formulation
◼ Each link’s vote is proportional to the
importance of its source page
◼ If page j with importance rj has n out-links,
each link gets rj / n votes
◼ Page j’s own importance is the sum of the
votes on its in-links i k
ri/3 r /4
k
j rj/3
rj = ri/3+rk/4
rj/3 rj/3
PageRank: The “Flow” Model
◼ A “vote” from an important The web in 1839

page is worth more y/2

◼ A page is important if it is
y
pointed to by other important
a/2
pages y/2
◼ Define a “rank” rj for page j m
a m
a/2
“Flow” equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
Solving the Flow Equations
Flow equations:
◼ ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
PageRank: Matrix Formulation
◼
Example
◼

j rj
. =
ri
1/3

M . r = r
Eigenvector Formulation
◼
Example: Flow Equations & M

y a m
y y ½ ½ 0
a ½ 0 1
a m m 0 ½ 0

r = M·r

ry = ry /2 + ra /2 y ½ ½ 0 y
ra = ry /2 + rm a = ½ 0 1 a
rm = ra /2 m 0 ½ 0 m
Power Iteration Method
◼ Given a web graph with n nodes, where the
nodes are pages and edges are hyperlinks
◼ Power iteration: a simple iterative scheme
▪ Suppose there are N web pages
▪ Initialize: r(0) = [1/N,….,1/N]T
▪ Iterate: r(t+1) = M · r(t)
di …. out-degree of node i
▪ Stop when |r(t+1) – r(t)|1 < ε
|x|1 = ∑1≤i≤N|xi| is the L1 norm
Can use any other vector norm, e.g., Euclidean
PageRank: How to solve?
y a m
◼ y
y ½ ½ 0
a ½ 0 1
a m m 0 ½ 0

ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2

Iteration 0, 1, 2, …
PageRank: How to solve?
y a m
◼ y
y ½ ½ 0
a ½ 0 1
a m m 0 ½ 0

ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2

Iteration 0, 1, 2, …
Random Walk Interpretation
i1 i2 i3
◼

j
The Stationary Distribution
i1 i2 i3
◼

j
Existence and Uniqueness
◼ A central result from the theory of random
walks (a.k.a. Markov processes):

For graphs that satisfy certain conditions,

the stationary distribution is unique and
eventually will be reached no matter what the
initial probability distribution at time t = 0
PageRank:
The Google Formulation
PageRank: Three Questions

or
equivalently

◼ Does this converge?

◼ Does it converge to what we want?
◼ Are results reasonable?
Does this converge?

a b

◼ Example:
ra 1 0 1 0
=
rb 0 1 0 1
Iteration 0, 1, 2, …
Does it converge to what we want?

a b

◼ Example:
ra 1 0 0 0
=
rb 0 1 0 0
Iteration 0, 1, 2, …
PageRank: Problems
Dead end
2 problems:
◼ (1) Some pages are
dead ends (have no out-links)
▪ Random walk has “nowhere” to go to
▪ Such pages cause importance to “leak out” Spider
trap

◼ (2) Spider traps:

(all out-links are within the group)
▪ Random walked gets “stuck” in a trap
▪ And eventually spider traps absorb all importance
Problem: Spider Traps
y a m
◼ y
y ½ ½ 0
a ½ 0 0
a m m 0 ½ 1

m is a spider trap ry = ry /2 + ra /2
ra = ry /2
rm = ra /2 + rm

Iteration 0, 1, 2, …
All the PageRank score gets “trapped” in node m.
Solution: Teleports!
◼ The Google solution for spider traps: At each
time step, the random surfer has two options
▪ With prob. β, follow a link at random
▪ With prob. 1-β, jump to some random page
▪ Common values for β are in the range 0.8 to 0.9
◼ Surfer will teleport out of spider trap
within a few time steps
y y

a m a m
Problem: Dead Ends
y a m
◼ y
y ½ ½ 0
a ½ 0 0
a m m 0 ½ 0

ry = ry /2 + ra /2
ra = ry /2
rm = ra /2

Iteration 0, 1, 2, …

Here the PageRank “leaks” out since the matrix is not stochastic.
Solution: Always Teleport!
◼ Teleports: Follow random teleport links with
probability 1.0 from dead-ends
▪ Adjust matrix accordingly

y y

a m a m
y a m y a m
y ½ ½ 0 y ½ ½ ⅓
a ½ 0 0 a ½ 0 ⅓
m 0 ½ 0 m 0 ½ ⅓
Why Teleports Solve the Problem?
Theory of Markov chains
Make M stochastic
Make M aperiodic
Make M Irreducible
Solution: Random Teleports
◼

di … out-degree
of node i
The Google Matrix
◼

[1/N]NxN…N by N matrix
where all entries are 1/N
Random Teleports (β = 0.8)
M [1/N]NxN
7/15
y 1/2 1/2 0 1/3 1/3 1/3
0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3
1/
5

15
7/1

5
7/1

1/
15

y 7/15 7/15 1/15

13/15
a 7/15 1/15 1/15
7/15
a m 1/15 7/15 13/15
1/15
m
1/
15 A

y 1/3 0.33 0.24 0.26 7/33

a = 1/3 0.20 0.20 0.18 ... 5/33
m 1/3 0.46 0.52 0.56 21/33
How do we actually compute
the PageRank?
Computing Page Rank
◼ Key step is matrix-vector multiplication
▪ rnew = A · rold
◼ Easy if we have enough main memory to
hold A, rold, rnew
◼ Say N = 1 billion pages
▪ We need 4 bytes for A = β·M + (1-β) [1/N]NxN
each entry (say) ½ ½ 0 1/3 1/3 1/3
▪ 2 billion entries for A = 0.8 ½ 0 0
0 ½ 1
+0.2 1/3 1/3 1/3
1/3 1/3 1/3
vectors, approx 8GB
▪ Matrix A has N2 entries 7/15 7/15 1/15
▪ 10 is a large number!
18
= 7/15 1/15 1/15
1/15 7/15 13/15
Matrix Formulation
◼ Suppose there are N pages
◼ Consider page i, with di out-links
◼ We have Mji = 1/|di| when i → j
and Mji = 0 otherwise
◼ The random teleport is equivalent to:
▪ Adding a teleport link from i to every other page
and setting transition probability to (1-β)/N
▪ Reducing the probability of following each
out-link from 1/|di| to β/|di|
▪ Equivalent: Tax each page a fraction (1-β) of its
score and redistribute evenly
Rearranging the Equation
◼

Note: Here we assumed M

has no dead-ends
[x]N … a vector of length N with all entries x
PageRank: The Complete Algorithm
◼

If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends
the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.
Some Problems with Page Rank
◼ Measures generic popularity of a page
▪ Biased against topic-specific authorities
▪ Solution: Topic-Specific PageRank (next)
◼ Uses a single measure of importance
▪ Other models of importance
▪ Solution: Hubs-and-Authorities
◼ Susceptible to Link spam
▪ Artificial link topographies created in order to
boost page rank
▪ Solution: TrustRank

Social Network Analysis
No ratings yet
Social Network Analysis
28 pages
3.5 WebMining ImportantPages
No ratings yet
3.5 WebMining ImportantPages
11 pages
Module VI Link Analysis Final
No ratings yet
Module VI Link Analysis Final
104 pages
09 Pagerank
No ratings yet
09 Pagerank
61 pages
Link Analysis
No ratings yet
Link Analysis
37 pages
PageRank 2021
No ratings yet
PageRank 2021
55 pages
Lecture 12 - Link Analysis
No ratings yet
Lecture 12 - Link Analysis
57 pages
Experiance Letter Sample
No ratings yet
Experiance Letter Sample
3 pages
Link Analysis
No ratings yet
Link Analysis
47 pages
Module 4 MapReduce and Link Analysis
No ratings yet
Module 4 MapReduce and Link Analysis
103 pages
Datamining-Lect7 - Link Analysis Ranking PageRank - Random Walks HITS Absorbing Random Walks and Label Propagation
No ratings yet
Datamining-Lect7 - Link Analysis Ranking PageRank - Random Walks HITS Absorbing Random Walks and Label Propagation
99 pages
Page Rank With 13 Cases
No ratings yet
Page Rank With 13 Cases
72 pages
Lecture 9
No ratings yet
Lecture 9
64 pages
TM3 ch05 Link Analysis
No ratings yet
TM3 ch05 Link Analysis
69 pages
Itt459 Individual Assignment
No ratings yet
Itt459 Individual Assignment
28 pages
Page Rank and HITS
No ratings yet
Page Rank and HITS
39 pages
14 Link 1
No ratings yet
14 Link 1
10 pages
Page Rank
No ratings yet
Page Rank
29 pages
Rec Sys Network
No ratings yet
Rec Sys Network
45 pages
Lect 14-Web Ranking
No ratings yet
Lect 14-Web Ranking
30 pages
TMS374 Family In-Circuit Programming: Users Manual Rev. 1.3 2005.05.11
100% (1)
TMS374 Family In-Circuit Programming: Users Manual Rev. 1.3 2005.05.11
10 pages
Markov Chains PDF
No ratings yet
Markov Chains PDF
66 pages
Pagerank
No ratings yet
Pagerank
3 pages
Lab 4-2
No ratings yet
Lab 4-2
4 pages
Project2 SimplifiedPageRank
No ratings yet
Project2 SimplifiedPageRank
6 pages
IR-UNIT 11 (Link Analysis) - 2019
No ratings yet
IR-UNIT 11 (Link Analysis) - 2019
58 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
18 pages
15 Link 2
No ratings yet
15 Link 2
11 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
44 pages
Brin and Page 1998 Page Et Al. 1999
No ratings yet
Brin and Page 1998 Page Et Al. 1999
37 pages
Abhishek Arora
No ratings yet
Abhishek Arora
2 pages
PMBD-07-Link Analysis
No ratings yet
PMBD-07-Link Analysis
42 pages
Module 6-: Real Time Big Data Models
No ratings yet
Module 6-: Real Time Big Data Models
58 pages
Advanced Analysis of Algorithms: Dept of CS & IT University of Sargodha
No ratings yet
Advanced Analysis of Algorithms: Dept of CS & IT University of Sargodha
51 pages
6 Pagerank
No ratings yet
6 Pagerank
7 pages
Ch4-Operationa Based Railway Planning
No ratings yet
Ch4-Operationa Based Railway Planning
40 pages
Math 551 Lab 12
No ratings yet
Math 551 Lab 12
5 pages
Dynamic Pagerank
No ratings yet
Dynamic Pagerank
31 pages
Jeffrey D. Ullman Stanford University
No ratings yet
Jeffrey D. Ullman Stanford University
55 pages
Abstract. The Original Purpose of Google'S Pagerank Algorithm Is To Assess The
No ratings yet
Abstract. The Original Purpose of Google'S Pagerank Algorithm Is To Assess The
6 pages
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
No ratings yet
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
33 pages
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
No ratings yet
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
46 pages
DWM Expt9
No ratings yet
DWM Expt9
6 pages
Report PDF
No ratings yet
Report PDF
35 pages
Page Rank PDF
0% (1)
Page Rank PDF
20 pages
Liuty
No ratings yet
Liuty
50 pages
Students Perceptions On Online Education
No ratings yet
Students Perceptions On Online Education
4 pages
Markov Chains
No ratings yet
Markov Chains
37 pages
Power Point
No ratings yet
Power Point
77 pages
S4 Planning Phase
No ratings yet
S4 Planning Phase
4 pages
Specifications Guide Electric Range EN
No ratings yet
Specifications Guide Electric Range EN
2 pages
CS345 Data Mining: Link Analysis Algorithms Page Rank
No ratings yet
CS345 Data Mining: Link Analysis Algorithms Page Rank
37 pages
Dbms Review-3: G.BALAVIGNESH-10MSE1072 Harshavardhan-10Mse1077
No ratings yet
Dbms Review-3: G.BALAVIGNESH-10MSE1072 Harshavardhan-10Mse1077
35 pages
Google Pagerank: Maths Delivers!
No ratings yet
Google Pagerank: Maths Delivers!
24 pages
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
No ratings yet
Distributed Computing Seminar: Lecture 5: Graph Algorithms & Pagerank
33 pages
Structure of The Web:: Be M (MV) M V
No ratings yet
Structure of The Web:: Be M (MV) M V
14 pages
Bill of Material IH
No ratings yet
Bill of Material IH
1 page
De Kerchove NV07
No ratings yet
De Kerchove NV07
15 pages
Technical University of Ilmenau Institute For Theoretical and Technical Computer Science Automata and Formal Languages
No ratings yet
Technical University of Ilmenau Institute For Theoretical and Technical Computer Science Automata and Formal Languages
19 pages
EQUIP9-Operations-Use Case Challenge
No ratings yet
EQUIP9-Operations-Use Case Challenge
6 pages
John Deere 4320 Tractor Operator's Manual (C)
No ratings yet
John Deere 4320 Tractor Operator's Manual (C)
84 pages
Link Analysis: (Follow The Links To Learn More!)
No ratings yet
Link Analysis: (Follow The Links To Learn More!)
28 pages
Cse535 Link Analysis
No ratings yet
Cse535 Link Analysis
19 pages
Hepa Filters 01
No ratings yet
Hepa Filters 01
1 page
1.1 Pagerank Description
No ratings yet
1.1 Pagerank Description
19 pages
Applications of Stochastic Models in Web Page Ranking
No ratings yet
Applications of Stochastic Models in Web Page Ranking
8 pages
AFM244-S23 Syllabus
No ratings yet
AFM244-S23 Syllabus
7 pages
Wravor Catalog en
No ratings yet
Wravor Catalog en
28 pages
In Voice 12615600515296266337
No ratings yet
In Voice 12615600515296266337
1 page
Major Final Project - BW
No ratings yet
Major Final Project - BW
80 pages
The Linear Algebra Behind Google'S Pagerank Algorithm: Sujit Dunga 11110102
No ratings yet
The Linear Algebra Behind Google'S Pagerank Algorithm: Sujit Dunga 11110102
6 pages
Amy Corns - Connecting Scatter Plots and Correlation Coefficients Activity
No ratings yet
Amy Corns - Connecting Scatter Plots and Correlation Coefficients Activity
23 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
9 pages
The Use of The Linear Algebra by Web Search Engines
No ratings yet
The Use of The Linear Algebra by Web Search Engines
5 pages
Page Rank
No ratings yet
Page Rank
1 page
3.1 Usage of Ajax and Json
No ratings yet
3.1 Usage of Ajax and Json
18 pages
An 120
No ratings yet
An 120
6 pages
Which Control The Pitch Angle of The Tail Rotor Blades: by Pressing On The Right Pedal, The Pitch Is
No ratings yet
Which Control The Pitch Angle of The Tail Rotor Blades: by Pressing On The Right Pedal, The Pitch Is
5 pages
B - Com - II Money and Financial System Additional Sub Point
No ratings yet
B - Com - II Money and Financial System Additional Sub Point
32 pages
Lesson 2 Introduction of Robot HAT
No ratings yet
Lesson 2 Introduction of Robot HAT
4 pages
AMF-65 AMS RFT Partnership Range - CULT PDF
No ratings yet
AMF-65 AMS RFT Partnership Range - CULT PDF
46 pages
Jaget PDF
No ratings yet
Jaget PDF
5 pages
Website SEO Adudit Report Thecopycreators
No ratings yet
Website SEO Adudit Report Thecopycreators
21 pages
ILAC - Members (By Category)
No ratings yet
ILAC - Members (By Category)
11 pages
SDO Animo Year End 2020-2021 - GBB - Lopez
No ratings yet
SDO Animo Year End 2020-2021 - GBB - Lopez
2 pages
GE Welch - Group 5
No ratings yet
GE Welch - Group 5
5 pages
Tugas SKD
No ratings yet
Tugas SKD
5 pages
3M Versaflo Respirator Systems Are Easy To Select: Modular Means Versatile
No ratings yet
3M Versaflo Respirator Systems Are Easy To Select: Modular Means Versatile
2 pages

CSF-469-L11-13 (Link Analysis Page Rank)

Uploaded by

CSF-469-L11-13 (Link Analysis Page Rank)

Uploaded by

Analysis of Large Graphs:

Link Analysis, PageRank

◼ There is large diversity

◼ Are all in-links are equal?

page is worth more y/2

For graphs that satisfy certain conditions,

◼ Does this converge?

◼ (2) Spider traps:

y 7/15 7/15 1/15

y 1/3 0.33 0.24 0.26 7/33

Note: Here we assumed M

You might also like