0% found this document useful (0 votes)
41 views47 pages

CSF-469-L11-13 (Link Analysis Page Rank)

This document discusses link analysis and PageRank algorithms for analyzing large graphs, specifically the web graph. It describes: 1) The web can be modeled as a directed graph with webpages as nodes and hyperlinks as edges. 2) PageRank is an algorithm that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set. It defines a "random surfer" model and considers the probability of ending up at a page as an indication of its importance. 3) The PageRank algorithm can be formulated as the principal eigenvector of the normalized link matrix of the web graph. It
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views47 pages

CSF-469-L11-13 (Link Analysis Page Rank)

This document discusses link analysis and PageRank algorithms for analyzing large graphs, specifically the web graph. It describes: 1) The web can be modeled as a directed graph with webpages as nodes and hyperlinks as edges. 2) PageRank is an algorithm that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set. It defines a "random surfer" model and considers the probability of ending up at a page as an indication of its importance. 3) The PageRank algorithm can be formulated as the principal eigenvector of the normalized link matrix of the web graph. It
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Analysis of Large Graphs:

Link Analysis, PageRank


Web as a Graph
◼ Web as a directed graph:
▪ Nodes: Webpages
▪ Edges: Hyperlinks TODAY 2.00Pm there will be
A demo on writing a scraper
In F-106 by Mr. Amit
I teach a
class on IR.
CS F469:
Classes are
in the
LTC
CSIS Dept
at BITS HYD
BITS PILANI
HYD
CAMPUS

2
Web as a Graph
◼ Web as a directed graph:
▪ Nodes: Webpages
▪ Edges: Hyperlinks

I teach a
class on IR.
CS F469:
Classes are
in the
LTC building
CSIS DEPT
BITS HYD
BITS PILANI
HYD
CAMPUS

3
Web as a Directed Graph
Broad Question
◼ How to organize the Web?
◼ First try: Human curated
Web directories
▪ Yahoo, DMOZ, LookSmart
◼ Second try: Web Search
▪ Information Retrieval investigates:
Find relevant docs in a small
and trusted set
▪ Newspaper articles, Patents, etc.
▪ But: Web is huge, full of untrusted documents,
random things, web spam, etc.
Web Search: 2 Challenges
2 challenges of web search:
◼ (1) Web contains many sources of information
Who to “trust”?
▪ Trick: Trustworthy pages may point to each other!
◼ (2) What is the “best” answer to query
“newspaper”?
▪ No single right answer
▪ Trick: Pages that actually know about newspapers
might all be pointing to many newspapers
Ranking Nodes on the Graph
◼ All web pages are not equally “important”
http://universe.bits-pilani.ac.in/hyderabad/arunamalap
ati/Profile
vs.
http://www.bits-pilani.ac.in/Hyderabad/index.aspx

◼ There is large diversity


in the web-graph
node connectivity.
Let’s rank the pages by
the link structure!
Link Analysis Algorithms
◼ We will cover the following Link Analysis
approaches for computing importances
of nodes in a graph:
▪ Page Rank
▪ Topic-Specific (Personalized) Page Rank
▪ Web Spam Detection Algorithms
PageRank:
The “Flow” Formulation
Links as Votes
◼ Idea: Links as votes
▪ Page is more important if it has more links
▪ In-coming links? Out-going links?
◼ Think of in-links as votes:
▪ http://www.bits-pilani.ac.in/Hyderabad/index.aspx has 23,400 in-links
▪ http://universe.bits-pilani.ac.in/hyderabad/arunamalapati/Profile has 1
in-link

◼ Are all in-links are equal?


▪ Links from important pages count more
▪ Recursive question!
Example: PageRank Scores

A
B
3.3 C
38.4
34.3

D
E F
3.9
8.1 3.9

1.6
1.6 1.6 1.6
1.6
Simple Recursive Formulation
◼ Each link’s vote is proportional to the
importance of its source page
◼ If page j with importance rj has n out-links,
each link gets rj / n votes
◼ Page j’s own importance is the sum of the
votes on its in-links i k
ri/3 r /4
k
j rj/3
rj = ri/3+rk/4
rj/3 rj/3
PageRank: The “Flow” Model
◼ A “vote” from an important The web in 1839

page is worth more y/2


◼ A page is important if it is
y
pointed to by other important
a/2
pages y/2
◼ Define a “rank” rj for page j m
a m
a/2
“Flow” equations:
ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
Solving the Flow Equations
Flow equations:
◼ ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2
PageRank: Matrix Formulation

Example

j rj
. =
ri
1/3

M . r = r
Eigenvector Formulation

Example: Flow Equations & M

y a m
y y ½ ½ 0
a ½ 0 1
a m m 0 ½ 0

r = M·r

ry = ry /2 + ra /2 y ½ ½ 0 y
ra = ry /2 + rm a = ½ 0 1 a
rm = ra /2 m 0 ½ 0 m
Power Iteration Method
◼ Given a web graph with n nodes, where the
nodes are pages and edges are hyperlinks
◼ Power iteration: a simple iterative scheme
▪ Suppose there are N web pages
▪ Initialize: r(0) = [1/N,….,1/N]T
▪ Iterate: r(t+1) = M · r(t)
di …. out-degree of node i
▪ Stop when |r(t+1) – r(t)|1 < ε
|x|1 = ∑1≤i≤N|xi| is the L1 norm
Can use any other vector norm, e.g., Euclidean
PageRank: How to solve?
y a m
◼ y
y ½ ½ 0
a ½ 0 1
a m m 0 ½ 0

ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2

Iteration 0, 1, 2, …
PageRank: How to solve?
y a m
◼ y
y ½ ½ 0
a ½ 0 1
a m m 0 ½ 0

ry = ry /2 + ra /2
ra = ry /2 + rm
rm = ra /2

Iteration 0, 1, 2, …
Random Walk Interpretation
i1 i2 i3

j
The Stationary Distribution
i1 i2 i3

j
Existence and Uniqueness
◼ A central result from the theory of random
walks (a.k.a. Markov processes):

For graphs that satisfy certain conditions,


the stationary distribution is unique and
eventually will be reached no matter what the
initial probability distribution at time t = 0
PageRank:
The Google Formulation
PageRank: Three Questions

or
equivalently

◼ Does this converge?


◼ Does it converge to what we want?
◼ Are results reasonable?
Does this converge?

a b

◼ Example:
ra 1 0 1 0
=
rb 0 1 0 1
Iteration 0, 1, 2, …
Does it converge to what we want?

a b

◼ Example:
ra 1 0 0 0
=
rb 0 1 0 0
Iteration 0, 1, 2, …
PageRank: Problems
Dead end
2 problems:
◼ (1) Some pages are
dead ends (have no out-links)
▪ Random walk has “nowhere” to go to
▪ Such pages cause importance to “leak out” Spider
trap

◼ (2) Spider traps:


(all out-links are within the group)
▪ Random walked gets “stuck” in a trap
▪ And eventually spider traps absorb all importance
Problem: Spider Traps
y a m
◼ y
y ½ ½ 0
a ½ 0 0
a m m 0 ½ 1

m is a spider trap ry = ry /2 + ra /2
ra = ry /2
rm = ra /2 + rm

Iteration 0, 1, 2, …
All the PageRank score gets “trapped” in node m.
Solution: Teleports!
◼ The Google solution for spider traps: At each
time step, the random surfer has two options
▪ With prob. β, follow a link at random
▪ With prob. 1-β, jump to some random page
▪ Common values for β are in the range 0.8 to 0.9
◼ Surfer will teleport out of spider trap
within a few time steps
y y

a m a m
Problem: Dead Ends
y a m
◼ y
y ½ ½ 0
a ½ 0 0
a m m 0 ½ 0

ry = ry /2 + ra /2
ra = ry /2
rm = ra /2

Iteration 0, 1, 2, …

Here the PageRank “leaks” out since the matrix is not stochastic.
Solution: Always Teleport!
◼ Teleports: Follow random teleport links with
probability 1.0 from dead-ends
▪ Adjust matrix accordingly

y y

a m a m
y a m y a m
y ½ ½ 0 y ½ ½ ⅓
a ½ 0 0 a ½ 0 ⅓
m 0 ½ 0 m 0 ½ ⅓
Why Teleports Solve the Problem?
Theory of Markov chains
Make M stochastic
Make M aperiodic
Make M Irreducible
Solution: Random Teleports

di … out-degree
of node i
The Google Matrix

[1/N]NxN…N by N matrix
where all entries are 1/N
Random Teleports (β = 0.8)
M [1/N]NxN
7/15
y 1/2 1/2 0 1/3 1/3 1/3
0.8 1/2 0 0 + 0.2 1/3 1/3 1/3
0 1/2 1 1/3 1/3 1/3
1/
5

15
7/1

5
7/1

1/
15

y 7/15 7/15 1/15


13/15
a 7/15 1/15 1/15
7/15
a m 1/15 7/15 13/15
1/15
m
1/
15 A

y 1/3 0.33 0.24 0.26 7/33


a = 1/3 0.20 0.20 0.18 ... 5/33
m 1/3 0.46 0.52 0.56 21/33
How do we actually compute
the PageRank?
Computing Page Rank
◼ Key step is matrix-vector multiplication
▪ rnew = A · rold
◼ Easy if we have enough main memory to
hold A, rold, rnew
◼ Say N = 1 billion pages
▪ We need 4 bytes for A = β·M + (1-β) [1/N]NxN
each entry (say) ½ ½ 0 1/3 1/3 1/3
▪ 2 billion entries for A = 0.8 ½ 0 0
0 ½ 1
+0.2 1/3 1/3 1/3
1/3 1/3 1/3
vectors, approx 8GB
▪ Matrix A has N2 entries 7/15 7/15 1/15
▪ 10 is a large number!
18
= 7/15 1/15 1/15
1/15 7/15 13/15
Matrix Formulation
◼ Suppose there are N pages
◼ Consider page i, with di out-links
◼ We have Mji = 1/|di| when i → j
and Mji = 0 otherwise
◼ The random teleport is equivalent to:
▪ Adding a teleport link from i to every other page
and setting transition probability to (1-β)/N
▪ Reducing the probability of following each
out-link from 1/|di| to β/|di|
▪ Equivalent: Tax each page a fraction (1-β) of its
score and redistribute evenly
Rearranging the Equation

Note: Here we assumed M


has no dead-ends
[x]N … a vector of length N with all entries x
PageRank: The Complete Algorithm

If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends
the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.
Some Problems with Page Rank
◼ Measures generic popularity of a page
▪ Biased against topic-specific authorities
▪ Solution: Topic-Specific PageRank (next)
◼ Uses a single measure of importance
▪ Other models of importance
▪ Solution: Hubs-and-Authorities
◼ Susceptible to Link spam
▪ Artificial link topographies created in order to
boost page rank
▪ Solution: TrustRank

You might also like