0% found this document useful (0 votes)
58 views19 pages

1.1 Pagerank Description

This document provides an overview of PageRank and how it works. It discusses: 1) PageRank was developed by Sergey Brin and Larry Page at Stanford University in 1997 and is used by Google to rank the importance of web pages. It uses the link structure of the web to determine importance. 2) PageRank can be modeled as a Markov chain, where the transition probability matrix P represents the probability of moving from one page to another based on links. The stationary distribution of this Markov chain gives the PageRank values. 3) A page is considered more important if other important pages link to it. The PageRank algorithm calculates values recursively based on the ranks of incoming pages weighted by the transition probabilities

Uploaded by

radha_preetam
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views19 pages

1.1 Pagerank Description

This document provides an overview of PageRank and how it works. It discusses: 1) PageRank was developed by Sergey Brin and Larry Page at Stanford University in 1997 and is used by Google to rank the importance of web pages. It uses the link structure of the web to determine importance. 2) PageRank can be modeled as a Markov chain, where the transition probability matrix P represents the probability of moving from one page to another based on links. The stationary distribution of this Markov chain gives the PageRank values. 3) A page is considered more important if other important pages link to it. The PageRank algorithm calculates values recursively based on the ranks of incoming pages weighted by the transition probabilities

Uploaded by

radha_preetam
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Chapter 1

PageRank
1.1 PageRank description
PageRank is a link analysis algorithm developed at Stanford university by
Sergey Brin and Larry Page [?] in 1997, and implemented by Google. PageR-
ank algorithm uses the inherent structure of the web to generate relevant
results for the query terms because links often provide more complete and
concise information about documents than text contained in the actual doc-
ument
The PageRank values are pre-calculated and stored for all pages known to
the IR system. This means every page in the web has a PageRank score that
is completely independent of query terms. A search that returns PageRank
scores is reporting the importance hierarchy of pages containing the query
terms.
1
Chapter 1. PageRank 2
1.2 Hyperlink structure of the Web
A set of pages in the web may be modeled as nodes in a directed graph.
The edges between nodes represent links between pages. A graph of a simple
6-page web is depicted in Figure 2.1. The directed edge from node two to
node three signies that page two links to page three. However, page three
does not link to page two, so there is no edge from node three to node two.
Figure 1.1: Graph modelling 6-node web
The PageRank thesis [?] constructs page importance hierarchies based
upon the link structure of the web. Generally, more important pages will
have more inlinks. Inlinks from important pages will also have a greater
eect on PageRank for a particular page than inlinks from marginal pages.
The calculation of PageRank is recursive, building the rank for a particular
page based on the ranks of the pages that link to it.
To calculate PageRank, a mathematical model of the link structure of
the web is built. An Adjacency matrix L from the graph of the Web is
Chapter 1. PageRank 3
constructed and the entries in L are dened as
L
ij
=

1 if there exists a link from node i to j


0 otherwise
All entries of L are then either 0 or 1. An entry of 1 in the second row,
third column would indicate that page two links to page three. For the 6-node
graph in Figure 2.1, the adjacency matrix L is
L =

0 1 0 1 0 0
1 0 1 0 0 0
0 0 0 1 0 0
0 0 0 0 1 0
0 0 0 0 0 0
0 0 0 1 0 0

PageRank thesis[?] constructs page ranking based on the importance of


inlinks and outlinks. The underlying assumption is that a web user is more
likely to visit more important pages. Consider a theoretical web user, navi-
gating through a series of web pages. As the user views a page, he or she may
follow any one of the pages outlinks to another page in the web. Each row,
i, of the matrix L represents the outlinks from the corresponding page. The
row may be reconstructed as a probability distribution for the movement of
the web user, so that each entry in a row instead signies the conditional
probability that a user currently visiting page i would next visit page j.
Consider row one from the adjacency matrix L above. The matrix reveals
that page one has two outlinks to pages two and four. Assuming that a web
user is equally likely to follow all outlinks on any given page, a new matrix
P can be constructed, where P
ij
is the probability of moving from node i to
node j. For the above example:
Chapter 1. PageRank 4
P =

0 0.5 0 0.5 0 0
0.5 0 0.5 0 0 0
0 0 0 1 0 0
0 0 0 0 1 0
0 0 0 0 0 0
0 0 0 1 0 0

The accuracy of the PageRank score for the web may be improved by
analysing web usage logs to determine whether the probability of moving
from node i to node j is the same for each outlink. For example, if usage logs
show that a user is more likely to follow a link from page one to page two
than from page one to page four, the rst row of P may be
P1 =

0 0.6667 0 0.3333 0 0

For simplicity, it is assumed that a user is equally likely to move from


node i to each outlinked node, so the formal method for lling in the entries
of P is
P
ij
=

1/|O
i
| if there exists an edge from node i to j
0 otherwise
where |O
i
| is the cardinality of the set of outlinked nodes from node i. The
construction of the transition probability matrix P from the network graph
in this manner is the rst step towards the mathematical model dening
PageRank.
1.3 Markov chain Model of the Random surfer
PageRank may be assigned to pages based on the information given in the
matrix P. Suppose a random surfer navigates through a series of pages,
Chapter 1. PageRank 5
successively following links at random. The PageRank of a particular page
may be dened as the long-term probability that the surfer will end up at
that page, regardless of starting position . Equivalently, if millions of users
are surng the web simultaneously, a pages PageRank is the percentage of
viewers expected to be on that page at any given time. This probability vec-
tor can be recognised as the stationary vector for a Markov Chain. Consider
the following vector:
m =

0
0.60
0.20
0.20

The vector m could be a probability distribution for a set of four pages


indicating that a user is 60% likely to visit page two, 20% likely to visit
pages three and four, and not likely to visit page one at all from the current
location. The vector m is called a discrete probability vector if all entries are
nonnegative and the column sum is 1 or
||m||
1
=

m
i
= 1
A stochastic matrix P is a non-negative matrix composed entirely of proba-
bility vectors so that Pe = e (where e is a vector of all ones) and P 0 (
P
ij
0 for all i, j). The condition Pe = e states that the sum of each row
is equal to 1. A Markov Chain is then dened as a sequence of probability
vectors m
0
,m
1
,m
2
,m
3
... such that
m
T
1
= m
T
0
P, m
T
2
= m
T
1
P, m
T
3
= m
T
2
P.....
and
m
T
k+1
= m
T
k
P for k = 1, 2, 3.......
Chapter 1. PageRank 6
An entry of the state vector m
k
describes the probability that a user
is visiting a particular page at time step k. Under certain assumptions,
subsequent iterations of
m
T
k+1
= m
T
k
P
will converge to a stationary vector q, independent of m
0
such that
q
T
P = q
T
The stationary vector q may be interpreted as the long-term probability
distribution for the pages in the web (reference). Suppose an initial state
vector for a set of four web pages is
m
0
=

0
0.60
0.20
0.20

and say it eventually converges to


q =

0.10
0.50
0.15
0.25

This implies that a user who is 60% likely to be at page two initially is 50%
likely to be visiting page two at some distant future time independent of the
initial location. The vector q could also be interpreted as at any given time,
half of all surfers are visiting page two.
The PageRank thesis [?] posits that a page is important if other important
pages link to it. The importance z
j
of a page is calculated as
z
j
=

i
P
ij
z
i
(1.1)
Chapter 1. PageRank 7
Thus, the PageRank of page j is the sum of the PageRanks of the pages
that link to page j, multiplied by the respective transition probabilities of
P. If a particular page that links to page j has a high PageRank, this will
aect z
j
more than an inlinking page with low PageRank. The PageRanks
of a web of n pages are given by
z
1
= P
11
z
1
+ P
21
z
2
+ P
31
z
3
+ + P
n1
z
n
(1.2)
z
2
= P
12
z
1
+ P
22
z
2
+ P
32
z
3
+ + P
n2
z
n
(1.3)
z
3
= P
13
z
1
+ P
23
z
2
+ P
33
z
3
+ + P
n3
z
n
(1.4)
.
.
. (1.5)
z
n
= P
1n
z
1
+ P
2n
z
2
+ P
3n
z
3
+ + P
nn
z
n
(1.6)
(1.7)
In matrix terms, this may be written as
z = P
T
z or z
T
= z
T
P (1.8)
One can recognize that z
T
is a left hand eigenvector of P corresponding
to = 1 . Thus computing PageRank, the stationary vector of a Markov
Chain, is computationally equivalent to nding the eigenvector associated
with a known eigenvalue, = 1.
A Markov Chain is irreducible if every state is reachable from every other
state. If P is positive ( P > 0 if P
ij
> 0 for all i, j), then each state is reach-
able from every other state. Since all entries are greater than zero, given any
initial state, any other state may be reached in only one step. Thus, if P is
positive, it is also irreducible. An irreducible,stochastic matrix P is guaran-
teed to have a stationary vector by the Perron-Frobenius theorem.(reference)
A few denitions are introduced before stating Perron-Frobenius theorem
Chapter 1. PageRank 8
Denition 1.1. An nn matrix A with real entries a
ij
is called a stochastic
matrix provided
1. all the entries a
ij
satisfy 0 a
ij
1
2. each of the columns sum to one
i.e.

i
a
ij
= 1 j (1.9)
3. each row has some non zero entry i.e. (it is possible to make a transition
to each of the i
th
states from some other state)
4. some column has more than one non zero entry i.e. atleast one node
has more than one outlink.
Note that an entry a
ij
is the probability of going from the j
th
state to the i
th
state.
Denition 1.2. A stochastic matrix A is called regular or eventually positive
provided there is a q > 0 such that A
q
has all positive entries. This means that
for this iterate it is possible to make a transition from any state to any other
state. It then follows that A
p
has all positive entries for p q. A regular
stochastic matrix automatically satises conditions 3 and 4 of denition 2.1
Theorem 1.3 (PerronFrobenius theorem). Let A be a regular stochastic
matrix
1. A has 1 as a eigenvalue of geometric and algebraic multiplicity one.
The eigen vector v
1
can be chosen with all positive entries and

i
v
1
i
= 1 (1.10)
Chapter 1. PageRank 9
v
1
must have all positive entries or all negative entries
2. All the other eigenvalues
j
have |
j
| < 1 . If v
j
is an eigenvector for

j
, then

i
v
j
i
= 0 (1.11)
3. If p is any probability distribution with

i
p
i
= 1, then
p = v
1
+
n

j=2
y
j
v
j
(1.12)
where y
j
s are coecients from R. Also A
q
p goes to v
1
as q goes to
innity.
1.4 Modelling the Human Surfer
Unfortunately the Markov matrix P does not satisfy P > 0, and may not be
irreducible or stochastic. Fortunately, these mathematical technicalities co-
incide with web modelling issues. The random surfer model does not closely
model the movement of a human surfer, in that a human user always has the
option of randomly jumping to another page in the web. Should a user come
to a page with no outlinks (such is the case with page ve in our 6-node web),
he or she will presumably not remain on that page forever. At this point, to
continue surng, the user is forced to jump to another page in the web. This
ability is always present however, since at any point in time the user may
manually enter a URL. Therefore, every page in the web is implicitly linked
to one another through the ability of the user to randomly jump to another
page.
Since the human user will presumably not remain on a page with no out-
links indenitely, P must be adjusted accordingly. It is assumed that the
Chapter 1. PageRank 10
user is equally likely to jump to any other page in the web. For a matrix
P, construct

P by replacing the entries in a row of zeros with 1/ne
T
where
e is a row vector of all ones e = (1, 1, ...1) of size n where n is the order
of P. Fortunately, this also makes

P stochastic, since every row of P is a
probability vector.
Brin and Page[?] add an adjustment matrix E to

P, which directly links
every page in the web. The adjustment matrix E is constructed as 1/nee
T
.
Even though the user always has the option of jumping to any other page in
the web, he or she will not always choose to do so. Therefore, another factor,
.is introduced Google uses an = .85, which indicates that the model used
by Google assumes that approximately 85% of the time, a user will simply
follow successive links on pages in the web. However, 15% of the time, the
user will choose instead to jump to another page in the web. Thus,the new
matrix A is constructed as
A =

P
T
+ (1 )E =

P
T
+
(1 )
n
ee
T
(1.13)
where is the probability the user will choose to follow a link on the page,
and (1 ) is the probability that the user will opt to jump randomly to
another page in the web. Note that the original P has been transposed to
conform with usual convention of nding right, instead of left, eigenvectors.
The introduction of E also serves to make A irreducible, since A is now
positive and it follows from the Perron-Frobenius theorem that a positive,
stochastic, irreducible matrix is guaranteed to have a positive eigenvector.
Chapter 2
PageRank Computation Using
Power Method
Finding the eigenvector for a given eigenvalue and matrix can be a complex
computation. There are several available methods for eigenvector calculation,
the power method is often the method of choice, due to issues of storage, com-
putation time, and complexity.
2.1 Formulation
The power method is an iterative method that nds the vector x
k+1
from x
k
as x
k+1
= Ax
k
until x
k+1
converges to desired tolerance.
When x
k+1
converges, that vector is the eigenvector for the given matrix
and dominant eigenvalue (which here is 1). The PageRank algorithm is an
application of the power method:
Algorithm 2.1. PageRank(A , x
0
)
repeat
11
Chapter 2. PageRank Computation Using Power Method 12
x
k+1
= Ax
k
= ||x
k+1
x
k
||
1
until <
return x
k+1
where is the change from the k
th
iteration to the k + 1
th
iteration and
is the desired convergence threshold. Note that our Matrix A is
A =

P
T
+ (1 )E (2.1)
Observe that P is very sparse and the sparsity is lost by taking the Convex
combination and since we are dealing with matrices that represent the hy-
perlink structure of the web that are very huge in size taking the full matrix
A and applying the Power method will lead to memory overow. This can
be overcome by replacing the matrix vector product A x as
Algorithm 2.2. PageRank()
y = P
T
x
w = ||x||
1
||y||
1
y = y + w v
where v is the Personalization vector[?]which is
1
n
e. The irreducibility of
A implies that the dominant eigenvalue is 1. In addition, it can be shown
that 2(A) < 1 and that choosing an farther from 1 will speed con-
vergence of the Power Method.
Theorem 2.3. If |
1
| > |
2
| > |
3
|......... > |
n
| are the eigenvalues of A.
Then the power method on A converges at a geometric rate of |
2
|
The value of aects the size of |2| , and hence, how fast the power
method converges.
Chapter 2. PageRank Computation Using Power Method 13
Theorem 2.4. If A is a column stochastic matrix dened as .
A =

P
T
+
(1 )
n
ee
T
(2.2)
The eigenvalues of A are then given by (A) = 1, 2, 3....n .
2.2 Drawbacks of Power Method
Since, |
2
| and the rate of convergence of the power method depends on
the size of the second largest eigenvalue, the choice of will determine the
rate of convergence of the power method. A change in will greatly reduce
the total number of iterations needed to perform the PageRank algorithm,
and may also drastically aect the PageRanks of pages in the web. A higher
value of will place more weight on the link structure of the web, but will
cost more in terms of computation time
Chapter 3
Extrapolation techniques for
computing PageRank
PageRank computes the principal eigenvector of the matrix describing the
hyperlinks in the web using the famous Power method. Due to the sheer
size of the web(over 3 billion links ) this compuation can take several days.
Speeding up this computation is important for two reasons. First, comput-
ing PageRank quickly is necessary to reduce the lag time from when a new
crawl is completed to when that crawl can be made available for searching.
Secondly, recent approaches to personalized and topic-sensitive PageRank
schemes[?] require computing many PageRank vectors, each biased towards
certain types of pages. These approaches intensify the need for faster meth-
ods for computing PageRank. This chapter deals with short description of
two extrapolation methods to accelerate PageRank computations.
14
Chapter 3. Extrapolation techniques for computing PageRank 15
3.1 Aitken Extrapolation
The intuition behind Aitken extrapolation is as follows. Assume that the it-
erate x
k2
can be expressed as a linear combination of the rst two eigenvec-
tors. Under this assumption the principal eigenvector v
1
can be solved for the
principal eigenvector in closed form using the successive iterates x
k2
..... x
k
.
3.2 Quadratic extrapolation
In quadratic extrapolation the assumption is that the matrix A has 3 eigen-
vectors and that iterate x
k3
can be expressed as a linear combination of
these 3 eigenvectors. With these assumptions one can solve for the principal
eigenvector v
1
in closed form using the successive iterates x
k3
, x
k2
, x
k
.
The Power Method disscussed previously , is very eective in annihilating
error components of iterate in direction along eigenvectors provided the sec-
ond eigen value is far less than one if not the Power Method takes very long
to compute page rank in these cases extrapolation technique like the above
two helps in removing the error components and helps the power method to
converge faster. The use of extrapolation saves upto 30-70% of time.
Chapter 4
Topic-Sensitive Pagerank
4.1 Introduction
In our approach to topic-sensitive PageRank, we precompute the importance
scores one, as with ordinary PageRank. However, we compute multiple im-
portance scores for each page; we compute a set of scores of the importance
of a page with respect to various topics. At query time, these importance
scores are combined based on the topics of the query to form a composite
PageRank score for those pages matching the query. In our work we consider
two scenarios. In the rst, we assume a user with a specic information need
issues a query to our search engine in the conventional way, by entering a
query into a search box. In this scenario, we determine the topics most closely
associated with the query, and use the appropriate topic-sensitive PageRank
vectors for ranking the documents satisfying the query. This ensures that
the importance scores reect a preference for the link structure of pages
that have some bearing on the query.In the second scenario,we assume the
user is viewing a document (for instance,browsing the Web or reading email),
and selects a term from the document for which he would like more informa-
tion.For instance, if a query for architecture is performed by highlighting
16
Chapter 4. Topic-Sensitive Pagerank 17
a term in a document discussing famous building architects, we would like
the result to be dierent than if the query architecture is performed by
highlighting a term in a document on CPU design. By selecting the appro-
priate topic-sensitive PageRank vectors based on the context of the query, we
hope to provide more accurate search results. Note that even when a query
is issued in the conventional way,without highlighting a term, the history of
queries issued constitutes a form of query context. Yet another source of
context comes from the user who submitted the query. For instance, the
users bookmarks and browsing history could be used in selecting the appro-
priate topic-sensitive rank vec- tors. By making PageRank topic-sensitive,
we avoid the problem of heavily linked pages getting highly ranked for queries
for which they have no particular authority. Pages considered important in
some subject domains may not be considered important in others, regardless
of what keywords may appear either in the page or in anchor text referring
to the page.
4.2 ODP-biasing
The rst step in our approach is to generate a set of biased PageRank vectors
using a set of basis topics. This step is performed once, one, during the
preprocessing of the Web crawl. For the personalization vector p ,we use the
URLs present in the various categories in the ODP. We create 16 dierent
biased PageRank vectors by using the URLs present below each of the 16
top-level categories of the ODP as the personalization vectors.In particular,
let T
j
be the set of URLs in the ODP category c
j
.Then when computing
the PageRank vector for topic c
j
, in place of the uniform damping vector
p = [1/N]
N1
we use the nonuniform vector p = v
j
where
Chapter 4. Topic-Sensitive Pagerank 18
v
ji
=

1/|T
j
| iT
j
,
0 i / T
j
.
The PageRank vector for topic c
j
will be referred to as

PR(, v
j
). We also
generate the single unbiased PageRank vector (denoted as NoBias) for the
purpose of comparison.We also compute the 16 class term-vectors

D
j
consist-
ing of the terms in the documents below each of the 16 top-level categories.
D
j
t simply gives the total number of occurrences of term t in documents
listed below class c
j
of the ODP.One could envision using other sources for
creating topic-sensitive PageRank vectors; however, the ODP data is freely
available, and as it is compiled by thousands of volunteer editors, is less
susceptible to inuence by any one party.
4.3 Query-Time Importance Score
The second step in our approach is performed at query time. Given a query
q, let q

be the context of q. In other words, if the query was issued by


highlighting the term q in some Web page u,then q

consists of the terms in


u. For ordinary queries not done in context, let q

= q. Using a unigram
language model, with parameters set to their maximum-likelihood estimates,
we compute the class probabilities for each of the 16 top-level ODP classes,
conditioned on q

. Let q

i
be the ith term in the query (or query context) q

. Then given the query q, we compute for each c


j
the following:
P(c
j
/q

) =
P(c
j
).P(q

/c
j
)
P(q

)
P(c
j
).

i
P(q

/c
j
)
P(q

i
/c
j
) is easily computed from the class term-vector

D
j
.The quantity
P(c
j
) is not as straightforward. We chose to make it uniform, although we
could personalize the query results for dierent users by varying this dis-
tribution. In other words, for some user k, we can use a prior distribution
Chapter 4. Topic-Sensitive Pagerank 19
P
k
(c
j
) that reects the interests of user k. This method provides an alterna-
tive framework for user-based personalization, rather than directly varying
the damping vector p.
Using a text index, we retrieve URLs for all documents containing the
original query terms q. Finally, we compute the query-sensitive importance
score of each of these retrieved URLs as follows. Let rank
jd
be the rank of
document d given by the rank vector

PR(, v
j
) (i.e., the rank vector for topic
c
j
). For the Web document d, we compute the query-sensitive importance
score s
qd
as follows.
s
qd
=

j
P(c
j
/q

).rank
jd
The results are ranked according to this composite score s
qd
.
The above query-sensitive PageRank computation has the following prob-
abilistic interpretation, in terms of the random surfer model.Let w
j
be
the coecient used to weight the jth rank vector, with

j
w
j
= 1(e.g., let
w
j
= P(c
j
/q)). Then note that the equality

j
[w
j
.

PR(, v
j
)] =

PR(,

j
[w
j
. v
j
])
Thus we see that the following random walk on the Web yields the topic-
sensitive score s
qd
. With probability 1, a random surfer on page u follows
an outlink of u(where the particular outlink is chosen uniformly at random).
With probability .P(c
j
/q

),the surfer instead jumps to one of the pages in


T
j
(where the particular page in T
j
is chosen uniformly at random). The
long term visit probability that the surfer is at page v is exactly given by the
composite score s
qd
dened above. Thus,topics exert inuence over the nal
score in proportion to their anity with the query (or query context).

You might also like