0% found this document useful (0 votes)
11 views7 pages

6 Pagerank

pagerank
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views7 pages

6 Pagerank

pagerank
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Pagerank (Brin and Page 1998)

The idea that made Google great

This document introduces the famous Pagerank algorithm (Brin and Page 1998). These lecture notes are
not meant as an exhaustive and detailed document; instead, they contain a quick description of the main
concepts and features of the algorithm, with links to references and other more complete resources. Students
are strongly encouraged to read these, it should help with the “AA” transversal skill for the CAI-GCED
course. In the mid-term or final exams you will be asked questions regarding this topic, so make sure you
know the content of this algorithm and some of the extensions mentioned here.

1. What is pagerank? Main ideas


As its name suggests, Pagerank is an algorithm that generates a ranking. Namely, given a directed graph
as input (a web graph, for example, where nodes are pages and links are html hyperlinks) it assigns an
importance score to each page or node (its “pagerank”). The algorithm was proposed in the context of web
search, and the pagerank of pages was used to rank results. Using the global structure of the web to improve
ranking of web results was the magic ingredient that gave Google advantage over its competitors.
In this text I talk about pages and nodes interchangeably, and it should be understood that when I say page I
mean a node in the input graph.
The main idea of the algorithm lies in the fact that a link from page A to page B should be understood as an
endorsement of B’s importance by A. Not all endorsements are equal, and being endorsed by the Queen of
England is not the same as being endorsed by a random citizen.

A page is important if it is pointed to by other important pages

To formalize this seemingly circular concept, let us start introducing some notation. Let G = (V, E) be a
directed graph (the web graph) with nodes V = {1, .., n} – so, there are n pagesˆ1 – and (i, j) ∈ E if page i
points to page j.
As our running example, we may have the following 4-node graph given by

• V = {1, 2, 3, 4}, and


• E = {(1, 1), (1, 3), (1, 4), (2, 1), (2, 4), (3, 2), (3, 4), (4, 2)}

The pagerank of a page i ∈ V, pi is a real value (positive score) associated to each page in the input graph.
It corresponds to the importance of node i globally in the graph. The intuition behind the mathematical
definition of pagerank is that
1 Think n extremely large.

1
The pagerank (prestige) of a node is passed in equal parts to the nodes to which it points.

So we define:
Definition (pagerank): The vector ( pi )i∈V of pageranks should satisfy
1. ∑i pi = 1, and
pj
2. for all i: pi = ∑( j,i)∈E out( j)

where out( j) is the outdegree of vertex j.

Example with toy graph


The definition leads to a system of n + 1 linear equations. For example, let us instantiate the equation for p1
pj
using the general definition pi = ∑ . We have to go over all nodes j pointing to 1, which are nodes 1
j →i
out( j)
and 2. Therefore we get that:

p1 p
p1 = + 2
3 2

Notice the different denominators: node 1 shares its pagerank equally among 3 nodes (outdegree is 3), and
node 2 has outdegree 2 and hence shares its pagerank equally to nodes 1 and 4. The following picture should
make the “flow” of pageranks clear:

This leads to the set of n + 1 linear equations:


p1 p
p1 = + 2
3 2
p3
p2 = + p4
2
p1
p3 =
3
p1 p p
p4 = + 2+ 3
3 2 2
1= p1 + p2 + p3 + p4

Notice also that the pageranks are distributed across the graph but the net pagerank should stay constant
(and add up to 1)2 .

2. Linear algebra view


We may write the system of linear equations compactly, as usual, using matrix notation. Let M be the matrix
such that (M is called the transition matrix):
• Mi,j = 1/out(i ) if (i, j) ∈ E
• Mi,j = 0 if (i, j) ̸∈ E
Then the system of equations above is equivalent to the matrix equation

p = MT p
2 For well-behaved graphs at least, we shall see this later.

2
Notice that p is the eigenvector of M T associated to eigenvalue 1, and so, finding the solution to our system
of equation is equivalent to finding the leading eigenvector of matrix M T .
In conclusion, a node’s importance is given by its coordinate in the leading eigenvector of the (transpose of)
the transition matrix M.

Example with toy graph


Rows of M add to 1 (row-stochastic). Columns of M T add to 1 (column-stochastic).

1 1 1 1 1 
3 0 3 3 3 2 0 0
1 0 0 1 0 0 1
1
M= 2
1
2
1 MT = 
1
2 
0 2 0 2 3 0 0 0
0 1 0 0 1 1 1
3 2 2 0

     
p1 1/3 1/2 0 0 p1
 p2   0 0 1/2 1  p2 
 = · 
 p3  1/3 0 0 0  p3 
p4 1/3 1/2 1/2 0 p4

3. Probabilistic view (random surfer)


An equivalent but useful view of pagerank is given by a probabilistic interpretation of a random surfer that
jumps from page to page at random (following links in the web graph at random).
Following our toy example, assume that the surfer starts at node 1, and follows links uniformly at random.

Possible sequences of nodes visited by the random surfer:


• 1, 1, 1, 1, 4, 2, ..
• 1, 3, 4, 2, 1, 3, ..
• 1, 4, 2, 1, 4, 2, ..

We view the “pagerank vector” now as a distribution over the nodes of the web graph of the location of the
random surfer at time t.
For example:
• p(t = 0) = (1, 0, 0, 0) T means that at time t = 0 the random surfer is at node 1.
• p(t = 0) = (1/4, 1/4, 1/4, 1/4) T means that at time t = 0 the random surfer could be at any node with
equal probability.

3
Exercise: Supposing that the random surfer starts from node 1, where could we find her at time t = 1, and at
time t = 2? And at any given time t?
Here, the transition matrix is telling us the probabilities of jumping between nodes, in particular, the first
row tells us that, at time t = 1, the location of the surfer is given by p(t = 1) = (1/3, 0, 1/3, 1/3) T , which we
obtain by the matrix-vector multiplication p(t = 1) = M T p(t = 0). In general, to figure out the location in
the following time step, we can use

p(t + 1) := M T p(t) := ( M T )2 p(t − 1) := ... := ( M T )t p(0)

And the fixed point of this recurrence gives us the solution to pagerank as well, namely, when p(t + 1) =
p(t) = M T p(t), then we have clearly found the pagerank solution since it satisfies the linear equations as
required.

4. Power iteration method


The last recurrence suggests a method for finding the pagerank values of a graph, called the power iteration
method for obvious reasons:
• t=0
• p(0) = (1/n, .., 1/n) T
• repeat until convergence:
– p ( t + 1) = M T p ( t )
– t = t+1

Implementation of power method and application to toy example


import numpy as np

Mt = np.array([[1/3, 0, 1/3, 1/3],


[1/2, 0, 0, 1/2],
[0, 1/2, 0, 1/2],
[0, 1, 0, 0]]).transpose()

n = 4

t = 0
p = np.array([1/n] * n)
p_old = p + 10

while not np.allclose(p, p_old):


t += 1
p_old = p
p = Mt @ p

print(f'converged after {t} iterations to vector {p}')


produces the output:
converged after 22 iterations to vector [0.26 0.35 0.09 0.3]
Worth noting is that this type of implementation that relies directly on the matrix-vector multiplication may
be wasteful of space (only needing to store matrix M in explicit form is wasteful as we may have lots of
0s in sparse graphs – and most real graphs are actually sparse) and, consequently, on time. More efficient
implementations exist where each iteration of the power method can be done in time O(n + m) where n is

4
the number of nodes and m is the number of edges in the graph. Note that the matrix representation uses
space and hence time O(n2 ) in each iteration which for large graphs may be prohibitive. More on this in the
corresponding pagerank lab session.

5. Convergence guarantees of the power method


Now, does the power method always work? If not, when is it guaranteed to work? We need to figure out the
following:
• The method converges to some solution
• The method converges to a unique solution
• The method converges fast to the unique solution
• The method converges fast to the unique solution for any starting point
It turns out that the power method can fail indeed for some bad input Ms. Let us look at some of the things
that could go wrong:

5.1. Dangling nodes


Dangling nodes are nodes with no outgoing links. The presence of such nodes is problematic for the
definition of pagerank. In fact, it is not hard to find a simple input graph with a dangling node where there
is no solution to the pagerank system of equations (Exercise: find it!)
Note that in the M matrix, the row corresponding to a dangling node is all 0, and so the matrix is not
stochastic (the rows corresponding to dangling nodes sum to 0 and not to the expected 1 in stochastic
matrices). In fact, when the transition matrix M is stochastic (or M T is column-stochastic), then we are
guaranteed to find at least one solution by the Perron-Frobenius theorem of linear algebra.
So, to guarantee the existence of a solution we shall force the M matrices that we compute the power method
over to be stochastic. How do we do that? Well, essentially by substituting all-zero rows corresponding to
dangling nodes with uniform rows where all entries are 1/n. This corresponds to adding outgoing links from
dangling nodes to all nodes in the graph (including themselves). Or, equivalently, we are redistributing the
pagerank of dangling nodes accross all nodes equally.
Remember that we said at some point that if all goes well the total pagerank adds up to 1 always? Well, in
the presence of dangling nodes this is not so (try to imagine why!), and when executing one iteration of the
power method we will see that the total pagerank is less and less. In other words, pagerank seems to leak. If
you find this is the case in your implementation, then chances are you are not dealing with dangling nodes
appropriately.

5.2. Sink nodes


Sink nodes are nodes whose only outgoing link is to themselves. Try to think what happens to a random
surfer once she enters a sink node. There is no escape! So, these types of nodes end up hoarding all the
pagerank deeming it useless.

5.3. Disconnected components


Something similar happens in graphs that are not strongly connected, namely, that have disconnected
components. Once the surfer enters such a component there is no way out, and so uniqueness of pagerank
solution is not guaranteed; basically pagerank is ill-defined in this case.
Think of the following scenario and solutions to the equation p = M T p:

5
 
0 1
The system of equations is p = p and so, in this extreme case,
1 0
any vector is a solution!

Unconnected components have more than 1 eigenvector associated to the eigenvalue 1. If the graph is
strongly connected this does not happen - multiplicity 1, and then uniqueness of solution is guaranteed.

5.4 Certain cyclic patterns


A certain kind of cyclic graphs are problematic since they can make the power iteration fail. See, for example,
the following case:

This has a unique solution: (1/4, 1/4, 1/4, 1/4) T however the power
method fails to converge if it starts from any other vector than the
solution vector.

Not all cyclic graphs are problematic; in fact the problematic ones are the ones that are periodic. Here the
solution will be to make sure that all input graphs are aperiodic.

5.5 Fixing the problematic cases: damping factor (λ)


We need to fix problematic cases by modifying the transition matrix to make sure it is stochastic, aperiodic,
and strongly connected. This way none of the problematic cases can happen and a unique solution exists,
and the power method is guaranteed to find it fast.
Now, to solve the issue of sink nodes and/or disconnected components, or periodic input graphs, Google’s
founders came up with the notion of damping factor. They define the Google matrix G as a mixture of the
original transition matrix M (forced to be made stochastic) and the transition matrix corresponding to a
complete digraph to guarantee success of the power method:

1
G = λM + (1 − λ) J
n
where J is the matrix containing all 1’s and 0 < λ < 1 is the damping factor.
Instead of working the power method on M, we are going to use the Google matrix G, which is guaranteed to
have a unique pagerank solution that the power method will find fast. In essence, the graph that corresponds
to G is strongly connected, aperiodic and G is stochastic, so the Perron-Frobenius theorem guarantees
existence and uniqueness of solution, and the second eigenvalue of G governs convergence of the power
method.

5.6 Damping factor and teleportation


The way we can interpret the damping factor in the random surfer interpretation of pagerank is as follows.
Before following a link in the web graph, the random surfer tosses a (biased) coin: with probability λ she
follows an outgoing link from the current location, but with probability 1 − λ, she jumps (i.e. teleports) to
any node in the web graph.

6
Final observations on the damping factor λ
• λ is to ensure uniqueness and (fast) convergence. Not in pagerank definition.
• As λ → 1, solution closer to the “true’ ’ pagerank
• As λ → 0, solution closer to uniform (not interesting)
• As λ → 0, faster guaranteed convergence
• Balance between speed and accuracy
• Values 0.8 . . . 0.9 typical

6. Topic-sensitive pagerank
The pagerank vector is defined on the basis of the linear system of equations p = G T p, or equivalently

p = λM T p + (1 − λ)u

where u is the uniform vector (all of its entries are 1/n).


This alternative formulation can be useful for defining personalized pageranks, by modifying the teleportation
to bias the result towards some subset of pages. So, instad of teleporting to any page uniformly at random
(given by u), we can teleport to a particular subset of pages given by some personalized vector r:

p = λM T p + (1 − λ)r

This is the idea in the following paper, which describes topic-sensitive pagerank. Notice that depending on
the nature of r the resulting aperiodicity, strong connection etc. may be broken and so extra care needs to be
taken in those cases.
For more information, please read Topic-sensitive PageRank: a context-sensitive ranking algorithm for Web
search by T.H. Haveliwala (Haveliwala 2003).

References
Brin, Sergey, and Lawrence Page. 1998. “The Anatomy of a Large-Scale Hypertextual Web Search Engine.”
Comput. Networks 30 (1-7): 107–17. https://doi.org/10.1016/S0169-7552(98)00110-X.
Haveliwala, T. H. 2003. “Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search.”
IEEE Transactions on Knowledge and Data Engineering 15 (4): 784–96. https://doi.org/10.1109/TKDE.2003.
1208999.

You might also like