Math 551 Lab 12
Math 551 Lab 12
Lab 12
Goals: To apply linear algebra techniques to ranking of webpages in the World Wide Web.
To get started:
Matlab commands used: load, size, numel, nnz, for... end, gplot
What you have to submit: The file lab12.m which you will modify during the lab session.
INTRODUCTION
According to social polls, the majority of users only look at the first few results of online
searches and very few users look past the first page of results. Hence, it is crucially important
to rank the pages in the “right” order so that the most respectable and relevant results will
come first. The simplest way to determine the rank of a webpage in a network is to look at how
many times it has been referred to by other webpages. This simple ranking method leaves a
lot to be desired. In particular, it can be easily manipulated by referring to a certain webpage
from a lot of “junk” webpages. The quality of the webpages referring to the page we are trying
to rank should matter too. This is the main idea behind the Google PageRank algorithm.
The Google PageRank algorithm is the oldest algorithm used by Google to rank web pages
which are preranked offline. The PageRank score of every webpage is recomputed each time
Google crawls the web. Let us look at the theory behind the algorithm. As it turns out it, it
is based on the theorems of linear algebra!
The main assumption of the algorithm is that if you are located on any webpage then with equal
probability you can follow any of the hyperlinks from that page to another page. This allows
us to represent a webpage network as a directed graph with the webpages being the nodes,
and the edges being the hyperlinks between the webpages. The adjacency matrix of such a
network is built in the following way: the (i, j)th element of this matrix is equal to 1 if there
is a hyperlink from the webpage i to the webpage j and is equal to 0 otherwise. Then the row
sums of this matrix will represent numbers of hyperlinks from each webpage and the column
sums will represent numbers of times each webpage has been referred to by other webpages.
Furthermore, we can generate a matrix of probabilities S such that the (i, j)th element of this
matrix is equal to the probability of traveling from ith webpage to jth webpage in the network.
This probability is equal to zero if there is no hyperlink from ith page to jth page and is equal
to 1/Ni if there is a hyperlink from ith page to jth page, where Ni is the total number of
hyperlinks from the ith page. For instance, consider a sample network of only four webpages,
as shown on the Fig. 1. The matrix S for this network can be written as:
1
The original network data is available here: https://snap.stanford.edu/data/.
1
1 2
4 3
Figure 1: Sample network of four webpages
0 1/3 1/3 1/3
1/2 0 0 1/2
S=
0 1/2 0 1/2
(1)
0 0 0 0
There are several issues which make working with the matrix S inconvenient. First of all, there
are webpages that do not have any hyperlinks - these are called “dangling nodes” (such as node
4 in Fig. 1). These nodes will correspond to zero rows of the matrix S. Moreover, the webpages
in the network may not be connected to each other and the graph of the network may consist
of several disconnected components. These possibilities lead to undesirable properties of the
matrix S which make computations more complicated and sometimes even impossible.
The problem of dangling nodes can be solved by assigning all elements of the matrix S in the
rows corresponding to the dangling nodes equal probabilities 1/N , where N is the number of
the nodes in the network. This can be understood in the following way: if we are at a dangling
node, we can with equal probability jump to any other page in the network. To solve the
potential disconnectedness problem, we assume that a user can follow hyperlinks on any page
with a probability 1 − α and can jump (or “teleport”) to any other page in the network with
a probability α. The number α is called the damping factor. The value α = 0.15 is usually
taken in practical applications. The “teleport” surfing of the network can be interpreted as a
user manually typing the webpage address in the browser or using a saved hyperlink from their
bookmarks to move from one page onto another. The introduction of the damping factor allows
us to obtain the Google matrix G in the form:
G = (1 − α)S + αE,
where E is a matrix with all the elements equal to 1/N , where N is a number of webpages in
the network.
The matrix G has nice properties. In particular, it has only positive entries and all of its rows
sum up to 1. In mathematical language, this matrix is stochastic and irreducible (you can look
up the precise definitions of these terms if you are interested). The matrix G satisfies the
following Perron-Frobenius theorem:
Theorem 1 (Perron-Frobenius) Every square matrix with positive entries has a unique unit
eigenvector with all positive entries. The eigenvalue corresponding to this eigenvector is real
and positive. Moreover, this eigenvalue is simple and is the largest in absolute value among all
the eigenvalues of this matrix.
2
Let us apply this theorem to the matrix G. First of all, observe that the row sums of the matrix
G are equal to 1. Consider the vector 1 = (1, 1, ..., 1)T (the vector of all 1s). It is easy to see
that
G1 = 1.
(Multiplying a matrix G by 1 has the same
√ effect as the MATLAB command sum(G,2).) But
then it follows that v1 = 1/k1k = 1/ N is the unique unit eigenvector with all positive
components, and, therefore, by the Perron-Frobenius theorem, λ1 = 1 is the largest eigenvalue!
We are interested in the left eigenvector for the eigenvalue λ1 = 1:
uT1 G = uT1 .
Again, by the Perron-Frobenius theorem, the vector u1 is the unique unit eigenvector with all
positive components corresponding to the largest in absolute value eigenvalue λ1 = 1. We will
use the components of this vector for the ranking of webpages in the network.
Let us look at the justification behind this algorithm. We have already established that the
vector u1 exists. Consider the following iterative process. Assume that at the beginning a user
can be on any webpage in the network with equal probability:
w0 = (1/N, 1/N, ..., 1/N ).
After 1 step (one move from one webpage to another using hyperlinks or teleporting), the
probability vector of being on the ith webpage is determined by the ith component of the
vector
w1 = w0 G.
After two moves the vector of probabilities becomes
w2 = w1 G = w0 G2 ,
and so on.
We hope that after a large number of steps n, the vector wn = w0 Gn starts approaching some
kind of limit vector w∗ , wn → w∗ . It turns out that due to the properties of the matrix G
this limit vector w∗ indeed exists and it is exactly the eigenvector corresponding to the largest
eigenvalue, namely, λ1 = 1. Moreover, numerical computation of matrix eigenvalues is actually
based on taking the powers of the matrix (it is called the Power method) and not on solving
the characteristic equation!
Notice that for any row vector w, the dot product w1 is simply the sum of all entries of w.
(That is, w*ones(N,1) == sum(w).) Also,
G1 = 1 =⇒ G2 1 = 1 =⇒ G3 1 = 1 =⇒ ···
(see if you can figure out why) so
wn 1 = w0 Gn 1 = w0 1 = 1.
From this, it is possible to show that w∗ is a non-negative vector whose entries sum to 1.
The ith component of this vector represent the probability of being on the ith webpage in the
network after a very large number of moves along the hyperlinks. Thus, it is reasonable to take
these probabilities as ranking of the webpages in the network.
3
TASKS
1. Open the file lab12.m. In the code cell titled %%Load the network data load the data
from the file AdjMatrix.mat into Matlab by using the load command. Save the resulting
matrix as AdjMatrix. Observe that the adjacency matrices of real networks are likely to
be very large (may contain millions of nodes or more) and sparse. Check the sparsity of
the matrix AdjMatrix using the functions numel and nnz. Denote the ratio of non-zero
elements over the total number entries in AdjMatrix as RatioNnzAdjMatrix.
Variables: AdjMatrix, RatioNnzAdjMatrix
2. Check the dimensions of the matrix AdjMatrix using the size function. Save the dimen-
sions as new variables m and n.
Variables: m, n
3. Observe that while the network described by the matrix AdjMatrix is not large at all from
the viewpoint of practical applications, computations with this matrix may still take a
noticeable amount of time. To save time, we will cut a subset out of this network and use
it to illustrate the Google PageRank algorithm. Introduce a new variable NumNetwork and
set its value to 500. Then cut a submatrix AdjMatrixSmall out of the matrix AdjMatrix
and plot the graph represented by the matrix AdjMatrixSmall by running the following
code cell (see the file lab12.m):
%% Display a small amount of network
NumNetwork=500;
AdjMatrixSmall=AdjMatrix(1:NumNetwork,1:NumNetwork);
for j=1:NumNetwork
coordinates(j,1)=NumNetwork*rand;
coordinates(j,2)=NumNetwork*rand;
end;
gplot(AdjMatrixSmall,coordinates,’k-*’);
This will plot the subgraph of the first 500 nodes in the network with random locations
of the nodes. Notice the use of the function gplot to produce this graph. Observe that
Matlab has special functions graph and digraph for working with graphs, but those
functions are a part of the special package “Graph and Network Algorithms” which may
not be immediately available. Simpler methods, as shown above, will be sufficient for our
purposes.
Variables: AdjMatrixSmall, coordinates, NumNetwork
4. Use sum and max to check the amount of links originating from each webpage, namely,
find the largest out-degree and the page with the largest out-degree.
Variables: NumLinks, MaxOutLinks, PageMaxOutLinks
5. Create a matrix of probabilities (Google matrix). Element (i, j) of the matrix shows the
probability of moving from the i-th page of the network to j-th page. It is assumed that
the user can follow any link on the page with a total probability of 85% (all hyperlinks
are equal), and jump (teleport) to any other page in the network with a total probability
of 15% (again, all pages are equal). Namely we set the parameter α = 0.15 and proceed
as follow (see the file lab12.m):
4
alpha=0.15;
GoogleMatrix=zeros(NumNetwork,NumNetwork);
for i=1:NumNetwork
if NumLinks(i)~=0
GoogleMatrix(i,:)=AdjMatrixSmall(i,:)./NumLinks(i);
else
GoogleMatrix(i,:)=1./NumNetwork;
end;
end;
GoogleMatrix=(1-alpha)*GoogleMatrix+alpha*ones(NumNetwork,NumNetwork)./NumNetwork;
7. Introduce the 1 × NumNetwork row vector w0 with each entry equal to 1/NumNetwork,
and compute the consequent vectors w1 = w0 G, w2 = w1 G, w3 = w2 G, w90 = w0 G90 ,
w100 = w0 G100 , where G is the GoogleMatrix. Compute the difference δw = w100 − w90 .
Observe that the sequence wn converges to a certain limit vector w∗ very fast.
Variables: w0, w1, w2, w3, w90, w100, deltaw
8. Compute the eigenvalues and the left and the right eigenvectors of the matrix G using the
function eig. Observe that the right eigenvector corresponding to the eigenvalue λ1 = 1
is proportional to the vector v1 = (1, 1, ..., 1). To compute the left eigenvectors, use the
function eig on the matrix G’. Select the left eigenvector corresponding to the eigenvalue
λ1 = 1 and denote it as u1.
Variables: u1
9. By default the vector u1 is not scaled to have all positive components (even though they
will all have the same sign). Normalize this vector by using the code:
u1=abs(u1)/norm(u1,1);
This will create a probability vector with all positive components whose entries sum to 1.
10. Use the function max to select the maximal element and its index in the array.
Variables: MaxRank, PageMaxRank
Q2: Which page is the most important in the network?
11. Find out whether the highest ranking webpage is the same as the page with the most
hyperlinks pointing to it. To do so, create the vector of column sums of the matrix
AdjMatrixSmall and save it as MaxInLinks. Use the function max again to select the
page with the maximum number of in-links.
Variables: MaxInLinks, PageMaxInLinks
Q3: What is the number of hyperlinks pointing to the webpage PageMaxRank?