Project 12
Project 12
3
Project overview
Each of the projects in this book is designed to test several concepts from linear algebra. The
goals of each project and linear algebra concepts used in it are summarized below:
Project 1: To introduce operations with matrices in Matlab such as matrix addition, scalar
multiplication, matrix multiplication, and matrix input functions in Matlab.
Project 2: To use the matrix operations introduced in the previous project to manipulate
images. Lightening/darkening, cropping, and changing the contrast of the images are discussed.
Project 3: To use matrix multiplication to manipulate image colors similarly to photo filters
for images.
Project 5: To discuss some of the strategies for the ranking of sports teams with an example
of the Big 12 conference in college football.
Project 7: To create a simple recommender system based on the use of norms and inner
products.
Project 8: To study basic interpolation techniques and the least squares method with
application to climate data.
Project 10: To apply matrix mappings in order to generate fractals with the help of Chaos
game.
Project 11: To apply eigenvalue problems and orthogonal projections in order to create a
simple face recognition algorithm.
Project 12: To introduce Google’s PageRank algorithm and look at the eigenproblems for
the ranking of webpages.
Project 13: Application of eignevalues and eigenvectors for the data clustering with the
example of Facebook network.
Project 14: Application of singular value decomposition for image compression and noise
reduction.
Sample questions to check students’ understanding and checkpoints for introduced variables
are highlighted in the text of the projects in yellow. The templates of the labs 9-14 are given in
the appendix.
12
Project 12: Matrix eigenvalues and the Google’s PageRank algo-
rithm
Goals: To apply matrix eigenvalues and eigenvectors to ranking of webpages in the World Wide
Web.
To get started:
• Download the file AdjMatrix.mat which contains the adjacency matrix of a so-called “wiki-
vote” network with 8297 nodes 16 [9].
Matlab commands used: load, size, numel, nnz, for... end, gplot
What you have to submit: The file lab12.m which you will modify during the lab session.
INTRODUCTION
According to social polls, the majority of users only look at the first few results of the online
search and very few users look past the first page of results. Hence, it is crucially important to
rank the pages in the “right” order so that the most respectable and relevant results will come
first. The simplest way to determine the rank of a webpage in a network is to look at how many
times it has been referred to by other webpages. This simple ranking method leaves a lot to be
desired. In particular, it can be easily manipulated by referring to a certain webpage from a lot
of “junk” webpages. The quality of the webpages referring to the page we are trying to rank
should matter too. This is the main idea behind the Google PageRank algorithm.
The Google PageRank algorithm is the oldest algorithm used by Google to rank the web
pages which are preranked offline. The PageRank scores of the webpages are recomputed each
time Google crawls the web. Let us look at the theory behind the algorithm. As it turns out it,
it is based on the theorems of linear algebra!
The main assumption of the algorithm is that if you are located on any webpage then with
equal probability you can follow any of the hyperlinks from that page to another page. This
allows to represent a webpage network as a directed graph with the webpages being the nodes,
and the edges being the hyperlinks between the webpages. The adjacency matrix of such a
network is built in the following way: the (i, j)th element of this matrix is equal to 1 if there
is a hyperlink from the webpage i to the webpage j and is equal to 0 otherwise. Then the row
sums of this matrix will represent numbers of hyperlinks from each webpage and the column
sums will represent numbers of times each webpage has been referred to by other webpages.
Even further, we can generate a matrix of probabilities S such that the (i, j)th element of
this matrix is equal to the probability of traveling from ith webpage to jth webpage in the
network. This probability is equal to zero if there is no hyperlink from ith page to jth page and
is equal to 1/Ni if there is a hyperlink from ith page to jth page, where Ni is the total number of
hyperlinks from ith page. For instance, consider a sample network of only four webpages shown
on the Fig. 17. The matrix S for this network can be written as:
15
The printout of the file lab12.m is given in the appendices of this book.
16
The original network data is available here: https://snap.stanford.edu/data/.
74
1 2
4 3
Figure 17: Sample network of four webpages
0 1/3 1/3 1/3
1/2 0 0 1/2
S=
0 1/2 0 1/2
(1)
0 0 0 0
There are several issues which make working with the matrix S inconvenient. First of all,
there are webpages that do not have any hyperlinks - the so-called “dangling nodes” (such as
the node 4 in Fig. 17). These nodes will correspond to zero rows of the matrix S. Moreover, the
webpages in the network may not be connected to each other and the graph of the network may
consist of several disconnected components. These possibilities lead to undesirable properties of
the matrix S which make computations with it complicated and not even always possible.
The problem of the dangling nodes can be solved by assigning all elements of the matrix S in
the rows corresponding to the dangling nodes equal probabilities 1/N , where N is the number
of the nodes in the network. This can be understood in the following way: if we are at the
dangling node we can with equal probability jump to any other page in the network. To solve
the potential disconnectedness problem, we assume that a user can follow hyperlinks on any
page with a probability 1 − α and can jump (or “teleport”) to any other page in the network
with a probability α. The number α is called a damping factor. The value of α = 0.15 is usually
taken in practical applications. The “teleport” surfing of the network can be interpreted as a
user manually typing the webpage address in the browser or using a saved hyperlink from their
bookmarks to move from one page onto another. The introduction of the damping factor allows
us to obtain the Google matrix G in the form:
G = (1 − α)S + αE,
where E is a matrix with all the elements equal to 1/N , where N is a number of webpages in
the network.
The matrix G has nice properties. In particular, it has only positive entries and all of its
rows sum up to 1. In mathematical language, this matrix is stochastic and irreducible (you can
look up the precise definitions of these terms if you are interested). The matrix G satisfies the
following Perron-Frobenius theorem:
Theorem 1 (Perron-Frobenius) Every square matrix with positive entries has a unique unit
eigenvector with all positive entries. The eigenvalue corresponding to this eigenvector is real and
positive. Moreover, this eigenvalue is simple and is the largest in absolute value among all the
eigenvalues of this matrix.
Let us apply this theorem to the matrix G. First of all, observe that the row sums of the
75
matrix G are equal to 1. Consider the vector v1 = (1, 1, ..., 1)T /N . It is easy to see that
Gv1 = v1 .
But then it follows that v1 is the unique eigenvector with all positive components, and, therefore,
by the Perron-Frobenius theorem, λ1 = 1 is the largest eigenvalue!
We are interested in the left eigenvector for the eigenvalue λ1 = 1:
uT1 G = uT1 .
Again, by the Perron-Frobenius theorem, the vector u1 is the unique unit eigenvector with all
positive components corresponding to the largest in absolute value eigenvalue λ1 = 1. We will
use the components of this vector for the ranking of webpages in the network.
Let us look at the justification behind this algorithm. We have already established that the
vector u1 exists. Consider the following iterative process. Assume that at the beginning a user
can be on any webpage in the network with equal probability:
After 1 step (one move from one webpage to another using hyperlinks or teleporting), the
probability vector of being on the ith webpage is determined by the ith component of the vector
w1 = w0 G.
w2 = w1 G = w0 G2 ,
and so on.
We hope that after a large number of steps n, the vector wn = w0 Gn starts approaching
some kind of limit vector w∗ , wn → w∗ . It turns out that due to the properties of the matrix G
this limit vector w∗ indeed exists and it is exactly the eigenvector corresponding to the largest
eigenvalue, namely, λ1 = 1. Moreover, numerical computation of matrix eigenvalues is actually
based on taking the powers of the matrix (it is called the Power method) and not on solving the
characteristic equation!
Let us assume that the vector w∗ is a non-negative vector whose entries sum to 1. Then the
components of this vector represent the probabilities of being on each webpage in the network
after a very large number of moves along the hyperlinks. Thus, it is perfectly reasonable to take
these probabilities as ranking of the webpages in the network.
TASKS
1. Open the file lab12.m. In the code cell titled %%Load the network data load the data
from the file AdjMatrix.mat into Matlab by using the load command. Save the resulting
matrix as AdjMatrix. Observe that the adjacency matrices of real networks are likely to
be very large (may contain millions of nodes or more) and sparse. Check the sparsity of
the matrix AdjMatrix using the functions numel and nnz. Denote the ratio of non-zero
elements as nnzAdjMatrix. If you did everything correctly you should obtain that only
0.15% of the elements of the matrix AdjMatrix are non-zero.
Variables: AdjMatrix, nnzAdjMatrix
76
2. Check the dimensions of the matrix AdjMatrix using the size function. Save the dimen-
sions as new variables m and n.
Variables: m, n
3. Observe that while the network described by the matrix AdjMatrix is not large at all from
the viewpoint of practical applications, computations with this matrix may still take a
noticeable amount of time. To save time, we will cut a subset out of this network and use
it to illustrate the Google PageRank algorithm. Introduce a new variable NumNetwork and
set its value to 500. Then cut a submatrix AdjMatrixSmall out of the matrix AdjMatrix
and plot the graph represented by the matrix AdjMatrixSmall by running the following
code cell:
%% Display a small amount of network
NumNetwork=500;
AdjMatrixSmall=AdjMatrix(1:NumNetwork,1:NumNetwork);
for j=1:NumNetwork
coordinates(j,1)=NumNetwork*rand;
coordinates(j,2)=NumNetwork*rand;
end;
gplot(AdjMatrixSmall,coordinates,’k-*’);
This will plot the subgraph of the first 500 nodes in the network with random locations
of the nodes. Notice the use of the function gplot to produce this graph. Observe
that Matlab has special functions graph and digraph for working with graphs, but those
functions are a part of the special package “Graph and Network Algorithms” which may
not be immediately available. Simpler methods, as shown above, will be sufficient for our
purposes.
Variables: AdjMatrixSmall, coordinates, NumNetwork
5. Compute the eigenvalues and the left and the right eigenvectors of the matrix G using the
function eig. Observe that the right eigenvector corresponding to the eigenvalue λ1 = 1
is proportional to the vector v1 = (1, 1, ..., 1). To compute the left eigenvectors, use the
function eig on the matrix G’. Select the left eigenvector corresponding to the eigenvalue
λ1 = 1 and denote it as u1.
Variables: u1
6. Observe that by default the vector u1 is not scaled to have all positive components (even
though all the components of the vector u1 will have the same sign). Normalize this vector
by using the code:
u1=abs(u1)/norm(u1,1);
This will create a vector with all positive components whose entries sum to 1 (called a
probability vector).
7. Use the function max to select the maximal element and its index in the array.
Variables: MaxRank, PageMaxRank
77
8. Find out whether the highest ranking webpage is the same as the page with the most
hyperlinks pointed to it. To do so, create the vector of column sums of the matrix G and
save it as MostLinks. Use the function max again to select the element with the maximal
number of links.
Variables: MostLinks, MaxLinks, PageMaxLinks
78