The Use of The Linear Algebra by Web Search Engines
The Use of The Linear Algebra by Web Search Engines
1 Introduction
Nearly all the major Web search engines today use link analysis to improve their search results. That’s
exciting news for linear algebraists because link analysis, the use of the Web’s hyperlink structure, is
built from fundamentals of matrix theory. Link analysis and its underlying linear algebra have helped
revolutionize Web search, so much so that the pre-link analysis search (before 1998) pales in comparison
to today’s remarkably accurate search.
HITS [13] and PageRank [2, 3] are two of the most popular link analysis algorithms. Both were
developed around 1998 and both have dramatically improved the search business. In order to appreciate
the impact of link analysis, recall for a minute the state of search prior to 1998. Because of the immense
number of pages on the Web, a query to an engine often produced a very long list of relevant pages,
sometimes thousands of pages long. A user had to sort carefully through the list to find the most relevant
pages. The order of presentation of the pages was little help because spamming was so easy then. In order
to trick a search engine into producing rankings higher than normal, spammers used meta-tags liberally,
claiming their page used popular search terms that never appeared in the page. Meta-tags became useless
for search engines. Spammers also repeated popular search terms in invisible text (white text on a white
background) to fool engines.
(k)
(k−1) (k)
(k)
ai = hj and hi = aj for k = 1, 2, 3, . . . ,
j:eji ∈E j:eij ∈E
where eij represents a hyperlink from page i to page j and E is the set of hyperlinks. To compute the
(1) (1)
scores for a page, he started with uniform scores for all pages, i.e., hi = 1/n and ai = 1/n where n is
the number of pages in a so-called neighborhood set for the query list. The neighborhood set consists of
1
all pages in the query list plus all pages pointing to or from the query pages. Depending on the query, the
neighborhood set could contain just a hundred pages or a hundred thousand pages. (The neighborhood
set allows latent semantic associations to be made.) The hub and authority scores are iteratively refined
until convergence to stationary values.
Using linear algebra we can replace the summation equations with matrix equations. Let h and a be
column vectors holding the hub and authority scores. Let L be the adjacency matrix for the neighborhood
set. That is, Lij = 1 if page i links to page j, and 0, otherwise. These definitions show that
a(k) = LT La(k−1)
(k)
h = LLT h(k−1) .
These equations make it clear that Kleinberg’s algorithm is really the power method applied to the
positive semi-definite matrices LT L and LLT . LT L is called the hub matrix and LLT is the authority
matrix. Thus, HITS amounts to solving the eigenvector problems LT La = λ1 a and LLT h = λ1 h, where
λ1 is the largest eigenvalue of LT L (and LLT ), and a and h are corresponding eigenvectors.
While this is the basic linear algebra required by the HITS method, there are many more issues to
be considered. For example, important issues include convergence, existence, uniqueness, and numerical
computation of these scores [5, 7, 14]. Several modifications to HITS have been suggested, each bringing
various advantages and disadvantages [4, 6, 8]. A variation of Kleinberg’s HITS concept is at the base of
the search engine TEOMA (http://www.teoma.com), which is owned by Ask Jeeves, Inc.
(k+1)
rj(k)
ri = ,
|Oj |
j∈Ii
(k)
where ri is the PageRank of page i at iteration k, Ii is the set of pages pointing into page i and |Oj | is
the number of outlinks from page j. Like HITS, PageRank starts with a uniform rank for all pages, i.e.,
(0)
ri = 1/n and successively refines these scores, where n is the total number of Web pages.
Like HITS, we can write this process using matrix notation. Let the row vector π(k)T be the PageRank
vector at the k th iteration. As a result, the summation equation for PageRank can be written compactly
as
π(k+1)T = π(k)T H,
2
where H is a row normalized hyperlink matrix, i.e., hij = 1/|Oi |, if there is a link from page i to page j,
and 0, otherwise. Unfortunately, this iterative procedure has convergence problems—it can cycle or the
limit may be dependent on the starting vector.
To fix these problems, Brin and Page revised their basic PageRank concept. Still using the hyperlink
structure of the Web, they build an irreducible aperiodic Markov chain characterized by a primitive (irre-
ducible with only one eigenvalue on the spectral circle) transition probability matrix. The irreducibility
guarantees the existence of a unique stationary distribution vector πT , which becomes the PageRank
vector. The power method with a primitive stochastic iteration matrix will always converge to πT inde-
pendent of the starting vector, and the asymptotic rate of convergence is governed by the magnitude of
the subdominant eigenvalue λ2 of the transition matrix [19].
Here’s how Google turns the hyperlink structure of the Web into a primitive stochastic matrix. If
there are n pages in the Web, let H be the n × n matrix whose element hij is the probability of moving
from page i to page j in one click of the mouse. The simplest model is to take hij = 1/|Oi |, which means
that starting from any Web page we assume that it is equally likely to follow any of the outgoing links
to arrive at another page.
However, some rows of H may contain all zeros, so H is not necessarily stochastic. This occurs
whenever a page contains no outlinks; many such pages exist on the Web and are called dangling nodes.
An easy fix is to replace all zero rows with eT /n, where eT is the row vector of all ones. The revised
(now stochastic) matrix S can be written as a rank-one update to the sparse H. Let a be the dangling
node vector in which
1 if page i is a dangling node,
ai =
0 otherwise.
Then,
S = H + aeT /n.
Actually, any probability vector pT > 0 with pT e = 1 can be used in place of the uniform vector eT /n.
We’re not home yet because the adjustment that produces the stochastic matrix S isn’t enough to
insure the existence of a unique stationary distribution vector (needed to make PageRank well defined).
Irreducibility on top of stochasticity is required. But the link structure of the Web is reducible—the Web
graph is not strongly connected. Consequently, an adjustment to make S irreducible is needed. This last
adjustment brings us to the Google matrix, which is defined to be
G = αS + (1 − α)E,
where 0 ≤ α ≤ 1 and E = eeT /n. Google eventually replaced the uniform vector eT /n with a more
general probability vector vT (so that E = evT ) to allow them the flexibility to make adjustments to
PageRanks as well as to personalize them. See [10, 15] for more about the personalization vector vT .
Because G is a convex combination of the two stochastic matrices S and E, it follows that G is
both stochastic and irreducible. Furthermore, every node is now directly connected to every other node
(although the probability of transition may be very small in some cases), so G > 0. Consequently, G is a
primitive matrix, and this insures that the power method π(k+1)T = π(k)T G will converge, independent
of the starting vector, to a unique stationary distribution πT [19]. This is the mathematical part of
Google’s PageRank vector.
3
Written in this way, it becomes clear that the power method applied to G can be implemented with
vector-matrix multiplications on the extremely sparse H, and G and S are never formed or stored. A
matrix-free method such as the power method is required due to the size of the matrices and vectors
involved (Google’s index is currently 4.3 billion pages). Fortunately, since H is sparse, each vector-
matrix multiplication required by the power method can be computed in nnz(H) flops, where nnz(H) is
the number of nonzeros in H. And since the average number of nonzeros per row in H is significantly less
than 10, we have O(nnz(H)) ≈ O(n). Furthermore, at each iteration, the power method only requires
the storage of one vector, the current iterate, whereas other accelerated matrix-free methods, such as
restarted GMRES or BiCGStab, require storage of at least several vectors, depending on the size of the
subspace chosen. Finally, the power method applied in this way converges quickly. Brin and Page report
success using only 50 to 100 power iterations [3]. This is due in large part to the fact that it can be
proven [15] that the subdominant eigenvalue of G satisfies |λ2 | ≤ α, and Google originally set α = .85.
Like HITS, the basic concepts of PageRank are simple, but there are many subtle issues that lurk just
below the surface. For example, there are complicated and unresolved issues concerning personalization,
computation, accelerated computation, sensitivity, and updating—more information is available in [7, 11,
12, 21, 15, 17, 18, 20].
This brief introduction describes only the mathematical component of Google’s ranking system. How-
ever, it’s known that there are non-mathematical “metrics” that are also considered when Google responds
to a query, so the results seen by a user are in fact PageRank tempered by other metrics. While Google is
secretive about these other metrics, they state on their Web site (http://www.google.com/technology)
that “The heart of our software is PageRank... .”
4 Books
SIAM is publishing a second edition of the popular Understanding Search Engines: Mathematical Mod-
eling and Text Retrieval [1] by Michael W. Berry and Murray Browne in 2005. The new edition contains
a chapter devoted to link analysis. As a result, readers can see how link analysis and ranking algorithms
fit into the overall search process.
Also due out in 2005 is our book, Understanding Web Search Engine Rankings: Google’s PageRank,
Teoma’s HITS, and other ranking algorithms [16]. This book from Princeton University Press will contain
over 250 pages devoted to link analysis algorithms with several introductory chapters, examples, and code,
as well as chapters dealing with more advanced issues in Web search ranking.
References
[1] Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
SIAM, Philadelphia, 1999.
[2] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks
and ISDN Systems, 33:107–117, 1998.
[3] Sergey Brin, Lawrence Page, R. Motwami, and Terry Winograd. The PageRank citation ranking: bringing order to
the web. Technical report, Computer Science Department, Stanford University, 1998.
[4] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst Simon. PageRank, HITS and a unified framework for link analysis.
In Proceedings of the 25th ACM SIGIR Conference, pages 353–354, Tampere, Finland, August 2002.
[5] Chris H. Q. Ding, Hongyuan Zha, Xiaofeng He, Parry Husbands, and Horst D. Simon. Link analysis: hubs and
authorities on the World Wide Web. SIAM Review, 46(2):256–268, 2004.
[6] Ayman Farahat, Thomas Lofaro, Joel C. Miller, Gregory Rae, F. Schaefer, and Lesley A. Ward. Modifications of
Kleinberg’s HITS algorithm using matrix exponentiation and web log records. In ACM SIGIR Conference, pages
444–445, September 2001.
[7] Ayman Farahat, Thomas Lofaro, Joel C. Miller, Gregory Rae, and Lesley A. Ward. Existence and uniqueness of
ranking vectors for linear link analysis. In ACM SIGIR Conference, September 2001.
4
[8] Francois Fouss, Jean-Michel Renders, and Marco Saerens. Some relationships between kleinberg’s hubs and authorities,
correspondence analysis, and the Salsa algorithm. In Proceedings of the 7th International Conference on the Statistical
Analysis of Textual Data (JADT 2004), pages 445–455, 2004.
[9] Taher H. Haveliwala and Sepandar D. Kamvar. The second eigenvalue of the Google matrix. Technical report, Stanford
University, 2003.
[10] Taher H. Haveliwala, Sepandar D. Kamvar, and Glen Jeh. An analytical comparison of approaches to personalizing
PageRank. Technical report, Stanford University, 2003.
[11] Sepandar D. Kamvar and Taher H. Haveliwala. The condition number of the PageRank problem. Technical report,
Stanford University, 2003.
[12] Sepandar D. Kamvar, Taher H. Haveliwala, and Gene H. Golub. Adaptive methods for the computation of PageRank.
Technical report, Stanford University, 2003.
[13] Jon Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46, 1999.
[14] Amy N. Langville and Carl D. Meyer. A survey of eigenvector methods of web information retrieval. The SIAM
Review, 2003. Accepted in December 2003.
[15] Amy N. Langville and Carl D. Meyer. Deeper inside PageRank. Internet Mathematics Journal, 2004. Accepted in
February 2004.
[16] Amy N. Langville and Carl D. Meyer. Understanding Web Search Engine Rankings: Google’s PageRank, Teoma’s
HITS, and other ranking algorithms. Princeton University Press, Princeton, 2005.
[17] Chris Pan-Chi Lee, Gene H. Golub, and Stefanos A. Zenios. Partial state space aggregation based on lumpability and
its application to PageRank. Technical report, Stanford University, 2003.
[18] Ronny Lempel and Shlomo Moran. Rank-stability and rank-similarity of link-based web ranking algorithms in
authority-connected graphs. In Second Workshop on Algorithms and Models for the Web-Graph (WAW 2003), Bu-
dapest, Hungary, May 2003.
[19] Carl D. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM, Philadelphia, 2000.
[20] Andrew Y. Ng, Alice X. Zheng, and Michael I. Jordan. Link analysis, eigenvectors and stability. In Seventh International
Joint Conference on Artificial Intelligence, 2001.
[21] Twelfth International World Wide Web Conference. Extrapolation Methods for Accelerating PageRank Computations,
2003.