Web Search Algorithms and PageRank
Web Search Algorithms and PageRank
2008
Recommended Citation
Samarbakhsh, Laleh, "Web Search Algorithms and PageRank" (2008). Theses and Dissertations (Comprehensive). Paper 872.
This Thesis is brought to you for free and open access by Scholars Commons @ Laurier. It has been accepted for inclusion in Theses and Dissertations
(Comprehensive) by an authorized administrator of Scholars Commons @ Laurier. For more information, please contact [email protected].
NOTE TO USERS
UMI
1*1 Library and
Archives Canada
Bibliotheque et
Archives Canada
NOTICE: AVIS:
The author has granted a non- L'auteur a accorde une licence non exclusive
exclusive license allowing Library permettant a la Bibliotheque et Archives
and Archives Canada to reproduce, Canada de reproduire, publier, archiver,
publish, archive, preserve, conserve, sauvegarder, conserver, transmettre au public
communicate to the public by par telecommunication ou par Plntemet, prefer,
telecommunication or on the Internet, distribuer et vendre des theses partout dans
loan, distribute and sell theses le monde, a des fins commerciales ou autres,
worldwide, for commercial or non- sur support microforme, papier, electronique
commercial purposes, in microform, et/ou autres formats.
paper, electronic and/or any other
formats.
Canada
WEB SEARCH ALGORITHMS
AND
PAGERANK
by
Laleh Samarbakhsh
THESIS
2008
positive.
ii ABSTRACT
ple for graphs which are not power law. For these graphs,
binary tree
Acknowledgements
like to thank Dr. George Lai, Dr. Zilin Wang, and Dr. De-
Abstract i
Acknowledgements iii
Chapter 1. Introduction 1
1.1. Motivation 1
PageRank 39
4.1. Introduction 53
Appendix 85
Bibliography 89
List of Figures
row. 72
Introduction
1.1. Motivation
tence of the web itself. Since the birth of the web, it has
engines.
B r i n a n d L a w r e n c e P a g e [3], t h e f o u n d e r s of G o o g l e . In
1998, Brin and Page were PhD students. They took a leave
trix.
web pages follow similar power laws in the web graph. Fur-
in-degree values.
that u and v are incident to the edge uv, and that u and
of degree 0 is isolated.
such that every edge has its ends in different classes (hence,
Nk,G=\{xeV(G):degG(x) = k}\.
et al. [4], which sampled 200 million web pages and their
A > 0. More generally, A > B means that each a^- > b^.
is defined by
( 1 if^-eE(G9,
aij = <
I 0 otherwise.
are non-negative.
10 1. INTRODUCTION
det(A - AI) = 0
(See, for example, [2].) The first (that is, largest in absolute
eigenvalue.
n
Mli = £ T •I
•1=1
rx.
Ap = rp
ples of p.
we write
lim Mt = L
t—>oo
12 1. INTRODUCTION
[11].
does not remember the way it reached the state Xt-\. This
property does not imply that the state Xt does not depend
probability
Pid=F(Xt=j\Xt-i=i)
/ \
Po,o Po,i p0,j
Plfi P\,l Pl j
P =
V /
that
sT = sTP.
and then classify and retrieve the query from this database.
query engine; see [13]. See Figure 2.1 for a simplified model
of a search engine.
17
18 2. THE PAGERANK ALGORITHM
Query
Crawler Indexer
Processor
' •
Ranking of the
Results
The links are classified once they are entered into the
After the search is done, let us say there are 200 pages
ticle may move from its current state to any of its neigh-
Pij = <
I 0 otherwise.
form random walk is stochastic; that is, the row sums are
s T such that
s T P = sT.
_ deg(i)
S%
~2\E(G)[
2.3. THE GOOGLE MATRIX 21
rected graphs.
I 0 otherwise.
web pages that do not link to any other web pages. These
that any zero rows are replaced with the vector with each
matrix of all l's. (We do not use the notation G for the
tions hold.
must show that the row sums in P are all equal to 1. For
2.3. THE GOOGLE MATRIX 23
Ti = ^ (a(P2kj+—^— (Jn,n)ij)
Kj<n ^ '
1—a
= " E ( P 2)*J + ( ! - « ) •
l<j<n
In this case,
(P2)„ = i.
Hence,
= a V - + (1 - a)
l<j<n
= a + (1 — a) = 1.
In this case,
(P2)id = (Pi),,,.
24 2. T H E P A G E R A N K ALGORITHM
Hence,
n = a J2 (Pikj + ( l - a )
l<j<n
— a + (1 — a) = 1.
tive. •
[13].
s T P = sT. •
Pi and let In(Pi) denote the set of web pages that point to
at (k + l)-th step as
(2.2) PRM(P,)= £ ^ p .
(See [2].)
s T P = sT.
(2) Define
zTk+1 = zpP = (zj )P*
Inn zfc+i
K—>00
matrix.
( 0 1/2 1/2 0 0 0 \
0 0 0 0 0 0
1/3 1/3 0 0 1/3 0
Pi
0 0 0 0 1/2 1/2
0 0 0 1/2 0 1/2
\ 0 0 0 1 0 0 j
30 2. THE PAGERANK ALGORITHM
The P2 matrix is
( 0 1/2 1/2 0 0 0 \
0 0 0 0 1/2 1/2
0 0 0 1/2 0 1/2
\ 0 0 0 1 0 0 /
1 b a+b a+b b b b ^
6 + c 6 + c 6 + c 6+ c 6 + c b + c
b+d b+d b b b+d b
b b b b a+60+6
b b b a+b b a+b
\
b b b a+b b b
equal to 20.
%Implementation of PageRank c a l c u l a t o r
%using power method
function [pi,time,numiter]=
PageRank(piO,H,n,alpha,epsilon);
32 2. T H E P A G E R A N K ALGORITHM
rowsumvector=ones(l,n)*H';
nonzerorows=find(rowsumvector);
zerorows=setdiff(1:n,nonzerorows);
l=length(zerorows);
a=sparse(zerorows,ones(l,1),ones(l,1),n,1);
k=0;
residual=l;
pi=piO;
tic;
for ( i=0:20 )
prevpi=pi;
k=k+l;
pi=alpha*pi*H + (alpha*(pi*a)+l-alpha)
*((l/n)*ones(l,n));
residual=norm(pi-prevpi,1);
end;
numiter=k;
time=toc;
70save p i ;
CHAPTER 3
from [14].
linear algebra.
10% of the nodes always follows a power law with the same
rule: the tail will cover 70 percent of the value of the distri-
can be stated as
,. F(PR > x) .,
lim — — — — - = 1.
x^oo pyx)
of PageRank by p(x).
lim ™ = if.
z-»oo RV(X)
SV{tx)
lim = x
"cTTTT -
3.1. REGULARLY VARYING RANDOM VARIABLES 37
RV(x)=x<3SV{x),
a(x)
lim = L
T^T
1-F(x)~ x~pSV{x),
Stieltjes transform of X is
/( 5 ) = E[e-*],
38 3. P A G E R A N K IN P O W E R L A W GRAPHS
poo
£„ = / XndF(x).
Jo
such that
n p
s—>0 •«•—' ?!
i=0
fn(s) = (-ir+i(f(s)-J2^(-s^
\ i=0
precise.
and if£n < oo7 (3 = n + r), and rj G (0,1), then the following
are equivalent
PageRank
links.
(5) All Rj's are independent and have the same distri-
The equation (3.3) has the same form as the original PageR-
of Poisson arrivals on the time interval [0, x\. For more de-
Stieltjes transforms.
technical lemma.
variable X.
we have that
implies that
fn(t) ~ ( - 1 ) ^ ( 1 - 0)1?SV Q , as t -* 0
N(X)
(3.4) R = a Y; -jRj + {l-a),
CX
3=1
N{X)
r{s) = E[e- ] = sR l a a \]
E{e-< - ^}E exp | — s— Ri
~d i=i
exp
= e
—s(l—a)
r4E^ i=l
F(N(X) = k)
oo
s(l-a) r\s-X) F{N(X) =k)
k=i
d
Ki-")(S N(X) a
r s
~d
&N(X)(s) = / ( l - S)
3.3. STOCHASTIC EQUATIONS 47
GN(X)(S) = E[sNW]
roo
= / E[sN®]dFx(t)
Jo
roo
= / e-^dFxit)
Jo
= f(l-s). •
and the Rj's is heavily used. For example, using this in-
index j3.
<
(2) 1 — FR{X) ~ d0 ^a/3dx~f3SV(x), as x —> oo. In partic-
l-F(x)=F(X >x),
Theorem 3.3.2 does not fit the results from their web crawls.
web page.
tributions.
CHAPTER 4
4.1. Introduction
Markov chain.
53
54 4. PAGBRANK AND IN-DEGREE
(4.2) sj+1 = sf
= (sJ)P*-
degree of the ith node, simply find the ith column sum of
is true for power law graphs, but not for arbitrary graphs.
are distinct.
rooted tree in which every node other than the leaves have
with a set of finite 0-1 sequences (or strings), with the root
in the binary tree, and then compare the ranking with the
4.2. BINARY TREES 57
J0
FIGURE
A
00 01 10 11
random walk on the binary tree does not correlate with the
in-degree distribution.
Xij denote the i-th node on the j - t h row of the binary tree.
A
1,1
X
2,lf£ ilX2,2
y y y y
3,1 3,2 3,3 3,4
(1) For 0 < i < r, the number of nodes on the i-th row
treeT2{r), is 2i~l.
leaves are on the same level and every non-leaf node has
two children.
has two children (since the binary tree is full) and so the
on row i:
= 2 x 2i_1
= 2\
For item (2), the total number of nodes in a full binary tree
\T2(r)\ = 2 1 + 2 2 + 23 + . . . + 2 r
r—1
= E2"
i=0
= 2r-l. •
AA:AA
FIGURE 4.3. An arbitrary binary tree
structure.
/ 0 0 0 0 ^
1 0 0 ...
1 0 0 ...
Pi =
0 1 0 ...
0 1 0 ...
second row, every two consecutive rows are equal. The next
the l's reach the (2 r _ 1 )-th column of the matrix. Since the
leaves of the tree have zero in-degree, the matrix will have
root or the node x\^_ is the unique dangling node, where the
therefore,
1 0 0 ...
1 0 0 ...
M = P2 =
0 1 0 ...
0 1 0 ...
V ; • ; • • /
M
6 — J n . l ~~
w
We now state the main results of this section
tree T2(r). For all k > 1, 1 > i > j > 2v~l and 1 < p <
r — 1, we have that
where n = 2r — 1.
defer the proofs of Theorem 4.3.1 and 4.3.2 until the fol-
(1) For any two nodes on the same row of T2(r), the
corresponding entries in s are equal.
(p + l)-st row of s.
(4.3) s T = lim e r H t .
t—»oo
[sr],- = [lime^H'b
t—>oo
= nm[eTHt]i
t—*oo
= l-limlH*],-.
[H]p,i = [H ]p,j.
were arbitrary, any two nodes on the same row have equal
[s T ], = [limeTHt]i
t—>oo
= lim^H*],
fact that the jth node is located on the next row right after,
we can write for any fixed t > 0 and for 1 < i, j < 2P~1,
value for the node xp+i,j: any node on the (p + l)-th row.
law, since the leaves are the most abundant nodes. How-
ure 4.4.
4.3. CALCULATING THE STATIONARY DISTRIBUTION 67
1 0 0
1 0 0
P2 =
0 1 0
0 1 0
V J
Assuming the teleportation factor a to be equal to 0.85
as
( 1 2 2 3 3 3 3 4 4 4 4 4 4 4 4),
which implies that the root is the highest ranked page (as
rank.
of the tree):
/ 2 \
2
ID =
0
W
Thus, the in-degree ranks the pages as
( 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2),
implying that the first seven pages in the graph (that is,
all the non-leaf nodes) have equal ranking. All the leaves
Q. 0.4
CO
i_ • * • • PageRank
O)
Q) indegree
= 0.3
a>
2
D)
<D
•g 0.2
x x
(A
i 0.1 X X X X
to
OH
0>
en
_fc X X X X X X X
CL 0 5 10 15
Nodes in Binary Tree
is a step function.
1 ' 1 1 1 1 1 '
m PageRank
-
j in—degree
0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05
in-degree value
Figure 4.6.
4.4. PROOFS OF MAIN RESULTS 71
us say the first row starts with the node 2^1 and the second
row starts with the node 2^+1,1. Note that we already know
counting from top to bottom and left to right, the node x^j
is the j'-th node on the i-th row. So, counting the nodes,
adjacency matrices.
A
Xri>(1<i<2M)
Xr+1i,(1o<2r)
fai\
OC2
a =
\ On J
4.4. PROOFS OF MAIN RESULTS 73
we have that
1
a i \ ' ^
a2 OL2
A = [AilAal-.-IA*]
\ ®n J \®n J
= aiAi +a2A2 +... +anAn,
[H ]rj = [H • H]rti
Similarly,
fH*+l [H -H] r j
r,3
x A
r+1,2i-1 r+1,2j-1
the same row (which is the case, since 1 < i < j < 2r'~1
Hence,
[H + i ], r ) i = [ H + i ] r j - .
The final step of the induction is carried out and hence, for
all k > 1,
[H*]r>< = [H fc ] rj . D
the above sums equal, but also the value for each element
[H ] p +i,2i-l = [H ]p+i,2i.
Hence,
for every row we move away from the root. This behaviour
binary trees.
degree. This follows from the fact that the degree of a node
in [2]-)
30
20
10
0
0.C 107 0.008 0.009 0.01 0.011 0.012 0.01:
30
20
10
0
) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
power laws.
300 ~i r -l r
200
100
0 J I L k_> I I- . I 1_
81
82 5. CONCLUSION AND FUTURE WORK
end;
fcloseCfid);
85
86 APPENDIX
ID = sum(X);
ID = (ID - mean(ID))/max(abs((ID-mean(ID))));
ID = (ID + l)/2;
save ID;
end
Inl = zeros(l,n);
for i=l:n
Inl(i)=l/n;
end%for
APPENDIX
%return Inl;
save Inl;
function J = RandomSample(n);
X = randint(n,n);
for j=l:n
for i=l:n
if i==j
X(i,j)=0;
end°/0if
end%for
end°/0f o r
y = sum(X,2);
for j=l:n
for i=l:n
J(i,j)=X(i,j)/y(i);
end%for
end°/0f o r
save J;
88 APPENDIX
hold on;
subplot(2,1,1); hist(PR_15_l);
subplot(2,1,2); hist(ID_15_2);
Bibliography
309-320.
89
90 BIBLIOGRAPHY
2005.
(2004) 38 239-243.