0% found this document useful (0 votes)

64 views64 pages

Accelerating Ranking System Using Web Graph

This project aims to accelerate the ranking system of the Needle search engine using a webgraph representation. It implements Google's PageRank and a Cluster Rank algorithm, which assigns importance scores to web pages based on their inbound links. The project adds database tables to store ranking scores and represents the webgraph to improve time performance. Experimental results on datasets of varying sizes show that the webgraph representation provides significant time gains over the original implementation for ranking algorithms.

Uploaded by

Complete name

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views64 pages

Accelerating Ranking System Using Web Graph

Uploaded by

Complete name

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 64

ACCELERATING RANKING

SYSTEM USING

WEBGRAPH
By

Random

A project submitted to the graduate faculty of

The University of Colorado at Colorado Springs in partial

Fulfillment of the Master of Science degree

DEPARTMENT OF COMPUTER SCIENCE

2007
2

This project for Master’s of Science degree by Padmaja Adipudi has

been approved for the Department of Computer Science

Dr. J. Kalita (Advisor)

Dr. E. Chow (Committee Member)

Dr. T. Chamillard (Committee Member)

Date
3

ACKNOWLEDGEMENTS

Many people have shared their time and expertise to help me accomplish this project.
First I would like to sincerely thank my advisor, Dr. Jugal K. Kalita for his guidance and
help. And also many thanks to Dr. T. Chamillard and Dr. C. Edward Chow for their
supports.

I wish to pay special tributes to the fellow engineers Srinivas Guntupalli, Sunil Bhave
and Shawn Stoffer who provided constructive suggestions. I would like to thank Sonali
Patankar who provided a large set of sample data.

Finally, I need to acknowledge that all the friends in the research team are of great help.
Thank you!
4

Table of Contents

1 Abstract...................................................................................................................................7

2 Introduction............................................................................................................................8

2.1 Cluster Rank Algorithm........................................................................................8

2.2 Problem Statement................................................................................................9
2.3 Summary of Work.................................................................................................9
2.4 Related Work..........................................................................................................9

3 Implementation....................................................................................................................12

3.1 Workflow..............................................................................................................13
3.2 Google’s PageRank..............................................................................................13
3.3 Cluster Rank........................................................................................................14
3.4 Source Rank.........................................................................................................15
3.5 Truncated PageRank...........................................................................................16
3.6 Software & Packages Used..................................................................................17
3.7 Database Implementation...................................................................................18
3.8 Database Tables Added........................................................................................18
3.8.1 Table Source.........................................................................................................18
3.8.2 Table PageRankBySource...................................................................................19
3.8.3 Table PageRankTruncated..................................................................................20
3.8.4 View vwURLLinkSource.....................................................................................20
3.9 Database Changed Made....................................................................................20
3.10 Database Relationships.......................................................................................22
3.11 Module Implementation......................................................................................23
3.11.1 Original Implementation.....................................................................................23
3.11.2 Current Implementation.....................................................................................23
3.11.3 Current Implementation – Module Details.......................................................24

4 Experimental Results...........................................................................................................28
5
4.1 Experimental Data Setup....................................................................................28
4.2 Challenges & Key Observations.........................................................................29
4.3 Future Upgrade of The Search Engine..............................................................31
4.4 Time Comparisons For Cluster Rank Before & After Using WebGraph.......32
4.5 Time Measure Between Algorithms Using WebGraph....................................35
4.5.1 For 300,000 URLs................................................................................................35
4.5.2 For 600,000 URLs................................................................................................36
4.5.3 For 4 Million URLs..............................................................................................37
4.5.4 Graph Representation For Time Measure (in Sec)...........................................38
4.5.5 Node In-Link Distribution across Nodes...........................................................42
4.5.6 Cluster In-Link Distribution across Clusters....................................................45
4.5.7 Source In-Link Distribution across Sources......................................................48
4.5.8 Time Gain Analysis Between Algorithms..........................................................51
4.6 Quality Measure Between Algorithms...............................................................51
4.6.1 Survey Results......................................................................................................52
4.7 Conclusion of Experimental Results..................................................................52

5 References.............................................................................................................................54

6 Appendix A – Survey Questions to Evaluate Algorithm Quality.....................................56

6.1 Search Keywords.................................................................................................56

6.2 Survey Questions.................................................................................................56

7 Appendix B – Software and hardware Environment.......................................................57

8 Appendix C – Database Scripts..........................................................................................58

8.1 Database Table/View Setup.................................................................................58

8.2 Database Table/View Cleanup............................................................................62

9 Appendix D – Software Setup.............................................................................................63

10 Appendix E – Using the Search Engine................................................................................65

1 Abstract

Search Engine is a tool to find the information on any topic in the Web. The basic
components of a Search Engine are Web Crawler, Parser, Page-Rank System, Repository
and a Front-End. In a nut shell here is how the Search Engine operates. The Web Crawler
fetches the web pages from Web, the Parser takes all downloaded raw results, analyzes
and eventually tries to make sense out of them. Finally the Page-Rank system finds the
importance of pages, and the Search Engine lists the results in the order of relevance and
importance.

In short, a Page-Rank is a “vote”, by all the other pages on the Web, about how important
a page is. Studying the Web graph, which is used in Page-Rank System, is often difficult
due to their large size. In Web graph the Web pages are represented as nodes and the
hyperlinks between the Web pages are represented as directed links from one node to
other node. Different kinds of algorithms were proposed because of the large Web graph
to get efficient Page-Rank Systems.

The Needle is a Search Engine at UCCS for educational domains developed by a group
of previous students at UCCS under the guidance of Dr. J. Kalita. The goal for this
project is to accelerate the Page-Rank System of Needle Search Engine, at the same time
upgrading the Search Engine with 1 Million URLs. The acceleration of the Page-Rank
System will be accomplished by applying a package called “WebGraph” [1] with
compression techniques to represent the Web graph compactly, and also compare the
ranking efficiency, by using two recently published ranking algorithms called Truncated
PageRank [7] and Source Rank [10]. Finally deploy the best to upgrade the Needle
Search Engine with 1 Million pages.
7

2 Introduction

Search Engine technology was born almost at the same time as the World Wide Web [9].
The Web is potentially a terrific place to get information on almost any topic. Doing
research without leaving your desk sounds like a great idea, but all too often you end up
wasting precious time chasing down useless URLs if the search engine is not designed
properly.

The dramatic growth of the World Wide Web is forcing modern search engines to be
more efficient and research is being done to improve the existing technology. The design
of the Search Engine is a tedious process because of the dynamic nature and sheer
volume of the data.

Page-Rank system is a component of Search Engine to find the importance of a Web page
relevant to search topic. PageRank [6] is a system of scoring nodes in a directed graph
based on the stationary distribution of a random walk on the directed graph. A graduate
student Yi Zhang implemented Cluster Rank algorithm [4], which is based on the famous
Google’s PageRank algorithm [6]. In Google’s PageRank the importance of a page is
based on the importance of parent web pages.

2.1 Cluster Rank Algorithm

 Group all pages in to clusters.

o Perform first level clustering for dynamically generated pages

o Perform second level clustering on virtual directory and graph density

 Calculate the rank for each cluster with the original PageRank [6] algorithm

 Distribute the rank number to its members by weighted average.

8
The Cluster Rank [4] is designed and implemented to achieve similar goals as that of
existing PageRank [6] while providing the similar performance and providing an
additional feature for managing similar pages in search results.

2.2 Problem Statement

The existing Page-Rank System of the Needle takes long update times. It took around
2hours to calculate the Page Rank for 300,000 URL and it will take months to update the
system with the World Wide Web because of sheer volume of data. A group of people
from Italy developed a package called “WebGraph” [1] to represent the Web graph
compactly, which resolves the long update times for the World Wide Web.

2.3 Summary of Work

The Page-Rank System of the Needle Search Engine is designed and implemented using
Cluster Rank [4] algorithm, which is similar to famous Google’s PageRank [4] algorithm.
Google’s PageRank [4] algorithm is based on the link structure of the graph. A
“WebGraph” [1] package is used to represent the graph in most efficient manner, which
helps in accelerating the ranking procedure of the World Wide Web. Two latest Page-
Rank algorithms called Source Rank [10], Truncated PageRank [7] are taken to compare
the existing ranking system, which is Cluster Rank [4], and deploy the best in the Needle
Search Engine. Two attributes are taken in to consideration for selecting the best
algorithm. The first one is the time and second one is human evaluation for the quality of
the search. A survey is conducted with the help of the research team on finding the best
algorithm on different search topics.

2.4 Related Work

The existing Page-Rank system of the Needle Search Engine takes long update time as
the number of URLs increases. Research was done on the published ranking system
papers, and below are the details of those papers.

There is a paper called “Efficient Computation of page-rank” written by Taher

H.Haveliwala [3]. This paper discusses efficient techniques for computing Page-Rank, a
9
ranking metric for hypertext documents and showed that the Page-Rank can be computed
for very large sub graphs of the Web on machines with limited main memory. They
discussed several methods for analyzing the convergence of Page-Rank based on the
induced ordering of pages.

The main advantage of the Google’s PageRank [6] measure is that it is independent of the
query posed by user, this means that it can be pre computed and then used to optimize the
layout of the inverted index structure accordingly. However, computing the Page-Rank
requires implementing an iterative process on a massive graph corresponding to billions
of Web pages and hyperlinks. There is a paper written by Yen-Yu Chen and Qingqing gan
[2] on Page-Rank calculation by using efficient techniques to perform iterative
computation. They derived two algorithms for Page-Rank and compared those with two
existing algorithms proposed by Havveliwala [3], and the results were impressive.

In this paper [6], the authors namely Lawrence Page, Sergey Brin, Motwani and Terry
Winograd took advantage of the link structure of the Web to produce a global
“importance” ranking of every Web page. This ranking, called PageRank [6], helps
search engines and users quickly make sense of the vast heterogeneity of the World Wide
Web.

This paper introduces a family of link-based ranking algorithms that propagate page
importance through links [7]. In these algorithms there is a damping function that
decreases with distance, so a direct link implies more endorsement than a link through a
long path. PageRank [6] is the most widely known ranking function of this family. The
main objective of the paper is to determine whether this family of ranking techniques has
some interest per se, and how different choices for the damping function impact on rank
quality and on convergence speed. The Page Rank is computed similar to Google’s
PageRank [6] , except that the supporters that are too close to a target node, do not
contribute to wards it ranking. Spammers can afford spam up to few levels only. Using
this technique, a group of pages that are linked together with the sole purpose of
obtaining an undeservedly high score can be detected. The authors of this paper apply
10
only link-based methods that are they study the topology of the Web graph with out
looking at the content of the web pages.

In this paper [10], they develop a spam-resilient Page-Rank system that promotes a
source-based view of the Web. One of the most salient features of the spam-resilient
ranking algorithm is the concept of influence throttling. Through formal analysis and
experimental evaluation, they show the effectiveness and robustness of our spam-resilient
ranking model in comparison with Google’s PageRank [6] algorithm.

The need to run different kinds of algorithms over large Web graph motivates the
research for compressed graph representations that permit accessing without
decompressing them [1]. At this point there exists a few such compression proposals,
some of them are very efficient in practice.

Studying the Web graph is often difficult due to their large size [1]. It currently contains
some 3 billion nodes, and more than 50 billion arcs. Recently, several proposals have
been published about various techniques that allow storing a Web graph in memory in a
limited space, exploiting the inner redundancies of the Web. The WebGraph [1]
framework is a suit of codes, algorithms and tools that aims at making it easy to
manipulate large Web graphs. The WebGraph can compress the WebBase graph [12],
(118 Mnodes, 1Glinks) in as little as 3.08 bits per link, and its transposed version in as
little as 2.89 bits per link. It consists of a set of flat codes suitable for storing Web graphs
(or, in general, integers with power-law distribution in a certain exponent range),
compression algorithms that provide a high compression ratio, algorithms for accessing a
compressed graph without actually decompressing it (decompression is delayed until it is
actually necessary, and documentation and data sets.
11

3 Implementation

A package called “WebGraph” [1] is used to represent the graph compactly. This package
is developed in Java. The existing Page-Rank system is developed using Perl. A Perl
library called “Inline-Java” is used to call the java modules of WebGraph [1] package to
reuse the existing Perl code of the Cluster Rank [1] algorithm. Here is simple work flow
diagram.

Listed below is the snippet of code that shows how to call Java from a Perl module:

Use Inline java => <<’DATA’;

/ JAVA Code Begins /

import it.unimi.dsi.webgraph.ImmutableGraph;
…
public class MyGraph extends ImmutableGraph{
…

ImmutableGraph graph;
public MyGraph(String basename) throws IOException{
graph = ImmutableGraph.load( basename );
}
…
public int getSuccCount( int n ) throws IOException ...{
…
}
}
/** JAVA Code Ends **/

DATA

## Perl Code Begins ##

my $vargraph, $varnode, varcount;
$vargraph = new MyGraph("bvnodein");
$id = 2;
$varcount = $vargraph->getSuccCount($id);

The Page-Rank system gets the information stored by Crawler. WebGraph [1] package
generates the compressed graph by taking a graph that is in ASCII graph format. In
ASCII graph the first line contains the number of nodes ‘n’, then ‘n’ lines follow, the i-th
12
line containing the successors of node ‘i’ in increasing order (nodes are numbered from 0
to n-1). Successors are separated by a single space. This compressed graph is given as
input to the Page-Rank system for calculation.

3.1 Workflow

3.2 Google’s PageRank

The three algorithms Cluster Rank [4], Source Rank [10], Truncated PageRank [7] are
based on the famous Google’s PageRank.

The published Page Rank algorithm can be described in a very simple manner:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

13
In the equation above:

PR(Tn): Each page has a notion of its own self-importance. That’s “PR(T1)” for the first
page in the web all the way up to PR(Tn) for the last page.

C(Tn): Each page spreads its vote out evenly amongst all of its outgoing links. The
count, or number, of outgoing links for page 1 is C(T1), C(Tn) for page n, and so on for
all pages.

PR(Tn)/C(Tn): if a page (page A) has a back link from page N, the share of the vote page
A gets is PR(Tn)/C(Tn).

d: All these fractions of votes are added together but, to stop the other pages having too
much influence, this total vote is "damped down" by multiplying it by 0.85 (the factor d).

The definition of d also came from an intuitive basis in random walks on graphs. The idea
is that a random surfer keeps clicking on successive links at random, but the surfer
periodically “gets bored” and jumps to a random page. The probability that the surfer gets
bored is the dampening factor.

(1 - d): The (1 – d) bit at the beginning is a probability math magic so the "sum of all Web
pages" Page Rank is 1, achieved by adding the part lost by the d(....) calculation. It also
means that if a page has no links to it, it still gets a small PR of 0.15 (i.e. 1 – 0.85).

3.3 Cluster Rank

The original PageRank algorithm is applied on Clusters and then the rank is distributed to
the members by weighted average.

1. Group all pages into clusters.

Perform first level clustering for dynamically generated pages

Perform second level clustering on virtual directory and graph density

2. Calculate the rank for each cluster with the original PageRank algorithm.
14
3. Distribute the rank number to its members by weighted average by using

PR = CR * Pi/Ci.

The notations here are:

PR: The rank of a member page

CR: The cluster rank from previous stage

Pi: The incoming links of this page

Ci: Total incoming links of this cluster.

3.4 Source Rank

The original PageRank algorithm is applied on Sources and then the rank is distributed
directly to the members.

1. Group all pages into Sources based on “Domain”.

2. Calculate the rank for each Source with the original PageRank algorithm.

3. Distribute the rank number to its members by weighted average by using

PR = SR * Si.

The notations here are:

PR: The rank of a member page

SR: The source rank from previous stage

Si: Total incoming unique links of this source

3.5 Truncated PageRank

Truncated Page rank is link based ranking function that decreases the importance of
neighbors that are topologically close to the target node. A damping function is
introduced to remove the direct contribution of the first level of the linking.

They suggest generalization to the PageRank equation to:

The rank propagates through links.

We can calculate the Page Rank of a page by summing up contributions from different
distances.

PR(p) = å at · Mt = å damping(t) · Mt

The notations here are:

C: Normalization constant

a : The damping factor

3.6 Software & Packages Used

WebGraph:

WebGraph is a java package used to represent the graph compactly.

Java:

Java is used as the programming language to use the WebGraph package.

jdbc::mysql:

To update the MySQL database tables with the Page Rank information in the Truncated
PageRank module written in Java.

Inline-java:

This is a Perl library to access the java modules from Perl.

Perl:

Perl (version 5.8.8) is used as the programming language. A fast interpreter, its features to
handle and manipulate strings and relatively small memory signatures of its modules
make it an ideal language for this project.

MySQL:

The database is designed and implemented using MySQL v3.23.58 and v4.1.1. MySQL is
free, scalable and Perl has a rich API for MySQL.

PhpMyAdmin:

It is a good and intuitive front-end user interface for the MySQL database. Many features
are provided to create, manipulate and manage databases and users in the MySQL
environment. One can also see and adjust MySQL environment variables.

Apache:
17
Apache server v2.0.54 is used for the machines to communicate using the CGI module.

Software used for documentation:

Microsoft Excel was used for the diagrams and Microsoft Word was used to write the
project report.

3.7 Database Implementation

The original Needle [4] database consists of 15 tables and 2 views. Changes were made
to this Needle database schema in order to accommodate the Source Rank [10] and
Truncated PageRank [7] calculations.

Three new tables namely Source, PageRankBySource, PageRankTruncated and view

named vwURLLinkSource are added. An explanation of each table and changes made to
the database schema is shown below.

3.8 Database Tables Added

3.8.1 Table Source

This table was added to store the source ids, which are required to compute the weighted

Page Rank for the URLs using the Source Rank algorithm. Listed below are the table

columns and their purpose.

o source_id: To compute the Page Rank of URLs using the Source Rank algorithm,

the URLs are grouped in to individual sources based on domain name of the URL.

This column uniquely identifies that source.

o base_url: The base URL of the source

o source_rank: The rank given to the source which will be used later during the

source rank computation.

18
o source_rank_date: The date on which the source was computed

o out_link_count: The number of out links of the source

o in_link_count: The number of in links of the source

o cal_current_iter: The iteration number used to converge the source rank

o old_date1: The previous date 1 on which the Page Rank was computed

o old_prc1: The previous weighted Page Rank 1 of the URL

o old_date1: The previous date 2 on which the Page Rank was computed

o old_prc1: The previous weighted Page Rank 2 of the URL

3.8.2 Table PageRankBySource

This table was added to store the weighted Page Rank for the URLs computed using the
Source Rank algorithm. Listed below are the table columns and the purpose.

o url_id: The URL id for which the weighted Page Rank is given. This is the
primary key of the table.

o c_date: The date on which the weighted Page Rank was computed

o c_prc: The weighted Page Rank

o old_date1: The previous date 1 on which the Page Rank was computed

o old_prc1: The previous weighted Page Rank 1 of the URL

o old_date1: The previous date 2 on which the Page Rank was computed

o old_prc1: The previous weighted Page Rank 2 of the URL

19
3.8.3 Table PageRankTruncated

This table was added to store the weighted Page Rank for the URLs computed using the
Truncated PageRank algorithm. Listed below are the table columns and their purpose.

o url_id: The URL id for which the weighted Page Rank is given. This is the
primary key of the table.

o c_date: The date on which the weighted Page Rank was computed

o c_prc: The weighted Page Rank

o old_date1: The previous date 1 on which the Page Rank was computed

o old_prc1: The previous weighted Page Rank 1 of the URL

o old_date1: The previous date 2 on which the Page Rank was computed

o old_prc1: The previous weighted Page Rank 2 of the URL

3.8.4 View vwURLLinkSource

This view uses provides the information from URL and URLLinkStructure to obtain the
out-links of each source. The view also makes sure that the url_id exists in both URL and
URLLinkStructure tables and is not NULL. Listed below are the table columns and their
purpose.

o fromsource: The source_id of URL table that acts as the from source id

o tosource: The source_id of URL table that acts as the to source id (out-link)

3.9 Database Changed Made

It has been noticed that there is scope to improve the performance of the SQL query
execution in the existing Page-Rank System implementation. Changes were also made to
the existing database schema in order to accommodate the Page-Rank systems using the
20
new algorithms namely Source Rank and Truncated Page Rank. Listed below are the
details of the changes made.

o Index has been created for base_url of Cluster table to improve performance of
the SQL query execution in the Perl modules.

o The PageRank table was populated using the PageRank module during the
original Cluster Rank implementation, and used later in order to obtain the out-
links of a specific URL. We replaced the need of this table by creating Web graph
for the node out-links.

o Index has been created to sec_base_url of SecondLvlClusterWork table to

improve the performance of SQL query execution in Perl modules.

o The column source_id has been added to URL table and index was created for
source_id to accommodate the Source Rank computation.
21

3.10 Database Relationships

3.11 Module Implementation

3.11.1 Original Implementation

The original Needle Search Engine was implemented using the Perl programming
language and MySQL as the backend database. After crawling the web pages, the web
page URLs were stored in a MySQL database table called URL. The link structure
between the URLs was stored in a table called URLLinkStructure.

This implementation generates the Page Rank for the URLs in a reasonable amount of
time, as long as the URL linking structure (URL graph) is of small size. As the number of
URLs grow with bigger URL linking structure, this ranking system becomes less
efficient, in other words, this ranking system will take long update times and requires
more machine resources such as memory and CPU to compute the Page Rank.

3.11.2 Current Implementation

In order to accelerate the original implementation of the ranking system, a package called
WebGraph [1] is used in the current implementation. The WebGraph [1] package was
developed by a group of people from Italy using Java programming language. By using
this package, the URL linking structure, also called as ‘web graph’ can be represented
compactly using compression techniques. The package provides several Java methods to
access the compressed format of the web graph.

To make use of the WebGraph [1] package in the current implementation, a Perl library
called Inline::Java is used. Using this library, the Perl programs can call the Java methods
of the WebGraph [1] package.

To improve the SQL query performance in the original implementation, indexes were
added on as needed basis to the backend database tables.

Two other recent ranking algorithms namely SourceRank [10] and the Truncated
PageRank [7] were implemented to compare against the original Cluster Rank algorithm
[4], in terms of the efficiency and quality of the ranking system. In order to accommodate
23
these two algorithms in the current implementation, the original database schema has
been changed as needed. The three algorithms will be compared for efficiency, using the
metrics generated against URLs of sizes 300K, 600K and 4 Millions. The quality will be
measured by conducting a survey among a group of people who will perform the
keyword queries on three different search engines that implement these algorithms.

3.11.3 Current Implementation – Module Details

The implementation has 3 phases namely graph generation, rank generation and search
engine. The modules listed below explain these in detail.

3.11.3.1 Graph Generation Modules

To make use of the WebGraph package, the URL link structure is represented in ASCII
format, in the form of a file named basename.graph-txt. The first line contains the number
of nodes ‘n’, then ‘n’ lines follow the i-th line containing the successors of the node ‘i’ in
the increasing order (nodes are numbered from 0 to n-1). The successors are separated by
a single space.

The ASCII formatted URL link structure will then be converted in to the compressed
format called BVGraph format (Boldi-Vigna Graph format, in the name of WebGraph
package authors Paolo Boldi & Sebastiano Vigna). The compressed BVGraph is
described by a graph file (with extension .graph), an offset file (with extension
.offsets) and a property file (with extension .properties) [13].

The BVGraph can be generated from an ASCII formatted Graph using the command
listed below:

java it.unimi.dsi.webgraph.BVGraph -g ASCIIGraph example bvexample

where example is the basename of the ASCII formatted graph (example.graph-txt) and
the bvexample if the name for the resulting BVGraph (bvexample.graph,
bvexample.offsets, bvexample.properties).
24
In the current implementation, we represent the URL link structure (graph) in two
different ASCII format files named nodein.graph-txt and nodeout.graph-txt. The
nodein.graph-txt represents each node and it’s in-links. The nodeout.graph-txt represents
the node and it’s out-links.

The ASCII formatted URL link structure will then be converted in to the compressed
BVGraph format. The BVGraph will be represented as bvnodein.graph, bvnodein.offsets
and bvnodein.properties for the ASCII graph nodein.graph-txt. The BVGraph will be
represented as bvnodeout.graph, bvnodeout.offsets, and bvnodeout.properties for the
ASCII graph nodein.graph-txt.

a) nodegraphin.pl: This Perl module will generate the ASCII formatted

nodein.graph-txt by executing SQL queries against URLLinkStructure table. Once
the nodein.graph-txt is available, the same module will generate the compressed
BVGraph files named bvnodein.graph, bvnodein.offsets and bvnodein.properties.

b) nodegraphout.pl: This Perl module will generate the ASCII formatted

nodeout.graph-txt by executing SQL queries against URLLinkStructure table.
Once the nodeout.graph-txt is available, the same module will generate the
compressed BVGraph files named bvnodeout.graph, bvnodeout.offsets and
bvnodeout.properties..

The Source Rank algorithm groups up a set of nodes in to sources. During the Source
Rank calculation, the algorithm will require to find the in-links for each source. In order
to accelerate the Source Ranking system, the Source linking structure is represented in
BVGraph format.

a) sourcegraphin.pl: This Perl module will generate the ASCII formatted

sourcein.graph-txt by executing SQL queries against vwURLLinkSource view.
Once the sourcein.graph-txt is available, the same module will generate the
compressed BVGraph files named bvsourcein.graph, bvsourcein.offsets and
bvsourcein.properties.
25
3.11.3.2 Rank Generation Modules

 ClusterRank: The Cluster Ranking system uses two Perl modules namely
clustering.pl and clusterrank.pl as described below.

a) clustering.pl: In this module, two phases occur namely the first level clustering
and the second level clustering. The first level clustering selects each url id from
the URL table, based on the content of the URL, finds out if it belongs to an
existing cluster. If it belongs to an existing cluster in Cluster table, updates the
cluster id in URL table. If it does not, creates a new cluster in Cluster table,
updates the cluster id in URL table. The second level clustering will calculate the
density for each cluster in Cluster table, approves it depending upon the density
threshold value.

b) clusterrank.pl: In this module, we generate the cluster rank for each cluster based
on Google’s PageRank algorithm. The Page Rank for each URL contained in the
Cluster will then be calculated and stored in a table named PageRankByCluster.

 SourceRank: The Source Ranking system uses two Perl modules namely
sourcing.pl and sourcerank.pl as described below.

a) sourcing.pl: In this module, the sourcing selects each url id from

the URL table, based on the domain of the URL, finds out if it belongs to an
existing source. If it belongs to an existing source in Source table, updates the
source id in URL table. If it does not, creates a new source in Source table,
updates the source id in URL table.

a) sourcerank.pl: In this module, we generate the source rank for each source based
on Google’s PageRank algorithm. The Page Rank for each URL contained in the
Source will then be calculated and stored in a table named PageRankBySource.

 Truncated PageRank: The Truncated PageRank Ranking system uses a

Java module namely TruncatedPageRank.java as described below.
26
a) TruncatedPageRank.java: In this module, the truncation eliminates the neighbors
of the URL based on the predefined truncation factor. The rank for the remaining
URLs will be calculated and stored in a table named PageRankTruncated.

3.11.3.3 Search Engine Modules

 Search Engine: The Search Engine gets the query words from users, it
then selects the URLs, the keyword total_weight from Keyword table and the
Page-Rank from the Page Rank table associated to the Algorithm used in the
Search Engine. For example if Cluster rank algorithm is used the Page Rank table
will be PageRankByCluster. The keyword weight is multiplied by the Page Rank
value for each URL. The URLs then be displayed to the user in the descending
sort order by the multiplied value.

a) search.pl: This module presents the Search Engine UI to the end user. Three
different forms, one for each algorithm namely Cluster Rank, Source Rank
and Truncated Page-Rank will be listed. The User will be able to perform the
search using any of the three implemented algorithms using this UI.

b) search_list.pl: This module gets the query keywords and displays search
results using the PageRankByCluster table

c) search_source.pl: This module gets the query keywords and displays search
results using the PageRankBySource table

d) search_truncated.pl: This module gets the query keywords and displays

search results using the PageRankTruncated table

e) search_sublist.pl: This module will group the similar pages during the display
of search results.
27

4 Experimental Results

4.1 Experimental Data Setup

The experiments were conducted on 3 different data sets with URLs of 300K, 600K and 4
Million. Importing the data from the Crawler was tedious and time consuming because of
the large size of the data. The Crawler, Ranking system, Parser modules uses different
database schemas. To setup the data for the current ranking system for the above listed
data sets, the following steps were taken:

o Identified the URL and URLLinkStructure tables generated by the Crawler

module with sufficient number of URLs such as 600K and 4 Million. The
database with these tables was on a server machine (128.198.144.19) that is
different than the server machine (128.198.144.16) where the current project has
been implemented.

o Created/Copied necessary tables needed by the search engine from the search
engine server machine to this database. These tables are namely Crawler,
Dictionary, KeyWord, KeyWordWork, TextParser and PageLocation. Executed
the Parser in order to populate these new tables using

 The URL and URLLinkStructure tables

 Other data files such as crawled documents generated by Crawler module.

o Took a backup of these database tables needed by the Page-Rank system.

Transported the backup file from the Crawler machine to Search Engine machine.

o Restored the tables on to the database on the Search Engine machine. Made
necessary database schema changes such as:

 Adding new columns to accommodate the Page-Rank calculations using

algorithms namely Source Rank and Truncated PageRank
28
 Adding indexes to the tables obtained from the Crawler machine to
improve the performance of SQL queries during the Page-Rank
generation. The indexes were added to columns of the tables that were
used in the WHERE clause of the SQL query coded in the Perl and Java
modules.

4.2 Challenges & Key Observations

The original implementation uses the multiple database tables to compute the Page Rank.
The linking structure between the URLs, which is the key factor while computing the
Page Rank, was also represented in the form a database called URLLinkStructure. Most
of the Page-Rank algorithms use the in-links and out-links of URL while computing the
Page Rank of that URL.

As the number of URLs and the linking structure grow, it becomes complex and time
consuming, to obtain the linkage information by sending repeated SQL queries against
the huge URLLinkStructure table. It helps if the in-link and the out-link information is
readily available during the Page Rank computation. A package called WebGraph was
developed by a group of people from Italy to achieve this purpose. When the URL link
structure is represented in an ASCII text file (ASCII web graph) in certain format, using
the Classes and Methods described in this JAVA based package, the in-link, out-link
information can be accessed very efficiently, as a result of which the Page-Rank
computation times can be reduced significantly.

In brief, in ASCII web graph, each node’s in-links or out-links are listed in a single line
separated by space in text file starting node 0 from second line to node n where n is the
total number of nodes which is listed in the first line. Using the WebGraph package, this
ASCII Graph can then be compressed and an equivalent BVGraph of significantly
smaller in size compared to original ASCII graph can be generated. During the Page
Rank computation, using the methods provided by the WebGraph package, the BVGraph
of in-links can be loaded, the in-links of a node (successors) and the total number of in-
links of a node can be efficiently accessed during the Page-Rank computation. The same
is true for BVGraph of out-links.
29
It has been observed that the BVGraph can be loaded mainly in two different ways before
they can be accessed, namely load and loadOffline methods. The load method will load
the graph in to memory. This helps to access the successors and an outdegree of a
particular node directly by sending the node number as parameter to the methods provide
by WebGraph package. This is very efficient but works only with smaller BVGraphs
(approximately less than a million nodes). In case of large graphs, the graph needs to be
loaded using loadOffline method. This method does not load the graph in to memory. In
order to get the successors and outdegree of a node n, we need to start from the beginning
of the node 0, iterate the graph until we reach the node n and read the necessary
information.

It has been observed that the total time it takes i) to generate the ASCII web graph using
an optimized SQL query logic ii) to generate the equivalent BVGraph and iii) to access
the node successor information using the graph, is lot less compared to the time it takes to
access the same using SQL queries especially for large web graphs.

The current implementation does use the URL table during the Page-Rank computation.
It has been observed that, instead of accessing the in-link and out-link information from
the web graph, it will be very efficient if the number of in-links and out-links information
is available for each URL in the URL table itself. Because of this, the current
implementation performs the step of updating the in-link and out-link information in
URL table for each URL, at the beginning of the Page-Rank computation process. It was
proven that using this step, the overall time for Page-Rank, is reduced to a greater extent
especially for large graphs.

The original implementation was written fully in Perl. Since the WebGraph package was
developed in Java, in order to take advantage of the better features of the two languages,
a Perl library called Inline::Java was used. During the page-rank computation of Source
Rank and Cluster Rank, it has been noticed that separating out the BVGraph access to
Java, Page Rank computation to Perl and passing the data between the two using
Inline::Java concepts, was efficient.
30
For Truncated PageRank algorithm source code is available under GNU license is taken
and made necessary changes to fit in to the current implementation. Java and the
jdbc::mysql interface was used in order to store the Page Rank in to database table for
later access during the search.

4.3 Future Upgrade of The Search Engine

The entire process of generating page-rank using algorithm namely Cluster Rank, Source
Rank and Truncated PageRank was automated. The current implementation uses the URL
table that contains more than 4 Million URLs.

Listed below are the steps to compute the Page-Rank for a different set of URLs.

o Make sure that the table URL provides the web page URLs and the table
URLLinkStructure provides the link structure between the URLs respectively.

o If these two tables are available from a Crawler process, make sure that the tables
namely Crawler, Dictionary, KeyWord, KeyWordWork, TextParser and
PageLocation and made available in the same database

o Run the Parser module (perl textparser.pl) to update the KeyWord info.

o Copy all these tables to the database where the other tables listed in Appendix D.

o Make sure that the URL table has the columns as listed in Create URL table script
of Appendix C (section 8.1). Make sure that the indexes are available on tables as
documented in Appendix C (section 8.1).

o Run the Perl file called kickoff.pl to perform the steps listed below.

o system("perl clustering.pl");

o system("perl sourcing.pl");

o system("perl nodegraphin.pl");
31
o system("perl nodegraphout.pl");

o system("perl sourcegraphin.pl");

o system("perl sourcerank.pl");

o system("java -Xmx1000M TruncatedPageRank bvnodeout -t 2");

o system("perl clusterrank.pl");

4.4 Time Comparisons For Cluster Rank Before & After Using
WebGraph

The Original Search Engine used the Cluster Rank ranking system. This ranking system
takes 6900 seconds, which is 1 hour 55 min per iteration for 289,503 nodes. In this
algorithm we assume that it converges after 40 iterations.

Even though the ranking system calculates the rank for 289,503 nodes (URLs), all these
pages are not considered during search results display process. This is because, not all
these URLs are crawled by the Crawler module. For example, out of these 289,503
nodes, the Crawler module in the original Search Engine, crawled 54,201 pages. The
Parser module of the Search Engine, works only on these crawled URLs to generate the
keywords that will be used during the search results display.

The original Cluster Rank algorithm was run against Cluster Ranking system using
WebGraph package. The ranking system takes 6780 seconds, which is 1 hour 53 min per
iteration for 289,503 nodes. This experiment concludes that the ranking system takes less
time for Page-Rank computation using the WebGraph package.

Listed below is the time comparison table for Cluster Rank before and after applying the
WebGraph package.

Function Time for 300K URLs before Time for 300K URLs after
using WebGraph (in seconds) using WebGraph (in seconds)
32

prepareCR 264 809

prepareIncomingCount 2288 192

doClusterRank (per 6900 6780

iteration)

Total: 9452 7781

Table: Time per iteration for 300K URLs before/after using WebGraph

The overall time gain with WebGraph in Page-Rank computation using Cluster Rank
algorithm is approximately 20% for 300K URLs. Listed below is the brief description for
the overall time again using the WebGraph.

In prepareCR step above, we replaced the need of using the ‘PageRank’ table (and there
by replacing the need of generating the ‘PageRank’ table solely for this purpose) by using
the Web graph of out-links, for calculating the out-link information of URL. There
appears to be a loss in terms of time it takes for this step, but in fact we did not have to
generate the PageRank table which in directly saves significant amount of time.

In prepareIncomingCount, we replaced the need of updating the in-links of URLs using

SQL query against URLLinkStructure, by using the Web graph of in-links. We gained
significant amount of time in this step.

In doClusterRank step, we should have used the graph for Cluster in-links. But the
original implementation of the Cluster Rank, generates a table during the ‘Clustering’
phase that is used for Cluster in-links. It has been observed that usage of the graph for
Cluster in-links, in this step, does not gain much time compared to the usage of the table
generated by Clustering phase (the first phase of original Cluster Rank implementation).
This statement applies to graphs ranging from 300K URLs to 4 Million URLs based on
the experiments. The process of generating the Cluster in-links is made available for
future development to leverage the graph’s usage over the table usage while dealing with
33
huge sets of URLs over 4 Million, to improve the efficiency of the original
implementation of the Cluster Rank algorithm.

Listed below is the time distribution represented in the form of Pie Chart for Cluster
Rank before and after applying the WebGraph package.

ClusterRank Without WebGraph

PrepareCR
3%
IncomingLinkCount

24%

1
2
3
doClusterRank
73%

Chart: Time distribution per iteration for 300K URLs before using WebGraph

Chart: Time distribution per iteration for 300K URLs after using WebGraph
34

Listed below is the time gain represented in the form of Bar Chart for Cluster Rank
before and after applying the WebGraph package

Without/With BVGraph Total time gain using WebGraph for 300K URLs

7737
With WebGraph
1
Without WebGraph
9452

0 2000 4000 6000 8000 10000

Time in seconds

Chart: Time gain per iteration for 300K URLs using WebGraph

4.5 Time Measure Between Algorithms Using WebGraph

There are two recent algorithms called Source Rank, Truncated PageRank algorithm. The
authors of these algorithms used the WebGraph package to represent the graph compactly
during their Page-Rank calculation. Because of this reason, these two algorithms were
considered for Page-Rank comparisons using WebGraph package. The experiments were
performed using 300K, 600K and 4 Million URLs for Page-Rank calculation.

The time comparison of Page-Rank computation, for different sets of data using the three
algorithms is listed below.

4.5.1 For 300,000 URLs

Listed below is the table that represents the time taken by each algorithm for the Page-
Rank computation of 300K URLs
35

Step Cluster Rank (in Source Rank (in Truncated PageRank

seconds) seconds) (in seconds)

Node-In BVGraph 589 589 N/A

Node-Out BVGraph 857 857 857

Source-In BVGraph N/A 92 N/A

Rank calculation 6780 660 22

(per iteration)

Table 4.5.1: Time measure for Page-Rank computation between three algorithms (300 K)

4.5.2 For 600,000 URLs

Listed below is the table that represents the time taken by each algorithm for the Page-
Rank computation of 600K URLs

Step Cluster Rank (in Source Rank (in Truncated PageRank

seconds) seconds) (in seconds)

Node-In BVGraph 79 79 N/A

Node-Out BVGraph 68 68 68

Source-In BVGraph N/A 3 N/A

Rank calculation 422 3 32

(per iteration)

Table 4.5.2: Time measure for Page-Rank computation between three algorithms

4.5.3 For 4 Million URLs

Listed below is the table that represents the time taken by each algorithm for the Page-
Rank computation of 4 Million URLs
36

Step Cluster Rank (in Source Rank (in Truncated PageRank

seconds) seconds) (in seconds)

Node-In BVGraph 967 967 N/A

Node-Out BVGraph 909 909 909

Source-In BVGraph N/A 14 N/A

Rank calculation 2520 21 295

(per iteration)

Table 4.5.3: Time measure for Page-Rank computation between three algorithms
37

4.5.4 Graph Representation For Time Measure (in Sec)

Algorithm URLs: 633061 URLs: 289503 URLs: 4 M

Node InLinks: 2905183 Node InLinks: 21781790 Node InLinks: 28346447
Average InLinks per Node: 4.6 Average InLinks per Node: 78.06 Average InLinks per Node: 5.82
Clusters: 48271 Clusters: 164136 Clusters: 256919
Cluster InLinks: 983579 Cluster InLinks: 18210270 Cluster InLinks: 9120926
Average InLinks per Cluster: 16.35 Average InLinks per Cluster: 109.35 Average InLinks per Cluster: 32.54
Sources: 425 Sources: 14892 Sources: 482
Source InLinks: 75217 Source InLinks: 9988138 Source InLinks: 509693
Average InLinks per Source: 176.98 Average InLinks per Source: 670.8 Average InLinks per Source: 1057.45

Cluster Rank (Time is

directly proportional
to Cluster InLinks) 422 6780 2520

Source Rank (Time is

directly proportional
to Source InLinks) 3 660 21

Truncated PageRank
(Time is directly
proportional to
number Node
InLinks) 2 12 17

Table Table 4.5.4: Time measure between algorithms per iteration (in Seconds)
38

Time Measure between algorithms per iteration

8000
Time in secondsds 7000 6780
6000
5000 Cluster Rank
4000 Source Rank
3000 Truncated PageRank
2520
2000
1000
422 660
0 3
2 12 21
17
1 2 3
Node InLinks
(1: 2905183, 2: 21781790, 3: 28346447)

Line Graph (Y-Axis: Time in seconds, X-axis: Node Graph Size)

Time Measure between algorithms per iteration

(1: 2905183, 2: 21781790,

17
3 21
Node Graph Size

2520
3: 28346447)

Truncated PageRank
12
2 660 Source Rank
6780
Cluster Rank
2
1 3
422

0 2000 4000 6000 8000

Time in seconds

Bar Graph (X-Axis: Time in seconds, Y-axis: Node Graph Size)

Cluster Rank Time Measure based on Cluster InLinks

8000

Time in secondsds
7000 6780
6000
5000
4000 Cluster Rank
3000
2520
2000
1000
422
0
1 2 3
Cluster InLinks
(1: 983579, 2: 9120926, 3: 18210270)

Line Graph (Y-Axis: Time in seconds, X-axis: Cluster InLinks)

Source Rank Time Measure based on Source InLinks

700
660
Time in secondsds

600
500
400
Source Rank
300
200
100
0 3 21
1 2 3
Source InLinks
(1: 75217, 2: 509693, 3: 9988138)

Line Graph (Y-Axis: Time in seconds, X-axis: Source InLinks)

Truncated PageRank Time Measure based on Node InLinks

18
17
Time in secondsds
16
14
12 12
10
Truncated PageRank
8
6
4
2 2
0
1 2 3
Node InLinks
(1: 2905183, 2: 21781790, 3: 28346447)

Line Graph (Y-Axis: Time in seconds, X-axis: Node InLinks)

4.5.5 Node In-Link Distribution across Nodes

4.5.5.1 Node In-Link Distribution across Nodes for 300K

Distribution of Nodes and InLinks for 300K

250000

200000
# of Nodes

150000

Nodes

100000

50000

0
1
30
59

150

189
245
330
424

575
807

1310
1400

2579

3939
1089

2381

2671
3399

5110
117

# of InLinks
42
4.5.5.2 Node In-Link Distribution across Nodes for 600K

Distribution of Nodes and InLinks for 600K

600000

500000

400000
# of Nodes

300000 Nodes

200000

100000

0
1
29
57
85

141

169
198
234
269
317

356
391
467
640
777

2140
2566
1015
1396
1784
113

# of InLinks
43
4.5.5.3 Node In-Link Distribution across Nodes for 4M

Distribution of Nodes and InLinks for 4M

4500000

4000000

3500000

3000000
# of Nodes

2500000
Nodes
2000000

1500000

1000000

500000

0
1068
1325

2444
3527
6298
9058
12648
1
70
139
208
278
349
425
523
642

751
900

1579
1920

65578
# of InLinks
44
4.5.6 Cluster In-Link Distribution across Clusters

4.5.6.1 Cluster In-Link Distribution across Clusters for 300K

Distribution of Clusters and InLinks for 300K

140000

120000

100000
# of Clustersrs

80000
Nodes
60000

40000

20000

0
1
29
57

148
182

230
310
368

545
713

950
1269

2486

3788
3971

8751
1316
1430

2588
2698
113

# of InLinks
45
4.5.6.2 Cluster In-Link Distribution across Clusters for 600K

Distribution of Clusters and InLinks for 600K

35000

30000

25000
# of Clustersrs

20000
Nodes
15000

10000

5000

0
1

61
81
101

121
143
172
207

277
321
371

446
576

927

2128
2606
21

244

1516

2174
711

1195
# of InLinks
46
4.5.6.3 Cluster In-Link Distribution across Clusters for 4M

Distribution of Clusters and InLinks for 4M

140000

120000

100000
# of Clustersrs

80000
Nodes
60000

40000

20000

0
40
79

157
196

237
283

340
412

489
608

744
936

1815

6033
9997
1

1448

2376
3158

54223
118

1109

# of InLinks
47
4.5.7 Source In-Link Distribution across Sources

4.5.7.1 Source In-Link Distribution across Sources for 300K

Distribution of Source and InLinks for 300K

6000

5000

4000
# of Sources

3000 Nodes

2000

1000

0
1
28

55
85
128
210
452
1078
1300
1397

2721
3788
3936

5248
7626
9214
12712
37856
1508
2554
2664

4050

# of InLinks
48
4.5.7.2 Source In-Link Distribution across Sources for 600K

Distribution of Sources and InLinks for 600K

100

70
# of Sources

50 Nodes

0
16

52
61

106

135

289

507

1414
1
6

156

718

2892
11

211
# of InLinks
49
4.5.7.3 Source In-Link Distribution across Sources for 4M

Distribution of Sources and InLinks for 4M

100

70
# of Sources

50 Nodes

0
13
19
25
31
41

53
66
77
88
102

164
195
249

318
565
865
1579
3291
1
7

13950
117

# of InLinks
50

4.5.8 Time Gain Analysis Between Algorithms

Based on the results obtained from the experiments, it is proven that:

o ClusterRank computational time is directly proportional to the Cluster in-links.

This is because, in this algorithm the nodes (URLs) are grouped into Clusters
based on the ‘virtual directory and dynamically generated links (containing ?, #)’,
and the rank of the Cluster is distributed to the nodes contained in the Cluster. The
rank of the Cluster is computed using Google’s PageRank algorithm, by treating
the Clusters as nodes.

o SourceRank computational time is directly proportional to the Source in-links.

This is because, in this algorithm the nodes (URLs) are grouped into Sources
based on the ‘domain’, and the rank of the Source is distributed to the nodes
contained in the Source. The rank of the Source is computed using Google’s
PageRank algorithm, by treating the Sources as nodes.

o Truncated PageRank computation time is directly proportional to the number of

URLs. This is because, in this algorithm, the Truncated PageRank is computed
using Google’s PageRank algorithm.

4.6 Quality Measure Between Algorithms

A survey is performed among a group of people to measure the quality of the three
algorithms. The survey is performed based on the questions listed in appendix A, using
25 different key words. These 25 keywords were identified by using Google’s tool that is
available at: https://adwords.google.com/select/KeywordToolExternal. The purpose of
this tool is to identify the relevant keywords for websites based on their content. URLs of
multiple universities were considered, while using this tool, to identify relevant search
keywords for educational domains.

The quality is measured by following the steps:

 Each person participated in the survey perform a keyword search using 25

keywords against three Search-Engine systems. The first Search Engine system
uses the data generated by ClusterRank algorithm, the second one uses the data
generated by SourceRank algorithm and third one uses the data generated by
Truncated PagaRank algorithm

 The average of the quality points is calculated for each question for different
keywords.

 The average of the quality points from all the users calculated for each question
from step one.

 Finally, the average of the quality points for all the questions from step two, is
calculated for each algorithm

 Based on result from step three, conclude the better algorithm in terms of quality.

4.6.1 Survey Results

Algorithm Quality measure based on the scale 1 to 5

(1 being the best) for 25 keywords

ClusterRank 2.06

SourceRank 1.65

Truncated PageRank 2.94

4.7 Conclusion of Experimental Results

From the above results we observed that the Page-Rank computation time depends on the
URL link structure (Web graph) and also the algorithm used for the computation.

It has been observed that, for algorithms that use only the URL linking structure such as
Truncated PageRank, the Page-Rank computation time is directly proportional to the
number of URL in-links.
52
For algorithms such as SourceRank, where the Page-Rank is calculated for the Source
and then is propagated to URLs contained within that Source, the Page-Rank
computation time is directly proportional to the number of links between the Sources
(Source graph size). The same is true for Cluster Rank algorithm.

It is important to note that the number of URLs has no direct relation with the number of
Sources or number of Clusters, as it significantly varies depending on the crawling
process. For example, if URLs of multiple domains are crawled, the number of Sources
generated will be more. The number of Clusters should always be more than the number
of Sources for the same set of URLs. This is because the Source is defined based on
‘domain’ and the Cluster is defined based on ‘virtual directory and dynamically generated
pages within the URL’.

Based on the results from the manual survey, the quality of the SourceRank is proven as
better algorithm than the ClusterRank and Truncated PageRank.

Based on the results from the time-measure experiments, the SoureRank is proven to take
less time compared to ClusterRank. This is because, if the URLs of multiple domains are
crawled, the number of Sources generated will be more. The number of Clusters should
always be more than the number of Sources for the same set of URLs, as the Source is
defined based on ‘domain’ and the Cluster is defined based on ‘virtual directory and
dynamically generated pages within the URL’.

To take advantage of both Efficiency & Quality, the Source Rank is proven to be the
better algorithm out of the three based on the experiments conducted using the available
data.
53

5 References

[1] Paolo Boldi, Sebastiano Vigna. The WebGraoh Framework 1: Compression

Techniques. In Proceedings of The 14th International World Wide Web Conference (New
York), http://www2004.org/proceedings/docs/1p595.pdf, Pages 595 – 602, ACM Press,
2004.

[2] Yen-Yu Chen, Qingqing Gan, Torsten Suel. I/O-Efficient Techniques for Computing
Pagerank. In Proceedings of the Eleventh ACM Conference on Information and
Knowledge Management (CIKM), http://cis.poly.edu/suel/papers/pagerank.pdf, Pages
549-557, 2002.

[3] G. Jeh and J. Widom. “SimRank: A Measure of Structural-Context Similarity”,

Proceedings of the 8th ACM International Conference on Knowledge Discovery and Data
Mining (SIGKDD), Pages 538-543, 2002,http://www-cs-
students.stanford.edu/~glenj/simrank.pdf

[4] Yi Zhang. Design and Implementation of a Search Engine with the Cluster Rank
Algorithm. UCCS Computer Science Master’s Thesis, 2006.

[5] John A. Tomlin. A New Paradigm for Ranking Pages on the World Wide
Web,http://www2003.org/cdrom/papers/refereed/p042/paper42_html/p42-tomlin.htm,
Pages 350-355, WWW 2003.

[6] Page, Lawrence, Brin, Sergey, Motwani, Rajeev, Winograd, Terry. The PageRank
Citation Ranking: Bringing Order to the Web, http://www.cs.huji.ac.il/~csip/1999-66.pdf,
1999.

[7] Ricardo BaezaYates, Paolo Boldi, Carlos Castillo. Generalizing PageRank: Damping
Functions for LinkBased Ranking Algorithms. In Proceedings of the 29th Annual
International ACM SIGIR,
http://www.dcc.uchile.cl/~ccastill/papers/baeza06_general_pagerank_damping_functions
_link_ranking.pdf, Pages 308-315, ACM Press, 2006.
54
[8] Gonzalo Navarro. Compressing Web Graphs like Texts,
ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/graphcompression.ps.gz, 2007.

[9] The Spiders Apprentice, http://www.monash.com/spidap1.html, 2004.

[10] James Caverlee, Ling Liu, S.Webb. Spam-Resilient Web Ranking via influence
Throttling. 21st IEEE International Parallel and Distributed Processing Symposium
(IPDPS), http://www-static.cc.gatech.edu/~caverlee/pubs/caverlee07ipdps.pdf,
LongBeach, 2007.

[11] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates, “Using rank

propagation and probabilistic counting for link-based spam detection. Technical report”,
2006,
http://www.dcc.uchile.cl/~ccastill/papers/becchetti_06_automatic_link_spam_detection_r
ank_propagation.pdf

[12] Jun Hirai, Sriram Raghavan, Hector Garcia-Molina, and Andreas Paepcke. WebBase:
A repository of Web page

[13] http://webgraph.dsi.unimi.it/docs/index.html World Wide Web

6 Appendix A – Survey Questions to Evaluate Algorithm

Quality

Perform 25 different key-word searches to measure the quality of the algorithm based on
the information displayed to the user that is relevant to the key-word.

6.1 Search Keywords

pictures university Faculty stadium undergraduate

map admissions Scholarships loan mba

alumni computer Graduate business research

students technology Accommodation campus vacations

dean department Aid gpa parking

6.2 Survey Questions

1. First page accuracy (scale 1 to 5, 1 being the best)

2. Second page accuracy (scale 1 to 5, 1 being the best)

3. Result order on the first page (scale 1 to 5, 1 being the best)

4. Result order on the second page (scale 1 to 5, 1 being the best)

5. Overall, are the important pages showing up early? (scale 1 to 5, 1 being the best)

6. Overall, the percentage in result hits are relevant? (Give a percentage)

7 Appendix B – Software and hardware Environment

Software Environment:

o Fedora Core 4, MySQL server 5.0.26, Perl v5.8.6, Apache 2.2.3, Java 1.5

Hardware Environment:

o Pentium 4, 2.0 GHz CPU and 1 GB RAM

8 Appendix C – Database Scripts

8.1 Database Table/View Setup

CREATE TABLE MediaType (
doc_type_id TINYINT UNSIGNED NOT NULL auto_increment,
doc_type VARCHAR(20) NOT NULL,
PRIMARY KEY(doc_type_id));

INSERT INTO MediaType SET doc_type="text";

INSERT INTO MediaType SET doc_type="image";
INSERT INTO MediaType SET doc_type="OtherBinary";

CREATE TABLE URL (

url_id INT UNSIGNED NOT NULL auto_increment,
url VARCHAR(255) NOT NULL,
doc_type_id TINYINT UNSIGNED NOT NULL,
container_url INT UNSIGNED,
title VARCHAR(255),
cluster_id INT UNSIGNED,
source_id INT UNSIGNED,
in_plink_count INT UNSIGNED,
in_clink_count INT UNSIGNED,
INDEX (url(255)),
FOREIGN KEY(doc_type_id) REFERENCES MediaType(doc_type_id),
FOREIGN KEY(cluster_id) REFERENCES Cluster(cluster_id),
FOREIGN KEY(source_id) REFERENCES Source(source_id),
PRIMARY KEY (url_id) );

CREATE TABLE Crawler (

url_id INT UNSIGNED NOT NULL,
crawled_date DATE NOT NULL,
localfullname VARCHAR(255) NOT NULL,
size INT UNSIGNED NOT NULL,
FOREIGN KEY(url_id) REFERENCES URL(url_id));

CREATE TABLE ImageProcessor (

url_id INT UNSIGNED NOT NULL,
surrounding_words_before VARCHAR(255) NOT NULL,
surrounding_words_after VARCHAR(255) NOT NULL,
processed_date DATE NOT NULL,
FOREIGN KEY(url_id) REFERENCES URL(url_id));

CREATE TABLE URLLinkStructure (

link_id INT UNSIGNED NOT NULL auto_increment,
from_url_id INT UNSIGNED NOT NULL,
to_url_id INT UNSIGNED NOT NULL,
anchor_text VARCHAR(100),
update_date DATE NOT NULL,
FOREIGN KEY(from_url_id) REFERENCES URL(url_id),
FOREIGN KEY(to_url_id) REFERENCES URL(url_id),
UNIQUE INDEX (from_url_id, to_url_id),
PRIMARY KEY (link_id) );
58

CREATE TABLE PageLocation (

id SMALLINT UNSIGNED NOT NULL auto_increment,
description VARCHAR(32) NOT NULL,
htmltag VARCHAR(32) NOT NULL,
weight MEDIUMINT UNSIGNED,
weight_date DATE,
PRIMARY KEY(id));

CREATE TABLE TextParser (

url_id INT UNSIGNED NOT NULL,
processed_date DATE NOT NULL,
UNIQUE INDEX (url_id),
FOREIGN KEY(url_id) REFERENCES URL(url_id));

CREATE TABLE KeyWordWork (

keyword_id INT UNSIGNED NOT NULL,
url_id INT UNSIGNED NOT NULL,
location_id SMALLINT UNSIGNED NOT NULL,
update_date DATE NOT NULL,
frequency MEDIUMINT UNSIGNED,
UNIQUE INDEX (keyword_id, url_id, location_id),
FOREIGN KEY(url_id) REFERENCES URL(url_id),
FOREIGN KEY(location_id) REFERENCES PageLocation(id),
FOREIGN KEY(keyword_id) REFERENCES Dictionary(id));

CREATE TABLE KeyWord (

keyword_id INT UNSIGNED NOT NULL,
url_id INT UNSIGNED NOT NULL,
total_weight INT UNSIGNED NOT NULL,
total_weight_date DATE NOT NULL,
UNIQUE INDEX (keyword_id, url_id),
FOREIGN KEY(url_id) REFERENCES URL(url_id),
FOREIGN KEY(keyword_id) REFERENCES Dictionary(id));

CREATE TABLE Dictionary (

id INT UNSIGNED NOT NULL auto_increment,
word VARCHAR(32) NOT NULL,
UNIQUE INDEX (word(32)),
PRIMARY KEY (id));

CREATE TABLE Cluster (

cluster_id INT UNSIGNED NOT NULL,
base_url VARCHAR(255) NOT NULL,
cluster_rank FLOAT ZEROFILL,
cluster_rank_date DATE,
out_link_count INT UNSIGNED,
in_link_count INT UNSIGNED,
cal_last_update DATE,
cal_reserved_by VARCHAR(255),
cal_current_iter SMALLINT UNSIGNED,
old_cr1 FLOAT ZEROFILL,
old_cr1_date DATE,
old_cr2 FLOAT ZEROFILL,
old_cr2_date DATE,
prop_sec_cluster_id INT UNSIGNED,
PRIMARY KEY (cluster_id),
59
FOREIGN KEY(prop_sec_cluster_id) REFERENCES
SecondLvlClusterWork(sec_cluster_id)
);

CREATE TABLE SecondLvlClusterWork (

sec_cluster_id INT UNSIGNED NOT NULL,
sec_base_url VARCHAR(255) NOT NULL,
graph_density FLOAT ZEROFILL,
PRIMARY KEY (sec_cluster_id)
);

CREATE TABLE PageRankByCluster (

url_id INT UNSIGNED NOT NULL,
c_date DATE,
c_prc FLOAT ZEROFILL,
old_date1 DATE,
old_prc1 FLOAT ZEROFILL,
old_date2 DATE,
old_prc2 FLOAT ZEROFILL,
UNIQUE INDEX (url_id),
CONSTRAINT FOREIGN KEY(url_id) REFERENCES URL(url_id)
ON DELETE CASCADE ON UPDATE CASCADE
);

CREATE OR REPLACE VIEW vwURLLinkSource

AS SELECT ufrom.source_id fromsource, uto.source_id tosource
FROM URLLinkStructure, URL ufrom, URL uto
Where ufrom.source_id is not NULL
AND uto.source_id is not NULL
AND URLLinkStructure.from_url_id = ufrom.url_id
AND URLLinkStructure.to_url_id = uto.url_id
AND ufrom.source_id <> uto.source_id;

CREATE TABLE Source (

source_id INT UNSIGNED NOT NULL,
base_url VARCHAR(255) NOT NULL,
source_rank FLOAT ZEROFILL,
source_rank_date DATE,
out_link_count INT UNSIGNED,
in_link_count INT UNSIGNED,
cal_last_update DATE,
cal_reserved_by VARCHAR(255),
cal_current_iter SMALLINT UNSIGNED,
old_cr1 FLOAT ZEROFILL,
old_cr1_date DATE,
old_cr2 FLOAT ZEROFILL,
old_cr2_date DATE,
prop_sec_source_id INT UNSIGNED,
PRIMARY KEY (source_id),
FOREIGN KEY(prop_sec_source_id) REFERENCES
SecondLvlSourceWork(sec_source_id)
);

CREATE TABLE PageRankBySource (

url_id INT UNSIGNED NOT NULL,
c_date DATE,
c_prc FLOAT ZEROFILL,
60
old_date1 DATE,
old_prc1 FLOAT ZEROFILL,
old_date2 DATE,
old_prc2 FLOAT ZEROFILL,
UNIQUE INDEX (url_id),
CONSTRAINT FOREIGN KEY(url_id) REFERENCES URL(url_id)
ON DELETE CASCADE ON UPDATE CASCADE
);

alter table Cluster ADD INDEX (base_url);

alter table Source ADD INDEX (base_url);
alter table SecondLvlClusterWork ADD INDEX (sec_base_url);
61

8.2 Database Table/View Cleanup

DROP VIEW vwURLLinkCluster;
DROP VIEW vwURLLinkSource;
DROP TABLE MediaType ;
DROP TABLE URL;
DROP TABLE Crawler;
DROP TABLE ImageProcessor;
DROP TABLE URLLinkStructure;
DROP TABLE PageLocation;
DROP TABLE TextParser;
DROP TABLE KeyWord;
DROP TABLE KeyWordWork;
DROP TABLE Dictionary;
DROP TABLE PageRankByCluster;
DROP TABLE SecondLvlClusterWork;
DROP TABLE Cluster;
DROP TABLE PageRankBySource;
DROP TABLE Source;
commit;
62

9 Appendix D – Software Setup

WebGraph:

WebGraph is a java package used to represent the graph compactly.

Can be obtained from URL: http://webgraph.dsi.unimi.it/

Java:

Java is used as the programming language to use the WebGraph package.

Can be obtained from URL: http://www.java.com/en/download/linux_manual.jsp

jdbc::mysql:

To update the MySQL database tables with the Page Rank information in the Truncated

PageRank module written in Java

Can be obtained from URL: http://dev.mysql.com/downloads/connector/j/5.1.html

Inline-java:

This is a Perl library to access the java modules from Perl.

Can be obtained from URL: http://search.cpan.org/CPAN/authors/id/P/PA/PATL/Inline-

Java-0.52.tar.gz

Perl:

Perl (version 5.8.8) is used as the programming language. A fast interpreter, its features to

handle and manipulate strings and relatively small memory signatures of its modules

make it an ideal language for this project.

Can be obtained from URL: http://www.perl.com/download.csp

MySQL:

The database is designed and implemented using MySQL v3.23.58 and v4.1.1. MySQL is

free, scalable and Perl has a rich API for MySQL.

63
Can be obtained from URL: http://dev.mysql.com/downloads/

Apache:

Apache server v2.0.54 is used for the machines to communicate using the CGI module.

Can be obtained from URL: http://httpd.apache.org/

10 Appendix E – Using the Search Engine

All the experiments were performed using machine with ip address 128.198.144.16. The
modules are made available at:

o /home/padipudi/webgrap/truncated-pagerank-1.0

The 600K and 4M URL datasets were obtained from sonali_new database on
128.198.144.19.

The textparser.pl was run on 128.198.144.19 and the database tables were copied on to
128.198.144.16 to run the page-rank computation modules.

The Apache instance that serves up all the three Page-Rank systems is available at:

o http://128.198.144.16:1180/cgi-bin/search.pl

The cgi-bin virtual directory that Apache uses to serve up the pages, is available at:

o /home/padipudi/apache/cgi-bin.

ML Movie Review
100% (1)
ML Movie Review
2 pages
Academic Performance in Math and Science-Related Subjects of Grade 12 Students in Relation To Class Schedules
No ratings yet
Academic Performance in Math and Science-Related Subjects of Grade 12 Students in Relation To Class Schedules
26 pages
Organizing & Presenting Data
No ratings yet
Organizing & Presenting Data
36 pages
Janus Faces
100% (1)
Janus Faces
8 pages
TIFRAC, India's First Computer
No ratings yet
TIFRAC, India's First Computer
10 pages
Completing Story: 01. King Midas and The Golden Touch Story
100% (2)
Completing Story: 01. King Midas and The Golden Touch Story
4 pages
Python Design and Implementation of A Simple Web Search E
No ratings yet
Python Design and Implementation of A Simple Web Search E
9 pages
Modified Ranking Engine
No ratings yet
Modified Ranking Engine
4 pages
Chapter 6 Planning
No ratings yet
Chapter 6 Planning
2 pages
Relevancy Based Content Search in Semantic Web
No ratings yet
Relevancy Based Content Search in Semantic Web
2 pages
The Implementation of A Web Crawler URL Filter Algorithm Based On Caching
No ratings yet
The Implementation of A Web Crawler URL Filter Algorithm Based On Caching
4 pages
Our Favorite Expert For The Position National Hydraulic Engineer
No ratings yet
Our Favorite Expert For The Position National Hydraulic Engineer
1 page
DWM Expt9
No ratings yet
DWM Expt9
6 pages
Article28
No ratings yet
Article28
7 pages
IR Module 3 (1)
No ratings yet
IR Module 3 (1)
45 pages
IR Unit V Notes remaining
No ratings yet
IR Unit V Notes remaining
10 pages
ir5
No ratings yet
ir5
18 pages
RMA KS2 BoSYScoresheetv1
No ratings yet
RMA KS2 BoSYScoresheetv1
19 pages
Iste Search Engine
No ratings yet
Iste Search Engine
6 pages
Mathematical Operations in Sadratnamālā An Analysis With Modern Interpretation
No ratings yet
Mathematical Operations in Sadratnamālā An Analysis With Modern Interpretation
3 pages
Implementation and Analysis of Google's Page Rank Algorithm Using Network Dataset
No ratings yet
Implementation and Analysis of Google's Page Rank Algorithm Using Network Dataset
5 pages
S O W C A: Urvey F EB Rawling Lgorithms
No ratings yet
S O W C A: Urvey F EB Rawling Lgorithms
8 pages
Report PDF
No ratings yet
Report PDF
35 pages
08 Web Search and Web Crawling
No ratings yet
08 Web Search and Web Crawling
33 pages
Design and Implementation of A Simple Web Search E
No ratings yet
Design and Implementation of A Simple Web Search E
9 pages
SEO: The PAGE RANK Algorithm: Presidency University, Bengaluru School of Engineering
No ratings yet
SEO: The PAGE RANK Algorithm: Presidency University, Bengaluru School of Engineering
56 pages
Conclusion For Srs
No ratings yet
Conclusion For Srs
5 pages
Keyw Word Quer Ry Based D Focused Dwebc Rawler: Sciencedirect
No ratings yet
Keyw Word Quer Ry Based D Focused Dwebc Rawler: Sciencedirect
7 pages
Enhancing Link Evaluation Through a Coor
No ratings yet
Enhancing Link Evaluation Through a Coor
21 pages
Java Web Crawler
No ratings yet
Java Web Crawler
1 page
Deep Crawling of Web Sites Using Frontier Technique: Samantula Hemalatha
No ratings yet
Deep Crawling of Web Sites Using Frontier Technique: Samantula Hemalatha
11 pages
Lect 02-Crawling Part a
No ratings yet
Lect 02-Crawling Part a
21 pages
TQM Module 1
100% (5)
TQM Module 1
36 pages
20 Different Number Pattern Programs in Java
No ratings yet
20 Different Number Pattern Programs in Java
21 pages
CPP Report
No ratings yet
CPP Report
9 pages
Searching The Web
No ratings yet
Searching The Web
24 pages
Chapter 13
No ratings yet
Chapter 13
13 pages
Comparative Study of Page Rank and Weighted Page Rank Algorithm
No ratings yet
Comparative Study of Page Rank and Weighted Page Rank Algorithm
9 pages
Unit-2
No ratings yet
Unit-2
14 pages
CAN BUS Steering Gear Control
No ratings yet
CAN BUS Steering Gear Control
7 pages
Unit 7_ Search Engine
No ratings yet
Unit 7_ Search Engine
10 pages
Groups and Individuals
No ratings yet
Groups and Individuals
3 pages
Ponyprog Tutorial PDF
No ratings yet
Ponyprog Tutorial PDF
10 pages
Chapter - 01: "Variables, Constants & Expressions": (C - Programming)
No ratings yet
Chapter - 01: "Variables, Constants & Expressions": (C - Programming)
17 pages
4
No ratings yet
4
16 pages
IR_workbook_answers
No ratings yet
IR_workbook_answers
36 pages
BOILEAU 1959 A Critical Study of The Linear Suburb of Madrid PDF
No ratings yet
BOILEAU 1959 A Critical Study of The Linear Suburb of Madrid PDF
14 pages
Dbms Review-3: G.BALAVIGNESH-10MSE1072 Harshavardhan-10Mse1077
No ratings yet
Dbms Review-3: G.BALAVIGNESH-10MSE1072 Harshavardhan-10Mse1077
35 pages
Anisotropic Diffusion
No ratings yet
Anisotropic Diffusion
184 pages
Web Search Engingine Indexing Crawling and Ranking
No ratings yet
Web Search Engingine Indexing Crawling and Ranking
63 pages
Adjetivos Comparativos - Teste
No ratings yet
Adjetivos Comparativos - Teste
2 pages
Moyno 1000-Serivce-Manual PDF
No ratings yet
Moyno 1000-Serivce-Manual PDF
34 pages
Liuty
No ratings yet
Liuty
50 pages
21jul201512071432 DAIWAT A VYAS 1-6
No ratings yet
21jul201512071432 DAIWAT A VYAS 1-6
6 pages
Set 2 AK
No ratings yet
Set 2 AK
11 pages
Information Retrieval Lecture 10 - Web Crawling
No ratings yet
Information Retrieval Lecture 10 - Web Crawling
8 pages
Naturvital-Plus - ING - Technical Data Sheet PDF
No ratings yet
Naturvital-Plus - ING - Technical Data Sheet PDF
1 page
Prakash J. y R. Kumar. 2015. Web Crawling Through Shark-Search Using Pagerank
No ratings yet
Prakash J. y R. Kumar. 2015. Web Crawling Through Shark-Search Using Pagerank
7 pages
Erformance Valuation EB Rawler: P E O W C
No ratings yet
Erformance Valuation EB Rawler: P E O W C
34 pages
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
No ratings yet
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
12 pages
Career Genogram
No ratings yet
Career Genogram
3 pages
IEEE Paper Format Template
No ratings yet
IEEE Paper Format Template
2 pages
CNH-1015 MaintProducts Case r18 LR
100% (1)
CNH-1015 MaintProducts Case r18 LR
28 pages
A Review of 'The Anatomy of A Large-Scale Hypertextual Web Search Engine'
No ratings yet
A Review of 'The Anatomy of A Large-Scale Hypertextual Web Search Engine'
27 pages
Balancing Volume, Quality and Freshness in Web Crawling: Ricardo Baeza-Yates and Carlos Castillo
No ratings yet
Balancing Volume, Quality and Freshness in Web Crawling: Ricardo Baeza-Yates and Carlos Castillo
14 pages
Major Project PROPOSAL-BACHELOR OF ENGINEERING
No ratings yet
Major Project PROPOSAL-BACHELOR OF ENGINEERING
37 pages
SearchLand: Search Quality For Beginners
No ratings yet
SearchLand: Search Quality For Beginners
29 pages
Different Types of Web Crawlers
No ratings yet
Different Types of Web Crawlers
40 pages
13 Building Search Engine Using Machine Learning Technique
No ratings yet
13 Building Search Engine Using Machine Learning Technique
4 pages
Search Engine: Amit Kamath Ancy Alphonso
No ratings yet
Search Engine: Amit Kamath Ancy Alphonso
22 pages
Career Cluster Finder Naviance Student
No ratings yet
Career Cluster Finder Naviance Student
11 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
Probability of Combined Events - Past Paper Questions: Year Series Paper Number
No ratings yet
Probability of Combined Events - Past Paper Questions: Year Series Paper Number
18 pages
Practical Research 1
No ratings yet
Practical Research 1
82 pages
Competency Mapping
No ratings yet
Competency Mapping
8 pages
Chapter11 The View of Reliability
No ratings yet
Chapter11 The View of Reliability
41 pages
Isa RP16.4 1960
No ratings yet
Isa RP16.4 1960
10 pages
Data Empowerment: Harnessing Advanced Mathematical and Statistical Methods for Data Science and Machine Learning
From Everand
Data Empowerment: Harnessing Advanced Mathematical and Statistical Methods for Data Science and Machine Learning
NAGARAJU CHEVURU
No ratings yet
The Future of Learning: Revolutionizing Education Through Generative AI: AI Books, #11
From Everand
The Future of Learning: Revolutionizing Education Through Generative AI: AI Books, #11
Mohammad
No ratings yet
Mastering Python Advanced Concepts and Practical Applications
From Everand
Mastering Python Advanced Concepts and Practical Applications
Aissa Younes
No ratings yet
Plain JavaScript: Learning the Front-End
From Everand
Plain JavaScript: Learning the Front-End
Roger Beans-Rivet
No ratings yet
Series 7 Exam Prep Complete Review and Study Guide for FINRA Certification, Including Practice Questions, Proven Test Strategies, Expert Tips, and Detailed Exam Topic Explanations
From Everand
Series 7 Exam Prep Complete Review and Study Guide for FINRA Certification, Including Practice Questions, Proven Test Strategies, Expert Tips, and Detailed Exam Topic Explanations
Jonathan L Reese
No ratings yet
Design and Technology in Today's World: A First Look
From Everand
Design and Technology in Today's World: A First Look
Baz Professor
No ratings yet
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Blog Smarter, Not Harder: SEO, Blogging, and AI Strategies to Skyrocket Your Traffic
From Everand
Blog Smarter, Not Harder: SEO, Blogging, and AI Strategies to Skyrocket Your Traffic
Jay Nans
No ratings yet
Human Nature Potential in Nurture
From Everand
Human Nature Potential in Nurture
David L. Hawk
No ratings yet
ChatGPT for Business: Strategies for Success
From Everand
ChatGPT for Business: Strategies for Success
Matthew C. Smith
1/5 (1)
A To Z of Internet: Everything You Wanted to Know
From Everand
A To Z of Internet: Everything You Wanted to Know
Bittu Kumar
No ratings yet
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
Software Patterns Made Easy
From Everand
Software Patterns Made Easy
Justice Nanhou
No ratings yet

Accelerating Ranking System Using Web Graph

Uploaded by

Accelerating Ranking System Using Web Graph

Uploaded by

ACCELERATING RANKING

A project submitted to the graduate faculty of

The University of Colorado at Colorado Springs in partial

Fulfillment of the Master of Science degree

DEPARTMENT OF COMPUTER SCIENCE

This project for Master’s of Science degree by Padmaja Adipudi has

been approved for the Department of Computer Science

Dr. J. Kalita (Advisor)

Dr. E. Chow (Committee Member)

Dr. T. Chamillard (Committee Member)

2.1 Cluster Rank Algorithm........................................................................................8

6 Appendix A – Survey Questions to Evaluate Algorithm Quality.....................................56

6.1 Search Keywords.................................................................................................56

7 Appendix B – Software and hardware Environment.......................................................57

8 Appendix C – Database Scripts..........................................................................................58

8.1 Database Table/View Setup.................................................................................58

9 Appendix D – Software Setup.............................................................................................63

10 Appendix E – Using the Search Engine................................................................................65

2.1 Cluster Rank Algorithm

 Group all pages in to clusters.

o Perform first level clustering for dynamically generated pages

o Perform second level clustering on virtual directory and graph density

 Distribute the rank number to its members by weighted average.

2.2 Problem Statement

2.3 Summary of Work

2.4 Related Work

There is a paper called “Efficient Computation of page-rank” written by Taher

Use Inline java => <<’DATA’;

/** JAVA Code Begins **/

## Perl Code Begins ##

3.2 Google’s PageRank

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

3.3 Cluster Rank

1. Group all pages into clusters.

Perform first level clustering for dynamically generated pages

Perform second level clustering on virtual directory and graph density

The notations here are:

PR: The rank of a member page

CR: The cluster rank from previous stage

Pi: The incoming links of this page

Ci: Total incoming links of this cluster.

3.4 Source Rank

1. Group all pages into Sources based on “Domain”.

3. Distribute the rank number to its members by weighted average by using

The notations here are:

PR: The rank of a member page

SR: The source rank from previous stage

Si: Total incoming unique links of this source

3.5 Truncated PageRank

They suggest generalization to the PageRank equation to:

The rank propagates through links.

The notations here are:

a : The damping factor

3.6 Software & Packages Used

WebGraph is a java package used to represent the graph compactly.

Java is used as the programming language to use the WebGraph package.

This is a Perl library to access the java modules from Perl.

Software used for documentation:

3.7 Database Implementation

Three new tables namely Source, PageRankBySource, PageRankTruncated and view

3.8 Database Tables Added

3.8.1 Table Source

columns and their purpose.

This column uniquely identifies that source.

o base_url: The base URL of the source

source rank computation.

o out_link_count: The number of out links of the source

o in_link_count: The number of in links of the source

o cal_current_iter: The iteration number used to converge the source rank

o old_prc1: The previous weighted Page Rank 1 of the URL

o old_prc1: The previous weighted Page Rank 2 of the URL

3.8.2 Table PageRankBySource

o c_prc: The weighted Page Rank

o old_prc1: The previous weighted Page Rank 1 of the URL

o old_prc1: The previous weighted Page Rank 2 of the URL

o c_prc: The weighted Page Rank

/ JAVA Code Begins /