The Implementation of A Web Crawler URL Filter Algorithm Based On Caching
The Implementation of A Web Crawler URL Filter Algorithm Based On Caching
Abstract — For large-scale Web information collection, the URL pages and follow the links to find new pages and links [1,2].
filter module plays important roles in a Web crawler which is a A Web crawler usually starts from a URL address alone or
central component of a search engine. The performance of an URL URL list to visit each pages specified by each URL, extracts
filter module influents the efficiency of the entire collection system the links in each page by the content analysis and eliminates
directly. This paper introduces one URL filter algorithm based on the repetitive URLs after download each page, then adds the
caching and its implementation. The performances of stability and new links to the URL list.
paralleling of the algorithm are verified by the experiments for
Websites which handle a large number of web pages. Experiment Seed URL Internet
results show the algorithm proposed in this paper can achieve
satisfactory performances through reasonable adjustments of its
some parameters and it is suitable for the process of the URL filter
of a Website which has a number of page navigator links and index URL List Page Downloader Crawling
pages especially. Parameter
Assistor
Keywords - Web Crawler; URL Filter; Caching Link Extractor
I. INTRODUCTION
Accompanied by the informationization of society and URL Filter
rapid development of the Internet, more and more
information is on the Internet. These make search engine,
one of the main applications of modern information
technology, become one necessary tool by which people can Figure 1. A simplified web crawler
get information through the Internet more easily and quickly.
A web crawler, which is a central component of a search It is a simplified Web crawler in Figure 1. According to
engine, impacts not only on the recall ratio and precision of a Figure 1, a Web crawler starts an URL called the Seed URL
search engine, but also on the capacity of the data storage to visit the Internet. The Page Downloader gets an URL from
and efficiency of a search engine. URL List to download the page from the Internet, transfers
World Wide Web can be viewed as a directed graph the page to the Link Extractor. The Page Downloader checks
without rules and borders, and a wide range of circles exists the parameters in accordance with the requirements of
in it. Due to a large number of index pages and navigation crawling to decide whether or not to download pages. As the
contents existed in a Website, as well as a large number of crawler visits these URLs, the Link Extractor indentifies all
relevant content links existed in a web page, the URLs get the hyperlinks in the pages in accordance with the
from web pages are much more than the actual number of requirements of crawling and transfers them to the URL
URLs existed in the Internet. Filter, stores the results into URL list. The Crawling
The URL filter module, which is an important Parameter Assistor provides the parameter setting for the
component of a crawler, is used to filter the URLs analyzed needs of all parts of the crawler.
from the web pages downloaded by the crawler of a search
engine so as to improve the efficiency of the crawler. We III. THE COMMON METHODS OF THE URL FILTER
design an URL filter algorithm and implement it based on In order to avoid downloading the same page repeatedly,
caching. Experiment results show the algorithm proposed in we need to eliminate the repetitive URLs. The URL filter can
this paper can achieve satisfactory performances. easily become a bottleneck in the performance of the crawler
II. A WEB CRAWLER system because the URL list to be inquired is growing bigger
and bigger.
A Web crawler (also known as a Web spider or a Web The main idea of the URL filter is to determine whether
robot) is a program or an automated script which can get an URL exists in the known URL list. The URL query
specified pages from the Internet, extract the links from these algorithm is the key. The basic process is for each string of
1059
454
Authorized licensed use limited to: UNIVERSITI TEKNOLOGI MARA. Downloaded on June 28,2024 at 10:57:26 UTC from IEEE Xplore. Restrictions apply.
If there is an URL list of Websites to be crawled in V. ALGORITHM IMPLEMENTATION AND PERFORMANCE
database, select the number of initNum of URLs EVALUATION
which are ordered descending by the number of
The performance of the URL filter algorithm based on
repetition to initialize visitingList, otherwise
caching, which is used in large-scale webpage to eliminate
initialize the visitingList with the index page of the
duplication, was analyzed and validated through experiments
Website.
by the prototype system.
In the experiments, having set the parameters (caching
Parsed URL capacity, exchange rate of cache and external storage, and
depth of crawling) according to different crawling situations,
we obtained basic data from a wide variety of Websites, and
Yes
In visitedList ? then got the interrelated data for URL filter based on the
synthetic analysis of basic data. The effect comparison of
URL filter algorithm, which is implemented in this paper,
No
shows in Figure 3, in which the average links of a page (the
Yes average number of effective links per page in a Website) is
In visitingList ? as X-coordinate, the rate of hitting the cache (the ratio of a
pending URL can be found in cache) is as Y- coordinate.
duplicated number + 1
100
No
90
The rate 80
No New URL, add to
In database ? of hitting
visitingList 70
the cache
Yes 60
(%)
50
No Add to visitingList,
2 .7
3. 1
3 .2
3. 3
4 .7
12
2
2.5
2.9
9.7
10 .6
11. 3
Visited ? duplicated number + 1
First Non-First The average links of a page
Yes
Figure 3. The effect comparison of URL filter
Add to visitedList,
duplicated number + 1 We know, from the experimental data, the higher the
Finish average links of a page, the higher the rate of hitting the
cache, the more efficient the URL filter, especially when
Figure 2. URL filter algorithm based on caching crawling the Website again, the rate of hitting the cache has
been increased significantly and the performance of the URL
2) To add the URLs parsed from one page and filter has been improved accordingly. So that it is very
eliminated duplication into the related URL lists in efficient to crawl the Website which has a large number of
the URL Filter. page navigator links and index pages by using the web
The specific filter steps are: for a parsed URL, crawler URL filter algorithm based on caching. In addition, a
searching in the visitedList of URL Filter first, if large number of experimental data shows that the algorithm
found, it is a repetition of some URL and increase has consistent performances of stability and parallel
its duplicated number, otherwise searching in processing while crawling all types of Websites. The
visitingList. If found, it is a repetition of some URL experimental data shows in table 1.
and increase its duplicated number, otherwise The experimental data in table 1 indicated that with the
searching in database. If not found, it is a new URL increase in the number of links, the rate of hitting the cache
and add it to visitingList, otherwise determining it a also increase, in particular, the higher the average links of a
repetition of some URL, increasing its duplicated page, the higher the efficiency of eliminating duplication.
number and adding it to the related URL lists The prototype system, which centers on the database storage,
according to it had been visited or not. has more than one processing node and each node can crawl
3) If the size of the URL caching lists in the URL parallel. The performance of eliminating duplication of each
Filter is larger than the setting value of the crawling node will not reduce with the increasing of node, and the
parameters, transfer and store the URL collection time processing performance of the whole system only will
with the setting capacity into database, otherwise be influenced by database concurrent processing and
load the some block of URLs from database into the network environment, so the stability and parallel
URL lists according to the replacement strategy. performances of the algorithm are good.
1060
455
Authorized licensed use limited to: UNIVERSITI TEKNOLOGI MARA. Downloaded on June 28,2024 at 10:57:26 UTC from IEEE Xplore. Restrictions apply.
Table I. URL filter data for different types of Websites experiments for Websites which handle a large number of
web pages. Experiment results show the algorithm proposed
Website URL NL NLAF RHCF RHCNF
in this paper can achieve satisfactory performances through
reasonable adjustments of its some parameters and is suitable
for the process of the URL filter of a Website which has a
http://cs.scu.edu.cn 1408 399 59.5 100
number of page navigator links and index pages especially.
http://www.jtstar.com 13986 1321 89.7 96.6
REFERENCES
http://www.scu.edu.cn 23550 7837 64.5 77.9
[1] Yuan Wan, and Hengqing Tong, “URL Assignment Algorithm of
http://news.sina.com.cn 38763 11861 67.1 78.7 Crawler in Distributed System Based on Hash,” IEEE International
Conference on Networking, Sensing and Control, 2008.
www.xinhuanet.com 61918 10368 68.0 88.6
[2] WU Lihui, WANG Bin, and YU Zhihua, “Design and Realization of
where: a General Web Crawler,” Computer Engineering, February 2005,
NL — Number of links pp.123-124.
NLAF — Number of links after filtering
RHCF — the Rate of Hitting the cache (First %) [3] Christopher Martinez, Wei-Ming Lin, and Parimal Patel, “Optimal
RHCNF — the Rate of Hitting the cache (Non-First %) XOR Hashing for A Linearly Distributed Address Lookup in
Computer Networks,” Symposium On Architecture For Networking
This paper introduces one URL filter algorithm based on And Communications Systems, 2005, pp. 203-210.
caching and its implementation. The stability and parallel [4] Xiao-Guang Liu, and Jun Lee, “K-Divided Bloom Filter Algorithm
performances of the algorithm are verified by the and Its Analysis,” Future generation communication and networking
(fgcn 2007), 2007, 1(6-8), pp. 220-224.
1061
456
Authorized licensed use limited to: UNIVERSITI TEKNOLOGI MARA. Downloaded on June 28,2024 at 10:57:26 UTC from IEEE Xplore. Restrictions apply.