Java Web Crawler
Java Web Crawler
I. I NTRODUCTION
The World Wide Web currently has, at least, almost 2
billion indexed web pages currently [1]. This does not count
un-indexed pages, whose number is much larger. The main
method of indexing and searching all of these sites currently
is through web crawlers. A web crawler is an application that
systematically browses the web, going from link to link, and
indexing the sites it comes upon. These sites then can be stored
in memory (for smaller, single-use web crawlers) or stored in
databases for traversal later. One well known web crawler is
GoogleBot, which is used by Google to crawl and index sites
that are used later in their search engine [2]. There are multiple
theories related to best performance crawls, however, it seems
that the situation determines what will be the most appropriate
and efficient crawling method.
With many web crawlers, the process is to visit a top-level
domain, search for links within the page, and follow those
links. Meanwhile, the crawler will cache these links within
a database of some sort and pass the control to the database
to some searching application, potentially a search engine, to
use for searching. This approach is extremely efficient if the
goal of a process is to build a database to use multiple times
over, and the storage space is available. However, this may be
too much for a simple query over a specific domain during
one-time searches
The crawler discussed in this paper systematically crawls
and matches query strings based on a users input on a caseby-case basis. The user will input there targeted domain, the
query they are looking for, and their choices for searching