19EAI441: Web Mining
Module: IV
Web Crawling
Syllabus
Module: IV
Web Crawling – A basic Crawler Algorithm: Breadth-First
Crawlers and Preferential Crawlers. Implementation Issues –
Fetching, Parsing, stop word removal and stemming, link
extraction and canonicalization, spider traps, Page repository and
concurrency.
Web Crawling:
Web crawlers, also known as spiders or robots, are programs that
automatically download Web pages.
Well known search engines such as Google, Yahoo! and MSN run
very efficient universal crawlers designed to gather all pages
irrespective of their content.
Other crawlers, sometimes called preferential crawlers, are more
targeted. They attempt to download only pages of certain types or
topics.
A Basic Crawler Algorithm:
In its simplest form, a crawler starts from a set of seed pages
(URLs) and then uses the links within them to fetch other pages.
The links in these pages are, in turn, extracted and the corresponding
pages are visited.
The process repeats until a sufficient number of pages are visited
or some other objective is achieved.
This simple description hides many delicate issues related to
network connections, spider traps, URL canonicalization, page
parsing, and crawling ethics.
The below figure shows the flow of a basic sequential crawler.
Such a crawler fetches one page at a time, making inefficient use of
its resources.
The crawler maintains a list of unvisited URLs called the frontier.
Fig: Flow chart of a basic sequential
crawler. The main data operations are
shown on the left, with dashed arrows.
The list is initialized with seed URLs which may be provided by
the user or another program.
In each iteration of its main loop, the crawler picks the next URL
from the frontier, fetches the page corresponding to the URL through
HTTP, parses the retrieved page to extract its URLs, adds newly
discovered URLs to the frontier, and stores the page (or other
extracted information, possibly index terms) in a local disk
repository.
The crawling process may be terminated when a certain number
of pages have been crawled.
The crawler may also be forced to stop if the frontier becomes
empty, although this rarely happens in practice due to the high
average number of links (on the order of ten out-links per page across
the Web).
A crawler is, in essence, a graph search algorithm. The Web can be
seen as a large graph with pages as its nodes and hyperlinks as its
edges.
A crawler starts from a few of the nodes (seeds) and then follows
the edges to reach other nodes.
Note that given some maximum size, the frontier will fill up
quickly due to the high fan-out of pages.
Even more importantly, the crawler algorithm must specify the
order in which new URLs are extracted from the frontier to be
visited.
These mechanisms determine the graph search algorithm
implemented by the crawler.
Breadth-First Crawlers:
The frontier may be implemented as a first-in-first-out (FIFO)
queue, corresponding to a breadth-first crawler.
The URL to crawl next comes from the head of the queue and
new URLs are added to the tail of the queue.
Once the frontier reaches its maximum size, the breadth-first
crawler can add to the queue only one unvisited URL from each
new page crawled.
It is therefore not surprising that the order in which pages are visited by
a breadth-first crawler is highly correlated with their PageRank or
indegree values.
An important implication of this phenomenon is an intrinsic bias of
search engines to index well connected pages.
Topical locality measures indicate that pages in the link neighborhood
of a seed page are much more likely to be related to the seed pages than
randomly selected pages.
These and other types of bias are important to universal crawlers.
As mentioned earlier, only unvisited URLs are to be added to the
frontier. This requires some data structure to be maintained with
visited URLs.
The crawl history is a time-stamped list of URLs fetched by the
crawler tracking its path through the Web.
A URL is entered into the history only after the corresponding
page is fetched. This history may be used for post-crawl analysis and
evaluation.
This check is required to avoid revisiting pages or wasting space
in the limited-size frontier. Typically a hash table is appropriate to
obtain quick URL insertion and look-up times (O(1)).
The look-up process assumes that one can identify two URLs
effectively pointing to the same page.
Another important detail is the need to prevent duplicate URLs
from being added to the frontier.
A separate hash table can be maintained to store the frontier
URLs for fast look-up to check whether a URL is already in it.
Preferential Crawlers:
A different crawling strategy is obtained if the frontier is
implemented as a priority queue rather than a FIFO queue.
Typically, preferential crawlers assign each unvisited link a
priority based on an estimate of the value of the linked page.
The estimate can be based on topological properties (e.g., the
indegree of the target page), content properties (e.g., the similarity
between a user query and the source page), or any other combination
of measurable features.
If pages are visited in the order specified by the priority values in
the frontier, then we have a best-first crawler.
The priority queue may be a dynamic array that is always kept
sorted by URL scores. At each step, the best URL is picked from
the head of the queue.
Once the corresponding page is fetched, the URLs extracted
from it must, in turn, be scored.
They are then added to the frontier in such a manner that the
sorting order of the priority queue is maintained.
As for breadth-first, best-first crawlers also need to avoid
duplicate URLs in the frontier.
Keeping a separate hash table for look-up is an efficient way to
achieve this.
The time complexity of inserting a URL into the priority queue is
O(logF), where F is the frontier size (looking up the hash requires
constant time).
To dequeue a URL, it must first be removed from the priority
queue (O(logF)) and then from the hash table (again O(1)).
Thus the parallel use of the two data structures yields a
logarithmic total cost per URL.
Once the frontier’s maximum size is reached, only the best
URLs are kept; the frontier must be pruned after each new set of
links is added.
Implementation Issues:
• Fetching, Parsing, stop word removal and stemming, link extraction
and canonicalization, spider traps, Page repository and concurrency .