UNIT III-Web Crawlers Why Do We Need Web Crawlers?

UNIT III- Web Crawlers
Why Do We Need Web Crawlers?
Web crawlers, also known as spiders or robots, are programs that automatically download Web
pages. Since information on the Web is scattered among billions of pages served by millions of servers
around the globe, users who browse the Web can follow hyperlinks to access information, virtually
moving from one page to the next.
A crawler can visit many sites to collect information that can be analyzed and mined in a central location,
either online (as it is downloaded) or off-line (after it is stored). Were the Web a static collection of pages,
we would have little long term use for crawling. Once all the pages are fetched and saved in a repository,
we are done. However, the Web is a dynamic entity evolving at rapid rates.
Hence there is a continuous need for crawlers to help applications stay current as pages and links are
added, deleted, moved or modified.
Main Uses of Web Crawlers:
There are many applications for Web crawlers. One is business intelligence, whereby organizations
collect information about their competitors and potential collaborators. Another use is to monitor Web
sites and pages of interest, so that a user or community can be notified when new information appears in
certain places. There are also malicious applications of crawlers,
for example, that harvest email addresses to be used by spammers or collect personal information to be
used in phishing and other identity theft attacks. The most widespread use of crawlers is, however, in
support of search engines.
Types of Web Crawlers:
In fact, crawlers are the main consumers of Internet bandwidth. They collect pages for search engines to
build their indexes. Well known search engines such as Google, Yahoo! and MSN run very efficient
universal crawlers designed to gather all pages irrespective of their content.
Other crawlers, sometimes called preferential crawlers, are more targeted. They attempt to download
only pages of certain types or topics.
Some of the crawlers
SCRAPY, HERITRIX, APACHE NUTCH, HTTRACK
Web Crawler and a Search Engine
Web Crawler: The Data Collector
 Role: A web crawler is a program that automatically scans and downloads web pages.
 Function: It follows links from one page to another, collecting data about the content of
each page.
 Example: Think of a web crawler as a librarian who goes around gathering books from
all over the world to bring back to the library.
Search Engine: The Data Organizer and Retriever
 Role: A search engine is a software system that organizes, indexes, and retrieves
information from the internet based on user queries.
 Function: It uses the data collected by the web crawler to create an index, which allows
it to quickly find relevant web pages when a user searches for something.
 Example: The search engine is like the catalog system in a library that helps you find the
book you need based on the librarian’s collection.
Example Workflow
1. Crawling:
o The crawler visits a new blog post about "10 Best Pizza Places in New York".
o It downloads the content and follows links to other related pages, such as
restaurant websites.
2. Indexing:
o The search engine processes the downloaded content, extracting key information
about pizza places and adding it to its index.
3. Query Processing and Ranking:
o A user searches for "best pizza places in New York".
o The search engine looks up its index and finds the blog post along with other
relevant pages.
o It ranks the pages based on factors like relevance, quality of content, and the
number of other sites linking to them.
4. Retrieval:
o The search engine displays the search results, with the blog post about "10 Best
Pizza Places in New York" appearing near the top.
A Basic Crawler Algorithm
In its simplest form, a crawler starts from a set of seed pages (URLs) and then uses the links
within them to fetch other pages. The links in these pages are, in turn, extracted and the
corresponding pages are visited. The process repeats until a sufficient number of pages are
visited or some other objective is achieved.
This simple description hides many delicate issues related to network connections, spider traps,
URL canonicalization, page parsing, and crawling ethics. In fact, Google founders Sergey Brin
and Lawrence Page, in their seminal paper, identified the Web crawler as the most sophisticated
yet fragile component of a search engine.
Figure shows the flow of a basic sequential crawler. Such a crawler fetches one page at a
time, making inefficient use of its resources.
The crawler maintains a list of unvisited URLs called the frontier. The list is initialized with
seed URLs which may be provided by the user or another program. In each iteration of its main
loop, the crawler picks the next URL from the frontier, fetches the page corresponding to the
URL through HTTP, parses the retrieved page to extract its URLs, adds newly discovered
URLs to the frontier, and stores the page (or other extracted information, possibly index terms)
in a local disk repository.
The crawling process may be terminated when a certain number of pages have been crawled.
The crawler may also be forced to stop if the frontier becomes empty, although this rarely
happens in practice due to the high average number of links (on the order of ten out-links per
page across the Web).
A crawler is, in essence, a graph search algorithm. The Web can be seen as a large graph with
pages as its nodes and hyperlinks as its edges. A crawler starts from a few of the nodes (seeds)
and then follows the edges to reach other nodes. The process of fetching a page and extracting
the links within it is analogous to expanding a node in graph search.
The frontier is the main data structure, which contains the URLs of unvisited pages. Typical
crawlers attempt to store the frontier in the main memory for efficiency. Based on the declining
price of memory and the spread of 64-bit processors, quite a large frontier size is feasible. Yet
the crawler designer must decide which URLs have low priority and thus get discarded when
the frontier is filled up. Note that given some maximum size, the frontier will fill up quickly due
to the high fan-out of pages. Even more importantly, the crawler algorithm must specify the
order in which new URLs are extracted from the frontier to be visited. These mechanisms
determine the graph search algorithm implemented by the crawler.
URL canonicalization is the process of converting different URLs that point to the same content
into a single, standard URL to avoid duplication and ensure consistency. This is important
because the same web page can often be accessed through multiple URLs due to variations like
"http" vs. "https", "www" vs. non-www, trailing slashes, and URL parameters. For example,
"http://example.com", "https://example.com", and "http://www.example.com/index.html" might
all lead to the same content. Canonicalization helps search engines and web crawlers understand
that these different URLs are actually the same page, which consolidates link equity and prevents
issues with duplicate content. This process typically involves selecting a preferred URL
(canonical URL) and using techniques like redirects, the rel="canonical" tag, and consistent
internal linking to guide both users and search engines to the correct, canonical version of the
URL.
Breadth-First Crawler:
The frontier may be implemented as a first-in-first-out (FIFO) queue, corresponding to a
breadth-first crawler. The URL to crawl next comes from the head of the queue and new URLs
are added to the tail of the queue. Once the frontier reaches its maximum size, the breadth-first
crawler can add to the queue only one unvisited URL from each new page crawled. The breadth-
first strategy does not imply that pages are visited in “random” order.
To understand why, we have to consider the highly skewed, long-tailed distribution of in degree
in the Web graph. Some pages have a number of links pointing to them that are orders of
magnitude larger than the mean.
Popular pages on the web have many links pointing to them, making them act like magnets for
breadth-first crawlers. As a result, these crawlers visit popular pages earlier and more frequently.
This behavior is closely related to PageRank or the number of incoming links (indegree) a page
has. Consequently, search engines tend to index these well-connected pages first, showing an
inherent bias towards popular content.
Breadth-first crawlers aren't random because the choice of initial seed pages significantly
influences their behavior. Pages linked to by these seed pages are often related in topic, meaning
the crawler tends to stay within the same topical area. This introduces a bias, as the crawler is
more likely to visit pages similar to the seed pages rather than a random assortment of pages.
The crawl history is a time-stamped list of URLs fetched by the crawler tracking its path
through the Web. A URL is entered into the history only after the corresponding page is fetched.
This history may be used for post-crawl analysis and evaluation. For example, we want to see if
the most relevant or important resources are found early in the crawl process. While history may
be stored on disk, it is also maintained as an in-memory data structure for fast look-up, to check
whether a page has been crawled or not. This check is required to avoid revisiting pages or
wasting space in the limited-size frontier. Typically a hash table is appropriate to obtain quick
URL insertion and look-up times (O(1)). The look-up process assumes that one can identify two
URLs effectively pointing to the same page.
Preferential Crawlers
A different crawling strategy is obtained if the frontier is implemented as a priority queue

rather than a FIFO queue. Typically, preferential crawlers assign each unvisited link a
priority based on an estimate of the value of the linked page. The estimate can be based on
topological properties (e.g., the in degree of the target page), content properties (e.g., the
similarity between a user query and the source page), or any other combination of measurable
features. For example, the goal of a topical crawler is to follow edges that are expected to lead to
portions of the Web graph that are relevant to a user-selected topic. The choice of seeds is even
more important in this case than for breadth-first crawlers.
For now let us simply assume that some function exists to assign a priority value or score to each
unvisited URL. If pages are visited in the order specified by the priority values in the frontier,
then we have a best-first crawler.
The priority queue may be a dynamic array that is always kept sorted by URL scores. At each
step, the best URL is picked from the head of the queue. Once the corresponding page is fetched,
the URLs extracted from it must, in turn, be scored. They are then added to the frontier in such a
manner that the sorting order of the priority queue is maintained. As for breadth-first, best-first
crawlers also need to avoid duplicate URLs in the frontier.
Keeping a separate hash table for look-up is an efficient way to achieve this. The time
complexity of inserting a URL into the priority queue is O(logF), where F is the frontier size
(looking up the hash requires constant time). To dequeue a URL, it must first be removed from
the priority queue (O(logF)) and then from the hash table (again O(1)).
Thus the parallel use of the two data structures yields a logarithmic total cost per URL. Once the
frontier’s maximum size is reached, only the best URLs are kept; the frontier must be pruned
after each new set of links is added.
Implementation Issues
1. Fetching
To fetch pages, a crawler acts as a Web client; it sends an HTTP request to the server hosting the
page and reads the response. The client needs to timeout connections to prevent spending
unnecessary time waiting for responses from slow servers or reading huge pages. In fact, it is
typical to restrict downloads to only the first 10-100 KB of data for each page. The client parses
the response headers for status codes and redirections.
Redirect loops are to be detected and broken by storing URLs from a redirection chain in a hash
table and halting if the same URL is encountered twice. One may also parse and store the last-
modified header to determine the age of the document, although this information is known to
be unreliable.
Error-checking and exception handling is important during the page fetching process since the
same code must deal with potentially millions of remote servers. In addition, it may be beneficial
to collect statistics on timeouts and status codes to identify problems or automatically adjust
timeout values. (like 200 for success or 404 for not found) .
Programming languages such as Java, Python and Perl provide simple programmatic interfaces
for fetching pages from the Web. However, one must be careful in using high-level interfaces
where it may be harder to detect lower-level problems. For example, a robust crawler in Perl
should use the Socket module to send HTTP requests rather than the higher-level LWP library
(the World-Wide Web library for Perl). The latter does not allow fine control of connection
timeouts.
2 Parsing
Once (or while) a page is downloaded, the crawler parses its content, i.e., the HTTP payload, and
extracts information both to support the crawler’s master application (e.g., indexing the page if
the crawler supports a search engine) and to allow the crawler to keep running (extracting links
to be added to the frontier). Parsing may imply simple URL extraction from hyperlinks, or more
involved analysis of the HTML code.
The Document Object Model (DOM) establishes the structure of an HTML page as a tag tree, as
illustrated in Fig.. HTML parsers build the tree in a depth-first manner, as the HTML source
code of a page is scanned linearly.
Unlike program code, which must compile correctly or else will fail with a syntax error,
correctness of HTML code tends to be laxly enforced by browsers. Even when HTML standards
call for strict interpretation, de facto standards imposed by browser implementations are very
forgiving.
This, together with the huge population of non-expert authors generating Web pages, imposes
significant complexity on a crawler's HTML parser.
Many pages are published with missing required tags, tags improperly nested, missing close tags,
misspelled or missing attribute names and values, missing quotes around attribute values,
unescaped special characters, and so on. As an example, the double quotes character in HTML is
reserved for tag syntax and thus is forbidden in text.
The special HTML entity " is to be used in its place. However, only a small number of
authors are aware of this, and a large fraction of Web pages contains this illegal character. Just
like browsers, crawlers must be forgiving in these cases; they cannot afford to discard many
important pages as a strict parser would do.
A wise preprocessing step taken by robust crawlers is to apply a tool such as tidy
(www.w3.org/People/Raggett/tidy) to clean up the HTML content prior to parsing. To add to
the complexity, there are many coexisting HTML and XHTML reference versions. However, if
the crawler only needs to extract links within a page and/or the text in the page, simpler parsers
may suffice. The HTML parsers available in high-level languages such as Java and Perl are
becoming increasingly sophisticated and robust.
A growing portion of Web pages are written in formats other than HTML. Crawlers supporting
large-scale search engines routinely parse and index documents in many open and proprietary
formats such as plain text, PDF, Microsoft Word and Microsoft PowerPoint.
Depending on the application of the crawler, this may or may not be required. Some formats
present particular difficulties as they are written exclusively for human interaction and thus are
especially unfriendly to crawlers.
For instance, some commercial sites use graphic animations in Flash; these are difficult
for a crawler to parse in order to extract links and their textual content. Other examples include
image maps and pages making heavy use of Javascript for interaction.
New challenges are going to come as new standards such as Scalable Vector Graphics (SVG),
Asynchronous Javascript and XML (AJAX), and other XML-based languages gain popularity.
3. Stopword Removal and Stemming

When parsing a Web page to extract the content or to score new URLs suggested by the page, it
is often helpful to remove so-called stopwords, i.e., terms such as articles and conjunctions,
which are so common that they hinder the discrimination of pages on the basis of content.
Another useful technique is stemming, by which morphological variants of terms are conflated
into common roots (stems). In a topical crawler where a link is scored based on the similarity
between its source page and the query, stemming both the page and the query helps improve the
matches between the two sets and the accuracy of the scoring function.
4. Link Extraction and Canonicalization
HTML Parsing:
 HTML parsers identify tags and attribute-value pairs in web pages.

 To extract hyperlinks, parsers look for <a> tags and grab href attributes.
URL Filtering:
Filter URLs to exclude unwanted file types using white lists (e.g., only follow links to text/html
content pages) or black lists (e.g., discard links to PDF files)).
 File extensions can help but are unreliable, so an HTTP HEAD request can check the
content-type header instead.
Dynamic Content:
 Dynamic pages, often indicated by special characters or directory names (like /cgi-
bin/), were previously filtered out. However, this is less common now due to the
widespread use of dynamic content.
Converting Relative URLs:
 Relative URLs need to be converted to absolute URLs for accuracy.

 Example: Convert news/today.html on http://www.somehost.com/index.html to
http://www.somehost.com/news/today.html.
 Base URL is used for this conversion and can be specified by HTTP headers, meta-tags,
or default to the source page's directory.
URL Canonicalization:
 Canonicalization standardizes URLs to a consistent format.

 Steps include specifying port numbers consistently and converting relative URLs.
 Heuristic rules help detect when different URLs point to the same page to avoid
duplication.
 Consistent canonicalization ensures reliable crawling and indexing.
5. Spider Trap
A spider trap is a situation where a web crawler gets stuck in an infinite loop, repeatedly
fetching the same or similar pages. This happens when a website generates an endless
number of URLs, often through dynamic content or poorly designed pagination, causing the
crawler to waste resources without making meaningful progress.
 A crawler must be aware of spider traps. These are Web sites where the URLs of dynamically
created links are modified based on the sequence of actions taken by the browsing user (or
crawler). Some e-commerce sites such as Amazon.com may use URLs to encode which sequence
of products each user views. This way, each time a user clicks a link, the server can log detailed
information on the user's shopping behavior for later analysis.
 As an illustration, consider a dynamic page for product x, whose URL path is /x and that contains
a link to product y. The URL path for this link would be /x/y to indicate that the user is going
from page x to page y. Now suppose the page for y has a link back to product x. The dynamically
created URL path for this link would be /x/y/x, so that the crawler would think this is a new page
when in fact it is an already visited page with a new URL. As a side effect of a spider trap, the
server may create an entry in a database every time the user (or crawler) clicks on certain
dynamic links.
Example of a Spider Trap:
Infinite URL Generation:
 Scenario: A website uses a calendar interface to display events. Each time the crawler
requests a different date, the server generates a new URL, such as:
o http://example.com/events?date=2024-07-01
o http://example.com/events?date=2024-07-02
o http://example.com/events?date=2024-07-03 and so on.
 Trap: The crawler keeps finding new dates to request, leading to an infinite number of
URLs to visit, even if there are no new events. This can also happen with pagination
where the crawler continually follows "next" links:
o http://example.com/page1
o http://example.com/page2
o http://example.com/page3 and so forth, without end.
Universal Crawlers
General purpose search engines use Web crawlers to maintain their indices , amortizing the cost of
crawling and indexing over the millions of queries received between successive index updates (though
indexers are designed for incremental updates).
This passage discusses the differences between two types of web crawlers: large-scale universal
crawlers and concurrent breadth-first crawlers. Here's a clearer breakdown:
1. Performance
 Large-scale Universal Crawlers: These crawlers are designed to handle massive

workloads. They need to fetch and process hundreds of thousands of web pages every
second. This requires sophisticated architecture, such as distributed systems, efficient
data storage, and optimized algorithms to manage such a high volume of data
without slowing down.
 Concurrent Breadth-First Crawlers: These crawlers typically work by exploring one
level of a website's links before moving on to the next level, like going from one layer of
a web to the next. They operate concurrently, meaning they handle multiple tasks at the
same time, but their performance is generally on a smaller scale compared to large-scale
crawlers.
2. Policy
 Large-scale Universal Crawlers: These crawlers aim to cover a vast portion of the web,
prioritizing the most important or popular pages. They also need to keep their data up-
to-date by regularly revisiting and updating their index (a database of web content).
However, there is a challenge here: trying to cover as many pages as possible while
also keeping the information fresh. Since these goals can conflict (focusing on new
pages might mean missing updates on old ones), the crawlers need to be designed in a
way that balances these objectives effectively.
 Concurrent Breadth-First Crawlers: These crawlers don't necessarily have the same
ambitious goals. They may focus on specific websites or smaller parts of the web, so their
policies are less complex and they don't need to make as many tradeoffs between
coverage and freshness.
Large-scale universal crawlers are built for high performance and wide coverage, but they have
to carefully balance between covering a broad range of pages and keeping their index updated.
Concurrent breadth-first crawlers are simpler and operate on a smaller scale.
Scalability
Figure illustrates the architecture of a large-scale crawler. The most important change from the
concurrent model discussed earlier is the use of asynchronous sockets in place of threads or processes
with synchronous sockets.
Asynchronous sockets are non-blocking, so that a single process or thread can keep hundreds of
network connections open simultaneously and make efficient use of network bandwidth. Not only
does this eliminate the overhead due to managing threads or processes, it also makes locking access to
shared data structures unnecessary.
Instead, the sockets are polled to monitor their states(polling refers to the process of continuously
checking the status of multiple network connections to see if they are ready to perform a specific
operation, such as reading data or writing data.).
When an entire page has been fetched into memory, it is processed for link extraction and indexing. This
“pull” model eliminates contention for resources and the need for locks.
The frontier manager can improve the efficiency of the crawler by maintaining several parallel
queues, where the URLs in each queue refer to a single server.
In addition to spreading the load across many servers within any short time interval, this approach allows
to keep connections with servers alive over many page requests, thus minimizing the overhead
of TCP opening and closing handshakes.
The crawler needs to resolve host names in URLs to IP addresses. The connections to the Domain Name
System (DNS) servers for this purpose are one of the major bottlenecks of a naïve crawler, which
opens a new TCP connection to the DNS server for each URL. (DNS servers translate human-readable
domain names (e.g., example.com) into IP addresses (e.g., 192.0.2.1) )
To address this bottleneck, the crawler can take several steps.
First, it can use UDP instead of TCP as the transport protocol for DNS requests. While UDP does not
guarantee delivery of packets and a request can occasionally be dropped, this is rare. On the other hand,
UDP incurs no connection overhead with a significant speed-up over TCP.
Second, the DNS server should employ a large, persistent, and fast (in-memory) cache.
Finally, the pre-fetching of DNS requests can be carried out when links are extracted from a page. In
addition to being added to the frontier, the URLs can be scanned for host names to be sent to the DNS
server. This way, when a URL is later ready to be fetched, the host IP address is likely to be found in the
DNS cache, obviating the need to propagate the request through the DNS tree.
In addition to making more efficient use of network bandwidth through asynchronous sockets, large-
scale crawlers can increase network bandwidth by using multiple network connections switched to
multiple routers, thus utilizing the networks of multiple Internet service providers. Similarly, disk I/O
throughput can be boosted via a storage area network connected to a storage pool through a fibre channel
switch.
Challenge of Coverage vs. Freshness:
- Search engines face a tradeoff between covering as many important pages as possible and
keeping their index up-to-date.
- The web changes rapidly, with new pages, links, and content appearing frequently.
- Studies show that about 8% of web pages are newly created each week, but not all of this
content is unique.
- The link structure is even more dynamic, with 25% new links each week.
- Most changes on the web come from new additions or deletions rather than modifications to
existing pages.
Revisit Strategies:
- Since some pages change more frequently than others, it’s important for crawlers to revisit
them to maintain freshness.
- However, strategies based on how often a page changes may not be the best. Instead, the
degree of change (how much the page changes when it does) is a better predictor of when a page
is likely to change again.
This highlights the complexity of balancing the need for a comprehensive index with the need to
keep that index up-to-date.
Focused Crawler
A focused crawler is a specialized web crawler designed to search and explore only specific
categories or topics on the web. Instead of crawling the entire web, it focuses on areas that are of
particular interest to the user.
Why Use a Focused Crawler
Imagine you're in charge of maintaining a directory, like Yahoo! Directory or the Open Directory
Project (ODP), which organizes websites into categories. You want to find new, relevant pages
to add to these categories without wasting time on unrelated content. A focused crawler helps
you do this by only looking for pages that match certain categories.
Web Taxonomy?
- A web taxonomy is a way of organizing websites into categories and subcategories, making it
easier to find relevant information. For example, in a directory, you might have a main category
like "Sports" and subcategories like "Soccer," "Basketball," etc.
The Role of an ODP Editor:

- Suppose you are an editor for a specific category in the ODP (like "Soccer"). Your job is to
ensure that this category includes the most relevant and up-to-date web pages.
Bias Towards Relevant Pages:
The key feature of a focused crawler is that it biases its search towards pages that are likely to be
relevant to the categories you’re interested in. This means it spends more time looking at pages
that are more likely to be useful, rather than wasting time on unrelated content.
How Does It Work?
proposed a focused crawler based on a classifier. The idea is to first build a text classifier using labeled
example pages from, say, the ODP. Then the classifier would guide the crawler by preferentially
selecting from the frontier those pages that appear most likely to belong to the categories of interest,
according to the classifier's prediction. To train the classifier, example pages are drawn from various
categories in the taxonomy as shown in Fig
The classification algorithm used was the naïve Bayesian method . For each category c in the taxonomy
we can build a Bayesian classifier to compute the probability Pr(c|p) that a crawled page p belongs to c
(by definition, Pr(top|p) = 1 for the top or root category). The user can select a set c* of categories of
interest. Each crawled page is assigned a relevance score.
Two strategies were explored. In the “soft” focused strategy, the crawler uses the score R(p) of each
crawled page p as a priority value for all unvisited URLs extracted from p. The URLs are then added to
the frontier, which is treated as a priority queue . In the “hard” focused strategy, for a crawled page p, the
classifier first finds the leaf category cˆ( p) in the taxonomy most likely to include p:
Another element of the focused crawler is the use of a distiller. The distiller applies a modified version of
the HITS algorithm to find topical hubs. These hubs provide links to authoritative sources on a focus
category.
Example Scenario:
Let's say your focused crawler is set to find pages about "Soccer." If the crawler stumbles upon a
page dedicated to the FIFA World Cup 2006, the classifier would recognize that this page is
relevant because it fits under "Sports/Soccer." The links on this page would then be added to the
list for further crawling.
What is a Context-Focused Crawler?
A context-focused crawler is a special type of web crawler designed to find pages that are closely
related to a specific topic, even if those pages don’t directly contain the target keywords. It uses a
technique to estimate how "far away" (in terms of link distance) a page is from pages that are
already known to be relevant. They also use naïve Bayesian classifiers as a guide.
How Does It Work
1. Understanding the Scenario:

o Imagine you’re looking for pages related to "machine learning." A standard
crawler might miss some relevant pages because it only looks for pages
containing the exact keywords "machine learning."
o However, important pages could be a few clicks away from other pages (like a
university's computer science department page), even if those pages don’t
mention "machine learning" directly.
2. Layered Context Graph:
o The context-focused crawler builds a context graph with multiple layers.
 Layer 0: Contains pages you already know are relevant (like specific
machine learning papers).
 Layer 1: Contains pages that link to the Layer 0 pages.
 Layer 2: Contains pages that link to Layer 1 pages, and so on.
o The idea is that pages closer to Layer 0 in this graph are more likely to be relevant
to your topic.
The seed pages in layer 0 (and possibly those in layer 1) are then concatenated into a single large
document, and the top few terms according to the TF-IDF weighting scheme (see Chap. 6) are selected as
the vocabulary (feature space) to be used for classification. A naïve Bayesian classifier is built for each
layer in the context graph. A prior probability Pr( ) = 1/L is assigned to each layer. All the pages in a
layer are used to compute Pr(t| ), the probability of occurrence of a term t given the layer (class) . At
the crawling time, these are used to compute Pr(p| ) for each crawled page p. The posterior probability
Pr( |p) of p belonging to layer can then be computed for each layer from Bayes’ rule. The layer *
with highest posterior probability wins:
3. Training the Classifier:

o The crawler uses a naïve Bayesian classifier (a type of algorithm) to learn which
pages are likely to be relevant based on the layers of the context graph.
o It analyzes the content of pages in each layer to identify common terms and
patterns.
oThis helps the crawler estimate the likelihood that a new page belongs to a
specific layer.
4. Crawling Strategy:
o When the crawler finds a new page, it calculates the probability that the page
belongs to a specific layer.
o If the probability is high enough, the page is classified into that layer, meaning it’s
close to relevant content.
o The crawler prioritizes pages from layers closer to the relevant content, meaning
it follows links that seem to lead directly to relevant information.
o
5. Improvement Over Standard Crawlers:
o This approach helps the crawler find relevant pages that might be missed by
standard methods, making it more effective at finding content related to your
target topic.
6. Use of Advanced Classifiers:
o While the naïve Bayesian method is commonly used, research has shown that
using more advanced algorithms like Support Vector Machines (SVMs) or
neural networks can improve the accuracy and effectiveness of the crawler even
more.
Topical Crawlers
For many preferential crawling tasks, labeled (positive and negative) examples of pages are not
available in sufficient numbers to train a focused crawler before the crawl starts. Instead, we
typically have a small set of seed pages and a description of a topic of interest to a user or user
community.
The topic can consist of one or more example pages (possibly the seeds) or even a short query.
Preferential crawlers that start with only such information are often called topical crawlers. They do
not have text classifiers to guide crawling.
Even without the luxury of a text classifier, a topical crawler can be smart about preferentially exploring
regions of the Web that appear relevant to the target topic by comparing features collected from visited
pages with cues in the topic description.
To illustrate a topical crawler with its advantages and limitations, let us consider the MySpiders applet
(myspiders.informatics.indiana.edu). Figure shows a screenshot of this application.
The applet is designed to demonstrate two topical crawling algorithms, best-N-first and InfoSpiders,
both discussed
MySpiders is interactive in that a user submits a query just like one would do with a search engine, and
the results are then shown in a window. However, unlike a search engine, this application has no index to
search for results. Instead the Web is crawled in real time. As pages deemed relevant are crawled, they
are displayed in a list that is kept sorted by a user-selected criterion: score or recency.
The score is simply the content (cosine) similarity between a page and the query;
The recency of a page is estimated by the last-modified header, if returned by the server (as noted
earlier this is not a very reliable estimate).
One of the advantages of topic crawling is that all hits are fresh by definition. No stale results are
returned by the crawler because the pages are visited at query time. This makes this type of crawlers
suitable for applications that look for very recently posted documents, which a search engine may not
have indexed yet. On the down side, the search is slow compared to a traditional search engine because
the user has to wait while the crawler fetches and analyzes pages. If the user's client machine (where
the applet runs) has limited bandwidth, e.g., a dial-up Internet connection, the wait is likely infeasible.
Another disadvantage is that the ranking algorithms cannot take advantage of global prestige measures,
such as PageRank, available to a traditional search engine.
Crawler Ethics and Conflicts

Crawlers, especially when efficient, can put a significant strain on the resources of Web servers, mainly
on their network bandwidth. A crawler that sends many page requests to a server in rapid succession, say
ten or more per second, is considered impolite. The reason is that the server would be so busy responding
to the crawler that its service to other requests, including those from human browsing interactively, would
deteriorate. In the extreme case a server inundated with requests from an aggressive crawler would
become unable to respond to other requests, resulting in an effective denial of service attack by the
crawler.
To prevent such incidents, it is essential for a crawler to put in place measures to distribute its requests
across many servers, and to prevent any one server (fully qualified host name) from receiving requests at
more than some reasonably set maximum rate (say, one request every few seconds).
In a concurrent crawler, this task can be carried out by the frontier manager, when URLs are dequeued
and passed to individual threads or processes. This practice not only is required by politeness toward
servers, but also has the additional benefits of limiting the impact of spider traps and not overloading
the server, which will respond slowly.
Preventing server overload is just one of a number of policies required of ethical Web agents . Such
policies are often collectively referred to as crawler etiquette. Another requirement is to disclose the
nature of the crawler using the User-Agent HTTP header. The value of this header should include not
only a name and version number of the crawler, but also a pointer to where Web administrators
may find information about the crawler. Often a Web site is created for this purpose and its URL is
included in the User-Agent field. Another piece of useful information is the email contact to be specified
in the From header.
Finally, crawler etiquette requires compliance with the Robot Exclusion Protocol. This is a de facto
standard providing a way for Web server administrators to communicate which files may not be
accessed by a crawler.
This is accomplished via an optional file named robots.txt in the root directory of the Web server (e.g.,
http://www.somehost.com/robots.txt).
The file provides access policies for different crawlers, identified by the User-agent field. For any
user-agent value (or the default “*”) a number of Disallow entries identify directory subtrees to be
avoided. Compliant crawlers must fetch and parse a server's robots.txt file before sending requests to that
server. For example, the following policy in robots.txt:
User-agent: *
Disallow: / directs any crawler to stay away from the entire server.
Some high-level languages such as Perl provide modules to parse robots.txt files. It is wise for a crawler
to cache the access policies of recently visited servers, so that the robots.txt file need not be fetched and
parsed every time a request is sent to the same server. Additionally, Web authors can indicate if a page
may or may not be indexed, cached, or mined by a crawler using a special HTML meta-tag. Crawlers
need to fetch a page in order to parse this tag,
therefore this approach is not widely used. More details on the robot exclusion protocols can be found at
http://www.robotstxt.org/wc/robots.html.
Thus it is likely that a crawler which does not comply with the Exclusion Protocol and does not follow
proper etiquette will be quickly blocked by many servers. Crawlers may disguise themselves as browsers
by sending a browser's identifying string in the User-Agent header. This way a server administrator may
not immediately detect lack of compliance with the Exclusion Protocol, but an aggressive request profile
is likely to reveal the true nature of the crawler.
Deception does not occur only by crawlers against servers. Some servers also attempt to deceive
crawlers. For example, Web administrators may attempt to improve the ranking of their pages in a search
engine by providing different content depending on whether a request originates from a browser or a
search engine crawler, as determined by inspecting the request's User-Agent header. This technique,
called cloaking, is frowned upon by search engines, which remove sites from their indices when such
abuses are detected.
One of the most serious challenges for crawlers originates from the rising popularity of pay-per-click
advertising. If a crawler is not to follow advertising links, it needs to have a robust detection algorithm to
discriminate ads from other links. A bad crawler may also pretend to be a genuine user who clicks on the
advertising links in order to collect more money from merchants for the hosts of advertising links.
Some New Developments
The typical use of (universal) crawlers thus far has been for creating and maintaining indexes for general
purpose search engines. However a more diverse use of (topical) crawlers is emerging both for client and
server based applications. Topical crawlers are becoming important tools to support applications such as
specialized Web portals (a.k.a. “vertical” search engines), live crawling, and competitive intelligence.
Another characteristic of the way in which crawlers have been used by search engines up to now is the
one-directional relationship between users, search engines, and crawlers.
Users are consumers of information provided by search engines, search engines are consumers of
information provided by crawlers, and crawlers are consumers of information provided by users
(authors). This one-directional loop does not allow, for example, information to flow from a search engine
(say, the queries submitted by users) to a crawler. It is likely that commercial search engine will soon
leverage the huge amounts of data collected from their users to focus their crawlers on the topics most
important to the searching public.
To investigate this idea in the context of a vertical search engine, a system was built in which the
crawler and the search engine engage in a symbiotic relationship [44]. The crawler feeds the search
engine which in turn helps the crawler. It was found that such a symbiosis can help the system learn about
a community's interests and serve such a community with better focus.
As discussed , universal crawlers have to somehow focus on the most “important” pages given the
impossibility to cover the entire Web and keep a fresh index of it. This has led to the use of global
prestige measures such as PageRank to bias universal crawlers, either explicitly or implicitly through the
long-tailed structure of the Web graph. An important problem with these approaches is that the focus is
dictated by popularity among “average” users and disregards the heterogeneity of user interests. A page
about a mathematical theorem may appear quite uninteresting to the average user, if one compares it to a
page about a pop star using indegree or PageRank as a popularity measure. Yet the math page may be
highly relevant and important to a small community of users (mathematicians).
Future crawlers will have to learn to discriminate between low-quality pages and high-quality pages that
are relevant to very small communities.
Social networks have recently received much attention among Web users as vehicles to capture
commonalities of interests and to share relevant information. We are witnessing an explosion of social
and collaborative engines in which user recommendations, opinions, and annotations are

UNIT III-Web Crawlers Why Do We Need Web Crawlers?

Uploaded by

Copyright:

Available Formats

UNIT III-Web Crawlers Why Do We Need Web Crawlers?

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UNIT III-Web Crawlers Why Do We Need Web Crawlers?

Uploaded by

Copyright:

Available Formats

UNIT III- Web Crawlers

Why Do We Need Web Crawlers?

Main Uses of Web Crawlers:

Types of Web Crawlers:

Web Crawler and a Search Engine

Web Crawler: The Data Collector

A Basic Crawler Algorithm

A different crawling strategy is obtained if the frontier is implemented as a priority queue

3. Stopword Removal and Stemming

4. Link Extraction and Canonicalization

 HTML parsers identify tags and attribute-value pairs in web pages.

Converting Relative URLs:

 Relative URLs need to be converted to absolute URLs for accuracy.

 Canonicalization standardizes URLs to a consistent format.

Example of a Spider Trap:

Infinite URL Generation:

 Large-scale Universal Crawlers: These crawlers are designed to handle massive

Why Use a Focused Crawler

The Role of an ODP Editor:

Bias Towards Relevant Pages:

How Does It Work?

What is a Context-Focused Crawler?

How Does It Work

1. Understanding the Scenario:

3. Training the Classifier:

Crawler Ethics and Conflicts

You might also like