Preparation
Preparation
The good news about the Internet and its most visible component, the World Wide Web, is that there are
hundreds of millions of pages available, waiting to present information on an amazing variety of topics. The bad
news about the Internet is that there are hundreds of millions of pages available, most of them titled according to
the whim of their author, almost all of them sitting on servers with cryptic names. When you need to know about a
particular subject, how do you know which pages to read? If you're like most people, you visit an Internet search
engine.
Internet search engines are special sites on the Web that are designed to help people find information stored on
other sites. There are differences in the ways various search engines work, but they all perform three basic tasks:
They search the Internet -- or select pieces of the Internet -- based on important words.
They keep an index of the words they find, and where they find them.
They allow users to look for words or combinations of words found in that index.
Early search engines held an index of a few hundred thousand pages and documents, and received
maybe one or two thousand inquiries each day. Today, a top search engine will index hundreds of
millions of pages, and respond to tens of millions of queries per day. In this article, we'll tell you how these
major tasks are performed, and how Internet search engines put the pieces together in order to let you
find the information you need on the Web.
Finding key information from gigantic World Wide Web is similar to find a needle lost in haystack. For
this purpose we would use a special magnet that would automatically, quickly and effortlessly attract
that needle for us. In this scenario magnet is “Search Engine
Search Engine: A software program that searches a database and gathers and reports information that
contains or is related to specified terms.
OR
A website whose primary function is providing a search for gathering and reporting information
available on the Internet or a portion of the Internet
1990 - The first search engine Archie was released .There was no World Wide Web at the time. Data
resided on defense contractor , university, and government computers, and techies were the only
people accessing the data. The computers were interconnected by Telenet . File Transfer Protocol (FTP)
used for transferring files from computer to computer. There was no such thing as a browser.Files were
transferred in their native format and viewed using the associated file type software. Archie searched
FTP servers and indexed their files into a searchable directory.
1991 - Gopherspace came into existence with the advent of Gopher.Gopher cataloged FTP sites, and the
resulting catalog became known as Gopherspace .
1994 - WebCrawler, a new type of search engine that indexed the entire content of a web page , was
introduced. Telenet / FTP passed information among the new web browsers accessing not FTP sites but
WWW sites.Webmasters and web site owners begin submitting sites for inclusion in the growing
number of web directories.
1995 -Meta tags in the web page were first utilized by some search engines to determine relevancy.
1997 - Search engine rank-checking software was introduced. It provides an automated tool to
determine web site position and ranking within the major search engines.
1998 - Search engine algorithms begin incorporating esoteric information in their ranking algorithms.
E.g. Inclusion of the number of links to a web site to determine its “link popularity.” Another
ranking approach was to determine the number of clicks (visitors) to a web site based upon keyword
and phrase relevancy.
2000 - Marketers determined that pay-per click campaigns were an easy yet expensive approach to
gaining top search rankings. To elevate sites in the search engine rankings web sites started adding
useful and relevant content while optimizing their web pages for each specific search engine.
Determining relevance: The system must determine whether a document contains the
required information or not.
Directories
It uses automated software programs to survey and categories web pages , which is known as
‘spiders’, ‘crawlers’, ‘robots’ or ‘bots’.
A spider will find a web page, download it and analyses the information presented on the web
page. The web page will then be added to the search engine’s database.
When a user performs a search, the search engine will check its database of web pages for the
key words the user searched.
The results (list of suggested links to go to), are listed on pages by order of which is ‘closest’ (as
defined by the ‘bots).
Examples of crawler-based search engines are:
Google (www.google.com)
Robot Algorithm
All robots use the following algorithm for retrieving documents from the Web:
1. The algorithm uses a list of known URLs. This list contains at least one URL to start with.
2. A URL is taken from the list, and the corresponding document is retrieved from the
Web.
3. The document is parsed to retrieve information for the index database and to extract
the embedded links to other documents.
4. The URLs of the links found in the document are added to the list of known URLs.
5. If the list is empty or some limit is exceeded (number of documents retrieved, size of
the index database, time elapsed since startup, etc.) the algorithm stops.
otherwise the algorithm continues at step 2.
2. Crawler program treated World Wide Web as big graph having pages as nodes And the
hyperlinks as arcs.
3. Crawler works with a simple goal: indexing all the keywords in web pages’ titles.
2. Heap
3. Hash table
5. Directories :
6. A ‘directory’ uses human editors who decide what category the site belongs to.
7. They place websites within specific categories or subcategories in the ‘directories’ database.
8. By focusing on particular categories and subcategories, user can narrow the search to those
records that are most likely to be relevant to his/her interests.
9. The human editors comprehensively check the website and rank it, based on the information
they find, using a pre-defined set of rules.
Hybrid search engines use a combination of both crawler-based results and directory results.
Yahoo (www.yahoo.com)
Google (www.google.com)
• Meta search engines query several other Web search engine databases in parallel and then
combine the results in one list.
Metacrawler (www.metacrawler.com)
Dogpile (www.dogpile.com)
Pros :-
Easy to use
It will get at least some results when no result had been obtained with traditional search
engines.
Cons :-
Metasearch engine results are less relevant, since it doesn’t know the internal
“alchemy” of search engine used.
Since, only top 10-50 hits are retrieved from each search engine, the total number of
hits retrieved may be considerably less than found by doing a direct search.
Advanced search features (like, searches with boolean operators and field limiting ; use
of " ", +/-. default AND between words e.t.c.) are not usually available.
2. "Pseudo" MSEs type I which exclusively group the results by search engine
3. "Pseudo" MSEs type II which open a separate browser window for each search engine used and
CONCLUSION: Search engine plays important role in accessing the content over the internet, it
fetches the pages requested by the user.
It made the internet and accessing the information just a click away.
The search engine sites are among the most popular websites.
Google:
Next: Results Page »
If you aren’t interested in learning how Google creates the index and the database of documents that it
accesses when processing a query, skip this description. I adapted the following overview from Chris
Sherman and Gary Price’s wonderful description of How Search Engines Work in Chapter 2 of The Invisible
Google runs on a distributed network of thousands of low-cost computers and can therefore carry out fast
parallel processing. Parallel processing is a method of computation in which many calculations can be
performed simultaneously, significantly speeding up data processing. Google has three distinct parts:
The indexer that sorts every word on every page and stores the resulting index of words in a huge
database.
The query processor, which compares your search query to the index and recommends the
Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to
the Google indexer. It’s easy to imagine Googlebot as a little spider scurrying across the strands of
cyberspace, but in reality Googlebot doesn’t traverse the web at all. It functions much like your web
browser, by sending a request to a web server for a web page, downloading the entire page, then handing it
Googlebot consists of many computers requesting and fetching pages much more quickly than you can with
your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid
overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes
requests of each individual web server more slowly than it’s capable of doing.
Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html, and through
Unfortunately, spammers figured out how to create automated bots that bombarded the add URL form with
millions of URLs pointing to commercial propaganda. Google rejects those URLs submitted through its Add
URL form that it suspects are trying to deceive users by employing tactics such as including hidden text or
links on a page, stuffing a page with irrelevant words, cloaking (aka bait and switch), using sneaky
redirects, creating doorways, domains, or sub-domains with substantially similar content, sending
automated queries to Google, and linking to bad neighbors. So now the Add URL form also has a test: it
displays some squiggly letters designed to fool automated “letter-guessers”; it asks you to enter the letters
When Googlebot fetches a page, it culls all the links appearing on the page and adds them to a queue for
subsequent crawling. Googlebot tends to encounter little spam because most web authors link only to what
they believe are high-quality pages. By harvesting links from every page it encounters, Googlebot can
quickly build a list of links that can cover broad reaches of the web. This technique, known as deep crawling,
also allows Googlebot to probe deep within individual sites. Because of their massive scale, deep crawls can
reach almost every page in the web. Because the web is vast, this can take some time, so some pages may
Although its function is simple, Googlebot must be programmed to handle several challenges. First, since
Googlebot sends out simultaneous requests for thousands of pages, the queue of “visit soon” URLs must be
constantly examined and compared with URLs already in Google’s index. Duplicates in the queue must be
eliminated to prevent Googlebot from fetching the same page again. Googlebot must determine how often
to revisit a page. On the one hand, it’s a waste of resources to re-index an unchanged page. On the other
To keep the index current, Google continuously recrawls popular frequently changing web pages at a rate
roughly proportional to how often the pages change. Such crawls keep an index current and are known
as fresh crawls. Newspaper pages are downloaded daily, pages with stock quotes are downloaded much
more frequently. Of course, fresh crawls return fewer pages than the deep crawl. The combination of the
two types of crawls allows Google to both make efficient use of its resources and keep its index reasonably
current.
2. Google’s Indexer
Googlebot gives the indexer the full text of the pages it finds. These pages are stored in Google’s index
database. This index is sorted alphabetically by search term, with each index entry storing a list of
documents in which the term appears and the location within the text where it occurs. This data structure
as the, is, on,or, of, how, why, as well as certain single digits and single letters). Stop words are so common
that they do little to narrow a search, and therefore they can safely be discarded. The indexer also ignores
some punctuation and multiple spaces, as well as converting all letters to lowercase, to improve Google’s
performance.
The query processor has several parts, including the user interface (search box), the “engine” that evaluates
queries and matches them to relevant documents, and the results formatter.
PageRank is Google’s system for ranking web pages. A page with a higher PageRank is deemed more
important and is more likely to be listed above a page with a lower PageRank.
Google considers over a hundred factors in computing a PageRank and determining which documents are
most relevant to a query, including the popularity of the page, the position and size of the search terms
within the page, and the proximity of the search terms to one another on the page. A patent
application discusses other factors that Google considers when ranking a page. Visit SEOmoz.org’s report for
an interpretation of the concepts and the practical applications contained in Google’s patent application.
Google also applies machine-learning techniques to improve its performance automatically by learning
relationships and associations within the stored data. For example, the spelling-correcting system uses such
techniques to figure out likely alternative spellings. Google closely guards the formulas it uses to calculate
relevance; they’re tweaked to improve quality and performance, and to outwit the latest devious techniques
used by spammers.
Indexing the full text of the web allows Google to go beyond simply matching single search terms. Google
gives more priority to pages that have search terms near each other and in the same order as the query.
Google can also match multi-word phrases and sentences. Since Google indexes HTML code in addition to
the text on the page, users can restrict searches on the basis of where query words appear, e.g., in the title,
in the URL, in the body, and in links to the page, options offered by Google’s Advanced Search