0% found this document useful (0 votes)
104 views10 pages

Preparation

Uploaded by

shiv900
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views10 pages

Preparation

Uploaded by

shiv900
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

ABSTRACT:

The good news about the Internet and its most visible component, the World Wide Web, is that there are
hundreds of millions of pages available, waiting to present information on an amazing variety of topics. The bad
news about the Internet is that there are hundreds of millions of pages available, most of them titled according to
the whim of their author, almost all of them sitting on servers with cryptic names. When you need to know about a
particular subject, how do you know which pages to read? If you're like most people, you visit an Internet search
engine.

Internet search engines are special sites on the Web that are designed to help people find information stored on
other sites. There are differences in the ways various search engines work, but they all perform three basic tasks:

 They search the Internet -- or select pieces of the Internet -- based on important words.
 They keep an index of the words they find, and where they find them.
 They allow users to look for words or combinations of words found in that index.

Early search engines held an index of a few hundred thousand pages and documents, and received
maybe one or two thousand inquiries each day. Today, a top search engine will index hundreds of
millions of pages, and respond to tens of millions of queries per day. In this article, we'll tell you how these
major tasks are performed, and how Internet search engines put the pieces together in order to let you
find the information you need on the Web.

Finding key information from gigantic World Wide Web is similar to find a needle lost in haystack. For
this purpose we would use a special magnet that would automatically, quickly and effortlessly attract
that needle for us. In this scenario magnet is “Search Engine

Search Engine: A software program that searches a database and gathers and reports information that
contains or is related to specified terms.

OR

A website whose primary function is providing a search for gathering and reporting information
available on the Internet or a portion of the Internet

1990 - The first search engine Archie was released .There was no World Wide Web at the time. Data
resided on defense contractor , university, and government computers, and techies were the only
people accessing the data. The computers were interconnected by Telenet . File Transfer Protocol (FTP)
used for transferring files from computer to computer. There was no such thing as a browser.Files were
transferred in their native format and viewed using the associated file type software. Archie searched
FTP servers and indexed their files into a searchable directory.

1991 - Gopherspace came into existence with the advent of Gopher.Gopher cataloged FTP sites, and the
resulting catalog became known as Gopherspace .

1994 - WebCrawler, a new type of search engine that indexed the entire content of a web page , was
introduced. Telenet / FTP passed information among the new web browsers accessing not FTP sites but
WWW sites.Webmasters and web site owners begin submitting sites for inclusion in the growing
number of web directories.

1995 -Meta tags in the web page were first utilized by some search engines to determine relevancy.

1997 - Search engine rank-checking software was introduced. It provides an automated tool to
determine web site position and ranking within the major search engines.

1998 - Search engine algorithms begin incorporating esoteric information in their ranking algorithms.

E.g. Inclusion of the number of links to a web site to determine its “link popularity.” Another
ranking approach was to determine the number of clicks (visitors) to a web site based upon keyword
and phrase relevancy.

2000 - Marketers determined that pay-per click campaigns were an easy yet expensive approach to
gaining top search rankings. To elevate sites in the search engine rankings web sites started adding
useful and relevant content while optimizing their web pages for each specific search engine.

Stages in information retrieval

 Finding documents: It is potentially needed to find interesting documents on the Web


consists of millions of documents, distributed over tens of thousands of servers.

 Formulating queries: It needed to express exactly what kind of information is to


retrieve.

 Determining relevance: The system must determine whether a document contains the
required information or not.

 Types of Search Engine

On the basis of working, Search engine is categories in following group :-

 Crawler-Based Search Engines

 Directories

 Hybrid Search Engines

 Meta Search Engines

 Crawler-Based Search Engines

 It uses automated software programs to survey and categories web pages , which is known as
‘spiders’, ‘crawlers’, ‘robots’ or ‘bots’.

 A spider will find a web page, download it and analyses the information presented on the web
page. The web page will then be added to the search engine’s database.
 When a user performs a search, the search engine will check its database of web pages for the
key words the user searched.

 The results (list of suggested links to go to), are listed on pages by order of which is ‘closest’ (as
defined by the ‘bots).
Examples of crawler-based search engines are:

 Google (www.google.com)

 Ask Jeeves (www.ask.com)

 Robot Algorithm

All robots use the following algorithm for retrieving documents from the Web:

1. The algorithm uses a list of known URLs. This list contains at least one URL to start with.

2. A URL is taken from the list, and the corresponding document is retrieved from the
Web.

3. The document is parsed to retrieve information for the index database and to extract
the embedded links to other documents.

4. The URLs of the links found in the document are added to the list of known URLs.

5. If the list is empty or some limit is exceeded (number of documents retrieved, size of
the index database, time elapsed since startup, etc.) the algorithm stops.
otherwise the algorithm continues at step 2.

2. Crawler program treated World Wide Web as big graph having pages as nodes And the
hyperlinks as arcs.

3. Crawler works with a simple goal: indexing all the keywords in web pages’ titles.

4. Three data structure is needed for crawler or robot algorithm

1. A large linear array , url_table

2. Heap

3. Hash table

5. Directories :

6. A ‘directory’ uses human editors who decide what category the site belongs to.

7. They place websites within specific categories or subcategories in the ‘directories’ database.
8. By focusing on particular categories and subcategories, user can narrow the search to those
records that are most likely to be relevant to his/her interests.

9. The human editors comprehensively check the website and rank it, based on the information
they find, using a pre-defined set of rules.

There are two major directories :

Yahoo Directory (www.yahoo.com)

Open Directory (www.dmoz.org)

Hybrid Search Engines

Hybrid search engines use a combination of both crawler-based results and directory results.

Examples of hybrid search engines are:

Yahoo (www.yahoo.com)

Google (www.google.com)

Meta Search Engines

• Also known as Multiple Search Engines or Metacrawlers.

• Meta search engines query several other Web search engine databases in parallel and then
combine the results in one list.

Examples of Meta search engines include:

Metacrawler (www.metacrawler.com)

Dogpile (www.dogpile.com)

Pros and Cons of Meta Search Engines

Pros :-

 Easy to use

 Able to search more web pages in less time.

 High probability of finding the desired page(s)

 It will get at least some results when no result had been obtained with traditional search
engines.

Cons :-
 Metasearch engine results are less relevant, since it doesn’t know the internal
“alchemy” of search engine used.

 Since, only top 10-50 hits are retrieved from each search engine, the total number of
hits retrieved may be considerably less than found by doing a direct search.

 Advanced search features (like, searches with boolean operators and field limiting ; use
of " ", +/-. default AND between words e.t.c.) are not usually available.

Meta Search Engines (MSEs)

Come In Four Flavors

1. "Real" MSEs which aggregate/rank the results in one page

2. "Pseudo" MSEs type I which exclusively group the results by search engine

3. "Pseudo" MSEs type II which open a separate browser window for each search engine used and

4. Search Utilities, software search tools.

 CONCLUSION: Search engine plays important role in accessing the content over the internet, it
fetches the pages requested by the user.

   It made the internet and accessing the information just a click away.

 The need for better search engines only increases

 The search engine sites are among the most popular websites.

Google:

 Google use spiders

 Large index of keywords.

 Google’s PAGE RANK .

1. frequency and location of keywords within the Web page

2. Web page history.

3. number of other Web pages that link to the page in question


Google Guide > Part II: Understanding Results > How Google Works

Next: Results Page »

How Google Works

If you aren’t interested in learning how Google creates the index and the database of documents that it

accesses when processing a query, skip this description. I adapted the following overview from Chris

Sherman and Gary Price’s wonderful description of How Search Engines Work in Chapter 2 of The Invisible

Web (CyberAge Books, 2001).

Google runs on a distributed network of thousands of low-cost computers and can therefore carry out fast

parallel processing. Parallel processing is a method of computation in which many calculations can be

performed simultaneously, significantly speeding up data processing. Google has three distinct parts:

 Googlebot, a web crawler that finds and fetches web pages.

 The indexer that sorts every word on every page and stores the resulting index of words in a huge

database.

 The query processor, which compares your search query to the index and recommends the

documents that it considers most relevant.

Let’s take a closer look at each part.

1. Googlebot, Google’s Web Crawler

Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to

the Google indexer. It’s easy to imagine Googlebot as a little spider scurrying across the strands of

cyberspace, but in reality Googlebot doesn’t traverse the web at all. It functions much like your web
browser, by sending a request to a web server for a web page, downloading the entire page, then handing it

off to Google’s indexer.

Googlebot consists of many computers requesting and fetching pages much more quickly than you can with

your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid

overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes

requests of each individual web server more slowly than it’s capable of doing.

Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html, and through

finding links by crawling the web.

Unfortunately, spammers figured out how to create automated bots that bombarded the add URL form with

millions of URLs pointing to commercial propaganda. Google rejects those URLs submitted through its Add

URL form that it suspects are trying to deceive users by employing tactics such as including hidden text or

links on a page, stuffing a page with irrelevant words, cloaking (aka bait and switch), using sneaky

redirects, creating doorways, domains, or sub-domains with substantially similar content, sending

automated queries to Google, and linking to bad neighbors. So now the Add URL form also has a test: it
displays some squiggly letters designed to fool automated “letter-guessers”; it asks you to enter the letters

you see — something like an eye-chart test to stop spambots.

When Googlebot fetches a page, it culls all the links appearing on the page and adds them to a queue for

subsequent crawling. Googlebot tends to encounter little spam because most web authors link only to what

they believe are high-quality pages. By harvesting links from every page it encounters, Googlebot can

quickly build a list of links that can cover broad reaches of the web. This technique, known as deep crawling,

also allows Googlebot to probe deep within individual sites. Because of their massive scale, deep crawls can

reach almost every page in the web. Because the web is vast, this can take some time, so some pages may

be crawled only once a month.

Although its function is simple, Googlebot must be programmed to handle several challenges. First, since

Googlebot sends out simultaneous requests for thousands of pages, the queue of “visit soon” URLs must be

constantly examined and compared with URLs already in Google’s index. Duplicates in the queue must be

eliminated to prevent Googlebot from fetching the same page again. Googlebot must determine how often

to revisit a page. On the one hand, it’s a waste of resources to re-index an unchanged page. On the other

hand, Google wants to re-index changed pages to deliver up-to-date results.

To keep the index current, Google continuously recrawls popular frequently changing web pages at a rate

roughly proportional to how often the pages change. Such crawls keep an index current and are known

as fresh crawls. Newspaper pages are downloaded daily, pages with stock quotes are downloaded much

more frequently. Of course, fresh crawls return fewer pages than the deep crawl. The combination of the

two types of crawls allows Google to both make efficient use of its resources and keep its index reasonably

current.

2. Google’s Indexer

Googlebot gives the indexer the full text of the pages it finds. These pages are stored in Google’s index

database. This index is sorted alphabetically by search term, with each index entry storing a list of

documents in which the term appears and the location within the text where it occurs. This data structure

allows rapid access to documents that contain user query terms.


To improve search performance, Google ignores (doesn’t index) common words called stop words (such

as the, is, on,or, of, how, why, as well as certain single digits and single letters). Stop words are so common

that they do little to narrow a search, and therefore they can safely be discarded. The indexer also ignores

some punctuation and multiple spaces, as well as converting all letters to lowercase, to improve Google’s

performance.

3. Google’s Query Processor

The query processor has several parts, including the user interface (search box), the “engine” that evaluates

queries and matches them to relevant documents, and the results formatter.

PageRank is Google’s system for ranking web pages. A page with a higher PageRank is deemed more

important and is more likely to be listed above a page with a lower PageRank.

Google considers over a hundred factors in computing a PageRank and determining which documents are

most relevant to a query, including the popularity of the page, the position and size of the search terms

within the page, and the proximity of the search terms to one another on the page. A patent

application discusses other factors that Google considers when ranking a page. Visit SEOmoz.org’s report for

an interpretation of the concepts and the practical applications contained in Google’s patent application.

Google also applies machine-learning techniques to improve its performance automatically by learning

relationships and associations within the stored data. For example, the spelling-correcting system uses such

techniques to figure out likely alternative spellings. Google closely guards the formulas it uses to calculate

relevance; they’re tweaked to improve quality and performance, and to outwit the latest devious techniques

used by spammers.

Indexing the full text of the web allows Google to go beyond simply matching single search terms. Google

gives more priority to pages that have search terms near each other and in the same order as the query.

Google can also match multi-word phrases and sentences. Since Google indexes HTML code in addition to

the text on the page, users can restrict searches on the basis of where query words appear, e.g., in the title,

in the URL, in the body, and in links to the page, options offered by Google’s Advanced Search

Form and Using Search Operators (Advanced Operators).


Let’s see how Google processes a query.

Copyright © 2003 Google Inc. Used with permission.

You might also like