0% found this document useful (0 votes)

104 views10 pages

Preparation

Uploaded by

shiv900

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

104 views10 pages

Preparation

Uploaded by

shiv900

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

ABSTRACT:

The good news about the Internet and its most visible component, the World Wide Web, is that there are
hundreds of millions of pages available, waiting to present information on an amazing variety of topics. The bad
news about the Internet is that there are hundreds of millions of pages available, most of them titled according to
the whim of their author, almost all of them sitting on servers with cryptic names. When you need to know about a
particular subject, how do you know which pages to read? If you're like most people, you visit an Internet search
engine.

Internet search engines are special sites on the Web that are designed to help people find information stored on
other sites. There are differences in the ways various search engines work, but they all perform three basic tasks:

 They search the Internet -- or select pieces of the Internet -- based on important words.
 They keep an index of the words they find, and where they find them.
 They allow users to look for words or combinations of words found in that index.

Early search engines held an index of a few hundred thousand pages and documents, and received
maybe one or two thousand inquiries each day. Today, a top search engine will index hundreds of
millions of pages, and respond to tens of millions of queries per day. In this article, we'll tell you how these
major tasks are performed, and how Internet search engines put the pieces together in order to let you
find the information you need on the Web.

Finding key information from gigantic World Wide Web is similar to find a needle lost in haystack. For
this purpose we would use a special magnet that would automatically, quickly and effortlessly attract
that needle for us. In this scenario magnet is “Search Engine

Search Engine: A software program that searches a database and gathers and reports information that
contains or is related to specified terms.

A website whose primary function is providing a search for gathering and reporting information
available on the Internet or a portion of the Internet

1990 - The first search engine Archie was released .There was no World Wide Web at the time. Data
resided on defense contractor , university, and government computers, and techies were the only
people accessing the data. The computers were interconnected by Telenet . File Transfer Protocol (FTP)
used for transferring files from computer to computer. There was no such thing as a browser.Files were
transferred in their native format and viewed using the associated file type software. Archie searched
FTP servers and indexed their files into a searchable directory.

1991 - Gopherspace came into existence with the advent of Gopher.Gopher cataloged FTP sites, and the
resulting catalog became known as Gopherspace .

1994 - WebCrawler, a new type of search engine that indexed the entire content of a web page , was
introduced. Telenet / FTP passed information among the new web browsers accessing not FTP sites but
WWW sites.Webmasters and web site owners begin submitting sites for inclusion in the growing
number of web directories.

1995 -Meta tags in the web page were first utilized by some search engines to determine relevancy.

1997 - Search engine rank-checking software was introduced. It provides an automated tool to
determine web site position and ranking within the major search engines.

1998 - Search engine algorithms begin incorporating esoteric information in their ranking algorithms.

E.g. Inclusion of the number of links to a web site to determine its “link popularity.” Another
ranking approach was to determine the number of clicks (visitors) to a web site based upon keyword
and phrase relevancy.

2000 - Marketers determined that pay-per click campaigns were an easy yet expensive approach to
gaining top search rankings. To elevate sites in the search engine rankings web sites started adding
useful and relevant content while optimizing their web pages for each specific search engine.

Stages in information retrieval

 Finding documents: It is potentially needed to find interesting documents on the Web

consists of millions of documents, distributed over tens of thousands of servers.

 Formulating queries: It needed to express exactly what kind of information is to

retrieve.

 Determining relevance: The system must determine whether a document contains the
required information or not.

 Types of Search Engine

On the basis of working, Search engine is categories in following group :-

 Crawler-Based Search Engines

 Directories

 Hybrid Search Engines

 Meta Search Engines

 Crawler-Based Search Engines

 It uses automated software programs to survey and categories web pages , which is known as
‘spiders’, ‘crawlers’, ‘robots’ or ‘bots’.

 A spider will find a web page, download it and analyses the information presented on the web
page. The web page will then be added to the search engine’s database.
 When a user performs a search, the search engine will check its database of web pages for the
key words the user searched.

 The results (list of suggested links to go to), are listed on pages by order of which is ‘closest’ (as
defined by the ‘bots).
Examples of crawler-based search engines are:

 Google (www.google.com)

 Ask Jeeves (www.ask.com)

 Robot Algorithm

All robots use the following algorithm for retrieving documents from the Web:

1. The algorithm uses a list of known URLs. This list contains at least one URL to start with.

2. A URL is taken from the list, and the corresponding document is retrieved from the
Web.

3. The document is parsed to retrieve information for the index database and to extract
the embedded links to other documents.

4. The URLs of the links found in the document are added to the list of known URLs.

5. If the list is empty or some limit is exceeded (number of documents retrieved, size of
the index database, time elapsed since startup, etc.) the algorithm stops.
otherwise the algorithm continues at step 2.

2. Crawler program treated World Wide Web as big graph having pages as nodes And the
hyperlinks as arcs.

3. Crawler works with a simple goal: indexing all the keywords in web pages’ titles.

4. Three data structure is needed for crawler or robot algorithm

1. A large linear array , url_table

2. Heap

3. Hash table

5. Directories :

6. A ‘directory’ uses human editors who decide what category the site belongs to.

7. They place websites within specific categories or subcategories in the ‘directories’ database.
8. By focusing on particular categories and subcategories, user can narrow the search to those
records that are most likely to be relevant to his/her interests.

9. The human editors comprehensively check the website and rank it, based on the information
they find, using a pre-defined set of rules.

There are two major directories :

Yahoo Directory (www.yahoo.com)

Open Directory (www.dmoz.org)

Hybrid Search Engines

Hybrid search engines use a combination of both crawler-based results and directory results.

Examples of hybrid search engines are:

Yahoo (www.yahoo.com)

Google (www.google.com)

Meta Search Engines

• Also known as Multiple Search Engines or Metacrawlers.

• Meta search engines query several other Web search engine databases in parallel and then
combine the results in one list.

Examples of Meta search engines include:

Metacrawler (www.metacrawler.com)

Dogpile (www.dogpile.com)

Pros and Cons of Meta Search Engines

Pros :-

 Easy to use

 Able to search more web pages in less time.

 High probability of finding the desired page(s)

 It will get at least some results when no result had been obtained with traditional search
engines.

Cons :-
 Metasearch engine results are less relevant, since it doesn’t know the internal
“alchemy” of search engine used.

 Since, only top 10-50 hits are retrieved from each search engine, the total number of
hits retrieved may be considerably less than found by doing a direct search.

 Advanced search features (like, searches with boolean operators and field limiting ; use
of " ", +/-. default AND between words e.t.c.) are not usually available.

Meta Search Engines (MSEs)

Come In Four Flavors

1. "Real" MSEs which aggregate/rank the results in one page

2. "Pseudo" MSEs type I which exclusively group the results by search engine

3. "Pseudo" MSEs type II which open a separate browser window for each search engine used and

4. Search Utilities, software search tools.

 CONCLUSION: Search engine plays important role in accessing the content over the internet, it
fetches the pages requested by the user.

 It made the internet and accessing the information just a click away.

 The need for better search engines only increases

 The search engine sites are among the most popular websites.

Google:

 Google use spiders

 Large index of keywords.

 Google’s PAGE RANK .

1. frequency and location of keywords within the Web page

2. Web page history.

3. number of other Web pages that link to the page in question

Google Guide > Part II: Understanding Results > How Google Works

Next: Results Page »

How Google Works

If you aren’t interested in learning how Google creates the index and the database of documents that it

accesses when processing a query, skip this description. I adapted the following overview from Chris

Sherman and Gary Price’s wonderful description of How Search Engines Work in Chapter 2 of The Invisible

Web (CyberAge Books, 2001).

Google runs on a distributed network of thousands of low-cost computers and can therefore carry out fast

parallel processing. Parallel processing is a method of computation in which many calculations can be

performed simultaneously, significantly speeding up data processing. Google has three distinct parts:

 Googlebot, a web crawler that finds and fetches web pages.

 The indexer that sorts every word on every page and stores the resulting index of words in a huge

database.

 The query processor, which compares your search query to the index and recommends the

documents that it considers most relevant.

Let’s take a closer look at each part.

1. Googlebot, Google’s Web Crawler

Googlebot is Google’s web crawling robot, which finds and retrieves pages on the web and hands them off to

the Google indexer. It’s easy to imagine Googlebot as a little spider scurrying across the strands of

cyberspace, but in reality Googlebot doesn’t traverse the web at all. It functions much like your web
browser, by sending a request to a web server for a web page, downloading the entire page, then handing it

off to Google’s indexer.

Googlebot consists of many computers requesting and fetching pages much more quickly than you can with

your web browser. In fact, Googlebot can request thousands of different pages simultaneously. To avoid

overwhelming web servers, or crowding out requests from human users, Googlebot deliberately makes

requests of each individual web server more slowly than it’s capable of doing.

Googlebot finds pages in two ways: through an add URL form, www.google.com/addurl.html, and through

finding links by crawling the web.

Unfortunately, spammers figured out how to create automated bots that bombarded the add URL form with

millions of URLs pointing to commercial propaganda. Google rejects those URLs submitted through its Add

URL form that it suspects are trying to deceive users by employing tactics such as including hidden text or

links on a page, stuffing a page with irrelevant words, cloaking (aka bait and switch), using sneaky

redirects, creating doorways, domains, or sub-domains with substantially similar content, sending

automated queries to Google, and linking to bad neighbors. So now the Add URL form also has a test: it
displays some squiggly letters designed to fool automated “letter-guessers”; it asks you to enter the letters

you see — something like an eye-chart test to stop spambots.

When Googlebot fetches a page, it culls all the links appearing on the page and adds them to a queue for

subsequent crawling. Googlebot tends to encounter little spam because most web authors link only to what

they believe are high-quality pages. By harvesting links from every page it encounters, Googlebot can

quickly build a list of links that can cover broad reaches of the web. This technique, known as deep crawling,

also allows Googlebot to probe deep within individual sites. Because of their massive scale, deep crawls can

reach almost every page in the web. Because the web is vast, this can take some time, so some pages may

be crawled only once a month.

Although its function is simple, Googlebot must be programmed to handle several challenges. First, since

Googlebot sends out simultaneous requests for thousands of pages, the queue of “visit soon” URLs must be

constantly examined and compared with URLs already in Google’s index. Duplicates in the queue must be

eliminated to prevent Googlebot from fetching the same page again. Googlebot must determine how often

to revisit a page. On the one hand, it’s a waste of resources to re-index an unchanged page. On the other

hand, Google wants to re-index changed pages to deliver up-to-date results.

To keep the index current, Google continuously recrawls popular frequently changing web pages at a rate

roughly proportional to how often the pages change. Such crawls keep an index current and are known

as fresh crawls. Newspaper pages are downloaded daily, pages with stock quotes are downloaded much

more frequently. Of course, fresh crawls return fewer pages than the deep crawl. The combination of the

two types of crawls allows Google to both make efficient use of its resources and keep its index reasonably

current.

2. Google’s Indexer

Googlebot gives the indexer the full text of the pages it finds. These pages are stored in Google’s index

database. This index is sorted alphabetically by search term, with each index entry storing a list of

documents in which the term appears and the location within the text where it occurs. This data structure

allows rapid access to documents that contain user query terms.

To improve search performance, Google ignores (doesn’t index) common words called stop words (such

as the, is, on,or, of, how, why, as well as certain single digits and single letters). Stop words are so common

that they do little to narrow a search, and therefore they can safely be discarded. The indexer also ignores

some punctuation and multiple spaces, as well as converting all letters to lowercase, to improve Google’s

performance.

3. Google’s Query Processor

The query processor has several parts, including the user interface (search box), the “engine” that evaluates

queries and matches them to relevant documents, and the results formatter.

PageRank is Google’s system for ranking web pages. A page with a higher PageRank is deemed more

important and is more likely to be listed above a page with a lower PageRank.

Google considers over a hundred factors in computing a PageRank and determining which documents are

most relevant to a query, including the popularity of the page, the position and size of the search terms

within the page, and the proximity of the search terms to one another on the page. A patent

application discusses other factors that Google considers when ranking a page. Visit SEOmoz.org’s report for

an interpretation of the concepts and the practical applications contained in Google’s patent application.

Google also applies machine-learning techniques to improve its performance automatically by learning

relationships and associations within the stored data. For example, the spelling-correcting system uses such

techniques to figure out likely alternative spellings. Google closely guards the formulas it uses to calculate

relevance; they’re tweaked to improve quality and performance, and to outwit the latest devious techniques

used by spammers.

Indexing the full text of the web allows Google to go beyond simply matching single search terms. Google

gives more priority to pages that have search terms near each other and in the same order as the query.

Google can also match multi-word phrases and sentences. Since Google indexes HTML code in addition to

the text on the page, users can restrict searches on the basis of where query words appear, e.g., in the title,

in the URL, in the body, and in links to the page, options offered by Google’s Advanced Search

Form and Using Search Operators (Advanced Operators).

Let’s see how Google processes a query.

Search Engines
83% (6)
Search Engines
23 pages
Types of Search Engines and How It Works
100% (2)
Types of Search Engines and How It Works
42 pages
Search Engine
No ratings yet
Search Engine
19 pages
Search Engines
No ratings yet
Search Engines
24 pages
Procedure For Control of Documents and Records
100% (2)
Procedure For Control of Documents and Records
3 pages
Information Retrieval Systems
100% (1)
Information Retrieval Systems
102 pages
Information Retrieval
No ratings yet
Information Retrieval
31 pages
Search Engine
100% (2)
Search Engine
42 pages
IR Module 3
No ratings yet
IR Module 3
45 pages
Final
No ratings yet
Final
42 pages
Internet and Internet Protocols
No ratings yet
Internet and Internet Protocols
21 pages
Connecting HTML Help To Visual Basic Programs: by Don Lammers
100% (2)
Connecting HTML Help To Visual Basic Programs: by Don Lammers
15 pages
Seach Engine
50% (2)
Seach Engine
18 pages
DocFetcher Manual
No ratings yet
DocFetcher Manual
5 pages
Saep 381 PDF
100% (1)
Saep 381 PDF
17 pages
Gdsii Vs Oasis
No ratings yet
Gdsii Vs Oasis
13 pages
Search Engine: Submitted By, E.Priyan, Pondicherry University
No ratings yet
Search Engine: Submitted By, E.Priyan, Pondicherry University
13 pages
Atlasti v7 Manual
No ratings yet
Atlasti v7 Manual
470 pages
Search Engine
No ratings yet
Search Engine
35 pages
Jaff Seminar
No ratings yet
Jaff Seminar
31 pages
Manual de PATBASE PDF
No ratings yet
Manual de PATBASE PDF
138 pages
Search Engine
100% (1)
Search Engine
22 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
59 pages
Search Engines: Submitted To: Submitted by
No ratings yet
Search Engines: Submitted To: Submitted by
16 pages
Cali) Ngasan - Search Engine
No ratings yet
Cali) Ngasan - Search Engine
98 pages
Meta Search Engines
No ratings yet
Meta Search Engines
48 pages
Tokyoproducts
No ratings yet
Tokyoproducts
34 pages
Chapter 1 Search Engine 1. Objective
No ratings yet
Chapter 1 Search Engine 1. Objective
63 pages
Catia V5 R14
No ratings yet
Catia V5 R14
53 pages
Internet Resource Guide (Final)
No ratings yet
Internet Resource Guide (Final)
13 pages
Search Engines: Methods, Advertisements, Website Integration
No ratings yet
Search Engines: Methods, Advertisements, Website Integration
34 pages
LIS Project Copy1
No ratings yet
LIS Project Copy1
22 pages
Seo CH1
No ratings yet
Seo CH1
45 pages
Internet Search
No ratings yet
Internet Search
26 pages
Search Tools
No ratings yet
Search Tools
24 pages
7 Phrase Queries and Positional Indexes
No ratings yet
7 Phrase Queries and Positional Indexes
25 pages
Rockset For Hybrid Search
No ratings yet
Rockset For Hybrid Search
27 pages
Search Engine
No ratings yet
Search Engine
15 pages
BA4029 SOCIAL MEDIA WEB ANALYTICS Unit 5
No ratings yet
BA4029 SOCIAL MEDIA WEB ANALYTICS Unit 5
23 pages
Web Crawlers & Hyperlink Analysis
No ratings yet
Web Crawlers & Hyperlink Analysis
50 pages
Search Engine
No ratings yet
Search Engine
17 pages
WEB BROWSERS+search Engine
No ratings yet
WEB BROWSERS+search Engine
10 pages
The Google Hacker's Guide
No ratings yet
The Google Hacker's Guide
32 pages
Working With The Internet
No ratings yet
Working With The Internet
39 pages
Unit 8 - Search Engines
No ratings yet
Unit 8 - Search Engines
8 pages
MCS 1 Cautare Pe Net
No ratings yet
MCS 1 Cautare Pe Net
22 pages
Search Engine: An Effective Tool For Exploring The Internet
No ratings yet
Search Engine: An Effective Tool For Exploring The Internet
5 pages
Internet Searching Technique - Last Edited
No ratings yet
Internet Searching Technique - Last Edited
36 pages
ST. JOSEPH SCHO-WPS Office
No ratings yet
ST. JOSEPH SCHO-WPS Office
15 pages
Collaborative Unit Lesson Plan
No ratings yet
Collaborative Unit Lesson Plan
2 pages
Oc 2 RJPGT 2023
No ratings yet
Oc 2 RJPGT 2023
13 pages
Web Search-Engines: Preksha Mangal B-Tech CS-3 Year
No ratings yet
Web Search-Engines: Preksha Mangal B-Tech CS-3 Year
43 pages
Mongodb Indexes
No ratings yet
Mongodb Indexes
31 pages
Search Engine
No ratings yet
Search Engine
42 pages
Lab Manual: Web Technology
No ratings yet
Lab Manual: Web Technology
39 pages
QuickSprout SEO Audit
No ratings yet
QuickSprout SEO Audit
40 pages
How Google Works
No ratings yet
How Google Works
61 pages
Web Crawler A Review
No ratings yet
Web Crawler A Review
5 pages
Search Engines: Presented By, Aswathy Gopinadhan 2 Sem Mba
No ratings yet
Search Engines: Presented By, Aswathy Gopinadhan 2 Sem Mba
30 pages
Web Search Engines: Practice and Experience: Content Analysis Query Prcessing Search Log
No ratings yet
Web Search Engines: Practice and Experience: Content Analysis Query Prcessing Search Log
21 pages
Search Engine: Submitted To: Submitted By: Mr. Sudhir Sonu
No ratings yet
Search Engine: Submitted To: Submitted By: Mr. Sudhir Sonu
19 pages
Working of Webb Search Engines
No ratings yet
Working of Webb Search Engines
29 pages
Search Engines: Sara Khalid Suliman
No ratings yet
Search Engines: Sara Khalid Suliman
34 pages
Search Engine
No ratings yet
Search Engine
3 pages
Search Engines .: Presented By: Rasik Mevada Vishal Dabhi Vimal Nair Ravi Mathai
No ratings yet
Search Engines .: Presented By: Rasik Mevada Vishal Dabhi Vimal Nair Ravi Mathai
25 pages
Search Engine
No ratings yet
Search Engine
21 pages
Chapter 8, Internet Marketing: Outline 8
No ratings yet
Chapter 8, Internet Marketing: Outline 8
39 pages
Module 2
No ratings yet
Module 2
18 pages
Pre 5 Midterm Reviewer Nerfed
No ratings yet
Pre 5 Midterm Reviewer Nerfed
6 pages
Exam 49-50
No ratings yet
Exam 49-50
28 pages
Darknet Report
No ratings yet
Darknet Report
27 pages
Unit 1: Search Engine Optimisation
No ratings yet
Unit 1: Search Engine Optimisation
10 pages
Innovation Case Study
No ratings yet
Innovation Case Study
9 pages
How Do Search Engines Work
No ratings yet
How Do Search Engines Work
3 pages
Search Engine Comparisons
No ratings yet
Search Engine Comparisons
23 pages
Seminar Formatkhjj
No ratings yet
Seminar Formatkhjj
24 pages
Search Engine
No ratings yet
Search Engine
22 pages
Search Engine: by Bhupendra Ratha, Lecturer
No ratings yet
Search Engine: by Bhupendra Ratha, Lecturer
22 pages
Search Tools: Presented By: ISHA
No ratings yet
Search Tools: Presented By: ISHA
22 pages
21ite09 Information Reterival
No ratings yet
21ite09 Information Reterival
2 pages
SPPM 1002 Web Searching
No ratings yet
SPPM 1002 Web Searching
12 pages
Global Patent Index: An Advanced Tool For Searching The EPO's Worldwide Patent Data
No ratings yet
Global Patent Index: An Advanced Tool For Searching The EPO's Worldwide Patent Data
8 pages
cs419-519 Slides Part 2
No ratings yet
cs419-519 Slides Part 2
6 pages
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
No ratings yet
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
10 pages
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
No ratings yet
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
13 pages
Web Search Engine
No ratings yet
Web Search Engine
26 pages
Web Server Hardware and Software
No ratings yet
Web Server Hardware and Software
4 pages
Introduction To The CD-ROM Edition: The ARRL Handbook For Radio Communications
No ratings yet
Introduction To The CD-ROM Edition: The ARRL Handbook For Radio Communications
4 pages
Google Search Revealed: Mastering the Algorithm for Search Dominance
From Everand
Google Search Revealed: Mastering the Algorithm for Search Dominance
Azhar ul Haque Sario
No ratings yet
Seo Learning Guide
From Everand
Seo Learning Guide
ngencoband
No ratings yet