0% found this document useful (0 votes)

147 views

Web Search Engine

This document provides an overview of web search engines. It discusses the difficulties in building search engines, including distributed and dynamic data. It describes how search engines work, including crawling websites, building indices of words and pages, and ranking results. The document also discusses types of search engines, popular engines in 1998, and features like PageRank, anchor text, and metasearch engines.

Uploaded by

pradhanritesh6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

147 views

Web Search Engine

Uploaded by

pradhanritesh6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 26

This is the html version of the file

http://softbase.uwaterloo.ca/~tozsu/courses/cs748t/surveys/sunny-slides.pdf.
Google automatically generates html versions of documents as we crawl the web.
Page 1

The Overview of Web

Search
Engines
Presented by Sunny Lam
Page 2

Outline
Introduction
Information Retrieval
Searching Problems
Types of Search Engines
The Largest Search Engines
Architectures
User Interfaces
Web Directories
Ranking
Web Crawlers
Indices
Metasearchers
Add-on Tools
Future Work
Conclusion
Page 3

Questions about the

Web
Q: How many computers are in the world?
A: Over 40 million.
Q: How many of them are Web servers?
A: Over 3 million.
Q: How many Web pages in the world?
A: Over 350 million.
Q: What is the most popular formats of Web documents?
A: HTML, GIF, JPG, ASCII files, Postscript and ASP.
Q: What is the average size of Web document?
A: Mean: 5 Kb; Median: 2 Kb.
Q: How many queries does a search engine answer every day?
A: Tens of millions.
Page 4

Characteristics of the
Web
Huge (1.75 terabytes of text)
Allow people to share information
globally and freely
Hides the detail of communication
protocols, machine
locations, and operating systems
Data are unstructured
Exponential growth
Increasingly commercial over time (1.5 %
.com in
1993 to 60% .com in 1997)
Page 5

Difficulties of
Building a Search
Engine
Build by Companies and hide the
technical detail
Distributed data
High percentage of volatile data
Large volume
Unstructured and redundant data
Quality of data
Heterogeneous data
Dynamic data
How to specify a query from the user
How to interpret the answer provided by
the system
Page 6

Information Retrieval
Search Engine is in the field of IR
Searching authors, titles and subjects in library
card catalogs or
computers
Document classification and categorization, user
interfaces, data
visualization, filtering
Should easily retrieve interested information
IR can be inaccurate as long as the error is
insignificant
Data is usually natural language text, which is not
always well
structured and could be semantically ambiguous
Goal: To retrieve all the documents which are
relevant to a
query while retrieving as few non-relevant
documents as
possible
Page 7

User Problems
Do not exactly understand how to provide
a
sequence of words for the search
Not aware of the input requirement of the
search
engine.
Problems understanding Boolean logic, so
the users
cannot use advanced search
Novice users do not know how to start
using a search
engine
Do not care about advertisements ? No
funding
Around 85% of users only look at the first
page of
the result, so relevant answers might be
skipped
Page 8

Searching Guidelines
Specify the words clearly (+, -)
Use Advanced Search when necessary
Provide as many particular terms as possible
If looking for a company, institution, or
organization, try:
www.name [.com | .edu | .org | .gov | country code]
Some searching engine specialize in some areas
If the user use broad queries, try to use Web
directories as
starting points
The user should notice that anyone can publish
data on the
Web, so information that they get from search
engines might
not be accurate.
Page 9
Types of Search
Engines
Search by Keywords (e.g.
AltaVista,
Excite, Google, and Northern
Light)
Search by categories (e.g.
Yahoo!)
Specialize in other languages
(e.g.
Chinese Yahoo! and Yahoo!
Japan)
Interview simulation (e.g. Ask
Jeeves!)
Page 10
The Largest Search
Engines
(1998)
Search engine
URL
Web pages indexed
AltaVista
www.altavista.com
140
AOL Search
search.aol.com
N/A
Excite
www.excite.com
55
Google
google.stanford.edu
25
GoTo
goto.com
N/A
HotBot
www.hotbot.com
110
Go
www.go.com
30
Lycos
www.lycos.com
30
Magellan
magellan.excite.com
55
Microsoft
search.msn.com
N/A
Northern Light
www.northernlight.com
67
Open Text
www.opentext.com
N/A
WebCrawler
www.webcrawler.com
2
Page 11

Search Engine
Architectures
AltaVista
Harvest
Google
Page 12

AltaVista
Architecture
User
Interface
Query Engine
Crawler
Indexer
Index
Web
Page 13
Harvest Architecture
User
Replication
Manager
Broker
Object Cache
Web site
Gatherer
Broker
Page 14

Google Architecture
Page 15

User Interfaces
Query Interface
A box is entered a sequence of words (AltaVista
uses union,
HotBot uses intersection)
Complex query interfaces (e.g. Boolean logic,
phrase search,
title search, URL search, date range search, data
type search)
Answer Interface
Relevant pages appear on the top of the list
Each entry in the list includes a title of the page, an
URL, a brief
summary, a size , a date and a written language
Page 16
Web Directories
Also called: catalogs, yellow pages,
subject
directories
Hierarchical taxonomies that classify
human
knowledge
First level of taxonomies range from 12 to
26
Popularities: Yahoo!, eBLAST,
LookSmart, Magellan,
and Nacho.
Most allow keyword searches
Category services: AltaVista Categories,
AOL Netfind,
Excite Channels, HotBot, Infoseek, Lycos
Subjects,
and WebCrawler Select.
Page 17
The Most Popular
Web
Directories in 1998
Web directory
URL
Number of Web sites
Categories
eBLAST
www.eblast.com
125
N/A
LookSmart
www.looksmart.com
300
24
Lycos Subjects
www.lycos.com
50
N/A
Magellan
magellan.excite.com
60
N/A
NewHoo
www.newhoo.com
100
23
Netscape
search.netscape.com
N/A
N/A
Search.com
www.search.com
N/A
N/A
Snap
www.snap.com
N/A
N/A
Yahoo!
www.yahoo.com
750
N/A
Page 18

Ranking
Not publicly available
Do not allow access to the
text, but
only indices
Sometimes too many relevant
pages for
a simple query
Hard to compare the quality of
ranking
for two search engines
PageRank, Anchor Text
Page 19

PageRank
Used by WebQuery and Google
The equation:
PR(a) = q (1 - q)?
(i = 1 .. N)
PR(p
i
)/C(p
i
)
Google simulates users using the search
engine to
rank documents
Google uses citation graph (518 million
links)
Google computes 26 million in a few
hours
Many pages point to the result page ?
High ranking
Some high-ranking pages point to the
result page ?
High ranking
Page 20

Anchor Text
Most search engines associate the
text of a
link with the page that the link is on
Google is the other way around
Advantages: more accurate
descriptions of
Web pages and document can be
indexed
259 million anchors
Idea was originated by WWWW
(World Wide
Web Worm)
Page 21
Other Features
Keep track of location
information for all
hits
Keep track of visual
presentation (e.g.
font size of words)
Page 22

Web Crawlers
Software agents that traverse the Web sending
new or updated
pages to a main server where they are indexed
Also called robots, spiders, worms, wanders,
walkers, and
knowbots
The 1st crawler, Wanderer was developed in 1993
Not been publicly described
Runs on local machine and send requests to remote
Web
servers
Most fragile application
Breath-first and depth-first manner
Avoid crawling same pages
Web pages change dynamically
Invalid links: 2% to 9%
Fastest crawlers are able to traverse up to 10 million
pages per day
Page 23

Google Crawler
Fast distributed crawling system
How does it work?
Peak speed: > 100 pages/sec or 600k per
sec for 4
crawlers
Use DNS cache to avoid DNS look up
Each connection possible states:
Looking up DNS
Connecting to host
Sending request
Receiving response
Crawling problems
Page 24

Internet Archive
Uses multiple machines
A crawler is a single thread
Each crawler assigns to 64 sites
No site is assigned to more than one
crawler
Each crawler reads a list of URLs into
per-site queues
Each crawler uses asynchronous I/O to
fetch pages
from these queues in parallel
Each crawler extracts the links inside the
downloaded
page
The crawler assigns links to appropriate
site queues
Page 25

Mercator
Named after the Flemish
cartographer
Mercator
Developed by Compaq
Written in Java
Scalable: can scale up to the entire
Web (has
fetched tens of millions of Web
documents)
Extensible: designed in a modular
way, can
add new function by 3rd parties
Page 26

Indices
Use inverted files
Inverted file is a list of sorted words
Each word points to related pages
A short description associates with each pointer
500 bytes for description and pointer
Store answer in memory
Reduce size of files to 30%
Use binary search for searching for a single
keyword
Multiple keyword searching requires multiple
binary search
independently, then combine all the result
Phrase search is unknown in public
Phrase search is to search words near each other
Page 27

Metasearchers
A Web server that takes a given
query from
the user and sends it to several
sources
Collect the answer from these
sources
Return a unified result to the user
Able to sort by host, keyword, data,
and
popularity
Can run on client machine as well
Number of sources is adjustable
Page 28

Metasearchers in
1998
Metasearcher
URL
Sources used
C4
www.c4.com
14
Dogpile
www.dogpile.com
25
Highway61
www.highway61.com
5
InFind
www.infind.com
6
Mamma
www.mamma.com
7
MetaCrawler
www.metacrawler.com
7
MetaMiner
www.miner.uol.com.br
13
Local Find
local.find.com
N/A
Page 29

Inquirus
Developed by NEC Research
Institute
Download and analyze Web
pages
Display each page with
highlighted
query terms in progressive
manner
Discard non-existing pages
Not publicly available
Page 30

Savvy Search
Available in 1997, but not now
Goal #1: maximize the likelihood of
returning
good links
Goal #2: minimize computational
and Web
resource consumption
Determines which search engines
to contact
and in what order
Ranks search engines based on
query terms
and search engines performance
Page 31

STARTS
Stanford Protocol Proposal for
Internet
Retrieval and Search
Supported by 11 companies
Facilitates the task of querying
multiple
document sources
1. Choose the best sources to
evaluate a query
2. Submit the query at these sources
3. Merge the query results from
these sources
Page 32

STARTS Protocol
The Query-Language
Problems
The Rank-Merging Problem
The Source-Metadata Problem
Page 33

Add-on Tools: Alexa

Free: www.alexa.com
Appear as a toolbar in IE 5x
Provide useful information about the sites
Allow users to browse related sites
Perform searches within the Web site, related site
or the whole
Web
Shop online
Provide popularity
Provide speed of access
Provide freshness
Provide overall quality from Alexa users
Page 34

Future Work
1.
Provide better information filtering
2.
Pose queries more visually
3.
New techniques to traverse the Web due to Web’s
growth
4.
New techniques to increase efficiency
5.
Better ranking algorithms
6.
Algorithms that choose which pages to index
7.
Techniques to find dynamic pages which are created on
demand
8.
Techniques to avoid searching for duplicated data
9.
Techniques to search multimedia documents on the
Web
10. Friendly user interfaces
11. Standard protocol to query search engines
12. Web mining
13. Developments of reliable and secure intranet
Page 35

Conclusion

Archetypes Presentation Disney Re-Cap
No ratings yet
Archetypes Presentation Disney Re-Cap
39 pages
Types of Search Engines and How It Works
100% (2)
Types of Search Engines and How It Works
42 pages
(Stefan Buettcher Charles L. A. Clarke Gordon
100% (2)
(Stefan Buettcher Charles L. A. Clarke Gordon
633 pages
Advanced SEO
100% (5)
Advanced SEO
248 pages
Term Paper OF Int-301: Web Programming: Topic: Search Engine
No ratings yet
Term Paper OF Int-301: Web Programming: Topic: Search Engine
18 pages
By Hyder Ziaee & Osama Anjum
No ratings yet
By Hyder Ziaee & Osama Anjum
25 pages
SHS Search Engines
100% (1)
SHS Search Engines
19 pages
Google - Wikipedia, The Free Encyclopedia
No ratings yet
Google - Wikipedia, The Free Encyclopedia
16 pages
A Comprehensive Review On Watermelon Seed Oil
No ratings yet
A Comprehensive Review On Watermelon Seed Oil
7 pages
A Steganography LSB Technique For Hiding Image Within Image Using Blowfish Encryption Algorithm
No ratings yet
A Steganography LSB Technique For Hiding Image Within Image Using Blowfish Encryption Algorithm
6 pages
Handbook of Research On Computer Mediated Communication
No ratings yet
Handbook of Research On Computer Mediated Communication
17 pages
New Technique For Hiding Data in Audio and Image With Multilevel Protection
No ratings yet
New Technique For Hiding Data in Audio and Image With Multilevel Protection
4 pages
Digital Image Steganography Using Matrix Addition-Slides
No ratings yet
Digital Image Steganography Using Matrix Addition-Slides
22 pages
Hacking CSEs - Creating Google Custom Search Engines - 05jan2012
No ratings yet
Hacking CSEs - Creating Google Custom Search Engines - 05jan2012
5 pages
Search Engines: Prepared By: Hannah Wynzelle T. Aban
No ratings yet
Search Engines: Prepared By: Hannah Wynzelle T. Aban
30 pages
Protest Music
No ratings yet
Protest Music
7 pages
Memes, Truthiness and Wikiality in The Realm of Public Knowledge
No ratings yet
Memes, Truthiness and Wikiality in The Realm of Public Knowledge
28 pages
Feed For All
No ratings yet
Feed For All
113 pages
HowStuffWorks - McDonald's Real Estate - PDF
No ratings yet
HowStuffWorks - McDonald's Real Estate - PDF
8 pages
Google Dork 2
No ratings yet
Google Dork 2
6 pages
Google Hacking
No ratings yet
Google Hacking
43 pages
899 North Capitol Street, NE Washington, DC 20002
No ratings yet
899 North Capitol Street, NE Washington, DC 20002
30 pages
The Most Influential People in Vaccines
No ratings yet
The Most Influential People in Vaccines
18 pages
Meta Search Engine Using Distributed Information Retrieval
No ratings yet
Meta Search Engine Using Distributed Information Retrieval
35 pages
Religion - A Dictionary - For Believers and Nonbelievers
100% (1)
Religion - A Dictionary - For Believers and Nonbelievers
314 pages
Metadata Website Examples
No ratings yet
Metadata Website Examples
174 pages
RSS Cheat Sheet
No ratings yet
RSS Cheat Sheet
1 page
Sitemap XML - The Quick Start Guide To Site Maps
No ratings yet
Sitemap XML - The Quick Start Guide To Site Maps
3 pages
Intense Sweeteners Report
No ratings yet
Intense Sweeteners Report
6 pages
Intelligent Search Engines
No ratings yet
Intelligent Search Engines
3 pages
Search Engine Problems and Solutions
No ratings yet
Search Engine Problems and Solutions
2 pages
Music and Peace
No ratings yet
Music and Peace
5 pages
HTML Entities PDF
No ratings yet
HTML Entities PDF
6 pages
Internet Search Tools Search Engines Meta-Search Engines Metasites Directories
No ratings yet
Internet Search Tools Search Engines Meta-Search Engines Metasites Directories
10 pages
Iwt PDF
No ratings yet
Iwt PDF
31 pages
What Are Memes
No ratings yet
What Are Memes
1 page
Newer Antimalarial Drugs
No ratings yet
Newer Antimalarial Drugs
3 pages
198 Methods of Nonviolent Protest
No ratings yet
198 Methods of Nonviolent Protest
3 pages
Art of Googling
No ratings yet
Art of Googling
40 pages
Image Search Engines
100% (1)
Image Search Engines
54 pages
Exploring Studio MX
100% (1)
Exploring Studio MX
388 pages
(Ebook) - Macro Media Flash MX Bible
100% (1)
(Ebook) - Macro Media Flash MX Bible
438 pages
The Road To Man Made Death
100% (2)
The Road To Man Made Death
5 pages
E-Commerce Assignment For MIS
100% (1)
E-Commerce Assignment For MIS
11 pages
Bibliography
No ratings yet
Bibliography
4 pages
Untangling The Web: A Guide To Internet Research
100% (1)
Untangling The Web: A Guide To Internet Research
651 pages
Search Smart
No ratings yet
Search Smart
6 pages
Chapter 20T - The New Normal
No ratings yet
Chapter 20T - The New Normal
13 pages
Death by China Notes
No ratings yet
Death by China Notes
7 pages
BILL GATES VIRUS HASHTAG On Twitter
No ratings yet
BILL GATES VIRUS HASHTAG On Twitter
252 pages
Search Engine
No ratings yet
Search Engine
35 pages
Preparation
No ratings yet
Preparation
10 pages
Jaff Seminar
No ratings yet
Jaff Seminar
31 pages
BA4029 SOCIAL MEDIA WEB ANALYTICS unit 5
No ratings yet
BA4029 SOCIAL MEDIA WEB ANALYTICS unit 5
23 pages
Working of Webb Search Engines
No ratings yet
Working of Webb Search Engines
29 pages
WEB BROWSERS+search Engine
No ratings yet
WEB BROWSERS+search Engine
10 pages
Darknet Report
No ratings yet
Darknet Report
27 pages
7 CurrentTrendsAndIssues
No ratings yet
7 CurrentTrendsAndIssues
50 pages
Seminar Formatkhjj
No ratings yet
Seminar Formatkhjj
24 pages
Web Search-Engines: Preksha Mangal B-Tech CS-3 Year
No ratings yet
Web Search-Engines: Preksha Mangal B-Tech CS-3 Year
43 pages
Seach Engine
50% (2)
Seach Engine
18 pages
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
No ratings yet
Working of Search Engines: Avinash Kumar Widhani, Ankit Tripathi and Rohit Sharma Lnmiit
13 pages
HTTP The Definitive Guide 1st Edition David Gourley - Get the ebook instantly with just one click
100% (1)
HTTP The Definitive Guide 1st Edition David Gourley - Get the ebook instantly with just one click
48 pages
CPS-02系列扫查架说明书 - DCY4.021.277SS - V1.0A-E- 英文20190715
No ratings yet
CPS-02系列扫查架说明书 - DCY4.021.277SS - V1.0A-E- 英文20190715
36 pages
Basics of SEO: by Seoexon
No ratings yet
Basics of SEO: by Seoexon
30 pages
A Study On Online Search by People Using Search Engine: Article
No ratings yet
A Study On Online Search by People Using Search Engine: Article
11 pages
COMPUTER APPLICATIONS SAMPLE PAPERS KNOWLEDGE BOAT
No ratings yet
COMPUTER APPLICATIONS SAMPLE PAPERS KNOWLEDGE BOAT
76 pages
Seo 3
No ratings yet
Seo 3
29 pages
Web Engineering: 20th International Conference, ICWE 2020, Helsinki, Finland, June 9–12, 2020, Proceedings Maria Bielikova 2024 Scribd Download
100% (2)
Web Engineering: 20th International Conference, ICWE 2020, Helsinki, Finland, June 9–12, 2020, Proceedings Maria Bielikova 2024 Scribd Download
65 pages
SEO For Agencies White Paper 2024
No ratings yet
SEO For Agencies White Paper 2024
15 pages
OSINT 2025: Deep Search AI Integration and Future Trends
No ratings yet
OSINT 2025: Deep Search AI Integration and Future Trends
61 pages
Clickonomics ndss2013
No ratings yet
Clickonomics ndss2013
14 pages
E-Business Infrastructure - Sep 30
No ratings yet
E-Business Infrastructure - Sep 30
54 pages
DWM Notes
No ratings yet
DWM Notes
19 pages
How to Build a Powerful Knowledge Chatbot with Lamatic.ai, Firecrawl and RAG _ by Vrijraj Singh _ Lamatic.ai Engineering _ Dec, 2024 _ Medium
No ratings yet
How to Build a Powerful Knowledge Chatbot with Lamatic.ai, Firecrawl and RAG _ by Vrijraj Singh _ Lamatic.ai Engineering _ Dec, 2024 _ Medium
16 pages
Aindumps AI-900 v2021-04-29 by Mohammed 47q
No ratings yet
Aindumps AI-900 v2021-04-29 by Mohammed 47q
30 pages
E-Commerce Review Scrapper: Python Mini Project On
No ratings yet
E-Commerce Review Scrapper: Python Mini Project On
15 pages
Social Media and Web Analytics Unit-3 Website Metrics
No ratings yet
Social Media and Web Analytics Unit-3 Website Metrics
10 pages
Google Search Console
No ratings yet
Google Search Console
13 pages
Security Day 05 Google Hacking
No ratings yet
Security Day 05 Google Hacking
21 pages
Beginner SEO Terms
No ratings yet
Beginner SEO Terms
183 pages
Introduction To Web Scraping
100% (1)
Introduction To Web Scraping
3 pages
80 Legs
No ratings yet
80 Legs
2 pages
6 Chapter 3 The Internet and World Wide Web
No ratings yet
6 Chapter 3 The Internet and World Wide Web
20 pages
Advanced Searching Techniques in Google
No ratings yet
Advanced Searching Techniques in Google
22 pages
Abstract: YSPM'S YTC, Faculty of MCA, Satara. 1
No ratings yet
Abstract: YSPM'S YTC, Faculty of MCA, Satara. 1
15 pages
5.web Crawler Writeup
No ratings yet
5.web Crawler Writeup
7 pages
Web Science: An Interdisciplinary Approach To Understanding The Web
No ratings yet
Web Science: An Interdisciplinary Approach To Understanding The Web
10 pages
MKTC - 605 (Digital Marketing)
100% (1)
MKTC - 605 (Digital Marketing)
53 pages

Web Search Engine

Uploaded by

Web Search Engine

Uploaded by

This is the html version of the file

The Overview of Web

Questions about the

Add-on Tools: Alexa

You might also like