Unit-5. Search Engines
Unit-5. Search Engines
Information Retrieval
& Search Engines
Search and Information
Retrieval
Searchon the Web is a daily activity for many
people throughout the world
DataProcessing, Search and communication are
most popular uses of the computer
Applications involving search are everywhere
Thefield of computer science that is most involved
with R&D for search is information retrieval (IR)
3
IR problem
First applications: in libraries (1950s)
ISBN: 0-201-12227-8
Author: Salton, Gerard
Title: Automatic text processing: the transformation,
analysis, and retrieval of information by computer
Editor: Addison-Wesley
Date: 1989
Content: <Text>
external attributes and internal attribute (content)
Search by external attributes = Search in DB
IR: Search by content
Information retrieval
(IR)
What comes to mind when I say “information
retrieval”?
Where have you seen IR? What are some real-
world examples/uses?
Search engines
File search (e.g. Windows Instant Search, Google Desktop)
Databases?
Catalog search (e.g. library)
Intranet search (i.e. corporate networks)
5
Information retrieval
Information retrieval is a field concerned with
the structure, analysis, organization, storage,
searching, and retrieval of information.
Web
7
An information
retrieval system is
system that is
capable of storage,
retrieval and
maintenance of
information.
9
Information Retrieval
10
11
12
Information Retrieval
Information Retrieval is finding material in documents of
an unstructured nature that satisfy an information need
from within large collections of digitally stored content
?
13
Information Retrieval
Information Retrieval is finding material in documents of
an unstructured nature that satisfy an information need
from within large collections of digitally stored content
Info.
Goal = find nee
documents Query
d
relevant to an Retrieval
IR
syste
information Document
collection
m Answer list
need from a
large
document set 14
Information Retrieval
Information Retrieval is finding material in documents of
an unstructured nature that satisfy an information need
from within large collections of digitally stored content
15
Information Retrieval
Information Retrieval is finding material in documents of
an unstructured nature that satisfy an information need
from within large collections of digitally stored content
16
Information Retrieval
Information Retrieval is finding material in documents of
an unstructured nature that satisfy an information need
from within large collections of digitally stored content
Goals Repositories
Workspace
The exchange doesn’t end with first answer
Users can recognize elements of a useful answer, even
when incomplete
Questions and understanding changes as the process
22
Challenges
storage
discovery (web)
Data is unstructured
Querying/Understanding user intent
SPAM
Data quality
IR vs. databases
Structured data tends to refer to information in “tables”
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
Ivy Smith 50000
Relevance
A relevant document contains the information
that a person was looking for when they
submitted a query to the search engine
Precision:
Proportion of retrieved
documents that are relevant
Recall:
Proportion of relevant documents
that are retrieved
Assumption: All relevant documents are known.
31
Information Retrieval
Information Retrieval – a Definition. The aim of information
retrieval (IR) is to make machine-stored data discoverable: unlike
data mining which extracts structures from online records, IR is
concerned with filtering specific information from a set of data.
A “Search engine” is one of many different kinds of “information
retrieval systems.”
An information retrieval system is any multi-tiered system by
which underlying stored data (raw, structured, or even
documents, images and sound/video) is indexed by keywords or
codes to allow it to be searched, enumerated and retrieved.
There may be a GUI that allows a human to perform these
functions, a documented API may be provided for integrating
32
Job of
Spide
r
43
44
Crawling
Crawling is the discovery process in which search engines
send out a team of robots (known as crawlers or spiders) to
find new and updated content. Content can vary — it could be
a webpage, an image, a video, a PDF, etc. — but regardless of
the format, content is discovered by links.
Processing is distributed
among several peers in a
decentralized
The usual search scenario involves someone typing in a query to
a search engine and receiving answers in the form of a list of
documents in ranked order.
World Wide Web (web search) is by far the most common application
involving information retrieval.
Vertical search is a specialized form of web search where the domain of
the search is restricted to a particular topic.
Enterprise search involves finding the required information in the huge
variety of computer files scattered across a corporate intranet. Web pages
are certainly a part of that distributed information store, but most
information will be found in sources such as email, reports, presentations,
spreadsheets, and structured data in corporate databases.
52
Directories
A Web Directory or Web Guide is a
hierarchical representation of
hyperlinks.
The top level is typically a wide range of
very general topics.
Each topic contains hyperlinks of more
specialized sub-topics.
Very easy to use.
Hierarchical 58
Representation
59
Comparison
Directory Search Engine
A directory allows you to explore A search engine brings you to the
and get what you want eventually. exact page on the words or
phrases you are looking for.
Use a directory to find cooking- Use a search engine to find a
related websites. specific recipe, by providing the
name of the ingredients.
Use a directory to find travel Use a search engine to find the
guides in a country. transport trains schedule in South
Africa.
60
61
62
63
Search Engine Issues
Performance
improving the efficiency of search
e.g.,reducing response time, increasing query
throughput, increasing indexing speed
Indexesare data structures designed to improve
search efficiency
designing and implementing them are major issues for
search engines
Search Engine Issues
Dynamic data
The“collection” for most real applications is
constantly changing in terms of updates,
additions, deletions
e.g., web pages
For
Web search, spam in all its forms is one of the
major issues
Affects
the efficiency of search engines and, more
seriously, the effectiveness of the results
Many types of spam will degrade the performance.
Use
new subfield called adversarial IR, since
spammers are “adversaries” with different goals
White Pages / Yellow 68
Pages
White pages allows user to lookup information
about individuals.
We can use white page to track down the
telephone numbers, email address.
People can abuse white pages
Some people think that white pages are an
invasion of their privacy.
Yellow pages contain information about
69
Search Engine
Components
Web crawler
It is also known as spider or bots. It is a software component
that traverses the web to gather information.
Database
All the information on the web is stored in database. It consists
of huge web resources.
Search Interfaces
This component is an interface between user and the database.
It helps the user to search through the database.
70
Search Engine
Components
Generally there are three basic components of a search engine as
listed below:
Web crawler
It is also known as spider or bots. It is a software component that
traverses the web to gather information.
Database
All the information on the web is stored in database. It consists of
huge web resources.
Search Interfaces
This component is an interface between user and the database. It
71
Once web crawler finds the pages, the search engine then
shows the relevant web pages as a result. These retrieved
web pages generally include title of page, size of text
portion, first several sentences etc.
Architecture
SE Processing - Index
Processing
Text acquisition
identifies and stores documents for indexing
Text transformation
transforms documents into index terms or
features
Index creation
takesindex terms and creates data
structures (indexes) to support fast searching
Indexing Process
79
80
Text Acquisition
Stores
text, metadata, and other related content for
documents
Metadata is information about document such as type and
creation date
Other content includes links, anchor text
Providesfast access to document contents for search
engine components
Stemming
Group words derived from a common stem
e.g., “computer”, “computers”, “computing”, “compute”
Usually effective, but not for all queries
Text Transformation
Link Analysis
Makes use of links and anchor text in web pages
User interaction
supports creation and refinement of query, display
of results
Ranking
uses
query and indexes to generate ranked list of
documents
Evaluation
monitors and measures effectiveness and efficiency
96
Query Process
User Interaction
Query input
Provides interface and parser for query
language
Query language used to describe more complex
queries and results of query transformation
IR query languages also allow content and
structure specifications, but focus on content
User Interaction
Query transformation
Improves initial query
Spell checking and query suggestion provide
alternatives to original query using search log
Query expansion and relevance feedback
modify the original query with additional terms
10
Search Engine - How does it 0
work?
User Interface – Allows you to type a query and
displays the results.
Searcher – The engine searches the database for
matching your query.
Evaluator – The engine assigns scores to the
retrieved information.
Gatherer – The component that travels the WEB,
and collects information.
Indexer – The engine that categorizes the data
collected by the gatherer.
10
1
User Interface
reference
10
5
Gatherer
ALL THE
BEST