IR Chapter 1
IR Chapter 1
2
Information Retrieval
▪ It is a research field traditionally separate from Databases
▪ Goes back to IBM, Rand and Lockheed in the 50’s
▪ G. Salton at Cornell in the 60’s
▪ Lots of research since then
▪ Products traditionally separate
▪ Originally, document management systems for libraries,
government, law, etc.
▪ Gained prominence in recent years due to web search
3
What is IR?
▪ The study of methods and structures used to
represent and access information (Witten et al.)
▪ IR deals with the representation, storage,
organization, and access to information items
(Salton ).
▪ Information Retrieval (IR) is finding material
(usually documents) of an unstructured nature
(usually text) that satisfies an information need from
within large collections (usually stored on
computers) ( Manning et al.).
4
What is IR?
5
Broad Sense of IR
▪ It is a discipline that finds information that people
want
▪ The motivation behind would include
▪ Humans’ desire to understand the world and to gain
knowledge
▪ Acquire sufficient and accurate information/answer to
accomplish a task
6
Broad Sense of IR
▪ Because finding information can be done in so many
different ways, IR would involve:
• Classification
• Re-Clustering
• Recommendation
• Social network
• Interpreting natural languages
• Question answering
• Knowledge bases
• Human-computer Interaction
• Psychology and Cognitive Science
7
Narrow Sense of IR
• It is ‘search’
– Mostly searching for documents
• It is a computer science discipline that designs and implements
algorithms and tools to help people find information that they
want
– from one or multiple large collections of materials (text or
multimedia, structured or unstructured, with or without
hyperlinks, with or without metadata, in a foreign language
or not • where people can be a single user or a group
– who initiate the search process by an information need,
– and, the resulting information should be relevant to the
information need (based on the judgment by the person
who starts the search)
8
Narrow Sense of IR
• It helps people find relevant documents
– from one large collection of material (which is the Web or
a TREC collection),
– where there is a single user,
– who initiates the search process by a query driven by an
information need,
– and, the resulting documents should be ranked (from the
most relevant to the least) and returned in a list
9
Relationships to Sister Disciplines
10
Databases vs. IR
11
Data Retrieval VS Information Retrieval
▪ While both involve the search for specific data, they differ in
their scope and purpose.
12
Definition
Information Retrieval
▪ It is the process of retrieving relevant information from a
collection of unstructured or semi-structured data.
▪ It involves the use of search engines or other information
retrieval systems to find documents or other sources of
information that match a particular query.
Data Retrieval
▪ It is the process of retrieving specific data from a structured
database or other data storage system.
▪ It involves the use of queries or other data retrieval techniques
to extract the desired data from a larger data set.
13
Scope
Information Retrieval
▪ The scope of information retrieval is generally broader than
that of data retrieval.
▪ Information retrieval systems are designed to search large
collections of data, such as the internet or a digital library, and
return a set of relevant documents or other sources of
information.
Data Retrieval
▪ Data retrieval is the process of retrieving specific data from a
structured database or other data storage system.
▪ It involves the use of queries or other data retrieval techniques
to extract the desired data from a larger data set.
14
Purpose
▪ The purpose of Information Retrieval is to help users find
relevant information quickly and efficiently.
▪ It is often used in situations where the user is not sure exactly
what they are looking for, and needs to explore a large
collection of data to find relevant information.
▪ The purpose of Data Retrieval is to extract specific data
elements for analysis or processing.
▪ It is often used in business intelligence or data analysis
applications, where the user needs to extract specific data
elements from a larger data set for further analysis.
15
Data Retrieval VS Information
Retrieval
Data retrieval Information Retrieval
Data organization Structured Unstructured
Fields Clear Semantics (ID, No fields (other than
Name, age,) text)
Query Language Artificial (defined, SQL) Free text (“natural
language”), Boolean
Matching Exact (results are always Partial match, best
“correct”) match
Query specification Complete Incomplete
Query specification Complete Incomplete
Items wanted Matching Relevant
Accuracy 100% < 50%
16
Information Retrieval systems
▪ An Information Retrieval System is a system that is capable of
storage, retrieval, and maintenance of information.
▪ It consists of a software program that facilitates a user in
finding the information the he/she needs.
▪ Modern information retrieval systems deal not only with
textual information but also with multimedia information
comprising text, audio, images and video.
▪ They deal with storage, organization and access to text, as well
as multimedia information resources.
17
Objectives of Information
Retrieval Systems
18
Scope of IR System
Unstructured Information
▪ This information either does not have a pre-defined data model
or is not organized in a pre-defined order.
▪ Unstructured information is typically text-heavy, but may
contain datasets such as dates, numbers, and facts as well.
▪ Examples of “unstructured data” may include :-
▪ books, ▪ Analog data,
▪ journals, ▪ Images,
▪ Files, and
▪ documents,
▪ Unstructured text such as the
▪ metadata, body of an e-mail message, Web
▪ health records, page, or word-processor
▪ audio, document.
▪ video,
19
Scope of IR System
Structured Information
20
Structure of an IR System
21
Structure of an IR System
▪ The black box is the information retrieval system.
▪ The notion of relevance is at the centre of IR.
▪ The primary goal of an IR system is to retrieve all the
documents which are relevant to a user query while retrieving
as few non-relevant documents as possible.
22
Typical IR System Architecture
23
IR System vs. Web Search System
Web Spider
Document
corpus
Query IR
String System
1. Page1
2. Page2
3. Page3 Ranked
. Relevant Documents
.
24
The Retrieval Process
25
The Retrieval Process
• It is necessary to define the text database before any of the
retrieval processes are initiated
• This is usually done by the manager of the database and includes
specifying the following
– The documents to be used
– The operations to be performed on the text
– The text model to be used (the text structure and what
elements can be retrieved)
• The text operations transform the original documents and the
information needs and generate a logical view of them
26
The Retrieval Process
27
The Retrieval Process
28
The Retrieval Process