Chapter 1 Introduction To IR
Chapter 1 Introduction To IR
et
1
Information Retrieval
Information retrieval (IR) is the process of finding
material (usually documents) of an unstructured
nature (usually text) that satisfies an information
need from within large collections (usually stored on
computers).
Information is organized into (a large number of)
documents
Large collections of documents from various sources: news
articles, research papers, books, digital libraries, Web pages,
etc.
Example: Web Search Engines like Google claim to index over 1
Trillion pages
1 2
General Goal of Information Retrieval
To help users find useful information based on
their information needs (with a minimum effort)
despite
Increasing complexity of Information
Changing needs of user
1 3
Information Retrieval vs. Data Retrieval
Emphasis of IR is on the retrieval of information, rather than on
the retrieval of data
Data retrieval
Consists mainly of determining which documents contain a set
of keywords in the user query (which is not enough to satisfy
the user information need)
Aims at retrieving all objects that satisfy well defined semantics
a single erroneous object among a thousand retrieved objects
implies failure
Information retrieval
Is concerned with retrieving information about a subject or topic
than retrieving data which satisfies a given query
semantics is frequently loose: the retrieved objects might be
inaccurate
small errors are tolerated
1 4
Information Retrieval vs. Data Retrieval
Example of data retrieval system is a relational database
Data Retrieval Info Retrieval
Data organization Structured Unstructured
Fields Clear Semantics No fields (other
(ID, Name, age,…) than text)
Query Language Artificial (defined, Free text (“natural
SQL) language”), Boolean
Matching Exact (results are Partial match, best match
always “correct”)
Query specification Complete Incomplete
Items wanted Matching Relevant
Accuracy 100% < 50%
Error response Sensitive Insensitive
1 5
Why is IR so hard?
Traditionnel Information retrieval (IR) Systems
attempt to find relevant documents to respond to a
user’s request.
Information retrieval problem: locating relevant
documents based on user input, such as keywords or
example documents
The real problem boils down to matching the language of
the query to the language of the document.
Simply matching on words is a very brittle (no elasticity)
approach. One word can have different semantic
meanings. Consider: Take
“take a place at the table”
“take money to the bank”
“take a picture”
1 6
Basic Concepts in Information Retrieval:
(i) User Task and (ii) Logical View of documents
Retrieval
DB
Browsing
USER
1 7
The User Task
Retrieval
• It is the process of retrieving information whereby the main
objective is clearly defined from the onset /beginning of
searching process.
• The user of a retrieval system has to translate his
information need into a query in the language provided by
the system.
• In this context (i.e. by specifying a set of words), the user
searches for useful information executing a retrieval task
• English Language Statement :
I want a book by J. K Rowling titled The Chamber of Secrets
1 8
Browsing
• It is the process of retrieving information, whereby the
main objective is not clearly defined from the beginning
and whose purpose might change during the interaction
with the system.
• E.g. User might search for documents about ‘car racing’ .
Meanwhile he might find interesting documents about ‘car
manufacturers’. While reading about car manufacturers in
Addis, he might turn his attention to a document
providing ‘direction to Addis’, and from this to documents
which cover ‘Tourism in Ethiopia’.
• In this context, user is said to be browsing in the collection
and not searching, since a user may has an interest
glancing around
1 9
Logical View of Documents
Documents in a collection are frequently represented by a
set of index terms or keywords
Such keywords are mostly extracted directly from the text of
the document
These representative keywords provide a logical view of the
document
1 11
Structure of an IR System
An Information Retrieval System serves as a bridge between the
world of authors and the world of readers/users,
That is, writers present a set of ideas in a document using a set of
concepts. Then Users seek the IR system for relevant documents
that satisfy their information need.
User Documents
Black box
Document
corpus
Query IR
String System
1. Doc1
2. Doc2
Ranked 3. Doc3
Documents .
.
1 13
Web Search System
Web Spider
Document
corpus
Query IR
String System
1. Page1
2. Page2
3. Page3 Ranked
. Documents
.
1 14
What is Information Retrieval ?
A good formal definition of information retrieval is
given in Baeze-Yates & Riberio-Neto (1990, p1)
“Information retrieval deals with representation,
storage, organization of, and access to information
items. The organization and access of information
items should provide the user with easy access to the
information in which he is interested”
The definition incorporates all important features of
a good information retrieval system
Representation
Storage
Organization
Access
The focus is on the user information need
1 15
The Retrieval Process
It is necessary to define the text database before
any of the retrieval processes are initiated
This is usually done by the manager of the database
and includes specifying the following
The documents to be used
The operations to be performed on the text
The text model to be used (the text structure and what
elements can be retrieved)
1 17
Detail view of the Retrieval Process
User Text
Interface
User Text
need
Text Operations
logical view Logical view
DB
User Query Language manager
Indexing Module
feedback & Operations
Searching Index