0% found this document useful (0 votes)
21 views18 pages

Chapter 1 Introduction To IR

Uploaded by

Dawit Sebhat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views18 pages

Chapter 1 Introduction To IR

Uploaded by

Dawit Sebhat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

[email protected].

et

1
Information Retrieval
 Information retrieval (IR) is the process of finding
material (usually documents) of an unstructured
nature (usually text) that satisfies an information
need from within large collections (usually stored on
computers).
 Information is organized into (a large number of)
documents
 Large collections of documents from various sources: news
articles, research papers, books, digital libraries, Web pages,
etc.
 Example: Web Search Engines like Google claim to index over 1
Trillion pages

1 2
General Goal of Information Retrieval
 To help users find useful information based on
their information needs (with a minimum effort)
despite
 Increasing complexity of Information
 Changing needs of user

 Provide immediate random access to the document


collection.
 Retrieval systems, such as Google, Yahoo, are
developed with this aim.

1 3
Information Retrieval vs. Data Retrieval
 Emphasis of IR is on the retrieval of information, rather than on
the retrieval of data
Data retrieval
Consists mainly of determining which documents contain a set
of keywords in the user query (which is not enough to satisfy
the user information need)
Aims at retrieving all objects that satisfy well defined semantics
a single erroneous object among a thousand retrieved objects
implies failure
Information retrieval
Is concerned with retrieving information about a subject or topic
than retrieving data which satisfies a given query
semantics is frequently loose: the retrieved objects might be
inaccurate
small errors are tolerated

1 4
Information Retrieval vs. Data Retrieval
 Example of data retrieval system is a relational database
Data Retrieval Info Retrieval
Data organization Structured Unstructured
Fields Clear Semantics No fields (other
(ID, Name, age,…) than text)
Query Language Artificial (defined, Free text (“natural
SQL) language”), Boolean
Matching Exact (results are Partial match, best match
always “correct”)
Query specification Complete Incomplete
Items wanted Matching Relevant
Accuracy 100% < 50%
Error response Sensitive Insensitive
1 5
Why is IR so hard?
 Traditionnel Information retrieval (IR) Systems
attempt to find relevant documents to respond to a
user’s request.
 Information retrieval problem: locating relevant
documents based on user input, such as keywords or
example documents
 The real problem boils down to matching the language of
the query to the language of the document.
 Simply matching on words is a very brittle (no elasticity)
approach. One word can have different semantic
meanings. Consider: Take
 “take a place at the table”
 “take money to the bank”
 “take a picture”
1 6
Basic Concepts in Information Retrieval:
(i) User Task and (ii) Logical View of documents

The User Task:


two user task – retrieval and browsing

Retrieval

DB
Browsing

USER
1 7
The User Task
Retrieval
• It is the process of retrieving information whereby the main
objective is clearly defined from the onset /beginning of
searching process.
• The user of a retrieval system has to translate his
information need into a query in the language provided by
the system.
• In this context (i.e. by specifying a set of words), the user
searches for useful information executing a retrieval task
• English Language Statement :
I want a book by J. K Rowling titled The Chamber of Secrets

1 8
Browsing
• It is the process of retrieving information, whereby the
main objective is not clearly defined from the beginning
and whose purpose might change during the interaction
with the system.
• E.g. User might search for documents about ‘car racing’ .
Meanwhile he might find interesting documents about ‘car
manufacturers’. While reading about car manufacturers in
Addis, he might turn his attention to a document
providing ‘direction to Addis’, and from this to documents
which cover ‘Tourism in Ethiopia’.
• In this context, user is said to be browsing in the collection
and not searching, since a user may has an interest
glancing around
1 9
Logical View of Documents
Documents in a collection are frequently represented by a
set of index terms or keywords
Such keywords are mostly extracted directly from the text of
the document
These representative keywords provide a logical view of the
document

Docs Tokenization stop words stemming Indexing

Full Index terms


text

Document representation viewed as a continuum, in which


logical view of documents might shift from full text to index
terms 1 10
Logical view of documents
 If full text :
 Each word in the text is a keyword
 Most complex form
 Expensive
 If full text is too large, the set of representative keywords
can be reduced through transformation process called text
operation
⚫ It reduce the complexity of the document
representation and allow moving the logical view
from that of a full text to a set of index terms

1 11
Structure of an IR System
 An Information Retrieval System serves as a bridge between the
world of authors and the world of readers/users,
 That is, writers present a set of ideas in a document using a set of
concepts. Then Users seek the IR system for relevant documents
that satisfy their information need.
User Documents
Black box

 The black box is the information retrieval system.


 To be effective in its attempt to satisfy information need of users, the IR
system must ‘interpret’ the contents of documents in a collection and
rank them according to their degree of relevance to the user query.
 Thus the notion of relevance is at the center of IR
 The primary goal of an IR system is to retrieve all the documents which
are relevant to a user query while retrieving as few non-relevant
documents as possible
1 12
Typical IR System Architecture

Document
corpus

Query IR
String System

1. Doc1
2. Doc2
Ranked 3. Doc3
Documents .
.

1 13
Web Search System

Web Spider
Document
corpus

Query IR
String System

1. Page1
2. Page2
3. Page3 Ranked
. Documents
.

1 14
What is Information Retrieval ?
 A good formal definition of information retrieval is
given in Baeze-Yates & Riberio-Neto (1990, p1)
“Information retrieval deals with representation,
storage, organization of, and access to information
items. The organization and access of information
items should provide the user with easy access to the
information in which he is interested”
 The definition incorporates all important features of
a good information retrieval system
 Representation
 Storage
 Organization
 Access
 The focus is on the user information need
1 15
The Retrieval Process
 It is necessary to define the text database before
any of the retrieval processes are initiated
 This is usually done by the manager of the database
and includes specifying the following
 The documents to be used
 The operations to be performed on the text
 The text model to be used (the text structure and what
elements can be retrieved)

 The text operations transform the original


documents and the information needs and generate a
logical view of them
1 16
Retrieval Process ….
 Once the logical view of the documents is defined,
the database module builds an index of the text
 An index is a critical data structure
 It allows fast searching over large volumes of data
 Different index structures might be used , but the
most popular one is the inverted file
 Given that the document database is indexed, the
retrieval process can be initiated

1 17
Detail view of the Retrieval Process
User Text
Interface
User Text
need

Text Operations
logical view Logical view
DB
User Query Language manager
Indexing Module
feedback & Operations

Query Inverted file

Searching Index

Retrieved docs Text


Database
Ranking
Ranked docs
1 18

You might also like