0% found this document useful (0 votes)

14 views50 pages

IR Unit-1 - Updated

Uploaded by

Danish Nevrekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views50 pages

IR Unit-1 - Updated

Uploaded by

Danish Nevrekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

WELCOME

Department Optional Course -4

Subject: Information Retrieval.
Course Code: CSDC7023

Mr. S. G. Shaikh
Asst. Professor
Dept. Of Computer Engg,
AIKTC, New Panvel
[email protected]
Cell. +91 9960726716

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Subject: Information Retrieval CSDC7023
Prerequisite: Data structures and algorithms
Course Objectives:
The course aims students :
1 To learn the fundamentals of Information Retrieval
2 To analyze various Information retrieval modeling techniques
3 To understand query processing and its applications
4 To explore the various indexing and scoring techniques
5 To assess the various evaluation methods
6 To analyze various information retrieval for real world application.

Course Outcomes:
Learner will be able to: -
1 Describe and Analyze the concepts, challlenges of the Information retrieval system.
2 Design the various modeling techniques for information retrieval systems.
3 Implements the query structure and various query operations
4 Analyzing the indexing and scoring operation in information retrieval systems
5 Perform the evaluation of information retrieval systems
6 Analyze various information retrieval for real world application

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Unit No-01: Introduction to Information Retrieval

Syllabus:-
Introduction to Information Retrieval, Basic Concepts, Information Versus Data,
Trends and research issues in information retrieval. The retrieval process,
Information retrieval in the library, web and digital libraries.

•Reference Books Used

• T1:Modern information retrieval, Baeza-Yates, R. and Ribeiro-Neto, B., 1999. ACM press
• T1:Introduction to Modern Information Retrieval. G.G. Chowdhury. NealSchuman
• R1: Storage Network Management and Retrieval, VaishaliKhairnar

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Introduction to Information Retrieval

Image Source: Google Images

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Introduction to Information Retrieval

 Information retrieval (IR) deals with the representation, storage,

organization of, and access to information items.
 The representation and organization of the information items should
provide the user with easy access to the information in which he is
interested.
 Information retrieval (IR) is a field that has been developing in parallel
with database systems for many years.
 Unlike the field of database systems, which has targeted query and
transaction processing of structured data, information retrieval is
concerned with the organization and retrieval of data from multiple text-
based documents.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Introduction to Information Retrieval

 Information retrieval (IR) deals with the representation, storage,

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Introduction to Information Retrieval
 Since information retrieval and database systems each handle different
kinds of data, some database system problems are usually not present in
information retrieval systems, such as concurrency control, recovery,
transaction management, and update.
 There are some common information retrieval problems that are usually not
encountered in traditional database systems, such as unstructured
documents, approximate search based on keywords, and the notion of
relevance.
 Because of the abundance of text data, information retrieval has discovered
several applications.
 There exist several information retrieval systems, including online library
catalog systems, online records management systems, and the more
currently developed Web search engines.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Introduction to Information Retrieval
 A general data retrieval problem is to locate relevant documents in a
document set depending on a user’s query, which is often some keywords
defining an information need, although it can also be an example of relevant
records.
 This is most suitable when a user has some ad hoc (i.e., short-term) data
need, including finding data to buy a used car. When a user has a long-term
data need (e.g., a researcher’s interests), a retrieval system can also take
the initiative to “push” any newly arrived data elements to a user if the
element is judged as being relevant to the user’s data need.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Introduction to Information Retrieval
There are two basic measures for assessing the quality of text retrieval which
are as follows −

Precision − This is the percentage of retrieved data that are actually relevant
to the query (i.e., “correct” responses). It is formally represented as
precision=|{Relevant}∩{Retrieved}||{Retrieved}|

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Introduction to Information Retrieval

Recall − This is the percentage of records that are relevant to the query and
were actually retrieved. It is formally represented as
recall=|{Relevant}∩{Retrieved}||{Relevant}|

An information retrieval system is often required to trade-off recall for

precision or vice versa. There is one generally used trade-off is the F-score,
which is represented as the harmonic mean of recall and precision −
F–score=recall×precision(recall+precision)2

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Introduction to Information Retrieval
 An information retrieval system searches a collection of natural language
documents with the goal of retrieving exactly the set of documents that
matches a user’s question.
 They have their origin in library systems.

 These systems assist users in finding the information they require but it does
not attempt to deduce or generate answers.
 It tells about the existence and location of documents that might consist of
the required information that is given to the user.
 The documents that satisfy the user’s requirement are called relevant
documents. If we have a perfect IR system, then it will retrieve only relevant
documents.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Basics of IR Systems

 From the above diagram, it is clear that a user who needs information will have
to formulate a request in the form of a query in natural language. After that, the
IR system will return output by retrieving the relevant output, in the form of
documents, about the required information.
Image Source: Google Images
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Basics of IR Systems

The step by step procedure of these systems are as follows:

 Indexing the collection of documents.
 Transforming the query in the same way as the document content is
represented.
 Comparing the description of each document with that of the query.
 Listing the results in order of relevancy.

Retrieval Systems consist of mainly two processes:

 Indexing
 Matching

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Basics of IR Systems
Indexing
It is the process of selecting terms to represent a text.

Indexing involves:
Tokenization of string
Removing frequent words
Stemming

Two common Indexing Techniques:

Boolean Model
Vector space model

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Basics of IR Systems
Matching

It is the process of finding a measure of similarity between two text

representations.

The relevance of a document is computed based on the following parameters:

1. TF: It stands for Term Frequency which is simply the number of times a given
term appears in that document.

TF (i, j) = (count of ith term in jth document)/(total terms in jth document)

2. IDF: It stands for Inverse Document Frequency which is a measure of the

general importance of the term.

IDF (i) = (total no. of documents)/(no. of documents containing ith term)

3. TF-IDF Score (i, j) = TF * IDF

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Basics of IR Systems
The effective retrieval of relevant information is directly affected both by
the user task and by the logical view of the documents adopted by the
retrieval system.

The User Task

 The user of a retrieval system has to translate his information need into a
query in the language provided by the system.
 With an information retrieval system, this normally implies specifying a set
of words which convey the semantics of the information need.
 With a data retrieval system, a query expression (such as, for instance,
a regular expression) is used to convey the constraints that must be
satisfied by objects in the answer set.
 In both cases, we say that the user searches for useful information
executing a retrieval task.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
The User Task
 Consider now a user who has an interest which is either poorly defined or
which is inherently broad.
 For instance, the user might be interested in documents about car racing in
general. In this situation, the user might use an interactive interface to simply look
around in the collection for documents related to car racing.
 For instance, he might find interesting documents about Formula 1 racing, about
car manufacturers, or about the `24 Hours of Le Mans.' Furthermore, while reading
about the `24 Hours of Le Mans', he might turn his attention to a document which
provides directions to Le Mans and, from there, to documents which cover tourism
in France. In this situation, we say that the user is browsing the documents in the
collection, not searching.
 It is still a process of retrieving information, but one whose main objectives are not
clearly defined in the beginning and whose purpose might change during the
interaction with the system.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
The User Task

Figure: Interaction of the user with the retrieval system through distinct tasks.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Logical View of the Documents
2. Logical View of the Documents
 Due to historical reasons, documents in a collection are frequently represented
through a set of index terms or keywords. Such keywords might be extracted directly from
the text of the document or might be specified by a human subject (as frequently done in the
information sciences arena).
 No matter whether these representative keywords are derived automatically or generated by a
specialist, they provide a logical view of the document.
 Modern computers are making it possible to represent a document by its full set of words. In
this case, we say that the retrieval system adopts a full text logical view (or representation) of
the documents.
 With very large collections, however, even modern computers might have to reduce the set of
representative keywords.
 This can be accomplished through the elimination of stopwords (such as articles and
connectives), the use of stemming (which reduces distinct words to their common
grammatical root), and the identification of noun groups (which eliminates adjectives,
adverbs, and verbs).
Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Logical View of the Documents
2. Logical View of the Documents

Figure: Logical view of a document: from full text to a set of index terms.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Difference of Information Retrieval and Data Retrieval

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Issues in Information Retrieval
Indexing is the most vital part of any Information
Retrieval System.
It is a process in which the documents required by
the users are transformed into searchable data
structures.
 Indexing can be also referred to as the process of
extraction rather than analysis of particular content.
It creates a core functionality of the IR process since
it is the first step in IR and assists in efficient
information retrieval.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Issues in Information Retrieval
In the process, first, the document surrogates are
created to represent each document.
Secondly, it requires analysis of original documents
that include simple (identifying meta-information
e.g., author, title, subject etc.) and complex (linguistic
analysis of content) data.
 Indexes are the data structures that are used to
make the search faster.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Issues in Information Retrieval

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Issues in Information Retrieval
1 Document and Query Indexing –

Main goal of Document and Query Indexing is to

find important meanings and creating an internal
representation.
The factors to be considered are accuracy to
represent semantics, exhaustiveness, and facility for
a computer to manipulate.

Mr. S. G. Shaikh, Assistant Professor, Department of Computer Engineering ,AIKTC, New Panvel
Issues in Information Retrieval
2. Query Evaluation –
In the retrieval model how can a document be represented
with the selected keywords and how are documents and
query representations compared to calculate a score.
Information Retrieval (IR) deals with issues like uncertainty
and vagueness in information systems.
 Uncertainty :
The available representation does not typically reflect true
semantics of objects such as images, videos etc.
 Vagueness :
The information that the user requires lacks clarity, is only
vaguely expressed in a query, feedback or user action.