Chapter 1 Introduction To ISR
Chapter 1 Introduction To ISR
Chapter One
Introduction to ISR
What is Information Storage and Retrieval?
o It is the System which are used to store information gathered from different sources in
such a way that it can be retrieved easily and effectively upon request.
What is Information Storage?
o Collecting information from different resources and storing it in either storage room
(maintaining paper records) or the storage devices such as hard disk, DVD, CD is
called as information storage. This information may be in any of the form that is
audio, video, text.
What is Information Retrieval?
o IR System is mainly focus on electronic searching and retrieving old
documents.
o Information Retrieval (IR) can be defined as a software program that deals
with the organization, storage, retrieval, and evaluation of information
from document repositories, particularly textual information.
o The process of searching, fetching and serving of information to the requested users is
information retrieval.
o An IR System is capable of performing operations like
methods for adding documents to the database,
Modifying or deleting them from the database,
Methods for searching and
Serving appropriate document to the users.
o Information Retrieval is an activity of obtaining relevant documents based on user
needs from collection of retrieved documents.
o The IR system assists the users in finding the information they require but it does
not explicitly return the answers to the question.
The difficulty is not only knowing how to extract this information but also
knowing how to use it to decide relevance. Thus, the notion of relevance is at the
center of information retrieval. In fact, the primary goal of an IR system is to
retrieve all the documents which are relevant to a user query while retrieving
as few non-relevant documents as possible.
Difference between Information Retrieval and Data Retrieval
Information Retrieval Data Retrieval
The software the program that deals with Data retrieval deals with obtaining data
the organization, storage, retrieval, and from a database management system
evaluation of information from document such as ODBMS. It is A process of
repositories particularly textual identifying and retrieving the data from
information. the database, based on the query provided
by user or application.
Retrieves information about a subject. Determines the keywords in the user
query and retrieves the data.
Small errors are likely to go unnoticed. A single error object means total failure.
Not always well structured and is Has a well-defined structure and
semantically ambiguous. semantics.
Does not provide a solution to the user of Provides solutions to the user of the
the database system. database system.
The results obtained are approximate The results obtained are exact matches.
matches.
Results are ordered by relevance. Results are unordered by relevance.
It is a probabilistic model. It is a deterministic model.
o Data Retrieval systems directly retrieve data from database management systems like
ODBMS by identifying keywords in the queries provided by users and matching them
with the documents in the database.
o Whereas the Information Retrieval system in DBMS is a set of algorithms or programs
that involve storing, retrieving, evaluation of document and query representations,
especially text-based, to display results based on similarity.
What is an IR Model?
o An Information Retrieval (IR) model selects and ranks the document that
is required by the user or the user has asked for in the form of a query.
o The documents and the queries are represented in a similar manner, so that
document selection and ranking can be formalized by a matching function
that returns a retrieval status value (RSV) for each document in the
collection.
o Many of the Information Retrieval systems represent document contents by
a set of descriptors, called terms, belonging to a vocabulary V.
o An IR model determines the query-document matching function according
to four main approaches:
The estimation of the probability of user’s relevance rel for each
document d and query q with respect to a set R q of training
documents: Prob (rel|d, q, Rq)
Types of IR Models
1. Acquisition: In this step, the selection of documents and other objects from
various web resources that consist of text-based documents takes place.
o The required data is collected by web crawlers and stored in the
database.
2. Representation: It consists of indexing that contains free-text terms,
controlled vocabulary, manual & automatic techniques as well.
o Example: Abstracting contains summarizing and Bibliographic
description that contains author, title, sources, data, and metadata.
3. File Organization: There are two types of file organization methods.
1. Sequential: It contains documents by document data.
2. Inverted: It contains term by term, list of records under each term.
Combination of both.
4. Query: An IR process starts when a user enters a query into the system.
o Queries are formal statements of information needs, for example,
search strings in web search engines.
The User Task: The information first is supposed to be translated into a query by the user.
o In the information retrieval system, there is a set of words that convey the semantics
of the information that is required whereas, in a data retrieval system, a query
expression is used to convey the constraints which are satisfied by the objects.
Example: A user wants to search for something but ends up searching
with another thing. This means that the user is browsing and not
searching. The above figure shows the interaction of the user through
different tasks.
Logical View of the Documents: A long time ago, documents were represented through a
set of index terms or keywords.
o Nowadays, modern computers represent documents by a full set of words which
reduces the set of representative keywords. This can be done by eliminating stop
words i.e. articles and connectives. These operations are text operations. These
text operations reduce the complexity of the document representation from full
text to set of index terms.
2. Vector Space Model: -This model takes documents and queries denoted as
vectors and retrieves documents depending on how similar they are. This
can result in two types of vectors which are then used to rank search results
either
Binary in Boolean VSM.
Weighted in Non-binary VSM.
3. Probability Distribution Model: - In this model, the documents are
considered as distributions of terms and queries are matched based on the
similarity of these representations. This is made possible using entropy or by
computing the probable utility of the document. They are if two types:
Similarity-based Probability Distribution Model
Expected-utility-based Probability Distribution Model
4. Probabilistic Models: -The probabilistic model is rather simple and takes the
probability ranking to display results. To put it simply, documents are
ranked based on the probability of their relevance to a searched query.
Components of Information Retrieval Model
Here are the prerequisites for an IR model:
1. An automated or manually-operated indexing system used to index and
search techniques and procedures.
2. A collection of documents in any one of the following formats: text, image
or multimedia.
3. A set of queries that serve as the input to a system, via a human or machine.
4. An evaluation metric to measure or evaluate a system’s effectiveness (for
instance, precision and recall). For instance, to ensure how useful the
information displayed to the user is.