Information Retrieval
Information Retrieval
INFORMATION RETRIEVAL
Semester V Course Code :BAI515B CIE
Marks :50
Module-1
Module-2
Module-3
Module-4
Indexing and Searching: Inverted indexes, Signature files, Suffix trees and
suffix arrays, Sequential searching, Multi-dimensional indexing.
Module-5
Web retrieval: The web, Search engine architectures, Search engine ranking,
Managing web data, Search engine user interaction.
Structured Text Retrieval: Structuring Power, Early text retrieval models,
XML retrieval, XML retrieval evaluation.
Information retrieval is about retrieving information relevant to the user on the basis of a
query. Early IR systems were boolean systems which allowed users to specify their
information need using a complex combination of boolean ANDs, ORs and NOTs.
In modern IR system, users need vast information for search engine. User looking for the
link to the homepage of a government, company and colleges. They also looking for
information required to execute tasks associated with their jobs or immediate needs.
Sometime user types full description of query to IR system. To solve this query by search
engine is not possible. Here user might want to first translate this information need into a
query, to be posed to the system.
Given the user query, the goal of the IR system is to retrieve information that is useful or
relevant to the user.
The key issues with IR models are selection of search vocabulary, search strategy
formulations and information overload
➥ 1.2.1 The User’s Task
The user of a retrieval system has to translate his information need into a query in the
language provided by the system. With an information retrieval system, this normally
implies specifying a set of words which convey the
semantics of the information need.
With a data retrieval system, a query expression is
used to convey the constraints that must be
satisfied by objects in the answer set. In both cases,
we say that the user searches for useful information
executing a retrieval task. Fig. 1.2.1 shows
Interaction of the user with the retrieval system.
Fig. 1.2.1 : Interaction of the
user with the retrieval
system
Suppose the user may be interested in web site about healthcare product. In this situation,
the user might use an interactive interface to simply look around in the collection for
documents related to healthcare product.
User may be interested in new beauty product, weight loss or gain product. Here user is
browsing the documents in the collection, not searching. It is still a process of retrieving
information, but one whose main objectives are not clearly defined in the beginning and
whose purpose might change during the interaction with the system.
Pull technology : User requests information in an interactive manner. It perform three
retrieval tasks, i.e. Browsing (hypertext), Retrieval (classical IR systems) and Browsing and
retrieval (modern digital libraries and web systems).
Push technology : Automatic and permanent pushing of information to user. It acts like a
a) Input : Store only a representation of the document or query which means that the text
of a document is lost once it has been processed for the purpose of generating its
representation.
b) A document representative could be a list of extracted words considered to be
significant.
c) Processor : Involve in performing actual retrieval function, executing the search
strategy in response to a query.
d) Feedback : Improving the subsequent run after sample retrieval.
e) Output : A set of document numbers.
Information retrieval locates relevant documents, on the basis of user input such as
The computer-based retrieval systems store only a representation of the document or query
which means that the text of a document is lost once it has been processed for the purpose
of generating its representation.
The process may involve structuring the information, such as classifying it. It will also
involve performing the actual retrieval function that is executing the search strategy in
response to a query.
Text document is the output of information retrieval system. Web search engines are the
most familiar example of IR systems.
The user’s query is processed by a search engine, which may be running on the user’s local
machine, on a large cluster of machines in a remote geographic location, or anywhere in
between.
A major task of a search engine is to maintain and manipulate an inverted index for a
document collection. This index forms the principal data structure used by the engine for
searching and relevance ranking.
CS-AI&ML, PDIT Hosapete Page 12
Information Retrieval (1 - 13) Module-1
As its basic function, an inverted index provides a mapping between terms and the
locations in the collection in which they occur.
To support relevance ranking algorithms, the search engine maintains collection statistics
associated with the index, such as the number of documents containing each term and the
length of each document.
In addition the search engine usually has access to the original content of the documents in
order to report meaningful results back to the user.
Using the inverted index, collection statistics, and other data, the search engine accepts
queries from its users, processes these queries, and returns ranked lists of results.
To perform relevance ranking, the search engine computes a score, sometimes called a
Retrieval Status Value (RSV), for each document. After sorting documents according to
their scores, the result list must be subjected to further processing, such as the removal of
duplicate or redundant results.
For example, a web search engine might report only one or results from a single host or
domain, eliminating the others in favor of pages from different sources.
Database management systems support fast and accurate data lookups in business and
industry; in journalism, lookups are related to questions of who, when, and where as
opposed to what, how, and why questions.
In libraries, lookups have been called “known item” searches to distinguish them from
subject or topical searches.
A typical example would be a user wanting to make a reservation to a restaurant and
looking for the phone number on the Web.
On the other hand, exploratory search is described as open-ended, with an unclear
information need, an ill-structured problem of search with multiple targets. This search
activity is evolving and can occur over time.
For example, a user wants to know more about Senegal, she doesn’t really know what kind
of information she wants or what she will discover in this searchsession; she only knows
she wants to learn more about that topic.
Query formulation is the stage of the interactive information access process in which user
translates an information need into a query and submits the query to an information access
system such as a search engine.
The system performs some computation to match the query with the documents most likely
to be relevant to the query and returns a ranked list of relevant documents to the user.
🞕 Ans. :
Input : Store only a representation of the document or query which means that the text of a
document is lost once it has been processed for the purpose of generating its representation.
A document representative could be a list of extracted words considered to be significant.
Processor : Involve in performing actual retrieval function, executing the search strategy in
Objective terms are extrinsic to semantic content, and there is generally no disagreement
about how to assign them. Examples include author name, document URL, and date of
publication.
Nonobjective terms are intended to reflect the information manifested in the document,
and there is no agreement about the choice or degree of applicability of these terms. They
are also known as content terms.
Q.5 Explain the type of natural language technology used in information retrieval.
🞕 Ans. : Two types of natural language technology can be useful in information retrieval :
Natural language interfaces make the task of communicating with the information source
easier, allowing a system to respond to a range of inputs, possibly from inexperienced
users, and to produce more customized output.
Natural language text processing allows a system to scan the source texts, either to retrieve
particular information or to derive knowledge structures that may be used in accessing
information from the texts.
Q.6 What is search engine ?
🞕 Ans. : A search engine is a document retrieval system designed to help find information
stored in a computer system, such as on the WWW. The search engine allows one to ask for
content meeting specific criteria and retrieves a list of items that match those criteria.
Q.7 What is conflation ?
🞕 Ans. : Stemming is the process for reducing inflected words to their stem, base or root
form, generally a written word form. The process of stemming is often called conflation.
Q.8 What is an invisible web ?
🞕 Ans. : Many dynamically generated sites are not indexable by search engines; this
phenomenon is known as the invisible web.
Q.9 Define Zipf’s law.
th
🞕 Ans. : An empirical rule that describes the frequency of the text words. It states that the i
most frequent word appears as many times as the most frequent one divided by i , for some
CS-AI&ML, PDIT Hosapete Page 24
Information Retrieval (1 - 25) Module-1
> 1.
d All of these
Q.12 Early IR systems were systems which allowed users to specify their
information need using a complex combination of Boolean ANDs, ORs and
NOTs.
a Boolean b vector c logical d All of these
Q.16 Web browser is a software program that interprets and displays the
contents of web pag
a XML b HTML c static d dynamic
Q.17 diagram is a diagram that visually displays all the possible logical
relationships between collections of sets.
a Text b Information c Binary d Venn