0% found this document useful (0 votes)
120 views31 pages

Information Retrieval

The document outlines the curriculum for an Information Retrieval course, covering topics such as the IR problem, user interfaces, retrieval evaluation, indexing, and web retrieval. It emphasizes the distinction between information retrieval and data retrieval, highlighting the challenges faced in retrieving relevant documents from unstructured data. The course includes various modules with specified textbook chapters for further reading and understanding of the subject matter.

Uploaded by

vasantha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views31 pages

Information Retrieval

The document outlines the curriculum for an Information Retrieval course, covering topics such as the IR problem, user interfaces, retrieval evaluation, indexing, and web retrieval. It emphasizes the distinction between information retrieval and data retrieval, highlighting the challenges faced in retrieving relevant documents from unstructured data. The course includes various modules with specified textbook chapters for further reading and understanding of the subject matter.

Uploaded by

vasantha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Information Retrieval (1 - 2) Module-1

INFORMATION RETRIEVAL
Semester V Course Code :BAI515B CIE
Marks :50

Module-1

Introduction: Information retrieval, IR problem, IR System, The web.


User interfaces for search: Introduction, How people search, Search
interfaces today, Visualization on search interfaces, Design and evaluation of
search interfaces

Textbook: Chapter 1: 1.1 to 1.4, Chapter 2: 2.1 to 2.5

Module-2

Modeling: IR models, Classic information retrieval, Alternative set theoretic


models, Alternative algebraic models, Alternative probabilistic models, Other
models.

Textbook: Chapter 3: 3.1 to 3.6

Module-3

Retrieval Evaluation: Retrieval metrics, Reference Collections, User-based


evaluation
Relevance feedback and Query expansion: A framework for feedback
methods, Explicit relevance feedback, Explicit feedback through clicks,
Implicit feedback through local analysis, Implicit feedback through global
analysis
Documents - Languages and Properties: Metadata, Document formats,
Text properties, Document preprocessing, Organizing documents, Text
compression

CS-AI&ML, PDIT Hosapete Page 2


Information Retrieval (1 - 3) Module-1
Textbook : Chapter 4: 4.3 to 4.5, Chapter 5: 5.2 to 5.6, Chapter 6: 6.2 to
6.3, 6.5 to 6.8

Module-4

Indexing and Searching: Inverted indexes, Signature files, Suffix trees and
suffix arrays, Sequential searching, Multi-dimensional indexing.

Textbook: Chapter 9: 9.2 to 9.6

Module-5

Web retrieval: The web, Search engine architectures, Search engine ranking,
Managing web data, Search engine user interaction.
Structured Text Retrieval: Structuring Power, Early text retrieval models,
XML retrieval, XML retrieval evaluation.

Textbook: Chapter 11: 11.2 to 11.7, Chapter 13: 13.2 to 13.5


Text Books:

1. Ricardo BaezaYates and BerthierRibeiroNeto, Modern Information


Retrieval, 2nd Edition, Pearson 2011

CS-AI&ML, PDIT Hosapete Page 3


Information Retrieval (1 - 4) Module-1

1.1 Introduction of Information Retrieval AU : May-17

 Information Retrieval (IR) is finding material (usually documents) of an unstructured


nature (usually text) that satisfies an information need from within large collections.
 The role of an IR system is to retrieve all the documents, which are relevant to a query
while retrieving as few non - relevant documents as possible. IR allows access to whole
documents, whereas, search engines do not.
 There is a huge quantity of text, audio, video and other documents available on the Internet,
on about any subject. Users need to be able to find relevant information to satisfy their
particular information needs.
 There are two ways of searching for information : to use a search engine or to browse
directories organized by categories. There is still a large part of the Internet that is not
accessible (for example private databases and intranets).
 Information retrieval is the task of representing, storing, organizing, and offering access to
information items. IR is different from data retrieval, which is about finding precise data in
databases with a given structure.
 In IR systems, the information is not structured. It is contained in free form in text
(webpages or other documents) or in multimedia content. The first IR systems implemented
in 1970’s were designed to work with small collections of text. Some of these techniques
are now used in search engines.
 The information retrieval techniques focusing on the challenges faced by search engine.
One particular challenge is the large scale, given by the huge number of webpages available
on the Internet.
 Another challenge is inherent to any information retrieval system that deals with text : the
ambiguity of the natural language (English or other languages) that makes it difficult to
have perfect matches between documents and user queries.
 Information retrieval is never an easy task. The problem with IR is that document
representation, either by index terms or texts cannot satisfy user need representation, which
is dynamic and complicated.
 Moreover, traditional IR systems are designed to support only one type of information-
seeking strategy that users engage in query formulation.

➥ 1.1.1 Early Developments


 Information Retrieval (IR) is about the process of providing answers to client's information
needs. It is thus concerned with the collection, representation, storage, organization,
accessing, manipulation and display, of the information items necessary to satisfying those
needs.

CS-AI&ML, PDIT Hosapete Page 4


Information Retrieval (1 - 5) Module-1

 Definition : Information Retrieval (IR) is finding material (usually documents) of an


unstructured nature (usually text) that satisfies an information need from within large
collections.
 There is a huge quantity of text, audio, video and other documents available on the Internet,
on about any subject. Users need to be able to find relevant information to satisfy their
particular information needs.
 There are two ways of searching for information : To use a search engines or to browse
directories organized by categories.
 IR is the task of representing, storing, organizing, and offering access to information
items. IR is different from data retrieval, which is about finding precise data in databases
with a given structure.
 In IR systems, the information is not structured. It is contained in free form in text (web
pages or other documents) or in multimedia content.
 The first IR systems implemented in 1970's were designed to work with small collections
of text. Some of these techniques are now used in search engines.
 The information retrieval techniques focusing on the challenges faced by search engine.
1. One particular challenge is the large scale, given by the huge number of web-pages
available on the Internet.
2. The ambiguity of the natural language (English or other languages) that makes it
difficult to have perfect matches between documents and user queries.
 Information retrieval is never an easy task. The problem with IR is that document
representation, either by index terms or texts cannot satisfy user need representation, which
is dynamic and complicated.
 Traditional IR systems are designed to support only one type of information-seeking
strategy that users engage in : Query formulation.

➠ 1.2 The IR Problem

 Information retrieval is about retrieving information relevant to the user on the basis of a
query. Early IR systems were boolean systems which allowed users to specify their
information need using a complex combination of boolean ANDs, ORs and NOTs.

CS-AI&ML, PDIT Hosapete Page 5


Information Retrieval (1 - 6) Module-1

 In modern IR system, users need vast information for search engine. User looking for the
link to the homepage of a government, company and colleges. They also looking for
information required to execute tasks associated with their jobs or immediate needs.
 Sometime user types full description of query to IR system. To solve this query by search
engine is not possible. Here user might want to first translate this information need into a
query, to be posed to the system.
 Given the user query, the goal of the IR system is to retrieve information that is useful or
relevant to the user.
 The key issues with IR models are selection of search vocabulary, search strategy
formulations and information overload
➥ 1.2.1 The User’s Task
 The user of a retrieval system has to translate his information need into a query in the
language provided by the system. With an information retrieval system, this normally
implies specifying a set of words which convey the
semantics of the information need.
 With a data retrieval system, a query expression is
used to convey the constraints that must be
satisfied by objects in the answer set. In both cases,
we say that the user searches for useful information
executing a retrieval task. Fig. 1.2.1 shows
Interaction of the user with the retrieval system.
Fig. 1.2.1 : Interaction of the
user with the retrieval
system

 Suppose the user may be interested in web site about healthcare product. In this situation,
the user might use an interactive interface to simply look around in the collection for
documents related to healthcare product.
 User may be interested in new beauty product, weight loss or gain product. Here user is
browsing the documents in the collection, not searching. It is still a process of retrieving
information, but one whose main objectives are not clearly defined in the beginning and
whose purpose might change during the interaction with the system.
 Pull technology : User requests information in an interactive manner. It perform three
retrieval tasks, i.e. Browsing (hypertext), Retrieval (classical IR systems) and Browsing and
retrieval (modern digital libraries and web systems).
 Push technology : Automatic and permanent pushing of information to user. It acts like a

CS-AI&ML, PDIT Hosapete Page 6


Information Retrieval (1 - 7) Module-1
software agents.

CS-AI&ML, PDIT Hosapete Page 7


Information Retrieval (1 - 8) Module-1

➠ 1.3 Information versus Data Retrieval


 An information retrieval system is software that has the features and functions required to
manipulate "information" items versus a DBMS that is optimized to handle "structured"
data.
 Information retrieval and Data Retrieval (DR) are often viewed as two mutually exclusive
means to perform different tasks, IR being used for finding relevant documents among a
collection of unstructured/semi-structured documents.
 Data retrieval being used for finding exact matches using stringent queries on structured
data, often in a Relational Database Management System (RDBMS).
 IR is used for assessing human interests, i.e., IR selects and ranks documents based on the
likelihood of relevance to the user's needs. DR is different; answers to users' queries are
exact matches which do not impose any ranking.
 Data retrieval involves the selection of a fixed set of data based on a well-defined query.
Information retrieval involves the retrieval of documents of natural language.
 IR systems do not support transactional updates whereas database systems support
structured data, with schemas that define the data organization. IR systems deal with some
querying issues not generally addressed by database systems and approximate searching by
keywords.

➥ 1.3.1 Difference between Data Retrieval and Information Retrieval


Parameters Data retrieval Information retrieval
Example Data base query WWW search
Matching Exact Partial match Best match
Inference Deduction Induction
Model Deterministic Probabilistic
Query language Artificial Natural
Query specification Complete Incomplete
Items wanted Matching Relevant
Error response Sensitive Insensitive
Classification Monotonic Polytechnic
➠ 1.4 The IR System

 An information retrieval system is an information system, which is used to store items of


information that need to be processed, searched, retrieved, and disseminated to various user
populations.

CS-AI&ML, PDIT Hosapete Page 8


Information Retrieval (1 - 9) Module-1

 Information retrieval is the process of searching some collection of documents, in order to


identify those documents which deal with a particular subject. Any system that is designed
to facilitate this literature searching may legitimately be called an information retrieval
system.
 Conceptually, IR is the study of finding needed information. It helps users to find
information that matches their information needs. Historically, IR is about document
retrieval, emphasizing document as the basic unit.
 Information retrieval locates relevant documents, on the basis of user input such as
keywords or example documents, for example : Find documents containing the words
"database systems".
 Fig. 1.4.1 shows information retrieval system block diagram. It consists of three
components : Input, processor and output.

Fig. 1.4.1 : IR block diagram

a) Input : Store only a representation of the document or query which means that the text
of a document is lost once it has been processed for the purpose of generating its
representation.
b) A document representative could be a list of extracted words considered to be
significant.
c) Processor : Involve in performing actual retrieval function, executing the search
strategy in response to a query.
d) Feedback : Improving the subsequent run after sample retrieval.
e) Output : A set of document numbers.
 Information retrieval locates relevant documents, on the basis of user input such as

CS-AI&ML, PDIT Hosapete Page 9


Information Retrieval (1 - 10) Module-1
keywords or example documents.

CS-AI&ML, PDIT Hosapete Page 10


Information Retrieval (1 - 11) Module-1

 The computer-based retrieval systems store only a representation of the document or query
which means that the text of a document is lost once it has been processed for the purpose
of generating its representation.
 The process may involve structuring the information, such as classifying it. It will also
involve performing the actual retrieval function that is executing the search strategy in
response to a query.
 Text document is the output of information retrieval system. Web search engines are the
most familiar example of IR systems.

➥ 1.4.1 Process of Information Retrieval


 Information retrieval is often a continuous process during which you will consider,
reconsider and refine your research problem, use various different information resources,
information retrieval techniques and library services and evaluate the information you find.
 Fig. 1.4.2 shows that the stages follow each other during the process, but in reality they are
often active simultaneously and you usually will repeat some stages during the same
information retrieval process.

Fig. 1.4.2 Stages of IR process

 The different stages of the information retrieval process are :


1. Problem / Topic : An information need occurs when more information is required to
solve a problem
2. Information retrieval plan : Define your information need and choose your information
resources, retrieval techniques and search terms

CS-AI&ML, PDIT Hosapete Page 11


Information Retrieval (1 - 12) Module-1

3. Information retrieval : Perform your planned information retrieval (information retrieval


techniques)
4. Evaluating the results : Evaluate the results of your information retrieval (number and
relevance of search results)
5. Locating publications : Find out where and how the required publication, e.g. article,
can be acquired
6. Using and evaluating the information : Evaluate the final results of the process (critical
and ethical evaluation of the information and information resources)

➥ 1.4.2 The Software Architecture of the IR System


 Fig. 1.4.3 shows architecture of IR system.

Fig. 1.4.3 : Architecture of IR system

 The user’s query is processed by a search engine, which may be running on the user’s local
machine, on a large cluster of machines in a remote geographic location, or anywhere in
between.
 A major task of a search engine is to maintain and manipulate an inverted index for a
document collection. This index forms the principal data structure used by the engine for
searching and relevance ranking.
CS-AI&ML, PDIT Hosapete Page 12
Information Retrieval (1 - 13) Module-1

 As its basic function, an inverted index provides a mapping between terms and the
locations in the collection in which they occur.
 To support relevance ranking algorithms, the search engine maintains collection statistics
associated with the index, such as the number of documents containing each term and the
length of each document.
 In addition the search engine usually has access to the original content of the documents in
order to report meaningful results back to the user.
 Using the inverted index, collection statistics, and other data, the search engine accepts
queries from its users, processes these queries, and returns ranked lists of results.
 To perform relevance ranking, the search engine computes a score, sometimes called a
Retrieval Status Value (RSV), for each document. After sorting documents according to
their scores, the result list must be subjected to further processing, such as the removal of
duplicate or redundant results.
 For example, a web search engine might report only one or results from a single host or
domain, eliminating the others in favor of pages from different sources.

➥ 1.4.3 The Retrieval and Ranking Processes


 A good retrieval model will find documents that are likely to be considered relevant by the
person who submitted the query. Some retrieval models focus on topical relevance, but a
search engine deployed in a real environment must use ranking algorithms that incorporates
user relevance.
 Relevancy ranking is the method that is used to order the results list in such a way that the
records most likely to be of interest to a user will be at the front. This makes searching
easier for users as they will not have to spend as much time looking through records for the
information that interests them.
 Each relevancy ranking algorithm slightly biases one type of data over another. While most
any of the relevancy ranking algorithms will make a large difference, it is sometimes
worthwhile trying several of the ranking methods. This way, you will be able to find the
algorithm which most closely reflects the needs of your application as well as you and your
user's expectations.
 There are a number of ways of calculating how a given record ranks and the factors that are
taken into consideration vary with each technique.
a) The number of times the search term occurs within a given record.
b) The number of times the search term occurs across the collection of records.

CS-AI&ML, PDIT Hosapete Page 13


Information Retrieval (1 - 14) Module-1

c) The number of words within a record.


d) The frequencies of words within a record.
e) The number of records in the index.
 Typically, relevancy ranking algorithms rank records in relation to each other. The weight
assigned to a given record is a weight that reflects the weight of the record in relation to
other records within the same database and for the same query.
University Questions

1 Explain in detail about the components of IR. AU : Dec.-16, Marks 16


.
Explain the issues in the process of information AU : Dec.-17, Marks 8
2
Explain in detail, the components of Information Retrieval and Search
.
Engine.

➠ 1.5 The Web


 World wide web is collection of millions of files stored on thousands of servers all over the
world. These files represent documents, pictures, video, sounds, programs, interactive
environments.
 A web page is an HTML document that is stored on a web server. A web site is a collection
of web pages belonging to a particular organization.
 URL of these pages share a common prefix, which is the address of the home page of the
size. Search engines are a bottom-up approach for finding your way around the web. Some
search engines search only the titles of web pages. While other search every word.
Keywords can be combined with Boolean operations, such as AND, OR and NOT, to
produce rather complicated queries.

➥ 1.5.1 The e-Publishing Era


 E-publishing refers to a publishing process where the manuscript are submitted in E-format,
edited, printed and even distributed to users in E-form by computer and communication
technology, which may be online, CD-ROM, Networks etc. It involves the storage of
information in electronic or digital form. It also refers to a type of publishing that does not
include printed books.
 E-publishing has been defining as any non-print media material that is published in
digitized form to an identifiable public. The media in electronic publishing can be text,
numeric, graphic, still or motion pictures, video, sound or as infrequently the case a

CS-AI&ML, PDIT Hosapete Page 14


Information Retrieval (1 - 15) Module-1
combination of any or all of these.

CS-AI&ML, PDIT Hosapete Page 15


Information Retrieval (1 - 16) Module-1

 There are four main reasons for the development of e-publishing,


a) Rapid development and wide use of computer technology.
b) The tremendous growth of computer networks.
c) Merging of computer and telecommunication technology.
d) Development of information industry.

➥ 1.5.2 How the Web Changed Search


 The web has introduced millions of people to search. The information retrieval community
stands ready to suggest helpful strategies for finding information on the Web.
 Let us consider the impact of web on search engine:
1. Characteristics of the document collection itself
2. Size of the collection and volume of user queries
3. Vast size of the document collection
4. Web adverting
 Search has changed dramatically over the past year and semantic technology has been at the
centre of it all. Consumers increasingly expect search engines to understand natural
language and perceive the intent behind the words they type in, and search engine
algorithms are rising to this challenge.

➠ 1.6 How People Search


➥ 1.6.1 Information Lookup Versus Exploratory Search
 Search activities are commonly divided into two broad categories: lookup and exploratory.
Exploratory search is an increasingly important activity yet challenging for users.
 Lookup search is by far the better understood and assumed to have precise search goals.
The predominant design goal in information retrieval systems has been fast and accurate
completion of lookup searches.
 Exploratory search is presently thought to center around the acquisition of new knowledge
and considered to be challenging for the user.
 Lookup is the most basic kind of search task and has been the focus of development for
database management systems and much of what Web search engines support.
 Lookup tasks return discrete and well-structured objects such as numbers, names, short
statements, or specific files of text or other media.

CS-AI&ML, PDIT Hosapete Page 16


Information Retrieval (1 - 17) Module-1

 Database management systems support fast and accurate data lookups in business and
industry; in journalism, lookups are related to questions of who, when, and where as
opposed to what, how, and why questions.
 In libraries, lookups have been called “known item” searches to distinguish them from
subject or topical searches.
 A typical example would be a user wanting to make a reservation to a restaurant and
looking for the phone number on the Web.
 On the other hand, exploratory search is described as open-ended, with an unclear
information need, an ill-structured problem of search with multiple targets. This search
activity is evolving and can occur over time.
 For example, a user wants to know more about Senegal, she doesn’t really know what kind
of information she wants or what she will discover in this searchsession; she only knows
she wants to learn more about that topic.

➠ 1.7 Search Interfaces Today


 The job of the search user interface is to aid users in the expression of their information
needs, in the formulation of their queries, in the understanding of their search results, and in
keeping track of the progress of their information seeking efforts.
 The typical search interface today is of the form : type-keywords-in-entry-form, view-
results-in-a-vertical-list.
 Some important reasons for the relative simplicity and unchanging nature of the standard
Web search interface are :
a) Search is a means towards some other end, rather than a goal in itself. When a person is
looking for information, they are usually engaged in some larger task, and do not want
their flow of thought interrupted by an intrusive interface.
b) Search is a mentally intensive task. When a person reads text, they are focused on that
task; it is not possible to read and to think about something else at the same time. Thus,
the fewer distractions while reading, the more usable the interface.
c) Since nearly everyone who uses the Web uses search, the interface design must be
understandable and appealing to a wide variety of users of all ages, cultures and
backgrounds, applied to an enormous variety of information needs.

CS-AI&ML, PDIT Hosapete Page 17


Information Retrieval (1 - 18) Module-1

➥ 1.7.1 Query Specification


 The query specification process is :
1. The kind of information the searcher supplies. Query specification input spans a
spectrum from full natural language sentences, to keywords and key phrases, to syntax-
heavy command language-based queries.
2. The interface mechanism the user interacts with to supply this information. These
include command line interfaces, graphical entry form-based interfaces, and interfaces
for navigating links.
 Queries over collections of textual information usually take on a textual form. Keyword
queries consist of a list of one or more words or phrases -- rather than full natural language
statements.
 Example : English keyword queries include flip cam, fresh chilli paste recipes, and video
game addiction. Some keyword queries consist of lists of different words and phrases,
which together suggest a topic.
 Many others are noun compounds and proper nouns. Less frequently, keyword queries
contain syntactic fragments including prepositions and verbs and in some cases, full
syntactic phrases.
 Dynamic query term suggestions can be provided as the user types in a term before they
view the results or it can be presented following the result display stage.
 It is interesting to note that the performance of query term suggestions across the three search
engines is varied in terms of the number of suggestions and how they handle single word and
multi-word queries. Following table provides a comparative overview of the number of
suggested query terms for TREC topics.

Search engine Average number of Median number of words


words suggested suggested
Google 4 4
Yahoo! 6.46 10
Bing 6.18 8
 All the three search engines offer spelling error correction features and around 80% of the
time they provide 4 or more dynamic query suggestions.

CS-AI&ML, PDIT Hosapete Page 18


Information Retrieval (1 - 19) Module-1

➥ 1.7.2 Retrieval Result Display


 When displaying search results, either the documents must be shown in full or else the
searcher must be presented with some kind of representation of the content of those
documents.
 The documents surrogate refers to the information that summarizes the document.
 The appearance of search engine results pages is constantly in flux due to experiments
conducted by Google, Bing, and other search engine providers to offer their users a more
intuitive, responsive experience.
 The quality of the surrogate can greatly effect the perceived relevance of the search results
listing. In Web search, the page title is usually shown prominently along with the URL and
sometimes other metadata.
 The user enters their search query, upon which the search engine presents them with a
SERP. Every SERP is unique, even for search queries performed on the same search engine
using the same keywords or search queries.
 This is because virtually all search engines customize the experience for their users by
presenting results based on a wide range of factors beyond their search terms, such as the
user’s physical location, browsing history, and social settings. Two SERPs may appear
identical, and contain many of the same results, but will often feature subtle differences.
 A deep link is a hypertext link to a page on a website other than its homepage. Deep links
are often used to link directly to products of an online store to or appropriate content.
 Google itself uses deep links in the form of rich snippets or sitelinks. A hyperlink that
points to a deeper level of a domain can also be useful for link hubs, lists of topics or in
citations. Again, the user’s interest is in the foreground.
 Price comparison portals also work with deep links. In this case, this type of link is
necessary because the potential customer would want to find and buy the exact product he
is comparing.

➥ 1.7.3 Query Reformulation


 After a query is specified and results have been produced, a number of tools exist to help
the user reformulate their query.
 Query formulation is an essential part of successful information retrieval. The challenges in
formulating effective queries are emphasized in web information search, because the web is
used by a diverse population varying in their levels of expertise.

CS-AI&ML, PDIT Hosapete Page 19


Information Retrieval (1 - 20) Module-1

 Query formulation is the stage of the interactive information access process in which user
translates an information need into a query and submits the query to an information access
system such as a search engine.
 The system performs some computation to match the query with the documents most likely
to be relevant to the query and returns a ranked list of relevant documents to the user.

➠ 1.8 Visualization in Search Interfaces


 Various method is used in search engine for visualization concept.
1. Visualizing Boolean syntax
2. Visualizing query terms within retrieval results
3. Visualizing relationships among words and documents
4. Visualization for text mining
 Visualizing Boolean syntax : Boolean query is rarely used in web search because of its
difficult syntax. Venn diagram is better method than Boolean search for representing query.
Problem with Boolean queries is that they can easily end up with empty results or too many
results.
 Visualizing query terms within retrieval results : In standard search result listing,
summary sentences are often selected that contains query terms and occurrence of these
terms are highlighting or boldfaced where they appear in the title, summary and URL.
Fig. 1.8.1 shows visualization in query.

CS-AI&ML, PDIT Hosapete Page 20


Information Retrieval (1 - 21) Module-1
Fig. 1.8.1 : Visualization query

CS-AI&ML, PDIT Hosapete Page 21


Information Retrieval (1 - 22) Module-1

 Visualizing relationships among words and documents : Visualization developers


suggest various idea of placing words and documents on a two-dimensional canvas, where
proximity of glyphs represents semantic relationship among the terms or documents.
Another method is to map documents or words from a very high-dimensional term space
down into a two-dimensional plane and show where the documents or words fall within
that plane using 2D or 3D.
 Visualization for text mining : Text mining is understood as a process of automatically
extracting meaningful, useful, previously unknown and ultimately comprehensible
information from textual document repositories. Text mining can be visualized as
consisting of two phases: Text refining that transforms free-form text documents into a
chosen intermediate form, and knowledge distillation that deduces patterns or knowledge
from the intermediate form.

➠ 1.9 Part A : Short Answered Questions [2 Marks Each]


Q.1 Define information retrieval. AU : Dec.-16
🞕 Ans. : Information Retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from within large
collections (usually stored on computers).
Q.2 Explain difference between data retrieval and information retrieval.
🞕 Ans. :

Parameters Data Retrieval Information Retrieval


Example Data Base Query WWW Search
Matching Exact Partial Match, Best Match
Interference Deduction Induction
Model Deterministic Probabilistic
Q.3 List and explain components of IR block diagram.

🞕 Ans. :

 Input : Store only a representation of the document or query which means that the text of a
document is lost once it has been processed for the purpose of generating its representation.
 A document representative could be a list of extracted words considered to be significant.
 Processor : Involve in performing actual retrieval function, executing the search strategy in

CS-AI&ML, PDIT Hosapete Page 22


Information Retrieval (1 - 23) Module-1
response to query.

CS-AI&ML, PDIT Hosapete Page 23


Information Retrieval (1 - 24) Module-1

 Feedback : Improving the subsequent run after sample retrieval.


 Output : A set of document numbers.
Q.4 What is objective term and nonobjective term ?
🞕 Ans. :

 Objective terms are extrinsic to semantic content, and there is generally no disagreement
about how to assign them. Examples include author name, document URL, and date of
publication.
 Nonobjective terms are intended to reflect the information manifested in the document,
and there is no agreement about the choice or degree of applicability of these terms. They
are also known as content terms.
Q.5 Explain the type of natural language technology used in information retrieval.
🞕 Ans. : Two types of natural language technology can be useful in information retrieval :
 Natural language interfaces make the task of communicating with the information source
easier, allowing a system to respond to a range of inputs, possibly from inexperienced
users, and to produce more customized output.
 Natural language text processing allows a system to scan the source texts, either to retrieve
particular information or to derive knowledge structures that may be used in accessing
information from the texts.
Q.6 What is search engine ?
🞕 Ans. : A search engine is a document retrieval system designed to help find information
stored in a computer system, such as on the WWW. The search engine allows one to ask for
content meeting specific criteria and retrieves a list of items that match those criteria.
Q.7 What is conflation ?
🞕 Ans. : Stemming is the process for reducing inflected words to their stem, base or root
form, generally a written word form. The process of stemming is often called conflation.
Q.8 What is an invisible web ?
🞕 Ans. : Many dynamically generated sites are not indexable by search engines; this
phenomenon is known as the invisible web.
Q.9 Define Zipf’s law.
th
🞕 Ans. : An empirical rule that describes the frequency of the text words. It states that the i

most frequent word appears as many times as the most frequent one divided by i , for some
CS-AI&ML, PDIT Hosapete Page 24
Information Retrieval (1 - 25) Module-1
 > 1.

CS-AI&ML, PDIT Hosapete Page 25


Information Retrieval (1 - 26) Module-1

Q.10 What is supervised learning ?


🞕 Ans. : In supervised learning, both the inputs and the outputs are provided. The network
then processes the inputs and compares its resulting outputs against the desired outputs. Errors
are then propagated back through the system, causing the system to adjust the weights which
control the network
Q.11 What is unsupervised learning ?
🞕 Ans. : In an unsupervised learning, the network adapts purely in response to its inputs.
Such networks can learn to pick out structure in their input.
Q.12 What is text mining ?
🞕 Ans. : Text mining is understood as a process of automatically extracting meaningful,
useful, previously unknown and ultimately comprehensible information from textual document
repositories. Text mining can be visualized as consisting of two phases : Text refining that
transforms free-form text documents into a chosen intermediate form, and knowledge
distillation that deduces patterns or knowledge from the intermediate form.
Q.13 Specify the role of an IR system. AU : Dec.-16
🞕 Ans. : The role of an IR system is to retrieve all the documents, which are relevant to a
query while retrieving as few non - relevant documents as possible. IR allows access to whole
documents, whereas, search engines do not.
Q.14 Outline the impact of the web on information retrieval. AU : May-17
🞕 Ans. : Web is a huge, widely-distributed, highly heterogeneous and semi-structured
information. The rapid growth of the Internet, huge information is available on the Web and
Web information retrieval presents additional technical challenges when compared to classic
information retrieval due to the heterogeneity and size of the web.
 Web information retrieval is unique due to the dynamism, variety of languages used,
duplication, high linkage, ill formed query and wide variance in the nature of users. IR
helps users find information that matches their information needs expressed as queries.
Historically, IR is about document retrieval, emphasizing document as the basic unit.
Q.15 Compare information retrieval and web search. AU : May-17
🞕 Ans. : In information retrieval, databases usually cover only one language or indexing of
documents written in different languages with the same vocabulary. In web search, documents
are in many different languages. Usually search engines use full text indexing; no additional
subject analysis.

CS-AI&ML, PDIT Hosapete Page 26


Information Retrieval (1 - 27) Module-1

➠ 1.10 Multiple Choice Questions


Q.1 retrieval deals with the representation, storage, organization of and
access to information items such as documents, web pages, online catalogs,
structured and semi- structured records and multimedia objects.
a Data b Audio c Video d Information

Q.2 of a retrieval system has to translate his information need into a


query in the language provided by the system.
a The manager b The user

c The designer d The administrator

Q.3 is usually provided by most modern information retrieval


systems. a Information and knowledge retrieval

b Information or knowledge retrieval

c Information and data retrieval

d Information or data retrieval

Q.4 is an iterative process of formulating a conceptual from a large


collection of information.
a Sense making b Data collection

c Information collection d All of these

Q.5 A web page is an _ document that is stored on a web


server. a XML b HTML c XSL d Java

Q.6 A is a hypertext link to a page on a website other than its


homepage. a hyper link b deep link c URL d
HTML

CS-AI&ML, PDIT Hosapete Page 27


Information Retrieval (1 - 28) Module-1
Q.7 Which methods are used in search engine for visualization
concept ? a Visualizing Boolean syntax

b Visualizing query terms within retrieval results

c Visualizing relationships among words and documents

d All of these

CS-AI&ML, PDIT Hosapete Page 28


Information Retrieval (1 - 29) Module-1

Q.8 URL stands for _.


a Uniform Ravar Location b Uniform Resource Locator

c Uni Resource Locate d Uniform Reverse Locator

Q.9 Which of the following is a search engine ?


a Google b Yahoo! c Bing d All of these

Q.10 In IR systems, the information is _.


a Structured b semi-structured c not structured d None

Q.11 A search engine is a program to search _.


a for information b web pages for information using specified search
terms

c web pages d web pages for specified index terms

Q.12 Early IR systems were systems which allowed users to specify their
information need using a complex combination of Boolean ANDs, ORs and
NOTs.
a Boolean b vector c logical d All of these

Q.13 Information Retrieval Systems is characterized by _ data format.


(a) structured b semi-structured c unstructured d all of these

Q.14 are a set of electronic resources and associated technical


capabilities for
creating, searching, and using information.

a Analog libraries b Digital libraries

c Digital information d Digital data

CS-AI&ML, PDIT Hosapete Page 29


Information Retrieval (1 - 30) Module-1

Q.15 Which of the following is NOT components of IR block diagram ?


a Input b Processor c Feedback d Information

Q.16 Web browser is a software program that interprets and displays the
contents of web pag
a XML b HTML c static d dynamic

Q.17 diagram is a diagram that visually displays all the possible logical
relationships between collections of sets.
a Text b Information c Binary d Venn

CS-AI&ML, PDIT Hosapete Page 30


Information Retrieval (1 - 31) Module-1

➤ Answer Keys for Multiple Choice Questions

Q.1 d Q.2 b Q.3 c Q.4 a


Q.5 b Q.6 b Q.7 d Q.8 b
Q.9 d Q.10 c Q.11 b Q.12 a
Q.13 c Q.14 b Q.15 d Q.16 b
Q.17 d



CS-AI&ML, PDIT Hosapete Page 31


Information Retrieval (1 - 32) Module-1

CS-AI&ML, PDIT Hosapete Page 32

You might also like