Unit I - Irs
Unit I - Irs
Retrieval
Definition of Information Retrieval System
• An Information Retrieval System is a system that is
capable of storage, retrieval, and maintenance of
information.
• Information in this context can be composed of text
(including numeric and date data), images, audio,
video and other multi-media objects.
• While there are many other ways to build an object in
an information retrieval system, so far only text has
shown to be a data type that is well suited for
complete functional processing.
– The other data types have been treated as highly
informative sources, but are primarily linked for retrieval
based upon search of the text.
Definition of Information Retrieval System
• The term “item” is used to represent the smallest
complete unit that is processed and manipulated
by the system.
– A complete document, such as a book, newspaper or
magazine could be an item. At other times each
chapter, or article may be defined as an item.
– An item may address even lower levels of abstraction
such as a contiguous passage of text or a paragraph.
– A video news program could be considered an item. It
is composed of text in the form of closed captioning,
audio text provided by the speakers, and the video
images being displayed.
• There are multiple "tracks" of information possible in a
single item.
– They are typically correlated by time.
Definition of Information Retrieval System
• An Information Retrieval System consists of a software
program that facilitates a user in finding the
information the user needs.
• The system may use standard computer hardware or
specialized hardware to support the search sub-
function and to convert non-textual sources to a
searchable media (e.g., transcription of audio to text).
• An information system's ability to reduce the overhead
(time required to find the information needed, excluding the time for actually
reading the relevant data. )required for a user to locate the
information they need is a key indicator of its success.
• Thus search composition (preparation/ the manufacturing process),
search execution, and reading non-relevant items are
all aspects of information retrieval overhead.
Definition of Information Retrieval System
• A new way to access terabytes of information has
been made possible by the advent and
exponential growth of the Internet, as well as its
original WAIS (Wide Area Information Servers)
capacity and more current advanced search
servers (like INFOSEEK and EXCITE).
• The processing and access of large quantities of
textual data have now become a needed
capability for large quantities of the population
with significant research and development being
done by the private sector.
• Images across the Internet are searchable from
many web sites such as WEBSEEK, DITTO.COM,
ALTAVISTA/IMAGES.
Difference between information retrieval
system and dbms
• Information Retrieval is concerned with the
representation, storage, organization of, and
access to information items.
• The main difference between databases and
IR is that databases focus on structured data
while IR focuses mainly on unstructured data
• Also, databases are concerned with data
retrieval, not information retrieval.
Objectives of Information Retrieval Systems
• The general objective of an Information Retrieval
System is to minimize the overhead of a user locating
needed information.
– Overhead can be expressed as the time a user spends in all
of the steps leading to reading an item containing the
needed information (e.g., query generation, query
execution, scanning results of query to select items to
read, reading non-relevant items).
• The information required and the user's willingness to
absorb overhead determine how successful an
information system will be.
• Under some circumstances, needed information can
be defined as all information that is in the system that
relates to a user’s need.
– In other cases it may be defined as sufficient information
in the system to complete a task, allowing for missed data.
Objectives of Information Retrieval Systems
• A system that supports reasonable retrieval requires
fewer features than one which requires
comprehensive retrieval.
– In many cases comprehensive retrieval is a negative
feature because it overloads the user with more
information than is needed.
– This makes it more difficult for the user to filter the
relevant but non-useful information from the critical
items.
• In information retrieval the term “relevant” item is
used to represent an item containing the needed
information.
• In reality the definition of relevance is not a binary
classification but a continuous function.
Objectives of Information Retrieval Systems
• The two major measures commonly associated with information
systems are precision and recall.
• When a user decides to issue a search looking for information on a
topic, the total database is logically divided into four segments
shown in Figure:
• Relevant items are those documents that contain information that helps
the searcher in answering his question. Non-relevant items are those
items that do not provide any directly useful information. There are two
possibilities with respect to each item:
– it can be retrieved or not retrieved by the user’s query.
Objectives of Information Retrieval Systems
• Precision and recall are defined as:
• where
– Number_Possible_Relevant are the number of relevant
items in the database.
– Number_Total_Retieved is the total number of items
retrieved from the query.
– Number_Retrieved_Relevant is the number of items
retrieved that are relevant to the user’s search need.
Objectives of Information Retrieval Systems
• Precision: Precision measures how many of the
retrieved items are relevant to the user's query.
– It is calculated as the ratio of relevant items retrieved to
the total items retrieved. For example, if a search has 85%
precision, it means 85% of the items retrieved are
relevant, and 15% are non-relevant (which represent
overhead for the user).
• Recall: Recall measures how well the system retrieves
all the relevant items from the database that the user
is interested in.
– It is calculated as the ratio of relevant items retrieved to
the total number of relevant items in the database.
– A high recall indicates that the system is good at finding all
relevant items.
Objectives of Information Retrieval Systems
• Relationship between Precision and Recall:
– In an ideal scenario (Figure), where every retrieved item is relevant, precision starts at 100%
(because initially all retrieved items are relevant) and gradually decreases as more non-
relevant items are retrieved. Recall, on the other hand, starts low and increases as more
relevant items are found, until all relevant items in the database have been retrieved.
– Once all relevant items have been retrieved, recall reaches 100% because no more relevant
items can be retrieved.
– Precision is affected by the retrieval of non-relevant items; as more non-relevant items are
retrieved, precision drops.
– Recall, however, is not affected by the retrieval of non-relevant items; it only concerns how
many of the relevant items were successfully retrieved out of all possible relevant items.
• Recall is not directly calculable in operational systems because it requires
knowledge of the total set of relevant items in the database, which may not be
known beforehand. Thus, operational systems often estimate or infer recall
indirectly based on the retrieved items.
• precision and recall are crucial metrics in evaluating the effectiveness of
information retrieval systems, with precision focusing on the relevance of
retrieved items and recall focusing on the completeness of retrieval for relevant
items.
Objectives of Information Retrieval Systems
Objectives of Information Retrieval Systems
• The first objective of an Information Retrieval System
is support of user search generation.
• There are natural obstacles to specification of the
information a user needs that come from ambiguities
inherent in languages, limits to the user’s ability to
express what information is needed and differences
between the user’s vocabulary corpus and that of the
authors of the items in the database.
• Natural languages suffer from word ambiguities such
as homographs and use of acronyms that allow the
same word to have multiple meanings (e.g., the word
“field” )
• Disambiguation techniques exist but introduce
significant system overhead in processing power and
extended search times and often require interaction
with the user.
Objectives of Information Retrieval Systems
• Many users have trouble in generating a good search
statement. The typical user does not have significant
experience with nor even the aptitude for Boolean
logic statements.
• It is only with the introduction of Information Retrieval
Systems such as RetrievalWare, TOPIC, AltaVista,
Infoseek and INQUERY that the idea of accepting
natural language queries is becoming a standard
system feature
• This allows users to state in natural language what
they are interested in finding. But the completeness of
the user specification is limited by the user’s
willingness to construct long natural language queries.
Most users on the Internet enter one or two search
terms.
Objectives of Information Retrieval Systems
• Multi-media adds an additional level of complexity in
search specification.
– The modal has been converted to text (e.g., audio
transcription, OCR) the normal text techniques are still
applicable.
– They are achieved by having prestored examples of known
objects in the media and letting the user select them for
the search.
– This type specification becomes more complex when
coupled with Boolean or natural language textual
specifications.
• In addition to the complexities in generating a query,
quite often the user is not an expert in the area that is
being searched and lacks domain specific vocabulary
unique to that particular subject area.
Objectives of Information Retrieval Systems
• A limited knowledge of the vocabulary associated with a
particular area along with lack of focus on exactly what
information is needed leads to use of inaccurate and in some
cases misleading search terms.
• Users usually start with simple queries that suffer from failure
rates approaching 50%.