0% found this document useful (0 votes)
13 views116 pages

Unit I - Irs

An Information Retrieval System (IRS) is designed to store, retrieve, and maintain various types of information, primarily focusing on text while also accommodating multimedia. The system aims to minimize user overhead in locating information, with key metrics including precision and recall to evaluate effectiveness. The process involves item normalization, which includes standardizing input formats, identifying processing tokens, and applying stop lists to enhance search efficiency.

Uploaded by

Maneesh Ramaram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views116 pages

Unit I - Irs

An Information Retrieval System (IRS) is designed to store, retrieve, and maintain various types of information, primarily focusing on text while also accommodating multimedia. The system aims to minimize user overhead in locating information, with key metrics including precision and recall to evaluate effectiveness. The process involves item normalization, which includes standardizing input formats, identifying processing tokens, and applying stop lists to enhance search efficiency.

Uploaded by

Maneesh Ramaram
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 116

Introduction to Information

Retrieval
Definition of Information Retrieval System
• An Information Retrieval System is a system that is
capable of storage, retrieval, and maintenance of
information.
• Information in this context can be composed of text
(including numeric and date data), images, audio,
video and other multi-media objects.
• While there are many other ways to build an object in
an information retrieval system, so far only text has
shown to be a data type that is well suited for
complete functional processing.
– The other data types have been treated as highly
informative sources, but are primarily linked for retrieval
based upon search of the text.
Definition of Information Retrieval System
• The term “item” is used to represent the smallest
complete unit that is processed and manipulated
by the system.
– A complete document, such as a book, newspaper or
magazine could be an item. At other times each
chapter, or article may be defined as an item.
– An item may address even lower levels of abstraction
such as a contiguous passage of text or a paragraph.
– A video news program could be considered an item. It
is composed of text in the form of closed captioning,
audio text provided by the speakers, and the video
images being displayed.
• There are multiple "tracks" of information possible in a
single item.
– They are typically correlated by time.
Definition of Information Retrieval System
• An Information Retrieval System consists of a software
program that facilitates a user in finding the
information the user needs.
• The system may use standard computer hardware or
specialized hardware to support the search sub-
function and to convert non-textual sources to a
searchable media (e.g., transcription of audio to text).
• An information system's ability to reduce the overhead
(time required to find the information needed, excluding the time for actually
reading the relevant data. )required for a user to locate the
information they need is a key indicator of its success.
• Thus search composition (preparation/ the manufacturing process),
search execution, and reading non-relevant items are
all aspects of information retrieval overhead.
Definition of Information Retrieval System
• A new way to access terabytes of information has
been made possible by the advent and
exponential growth of the Internet, as well as its
original WAIS (Wide Area Information Servers)
capacity and more current advanced search
servers (like INFOSEEK and EXCITE).
• The processing and access of large quantities of
textual data have now become a needed
capability for large quantities of the population
with significant research and development being
done by the private sector.
• Images across the Internet are searchable from
many web sites such as WEBSEEK, DITTO.COM,
ALTAVISTA/IMAGES.
Difference between information retrieval
system and dbms
• Information Retrieval is concerned with the
representation, storage, organization of, and
access to information items.
• The main difference between databases and
IR is that databases focus on structured data
while IR focuses mainly on unstructured data
• Also, databases are concerned with data
retrieval, not information retrieval.
Objectives of Information Retrieval Systems
• The general objective of an Information Retrieval
System is to minimize the overhead of a user locating
needed information.
– Overhead can be expressed as the time a user spends in all
of the steps leading to reading an item containing the
needed information (e.g., query generation, query
execution, scanning results of query to select items to
read, reading non-relevant items).
• The information required and the user's willingness to
absorb overhead determine how successful an
information system will be.
• Under some circumstances, needed information can
be defined as all information that is in the system that
relates to a user’s need.
– In other cases it may be defined as sufficient information
in the system to complete a task, allowing for missed data.
Objectives of Information Retrieval Systems
• A system that supports reasonable retrieval requires
fewer features than one which requires
comprehensive retrieval.
– In many cases comprehensive retrieval is a negative
feature because it overloads the user with more
information than is needed.
– This makes it more difficult for the user to filter the
relevant but non-useful information from the critical
items.
• In information retrieval the term “relevant” item is
used to represent an item containing the needed
information.
• In reality the definition of relevance is not a binary
classification but a continuous function.
Objectives of Information Retrieval Systems
• The two major measures commonly associated with information
systems are precision and recall.
• When a user decides to issue a search looking for information on a
topic, the total database is logically divided into four segments
shown in Figure:

• Relevant items are those documents that contain information that helps
the searcher in answering his question. Non-relevant items are those
items that do not provide any directly useful information. There are two
possibilities with respect to each item:
– it can be retrieved or not retrieved by the user’s query.
Objectives of Information Retrieval Systems
• Precision and recall are defined as:

• where
– Number_Possible_Relevant are the number of relevant
items in the database.
– Number_Total_Retieved is the total number of items
retrieved from the query.
– Number_Retrieved_Relevant is the number of items
retrieved that are relevant to the user’s search need.
Objectives of Information Retrieval Systems
• Precision: Precision measures how many of the
retrieved items are relevant to the user's query.
– It is calculated as the ratio of relevant items retrieved to
the total items retrieved. For example, if a search has 85%
precision, it means 85% of the items retrieved are
relevant, and 15% are non-relevant (which represent
overhead for the user).
• Recall: Recall measures how well the system retrieves
all the relevant items from the database that the user
is interested in.
– It is calculated as the ratio of relevant items retrieved to
the total number of relevant items in the database.
– A high recall indicates that the system is good at finding all
relevant items.
Objectives of Information Retrieval Systems
• Relationship between Precision and Recall:
– In an ideal scenario (Figure), where every retrieved item is relevant, precision starts at 100%
(because initially all retrieved items are relevant) and gradually decreases as more non-
relevant items are retrieved. Recall, on the other hand, starts low and increases as more
relevant items are found, until all relevant items in the database have been retrieved.
– Once all relevant items have been retrieved, recall reaches 100% because no more relevant
items can be retrieved.
– Precision is affected by the retrieval of non-relevant items; as more non-relevant items are
retrieved, precision drops.
– Recall, however, is not affected by the retrieval of non-relevant items; it only concerns how
many of the relevant items were successfully retrieved out of all possible relevant items.
• Recall is not directly calculable in operational systems because it requires
knowledge of the total set of relevant items in the database, which may not be
known beforehand. Thus, operational systems often estimate or infer recall
indirectly based on the retrieved items.
• precision and recall are crucial metrics in evaluating the effectiveness of
information retrieval systems, with precision focusing on the relevance of
retrieved items and recall focusing on the completeness of retrieval for relevant
items.
Objectives of Information Retrieval Systems
Objectives of Information Retrieval Systems
• The first objective of an Information Retrieval System
is support of user search generation.
• There are natural obstacles to specification of the
information a user needs that come from ambiguities
inherent in languages, limits to the user’s ability to
express what information is needed and differences
between the user’s vocabulary corpus and that of the
authors of the items in the database.
• Natural languages suffer from word ambiguities such
as homographs and use of acronyms that allow the
same word to have multiple meanings (e.g., the word
“field” )
• Disambiguation techniques exist but introduce
significant system overhead in processing power and
extended search times and often require interaction
with the user.
Objectives of Information Retrieval Systems
• Many users have trouble in generating a good search
statement. The typical user does not have significant
experience with nor even the aptitude for Boolean
logic statements.
• It is only with the introduction of Information Retrieval
Systems such as RetrievalWare, TOPIC, AltaVista,
Infoseek and INQUERY that the idea of accepting
natural language queries is becoming a standard
system feature
• This allows users to state in natural language what
they are interested in finding. But the completeness of
the user specification is limited by the user’s
willingness to construct long natural language queries.
Most users on the Internet enter one or two search
terms.
Objectives of Information Retrieval Systems
• Multi-media adds an additional level of complexity in
search specification.
– The modal has been converted to text (e.g., audio
transcription, OCR) the normal text techniques are still
applicable.
– They are achieved by having prestored examples of known
objects in the media and letting the user select them for
the search.
– This type specification becomes more complex when
coupled with Boolean or natural language textual
specifications.
• In addition to the complexities in generating a query,
quite often the user is not an expert in the area that is
being searched and lacks domain specific vocabulary
unique to that particular subject area.
Objectives of Information Retrieval Systems
• A limited knowledge of the vocabulary associated with a
particular area along with lack of focus on exactly what
information is needed leads to use of inaccurate and in some
cases misleading search terms.
• Users usually start with simple queries that suffer from failure
rates approaching 50%.

• Thus, an Information Retrieval System must provide tools to


help overcome the search specification problems discussed
above
Functional Overview
• The total information storage and retrieval system consists of four major
functional processes:
– Item Normalization
– Selective dissemination of information ( i.e. Mail)
– Archival Document Database Search
– Index database search along with Automatic File Build Process
• The next figure shows the logical view of these capabilities in a single
integrated information retrieval system.
Item Normalization
• The first step in any integrated system is to
normalize the incoming items to a standard format.
• Item normalization provides logical restructuring of
the item.
• Additional operations are needed to create a
searchable data structure:
– identification of processing tokens (e.g., words),
– characterization of the tokens (categorizing the individual
units of a sequence)
– stemming (e.g., removing word endings) of the tokens
• The processing tokens and their characterization are used to
define the searchable text from the total received text.
Item Normalization
• The following figure shows the normalization
process.
Item Normalization
• Standardizing the input takes the different external formats of
input data and performs the translation to the formats acceptable
to the system (eg: translation of foreign languages into Unicode)
• One standard encoding that covers English, French, Spanish, etc.
is ISO-Latin
• Multimedia adds an extra dimension to the normalization
Process.
• If the input is video the likely digital standards will be either
MPEG-2, MPEG-1, AVI or Real Media
• MPEG (Motion Picture Expert Group) standards are the most
universal standards for higher quality video where Real Media is
the most common standard for lower quality video being used on
the Internet
• Audio standards are typically WAV or Real Media (Real Audio).
• Images vary from JPEG to BMP.
Item Normalization
• The next process is to parse the item into logical sub-
divisions that have meaning to the user. This process,
called “Zoning”
• A typical item is sub-divided into zones, which may
overlap and can be hierarchical, such as Title, Author,
Abstract, Main Text, Conclusion, and References.
• The term “Zone” was selected over field because of
the variable length nature of the data identified and
because it is a logical sub-division of the total item,
whereas the term “fields” implies independence
• This categorization helps in organizing and managing
data efficiently.
Item Normalization
• Identification involves recognizing these tokens
within the categorized information.
• Once identified, the information (tokens) and their
associated zones are stored in a way that allows for
easy retrieval and management.
• Users can perform searches that are restricted to
specific zones or categories.
• Once a search is complete, the user wants to
efficiently review the results to locate the needed
information. A major limitation to the user is the size
of the display screen which constrains the number of
items that are visible for review.
Item Normalization
• To optimize the number of items reviewed per
display screen, the user wants to display the
minimum data required from each item to
allow determination of the possible relevance
of that item.
• Quite often the user will only display zones
such as the Title or Title and Abstract.
• This allows multiple items to be displayed per
screen. The user can expand those items of
potential interest to see the complete text.
Item Normalization
• Once the standardization and zoning has been completed,
information (i.e., words) that are used in the search process need
to be identified in the item.
– The term processing token is used because a “word” is not the most
efficient unit on which to base search structures.
• The first step in identification of a processing token consists of
determining a word.
• Systems determine words by dividing input symbols into three
classes:
– valid word symbols,
– inter-word symbols, and
– special processing symbols.
• A word is defined as a contiguous set of word symbols bounded
by inter-word symbols. (eg: of word symbols are Alphabetic characters
and numbers, eg of inter-word symbols are: blanks, periods and semicolons)
Item Normalization
• The significance of these symbols depends on
the language being processed.
– For example, an apostrophe might be critical for
representing foreign names accurately in a
database, even if it's of minimal importance in
English when used for possessives.
• When designing text processing systems,
decisions about which inter-word symbols to
prioritize are based on the required accuracy
of searches and specific language
characteristics.
Item Normalization
• Finally there are some symbols that may
require special processing.
– A hyphen can be used many ways, often left to the
taste and judgment of the writer
• when a hyphen (or other special symbol) is
detected a set of rules are executed to
determine what action is to be taken
generating one or more processing tokens
Item Normalization
• Stop List/Algorithm is applied to the list of
potential processing tokens.
– Objective of Stop function: To save system resources by eliminating
from the set of searchable processing tokens those that have little
value to the system.
– They consist of words or terms that are excluded from being indexed
or considered in search queries due to their frequent occurrence or
lack of semantic value in relation to the search context
– Stop lists help filter out words that are extremely common across
documents (like "the", "and", "is") but carry little semantic meaning.
Including these words in search queries could lead to irrelevant
results.
– By excluding these words from indexing and search operations, stop
lists reduce the size of indexes and improve the efficiency of search
queries.
Item Normalization
• According to Ziph's (Ziph-49) hypothesis, most unique
words only appear a few times when examining the
frequency of recurrence of these terms over a corpus
of objects.
• Zipf's law states that in a given text or corpus, the
frequency of any word is inversely proportional to its
rank in the frequency table.
• The rank-frequency law of Ziph is:
– Frequency * Rank = constant
• where
– Frequency is the number of times a word occurs and
– rank is the rank order of the word.
Item Normalization
• The next step in finalizing on processing tokens is
identification of any specific word characteristics
(feature) .
• The feature helps systems distinguish between
different meanings of a given word. This includes a
morphological examination of the part of speech of
the processing token.
• For example, for the word "plane":
– As an adjective: "level or flat"
– As a noun: "aircraft or facet"
– As a verb: "the act of smoothing or evening"
• Another example of characterization is if upper case
should be preserved
Item Normalization
• Once the potential processing token has been identified and characterized,
most systems apply stemming algorithms to normalize the token to a
standard semantic representation.
– Stemming is a process where words are reduced to their base or root
form, which helps in grouping together variants of words that have the
same meaning
• By reducing words to their stems, stemming helps improve the precision of
search queries. For example, searching for "run" might retrieve documents
containing "running" or "runner" because they share the same root.
• Stemming standardizes tokens to a common form, reducing the number of
unique terms the system needs to handle. This can simplify indexing and
searching processes.
• By reducing the number of unique terms, stemming can decrease
computational overhead and memory usage associated with processing
and storing text data
• Once the processing tokens have been finalized based upon the stemming
algorithm, they are used as updates to the searchable data structure.
Selective Dissemination of Information
• The Selective Dissemination (spreading) of
Information (Mail) Process provides the
capability to dynamically compare newly
received items in the information system
against standing statements of interest of
users and deliver the item to those users
whose statement of interest matches the
contents of the item.
• The Mail process is composed of the search
process, user statements of interest (Profiles)
and user mail files.
Selective Dissemination of Information
• Search Process: This component compares
each incoming item against every user's
profile.
• User Profiles (Statements of Interest):
Profiles contain broad search statements and
specify which mail files should receive
matching documents.
• User Mail Files: These files store items that
match the user profiles and are typically
viewed in the order of receipt.
Selective Dissemination of Information
Selective Dissemination of Information
• User search profiles (Push system) are different than
ad hoc queries (pull system) in that they contain
significantly more search terms (10 to 100 times
more terms) and cover a wider range of interests.
• These profiles define all the areas in which a user is
interested versus an ad hoc query which is frequently
focused to answer a specific question.
• It has been shown in recent studies that
automatically expanded user profiles perform
significantly better than human generated profiles
Document Database Search
• The Document Database Search Process provides the
capability for a query to search against all items
received by the system.
• The Document Database Search process is composed
of the search process, user entered queries (typically
ad hoc queries) and the document database which
contains all items that have been received,
processed and stored by the system.
• Any search for information that has already been
processed into the system can be considered a
“retrospective” search for information.
Document Database Search
• Searches can cover a wide range of time periods, not
necessarily limited to recent items.
• The document database typically holds a vast
amount of data spanning extensive periods,
sometimes containing hundreds of millions of items.
• Items in the Document Database are usually not
edited once received, reflecting their original state
when processed.
– Due to the diminishing value of information over time,
databases are often partitioned by time to facilitate
archiving and efficient retrieval.
Index Database Search
• A user may want to save the interested item for
future reference. This is accomplished via the Index
Process.
• The user can logically store an item in a file along
with additional index terms and descriptive text the
user wants to associate with the item.
• A good analogy to an index file is the card catalog in
a library
• The index database search Process provides the
capability to create indexes and search them.
Index Database Search
• The system also provides the capability to
search the index and then search the items
referenced by the index records that satisfied
the index portion of the query. This Process is
called a Combined File Search.
• In an ideal system the index record could
reference portions of items versus the total
item
Index Database Search
• Two classes of Index files: Public and Private
– Every user can have one or more Private Index
Files leading to a very large number of files
– Private Index File
• References only a small subset of the total number of
items in the Document Database.
• Typically have very limited access lists.
– Public Index Files
• Maintained by professional library services personnel
and typically index every item in the Document
Database.
• Have access lists that allow any one to search and
retrieve data.
Index Database Search
• To assist the users in generating indexes,
especially the professional indexers, the
system provides a process called Automatic
File Build.
– This capability processes selected incoming
documents and automatically determine potential
indexing for the item.
• The capability to create Private and Public
Index Files is frequently implemented via a
Structured DBMS
Multimedia Database Search
• From a system viewpoint, the multimedia data
is an addition to the Information Retrieval
System's current structures rather than
conceptually its own data structure.
– It will reside almost entirely in the area described
as the Document Database.
• The specialized indexes to allow search of the
multi-media (e.g., vectors representing video
and still images, text created by audio
transcription) will be augmented search
structures
Multimedia Database Search
• The correlation between the multimedia and
the textual domains will be either via Time or
Positional synchronization.
– Time synchronization is an ex. of transcribed text
from audio or composite video sources
– Positional synchronization is where the
multimedia is localized by a hyperlink in a textual
item.
Relationship to Database Management Systems
• There are two major categories of systems
available to process items: Information
Retrieval Systems and Data Base Management
Systems
– IRS are designed to handle "information" items,
which are characterized as fuzzy text. "Fuzzy" here
refers to the lack of strict standards or controls on
how the information is created.
– DBMS are optimized for handling "structured"
data, which consists of well-defined facts typically
organized into tables.
Relationship to Database Management Systems
• IRS deal with diverse vocabulary and styles
since creators of information items may use
different terminology and approaches.
• Users of IRS need to consider various search
term possibilities due to the ambiguity and
diversity of language used in information
items.
• Each attribute within a table in DBMS has a
semantic description that clearly defines its
meaning, such as "employee name" or
"employee salary."
Relationship to Database Management Systems
• The search results from IRS are often
presented in relevance-ranked order, using
features like relevance feedback to help refine
searches.
• Queries in DBMS yield specific results in a
tabular format, making it easier for users to
retrieve desired information.
• DBMS software is used to store “information.”
– DBMS lack the ranking and relevance feedback
features found in IRS, which are critical for
handling fuzzy, unstructured information
effectively.
Relationship to Database Management Systems
• It is also possible to have structured data used in an information system
(such as TOPIC).
– When this occurs, the user needs to be extremely resourceful in
order to get the system to deliver the management data and reports
that are easily accessible in a database management system.
• To take advantage of each other's advantages, DBMS and IRS must be
integrated.
– One of the first commercial databases to integrate the two systems
into a single view is the INQUIRE DBMS
– A more current example is the ORACLE DBMS that now offers an
imbedded capability called CONVECTIS, which is an informational
retrieval system that uses a comprehensive thesaurus which
provides the basis to generate “themes” for a particular item.
– The INFORMIX DBMS has the ability to link to RetrievalWare to
provide integration of structured data and information along with
functions associated with Information Retrieval Systems.
Digital Libraries and Data Warehouses
• Two other systems frequently described in the
context of information retrieval are Digital
Libraries and Data Warehouses (or DataMarts).
• There is significant overlap between these two
systems and an Information Storage and
Retrieval System.
• All three systems are repositories of
information and their primary goal is to satisfy
user information needs.
Digital Libraries and Data Warehouses
• Information Retrieval Systems (IRS) are designed to store and retrieve
information based on user queries. They have evolved significantly from
traditional library card catalogs to digital systems capable of handling vast
amounts of electronic data. Their primary goal is to satisfy user information
needs efficiently.
• Digital Libraries are collections of digital materials accessible via the Internet.
They aim to provide electronic access to information previously stored in
physical formats like books, journals, and other media. The transition to digital
formats allows for easier searchability and access, although challenges such as
indexing standards and the preservation of digital information over time
remain.
• Data Warehouses are primarily used in the commercial sector to manage and
analyze large volumes of structured data. They serve as central repositories
that integrate data from different sources within an organization. Data
warehouses facilitate decision-making by providing tools for data
manipulation, analysis, and reporting. Data mining, a process within data
warehouses, involves discovering patterns and relationships in data that were
not initially apparent.
Digital Libraries and Data Warehouses
• While there is overlap among Information Retrieval Systems, Digital Libraries,
and Data Warehouses in terms of their goal to store and retrieve information,
they differ in scope and focus:
– Information Retrieval Systems focus on textual data and user queries.
• Information Retrieval Systems continue to evolve with advancements in search
algorithms and handling of diverse data types
– Digital Libraries manage digital collections of various media types and
emphasize electronic access and preservation.
• Indexing is one of the critical disciplines in library science and significant
effort has gone into the establishment of indexing and cataloging
standards. Migration of many of the library products to a digital format
introduces both opportunities and challenges:
– Converting existing hardcopy materials into digital formats while
preserving their integrity and ensuring accessibility.
– Addressing issues related to copyright and intellectual property
rights, especially in a digital context where laws and regulations vary
globally.
– Data Warehouses primarily handle structured data for decision support and
often include data mining capabilities to uncover hidden patterns.
Information Retrieval System
Capabilities
Search Capabilties
Browse Capabilities
Miscellaneous Capabilities
Search Capabilities
• The objective of the search capability is to
allow for a connecting between a user’s
specified need and the items in the
information database that will answer that
need.
– This statement can be composed of natural
language text and/or query terms with Boolean
logic indicators.
• Some systems allow users to indicate the
importance of search terms using a numeric
value between 0.0 and 1.0.
Search Capabilities
• This weighting helps prioritize certain terms over others when
retrieving relevant items.
• For example, in the query
– "Find articles that discuss automobile emissions(.9) or sulfur
dioxide(.3) on the farming industry,"
• The system would understand that automobile emissions are
more critical than discussions on sulfur dioxide for ranking
purposes.
• The search statement may apply to the complete item or contain
additional parameters.
– Parameters can restrict the search to specific parts or zones
within an item, which helps improve relevance by avoiding
retrieval of non-relevant portions. This can be particularly
useful in larger documents where searching within specific
sections (passage searching) enhances precision.
Search Capabilities
• Based upon the algorithms used in a system
many different functions are associated with the
system’s understanding the search statement.
• The functions define the relationships between
the terms in the search statement (e.g., Boolean,
Natural Language, Proximity, Contiguous Word Phrases, and
Fuzzy Searches) the interpretation of a particular
word (e.g., Term Masking, Numeric and Date Range, Contiguous
Word Phrases, and Concept/Thesaurus expansion).
• The terms "processing token," "word," and "term" are
used interchangeably or contextually to refer to units
that are searchable within documents.
Boolean Logic
• It allows a user to logically relate multiple concepts
together to define what information is needed.
• Boolean functions apply to processing tokens
identified anywhere within an item
• Typical Boolean operators are AND, OR and NOT,
and are implemented using intersection, set union
and set difference procedures.
• Placing portions of the search statement in
parentheses are used to overtly specify the order
of Boolean operations (i.e., nesting function).
Boolean Logic
• If parentheses are not used, the system
follows a default precedence ordering of
operations (e.g., typically NOT then AND then
OR).
• Queries are processed Left to Right unless
parentheses are included.
• A special type of Boolean search is called “M
of N” logic.
– The user lists a set of possible search terms and
identifies, as acceptable, any item that contains a
subset of the terms.
Boolean Logic
• Ex: For example, “Find any item containing
any two of the following terms: “AA,” “BB,”
“CC.” This can be expanded into a Boolean
search that performs an AND between all
combinations of two terms and “OR”s the
results together ((AA AND BB) or (AA AND CC)
or (BB AND CC)).

• Most information retrieval systems allow


Boolean operations and natural language
interfaces
Boolean Logic
• Some search examples and their meanings are as
follows:
Proximity
• Proximity is used to restrict the distance allowed
within an item between two search terms.
• The semantic concept is that the closer two terms
are found in a text, the more likely they are related
in the description of a particular concept.
• Proximity is used to increase the precision of a
search.
– If the terms COMPUTER and DESIGN are found
within a few words of each other then the item
is more likely to be discussing the design of
computers than if the words are paragraphs
apart.
Proximity
• The typical format for proximity is:
– TERM1 within “m” “units” of TERM2
• The distance operator “m” is an integer number and
units are in Characters, Words, Sentences, or
Paragraphs.
• For items containing imbedded images (e.g., digital
photographs), text between the images could help in
precision when the objective is in locating a certain
image.
• The proximity relationship contains a direction
operator indicating the direction (before or after)
that the second term must be found within the
number of units specified. Default is either direction.
Proximity
• A special case of the Proximity operator is the
Adjacent (ADJ) operator that normally has a
distance operator of one and a forward only
direction.
• Another special case is where the distance is
set to zero meaning within the same semantic
unit.
Proximity
• Some proximity search statement examples and their
meanings are given in following figure:
Contiguous Word Phrases
• A Contiguous Word Phrase (CWP) can be used as
a unique search operator in addition to a query
word.
• A Contiguous Word Phrase is two or more words
that are treated as a single semantic unit.
• An example of a CWP is “United States of
America.”
• It is four words that specify a search term
representing a single specific semantic concept (a
country).
Contiguous Word Phrases
• A contiguous word phrase also acts like a special search
operator that is similar to the proximity (Adjacency)
operator but allows for additional specificity.
• If two terms are specified, CWP and Proximity operator
are identical.
• For contiguous word phrases of more than two terms
the only way of creating an equivalent search statement
using proximity and Boolean operators is via nested
Adjacencies which are not found in most commercial
systems.
– This is because Proximity and Boolean operators are binary
operators but contiguous word phrases are an “N”ary
operator where “N” is the number of words in the CWP.
Contiguous Word Phrases
• Contiguous Word Phrases are called Literal
Strings in WAIS (Wide Area Information
Servers) and Exact Phrases in Retrieval Ware.
• In WAIS multiple Adjacency (ADJ) operators
are used to define a Literal String (e.g.,
“United” ADJ “States” ADJ “of” ADJ
“America”).
Fuzzy Searches
• Fuzzy searching is a technique used to find
words that are similar to the search term,
even if they are not an exact match.
• It helps in locating terms with minor spelling
errors or variations.
• Fuzzy searches include variations of a search
term based on spelling similarity. For example,
searching for “computer” might also return
results like “compiter,” “conputer,” and
“computter.”
Fuzzy Searches
• This approach increases recall (finding more
relevant results) but decreases precision (risk of
retrieving irrelevant results). This is because it
might include terms that are close but not
exactly what you were searching for.
• In systems that rank search results, words that
resemble the search term more closely might be
given higher ranks. For instance, “computer” will
be ranked higher than “commuter,” especially if
the latter is an entirely different word with a
different meaning.
Fuzzy Searches
• Fuzzy searches can expand a query by
including other terms with similar spellings.
The expansion might be limited by specifying
the maximum number of similar terms.
• The concept of “closest” (Heuristic Function) is
used to determine which terms to include.
This function can vary depending on the
system.
• Fuzzy searching has its maximum utilization in
systems that accept items that have been
Optical Character Read (OCR)
Fuzzy Searches
• In the OCR process a hardcopy item is scanned into a binary
image.
• The OCR process is a pattern recognition process that segments
the scanned in image into meaningful subregions.
• The OCR process will then determine the character and translate
it to an internal computer encoding (ex: ASCII or other)
• OCR can introduce errors due to imperfections in the scanned
image or limitations in character recognition accuracy.
• Typically, OCR systems achieve high accuracy (90-99%), but errors
are still common, especially if the quality of the original document
is poor.
• Role of Fuzzy Searching in OCR: Fuzzy searching helps in retrieving
information from OCR-processed text by compensating for
character recognition errors. It allows users to find relevant
documents despite these errors.
Term Masking
• Term masking is a technique used in search
systems to handle variations in search terms by
allowing some flexibility in matching.
• Instead of requiring an exact match for a search
query, term masking lets you specify a pattern
that can match multiple possible terms.
• This is particularly useful when a search system
doesn't use advanced stemming algorithms or
when it uses only simple stemming.
Term Masking
• Types of Term Masking:
– Fixed Length Masking
– Variable Length Masking
• Fixed Length Masking: This involves masking a specific position
in a word, allowing any character in that position or the
absence of a character there.
• For instance, if you're searching for a term where the third
character is masked, it would match words regardless of what
character is in that third position or even if there is no character
in that position.
– Example: Suppose you want to find words where the third
character is unspecified. The query "HE$LO" could match
"HELLO" or "HEALO", “HELO” depending on how the system
is set up. This type of masking is less common and often not
critical for most systems.
Term Masking

• Variable Length Masking: Variable length


“don’t cares” allows masking of any number
of characters within a processing token.
• The masking may be in the front, at the end,
at both front and end, or imbedded.
Term Masking
Numeric and Date Ranges
• Term masking is useful when applied to
words, but does not work for finding ranges of
numbers or numeric dates.
– Using a term like "125*" will only find numbers
that start with "125" (e.g., "125," "1254,"
"12556"). It will not find numbers like "130" or
"120" because they do not start with "125.“
– Similarly, term masking won’t help in finding dates
in a range, such as from "4/2/93" to "5/2/95."
Term masking can’t interpret or process dates as
ranges.
Numeric and Date Ranges
• To handle numeric and date ranges effectively,
systems use normalization processes to
categorize and interpret data correctly
– During normalization, data is processed and
converted into a format that makes it easier to
perform operations like sorting or range queries. For
example, words might be identified and classified as
numbers or dates.
• Once data is normalized, the system can apply
specialized processing for numbers and dates.
– This makes it possible to run more complex searches
than term masking alone.
Numeric and Date Ranges
• With numeric or date normalization, users can input
queries that specify ranges or conditions, such as:
– Numeric Ranges:
• "125-425": Finds numbers between 125 and 425, inclusive.
• ">125": Finds numbers greater than 125.
• "<233": Finds numbers less than 233.
– Date Ranges:
• "4/2/93-5/2/95": Finds dates from April 2, 1993, to May 2, 1995.
• ">4/2/93": Finds dates after April 2, 1993.
• "<5/2/95": Finds dates before May 2, 1995.
• These types of queries are processed differently than
simple term-based searches, allowing for more precise
and useful results when dealing with numbers and
dates.
Concept/Thesaurus Expansion
• A Thesaurus helps in expanding search terms
by finding synonyms or related words.
• A thesaurus is typically a one-level or two-
level expansion of a term to other terms that
are similar in meaning
• A Concept Class is a tree structure that
expands each meaning of a word into
potential concepts that are related to the
initial term
Concept/Thesaurus Expansion
• An example of Thesaurus and Concept Class
structures
Concept/Thesaurus Expansion
• Thesauri is of two types:
– Semantic Thesaurus: Contains words and their semantically
similar counterparts (words with similar meanings). For
instance, if you search for "happy," a semantic thesaurus might
suggest "joyful" or "content.“
– Statistical Thesaurus: Uses statistical methods to identify words
that frequently occur together in a given dataset. This type is
generated based on data patterns rather than predefined
semantic relationships. For example, if "medical" and "doctor"
frequently appear together in documents, the system identifies
a strong statistical relationship between these terms.
• Thesauri help in broadening the search by including related
terms, which can improve the recall of search results. However,
this can sometimes reduce precision if unrelated terms are
included.
Concept/Thesaurus Expansion
• In some systems, users can manually select and add
terms from thesauri or concept trees to refine their
searches. This makes searches more specific to the
demands of the user.
• Both thesauri and concept class databases are
powerful tools for enhancing search capabilities.
• Thesauri help in expanding terms based on semantic
similarities, while concept classes offer a structured
approach to exploring and relating concepts. Both
approaches have their strengths and are often used
to balance recall and precision in search queries.
Natural Language Queries
• Instead of using specific search terms and Boolean operators
(like AND, OR, NOT) to search for information, you can type a
full sentence or a prose statement (the ordinary language people use in
speaking or writing) describing what you’re looking for.
– The longer the prose, the more accurate the results returned.
• One major challenge is handling negation (like "Do not
include..."). The system needs to correctly understand and
exclude information that doesn't meet the criteria.
– Find for me all the items that discuss oil reserves and current attempts
to find new oil reserves. Include any items that discuss the international
financial aspects of the oil production process. Do not include items
about the oil industry in the United States.
• Systems are designed to find items similar to the ones
described in the query, but excluding certain items that match
part of the query is more complex.
Natural Language Queries
• The system translates this natural language
input into a format it can process.
• The difficulty lies in correctly interpreting and
applying the negation.
• When this capability has been made available,
users have a tendency to enter sentence
fragments that reflect their search need
rather than complete sentences
• This is predictable because the users want to
minimize use of the human resource (their
time).
Natural Language Queries
• The likely input for the above example is:
– oil reserves and attempts to find new oil reserves,
international financial aspects of oil production,
not United States oil industry
• Using the same search statement, a Boolean
query attempting to find the same
information might appear:
– (“locate” AND “new” and “oil reserves”) OR
(“international” AND “financ*” AND “oil
production”) NOT (“oil industry” AND “United
States”)
Natural Language Queries
• Associated with natural language queries is a function
called relevance feedback. The natural language does
not have to be input by the user but just identified by
the user
– Relevance feedback feature allows users to refine searches
based on items they find relevant, even if they don’t input
full sentences. Users can select relevant items or text
segments, and the system adjusts its search accordingly.
• To accommodate the negation function and provide
users with a transition to the natural language
systems, most commercial systems have a user
interface that provides both a natural language and
Boolean logic capability.
Natural Language Queries

• Negation is handled by the Boolean portion of


a search.
• „ Natural language interfaces improve the
recall of systems with a decrease in precision
when negation is required.
Multimedia Queries
• The user interface becomes far more complex with
the introduction of the availability of multimedia
items.
• As multimedia content (e.g., video, audio, images)
becomes more common, searching becomes more
complex compared to searching just text.
• Users need to specify search criteria not only for
text but also for other types of content.
• One current focus is on how still images can be used
in searches. For instance, users can search for
images within a multimedia item or use a still image
to locate specific scenes in a video.
Multimedia Queries
• Videos are analyzed by breaking them into
scenes, with each scene represented by a
series of images. This helps in indexing and
searching for specific scenes or moments in
the video.
• Static text within videos can be extracted
using Optical Character Recognition (OCR)
technology.
– This allows text within the video to be searchable.
Multimedia Queries
• Searching for specific audio segments directly is
challenging because it would require simulating the
audio to find a match.
– Instead, audio is transcribed into text, which can then be
searched.
• However, transcription accuracy varies: it’s more
accurate for controlled content (e.g., news broadcasts)
and less accurate for conversational speech.
• OCR errors will usually create a text string that is not a
valid word..
• In ASR (Automatic Speech Recognition), all errors are
other valid words since ASR selects entries ONLY from
dictionary of words
Multimedia Queries
• Audio analysis can also identify specific
speakers, which adds another layer to
searching. This is relatively accurate and can
be useful in multimedia searches.
• To find relevant content, different types of
information (text, images, audio) are
correlated based on time or location. For
example, in a video news program, all related
information (scene changes, transcribed
audio, closed captioning, and user-assigned
index terms) is synchronized by time.
Multimedia Queries
• An example query might be: “Find where Bill
Clinton is discussing Cuban refugees and there
is a picture of a boat.” The system would
locate segments where:
– Bill Clinton is speaking (using audio track and
speaker identification),
– The text streams (e.g., OCRed text, transcribed
audio, closed captioning) mention refugees and
Cuba,
– A scene change includes a boat.
Browse Capabilities
Browse Capabilities
• Browse capabilities provide the user to review the results to
determine which items are of interest and select those to be
displayed.
• There are two ways of displaying a summary of the items that are
associated with a query:
– Line item status:
• This displays search results in a simple list format, where
each item is shown with key information. This approach is
useful for quickly scanning through results to find relevant
items.
– Data visualization:
• This technique makes use of visual aids like graphs and
charts to display search results. It assists users in identifying
patterns and trends that may not be immediately seen from
a list.
Browse Capabilities
• Users can click on items or visual elements to get
more detailed information. They can move between
summary displays (list or visual) and detailed views
easily.
• If a search is very precise (i.e., it returns mostly
relevant results (searches resulted in high precision)), the need
for advanced browsing features is reduced because
the results are already well-targeted.
• If a search yields many results that aren't relevant,
browse capabilities become more important. They
help users filter through the results to find the ones
that best match their needs.
Ranking
• In traditional Boolean systems, the status
display is a count of the number of items
found by the query.
• All these items satisfy every condition
specified in the query.
• The reasons why an item was selected can
easily be traced to and displayed (e.g., via
highlighting) in the retrieved items.
Ranking
• Modern systems use relevance scoring to rank
search results based on how well each item matches
the search query.
– Relevance score: It is an estimate of the search system on
how closely the item satisfies the search statement.
• Relevance scores are typically normalized between
0.0 (not relevant) and 1.0 (highly relevant). A score
of 1.0 means the system is confident that the item is
very relevant to the search.
• Users can use these scores to decide when to stop
reviewing results, as lower scores suggest less
relevance.
Ranking
• Not all items will have a high relevance score. Systems
usually set a minimum threshold for relevance; items
with scores below this threshold are not displayed by
default, though users can adjust this threshold if
needed.
• In many circumstances Collaborative Filtering enhances
search results by incorporating feedback from users.
– For example, if many users rate certain items highly, the
system will prioritize these items in similar future queries.
• This method is used by sites like Amazon and
MovieFinder to recommend products based on user
behavior.
Ranking
• In summary displays, relevance might be shown as a single
line of information, often with a truncated title.
• Systems may use colors or categories (like High, Medium,
Low) to indicate relevance instead of showing exact numbers,
though color coding might not be accessible for colorblind
users.
• Some systems use graphical representations like two or
three-dimensional graphs to visualize the relationships
between items and their relevance. This helps users see how
items cluster by topic and navigate through results more
intuitively.
• Advanced systems use graphical displays to show how terms
contributed to an item's relevance. This helps users
understand and refine their search queries for better results.
Ranking
• More sophisticated systems illustrate the
terms that added to the relevancy of an item
with graphical displays.
• This helps users understand and refine their
search queries for better results.
Zoning
• When users are presented with search results, they want
to see just enough information to determine whether an
item is relevant to their needs.
– This helps in quickly filtering through results without being
overwhelmed by too much data.
• Given that screens have limited space, the system must
show just a few key details of each item.
• For instance, showing only the Title and Abstract of each
item might be sufficient for users to judge its relevance.
• This approach allows more items to be displayed
simultaneously, making efficient use of the user's
cognitive abilities to scan and assess multiple items
quickly.
Zoning
• Instead of retrieving and displaying entire
items, systems can focus on smaller sections
or passages within those items. This is
particularly useful in large documents or
datasets.
• Locality and passage-based search and
retrieval are concepts connected to zoning for
use in reducing the amount of information an
end user must review from a hit item.
Zoning
• Passage Retrieval: The item is divided into uniform-
sized passages (chunks of text) that are indexed.
When a user searches, the system retrieves and
displays these passages rather than the whole item.
This is beneficial when users are only interested in
specific parts of the content.
• Locality-Based Retrieval: This method allows for
dynamic boundaries around relevant sections.
Instead of fixed-size passages, the system identifies
and retrieves relevant local areas of the item based
on the search query.
Zoning

• Users are typically given the option to expand


or view the full item if they find the displayed
passage or section relevant. This ensures that
while the initial display is concise, users can
still access the complete information if
needed.
Highlighting
• Highlighting is used to mark parts of a text
that are relevant to a user's search query.
• This helps users quickly locate important
sections within a document.
• Many systems start displaying the document
from the first highlighted section and allow
users to jump to subsequent highlighted
sections. This makes it easier to scan through
the relevant parts of the document.
Highlighting
• Some advanced systems, like the DCARS
system for RetrievalWare, can automatically
position the view at the most relevant passage
based on the query. This helps users start
their review at the most relevant part of the
document.
• In traditional Boolean search systems,
highlighting was quite effective because there
was a direct correlation between search terms
and highlighted terms in the documents.
Highlighting
• Modern systems often use Natural Language
Processing (NLP), automatic term expansion (e.g.,
through thesauri), and similarity ranking algorithms.
These systems might highlight terms that don’t
directly match the search query, making it less clear
why certain items were retrieved.
• When terms in search results don’t directly match
the query terms, it can be frustrating for users who
may not understand why a document was retrieved.
This makes it difficult to refine their search
effectively.
Highlighting
• In ranking systems, different terms contribute
to the ranking of a document to varying
extents. Therefore, highlighting might use
different colors or intensities to reflect the
importance of each term, but this can still be
confusing.
• Because of the limitations of highlighting in
complex search systems, information
visualization techniques are being explored as
a better way to help users understand search
results and formulate more accurate queries.
Miscellaneous Capabilities
Miscellaneous Capabilities
• These functions are designed to make it easier and faster for users to input
their search queries.
• They also help minimize the chances of users entering ineffective or poor
queries.
– Vocabulary Browse : This function helps users understand the different
tokens (words or terms) that the searchable database uses. It provides
insights into which terms are available and how often they appear in the
database. This can guide users in forming better queries by knowing what
terms are relevant and common.
– Iterative searching and search history log: This involves refining search
queries based on previous searches. For example, if an initial search
doesn’t yield the desired results, users can modify and refine their queries
iteratively to improve outcomes.
• Search History Logs: These logs keep track of previous searches made
by the user. They allow users to revisit past queries and their results,
which can be helpful for tracking progress or repeating useful searches
– Canned queries: These are pre-defined queries that users have created and
saved from previous sessions. They can be quickly accessed and reused,
saving time and efforts for commonly performed searches
Vocabulary Browse
• Vocabulary Browse provides knowledge on the
processing tokens available in the searchable
database.
• It provides the capability to display alphabetical
sorted order words from the document database
• Logically, all unique words (processing tokens) in
the database are kept in sorted order along with
the count of the number of unique items in
which the word is found.
– This helps users understand the frequency and
distribution of words in the database.
Vocabulary Browse
• The user can enter a word or word fragment and the system
will begin to display the dictionary around the entered text.
• Below table shows what is seen in vocabulary browse if the
user enters “comput”

• Vocabulary browse provides information on the exact words in the database .


• Impact of vocabulary browse on Search Results: By examining how often a term
appears in the database, users can make better decisions about which search
terms to use. For example, if “computer” appears too frequently, it might
generate too many results when used as an “OR” term (which broadens the
search). To get more focused results, users might need to combine “computer”
with additional terms using “AND” to narrow down the search.
Iterative Search and Search History Log
• Frequently a search returns a Hit file (results) containing
many more items than the user wants to review.
• Instead of starting a completely new search, you can use the
results from your previous search to narrow down your
current search.
– This is done by creating a new query that is applied only to
the results of the previous search.
• This technique effectively adds additional constraints to the
original search, functioning similarly to combining the original
query with a new search criterion using an AND condition.
– For instance, if your initial search was for "cats," you might
refine it by adding "black" to search specifically for "black
cats.“
• This process of refining the results of a previous search to
focus on the relevant items is called iterative search.
Iterative Search and Search History Log
• During a login session, a user could execute
many queries to locate the needed
information.
• To facilitate locating previous searches as
starting points for new searches, search
history logs are available.
• The search history log is the capability to
display all the previous searches that were
executed during the current session.
Canned Query
• The capability to name a query and store it to be retrieved
and executed during a later user sessions is called canned or
stored queries.
• Instead of creating a new search from scratch each time,
users can save a query and retrieve it for future use.
• This feature is useful for users who frequently search within
specific areas of interest or need to conduct similar searches
repeatedly.
• A canned allows a user to create and refine a search that
focuses on the user’s general area of interest one time and
then retrieve it to add additional search criteria to retrieve
data that is currently needed.
• It helps save time and ensures consistency in search results.
Canned Query
• Example: Suppose a user is responsible for monitoring
European investments. Rather than creating a new
search query each time that includes geographic terms
for Europe and other specific criteria, the user can save a
canned query with the geographic terms already
included.
– This saved query can then be easily retrieved and
modified with additional criteria relevant to the
current information need, such as a particular
investment type or time period.
• Since the query is created once and reused many
times, it significantly reduces the effort required to
perform repeated searches.
Multimedia
• In traditional text-based searches, hits (results)
are typically listed one per line, ranked by
relevance. This allows users to scan through a
list efficiently.
• When dealing with multimedia items (e.g., a
combination of text, images, and audio),
displaying the results using a similar text-based
approach becomes problematic.
– The inclusion of additional content like images and
audio clips complicates how hits are presented.
Multimedia
• A common solution is to show a "thumbnail" image alongside
each hit.
• However, this often requires more screen space, which reduces
the number of hits visible at once.
– Consequently, users may see fewer items at a time, which can hinder their
ability to quickly select relevant results.
• Audio presents unique challenges compared to text. Users
process audio information at a slower rate than text.
• To mitigate this, transcribed text of the audio is often provided
alongside the audio files.
– This allows users to scan text quickly, providing context and aiding in
navigation through the audio content.
• Users also benefit from being able to annotate (comment) the
transcriptions. This allows them to add notes or mark
important sections, which enhances their ability to work with
multimedia data.

You might also like