IRS Unit-1
IRS Unit-1
Information can be composed of text (including numeric and date data), images,
audio, video, and other multi-media objects.
Item:
the smallest complete unit that is processed and manipulated by the system is
called an item.
Example:
A book
Newspaper
magazine
video news program
The introduction and exponential growth of the Internet along with its initial
WAIS (Wide Area Information Servers) capability and more recent advanced
search servers (e.g., INFOSEEK, EXCITE) has provided a new avenue for
access to terabytes of information.
The algorithms and techniques to optimize the processing and access of large
1
quantities of textual data were once the sole domain of segments of the
Government, a few industries, and academics.
Images across the Internet are searchable from many websites such as
WEBSEEK, DITTO.COM, and ALTAVISTA/IMAGES.
News organizations such as the BBC are processing the audio news they have
produced and are making historical audio news searchable via the audio-
transcribed versions of the news.
Major video organizations such as Disney are using video indexing to assist in
finding specific images in their previously produced videos to use in future
videos or incorporate in advertising.
It is easy to confuse the software that optimizes functional support of each type
of system with actual information or structured data that is being stored and
manipulated.
2
1.2 Objectives of Information Retrieval Systems
Search composition, search execution, and reading non-relevant items are all
aspects of information retrieval overhead.
Overhead can be expressed as the time a user spends in all of the steps leading to
reading an item containing the needed information.
Example:
query generation,
query execution,
scanning results of query to select items to read,
reading non-relevant items.
When a user decides to issue a search looking for information on a topic, the
total database is logically divided into four segments.
Relevant items
Non-relevant items
3
Two possibilities with respect to each item:
it can be retrieved
The two major measures commonly associated with information systems are
Precision
Recall
Precision = Number_Retrieved_Relevant
----------------------------------
Number_Total_Retrieved
Recall= Number_Retrieved_Relevant
----------------------------------
Number_Possible_Relevant
4
1.3 Functional Overview
5
1) Item Normalization:
The first step in any integrated system is to normalize the incoming items to
a standard format.
provides logical restructuring of the item.
Additional operations during item normalization that are needed to create a
searchable data structure are:
identification of processing tokens (e.g., words)
characterization of the tokens
stemming (e.g., removing word endings) of the tokens
The processing tokens and their characterization are used to define the
searchable text from the total received text.
Figure 1.5 shows the normalization process.
6
Standardizing
A system may have a single format for all items or allow multiple
formats.
Takes the different external formats of input data and performs the
translation to the formats acceptable to the system.
Example of standardization
Translation of foreign languages into Unicode.
Every language has a different internal binary encoding for the
characters in the language.
One standard encoding that covers English, French, Spanish, etc. is
ISO-Latin.
7
Systems determine words by dividing input symbols into 3 classes:
Valid word symbols
Inter-word symbols
Special processing symbols.
word
is defined as a contiguous set of word symbols bounded by
inter-word symbols.
Examples of word symbols are alphabetic characters and
numbers.
inter-word symbols
are non-searchable and should be carefully selected.
Examples of possible inter-word symbols are blanks, periods
and semicolons.
The exact definition of an inter-word symbol is dependent
upon the aspects of the language domain of the items to be
processed by the system.
For example, an apostrophe may be of little importance if only
used for the possessive case in English,
Stop List/Algorithm
applied to the list of potential processing tokens.
The objective of the Stop function is to save system resources
by eliminating from the set of searchable processing tokens
those that have little value to the system.
Given the significant increase in available cheap memory,
storage and processing power, the need to apply the Stop
function to processing tokens is decreasing.
Examples of Stop algorithms are:
Stop all numbers greater than “999999”,
Stop any processing token that has numbers and
characters intermixed.
The algorithms are typically source specific, usually eliminating
unique item numbers that are frequently found in systems and have
no search value.
Characterize tokens
The next step in finalizing on processing tokens is the identification of any
specific word characteristics.
The characteristic is used in systems to assist in disambiguation of a
particular word.
Thus, for a word such as “plane,” the system understands that it could mean
8
“level or flat” as an adjective, “aircraft or facet” as a noun, or “the act of
smoothing or evening” as a verb.
Another example of characterization is if upper case should be preserved.
In most systems upper/lower case is not preserved to avoid the system
having to expand a term to cover the case where it is the first word in a
sentence.
But, for proper names, acronyms and organizations, the upper case
represents a completely different use of the processing token versus it being
found in the text.
“Pleasant Grant” should be recognized as a person’s name versus a
“pleasant grant” that provides funding.
Other characterizations that are typically treated separately from text are
numbers and dates.
Applying Stemming
Once the potential processing token has been identified and characterized,
most systems apply stemming algorithms to normalize the token to a
standard semantic representation.
The decision to perform stemming is a tradeoff between precision of a
search (i.e., finding exactly what the query specifies) versus standardization
to reduce system overhead in expanding a search term to similar token
representations with a potential increase in recall.
For example, the system must keep singular, plural, past tense, possessive,
etc. as separate searchable tokens and potentially expand a term at search
time to all its possible representations, or just keep the stem of the word,
eliminating endings.
The amount of stemming that is applied can lead to retrieval of many non-
relevant items.
Some systems such as RetrievalWare, that use a large dictionary/thesaurus,
looks up words in the existing dictionary to determine the stemmed version
in lieu of applying a sophisticated algorithm.
9
2) Selective Dissemination of Information
Typically items in the Document Database do not change (i.e., are not
edited) once received.
10
Public Index files
maintained by professional library services personnel
typically index every item in the Document Database.
There is a small number of Public Index files.
These files have access lists (i.e., lists of users and their privileges) that
allow anyone to search or retrieve data.
To assist the users in generating indexes, especially the professional
indexers, the system provides a process called Automatic File Build shown
(also called Information Extraction).
11
1.4 Relationship to Database Management Systems, Digital Libraries, Data
Warehouse
DBMS
Structured data is well structured facts represented by tables.
User will get desired information for the specific request.
IRS
High probability for not finding all relevant items.
different vocabulary discusses one or many topics.
Hierarchical search, Ranking is used
Data warehouses are similar to information storage and retrieval systems in that
they both have a need for search and retrieval of information.
12
But a data warehouse is more focused on structured data and decision support
technologies.
In addition to the normal search process, a complete system provides a flexible
set of analytical tools to “mine” the data.
Data mining (originally called Knowledge Discovery in Databases - KDD) is a
search process that automatically analyzes data and extract relationships and
dependencies that were not part of the database design.
13
Information Retrieval System Capabilities
Search Capabilities
Objective, Weighting, Functions, Relationships, Interpretations
Browse Capabilities
Miscellaneous Capabilities
search statement may apply to the complete item or contain additional parameters
limiting it to a logical division of the item (i.e., to a zone).
Based upon the algorithms used in a system, many different functions are
associated with the system’s understanding of the search statement.
Functions
o define the relationships between the terms in the search statement and the
interpretation of a particular word.
14
Example of the relationships are
o Boolean
o Natural Language
o Proximity
o Contiguous Word Phrases
o Fuzzy Searches
15
Proximity
Proximity is used to restrict the distance allowed within an item between two
search terms.
The semantic concept is that the closer two terms are found in a text the more
likely they are related in the description of a particular concept.
Proximity is used to increase the precision of a search.
The typical format for proximity is:
TERM1 within “m” “units” of TERM2
The distance operator “m” is an integer number and units are in
Characters, Words, Sentences, or Paragraphs.
16
A special case of the Proximity operator is the Adjacent (ADJ) operator that
normally has a distance operator of one and a forward only direction.
Another special case is where the distance is set to zero meaning within the
same semantic unit.
CWP acts like a special search operator that is similar to the proximity (Adjacency)
operator but allows for additional specificity.
If two terms are specified, the contiguous word phrase and the proximity operator
using directional one word parameters or the Adjacent operator are identical.
Nested Adjacencies
For contiguous word phrases of more than two terms, the only way of
creating an equivalent search statement using proximity and Boolean
operators is via nested Adjacencies..
Proximity and Boolean operators are binary operators
but contiguous word phrases are an “N”ary operator where “N” is the number of words
in the CWP.
17
Fuzzy Searches
Fuzzy Searches provide the capability to locate spellings of words that are similar
to the entered search term.
used to compensate for errors in spelling of words.
increases recall at the expense of decreasing precision (i.e., it can erroneously
identify terms as the search term).
In the process of expanding a query term, fuzzy searching includes other terms
that have similar spellings, giving more weight (in systems that rank output) to
words in the database that have similar word lengths and position of the characters
as the entered term.
A Fuzzy Search on the term “computer” would automatically include the
following words from the information database: “computer,” “compiter,”
“conputer,” “computter,” “compute.”
18
The masking may be
o in the front
o at the end
o at both front and end
o imbedded.
The first three of these cases are called suffix search, prefix search and imbedded
character string search, respectively.
The use of an imbedded variable length don’t care is seldom used.
If “*” represents a variable length don’t care then the following are examples of its
use:
“*COMPUTER” Suffix Search
“COMPUTER*” Prefix Search
“*COMPUTER*” Imbedded String Search
19
Concept/Thesaurus Expansion
Associated with both Boolean and Natural Language Queries
ability to expand the search terms via Thesaurus or Concept Class database
reference tool.
thesaurus is typically a one-level or two-level expansion of a term to other terms
that are similar in meaning.
A Concept Class is a tree structure that expands each meaning of a word into
potential concepts that are related to the initial term (e.g., in the TOPIC system).
Concept classes are sometimes implemented as a network structure that links word
stems (e.g., in the RetrievalWare system).
An example of Thesaurus and Concept Class structures are shown in Figure 2.4
and Figure 2.5.
20
Thesauri are either semantic or based upon statistics.
Semantic thesaurus
is alisting of words and then other words that are semantically similar.
The problem with thesauri is that they are generic to a language.
Can introduce many search terms that are not found in the document
database.
21
1.6 Browse Capabilities
Once the search is complete, Browse capabilities provide the user with the
capability to determine which items are of interest and select those to be
displayed.
There are two ways of displaying a summary of the items that are associated with
a query: line item status and data visualization.
From these summary displays, the user can select the specific items and zones
within the items for display.
Ranking
Zoning
Highlighting
Ranking
Hits are retrieved either in sorted order or time order from the newest to oldest.
Based on the relevance score, hit results are ranked.
Relevance score is the estimate of system search.
Display items with relevance score and description of item.
Typically relevance scores are normalized to a value between 0.0 and 1.0.
The highest value of 1.0 is interpreted that the system is sure that the item is relevant
to the search statement.
In addition to ranking, based upon the characteristics of the item and the database,
collaborative filtering (matching people with similar interests and making recommendations)
provides an option for selecting and ordering output.
Collaborative filtering has been very successful in sites such as AMAZON.COM
MovieFinder.com, and CDNow.com in deciding what products to display to users
based upon their queries.
Rather than limiting the number of items that can be assessed by the number of lines
on a screen, other graphical visualization techniques showing the relevance
relationships of the hit items can be used.
For example, a two or three dimensional graph can be displayed where points on the
22
graph represent items and the location of the points represent their relative
relationship between each other and the user’s query.
In some cases color is also used in this representation.
It allows a user to see the clustering of items by topics and browse through a cluster
or move to another topical cluster.
Zoning
Users want to see minimum information needed to determine if the item is
relevant or not.
Once determination is made, user wants to display complete item for detailed
view.
Limited display screen sizes require selectability of what portions of item needed
to make relevancy.
Basic search item is not complete item, but algorithmic defined sub division of
item.
Related to zoning, for use in minimizing what an end user needs to review from a
hit item is the idea of locality and passage based search and retrieval.
Passage retrieval
Item is divided into uniform sized passages that are indexed.
Locality based retrieval
Passage boundaries can be dynamic.
Highlighting
An another display aid.
Most systems allow the display of an item to begin with the first highlight within
the item and allow subsequent jumping to the next highlight.
Provides the capability to determine passage in the document most relevant to the
query and position the browse to start at that passage.
The DCARS system that acts as a user front-end to the Retrieval Ware search
system allows the user to browse an item in the order of the paragraphs or
individual words that contributed most to the rank value associated with the item.
The highlighting may vary by introducing colors and intensities to indicate the
relative importance of a particular word in the item in the decision to retrieve the
item.
23
1.7 Miscellaneous Capabilities
Vocabulary Browse
Iterative Search and Search History Log
Canned Query
Vocabulary Browse
Vocabulary Browse provides the capability to display in alphabetical sorted order
words from the document database.
Logically, all unique words (processing tokens) in the database are kept in sorted
order along with a count of the number of unique items in which the word is
found.
The user can enter a word or word fragment and the system will begin to display
the dictionary around the entered text.
It helps the user determine the impact of using a fixed or variable length mask on a
search term and potential misspellings.
The user can determine that entering the search term “compul*” in effect is
searching for “compulsion” or “compulsive” or “compulsory.”
It also shows that someone probably entered the word “computen” when they
really meant “computer.”
24
Iterative Search and Search History Log
Frequently a search returns a Hit file containing many more items than the user
wants to review.
Rather than typing in a complete new query, the results of the previous search can
be used as a constraining list to create a new query that is applied against it.
This has the same effect as taking the original query and adding additional search
statement against it in an AND condition.
This process of refining the results of a previous search to focus on relevant items
is called iterative search.
This also applies when a user uses relevance feedback to enhance a previous
search.
The search history log is the capability to display all the previous searches that
were executed during the current session.
Canned Query
The capability to name a query and store it to be retrieved and executed during a
later user session is called canned or stored queries.
A canned query allows a user to create and refine a search that focuses on the
user’s general area of interest one time and then retrieve it to add additional
search criteria to retrieve data that is currently needed.
Canned query features also allow for variables to be inserted into the query and
bound to specific values at execution time.
25
1.8 Standards for IRS
Z39.50
The Z39.50 standard does not specify an implementation, but the capabilities
within an application (Application Service) and the protocol used to communicate
between applications (Information Retrieval Application Protocol).
It is a computer to computer communications standard for database searching and
record retrieval.
Its objective is to overcome different system incompatibilities associated with
multiple database searching.
The first version of Z39.50 was approved in 1992.
An international version of Z39.50, called the Search and Retrieve Standard (SR),
was approved by the International Organization for Standardization (ISO) in 1991.
Z39.50-1995, the latest version of Z39.50, replaces SR as the international
information retrieval standard.
The standard describes eight operation types: Init (initialization), Search, Present,
Delete, Scan, Sort, Resource-report, and Extended Services.
There are five types of queries (Types 0, 1, 2, 100, 101, and 102).
The client is identified as the “Origin” and performs the communications functions
relating to initiating a search, translation of the query into a standardized format,
sending a query, and requesting return records.
The server is identified as the “Target” and interfaces to the database at the remote
responding to requests from the Origin (e.g., pass query to database, return records in
a standardized format and status).
The end user does not have to be aware of the details of the standard since the Origin
function performs the mapping from the user’s query interface into Z39.50 format.
This makes the dissimilarities of different database systems transparent to the user and
facilitates issuing one query against multiple databases at different sites returning to the
user a single integrated Hit file.
26
WAIS (Wide Area Information Servers)
Wide Area Information Service (WAIS) is the standard for many search
environments on the INTERNET.
WAIS was developed by a project started in 1989 by three commercial companies
(Apple, Thinking Machines, and Dow Jones).
The original idea was to create a program that would act as a personal librarian.
A free version of WAIS is still available via the Clearinghouse for Networked
Information Discovery and Retrieval (CINDIR) called “FreeWAIS.”
The original development of WAIS started with the 1988, Z39.50 protocol as a base
following the client/server architecture concept.
The developers incorporated the information retrieval concepts that allow for
ranking, relevance feedback and natural language processing functions that apply to
full text searchable databases.
Center for National Research Initiatives (CNRI) that is working with the Department
of Defense and also the American Association of Publishers (AAP), focusing on an
Internet implementation that allows for control of electronic published and copyright
material.
In addition to the Handle Server architecture, CNRI is also advocating a
communications protocol to retrieve items from existing systems.
Repository Archive Protocol (RAP) defines the mechanisms for clients to use the
handles to retrieve items.
It also includes other administrative functions such as privilege validation.
The Handle system is designed to meet the Internet Engineering Task Force (IETF)
requirements for naming Internet objects via Uniform Resource Names to replace
URLs as defined inthe Internet’s RFC-1737 (IETF- 96).
WAIS (Wide Area Information Servers) is an Internet system in which specialized
subject databases are created at multiple server locations, kept track of by a
directory of servers at one location, and made accessible for searching by users with
WAIS client programs.
The user of WAIS is provided with or obtains a list of distributed databases.
The user enters a search argument for a selected database and the client then accesses
all the servers on which the database is distributed.
The results provide a description of each text that meets the search requirements.
27