IRS Unit-2
IRS Unit-2
Indexing
Indexing is the transformation from the received item to the searchable data structure.
This process can be manual or automatic, creating a direct search of items in the
Document Database or an indirect search via Index Files.
Search
Search is a technique that correlate the user-entered query statement to the set of items
in the database, once the searchable data structure has been created.
2.1 History
Cataloging
Indexing is originally called as Cataloging.
Cataloging is the oldest technique for identifying the contents of items to assist in their
retrieval.
Objective of cataloging:
To give access points to a collection of information that are expected by and most useful
to the users.
History
Up to the 19th Century there was little advancement in cataloging, only changes in the
methods used to represent the basic information.
In the late 1800s subject indexing became hierarchical (e.g., Dewey Decimal System).
In 1963 the Library of Congress initiated a study on the computerization of
bibliographic replacements.
From 1966 - 1968 the Library of Congress ran its MARC I pilot project. MARC
(MAchine Readable Cataloging) standardizes the structure, contents and coding of
bibliographic records.
1965 DIALOG The earliest commercial cataloging system was developed by Lockheed
Corporation for NASA.
1978 DIALOG became commercial.
1988, DIALOG sold to Knight-Ridder, which contained over 320 index databases.
The initial introduction of computers to assist the cataloging function did not change its
basic operation of a human indexer determining those terms to assign to a particular
item.
The user, instead of searching through physical cards in a card catalog, now performed
a search on a computer and electronically displayed the card equivalents.
In the 1990s, the significant reduction in the cost of processing power and memory in
modern computers, along with access to the full text of an item from the publishing
stages in electronic form, allowed the use of the full text of an item as an alternative to
the indexer-generated subject index.
The searchable availability of the text of items has changed the role of indexers.
The indexer is no longer required to enter index terms that are redundant with words in
the text of an item.
2.2 Objective of Indexing
The objectives of indexing have changed with the evolution of Information Retrieval
Systems.
Total document indexing
Controlled vocabulary
Uncontrolled vocabularies
automatic text analysis algorithms
Public File indexer
Private Index files
Selective indexing
full document indexing
Controlled vocabulary
a finite set of index terms from which all index terms must be selected.
In a manual indexing environment, controlled vocabulary makes the indexing
process slower, but simplifies the search process.
Controlled vocabularies aid the user in knowing the domain of terms.
Uncontrolled vocabularies
make indexing faster but the search process much more difficult.
The source information (frequently called citation data) can automatically be
extracted.
Modern systems, with the automatic use of thesauri and other reference
databases, reduce the need for controlled vocabularies.
Automatic text analysis algorithms
cannot consistently perform abstraction on all concepts that are in an item.
They can not correlate the facts in an item to determine additional related
concepts to be indexed.
The words used in an item do not always reflect the value of the concepts being
presented.
It is the combination of the words and their semantic implications that contain
the value of the concepts being discussed.
Public File indexer
Public File indexer considers the information needs of all users of the library
system.
Individual users of the system have their own domains of interest that bound the
concepts in which they are interested.
It takes a human being to evaluate the quality of the concepts being discussed
in an item to determine if that concept should be indexed.
Private Index files
allows the user to logically subset the total document file into folders of interest
that, in the user’s judgment, have future value.
allows the user to judge the utility of the concepts based upon his need versus
the system’s need and perform concept abstraction.
Selective indexing
is based upon the value of concepts to increase the precision of searches.
full document indexing
Availability of full document indexing saves the indexer from entering index
terms that are identical to words in the document.
2.3 Indexing Process
When an organization with multiple indexers decides to create a public or private index,
some procedural decisions assist the indexers and end users in how to create the index
terms.
scope of the indexing
linking index terms
Scope of Indexing
defines the level of detail that the subject index will contain.
This is based on the usage scenarios of the end users.
When performed manually, the process of determining the bibliographic terms that
represent the concepts in an item is difficult.
Problems arise from the interaction of two sources: the author and the indexer.
The vocabulary domain of the author may be different than that of the indexer, causing
the indexer to misinterpret the importance.
The indexer is not an expert in all areas and has different levels of knowledge in the
different areas being presented in the item.
This results in different quality levels of indexing.
The indexer must determine when to stop the indexing process.
There are two factors involved in deciding on what level to index the concepts in an
item:
Exhaustivity
Specificity.
Exhaustivity of indexing
is the extent to which the different concepts in the item are indexed.
For example, if two sentences of a 10-page item on microprocessors discuss on-
board caches, should this concept be indexed?
Specificity
relates to the preciseness of the index terms used in indexing.
For example, whether the term “processor” or “microcomputer” or “Pentium”
should be used in the index of an item.
using general index terms yields low exhaustivity and specificity.
This approach requires a minimal number of index terms per item.
reduces the cost of generating the index.
Weighting
Weighting of index terms is not common in manual indexing systems.
Weighting is the process of assigning importance to an index term’s use in an
item.
The weight should represent the degree to which the concept is associated with
the index term.
The process of assigning weights adds additional overhead on the indexer and
requires a more complex data structure to store the weights.
Linkages
Another decision on the indexing process is whether linkages are available between
index terms for an item.
Linkages are used to correlate related attributes associated with concepts discussed in
an item.
Precoordination
process of creating term linkages at index creation time.
When index terms are not coordinated at index time, the coordination occurs at
search time.
Post coordination
coordinating terms after (post) the indexing process.
implemented by “AND” ing index terms together, which only finds indexes
that have all of the search terms.
Figure 3.2 shows the different types of linkages.
It assumes that an item discusses the drilling of oil wells in Mexico by CITGO and the
introduction of oil refineries in Peru by the U.S.
When the linked capability is added, the system does not erroneously relate Peru and
Mexico since they are not in the same set of linked items.
It still does not have the ability to discriminate between which country is introducing
oil refineries into the other country.
Introducing roles in the last two examples of Figure 3.2 removes this ambiguity when
modifiers are used.
2.4 Automatic Indexing
Human Indexing
Advantages of human indexing
ability to determine concept abstraction
ability to judge the value of a concept.
Disadvantages of human indexing
cost , processing time, consistency
Automatic indexing
Capability for the system to automatically determine the index terms to be
assigned to an item.
More complex processing is required.
No additional indexing costs versus the salaries and benefits regularly paid to
human indexers.
Requires only a few seconds or less of computer time based on the processor's
size and the algorithms' complexity to generate the index.
Consistency in the index term selection process as indexing is performed
automatically by an algorithm.
Indexes from automated indexing fall into two classes:
Unweighted, weighted
Unweighted indexing system
index terms in a document and their word location(s) are kept in the searchable
data structure.
Queries are based upon Boolean logic and the items in the resultant Hit file are
considered.
The last item in the file is the first item to be relevant to the user’s need.
Weighted indexing system
The weight of the index term is based upon a function associated with the
frequency of occurrence of the term in the item.
values for the index terms are normalized between zero and one.
The higher the weight, the more the term represents a concept discussed in the
item.
The query process uses the weights along with any weights assigned to terms in
the query to determine a rank value.
2.4.1 Indexing by Term
2.4.2 Indexing by Concept
2.4.3 Multimedia Indexing
Indexing by Term
There are two major techniques for the creation of the index terms:
statistical
natural language
Statistical techniques based upon
vector models
probabilistic models
calculation of weights in those models using statistical information such as
frequency of occurrence of words
their distributions in the searchable database.
Vector model (Example. SMART system)
The system uses weights for information detection and stores these weights in
a vector form.
Each vector represents a document.
each position in a vector represents a different unique word (processing token)
in the database.
The value assigned to each position is the weight of that term in the document.
A value of zero indicates that the word was not in the document.
Queries can be translated into the vector form.
Search is accomplished by calculating the distance between the query vector
and the document vectors.
Probabilistic model.
Bayesian approach is the most successful model in this area.
It is based upon the theories of evidential reasoning (drawing conclusions from
evidence).
Bayesian approach could be applied as part of index term weighting by
calculating the relationship between an item and a specific query.
A Bayesian network is a directed acyclic graph
Each node represents a random variable.
Arcs between the nodes represent a probabilistic dependence between
the node and its parents.
Figure shows the basic weighting approach for index terms or associations
between query terms and index terms.
MatchPlus system.
Example for concept indexing.
neural networks facilitates machine learning of concept/word relationships.
goal is to determine word relationships (e.g., synonyms) and the strength of these
relationships and use that information in generating context vectors, from the corpus
of items.
Two neural networks are used.
One neural network learning algorithm generates stem context vectors that are
sensitive to similarity of use.
another one performs query modification based upon user feedback.
context vectors
Word stems, items and queries that are represented by high dimensional (atleast
300 dimensions) vectors.
Each dimension in a vector could be viewed as an abstract concept class.
Multimedia Indexing
The automated indexing takes place in multiple passes of the information.
The first pass is a conversion from the analog input mode into a digital structure.
Then algorithms are applied to the digital structure to extract the unit of processing of
the different modalities that will be used to represent the item.
In an abstract sense this could be considered the location of a processing token in the
modality.
This unit will then undergo the final processing that will extract the searchable features
that represent the unit.
system will convert the audio to digital format and determine the phonemes
associated with the utterances.
The phonemes will be used as input to a Hidden Markov Search model that will
determine with a confidence level the words that were spoken.
A single phoneme can be divided into four states for the Markov model.
It is the textual words associated with the audio that becomes the searchable
structure.
In addition to storing the extracted index searchable data, a multimedia item also needs
to store some mechanism to correlate the different modalities during search.
There are two main mechanisms that are used
Positional
Temporal
Positional
is used when the modalities are scattered in a linear sequential composition.
For example a document that has images or audio inserted, can be considered a
linear structure and the only relationship between the modalities will be the just
a position of each modality.
Temporal
based upon time because the modalities are executing concurrently.
The typical video source of television is inherently a multimedia source.
It contains video, audio, and potentially closed captioning.
Extraction
The process of extracting facts to go into indexes is called Automatic File Build.
Its goal is to process incoming items and extract index terms that will go into a
structured database.
Extraction system analyzes only the portions of a document that contain information
relevant to the extraction criteria.
objective is to update a structured database with additional facts.
The updates may be from a controlled vocabulary or substrings from the item as defined
by the extraction rules.
Document summarization
goal is to extract a summary of an item maintaining the most important ideas while
significantly reducing the size.
Examples of summaries are part of any item such as titles, table of contents, and
abstracts.
The abstract can be used to represent the item for search purposes or to determine the
item without having to read the complete item.
It is not feasible to automatically generate a coherent narrative summary of an item with
proper discourse, abstraction and language usage.
Restricting the domain of the item significantly improves the quality of the output.
Different algorithms produce different summaries.
Most automated algorithms summarize by calculating a score for each sentence and
then extracting the sentences with the highest scores.
2.6 Introduction to Data Structures
From an Information Retrieval System perspective, the two aspects of a data structure
are
its ability to represent concepts and their relationships
how well it supports location of those concepts
Goal of stemming
o improve performance
o reduce system resources by reducing the number of unique words that a system
has to contain.
The stemming process creates one large index for the stem.
Stem carries the meaning of the concept associated with the word and the affixes.
Stemming algorithms
improve the efficiency of the information system
improve recall.
Conflation
refers to mapping multiple morphological variants to a single representation
(stem).
Languages have precise grammars that define their usage, but also evolve based upon
human usage.
Thus exceptions and non-consistent variants are always present in languages that
typically require ‘exception look-up tables’ in addition to the normal reduction rules.
Stemming can cause problems for Natural Language Processing (NLP) systems by loss
of information needed for aggregate levels of natural language processing (discourse
analysis).
The tenses of verbs may be lost in creating a stem, but they are needed to determine if
particular concept being indexed occurred in the past or will be occurring in the future.
Time is one example of the type of relationships that are defined in Natural Language
Processing systems.
Stemming algorithm removes suffixes and prefixes, sometimes recursively, to derive
the final stem.
Other techniques such as table lookup and successor stemming provide alternatives that
require additional overheads.
Successor stemmers determine prefix overlap as the length of a stem is increased.
Table lookup requires a large data structure.
2.7b Porter Stemming Algorithm
The Porter Algorithm is based upon a set of conditions of the stem, suffix and prefix
and associated actions given the condition.
Some examples of stem conditions are:
1. The measure, m, of a stem is a function of sequences of vowels (a, e, i, o, u,) followed
by a consonant.
If V is a sequence of vowels and C is a sequence of consonants, then m is:
C(VC)mV
where the initial C and final V are optional and m is the number VC repeats.
Given the word “duplicatable,” the following are the steps in the stemming process:
The application of another rule in step 4, removing “ic,” cannot be applied since only
one rule from each step is allowed be applied.
2.7c Dictionary Look-Up Stemmers
In this approach, simple stemming rules may be applied.
The rules are taken from those that have the fewest exceptions (e.g., removing
pluralization from nouns).
The original term or stemmed version of the term is looked up in a dictionary and
replaced by the stem that best represents it.
INQUERY system :
uses a stemming technique called Kstem.
Kstem is a morphological analyzer that that reduces morphological variants to a root
form.
For example, `elephants'->`elephant', `amplification'->`amplify', and `european'-
>`europe'.
It tries to avoid collapsing words with different meanings into the same root.
For example, “memorial” and “memorize” reduce to “memory”.
“memorial” and “memorize” are not synonyms and have very different meanings.
The Kstem system uses the following six major data files to control and limit the
stemming process:
Dictionary of words (lexicon)
Supplemental list of words for the dictionary
Exceptions list for those words that should retain an “e” at the end (e.g., “suites”
to “suite” but “suited” to “suit”)
Direct Conflation - allows definition of direct conflation via word pairs that
override the stemming algorithm
Country_Nationality - conflations between nationalities and countries
(“British” maps to “Britain”)
Proper Nouns - a list of proper nouns that should not be stemmed.
Like other stemmers associated with Natural Language Processors and dictionaries,
Kstem, returns words instead of truncated word forms.
Generally, Kstem requires a word to be in the dictionary before it reduces one word
form to another.
Some endings are always removed, even if the root form is not found in the dictionary
(e.g.,‘ness’, ‘ly’).
If the word being processed is in the dictionary, it is assumed to be unrelated to the root
after stemming and conflation is not performed (e.g.,‘factorial’ needs to be in the
dictionary or it is stemmed to ‘factory’).
For irregular morphologies, it is necessary to explicitly map the word variant to the root
desired (for example, “matrices” to “matrix”).
RetrievalWare System:
The strength of the RetrievalWare System lies in its Thesaurus/Semantic Network
support data structure that contains over 400,000 words.
The dictionaries contain the morphological variants of words.
New words that are not special forms (e.g., dates, phone numbers) are located in the
dictionary to determine simpler forms by stripping off suffixes and respelling plurals as
defined in the dictionary.
2.7d Successor Stemmers
The process determines the successor varieties for a word and uses this information to
divide a word into segments and selects one of the segments as the stem.
The successor variety of a segment of a word in a set of words is the number of distinct
letters that occupy the segment length plus one character.
Using this formula a set of entropy measures can be calculated for a word and
its predecessors.
Selecting the first or second segment in general determines the appropriate stem.
2.8 Inverted File Structure
Inverted file structure used in both database management and Information Retrieval
Systems.
Inverted file structures are composed of three basic files:
document file
inversion lists (posting files)
dictionary
Each document in the system is given a unique numerical identifier that is stored in the
inversion list.
Dictionary is used to locate the inversion list for a particular word.
For each word, a list of documents in which the word is found in is stored
(the inversion list for that word).
It’s a sorted list of all unique words (processing tokens) in the system and a pointer to
the location of its inversion list.
Additional information may be used from the item to increase precision and provide a
more optimum inversion list file structure.
For example, if zoning issued, the dictionary may be partitioned by zone.
There could be a dictionary and set of inversion lists for the “Abstract” zone in an item
and another dictionary and set of inversion lists for the “Main Body” zone.
Inversion list contains the document identifier for each document in which the word is
found.
2.9 N-Gram Data Structures
N-Grams
a special technique for conflation (mapping).
a unique data structure in information systems that ignores words and treats the
input as continuous data (optionally limiting its processing by interword
symbols).
a fixed-length consecutive series of “n” characters or fixed-length overlapping
symbol segments that define the searchable processing tokens.
These tokens have logical linkages to all the items in which the tokens are found.
To store the linkage data structure, inversion lists, document vectors and other
data structures are used in the search process.
n-grams do not care about semantics, unlike stemming (which determines the
stem of a word that represents the semantic meaning of the word).
Examples of bigrams, trigrams, and pentagrams are given in Figure 4.7 for the word
phrase “sea colony”.
n-grams with n greater than two allow interword symbols to be part of the n-gram set.
The symbol # is used to represent the interword symbol (e.g., blank, period, semicolon,
colon, etc.).
Each of the n-grams created becomes a separate processing token and is searchable.
It is possible that the same n-gram can be created multiple times from a single word.
A Continuous text input data structure is indexed in contiguous “n” character tokens
using n-grams with interword symbols between processing tokens.
A continuous text input data structure is addressed differently using PAT trees and PAT
arrays.
The name PAT is short for PAtriciaTrees (PATRICIA stands for Practical Algorithm
To Retrieve Information Coded In Alphanumerics.)
Figure 4.10 gives an example of the sistrings used in generating a PAT Tree.
If the binary representations of “h” is (100), “o” is (110), “m” is (001) and“e”
is (101) then the word “home” produces the input 100110001101.
Using the sistrings, the full PAT binary tree is shown in Figure 4.11.
Advantages and Disadvantages
PAT trees are ideal for prefix searches.
Suffix, imbedded string, and fixed length masked searches are easy if the total
input stream is used in defining the PAT tree.
Fuzzy searches are very difficult because large number of possible sub-trees
could match the search term.
PAT arrays have more accuracy than Signature files
ability to string searches that are inefficient in inverted files (e.g., suffix
searches, approximate string searches, longest repetition).
It is not used in any major commercial products.
2.11 Signature File Structure
The items that satisfy the test can either be evaluated by another search algorithm to
eliminate additional false hits or delivered to the user for review.
The text of the items is represented in a highly compressed form that facilitates the fast
test.
Signature file
can be stored as a signature with each row representing signature block.
associated with each row is a pointer to the original text block.
Application / Advantages
Signature files provide a solution for storing and locating information in a
number of different situations.
Signature files are applied as medium size databases, WORM devices, parallel
processing machines, and distributed environments.
2.12 Hypertext and XML Data Structures
Hypertext:
A mechanism for representing information structure.
It differs from traditional information storage data structures in format and use.
Hypertext is stored in Hypertext Markup Language (HTML) and eXtensible
Markup Language (XML).
HTML and XML provide detailed descriptions for subsets of text similar to the
zoning that increase search accuracy and improve display of hit results.
Hypertext Structure
used in the Internet environment
requires electronic media storage for the item.
Hypertext allows one item to reference another item via an imbedded pointer.
Each separate item is called a node.
Reference pointer is called a link.
Each node is displayed by a viewer that is defined for the file type associated with
the node.
Hypertext Markup Language (HTML)
defines the internal structure for information exchange across the World Wide Web
on the Internet.
A document is composed of the text of the item along with HTML tags that describe
how to display the document.
Tags are formatting or structural keywords contained between less-than, greater
than symbols (e.g., <title>, <strong>).
The HTML tag associated with hypertext linkages is <a href= …#NAME /a>
where “a” and “/a” are an anchor start tag and anchor end tag denoting the text that
the user can activate.
“href” is the hypertext reference containing either a file name if the referenced item
is on this node or an address(URL) and a file name if it is on another node.
“#NAME” defines a destination point other than the top of the item to go to.
The URL has three components:
access method the client used to retrieve the item
Internet address of the server where the item is stored
address of the item at the server
Figure 4.14 shows an example of a segment of a HTML document.
Hypertext is a non-sequential directed graph structure, where each node contains its
own information.
A node may have several outgoing links, each of which is then associated with some
smaller part of the node called an anchor.
When an anchor is activated, the associated link is followed to the destination node,
thus navigating the hypertext network.
Dynamic HTML
Combination of the latest HTML tags, style sheets, and programming that
help to create WEB pages that are more animated and responsive to user
interaction.
Supports features such as object-oriented view of a WEB page and its
elements, cascading style sheets, programming that can address most page
elements add dynamic fonts.
XML:
The Extensible Markup Language (XML) is a standard data structure on the WEB.
The objective is to extend HTML with semantic information.
The logical data structure within XML is defined by a Data Type Description (DTD)
The user can create any tags needed to describe and manipulate their structure.
The following is a simple example of XML tagging:
The states will be one of the above that is observed at the closing of the market.
Given that the market fell on one day (State 1), the matrix suggests that the
probability of the market not changing the next day is .1.
This then allows questions such as the probability that the market will increase
for the next 4 days then fall.
This would be equivalent to the sequence of SEQ = {S3, S3, S3, S3, S1}.
Let’s assume current state is only dependent upon the last state.
This would then be calculated by the formula:
The movement between states can be defined by a state transition matrix with
state transitions.