The transformation from the received item to the searchable data structure is called
This process can be manual or automatic, creating the basis for direct search of items
in the Document Database or indirect search via Index Files.
Once the searchable data structure has been created, techniques must be defined that
correlate the user-entered query statement to the set of items in the database.
This process is called Search .
2.1 History
From 1966 - 1968 the Library of Congress ran its MARC I pilot project. MARC
(MAchine Readable Cataloging) standardizes the structure, contents and coding of
bibliographic records.
1965 DIALOG The earliest commercial cataloging system was developed by
Lockheed Corporation for NASA.
1978 DIALOG became commercial.
1988, DIALOG sold to Knight-Ridder, which contained over 320 index databases.
Indexing (cataloging), until recently, was accomplished by creating a bibliographic
citation in a structured file that references the original text.
The indexing process is typically performed by professional indexers associated with
library organizations.
Throughout the history of libraries, this has been the most important and most
difficult processing step.
Most items are retrieved based upon what the item is about.
The user’s ability to find items on a particular subject is limited by the indexer.
The initial introduction of computers to assist the cataloguing function did not change
its basic operation of a human indexer determining those terms to assign to a
particular item.
The user, instead of searching through physical cards in a card catalog, now
performed a search on a computer and electronically displayed the card equivalents.
In the 1990s, the significant reduction in cost of processing power and memory in
modern computers, along with access to the full text of an item from the publishing
stages in electronic form, allow use of the full text of an item as an alternative to the
indexer-generated subject index.
The searchable availability of the text of items has changed the role of indexers.
The indexer is no longer required to enter index terms that are redundant with words
in the text of an item.
2.2 Objective of Indexing
The objectives of indexing have changed with the evolution of Information Retrieval
total document indexing
The full text searchable data structure for items in the Document File provides
a new class of indexing called total document indexing.
In this environment, all of the words within the item are potential index
descriptors of the subject(s) of the item.
Current systems have the ability to automatically weight the processing tokens
based upon their potential importance in defining the concepts in the item.
Previously, indexing defined the source and major concepts of an item and
provided a mechanism for standardization of index terms (i.e., use of a
controlled vocabulary).
Controlled vocabulary
is a finite set of index terms from which all index terms must be selected (the
domain of the index).
In a manual indexing environment, the use of a controlled vocabulary makes
the indexing process slower, but potentially simplifies the search process.
The extra processing time comes from the indexer trying to determine the
appropriate index terms for concepts that are not specifically in the controlled
vocabulary set.
Controlled vocabularies aid the user in knowing the domain of terms that the
indexer had to select from.
Uncontrolled vocabularies
make indexing faster but the search process much more difficult.
The availability of items in electronic form changes the objectives of manual
The source information (frequently called citation data) can automatically be
Modern systems, with the automatic use of thesauri and other reference
databases, can account for diversity of language/vocabulary use and thus
reduce the need for controlled vocabularies.
Most of the concepts discussed in the document are locatable via search of the
total document index.
automatic text analysis algorithms
cannot consistently perform abstraction on all concepts that are in an item.
They can not correlate the facts in an item in a cause/effect relationship to
determine additional related concepts to be indexed.
The words used in an item do not always reflect the value of the concepts
being presented.
It is the combination of the words and their semantic implications that contain
the value of the concepts being discussed.
The utility of a concept is also determined by the user’s need.
Public File indexer
Public File indexer needs to consider the information needs of all users of the
library system.
Individual users of the system have their own domains of interest that bound
the concepts in which they are interested.
It takes a human being to evaluate the quality of the concepts being discussed
in an item to determine if that concept should be indexed.
Private Index files
allows the user to logically subset the total document file into folders of
interest including only those documents that, in the user’s judgment, have
future value.
allows the user to judge the utility of the concepts based upon his need versus
the system need and perform concept abstraction.
Selective indexing
is based upon the value of concepts increases the precision of searches.
full document indexing
Availability of full document indexing saves the indexer from entering index
terms that are identical to words in the document.
Users may use Public Index files as pan of their search criteria to increase the recall.
They may want to constrain the search by their Private Index file to increase the
precision of the search.
Figure 3.1 shows the potential relationship between use of the words in an item to
define the concepts.
Public Indexing of the concept adds additional index terms over the words in the item
to achieve abstraction.
The index file use fewer terms because it only indexes the important concepts.
Private Index files are more focused, limiting the number of items indexed to those
that have value to the user and within items only the concepts bounded by the specific
user’s interest domain.
There is overlap between the Private and Public Index files, but the Private Index file
is indexing fewer concepts in an item than the Public Index file and the file owner
uses his specific vocabulary of index terms.
In addition to the primary objective of representing the concepts within an item to
facilitate the user’s finding relevant information, electronic indexes to items provide a
basis for other applications to assist the user.
The format of the index supports the ranking of the output to present the items most
likely to be relevant to the user’s needs.
index can be used to cluster items by concept.
The clustering of items has the effect of making an electronic system similar to a
physical library.
The paradigm of going to the library and browsing the book shelves in a topical area
is the same as electronically browsing through items clustered by concepts.
2.3 Indexing Process
The indexer is not an expert on all areas and has different levels of knowledge in the
different areas being presented in the item.
This results in different quality levels of indexing.
The indexer must determine when to stop the indexing process.
There are two factors involved in deciding on what level to index the concepts in an
Exhaustivity of indexing
is the extent to which the different concepts in the item are indexed.
For example, if two sentences of a 10-page item on microprocessors discusses
on-board caches, should this concept be indexed?
relates to the preciseness of the index terms used in indexing.
For example, whether the term “processor” or “microcomputer” or “Pentium”
should be used in the index of an item.
using general index terms yields low exhaustivity and specificity.
This approach requires a minimal number of index terms per item.
reduces the cost of generating the index.
Weighting of index terms is not common in manual indexing systems.
Weighting is the process of assigning an importance to an index term’s use in
an item.
The weight should represent the degree to which the concept associated with
the index term.
The process of assigning weights adds additional overhead on the indexer and
requires a more complex data structure to store the weights.
Positional roles treat the data as a vector allowing only one value per position.
Thus if the example is expanded so that the U.S. was introducing oil refineries in
Peru, Bolivia and Argentina, then the positional role technique would require three
entries, where the only difference would be in the value in the “affected country”
When modifiers are used, only one entry would be required and all three countries
would be listed with three “MODIFIER”s.
2.4 Automatic Indexing
Automatic indexing
Capability for the system to automatically determine the index terms to be
assigned to an item.
More complex processing is required when emulate a human indexer and
determine a limited number of index terms.
No additional indexing costs versus the salaries and benefits regularly paid to
human indexers.
Requires only a few seconds or less of computer time based upon the size of
the processor and the complexity of the algorithms to generate the index.
Consistency in the index term selection process as indexing is performed
automatically by an algorithm.
Indexes from automated indexing fall into two classes:
Unweighted indexing system
index term in a document and its word location(s) are kept in the searchable
data structure.
Queries against unweighted systems are based upon Boolean logic and the
items in the resultant Hit file are considered equal in value.
The last item presented in the file is the first item to be relevant to the user’s
information need.
Indexing by Term
There are two major techniques for creation of the index terms:
natural language
Statistical techniques
based upon vector models and probabilistic models with a special case being
Bayesian models.
calculation of weights in those models use statistical information such as the
frequency of occurrence of words and their distributions in the searchable
Figure shows the basic weighting approach for index terms or associations
between query terms and index terms.
If the goal is to provide ranking as the result of a search by the posteriors, the Bayes
rule can be simplified to a linear decision rule:
where I(Fik) is an indicator variable that equals 1 only if Fik is present in the
item(equals zero otherwise) and w is a coefficient corresponding to a specific
feature/concept pair.
function g is the sum of the weights of the features
w is weight corresponding to each feature (index term)
w produces a ranking in decreasing order that is equivalent to the order produced by
the posterior probabilities.
Then uses thesauri or other query expansion techniques to expand a query to find the
different ways the same thing has been represented.
MatchPlus system.
Example for concept indexing.
neural networks facilitates machine learning of concept/word relationships.
goal is to determine word relationships (e.g., synonyms) and the strength of these
relationships and use that information in generating context vectors, from the
corpus of items.
Multimedia Indexing
The automated indexing takes place in multiple passes of the information versus just a
direct conversion to the indexing structure.
The first pass in most cases is a conversion from the analog input mode into a digital
Then algorithms are applied to the digital structure to extract the unit of processing of
the different modalities that will be used to represent the item.
In an abstract sense this could be considered the location of a processing token in the
This unit will then undergo the final processing that will extract the searchable
features that represent the unit.
analog audio input
system will convert the audio to digital format and determine the phonemes
associated with the utterances.
The phonemes will be used as input to a Hidden Markov Search model that
will determine with a confidence level the words that were spoken.
A single phoneme can be divided into four states for the Markov model.
It is the textual words associated with the audio that becomes the searchable
In addition to storing the extracted index searchable data, a multimedia item also
needs to store some mechanism to correlate the different modalities during search.
There are two main mechanisms that are used
is used when the modalities are scattered in a linear sequential composition.
For example a document that has images or audio inserted, can be considered
a linear structure and the only relationship between the modalities will be the
just a position of each modality.
based upon time because the modalities are executing concurrently.
The typical video source off television is inherently a multimedia source.
It contains video, audio, and potentially closed captioning.
2.5 Information Extraction
The process of extracting facts to go into indexes is called Automatic File Build.
Its goal is to process incoming items and extract index terms that will go into a
structured database.
Extraction system analyzes only the portions of a document that contain information
relevant to the extraction criteria.
objective is to update a structured database with additional facts.
The updates may be from a controlled vocabulary or substrings from the item as
defined by the extraction rules.
Document summarization
goal is to extract a summary of an item maintaining the most important ideas while
significantly reducing the size.
Examples of summaries are part of any item such as titles, table of contents, and
The abstract can be used to represent the item for search purposes or to determine the
item without having to read the complete item.
It is not feasible to automatically generate a coherent narrative summary of an item
with proper discourse, abstraction and language usage.
Restricting the domain of the item significantly improves the quality of the output.
Different algorithms produce different summaries.
Most automated algorithms summarize by calculating a score for each sentence and
then extracting the sentences with the highest scores.
2.6 Introduction to Data Structures
From an Information Retrieval System perspective, the two aspects of a data structure
its ability to represent concepts and their relationships
how well it supports location of those concepts
2.7 Stemming Algorithms
Languages have precise grammars that define their usage, but also evolve based upon
human usage.
Thus exceptions and non-consistent variants are always present in languages that
typically require ‘exception look-up tables’ in addition to the normal reduction rules.
Stemming can cause problems for Natural Language Processing (NLP) systems by
loss of information needed for aggregate levels of natural language processing
(discourse analysis).
The tenses of verbs may be lost in creating a stem, but they are needed to determine if
particular concept being indexed occurred in the past or will be occurring in the
Time is one example of the type of relationships that are defined in Natural Language
Processing systems.
Stemming algorithm removes suffixes and prefixes, sometimes recursively, to derive
the final stem.
Other techniques such as table lookup and successor stemming provide alternatives
that require additional overheads.
Successor stemmers determine prefix overlap as the length of a stem is increased.
Table lookup requires a large data structure.
2.7b Porter Stemming Algorithm
The Porter Algorithm is based upon a set of conditions of the stem, suffix and prefix
and associated actions given the condition.
Some examples of stem conditions are:
1. The measure, m, of a stem is a function of sequences of vowels (a, e, i, o, u,)
followed by a consonant.
If V is a sequence of vowels and C is a sequence of consonants, then m is:
where the initial C and final V are optional and m is the number VC repeats.
Given the word “duplicatable,” the following are the steps in the stemming process:
The application of another rule in step 4, removing “ic,” cannot be applied since only
one rule from each step is allowed be applied.
2.7c Dictionary Look-Up Stemmers
In this approach, simple stemming rules may be applied.
The rules are taken from those that have the fewest exceptions (e.g., removing
pluralization from nouns).
Even the most consistent rules have exceptions.
The original term or stemmed version of the term is looked up in a dictionary and
replaced by the stem that best represents it.
This technique has been implemented in the INQUERY and RetrievalWare Systems.
The Kstem system uses the following six major data files to control and limit the
stemming process:
Dictionary of words (lexicon)
Supplemental list of words for the dictionary
Exceptions list for those words that should retain an “e” at the end (e.g.,
“suites” to “suite” but “suited” to “suit”)
Direct Conflation - allows definition of direct conflation via word pairs that
override the stemming algorithm
Country_Nationality - conflations between nationalities and countries
(“British” maps to “Britain”)
Proper Nouns - a list of proper nouns that should not be stemmed.
2.7d Successor Stemmers
The process determines the successor varieties for a word and uses this information to
divide a word into segments and selects one of the segments as the stem.
The successor variety of a segment of a word in a set of words is the number of
distinct letters that occupy the segment length plus one character.
The successor varieties of a word are used to segment a word by applying one of the
following four methods :
1. Cutoff method: a cutoff value is selected to define stem length. The value varies
for each possible set of words.
2. Peak and Plateau: a segment break is made after a character whose successor
variety exceeds that of the character immediately preceding it and the character
immediately following it.
3. Complete word method: break on boundaries of complete words
4. Entropy method: uses the distribution of successor variety letters.
Let |Dak| be the number of words beginning with the k length sequence of
letters a.
Let |Dakj| be the number of words in Dak with successor j.
The entropy (Average Information) of |Dak| is:
Using this formula a set of entropy measures can be calculated for a word and
its predecessors.
To illustrate the use of successor variety stemming, consider the example below
where the task is to determine the stem of the word READABLE.
Using the complete word segmentation method, the test word "READABLE" will be
segmented into "READ" and "ABLE,".
After a word has been segmented, the segment to be used as the stem must be
Hafer and Weiss used the following rule:
Selecting the first or second segment in general determines the appropriate stem.
2.8 Inverted File Structure
Inverted file structure used in both database management and Information Retrieval
Inverted file structures are composed of three basic files:
document file
inversion lists (posting files)
For each word, a list of documents in which the word is found in is stored (the
inversion list for that word).
Each document in the system is given a unique numerical identifier that is stored in
the inversion list.
Dictionary is used to locate the inversion list for a particular word.
Dictionary is a sorted list of all unique words (processing tokens) in the system and a
pointer to the location of its inversion list.
Dictionaries can also store other information used in query optimization such as the
length of inversion lists.
Additional information may be used from the item to increase precision and provide a
more optimum inversion list file structure.
For example, if zoning issued, the dictionary may be partitioned by zone.
There could be a dictionary and set of inversion lists for the “Abstract” zone in an
item and another dictionary and set of inversion lists for the “Main Body” zone.
The inversion list contains the document identifier for each document in which the
word is found.
To support proximity, contiguous word phrases, and term weighting algorithms, all
occurrences of a word are stored in the inversion list along with the word position.
Thus if the word “bit” was the tenth, twelfth and eighteenth word in document #1,
then the inversion list would appear: bit -1(10), 1(12), 1(18)
Weights can also be stored in inversion lists.
When a search is performed, the inversion lists for the terms in the query are located
appropriate logic is applied between inversion lists.
The result is a final list of items that satisfy the query.
For systems that support ranking, the list is reorganized into ranked order.
The document numbers are used to retrieve the documents from the Document File.
Instead of using a dictionary to point to the inversion list, B-trees can be used.
The inversion lists at the leaf level or referenced in higher level pointers.
Figure 4.6 shows how the words in Figure 4.5 would appear.
2.9 N-Gram Data Structures
a special technique for conflation (stemming).
an unique data structure in information systems that ignores words and treats
the input as a continuous data, optionally limiting its processing by interword
a fixed length consecutive series of “n” characters or fixed length overlapping
symbol segments that define the searchable processing tokens.
These tokens have logical linkages to all the items in which the tokens are
To store the linkage data structure, inversion lists, document vectors and other
data structures are used in the search process.
n-grams do not care about semantics, unlike stemming (that determine the
stem of a word that represents the semantic meaning of the word).
Examples of bigrams, trigrams and pentagrams are given in Figure 4.7 for the word
phrase “sea colony”.
n-grams with n greater than two allow interword symbols to be part of the n-gram set.
The symbol # is used to represent the interword symbol (e.g., blank, period,
semicolon, colon, etc.).
Each of the n-grams created becomes a separate processing token and are searchable.
It is possible that the same n-gram can be created multiple times from a single word.
2.10 PAT Data Structure
A Continuous text input data structure is indexed in contiguous “n” character tokens
using n-grams with interword symbols between processing tokens.
A continuous text input data structure is addressed differently using PAT trees and
PAT arrays.
The name PAT is short for PAtriciaTrees (PATRICIA stands for Practical Algorithm
To Retrieve Information Coded In Alphanumerics.)
PAT tree
is an unbalanced, binary digital tree defined by the sistrings.
The individual bits of the sistrings decide the branching patterns with zeros
branching left and ones branching right.
PAT trees also allow each node in the tree to specify which bit is used to
determine the branching via bit position or the number of bits to skip from the
parent node.
This is useful in skipping over levels that do not require branching.
The key values are stored at the leaf nodes (bottom nodes) in the PAT Tree.
For a text input of size “n” there are “n” leaf nodes and “n-1” at most higher
level nodes.
It is possible to place additional constraints on sistrings for the leaf nodes.
Figure 4.10 gives an example of the sistrings used in generating a PAT Tree.
If the binary representations of “h” is (100), “o” is (110), “m” is (001) and“e”
is (101) then the word “home” produces the input 100110001101.
Using the sistrings, the full PAT binary tree is shown in Figure 4.11.
A more compact tree where skip values are in the intermediate nodes is shown
in Figure 4.12.
Advantages and Disadvantages
PAT trees are ideal for prefix searches.
Suffix, imbedded string, and fixed length masked searches are easy if the total
input stream is used in defining the PAT tree.
Fuzzy searches are very difficult because large number of possible sub-trees
could match the search term.
PAT arrays have more accuracy than Signature files
ability to string searches that are inefficient in inverted files (e.g., suffix
searches, approximate string searches, longest repetition).
It is not used in any major commercial products.
2.11 Signature File Structure
In Figure 4.13 the block size is set at five words
code length is 16 bits
number of bits that are allowed to be “1” for each word is five.
The words in a query are mapped to their signature.
Signature file
can be stored as a signature with each row representing signature block.
associated with each row is a pointer to the original text block.
Design objective of a signature file system
trading off the size of the structure versus the density of the final created
Longer code lengths reduce the probability of collision in hashing the words
(i.e., two different words hashing to the same value).
Fewer bits per code reduce the effect of a code word pattern being in the final
block signature.
Application / Advantages
Signature files provide a solution for storing and locating information in a
number of different situations.
Signature files are applied as medium size databases, WORM devices, parallel
processing machines, and distributed environments.
2.12 Hypertext and XML Data Structures
A mechanism for representing information structure.
It differs from traditional information storage data structures in format and
Hypertext is stored in Hypertext Markup Language (HTML) and eXtensible
Markup Language (XML).
HTML and XML provide detailed descriptions for subsets of text similar to
the zoning that increase search accuracy and improve display of hit results.
Hypertext Structure
used in the Internet environment
requires electronic media storage for the item.
Hypertext allows one item to reference another item via an imbedded pointer.
Each separate item is called a node.
Reference pointer is called a link.
Each node is displayed by a viewer that is defined for the file type associated with
the node.
Hypertext Markup Language (HTML)
defines the internal structure for information exchange across the World Wide
Web on the Internet.
A document is composed of the text of the item along with HTML tags that
describe how to display the document.
Tags are formatting or structural keywords contained between less-than, greater
than symbols (e.g., <title>, <strong>).
The HTML tag associated with hypertext linkages is <a href= …#NAME /a>
where “a” and “/a” are an anchor start tag and anchor end tag denoting the text
that the user can activate.
“href” is the hypertext reference containing either a file name if the referenced
item is on this node or an address(URL) and a file name if it is on another node.
“#NAME” defines a destination point other than the top of the item to go to.
The URL has three components:
access method the client used to retrieve the item
Internet address of the server where the item is stored
address of the item at the server
Figure 4.14 shows an example of a segment of a HTML document.
Dynamic HTML
Combination of the latest HTML tags, style sheets and programming that
help to create WEB pages that are more animated and responsive to user
Supports features such as object-oriented view of a WEB page and its
elements, cascading style sheets, programming that can address most page
elements add dynamic fonts.
Allows the specification of style sheets in a cascading fashion.
The extensible Markup Language (XML) is a standard data structure on the WEB.
Objective is to extend HTML with semantic information.
The logical data structure within XML is defined by a Data Type Description (DTD)
The user can create any tags needed to describe and manipulate their structure.
The following is a simple example of XML tagging:
The states will be one of the above that is observed at the closing of the
The movement between states can be defined by a state transition matrix with
state transitions.
Given that the market fell on one day (State 1), the matrix suggests that the
probability of the market not changing the next day is 1.
This then allows questions such as the probability that the market will increase
for the next 4 days then fall.
This would be equivalent to the sequence of SEQ = {S3, S3, S3, S3, S1}.
let’s assume that instead of the current state being dependent upon all the
previous states, lets assume it is only dependent upon the last state.
This would then be calculated by the formula:
To add more flexibility, probability function was allowed to be associated with
the state. The result is called the Hidden Markov Model.
The complete specification of a HMM requires
specification of the states
the output symbols
three probability measures for the state transitions
output probability functions and the initial states
The distributions are frequently called A, B, and ø and the following notation is used
to define the model: