0% found this document useful (0 votes)
24 views37 pages

IRS Unit-2

The document discusses the history and objectives of cataloging and indexing in library systems. It covers topics like controlled vs uncontrolled vocabularies, selective vs full indexing, and the manual vs automatic indexing process.

Uploaded by

Venkatesh J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views37 pages

IRS Unit-2

The document discusses the history and objectives of cataloging and indexing in library systems. It covers topics like controlled vs uncontrolled vocabularies, selective vs full indexing, and the manual vs automatic indexing process.

Uploaded by

Venkatesh J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

2.

CATALOGING AND INDEXING

Indexing
 Indexing is the transformation from the received item to the searchable data structure.
 This process can be manual or automatic, creating a direct search of items in the
Document Database or an indirect search via Index Files.

Concept based representation


 Instead of trying to create a searchable data structure, some systems transform the item
into different representation which is concept-based representation.
 use this as the searchable data structure.

Search
 Search is a technique that correlate the user-entered query statement to the set of items
in the database, once the searchable data structure has been created.

2.1 History

Cataloging
 Indexing is originally called as Cataloging.
 Cataloging is the oldest technique for identifying the contents of items to assist in their
retrieval.

Objective of cataloging:
 To give access points to a collection of information that are expected by and most useful
to the users.

History
 Up to the 19th Century there was little advancement in cataloging, only changes in the
methods used to represent the basic information.
 In the late 1800s subject indexing became hierarchical (e.g., Dewey Decimal System).
 In 1963 the Library of Congress initiated a study on the computerization of
bibliographic replacements.
 From 1966 - 1968 the Library of Congress ran its MARC I pilot project. MARC
(MAchine Readable Cataloging) standardizes the structure, contents and coding of
bibliographic records.
 1965 DIALOG The earliest commercial cataloging system was developed by Lockheed
Corporation for NASA.
 1978 DIALOG became commercial.
 1988, DIALOG sold to Knight-Ridder, which contained over 320 index databases.

 Indexing (cataloging), until recently, was accomplished by creating a bibliographic


citation in a structured file that references the original text.
 The indexing process is typically performed by professional indexers associated with
library organizations.
 Throughout the history of libraries, this has been the most important and most difficult
processing step.
 Most items are retrieved based upon what the item is about.
 The user’s ability to find items on a particular subject is limited by the indexer.

 The initial introduction of computers to assist the cataloging function did not change its
basic operation of a human indexer determining those terms to assign to a particular
item.
 The user, instead of searching through physical cards in a card catalog, now performed
a search on a computer and electronically displayed the card equivalents.
 In the 1990s, the significant reduction in the cost of processing power and memory in
modern computers, along with access to the full text of an item from the publishing
stages in electronic form, allowed the use of the full text of an item as an alternative to
the indexer-generated subject index.
 The searchable availability of the text of items has changed the role of indexers.
 The indexer is no longer required to enter index terms that are redundant with words in
the text of an item.
2.2 Objective of Indexing
 The objectives of indexing have changed with the evolution of Information Retrieval
Systems.
 Total document indexing
 Controlled vocabulary
 Uncontrolled vocabularies
 automatic text analysis algorithms
 Public File indexer
 Private Index files
 Selective indexing
 full document indexing

 Total document indexing


 The full-text searchable data structure for items in the Document File provides
a new class of indexing called total document indexing.
 In this environment, all of the words within the item are potential index
descriptors of the subject(s) of the item.
 Current systems automatically weigh the processing tokens based on their
potential importance in defining the concepts in the item.

 Controlled vocabulary
 a finite set of index terms from which all index terms must be selected.
 In a manual indexing environment, controlled vocabulary makes the indexing
process slower, but simplifies the search process.
 Controlled vocabularies aid the user in knowing the domain of terms.

Uncontrolled vocabularies
 make indexing faster but the search process much more difficult.
 The source information (frequently called citation data) can automatically be
extracted.
 Modern systems, with the automatic use of thesauri and other reference
databases, reduce the need for controlled vocabularies.
 Automatic text analysis algorithms
 cannot consistently perform abstraction on all concepts that are in an item.
 They can not correlate the facts in an item to determine additional related
concepts to be indexed.
 The words used in an item do not always reflect the value of the concepts being
presented.
 It is the combination of the words and their semantic implications that contain
the value of the concepts being discussed.
 Public File indexer
 Public File indexer considers the information needs of all users of the library
system.
 Individual users of the system have their own domains of interest that bound the
concepts in which they are interested.
 It takes a human being to evaluate the quality of the concepts being discussed
in an item to determine if that concept should be indexed.
 Private Index files
 allows the user to logically subset the total document file into folders of interest
that, in the user’s judgment, have future value.
 allows the user to judge the utility of the concepts based upon his need versus
the system’s need and perform concept abstraction.
 Selective indexing
 is based upon the value of concepts to increase the precision of searches.
 full document indexing
 Availability of full document indexing saves the indexer from entering index
terms that are identical to words in the document.
2.3 Indexing Process

 When an organization with multiple indexers decides to create a public or private index,
some procedural decisions assist the indexers and end users in how to create the index
terms.
 scope of the indexing
 linking index terms

Scope of Indexing
 defines the level of detail that the subject index will contain.
 This is based on the usage scenarios of the end users.

 When performed manually, the process of determining the bibliographic terms that
represent the concepts in an item is difficult.
 Problems arise from the interaction of two sources: the author and the indexer.
 The vocabulary domain of the author may be different than that of the indexer, causing
the indexer to misinterpret the importance.
 The indexer is not an expert in all areas and has different levels of knowledge in the
different areas being presented in the item.
 This results in different quality levels of indexing.
 The indexer must determine when to stop the indexing process.

 There are two factors involved in deciding on what level to index the concepts in an
item:
 Exhaustivity
 Specificity.
 Exhaustivity of indexing
 is the extent to which the different concepts in the item are indexed.
 For example, if two sentences of a 10-page item on microprocessors discuss on-
board caches, should this concept be indexed?
 Specificity
 relates to the preciseness of the index terms used in indexing.
 For example, whether the term “processor” or “microcomputer” or “Pentium”
should be used in the index of an item.
 using general index terms yields low exhaustivity and specificity.
 This approach requires a minimal number of index terms per item.
 reduces the cost of generating the index.

 Another decision on indexing is what portions of an item should be indexed.


 The simplest case is to limit the indexing to the Title or Title and Abstract zones.

 Weighting
 Weighting of index terms is not common in manual indexing systems.
 Weighting is the process of assigning importance to an index term’s use in an
item.
 The weight should represent the degree to which the concept is associated with
the index term.
 The process of assigning weights adds additional overhead on the indexer and
requires a more complex data structure to store the weights.

Linkages
 Another decision on the indexing process is whether linkages are available between
index terms for an item.
 Linkages are used to correlate related attributes associated with concepts discussed in
an item.
 Precoordination
 process of creating term linkages at index creation time.
 When index terms are not coordinated at index time, the coordination occurs at
search time.
 Post coordination
 coordinating terms after (post) the indexing process.
 implemented by “AND” ing index terms together, which only finds indexes
that have all of the search terms.
Figure 3.2 shows the different types of linkages.

 It assumes that an item discusses the drilling of oil wells in Mexico by CITGO and the
introduction of oil refineries in Peru by the U.S.
 When the linked capability is added, the system does not erroneously relate Peru and
Mexico since they are not in the same set of linked items.
 It still does not have the ability to discriminate between which country is introducing
oil refineries into the other country.
 Introducing roles in the last two examples of Figure 3.2 removes this ambiguity when
modifiers are used.
2.4 Automatic Indexing
 Human Indexing
 Advantages of human indexing
 ability to determine concept abstraction
 ability to judge the value of a concept.
 Disadvantages of human indexing
 cost , processing time, consistency
 Automatic indexing
 Capability for the system to automatically determine the index terms to be
assigned to an item.
 More complex processing is required.
 No additional indexing costs versus the salaries and benefits regularly paid to
human indexers.
 Requires only a few seconds or less of computer time based on the processor's
size and the algorithms' complexity to generate the index.
 Consistency in the index term selection process as indexing is performed
automatically by an algorithm.
 Indexes from automated indexing fall into two classes:
 Unweighted, weighted
 Unweighted indexing system
 index terms in a document and their word location(s) are kept in the searchable
data structure.
 Queries are based upon Boolean logic and the items in the resultant Hit file are
considered.
 The last item in the file is the first item to be relevant to the user’s need.
 Weighted indexing system
 The weight of the index term is based upon a function associated with the
frequency of occurrence of the term in the item.
 values for the index terms are normalized between zero and one.
 The higher the weight, the more the term represents a concept discussed in the
item.
 The query process uses the weights along with any weights assigned to terms in
the query to determine a rank value.
 2.4.1 Indexing by Term
 2.4.2 Indexing by Concept
 2.4.3 Multimedia Indexing

Indexing by Term
 There are two major techniques for the creation of the index terms:
 statistical
 natural language
 Statistical techniques based upon
 vector models
 probabilistic models
 calculation of weights in those models using statistical information such as
 frequency of occurrence of words
 their distributions in the searchable database.
 Vector model (Example. SMART system)
 The system uses weights for information detection and stores these weights in
a vector form.
 Each vector represents a document.
 each position in a vector represents a different unique word (processing token)
in the database.
 The value assigned to each position is the weight of that term in the document.
 A value of zero indicates that the word was not in the document.
 Queries can be translated into the vector form.
 Search is accomplished by calculating the distance between the query vector
and the document vectors.
 Probabilistic model.
 Bayesian approach is the most successful model in this area.
 It is based upon the theories of evidential reasoning (drawing conclusions from
evidence).
 Bayesian approach could be applied as part of index term weighting by
calculating the relationship between an item and a specific query.
 A Bayesian network is a directed acyclic graph
 Each node represents a random variable.
 Arcs between the nodes represent a probabilistic dependence between
the node and its parents.

 Figure shows the basic weighting approach for index terms or associations
between query terms and index terms.

 The nodes C1 and C2 represent “the item contains concept Ci”


 F nodes represent “the item has feature Fij (e.g., words)

 The network interpreted as C representing concepts in a query


 F representing concepts in an item.
 The goal is to calculate the probability of Ci given Fij.
 To perform that calculation two sets of probabilities are needed:
 The prior probability P(Ci) that an item is relevant to concept C.
 The conditional probability P(Fij/Ci) that the features Fij where j=1,m are
present in an item given that the item contains topic Ci.

 The automatic indexing task is to calculate the posterior probability the


probability that the item contains concept Ci given the presence of features Fij
 Natural language techniques
 perform more complex parsing to define the final set of index concepts.
 weighted systems are discussed as vectorized information systems.

 DR-LINK (Document Retrieval through Linguistic Knowledge) system


 define indexes to items via natural language processing.
 processes items at the morphological, lexical, semantic, syntactic, and discourse
levels.
 Each level uses information from the previous level to perform its additional
analysis.
 The discourse level is abstracting information beyond the sentence level and can
determine abstract concepts.
 This allows the indexing to include specific term as well as abstract concepts
such as time.
Indexing by Concept
 Indexing by term treats each occurrence as a different index.
 Then use thesauri or other query expansion techniques to expand a query to find the
different ways the same thing has been represented.

 basis for concept indexing


 There are many ways to express the same idea
 increase retrieval performance using a single representation.
 Concept indexing determines a related set of concepts based upon a test set of terms
 uses those concepts for indexing all items

 MatchPlus system.
 Example for concept indexing.
 neural networks facilitates machine learning of concept/word relationships.
 goal is to determine word relationships (e.g., synonyms) and the strength of these
relationships and use that information in generating context vectors, from the corpus
of items.
 Two neural networks are used.
 One neural network learning algorithm generates stem context vectors that are
sensitive to similarity of use.
 another one performs query modification based upon user feedback.
 context vectors
 Word stems, items and queries that are represented by high dimensional (atleast
300 dimensions) vectors.
 Each dimension in a vector could be viewed as an abstract concept class.
Multimedia Indexing
 The automated indexing takes place in multiple passes of the information.
 The first pass is a conversion from the analog input mode into a digital structure.
 Then algorithms are applied to the digital structure to extract the unit of processing of
the different modalities that will be used to represent the item.
 In an abstract sense this could be considered the location of a processing token in the
modality.
 This unit will then undergo the final processing that will extract the searchable features
that represent the unit.

 Indexing video or images can be accomplished


 at the raw data level (e.g., the aggregation of raw pixels), the feature level
distinguishing primitive attributes such as color and luminance.
 at the semantic level where meaningful objects are recognized (e.g., an airplane
in the image/video frame).

 analog audio input

 system will convert the audio to digital format and determine the phonemes
associated with the utterances.
 The phonemes will be used as input to a Hidden Markov Search model that will
determine with a confidence level the words that were spoken.
 A single phoneme can be divided into four states for the Markov model.
 It is the textual words associated with the audio that becomes the searchable
structure.
 In addition to storing the extracted index searchable data, a multimedia item also needs
to store some mechanism to correlate the different modalities during search.
 There are two main mechanisms that are used
 Positional
 Temporal
 Positional
 is used when the modalities are scattered in a linear sequential composition.
 For example a document that has images or audio inserted, can be considered a
linear structure and the only relationship between the modalities will be the just
a position of each modality.
 Temporal
 based upon time because the modalities are executing concurrently.
 The typical video source of television is inherently a multimedia source.
 It contains video, audio, and potentially closed captioning.

 Synchronized Multimedia Integration Language (SMIL).


 It is a mark-up language designed to support multimedia presentations that
integrate text (e.g.,from slides or free running text) with audio, images and
video.
 time is the mechanism that is used to synchronize the different modalities.
 indexing must include a time-offset parameter versus a physical displacement.
2.5 Information Extraction

 There are two processes associated with information extraction:


 determination of facts to go into structured fields in a database
 extraction of text that can be used to summarize an item.
 determination of facts
 only a subset of the important facts in an item may be identified and extracted.
 Summarization
 all the major concepts in the item should be represented.

Extraction
 The process of extracting facts to go into indexes is called Automatic File Build.
 Its goal is to process incoming items and extract index terms that will go into a
structured database.
 Extraction system analyzes only the portions of a document that contain information
relevant to the extraction criteria.
 objective is to update a structured database with additional facts.
 The updates may be from a controlled vocabulary or substrings from the item as defined
by the extraction rules.

 The term “slot” is used to define a particular category of information to be extracted.


 Slots are organized into templates or semantic frames.
 Information extraction requires multiple levels of analysis of the text of an item.
 It must understand the words and their context (discourse analysis).

 Metrics to compare information extraction


 Recall
 Refers to how much information was extracted from an item versus how
much should have been extracted from the item.
 It shows the amount of correct and relevant data extracted versus the
correct and relevant data in the item.
 It refers to how much information was extracted accurately versus the
total information extracted.
 Over generation
 It measures the amount of irrelevant information that is extracted.
 This could be caused by
 templates filled on topics that are not intended to be
extracted
 Slots that filled with non-relevant data.
 Fallout
 It measures how much a system assigns incorrect slot fillers.

Document summarization
 goal is to extract a summary of an item maintaining the most important ideas while
significantly reducing the size.
 Examples of summaries are part of any item such as titles, table of contents, and
abstracts.
 The abstract can be used to represent the item for search purposes or to determine the
item without having to read the complete item.
 It is not feasible to automatically generate a coherent narrative summary of an item with
proper discourse, abstraction and language usage.
 Restricting the domain of the item significantly improves the quality of the output.
 Different algorithms produce different summaries.
 Most automated algorithms summarize by calculating a score for each sentence and
then extracting the sentences with the highest scores.
2.6 Introduction to Data Structures

 From an Information Retrieval System perspective, the two aspects of a data structure
are
 its ability to represent concepts and their relationships
 how well it supports location of those concepts

 There are two major data structures in any information system.


 One structure stores and manages the received items in their normalized form;
this process is called document manager.
 The other contains the processing tokens and associated data to support search.
.
2.7 Stemming Algorithms

 Goal of stemming
o improve performance
o reduce system resources by reducing the number of unique words that a system
has to contain.
 The stemming process creates one large index for the stem.

 1 Introduction to the Stemming Process


 2 Porter Stemming Algorithm
 3 Dictionary Look-Up Stemmers
 4 Successor Stemmers

2.7a Introduction to the Stemming Process

 Stem carries the meaning of the concept associated with the word and the affixes.
 Stemming algorithms
 improve the efficiency of the information system
 improve recall.
 Conflation
 refers to mapping multiple morphological variants to a single representation
(stem).

 Languages have precise grammars that define their usage, but also evolve based upon
human usage.
 Thus exceptions and non-consistent variants are always present in languages that
typically require ‘exception look-up tables’ in addition to the normal reduction rules.

 The idea of equating multiple representations of a word as a single stem term is to


provide significant compression.
 For example, the stem “comput” could associate “computable, computability,
computation, computational, computed, computing, computer, computerese,
computerize” to one compressed word.
 compression of stemming does not significantly reduce storage requirements.
 misspellings and proper names reduce the compression even more.
 Another major use of stemming is to improve recall.
 As long as a semantically consistent, stem can be identified for a set of words.
 The process of stemming helps to not to miss relevant items.

 Stemming of the words “calculate, calculates, calculation, calculations, calculating” to


a single stem “calculat” assures whichever of those terms is entered by the user, it is
translated to the stem and finds all the variants in any items they exist.
 stemming can not improve precision as the precision value is not based on finding all
relevant items but just minimizing the retrieval of non-relevant items.

 Stemming can cause problems for Natural Language Processing (NLP) systems by loss
of information needed for aggregate levels of natural language processing (discourse
analysis).
 The tenses of verbs may be lost in creating a stem, but they are needed to determine if
particular concept being indexed occurred in the past or will be occurring in the future.
 Time is one example of the type of relationships that are defined in Natural Language
Processing systems.
 Stemming algorithm removes suffixes and prefixes, sometimes recursively, to derive
the final stem.

 Other techniques such as table lookup and successor stemming provide alternatives that
require additional overheads.
 Successor stemmers determine prefix overlap as the length of a stem is increased.
 Table lookup requires a large data structure.
2.7b Porter Stemming Algorithm

 The Porter Algorithm is based upon a set of conditions of the stem, suffix and prefix
and associated actions given the condition.
 Some examples of stem conditions are:
 1. The measure, m, of a stem is a function of sequences of vowels (a, e, i, o, u,) followed
by a consonant.
 If V is a sequence of vowels and C is a sequence of consonants, then m is:
C(VC)mV
 where the initial C and final V are optional and m is the number VC repeats.

 Suffix conditions take the form current_suffix = = pattern


 Actions are in the form old_suffix ->new_suffix
 Rules are divided into steps to define the order of applying the rules.

 Given the word “duplicatable,” the following are the steps in the stemming process:

 The application of another rule in step 4, removing “ic,” cannot be applied since only
one rule from each step is allowed be applied.
2.7c Dictionary Look-Up Stemmers
 In this approach, simple stemming rules may be applied.
 The rules are taken from those that have the fewest exceptions (e.g., removing
pluralization from nouns).
 The original term or stemmed version of the term is looked up in a dictionary and
replaced by the stem that best represents it.

Implementation of Dictionary Look-Up Stemmers in commercial databases:


 Dictionary Look-Up Stemmer has been implemented in the INQUERY and
RetrievalWare Systems.

 INQUERY system :
 uses a stemming technique called Kstem.
 Kstem is a morphological analyzer that that reduces morphological variants to a root
form.
 For example, `elephants'->`elephant', `amplification'->`amplify', and `european'-
>`europe'.
 It tries to avoid collapsing words with different meanings into the same root.
 For example, “memorial” and “memorize” reduce to “memory”.
 “memorial” and “memorize” are not synonyms and have very different meanings.

 The Kstem system uses the following six major data files to control and limit the
stemming process:
 Dictionary of words (lexicon)
 Supplemental list of words for the dictionary
 Exceptions list for those words that should retain an “e” at the end (e.g., “suites”
to “suite” but “suited” to “suit”)
 Direct Conflation - allows definition of direct conflation via word pairs that
override the stemming algorithm
 Country_Nationality - conflations between nationalities and countries
(“British” maps to “Britain”)
 Proper Nouns - a list of proper nouns that should not be stemmed.
 Like other stemmers associated with Natural Language Processors and dictionaries,
Kstem, returns words instead of truncated word forms.
 Generally, Kstem requires a word to be in the dictionary before it reduces one word
form to another.
 Some endings are always removed, even if the root form is not found in the dictionary
(e.g.,‘ness’, ‘ly’).
 If the word being processed is in the dictionary, it is assumed to be unrelated to the root
after stemming and conflation is not performed (e.g.,‘factorial’ needs to be in the
dictionary or it is stemmed to ‘factory’).
 For irregular morphologies, it is necessary to explicitly map the word variant to the root
desired (for example, “matrices” to “matrix”).

 RetrievalWare System:
 The strength of the RetrievalWare System lies in its Thesaurus/Semantic Network
support data structure that contains over 400,000 words.
 The dictionaries contain the morphological variants of words.
 New words that are not special forms (e.g., dates, phone numbers) are located in the
dictionary to determine simpler forms by stripping off suffixes and respelling plurals as
defined in the dictionary.
2.7d Successor Stemmers

 The process determines the successor varieties for a word and uses this information to
divide a word into segments and selects one of the segments as the stem.
 The successor variety of a segment of a word in a set of words is the number of distinct
letters that occupy the segment length plus one character.

 A graphical representation of successor variety is shown in a symbol tree.


 Figure shows the symbol tree for the terms bag, barn, bring, both, box, and bottle.
 The successor variety for any prefix of a word is the number of children that are
associated with the node in the symbol tree representing that prefix.
 For example, the successor variety for the first letter “b” is three. The successor variety
for the prefix “ba” is two.
 The successor varieties of a word are used to segment a word by applying one of the
following four methods :
1. Cutoff method: a cutoff value is selected to define stem length. The value varies
for each possible set of words.
2. Peak and Plateau: a segment break is made after a character whose successor
variety exceeds that of the character immediately preceding it and the character
immediately following it.
3. Complete word method: break on boundaries of complete words
4. Entropy method: uses the distribution of successor variety letters.
 Let |Dak| be the number of words beginning with the k length sequence of letters
a.
 Let |Dakj| be the number of words in Dak with successor j.
 The entropy (Average Information) of |Dak| is:

 Using this formula a set of entropy measures can be calculated for a word and
its predecessors.

Example to determine the stem of the word READABLE.


 Using the complete word segmentation method, the test word "READABLE" will be
segmented into "READ" and "ABLE,".
 After a word has been segmented, the segment to be used as the stem must be selected.
 Hafer and Weiss used the following rule:

 Selecting the first or second segment in general determines the appropriate stem.
2.8 Inverted File Structure

 Inverted file structure used in both database management and Information Retrieval
Systems.
 Inverted file structures are composed of three basic files:
 document file
 inversion lists (posting files)
 dictionary

 Each document in the system is given a unique numerical identifier that is stored in the
inversion list.
 Dictionary is used to locate the inversion list for a particular word.
 For each word, a list of documents in which the word is found in is stored
(the inversion list for that word).
 It’s a sorted list of all unique words (processing tokens) in the system and a pointer to
the location of its inversion list.

 Additional information may be used from the item to increase precision and provide a
more optimum inversion list file structure.
 For example, if zoning issued, the dictionary may be partitioned by zone.
 There could be a dictionary and set of inversion lists for the “Abstract” zone in an item
and another dictionary and set of inversion lists for the “Main Body” zone.
 Inversion list contains the document identifier for each document in which the word is
found.
2.9 N-Gram Data Structures

 N-Grams
 a special technique for conflation (mapping).
 a unique data structure in information systems that ignores words and treats the
input as continuous data (optionally limiting its processing by interword
symbols).
 a fixed-length consecutive series of “n” characters or fixed-length overlapping
symbol segments that define the searchable processing tokens.
 These tokens have logical linkages to all the items in which the tokens are found.
 To store the linkage data structure, inversion lists, document vectors and other
data structures are used in the search process.
 n-grams do not care about semantics, unlike stemming (which determines the
stem of a word that represents the semantic meaning of the word).

 Examples of bigrams, trigrams, and pentagrams are given in Figure 4.7 for the word
phrase “sea colony”.
 n-grams with n greater than two allow interword symbols to be part of the n-gram set.
 The symbol # is used to represent the interword symbol (e.g., blank, period, semicolon,
colon, etc.).

 Each of the n-grams created becomes a separate processing token and is searchable.
 It is possible that the same n-gram can be created multiple times from a single word.

 The advantage of n-grams


 limit the number of searchable tokens.
 maximum number of unique n-grams generated.
 fast processing on minimally sized machines
 disadvantage
 false hits can occur under some architectures.
 increased size of inversion lists that store the linkage data structure.
 n-grams expands the number of processing tokens.
 There is no semantic meaning in a particular n-gram since it is a fragment of a
processing token and may not represent a concept.
 poor representation of concepts and their relationships.
2.10 PAT Data Structure

 A Continuous text input data structure is indexed in contiguous “n” character tokens
using n-grams with interword symbols between processing tokens.
 A continuous text input data structure is addressed differently using PAT trees and PAT
arrays.
 The name PAT is short for PAtriciaTrees (PATRICIA stands for Practical Algorithm
To Retrieve Information Coded In Alphanumerics.)

 Sistring (Semi-infinite string)


 The input stream is transformed into a searchable data structure consisting of
substrings called sistring or semi-infinite string.
 All substrings are unique.
 A substring can start at any point in the text and can be uniquely indexed by its
starting location and length.
 Substring may go beyond the length of the input stream by adding additional
null characters.
 Figure 4.9 shows some possible sistrings for an input text.
 PAT tree
 is an unbalanced, binary digital tree defined by the sistrings.
 The individual bits of the sistrings decide the branching patterns with zeros
branching left and ones branching right.
 PAT trees also allow each node in the tree to specify which bit is used to
determine the branching via bit position or the number of bits to skip from the
parent node.
 This is useful in skipping over levels that do not require branching.
 The key values are stored at the leaf nodes in the PAT Tree.
 For a text input of size “n” there are “n” leaf nodes and “n-1” at most higher
level nodes.
 It is possible to place additional constraints on sistrings for the leaf nodes.

 Figure 4.10 gives an example of the sistrings used in generating a PAT Tree.
 If the binary representations of “h” is (100), “o” is (110), “m” is (001) and“e”
is (101) then the word “home” produces the input 100110001101.
 Using the sistrings, the full PAT binary tree is shown in Figure 4.11.
 Advantages and Disadvantages
 PAT trees are ideal for prefix searches.
 Suffix, imbedded string, and fixed length masked searches are easy if the total
input stream is used in defining the PAT tree.
 Fuzzy searches are very difficult because large number of possible sub-trees
could match the search term.
 PAT arrays have more accuracy than Signature files
 ability to string searches that are inefficient in inverted files (e.g., suffix
searches, approximate string searches, longest repetition).
 It is not used in any major commercial products.
2.11 Signature File Structure

 The goal of a signature file structure


 to provide a fast test to eliminate the majority of items that are not related to a
query.

 The items that satisfy the test can either be evaluated by another search algorithm to
eliminate additional false hits or delivered to the user for review.
 The text of the items is represented in a highly compressed form that facilitates the fast
test.

 Signature file search


 a linear scan of the compressed version of items producing a response time
linear with respect to file size.
 signature search file is created using superimposed coding.
 superimposed coding
 The coding is based on words in the item.
 The words are mapped into a “word signature”.
 Word signature
 a fixed length code with a fixed number of bits.
 a bit pattern of size F, with m bits set to "1", while the rest are "0".
 The bit positions are determined using a hash function of the word.
 The word signatures are ORed together to create the signature of an item.
 To avoid signatures being too dense with “1”s, a maximum number of words is
specified and an item is partitioned into blocks of that size.
 Example

 In Figure 4.13 the block size is set at five words


 code length is 16 bits
 number of bits that are allowed to be “1” for each word is five.
 The words in a query are mapped to their signature.

 Signature file
 can be stored as a signature with each row representing signature block.
 associated with each row is a pointer to the original text block.

 Application / Advantages
 Signature files provide a solution for storing and locating information in a
number of different situations.
 Signature files are applied as medium size databases, WORM devices, parallel
processing machines, and distributed environments.
2.12 Hypertext and XML Data Structures

Hypertext:
 A mechanism for representing information structure.
 It differs from traditional information storage data structures in format and use.
 Hypertext is stored in Hypertext Markup Language (HTML) and eXtensible
Markup Language (XML).
 HTML and XML provide detailed descriptions for subsets of text similar to the
zoning that increase search accuracy and improve display of hit results.

 Hypertext Structure
 used in the Internet environment
 requires electronic media storage for the item.
 Hypertext allows one item to reference another item via an imbedded pointer.
 Each separate item is called a node.
 Reference pointer is called a link.
 Each node is displayed by a viewer that is defined for the file type associated with
the node.
 Hypertext Markup Language (HTML)
 defines the internal structure for information exchange across the World Wide Web
on the Internet.
 A document is composed of the text of the item along with HTML tags that describe
how to display the document.
 Tags are formatting or structural keywords contained between less-than, greater
than symbols (e.g., <title>, <strong>).
 The HTML tag associated with hypertext linkages is <a href= …#NAME /a>
 where “a” and “/a” are an anchor start tag and anchor end tag denoting the text that
the user can activate.
 “href” is the hypertext reference containing either a file name if the referenced item
is on this node or an address(URL) and a file name if it is on another node.
 “#NAME” defines a destination point other than the top of the item to go to.
 The URL has three components:
 access method the client used to retrieve the item
 Internet address of the server where the item is stored
 address of the item at the server
 Figure 4.14 shows an example of a segment of a HTML document.

 Hypertext is a non-sequential directed graph structure, where each node contains its
own information.
 A node may have several outgoing links, each of which is then associated with some
smaller part of the node called an anchor.
 When an anchor is activated, the associated link is followed to the destination node,
thus navigating the hypertext network.

 In a hypertext environment, the user “navigates” through the node network by


following links.
 Hypertext references are used to include information that is other than text (e.g.,
graphics, audio, photograph, video) in a text item.

 Dynamic HTML
 Combination of the latest HTML tags, style sheets, and programming that
help to create WEB pages that are more animated and responsive to user
interaction.
 Supports features such as object-oriented view of a WEB page and its
elements, cascading style sheets, programming that can address most page
elements add dynamic fonts.
XML:
 The Extensible Markup Language (XML) is a standard data structure on the WEB.
 The objective is to extend HTML with semantic information.
 The logical data structure within XML is defined by a Data Type Description (DTD)
 The user can create any tags needed to describe and manipulate their structure.
 The following is a simple example of XML tagging:

 Resource Description Format (RDF)


 used to represent properties of WEB resources such as images, documents and
relationships between them.
 This will include the Platform for Internet Content Selection (PICS) for
attaching labels to material for content filtering (e.g., unsuitable for children).
 Hypertext links for XML
 These are defined in the Xlink (XML Linking Language) and Xpoint (XML
Pointer language) specifications.
 allow different types of links to locations within a document and external to the
document.
 allow an application to know if a link is positioning reference within an item or
link to another document.
 help in determining what needs to be retrieved to define the total item.
 XML Style Sheet Linking
 define how to display items on a particular style sheet and handle cascading
stylesheets.
2.13 Hidden Markov Models

 Markov process assumption


 future is independent of the past given the present
 In other words, assuming we know our present state, we do not need any other
historical information to predict the future state.

 Example of a three state Markov Model of the Stock Market.

 The states will be one of the above that is observed at the closing of the market.
 Given that the market fell on one day (State 1), the matrix suggests that the
probability of the market not changing the next day is .1.
 This then allows questions such as the probability that the market will increase
for the next 4 days then fall.
 This would be equivalent to the sequence of SEQ = {S3, S3, S3, S3, S1}.
 Let’s assume current state is only dependent upon the last state.
 This would then be calculated by the formula:

 The following graph depicts the model.

 The movement between states can be defined by a state transition matrix with
state transitions.

 The directed lines indicate the state transition probabilities ai,j.


 Hidden Markov Models (HMMs)
 To add more flexibility, probability function was allowed to be associated with
the state. The result is called the Hidden Markov Model.
 a class of probabilistic graphical model that allow us to predict a sequence of
unknown (hidden) variables from a set of observed variables.
 allow us to compute the joint probability of a set of hidden states (latent states)
given a set of observed states.
 Once we know the joint probability of a sequence of hidden states, we
determine the best possible sequence of hidden states.

 Discrete Hidden Markov Model consisting of the following:


1. S = { S0, , …, Sn-1} as a finite set of states where s0 always denotes the initial
state. Typically the states are interconnected such that any state can be reached
from any other state.
2. V = { v0, , …, vm-1} is a finite set of output symbols..
3. A = S x S a transition probability matrix where ai,j represents the probability of

transitioning from state i to state j such that for all i = 0, ., n -1.


Every value in the matrix is a positive value between 0 and 1. For the case where
every state can be reached from every other state every value in the matrix will
be non-zero.
4. B = S X V is an output probability matrix where element bj,k is a function

determining the probability and for all j = 0, .., n - l .


5. The initial state distribution.

 The complete specification of a HMM requires


 specification of the states
 the output symbols
 three probability measures for the state transitions
 output probability functions and the initial states

 Applications of Hidden Markov Models (HMM)


 speech recognition
 optical character recognition
 topic identification
 information retrieval search.

You might also like