0% found this document useful (0 votes)
189 views42 pages

UNIT 2 IRS Up

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 42

UNIT 2- IRS

2. CATALOGING AND INDEXING

Indexing
 The transformation from the received item to the searchable data structure is called
Indexing.
 This process can be manual or automatic, creating the basis for direct search of items
in the Document Database or indirect search via Index Files.

Concept based representation


 Instead of trying to create a searchable data structure, some systems transform the
item into a completely different representation that is concept based.
 use this as the searchable data structure.

Search
 Once the searchable data structure has been created, techniques must be defined that
correlate the user-entered query statement to the set of items in the database.
 This process is called Search .

2.1 History

 Indexing is originally called as Cataloging.


 Cataloging is the oldest technique for identifying the contents of items to assist in
their retrieval.
The objective of cataloging:
 to give access points to a collection of information that are expected by and most
useful to the users.
 Up to the 19th Century there was little advancement in cataloging, only changes in the
methods used to represent the basic information.
 In the late 1800s subject indexing became hierarchical (e.g., Dewey Decimal System).
 In 1963 the Library of Congress initiated a study on the computerization of
bibliographic surrogates.

1
 From 1966 - 1968 the Library of Congress ran its MARC I pilot project. MARC
(MAchine Readable Cataloging) standardizes the structure, contents and coding of
bibliographic records.
 1965 DIALOG The earliest commercial cataloging system was developed by
Lockheed Corporation for NASA.
 1978 DIALOG became commercial.
 1988, DIALOG sold to Knight-Ridder, which contained over 320 index databases.
 Indexing (cataloging), until recently, was accomplished by creating a bibliographic
citation in a structured file that references the original text.
 The indexing process is typically performed by professional indexers associated with
library organizations.
 Throughout the history of libraries, this has been the most important and most
difficult processing step.
 Most items are retrieved based upon what the item is about.
 The user’s ability to find items on a particular subject is limited by the indexer.
 The initial introduction of computers to assist the cataloguing function did not change
its basic operation of a human indexer determining those terms to assign to a
particular item.
 The user, instead of searching through physical cards in a card catalog, now
performed a search on a computer and electronically displayed the card equivalents.
 In the 1990s, the significant reduction in cost of processing power and memory in
modern computers, along with access to the full text of an item from the publishing
stages in electronic form, allow use of the full text of an item as an alternative to the
indexer-generated subject index.
 The searchable availability of the text of items has changed the role of indexers.
 The indexer is no longer required to enter index terms that are redundant with words
in the text of an item.

2
2.2 Objective of Indexing
 The objectives of indexing have changed with the evolution of Information Retrieval
Systems.
 total document indexing
 The full text searchable data structure for items in the Document File provides
a new class of indexing called total document indexing.
 In this environment, all of the words within the item are potential index
descriptors of the subject(s) of the item.
 Current systems have the ability to automatically weight the processing tokens
based upon their potential importance in defining the concepts in the item.
 Previously, indexing defined the source and major concepts of an item and
provided a mechanism for standardization of index terms (i.e., use of a
controlled vocabulary).
 Controlled vocabulary
 is a finite set of index terms from which all index terms must be selected (the
domain of the index).
 In a manual indexing environment, the use of a controlled vocabulary makes
the indexing process slower, but potentially simplifies the search process.
 The extra processing time comes from the indexer trying to determine the
appropriate index terms for concepts that are not specifically in the controlled
vocabulary set.
 Controlled vocabularies aid the user in knowing the domain of terms that the
indexer had to select from.
 Uncontrolled vocabularies
 make indexing faster but the search process much more difficult.
 The availability of items in electronic form changes the objectives of manual
indexing.
 The source information (frequently called citation data) can automatically be
extracted.
 Modern systems, with the automatic use of thesauri and other reference
databases, can account for diversity of language/vocabulary use and thus
reduce the need for controlled vocabularies.

3
 Most of the concepts discussed in the document are locatable via search of the
total document index.
 automatic text analysis algorithms
 cannot consistently perform abstraction on all concepts that are in an item.
 They can not correlate the facts in an item in a cause/effect relationship to
determine additional related concepts to be indexed.
 The words used in an item do not always reflect the value of the concepts
being presented.
 It is the combination of the words and their semantic implications that contain
the value of the concepts being discussed.
 The utility of a concept is also determined by the user’s need.
 Public File indexer
 Public File indexer needs to consider the information needs of all users of the
library system.
 Individual users of the system have their own domains of interest that bound
the concepts in which they are interested.
 It takes a human being to evaluate the quality of the concepts being discussed
in an item to determine if that concept should be indexed.
 Private Index files
 allows the user to logically subset the total document file into folders of
interest including only those documents that, in the user’s judgment, have
future value.
 allows the user to judge the utility of the concepts based upon his need versus
the system need and perform concept abstraction.
 Selective indexing
 is based upon the value of concepts increases the precision of searches.
 full document indexing
 Availability of full document indexing saves the indexer from entering index
terms that are identical to words in the document.
 Users may use Public Index files as pan of their search criteria to increase the recall.
 They may want to constrain the search by their Private Index file to increase the
precision of the search.
4
 Figure 3.1 shows the potential relationship between use of the words in an item to
define the concepts.

 Public Indexing of the concept adds additional index terms over the words in the item
to achieve abstraction.
 The index file use fewer terms because it only indexes the important concepts.
 Private Index files are more focused, limiting the number of items indexed to those
that have value to the user and within items only the concepts bounded by the specific
user’s interest domain.
 There is overlap between the Private and Public Index files, but the Private Index file
is indexing fewer concepts in an item than the Public Index file and the file owner
uses his specific vocabulary of index terms.
 In addition to the primary objective of representing the concepts within an item to
facilitate the user’s finding relevant information, electronic indexes to items provide a
basis for other applications to assist the user.
 The format of the index supports the ranking of the output to present the items most
likely to be relevant to the user’s needs.
 index can be used to cluster items by concept.
 The clustering of items has the effect of making an electronic system similar to a
physical library.
 The paradigm of going to the library and browsing the book shelves in a topical area
is the same as electronically browsing through items clustered by concepts.

5
2.3 Indexing Process

 When an organization with multiple indexers decides to create a public or private


index, some procedural decisions assist the indexers and end users on how to create
the index terms.
 scope of the indexing
 linking index terms
Scope of Indexing
 defines the level of detail that the subject index will contain.
 This is based upon usage scenarios of the end users.
 When performed manually, the process of determining the bibliographic terms that
represent the concepts in an item is difficult.
 Problems arise from interaction of two sources: the author and the indexer.
 The vocabulary domain of the author may be different than that of the indexer,
causing the indexer to misinterpret the importance.

 The indexer is not an expert on all areas and has different levels of knowledge in the
different areas being presented in the item.
 This results in different quality levels of indexing.
 The indexer must determine when to stop the indexing process.
 There are two factors involved in deciding on what level to index the concepts in an
item:
 Exhaustivity
 Specificity.
 Exhaustivity of indexing
 is the extent to which the different concepts in the item are indexed.
 For example, if two sentences of a 10-page item on microprocessors discusses
on-board caches, should this concept be indexed?
 Specificity
 relates to the preciseness of the index terms used in indexing.
 For example, whether the term “processor” or “microcomputer” or “Pentium”
should be used in the index of an item.
 using general index terms yields low exhaustivity and specificity.
6
 This approach requires a minimal number of index terms per item.
 reduces the cost of generating the index.

 Another decision on indexing is what portions of an item should be indexed.


 The simplest case is to limit the indexing to the Title or Title and Abstract zones.

 Weighting
 Weighting of index terms is not common in manual indexing systems.
 Weighting is the process of assigning an importance to an index term’s use in
an item.
 The weight should represent the degree to which the concept associated with
the index term.
 The process of assigning weights adds additional overhead on the indexer and
requires a more complex data structure to store the weights.

Pre coordination and Linkages


 Another decision on the indexing process is whether linkages are available between
index terms for an item.
 Linkages are used to correlate related attributes associated with concepts discussed in
an item.
 Precoordination
 process of creating term linkages at index creation time.
 When index terms are not coordinated at index time, the coordination occurs
at search time.
 Post coordination
 coordinating terms after (post) the indexing process.
 implemented by “AND” ing index terms together, which only finds indexes
that have all of the search terms.
 Figure 3.2 shows the different types of linkages.
 It assumes that an item discusses the drilling of oil wells in Mexico by CITGO and the
introduction of oil refineries in Peru by the U.S.
 When the linked capability is added, the system does not erroneously relate Peru and
Mexico since they are not in the same set of linked items.
7
 It still does not have the ability to discriminate between which country is introducing
oil refineries into the other country.
 Introducing roles in the last two examples of Figure 3.2 removes this ambiguity.

 Positional roles treat the data as a vector allowing only one value per position.
 Thus if the example is expanded so that the U.S. was introducing oil refineries in
Peru, Bolivia and Argentina, then the positional role technique would require three
entries, where the only difference would be in the value in the “affected country”
position.
 When modifiers are used, only one entry would be required and all three countries
would be listed with three “MODIFIER”s.

8
2.4 Automatic Indexing

 Advantages of human indexing


 ability to determine concept abstraction
 ability to judge the value of a concept.
 Disadvantages of human indexing
 Cost
 processing time
 consistency

 Automatic indexing
 Capability for the system to automatically determine the index terms to be
assigned to an item.
 More complex processing is required when emulate a human indexer and
determine a limited number of index terms.
 No additional indexing costs versus the salaries and benefits regularly paid to
human indexers.
 Requires only a few seconds or less of computer time based upon the size of
the processor and the complexity of the algorithms to generate the index.
 Consistency in the index term selection process as indexing is performed
automatically by an algorithm.
 Indexes from automated indexing fall into two classes:
 unweighted.
 weighted
 Unweighted indexing system
 index term in a document and its word location(s) are kept in the searchable
data structure.
 Queries against unweighted systems are based upon Boolean logic and the
items in the resultant Hit file are considered equal in value.
 The last item presented in the file is the first item to be relevant to the user’s
information need.

 Weighted indexing system 9


 weight of the index term is based upon a function associated with the
frequency of occurrence of the term in the item.
 values for the index terms are normalized between zero and one.
 The higher the weight, the more the term represents a concept discussed in the
item.
 query process uses the weights along with any weights assigned to terms in the
query to determine a rank value.

 2.4.1 Indexing by Term


 2.4.2 Indexing by Concept
 2.4.3 Multimedia Indexing

Indexing by Term
 There are two major techniques for creation of the index terms:
 statistical
 natural language
 Statistical techniques
 based upon vector models and probabilistic models with a special case being
Bayesian models.
 calculation of weights in those models use statistical information such as the
frequency of occurrence of words and their distributions in the searchable
database.

 Natural language techniques


 perform more complex parsing to define the final set of index concepts.
 weighted systems are discussed as vectorized information systems.

 Vector model (Example. SMART system)


 The system emphasizes weights as a foundation for information detection and
stores these weights in a vector form.
 Each vector represents a document
 each position in a vector represents a different unique word (processing token)
in the database.
10
 The value assigned to each position is the weight of that term in the document.
 A value of zero indicates that the word was not in the document.
 Queries can be translated into the vector form.
 Search is accomplished by calculating the distance between the query vector
and the document vectors.
 Probabilistic model.
 Bayesian approach is the most successful model in this area.
 It is based upon the theories of evidential reasoning (drawing conclusions
from evidence).
 The Bayesian approach could be applied as part of index term weighting by
calculating the relationship between an item and a specific query.
 A Bayesian network is a directed acyclic graph
 Each node represents a random variable.
 Arcs between the nodes represent a probabilistic dependence between the node
and its parents.

 Figure shows the basic weighting approach for index terms or associations
between query terms and index terms.

 The nodes C1 and C2 represent “the item contains concept Ci”


 F nodes represent “the item has feature Fij (e.g., words)

 The network interpreted as C representing concepts in a query


 F representing concepts in an item.
 The goal is to calculate the probability of Ci given Fij.

 To perform that calculation two sets of probabilities are needed:


11
 The prior probability P(Ci) that an item is relevant to concept C.
 The conditional probability P(Fij/Ci) that the features Fij where j=1,m are
present in an item given that the item contains topic Ci.

 The automatic indexing task is to calculate the posterior probability


the probability that the item contains concept Ci given the presence of features Fij

 The Bayes inference formula that is used is:

 If the goal is to provide ranking as the result of a search by the posteriors, the Bayes
rule can be simplified to a linear decision rule:

 where I(Fik) is an indicator variable that equals 1 only if Fik is present in the
item(equals zero otherwise) and w is a coefficient corresponding to a specific
feature/concept pair.
 function g is the sum of the weights of the features
 w is weight corresponding to each feature (index term)
 w produces a ranking in decreasing order that is equivalent to the order produced by
the posterior probabilities.

 DR-LINK (Document Retrieval through Linguistic Knowledge) system


 define indexes to items via natural language processing.
 processes items at the morphological, lexical, semantic, syntactic, and
discourse levels.
 Each level uses information from the previous level to perform its additional
analysis.
 The discourse level is abstracting information beyond the sentence level and
can determine abstract concepts.
 This allows the indexing to include specific term as well as abstract concepts
such as time.
Indexing by Concept
 Indexing by term treats each occurrence as a different index.

12
 Then uses thesauri or other query expansion techniques to expand a query to find the
different ways the same thing has been represented.

 basis for concept indexing


 There are many ways to express the same idea and increased retrieval
performance comes from using a single representation.
 Concept indexing determines a related set of concepts based upon a test set of terms
 uses those concepts for indexing all items
 Latent Semantic Indexing
 Indexing the latent semantic information in items.

 MatchPlus system.
 Example for concept indexing.
 neural networks facilitates machine learning of concept/word relationships.
 goal is to determine word relationships (e.g., synonyms) and the strength of these
relationships and use that information in generating context vectors, from the
corpus of items.

 Two neural networks are used.


 One neural network learning algorithm generates stem context vectors that are
sensitive to similarity of use.
 another one performs query modification based upon user feedback.
 context vectors
 Word stems, items and queries that are represented by high dimensional
(atleast 300 dimensions) vectors.
 Each dimension in a vector could be viewed as an abstract concept class.

13
Multimedia Indexing
 The automated indexing takes place in multiple passes of the information versus just a
direct conversion to the indexing structure.

 The first pass in most cases is a conversion from the analog input mode into a digital
structure.
 Then algorithms are applied to the digital structure to extract the unit of processing of
the different modalities that will be used to represent the item.
 In an abstract sense this could be considered the location of a processing token in the
modality.
 This unit will then undergo the final processing that will extract the searchable
features that represent the unit.

 Indexing video or images can be accomplished


 at the raw data level (e.g., the aggregation of raw pixels),
 the feature level distinguishing primitive attributes such as color and
luminance
 at the semantic level where meaningful objects are recognized (e.g., an
airplane in the image/video frame).

 An example is processing of video.


 The system (e.g., Virage) will periodically collect a frame of video input for
processing.
 It might compare that frame to the last frame captured to determine the differences
between the frames.
 If the difference is below a threshold it will discard the frame.
 For a frame requiring processing, it will define a vector that represents the different
features associated with that frame.
 Each dimension of the vector represents a different feature level aspect of the frame.
 The vector then becomes the unit of processing in the search system.
 This is similar to processing an image.
 Semantic level indexing requires pattern recognition of objects within the images.

14
 analog audio input

 system will convert the audio to digital format and determine the phonemes
associated with the utterances.
 The phonemes will be used as input to a Hidden Markov Search model that
will determine with a confidence level the words that were spoken.
 A single phoneme can be divided into four states for the Markov model.
 It is the textual words associated with the audio that becomes the searchable
structure.

 In addition to storing the extracted index searchable data, a multimedia item also
needs to store some mechanism to correlate the different modalities during search.
 There are two main mechanisms that are used
 Positional
 Temporal
 Positional
 is used when the modalities are scattered in a linear sequential composition.
 For example a document that has images or audio inserted, can be considered
a linear structure and the only relationship between the modalities will be the
just a position of each modality.
 Temporal
 based upon time because the modalities are executing concurrently.
 The typical video source off television is inherently a multimedia source.
 It contains video, audio, and potentially closed captioning.

 Synchronized Multimedia Integration Language (SMIL).


 creation of multimedia presentations are becoming more common using the
Synchronized Multimedia Integration Language (SMIL).
 It is a mark-up language designed to support multimedia presentations that
integrate text (e.g.,from slides or free running text) with audio, images and
video.
 time is the mechanism that is used to synchronize the different modalities.
 indexing must include a time-offset parameter versus a physical displacement.

15
2.5 Information Extraction

 There are two processes associated with information extraction:


 determination of facts to go into structured fields in a database
 extraction of text that can be used to summarize an item.
 determination of facts
 only a subset of the important facts in an item may be identified and extracted.
 Summarization
 all the major concepts in the item should be represented.

Extraction
 The process of extracting facts to go into indexes is called Automatic File Build.
 Its goal is to process incoming items and extract index terms that will go into a
structured database.
 Extraction system analyzes only the portions of a document that contain information
relevant to the extraction criteria.
 objective is to update a structured database with additional facts.
 The updates may be from a controlled vocabulary or substrings from the item as
defined by the extraction rules.

 The term “slot” is used to define a particular category of information to be extracted.


 Slots are organized into templates or semantic frames.
 Information extraction requires multiple levels of analysis of the text of an item.
 It must understand the words and their context (discourse analysis).

 Metrics to compare information extraction


 Recall
 Refers to how much information was extracted from an item versus
how much should have been extracted from the item.
 It shows the amount of correct and relevant data extracted versus the
correct and relevant data in the item.
 It refers to how much information was extracted accurately versus the
total information extracted.
16
 Over generation
 It measures the amount of irrelevant information that is extracted.
 This could be caused by
 templates filled on topics that are not intended to be
extracted
 Slots that filled with non-relevant data.
 Fallout
 It measures how much a system assigns incorrect slot fillers.

Document summarization
 goal is to extract a summary of an item maintaining the most important ideas while
significantly reducing the size.
 Examples of summaries are part of any item such as titles, table of contents, and
abstracts.
 The abstract can be used to represent the item for search purposes or to determine the
item without having to read the complete item.
 It is not feasible to automatically generate a coherent narrative summary of an item
with proper discourse, abstraction and language usage.
 Restricting the domain of the item significantly improves the quality of the output.
 Different algorithms produce different summaries.
 Most automated algorithms summarize by calculating a score for each sentence and
then extracting the sentences with the highest scores.

17
2.6 Introduction to Data Structures

 From an Information Retrieval System perspective, the two aspects of a data structure
are
 its ability to represent concepts and their relationships
 how well it supports location of those concepts

 There are two major data structures in any information system.


 One structure stores and manages the received items in their normalized form;
this process is called document manager.
 The other contains the processing tokens and associated data to support search.
.

18
2.7 Stemming Algorithms

 Goal of stemming was to improve performance and reduce system resources by


reducing the number of unique words that a system has to contain.
 The stemming process creates one large index for the stem.
 1 Introduction to the Stemming Process
 2 Porter Stemming Algorithm
 3 Dictionary Look-Up Stemmers
 4 Successor Stemmers

2.7a Introduction to the Stemming Process


 Stemming algorithms are used to improve the efficiency of the information system
and to improve recall.
 Conflation
 the term used to refer to mapping multiple morphological variants to a single
representation (stem).
 The principle is that the stem carries the meaning of the concept associated with the
word and the affixes.

 Languages have precise grammars that define their usage, but also evolve based upon
human usage.
 Thus exceptions and non-consistent variants are always present in languages that
typically require ‘exception look-up tables’ in addition to the normal reduction rules.

 The idea of equating multiple representations of a word as a single stem term is to


provide significant compression.
 For example, the stem “comput” could associate “computable, computability,
computation, computational, computed, computing, computer, computerese,
computerize” to one compressed word.
 compression of stemming does not significantly reduce storage requirements.
 misspellings and proper names reduce the compression even more.
 Another major use of stemming is to improve recall.
 As long as a semantically consistent, stem can be identified for a set of words.
19
 The process of stemming helps to not to miss relevant items.

 Stemming of the words “calculate, calculates, calculation, calculations, calculating” to


a single stem “calculat” assures whichever of those terms is entered by the user, it is
translated to the stem and finds all the variants in any items they exist.
 stemming can not improve precision as the precision value is not based on finding all
relevant items but just minimizing the retrieval of non-relevant items.

 Stemming can cause problems for Natural Language Processing (NLP) systems by
loss of information needed for aggregate levels of natural language processing
(discourse analysis).
 The tenses of verbs may be lost in creating a stem, but they are needed to determine if
particular concept being indexed occurred in the past or will be occurring in the
future.
 Time is one example of the type of relationships that are defined in Natural Language
Processing systems.
 Stemming algorithm removes suffixes and prefixes, sometimes recursively, to derive
the final stem.

 Other techniques such as table lookup and successor stemming provide alternatives
that require additional overheads.
 Successor stemmers determine prefix overlap as the length of a stem is increased.
 Table lookup requires a large data structure.

20
2.7b Porter Stemming Algorithm

 The Porter Algorithm is based upon a set of conditions of the stem, suffix and prefix
and associated actions given the condition.
 Some examples of stem conditions are:
 1. The measure, m, of a stem is a function of sequences of vowels (a, e, i, o, u,)
followed by a consonant.
 If V is a sequence of vowels and C is a sequence of consonants, then m is:
C(VC)mV
 where the initial C and final V are optional and m is the number VC repeats.

 Suffix conditions take the form current_suffix = = pattern


 Actions are in the form old_suffix ->new_suffix
 Rules are divided into steps to define the order of applying the rules.

 Given the word “duplicatable,” the following are the steps in the stemming process:

 The application of another rule in step 4, removing “ic,” cannot be applied since only
one rule from each step is allowed be applied.

21
2.7c Dictionary Look-Up Stemmers
 In this approach, simple stemming rules may be applied.
 The rules are taken from those that have the fewest exceptions (e.g., removing
pluralization from nouns).
 Even the most consistent rules have exceptions.
 The original term or stemmed version of the term is looked up in a dictionary and
replaced by the stem that best represents it.
 This technique has been implemented in the INQUERY and RetrievalWare Systems.

 The INQUERY system uses a stemming technique called Kstem.


 Kstem is a morphological analyzer that that reduces morphological variants to a root
form.
 For example, `elephants'->`elephant', `amplification'->`amplify', and `european'-
>`europe'.
 It tries to avoid collapsing words with different meanings into the same root.
 For example, “memorial” and “memorize” reduce to “memory”.
 But “memorial” and “memorize” are not synonyms and have very different meanings.
 Kstem, like other stemmers associated with Natural Language Processors and
dictionaries, returns words instead of truncated word forms.
 Generally, Kstem requires a word to be in the dictionary before it reduces one word
form to another.
 Some endings are always removed, even if the root form is not found in the dictionary
(e.g.,‘ness’, ‘ly’).
 If the word being processed is in the dictionary, it is assumed to be unrelated to the
root after stemming and conflation is not performed (e.g.,‘factorial’ needs to be in the
dictionary or it is stemmed to ‘factory’).
 For irregular morphologies, it is necessary to explicitly map the word variant to the
root desired (for example, “matrices” to “matrix”).

22
 The Kstem system uses the following six major data files to control and limit the
stemming process:
 Dictionary of words (lexicon)
 Supplemental list of words for the dictionary
 Exceptions list for those words that should retain an “e” at the end (e.g.,
“suites” to “suite” but “suited” to “suit”)
 Direct Conflation - allows definition of direct conflation via word pairs that
override the stemming algorithm
 Country_Nationality - conflations between nationalities and countries
(“British” maps to “Britain”)
 Proper Nouns - a list of proper nouns that should not be stemmed.

 The strength of the RetrievalWare System lies in its Thesaurus/Semantic Network


support data structure that contains over 400,000 words.
 The dictionaries that are used contain the morphological variants of words.
 New words that are not special forms (e.g., dates, phone numbers) are located in the
dictionary to determine simpler forms by stripping off suffixes and respelling plurals
as defined in the dictionary.

23
2.7d Successor Stemmers

 The process determines the successor varieties for a word and uses this information to
divide a word into segments and selects one of the segments as the stem.
 The successor variety of a segment of a word in a set of words is the number of
distinct letters that occupy the segment length plus one character.

 A graphical representation of successor variety is shown in a symbol tree.


 Figure shows the symbol tree for the terms bag, barn, bring, both, box, and bottle.
 The successor variety for any prefix of a word is the number of children that are
associated with the node in the symbol tree representing that prefix.
 For example, the successor variety for the first letter “b” is three. The successor
variety for the prefix “ba” is two.

24
 The successor varieties of a word are used to segment a word by applying one of the
following four methods :
1. Cutoff method: a cutoff value is selected to define stem length. The value varies
for each possible set of words.
2. Peak and Plateau: a segment break is made after a character whose successor
variety exceeds that of the character immediately preceding it and the character
immediately following it.
3. Complete word method: break on boundaries of complete words
4. Entropy method: uses the distribution of successor variety letters.
 Let |Dak| be the number of words beginning with the k length sequence of
letters a.
 Let |Dakj| be the number of words in Dak with successor j.
 The entropy (Average Information) of |Dak| is:

 Using this formula a set of entropy measures can be calculated for a word and
its predecessors.

 To illustrate the use of successor variety stemming, consider the example below
where the task is to determine the stem of the word READABLE.

25
 Using the complete word segmentation method, the test word "READABLE" will be
segmented into "READ" and "ABLE,".
 After a word has been segmented, the segment to be used as the stem must be
selected.
 Hafer and Weiss used the following rule:

 Selecting the first or second segment in general determines the appropriate stem.

26
2.8 Inverted File Structure

 Inverted file structure used in both database management and Information Retrieval
Systems.
 Inverted file structures are composed of three basic files:
 document file
 inversion lists (posting files)
 dictionary
 For each word, a list of documents in which the word is found in is stored (the
inversion list for that word).
 Each document in the system is given a unique numerical identifier that is stored in
the inversion list.
 Dictionary is used to locate the inversion list for a particular word.
 Dictionary is a sorted list of all unique words (processing tokens) in the system and a
pointer to the location of its inversion list.
 Dictionaries can also store other information used in query optimization such as the
length of inversion lists.

 Additional information may be used from the item to increase precision and provide a
more optimum inversion list file structure.
 For example, if zoning issued, the dictionary may be partitioned by zone.
 There could be a dictionary and set of inversion lists for the “Abstract” zone in an
item and another dictionary and set of inversion lists for the “Main Body” zone.
 The inversion list contains the document identifier for each document in which the
word is found.
27
 To support proximity, contiguous word phrases, and term weighting algorithms, all
occurrences of a word are stored in the inversion list along with the word position.
 Thus if the word “bit” was the tenth, twelfth and eighteenth word in document #1,
then the inversion list would appear: bit -1(10), 1(12), 1(18)
 Weights can also be stored in inversion lists.

 When a search is performed, the inversion lists for the terms in the query are located
 appropriate logic is applied between inversion lists.
 The result is a final list of items that satisfy the query.
 For systems that support ranking, the list is reorganized into ranked order.
 The document numbers are used to retrieve the documents from the Document File.

 Instead of using a dictionary to point to the inversion list, B-trees can be used.
 The inversion lists at the leaf level or referenced in higher level pointers.
 Figure 4.6 shows how the words in Figure 4.5 would appear.

 A B-tree of order m is defined as:


 A root node with between 2 and 2m keys
 All other internal nodes have between m and 2m keys
 All keys are kept in order from smaller to larger
 All leaves are at the same level or differ by at most one level.

28
2.9 N-Gram Data Structures

 N-Grams
 a special technique for conflation (stemming).
 an unique data structure in information systems that ignores words and treats
the input as a continuous data, optionally limiting its processing by interword
symbols.
 a fixed length consecutive series of “n” characters or fixed length overlapping
symbol segments that define the searchable processing tokens.
 These tokens have logical linkages to all the items in which the tokens are
found.
 To store the linkage data structure, inversion lists, document vectors and other
data structures are used in the search process.
 n-grams do not care about semantics, unlike stemming (that determine the
stem of a word that represents the semantic meaning of the word).

 Examples of bigrams, trigrams and pentagrams are given in Figure 4.7 for the word
phrase “sea colony”.
 n-grams with n greater than two allow interword symbols to be part of the n-gram set.
 The symbol # is used to represent the interword symbol (e.g., blank, period,
semicolon, colon, etc.).

29
 Each of the n-grams created becomes a separate processing token and are searchable.
 It is possible that the same n-gram can be created multiple times from a single word.

 The advantage of n-grams


 limit the number of searchable tokens.
 maximum number of unique n-grams generated.
 fast processing on minimally sized machines
 disadvantage
 false hits can occur under some architectures.
 increased size of inversion lists that store the linkage data structure.
 n-grams expands the number of processing tokens.
 There is no semantic meaning in a particular n-gram since it is a fragment of
processing token and may not represent a concept.
 poor representation of concepts and their relationships.

30
2.10 PAT Data Structure

 A Continuous text input data structure is indexed in contiguous “n” character tokens
using n-grams with interword symbols between processing tokens.
 A continuous text input data structure is addressed differently using PAT trees and
PAT arrays.
 The name PAT is short for PAtriciaTrees (PATRICIA stands for Practical Algorithm
To Retrieve Information Coded In Alphanumerics.)

 Sistring (Semi-infinite string)


 The input stream is transformed into a searchable data structure consisting of
substrings called sistring or semi-finite string.
 In creation of PAT trees, each position in the input string is the anchor point
for a sub-string that starts at that point and includes all new text up to the end
of the input.
 All substrings are unique.
 A substring can start at any point in the text and can be uniquely indexed by its
starting location and length.
 Substring may go beyond the length of the input stream by adding additional
null characters.
 Figure 4.9 shows some possible sistrings for an input text.

31
 PAT tree
 is an unbalanced, binary digital tree defined by the sistrings.
 The individual bits of the sistrings decide the branching patterns with zeros
branching left and ones branching right.
 PAT trees also allow each node in the tree to specify which bit is used to
determine the branching via bit position or the number of bits to skip from the
parent node.
 This is useful in skipping over levels that do not require branching.
 The key values are stored at the leaf nodes (bottom nodes) in the PAT Tree.
 For a text input of size “n” there are “n” leaf nodes and “n-1” at most higher
level nodes.
 It is possible to place additional constraints on sistrings for the leaf nodes.

 Figure 4.10 gives an example of the sistrings used in generating a PAT Tree.
 If the binary representations of “h” is (100), “o” is (110), “m” is (001) and“e”
is (101) then the word “home” produces the input 100110001101.
 Using the sistrings, the full PAT binary tree is shown in Figure 4.11.

32
 A more compact tree where skip values are in the intermediate nodes is shown
in Figure 4.12.

33
 Advantages and Disadvantages
 PAT trees are ideal for prefix searches.
 Suffix, imbedded string, and fixed length masked searches are easy if the total
input stream is used in defining the PAT tree.
 Fuzzy searches are very difficult because large number of possible sub-trees
could match the search term.
 PAT arrays have more accuracy than Signature files
 ability to string searches that are inefficient in inverted files (e.g., suffix
searches, approximate string searches, longest repetition).
 It is not used in any major commercial products.

34
2.11 Signature File Structure

 The goal of a signature file structure


 to provide a fast test to eliminate the majority of items that are not related to a
query.
 The items that satisfy the test can either be evaluated by another search algorithm to
eliminate additional false hits or delivered to the user to review.
 The text of the items is represented in a highly compressed form that facilitates the
fast test.

 Signature file search


 a linear scan of the compressed version of items producing a response time
linear with respect to file size.
 signature search file is created using superimposed coding.
 superimposed coding
 The coding is based upon words in the item.
 The words are mapped into a “word signature”.
 Word signature
 a fixed length code with a fixed number of bits.
 a bit pattern of size F, with m bits set to "1", while the rest are "0".
 The bit positions are determined using a hash function of the word.
 The word signatures are ORed together to create the signature of an item.
 To avoid signatures being too dense with “1”s, a maximum number of words
is specified and an item is partitioned into blocks of that size.
 Example

35
 In Figure 4.13 the block size is set at five words
 code length is 16 bits
 number of bits that are allowed to be “1” for each word is five.
 The words in a query are mapped to their signature.

 Signature file
 can be stored as a signature with each row representing signature block.
 associated with each row is a pointer to the original text block.
 Design objective of a signature file system
 trading off the size of the structure versus the density of the final created
signatures.
 Longer code lengths reduce the probability of collision in hashing the words
(i.e., two different words hashing to the same value).
 Fewer bits per code reduce the effect of a code word pattern being in the final
block signature.

 Application / Advantages
 Signature files provide a solution for storing and locating information in a
number of different situations.
 Signature files are applied as medium size databases, WORM devices, parallel
processing machines, and distributed environments.

36
2.12 Hypertext and XML Data Structures

Hypertext:
 A mechanism for representing information structure.
 It differs from traditional information storage data structures in format and
use.
 Hypertext is stored in Hypertext Markup Language (HTML) and eXtensible
Markup Language (XML).
 HTML and XML provide detailed descriptions for subsets of text similar to
the zoning that increase search accuracy and improve display of hit results.

 Hypertext Structure
 used in the Internet environment
 requires electronic media storage for the item.
 Hypertext allows one item to reference another item via an imbedded pointer.
 Each separate item is called a node.
 Reference pointer is called a link.
 Each node is displayed by a viewer that is defined for the file type associated with
the node.
 Hypertext Markup Language (HTML)
 defines the internal structure for information exchange across the World Wide
Web on the Internet.
 A document is composed of the text of the item along with HTML tags that
describe how to display the document.
 Tags are formatting or structural keywords contained between less-than, greater
than symbols (e.g., <title>, <strong>).
 The HTML tag associated with hypertext linkages is <a href= …#NAME /a>
 where “a” and “/a” are an anchor start tag and anchor end tag denoting the text
that the user can activate.
 “href” is the hypertext reference containing either a file name if the referenced
item is on this node or an address(URL) and a file name if it is on another node.
 “#NAME” defines a destination point other than the top of the item to go to.

37
 The URL has three components:
 access method the client used to retrieve the item
 Internet address of the server where the item is stored
 address of the item at the server
 Figure 4.14 shows an example of a segment of a HTML document.

 Hypertext is a non-sequential directed graph structure, where each node contains


its own information.
 The author assumes, the reader can follow the linked data as easily as following
the sequential presentation.
 A node may have several outgoing links, each of which is then associated with
some smaller part of the node called an anchor.
 When an anchor is activated, the associated link is followed to the destination
node, thus navigating the hypertext network.

 Conventional items are read sequentially by a user.


 In a hypertext environment, the user “navigates” through the node network by
following links.
 Hypertext references are used to include information that is other than text (e.g.,
graphics, audio, photograph, video) in a text item.

 Dynamic HTML
 Combination of the latest HTML tags, style sheets and programming that
help to create WEB pages that are more animated and responsive to user
interaction.

38
 Supports features such as object-oriented view of a WEB page and its
elements, cascading style sheets, programming that can address most page
elements add dynamic fonts.
 Allows the specification of style sheets in a cascading fashion.

XML:
 The extensible Markup Language (XML) is a standard data structure on the WEB.
 Objective is to extend HTML with semantic information.
 The logical data structure within XML is defined by a Data Type Description (DTD)
 The user can create any tags needed to describe and manipulate their structure.
 The following is a simple example of XML tagging:

 Resource Description Format (RDF)


 used to represent properties of WEB resources such as images, documents and
relationships between them.
 This will include the Platform for Internet Content Selection (PICS) for
attaching labels to material for content filtering (e.g., unsuitable for children).
 Hypertext links for XML
 These are defined in the Xlink (XML Linking Language) and Xpoint (XML
Pointer language) specifications.
 allow different types of links to locations within a document and external to
the document.
 allow an application to know if a link is positioning reference within an item
or link to another document.
 help in determining what needs to be retrieved to define the total item.
 XML Style Sheet Linking
 define how to display items on a particular style sheet and handle cascading
stylesheets.
 allow designers to limit what is displayed to the user and allow expansion to
the whole item if desired.
39
2.13 Hidden Markov Models

 Markov process assumption


 future is independent of the past given the present
 In other words, assuming we know our present state, we do not need any other
historical information to predict the future state.

 Example of a three state Markov Model of the Stock Market.

 The states will be one of the above that is observed at the closing of the
market.
 The movement between states can be defined by a state transition matrix with
state transitions.

 Given that the market fell on one day (State 1), the matrix suggests that the
probability of the market not changing the next day is 1.
 This then allows questions such as the probability that the market will increase
for the next 4 days then fall.
 This would be equivalent to the sequence of SEQ = {S3, S3, S3, S3, S1}.
 let’s assume that instead of the current state being dependent upon all the
previous states, lets assume it is only dependent upon the last state.
 This would then be calculated by the formula:

 The following graph depicts the model.


 The directed lines indicate the state transition probabilities ai,j.
 In the example, every state corresponded to an observable event (change in the
market).

40
 To add more flexibility, probability function was allowed to be associated with
the state. The result is called the Hidden Markov Model.

 Hidden Markov Models (HMMs)


 a class of probabilistic graphical model that allow us to predict a sequence of
unknown (hidden) variables from a set of observed variables.
 allow us to compute the joint probability of a set of hidden states (latent states)
given a set of observed states.
 Once we know the joint probability of a sequence of hidden states, we
determine the best possible sequence of hidden states.

 Definition of a discrete Hidden Markov Model is summarized as consisting of the


following:
1. S = { S0, , …, Sn-1} as a finite set of states where s0 always denotes the initial
state. Typically the states are interconnected such that any state can be
reached from any other state.
2. V = { v0, , …, vm-1} is a finite set of output symbols. This will correspond to
the physical output from the system being modeled.
3. A = S x S a transition probability matrix where ai,j represents the probability of

transitioning from state i to state j such that for all i = 0, ., n -1.


Every value in the matrix is a positive value between 0and 1. For the case
where every state can be reached from every other state every value in the
matrix will be non-zero.
4. B = S X V is an output probability matrix where element bj,k is afunction

determining the probability and for all j = 0, .., n - l .


5. The initial state distribution.

 The HMM will generate an output symbol at every state transition.


 The transition probability is the probability of the next state given the current
state.
 The output probability is the probability that a given output is generated upon
arriving at the next state.

41
 The complete specification of a HMM requires
 specification of the states
 the output symbols
 three probability measures for the state transitions
 output probability functions and the initial states

 The distributions are frequently called A, B, and ø and the following notation is used
to define the model:

 Issues with HMM are


o how to efficiently calculate the probability of a sequence of observed outputs
given the HMM model.
o how to determine which of a number of competing models should be selected
given an observed set of outputs.
o how best to tune the λ model to maximize the probability of the output
sequence given λ.

 Applications of Hidden Markov Models (HMM)


 speech recognition
 optical character recognition
 topic identification
 information retrieval search.

42

You might also like