Pe Ii6
Pe Ii6
(R18)
Information Retrieval System
Lecture Notes
Prepared by
Mrs.Swapna
( Professor&HOD-CSM)
Dept. CSE(AIML)
Course Objectives:
To learn the important concepts and algorithms in IRS
To understand the data/file structures that are necessary to design, and implement information retrieval (IR) systems.
Course Outcomes:
Ability to apply IR principles to locate relevant information large collections of data
Ability to design different document clustering algorithms
Implement retrieval systems for web search tasks.
Design an Information Retrieval System for web search tasks.
UNIT - I
Introduction to Information Retrieval Systems: Definition of Information Retrieval System, Objectives of Information Retrieval Systems, Functional
Overview, Relationship to Database Management Systems, Digital Libraries and Data Warehouses.
Information Retrieval System Capabilities: Search Capabilities, Browse Capabilities, Miscellaneous Capabilities.
UNIT - II
Cataloging and Indexing: History and Objectives of Indexing, Indexing Process, Automatic Indexing, Information Extraction.
Data Structure: Introduction to Data Structure, Stemming Algorithms, Inverted File Structure, N-Gram Data Structures, PAT Data Structure, Signature File
Structure, Hypertext and XML Data Structures, Hidden Markov Models.
UNIT - III
Automatic Indexing: Classes of Automatic Indexing, Statistical Indexing, Natural Language, Concept Indexing, Hypertext Linkages.
Document and Term Clustering: Introduction to Clustering, Thesaurus Generation, Item Clustering, Hierarchy of Clusters.
UNIT - V
Text Search Algorithms: Introduction to Text Search Techniques, Software Text Search Algorithms, Hardware Text Search Systems.
Multimedia Information Retrieval: Spoken Language Audio Retrieval, Non-Speech Audio Retrieval, Graph Retrieval, Imagery Retrieval, Video Retrieval.
TEXT BOOK:
1. Information Storage and Retrieval Systems – Theory and Implementation, Second Edition, Gerald J. Kowalski, Mark T. Maybury, Springer
REFERENCE BOOKS:
1. Frakes, W.B., Ricardo Baeza-Yates: Information Retrieval Data Structures and Algorithms, Prentice Hall, 1992.
2. Information Storage & Retrieval By Robert Korfhage – John Wiley & Sons.
3. Modern Information Retrieval By Yates and Neto Pearson Education.
Chapter1:
Introduction to Information Retrieval Systems
� Functional Overview
Where:
� Number_Possible_Relevant are the
number of relevant items in the database.
� Number_Total Retieved is the total
number of items retrieved from the query.
� Number_Retrieved_Relevant is the number of
items retrieved that are relevant to the user's
search need.
Measures associates with IR systems Cont.
1. Item Normalization
Cont.
� Once the standardization and zoning has
been completed, information (i.e., words)
that are used in the search process need to be
identified in the item.
� The first step in identification of a processing
token consists of determining a word. Systems
determine words by
dividing input symbols into three classes: valid
word symbols, inter-word symbols, and special
processing symbols.
1. Item Normalization
Cont.
� A word is defined as a contiguous set of
word symbols bounded by inter-word
symbols.
� Examples of word symbols are alphabetic
characters and numbers.
� Examples of possible inter-word symbols
are blanks, periods and semicolons.
1. Item Normalization
Cont.
� Next, a Stop List/Algorithm is applied to the list of
potential processing tokens.
� The objective of the Stop function is to save system
resources by eliminating from the set of searchable
processing tokens those that have
little value to the system.
� Stop Lists are commonly found in most systems and
consist of words (processing tokens) whose
frequency and/or semantic use make them of no
value as a searchable token.
� (e.g., "the"), have no search value and are not a
useful part of a user's query.
Item Normalization Cont.
Chapter 2
Objectives
query difficult and may limit the system's ability to perform term
meant).
• Users may use public index files as part of search criteria to increase recall.
• The primary objective of representing the concepts within an item to facilitate users finding
relevant information .
• Users may use public index files as part of search criteria to increase recall.
• The primary objective of representing the concepts within an item to facilitate users finding
relevant information .
TEXT PROCESSING
1. Document Parsing. Documents come in all sorts of languages, character sets, and formats;
often, the same document may contain multiple languages or formats, e.g., a French email
16
Smartzworld.com jntuworldupdates.org
with Portuguese PDF attachments. Document parsing deals with the recognition and
“breaking down” of the document structure into individual components. In this pre
processing phase, unit documents are created; e.g., emails with attachments are split into one
document representing the email and as many documents as there are attachments.
2. Lexical Analysis. After parsing, lexical analysis tokenizes a document, seen as an input
stream, into words. Issues related to lexical analysis include the correct identification of
accents, abbreviations, dates, and cases. The difficulty of this operation depends much on the
language at hand: for example, the English language has neither diacritics nor cases, French
has diacritics but no cases, German has both diacritics and cases. The recognition of
abbreviations and, in particular, of time expressions would deserve a separate chapter due to
its complexity and the extensive literature in the field For current approaches
3. Stop-Word Removal. A subsequent step optionally applied to the results of lexical analysis
is stop-word removal, i.e., the removal of high-frequency words. For example, given the
sentence “search engines are the most visible information retrieval applications” and a classic
stop words set such as the one adopted by the Snowball stemmer,1 the effect of stop-word
removal would be: “search engine most visible information retrieval applications”.
4. Phrase Detection. This step captures text meaning beyond what is possible with pure bag-
of-word approaches, thanks to the identification of noun groups and other phrases. Phrase
detection may be approached in several ways, including rules (e.g., retaining terms that are
not separated by punctuation marks), morphological analysis , syntactic analysis, and
combinations thereof. For example, scanning our example sentence “search engines are the
most visible information retrieval applications” for noun phrases would probably result in
identifying “search engines” and “information retrieval”.
5. Stemming and Lemmatization. Following phrase extraction, stemming and lemmatization
aim at stripping down word suffixes in order to normalize the word. In particular, stemming
is a heuristic process that “chops off” the ends of words in the hope of achieving the goal
correctly most of the time; a classic rule based algorithm for this was devised by Porter
[280]. According to the Porter stemmer, our example sentence “Search engines are the most
visible information retrieval applications” would result in: “Search engin are the most visibl
inform retriev applic”.
• Lemmatization is a process that typically uses dictionaries and morphological analysis of
words in order to return the base or dictionary form of a word, thereby collapsing its
inflectional forms (see, e.g., [278]). For example, our sentence would result in “Search
engine are the most visible information retrieval application” when lemmatized according to
a WordNet-based lemmatizer
6. Weighting. The final phase of text pre processing deals with term weighting. As
previously mentioned, words in a text have different descriptive power; hence, index terms
can be weighted differently to account for their significance within a document and/or a
document collection. Such a weighting can be binary, e.g., assigning 0 for term absence and
1 for presence.
SCOPE OF INDEXING
16
Smartzworld.com jntuworldupdates.org
• When perform the indexing manually, problems arise from two sources the author and the
indexer the author and the indexer .
• Vocabulary domain may be different the author and the indexer.
• This results in different quality levels of indexing.
• The indexer must determine when to stop the indexing process.
• Two factors to decide on level to index the concept in a item.
• The exhaustively and how specific indexing is desired.
• Exhaustively of index is the extent to which the different concepts in the item are indexed.
• For example, if two sentences of a 10-page item on microprocessors discusses on-board
caches, should this concept be indexed
• Specific relates to preciseness of index terms used in indexing.
• For example, whether the term “processor” or the term “microcomputer” or the term
“Pentium” should be used in the index of an item is based upon the specificity decision.
• Indexing an item only on the most important concept in it and using general index terms
yields low exhaustively and specificity.
• Another decision on indexing is what portion of an item to be indexed Simplest case is to
limit the indexing to title and abstract(conceptual ) zone .
• General indexing leads to loss of precision and recall.
PREORDINATION AND LINKAGES
• Another decision on linkages process whether linkages are available between index terms for
an item .
• Used to correlate attributes associated with concepts discussed in an item .‟this process is
called preordination .
• When index terms are not coordinated at index time the coordination occurs at search time.
This is called post coordination , implementing by “AND” ing index terms .
• Factors that must be determined in linkage process are the number of terms that can be
related.
• Ex., an item discusses „the drilling of oil wells in Mexico by CITGO and the introduction of
oil refineries in Peru by the U.S.‟
AUTOMATIC INDEXING
• Case: Total document indexing
• Automatic indexing requires few seconds based on the processor and complexity of
algorithms to generate indexes.‟
• Adv. is consistency in index term selection process.
16
Smartzworld.com jntuworldupdates.org
• Index resulting form automated indexing fall into two classes , weighted and un weighted .
• Un weighted indexing system : the existence of an index term in a document and some times
its word location are kept as part of searchable data structure .
• Weighted indexing system: a attempt is made to place a value on the index term associated
with concept in the document . Based on the frequency of occurrence of the term in the item .
• Values are normalized between 0 and 1.
• The results are presented to the user in order of rank value from highest number to lowest
number .
• Indexing By term
• Terms (vocabulary) of the original item are used as basis of index process .
• There are two major techniques for creation of index statistical and natural language.
• Statistical can be based upon vector models and probabilistic models with a special case
being Bayesian model(accounting for uncertainty inherent in the model selection process) .
• Called statistical because their calculation of weights use information such as frequency of
occurrence of words .
• Natural language also use some statistical information , but perform more complex parsing to
define the final set of index concept.
• Other weighted systems discussed as vectorised information system .
• The system emphasizes weights as a foundation for information detection and stores these
weights in a vector form.
• Each vector represents a document. And each position in a vector represent a unique
word(processing token) in a data base..
• The value assigned to each position is the weight of that term in the document.
• 0 indicates that the word was not in the document .
• Search is accomplished by calculating the distance between the query vector and document
vector.
• Bayesian approach: based on evidence reasoning( drawing conclusion from evidence )
•
• Could be applied as part of index term weighing. But usually applied as part of retrieval
process by calculating the relation ship between an item and specific query.
• Graphic representation each node represents a random variable arch between the nodes
represent a probabilistic dependencies between the node and its parents .
• Two level Bayesian network
• “ c”“ represents concept in a query
16
Smartzworld.com jntuworldupdates.org
• Another approach is natural language processing.
• DR-LINK( document retrieval through linguistics knowledge )
• Indexing by concept
• Concept indexing determines a canonical set of concept based upon a test set of terms and
uses them as base for indexing all items. Called latent semantics indexing .
• Ex: match plus system developed by HNC inc
• Uses neural NW strength of the system word relationship (synonyms) and uses the
information in generating context vectors.
• Two neural networks are used one to generated stem context vectors and another one to
perform query.
• Interpretation is same as the weights.
• Multimedia indexing:
• Indexing video or images can be accomplished at raw data level.
• Positional and temporal (time) search can be done.
INFORMATION EXTRACTION
There are two processes associated with information extraction:
• 1.determination of facts to go into structured fields in a database and
• 2.extraction of text that can be used to summarize an item.
• The process of extracting facts to go into indexes is called Automatic File Build.
• In establishing metrics to compare information extraction, precision and recall are applied
with slight modifications.
•
Recall refers to how much information was extracted from an item versus how much should
have been extracted from the item.
16
Smartzworld.com jntuworldupdates.org
• It shows the amount of correct and relevant data extracted versus the correct and relevant
data in the item.
• Precision refers to how much information was extracted accurately versus the total
information extracted.
• Additional metrics used are over generation and fallout.
• Over generation measures the amount of irrelevant information that is extracted.
• This could be caused by templates filled on topics that are not intended to be extracted or
slots that get filled with non-relevant data.
• Fallout measures how much a system assigns incorrect slot fillers as the number of
• These measures are applicable to both human and automated extraction processes.
• Another related information technology is document summarization.
• Rather than trying to determine specific facts, the goal of document summarization is to
extract a summary of an item maintaining the most important ideas while significantly
reducing the size.
• Examples of summaries that are often part of any item are titles, table of contents, and
abstracts with the abstract being the closest.
• The abstract can be used to represent the item for search purposes or as a way for a user to
determine the utility of an item without having to read the complete item.
DATA STRUCTURES
• Introduction to Data Structures
• Stemming Algorithms
• Inverted File Structure
• N-Gram Data Structure
• PAT Data Structure
• Signature File Structure
• Hypertext and XML Data Structures
• Data structure : The knowledge of data structure gives an insight into the capabilities
available to the system .
• Each data structure has a set of associated capabilities .
1. Ability to represent the concept and their r/s.
•
Two major data structures in any IRS:
1. One structure stores and manages received items in their normalized form is called document
manger
16
Smartzworld.com jntuworldupdates.org
2. The other data structure contains processing tokens and associated data to support
search.
Item
normalization
Document
File creation
Document
Document
manager
Search manager
Result of a search are references to the items that satisfy the search statement which are
passed to the document manager for retrieval.
Focus : on data structure that support search function
Stemming : is the transformation often applied to data before placing it in the searchable
data structure
Stemming represents concept(word) to a canonical (authorized; recognized;
accepted)morphological (the patterns of word formation in a particular language )
representation .
Risk with stemming : concept discrimination information may be lost in the process. Causing
decrease in performance.
Advantage : has a potential to increase recall.
STEMMING ALGORITHMS
• Stemming algorithm is used to improve the efficiency of IRS and improve recall.
•
Conflation(the process or result of fusing items into one entity; fusion; amalgamation)is a
16
Smartzworld.com jntuworldupdates.org
term that is used to refer mapping multiple morphological variants to single
representation(stem).
• Stem carries the meaning of the concept associated with the word and the affixes(ending)
introduce subtle(slight) modification of the concept.
• Terms with a common stem will usually have similar meanings, for example:
• Ex : Terms with a common stem will usually have similar meanings, for example:
• CONNECT
• CONNECTED
• CONNECTING
• CONNECTION
• CONNECTIONS
• Frequently, the performance of an IR system will be improved if term groups such as this are
conflated into a single term. This may be done by removal of the various suffixes -ED, -ING,
-ION, IONS to leave the single term CONNECT
• In addition, the suffix stripping process will reduce the total number of terms in the IR
system, and hence reduce the size and complexity of the data in the system, which is always
advantageous.
Major usage of stemming is to improve recall.
Important for a system to categories a word prior to making the decision to stem.
Proper names and acronyms (A word formed from the initial letters of a name say IARE …)
should not have stemming applied.
Stemming can also cause problems for natural language processing NPL systems by causing
loss of information .
PORTER STEMMING ALGORITHM
• Based on a set condition of the stem
• A consonant in a word is a letter other than A, E, I, O or U, some important stem conditions
are
1. The measure m of a stem is a function of sequence of vowels (V) followed by a sequence of
consonant ( C ) .
C (VC)mV. m is number VC repeats
The case m = 0 covers the null word.
2. *<X> - stem ends with a letter X
3.*v* - stem contains a vowel
4. *d - stem ends in double consonant (e.g. -TT, -SS).
16
Smartzworld.com jntuworldupdates.org
5. *o - stem ends in consonant vowel sequence where the final consonant is
not w,x,y(e.g. -WIL, -HOP).
Suffix cond.s takes the form current _suffix = = pattern
Actions are in the form old_suffix ->. New_suffix
Rules are divided into steps to define the order for applying the rule.
Examples of the rules
1c *v* Y I Happy->happi
16
Smartzworld.com jntuworldupdates.org
Dictionary look up stemmers
Use of dictionary look up.
The original term or stemmed version of the term is looked up in a dictionary and
replaced by the stem that best represents it.
This technique has been implemented in INQUERY and Retrieval ware systems-
INQUERY system uses the technique called Kstem.
Kstem is a morphological analyzer that conflates words variants to a root form.
It requires a word to be in the dictionary
Kstem uses 6 major data files to control and limit the stemming process.
1. Dictionary of words (lexicon)
2. Supplemental list of words for dictionary
3. Exceptional list of words that should retain a „e‟ at the end (e.g., “suites” to “suite” but
“suited” to “suit”).
4. Direct _conflation - word pairs that override stemming algorithm.
5. County_nationality _conflation ( British maps to Britain )
6. Proper nouns -- that should not be stemmed
New words that are not special forms (e.g., dates, phone numbers) are located in the
dictionary to determine simpler forms by stripping off suffixes and respelling plurals as
defined in the dictionary.
3. Successor stemmers:
Based on length of prefixes .
The smallest unit of speech that distinguishes on word from another
The process uses successor varieties for a word .
Uses information to divide a word into segments and selects on of the segments to stem.
Successor variety of words are used to segment a word by applying one of the following four
16
Smartzworld.com jntuworldupdates.org
methods.
1. Cutoff method : a cut of value is selected to define the stem length.
2. Peak and plateau: a segment break is made after a character whose successor variety exceeds
that of the character.
3. Complete word method: break on boundaries of complete words.
4. Entropy method:uses the distribution method of successor variety letters.
1. Let |Dak| be the number of words beginning with k length sequence of letters a.
2. Let |Dakj| be the number of words in Dak with successor j.
3. The probability that a member of Dak has the successor j is given as |Dakj| / |Dak|
The entropy of |Dak| is
26
Hak = -(|Dakj| / |Dak| )(log(|Dakj| / |Dak| ))
p=1
After a word has been segmented the segment to be used as stem must be selected.
Hafer and Weiss selected the following rule
If ( first segment occurs in <=12 words in database)
First segment is stem
Else (second segment is stem)
INVERTED FILE STRUCTURE
Inverted file structure
Most common data structure
Inverted file structures are composed of three files
1. The document file
2. The inversion list (Posting List)
3. Dictionary
The inverted file : based on the methodology of storing an inversion of documents.
For each word a listof documents in which the word is found is stored(inversion of document
)
Each document is given a unique the numerical identifier that is stored in inversion list .
Dictionary is used to located the inversion list for a particular word.
Which is a sorted list( processing tokens) in the system and a pointer to the location of
its inversion list.
16
Smartzworld.com jntuworldupdates.org
Dictionary can also store other information used in query optimization such as length of
inversion lists to increase the precision.
16
Smartzworld.com jntuworldupdates.org
Smartworld.asia Specworld.in
16
Smartzworld.com jntuworldupdates.org
Widely used as cryptography in world war II
Spelling errors detection and correction
Advantage:
They place a finite limit on the number of searchable token
MaxSeg n=()n maximum number of unique n grams that can be generated.
“ n” is the length of n-grams
number of process able symbols
Disadvantage: longer the n gram the size of inversion list increase.
Performance has 85 % precision .
Smartzworld.com jntuworldupdates.org
substrings are unique.
Each position in a input string is a anchor point for a sub string.
In creation of PAT trees each position in the input string is the anchor point for a sub-string
that starts at that point and includes all new text up to the end of the input.
Binary tree, most common class for prefix search,But Pat trees are sorted logically which
facilitate range search, and more accurate then inversion file .
PAT trees provide alternate structure if supporting strings search.
• The key values are stored at the leaf nodes (bottom nodes) in the PAT Tree.
• For a text input of size “n” there are “n” leaf nodes and “n-1” at most higher level nodes.
• It is possible to place additional constraints on sistrings for the leaf nodes.
• If the binary representations of “h” is (100), “o” is (110), “m” is (001) and “e” is (101) then
the word “home” produces the input 100110001101.....Using the sistrings.
INPUT 100110001101
sistring 1 1001....
sistring 2 001100...
sistring 3 01100....
sistring 4 11.......
sistring 5 1000...
16
Smartzworld.com jntuworldupdates.org
sistring 6 000.....
sistring 7 001101
sistring 8 01101
The value in the intermediate nodes (indicated by rectangles) is the number of bits to skip
until the next bit to compare that causes differences between similar terms.
16
Smartzworld.com jntuworldupdates.org
Signature file structure
The coding is based upon words in the code.
The words are mapped into word signatures .
A word signature is fixed length code with a fixed number of bits set to 1.
The bit positions that are set to one are determined via a hash function of the word.
The word signatures are Ored together to create signature of an item..
Partitioning of words is done in block size ,Which is nothing but set of words, Code length is
16 bits .
Search is accomplished by template matching on the bit position .
Provide a practical solution applied in parallel processing , distributed environment etc.
• To avoid signatures being too dense with “1”s, a maximum number of words is
specified and an item is partitioned into blocks of that size.
• The block size is set at five words, the code length is 16 bits and the number of bits that are
allowed to be “1” for each word is five.
• TEXT: Computer Science graduate students study (assume block size is five words)
WORD Signature
Superimposed Coding
16
Smartzworld.com jntuworldupdates.org
Application(s)/Advantage(s)
• Signature files provide a practical solution for storing and
locating information in a number of different situations.
HMM is probabilistic model for machine learning. It is mostly used in speech recognition, to
some extent it is also applied for classification task
A sequence classifier or sequence labeler is a model whose job is to assign some labelor class to
each unit in a sequence. The HMM is probabilistic sequence classifiers; which means given a
sequence of units (words, letters, morphemes, sentences etc) its job is to compute a probability
distribution over possible labels and choose the best label sequence.
The Hidden Markov Model is one of the most important machine learning models in speech and
language processing. In order to define it properly, we need to first introduce the Markov chain.
Markov chains and Hidden Markov. According to Jurafsky, Martin [2005] a weighted finite-state
automaton is a Models are both extensions of the finite automatawhich is based on the input
observation
a Markov chain is a special case of a weighted automaton inwhich the input sequence uniquely
determines states the automaton will go through for that input sequence. Since they can‟t
represent inherently ambiguous problems. Markov chain is only useful for assigning probabilities
to unambiguous sequences.
The underlying theoretical basis is much more sound, elegant and easy tounderstand.
27
Smartzworld.com jntuworldupdates.org
It is easier to implement and analyze.
HMM taggers are very simple to train (just need to compile counts from thetraining corpus).
It has also been proved effective for a number of other tasks, such as speechrecognition,
handwriting recognition and sign language recognition.
Because each HMM uses only positive data, they scale well; since new wordscan be added
without affecting learnt HMMs.
Disadvantages :
In order to define joint probability over observation and label sequence HMMneeds to
enumerate all possible observation sequences.
Main difficulty is modeling probability of assigning a tag to word can be verydifficult if “words”
are complex.
In other words, HMM training involvesmaximizing the observed probabilities for examples
belonging to a class. But it does not minimize the probability of observation of instances from
other classes.
27
Smartzworld.com jntuworldupdates.org
Unit 3
1. Statistical Indexing (Probalistic weight calculation , vector weights-simple term frequency,inverse document
frequency,signal weighting,discrimination)
2. Concept Indexing
3. Natural Language
4.Hypertext Linkages
1.1 Statistical Indexing : Statistical indexing uses frequency of occurrence of events to calculate a number that is
used to indicate the potential relevance of an item.
Statistical strategies cover the broadest range of indexing techniques and are the most prevalent in commercial
systems.
The basis for a statistical approach is use of frequency of occurrence of events. The events usually are related to
occurrences of processing tokens (words/phrases) within documents and within the database.
The words/phrases are the domain of searchable values. The statistics that are applied to the event data are
probabilistic, Bayesian, vector space. The static approach stores a single statistic, such as how often each word
occurs in an item, that is used in generating relevance scores after a standard Boolean search.
1.2 Probabilistic indexing :stores the information that are used in calculating a probability that a particular item
satisfies (i.e., is relevant to) a particular query.
There are many different areas in which the probabilistic approach may be applied.
The method of logistic regression is described as an example of how a probabilistic approach is applied to
information retrieval
The approach starts by defining a "Model 0" system which exists before specific probabilistic models are applied. In
a retrieval system there exist query terms qi and document terms di, which have a set of attributes (vl ..... Vn) from
the query (e.g., counts of term frequency in the query), from the document (e.g., counts of term frequency in the
document ) and from the database (e.g., total number of documents in the database divided by the number of
documents indexed by the term).
The logistic reference model uses a random sample of query-documentterm triples for which binary relevance
judgments have been made from a training
1.3 Vector Weighting: A vector is a one-dimensional set of values, where the order/position of each value in the
set is fixed and represents a particular domain. In information retrieval, each position in the vector typically
represents a processing token.
There are two approaches to the domain of values in the vector: binary and weighted.
Figure 5.2 shows how an item that discusses petroleum refineries in Mexico would be represented. In the example,
the major topics discussed are indicated by tile index terms for each column (i.e., Petroleum, Mexico, Oil, Taxes,
Refineries and Shipping).
Binary vectors require a decision process to determine if the degree that a particular processing token represents
the semantics of an item is sufficient to include it in the vector.
In the example for Figure 5.2, a five-page item may have had only one sentence like "Standard taxation of the
shipment of the oil to refineries is enforced." For the binary vector, the concepts of "Tax" and "Shipment" are below
the threshold of importance (e.g., assume threshold is 1.0)
1.3.1 Simple Term Frequency Algorithm:
The simplest approach is to have the weight equal to the term frequency. This approach emphasizes the use of a
particular processing token within an item
1.3.2 Inverse Document Frequency:
weighting algorithms that the weight assigned to an item should be inversely proportional to the frequency of
occurrence of an item in the database.
where WEIGHTij is the vector weight that is assigned to term "j" in item "i," TFij (term frequency) is the frequency of
term "j" in item 'T', "n" is the number of items in the database and IFj (item frequency or document frequency) is the
number of items in the database that have term "j" in them.
1.3.3 Signal Weighting
The distribution of the frequency of processing tokens within an item can affect the ability to rank items.
For example, assume the terms "SAW" and "DRILL" are found in 5 items with the following frequencies defined in
Figure 5.5. Both terms are found a total of 50 times in the five items. The term "SAW" does not give any insight into
which item is more likely to be relevant to a search of "SAW".
1.3.4 Discrimination Value:
To achieve the objective of finding relevant items, it is important that the index discriminates among items.
There are three possibilities with the DISCRIMi value being positive, close to zero or negative.
A positive value indicates that removal of term "i" has increased the similarity between items. In this case, leaving
the term in the database assists in discriminating between items and is of value.
A value close to zero implies that the term's removal or inclusion does not change the similarity between items.
If the value of DISCRIMts is negative, the term's effect on the database is to make the items appear more similar
since their average similarity decreased with its removal.
Once the value of DISCRMi is normalized as a positive number, it can be used in tile standard weighting formula as:
Weightik = TFik * DISCRIMk
Dynamic changing weights when ever new data terms are entered in database becomes complex
Complex calculation for every new term entered the weightes tends to change
Large number of storage space is needed for calculation and maintainance is difficult
1. Vector representation is done by o and 1 for weighted and non weighted terms and when proximity search is
applied more number of irreverent terms may retrieve through vector model
The objective of creating the index to an item is to represent the semantic information in the item. A Bayesian
network can be used to determine the final set of processing tokens (called topics) and their weights. Figure 5.6
shows a simple view of the process where Ti represents the relevance of topic "i" in a particular item and PTj
represents a statistic associated with the event of processing token "j" being present in the item
Figure 5.7 shows the extended Bayesian network. Extending the network creates new processing tokens for those
cases where there are dependencies between processing tokens
2 Concept Indexing:
Concept indexing maps to the specified concept for given term search according to the term weights being indexed
for specified related concept.
Natural Language Processing (NLP) refers to AI method of communicating with an intelligent systems using a natural
language such as English.
Processing of Natural Language is required when you want an intelligent system like robot to perform as per your
instructions, when you want to hear decision from a dialogue based clinical expert system, etc.
The field of NLP involves making computers to perform useful tasks with the natural languages humans use. The
input and output of an NLP system can be −
Speech
Written Text
It is the process of producing meaningful phrases and sentences in the form of natural language from some internal
representation.
It involves −
Text planning − It includes retrieving the relevant content from knowledge base.
Sentence planning − It includes choosing required words, forming meaningful phrases, setting tone of the sentence.
Difficulties in NLU
For example, “He lifted the beetle with red cap.” − Did he use cap to lift the beetle or he lifted a beetle that had red
cap?
Referential ambiguity − Referring to something using pronouns. For example, Rima went to Gauri. She said, “I am
tired.” − Exactly who is tired?
Syntax − It refers to arranging words to make a sentence. It also involves determining the structural role of words in
the sentence and in phrases.
Semantics − It is concerned with the meaning of words and how to combine words into meaningful phrases and
sentences.
Pragmatics − It deals with using and understanding sentences in different situations and how the interpretation of
the sentence is affected.
Discourse − It deals with how the immediately preceding sentence can affect the interpretation of the next
sentence.
Steps in NLP
Lexical Analysis − It involves identifying and analyzing the structure of words. Lexicon of a language means the
collection of words and phrases in a language. Lexical analysis is dividing the whole chunk of txt into paragraphs,
sentences, and words.
Syntactic Analysis (Parsing) − It involves analysis of words in the sentence for grammar and arranging words in a
manner that shows the relationship among the words. The sentence such as “The school goes to boy” is rejected by
English syntactic analyzer.
Semantic Analysis − It draws the exact meaning or the dictionary meaning from the text. The text is checked for
meaningfulness. It is done by mapping syntactic structures and objects in the task domain. The semantic analyzer
disregards sentence such as “hot ice-cream”.
Discourse Integration − The meaning of any sentence depends upon the meaning of the sentence just before it. In
addition, it also brings about the meaning of immediately succeeding sentence.
Pragmatic Analysis − During this, what was said is re-interpreted on what it actually meant. It involves deriving those
aspects of language which require real world knowledge.
2.2 Index Phrase Generation The goal of indexing is to represent the semantic concepts of an item in the
information system to support finding relevant information. Single words have conceptual context, but frequently
they are too general to help the user find the desired information. Term phrases allow additional specification mid
focusing of the concept to provide better precision and reduce the user's overhead of retrieving non-relevant items.
Having the modifier "grass" or "magnetic" associated with the term "field" clearly disambiguates between very
different concepts. One of the earliest statistical approaches to determining term phrases proposed by Salton was
use of a COHESION factor between terms
where SIZE-FACTOR is a normalization factor based upon the size of the vocabulary and PAIR-FREQ~h is the total
frequency of co-occurrence of the pair Termk, Termh in the item collection.
4.HyperText Linkages:
In the context of search indexing, metadata is data that tells search engines what a webpage is about. Often the
meta title and meta description are what will appear on search engine results pages, as opposed to content from
the webpage that's visible to users.
Hyperlink Vs Hypertext
Both the term are twins to each other and perform with each other basically complete each other. But few times
we get confused about which one is the which one really. To clear that confusion, we will discuss the specificness
of both in detail with the proper example and explain the differences as well. Both the terms are used in the
WWW(World Wide Web)
Hyperlink: The hyperlink contains the URL of the webpages. In a general way, a hyperlink is referenced when a
hypertext navigated. These hyperlinks are hidden under the text, image, graphics, audio, video, and gets highlighted
once we hover the mouse over it. To activate the hyperlink, we click the hypermedia, which ends up within the
opening of the new document. It establishes the connection between the knowledge units, usually known as the
target document and therefore the alternate name for the hyperlink is anchor or node.
Hypertext: Ted Nelson introduced the term Hypertext in 1956. Hypertext is a text which contains the visible text to
redirect the targeted page(page URL contained by Hyperlink). It was invented to establish cross-reference in the
computer world, similar to that is made in books like an index. However, the usual pattern of reading a book is
sequential. But, this hypertext introduces the idea of cross-referencing the data. This cross-referencing is sort of
complicated within the world, but it makes the work easier. If we are surfing on the web, at the time of reading a
piece of writing we suddenly encounter a term, which we wanted to understand at that moment. If that term may
be a hypertext, we will directly attend that page where we will find the information about that term. So, this
eliminates the additional time of searching that term.
Example: This example combines both the term.
What is a web crawler bot?
A web crawler, spider, or search engine bot downloads and indexes content from all over the Internet. The goal of
such a bot is to learn what (almost) every webpage on the web is about, so that the information can be retrieved
when it's needed. They're called "web crawlers" because crawling is the technical term for automatically accessing a
website and obtaining data via a software program.
These bots are almost always operated by search engines. By applying a search algorithm to the data collected by
web crawlers, search engines can provide relevant links in response to user search queries, generating the list of
webpages that show up after a user types a search into Google or Bing (or another search engine).
A web crawler bot is like someone who goes through all the books in a disorganized library and puts together a card
catalog so that anyone who visits the library can quickly and easily find the information they need
The Internet is constantly changing and expanding. Because it is not possible to know how many total webpages
there are on the Internet, web crawler bots start from a seed, or a list of known URLs. They crawl the webpages at
those URLs first. As they crawl those webpages, they will find hyperlinks to other URLs, and they add those to the list
of pages to crawl next.
Given the vast number of webpages on the Internet that could be indexed for search, this process could go on
almost indefinitely. However, a web crawler will follow certain policies that make it more selective about which
pages to crawl, in what order to crawl them, and how often they should crawl them again to check for content
updates.
The relative importance of each webpage: Most web crawlers don't crawl the entire publicly available Internet and
aren't intended to; instead they decide which pages to crawl first based on the number of other pages that link to
that page, the amount of visitors that page gets, and other factors that signify the page's likelihood of containing
important information.
The idea is that a webpage that is cited by a lot of other webpages and gets a lot of visitors is likely to contain high-
quality, authoritative information, so it's especially important that a search engine has it indexed – just as a library
might make sure to keep plenty of copies of a book that gets checked out by lots of people.
Revisiting webpages: Content on the Web is continually being updated, removed, or moved to new locations. Web
crawlers will periodically need to revisit pages to make sure the latest version of the content is indexed.
PART 2 –unit 3
1.Introduction to Clustering
The goal of the clustering was to assist in the location of information. Clustering of words originated with the
generation of thesauri. Thesaurus, coming from the Latin word meaning “treasure,” is similar to a dictionary in that it
stores words. Instead of definitions, it provides the synonyms and antonyms for the words. Its primary purpose is to
assist authors in selection of vocabulary. The goal of clustering is to provide a grouping of similar objects (e.g., terms
or items) into a “class” under a more general title. Clustering also allows linkages between clusters to be specified.
The term class is frequently used as a synonym for the term cluster.
The process of clustering follows the following steps:
Define the domain for the clustering effort. Defining the domain for the clustering identifies those objects to be used
in the clustering process. Ex: Medicine, Education, Finance etc.
Once the domain is determined, determine the attributes of the objects to be clustered. (Ex: Title, Place, job etc
zones)
Determine the strength of the relationships between the attributes whose co-occurrence in objects suggest those
objects should be in the same class.
Apply some algorithm to determine the class(s) to which each item will be assigned.
There are three basic methods for generation of a thesaurus; hand crafted, co- occurrence, and header-modifier
based. In header-modifier based thesauri term relationships are found based upon linguistic relationships. Words
appearing in similar grammatical contexts are assumed to be similar. The linguistic parsing of the document
discovers the following syntactical structures: Subject-Verb, Verb- Object, Adjective-Noun, and Noun-Noun. Each
noun has a set of verbs, adjectives and nouns that it co-occurs with, and a mutual information value is calculated for
each using typically a log function.
The art of manual thesaurus construction resides in the selection of the set of words to be included. . Care is taken to
not include words that are unrelated to the domain of the thesaurus. If a
concordance is used, other tools such as KWOC, KWIC or KWAC may help in determining useful words. A Key Word
Out of Context (KWOC) is another name for a concordance. Key Word In Context (KWIC) displays a possible term in
its phrase context. It is structured to identify easily the location of the term under consideration in the sentence. Key
Word And Context (KWAC) displays the keywords followed by their context.
In the Figure 6.1 the character “/” is used in KWIC to indicate the end of the phrase. The KWIC and KWAC are useful
in determining the meaning of homographs.
Once the terms are selected they are clustered based upon the word relationship guidelines and the interpretation
of the strength of the relationship. This is also part of the art of manual creation of the thesaurus, using the
judgment of the human analyst.
There are many techniques for the automatic generation of term clusters to create statistical thesauri. When the
number of clusters created is very large, the initial clusters may be used as a starting point to generate more abstract
clusters creating a hierarchy. The basis for automatic generation of a thesaurus is a set of items that represents the
vocabulary to be included in the thesaurus. Selection of this set of items is the first step of determining the domain
for the thesaurus. The processing tokens (words) in the set of items are the attributes to be used to create the
clusters.
Implementation of the other steps differs based upon the algorithms being applied. The automated method of
clustering documents is based upon the polythetic clustering where each cluster is defined by a set of words and
phrases. Inclusion of an item in a cluster is based upon the similarity of the item's words and phrases to those of
other items in the cluster.
In the complete term relation method, the similarity between every term pair is calculated as a basis for
determining the clusters. The easiest way to understand this approach is to consider the vector model. The vector
model is represented by a matrix where the rows are individual items and the
columns are the unique words (processing tokens) in the items. The values in the matrix represent how strongly that
particular word represents concepts in the item.
Figure 6.2 provides an example of a database with 5 items and 8 terms. To determine the relationship between
terms, a similarity measure is required. The measure calculates the similarity between two terms. In Chapter 7 a
number of similarity measures are presented. The similarity measure is not critical
where “k” is summed across the set of all items. In effect the formula takes the two columns of the two terms being
analyzed, multiplying and accumulating the values in each row. The results can be paced in a resultant “m” by “m”
matrix, called a Term-Term Matrix (Salton-83), where “m” is the number of columns (terms) in the original matrix.
This simple formula is reflexive so that the matrix that is generated is symmetric. Other similarity formulas could
produce a non- symmetric matrix.
Using the data in Figure 6.2, the Term-Term matrix produced is shown in Figure 6.3. There are no values on the
diagonal since that represents the auto correlation of a word to itself. The next step is to select a threshold that
determines if two terms are considered similar enough to each other to be in the same class. In this example, the
threshold value of 10 is used. Thus two terms are considered similar if the similarity value between them is 10 or
greater. This produces a new binary matrix called the Term Relationship matrix (Figure 6.4) that defines which terms
are similar.
A one in the matrix indicates that the terms specified by the column and the row are similar enough to be in the
same class. Term 7 demonstrates that a term may exist on its own with no other similar terms identified. In any of
the clustering processes described below this term will always migrate to a class by itself.
The final step in creating clusters is to determine when two objects (words) are in the same cluster.
There are many different algorithms available. The following algorithms are the most common: cliques,
single link, stars and connected components.
Applying the algorithm to cliques Figure 6.4, the following classes are created:
Class 5 (Term 7)
Notice that Term 1 and Term 6 are in more than one class. A characteristic of this approach is that terms
can be found in multiple classes. In single link clustering the strong constraint that every term in a class is
similar to every other term is relaxed.
The rule to generate single link clusters is that any term that is similar to any term in the cluster can be
added to the cluster. It is impossible for a term to be in two different clusters. This in effect partitions the
set of terms into the clusters. The algorithm is:
Applying the algorithm for creating clusters using single link to the Term Relationship Matrix, Figure 6.4,
the following classes are created:
Class 2 (Term 7)
There are many other conditions that can be placed on the selection of terms to be clustered.
Class 3 (Term 7)
Item Clustering : same methodology as of term clustering
1.4.2 Cluster Using Existing Clusters
2.Hierarchy of Cluster or HCAM Agglomerative clustering : is the most common type of hierarchical
clustering used to group objects in clusters based on their similarity. It’s also known
as AGNES (Agglomerative Nesting). The algorithm starts by treating each object as a singleton cluster.
Next, pairs of clusters are successively merged until all clusters have been merged into one big cluster
containing all objects. The result is a tree-based representation of the objects, named dendrogram.
Objectives of dendogram:
Page | 1 User Search Techniques: Search statements and binding, Similarity measures and ranking,
Relevance feedback, Selective dissemination of information search, weighted searches of
Boolean systems, Searching the Internet and hypertext. Information Visualization:
Introduction, Cognition and perception, Information visualization technologies.
Search statements are the statements of an information need generated by users to specify
the concepts they are trying to locate in items.
In generation of the search statement, the user may have the ability to weight (assign an
importance) to different concepts in the statement. At this point the binding is to the
vocabulary and past experiences of the user. Binding in this sense is when a more abstract
form is redefined into a more specific form. The search statement is the user’s attempt to
specify the conditions needed to subset logically the total item space to that cluster of items
that contains the information needed by the user.
The next level of binding comes when the search statement is parsed for use by a specific
search system.
The final level of binding comes as the search is applied to a specific database. This binding
is based upon the statistics of the processing tokens in the database and the semantics used
in the database. This is especially true in statistical and concept indexing systems.
Figure 7.1 illustrates the three potential different levels of binding. Parenthesis are used in
the second binding step to indicate expansion by a thesaurus.
Page | 2
A variety of different similarity measures can be used to calculate the similarity between the
item and the search statement. A characteristic of a similarity formula is that the results of
the formula increase as the items become more similar. The value is zero if the items are
totally dissimilar. An example of a simple “sum of the products” similarity measure from
the example to determine the similarity between documents for clustering purposes is:
Croft and Harper (Croft-79). Croft expanded this original concept, taking into account the
frequency of occurrence of terms within an item producing the following similarity formula
(Croft-83):
where K is a constant, the frequency of term is determine by “i” and J two terms
The best values for K seemed to range between 0.3 and 0.5 default threshold
Another early similarity formula was used by Salton treated the index and the
search query as n dimensional vectors . To determine the “weight” an item has with respect
to the search statement, the Cosine formula is used to calculate the distance between the
vector for the item and the vector for
the query:
Page | 3
where is the k is a constant set to 1 and I is the term in document and j is the term for user query search.
similarity measures for selecting Hit items is according to the value of ranking and the output is
displayed accordingly.
Ranking the output implies ordering the output from most likely items that satisfy the query to
least likely items rank .
Retrieval Ware first uses indexes (inversion lists) to identify potential relevant items.
It then applies 1. coarse grain and
2.fine grain ranking
coarse grain ranking is based on the presence of query terms within items. The coarse grain ranking is a
weighted formula that can be adjusted based on completeness, contextual evidence or variety, and
semantic distance.
Fine grain ranking considers the physical location of query terms and related words using factors of
proximity search in addition to the other three factors in coarse grain evaluation, if the related terms
and query terms occur in close proximity (same sentence or paragraph) the item is judged more relevant
Relevance Feedback: The first major work on relevance feedback was published in
1965 by Rocchio (republished in 1971: Rocchio-71). Rocchio was documenting
experiments on reweighting query terms and query expansion based upon a vector
representation of queries and items. The concepts are also found in the probabilistic model
presented by Robertson and Sparck Jones (Robertson-76). The relevance feedback concept
was that the new query should be based on the old query modified to increase the weight of
Page | 4
Terms in relevant items and decrease the weight of terms that are in non-relevant items.
This technique not only modified the terms in the original query but also allowed expansion
of new terms from the relevant items. This formula was used.
Page | 5
The factor is referred to as positive feedback because it is using the user judgments on relevant
items to increase the values of terms for the next iteration of searching if it is relevant to search .
The factor is referred to as negative if it is non relevant to the query search of user.
Figure 7.6 gives an example of the impacts of positive and negative feedback. The filled circles
represent non- relevant items; the other circles represent relevant items. The oval represents the
items that are returned from the query. The solid box is logically where the query is initially. The
hollow box is the query modified by relevance feedback (positive only or negative only in the
Figure).
Page | 6
Objectives of SDI:
Page | 7
The MMM technique was expanded by Paice (Paice-84) considering all item weights versus the
maximum/minimum approach. The similarity measure is calculated as:
Page | 11
For instance, if you are demonstrating a change over a set of time periods with more than a small
handful of insights, a line graph is an effective means of visualization. Moreover, lines make it
simple to plot multiple series together.
Ensure that all of your colleagues, staff, and team members understand where your data comes
from and how to access it to ensure the smooth handling of insights across departments.
Keep your data protected and your data handling systems simple, digestible, and updated to
make the visualization process as straightforward and intuitive as humanly possible.
Ensure that you use business dashboards that present your most valuable insights in one easy-to-
access, interactive space - accelerating the visualization process while also squeezing the
maximum value from your information.
Of course, your hierarchy, ordering, and layout will be in a state of constant evolution but by
putting a system in place, you will make your visualization efforts speedier, simpler, and more
successful.
8. Include Comparisons
This may be the briefest of our data visualization methods, but it’s important nonetheless: when
you’re presenting your information and insights, you should include as many tangible
comparisons as possible. By presenting two graphs, charts, diagrams together, each showing
contrasting versions of the same information over a particular timeframe, such as monthly sales
records for 2016 and 2017 presented next to one another, you will provide a clear-cut guide on
the impact of your data, highlighting strengths, weaknesses, trends, peaks, and troughs that
everyone can ponder and act upon.
Page | 14
We live in a fast-paced, hyper-connected digital age that is far removed from the pen and paper
or even copy and paste mentality of the yesteryears - and as such, to make a roaring visualization
success, you should use the digital tools that will help you make the best possible decisions while
gathering your data in the most efficient, effective way.
3. Calculating the pair wise similarity between all of the indicated items and clustering the
results
Page | 15
Unit 5
Text Search Algorithms: Introduction, Software text search algorithms, Hardware text
search systems.
Multimedia Information Retrieval: Spoken Language Audio Retrieval, Non-Speech Audio
Retrieval,
Graph Retrieval, Imagery Retrieval, Video Retrieval
The basic concept of a text scanning system is the ability for one or moreusers to enter queries, and
the text to be searched is accessed and compared to thequery terms. When all of the text has been
accessed, the query is complete.
The database contains the full text of the items .The term detector is the special hardware/software
that contains all of the terms being searched for and in some systems the logic between the items. It
will input the text and detect the existence of the search terms. It will output to the query resolver
the detected terms to allow for final logical processing of a query against an item. The query
resolver performs two functions.
It will accept search statements from the users, extract the logic and search terms andpass
the search terms to the detector. It also accepts results from the detector and determines which
queries are satisfied by the item and possibility the weight associated with hit. The Query Resolver
will Pass information to the user interface that will be continually updating search status to the user
and on request retrieve any items that satisfy the user search statement.
In the case of hardware search machines, multiple parallel search machines (term detectors)
may work against the same data stream allowing for more queries or against different data streams
reducing the time to access thecomplete database. In software systems, multiple detectors may
execute at the same time.
Text search Techniques are of two types
1.Hardware text search
2. software text search
data are serially streamed into the detector. When a match occurs, the term comparator informs the
external query resolver (usually in the main computer) by setting status flags.
Specialized hardware that interfaces with computers and is used to search secondary storage
devices was developed from the early 1970s with the most recent product being the Parallel
Searcher (previously the Fast Data Finder). The typical hardware configuration is shown in
Figure 9.9 in the dashed box. The speed of search is then based on the speed of the I/O.
One of the earliest hardware text string search units was the Rapid Search Machine
developed by General Electric. The machine consisted of a special purpose search unit where a
single query was passed against a magnetic tape containing the documents. A more sophisticated
search unit was developed by Operating Systems Inc. called the Associative File Processor (AFP).
It is capable of searching against multiple queries at the same time. Following that initial
development, OSI, using a different approach, developed the High SpeedText Search (HSTS)
machine. One state machine is dedicated to contiguous word phrases, another for imbedded term
match and the final for exact word match.
In parallel with that development effort, GE redesigned their Rapid Search Machine into the
GESCAN unit. The GESCAN system uses a text array processor (TAP) that simultaneously
matches many terms and conditions against a given text stream the TAP receives the query
information from the user’s computer and directly access the textual data from secondary storage.
The TAP consists of a large cache memory and an array of four to 128 query processors. The text is
loaded into the cache and searched by the query processors (Figure 9.10). Each query processor is
independent and can be loaded at any time. A complete query is handled by each query processor.
A query processor works two operations in parallel; matching query terms to input text and
Boolean logic resolution. Term matching is performed by a series of character cells each containing
one character of the query. A string of character cells is implemented on the same LSI chip and the
chips can be connected in series for longer strings. When a word or phrase of the query is
matched, a signal is sent to the resolution sub-process on the LSI chip. The resolution
chip is responsible for resolving the Boolean logic between terms and proximity
requirements. If the item satisfies the query, the information is transmitted to the users
computer.
The text array processor uses these chips in a matrix arrangement as shown in
Figure9.10. Each row of the matrix is a query processor in which the first chip performs
the query resolution while the remaining chips match query terms. The maximum
number of characters in a query is restricted by the length of a row while the number of
rows limit the number of simultaneous queries that can be processed.
The Fast Data Finder (FDF) is the most recent specialized hardware text search
unit still in use in many organizations. It was developed to search text and has been used
to search English and foreign languages. The early Fast Data Finders consisted of an
array of programmable text processing cells connected in series forming a pipeline
hardware search processor. The cells are implemented using a VSLI chip. In the TREC
tests each chip contained 24 processor cells with a typical system containing 3600 cells.
Each cell will be a comparator for a single character limiting the total number of
4
characters in a query to the number of cells.
The cells are interconnected with an 8-bit data path and approximately 20- bit
control path. The text to be searched passes through each cell in a pipeline fashion until
the complete database has been searched. As data is analyzed at each cell, the 20 control
lines states are modified depending upon their current state and the results from the
comparator. An example of a Fast Data Finder system is shown in Figure 9.11.
5
microprocessor/memory and feed through the comparators.
The search characters are stored in the registers.
The connection between the registers reflects the control lines that are also passing state
information.
Groups of cells are used to detect query terms, along with logic between the terms, by
appropriate programming of the control lines.
When a pattern match is detected, a hit is passed to the internal microprocessor that
passes it back to the host processor, allowing immediate access by the user
The functions supported by the Fast data Finder for a query search supports:
Boolean Logic including negation
Proximity on an arbitrary pattern
Term masking
Fuzzy matching
Term weights
Numeric ranges
In software streaming techniques, the item to be searched is read into memory, and then
the algorithm is applied.
There are four major algorithms associated with software text search:
the brute force approach,
Knuth-Morris-Pratt(in Syllabus)
Boyer-Moore(in Syllabus)
Shift-OR algorithm, and Rabin-Karp
6
7
Knuth Morris Pratt (KMP) is an algorithm, which checks the characters from left to
right.
When a pattern has a sub-pattern appears more than one in the sub-pattern, it uses that
property to improve the time complexity, it is also applyed for the worst case.
Input and Output
Input: Main String: “AAAABAAAAABBBAAAAB”,
The pattern “AAAB”
Output:
8
Pattern found at location: 1
Pattern found at location: 7
Pattern found at location: 14
9
10
The Boyer Moore algorithm does preprocessing , It processes the pattern and creates different arrays for
each of the two heuristics.
At every step, it slides the pattern by the max of the slides suggested by each of the two heuristics.
Boyer Moore is a combination of the following two approaches.
1) Bad Character Heuristic
2) Good Suffix Heuristic
11
Rabin Krap Method
12
Multimedia Information Retrieval: Spoken Language Audio Retrieval, Non-Speech
Audio Retrieval,
Graph Retrieval, Imagery Retrieval, Video Retrieval
1. Introduction
The amount of information available on the Internet, primarily by way of the World
Wide Web, is truly staggering. According to one measurement, in February 1999 there were
about 800 million web pages publicly available on about 3 million web servers, for a total of
approximately 9 terabytes of data These enormously large numbers are a testament to the
success of the Internet in providing a way for people all around the world to share information
With computer technology improving at a phenomenal pace, the technology limitations which
dictated the predominant use of text on the Internet in the past are lessening. In the very near
future, non-textual data will be as common a format for publicly available data as text is now.
In light of these trends, it is important to review the state of the art of the retrieval of
such non- textual, multimedia data. Text information retrieval is already well established;
most data retrieval systems, such as web search engines, are text retrieval systems. However,
multimedia information retrieval is less established. There are a number of open issues
Of image, audio, and video, image data retrieval is arguably the best developed
technology. As far back as 1986, image databases were being developed and deployed, such
as UC Berkeley’s ImageQuery system, whose “developers believe that this software […] was
13
the first deployed multi-user networked digital image database system” [2]. With over a
decade of research and development, image data retrieval has had time to grow and mature.
This has allowed the area to address some difficult issues (some of which remain open at
present): image classification, query matching, image standards, attribute classification, and
evaluation. These issues will be explained further below. As a note, though standards,
attribute classification, and evaluation are discussed in terms of image retrieval systems, they
are outstanding issues for audio and video retrieval systems as well. Classification and
querying also apply to the other forms of media, but the media’s unique properties necessitate
Image Classification
to the amalgamation of pixels that make up an image document. Usually the primary
motivation behind such classification is to enable query matching, which is discussed below,
but classification is a complex issue and warrants its own section. This section describes
different ways to classify images, regardless of intent. The context for the discussion is
pattern recognition is innate and often subconscious; optical illusions, for example, play on
this fact by often inviting the eye to see patterns that are inaccurate or incorrect. Even babies
learn at an extremely early age to identify a parent’s face. For an automated image processing
system, however, pattern recognition is a surprisingly complex problem. The same level of
detail that allows computers to perform large numerical computations with unerring accuracy
14
works against computers attempting to recognize patterns in images. Since two images of the
same object can be slightly different, such as different angles of view, different lighting,
different coloring, etc., a computer’s precision does not easily “ignore” such differences.
Humans, of course, with their (relatively) larger lack of precision, can easily see past minor
One approach that has been reported in the literature to address the pattern recognition
problem is the general technique of segmentation. The segmentation technique is based on the
chunks, which are easier to solve and whose solutions can be combined to eventually solve the
larger problem. In this case, the pattern recognition problem is segmented into three levels of
matching: the pixel level, the “stuff” level, and the “thing” level. The pixel level is the
pixels in the images. It is also generally the least useful technique, as minor changes in image
appearance can render a false negative. However, using pixel level matching as a basis,
higher-order matching can be performed, using queries such as “a mostly green area with some
brown vertical strips,” which could be a forest with trees. This level of recognition is the
“stuff” level, as the system now has an awareness of some relationships between pixels to
represent some stuff. Using stuff, an even higher semantic meaning can be assigned to
relationships between stuff – “things,” such as “a mostly cylindrical area with four smaller
cylinders below it, and all cylinders an alternating mix of white and black regions” to (very
crudely) represent a zebra. The “thing” level is the level in which most humans would prefer
to operate, as the semantic units are clear, discrete, and of an appropriate scale. A human
15
would normally search for all images of zebras, not all images of cylinders with smaller
cylinders below it, where all cylinders have patterns of alternating black and white. This
segmentation into pixel, stuff, and thing levels provides a tractable approach to the problem of
pattern recognition
Blobworld, segments an image into contiguous regions of pixels (“blobs”) which have similar
color and texture. The authors admit their blobs are not quite at the same semantic level as
“things,” but they state that blobs are semantically higher than “stuff.” Additionally, their
system provides some key features lacking in other image retrieval systems: an interface to
allow the user to sketch blobs for a query, and feedback as to why the system matched an
image with the query. Blobworld, while perhaps not yet well enough developed for general
public usage, is a promising research prototype towards solving the pattern recognition
Query Matching
Tied very closely to the issue of image classification is the issue of query matching.
As previously stated, the primary intent behind classifying images is to allow efficient
searching or browsing to the database of images. The range of types of queries supported by
an image retrieval system will be primarily based on how the images are classified. For
example, a system that classifies its images using segmentation and generates “stuff” would
(hopefully) allow searchers to query the database based on some criteria of stuff. Clearly, any
image retrieval system can support text keyword matching based on manually indexed
16
metadata, but such querying is generic and essentially ignores the format of the image
documents. Three querying techniques that have been developed which take into account the
unique properties of image data are color histograms, quadtrees of histograms, and basic
shape matching.
Searching by color data is essentially a pixel-level search. Since pixel comparisons are
basically numerical comparisons and do not require semantic reasoning, they are very easy for
computers to perform. An example of such a query could be “find all images with at least 50%
more red pixels than green pixels” or “find all images whose most frequently used color is
similar to this image’s most frequently used color.” Searches of this type are often answered
occurences in that image. Often histograms are stored internally as vectors of values, which
are easy to search by the matching algorithms. For example, Columbia University’s
WebSEEK system uses color histograms to keep its query response time less than two seconds
[5].
Though the computer can therefore process pixel level searches with color histograms
very quickly and efficiently, it is clear they are likely to be of very limited use to a human
searcher.
Image Standards
Image standards refers to the standards that define the metadata which describe image
files. The most obvious metadata is the structure of the electronic image file itself. Widely
adopted, open standards, such as JPEG, GIF, and TIFF, have been developed and deployed
and allow the easy sharing of images. The ready availability of the details of the standard
17
provides a measure of confidence that these file formats will be decodable even in the future.
Repositories of file format information exist (e.g., [17]), even providing decoding information
for long-obsolete formats as Wordstar and dBASE files. Given such repositories, files from
long ago could still be decoded and used, albeit with some effort.
Despite such access to file format information, however, the problem is not yet a solved
problem. Not only is image metadata important to simply understand and decode the image
document, but a large amount of other metadata needs to accompany image files for future
reference. Such metadata could include information about how the image was generated (e.g.,
retouched to remove shadows). An indication of the contents of the image would also be
desirable, allowing the comparing of two images (such as two digital photographs of the same
statue taken from different angles) for a measure of equivalency. The metadata could include
information about reproduction rights of the image, or contact information for the holder of the
copyright. Finally, the metadata might include some sort of verification signature to assure the
Audio data retrieval systems are not text-based retrieval systems, and they therefore
share the same issues as image retrieval systems. As stated above, the issues of standards,
attribute classification, and evaluation are directly applicable to audio retrieval as well. They
also pose different research problems than image retrieval systems do, for two fundamental
reasons: audio data is (obviously) aurally-based instead of visually-based, and audio data is
time-dependent. The former difference leads to some unique and creative approaches to
18
solving the querying and retrieval issue, while the latter difference is the root of the
Just as image retrieval systems must address how to support queries for images, audio
retrieval systems must create ways to allow formation of queries for audio documents.
Naturally, (text) keyword matching is a possibility, just as it can be used in image retrieval
systems. However, the natural way for humans to query other human retrieval systems (e.g.,
music librarians, radio DJs, employees at music stores, etc.) is by humming or singing part of a
tune.
Research into how non-professional singers hum or sing familiar songs has led to the
development of a number of systems which can accept such hummed or sung input for queries
After accepting the acoustic query and transforming it to digital format, there are different
ways to perform the actual matching. Bainbridge et al.’s system describes how they use
frequency analysis to transcribe the acoustic input into musical notes and then compare edit
distances to determine matches Ghias et al.’s approach differs; they convert the input into a
pitch contour, which is a string in a three- letter alphabet. The pitch contour represents how
the pitch of the input changes between each note: whether the pitch goes up (U), goes down
(D), or stays the same (S). Given this string, familiar string- comparison algorithms can be
Using just three choices to generate the pitch contour means simpler matching, but it
also means that a large amount of information is discarded which could reduce the search
space. Blackburn and DeRoure suggest various improvements to the query process, including
19
a five-letter alphabet (up a lot, up a little, same, down a little, and down a lot); generating a
secondary pitch contour, where a note is compared to the note two notes ago; and comparing
time contours, which would represent rhythm information . Ghias et al. additionally note that
some errors, such as drop-out errors (skipping notes) may be more common when people hum
or sing a song. They suggest further study to clarify the relative frequency of such errors, so
as to allow tuning of the matching algorithms to be more tolerant of the common errors
The nature of audio and music data presents many opportunities to develop creative
methods to accept and process audio queries. Using error-tolerant abstractions such as
frequency analysis or pitch contours, audio retrieval systems can transform the problem of
audio matching into well-known problems of edit distance calculation or string matching. In
this way, systems can utilize established solutions for these problems to provide efficient and
Presentation
Of course, once the user has input a query and the system has determined some
number of matches against the audio database, the next logical step is presenting the match
results to the user. Here the time-dependent nature of audio data reveals the problem of
presentation. For media that are not time-dependent, such as text or images, the data (or an
abbreviated form) is static and can be displayed without any trouble. For time-dependent
media such as audio, it is unclear what form should be displayed or presented to the user,
time) is unlikely to be useful to the searcher. Bainbridge et al. enumerate a number of such
20
supported in presenting retrieved text. These issues include whether to transpose all matches
to the same key to make comparison easier, using a visual representation to present the audio,
allowing for the equivalent of quickly scanning through a list of matches to find an
appropriate match, supporting excerpting to show the matched query in context, and creating
A related research effort is how to browse and navigate through databases of audio.
for interconnecting related points in time in these streams. For text, hypertext provides a
structure to indicate relationships between certain parts of the text, both within the same
document and between documents. Blackburn and DeRoure describe their attempts to provide
a similar functionality for music .They propose to use an open hypermedia model to supply
hyperlinks. This model specifies that hyperlinks are not embedded in the contents document,
but instead are stored in a separate, associated document (the “linkbase”). At any point while
browsing a music document, a user may request hyperlinks based on the current location in
the audio stream; the system will then consult the linkbase to present links to related materials.
Video data retrieval shares some properties with image data retrieval, due to the
commonality of their visual nature. However, video data is also time-dependent like audio
data, and, in fact, movies often have synchronized audio tracks accompanying the video data.
This shared commonality naturally lends to applying solutions from the image and audio
21
retrieval areas to research problems in the video retrieval domain. In some ways this strategy
is successful, but, as usual, video data has some unique properties which again lead to creative
Some novel approaches have been developed to classify video data for good query
matching.
Gauch et al. describe how their VISION system processes video data for classification
through segmentation This segmentation is slightly different than the segmentation in terms
of image data; specifically, segmentation here means to identify camera shot changes in the
stream of video data, and from there to group adjacent camera shots into scenes. This is
analogous to segmentation of image data into stuff and things, and unfortunately, the
difficulty of such classification is analogous as well. It has been well researched how to
identify changes of camera shots, such as by observing large changes in color histograms
between frames. However, it is more complicated to properly identify when a scene starts
and ends. The VISION system uses clues from the synchronized audio track to perform this
segmentation; for example, if the speaker changes after a shot change, it may signify a
different scene. By tuning various thresholds, the VISION system can be adjusted to correctly
exists, to help classify the video data. Keywords are extracted from the text of the closed-
captioning, using well- understood text manipulation techniques. This provides a reliable
source of metadata information for classification. If the closed-captioning signal is absent, the
22
VISION system falls back to extracting keywords from the audio stream. They take care to
make the distinction between full continuous speech recognition of the audio stream, which is
a difficult task, to what they call “word-spotting,” or selective keyword recognition from the
audio. Gauch et al. admit their word-spotting technique does not yield very good results yet
(about 50% recall but only 20% precision), but they intend to refine and improve the method.
Another classification strategy is the use of keyframes. Keyframes are frames whose
images represent a semantic unit of the stream, such as a scene. Many video retrieval systems
implement some algorithm to identify keyframes . Color features and motion cues can be used
to automatically detect keyframes By extracting keyframes, the retrieval system can leverage
image retrieval techniques to support queries on keyframe images. Assuming the keyframes
are indeed good representatives of their respective scenes, this classification method is also a
Video data is made up of both image data and audio data, and this fact provides ways to
approach the problem of classification for queries. Using video segmentation, analyzing the
closed-captioning or audio signal, and extracting keyframes are some of the ways to implement
Presentation
presentation. The distinct properties of video data, however, allow different techniques to
address this issue. As mentioned above, keyframe extraction provides the (supposedly) most
important frames in the video document, and these frames can be used as a summarization of
the entire document. WebClip, for example, calls this model the time-based model since the
23
timeline is kept in the correct sequential order for presentation. The VISION system uses this
technique as well for its presentation, displaying thumbnails of each keyframe and showing
the full video data if the user selects a specific thumbnail They also mention that during such
playback, the user interface includes fast-forward and fast-rewind buttons, which display the
video stream at four times the normal rate (usually by dropping frames to achieve the desired
rate), and a slider bar to allow access to any arbitrary moment in the video.
graph-based information retrieval system whose query can be expressed as a graph of topics/subtopics. Documents are
ranked with respect to a query, upon relationships among documents, relationships among topics/subtopics, and
relationships between query terms and documents.
The system is evaluated and compared with two information retrieval systems on two standard text collections.
24