0% found this document useful (0 votes)
18 views43 pages

String Matching Chapter 12 Goodrich Nep

Uploaded by

Sanjana Dwivedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views43 pages

String Matching Chapter 12 Goodrich Nep

Uploaded by

Sanjana Dwivedi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 43

12.

3 Pattern Matching
Algorithms
Goodrich
• In Pattern Matching problem on strings, we are given:
• a text string T of length n and
• a pattern string P of length m, and
we want to find whether P is a substring of T or not.
• Give examples….
• Give applications…..
• The notion of a “match” is that there is a substring of T starting at some index i
that matches P, character by character, such that
T[i] = P[0],
T[i+1] = P[1], ...,
T[i+m−1] = P[m−1].
• That is, P = T[i..i+m−1].
• Thus, the output from a pattern matching algorithm could either be some
indication that the pattern P does not exist in T (failure case) or an integer
indicating the starting index in T of a substring matching P(success case).
12.3.1 Brute Force
In this we simply test all the possible placements of P relative to T.
• Example 12.4: Suppose we are given the text string
T="abacaabaccabacabaabb" and the pattern string P="abacab“.
Example- One more example – students must compute number of comparisons..
Correctness:
• The brute-force pattern matching algorithm consists of two nested loops, with the
outer loop indexing through all possible starting indices of the pattern in the text,
and the inner loop indexing through each character of the pattern, comparing it to
its potentially corresponding character in the text.
• Thus, the correctness of the brute-force pattern matching algorithm follows
immediately from this exhaustive search approach.
Performance:
• In the worst case, for each candidate index in T, we can perform upto m character
comparisons to discover that P does not match T at the current index.
• The outer for loop is executed at most n−m+1 times, and the inner loop is executed
at most m times.
• Thus, the running time of the brute-force method is O((n−m+1)m), which is
simplified as O(nm). Note that when m=n/2, this algorithm has quadratic running
time O(n2).
12.3.3 The Knuth-Morris-Pratt Algorithm

Revisiting the worst-case performance of the brute-force algorithm:

• major inefficiency-- we performed many compaisons while testing a potential placement of the
pattern against the text, and if we discover a pattern character that does not match in the text,
then we throw away all the information gained by these comparisons and start over again from
scratch with the next incremental placement of the pattern.

• The Knuth-Morris-Pratt (or “KMP”) algorithm, avoids this waste of information and, in doing so, it
achieves a running time of O(n+m), which is optimal in the worst case.

• That is, in the worst case any pattern matching algorithm will have to examine all the characters of
the text and all the characters of the pattern at least once.
The Failure Function
• Main idea of the KMP algorithm --preprocess the pattern string P so
as to compute a failure function, f, that indicates the proper shift of P
so that, to the largest extent possible, we can reuse previously
performed comparisons.
• failure function f(j) is defined as the length of the longest prefix of P
that is a suffix of P[1..j] (note that we did not put P[0..j] here).
• Use the convention f(0) = 0. Importance of this failure function-- it
“encodes” repeated substrings inside the pattern itself.
Technique of KMP Algorithm ( 3 possibilities)
• 1st- Each time there is a match, we increment the current indices i.e
moves on to the next characters in T and P
• 2nd- If mismatch and we have previously made progress in P, then we
consult the failure function to determine the new index in P where we
need to continue checking P against T i.e consults the failure function
for a new candidate character in P.
• 3rd- there is a mismatch and we are at the beginning of P--we simply
increment the index for T (and keep the index variable for P at its
beginning) i.e starts over with the next index in T.

Repeat the above process until we find a match of P in T or the index for
T reaches n, the length of T (indicating that we did not find the pattern
P in T).
• Correctness of KMP algorithm follows from the definition of the
failure function--Any comparisons that are skipped are actually
unnecessary for the failure function guarantees that all the ignored
comparisons are redundant—they would involve comparing the same
matching characters over again.
• Also note that the algorithm performs fewer overall comparisons than
the brute-force algorithm run on the same strings.
Performance of KMP Algorithm
• Excluding the computation of the failure function, the running time is proportional to
the number of iterations of the while loop.
• For the sake of analysis, let us define k=i−j. Intuitively, k is the total amount by which
the pattern P has been shifted with respect to the text T.
• Note that throughout the execution of the algorithm, we have k≤n. One of the
following three case s occurs a t each iteration of the loop. If T[i]=P[j], then i
increases by 1, and k does not change, since j also increases by 1.
• If T[i]!=P[j] and j>0,then i does not change and k increases by at least 1, since, in this
case, k changes from i−j to i−f(j−1), which is an addition of j −f(j−1),which is positive
because f(j−1)<j.
• If T[i]!=P[j]and j=0, then i increases by 1 and k increases by 1, since j does not
change.
• Thus, at each iteration of the loop, either I or k increases by at least 1(possibly
both);hence, the total number of iterations of the while loop in the KMP pattern
matching algorithm is at most 2n. Of course, achieving this bound assumes that we
have already computed the failure function for P.
Constructing the KMP Failure Function
• Dry Running of KMPFailureFunction for P=“abacab” (do on board in class along with
given algo
j and P) 0 1 2 3 4 5
P [j] a b a c a b
f (j) 0 0 1 0 1 2

i=1, j=0 i=3 i=4 i=6


f(0)=0, 0<6, T 3<6, T 4<6, T 6<6, F
If a==b , F if P[1]==P3], b==c, F if a==a, T
Else if 0>0, F else if 1>0, T f(4)=1
Else f(1)=0 j=f(0)=0 i=5, j=1

i=2 i=3 i=5


2<6, T 3<6, T 5<6, T
If a=a, T if a==c, F if b==b, T
F(2)=1 if 0>0, F f(5)= 2
i=3, j=1 f(3)=0 i=6, j=2
Observations
• KMPFailureFunction(P) is a“bootstrapping” process quite similar to
the KMPMatch Algorithm--- we compare the pattern to itself as in the
KMP algorithm.
• Each time we have two characters that match, we set f(i)=j+1.
• Note that since we have i>j throughout the execution of the
algorithm, f(j−1) is always defined when we need to use it.

*bootstrapping- get into or out of a situation using existing resources.


• Algorithm KMP Failure Function runs in O(m)time. Its analysis is
analogous to that of algorithm KMP Match.
• Proposition 12.6: The Knuth-Morris-Pratt algorithm performs
pattern matching on a text string of length n and a pattern string of
length m in O(n+m) time.
12.5 Tries
• The pattern matching algorithm speeds up the search in a text by
preprocessing the pattern (eg by computing the failure function in
KMP algorithm). In Tries, we take a complementary(opposite)
approach, namely, we preprocess the text.
• Above is suitable for applications where a series of queries is
performed on a fixed text, so that the initial cost of preprocessing the
text is compensated by a speedup in each subsequent query.
• Example- a Web site that offers pattern matching in Shakespeare’s
Hamlet or a search engine that offers Web pages on the Hamlet topic.
• A trie (pronounced “try”) is a tree-based data structure for storing
strings in order to support fast pattern matching.
• “trie” comes from the word “retrieval.”
• main application--information retrieval.
• search for a certain DNA sequence in a genomic database,
• given a collection S of strings, all defined using the same alphabet. The
primary query operations that tries support are pattern matching and prefix
matching. The latter operation involves being given a string X, and looking for
all the strings in S that contain X as a prefix.
12.5.1 Standard Tries
• Let S be a set of s strings from alphabet Σ such that no string in S is a
prefix of another string.
• A standard trie for S is an ordered tree T with the following properties
• Each node of T, except the root, is labelled with a character of Σ.
• The ordering of the children of an internal node of T is determined by a
canonical ordering of the alphabet Σ.
• T has s external nodes, each associated with a string of S, such that the
concatenation of the labels of the nodes on the path from the root to an
external node v of T yields the string of S associated with v.
• A trie T represents the strings of S with paths from the root to the
external nodes of T.
• Importance of assuming that no string in S is a prefix of another
string ???---is ensures that each string of S is uniquely associated with an
external node of T. We can always satisfy this assumption by adding a
special character that is not in the original alphabet Σ at the end of each
string.
• An internal node can have between 1 and d children, where d is the size
of the alphabet. There is an edge going from the root r to one of its
children for each character that is first in some string in the collection S.
• In addition, a path from the root of T to an internal node v at depth i
corresponds to an i-character prefix X[0..i−1] of a string X of S. In fact, for
each character c that can follow the prefix X[0..i−1] in a string of the set S,
there is a child of v labeled with character c. In this way, a trie concisely
stores the common prefixes that exist among a set of strings.
• If there are only two characters in the alphabet, then the trie
becomes a binary tree, with some internal nodes possibly having only
one child.
• In general, if there are d characters in the alphabet, then the trie will
be a multi-way tree where each internal node has between 1 to d
children.
Proposition 12.8: A standard trie storing a collection S of s strings of
total length n from an alphabet of size d has the following properties:

• Every internal node of T has at most d children


• T has s external nodes
• The height of T is equal to the length of the longest string in S
• The number of nodes of T is O(n) ment a trie with a tree storing characters
at its nodes.
• The worst case for the number of nodes of a trie occurs when no two strings share a
common nonempty prefix; that is, except for the root, all internal nodes have one child.
• A trie T for a set S of strings can be used to implement a dictionary whose keys are the
strings of S.
• We perform a search in T for a string X by tracing down from the root the path indicated by
the characters in X.
• If this path can be traced and terminates at an external node, then we know X is in the
dictionary.
• If the path cannot be traced or the path can be traced but terminates at an internal node,
then X is not in the dictionary
• Note that in this implementation of a dictionary, single characters are compared instead of
the entire string (key). It is easy to see that the running time of the search for a string of size
m is O(dm), where d is the size of the alphabet.
• Indeed, we visit at most m+1 nodes of T and we spend O(d) time at each node. For some
alphabets, we may be able to improve the time spent at a node to be O(1) or O(logd) by
using a dictionary of characters implemented in a hash table or search table. However, since
d is a constant in most applications, we can stick with the simple approach that takes O(d)
time per node visited.
• We can use a trie to perform a special type of pattern matching,
called word matching– in this we want to determine whether a given
pattern matches one of the words of the text exactly.
• Word matching differs from standard pattern matching since the
pattern cannot match an arbitrary substring of the text, but only one
of its words. Using a trie, word matching for a pattern of length m
takes O(dm) time, where d is the size of the alphabet, independent of
the size of the text. If the alphabet has constant size (as is the case for
text in natural languages and DNA strings), a query takes O(m) time,
proportional to the size of the pattern. A simple extension of this
scheme supports prefix matching queries. However, arbitrary
occurrences of the pattern in the text (for example, the pattern is a
proper suffix of a word or spans two words) cannot be efficiently
performed.
• To construct a standard trie for a set S of strings, we can use an
incremental algorithm that inserts the strings one at a time. Recall the
assumption that no string of S is a prefix of another string. To insert a
string X into the current trie T, we f irst try to trace the path
associated with X in T. Since X is not already in T and no string in S is a
prefix of another string, we stop tracing the path at an internal node v
of T before reaching the end of X. We then create a new chain of
node descendents of v to store the remaining characters of X. The
time to insert X is O(dm), where mis the length of X and d is the size
of the alphabet. Thus, constructing the entire trie for set S takes O(dn)
time, where n is the total length of the strings of S.
• Space inefficiency in the standard trie ?? Explain how --there are
potentially a lot of nodes in the standard trie that have only one child,
and the existence of such nodes is a waste
• Solution to above problem- compressed trie or Patriciatrie.
12.5.2 Compressed Tries
• A compressed trie is similar to a standard trie but it ensures that each
internal node in the trie has at least two children. It enforces this rule
by compressing chains of single-child nodes into individual edges.
• An internal node v of T is redundant if v has one child and is not the
root.
• Fig 12.9 has eight redundant nodes.
• A chain of k ≥ 2 edges
(v0,v1) (v1,v2)…….. (vk−1,vk)
is redundant if:
• vi is redundant for i = 1,...,k−1
• v0 and vk are not redundant
• Nodes in a compressed trie are labeled with strings, which are
substrings of strings in the collection, rather than with individual
characters.
• Advantage of a compressed trie over a standard trie -- the number of
nodes of the compressed trie is proportional to the number of strings
and not to their total length.
Proposition 12.9: A compressed trie storing a collection S of s strings
from an alphabet of size d has the following properties:
• Every internal node of T has at least two children and most d children
• T has s external nodes
• The number of nodes of T is O(s).
• A compressed trie is truly advantageous only when it is used as an
auxiliary index structure over a collection of strings already stored in a
primary structure, and is not required to actually store all the
characters of the strings in the collection.
• Example- collection S of strings is an array of strings S[0],
S[1], ...,S[s−1].
• Instead of storing the label X of a node explicitly, we represent it
implicitly by a triplet of integers(i,j,k), such that X=S[i][j..k]; that is, X is
the substring of S[i] consisting of the characters from the jth to the
kth included.
• This additional compression scheme allows us to reduce the total
space from O(n) for the standard trie to O(s) for the compressed trie,.
We must still store the different strings in S,but we have reduced the
space required.
12.5.3 Suffix Tries
Another important application for tries--- when the strings in the
collection S are all the suffixes of a string X. Such a trie is called the
suffix trie or suffix tree or position tree of string X.
• Saving Space Using a suffix trie allows us to save space over a
standard trie by using several space compression techniques,
including those used for the compressed trie. The advantage of the
compact representation of tries now becomes apparent for suffix
tries. Since the total length of the suffixes of a string X of length n is
1+2+………… +n=n(n+1)/ 2 ,
storing all the suffixes of X explicitly would take O(n2) space.
• Even then, the suffix trie represents these strings implicitly in O(n)
space, as formally stated in the following proposition.
• Construction--We can construct the suffix trie for a string of length n
with an incremental algorithm like the one given in Section 12.5.1.
This construction takes O(dn2) time because the total length of the
suffixes is quadratic in n.
• However, the (compact) suffix trie for a string of length n can be
constructed in O(n) time with a specialized algorithm, different from
the one for general tries- This is fairly complex and not in syllabus.
• Using a Suffix Trie
Trace the path associated with P in T. P is a substring of X if and only if
such a path can be traced.
The search down the trie T assumes that nodes in T store some
additional information, with respect to the compact representation of
the suffix trie: If node v has label (i, j) and Y is the string of length y
associated with the path from the root to v (included), then X[j−y+1..j]
=Y.
This property ensures that we can easily compute the start index of the
pattern in the text when a match occurs.
12.5.4 Search Engines
The World Wide Web contains a huge collection of text documents (Web pages).
Information about these pages are gathered by a program called a Web crawler, which then
stores this information in a special dictionary database.
A Web search engine allows users to retrieve relevant information from this database,
thereby identifying relevant pages on the Web containing given keywords.
Inverted Files
The core information stored by a search engine is a dictionary, called an inverted index or
inverted file, storing key-value pairs (w,L), where w is a word and L is a collection of pages
containing word w.
The keys (words) in this dictionary are called index terms and is a set of vocabulary entries and
proper nouns.
The elements in this dictionary are called occurrence lists and cover as many Web pages as
possible.
Implementation of an inverted index with a data structure consisting of:
1. An array storing the occurrence lists of the terms (in no particular order)
2. A compressed trie for the set of index terms, where each external node stores the index of the
occurrence list of the associated term.
• Why do we store the occurrence lists outside the trie ?? to keep the
size of the trie data structure sufficiently small to fit in internal
memory.
• Instead, because of their large total size, the occurrence lists have to
be stored on disk.
• A query for a single keyword is similar to a word matching query--Find
the keyword in the trie and return the associated occurrence list.
• When multiple keywords are given and the desired output are the
pages containing all the given keywords --thenwe retrieve the
occurrence list of each keyword using the trie and return their
intersection.
• Devising fast and accurate ranking algorithms for search engines is a
major research area.

You might also like