0% found this document useful (0 votes)
2 views

Moore Algorithm

The Boyer-Moore algorithm is an efficient pattern searching technique that utilizes two heuristics: the Bad Character heuristic and the Good Suffix heuristic. The Bad Character heuristic shifts the pattern based on the last occurrence of a mismatched character, while the Good Suffix heuristic shifts the pattern based on suffix matches. The document provides an overview of these heuristics, their implementation in Python, and discusses their time complexities.

Uploaded by

Rama Krishna
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Moore Algorithm

The Boyer-Moore algorithm is an efficient pattern searching technique that utilizes two heuristics: the Bad Character heuristic and the Good Suffix heuristic. The Bad Character heuristic shifts the pattern based on the last occurrence of a mismatched character, while the Good Suffix heuristic shifts the pattern based on suffix matches. The document provides an overview of these heuristics, their implementation in Python, and discusses their time complexities.

Uploaded by

Rama Krishna
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Boyer Moore Algorithm for Pa ern Searching

Pa ern searching is an important problem in computer science. When we do


search for a string in a notepad/word file, browser, or database, pa ern
searching algorithms are used to show the search results.
A typical problem statement would be-
” Given a text txt[0..n-1] and a pa ern pat[0..m-1] where n is the length of the
text and m is the length of the pa ern, write a func on search(char pat[], char
txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m. “
Examples:
Input: txt[] = “THIS IS A TEST TEXT”
pat[] = “TEST”
Output: Pa ern found at index 10
Input: txt[] = “AABAACAADAABAABA”
pat[] = “AABA”
Output: Pa ern found at index 0
Pa ern found at index 9
Pa ern found at index 12

Boyer Moore is a combina on of the following two approaches.


1. Bad Character Heuris c
2. Good Suffix Heuris c
Both of the above heuris cs can also be used independently to search a
pa ern in a text. Let us first understand how two independent approaches
work together in the Boyer Moore algorithm.
The Boyer Moore algorithm does preprocessing for the same reason. It
processes the pa ern and creates different arrays for each of the two
heuris cs. At every step, it slides the pa ern by the max of the slides suggested
by each of the two heuris cs. So, it uses greatest offset suggested by the two
heuris cs at every step.
Unlike the previous pa ern searching algorithms, the Boyer Moore algorithm
starts matching from the last character of the pa ern.
In this post, we will discuss the bad character heuris c and the Good Suffix
heuris c in the next post.

Bad Character Heuris c


The idea of bad character heuris c is simple. The character of the text which
doesn’t match with the current character of the pa ern is called the Bad
Character. Upon mismatch, we shi the pa ern un l –
1. The mismatch becomes a match.
2. Pa ern P moves past the mismatched character.
Case 1 – Mismatch become match
We will lookup the posi on of the last occurrence of the mismatched character
in the pa ern, and if the mismatched character exists in the pa ern, then we’ll
shi the pa ern such that it becomes aligned to the mismatched character in
the text T.
Explana on:
In the above example, we got a mismatch at posi on 3.
Here our mismatching character is “A”. Now we will search for last occurrence
of “A” in pa ern. We got “A” at posi on 1 in pa ern (displayed in Blue) and this
is the last occurrence of it. Now we will shi pa ern 2 mes so that “A” in
pa ern get aligned with “A” in text.
Case 2 – Pa ern move past the mismatch character
We’ll lookup the posi on of last occurrence of mismatching character in
pa ern and if character does not exist we will shi pa ern past the
mismatching character.
Explana on:
Here we have a mismatch at posi on 7.
The mismatching character “C” does not exist in pa ern before posi on 7 so
we’ll shi pa ern past to the posi on 7 and eventually in above example we
have got a perfect match of pa ern (displayed in Green). We are doing this
because “C” does not exist in the pa ern so at every shi before posi on 7 we
will get mismatch and our search will be fruitless.
Implementa on:
In the following implementa on, we pre-process the pa ern and store the last
occurrence of every possible character in an array of size equal to alphabet
size. If the character is not present at all, then it may result in a shi by m
(length of pa ern). Therefore, the bad character heuris c takes O(n/m) me in
the best case.

# Python3 Program for Bad Character Heuris c


# of Boyer Moore String Matching Algorithm

NO_OF_CHARS = 256
def badCharHeuris c(string, size):
'''
The preprocessing func on for
Boyer Moore's bad character heuris c
'''

# Ini alize all occurrence as -1


badChar = [-1]*NO_OF_CHARS

# Fill the actual value of last occurrence


for i in range(size):
badChar[ord(string[i])] = i

# return ini alized list


return badChar

def search(txt, pat):


'''
A pa ern searching func on that uses Bad Character
Heuris c of Boyer Moore Algorithm
'''
m = len(pat)
n = len(txt)

# create the bad character list by calling


# the preprocessing func on badCharHeuris c()
# for given pa ern
badChar = badCharHeuris c(pat, m)

# s is shi of the pa ern with respect to text


s=0
while(s <= n-m):
j = m-1

# Keep reducing index j of pa ern while


# characters of pa ern and text are matching
# at this shi s
while j >= 0 and pat[j] == txt[s+j]:
j -= 1

# If the pa ern is present at current shi ,


# then index j will become -1 a er the above loop
if j < 0:
print("Pa ern occur at shi = {}".format(s))

'''
Shi the pa ern so that the next character in text
aligns with the last occurrence of it in pa ern.
The condi on s+m < n is necessary for the case when
pa ern occurs at the end of text
'''
s += (m-badChar[ord(txt[s+m])] if s+m < n else 1)
else:
'''
Shi the pa ern so that the bad character in text
aligns with the last occurrence of it in pa ern. The
max func on is used to make sure that we get a posi ve
shi . We may get a nega ve shi if the last occurrence
of bad character in pa ern is on the right side of the
current character.
'''
s += max(1, j-badChar[ord(txt[s+j])])

# Driver program to test above func on


def main():
txt = "ABAAABCD"
pat = "ABC"
search(txt, pat)

if __name__ == '__main__':
main()

Output
pa ern occurs at shi = 4
Time Complexity : O(m*n)
Auxiliary Space: O(1)
The Bad Character Heuris c may take O(m*n) me in worst case. The worst
case occurs when all characters of the text and pa ern are same. For example,
txt[] = “AAAAAAAAAAAAAAAAAA” and pat[] = “AAAAA”. The Bad Character
Heuris c may take O(n/m) in the best case. The best case occurs when all the
characters of the text and pa ern are different.

Good Suffix
heuris c for pa ern searching. Just like bad character heuris c, a preprocessing
table is generated for good suffix heuris c.
Good Suffix Heuris c
Let
t
be substring of text
T
which is matched with substring of pa ern
P
Now we shi pa ern un l : 1) Another occurrence of t in P matched with t in T.
2) A prefix of P, which matches with suffix of t 3) P moves past t
Case 1: Another occurrence of t in P matched with t in T
Pa ern P might contain few more occurrences of
t
. In such case, we will try to shi the pa ern to align that occurrence with t in
text T. For example-
Explana on:
In the above example, we have got a substring t of text T matched with pa ern
P (in green) before mismatch at index 2. Now we will search for occurrence of t
(“AB”) in P. We have found an occurrence star ng at posi on 1 (in yellow
background) so we will right shi the pa ern 2 mes to align t in P with t in T.
This is weak rule of original Boyer Moore and not much effec ve, we will
discuss a
Strong Good Suffix rule
shortly.
Case 2: A prefix of P, which matches with suffix of t in T
It is not always likely that we will find the occurrence of t in P. Some mes there
is no occurrence at all, in such cases some mes we can search for some
suffix of t
matching with some
prefix of P
and try to align them by shi ing P. For example –
Explana on:
In above example, we have got t (“BAB”) matched with P (in green) at index 2-4
before mismatch . But because there exists no occurrence of t in P we will
search for some prefix of P which matches with some suffix of t. We have found
prefix “AB” (in the yellow background) star ng at index 0 which matches not
with whole t but the suffix of t “AB” star ng at index 3. So now we will shi
pa ern 3 mes to align prefix with the suffix.
Case 3: P moves past t
If the above two cases are not sa sfied, we will shi the pa ern past the t. For
example –
Explana on:
If above example, there exist no occurrence of t (“AB”) in P and also there is no
prefix in P which matches with the suffix of t. So, in that case, we can never find
any perfect match before index 4, so we will shi the P past the t ie. to index 5.
Strong Good suffix Heuris c
Suppose substring
q = P[i to n]
got matched with
t
in T and
c = P[i-1]
is the mismatching character. Now unlike case 1 we will search for t in P which
is not preceded by character
c
. The closest such occurrence is then aligned with t in T by shi ing pa ern P.
For example –
Explana on:
In above example,
q = P[7 to 8]
got matched with t in T. The mismatching character
c
is “C” at posi on P[6]. Now if we start searching t in P we will get the first
occurrence of t star ng at posi on 4. But this occurrence is preceded by “C”
which is equal to c, so we will skip this and carry on searching. At posi on 1 we
got another occurrence of t (in the yellow background). This occurrence is
preceded by “A” (in blue) which is not equivalent to c. So we will shi pa ern P
6 mes to align this occurrence with t in T.We are doing this because we
already know that character
c = “C”
causes the mismatch. So any occurrence of t preceded by c will again cause
mismatch when aligned with t, so that’s why it is be er to skip this.
Preprocessing for Good suffix heuris c
As a part of preprocessing, an array
shi
is created. Each entry
shi [i]
contain the distance pa ern will shi if mismatch occur at posi on
i-1
. That is, the suffix of pa ern star ng at posi on
i
is matched and a mismatch occur at posi on
i-1
. Preprocessing is done separately for strong good suffix and case 2 discussed
above.
1) Preprocessing for Strong Good Suffix
Before discussing preprocessing, let us first discuss the idea of border. A
border
is a substring which is both proper suffix and proper prefix. For example, in
string
“ccacc”
,
“c”
is a border,
“cc”
is a border because it appears in both end of string but
“cca”
is not a border. As a part of preprocessing an array
bpos
(border posi on) is calculated. Each entry
bpos[i]
contains the star ng index of border for suffix star ng at index i in given
pa ern P. The suffix
?
beginning at posi on m has no border, so
bpos[m]
is set to
m+1
where
m
is the length of the pa ern. The shi posi on is obtained by the borders which
cannot be extended to the le . Following is the code for preprocessing –
void preprocess_strong_suffix(int *shi , int *bpos,
char *pat, int m)
{
int i = m, j = m+1;
bpos[i] = j;
while(i > 0)
{
while(j <= m && pat[i-1] != pat[j-1])
{
if (shi [j] == 0)
shi [j] = j-i;
j = bpos[j];
}
i--; j--;
bpos[i] = j;
}
}

Explana on:
Consider pa ern
P = “ABBABAB”, m = 7
.
 The widest border of suffix “AB” beginning at posi on i = 5 is ?(nothing)
star ng at posi on 7 so bpos[5] = 7.
 At posi on i = 2 the suffix is “BABAB”. The widest border for this suffix is
“BAB” star ng at posi on 4, so j = bpos[2] = 4.
 We can understand
 bpos[i] = j
 using following example –

 If character
 #
 Which is at posi on
 i-1
 is equivalent to character
 ?
 at posi on
 j-1
 , we know that border will be
 ? + border of suffix at posi on i
 which is star ng at posi on
 j
 which is equivalent to saying that
 border of suffix at i-1 begin at j-1
 or
 bpos[ i-1 ] = j-1
 or in the code –
 i--;
j--;
bpos[ i ] = j

 But if character
 #
 at posi on
 i-1
 do not match with character
 ?
 at posi on
 j-1
 then we con nue our search to the right. Now we know that –

1. Border width will be smaller than the border star ng at


posi on j ie. smaller than x…?
2. Border has to begin with # and end with ? or could be empty (no
border exist).
 With above two facts we will con nue our search in sub string
 x…?
 from posi on
 j to m
 . The next border should be at
 j = bpos[j]
 . A er upda ng
 j
 , we again compare character at posi on
 j-1 (?)
 with # and if they are equal then we got our border otherwise we
con nue our search to right
 un l j>m
 . This process is shown by code –
 while(j <= m && pat[i-1] != pat[j-1])
{
j = bpos[j];
}
i--; j--;
bpos[i]=j;

 In above code look at these condi ons –


 pat[i-1] != pat[j-1]

 This is the condi on which we discussed in case 2. When the character


preceding the occurrence of t in pa ern P is different than mismatching
character in P, we stop skipping the occurrences and shi the pa ern. So
here
 P[i] == P[j]
 but
 P[i-1] != p[j-1]
 so we shi pa ern from
 i to j
 . So
 shi [j] = j-i
 is recorder for
 j
 . So whenever any mismatch occur at posi on
 j
 we will shi the pa ern
 shi [j+1]
 posi ons to the right. In above code the following condi on is very
important –
 if (shi [j] == 0 )

 This condi on prevent modifica on of


 shi [j]
 value from suffix having same border. For example, Consider pa ern
 P = “addbddcdd”
 , when we calculate bpos[ i-1 ] for i = 4 then j = 7 in this case. we will be
eventually se ng value of shi [ 7 ] = 3. Now if we calculate bpos[ i-1 ]
for i = 1 then j = 7 and we will be se ng value shi [ 7 ] = 6 again if there
is no test shi [ j ] == 0. This mean if we have a mismatch at posi on 6 we
will shi pa ern P 3 posi ons to right not 6 posi on.
 2) Preprocessing for Case 2
 In the preprocessing for case 2, for each suffix the widest border of the
whole pa ern that is contained in that suffix is determined. The star ng
posi on of the widest border of the pa ern at all is stored in
 bpos[0]
 In the following preprocessing algorithm, this value bpos[0] is stored
ini ally in all free entries of array shi . But when the suffix of the pa ern
becomes shorter than bpos[0], the algorithm con nues with the next-
wider border of the pa ern, i.e. with bpos[j].

# Python3 program for Boyer Moore Algorithm with


# Good Suffix heuris c to find pa ern in
# given text string

# preprocessing for strong good suffix rule


def preprocess_strong_suffix(shi , bpos, pat, m):

# m is the length of pa ern


i=m
j=m+1
bpos[i] = j

while i > 0:

'''if character at posi on i-1 is


not equivalent to character at j-1,
then con nue searching to right
of the pa ern for border '''
while j <= m and pat[i - 1] != pat[j - 1]:

''' the character preceding the occurrence


of t in pa ern P is different than the
mismatching character in P, we stop skipping
the occurrences and shi the pa ern
from i to j '''
if shi [j] == 0:
shi [j] = j - i

# Update the posi on of next border


j = bpos[j]
''' p[i-1] matched with p[j-1], border is found.
store the beginning posi on of border '''
i -= 1
j -= 1
bpos[i] = j

# Preprocessing for case 2


def preprocess_case2(shi , bpos, pat, m):
j = bpos[0]
for i in range(m + 1):

''' set the border posi on of the first character


of the pa ern to all indices in array shi
having shi [i] = 0 '''
if shi [i] == 0:
shi [i] = j

''' suffix becomes shorter than bpos[0],


use the posi on of next widest border
as value of j '''
if i == j:
j = bpos[j]

'''Search for a pa ern in given text using


Boyer Moore algorithm with Good suffix rule '''
def search(text, pat):
# s is shi of the pa ern with respect to text
s=0
m = len(pat)
n = len(text)

bpos = [0] * (m + 1)

# ini alize all occurrence of shi to 0


shi = [0] * (m + 1)

# do preprocessing
preprocess_strong_suffix(shi , bpos, pat, m)
preprocess_case2(shi , bpos, pat, m)

while s <= n - m:
j=m-1

''' Keep reducing index j of pa ern while characters of


pa ern and text are matching at this shi s'''
while j >= 0 and pat[j] == text[s + j]:
j -= 1

''' If the pa ern is present at the current shi ,


then index j will become -1 a er the above loop '''
if j < 0:
print("pa ern occurs at shi = %d" % s)
s += shi [0]
else:

'''pat[i] != pat[s+j] so shi the pa ern


shi [j+1] mes '''
s += shi [j + 1]

# Driver Code
if __name__ == "__main__":
text = "ABAAAABAACD"
pat = "ABA"
search(text, pat)

You might also like