0% found this document useful (0 votes)
19 views

Application of Improved BM Algorithm in String Appr 2020 Procedia Computer S

The document describes an improved algorithm for string approximate matching that combines the bad characters rule and good suffix rules of the Boyer-Moore algorithm. The improved algorithm uses a binary sequence function to control the output range of the target segment and allows for partial matching of the target segment with similar features. Experimental results showed the improved algorithm improves accuracy of matching the target segment and can match over 7 times as many segments as exact matching.

Uploaded by

Nikita Pokharkar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Application of Improved BM Algorithm in String Appr 2020 Procedia Computer S

The document describes an improved algorithm for string approximate matching that combines the bad characters rule and good suffix rules of the Boyer-Moore algorithm. The improved algorithm uses a binary sequence function to control the output range of the target segment and allows for partial matching of the target segment with similar features. Experimental results showed the improved algorithm improves accuracy of matching the target segment and can match over 7 times as many segments as exact matching.

Uploaded by

Nikita Pokharkar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Available

Available online
online at
at www.sciencedirect.com
www.sciencedirect.com

ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer
Procedia Computer Science00
Science00 (2019)
(2019) 000–000
000–000

ScienceDirect
www.elsevier.com/locate/p
www.elsevier.com/locate/p
Procedia Computer Science 166 (2020) 576–581
rocedia
rocedia

3rd International Conference on Mechatronics and Intelligent Robotics (ICMIR-2019)

Application of Improved BM Algorithm in String Approximate


Matching
Ying Duan, Hua Long*1
*1, Yu Quan Qu

Kunming
Kunming University
University of
of Science
Science and
and Technology,
Technology, Faculty
Faculty of
of Information
Information Engineering
Engineering
and Automation,, Yunnan
and Automation Yunnan Kunming
Kunming 650000
650000

Abstract:
Abstract:

In the
In the paper,
paper, we
we will
will propose
propose aa new
new algorithm
algorithm to
to improve
improve the
the error
error tolerance
tolerance and
and flexibility
flexibility of
of the
the exact
exact matching
matching by
by combining
combining
the bad
the bad characters
characters and
and good
good suffix
suffix rules
rules in
in Boyer-Moore
Boyer-Moore algorithm.
algorithm. First
First of
of all,
all, using
using the
the binary
binary sequence
sequence function
function to
to control
control the
the
output range of the target segment; secondly, matching the target segment with partial similar features. The experimental
output range of the target segment; secondly, matching the target segment with partial similar features. The experimental results results
show that
show that the
the improved
improved algorithm
algorithm can
can improve
improve the
the accuracy
accuracy ofof the
the target
target segment
segment andand the
the number
number ofof segments
segments matched
matched can
can be
be
increased to
increased to more
more than
than 77 times
times of
of the
the exact
exact matching
matching segment.
segment.
© 2020 The Authors. Published by Elsevier B.V.
© 2019
© 2019 The
The Authors.
Authors. Published
Published by
by Elsevier
Elsevier B.V.
B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under
Peer-review under responsibility
responsibility of
of organizing
organizing committee
committee of the 3rd
3rd International Conference
Conference on Mechatronics
Mechatronics and
and Intelligent
Intelligent
Peer-review under responsibility of the scientific committeeofofthe
the 3rd International
International Conference on
on Mechatronics and Intelligent
Robotics (ICMIR-2019)
Robotics (ICMIR-2019)
Robotics, ICMIR-2019.
Keywords:
Keywords: BM
BM algorithm,
algorithm, approximate
approximate search,
search, pattern
pattern matching
matching

1.
1. Introduction
Introduction

There
There are
are many
many traditional algorithms for
traditional algorithms for exact
exact string
string matching.
matching. The
The more
more common
common ones
ones are
are BF,
BF, KMP,
KMP, BMBM and
and
some
some improved
improved algorithms
algorithms[4][1-3].
[1-3].
The
The BFBF (Brute Force) [4] algorithm
(Brute Force) algorithm is
is the
the least
least efficient
efficient algorithm
algorithm in
in the
the exact
exact matching
matching algorithm.
algorithm. In
In the
the worst
worst
case, the time complexity
case, the time complexity is is O  nm  , and the comparison times
, and the comparison times are are 2n [5]
[5]. The time complexity of the
. The time complexity of the
[4]
KMP(Knuth-Morris-Pratt)
KMP(Knuth-Morris-Pratt) [4] algorithm
algorithm in
in the
the search
search phase
phase is
is O(nm),
O(nm), but
but in
in the
the general
general case,
case, the
the actual
actual execution
execution

1
1 Corresponding
Corresponding Author.
Author. Tel.+(86)
Tel.+(86) 13888051616
13888051616
*
*E-mail:
E-mail: [email protected]
[email protected]

2019
2019 The
The Authors.
Authors. Published
Published by
by Elsevier
Elsevier B.V.
B.V. This
This is
is an
an open
open access
access article
article under
under the
the CC
CC BY-NC-ND
BY-NC-ND license
license
https://creativecommons.org/licenses/by-nc-nd/4.0/)
https://creativecommons.org/licenses/by-nc-nd/4.0/)
Selection
Selection and
and peer-review
peer-review under
under responsibility
responsibility of
of the
the scientific
scientific committee
committee of
of the
the 3rd
3rd International
International Conference
Conference on
on Mechatronics
Mechatronics and
and Intelligent
Intelligent
Robotics
Robotics (ICMIR-2019)
(ICMIR-2019)

1877-0509 © 2020 The Authors. Published by Elsevier B.V.


This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the 3rd International Conference on Mechatronics and Intelligent Robotics,
ICMIR-2019.
10.1016/j.procs.2020.02.017
Ying Duan et al. / Procedia Computer Science 166 (2020) 576–581 577

time approximates that the existence of the NEXT array requires extra space for O  m  . And in most cases, the
KMP algorithm has no significant advantage over the BF algorithm [5]. In 1977, Boyer and Moore proposed a new
pattern matching algorithm—BM algorithm [6]. One of the most important features of the BM algorithm is that more
unnecessary characters can be skipped during the matching process. Experimental data shows that the matching
speed of the BM algorithm is about 3 to 5 times that of the KMP algorithm [6].
Accurate matching of the resulting fragments is less faulty tolerant and less flexible, and only the perfectly
matched segments can be queried. Information containing a certain amount of information cannot be effectively
filtered out [7], such as information extraction, intrusion detection, sensitive word filtering, text editing, information
query and so on. In order to improve this problem, this paper proposes an improved algorithm for BM.

2. BM algorithm

Match the matching problem as follows:


The string of the text library to be matched is defined as T  wi wi 1wi 2 wi n  , and the pattern fragment is
defined as P  p j p j 1p j  2 p j+m  , where wi represents the i-th character in T, and similarly, pj represents
the j-th character in P.
The basic idea of the BM algorithm is [ 8-10]:
Matching from left to right.
If the match failure occurs in wi  p j , and wi does not appear in the pattern segment P, move the pattern
segment to the right until pj is in the first position on the right side of the matching failure bit wi . If wi
choose j
appears in several places in P, then max{
 K | Tk p j } .
After the F segment of the pattern segment "P" and the portion of the text segment T that coincide with each other,
some parts appear in other places in T, you can move T to the right, directly align this part, and require this part to be
as large as possible [11].
The BM algorithm uses a "jumping" lookup strategy to improve matching efficiency, and its time complexity is
O(n / m) . But the BM algorithm is less flexible in matching, It is not possible to match strings with some
scrambling signals [11], so this paper proposes an expansion. Therefore, this paper proposes an improved algorithm to
expand the matching range of target segments.

3. IBM Algorithm

The core idea of the IBM (Improved BM) algorithm is to first match the bad character (BC) rule and the good
suffix (GS) idea of the BM algorithm, and skip as many fragments as possible that do not meet the requirements and
repeat. Then, after the matching, the mode segment P is aligned with the segment to be matched dP, and the binary
sequence function is calculated using an exclusive OR operation. Finally, the output range of the target segment is
limited by logical operations.
The specific matching process is as follows:
T  wi 
Query whether the character T in the text string exists in the character set of the pattern segment P. If
T  wi   P T  wi  T  wi   P
, Then define this character as a bad character and skip it. If , then the position
T  wi  wi  p j 1 index
of the same character as in the pattern fragment P is returned. When , is expressed as:

index j  1 (1)

The index value is used to align the characters wi and p j 1 .


After the character p j 1 is aligned with the character wi , the matching character segment dP is intercepted.
578 Ying Duan et al. / Procedia Computer Science 166 (2020) 576–581

dP  wi 1wi wmi 1


(2)

P   p j p j 1 p j m 
(3)

The binary sequence function is constructed to control the output range of the target segment. Binary sequence
function: XOR the corresponding characters of dP and P:

0 , wi 1  p j
y1  
1 , wi 1  p j (4)

Then get a binary sequence Y with a length of m:

Y   y j y j 1 y j m 
(5)

The specific process is shown in Fig.1.

Figure.1 The matching string matches the pattern fragment

The binary sequence obtained by the above formula can be logically operated to accurately control the search
range of the desired target segment. For example, suppose that the maximum allowable number of replacement bits
is K, the number of "1"s in Y is r, when r  K , it is confirmed that dP is the target segment.
In the case of approximate matching, in order to avoid the waste of data resources caused by repeated selection of
similar segments, it is skipped by consecutively good characters with consecutive good characters after the pointer,
which is called a pointer good character rule.
Pointer good character rule: When the number of good characters F  2 appears consecutively after the
character pointer in dP, the fragment is skipped.
Assuming the transfer function next (i ) , the formula is

 F , wi  P and F  2
next  wi    (6)
1, others

Where F is the number of good characters that appear consecutively, and next (i )  F is the match of the Fth
character after i .
For ease of understanding, the matching process of T and P is introduced according to the IBM algorithm.
Assume that the character set  =24 , m=5, K=2, n=11, m=4, i=j=1.
Starting from T  w1  in the text string T , it is judged whether it exists in P, as follows:
Ying Duan et al. / Procedia Computer Science 166 (2020) 576–581 579

Figure..2 Query current character

In this example, it is scanned from left to right. In Fig. 3, Y = 00101, r = 2. Since r  K , it is determined that dP
is the target segment. The portion of the box in Fig.3 is the position of the current character pointer.

Figure.3 Array matching succeeded

In Fig. 3, because F  2 , next  w1   2 is obtained according to the good character rule of the pointer,
indicating that the next query character is "c".

Figure..4 Next character match

Since "c" does not exist in P, then next  w3   1 .


If r  K , according to the pointer good character rule. As shown in Fig.5, F  2 , so next  w5   2 , the next
matching character is "a".

Figure..5 Match failed


580 Ying Duan et al. / Procedia Computer Science 166 (2020) 576–581

4. Experimental Result

This article tests the text according to the IBM algorithm. Assume that  =24 , text length m  210000 , K = 4.
The similar segments of P  " cgpnannlhjfslbc " are approximate matched by Java programming. After statistical
output, some of the results are as follows:
A total of 6663 similar texts with "capnannlhjfslbc" may have four or fewer
replacement errors.
A total of 207 different kinds of results appeared.
1 capnannlhjfslbc 850 0.12757
2 cgpnannlhjfslbc 485 0.07279
3 cgpnanblhjfslff 404 0.060633
4 ggpnanblhjfslbq 340 0.051028
5 cgpobknlhjfslbc 297 0.044575
6 sapnannlhjfslbc 153 0.022963
7 gapnannlhjfslbc 144 0.021612
8 dgpnannlhjfslbc 110 0.016509
9 fapnannlhjfslbc 90 0.013507
9 hapnannlhjfslbc 90 0.013507
9 japnannlhjfslbc 90 0.013507
12 iapnannlhjfslbc 81 0.012157
13 cgpnadnlhjfslbc 66 0.009905
14 tgpnannlhjfslbc 60 0.009005
15 kapnannlhjfslbc 54 0.008104

Figure..6 Target segment statistics

Fig.7 and Fig.8 show the results of manual random comparison. The boxes are not the same characters:

Figure.7. Target segment vs. pattern segment

Figure.8 Target segment vs. pattern segment

The experimental results show that the IBM algorithm increases the target segment number by more than 7 times
under the premise of ensuring the accuracy of the target segment, which effectively expands the search range of the
target segment of the BM algorithm.
Ying Duan et al. / Procedia Computer Science 166 (2020) 576–581 581

Figure..9 Fragment number chart

5. Summary

The IBM algorithm can effectively find the target segment, improving the search range and fault tolerance under
the exact match. The algorithm in this paper can flexibly match the target segments and expand the search range
based on the effective information.

6. Acknowledgments

For the completion of my thesis, first, I wish to express my deepest gratitude to my teacher, Prof. Long Hua, who
has given me the most valuable Suggestions and advice, and made necessary corrections. Then I am greatly indebted
To Prof. Shao Yubin and Associate professor Du Qingzhi for their advice. Finally, I would like to express my
appreciation to my classmates, who have generously offered their help.

7. References

1. Hong-Wei Lu, Kai Wei, Hua-Feng Kong. An Improved KMP Efficient Pattern Matching Algorithm[J]. Journal of Huazhong
University of Science and Technology(Natural Science), 2006, 34(10): 41-43.
2. [2] Wei Yang, Zhao-Feng Ma, Wei Li. Research improvement based on BM pattern matching algorithm [C].Proceedings of the 19th
National Youth Communication Academic Annual Conference. 2014.
3. [3] Horspool R N . Practical fast searching in strings[J]. Software: Practice and Experience, 1980, 10(6):6.
4. [4]Charras C. Exact String Matching Algorithms [Z]. http://www—igm.univmlv.fr/~ lecroq/string/
5. [5] Koloud Al-Khamaiseh, Shadi ALShagarin. A Survey of String Matching Algorithms[J]. International Journal of Engineering
Research and Applications, 2014, 4(7): 144-156.
6. [6] Boyer R S, Moore J S. A Fast String Searching Algorithm [J]. Communications of the ACM, 1977, 20(10):762—7721.
7. [7] Wei-Qin Liu. Research on network sensitive information monitoring system [D]. Guangdong University of Technology, 2008.
8. [8] Lian-Ying Min, Ting-Ting Zhao. Research and Improvement of BM Algorithm[J]. Journal of Wuhan University of Technology:
Transportation Science and Engineering Edition, 2006(3):528-530.
9. [9] Wen-Jing Sun, Hua Qian. Improved BM Algorithm and Its Application in Network Intrusion Detection[J]. Computer Science,
2013, 40 (12): 174-176.
10. [10]Wei-Wei Yang, Xiang Liao. An improved BM pattern matching algorithm[J]. Journal of Computer Applications, 2006, 26(2):
318-0319.
11. [11] Li Chen, Wen-Long Xiong. Network Intrusion Detection System Based on Linux[J]. Journal of Wuhan University of
Technology(Transmission Science and Engineering), 2004, 28(1): 137-140.

You might also like