Application of Improved BM Algorithm in String Appr 2020 Procedia Computer S
Application of Improved BM Algorithm in String Appr 2020 Procedia Computer S
Available online
online at
at www.sciencedirect.com
www.sciencedirect.com
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer
Procedia Computer Science00
Science00 (2019)
(2019) 000–000
000–000
ScienceDirect
www.elsevier.com/locate/p
www.elsevier.com/locate/p
Procedia Computer Science 166 (2020) 576–581
rocedia
rocedia
Kunming
Kunming University
University of
of Science
Science and
and Technology,
Technology, Faculty
Faculty of
of Information
Information Engineering
Engineering
and Automation,, Yunnan
and Automation Yunnan Kunming
Kunming 650000
650000
Abstract:
Abstract:
In the
In the paper,
paper, we
we will
will propose
propose aa new
new algorithm
algorithm to
to improve
improve the
the error
error tolerance
tolerance and
and flexibility
flexibility of
of the
the exact
exact matching
matching by
by combining
combining
the bad
the bad characters
characters and
and good
good suffix
suffix rules
rules in
in Boyer-Moore
Boyer-Moore algorithm.
algorithm. First
First of
of all,
all, using
using the
the binary
binary sequence
sequence function
function to
to control
control the
the
output range of the target segment; secondly, matching the target segment with partial similar features. The experimental
output range of the target segment; secondly, matching the target segment with partial similar features. The experimental results results
show that
show that the
the improved
improved algorithm
algorithm can
can improve
improve the
the accuracy
accuracy ofof the
the target
target segment
segment andand the
the number
number ofof segments
segments matched
matched can
can be
be
increased to
increased to more
more than
than 77 times
times of
of the
the exact
exact matching
matching segment.
segment.
© 2020 The Authors. Published by Elsevier B.V.
© 2019
© 2019 The
The Authors.
Authors. Published
Published by
by Elsevier
Elsevier B.V.
B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under
Peer-review under responsibility
responsibility of
of organizing
organizing committee
committee of the 3rd
3rd International Conference
Conference on Mechatronics
Mechatronics and
and Intelligent
Intelligent
Peer-review under responsibility of the scientific committeeofofthe
the 3rd International
International Conference on
on Mechatronics and Intelligent
Robotics (ICMIR-2019)
Robotics (ICMIR-2019)
Robotics, ICMIR-2019.
Keywords:
Keywords: BM
BM algorithm,
algorithm, approximate
approximate search,
search, pattern
pattern matching
matching
1.
1. Introduction
Introduction
There
There are
are many
many traditional algorithms for
traditional algorithms for exact
exact string
string matching.
matching. The
The more
more common
common ones
ones are
are BF,
BF, KMP,
KMP, BMBM and
and
some
some improved
improved algorithms
algorithms[4][1-3].
[1-3].
The
The BFBF (Brute Force) [4] algorithm
(Brute Force) algorithm is
is the
the least
least efficient
efficient algorithm
algorithm in
in the
the exact
exact matching
matching algorithm.
algorithm. In
In the
the worst
worst
case, the time complexity
case, the time complexity is is O nm , and the comparison times
, and the comparison times are are 2n [5]
[5]. The time complexity of the
. The time complexity of the
[4]
KMP(Knuth-Morris-Pratt)
KMP(Knuth-Morris-Pratt) [4] algorithm
algorithm in
in the
the search
search phase
phase is
is O(nm),
O(nm), but
but in
in the
the general
general case,
case, the
the actual
actual execution
execution
1
1 Corresponding
Corresponding Author.
Author. Tel.+(86)
Tel.+(86) 13888051616
13888051616
*
*E-mail:
E-mail: [email protected]
[email protected]
2019
2019 The
The Authors.
Authors. Published
Published by
by Elsevier
Elsevier B.V.
B.V. This
This is
is an
an open
open access
access article
article under
under the
the CC
CC BY-NC-ND
BY-NC-ND license
license
https://creativecommons.org/licenses/by-nc-nd/4.0/)
https://creativecommons.org/licenses/by-nc-nd/4.0/)
Selection
Selection and
and peer-review
peer-review under
under responsibility
responsibility of
of the
the scientific
scientific committee
committee of
of the
the 3rd
3rd International
International Conference
Conference on
on Mechatronics
Mechatronics and
and Intelligent
Intelligent
Robotics
Robotics (ICMIR-2019)
(ICMIR-2019)
time approximates that the existence of the NEXT array requires extra space for O m . And in most cases, the
KMP algorithm has no significant advantage over the BF algorithm [5]. In 1977, Boyer and Moore proposed a new
pattern matching algorithm—BM algorithm [6]. One of the most important features of the BM algorithm is that more
unnecessary characters can be skipped during the matching process. Experimental data shows that the matching
speed of the BM algorithm is about 3 to 5 times that of the KMP algorithm [6].
Accurate matching of the resulting fragments is less faulty tolerant and less flexible, and only the perfectly
matched segments can be queried. Information containing a certain amount of information cannot be effectively
filtered out [7], such as information extraction, intrusion detection, sensitive word filtering, text editing, information
query and so on. In order to improve this problem, this paper proposes an improved algorithm for BM.
2. BM algorithm
3. IBM Algorithm
The core idea of the IBM (Improved BM) algorithm is to first match the bad character (BC) rule and the good
suffix (GS) idea of the BM algorithm, and skip as many fragments as possible that do not meet the requirements and
repeat. Then, after the matching, the mode segment P is aligned with the segment to be matched dP, and the binary
sequence function is calculated using an exclusive OR operation. Finally, the output range of the target segment is
limited by logical operations.
The specific matching process is as follows:
T wi
Query whether the character T in the text string exists in the character set of the pattern segment P. If
T wi P T wi T wi P
, Then define this character as a bad character and skip it. If , then the position
T wi wi p j 1 index
of the same character as in the pattern fragment P is returned. When , is expressed as:
index j 1 (1)
P p j p j 1 p j m
(3)
The binary sequence function is constructed to control the output range of the target segment. Binary sequence
function: XOR the corresponding characters of dP and P:
0 , wi 1 p j
y1
1 , wi 1 p j (4)
Y y j y j 1 y j m
(5)
The binary sequence obtained by the above formula can be logically operated to accurately control the search
range of the desired target segment. For example, suppose that the maximum allowable number of replacement bits
is K, the number of "1"s in Y is r, when r K , it is confirmed that dP is the target segment.
In the case of approximate matching, in order to avoid the waste of data resources caused by repeated selection of
similar segments, it is skipped by consecutively good characters with consecutive good characters after the pointer,
which is called a pointer good character rule.
Pointer good character rule: When the number of good characters F 2 appears consecutively after the
character pointer in dP, the fragment is skipped.
Assuming the transfer function next (i ) , the formula is
F , wi P and F 2
next wi (6)
1, others
Where F is the number of good characters that appear consecutively, and next (i ) F is the match of the Fth
character after i .
For ease of understanding, the matching process of T and P is introduced according to the IBM algorithm.
Assume that the character set =24 , m=5, K=2, n=11, m=4, i=j=1.
Starting from T w1 in the text string T , it is judged whether it exists in P, as follows:
Ying Duan et al. / Procedia Computer Science 166 (2020) 576–581 579
In this example, it is scanned from left to right. In Fig. 3, Y = 00101, r = 2. Since r K , it is determined that dP
is the target segment. The portion of the box in Fig.3 is the position of the current character pointer.
In Fig. 3, because F 2 , next w1 2 is obtained according to the good character rule of the pointer,
indicating that the next query character is "c".
4. Experimental Result
This article tests the text according to the IBM algorithm. Assume that =24 , text length m 210000 , K = 4.
The similar segments of P " cgpnannlhjfslbc " are approximate matched by Java programming. After statistical
output, some of the results are as follows:
A total of 6663 similar texts with "capnannlhjfslbc" may have four or fewer
replacement errors.
A total of 207 different kinds of results appeared.
1 capnannlhjfslbc 850 0.12757
2 cgpnannlhjfslbc 485 0.07279
3 cgpnanblhjfslff 404 0.060633
4 ggpnanblhjfslbq 340 0.051028
5 cgpobknlhjfslbc 297 0.044575
6 sapnannlhjfslbc 153 0.022963
7 gapnannlhjfslbc 144 0.021612
8 dgpnannlhjfslbc 110 0.016509
9 fapnannlhjfslbc 90 0.013507
9 hapnannlhjfslbc 90 0.013507
9 japnannlhjfslbc 90 0.013507
12 iapnannlhjfslbc 81 0.012157
13 cgpnadnlhjfslbc 66 0.009905
14 tgpnannlhjfslbc 60 0.009005
15 kapnannlhjfslbc 54 0.008104
Fig.7 and Fig.8 show the results of manual random comparison. The boxes are not the same characters:
The experimental results show that the IBM algorithm increases the target segment number by more than 7 times
under the premise of ensuring the accuracy of the target segment, which effectively expands the search range of the
target segment of the BM algorithm.
Ying Duan et al. / Procedia Computer Science 166 (2020) 576–581 581
5. Summary
The IBM algorithm can effectively find the target segment, improving the search range and fault tolerance under
the exact match. The algorithm in this paper can flexibly match the target segments and expand the search range
based on the effective information.
6. Acknowledgments
For the completion of my thesis, first, I wish to express my deepest gratitude to my teacher, Prof. Long Hua, who
has given me the most valuable Suggestions and advice, and made necessary corrections. Then I am greatly indebted
To Prof. Shao Yubin and Associate professor Du Qingzhi for their advice. Finally, I would like to express my
appreciation to my classmates, who have generously offered their help.
7. References
1. Hong-Wei Lu, Kai Wei, Hua-Feng Kong. An Improved KMP Efficient Pattern Matching Algorithm[J]. Journal of Huazhong
University of Science and Technology(Natural Science), 2006, 34(10): 41-43.
2. [2] Wei Yang, Zhao-Feng Ma, Wei Li. Research improvement based on BM pattern matching algorithm [C].Proceedings of the 19th
National Youth Communication Academic Annual Conference. 2014.
3. [3] Horspool R N . Practical fast searching in strings[J]. Software: Practice and Experience, 1980, 10(6):6.
4. [4]Charras C. Exact String Matching Algorithms [Z]. http://www—igm.univmlv.fr/~ lecroq/string/
5. [5] Koloud Al-Khamaiseh, Shadi ALShagarin. A Survey of String Matching Algorithms[J]. International Journal of Engineering
Research and Applications, 2014, 4(7): 144-156.
6. [6] Boyer R S, Moore J S. A Fast String Searching Algorithm [J]. Communications of the ACM, 1977, 20(10):762—7721.
7. [7] Wei-Qin Liu. Research on network sensitive information monitoring system [D]. Guangdong University of Technology, 2008.
8. [8] Lian-Ying Min, Ting-Ting Zhao. Research and Improvement of BM Algorithm[J]. Journal of Wuhan University of Technology:
Transportation Science and Engineering Edition, 2006(3):528-530.
9. [9] Wen-Jing Sun, Hua Qian. Improved BM Algorithm and Its Application in Network Intrusion Detection[J]. Computer Science,
2013, 40 (12): 174-176.
10. [10]Wei-Wei Yang, Xiang Liao. An improved BM pattern matching algorithm[J]. Journal of Computer Applications, 2006, 26(2):
318-0319.
11. [11] Li Chen, Wen-Long Xiong. Network Intrusion Detection System Based on Linux[J]. Journal of Wuhan University of
Technology(Transmission Science and Engineering), 2004, 28(1): 137-140.