Exact String Matching Algorithms Survey Issues and
Exact String Matching Algorithms Survey Issues and
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
ABSTRACT String matching has been an extensively studied research domain in the past two decades due
to its various applications in the fields of text, image, signal, and speech processing. As a result, choosing
an appropriate string matching algorithm for current applications and addressing challenges is difficult.
Understanding different string matching approaches(such as exact string matching, approximate string
matching algorithms), integrating several algorithms, and modifying algorithms to address related issues
are also difficult. This article presents a survey on single-pattern exact string matching algorithms.The main
purpose of this survey is to propose new classification, identify new directions and highlight the possible
challenges, current trends, and future works in the area of string matching algorithms with a core focus on
exact string matching algorithms.
INDEX TERMS String matching, Boyer-Moore, Rabin-Karp, Knuth-Morris-Pratt, Exact string matching,
Pattern matching, Pattern recognition, Pattern analysis
VOLUME 4, 2016 1
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
challenges. The analysis of different bases contributes to the discuss approaches that fall under software-based exact string
community. We propose a new class according to concept and matching algorithms in the following section.
basis of the methods to save time in choosing an appropriate
method depending on merits and demerits of the methods.
For example, we categorize exact matching algorithms as A. APPROXIMATE STRING MATCHING ALGORITHMS
single and multiple-pattern matching methods. The two cat-
egories are further classified as software and hardware-based Approximate string matching algorithm finds a substring that
methods. This new taxonomy helps readers understand the is close to a given pattern string. This algorithm is contrary
concept and suitability of the methods depending on their to exact string matching algorithm that expects a full match.
requirements. In this case, deciding the degree of closeness is challenging
but interesting, which depends on the application and com-
The methods can be broadly classified into two main plexity of the issues. According to Wu and Manber [14], this
categories: exact string matching approach that does not approach consists of finding all substrings S with K or fewer
allow any tolerance, and approximate string matching ap- differences within given text t such that d(p, S ≥ K), where
proach (also known as k-mismatch approach) that allows p denotes a short pattern string with length m, d denotes dis-
tolerance while matching. Exact string matching algorithms tance function, and K denotes an integer with value K ≥ 0.
can be further divided into single- and multiple-pattern exact In other words, the algorithms count the number matches and
matching approaches, as shown in Figure. 3. Single-pattern fix some threshold as K while matching substring with the
exact matching can be grouped into software and hardware- strings. This approach is generally used when K mismatches
based exact string matching algorithms. The software-based are found between the pattern and the given text. Two popular
string matching algorithms can be divided into character distance measures, namely, Hamming distance measure and
comparison, hashing, bit-parallel, and hybrid approaches. Levenshtein distance function, are used in this approach [15,
However, this work focuses on software-based exact string 16]. These algorithms are introduced to address spell errors
matching algorithms rather than hardware-based exact string present in patterns or texts, low quality of texts, and difficulty
matching algorithms due to the vast scope of the latter, in searching foreign names [17]. Approximate string match-
which goes beyond the scope of the proposed work. We ing algorithms can be classified as follows.
VOLUME 4, 2016 3
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
• Filtration-based algorithms: These algorithms are string matching algorithms, multiple-pattern algorithms can
two-stage process algorithms. In the first stage, the loca- be divided into hardware- and software-based string match-
tions of possible occurrence of patterns within the text ing algorithms.
are identified. In the second stage, all those locations are
fully verified. Some of the recent algorithms that follow
3) Hardware-based Pattern Matching
this approach include those in [18-20]. However, these
algorithms are insufficient for worst case scenarios [21]. The implementation of hardware-based matching algorithms
Most filtration indexes [19, 21, 25, 29, 35, 37] usually requires hardware devices, such as graphical processing units
differ in terms of text sampling, pattern sampling, and (GPUs) and field programmable graphical arrays (FPGAs).
alignment conditions [22]. These algorithms can be implemented using parallel pro-
• Back Tracking-based algorithms: These algorithms cessing programming languages, such as CUDA, Open-MP,
are generally an extension of exact string matching algo- and other specific languages. The implementation of string
rithms. In this approach, existing exact string matching matching algorithms in hardware devices, such as GPU or
algorithms are modified to enable approximate search FPGA, produces more overhead than that of software-based
using edit distance operations. The use of succinct pattern string matching algorithms, but the former approach
(compressed data) and suffix index-based data struc- is faster than the latter approach. As mentioned above,
tures is encouraged in these types of approaches. Some hardware-based pattern matching needs different hardware
of the recent works include those in [23-26]. devices and is thus costly. In addition, after the imple-
mentation of hardware-based pattern matching algorithm, it
In many cases, approximate string matching does not
cannot be applied on different data or applications because
work well, particularly in medical domain that expects 100%
changing the hardware design is impossible. By contrast,
matching to find a solution without any approximation. In
software-based pattern matching is flexible and can be used
this context, exact string matching is more useful than ap-
for any number of times on different applications. Therefore,
proximate string matching.
software-based pattern string matching algorithms are popu-
lar [32, 63-65].
B. EXACT STRING MATCHING
In exact string matching approach, all occurrences of a given
pattern p from a given text t are found [27]. In this string III. ANALYSIS OF SOFTWARE-BASED SINGLE-PATTERN
matching, the characters present in a pattern window and a MATCHING ALGORITHMS
text window are compared. The length of both windows must In contrast to hardware-based string matching algorithms,
be of equal length during the comparison phase. Shifting software-based algorithms use certain compilers and pro-
of characters in case of a mismatch is necessary to develop gramming languages for implementation purposes and re-
efficient algorithms [27]. As mentioned earlier, exact string quire less overhead. As shown in Figure 3, software-based
matching algorithms can be classified as single- and multiple- algorithms can be divided into character, hashing, suffix au-
pattern string matching algorithms. tomata, bit-parallel, and hybrid string matching algorithms.
The following section explains these algorithms in detail.
1) Single Pattern Matching
In single-pattern matching algorithms, the algorithm receives
only a single pattern as an input and searches for that specific
pattern from the target database. This group can be further
divided into two subgroups of hardware- and software-based
matching. Some applications require more than one pattern
to be searched, such as in analyzing mutations in DNA.
Multiple-pattern matching algorithms are proposed for such
applications [28].
2) Multiple-pattern Matching
Multiple-pattern matching is an advanced version of single-
pattern matching. In multiple-pattern matching algorithms,
one input is received by the algorithm, and multiple occur-
rences of that input are searched from the target database
[29]. Multiple-pattern matching algorithms are usually ap-
plied in the area of bioinformatics, such as in DNA com-
parison and protein sequence [30]. In DNA and protein se-
quences, these algorithms are used to detect and analyze any
anomaly in the given sequence [31]. Similar to single-pattern FIGURE 4. Working of Boyer Moore Algorithm.
4 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
3) Hybrid BM Approaches match the text starting at position b.” On the basis of these
Hybrid BM approaches are combined or integrated by con- assertions, the value of ”true” null statement is searched
sidering advantages of other methods to enhance the perfor- using Hoarse axiomatic semantic proof rule. The results
mance of each algorithm. indicate some information is lost. This lost information is
utilized to reduce the subsequent computational effort that
Sunday [44] proposed an algorithm popularly known as is needed to attain the final result.
Sunday algorithm, which is an improvement of BM algo-
rithm. This method uses three key steps to scan the given Xian [46] proposed the KMPBS algorithm by combining
pattern in three different orders. Sunday algorithm combines BM and KMP algorithms. The given pattern P of length m
the logic of BM and KMP algorithms. In the first algorithm, is searched from left to right within the text T. Searching
a new function 41 is used to compute the index of the first is conducted by comparing the last character of P with the
leftmost occurrence of a character from the end of the text corresponding character of text T, and KMP algorithm is used
string. This way gives the absolute pattern shift required for to compare the rest of the characters in case of a match. Dif-
shifting the pattern. This feature is also found in BM, but ferent automata approaches have been applied in character-
the shift in BM is relative to the position of p of the last based approaches, and a large amount of computational time
mismatch. For scanning in a specific order, another function is required by character-based algorithms. To reduce com-
42 is defined. The same feature is used in KMP and BM putational time, hashing-based algorithms are proposed and
approaches. This function finds a position to shift from developed.
current mismatch position. Cao et al. [47] proposed a character-based string matching
Colussi [45] proposed the Colussi algorithm to improve algorithm, which calculates the statistical probability of each
the efficiency of KMP algorithm. In Colussi algorithm, for- English letter in the pattern string in accordance with its
mal correctness proof of KMP algorithm is proposed by special position in the pattern string. The proposed algorithm
defining three assertions: Mch (b), which asserts that ”the uses optimization based on evolution strategies to calculate
pattern matches the text in position b”; NMch (b), which the statistical probability and dynamic condition of each
asserts that ”the pattern does not match position b”; and Eq character in the pattern string. The main idea of the proposed
(b, i), which asserts that ”the first i characters of the pattern algorithm is to search for a character with the lowest proba-
6 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
[54]. Hashing-based approaches can be classified further as searching phase, similar to skip search, a numeric value (fin-
q-gram and non q-gram approaches, as shown in Figure 7. gerprint) for each position text is calculated and is matched
with the stored fingerprint of the pattern. If the fingerprint
1) Q-qrams Approach: matches, then the location is verified for possible matching.
The q-gram approach divides a given sequence into n subse- If the fingerprint match results in empty location, then no
quences for matching. On this basis, several approaches are verification is required. The experimental results show that
developed as described below. the proposed algorithm performs well when the length of the
Lecroq [55] proposed a modification to the method of Wu pattern is long. However, the performance of the proposed
and Manber [56]. The modification obtains a single matching algorithm decreases when the length of the pattern is short.
pattern to accelerate the matching process. For the pattern of In summary, although hashing-based approaches are faster
length l, the approach finds hash values using h function for than character-based approaches, hashing process still suf-
each substring. In other words, using the window of string, fers from certain drawbacks (e.g., hashing collision) and is
w = t[s + m − l..s + m − 1] with length l, the hash value unsuitable for short-length patterns. Hashing approaches are
is computed. If the value of w of the substring is greater than also sensitive to capital and small case letters due to the use of
0, then the shift of length [h(w)] is applied; otherwise, the different encoding schemes. For these applications, hashing
pattern is naively checked [57]. algorithms may not perform well. For short-length patterns,
Boyer-Moore-Horspool with q-grams (BMHq) [57] is an automata approaches work considerably better than hashing
efficient modification of Horspool algorithm and is suitable processes according to our experiments.
for DNA alphabet using q-gram approach. To inspect a single
character at each alignment, q gram is read and an integer C. SUFFIX AUTOMATA-BASED APPROACH
called fingerprint is computed. The idea consists of mapping Suffix automaton/automation is an automaton that comprises
ASCII codes of respective characters in DNA alphabet to a two related but distinct automata constructors: deterministic
range of four characters, that is, r : {a, c, g, t} → {0, 1, 2, 3}, acyclic finite state automaton (a data structure representing a
such that the computation can be minimized or limited. The finite set of strings) and suffix automaton (a finite automaton
comparison for equality is performed by comparing the last acting as suffix index) [60] for matching. It can be defined as
q gram of pattern with corresponding q gram (in current D(p) = {Q, q0 , F, Σ, δ}.
window). Here, Q = {q1 , q2 , q3 .....qm } is a set of states, F = {qm } is
set to accept states, and δ = Q × Σ → Q is the transition
2) Non q-grams Approach: function. This approach uses a directed acyclic graph in
In non q-gram approach, the whole input pattern is encoded which nodes/vertices are called states, and edges between
and scanned. A few approaches that use this concept are the nodes are considered a transition between the states.
discussed as follows. This approach uses the suffix automaton data structure that
The algorithm of Wu and Manber [56] searches for recognizes all the suffixes of the pattern. One of the states
all the occurrences of the patterns in a finite set X = (node) denoted by 00 q000 is called the ”Initial state” of the
[x0, x1, ..., xk − 1] with a given text y and is based on BM suffix automaton from where we can reach to all other states
algorithm. Substrings are considered to be of length q. The in the automaton. One or more of these states are marked as
shift for all possible strings of length q is computed during ”Terminal states”. Thus, if we go from 00 q000 to any of these
the pre-processing phase. From X finite set, all substrings B of terminal states and note down the labels of the edges, then
length q are hashed, followed by a shift. This step is followed we obtain a suffix of the original string ”S”. The following
by searching phase, which consists of reading substrings B example in Figure 9 explains the concept of suffix automaton.
of length q. Three tables are used in the pre-processing phase As shown in Figure 9, we have a pattern input ”abbabb”.
(i.e., SHIFT, HASH, and PREFIX). Each state represents one character, and searching starts
The algorithm of S. Kim and Y. Kim [58] follows the from state 00 q000 denoted by ”0” in the example. The suffix
hashing approach and fully utilizes the encoding scheme. The automaton while traversing from state 0 to terminal nodes
input pattern is encoded, and the given text is scanned from (denoted by double circles) must represent a suffix that is
left to right. S. Kim and Y. Kim claimed that the algorithm a substring of the main pattern, that is, ”abbabb”. In this
is efficient for large patterns and suitable for multiple-pattern case, the possible suffixes of ”abbabb” are ”b”, ”bb’, ”abb”,
strings. ”babb”, and ”bbabb”. Figure 9 show that each terminal node
Faro [59] proposed a condensed alphabet-based string results in the suffix that is a substring of the main pattern.
matching algorithm. The proposed algorithm is an enhanced The final state is reached using the path given by the
version of an existing skip-search string matching algorithm. terminal states. This process reduces the large number of
The proposed algorithm involves two phases: pre-processing comparisons among patterns using the longest suffix. There-
and searching. In the pre-processing phase for each substring, fore, time efficiency is guaranteed [61]. The approaches that
a numeric value called fingerprint is calculated and then use this concept are discussed below.
stored in the table. Similarly, subsequences for each pattern KMP string matching algorithm is a basic and fundamental
are also stored in a table for searching purposes. In the algorithm that uses the concept of automata in string match-
8 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
ing and was proposed by Knuth et al. [29]. The basic idea When i = 3, the possible prefixes for character ”A” are A
behind the algorithm is that the text t is scanned from left and AC. The possible suffixes are A and CA. A is prefix and
to right, and the algorithm decides the number of patterns suffix. Thus, the value of prefix will be 1.
p to be shifted to avoid redundant comparisons during a
mismatch. Thus, this algorithm tracks information gained I 1 2 3 4 5 6
Pattern A C A C A G
from previous comparisons. This algorithm skips characters Prefix 0 0 1 2 3 0
depending on prefix and suffix rules and is illustrated using
the example below. When i = 4 the possible prefixes for character”C” are A,
I 1 2 3 4 5 6
AC, and ACA. The possible suffixes are C, AC, and CAC.
Pattern A C A C A G AC is prefix and suffix. Thus, the value of prefix will be 2.
Prefix 0 0 1 2 3 0
I 1 2 3 4 5 6
Pattern A C A C A G
When i = 1, no possible prefixes and suffixes are available Prefix 0 0 1 2 3 0
for character ”A”. Thus, the value of prefix will be 0, and the
pointer will be moved to index 2. When i = 5, the possible prefixes for character ”A” are A,
I 1 2 3 4 5 6
AC, ACA, and ACAC. The possible suffixes are A, CA, and
Pattern A C A C A G CACA. ACA is prefix and suffix. Thus, the value of prefix
Prefix 0 0 1 2 3 0 will be 3.
I 1 2 3 4 5 6
When i = 2, the possible prefixes and suffixes for charac- Pattern A C A C A G
ter ”C”. C does not have a match. Thus, the value of prefix Prefix 0 0 1 2 3 0
will be 0.
I 1 2 3 4 5 6 When i = 6, the possible prefixes for character ”G” are A,
Pattern A C A C A G AC, ACA, ACAC, and ACACA. The possible suffixes are G,
Prefix 0 0 1 2 3 0 AG, CAG, ACAG, CACA, and ACACAG.
VOLUME 4, 2016 9
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
The above-mentioned process of matching shows that this processing phase requires O(m) space. However, this ap-
procedure does not find the element with suffix and prefix. proach is inefficient for a small set of data [62]. We discuss
This is the novel idea of automata-based string matching, the algorithms shown in Figure 8 that use this basis for string
which is different from other existing methods. Given that matching as follows.
a pre-generated prefix table is used, the procedure allows
skipping certain comparisons during matching. The entire 1) Directed Acyclic Word Graph-based Approaches:
process requires search time complexity O(n), and the pre- A directed acyclic word graph (DAWG) approach is a data
structure that allows fast word searches. In DAWG, each node
represents a character. The first character represents the entry
point. One can travel from one node to another to find a
proper match.
Backward non-deterministic DAWG (BNDM) matching
[67] is based on the concept of non-deterministic automaton
approach along with bit-parallel concept. A window of length
m is shifted over a given text t. For each alignment, a pattern p
is searched by scanning the current window backward while
automaton configuration is updated accordingly.
Double-forward DAWG matching [69] uses two automata.
The key idea is to divide window into two parts, and each
window is scanned with a factor automaton of p. The two
positions for each text window are represented by α and β,
and the algorithm starts at position β and reads forward the
text in the current window for each attempt.
Backward oracle matching (BOM) algorithm [68] is based
on acyclic automaton and recognizes at least the factors of
FIGURE 9. Working of Suffix automaton. p with m + 1 states. The key idea is that, if back searching
10 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
fails on any letter (e.g., character c) after reading a particular matching process starts in each area from both ends and
word w, then cw is not a factor of pattern p and the window continues toward the center of the area and stops when two
can be moved after c. An intermediate structure, which windows overlap each other. The functionalities of integer
is called factor oracle, is built. Oracle is an automaton to comparison and multi-window (continuous jump) are added
ensure that Q has exactly m + 1 states. This intermediate to the three existing algorithms (e.g., quick search, tuned BM,
structure called factor oracle must satisfy four conditions: 1) and BMHq) for fast and cost-effective string matching.
automaton should be acyclic, 2) states should be as few as
possible, 3) factors of p should be recognized at least, and 4) 3) Automata-based Skipping Approaches:
linear number of transitions should be used. Once a window
Waga et al. [74] proposed an automata-based skipping al-
of size m is moved on text, a pattern is searched by scanning
gorithm for fast and efficient real-time pattern matching in
the current window backward to realize a secure shift.
embedded online applications. The proposed algorithm is a
BOM-based algorithms have different variants, such as ex-
modified version of an existing string matching algorithm
tended BOM, forward BOM, and simplified extended BOM.
called Franek-Jennings-Smyth (FJS). The above-mentioned
Extended BOM [72] is a modification of BOM algorithm by
authors extend the skipping value functionality for timed
enhancing the speed of the algorithm through introducing
pattern matching from FJS string matching algorithm us-
a fast loop. The transitions are computed in one step for
ing automata states by language, substring, and word over-
two rightmost characters to determine an undefined transition
approximation. Two versions of the FJS-type algorithm (e.g.,
with high probability. Forward BOM is an improved version
untimed and timed) are presented for offline and online
of extended BOM and combines the ideas of quick-search
pattern matching. The FJS skipping value functionality is
and extended BOM algorithms. The idea of this algorithm is
the combination of two skip value algorithms, namely, ∇
to compute the shift advancement while focusing on a char-
(quick search) and β (KMP). For online versions of the FJS-
acter that follows the forward character (current window).
type algorithm, zone abstraction is combined with two value
Simplified extended BOM algorithm [73] replaces the two-
skipping functionalities. The skipping value algorithms help
dimensional table with one-dimensional array and has the
in unnecessary matching executions, and the proposed FJS-
same procedure as extended BOM. This algorithm saves
type algorithm uses part of the pattern instead of the whole
memory compared with its predecessor.
target word for fast online pattern matching.
2) Wide window based approaches:
He et al. [70] proposed the wide window (WW) algorithm, D. BIT-PARALLEL APPROACH
which is different from the traditional window system used Automata-based string matching algorithms are excellent
in string matching. WW algorithm divides the text into n/m for long-length patterns but are unsuitable for short-length
overlapping windows (of size 2 m − 1). This algorithm also ones. Thus, bit-parallel approach, which involves parallel
uses suffix automaton approach. The suffix of text is scanned processing, was proposed by Domolki in 1968 to accelerate
from middle to right using forward suffix automaton. Corre- the matching process. This concept is based on parallel
sponding prefixes are scanned backward using reverse prefix computing. In this approach, the number of operations within
automaton if required. m rightmost characters are scanned algorithm is decreased to the number of bits in computer
from left to right with the initial state q0 until a full match or word [76]. This algorithm is fast and efficient, especially
lack of transition is achieved. The remaining m − 1 leftmost when the length of the given pattern p is less than the
characters are scanned from right to left. word length [27]. We classify bit-parallel algorithms depend-
Liu et al. [71] improved the WW algorithm proposed by ing on bit-level operations: Shift-OR (SO), Shift-AND, and
He et al. [70] by changing the parameters of the older version. Single Instruction/Multiple Data (SIMD)-based instruction
In the improved version, the second phase of WW algorithm approaches. The classification is presented in Figure 10.
is modified by changing the length of the longest remembered The approaches that use bit operation are discussed as
prefix to more than 0 for the first improvement and m/2 for follows. Shift-based algorithms use logical bitwise opera-
the second one. tions such as NOT, OR, AND, and XOR to compare strings.
Hongbo et al. [75] proposed multi-window and integer Bitwise NOT performs logical negation for each input bit.
comparison based on three suffix string matching algorithms. Bitwise OR performs bitwise OR operation, that is, it com-
The proposed algorithms include the enhanced version of pares two given inputs of equal length and outputs 0 in case
three existing suffix string matching algorithms, namely, both input bits comprise 0; otherwise, it outputs 1. Bitwise
quick search, tuned BM, and BMHq, by adding the func- AND performs logical multiplication of two bit patterns and
tionality of unaligned read integer comparison and multi- outputs 1 in case both input bits comprise 1s; otherwise, it
window. The main objective of enhancement is to reduce the outputs 0. Similarly, logical bitwise XOR operation performs
comparisons (integer comparison) and accelerate the match- exclusive OR operations and outputs 1 if two bit patterns
ing process (multi-window). In multi-window (i.e., jump are different; it outputs 0 in case bit patterns are equal. The
distance calculation mechanism), the text is divided equally following algorithms implement bitwise operations for string
into areas, and two windows belong to one single area. The matching.
VOLUME 4, 2016 11
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
SO algorithm [77] uses bitwise operation for string match- bits that can be processed by the CPU). This method fails
ing unlike other approaches. The key idea is to perform when pattern lengths as input are more than the computer
the parallel operation of NFA while searching. Here, NFA word size.
represents a vector of m different states, and each state i SIMD-based algorithms are developed to accelerate string
indicates the state of the search between the positions of the matching using hardware approach by utilizing the func-
pattern and the positions of the text. The basic methodology tionality of SIMD instructions. As in shift-based algorithms,
follows the concept of KMP and BM as discussed above. logical operations are done in a parallel-wise manner to
Non-active states are represented by 1 and active states by accelerate matching. However, in SIMD-based approach, the
0. capability of core pre-processor is utilized to also enhance
Fredriksson and Grabowski [78] proposed an algorithm searching. For example, INTEL SSE4.2 instruction set can
based on SO algorithm [77] by improving average and worst perform 256 comparison operations in a single instruction.
case running times. The number of mismatches between The four main string processing operations are listed as
pattern and text is computed using bit parallel. follows:
(i) PCMPESTRI - Packed compare explicit-length strings,
Popular algorithms [14, 77] normally take n[m/w] time to return index in ECX/RCX
find the occurrences of pattern P in text T (w denotes the
(ii) PCMPESTRM - Packed compare explicit-length
number of bits in machine word). Bit-parallel approach is
strings, return mask in XMM0
extended by using super-alphabets. The idea is to process
several characters using single step. The set of patterns is (iii) PCMPISTRI - Packed compare implicit-length
processed in the same way as that in SO algorithm, but the strings, return index in ECX/RCX and
algorithm scans only q th factor of the text. q th factor is the (iv) PCMPISTRM - Packed compare implicit-length
set of new patterns generated from original pattern P, which strings, return mask in XMM0 [79].
is calculated during the pre-processing phase. The authors All these instructions can be utilized to enhance string
claimed that using this technique has accelerated the speed by matching using different programming models. The algo-
a factor of O(log n). However, the length of the input pattern rithms that utilize SIMD instructions are discussed as fol-
is dependent on the computer word size (i.e., the number of lows.
12 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
Peltola [80] proposed the bit-parallel length-invariant probable occurrence of the pattern in the text. The bad
matcher. In this approach, an alignment matrix is constructed character approach of BM algorithm and 2-base logarithm
consisting of ω rows (size of the word in target machine) table value of the current flag are used to obtain shifts.
such that, for each rowi , (0 ≤ i ≤ ω) contains a pattern A single algorithm may not work well for different
that is right shifted by i characters. The algorithm operates applications because each application may pose different
by sliding the alignment matrix over the text and checks for challenges. Therefore, the strengths of different algorithms
any possible placement of input text. The system must allow should be determined to integrate their advantages in a hybrid
hardware implementations to implement SIMD in real time approach for solving complex issues.
for practical applications.
Ulekci et al. [81] proposed the streaming SIMD extension E. HYBRID APPROACH
filtering algorithm. The SIMD instruction is a feature of The hybrid approach operates well on a complex problem be-
microprocessors that supports parallel processing on multiple cause it combines the advantages of different algorithms and
datasets. Two phases, namely, filtering and verification, are is better than individual algorithms [85]. Many approaches
used in this algorithm. In the case of filtering, the most use the hybrid concept. The approaches that combine one
observable probable pattern with fast heuristic is detected or more character-based methods are placed under character-
within portions of text. The output of filtering is verified based approaches, and methods that use one or more methods
in the verification step. This process requires worst time from automata-based approaches are placed under automata-
complexity of 0(nm), best time complexity of O(n/m), and based approaches. The approaches that use character- and
average time complexity of O(n/m) + n.m/216 , where m automata-based methods are placed in between the two ap-
represents the length of text pattern and n represents the proaches, as shown in Figure 11.
length of given text. Crochemore et al.’s algorithm [86] is a combination of
Ulekci et al. [82] concluded that this algorithm is suit- Aho-Corasick [87] and DAWG [41] and is also known as the
able for long-length patterns. Two 16-byte blocks, namely, reverse factor algorithm. This algorithm makes at least 2n
N = [n/16] and M = [m/16], are created. Considering comparisons. Two theoretical tools, namely, Aho-Corasick
that this algorithm focuses on long-length patterns, the lowest machine and DAWG, are used for implementation purposes.
limit for m is set to 32(32 ≤ m). This operation cannot be Two processes, namely, PROCESS1 and PROCESS2, are
generalized for different types of inputs because these ap- used in this algorithm. PROCESS1 scans the text from left
proaches are required to convert non-deterministic automata to right with a shift of ending position of the pattern by
to deterministic automata. m/2 (m denotes the length of the shortest pattern). PRO-
Wu and Manber [56] proposed the Shift-AND algorithm, CESS1 remembers the position of each character i in the
which does not convert NFA to deterministic finite automaton string (Υ) and passes control to PROCESS2, which starts
(DFA) but uses NFA directly for performing operations in searching the string backward from i + SHIF T (SHIF T =
parallel. For NFA, the automaton is given by {Q, Σ, δ, q0 , F } length of shortest pattern − |Υ|). The feature of this al-
for the language P in which all words are recognized having gorithm is that it can make long jumps compared with the
ending occurrence of p. Here, Q = {q1 , q2 , q3 .....qm }, where Aho-Corasick algorithm [87].
q0 is an initial state, F = (qm ) is set to final states, and δ : Navarro’s algorithm [88] is a modification of BDM al-
Q × Σ → P (Q) is the transition function. The key idea is to gorithm and skips characters using suffix automaton. The
keep a record of all prefixes that matched the suffix of the text modification allows errors in searching the pattern in reverse
read by creating a table that holds bit mask. In a bit mask, the order.
set of prefixes is kept and updated using bit parallel. Thus, Wuu’s algorithm [89] is an extension of KMP algorithm
the algorithm builds a table and updates bit mask in scanning and uses tree-based approach for pattern matching. The main
the pattern. difference from KMP algorithm is that the shift is moved
Bit-parallel algorithms usually do not keep a record of horizontally right or left similar to KMP and vertically up
previous alignments that have been checked. Shift-Vector and down. The traversal process is bounded for each node
matching [83] is the first algorithm to introduce partial mem- within a subject tree. The time complexity of the algorithm
ory for transferring information to subsequent alignments. is O(n × log n), where n denotes the number of nodes in the
A bit vector S is maintained, which provides information tree.
regarding the occurrence of the pattern at certain positions Yuebin’s algorithm [90] is a modification of Boyer-Moore-
[69]. In the searching phase, the algorithm takes OR opera- Horspool algorithm [38]. The pattern is scanned from right to
tion with bit vector and updates Shift-Vector corresponding left. In the pre-processing phase, an array NEXT is generated
to text character, which is aligned to the rightmost character to compute the shift position. The information in this array
of the pattern. is used to determine the number of characters to be skipped.
The bit-parallel algorithm for small alphabets [84] is based AKC algorithm [76, 91] is a modification of the algorithms in
on the principle of matching matrix of the pattern and the [39] and [92]. Characters within windows are scanned from
text. For matrix matching, a 2-base logarithm table is used right to left. For each search, the information regarding the
to locate the leftmost ”1” bit. This bit indicates the recent factors that match the suffix of the pattern is stored. Once
VOLUME 4, 2016 13
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
pattern is shifted after each search, this algorithm ensures for next aligment will be A + m − u + 1. The functionality
that the previously matched suffix and text factor remain the of LBNDM to handle long-length patterns is improved com-
same. pared with that of BNDM. In LBNDM, pattern is partitioned
into subpatterns. The leftmost subpattern is scanned first.
Fast-search (FS) algorithms [76, 93, 94] constitute the
Then, all the remaining subpatterns are examined when a
family of variant algorithms from BM. The basic mechanism
match is found.
of all these algorithms is nearly the same as that of BM, that
is, shift is computed using bad character rule only during Two-way non-deterministic DAWG matching (TNDM)
comparison mismatch in the first attempt; otherwise, good algorithm is also a variant of the BNDM algorithm. Back-
suffix rule is applied for other cases. FS is the first algorithm ward and forward searches are made alternately. Once the
of this family. The comparison of pattern with the window pattern is aligned with text window and a mismatch occurs,
is performed from right to left. At each attempt, the pattern TNDM initializes state vector D in accordance with the two
and current window is compared, and shift is computed using rightmost characters compared with 1m in BNDM and scans
bad character rule in case a mismatch occurs during the in the forward direction to examine text characters after
first character comparison. In other cases, the suffix rule is the alignment to look for any conflicting characters within
used. Backward FS is another algorithm of this family. It the pattern. Further improvement is made on TNDM in the
combines bad character rule with good suffix rule to obtain form of forward non-deterministic DAWG matching. When
backward good suffix rule. Forward FS algorithm uses look- finding the suffix using FNDM, BNDM backward check is
ahead character to compute large shift advancements. substituted with a naive check of occurrence [27, 76, 97].
Simplified BNDM (SBNDM) and long BNDM (LBNDM) Nebel [98] modified Horspool string matching algorithm
[76, 95] are proposed by the same authors and are based on by increasing the searching speed using probabilities of
BNDM algorithm [96]. The main loop in SBNDM is made different symbols. This algorithm is similar to Sunday’s
faster than that in BNDM without memorizing the longest algorithm [43] except in the way symbols in the pattern are
prefix. This process makes its shift computation lighter than compared with symbols in text. The relative number of occur-
that of BNDM. If current alignment position is A in the text rences of different symbols is used to represent probabilities.
with u denoting updates are done, then the starting position Pre-processing is divided into two phases. In the first phase,
14 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
the positions of pattern P in which same symbols occur are Berry et al.’s algorithm to compute shift values during the
determined. In the second phase, the values of V (V denotes searching phase.
an array) are set using min-heap depending on needs. The Genomic-oriented rapid algorithm [105] uses Horspool
algorithm takes O(mlog (n)) overall running time. algorithm and filtering approach based on hash function. In
SSABS and TVSBS algorithms [99] are examples of a the pre-processing phase, the position of each character c of
hybrid algorithm. SSABS is hybrid of quick-search and the alphabet within the pattern is stored with the rightmost
Raita algorithms [39]. The comparison is carried out using character preceded by character c. In the searching phase,
Raita algorithm. First, the rightmost character is compared. Horspool bad character rule concept is implemented; in the
Then, a further comparison is carried out using the leftmost concept, fast-based loop is used to locate the occurrence of
character once a match is found. In this way, the resemblance the rightmost character of the pattern [27].
is established between the given window and pattern. The Cho [106] used the bad character rule of Horspool along
remaining characters are compared from right to left until a with factorial number system to compute the shift table for
complete match occurs, or vice versa [99]. Thathoo [100] text search and found all the substrings with same relative
later improved this algorithm using a shift method from orders as pattern p within a given text t. The proposed
Berry-Ravindran bad character rule [42]. This algorithm re- algorithm is more efficient than KMP with time complexity
quires a small number of character comparisons due to its of O(n + mlogm) in average case and O(n + mlogm) in
great shift advancements. worst case.
FJS algorithm [101] is a mixture of linear worst case Pandey et al. [107] proposed a hashing with chaining-
time complexity and sublinear average behavior of KMP and based hybrid string matching algorithm to reduce the time
quick-search algorithms, respectively [26]. The algorithm has complexity of string matching algorithms. The idea of hash-
two phases. The first phase involves the use of two steps for ing with chaining is combined in the proposed algorithm.
each attempt of comparison. The first comparison between The proposed algorithm involves two phases: pre-processing
the given window and pattern is performed using the quick- and searching phases. In the first phase of pre-processing,
search algorithm and starts from the rightmost character of the given string is divided into substring, and each substring
the pattern. In the case of a mismatch, quick-search shift has a size equal to the pattern. After division, each substring
is implemented; otherwise, FJS invokes the second step. In is assigned with a unique integer (ASCII value), and the
the second phase, KMP pattern matching is used, which substrings are stored in a hash table along with their location
starts from the leftmost character, and shift is performed using a hash function. In the second phase of searching, the
accordingly followed by a return to first step [11, 27]. integer hash value of the pattern is calculated, and the hash
Alqadi et al.’s algorithm [102] is a multiple skip pattern value of the pattern and integer hash values of substrings
matching and is a modification of BM algorithm. To the best are compared in the hash table. If the hash values of pattern
of our knowledge, this algorithm performs skips depending and substring are matched, then the location of the substring
on index values. In this case, the comparison is based on the is returned. The proposed algorithm cannot reduce the time
index values of all substring occurrences of given text that is complexity in most of the cases and requires O(n − m) extra
equal to pattern p of length m. The skip value is calculated memory because it stores substrings in a hash table.
using ranges from 1 − (m − 1). Al-Ssulami [108] proposed a hybrid algorithm for string
Huang et al.’s algorithm [103] aims to reduce the memory matching called simple string matching. The proposed algo-
requirements of Aho-Corasick algorithm [87]. The approach rithm is a modified version of Horspool algorithm with addi-
of magic states is used in DFA. The algorithm rearranges tional string matching conditions for scanning and matching
states in two steps: magic states are found in the first step, and the text (from left to right) and string pattern (from right
the transition matrix is partitioned depending on a threshold to left). The proposed algorithm operates in two steps. In
in the second step. The magic state receives the same input the first step of pre-processing, the pivoting character in the
character resulting in the same next state. The transition pattern is searched by computing the character distance and
matrix has two submatrices: one with smaller state values its maximum safe shift. In the second step, the algorithm
than the threshold, and the other is used to generate bitmap compares the pivot character of the pattern with characters of
matrix and state list matrix. In the search phase, all elements the text. If the pivoting character matches with the character
in the second matrix are identified and bitmap matrix is set in the text, then the algorithm starts matching the pattern
to 1 in case no magic state is found; otherwise, it is set to 0. with text from the right most character until the end of the
Next state is inserted to the state list matrix. text. If the matching fails, then hybrid Horspool shift is used
Hudaib et al.’s algorithm [104] is a modification of Berry for matching. If the pattern and text with equal character (bi)
et al.’s algorithm [42]. The algorithm divides the text into length is mismatched (where 0 <= i < k and bj 6= bj + 1)
two equal parts and scans two parts by using two windows at any given positions (0 ≤ k < m or j) during the matching
simultaneously. The left window scans from the left part of process, then the position of the pattern must be shifted to
the text, whereas the right window scans from right to left exactly (k − i + 1) or (k − j + 1) positions to accelerate
in parallel. This process makes this algorithm suitable for the searching process. The proposed algorithm achieves good
parallel processors [27]. This algorithm uses shift rule of performance for pattern matching on human proteins, text of
VOLUME 4, 2016 15
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
natural languages (e.g., Arabic, English, Italian, French, and is plotted by using various keywords, such as the application
Chinese), and E. coli genome. However, the efficiency of the of string matching algorithms, exact matching algorithms,
proposed algorithm decreases when the mismatches of the and other related keywords. The citation report is created on
patterns do not occur at the rightmost end. the basis of a Web of Science tool. Our intention is to show
In summary, we discuss and analyze different algorithms that string matching is still vital for new concepts, such as
of software-based pattern string matching in the past two big data and social media data. Figure 12 shows the trend
decades. The discussion and analysis indicate that each algo- or importance of string matching over time. String matching
rithm has its own merits and demerits in terms of suitability plays an important role, and an appropriate method that can
to applications and capability to handle complexity of prob- be modified or used for new applications and datasets should
lems. We summarize the above-mentioned discussion with be selected.
different parameters in the following section.
B. APPLICATIONS
IV. COMPARISON ANALYSIS OF SOFTWARE-BASED As discussed earlier, the scope of string matching algorithms
PATTERN STRING MATCHING ALGORITHMS is limited to few applications. Based on the literature review,
The literature review reveals that time complexity, limitation, we investigate the applications where exact matching algo-
and dataset are important in deciding the performance of rithms can be applied for deeper analysis. Figure 13 shows
the methods. We analyze the methods in terms of the three the applications, namely, multimedia, networking, forensic,
parameters and show the results in Tables 1-5. and search engines. From this figure, additional applications
On the basis of the analysis reported in Tables 1-5, we or sub-applications can be identified in the future.
summarize the advantages and disadvantages of the methods
from 1980 to 2018 in Table 6. This summary helps readers 1) Multimedia Applications
understand the strengths and weaknesses of the methods.
Social media use multimedia information, such as texts,
Accordingly, a hybrid method or a unified method with
videos, audios, images, and different scripts. Our analysis
advantages of the methods can be developed to address new
reveals that most of the string matching algorithms focus
challenges. Appropriate methods can also be easily selected
on English text, biological data but not different script data,
depending on strengths and requirements.
video data, and image data. The reason is that most of
As shown in Table 6, hybrid approaches are more flexible
these algorithms involve exact or approximate matching. This
than a single approach and can be extended to solve any
matching is insufficient to handle multivalued and multivari-
string matching problem. However, hybrid approaches re-
ate data. Therefore, string matching algorithms should be
quire comprehensive knowledge of different algorithms and
extended to the above-mentioned applications because they
need to combine their features to obtain the optimal output.
require complex matching procedure [116, 117, 122].
Prior to designing an algorithm, other factors that can affect
its performance must be determined and addressed.
2) Networking Applications
V. CURRENT TRENDS, APPLICATIONS, CHALLENGES The rapid advancements in technology have made security
AND FUTURE SCOPE a primary concern in all networks at present. The threats
Analysis of the performance of string matching algorithms of hacking and intruder attacks constantly exist. To solve
until 2018 shows that they are popular and required in sev- security issues, several methods have directly and indirectly
eral applications. Most of the algorithms are developed to proposed encoding-decoding and encryption-decryption to
improve time efficiency because matching in string matching secure data. For example, most prominent methods include
algorithms involves complex operations. Thus, new algo- encryption, virtual private networks, and firewalls. Among
rithms or extensions of string matching concepts are limited these techniques, network intrusion detection (NID) is a new
to few applications and solving issues. However, the use of technique that is used to detect suspicious activities at the
string matching algorithms has increased because they work network and host level. NID systems are used to capture data
well regardless of databases, scripts, and applications. In packets that travel through network media, such as cables and
this section, we explore the current trend of string matching wireless. In these cases, string matching algorithms can be
algorithms, new applications that require improved strengths investigated to verify the packets. In other words, signature-
of these algorithms, new challenges, and the possible future based intrusion detection and anomaly detection systems can
scope. be introduced using string matching algorithms.
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
VOLUME 4, 2016 17
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
passport to identify persons in the centralized database of the sound, image, or video. String matching plays a vital role in
world. solving such issues because it does not require recognition
of each character in the text or content in image or video.
4) Search Engines Instead, string matching considers the entire input as one
pattern to find a match. This advantage is inherent to string
Indexing and retrieval methods may fail to retrieve actual in-
matching algorithms.
formation based on user interest through recognition of text,
18 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
Character based Long length of texts requires long The sliding window system of these Natural text
shifts. thus, these methods are suitable methods is slower than that of other
for applications that search long pat- string matching approaches.
tern of texts
Automata-based These approaches avoids backtracking These methods consume much time Security related applications and real-
(examining each character only once in construction, such as the construc- time applications with optimal input
under constant time per character). tion of flow chart and computation of size.
Thus, these methods are suitable for transition table. Building automaton is
web related applications or real-time also time consuming if input symbol is
systems. large.
Bit parallel based Bit-parallel algorithms are extremely The performance of these algorithms These methods are suitable for appli-
fast when a pattern fits in a computer degrades considerably as m number of cations in which parallelism can eas-
word. bits per word sizeâŇL’ increases. ily be performed, such as plagiarism
checking, spell checking, data mining
and bio-informatics.
Hashing based The speed of processing is high for The cost of a good hash function can be These apporaches are suitable for texts
these methods due to quadratic com- significantly higher than the inner loop pre-processed with large table entries
parisons. However, they are suitable of the lookup algorithm for a sequen- such as spell checking, imaging ap-
for pre-processed texts and hash tables tial list or search tree. Tese approaches plications and bio-informatics. These
with large entries. are ineffective when the number of en- methods can be extended to multi-
tries is small and can cause long delays dimensional pattern matching.
due to micro-processor cache misses
induced by poor quality of reference.
Hybrid based The combination of more than two ap- Most hybrid methods have overheads. The suitability of these methods de-
proaches results in improved solutions. pends on the application.
VOLUME 4, 2016 19
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
3) Analysis Tool
Considerable simulation, networking, and programming
tools are available, which aid in determining the behavior of
a particular model or application. However, no effective tool
exists that can aid in determining the performance of algo-
rithms. Although many mathematical models are available,
not everybody can efficiently determine the performance of
algorithms through those models. Those models depend on
the hypothesis. The main parameter for determining the per-
formance of the algorithm is execution time. However, other
factors, such as hardware and text size to be searched, play
a crucial role in determining the performance of algorithms.
The importance of these factors for an algorithm should be
determined.
4) Execution Time
Considerable overheads are associated with string matching
FIGURE 13. The possible new applications of string matching algorithms. algorithms, such as pre-processing time followed by search-
ing time. The time and space complexities of these algo-
rithms increase with the increase in data size. Thus, efforts
C. CHALLENGES
are required to optimize these algorithms for fast execution
From the analysis of the literature review, it can be observed with minimum overhead.
that there are numerous challenges within the domain of
Exact matching algorithms that need to be addressed. We list
5) Library
the challenges as follows:
No authenticated library is available for string matching algo-
1) Refinement Of Proposed Classification
rithms to determine their efficiency. Although few libraries,
such as Faro and Lecroq [11], are used as a smart tool,
The performance of string matching algorithms varies in
considerable efforts are needed to include many and existing
different areas of applications, such as molecular biology,
algorithms. Accordingly, future researchers can optimize the
network intrusion detection, and text processing. Some al-
previous algorithms without considerable efforts by keeping
gorithms perform well for only short-length patterns, only
time constraint in mind.
long-length patterns, and only average-length patterns. A
string matching algorithm that performs well for English
6) Development Of Efficient Data Structures
text may behave differently in DNA matching. Thus, the
performance of string matching algorithms must be analyzed Different data structures are used in string matching algo-
on the basis of different patterns and text formats. Analyzing rithms. Some data structures use trees and arrays based on
the performance of these algorithms on the basis of different suffix or prefix approach. Efficient data structures should be
applications (e.g., DNA sequencing, fingerprint detection, developed, which can perform better than previously used
and text processing) and classifying them depending on data structures regardless of applications and data.
performance in respective areas rather than methodologies
used are difficult. However, these tasks help researchers 7) Optimal File Size
implement and optimize only specific algorithms pertaining All algorithms do not perform well depending on different
to specified area. For example, only algorithms that perform text and pattern sizes. The performance of some algorithms
well in DNA sequencing can be targeted for optimization either linearly or exponentially decreases. The optimal file
rather than selecting algorithms randomly. size or number of words should be determined to enable an
algorithm to provide an efficient performance. This task is
2) Performance Analysis Using Different Encoding challenging due to a large number of available algorithms to
Techniques be implemented and evaluated.
Encoding is the basis for string matching algorithms. Encod-
ing techniques have different types, such as ASCII, UTF-8, 8) Benchmark Standard
and UTF-16. The performance of different string matching Different developers implement string matching algorithms
algorithms can be checked using different encoding tech- depending on their logic. If two same algorithms are com-
niques. This way is useful for texts that require more than 1 pared for a given data set, then both algorithms will behave
bit for a single alphabet, such as Arabic, Chinese, and Persian differently depending on the different platforms used and log-
texts. ics used. In this case, some benchmark methods are needed to
help researchers determine whether the algorithm developed
20 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
by the developer is working in accordance with the standard 6) Memory Analysis Of String Matching Algorithms
and is the correct version. This way helps developers identify Existing string matching algorithms have been analyzed in
the correct version of algorithms with ease of implementa- terms of time complexities and performance [66]. Analysis of
tion. memory requirements of existing string matching algorithms
on heap during runtime is an interesting topic. Optimization
D. FUTURE SCOPE based on memory consumption on heap during runtime can
Here, we discuss the possible future scope of software-based be a future research area as well.
pattern string matching algorithms.
7) Other Future Related Works
1) Factorial Analysis Of String Matching Algorithms Other possible future works related to string matching in-
The effect of different factors, such as text size, RAM, and clude classification of multiple pattern-based algorithms de-
IDE, can be determined on string matching algorithms by pending on an application with performance analysis and
designing a factorial model based on factorial design. This survey on image-based matching algorithms with a focus
procedure helps determine the effect of different factors on on time complexities of algorithms and possible areas of
the execution time of an algorithm. application. Determining whether a particular algorithm can
be used in multiple applications is also an interesting en-
2) Implementation Of String Matching Algorithms In deavor. The need of proposing new algorithms even if the
MapReduce Environment present hardware can solve time complexity issues should be
MapReduce is a parallel programming paradigm used in supported as well.
Hadoop for big data analysis. Current string matching al-
gorithms can be optimized for MapReduce framework. This VI. CONCLUSION
process increases the execution time of these algorithms The field of string matching is vast due to the development
using parallel processing. The trend has already started. of numerous algorithms, and studying methodology, com-
Athar et al. [118] developed a new algorithm for DNS se- plexity, and limitations of all those algorithms is a tedious
quence matching by using an MPI technique. The proposed task. This work focuses only on software-based pattern string
approach uses multicore processors for parallel processing. matching algorithms and their applications. However, other
The obtained DNA sequence shows the highest performance category of string matching algorithms can also be explored.
among those of other serial versions of algorithms after In this work, we analyze more than 50 string matching meth-
parallelization [118]. ods in terms of strengths, weaknesses, and efficiency with
respect to applications, which can help future researchers
3) Use Of GPU And Field Programmable Gate Array (FPGA) identify suitable string matching algorithms depending on
CPU is optimized for sequential serial processing with few their application and complexity of the problems. On the
cores. GPU comprises thousands of small, efficient cores basis of the analysis, we compare string matching algorithms
that can handle multiple tasks simultaneously [119, 120]. of respective category based on time complexity, limitation,
GPU is programmed using either CUDA or OpenGL. String and databases. Furthermore, we identify new challenges,
matching algorithms are usually written in C or C++ pro- applications, and directions to expand the scope of string
gramming languages. Thus, another future direction is to matching algorithms. To the best of our knowledge, this com-
implement string matching algorithms in GPUs and FPGAs prehensive survey on single-pattern exact string matching
using CUDA or OpenGL or any other GPU/FPGA-supported algorithms is the first to discuss future directions, challenges,
language. This way increases the efficiency of string match- possible applications, and new taxonomies. We extend the
ing algorithms by manifolds. review on approximate string matching algorithms in detail
by considering implementation issues, real-time application,
4) Survey Of Approximate and Hardware-based Approaches and future vector-based matching algorithms.
Another future work is to carry out a survey based on
approximate and exact matching algorithms implemented in REFERENCES
hardware devices. The proposed work will focus on evaluat- [1] A. AbdulRazzaq, Rashid, N., Hasan, A., Abu-Hashem, M., ”The
ing their weaknesses and strengths on the basis of a specific exact string matching algorithms efficiency review,” presented at
parameter. the 3rd World Conference on Innovation and Computer Sciences,
2013. Available: http://www.world-education-center.org/index.php/P-
ITCS/article/view/2668/2228
5) Arabic Pattern Matching Algorithms [2] S. Hakak, A. Kamsin, P. Shivakumara, M. Y. Idna Idris, and G. A. Gilkar.
Arabic is the second most spoken language in the world after ”A new split based searching for exact pattern matching for natural texts.”
PloS one 13, no. 7 (2018): e0200912.
English [121, 122, 123]. In Arabic language, connected and [3] K. Hendawi, Baharudin, A., ”String Matching Algoritms (SMAs): Survey
unconnected words exist, which take considerable bytes and & Empirical Analysis,” Journal of Computer Sciences and Management,
processing time. Thus, development of multilingual exact pp. 2637-2644, 2013.
[4] W. M. Szeto and M. H. Wong, "Stream segregation algorithm for pattern
matching algorithms with suitable encoding techniques is a matching in polyphonic music databases," Multimedia Tools and Applica-
promising and interesting future work. tions, vol. 30, no. 1, pp. 109-127, 2006.
VOLUME 4, 2016 21
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
[5] S. Shrivastav, S. Kumar, and K. Kumar, "Towards an ontology based [29] D. E. Knuth, Morris, Jr, J. H., Pratt, V. R., "Fast pattern matching in
framework for searching multimedia contents on the web," Multimedia strings," SIAM journal on computing, pp. 323-350, 1977.
Tools and Applications, pp. 1-30, 2017. [30] D. Suleiman, "Enhanced Berry Ravindran Pattern Matching Algorithm
[6] Y.-H. Kim, H.-J. Kwon, J.-G. Kang, and H. Chang, "The study on content (EBR)," Life Science Journal, vol. 11, no. 7, 2014.
based multimedia data retrieval system," Multimedia Tools and Applica- [31] D. Suleiman, A. Hudaib, A. Al-Anani, R. Al-Khalid, and M. Itriq, "ERS-
tions, vol. 57, no. 2, pp. 393-405, 2012. A Algorithm for Pattern Matching," Middle East Journal of Scientific
[7] G. Navarro, "A Guided Tour to Approximate String Matching," ACM Research, vol. 15, no. 7, pp. 1067-1075, 2013.
Computing Surveys, pp. 31- 88, 2001. [32] S. Arudchutha, T. Nishanthy, and R. G. Ragel, "String matching with
[8] M. Boyer and Smith, "Experiments with a very fast substring search multicore CPUs: Performing better with the Aho-Corasick algorithm," in
algorithm," Software: Practice and Experience, vol. 21, no. 10, pp. 1065- Industrial and Information Systems (ICIIS), 2013 8th IEEE International
1074, 1991. Conference on, 2013, pp. 231-236: IEEE.
[9] P. D. Michailidis and K. G. Margaritis, "On-line string matching al- [33] E. Rafiq, M. W. El-Kharashi, and Gebali, F, "A fast string search algorithm
gorithms: survey and experimental results," (in English), International for deep packet classification," Computer Communications, pp. 1524-
Journal of Computer Mathematics, vol. 76, no. 4, pp. 411-434, Feb 2001. 1538, 2004.
[10] C. Charras, and Lecroq, T., Handbook of exact string matching algorithms. [34] R. S. Boyer, Moore, J. S, "A fast string searching algorithm," Communi-
H. King’s College Publications, 2004. cations of the ACM, pp. 762-772, 1977.
[11] S. Faro and T. Lecroq, "The Exact Online String Matching Problem: [35] M. A. Hernandez, " Taxonomy of Some Right-to-Left String- Matching
A Review of the Most Recent Results," (in English), Acm Computing Algorithms„" in WFLP’09 Proceedings of the 18th international confer-
Surveys, vol. 45, no. 2, Feb 2013. ence on Functional and Constraint Logic Programming, Heidelberg, 2010,
[12] G. F. Ahmed and N. Khare, " Hardware based String Matching Algo- pp. 79-95: springer.
rithms: A Survey," International Journal of Computer Applications, vol.88, [36] G. Baeza-Yates, G.H., " A new approach to text searching," Communica-
no. 11, pp. 16-19, 2014. tions of the Association for Computing Machinery, pp. 74-82, 1992.
[13] L. Otero-Cerdeira, F. J. Rodriguez-Martinez, and A. GÃşmez-Rodriguez, [37] D. Breslauer, L. Colussi, and L. Toniolo, "Tight comparison bounds for
"Ontology matching: A literature review," Expert Systems with Applica- the string prefix-matching problem," (in English), Information Processing
tions, vol. 42, no. 2, pp. 949-971, 2015. Letters, vol. 47, no. 1, pp. 51-57, Aug 9 1993.
[14] S. Wu and U. Manber, "Fast text searching: allowing errors," (in English), [38] R. N. Horspool, 10(6), 501-506., " Practical fast searching in strings.,"
Communications of the ACM, vol. 35, no. 10, pp. 83-91, Oct 1992. Software: Practice and Experience, pp. 501-506, 1980.
[39] A. Apostolico and R. Giancarlo, "The Boyer-Moore-Galil String Search-
[15] D.Sankoff, Common Subsequences and Monotone Subsequences.
ing Strategies Revisited," (in English), SIAM Journal on Computing, vol.
Addison-Wesley, 1983.
15, no. 1, pp. 98-105, Feb 1986.
[16] V. Levenshtein, "Binary codes capable of correcting spurious insertions
[40] T. Raita, "Tuning the boyer-moore-horspool string searching algorithm,"
and deletions of ones," in Probl. Inf. Transmission, 1965, p. 196.
Software: Practice and Experience, pp. 879-884, 1992.
[17] V. SaiKrishna, and N. Khare, "String Matching and its Applications in
[41] M. Crochemore, Czumaj, A., Gasieniec, L., Jarominek, S., Lecroq, T.,
Diversified Fields," International Journal of Computer Science Issues
Plandowski, W., Rytter, W., "Speeding up two string-matching algo-
(IJCSI), pp. 219-226, 2012.
rithms," Algorithmica, pp. 247-267, 1994.
[18] M. Farach-Colton, G. M. Landau, S. C. Sahinalp, and D. Tsur, "Optimal [42] T. Berry, Ravindran, S., "A Fast String Matching Algorithm and Experi-
Spaced Seeds for Faster Approximate String Matching," in Automata, mental Results," Stringology, pp. 16-28, 1999.
Languages and Programming: 32nd International Colloquium, ICALP
[43] M. K. Ahmad, "An Enhanced Boyer-Moore Algorithm (Doctoral disserta-
2005, Lisbon, Portugal, July 11-15, 2005. Proceedings, L. Caires, G. F.
tion)," Middle East University, 2014.
Italiano, L. Monteiro, C. Palamidessi, and M. Yung, Eds. Berlin, Heidel-
[44] D. M. Sunday, " A very fast substring search algorithm," Communications
berg: Springer Berlin Heidelberg, 2005, pp. 1251-1262.
of the ACM„ pp. 132-142, 1990.
[19] J. Karkkainen and J. C. Na, "Faster Filters for Approximate String [45] L. Colussi, "Correctness and efficiency of pattern matching algorithms,"
Matching," in 2007 Proceedings of the Ninth Workshop on Algorithm (in English), Information and Computation, vol. 95, no. 2, pp. 225-251,
Engineering and Experiments (ALENEX), pp. 84-90. Dec 1991.
[20] G. Kucherov, L. Noe, and M. Roytberg, "Multiseed lossless filtration," [46] H. Xian-feng, Y. Yu-bao, and L. Xia. "Hybrid pattern-matching algorithm
IEEE/ACM Transactions on Computational Biology and Bioinformatics, based on BM-KMP algorithm." In Advanced Computer Theory and En-
vol. 2, no. 1, pp. 51-61, 2005. gineering (ICACTE), 2010 3rd International Conference on, vol. 5, pp.
[21] G. Kucherov, K. Salikhov, and D. Tsur, "Approximate string matching V5-310. IEEE, 2010.
using a bidirectional index," Theoretical Computer Science, vol. 638, pp. [47] Z. Cao, Y. Zhenzhen, and L. Lihua. "A fast string matching algorithm
145-158, 2016. based on lowlight characters in the pattern." In Advanced Computational
[22] B. Yates and G. Navarro, "A hybrid indexing method for approximate Intelligence (ICACI), 2015 Seventh International Conference on, pp. 179-
string matching," Journal of Discrete Algorithms, Special Issue on String 182. IEEE, 2015.
matching Patterns, pp. 1-35, 2001. [48] S. Hakak, A. Kamsin, P. Shivakumara, M. Y. Idna Idris, and G. A. Gilkar.
[23] D. Belazzougui, F. Cunial, J. Karkkainen, and V. Makinen, "Versatile Suc- "A new split based searching for exact pattern matching for natural texts."
cinct Representations of the Bidirectional Burrows-Wheeler Transform," PloS one 13, no. 7 (2018): e0200912. Skid
in Algorithms - ESA 2013: 21st Annual European Symposium, Sophia [49] S. Hakak, K. Amirrudin, P. Shivakumara, and M. Y. I. Idris. "Partition-
Antipolis, France, September 2-4, 2013. Proceedings, H. L. Bodlaender Based Pattern Matching Approach for Efficient Retrieval Of Arabic Text."
and G. F. Italiano, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, Malaysian Journal of Computer Science 31, no. 3 (2018): 200-209.
2013, pp. 133-144. [50] R. M. Karp and M. O. Rabin, "Efficient Randomized Pattern-Matching
[24] T. W. Lam, R. Li, A. Tam, S. Wong, E. Wu, and S. M. Yiu, "High Algorithms," (in English), Ibm Journal of Research and Development, vol.
Throughput Short Read Alignment via Bi-directional BWT," in 2009 IEEE 31, no. 2, pp. 249-260, Mar 1987.
International Conference on Bioinformatics and Biomedicine, 2009, pp. [51] W. S. Dorn, "Generalizations of Horner’s rule for polynomial evaluation,"
31-36. IBM Journal of Research and Development, pp. 239-245, 1962.
[25] L. M. Russo, G. Navarro, A. Oliveira, and P. Morales, "Approximate String [52] J. Lee, "Analysis of Fundamental Exact and Inexact Pattern Matching
Matching with Compressed Indexes," Algorithms, vol. 2, no. 3, p. 1105, Algorithms„" in BIOC 218, Standford, 2004, pp. 1-15.
2009. [53] J. S. Fide, "A Survey of String Matching Approaches in Hardware,"
[26] T. Schnattinger, E. Ohlebusch, and S. Gog, "Bidirectional search in a string Technical Report TR SPDS 06-01, USA 2008.
with wavelet trees and bidirectional matching statistics," Information and [54] A. Abdulrazzaq, . Abdul Rashid, N,.Hamdani, H,. Ghadban, R., Mah-
Computation, vol. 213, pp. 13-22, 2012/04/01 2012. mood, A.W., "Influenced Factors on Computation Among Quick Search,
[27] A. A. AbdulRazzaq, Rashid, N. A. A., Hasan, A. A., Abu-Hashem, M. A, Two-Way and Karp-Rabin Algorithms," in Proceeding of the 3rd In-
"The exact string matching algorithms efficiency review," Global Journal ternational 3rd International Conference on Informatics and Technology
on Technology, pp. 576-589, 2013. (Informat Informatics and Technology (Informatics ’09), 2009, pp. 81-87.
[28] V. Alfred, ”Algorithms for finding patterns in strings”, Algorithms and [55] T. Lecroq, "Fast exact string matching algorithms," (in English), Informa-
Complexity, vol. 1, 2014. tion Processing Letters, vol. 102, no. 6, pp. 229-235, Jun 15 2007.
22 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
[56] S. Wu and U. Manber, "A fast algorithm for multi-pattern searching," https://software.intel.com/en-us/articles/xml-parsing-accelerator-with-
Department of Computer Science, University of Arizona, Tucson, AZ, intel-streaming-simd-extensions-4-intel-sse4
Report TR-94-171994. [80] H. Peltola, "Alternative algorithms for bit-parallel string matching," in
[57] P. Kalsi, and J. Tarhio, "Comparison of exact string matching algorithms Proceedings of the 10th International Symposium on String Processing
for biological sequences," in Proceedings of the Second International and Information Retrieval SPIRE, 2003.
Conference on Bioinformatics Research and Development, BIRD, 2008. [81] K. Ulekci and M., "Filter based fast matching of long patterns by us-
[58] S. Kim, Kim, Y., "A fast multiple string pattern matching algorithm," in ingsimd instructions," in Proceedings of the Prague Stringology Confer-
Proceedings of 17th AoM/IAoM Conference on Computer Science, 1999, ence, Prague, Czech Republic, 2009.
pp. 44-49. [82] K. Ulekci and O. M, "A method to overcome computer word size limitation
[59] F. Simone. "A very fast string matching algorithm based on condensed in bit-parallel pattern matching," in Proceedings of the 19th International
alphabets." In International Conference on Algorithmic Applications in Symposium on Algorithms and Computation, ISAAC, 2008.
Management, pp. 65-76. Springer, Cham, 2016. [83] Peltola, H., Tarhio, J. (2003, October). Alternative algorithms for bit-
[60] W. Yang, "Mealy machines are a better model of lexical analyzers," parallel string matching. In International Symposium on String Processing
computer languages journal, pp. 27-38, 1996. and Information Retrieval (pp. 80-93). Springer, Berlin, Heidelberg.
[61] R. Navarro, M. A, "Bit-Parallel Approach to Suffix Automata: Fast [84] G. Zhang, E. Zhu, L. Mao, and M. Yin, "A bit-parallel exact string
Extended String Matching," in Proc. of the 9th Annual Symposium on matching algorithm for small alphabet," in Frontiers in Algorithmics:
Combinatorial Pattern Matching, Berlin, 1998, pp. 14-33: Springer-Verlag. Springer, 2009, pp. 336-345.
[62] K. Rasool, N., "Parallelization of KMP String Matching Algorithm on [85] F. J. Franek, C.G. and Smyth, W.F., " A simple fast hybrid pattern matching
Different SIMD architectures: MultiCore and GPGPUâĂŹs," International algorithm," Journal of Discrete Algorithms, pp. 682-695, 2007.
Journal of Computer Applications, pp. 26-28, 2012. [86] M. Crochemore, A. Czumaj, L. GaÌğsieniec, T. Lecroq, W. Plandowski,
[63] J. A. Joseph, R. Korah, S. Salivahanan, (2018). Efficient String Matching and W. Rytter, "Fast practical multi-pattern matching," (in English), Infor-
FPGA for speed up Network Intrusion Detection. Appl. Math, 12(2), 397- mation Processing Letters, vol. 71, no. 3-4, pp. 107-113, Aug 27 1999.
404. [87] M. C. A.V. Aho, "Efficient string matching: An aid to bibliographic
[64] M. Aldwairi, Y. Flaifel, K. Mhaidat, (2018). Efficient wu-manber pattern search„" Comm. ACM, pp. 333-340, 1975.
matching hardware for intrusion and malware detection. In 2018 Interna- [88] G. NAVARRO, Nrgrep: A fast and flexible pattern matching tool, Tech.
tional Conference on Electrical, Electronics, Computers, Communication, Rep. TR/DCC-2000-3, 2000. Dept. of Computer Science, Univ. of Chile,
Mechanical and Computing (EECCMC). Tamil Nadu, Vellore, India: Aug. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/ nrgrep.ps.gz.
IEEE.
[89] H. T. L. Wuu, Lu, H. T., & Yang, W., "A simple tree pattern-matching
[65] X. Wang, D.Pao, (2018). Memory-Based Architecture for Multicharacter
algorithm," in In In Proceedings of the Workshop on Algorithms and
AhoâĂŞCorasick String Matching. IEEE Transactions on Very Large
Theory of Computation, 2000.
Scale Integration (VLSI) Systems, 26(1), 143-154.
[90] H. K. B. Yuebin, "New string matching technology for network security,"
[66] S. Hakak, A. Kamsin, P. Shivakumara, O. Tayan, M. Y. I. Idris, G.
in 17th International Conference on Advanced Information Networking
amin Gilkar, (2018). An Efficient Text Representation for Searching and
and Applications, AINA 2003, pp. 27-29.
Retrieving Classical Diacritical Arabic Text. Procedia Computer Science,
142, 150-157. [91] M. Ahmed, M. Kaykobad, and R. A. Chowdhury, "A New String Matching
Algorithm," (in English), International Journal of Computer Mathematics,
[67] B. Commentz-Walter, " A string matching algorithm fast on the average,"
vol. 80, no. 7, pp. 825-834, Jul 2003.
Springer, pp. 118-132, 1979.
[92] G. Boyer Moore, "The BoyerMoore-Galil string searching strategies revis-
[68] C. Allauzen, Crochemore, M., Raffinot, M., " Factor oracle: A new
ited," SIAM J. Comput., pp. 89-105, 1986.
structure for pattern matching," in SOFSEMâĂŹ99, 1999, pp. 295-310:
Springer. [93] D. Cantone and S. Faro, "Fast-search: A new efficient variant of the
[69] R. Allauzen, "Simple optimal string matching algorithm," Algorithms, pp. Boyer-Moore string matching algorithm," (in English), Experimental and
102-116, 2000. Efficienct Algorithms, Proceedings, vol. 2647, pp. 47-58, 2003.
[70] L. He, B. Fang, and J. Sui, "The wide window string matching algorithm," [94] S. F. D. Cntone, ”Searching for a substring with constant extra-space
(in English), Theoretical Computer Science, vol. 332, no. 1-3, pp. 391-404, complexity,” in In Proc. of Third International Conference on Fun with
Feb 28 2005. algorithms, 2004, pp. 118-131.
[71] C. Liu, Wang, Y., Liu, D., and Li, D., "Two improved single pattern [95] J. T. H. Peltola, ”Alternative algorithms for bit-parallel string matching,”
matching algorithms," in ICAT Workshops, Hangzhou, China„ 2006, pp. in Proceedings of the 10th International Symposium on String Processing
419-422: IEEE Computer Society. and Information Retrieval SPIRE’03, 2003.
[72] S. Faro and T. Lecroq, ”Efficient variants of the Backward-Oracle- [96] G. Navarro and M. Raffinot, "A bit-parallel approach to suffix automata:
Matching algorithm.,” in Proceedings of the Prague Stringology Confer- Fast extended string matching," (in English), Combinatorial Pattern
ence, Czech Republic, 2008, pp. 146-160: Czech Technical University. Matching, vol. 1448, pp. 14-33, 1998.
[73] H. Fan, Yao, N., and Ma, H., "Fast variants of the backward-oracle- [97] B. U. J. Holub, "Fast variants of bit parallel approach to suffix automata,"
marching algorithm.," in Fourth International Conference on Internet in Second Haifa Annual International Stringology Research Workshop of
Computing for Science and Engineering, Washington, DC, 2009: IEEE the Israeli Science Foundation, 2005.
Computer Society. [98] M. E. Nebel, "Fast string matching by using probabilities: on an optimal
[74] W. Masaki, I. Hasuo, and K. Suenaga. "Efficient online timed pattern mismatch variant of Horspool’s algorithm," Theoretical computer science,
matching by automata-based skipping." In International Conference on pp. 329-343, 2006.
Formal Modeling and Analysis of Timed Systems, pp. 224-243. Springer, [99] S. S. Sheik, S. K. Aggarwal, A. Poddar, N. Balakrishnan, and K. Sekar, "A
Cham, 2017. FAST pattern matching algorithm," (in English), J Chem Inf Comput Sci,
[75] F. Hongbo, S. Shupeng, Z Jing, and D. Li. "Suffix Type String Matching vol. 44, no. 4, pp. 1251-6, Jul-Aug 2004.
Algorithms Based on Multi-windows and Integer Comparison." In Interna- [100] R. Thathoo, A. Virmani, S. S. Lakshmi, N. Balakrishnan, and K. Sekar,
tional Conference on Information and Communications Security, pp. 414- "TVSBS: A fast exact pattern matching algorithm for biological se-
420. Springer, Cham, 2015. quences," (in English), Current Science, vol. 91, no. 1, pp. 47-53, Jul 10
[76] S. Faro and T. Lecroq, "The exact online string matching problem: A 2006.
review of the most recent results," ACM Computing Surveys (CSUR), vol. [101] F. Franek, Jennings, C. G., and Smyth, W. F., "A simple fast hybrid
45, no. 2, 2013. pattern-matching algorithm," J. Discret. Algorithms, pp. 682-695, 2007.
[77] Y. Baeza, R. & Gonnet, "A new approach to text searching," Communica- [102] Z. A. Alqadi, Aqel, M., & El Emary, I. M., "Multiple skip Multiple
tions of the ACM, pp. 74-82, 1992. pattern matching algorithm (MSMPMA)," IAENG International Journal
[78] G. Fredriksson, S, "Practical and optimal string matching," in Proceedings of Computer Science, pp. 14-20, 2007.
of the International Symposium on String Processing and Information [103] N. Huang, Y. Chu, C. Hsieh, C.-H. Tsai, and Y.-J. Tzang, "A Determin-
Retrieval SPIRE, 2005. istic Cost-effective String Matching Algorithm for Network Intrusion De-
[79] Intel.com. (2015, 5th April). XML Parsing Accelerator with tection System," in Communications, 2007. ICC ’07. IEEE International
Intel Streaming SIMD Extensions 4 (Intel-SSE4). Available: Conference, 2007, pp. 1292-1297.
VOLUME 4, 2016 23
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
[104] A.Hudaib, D. Suleiman, M. Itriq, and A. Al-anani, "A fast pattern AMIRRUDIN KAMSIN is a Senior Lecturer at
matching algorithm with two sliding windows (TSW)," J. Comput. Sci, the Faculty of Computer Science and Information
pp. 393-401, 2008. Technology, University of Malaya, Malaysia. He
[105] S. Deusdado and P. Carvalho, "GRASPm: an efficient algorithm for exact received his BIT (Management) in 2001 and MSc
pattern-matching in genomic sequences," Int J Bioinform Res Appl, vol. in Computer Animation in 2002 from University
5, no. 4, pp. 385-401, 2009. of Malaya and Bournemouth University, UK re-
[106] S. Cho, J. C. Na, K. Park, and J. S. Sim, "A fast algorithm for order- spectively. He obtained his PhD from University
preserving pattern matching," (in English), Information Processing Let-
College London (UCL) in 2014. His research ar-
ters, vol. 115, no. 2, pp. 397-402, Feb 2015.
eas include human computer interaction (HCI),
[107] P. Shivendra Kumar, H. K. Tiwari, and P. Tripathi. "Hybrid approach
to reduce time complexity of string matching algorithm using hashing authentication systems, e-learning, mobile appli-
with chaining." In Proceedings of International Conference on ICT for cations, serious game, augmented reality and mobile health services.
Sustainable Development, pp. 185-193. Springer, Singapore, 2016.
[108] A. M. Al-Ssulami,"Hybrid string matching algorithm with a pivot."
Journal of Information Science 41, no. 1 (2015): 82-88.
[109] S. Faro, Lecroq, T., "The exact string matching problem: a comprehen-
sive experimental evaluation," 2010. PALAIAHNAKOTE SHIVAKUMARA received
[110] Ersin, " A Research into String Matching Algorithms and a new string B.Sc., M.Sc., M.Sc Technology by research and
matching algorithm," Master Thesis, Department of Computer Engineer- Ph.D. degrees from the University of Mysore,
ing, Trakya University, Turkey2008. Mysore, Karnataka, India in 1995, 1999, 2001
[111] P. P. Kalsi, H.and Tarhio, J., "Exact string matching algorithms for biolog- and 2005, all in computer science. Currently, he
ical sequences," in In Proc. BIRD 2008, 2nd International Conference on
is working as a Senior Lecturer at University
Bioinformatics Research and Development, Communications in Computer
of Malaya (UM), Kuala Lumpur, Malaysia. He
and Information Science, 2008: Springer.
[112] T. Lecroq, "Experimental results on string matching algorithms," (in worked as a Research Fellow at the National Uni-
English), Software: Practice and Experience, vol. 25, no. 7, pp. 727-765, versity of Singapore, Singapore from 2005-2007
Jul 1995. and 2008-2013. Besides, he worked as Research
[113] H. Zhang, " Parallelization of software based intrusion detection system," Consultant at Nanyang Technological University, Singapore for the period
Master thesis, University of Canterbury, New Zealand, 2011. of one year from 2007-2008. Based on his work, he has published more
[114] K. Kambatla, G. Kollias, V. Kumar, and A. Grama, "Trends in big data than 190 research papers in national, international conferences and journals.
analytics," (in English), Journal of Parallel and Distributed Computing, He has been serving as Associate Editor for Transactions on Asian Lan-
vol. 74, no. 7, pp. 2561-2573, Jul 2014. guage Information Processing (TALLIP). Further, he was the recipient of a
[115] W. O. Science. (2016). Web of Science. Available: prestigious âĂIJDynamic Indian of the MillenniumâĂİ award by KG foun-
http://isiknowledge.com/wos dation, India for his contributions to computer science field. He won ”Top
[116] S. Hakak, A. Kamsin, O. Tayan,M. Y. I. Idris, G. A. Gilkar, Approaches Reviewer” award from Pattern Recognition Letters. He has several interna-
for preserving content integrity of sensitive online Arabic content: A tional collaborators, namely, Nanjing University, China, Hohai University,
survey and research challenges. Information Processing Management, China, Shantou University, China, Indian Statistical Institute, Kolkata, India,
2017.
University of Essex, UK, Assiut University, Egypt, University of Technology
[117] S. Hakak, A. Kamsin, S. Palaiahnakote, O. Tayan, M. Y. I. Idris, K. Z.
Sydney, Australia. He has been serving as chairs at different levels for
Abukhir, Residual-based approach for authenticating pattern of multi-style
diacritical Arabic texts. PloS one, 13(6), 2018, e0198284. International Conference, namely, ICDAR, DAS, ICFHR, ACPR etc. His
[118] A. Abdulrazzaq, K, N. Rashid, A, and A. Alezzi, H,A, "Parallel Process- area of research includes video text understanding, document analysis,
ing of Hybrid Exact String," 2013. image processing and OCR related.
[119] Quora.com, "What-is-the-difference-among-CPU-GPU-APU-FPGA-
DSP-and-Intel-MIC," in www.quora.co, ed, 2015.
[120] C.-L. Hung, C.-Y. Lin, and H.-H. Wang, "An efficient parallel-network
packet pattern-matching approach using GPUs," (in English), Journal of
Systems Architecture, vol. 60, no. 5, pp. 431-439, May 2014. GULSHAN AMIN received the bachelor’s and
[121] O. Tayan and Y. M. Alginahi, "A review of recent advances on multi- master’s degrees from India. She is currently
media watermarking security and design implications for digital Quran a Lecturer with the Department of Computer
computing," in Biometrics and Security Technologies (ISBAST), 2014
Science and Information Technology, Faculty of
International Symposium on, 2014, pp. 304-309: IEEE.
Computer Science and Information Technology,
[122] S. Hakak, A. Kamsin, O. Tayan, M. Y. Idna Idris, A. Gani, and S.
Zerdoumi, "Preserving Content Integrity of Digital Holy Quran: Survey
Shaqra University, Saudi Arabia. She has vast
and Open Challenges," IEEE Access, vol. PP, no. 99, pp. 1-1, 2017. teaching experience due to having worked in vari-
[123] S. Zerdoumi, A. Q. M. Sabri, A. Kamsin, I. A. T. Hashem, A. Gani, S. ous educational institutions locally and abroad.
Hakak, V. Chang, Image pattern recognition in big data: taxonomy and
open challenges: survey. Multimedia Tools and Applications, 1-31, 2017.
24 VOLUME 4, 2016
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2019.2914071, IEEE Access
VOLUME 4, 2016 25
2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.