Logical Indexing
Logical Indexing
Abstract—Searching process performance is very important performance, we only consider BM algorithm and KMP
in this modern world that consists of various advanced algorithm to be analyzed and compared with LI because the
technology. String matching algorithm is one of the most other variant basically have been represented by those
commonly applied algorithms for searching. String that algorithms.
contains sequences of character is the simplest representation
of any complex data in our real life, such as search engine The organization of this paper is as follows: in the next
database, finger print minutiae, or DNA sequence. The current section, we give review of existing algorithms. Then in
most famous string algorithms are Knuth-Morris-Pratt (KMP) section 3, we focus on defining and describing LI algorithm
algorithm and Boyer-Moore (BM) algorithm. Boyer-Moore with terms definition, informal explanation, main algorithm’s
algorithm and KMP algorithm are not efficient in some cases. pseudo-code, processing stages, and complexities analysis. In
We introduced a new variation of string matching algorithm section 4, we have empirical evidence of our algorithm based
based on Logical-Indexing (LI), which is more efficient than on the testing results compared to BM and KMP algorithm.
those algorithms. Logical-Indexing algorithm implements a The result and analysis are written in section 5. Finally, the
new function or method and defines different jumping rules conclusions and future work are presented in section 6.
compared to others. The indexes of texts and patterns are used
to reduce the number of comparisons. Index enables us to
analyze the condition of comparison without directly II. REVIEW OF EXISTING ALGORITHMS
comparing each character in the pattern. Theoretically, we
have proven that Logical-Indexing algorithm can skip more A. Knuth-Morris-Pratt (KMP) Algorithm
characters than KMP and BM algorithm by analyzing KMP’s It is a common sense to try searching at every starting
and BM’s weaknesses then giving the better solution that are position of the text, abandoning the search as soon as an
implemented in LI. Furthermore, experiments conducted on incorrect character is found [1]. Knuth-Morris-Pratt
various combinations of pattern matching cases also algorithm is not efficient because the length of pattern’s shift
demonstrate that the average number of LI’s direct is depending on how long the matched-string. In the other
comparisons is smaller than KMP’s and BM’s algorithm. words, if we have already compared many characters and
then we find a mismatch, at that point we may have a long
Keywords—LI algorithm; full-jumping; margin-jumping;
matched-jumping; paired-character; occurrenceTable;
shift but we need to restart comparing again after long
comparison that we did. Therefore, the purpose to minimize
comparison by have a long shift does not work efficiently in
I. INTRODUCTION this algorithm.
Searching algorithm is routinely applied in various
computer applications, such as finding part of DNA in bio B. Boyer-Moore (BM) Algorithm
informatics engineering. Consequently, searching Boyer-Moore algorithm is one of the most efficient
performance is an important factor in computer science. algorithms compared to the other algorithms available in the
There are two main types of string matching algorithm. If we literature. The basic idea behind the algorithm is to gain more
refer to the starting index of comparison, the first is Knuth- information by matching the pattern from the right than from
Morris-Pratt or KMP algorithm, and the second is Boyer- the left [2]. Boyer-Moore algorithm uses two main processes,
Moore or BM algorithm. Both of them have their own which are looking glass technique and character jump
strengths and weaknesses. Besides, we also study and technique. Looking glass technique is the way to find pattern
analyze another algorithm such as Sheik-Sumit-Anindya- in the text by matching the pattern from the right side.
Balakrishnan-Sekar algorithm (SSABS) and Fast Algorithm Character jump technique is an index shifting technique
for Approximate String Matching (FAAST). based on some conditions. Character jump technique is based
In this paper, we present a new method to process any on shift caused by delta1 or delta2.
searching data regardless of its size. We call this algorithm This algorithm had been extended since the first version.
Logical-Indexing (LI) algorithm. Logical-Indexing algorithm The most recent version uses two precomputed functions to
is not only applicable for small alphabet size, but also larger shift the window to the right. These two shift functions are
scale. This algorithm has four core functions or methods: called the good-suffix shift and the bad-character shift.
createCharTable(), createOccurrenceTable(),
createMargin(), and findPairedChar(). Measuring LI ‘s
However, BM algorithm is less effective in using last
occurrence index than LI because we use it twice in every
394
mismatch, instead of once. Thus, LI algorithm has bigger Assume S is a text, P is a pattern, and n is comparison
possibility to do a longer shift because the probability of number. After fourth comparison, the mismatch occurs and
finding two character sequentially (paired-character) is less pattern’s index will be divided as a following part’s color
than finding one character. The detail explanation about that will be described below:
paired-character will be given in the next section. S = “bacxybaabababaxbaacaabacxaba”
P = “bacxaba”
C. Sheik-Sumit-Anindya-Balakrishnan-Sekar (SSABS) n = 321
Algorithm
• Matched-string (“ba”) is the substring that is already
The SSABS algorithm carries the order of comparisons matched before mismatch occurs;
out by comparing the last character of the window and the
pattern, and after a match, the algorithm further compares the • Unmatched-string (“bacxa”) is the substring that has
first character of the window and the pattern. By doing so, an not been compared when the mismatch occurs;
initial resemblance can be established between the pattern • Paired-character (“xy”) is the text’s substring that
and the window, and the remaining characters are compared consists of two characters or one character if the
from right to left until a complete match or a mismatch mismatch occurs in the text’s first index. Assume that
occurs [3]. This algorithm is not implementing any additional we have mismatch’s index relative to the text’s index,
rule or logic because it only changes the order of BM’s so:
comparison. Therefore, it is also less effective in using last
occurrence index than LI because we use it twice in every the first character = S[mismatch’s index -1] and
mismatch, instead of once. The weakness is similar to BM
algorithm. the second character = S[mismatch’s index];
• Margin-string is the longest string between suffix of
D. Fast Algorithm for Approximate String Matching matched-string and prefix of pattern. In this case,
(FAAST) margin-string = “ba”. The front-margin is “ba”
The other one, FAAST algorithm theoretically improved (purple) and the back-margin is “ba” (blue);
BM algorithm in some cases, especially on small alphabet • One phase is defined when a pattern is not moved or
size. Brief explanation of their algorithm is that the algorithm shifted while comparison is occurring. When the
requires at least x matches in the last k + x characters when pattern is shifted, we continue to the next phase of
calculating shift distances, where x is a small integer value comparing.
(typically 2 or 3 in their experiments). However, as the
alphabet size and the x value get large, they notice that the • In-pair is a condition when paired-character can be
time and memory required for the shift distance calculation found in unmatched-string.
increase quickly, which in turn deteriorates the performance
of FAAST [4]. B. Informal Description
This algorithm was called Logical-Indexing string
III. THE PROPOSED ALGORITHM matching algorithm because we optimize the use of index by
adding some logical processing to minimize the number of
A. Terms Definition direct comparison. Logical-Indexing algorithm is improving
There are some terms used to explain this algorithm. Boyer-Moore algorithm in reducing the number of
Assume S is a string of size M and P is a string of size N. comparison by applying new rules. We use two text’s
characters to be analyzed instead of one when mismatch
S = x0 x1 ... xM
occurs. The purpose is to get a bigger pattern shift to skip any
P = y0 y1 ... yN
N < M useless character. Moreover, we use different preprocessed or
precomputed functions combination to improve BM’s
• Text is the place where we want to find a pattern precomputed functions.
inside of it. In this case, text is the value of S.
The matching process started by placing the first index of
• Pattern is sequence of character, which has length pattern aligns with the first index of the text. This algorithm
shorter than text. Pattern is value of P. Pattern’s head will start the comparison from the last character of the
is y0 and pattern’s back is yn. pattern. It will be compared to text’s character, which has the
same index. If the comparison is true, the algorithm will
• Prefix of S is a substring S[1 ... k-1] and suffix of decrease comparison’s index by one. At this point, algorithm
S is a substring S[k-1 ... 1] ( k is any index starts to compare characters between pattern and text. If the
between 1 and m ). result is false or a mismatch occurs, there are three
• Index is a position representation, which starts from possibilities, which are matched-jumping, margin-jumping,
zero until string’s length minus one. There are three and full-jumping. This algorithm has some characteristics
kinds of index: 1) Pattern’s index is an index that that will be described as follows:
relative to the pattern; 2) Text’s index is an index that 1) Smart Finding: When the mismatch occurs, we will
relative to the text; and 3) Comparison’s index is an
have paired-character to be examined whether it is in-pair or
index of current comparison between text and pattern.
This index is derived from text’s index since the not. There are three possibilities as the results of this process
index of pattern will move along relative to text’s which are matched-jumping, full-jumping and margin-
index. jumping. If the paired-character we are looking for is on the
unmatched-string, so it will continue to do matched-
395
jumping. In another hand, it will be processed according to We use createMargin() function to calculate
margin-jumping or full-jumping condition. The idea of marginTable values. The technique is by calculating the
smart finding is to find paired-character in the unmatched- longest suffix of matched-string that is the same with prefix
string without direct comparison since we have precomputed of pattern. The index j is the position where the mismatch
value that can be used. occurs. Therefore, matched-string should be “ba” if we get
mismatch at fourth index after third comparison. At that
2) Matched Jumping: When the paired-character is in-
point, we have marginTable[4] = 2 which means that there
pair (can be found in unmatched-string), so the pattern will is a substring or suffix of matched-string with length of 2 that
move until paired-character’s second index is aligned with is also a prefix of the pattern. It turns out it is called as
the index of mismatch. In the special case, if paired- margin because the illustration at figure 3 shows that string
character is not in-pair but paired-character’s second “ba” can be reached from the left as well as from the right
character is the same with pattern’s first character, we will
The second variable is occurrenceTable, which means
move the pattern until its first index reach mismatch’s index
the table that saves previous occurrence of the pattern’s
(mismatch’s index is relative to text’s index). We call the characters relative from its index. For example, we use
last condition as single-match pattern P = “bacxaba” that is already defined in table 1, the
3) Margin Jumping: When the conditions of matched- following table is its occurrenceTable:
jumping cannot be fulfilled and the margin is not zero, the
pattern will be moved or dragged until its front-margin align TABLE III. OCCURRENCE TABLE
with the previous position of back-margin.
4) Full Jumping: When the condition of margin-jumping j 0 1 2 3 4 5 6
cannot be applied, the pattern will be moved until its first occurrenceTable[j] -1 -1 -1 -1 1 0 4
index placed next to its previous last index position. In
another words, the pattern will be moved to the right as long The values of
table 3 is computed using
as it’s length. createOccurrenceTable() function by iterating pattern’s
5) Avoid Double Comparison: The purpose is when the index from the first index. According to that table, the value
algorithm is already compared some characters and all of of occurrenceTable[6] is 4 and the value of
them are matched, it should not compare them twice in the occurrenceTable[4] is 1. We already knew that P[6] is the
next phase. However, this feature only available in matched- same with P[4] which is equal with ‘a’. Therefore,
jumping and margin-jumping conditions. We will save occurrenceTable[6] shows the previous index of character
mismatch’s index when one of the conditions occurs. The ‘a’ before it appears at sixth index. Furthermore, some
first, when margin-jumping occurs, this can skip M-1 characters do not have previous existence in the earlier index.
We marked that occurrence with value of -1, such as at j = 0.
character at the best case. The second, when matched-
jumping occurs, this can skip 2 character at normal case or There are many ways to implement this function. In my
one character at single-match condition. This will be very implementation, we use container (list) to save the position
useful when we applied LI in any complex system because of each character and then will be updated with the newest
the cost of direct comparison could be really big. position. At first, the container is empty (initialized by -1) but
when the iteration is started, it will save the index of current
C. Preprocessing Stage character in the current position. The current value will be
The first step before doing any comparison is building retrieved before updated by the same character. That looping
some databases that will be useful in the matching stage. The keeps going until the rest of the pattern iterated.
variables that are important to be noticed are marginTable, The occurrenceTable is very useful in almost every case
occurrenceTable, and charTable. The first variable is of matching using LI algorithm. We can try to find paired-
marginTable. This variable contains the length of margin- character until satisfy in-pair condition through accessing this
string in every pattern’s index. The type of this variable is variable recursively. However, at the first time before
vector or array of integer. Supposed that we have a pattern P retrieving any data from occurrenceTable, we need to know
= “bacxaba”, table 1 depicts pattern’s index that will be used where the text’s character position in the pattern that caused
and table 2 is pattern’s marginTable. mismatch is. The last position of every character (we use 225
different characters) will be mapped by its last occurrence
TABLE I. PATTERN and stored in variable charTable.
j 0 1 2 3 4 5 6 The third variable is charTable which is array of integer.
P[j] b a c x a b a This variable will save the last occurrence of every alphabet
in the pattern. If there are some alphabets that are never exist
TABLE II. MARGIN TABLE in the pattern, then the value of its charTable is -1. For
example if we want to find last occurrence of ‘a’ inside of the
j 0 1 2 3 4 5 6
marginTable[j] 2 2 2 2 2 0 0
pattern P, then charTable[‘a’] = 6 or if we try another
alphabet like ‘y’, so the value of charTable[‘y’] is -1.
The purpose of charTable is to give starting index for
Text = bacxybaabababaxbaacaabacxaba
finding paired-character in the unmatched-string. Suppose
Matched-string = bacxaba
that we have a mismatch at Ti, and then we will get its last
Pattern = bacxaba
occurrence from charTable[Ti]. After that, we start to
Fig. 1. Illustration of margin “ba”. compare it with the index of mismatch (assumed j), if
396
charTable[Ti] > j, we will start to check the value of index] == P[0]. Therefore, the pattern will be shifted until
occurrenceTable[charTable[Ti]] or even the value of its first index reach mismatch’s index.
occurrenceTable [ occurrenceTable [...]] recursively
until satisfy the proper conditions for starting. The proper
condition will be satisfied when occurrenceTable[…] < j.
However, if the result is -1, it will stop doing recursive and Fig. 4. Matching Process (Third phase).
start analyzing for margin-jumping and full-jumping
conditions. In this third phase, mismatch occurs at the first
comparison. In this condition, we know that margin-string is
D. Matching Stage empty or zero, paired-character is “ax”, and unmatched-
Suppose that we have the same pattern P = “bacxaba” string is “bacxab”. By using findPairedChar(), LI
that defined in table 1. We want to find pattern P inside of algorithm will know that the paired-character is not in-pair.
Text T = “bacxybaabababaxbaacaabacxaba”. This matching Therefore, the pattern will be dragged according to full-
process requires values from marginTable[] and jumping rule.
occurrenceTable[] that have been described in the
previous section. The value of marginTable[] is given at
table 3 (Margin Table) and the value of occurrenceTable[]
is given at Table 4 (Occurrence Table). Figure 2 - 7 consist
of three line, including text, pattern, and the order of Fig. 5. Matching Process (Fourth phase).
comparison. The third line also shows the number of
comparison at the current position. However, we only give its In this fourth phase, mismatch occurs at the first
last digit to make sure it is correctly-align with the character comparison again. In this condition, we know that margin-
above of it. string is empty or zero, paired-character is “ab”, and
unmatched-string is “bacxab”. By analyzing values from
occurrenceTable and charTable, LI algorithm will know
TABLE IV. TEXT that the position of paired-character are at index 4 and index
i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 5. Then we got one after calculate its jumping distance. It
T[i] b a c x y b a a b a b a b a means that the pattern will be shifted one index to the right
i 14 15 16 17 18 19 20 21 22 23 24 25 26 27
T[i] x b a a c a a b a c x a b a
At the first condition, the position of text and pattern will Fig. 6. Matching Process (Fifth phase).
be left aligned as follows
This fifth phase shows about skipping or avoiding double
comparison after matched-jumping. After comparing from
22th index, we find a mismatch at 19th index. However, we
only need two comparisons because we already marked the
Fig. 2. Matching Process (First phase). 21th and 20th index in the previous phase. In this case, LI
algorithm will try to find paired-character (“ca”) in the
At the first phase, the character on both text and pattern unmatched-string but it is failed because (“ca”) is not in-pair.
will be compared starting from right most patterns’ index. Then we move the pattern until its front-margin aligned with
Starting from sixth index, we decrease the index of back-margin position since marginTable[3] == 2.
comparison until we get a mismatch at the fourth index. Then
we know that margin-string = “ba”, paired-character = “xy”,
and unmatched-string = “bacxa”. This algorithm will try to
find that paired-character in the unmatched-string. In this
case, it is failed. Therefore, the pattern is shifted until its Fig. 7. Matching Process (Sixth phase).
front-margin aligns with back-margin or with a
distance=Pattern.length-marginTable[4]= 5. After shifted In this last phase LI algorithm finish the comparison at
5 indexes to the right, it will appear as shown in figure 3. index 23th with total 16 times comparison. In the previous
phase, we save the index of mismatch to avoid double
comparison. We are not comparing the 21th - 22th index
because it should be the same (match) according to the
previous shifting.
Fig. 3. Matching Process (Second phase).
The second phase will start from the right most pattern’s E. Algorithm’s Pseudo Code
index as always. Noted that the comparison’s index is This following pseudo code is a general description of
relative to text’s index, so we said that the comparison in this how this algorithm works, especially on matching stage. We
phase starts at 11th index until we get mismatch at the 8th assume that pattern’s index and text’s index always start
index. Just like before, LI algorithm will try to find paired- from zero. According to common programming languages,
character, which is “ab”, and it is failed. However, it still has such as Java and c++ language, string is an array of char that
one chance to make matched-jumping. The alterative its length represent the number of char in the string. In this
condition for matched-jumping is when T[mismatch’s case, we save the pattern and the text inside of string
397
variable, so it can be treated as an array of char. For instance, F. Analysis of the Proposed Algorithm
if we want to access the nth character of a string S, we will Assume that we have text of n size and pattern of m size.
write S [ n - 1 ], because the index starts from zero. The time complexity of LI algorithm will be described at
The result after exiting while-looping does not show the lemma 1 and lemma 2 below.
detail comparison result, such as index of first-match and Lemma 1. The time complexity is O (n) in the best case.
number of comparison, because in this paper we only give
the main idea of LI algorithm. However, we already Proof. Every character in the text will be compared once,
implemented this pseudo code’s idea successfully in Java so the minimum number of comparison is n times (see figure
language and finished the benchmarking that is given as 8 below). Furthermore, the number of phase is one in every
empirical evidence. best case.
398
have various combinations of text and pattern. In the figure
10, it shows the number of comparison increasing according
to all algorithms, but LI has the smallest increment.
Moreover, in the figure 11 also shows that LI has the
smallest number of comparison.
VI. CONCLUSION
In this paper, we proposed a new string-matching
algorithm by optimizing the use of its index. The most
important variable in this algorithm is occurrenceTable,
which is a variable that save the last occurrence of each
character. It is used to find paired-character that will
Fig. 10. The y-axis shows the direct comparison number that is used to maximize the shift for next phase. This new algorithm has
search pattern of size 8 character inside of various text’s size. been tested, and it is significantly improving Boyer-Moore
algorithm and Knuth-Morris-Pratt algorithm. This algorithm
is not only editing BM’s lastOccurrence() function, but
also we are adding a lot of new logical processing that never
used by another algorithm, such as: finding paired-character,
defining margin-area, calculating matched-jumping, and
calculating margin-jumping.
The test case that has been used is sufficiently large, and
at this point, it is enough to conclude that the proposed
algorithm is efficient in reducing the number of comparison.
Moreover, it can be applied in other contexts of searching,
especially in a database that need a long time to compare a
single data to another. Our further interest is to calculate the
best length of paired-character dynamically regarding to the
Fig. 11. Semi-log chart that depicts direct comparison number between size of pattern. We also want to implement this algorithm in
KMP, BM, and LI algorithm when they search pattern (vary in size) inside a real life problem to support high scale computation.
of 10000 characters text.
REFERENCES
V. RESULT AND ANALYSIS [1] D.E. Knuth, J.H . Morris, V.R. Pratt, “Fast pattern matching in
strings”. TR CS-74-440, Stanford U., Stanford, California, 1974.
Theoretically, we proved that LI’s rules are better than
[2] R. S. Boyer and J. S. Moore, “A fast string searching algorithm,”
BM and KMP because we have marginTable and Communications of the ACM, 20(10):761–772, 1977.
occurrenceTable that make LI skips more character than [3] S. S. Sheik, K. A. Sumit, P. Anindya, N. Balakrishnan, K. Sekar, “A
BM and KMP algorithm. Technically, we found that Logical- FAST Pattern Matching Algorithm,” J. Chem. Inf. Comput. Sci. 2004,
Indexing much more efficient in reducing number of 44, 1251-1256.
comparison based on the charts of experiments result in the [4] Z. Liu1, X. Chen, J. Borneman, and T. Jiang, “A Fast Algorithm for
previous section. We have analyzed two types of data, which Approximate String Matching on Gene Sequences,” unpublished.
399