Entropy-Based Approach in Selection Exact String-Matching Algorithms
Entropy-Based Approach in Selection Exact String-Matching Algorithms
Article
Entropy-Based Approach in Selection Exact
String-Matching Algorithms
Ivan Markić 1, * , Maja Štula 2 , Marija Zorić 3 and Darko Stipaničev 2
1 Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture, University of Split,
21000 Split, Croatia
2 Department of Electronics and Computing, Faculty of Electrical Engineering,
Mechanical Engineering and Naval Architecture, University of Split, 21000 Split, Croatia;
[email protected] (M.Š.); [email protected] (D.S.)
3 IT Department, Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture,
University of Split, 21000 Split, Croatia; [email protected]
* Correspondence: [email protected]; Tel.: +385-(91)-9272123
Abstract: The string-matching paradigm is applied in every computer science and science branch
in general. The existence of a plethora of string-matching algorithms makes it hard to choose
the best one for any particular case. Expressing, measuring, and testing algorithm efficiency is
a challenging task with many potential pitfalls. Algorithm efficiency can be measured based on
the usage of different resources. In software engineering, algorithmic productivity is a property
of an algorithm execution identified with the computational resources the algorithm consumes.
Resource usage in algorithm execution could be determined, and for maximum efficiency, the goal
is to minimize resource usage. Guided by the fact that standard measures of algorithm efficiency,
such as execution time, directly depend on the number of executed actions. Without touching
the problematics of computer power consumption or memory, which also depends on the algorithm
type and the techniques used in algorithm development, we have developed a methodology which
enables the researchers to choose an efficient algorithm for a specific domain. String searching
algorithms efficiency is usually observed independently from the domain texts being searched.
This research paper aims to present the idea that algorithm efficiency depends on the properties of
Citation: Markić, I.; Štula, M.; Zorić, searched string and properties of the texts being searched, accompanied by the theoretical analysis
M.; Stipaničev, D. Entropy-Based of the proposed approach. In the proposed methodology, algorithm efficiency is expressed through
Approach in Selection Exact character comparison count metrics. The character comparison count metrics is a formal quantitative
String-Matching Algorithms. Entropy
measure independent of algorithm implementation subtleties and computer platform differences.
2021, 23, 31. https://doi.org/
The model is developed for a particular problem domain by using appropriate domain data (patterns
10.3390/e23010031
and texts) and provides for a specific domain the ranking of algorithms according to the patterns’
entropy. The proposed approach is limited to on-line exact string-matching problems based on
Received: 14 November 2020
Accepted: 22 December 2020
information entropy for a search pattern. Meticulous empirical testing depicts the methodology
Published: 28 December 2020 implementation and purports soundness of the methodology.
Publisher’s Note: MDPI stays neu- Keywords: exact string-matching; algorithm efficiency; algorithm performance; entropy; comparison;
tral with regard to jurisdictional claims testing framework
in published maps and institutional
affiliations.
1. Introduction
Copyright: © 2020 by the authors. Li-
String-matching processes are included in applications in many areas, like applications
censee MDPI, Basel, Switzerland. This for information retrieval, information analysis, computational biology, multiple variations
article is an open access article distributed of practical software implementations in all operating systems, etc. String-matching forms
under the terms and conditions of the the basis for other computer science fields, and it is one of the most researched areas in
Creative Commons Attribution (CC BY) theory as well as in practice. An increasing amount and availability of textual data require
license (https://creativecommons.org/ the development of new approaches and tools to search useful information more effectively
licenses/by/4.0/). from such a large amount of data. Different string-matching algorithms perform better or
worse, depending on the application domain, making it hard to choose the best one for any
particular case [1–5].
The main reason for analyzing an algorithm is to discover its features and com-
pare them with other algorithms in a similar environment. When features are the focus,
the mostly and primarily used parts are time and space resources and researchers want
to know how long the implementation of a particular algorithm will run on a specific
computer and how much space it will require. The implementation quality and compiler
properties, computer architecture, etc., have huge effects on performance. Establishing
differences between an algorithm and its implementation features can be challenging [6,7].
An algorithm is efficient if its resource consumption in the process of execution is
below some acceptable or desirable level. The algorithm will end execution on an avail-
able computer in a reasonable amount of time, or space in efficiently acceptable contexts.
Multiple factors can affect an algorithm’s efficiency, such as algorithm implementation,
accuracy requirements, and lack of computational power. A few frameworks exist for
testing string matching algorithms [8]. Hume and Sunday presented a framework for
testing string matching algorithms in 1991. It was developed in the C programming
language, and it has been used in the 900 . [9] Faro presented the String Matching Algo-
rithm Research Tool (SMART) framework in 2010 and its improved version six years later.
SMART is a framework designed to develop, test, compare, and evaluate string matching
algorithms [3].
This paper introduces a model-building methodology for selecting the most efficient
string search algorithm, based on a pattern entropy while expressing the algorithm’s
efficiency using platform independent metrics. According to their efficiency, the devel-
oped methodology for ranking algorithms considers properties of the searched string and
properties of the texts that are being searched. This methodology does not depend on
algorithm implementation, computer architecture, programming languages specifics, and
it provides a way to investigate algorithm strengths and weaknesses. More information
about the formal metric definition is described in Section 3.1.3. The paper covers only
the fundamental algorithms. Selected algorithms are the basis for the most state-of-the-art
algorithms, which belong to the classical approach. More information is described in
Section 3.2.2. We also analyzed different approaches, metrics for measuring algorithms’
efficiency, and algorithm types.
The paper is organized as follows: The basic concepts of string-matching are de-
scribed in Section 2. Entropy, formal metrics, and related work are described in Section 3
along with the proposed methodology. Experimental results with string search algorithms
evaluation models developed according to the proposed methodology are presented in
Section 4. Section 5 presents the validation results for the developed models and a discus-
sion. Section 6 concludes this paper.
2. String-Matching
String-matching consists of finding all occurrences of a given string in a text. String-
matching algorithms are grouped into exact and approximate string-matching algorithms.
Exact string-matching does not allow any tolerance, while approximate string-matching
allows some tolerance. Further, exact string-matching algorithms are divided into two
groups: single pattern matching and multiple pattern matching. These two categories
are also divided into software and hardware-based methods. The software-based string-
matching algorithms can be divided into character comparison, hashing, bit-parallel, and
hybrid approaches. This research focuses on on-line software-based exact string-matching
using a character comparison approach. On-line searching means that there is no built data
structures in the text. The character-based approach is known as a classical approach that
compares characters to solve string-matching problems [10,11].
The character-based approach has two key stages: matching and shift phases. The prin-
ciple behind algorithms for string comparison covers text scanning with the window of size
m, commonly referred to as the sliding window mechanism (or search window). In the pro-
Entropy
Entropy
Entropy 2021,
2021, 23,
23,23,
2021, 31 x x 319of
3 3ofof 1818
ofofsize
cess 𝑚,𝑚,commonly
ofsize commonly
comparing the main referred
referred T to
text to asas
[1 . .the
. the sliding
n]sliding
and window
window
a pattern [1mechanism
P mechanism
. . . m], where (or(orsearch
msearch window).
≤ n,window).
the aim
In
isIntothethe
find process
process of comparing
of comparing
all occurrences, the main
theofmain
if any, text T
text Tpattern
the exact [1…n]
[1…n] Pand and a pattern
a pattern
in the P[1…m],
P[1…m],1).where
text T (Figure where mm≤ ≤n,n,
The result
ofthethe
aimaimis is
comparing toto find
find allall
patterns occurrences,
occurrences, if any,
with text isifinformation
any, ofofthethe exact
exact
that pattern
pattern
they match𝑃𝑃 inin
ifthethe text
text
they 𝑇𝑇
are (Figure
(Figure
equal 1.).
or1.). The
The
they
result of comparing
result of comparing
mismatch. The lengthpatterns patterns with text
with text must
of both windows is information
is information
be of equalthat that they
they match
in length, match
during if
if theythey are equal oror
are
the comparison equal
they
phase. mismatch.
they mismatch.
First, one must The
The lengthlength
align the of both
of window
both windowswindows
and themust must
text’sbebe of
leftofend equal
equal
andinthenin length,
length,
compare during
during thethethe com-
com-
charac-
ters parison
parison phase.
from phase.
the window First,
First, oneone
with mustmust
the align
align
pattern’s thethe window
window
characters. and
and the
After the text’s
text’s
an exactleft left end
end
matching andand then
then
(or compare
compare
mismatch)
ofthethe characters
characters
pattern with from from
the text, the window
the window
the window with
withisthe the pattern’s
pattern’s
moved characters.
characters.
to the right. The After
After
same an exact
anprocedure matching
exact matching repeats (or(or
untilmismatch)
the right
mismatch) of
of end pattern
patternof the with
withwindow the text,
has
the text, the window
thereached
window is moved
theisright
moved end to
toofthethe
the right.
text The
right. The
[11–15]. same procedure
same procedure
repeats
repeats until
until thethe right
right endend ofof
thethe window
window hashas reached
reached thethe right
right end end ofofthethe text
text [11–15].
[11–15].
Search
Search window
window
Text
Text
G GC CA AT TC CG GC CA AG GA AG GA AG GT TA AT TA AC CA AG GT TA AC CG G
Pattern G GC CA AG GA AG GA AG G
Pattern
Figure
Figure
Figure 1. Exact
1.1.Exact
Exact string
string
string matching.
matching.
matching.
3.3.Methodology
3. Methodology
Methodology
3.1.3.1.
Methodology
Methodology Description
Description
3.1. Methodology Description
A state of the art survey shows a lack of platform-independent methodology, which
AA state
state ofof the
the artart survey
survey shows
shows a lack
a lack ofof platform-independent
platform-independent methodology,
methodology, which
which
willwill
helphelp
choose an algorithm
choose an for searching
algorithm for a specific
searching a string pattern.
specific string The proposed
pattern. The approachap-
proposed
will help choose an algorithm for searching a specific string pattern. The proposed ap-
forproach
evaluatingevaluating
exact string pattern matching algorithms is formalized in a methodology
proach forforevaluating exact
exact string
string patternmatching
pattern matchingalgorithms
algorithmsis isformalized
formalizedinina ameth-
meth-
consisting
odology of six steps,
consisting ofshown
six in
steps, Figure
shown 2,
in to build
Figure 2,a model
to build applicable
a model to data sets
applicable to andsets
data
odology consisting of six steps, shown in Figure 2, to build a model applicable to data sets
algorithms
and in a given
algorithms adomain.
and algorithms inin given
a given domain.
domain.
Figure
Figure 2. 2. Methodology
Methodology forfor building
building model
model based
based onon the
the entropy
entropy approach
approach forfor string
string search
search algo-
algo-
Figure Methodology for building model based on the entropy approach for string search
2. selection.
rithms
rithms selection.
algorithms selection.
Thefirst
The firststep
stepofofthetheproposed
proposedmethodology,
methodology,shown shownininFigure
Figure 2,is isselecting
selectingrepre-
repre-
The first step of the proposed methodology, shown in Figure 2, is2,selecting represen-
sentative
sentative texts for domain model building. In the second step, the algorithms are selected.
tative textstexts for domain
for domain modelmodel building.
building. In In
thethe second
second step,
step, thealgorithms
the algorithmsare areselected.
selected.
Selectedalgorithms
Selected algorithmsare arelimited
limitedonly
onlytotothe
theones
onesthatthatwanted
wantedtotobebeconsidered.
considered.After Afterse-se-
Selected algorithms are limited only to the ones that wanted to be considered. After
lecting representative
lecting representative texts for domain model building and algorithms, the searching
selecting representativetexts texts for
for domain model building
domain model building andandalgorithms,
algorithms,the thesearching
searching
phaseforforrepresentative
phase representativepatterns patternsstarts
starts inthe thethird
thirdstep.
step.Representative
Representativepatterns
patternscan can be
phase for representative patterns starts ininthe third step. Representative patterns can bebe
text substrings
textsubstrings
substringsor or
or thethe
the can can
can be be randomly
be randomly
randomly createdcreated from the domain alphabet. The searching
text created from from the
the domain
domainalphabet.
alphabet.The Thesearching
search-
phase
phase means
means thatthat all
all representative
representative patterns
patterns are
are searched
searched with
with algorithms
algorithms selected
selected inin the
the
ing phase means that all representative patterns are searched with algorithms selected in
thesecond
second
second step.
step.
step. Search
Search
Search resultsare
results
results arecollected
are collectedand
collected andexpressed
and expressedinin
expressed specific
specific
specific metrics.
metrics.
metrics. InIn
In the
the
the fourth
fourth
fourth
step, patterns
step,patterns
step, patterns entropy entropy is calculated.
entropy is calculated. In
calculated. InInthe the fifth
thefifth step,
fifthstep, entropy
entropy
step, entropy discretization
discretization
discretization is applied.
is applied.
is applied. En-
En-
tropy results
tropy results
Entropy resultsare are discretized
arediscretized
discretizedand and divided
and divided
divided into into groups by
into groups by frequencyfrequency distribution
frequency distribution
distribution[16,17].[16,17].
[16,17].
Entropy 2021, 23, 31 4 of 19
The last step is to classify algorithms in the built model and present the obtained algorithms’
ranking according to the proposed approach.
Also, the range of the data should be calculated by finding minimum and maximum
values. The range will be used to determine the class interval or class width. The following
Equation (3) is used [16,17]:
max(values) − min(values)
h= (3)
C
3.1.2. Entropy
Shannon entropy is a widely used concept in information theory to convey the amount
of information contained in a message. Entropy is a standard measure for the state of order,
or better, a disorder of symbols in a sequence. The entropy of a sequence of characters
describes the complexity, compressibility, amount of information [21–25].
Suppose that events A1 , A2 , . . . , An is defined, and they make a complete set. The fol-
n
lowing expression is valid ∑ pi = 1, where pi = p(Ai ). Finite system α holds all events Ai , i
i =1
= 1, 2, . . . , n with probability pi ’s corresponding values. The following form will denote
system α (Equation (4)) [22]:
A1 A2 A
α= ... n (4)
p1 p2 pn
P( G ) = 18 = 0.125 P( T ) = 38 = 0.375
2
× log2 28 − 28 × log2 28 − 18 × log2 81 − 3 3
Entropy = − 8 8 × log2 8
For the DNA domain model the total number of observations in the data n = 91.
The observation data are distinct values of calculated entropies rounded up two decimals,
for DNA domain there are totally 91 items (0 | 0.34 | 0.53 | 0.54 | 0.70 | 0.81 | 0.90 | 0.95
| 0.99 | . . . | 1.96 | 1.97 | 1.98 | 1.99 | 2.00). The number of classes for the DNA domain,
applying Equation (2) is 9, width of classes after applying Equation (3) is 0.22. The Table 2
shows entropy classes after discretization with the number of patterns in each of them.
Entropy 2021, 23, 31 9 of 19
Table 3 is just a section of the overall patterns entropy classification for the natural
language domain.
For the natural language domain, the total number of observations in the data n = 105.
The observation data for natural language domain are (0 | 1 | 1.5 | 2 | 2.16 | 2.25 | 2.41 |
2.5 | 2.62 | . . . | 3.83 | 3.84 | 3.86 | 3.87 | 3.88), totally 105 items. The number of classes
for the natural language domain, applying Equation (2) is 9, the width of classes after
applying Equation (3) is 0.46. The Table 4 shows entropy classes for the natural language
domain after discretization with the number of patterns in each of them.
Entropy classes containing a small number of patterns affect the model the least since
such patterns are rare and occur in less than 0.5% of cases. Examples of such patterns
are TTTTTTTTCTTTTTTT, AAAGAAA, and LL. When a pattern does not belong to any
entropy class, the relevant class is the first closest entropy class.
Entropy 2021, 23, 31 10 of 19
Table 4. Entropy classes after discretization for the natural language domain.
Entropy Class/
1 2 3 4 5 6 7 8 9
Algorithm
Quartile 1
AC 20.59% 0.00% 20.00% 12.61% 5.74% 8.99% 8.51% 5.43% 2.53%
BF 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
BM 47.92% 100.00% 56.67% 51.82% 57.20% 61.11% 58.26% 61.95% 62.83%
HOR 38.24% 0.00% 46.67% 49.55% 49.88% 51.06% 49.92% 53.92% 54.60%
KMP 14.71% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
MP 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
QS 55.88% 100.00% 53.33% 51.35% 55.36% 53.97% 54.08% 53.68% 54.85%
Quartile 2
AC 29.41% 100.00% 16.67% 35.14% 41.15% 31.22% 38.03% 36.47% 34.34%
BF 23.53% 0.00% 13.33% 22.97% 20.95% 18.52% 21.75% 21.58% 26.36%
BM 18.75% 0.00% 43.33% 28.38% 24.28% 29.63% 22.92% 24.38% 23.21%
HOR 20.59% 100.00% 30.00% 14.41% 20.95% 29.89% 22.54% 24.32% 18.41%
KMP 35.29% 0.00% 13.33% 26.58% 20.95% 18.52% 26.65% 21.58% 27.22%
MP 23.53% 0.00% 13.33% 22.97% 20.95% 18.52% 21.75% 21.58% 27.22%
QS 14.71% 0.00% 43.33% 23.42% 25.94% 28.57% 21.63% 25.08% 18.25%
Quartile 3
AC 23.53% 0.00% 53.33% 25.23% 29.18% 36.24% 26.54% 31.67% 31.41%
BF 26.47% 0.00% 16.67% 24.77% 27.43% 25.66% 24.56% 24.36% 26.01%
BM 25.00% 0.00% 0.00% 19.80% 16.05% 9.26% 18.82% 13.68% 13.95%
HOR 29.41% 0.00% 23.33% 35.59% 26.18% 19.05% 27.38% 21.77% 26.72%
KMP 23.53% 100.00% 43.33% 21.17% 34.66% 41.53% 29.63% 37.77% 25.15%
MP 26.47% 0.00% 33.33% 24.77% 27.43% 25.66% 24.56% 24.56% 25.15%
QS 29.41% 0.00% 3.33% 25.23% 15.96% 17.46% 24.28% 21.24% 26.65%
Quartile 4
AC 26.47% 0.00% 10.00% 27.03% 23.94% 23.54% 26.93% 26.43% 31.72%
BF 50.00% 100.00% 70.00% 52.25% 51.62% 55.82% 53.69% 54.06% 47.63%
BM 8.33% 0.00% 0.00% 0.00% 2.47% 0.00% 0.00% 0.00% 0.01%
HOR 11.76% 0.00% 0.00% 0.45% 2.99% 0.00% 0.17% 0.00% 0.27%
KMP 26.47% 0.00% 43.33% 52.25% 44.39% 39.95% 43.72% 40.65% 47.63%
MP 50.00% 100.00% 53.33% 52.25% 51.62% 55.82% 53.69% 53.87% 47.63%
QS 0.00% 0.00% 0.00% 0.00% 2.74% 0.00% 0.00% 0.00% 0.25%
Entropy 2021, 23, 31 12 of 19
Table 6. Algorithms ranking model for the natural language texts and patterns.
Entropy Class/
1 2 3 4 5 6 7 8 9
Algorithm
Quartile 1
AC 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
BF 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
BM 66.67% 0.00% 40.56% 60.38% 57.18% 64.05% 65.02% 58.81% 66.06%
HOR 33.33% 0.00% 35.06% 30.19% 38.72% 41.01% 51.24% 56.47% 52.04%
KMP 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
MP 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
QS 100.00% 0.00% 100.00% 84.91% 79.23% 70.13% 59.01% 59.71% 57.01%
Quartile 2
AC 100.00% 0.00% 51.67% 50.94% 50.00% 50.13% 58.66% 50.18% 72.85%
BF 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
BM 33.33% 0.00% 59.44% 39.62% 42.82% 35.95% 34.98% 41.19% 33.94%
HOR 66.67% 0.00% 64.94% 69.81% 61.28% 58.99% 48.76% 43.53% 47.96%
KMP 100.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
MP 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
QS 0.00% 0.00% 0.00% 15.09% 20.77% 29.87% 40.99% 40.29% 42.99%
Quartile 3
AC 0.00% 0.00% 48.33% 49.06% 50.00% 49.87% 41.34% 49.82% 27.15%
BF 0.00% 0.00% 36.67% 33.96% 33.08% 32.41% 31.45% 33.45% 32.13%
BM 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
HOR 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
KMP 0.00% 0.00% 44.44% 49.06% 45.90% 46.58% 47.70% 46.04% 47.06%
MP 33.33% 0.00% 44.44% 41.51% 45.90% 46.08% 45.94% 45.50% 46.15%
QS 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
Quartile 4
AC 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
BF 100.00% 0.00% 63.33% 66.04% 66.92% 67.59% 68.55% 66.55% 67.87%
BM 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
HOR 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
KMP 0.00% 0.00% 55.56% 50.94% 54.10% 53.42% 52.30% 53.96% 52.94%
MP 66.67% 0.00% 55.56% 58.49% 54.10% 53.92% 54.06% 54.50% 53.85%
QS 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
When patterns are searched with the BM algorithm, 47.92% of the searching results ex-
pressed as CC count is in the first quartile, 23.53% are in the second quartile, and 25% are
in the third, and 8.33% are in the fourth quartile. In this case, for a given pattern, the built
model suggests using the QS algorithm as the most efficient algorithm. The selected algo-
Entropy 2021, 23, 31 rithm is considered an optimal algorithm that will make fewer character comparisons 13 of 19
(CC) than others for most patterns being searched belonging to the entropy class 1.
Entropy classes in Table 5 are defined in Table 2.
Figure5.5.Algorithms
Figure Algorithmsranking
rankingfor
forentropy
entropyclass
class1.1.
5. Table 5. Algorithms
Methodology ranking model
Validation for DNA texts and patterns.
and Discussion
Entropy Class/Algorithm For model1 validation, 2 the seventh
3 and4 ninth entropy
5 classes
6 (961
7 and 4692 8 patterns)
9
were selected for the DNAQuartile
domain,1 and the sixth (393 patterns) and ninth classes (221 pat-
AC terns) were20.59%
selected for the natural
0.00% 20.00%language
12.61% domain.
5.74% 8.99% The model
8.51% classes
5.43%chosen 2.53%for
BF validation have
0.00% the highest
0.00% number of representative patterns
0.00% 0.00% 0.00% 0.00% 0.00% and are characteristic
0.00% 0.00%for
BM the specific47.92%
domains. 100.00% 56.67% 51.82% 57.20% 61.11% 58.26% 61.95% 62.83%
HOR The selected
38.24% patterns0.00%for 46.67%
validation are not
49.55% 49.88%part51.06%
of the 49.92%
patterns 53.92%
set with 54.60%
which
KMP the model 14.71%
was created.0.00%For the0.00%
DNA domain model, also
0.00% 0.00% 0.00% 0.00% a different text is
0.00% chosen
0.00%for
MP validation. 0.00%
The DNA domain0.00% model is validated with
0.00% 0.00% 0.00% 0.00% 0.00% the DNA sequence Homo
0.00% sapiens
0.00%
QS isolate HG00514
55.88%chromosome
100.00% 9 53.33%
genomic51.35%
scaffold55.36%
HS_NIOH_CHR9_SCAFFOLD_1,
53.97% 54.08% 53.68% whole
54.85%
genome shotgun sequence, 43.213.237
Quartile 2 bp, 39 Mb as the text [58]. The natural language
AC domain is validated
29.41% with the natural
100.00% 16.67%language
35.14% text set from
41.15% the Canterbury
31.22% 38.03% 36.47% Corpus. [43]
34.34%
BF Before23.53%
the model validation process, a check was made to
0.00% 13.33% 22.97% 20.95% 18.52% 21.75% 21.58% see if the selected patterns
26.36%
BM were sufficiently
18.75%representative
0.00% for model
43.33% validation.
28.38% 24.28% The29.63%
check was22.92% done24.38%
with the 23.21%
central
HOR limit theorem. The
20.59% set of patterns used in the validation phase
100.00% 30.00% 14.41% 20.95% 29.89% 22.54% 24.32% has a normal distribution
18.41%
KMP (Figure 6, Mean
35.29%= 1.900,0.00%
and Std.13.33%
Dev = 26.58%
0.064) as20.95%
a set of18.52%
patterns used in21.58%
26.65% model building
27.22%
MP (Figure 7, Mean
23.53%= 1.901, and Std. Dev = 0.059), which means that
0.00% 13.33% 22.97% 20.95% 18.52% 21.75% 21.58% patterns used to validate
27.22%
the model represent a domain.
Other entropy classes of patterns discretized character comparisons also follow the nor-
mal distribution. The basis in the model validation phase is to verify if the test results differ
from the developed model results presented in Tables 5 and 6.
For comparing the two data sets (model results and test results), double-scaled Euclid
distance, and Pearson correlation coefficient were used.
Double-scaled Euclidian distance normalizes raw Euclidian distance into a range of 0–
1, where 1 represents the maximum discrepancy between the two variables. The first step in
comparing two datasets with double-scaled Euclidian methods is to compute the maximum
possible squared discrepancy (md) per variable i of v variables, where v is the number of
observed variables in the data set. The mdi = (Maximum for variable i-Minimum for variable
i)2 , where 0 (0%) is used for minimum and 1 (100%) for maximum values for double-scaled
Euclid distance. The second step’s goal is to produce the scaled variable Euclidean distance,
where the sum of squared discrepancies per variable is divided by the maximum possible
discrepancy for that variable, Equation (7):
v !
( p1i − p2i )2
u v
d1 = t ∑
u
(7)
i =1
mdi
Entropy 2021, 23, 31 14 of 19
Entropy 2021, 23, x 13 of 18
Figure
Figure 6.
6. Patterns
Patterns used
used in
in the
the validation
validation phase
phase for
for the
the entropy
entropy class
class 99 of
of the
the DNA
DNA domain.
domain.
Figure 7. Patterns used in model building for the entropy class 9 of the DNA domain.
Figure 7. Patterns
Theused in model
final step isbuilding
dividingforscaled
the entropy class 9 distance
Euclidian of the DNA domain.
with the root of v, where v is
the number of observed variables, Equation (8). Double-scaled Euclid distance easily turns
into aOther entropy
measure classes of
of similarity bypatterns discretized
subtracting it from 1.0.character comparisons also follow the
[16,59–62]:
normal distribution. The basis in the model validation phase is to verify if the test results
differ from the developed model results presented in Tables 5 and 6.
s
v ( p1i − p2i )2
∑i=1results
For comparing the two data sets (model mdand
i
test results), double-scaled Euclid
distance, and Pearson correlationd2coefficient
= were
√ used. (8)
Double-scaled Euclidian distance normalizes v raw Euclidian distance into a range of
0–1, where 1 represents the maximum discrepancy between the two variables. The first
step in comparing two datasets with double-scaled Euclidian methods is to compute the
maximum possible squared discrepancy (md) per variable i of v variables, where v is the
Entropy 2021, 23, 31 15 of 19
Table 7 shows the usage of the double-scaled Euclidian distance method for entropy
class 7 of DNA
Applying Equation (8) on Table 7, column “Scaled Euclidean (d1 )”, gives a double-
scaled Euclidian distance of 0.227. Subtracting double-scaled Euclidian Distance from
1 gives a similarity coefficient of 0.773 or 77%.
Table 8 shows the results of the calculated double-scaled Euclid distance and corre-
sponding similarity coefficient.
Table 8. Double-scaled Euclid distance for DNA and natural language classes.
Table 9. Pearson correlation coefficient for DNA and natural language classes.
The seventh and ninth classes from the built model for the DNA domain have a
linear Pearson’s correlation coefficient with their validation results. The sixth and ninth
classes from the natural language domain’s built model have a linear Pearson’s correlation
coefficient with their validation results. Pearson’s correlation coefficient shown in Figure 8
indicate that the values from the built model (x-axis, Model) and their corresponding
validation result (y-axis, validation) follow each other with a strong positive relationship.
Class 9 0.685 0.905
The seventh and ninth classes from the built model for the DNA domain have a linear
Pearson’s correlation coefficient with their validation results. The sixth and ninth classes
from the natural language domain’s built model have a linear Pearson’s correlation coef-
Entropy 2021, 23, 31 ficient with their validation results. Pearson’s correlation coefficient shown in Figure168of 19
indicate that the values from the built model (x-axis, Model) and their corresponding val-
idation result (y-axis, validation) follow each other with a strong positive relationship.
Using the
Using the double-scaled
double-scaled Euclidean
Euclideandistance
distanceininthe validation
the validationprocess
processshows a strong
shows a strong
similarity between the built model and validation results. In addition to the similarity,
similarity between the built model and validation results. In addition to the similarity, a
strong positive relationship exists between classes selected from the built model
a strong positive relationship exists between classes selected from the built model and and val-
idation results
validation proven
results provenby by
Pearson’s correlation
Pearson’s coefficient.
correlation Presented
coefficient. results
Presented showshow
results that itthat
is possible to use the proposed methodology to build a domain model for selecting an
it is possible to use the proposed methodology to build a domain model for selecting
optimal algorithm for the exact string matching. Except for optimal algorithm selection
an optimal algorithm for the exact string matching. Except for optimal algorithm selec-
for a specific domain, this methodology can be used to improve the efficiency of string-
tion for a specific domain, this methodology can be used to improve the efficiency of
matching algorithms in the context of performance, which is in correlation with empirical
string- matching algorithms in the context of performance, which is in correlation with
measurements.
empirical measurements.
The data used to build and validate the model can be downloaded from the website [63].
6. Conclusions
Proposed methodology for ranking algorithms is based on properties of the searched
string and properties of the texts being searched. Searched strings are classified according
to the pattern entropy. This methodology is expressing algorithms efficiency using plat-
form independent metrics thus not depending on algorithm implementation, computer
architecture or programming languages characteristics. This work focuses on classical
software-based algorithms that use exact string-matching techniques with a character
comparison approach. For any other type of algorithms, this methodology cannot be used.
The used character comparisons metrics is platform-independent in the context of formal
approaches, but the number of comparisons directly affects the time needed for algorithm
execution and usage of computational resource. Studying the methodology, complexity,
and limitations of all available algorithms is a complicated and long-term task. The paper
discusses, in detail, available metrics for string searching algorithms properties evaluation
and proposing a methodology for building a domain model for selecting an optimal string
searching algorithm. The methodology is based on presenting exact string-matching results
to express algorithm efficiency regardless of query pattern length and dataset size. We
Entropy 2021, 23, 31 17 of 19
considered the number of compared characters of each algorithm expressed by the searched
string entropy for our baseline analysis. High degrees of similarity and a strong correlation
between the validation results and the built model data have been proven, making this
methodology a useful tool that can help researchers choose an efficient string- matching
algorithm according to the needs and choose a suitable programming environment for
developing new algorithms. Everything that is needed is a pattern from a specific domain
by which the model is built, and the model will suggest using the most optimal algorithm
for usage. The defined model finally selects the algorithm that will most likely run up
the least character comparison count in pattern matching. This research does not intend to
evaluate the algorithm logic and programming environment in any way; the main reason
for comparing the results of algorithms is the construction of the algorithm selection model.
The built model is straightforwardly extendable with other algorithms; all required is ade-
quate training data sets. Further research is directed to find additional string characteristics,
besides pattern entropy, that can enhance developed methodology precision for selecting
more efficient string search algorithms.
16. Myatt, G.J.; Johnson, W.P. Making Sense of Data I a Practical Guide to Exploratory Data Analysis and Data Mining, 2nd ed.; John Wiley
& Sons, Inc.: Somerset, NJ, USA, 2014; ISBN 9781118407417.
17. Manikandan, S. Frequency distribution. J. Pharmacol. Pharmacother. 2011, 2, 54. [CrossRef] [PubMed]
18. Bartlett, J.; Kotrlik, J.; Higgins, C. Organizational research: Determining appropriate sample size in survey research. Inf. Technol.
Learn. Perform. J. 2001, 19, 43.
19. Taherdoost, H. Determining Sample Size; How to Calculate Survey Sample Size. Int. J. Econ. Manag. Syst. 2017, 2, 237–239.
20. Israel, G.D. Determining Sample Size; University of Florida: Gainesville, FL, USA, 1992.
21. Mohammed, R. Information Analysis of DNA Sequences. arXiv 2010, arXiv:1010.4205, 1–22.
22. Schmitt, A.O.; Herzel, H. Estimating the entropy of DNA sequences. J. Theor. Biol. 1997, 188, 369–377. [CrossRef]
23. Ebeling, W.; Nicolis, G. Word frequency and entropy of symbolic sequences: A dynamical perspective. Chaos Solitons Fractals
1992, 2, 635–650. [CrossRef]
24. Herzel, H.; Ebeling, W.; Schmitt, A.O. Entropies of biosequences: The role of repeats. Phys. Rev. E 1994, 50, 5061–5071. [CrossRef]
25. Lesne, A.; Blanc, J.L.; Pezard, L. Entropy estimation of very short symbolic sequences. Phys. Rev. E 2009, 79, 1–10. [CrossRef]
26. Rhodes, P.C.; Garside, G.R. Use of maximum entropy method as a methodology for probabilistic reasoning. Knowl. Based Syst.
1995, 8, 249–258. [CrossRef]
27. Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [CrossRef]
28. Muchnik, A.; Vereshchagin, N. Shannon entropy vs. kolmogorov complexity. In International Computer Science Symposium in
Russia; Springer: Berlin/Heidelberg, Germany, 2006; pp. 281–291. [CrossRef]
29. Grunwald, P.; Vitanyi, P. Shannon Information and Kolmogorov Complexity. 2004. Available online: https://arxiv.org/pdf/cs/
0410002.pdf (accessed on 4 May 2020).
30. Teixeira, A.; Matos, A.; Souto, A.; Antunes, L. Entropy Measures vs. Kolmogorov Complexity. Entropy 2011, 13, 595–611.
[CrossRef]
31. Goulão, M.; Brito e Abreu, F. Formal definition of metrics upon the CORBA component model. In Quality of Software Architectures
and Software Quality; Springer: Berlin/Heidelberg, Germany, 2005; pp. 88–105. [CrossRef]
32. Barabucci, G.; Ciancarini, P.; Di Iorio, A.; Vitali, F. Measuring the quality of diff algorithms: A formalization. Comput. Stand.
Interfaces 2016, 46, 52–65. [CrossRef]
33. Ivkovic, N.; Jakobovic, D.; Golub, M. Measuring Performance of Optimization Algorithms in Evolutionary Computation. Int. J.
Mach. Learn. Comput. 2016, 6, 167–171. [CrossRef]
34. Aho, A.V.; Hopcroft, J.E.; Ullman, J.D. The Design and Analysis of Computer Algorithms; Addison-Wesley Pub. Co.: Reading, MA,
USA, 1974; ISBN 9780201000290.
35. Hromkovič, J. Theoretical Computer Science: Introduction to Automata, Computability, Complexity, Algorithmics, Randomization,
Communication, and Cryptography; Springer: Berlin/Heidelberg, Germany, 2004; ISBN 3540140158.
36. Jain, P.; Pandey, S. Comparative Study on Text Pattern Matching for Heterogeneous System. Int. J. Comput. Sci. Eng. Technol. 2012,
3, 537–543.
37. Pandiselvam, P.; Marimuthu, T.; Lawrance, R. A comparative study on string matching algorithms of biological sequences. Int.
Conf. Intell. Comput. 2014, 2014, 1–5.
38. Faro, S.; Lecroq, T. The Exact Online String Matching Problem: A Review of the Most Recent Results. Acm Comput. Surv. 2013,
45, 13. [CrossRef]
39. Lecroq, T.; Charras, C. Handbook od Exact String Matching; Laboratoire d’Informatique de Rouen Université de Rouen: Rouen,
France, 2001.
40. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley and Sons: Hoboken, NJ, USA, 2005; ISBN 9780471241959.
41. Kucak, D.; Djambic, G.; Fulanovic, B. An empirical study of algorithms performance in implementations of set in Java. In Pro-
ceedings of the 23rd DAAAM International Symposium on Intelligent Manufacturing and Automation 2012, Zadar, Croatia,
24–27 October 2012; Volume 1, pp. 565–568.
42. Alhendawi, K.M.; Baharudin, A.S. String Matching Algoritms (SMAs): Survey & Empirical Analysis. J. Comput. Sci. Manag. 2013,
2, 2637–2644.
43. The Canterbury Corpus. Available online: http://corpus.canterbury.ac.nz/ (accessed on 21 December 2020).
44. Compeau, P.; Pevzner, P. Bioinformatics Algorithms: An Active Learning Approach; Active Learning Publishers: La Jolla, CA, USA,
2015; Volume 1, ISBN 0990374602.
45. Markić, I.; Štula, M.; Jukić, M. Pattern Searching in Genome. Int. J. Adv. Comput. Technol. 2018, 10, 36–46.
46. Anabarilius grahami isolate AG-KIZ scaffold371_cov124, whole genome sh—Nucleotide—NCBI.
47. Chelonia mydas unplaced genomic scaffold, CheMyd_1.0 scaffold1, whole—Nucleotide—NCBI.
48. Escherichia coli strain LM33 isolate patient, whole genome shotgun seq—Nucleotide—NCBI.
49. Macaca mulatta isolate AG07107 chromosome 19 genomic scaffold ScNM3vo_—Nucleotide—NCBI.
50. The Canterbury Corpus—The King James Version of the Bible. Available online: https://corpus.canterbury.ac.nz/descriptions/.
(accessed on 13 February 2020).
51. Boyer, R.S.; Moore, J.S. A fast string searching algorithm. Commun. ACM 1977, 20, 762–772. [CrossRef]
52. Knuth, D.E.; Morris, J.H.; Pratt, V.R.; Morris, J.H., Jr.; Pratt, V.R. Fast Pattern Matching in Strings. SIAM J. Comput. 1977, 6, 323–350.
[CrossRef]
Entropy 2021, 23, 31 19 of 19
53. Apostolico, A.; Crochemore, M. Optimal canonization of all substrings of a string. Inf. Comput. 1991, 95, 76–95. [CrossRef]
54. Sunday, D.M. A very fast substring search algorithm. Commun. ACM 1990, 33, 132–142. [CrossRef]
55. Horspool, R.N. Practical fast searching in strings. Softw. Pract. Exp. 1980, 10, 501–506. [CrossRef]
56. Hakak, S.; Kamsin, A.; Shivakumara, P.; Idris, M.Y.I.; Gilkar, G.A. A new split based searching for exact pattern matching for
natural texts. PLoS ONE 2018, 13, e0200912. [CrossRef]
57. Powers, D.M.W. Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation. Hum.
Commun. Sci. SummerFest 2007, 24. Available online: https://csem.flinders.edu.au/research/techreps/SIE07001.pdf (accessed on
22 March 2020).
58. National Center for Biotechnology Information. Available online: https://www.ncbi.nlm.nih.gov/ (accessed on 15 August 2019).
59. Wheelan, C. Naked Statistics: Stripping the Dread from the Data; WW Norton & Co.: New York, NY, USA, 2013; ISBN 978-0-39307-195-5.
60. Barrett, P. Euclidean Distance Raw, Normalized, and Double-Scaled Coefficients. 2005. Available online: https://www.pbarrett.
net/techpapers/euclid.pdf (accessed on 16 September 2020).
61. Anton, H. Elementary Linear Algebra, 11th ed.; Wiley: New York, NY, USA, 2019; ISBN 978-1-119-62569-8.
62. Rodgers, J.L.; Nicewander, W.A. Thirteen Ways to Look at the Correlation Coefficient. Am. Stat. 1988, 42, 59–66. [CrossRef]
63. Raw Data for Entropy Based Approach in Selection Exact String Matching Algorithms. Available online: https://www.dropbox.
com/t/kXKUZeIIVpw3hU5O (accessed on 3 November 2020).