0% found this document useful (0 votes)

32 views19 pages

Entropy-Based Approach in Selection Exact String-Matching Algorithms

This research paper aims to present the idea that algorithm efficiency depends on the properties of searched string and properties of the texts being searched, accompanied by the theoretical analysis of the proposed approach. In the proposed methodology, algorithm efficiency is expressed through character comparison count metrics.

Uploaded by

Ivan Markic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views19 pages

Entropy-Based Approach in Selection Exact String-Matching Algorithms

Uploaded by

Ivan Markic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

entropy

Article
Entropy-Based Approach in Selection Exact
String-Matching Algorithms
Ivan Markić 1, * , Maja Štula 2 , Marija Zorić 3 and Darko Stipaničev 2

1 Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture, University of Split,
21000 Split, Croatia
2 Department of Electronics and Computing, Faculty of Electrical Engineering,
Mechanical Engineering and Naval Architecture, University of Split, 21000 Split, Croatia;
[email protected] (M.Š.); [email protected] (D.S.)
3 IT Department, Faculty of Electrical Engineering, Mechanical Engineering and Naval Architecture,
University of Split, 21000 Split, Croatia; [email protected]
* Correspondence: [email protected]; Tel.: +385-(91)-9272123

Abstract: The string-matching paradigm is applied in every computer science and science branch
in general. The existence of a plethora of string-matching algorithms makes it hard to choose
the best one for any particular case. Expressing, measuring, and testing algorithm efficiency is
a challenging task with many potential pitfalls. Algorithm efficiency can be measured based on
the usage of different resources. In software engineering, algorithmic productivity is a property
of an algorithm execution identified with the computational resources the algorithm consumes.
Resource usage in algorithm execution could be determined, and for maximum efficiency, the goal
is to minimize resource usage. Guided by the fact that standard measures of algorithm efficiency,
such as execution time, directly depend on the number of executed actions. Without touching
the problematics of computer power consumption or memory, which also depends on the algorithm
type and the techniques used in algorithm development, we have developed a methodology which
enables the researchers to choose an efficient algorithm for a specific domain. String searching
algorithms efficiency is usually observed independently from the domain texts being searched.

This research paper aims to present the idea that algorithm efficiency depends on the properties of
Citation: Markić, I.; Štula, M.; Zorić, searched string and properties of the texts being searched, accompanied by the theoretical analysis
M.; Stipaničev, D. Entropy-Based of the proposed approach. In the proposed methodology, algorithm efficiency is expressed through
Approach in Selection Exact character comparison count metrics. The character comparison count metrics is a formal quantitative
String-Matching Algorithms. Entropy
measure independent of algorithm implementation subtleties and computer platform differences.
2021, 23, 31. https://doi.org/
The model is developed for a particular problem domain by using appropriate domain data (patterns
10.3390/e23010031
and texts) and provides for a specific domain the ranking of algorithms according to the patterns’
entropy. The proposed approach is limited to on-line exact string-matching problems based on
Received: 14 November 2020
Accepted: 22 December 2020
information entropy for a search pattern. Meticulous empirical testing depicts the methodology
Published: 28 December 2020 implementation and purports soundness of the methodology.

Publisher’s Note: MDPI stays neu- Keywords: exact string-matching; algorithm efficiency; algorithm performance; entropy; comparison;
tral with regard to jurisdictional claims testing framework
in published maps and institutional
affiliations.

1. Introduction
Copyright: © 2020 by the authors. Li-
String-matching processes are included in applications in many areas, like applications
censee MDPI, Basel, Switzerland. This for information retrieval, information analysis, computational biology, multiple variations
article is an open access article distributed of practical software implementations in all operating systems, etc. String-matching forms
under the terms and conditions of the the basis for other computer science fields, and it is one of the most researched areas in
Creative Commons Attribution (CC BY) theory as well as in practice. An increasing amount and availability of textual data require
license (https://creativecommons.org/ the development of new approaches and tools to search useful information more effectively
licenses/by/4.0/). from such a large amount of data. Different string-matching algorithms perform better or

Entropy 2021, 23, 31. https://doi.org/10.3390/e23010031 https://www.mdpi.com/journal/entropy

Entropy 2021, 23, 31 2 of 19

worse, depending on the application domain, making it hard to choose the best one for any
particular case [1–5].
The main reason for analyzing an algorithm is to discover its features and com-
pare them with other algorithms in a similar environment. When features are the focus,
the mostly and primarily used parts are time and space resources and researchers want
to know how long the implementation of a particular algorithm will run on a specific
computer and how much space it will require. The implementation quality and compiler
properties, computer architecture, etc., have huge effects on performance. Establishing
differences between an algorithm and its implementation features can be challenging [6,7].
An algorithm is efficient if its resource consumption in the process of execution is
below some acceptable or desirable level. The algorithm will end execution on an avail-
able computer in a reasonable amount of time, or space in efficiently acceptable contexts.
Multiple factors can affect an algorithm’s efficiency, such as algorithm implementation,
accuracy requirements, and lack of computational power. A few frameworks exist for
testing string matching algorithms [8]. Hume and Sunday presented a framework for
testing string matching algorithms in 1991. It was developed in the C programming
language, and it has been used in the 900 . [9] Faro presented the String Matching Algo-
rithm Research Tool (SMART) framework in 2010 and its improved version six years later.
SMART is a framework designed to develop, test, compare, and evaluate string matching
algorithms [3].
This paper introduces a model-building methodology for selecting the most efficient
string search algorithm, based on a pattern entropy while expressing the algorithm’s
efficiency using platform independent metrics. According to their efficiency, the devel-
oped methodology for ranking algorithms considers properties of the searched string and
properties of the texts that are being searched. This methodology does not depend on
algorithm implementation, computer architecture, programming languages specifics, and
it provides a way to investigate algorithm strengths and weaknesses. More information
about the formal metric definition is described in Section 3.1.3. The paper covers only
the fundamental algorithms. Selected algorithms are the basis for the most state-of-the-art
algorithms, which belong to the classical approach. More information is described in
Section 3.2.2. We also analyzed different approaches, metrics for measuring algorithms’
efficiency, and algorithm types.
The paper is organized as follows: The basic concepts of string-matching are de-
scribed in Section 2. Entropy, formal metrics, and related work are described in Section 3
along with the proposed methodology. Experimental results with string search algorithms
evaluation models developed according to the proposed methodology are presented in
Section 4. Section 5 presents the validation results for the developed models and a discus-
sion. Section 6 concludes this paper.

2. String-Matching
String-matching consists of finding all occurrences of a given string in a text. String-
matching algorithms are grouped into exact and approximate string-matching algorithms.
Exact string-matching does not allow any tolerance, while approximate string-matching
allows some tolerance. Further, exact string-matching algorithms are divided into two
groups: single pattern matching and multiple pattern matching. These two categories
are also divided into software and hardware-based methods. The software-based string-
matching algorithms can be divided into character comparison, hashing, bit-parallel, and
hybrid approaches. This research focuses on on-line software-based exact string-matching
using a character comparison approach. On-line searching means that there is no built data
structures in the text. The character-based approach is known as a classical approach that
compares characters to solve string-matching problems [10,11].
The character-based approach has two key stages: matching and shift phases. The prin-
ciple behind algorithms for string comparison covers text scanning with the window of size
m, commonly referred to as the sliding window mechanism (or search window). In the pro-
Entropy
Entropy
Entropy 2021,
2021, 23,
23,23,
2021, 31 x x 319of
3 3ofof 1818

ofofsize
cess 𝑚,𝑚,commonly
ofsize commonly
comparing the main referred
referred T to
text to asas
[1 . .the
. the sliding
n]sliding
and window
window
a pattern [1mechanism
P mechanism
. . . m], where (or(orsearch
msearch window).
≤ n,window).
the aim
In
isIntothethe
find process
process of comparing
of comparing
all occurrences, the main
theofmain
if any, text T
text Tpattern
the exact [1…n]
[1…n] Pand and a pattern
a pattern
in the P[1…m],
P[1…m],1).where
text T (Figure where mm≤ ≤n,n,
The result
ofthethe
aimaimis is
comparing toto find
find allall
patterns occurrences,
occurrences, if any,
with text isifinformation
any, ofofthethe exact
exact
that pattern
pattern
they match𝑃𝑃 inin
ifthethe text
text
they 𝑇𝑇
are (Figure
(Figure
equal 1.).
or1.). The
The
they
result of comparing
result of comparing
mismatch. The lengthpatterns patterns with text
with text must
of both windows is information
is information
be of equalthat that they
they match
in length, match
during if
if theythey are equal oror
are
the comparison equal
they
phase. mismatch.
they mismatch.
First, one must The
The lengthlength
align the of both
of window
both windowswindows
and themust must
text’sbebe of
leftofend equal
equal
andinthenin length,
length,
compare during
during thethethe com-
com-
charac-
ters parison
parison phase.
from phase.
the window First,
First, oneone
with mustmust
the align
align
pattern’s thethe window
window
characters. and
and the
After the text’s
text’s
an exactleft left end
end
matching andand then
then
(or compare
compare
mismatch)
ofthethe characters
characters
pattern with from from
the text, the window
the window
the window with
withisthe the pattern’s
pattern’s
moved characters.
characters.
to the right. The After
After
same an exact
anprocedure matching
exact matching repeats (or(or
untilmismatch)
the right
mismatch) of
of end pattern
patternof the with
withwindow the text,
has
the text, the window
thereached
window is moved
theisright
moved end to
toofthethe
the right.
text The
right. The
[11–15]. same procedure
same procedure
repeats
repeats until
until thethe right
right endend ofof
thethe window
window hashas reached
reached thethe right
right end end ofofthethe text
text [11–15].
[11–15].

Search
Search window
window
Text
Text
G GC CA AT TC CG GC CA AG GA AG GA AG GT TA AT TA AC CA AG GT TA AC CG G

Pattern G GC CA AG GA AG GA AG G
Pattern
Figure
Figure
Figure 1. Exact
1.1.Exact
Exact string
string
string matching.
matching.
matching.

3.3.Methodology
3. Methodology
Methodology
3.1.3.1.
Methodology
Methodology Description
Description
3.1. Methodology Description
A state of the art survey shows a lack of platform-independent methodology, which
AA state
state ofof the
the artart survey
survey shows
shows a lack
a lack ofof platform-independent
platform-independent methodology,
methodology, which
which
willwill
helphelp
choose an algorithm
choose an for searching
algorithm for a specific
searching a string pattern.
specific string The proposed
pattern. The approachap-
proposed
will help choose an algorithm for searching a specific string pattern. The proposed ap-
forproach
evaluatingevaluating
exact string pattern matching algorithms is formalized in a methodology
proach forforevaluating exact
exact string
string patternmatching
pattern matchingalgorithms
algorithmsis isformalized
formalizedinina ameth-
meth-
consisting
odology of six steps,
consisting ofshown
six in
steps, Figure
shown 2,
in to build
Figure 2,a model
to build applicable
a model to data sets
applicable to andsets
data
odology consisting of six steps, shown in Figure 2, to build a model applicable to data sets
algorithms
and in a given
algorithms adomain.
and algorithms inin given
a given domain.
domain.

Figure
Figure 2. 2. Methodology
Methodology forfor building
building model
model based
based onon the
the entropy
entropy approach
approach forfor string
string search
search algo-
algo-
Figure Methodology for building model based on the entropy approach for string search
2. selection.
rithms
rithms selection.
algorithms selection.
Thefirst
The firststep
stepofofthetheproposed
proposedmethodology,
methodology,shown shownininFigure
Figure 2,is isselecting
selectingrepre-
repre-
The first step of the proposed methodology, shown in Figure 2, is2,selecting represen-
sentative
sentative texts for domain model building. In the second step, the algorithms are selected.
tative textstexts for domain
for domain modelmodel building.
building. In In
thethe second
second step,
step, thealgorithms
the algorithmsare areselected.
selected.
Selectedalgorithms
Selected algorithmsare arelimited
limitedonly
onlytotothe
theones
onesthatthatwanted
wantedtotobebeconsidered.
considered.After Afterse-se-
Selected algorithms are limited only to the ones that wanted to be considered. After
lecting representative
lecting representative texts for domain model building and algorithms, the searching
selecting representativetexts texts for
for domain model building
domain model building andandalgorithms,
algorithms,the thesearching
searching
phaseforforrepresentative
phase representativepatterns patternsstarts
starts inthe thethird
thirdstep.
step.Representative
Representativepatterns
patternscan can be
phase for representative patterns starts ininthe third step. Representative patterns can bebe
text substrings
textsubstrings
substringsor or
or thethe
the can can
can be be randomly
be randomly
randomly createdcreated from the domain alphabet. The searching
text created from from the
the domain
domainalphabet.
alphabet.The Thesearching
search-
phase
phase means
means thatthat all
all representative
representative patterns
patterns are
are searched
searched with
with algorithms
algorithms selected
selected inin the
the
ing phase means that all representative patterns are searched with algorithms selected in
thesecond
second
second step.
step.
step. Search
Search
Search resultsare
results
results arecollected
are collectedand
collected andexpressed
and expressedinin
expressed specific
specific
specific metrics.
metrics.
metrics. InIn
In the
the
the fourth
fourth
fourth
step, patterns
step,patterns
step, patterns entropy entropy is calculated.
entropy is calculated. In
calculated. InInthe the fifth
thefifth step,
fifthstep, entropy
entropy
step, entropy discretization
discretization
discretization is applied.
is applied.
is applied. En-
En-
tropy results
tropy results
Entropy resultsare are discretized
arediscretized
discretizedand and divided
and divided
divided into into groups by
into groups by frequencyfrequency distribution
frequency distribution
distribution[16,17].[16,17].
[16,17].
Entropy 2021, 23, 31 4 of 19

The last step is to classify algorithms in the built model and present the obtained algorithms’
ranking according to the proposed approach.

3.1.1. Representative Patterns Sample Size

The sample size of representative patterns is determined by Equation (1) for finite
population [18–20]:
n
n0 = (1)
z2 × p (1− p )
1+ ε2 N
where n is the sample size, z is the z-score, ε is the margin of error, N is the population size,
and p is the population proportion. The commonly used confidence levels are 90%, 95%, and
99%. Each has its corresponding z-scores provided by tables based on the chosen confidence
level (confidence level 0.95 used in the experiment with z-score 1.65). The margin of error
means the maximum distance for the sample estimate to deviate from the real value.
A population proportion describes a percentage of the value associated with a survey.
In theory, we are dealing with an unlimited population since patterns and texts can have
an unlimited number of characters. However, in practice, we have to limit populations to a
finite number [14,16].
The maximum number of classes in the discretization phase is determined by Equation (2)
(n is the total number of observations in the data) [16,17]:
√
number o f classes = C = 2 × 3 n (2)

Also, the range of the data should be calculated by finding minimum and maximum
values. The range will be used to determine the class interval or class width. The following
Equation (3) is used [16,17]:

max(values) − min(values)
h= (3)
C

3.1.2. Entropy
Shannon entropy is a widely used concept in information theory to convey the amount
of information contained in a message. Entropy is a standard measure for the state of order,
or better, a disorder of symbols in a sequence. The entropy of a sequence of characters
describes the complexity, compressibility, amount of information [21–25].
Suppose that events A1 , A2 , . . . , An is defined, and they make a complete set. The fol-
n
lowing expression is valid ∑ pi = 1, where pi = p(Ai ). Finite system α holds all events Ai , i
i =1
= 1, 2, . . . , n with probability pi ’s corresponding values. The following form will denote
system α (Equation (4)) [22]:

A1 A2 A
α= ... n (4)
p1 p2 pn

The states of a system α will denote events Ai , i = 1, 2, . . . , n. System α is a discrete

system with a finite set of states. Every finite system describes some state of uncertainty
because it is impossible to know which state is the system in a specific time. The goal is to
express quantitatively such uncertainty in some way. It means that a particular function,
which will join a specific number to system α, should be defined. In that way, the system
will have a measure for its uncertainty [22].
The function which quantitatively measures an uncertainty of a system is called
Entropy of system, and it is defined with the following Equation (5) [22,26]:
n
H ( p1 , p2 , . . . , pn ) = − ∑ pi logpi (5)
i =1
Entropy 2021, 23, 31 5 of 19

The entropy of a system is denoted with H(α). If pi = 0, it follows that pi log pi = 0.

The information theory logarithm base is usually 2, and an entropy unit is called a bit
(binary digit). The entropy is zero only if one of the probabilities pi = 1, . . . , n is equal 1, and
others are 0. In that case, there is no uncertainty since it is possible to predict the system’s
state precisely. In any other case, entropy is a positive number [22,26].
If the system α contains test results, a degree of uncertainty before a test is executed is
equal to the entropy of the system α. When the test is executed, the degree of uncertainty is
zero. The amount of information after test execution is larger if the uncertainty was bigger
before the test. The information given after the test, denoted with ϑα, is equal to the entropy
of the system α (Equation (6)) [22,27]:
n n
ϑα = H (α) = − ∑ pi log pi = ∑ pi (− log pi ) (6)
i =1 i =1

Another measure from information theory is Kolmogorov complexity. Although

Kolmogorov complexity looks similar to Shannon entropy, they are conceptually different
measures. Shannon entropy interprets the smallest number of bits required for the optimal
string encoding. Kolmogorov complexity is the minimum number of bits (or the minimum
length) from which a particular string can effectively be reconstructed [28–30]:

3.1.3. Formal Metric Description

The metrics and the quality attributes that are used for string searching algorithms
analysis imply several issues, like the quality of framework (occurs when a quality model
does not define a metric), lack of an ontology (occurs when the architectural concepts need
quantification), lack of an adequate formalism (when metrics are defined with a formalism
that requires a strong mathematical background what causes less metric usability), lack of
computational support (occurs when metrics do not produce tools for metrics collection),
lack of flexibility (occurs when metrics collection tools are not available in open-source
format what causes less ability to modify them) and lack of validation (occurs when cross-
validation is not performed). All these issues complicate determining which properties
and measures would be useful in selecting metrics and results presentation [31].
Two main approaches exist for expressing the speed of an algorithm. The first ap-
proach is formal, analyzing algorithm complexity through algorithm time efficiency (or
time complexity, the time required). The second approach is empirical, analyzing partic-
ular computer resources usage through space and time efficiency (or space complexity,
the memory required). Objective and informative metrics should accompany each ap-
proach [2,31–35].
Algorithmic efficiency analysis shows the amount of work that an algorithm will
need for execution, and algorithm performance is a feature of hardware that shows how
fast the algorithm execution will be done. Formal metrics are usually used for efficiency
analysis. A commonly used metric from the formal approach is Big O notation or Landau’s
symbol, representing the complexity of an algorithm shown as a function of the input size
describing the upper bound for the search time in the worst case. Empirical metrics, like
algorithm execution run time usually presented in milliseconds, processor and memory
usage, temporary disk usage, long term disk usage, power consumption, etc., are usually
used for algorithm performance analysis. The runtime execution metric is difficult to
describe analytically, so empirical evaluation is needed through the experiments using
execution runtime metrics [4,9,11,14,36–42].
The proposed methodology focuses on evaluating the speed of algorithms using
the character comparisons (CC) metric. CC metric is the number of compared characters
of the pattern with the characters of text. Character comparison metric is a measure
that is independent of programming language, computational resources, and operating
systems, which means that it is platform-independent like formal approaches. However,
besides the time complexity, the CC metric covers space complexity in some segments, like
y 2021, 23, xEntropy 2021, 23, 31 6 of 18 6 of 19
y 2021, 23, x 6 of 18

execution run the

time, and canrun
execution be time,
programmatically implemented and
and can be programmatically used like empirical
implemented and used like empirical
execution
approaches.run time,this
Thus, and
approaches. can
metric be
is aprogrammatically
Thus, this metric
formal implemented
andisempirical
a formal and and
empirical
approach used like
approach
combined empirical [9].
[9]. combined
approaches. Thus, this metric is a formal and empirical approach combined [9].
3.2.Implementation
3.2. Methodology Methodology Implementation
3.2. Methodology Implementation
The
The application of application
the proposed of methodology
the proposed methodology
is presented inisthe
presented in the
paper for the two
paper for the two
The application
domains. domains.
For the of the
genome Forproposed
the genome
(DNA) domainmethodology
proposed is presented
(DNA) domain in
proposed
methodology is the paper for the
methodology
implemented two
is implemented
and de- and
domains. For depicted
the genome in Figure
(DNA) 3 and
domain for the natural
proposed language
methodology domain
is
picted in Figure 3 and for the natural language domain is shown in Figure 4. is shown
implemented in
andFigure
de- 4.
picted in Figure 3 and for the natural language domain is shown in Figure 4.

Figure 3. Methodology implementation for the DNA domain.

Figure 3. Methodology implementation for the DNA domain.
Figure 3. Methodology implementation for the DNA domain.

Figure 4. Methodology implementation for the natural language domain.

Figure 4. Methodology implementation
Figure 4. Methodology for the natural
implementation for language domain.
the natural language domain.
The results of entropy calculation (Figures 3 and 4) for each searched pattern are la-
beledThe results
PattEnt; ofvariable
the entropy
The calculation
results of entropy
PattEnt (Figures 3 and
calculation
is rounded and 4) forPattEntRound;
each
(Figures
labeled 3 searched
and 4) for pattern
theeach are la-
searched
meaning of pattern are
beled PattEnt; labeled
the PattEnt;
variable the
PattEnt variable
is roundedPattEnt
and is rounded
labeled and labeled
PattEntRound;
other variables are Algo–selected algorithm, m–length of the pattern, comp–number of PattEntRound;
the meaning of the meaning
other variablesof other
are variables
Algo–selected are Algo–selected
algorithm, algorithm,
m–length of them–length
pattern, of the pattern,
comp–number
character comparisons per searched pattern. Groups based on the frequency distribution comp–number
of of
character
are marked character
comparisons
with comparisons
per searched
PattEntClass. per searched
Thesepattern. Groups
designations pattern.
arebased
used Groups
oninthe based
thefrequencyon the
following frequency
distribution
sections in distribution
are marked are
with marked with
PattEntClass. PattEntClass.
These These
designations designations
are used in theare used in
following the
presenting methodology results. Obtained results are used for the algorithm’s ranking list. following
sections in sections in
presenting methodology results. Obtained results are used for the algorithm’s ranking list.
Entropy
Entropy 2021,
2021, 23,
23, x31 77 of
of 18
19

Entropy 2021, 23, x 7 of 18

That is the entropy-based model for selecting the most efficient algorithm for any pattern
presenting methodology results. Obtained results are used for the algorithm’s ranking list.
searched in a particular domain. In the following sections, each step is described in detail.
That is the entropy-based model for selecting the most efficient algorithm for any pattern
Entropy 2021, 23, x searched in a particular domain. In the following sections, each step is described in detail.
7 of 18
3.2.1.isSelection
That of Representative
the entropy-based Texts
model for for Domain
selecting Model
the most Building
efficient algorithm for any pattern
searched
3.2.1.For in aDNA
the particular
Selection domain.
domain, InTexts
selected
of Representative the following
representativesections,
for Domain texts
Model each
are thestep is described
genome
Building data of in detail.
four dif-
Entropy 2021, 23, x 7 of 18
ferentFor species.
the DNA For the naturalselected
domain, languagerepresentative
domain, selected textsrepresentative
are the genome textsdata are English
of four
3.2.1.is
That
texts Selection
fromthe the of Representative
entropy-based
Canterbury model
Corpus Texts
for[43]. for
selecting
The Domainthe most
length Model
of a Building
efficient
DNA algorithm
sequence for any pattern
expressed in base
different species. For the natural language domain, selected representative texts are English
searched invaries
aDNA
particular adomain. In the following sections, each step is described in detail.
textsFor
pairs (bp)the
from the domain,
from
Canterbury few selected
thousand
Corpus representative
[43]. to
The several texts
lengthmillion
of a DNA areand the genome
even
sequence billiondatabp.ofThe
expressed four
inDNA dif-
base
That
ferent is
character the entropy-based
species.
stringsFor the
are formed model
natural with for
language selecting
the domain,
4-letter the most
selected
alphabet efficient algorithm
representative
{A, C, G, T}. The for
texts
length any
are pattern
English
of texts
pairs (bp) varies from a few thousand to several million and even billion bp. The DNA
Entropy 2021, 23, x 3.2.1.
searched
texts Selection
from inthe of Representative
a particular
Canterbury domain.
Corpus Texts
In [43]. for
The Domain
the following length Model
sections,
of aformed
DNA Building
each step is described inindetail.
7base
of[a-
18
from the
character Canterbury
strings are Corpus
formed iswith
expressed in
the 4-letter bytes and
alphabet {A, C,sequence
of
G,the expressed
T}.English
The lengthalphabetof texts
pairs For
from (bp)
z|A-Z|0–9|!|].the DNA
variesWe
the Canterbury domain,
from used a few selected
thousand
the bible
Corpus representative
subsettoasseveral
is expressed the
in bytes texts
textmillion
toand are the
and even
be searched
formed genome billion
ofbecause
the data
itbp.
English of
is Thefour
more DNA
alphabet dif-
rep-
3.2.1. Selection
ferent species. of are
For Representative
the natural Texts
language for Domain
domain, Model
selected Building
representative texts
character
resentative strings
[a-z|A-Z|0–9|!|].of natural Weformed
English
used the with
text the subset
than
bible 4-letter
the asalphabet
other theconvenient
text to {A, beC,
wordG, T}.
searched The
lists, and
because itare
length English
of texts
isitpublicly
is more
texts
from For
released
That from
the
is the theDNA
[2,21,44,45].
the
representative entropy-baseddomain,
Canterbury
Canterbury
of natural Corpus selected
Corpus
model
English text representative
for[43].
is expressed The
selecting
than in length
the bytes
the most
other texts
of
and are the
aformed
DNA
efficient
convenient of genome
sequence
thelists,
algorithm
word data
expressed
English ofisfour
for alphabet
and any
it in base
pattern
publiclydif-
[a-
ferent
pairs In
searched species.
(bp)
z|A-Z|0–9|!|]. varies
detail, For
We
the the
from used natural
a
followingfew
the language
thousand
bible
publicly subset domain,
to asseveral
availablethe selected
text million
to be
representative
in a particular domain. In the following sections, each step is described in detail.
released [2,21,44,45]. representative
and
searched even
textsbecause
are texts
billion bp.
it
used isare
TheEnglish
more
for DNArep-
model
texts from
character
resentative
building:
In the
stringsCanterbury
of the
detail, are formed
natural
followingEnglish Corpus
with [43].
text than
publicly the Theother
4-letter
the
available length
alphabetof a DNA
convenient
representative {A, C,
texts sequence
wordG, used
are T}. The
lists, expressed
and
for length
model in texts
of
it is publiclybase
building:
pairs
from (bp)
the varies
Canterbury from a few
Corpus thousand
is expressed to several
in bytes million
and formedand even
of the billion
English bp. The
alphabet DNA [a-
•released
3.2.1.
• DNA DNA [2,21,44,45].
Selection of Representative
sequences
sequences of
of nucleotides
nucleotides Textsfor for
theDomain
DNA domain Model Building
character
z|A-Z|0–9|!|].
In
For strings
detail,
the DNA We
the are formed
used
following
domain, withsubset
theselected
bible
publicly the 4-letter
as the alphabet
available
representative text to be{A,
representative
texts are C,
searched
the G,
textsT}.
genome Thedata
because
are length
it isoffor
used moreofmodel
four texts
rep-
dif-
from 
the Anabarilius
Anabarilius
Canterbury graham
graham
Corpus (Kanglang
(Kanglang
is expressed fish;
fish;
in RJVU01051648.1
bytes and formed Anabarilius
of the Englishgrahamialphabet isolate
isolate
[a-
resentative
building:
ferent species. of natural
For the English
natural text
languagethan the other
domain, convenient
selected word
representative lists, and
texts it is
are publicly
English
z|A-Z|0–9|!|]. AG-KIZ
AG-KIZ We scaffold371_cov124,
scaffold371_cov124,
used the bible subset whole
whole as the genome
genome text shotgun
to be shotgun
searched sequence,
sequence, because 14.747.523
14.747.523
it is bp (base
more bp
rep-
released
•texts DNA
from [2,21,44,45].
sequences
the Canterbury of nucleotides
Corpus for [46]
[43]. the
TheDNA length domain
of a DNA sequence expressed in base
resentative pairs),
(base 14.3
pairs),
of natural Mb file
14.3
EnglishMbsize)file [46]
size)
text thanavailable
theseveral
otherrepresentative
convenient word lists,
pairs In(bp)detail,
varies thefrom following
a (green publicly
few (Kanglang
thousand to million and texts
even areand
billionusedbp.it isfor
Thepublicly
model
DNA
released Chelonia
Anabarilius
Chelonia
[2,21,44,45].
mydas
mydasgraham(green sea turtle; fish; RJVU01051648.1
NW_006571126.1
NW_006571126.1 Anabarilius
Chelonia
Chelonia mydas
mydas grahami
unplaced
unplaced isolatege-
ge-
building:
characternomic strings are formed with thescaffold1,
4-letter alphabet {A, genome
C,shotgun
G,sequence,
T}. shotgun
The length of texts
AG-KIZ
nomic scaffold371_cov124,
scaffold,
scaffold, CheMyd_1.0
CheMyd_1.0 whole genome
scaffold1,wholewhole shotgun
genome 14.747.523
sequence, 7.392.783
sequence, bp
•fromIn thedetail,
DNA bp,
(base
the following
sequences
Canterbury Mb)bp,
7.1pairs),
7.392.783 of
[47]
14.3
7.1
publicly
nucleotides
Corpus Mbisfile
Mb) foravailable
expressed
[47] size) theinDNA
[46] bytes representative
and formed oftexts
domain are usedalphabet
the English for model [a-
building:
 Escherichia
z|A-Z|0–9|!|]. We
Anabarilius
Chelonia mydas
Escherichia used
graham
coli the bible
(Kanglang
(NZ_LN874954.1 subset
(green sea turtle;Escherichia
coli (NZ_LN874954.1 as
fish; the text to
RJVU01051648.1
Escherichia
NW_006571126.1 coli
colibe searched
strain because
Anabarilius
LM33 isolate
Chelonia mydas unplaced it
grahamiis
patient, more rep-
isolate
whole
wholege-
•resentative
DNA sequences
of natural of nucleotides
English text than for the DNA
other domain
AG-KIZ
genome
nomic
genome scaffold371_cov124,
shotgun
scaffold,
shotgun sequence,
CheMyd_1.0
sequence, whole
49.02.341
49.02.341 bp,convenient
genome
bp,
scaffold1, 4.8 Mb)
4.8whole
Mb) shotgun
[48] word
genome
[48] lists,
sequence, and14.747.523
shotgun it issequence,
publicly bp
released Macaca
 [2,21,44,45].
Anabarilius
(base
7.392.783
Macaca pairs), bp,
mulatta
mulattagraham Mb(Kanglang
7.1(Rhesus
14.3 Mb)
(Rhesus [47]macaque
file size) [46]
macaque fish;monkey;
RJVU01051648.1
monkey; ML143108.1 Anabarilius
Macaca grahami
mulatta isolate isolate
isolate
In
 detail,
AG-KIZ
AG07107
Chelonia themydas
Escherichia
AG07107 following
scaffold371_cov124,
chromosome (green publicly
coli (NZ_LN874954.1
chromosome 19
19genomic
sea available
whole
turtle;
genomic representative
genome
scaffold
NW_006571126.1
Escherichia
scaffold coli shotgun
ScNM3vo_33
strainChelonia
ScNM3vo_33 LM33texts
× are
sequence,
44 M,used
mydas
× isolate
44 M, whole forgenome
14.747.523
unplaced
patient,
whole model
genome
whole bp
ge-
building:shotgun
(base
nomic
genome sequence,
pairs),
scaffold,
shotgun 14.3 Mb24.310.526
file
CheMyd_1.0
sequence, size)
shotgun sequence, 24.310.526 bp, 24.3 Mb) [49] bp,
[46] 24.3
scaffold1,
49.02.341 Mb)
bp, [49]
4.8 whole
Mb) genome
[48] shotgun sequence,

••• English
DNA
EnglishChelonia
7.392.783
Macaca bp,of
mydas
textsmulatta
sequences for 7.1
the (green
Mb) [47]
(Rhesus
nucleotides
natural sea turtle;
for theNW_006571126.1
macaque
language monkey;
DNA
domain domain ML143108.1 Chelonia mydasmulatta
Macaca unplaced ge-
isolate
texts for the natural language domain

 nomic
Escherichia
AG07107
Anabarilius scaffold,
coli
chromosome
graham CheMyd_1.0
(NZ_LN874954.1 19 genomic
(Kanglang scaffold1,
Escherichia
fish; scaffold whole
coli
ScNM3vo_33 genome
strain LM33 shotgun
× isolate patient,
44 M, grahami
whole sequence,
whole
genome
 BibleBible(The
7.392.783(Thebp, King
King7.1
James
James
Mb)
Version,
Version,with
[47] withaRJVU01051648.1
asize
sizeofof4.047.392
4.047.392 Anabarilius
bytes)
bytes)[50] [50] isolate
genome
shotgun shotgun
sequence, sequence,
24.310.526 49.02.341
bp, 24.3 bp,
Mb)
AG-KIZ scaffold371_cov124, whole genome shotgun sequence, 14.747.523 bp4.8 Mb)
[49] [48]
•3.2.2. Selection
3.2.2.EnglishEscherichia
Macaca
(base ofmulatta
texts coli (NZ_LN874954.1
Algorithms
for
pairs), the
14.3 (Rhesus
natural macaque
Mb filelanguage
size) [46] Escherichia
monkey;coli
domain strain LM33Macaca
ML143108.1 isolate mulatta
patient, isolate
whole
Selection of Algorithms
 genome
AG07107
Chelonia shotgun
chromosome
mydas sequence,
(green 19
sea 49.02.341
genomic
turtle;with bp,
scaffold
NW_006571126.14.8 Mb)
ScNM3vo_33 [48]
Chelonia × 44 M,
mydas whole genome

Seven
Seven Biblecommonly
(The
commonly King used
used Jamesstring
string matching
Version,
matching algorithms
a size of 4.047.392
algorithms have
have been
bytes)
been [50] unplaced
chosen
chosen to be
to be ranked
ranked ge-
 nomicMacaca
shotgun mulatta
sequence,
scaffold, (Rhesus
24.310.526
CheMyd_1.0 macaque bp, monkey;
24.3
scaffold1, Mb) ML143108.1
[49]
whole genome Macaca mulatta isolate
with the
with the proposed
proposed model: brute
model: brute force, nӓive
force, näive (BF), Boyer-Moore
(BF), Boyer-Moore (BM), shotgun
(BM), Knuth Morris sequence,
Morris Pratt
•3.2.2.English AG07107
texts
7.392.783 chromosome
bp,the
for 7.1natural
Mb) [47] 19 genomicdomain
language scaffold ScNM3vo_33 × 44Knuth
M, whole genome Pratt
(KMP), Apostolico-Crochemore
Selection of
(KMP), Apostolico-CrochemoreAlgorithms (AC), quick
(AC), quick search
search (QS), Morris Pratt (MP) and Horspool
shotgun sequence,
 [12,39,51–56].
Escherichia coliused 24.310.526
(NZ_LN874954.1 bp,Escherichia
24.3 Mb)(QS),coli
Morris Pratt (MP) and Horspool
[49]strain LM33 isolate patient, whole
(HOR)
(HOR) Seven Bible (The King
commonly
[12,39,51–56]. TheThe selected
Jamesstring
selected Version, algorithms
matching
algorithms withbelong belong
a size
algorithms of to
to4.047.392
thehave the
group group
bytes)
been of
[50]software-based
chosen
of software-based to be ranked algo-
•algorithms
English genometexts
that use for
shotgunthe string-matching
exact natural
sequence, language
49.02.341 domain
techniquesbp, 4.8 Mb)
with a [48]
character comparison approach
with
rithms thethat
proposed
use exact model: brute force, techniques
string-matching nӓive (BF), Boyer-Moore
with a character (BM), Knuth Morris
comparison Pratt
approach
3.2.2.
(KMP), 
(classical Macaca
approach)
Selection
Bible of
(Themulatta [11].(Rhesus
Algorithms
Apostolico-CrochemoreKing All selected
James macaque
Version,
(AC), quickwithmonkey;
algorithms a
search size ofML143108.1
used
(QS), in thisPratt
4.047.392
Morris Macaca
experiment
bytes) [50]
(MP) mulatta
and match isolate
Horspooltheir
(classical approach) [11]. All selected algorithms used in this experiment match their pub-
published
Seven AG07107
version
commonly chromosome
[3,12,39],
used which
string 19algorithms
genomic
might
matching scaffold
represent theScNM3vo_33
algorithms better have of× software-based
implementation
been 44 M, whole
chosen of the begenome
original
(HOR)
lished [12,39,51–56].
version [3,12,39], The selected
which might represent belong to
the better the group
implementation oftothe ranked
algo-
original
algorithm
3.2.2.
with Selection
thethat [57].
shotgun
proposed of Seven
sequence,
Algorithms
model: string search
24.310.526
brute force, algorithms
bp, 24.3
nӓive (BF), Mb) are selected
[49] as our baseline for model
rithms
algorithm useSeven
[57]. exact string-matching
string search techniques
algorithms are Boyer-Moore
with
selected a character
as our (BM), Knuth
comparison
baseline forMorris Pratt
approach
model con-
construction.
•(KMP),English
Seven However,
texts
commonly for[11].
the
usedany exact
natural
string string-matching
languagequickdomain
matching algorithms algorithm have that
been can(MP)be evaluated with
(classical
struction. Apostolico-Crochemore
approach)
However, any All
exact (AC),
selected algorithms
string-matching search used (QS),
algorithm Morris
in thisthat experiment
can be chosen
Pratt and
match
evaluated to with
be ranked
Horspool
their pub-
char-
character
with the comparison
proposed model:metricsbrute can be ranked
force, nӓive with thetoproposed model.
(HOR)
acter 
lished [12,39,51–56].
version
Bible [3,12,39],
comparison (The metricsThe
King selected
which
can
James bemight algorithms
ranked
Version, with(BF),
represent
with belong
the the Boyer-Moore
better
proposed
a size of 4.047.392 model. (BM),
theimplementation
group bytes) Knuth
[50] ofMorris
of software-based Pratt
algo-
the original
(KMP),
rithms
algorithm Apostolico-Crochemore
that use exact
[57]. Seven string-matching
string (AC),
search algorithms quick
techniquessearch (QS),
with a Morris
character Pratt (MP)
comparison
are selected as our baseline for model con- and Horspool
approach
3.2.3. Searching Results for Representative Patterns
(HOR)
(classical [12,39,51–56].
approach)
3.2.3. Searching
struction.
3.2.2. However,
Selection ofResults Thefor
[11].
any
Algorithms selected
All
exact selected algorithms
Representative algorithms
string-matching belong
Patterns usedtointhe
algorithm this
thatgroup
experiment
can be of software-based
match with
evaluated algo-
their char-
pub-
rithms For model
that use development,
exact string-matching design, and construction
techniques with a in step
character 3 of the
comparisonmodel building
approach
lished
acterSevenversion
comparison
For2), model [3,12,39],
metrics
development,
commonly which
used can be might
ranked
design,
string represent
with
and For
matching the the
construction
algorithms better
proposed in implementation
havemodel.
stepbeen 3model,
ofchosen
the7.682 of
model the original
bebuilding
topatternsranked
(Figure
(classical we used
approach) 9.725
[11]. different
All selected patterns.
algorithms the
usedDNA domain
in this experiment match their pub-are
algorithm
(Figure
with the 2), [57].
we
proposed Seven
used string
9.725
model: search
different
brute algorithms
force,patterns.
nӓive Forare
(BF), selected
the DNA
Boyer-Moore as our
domain baseline
(BM), model,
Knuth for
7.682 model
Morris con-
patterns
Pratt
used, and 2.043 patterns of English text from the Canterbury Corpus are used for the natural
lished
3.2.3.
struction.
are used,version
Searching [3,12,39],
However,
and Results
2.043 anyfor
patternswhich
exact might
Representative represent
ofstring-matching Patterns the
algorithmbetter implementation
that canPrattbe evaluated of the original
with char-
(KMP),
language Apostolico-Crochemore
domain. The length ofEnglish
(AC),
patterns text
quick from
search
ranges the(QS),
from Canterbury
Morris
2 characters Corpus
to 32 (MP) areand used
characters. for
Horspool the
algorithm
acter
natural
(HOR) [57].
comparison
For model
language Seven
[12,39,51–56]. metrics
domain. string
development,
The can
Thesearch
be
selectedlength algorithms
ranked
design, of with
and
patterns
algorithms are
the selected
proposed
construction
ranges
belong to in
from
the as our
model.
step
2
group 3 baseline
of
characters
of the for
model
to 32
software-based model con-
building
characters.
algo-
4.269 patterns for the DNA domain and 1.685 patterns for the natural language domain
struction.
(Figure
rithms4.269 2),However,
that we used
patterns
use exact any
9.725
for theexact
DNA string-matching
different
string-matching patterns.
domain andFor
techniques algorithm
1.685the DNA
patterns
with that canthe
domain
for becomparison
evaluated
model,
natural 7.682 with
language char-
patternsdo-
(or more) are needed to accomplish a confidence level ofa 95%,
characterthat the real value approach
is within
3.2.3.
acter
are
main Searching
comparison
used, and Results
2.043 metrics for
patterns canRepresentative
ofbe ranked
English with
text Patterns
fromthe proposed
the Canterbury model. Corpus are used for the
±1% (or
(classical more)
approach)
of the are needed
surveyed [11].
valueAllto accomplish
selected
(Equation (1)).aWith
algorithms confidence
used inlevel of 95%,
this experiment
this confidence level,that thebe
match
it can real value
their pub-
concluded is
within
lishedFor
natural ±1%model
language
version
that a model development,
domain.
ofobjectively
the surveyed
[3,12,39], The
which value
reflects design,
length
might
the of and
(Equation
represent
modeled construction
patterns(1)). ranges
With
the
domain this
better
since in
from step
confidence 3
implementation
a model of
2 characters the model
toof
level, 32
is constructedit the
can building
characters.
be con-
original
with an
3.2.3.
(Figure Searching
2),sample
we Results
used 9.725 fordifferent
Representative patterns. Patterns
For
cluded 4.269
algorithm
adequate that patterns
a model
[57]. Seven for
size. the DNA
objectively
string search domain
reflects the
algorithmsandmodeledarethe
1.685 DNA as
patterns
domain
selected domain
for
ourthe
since amodel,
natural
modelfor
baseline 7.682
islanguage patterns
constructed
model do-
con-
are
main For
withused, model
and
(oradequate
an
struction. more)
However, development,
2.043
are patterns
needed
sample
any exactsize. design,
Englishand
toofaccomplish
string-matching texta construction
from
confidence
algorithm in
the Canterbury
level
that step
ofcan 3Corpus
95%, ofevaluated
be the the
that model
are used
real building
for
value
with the
char- is
(Figure
natural
within 2),
±1% we
language
of used
the 9.725
domain.
surveyed different
The length
value patterns.
of
(Equation
acter comparison metrics can be ranked with the proposed model. patterns For
(1)). the
ranges
With DNA from
this domain
2 model,
characters
confidence to
level, 7.682
32
it patterns
characters.
can be con-
are used,
cluded 4.269thatand 2.043 for
patterns
a model patterns
the DNA
objectively of English
domain
reflects text
theand from 1.685
modeled thepatterns
Canterbury
domain for
since Corpus
the a natural
model are used
islanguage for do-
constructed the
natural
main
with an
3.2.3. language
(oradequate
more) are
Searching domain.
needed
sample
Results The length of patterns
to accomplish aPatterns
forsize.
Representative ranges from 2 characters
confidence level of 95%, that the real value is to 32 characters.
within 4.269±1% patterns
of the for
surveyedthe DNA
value
For model development, design, and construction domain
(Equation and 1.685
(1)). With patterns
this for the
inconfidence
step 3 ofnatural
level,
the modellanguage
it canbuilding
be con-do-
main
cluded (or more)
that a are
model needed
objectivelyto accomplish
reflects the a confidence
modeled
(Figure 2), we used 9.725 different patterns. For the DNA domain model, 7.682 patterns level
domain of 95%,
since a that
model theis real value
constructed is
Entropy 2021, 23, 31 8 of 19

3.2.4. Patterns Entropy Calculation

Searched patterns are grouped into classes according to their entropy. Entropy is
calculated using Equations (5) and (6). For example, for P = TCGTAACT, after count-
ing the number of characters in a pattern, A = 2, C = 2, G = 1, T = 3, the probabilities
respectively are:
P( A ) = 28 = 0.25 P(C ) = 28 = 0.25

P( G ) = 18 = 0.125 P( T ) = 38 = 0.375
2

× log2 28 − 28 × log2 28 − 18 × log2 81 − 3 3

Entropy = − 8 8 × log2 8

Entropy = −(−0.5) − (−0.5) − (−0.375) − (−0.53064) = 1.90563906222957 ≈ 1.91

So for a pattern TCGTAACT calculated entropy is 1.90563906222957. Entropy for
the given pattern from the English text P = ”e name of the LORD. And the LORD” is accordingly
3.698391111. Entropies values are rounded to the two decimals (i.e., entropy for pattern
TCGTAACT is 1.91 and entropy for English text pattern in the above example is 3.7).

3.2.5. Entropies Discretization

The next phase is grouping data into classes or making frequency distribution. Cal-
culated entropies are discretized in classes created by frequency distribution, displaying
the number of observations or results in a sample or a given interval. Classes do not need
to be represented with the same number of patterns.
Table 1 is just a section of the overall patterns entropy classification for the DNA domain.

Table 1. Data discretization for the DNA domain.

Pattern PattEnt PattEntRound PattEntClass

AAAAAAAA 0.0000000000000 0.00 <0.22222
TTTTTTTTCTTTTTTT 0.3372900666170 0.34 0.22222–0.44444
AACAAAAA 0.5435644431996 0.54 0.44444–0.66667
AAAAAAACAAACAACA 0.6962122601251 0.70 0.66667–0.88889
TGGTAAAAAAAAAAAA 1.0612781244591 1.06 0.88889–1.11111
AAAAAGCG 1.2987949406954 1.30 1.11111–1.33333
CAAG 1.5000000000000 1.50 1.33333–1.55556
CCTACTAAACACCGTA 1.7640976555739 1.76 1.55556–1.77778
GCATACCTTTCGCAGC 1.9362781244591 1.94 ≥1.77778

For the DNA domain model the total number of observations in the data n = 91.
The observation data are distinct values of calculated entropies rounded up two decimals,
for DNA domain there are totally 91 items (0 | 0.34 | 0.53 | 0.54 | 0.70 | 0.81 | 0.90 | 0.95
| 0.99 | . . . | 1.96 | 1.97 | 1.98 | 1.99 | 2.00). The number of classes for the DNA domain,
applying Equation (2) is 9, width of classes after applying Equation (3) is 0.22. The Table 2
shows entropy classes after discretization with the number of patterns in each of them.
Entropy 2021, 23, 31 9 of 19

Table 2. Entropy classes after discretization for the DNA domain.

Class No. Entropy Class Number of Patterns

1 <0.22222 9
2 0.22222–0.44444 1
3 0.44444–0.66667 23
4 0.66667–0.88889 72
5 0.88889–1.11111 195
6 1.11111–1.33333 278
7 1.33333–1.55556 961
8 1.55556–1.77778 1.451
9 ≥1.77778 4.692
Total 7.682

Table 3 is just a section of the overall patterns entropy classification for the natural
language domain.

Table 3. Data discretization for the natural language domain.

Pattern PattEnt PattEntRound PattEntClass

Mm 0.0000000000000 0.00 <0.46004
We 1.0000000000000 1.00 0.92007–1.38011
Full 1.5000000000000 1.50 1.38011–1.84014
off from 2.1556390622295 2.16 1.84014–2.30018
ine enemies, eve 2.6556390622295 2.66 2.30018–2.76021
Joseph remembere 3.0243974703477 3.02 2.76021–3.22025
e: The LORD lift 3.5778195311147 3.58 3.22025–3.68028
they have, and deliver our lives 3.7508072359050 3.75 ≥3.68028

For the natural language domain, the total number of observations in the data n = 105.
The observation data for natural language domain are (0 | 1 | 1.5 | 2 | 2.16 | 2.25 | 2.41 |
2.5 | 2.62 | . . . | 3.83 | 3.84 | 3.86 | 3.87 | 3.88), totally 105 items. The number of classes
for the natural language domain, applying Equation (2) is 9, the width of classes after
applying Equation (3) is 0.46. The Table 4 shows entropy classes for the natural language
domain after discretization with the number of patterns in each of them.
Entropy classes containing a small number of patterns affect the model the least since
such patterns are rare and occur in less than 0.5% of cases. Examples of such patterns
are TTTTTTTTCTTTTTTT, AAAGAAA, and LL. When a pattern does not belong to any
entropy class, the relevant class is the first closest entropy class.
Entropy 2021, 23, 31 10 of 19

Table 4. Entropy classes after discretization for the natural language domain.

Class No. Entropy Class Number of Patterns

1 <0.46004 3
2 0.46004–0.92007 0
3 0.92007–1.38011 151
4 1.38011–1.84014 53
5 1.84014–2.30018 383
6 2.30018–2.76021 393
7 2.76021–3.22025 283
8 3.22025–3.68028 556
9 ≥3.68028 221
Total 2.043

4. Classification of Algorithms in the Built Model

The algorithm analysis results integrated into a model, provide a ranking list of
algorithms by their efficiency, measured with character comparison metrics, correlated
with the searched pattern entropy. More efficient algorithms perform fewer characters
comparison when finding a pattern in a text. The model proposes a more efficient algorithm
for string matching for a given pattern based on the entropy class to which the observed
pattern belongs.
The results presented in Tables 5 and 6 give a ranking list of the selected algorithms
grouped by the entropy class. The percentages shown in the result tables represent a
proportion of pattern searching results for a particular algorithm, which might be smaller
or greater than the remaining algorithms inside the quartile. For example, in Table 5, if
the searched pattern belongs to the entropy class 1 (number of representative patterns is 9),
55.88% of the searching results for a given entropy class with the QS algorithm are in the first
quartile, 14.71% are in the second quartile, 29.41% are in the third quartile (Figure 5). When
patterns are searched with the BM algorithm, 47.92% of the searching results expressed as
CC count is in the first quartile, 23.53% are in the second quartile, and 25% are in the third,
and 8.33% are in the fourth quartile. In this case, for a given pattern, the built model
suggests using the QS algorithm as the most efficient algorithm. The selected algorithm is
considered an optimal algorithm that will make fewer character comparisons (CC) than
others for most patterns being searched belonging to the entropy class 1.
Entropy classes in Table 5 are defined in Table 2.
In Table 5, for the entropy class 8 (number of representative patterns searched is 1451),
the model shows that the BM algorithm is the most efficient. In 61.95% of cases for patterns
in the entropy class 8, the BM algorithm made the least characters comparison versus
the other six algorithms evaluated with the model. In 24.38% cases, BM was second best;
in 13.68% cases was the third and never was the worse.
Entropy classes in Table 6 are defined in Table 4.
In Table 6, for example, for the entropy class 6 (number of representative patterns
searched is 393), the model shows that the QS algorithm is the most efficient. In 70.13%
of cases for patterns in the entropy class 6, the QS algorithm made the least characters
comparison versus the other six algorithms evaluated with the model. In 29.87% of
cases, QS was second best and never was the worse. For the entropy class 7 (number of
representative patterns searched is 283), the model shows that the most efficient is the BM
algorithm. In 65.02% of cases for patterns in the entropy class 7, the BM algorithm made
the least characters comparison versus the other six algorithms evaluated with the model.
In 34.98% of cases, BM was second best and never was the worse.
Entropy 2021, 23, 31 11 of 19

Table 5. Algorithms ranking model for DNA texts and patterns.

Entropy Class/
1 2 3 4 5 6 7 8 9
Algorithm
Quartile 1
AC 20.59% 0.00% 20.00% 12.61% 5.74% 8.99% 8.51% 5.43% 2.53%
BF 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
BM 47.92% 100.00% 56.67% 51.82% 57.20% 61.11% 58.26% 61.95% 62.83%
HOR 38.24% 0.00% 46.67% 49.55% 49.88% 51.06% 49.92% 53.92% 54.60%
KMP 14.71% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
MP 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
QS 55.88% 100.00% 53.33% 51.35% 55.36% 53.97% 54.08% 53.68% 54.85%
Quartile 2
AC 29.41% 100.00% 16.67% 35.14% 41.15% 31.22% 38.03% 36.47% 34.34%
BF 23.53% 0.00% 13.33% 22.97% 20.95% 18.52% 21.75% 21.58% 26.36%
BM 18.75% 0.00% 43.33% 28.38% 24.28% 29.63% 22.92% 24.38% 23.21%
HOR 20.59% 100.00% 30.00% 14.41% 20.95% 29.89% 22.54% 24.32% 18.41%
KMP 35.29% 0.00% 13.33% 26.58% 20.95% 18.52% 26.65% 21.58% 27.22%
MP 23.53% 0.00% 13.33% 22.97% 20.95% 18.52% 21.75% 21.58% 27.22%
QS 14.71% 0.00% 43.33% 23.42% 25.94% 28.57% 21.63% 25.08% 18.25%
Quartile 3
AC 23.53% 0.00% 53.33% 25.23% 29.18% 36.24% 26.54% 31.67% 31.41%
BF 26.47% 0.00% 16.67% 24.77% 27.43% 25.66% 24.56% 24.36% 26.01%
BM 25.00% 0.00% 0.00% 19.80% 16.05% 9.26% 18.82% 13.68% 13.95%
HOR 29.41% 0.00% 23.33% 35.59% 26.18% 19.05% 27.38% 21.77% 26.72%
KMP 23.53% 100.00% 43.33% 21.17% 34.66% 41.53% 29.63% 37.77% 25.15%
MP 26.47% 0.00% 33.33% 24.77% 27.43% 25.66% 24.56% 24.56% 25.15%
QS 29.41% 0.00% 3.33% 25.23% 15.96% 17.46% 24.28% 21.24% 26.65%
Quartile 4
AC 26.47% 0.00% 10.00% 27.03% 23.94% 23.54% 26.93% 26.43% 31.72%
BF 50.00% 100.00% 70.00% 52.25% 51.62% 55.82% 53.69% 54.06% 47.63%
BM 8.33% 0.00% 0.00% 0.00% 2.47% 0.00% 0.00% 0.00% 0.01%
HOR 11.76% 0.00% 0.00% 0.45% 2.99% 0.00% 0.17% 0.00% 0.27%
KMP 26.47% 0.00% 43.33% 52.25% 44.39% 39.95% 43.72% 40.65% 47.63%
MP 50.00% 100.00% 53.33% 52.25% 51.62% 55.82% 53.69% 53.87% 47.63%
QS 0.00% 0.00% 0.00% 0.00% 2.74% 0.00% 0.00% 0.00% 0.25%
Entropy 2021, 23, 31 12 of 19

Table 6. Algorithms ranking model for the natural language texts and patterns.

Entropy Class/
1 2 3 4 5 6 7 8 9
Algorithm
Quartile 1
AC 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
BF 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
BM 66.67% 0.00% 40.56% 60.38% 57.18% 64.05% 65.02% 58.81% 66.06%
HOR 33.33% 0.00% 35.06% 30.19% 38.72% 41.01% 51.24% 56.47% 52.04%
KMP 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
MP 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
QS 100.00% 0.00% 100.00% 84.91% 79.23% 70.13% 59.01% 59.71% 57.01%
Quartile 2
AC 100.00% 0.00% 51.67% 50.94% 50.00% 50.13% 58.66% 50.18% 72.85%
BF 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
BM 33.33% 0.00% 59.44% 39.62% 42.82% 35.95% 34.98% 41.19% 33.94%
HOR 66.67% 0.00% 64.94% 69.81% 61.28% 58.99% 48.76% 43.53% 47.96%
KMP 100.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
MP 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
QS 0.00% 0.00% 0.00% 15.09% 20.77% 29.87% 40.99% 40.29% 42.99%
Quartile 3
AC 0.00% 0.00% 48.33% 49.06% 50.00% 49.87% 41.34% 49.82% 27.15%
BF 0.00% 0.00% 36.67% 33.96% 33.08% 32.41% 31.45% 33.45% 32.13%
BM 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
HOR 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
KMP 0.00% 0.00% 44.44% 49.06% 45.90% 46.58% 47.70% 46.04% 47.06%
MP 33.33% 0.00% 44.44% 41.51% 45.90% 46.08% 45.94% 45.50% 46.15%
QS 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
Quartile 4
AC 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
BF 100.00% 0.00% 63.33% 66.04% 66.92% 67.59% 68.55% 66.55% 67.87%
BM 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
HOR 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
KMP 0.00% 0.00% 55.56% 50.94% 54.10% 53.42% 52.30% 53.96% 52.94%
MP 66.67% 0.00% 55.56% 58.49% 54.10% 53.92% 54.06% 54.50% 53.85%
QS 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00%
When patterns are searched with the BM algorithm, 47.92% of the searching results ex-
pressed as CC count is in the first quartile, 23.53% are in the second quartile, and 25% are
in the third, and 8.33% are in the fourth quartile. In this case, for a given pattern, the built
model suggests using the QS algorithm as the most efficient algorithm. The selected algo-
Entropy 2021, 23, 31 rithm is considered an optimal algorithm that will make fewer character comparisons 13 of 19
(CC) than others for most patterns being searched belonging to the entropy class 1.
Entropy classes in Table 5 are defined in Table 2.

Figure5.5.Algorithms
Figure Algorithmsranking
rankingfor
forentropy
entropyclass
class1.1.

5. Table 5. Algorithms
Methodology ranking model
Validation for DNA texts and patterns.
and Discussion
Entropy Class/Algorithm For model1 validation, 2 the seventh
3 and4 ninth entropy
5 classes
6 (961
7 and 4692 8 patterns)
9
were selected for the DNAQuartile
domain,1 and the sixth (393 patterns) and ninth classes (221 pat-
AC terns) were20.59%
selected for the natural
0.00% 20.00%language
12.61% domain.
5.74% 8.99% The model
8.51% classes
5.43%chosen 2.53%for
BF validation have
0.00% the highest
0.00% number of representative patterns
0.00% 0.00% 0.00% 0.00% 0.00% and are characteristic
0.00% 0.00%for
BM the specific47.92%
domains. 100.00% 56.67% 51.82% 57.20% 61.11% 58.26% 61.95% 62.83%
HOR The selected
38.24% patterns0.00%for 46.67%
validation are not
49.55% 49.88%part51.06%
of the 49.92%
patterns 53.92%
set with 54.60%
which
KMP the model 14.71%
was created.0.00%For the0.00%
DNA domain model, also
0.00% 0.00% 0.00% 0.00% a different text is
0.00% chosen
0.00%for
MP validation. 0.00%
The DNA domain0.00% model is validated with
0.00% 0.00% 0.00% 0.00% 0.00% the DNA sequence Homo
0.00% sapiens
0.00%
QS isolate HG00514
55.88%chromosome
100.00% 9 53.33%
genomic51.35%
scaffold55.36%
HS_NIOH_CHR9_SCAFFOLD_1,
53.97% 54.08% 53.68% whole
54.85%
genome shotgun sequence, 43.213.237
Quartile 2 bp, 39 Mb as the text [58]. The natural language
AC domain is validated
29.41% with the natural
100.00% 16.67%language
35.14% text set from
41.15% the Canterbury
31.22% 38.03% 36.47% Corpus. [43]
34.34%
BF Before23.53%
the model validation process, a check was made to
0.00% 13.33% 22.97% 20.95% 18.52% 21.75% 21.58% see if the selected patterns
26.36%
BM were sufficiently
18.75%representative
0.00% for model
43.33% validation.
28.38% 24.28% The29.63%
check was22.92% done24.38%
with the 23.21%
central
HOR limit theorem. The
20.59% set of patterns used in the validation phase
100.00% 30.00% 14.41% 20.95% 29.89% 22.54% 24.32% has a normal distribution
18.41%
KMP (Figure 6, Mean
35.29%= 1.900,0.00%
and Std.13.33%
Dev = 26.58%
0.064) as20.95%
a set of18.52%
patterns used in21.58%
26.65% model building
27.22%
MP (Figure 7, Mean
23.53%= 1.901, and Std. Dev = 0.059), which means that
0.00% 13.33% 22.97% 20.95% 18.52% 21.75% 21.58% patterns used to validate
27.22%
the model represent a domain.
Other entropy classes of patterns discretized character comparisons also follow the nor-
mal distribution. The basis in the model validation phase is to verify if the test results differ
from the developed model results presented in Tables 5 and 6.
For comparing the two data sets (model results and test results), double-scaled Euclid
distance, and Pearson correlation coefficient were used.
Double-scaled Euclidian distance normalizes raw Euclidian distance into a range of 0–
1, where 1 represents the maximum discrepancy between the two variables. The first step in
comparing two datasets with double-scaled Euclidian methods is to compute the maximum
possible squared discrepancy (md) per variable i of v variables, where v is the number of
observed variables in the data set. The mdi = (Maximum for variable i-Minimum for variable
i)2 , where 0 (0%) is used for minimum and 1 (100%) for maximum values for double-scaled
Euclid distance. The second step’s goal is to produce the scaled variable Euclidean distance,
where the sum of squared discrepancies per variable is divided by the maximum possible
discrepancy for that variable, Equation (7):
v !
( p1i − p2i )2
u v
d1 = t ∑
u
(7)
i =1
mdi
Entropy 2021, 23, 31 14 of 19
Entropy 2021, 23, x 13 of 18

Figure
Figure 6.
6. Patterns
Patterns used
used in
in the
the validation
validation phase
phase for
for the
the entropy
entropy class
class 99 of
of the
the DNA
DNA domain.
domain.

Figure 7. Patterns used in model building for the entropy class 9 of the DNA domain.

Figure 7. Patterns
Theused in model
final step isbuilding
dividingforscaled
the entropy class 9 distance
Euclidian of the DNA domain.
with the root of v, where v is
the number of observed variables, Equation (8). Double-scaled Euclid distance easily turns
into aOther entropy
measure classes of
of similarity bypatterns discretized
subtracting it from 1.0.character comparisons also follow the
[16,59–62]:
normal distribution. The basis in the model validation phase is to verify if the test results
differ from the developed model results presented in Tables 5 and 6.
s
v ( p1i − p2i )2
∑i=1results
For comparing the two data sets (model mdand
i
test results), double-scaled Euclid
distance, and Pearson correlationd2coefficient
= were
√ used. (8)
Double-scaled Euclidian distance normalizes v raw Euclidian distance into a range of
0–1, where 1 represents the maximum discrepancy between the two variables. The first
step in comparing two datasets with double-scaled Euclidian methods is to compute the
maximum possible squared discrepancy (md) per variable i of v variables, where v is the
Entropy 2021, 23, 31 15 of 19

Table 7 shows the usage of the double-scaled Euclidian distance method for entropy
class 7 of DNA

Table 7. Example of usage double-scaled Euclidian distance on DNA entropy class 7.

i Built Model (p1) Validation (p2) Scaled Euclidean (d1 )

1 0.58263 0.96667 0.14749
2 0.22916 0.03333 0.03835
...
28 0.43718 0.16667 0.07318

Applying Equation (8) on Table 7, column “Scaled Euclidean (d1 )”, gives a double-
scaled Euclidian distance of 0.227. Subtracting double-scaled Euclidian Distance from
1 gives a similarity coefficient of 0.773 or 77%.
Table 8 shows the results of the calculated double-scaled Euclid distance and corre-
sponding similarity coefficient.

Table 8. Double-scaled Euclid distance for DNA and natural language classes.

DNA Natural Language

Model vs. Validation
Result for: Double-Scaled Similarity Double-Scaled Similarity
Euclidean Distance Coefficient Euclidean Distance Coefficient
Class 6 0.194 0.806 (80%)
Class 7 0.227 0.773 (77%)
Class 9 0.231 0.769 (77%) 0.145 0.855 (86%)

Converting double-scaled Euclidian distance to a context of similarity, it is possible

to conclude that the built model is similar to the validation results with a high degree
of similarity. The seventh and ninth classes from the built model for the DNA domain
have a similarity coefficient with their validation results of 77%. The high percentage of
similarity also has the sixth and ninth classes from the built model for the natural language
domain with their validation results of 80% and 86%. The results for validated classes
obtained in the validation process are extremely similar to the results from the built model.
A proportion of searched pattern character comparisons for a particular algorithm inside
the quartile is similar to the built model.
Pearson’s correlation coefficient is used to check the correlation between data from
the model and data from the validation phase. Pearson correlation coefficients per classes
are shown in Table 9.

Table 9. Pearson correlation coefficient for DNA and natural language classes.

Model vs. Validation Result for: DNA Natural Language

Class 6 0.848
Class 7 0.795
Class 9 0.685 0.905

The seventh and ninth classes from the built model for the DNA domain have a
linear Pearson’s correlation coefficient with their validation results. The sixth and ninth
classes from the natural language domain’s built model have a linear Pearson’s correlation
coefficient with their validation results. Pearson’s correlation coefficient shown in Figure 8
indicate that the values from the built model (x-axis, Model) and their corresponding
validation result (y-axis, validation) follow each other with a strong positive relationship.
Class 9 0.685 0.905

The seventh and ninth classes from the built model for the DNA domain have a linear
Pearson’s correlation coefficient with their validation results. The sixth and ninth classes
from the natural language domain’s built model have a linear Pearson’s correlation coef-
Entropy 2021, 23, 31 ficient with their validation results. Pearson’s correlation coefficient shown in Figure168of 19
indicate that the values from the built model (x-axis, Model) and their corresponding val-
idation result (y-axis, validation) follow each other with a strong positive relationship.

Figure 8. Pearson’s linear correlation coefficient.

Using the
Using the double-scaled
double-scaled Euclidean
Euclideandistance
distanceininthe validation
the validationprocess
processshows a strong
shows a strong
similarity between the built model and validation results. In addition to the similarity,
similarity between the built model and validation results. In addition to the similarity, a
strong positive relationship exists between classes selected from the built model
a strong positive relationship exists between classes selected from the built model and and val-
idation results
validation proven
results provenby by
Pearson’s correlation
Pearson’s coefficient.
correlation Presented
coefficient. results
Presented showshow
results that itthat
is possible to use the proposed methodology to build a domain model for selecting an
it is possible to use the proposed methodology to build a domain model for selecting
optimal algorithm for the exact string matching. Except for optimal algorithm selection
an optimal algorithm for the exact string matching. Except for optimal algorithm selec-
for a specific domain, this methodology can be used to improve the efficiency of string-
tion for a specific domain, this methodology can be used to improve the efficiency of
matching algorithms in the context of performance, which is in correlation with empirical
string- matching algorithms in the context of performance, which is in correlation with
measurements.
empirical measurements.
The data used to build and validate the model can be downloaded from the website [63].

6. Conclusions
Proposed methodology for ranking algorithms is based on properties of the searched
string and properties of the texts being searched. Searched strings are classified according
to the pattern entropy. This methodology is expressing algorithms efficiency using plat-
form independent metrics thus not depending on algorithm implementation, computer
architecture or programming languages characteristics. This work focuses on classical
software-based algorithms that use exact string-matching techniques with a character
comparison approach. For any other type of algorithms, this methodology cannot be used.
The used character comparisons metrics is platform-independent in the context of formal
approaches, but the number of comparisons directly affects the time needed for algorithm
execution and usage of computational resource. Studying the methodology, complexity,
and limitations of all available algorithms is a complicated and long-term task. The paper
discusses, in detail, available metrics for string searching algorithms properties evaluation
and proposing a methodology for building a domain model for selecting an optimal string
searching algorithm. The methodology is based on presenting exact string-matching results
to express algorithm efficiency regardless of query pattern length and dataset size. We
Entropy 2021, 23, 31 17 of 19

considered the number of compared characters of each algorithm expressed by the searched
string entropy for our baseline analysis. High degrees of similarity and a strong correlation
between the validation results and the built model data have been proven, making this
methodology a useful tool that can help researchers choose an efficient string- matching
algorithm according to the needs and choose a suitable programming environment for
developing new algorithms. Everything that is needed is a pattern from a specific domain
by which the model is built, and the model will suggest using the most optimal algorithm
for usage. The defined model finally selects the algorithm that will most likely run up
the least character comparison count in pattern matching. This research does not intend to
evaluate the algorithm logic and programming environment in any way; the main reason
for comparing the results of algorithms is the construction of the algorithm selection model.
The built model is straightforwardly extendable with other algorithms; all required is ade-
quate training data sets. Further research is directed to find additional string characteristics,
besides pattern entropy, that can enhance developed methodology precision for selecting
more efficient string search algorithms.

Supplementary Materials: The following are available online at https://www.mdpi.com/1099-430

0/23/1/31/s1.
Author Contributions: Conceptualization, I.M. and M.Š.; methodology, I.M.; software, I.M.; valida-
tion, I.M., M.Š. and M.Z.; formal analysis, I.M.; investigation, I.M.; resources, I.M.; data curation, I.M.;
writing—original draft preparation, I.M.; writing—review and editing, I.M., M.Š. and M.Z.; visualiza-
tion, I.M.; supervision, M.Š. and D.S.; project administration, I.M. and M.Š.; funding acquisition, D.S.
All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: Data is contained within the supplementary materials.
Conflicts of Interest: The authors declare no conflict of interest.

References and Notes

1. Xiong, J. Essential Bioinformatics; Cambridge University Press: Cambridge, UK, 2006; ISBN 9780521600828.
2. Pizzi, C.; Ornamenti, M.; Spangaro, S.; Rombo, S.E.; Parida, L. Efficient algorithms for sequence analysis with entropic profiles.
IEEE/ACM Trans. Comput. Biol. Bioinform. 2018, 15, 117–128. [CrossRef] [PubMed]
3. Faro, S.; Lecroq, T.; Borz, S. The String Matching Algorithms Research Tool. Proc. Prague Stringol. Conf. 2016, 99–113.
4. Al-Khamaiseh, K.; Alshagarin, S. A Survey of String Matching Algorithms. J. Eng. Res. Appl. 2014, 4, 144–156.
5. SaiKrishna, V.; Rasool, P.A.; Khare, D.N. String Matching and its Application in Diversified Fields. IJCSI Int. J. Comput. Sci. Issues
2012, 9, 219–226.
6. Sedgewick, R.; Flajolet, P. An Introduction to the Analysis of Algorithms, 2nd ed.; Addison-Wesley/Pearson Education: Westford,
MA, USA, 2013; ISBN 9780321905758.
7. Michailidis, P.D.; Margaritis, K.G. On-line string matching algorithms: Survey and experimental results. Int. J. Comput. Math.
2001, 76, 411–434. [CrossRef]
8. Faro, S. Evaluation and improvement of fast algorithms for exact matching on genome sequences. In International Conference on
Algorithms for Computational Biology; Springer: Cham, Switzerland, 2016; Volume 9702, pp. 145–157. [CrossRef]
9. Hume, A.; Sunday, D. Fast string searching. Softw. Pract. Exp. 1991, 21, 1221–1248. [CrossRef]
10. Navarro, G.; Raffinot, M. Flexible Pattern Matching in Strings: Practical Online Search Algorithms for Texts and Biological
Sequences. Computer 2002, 35. [CrossRef]
11. Hakak, S.I.; Kamsin, A.; Shivakumara, P.; Gilkar, G.A.; Khan, W.Z.; Imran, M. Exact String Matching Algorithms: Survey, Issues,
and Future Research Directions. IEEE Access 2019, 7, 69614–69637. [CrossRef]
12. Gusfield, D. Algorithms on strings, trees, and sequences: Computer science and computational biology. Theory Pract. 1997, 28, 554.
13. Cormen, T.H.; Cormen, T.H. Introduction to Algorithms; MIT Press: Cambridge, MA, USA, 2001; ISBN 9780262032933.
14. Jiji, N.; Mahalakshmi, T. Survey of Exact String Matching Algorithm for Detecting Patterns in Protein Sequence. Adv. Comput. Sci.
Technol. 2017, 10, 2707–2720.
15. Singla, N.; Garg, D. String Matching Algorithms and their Applicability in various Applications. Int. J. Soft Comput. Eng. 2012, 1,
2231–2307.
Entropy 2021, 23, 31 18 of 19

16. Myatt, G.J.; Johnson, W.P. Making Sense of Data I a Practical Guide to Exploratory Data Analysis and Data Mining, 2nd ed.; John Wiley
& Sons, Inc.: Somerset, NJ, USA, 2014; ISBN 9781118407417.
17. Manikandan, S. Frequency distribution. J. Pharmacol. Pharmacother. 2011, 2, 54. [CrossRef] [PubMed]
18. Bartlett, J.; Kotrlik, J.; Higgins, C. Organizational research: Determining appropriate sample size in survey research. Inf. Technol.
Learn. Perform. J. 2001, 19, 43.
19. Taherdoost, H. Determining Sample Size; How to Calculate Survey Sample Size. Int. J. Econ. Manag. Syst. 2017, 2, 237–239.
20. Israel, G.D. Determining Sample Size; University of Florida: Gainesville, FL, USA, 1992.
21. Mohammed, R. Information Analysis of DNA Sequences. arXiv 2010, arXiv:1010.4205, 1–22.
22. Schmitt, A.O.; Herzel, H. Estimating the entropy of DNA sequences. J. Theor. Biol. 1997, 188, 369–377. [CrossRef]
23. Ebeling, W.; Nicolis, G. Word frequency and entropy of symbolic sequences: A dynamical perspective. Chaos Solitons Fractals
1992, 2, 635–650. [CrossRef]
24. Herzel, H.; Ebeling, W.; Schmitt, A.O. Entropies of biosequences: The role of repeats. Phys. Rev. E 1994, 50, 5061–5071. [CrossRef]
25. Lesne, A.; Blanc, J.L.; Pezard, L. Entropy estimation of very short symbolic sequences. Phys. Rev. E 2009, 79, 1–10. [CrossRef]
26. Rhodes, P.C.; Garside, G.R. Use of maximum entropy method as a methodology for probabilistic reasoning. Knowl. Based Syst.
1995, 8, 249–258. [CrossRef]
27. Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [CrossRef]
28. Muchnik, A.; Vereshchagin, N. Shannon entropy vs. kolmogorov complexity. In International Computer Science Symposium in
Russia; Springer: Berlin/Heidelberg, Germany, 2006; pp. 281–291. [CrossRef]
29. Grunwald, P.; Vitanyi, P. Shannon Information and Kolmogorov Complexity. 2004. Available online: https://arxiv.org/pdf/cs/
0410002.pdf (accessed on 4 May 2020).
30. Teixeira, A.; Matos, A.; Souto, A.; Antunes, L. Entropy Measures vs. Kolmogorov Complexity. Entropy 2011, 13, 595–611.
[CrossRef]
31. Goulão, M.; Brito e Abreu, F. Formal definition of metrics upon the CORBA component model. In Quality of Software Architectures
and Software Quality; Springer: Berlin/Heidelberg, Germany, 2005; pp. 88–105. [CrossRef]
32. Barabucci, G.; Ciancarini, P.; Di Iorio, A.; Vitali, F. Measuring the quality of diff algorithms: A formalization. Comput. Stand.
Interfaces 2016, 46, 52–65. [CrossRef]
33. Ivkovic, N.; Jakobovic, D.; Golub, M. Measuring Performance of Optimization Algorithms in Evolutionary Computation. Int. J.
Mach. Learn. Comput. 2016, 6, 167–171. [CrossRef]
34. Aho, A.V.; Hopcroft, J.E.; Ullman, J.D. The Design and Analysis of Computer Algorithms; Addison-Wesley Pub. Co.: Reading, MA,
USA, 1974; ISBN 9780201000290.
35. Hromkovič, J. Theoretical Computer Science: Introduction to Automata, Computability, Complexity, Algorithmics, Randomization,
Communication, and Cryptography; Springer: Berlin/Heidelberg, Germany, 2004; ISBN 3540140158.
36. Jain, P.; Pandey, S. Comparative Study on Text Pattern Matching for Heterogeneous System. Int. J. Comput. Sci. Eng. Technol. 2012,
3, 537–543.
37. Pandiselvam, P.; Marimuthu, T.; Lawrance, R. A comparative study on string matching algorithms of biological sequences. Int.
Conf. Intell. Comput. 2014, 2014, 1–5.
38. Faro, S.; Lecroq, T. The Exact Online String Matching Problem: A Review of the Most Recent Results. Acm Comput. Surv. 2013,
45, 13. [CrossRef]
39. Lecroq, T.; Charras, C. Handbook od Exact String Matching; Laboratoire d’Informatique de Rouen Université de Rouen: Rouen,
France, 2001.
40. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley and Sons: Hoboken, NJ, USA, 2005; ISBN 9780471241959.
41. Kucak, D.; Djambic, G.; Fulanovic, B. An empirical study of algorithms performance in implementations of set in Java. In Pro-
ceedings of the 23rd DAAAM International Symposium on Intelligent Manufacturing and Automation 2012, Zadar, Croatia,
24–27 October 2012; Volume 1, pp. 565–568.
42. Alhendawi, K.M.; Baharudin, A.S. String Matching Algoritms (SMAs): Survey & Empirical Analysis. J. Comput. Sci. Manag. 2013,
2, 2637–2644.
43. The Canterbury Corpus. Available online: http://corpus.canterbury.ac.nz/ (accessed on 21 December 2020).
44. Compeau, P.; Pevzner, P. Bioinformatics Algorithms: An Active Learning Approach; Active Learning Publishers: La Jolla, CA, USA,
2015; Volume 1, ISBN 0990374602.
45. Markić, I.; Štula, M.; Jukić, M. Pattern Searching in Genome. Int. J. Adv. Comput. Technol. 2018, 10, 36–46.
46. Anabarilius grahami isolate AG-KIZ scaffold371_cov124, whole genome sh—Nucleotide—NCBI.
47. Chelonia mydas unplaced genomic scaffold, CheMyd_1.0 scaffold1, whole—Nucleotide—NCBI.
48. Escherichia coli strain LM33 isolate patient, whole genome shotgun seq—Nucleotide—NCBI.
49. Macaca mulatta isolate AG07107 chromosome 19 genomic scaffold ScNM3vo_—Nucleotide—NCBI.
50. The Canterbury Corpus—The King James Version of the Bible. Available online: https://corpus.canterbury.ac.nz/descriptions/.
(accessed on 13 February 2020).
51. Boyer, R.S.; Moore, J.S. A fast string searching algorithm. Commun. ACM 1977, 20, 762–772. [CrossRef]
52. Knuth, D.E.; Morris, J.H.; Pratt, V.R.; Morris, J.H., Jr.; Pratt, V.R. Fast Pattern Matching in Strings. SIAM J. Comput. 1977, 6, 323–350.
[CrossRef]
Entropy 2021, 23, 31 19 of 19

53. Apostolico, A.; Crochemore, M. Optimal canonization of all substrings of a string. Inf. Comput. 1991, 95, 76–95. [CrossRef]
54. Sunday, D.M. A very fast substring search algorithm. Commun. ACM 1990, 33, 132–142. [CrossRef]
55. Horspool, R.N. Practical fast searching in strings. Softw. Pract. Exp. 1980, 10, 501–506. [CrossRef]
56. Hakak, S.; Kamsin, A.; Shivakumara, P.; Idris, M.Y.I.; Gilkar, G.A. A new split based searching for exact pattern matching for
natural texts. PLoS ONE 2018, 13, e0200912. [CrossRef]
57. Powers, D.M.W. Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation. Hum.
Commun. Sci. SummerFest 2007, 24. Available online: https://csem.flinders.edu.au/research/techreps/SIE07001.pdf (accessed on
22 March 2020).
58. National Center for Biotechnology Information. Available online: https://www.ncbi.nlm.nih.gov/ (accessed on 15 August 2019).
59. Wheelan, C. Naked Statistics: Stripping the Dread from the Data; WW Norton & Co.: New York, NY, USA, 2013; ISBN 978-0-39307-195-5.
60. Barrett, P. Euclidean Distance Raw, Normalized, and Double-Scaled Coefficients. 2005. Available online: https://www.pbarrett.
net/techpapers/euclid.pdf (accessed on 16 September 2020).
61. Anton, H. Elementary Linear Algebra, 11th ed.; Wiley: New York, NY, USA, 2019; ISBN 978-1-119-62569-8.
62. Rodgers, J.L.; Nicewander, W.A. Thirteen Ways to Look at the Correlation Coefficient. Am. Stat. 1988, 42, 59–66. [CrossRef]
63. Raw Data for Entropy Based Approach in Selection Exact String Matching Algorithms. Available online: https://www.dropbox.
com/t/kXKUZeIIVpw3hU5O (accessed on 3 November 2020).

Exact String Matchin
No ratings yet
Exact String Matchin
7 pages
Survey Paper On String Matching
No ratings yet
Survey Paper On String Matching
4 pages
DAA - (MINI - PROJECT) Aniket, Vedant
No ratings yet
DAA - (MINI - PROJECT) Aniket, Vedant
19 pages
Naive Pattern Searching Explained
No ratings yet
Naive Pattern Searching Explained
23 pages
String Matching Algorithms Analysis
No ratings yet
String Matching Algorithms Analysis
5 pages
IRS Unit-5
No ratings yet
IRS Unit-5
62 pages
Exact String Matching Survey
No ratings yet
Exact String Matching Survey
25 pages
Evaluating Efficiency of Some Exact Stri
No ratings yet
Evaluating Efficiency of Some Exact Stri
8 pages
UNIT-V String Matching
No ratings yet
UNIT-V String Matching
24 pages
Efficient Name Generation Using The Boyer-Moore Algorithm For Meaningful Combinations
No ratings yet
Efficient Name Generation Using The Boyer-Moore Algorithm For Meaningful Combinations
6 pages
2 Studyof Different Algorithmsfor Pattern Matching
No ratings yet
2 Studyof Different Algorithmsfor Pattern Matching
7 pages
String Matching for Researchers
No ratings yet
String Matching for Researchers
36 pages
Fast String Searching: AT&T Bell Laboratories, 600 Mountain Ave., Murray Hill, NJ 07974, U.S.A
No ratings yet
Fast String Searching: AT&T Bell Laboratories, 600 Mountain Ave., Murray Hill, NJ 07974, U.S.A
28 pages
2d Pattern Matching
No ratings yet
2d Pattern Matching
35 pages
Patternmatchingalgorithms
No ratings yet
Patternmatchingalgorithms
63 pages
Adsa
No ratings yet
Adsa
9 pages
AlgorithmsandDataStructures Part5StringMatching
No ratings yet
AlgorithmsandDataStructures Part5StringMatching
29 pages
Obs Ds Unit5
No ratings yet
Obs Ds Unit5
10 pages
A Two Way Pattern Matching Algorithm Using Sliding Patterns
No ratings yet
A Two Way Pattern Matching Algorithm Using Sliding Patterns
5 pages
Single Keyword Pattern Matching Algorithms
No ratings yet
Single Keyword Pattern Matching Algorithms
5 pages
Foundations of Sequence Analysis
No ratings yet
Foundations of Sequence Analysis
161 pages
Text Pattern Matching for Developers
No ratings yet
Text Pattern Matching for Developers
9 pages
Tsa Lectures 1
No ratings yet
Tsa Lectures 1
226 pages
Fast Hybrid String Matching
No ratings yet
Fast Hybrid String Matching
13 pages
UNIT 5.3 (String Mactching)
No ratings yet
UNIT 5.3 (String Mactching)
23 pages
String Matching Algorithms
100% (1)
String Matching Algorithms
31 pages
Lec 3
No ratings yet
Lec 3
37 pages
Fla 03
No ratings yet
Fla 03
27 pages
DAA Unit5 Theory 50q
No ratings yet
DAA Unit5 Theory 50q
35 pages
Comparing The Performance of Reverse Col
No ratings yet
Comparing The Performance of Reverse Col
7 pages
Lecture 04 Inaryseachtree
No ratings yet
Lecture 04 Inaryseachtree
20 pages
String Matching Algorithms Guide
No ratings yet
String Matching Algorithms Guide
46 pages
Parallel Rabin-Karp for Plagiarism Detection
No ratings yet
Parallel Rabin-Karp for Plagiarism Detection
16 pages
Unit 3
No ratings yet
Unit 3
34 pages
String Search Algorithm
No ratings yet
String Search Algorithm
6 pages
A FAST Pattern Matching Algorithm: S. S. Sheik, Sumit K. Aggarwal, Anindya Poddar, N. Balakrishnan, and K. Sekar
No ratings yet
A FAST Pattern Matching Algorithm: S. S. Sheik, Sumit K. Aggarwal, Anindya Poddar, N. Balakrishnan, and K. Sekar
6 pages
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
No ratings yet
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
10 pages
4 Module Algorithms
No ratings yet
4 Module Algorithms
28 pages
Unit - 5 Irs
100% (1)
Unit - 5 Irs
78 pages
Unit 5 Irs PDF
No ratings yet
Unit 5 Irs PDF
9 pages
A Guided Tour To Approximate String Matching: Gonzalo Navarro
No ratings yet
A Guided Tour To Approximate String Matching: Gonzalo Navarro
58 pages
04 03-PatternMatchingAndTries
No ratings yet
04 03-PatternMatchingAndTries
28 pages
Text Search Algorithms and Systems
No ratings yet
Text Search Algorithms and Systems
73 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
25 pages
Abstract
No ratings yet
Abstract
12 pages
Unit-V DS Pattern Matching and Tries
No ratings yet
Unit-V DS Pattern Matching and Tries
26 pages
Module 06. String Algorithms Lecture 1 - 2
No ratings yet
Module 06. String Algorithms Lecture 1 - 2
19 pages
Thesis On String Matching
100% (3)
Thesis On String Matching
5 pages
ADA Lect10
No ratings yet
ADA Lect10
12 pages
Daa Mini Project
No ratings yet
Daa Mini Project
5 pages
Information Retrieval - Chapter 10 - String Searching Algorithms
No ratings yet
Information Retrieval - Chapter 10 - String Searching Algorithms
27 pages
String Matching Algorithms Overview
No ratings yet
String Matching Algorithms Overview
96 pages
Overview of String Matching Algorithms
No ratings yet
Overview of String Matching Algorithms
5 pages
String Algorithms for CS Students
No ratings yet
String Algorithms for CS Students
48 pages
Lecture4 - Indexing and Searching I
No ratings yet
Lecture4 - Indexing and Searching I
56 pages
Irs Unit 5 PDF
No ratings yet
Irs Unit 5 PDF
24 pages
46 PDFsam Redis Cookbook
No ratings yet
46 PDFsam Redis Cookbook
5 pages
MDA - 1.module 1 - BI Introduction - Data Prep
No ratings yet
MDA - 1.module 1 - BI Introduction - Data Prep
131 pages
Name Matching
No ratings yet
Name Matching
14 pages
CD Asia Online User's Guide PDF
No ratings yet
CD Asia Online User's Guide PDF
2 pages
Pattern Matching & Compression
No ratings yet
Pattern Matching & Compression
4 pages
AML Module in Loanance Web Portal - 05
No ratings yet
AML Module in Loanance Web Portal - 05
15 pages
Asynchronous Apex: Scenario-Based Interview Questions
No ratings yet
Asynchronous Apex: Scenario-Based Interview Questions
8 pages
Human Translation vs. Machine Translation
100% (1)
Human Translation vs. Machine Translation
20 pages
Baeza Yates PDF
No ratings yet
Baeza Yates PDF
9 pages
Bit-Parallel Algorithms for String Matching
No ratings yet
Bit-Parallel Algorithms for String Matching
12 pages
Advanced Search Syntax User's Guide: Terms
No ratings yet
Advanced Search Syntax User's Guide: Terms
4 pages
Data Analytics Program - Introduction To Data Analytics - Lesson 1
No ratings yet
Data Analytics Program - Introduction To Data Analytics - Lesson 1
56 pages
Tableau Prep Data Prep Guide
No ratings yet
Tableau Prep Data Prep Guide
6 pages
Fuzzy Logic in Pharma NLP Analytics
No ratings yet
Fuzzy Logic in Pharma NLP Analytics
22 pages
Matching Final PDF
No ratings yet
Matching Final PDF
22 pages
GTS
No ratings yet
GTS
30 pages
Using Elasticsearch and NEST in NET - by Lucas Garcia - Medium
No ratings yet
Using Elasticsearch and NEST in NET - by Lucas Garcia - Medium
1 page

Entropy-Based Approach in Selection Exact String-Matching Algorithms

Uploaded by

Entropy-Based Approach in Selection Exact String-Matching Algorithms

Uploaded by

entropy

Entropy 2021, 23, 31. https://doi.org/10.3390/e23010031 https://www.mdpi.com/journal/entropy

3.1.1. Representative Patterns Sample Size

The states of a system α will denote events Ai , i = 1, 2, . . . , n. System α is a discrete

The entropy of a system is denoted with H(α). If pi = 0, it follows that pi log pi = 0.

Another measure from information theory is Kolmogorov complexity. Although

3.1.3. Formal Metric Description

execution run the

Figure 3. Methodology implementation for the DNA domain.

Figure 4. Methodology implementation for the natural language domain.

Entropy 2021, 23, x 7 of 18

3.2.4. Patterns Entropy Calculation

Entropy = −(−0.5) − (−0.5) − (−0.375) − (−0.53064) = 1.90563906222957 ≈ 1.91

3.2.5. Entropies Discretization

Table 1. Data discretization for the DNA domain.

Pattern PattEnt PattEntRound PattEntClass

Table 2. Entropy classes after discretization for the DNA domain.

Class No. Entropy Class Number of Patterns

Table 3. Data discretization for the natural language domain.

Pattern PattEnt PattEntRound PattEntClass

Class No. Entropy Class Number of Patterns

4. Classification of Algorithms in the Built Model

Table 5. Algorithms ranking model for DNA texts and patterns.

Table 7. Example of usage double-scaled Euclidian distance on DNA entropy class 7.

i Built Model (p1) Validation (p2) Scaled Euclidean (d1 )

DNA Natural Language

Converting double-scaled Euclidian distance to a context of similarity, it is possible

Model vs. Validation Result for: DNA Natural Language

Figure 8. Pearson’s linear correlation coefficient.

Supplementary Materials: The following are available online at https://www.mdpi.com/1099-430

References and Notes

You might also like