Enhancement of Hakak's Split-Based Searching Algorithm Through Multiprocessing
Enhancement of Hakak's Split-Based Searching Algorithm Through Multiprocessing
ISSN No:-2456-2165
Abstract:- One of the recent string-matching algorithms preprocessing phase will be followed by the Searching
classified as a Hybrid Boyer Moore Approach is Hakak’s Phase where the second half of the pattern will be searched
Split-Based Searching Algorithm which works by in the text string. It also uses the brute force approach when
dividing the pattern into two parts while concentrating finding the index of the matching pattern. Once a match is
most of the searching process only to the second half of verified, it will then compare the first half of the pattern to
the pattern. However, the algorithm calls for the part of the text string that is parallel before the first
optimization to its searching phase since it still employs character of the pattern’s second half. This algorithm
the single shifting for instances where a mismatch is met. presented positive results where it outperforms other string-
This paper presents an enhanced version of the matching algorithms in terms of time and space efficiency.
algorithm by applying multiprocessing emphasizing its However, its implementation of single shifting leads to a
ability to make use of multiple CPU (Central Processing disadvantage especially on instances where patterns are to
Unit) cores in finding occurrences of the pattern during be searched on large-sized texts. To address the problem, an
the searching phase. Through this, concurrency and enhancement of the Hakak’s Split-Based Searching
parallelism can be reached in the existing algorithm that Algorithm is proposed.
is only limited to using a single processor prior
enhancement. In conclusion, this study successfully II. REVIEW OF RELATED LITERATURE
enhanced the existing algorithm in terms of time
complexity by maximizing the usage of memory String matching algorithms usually comprise a
resources. preprocessing and searching phase highlighting that the
latter makes use of a shift value keeping in mind the number
Keywords:- Hakak’s Split-Based Searching Algorithm, of character comparisons while searching [1]. Their paper
String Matching, Multiprocessing. emphasizes that the searching phase may be improved by
employing other bad character shift functions, such as the
I. INTRODUCTION Berry-Ravindran algorithm which works by calculating the
shift value placing emphasis on the bad character shift of the
String matching is defined as the process of searching two consecutive characters. Other findings in regards with
whether a specific group of characters or ‘pattern’ exists in a the searching phase of other algorithms include the Boyer-
group of strings or ‘text’. With the continuous amount of Moore algorithm having iterative processes of considerable
information piling up on several databases, many problems length in searching for the pattern in the text [2]. Since
may arise with regards to the resources to be spent when Hakak's Split-Based Searching Algorithm is considered a
executing steps of searching, retrieving, and interpreting text Hybrid Boyer-Moore Approach, it still retains the concept of
information. Existing string-matching algorithms play a the bad character rule but utilizes the single shifting method.
significant role to solve real-world problems concerning Considering it only makes use of a single block of character,
various fields of application such as text mining; this type of the searching for the pattern may not be done efficiently.
algorithm specializes on analyzing information which can be Since Hakak's Split-Based Searching Algorithm still
used by the following: Intrusion Detection Systems, Search employs the single shifting technique in order to search
Engines, Plagiarism Detection, Bioinformatics, and others. different texts, it is hypothesized that the idea of
Out of all the widely known string matching algorithms multiprocessing or parallelization of the algorithm will be
available, many enhanced algorithms were already created helpful for the performance of the algorithm considering the
with intensive and thorough research, one of these is the time and space complexity.
Hakak’s Split-Based Searching Algorithm. Hakak’s Split-
Based Searching Algorithm aims to work on a unique A methodology is proposed in order to achieve
approach of searching which aims to outpower similar multiple executions of the same task through the usage of
algorithms in terms of the time consumption along the multiprocessors. Multiprocessing is a kind of technology
process of finding patterns on a text string. that utilizes two or more CPU cores within a system
allowing the allocation and sharing of resources between
Hakak’s Split-Based Searching Algorithm’s process each of them [3]. Nowadays, the relevance of multicore
starts with the Preprocessing Phase where it implements a technology has shaped the industry's way of handling and
unique approach of finding occurrences in the text by processing data. Technological advancements suggest the
splitting the string patterns into two; only the second half of exploration and optimization of current string matching
the pattern will be compared to the text. The end of the algorithms’ capability of processing texts. Since most exact
Fig.1: Average running time of the proposed algorithm using different number of CPU cores
Fig. 1 shows how the proposed algorithm’s average time in shifting the pattern. Moreover, the usage of too
runtime is directly affected by the number of CPU cores many cores will be like running the algorithm sequentially
being distributed for running. Among all the CPU cores since there is a time overhead when spawning a new child
used in testing, 4 cores garnered the shortest average process. Running the algorithm at 4 CPU cores, which is
runtime at 689.7155ms (about half second). In addition, found to be the best processor for this algorithm shows that
slower average runtime is attributed to using insufficient and it is 14.394% and 18.446%faster than using a 2-core
excessive numbers of CPU cores. Not using enough cores in processor 788.9914999999999ms (about 1 second) and an
the process causes the algorithm to run slow since the 8-core processor 816.9395000000001ms (about 1 second),
partitioning of subtext will be longer, taking a considerable respectively.
As shown in Table 1, the Proposed Algorithm was able other pattern lengths which means that the Proposed
to surpass Hakak’s Split Based Searching Algorithm in Algorithm is at its’ best performance when searching for the
terms of the average running time based on 10 rounds. said pattern length. The researchers also seen that the no. of
Specifically, patterns with length of 4 to 7 characters occurrences of a pattern greatly affects the searching
showed a larger gap in terms of milliseconds compared to performance.
Fig. 2 shows the existing algorithm’s memory shifting table in the preprocessing phase thus only making
consumption throughout the entire process. The algorithm use of a single shifting method when a mismatch is met in
started from 0 MiB and at once allotted 20 MiB for the the searching phase. Compared to traditional string-
whole duration of the process. Also, there was no scenario matching algorithms, Hakak’s Split-Based Searching
of any other memory being used in the latter part of the algorithm considered using minimal memory while
string-matching algorithm. This result happened due to the concurrently addressing the concern of time consumed. The
splitting of the pattern and allotting most of the entire existing algorithm may still be enhanced while taking into
process only to the second half. Furthermore, the existing consideration that the time consumed is less but at the same
algorithm’s motivation includes the elimination of creating a time maximizing the resources the computer has.
Fig. 3 displays the memory consumption of the multiprocessing. These child processes begin their task of
proposed algorithm emphasizing the use of CPU cores or them are properly initialized because of the join () method.
child processes. The black line portrays the data consumed Each child processor is assigned an index from the subtexts
by the main CPU which is 20 MiB. Also, no added memory proven which serves as their starting point of searching.
is used in the main CPU because of its child processes. They work concurrently but if any child process finds a
Before starting the multiprocessing, a specific CPU core, match even if the other processes are not finished, it appends
portrayed as the blue line, was initialized known as the the results to the Manager at once. The Manager’s memory
Manager who handles the shared memory allocation for the increased by a little amount because of the results that were
list of results. After a time, 4 child processes were spawned combined from the child processes.
taking into consideration the use of 4 CPU cores allotted for
ACKNOWLEDGMENT