0% found this document useful (0 votes)
72 views5 pages

Enhancement of Hakak's Split-Based Searching Algorithm Through Multiprocessing

One of the recent string-matching algorithms classified as a Hybrid Boyer Moore Approach is Hakak’s Split-Based Searching Algorithm which works by dividing the pattern into two parts while concentrating most of the searching process only to the second half of the pattern
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views5 pages

Enhancement of Hakak's Split-Based Searching Algorithm Through Multiprocessing

One of the recent string-matching algorithms classified as a Hybrid Boyer Moore Approach is Hakak’s Split-Based Searching Algorithm which works by dividing the pattern into two parts while concentrating most of the searching process only to the second half of the pattern
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Volume 7, Issue 4, April – 2022 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Enhancement of Hakak’s Split-Based Searching


Algorithm through Multiprocessing
Enrico Sebastian Digman, Robbie Shane Orantoy, John Alfred Velasco,
Mark Christopher Blanco, Richard Regala, Dan Michael Cortez
Computer Science Department
Pamantasan ng Lungsod ng Maynila (University of the City of Manila)
Manila, NCR, Philippines

Abstract:- One of the recent string-matching algorithms preprocessing phase will be followed by the Searching
classified as a Hybrid Boyer Moore Approach is Hakak’s Phase where the second half of the pattern will be searched
Split-Based Searching Algorithm which works by in the text string. It also uses the brute force approach when
dividing the pattern into two parts while concentrating finding the index of the matching pattern. Once a match is
most of the searching process only to the second half of verified, it will then compare the first half of the pattern to
the pattern. However, the algorithm calls for the part of the text string that is parallel before the first
optimization to its searching phase since it still employs character of the pattern’s second half. This algorithm
the single shifting for instances where a mismatch is met. presented positive results where it outperforms other string-
This paper presents an enhanced version of the matching algorithms in terms of time and space efficiency.
algorithm by applying multiprocessing emphasizing its However, its implementation of single shifting leads to a
ability to make use of multiple CPU (Central Processing disadvantage especially on instances where patterns are to
Unit) cores in finding occurrences of the pattern during be searched on large-sized texts. To address the problem, an
the searching phase. Through this, concurrency and enhancement of the Hakak’s Split-Based Searching
parallelism can be reached in the existing algorithm that Algorithm is proposed.
is only limited to using a single processor prior
enhancement. In conclusion, this study successfully II. REVIEW OF RELATED LITERATURE
enhanced the existing algorithm in terms of time
complexity by maximizing the usage of memory String matching algorithms usually comprise a
resources. preprocessing and searching phase highlighting that the
latter makes use of a shift value keeping in mind the number
Keywords:- Hakak’s Split-Based Searching Algorithm, of character comparisons while searching [1]. Their paper
String Matching, Multiprocessing. emphasizes that the searching phase may be improved by
employing other bad character shift functions, such as the
I. INTRODUCTION Berry-Ravindran algorithm which works by calculating the
shift value placing emphasis on the bad character shift of the
String matching is defined as the process of searching two consecutive characters. Other findings in regards with
whether a specific group of characters or ‘pattern’ exists in a the searching phase of other algorithms include the Boyer-
group of strings or ‘text’. With the continuous amount of Moore algorithm having iterative processes of considerable
information piling up on several databases, many problems length in searching for the pattern in the text [2]. Since
may arise with regards to the resources to be spent when Hakak's Split-Based Searching Algorithm is considered a
executing steps of searching, retrieving, and interpreting text Hybrid Boyer-Moore Approach, it still retains the concept of
information. Existing string-matching algorithms play a the bad character rule but utilizes the single shifting method.
significant role to solve real-world problems concerning Considering it only makes use of a single block of character,
various fields of application such as text mining; this type of the searching for the pattern may not be done efficiently.
algorithm specializes on analyzing information which can be Since Hakak's Split-Based Searching Algorithm still
used by the following: Intrusion Detection Systems, Search employs the single shifting technique in order to search
Engines, Plagiarism Detection, Bioinformatics, and others. different texts, it is hypothesized that the idea of
Out of all the widely known string matching algorithms multiprocessing or parallelization of the algorithm will be
available, many enhanced algorithms were already created helpful for the performance of the algorithm considering the
with intensive and thorough research, one of these is the time and space complexity.
Hakak’s Split-Based Searching Algorithm. Hakak’s Split-
Based Searching Algorithm aims to work on a unique A methodology is proposed in order to achieve
approach of searching which aims to outpower similar multiple executions of the same task through the usage of
algorithms in terms of the time consumption along the multiprocessors. Multiprocessing is a kind of technology
process of finding patterns on a text string. that utilizes two or more CPU cores within a system
allowing the allocation and sharing of resources between
Hakak’s Split-Based Searching Algorithm’s process each of them [3]. Nowadays, the relevance of multicore
starts with the Preprocessing Phase where it implements a technology has shaped the industry's way of handling and
unique approach of finding occurrences in the text by processing data. Technological advancements suggest the
splitting the string patterns into two; only the second half of exploration and optimization of current string matching
the pattern will be compared to the text. The end of the algorithms’ capability of processing texts. Since most exact

IJISRT22APR1563 www.ijisrt.com 1068


Volume 7, Issue 4, April – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
string matching algorithms make use of a uniprocessor, B. Comparing the performance of Proposed Algorithm with
intensive research and enhancements are needed in order to Hakak’s Split-Based Searching Algorithm
surpass the limitations encountered by these algorithms To verify the efficiency of the proposed algorithm, both
since they are not capable of a high degree of parallelizing algorithms (the existing Hakak’s Split-Based Searching
of tasks. Algorithm and the proposed enhanced version) are to be
compared side by side in terms of their time and memory
The prominent challenge for the development of string consumption. The dataset that will be used for
matching algorithms requires high and efficient performance experimentation is Hakak, et al.’s S1 Dataset, specifically
since applications such as network intrusion detection the bible.txt file. With regards to the total runtime per
systems need to match with today's network speed. pattern search, both algorithms will be running ten (10)
Shortcomings in identifying encountered attacks are highly times using different pattern lengths such as short patterns
recommended to be enhanced. Also, the relevance of big (having a length less than 4), medium patterns (having a
data in the industry requires efficient handling of big blocks length of 4 to 7), and long patterns (having a length greater
of data [4]. Parallel processing is also highlighted as a than 7); a python module: datetime will be used to measure
technique that varies between multiprocessing and the average runtime of the iterative processes performed in
multithreading. Multiprocessing works by the separation of this comparison in milliseconds (ms). On the other hand, the
processes for execution of the same tasks in the program, quantification of both algorithms’ memory consumption
whereas multithreading utilizes breaking down large tasks requires another Python module which is memory-profiler,
into a lot of tasks. An example of a study where it was using mebibytes (MiB) as unit of measurement. By
implemented is when Knuth Morris Pratt, Karp-Rabin, and performing these methods, it will now be possible to check
Boyer-Moore algorithms were integrated into a distributed and compare the efficiency of the existing and enhanced
multiprocessing environment introducing the concept of algorithms.
parallelization of exact string matching algorithms. These
were applied in Snort's string matching engine utilizing C. Identification of essential formulas to be used for the
Local Area Networks [5]. Proposed Algorithm and multiprocessing
𝑙𝑒𝑛𝑔𝑡ℎ = 𝑛 / 𝑃 + 𝑚2 − 1 (1)
III. PROPOSED METHOD 𝑠𝑡𝑎𝑟𝑡 = (𝑖 ∗ 𝑙𝑒𝑛𝑔𝑡ℎ) − ((𝑚2 − 1) ∗ 𝑖) (2)
𝑒𝑛𝑑 = 𝑠𝑡𝑎𝑟𝑡 + 𝑙𝑒𝑛𝑔𝑡ℎ (3)
In order to attain the objective of improving the
𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑡𝑜 𝑀𝑎𝑝 = 𝑝2[𝑖0 ] − 𝑚2 (4)
existing algorithm's time complexity, this paper proposes the
use of multiprocessing at the algorithm's searching phase. Multiprocessing, as proposed to enhance the existing
algorithm, requires designating substrings of the text to each
A. Implementing multiprocessing in the searching phase of
CPU core where searching is to be done simultaneously. To
Hakak’s Split-Based Searching Algorithm
reach the substring length, it is important to take note of the
Since the existing algorithm made use of the divide and
number of processors to be distributed (P) and the length of
conquer approach in the form of splitting, the researchers
the text (n). Considering the circumstances where patterns
also took into consideration the same concept but through a
can potentially be found in between ends of two succeeding
different method which is multiprocessing. [4] also
substrings, the formula includes adding the value of the
mentioned that multiprocessing works by separation of
length of the second half of the pattern (m2) subtracted by 1.
processes for execution of same tasks in the program. Since
This way, those areas will now be reached as the formula
most CPU cores are not usually used by a computer, it is
slightly extends the searching threshold.
hypothesized that these can be used rather than still being
idle. Time consumption is still the primary focus of a string- Given that the substring length has already been
matching algorithm but in today’s technology, algorithms computed, the starting index for searching at each processor
must also be flexible and able to adapt to any computer’s must be found by making use of formula no. 2. The formula
specification. The experiment shall be done in PyCharm subtracts the product of the counter (i) and m2 subtracted by
Community Edition 2021.3.2 on AMD Ryzen 5 3500U 1 to find the next substring which needs to start at an index
Processor with 8 cores, 4 GB RAM (Random Access where it overlaps with the end of the earlier substring. After
Memory) using Windows 10. To implement this in the executing the earlier formula, formula no. 3 figures out the
experimental research design, the multiprocessing package ending index of the substring assigned to each processor in
of Python will be used to give the researchers the ability to searching by adding the values computed from the first two
maximize multiple processors in a single computer. The formulas.
Process class is specifically used to initialize the processes
for each of the CPU cores proven and then using join () Following the execution of earlier formulas, the
method so that all the processes are completed first before searching phase will now begin to search for pattern
continuing to the next line of code. matches in a concurrent manner. If the second half of the
pattern (p2) is found, formula no. 4 must be used to know
the current location of p2 in the substring. Subtracting the
current index by m2 will find the index of where the first
half of the pattern must be mapped in the substring.

IJISRT22APR1563 www.ijisrt.com 1069


Volume 7, Issue 4, April – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IV. RESULT AND DISCUSSIONS

A. Comparative Analysis of Initialization of Different No. of CPU Cores

Fig.1: Average running time of the proposed algorithm using different number of CPU cores

Fig. 1 shows how the proposed algorithm’s average time in shifting the pattern. Moreover, the usage of too
runtime is directly affected by the number of CPU cores many cores will be like running the algorithm sequentially
being distributed for running. Among all the CPU cores since there is a time overhead when spawning a new child
used in testing, 4 cores garnered the shortest average process. Running the algorithm at 4 CPU cores, which is
runtime at 689.7155ms (about half second). In addition, found to be the best processor for this algorithm shows that
slower average runtime is attributed to using insufficient and it is 14.394% and 18.446%faster than using a 2-core
excessive numbers of CPU cores. Not using enough cores in processor 788.9914999999999ms (about 1 second) and an
the process causes the algorithm to run slow since the 8-core processor 816.9395000000001ms (about 1 second),
partitioning of subtext will be longer, taking a considerable respectively.

B. Average running time Results

Hakak’s Split Based Proposed Algorithm


No.of
Pattern Length Text Searching Algorithm Average Running Time
Occurrences
Average Running Time (ms) (ms)

God 1395.9604999999997 914.8524000000001 4040


Less than 4
characters
for 1341.6218000000003 1073.334 12619

wroth 1308.4778000000001 692.2360000000001 47


4 to 7
characters
finish 1386.2564 691.5103 54

that Adam 1245.5148 782.0433 1


Greater than 7
characters
continually 1386.0278 834.1506999999999 79
Table 1: Comparison of Hakak’s Split-Based Searching Algorithm and Proposed Algorithm

As shown in Table 1, the Proposed Algorithm was able other pattern lengths which means that the Proposed
to surpass Hakak’s Split Based Searching Algorithm in Algorithm is at its’ best performance when searching for the
terms of the average running time based on 10 rounds. said pattern length. The researchers also seen that the no. of
Specifically, patterns with length of 4 to 7 characters occurrences of a pattern greatly affects the searching
showed a larger gap in terms of milliseconds compared to performance.

IJISRT22APR1563 www.ijisrt.com 1070


Volume 7, Issue 4, April – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
C. Memory Consumption Results

Fig. 2: Analysis of Hakak’s Split-Based Searching Algorithm’s Memory Consumption

Fig. 2 shows the existing algorithm’s memory shifting table in the preprocessing phase thus only making
consumption throughout the entire process. The algorithm use of a single shifting method when a mismatch is met in
started from 0 MiB and at once allotted 20 MiB for the the searching phase. Compared to traditional string-
whole duration of the process. Also, there was no scenario matching algorithms, Hakak’s Split-Based Searching
of any other memory being used in the latter part of the algorithm considered using minimal memory while
string-matching algorithm. This result happened due to the concurrently addressing the concern of time consumed. The
splitting of the pattern and allotting most of the entire existing algorithm may still be enhanced while taking into
process only to the second half. Furthermore, the existing consideration that the time consumed is less but at the same
algorithm’s motivation includes the elimination of creating a time maximizing the resources the computer has.

Fig.3: Analysis of Proposed Algorithm’s Memory Consumption

Fig. 3 displays the memory consumption of the multiprocessing. These child processes begin their task of
proposed algorithm emphasizing the use of CPU cores or them are properly initialized because of the join () method.
child processes. The black line portrays the data consumed Each child processor is assigned an index from the subtexts
by the main CPU which is 20 MiB. Also, no added memory proven which serves as their starting point of searching.
is used in the main CPU because of its child processes. They work concurrently but if any child process finds a
Before starting the multiprocessing, a specific CPU core, match even if the other processes are not finished, it appends
portrayed as the blue line, was initialized known as the the results to the Manager at once. The Manager’s memory
Manager who handles the shared memory allocation for the increased by a little amount because of the results that were
list of results. After a time, 4 child processes were spawned combined from the child processes.
taking into consideration the use of 4 CPU cores allotted for

IJISRT22APR1563 www.ijisrt.com 1071


Volume 7, Issue 4, April – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
V. CONCLUSION AND RECOMMENDATION
REFERENCES
Through the researchers' extensive analysis of gathered
data, they were able to draw these conclusions. First, the [1.] Hudaib, A., Al-Khalid, R., Suleiman, D., Itriq, M., &
single shifting text window is inefficient and may Al-Anani, A. (2008). A fast pattern matching algorithm
compromise the time consumed in searching the pattern. with two sliding windows (TSW). Journal of Computer
Since the technology of multiprocessing was introduced in Science, 4(5), 393.
this study, avoiding the usage of too many or too little [2.] Xu, B., Zhou, X., & Li, J. (2006, June). Recursive shift
amount of CPU cores allows string matching algorithms to indexing: a fast multi-pattern string matching
provide a high degree of parallelizing of processes. Algorithm. In Proc. of the 4th International
Considering its elements, a manager can be shared across all Conference on Applied Cryptography and Network
processes and is more dynamic compared to shared memory, Security (ACNS) (pp. 64-73). Springer-Verlag
however, it is slower since it is spawning a new child Singapore
process. Also, big datasets are suitable for shared memory [3.] Soewito, B., & Weng, N. (2007, October).
resources in order not to compromise the time for passing Methodology for evaluating dna pattern searching
the arguments to the CPU cores. algorithms on multiprocessor. In 2007 IEEE 7th
International Symposium on BioInformatics and
After an in-depth evaluation of the study’s results and BioEngineering (pp. 570-577). IEEE.
conclusion, the researchers recommend implementing other [4.] Hnaif, A. A., Aldahoud, A., Alia, M. A., Al'otoum, I.
forms of parallelization as enhancements for string matching S., & Nazzal, D. (2018). Multiprocessing scalable
algorithms such as multithreading, distributed computing, string matching algorithm for network intrusion
etc. Also, the use of the GPU may enable advanced detection system. International Journal of High
parallelization of multiple data setsand is capable of Performance Systems Architecture, 8(3), 159-168.
efficient computations. Architectures such as CUDA, MPI [5.] Al-Mamory, S. O. (2012). Speed enhancement of snort
(Message Passing Interface), and OpenMP allow network intrusion detection system. Journal of
programmers to manipulate specific functions for Babylon University, Pure and Applied Science, 20(1),
parallelization. When it comes to the proposed algorithm, it 1
may produce efficient results when searching texts such as
DNA sequences since the pattern is more likely to be found
in the text. Additionally, the proposed algorithm may
perform differently depending on the user's computer
specification, specifically the number of CPU cores
available. Experimentation of the study may consider the
programming language to be used when coding different
string matching algorithms for comparative analysis since
some have specific functions that may help in coding.

ACKNOWLEDGMENT

The researchers would like to express their utmost


gratitude to God Almighty for granting them the proper
skill, intellect, and enough strength to make this endeavor
come to reality. A plenty of thanks to their parents and kind
friends for their undying support and continuous
encouragement to keep going in pursuit of completing this
research. Finally, the researchers were beyond delighted to
acknowledge their research adviser and other professors of
the Pamantasan ng Lungsod ng Maynila for sharing their
ability as well as supplying motivational words and
constructive criticisms which makes a lot of help throughout
the course of carrying out this piece of work.

IJISRT22APR1563 www.ijisrt.com 1072

You might also like