0% found this document useful (0 votes)
133 views

Comparison of Parallel Quick and Merge Sort Algorithms On Architecture With Shared Memory

This document compares the performance of parallel Quicksort and Mergesort algorithms on shared memory architectures. It discusses: 1) Quicksort and Mergesort are popular divide-and-conquer sorting algorithms that are well-suited for parallelization. OpenMP is used for parallelization as it supports shared memory parallel programming. 2) Quicksort works by recursively dividing the array into subarrays around a pivot element, while Mergesort divides the array in half and recursively sorts the subarrays before merging the results. 3) For parallel Quicksort, the array is partitioned by one thread and subarrays are assigned to other threads. For parallel Mergesort, the recursive calls to sort the sub

Uploaded by

emir8gallo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
133 views

Comparison of Parallel Quick and Merge Sort Algorithms On Architecture With Shared Memory

This document compares the performance of parallel Quicksort and Mergesort algorithms on shared memory architectures. It discusses: 1) Quicksort and Mergesort are popular divide-and-conquer sorting algorithms that are well-suited for parallelization. OpenMP is used for parallelization as it supports shared memory parallel programming. 2) Quicksort works by recursively dividing the array into subarrays around a pivot element, while Mergesort divides the array in half and recursively sorts the subarrays before merging the results. 3) For parallel Quicksort, the array is partitioned by one thread and subarrays are assigned to other threads. For parallel Mergesort, the recursive calls to sort the sub

Uploaded by

emir8gallo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Performance Comparison of Parallel Quick and

Merge Sorting Algorithms on Architecture with


Shared Memory
Omar Sokolović, Emir Bećirović and Mirza Ohranović
Faculty of Electrical Engineering, University of Sarajevo
Sarajevo, Bosnia and Herzegovina
[email protected]
[email protected]
[email protected]

Abstract — Sorting algorithms are widely used in more algorithms and they are easier to understand than other divide
complex algorithms that rely on their output correctness. and conquer methods.
Slowly but surely, sequential approach is losing its OpenMP was chosen for parallelization in this experiment
significance because of modern hardware constraints. because it is easier to understand than other various thread
Programs and algorithms must be redesigned and libraries, can work on a large number of shared memory
adjusted so they can use full potential of new hardware computers, it is standardized and well documented.
and they should be created using parallel approach.
Quick-Sort and Merge-Sort are sorting algorithms which II. SORTING ALGORITHMS AND THEIR
are easy to understand, well understood in parallel PARALLELIZATION
algorithms theory and have popular representation of the
rich class of divide and conquer methods. In this paper, A. ​Quick-Sort
comparison of sequential and parallel implementation of Quick-Sort is an efficient sorting algorithm, serving as
Merge-Sort and Quick-Sort algorithms performance is systematic method for placing the elements of an array in
presented. This is done with parallel programming order. Developed by Tony Hoare in 1959 and published in
platform OpenMP and it is run on mainstream multi-core 1961, it is still commonly used algorithm for sorting. When
computers. Performances were measured on multiple Intel implemented well, it can be about two or three times faster
i3, Intel i5 Intel i7 processors and Intel Xeon CPU than its main competitors, Merge-Sort and Heap-Sort [1].
E5-2676. Mathematical analysis of quicksort shows that, on average,
There will be some words about OpenMP environment the algorithm takes O(n logn) comparisons to sort n items. In
and about the way mentioned algorithms work. Results, the worst case, it makes O(n​ 2​
) comparisons, though this
speedups and comparisons are illustrated in this paper. behavior is rare.
Quick-Sort is a divide and conquer algorithm. It first
Keywords — Sorting algorithm; Parallel Quick-Sort; divides main array in two smaller subarrays: the low elements
Parallel Merge-Sort; Speedup; and the high elements. Quick-Sort can then recursively sort
the subarrays.
I. INTRODUCTION The steps are:
Sorting algorithms are one of the most critical problems in 1. Pick an element, called a pivot, from the array.
computer science and they are frequently used by computer 2. Divide the list into two subarrays: a lower list
scientists for search algorithms when picking relevant results, containing numbers smaller than the pivot, and an
sorting amounts of data, converting data and producing upper list containing numbers larger than or equal to
human-readable output. the pivot.
Because of their wide usage, and inclusion in other, more 3. Recursively apply the above steps to the sub-array of
complex, algorithms, it is mandatory that they are producing elements with smaller values and separately to the
correct results and doing that as fast as possible. Considering sub-array of elements with greater values.
all this reasons many sorting algorithms have been developed The base case of the recursion is arrays of size zero or
such as Quick-Sort, Merge-Sort, Selection-Sort, Insertion-Sort one, which are in order by definition, so they never need to be
etc. The reason Quick-Sort and Merge-Sort have been chosen sorted. The final sorted result is the concatenation of the
is because they are efficient divide and conquer sorting sorted lower array, the pivot, and the sorted upper array.
The pivot selection and partitioning steps can same for worst and average case scenarios [4].
be done in several different ways, and the choice of specific Problem is that this algorithm is memory inefficient,
implementation schemes greatly affects the performance of because every time array is divided, two new arrays are
the algorithm. Pivot can be selected randomly, using the index allocated in memory. Unlike of that, Quick-Sort algorithm
of first element of array, index of last element of array, middle works on same array during its execution. Because of that,
index of the partition, choosing the median of the first, middle merge sort is worse than quick sort in sort of data caching.
and last element of the partition for the pivot. For needs of this Just like Quick-Sort algorithm, Merge-Sort is a divide and
experiment, pivot was chosen as middle index of partition. conquer algorithm too. It divides array on two subarrays of
same length, and recursively calls itself on both subarrays.
B.​Parallel Quick-Sort Algorithm divides that subarrays in new subarrays until their
Parallel Quick-Sort algorithm can be implemented in length is 0 or 1, because then they are sorted by definition.
different ways. In this section it will be summarized the one When subarrays are sorted, they will be merged [5].
used for this experiment. The main idea is to do the Steps of Merge-Sort algorithm are:
partitioning of the original array using a single thread and then 1. Finding middle point off main array A where it will
assign lower subarray to same thread and upper subarray to be divided in two halves, subarray B from first to
another thread for further partitioning. This is the case when middle element of A, and subarray C from middle to
two threads are available. When four threads are available this last element of A:
step is repeated one more time. middle = (first(A) + last(A)) / 2;
2. Calling Merge-Sort on subaray B:
merge-sort(A, first, middle);
3. Calling Merge-Sort on subarray C:
merge-sort(A, middle, last);
4. Mergeing two sorted subarrays B and C:
merge(A, first, middle, last);
These steps are for basic 2-way algorithm where array is
divided into 2 subarrays. Merge-Sort algorithm can be k-way
where k is number of subarrays obtained by dividing an array.

D. ​Parallel Merge-Sort
Idea of parallelization of Merge-Sort algorithm is to call
sequential version of this algorithm on both subarrays which
are obtained from given array, where Merge-Sort for every
array is assigned to different thread. Indifference of this,
Quick-Sort parallelization is based on recursive call of
algorithm itself but on different threads.
Steps of parallel Merge-Sort algorithm are like in basic
sequential algorithm. Only differences are in step 2 and 3
where these steps are performed in parallel on two different
threads. Expected speedup, where speedup is quotient between
execution time on 1 thread and execution time on 2 threads, is
slightly less than 2 because of existing overhead in
parallelization. Measurements and performance comparisons
Fig. 1. Visualization of parallel Quick-Sort [2] are shown in one of next paragraphs.
Problem with implementation used in experiment is not
C. ​
Merge-Sort scalable, which means ​performances will not be improved by
Merge-Sort algorithm, as a main competition of efficient using more threads. According to this concept, for using 3
sorting algorithms like Quick-Sort algorithm, is one of the threads 3-way, for 4 threads 4-way, and for K threads, K-way
most efficient algorithms for sorting the elements of a given Merge-Sort algorithm should be used. In theory, when using
array. K threads and K-way Merge-Sort, speedup should be close to
This algorithm is developed by John von Neumann in 1945. K.
Bottom-up version was published by Goldstine and John von There are other implementations that may or may not be
Neumann in 1948 [3]. scalable.
Time complexity analysis shows that Merge-Sort in ​best case It is important to notice that in huge number of Merge-Sort
implementations, function for merging sorted subarrays in
scenario has O(n*log(n)) time complexity. Time complexity is
correct way is not parallel.
Improvement in parallelization of Merge-Sort algorithm is obtained using Intel Core i3 and Intel Core i5 processors,
done by parallelizing that merging operation. That is done by while those shown by fig. 3. are obtained using Intel Core i7
conversion into OpenMP of a platform-specific technique and Intel Xeon E5 processors.
developed for .NET Task Parallel Library [​6​].

III. TESTING ENVIRONMENT


Various CPU types with different clock and cache
memory were used in order to set up the testing environment.
The testing environment was using four different types of Intel
CPUs: Intel i3-3217U 1.80 GHz, Intel i5-5200U 2.20 GHz,
Intel i7-7200U 2.70GHz and Intel Xeon E5-2676 v3 2.40GHz
on Amazon Web Services EC2 platform. Amazon EC2
infrastructure is built on the Intel Xeon E5 processors. The
instance type M4 was used for the purpose of this paper.
Amazon offers a variety of services regarding cloud
computing. The M4 instance type is using the Intel Xeon
E5-2676 v3 2.40GHz and the instance type m4.2xlarge, as
stated in the official Amazon Web Services documentation, is
equipped with four physical cores of this processor, and also
two logical cores per one physical [8].
Operating systems that were used for the testing
environment were Windows 10 for Intel i3, i5 and i7 CPUs
and Ubuntu 16.04 for the Intel Xeon CPU. All of the used
CPUs were supporting multi thread processing which was
crucial for the purpose of writing this paper.
As the algorithms were implemented in C++ programming
language, the testing environment was using C++ compilers
with the support of Open Multi Processing (OpenMP)
application programming interface. OpenMP supports
Fig. 2.​​Results on i3 and i5 using two threads
multi-platform shared memory multiprocessing which was
crucial in the process of testing our thesis.
Having set up the processors with the multi thread support
on the hardware side and the compilers on the software side,
the testing environment was setup for testing the hypothesis.

IV. EXPERIMENT MEASUREMENTS AND


PERFORMANCE COMPARISONS
Algorithms mentioned earlier are executed multiple times
on architectures described before, with different parameters.
Results and time measurements presented here are average of
obtained results. Speedup and efficiency are main metrics
used in this experiments. Speedup is defined as quotient
between execution time using one thread and execution time
using K threads, where K is number of used threads for
parallel execution. Efficiency is defined like quotient between
speedup and number of used threads.

A.​Experiment Results for Quick-Sort


For this experiment, array of 117964799 unsorted random
integer numbers between 1 and 117964799 were used. That is
approximately 460800KB.
Fig. 2. and fig. 3. are showing results obtained after
execution of ​cstdlib qsort fuction, sequential Quick-Sort, and
parallel Quick-Sort algorithm. ​Results shown by fig. 2. are
Fig. 3.​​Results on i7 and Xeon E5 using two threads
Fig. 4. and fig. 5. are showing results of execution of same
algorithms on same processors, when four threads are used for
parallel Quick-Sort algorithm.

Fig. 5. Results on i7 and Xeon E5 using four​​threads

Results for Quick-Sort algorithm, when two threads are


used, are as expected. They are slightly less than 2 because of
overhead work in case of parallel algorithms and because of
Fig. 4.​​Results on i3 and i5 using four threads partitioning operation which must be done sequentially. Time
measurements for cstdlib Quick-Sort (qsort function) are
added as a reference point.
When two threads are used for execution of algorithm,
speedups on Intel i3, i5, i7 and Xeon E5 vCPUs, respectively,
are: 1.87, 1.71, 1.74 and 1.91.
Parallel Quick-Sort executed using two threads is 87%, 71%,
74% and 91% faster than sequential one.
For i3, i5, i7 and Xeon E5 vCPUs, respectively,
efficiencies are: 0.935, 0.855, 0.87 and 0.955 or in
percentages: 93.5%, 85.5%, 87% and 95.5%. Intel Xeon E5 is
best used processor as expected, and Intel Core i5 is worst
used in this case.
When four threads are used, results are not as expected,
except for Intel Xeon E5. Speedups in this case should be
slightly less than 4, but they are: 2.47, 2.35, 2.39 Intel i3, i5,
i7 processors respectively. Speedup for Intel Xeon E5 is 3.52.
Efficiencies for these processors are: 0.6175, 0.5875, 0.5975
and 0.88, respectively. In percentages, efficiencies are:
61.75%, 58.75%, 59.75% and 88%. Intel Xeon E5 is best used
by far, and Intel Core i5 is worst used processor. It can be seen
that Intel Core processors used for this experiment have bad
speedups and efficiencies in this case. Intel Xeon E5 on other
side has good results. That is because every Intel Core
processor used in this experiment has two physical cores, and
two logical cores for every physical one. Xeon E5, on the ​M4
Amazon EC2 instance, is the only processor used that has four
physical cores, and they are needed for true thread level
parallelization.

B. ​
Experiment Results for Merge-Sort
For this experiment, array of 1024000 unsorted random
integer numbers between 1 and 1024000 were used. That is
approximately 4000KB. Merge-Sort is memory inefficient, as
mentioned earlier, and array with more elements could not be
used in current conditions. One temporary array of same
length is needed for Merge-Sort algorithm. On the start of
experiment, approximately 8000KB of memory is allocated.
After, in every recursion step, same amount of memory is
going to be allocated.
Fig. 6. is showing results obtained after execution of
sequential Merge-Sort marked green, and parallel Merge-Sort
marked blue, on Intel Core i3 and i5 processors. For parallel
Merge-Sort two threads are used. Time is measured in
milliseconds.

Fig. 7.​​Results obtained by execution on Intel Core i7 and Intel Xeon E5


processors

Obtained results are as expected. On every processor


speedup slightly less than, 2 when 2 threads are used for
algorithm execution. Speedups on Intel i3, i5, i7 and Xeon E5
processors, respectively, are: 1.88, 1.69, 1.80 and 1.90.
Conclusion is that parallel Merge-Sort algorithm executed on
two threads is 88%, 69%, 80% and 90% faster than sequential
one. Efficiencies on Intel i3, i5, i7 and Xeon E5 processors,
respectively, are: 0.94, 0.845, 0.9 and 0.95, or in percentages:
94%, 84.5%, 90% and 95%. Intel Xeon E5 is best used and
Intel Core i5 is worst used. In theory speedup should be 2, but
because of existing overhead work in parallel programs it
can’t be. For example, part of that overhead work is spawning
additional threads or work assignment.
Specific for this algorithm is merging operation which
merges two sorted subarrays into one array in which all
elements are in correct order. This is a noticeable sequential
part of Merge-Sort algorithm. Reports are that Merge-Sort
Fig. 6.​​Results obtained by execution on Intel Core i3 and i5 processors with parallelized merging operation using technique
mentioned earlier can be 25% faster, even from Quick-Sort
Results obtained by execution on Intel Core i7 and Intel Xeon algorithm [6].
E5 processors are shown in fig. 7.
C. ​Wrong Readings and Measurements
There are many factors that affect the results of
experiment like this. Those factors can lead to wrong results
and conclusions. Some of those factors are busyness of
processor by other processes, to slow sequential algorithm ​and
type of caching. Fig. 8. shows readings for Quick-Sort
algorithm executed using two threads, that are done initially, The goal was to test and assess the behavior of
before checking how mentioned factors affect execution. performance of these algorithms on multiple multi-core
Those readings are wrong. processors using various numbers of threads at the same time.
As mentioned, four different types of Intel processors were
used during this experiment - i3, i5, i7 and Xeon.
The process was composed of multiple steps. First, the
algorithms were implemented in C++ and adjusted to
OpenMP API. The testing environment was setup on two
different operating systems - Windows 10 on laptops and
Ubuntu 16.04. on an Amazon EC2 instance. Afterwards, the
algorithms were executed multiple times and the records were
tracked during the whole process.
Considering the experiments that were running on two
threads for both algorithms, the expected results were in line
with the actual results of the experiments. In theory, the
expected speedup should be 2 - but considering the existing
overhead work in parallel programs the speedup was slightly
lower.
Considering the experiments that were running on four
threads, the Merge-Sort algorithm couldn’t be scaled in a
manner to keep the performance improvement for more than
two threads. On the other hand, the Quick-Sort algorithm was
suitable for scaling so performance improvement was
experienced in the case of running the algorithm on four
threads.
Choosing parallelization while implementing sorting
algorithms which relay on divide and conquer strategy, based
Fig. 8. Wrong readings on the testing results, was an optimal choice. The expected
results were close to the theoretical expectations and the
Speedup here is 3.37 and efficiency is 1.685 or 168.5%. possibility of creating a better performance relaying on
Conclusion can be that speedup is super-linear. Super-linear software structure and implementation using the same
speedup is one where speedup is more than K when K hardware was more than evident.
processing units are used. It happens rarely and in low-level
computations [7]. In this case, processor resources were taken REFERENCES
by other processes in the system. After shutting down [1] “​Quicksort” ​Wikipedia,​4 Jan. 2018. [Online]. Available:
https://en.wikipedia.org/wiki/Quicksort. [Accessed: 13-Jan-2018].
unnecessary processes, real results, shown by fig. 2 in this
[2] ​N. N. Xiong, ​International Journal of Advanced Science
case, are obtained. and Technology​. [Online].
Available: http://www.sersc.org/journals/IJAST/.
V. CONCLUSION [Accessed:13-Jan-2018].
[3] ​“Merge sort” ​Wikipedia​, 09-Jan-2018. [Online]. Available:
Achieving high performance while implementing complex https://en.wikipedia.org/wiki/Merge_sort. [Accessed: 13-Jan-2018].
systems is one of the most important aspects of software [4] ​“Time Complexities of all Sorting Algorithms” ​GeeksforGeeks​ ,
development at this moment in history. Since the hardware 08-May-2017. [Online]. Available:https://www.geeksforgeeks.org/time-
became limited in the terms of achieving better performance complexities-of-all-sorting-algorithms/. [Accessed: 13-Jan-2018].
[5] ​“Merge Sort” ​GeeksforGeeks​, 14-Oct-2017. [Online]. Available:
in a more significant manner, adaptation and adjustment of https://www.geeksforgeeks.org/merge-sort/. [Accessed: 13-Jan-2018].
parallel processes in computer software is a must. [6] A. Radenski, “Shared Memory, Message Passing, and Hybrid
Implementation of sorting algorithms, which are part of Merge Sorts for Standalone and Clustered SMPs”
almost any sophisticated software, was the core of this paper. ​International Conference on Parallel and Distributed
Processing Techniques and Applications​, 2011.
Two popular sorting algorithms, Merge-Sort and Quick-Sort,
[7] ​“Speedup” ​Wikipedia,​29-Sep-2017. [Online]. Available:
were adjusted in redesigned in a manner which allowed them https://en.wikipedia.org/wiki/Speedup#Super-linear_speedup.
to use ​parallelization in order to achieve higher performance. [Accessed: 16-Jan-2018].
Since these algorithms are efficient divide and conquer [8] “Amazon EC2 Instance Types – Amazon Web Services (AWS)”
​Amazon Web Services, Inc.​[Online].
algorithms, in this case they were the most natural choice in Available: https://aws.amazon.com/ec2/instance-types/#instance-details.
terms of understanding the parallelization process. [Accessed: 16-Jan-2018].

You might also like