An OpenCL Method of Parallel Sorting Algorithms For GPU Architecture
An OpenCL Method of Parallel Sorting Algorithms For GPU Architecture
An OpenCL Method of Parallel Sorting Algorithms For GPU Architecture
Sathe
S. R. Sathe
Abstract
In this paper, we present a comparative performance analysis of different parallel sorting
algorithms: Bitonic sort and Parallel Radix Sort. In order to study the interaction between the
algorithms and architecture, we implemented both the algorithms in OpenCL and compared its
performance with Quick Sort algorithm, the fastest algorithm. In our simulation, we have used
Intel Core2Duo CPU 2.67GHz and NVidia Quadro FX 3800 as graphical processing unit.
Keywords: GPU, GPGPU, Parallel Computing, Parallel Sorting Algorithms, OpenCL.
1. INTRODUCTION
The GPU (Graphics Processing Unit) [1] is a highly tuned, specialized machine, designed
specifically for parallel processing at high speed. In recent years, Graphic Processing Unit (GPU)
has been evolved as massive parallel processor for achieving high computing performance. The
architecture of GPU is suitable not only for graphics rendering algorithms but for also general
parallel algorithms in a wide variety of application domains.
Sorting is one of the fundamental problems of computer science, and parallel algorithms for
sorting have been studied since the beginning of parallel computing. Batchers (log 2 n) - depth
bitonic sorting network [2] was one of the first methods proposed. Since then many different
parallel sorting algorithms have been proposed [7, 9, 10]. The (log n) - depth sorting circuit was
proposed in [4, 6].
Given, a diversity of parallel architectures and a number of parallel sorting algorithms, there is a
question of which is the best fit for a given problem instance. An extent to which an application
will benefit from these parallel systems, depend on the number of cores available and other
parameters. Thus, many researchers have become interested in harnessing the power of GPUs
for sorting algorithms. Recently, there has been increased interest in such research efforts [8, 11,
16]. However, more studies are needed to claim whether a certain algorithm can be
recommended for a particular parallel architecture.
In this paper, we present an experimental study of two different parallel sorting algorithms: Bitonic
sort and Parallel Radix sort.
This paper is organized as follows. Section - 2 provides previous work done. In Section - 3, we
present GPU architecture and OpenCL Programming model. Parallel Sorting algorithms are
explained in Section - 4. Test results and analysis are provided in Section - 5. Section - 6
concludes our work and makes future research plans.
International Journal of Experimental Algorithms (IJEA), Volume (3): Issue (1) : 2012
2. RELATED WORK
In this section, we review previous work on parallel sorting algorithms. Study of parallel
algorithms using OpenCL is still in progress and there is not much work done in this topic.
However, an overview of parallel sorting algorithms is given in [5]. Here we review parallel
algorithms with respect to GPU architecture.
A parallel sorting algorithm is presented in [12] for general purpose internal sorting on MIMD
machines where performance of the algorithm on the Fujitsu AP1000 MIMD supercomputer is
discussed. A comparative performance evaluation of parallel sorting algorithms presented in [13].
They implement parallel algorithms with respect to the architecture of the machine. An on-chip
local memory version of radix sort for GPUs has been implemented [21]. As expected, OpenCL
local memory is much faster than global memory. Bitonic sorting algorithm has been implemented
using stream processing units and Image Stream processors in [17, 15].
An O(n) radix sort is implemented in [21]. As reported in [21] radix sort is roughly twice as fast as
the CUDAPP[19] radix sort. Quick-sort algorithm for GPUs using CUDA has been implemented
in [20] where their results suggest that given a large data set of elements, quick-sort still gives
better performance as compared to radix and Bitonic sort. A portable OpenCL implementation of
the radix sort algorithm is presented in [24] where authors test radix sort on several GPUs and
CPUs. An analysis of parallel and sequential bitonic, odd-even and rank-sort algorithms for
different CPU and GPU architectures are presented in [23] where they exploit task parallelism
using OpenCL.
HOST
PE 8
PE 1
PE 2
LOCAL MEMORY
PE 8
LOCAL MEMORY
PE 2
PE 8
PE 2
PE 1
LOCAL MEMORY
PE 1
GLOBAL MEMORY
The GPU is programmable using vendor provided APIs such as NVIDIAs CUDA [18], OpenCL
specification by Khronos group [22]. While CUDA targets GPU specifically, OpenCL targets
heterogeneous system which includes GPUs and/or CPUs. OpenCL programming model involves
a host program on the host (CPU) side that launches Single Instruction Multiple Threads (SIMT)
based programs called kernels consisting of groups of threads called as warps on the target
device. Although management of warps is hardware dependent, programmer can organize
problem domain into several work-items, consisting of one or more work-groups. This is
International Journal of Experimental Algorithms (IJEA), Volume (3): Issue (1) : 2012
explained as ND-Range in GPU architecture. For more information on managing and optimizing
ND-Range refer to OpenCL Specifications [22]. In summary, we say, following steps are needed
to initialize an OpenCL Application.
Setting Up OpenCL Environment Declare OpenCL context, choose device type and
create the context and a command queue.
Declare Buffers & Move Data across CPU & GPU Declare buffers on the device and
enqueue input data to the device.
Runtime Kernel Compilation Compile the program from the kernel array, build the
program, and define the kernel.
Run the Program Set kernel arguments and the work-group size and then enqueue
kernel onto the command queue to execute on the device.
Get Results to Host After the program has run, read back result array from device
buffer to host memory.
See [25, 26, 27, 22] for more details on this topic.
International Journal of Experimental Algorithms (IJEA), Volume (3): Issue (1) : 2012
Algorithm 1 is bitonic sort kernel for SIMD architecture where input data is multiple of 8 data
sequence. Algorithm 2 is generalized bitonic sort and its corresponding kernel is shown in
algorithm 3.
Initially, the host (CPU) device distributes unsorted vector in form of work_groups to GPU cores
using the global_size and local_size OpenCL Parameters. Alternate work_items in work_group
perform sorting in ascending and descending order. Next, merging stage is performed and result
is obtained. For more information, on this parameters please refer OpenCL Specifications [22].
4.2 Parallel Radix Sort
Like the bitonic sort, the radix sort [14] uses a divide-and-conquer strategy; it splits the dataset
into subsets and sorts the elements in the subsets. But instead of sorting bitonic sequences, the
radix sort is a multiple pass distribution sort algorithm that distributes each item to a bucket
according to least significant digit of the elements. After each pass, items are collected from the
buckets, keeping the items in order, then redistributed according to the next most significant digit.
Suppose, the input elements are 34, 12, 42, 32, 44, 41, 34, 11, 32, 63.
After First Pass: {[41, 11], [12, 42, 32, 32], [63],
After Second Pass: {[11, 12], [32, 32, 34, 34], [41, 42, 44], [63]}
When we collect them they are in order: {11, 12, 32, 32, 34, 34, 41, 42, 44, 63}
International Journal of Experimental Algorithms (IJEA), Volume (3): Issue (1) : 2012
In OpenCL, the first step of each pass is to compute histogram to identify the least significant
digit. Let p be the number of number of processing elements available on GPU device. Each
processing element is responsible for n / p input elements. In next step, each processing
element counts the number of its elements and then computes the prefix sums of these counts.
Next, the prefix sums of all processing elements are combined by computing the prefix sums of
the processing element-wise prefix sums. Finally, each processing element places its elements in
the output array. More details are given in the pseudo-code below.
b no. of bits
A Input Data
cmp 1
cnt0 contains zeros count
cnt1 contains ones count
One, Zero Bucket Arrays
Mask Temporary Array
b
for ( i = 0 to 2 1)
{
for ( j = 0 to A.size)
{
if (A [j] && cmp)
cnt1 ++
One [cnt1] a[j]
else
cnt0 ++
Mask [cnt0] j
}
for( j = cnt0 to A.size)
Mask [j] A.size cnt0 + j
A shuffle(A, one, Mask)
cmp left_shift(cmp)
}
result A
Pseudo-code: Parallel Radix Sort Kernel
The code performs bitwise AND with cmp. If AND result is non-zero, code places the element in
One array and increments ones counter. If the result is zero, the code set appropriate value in
Mask array and increment zeros counter. Once every element is analyzed, the Mask array is
further updated to identify each element in One;s array. The shuffle function re-arranges the
Mask array data and then process continues.
The computation of histogram is shown in algorithm 4. After this step, histogram is scanned and
prefix sum is calculated using the algorithm 5. After this step, re-ordering of histogram takes place
and finally result is obtained by transposing the re-ordered histogram. Other implementation
details are not mentioned here; only the method is presented in this paper. For more information
refer [27].
5. EXPERIMENTAL RESULTS
In this section, we discus machine specifications on which experiments were carried out and
present our experimental results. In all cases, the elements to be sorted were randomly
generated 10 bit integers. All experiments were repeated 30 times and the results were reported
are averaged over 30 runs.
International Journal of Experimental Algorithms (IJEA), Volume (3): Issue (1) : 2012
9000
Quick Sort
Bitonic Sort
8000
Radix Sort
7000
Time (ms)
6000
5000
4000
3000
2000
1000
10
12
14
16
18
International Journal of Experimental Algorithms (IJEA), Volume (3): Issue (1) : 2012
REFERENCES
[1]
[2]
K. E. Batcher. Sorting networks and their applications. in AFIPS Spring Joint Computer
Conference, Arlington, VA, Apr. 1968, pages 307314.
[3]
D.E. Knuth. The Art of Computer Programming. Vol. 3: Sorting and Searching (second
edition). Menlo Park: Addison-Wesley, 1981.
[4]
M. Ajtai, J. Komlos, Szemeredi. Sorting in parallel steps. Combinatorica 3. 983, pp. 1 -19.
[5]
[6]
J. H. Reif, L. G. Valiant. A Logarithmic Time Sort for Linear Size Networks. Journals of the
ACM, 34(1): 60 76, 1987.
[7]
G.E. Blelloch, Vector Models for Data-Parallel Computing. The MIT Press, 1990.
[8]
G.E. Blelloch, C.E. Leiserson, B.M. Maggs, C.G. Plaxton, S.J. Smith, M. Zagha. A
Comparison of Sorting Algorithms for the Connection Machine CM-2. in Annual ACM
Symp. Paral. Algo: Arc. 1991, Pages 3 -16.
[9]
[10]
J.H. Reif. Synthesis of Parallel Algorithms. Morgan Kaufmann, San Mateo, CA, 1993.
[11]
H. Li, K.C. Sevcik. Parallel Sorting by Over-partitioning. in Annual ACM Symp. Paral.
Algor.Arch. 1994, pages 46 56.
[12]
[16]
International Journal of Experimental Algorithms (IJEA), Volume (3): Issue (1) : 2012
[20]
[21]
[22]
[23]
[26]
[27]
International Journal of Experimental Algorithms (IJEA), Volume (3): Issue (1) : 2012