0% found this document useful (0 votes)

159 views8 pages

An OpenCL Method of Parallel Sorting Algorithms For GPU Architecture

In this paper, we present a comparative performance analysis of different parallel sorting algorithms: Bitonic sort and Parallel Radix Sort. In order to study the interaction between the algorithms and architecture, we implemented both the algorithms in OpenCL and compared its performance with Quick Sort algorithm, the fastest algorithm. In our simulation, we have used Intel Core2Duo CPU 2.67GHz and NVidia Quadro FX 3800 as graphical processing unit.

Uploaded by

AI Coordinator - CSC Journals

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

159 views8 pages

An OpenCL Method of Parallel Sorting Algorithms For GPU Architecture

Uploaded by

AI Coordinator - CSC Journals

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Krishnahari Thouti & S.R.

Sathe

An OpenCL Method of Parallel Sorting Algorithms for GPU

Architecture
Krishnahari Thouti

[email protected]

Department of Computer Science Engg.

Visvesvaraya National Institute of Technology
Nagpur, 440010, Maharashtra, India

S. R. Sathe

[email protected]

Department of Computer Science Engg.

Visvesvaraya National Institute of Technology
Nagpur, 440010, Maharashtra, India

Abstract
In this paper, we present a comparative performance analysis of different parallel sorting
algorithms: Bitonic sort and Parallel Radix Sort. In order to study the interaction between the
algorithms and architecture, we implemented both the algorithms in OpenCL and compared its
performance with Quick Sort algorithm, the fastest algorithm. In our simulation, we have used
Intel Core2Duo CPU 2.67GHz and NVidia Quadro FX 3800 as graphical processing unit.
Keywords: GPU, GPGPU, Parallel Computing, Parallel Sorting Algorithms, OpenCL.

1. INTRODUCTION
The GPU (Graphics Processing Unit) [1] is a highly tuned, specialized machine, designed
specifically for parallel processing at high speed. In recent years, Graphic Processing Unit (GPU)
has been evolved as massive parallel processor for achieving high computing performance. The
architecture of GPU is suitable not only for graphics rendering algorithms but for also general
parallel algorithms in a wide variety of application domains.
Sorting is one of the fundamental problems of computer science, and parallel algorithms for
sorting have been studied since the beginning of parallel computing. Batchers (log 2 n) - depth
bitonic sorting network [2] was one of the first methods proposed. Since then many different
parallel sorting algorithms have been proposed [7, 9, 10]. The (log n) - depth sorting circuit was
proposed in [4, 6].
Given, a diversity of parallel architectures and a number of parallel sorting algorithms, there is a
question of which is the best fit for a given problem instance. An extent to which an application
will benefit from these parallel systems, depend on the number of cores available and other
parameters. Thus, many researchers have become interested in harnessing the power of GPUs
for sorting algorithms. Recently, there has been increased interest in such research efforts [8, 11,
16]. However, more studies are needed to claim whether a certain algorithm can be
recommended for a particular parallel architecture.
In this paper, we present an experimental study of two different parallel sorting algorithms: Bitonic
sort and Parallel Radix sort.
This paper is organized as follows. Section - 2 provides previous work done. In Section - 3, we
present GPU architecture and OpenCL Programming model. Parallel Sorting algorithms are
explained in Section - 4. Test results and analysis are provided in Section - 5. Section - 6
concludes our work and makes future research plans.

International Journal of Experimental Algorithms (IJEA), Volume (3): Issue (1) : 2012

Krishnahari Thouti & S.R.Sathe

2. RELATED WORK
In this section, we review previous work on parallel sorting algorithms. Study of parallel
algorithms using OpenCL is still in progress and there is not much work done in this topic.
However, an overview of parallel sorting algorithms is given in [5]. Here we review parallel
algorithms with respect to GPU architecture.
A parallel sorting algorithm is presented in [12] for general purpose internal sorting on MIMD
machines where performance of the algorithm on the Fujitsu AP1000 MIMD supercomputer is
discussed. A comparative performance evaluation of parallel sorting algorithms presented in [13].
They implement parallel algorithms with respect to the architecture of the machine. An on-chip
local memory version of radix sort for GPUs has been implemented [21]. As expected, OpenCL
local memory is much faster than global memory. Bitonic sorting algorithm has been implemented
using stream processing units and Image Stream processors in [17, 15].
An O(n) radix sort is implemented in [21]. As reported in [21] radix sort is roughly twice as fast as
the CUDAPP[19] radix sort. Quick-sort algorithm for GPUs using CUDA has been implemented
in [20] where their results suggest that given a large data set of elements, quick-sort still gives
better performance as compared to radix and Bitonic sort. A portable OpenCL implementation of
the radix sort algorithm is presented in [24] where authors test radix sort on several GPUs and
CPUs. An analysis of parallel and sequential bitonic, odd-even and rank-sort algorithms for
different CPU and GPU architectures are presented in [23] where they exploit task parallelism
using OpenCL.

3. GPU ARCHITECTURE and OPENCL FRAMEWORK

HOST

PE 8

PE 1

PE 2

LOCAL MEMORY

PE 8

LOCAL MEMORY

PE 2

PE 8

PE 2

PE 1

LOCAL MEMORY

PE 1

NVidia GPUs comprises of array of multi-processor units called Streaming Multiprocessors

(SMs), also called as Compute Units (CU) and each one consists of multiple Scalar Processor
(SP) cores, also known as Processing Elements (PE). The NVidia Quadro FX 3800 has 24 SMs
with 8 PEs in each SM as shown in Figure 1. There is on-chip local store called shared memory,
through which the PEs communicate with SM and different SMs communicate through off-chip
memory called global memory.

GLOBAL MEMORY

FIGURE 1: GPU Architecture

The GPU is programmable using vendor provided APIs such as NVIDIAs CUDA [18], OpenCL
specification by Khronos group [22]. While CUDA targets GPU specifically, OpenCL targets
heterogeneous system which includes GPUs and/or CPUs. OpenCL programming model involves
a host program on the host (CPU) side that launches Single Instruction Multiple Threads (SIMT)
based programs called kernels consisting of groups of threads called as warps on the target
device. Although management of warps is hardware dependent, programmer can organize
problem domain into several work-items, consisting of one or more work-groups. This is

International Journal of Experimental Algorithms (IJEA), Volume (3): Issue (1) : 2012

Krishnahari Thouti & S.R.Sathe

explained as ND-Range in GPU architecture. For more information on managing and optimizing
ND-Range refer to OpenCL Specifications [22]. In summary, we say, following steps are needed
to initialize an OpenCL Application.

Setting Up OpenCL Environment Declare OpenCL context, choose device type and
create the context and a command queue.
Declare Buffers & Move Data across CPU & GPU Declare buffers on the device and
enqueue input data to the device.
Runtime Kernel Compilation Compile the program from the kernel array, build the
program, and define the kernel.
Run the Program Set kernel arguments and the work-group size and then enqueue
kernel onto the command queue to execute on the device.
Get Results to Host After the program has run, read back result array from device
buffer to host memory.

See [25, 26, 27, 22] for more details on this topic.

4. PARALLEL SORTING ALGORITHMS

In this section we give brief descriptions of two parallel sorting algorithms selected for
implementation.
4.1 Bitonic Sort
Batchers Bitonic sort [2] is a parallel sorting algorithm which merges two bitonic sequences.
Bitonic sorting was originally defined in terms of sorting networks. Sorting networks are
comparison networks that always sort their inputs. A sorting network [14, 3] is a special kind of
sorting algorithm, where the sequence of comparisons is data independent. This makes sorting
networks suitable for implementation in hardware or in parallel processor arrays.
A bitonic sequence is a sequence of values a = {a0, a1, ap-1} with the property that either (1)
there exist an index k, where 0<k<p-1 such that a0 a1 ak ap-1 or a0 a1 ak
ap-1 or (2) there exist a cyclic shift of indices so that (1) is satisfied. For example, (4, 8, 12, 15,
11, 6, 3, 2) is a bitonic sequence.
Let s = {a1, a2 ap} be bitonic sequence such that a0 a1 ap/2-1 and ap/2 ap/2+1 ap-1.
The bitonic sequence s can be sorted with bitonic split operation which halves the sequence into
two bitonic sequences s1 and s2 such that all values of s1 are smaller than or equal to all the
values of s2. That is, bitonic split operation performs:
S1 = {min (a0, ap/2), , min (ap/2-1, ap-1)}
S2 = {max (a0, ap/2), , max (ap/2-1, ap-1)}
For example, the bitonic sequence mentioned above s = (4, 8, 12, 15, 11, 6, 3, 2) will be divided
to two bitonic sequences s1 = (4, 6, 3, 2) and s2 = (11, 8, 12, 15). Thus, given a bitonic sequence,
we can use bitonic splits recursively to obtain short bitonic sequences until we obtain sequences
of size one, at which point the input bitonic sequence is sorted. This procedure of sorting a bitonic
sequence using bitonic splits is called bitonic merge (BM).
The bitonic sorting network for sorting N numbers consists of log(N) bitonic sorting stages, where
ith stage is composed of N/2i alternating increasing and decreasing bitonic merges of size 2i. In
OpenCL implementation, we set kernel arguments for each of the stages and call the kernel subroutine bitonic sort. Algorithm 1, 2, and 3 shows bitonic sorting algorithm on GPU device using
OpenCL. The algorithm executes on every core in GPU kernel in parallel.

International Journal of Experimental Algorithms (IJEA), Volume (3): Issue (1) : 2012

Krishnahari Thouti & S.R.Sathe

__kernel void bitonic_sort(__global *data, int dir)

{
divide data into in1 and in2
sort(in1, ASC)
sort(in2, DES)
swap(in1, in2, dir)
sort(in1, dir)
sort(in2, dir)
result = (in1, in2)
}

for each level i = 1, , log(n)

{
for each pass of level j = 1 to i +1
run_kernel ();
}
Algorithm 2: Generalized Bitonic Sort

Algorithm 1: Bitonic Sort Kernel for SIMD Architecture

Algorithm 1 is bitonic sort kernel for SIMD architecture where input data is multiple of 8 data
sequence. Algorithm 2 is generalized bitonic sort and its corresponding kernel is shown in
algorithm 3.

kernel sort(global *data, int stage i, int pass_of_stage j,

int dir)
{
/* using values of i, j, dir get left_Id & right_Id */
left_child = data [left_Id]
right_child = data [right_Id]
compare(left_child, right_child)
/* copy left & right child values to data with respect to dir
*/
data [left_child] = max(left_child, right_child)
data [right_child] = min(left-child, right_child)
}
Algorithm 3: Generalized Bitonic Sort Kernel Using OpenCL

Initially, the host (CPU) device distributes unsorted vector in form of work_groups to GPU cores
using the global_size and local_size OpenCL Parameters. Alternate work_items in work_group
perform sorting in ascending and descending order. Next, merging stage is performed and result
is obtained. For more information, on this parameters please refer OpenCL Specifications [22].
4.2 Parallel Radix Sort
Like the bitonic sort, the radix sort [14] uses a divide-and-conquer strategy; it splits the dataset
into subsets and sorts the elements in the subsets. But instead of sorting bitonic sequences, the
radix sort is a multiple pass distribution sort algorithm that distributes each item to a bucket
according to least significant digit of the elements. After each pass, items are collected from the
buckets, keeping the items in order, then redistributed according to the next most significant digit.
Suppose, the input elements are 34, 12, 42, 32, 44, 41, 34, 11, 32, 63.
After First Pass: {[41, 11], [12, 42, 32, 32], [63],

[34, 44, 34]}

After Second Pass: {[11, 12], [32, 32, 34, 34], [41, 42, 44], [63]}
When we collect them they are in order: {11, 12, 32, 32, 34, 34, 41, 42, 44, 63}

International Journal of Experimental Algorithms (IJEA), Volume (3): Issue (1) : 2012

Krishnahari Thouti & S.R.Sathe

In OpenCL, the first step of each pass is to compute histogram to identify the least significant
digit. Let p be the number of number of processing elements available on GPU device. Each
processing element is responsible for n / p input elements. In next step, each processing
element counts the number of its elements and then computes the prefix sums of these counts.
Next, the prefix sums of all processing elements are combined by computing the prefix sums of
the processing element-wise prefix sums. Finally, each processing element places its elements in
the output array. More details are given in the pseudo-code below.
b no. of bits
A Input Data
cmp 1
cnt0 contains zeros count
cnt1 contains ones count
One, Zero Bucket Arrays
Mask Temporary Array
b

for ( i = 0 to 2 1)
{
for ( j = 0 to A.size)
{
if (A [j] && cmp)
cnt1 ++
One [cnt1] a[j]
else
cnt0 ++
Mask [cnt0] j
}
for( j = cnt0 to A.size)
Mask [j] A.size cnt0 + j
A shuffle(A, one, Mask)
cmp left_shift(cmp)
}
result A
Pseudo-code: Parallel Radix Sort Kernel
The code performs bitwise AND with cmp. If AND result is non-zero, code places the element in
One array and increments ones counter. If the result is zero, the code set appropriate value in
Mask array and increment zeros counter. Once every element is analyzed, the Mask array is
further updated to identify each element in One;s array. The shuffle function re-arranges the
Mask array data and then process continues.
The computation of histogram is shown in algorithm 4. After this step, histogram is scanned and
prefix sum is calculated using the algorithm 5. After this step, re-ordering of histogram takes place
and finally result is obtained by transposing the re-ordered histogram. Other implementation
details are not mentioned here; only the method is presented in this paper. For more information
refer [27].

5. EXPERIMENTAL RESULTS
In this section, we discus machine specifications on which experiments were carried out and
present our experimental results. In all cases, the elements to be sorted were randomly
generated 10 bit integers. All experiments were repeated 30 times and the results were reported
are averaged over 30 runs.

International Journal of Experimental Algorithms (IJEA), Volume (3): Issue (1) : 2012

Krishnahari Thouti & S.R.Sathe

Let n = no. of elements

wi = no. of work_items
wg = no. of work_groups
/* wi & wg can be computed using clDeviceInfo()
: see [22] */
for ( i = wi to wi + wg)
{
Extract the group of bits of pass i,and
Store the result in hist []
}

for each processing element, PE i

{
sum[i] = list [ (n/p) * i]
for ( j = 1 to n/p)
sum[i] = sum[i] + list[(n/p) * i + j ]
result = (sum)
}
Algorithm 5: Parallel Prefix Sum

Algorithm 4: Compute Histogram

5.1 Machine Descriptions

The GPU device used for testing simulation is NVidia Quadro FX 3800 which has 192 processing
cores and 1 GB device global memory. For comparison purpose, we have implemented and
tested the results of quick-sort algorithm on 2.66GHz Intel Core2DUO CPU E7300 with 1GB
RAM. The cache specifications are 32KB data cache, 32KBinstruction cache and 3MB shared L2
cache.
5.2 Comparison of the Algorithms
Figure 2 shows the comparison of above mentioned algorithms for different size of input
sequence. For comparison purpose, we have taken the sequential version of Quick sort and have
compared with OpenCL version of Parallel Bitonic Sort and Parallel Radix Sort. As expected, in
all cases, radix sort is fastest, followed by Bitonic sort, and then quick sort. GPU is a large
computation unit and thus we measured the GPU runtime called as GPU PROFILE time only,
excluding the time for GPU memory allocation, data and memory transfer between CPU and
GPU. However, if we take into account, all the parameters concerning GPU application, as
explained in Section 3, we find that quick sort is still the fastest.

9000

Quick Sort
Bitonic Sort

8000

Radix Sort
7000

Time (ms)

6000

5000

4000

3000

2000

1000

No. of Elements in M units (1M = 2^20)

FIGURE 2: Comparison of Sorting Algorithms

International Journal of Experimental Algorithms (IJEA), Volume (3): Issue (1) : 2012

Krishnahari Thouti & S.R.Sathe

6. CONCLUSION AND FUTURE SCOPE

We have presented an analysis of parallel bitonic and radix sort algorithms for GPUs using
OpenCL and their comparison with the serial implementation of quicksort on CPU Dual-core
machine. We have shown their GPU performance and compared with CPU implementation of
quick sort. Our finding reports that radix sort is still the fastest, followed by Bitonic sort, and then
quick sort. In future work, along with these sorting algorithms, we are planning to investigate
some other parallel sorting algorithms including quick sort and use different GPU architecture
from different vendors for our analysis.

REFERENCES
[1]

General Purpose Computations Using Graphics Hardware, http://www.gpgpu.org/

[2]

K. E. Batcher. Sorting networks and their applications. in AFIPS Spring Joint Computer
Conference, Arlington, VA, Apr. 1968, pages 307314.

[3]

D.E. Knuth. The Art of Computer Programming. Vol. 3: Sorting and Searching (second
edition). Menlo Park: Addison-Wesley, 1981.

[4]

M. Ajtai, J. Komlos, Szemeredi. Sorting in parallel steps. Combinatorica 3. 983, pp. 1 -19.

[5]

S. G. Akl. Parallel Sorting Algorithms, Academic Press, 1985.

[6]

J. H. Reif, L. G. Valiant. A Logarithmic Time Sort for Linear Size Networks. Journals of the
ACM, 34(1): 60 76, 1987.

[7]

G.E. Blelloch, Vector Models for Data-Parallel Computing. The MIT Press, 1990.

[8]

G.E. Blelloch, C.E. Leiserson, B.M. Maggs, C.G. Plaxton, S.J. Smith, M. Zagha. A
Comparison of Sorting Algorithms for the Connection Machine CM-2. in Annual ACM
Symp. Paral. Algo: Arc. 1991, Pages 3 -16.

[9]

F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees and

Hypercubes. Morgan Kaufmann, 1992.

[10]

J.H. Reif. Synthesis of Parallel Algorithms. Morgan Kaufmann, San Mateo, CA, 1993.

[11]

H. Li, K.C. Sevcik. Parallel Sorting by Over-partitioning. in Annual ACM Symp. Paral.
Algor.Arch. 1994, pages 46 56.

[12]

A. Tridgell, R. P. Brent. A general-purpose parallel sorting algorithm in International J. of

High Speed Computing 7 (1995), pp. 285-301.

[13] N. Amato, R. Iyer, S. Sundaresan, Y. Wu. A Comparison of Parallel Sorting Algorithms on

Different Architectures Texas A & M University, College Station, TX, 1998.
[14] T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein. Introduction to Algorithms. 2nd edition,
The MIT Press. 2001.
[15]

T. J. Purcell, C. Donner, M. Cammarano, H. Jensen, P. Hanrahan Photon mapping on

programmable graphics hardware, in Annual ACM SIGGRAPH / Eurographics conference
on Graphics Hardware, 2003, pp. 41 50.

[16]

J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A. E. Lefohn, T. J. Purcell.

A Survey of General-Purpose Computation on Graphics Hardware. in Eurographics 2005,
State of the Art Reports, August 2005, pp. 21-51.

International Journal of Experimental Algorithms (IJEA), Volume (3): Issue (1) : 2012

Krishnahari Thouti & S.R.Sathe

[17] A. Greb, G. Zachmann. GPU-AbiSort: Optimal Parallel Sorting on Stream Architectures in

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed
processing. 2006.
[18] NVidia CUDA GPGPU Framework. http://www.nvidia.com/
[19]

S. Sengupta, M. Harris, Y. Zhang, J. D. Owens. Scan primitives for GPU computing, in

Graphics Hardware 2007, Aug. 2007, pp. 97106.

[20]

D. Cedermann, P. Tsigas. A practical quicksort algorithm for graphic processors, Tech.

Rep, Chalmers University of Technology and Goteberg University, 2008.

[21]

N. Satish, M. Harris, M. Garland. Designing efficient sorting algorithms for manycore

GPUs. In Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed
Processing. May 23-29, 2009, pp.1-10.

[22]

OpenCL Specification, http://www.khronos.org/opencl/

[23]

F. Gul, O. Usman Khan, B. Montrucchio, P. Giaccone. Analysis of Fast Parallel Sorting

Algorithms for GPU Architectures. in Proceeding FIT '11 Proceedings of the 2011 Frontiers
of Information Technology Pages 173-178.

[24] P. Helluy. A portable implementation of the radix sort algorithm in OpenCL.

http://code.google.com/p/ocl-radix-sort/ May 2011
[25]

B. Gaster, L. Howes, D.R. Kaeli, P. Mistry, D. Schaa. Heterogeneous Computing with

OpenCL. Morgan Kaufmann. 2011.

[26]

AMD Accelerated Parallel Processing OpenCL Programming Guide, Advanced Micro

Devices, Inc. 2012. http://developer.amd.com/appsdk

[27]

M. Scarpino. OpenCL in Action. Manning Publications, 2011.

International Journal of Experimental Algorithms (IJEA), Volume (3): Issue (1) : 2012

A Practical Quicksort Algorithm for Graphics Processors 3hac3qeos3
No ratings yet
A Practical Quicksort Algorithm for Graphics Processors 3hac3qeos3
21 pages
PPL Gpu Sorting Pre Print
No ratings yet
PPL Gpu Sorting Pre Print
28 pages
Efficient Parallel Merge Sort For Fixed and Variable Length Keys
No ratings yet
Efficient Parallel Merge Sort For Fixed and Variable Length Keys
10 pages
GPU Quicksort
No ratings yet
GPU Quicksort
22 pages
Designing Efficient Sorting Algorithms For Manycore Gpus: Ntroduction
No ratings yet
Designing Efficient Sorting Algorithms For Manycore Gpus: Ntroduction
10 pages
Designing Efficient Sorting Algorithms For Manycore Gpus: Ntroduction
No ratings yet
Designing Efficient Sorting Algorithms For Manycore Gpus: Ntroduction
10 pages
Cuda Opencl
No ratings yet
Cuda Opencl
17 pages
A Comparison of Parallel Sorting Algorithms On Different Architectures
No ratings yet
A Comparison of Parallel Sorting Algorithms On Different Architectures
18 pages
An Efficient Sorting Algorithm With CUDA
No ratings yet
An Efficient Sorting Algorithm With CUDA
8 pages
Performance Analysis of Parallel Sorting Algorithms Using MPI
No ratings yet
Performance Analysis of Parallel Sorting Algorithms Using MPI
6 pages
Digital Engineering: Complex System Design
From Everand
Digital Engineering: Complex System Design
S Mathioudakis
No ratings yet
U2020517 Assignmenet3 Report
No ratings yet
U2020517 Assignmenet3 Report
8 pages
FPGA Based Hardware Accelerator For Sorting Data
No ratings yet
FPGA Based Hardware Accelerator For Sorting Data
4 pages
Pquick
No ratings yet
Pquick
19 pages
OpenCL A Parallel Programming Standart For Heterogeneous
No ratings yet
OpenCL A Parallel Programming Standart For Heterogeneous
12 pages
Data - Parallel Algorithms On Gpus
No ratings yet
Data - Parallel Algorithms On Gpus
31 pages
Chap9 Slides
No ratings yet
Chap9 Slides
68 pages
An Efficient O N Comparison-Free Sorting Algorithm
No ratings yet
An Efficient O N Comparison-Free Sorting Algorithm
13 pages
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Spatial Sorting Algorithms For Parallel Computing in Networks
No ratings yet
Spatial Sorting Algorithms For Parallel Computing in Networks
6 pages
Gpu-Accelerated Face Detection Algorithm
No ratings yet
Gpu-Accelerated Face Detection Algorithm
9 pages
Volume Tiled Forward Shading
No ratings yet
Volume Tiled Forward Shading
52 pages
Computer Generation of Streaming Sorting Networks
No ratings yet
Computer Generation of Streaming Sorting Networks
9 pages
Pawan 09 Graph Algorithms
No ratings yet
Pawan 09 Graph Algorithms
26 pages
OpenCL Jumpstart Guide
No ratings yet
OpenCL Jumpstart Guide
17 pages
A Hybrid Pipelined Architecture For High Performance Top-K Sorting On FPGA
No ratings yet
A Hybrid Pipelined Architecture For High Performance Top-K Sorting On FPGA
5 pages
Summary Master Thesis
No ratings yet
Summary Master Thesis
3 pages
6235_L13
No ratings yet
6235_L13
17 pages
Parallel BFS On Graphs Using GPGPU
No ratings yet
Parallel BFS On Graphs Using GPGPU
10 pages
HPC Final PPTs
No ratings yet
HPC Final PPTs
369 pages
The Design and Analysis of Parallel Algorithms
No ratings yet
The Design and Analysis of Parallel Algorithms
412 pages
DYNAMIC SORTING ALGORITHM VISUALIZER USING OPENGL Report
88% (8)
DYNAMIC SORTING ALGORITHM VISUALIZER USING OPENGL Report
35 pages
Efficient Parallel Sort On AVX-512-based Multi-Core and Many-Core Architectures
No ratings yet
Efficient Parallel Sort On AVX-512-based Multi-Core and Many-Core Architectures
9 pages
GPU Programming Using openCL
No ratings yet
GPU Programming Using openCL
13 pages
DNA Assembly With de Bruijn Graphs On FPGA PDF
No ratings yet
DNA Assembly With de Bruijn Graphs On FPGA PDF
4 pages
Merge Sort Sequential and Parallel Progr
No ratings yet
Merge Sort Sequential and Parallel Progr
7 pages
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
From Everand
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
Digital Equipment Corporation
No ratings yet
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
14 Parallel Algorithms CUDA Basics s20
No ratings yet
14 Parallel Algorithms CUDA Basics s20
89 pages
Cs - Ualberta.ca Thesis
100% (3)
Cs - Ualberta.ca Thesis
5 pages
Parallelization of BFS Graph Algorithm Using CUDA
No ratings yet
Parallelization of BFS Graph Algorithm Using CUDA
6 pages
Fast Sort On CPUs, GPUs and Intel MIC Architectures - Technical Report - Intel Labs (Intel-Labs-Radix-Sort-Mic-Report)
No ratings yet
Fast Sort On CPUs, GPUs and Intel MIC Architectures - Technical Report - Intel Labs (Intel-Labs-Radix-Sort-Mic-Report)
11 pages
Lect8 Parallel System
No ratings yet
Lect8 Parallel System
43 pages
OpenCL For EiT-M
No ratings yet
OpenCL For EiT-M
41 pages
Efficient Relational Algebra Algorithms and Data Structures For GPU
No ratings yet
Efficient Relational Algebra Algorithms and Data Structures For GPU
15 pages
Metaheuristics On GPUs
No ratings yet
Metaheuristics On GPUs
3 pages
PostgreSQL OpenCL Procedural Language
No ratings yet
PostgreSQL OpenCL Procedural Language
29 pages
Thread-Level Parallel Algorithm For Sorting Integer Sequence On Multi-Core Computers
No ratings yet
Thread-Level Parallel Algorithm For Sorting Integer Sequence On Multi-Core Computers
5 pages
Using OpenCL Programming Massively Parallel Computers
No ratings yet
Using OpenCL Programming Massively Parallel Computers
309 pages
NVIDIA OpenCL JumpStart Guide
No ratings yet
NVIDIA OpenCL JumpStart Guide
15 pages
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators
No ratings yet
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators
30 pages
Run-Time Graph Based P&R Algorithm For Fpgas: A Hardware Implementation
No ratings yet
Run-Time Graph Based P&R Algorithm For Fpgas: A Hardware Implementation
3 pages
Investigating The Effect of Varying Block Size On Power and Energy Consumption of GPU Kernels
No ratings yet
Investigating The Effect of Varying Block Size On Power and Energy Consumption of GPU Kernels
21 pages
HPC2
No ratings yet
HPC2
22 pages
Ijaret: International Journal of Advanced Research in Engineering and Technology (Ijaret)
No ratings yet
Ijaret: International Journal of Advanced Research in Engineering and Technology (Ijaret)
9 pages
Gameenginegems 2
No ratings yet
Gameenginegems 2
526 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
Opencl Programming For The Cuda Architecture
No ratings yet
Opencl Programming For The Cuda Architecture
23 pages
ECCOMAS Oslo Article
No ratings yet
ECCOMAS Oslo Article
12 pages
3 Heterogeneous Computer Architectures: 3.1 Gpus
No ratings yet
3 Heterogeneous Computer Architectures: 3.1 Gpus
16 pages
Using Grammar Extracted From Sample Inputs To Generate Effective Fuzzing Files
No ratings yet
Using Grammar Extracted From Sample Inputs To Generate Effective Fuzzing Files
23 pages
A Cost-Effective Automated Weather Reporting System AWRS For The Canadian Remote Northern Air Operators
No ratings yet
A Cost-Effective Automated Weather Reporting System AWRS For The Canadian Remote Northern Air Operators
13 pages
A Novel Approach To The Weight and Balance Calculation For The de Haviland Canada DHC-6 Seaplane Operators
No ratings yet
A Novel Approach To The Weight and Balance Calculation For The de Haviland Canada DHC-6 Seaplane Operators
10 pages
A Bring Your Own Device Risk Assessment Model
No ratings yet
A Bring Your Own Device Risk Assessment Model
20 pages
A Smart Receptionist Implementing Facial Recognition and Voice Interaction
No ratings yet
A Smart Receptionist Implementing Facial Recognition and Voice Interaction
11 pages
FETs As A Congruous Hardware in Embedded Technology and Sensors
No ratings yet
FETs As A Congruous Hardware in Embedded Technology and Sensors
18 pages
IoT Network Attack Detection Using Supervised Machine Learning
No ratings yet
IoT Network Attack Detection Using Supervised Machine Learning
15 pages
Audit Quality and Environment, Social, and Governance Risks
No ratings yet
Audit Quality and Environment, Social, and Governance Risks
26 pages
IJCL-121Developing AI Tools For A Writing Assistant: Automatic Detection of Dt-Mistakes in Dutch
No ratings yet
IJCL-121Developing AI Tools For A Writing Assistant: Automatic Detection of Dt-Mistakes in Dutch
15 pages
Interplay of Digital Forensics in Ediscovery
No ratings yet
Interplay of Digital Forensics in Ediscovery
26 pages
Unit-6 Multiprocessors
No ratings yet
Unit-6 Multiprocessors
21 pages
Coa Report
No ratings yet
Coa Report
14 pages
Supercomputing Architectures - Literature Review
No ratings yet
Supercomputing Architectures - Literature Review
6 pages
BuddyBland Titan SC12
No ratings yet
BuddyBland Titan SC12
12 pages
Some Aspects of Chess Programming
No ratings yet
Some Aspects of Chess Programming
142 pages
JNTUK Curriculum
No ratings yet
JNTUK Curriculum
45 pages
1.2 Underlying Principles of Parallel and Distributed Computing
No ratings yet
1.2 Underlying Principles of Parallel and Distributed Computing
42 pages
From Programming Sequential Machines To Parallel Smart Mobile Devices Bringing Back The Imperative Paradigm To Todays Perspective
No ratings yet
From Programming Sequential Machines To Parallel Smart Mobile Devices Bringing Back The Imperative Paradigm To Todays Perspective
7 pages
Ktu s2 Mtech Cse Syllabus
No ratings yet
Ktu s2 Mtech Cse Syllabus
32 pages
CS8491 Ca Unit 4
No ratings yet
CS8491 Ca Unit 4
32 pages
Computer Classification
No ratings yet
Computer Classification
6 pages
Unit-II 21CSC202J OperatingSystem
No ratings yet
Unit-II 21CSC202J OperatingSystem
156 pages
Hvm2: A Parallel Evaluator For Interaction Combinators
No ratings yet
Hvm2: A Parallel Evaluator For Interaction Combinators
25 pages
Stbtel Reference v8p4
No ratings yet
Stbtel Reference v8p4
25 pages
Getting Started With The AMD Robotics Hardware Portfolio - Final v2
No ratings yet
Getting Started With The AMD Robotics Hardware Portfolio - Final v2
38 pages
Microcomputer System Design Trend
No ratings yet
Microcomputer System Design Trend
26 pages
Introduction To Parallel Computing 2nd Edition Ananth Grama Instant Download
100% (1)
Introduction To Parallel Computing 2nd Edition Ananth Grama Instant Download
62 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
34 pages
Parallel Database
No ratings yet
Parallel Database
22 pages
Flynn Classification
No ratings yet
Flynn Classification
6 pages
Example: 201201014-GPU-AS2: Assignments For GPU Programming Course/ Lab
No ratings yet
Example: 201201014-GPU-AS2: Assignments For GPU Programming Course/ Lab
4 pages
b24cs089 Coa Practicum
No ratings yet
b24cs089 Coa Practicum
15 pages
Shared Memory Synchronization
No ratings yet
Shared Memory Synchronization
223 pages
Module2 230505041342 19802ac3
No ratings yet
Module2 230505041342 19802ac3
51 pages
Regulations: Admission, Scheme and Syllabi Integrated Mca
No ratings yet
Regulations: Admission, Scheme and Syllabi Integrated Mca
117 pages
Instruction Level Pipelining
100% (1)
Instruction Level Pipelining
113 pages
ArbMaker 3 - Release Notes
No ratings yet
ArbMaker 3 - Release Notes
3 pages
High Performance Computing: Modern Systems and Practices Thomas Sterling Download PDF
No ratings yet
High Performance Computing: Modern Systems and Practices Thomas Sterling Download PDF
54 pages
Pega CSA & CSSA 8.7: Course Overview
100% (1)
Pega CSA & CSSA 8.7: Course Overview
10 pages
Thomson 2006
No ratings yet
Thomson 2006
5 pages

An OpenCL Method of Parallel Sorting Algorithms For GPU Architecture

Uploaded by

An OpenCL Method of Parallel Sorting Algorithms For GPU Architecture

Uploaded by

Krishnahari Thouti & S.R.

An OpenCL Method of Parallel Sorting Algorithms for GPU

Department of Computer Science Engg.

Department of Computer Science Engg.

Krishnahari Thouti & S.R.Sathe

3. GPU ARCHITECTURE and OPENCL FRAMEWORK

NVidia GPUs comprises of array of multi-processor units called Streaming Multiprocessors

FIGURE 1: GPU Architecture

Krishnahari Thouti & S.R.Sathe

4. PARALLEL SORTING ALGORITHMS

Krishnahari Thouti & S.R.Sathe

__kernel void bitonic_sort(__global *data, int dir)

for each level i = 1, , log(n)

Algorithm 1: Bitonic Sort Kernel for SIMD Architecture

__kernel sort(__global *data, int stage i, int pass_of_stage j,

[34, 44, 34]}

Krishnahari Thouti & S.R.Sathe

Krishnahari Thouti & S.R.Sathe

Let n = no. of elements

for each processing element, PE i

Algorithm 4: Compute Histogram

5.1 Machine Descriptions

No. of Elements in M units (1M = 2^20)

FIGURE 2: Comparison of Sorting Algorithms

Krishnahari Thouti & S.R.Sathe

6. CONCLUSION AND FUTURE SCOPE

General Purpose Computations Using Graphics Hardware, http://www.gpgpu.org/

S. G. Akl. Parallel Sorting Algorithms, Academic Press, 1985.

F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees and

A. Tridgell, R. P. Brent. A general-purpose parallel sorting algorithm in International J. of

[13] N. Amato, R. Iyer, S. Sundaresan, Y. Wu. A Comparison of Parallel Sorting Algorithms on

T. J. Purcell, C. Donner, M. Cammarano, H. Jensen, P. Hanrahan Photon mapping on

J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A. E. Lefohn, T. J. Purcell.

Krishnahari Thouti & S.R.Sathe

[17] A. Greb, G. Zachmann. GPU-AbiSort: Optimal Parallel Sorting on Stream Architectures in

S. Sengupta, M. Harris, Y. Zhang, J. D. Owens. Scan primitives for GPU computing, in

D. Cedermann, P. Tsigas. A practical quicksort algorithm for graphic processors, Tech.

N. Satish, M. Harris, M. Garland. Designing efficient sorting algorithms for manycore

OpenCL Specification, http://www.khronos.org/opencl/

F. Gul, O. Usman Khan, B. Montrucchio, P. Giaccone. Analysis of Fast Parallel Sorting

[24] P. Helluy. A portable implementation of the radix sort algorithm in OpenCL.

B. Gaster, L. Howes, D.R. Kaeli, P. Mistry, D. Schaa. Heterogeneous Computing with

AMD Accelerated Parallel Processing OpenCL Programming Guide, Advanced Micro

M. Scarpino. OpenCL in Action. Manning Publications, 2011.

You might also like

kernel sort(global *data, int stage i, int pass_of_stage j,