Naik 2015

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Analysis of performance enhancement on graphic processor based heterogeneous

architecture:
A CUDA and MATLAB experiment

Vilas H. Naik1 Chidanand S. Kusur2


Asst. Prof., Department of Computer Science and Engineering PG Scholar, Department of Computer Science and Engineering
Basaveshwar Engineering College, B.L.D.E.A’s College of Engineering and Technology,
Bagalkot, Karnataka, India. Bijapur, Karnataka, India.
[email protected] [email protected]

Abstract—Today multiprocessors, multicores, clusters and computing needs to be change to keep up with the evolution of
heterogeneous computing are becoming the most popular new hardware architecture and necessary to respond for the
architectures to achieve high performance computing. The changes in hardware architecture by the programmers. So, it is
different approaches are made by system designers to enhance necessary for the programmers to write programs that run on
the system performance such as increasing clock frequency of
heterogeneous platform to map with the new hardware
CPUs from MHz to GHz and addition of more number of CPU
cores i.e from single core processor to dual core, quad core, hexa architectures.
core, octo core, ten core and more processors. Still, multicore The multi-core era saw some interesting developments of
processing creates some challenges of its own. The extra core graphics processing units. Today, GPUs are used for both
results into increased processor size and also high power graphics and non-graphic processing applications. The GPUs
consumption. Meanwhile, General Purpose Graphics Processing contain thousands of cores which are highly designed for
Units (GPGPUs) are designed and implemented that contain parallel processing, where as CPU is designed for serial
hundreds of cores with more number of Arithmetic and Logic processing. The GPU makes the advanced semiconductor
Units and Control Units. These GPGPUs can be used in addition technology and contains more number of ALUs and CUs in
to CPU for heterogeneous computing for the enhancement of
system performance for selected applications by data parallelism.
comparison with CPU. A GPU has multiple SIMD (Single
The heterogeneous programming environment that includes Instruction Multiple Data) units to perform many arithmetical
other processors like GPGPU in addition to CPU can be used to operations of the similar type. The GPUs have vector
enhance the execution performance of computational intensive processing capabilities that enable the programmers to perform
programs. So, it is necessary for the programmer to run and parallel operations on very large sets of data with much lower
analyze the selected computational intensive programs on both power consumption relative to the serial processing of similar
homogeneous and heterogeneous programming platform. The data sets on CPUs. So, GPGPUs became increasingly attractive
homogeneous programming environment makes the use of multi for more general purposes operations to address heterogeneous
core CPU, where as the heterogeneous programming parallel programming.
environment makes the use of different processors such as
General Purpose Graphics Processing Unit (GPGPUs), Field
Today, heterogeneous systems have become common to
Programmable Gate Arrays (FPGAs), Digital Signal Processors run computational intensive applications. To support
(DSPs) in addition to CPU. Hence, the programmer needs to heterogeneous computing, the major software and hardware
write the code that makes the use of both CPU and other manufactures started to create heterogeneous software
processors by using heterogeneous software environment such as environment such as parallel computing toolbox of MATLAB,
parallel MATLAB with GPU enabled functions, MATLAB GPU enabled matlab functions and NVIDIA’s Compute
supported CUDA kernels and CUDA C for the execution of Unified Device Architecture (CUDA) framework. The parallel
parallel code to achieve high performance in heterogeneous MATLAB supports three levels of GPU computing for
programming environment in comparison with homogeneous
heterogeneous computing. The CUDA architecture provides
(sequential) programming approach with only CPU.
scalable programming model for heterogeneous computing that
Keywords — Multicore, Heterogeneous Computing, High results into high performance computing (HPC).
Performance Computing, GPGPU, Parallel MATLAB, CUDA 1.1 Graphics Processing Unit for Heterogeneous Computing
kernels. Graphics processing units were initially developed to
improve 3D graphics performance by offloading graphics from
I. INTRODUCTION CPU to GPU. The GPU contains thousands of cores, which are
mainly designed for parallel processing. The producers of
The architecture of computing blocks of processors has
graphics hardware offer a programming interface for the
changed within the past ten years. The technology moved from
programmers to utilize the computational resources of GPU for
single-core processors to shared memory multicore processors
general purpose and is called general purpose GPU (GPGPU).
and to highly scalable “many core” processors and finally to
heterogeneous platforms i.e. combination of other processors
like GPGPU, FPGA, DSP in addition to CPU. Hence, the
is designed and implemented for identification and simulation
by using Jacket and Parallel Computing Toolbox with
MATLAB in [3]. The parallel processing capability of GPUs
used to increase throughput by processing millions of
originally stored images in the Digital Processing Centre(DPC)
for image processing of Family Search as discussed in[4]. The
parallel image convolution algorithm is implemented in [5] by
using parallel computing toolbox (PCT) and distributed
computing server (DCS) toolbox of MATLAB. In [6] built-in
construct like parfor is used to achieve highest degree of
Fig. 1. Computational Resources of CPU and GPU programmability for computational applications.
The Jacket and GFOR tools of MATLAB is used in [7] to
The figure 1 illustrates the difference between the accelerate computational intensive applications like matrix
computational resources of CPU and GPU. The CPU is multiplication and fft. The CUDA technology is used in [8] for
designed for serial processing where as GPU is designed for point to point image processing to implement techniques like
parallel processing. The CPU has larger cache and less number brightening, darkening and thresholding. The loop unrolling
of CUs and ALUs, where as GPU is designed for parallel and tiling is done to gain performance enhancement on
processing with more number of ALUs and CUs that helps to heterogeneous platform as given in [9]. The architecture of
run parallel code to solve computational intensive problems. CUDA framework is discussed in [10] to solve computational
The CPU is suitable for Single Instruction Single Data (SISD) intensive part on GPU by using grids of blocks and blocks of
units to perform arithmetical operations sequentially, where as threads. The matrix multiplication is implemented in
GPU is suitable for Single Instruction Multiple Data (SIMD) MATLAB to parallelize independent operations on GPU as
units to perform arithmetical operations in parallel. discussed in [11,12]. The OpenCL programming model is used
1.2 Parallel and Heterogeneous Computing with MATLAB in [13,14] to compute discrete fourier transform (DFT) by
The parallel computing toolbox of MATLAB provides using GPU. The work proved that the butterfly structure of
built-in constructs like parfor (parallel for) and spmd (single FFT is suitable to implement on parallel hardware architecture.
program multiple data) to run code simultaneously on multiple The different algorithms are designed in [15] for data
workers to make use of multi cores of CPU. The difference sharing by code optimization that makes the use of GPU to
between parfor and spmd is that, the workers in spmd speed up the functions in the data-independent or data-sharing
communicate with each other, but in parfor they do not category. The GPU is used to convert picture into numbers
communicate. instead of numbers into picture in [16] that increased the
The MATLAB supports hundreds of GPU enabled functions processing speed in GHz. An efficient parallel video
such as discrete fourier transform (fft), matrix multiplication processing techniques on GPU is designed and implemented in
(mtimes) and left matrix division (mldivide) to run on data [17] by using CUDA framework to reorganize the execution
stored on GPU. The built-in functions such as gpuArray and order and to optimize the data structure. They proposed an
gather functions are used to transmit data from CPU to GPU efficient parallel framework for H.264/AVC encoder based on
and to get result back from GPU to CPU as shown in figure 2. massively parallel architecture.
Although, the programmers can run the NVIDIA’s CUDA The parallel computational intelligence based multi-camera
kernel objects in matlab on large data set. surveillance system is designed by using GPUs in [18] to
address multiple vision tasks at various levels, such as
segmentation, representation or characterization and analysis
and monitoring of the movement. The different parallel
algorithms are discussed in [19] for different image processing
Fig. 2. Data and Result Exchange between CPU and GPU
techniques like point operations, dithering, smoothening, edge
detection and image segmentation. The parallel motion
II. RELATED STUDY magnification algorithm is parallelized for video processing in
[20] by using parallel computing toolbox of MATLAB.
At the beginning, researchers increased the performance of
parallel applications on clusters with CPU and GPU III. ANALYSIS APPROACH
architecture. They combined the OpenMP programming with The proposed approach for the analysis of performance
task level services such as queuing and scheduling for efficient enhancement on graphic processor based heterogeneous
utilization of cluster-wide GPU devices. i.e. heterogeneous architecture is to make use of both homogeneous and
platform used to support large-scale high-end parallel heterogeneous programming environment as shown in figure 3.
computing as given in [1]. The re-configurable processors and
general purpose GPUs are used in [2] with OpenCL and
CUDA programming model on scalable heterogeneous
computing systems (SHOC) to increase execution speed of
computational intensive applications. Gaussian process model
part of an application on GPU using CUDA kernel objects.
These objects are used run CUDA kernel on huge amount of
data stored in MATLAB matrices.
IV. EXPERIMENTAL RESULTS AND ANALYSIS
The following computational intensive programs are
selected for the analysis of performance enhancement on
graphic processor based heterogeneous architecture.
• Vector Addition
• Matrix Mulltiplication
• FFT and
• Binary Image Conversion
The experimentations are made for the above listed
programs on both homogeneous and heterogeneous
programming environment that include multi core CPU with
GPGPU. It is observed that the execution time of selected
computational intensive applications on homogeneous platform
is more than heterogeneous platform.
Fig. 3. Proposed Methodology for Performance Analysis
The traditional C language and sequential MATLAB is
3.1 Homogeneous Programming Environment
used to run sequential code. The parallel computing toolbox of
The homogeneous programming environment makes the MATLAB, GPU enabled functions of MATLAB and CUDA C
use of only CPU, where as heterogeneous programming are used to run parallel code for the analysis of selected
environment makes the use of other processors in addition to applications.
CPU. The C language, sequential MATLAB, parallel
The GPU computing environment of parallel MATLAB is
MATLAB are used to write and analyse the sequential and
also used for heterogeneous computing by using GPU
parallel code for the selected computational intensive
supported functions like gpuArray, gather, static methods to
problems.
create data on GPU that results into less execution time by both
Traditional C language: The traditional C language is used CPU and GPU in comparison with only CPU.
to write sequential code and to observe the execution time
The following different mode of experiments are used for
taken by computational intensive applications by using clock()
the analysis of program execution speed.
function from time.h. The data type clock_t is used to define
Mode1: In the first mode of experiment, the sequential code is
and use two variables to store start and end time. The elapsed
written to complete the given task by using traditional C
time is further used for analysis of performance enhancement
language that results into more execution speed.
in comparison with heterogeneous programming environment.
Mode2: In the second mode of experiment, the parallel code is
The Sequential MATLAB: The MATLAB programming
written by using PCT of MATLAB code by using parfor
language provides many powerful built-in functions to solve
construct that makes the use of available pool workers to
computational intensive functions. The matlab built-in
complete the assigned task simultaneously. This results into
functions such as tic and toc are used to calculate elapsed time
less program execution speed in comparison with first attempt
by processors to solve computational intensive task.
(Sequential Code).
3.2 Heterogeneous Programming Environment
Mode3: In the third mode of experiment, the computational
The programming environment in which both CPU and
data is transferred from CPU to GPU to perform massive
GPGPU are used to solve computational intensive applications
parallel operations by GPGPU by using built in functions like
in parallel is called heterogeneous programming environment.
gpuArray that transmits the data from CPU to GPU to make the
The following languages are used to write and analyse the
use of GPU supported built-in functions. It is observed that, the
program execution time of selected applications.
execution speed is again reduced in comparison with second
The Parallel Computing Toolbox of MATLAB : The attempt (parallel MATLAB).
Parallel Computing Toolbox (PCT) is used to solve
Mode4: In the fourth mode of experiment, to overcome latency
computationally and data-intensive problems using multicore
of data transmission from CPU to GPU, the data is created on
processors with high-level parallel for-loops. The MATLAB
GPU to perform operations. Again, it is observed that, the
workers are used to run independent tasks concurrently by
execution speed is again reduced in comparison with third
using parfor loop construct.
attempt (GPU computing).
GPU enabled MATLAB functions: The Parallel Computing
Mode5: Finally, the CUDA is used to exploit the architecture
Toolbox also provides GPU enabled functions to solve
of GPGPU to solve computational intensive operations
computationally and data-intensive problems using many cores
simultaneously that results into again less execution speed in
of GPGPU processor.
comparison with previously made attempts.
CUDA kernel objects: The Parallel Computing Toolbox
allows the programmers to parallelize computational intensive
The different experimental results are noted for the analysis The performance enhancement is observed in
of execution speed of selected applications and they are heterogeneous programming environment by using CUDA
discussed below. kernel objects in comparison with homogeneous and parallel
4.1 Vector Addition programming environments as shown in figure 5.
The vector addition of one million values done on 4.3 Fast Fourier Transform
homogeneous and heterogeneous platform of corresponding The two programming environments such as sequential
elements in different programming environments such as matlab and GPU computing in matlab are used to compute two
sequential matlab, parallel matlab, GPU computing in matlab, dimensional fft algorithm. The experimental results are listed
and finally by using CUDA kernel objects in matlab. The in table 3.
experimental results are listed in table 1. TABLE 3 ANALYSIS OF EXECUTION TIME FOR FFT
Sequential GPU enabled
TABLE 1. ANALYSIS OF EXECUTION TIME FOR VECTOR ADDITION Cases
MATLAB parallel MATLAB
Only
Comput gpuArray CUDA Time Elapsed 0.31s 0.02s
Cases CPU parfor gpuArray
e on data kernel Time Complexity Ω (1) Ω (1)
GPU Environment Homogeneous Heterogeneous
Time Elapsed 8.96s 5.00s 0.015s 0.009s 0.0017s 0.00018s Processor CPU with 4 cores CPU + GPU
Time Complexity Θ(n) Ω (1) Ω (1) Ω (1) Ω (1) Ω (1)
Approach Sequential Parallel Parallel Parallel Parallel Parallel
CPU with
CPU+ CPU+ CPU+ CPU+
Processor CPU 4 Cores, 4
GPU GPU GPU GPU
Workers

Fig. 6. Performance Enhancement of Fast Fourier Transform


Fig. 4. Performance Enhancement of Vector Addition
The performance enhancement is observed in
The performance enhancement is observed in heterogeneous heterogeneous programming environment by using GPU
programming environment by using CUDA kernel objects in enables functions of matlab in comparison with homogeneous
comparison with homogeneous and parallel programming programming environments as shown in figure 6.
environments as shown in figure 4. 4.4 Binary Image Conversion
4.2 Matrix Multiplication The two programming environments such as sequential matlab,
The matrix multiplication of 1000 x 1000 values done on and CUDA kernel objects in matlab are for binary image
both homogeneous and heterogeneous platform with different conversion from black to white and vice versa. The results are
programming environments such as sequential matlab, parallel listed in table 4.
matlab, GPU computing in matlab, and by using CUDA kernel TABLE 4. ANALYSIS OF EXECUTION TIME FOR BINARY IMAGE PROCESSING
objects in matlab. The experimental results are listed in table 2. Cases Sequential MATLAB Using CUDA Kernel
Time Elapsed 0.103609 0.000620s
TABLE II. ANALYSIS OF EXECUTION TIME FOR MATRIX MULTIPLICATION Time Complexity Θ(n2) Ω (1)
Environment Homogeneous Heterogeneous
Sequential Processor CPU with 4 cores CPU + GPU
Cases CPU gpuArray GPU Data CUDA
MATLAB
Time Elapsed 10.89s 3.06s 0.12s 0.11s 0.030S
Time Complexity Θ(n3) Ω (1) Ω (1) Ω (1) Ω (1)
Approach Sequential Parallel Parallel Parallel Parallel
Processor CPU CPU CPU+GPU CPU+GPU CPU+GPU

Fig. 7. Performance Enhancement of Binary Image Processing


Fig. 5. Performance Enhancement of Matrix Multiplication
In this experiment, the performance enhancement is [10] Jason Sanders Edward Kandrot, CUDA by Example, “An
observed in heterogeneous programming environment by using Introduction to General-Purpose GPU Programming”,
CUDA kernel objects of matlab in comparison with Addesion Wesely, 2010
homogeneous programming environments as shown in figure [11] André Rigland Brodtkorb , “Matrix-Matrix Multiplication
7. In Matlab Using The GPU”, pp. 1-7, 2006
[12] José María CECILIA a José Manuel GARCÍA a Manuel
V. CONCLUSION UJALDÓN , “The GPU on the Matrix-Matrix Multiply,
Performance Study and Contributions”, 2009
It is observed that, the program execution performance of
selected computational intensive applications like vector [13] Yan Li, Yunquan Zhang, Haipeng Jia, Guoping Long and
addition, matrix multiplication, fft and binary image Ke Wang “Automatic FFT Performance Tuning on OpenCL
GPUs”, 2011 IEEE 17th International Conference on
conversion is enhanced on heterogeneous platform by using Parallel and Distributed Systems, pp. 228-235, 2011
best features of other processors like GPGPU in addition to
[14] Yi-Pin Hsu and Shin-Yu Lin, “Parallel-computing approach
CPU in comparison with homogeneous platform. for FFT implementation on digital signal processor (DSP)” ,
The experiments made in this work proves that, the World Academy of Science, Engineering and Technology,
programmers can exploit the heterogeneous system Vol:18 2008-06-23, 2008
architecture with available GPGPU supported frameworks like [15] Jingfei Kong, Martin Dimitrov, Yi Yang, “Accelerating
CUDA, OpenCL and GPU computing tools of MATLAB etc., MATLAB Image Processing Toolbox Functions on GPUs”,
to design and implement different algorithms to solve GPGPU’10, March 14, 2010, Pittsburg, PA, USA, ACM
computational intensive applications from different areas like 978-1-60558-935-0/10/03, 2010
image processing, video processing, data mining, cloud [16] James Fung - NVIDIA Corporation, USA and Steve Mann-
computing etc., that include intensive mathematical University of Toronto , “Using Graphics Devices in Reverse
calculations. GPU-Based Image Processing and Computer Vision”, 2007
[17] Huayou Su,Mei Wen, NanWu, Ju Ren, and Chunyuan
REFERENCES Zhang , “Efficient Parallel Video Processing Techniques on
GPU Hindawi Publishing Corporation” , Hindawi
[1] Amnon Barak, Tal Ben-Nun, Ely Levy and Amnon, “A Publishing Corporation, e Scientific World Journal, Volume
Package for OpenCL Based Heterogeneous Computing on 2014, Article ID 716020, 19 pages, March 2014.
Clusters with Many GPU Devices”, IEEE Computer [18] Sergio Orts-Escolano , “Jose Garcia-Rodriguez, Parallel
Society, pp. 1-7, 2010 Computational Intelligence-Based Multi-Camera
[2] Anthony Danalis , Philip C. Roth, Gabriel Marin, Kyle Surveillance System”, Journal of Sensor and Actuator
Spafford, “The Scalable Heterogeneous Computing (SHOC) Networks, ISSN 2224-2708, pp. 95-112, 2014
Benchmark Suite” , ACM, GPGPU ’10 March 14, 2010. [19] Thomas Bräunl, "Tutorial in Data Parallel Image
Pittsburgh, PA, USA. Processing", Australian Journal of Intelligent Information
[3] Jan Prikryl, “Graphics Card as A Cheap Supercomputer” , Processing Systems (AJIIPS), vol. 6, no. 3, 2001, pp. 164–
Programs and Algorithms of Numerical Matematics 16, J. 174 (2011)
Chleboun, K. Segeth, J. ˇ S´ıstek, T. Vejchodsk´y (Eds.), [20] Neal Wadhwa, "Parallel Video Processing", pp. 1-7,
Institute of Mathematics AS CR, PP. 162-167, Prague 2013 December 17, 2012
[4] Ben Baker, “Using GPUs for Image Processing”, Family [21] http://www.nvidia.com/object/what-is-gpu-computing.html
Search, 2010 (June 15th , 2014)
[5] Magdalena Szymczyk, Piotr Szymczyk, “MATLAB and [22] http://www.mathworks.in/products/parallel-computing
Parallel Computing Image Processing & Communication”, (June 20th , 2014)
vol. 17, no. 4, pp. 207-216, 2012 [23] docs.nvidia.com/cuda/cuda-c-programming-guide (June 21st
[6] Gaurav Sharma · Jos Martin, “MATLAB A Language for , 2014)
Parallel Computing” Int J Parallel Prog, Springer, pp. 37:3–
36, 2009
[7] Kavita Chauhan, Javed Ashraf, “Accelerating MATLAB
Applications on Parallel Hardware” , International Journal
of Computer Science and Network (IJCSN) Volume 1, Issue
4, August 2012
[8] Eric Olmedo, Jorge de la Calleja, Antonio Benitez, and Ma.
Auxilio Medina, “Point to point processing of digital images
using parallel computing”, IJCSI-9-3-3-1-10, 2012
[9] Yash Ukidave and David Kaeli, “Analyzing Optimization
Techniques for Power Efficiency on Heterogeneous
Platforms” , IEEE 27th International Symposium on Parallel
& Distributed Processing Workshops and PhD Forum, pp.
1040-1049, 2013

You might also like