Naik 2015
Naik 2015
Naik 2015
architecture:
A CUDA and MATLAB experiment
Abstract—Today multiprocessors, multicores, clusters and computing needs to be change to keep up with the evolution of
heterogeneous computing are becoming the most popular new hardware architecture and necessary to respond for the
architectures to achieve high performance computing. The changes in hardware architecture by the programmers. So, it is
different approaches are made by system designers to enhance necessary for the programmers to write programs that run on
the system performance such as increasing clock frequency of
heterogeneous platform to map with the new hardware
CPUs from MHz to GHz and addition of more number of CPU
cores i.e from single core processor to dual core, quad core, hexa architectures.
core, octo core, ten core and more processors. Still, multicore The multi-core era saw some interesting developments of
processing creates some challenges of its own. The extra core graphics processing units. Today, GPUs are used for both
results into increased processor size and also high power graphics and non-graphic processing applications. The GPUs
consumption. Meanwhile, General Purpose Graphics Processing contain thousands of cores which are highly designed for
Units (GPGPUs) are designed and implemented that contain parallel processing, where as CPU is designed for serial
hundreds of cores with more number of Arithmetic and Logic processing. The GPU makes the advanced semiconductor
Units and Control Units. These GPGPUs can be used in addition technology and contains more number of ALUs and CUs in
to CPU for heterogeneous computing for the enhancement of
system performance for selected applications by data parallelism.
comparison with CPU. A GPU has multiple SIMD (Single
The heterogeneous programming environment that includes Instruction Multiple Data) units to perform many arithmetical
other processors like GPGPU in addition to CPU can be used to operations of the similar type. The GPUs have vector
enhance the execution performance of computational intensive processing capabilities that enable the programmers to perform
programs. So, it is necessary for the programmer to run and parallel operations on very large sets of data with much lower
analyze the selected computational intensive programs on both power consumption relative to the serial processing of similar
homogeneous and heterogeneous programming platform. The data sets on CPUs. So, GPGPUs became increasingly attractive
homogeneous programming environment makes the use of multi for more general purposes operations to address heterogeneous
core CPU, where as the heterogeneous programming parallel programming.
environment makes the use of different processors such as
General Purpose Graphics Processing Unit (GPGPUs), Field
Today, heterogeneous systems have become common to
Programmable Gate Arrays (FPGAs), Digital Signal Processors run computational intensive applications. To support
(DSPs) in addition to CPU. Hence, the programmer needs to heterogeneous computing, the major software and hardware
write the code that makes the use of both CPU and other manufactures started to create heterogeneous software
processors by using heterogeneous software environment such as environment such as parallel computing toolbox of MATLAB,
parallel MATLAB with GPU enabled functions, MATLAB GPU enabled matlab functions and NVIDIA’s Compute
supported CUDA kernels and CUDA C for the execution of Unified Device Architecture (CUDA) framework. The parallel
parallel code to achieve high performance in heterogeneous MATLAB supports three levels of GPU computing for
programming environment in comparison with homogeneous
heterogeneous computing. The CUDA architecture provides
(sequential) programming approach with only CPU.
scalable programming model for heterogeneous computing that
Keywords — Multicore, Heterogeneous Computing, High results into high performance computing (HPC).
Performance Computing, GPGPU, Parallel MATLAB, CUDA 1.1 Graphics Processing Unit for Heterogeneous Computing
kernels. Graphics processing units were initially developed to
improve 3D graphics performance by offloading graphics from
I. INTRODUCTION CPU to GPU. The GPU contains thousands of cores, which are
mainly designed for parallel processing. The producers of
The architecture of computing blocks of processors has
graphics hardware offer a programming interface for the
changed within the past ten years. The technology moved from
programmers to utilize the computational resources of GPU for
single-core processors to shared memory multicore processors
general purpose and is called general purpose GPU (GPGPU).
and to highly scalable “many core” processors and finally to
heterogeneous platforms i.e. combination of other processors
like GPGPU, FPGA, DSP in addition to CPU. Hence, the
is designed and implemented for identification and simulation
by using Jacket and Parallel Computing Toolbox with
MATLAB in [3]. The parallel processing capability of GPUs
used to increase throughput by processing millions of
originally stored images in the Digital Processing Centre(DPC)
for image processing of Family Search as discussed in[4]. The
parallel image convolution algorithm is implemented in [5] by
using parallel computing toolbox (PCT) and distributed
computing server (DCS) toolbox of MATLAB. In [6] built-in
construct like parfor is used to achieve highest degree of
Fig. 1. Computational Resources of CPU and GPU programmability for computational applications.
The Jacket and GFOR tools of MATLAB is used in [7] to
The figure 1 illustrates the difference between the accelerate computational intensive applications like matrix
computational resources of CPU and GPU. The CPU is multiplication and fft. The CUDA technology is used in [8] for
designed for serial processing where as GPU is designed for point to point image processing to implement techniques like
parallel processing. The CPU has larger cache and less number brightening, darkening and thresholding. The loop unrolling
of CUs and ALUs, where as GPU is designed for parallel and tiling is done to gain performance enhancement on
processing with more number of ALUs and CUs that helps to heterogeneous platform as given in [9]. The architecture of
run parallel code to solve computational intensive problems. CUDA framework is discussed in [10] to solve computational
The CPU is suitable for Single Instruction Single Data (SISD) intensive part on GPU by using grids of blocks and blocks of
units to perform arithmetical operations sequentially, where as threads. The matrix multiplication is implemented in
GPU is suitable for Single Instruction Multiple Data (SIMD) MATLAB to parallelize independent operations on GPU as
units to perform arithmetical operations in parallel. discussed in [11,12]. The OpenCL programming model is used
1.2 Parallel and Heterogeneous Computing with MATLAB in [13,14] to compute discrete fourier transform (DFT) by
The parallel computing toolbox of MATLAB provides using GPU. The work proved that the butterfly structure of
built-in constructs like parfor (parallel for) and spmd (single FFT is suitable to implement on parallel hardware architecture.
program multiple data) to run code simultaneously on multiple The different algorithms are designed in [15] for data
workers to make use of multi cores of CPU. The difference sharing by code optimization that makes the use of GPU to
between parfor and spmd is that, the workers in spmd speed up the functions in the data-independent or data-sharing
communicate with each other, but in parfor they do not category. The GPU is used to convert picture into numbers
communicate. instead of numbers into picture in [16] that increased the
The MATLAB supports hundreds of GPU enabled functions processing speed in GHz. An efficient parallel video
such as discrete fourier transform (fft), matrix multiplication processing techniques on GPU is designed and implemented in
(mtimes) and left matrix division (mldivide) to run on data [17] by using CUDA framework to reorganize the execution
stored on GPU. The built-in functions such as gpuArray and order and to optimize the data structure. They proposed an
gather functions are used to transmit data from CPU to GPU efficient parallel framework for H.264/AVC encoder based on
and to get result back from GPU to CPU as shown in figure 2. massively parallel architecture.
Although, the programmers can run the NVIDIA’s CUDA The parallel computational intelligence based multi-camera
kernel objects in matlab on large data set. surveillance system is designed by using GPUs in [18] to
address multiple vision tasks at various levels, such as
segmentation, representation or characterization and analysis
and monitoring of the movement. The different parallel
algorithms are discussed in [19] for different image processing
Fig. 2. Data and Result Exchange between CPU and GPU
techniques like point operations, dithering, smoothening, edge
detection and image segmentation. The parallel motion
II. RELATED STUDY magnification algorithm is parallelized for video processing in
[20] by using parallel computing toolbox of MATLAB.
At the beginning, researchers increased the performance of
parallel applications on clusters with CPU and GPU III. ANALYSIS APPROACH
architecture. They combined the OpenMP programming with The proposed approach for the analysis of performance
task level services such as queuing and scheduling for efficient enhancement on graphic processor based heterogeneous
utilization of cluster-wide GPU devices. i.e. heterogeneous architecture is to make use of both homogeneous and
platform used to support large-scale high-end parallel heterogeneous programming environment as shown in figure 3.
computing as given in [1]. The re-configurable processors and
general purpose GPUs are used in [2] with OpenCL and
CUDA programming model on scalable heterogeneous
computing systems (SHOC) to increase execution speed of
computational intensive applications. Gaussian process model
part of an application on GPU using CUDA kernel objects.
These objects are used run CUDA kernel on huge amount of
data stored in MATLAB matrices.
IV. EXPERIMENTAL RESULTS AND ANALYSIS
The following computational intensive programs are
selected for the analysis of performance enhancement on
graphic processor based heterogeneous architecture.
• Vector Addition
• Matrix Mulltiplication
• FFT and
• Binary Image Conversion
The experimentations are made for the above listed
programs on both homogeneous and heterogeneous
programming environment that include multi core CPU with
GPGPU. It is observed that the execution time of selected
computational intensive applications on homogeneous platform
is more than heterogeneous platform.
Fig. 3. Proposed Methodology for Performance Analysis
The traditional C language and sequential MATLAB is
3.1 Homogeneous Programming Environment
used to run sequential code. The parallel computing toolbox of
The homogeneous programming environment makes the MATLAB, GPU enabled functions of MATLAB and CUDA C
use of only CPU, where as heterogeneous programming are used to run parallel code for the analysis of selected
environment makes the use of other processors in addition to applications.
CPU. The C language, sequential MATLAB, parallel
The GPU computing environment of parallel MATLAB is
MATLAB are used to write and analyse the sequential and
also used for heterogeneous computing by using GPU
parallel code for the selected computational intensive
supported functions like gpuArray, gather, static methods to
problems.
create data on GPU that results into less execution time by both
Traditional C language: The traditional C language is used CPU and GPU in comparison with only CPU.
to write sequential code and to observe the execution time
The following different mode of experiments are used for
taken by computational intensive applications by using clock()
the analysis of program execution speed.
function from time.h. The data type clock_t is used to define
Mode1: In the first mode of experiment, the sequential code is
and use two variables to store start and end time. The elapsed
written to complete the given task by using traditional C
time is further used for analysis of performance enhancement
language that results into more execution speed.
in comparison with heterogeneous programming environment.
Mode2: In the second mode of experiment, the parallel code is
The Sequential MATLAB: The MATLAB programming
written by using PCT of MATLAB code by using parfor
language provides many powerful built-in functions to solve
construct that makes the use of available pool workers to
computational intensive functions. The matlab built-in
complete the assigned task simultaneously. This results into
functions such as tic and toc are used to calculate elapsed time
less program execution speed in comparison with first attempt
by processors to solve computational intensive task.
(Sequential Code).
3.2 Heterogeneous Programming Environment
Mode3: In the third mode of experiment, the computational
The programming environment in which both CPU and
data is transferred from CPU to GPU to perform massive
GPGPU are used to solve computational intensive applications
parallel operations by GPGPU by using built in functions like
in parallel is called heterogeneous programming environment.
gpuArray that transmits the data from CPU to GPU to make the
The following languages are used to write and analyse the
use of GPU supported built-in functions. It is observed that, the
program execution time of selected applications.
execution speed is again reduced in comparison with second
The Parallel Computing Toolbox of MATLAB : The attempt (parallel MATLAB).
Parallel Computing Toolbox (PCT) is used to solve
Mode4: In the fourth mode of experiment, to overcome latency
computationally and data-intensive problems using multicore
of data transmission from CPU to GPU, the data is created on
processors with high-level parallel for-loops. The MATLAB
GPU to perform operations. Again, it is observed that, the
workers are used to run independent tasks concurrently by
execution speed is again reduced in comparison with third
using parfor loop construct.
attempt (GPU computing).
GPU enabled MATLAB functions: The Parallel Computing
Mode5: Finally, the CUDA is used to exploit the architecture
Toolbox also provides GPU enabled functions to solve
of GPGPU to solve computational intensive operations
computationally and data-intensive problems using many cores
simultaneously that results into again less execution speed in
of GPGPU processor.
comparison with previously made attempts.
CUDA kernel objects: The Parallel Computing Toolbox
allows the programmers to parallelize computational intensive
The different experimental results are noted for the analysis The performance enhancement is observed in
of execution speed of selected applications and they are heterogeneous programming environment by using CUDA
discussed below. kernel objects in comparison with homogeneous and parallel
4.1 Vector Addition programming environments as shown in figure 5.
The vector addition of one million values done on 4.3 Fast Fourier Transform
homogeneous and heterogeneous platform of corresponding The two programming environments such as sequential
elements in different programming environments such as matlab and GPU computing in matlab are used to compute two
sequential matlab, parallel matlab, GPU computing in matlab, dimensional fft algorithm. The experimental results are listed
and finally by using CUDA kernel objects in matlab. The in table 3.
experimental results are listed in table 1. TABLE 3 ANALYSIS OF EXECUTION TIME FOR FFT
Sequential GPU enabled
TABLE 1. ANALYSIS OF EXECUTION TIME FOR VECTOR ADDITION Cases
MATLAB parallel MATLAB
Only
Comput gpuArray CUDA Time Elapsed 0.31s 0.02s
Cases CPU parfor gpuArray
e on data kernel Time Complexity Ω (1) Ω (1)
GPU Environment Homogeneous Heterogeneous
Time Elapsed 8.96s 5.00s 0.015s 0.009s 0.0017s 0.00018s Processor CPU with 4 cores CPU + GPU
Time Complexity Θ(n) Ω (1) Ω (1) Ω (1) Ω (1) Ω (1)
Approach Sequential Parallel Parallel Parallel Parallel Parallel
CPU with
CPU+ CPU+ CPU+ CPU+
Processor CPU 4 Cores, 4
GPU GPU GPU GPU
Workers