Navya2022 Chapter ComparativeStudyOfDirective-ba
Navya2022 Chapter ComparativeStudyOfDirective-ba
1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 779
N. R. Shetty et al. (eds.), Emerging Research in Computing, Information, Communication
and Applications, Lecture Notes in Electrical Engineering 790,
https://doi.org/10.1007/978-981-16-1342-5_61
780 C. Navya et al.
code in FORTRAN, C, and C + + with the goal that they can be offloaded from
CPU to quickening agent, which gives versatility across working frameworks, CPUs,
and accelerators. OpenACC is a programming standard for parallel programming
developed by PGI, Cray, CAPS, and Nvidia.
The main objective of this work is to utilize the available GPUs efficiently for
parallel programming and to compare the time taken for executing scientific appli-
cations on CPUs and GPUs using OpenACC and OpenMP. The work focuses on
parallelizing benchmark applications like fast Fourier transform, Laplace trans-
form, molecular dynamics, and matrix multiplication using OpenMP and OpenACC
directive-based languages. The experiment section demonstrates execution of these
parallel applications on various programming environment. The result section
demonstrates benchmark application implemented using OpenACC will perform
better against to the benchmark application implemented using OpenMP.
The remaining part of this paper is organized as follows: Sect. 2 highlights various
national and international efforts, which focuses on understanding the concepts of
high performance computing, parallel programming, and directive-based program-
ming models used for parallel programming. Sect. 3 focuses on architectural analysis
of programming models. Section 4 discusses the proposed parallel implementation of
various benchmark applications. Section 5 describes the performance evaluation, the
experimental setup, and the results of the experiment along with the graphs. Finally,
Sect. 6 provides conclusion and future work to enhance the proposed system.
2 Related Work
This section describes efforts made by several researchers to deal with issues and
challenges in applying the methods for real-time applications. In Li, [1] has shown
a performance examination among CUDA and OpenACC. The exhibition investi-
gation manages programming models and fundamental compilers. The examination
of execution holes has appeared in nineteen parts of ten benchmarks. They will in
general use piece execution time and information affectability as primary princi-
ples reported once ends were made. A comparison of knowledge sensitivity may be
a new index to explore an easily ignored downside that, however, each program-
ming model is sensitive to changes of knowledge sizes. The vibes of the PRoDS
equation brings the USA. A target examination rather than an emotional correlation.
The OpenACC programming model is significantly more delicate to information
than CUDA with advancements, while CUDA is much touchier than OpenACC to
improvements. Generally speaking, the OpenACC execution is practically equiva-
lent to CUDA under a decent correlation, and OpenACC might be a fair contrast to
CUDA especially for amateurs in elevated-level equal programming.
The work [2] shows that OpenACC and OpenMP each lacks the power to
completely generate a tailor-made multidimensional grid and threads for GPUs.
Whereas this downside is often overcome by flattening the loop via the collapse
Comparative Study of Directive-based Programming Models … 781
clause, their experiments have shown that the performance of such associate improve-
ment would possibly still be slower than the CUDA 2-dimensional grid, wherever
we have a tendency to achieve our greatest performance result.
In Ledur et al. [3] and Memeti et al. [4], authors show the characteristics of
OpenCL, OpenMP, OpenACC, and CUDA with respect to programming productive-
ness, enforcement, and energy. In Wang et al. [5], author shows how realistic it is to
use a single OpenACC source code for a set of hardware’s with different underlying
micro-architecture, Nvidia Kepler, and Intel Knight Corner. In this project, we are
considering Nvidia Tesla and Nvidia Quadro. Performance portability of OpenACC
is related to the arithmetic intensity, and a big performance gap still exists in specific
benchmarks between platforms.
The work [6] demonstrated that GPU-based parallel computer architecture has
been showing extended notoriety as a building piece for first class handling and
for future Exascale enrolling. They have assessed existing request-based models
by porting application pieces from particular authentic spaces to utilize CUDA
GPUs, which, along these lines, permits us to perceive fundamental issues in the
supportiveness, adaptability, sensibility, and investigate limit of the current models.
The OpenACC programming model adopts a renowned approach. The first applica-
tion is explained with orders and calls to a runtime application program interface. The
compiler is coordinated to create pieces that execute on the connected GPU or GPUs.
782 C. Navya et al.
The OpenACC execution model is like that of CUDA; a fundamental program runs
on the CPU and starts errands (computational pieces or information moves) on the
GPU. The principle program handles synchronization, either through unequivocal
client control or certainly. OpenACC embraces the natural feeble memory model
utilized in most GPU programming models: The GPU and CPU memory spaces are
particular. OpenACC is an API that gives a lot of array orders, runtime libraries, and
condition factors that can be used to create equal projects in Fortran, C, and C +
to run on quickening agents, including GPUs. Engineers can begin composing their
calculations consecutively and introduce directives OpenACC in the algorithm. It
resembles giving indications for the compiler to turn the code parallel.
The OpenMP standard was created, and it is kept up by the gathering OpenMP
architecture review board shaped from some big organizations, for example, Intel,
SGI, SUN Microsystems, IBM, and others, that toward the finish of 1997, assembled
power to make a conventional equal programming for shared memory models. The
OpenMP API and spotlights on a lot of orders that underpins the making of equal
projects with shared memory through the execution of a programmed and improved
arrangement of strings. Its highlights would now be able to be utilized in dialects
FORTRAN 77, FORTRAN 90, C, and C ++. The benefits of utilizing OpenMP can
be shown on straightforwardness and little change in the codes, the powerful help
for equal programming, simplicity of comprehension, and utilization of mandates,
one help settled parallelism, and the chance of dynamic alteration of the quantity of
strings utilized.
The OpenMP program will contain pragma omp directives. When the compiler
encounters a pragma directive, it will start executing the program parallel. It will
divide the tasks among multiple cores of the CPU, and the program will start executing
all the tasks on different CPUs simultaneously. And hence, the parallel programming
of a given application can be achieved.
OpenACC will contain pragma acc directives. When the compiler encounters
a pragma directive, it will start executing the program parallely. It will divide the
program tasks among multiple cores of GPUs. GPU will contain thousands of cores
which can be used to execute the program faster, and time taken for executing a
program will be less. And, all GPU cores will be used to execute the program
simultaneously to achieve parallel programming.
Comparative Study of Directive-based Programming Models … 783
4 Proposed Methodology
over the columns of An are reproduced with the m sections down the segments of B
(the exact definition is underneath).
Algorithm
A local execution doles out one string to process one component of grid C. Each
string loads one line of network An and one segment of framework B from worldwide
memory, does the internal item, and stores the outcome back to lattice C in the
worldwide memory. To build the “calculation-to-memory proportion,” the tiled grid
duplication can be applied. One string square processes one tile of lattice C (Fig. 1).
A FFT figures the DFT and creates precisely the identical result as assessing the DFT
definition straightforwardly; practically, essential contrast is that a FFT is much
quicker. Let × 0, …., x N − 1 be complex numbers. The DFT is defined by the
formula.
Evaluating this definition particularly obliges O (N2) operations: There are N
yields Xk, and each yield obliges a total of N terms. A FFT is any framework to
enroll the same results in O (N log N) operations. More conclusively, all known FFT
Comparative Study of Directive-based Programming Models … 785
Algorithm
The portion of work that can be parallelized in this problem is finding out twiddle
factors and FFT computation. Mapping of concurrent portions of work can be
achieved by using the directive pragma omp for which splits parallel iteration spaces
across threads in OpenMP and by using the pragma acc kernels directive in OpenACC
to execute the portion of code on GPU. The data samples and the twiddle factors, in
form of a complex number, are shared (global) across all the threads. Intermediate
results generated during processing are stored in variables that are private (local) to
the threads.
Synchronization is needed while moving back and forth in recursive task divi-
sion, which is achieved through barrier directive in OpenMP and barrier directive in
OpenACC.
The Laplace change is a by and large used fundamental change as a piece of number
juggling and electrical structure named after Pierre-Simon Laplace that changes a
portion of time into a segment of complex rehash. The retrogressive Laplace change
takes a flighty recurrent region breaking point and yields a cutoff depicted in the time
space. The Laplace change is related to the Fourier change, yet while the Fourier
change imparts a limit or banner as a superposition of sinusoids, the Laplace change
conveys a limit, even more generally, as a superposition of minutes. Given a direct
logical or utilitarian delineation of an information or respect a system, the Laplace
786 C. Navya et al.
change gives an alternative functional depiction that consistently revamps the tech-
nique of examining the lead of the structure or in mixing another system taking into
account a course of action of determinations. Along these lines, for example, Laplace
change from the time locale to the recurrent space changes differential connections
into arithmetical numerical enunciations and convolution into steady.
Algorithm
The portion of work that can be parallelized in this problem is taking the size of
matrix and calculating the time taken for each matrix. Mapping of concurrent portions
of work can be achieved by using the directive pragma omp for which splits parallel
iteration spaces across threads in OpenMP and by using the pragma acc kernels
directive in OpenACC to execute the portion of code on GPU. The input data, viz.
taking the size of matrix and taking their values, these values are shared (global)
across all the threads. Intermediate results generated during processing are stored
in variables that are private (local) to the threads. Synchronization is needed while
calculating the time taken for each matrix. These calculations can be done within
the pragma omp master directive in OpenMP and pragma acc master directive in
OpenACC.
Algorithm
The portion of work that can be parallelized in this problem is calculating the force
on each particle. Mapping of concurrent portions of work can be achieved by using
the directive pragma omp for which splits parallel iteration spaces across threads in
OpenMP and by using the pragma acc kernels directive in OpenACC to execute the
portion of code on GPU. The input data, viz. number of particles, their positions,
and initial velocities, are shared (global) across all the threads. Intermediate results
generated during processing are stored in variables that are private (local) to the
threads. Synchronization is needed while measuring properties such as total kinetic
energy and total potential energy. These calculations can be done within the pragma
omp master directive in OpenMP and pragma acc master directive in OpenACC.
This section is aimed to give a brief description of the experimental setup that is
required for this project work to obtain the required results. First, the experimental
setup is based on the system specifications to meet the requirements. Different appli-
cations are used to determine the performance of CPU and GPU and conclude with
results and screenshots. Once the framework necessities are settled, then we need to
figure out if a specific programming bundle fits framework prerequisites or not. The
788 C. Navya et al.
software bundle includes Nvidia drivers, PGI compiler, CUDA toolkit (Version 6.5),
and Fedora operating system.
The following is the configuration of the system which is used in our experiment
setup to run the applications on CPU and GPU (Tables 1 and 2).
PGI compiler is the compiler required to compile and run the OpenACC and
OpenMP programs. The following graphs show the comparison of OpenMP and
OpenACC on CPUs and GPUs for scientific applications on Tesla and Quadro
graphics cards, in which x-axis shows the time in seconds and y-axis shows the
number of particles in which blue color shows the results of OpenMP and red color
shows the results of OpenACC. Performance is measured by varying the problem
size of the benchmark applications.
Fig 2 shows the comparison OpenMP and OpenACC for matrix multiplication.
The results on Tesla graphics card shows better performance because it contains more
number of CPU cores (24) and more number of GPU cores (448) when compared to
Quadro graphics card which contains less CPU cores (8) and less GPU cores (240).
As the size of the matrix increases, OpenACC implementation will perform better
(Fig. 3).
The figure shows the comparison OpenMP and OpenACC for FFT. On both the
devices, we are able to observer large gap in the performance between OpenACC
and OpenMP implementation as the problem size increases.
Comparative Study of Directive-based Programming Models … 789
Fig. 3 Fast Fourier transform (FFT) performance on Tesla and Quadro graphics cards
Fig. 4 shows the comparison OpenMP and OpenACC implementation for Laplace
transform. The results demonstrate performance improvement with OpenACC
implementation as the size of the data samples increases.
(240) (Fig. 5).
The figure shows the comparison OpenMP and OpenACC for molecular
dynamics. The results on Tesla graphics card shows better performance because
it contains more number of CPU cores (24) and more number of GPU cores (448)
when compared to Quadro graphics card which contains less CPU cores (8) and less
GPU cores (240). Also, results demonstrate improvement in the performance as the
number of particles increases.
6 Conclusion
Parallel programming is increasing in the near future, not only in massive computing
software, but also in systems of small and medium businesses to generate more speed
and providing the programmer more options to exploit the hardware resources. In
this project, the performance of parallel program is better than the performance of
serial program. When performance measurement was done, it was observed that time
taken by OpenACC is better for smaller size data as well as when the size of data
increases, when compared with OpenMP. Hence, from our work, we can conclude that
OpenACC is a good option for data parallel task on GPUs. OpenMP is less complex
and can give nearly equivalent performance to OpenACC on CPUs. Developers who
need to utilize the parallelism must change their programming paradigms to address
the issues that emerge as speed and better execution programming applications so as
to build the limit computations and handling conceivable.
References
1. Li X, Shih PC (2018) An early performance comparison of CUDA and openACC. MATEC Web
Conf 208:05002. https://doi.org/10.1051/matecconf/201820805002
2. Gayatri R, Yang C, Kurth T, Deslippe J (2018) A case study for performance portability using
openmp 4.5. In: WACCPD@SC
3. Ledur CL, Zeve CM, dos Anjos JC (2013) Comparative analysis of openACC, openMP and
CUDA using sequential and parallel algorithms
4. Memeti S, Li L, Pllana S, Kolodziej J, Kessler C (2017) Benchmarking openCL, openACC,
openMP, and CUDA: programming productivity, performance, and energy consumption. In:
Proceedings of the 2017 workshop on adaptive resource management and scheduling for cloud
computing. ARMS-CC ’17, Association for Computing Machinery, New York, NY, USA, pp
1–6. https://doi.org/10.1145/3110355.3110356
Comparative Study of Directive-based Programming Models … 791
5. Wang Y, Qin Q, SEE SCW, Lin J (2013) Performance portability evaluation for openACC on
intel knights corner and nvidia kepler
6. Lee S, Vetter JS (2013) Early evaluation of directive-based GPU programming models for
productive exascale computing. In: SC ’12: proceedings of the international conference on high
performance computing, networking, storage and analysis, pp 1–11