0% found this document useful (0 votes)

22 views13 pages

Navya2022 Chapter ComparativeStudyOfDirective-ba

Uploaded by

Ebtsam Dosoky

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views13 pages

Navya2022 Chapter ComparativeStudyOfDirective-ba

Uploaded by

Ebtsam Dosoky

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Comparative Study of Directive-based

Programming Models on CPUs

and GPUs for Scientific Applications

C. Navya , H. A. Sanjay , and Sanket Salvi

1 Introduction

The world of high performance computing is an interesting field of study from

two decades. As the need for large and complex applications have been increasing
across many industries, the need for high performance computing has also increased.
With the need to give more assets to the users, there is the chance of utilizing all
equipment assets proficiently. However, the CPU technology is not capable of scaling
in performance sufficiently to address the demand. GPUs can give amazing execution
by utilizing the available GPU cores. Graphics processing unit (GPU) can be used
for parallel programming. Parallel programming is a process of dividing complex
computational tasks into smaller tasks that can run concurrently. The main objective
of HPC is to achieve better performance for large and complex scientific applications
by adopting parallel programming paradigm. Multi-cores CPUs will improve the
performance of the HPC application up to some extent and cannot be proficiently
extrapolated forward in time. In both cases, it is difficult to overcome the critical part
of the program to use the computational force efficiently.
For heterogeneous parallel programming, directive-based accelerators are used.
Compared to low-level programming (PThreads, CUDA, and OpenCL) using
directives-based language such as OpenMP and OpenACC. For parallel program-
ming, OpenACC is the new programming standard, where application program inter-
face characterizes a gathering of compiler orders to depict circles and locales of

C. Navya (B) · H. A. Sanjay · S. Salvi

Nitte Meenakshi Institute of Technology, Bangalore, India
e-mail: [email protected]
H. A. Sanjay
e-mail: [email protected]
S. Salvi
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 779
N. R. Shetty et al. (eds.), Emerging Research in Computing, Information, Communication
and Applications, Lecture Notes in Electrical Engineering 790,
https://doi.org/10.1007/978-981-16-1342-5_61
780 C. Navya et al.

code in FORTRAN, C, and C + + with the goal that they can be offloaded from
CPU to quickening agent, which gives versatility across working frameworks, CPUs,
and accelerators. OpenACC is a programming standard for parallel programming
developed by PGI, Cray, CAPS, and Nvidia.
The main objective of this work is to utilize the available GPUs efficiently for
parallel programming and to compare the time taken for executing scientific appli-
cations on CPUs and GPUs using OpenACC and OpenMP. The work focuses on
parallelizing benchmark applications like fast Fourier transform, Laplace trans-
form, molecular dynamics, and matrix multiplication using OpenMP and OpenACC
directive-based languages. The experiment section demonstrates execution of these
parallel applications on various programming environment. The result section
demonstrates benchmark application implemented using OpenACC will perform
better against to the benchmark application implemented using OpenMP.
The remaining part of this paper is organized as follows: Sect. 2 highlights various
national and international efforts, which focuses on understanding the concepts of
high performance computing, parallel programming, and directive-based program-
ming models used for parallel programming. Sect. 3 focuses on architectural analysis
of programming models. Section 4 discusses the proposed parallel implementation of
various benchmark applications. Section 5 describes the performance evaluation, the
experimental setup, and the results of the experiment along with the graphs. Finally,
Sect. 6 provides conclusion and future work to enhance the proposed system.

2 Related Work

This section describes efforts made by several researchers to deal with issues and
challenges in applying the methods for real-time applications. In Li, [1] has shown
a performance examination among CUDA and OpenACC. The exhibition investi-
gation manages programming models and fundamental compilers. The examination
of execution holes has appeared in nineteen parts of ten benchmarks. They will in
general use piece execution time and information affectability as primary princi-
ples reported once ends were made. A comparison of knowledge sensitivity may be
a new index to explore an easily ignored downside that, however, each program-
ming model is sensitive to changes of knowledge sizes. The vibes of the PRoDS
equation brings the USA. A target examination rather than an emotional correlation.
The OpenACC programming model is significantly more delicate to information
than CUDA with advancements, while CUDA is much touchier than OpenACC to
improvements. Generally speaking, the OpenACC execution is practically equiva-
lent to CUDA under a decent correlation, and OpenACC might be a fair contrast to
CUDA especially for amateurs in elevated-level equal programming.
The work [2] shows that OpenACC and OpenMP each lacks the power to
completely generate a tailor-made multidimensional grid and threads for GPUs.
Whereas this downside is often overcome by flattening the loop via the collapse
Comparative Study of Directive-based Programming Models … 781

clause, their experiments have shown that the performance of such associate improve-
ment would possibly still be slower than the CUDA 2-dimensional grid, wherever
we have a tendency to achieve our greatest performance result.
In Ledur et al. [3] and Memeti et al. [4], authors show the characteristics of
OpenCL, OpenMP, OpenACC, and CUDA with respect to programming productive-
ness, enforcement, and energy. In Wang et al. [5], author shows how realistic it is to
use a single OpenACC source code for a set of hardware’s with different underlying
micro-architecture, Nvidia Kepler, and Intel Knight Corner. In this project, we are
considering Nvidia Tesla and Nvidia Quadro. Performance portability of OpenACC
is related to the arithmetic intensity, and a big performance gap still exists in specific
benchmarks between platforms.
The work [6] demonstrated that GPU-based parallel computer architecture has
been showing extended notoriety as a building piece for first class handling and
for future Exascale enrolling. They have assessed existing request-based models
by porting application pieces from particular authentic spaces to utilize CUDA
GPUs, which, along these lines, permits us to perceive fundamental issues in the
supportiveness, adaptability, sensibility, and investigate limit of the current models.

3 Architectural Analysis of Parallel Programming Models

Generic directive-based programming frameworks comprise of directives, library

routines, and designated compilers. In the request-based GPU programming models,
a game plan of solicitations is utilized to stretch out data open to the selected
compilers, for example, heading on arranging of circles onto GPU and information
sharing rules. The most essential great circumstance of utilizing request-based GPU
programming models is that they give bizarre state thought on GPU programming,
following the consigned compiler covers the majority of the eccentric subtle parts
explicit to the principal GPU structures. Another ideal position is that the solicitation
frameworks make it simple to do steady parallelization of occupations, as OpenMP,
with the ultimate objective that a client can choose domains of a host undertaking to
be offloaded to a GPU gadget in a steady manner and a short time later the compiler,
thus, makes related host gadget programs. There exists a couple of order-based GPU
programming models. These models give unmistakable degrees of consultation and
programming attempts expected to follow their models and smooth out the execution
furthermore move.

3.1 OpenACC Programming Model

The OpenACC programming model adopts a renowned approach. The first applica-
tion is explained with orders and calls to a runtime application program interface. The
compiler is coordinated to create pieces that execute on the connected GPU or GPUs.
782 C. Navya et al.

The OpenACC execution model is like that of CUDA; a fundamental program runs
on the CPU and starts errands (computational pieces or information moves) on the
GPU. The principle program handles synchronization, either through unequivocal
client control or certainly. OpenACC embraces the natural feeble memory model
utilized in most GPU programming models: The GPU and CPU memory spaces are
particular. OpenACC is an API that gives a lot of array orders, runtime libraries, and
condition factors that can be used to create equal projects in Fortran, C, and C +
to run on quickening agents, including GPUs. Engineers can begin composing their
calculations consecutively and introduce directives OpenACC in the algorithm. It
resembles giving indications for the compiler to turn the code parallel.

3.2 OpenMP Programming Model

The OpenMP standard was created, and it is kept up by the gathering OpenMP
architecture review board shaped from some big organizations, for example, Intel,
SGI, SUN Microsystems, IBM, and others, that toward the finish of 1997, assembled
power to make a conventional equal programming for shared memory models. The
OpenMP API and spotlights on a lot of orders that underpins the making of equal
projects with shared memory through the execution of a programmed and improved
arrangement of strings. Its highlights would now be able to be utilized in dialects
FORTRAN 77, FORTRAN 90, C, and C ++. The benefits of utilizing OpenMP can
be shown on straightforwardness and little change in the codes, the powerful help
for equal programming, simplicity of comprehension, and utilization of mandates,
one help settled parallelism, and the chance of dynamic alteration of the quantity of
strings utilized.

3.3 Difference Between OpenACC and OpenMP

The OpenMP program will contain pragma omp directives. When the compiler
encounters a pragma directive, it will start executing the program parallel. It will
divide the tasks among multiple cores of the CPU, and the program will start executing
all the tasks on different CPUs simultaneously. And hence, the parallel programming
of a given application can be achieved.
OpenACC will contain pragma acc directives. When the compiler encounters
a pragma directive, it will start executing the program parallely. It will divide the
program tasks among multiple cores of GPUs. GPU will contain thousands of cores
which can be used to execute the program faster, and time taken for executing a
program will be less. And, all GPU cores will be used to execute the program
simultaneously to achieve parallel programming.
Comparative Study of Directive-based Programming Models … 783

4 Proposed Methodology

The implementation part describes the parallel implementation of benchmark appli-

cation by considering available GPU cores efficiently using OpenACC and available
CPU cores using OpenMP. Performance of OpenACC programs running on GPU
is compared with that of OpenMP programs running on CPU. The runtime of each
core is evaluated on the test dataset. The benchmark HPC applications that are used
to do the comparison are matrix multiplication, fast Fourier transforms, Laplace
transforms, and molecular dynamics.
Parallelizing process may incorporate a few or all of the following:
• Analyzing segments of the code that can be performed simultaneously.
• Outlining the simultaneous bits of code onto multiple processes running in
parallel.
• Distributing the input, output, and intermediate data information related with the
program.
• Overseeing access to information shared by various processors.
• Integrating the processors at different phases of the parallel program execution.

4.1 Parallel Implementation of Matrix Multiplication:

In math, framework augmentation is a parallel operation that takes a couple of

grids and produces another network. Numbers, for example, the genuine or complex
numbers can be duplicated by number juggling. Then again, networks are varieties of
numbers, so there is no one of a kind approach to characterize “the” increase of grids.
Accordingly, when all is said in done, the expression “lattice duplication” alludes to
various diverse approaches to increase lattices. The key components of any network
duplication include: the quantity of columns and segments the first grids have (called
the “size”, “request,” or “measurement”) and determining how the passages of the
networks create the new grid. Like vectors, networks of any size can be duplicated
by scalars, which add up to reproducing each section of the grid by the same number.
Like the entry-wise meaning of including or subtracting lattices, augmentation of
two grids of the same size can be characterized by reproducing the relating passages,
and this is known as the Hadamard item. Another definition is the Kronecker result
of two frameworks, to acquire a square network. One can shape numerous different
definitions. On the other hand, the most helpful definition can be inspired by straight
mathematical statements and direct changes on vectors, which have various appli-
cations in connected arithmetic, material science, and designing. This definition is
regularly called the lattice product [2, 3]. In words, if An is a n m grid and B is a m
p network, their framework item AB is a n p framework, in which the m passages
784 C. Navya et al.

Fig. 1 Shows the parallelization of matrix multiplication

over the columns of An are reproduced with the m sections down the segments of B
(the exact definition is underneath).

Algorithm

1. Input: matrices A[m,p] and B[p,n]

2. Let C[m,n] be a new matrix
3. for i from 1 to m:
4. for j from 1 to p:
5. Let sum = 0
6. For k from 1 to n:
7. Set sum <- sum + A[i,k] × B[k,j]
8. Set Cij<- sum
9. Return C

A local execution doles out one string to process one component of grid C. Each
string loads one line of network An and one segment of framework B from worldwide
memory, does the internal item, and stores the outcome back to lattice C in the
worldwide memory. To build the “calculation-to-memory proportion,” the tiled grid
duplication can be applied. One string square processes one tile of lattice C (Fig. 1).

4.2 Parallel Implementation of Fast Fourier Transforms:

A FFT figures the DFT and creates precisely the identical result as assessing the DFT
definition straightforwardly; practically, essential contrast is that a FFT is much
quicker. Let × 0, …., x N − 1 be complex numbers. The DFT is defined by the
formula.
Evaluating this definition particularly obliges O (N2) operations: There are N
yields Xk, and each yield obliges a total of N terms. A FFT is any framework to
enroll the same results in O (N log N) operations. More conclusively, all known FFT
Comparative Study of Directive-based Programming Models … 785

computations oblige (N log N) operations, but there is no known confirmation that

a lower disperse quality score is impossible.
To layout the venture trusts of a FFT, consider the count of complex growths and
increments. Surveying the DFT’s aggregates particularly incorporates N2 complex
increments and N (N1) complex additions. The undoubtedly comprehended radix 2
Cooley-Turkey count, for N a power of 2, can enlist the same result with just (N/2)
log2 (N) psyche boggling duplications and Nlog2(N) complex additions. For all
intents and purposes, genuine execution on front line PCs is by and large controlled
by components other than the rate of calculating operations, and the examination is
a confounded subject, yet the general change from O (N2) to O (N log N) remains.

Algorithm

1. For k=0 to l // where l=log (N)/log (2)

2. Divide the set into intervals
3. Get the corresponding twiddle factors, W
4. in each interval
5. for each pair of points J, J+Half-Size
6. Analyze using butterfly representation

The portion of work that can be parallelized in this problem is finding out twiddle
factors and FFT computation. Mapping of concurrent portions of work can be
achieved by using the directive pragma omp for which splits parallel iteration spaces
across threads in OpenMP and by using the pragma acc kernels directive in OpenACC
to execute the portion of code on GPU. The data samples and the twiddle factors, in
form of a complex number, are shared (global) across all the threads. Intermediate
results generated during processing are stored in variables that are private (local) to
the threads.
Synchronization is needed while moving back and forth in recursive task divi-
sion, which is achieved through barrier directive in OpenMP and barrier directive in
OpenACC.

4.3 Parallel Implementation of Laplace Transforms:

The Laplace change is a by and large used fundamental change as a piece of number
juggling and electrical structure named after Pierre-Simon Laplace that changes a
portion of time into a segment of complex rehash. The retrogressive Laplace change
takes a flighty recurrent region breaking point and yields a cutoff depicted in the time
space. The Laplace change is related to the Fourier change, yet while the Fourier
change imparts a limit or banner as a superposition of sinusoids, the Laplace change
conveys a limit, even more generally, as a superposition of minutes. Given a direct
logical or utilitarian delineation of an information or respect a system, the Laplace
786 C. Navya et al.

change gives an alternative functional depiction that consistently revamps the tech-
nique of examining the lead of the structure or in mixing another system taking into
account a course of action of determinations. Along these lines, for example, Laplace
change from the time locale to the recurrent space changes differential connections
into arithmetical numerical enunciations and convolution into steady.

Algorithm

1. Set particle positions.

2. Assign particle velocities.
3. Repeat
Calculate force on each particle.
Update particle positions and velocity.
Measure properties ,Store results.
until the preset time steps
5. Analyze properties, print results

The portion of work that can be parallelized in this problem is taking the size of
matrix and calculating the time taken for each matrix. Mapping of concurrent portions
of work can be achieved by using the directive pragma omp for which splits parallel
iteration spaces across threads in OpenMP and by using the pragma acc kernels
directive in OpenACC to execute the portion of code on GPU. The input data, viz.
taking the size of matrix and taking their values, these values are shared (global)
across all the threads. Intermediate results generated during processing are stored
in variables that are private (local) to the threads. Synchronization is needed while
calculating the time taken for each matrix. These calculations can be done within
the pragma omp master directive in OpenMP and pragma acc master directive in
OpenACC.

4.4 Parallel Implementation of Molecular Dynamics

Molecular dynamics (MD) is a PC reenactment of physical improvements of

molecules and particles in the association of N-body diversion. The particles are
allowed to interface for a period of time, giving a point of view of the development
of the atoms. In the most generally perceived adjustment, the direction of particles
and iotas are constrained by numerically handling Newton’s examinations of devel-
opment for a game plan of imparting particles, where qualities between the particles
and potential imperativeness are described by interatomic conceivable outcomes or
subatomic mechanics power fields. The strategy was at first envisioned inside specu-
lative material science in the late 1950s yet is associated today generally in compound
material science, materials science, and the showing of biomolecules.
Since atomic frameworks comprise countless, it is hard to find the properties
of such complex systems legitimately; MD reenactment circumvents this issue by
Comparative Study of Directive-based Programming Models … 787

using numerical methodologies. In any case, long MD propagations are deductively

seriously adjusted, creating consolidated mix-ups in numerical joining that can be
limited with genuine selection of computations and boundaries, yet not cleared out
totally.
For structures which agree to the ergodic theory, the advancement of a singular
nuclear stream reenactment might be used to concentrate doubtlessly noticeable
thermodynamic properties of the system: the time midpoints of an ergodic structure
contrast with micro-canonical gathering midpoints. MD has moreover been named
“quantifiable mechanics by numbers” and “Laplace’s vision of Newtonian mechan-
ics” of foreseeing the future by vivifying nature’s powers and allowing understanding
into sub-nuclear development on an atomic scale.

Algorithm

1. Set particle positions.

2. Assign particle velocities.
3. repeat
1. Calculate force on each particle.
2. Update particle positions and velocity.
3. Measure properties, Store results.
4. until the preset time steps
5. Analyze properties, print results

The portion of work that can be parallelized in this problem is calculating the force
on each particle. Mapping of concurrent portions of work can be achieved by using
the directive pragma omp for which splits parallel iteration spaces across threads in
OpenMP and by using the pragma acc kernels directive in OpenACC to execute the
portion of code on GPU. The input data, viz. number of particles, their positions,
and initial velocities, are shared (global) across all the threads. Intermediate results
generated during processing are stored in variables that are private (local) to the
threads. Synchronization is needed while measuring properties such as total kinetic
energy and total potential energy. These calculations can be done within the pragma
omp master directive in OpenMP and pragma acc master directive in OpenACC.

5 Experimental Setup and Results

This section is aimed to give a brief description of the experimental setup that is
required for this project work to obtain the required results. First, the experimental
setup is based on the system specifications to meet the requirements. Different appli-
cations are used to determine the performance of CPU and GPU and conclude with
results and screenshots. Once the framework necessities are settled, then we need to
figure out if a specific programming bundle fits framework prerequisites or not. The
788 C. Navya et al.

Table 1 Tesla GPU configuration

CPU configuration GPU configuration
System type HP Pro 3330 NT PC PowerEdge R270
Processor Intel Core i3-322Q Intel xeon® CPU E5-26,200
[email protected] @2.00GHZ\1S
RAM 2 GB RAM 32 GB RAM
Operating system type 64-bit 6 4-bit
Hard disk 500 GB 500GBx3
Graphics card NA Nvidia Tesla M2075 dual slot
graphics card

Table 2 Quadro GPU configuration

CPU configuration GPU configuration
System type HP Pro 3330 NT PC DELL Precision R5500
Processor Intel Core i3-3220 CPU Intel xeon® CPU E5620
@3.30GHZx4 2.40 GHz
RAM 2 GB RAM 32 GB RAM
Operating system type 64-bit 64-bit
Hard disk 500 GB 500GBx3
Graphics card NA Nvidia Qua dip K2000 dual slot graphics card

software bundle includes Nvidia drivers, PGI compiler, CUDA toolkit (Version 6.5),
and Fedora operating system.
The following is the configuration of the system which is used in our experiment
setup to run the applications on CPU and GPU (Tables 1 and 2).
PGI compiler is the compiler required to compile and run the OpenACC and
OpenMP programs. The following graphs show the comparison of OpenMP and
OpenACC on CPUs and GPUs for scientific applications on Tesla and Quadro
graphics cards, in which x-axis shows the time in seconds and y-axis shows the
number of particles in which blue color shows the results of OpenMP and red color
shows the results of OpenACC. Performance is measured by varying the problem
size of the benchmark applications.
Fig 2 shows the comparison OpenMP and OpenACC for matrix multiplication.
The results on Tesla graphics card shows better performance because it contains more
number of CPU cores (24) and more number of GPU cores (448) when compared to
Quadro graphics card which contains less CPU cores (8) and less GPU cores (240).
As the size of the matrix increases, OpenACC implementation will perform better
(Fig. 3).
The figure shows the comparison OpenMP and OpenACC for FFT. On both the
devices, we are able to observer large gap in the performance between OpenACC
and OpenMP implementation as the problem size increases.
Comparative Study of Directive-based Programming Models … 789

Fig. 2 Matrix multiplication performance on Tesla and Quadro graphics cards

Fig. 3 Fast Fourier transform (FFT) performance on Tesla and Quadro graphics cards

Fig. 4 shows the comparison OpenMP and OpenACC implementation for Laplace
transform. The results demonstrate performance improvement with OpenACC
implementation as the size of the data samples increases.
(240) (Fig. 5).
The figure shows the comparison OpenMP and OpenACC for molecular
dynamics. The results on Tesla graphics card shows better performance because
it contains more number of CPU cores (24) and more number of GPU cores (448)
when compared to Quadro graphics card which contains less CPU cores (8) and less

Fig. 4 Laplace transform performance on Tesla and Quadro graphics cards

790 C. Navya et al.

Fig. 5 Molecular dynamics performance on Tesla and Quadro graphics cards

GPU cores (240). Also, results demonstrate improvement in the performance as the
number of particles increases.

6 Conclusion

Parallel programming is increasing in the near future, not only in massive computing
software, but also in systems of small and medium businesses to generate more speed
and providing the programmer more options to exploit the hardware resources. In
this project, the performance of parallel program is better than the performance of
serial program. When performance measurement was done, it was observed that time
taken by OpenACC is better for smaller size data as well as when the size of data
increases, when compared with OpenMP. Hence, from our work, we can conclude that
OpenACC is a good option for data parallel task on GPUs. OpenMP is less complex
and can give nearly equivalent performance to OpenACC on CPUs. Developers who
need to utilize the parallelism must change their programming paradigms to address
the issues that emerge as speed and better execution programming applications so as
to build the limit computations and handling conceivable.

References

1. Li X, Shih PC (2018) An early performance comparison of CUDA and openACC. MATEC Web
Conf 208:05002. https://doi.org/10.1051/matecconf/201820805002
2. Gayatri R, Yang C, Kurth T, Deslippe J (2018) A case study for performance portability using
openmp 4.5. In: WACCPD@SC
3. Ledur CL, Zeve CM, dos Anjos JC (2013) Comparative analysis of openACC, openMP and
CUDA using sequential and parallel algorithms
4. Memeti S, Li L, Pllana S, Kolodziej J, Kessler C (2017) Benchmarking openCL, openACC,
openMP, and CUDA: programming productivity, performance, and energy consumption. In:
Proceedings of the 2017 workshop on adaptive resource management and scheduling for cloud
computing. ARMS-CC ’17, Association for Computing Machinery, New York, NY, USA, pp
1–6. https://doi.org/10.1145/3110355.3110356
Comparative Study of Directive-based Programming Models … 791

5. Wang Y, Qin Q, SEE SCW, Lin J (2013) Performance portability evaluation for openACC on
intel knights corner and nvidia kepler
6. Lee S, Vetter JS (2013) Early evaluation of directive-based GPU programming models for
productive exascale computing. In: SC ’12: proceedings of the international conference on high
performance computing, networking, storage and analysis, pp 1–11

Parallel Programming For Modern High Performance Computing Systems (Czarnul, Pawel)
No ratings yet
Parallel Programming For Modern High Performance Computing Systems (Czarnul, Pawel)
330 pages
Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
GPUProgramming Talk
No ratings yet
GPUProgramming Talk
18 pages
What Is Coding
100% (1)
What Is Coding
12 pages
(Ebook PDF) Fundamentals of C# Programming For Information Systems 2nd Editioninstant Download
100% (5)
(Ebook PDF) Fundamentals of C# Programming For Information Systems 2nd Editioninstant Download
49 pages
Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA
No ratings yet
Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA
18 pages
Thesis Gpu Programming
100% (2)
Thesis Gpu Programming
6 pages
Nokiaedu: Flexi Sran Bts Operation and Maintenance
No ratings yet
Nokiaedu: Flexi Sran Bts Operation and Maintenance
26 pages
HyperWorks 14.0 Installation Guide PDF
100% (1)
HyperWorks 14.0 Installation Guide PDF
91 pages
GPU Programming Slides 1
No ratings yet
GPU Programming Slides 1
33 pages
Stm32l0 低功耗模式、复位与时钟模块，低功耗uart和低功耗定时器
No ratings yet
Stm32l0 低功耗模式、复位与时钟模块，低功耗uart和低功耗定时器
69 pages
Arduino in A Nutshell 1.6
No ratings yet
Arduino in A Nutshell 1.6
20 pages
Basics of Networking
No ratings yet
Basics of Networking
19 pages
Lenovo Price List-April 2024
No ratings yet
Lenovo Price List-April 2024
5 pages
MD5 File Validation: Feature Overview
No ratings yet
MD5 File Validation: Feature Overview
6 pages
Julius Book
No ratings yet
Julius Book
76 pages
''Airline Reservation System'' Project Report
No ratings yet
''Airline Reservation System'' Project Report
33 pages
Luong Thesis
No ratings yet
Luong Thesis
81 pages
JCUDA
No ratings yet
JCUDA
13 pages
T2000 Hardware RAID
No ratings yet
T2000 Hardware RAID
9 pages
UpgradeGX15 Practical SP
100% (1)
UpgradeGX15 Practical SP
142 pages
Modelling Software 3D Façades
No ratings yet
Modelling Software 3D Façades
3 pages
i5OS and Related Software - Installing, Upgrading or Deleting i5OS and Related Software
No ratings yet
i5OS and Related Software - Installing, Upgrading or Deleting i5OS and Related Software
262 pages
Part4 22
No ratings yet
Part4 22
65 pages
GPUParallelProgramming PDF
No ratings yet
GPUParallelProgramming PDF
104 pages
S3076 Getting Started With OpenACC
No ratings yet
S3076 Getting Started With OpenACC
58 pages
StreamAPI 3
No ratings yet
StreamAPI 3
28 pages
SB-KDDA-2012-2 Mobile Print Settings and Troubleshooting
No ratings yet
SB-KDDA-2012-2 Mobile Print Settings and Troubleshooting
12 pages
Basics CUDA
No ratings yet
Basics CUDA
55 pages
Introduction To CUDA Platform 1
No ratings yet
Introduction To CUDA Platform 1
18 pages
OpenACC Programming Guide 0 0
No ratings yet
OpenACC Programming Guide 0 0
73 pages
FCC by Akatsuki - Removed
No ratings yet
FCC by Akatsuki - Removed
44 pages
Achieving High Performance Computing
No ratings yet
Achieving High Performance Computing
58 pages
Introduction To OpenACC Course 20161026 1550 1
No ratings yet
Introduction To OpenACC Course 20161026 1550 1
68 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
Programming Models For GPU Architecture
No ratings yet
Programming Models For GPU Architecture
55 pages
What Does It Cost To Fix A Defect
No ratings yet
What Does It Cost To Fix A Defect
8 pages
Owens
No ratings yet
Owens
67 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Resume of Legiste905
No ratings yet
Resume of Legiste905
2 pages
4.big Data Platforms
No ratings yet
4.big Data Platforms
40 pages
25-04 Gpu Programming Without Cuda
No ratings yet
25-04 Gpu Programming Without Cuda
38 pages
Unit 4
No ratings yet
Unit 4
48 pages
OpenACC Advanced Fixed
No ratings yet
OpenACC Advanced Fixed
53 pages
OpenACC Princeton Bootcamp PDF
No ratings yet
OpenACC Princeton Bootcamp PDF
51 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
Cks 2012 It Art 002
No ratings yet
Cks 2012 It Art 002
10 pages
2023 CSC14120 Lecture00 CourseIntroduction
No ratings yet
2023 CSC14120 Lecture00 CourseIntroduction
30 pages
Python RapidSolutionInEmbeddedLinux v0.2
No ratings yet
Python RapidSolutionInEmbeddedLinux v0.2
15 pages
NetWorker+Integration+Workshop VMware
No ratings yet
NetWorker+Integration+Workshop VMware
75 pages
PDC Final Document
No ratings yet
PDC Final Document
21 pages
Unit 5'
No ratings yet
Unit 5'
33 pages
Computação Paralela
No ratings yet
Computação Paralela
18 pages
Performance Evaluation of A Two Dimensional Lattice Boltzmann-2017
No ratings yet
Performance Evaluation of A Two Dimensional Lattice Boltzmann-2017
22 pages
A Survey of Parallel Programming Models and Tools in The Multi and Many-Core Era
No ratings yet
A Survey of Parallel Programming Models and Tools in The Multi and Many-Core Era
18 pages
A Comparison of Journaling and Transactional File Systems
No ratings yet
A Comparison of Journaling and Transactional File Systems
12 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Parallel Comp Point Main
No ratings yet
Parallel Comp Point Main
18 pages
Course 7
No ratings yet
Course 7
21 pages
Openacc Online Course: Lecture 1: Introduction To Openacc
No ratings yet
Openacc Online Course: Lecture 1: Introduction To Openacc
47 pages
Developing Library of Internet Protocol Suite On CUDA Platform
No ratings yet
Developing Library of Internet Protocol Suite On CUDA Platform
4 pages
Manual QVL A770M A V1.0
No ratings yet
Manual QVL A770M A V1.0
2 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
4 1 MWagner GPU Volta
No ratings yet
4 1 MWagner GPU Volta
36 pages
DS1822-Parallel Computing - Unit5
No ratings yet
DS1822-Parallel Computing - Unit5
16 pages
PCX - Components and Peripherals Price List
No ratings yet
PCX - Components and Peripherals Price List
2 pages
01 ParProg20
No ratings yet
01 ParProg20
19 pages
Lapillonne ppl2014 Electronic
No ratings yet
Lapillonne ppl2014 Electronic
17 pages
Lobeiras 2015
No ratings yet
Lobeiras 2015
12 pages
Case Study On: Nitte Meenakshi Institute of Technology
No ratings yet
Case Study On: Nitte Meenakshi Institute of Technology
8 pages
A Programming Model For Massive Data Parallelism With Data Dependencies
No ratings yet
A Programming Model For Massive Data Parallelism With Data Dependencies
8 pages
GPU Verification Iccad18-Gpu
No ratings yet
GPU Verification Iccad18-Gpu
8 pages
QP 11 Ip Hy 2023-24 Set 1
No ratings yet
QP 11 Ip Hy 2023-24 Set 1
6 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
1.6 Final Thoughts: 1 Parallel Programming Models 49
No ratings yet
1.6 Final Thoughts: 1 Parallel Programming Models 49
5 pages
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
Unit 4 PHP
No ratings yet
Unit 4 PHP
11 pages
Vadp Vsphere Backup111
No ratings yet
Vadp Vsphere Backup111
25 pages
NewScientific Programming - 2010 - Jespersen - Acceleration of A CFD Code With A GPU
No ratings yet
NewScientific Programming - 2010 - Jespersen - Acceleration of A CFD Code With A GPU
9 pages
Graphics Processing Unit (GPU) Programming Strategies and Trends in GPU Computing
No ratings yet
Graphics Processing Unit (GPU) Programming Strategies and Trends in GPU Computing
10 pages
Programming GPU Clusters With Shared Memory Abstraction in Software
No ratings yet
Programming GPU Clusters With Shared Memory Abstraction in Software
8 pages
Programming Models: Scope
No ratings yet
Programming Models: Scope
3 pages
CSE5006 Multicore-Architectures ETH 1 AC41
No ratings yet
CSE5006 Multicore-Architectures ETH 1 AC41
9 pages
Cost-Effective HPC Clustering For Computer Vision Applications
No ratings yet
Cost-Effective HPC Clustering For Computer Vision Applications
6 pages
Striplights (Neopixels) : Psoc Creator Community Component Data Sheet
No ratings yet
Striplights (Neopixels) : Psoc Creator Community Component Data Sheet
10 pages
Flinkcl: An Opencl-Based In-Memory Computing Architecture On Heterogeneous Cpu-Gpu Clusters For Big Data Abstract
No ratings yet
Flinkcl: An Opencl-Based In-Memory Computing Architecture On Heterogeneous Cpu-Gpu Clusters For Big Data Abstract
3 pages
HPC Question Bank
No ratings yet
HPC Question Bank
2 pages
Android Training Course Module
No ratings yet
Android Training Course Module
4 pages

Navya2022 Chapter ComparativeStudyOfDirective-ba

Uploaded by

Navya2022 Chapter ComparativeStudyOfDirective-ba

Uploaded by

Comparative Study of Directive-based

Programming Models on CPUs

C. Navya , H. A. Sanjay , and Sanket Salvi

The world of high performance computing is an interesting field of study from

C. Navya (B) · H. A. Sanjay · S. Salvi

3 Architectural Analysis of Parallel Programming Models

Generic directive-based programming frameworks comprise of directives, library

3.1 OpenACC Programming Model

3.2 OpenMP Programming Model

3.3 Difference Between OpenACC and OpenMP

The implementation part describes the parallel implementation of benchmark appli-

4.1 Parallel Implementation of Matrix Multiplication:

In math, framework augmentation is a parallel operation that takes a couple of

Fig. 1 Shows the parallelization of matrix multiplication

1. Input: matrices A[m,p] and B[p,n]

4.2 Parallel Implementation of Fast Fourier Transforms:

computations oblige (N log N) operations, but there is no known confirmation that

1. For k=0 to l // where l=log (N)/log (2)

4.3 Parallel Implementation of Laplace Transforms:

1. Set particle positions.

4.4 Parallel Implementation of Molecular Dynamics

Molecular dynamics (MD) is a PC reenactment of physical improvements of

using numerical methodologies. In any case, long MD propagations are deductively

1. Set particle positions.

5 Experimental Setup and Results

Table 1 Tesla GPU configuration

Table 2 Quadro GPU configuration

Fig. 2 Matrix multiplication performance on Tesla and Quadro graphics cards

Fig. 4 Laplace transform performance on Tesla and Quadro graphics cards

Fig. 5 Molecular dynamics performance on Tesla and Quadro graphics cards

You might also like