Parallel Programming For Modern High Performance Computing Systems (Czarnul, Pawel)
Parallel Programming For Modern High Performance Computing Systems (Czarnul, Pawel)
for Modern
High Performance
Computing Systems
Parallel Programming
for Modern
High Performance
Computing Systems
Paweł Czarnul
Gdańsk University of Technology, Poland
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
To my daughter Ala
Contents
1.1 INTRODUCTION 1
1.2 FROM PROBLEM TO PARALLEL SOLUTION –
DEVELOPMENT STEPS 2
1.3 APPROACHES TO PARALLELIZATION 4
1.4 SELECTED USE CASES WITH POPULAR APIS 6
1.5 OUTLINE OF THE BOOK 7
vii
viii Contents
Index 297
List of figures
xiii
xiv List of figures
2.1 Performance of the first cluster on the TOP500 list over time 20
2.2 The number of cores of the first cluster on the TOP500 list
over time 20
2.3 CPU clock frequency of the first cluster on the TOP500 list
over time 21
2.4 Performance to power consumption ratio of the first cluster on
the TOP500 list over time 21
xvii
xviii List of tables
xix
xx List of listings
Parallel computing systems have recently become more and more accessible to
a wide range of users. Not only programmers in high performance computing
centers but also a typical consumer can now benefit from high performance
computing devices installed even in desktop computers. The vast majority of
new computers sold today feature multicore CPUs and GPUs which can be
used for running parallel programs. Such usage of GPUs is often referred to
as GPGPU (General Purpose Computations on Graphics Processing Units).
Among devices announced by manufacturers are, for instance, a 7th generation
Intelr Core™ i7-7920HQ CPU that features 4 cores with HyperThreading for
8 logical processors clocked at 3.1GHz (up to 4.1GHz in turbo mode) and
a TDP (Thermal Design Power) of 45W. AMD Ryzen™ 7 1800X features 8
cores for 16 logical processors clocked at 3.6 GHz (4 GHz in turbo mode)
and a TDP of 95W. A high end desktop Intel Core i9-7900X CPU features
10 cores with HyperThreading for 20 logical processors clocked at 3.3GHz
(up to 4.3GHz in turbo mode) and a TDP of 140W. NVIDIAr Titan X,
based on the NVIDIAr Pascal™ architecture, features 3584 CUDAr cores
and 12GB of memory at the base clock of 1417MHz (1531MHz in boost)
with a power requirement of 250W. Workstations or servers can use CPUs
such as Intelr Xeonr Scalable processors such as Intel Xeon Platinum 8180
processor that features 28 cores and 56 logical processors clocked at 2.5 GHz
(up to 3.8GHz in turbo mode) and a power requirement of 205W or Intel
Xeon E5-4669v4 with 22 cores and 44 logical processors clocked at 2.2 GHz (3
GHz in turbo mode) and a TDP of 135W. AMD Opteron™ 6386 SE features
16 cores clocked at 2.8 GHz (3.5 GHz in turbo mode) with a TDP of 140 W.
High performance oriented GPUs include NVIDIAr Teslar P100, based on
the Pascal architecture, with 3584 CUDA cores at the base clock of 1480MHz
in boost and with a power requirement of 250W as well as NVIDIA Tesla
V100 with 16 GB HBM2 memory, 5120 CUDA cores clocked at 1455MHz
in boost and with a power requirement of 300W. AMD FirePro™ W9100
features 2816 Stream Processors, 32GB or 16GB GDDR5 GPU memory with
a power requirement of 275W. High performance oriented machines can use
coprocessors such as Intelr Xeon Phi™ x100 7120A with 61 cores clocked at
1.238GHz and a TDP of 300W or e.g. Intel Xeon Phi x200 7250 processors
with 68 cores clocked at 1.4GHz with a TDP of 215W. As it was the case
in the past and is still the case today, computer nodes can be interconnected
together within high performance computing clusters for even greater compute
xxiii
xxiv Preface
The target audience of this book are students, programmers, domain spe-
cialists who would like to become acquainted with:
Pawel Czarnul
Gdańsk, Poland
CHAPTER 1
1.1 INTRODUCTION
For the past few years, increase in performance of computer systems has been
possible through several technological and architectural advancements such
as:
It can be seen, as discussed further in Section 2.6, that CPU clock frequen-
cies have generally stabilized for the past few years and increasing computing
power has been possible mainly through adding more and more computing
cores to processors. This means that in order to make the most of available
hardware, an application should efficiently use these cores with as little over-
head or performance loss as possible. The latter comes from load imbalance,
synchronization, communication overheads etc.
Nowadays, computing devices typically used for general purpose calcula-
tions, used as building blocks for high performance computing (HPC) systems,
include:
1
2 Understanding the need for parallel computing
☎✥✥
P❚✓✔✕✓✖✟✝✗❚ ✘P✙✚✕✛✜✢✣
93.01
P✕✞❚✓ ✗✕✝✢✆✖✛✎☛✕✝ ✘✤✦✣
P❚✓✔✕✓✖✟✝✗❚ ✜ ✛✕✞❚✓ ✗✕✝✢✆✖✛✎☛✕✝ ✘P✙✚✕✛✜✢ ✜ ✤✦✣
✄✥
✂✥
✁✥
33.86
✥ 17.808
19.59
15.371
8.63
6.051
1.901 2.27
✥
❙✆✝✞✟✠ ✡✟☛☞✆✌☛✍☞✎ ✡☛✟✝☞❚✏ P☛✑ ✒✟☛✝✎
Figure 1.2 shows, at a high level, which types of systems are typical tar-
gets for parallel applications coded with a specific API considered in this
book. The book also shows examples of combining APIs for hybrid parallel
programming. As more and more HPC systems combine CPUs and accelera-
tors, the presented templates and techniques are very likely to be applied to
forthcoming systems as well.
▼✍■
❖❝✟☞▼✍ ✹
▼ ✁✂✄☎✆✝✞✟✠✡☛☞✌☎✆✝✞✟
❈✍✎
❖❝✟☞❈✚
❖❝✟☞✙❈❈
✾
✼ ✽
✷
❖❝✟☞▼✍ ✺
▼✍■
❈✎✘✙
●✍✎
▼☛☞✌☎✆✝✞✟
■☞✂✟✁ ✏✟✝☞ ✍✑✄
✆✝❝✞✝✆✟✒✒✝✞
❖❝✟☞▼✍ ▼✍■
✸ ✻
◆✝✓✟
✭✆✝✡❝ ✂✟✞✠✔✝✞✕✒✂☛✂✄✝☞✠✒✟✞✖✟✞✗
The chapter also describes examples of hybrid applications using selected com-
binations of APIs i.e. MPI+OpenMP, MPI+CUDA, MPI+Pthreads.
Chapter 5 presents implementations of master-slave, geometric SPMD and
divide-and-conquer paradigms introduced in Chapter 3 using APIs or combi-
nations of APIs presented in Chapter 4.
Finally, Chapter 6 describes selected techniques used to optimize parallel
programs, including data prefetching, overlapping communication and compu-
tations, load balancing techniques, minimization of synchronization overheads.
Data placement and thread affinity are discussed in the context of possible
impact on performance. Furthermore, optimizations typical of GPUs, Xeon
Phi, clusters and hybrid systems are presented.
CHAPTER 2
Overview of selected
parallel and distributed
systems for high
performance computing
CONTENTS
2.1 Generic taxonomy of parallel computing systems . . . . . . . 11
2.2 Multicore CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Manycore CPUs/coprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Cluster systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Growth of high performance computing systems and
relevant metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.7 Volunteer-based systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 Grid systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
11
12 Overview of selected parallel and distributed systems ...
1. Shared memory systems – threads and processes may use shared mem-
ory for exchanging data and/or synchronization. There can be systems
composed of many nodes as well as individual nodes that feature more
and more cores in the following computing devices:
(a) multicore CPUs such as Intel Xeon E5 or E7 or manycore CPUs
such as Intel Xeon Phi x200,
(b) accelerators and coprocessors including:
• GPUs,
• coprocessors such as Intel Xeon Phi x100 or coprocessor ver-
sions of Intel Xeon Phi x200.
2. Distributed memory systems – threads and processes of a parallel appli-
cation generally do not have direct access to all memories of the whole
system. Such systems include, in particular:
(a) clusters – a cluster consists of many nodes interconnected with a
fast network such as Infiniband™, nodes will typically include:
• multicore CPU(s),
• accelerator(s)/coprocessor(s),
(b) volunteer computing – this type of system consists of a server or
servers to which volunteers connect in order to download data sets,
process the latter and send results back. This way, a large number
of independent volunteers take part in large scale, usually socially
oriented initiatives.
It should be noted that the term multicore usually refers to cores of tra-
ditional multicore CPUs that evolved from traditional CPUs by adding more
and more cores. On the other hand, manycore is often referred to computing
devices with tens or hundreds of cores designed with less powerful cores with
an intent to run parallel applications.
The aforementioned systems can be described with the UMLr notation
as shown in Figure 2.1. The key difference between clusters and volunteer
based systems in terms of hardware is geographical distribution of volunteers’
nodes compared to centralized location of cluster nodes. Furthermore, clus-
ters usually feature fast interconnects such as Infiniband compared to slow
WANs in volunteer based systems. Additionally, volunteers are independent
and codes run in an untrusted environment in contrast to a single cluster envi-
ronment typically maintained by one administrator. In this context, a grid is
typically referred to as a collection of clusters and servers, possibly in various
administrative domains.
numbers such as 128 [99] or more along with adoption of other technologies
for storage such as NVRAM. Depending on a platform, a few CPUs per node
could be incorporated into a node of an HPC system.
Examples of server type CPUs used today include Intel Xeon E7-8890 v3
that features 18 cores (36 threads with HyperThreading) clocked at 2.50 GHz
(3.3 GHz in turbo) with a TDP of 165W or AMD Opteron 6386 SE that
features 16 cores clocked at 2.8 GHz (3.5 GHz in turbo) with a TDP of 140W.
From the programming point of view, there are several APIs available for
effective usage of multicore CPUs within a node, including in particular:
2.3 GPUS
In today’s landscape of computer systems, accelerators play an important
role. Such computing devices offer considerable computational power for well
structured computations compared to CPUs. Consequently, both CPUs and
accelerators/coprocessors can be used together for parallel processing. As of
now, the latter are generally plugged into motherboards into PCIe slots. Con-
sequently, this communication link may become a bottleneck in case input
data needs to be sent frequently from a host’s RAM to an accelerator’s mem-
ory for relatively short-lived computations.
The architectures of GPUs have been evolving from generation to gen-
eration. For instance, NVIDIA introduced architectures such as NVIDIAr
Fermi™, NVIDIAr Kepler™, NVIDIAr Maxwell™, NVIDIAr Pascal™ and
NVIDIAr Volta™.
GPUs 15
From this point of view, a programmer should make sure that the aforemen-
tioned features are utilized to a maximum extent in the application. Specific
optimization points for this architecture are outlined in Section 6.9.2.
One of the advantages of programming for the Intel Xeon Phi platform is
the possibility to use well established programming APIs including:
• native – Intel Xeon Phi is treated as a separate system onto which the
programmer logs in and runs a parallel application on the Xeon Phi
cores. This mode can be implemented with OpenMP or MPI.
• offload – an application is started on a CPU (there can be more than
one CPU with multiple cores) and offloads some computations to an
Xeon Phi coprocessor or more Intel Xeon Phi coprocessors if installed
in the node.
• symmetric – an application runs both on cores of the host CPUs and
cores of Intel Xeon Phi coprocessor(s). It should be kept in mind that
efficiency of a CPU core and an Intel Xeon Phi core is different and
this needs to be accounted for within the application that can be imple-
mented with MPI in this case.
are grouped into clusters. The latter defines how DRAM and MCDRAM are
available to the application and how they can be used. Coprocessor versions
of the x200 series were also developed.
TABLE 2.1Performance of the first cluster on the TOP500 list over time,
based on data from [4]
Date Performance [TFlop/s] Change
in 3 years
current
[ previous ]
06/2017 93014.6 2.75
06/2014 33862.7 4.15
06/2011 8162 7.96
06/2008 1026 ← baseline
TABLE 2.2The number of cores of the first cluster on the TOP500 list
over time, based on data from [4]
Date Number of cores Change
in 3 years
current
[ previous ]
06/2017 10649600 3.41
06/2014 3120000 5.69
06/2011 548352 4.48
06/2008 122400 ← baseline
Historically, taking into account the history of the TOP500 list, the most
powerful cluster from this list would drop to the 500-th place on that list after
Growth of high performance computing systems and relevant metrics 21
TABLE 2.3CPU clock frequency of the first cluster on the TOP500 list
over time, based on data from [4]
Date Clock frequency [MHz] Change
in 3 years
current
[ previous ]
06/2017 1450 0.66
06/2014 2200 1.1
06/2011 2000 0.63
06/2008 3200 ← baseline
roughly 7-8 years. Furthermore, now, more than ever, energy consumption of
such large systems and energy efficiency becomes a real concern. Specifically,
it is aimed for that future clusters should not exceed 20MW of power con-
sumption [61]. This is mainly due to high running costs of such systems. As
shown in Table 2.4, the performance to power consumption ratio of the top
cluster on the TOP500 has visibly increased.
Furthermore, increasing the size of a cluster in terms of the number of
computational devices/the number of nodes brings the following matters into
consideration:
2. Comcute [16, 51] extended the basic idea implemented in BOINC with
the following features:
• In order to free the volunteer from the need for installation of
software on the client side, by design computations are performed
within a web browser. However, this potentially limits performance
due to technologies that can be used for computations. The de-
sign of Comcute allows selection of potentially the best technology
out of those supported on the client side. Technologies such as
JavaScriptr and Java applets were tested. Comcute still allows the
running of dedicated clients, if needed.
• A multi-level architecture that distinguishes the following layers
(Figure 2.2):
(a) Z layer – allows definition of data partitioners, data mergers,
tasks with codes in various technologies that will be sent to
volunteers for processing.
(b) W layer – upper level task management in which a group of W
servers can be selected that will manage execution of a given
task collectively. This results in:
i. Reliability – in case of W server failure, others can con-
tinue.
ii. Better parallelization – particular W servers can be located
in various geographical regions. Definition of a task in-
cludes: volunteer code, data partitioner and merger codes,
input data as well arguments such as a desired level of re-
dundancy. Then W servers partition input data and wait
for the lower layer to prefetch data packets.
(c) S layer – includes so-called distribution servers that are proxies
between volunteers who join the system and W servers that
host data packets. Several S servers can connect to a W server
and many volunteers can communicate with an S server. S
servers shall prefetch data packets from W servers and fetch
new packets as volunteers request those.
(d) I layer – volunteer layer: independent volunteers connect to
publicly available Comcute’s website [1] and upon joining com-
putations are redirected to one of S servers from which their
browsers fetch input data packets for processing and to which
results are returned.
3. CrowdCL [111] is framework supporting volunteer computing and pro-
cessing when visiting a web page. This open source solution allows us-
ing OpenCL, WebCL, JavaScript, CrowdCLient, KernelContext on the
client in order to make use of available computing devices.
4. WeevilScout framework [37] features computing within a browser using
Javascript.
24 Overview of selected parallel and distributed systems ...
Cluster and volunteer based systems are clearly compute oriented but are
really best suited for various application paradigms. Volunteer based systems
are suitable for compute intensive applications in which independently pro-
cessed data chunks can be easily distinguished, usually corresponding to either
parts of a single solution or solutions for various input data sets. It should be
noted that in volunteer computing:
✆ ✞✝✆☎✂✂✁✄ ✂✁ ❣
✄✆✌✟❧✟ ❣☛✁✉✆✉q
☞ ✆❣✞☛✞✄ ☎✞✡✠☎
✆❧✞☎
✆✟✉
✆✌✟✉☎✡
✆❧✞☎
be integrated into one system so that a client can use the resources potentially
transparently. Examples of grid middlewares include Globusr Toolkit [161],
UNICORE [159] or Gridbus [30].
From a user’s point of view, the following services are usually expected:
On the other end, various so-called virtual organizations may expose their
resources to the grid. These resources, such as clusters, may use various tech-
nologies and software for:
These differences should not absorb a grid client. Because of that, grid middle-
ware allows to hide these differences behind a uniform interface that exposes
desired services to users. Essentially, such services can be called directly or
service invocations can be incorporated into higher level systems or modules,
often with a graphical interface, allowing management of specific applications
to be run on the grid.
An example of a grid system with a WWW and Web Service interface is
BeesyCluster [40, 41, 48, 43] deployed as an access portal to high performance
computing clusters of Academic Computer Center in Gdańsk, Poland. It al-
lows users to set up an account that might be connected to an ACC system
account but also accounts on other clusters and perform tasks such as editing
programs, compilation, running either sequential or parallel jobs from com-
mand line or submitting jobs via queuing systems such as PBS, LoadLevelerr
or others. Furthermore, BeesyCluster allows users to couple parallel applica-
tions from multiple nodes and clusters in a workflow application [44].
CHAPTER 3
29
30 Typical paradigms for parallel applications
are involved, this may make the problem difficult to parallelize. Synchroniza-
tion and communication between processes or threads involves overhead. Data
partitioning can be:
1. static – if it is/can be done before processing starts,
2. dynamic – if new data chunks/packets are generated at runtime.
It should be noted that partitioning of input data in an algorithm can generate
either:
1. A fixed/predefined number of data chunks/packets/subdomains (at each
step of the algorithm where partitioning is involved) – which is deter-
mined by the algorithm itself. An example would be alpha beta search
in a game tree in which the number of data packets/chunks would cor-
respond to the number of positions generated from a given position –
the number of legal moves.
2. A possibly variable number of data chunks/packets – in this particular
case, the algorithm may adjust the number of data chunks/packets. For
instance, searching for templates within a large text file: the input file
can be partitioned into a certain number of chunks which are processed
by processes/threads in parallel.
In the latter case, adjustment of granularity by setting the size of a packet
typically results in a trade-off between:
• overhead for data packet/chunk management – typically a larger number
of packets/chunks would result in additional time spent on synchroniza-
tion and communication,
• imbalance – a smaller number of data packets/chunks would make it
more difficult to balance load, especially if:
1. data packets require various amounts of time to process,
2. the number of data packets is similar to the number of processing
cores and does not divide equally by the latter,
3. the number of available processing cores exceeds the number of
data packets.
Such trade-offs can be observed at various levels of parallelism i.e. be-
tween processor cores, cluster nodes and clusters. For instance, in [39] it is
demonstrated how execution time of a parallel compute intensive workflow
implementing adaptive quadrature integration for a given range depends on
how input data (input range) was partitioned. For 16 cluster nodes, the in-
put range was partitioned into various numbers of files with subranges. With
processors of various speeds, a reasonably large number of files is needed for
load balancing, but too large a number of files results in additional overhead.
A similar trade-off is shown in Section 6.2 for a parallel master-slave MPI
application.
32 Typical paradigms for parallel applications
3.1.2 Communication
Naturally, in parallel and distributed systems data is typically distributed
among processes or threads running on various cores and possibly nodes. An
algorithm determines data and control flow as well as proper synchronization.
Communication and synchronization involve overheads, depending on whether
processes or threads run on a shared memory machine within a node (including
multicore CPUs but also manycore systems such as Intel Xeon Phi), multiple
cluster nodes, multiple clusters or hybrid environments including many levels
of parallelism.
Modeling communication costs has been widely analyzed in the literature.
Such models and coefficients would depend on the underlying hardware as
well as software stacks. For instance, point-to-point communication time of
message of size d can be modeled [146] as:
d
tc (d) = ts + (3.1)
B
where ts denotes the startup time while B denotes the maximum possible
bandwidth.
Paper [133] contains formulas describing communication times of various
MPI calls as well as coefficients in the formulas (for a given MPI implementa-
tion and a cluster environment). Additionally, formulas for power consumption
required by computations on a certain number of nodes with a given number
of threads are presented along with dependency of power consumption per
processor versus the number of active threads.
t(1)
S(N ) = (3.2)
t(N )
where:
1. The environments are not equal when running the application on 1 com-
puting device and on N computing devices. For instance, the environ-
ment with 1 computing device may have limited memory. For a memory
hungry application this may result in the need for using a swap file
and considerable slow down and consequently large execution time. On
the other hand, a multi node environment may have considerably larger
memory as memories of particular nodes are available to the applica-
tion. In this case, if particular processes of an application consume only
the memory of the size they need, page swapping may be unnecessary.
Consequently, the application running in a multi node environment may
result in time shorter than t(1)/N .
2. Another example of when superlinear speed-up may occur is related to
applications in which processing time may be dependent on particular
input data. For example, if an application is to find a particular argu-
ment, element or a solution that meets a certain criterion in a search
space, it is not known in advance how much time it will take. Specifically,
Master-slave 35
t(1)
pe(N ) = (3.3)
N · t(N )
where:
3.2 MASTER-SLAVE
Master-slave is a programming paradigm that can be applied to many practi-
cal algorithms for parallelization. Specifically, the paradigm distinguishes two
types of actors:
1. Master – partitions input data into data packets which are sent to slaves
for processing. Upon receiving results of processing data packets, these
are merged in order to obtain a final solution to a problem.
2. Slave – responsible for processing of received data packets and return-
ing results. A slave is not aware of existence of other slaves and just
communicates with the master.
depend not only on the input data size but also on the data itself. For
example, integration of a function over a given range using an adaptive
quadrature [170, 38] algorithm will depend not only on the length of
the range but mainly on the function within the range. Analysis of a
Geometric SPMD 39
FIGURE 3.4 Flow of the basic master-slave application with more data
packets over time, diameters denote execution times, 1 of 2
Geometric SPMD 41
FIGURE 3.5 Flow of the basic master-slave application with more data
packets over time, diameters denote execution times, 2 of 2
42 Typical paradigms for parallel applications
FIGURE 3.7 Flow of the basic master-slave application with more data
packets over time and overlapping communication and computations,
diameters denote execution times, 1 of 2
44 Typical paradigms for parallel applications
FIGURE 3.8 Flow of the basic master-slave application with more data
packets over time and overlapping communication and computations,
diameters denote execution times, 2 of 2
A simulation will then proceed in successive iterations that will typically cor-
respond to time steps. When solving the system of linear equations using the
Jacobi or related methods, iterations will usually allow divergence to a solu-
tion. In each iteration, cells throughout the whole domain will need to be up-
dated. In a parallel implementation, the whole domain will be partitioned into
disjoint subdomains (with additionally so-called ghost cells described below)
each of which will be assigned to a distinct process or thread for updates. An
example of such partitioning for a two dimensional domain is shown in Figure
3.9.
After a process or thread has updated all its cells, the next iteration should
follow. Before this can be done, however, processes or threads handling neigh-
boring subdomains will need to exchange values of boundary cells that will
be used by their neighbors for updates of their boundary cells in the next it-
eration. Such cells are called ghost cells. Consequently, in order to reach good
performance and high speed-ups of the parallel code the following will need
to be performed at the same time:
1. balancing computations among parallel processes or threads,
2. minimization of communication times for sending and receiving ghost
cells.
Communication times will depend on the underlying communication in-
frastructure. Proper partitioning should minimize the number of bytes ex-
changed and the number of neighbors in the communication, especially be-
tween processes on various nodes. As an example, for rectangular domains the
following partitioning strategy can be applied (Figure 3.10):
1. Partition the domain with planes parallel to XY, YZ, and XZ planes
respectively.
2. There are planes parallel to YZ, XZ and XY planes respectively. The
former planes cut the domain into equally sized subdomains. The domain
is of size X, Y and Z which is input data to the partitioning algorithm.
3. The goal of the partitioning strategy is to balance computations while
minimizing communication between subdomains. Since the planes cut
the domain into equally sized parts, computations are balanced assuming
each cell in every subdomain requires the same computational effort.
However, taking into account possibly various sizes of the domain: X, Y
and Z it is necessary to find such numbers of planes cutting the domain
in each dimension that communication is minimized. In a 3D space,
subdomains in the middle of the domain have six neighbors. The total
number of cells that need to be exchanged, from the point of view of a
single subdomain, is as follows:
X Y X Z Y Z
2( + + ) (3.4)
x+1y+1 x+1z+1 y+1z+1
Assuming that the number of processes P is given this forces the follow-
ing condition
X, Y and Z are given. The goal would become to find such x, y and z
that the value given by Equation 3.4 is minimized. This leads to mini-
mization of
Geometric SPMD 47
So far we have considered a problem in which all cells throughout the do-
main require the same computational effort. In case various cells have different
computational weights that result in different update times then it needs to be
taken into account during partitioning of the domain into subdomains. There
are several approaches that can be adopted for load balancing. For simplicity,
in this case, the so-called Recursive Coordinate Bisection (RCB) is demon-
strated. Figure 3.11 presents the same domain as considered before but with
various cell weights. In a real world scenario, this may correspond to a simula-
tion of a phenomenon with a source and distribution of waves from the source
throughout space. Other equations or accuracy might be preferred closer to
the source than in the other part of space. A recursive partitioning algorithm
can work as follows:
48 Typical paradigms for parallel applications
1. Partition the current domain (initially the whole domain) into two sub-
domains by cutting it with a plane such that the dimension with the
largest width is cut by a perpendicular plane. In two dimensions, as
shown in Figure 3.11, the algorithm cuts with lines.
2. Apply step 1. to the resulting subdomains until a required number of
subdomains is reached.
Then there comes the main simulation loop. As indicated above, the basic
steps of the simulation are outlined in Figure 3.12 with computations and
communication performed in every iteration. One of disadvantages of this
approach is that communication follows the whole computational step.
Geometric SPMD 49
1. Update its boundary cells first. It can be done assuming that values for
ghost cells from the previous iteration are available.
2. Start sending the updated boundary cells to processes or threads han-
dling neighboring subdomains. At the same time, start receiving updated
values for ghost cells.
3. Update all remaining cells within the domain i.e. the interior cells. It
should be noted that the communication started in step 2 and these
updates can potentially be performed at the same time thus saving some
time compared to the solution shown in Figure 3.12.
50 Typical paradigms for parallel applications
The scheme shown in Figure 3.15 compares if time lost from imbalance is
larger than time required for repartitioning and based on this assessment
starts dynamic load balancing or not. Figure 3.16 presents the same scheme
but improved with dynamic adjustment of the following parameters:
FIGURE 3.15 Activity diagram for the dynamic load balancing step
FIGURE 3.16Activity diagram for the dynamic load balancing step im-
proved with dynamic load balancing step adjustment
Pipelining 55
3.4 PIPELINING
In pipeline type processing, one can distinguish distinct stages of processing
through which each of the input data packet needs to go. As shown in Figure
3.17, input data is considered as a stream of input data packets that enter
a pipeline. Each of the pipeline stages is associated with code that can be
executed on a separate core, processor or accelerator. Specifically, some codes
might preferably be executed on such a computing device that is especially
well suited for execution of such code. This might be a rationale for selection
of this type of processing. However, such pipeline processing can suffer from
performance bottlenecks if some pipeline stages process data packets much
longer than the others.
3.5 DIVIDE-AND-CONQUER
Divide-and-conquer is a technique that is based on the following steps:
Figure 3.19 shows an imbalanced tree with various node degrees and depths
for various branches of the tree. In terms of computations that are to be
performed within nodes, various configurations can be distinguished as well,
considering:
Divide-and-conquer 57
these tree nodes are to aggregate subresults from nodes closer to leaves and
pass intermediate results to nodes closer to the root.
In alpha beta search in chess evaluation of a position performed in leaves
may consider not only the pieces each player has on the board but also po-
sitional advantages, immediate threats etc. This may be a time consuming
process. There may be a trade-off between how accurate evaluation is and the
size of the tree to be analyzed within a timeout for a move. On the other hand,
in merge sort processing in leaves only requires comparison and potentially
exchange of numbers which is fast. Next, merging in intermediate nodes takes
more and more time as arrays to be merged are larger and larger.
Let us first consider a balanced tree with the same node degrees equal to 2.
Such a tree can be found in algorithms such as mergesort. If we assume that
pairs of processes/threads can communicate/synchronize at the same time
then computations can be organized as shown in Figure 3.20. Particular pro-
cesses or threads and required communication or synchronization is shown.
It should be noted that in this allocation, some processes or threads are re-
sponsible for nodes of the tree at various levels. Flow of computations, on the
other hand, is shown in Figures 3.21 and 3.22.
It should be noted that, in general, a divide-and-conquer application can
be highly dynamic in terms of the following:
the total running time of the algorithm. Figure 3.23 presents such a general
divide-and-conquer tree.
In order to parallelize a general divide-and-conquer tree efficiently, a dy-
namic approach with partitioning of the tree can be used [38, 47]. Specifically,
if there are several processing nodes then, in terms of load balancing, each of
the processes or threads can perform steps depicted in Figure 3.24.
These steps can be outlined as follows for a process or thread on each
processing node:
1. If there is at least one subtree available then fetch and process one.
2. If there is no data to process then request a subtree from one of the
neighbors of the processing core/node. There may be various load bal-
ancing algorithms that determine:
• which neighbors can be considered for load balancing,
60 Typical paradigms for parallel applications
• how much data should be fetched from neighbors – how many sub-
trees in this case.
This algorithm allows cutting off subtrees from nodes that have large trees
available. It should be noted that a subtree can be cut off and be sent for pro-
cessing to another process or thread. This idea is related to concepts of work
stealing and work sharing used in scheduling of multithreaded computations.
Work stealing uses a mechanism in which underloaded processors request and
62 Typical paradigms for parallel applications
fetch work from overloaded ones and work sharing uses spawning work on pro-
cessors and migration such that load is balanced. In this case work (a subtree)
is generated on demand. Processing of distinct subtrees is independent. How-
ever, a process or thread that is to merge results will need to wait for results
from all subtrees at a given level. Consequently, the following performance
related issues arise:
1. There may be idle time between when an idle node requests a new
subtree and the time data is available. One solution to cope with this is
data prefetching described in Section 6.1. Specifically, before an available
subtree is to be processed, a request may be sent in the background for
sending another subtree. However, it should be noted that prefetching
too many subtrees may overload the given node because quite often the
real computational effort for a subtree can not be predicted in advance
[38].
Divide-and-conquer 63
69
70 Selected APIs for parallel programming
This chapter aims to introduce of some of the most useful parts of popular
APIs for parallel programming in today’s modern computing systems. De-
scription of the APIs is meant to introduce functions and methods most often
used in solving practical problems. The presented APIs are representative in
terms of types of systems on which these are used for programming parallel
applications. The APIs include:
71
In their respective contexts, key areas of the APIs are presented with
the most focus on MPI, OpenMP, CUDA and OpenCL. Selection is based
on required elements for implementation of master-slave, geometric SPMD
and divide-and-conquer parallelism, presented in Chapter 5, as well as useful
and modern elements of the APIs available in the latest versions. The most
important of the presented elements in the APIs are summarized in Tables
4.1 and 4.2 and include:
int MPI_Finalize(void)
1. Blocking – the function blocks the execution of the process until the
function has been executed (such as message received) or progressed
until a certain step (such as copied to a final/additional buffer for a
send operation such that the input buffer can be modified).
2. Non-blocking – the function returns immediately to the calling process
and execution of the call may progress in the background. In this case,
such a function can be thought of as a request for execution. There is
a handle MPI_Request identified with the request which needs to be
passed to additional functions for checking the status of a call or to
terminate the call in a blocking way.
Additionally, the gather functions also include a version that allows all
processes to receive data, not only the root process. This is possible by
invoking function:
which allows each process to receive data chunks from each process in
such a way that the data chunk from process with rank i is put into the
i-th part of receivebuffer. Similarly to the aforementioned versions,
MPI_Allgatherv allows receiving data chunks of various lengths:
4. Broadcast – in this case one process broadcasts data to all the processes
in the communicator:
6. All to all communication. In this case all processes send data to all
other processes. This communication mode implemented by function
MPI_Alltoall can be thought of as an extension of MPI_Allgather. In
this particular case, each process sends various data to various receiving
processes. Specifically [6] process with rank i sends the j-th chunk of
data to process with rank j. The process with rank j receives this data
chunk to the i-th chunk location. Details of this function are as follows:
1. Packing and unpacking elements of various data types into a buffer and
passing a buffer to an MPI_Send type function and using the predefined
MPI_PACKED type for sending the message.
2. Creating a custom data type that would already consider various data
types, numbers of elements, strides etc. Such a type can be created using
one of many MPI functions. It needs to be registered using function
MPI_Type_commit and can subsequently be used in the standard MPI
communication functions.
Having obtained sizes for particular blocks of elements of various data types,
the total size for a buffer can be computed by adding required sizes for the
blocks. Subsequently, a buffer can be allocated using malloc and packed with
blocks using MPI_Pack for each of the blocks:
Then the buffer can be sent using MPI_PACKED for the message type. On the
receiver side, the message can be unpacked using MPI_Unpack:
Alternatively, an application may create its own custom data types that
will be used throughout execution. This is especially desirable if a process
requires to send a more complex data type than a simple structure in which
blocks of elements are located one after another. For instance, sending a col-
umn of a matrix may require a single data type but skipping a certain number
of bytes between element locations. MPI offers several functions for creating
custom data types. Definition of a new data type should be followed by invoca-
tion of function MPI_Type_commit(MPI_Datatype *datatype) that registers
the data type. Specific functions for data type definition include:
In such cases, a process might want to receive a message from any process,
regardless of which sends a message first. This can be accomplished by calling
function MPI_Recv with MPI_ANY_SOURCE specified as the rank of the sender.
Obviously, the receiving process may need to find out the actual rank of the
sender in order to communicate with it next. This can be achieved as follows:
MPI_Status status;
MPI_Recv(buffer,count,datatype,MPI_ANY_SOURCE,tag,
communicator,&status);
// find out the real sender rank
int senderrank=status.MPI_SOURCE;
Similarly, a process may wait for a message with a given tag or for a message
with any tag. It can be achieved in an analogous way using MPI_ANY_TAG for
the tag in MPI_Recv. The actual tag of the received message can be read as
follows after MPI_Recv:
int tag=status.MPI_TAG;
1. Find out about the incoming message without actually receiving it i.e.
check if such a message is available. It can be done using blocking
MPI_Probe or non-blocking MPI_Iprobe functions:
– MPI_Send(...) – a basic blocking send function. After the function has re-
turned, the calling process is allowed to modify the send buffer. MPI
can buffer the message which may cause the function to return fast. Al-
ternatively, this call may even block until the moment the sent message
is buffered on the receiver side.
88 Selected APIs for parallel programming
– buffer – a pointer to a buffer for storing the incoming message, the buffer
should have been allocated before,
– count – the number of elements of the given data type,
– datatype – data type for elements used in the message,
– source – the rank of a sender process,
– tag – a label associated with the given message,
– comm – denotes the communicator within which communication is performed
and process identified,
– request – a handler associated with the call that can be used for completion
of the call later.
It should be noted that after the function has completed, the request is
either set to inactive or MPI_REQUEST_NULL. If MPI_Wait is invoked with
MPI_REQUEST_NULL then it has no effect.
MPI_Test, on the other hand, is a non-blocking call. This means that
MPI_Test returns a flag whether a previous message associated with a given
request has completed. The function call is as follows:
90 Selected APIs for parallel programming
If the flag is set to true then, similarly to MPI_Wait, the request is either
set to inactive or MPI_REQUEST_NULL while status will contain information
on the call.
It should be noted that MPI actually provides a whole family of MPI_Wait
and MPI_Test functions that allow passing more than just one request as
input. These can be very useful in more complex communication patterns
that involve more processes. These functions include:
that allows blocking waiting for completion of any of the calls associated with
requests given as input in arraywithrequests. A value of index will denote
the one that completed. It should be noted that using MPI_Waitany() when
handling communication with many processes might lead to some being served
more frequently than others.
Another useful function is blocking MPI_Waitall that allows waiting for
completion of all non-blocking calls for requests were given as input. This
function can include requests for both send and receive calls. The syntax is as
follows:
MPI allows the creation of windows using several functions. For instance,
allows each process in communicator (the call is collective) to expose its own
window in memory for one-sided (remote memory access) operations. The
function has the following parameters:
which is similar to MPI_Win_allocate but all processes may access the mem-
ory using write/read operations. address can be used on the invoking process.
By default, a contiguous memory across process ranks is allocated with at least
size bytes in each rank. Function:
which finalizes one-sided calls on both sides. MPI_Get, MPI_Rget allow reading
data from a window. The syntax of MPI_Get is as follows:
where origin* variables concern the receive buffer on the origin side
while target* variables for the memory data is received from. Similarly to
MPI_Rput,
– Passive target:
94 Selected APIs for parallel programming
– Active target:
– With collective fence operations. In this case, processes may access
memories of other processes associated with a window. This syn-
chronization mode can be used e.g. for geometric SPMD applica-
tions such as those discussed in Section 3.3 for the communication
phase that follows a computation phase in iterations of an algo-
rithm. The syntax of the fence operation is as follows:
x − 1 ( x−1 )2 ( x−1 )3
ln(x) = + x + x + ... (4.1)
x 2 3
for x > 12 . Such a problem can be parallelized by assignment of an equal
number of elements of the sum to each process of an MPI application. It
should also be noted that each successive element of the sum seems to involve
more computations since the power increases. One possible solution, that can
be adopted, is as follows:
// obtain my rank
MPI_Comm_rank(MPI_COMM_WORLD,&myrank);
// and the number of processes
MPI_Comm_size(MPI_COMM_WORLD,&nproc);
(...)
// read the argument
x=...
mult_coeff=(x-1)/x;
count=maxelemcount/nproc;
// now compute my own partial sum in a loop
power=myrank*count+1;
prev=pow(((x-1)/x),power);
for(;count>0;power++,count--) {
partialsum+=prev/power;
prev*=mult_coeff;
}
// now each process should have its own partial sum ready
// add the values to the process with rank 0
MPI_Reduce(&partialsum,&sum,1,MPI_DOUBLE,MPI_SUM,0,
MPI_COMM_WORLD);
MPI_Finalize();
1. Each process can determine the input data it should process on its own.
No data distribution phase was needed which is often necessary if e.g.
only one process has access to input data or one process generates input
data.
2. Processes can perform operations independently and partial results can
be merged into a final value at the end of computations. Such an appli-
cation is often referred to as embarrassingly parallel. In many problems,
communication and synchronization among processes are much more
complex, as described in Chapter 3.
3. In this specific example computations are balanced among processes. In
many practical problems, as described in Chapter 3, load balancing will
be a real challenge and may even need to be performed at runtime.
10
1
1 2 4 8 16 32
number of processes
and parameters:
The other function allows for specifying various executables to be used while
spawning:
100 Selected APIs for parallel programming
which returns the parent communicator. A process has been been spawned by
another process if parentcommunicator!=MPI_COMM_NULL. For instance, if a
single process started children, it could be contacted with using rank 0 and
parentcommunicator.
When using arguments to be passed to spawned processes, argv contains
just arguments to be passed, without the name of the binary. Additionally,
the info argument may be used to pass information when a process should be
started. As an example, the following sequence sets a host name for launch:
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info,"host","<hostname>");
Message Passing Interface (MPI) 101
MPI_Comm_spawn("binary",argv,1,info,0,MPI_COMM_SELF,
&intercommunicator,&errorcode);
It should be noted that the processes must refer to the same file with the same
access mode. A file should be closed with a call to function:
4.2 OPENMP
4.2.1 Programming model and application structure
OpenMP is an API [126, 127] that allows easy extension of sequential appli-
cations such as written in C or Fortran so that execution of selected regions
within the code is performed in parallel. In essence, an application can be
extended with the following constructs:
int main() {
OpenMP 103
// a serial region
printf("\nThis is executed by one thread only!");
// serial processing here
// a serial region
printf("\nThis code is executed by one thread only");
// serial processing here
// a serial region
printf("\nThis code is executed by one thread only");
// serial processing here
Directives may have several attributes and arguments that specify how par-
allelization within particular regions is performed. As an example, the pseu-
docode shown in Listing 4.3 distinguishes two parallel regions. 32 threads are
requested (num_threads(32)) for execution of the first one. In the second,
the number of threads will be selected by the runtime system unless specified
by an environment variable as discussed in Section 4.2.3. The second par-
allel region specifies scoping of variables within the region – variable0 and
variable1 are variables private to each thread while variable2 is shared and
will require proper synchronization when written to by threads.
Listing 4.3 Basic OpenMP structure with numbers of threads and func-
tion calls
#include <omp.h>
104 Selected APIs for parallel programming
#include <stdio.h>
int main() {
// a serial region
printf("\nThis is executed by one thread only!");
// serial processing here
// a parallel region
// each thread can fetch its own id unique
// among the threads and the number
// of threads executing the region (set to 32)
#pragma omp parallel num_threads(32)
{
printf("\nHello from thread %d, nthreads %d\n",
omp_get_thread_num(), omp_get_num_threads());
// a serial region
printf("\nThis code is executed by one thread only");
// serial processing here
// a parallel region
// the following specifies a parallel region with scoping of
variables
// in this case the number of threads is not set a’priori
#pragma omp parallel private(variable0, variable1) shared(
variable2)
{
// parallel processing here
}
Each of the threads can fetch the number of threads active in the team
by calling function:
int omp_get_num_threads()
int omp_get_thread_num()
– - – subtraction,
– & – bitwise and,
– | – bitwise inclusive or,
– ^ – bitwise exclusive or,
– && – logical and,
– || – logical or.
As an example, parallel sum can be computed using the code in-
cluded in file openmp-reduction-sum.c. An important part of such
an implementation is shown in Listing 4.4.
printf("Final result=%ld\n",result);
(...)
int id;
(...)
#pragma omp parallel private(id) reduction(max:result)
num_threads(32)
{
id=omp_get_thread_num();
result=compute(id);
}
printf("Final result=%ld\n",result);
(...)
• private(varlist) – as above.
• firstprivate(varlist) – as above.
• lastprivate(varlist) where varlist is a list of variables that will
be private to each thread in the team and the final global value of
the variable will be from the last iteration of the loop.
• ordered – if specified then iterations of the loop will be as in a
sequential code.
108 Selected APIs for parallel programming
export OMP_NUM_THREADS=32
./application
is used then such a section is called unnamed. This means that all
threads will synchronize on entry to the code. Secondly, if there are
more unnamed sections, these would be regarded as one.
• named – in this case it is possible to specify distinct critical regions
in an application each of which is identified with a name specified
as follows:
and performs a flush either using the specified list of variables or the whole
view of a thread if a list has not been specified.
The flush operation is also called by default in many points of code that
in fact synchronize threads. Some of these include (a full list can be found in
[127]):
(...)
// read the argument
x=...
sum=0;
// use the OpenMP pragma for construct
printf("\nResult is %f\n",sum);
The code can be compiled and run as follows using a gcc compiler:
This implementation, while simple and easy and fast to develop, has some
drawbacks such as direct computation of each element of the sum in every it-
eration. This is important because it makes computations of various iterations
independent.
These can be changed towards a longer implementation, included in file
lnx_OpenMP_1.c resembling the MPI solution from Section 4.1.13 but imple-
mented using a #pragma omp parallel in which each thread runs its own
independent loop. Listing 4.7 shows an important part of such an implemen-
tation.
(...)
// read the argument
x=...
mythreadid=omp_get_thread_num();
threadnum=omp_get_num_threads();
mult_coeff=(x-1)/x;
count=maxelemcount/threadnum;
prev=pow(((x-1)/x),power);
for(;count>0;power++,count--) {
sum+=prev/power;
prev*=mult_coeff;
}
}
// now the main thread should have the result in variable sum
printf("\nResult is %f\n",sum);
(...)
Results for various numbers of threads per process (coded with OpenMP) are
presented in Figure 4.3. It can be seen that within a node this code scales
slightly better than an MPI based version presented in Section 4.1.13 which is
expected within one node. Best times out of three runs for each configuration
are presented.
100
Execution time [s]
10
1
1 2 4 8 16 32
number of threads
which indicates that iterations of the for loop can be executed concurrently
with SIMD instructions. In particular, there are a few clauses that can be
used with simd. These include in particular: safelen(iterdistance) which
specifies that the iteration numbers of two iterations executed concurrently
must not exceed iterdistance, as well as private(...), lastprivate(...)
and reduction(...).
Additionally, the following can be used before function:
float tab0[1000];
<initialize tab0>
...
return 0;
}
Sample compilation using the gcc compiler and running the application is
shown below:
4.3 PTHREADS
4.3.1 Programming model and application structure
POSIX threads or Pthreads [155] defines an API that allows writing multi-
threaded programs, in particular defining constants, types and functions for:
Function:
can be used to finalize the calling thread and pass a result to a thread that
has called pthread_join() if the calling thread is joinable.
A thread can be joinable or detached. Setting a thread as a joinable thread
can be done with setting a proper attribute when creating a thread i.e.:
pthread_t thread;
int startValue;
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
pthread_create(&thread,&attr,function,(void *)(&startValue));
The original thread can wait for another thread (join with it) by calling
function:
int *startValue;
double *threadResults; // required type can be used
120 Selected APIs for parallel programming
// perform computations
threadResults[start]=partialsum;
}
for (i=0;i<threadnum;i++) {
// initialize the start value
startValue[i]=i;
// launch a thread for calculations
pthread_create(&thread[i],&attr,Calculate,(void *)(&(
startValue[i])));
}
The value of 0 indicates a success while EINVAL an error (cannot make the
given mutex as consistent in this case).
Mutex types include:
122 Selected APIs for parallel programming
Functions:
After a call has returned, the calling thread locked the mutex. The calling
thread will block if the given mutex is already locked.
Unlocking a given mutex can be performed with a call to the following
function:
Function:
as its name suggests, will lock the mutex if it is unlocked or will return at once
otherwise. For a mutex of type PTHREAD_MUTEX_RECURSIVE already locked by
the calling thread, it will increase the corresponding count.
Function:
locks the mutex if it is possible or otherwise waits for the mutex to be unlocked.
The waiting time is determined by timeout.
Locking and unlocking functions return 0 if successful. In case of
pthread_mutex_trylock() the value of 0 is returned in case the mutex has
been locked by the call.
acts as a wait until the given absolutetime. If the system time passes
absolutetime, the function returns ETIMEDOUT. EINVAL denotes an invalid
absolute time.
Function:
will cause unblocking of one or more threads blocked on the pointed condition
variable while function:
will cause unblocking of all threads blocked on the pointed condition variable.
Successful termination returns the value of 0.
4.3.4 Barrier
Function:
Function:
Pthreads 125
can be used by threads to enter the barrier. After threadcount threads have
entered the barrier, all are allowed to continue. Then the function returns
PTHREAD_BARRIER_SERIAL_THREAD for one of the threads and the value of 0
for the other threads.
4.3.5 Synchronization
An application should not allow a thread to access a variable while there
may be modifications to the variable performed by another thread. Accord-
ing to the specification, there are several functions that synchronize memory
view among threads. Among the functions described in this book, there are:
pthread_create(), pthread_join(), pthread_mutex_lock() (with the ex-
ception of a recursive mutex already locked by the thread that has called
the function), pthread_mutex_trylock() pthread_mutex_timedlock(),
pthread_mutex_unlock() (with the exception of a recursive mutex with a
counter larger than 1), pthread_cond_wait(), pthread_cond_timedwait(),
pthread_cond_signal() and pthread_cond_broadcast(), pthread_barrier
_wait(). Specification [155] lists additional functions, not covered in this
book.
mult_coeff=(x-1)/x;
count=maxelemcount/threadnum;
// now compute my own partial sum in a loop
power=start*count+1;
prev=pow(((x-1)/x),power);
for(;count>0;power++,count--) {
partialsum+=prev/power;
prev*=mult_coeff;
}
threadResults[start]=partialsum;
}
int i;
void *threadstatus;
(...)
thread=(pthread_t *)malloc(sizeof(pthread_t)*threadnum);
if (thread==NULL) {
// handle error
}
startValue=(int *)malloc(sizeof(int)*threadnum);
CUDA 127
if (startValue==NULL) {
// handle error
}
threadResults=(double *)malloc(sizeof(double)*threadnum);
if (threadResults==NULL) {
// handle error
}
for (i=0;i<threadnum;i++) {
// initialize the start value
startValue[i]=i;
// launch a thread for calculations
pthread_create(&thread[i], &attr, Calculate, (void *)(&(
startValue[i])));
}
4.4 CUDA
4.4.1 Programming model and application structure
NVIDIA CUDA is an API that allows parallel programming for a GPU or a
collection of GPUs. It proposes a programming model that exposes massive
128 Selected APIs for parallel programming
10
1
1 2 4 8 16 32
number of threads
3. Copy input data from RAM to the memory on the GPU using
cudaMemcpy(...).
4. Launch computations on the GPU by calling a function called a kernel
function. Such a function is declared using the __global__ qualifier and
has void as the returned type.
5. Upon completion copy results from the memory on the GPU to RAM
using cudaMemcpy(...).
6. Display results.
nameofkernelfunction<<numberofblocksingrid,
numberofthreadsinblock>>(kernelparameters);
dim3 numberofblocksingrid(32,32,32);
dim3 numberofthreadsinblock(16,16,4);
Each thread would typically find its location in the grid in order to find
out the data/part of computations it should perform. The following variables
can be used for this purpose:
• __managed__ – variable that can be accessed from the host and the
device.
__syncthreads()
1. Each thread copies part of input data from global to shared memory
(shared memory can be used as kind of cache as discussed in Section
6.9.1).
2. __syncthreads() is called that ensures that all the threads have com-
pleted copying.
3. Each thread computes part of data and stores in shared memory.
4. __syncthreads() is called that ensures that all the threads have written
their parts of data.
132 Selected APIs for parallel programming
Having independent blocks allows the runtime layer to hide latency in case
some delays show up. From the programmer’s point of view, this results in au-
tomatic dynamic scheduling on the GPU and load balancing but also imposes
constraints on arrangement of computations and data in the application.
The programming model for an application is called SIMT (Single Instruc-
tion Multiple Threads). Essentially, threads working in parallel usually process
various data or perform various computations, the results of which are further
integrated into a final result. Threads within a warp must execute the same
instruction at the same time. It should be noted, however, that if the code
includes conditional statements such as:
if (condition) {
<code of block A>
else {
<code of block B>
}
then codes of blocks A and B will not be executed in parallel if some threads
have condition equal to true and others equal to false. In fact, codes
of blocks A and B will be executed sequentially. Starting with Volta and
CUDA 9, independent thread scheduling is used with program counter per
thread and statements from divergent branches can be interleaved [65]. Func-
tion __syncwarp() can be used for reconvergence. Additionally, Cooperative
Groups in CUDA 9 allow to manage groups of communicating threads in a
flexible way with respective functions [122] and possibility of whole grid or
multi-GPU synchronization.
CUDA offers several so-called atomic functions that allow atomic opera-
tions that read, modify and store values in either global or shared memories
[122]. Such operations can be called from within functions running on a device.
Particular functions take the input value and perform an operation using the
old value at the given address and the new value and saves a result to the same
location the old value was stored. Examples of useful functions include: int
atomicAdd(int *oldvaluelocation,int value) and int atomicSub(int
*oldvaluelocation,int value) for adding or subtracting value from a vari-
able stored in a given location, int atomicMax(int *oldvaluelocation,int
value) and int atomicMin(int *oldvaluelocation,int value) for com-
puting a maximum or minimum out of the old value and newly given value,
int atomicExch(int *oldvaluelocation,int value) for exchanging the
old and new values as well as int atomicCAS(int *oldvaluelocation,int
valuetocompare,int valueifequal) that compares the old value with
valuetocompare and stores valueifequal if compare returned true or does
not do anything otherwise. CUDA also supports atomic logical operations.
134 Selected APIs for parallel programming
4.4.3 Constraints
When processing on a GPU, there are several constraints that should be taken
into account, in particular:
Then the same approach is adopted to a result and the procedure is repeated.
Listing 4.12 presents an important part of such an implementation. The code
creates a grid with blocks in which each thread deals with a separate input
argument. Afterwards, each thread writes the number of iterations for its
number into its own space in a global array. Note that the code also runs for
my_index=0 but this does not affect the final result.
__global__
void checkCollatz(long *result) {
long my_index=blockIdx.x*blockDim.x+threadIdx.x;
unsigned long start=my_index;
char cond=1;
unsigned long counter=0;
for(;cond;counter++) {
start=(start%2)?(3*start+1):(start/2);
cond=(start>1)?1:0;
}
result[my_index]=counter;
}
(...)
long size=threadcount*sizeof(long);
long *hresults=(long *)malloc(size);
if (!hresults) errorexit("Error allocating memory on the host"
);
long *dresults=NULL;
if (cudaSuccess!=cudaMalloc((void **)&dresults,size))
errorexit("Error allocating memory on the GPU");
(...)
if (cudaSuccess!=cudaMemcpy(hresults,dresults,size,
cudaMemcpyDeviceToHost))
errorexit("Error copying results");
136 Selected APIs for parallel programming
cudaDeviceSynchronize();
// release resources
free(hresults);
if (cudaSuccess!=cudaFree(dresults))
errorexit("Error when deallocating space on the GPU");
}
This initial code can be improved in several ways. For instance, this
code does not exploit shared memory that could be used within thread
blocks. As a side effect, the size of the result that needs to be sent back
to the host memory is large and consequently finding a maximum on the
host side can take considerable time. An improved version, included in
file collatz-partialparallelmaxcomputation.cu applies parallel reduction
within each thread block using shared memory. An improved kernel is shown in
Listing 4.13. Specifically, each thread puts its result into its own place in shared
memory and then a reduction algorithm proceeds in the logarithmic number
of steps for finding a maximum value synchronized using __syncthreads()
between iterations. Also, the number of results transferred to the host side and
browsed sequentially for a maximum decreased as a result. This code could
potentially be improved further as shown in [82].
Furthermore, another improvement could be to parallelize finding the max-
imum on the host side, for instance using OpenMP.
for(;cond;counter++) {
CUDA 137
start=(start%2)?(3*start+1):(start/2);
cond=(start>1)?1:0;
}
sresults[threadIdx.x]=counter;
__syncthreads();
nvcc collatz.cu
nvprof ./a.out
nvcc collatz-partialparallelmaxcomputation.cu
nvprof ./a.out
Table 4.3 presents lowest execution times for the GPU part and the whole
program for a testbed platform.
TABLE 4.3 Execution times [us] for two CUDA code versions
2 x Intel Xeon E5-2620v4 +
NVIDIA GTX 1070
standard execution time GPU 54221
improved execution time GPU 23329
cudaMalloc(...);
138 Selected APIs for parallel programming
cudaMemcpy(...,inputdata,...,cudaMemcpyHostToDevice);
kernel<<<gridsize,blocksize>>>(...);
cudaMemcpy(results,...,...,cudaMemcpyDeviceToHost);
kernelA<<<gridsizeA,blocksizeA>>>(...);
kernelB<<<gridsizeB,blocksizeB>>>(...);
kernelC<<<gridsizeC,blocksizeC>>>(...);
processdataonCPU();
cudaDeviceSynchronize();
– Streams – operations issued to a stream are executed one after one; the
default stream (with id 0) is used if no other stream is specified explic-
itly. Operations in various streams (with non 0 ids) can be executed
concurrently.
– Asynchronous communication between a host and a device. Whether a
particular device supports such mode can be checked by reading the
value of field asyncEngineCount in a structure returned by function
cudaGetDeviceProperties(...) described in Section 4.4.8.
kernelA<<<gridsizeA,blocksizeA,smbytes,stream>>>()
CUDA 139
where smbytes indicates dynamically reserved shared memory size per thread
block [84, 122] and stream indicates the stream to be used. For the former,
it is possible to specify the data size for shared memory when launching the
kernel as demonstrated in the example in Section 5.1.5.
An asynchronous call starting communication between the host and a de-
vice can be performed as follows:
where stream indicates the stream the request should be put into. Conse-
quently, this allows potential overlapping of computations and communication
when using two streams as shown in Section 6.1.2. Asynchronous copying and
overlapping of communication between the host and the device with compu-
tations can be done when page-locked memory on the host is used.
Page-locked (i.e. pinned) memory can be allocated using function:
cudaError_t cudaHostGetDevicePointer(void
**pointerondevice,void *pointeronhost,
unsigned int flags)
where pointerondevice will return a pointer that can be used on the device
[123].
CUDA allows concurrent kernel execution as outlined in [137]. There are
a few conditions for that to take place:
In CUDA 7 a new possibility for so-called per thread default streams was
introduced [85]. Specifically, it can be useful for multithreaded applications in
which case each thread running on the host will have its own default stream
140 Selected APIs for parallel programming
and such per thread default streams will execute tasks in parallel. Turning
on per thread default streams can be done by compiling using nvcc with the
--default-stream per-thread flag. Furthermore, such per thread streams
will execute tasks in parallel with streams other than the default one. There
are several functions that can be used for synchronization in this context:
cudaDeviceSynchronize(void)
which waits until all tasks previously submitted to the device have completed,
which waits until tasks previously submitted to the stream have completed,
which forces all tasks that will be submitted to the given stream to wait for
the given event. It is important that this function can be used to synchronize
tasks submitted to various devices if various streams are used. Specifically,
all the tasks to be submitted to the stream will not start until a call to
function cudaEventRecord(...) has completed [123]. In order to implement
such synchronization, an event needs to be created first and recorded later with
function cudaEventRecord(...) [113, 137]. Functions that allow management
of events are as follows:
which creates an event with standard settings (see below for creation of an
event with flags),
enforces waiting until all tasks submitted before the last call to
cudaEventRecord(...) have completed.
The way the host thread is waiting for results can be affected by setting proper
device flags such as [123]:
– cudaDeviceScheduleSpin – do not yield control to other threads until re-
ceiving result from a GPU and busy wait for result that may affect
performance of other threads running on the CPU,
– cudaDeviceScheduleYield – yield control to other threads before receiving
result from a GPU – may increase the time for getting results but does
not affect performance of other threads running on the CPU that much,
– cudaDeviceScheduleBlockingSync – block the calling thread.
Following this, function:
cudaError_t cudaEventCreateWithFlags(cudaEvent_t
*event,unsigned int flags)
which returns cudaSuccess if all tasks submitted prior to the last call to
cudaEventRecord(...) have already been finalized, cudaErrorNotReady is
returned otherwise.
where memory will point at a newly allocated memory space on the de-
vice that will be managed by Unified Memory mechanisms. The size of
the memory is specified in size and given in bytes. It should be noted
that on card generations before Pascal, pages would be populated on a
GPU. If a CPU initializes data then a page(s) will be moved to the CPU
and then to the GPU if a kernel operating on it is launched. On Pascal
cards, no pages are created at the managed malloc, only when refer-
enced which can minimize page migrations [143]. On Pascal, concurrent
accesses from a CPU and a GPU are possible. Flags, if specified, may
include cudaMemAttachGlobal which indicates that the newly allocated
memory can be referred to from any stream on a GPU. The memory can
be accessed in both CPU and GPU threads. In case there are several
GPUs in the system, memory is allocated on the one that was selected
when cudaMallocManaged(...) was called. Threads running on other
GPUs may refer to the memory as well. A sample allocation of memory
can be done as follows:
char *memory;
int memsize=1024;
cudaMallocManaged(&memory,memsize);
2. Initialize data.
3. Spawn a kernel on a device (it will have access to the data in the unified
memory.
4. Wait for completion of computations on the host side which can be
accomplished using a call to cudaDeviceSynchronize().
5. Read results from the unified memory and display to the user.
Unified memory can be accessed from both threads running on a CPU and a
device. It should be noted that, in general, the host thread must not access
managed memory (allocated with cudaMallocManaged(...)) while a kernel
on a GPU is active, even if the latter is not actually using the data or is using
other variables [122]. This is, however, possible on Pascal generation of cards
[143].
Unified memory allows increasing productivity for CUDA-enabled appli-
cations and porting traditional HPC codes to GPU enabled environments. In
particular, management of data stored in complex structures is much easier
using unified memory rather than explicit manual copying between the host
and device memories. Work [83] shows how C++ new, delete operators as
CUDA 145
cudaError_t cudaPointerGetAttributes(cudaPointerAttributes
*attributes, const void *pointer)
in order to find out where the data is stored (host or GPU and which GPU).
One of the key differences between UMA and UVA is that in UMA pages with
data can be migrated between a host and a device.
for which, upon return, dcount contains the number of devices that can run
CUDA code,
146 Selected APIs for parallel programming
which sets the device identified with devicenumber as the current device.
Allocating memory and subsequent launching of a kernel are performed on
the current device (by default set to device 0).
– working mode in terms of accessing the device i.e. how many host threads
will be able to call cudaSetDevice(...),
– if it is possible to launch many kernels on the device from the host,
– if sharing memory space between the device and the host is enabled.
As suggested in [109] it is important to always set the current device when
GPU devices are used from the level of a multithreaded host application –
specifically just after new threads are created.
4.5 OPENCL
4.5.1 Programming model and application structure
OpenCL is an open standard that allows parallel programming [17, 158] tar-
geted for heterogeneous systems, in particular including multicore CPUs and
GPUs [102]. OpenCL version 2.1 was released in November 2015 while version
2.2 in May 2017. For instance, a single application can use both CPUs and
GPUs installed within a server/workstation node. It should be noted, however,
that such computing devices may considerably differ in performance for vari-
ous types of applications which may require proper care especially in terms of
load balancing. OpenCL distinguishes several key concepts, most important
of which (in terms of general purpose parallel programming) include:
– Global memory – all work items can read and write locations of this memory,
it can also be accessed by a host thread.
– Constant memory – visibility is as for the global memory. However, while
this memory can be written to by a host it can only be read from by
work items.
– Local memory – accessible only to work items executing a kernel, space
allocated within local memory can be shared by work items of one work
group.
– Private memory – visibility such as local memory in CUDA.
It should be noted that OpenCL 2.0+ also provides Shared Virtual Mem-
ory (SVM) that enables code on the host and device sides to share presum-
ably complex data structures [102]. In coarse-grained mode a memory buffer
is shared. In fine-grained mode memory locations can be read/written concur-
rently if SVM atomic operations are supported or different locations can be
accessed if the latter are not supported. Fine-grain buffer sharing allows buffer
sharing while fine-grain system sharing allows access to the whole memory of
a host.
A typical sequence of operations that are executed for parallel computa-
tions on a device in OpenCL includes:
OpenCL 149
2. Find out devices (either all or devices of certain types such as CPUs or
GPUs within a platform using a call to clGetDeviceIDs(...).
3. Create a context for a device(s) using clCreateContext(...).
4. Create a command queue on a device within a context using
clCreateCommandQueueWithProperties(...).
150 Selected APIs for parallel programming
cl_context clCreateContext(
const cl_context_properties *properties,
cl_uint numberofdevices,
const cl_device_id *devices,
void (CL_CALLBACK *pfn_notify)(const char *errorinfo,
const void *privateinfo, size_t cb,
void *callbackinputparameters),
void *callbackinputparameters,
cl_int *errorcode)
cl_command_queue clCreateCommandQueueWithProperties(
cl_context context,
cl_device_id device,
const cl_queue_properties *propertiesofqueue,
cl_int *returnederrorocode)
cl_program
clCreateProgramWithSource(cl_context context,
cl_uint stringcount,
const char **strings,
const size_t *stringlengths,
cl_int *returnederrorcode)
clEnqueueMarkerWithWaitList(cl_command_queue commandqueue,
cl_uint numberofeventsinwaitlist,
const cl_event *eventwaitlist,
cl_event *event)
cl_int
clEnqueueBarrierWithWaitList(cl_command_queue commandqueue,
cl_uint numberofeventsinwaitlist,
const cl_event *eventwaitlist,
cl_event *event)
if (CL_SUCCESS!=clGetPlatformIDs(1,platforms,&
numberofplatformsreturned))
errorexit("Error getting platforms");
if (CL_SUCCESS!=clGetDeviceIDs(platforms[0],CL_DEVICE_TYPE_GPU
,6,devices,&numberofdevicesreturned))
errorexit("Error getting devices");
160 Selected APIs for parallel programming
(...)
cl_context_properties properties[]={
CL_CONTEXT_PLATFORM, (cl_context_properties)platforms[0],
0};
context=clCreateContext(properties,numberofdevicesreturned,
devices,NULL,NULL,&errorcode);
if (CL_SUCCESS!=errorcode)
errorexit("Error creating context");
cl_queue_properties *queueproperties=NULL;
queue=clCreateCommandQueueWithProperties(context,devices[0],
queueproperties,&
errorcode);
if (CL_SUCCESS!=errorcode)
errorexit("Error creating a command queue");
program=clCreateProgramWithSource(context,1,
kernelstringspointer,&kernelsourcesize,&errorcode);
if (CL_SUCCESS!=errorcode)
errorexit("Error creating a program");
if (CL_SUCCESS!=clBuildProgram(program,0,NULL,"",NULL,NULL))
errorexit("Error building program");
kernel=clCreateKernel(program,"collatzkernel",&errorcode);
if (CL_SUCCESS!=errorcode)
errorexit("Error creating a kernel");
size_t globalworksize=1024*10000;
size_t localworksize=1024;
outputbuffer=clCreateBuffer(context,CL_MEM_WRITE_ONLY,
globalworksize*sizeof(long),NULL,&errorcode);
if (CL_SUCCESS!=errorcode)
errorexit("Error creating a buffer");
if (CL_SUCCESS!=clSetKernelArg(kernel,0,sizeof(outputbuffer),&
outputbuffer))
errorexit("Error setting a kernel argument");
if (CL_SUCCESS!=clEnqueueNDRangeKernel(queue,kernel,1,NULL,&
globalworksize,&localworksize,0,NULL,NULL))
errorexit("Error enqueuing a kernel");
if (CL_SUCCESS!=clFinish(queue))
errorexit("Error finishing computations within a queue");
if (CL_SUCCESS!=clReleaseKernel(kernel))
errorexit("Error releasing the kernel");
if (CL_SUCCESS!=clReleaseProgram(program))
errorexit("Error releasing the program");
if (CL_SUCCESS!=clReleaseMemObject(outputbuffer))
errorexit("Error releasing the buffer");
if (CL_SUCCESS!=clReleaseCommandQueue(queue))
errorexit("Error releasing the queue");
if (CL_SUCCESS!=clReleaseContext(context))
errorexit("Error releasing the context");
fflush(stdout);
}
162 Selected APIs for parallel programming
This first code can be extended and improved with the following:
1. Finding the maximum of iterations among work items within each work
group, similarly to the solution for CUDA presented in Section 4.4.4.
This works by using local memory for storage of intermediate maximum
iteration values in successive iterations of parallel reduction which has
O(log(n)) complexity where n denotes the number of elements. This step
still generates one subresult per work group. The total number of subre-
sults can still be quite large, depending on an NDRange configuration.
2. Consequently, another kernel (with one work group, assuming the num-
ber of groups is only a few times larger than the maximum number of
work items in a work group) can be launched for parallel reduction of
results from the previous step. In this case, the number of results from
the previous step is equal to the number of work groups. These results
are extended to the nearest multiple of the number of threads within
a group (set to 1024 in the code) such that each work item can first
browse the same number of results. Padded elements need to be ini-
tialized to 0. Specifically, each work item first finds a maximum out of
howmanyelementsperworkitem elements. Note that in each iteration of
a loop all work items collectively refer to consecutive locations in global
memory, skipping 1024 elements in every iteration. Following this step
work items synchronize and then collectively reduce values to a final
result using the aforementioned tree based approach.
start=(start%2)?(3*start+1):(start/2); \
cond=(start>1)?1:0; \
} \
\
sresults[mylocalindex]=counter; \
barrier(CLK_LOCAL_MEM_FENCE); \
\
for(counter=512;counter>0;counter/=2) { \
if (mylocalindex<counter) \
sresults[mylocalindex]=(sresults[mylocalindex]>sresults[
mylocalindex+counter])?sresults[mylocalindex]:sresults[
mylocalindex+counter]; \
barrier(CLK_LOCAL_MEM_FENCE); \
} \
\
if (mylocalindex==0) \
result[get_group_id(0)]=sresults[0]; \
} \
";
for(counter=512;counter>0;counter/=2) { \
if (mylocalindex<counter) \
sresults[mylocalindex]=(sresults[mylocalindex]>sresults[
mylocalindex+counter])?sresults[mylocalindex]:sresults[
mylocalindex+counter]; \
barrier(CLK_LOCAL_MEM_FENCE); \
} \
\
if (mylocalindex==0) \
data[get_group_id(0)]=sresults[0]; \
\
} \
";
(...)
if (CL_SUCCESS!=clGetPlatformIDs(1,platforms,&
numberofplatformsreturned))
errorexit("Error getting platforms");
if (CL_SUCCESS!=clGetDeviceIDs(platforms[0],CL_DEVICE_TYPE_GPU
,6,devices,&numberofdevicesreturned))
errorexit("Error getting devices");
(...)
cl_context_properties properties[]={
CL_CONTEXT_PLATFORM, (cl_context_properties)platforms[0],
0};
OpenCL 165
context=clCreateContext(properties,numberofdevicesreturned,
devices,NULL,NULL,&errorcode);
if (CL_SUCCESS!=errorcode)
errorexit("Error creating context");
cl_queue_properties *queueproperties=NULL;
queue=clCreateCommandQueueWithProperties(context,devices[0],
queueproperties,&
errorcode);
if (CL_SUCCESS!=errorcode)
errorexit("Error creating a command queue");
program=clCreateProgramWithSource(context,2,
kernelstringspointer,kernelsourcesize,&errorcode);
if (CL_SUCCESS!=errorcode)
errorexit("Error creating a program");
if (CL_SUCCESS!=clBuildProgram(program,0,NULL,"",NULL,NULL))
errorexit("Error building program");
kernel=clCreateKernel(program,"collatzkernel1",&errorcode);
if (CL_SUCCESS!=errorcode)
errorexit("Error creating a kernel");
kernel1=clCreateKernel(program,"reducekernel",&errorcode);
if (CL_SUCCESS!=errorcode)
errorexit("Error creating a kernel");
size_t localworksize=1024;
size_t globalworksize=1024*10000;
size_t groupcount=globalworksize/localworksize; // we assume
that this is divisible
if (groupcount%localworksize) howmanyelementsperworkitem+=1;
size_t buffersize=howmanyelementsperworkitem*localworksize;
if (CL_SUCCESS!=clSetKernelArg(kernel,0,sizeof(outputbuffer),&
outputbuffer))
errorexit("Error setting a kernel argument");
if (CL_SUCCESS!=clEnqueueNDRangeKernel(queue,kernel,1,NULL,&
globalworksize,&localworksize,0,NULL,NULL))
errorexit("Error enqueuing a kernel");
if (CL_SUCCESS!=clFinish(queue))
errorexit("Error finishing computations within a queue1");
OpenACC 167
if (CL_SUCCESS!=clReleaseKernel(kernel))
errorexit("Error releasing the kernel");
if (CL_SUCCESS!=clReleaseProgram(program))
errorexit("Error releasing the program");
if (CL_SUCCESS!=clReleaseMemObject(outputbuffer))
errorexit("Error releasing the buffer");
if (CL_SUCCESS!=clReleaseCommandQueue(queue))
errorexit("Error releasing the queue");
if (CL_SUCCESS!=clReleaseContext(context))
errorexit("Error releasing the context");
printf("\nResult is %ld\n",hostbuffer[0]);
fflush(stdout);
4.6 OPENACC
4.6.1 Programming model and application structure
The OpenACC API [125] specifies ways of expressing parallelism in C and
Fortran codes such that some computations can be offloaded to accelerators,
typically GPUs.
While CUDA and OpenCL allow for (comparably) lower level program-
ming with management of computing devices, explicit management of streams
and queues respectively, OpenACC uses directives to point out sections of code
that may include parallelism or point out specific constructs such as loops that
may be parallelized.
At the level of API, OpenACC is similar to OpenMP, in the sense of using
primarily directives to describe potential parallelism in a standard program.
In essence, OpenACC allows three levels of parallelism through consider-
ation and incorporation of:
– gangs – a certain number of gangs execute,
– workers – each gang may run one or more workers,
– a certain number of vector lanes within a worker.
Similarly to OpenMP, program execution starts with a single thread. Some
sections of code will employ parallelism at the aforementioned levels using
constructs described next.
168 Selected APIs for parallel programming
One or more gangs start execution of the code within the block, with
one worker per gang starting execution. Similarly to OpenMP, iterations
of loops within a parallel block can be processed in parallel.
Next to the parallel keyword, several clauses can be specified, with
the most important ones being:
where a clause can be one of the following referring to a variable being ac-
cessed: read, write, update (increments, decrements, binary operations),
capture (update with a read).
OpenACC 171
int i;
double x;
long maxelemcount=10000000;
double sum=0; // final value
long power; // acts as a counter
(...)
sum=0;
// use the OpenACC parallel loop construct
printf("\nResult is %f\n",sum);
(...)
In this particular case, the accULL [138, 139] version 0.4 alpha software
was used for compilation and running the aforementioned example in a Linux
based system.
There are several OpenACC compilers including the OpenACC Toolkit
available from NVIDIA [121] with a PGI Accelerator Fortran/C Compiler,
support for OpenACC in GCC [2] etc.
4.7.1 MPI+Pthreads
File lnx_MPI+Pthreads.c includes a parallel code that uses MPI and the
Pthreads API to implement the problem stated in Section 4.1.13, first
to start processes working in parallel and the latter to start threads
within processes. Threads dedicated for computations are started with a
call to pthread_create(...) and execute function void *Calculate(void
*args). Within this function, that can be implemented as follows:
mult_coeff=(x-1)/x;
count=maxelemcount/totalthreads;
// now compute my own partial sum in a loop
power=(myrank*threadnum+start)*count+1;
prev=pow(((x-1)/x),power);
for(;count>0;power++,count--) {
partialsum+=prev/power;
prev*=mult_coeff;
}
threadResults[start]=partialsum;
}
Each thread initializes the power it will start with by fetching the value
of its process rank as well as its thread id passed as an argument to function
Calculate(...). After each thread has computed its partial result it writes
the result to a global array. Each of these values is added after a call to
pthread_join(...) from the main thread for which the key part of the code
is shown in Listing 4.17. Then all results from processes are added with a call
to MPI_Reduce(...).
The sample program can be compiled and run as follows:
174 Selected APIs for parallel programming
int i;
int threadsupport;
void *threadstatus;
(...)
MPI_Comm_size(MPI_COMM_WORLD, &proccount);
if (maxelemcount<proccount) {
if (!myrank)
printf("Maxelemcount smaller than the number of processes
- try again.");
MPI_Finalize();
return -1;
}
for (i=0;i<threadnum;i++) {
// initialize the start value
startValue[i]=i;
// launch a thread for calculations
pthread_create(&thread[i], &attr, Calculate, (void *)(&(
startValue[i])));
}
// now synchronize the threads
// and add results from all the threads in a process
for (i=0;i<threadnum;i++) {
pthread_join(thread[i], &threadstatus);
totalprocesssum+=threadResults[i];
}
// now merge results from processes to rank 0
MPI_Reduce(&totalprocesssum,&totalsum,1,
MPI_DOUBLE,MPI_SUM,0,
MPI_COMM_WORLD);
if (!myrank)
printf("Result=%f\n",totalsum);
pthread_attr_destroy(&attr);
// Shut down MPI
MPI_Finalize();
return 0;
}
Figure 4.8 presents execution times for various numbers of processes and
computing threads per process started with Pthreads run on a workstation
176 Selected APIs for parallel programming
with 2 x Intel Xeon E5-2620v4 and 128 GB RAM. This type of application does
not benefit much from HyperThreading. In such an environment it is generally
better to use more threads per process than a larger number of processes
with fewer threads due to lower overhead for creation and switching between
threads compared to processes. Consequently, in a cluster environment it is
recommended to use MPI for internode communication with parallelization
within a node using threads. Results are best times out of 3 runs for each
configuration.
100
1 thread per process
2 threads per process
4 threads per process
8 threads per process
16 threads per process
Execution time [s]
10
1
1 2 4 8 16 32
number of processes
4.7.2 MPI+OpenMP
Extending the previous implementations, the code included in file
lnx_MPI+OpenMP.c combines MPI and OpenMP in one application for both
parallel processing among processes and within each process using threads.
Listing 4.18 lists key parts of such an implementation. An OpenMP #pragma
omp parallel directive is used for parallel processing using threads in each
process with summing thread results with the reduction (+:partialsum)
clause.
MPI_Init_thread(&argc, &argv,MPI_THREAD_FUNNELED,&
threadsupport);
if (threadsupport<MPI_THREAD_FUNNELED) {
printf("\nThe implementation does not support
MPI_THREAD_FUNNELED, it supports level %d\n",threadsupport);
MPI_Finalize();
return -1;
}
if (maxelemcount<proccount) {
if (!myrank)
printf("Maxelemcount smaller than the number of processes
- try again.");
MPI_Finalize();
return -1;
}
100
1 thread per process
2 threads per process
4 threads per process
8 threads per process
16 threads per process
Execution time [s]
10
1
1 2 4 8 16 32
number of processes
In the experiments, --bind-to none gave better results for 1 and 2 pro-
cesses but generally better results were obtained without this option for larger
numbers of processes. Best times are presented.
4.7.3 MPI+CUDA
The parallel application for verification of the Collatz hypothesis, presented
in Section 4.4.4, can be extended to be used in a cluster environment, with
potentially more than one GPU per node. The goal of the code is not only to
verify the hypothesis (in which case computations for each considered start-
ing number should eventually terminate) but also to compute the maximum
number of steps for the starting number to reach 1 with the operations listed
in the Collatz hypothesis.
The structure of such an application is as follows:
MPI_Init(&argc, &argv);
(...)
start=atol(argv[1]);
end=atol(argv[2]);
Selected hybrid approaches 181
gpuspernode=atoi(argv[3]);
if (!launchcomputekernel(myrank%gpuspernode,start,end,&
result,myrank,processcount))
errorexit("\nError on the .cu side");
if (!myrank)
printf("\nFinal result is %ld\n",finalresult);
MPI_Finalize();
}
for(;cond;counter++) {
start=(start%2)?(3*start+1):(start/2);
cond=(start>1)?1:0;
}
sresults[threadIdx.x]=counter;
__syncthreads();
long size=blocksingrid*sizeof(long);
Selected hybrid approaches 183
if (cudaSuccess!=cudaSetDevice(deviceno)) { // each
process sets an active GPU on the node it will manage
printf("\nError setting a CUDA device. Check the number
of devices per node.\n");
return 0;
}
long *dresults=NULL;
if (cudaSuccess!=cudaMalloc((void **)&dresults,size)) {
printf("\nError allocating memory on the GPU");
return 0;
}
if (cudaSuccess!=cudaMemcpy(hresults,dresults,size,
cudaMemcpyDeviceToHost)) {
printf("\nError copying results");
return 0;
}
cudaDeviceSynchronize();
// release resources
free(hresults);
if (cudaSuccess!=cudaFree(dresults)) {
printf("\nError when deallocating space on the GPU");
return 0;
}
return 1;
184 Selected APIs for parallel programming
Such code organization means that the following steps can be used in order
to compile and link in order to obtain a final executable file:
The first two arguments denote the start and the end of the range to check and
the last argument denotes the number of GPUs per node that the application
will use.
Results of running this application on a workstation with 2 x Intel Xeon
E5-2620v4 and 128 GB RAM and two NVIDIA GTX 1070 cards are presented
in Table 4.4.
Programming parallel
paradigms using
selected APIs
CONTENTS
5.1 Master-slave . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
5.1.1 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
5.1.2 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
5.1.3 MPI+OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
5.1.4 MPI+Pthreads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
5.1.5 CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
5.1.6 OpenMP+CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
5.2 Geometric SPMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
5.2.1 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
5.2.2 MPI+OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
5.2.3 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
5.2.4 MPI+CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
5.3 Divide-and-conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
5.3.1 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
5.3.2 CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
5.3.3 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
5.3.3.1 Balanced version . . . . . . . . . . . . . . . . . . . . . 236
5.3.3.2 Version with dynamic process
creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
5.1 MASTER-SLAVE
Implementations of the master-slave paradigm may differ in how the following
actions are executed:
185
186 Programming parallel paradigms using selected APIs
Furthermore, the code can be simplified in such a way that the active
role of the master is eliminated by transferring master’s responsibilities to
slaves that fetch successive data packets, compute, store results and repeat
this process. In this case, a synchronization mechanism among slaves needs to
be adopted.
5.1.1 MPI
This section proposes a code (available in file master-slave-MPI.c) of a basic
master-slave application implemented with C and MPI. The application is to
perform numerical integration of a given function (defined within C function
fcpu(...)) over a given range. The function can be defined, as an example,
as follows:
double fcpu(double x) {
return 1.0/(1.0+x);
}
For the sake of generalization of the code, several elements realizing the
master-slave concept are distinguished:
typedef struct {
long elemcount;
double *elements;
double *result;
} functionparameters_t;
Listing 5.1 Basic master-slave application using MPI – master’s key code
long packetcounter=0;
// distribute initial data packets to slave processes
for(i=1;i<processcount;i++,packetcounter++)
MPI_Send(&(data[2*packetcounter]),2,MPI_DOUBLE,i,DATA_PACKET,
MPI_COMM_WORLD);
// check who sent the result and send another data packet to
that process
MPI_Send(&(data[2*packetcounter]),2,MPI_DOUBLE,mpistatus.
MPI_SOURCE,DATA_PACKET,MPI_COMM_WORLD);
packetcounter++;
MPI_Send(NULL,0,MPI_DOUBLE,i,FINISH_COMPUTATIONS,
MPI_COMM_WORLD);
// display the result
printf("\nResult is %f\n",result);
Each slave acts in a loop and does the following (Listing 5.2 presents key
code of the slave process):
Listing 5.2 Basic master-slave application using MPI – slave’s key code
double data[2];
do {
MPI_Probe(0,MPI_ANY_TAG,MPI_COMM_WORLD,&mpistatus);
if (mpistatus.MPI_TAG==DATA_PACKET) {
MPI_Recv(data,2,MPI_DOUBLE,0,DATA_PACKET,MPI_COMM_WORLD,&
mpistatus);
// spawn computations in parallel using OpenMP
step=(data[1]-data[0])/(40*DATA_PACKET_PARTITION_COUNT); //
equivalent to other implementations
slavecpu(&temporaryresult,data[0],data[1],step);
// send the result back
MPI_Send(&temporaryresult,1,MPI_DOUBLE,0,RESULT,
MPI_COMM_WORLD);
}
} while (mpistatus.MPI_TAG!=FINISH_COMPUTATIONS);
and run in the following way, with 1 master and 16 slave processes
100
Execution time [s]
10
1
1 2 4 8 16
number of slave processes
5.1.2 OpenMP
OpenMP allows implementation of master-slave type parallel programming in
several ways. Specifically, the traditional master-slave requires the following:
1. All threads are in fact slaves (processing data) that fetch input data,
merge and store results.
2. After fetching an inputoutput lock, each thread fetches input data, if
available.
3. The thread processes the input.
4. After fetching the lock again, each thread stores its result in the output
buffer and checks whether a sufficient number of results is available. If
this is the case then the thread merges results. Then it checks whether
new input should be generated. If this is the case then the thread gen-
erates new data by invoking function generate_new_input(...).
This way master and slave functions are merged. The main part of the code
implementing the improved approach is shown in Listing 5.3 while full code
is available in file master-slave-OpenMP-1.c.
Master-slave 193
do {
processdata=0;
finish=0;
omp_set_lock(&inputoutputlock);
if (processedcount<CHUNKCOUNT) {
myinputindex=currentinputindex;
if (currentinputindex<lastgeneratedcount) {
currentinputindex++;
processdata=1;
}
} else finish=1;
omp_unset_lock(&inputoutputlock);
if (processdata) {
// now process the input data chunk
result=process(&(input[myinputindex]));
if (processedcount<CHUNKCOUNT) {
lastgeneratedcount=generate_new_input(input);
currentinputindex=0;
}
}
omp_unset_lock(&inputoutputlock);
}
} while (!finish);
}
print_final_output(&finaloutput);
omp_destroy_lock(&inputoutputlock);
5.1.3 MPI+OpenMP
The code presented in Section 5.1.1 can be extended for parallelization with
OpenMP within each process. Since multicore processors are widely available,
a combination of MPI for communication between nodes and OpenMP for
parallelization of slave computations is a good approach. Such an approach is
presented in Figure 5.6.
Proposed code, available in file master-slave-MPI+OpenMP.c differs from
the MPI only approach from Section 5.1.1 in the following:
MPI_Init_thread(&argc, &argv,MPI_THREAD_FUNNELED,&
threadsupport);
if (threadsupport<MPI_THREAD_FUNNELED) {
printf("\nThe implementation does not support
MPI_THREAD_FUNNELED, it supports level %d\n",threadsupport)
;
MPI_Finalize();
return -1;
}
long countermax;
double arg;
int threadid;
arg=start+step*threadid;
for(counter=0;counter<countermax;counter++) {
sresult+=step*(fcpu(arg)+fcpu(arg+step))/2;
arg+=threadcount*step;
}
}
*result=sresult;
}
5.1.4 MPI+Pthreads
The traditional master-slave scheme can also be implemented with MPI and
Pthreads for hybrid environments i.e. clusters with multiple cores within each
node. MPI can be preferably used for communication between nodes while
Pthreads can be used to manage input data, computations and results across
threads that run on cores within a node. In contrast to the MPI+OpenMP
version, the MPI+Pthreads solution shown in Figure 5.7 is much more flex-
ible but also much more complex. Many threads with various roles are used
within processes. Key code of the master process is shown in Listing 5.6. The
implementation uses the following threads within each slave process:
200 Programming parallel paradigms using selected APIs
– master (main) thread – used to receive data packets which are stored and
then fetched by computing threads – key code shown in Listing 5.7,
– computing threads – used to process incoming data packets – key code shown
in Listing 5.8,
– sending thread – one thread used to send processed data packets – key code
shown in Listing 5.9; the thread sends a temporary result back to the
master process which merges results from various processes.
// check who sent the result and send another data packet to
that process
MPI_Send(&(data[2*packetcounter]),2,MPI_DOUBLE,mpistatus.
MPI_SOURCE,DATA_PACKET,MPI_COMM_WORLD);
Master-slave 203
packetcounter++;
inputdatapacketrequestedcounter++;
pthread_cond_signal(&inputqueuecondvariableconsumer);
pthread_mutex_unlock(&inputqueuemutex);
pthread_mutex_lock(&outputqueuemutex);
outputdatapacketrequestedcounter++;
pthread_mutex_unlock(&outputqueuemutex);
}
} while (mpistatus.MPI_TAG!=FINISH_COMPUTATIONS);
fetcheddatapacket=0;
if ((inputqueuesize>0)) {
inputqueuesize--;
start=inputqueue[2*inputqueuesize];
end=inputqueue[2*inputqueuesize+1];
Master-slave 205
inputdatapacketprocessedcounter++;
fetcheddatapacket=1;
}
pthread_cond_signal(&inputqueuecondvariableproducer);
pthread_mutex_unlock(&inputqueuemutex);
if (fetcheddatapacket) {
// set the step
step=(end-start)/(40*DATA_PACKET_PARTITION_COUNT);
sresult=0;
// process the data packet
countermax=((end-start)/step);
arg=start;
for(counter=0;counter<countermax;counter++) {
sresult+=step*(fcpu(arg)+fcpu(arg+step))/2;
arg+=step;
}
}
if (fetcheddatapacket) {
// insert result into the final queue
pthread_mutex_lock(&outputqueuemutex);
while (outputqueuesize==QUEUE_SIZE)
pthread_cond_wait(&outputqueuecondvariableproducer,&
outputqueuemutex);
outputqueue[outputqueuesize]=sresult;
outputqueuesize++;
pthread_cond_signal(&outputqueuecondvariableconsumer);
pthread_mutex_unlock(&outputqueuemutex);
}
}
while (!threadfinish) {
pthread_cond_wait(&outputqueuecondvariableconsumer,&
outputqueuemutex);
}
if ((outputthreadfinish) && (outputdatapacketrequestedcounter
==outputdatapacketprocessedcounter)) threadfinish=1;
// terminate only after all requested packets have been
processed
fetcheddatapacket=0;
if (outputqueuesize>0) {
// copy output
temporaryresult=outputqueue[--outputqueuesize];
outputdatapacketprocessedcounter++;
fetcheddatapacket=1;
}
pthread_cond_signal(&outputqueuecondvariableproducer);
pthread_mutex_unlock(&outputqueuemutex);
if (fetcheddatapacket) {
MPI_Send(&temporaryresult,1,MPI_DOUBLE,0,RESULT,
MPI_COMM_WORLD);
}
}
5.1.5 CUDA
Figure 5.8 presents a basic master-slave scheme that can use many GPUs
installed within a node. This version uses one host thread and the CUDA API
for management of computations among GPUs. Specifically, input data is
partitioned into several data chunks which are then distributed among GPUs.
This particular example uses numerical integration of function f(...) over
range [start,end]. In this application basic data structures and operations
are as follows:
threadcount=gridDim.x*blockDim.x;
countermax=((end-start)/step)/threadcount;
arg=start+step*(blockIdx.x*blockDim.x+threadIdx.x);
sresults[threadIdx.x]=0;
for(counter=0;counter<countermax;counter++) {
sresults[threadIdx.x]+=step*(f(arg)+f(arg+step))/2;
arg+=threadcount*step;
}
__syncthreads();
for(counter=512;counter>0;counter/=2) {
if (threadIdx.x<counter)
sresults[threadIdx.x]+=sresults[threadIdx.x+counter];
__syncthreads();
}
It should be noted that this parallel code would typically invoke kernels on
each GPU several times as shown in the host thread code in Listing 5.11. It has
the potential of load balancing even in heterogeneous environments in which
there are several GPUs with various performance. Effective load balancing is
possible if:
1. the number of data packets is large enough to hide idle times of faster
GPUs by giving more work while slower GPUs still process previous
data chunks,
2. execution time of a data chunk is considerable compared to kernel
launch, data copy and synchronization overheads.
gpuactive[i]=0;
hresults[i]=(double *)malloc(size);
if (!(hresults[i])) errorexit("Error allocating memory on the
host");
cudaSetDevice(i);
if (cudaSuccess!=cudaMalloc((void **)&(dresults[i]),size))
errorexit("Error allocating memory on the GPU");
if (cudaSuccess!=cudaStreamCreate(&(stream[i])))
errorexit("Error creating stream");
startvalue[i].elemcount=blocksingrid;
startvalue[i].elements=hresults[i];
startvalue[i].result=&(functionresults[i]);
Master-slave 211
gpuresultmergecpufunction(&(startvalue[i]));
finalresult+=*(startvalue[i].result);
__host__
void generatedata(double **data,long *packetcount) {
double start=1,end=1000000;
double counter=start;
double step=(end-start)/(double)DATA_PACKET_COUNT;
long i;
*data=(double *)malloc(sizeof(double)*(1+
DATA_PACKET_COUNT));
if (!(*data))
errorexit("Not enough memory on host when generating
data");
for(i=0;i<=DATA_PACKET_COUNT;i++,counter+=step)
(*data)[i]=counter;
212 Programming parallel paradigms using selected APIs
*packetcount=DATA_PACKET_COUNT;
}
and a certain size of step for partitioning within each data packet shown
in Listing 5.10,
2. configuration with an increased value of DATA_PACKET_COUNT e.g. in-
creased n times while the value of step is decreased n times.
Both versions result in the same width of a subrange used for the trapezoid
method within the GPU kernel but differ in the number of data packets. It is
expected that a larger number of data packets would generate an overhead.
Results obtained on on a workstation with 2 x Intel Xeon E5-2620v4, 2 x
NVIDIA GTX 1070 and 128 GB RAM are shown in Table 5.4. For each
configuration the best time out of three runs is presented. Two GPUs were
used.
The code can be compiled as follows:
nvcc master-slave-n-GPUs-one-thread-sequential-cpu-merging.cu
./a.out 2
TABLE 5.4 Execution time [s] for various configurations of the CUDA
code
DATA_PACKET_COUNT coefficient used in the de- Execution time [s]
nominator for computing
step
2 2000 3.992
10 400 4.020
100 40 4.118
1000 4 5.210
5.1.6 OpenMP+CUDA
Another implementation uses many host threads to manage computations on
GPU(s) and host CPU(s). Specifically, the idea of this solution, presented in
Figure 5.9 is as follows:
cudaSetDevice(i);
if (cudaSuccess!=cudaMalloc((void **)&(dresults[i]),size))
errorexit("Error allocating memory on the GPU");
if (cudaSuccess!=cudaStreamCreate(&(stream[i])))
errorexit("Error creating stream");
}
{
finalresult=0;
int i=omp_get_thread_num();
finalresult+=*(startvalue[i].result);
}
} while (!finish);
} else { // parallel execution on CPU(s)
do {
finish=0;
#pragma omp critical
{
// check if there are more and get a new data chunk
if (packetcounter<packetcount)
mypacketid=packetcounter++;
else
finish=1;
}
if (finish==0) {
start[i]=data[mypacketid];
end[i]=data[mypacketid+1];
step=(end[i]-start[i])/(40*blocksingrid*threadsinblock
);
// start computations on the CPU(s) -- the result will
be one double in this case
slavecpu(&(functionresults[i]),start[i],end[i],step,
cputhreadcount);
// merge results
finalresult+=functionresults[i];
}
} while (!finish);
}
}
printf("\nThe final result is %f\n",finalresult);
// release resources
(...)
arg=start+step*threadid;
for(counter=0;counter<countermax;counter++) {
sresult+=step*(fcpu(arg)+fcpu(arg+step))/2;
arg+=threadcount*step;
}
}
*result=sresult;
}
The -Xcompiler option allows to pass options directly to the compiler encap-
sulated by nvcc while -Xlinker allows to pass options to the host linker.
In terms of performance and speed-ups, the following can be noted with
respect to the implementation:
for(z=myminz;z<=mymaxz;z++)
for(y=myminy;y<=mymaxy;y++)
for(x=myminx;x<=mymaxx;x++)
updatecell(datanext,data,x,y,z);
Figure 5.12 presents execution times of the application for various sizes
of the domain and 100 iterations and for various numbers of processes on a
workstation with 2 x Intel Xeon E5-2620v4, 128 GB RAM. Presented times
are best times out of three runs for each configuration.
It should be noted that, depending on a particular problem, the ratio of
cell update time compared to necessary communication and synchronization
times might vary. In a parallel version this ration is crucial to obtaining good
speed-ups.
5.2.2 MPI+OpenMP
The MPI implementation presented in Section 5.2.1 can be extended with
support for multithreading using OpenMP. For instance, the following changes
can be introduced into the code:
int threadsupport;
MPI_Init_thread(&argc, &argv,MPI_THREAD_FUNNELED,
&threadsupport);
if (threadsupport<MPI_THREAD_FUNNELED) {
printf("\nThe implementation does not support
MPI_THREAD_FUNNELED,
it supports level %d\n",threadsupport);
MPI_Finalize();
return -1;
}
222 Programming parallel paradigms using selected APIs
400x400x400 domain
600x600x600 domain
800x800x800 domain
100
Execution time [s]
10
1
1 2 4 8 16 32
number of processes
FIGURE 5.12 Execution time of the testbed MPI SPMD application run
on a workstation with 2 x Intel Xeon E5-2620v4, 128 GB RAM, 16
physical cores, 32 logical processors
It should be noted that top level iterations are spread among computing
threads statically in blocks of 5 based on the assumption that each cell takes
the same amount of time to be updated. Good values for the block size may
depend on the value of mymaxz-myminz+1. On the other hand, if some cells
require more or less time to be computed, dynamic scheduling should be used
instead, as described in Section 4.2.2.
Geometric SPMD 223
is used before the main loop which results in the overhead for parallel re-
gion creation only once. Consequently, synchronization is needed within the
simulation loop. Important steps within the main loop are executed as follows:
data=data1;
datanext=data0;
} else {
data=data0;
datanext=data1;
}
}
400x400x400 domain
600x600x600 domain
800x800x800 domain
100
Execution time [s]
10
1
1 2 4 8 16
number of processes
5.2.3 OpenMP
Essentially, in a shared memory environment, domain data is accessible to all
threads thus only proper synchronization is required between iterations. Con-
sequently, one viable approach could be to use the code of the MPI+OpenMP
version presented in Section 5.2.2, simplified to use just 1 process. The MPI
part can then be removed altogether leaving just the OpenMP including di-
rectives.
5.2.4 MPI+CUDA
The MPI implementation presented in Section 5.2.1 can be extended for use
with GPU cards installed within cluster nodes, as shown in Figure 5.14. This
is especially interesting when the following are assumed:
1. Setting a device.
2. Allocation of memory.
226 Programming parallel paradigms using selected APIs
3. Data initialization.
4. Type definition.
5. Main iteration loop with the following steps: MPI communication, invo-
cation of function updatecells(...) for subdomain update, defined in
file SPMD-MPI+CUDA-2.cu.
6. Substitution of pointers to source and destination subdomains.
Geometric SPMD 227
❝✍✎✎✏
✉✑✒✍✓✔❝✔✎✎✏✕✖✖✖✗
❝✍✎✎✏
❦✔✦❛✔✎
✥❛ ✍ ★✙✜
❝✍✎✎✏ ✘✙✚
❙ ✁✂✄✁ ☎✆✝✞✂✟✄✠✡✌
✍❛✒ ✛✜✢✣ ✤✥✦
✏✔✓✓s❛✧ ✍ ★✙✜✩
♠✔♠✥✦✪ ✍✎✎✥❝✍✓s✥❛
✍❛✒ ✒✔✍✎✎✥❝✍✓s✥❛
updatecellgpu<<<blocksingrid,threadsinblock>>>(datanext,
data,myminx,myminy,myminz,myblockxsize,myblockysize,
228 Programming parallel paradigms using selected APIs
cudaDeviceSynchronize();
}
1. It might be difficult to balance load within each node due to the follow-
ing:
• the number of cells to be assigned to each GPU may be small and
consequently update times may be small compared to the overhead
for kernel launch and synchronization,
• CPU and GPU performance will most likely require dynamic load
balancing due to differences in processing speeds.
2. Unified Memory originally did not allow referencing from the host while
a GPU is active [122], see a note on Pascal in Section 4.4.7.
1. internode communication,
2. communication between a host and a GPU such as using streams.
It should also be noted that NVIDIA Multi Process Service (MPS) [119] allows
overlapping of kernel execution and memory copy operations from various host
processes. This may bring benefits to application execution time and increase
utilization of the GPU. Its benefits are shown in Section 6.1.2.
5.3 DIVIDE-AND-CONQUER
5.3.1 OpenMP
OpenMP features nested parallelism (discussed in Section 4.2.5) that allows
parallelization at various levels through #pragma omp parallel or #pragma
omp sections directives. However, this may lead to creation of too many
threads resulting in performance drop. On the other hand, the total number
of threads can be controlled by environment variable OMP_THREAD_LIMIT.
It should be noted that a multicore or a manycore type system is prefer-
able in order to make processing of independent subtrees of the divide-and-
conquer tree run in parallel. Paper [47] discusses parallelization of divide-
and-conquer applications using an OpenMP based framework on an Intel
Xeon Phi coprocessor. The paper presents impact of the maximum depth
for allowing thread spawning on the execution time. The implementation
presented in Listing 5.14 uses OpenMP for parallelization of a divide-and-
conquer application. While the solution presented in [47] first generates a
number of data packets each of which is then processed and its computa-
tions possibly divided and results merged, the solution presented in this book
starts with a single input data set which gets partitioned. Specifically, func-
tion integratedivideandconquer(...) in Listing 5.14 is invoked recursively.
Each invocation is to process an input range submitted as input. If a cer-
tain depth has been reached then an integrate for this range is computed
using a rectangle based method with a certain accuracy. Otherwise, a check
230 Programming parallel paradigms using selected APIs
double step2result=integratesimple(start,end,ACCURACY/2);
double result;
result=step2result;
if (depth==MAX_RECURSION_DEPTH)
return result;
if (fabs(step2result-stepresult)>0.0000000001) {
int threadid,threadcount;
// run two parts in distinct threads
result=0;
#pragma omp parallel private(threadid,threadcount) shared(start,
end,depth) reduction(+:result) num_threads(2)
{
threadcount=omp_get_num_threads();
threadid=omp_get_thread_num();
result=integratedivideandconquer(start,(start+end)/2,
depth+1)+
integratedivideandconquer((start+end)/2,end,depth+1)
;
}
}
} else { // continue partitioning within this thread
result=integratedivideandconquer(start,(start+end)/2,depth
+1)+
integratedivideandconquer((start+end)/2,end,depth+1);
}
}
return result;
}
and run using the taskset command to indicate processors on which the
multithreaded application can run, as shown in Section 6.4.
Table 5.8 presents execution times for the aforementioned code for various
numbers of available cores on a workstation with 2 x Intel Xeon E5-2620v4,
128 GB RAM. Best execution times out of three runs are presented for each
configuration. It should be noted that limited speed-up stems from limited
possibility of parallelization (a tree where only lower levels can be parallelized)
as well as limited sizes of subtrees that can be parallelized.
5.3.2 CUDA
CUDA, as discussed in Section 4.4.6, with proper cards and software versions,
supports dynamic parallelism that in turn allows launching a kernel from
within a kernel. It allows natural mapping of divide-and-conquer applications
Divide-and-conquer 233
onto the API taking into account the technological constraints listed in Section
4.4.6.
In this section, adaptive numerical integration of a function over a given
range is presented using CUDA dynamic parallelism. Assumptions for this
implementation are as follows:
The idea of the approach is presented in Figure 5.17 while the recursive kernel
implementation is presented in Listing 5.15.
sresults[threadIdx.x]=integratesimple(start,end,step/2);
__syncthreads();
for(counter=THREADS_PER_BLOCK/2;counter>0;counter/=2) {
if (threadIdx.x<counter)
sresults[threadIdx.x]+=sresults[threadIdx.x+counter];
__syncthreads();
}
if (threadIdx.x==0)
*(result+blockIdx.x)=sresults[0];
if (depth==MAX_RECURSION_DEPTH)
return;
cudaStreamCreateWithFlags(&stream2,
cudaStreamNonBlocking);
integratedp<<<1,THREADS_PER_BLOCK,THREADS_PER_BLOCK*
sizeof(double),stream2>>>(result+blockIdx.x+(long)BLOCKS
*(1<<(MAX_RECURSION_DEPTH-depth-1)),(start+end)/2,end,step
/2,depth+1);
} else *(result+blockIdx.x+(long)BLOCKS*(1<<(
MAX_RECURSION_DEPTH-depth-1)))=0;
}
__syncthreads();
if (threadIdx.x==0) {
cudaDeviceSynchronize();
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
}
__syncthreads();
if (threadIdx.x==0)
*(result+blockIdx.x)+=*(result+blockIdx.x+(long)BLOCKS*(1<<(
MAX_RECURSION_DEPTH-depth-1)));
}
5.3.3 MPI
For MPI, two implementations are proposed and discussed:
1. A balanced implementation in which an input problem (and conse-
quently data) is divided into problems of computationally equal sizes.
This is repeated recursively until a certain depth of the divide-and-
conquer tree is reached and allows structuring computations and com-
munication well. It also requires the number of processes equal to the
number of leaves generated. There are several algorithms that might
benefit from such an approach, for example:
236 Programming parallel paradigms using selected APIs
Figure 5.18 presents the working scheme using MPI with process ranks,
functions and MPI calls. Processes are arranged into a binary tree struc-
ture in such a way that in every iteration, pairs of processes communi-
cate and parent processes pass one of subproblems to another process.
This is continued until all processes receive their parts. Then processes
process data associated with their subproblems by invoking function
computationslocal(...) in parallel. What follows is parallel integra-
tion of results from each process which is done inversely to partitioning
in a similar tree like structure in which pairs of processes communicate
in every iteration.
Divide-and-conquer 237
Listing 5.16 presents MPI code for distribution of input data among MPI
processes while Listing 5.17 presents code for merging partial results
from processes, assuming more than 1 process.
partition(inputdatapacket,&outputdatapackets,&
outputdatapacketscount);
int alreadyreceived=0;
for(int currentskip=proccount/2;currentskip>=1;currentskip
/=2) {
if (!(myrank%currentskip)) { // then I am involved in the
given step
// receive if possible
if ((myrank-currentskip>=0) && (!alreadyreceived)) {
MPI_Recv(inputdatapacket->elements,2,MPI_DOUBLE,
myrank-currentskip,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
alreadyreceived=1;
}
MPI_Send(outputdatapackets[1].elements,2,MPI_DOUBLE,
myrank+currentskip/2,0,MPI_COMM_WORLD);
Divide-and-conquer 239
}
}
}
for(int currentskip=2;currentskip<=proccount;currentskip
*=2) {
if (!(myrank%currentskip)) { // then I (process) am
involved in the given step
MPI_Recv(outputdatapackets[1].elements,1,MPI_DOUBLE,
myrank+currentskip/2,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE
);
Figure 5.19 presents execution times for the code obtained on a work-
station with 2 x Intel Xeon E5-2620v4 and 128 GB RAM. Best out of
three runs for each configuration are reported.
100
Execution time [s]
10
1
1 2 4 8 16 32
number of processes
(a) idle processes requesting work from other processes, the latter cut-
ting off parts of the divide-and-conquer tree and sending for pro-
cessing, as suggested in Section 3.5 and [38],
Divide-and-conquer 241
(b) a master-slave scheme in which the master would partition the tree
(presumably into the number of subtrees considerably larger than
the number of processes) while other processes (slaves) would fetch
and process those subtrees.
FIGURE 5.22 Source code structure for the divide-and-conquer MPI ap-
plication with dynamic process creation
MPI_Info_set(*info,"host","<hostname>");
if (currentlevel==maxlevel) {
computationslocal(inputdatapacket,result);
return;
}
// otherwise partition and process the tree further
partition(inputdatapacket,&outputdatapackets,&
outputdatapacketscount);
spawnedforpacket=(char *)malloc(outputdatapacketscount*
sizeof(char));
if (!spawnedforpacket)
errorexit("Not enough memory");
intercommunicators=(MPI_Comm *)malloc(
outputdatapacketscount*sizeof(MPI_Comm));
if (!intercommunicators)
errorexit("Not enough memory");
for(i=0;i<outputdatapacketscount;i++) {
if (spawnedforpacket[i]=decidespawn(&(
outputdatapackets[i]),i)) {
MPI_Info info;
MPI_Info_create(&info);
// invoke a function that would return the host to
run a process on (this function may wait
// until load drops below a threshold)
selecthosttorun(&info);
MPI_Comm_spawn("DAC-MPI-irregular-dynamic-processes
-other",argv,1,info,0, MPI_COMM_SELF,&(
intercommunicators[i]),&errorcode);
if (errorcode!=MPI_SUCCESS)
errorexit("Error spawning a child process.");
// and send the other data packets to following
process(es)
Divide-and-conquer 247
MPI_Send(outputdatapackets[i].elements,2,MPI_DOUBLE
,0,0,intercommunicators[i]);
}
}
for(i=0;i<outputdatapacketscount;i++)
if (!(spawnedforpacket[i])) { // process these
within this process
inputdatapacket=&(outputdatapackets[i]);
// go lower with my data packet
divideandconquer(inputdatapacket,&outputdatapacket,
currentlevel+1,maxlevel,maxspawnlevel);
// now my result is pointed by outputdatapacket
copydatapacket(outputdatapacket,&(outputdatapackets
[i]));
// free the previous data packet
freedata(outputdatapacket);
}
MPI_Comm_disconnect(&(intercommunicators[i]));
}
copydatapacket(outputdatapackettemp,&(
outputdatapacketslowerlevel[i]));
}
freedata(outputdatapackets); // not needed anymore
// now merge results
merge(outputdatapacketslowerlevel,
outputdatapacketscount,&outputdatapacket);
freedatapackets(outputdatapacketslowerlevel,
outputdatapacketscount);
}
*result=outputdatapacket;
}
if (proccount!=1)
errorexit("\nThe must be only 1 root process.\n");
inputdatapacket=generateinputdata(argc,argv);
divideandconquer(inputdatapacket,&outputdatapacket,1,10,4);
// display the result
printf("\nResult is %f\n",outputdatapacket->elements[0]);
freedata(inputdatapacket);
freedata(outputdatapacket);
MPI_Finalize();
MPI_Comm_get_parent(&parentcommunicator);
if (parentcommunicator==MPI_COMM_NULL)
errorexit("No parent communicator.");
divideandconquer(inputdatapacket,&outputdatapacket,level
,10,4);
MPI_Comm_disconnect(&parentcommunicator);
freedata(inputdatapacket);
freedata(outputdatapacket);
MPI_Finalize();
mpicc -c DAC-MPI-irregular-dynamic-processes-common.c \
-o DAC-MPI-irregular-dynamic-processes-common.o
mpicc DAC-MPI-irregular-dynamic-processes-root.c \
DAC-MPI-irregular-dynamic-processes-common.o \
-o DAC-MPI-irregular-dynamic-processes-root
mpicc DAC-MPI-irregular-dynamic-processes-other.c \
DAC-MPI-irregular-dynamic-processes-common.o \
-o DAC-MPI-irregular-dynamic-processes-other
Result is 4.615121
mpirun -hostfile ./hostfile -np 1 \
./DAC-MPI-irregular-dynamic-processes-root 0 1000
Result is 6.908755
CHAPTER 6
Optimization
techniques and best
practices for parallel
codes
CONTENTS
6.1 Data prefetching, communication and computations
overlapping and increasing computation efficiency . . . . . . 252
6.1.1 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
6.1.2 CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
6.2 Data granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.3 Minimization of overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
6.3.1 Initialization and synchronization overheads . . . . 258
6.3.2 Load balancing vs cost of synchronization . . . . . . . 260
6.4 Process/thread affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
6.5 Data types and accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
6.6 Data organization and arrangement . . . . . . . . . . . . . . . . . . . . . 261
6.7 Checkpointing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
6.8 Simulation of parallel application execution . . . . . . . . . . . . . 264
6.9 Best practices and typical optimizations . . . . . . . . . . . . . . . . 265
6.9.1 GPUs/CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
6.9.2 Intel Xeon Phi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
6.9.3 Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
6.9.4 Hybrid systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
251
252 Optimization techniques and best practices for parallel codes
Such idle times may show up at various levels in a parallel system, including:
6.1.1 MPI
There are at least two programming approaches in MPI that allow implemen-
tation of overlapping communication and computations and data prefetching.
Listing 6.1 presents an approach with non-blocking API calls described in
detail in Section 4.1.11. The solution uses MPI_I* calls for starting fetching
data. Since these are non-blocking calls, a calling process can perform compu-
tations immediately after the call. The latter only issues a request for starting
communication. After computations, i.e. processing of a data packet, have
completed, non-blocking communication needs to be finalized using MPI_Wait
and processing of the just received data packet can follow. If there are to be
more data packets processed then a new data packet can be fetched before
computations start.
MPI_Recv(inputbuffer,...);
packet=unpack(inputbuffer);
while (shallprocess(packet)) {
// first start receiving a data packet
MPI_Irecv(inputbuffer,...,&mpirequest);
}
...
Alternatively, the code without unpacking of data from a buffer and using
two buffers instead is shown in Listing 6.2.
while (shallprocess(buffer)) {
if (buffer==inputbuffer1) {
buffer=inputbuffer0;
prevbuffer=inputbuffer1;
} else {
buffer=inputbuffer1;
prevbuffer=inputbuffer0;
}
}
...
In fact, a slave process would normally send back its results to the parent
process. Overlapping sends and processing of subsequent data, or packets can
also be arranged. Such a solution is shown in Listing 6.3.
while (shallprocess(buffer)) {
if (buffer==inputbuffer1) {
buffer=inputbuffer0;
prevbuffer=inputbuffer1;
prevresultbuffer=outputbuffer1;
} else {
buffer=inputbuffer1;
prevbuffer=inputbuffer0;
prevresultbuffer=outputbuffer0;
}
...
6.1.2 CUDA
Implementation of overlapping communication between the host and a GPU
and processing on the GPU can be done using streams described in Section
4.4.5. Specifically, using two streams potentially allows overlapping communi-
cation between page-locked host memory and a device (in one stream), com-
putations on the device (launched in another stream) as well as processing on
the host:
cudaStreamCreate(&streamS1);
cudaStreamCreate(&streamS2);
cudaMemcpyAsync(devicebuffer1,sourcebuffer1,copysize1,
cudaMemcpyHostToDevice,streamS1);
kernelA<<<gridsizeA,blocksizeA,0,streamS1>>>(...);
cudaMemcpyAsync(hostresultbuffer1,
deviceresultbuffer1,copyresultsize1,
cudaMemcpyDeviceToHost,streamS1);
cudaMemcpyAsync(devicebuffer2,sourcebuffer2,copysize2,
cudaMemcpyHostToDevice,streamS2);
kernelB<<<gridsizeB,blocksizeB,0,streamS2>>>(...);
cudaMemcpyAsync(hostresultbuffer2,
deviceresultbuffer2,copyresultsize2,
cudaMemcpyDeviceToHost,streamS2);
processdataonCPU();
cudaDeviceSynchronize();
3.5+ and a 64-bit application running under Linux [119]. An MPS daemon
can be started as follows:
nvidia-cuda-mps-control -d
The following experiment tests a scenario without MPS and with MPS
for the geometric SPMD application implemented with MPI and CUDA and
shown in Section 5.2.4. In every scenario the code was run with various num-
bers of MPI processes. For every configuration execution time for the best out
of three runs is presented in Table 6.1. Application parameters used were 384
384 960 10 2. Tests were performed on a workstation with 2 x Intel Xeon
E5-2620v4, 2 x NVIDIA GTX 1070 GPUs and 128 GB RAM Two GPUs were
used.
1. Small data packets would allow good balancing of data among comput-
ing nodes/processors. This might be especially useful if there are pro-
cessors of various computing speed. On the other hand, too many small
data packets would result in considerable overhead for communication.
2. Large data packets might result in poor load balancing among comput-
ing nodes/processors. Also, in case of really large data packets, compu-
tations might start with a delay.
with 2 x Intel Xeon E5-2620v4 and 128 GB RAM. For each configuration, the
best out of 3 runs are shown. Testbed results are shown for 4 and 16 processes
of an application.
50 16 processes
4 processes
40
Execution time [s]
30
20
10
0
10 100 1000 10000 100000 6
1x10
number of data packets
It should be noted that proper distribution of such data among e.g. nodes in
a cluster requires efficient load balancing, as discussed in Section 3.1.4.
Such approaches are also discussed in [23] in the context of OpenMP ap-
plications run on an Intel Xeon Phi. In the case of a hybrid MPI+OpenMP
application discussed here, a master thread of a process would be involved in
communication with other processes. Furthermore, processing of subdomain
data assigned to a process must be partitioned into several threads within the
process, in this case using OpenMP.
Two functionally identical implementations are as follows:
1. The main loop is executed by the master thread with some parts paral-
lelized using OpenMP constructs. Specifically:
(a) Exchange of data by the master thread (might be needed first if
each process initializes its own domain independently) without any
special constructs.
(b) Parallel update of subdomain cells – parallelization performed using
#pragma omp parallel for.
(c) Substitution of pointers for source domain and target domain per-
formed by the master thread without any special constructs.
2. Entering a parallel region outside of the loop. Then each thread would
execute loop iterations independently which results in the need for syn-
chronization. Specifically, steps within a loop iteration include:
(a) Exchange of data by the master thread (might be needed first
if each process initializes its own domain independently) in code
within #pragma omp master.
(b) Synchronization using #pragma omp barrier.
(c) Parallel update of subdomain cells – parallelization performed using
#pragma omp for.
(d) Substitution of pointers for source domain and target domain per-
formed by the master thread in code within #pragma omp master.
TABLE 6.2 Execution times [s] for two versions of MPI+OpenMP SPMD
code
version minimum execution average execution
time [s] time out of 10 runs
[s]
#pragma omp parallel 161.362 163.750
for inside main loop
#pragma omp parallel 160.349 163.010
outside of the main loop
1. Input data packets are divided into the number of groups equal to the
number of thread groups.
2. Instead of one critical section, the number of critical sections equal to
the number of thread groups is used, one per group. Then there are fewer
threads per critical section which potentially reduces time a thread needs
to wait for fetching a new data packet.
float sinf(float a)
double sin(double a)
long double sinl(long double a)
Resulting code may offer better performance and, in case of smaller data types,
open more potential for vectorization e.g. on Intel Xeon Phi.
Similarly, precision for floating point operations can be controlled with
various compiler switches as discussed in [34]. As a result, various trade-offs
between execution times and accuracy can be obtained.
In certain applications, smaller data types can be used effectively such as
16-bit fixed point instead of 32-bit floating for training deep networks [79].
that data cells that are fetched are located next to each other in memory. Let
us call this configuration A.
If, for the sake of a test, the order of loops is reversed i.e. indices are
browsed in the x, y and z dimension this is no longer the case. Let us call this
configuration B.
Table 6.3 presents comparison of execution times of the two configurations
for selected numbers of processes of an MPI application, run on a workstation
with 2 x Intel Xeon E5-2620v4 and 128 GB RAM. For each configuration, the
best out of 3 runs are shown. The codes can be compiled as follows:
Tests were run using 16 processes as follows, for a domain of size 600x600x600:
Often, tiling or blocking data [110] allows reusing of data that has been
loaded into cache or registers. For instance, if a loop stores data in a large
array (that does not fit into cache) such that in each iteration a successive
element is stored and then this data is used for subsequent updates of other
data, such a large loop can be tiled into a few passes each of which reuses data
from the cache.
6.7 CHECKPOINTING
Checkpointing is a mechanism that allows saving of the state of an application
and resume processing at a later time. Saving the state of a parallel application
is not easy because it requires saving a global consistent state of many pro-
cesses and/or many threads, possibly running on various computing devices,
either within a node or on a cluster, with consideration of communication and
synchronization.
Checkpointing might be useful for maintenance of the hardware. In some
cases, it may also allow moving the state of an application or individual ap-
plication processes to other locations. This effectively implements migration.
The latter might be useful for minimization of application execution times
Checkpointing 263
Examples of such systems and use cases applicable in the context of this
book include:
Best practices and typical optimizations 265
threads, there is a risk of false sharing. There are a few ways of dealing
with false sharing, including:
In case of Intel Xeon Phi x200, efficient execution of parallel codes will
also involve decisions on the following (article [74] discusses these modes in
more detail along with potential use cases):
• Flat – both DRAM and MCDRAM are available for memory al-
location (as NUMA nodes), the latter preferably for bandwidth
critical data.
• Cache – in this case MCDRAM acts as an L3 cache.
• Hybrid – a part of MCDRAM is configured as cache and a part as
memory that can be allocated by an application.
• MCDRAM – only MCDRAM is available.
2. cluster mode that defines how requests to memory are served through
memory controllers:
• All2All (default in case of irregular DIMM configuration) – a core, a
tag directory (to which a memory request is routed) and a memory
channel (to which a request is sent in case data is not in cache) can
be in various parts.
• Quadrant (in case of symmetric DIMM configuration) – a tag di-
rectory and a memory channel are located in the same region.
• Sub-NUMA clustering – in this mode regions (quarter – SNC4,
half – SNC2) will be visible as separate NUMA nodes. A core, a
tag directory and a memory channel are located in the same region
Best practices and typical optimizations 269
6.9.3 Clusters
For clusters, that consist of many nodes, the following techniques should be
employed during development of parallel programs:
export MIC_ENV_PREFIX=PHI
export PHI_KMP_AFFINITY=balanced
export PHI_KMP_PLACE_THREADS=60c,3t
export PHI_OMP_NUM_THREADS=180
Resources
CONTENTS
A.1 Software packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
export PATH=/home/pczarnul/openmpi-<version>-bin/bin:$PATH
in file ∼/.bashrc
– OpenACC – accULL version 0.4 alpha software was used for compilation
and running of the testbed example.
273
APPENDIX B
Further reading
CONTENTS
B.1 Context of this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
B.2 Other resources on parallel programming . . . . . . . . . . . . . . . 275
275
276 Further reading
279
280 Bibliography
[76] Gregg S. Process and Thread Affinity for Intel Xeon Phi
Processors, April 2016. https://software.intel.com/en-
us/articles/process-and-thread-affinity-for-intel-xeon-
phi-processors-x200, accessed on 19th July 2017.
[77] W. Gropp, T. Hoefler, R. Thakur, and E. Lusk. Using Advanced MPI:
Modern Features of the Message-Passing Interface (Scientific and En-
gineering Computation). Scientific and Engineering Computation. The
MIT Press, first edition, November 2014. ISBN 978-0262527637.
[78] W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Parallel
Programming with the Message-Passing Interface (Scientific and Engi-
neering Computation). Scientific and Engineering Computation. The
MIT Press, third edition, November 2014. ISBN 978-0262527392.
[79] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep
Learning with Limited Numerical Precision. In F. R. Bach and D. M.
Blei, editors, Proceedings of the 32nd International Conference on Ma-
chine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37
of JMLR Workshop and Conference Proceedings, pages 1737–1746.
JMLR.org, 2015.
[80] GWT-TUD GmbH. Vampir - Performance Optimization. https://
www.vampir.eu/, accessed on 19th July 2017.
[81] P. H. Hargrove and J. C. Duell. Berkeley Lab Checkpoint/Restart
(BLCR) for Linux Clusters. Journal of Physics: Conference Series,
46(1):494, 2006.
[82] M. Harris. Optimizing Parallel Reduction in CUDA. NVIDIA
Developer Technology. http://developer.download.nvidia.com/
compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/
reduction.pdf, accessed on 19th July 2017.
[83] M. Harris. Unified Memory in CUDA 6, November 2013.
https://devblogs.nvidia.com/parallelforall/unified-memory-
in-cuda-6/, accessed on 19th July 2017.
[84] M. Harris. Using Shared Memory in CUDA C/C++, Jan-
uary 2013. https://devblogs.nvidia.com/parallelforall/using-
shared-memory-cuda-cc/, accessed on 19th July 2017.
[85] M. Harris. GPU Pro Tip: CUDA 7 Streams Simplify Concurrency,
January 2015. https://devblogs.nvidia.com/parallelforall/gpu-
pro-tip-cuda-7-streams-simplify-concurrency/, accessed on 19th
July 2017.
288 Bibliography
[109] J. Luitjens. CUDA Pro Tip: Always Set the Current De-
vice to Avoid Multithreading Bugs, September 2014. https:
//devblogs.nvidia.com/parallelforall/cuda-pro-tip-always-
set-current-device-avoid-multithreading-bugs/, accessed on
19th July 2017.
[110] D. Mackay. Optimization and Performance Tuning for In-
tel Coprocessors - Part 1: Optimization Essentials, November
2012. Intel Developer Zone, https://software.intel.com/en-
us/articles/optimization-and-performance-tuning-for-intel-
xeon-phi-coprocessors-part-1-optimization, accessed on 19th
July 2017.
[111] T. MacWilliam and C. Cecka. CrowdCL: Web-based volunteer comput-
ing with WebCL. In High Performance Extreme Computing Conference
(HPEC), 2013 IEEE, pages 1–6, Sept 2013.
[112] I. Massy, N. Pena, and M. Ney. Efficient Perfect Matched Layer for Hy-
brid MRTD-FDTD Computation with Lossy Dielectric Material Bound-
aries. European physical journal - Applied physics, 57(01), january 2012.
[113] P. Micikevicius. Multi-GPU Programming, May 2012. Developer Tech-
nology, NVIDIA, GPU Technology Conference, San Jose, California,
USA.
[114] A. Munshi, B. Gaster, T. G. Mattson, J. Fung, and D. Ginsburg.
OpenCL Programming Guide. Addison-Wesley Professional, Boston,
MA, USA, 1st edition, 2011.
[115] B. Nichols, D. Buttlar, and J. P. Farrell. Pthreads Programming. A
POSIX Standard For Better Multiprocessing. O’Reilly, Beijing, 1998.
[116] K. A. Nuaimi, N. Mohamed, M. A. Nuaimi, and J. Al-Jaroodi. A Survey
of Load Balancing in Cloud Computing: Challenges and Algorithms. In
Proceedings of the 2012 Second Symposium on Network Cloud Comput-
ing and Applications, NCCA ’12, pages 137–142, Washington, DC, USA,
2012. IEEE Computer Society.
[117] A. Nukada, H. Takizawa, and S. Matsuoka. NVCR: A Transparent
Checkpoint-Restart Library for NVIDIA CUDA. In 2011 IEEE Inter-
national Symposium on Parallel and Distributed Processing Workshops
and Phd Forum, pages 104–113, May 2011.
[118] NVIDIA. Whitepaper. NVIDIA Tesla P100. The Most Advanced
Datacenter Accelerator Ever Built Featuring Pascal GP100, the
Worlds Fastest GPU. https://images.nvidia.com/content/pdf/
tesla/whitepaper/pascal-architecture-whitepaper.pdf, accessed
on 19th July 2017.
Bibliography 291
[142] F. Roth. Quick Start Guide for the Intel Xeon Phi Processor x200
Product Family, April 2016. https://software.intel.com/en-
us/articles/quick-start-guide-for-the-intel-xeon-phi-
processor-x200-product-family, accessed on 19th July 2017.
[143] N. Sakharnykh. The Future of Unified Memory, April 2016. GPU
Technology Conference, http://on-demand.gputechconf.com/gtc/
2016/presentation/s6216-nikolay-sakharnykh-future-unified-
memory.pdf, accessed on 19th July 2017.
[144] M.-R. Sancho. BSC Best Practices in Professional Training and Teaching
for the HPC Ecosystem. Journal of Computational Science, 14:74 –
77, 2016. Special issue: The Route to Exascale: Novel Mathematical
Methods, Scalable Algorithms and Computational Science Skills.
[145] J. Sanders and E. Kandrot. CUDA by Example: An Introduction
to General-Purpose GPU Programming. Addison-Wesley Professional,
Boston, MA, USA, 2010. ISBN-13: 978-0131387683.
[146] L. P. Santos, V. Castro, and A. Proença. Evaluation of the Commu-
nication Performance on a Parallel Processing System, pages 41–48.
Springer Berlin Heidelberg, Berlin, Heidelberg, 1997.
[147] C. Sarris and L. Katehi. An Efficient Numerical Interface between
FDTD and Haar MRTD-formulation and Applications. Microwave The-
ory and Techniques, IEEE Transactions on, 51(4):1146–1156, Apr 2003.
[165] R. van der Pas. OpenMP Tasking Explained, November 2013. SC13
Talk at OpenMP Booth, Santa Clara, CA, USA, http://openmp.org/
wp-content/uploads/sc13.tasking.ruud.pdf, accessed on 19th July
2017.
[166] A. Vladimirov, R. Asai, and V. Karpusenko. Parallel Programming and
Optimization with Intel Xeon Phi Coprocessors. Colfax International,
Sunnyvale, CA, USA, May 2015. ISBN 978-0-9885234-0-1.
[167] J. P. Walters and V. Chaudhary. Application-Level Checkpointing Tech-
niques for Parallel Programs, pages 221–234. Springer Berlin Heidelberg,
Berlin, Heidelberg, 2006.
[168] W. Wang, W. Liu, and Z. Wang. Survey of Load Balancing Strategies
on Heterogeneous Parallel System, pages 583–589. Springer Netherlands,
Dordrecht, 2011.
[169] X. Wang, Y. Hao, and C.-H. Chu. Stability Comparison between Multi-
Resolution Time-Domain (MRTD) and Finite-Difference Time-Domain
(FDTD) Techniques. In 2008 Loughborough Antennas and Propagation
Conference, pages 269–272, March 2008.
[170] B. Wilkinson and M. Allen. Parallel Programming: Techniques and
Applications Using Networked Workstations and Parallel Computers.
Pearson, Upper Sadle River, NJ, USA, 2nd edition edition, March 2004.
ISBN 978-0131405639.
297
298 Index
TDP, 25
thread affinity, 261, 267
balanced, 267
compact, 267
scatter, 267
thread divergence, 265
tree
balanced, 58
leaf, 58
level, 58
UML, 12
UNICORE, 27
Unified Virtual Addressing, 145
WAN, 12
warp, 132, 265
weather prediction, 42
WebCL, 23
WeevilScout, 23
work group, 14
work item, 14
work sharing, 61
work stealing, 61