0% found this document useful (0 votes)

13 views

An Improved Framework of GPU Computing For CFD Applications On Structured Grids Using OpenACC

Uploaded by

EmilioArroyo

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

An Improved Framework of GPU Computing For CFD Applications On Structured Grids Using OpenACC

Uploaded by

EmilioArroyo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Journal of Parallel and Distributed Computing 156 (2021) 64–85

Contents lists available at ScienceDirect

Journal of Parallel and Distributed Computing

www.elsevier.com/locate/jpdc

An improved framework of GPU computing for CFD applications on

structured grids using OpenACC
Weicheng Xue ∗ , Charles W. Jackson, Christoper J. Roy
Virginia Tech Kevin T. Crofton Department of Aerospace and Ocean Engineering, 215 Randolph Hall, Blacksburg, VA 24061, United States of America

a r t i c l e i n f o a b s t r a c t

Article history: This work is focused on improving multi-GPU performance of a research CFD code on structured grids.
Received 28 November 2020 MPI and PGI 18.1 OpenACC directives are used to scale the code up to 16 NVIDIA GPUs. This work
Received in revised form 14 March 2021 shows that using 16 P100 GPUs and 16 V100 GPUs can be up to 30× and 90× faster than 16 Xeon
Accepted 25 May 2021
CPU E5-2680v4 cores for three different test cases, respectively. A series of performance issues related to
Available online 3 June 2021
the scaling for the multi-block CFD code are addressed by applying various optimizations. Performance
Keywords: optimizations such as the pack/unpack message method, removing temporary arrays as arguments to
MPI procedure calls, allocating global memory for limiters and connected boundary data, reordering non-
OpenACC blocking MPI I_send/I_recv and blocking Wait calls, reducing unnecessary implicit derived type member
CFD data movement between the host and the device and the use of GPUDirect can improve the compute
Performance optimization utilization, memory throughput, and asynchronous progression in the multi-block CFD code using modern
Structured grid programming features.
© 2021 Elsevier Inc. All rights reserved.

1. Introduction ory address [32] and non-unified memory address usually being
used, respectively. On the software side, there are various program-
Computational Fluid Dynamics (CFD) is a method to solve prob- ming models for parallel computing including OpenMP [6], MPI [8],
lems related to fluids numerically. There are numerous studies CUDA [45], OpenCL [30] and OpenACC [52]. Different parallel ap-
applying a variety of CFD solvers to solve different fluid prob- plications can utilize different parallel paradigms based on a pure
lems. Usually these problems require the CFD results to be gen- parallel model or even a hybrid model such as MPI+OpenMP,
erated quickly as well as precisely. However, due to some re- MPI+CUDA, MPI+OpenACC, OpenMP+OpenACC, etc.
strictions of the CPU compute capability, system memory size or For multi/many-core computing, OpenMP, MPI and hybrid
bandwidth, highly refined meshes or computationally expensive MPI+OpenMP have been widely used and their performance has
numerical methods may not be feasible. For example, it may take also been frequently analyzed in various areas, including CFD.
thousands of CPU hours to converge a 3D Navier-Stokes flow case Gourdain et al. [19,20] investigated the effect of load balancing,
with more than millions of degrees of freedoms. In such a cir- mesh partitioning and communication overhead in their MPI im-
cumstance, high performance parallel computing [7] enables us to plementation of a CFD code, on both structured and unstructured
solve the problem much faster. Also, parallel computing can pro- meshes. They achieved good speedups for various cases up to
vide more memory space (either shared or distributed) so that thousands of cores. Amritkar et al. [2] pointed out that OpenMP
large problems can be solved. can improve data locality on a shared memory platform compared
Parallel computing differs from serial computing in many as- to MPI in a fluid-material application. However, Krpic et al. [31]
pects. On the hardware side, a parallel system commonly has showed that OpenMP performs worse when running large scale
multi/many-core processors or even accelerators such as GPUs, matrix multiplication even on shared-memory computer system
which enable programs to run in parallel. Memory in a paral- when compared to MPI. Similarly, Mininni et al. [38] compared
lel system is either shared or distributed [7], with unified mem- the performance of the pure MPI implementation and the hybrid
MPI+OpenMP implementation of an incompressible Navier-Stokes
solver, and found that the hybrid approach does not outperform
the pure MPI implementation when scaling up to about 20,000
* Corresponding author.
cores, which in their opinion may be caused by cache contention
E-mail addresses: [email protected] (W. Xue), [email protected] (C.W. Jackson),
[email protected] (C.J. Roy). and memory bandwidth. In summary, it can be concluded that MPI

https://doi.org/10.1016/j.jpdc.2021.05.010
0743-7315/© 2021 Elsevier Inc. All rights reserved.
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

is more suitable for massively parallel applications as it can help and OpenACC. They achieved some performance improvements for
achieve better performance compared to OpenMP. some of their test cases, and pointed out the communication and
In addition to accelerating a code on the CPU, accelerators synchronization overhead between the CPU and GPU may be the
such as GPU [48] are becoming popular in the area of scientific performance bottleneck. Both of the works in Ref. [13,61] showed
computing. CUDA [45], OpenCL [30], and OpenACC [52] are the that the hybrid CPU/GPU framework can outperform the pure GPU
three commonly used programming models for the GPU. CUDA framework to some degree, but the performance gain depends on
and OpenCL are mainly C/C++ extensions (CUDA has also been ex- the platform and application.
tended to Fortran) while OpenACC is a compiler directive based
interface, therefore CUDA and OpenCL are more troublesome in 2. Description of the CFD code: SENSEI
terms of programming, requiring a lot of user intervention. CUDA
is proprietary to NVIDIA and thus can only run on NVIDIA GPUs. SENSEI (Structured, Euler/Navier-Stokes Explicit-Implicit Solver)
OpenCL supports various architectures but it is a very low level is our in-house 2D/3D flow solver initially developed by Derlaga et
API, which is not easy for domain scientists to adapt to. Also, al- al. [16], and later extended to a turbulence modeling code base
though OpenCL has a good portability across platforms, a code through an object-oriented programming manner by Jackson et
may not run efficiently on various platforms without specific per- al. [25] and Xue et al. [62]. SENSEI is written in modern Fortran
formance optimizations and tuning. OpenACC has some advantages and is a multi-block finite volume CFD code. An important reason
over CUDA and OpenCL. Users only need to add directives in their
of why SENSEI uses structured grid is that the quality of mesh is
codes to expose enough parallelisms to the compiler which de-
better using a multi-block structured grid than using an unstruc-
termines how to accelerate the code. In such a way, a lot of low
tured grid. In addition, memory can be used more efficiently to
level implementation can be avoided, which provides a relatively
obtain better performance since the data are stored in a structured
easy way for domain scientists to accelerate their codes on the
GPU. Additionally, OpenACC can perform fairly well across different way in memory. The governing equations can be written in weak
platforms even without significant performance tuning. However, form as

OpenACC may not reveal some parallelisms if there is a lack of per- ∂ d +
formance optimizations. Therefore, OpenACC is usually assumed to
Q ( Fi ,n − F ν,n )d A = Sd (1)
∂t
be slower than CUDA and OpenCL, but it is still fairly fast. Even ∂
for some occasions, OpenACC can be the fastest [23], which is sur- is the vector of conserved variables, F i ,n and F ν,n are the
where Q
prising. To program on multiple GPUs, MPI may be needed, i.e.,
inviscid and viscous flux normal components (the dot product of
the MPI+OpenACC hybrid model may be required. CPUs are set as
hosts and GPUs are set as accelerator devices, which is referred to the 2nd order flux tensor and the unit face normal vector), respec-
as the offload model, in which the most computational expensive tively, given as
⎡
portion of the code is offloaded to the GPU, while the CPU handles ρ ⎤ ⎡
ρ Vn ⎤
instructions of controls and file I/O.
⎢ ρu ⎥ ⎢ ρ uV n + nx p ⎥
A lot of work has been done to leverage GPUs for CFD appli- =⎢
Q
⎥ ⎢ ⎥
⎢ ρ v ⎥ , Fi ,n = ⎢ ρ v V n + n y p ⎥ ,
cations. Jacobsen et al. [26] investigated multi-level parallelisms ⎣ρw ⎦ ⎣ρwV + n p ⎦
n z
for the classic incompressible lid driven cavity problem on GPU
clusters using MPI+CUDA and hybrid MPI+OpenMP+CUDA imple-
ρ et ρ ht V n
⎡ ⎤
mentations. They found that the MPI+CUDA implementation per- 0
forms much better than the pure CPU implementation but the ⎢ nx τxx + n y τxy + n z τxz ⎥
⎢ ⎥
hybrid performs worse than the MPI+CUDA implementation. Elsen F ν,n = ⎢ nx τ yx + n y τ y y + n z τ yz ⎥ (2)
et al. [17] ported a complex CFD code to a single GPU using ⎣ n τ +n τ +n τ ⎦
x zx y zy z zz
BrookGPU [10] and achieved a speedup of 40× for simple geome- n x x + n y y + n z z
tries and 20× for complex geometries. Brandvik et al. [9] applied
CUDA to accelerate a 3D Euler problem using a single GPU and got S is the source term from either body forces, chemistry source
a speedup of 16×. Luo et al. [34] applied MPI+OpenACC to port a terms, or the method of manufactured solutions [46]. ρ is the den-
2D incompressible Navier-Stokes solver to 32 NVIDIA C2050 GPUs sity, u, v, w are the Cartesian velocity components, et is the total
and achieved a speedup of 4× over 32 CPUs. They mentioned that energy, ht is the total enthalpy, V n = nx u + n y v + n z w and the
OpenACC can increase the re-usability of the code due to Ope- ni terms are the components of the outward-facing unit normal
nACC’s similarity to OpenMP. Xia et al. [58] applied OpenACC to vector. τi j are the viscous stress components based on Stokes’s hy-
accelerate an unstructured CFD solver based on a Discontinuous pothesis. i represents the heat conduction and work from the
Galerkin method. Their work achieved a speedup of up to 24× viscous stresses. In this paper, both the Euler and laminar Navier-
on one GPU compared to one CPU core. They also pointed out Stokes solvers of SENSEI are ported to the GPU, but not for the
that using OpenACC requires the minimum code intrusion and al- turbulence models as the turbulence implementation involves a lot
gorithm alteration to leverage the computational power of GPU. of object-oriented programming features such as overloading, poly-
Chandar et al. [13] developed a hybrid multi-CPU/GPU framework morphism, type-bound procedures, etc. These newer features of
on unstructured overset grids using CUDA. Xue et al. [59] applied the language are not supported well by the PGI compiler used, as
multiple GPUs for a complicated CFD code on two different plat- they may require the GPU to jump from an address to a different
forms but the speedup was not satisfactory (only up to 4× on a address in runtime, which should be avoided when programming
NVIDIA P100 GPU), even with some performance optimizations. on GPUs.
Also, Xue et al. [60] investigated the multi-GPU performance and In SENSEI, ghost cells are used for multiple purposes. First,
its performance optimization of a 3D buoyancy-driven cavity solver boundary conditions can be enforced in a very straightforward
using MPI and OpenACC directives. They showed that decompos- way. There are different kinds of boundaries in SENSEI, such as slip
ing the total problem in different dimensions affects the strong wall, non-slip wall, supersonic/subsonic for inflow/outflow, farfield,
scaling performance significantly when using multiple GPUs. Xue etc. Second, from the perspective of parallel computing, ghost cells
et al. [61] further applied the heterogeneous computing to accel- for connected boundaries contain data from the neighboring block
erate a complicated CFD code on a CPU/GPU platform using MPI used during a syncing routine so that every block can be solved

65
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

independently. SENSEI uses pointers of a derived type to store the

neighboring block information easily. Unless otherwise noted, all
of the results presented here will be using a second-order accu-
rate scheme. Second order accuracy is achieved using the MUSCL
scheme [55], which calculates the left and right state for the prim-
itive variables on each face of all cells. Time marching can be ac-
complished using an explicit M-step Runge-Kutta scheme [27] and
an implicit time stepping scheme [4,29,57]. In this paper, only the
explicit M-step Runge-Kutta scheme is used as the implicit scheme
uses a completely objected-oriented way of programming which
includes overloading of type-bound procedures.
Even though derived types are used frequently in SENSEI,
to promote coalesced memory access and improve cache reuse,
struct-of-array (SOA) instead of array-of-struct (AOS) is chosen for
SENSEI. This means that, for example, the densities in each cell are Fig. 1. CPU and GPU.

stored in contiguous memory locations instead of all of the degrees

of freedom for a cell being stored together. Using SOA produces a
coalesced memory access pattern which performs well on GPUs
and is recommended by NVIDIA [45].
SENSEI has the ability to approximate the inviscid flux with a
number of different inviscid flux functions. Roe’s flux difference
splitting [50], Steger-Warming flux vector splitting [51], and Van
Leer’s flux vector splitting [56] are available. The viscous flux is
calculated using a Green’s theorem approach and requires more
cells to be added to the inviscid stencil. For more details on the
theory and background see Derlaga et at. [16], Jackson et al. [25]
and Xue et al. [62].

3. Overview of CPU/GPU heterogeneous system, MPI and Fig. 2. The offload model.
OpenACC
tween processors via messages. It can be used on both distributed
3.1. CPU/GPU heterogeneous system and shared systems. MPI supports point-to-point communication
patterns as well as collective communications. MPI also supports
As can be seen in Fig. 1, the NVIDIA GPU has more lightweight the customization of derived data type so transferring data be-
cores than the CPU, so the compute throughput of the GPU is much tween different processors is easier. It should be noted that a cus-
higher than the CPU. Also, the GPU has higher memory band- tomized derived type may not guarantee fast data transfers. MPI
width and lower latency to its memory. The CPU and the GPU have supports the use of C/C++ and Fortran. There are many implemen-
discrete memories so there are data movements between them, tations of MPI including Open MPI [39] and MVAPICH2 [41].
which can be realized through the PCI-E or NVLink. The offload
model is commonly used for the pure GPU computing, which can
3.3. OpenACC
be seen in Fig. 2. In CFD, the CPU deals with the geometry input,
domain decomposition and some general settings. Then, the CPU
OpenACC is a standard for parallel programming on heteroge-
offloads the intensive computations to the GPU. The boundary data
neous CPU/GPU systems [52]. Very similar to OpenMP [6], Ope-
exchange can happen either on the CPU or the GPU, depending on
nACC is also directive based, so it requires less code intrusion to
whether the GPUDirect is used or not. After the GPU finishes the
the original code base compared with CUDA [45] or OpenCL [30].
computation, it moves the solution to the CPU. The CPU finally out-
OpenACC does not usually provide competitive performance com-
puts the solution to files. To obtain good performance, there should
pared to CUDA [24,5,3], however the performance it provides can
be enough GPU threads running concurrently. Using CUDA [45] or
still satisfy many needs. Compilers such as PGI [14] and GCC can
OpenACC [52], there are three levels of tasks: grid, thread block
support OpenACC in a way that the compiler detects the directives
and thread. Thread blocks can be run asynchronously in multiple
in a program and decides how to parallelize loops by default. The
streaming multiprocessors (SMs) and the communication between
compiler also handles moving the data between discrete memory
thread blocks is expensive. Each thread block has a number of
locations, but it is the users’ duty to inform the compiler to do so.
threads. There is only lightweighted synchronization overhead for
Users can provide more information through the OpenACC direc-
all threads in a block. All threads in a thread block can run in par-
allel in the Single Instruction Multiple Threads (SIMT) mode [43]. A tives to attempt to optimize performance. These optimizations will
kernel is launched as a grid of thread blocks. Several thread blocks be the focus of this paper.
can share a same SMs but all the resources need to be shared. Each
thread block contains multiple 32 thread warps. Threads in a warp 4. Domain decomposition
can be executed concurrently on a multiprocessor. In comparison
to the CPU, which is often optimized for instruction controls and There are many strategies to decompose a domain, such as us-
for low latency access to cached data, the GPU is optimized for ing Cartesian [21,18] or graph topology [22]. Because SENSEI is a
data parallel and high throughput computations. structured multi-block code, Cartesian block splitting will be used.
With Cartesian block splitting, there is a tradeoff between decom-
3.2. MPI posing the domain in more dimensions (e.g. 3D or 2D domain
decomposition) and fewer dimensions (e.g. 1D domain decom-
MPI (Message Passing Interface) is a programming model for position). The surface area-volume ratio is larger if decomposing
parallel computing [40] which enables data to be exchanged be- the domain in fewer dimensions, which means more data needs

66
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

Fig. 3. A 3D domain decomposition.

to be transferred between different processors. Also, decomposing

the domain in 1D can generate slices that are too thin to sup-
port the entire stencil when decomposing the domain into many
sub-blocks. However, the fewer number of dimensions being com-
posed means that each block needs to communicate with a fewer
number of neighbors, reducing the number of transfers and their
corresponding latency.
By default, SENSEI uses a general 3D or 2D domain decompo-
sition (depending on whether the problem is 3D or 2D) but can
switch to 1D domain decomposition if specified. An example of
the 3D decomposition is shown in Fig. 3. The whole domain is
decomposed into a number of blocks. Each block connects to 6
neighboring blocks, one on each face. For each sub-iteration step
(as the RK multi-step scheme is used), neighboring decomposed
blocks need to exchange data with each other, in order to fill their
own connected boundaries. Since data layout of multi-dimensional
Fig. 4. An example showing the domain aggregation. (For interpretation of the colors
arrays in Fortran is column-majored, we always decompose the do- in the figure(s), the reader is referred to the web version of this article.)
main starting form the most non-contiguous memory dimension.
For example, since the unit stride direction of a three-dimensional
boundary deflection increasing the amount of communication re-
array A (i , j , k) is the first index (i), i is the last decomposed di-
mension and k is the first decomposed dimension. quired. Using the domain aggregation approach, any decomposed
The 3D domain decomposition method shown in Fig. 3 is a pro- block is required to exchange data with its neighbors on the same
cessor clustered method. This method is designed for the scenarios processor but this does not require MPI communications. Only the
in which the number of processors (np) is greater than the number new connected boundaries (e.g. the red solid lines in Fig. 4) be-
of parent blocks (npb), i.e., the number of blocks before the domain tween neighboring processors need to be updated using MPI rou-
decomposition. There are several advantages with this decomposi- tines at each sub-iteration step.
tion strategy. First, this method is an “on the fly” approach, which
is convenient to use and requires no manual operation or pre- 5. Boundary decomposition in parallel and boundary reordering
processing of the domain decomposition. Second, it is very robust
that it can handle most situations if np is greater than or equal to Boundaries also need to be decomposed and updated on in-
npb. Third, the communication overhead is small due to the simple dividual processors. Initially, only the root processor has all the
connectivity, making the MPI communication implementation easy. boundary information for all parent blocks, since root reads in
The load can be balanced well if np is significantly larger than npb. the grid and boundaries. After domain decomposition, each parent
Finally, some domain decomposition work can be done in parallel, block is decomposed into a number of child blocks. These child
although the degree may vary for various scenarios. blocks need to update all the boundaries for themselves. For non-
This domain decomposition method may have load imbalance connected boundaries this update is very straightforward as each
issue if np is not obviously greater than npb, which can be ad- processor just needs to compare their individual block index range
dressed using a domain aggregation technique, similar to building with the boundary index range. For interior boundaries caused by
blocks. A simple 2D example of how the domain aggregation works domain decomposition, a family of Cartesian MPI topology routines
is given in Fig. 4. In this example, the first parent block has twice are used to setup communicators and make communication much
as many cells as the second parent block. If only two processors less troublesome. However, for connected parent block boundaries,
are used, the workload cannot be balanced well without over- the update (decomposing and re-linking these boundaries) is more
decomposition and aggregation. With over-decomposition, the first difficult, as the update is completed in parallel on individual pro-
parent block is decomposed into 4 blocks, and one of these decom- cessors in SENSEI, instead of on the root processor. The parallel
posed blocks is assigned to the second processor so both processor process can be beneficial if numerous connected boundaries exist.
have the same amount of work to do. It should be noted that the For every parent block connected boundary, the root processor first
processor boundary length becomes longer due to the processor broadcasts the boundary to all processors within that parent block

67
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

6. Platforms and metrics

6.1. Platforms

NewRiver. NewRiver [42] is a cluster at Virginia Tech. Each GPU

node on NewRiver is equipped with two Intel Xeon E5-2680v4
(Broadwell) 2.4 GHz CPUs, 512 GB memory, and two NVIDIA P100
GPUs. Each NVIDIA P100 GPU is capable of up to 4.7 TeraFLOPS
of double-precision performance. The compilers used on NewRiver
are PGI 18.1 and Open MPI 3.0.0 or MVAPICH2-GDR 2.3b. MVA-
PICH2-GDR 2.3b is a CUDA-aware MPI wrapper compiler which
supports GPUDirect, also available on NewRiver. An compiler op-
timization of -O4 is used.
Fig. 5. An example of using MPI inter-communicator.
Cascades. Cascades [11] is another cluster at Virginia Tech. Each
GPU node on Cascades is equipped with two Intel Skylake Xeon
Gold 3 GHz CPUs, 768 GB memory, and two NVIDIA V100 GPUs.
and its neighbor parent block, and then returns to deal with the Each NVIDIA V100 GPU is capable of up to 7.8 TeraFLOPS of
next parent block connected boundary. The processors within that double-precision performance. The NVIDIA V100 GPU offers the
parent block or its neighbor parent block compare the boundary highest GFLOPS among the GPUs we used. The compilers used on
received to their block index ranges. If a processor does not con- Cascades are PGI 18.1 and Open MPI 3.0.0. An compiler optimiza-
tain any index range of the parent boundary, it moves forward to tion of -O4 is used.
compare the next parent boundary. Processors in the parent block
TinkerCliffs. TinkerCliffs [53] is another cluster at Virginia Tech. Tin-
having this boundary or processors in the neighbor parent block
kerCliffs dose not contain any GPUs. Each base compute node on
matching part of the neighbor index range are colored but dif-
TinkerCliffs is equipped with two AMD EPYC 7702 CPU, with a base
ferently. These colored processors will need to update their index
frequency of 2 GHz and a boost frequency of up to 3.35 GHz. The
range for the connected boundary. To illustrate how we use MPI
compilers used on TinkerCliffs are GCC 10.2.0 and Open MPI 4.0.5.
topology routines and inter-communicators to setup connectivity
An compiler optimization of -O4 is used.
between neighbor blocks, a 2D example having 3 parent blocks Each job can apply 8 GPU nodes on NewRiver or Cascades so
and more than 3 CPUs is given in Fig. 5. Processors which match a the maximum number of GPUs available is 16 on NewRiver or
parent connected boundary are included in an inter-communicator. Cascades. TinkerCliffs is used to make the multi-CPU computing
The processor in a parent block first sends its index ranges to pro- to achieve the similar wall time as the multi-GPU computing since
cessors residing in the neighbor communicator. Then a processor NewRiver or Cascades does not have enough CPUs.
in the neighbor communicator matching part of the index range is
a neighbor while others in the neighbor communicator not match- 6.2. Performance metrics
ing the index range are not neighbor processors. Through looping
over all the neighbor processors in the neighbor communicator, To evaluate the performance of the parallel code, weak scaling
one processor sets up connectivity with all its connected neigh- and strong scaling are used. Strong scaling measures how the ex-
bors. This process is performed in parallel as the root processor ecution time varies when the number of processors changes for
does not need to participate in this process except for broadcast- a fixed total problem size, while weak scalability measures how
ing the parent boundary to all processors in the parent block and the execution time varies with the number of processors when
its neighbor parent block at the beginning. There may be spe- the problem size on each processor is fixed. Commonly, these two
cial cases. The first special case is that the root is located at a scalings are valuable to be investigated together, as we care more
parent block or its neighbor parent block. The root needs to par- about the weak scaling when we have enough compute resources
ticipate in the boundary decomposition and re-linking process, as available to run large problems, while more about the strong scal-
shown in Fig. 5. The second special case is given in the lower right ing when we only need to run small problems. In this paper, since
square in Fig. 5, in which a parent block partly connects to it- our focus is on the acceleration of the computation and data move-
self, which may make a decomposed block partially connect to ment in the iterative solver portion, when measuring productive
itself. performance, the timing contribution from the I/O portion (reading
In SENSEI, nonblocking MPI calls instead of blocking calls are in grid, writing out solution) and the one-time domain decompo-
used to improve the performance. However, nonblocking MPI calls sition is not taken into account.
requires a blocking call such as MPI_WAIT to finish the commu- Two basic metrics used in this paper are parallel speedup and
nication, and it may cause a deadlock issue for some multi-block efficiency. Speedup denotes how much faster the parallel version is
cases. An example of the deadlock issue is shown in Fig. 6. In this compared with the serial version of the code, while efficiency rep-
example, there are four processors ( P A~ P D), each with two con- resents how efficiently the processors are used. They are defined
nected boundaries (bc1 and bc2). For every processor, it needs to as follows,
block a MPI_WAIT call for its bc1 to finish first and then for its bc2.
t serial
However, the initial order of boundaries creates a circular depen- speedup = (3)
dency issue for all of the processors, and thus no communication t parallel
can be completed (deadlock). This deadlock issue may happen for speedup
both the parent block connections and the child block connections efficiency = (4)
np
after decomposition. Fig. 6 shows a solution to the deadlock is-
sue, i.e. reordering boundaries. Therefore, boundary reordering is where np is the number of processors (CPUs or GPUs).
implemented in SENSEI to automatically deal with such deadlock In order for the performance of the code to be compared well
issues. on different platforms and for different problem sizes, the wall

68
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

Fig. 6. An example of deadlock due to circular dependency.

Fig. 7. An explanation of ssspnt.

clock time per iteration step is converted to a metric called ssspnt 7. OpenACC parallelization and optimization
(scaled size steps per np time) which is defined in Eq. (5).
There is some general guidance for improving the performance
size × steps of a program on a GPU. First, sufficient parallelism should be ex-
ssspnt = s (5)
np × time posed to saturate the GPU with enough computational work, that
is, the speedup for the parallel portion should compensate for
where s is a scaling factor which scales the smallest platform the overhead of data transfers and the parallel setup. Second, the
ssspnt to the range of [0,1]. In this paper, s is set to be 10−6 for all memory bandwidth between the host and the device should be
test cases. size is the problem size, steps is the total iteration steps improved to reduce the communication cost, which is affected by
and time is the program solver wall clock time for steps iterations. the message size and frequency (if using MPI), memory access pat-
Using ssspnt has some advantages. First, GFLOPS requires know- terns, etc. It should be noted that all performance optimizations
ing the number of operations while ssspnt does not. In most codes, should guarantee the correctness of the implementation. Therefore,
especially complicated codes, it is usually difficult to know the to- this paper proposes and adopts various modifications to increase
tal number of operations. The metric ssspnt is a better way of the speed of various CFD kernels and reduce the communication
measuring the performance of a problem than the variable time overhead while always ensuring the correct result is obtained, i.e.,
as time may change if conditions (such as the number of itera- the results do not deviate from the serial implementation.
tions, problem size, etc.) change. Second, using ssspnt is clearer in Load balancing, communication overhead, latency, synchroniza-
terms of knowing the relative speed difference under different sit- tion overhead and data locality are important factors which may
uations than the metric “efficiency”. It is easy to know whether affect the performance. The domain decomposition and aggrega-
the performance is super-linear or linear or sub-linear, which is tion methods used in this paper can help solve the load imbalanc-
shown in Fig. 7(a), as well as know the relative performance com- ing issue well; however, the number of dimensions that need to
parison between different scenarios, which is shown in Fig. 7(b). be decomposed may require tuning, especially when given a large
Using ssspnt, different problems, platforms and different scalings number of processors. To reduce the communication overhead of
can be compared more easily. data transfers between the CPU and the GPU, the data should
Similar to Ref. [60], every time in this paper is measured con- be kept on the GPU as long as possible without being frequently
secutively for at least three instances. The difference for each time moved to the CPU. Also, non-contiguous data transfer between the
point is smaller than 1% (usually less than 1 s out of more than CPU and the GPU (large stride memory access) should be avoided
120 s). We also selected a handful of cases to run again to verify to improve the memory bandwidth. To hide latency, kernel execu-
the timings were consistent day to day. tion and data transfer should be overlapped as much as possible,

69
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

which may require reordering of some portions in the program. To plementation of connected boundary condition and its relevant
reduce the synchronization overhead, the number of tasks running parallelization. The solution bugs have been fixed in this paper
asynchronously should be maximized. To improve data locality and so that the OpenACC framework is extended correctly to multi-
increase the use of coalesced fetches, data should be loaded into block cases. In fact, solution debugging can be troublesome using
cache as chunks before needed, which can make read and write OpenACC without a good tool including Parallel Compiler Assisted
more efficiently. This paper addresses some of these issues based Software Testing (PCAST) [1] and Arm Forge, as intermediate re-
on profiling outputs. sults may be difficult to check directly on the device. Without
We should keep in mind that there are some inherent bottle- PCAST or Arm Forge [49], if the data on the GPU side needs to
necks limiting the actual performance of a CFD code on GPUs. be printed outside of the parallel region, then update of the data
Some CFD codes require data exchange to communicate between on the host side should be made before printing. If the data needs
partitions, which incurs some communication and synchroniza- to be known in the parallel region (when a kernel is running), a
tion overhead. Data fetching in discrete memory may cost more probe routine which is !$acc routine type should be inserted
clock cycles than expected due to low actual memory through- into the parallel region to print out the desired data. Keep in mind
put, system latency, etc. Therefore, the actual compute utilization that both the GPU and CPU have a copy of the data with the same
is difficult to increase sometimes and is application dependent. name but in discrete memories.
Another limiter of the performance is the need for branching state- In addition, it should be mentioned that there is a caveat when
ments in the code. For instance, certain flux functions might exe- updating the boundary data between the host and the device us-
cute different branches depending on the local Mach number. This ing the !$acc update directive since the ghost cells and interior
causes threads in a warp to diverge reducing the peak performance cells in SENSEI are stored and addressed together, which means
possible. The actual speedup after enough performance optimiza- that the boundary data are non-contiguous in the memory. Much
tion should still be smaller than the theoretical compute power higher memory throughput can be obtained if the whole piece of
the GPU can provide. The relation of the actual and theoretical data (including the interior cells and boundary cells) instead of ar-
speedup the GPU can provide is not covered in this paper. ray slicing is included. A 2D example can be seen in Fig. 8. As
Fortran uses column-major storage, the memory stores the array
7.1. V0: baseline elements by column first. However, the interior cell columns split
the ghost cells in memory. For 3D or multi-dimensional arrays, the
The baseline GPU version of SENSEI was mainly implemented data layout is more complicated but the principle is similar. If us-
by McCall [36], [37]. There are some restrictions using the PGI 18.1 ing the method in Listing 1 to update the boundary data on the
OpenACC. These restrictions mean the following features should be device, OpenACC updates the data slice by slice and there are many
avoided. more invocations. The memory throughput can be about 1/100
to 1/8 of using the method in Listing 2, based on the profiling
1. Reductions with derived type members outputs from the NVIDIA visual profiler. In fact, the only imple-
2. Temporary arrays as parameters to a procedure call mentation difference between the two methods is whether slicing
3. Procedure pointers within PGI OpenACC kernels is used or not (in Fortran, array slicing is commonly used), but
the performance difference is huge. However, some applications
The first restriction is a restriction of the OpenACC specification. or schemes may require avoiding updating the ghost cell values
This restriction can be resolved by using a temporary variable for (due to concerns for solution correctness) at some temporal points
the reduction operation and subsequently assigning the reduced when iterating the solver, then a manual data rearrangement, i.e.,
value to the derived type member. The third restriction can be the pack/unpack optimization, should be applied to overcome the
easily overcome by using the select case or if statements.
performance deterioration issue.
The second restriction indicates that either the compiler needs to
Table 1 shows the performance of some metrics for the V 00
automatically generate the temporary arrays or the user should
(using Listing 1) and V 0 (using Listing 2) comparison on the
manually create them. However, the temporary arrays deteriorate
NewRiver platform. Using slicing for the !$acc update direc-
the code performance significantly, which will be seen later.
tive reduces the memory throughput greatly to about 1% for the
Although the work in Ref. [36] and [37] overcame many re-
device to host bandwidth and about 9% for the host to device
strictions to port the code to the GPU, the GPU performance was
bandwidth, compared to not using slicing (with ghost cell data in-
not satisfactory. A NVIDIA P100 GPU was only 1.3× ∼ 3.4× faster
cluded). Also, the total invocations of using slicing is more than 2
than a single Intel Xeon E5-2680v4 CPU core, which indicates that
times higher than not using slicing. The last row in Table 1 is a
the GPU was not utilized efficiently. Some performance bottlenecks
reference (NVIDIA profiler reports different fractions of low mem-
were fixed in Ref. [59]. Profiling-driven optimizations were applied
ory throughput data transfers for different code versions) to show
to overcome some performance bottlenecks. First, loops with small
the more serious low memory throughput issue in V 00. We will
sizes were not parallelized as the launch overhead is more expen-
show some performance optimizations based on the V 0 version
sive than the benefits. As the warp size for NVIDIA GPUs is 32, the
next, even though V 0 has larger solution errors than the round-off
compiler may select a thread length of 128 or 256 to parallelize
errors for some cases.
small loops but the loop iteration number for these small loops
is less than 10. Second, the kernel of extrapolation to ghost cells
was moved from the CPU to the GPU in order to improve the per- 7.2. GPU optimization using OpenACC
formance, by passing the whole array with indices as arguments.
Finally, the kernel of updating corners and edges was parallelized. Although parallelized using the GPU in Ref. [36,37] and opti-
The eventual speedup of using a single GPU compared to a sin- mized in Ref. [59], the speedup for SENSEI is still not satisfactory
gle CPU was raised to 4.1× for a 3D case on a NVIDIA P100 GPU, due to some performance issues. The NVIDIA Visual Profiler is
but no multi-GPU performance results were shown as the parallel used to detect various performance bottlenecks. The bottlenecks
efficiency was not satisfactory. include low memory throughput, low GPU occupancy, inefficient
It should be mentioned that the relative solution differences data transfers, etc. Different architectures and problems may show
between the CPU and the GPU code in Ref. [36,37,59] are much different behaviors, which is one of our interests. Second, previ-
larger than the round-off error mainly due to an incorrect im- ously the boundary data needed to be transferred to the CPU first

70
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

Fig. 8. An example of showing ghost cells breaking the non-contiguity of the interior cell.

! IMIN face update

start_indx = 1 - n_ghost_cells(1)
end_indx = 2
!$acc update device(
!$acc soln%sblock(blck)%rho(start_indx:end_indx,1:jmax,1:kmax))
!$acc async(1)

Listing 1: Using slicing to update.

! IMIN face update

start_indx = 1 - n_ghost_cells(1)
end_indx = 2
!$acc update device(
!$acc soln%sblock(blck)%rho(start_indx:end_indx,:,:))
!$acc async(1)

Listing 2: Update including ghost cells.

Table 1
Comparison of V00 and V0 performance metrics.

Metrics V00 V0
Device to Host bandwidth, GB/s 0.129 19.69
Host to Device bandwidth, GB/s 0.8 6.59
Total invocations, times 262144 109349
Compute Utilization, % 1.2 17.2
Low memory throughput 118 MB/s for 96.4% data transfers 138 MB/s for 9.1% data transfers

in order to exchange data. We will apply GPUDirect to enable data 2. Pack the noncontiguous block boundary data to the send
transfers directly between GPUs. buffer, which can be explicitly parallelized using !$acc
loop directives, then update the send buffer on hosts using
V1: Pack/Unpack. The goal of this optimization is to improve the !$acc update directives.
memory throughput and reduce the communication cost if the 3. Have hosts transfer the data through nonblocking MPI_Isend/
required data are not located sequentially in memory [60]. As For- MPI_Irecv calls and blocking MPI_Wait calls.
tran is a column-majored language, the ﬁrst index i of a matrix 4. Update the recv buffer on devices using OpenACC update de-
A (i , j , k) denotes the fastest change. A decomposition in the i in- vice directives and ﬁnally unpack the contiguous data stored in
dex direction can generate chunks of data (at j − k planes) which recv buffer back to noncontiguous memory on devices, which
are highly non-contiguous. Decomposing in the j index direction can also be parallelized.
can also cause non-contiguous data transfers. Therefore, the op-
timization is targeted at solving this issue by converting the non- We will show that although extra memory is required for
contiguous data into a temporary contiguous array in parallel using buffers, the memory throughput can be improved to a level sim-
loop for directives and then updating this temporary array be- ilar to that in V 0 (but V 0 has larger simulation errors due to
tween hosts and devices using update directives. Performance the incorrect use of !$acc update, especially for cases hav-
gains will be obtained as the threads in a warp can access a con- ing connected boundaries). Using V 1, only the boundary data on
tiguous aligned memory region, that is, coalesced memory access the i boundary faces are packed/unpacked as such data are highly
is deployed instead of strided memory access. The procedure can noncontiguous. The boundary data on the j and k plane are not
be summarized as follows: buffered. The pack/unpack can be parallelized using !$acc loop
directives so that the computational overhead is very small, which
1. Allocate send/recv buffers for boundary cells on j − k planes can be seen in Listing 3.
on devices and hosts if decomposition happens in the i di- However, when updating the buffer arrays on either side (de-
mension, as the non-contiguous data on i planes make data vice or host), since the host only transfers the derived type ar-
transfer very slow. rays such as soln%sblock%array not the buffer arrays ar-

71
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

! IMIN face update

start_indx = 1 - n_ghost_cells(1)
end_indx = interior_cells
!$acc parallel present(soln, soln%sblock, &
!$acc soln%sblock(blck)%rho, &
!$acc soln%sblock(blck)%vel, &
!$acc soln%sblock(blck)%p, &
!$acc soln%sblock(blck)%temp, &
!$acc rho_buffer, vel_buffer, &
!$acc p_buffer, temp_buffer)
!$acc loop collapse(3)
do k = k_low, k_high
do j = j_low, j_high
do i = start_indx, end_indx
n = n_old + (i - start_indx) + (j - j_low) * i_count + &
(k - k_low) * j_count * i_count
rho_buffer(n) = soln%sblock(blck)%rho(i,j,k)
vel_buffer(:,n) = soln%sblock(blck)%vel(:,i,j,k)
p_buffer(n) = soln%sblock(blck)%p(i,j,k)
temp_buffer(n) = soln%sblock(blck)%temp(i,j,k)
end do
end do
end do
!$acc end parallel
n = n_old + i_count * j_count * k_count
!$acc update host(rho_buffer(n_old:n-1)) async(1)
!$acc update host(vel_buffer(:,n_old:n-1)) async(2)
!$acc update host(p_buffer(n_old:n-1)) async(3)
!$acc update host(temp_buffer(n_old:n-1)) async(4)

Listing 3: A pseudo code of showing how to pack/unpack.

start_indx = 1 - n_ghost_cells(1)
end_indx = interior_cells
do k = k_low, k_high
do j = j_low, j_high
do i = start_indx, end_indx
soln%sblock(blck)%rho(i,j,k) = rho_buffer(n)
soln%sblock(blck)%vel(:,i,j,k) = vel_buffer(:,n)
soln%sblock(blck)%p(i,j,k) = p_buffer(n)
soln%sblock(blck)%temp(i,j,k) = temp_buffer(n)
n = n + 1
end do
end do
end do

Listing 4: An extra step to pack/unpack data to the derived type array.

ray_buffer, there is an extra step on the host side to pack/un- are used to enable the GPU to parallelize the extrapolation ker-
pack the buffer to/from the derived type array, which can be seen nel. An example of how the extrapolation works in SENSEI can
in Listing 4. This step may not be needed for some other codes but be found in Listing 5. The data present directive notifies the
necessary for SENSEI, as SENSEI uses derived type arrays to store compiler that the needed data are located in the GPU memory,
primitive variables. The step adds some overhead to the host side, the data copyin directive copies in the boundary information
which will be addressed in V 5.
to the GPU, and the parallel loop directives parallelize the
V2: Extrapolating to ghost cells on the GPU. The V 1 version executes boundary loop iterations. The subroutine set_bc is a device routine
the kernel of extrapolating to ghost cells on the CPU. However, which is called in the parallel region. It is difficult for the compiler
leaving the extrapolation on the CPU may impede further per- to automatically know whether there are loops inside the routine,
formance improvement as this portion will be the performance and whether there are dependencies among the loop iterations in
bottleneck for the GPU code. Therefore, V 2 moves the kernel of the parallel region. The use of !$acc routine seq directive in
extrapolating to ghost cells to the GPU. When passing an in-
set_bc informs the compiler such information. After using the tem-
tent(out) reshaped array which is located in non-contiguous
porary arrays such as rho and vel, each CUDA thread needs to have
memory locations to a procedure call, the PGI compiler creates a
a copy of the arrays, which occupies a lot of SM registers and thus
temporary array that can be passed into the subroutine. The tem-
porary array can reduce the performance significantly and poses reduces the concurrency. As can been seen, these temporary arrays
a threat of cache contention if it is shared among CUDA threads. are used to store the data in the derived type in the beginning.
In fact, whether to support passing slices of array to a procedure Then they are used as arguments when invoking the set_bc sub-
call is a discussion for the NVIDIA PGI compiler group internally. routine. Finally the extrapolated data are copied back to the ghost
To resolve this issue, manually created private temporary arrays cells in the original derived type soln.

72
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

!$acc data present(soln, soln%rho, soln%vel, soln%p, &

!$acc soln%temp, soln%molecular_weight, &
!$acc grid%grid_vars%volume, &
!$acc grid%grid_vars%xi_n, grid, grid%grid_vars) &
!$acc copyin(bound, bclow, bchigh, n_mmtm)

!$acc parallel
!$acc loop independent
do k = bound%indx_min(3),bound%indx_max(3)
!$acc loop independent vector private(rho, vel, p, temp, vol)
do j = bound%indx_min(2),bound%indx_max(2)
rho(1:length) = soln%rho(high+1:low:order,j,k)
vel(1:n_mmtm,1:length) = soln%vel(:,high+1:low:order,j,k)
p(1:length) = soln%p(high+1:low:order,j,k)
temp(1:length) = soln%temp(high+1:low:order,j,k)

vol = grid%grid_vars%volume(high:low:order,j,k)
call set_bc(bound%bc_label, &
rho, &
vel, &
p, &
temp, &
molweight, &
vol, &
grid%grid_vars%xi_n(:,i,j,k), &
bclow, &
bchigh, &
n_mmtm)

soln%rho(high+1:low:order,j,k) = rho(1:length)
soln%vel(1:n_mmtm,high+1:low:order,j,k) = vel(:,1:length)
soln%p(high+1:low:order,j,k) = p(1:length)
soln%temp(high+1:low:order,j,k) = temp(1:length)
end do
end do
!$acc end parallel
!$acc end data

Listing 5: Using temporary array to do the ghost cell data extrapolation.

V3: Removal of Temporary Variables. Either the automatic or the man- copies of four limiter values, otherwise thread contention may oc-
ual use of temporary arrays in V 2 can greatly deteriorate the GPU cur. To fix this issue, the total cost of the limiter calculation on the
performance. Instead of passing array slices to a subroutine, the GPU is twice of that on the CPU. Also, storing the limiter locally re-
entire array was passed with the indicies of the desired slice as quires the limiter calculation and flux extrapolation to be together,
shown in Listing 6, which avoids the use of temporary arrays. which is highly compute intensive. V 4 uses global arrays to store
This method requires many subroutines to be modified in SEN- these limiters so that the flux calculation and limiter calculation
SEI. However, it saves the use of shared resources and improves can be separated, which is given in Listing 7. This approach will
the concurrency. leave more room for kernel concurrency and asynchronization and
also avoid thread contention.
V4: Splitting flux calculation and limiter calculation. For cases which
require the use of limiters, the CPU calculates the left and right V5: Derived type for connected boundaries on the GPU. The previous
limiters on a face once, as the next loop iteration can reuse two versions update the connected boundaries between the host and
limiter values without computing them again, which can be seen the device through using local dynamic arrays. Therefore, it is
in Eq. (6). worthwhile to investigate the effect of using global derived type
arrays to store the connected boundary data. It removes the ex-
L
Q i +1/2 = Q i + [(1 − κ ) +
i −1/2
( Q i − Q i −1 ) tra data copies on the host side mentioned in V 1. An example of
4 using the global derived type is given in Listing 8. If there is no
+ (1 + κ ) −
i +1/2
( Q i +1 − Q i )] (6) communication required among different CPU processors, the MPI
functions are not called.
R
Q i +1/2 = Q i +1 − [(1 + κ ) +
i +1/2
( Q i +1 − Q i )
4 V6: Change of blocking call locations. Since SENSEI is a multi-block
+ (1 − κ ) −
i +3/2
( Q i +2 − Q i +1 )] (7) CFD code, a processor may hold multiple blocks and many con-
nected boundaries. Using MPI non-blocking routines, there should
where and κ are MUSCL extrapolation parameters, are limiter be a place to execute the blocking call such as MPI_WAIT to
function values. L and R denote the left and right states, respec- complete the communications. Each Isend/Irecv call needs one
tively. MPI_WAIT, or multiple MPI_WAIT can be wrapped up into one
After porting the code to the GPU, since SENSEI calculates the MPI_WAITALL. The previous versions block the MPI_WAITALL call
limiters locally for each solution state (in V 0 through V 3), the lim- for every decomposed block. A newer way of achieving the func-
iter cannot be reused as different CUDA threads have their own tion is moving the MPI_WAITALL calls to the end, so that these

73
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

!$acc data present(soln, soln%rho, soln%vel, soln%p, &

!$acc soln%temp, soln%molecular_weight, &
!$acc grid%grid_vars%volume, &
!$acc grid%grid_vars%xi_n, grid, grid%grid_vars) &
!$acc copyin(bound, bclow, bchigh, n_mmtm)

!$acc parallel
!$acc loop independent
do k = bound%indx_min(3),bound%indx_max(3)
!$acc loop independent vector
do j = bound%indx_min(2),bound%indx_max(2)
call set_bc(bound%bc_label, &
grid, &
soln, &
soln%rho, &
soln%vel, &
soln%p, &
soln%temp, &
molweight, &
grid%grid_vars%volume, &
grid%grid_vars%xi_n(:,i,j,k), &
bclow, &
bchigh, &
j, &
k, &
n_mmtm, &
boundary_lbl, &
normal_lbl)

end do
end do
!$acc end parallel
!$acc end data

Listing 6: Passing derived type data and index range.

For platforms in which the asynchronous progression is sup-

ported completely (from both the software and hardware sides),
this optimization may work much better. However, for common
platforms in which the asynchronous progression is not supported
fully, OpenMP may need to be used to promote the asynchronous
progression [28,54,33,15,12]. Full asynchronous progression is a
very complicated issue and is not covered in this paper. This pa-
per will only apply MPI+OpenACC (PGI 18.1) to accelerate the CFD
code.
V7: Boundary flux optimization. In SENSEI, the fluxes for the wall
and farfield boundaries need to be overwritten to get more accu-
rate estimate for the solution. These overwritten flux calculations
are done after the boundary enforcement. For these two kinds of
fluxes, the previous versions do not compute them very efficiently.
A lot of temporary variables are allocated for each thread, which
deteriorates the concurrency of using PGI 18.1 OpenACC, as regis-
ters are limited. The principle of this optimization is similar to that
Fig. 9. Change of blocking call position. in V 3. An example of the optimization is given in Listing 9.
V8: Asynchronicity improvement. Kernels from different streams can
be overlapped so that the performance can be improved. The ver-
MPI_WAITALL calls are executed after all Isend & Irecv are exe- sion is exactly the same as that in V 7 but the environment vari-
cuted. It should be noted that the relative orders of all Isend & able “PGI_ACC_SYNCHRONOUS” is set to 0 when executing SEN-
Irecv calls are not changed, so there is no deadlock issue. The goal SEI, that is, asynchronization among some independent kernels is
of this optimization is to initiate all non-blocking calls as early as promoted. The !$acc wait directive makes the host wait until
possible and finish the blocking calls as late as possible so that asynchronous accelerator activities finish, i.e., it is the synchroniza-
more degrees of overlap of communication and computation can tion on the host side.
be obtained from the implementation side. An example is given in V9: Removal of implicit data copies between the host and device. The
Fig. 9. In this example, there are two blocks, each having two con- last general performance optimization is essentially manual tuning
nected boundaries. However, V 6 only improves the performance work. It requires the user to modify the code through profiling.
when multiple connected boundaries exist. All the relative orders The compiler sometimes does not know what variables are to be
of the Irecv and Isend calls are not changed. updated between the host and the device, so for the reason of

74
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

! xi limiter
!$acc parallel
!$acc loop independent collapse(3)
do k = 1, k_cells
do j = 1, j_cells

do i = 1, imax-1
call limiter_subroutine_x( sblock, gblock, i, j, k, &
sblock%limiter_xi%left, &
sblock%limiter_xi%right )
end do
end do
end do
!$acc end parallel

! xi flux
!$acc parallel
!$acc loop independent collapse(3) private(qL, qR)
do k = 1, k_cells
do j = 1, j_cells
do i = 2, imax-1

call muscl_extrapolation_xi( sblock, i, j, k, &

sblock%limiter_xi%left(1:neq,i-1,j,k), &
sblock%limiter_xi%left(1:neq,i,j,k), &
sblock%limiter_xi%right(1:neq,i,j,k), &
sblock%limiter_xi%right(1:neq,i+1,j,k), &
qL, qR )

call flux_function(qL, qR, &

gblock%grid_vars%xi_n(:,i,j,k), &
sblock%xi_flux(1:neq,i,j,k))

end do

end do
end do
!$acc end parallel

Listing 7: Splitting MUSCL extrapolation and limiter calculation.

!$acc update host(grid%gblock(blck)%bcs_acc(nc)%rho_send( &

!$acc 1:idx_max_nbor(1)-idx_min_nbor(1)+1, &
!$acc 1:idx_max_nbor(2)-idx_min_nbor(2)+1, &
!$acc 1:idx_max_nbor(3)-idx_min_nbor(3)+1))

! SEND and RECV derived type boundary data

call MPI_IRECV( grid%gblock(blck)%bcs_acc(nc)%rho_recv, &
scalar_count, MPI_DOUBLE_PRECISION, &
bound%bound_nbor%process_id, RHO_TAG, &
world_comm, req(req_count+1), ierr )

call MPI_ISEND( grid%gblock(blck)%bcs_acc(nc)%rho_send, &

scalar_count, MPI_DOUBLE_PRECISION, &
bound%bound_nbor%process_id, RHO_TAG, &
world_comm, req(req_count+5), ierr )

call MPI_WAITALL(req_count, req(1:req_count), &

stat(:,1:req_count), ierr)

!$acc update device(grid%gblock(blck)%bcs_acc(nc)%rho_recv( &

!$acc buff_size_self(1)*buff_size_self(2)* &
!$acc buff_size_self(3)))

Listing 8: Derived type for connected boundary data.

75
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

! V0 ~ V6

!$acc parallel copyin(i, bound) async(1)

!$acc loop independent
do k = bound%indx_min(3), bound%indx_max(3)
!$acc loop independent vector private( &
!$acc soln_L2, soln_L1, soln_R1, &
!$acc soln_R2, qL, qR, modf, &
!$acc lim_L2, lim_L1, lim_R1, lim_R2, &
!$acc vel_xi, rho_xi, p_xi, temp_xi)
do j = bound%indx_min(2), bound%indx_max(2)

! V7

!$acc parallel copyin(i, bound) async(1)

!$acc loop independent
do k = bound%indx_min(3), bound%indx_max(3)
!$acc loop independent vector private( &
!$acc qL, qR, modf)
do j = bound%indx_min(2), bound%indx_max(2)

Listing 9: Optimization of the overwritten boundary ﬂux kernel.

safety the compiler may update variables frequently, which may be Table 2
unnecessary. Different architectures and compilers may deal with Inlet case inﬂow boundary conditions.

the update differently, therefore the user can optimize it based on Mach number 4.0
the profiler outputs. The compiler may transfer some scalar vari- Pressure 12270 Pa
Temperature 217 K
ables, arrays with small size and even derived type data in every
iteration, but they only need to be copied once. There are multi-
ple places in SENSEI where the PGI 18.1 compiler makes unnec- 8. Solution and scaling performance
essary copies. These extra unnecessary data transfers are usually
small in size and deteriorate the memory throughput. The effect of 8.1. Supersonic flow through a 2D inlet
these copies can be significant for small size problems. However,
for compute-intensive computations, this optimization may not be The first test case is a simplified 2D 30 degree supersonic in-
very useful. let, which has only one parent block without having connected
boundaries. The inflow conditions are given in Table 2. There are
multiple levels of grid for strong and weak scaling analysis, of
V10: GPUDirect. GPUDirect is an umbrella word for several GPU which the total amount of cells range from 50k to 7 million. The
communication acceleration technologies. It provides high band- parallel solution and the serial solution have been compared from
width and low latency communication between NVIDIA GPUs. the beginning to the converged state during the iterations, and the
There are three levels of GPUDirect [44]. The first level is GPUDi- relative errors for all the primitive variables based on the inflow
rect Shared Access, introduced with CUDA 3.1. This feature avoids boundary values is within round-off error range (10−12 ).
an unnecessary memory copy within host memory between the A very coarse level of grid for the 2D inlet flow is shown in
intermediate pinned buffers of the CUDA driver and the network Fig. 10(a). The decomposition of using 16 GPUs (which is the high-
fabric buffer. The second level is GPUDirect Peer-to-Peer trans- est number of GPUs available) on a 416 × 128 grid is shown in
fer (P2P transfer) and Peer-to-Peer memory access (P2P memory Fig. 10(b). The decomposition is 2D, creating multiple connected
access), introduced with CUDA 4.0. This P2P memory access al- boundaries between processors. Ghost cells on the face of con-
lows buffers to be copied directly between two GPUs on the same nected boundaries are used to exchange data between neighboring
node. The last is GPU RDMA (Remote Direct Memory Access), with processors. The device needs to communicate with the host if mul-
which buffers can be sent from the GPU memory to a network tiple processors are used.
adapter without staging through host memory. The last feature is The relative residual L 2 norm history is shown in Fig. 11. It
can be seen that the iterative errors have been driven down small
not supported on NewRiver as it pertains to specific versions of
enough for all the primitive variables when converged. The Mach
the drivers (from NVIDIA and Mellanox for the GPU and the In-
number and density solutions are shown in Fig. 12. There are mul-
finiband, respectively) which are not installed (other dependencies
tiple flow deflections when the flow goes through the reflected
exist, particularly parallel filesystems). Although GPU RDMA is not
oblique shocks.
available, the other aspects of GPUDirect can be utilized to further
Fig. 13 shows the performance comparison of different opti-
improve the scaling performance on multiple GPUs. mizations using different flux options on different platforms. The
A CUDA-aware MPI implementation such as MVAPICH2-GDR al- grid size used in Fig. 13 is 416×128. The goal of making such a
lows GPU memory buffers to be passed directly to MPI function comparison is to investigate the effect of using various flux op-
calls, while GPUDirect depending on CUDA-aware MPI does not re- tions, time marching schemes and various generation GPUs when
quire extra data copies from/to the host [47]. An example of the applying the optimizations introduced earlier in this paper. An ob-
GPUDirect optimization is given in Listing 10. If comparing List 10 servation is that using the Roe flux is the same speed as using the
and List 8, it can be seen that there is no need to use !$acc van Leer flux, which is reasonable. It should be kept in mind that
update device/host clauses as the data transfer does not re- the ssspnt metric does not take the number of double precision
quire staging through the memory of the hosts. operations for each step into account so ssspnt is not equivalent

76
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

! V10
!$acc host_data use_device( &
!$acc grid%gblock(blck)%bcs_acc(nc)%rho_recv, &
!$acc grid%gblock(blck)%bcs_acc(nc)%vel_send, &
!$acc ...)

call MPI_IRECV( grid%gblock(blck)%bcs_acc(nc)%rho_recv, &

scalar_count, MPI_DOUBLE_PRECISION, &
bound%bound_nbor%process_id, RHO_TAG, &
world_comm, req(req_count+1), ierr )

call MPI_ISEND( grid%gblock(blck)%bcs_acc(nc)%rho_send, &

scalar_count, MPI_DOUBLE_PRECISION, &
bound%bound_nbor%process_id, RHO_TAG, &
world_comm, req(req_count+5), ierr )

call MPI_WAITALL(req_count, req(1:req_count), &

stat(:,1:req_count), ierr)

Listing 10: Enabling GPUDirect using OpenACC directives.

Fig. 10. 2D Euler supersonic inlet.

removing unnecessary data movements improves the overall per-

formance by more than 48%. These unnecessary data movements
are mainly transfers of some derived type members which do not
need to be updated. These unnecessary data movements can be
detected through careful performance profiling using the NVIDIA
visual profiler. For larger problems, the performance gain is not
that significant as the computation cost becomes more dominant,
which will be shown later. In the meantime, there is a gradual
performance improvement from V 3 to V 4 and V 6 to V 8. These
optimizations should not be overlooked as the issues related to
the optimizations will eventually become bottlenecks. Since this
case does not have connected boundaries, there is no obvious per-
formance change from V 4 to V 6. It should be mentioned that the
performance optimizations proposed earlier are not for only a spe-
cific case, but for general cases with multiple blocks and connected
boundaries.
Fig. 14(a) and Fig. 14(b) show the strong and weak scaling per-
Fig. 11. The relative iterative residual history for the inlet case.
formance for the 2D inlet Euler flow, respectively. The CPU scaling
performance is also given for reference. A single P100 GPU is more
to GFLOPS. Also, the speed of RK2 and RK4 is comparable, so this than 43× faster than a single CPU, on a grid level of 1664×1024,
paper will stick to the use of RK2 unless otherwise specified. which displays the compute power of the GPU. The strong scaling
If comparing the performance of different versions in Fig. 13, efficiency decays quickly for small problem sizes but not for the
there are two significant performance leaps including from V 2 to largest problem size in Fig. 14(a). The parallel efficiency using 16
V 3 and from V 8 to V 9. Since the extrapolation to ghost cells on P100 GPUs on the 3328×2048 grid is still kept higher than 86%.
the GPU runs inefficiently in V 2 due to the low compute utiliza- While for the weak scaling, the parallel efficiency is higher (90.9%)
tion, removing the use of temporary arrays in the parallel regions than the strong scaling efficiency, as there is more work to sat-
reduces the overhead from CUDA threads. More concurrency in the urate the GPU. The V100 GPU shows higher speedups but lower
code can therefore be utilized by the GPU. From V 8 to V 9, since efficiency, because the V100 GPU needs more computational work
the problem size is small (the compute fraction is not very high), as it is faster. The boundary connections for this inlet flow case af-

77
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

Fig. 12. 2D Euler supersonic inlet.

Fig. 13. Performance comparison for the 2D inlet Euler ﬂow.

Fig. 14. The scaling performance for the 2D inlet case.

ter the domain decomposition are not complicated, which is one Table 3
important reason why the performance is good. NACA 0012 airfoil farﬁeld boundary conditions.

Mach number 0.25

8.2. 2D subsonic flow past a NACA 0012 airfoil Static pressure 84307 Pa
Temperature 300 K
Angle of attack, α 5 degrees
The second test case in this paper is the 2D subsonic flow
(M ∞ = 0.25) past a NACA 0012 airfoil, at an angle of attack of
5 degrees. The flow field for all the simulation runs of this case is
initialized using the farfield boundary conditions which are given position of using 16 GPUs is shown in Fig. 15(b). Near the airfoil
in Table 3. This case will be solved by both the Euler and laminar surface, the grid is refined locally so processors near the wall take
NS solvers in SENSEI. smaller blocks, but the load is balanced.
Although the airfoil case contains only one parent block, the Fig. 16(a) shows the relative iterative residual L 2 norm history
grid is a C-grid, which means that the only one block connects for the laminar NS subsonic flow past a NACA 0012 airfoil. This
to itself on a face through a connected boundary, which makes the case requires the most iteration steps to be converged among all
airfoil case different from the 2D inlet flow case. One coarse grid of the test cases considered. Leveraging the compute power of the
this airfoil case is shown in Fig. 15(a). For the scaling analysis, the GPU saves a lot of time. To enable the iterative residual to fur-
grid size ranges from 400k to 6 million. Also, the domain decom- ther go down instead of oscillation, limiter freezing is adopted at

78
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

Fig. 15. 2D NS NACA 0012 airfoil.

Fig. 16. 2D laminar NS NACA 0012 airfoil.

around 600k steps. After freezing the limiter, the iterative residual multiple connected boundaries are created, which creates margin
norms continue to reduce smoothly. The iterative errors are driven for the reordering of I_send/I_recv and Wait to work. Intrinsically,
down small enough to obtain the steady state solution. this ordering is to propel more asynchronous progression on the
The parallel solution and the serial solution have been com- implementation side. The actual overlap degree still highly de-
pared on coarse levels of grid and the relative errors for all the pends on the communication system, which is out of the scope of
primitive variables based on the reference values are within round- this paper. Readers who are interested in more overlap and better
off error range (10−12 ). Fig. 16(b) shows the pressure coefficient asynchronous progression may try the combination of MPI+Ope-
solution and the streamlines for the laminar NS subsonic flow past nACC+OpenMP.
the NACA 0012 airfoil. Fig. 19 and Fig. 20 show the strong and weak scaling perfor-
Fig. 17 shows the comparison of different versions for the flow mance of this subsonic flow past a NACA 0012 airfoil solved by
past a NACA 0012 airfoil using a single P100 GPU. Laminar NS has the Euler and laminar NS solver on P100 and V100 GPUs, respec-
a smaller ssspnt (about 70%) compared to using the Euler solver tively. They show very similar behaviors with the only difference
as laminar NS contains more arithmetic operations in each step. in the scales. Overall, the laminar ssspnt is about 0.7 of the Euler
From V 2 to V 3, the speedup is more than 2 times on different ssspnt using multiple GPUs. The strong parallel efficiency on the
levels of grid, for both the Euler and laminar NS solver. To use 4096×1536 grid using 16 P100 GPUs for the Euler and laminar NS
globally allocated derived types to store the connected boundary solver is about 85.1% and 84.4%, respectively. The weak scaling effi-
data cannot improve the performance, which can be seen from the ciency is generally higher as there is more work to do for the GPU.
comparison of V 4 and V 5, if only using one processor, as there The efficiencies using V100 GPUs are lower than those using P100
are no MPI communication calls. Although the airfoil case has a GPUs, which indicates that faster GPUs may need more computa-
connected boundary, the data in the ghost cells for that boundary tional work to hold high efficiency.
are filled directly through copying. This case only has one con-
nected boundary, so there is no need to reorder the non-blocking 8.3. 3D transonic flow past an ONERA M6 wing
MPI I_send/I_recv calls and the MPI_Wait call. Similarly to the 2D
inlet case, on coarse levels of grid, there is noticeable performance The final case tested in this paper is the 3D transonic flow
improvement if applying the optimization in V 9. On fine levels of (M ∞ = 0.839) past an ONERA M6 wing, at an angle of attack of
grid, the benefit is limited. 3.06 degrees [35]. The flow field is initialized using the farfield
Since we cannot see any performance gain from V 5 to V 6 us- boundary conditions which are given in Table 4. Both the Euler and
ing single GPU, multiple GPUs are used to show the benefits. For laminar NS solvers in SENSEI are used to solve this problem. Differ-
all the runs shown in Fig. 18, V 6 (the red bars) outperforms V 5 ent from the previous two 2D problems, this 3D case has 4 parent
(the blue bars) by 4.2% to 32.8%, depending on the solver type, blocks with various sizes. Under some conditions (when using 2
grid level and number of GPUs used. After applying multiple GPUs, and 4 processors in this paper), domain aggregation is needed to

79
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

Fig. 17. The performance of different versions for the NACA 0012 airfoil case (P100 GPU).

Fig. 18. Performance comparison between V 5 and V 6 for the NACA 0012 airfoil case (P100 GPU).

Fig. 19. The scaling performance for the 2D Euler ﬂow past a NACA 0012 airfoil.

Fig. 20. The scaling performance for the 2D laminar NS ﬂow past a NACA 0012 airfoil.

80
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

Fig. 21. Grid and domain decomposition for ONERA M6 wing.

Fig. 22. Residual history and solution for ONERA M6 wing.

Table 4 2D cases. From V 3 to V 4, there is a performance drop for almost

ONERA M6 wing farfield boundary conditions. all runs, no matter what the grid level and the solver is. Splitting
Mach number, M ∞ 0.8395 one kernel into two kernels for this case incurs some overhead
Temperature, T ∞ 255.556 K and reduces the compute utilization a bit. There is a slight perfor-
Pressure, p ∞ 315979.763 Pa
mance improvement from V 4 to V 5 when using the derived type
Angle of attack, α 3.06 degrees
to buffer the boundary data for connected boundaries. The data
will be allocated in the main memory of the GPU before needed,
balance the load on different processors. This 3D wing case has a which outperforms the use of dynamic data to buffer the bound-
total grid size ranging from 300k to 5 million. ary data. For a single GPU, V 5 and V 6 perform equivalently fast.
The parallel solution and the serial solution of the wing case Further performance optimization on the boundary flux calcula-
have been compared to each other on a coarse mesh and the rel- tion can improve the performance significantly, which can be seen
ative errors for primitive variables based on the farfield boundary from V 6 to V 7. Carefully moving the data between the host and
values is within round-off error (10−12 ).A coarse level of grid and the device can improve the performance on coarse levels of grid,
the domain decomposition of using 16 GPUs are given in Fig. 21(a) but not on very fine levels of grid, as the computation becomes
and Fig. 21(b), respectively. The relative iterative residual L 2 norm more dominant when refining the grid.
history and the pressure coefficient (C p ) contour using the laminar Although the wing case has multiple parent blocks, there is no
NS solver in SENSEI are given in Fig. 22(a) and Fig. 22(b), respec- MPI communication if using only a single GPU. Therefore, there is
tively. From Fig. 22(a), it can be seen that the iterative errors have only negligible difference between V 5 and V 6 on one single GPU.
been driven down small enough. Similar to the NACA 0012 airfoil case, multiple GPUs are used to
Since this wing case is 3D and has multiple parent blocks, show the effect of reordering the non-blocking MPI I_send/I_recv
we are interested in whether the performance optimizations in- calls and the MPI_Wait calls. Fig. 24 shows that there are some
troduced earlier can improve the performance of this wing case. performance gains for all the runs, especially on the finer mesh
Fig. 23 shows the performance of different versions for the ONERA h1, possibly since more asynchronous progression can be exposed
M6 wing case. From the grid level of h5 to h1, the grid refinement on a finer mesh. V 6 accelerates the code by 12% to 32% compared
factor is 2 (refined in z, y and x cyclically). V 2 runs slightly slower to V 5. As said, more degrees of asynchronous progression may be
than V 1 for all levels of grid, indicating that the extrapolation to achieved if switching to the MPI+OpenACC+OpenMP model, which
ghost cells on the GPU is not as efficient as that on the CPU, al- is not covered in this paper.
though it is parallelized. With proper optimization, V 3 is about 3 Fig. 25 and Fig. 26 show the strong and weak scaling per-
to 5 times faster than V 2, which is similar to the previous two formance using Euler and laminar NS solvers, respectively. Some

81
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

Fig. 23. The single P100 GPU performance of different versions for ONERA M6 wing.

Fig. 24. Performance comparison between V 5 and V 6 for ONERA M6 wing (P100 GPU).

Fig. 25. The scaling performance for the 3D Euler ONERA M6 wing case.

different behaviors show as in this case some processors need to similar wall time as 16 V100 GPUs, for both the Euler and laminar
hold multiple blocks, which is different from the 2D inlet and 2D NS solver. In addition, the number of ppn has negligible impact on
NACA 0012 case. A single GPU is about 39 and 33 times faster the multi-CPU performance.
than a single CPU on the h5 level grid, using Euler and laminar NS
solver, respectively. The weak scaling of the GPU keeps good eﬃ- 8.4. GPUDirect
ciency over the whole np range shown in Fig. 25(b) and Fig. 26(b).
Fig. 27(a) and Fig. 27(b) show the performance comparison us- Applying OpenACC, we can use directives !$acc host_data
ing multi-CPUs and multiple P100 GPUs for Euler and laminar NS, use_device to enable the GPUDirect. It should be noted that
respectively. When adding CPUs, the number of processors per with GPUDirect disabled (in V 9), the difference of the performance
node (ppn) can be set to be 2, 4, 8 and 16 on NewRiver nodes, between using a CUDA-aware MPI version (MVAPICH2-GDR/2.3b)
which can be used to show the impact of ppn on the performance. and a non CUDA-aware MPI version (MVAPICH2/2.3b) is negligi-
The ppn value is ﬁxed to be 128 on TinkerCliffs. It should be noted ble, as expected. Thus the performance of using MVAPICH2/2.3b is
that both the horizontal and vertical axes in Fig. 27 are in log scale not shown in this work. Since GPUDirect is not a general perfor-
in order to show the whole data properly. 128 Xeon E5-2680v4 mance optimization which requires some support from both the
CPU cores are needed to achieve the similar wall time as 4 P100 compiler side and the communication system side, a comparison
GPUs, and 1024 AMD EPYC 7702 cores are needed to achieve the of V 9 and V 10 is made at the end to give readers more insights

82
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

Fig. 26. The scaling performance for the 3D laminar NS ONERA M6 wing case.

Fig. 27. The scaling performance of different types of CPU and GPU (ONERA M6 wing).

Fig. 28. Performance comparison between V 9 and V 10.

of the effect of GPUDirect. GPUDirect was applied to the 2D Eu- 9. Conclusions

ler/laminar flow past the NACA 0012 airfoil and the transonic flow
over the 3D ONERA M6 wing. It should be noted that there is An improved framework using MPI+OpenACC PGI 18.1 is de-
no guarantee that using GPUDirect can improve the performance veloped to accelerate a CFD code on multi-block structured grids.
substantially without the hardware support like using NVLink or OpenACC directives have some advantages in terms of the ease
RDMA (however both the NewRiver and the Cascades cluster does of programming, the good portability and the fair performance.
A processor-clustered domain decomposition and a block-clustered
not have NVLink or RDMA so the memory bandwidth is still not
domain aggregation method are used to balance the workload
high enough). It can be found that the two cases show differ-
among processors. This work demonstrates that the communica-
ent behaviors when applying GPUDirect, seen in Fig. 28. For the
tion overhead is not high using the domain decomposition and
NACA 0012 case, generally V 10 is slower than V 9, which means
aggregation methods proposed. A parallel boundary decomposi-
that GPUDirect can make the code run slower. However, for the tion method is also proposed with the use of the MPI inter-
ONERA wing case which is a multi-block case before domain de- communicator functions. The boundary reordering for multi-block
composition, using GPUDirect improves the performance by 3% to cases is addressed to avoid the dead lock issue when sending
18%. GPUDirect seems to work better on larger problems. If high and receiving messages. A number of performance optimizations
memory bandwidth NVLink or RDMA is available, we believe that are examined, such as using the global derived type to buffer the
GPUDirect should be more beneficial to the performance. connected boundary data, removing temporary arrays when mak-

83
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

ing procedure calls, reordering of blocking calls for non-blocking [17] E. Elsen, P. LeGresley, E. Darve, Large calculation of the flow over a hypersonic
MPI communications for multi-block cases, using GPUDirect, etc. vehicle using a GPU, J. Comput. Phys. 227 (2008) 10148–10161.
[18] R. Farber, Parallel Programming with OpenACC, Newnes, 2016.
These performance optimizations have been demonstrated to im-
[19] N. Gourdain, L. Gicquel, M. Montagnac, O. Vermorel, M. Gazaix, G. Staffelbach,
prove single GPU performance more than up to 8 times compared M. Garcia, J. Boussuge, T. Poinsot, High performance parallel computing of flows
to the baseline GPU version, especially for multi-block cases. More in complex geometries: I. Methods, Comput. Sci. Discov. 2 (2009) 015003.
importantly, all the three test cases show good strong and weak [20] N. Gourdain, L. Gicquel, G. Staffelbach, O. Vermorel, F. Duchaine, J. Boussuge,
scaling up to 16 GPUs, with a good parallel efficiency if the prob- T. Poinsot, High performance parallel computing of flows in complex geome-
tries: II. Applications, Comput. Sci. Discov. 2 (2009) 015004.
lem is large enough.
[21] G. Hager, G. Wellein, Introduction to High Performance Computing for Scien-
tists and Engineers, CRC Press, 2010.
CRediT authorship contribution statement [22] B. Hendrickson, T.G. Kolda, Graph partitioning models for parallel computing,
Parallel Comput. 26 (2000) 1519–1534.
• Weicheng Xue (first author): The first author served as the [23] J. Herdman, W. Gaudin, S. McIntosh-Smith, M. Boulton, D.A. Beckingsale,
main contributor and primary author of this study. All the perfor- A. Mallinson, S.A. Jarvis, Accelerating Hydrocodes with OpenACC, OpenCL and
CUDA, in: 2012 SC Companion: High Performance Computing, Networking Stor-
mance optimizations were developed and implemented by the first age and Analysis, IEEE, Salt Lake City, UT, USA, 2012, pp. 465–471.
author. All the results were collected by the first author. [24] T. Hoshino, N. Maruyama, S. Matsuoka, R. Takaki, Cuda vs openacc: perfor-
• Charles Jackson (second author): The second author provided mance case studies with kernel benchmarks and a memory-bound cfd applica-
a lot of useful suggestions when porting the CFD code called SEN- tion, in: 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and
Grid Computing, IEEE, Delft, Netherlands, 2013, pp. 136–143.
SEI to multiple CPUs. Also, the second author provided valuable
[25] C.W. Jackson, W.C. Tyson, C.J. Roy, Turbulence model implementation and ver-
comments for this manuscript. ification in the SENSEI CFD code, in: AIAA Scitech 2019 Forum, San Diego, CA,
• Christopher J. Roy (final author): The final author pro- 2019, p. 2331.
vided valuable feedback for this study and composition for this [26] D.A. Jacobsen, I. Senocak, Multi-level parallelism for incompressible flow com-
manuscript. putations on GPU clusters, Parallel Comput. 39 (2013) 1–20.
[27] A. Jameson, W. Schmidt, E. Turkel, Numerical solution of the Euler equations
by finite volume methods using Runge Kutta time stepping schemes, in: 14th
Declaration of competing interest Fluid and Plasma Dynamics Conference, Palo Alto, CA, US, 1981, p. 1259.
[28] M. Jiayin, S. Bo, W. Yongwei, Y. Guangwen, Overlapping communication and
The authors declare that they have no known competing finan- computation in MPI by multithreading, in: Proc. of International Conference
cial interests or personal relationships that could have appeared to on Parallel and Distributed Processing Techniques and Applications, Las Vegas,
influence the work reported in this paper. NEV, USA, 2006.
[29] C.A. Kennedy, M.H. Carpenter, Diagonally Implicit Runge-Kutta Methods for Or-
dinary Differential Equations. A Review, 2016.
References [30] Khronos OpenCL Working Group, The OpenCL C 2.0 specification, https://www.
khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf, 2019. (Accessed 24
[1] K. Ahmad, M. Wolfe, Automatic Testing of OpenACC Applications, in: Interna- July 2020).
tional Workshop on Accelerator Programming Using Directives, Springer, 2017, [31] Z. Krpic, G. Martinovic, I. Crnkovic, H.P.C. Green, MPI vs. OpenMP on a Shared
pp. 145–159. Memory System, in: 2012 Proceedings of the 35th International Convention
[2] A. Amritkar, S. Deb, D. Tafti, Efficient parallel CFD-DEM simulations using MIPRO, IEEE, Opatija, Croatia, 2012, pp. 246–250.
OpenMP, J. Comput. Phys. 256 (2014) 501–519. [32] R. Landaverde, T. Zhang, A.K. Coskun, M. Herbordt, An investigation of uni-
[3] V. Artigues, K. Kormann, M. Rampp, K. Reuter, Evaluation of performance porta- fied memory access performance in CUDA, in: 2014 IEEE High Performance
bility frameworks for the implementation of a particle-in-cell code, Concurr. Extreme Computing Conference (HPEC), IEEE, Waltham, MA, US, 2014, pp. 1–6.
Comput., Pract. Exp. 32 (2020) e5640. [33] H. Lu, S. Seo, P. Balaji MPI+ULT, Overlapping communication and computa-
[4] U.M. Ascher, S.J. Ruuth, R.J. Spiteri, Implicit-explicit runge-kutta methods for tion with user-level threads, in: 2015 IEEE 17th International Conference on
time-dependent partial differential equations, Appl. Numer. Math. 25 (1997) High Performance Computing and Communications, 2015 IEEE 7th Interna-
151–167. tional Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th
[5] F. Baig, C. Gao, D. Teng, J. Kong, F. Wang, Accelerating spatial cross-matching on
International Conference on Embedded Software and Systems, IEEE, New York,
cpu-gpu hybrid platform with cuda and openacc, Frontiers Big Data 3 (2020)
NY, USA, 2015, pp. 444–454.
14.
[34] L. Luo, J.R. Edwards, H. Luo, F. Mueller, Performance assessment of a multiblock
[6] B. Barney OpenMP, https://computing.llnl.gov/tutorials/openMP/, 2020. (Ac-
incompressible Navier-Stokes solver using directive-based GPU programming in
cessed 24 July 2020).
a cluster environment, in: 52nd Aerospace Sciences Meeting, National Harbor,
[7] B. Barney, Introduction to parallel computing, https://computing.llnl.gov/
MD, US, 2013.
tutorials/parallel_comp/, 2020. (Accessed 24 July 2020).
[35] M. Mani, J. Ladd, A. Cain, R. Bush, M. Mani, J. Ladd, A. Cain, R. Bush, An as-
[8] B. Barney, Message Passing Interface (MPI), https://computing.llnl.gov/tutorials/
sessment of one-and two-equation turbulence models for internal and external
mpi/, 2020. (Accessed 24 July 2020).
flows, in: 28th Fluid Dynamics Conference, Snowmass Village, CO, USA, 1997,
[9] T. Brandvik, G. Pullan, Acceleration of a 3D Euler solver using commodity
p. 2010.
graphics hardware, in: 46th AIAA Aerospace Sciences Meeting and Exhibit,
[36] A.J. McCall, Multi-level Parallelism with MPI and OpenACC for CFD Applica-
Reno, Nevada, US, 2008, p. 607.
[10] I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, P. Hanrahan, tions, Master’s thesis, Virginia Tech, 2017.
Brook for GPUs: stream computing on graphics hardware, ACM Trans. Graph. [37] A.J. McCall, C.J. Roy, A multilevel parallelism approach with MPI and OpenACC
23 (2004) 777–786. for complex CFD codes, in: 23rd AIAA Computational Fluid Dynamics Confer-
[11] Cascades, https://arc.vt.edu/computing/cascades/, 2020. (Accessed 12 Septem- ence, Denver, CO, USA, 2017, p. 3293.
ber 2020). [38] P.D. Mininni, D. Rosenberg, R. Reddy, A. Pouquet, A hybrid MPI–OpenMP
[12] E. Castillo, N. Jain, M. Casas, M. Moreto, M. Schulz, R. Beivide, M. Valero, scheme for scalable parallel pseudospectral computations for fluid turbulence,
A. Bhatele, Optimizing computation-communication overlap in asynchronous Parallel Comput. 37 (2011) 316–326.
task-based programs, in: Proceedings of the ACM International Conference on [39] Open MPI Documentation, https://www.open-mpi.org/doc/, 2020. (Accessed 10
Supercomputing, Washington, DC, USA, 2019, pp. 380–391. May 2020).
[13] D.D. Chandar, J. Sitaraman, D.J. Mavriplis, A hybrid multi-GPU/CPU computa- [40] MPI: a message-passing interface standard, https://www.mpi-forum.org/docs/
tional framework for rotorcraft flows on unstructured overset grids, in: 21st mpi-3.1/mpi31-report.pdf, 2015. (Accessed 20 February 2020).
AIAA Computational Fluid Dynamics Conference, San Diego, CA, US, 2013, [41] MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE, http://
p. 2855. mvapich.cse.ohio-state.edu/userguide/, 2020. (Accessed 10 May 2020).
[14] P.G.I. Compiler User’s Guide, https://www.pgroup.com/resources/docs/19.10/ [42] Newriver, https://www.arc.vt.edu/computing/newriver/, 2019. (Accessed 12
x86/pgi-user-guide/index.htm, 2019. (Accessed 10 May 2020). September 2020).
[15] A. Denis, F. Trahay, M.P.I. Overlap, Benchmark and analysis, in: 2016 45th Inter- [43] J. Nickolls, W.J. Dally, The GPU computing era, IEEE MICRO 30 (2010) 56–69.
national Conference on Parallel Processing (ICPP), IEEE, Philadelphia, PA, USA, [44] NVIDIA, NVIDIA GPUDirect, https://developer.nvidia.com/gpudirect, 2019.
2016, pp. 258–267. [45] NVIDIA, CUDA C++ Programming Guide, https://docs.nvidia.com/pdf/CUDA_C_
[16] J.M. Derlaga, T. Phillips, C.J. Roy, SENSEI computational fluid dynamics code: a Programming_Guide.pdf, 2019. (Accessed 24 July 2020).
case study in modern Fortran software development, in: 21st AIAA Computa- [46] W.L. Oberkampf, C.J. Roy, Verification and Validation in Scientific Computing,
tional Fluid Dynamics Conference, San Diego, CA, US, 2013. Cambridge University Press, 2010.

84
W. Xue, C.W. Jackson and C.J. Roy Journal of Parallel and Distributed Computing 156 (2021) 64–85

[47] M. Otten, J. Gong, A. Mametjanov, A. Vose, J. Levesque, P. Fischer, M. Min, Weicheng Xue received his Bachelor’s degree in
An MPI/OpenACC implementation of a high-order electromagnetics solver with Mechanical Engineering from Huazhong University
GPUDirect communication, Int. J. High Perform. Comput. Appl. 30 (2016) of Science and Technology, his Master’s degree in
320–334. Mechanical Engineering from University of Chinese
[48] J.D. Owens, M. Houston, D. Luebke, S. Green, J.E. Stone, J.C. Phillips, G.P.U. Com-
Academy of Sciences and is currently a senior Ph.D.
puting, Proc. IEEE 96 (2008) 879–899.
[49] A. Rico, J.A. Joao, C. Adeniyi-Jones, E. Van Hensbergen, ARM HPC ecosystem
student at the department of Aerospace Engineering,
and the reemergence of vectors, in: Proceedings of the Computing Frontiers Virginia Tech. His professional interests focus on ap-
Conference, 2017, pp. 329–334. plying high performance computing especially GPU
[50] P.L. Roe, Approximate Riemann solvers, parameter vectors, and difference computing to accelerate computational fluid dynam-
schemes, J. Comput. Phys. 43 (1981) 357–372. ics applications.
[51] J.L. Steger, R. Warming, Flux vector splitting of the inviscid gasdynamic equa-
tions with application to finite-difference methods, J. Comput. Phys. 40 (1981)
263–293. Dr. Charles Jackson received his Bachelor’s and
[52] The OpenACC application programming interface: version 3.0, https://www. Doctorate degrees in Aerospace Engineering from Vir-
openacc.org/sites/default/files/inline-images/Specification/OpenACC.3.0.pdf, ginia Tech. His current research interests are in com-
2019. (Accessed 8 March 2021). putational fluid dynamics including mesh adaption,
[53] Tinkercliffs, https://arc.vt.edu/tinkercliffs/, 2021. (Accessed 10 March 2021).
software development practices, and high perfor-
[54] K. Vaidyanathan, D.D. Kalamkar, K. Pamnany, J.R. Hammond, P. Balaji, D. Das,
mance computing, specifically GPU acceleration.
J. Park, B. Joó, Improving concurrency and asynchrony in multithreaded MPI
applications using software offloading, in: SC’15: Proceedings of the Interna-
tional Conference for High Performance Computing, Networking, Storage and
Analysis, IEEE, Austin, TX, USA, 2015, pp. 1–12.
[55] B. Van Leer, Towards the ultimate conservative difference scheme. v. a second-
Dr. Chris Roy received an undergraduate degree
order sequel to godunov’s method, J. Comput. Phys. 32 (1979) 101–136.
[56] B. Van Leer, Flux-vector splitting for the Euler equation, in: Upwind and High- in Mechanical Engineering from Duke University in
Resolution Schemes, Springer, 1997, pp. 80–89. 1992, a masters in Aerospace Engineering from Texas
[57] J. Wu, L. Fan, L. Erickson, Three-point backward finite-difference method for A&M in 1994, and a doctorate in Aerospace Engineer-
solving a system of mixed hyperbolic—parabolic partial differential equations, ing from North Carolina State University in 1998. After
Comput. Chem. Eng. 14 (1990) 679–685. spending 5 years as a Senior Member of the Tech-
[58] Y. Xia, J. Lou, H. Luo, J. Edwards, F. Mueller, OpenACC acceleration of an un- nical Staff at Sandia National Laboratories in Albu-
structured CFD solver based on a reconstructed discontinuous Galerkin method
querque, New Mexico, he moved to academia and is
for compressible flows, Int. J. Numer. Methods Fluids 78 (2015) 123–139.
currently a full professor in the Crofton Department
[59] W. Xue, C.W. Jackson, C.J. Roy, Multi-CPU/GPU parallelization, optimization and
machine learning based autotuning of structured grid CFD codes, in: 2018 AIAA
of Aerospace and Ocean Engineering at Virginia Tech. Dr. Roy has au-
Aerospace Sciences Meeting, Kissimmee, FL, US, 2018, p. 0362. thored or co-authored over 200 books, book chapters, journal articles, and
[60] W. Xue, C.J. Roy, Multi-GPU performance optimization of a computational fluid conference papers in the areas of computational fluid dynamics, verifi-
dynamics code using OpenACC, Concurr. Comput., Pract. Exp. (2020) e6036. cation, validation, and uncertainty quantification. He is the co-author of
[61] W. Xue, C.J. Roy, Heterogeneous computing of CFD applications on CPU-GPU the book Verification and Validation in Scientific Computing published by
platforms using OpenACC directives, in: AIAA Scitech 2020 Forum, Orlando, FL, Cambridge University Press in 2010.
US, p. 1046, 2020.
[62] W. Xue, H. Wang, C.J. Roy, Code verification for 3D turbulence modeling in
parallel SENSEI accelerated with MPI, in: AIAA Scitech 2020 Forum, Orlando,
FL, US, 2020, p. 0347.