0% found this document useful (0 votes)

12 views

A GPU-based Algorithm For Efficient LES of High Reynolds Number Flows in Heterogeneous CPU-GPU Supercomputers

Uploaded by

EmilioArroyo

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

A GPU-based Algorithm For Efficient LES of High Reynolds Number Flows in Heterogeneous CPU-GPU Supercomputers

Uploaded by

EmilioArroyo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

i An update to this article is included at the end

Applied Mathematical Modelling 85 (2020) 141–156

Contents lists available at ScienceDirect

Applied Mathematical Modelling

journal homepage: www.elsevier.com/locate/apm

A GPU-based algorithm for eﬃcient LES of high Reynolds

number ﬂows in heterogeneous CPU/GPU supercomputersR
Guillermo Oyarzun a, Iason A. Chalmoukis b, Georgios A. Leftheriotis b,
Athanassios A. Dimas b,∗
a
Barcelona Supercomputing Center, 08034 Barcelona, Spain
b
Laboratory of Hydraulic Engineering, Department of Civil Engineering, University of Patras, 26500, Patras, Greece

a r t i c l e i n f o a b s t r a c t

Article history: An optimized MPI+OpenACC implementation model that performs efficiently in CPU/GPU
Received 26 September 2019 systems using large-eddy simulation is presented. The code was validated for the simula-
Revised 9 March 2020
tion of wave boundary-layer flows against numerical and experimental data in the liter-
Accepted 14 April 2020
ature. A direct Fast-Fourier-Transform-based solver was developed for the solution of the
Available online 3 May 2020
Poisson equation for pressure taking advantage of the periodic boundary conditions. This
Keywords: solver was optimized for parallel execution in CPUs and outperforms by 10 times in com-
OpenACC putational time a typical iterative preconditioned conjugate gradient solver in GPUs. In
GPU architectures terms of parallel performance, an overlapping strategy was developed to reduce the over-
MPI head of performing MPI communications using GPUs. As a result, the weak scaling of the
LES algorithm was improved up to 30%. Finally, a large-scale simulation (Re = 2 × 105 ) using
High Reynolds number flows a grid of 4 × 108 cells was executed, and the performance of the code was analyzed. The
simulation was launched using up to 512 nodes (512 GPUs + 6144 CPU-cores) on one of
the current top 10 supercomputers of the world (Piz Daint). A comparison of the overall
computational time showed that the GPU version was 4.2 times faster than the CPU one.
The parallel efficiency of this strategy (47%) is competitive compared with the state-of-the-
art CPU implementations, and it has the potential to take advantage of modern supercom-
puting capabilities.
© 2020 Elsevier Inc. All rights reserved.

1. Introduction

Simulations of coastal flows at high, prototype-scale, Reynolds numbers require the use of supercomputers in order to ob-
tain results in reasonable computational time. However, the constant change of the computing systems makes the adaptation
of parallel algorithms a rather difficult task. During the last years, this evolution has been driven by the power consump-
tion constraints of the systems [1]. Consequently, the level of intra-node parallelism has been increased by the introduction
of multicore-CPUs or accelerators. The consolidated trend among the current top supercomputers is the incorporation of
Graphic Processing Units (GPUs) as accelerators that boost the performance within the nodes. Such systems require ap-
plications with high levels of fine-grain parallelism in order to exploit its full capacities. There is a demand for scalable
algorithms with higher degree of parallelism, as well as compatibility with fundamentally different parallel architectures.

R
This article belongs to the Special Issue: Applied Mathematical Modelling in Port, Coastal and Offshore Engineering.
∗
Corresponding author.
E-mail address: [email protected] (A.A. Dimas).

https://doi.org/10.1016/j.apm.2020.04.010
0307-904X/© 2020 Elsevier Inc. All rights reserved.
142 G. Oyarzun, I.A. Chalmoukis and G.A. Leftheriotis et al. / Applied Mathematical Modelling 85 (2020) 141–156

Computational fluid dynamics (CFD) is one of the disciplines with the potential to fully exploit the modern technologies.
The algorithms involved in CFD simulations demand fine meshes and a large number of floating-point operations to attain
accurate results in a reasonable computational time. In addition, algorithms for the imposition of the boundary conditions
on solid surfaces, like the Immersed Boundary (IB) method [2], have a complex memory footprint, which makes their porta-
bility to other computing architectures a rather difficult task. Consequently, new challenges arise in the implementation of
such CFD codes, like the need to be adapted or re-designed in order to efficiently utilize the different kinds of available
resources.
One of the early attempts to use heterogeneous systems for simple CFD simulations was focused in 3D finite-differences
methods using CUDA for structured meshes and high order schemes [3]. Compact difference schemes of sixth-order in a
GPU application were considered in [4], and acceleration up to 16 times in computational time with respect to a CPU ex-
ecution was reported. Navier-Stokes simulations on accelerators were studied for structured meshes executed only in one
GPU [5,6]. The use of multi-GPUs was first explored for the solution of the pressure linear system using a Conjugate Gradient
(CG) algorithm in [7], while an approximate inverse preconditioner and ideas for minimizing the communication overhead
through an overlapping technique were developed in [8]. The CG solver was accelerated 4 times with respect to a pure-CPU
execution, however, the benefit of this approach only reported a speedup of 2.3 in the overall CFD execution, because not
the entire code was ported to GPUs. A first implementation of the IB method was developed in [9] with special attention
to the CG solver. Within the class of CFD algorithms fully executed on GPUs, successful examples, based on the imple-
mentation of an MPI+CUDA execution mode [10,11], obtained parallel speedups that range from 4 to 8 depending of the
mesh size. A comparison of the GPU performance for explicit and implicit CFD simulations, showing the importance of the
memory footprint in the performance of the algorithms, is presented in [12]. Additionally, simulations using RANS models
have also profited by the multi-GPU approach achieving up to 2.2 speedup [13], and recent CFD applications based on high
order discretization methods [14] or heterogeneous implementations [15,16] have been developed to use the CPU/GPU su-
percomputers. The overall implementation in the aforementioned codes seems to be tightly coupled with the framework it
relies upon. The portability of those codes requires complex procedures, large programming efforts and the maintenance of
fundamentally different versions of the code for each architecture that is targeted.
In the present work, SimuCoast, a parallel, object-oriented, in-house code, focused on the simulation of coastal flows by
means of large-eddy simulations (LES), is presented. Its novel design is oriented to the simulation of high Reynolds number
flows with increased computational performance using an MPI+OpenACC strategy. Specifically, the intra-node computing
units (multicore CPUs or accelerators) are utilized using the OpenACC standard [17,18]. This is a directive-based parallel pro-
gramming model that manages the distribution of work to the computing units by adding pragma clauses in a way similar
to OpenMP. When utilizing many nodes, the inter-node information is communicated by means of MPI. This hybrid approach
reduces the MPI communications allowing the engagement of a larger number of nodes and provides the flexibility to ex-
ploit different architectures in the existent high-performance computing (HPC) facilities. The Navier-Stokes equations are
discretized on a Cartesian staggered mesh, following the methodology proposed in [19], where the imposition of boundary
conditions on solid surfaces is performed using the IB method [2], and turbulence closure is achieved by the Smagorinsky
eddy-viscosity model [20]. Concerning the Poisson equation, which is usually the most computationally intensive part in
CFD simulations, SimuCoast uses an iterative CG solver for general coastal applications (e.g., 3D wave breaking). For partic-
ular flows with two periodic directions, SimuCoast uses the Fourier diagonalization, for the transformation of the 3D linear
system into a set of 1D subsystems, which can be solved using direct solvers [21].
The current MPI+OpenACC implementation allows selecting the computing units (CPUs or GPUs) just by switching com-
pilation flags. The objective is to demonstrate that, if programmed correctly, the present approach can be as efficient as using
a state-of-the-art programming framework (CUDA, OpenCL), but without adding extra complexity in the code. Moreover, an
overlapping strategy was developed for reducing the overhead of performing MPI communications using the GPUs. The code
was validated against numerical and experimental data in [22] and [23]. Then, the code was applied for the simulation of
a large-scale case at a high Reynolds number of 2 × 105 , which is considerably larger than recent publications available
for coastal flows [19,24]. Physical results are presented and compared to corresponding results of the same case at a lower
Reynolds number (2 × 104 ), showing a significantly different flow behaviour. Finally, performance analysis of the code is
presented. The simulations exploiting only CPU systems were executed in the Aris supercomputer of the Greek Research &
Technology Network. The GPU development and scalability tests were performed, using up to 512 nodes, on one of the cur-
rent top 10 supercomputers, Piz Daint (Swiss National Supercomputing Centre), a Tier-0 system of PRACE (www.prace-ri.eu),
where each node is connected with an NVIDIA P100 GPU. The performance is demonstrated in terms of strong speedup and
acceleration of the MPI+OpenACC code in comparison with a CPU-only execution.
The rest of the article is organized as follows: the formulation and the numerical methods of SimuCoast are described
briefly in Section 2; the hybrid implementation using an MPI+OpenACC approach is presented in Section 3; numerical
experiments and results are included in Section 4; the performance analysis of SimuCoast is presented in Section 5; and the
concluding remarks are stated in Section 6.

2. Methodology

In the present work, the oscillatory flow over fixed ripples was considered, which characterizes near-bed processes in
the coastal zone. The free-stream velocity of the imposed oscillatory flow is assumed to model the near-bed flow of a
G. Oyarzun, I.A. Chalmoukis and G.A. Leftheriotis et al. / Applied Mathematical Modelling 85 (2020) 141–156 143

Table 1
The major algorithm stages in SimuCoast.

A Computation of intermediate velocity components - Eq. (4)

B Imposition of no-slip condition on the immersed surface - IB method
C Computation of dynamic pressure correction by solution of Eq. (6)
D Computation of ﬁnal velocity components - Eq. (7)

second-order Stokes wave

U (t ) = Uo[cos (ωt ) + β cos (2ωt )] (1)

where Uo = ao ω is the velocity amplitude, ao is the orbital amplitude, ω = 2π /T is the radial frequency, T is the period, and
β is the skewness factor.
In the LES approach, flow structures are separated into large, energy-containing eddies, which are directly resolved by
the computational mesh, and sub-grid scale (SGS) ones, whose effect on the large eddies is modeled. Using the characteristic
scales of length, ao , velocity, Uo , and pressure 0.5ρ Uo 2 , (where ρ is the fluid density) to render all variables non-dimensional,
the resulting equations of motion for an incompressible flow become:

∂ ui
=0 (2)
∂ xi

∂ ui ∂ ∂P ∂ τi j 1 ∂ 2 ui
+ ui u j = − d − + + fi (3)
∂t ∂xj ∂ xi ∂ x j Re ∂ x j ∂ x j
where xi are the three Cartesian coordinates denoted hereafter as the streamwise, x, the spanwise, y, and the vertical, z,
coordinates, t is the time, ui are the resolved velocity components, Pd is the dynamic pressure, τ ij are the SGS stresses,
Re = Uο ao /v is the Reynolds number, v is the water kinematic viscosity and fi represents a source term associated with the
implementation of the IB method for the enforcement of non-slip boundary conditions on the bed surface.
In SimuCoast, the dynamic pressure, Pd = P + p, is defined as the sum of the imposed dynamic pressure of the external
flow, P, and the dynamic pressure correction, p, which in the present wave boundary-layer flows arises due to the effect
of the solid bed. The SGS stresses are modeled using the Smagorisnky eddy-viscosity model [20]. To model the reduction
of turbulent flow fluctuations near impermeable surfaces, the influence of the SGS stresses is damped by reducing the SGS
eddy-viscosity smoothly to become zero at the bed surface using the formula in [25].
For the introduction of the rippled bed in the computational domain, which is discretized with a structured Cartesian
grid, the IB method is utilized following the methodology in [2]. The advantage of the IB method is that it allows the use
of efficient Cartesian Poisson solvers in problems with complex bed shape. The main characteristic of the implementation
of the IB method is that the bed surface is not aligned with the grid. In such cases, the solution is reconstructed, in the
vicinity of the boundary, in order to enforce the no-slip conditions.
The governing differential equations were spatially discretized using second-order, central finite-differences on a Carte-
sian staggered grid. The temporal discretization is achieved through a two-stage, time-splitting scheme. The intermediate
velocity, ūi , is computed explicitly in the first stage

ūi − uni 3 1 1 ∂ Pn+1 ∂ Pn n
= H uni − H uni −1 − + + fi (4)
t 2 2 2 ∂ xi ∂ xi
where n is the time-step number, t is the time-step, and H is the spatial operator, which includes the convective, the SGS
and the viscous terms of Eq. (3), and it is introduced using a second-order Adams-Bashforth scheme. In Eq. (4), the imposed
dynamic pressure of the external flow is computed according to the expression:

∂P
= Uo ω[sin (ωt ) + 2β sin (2ωt )] (5)
∂x
and it is introduced using a second-order trapezoidal rule. The dynamic pressure correction is obtained by solving the
Poisson equation, whose right-hand-side depends on the intermediate velocity,

∇ ūi
∇ 2 pn+1 = (6)
t
The computation of the velocity at the next time-step is obtained in the second stage using the computed dynamic
pressure correction

ui n+1 = ūi − t ∇ pn+1 (7)

The algorithm is structured as outlined in Table 1.

144 G. Oyarzun, I.A. Chalmoukis and G.A. Leftheriotis et al. / Applied Mathematical Modelling 85 (2020) 141–156

Fig. 1. Levels of parallelism and programming models exploited by SimuCoast in a hybrid supercomputer.

Fig. 2. Domain decomposition for a grid of 8 × 6 × 8=384 cells and R = 8 non-overlapping subdomains.

3. Implementation details

In the present study, a multilevel parallelization that combines different kinds of parallelism was implemented in Simu-
Coast to extend its portability to the variety of competing architectures and frameworks in the current supercomputers.
In any case, the top level requires the implementation of a multiple instruction multiple data (MIMD) distributed memory
parallelization in order to couple nodes of a supercomputer. At this level, the MPI standard is most commonly used for dis-
tributed memory parallelization based on a geometric domain decomposition. At the bottom level, a shared memory MIMD
parallelism is needed for the multicore CPUs, or a single instruction multiple data (SIMD) one for accelerators like GPUs.

3.1. Hybrid MPI+OpenACC implementation

The strategy is based on the use of a two-level hybrid MPI+OpenACC parallelization (see Fig. 1). By doing so, the MPI
couples nodes within the distributed memory model, while the OpenACC provides portability across different architectures
(CPUs or GPUs). This approach facilitates the utilization of hybrid nodes since the code portability is simplified to just a
change of the compilation flags of OpenACC. An additional benefit is the reduction of the MPI communications. For example,
let us consider a parallel system with Cn computing nodes where each one comprises a CPU with Ct cores per node. An MPI-
only implementation requires the launch of Cn × Ct processes in order to exploit all the resources of the system. For the
same scenario, the present hybrid MPI+OpenACC strategy allows the launch of only Cn processes where each MPI-process
is linked to a CPU socket and spawns Ct threads. In this way, the MPI communications are reduced by a factor of Ct , since
the threads interaction is achieved by using a shared memory space.
The first level of parallelization is the geometric domain decomposition. Consequently, the mesh M is decomposed into
R = Rx × Ry × Rz non-overlapping subdomains, M1 , …, MR . The goal of this partitioning is to maintain the load balance
in each subdomain. For each MPI process, the unknowns are categorized into two different sets: (1) owned unknowns are
those associated to the nodes of Mi ; and (2) halo unknowns are the unknowns from other subdomains coupled with owned
unknowns. The owned unknowns are also subdivided into: (1.1) inner unknowns are the owned unknowns that are coupled
only with other owned unknowns; and (1.2) interface unknowns are the owned unknowns coupled with halo unknowns. A
typical domain decomposition for a Cartesian grid using eight MPI processes is illustrated in Fig. 2, where Rx = Ry = Rz = 2.
In the second parallelization level of SimuCoast, the stencils associated with the operators of Table 1 are formed. These
stencils are data structures, which are designed to efficiently store information related to mesh topology, geometry, location
and physical properties of the flows. When working with incompressible flows, the linear operators do not change during the
simulation, therefore, the creation of the corresponding data structures becomes a preprocess with negligible computational
cost. The execution of the linear operators in SimuCoast is performed through loops that sweep all the directions of the
G. Oyarzun, I.A. Chalmoukis and G.A. Leftheriotis et al. / Applied Mathematical Modelling 85 (2020) 141–156 145

Fig. 3. Overlapping of the communications in the MPI+OpenACC environment.

owned computing subdomain. Within these loops, the iterations are mutually independent, which facilitates the utilization
of directive-based parallel programming tools, e.g. OpenACC or OpenMP. The coherence between subdomains is achieved by
communicating the halo cells before applying the linear operators; this operation is also referred as halo update. For 3D
ﬂows, the halo update consists of point-to-point MPI communications between the six neighboring subdomains.
When heterogeneous CPU/GPU systems are engaged, the communication must include the data transfers from GPU to
CPU. Then, the relative weight of the communications grows due to the memory transfer bottleneck, and it is necessary to
explore more eﬃcient ways to perform it. Fig. 3 demonstrates the current strategy to overlap communications with calcu-
lations for reducing this bottleneck. The idea is to perform the calculations of the inner cells while the communications are
executed. This is possible because these cells are not linked with other sub-domains. OpenACC provides the async() clause
that allows to perform operations asynchronously in the GPU. In our communication scheme, it is used to execute data
transfers and calculations concurrently. Once the halo update is completed, the code proceeds to perform the calculations
for the interface cells. By doing so, part of the communication cost is hidden, and this action improves the scalability of the
algorithm.

3.2. Fourier diagonalization

In linear algebra, the matrix form of the Poisson Eq. (6) is the following:
L· p=b (8)
where L is the Laplacian operator and b is the spatial gradient of the intermediate velocity divided by the time-step. The
solution of Eq. (8) is one of the most computationally intensive parts of the simulation, since it has to be solved at each
time-integration step. For incompressible flows, where L remains constant during the simulation, the computational cost of
any pre-processing stage becomes negligible, since it has to be performed only once at the start of the simulation.
For flows with one periodic boundary condition, for example in the streamwise direction, and a constant spatial step
(x), the couplings in the periodic direction are circulant matrices [26], and the initial algebraic system (8) can be diagonal-
ized by means of a Fast Fourier Transformation (FFT). The spectral Laplacian operator for a case with Nx cells in the periodic
direction is:
Lˆ i = λi 2D + L2D (9)
where i = 0, …, Nx -1, L2D is the Laplacian operator in the plane of the two non-periodic coordinates and λi is the eigenvalue
multiplying the diagonal contribution 2D . A general expression for the eigenvalues, found in [26,27], is:

2 2π i
λi =− 1 − cos (10)
x Nx
Therefore, the original system (8) is decomposed into a set of Nx mutually independent 2D systems
Lˆ i · pˆ2D ˆ 2D
i = bi (11)
where each system, hereafter denoted as “frequency” system, corresponds to a frequency (wavenumber) in the Fourier space.
For flows with two periodic boundary conditions in the horizontal directions, the Fourier diagonalization can be applied
consecutively to each direction resulting in:
Lˆ i j · pˆ1D ˆ 1D
i j = bi j (12)
where i = 0, …, Nx -1 and j = 0, …, Ny -1. The original system (8) is decomposed into a set of Ni,j mutually independent 1D
frequency systems, which are solved using the Tridiagonal matrix algorithm (TDMA). In these cases, the algorithm of the
Poisson equation solver is summarized in Table 2. Note that the algorithm considers the Laplacian matrix to be constant
during the simulation, therefore, its change-of-basis is performed only once in a preprocessing stage before the start of the
time-integration phase.
146 G. Oyarzun, I.A. Chalmoukis and G.A. Leftheriotis et al. / Applied Mathematical Modelling 85 (2020) 141–156

Table 2
The algorithm steps of the two-periodic
FFT (2pFFT) Poisson solver.

1 Transform b in bʹ = FFTy (b)

2 Transform bʹ in bʺ = FFTx (bʹ)
3 For j = 1 to Ny do
For i = 1 to Nx do

Solve Lˆ 1i jD · pˆ1i jD = bˆ 1i jD
end for
end for
4 Transform pʺ in pʹ = FFTx −1 (pʺ)
5 Transform pʹ in p = FFTy −1 (pʹ)

Table 3
The operation types of the algorithm of the parallel 2pFFT Poisson solver.

Type Operation

iii Convert 3D partition to slab/pencil-partition of b (MPI_Alltoallv)

i Steps 1-2 of Table 2
ii Step 3 of Table 2
i Steps 4-5 of Table 2
iii Convert slab/pencil-partition to 3D partition of p (MPI_Alltoallv)

Fig. 4. Left: Slab partitioning (Rx = 1, Ry =1, Rz = 8) Right: 2D-Pencil partitioning (Rx = 2, Ry =1, Rz = 4).

3.3. New domain decomposition and solution of the frequency systems

In SimuCoast, the optimization of the 2pFFT Algorithm (Table 2) consists of two types of operations: (i) the parallelization
of the change-of-basis from the physical to the spectral space and vice versa (steps 1, 2, 4, and 5); and (ii) the solution of
the frequency systems (step 3). Note that the FFT is a very communication intensive algorithm, and the common approach
in parallel solvers is to use its sequential version. The idea is to work with the spanwise and streamwise components
of the mesh not partitioned. The original mesh M is divided into Rz subdomains along the z-direction by conﬁguring a
slab-partitioning (Fig. 4, left). In this way, the spanwise and streamwise subvectors of any variable are not split between
different processes, and a sequential FFT algorithm can be used [28]. An alternative way for accomplishing the same is to
use multiple 2D-pencil partitions (Fig. 4, right) in which only one of the directions is not split. These data transformations
are executed before applying the FFT transform along the no-split directions. Nevertheless, these strategies of partitioning
are not necessarily the best choice for the other stages of the code. Consequently, its usage is restricted to the Poisson solver,
and the associated data rearrangements are performed before and after the application of the Poisson solver. These data
rearrangements are denoted as type (iii) operations of the parallel 2pFFT Poisson solver and are implemented by means of
MPI_ALLtoallv calls operated through all the processes within the MPI communicator (Table 3). To keep optimal performance,
new MPI communicators need to be created associating groups of neighboring subdomains.

4. Numerical Results

In order to validate SimuCoast, first a benchmark case for the transient development of oscillatory flow over a rippled
wall [22] was simulated. Moreover, the code was also validated in comparison to the experimental measurements of the
turbulent oscillatory flow over a rippled wall in [23]. Finally, the code was applied to a high Reynolds number case, and
corresponding physical and performance results are presented. In all cases, the boundary conditions for both intermediate
and final velocities were: zero Dirichlet at the bottom of the domain (below the bed), zero Neumann (rigid lid) at the top,
and periodic along the x and y directions.
G. Oyarzun, I.A. Chalmoukis and G.A. Leftheriotis et al. / Applied Mathematical Modelling 85 (2020) 141–156 147

Fig. 5. Instantaneous spanwise vorticity at t = T: present results (left), and the results in [22] (right). Dashed contours (left) or bold contours (right)
correspond to negative vorticity, solid contours to positive vorticity, and the contour interval is 3.75.

4.1. Validation against numerical results

First, SimuCoast was set up to reproduce the numerical simulation of pure oscillatory (β = 0) flow over two-dimensional
(2-D) ripples in [22] where the instantaneous spanwise vorticity field during the first wave cycle was presented and dis-
cussed. The relative ripple length and height were, Lr = 1.33·ao and hr = 0.2·ao , respectively. The shape of the rippled bottom
was described as in [29]:
1

2π
1 2π

x= hr sin , z = hr cos (13)
2 Lr 2 Lr
The size of the computational domain was equal to one ripple length on both horizontal directions and two ripple lengths
on the vertical direction. The grid was uniform with Nx × Ny × Nz = 128 × 128 × 250 = 4 × 106 cells. The grid spacing
in each direction was set as: x/ao = y/ao = z/ao = 0.01. The flow started from rest and the corresponding Reynolds
number was Re = 1,250. As it can be seen in Fig. 5, the results are in excellent agreement with the previously mentioned
numerical simulations.

4.2. Validation against experimental measurements

SimuCoast code was also validated against the experimental measurements of wave boundary-layer flow (Re = 23,163
and β = 0.1) over fixed ripples in [23], where the ripples were shaped as circular arcs with sharp crests. The radius of the
circular arc was
Lr 2
r= + 0.5 · hr (14)
( 8 · hr )
where Lr = 2.2·ao and hr = 0.159·ao . The specific oscillatory flow case corresponds to turbulent flow; thus, a fine grid
discretization was required. The size of the computational domain was set equal to two ripple lengths in all directions.
The grid was uniform in the horizontal directions, while it varied in the vertical direction, being finer close to the ripple
crests, with a total of Nx × Ny × Nz = 512 × 128 × 490 = 32 × 106 cells. The grid spacing in each direction was set as:
x/ao = 0.009, y/ao = 0.034, and 0.0034 < z/ao < 0.034.
The simulation started with fluid at rest, 10 wave periods were required for fully turbulent conditions to be established,
and another 10 wave periods were used for phase-averaging, which was sufficient, since spanwise-averaging was performed
as well. The phase- and spanwise-averaged streamwise velocity and its root mean square (RMS) fluctuations profiles over
the ripple crest are presented in Fig. 6, at the phases with maximum onshore (ωt = 0o ) and offshore (ωt = 180o ) velocity
and after the flow reversals (ωt = 80o and 270o ). The numerical results are compared to the experimental data in [23] with
overall good agreement.
Similar computational studies of wave boundary-layer flows over fixed ripples at modest Reynolds numbers have been
reported recently in the literature [19,24]. However, the realistic-scale situations are large-scale flows at high Reynolds num-
bers, whose numerical simulation demands very fine discretization, which makes the use of HPC mandatory. In the following
section, a large-scale simulation is presented using the proposed parallel methodology that efficiently uses heterogeneous
CPU/GPU supercomputers.

4.3. Large-scale ﬂow simulation

Experimental results (experiment Mr5b63) of oscillatory flow and sediment transport over ripples were presented in [30].
Specifically, an initially flat sandy bed resulted into a relatively stable rippled bed under oscillatory flow with Uo = 0.54 m/s,
ao = 0.195 m, β = 0.176 at Re = 2 × 105 . The resulting ripple dimensions were: hr = 0.076 m, Lr = 0.41 m, and hr /Lr = 0.19
(ripple steepness). In the present work, this flow was numerically simulated assuming fixed ripples shaped exactly as mod-
ified in [31] in order to be periodic along the x axis. The aim of this case was to verify that SimuCoast is able to simulate
large-scale flow conditions.
148 G. Oyarzun, I.A. Chalmoukis and G.A. Leftheriotis et al. / Applied Mathematical Modelling 85 (2020) 141–156

Fig. 6. Phase and spanwise-averaged streamwise velocity (top) and RMS streamwise velocity (bottom) proﬁles over the ripple crest at four wave phases.
Comparison of present numerical results (solid line) with the experimental data (symbols) in [23].

The computational domain with the immersed rippled boundary is presented in Fig. 7. The streamwise length, Lx , and
the spanwise width, Ly , of the computational domain were set equal to two ripple lengths. The specific width was found
to be sufficiently large to capture the development of turbulent structures after computing the two-point autocorrelation
functions for all three velocity components at several locations on the vertical plane x-z. It was found that the two-point
autocorrelation becomes small enough, (< 0.1), within a spanwise distance of half the domain width. The height, Lz , of the
computational domain was set equal to 4.5•ao to ensure that the upper boundary does not affect the development of the
boundary layer and flow separation over the ripples.
Two cases were simulated, one at Reynolds number, Re = 2 × 105 , as in the experiment Mr5b63 in [30], and one at
Reynolds number, Re = 2 × 104 , one order of magnitude smaller, in order to demonstrate that the physics in large-scale
flows cannot be revealed by simulation of modest-scale flows. After a mesh sensitivity analysis, the final computational
G. Oyarzun, I.A. Chalmoukis and G.A. Leftheriotis et al. / Applied Mathematical Modelling 85 (2020) 141–156 149

Fig. 7. Sketch of the computational domain with the Cartesian grid (shown every 8th node) used in the numerical simulations. The rippled bed is immersed
in the Cartesian grid.

grid consisted of 512 × 128 × 384 nodes for the case of Re = 2 × 104 and 1024 × 256 × 1536 nodes for the case of
Re = 2 × 105 . The grid spacing in both cases was uniform in the horizontal directions, while it was non-uniform in the
vertical direction with the finer resolution near the rippled bed. The corresponding values in wall units were x+ ≤ 5, y+
≤ 18, and z+ ≤ 3 for the case of Re = 2 × 104 and x+ ≤ 8, y+ ≤ 33, and z+ ≤ 5 for the case of Re = 2 × 105 . These
values are sufficient in order to get a good resolution of the viscous sublayer, and the boundary layer flow structures. The
SGS eddy viscosity, ν sgs , was always and everywhere ν sgs /ν < 0.7 for the case of Re = 2 × 104 , while it was ν ∗ sgs /ν < 1 for
the case of Re = 2 × 105 . The specific values suggest that the grid discretization was fine enough to resolve the turbulent
flow structures, and the contribution of the SGS eddies is negligible [32]. The convective (CFL) and viscous (VSL) constraints
were used for the selection of the computational time-step: CFL < 0.1 and VSL < 0.003 for the case of Re = 2 × 104 and
CFL < 0.1 and VSL < 0.001 for the case of Re = 2 × 105 .
Similar to the validation cases, the simulations started with fluid at rest, the first 10 wave periods were required for fully
turbulent conditions to be established, and another 10 wave periods were used for averaging. The phase- and spanwise-
averaged velocity, vorticity, and turbulent kinetic energy (TKE) for both cases is presented in Fig. 8 at two phases (T/16 and
10T/16). Bagnold [34] characterized “vortex ripples” those associated with the development of coherent vortices, which are
generated at the lee side of the ripple during each half wave-cycle and are lifted upwards during flow reversal [22]. Accord-
ing to Sleath [29], the steepness of vortex ripples is higher than about 0.1. This behavior was also observed in the present
simulations, as shown by the velocity and vorticity fields in Fig. 8, since the corresponding hr /Lr = 0.19. The vortices gen-
erated by onshore (T/16) and offshore (10T/16) flow separation at the ripple crest are stronger in the case of Re = 2 × 105 ,
while they are more diffused and lifted far upwards in the case of Re = 2 × 104 . During the phase of the maximum onshore
free-stream velocity (T/16), TKE is generated by flow separation in the region onshore of the ripple crest. For the case of
Re = 2 × 105 , it is more intense with higher maximum values, while for the case of Re = 2 × 104 , it is observed higher
and more diffused. The same applies during the phase of the maximum offshore free-stream velocity (10T/16), where TKE
is generated by flow separation in the region offshore of the ripple crest. Finally, magnitude and elevation of both vorticity
and TKE are increased in the phase of the maximum onshore free-stream velocity in comparison to the offshore one due to
the imposed external flow skewness (Eq. (1)).
From the above discussion, it is apparent that there are significant differences between the high and the modest Reynolds
number cases. Therefore, the accurate quantitative prediction of flow parameters in the coastal environment requires the
corresponding numerical simulations to be performed at high Reynolds numbers as close as possible to the prototype scale
one.

5. Performance analysis

5.1. Proﬁling

In terms of analyzing its performance, the SimuCoast algorithm is divided into: (I) the explicit stages that loop over
the mesh in order to calculate discretized operators (stages A, B and D in Table 1); and (II) the Poisson solver (stage C in
Table 1). A typical profiling of a SimuCoast application is depicted in Fig. 9. The test case corresponds to the oscillatory
flow over the bed geometry of Section 4.3 at Re = 2 × 104 executed in one node of the Aris supercomputer. An iterative
solver was used for the Poisson equation, while the discretization mesh had about 25 × 106 cells. In this case, the Poisson
solver takes up to 90% of the simulation time, so its optimization was the first priority. Regarding the other operations, the
150 G. Oyarzun, I.A. Chalmoukis and G.A. Leftheriotis et al. / Applied Mathematical Modelling 85 (2020) 141–156

Fig. 8. Phase- and spanwise-averaged velocity and spanwise vorticity (a, b, e, f), and turbulent kinetic energy (c, d, g, h) for the cases of Re = 2 × 104
(left) and Re = 2 × 105 (right).

Fig. 9. Typical proﬁling of the simulation of the test case in Section 4.3 at Re = 2 × 104 .
G. Oyarzun, I.A. Chalmoukis and G.A. Leftheriotis et al. / Applied Mathematical Modelling 85 (2020) 141–156 151

Fig. 10. Relative speedup of the OpenACC implementation with respect to the MPI-only for the intermediate velocities (top) and the vector-based opera-
tions (bottom).

calculation of the intermediate velocities was the most time-consuming stage, and it was also targeted for optimization.
Note that the IB method and the calculation of the ﬁnal velocities require approximately the same computational time. This
happens because the IB method is applied to a ﬁxed bed, and thus, its time-consuming setup calculations are performed
only during the preprocessing stage.

5.2. Intra-node performance

The optimization strategy consisted in creating an implementation with the capability to exploit all the computing re-
sources available on a node of a modern supercomputer. All the stages of the algorithm presented in Table 1 were ported
to OpenACC in order to have a uniﬁed code in which switching between platforms can be controlled only by changing
compilation ﬂags.
First, the intra-node performance study was focused to the intermediate velocity computations (stage A in Table 1) and
the vector-based operations (stages B and D in Table 1). Four implementations of the code were tested: the MPI-only,
MPI+OpenMP, MPI+OpenACC (CPU) and MPI+OpenACC (GPU). The MPI-only version was used as reference. In these im-
plementations, the number of tasks was equal to the number of CPU-cores, thus, utilizing all the resources of the node.
These comparison tests were performed in a single node of Piz Daint, where GPU implementation was possible. The relative
speedup of the different implementations is presented in Fig. 10 for 4 different grid sizes. Note that the MPI+OpenACC (CPU)
has nearly the same performance as MPI+OpenMP, probing that its use is reasonable for exploiting the CPU resources. On
the other hand, the MPI+OpenACC (GPU) outperforms the MPI-only by 2.6 times for the intermediate velocity computations
(Fig. 10 - top). The corresponding speedup for the vector-based operations is 4.7 (Fig. 10 - bottom).
152 G. Oyarzun, I.A. Chalmoukis and G.A. Leftheriotis et al. / Applied Mathematical Modelling 85 (2020) 141–156

Fig. 11. Relative speedup of the PCG Poisson solver using GPUs with respect to CPUs.

Fig. 12. Proﬁling of the FFT based direct Poisson solver for a grid of 108 cells. Left: slab-partitioning. Right: pencil-partitioning.

For the Poisson solver, the preconditioned conjugate gradient (PCG) method was used as reference. The PCG algorithm
comprises linear algebra operations that can be easily ported to any computing model. The performance of PCG resembles
the performance of the vector-based operations shown in Fig. 10 (bottom). The main algorithm difference is the existence
of sparse matrix vector multiplication in the PCG method; such an operation reuses some components of the input vector.
Consequently, the arithmetic intensity of the PCG is higher than the one of the vector-based operations, therefore, its GPU
performance is boosted. The relative speedup of the MPI+OpenACC (GPU) implementation of the PCG with respect to the
MPI-only one is shown in Fig. 11. The acceleration increases with increasing mesh size up to 6.7 times faster than the
MPI-only (CPU) implementation.

5.3. Poisson solver improvements

To achieve an expected level of accuracy, the PCG method requires a certain number of iterations. For complex simula-
tions at high Reynolds numbers, finer grids are required to fulfill the CFL condition, and therefore, a larger equation system
has to be solved at each time integration step. In the Krylov methods, such as the PCG, the number of iterations required
to converge are directly associated with the number of unknowns in the system [35]. Note that the number of iterations
can also increase if the matrix coefficients change [36] as in two-phase flow simulations. In any case, the iterative Poisson
solver becomes the most computationally demanding stage of the code ranging from 60% to 90% of the time [37].
A better choice for cases with two periodic boundary conditions is to use a direct Poisson solver as presented in
Section 3.2. In Fig. 12 (left), the computational time of each of the three operation types of the 2pFFT Poisson solver (Table 3)
using the slab-partitioning is shown. The numerical results were obtained for a grid of 108 cells and using up to 16 nodes
of Aris supercomputer (320 CPU-cores). The FFTs and the 1D-solvers scale up reasonably well from 1 to 16 nodes with a
G. Oyarzun, I.A. Chalmoukis and G.A. Leftheriotis et al. / Applied Mathematical Modelling 85 (2020) 141–156 153

Fig. 13. Strong speedup of Poisson solvers for mesh with 108 cells.

Table 4
Computational time per cell per timestep for the mesh case of 512 × 256 × 256 cells using 1 to 16 computing nodes.

Computing nodes 1 2 4 8 16

Local mesh per node 512 × 256 × 256 512 × 128 × 256 256 × 128 × 256 256 × 128 × 128 128 × 128 × 128
μs/cell/timestep 18.22E-02 9.55E-02 4.90E-02 3.64E-02 2.33E-02

parallel efficiency of 78% and 92%, respectively. On the other hand, the slab-partitioning operation becomes the bottleneck
of the algorithm with a lower parallel efficiency of only 7%, because it comprises collective MPI communications that involve
all the MPI processes.
The same test was performed using the pencil-partitioning as depicted in Fig. 12 (right). The main difference with
the slab-partitioning is that the MPI communications are executed in sub-groups of processes. In addition, the pencil-
partitioning is executed twice, once for each periodic direction. Consequently, additional communications are performed,
but for a reduced set of processes. This introduces an overhead of up to 60% of execution time in the communications al-
gorithm, which is reflected in the performance when using a small number of nodes. However, the overall scalability of the
algorithm is improved by 5.7 times when using 16 nodes (40% parallel efficiency). Hence, the pencil-partitioning is more
appropriate than slab-partitioning for large-scale simulations with many computer nodes.
The strong speedup of the iterative and direct Poisson solvers using 108 cells is shown in Fig. 13. As expected, the
PCG, which includes only point-to-point communications, scales better than both partitioning versions of the 2pFFT Poisson
solver. The 2pFFT scalability using the pencil-partitioning is up to 3.8 times better than the one with slab-partitioning.
Finally, the pencil-partitioning shows a parallel efficiency (speedup/nodes) of about 50%.
Note that despite the differences in scalability, the direct Poisson solver with pencil-partitioning outperforms the iterative
one by up to 68 times in computational time as shown in Fig. 14. The main reason is the large amount of iterations required
by PCG to reach an accuracy comparable to the one of 2pFFT. Moreover, the 2pFFT Poisson solver executed in CPUs was still
approximately 10 times faster than the PCG one executed in GPUs. Considering this acceleration, the relative weight of the
Poisson solver in the computational time-step was reduced in the range of 10-15%, and therefore further optimizations were
needed for the other stages of the algorithm (Table 1).

5.4. Inter-node performance

The inter-node performance of the non-Poisson stages of the SimuCoast algorithm (stages A, B and D in Table 1) was
analyzed with the goal of optimizing its parallel eﬃciency. Only the GPU version of the code had a potential of improvement,
since the CPU code was already optimized. The impact of the overlapping strategy presented in Section 3.1 was tested in
the Piz Daint supercomputer using up to 16 nodes. The weak speedup of the non-Poisson stages of the algorithm is shown
in Fig. 15. Two constant workloads of 3.2 × 106 and 6.4 × 106 cells per node were tested. The overlapping strategy seems
to be effective in reducing the overhead of the communications. As a result, the scaling was up to 30% better than without
using this strategy.
The computational time per grid cell per time step (μs/cell/timestep) for the mesh case of 512 × 256 × 256 cells is
presented in Table 4 using 1 to 16 computing nodes. Using more computing nodes means that each one works with a
smaller part of the computational grid (local) due to the decomposition algorithm explained in Section 3. A comparison
154 G. Oyarzun, I.A. Chalmoukis and G.A. Leftheriotis et al. / Applied Mathematical Modelling 85 (2020) 141–156

Fig. 14. Computational acceleration of the 2pFFT Poisson solver with pencil-partitioning in comparison to the PCG one.

Fig. 15. Slowdown (Weak speedup) of the non-Poisson operations using overlapping for two workloads: 3.2 × 106 cells/node (left) and 6.4 × 106 cells/node
(right).

with another code [33] that uses OpenMP for a mesh case of 512 × 128 × 256 cells on a single computational node, shows
that the present solution is 8.4 times faster for the same local mesh. If we use 8 times more computing nodes, we obtain
an acceleration of 34.3 times with respect to such code.
Finally, the strong speedup of the whole time-step of the algorithm (all four stages in Table 1) is shown in Fig. 16. The
scalability test was performed in the Piz Daint supercomputer engaging up to 512 computing nodes (6144 CPU-cores). The
strong speedup was measured for two implementations of the code: CPU and GPU. In both cases, the 2pFFT Poisson solver
with pencil-partitioning was utilized, and it was executed completely in CPU. The non-Poisson stages of the algorithm were
computed using either the CPU or the GPU implementation of the MPI+OpenACC. The CPU version of the code has better
scaling (61%) than the GPU one (47%). The slowdown in the GPU version occurs because the halo update requires additional
memory transfers between CPU and GPU memory spaces. Nevertheless, a comparison of the overall computational time
shows that the GPU code runs 4.2 times faster than the CPU one. This result is in agreement with other CFD codes running
in heterogeneous systems [11,28], and proves that SimuCoast has the potential to take advantage of modern supercomputing
capabilities. The development of an optimized 2pFFT Poisson solver in GPUs should be further exploited in the future.
G. Oyarzun, I.A. Chalmoukis and G.A. Leftheriotis et al. / Applied Mathematical Modelling 85 (2020) 141–156 155

Fig. 16. Strong speedup for mesh with 108 cells at the Piz Daint supercomputer.

6. Conclusions

An MPI+OpenACC strategy was implemented in SimuCoast in order to achieve increased computational performance in
modern supercomputers and flexibility in switching between computing systems (CPUs or GPUs). This strategy was proven
to be competitive compared with the state-of-the-art CPU implementations in CFD codes (MPI, MPI+OpenMP). Specifically,
an acceleration of 4.7 was achieved for the computations in the explicit stages of the code when using the GPUs. This re-
sult agrees with other CFD implementations running in GPUs using CUDA. The implementation was further complemented
with an overlapping strategy that allowed hiding up to 30% of the communication overheads derived from the GPU memory
transfers. For the Poisson solver, a CPU, FFT-based, direct solver was implemented for flows with periodic boundary condi-
tions; the solver outperformed in computational time the PCG in GPU by up to 10 times. Consequently, the optimal solution
consists in running the Poisson solver in CPU, and the other stages of the algorithm in GPU. The numerical tests show that
the developed hybrid CPU/GPU implementation accelerates up to 4.2 times the CPU-only solution, when using up to 512
hybrid nodes (6144 CPU-cores) of the Piz Daint supercomputer. Moreover, its parallel efficiency (47%) is competitive com-
pared with the state-of-the-art CPU implementations, and it has the potential to take advantage of modern supercomputing
capabilities.
The code was successfully validated in comparison to available numerical and experimental data in the literature, at
low to modest Reynolds numbers. Furthermore, a prototype-scale oscillatory flow at Re = 2 × 105 was simulated, and the
differences in comparison to the results of the same flow at Re = 2 × 104 were demonstrated. The prototype-scale flow was
simulated in reasonable computational time. Specifically, about 3 hrs were required for the simulation of one wave cycle at
Re = 2 × 105 (mesh size = 4 × 108 cells) on 512 hybrid nodes of the Piz Daint supercomputer.

Acknowledgments

This work was ﬁnancially supported by the Initial Training Network SEDITRANS (GA number: 607394), implemented
within the 7th Framework Programme of the European Commission under call FP7-PEOPLE-2013- ITN. The CPU simulations
were performed on the Aris supercomputer of the Greek Research & Technology Network (GRNET) under the coastHPC
project. The GPU development and scalability tests were performed in the context of a PRACE Type D project N°2010PA3748
at the hybrid Piz Daint (CSCS) nodes. The authors thankfully acknowledge these institutions.
156 G. Oyarzun, I.A. Chalmoukis and G.A. Leftheriotis et al. / Applied Mathematical Modelling 85 (2020) 141–156

References

[1] J. Dongarra, P. Beckman, T. Moore, The international exascale software project roadmap, Int. J. High. Perform. Comput. Appl. 25 (2011) 3–60, doi:10.
1177/1094342010391989.
[2] E. Balaras, Modeling complex boundaries using an external force field on fixed Cartesian grids in large-eddy simulations, Comput. Fluids 33 (2004)
375–404, doi:10.1016/S0 045-7930(03)0 0 058-6.
[3] P. Micikevicius, 3D finite-difference computation on GPUs using CUDA, in: GPGPU-2: Proceedings of the 2nd Workshop on General Purpose Processing
on Graphics Processing Units, Washington, DC, USA, 2009, pp. 79–84, doi:10.1145/1513895.1513905.
[4] B. Tutkun, F.O. Edis, A GPU application for high-order compact finite difference scheme, Comput. Fluids 55 (2012) 2–35, doi:10.1016/j.compfluid.2011.
10.016.
[5] E. Elsen, P. LeGresley, E. Darve, Large calculation of the flow over a hypersonic vehicle using a GPU, J. Comput. Phys. 227 (2008) 10148–10161, doi:10.
1016/j.jcp.2008.08.023.
[6] A. Alfonsi, S. Ciliberti, M. Mancini, L. Primavera, Performances of Navier-Stokes solver on a hybrid CPU/GPU computing system. In: Malyshkin V (ed)
Parallel Comput. Technologies. Lecture Notes in Computer Science. Springer, Berlin Heidelberg, 404-416. doi:10.1007/978- 3- 642- 23178- 0_35.
[7] A. Cevahir, A. Nukada, S. Matsuoka, Fast conjugate gradients with multiple GPUs. In: Allen G., Nabrzyski J., Seidel E., van Albada G.D., Dongarra
J., Sloot P.M.A. (eds) Computational Science – ICCS 2009. ICCS 2009. Lecture Notes in Computer Science, 5544 (2009). Springer, Berlin, Heidelberg.
doi:10.1007/978- 3- 642- 01970- 8_90.
[8] G. Oyarzun, R. Borrell, A. Gorobets, A. Oliva, MPI- CUDA sparse matrix-vector multiplication for the conjugate gradient method with an approximate
inverse pre-conditioner, Comput. Fluids 92 (2014) 244–252, doi:10.1016/j.compfluid.2013.10.035.
[9] B. Tutkun, F.O. Edis, An implementation of the direct-forcing immersed boundary method using GPU power, Eng. Appl. Comput. Fluid Mech. 11 (2017)
15–29, doi:10.1080/19942060.2016.1236749.
[10] D.A. Jacobsen, I. Senocak, Multi-level parallelism for incompressible low computations on GPU clusters, Parallel Comput 39 (2013) 1–20, doi:10.1016/j.
parco.2012.10.002.
[11] G. Oyarzun, R. Borrell, A. Gorobets, O. Lehmkuhl, A. Oliva, Direct numerical simulation of incompressible flows on unstructured meshes using hybrid
CPU/GPU supercomputers, Procedia Eng 61 (2013) 87–93, doi:10.1016/j.proeng.2013.07.098.
[12] M. Aissa, T. Verstraete, C. Vuik, Toward a GPU-aware comparison of explicit and implicit CFD simulations on structured meshes, Comput. Math. with
Appl. 74 (2017) 201–217, doi:10.1016/j.camwa.2017.03.003.
[13] M.T. Nguyen, P. Castonguay, E. Laurendeau, GPU parallelization of multigrid RANS solver for three-dimensional aerodynamic simulations on multiblock
grids, J. Supercomput. 75 (2019) 2562–2583, doi:10.1007/s11227- 018- 2653- 6.
[14] K.I. Karantasis, E.D. Polychronopoulos, J.A. Ekaterinaris, High order accurate simulation of compressible flows on GPU clusters over software distributed
shared memory, Comput. Fluids 93 (2014) 18–29, doi:10.1016/j.compfluid.2014.01.005.
[15] X. Liu, Z. Zhong, K. Xu, A hybrid solution method for CFD applications on GPU-accelerated hybrid HPC platforms, Future Gener. Comp. Sy. 56 (2016)
759–765, doi:10.1016/j.future.2015.08.002.
[16] R. Borrell, D. Dosimont, M. Garcia-Gasulla, G. Houzeaux, O. Lehmkuhl, V. Mehta, H. Owen, M. Vázquez, G. Oyarzun, Heterogeneous CPU/GPU co-
execution of CFD simulations on the POWER9 architecture: application to airplane aerodynamics, Future Gener. Comp. Sy. 107 (2020) 31–48, doi:10.
1016/j.future.2020.01.045.
[17] R. Farber, Parallel Programming with OpenACC (1 ed.), Morgan Kaufmann Publishers Inc, San Francisco, CA, USA, 2016.
[18] S. Chandrasekaran, G. Juckeland, OpenACC for Programmers: Concepts and Strategies, 1st Ed., Addison-Wesley Professional, 2017.
[19] A.A. Dimas, G.A. Leftheriotis, Mobility parameter and sand grain size effect on sediment transport over vortex ripples in the orbital regime, J. Geophys.
Res. Earth Surf. 124 (2019) 2–20, doi:10.1029/2018JF004741.
[20] J. Smagorinsky, General circulation experiments with the primitive equations, Monthly Weather Rev. 91 (1963) 99–165 GCEWTP>2.3.CO;2, doi:10.1175/
1520-0493(1963)0910099.
[21] R.B. Wilhelmson, J.H. Ericksen, Direct solutions for Poisson’s equation in three dimensions, J. Comput. Phys. 25 (1977) 319–331, doi:10.1016/
0 021-9991(77)90 0 01-8.
[22] P. Blondeaux, G. Vittori, Vorticity dynamics in an oscillatory flow over a rippled bed, J. Fluid Mech 226 (1991) 257–289, doi:10.1017/
S0 0221120910 02380.
[23] J. Fredsoe, K.H. Andersen, B.M. Sumer, Wave plus current over a ripple-covered bed, Coastal Eng 38 (1993) 177–221, doi:10.1016/S0378-3839(99)
0 0 047-2.
[24] A. Önder, J. Yuan, Turbulent dynamics of sinusoidal oscillatory flow over a wave bottom, J. Fluid Mech 858 (2019) 264–314, doi:10.1017/jfm.2018.754.
[25] E.R. Van Driest, On turbulent flow near a wall, J. Aeronaut. Sci. 23 (1956) 1007–1011, doi:10.2514/8.3713.
[26] P.J. Davis, , Circulant Matrices, 2nd Edition, Chelsea Publishing, New York, 1994.
[27] R.M. Gray, Toeplitz and circulant matrices: a review, foundations and trends in communications and information theory, 2 (2006), 155-239. doi:10.
1561/010 0 0 0 0 0 06.
[28] G. Oyarzun, R. Borrell, A. Gorobets, A. Oliva, Portable implementation model for CFD simulations. Application to hybrid CPU/GPU supercomputers, Int.
J. Comput. Fluid D. 31 (2017) 396–411, doi:10.1080/10618562.2017.1390084.
[29] J.F.A. Sleath, Sea Bed Mechanics, John Wiley & Sons, New York, 1984.
[30] J.J. van der Werf, J.S. Doucette, T. O’Donoghue, J.S. Ribberink, Detailed measurements of velocities and suspended sand concentrations over full-scale
ripples in regular oscillatory flow, J. Geophys. Res. (2007) 112, doi:10.1029/2006JF000614.
[31] J.J. van der Werf, V. Magar, J. Malarkey, K. Guizien, T. O’Donoghue, 2DV modelling of sediment transport processes over full-scale ripples in regular
asymmetric oscillatory flow, Cont. Shelf. Res. 28 (2008) 1040–1056, doi:10.1016/j.csr.2008.02.007.
[32] P. Sagaut, Large eddy simulation for incompressible flows, Springer, Berlin, 2006.
[33] D.G.E. Grigoriadis, A.A. Dimas, E. Balaras, Large-eddy simulation of wave turbulent boundary layer over rippled bed, Coast. Eng. 60 (2012) 174–189,
doi:10.1016/j.coastaleng.2011.10.003.
[34] R.A. Bagnold, Motion of waves in shallow water, Interaction between waves and sand bottoms, Proc. of the Royal Society of London. Series A, Mathe-
matical and Physical Sciences 187 (1946) 1–18, doi:10.1098/rspa.1946.0062.
[35] M.R. Hestenes, E. Stiefel, Methods of conjugate gradients for solving linear systems, J. Res. Nat. Bur. Standards 49 (1952) 409–436, doi:10.6028/jres.
049.044.
[36] J. Kim, S. Sastry, S. Shontz, A numerical investigation on the interplay amongst geometry, meshes, and linear algebra in the finite element solution of
elliptic PDEs, Eng. Comput. (2001) 28, doi:10.1007/s00366- 011- 0231- 0.
[37] M.M. Hafez, Numerical Simulations of Incompressible Flows, World Scientific, River Edge, NJ, 2003.
Update
Applied Mathematical Modelling
Volume 87, Issue , November 2020, Page 755

DOI: https://doi.org/10.1016/j.apm.2020.07.001
Applied Mathematical Modelling 87 (2020) 755

Contents lists available at ScienceDirect

Applied Mathematical Modelling

journal homepage: www.elsevier.com/locate/apm

Corrigendum

Corrigendum to “A GPU-based algorithm for eﬃcient LES of

high Reynolds number ﬂows in heterogeneous CPU/GPU
supercomputers” [Applied Mathematical Modelling 85 (2020)
141-156]
Guillermo Oyarzun a, Iason A. Chalmoukis b, Georgios A. Leftheriotis b,
Athanassios A. Dimas b,∗
a
Barcelona Supercomputing Center, 08034 Barcelona, Spain
b
Laboratory of Hydraulic Engineering, Department of Civil Engineering, University of Patras, 26500, Patras, Greece

On page 147 and on the ﬁrst line after equation (14), the correct expression is:

hr = 0.35ao