0% found this document useful (0 votes)
21 views8 pages

Articles CAF HPC2 Published

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views8 pages

Articles CAF HPC2 Published

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Computers and Fluids 173 (2018) 285–292

Contents lists available at ScienceDirect

Computers and Fluids


journal homepage: www.elsevier.com/locate/compfluid

HPC2 —A fully-portable, algebra-based framework for heterogeneous


computing. Application to CFD
X. Álvarez a,∗, A. Gorobets a,b, F.X. Trias a, R. Borrell a,c, G. Oyarzun a
a
Heat and Mass Transfer Technological Center, Technical University of Catalonia, C/ Colom 11, Terrassa (Barcelona) 08222, Spain
b
Keldysh Institute of Applied Mathematics RAS, Miusskaya Sq. 4, Moscow 125047, Russia
c
Termo Fluids, S.L., C/ Magí Colet 8, Sabadell (Barcelona), 08024, Spain

a r t i c l e i n f o a b s t r a c t

Article history: The variety of computing architectures competing in the exascale race makes the portability of codes of
Received 2 November 2017 major importance. In this work, the HPC2 code is presented as a fully-portable, algebra-based framework
Accepted 23 January 2018
suitable for heterogeneous computing. In its application to CFD, the algorithm of the time-integration
Available online 10 February 2018
phase relies on a reduced set of only three algebraic operations: the sparse matrix-vector product, the
Keywords: linear combination of vectors and the dot product. This algebraic approach combined with a multilevel
Heterogeneous computing MPI+OpenMP+OpenCL parallelization naturally provides portability. The performance has been studied
MPI+OpenMP+OpenCL on different architectures including multicore CPUs, Intel Xeon Phi accelerators and GPUs of AMD and
Hybrid CPU+GPU systems NVIDIA. The multi-GPU scalability is demonstrated up to 256 devices. The heterogeneous execution is
CFD tested on a CPU+GPU hybrid cluster. Finally, results of the direct numerical simulation of a turbulent flow
Symmetry-preserving discretization in a 3D air-filled differentially heated cavity are presented to show the capabilities of the HPC2 dealing
with large-scale CFD simulations.
© 2018 Elsevier Ltd. All rights reserved.

1. Introduction be compatible with distributed- and shared-memory MIMD par-


allelism and, more importantly, with stream processing, which is
Massively-parallel devices of various architectures are being a more restrictive parallel paradigm. Consequently, the fewer the
adopted by the newest supercomputers in order to overcome the kernels of an application, the easier it is to provide portability. Fur-
actual power constraint in the context of the exascale challenge thermore, if the majority of kernels represent linear algebra op-
[1]. This trend is being reflected in most of the fields that rely erations, then standard optimized libraries (e.g. ATLAS, clBLAST)
on large-scale simulations such as computational fluid dynam- or specialized in-house implementations can be used and easily
ics (CFD). Examples of CFD applications using accelerators can switched.
be found, for instance, in [2] (single-GPU, portable, OpenCL), [3– In this context, we proposed in a previous work [8] a portable
5] (multi-GPU, vendor-locked, CUDA) and [6] (petascale, multi-GPU, algebraic implementation approach for direct numerical simula-
portable, CUDA+OpenCL). tions (DNS) and large eddy simulation (LES) of incompressible tur-
Although the majority of problems in the field of mathematical bulent flows on unstructured meshes. Roughly, the implementation
physics involve sparse matrix and vector operations and hence al- consists in replacing classical stencil data structures and sweeps by
gorithms with very low arithmetic intensity, most of the emerging algebraic data structures and kernels. As a result, the algorithm re-
HPC architectures are FLOP-oriented, i.e. FLOPS to memory band- lies on a reduced set of only three algebraic operations: the sparse
width ratio is very high. Consequently, the achievable performance matrix-vector product (SpMV), the linear combination of vectors
is usually reduced to a small fraction of the peak performance as (axpy) and the dot product (dot).
proven by the HPCG Benchmark [7] results. On the other hand, the hybridization of HPC systems makes the
Therefore, in the design of large-scale simulation tools, soft- design of simulation codes a rather complex problem. Heteroge-
ware portability and efficiency are of crucial importance. The com- neous implementations such as an MPI+OpenMP+OpenCL paral-
puting operations that form the algorithm, so-called kernels, must lelization [9] can target a wide range of architectures and combine
different kinds of parallelism. Hence, they are becoming increas-

ingly necessary in order to engage all available computing power
Corresponding author.
and memory throughput of CPUs and accelerators. Examples of
E-mail addresses: [email protected] (X. Álvarez), [email protected] (A.
Gorobets), [email protected] (F.X. Trias). CFD codes capable of heterogeneous computing can be found, for

https://doi.org/10.1016/j.compfluid.2018.01.034
0045-7930/© 2018 Elsevier Ltd. All rights reserved.
286 X. Álvarez et al. / Computers and Fluids 173 (2018) 285–292

Host Device 0
Computing domain Subdomain 1 Subdomain 2

Host

Device 0 Device 1

Inner Elements Interface 1-level Halo 1-level Interface 2-level Halo 2-level

Fig. 1. Multi-level decomposition example for a cell-centred scheme among two dual-CPU nodes with one and two acceleration devices, respectively.

instance, in [10] the performance of the PyFR framework is tested


on a hybrid node with a multicore CPU, NVIDIA and AMD GPUs.
Further, in [11] the scalability of the HOSTA code is tested on up
to 1024 TianHe-1A hybrid nodes.
Following the spirit of Oyarzun et al. [8], and increasing the
level of abstraction, we present in this paper the HPC2 (Hetero-
geneous Portable Code for HPC). It is a fully-portable, algebra-
based framework capable of heterogeneous computing with many
Set
potential applications in the fields of computational physics and
mathematics. This framework aims to provide a user-friendly en-
vironment suitable for writing numerical algorithms in terms of
portable linear algebra kernels.

2. Multi-level domain decomposition ...

The computational domain is essentially a graph, i.e. a set of


objects in which some pairs are in some sense related (such as
Fig. 2. Representation of the HPC2 code structure.
mesh nodes, cells, faces, vertices, etc.), which may be subject to
calculations. It typically arises from the spatial discretization of the
physical domain, forming a fundamental part of many applications.
Fig. 1. This second-level decomposition must conform to the ac-
The optimal distribution of the workload of the computational do-
tual performance of devices for the sake of load balancing. As a
main across the HPC system is of great importance in heteroge-
result, at the second level, the Interface and Halo elements are clas-
neous computing for attaining maximum performance.
sified as (1) external ones that need MPI communications because
By way of example, let us consider a generic numerical algo-
are coupled with other subdomains of the first-level decomposi-
rithm which operates on a computational domain. The algorithm
tion, and (2) internal ones that only participate in the intra-node
is to be executed on an HPC system that consists of computing
exchanges. Note that the external Interface and Halo elements as-
nodes interconnected via a communication infrastructure. Hence,
signed to a device with a separate memory space require a much
a traditional first-level domain decomposition approach with MPI
more expensive multistage device-host-MPI-host-device communi-
parallelization is used in order to distribute the workload among
cations. Hence, the second-level decomposition must be aimed at
multiple nodes. In doing so, domain elements are assigned to sub-
reducing the volume of this expensive external communications (in
domains using a partitioning library (e.g. ParMETIS [12]) that fulfils
a one-level case, all the internal Interface and Halo would become
the requested load balancing and minimizes the number of cou-
external).
plings between cells of different subdomains. As a result, first-level
Finally, the third-level decomposition among NUMA nodes of
subdomain elements are classified into Inner and Interface cate-
the host (e.g. CPU sockets or parts of a multicore CPU grouped
gories (see Fig. 1). Namely, Interface elements are those coupled
in a shared L3 inner cache ring) allows allocating data in accor-
with elements from other subdomains. Consequently, those other’s
dance with the physical resources to which a group of threads is
neighbouring elements form a Halo. A communication between
assigned.
parallel processes is needed in order to update the data in Halo el-
ements needed by a kernel when processing Interface elements.
The Halo update is represented with non-blocking point-to-point 3. The HPC2 framework
MPI communications between neighbouring processes. An overlap
of these communications with computations partially eliminates In this section, we present the HPC2 : a fully-portable, algebra-
the data transfer overhead. In such a case, the Halo update is car- based framework suitable for heterogeneous computing with the
ried out simultaneously to the execution of the kernel only for the aim of providing a user-friendly environment for writing algo-
Inner elements. Afterwards, once the Halo is updated, the Inter- rithms in the fields of computational physics and mathematics.
face elements are computed.
Similarly, first-level subdomains are decomposed further in or- 3.1. Structure of HPC2
der to distribute the workload of each node among its comput-
ing devices, such as multiple CPUs (called host) and co-processors The code is structured following a multi-layer design repre-
of different kinds (called devices), as shown in the right part of sented in Fig. 2 as concentric rings. In this scheme, the layers
X. Álvarez et al. / Computers and Fluids 173 (2018) 285–292 287

are defined to be detached maintaining the object dependency re- as #pragma ivdep for the Intel compiler. Finally, OpenCL imple-
stricted to the inner layers. Therefore, each outer layer represents mentation provides the kernels portability across various stream
a higher level of abstraction. processing-based accelerators, including AMD and NVIDIA GPUs,
The first layer (centre) is composed of the Set and Topology ob- FPGA accelerators and ARM-based systems-on-a-chip (SoCs).
jects. This layer represents the computational domain (detailed in The heterogeneous execution mode is implemented using
Section 2) hence it is the core of any numerical simulation. Firstly, nested OpenMP regions (Fig. 3). The outer parallel region spawns
the Set is a basic data structure which aims to mimic the algebraic two threads: one for handling communications and another for
concept of a set i.e. a collection of objects of some kind. It is de- computations. The communication thread executes device-to-host
signed to be automatically distributed in the system and assigned (D2H), MPI, and host-to-device (H2D) transfers. The computing
to devices at runtime according to the execution parameters. Thus, thread submits kernels for the background OpenCL execution, then
it becomes generic and architecture-independent from the outer OpenMP-parallel processing is carried out within the inner paral-
layers’ point of view. Secondly, the Topology is designed to be in lel region. In doing so, the OpenMP and OpenCL computations are
agreement with a Set. It consists of the representation of the cou- carried out simultaneously with the data exchanges engaging all
plings between the objects of the Set. Therefore, it contains the re- available computing resources and hiding the communication over-
quired information to perform the data exchanges (Halo updates). head.
Note that the Topology is bounded to a Set, then it can only man- In order to optimize the matrix data structures and kernels, it is
age the data exchanges of objects belonging to that Set. However, necessary to reorder the rows and provide it with a proper storage
different Topology may be assigned to the same Set depending on format depending on its sparsity pattern and the target architec-
the numerical schemes (e.g. the second- and fourth-order schemes ture. The reordering aims to improve the data locality by reduc-
define different couplings between the same set of unknowns). ing the matrix bandwidth [13]. The implemented storage formats
In the second layer, two more complex algebraic objects, the include the standard Compressed Sparse Row (CSR) and different
Vector and the Matrix, are derived from Set. The Vector object rep- variants of the ELLPACK (sliced and block-transposed). Further de-
resents discrete functions on the computational domain (e.g. pres- tails on the implementation and the performance of these formats
sure, velocity, temperature). To comply with the algebraic concept can be found in our previous work [8]. Note, this reordering and
of vector it must be equipped with the algebraic operations: dot storage adaptation are internal routines that are hidden from the
product, scalar multiplication, vector addition, and linear combina- outer layers.
tion. These operations are contained in the dot and axpy kernels.
The Matrix object is provided with the matrix multiplication, rep- 3.3. Performance study
resented by the SpMV kernel, in order to perform linear transfor-
mations. Firstly, the performance of the HPC2 depends mainly on the
The third layer contains linear and non-linear algorithms. These algebraic kernels that compose its core. It must be noted that
algorithms are written using only Matrix and Vector kernels, and these algebraic operations have very low arithmetic intensity. For
hence they only maintain an inner-layer dependency. Some exam- double-precision values, the FLOP per byte ratio is typically around
ples of algorithms already implemented are the Conjugate Gradi- 1/8 (one operation per 8-byte argument). Therefore, it is clearly
ent solver, the Adams-Bashforth time integrator and the Courant– a memory-bounded application that requires a lot of attention
Friedrichs–Lewy (CFL) condition. to memory access optimization. For this reason, the theoretically
The fourth layer consists of the preprocessing mechanisms achievable performance can hardly reach several few percents of
which can be defined for different numerical methods. Namely, the device’s theoretical peak. For instance, for NVIDIA 2090 GPU
given a mesh, the preprocessing constructs the Set and Topol- this limit is around 3% (i.e. (0.125FLOP/Byte · 178GB/s)/670GFLOPS).
ogy objects. Then, it generates the coefficients of the operators This performance level is rather consistent with the results of
such as Gradient, Divergence or Laplacian. Additionally, this layer the well-known HPCG Benchmark [7] that reproduces a memory-
can involve external simulation codes which generate the opera- bounded sparse algebraic application.
tors as an input for the HPC2 time integration core. Single device tests have been run to study the performance of
Finally, the outermost layer is left for applications. The prepro- the HPC2 on the following devices: Intel Xeon E5-2660, Intel Xeon
cessing mechanisms are used to generate the required Matrix and E5-2620 v2, Intel Xeon 8160, Intel Xeon Phi 7290, NVIDIA Tesla
Vector objects. The combination of these objects, and its kernels, 2090, NVIDIA Tesla K40, and AMD Radeon R9 Nano. Single-device
together with the algorithms described in the third layer allows results shown in Fig. 4 (left) illustrate the performance compari-
the implementation of complex algorithms in the fields of compu- son for the overall DNS algorithm (see Algorithm 1 in Section 4)
tational physics and mathematics. By way of example, the reader and its three major kernels on different kinds of devices: several
is referred to the Algorithm 1 (in Section 4) for modelling DNS of generations of multicore and manycore CPUs and GPUs of AMD
incompressible turbulent flows with heat transfer, which is com- and NVIDIA. Additionally, performance on ARM-based SoCs can be
posed of only three linear algebra kernels: SpMV, dot, axpy. found in our previous work [14]. The mesh size per device was
around 1 million cells (unstructured and hexa-dominant). As ex-
3.2. Software implementation details pected, the achieved performance is directly related to the band-
width capacity of the devices. Consequently, the AMD GPU outper-
Our heterogeneous implementation relies on MPI, OpenMP and forms the FLOPS-oriented high-end computing devices due to its
OpenCL frameworks. The structure of the HPC2 aims to restrict the higher memory bandwidth. On the other hand, the attained per-
implementation specifics to the inner layers, i.e. the core and al- formance and its ratio between devices differ for each kernel. This
gebraic layers. Firstly, the non-blocking MPI point-to-point com- requires a careful workload balancing based on the performance of
munications are used for distributed-memory parallelization. Sec- the overall algorithm but not the separate kernels.
ondly, OpenMP is used for shared-memory MIMD parallelization The benefits of the heterogeneous CPU+GPU execution have
for multicore CPUs and manycore accelerators. The dynamic loop been measured on a hybrid cluster using two 8-core CPUs (E5-
scheduling is mostly used in order to avoid imbalance between 2660) and two GPUs (NVIDIA 2090). Comparison with the GPU-
threads that may appear due to the interference with OpenCL and only execution in shown in Fig. 4 (right) for an unstructured hexa-
MPI processes. Additionally, vectorization for SIMD extensions on dominant mesh with 1 million cells per node. The performance
the lowest level is achieved with compiler-specific directives, such gain compared to the GPU-only mode was 19%. Furthermore, the
288 X. Álvarez et al. / Computers and Fluids 173 (2018) 285–292

Communication thread Computation thread


OpenCL kernel queues
D2H: interface cIEnqueue

Computing inner cells


MPI MPI send, recv
lib
Host (1) Host (n) Devc (1) Devc (n)
MPI Waitall ...
inner inner inner inner

H2D: halo

cIEnqueue

Computing interface cells


Host (1) Host (n) Devc (1) Devc (n)
... ...
interface interface interface interface

cIFinish

Fig. 3. Heterogeneous execution of a kernel over a multi-level decomposed domain.

Fig. 4. Left: performance of the overall DNS algorithm (Algorithm 1 and the three basic kernels tested on different devices. Right: heterogeneous execution of the overall
DNS algorithm vs GPU-only.

heterogeneous efficiency (i.e. the ratio between the heterogeneous when the workload per device decreases. In contrast, with a suffi-
performance and the sum of the CPU-only and the GPU-only per- cient workload, the parallel efficiency obtained from the weak scal-
formance) appeared to be near 100% on 8 nodes. This efficiency ing results in Fig. 5 (right) is rather high. Specifically, the parallel
was expected to reduce since the CPUs should be more involved in efficiency for the mesh of 1.3 billion cells on 256 GPUs is 94% when
communications. However, the communication overhead appears scaling from one 4-GPU node to 64 nodes with the load of 5 mil-
to be much more efficiently hidden when overlapping with GPUs. lion cells per device. At the smaller workload of 1 million cells per
Finally, multi-GPU strong and weak scaling results are shown device, the parallel efficiency on 256 GPUs drops to 67% because,
in Fig. 5 for hexahedral meshes that represent the computational firstly, the computing load is not sufficient to hide all the commu-
domain of the DNS configuration described in the following sec- nications, and, secondly, the scaling range is 4 times bigger.
tion. The HPC5 supercomputer of the Kurchatov Institute was used
for these tests. Its nodes are equipped with two dual-GPU NVIDIA 4. Challenging the HPC2 : DNS of a turbulent differentially
K80 devices. It can be observed that the speed-up declives rather heated cavity
rapidly as shown in the strong scaling results in Fig. 5 (left), allow-
ing to speed-up around 8× at a reasonable efficiency level. This is A DNS of a turbulent air-filled differentially heated cavity (DHC)
due to the natural fact that the GPU’s throughput drops notably has been chosen as a first CFD case to show the capabilities of the
X. Álvarez et al. / Computers and Fluids 173 (2018) 285–292 289

1.6
25 Linear Speedup
1.4
5M cells, no overlap
20 5M cells, overlap 1.2
20M cells, overlap
Strong scaling

Weak scaling
1
15
0.8

0.6
10
0.4 Ideal
5 1M cells per GPU
0.2
5M cells per GPU
0
5 10 15 20 25 30 50 100 150 200 250
Number of GPUs Number of GPUs

Fig. 5. Strong (left) and weak (right) scaling on multiple GPUs for different mesh sizes.

Fig. 6. From left to right: DHC schema, instantaneous schlieren-like snapshot from the DNS and the averaged temperature field at the cavity mid-depth (the isotherms are
uniformly distributed between −0.5 and 0.5), the airflow map in the upper part of the cavity.

HPC2 dealing with large-scale CFD simulations. Firstly, the descrip- respectively. Notice that with the reference quantities, Lre f = H
tion of the case in conjunction with the numerical methods is de- and tre f = (H 2 /α )Ra−1/2 , the vertical buoyant velocity, Pr1/2 , and
tailed. Then, the DHC results are briefly presented. the characteristic dimensionless Brunt-Väisälä frequency, N, are
independent of the Ra. The configuration considered here resem-
4.1. Mathematical model and numerical method bles the experimental set-up performed by Saury et al. [15] and
Belleoud et al. [16]: the height, H/L, and depth, D/L, aspect ratios
We consider a cavity of height H, width L and depth D filled are 3.84 and 0.86, whereas the Rayleigh and Prandtl numbers
with an incompressible Newtonian viscous fluid of kinematic vis- are Ra = 1.2 × 1011 and Pr = 0.71 (air), respectively. The cavity
cosity ν , thermal diffusivity α and density ρ . The geometry of the is subjected to a temperature difference θ across the vertical
problem is displayed in Fig. 6 (left). The Boussinesq approxima- isothermal walls (θ (0, y, z ) = 0.5, θ (L/H, y, z ) = −0.5). The temper-
tion is used to account for the density variations. Thermal radia- ature at the rest of walls is given by the “Fully Realistic” boundary
tion is neglected. Under these assumptions, the velocity, u, and the conditions proposed in [17]. They are time-independent analytical
temperature, θ , are governed by the following set of dimensionless functions that fit the experimental data of Salat et al. [18]. The
PDEs no-slip boundary condition is imposed on the walls.
∂t u + (u · ∇ )u = PrRa−1/2 ∇ 2 u − ∇ p + f , (1) The governing Eqs. (1) and (2) are discretized using a
symmetry-preserving discretization [19]. Shortly, the temporal evo-
lution of the spatially discrete velocity vector, uc , is governed
∂t θ + (u · ∇ )θ = Ra−1/2 ∇ 2 θ , (2) by the following operator-based finite-volume discretization of
Eqs. (1)
where Pr = ν /α , Ra = (gβ θ H 3 )/(να ) and f = (0, Pr θ , 0 )
(Boussinesq approximation) are the Prandtl and Rayleigh num-
d uc
ber (based on the cavity height), and the body force vector, Ω3c d + C3c d (us )uc + D3c d uc + Ω3c d Gc pc = f c , M us = 0 c ,
dt
290 X. Álvarez et al. / Computers and Fluids 173 (2018) 285–292

Table 1
Physical and numerical simulation parameters of the DNS of the turbulent DHC displayed in Fig. 6. From
left to right: number of control volumes and concentration factors for each spatial direction, the size of
the first off-wall control volume in the x-direction (also in wall-units), the non-dimensional time-step,
the starting time for averaging and the time-integration period.

Nx Ny Nz γx γy γz (x)min (x )+min t t0 tavg


−5 −4
450 900 256 2 1 1 4.28 × 10 ࣠ 0.5 3.65 × 10 ≈ 300 ≈ 300

Fig. 7. Left: average temperature profiles at the cavity mid-depth at mid-width. Right: averaged Nusselt number at the cavity mid-depth.

where the pc ∈ Rn and uc ∈ R3n are the cell-centred pressure and (SpMV), most of the them sharing the same matrix portrait. Re-
velocity fields. For simplicity, uc is defined as a column vector and garding the convection (steps 1 and 2 in Algorithm 1), it can also
arranged as uc = (u1 , u2 , u3 )T , where ui = ((ui )1 , (ui )2 , . . . , (ui )n )T be reduced to SpMV operations by simply noticing that the coef-
are the vectors containing the velocity components corresponding ficients of the convective operator, Cc (uns ), must be re-computed
to the xi -spatial direction. The auxiliary discrete staggered velocity accordingly to the adopted numerical schemes [19]. Moreover, the
us = ((us )1 , (us )2 , . . . , (us )m )T ∈ Rm is related to the centered ve- computation of these coefficients can also be viewed as a SpMV.
Therefore, the convective operator is represented as a concate-
locity field via a linear interpolation c→s ∈ Rm×3n , us ≡ c → s uc .
nation of two SpMVs: (i) firstly, to compute the coefficients of
The dimensions of these vectors, n and m, are the number of
the convective operator, Cc (uns ), (ii) then, to compute the matrix-
control volumes and faces on the computational domain, respec- n
vector product to obtain the resulting vector, e.g. Cc (uns )θ c .
tively. The subindices c and s refer to whether the variables are
Regarding the time-integration scheme (steps 2 and 7 in
cell-centered or staggered at the faces. The matrices Ω3c d ∈ R3n×3n ,
Algorithm 1), and without loss of generality, a second-order
C3c d (us ) ∈ R3n×3n and D3c d ∈ R3n×3n are block diagonal matrices
Adams-Bashforth has been adopted here. Since it is a fully explicit
given by
schemes a CFL-like condition is required in order to keep the nu-
Ω3c d = I3  Ωc , C3c d (us ) = I3  Cc (us ), D3c d = I3  Dc , merical scheme inside the stability region [20]. This necessarily
leads to rather small time-steps, t and subsequently to a good
where I3 ∈ R3×3 is the identity matrix. Cc (us ) ∈ Rn×n and Dc ∈
initial guess for the Poisson equation. This justifies the fact that a
Rn×n are the collocated convective and diffusive operators, respec-
relatively simple linear solver for the Poisson equation (step 3 in
tively. The temporal evolution of the discrete temperature θ c ∈
Algorithm 1) suffices to maintain the norm of the divergence of
Rn (see Eq. (2)) is discretized in the same vein. For a detailed
explanation of the spatial discretization, the reader is referred to
Trias et al. [19]. Regarding the temporal discretization, a second-
order explicit one-leg scheme is used for both the convective and Algorithm 1 Time-integration step.
the diffusive terms [20]. Finally, the pressure-velocity coupling is 1. Compute the convective, the diffusiveand the source term of
momentum
 Eq. (1):
n
solved by means of a classical fractional step projection method
n
p
[21]: a predictor velocity, us , is explicitly evaluated without con- R uns , unc , θ c ≡ −C3c d (uns )unc − D3c d unc + f c (θ c )
p
sidering the contribution of the pressure gradient. Then, by im- 2. Compute
 the  predictor  velocity: uc = unc +
posing the incompressibility constraint, Muns +1 = 0c , it leads to a t 32 R(uns , unc ) − 12 R uns −1 , unc −1
Poisson equation for p ˜ nc +1 to be solved once each time-step, 3. Solve the Poisson equation given in Eq. (4): L p ˜ nc +1 = Musp where
p p
˜ nc +1 us = c→s uc
L = −MΩ−1
p T
Lp = M us with s M , (3)
4. Correct the staggered velocity field: uns +1 = us − G p
p
˜ nc +1 where
where p ˜ c = t pc and the discrete Laplacian operator, L, is repre- G = −Ωs M−1 T

sented by a symmetric negative semi-definite matrix. 5. Correct the cell centered velocity field: unc +1 = uc − Gc p
p
˜ nc +1
In summary, the method is based on only five basic (linear) op- where
erators: the cell-centered and staggered control volumes, Ωc and Gc = − s→c Ω−1 s M
T

Ωs , the matrix containing the face normal vectors, Ns , the cell-to- 6. Compute the convective and the diffusive terms intemperature
face scalar field interpolation, c → s and the divergence operator, transport Eq. (2):
 n
Rθ uns , θ c ≡ −Cc (uns )θ c − Pr−1 Dc θ c
n n
M. Once these operators are constructed, the rest follows straight-
forwardly from them. The algorithm to solve one time-integration 7. Compute temperature at the next time-step:
   
step in outlined in Algorithm 1. At this point, it must be noted that, n
n θ nc +1 = θ nc + t 3
2 Rθ
n−1
uns , θ c − 12 Rθ uns −1 n, θ c
except the non-linear convective term, C3c d (uns )unc and Cc (uns )θ c ,
all the operations correspond to sparse matrix-vector products
X. Álvarez et al. / Computers and Fluids 173 (2018) 285–292 291

Fig. 8. Time evolution of the Nusselt number at the vertical mid-plane (left) and its normalized density power spectrum (right).

the velocity field, Muns +1 , at a low enough level [22]. Furthermore, that the one obtained by D. Saury et al. [15], i.e. Nu = 231 ± 30
since the matrix, −L, is symmetric and positive-definite, a Pre- but very similar to the value obtained by means of LES, i.e. Nu =
conditioned Conjugate Gradient is used with a simple SpMV-based 254 (see Fig. 11 in [15]).
preconditioner (either the Jacobi or the approximate inverse). In Another important feature of this kind of configuration is the
conclusion, the overall Algorithm 1 relies on the set of three basic presence of internal waves. Although in the cavity core the av-
algebra operations: SpMV, dot and axpy. eraged velocity (and its fluctuations) are much smaller compared
with those observed in the vertical boundary layers, simulations
show that in this region isotherms oscillate around the mean hor-
4.2. Results and discussion izontal profile. As mentioned-above, the cavity core remains well
stratified (see Figs. 6 and 7, left); therefore, this phenomenon can
Since no subgrid-scale model is used in the computation, the be attributed to internal waves. This can be confirmed by analysing
grid resolution and the time-step, t, have to be fine enough to the Nusselt number through the vertical mid-plane, Nuc . The time
resolve all the relevant turbulence scales. Furthermore, the starting evolution and the normalized density power spectrum are respec-
time for averaging, t0 , and the time-integration period, tavg , must tively displayed in Figs. 8. The peak in the spectrum is located at
also be long enough to properly evaluate the flow statistics. The 0.096 which is in a good agreement with the dimensionless Brunt-
procedure followed to verify the simulation results is analogous to Väisälä frequency, N = (S Pr )0.5 /(2π ), where S is the dimensionless
our previous DNS work [23,24]. In this case, the averages over the stratification of the time-averaged temperature, i.e. N ≈ 0.09. Both
three statistically invariant transformations (time, mid-depth plane values are very similar confirming that internal waves are perma-
and central point symmetry) have been carried out for all fields nently excited by the eddies ejected from the vertical boundary
and the grid points in the three wall-normal directions are dis- layer. Detailed results including turbulent statistics can be down-
tributed using a hyperbolic-tangent function, i.e. for the x-direction loaded in the following link [26].
tanh{γx (2(i−1 )/Nx −1 )}
xi = 12 (1 + tanh γ
). For details about the physical and
x
numerical parameters see Table 1. Hereafter, the angular brackets
5. Conclusions
operator · denotes averaged variables.
Instantaneous flow fields displayed in Fig. 6 illustrate the in-
Motivated by the constant evolution of HPC architectures, the
herent flow complexity of this configuration. Namely, the vertical
aim of this paper was to design a fully-portable, algebra-based
boundary layers remain laminar only in their upstream part up to
framework suitable for heterogeneous computing with the aim of
the point where the waves traveling downstream grow up enough
providing a user-friendly environment for writing algorithms in the
to disrupt the boundary layers ejecting large unsteady eddies. An
fields of computational physics and mathematics. As a computing
accurate prediction of the flow structure in the cavity lies on the
novelty, the heterogeneous MPI+OpenMP+OpenCL implementation
ability to correctly locate the transition to turbulence while the
of kernels has been combined with a multi-level domain decom-
high sensitivity of the thermal boundary layer to external distur-
position strategy for distributing the workload among heteroge-
bances makes it difficult to predict (see results for a similar DHC
neous computing resources. Results have shown that the hetero-
configuration in [24,25], for instance). In this case, the transition
geneous performance of the HPC2 on a hybrid CPU+GPU cluster is
occurs around y ≈ 0.2 (see the peak of the averaged local Nusselt
nearly identical to the sum of the CPU-only and the GPU-only per-
number displayed in the right part of Fig. 6).
formance. The multi-GPU scalability of a CFD simulation has been
The average temperature field and the airflow map are dis-
demonstrated on up to 64 nodes equipped with 4 GPU devices.
played in Fig. 6 (right). The cavity is almost uniformly stratified
In addition, the performance has been studied on various architec-
with a dimensionless stratification of S ≈ 0.45 (see Fig. 7, left). This
tures including different generations of multicore-CPUs, AMD and
value is in a rather good agreement with the experimental results
NVIDIA GPUs, manycore accelerators (with the same kernel code,
obtained by Saury et al. [15] (S = 0.44 ± 0.03 with = 0.1 and
only changing the local workgroup sizes). These results demon-
S = 0.57 ± 0.03 with = 0.6, where is the wall emissivity). The
strate the portability of the proposed approach.
averaged Nusselt number at the cavity mid-depth is displayed in
Fig. 7 (right). The profile is again rather similar to the experimen-
tal results obtained by Saury et al. [15]. In this case, the transition Acknowledgments
point occurs at a slightly more upstream position. The peak of the
averaged local Nusselt number is located at y ≈ 0.2 whereas in the The work has been financially supported by the Ministerio de
experimental results this point is at y ≈ 0.3. Integrating the aver- Economía y Competitividad, Spain (ENE2014-60577-R). X. Á. is sup-
aged local Nusselt number over the y-direction, the overall Nusselt ported by a FI-DGR predoctoral contract (FI_B-2017-00614). A. G. is
is determined. In this case, Nu = 259.2, a slightly higher value supported by the Russian Science Foundation (grant 15-11-30039).
292 X. Álvarez et al. / Computers and Fluids 173 (2018) 285–292

F. X. T. is supported by a Ramón y Cajal postdoctoral contract (RYC- [12] LaSalle D, Karypis G. Multi-threaded graph partitioning. In: Proceedings - IEEE
2012-11996). R. B. is supported by a Juan de la Cierva posdoctoral 27th international parallel and distributed processing symposium, IPDPS 2013;
2013. p. 225–36.
grant (IJCI-2014-21034). This work has been carried out using com- [13] Cuthill E, McKee J. Reducing the bandwidth of sparse symmetric matrices. ACM
puting resources of the federal collective usage center Complex for ‘69 Proceedings of the 1969 24th national conference; 1969. p. 157–72.
Simulation and Data Processing for Mega-science Facilities at NRC [14] Oyarzun G, Borrell R, Gorobets A, Mantovani F, Oliva A. Efficient CFD code im-
plementation for the ARM-based mont-blanc architecture. Future Gener Com-
Kurchatov Institute, http://ckp.nrcki.ru/; the Barcelona Supercom- put Syst 2018;79:786–96.
puting Center; the Center for collective use of HPC computing re- [15] Saury D, Rouger N, Djanna F, Penot F. Natural convection in an air-filled cav-
sources at Lomonosov Moscow State University; the Joint Super- ity: experimental results at large rayleigh numbers. Int Commun Heat Mass
Transfer 2011;38:679–87.
computer Center of the Russian Academy of Sciences; the KIAM
[16] Belleoud P, Saury D, Lemonnier D. Coupled velocity and temperature measure-
RAS. The authors thankfully acknowledge these institutions. ments in an air-filled differentially heated cavity at Ra=1.2e11. Int J Therm Sci
2018;123:151–61.
[17] Sergent A, Joubert P, Xin S, Le Quéré P. Resolving the stratification discrepancy
References of turbulent natural convection in differentially heated air-filled cavities. part
II: end walls effects. Int J Heat Fluid Flow 2013;39:15–27.
[1] Dongarra J, et al. The international exascale software project roadmap. Int J [18] Salat J, Xin S, Joubert P, Sergent A, Penot F, Le Quéré P. Experimental and nu-
High Perform Comput Appl 2011;25(1):3–60. merical investigation of turbulent natural convection in a large air-filled cavity.
[2] Rossi R, Mossaiby F, Idelsohn SR. A portable openCL-based unstructured Int J Heat Fluid Flow 2004;25:824–32.
edge-based finite element Navier–Stokes solver on graphics hardware. Comput [19] Trias FX, Lehmkuhl O, Oliva A, Pérez-Segarra CD, Verstappen RWCP. Symme-
Fluids 2013;81:134–44. try-preserving discretization of Navier–Stokes equations on collocated unstruc-
[3] Jacobsen DA, Senocak I. Multi-level parallelism for incompressible flow com- tured meshes. J Comput Phys 2014;258:246–67.
putations on GPU clusters. Parallel Comput 2013;39(1):1–20. [20] Trias FX, Lehmkuhl O. A self-adaptive strategy for the time-integration of
[4] Khajeh-Saeed A, Perot JB. Direct numerical simulation of turbulence using GPU Navier–Stokes equations. Numer Heat Transfer Part B 2011;60(2):116–34.
accelerated supercomputers. J Comput Phys 2013;235:241–57. [21] Chorin AJ. Numerical solution of the Navier–Stokes equations. Math Comput
[5] Zaspel P, Griebel M. Solving incompressible two-phase flows on multi-GPU 1968;22:745–62.
clusters. Comput Fluids 2013;80(1):356–64. [22] Gorobets A, Trias FX, Soria M, Oliva A. A scalable parallel poisson solver
[6] Vincent P, Witherden FD, Vermeire B, Park JS, Iyer A. Towards green aviation for three-dimensional problems with one periodic direction. Comput Fluids
with python at petascale. In: International conference for high performance 2010;39:525–38.
computing, networking, storage and analysis, SC; November 2017. p. 1–11. [23] Trias FX, Gorobets A, Soria M, Oliva A. Direct numerical simulation of a
[7] Dongarra J, Heroux M. HPCG benchmark: a new metric for ranking high per- differentially heated cavity of aspect ratio 4 with Ra-number up to 1011 -
formance computing systems. Technical Report June; 2013. part I: Numerical methods and time-averaged flow. Int J Heat Mass Transfer
[8] Oyarzun G, Borrell R, Gorobets A, Oliva A. Portable implementation model for 2010;53:665–73.
CFD simulations. Application to hybrid CPU/GPU supercomputers. Int. J. Com- [24] Trias FX, Gorobets A, Pérez-Segarra CD, Oliva A. DNS and regularization model-
put. Fluid Dyn. 2017;31(9):396–411. ing of a turbulent differentially heated cavity of aspect ratio 5. Int J Heat Mass
[9] Gorobets A, Trias FX, Oliva A. A parallel MPI+openMP+openCL algo- Transfer 2013;57:171–82.
rithm for hybrid supercomputations of incompressible flows. Comput. Fluids [25] Barhaghi DG, Davidson L. Natural convection boundary layer in a 5:1 cavity.
2013;88:764–72. Phys Fluids 2007;19(12):125106.
[10] Witherden FD, Vermeire B, Vincent P. Heterogeneous computing on mixed un- [26] The DNS results presented in this paper are publicly available in http://www.
structured grids with pyFR. Comput Fluids 2015;120:173–86. cttc.upc.edu/downloads/DHC_Ra1_2e11.
[11] Xu C, Deng X, Zhang L, Fang J, Wang G, Jiang Y, Cao W, Che Y, Wang Y, Wang Z,
Liu W, Cheng X. Collaborating CPU and GPU for large-scale high-order CFD
simulations with complex grids on the tianhe-1a supercomputer. J Comput
Phys 2014;278(1):275–97.

You might also like