Load Balancing Strategies For The DSMC Simulation of Hypersonic Flows Using HPC
Load Balancing Strategies For The DSMC Simulation of Hypersonic Flows Using HPC
1
Institute of Space Systems (IRS), University of Stuttgart, 70569 Stuttgart,
Germany [email protected]
2
Institute of Aerodynamics and Gas Dynamics (IAG), University of Stuttgart,
70569 Stuttgart, Germany [email protected]
Abstract In the context of the validation of PICLas, a kinetic particle suite for the
simulation of rarefied, non-equilibrium plasma flows, the biased hypersonic nitrogen
flow around a blunted cone was simulated with the Direct Simulation Monte Carlo
method. The setup is characterized by a complex flow with strong local gradients
and thermal non-equilibrium resulting in a highly inhomogeneous computational
load. Especially, the load distribution is of interest, because it allows to exploit the
utilized computational resources efficiently. Different load distribution algorithms
are investigated and compared within a strong scaling. This investigation of the
parallel performance of PICLas is accompanied by simulation results in terms of the
velocity magnitude, translational temperature and heat flux, which is compared to
experimental measurements.
1 Introduction
For the numerical simulation of highly rarefied plasma flows, a fully kinetic
modelling of Boltzmann’s equation complemented by Maxwell’s equations is
necessary. For this purpose a particle codes that combines the PIC (Particle
in Cell) and DSMC (Direct Simulation Monte Carlo) method is developed at
IAG (Institute of Aerodynamics and Gas Dynamics) and IRS (Institute of
Space Systems) in recent years [7]. Particle codes are inherently numerically
expensive and thus are an excellent application for parallel computing. The
2 Binder et al.
where the δ-function is applied to position and velocity space, separately, and
the particle weighting factor wn = Nphy /Nsim is used to describe the ratio of
simulated to physical particles.
The DSMC method is briefly reviewed in Section 2. In Section 3, the
numerical setup and results of the simulation of the flow around a 70◦ blunted
cone geometry are presented. The load-distribution algorithms and the parallel
performance of the DSMC code are investigated in detail in Section 4, followed
by a summary and conclusion in Section 5.
Hypersonic Flow around a Blunted Cone using HPC 3
2 DSMC Solver
The DSMC method approximates the right hand side of Eq. (1) by modelling
binary particle collisions in a probabilistic and transient manner. The main
idea of the DSMC method is the non-deterministic, statistical calculation of
changes in particle velocity utilizing random numbers in a collision process.
Additionally, chemical reactions may occur in such collision events. The pri-
mordial concept of DSMC was developed by Bird [3] and is commonly applied
to the simulation of rarefied and neutral gas flows. The collision operator in
Eq. (1) is given by
∂f
coll
=
Z ∂t
W (v1 , v2 , v3 , v4 ){f (x, v1 , t)f (x, v2 , t) − f (x, v3 , t)f (x, v4 , t)}dv1 dv2 dv3 ,
(2)
where W represents the probability per unit time in which two particles collide
and change their velocities from v1 and v2 to v3 and v4 , respectively. However,
the DSMC method does not solve this collision integral directly, but rather
applies a phenomenological approach to the collision process of simulation
particles in a statistical framework.
A single standard DSMC time step is depicted schematically in Fig. 1.
First, a particle pair for the collision process is found by examining each cell
and applying a nearest neighbour search with an octree based pre-sorting.
An alternative method is the random pairing of all particles in each cell,
but with additional restrictions to the cell size. The collision probability is
modelled by choosing a cross section for each particle species using microscopic
considerations. As with the PIC method, macro particles are simulated instead
of real particles to reduce computational effort. The collision probability of
two particles, 1 and 2, is determined by methods found in [3, 2], which yields
Np,1 Np,2 ∆t
P12 = w (σ12 g12 ) , (3)
1 + δ12 Vc S12
where δ12 is the Kronecker delta, Vc the cell volume, ∆t the time step, σ the
cross section, S12 the number of particle pairs of species 1 and 2 in Vc and
g the relative velocity between the two particles considered. This probabil-
ity is compared to a pseudo random number R ∈ [0, 1) and if R < P12 , the
4 Binder et al.
pairing
localization,
boundary treatment
collision process
Δt → (Δv)part.
particle movement
(Δv)part. → (x, v)part.
sampling
DSMC
Rb 25.0 1 0.00
Rc 1.25 2 0.52
Rj 2.08 3 1.04
Rn 12.5 4 1.56
Rs 6.25 5 2.68
6 3.32
7 5.06
8 6.50
9 7.94
A popular validation case for rarefied gas flows is the wind tunnel test of the
70◦ blunted cone in a diatomic nitrogen flow at a Mach number of M = 20 [1].
The geometry of the model is depicted in Fig. 2. Positions of the heat flux
measurements are depicted by the numbers 1-9. While the experiments were
conducted at different rarefaction levels and angles of attack, the case denoted
by Set 2 and α = 30◦ is used for the investigation. The free-stream conditions
and simulation parameters are given in Table 1. Half of the fluid domain was
simulated to exploit the symmetry in the xy-plane.
An exemplary simulation result is shown in Fig. 3. Here, the translational
temperature in the symmetry plane and the velocity streamlines are shown.
The simulation results are compared to the experimental measurements in
terms of the heat flux in Fig. 4. Overall good agreement can be observed for
the first four thermocouples, where the error is below 10% and within exper-
imental uncertainty [1]. The agreement on the sting deteriorates for thermo-
couples further downstream to error values of up to 45%.
6 Binder et al.
T [K]
1,250
1,000
750
500
250
15
v [m s−1 ]
1,500
1,000
500
102
1 2 3 4
9
Heat flux qw kW m−2
8
101 7
100
10−1 5 6 Experiment
PICLas
10−2
0 1 2 3 4 5 6 7 8
S/Rn [−]
The code framework of PICLas utilizes implementations of the MPI 2.0 stan-
dard for parallelization. Load distribution between the MPI processes is a
crucial step. A domain decomposition by grid elements was chosen as strat-
egy. In a preprocessing step, all elements within the computational domain
are sorted along a Hilbert curve due to its clustering property [6]. Then, each
MPI process receives a certain segment of the space filling curve (SFC). To
illustrate an optimal load balance scenario, a simplified grid is considered that
consists of 8 × 8 = 64 elements, which are ordered along a SFC. Fig. 5 de-
picts the decomposition of the grid into four regions, each corresponding to
an individual MPI process when the number of processes is Np = 4. For inho-
mogeneous particle distributions or elements of significantly different size, the
load has to be assigned carefully. In the DSMC method, the computational
costs L of each grid element is assumed to be linearly dependent on the con-
tained particle number. In an optimally balanced case, each process receives
approximately the average load.
Offset elements (i.e., an index I along the SFC) define the assigned segment
of a process. When the SFC describes the interval of [1, Nelem ], the segment
of each process p is defined by [I(p) + 1, I(p + 1)] with I(Np + 1) = Nelem .
Thus, the total load assigned to a single process results in:
8 Binder et al.
I(p+1)
X
Lptot = Li (6)
i=I(p)+1
The main goal of a proper load distribution is to minimize the idle time
of waiting processes, i.e., the maximum of all total, process-specific loads
Lptot needs to be minimized. To achieve that, several distribution methods are
implemented in PICLas.
Distribution by elements
The previous method is, however, not applicable if the elements have different
loads, since a subdivision in element number does not necessarily correspond
in the same fraction of total load. Therefore, while looping through the pro-
cesses along the SFC, each process receives in our “simple” balance scheme
an iteratively increasing segment until the so far gathered load is equal or
greater than the ideal fraction. To ensure that the following processes receive
at least one element each, the respective number of assignable elements must
be reduced. The algorithm follows as:
Hypersonic Flow around a Blunted Cone using HPC 9
“Combing” algorithm
The “simple” algorithm ensures a very smooth load distribution for large
element numbers, since the ideal, current fraction can be achieved well by
the iterative adding of elements. However, if there exist elements with much
higher loads than most of the remaining ones, the load distribution method
fails. For this, we developed a smoothing algorithm, that “combs” the offset
elements along the SFC iteratively from the beginning towards the end. Here,
just the main characteristics of the method should be briefly described:
• The initial load distribution follows, i.e., from the “simple” balance method.
• A large number of different distributions is evaluated in terms of the max-
imum process-total load max(Lptot ), the one with the minimum value is
chosen as final solution.
• If the maximum Lptot belongs to a process p with a greater SFC-index
than the minimum one (maximum is “right” of the minimum), all offset
elements are shifted accordingly to the left.
• Maxima are smoothed to the right, i.e., small Lptot -intervals are increased
by shifting elements from maxima to minima.
• If the resultant optimum distribution was already reached before, elements
are shifted from the last process all towards the first one.
10 Binder et al.
For the test of parallelization, multiple simulations were run for a simulation
time of 1 · 10−4 s, corresponding to 2000 iterations. The speed-up between 720
and 5760 cores was calculated by
t720
SN = . (8)
tN
The respective parallel efficiency was determined by
720 · t720
ηN = , (9)
N · tN
where t720 and tN is the computational time using 720 and N cores, respec-
tively.
Fig. 6 shows the speed-up over the number of utilized nodes and the respec-
tive parallel efficiency as a label. The case without actual load balance (dis-
tribution by elements) is compared together with the distribution method by
paticle number per element against the ideal scaling behavior. The “Combing”
algorithm resulted into the same performace values as the “simple” balance
method, therefore, only the the latter one is displayed. The speed-up decreases
with an increasing number of cores due to the more frequent communication
between MPI processes. Nevertheless, a parallel efficiency of η = 0.87 can be
achieved using 5760 cores for the blunted cone test case.
The hypersonic flow around a 70◦ blunted cone was simulated with the Direct
Simulation Monte Carlo method. The case features complex flow phenomena
such as a detached compression shock in front and rarefied gas flow in the wake
of the heat shield. A comparison of the experimentally measured heat flux
yielded good agreement with the simulation results. The test case was utilized
to perform a strong scaling of the DSMC implementation of PICLas. With
regard to the computational duration on 720 cores, a parallel efficiency of 99%
to 87% could be achieved for 1440 and 5760 cores, respectively. The decrease
in parallel efficiency can be explained by an increasing MPI communication
effort. Currently, the implementation of cpu-time measurements into PICLas
is investigated for calculating the element loads directly instead of a simple
weighting of particle number, which will be focus of future reports.
Hypersonic Flow around a Blunted Cone using HPC 11
Distribution by elements
8 Simple load balance
Ideal 0.87
Speed-up S [-]
6
4 0.9 0.59
0.99 0.75
2 0.99
1
0.87
10.95
0
1,000 2,000 3,000 4,000 5,000 6,000
Nproc [-]
Fig. 6: Parallel performance of the double cone test case between 720 and
5670 cores. Speed-up S with labelled parallel efficiency η.
6 Acknowledgements
References
3. G. A. Bird. Molecular Gas Dynamics and the Direct Simulation of Gas Flows.
Oxford University Press, Oxford, 1994.
4. S. Copplestone, T. Binder, A. Mirza, P. Nizenkov, P. Ortwein, M. Pfeiffer, S. Fa-
soulas, and C.-D. Munz. Coupled PIC-DSMC simulations of a laser-driven plasma
expansion. In W. E. Nagel, D. H. Kröner, and M. M. Resch, editors, High
Performance Computing in Science and Engineering ‘15. Springer, 2016.
5. R. W. Hockney and J. W. Eastwood. Computer Simulation Using Particles.
McGraw-Hill, Inc., New York, 1988.
6. B. Moon, H.V. Jagadish, C. Faloutsos, and J.H. Saltz. Analysis of the clustering
properties of the Hilbert space-filling curve. Knowledge and Data Engineering,
IEEE Transactions on, 13(1):124–141, Jan 2001.
7. C.-D. Munz, M. Auweter-Kurtz, S. Fasoulas, A. Mirza, P. Ortwein, M. Pfeif-
fer, and T. Stindl. Coupled Particle-In-Cell and Direct Simulation Monte Carlo
method for simulating reactive plasma flows. Comptes Rendus Mécanique,
342(10-11):662–670, October 2014.
8. P. Ortwein, T. Binder, S. Copplestone, A. Mirza, P. Nizenkov, M. Pfeiffer,
T. Stindl, S. Fasoulas, and C.-D. Munz. Parallel performance of a discontinu-
ous Galerkin spectral element method based PIC-DSMC solver. In W. E. Nagel,
D. H. Kröner, and M. M. Resch, editors, High Performance Computing in Science
and Engineering ‘14. Springer, 2015.
9. A. Stock, J. Neudorfer, B. Steinbusch, T. Stindl, R. Schneider, S. Roller, C.-D.
Munz, and M. Auweter-Kurtz. Three-dimensional gyrotron simulation using a
high-order particle-in-cell method. In W. E. Nagel, D. H. Kröner, and M. M.
Resch, editors, High Performance Computing in Science and Engineering ’11.
Springer Berlin Heidelberg.