SDLEC Parallelization Workshop Presentation Slides
SDLEC Parallelization Workshop Presentation Slides
• Solving some problems requires more computing power than what typical machines
provide.
• Solution → Different ways of parallel optimization
• Solving some problems requires more computing power than what typical machines
provide.
• Solution → Different ways of parallel optimization
• Single core: SIMD
• Solving some problems requires more computing power than what typical machines
provide.
• Solution → Different ways of parallel optimization
• Single core: SIMD
• ax2 + bx → x(ax + b)
• Solving some problems requires more computing power than what typical machines
provide.
• Solution → Different ways of parallel optimization
• Single core: SIMD
• ax2 + bx → x(ax + b)
• 3 multiplications → 2 multiplications
• Solving some problems requires more computing power than what typical machines
provide.
• Solution → Different ways of parallel optimization
• Single core: SIMD
• ax2 + bx → x(ax + b)
• 3 multiplications
! →! 2 multiplications
x0 ax0 + b
• ◦ → can be computed concurrently.
x1 ax1 + b
• Solving some problems requires more computing power than what typical machines
provide.
• Solution → Different ways of parallel optimization
• Single core: SIMD
• ax2 + bx → x(ax + b)
• 3 multiplications
! →! 2 multiplications
x0 ax0 + b
• ◦ → can be computed concurrently.
x1 ax1 + b
• Multi-core (multi-threading): OpenMP, std::thread, library based, ..., etc
• Solving some problems requires more computing power than what typical machines
provide.
• Solution → Different ways of parallel optimization
• Single core: SIMD
• ax2 + bx → x(ax + b)
• 3 multiplications
! →! 2 multiplications
x0 ax0 + b
• ◦ → can be computed concurrently.
x1 ax1 + b
• Multi-core (multi-threading): OpenMP, std::thread, library based, ..., etc
• Machine clusters: MPI ← our workshop’s scope
• Solving some problems requires more computing power than what typical machines
provide.
• Solution → Different ways of parallel optimization
• Single core: SIMD
• ax2 + bx → x(ax + b)
• 3 multiplications
! →! 2 multiplications
x0 ax0 + b
• ◦ → can be computed concurrently.
x1 ax1 + b
• Multi-core (multi-threading): OpenMP, std::thread, library based, ..., etc
• Machine clusters: MPI ← our workshop’s scope
• Custom accelerator hardware (GPUs, FPGAs, ..., etc)
PhD Student PI PI
1. General Introduction
2. Point-to-point communication
3. Collective communication
4. How do I send my own Data?
5. Advanced applications and topics
Data Parallelism
Work units execute the same operations on a (distributed) set of data:
domain decomposition.
Task Parallelism
Work units execute on different control paths, possibly on different data
sets: multi-threading.
Pipeline Parallelism
Work gets split between producer and consumer units that are directly
connected. Each unit executes a single phase of a given task and hands
over control to the next one.
Data Parallelism
Work units execute the same operations on a (distributed) set of data:
domain decomposition.
Task Parallelism
Work units execute on different control paths, possibly on different data
sets: multi-threading.
Pipeline Parallelism
Work gets split between producer and consumer units that are directly
connected. Each unit executes a single phase of a given task and hands
over control to the next one.
simple
Simple geometric decomposition, in which the domain is split into pieces
by direction
hierarchical
Same as simple, but the order in which the directional split is done can be
specified
metis & scotch
Require no geometric input from the user and attempts to minimize the
number of processor boundaries. Weighting for the decomposition
between processors can be specified
manual
Allocation of each cell to a particular processor is specified directly.
Ω2
• Use of a layer of ghost cells to handle comms with neighboring processes → MPI
calls not self-adjoint
• Artificial increase in number of computations per process (and does not scale well)
Provided by NHR4CES, SDL Energy Conversion Group 8/50
Domain decomposition in OpenFOAM: Processor boundaries
cyclic + comms
Ω1
Ω2
Γ0,1
Ω0
Ω2
Γ0,1
Communicate efficiently through
processor boundaries
1 // Swap boundary face lists through
2 // processor boundary faces
3 // This code runs on all processors
4 syncTools::swapBoundaryFaceList(mesh(), faceField);
5 // Now faceField has values from the other side
6 // of the processor boundary
Distributed Memory
Message Passing Interface (MPI): Execute on multiple machines.
Shared Memory
Multi-threading capabilities of programming languages, OpenMP:
One machine, many CPU cores.
Data Streaming
CUDA and OpenCL. Applications are organized into streams (of same-type
elements) and kernels (which act on elements of streams) which is suitable
for accelerator hardware (GPUs, FPGAs, ..., etc).
You don’t have to know MPI API to parallelise OpenFOAM code! But you need the
concepts.
MPI Communicators
Objects defining which processes can communicate; Processes are referred
to by their ranks
• MPI_COMM_FOAM in the Foundation version and Foam Extend 5
• MPI_COMM_WORLD (All processes) elsewhere
• Size: Pstream::nProcs()
MPI rank
Process Identifier (an integer).
• Pstream::myProcNo() returns the active process’s ID.
P0 P1
OPstream IPstream
MPI buf
MPI buf
send recv
A Deadlock happens when a process is waiting for a message that never reaches it.
P0 P1
MPI buf2
MPI buf1
MPI buf1
MPI buf2
send recv send recv
Figure 6: Deadlock possibility due to a 2-processes send-recieve cycle (Kind of depends on MPI
implementation used!).
blocking blocking
117 175 blocking
scheduled 381
139 scheduled
186
nonBlocking
nonBlocking nonBlocking 104
164 165 scheduled
138
Figure 7: Frequency of usage for each type of OpenFOAM comms in OpenFOAM 10 (left), OpenFOAM
v2012 (middle) and Foam-Extend 5 (right)
DISCLAIMER: Data generated pre-maturely; not suitable to compare forks irt. parallel
performance -> Better compare history of the same fork instead.
• Avoids Deadlocks
• Minimizes idle time for MPI processes
• Helps skip unnecessary synchronisation
0.800
0.400
0.290
0.264
0.200
0.144
0.135
0.164 0.167 0.074
0.082 0.084 0.040 0.042 0.037 0.073
0.000 0.022 0.021
64 32 16 8
Message size in Mb
Default in scheduled
Standard Send
MPI_Send
Will not return until send buffer is safe to use, May also buffer.
MPI send modes used in OpenFOAM code
Will not return until send buffer is safe to use, May also buffer.
Default in nonBlocking
Will not return until send buffer is safe to use, May also buffer.
Default in nonBlocking
Will not return until send buffer is safe to use, May also buffer.
• All processes call the same function with the same set of arguments.
• Although MPI-2 has non-blocking collective communications, OpenFOAM uses only
the blocking variants.
• NOT a simple wrapper around P2P comms.
• Most collective algorithms are log(nProcs)
P2
P0 P1
P2
orOp()
P0 P1
v0 v0 ||v1 ||v2
orOp() orOp()
P2
v0 v0 v2
P0 P1
v0 v0 v0 v0 v1 v0
P2
v0 v0 v2
P0 P1
v0 v0 v0 v0 v1 v02
P2
P0 P1
P2
P0 P1
P2
v0 v1 v2
P0 P1
v0 v1 v2 v0 v1 v02
P2
v0 v1 v2
P0 P1
v0 v1 v2 v0 v1 v02
gather scatterList
41 75
gatherList scatter
121 146
Figure 13: Frequency of usage for each API call of OpenFOAM collective comms (v2012)
• There are also some Fork-specific interface methods we won’t discuss (eg.
Pstream::exchange)
P2
v20 + v1 + v2
P0 P1
v0 + v1 + v2 v10 + v1 + v2
P2
v0 + v1 + v2
sumOp()
P0 P1
v0 + v1 + v2 v0 + v1 + v2
sumOp() sumOp()
You can still fall for endless loops if you’re not careful!
So, Edges can’t be communicated as MPI messages, provide the necessary serialization
and de-serialization operators:
6 Pstream::scatterList(g);
7
8 // Check graph edges on all processes
9 Pout << g << endl;
1. Refine each processor’s part of the mesh, but we need to keep the global cell count
under a certain value:
cba 2023
This offering is not approved or endorsed by OpenCFD Limited, the producer of the
OpenFOAM software and owner of the OPENFOAM® and OpenCFD® trade marks.
• MPI standards: Blocking send can be used with a Non-blocking receive, and
vice-versa
• But OpenFOAM wrapping makes it ”non-trivial” to get it to work
• You can still use MPI API directly, eg. if you need one-sided communication.
[1] C. Augustine. Introduction to Parallel Programming with MPI and OpenMP. Source of
the great ’pit stops’ analogy. Oct. 2018. url: https:
//princetonuniversity.github.io/PUbootcamp/sessions/parallel-
programming/Intro_PP_bootcamp_2018.pdf.
[2] Fabio Baruffa. Improve MPI Performance by Hiding Latency. July 2020. url:
https://www.intel.com/content/www/us/en/developer/articles/
technical/overlap-computation-communication-hpc-
applications.html.