0% found this document useful (0 votes)

13 views84 pages

SDLEC Parallelization Workshop Presentation Slides

Uploaded by

Steve Mojang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views84 pages

SDLEC Parallelization Workshop Presentation Slides

Uploaded by

Steve Mojang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 84

Parallelization in OpenFOAM for HPC Deployment

Mohammed Elwardi Fadeli1,2 , Holger Marschall1 and Christian Hasse2

April, 2023
1
Mathematical Modeling and Analysis (MMA)
2
Simulation of Reactive Thermo Fluid Systems (STFS)
Energy Conversion Group, NHR4CES - TU Darmstadt
Motivation

• Solving some problems requires more computing power than what typical machines
provide.
• Solution → Different ways of parallel optimization

Provided by NHR4CES, SDL Energy Conversion Group 1/50

Motivation

• Solving some problems requires more computing power than what typical machines
provide.
• Solution → Different ways of parallel optimization
• Single core: SIMD

Provided by NHR4CES, SDL Energy Conversion Group 1/50

Motivation

• Solving some problems requires more computing power than what typical machines
provide.
• Solution → Different ways of parallel optimization
• Single core: SIMD
• ax2 + bx → x(ax + b)

Provided by NHR4CES, SDL Energy Conversion Group 1/50

Motivation

Provided by NHR4CES, SDL Energy Conversion Group 1/50

Motivation

Provided by NHR4CES, SDL Energy Conversion Group 1/50

Motivation

• Solving some problems requires more computing power than what typical machines
provide.
• Solution → Different ways of parallel optimization
• Single core: SIMD
• ax2 + bx → x(ax + b)
• 3 multiplications
! →! 2 multiplications
x0 ax0 + b
• ◦ → can be computed concurrently.
x1 ax1 + b
• Multi-core (multi-threading): OpenMP, std::thread, library based, ..., etc

Provided by NHR4CES, SDL Energy Conversion Group 1/50

Motivation

Provided by NHR4CES, SDL Energy Conversion Group 1/50

Motivation

Provided by NHR4CES, SDL Energy Conversion Group 1/50

NHR4CES - Simulation and Data Labs - SDL Energy Conversion

PhD Student PI PI

And more team members here - working on

Computationally efficient HPC-ready, (reactive)CFD software and methods.

Provided by NHR4CES, SDL Energy Conversion Group 2/50

Parallelization in OpenFOAM for HPC Deployment
Agenda

Mohammed Elwardi Fadeli1,2 , Holger Marschall1 and Christian Hasse2

April, 2023
1
Mathematical Modeling and Analysis (MMA)
2
Simulation of Reactive Thermo Fluid Systems (STFS)
Energy Conversion Group, NHR4CES - TU Darmstadt
Lecture section: Agenda
Go through the lecture
videos at your own pace

Ask about unclear

concepts throughout the
next 10 days
Worksop Activities
April, 3rd 2023 Presentation / Questions
Prepare for hands-on
————— ———–
sessions
April, 13th 2023, 10 AM CET Hands-on sessions
See how it’s done
Apply what you see to
parallelize code
Fight bugs, on your
favourite OF fork
Provided by NHR4CES, SDL Energy Conversion Group 3/50
Lecture section: Table of contents

1. General Introduction
2. Point-to-point communication
3. Collective communication
4. How do I send my own Data?
5. Advanced applications and topics

Provided by NHR4CES, SDL Energy Conversion Group 4/50

Parallelization in OpenFOAM for HPC Deployment
General Introduction

Mohammed Elwardi Fadeli1,2 , Holger Marschall1 and Christian Hasse2

April, 2023
1
Mathematical Modeling and Analysis (MMA)
2
Simulation of Reactive Thermo Fluid Systems (STFS)
Energy Conversion Group, NHR4CES - TU Darmstadt
The power of parallel workers

Figure 1: Parallel work during F1 Pit stops; CC BY 2.0, from commons.wikimedia.org

Provided by NHR4CES, SDL Energy Conversion Group 5/50
Types of Parallelism

Data Parallelism
Work units execute the same operations on a (distributed) set of data:
domain decomposition.
Task Parallelism
Work units execute on different control paths, possibly on different data
sets: multi-threading.
Pipeline Parallelism
Work gets split between producer and consumer units that are directly
connected. Each unit executes a single phase of a given task and hands
over control to the next one.

Provided by NHR4CES, SDL Energy Conversion Group 6/50

Types of Parallelism

Provided by NHR4CES, SDL Energy Conversion Group 6/50

Domain decomposition in OpenFOAM

simple
Simple geometric decomposition, in which the domain is split into pieces
by direction
hierarchical
Same as simple, but the order in which the directional split is done can be
specified
metis & scotch
Require no geometric input from the user and attempts to minimize the
number of processor boundaries. Weighting for the decomposition
between processors can be specified
manual
Allocation of each cell to a particular processor is specified directly.

Provided by NHR4CES, SDL Energy Conversion Group 7/50

Domain decomposition in OpenFOAM: Processor boundaries
0 sends φ, φ1 = φ0
Ω1
Ω2 Γ01
Γ10
Ω0
adjoint: φ0 +=φ1
φ1 = 0

Ω2

Figure 2: Classical halo approach for inter-processor communication

• Use of a layer of ghost cells to handle comms with neighboring processes → MPI
calls not self-adjoint
• Artificial increase in number of computations per process (and does not scale well)
Provided by NHR4CES, SDL Energy Conversion Group 8/50
Domain decomposition in OpenFOAM: Processor boundaries

cyclic + comms
Ω1
Ω2
Γ0,1
Ω0

Ω2

Figure 3: Zero-halo approach for inter-processor communication in OpenFOAM

• Communications across process boundaries handled as a BC

• MPI calls are self-adjoint; all processes perform the same work at the boundaries
Provided by NHR4CES, SDL Energy Conversion Group 9/50
Domain decomposition in OpenFOAM: Processor boundaries

Γ0,1
Communicate efficiently through
processor boundaries
1 // Swap boundary face lists through
2 // processor boundary faces
3 // This code runs on all processors
4 syncTools::swapBoundaryFaceList(mesh(), faceField);
5 // Now faceField has values from the other side
6 // of the processor boundary

Figure 4: Swapping boundary fields across processor boundaries

• Standard API for common processor boundary fields operations

→ Generalized for coupled patches, little to no micro management!
• Same local operations carried out on both processors (No checking for processor
ranks ... etc)
Provided by NHR4CES, SDL Energy Conversion Group 10/50
Modes of Parallelism

Distributed Memory
Message Passing Interface (MPI): Execute on multiple machines.
Shared Memory
Multi-threading capabilities of programming languages, OpenMP:
One machine, many CPU cores.
Data Streaming
CUDA and OpenCL. Applications are organized into streams (of same-type
elements) and kernels (which act on elements of streams) which is suitable
for accelerator hardware (GPUs, FPGAs, ..., etc).

Provided by NHR4CES, SDL Energy Conversion Group 11/50

MPI with OpenFOAM

Does ’echo’ work work with MPI? Hello World!

Hello World!
1 mpirun -n 3 echo Hello World! Hello World!

What about a solver binary?

This runs on ”undecomposed”
cases!
1 mpirun -n 3 icoFoam

Provided by NHR4CES, SDL Energy Conversion Group 12/50

MPI with OpenFOAM

Does ’echo’ work work with MPI? Hello World!

Hello World!
1 mpirun -n 3 echo Hello World! Hello World!

What about a solver binary?

This runs on ”undecomposed”
cases!
1 mpirun -n 3 icoFoam

But the solver is linked to libmpi!

... libmpi.so ...
1 ldd $(which icoFoam)

Provided by NHR4CES, SDL Energy Conversion Group 12/50

MPI with OpenFOAM

Does ’echo’ work work with MPI? Hello World!

Hello World!
1 mpirun -n 3 echo Hello World! Hello World!

What about a solver binary?

This runs on ”undecomposed”
cases!
1 mpirun -n 3 icoFoam

But the solver is linked to libmpi!

... libmpi.so ...
1 ldd $(which icoFoam)

Alright we get it now

Needs a decomposed case
1 mpirun -n 3 icoFoam -parallel

Provided by NHR4CES, SDL Energy Conversion Group 12/50

MPI with OpenFOAM: Parallel mode

Anatomy of MPI programs

1 #include <mpi.h>
2 void main (int argc, char *argv[])
3 {
4 int np, rank, err;
5 err = MPI_Init(&argc, &argv) ;
6 MPI_Comm_rank(MPI_COMM_WORLD,&rank);
7 MPI_Comm_size(MPI_COMM_WORLD,&np);
8 // Do parallel communications
9 err = MPI_Finalize() ;
10 }

Provided by NHR4CES, SDL Energy Conversion Group 13/50

MPI with OpenFOAM: Parallel mode

Anatomy of MPI programs How solver programs look

1 #include <mpi.h> 1 #include "fvCFD.H"
2 void main (int argc, char *argv[]) 2 void main (int argc, char *argv[])
3 { 3 {
4 int np, rank, err; 4 #include "setRootCase.H"
5 err = MPI_Init(&argc, &argv) ; 5 // Defines an argList object,
6 MPI_Comm_rank(MPI_COMM_WORLD,&rank); 6 // which has a ParRunControl member
7 MPI_Comm_size(MPI_COMM_WORLD,&np); 7 // If -parallel is passed in:
8 // Do parallel communications 8 // MPI_Init called in its ctor
9 // MPI_Finalize called in its dtor
9 err = MPI_Finalize() ;
10
10 } 11 // Time is constructed with
12 // <case>/processor<procID> paths
13 }

You don’t have to know MPI API to parallelise OpenFOAM code! But you need the
concepts.

Provided by NHR4CES, SDL Energy Conversion Group 13/50

Objectives

1. Have a basic understanding of Parallel programming with MPI in OpenFOAM Code.

2. Be able to send basic custom object types around using MPI.
3. Be aware of some of the common issues around MPI comms.
4. Acquire enough knowledge to learn more on your own
• Directly from OpenFOAM’s code
• MPI in general

By the end of the Workshop

→ Be able to parallelize basic serial OpenFOAM code.

Provided by NHR4CES, SDL Energy Conversion Group 14/50

Communication types in MPI

Used in OpenFOAM Used in OpenFOAM

Point-to-Point Comms Collective Comms

not used not used

One-Sided Comms Parallel IO

We’ll be focusing on the communications OpenFOAM wraps!

Provided by NHR4CES, SDL Energy Conversion Group 15/50

Parallelization in OpenFOAM for HPC Deployment
Point-to-Point Communications / General Introduction

Mohammed Elwardi Fadeli1,2 , Holger Marschall1 and Christian Hasse2

April, 2023
1
Mathematical Modeling and Analysis (MMA)
2
Simulation of Reactive Thermo Fluid Systems (STFS)
Energy Conversion Group, NHR4CES - TU Darmstadt
Communicators and ranks

There may be many processes talking!

MPI Communicators
Objects defining which processes can communicate; Processes are referred
to by their ranks
• MPI_COMM_FOAM in the Foundation version and Foam Extend 5
• MPI_COMM_WORLD (All processes) elsewhere
• Size: Pstream::nProcs()
MPI rank
Process Identifier (an integer).
• Pstream::myProcNo() returns the active process’s ID.

Provided by NHR4CES, SDL Energy Conversion Group 16/50

Serialization and De-serialization: Basics

Premise: resurrect objects from streams of (binary or readable-text) data.

Standard C++ way

Manually Override operator<< and operator>> to interact with stream
objects.
• This is used extensively in OpenFOAM
• Newer languages provide automatic serialization at language-level
Third-party libs
Try to handle automatic serialization
• Boost’s serialization library if you’re into Boost
• cereal as header-only library

Provided by NHR4CES, SDL Energy Conversion Group 17/50

Serialization for MPI comms

• MPI defines its own Data Types so it can be ”cross-platform”

• OpenFOAM gets around it using parallel streams as means of serialization
• OpenFOAM hands over a stream-representation of your data to MPI calls
• MPI passes the information in those streams around

P0 P1

OPstream IPstream

MPI buf

MPI buf
send recv

Figure 5: Communication between two processes in OpenFOAM

Provided by NHR4CES, SDL Energy Conversion Group 18/50

P2P comms: A first example

Figure 1: Parallel work during F1 Pit stops; CC BY 2.0, from commons.wikimedia.org

Provided by NHR4CES, SDL Energy Conversion Group 19/50
P2P comms: A first example

• Pstream class provides the interface needed for communication

• Each ”send” must be matched with a ”receive”

Slaves talk to master in a P2P fashion

1 if (Pstream::master())
2 {
3 // Receive lst on master
4 for
5 (
6 int slave=Pstream::firstSlave();
7 slave<=Pstream::lastSlave();
8 slave++
9 )
10 {
11 labelList lst;
12 IPstream fromSlave (Pstream::commsTypes::blocking, slave);
13 fromSlave >> lst; // Then do something with lst
14 }
15 } else {
16 // Send lst to master
17 OPstream toMaster (Pstream::commsTypes::blocking, Pstream::masterNo());
18 toMaster << localLst;
19 }

Provided by NHR4CES, SDL Energy Conversion Group 20/50

Parallelization in OpenFOAM for HPC Deployment
Point-to-Point Communications / Blocking comms

Mohammed Elwardi Fadeli1,2 , Holger Marschall1 and Christian Hasse2

April, 2023
1
Mathematical Modeling and Analysis (MMA)
2
Simulation of Reactive Thermo Fluid Systems (STFS)
Energy Conversion Group, NHR4CES - TU Darmstadt
P2P comms: A first example

Figure 1: Parallel work during F1 Pit stops; CC BY 2.0, from commons.wikimedia.org

Provided by NHR4CES, SDL Energy Conversion Group 21/50
P2P Blocking comms

Pstream::commsTypes::blocking (or just Pstream::blocking in Foam Extend)

defines properties for the MPI call which is executed by the constructed stream.

Sending is an asynchronous buffered blocking send

• Block until a copy of the passed buffer is made.
• MPI will send it at a later point, we just can’t know when.
• All we know is that the buffer is ready to be used after it returns.
Receiving is a blocking receive
• Block until incoming message is copied into passed buffer.

Provided by NHR4CES, SDL Energy Conversion Group 22/50

P2P Blocking comms

Pstream::commsTypes::scheduled (or just Pstream::scheduled in Foam

Extend) lets MPI pick the best course of action (in terms of performance and memory).
This may also depend on the MPI implementation.

• Does a ”standard send”, Either:

1. Do a buffered send (like blocking) if the buffer has enough free space to accommodate
sent data.
2. Fall back to a synchronous send otherwise.
• Both blocking and scheduled comms have a chance of causing deadlocks

A Deadlock happens when a process is waiting for a message that never reaches it.

Provided by NHR4CES, SDL Energy Conversion Group 23/50

P2P Blocking comms: Deadlocks

• Either a matching send or a recieve is missing (Definitely a deadlock).

• A send-recieve cycle (Incorrect usage or order of send/recieve calls).

P0 P1

MPI buf2

MPI buf1

MPI buf2
send recv send recv

Figure 6: Deadlock possibility due to a 2-processes send-recieve cycle (Kind of depends on MPI
implementation used!).

Provided by NHR4CES, SDL Energy Conversion Group 24/50

P2P Blocking comms: Stats

blocking blocking
117 175 blocking
scheduled 381
139 scheduled
186
nonBlocking
nonBlocking nonBlocking 104
164 165 scheduled
138

Figure 7: Frequency of usage for each type of OpenFOAM comms in OpenFOAM 10 (left), OpenFOAM
v2012 (middle) and Foam-Extend 5 (right)

DISCLAIMER: Data generated pre-maturely; not suitable to compare forks irt. parallel
performance -> Better compare history of the same fork instead.

1 grep -roh -e '::nonBlocking' -e '::blocking' -e '::scheduled' $FOAM_SRC | sort | uniq -c

Provided by NHR4CES, SDL Energy Conversion Group 25/50

Parallelization in OpenFOAM for HPC Deployment
Point-to-Point Communications / Non-Blocking comms

Mohammed Elwardi Fadeli1,2 , Holger Marschall1 and Christian Hasse2

April, 2023
1
Mathematical Modeling and Analysis (MMA)
2
Simulation of Reactive Thermo Fluid Systems (STFS)
Energy Conversion Group, NHR4CES - TU Darmstadt
P2P Non-Blocking comms

Pstream::commsTypes::nonBlocking (or just Pstream::nonBlocking in Foam

Extend) does not wait until buffers are safe to re-use.

• Returns immediately (similar to ”async” in multi-threading context).

• The program must wait for the operation to complete (Pstream::waitRequests).
• It’s a form of piepline parallelism; i.e. Overlaps computation and communication.

• Avoids Deadlocks
• Minimizes idle time for MPI processes
• Helps skip unnecessary synchronisation

Provided by NHR4CES, SDL Energy Conversion Group 26/50

P2P Non-Blocking comms: An example

Communicate with a neighboring processor

1 // Code for the Foundation version and ESI
2 PstreamBuffers pBufs (Pstream::commsTypes::nonBlocking);
3 // Send
4 forAll(procPatches, patchi)
5 {
6 UOPstream toNeighb(procPatches[patchi].neighbProcNo(), pBufs);
7 toNeighb << patchInfo;
8 }
9 pBufs.finishedSends(); // <- Calls Pstream::waitRequests
10 // Receive
11 forAll(procPatches, patchi)
12 {
13 UIPstream fromNb(procPatches[patchi].neighbProcNo(), pBufs);
14 Map<T> nbrPatchInfo(fromNb);
15 }

Provided by NHR4CES, SDL Energy Conversion Group 27/50

Overlapping communication and computation

Computation time blocking nonBlocking

0.800

Total execution time (in secs)

0.600
0.657

0.400
0.290

0.264
0.200
0.144
0.135
0.164 0.167 0.074
0.082 0.084 0.040 0.042 0.037 0.073
0.000 0.022 0.021

64 32 16 8
Message size in Mb

Figure 8: Effect of message size on overlapping communication and computation (4 processors,

OpenMPI 4, OpenFOAM 8); Benchmark inspired from [2]

Provided by NHR4CES, SDL Energy Conversion Group 28/50

MPI send modes used in OpenFOAM code

Default in scheduled

Standard Send
MPI_Send

Will not return until send buffer is safe to use, May also buffer.
MPI send modes used in OpenFOAM code

Default in scheduled Default in blocking

Standard Send Buffered Send

MPI_Send MPI_Bsend

Will not return until send buffer is safe to use, May also buffer.

Returns after copying the buffer and it can be used again

MPI send modes used in OpenFOAM code

Default in nonBlocking

Non-Blocking standard Send

MPI_Isend

Default in scheduled Default in blocking

Standard Send Buffered Send

MPI_Send MPI_Bsend

Will not return until send buffer is safe to use, May also buffer.

Returns after copying the buffer and it can be used again

MPI send modes used in OpenFOAM code

Default in nonBlocking

Non-Blocking standard Send

MPI_Isend

Default in scheduled Default in blocking

Standard Send Buffered Send

MPI_Send MPI_Bsend

Will not return until send buffer is safe to use, May also buffer.

Returns after copying the buffer and it can be used again

not used not used

Synchronous Send Ready Send

MPI_Ssend MPI_Rsend

Provided by NHR4CES, SDL Energy Conversion Group 29/50

Parallelization in OpenFOAM for HPC Deployment
Collective Communications / General Introduction

Mohammed Elwardi Fadeli1,2 , Holger Marschall1 and Christian Hasse2

April, 2023
1
Mathematical Modeling and Analysis (MMA)
2
Simulation of Reactive Thermo Fluid Systems (STFS)
Energy Conversion Group, NHR4CES - TU Darmstadt
Collective comms

When Two or more processes talk to each other.

• All processes call the same function with the same set of arguments.
• Although MPI-2 has non-blocking collective communications, OpenFOAM uses only
the blocking variants.
• NOT a simple wrapper around P2P comms.
• Most collective algorithms are log(nProcs)

Provided by NHR4CES, SDL Energy Conversion Group 30/50

Collective comms

• OpenFOAM puts their interface in static public methods of Pstream class.

• Major differences in the API across forks: (ESI and Foundation version) vs Foam Extend.
• Gather (all-to-one), Scatter (one-to-all), All-to-All variants of all-to-one ones.
• What OpenFOAM calls a ”reduce” is Gather+Scatter. This significantly differs from
MPI’s concept of a reduce which is an all-to-one operation.
• MPI has also a ”Broadcast” and ”Barrier” but these are not used in OpenFOAM.

Provided by NHR4CES, SDL Energy Conversion Group 31/50

Parallelization in OpenFOAM for HPC Deployment
Collective Communications / Common API

Mohammed Elwardi Fadeli1,2 , Holger Marschall1 and Christian Hasse2

April, 2023
1
Mathematical Modeling and Analysis (MMA)
2
Simulation of Reactive Thermo Fluid Systems (STFS)
Energy Conversion Group, NHR4CES - TU Darmstadt
Collective comms: Gather (All-to-one)

Check how something is distributed over processors

1 bool v = false;
2 if (Pstream::master()){ v = something(); } // <- must do on master
3 Pstream::gather(v, orOp<bool>()); // <- root process gathers

v20 ||v1 ||v2

P0 P1

v0 v10 ||v1 ||v2

Figure 9: An example OpenFOAM gather operation (More like a MPI-reduce)

Provided by NHR4CES, SDL Energy Conversion Group 32/50
Collective comms: Gather (All-to-one)

Check how something is distributed over processors

1 bool v = false;
2 if (Pstream::master()){ v = something(); } // <- must do on master
3 Pstream::gather(v, orOp<bool>()); // <- root process gathers

v20 ||v1 ||v2

orOp()
P0 P1

v0 v0 ||v1 ||v2
orOp() orOp()

Figure 9: An example OpenFOAM gather operation (More like a MPI-reduce)

Provided by NHR4CES, SDL Energy Conversion Group 32/50
Collective comms: Gather (All-to-one)

Check how something is distributed over processors (List-like)

1 List<bool> localLst(Pstream::nProcs(), false);
2 localLst[Pstream::myProcNo()] = something();
3 Pstream::gatherList(localLst); // <- root process gathers

v0 v0 v2

P0 P1

v0 v0 v0 v0 v1 v0

Figure 10: Example OpenFOAM gather operation on list items

Provided by NHR4CES, SDL Energy Conversion Group 33/50
Collective comms: Gather (All-to-one)

Check how something is distributed over processors (List-like)

1 List<bool> localLst(Pstream::nProcs(), false);
2 localLst[Pstream::myProcNo()] = something();
3 Pstream::gatherList(localLst); // <- root process gathers

v0 v0 v2

P0 P1

v0 v0 v0 v0 v1 v02

Figure 10: Example OpenFOAM gather operation on list items

Provided by NHR4CES, SDL Energy Conversion Group 33/50
Collective comms: Scatter (One-to-all)

Make processes know about something

1 bool v = false;
2 if (Pstream::master()){ v = something(); } // <- must do on master
3 Pstream::scatter(v); // <- root process scatters

v20 ||v1 ||v2

P0 P1

v0 v10 ||v1 ||v2

Figure 11: An example OpenFOAM scatter operation (More like a MPI-Bcast)

Provided by NHR4CES, SDL Energy Conversion Group 34/50
Collective comms: Scatter (One-to-all)

Make processes know about something

1 bool v = false;
2 if (Pstream::master()){ v = something(); } // <- must do on master
3 Pstream::scatter(v); // <- root process scatters

v120 ||v1 ||v2

P0 P1

v10 v10 ||v1 ||v2

Figure 11: An example OpenFOAM scatter operation (More like a MPI-Bcast)

Provided by NHR4CES, SDL Energy Conversion Group 34/50
Collective comms: Scatter (One-to-all)

Make processes know about something (List-like)

1 List<bool> localLst(Pstream::nProcs(), false);
2 if (Pstream::master()){ forAll(localLst, ei) { localLst[ei] = something(); } }
3 Pstream::scatterList(localLst); // <- root process scatters

v0 v1 v2

P0 P1

v0 v1 v2 v0 v1 v02

Figure 12: Example OpenFOAM scatter operation on list items

Provided by NHR4CES, SDL Energy Conversion Group 35/50
Collective comms: Scatter (One-to-all)

Make processes know about something (List-like)

1 List<bool> localLst(Pstream::nProcs(), false);
2 if (Pstream::master()){ forAll(localLst, ei) { localLst[ei] = something(); } }
3 Pstream::scatterList(localLst); // <- root process scatters

v0 v1 v2

P0 P1

v0 v1 v2 v0 v1 v02

Figure 12: Example OpenFOAM scatter operation on list items

Provided by NHR4CES, SDL Energy Conversion Group 35/50
Collective comms: API stats

gather scatterList
41 75

gatherList scatter
121 146

Figure 13: Frequency of usage for each API call of OpenFOAM collective comms (v2012)

• There are also some Fork-specific interface methods we won’t discuss (eg.
Pstream::exchange)

Provided by NHR4CES, SDL Energy Conversion Group 36/50

Collective comms: Reduce (All-to-All)

Do something with a var on all processors (eg. sum them up)

1 // Second arg: a binary operation function (functors); see ops.H
2 Foam::reduce(localVar, sumOp<decltype(localVar)>());
3 localVar = Foam::returnReduce(nonVoidCall(), sumOp<decltype(localVar)>());

v20 + v1 + v2

P0 P1

v0 + v1 + v2 v10 + v1 + v2

Figure 14: An example OpenFOAM reduce operation (MPI-Allreduce)

Provided by NHR4CES, SDL Energy Conversion Group 37/50
Collective comms: Reduce (All-to-All)

Do something with a var on all processors (eg. sum them up)

1 // Second arg: a binary operation function (functors); see ops.H
2 Foam::reduce(localVar, sumOp<decltype(localVar)>());
3 localVar = Foam::returnReduce(nonVoidCall(), sumOp<decltype(localVar)>());

v0 + v1 + v2

sumOp()
P0 P1

v0 + v1 + v2 v0 + v1 + v2
sumOp() sumOp()

Figure 14: An example OpenFOAM reduce operation (MPI-Allreduce)

Provided by NHR4CES, SDL Energy Conversion Group 37/50
Oh, there is a reduce here!

Figure 1: Parallel work during F1 Pit stops; CC BY 2.0, from commons.wikimedia.org

Provided by NHR4CES, SDL Energy Conversion Group 38/50
What if one of them just walks away before work is done?

Figure 1: Parallel work during F1 Pit stops; CC BY 2.0, from commons.wikimedia.org

Provided by NHR4CES, SDL Energy Conversion Group 39/50
Collective comms: Issues

You can still fall for endless loops if you’re not careful!

Infinite loops due to early returns and collective comms

1 void refineMesh(fvMesh& mesh, const label& globalNCells)
2 {
3 label currentNCells = 0;
4 do
5 {
6 // Perform calculations on all processors
7 currentNCells += addCells(mesh);
8 // On some condition, a processor should not continue, and
9 // returns control to the caller
10 if (Pstream::myProcNo() == 1) return; // <-- oops, can't do this
11 // !!! who exactly will reduce this!
12 reduce(currentNCells, sumOp<label>());
13 } while (currentNCells < globalNCells);
14 return;
15 }

Provided by NHR4CES, SDL Energy Conversion Group 40/50

Parallelization in OpenFOAM for HPC Deployment
How do I send my own data?

Mohammed Elwardi Fadeli1,2 , Holger Marschall1 and Christian Hasse2

April, 2023
1
Mathematical Modeling and Analysis (MMA)
2
Simulation of Reactive Thermo Fluid Systems (STFS)
Energy Conversion Group, NHR4CES - TU Darmstadt
Sending around a simple struct

Say we have something like this:

Push an edge from master to All

1 struct Edge {
2 label destination = -1;
3 scalar weight = 0.0;
4 };
5 // Local edge is always initialized to (-1, 0) on all processors
error: No match for
6 Edge ej; operator<<(OPstream&,Edge&)
7 // Change it only on master error: No match for
operator>>(IPstream&,Edge&)
8 if (Pstream::master())
9 {
10 ej.destination = 16;
11 ej.weight = 5.2;
12 }
13 // Try scatter the edge from master to all processes
14 Pstream::scatter(ej);

Provided by NHR4CES, SDL Energy Conversion Group 41/50

Sending around a simple struct

So, Edges can’t be communicated as MPI messages, provide the necessary serialization
and de-serialization operators:

Stream Operators for Edge class

1 // Compiler wants serialization to OPStream/IPStream
2 // But we better write for the bases OStream/IStream
3 Ostream& operator<<(Ostream& os, Edge& e) {
4 os << e.destination << " " << e.weight;
Info << ej;
5 return os;
”16 5.2”
6 }
7 Istream& operator>>(Istream& is, Edge& e) {
8 is >> e.destination;
9 is >> e.weight;
10 return is;
11 }

Provided by NHR4CES, SDL Energy Conversion Group 42/50

How about sending a list of custom objects?

Gathering info about Edges on all procs

1 using Graph = List<List<Edge>>;
2 // Start with an empty graph on all processes
3 Graph g(Pstream::nProcs(), List<Edge>()); Compiles, and works as ex-
4 // Process g[Pstream::myProcNo()] locally on the corresponding proc pected if Edge is modified so it
5 Pstream::gatherList(g); can be put in a List

6 Pstream::scatterList(g);
7
8 // Check graph edges on all processes
9 Pout << g << endl;

A Better way: Make Edge a child of one of the OpenFOAM classes

Better ways to define an Edge

1 struct Edge : public Tuple2<label, scalar> {};
2 // That's it, Edge is now fully MPI-ready

Provided by NHR4CES, SDL Energy Conversion Group 43/50

What if the struct has a pointer?

Easiest solution -> Follow all pointers

1 struct Edge {
2 int* destination = nullptr;
3 scalar weight = 0.0;
4 };
5 Ostream& operator<<(Ostream& os, Edge& e) {
6 os << *e.destination << " " << e.weight;
7 return os;
8 } Hoping that all members are
deep-copyable
9 Istream& operator>>(Istream& is, Edge& e) {
* Code for illustration only,
10 int o; is >> o; e.destination = &o; do not use raw pointers
11 is >> e.weight;
12 return is;
13 }
14 Edge ej;
15 if (Pstream::master())
16 {
17 ej.destination = &ej;
18 ej.weight = 5.2;
19 }

Provided by NHR4CES, SDL Energy Conversion Group 44/50

What if the struct has a reference to the mesh?

• Unlike pointers, references in C++ need to be initialized when declared (in

constructors for members of a reference type).
• You’ll most likely have a reference to the mesh → It’s good practice.
• The mesh is special because we know that it’s partitioned. And we want our objects
to use the mesh on the other process when we send them over!

The solution includes:

• Switch to LinkedList instead of random-access ones.

→ Why? Because they allow for passing custom constructor arguments.
• The Edge will have to get a new construction function (usually something nested in a
sub-class Edge::iNew::operator() )
• We will explore this in more detail during the hands-on sessions

Provided by NHR4CES, SDL Energy Conversion Group 45/50

Parallelization in OpenFOAM for HPC Deployment
Application examples & advanced topics

Mohammed Elwardi Fadeli1,2 , Holger Marschall1 and Christian Hasse2

April, 2023
1
Mathematical Modeling and Analysis (MMA)
2
Simulation of Reactive Thermo Fluid Systems (STFS)
Energy Conversion Group, NHR4CES - TU Darmstadt
Solving PDEs over decomposed domains (P2P comms)

General Transport Equation for a physical transport property

∂t φ + ∇ · (φu) − ∇ · (Γ∇φ) = Sφ (φ)

Discretized form (Finite Volume notation)

J∂t [φ]K + J∇ · F[φ]f (F,S,γ) K − J∇ · Γf ∇[φ] K = JSI [φ]K.

• Receive neighbour values from neighbouring processor.

• Send face cell values from local domain to neighouring processor
• Interpolate to processor patch faces

Provided by NHR4CES, SDL Energy Conversion Group 46/50

Adaptive Mesh Refinement on polyhedral meshes

1. Refine each processor’s part of the mesh, but we need to keep the global cell count
under a certain value:

Reduce nAddCells or nTotalAddCells?

1 label nAddCells = 0;
2 label nIters = 0;
3 label nTotalAddCells = 0;
4 do
5 {
6 nAddCells = faceConsistentRefinement(refineCell);
7 reduce(nAddCells, sumOp<label>());
8 ++nIters;
9 nTotalAddCells += nAddCells;
10 } while (nAddCells > 0);

Provided by NHR4CES, SDL Energy Conversion Group 47/50

Adaptive Mesh Refinement on polyhedral meshes

2. To decide on whether to refine cells at processor boundaries, we need cell levels

from the other side:

ownLevel holds neiLevel after swapping!

1 // Code extracted from Foam Extend 4.1
2 labelList ownLevel(nFaces - nInternalFaces);
3 forAll (ownLevel, i)
4 {
5 const label& own = owner[i + nInternalFaces];
6 ownLevel[i] = updateOwner();
7 }
8
9 // Swap boundary face lists (coupled boundary update)
10 syncTools::swapBoundaryFaceList(mesh_, ownLevel, false);
11
12 // Note: now the ownLevel list actually contains the neighbouring level
13 // (from the other side), use alias (reference) for clarity from now on
14 const labelList& neiLevel = ownLevel;

Provided by NHR4CES, SDL Energy Conversion Group 48/50

Advanced topics

The need for Load Balancing in AMR settings

• AMR operations tend to unbalance cell count distribution accross processors

• Using Blocking comms means more idle process time
• Non-Blocking are not a solution.
• Spending some time on rebalancing the mesh is.
• Naturally, load balancing itself involves parallel communication!

Provided by NHR4CES, SDL Energy Conversion Group 49/50

Licensing

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International

License.
Code snippets are licensed under a GNU Public License.

cba 2023

This offering is not approved or endorsed by OpenCFD Limited, the producer of the
OpenFOAM software and owner of the OPENFOAM® and OpenCFD® trade marks.

Provided by NHR4CES, SDL Energy Conversion Group 50/50

Questions?

Provided by NHR4CES, SDL Energy Conversion Group 50/50

Compile and link against MPI implementations

Compiler wrappers are your best friends!

Grab correct compiler/linker flags -I/usr/lib/x86_64-linux-

gnu/openmpi/include/openmpi
...
1 mpic++ --showme:compiler -pthread -L/usr/lib/x86_64-
2 mpic++ --showme:linker linux-gnu/openmpi/lib ...

OpenFOAM environment autmatically figures things out for you:

Typical Make/options file for the ESI fork

1 include $(GENERAL_RULES)/mpi-rules
2 EXE_INC = $(PFLAGS) $(PINC) ...
3 LIB_LIBS = $(PLIBS) ...

Provided by NHR4CES, SDL Energy Conversion Group

Miscellaneous

• MPI standards: Blocking send can be used with a Non-blocking receive, and
vice-versa
• But OpenFOAM wrapping makes it ”non-trivial” to get it to work

• You can still use MPI API directly, eg. if you need one-sided communication.

• Overlapping computation and communication for non-blocking calls is implemented

on the MPI side, so, put your computations after the recieve call.

Provided by NHR4CES, SDL Energy Conversion Group

Sources and further reading i

[1] C. Augustine. Introduction to Parallel Programming with MPI and OpenMP. Source of
the great ’pit stops’ analogy. Oct. 2018. url: https:
//princetonuniversity.github.io/PUbootcamp/sessions/parallel-
programming/Intro_PP_bootcamp_2018.pdf.
[2] Fabio Baruffa. Improve MPI Performance by Hiding Latency. July 2020. url:
https://www.intel.com/content/www/us/en/developer/articles/
technical/overlap-computation-communication-hpc-
applications.html.

Provided by NHR4CES, SDL Energy Conversion Group

Sources and further reading ii

[3] Pavanakumar Mohanamuraly, Jan Christian Huckelheim, and Jens-Dominik Mueller.

“Hybrid Parallelisation of an Algorithmically Differentiated Adjoint Solver”. In:
Proceedings of the VII European Congress on Computational Methods in Applied
Sciences and Engineering (ECCOMAS Congress 2016). Institute of Structural Analysis
and Antiseismic Research School of Civil Engineering National Technical University of
Athens (NTUA) Greece, 2016. doi: 10.7712/100016.1884.10290. url:
https://doi.org/10.7712/100016.1884.10290.
[4] B. Steinbusch. Introduction to Parallel Programming with MPI and OpenMP. Mar. 2021.
url: https://www.fz-
juelich.de/SharedDocs/Downloads/IAS/JSC/EN/slides/mpi/mpi-
openmp-handouts.pdf?__blob=publicationFile.
[5] EuroCC National Competence Center Sweden. Intermediate MPI. May 2022. url:
https://enccs.github.io/intermediate-mpi/.

Provided by NHR4CES, SDL Energy Conversion Group

Parallel Computing Lab Manual PDF
100% (1)
Parallel Computing Lab Manual PDF
51 pages
The OpenFOAM Technology Primer (Tomislav Maric Jens Höpken Kyle Mooney)
No ratings yet
The OpenFOAM Technology Primer (Tomislav Maric Jens Höpken Kyle Mooney)
458 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
OFprimer-v2012 1
No ratings yet
OFprimer-v2012 1
423 pages
Omp Hands On
No ratings yet
Omp Hands On
200 pages
CUDA Optimization Fundamentals
No ratings yet
CUDA Optimization Fundamentals
150 pages
Barbara Chapman Using OpenMP
No ratings yet
Barbara Chapman Using OpenMP
378 pages
Mca 4
No ratings yet
Mca 4
61 pages
OpenFOAM Foundation Handout PDF
No ratings yet
OpenFOAM Foundation Handout PDF
92 pages
sc09 Fluid Sim Cohen
No ratings yet
sc09 Fluid Sim Cohen
33 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
41 pages
L1.3a HPC Concepts
No ratings yet
L1.3a HPC Concepts
43 pages
NTNU HetComp Topublish PDF
No ratings yet
NTNU HetComp Topublish PDF
83 pages
Tutorial 4
No ratings yet
Tutorial 4
32 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
Lec 14
No ratings yet
Lec 14
52 pages
Running in Parallel
No ratings yet
Running in Parallel
24 pages
snappyHexMesh On HPC
No ratings yet
snappyHexMesh On HPC
23 pages
Parallel
No ratings yet
Parallel
20 pages
Lecture01 Intro ToHPC
No ratings yet
Lecture01 Intro ToHPC
48 pages
OpenFoam8 UserGuide
No ratings yet
OpenFoam8 UserGuide
223 pages
An Improved Framework of GPU Computing For CFD Applications On Structured Grids Using OpenACC
No ratings yet
An Improved Framework of GPU Computing For CFD Applications On Structured Grids Using OpenACC
22 pages
A Deep Dive Into The Latest HPC Software
No ratings yet
A Deep Dive Into The Latest HPC Software
38 pages
Labview Multicore Systems
No ratings yet
Labview Multicore Systems
86 pages
A Study of Performance Programming of CPU, GPU Accelerated Computers and SIMD Architecture
No ratings yet
A Study of Performance Programming of CPU, GPU Accelerated Computers and SIMD Architecture
19 pages
Open Foam Slides
No ratings yet
Open Foam Slides
43 pages
Jasak2 Slides
No ratings yet
Jasak2 Slides
29 pages
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators
No ratings yet
SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators
30 pages
CFD Engineer Master - S Certification Program
No ratings yet
CFD Engineer Master - S Certification Program
37 pages
Openfoam
No ratings yet
Openfoam
25 pages
Unit 4-Mca
No ratings yet
Unit 4-Mca
29 pages
ECCOMAS Glasgow Article
No ratings yet
ECCOMAS Glasgow Article
11 pages
Group3 - Parallel - Computing - Techniques - Presentation Power Point 2025
No ratings yet
Group3 - Parallel - Computing - Techniques - Presentation Power Point 2025
27 pages
MPI Python Workshop Day1 Fall2024
No ratings yet
MPI Python Workshop Day1 Fall2024
22 pages
001 - Openfoam SSÇ ÀÚ Á
No ratings yet
001 - Openfoam SSÇ ÀÚ Á
189 pages
Openfoam Homework
100% (1)
Openfoam Homework
6 pages
Openfoam CFD Thesis
100% (3)
Openfoam CFD Thesis
4 pages
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
No ratings yet
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
26 pages
Meep Openmp
No ratings yet
Meep Openmp
13 pages
Articles CAF Hierarchical Published
No ratings yet
Articles CAF Hierarchical Published
10 pages
Openfoam10 PDF
No ratings yet
Openfoam10 PDF
23 pages
Running in Parallel Luc Chin I
No ratings yet
Running in Parallel Luc Chin I
13 pages
Introductory Openfoam® Course From 2 To6 July, 2012: Joel Guerrero University of Genoa, Dicat
No ratings yet
Introductory Openfoam® Course From 2 To6 July, 2012: Joel Guerrero University of Genoa, Dicat
24 pages
Parallel Processing - Openfoam
No ratings yet
Parallel Processing - Openfoam
44 pages
Pre-Compiled Applications and Utilities in Openfoam: Tommaso Lucchini
No ratings yet
Pre-Compiled Applications and Utilities in Openfoam: Tommaso Lucchini
20 pages
Ayushagrawal HPC
No ratings yet
Ayushagrawal HPC
17 pages
It-241 Operating System Revision - 2
100% (1)
It-241 Operating System Revision - 2
51 pages
Final Assigment of PDC
No ratings yet
Final Assigment of PDC
12 pages
Belgeler Dosya20140130113152 File PDF
No ratings yet
Belgeler Dosya20140130113152 File PDF
64 pages
HPC Lecture (1) Summary
No ratings yet
HPC Lecture (1) Summary
8 pages
FSBIntro PDF
No ratings yet
FSBIntro PDF
22 pages
HPC in Engr Fall 2015
No ratings yet
HPC in Engr Fall 2015
96 pages
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
No ratings yet
FALLSEM2021-22 CSE4001 ETH VL2021220104078 Reference Material I 03-Aug-2021 Lecture1-Course Introduction
39 pages
1 Introduction-OpenFaom
No ratings yet
1 Introduction-OpenFaom
39 pages
Second Openfoam Workshop: Welcome and Introduction: Open Foam
No ratings yet
Second Openfoam Workshop: Welcome and Introduction: Open Foam
20 pages
Building Blocks and Library Ion in OpenFOAM
No ratings yet
Building Blocks and Library Ion in OpenFOAM
24 pages
Oneapi Fortran Compiler
No ratings yet
Oneapi Fortran Compiler
2,720 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
Distributed Memory Programming Using
No ratings yet
Distributed Memory Programming Using
113 pages
Open Foam
No ratings yet
Open Foam
19 pages
OpenFoam of Apps
No ratings yet
OpenFoam of Apps
20 pages
Enabling Openmp Task Parallelism On Multi-Fpgas
No ratings yet
Enabling Openmp Task Parallelism On Multi-Fpgas
1 page
OpenFOAM Open Platform For CFD and Complex Physics Simulation
No ratings yet
OpenFOAM Open Platform For CFD and Complex Physics Simulation
26 pages
Chapter 4. Threads & Concurrency
No ratings yet
Chapter 4. Threads & Concurrency
64 pages
OpenMP Tutorial
100% (1)
OpenMP Tutorial
82 pages
Parellel Computing 2024 C - Handout-2
No ratings yet
Parellel Computing 2024 C - Handout-2
3 pages
Parallelizing - K Means Clustering: A Project Report
100% (1)
Parallelizing - K Means Clustering: A Project Report
32 pages
Fortran Resources
No ratings yet
Fortran Resources
85 pages
ICON Tutorial 2017
No ratings yet
ICON Tutorial 2017
160 pages
Users Guide NMM Chap1-7
No ratings yet
Users Guide NMM Chap1-7
204 pages
Autodyn Parallel Processing Guide
No ratings yet
Autodyn Parallel Processing Guide
52 pages
HOKUSAI GreatWave Users Guide en
No ratings yet
HOKUSAI GreatWave Users Guide en
118 pages
OpenMPSlides Tamu SC
No ratings yet
OpenMPSlides Tamu SC
80 pages
Fortran Resources
No ratings yet
Fortran Resources
73 pages
Final
No ratings yet
Final
30 pages
CUDA-Multiple GPUs
No ratings yet
CUDA-Multiple GPUs
36 pages
Openmp 2
No ratings yet
Openmp 2
25 pages
Open MP
No ratings yet
Open MP
35 pages
Lab Manual - LP V - LA 3
No ratings yet
Lab Manual - LP V - LA 3
14 pages
Mcap Lesson Plan
No ratings yet
Mcap Lesson Plan
3 pages
HPC 2 A
No ratings yet
HPC 2 A
15 pages
AcuTrace Overview Mar 12
No ratings yet
AcuTrace Overview Mar 12
22 pages
2.3-DD2356-OpenMP Definitions
No ratings yet
2.3-DD2356-OpenMP Definitions
12 pages
A Comparison of Co-Array Fortran and Openmp Fortran For SPMD Programming
No ratings yet
A Comparison of Co-Array Fortran and Openmp Fortran For SPMD Programming
20 pages
CS6801 MCAP-Lesson Plan - Regulation-2013
No ratings yet
CS6801 MCAP-Lesson Plan - Regulation-2013
5 pages
Syllabus
No ratings yet
Syllabus
11 pages
LP1 Oral Answers
No ratings yet
LP1 Oral Answers
6 pages
Gaussian Elimination
No ratings yet
Gaussian Elimination
6 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
From Everand
Build Your Own Distributed Compilation Cluster - A Practical Walkthrough
Hunter Davis
No ratings yet