High Performance Computing Lecture 2 Parallel Programming With MPI Pub
High Performance Computing Lecture 2 Parallel Programming With MPI Pub
ADVANCED SCIENTIFIC COMPUTING
Prof. Dr. – Ing. Morris Riedel
Adjunct Associated Professor
School of Engineering and Natural Sciences, University of Iceland, Reykjavik, Iceland
Research Group Leader, Juelich Supercomputing Centre, Forschungszentrum Juelich, Germany
Parallel Programming with MPI
September 9, 2019
Room V02‐156
Review of Lecture 1 – High Performance Computing (HPC)
HPC Basics HPC Ecosystem Technologies
multi‐core processors
many‐core processors
with high single‐thread
with moderate single
performance
thread performance
used in parallel computing
used in parallel computing
not only used in physical
modeling and simulation
distributed memory
sciences today, but also for
architectures using
machine & deep learning
the Message Passing
Interface (MPI)
‘big data’
[3] J. Haut, G. Cavallaro [4] F. Berman: Maximising the
[1] Distributed & Cloud Computing Book [2] Introduction to High Performance Computing for Scientists and Engineers and M. Riedel et al. Potential of Research Data
Jötunn HPC Environment with Libraries & Modules
Thinking Parallel & Step‐wise Walkthrough for Parallel Programming
Basic Building Blocks of a Parallel Program
Code Compilation & Parallel Executions
Simple PingPong Application Example
Students understand…
Latest developments in parallel processing & high performance computing (HPC)
How to create and use high‐performance clusters
What are scalable networks & data‐intensive workloads
The importance of domain decomposition
Complex aspects of parallel programming
HPC environment tools that support programming
or analyze behaviour
Different abstractions of parallel computing on various levels
Foundations and approaches of scientific domain‐
specific applications
Students are able to …
Programm and use HPC programming paradigms
Take advantage of innovative scientific computing simulations & technology
Work with technologies and tools to handle parallelism complexity
Lecture 2 – Parallel Programming with MPI 5 / 50
Message Passing Interface (MPI) Concepts
Lecture 12 – 15 will offer more insights into a wide variety of physics & engineering applications that take advantage of HPC with MPI
Lecture 2 – Parallel Programming with MPI 7 / 50
Parallel Programming with MPI – Data Science Applications for HPC
Machine Learning Algorithms
Example: Highly Parallel Density‐based spatial clustering of applications with noise (DBSCAN)
Selected Applications: Clustering different cortical layers in brain tissue & point cloud data analysis
Clustering
[11] M. Goetz and M. Riedel et al,
Proceedings IEEE Supercomputing Conference, 2015
Lecture 8 will provide more details on MPI application examples with a particular focus on parallel and scalable machine learning
Lecture 2 – Parallel Programming with MPI 8 / 50
Example: Modular Supercomputing Architecture – MPI Usage in Cluster Module
we focus in this
The Cluster Module (CM) offers a
Cluster Nodes (CNs) with high lecture only on
single-thread performance and a this module
universal Infiniband interconnect
Given the CM architecture setup
it works very well for applications
that take advantage of MPI
network
interconnection
important
Pro: Network communication is relativel hidden and supported
Contra: Programming with MPI still requires using ‘parallelization methods’
Not easy: Write ‘technical code’ well integrated in ‘problem‐domain code’
Example: Race Car Simulation &
Heat dissipation in a Room
Apply a good parallelization method
(e.g. domain decomposition) time t
Write manually good MPI code for
(technical) communication
between processors
time t
(e.g. across 1024 cores)
Integrate well technical code
[10] Modified from [2] Introduction to High Performance Computing
with problem‐domain code Caterham F1 team for Scientists and Engineers
(e.g. computational fluid dynamics & airflow)
Lecture 3 will provide more details on MPI application examples with a particular focus on parallelization fundamentals
Lecture 2 – Parallel Programming with MPI 10 / 50
Distributed‐Memory Computers – Revisited (cf. Lecture 1)
[2] Introduction to High Performance Computing for Scientists and Engineers
time t
dominant
programming model
Message Passing
Interface (MPI)
time t
Features
Processors communicate via Network Interfaces (NI) [10] Modified from
Caterham F1 team
[2] Introduction to High Performance Computing
for Scientists and Engineers
NI mediates the connection to a Communication network
This setup is rarely used a programming model view today
Lecture 2 – Parallel Programming with MPI 11 / 50
Programming with Distributed Memory using MPI – Revisited (cf. Lecture 1)
[5] MPI Standard
Features
No remote memory access on distributed‐memory systems
Require to ‘send messages’ back and forth between processes PX
Many free Message Passing Interface (MPI) libraries available
Programming is tedious & complicated, but most flexible method P1 P2 P3 P4 P5
Lecture 4 will provide more details on advanced functions of the Message Passing Interface (MPI) standard and its use in applications
Lecture 2 – Parallel Programming with MPI 12 / 50
GNU OpenMPI Implementation
Message Passing Interface (MPI)
A standardized and portable message‐passing standard
Designed to support different HPC architectures
A wide variety of MPI implementations exist
Standard defines the syntax and semantics
of a core of library routines used in C, C++ & Fortran [5] MPI Forum
OpenMPI Implementation
Open source license based on the BSD license
Full MPI (version 3) standards conformance [6] OpenMPI Web page
Developed & maintained by a consortium of
academic, research, & industry partners
Typically available as modules on HPC systems and used with mpicc compiler
Often built with the GNU compiler set and/or Intel compilers
Lecture 2 will provide a full introduction and many more examples of the Message Passing Interface (MPI) for parallel programming
Lecture 2 – Parallel Programming with MPI 13 / 50
What is MPI from a Technical Perspective?
‘Communication library’ abstracting from low‐level network view
Offers 500+ available functions to communicate between computing nodes
Practice reveals: parallel applications often require just ~12 (!) functions
Includes routines for efficient ‘parallel I/O’ (using underlying hardware)
Supports ‘different ways of communication’
P1 P2 P3 P4 P5
‘Point‐to‐point communication’ between two computing nodes (P P)
Collective functions involve ‘N computing nodes in useful communiction’ Computing nodes
are independent
computing
processors (that
Deployment on Supercomputers supporting Applications Portability may also have N
cores each) and
Installed on (almost) all parallel computers that are all part of
one big parallel
Different languages: C, Fortran, Python, R, etc. computer (e.g.
hybrid architecture,
Careful: Different versions might be installed cf. Lecture 1)
Key reasons for requiring a standard programming library
Technical advancement in supercomputers is extremely fast
Parallel computing experts switch organizations and face another system
Applications using proprietary libraries where not portable
Create whole applications from scratch or time‐consuming code updates
MPI changed this & is dominant parallel programming model
P1 P2 P3 P4 P5
Works for different
MPI standard MPI is an open
implementations standard that
significantly supports
E.g., MPICH the portability
HPC Machine A HPC Machine B of parallel
E.g., Parastation MPI Porting a parallel applications across a
MPI Library MPI Library wide variety of
E.g., OpenMPI MPI application different HPC systems
Etc. and supercomputer
architectures
TCP/IP and socket programming libraries are plentiful available
Do we need a dedicated communication & network protocols library?
Goal: simplify programming in parallel programming Over the Internet?
Focus on scientific and engineering applications with mathematical calculations
Enable parallel and scalable machine and deep learning algorithms
Selected reasons
P1 P2 P3 P4 P5
Designed for performance within large parallel computers (e.g. no security)
Supports various interconnects between ‘computing nodes’ (hardware)
Offers various benefits like ‘reliable messages’ or ‘in‐order arrivals’
MPI is not designed to handle any communication in computer networks and is thus very special
MPI is not good for clients that constantly establishing/closing connections again and again (e.g. would have very slow performance in MPI)
MPI is notgood for internet chat clients or Web service servers in the Internet (e.g. no security beyond firewalls, no message encryption
directly available, etc.)
Compute NEW: 17
Node
P P
DATA: 17 M M DATA: 80
P1 P2 P3 P4 P5
HPC Machine
NEW: 06 Each processor has
its own data in its
memory that
MPI point‐to‐point
communications P P can not be
seen/accessed by
other processors
DATA: 06 M M DATA: 19
NEW: 17
P P
DATA: 17 M M DATA: 80
P1 P2 P3 P4 P5
NEW: 17 Broadcast
NEW: 17 distributes the
same data to many
P P or even all other
processors
DATA: 06 M M DATA: 19
P P NEW: 10
DATA: 10
DATA: 20 M M DATA: 80
DATA: 30 P1 P2 P3 P4 P5
Scatter distributes
different data to
many or even all
NEW: 30
P P NEW: 20 other processors
DATA: 06 M M DATA: 19
NEW: 80
NEW: 19
NEW: 06 P P
DATA: 17 M M DATA: 80
P1 P2 P3 P4 P5
Gather collects data
from many or even
all other processors
P P to one specific
processor
DATA: 06 M M DATA: 19
NEW: 122 global sum
+ as example
+ P P
+ +
DATA: 17 M M DATA: 80
P1 P2 P3 P4 P5
+
Reduce combines
collection with
computation based on
P P data from many or even
all other processors
Usage of reduce
includes finding a
global minimum or
DATA: 06 M M DATA: 19
maximum, sum, or
product of the different
data located
at different processors
Lecture 2 – Parallel Programming with MPI 21 / 50
Using MPI Ranks & Communicators
Answers the following question:
(numbers reflect How do I know where to send/receive to/from?
unique identity
of processor Each MPI activity specifies the context in
named ‘MPI rank)
which a corresponding function is performed
MPI_COMM_WORLD
(region/context of all processes)
Create (sub‐)groups of the processes / virtual
groups of processes
Peform communications only within these sub‐
groups easily with well‐defined processes
Lecture 4 on advanded MPI techniques will provide details about the often used MPI cartesian communicator & its use in applications
Lecture 2 – Parallel Programming with MPI 22 / 50
[Video] Introducing MPI – Summary
[9] Introducting MPI, YouTube Video
Check access to the cluster machine
Check MPI standard implementation and its version
Often SSH is used to remotely access clusters
OpenMPI
‘Open Source High Performance Computing’
Using the module environment
(cf. Practical Lecture 0.2)
Other Implementations exists [6] OpenMPI Web page
E.g., MPICH implementation
E.g., Parastation MPI implementation [12] Icelandic HPC Machines & Community
(we don‘t use those in this course)
4 Nodes
Cpu: 2x Intel Xeon CPU E5‐2690 v3 @ 2.60GHz
(2.6 GHz, 12 core)
Memory
128GB DDR4
Interconnect
10 Gb/s Ethernet
Ganglia monitoring
service
Shows usage of CPUs
[12] Icelandic HPC Machines & Community
We will have a visit to computing room of Jötunn to ‘touch metal’ and will meet our HPC System expert Hjörleifur Sveinbjörnsson
Lecture 2 – Parallel Programming with MPI 26 / 50
SSH Access to HPC System – Jötunn HPC System Example – Revisited
Example: first login via Hekla (if you are not in Uni network)
[12] Icelandic HPC Machines & Community
Jötunn HPC System
Hekla System
int main() The main function is ‘called‘ by the operating system when a user runs the C
{ program – but essentially a usual c function with optional parameters that we
will explore during the course of the lecture series
printf("Hello, World!"); The printf() function sends formatted text as output to stdout and is often
used for simple debugging of C programs
return 0; Return provides return values to the calling function; in the case of the main
function this can be considered as an exit status code for the OS. Mostly, 0
} exit code signifies a normal run (no errors) and a non 0 exit code (e.g., 1)
usually means there was a problem and the program had to exit abnormally.
Simple C Program
Above file content is stored in file hello.c using a C compiler
C
Although .c file extension it remains a normal text file
hello.c is not executable as C programm it needs a compilation hello.c
P P
Data exchange is key for design of applications
Sending/receiving data at specific times in the program
No shared memory for sharing variables with other remote processes
Messages can be simple variables (e.g. a word) or complex structures
Start with the basic building blocks using MPI P P …
Building up the ‘parallel computing environment’
int main(int argc, char** argv) Two integer variables that are later useful for working with specific data
{ obtained from the specific MPI library that we need to add in the next step too
in order to fill information into the integer variables about rank and sizes
int rank, size;
The printf() function sends formatted text as output to stdout and
printf("Hello World, I am %d out of %d\n", is often used for simple debugging of C programs
rank, size); Thinking in parallel in parallel programming is to understand that
different processes have an identity and work on different
elements of the program
return 0; In the example we want to give an output that shows the identity
of each MPI process by using the rank and size information
}
Extended Simple C Program (still C only)
Above file content is stored in file hello.c using a C compiler
C
Selected changes to the basic c program structure to prepare for MPI
hello.c is not executable as C programm it needs a compilation hello.c
MPI_Finalize();
[8] LLNL MPI Tutorial
return 0;
} using a C compiler
C
Extended Simple C Program
hello.c is not executable as C programm it needs a compilation hello.c
using a C compiler
C
hello.c
[12] Icelandic HPC Machines & Community
Knowledge of installed compilers essential (e.g. C, Fortran90, etc.)
Different versions and types of compilers exist (Intel, GNU, MPI, etc.)
E.g. mpicc pingpong.c –o pingpong
Module environment tool
Avoids to manually setup environment information for every application
Simplifies shell initialization and lets users easily modify their environment
Modules can be loaded and unloaded
Enable the installation of software in different versions
Module avail
Lists all available modules on the HPC system (e.g. compilers, MPI, etc.)
Module load
Loads particular modules into the current work environment [12] Icelandic HPC Machines & Community
E.g. module load gnu openmpi
Message Passing Interface (MPI)
A standardized and portable message‐passing standard
Designed to support different HPC architectures
A wide variety of MPI implementations exist
Standard defines the syntax and semantics
of a core of library routines used in C, C++ & Fortran [7] MPI Forum
OpenMPI Implementation
Open source license based on the BSD license
Full MPI (version 3) standards conformance [6] OpenMPI Web page
Developed & maintained by a consortium of
academic, research, & industry partners
Typically available as modules on HPC systems and used with mpicc compiler
Often built with the GNU compiler set and/or Intel compilers
Using modules to get the
right C compiler for
compiling hello.c
‘module load gnu openmpi‘
Note: there are many C
compilers available, we
using a C compiler
here pick one for our C
particular HPC course that
works with the Message hello.c mpicc
Passing Interface (MPI)
Note: If there are no errors,
the file hello is now a full C
C program executable that
can be started by an OS hello
executable
New: C program with MPI statements [12] Icelandic HPC Machines & Community
(cf. Practical Lecture 0.2 w/o MPI statements)
Lecture 2 – Parallel Programming with MPI 37 / 50
Step 6: Parallel Processing – Executing an MPI Program with MPIRun & Script (1)
Compilation done In Step 5
Compilers and linkers need various information where include files and libraries can be found
E.g. C header files like ‘mpi.h’
Compiling is different for each programming language hello hello
Example to understand distribution of program P P
E.g., executing the MPI program on 4 processors
Normally batch system allocations mpirun
(cf. Practical Lecture 0.2) M M
Understanding role
of mpirun is important create 4 processes that produce hello hello
output in parallel
Output of the program P P
Order of outputs
can vary because I/O
screen ‘serial resource’ M M
Lecture 2 – Parallel Programming with MPI 38 / 50
Step 6: Parallel Processing – Executing an MPI Program with MPIRun & Script (2)
Need of Job script
Example using mpirun
hello hello
P P
mpirun
M M
create 4 processes that produce hello hello
output in parallel
Step‐Wise Walkthrough P P
All performed steps should be done
in same manner for all MPI jobs
M M
Lecture 2 – Parallel Programming with MPI 39 / 50
Step 6: Parallel Processing – Executing an MPI Program with MPIRun & Script (3)
Submission using the Scheduler
Example: SLURM on Jötunn HPC system
Scheduler allocated 4 nodes as requested
MPIRun and scheduler distribute the executable on right nodes
Output consists of
the combined
output of all 4
requested nodes
Scheduler
Jötunn login node
Jötunn compute nodes
output file
Compute NEW: 17
Node
P P
DATA: 17 M M DATA: 80
P1 P2 P3 P4 P5
HPC Machine
NEW: 06 Each processor has
its own data in its
memory that
MPI point‐to‐point
communications P P can not be
seen/accessed by
other processors
DATA: 06 M M DATA: 19
Example: pingpong.c
NEW: 17
P P
DATA: 17 M M DATA: 80
P1 P2 P3 P4 P5
NEW: 17 Broadcast
NEW: 17 distributes the
same data to many
P P or even all other
processors
DATA: 06 M M DATA: 19
Example: broadcast.c
P
M
P P P …
M M M …
P
modified from [8] LLNL MPI Tutorial
M
Lecture 2 – Parallel Programming with MPI 45 / 50
[Video] OpenMPI
[13] What is OpenMPI, YouTube Video
[1] K. Hwang, G. C. Fox, J. J. Dongarra, ‘Distributed and Cloud Computing’, Book, Online:
http://store.elsevier.com/product.jsp?locale=en_EU&isbn=9780128002049
[2] Introduction to High Performance Computing for Scientists and Engineers, Georg Hager & Gerhard Wellein, Chapman & Hall/CRC Computational Science,
ISBN 143981192X, English, ~330 pages, 2010, Online:
http://www.amazon.de/Introduction‐Performance‐Computing‐Scientists‐Computational/dp/143981192X
[3] J. Haut, G. Cavallaro and M. Riedel et al., IEEE Transactions on Geoscience and Remote Sensing, 2019, Online:
https://www.researchgate.net/publication/335181248_Cloud_Deep_Networks_for_Hyperspectral_Image_Analysis
[4] Fran Berman, ‘Maximising the Potential of Research Data’
[5] The MPI Standard, Online:
http://www.mpi‐forum.org/docs/
[6] OpenMPI Web page, Online:
https://www.open‐mpi.org/
[7] DEEP Projects Web page, Online:
http://www.deep‐projects.eu/
[8] LLNL MPI Tutorial, Online:
https://computing.llnl.gov/tutorials/mpi/
[9] HPC – Introducting MPI, YouTube Video, Online:
http://www.youtube.com/watch?v=kHV6wmG35po
[10] Caterham F1 Team Races Past Competition with HPC, Online:
http://insidehpc.com/2013/08/15/caterham‐f1‐team‐races‐past‐competition‐with‐hpc
[11] M. Goetz, C. Bodenstein, M. Riedel, ‘HPDBSCAN – Highly Parallel DBSCAN’, in proceedings of the ACM/IEEE International Conference for High Performance
Computing, Networking, Storage, and Analysis (SC2015), Machine Learning in HPC Environments (MLHPC) Workshop, 2015, Online:
https://www.researchgate.net/publication/301463871_HPDBSCAN_highly_parallel_DBSCAN
Lecture 2 – Parallel Programming with MPI 48 / 50
Lecture Bibliography (2)
[12] Icelandic HPC Machines & Community, Online:
http://ihpc.is
[13] YouTube Video, What is OpenMPI, Online:
http://www.youtube.com/watch?v=D0‐xSWBGNAw