0% found this document useful (0 votes)

21 views108 pages

Parallel Programming

Uploaded by

achref.ghezil1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views108 pages

Parallel Programming

Uploaded by

achref.ghezil1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 108

Introduction to Parallel

Programming
Computing for Shared and
Distributed Memory Models
December 13, 2022 / January 9, 2023
Stéphane Zuckerman1 ([email protected])
1
ETIS, CY Cergy-Paris University, ENSEA, CNRS
Outline

1. Parallel Architectures Today

2. A short history of parallel programming and why we need it

3. Shared-memory execution models

3.1 Introduction to OpenMP
3.2 OpenMP Basics
3.3 Learning More About Shared-Memory Models

4. Message-passing execution models

4.1 Standard Features & Execution Model
4.2 MPI Basics: Features, Compilation, Execution
4.3 Non-Blocking Communications
4.4 Collective operations

5. References

S. Zuckerman Parallel Programming 08/01/2024 2

General Purpose Architectures I
An Overview

Figure: Single CPU and a single DRAM bank.

S. Zuckerman Parallel Programming 08/01/2024 3

General Purpose Architectures II
An Overview

Figure: Symmetric Multi-Processor (SMP) system; single DRAM bank.

S. Zuckerman Parallel Programming 08/01/2024 4

General Purpose Architectures III
An Overview

Figure: Symmetric Multi-Processor (SMP) system with Non-Uniform Memory Access (NUMA); multiple DRAM
banks.

S. Zuckerman Parallel Programming 08/01/2024 5

General Purpose Architectures IV
An Overview

Figure: Single CPU and a single DRAM bank.

S. Zuckerman Parallel Programming 08/01/2024 6

General Purpose Architectures V
An Overview

Figure: Single CPU: CPU, L1 data cache (L1D), L1 Instruction cache (L1I), L2 unified cache (L2), L3 Unified cache
(L3); single DRAM bank.

S. Zuckerman Parallel Programming 08/01/2024 7

General Purpose Architectures VI
An Overview

Figure: Single CPU+cache hierarchy; single DRAM bank.

S. Zuckerman Parallel Programming 08/01/2024 8

General Purpose Architectures VII
An Overview

Figure: Multiple CPU+cache hierarchy; cc-NUMA DRAM banks.

S. Zuckerman Parallel Programming 08/01/2024 9
General Purpose Architectures VIII
An Overview

S. Zuckerman Parallel Programming 08/01/2024 10

General Purpose Architectures IX
An Overview

S. Zuckerman Parallel Programming 08/01/2024 11

General Purpose Architectures X
An Overview

S. Zuckerman Parallel Programming 08/01/2024 12

In the beginning… I

Why parallel computing?

• Parallel programming appears very early in the life of computing.
• For every generation of high-end processor, some computations hog all of the
available resources.
• Solution: duplicate computing resources.

S. Zuckerman Parallel Programming 08/01/2024 13

In the beginning… II

Two (non-exclusive) approaches

• Design a parallel processor/computer architecture, i.e., duplicate functional units,
provide vector units, …:
• Cray I vector supercomputer
• Connection Machine
• Design a computer made of multiple computing nodes (also called compute
nodes):
• Custom clusters: SGI Altix, IBM BlueGene, …
• Commodity supercomputers: Beowulf clusters

S. Zuckerman Parallel Programming 08/01/2024 14

In the beginning… III

Toward standardization of parallel computing

Until the early/mid 1990s, each supercomputer vendor provides their own parallel
computing tools.
• Each new parallel computer requires to learn a new or updated environment to get
good performance
• Terrible for portability and productivity
The time was ripe for a standardization effort for all types of parallelism.

S. Zuckerman Parallel Programming 08/01/2024 15

In the beginning… IV

Toward standardization of parallel models of computation

• Distributed memory models:
• PVM (1991)
• MPI standard (1992)
• Shared memory models:
• POSIX.1c / IEEE Std 1003.1c-1995 — a.k.a. PThread library
• OpenMP standard (1997)
The obvious benefits are portability and productivity, although performance portability
is not guaranteed (only correctness).

S. Zuckerman Parallel Programming 08/01/2024 16

Why should we care about these parallel models? I

Hardware context
• End of Dennard’s scaling
• Moore’s law now used to add more computing units on a single chip (instead of
going to higher frequencies)
• Programming chip multiprocessors (CMP) is not just for
scientific/high-performance computing anymore
• Embedded chips require programming models and execution models to efficiently
exploit all of the hardware

S. Zuckerman Parallel Programming 08/01/2024 17

Why should we care about these parallel models? II

Software context – embedded systems

A lot of embedded systems are mainstream and general purpose nowadays
• e.g., Raspberry PI, Beagle, etc.
• They feature CMPs such as ARM multicore chips
• Even on more specialized platforms, MPI, OpenMP, or OpenCL implementations
exist:
• Xilinx proposes a way to synthesize circuits in FPGAs with OpenCL
• Adapteva’s Parallella board: Zynq-7000 = dual core Cortex A9+FPGA SoC + Epiphany
co-processor (16 cores with scratchpads).
• There are OpenMP and MPI implementations for both ARM and Epiphany boards.
• Bottom line: “general purpose” parallelism is made available to all parallel platforms
nowadays

S. Zuckerman Parallel Programming 08/01/2024 18

Why should we care about these parallel models? III

Advantages of traditional programming and execution models

• Because these models are standardized, they are made available in mainstream
programming tools
• OpenMP and MPI both have free/open source implementations available to all
• Same with PGAS languages
• Same with GPU-oriented languages
• Performance goes from “acceptable” to “pretty good”
• …but proprietary implementations tend to be faster because they have better/exclusive knowledge
of underlying system software and/or hardware
• …but not always!

S. Zuckerman Parallel Programming 08/01/2024 19

Introduction to OpenMP I

The OpenMP Framework

• Stands for Open MultiProcessing
• Three languages supported: C, C++, Fortran

• Supported on multiple platforms: UNIX, Linux, Windows, etc.

• Very portable

• Many compilers provide OpenMP capabilities:

• The GNU Compiler Collection (gcc) – OpenMP 3.1
• Intel C/C++ Compiler (icc) – OpenMP 3.1 (and partial support of OpenMP 4.0)
• Oracle C/C++ – OpenMP 3.1
• IBM XL C/C++ – OpenMP 3.0
• Microsoft Visual C++ – OpenMP 2.0
• etc.

OpenMP’s Main Components

• Compiler directives
• A functions library
• Environment variables

S. Zuckerman Parallel Programming 08/01/2024 20

The OpenMP Model

• An OpenMP program is executed using a unique process

• Threads are activated when entering a parallel region
• Each thread executes a task composed of a pool of instructions
• While executing, a variable can be read and written in memory:
• It can be defined in the stack of a thread: the variable is private
• It can be stored somewhere in the heap: the variable is shared by all threads

S. Zuckerman Parallel Programming 08/01/2024 21

Running OpenMP Programs: Execution Overview

OpenMP: Program Execution

• An OpenMP program is a sequence of serial and
parallel regions
• A sequential region is always executed by the
master thread: Thread 0
• A parallel region can be executed by multiple tasks
at a time
• Tasks can share work contained within the parallel
region

S. Zuckerman Parallel Programming 08/01/2024 22

Running OpenMP Programs: Execution Overview

OpenMP: Program Execution

S. Zuckerman Parallel Programming 08/01/2024 23

OpenMP Parallel Structures
• Parallel loops

S. Zuckerman Parallel Programming 08/01/2024 24

OpenMP Parallel Structures
• Parallel loops
• Sections

S. Zuckerman Parallel Programming 08/01/2024 25

OpenMP Parallel Structures
• Parallel loops
• Sections
• Procedures through orphaning

S. Zuckerman Parallel Programming 08/01/2024 26

OpenMP Parallel Structures
• Parallel loops
• Sections
• Procedures through orphaning
• Tasks

S. Zuckerman Parallel Programming 08/01/2024 27

OpenMP Structure I

Compilation Directives and Clauses

They define how to:
• Share work
• Synchronize
• Share data
They are processed as comments unless the right compiler option is specified on the
command line.

Fonctions and Subroutines

They are part of a library loaded at link time

Environment Variables
Once set, their values are taken into account at execution time

S. Zuckerman Parallel Programming 08/01/2024 28

OpenMP vs. MPI I

These two programming models are complementary:

• Both OpenMP and MPI can interface using C, C++, and Fortran
• MPI is a multi-process environment whose communication mode is explicit (the
user is in charge of handling communications)
• OpenMP is a multi-tasking environment whose communication between tasks is
implicit (the compiler is in charge of handling communications)
• In general, MPI is used on multiprocessor machines using distributed memory
• OpenMP is used on multiprocessor machines using shared memory
• On a cluster of independent shared memory machines, combining two levels of
parallelism can significantly speed up a parallel program’s execution.

S. Zuckerman Parallel Programming 08/01/2024 29

OpenMP: Principles
• The developer is in charge of
introducing OpenMP directives
• When executing, the OpenMP runtime
system builds a parallel region relying
on the “fork-join” model
• When entering a parallel region, the
master task spawns (“forks”) children
tasks which disappear or go to sleep
when the parallel region ends
• Only the master task remains active
after a parallel region is done

S. Zuckerman Parallel Programming 08/01/2024 30

Principal Directives I

Creating a Parallel Region: the parallel Directive

#pragma omp parallel
{
/* Parallel region code */
}

S. Zuckerman Parallel Programming 08/01/2024 31

Principal Directives II

Data Sharing Clauses

• shared(…): Comma-separated list of all variables that are to be shared by all
OpenMP tasks
• private(…): Comma-separated list of all variables that are to be visible only by
their task.
• Variables that are declared private are “duplicated:” their content is unspecified when entering the
parallel region, and when leaving the region, the privatized variable retains the content it had before
entering the parallel region
• firstprivate(…): Comma-separated list of variables whose content must be
copied (and not just allocated) when entering the parallel region.
• The value when leaving the parallel remains the one from before entering it.

• default(none|shared|private): Default policy w.r.t. sharing variables. If not

specified, defaults to “shared”

S. Zuckerman Parallel Programming 08/01/2024 32

A First Example: Hello World
szuckerm@evans201g:examples$ gcc -std=c99 -Wall -Wextra -pedantic \
-fopenmp -O3 -o omp_hello omp_hello.c

#include <stdio.h>
#include <stdlib.h> examples$ ./hello
#include <omp.h> [0] Hello, World!
#ifndef _OPENMP [3] Hello, World!
#define omp_get_thread_num () 0 [1] Hello, World!
#endif [2] Hello, World!

int main(void)
{
#pragma omp parallel
{
int tid = omp_get_thread_num ();
printf("[%d]\tHello , World !\n", tid);
}

return EXIT_SUCCESS;
}

Figure: omp_hello.c

S. Zuckerman Parallel Programming 08/01/2024 33

Example: Privatizing Variables
examples:$ gcc -std=c99 -Wall -Wextra -pedantic -O3 \
-fopenmp -o omp_private omp_private.c
omp_private.c: In function ‘main._omp_fn.0’:
omp_private.c:8:11: warning: ‘a’ is used uninitialized
in this function [-Wuninitialized]
a = a + 716.;
^
omp_private.c:4:11: note: ‘a’ was declared here
float a = 1900.0;

#include <stdio.h>
#include <omp.h> [2] a = 716.00
int main () { [1] a = 716.00
float a = 1900.0; [0] a = 716.00
#pragma omp parallel default(none) private(a) [3] a = 716.00
{ [0] a = 1900.00
a = a + 716.;
printf("[%d]\ta = %.2f\n",omp_get_thread_num (), a);
}
printf("[%d]\ta = %.2f\n",omp_get_thread_num (), a);
return 0;
}

Figure: omp_private.c

S. Zuckerman Parallel Programming 08/01/2024 34

Sharing Data Between Threads
examples:$ gcc -std=c99 -Wall -Wextra -pedantic -O3 \
-fopenmp -o omp_hello2 omp_hello2.c

#include <stdio.h>
#include <stdlib.h> examples$ ./hello2
#include <omp.h> [0] Hello, World!
#ifndef _OPENMP [3] Hello, World!
#define omp_get_thread_num () 0 [1] Hello, World!
#endif [2] Hello, World!

int main(void)
{
int ids[] = {0, 1, 2, 3, 4, 5, 6, 7};
#pragma omp parallel default(none) shared(ids)
{
printf("[%d]\tHello , World !\n", ids[ omp_get_thread_num ()]);
}

return EXIT_SUCCESS;
}

Figure: hello2.c

S. Zuckerman Parallel Programming 08/01/2024 35

Capturing Privatized Variables’ Initial Values
szuckerm@evans201g:examples$ gcc -std=c99 -Wall -Wextra -pedantic -O3\
-fopenmp -o omp_firstprivate omp_firstprivate.c

#include <stdio.h> examples$ ./omp_firstprivate

#include <omp.h> a = 19716.000000
int main () { a = 19716.000000
float a = 1900.0; a = 19716.000000
a = 19716.000000
#pragma omp parallel \ a = 19000.000000
default(none) firstprivate(a)
{
a = a + 716.;
printf("a = %f\n",a);
}

printf("a = %f\n",a);

return 0;
}

Figure: omp_firstprivate.c

S. Zuckerman Parallel Programming 08/01/2024 36

Scope of OpenMP Parallel Regions
When calling functions from a parallel region, local and automatic variables are
implicitly private to each task (they belong to their respective task’s stack). Example:

#include <stdio.h>
#include <omp.h>
void sub(void );
int main(void) {
#pragma omp parallel default(shared)
{
sub ();
}
return 0;
}
void sub(void) {
int a = 19716;
a += omp_get_thread_num ();
printf("a = %d\n", a);
}

Figure: omp_scope.c.txt

S. Zuckerman Parallel Programming 08/01/2024 37

Parallel Loops
szuckerm@evans201g:examples$ gcc -std=c99 -Wall -Wextra -pedantic -O3\
-fopenmp -o omp_for parallel_for.c

#include <stdio.h> examples$ ./omp_for

#include <omp.h> [1] Hellow, World!
[0] Hellow, World!
int [3] Hellow, World!
main(void) [2] Hellow, World!
{
#pragma omp parallel
{
int n_threads = omp_get_num_threads ();
#pragma omp for
for (int i = 0; i < n_threads; ++i) {
printf("[%d]\tHello , World !\n", i);
}
}
}

Figure: parallel_for.c

S. Zuckerman Parallel Programming 08/01/2024 38

Parallel Loops: A Few Things to Remember

1 The iterator of a omp for loop must use additions/substractions to get to the
next iteration (no i *= 10 in the postcondition)
2 The iterator of the outermost loop (which directly succeeds to the omp for
directive) is always private, but not the ones in other nested loops!
3 There is an implicit barrier at the end of the loop. You can remove it by adding
the clause nowait on the same line: #pragma omp for nowait
4 How the iterations are distributed among threads can be defined using the
schedule clause.

S. Zuckerman Parallel Programming 08/01/2024 39

Parallel Loops I
Specifying the Schedule Mode
The syntax to define a scheduling policy is schedule(ScheduleType, chunksize). The final line should
like this:

#pragma omp parallel default(none) \

shared (...) private (...) firstprivate (...)
{
#pragma omp for schedule (...) lastprivate (...)
for (int i = InitVal; ConditionOn(i); i += Stride)
{ /* loop body */ }
}

// or , all in one directive :

#pragma omp parallel for default(none) shared (...) private (...) \

firstprivate (...) lastprivate (...)
for (int i = InitVal; ConditionOn(i); i += Stride) {
/* loop body */
}

S. Zuckerman Parallel Programming 08/01/2024 40

Parallel Loops II
Specifying the Schedule Mode

The number of iterations in a loop is computed as follows:

|FinalVal − InitVal|

NumIterations = + |FinalVal − InitVal| mod Stride
Stride
The number of iteration chunks is thus computed like this:

NumIterations

NumChunks = + NumIterations mod ChunkSize
ChunkSize

Static Scheduling
schedule(static,chunksize) distributes the iteration chunks across threads in a round-robin fashion
• Guarantee: if two loops with the same “header” (precondition, condition, postcondition, and chunksize for the
parallel for directive) succeed to each other, the threads will be assigned the same iteration chunks
• By default, chunksize is equal to OMP_NUM_THREADS
• Very useful when iterations take roughly the same time to perform (e.g., dense linear algebra routines)

S. Zuckerman Parallel Programming 08/01/2024 41

Parallel Loops III
Specifying the Schedule Mode

Dynamic Scheduling
schedule(dynamic,chunksize) divides the iteration space according to chunksize, and creates an “abstract” queue
of iteration chunks. If a thread is done processing its chunk, it dequeues the next one from the queue. By default,
chunksize is 1.
Very useful if the time to process individual iterations varies.

Guided Scheduling
guided,chunksize Same behavior as dynamic, but the chunksize is divided by two each time a threads dequeues a
new chunk. The minimum size is one, and so is the default.
Very useful if the time to process individual iterations varies, and the amount of work has a “trail”

S. Zuckerman Parallel Programming 08/01/2024 42

Parallel Loops
Specifying the Schedule Mode I
# include <unistd .h>
# include <stdio.h>
# include <stdlib .h>
# include <omp.h>
const double MAX = 100000.;

double sum( const int n) {

const int id = omp_get_thread_num ();
double f = 0.0;
const int bound = id == 0 ? n*1001 : n;

for (int i = 0; i < bound; ++i)

f += i;

return f;
}

Figure: omp_for_schedule.c
S. Zuckerman Parallel Programming 08/01/2024 43
Parallel Loops
Specifying the Schedule Mode II
int main(void) {
printf("MAX = %.2f\n",MAX);
double acc = 0.0;
int* sum_until = malloc(MAX*sizeof(int ));
if (! sum_until) perror("malloc"), exit( EXIT_FAILURE );
for (int i = 0; i < (int)MAX; ++i) sum_until[i] = rand () % 100;
#pragma omp parallel default(none) \
shared(sum_until) firstprivate(acc)
{ /* Use the OMP_SCHEDULE environment variable on the command
* line to specify the type of scheduling you want , e.g.:
* export OMP_SCHEDULE =" static " or OMP_SCHEDULE =" dynamic ,10"
* or OMP_SCHEDULE =" guided ,100"; ./ omp_schedule
*/
#pragma omp for schedule(runtime)
for (int i = 0; i < bound; i+=1) {
acc += sum( sum_until[i] );
}
printf ("[%d]\tMy sum = %.2f\n", omp_get_thread_num (), acc);
}
free(sum_until );
return 0;
} Figure: omp_for_schedule.c

S. Zuckerman Parallel Programming 08/01/2024 44

Parallel Loops
Specifying the Schedule Mode: Outputs I
szuckerm@evans201g:examples$ gcc -std=c99 -Wall -Wextra -pedantic \
-fopenmp -O3 -o omp_schedule omp_for_schedule.c
szuckerm@evans201g:examples$ export OMP_NUM_THREADS=4 OMP_PROC_BIND=true

szuckerm@evans201g:examples$ export OMP_SCHEDULE="static"

szuckerm@evans201g:examples$ time ./omp_schedule
MAX = 100000.00
[0] My sum = 41299239778797.00
[1] My sum = 40564464.00
[3] My sum = 40174472.00
[2] My sum = 40502412.00

real 0m11.911s
user 0m11.930s
sys 0m0.004s

S. Zuckerman Parallel Programming 08/01/2024 45

szuckerm@evans201g:examples$ export OMP_SCHEDULE="static"

szuckerm@evans201g:examples$ time ./omp_schedule
MAX = 100000.00
[0] My sum = 41299239778797.00
[1] My sum = 40564464.00
[3] My sum = 40174472.00
[2] My sum = 40502412.00

real 0m11.911s
user 0m11.930s
sys 0m0.004s

szuckerm@evans201g:examples$ export OMP_SCHEDULE="static,1"

szuckerm@evans201g:examples$ time ./omp_schedule
MAX = 100000.00
[0] My sum = 41487115603934.00
[1] My sum = 40266669.00
[3] My sum = 40319644.00
[2] My sum = 40468898.00

real 0m11.312s
user 0m11.356s
sys 0m0.004s

S. Zuckerman Parallel Programming 08/01/2024 46

Parallel Loops
Specifying the Schedule Mode: Outputs II
szuckerm@evans201g:examples$ export OMP_SCHEDULE="dynamic,1000"
szuckerm@evans201g:examples$ time ./omp_schedule
MAX = 100000.00
[0] My sum = 1661647855868.00
[1] My sum = 55011312.00
[2] My sum = 46974801.00
[3] My sum = 58218664.00

real 0m0.546s
user 0m0.576s
sys 0m0.004s

S. Zuckerman Parallel Programming 08/01/2024 47

real 0m0.546s
user 0m0.576s
sys 0m0.004s

szuckerm@evans201g:examples$ export OMP_SCHEDULE="dynamic,1"

szuckerm@evans201g:examples$ time ./omp_schedule
MAX = 100000.00
[1] My sum = 57886783.00
[0] My sum = 76809786053.00
[2] My sum = 47423265.00
[3] My sum = 56452544.00

real 0m0.023s
user 0m0.059s
sys 0m0.004s

S. Zuckerman Parallel Programming 08/01/2024 48

Parallel Loops
Specifying the Schedule Mode: Outputs III
szuckerm@evans201g:examples$ export OMP_SCHEDULE="guided,1000"
szuckerm@evans201g:examples$ time ./omp_schedule
MAX = 100000.00
[0] My sum = 30922668944167.00
[3] My sum = 44855495.00
[2] My sum = 45989686.00
[1] My sum = 40596797.00

real 0m8.437s
user 0m8.452s
sys 0m0.008s

S. Zuckerman Parallel Programming 08/01/2024 49

real 0m8.437s
user 0m8.452s
sys 0m0.008s

szuckerm@evans201g:examples$ export OMP_SCHEDULE="guided,1"

szuckerm@evans201g:examples$ time ./omp_schedule
MAX = 100000.00
[0] My sum = 17508269385607.00
[1] My sum = 49603788.00
[2] My sum = 40584346.00
[3] My sum = 54438904.00

real 0m5.401s
user 0m5.438s
sys 0m0.008s

S. Zuckerman Parallel Programming 08/01/2024 50

Parallel Loops
The lastprivate Clause
int main(void) {
double acc = 0.0; const int bound = MAX;
printf("[%d]\ tMAX = %.2f\n",omp_get_thread_num (),MAX);
int* sum_until = smalloc(MAX*sizeof(int ));
for (int i = 0; i < bound; ++i)
sum_until[i] = rand () % 100;
#pragma omp parallel for default(none) shared(sum_until) \
schedule(runtime) firstprivate(acc) lastprivate(acc)
for (int i = 0; i < bound; i+=1)
acc += sum( sum_until[i] );
printf("Value of the last thread to write to acc = %.2f\n",acc);
free(sum_until );
return EXIT_SUCCESS;
} Figure: omp_for_lastprivate.c

S. Zuckerman Parallel Programming 08/01/2024 51

Incrementing a Global Counter I
Racy OpenMP Version
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
unsigned long g_COUNTER = 0;

int
main(void)
{
int n_threads = 1;
#pragma omp parallel default(none) \
shared(n_threads ,stdout ,g_COUNTER)
{
#pragma omp master
{
n_threads = omp_get_num_threads ();
printf("n_threads = %d\t",n_threads ); fflush(stdout );
}

++ g_COUNTER;
}
printf("g_COUNTER = %lu\n",g_COUNTER );
return EXIT_FAILURE;
}

S. Zuckerman Parallel Programming 08/01/2024 52

Incrementing a Global Counter II
Racy OpenMP Version

szuckerm@evans201g:examples$ for i in $(seq 100)

> do ./global_counter ;done|sort|uniq
n_threads = 4 g_COUNTER = 2
n_threads = 4 g_COUNTER = 3
n_threads = 4 g_COUNTER = 4

S. Zuckerman Parallel Programming 08/01/2024 53

Incrementing a Global Counter
Using a Critical Section
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
unsigned long g_COUNTER = 0;

int main(void) {
int n_threads = 1;
#pragma omp parallel default(none) \
shared(n_threads ,stdout ,g_COUNTER)
{
#pragma omp master
{
n_threads = omp_get_num_threads ();
printf("n_threads = %d\t",n_threads ); fflush(stdout );
}

#pragma omp critical

{ ++ g_COUNTER; }
}
printf("g_COUNTER = %lu\n",g_COUNTER );
return EXIT_FAILURE;
}

S. Zuckerman Parallel Programming 08/01/2024 54

Incrementing a Global Counter
Using an Atomic Section
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
unsigned long g_COUNTER = 0;

int main(void) {
int n_threads = 1;
#pragma omp parallel default(none) \
shared(n_threads ,stdout ,g_COUNTER)
{
#pragma omp master
{
n_threads = omp_get_num_threads ();
printf("n_threads = %d\t",n_threads ); fflush(stdout );
}

#pragma omp atomic

++ g_COUNTER;
}
printf("g_COUNTER = %lu\n",g_COUNTER );
return EXIT_FAILURE;
}

S. Zuckerman Parallel Programming 08/01/2024 55

Synchronization in OpenMP I

critical Directive
#pragma omp critical [(name)]
Guarantees that only one thread can access the sequence of instructions contained in
the (named) critical section. If no name is specified, an “anonymous” name is
automatically generated.

atomic Directive
#pragma omp atomic
Guarantees the atomicity of the single arithmetic instruction that follows. On
architectures that support atomic instructions, the compiler can generate a low-level
instruction to ensure the atomicity of the operation. Otherwise, atomic is equivalent
to critical.

S. Zuckerman Parallel Programming 08/01/2024 56

Synchronization in OpenMP II
barrier Directive
#pragma omp barrier
All threads from a given parallel region must wait at the barrier. All parallel regions
have an implicit barrier. All omp for loops do too. So do single regions.

single Directive
Guarantees that a single thread will execute the sequence of instructions located in the
single region, and the region will be executed only once. There is an implicit barrier
at the end of the region.

master Directive
Guarantees that only the master thread (with ID = 0) will execute the sequence of
instructions located in the single region, and the region will be executed only once.
There is NO implicit barrier at the end of the region.

S. Zuckerman Parallel Programming 08/01/2024 57

Synchronization in OpenMP III

nowait Clause
nowait can be used on omp for, single, and critical directives to remove the
implicit barrier they feature.

S. Zuckerman Parallel Programming 08/01/2024 58

Tasking in OpenMP

OpenMP 3.x brings a new way to express parallelism: tasks.

• Tasks must be created from within a single region
• A task is spawned by using the directive #pragma omp task
• Tasks synchronize with their siblings (i.e., tasks spawned by the same parent task)
using #pragma omp taskwait

S. Zuckerman Parallel Programming 08/01/2024 59

Case Study: Fibonacci Sequence
We’ll use the Fibonacci numbers example to illustrate the use of tasks:

/**
* \brief Computes Fibonacci numbers
* \param n the Fibonacci number to compute
*/
u64 xfib(u64 n) {
return n < 2 ? // base case?
n : // fib (0) = 0, fib (1) = 1
xfib(n-1) + xfib(n -2);
}

Average Time (cycles)

Sequential - Recursive 196051726.08

Table: Fibonacci(37), 50 repetitions, on Intel i7-2640M CPU @ 2.80GHz

S. Zuckerman Parallel Programming 08/01/2024 60

Computing Fibonacci Numbers: Headers I
utils.h, common.h, and mt.h
#ifndef UTILS_H_GUARD
#define UTILS_H_GUARD
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <stdint.h>
#include "rdtsc.h"

static inline void fatal(const char* msg) {

perror(msg), exit(errno );
}
static inline void sfree(void* p) {
if (p) { *( char *)p = 0;} free(p);
}
static inline void* scalloc(size_t nmemb , size_t size) {
void* p = calloc(nmemb ,size );
if (!p) { fatal("calloc"); }
return p;
}
static inline void* smalloc(size_t size) {
void* p = malloc(size );
if (!p) { fatal("malloc"); }
return p;

S. Zuckerman Parallel Programming 08/01/2024 61

Computing Fibonacci Numbers: Headers II
utils.h, common.h, and mt.h

}
static inline void usage(const char* progname) {
printf("USAGE: %s positive_number\n", progname );
exit (0);
}
void u64_measure(u64 (* func )(u64), u64 n,
u64 n_reps , const char* msg);
void u64func_time(u64 (* func )(u64), u64 n,
const char* msg);
#endif // UTILS_H_GUARD

S. Zuckerman Parallel Programming 08/01/2024 62

Computing Fibonacci Numbers: Headers III
utils.h, common.h, and mt.h

#ifndef COMMON_H_GUARD
#define COMMON_H_GUARD
#include "utils.h" // for smalloc (), sfree (), fatal (), scalloc (), ...
#define FIB_THRESHOLD 20

typedef uint64_t u64; typedef uint32_t u32; typedef uint16_t u16;

typedef uint8_t u8; typedef int64_t s64; typedef int32_t s32;
typedef int16_t s16; typedef int8_t s8;

u64 xfib(u64); u64 trfib(u64 ,u64 ,u64);

u64 trFib(u64); u64 sfib(u64);
u64 memoFib(u64); u64 memofib(u64 ,u64 *);

void* mt_memofib(void *); u64 mt_memoFib(u64);

void* mtfib(void *); u64 mtFib(u64);

u64 oFib(u64); u64 ofib(u64);

u64 o_memoFib(u64); u64 o_memofib(u64 ,u64 *);
#endif /* COMMON_H_GUARD */

S. Zuckerman Parallel Programming 08/01/2024 63

Computing Fibonacci Numbers: Headers IV
utils.h, common.h, and mt.h

#ifndef MT_H_GUARD
#define MT_H_GUARD
#include <pthread.h>
typedef struct fib_s { u64 *up , n; } fib_t;
typedef struct memofib_s { u64 *up , *vals , n; } memofib_t;
static inline pthread_t* spawn(void* (* func )( void*), void* data) {
pthread_t* thread = smalloc(sizeof(pthread_t )); int error = 0;
do {
errno = error = pthread_create(thread ,NULL ,func ,data );
} while (error == EAGAIN );
if (error) fatal("pthread_create");
return thread;
}
static inline void sync(pthread_t* thread) {
int error = 0; void* retval = NULL;
if ( (errno = ( error = pthread_join (*thread , &retval) ) ) )
fatal("pthread_join");
sfree(thread );
}
#endif // MT_H_GUARD

S. Zuckerman Parallel Programming 08/01/2024 64

Computing Fibonacci Numbers: Headers V
utils.h, common.h, and mt.h

S. Zuckerman Parallel Programming 08/01/2024 65

Computing Fibonacci Numbers: Code I
Naïve Pthread and OpenMP
#include "common.h"
#include "mt.h"
void* mtfib(void* frame) {
fib_t* f = (fib_t *) frame;
u64 n = f->n, *up = f->up;
if (n < FIB_THRESHOLD)
*up = sfib(n), pthread_exit(NULL );
u64 n1 = 0, n2 = 0;
fib_t frame1 = { .up = &n1 , .n = f->n-1 },
frame2 = { .up = &n2 , .n = f->n-2 };
pthread_t *thd1 = spawn(mtfib ,& frame1),
*thd2 = spawn(mtfib ,& frame2 );
sync(thd1 ); sync(thd2 );
*up = n1+n2;
return NULL;
}
u64 mtFib(u64 n) {
u64 result = 0; fib_t f = { .up = &result , .n = n };
(void)mtfib (&f);
return result;
}

S. Zuckerman Parallel Programming 08/01/2024 66

Computing Fibonacci Numbers: Code II
Naïve Pthread and OpenMP

#include "common.h"
#include <omp.h>

u64 ofib(u64 n) { u64 n1 , n2;

if (n < FIB_THRESHOLD) return sfib(n);
# pragma omp task shared(n1)
n1 = ofib(n -1);
# pragma omp task shared(n2)
n2 = ofib(n -2);
# pragma omp taskwait
return n1 + n2;
}

u64 oFib(u64 n) { u64 result = 0;

# pragma omp parallel
{
# pragma omp single nowait
{ result = ofib(n); }
} // parallel
return result;
}

S. Zuckerman Parallel Programming 08/01/2024 67

Computing Fibonacci Numbers: Code III
Naïve Pthread and OpenMP

Average Time (cycles)

Sequential - Recursive 196051726.08
Parallel - PThread 2837871164.24
Parallel - OpenMP 17707012.14

Table: Fibonacci(37), 50 repetitions, on Intel i7-2640M CPU @ 2.80GHz

S. Zuckerman Parallel Programming 08/01/2024 68

Computing Fibonacci Numbers: Code I
Memoization using Serial, Pthread and OpenMP
#include "common.h"
#include "mt.h"
void* mt_memofib(void* frame) { memofib_t* f = (memofib_t *) frame;
u64 n = f->n *vals = f->vals , *up = f->up;
if (n < FIB_THRESHOLD) *up = vals[n] = sfib(n), pthread_exit(NULL );
if (vals[n] == 0) { u64 n1 = 0, n2 = 0;
memofib_t frame1 = {.up=&n1 ,.n=f->n-1,. vals=vals},
frame2 = {.up=&n2 ,.n=f->n-2,. vals=vals };
pthread_t *thd1 = spawn(mt_memofib ,& frame1),
*thd2 = spawn(mt_memofib ,& frame2 );
sync(thd1 ); sync(thd2 );
vals[n] = n1 + n2; }
*up = vals[n], pthread_exit(NULL );
}
u64 mt_memoFib(u64 n) { u64 result = 0;
u64* fibvals = scalloc(n+1, sizeof(u64 ));
fibvals [1]=1; memofib_t f={.up=&result ,.n=n,. vals=fibvals };
(void)mt_memofib (&f);
return result;
}

S. Zuckerman Parallel Programming 08/01/2024 69

Computing Fibonacci Numbers: Code II
Memoization using Serial, Pthread and OpenMP

#include "common.h"
#include <omp.h>
u64 o_memofib(u64 n, u64* vals) {
if (n < FIB_THRESHOLD) return sfib(n);
if (vals[n] == 0) { u64 n1 = 0, n2 = 1;
# pragma omp task shared(n1 ,vals)
n1 = o_memofib(n-1,vals );
# pragma omp task shared(n2 ,vals)
n2 = o_memofib(n-2,vals );
# pragma omp taskwait
vals[n] = n1 + n2;
}
return vals[n];
}
u64 o_memoFib(u64 n) {
u64 result=0, *fibvals=calloc(n+1, sizeof(u64 ));
# pragma omp parallel
{
# pragma omp single nowait
{ fibvals [1] = 1; result = o_memofib(n,fibvals ); }
}

S. Zuckerman Parallel Programming 08/01/2024 70

Computing Fibonacci Numbers: Code III
Memoization using Serial, Pthread and OpenMP
return result; }

#include "common.h"

u64 memofib(u64 n, u64* vals) {

if (n < 2)
return n;
if (vals[n] == 0)
vals[n] = memofib(n-1,vals) + memofib(n-2,vals );
return vals[n];
}

u64 memoFib(u64 n) {
u64* fibvals = calloc(n+1, sizeof(u64 ));
fibvals [0] = 0; fibvals [1] = 1;
u64 result = memofib(n,fibvals );
sfree(fibvals );
return result;
}

S. Zuckerman Parallel Programming 08/01/2024 71

Computing Fibonacci Numbers: Code IV
Memoization using Serial, Pthread and OpenMP

Average Time (cycles)

Sequential - Recursive 196051726.08
Parallel - PThread 2837871164.24
Parallel - OpenMP 17707012.14
Parallel - PThread - Memoization 2031161888.56
Parallel - OpenMP - Memoization 85899.58
Sequential - Memoization 789.70

Table: Fibonacci(37), 50 repetitions, on Intel i7-2640M CPU @ 2.80GHz

S. Zuckerman Parallel Programming 08/01/2024 72

Computing Fibonacci Numbers: Code I
When Serial is MUCH Faster Than Parallel
#include "common.h"
u64 trfib(u64 n, u64 acc1 , u64 acc2) {
return n < 2 ?
acc2 :
trfib( n-1, acc2 , acc1+acc2 );
}
u64 trFib(u64 n) { return trfib(n, 0, 1); }

#include "common.h"
u64 sfib(u64 n) {
u64 n1 = 0, n2 = 1, r = 1;
for (u64 i = 2; i < n; ++i) {
n1 = n2;
n2 = r;
r = n1 + n2;
}
return r;
}

S. Zuckerman Parallel Programming 08/01/2024 73

Computing Fibonacci Numbers: Code II
When Serial is MUCH Faster Than Parallel

Average Time (cycles)

Table: Fibonacci(37), 50 repetitions, on Intel i7-2640M CPU @ 2.80GHz

S. Zuckerman Parallel Programming 08/01/2024 74

Learning More About Multi-Threading and OpenMP
Books (from most theoretical to most practical)
• Herlihy and Shavit 2008
• Rünger and Rauber 2013
• Kumar 2002
• Chapman, Jost, and Pas 2007

Internet Resources
• “The OpenMP® API specification for parallel programming” at openmp.org
• Provides all the specifications for OpenMP, in particular OpenMP 3.1 and 4.0
• Lots of tutorials (see http://openmp.org/wp/resources/#Tutorials)
• The Wikipedia article at http://en.wikipedia.org/wiki/OpenMP

Food for Thoughts

• Sutter 2005 (available at http://www.gotw.ca/publications/concurrency-ddj.htm)
• Lee 2006 (available at http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf)
• Boehm 2005 (available at http://www.hpl.hp.com/techreports/2004/HPL-2004-209.pdf)

S. Zuckerman Parallel Programming 08/01/2024 75

An introduction to MPI
Execution Model

Execution Model
• Relies on the notion of distributed memory
• All data transfers between MPI processes are explicit
• Processes can also be synchronized with each other
• Achieved using a library API

MPI process 6= UNIX or Windows process.

• A process = a program counter + a (separate) address space
• An MPI process could be implemented as a thread [Huang, Lawlor, and Kale 2003]

S. Zuckerman Parallel Programming 08/01/2024 76

MPI execution model in a nutshell I

S. Zuckerman Parallel Programming 08/01/2024 77

MPI execution model in a nutshell II

S. Zuckerman Parallel Programming 08/01/2024 78

MPI-1

• Basic I/O communication functions (100+)

• Blocking send and receive operations
• Nonblocking send and receive operations
• Collective communications
• Broadcast, scatter, gather, etc.
• Important for performance
• Datatypes to describe data layouet
• Process topologies (use of communicators, tags)
• C, C++, Fortran bindings
• Error codes and classes

S. Zuckerman Parallel Programming 08/01/2024 79

MPI-2 and beyond

• MPI-2 (2000):
• Thread support
• MPI-I/O, R-DMA
• MPI-2.1 (2008) and MPI-2.2 (2009):
• Corrections to standard, small additional features
• MPI-3 (2012):
• Lots of new features to standard (briefly discussed at the end)

S. Zuckerman Parallel Programming 08/01/2024 80

MPI Basics
Stuff needed by the MPI implementation from application
• How to compile and run MPI programs
• How to identify processes
• How to describe the data

S. Zuckerman Parallel Programming 08/01/2024 81

Compiling and running MPI programs

MPI is a library
Need to use function calls, to leverage MPI features.

Compilation
• Regular compilation: use of cc, e.g., gcc -o test test.c
• MPI compilation: mpicc -o test test.c

Execution
• Regular execution: ./text
• MPI execution: mpiexec -np 16 ./test

S. Zuckerman Parallel Programming 08/01/2024 82

MPI process identification

MPI groups
• Each MPI process belongs to one or more groups
• Each MPI process is given one or more colors
• Group+color = communicator
• All MPI processes belong to MPI_COMM_WORLD when the program starts

Identifying individual processes: ranks

• If a process belongs to two different communicators, its rank may be different from
the point of view of each communicator.

S. Zuckerman Parallel Programming 08/01/2024 83

Most basic MPI program
Hello World
#i n c l u d e <mpi.h> // r e q u i r e d to use MPI f u n c t i o n s
#i n c l u d e <stdio.h>

i n t main( i n t argc , char * argv []) {

i n t rank , size;

// must ALWAYS be c a l l e d to run an MPI program

MPI_Init (&argc , &argv );
// get p r o c e s s rank / i d
MPI_Comm_rank(MPI_COMM_WORLD , &rank );
// get t o t a l number o f p r o c e s s e s i n communicator
MPI_Comm_size(MPI_COMM_WOLRD , &size );

printf("I am process #%d/%d\n", rank , size );

// must ALWAYS be c a l l e d to run an MPI program

MPI_Finalize ();
r e t u r n 0;
}

S. Zuckerman Parallel Programming 08/01/2024 84

Basic data transfers
MPI_Send

Syntax: int MPI_Send (const void *buf, int count, MPI_Datatype

datatype, int dest, int tag, MPI_Comm comm)
• buf: the buffer from where to read the data to send
• count: the number of elements to send over the network link
• datatype: the type of data that is being sent, e.g., MPI_CHAR, MPI_INT, MPI_DOUBLE, etc.
• dest: which process is meant to receive the data (identified by its rank)
• tag: a way to discriminate between various messages sent to the same process rank
• comm: the communicator (or “group of tasks”) to target

S. Zuckerman Parallel Programming 08/01/2024 85

Basic data transfers
MPI_Recv

Syntax: int MPI_Recv (const void *buf, int count, MPI_Datatype

datatype, int src, int tag, MPI_Comm comm, MPI_Status *status)
• buf: the buffer from where to write the data to be read
• count: the number of elements to receive over the network link
• count can be bigger than what was received in practice (the real count can be obtained
using MPI_Get_count)
• If count is smaller than what is being sent, an error occurs
• datatype: the type of data that is being received, e.g., MPI_CHAR, MPI_INT, MPI_DOUBLE, etc.
• dest: from which process the data originates (identified by its rank)
• tag: a way to discriminate between various messages sent to the same receiving process rank
• comm: the communicator (or “group of tasks”) to target
• status: contains the source of the message, the tag, how many elements were sent.

S. Zuckerman Parallel Programming 08/01/2024 86

Basic data transfers I
Wildcards & status

Receive wildcards
• MPI_ANY_SOURCE: accepts data from any sender
• MPI_ANY_TAG: accepts data with any tag (as long as the receiver is a valid target)

S. Zuckerman Parallel Programming 08/01/2024 87

Basic data transfers II
Wildcards & status

Status object
Objects of type MPI_Status have the following accessible fields (assume our object
name is status):
• MPI_SOURCE: the rank of the process which sent the message (useful when using
MPI_ANY_SOURCE)
• MPI_TAG: the tag used to identify the received message (useful when using
MPI_ANY_TAG)
• MPI_ERROR: the error status (assuming the MPI program does not crash when an
error is detected—which is the behavior by default).
To get the number of elements received, the user can query status using the
MPI_Get_count function.

S. Zuckerman Parallel Programming 08/01/2024 88

A simple example to send and receive data

#i n c l u d e <mpi.h> // r e q u i r e d to use MPI f u n c t i o n s

#i n c l u d e <stdio.h>

i n t main( i n t argc , char * argv []) {

i n t rank , data [100];

MPI_Init (&argc , &argv );

MPI_Comm_rank(MPI_COMM_WORLD, &rank );

i f (rank == 0)
MPI_Send(data ,100, MPI_INT ,1,0,MPI_COMM_WORLD);
else
MPI_Recv(data ,100, MPI_INT ,0,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);

MPI_Finalize ();
r e t u r n 0;
}

S. Zuckerman Parallel Programming 08/01/2024 89

MPI is simple

• MPI_Init
• MPI_Comm_rank
• MPI_Comm_size
• MPI_Send
• MPI_Recv
• MPI_Finalize
…are enough to write any application using message passing.
However, to be productive and ensure reasonable performance portability, other
functions are required.

S. Zuckerman Parallel Programming 08/01/2024 90

MPI pragmatics
Building and running MPI programs
Building MPI programs
• C: mpicc
• C++: mpicxx
• Fortran: mpif77 (Fortran 77) or mpif90 (Fortran 90)

Running MPI programs

• mpiexec -np 16 ./test
• …will run the program test on 16 MPI processes.
• mpiexec -host h1,h2,… -np 16 ./test
• …will run the program test on the various hosts specified on the command line in a
round-robin fashion
• In our example, host h1 will receive MPI processes 0,2,4,6.
Note: mpiexec or mpirun can be used interchangeably (they are aliases).

S. Zuckerman Parallel Programming 08/01/2024 91

Non-blocking communications

Limits to MPI_Send and MPI_Recv

MPI_Send and MPI_Recv are blocking communication calls
• The sending process must wait until the data it is sending has been received
• The receiving process must block once it has initiated the receiving operation
• Consequence: data sent or received through blocking communications is safe to
(re)use
• However, this can severely hamper the overall performance of an application

Non-blocking variants: MPI_Isend and MPI_Irecv

• Routine returns immediately – completion has to be tested separately
• Primarily used to overlap computation and communication

S. Zuckerman Parallel Programming 08/01/2024 92

Non-blocking communications I
Syntax

API
• int MPI_Isend(const void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm, MPI_Request *request)
• int MPI_Irecv(const void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm, MPI_Request *request)
• int MPI_Wait(MPI_Request *request, MPI_Status *status)

S. Zuckerman Parallel Programming 08/01/2024 93

Non-blocking communications II
Syntax

Properties
• Non-blocking operations allow overlapping of computation and communication
• Completion can be tested using MPI_Test(MPI_Request *request, int flag,
MPI_Status *status)
• Anywhere one uses MPI_Send or MPI_Recv, one can use MPI_Isend/MPI_Wait or
MPI_Irecv/MPI_Wait pairs instead
• Combinations of blocking and non-blocking sends/receives can be used to
synchronize execution instead of barriers

S. Zuckerman Parallel Programming 08/01/2024 94

Non-blocking communications III
Syntax

Multiple completions
• int MPI_Waitall(int count, MPI_Request *array_of_requests,
MPI_Status *array_of_statuses)
• int MPI_Waitany(int count, MPI_Request *array_of_requests, int
*index, MPI_Status *status)
• int MPI_Waitsome(int count, MPI_Request *array_of_requests, int
*array_of_indices, MPI_Status *array_of_status)
There are corresponding versions of MPI_Test for each of those.

S. Zuckerman Parallel Programming 08/01/2024 95

A simple example to use non-blocking communications
#i n c l u d e <mpi.h>
// . . .
i n t main( i n t argc , char * argv []) {
// ...
i f (rank == 0) {
f o r (i=0; i< 100; i++) {
/∗ Compute each data element and send i t out ∗/
data[i] = compute(i);
MPI_ISend (& data[i], 1, MPI_INT , 1, 0, MPI_COMM_WORLD,
&request[i]);
}
MPI_Waitall (100, request , MPI_STATUSES_IGNORE)
} else {
f o r (i = 0; i < 100; i++)
MPI_Recv(& data[i], 1, MPI_INT , 0, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
}
// . . .
}

S. Zuckerman Parallel Programming 08/01/2024 96

Collective
Introduction operations
• Collective operations are called by all processes belonging to the same
communicator
• MPI_Bcast distributes data from one process (the root) to all others in a
communicator
• MPI_Reduce combines data from all processes in the communicator and returns it
to one process
• In many (numerical) algorithms, send/receive pairs can be replaced by
broadcast/reduce ones
• Simpler and more efficient

Properties
• Tags are not used; only communicators matter.
• Non-blocking collective operations were added in MPI-3
• Three classes of operations: synchronization, data movement, collective
computation

S. Zuckerman Parallel Programming 08/01/2024 97

Synchronization

int MPI_Barrier(MPI_Comm *comm)

• Blocks until all processes belonging to communicator comm call it
• No process can get out of the barrier unless all the other processes have reached it

S. Zuckerman Parallel Programming 08/01/2024 98

Collective data movements

• MPI_Bcast
• MPI_Scatter
• MPI_Gather
• MPI_Allgather
• MPI_Alltoall

S. Zuckerman Parallel Programming 08/01/2024 99

Collective computations

• MPI_Reduce
• MPI_Scan

S. Zuckerman Parallel Programming 08/01/2024 100

Other MPI Collective routines

• Other useful collective operations

• MPI_Allgatherv
• MPI_Alltoallv
• MPI_Gatherv
• MPI_Scatterv
• MPI_Reducescatter
• “All” versions deliver results to all participating processes
• “v” versions (stands for vector) allow the chunks to have different sizes
• Important when the number of processes involved in the computation is no a
multiple of the number of data elements
• MPI_Allreduce, MPI_Reduce, MPI_Reducescatter, and MPI_Scan take both
built-in and user-defined combiner functions.

S. Zuckerman Parallel Programming 08/01/2024 101

Built-in collective computation operations

MPI_MAX Maximum
MPI_MIN Minimum
MPI_PROD Product
MPI_SUM Sum
MPI_LAND Logical and
MPI_LOR Logical or
MPI_LXOR Logical exclusive or
MPI_BAND Bitwise and
MPI_BOR Bitwise or
MPI_BXOR Bitwise exclusive or
MPI_MAXLOC Maximum and location
MPI_MINLOC Minimum and location

S. Zuckerman Parallel Programming 08/01/2024 102

Example using collective operations I
#i n c l u d e <mpi.h>
#i n c l u d e <math.h>
i n t main( i n t argc , char * argv []) {
const double g_PI25DT = 3.141592653589793238362643;
double mypi , pi , h, sum , x, a;
int n, myid , numprocs , i, ierr;

xMPI_Init (&argc , &argv );

xMPI_Comm_rank(MPI_COMM_WORLD, &myid );
xMPI_Comm_size(MPI_COMM_WORLD, &numprocs );

f o r (;;) {
i f (myid == 0) {
printf("Enter the number of intervals: (0 quits) ");
fflush(stdout );
scanf("%f", &n);
i f (n <= 0) break ;
}

xMPI_Bcast (&n, 1, MPI_INT , 0, MPI_COMM_WORLD);

i f (n < 0) { xMPI_Finalize (); exit (0); }

h = 1.0/n; sum = 0.0;

f o r ( i n t i = myid +1; i < n; i+= numprocs) {
x = h * (( double )i -0.5);

S. Zuckerman Parallel Programming 08/01/2024 103

Example using collective operations II

sum += (4.0 / (1.0 * x*x));

}
mypi = h*sum;

xMPI_Reduce (&mypi ,&pi ,1,MPI_DOUBLE ,MPI_SUM ,0,MPI_COMM_WORLD);

i f (myid == 0)
printf("pi is %f. Error is %f\n", pi , fabs(pi -g_PI25DT ));
}
xMPI_Finalize ();
r e t u r n 0;
}

S. Zuckerman Parallel Programming 08/01/2024 104

Kumar, Vipin (2002). Introduction to Parallel Computing. 2nd. Boston, MA, USA:
Addison-Wesley Longman Publishing Co., Inc. isbn: 0201648652.
Huang, Chao, Orion Lawlor, and Laxmikant V Kale (2003). “Adaptive mpi”. In:
International workshop on languages and compilers for parallel computing. Springer,
pp. 306–322.
Boehm, Hans-J. (June 2005). “Threads Cannot Be Implemented As a Library”. In:
SIGPLAN Not. 40.6, pp. 261–268. issn: 0362-1340. doi:
10.1145/1064978.1065042. url:
http://doi.acm.org/10.1145/1064978.1065042.
Sutter, Herb (2005). “The Free Lunch Is Over: A Fundamental Turn Toward
Concurrency in Software”. In: Dr. Dobb’s Journal 30.3.
Lee, Edward A. (May 2006). “The Problem with Threads”. In: Computer 39.5,
pp. 33–42. issn: 0018-9162. doi: 10.1109/MC.2006.180. url:
http://dx.doi.org/10.1109/MC.2006.180.
Chapman, Barbara, Gabriele Jost, and Ruud van der Pas (2007). Using OpenMP:
Portable Shared Memory Parallel Programming (Scientific and Engineering
Computation). The MIT Press. isbn: 0262533022, 9780262533027.
Herlihy, Maurice and Nir Shavit (2008). The Art of Multiprocessor Programming.
San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. isbn: 0123705916,
9780123705914.
Rünger, Gudula and Thomas Rauber (2013). Parallel Programming - for Multicore
and Cluster Systems; 2nd Edition. Springer. isbn: 978-3-642-37800-3. doi:
S. Zuckerman Parallel Programming 08/01/2024 105
10.1007/978-3-642-37801-0. url:
http://dx.doi.org/10.1007/978-3-642-37801-0.

S. Zuckerman Parallel Programming 08/01/2024 106

S. Zuckerman Parallel Programming 08/01/2024 107
Thanks for your attention
Is there any question?

Laboratoire ETIS etis-lab.fr

6 Avenue du Ponceau
95000 Cergy

T.07 61 76 91 47

Bcs702 Parallel Computing Module 1
100% (1)
Bcs702 Parallel Computing Module 1
35 pages
Openmp
No ratings yet
Openmp
95 pages
Parallel Programming Unit 2
No ratings yet
Parallel Programming Unit 2
71 pages
High Performance Computing: Modern Systems and Practices 1st Edition - Ebook PDF PDF Download
100% (6)
High Performance Computing: Modern Systems and Practices 1st Edition - Ebook PDF PDF Download
60 pages
Omp Hands On
No ratings yet
Omp Hands On
200 pages
3.introduction To Parallelism
No ratings yet
3.introduction To Parallelism
64 pages
Introduction To Open MP
No ratings yet
Introduction To Open MP
42 pages
OpenMP - Reference Book
No ratings yet
OpenMP - Reference Book
59 pages
Govindarajan - ParallelizationPrinciples NSM AstroPhysics
No ratings yet
Govindarajan - ParallelizationPrinciples NSM AstroPhysics
50 pages
Unit 3 - Programming Multi-Core and Shared Memory
No ratings yet
Unit 3 - Programming Multi-Core and Shared Memory
100 pages
Parallel Programming For Multicore Machines Using OpenMP and MPI Lecture Notes (Dr. Constantinos Evangelinos) (Z-Library)
No ratings yet
Parallel Programming For Multicore Machines Using OpenMP and MPI Lecture Notes (Dr. Constantinos Evangelinos) (Z-Library)
292 pages
Lecture 10 Shared Memory Programming With OpenMP
No ratings yet
Lecture 10 Shared Memory Programming With OpenMP
30 pages
Xe 62011 Open MP
No ratings yet
Xe 62011 Open MP
46 pages
CSC-334 - P&DC - Lab Manual - V2.0
No ratings yet
CSC-334 - P&DC - Lab Manual - V2.0
102 pages
Introduction To Parallel Programming: Center For Institutional Research Computing
No ratings yet
Introduction To Parallel Programming: Center For Institutional Research Computing
98 pages
Openmp
No ratings yet
Openmp
61 pages
Chapter 3 - Shared-Memory Programming, OpenMP
No ratings yet
Chapter 3 - Shared-Memory Programming, OpenMP
65 pages
Programming Assignment: On Openmp
No ratings yet
Programming Assignment: On Openmp
19 pages
Intro To OpenMP Mattson Customized
No ratings yet
Intro To OpenMP Mattson Customized
94 pages
Parallel Programming Using OpenMP
No ratings yet
Parallel Programming Using OpenMP
76 pages
Lec 12 OpenMP
No ratings yet
Lec 12 OpenMP
152 pages
Demystifying Multicore Germany 14 PDF
No ratings yet
Demystifying Multicore Germany 14 PDF
82 pages
Mpsoc Architectures Openmp
No ratings yet
Mpsoc Architectures Openmp
35 pages
About OpenMP
No ratings yet
About OpenMP
86 pages
Shared Memory Parallel Programming: Introduction To Openmp
No ratings yet
Shared Memory Parallel Programming: Introduction To Openmp
39 pages
Mpi Openmp Handouts
No ratings yet
Mpi Openmp Handouts
67 pages
Lecture Open MP
No ratings yet
Lecture Open MP
25 pages
Prof. Dr. Aman Ullah Khan
No ratings yet
Prof. Dr. Aman Ullah Khan
27 pages
Parallel Processing
No ratings yet
Parallel Processing
31 pages
A Tutorial On Parallel Computing On Shared Memory Systems
No ratings yet
A Tutorial On Parallel Computing On Shared Memory Systems
23 pages
Lect11 Openmp1
No ratings yet
Lect11 Openmp1
35 pages
Lecture 6 Parallel Programming Models 1
No ratings yet
Lecture 6 Parallel Programming Models 1
14 pages
Parallel Programming: Process and Threads
No ratings yet
Parallel Programming: Process and Threads
18 pages
ParallelProgramming Start2016
No ratings yet
ParallelProgramming Start2016
41 pages
Openmp: John H. Osorio Ríos
No ratings yet
Openmp: John H. Osorio Ríos
24 pages
Lecture 6 Parallel Programming Models
No ratings yet
Lecture 6 Parallel Programming Models
17 pages
How Good Is OpenMP
No ratings yet
How Good Is OpenMP
13 pages
High Performance Computing (HPC) Lec4
No ratings yet
High Performance Computing (HPC) Lec4
32 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
PDC Lecture 15 OpenMP
No ratings yet
PDC Lecture 15 OpenMP
18 pages
High Performance Computing (HPC) - Lec3
No ratings yet
High Performance Computing (HPC) - Lec3
35 pages
Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp
No ratings yet
Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp
10 pages
Lecture 5
No ratings yet
Lecture 5
12 pages
2 ParallelArchExec
No ratings yet
2 ParallelArchExec
46 pages
Cs6801 Mcap MGM
No ratings yet
Cs6801 Mcap MGM
7 pages
OpenMP SPM
No ratings yet
OpenMP SPM
9 pages
Multi Core Architectures and Programming
No ratings yet
Multi Core Architectures and Programming
10 pages
Unit 3
No ratings yet
Unit 3
13 pages
Lecture 1
No ratings yet
Lecture 1
23 pages
Untitled Document
No ratings yet
Untitled Document
23 pages
OpenMP Tutorial - Lawrence Livermore National Laboratory
No ratings yet
OpenMP Tutorial - Lawrence Livermore National Laboratory
75 pages
Openmp: Author: Blaise Barney, Lawrence Livermore National Laboratory
No ratings yet
Openmp: Author: Blaise Barney, Lawrence Livermore National Laboratory
62 pages
Message Passing Interface: Parallel Processing Course University of Tehran
No ratings yet
Message Passing Interface: Parallel Processing Course University of Tehran
49 pages
NTA UGC NET Computer Science Syllabus
No ratings yet
NTA UGC NET Computer Science Syllabus
9 pages
Calculating Prime Numbers Comparing Java, C, and Cuda
No ratings yet
Calculating Prime Numbers Comparing Java, C, and Cuda
27 pages
Mellanox OFED Linux User Manual v2.3-1.0.1
No ratings yet
Mellanox OFED Linux User Manual v2.3-1.0.1
207 pages
Omp Hands On SC08
No ratings yet
Omp Hands On SC08
153 pages
Simcenter Nastran 2019.1: Parallel Processing Guide
No ratings yet
Simcenter Nastran 2019.1: Parallel Processing Guide
112 pages
Automated Performance Modeling of HPC Applications Using Machine Learning
No ratings yet
Automated Performance Modeling of HPC Applications Using Machine Learning
15 pages
Syllabus Parallel Computing
No ratings yet
Syllabus Parallel Computing
5 pages
Omp Handouts
No ratings yet
Omp Handouts
109 pages
Mit Openmp Mpi
No ratings yet
Mit Openmp Mpi
77 pages
How To Run MPI Under CodeBlocks
No ratings yet
How To Run MPI Under CodeBlocks
2 pages
Distributed Memory Programming With Mpi: Collective vs. Point-to-Point Communications
No ratings yet
Distributed Memory Programming With Mpi: Collective vs. Point-to-Point Communications
6 pages
IBHNet Manual PDF
No ratings yet
IBHNet Manual PDF
50 pages
Openpipeflow 1.02b
No ratings yet
Openpipeflow 1.02b
22 pages
Parallel Quicksort Implementation Using Mpi and Pthreads: Puneet Kataria RUID - 117004233
No ratings yet
Parallel Quicksort Implementation Using Mpi and Pthreads: Puneet Kataria RUID - 117004233
14 pages
RegCM Tutorial
No ratings yet
RegCM Tutorial
16 pages
Parallel Programming Platforms (Part 1) : CSE3057Y Parallel and Distributed Systems
No ratings yet
Parallel Programming Platforms (Part 1) : CSE3057Y Parallel and Distributed Systems
38 pages
Bahria University Lahore Campus: Department of Computer Sciences
No ratings yet
Bahria University Lahore Campus: Department of Computer Sciences
10 pages
OpenMP Basics
No ratings yet
OpenMP Basics
47 pages
Derara Senay
No ratings yet
Derara Senay
64 pages
Lab14. Parallel and Distributed Computing
No ratings yet
Lab14. Parallel and Distributed Computing
11 pages
Parallel Ray Tracing Using Mpi and Openmp: January 2008
No ratings yet
Parallel Ray Tracing Using Mpi and Openmp: January 2008
18 pages
Lecture3 Cluster Computing Baker 2
No ratings yet
Lecture3 Cluster Computing Baker 2
45 pages
MPI Using Java PDF
No ratings yet
MPI Using Java PDF
22 pages
Data Sheet 6ES7216-2AD23-0XB0: Supply Voltage
No ratings yet
Data Sheet 6ES7216-2AD23-0XB0: Supply Voltage
4 pages
High Performance Computing-1 PDF
No ratings yet
High Performance Computing-1 PDF
15 pages
Lab Assesment 10 Parallel & Distributed Computing (L31+32) : Dated: 18/10/2020 Assessment 10 Muskan Agrawal 18BCE0707
No ratings yet
Lab Assesment 10 Parallel & Distributed Computing (L31+32) : Dated: 18/10/2020 Assessment 10 Muskan Agrawal 18BCE0707
3 pages
Distributed Systems, Parallel Computing
No ratings yet
Distributed Systems, Parallel Computing
5 pages
BIS613D
No ratings yet
BIS613D
2 pages
214 2BT13 VIPA Control Systems Data Sheet
No ratings yet
214 2BT13 VIPA Control Systems Data Sheet
5 pages
Message Passing Interface: 1. Collective Communications
No ratings yet
Message Passing Interface: 1. Collective Communications
4 pages
Syllabus - BSCS 611 Parallel Computing
No ratings yet
Syllabus - BSCS 611 Parallel Computing
1 page