Parallel Programming
Parallel Programming
Programming
Computing for Shared and
Distributed Memory Models
December 13, 2022 / January 9, 2023
Stéphane Zuckerman1 ([email protected])
1
ETIS, CY Cergy-Paris University, ENSEA, CNRS
Outline
5. References
Figure: Symmetric Multi-Processor (SMP) system with Non-Uniform Memory Access (NUMA); multiple DRAM
banks.
Figure: Single CPU: CPU, L1 data cache (L1D), L1 Instruction cache (L1I), L2 unified cache (L2), L3 Unified cache
(L3); single DRAM bank.
Hardware context
• End of Dennard’s scaling
• Moore’s law now used to add more computing units on a single chip (instead of
going to higher frequencies)
• Programming chip multiprocessors (CMP) is not just for
scientific/high-performance computing anymore
• Embedded chips require programming models and execution models to efficiently
exploit all of the hardware
Environment Variables
Once set, their values are taken into account at execution time
#include <stdio.h>
#include <stdlib.h> examples$ ./hello
#include <omp.h> [0] Hello, World!
#ifndef _OPENMP [3] Hello, World!
#define omp_get_thread_num () 0 [1] Hello, World!
#endif [2] Hello, World!
int main(void)
{
#pragma omp parallel
{
int tid = omp_get_thread_num ();
printf("[%d]\tHello , World !\n", tid);
}
return EXIT_SUCCESS;
}
Figure: omp_hello.c
#include <stdio.h>
#include <omp.h> [2] a = 716.00
int main () { [1] a = 716.00
float a = 1900.0; [0] a = 716.00
#pragma omp parallel default(none) private(a) [3] a = 716.00
{ [0] a = 1900.00
a = a + 716.;
printf("[%d]\ta = %.2f\n",omp_get_thread_num (), a);
}
printf("[%d]\ta = %.2f\n",omp_get_thread_num (), a);
return 0;
}
Figure: omp_private.c
#include <stdio.h>
#include <stdlib.h> examples$ ./hello2
#include <omp.h> [0] Hello, World!
#ifndef _OPENMP [3] Hello, World!
#define omp_get_thread_num () 0 [1] Hello, World!
#endif [2] Hello, World!
int main(void)
{
int ids[] = {0, 1, 2, 3, 4, 5, 6, 7};
#pragma omp parallel default(none) shared(ids)
{
printf("[%d]\tHello , World !\n", ids[ omp_get_thread_num ()]);
}
return EXIT_SUCCESS;
}
Figure: hello2.c
printf("a = %f\n",a);
return 0;
}
Figure: omp_firstprivate.c
#include <stdio.h>
#include <omp.h>
void sub(void );
int main(void) {
#pragma omp parallel default(shared)
{
sub ();
}
return 0;
}
void sub(void) {
int a = 19716;
a += omp_get_thread_num ();
printf("a = %d\n", a);
}
Figure: omp_scope.c.txt
Figure: parallel_for.c
1 The iterator of a omp for loop must use additions/substractions to get to the
next iteration (no i *= 10 in the postcondition)
2 The iterator of the outermost loop (which directly succeeds to the omp for
directive) is always private, but not the ones in other nested loops!
3 There is an implicit barrier at the end of the loop. You can remove it by adding
the clause nowait on the same line: #pragma omp for nowait
4 How the iterations are distributed among threads can be defined using the
schedule clause.
|FinalVal − InitVal|
NumIterations = + |FinalVal − InitVal| mod Stride
Stride
The number of iteration chunks is thus computed like this:
NumIterations
NumChunks = + NumIterations mod ChunkSize
ChunkSize
Static Scheduling
schedule(static,chunksize) distributes the iteration chunks across threads in a round-robin fashion
• Guarantee: if two loops with the same “header” (precondition, condition, postcondition, and chunksize for the
parallel for directive) succeed to each other, the threads will be assigned the same iteration chunks
• By default, chunksize is equal to OMP_NUM_THREADS
• Very useful when iterations take roughly the same time to perform (e.g., dense linear algebra routines)
Dynamic Scheduling
schedule(dynamic,chunksize) divides the iteration space according to chunksize, and creates an “abstract” queue
of iteration chunks. If a thread is done processing its chunk, it dequeues the next one from the queue. By default,
chunksize is 1.
Very useful if the time to process individual iterations varies.
Guided Scheduling
guided,chunksize Same behavior as dynamic, but the chunksize is divided by two each time a threads dequeues a
new chunk. The minimum size is one, and so is the default.
Very useful if the time to process individual iterations varies, and the amount of work has a “trail”
return f;
}
Figure: omp_for_schedule.c
S. Zuckerman Parallel Programming 08/01/2024 43
Parallel Loops
Specifying the Schedule Mode II
int main(void) {
printf("MAX = %.2f\n",MAX);
double acc = 0.0;
int* sum_until = malloc(MAX*sizeof(int ));
if (! sum_until) perror("malloc"), exit( EXIT_FAILURE );
for (int i = 0; i < (int)MAX; ++i) sum_until[i] = rand () % 100;
#pragma omp parallel default(none) \
shared(sum_until) firstprivate(acc)
{ /* Use the OMP_SCHEDULE environment variable on the command
* line to specify the type of scheduling you want , e.g.:
* export OMP_SCHEDULE =" static " or OMP_SCHEDULE =" dynamic ,10"
* or OMP_SCHEDULE =" guided ,100"; ./ omp_schedule
*/
#pragma omp for schedule(runtime)
for (int i = 0; i < bound; i+=1) {
acc += sum( sum_until[i] );
}
printf ("[%d]\tMy sum = %.2f\n", omp_get_thread_num (), acc);
}
free(sum_until );
return 0;
} Figure: omp_for_schedule.c
real 0m11.911s
user 0m11.930s
sys 0m0.004s
real 0m11.911s
user 0m11.930s
sys 0m0.004s
real 0m11.312s
user 0m11.356s
sys 0m0.004s
real 0m0.546s
user 0m0.576s
sys 0m0.004s
real 0m0.546s
user 0m0.576s
sys 0m0.004s
real 0m0.023s
user 0m0.059s
sys 0m0.004s
real 0m8.437s
user 0m8.452s
sys 0m0.008s
real 0m8.437s
user 0m8.452s
sys 0m0.008s
real 0m5.401s
user 0m5.438s
sys 0m0.008s
int
main(void)
{
int n_threads = 1;
#pragma omp parallel default(none) \
shared(n_threads ,stdout ,g_COUNTER)
{
#pragma omp master
{
n_threads = omp_get_num_threads ();
printf("n_threads = %d\t",n_threads ); fflush(stdout );
}
++ g_COUNTER;
}
printf("g_COUNTER = %lu\n",g_COUNTER );
return EXIT_FAILURE;
}
int main(void) {
int n_threads = 1;
#pragma omp parallel default(none) \
shared(n_threads ,stdout ,g_COUNTER)
{
#pragma omp master
{
n_threads = omp_get_num_threads ();
printf("n_threads = %d\t",n_threads ); fflush(stdout );
}
int main(void) {
int n_threads = 1;
#pragma omp parallel default(none) \
shared(n_threads ,stdout ,g_COUNTER)
{
#pragma omp master
{
n_threads = omp_get_num_threads ();
printf("n_threads = %d\t",n_threads ); fflush(stdout );
}
critical Directive
#pragma omp critical [(name)]
Guarantees that only one thread can access the sequence of instructions contained in
the (named) critical section. If no name is specified, an “anonymous” name is
automatically generated.
atomic Directive
#pragma omp atomic
Guarantees the atomicity of the single arithmetic instruction that follows. On
architectures that support atomic instructions, the compiler can generate a low-level
instruction to ensure the atomicity of the operation. Otherwise, atomic is equivalent
to critical.
single Directive
Guarantees that a single thread will execute the sequence of instructions located in the
single region, and the region will be executed only once. There is an implicit barrier
at the end of the region.
master Directive
Guarantees that only the master thread (with ID = 0) will execute the sequence of
instructions located in the single region, and the region will be executed only once.
There is NO implicit barrier at the end of the region.
nowait Clause
nowait can be used on omp for, single, and critical directives to remove the
implicit barrier they feature.
/**
* \brief Computes Fibonacci numbers
* \param n the Fibonacci number to compute
*/
u64 xfib(u64 n) {
return n < 2 ? // base case?
n : // fib (0) = 0, fib (1) = 1
xfib(n-1) + xfib(n -2);
}
}
static inline void usage(const char* progname) {
printf("USAGE: %s positive_number\n", progname );
exit (0);
}
void u64_measure(u64 (* func )(u64), u64 n,
u64 n_reps , const char* msg);
void u64func_time(u64 (* func )(u64), u64 n,
const char* msg);
#endif // UTILS_H_GUARD
#ifndef COMMON_H_GUARD
#define COMMON_H_GUARD
#include "utils.h" // for smalloc (), sfree (), fatal (), scalloc (), ...
#define FIB_THRESHOLD 20
#ifndef MT_H_GUARD
#define MT_H_GUARD
#include <pthread.h>
typedef struct fib_s { u64 *up , n; } fib_t;
typedef struct memofib_s { u64 *up , *vals , n; } memofib_t;
static inline pthread_t* spawn(void* (* func )( void*), void* data) {
pthread_t* thread = smalloc(sizeof(pthread_t )); int error = 0;
do {
errno = error = pthread_create(thread ,NULL ,func ,data );
} while (error == EAGAIN );
if (error) fatal("pthread_create");
return thread;
}
static inline void sync(pthread_t* thread) {
int error = 0; void* retval = NULL;
if ( (errno = ( error = pthread_join (*thread , &retval) ) ) )
fatal("pthread_join");
sfree(thread );
}
#endif // MT_H_GUARD
#include "common.h"
#include <omp.h>
#include "common.h"
#include <omp.h>
u64 o_memofib(u64 n, u64* vals) {
if (n < FIB_THRESHOLD) return sfib(n);
if (vals[n] == 0) { u64 n1 = 0, n2 = 1;
# pragma omp task shared(n1 ,vals)
n1 = o_memofib(n-1,vals );
# pragma omp task shared(n2 ,vals)
n2 = o_memofib(n-2,vals );
# pragma omp taskwait
vals[n] = n1 + n2;
}
return vals[n];
}
u64 o_memoFib(u64 n) {
u64 result=0, *fibvals=calloc(n+1, sizeof(u64 ));
# pragma omp parallel
{
# pragma omp single nowait
{ fibvals [1] = 1; result = o_memofib(n,fibvals ); }
}
#include "common.h"
u64 memoFib(u64 n) {
u64* fibvals = calloc(n+1, sizeof(u64 ));
fibvals [0] = 0; fibvals [1] = 1;
u64 result = memofib(n,fibvals );
sfree(fibvals );
return result;
}
#include "common.h"
u64 sfib(u64 n) {
u64 n1 = 0, n2 = 1, r = 1;
for (u64 i = 2; i < n; ++i) {
n1 = n2;
n2 = r;
r = n1 + n2;
}
return r;
}
Internet Resources
• “The OpenMP® API specification for parallel programming” at openmp.org
• Provides all the specifications for OpenMP, in particular OpenMP 3.1 and 4.0
• Lots of tutorials (see http://openmp.org/wp/resources/#Tutorials)
• The Wikipedia article at http://en.wikipedia.org/wiki/OpenMP
Execution Model
• Relies on the notion of distributed memory
• All data transfers between MPI processes are explicit
• Processes can also be synchronized with each other
• Achieved using a library API
• MPI-2 (2000):
• Thread support
• MPI-I/O, R-DMA
• MPI-2.1 (2008) and MPI-2.2 (2009):
• Corrections to standard, small additional features
• MPI-3 (2012):
• Lots of new features to standard (briefly discussed at the end)
MPI is a library
Need to use function calls, to leverage MPI features.
Compilation
• Regular compilation: use of cc, e.g., gcc -o test test.c
• MPI compilation: mpicc -o test test.c
Execution
• Regular execution: ./text
• MPI execution: mpiexec -np 16 ./test
MPI groups
• Each MPI process belongs to one or more groups
• Each MPI process is given one or more colors
• Group+color = communicator
• All MPI processes belong to MPI_COMM_WORLD when the program starts
Receive wildcards
• MPI_ANY_SOURCE: accepts data from any sender
• MPI_ANY_TAG: accepts data with any tag (as long as the receiver is a valid target)
Status object
Objects of type MPI_Status have the following accessible fields (assume our object
name is status):
• MPI_SOURCE: the rank of the process which sent the message (useful when using
MPI_ANY_SOURCE)
• MPI_TAG: the tag used to identify the received message (useful when using
MPI_ANY_TAG)
• MPI_ERROR: the error status (assuming the MPI program does not crash when an
error is detected—which is the behavior by default).
To get the number of elements received, the user can query status using the
MPI_Get_count function.
i f (rank == 0)
MPI_Send(data ,100, MPI_INT ,1,0,MPI_COMM_WORLD);
else
MPI_Recv(data ,100, MPI_INT ,0,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
MPI_Finalize ();
r e t u r n 0;
}
• MPI_Init
• MPI_Comm_rank
• MPI_Comm_size
• MPI_Send
• MPI_Recv
• MPI_Finalize
…are enough to write any application using message passing.
However, to be productive and ensure reasonable performance portability, other
functions are required.
API
• int MPI_Isend(const void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm, MPI_Request *request)
• int MPI_Irecv(const void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm, MPI_Request *request)
• int MPI_Wait(MPI_Request *request, MPI_Status *status)
Properties
• Non-blocking operations allow overlapping of computation and communication
• Completion can be tested using MPI_Test(MPI_Request *request, int flag,
MPI_Status *status)
• Anywhere one uses MPI_Send or MPI_Recv, one can use MPI_Isend/MPI_Wait or
MPI_Irecv/MPI_Wait pairs instead
• Combinations of blocking and non-blocking sends/receives can be used to
synchronize execution instead of barriers
Multiple completions
• int MPI_Waitall(int count, MPI_Request *array_of_requests,
MPI_Status *array_of_statuses)
• int MPI_Waitany(int count, MPI_Request *array_of_requests, int
*index, MPI_Status *status)
• int MPI_Waitsome(int count, MPI_Request *array_of_requests, int
*array_of_indices, MPI_Status *array_of_status)
There are corresponding versions of MPI_Test for each of those.
Properties
• Tags are not used; only communicators matter.
• Non-blocking collective operations were added in MPI-3
• Three classes of operations: synchronization, data movement, collective
computation
• MPI_Bcast
• MPI_Scatter
• MPI_Gather
• MPI_Allgather
• MPI_Alltoall
• MPI_Reduce
• MPI_Scan
MPI_MAX Maximum
MPI_MIN Minimum
MPI_PROD Product
MPI_SUM Sum
MPI_LAND Logical and
MPI_LOR Logical or
MPI_LXOR Logical exclusive or
MPI_BAND Bitwise and
MPI_BOR Bitwise or
MPI_BXOR Bitwise exclusive or
MPI_MAXLOC Maximum and location
MPI_MINLOC Minimum and location
f o r (;;) {
i f (myid == 0) {
printf("Enter the number of intervals: (0 quits) ");
fflush(stdout );
scanf("%f", &n);
i f (n <= 0) break ;
}
i f (myid == 0)
printf("pi is %f. Error is %f\n", pi , fabs(pi -g_PI25DT ));
}
xMPI_Finalize ();
r e t u r n 0;
}
T.07 61 76 91 47