Govindarajan - ParallelizationPrinciples NSM AstroPhysics
Govindarajan - ParallelizationPrinciples NSM AstroPhysics
Fundamentals
R. Govindarajan
CSA/SERC, IISc
[email protected]
1
Overview
Introduction
Parallelization Essentials
Programming Models
Task creation
Synchronization / Communication
Introduction to OpenMP
Introduction to CUDA
Acknowledgments:
Slides for this tutorial are taken from presentation materials available with
the book “Parallel Computing Architecture: A Hardware/Software
Approach” (Culler, Singh and Gupta, Morgan Kaufmann Pub.) and the
associated course material. They have been suitably adapted.
2
Space of Parallel Computing
Parallel Architecture
Shared Memory
Centralized shared memory (UMA)
Distributed Shared Memory (NUMA)
Distributed Memory
A.k.a. Message passing
E.g., Clusters
Data Parallel
SIMD, Vector Machine
GPUs
3
Space of Parallel Computing
Programming Models
What programmer uses in coding
applications
Specifies synchronization and
communication
Programming Models:
Shared address space, e.g., OpenMP
Message passing, e.g., MPI
Single Instrn. Multi-Threaded (SIMT) , e.g.,
CUDA or OpenCL, specifically for GPUs
4
Definitions
Speedup =
Efficiency =
Amdahl’s Law:
For a program that has s part as
sequential execution, the speedup is
limited by 1/s .
20% time is spent on sequential part
maximum speedup is 5 !
5
Understanding Amdahl’s Law
1
n2 n2 Time
(a) Serial
concurrency
1 1
n2 p
n2/p
Time Time
n2/p
n2/p
8
Definitions
Task
Arbitrary piece of work in parallel computation
Executed sequentially; concurrency is only across
tasks
Fine-grained vs. coarse-grained tasks
Process
Abstract entity that performs the tasks
Communicate and synchronize to perform the
tasks
Process vs. Thread
Coarse vs. Fine grain
Threads typically share the address space
9
Steps involved in Parallelizaton
10
Steps in Creating a Parallel
Program
Partitioning
D A O M
e s r a
c s c p
o i h p
m g p0 p1 e p0 p1 i
p s P0 P1
n n
o m t g
s e r
i n a
t t t
P2 P3
i p2 p3 i p2 p3
o o
n n
11
Task vs. Data Decomposition
Computation is
decomposed and assigned A
(partitioned) – task
B
decomposition D
Task graphs, C
Synchronization among E F
tasks
fork- join G
barrier
12
Task vs. Domain Decomposition
Domain
Partitioning Data is Decomposition
often a natural view too
– data or domain
decomposition
Grid example;
Computation follows
data: owner computes
for i=1 to m
for j= 1 to n
a[i,j] = a[i,j] + v[i]
13
Assignment
Block-Cyclic Tiled
15
Orchestration
16
Overview
Introduction
Parallelization Essentials
Programming Models
Task creation
Synchronization / Communication
Introduction to OpenMP
Introduction to CUDA
17
What is OpenMP?
18
OpenMP Parallel Computing Solution
Stack
User layer
End User
Application
Prog. Layer
(OpenMP )
Environment
Directives OpenMP library
variables
System layer
Runtime library
Master thread
FORK
FORK
JOIN
JOIN
Parallel
Region
20
OpenMP Memory Model
Shared memory model
Shared variable: single copy of the variable
shared by many threads;
Private variable: variable seen by one thread;
each thread will have its own copy of the variable
22
OpenMP: Parallel Regions
Omp_parallel pragma
double D[1000];
let all threads execute
#pragma omp parallel the section in parallel
{ Executes the same code
int i; double sum = 0; several times (by many
for (i=0; i<1000; i++) threads)
sum += D[I]; How many threads we
printf(“Thread %d computes have?
%f\n”, omp_thread_num(), omp_set_num_threads(n)
sum); What is the use of
} repeating the same work
several times in parallel?
D is shared between the
threads, i and sum are
private 23
Parallel Regions – Another Example
double A[1000];
Each thread executes omp_set_num_threads(4);
the same code #pragma omp parallel
redundantly. {
int ID = omp_get_thread_num();
double A[1000]; pooh(ID, A);
}
omp_set_num_threads(4) printf(“all done\n”);
AAsingle
singlecopy
copy
of
ofAAisisshared
shared pooh(0,A) pooh(1,A) pooh(2,A) pooh(3,A)
between
betweenall all
threads.
threads.
26
OpenMP: Work Sharing
Constructs
Sequential for (int i=0; i<N; i++)
code
a[i]=b[i]+c[i];
28
OpenMP Section :
Work Sharing Construct
The Sections work-sharing construct gives
a different structured block to each
thread.
#pragma omp parallel
#pragma omp parallel
#pragma
#pragmaompompsections
sections
{{
#pragma
#pragmaomp
ompsection
section
X_calculation();
X_calculation();
#pragma
#pragmaomp
ompsection
section
y_calculation();
y_calculation();
#pragma
#pragmaompompsection
section By default, there is a
z_calculation();
z_calculation(); barrier at the end of the
}} “omp sections”. Use the
“nowait” clause to turn off
the barrier. 29
OpenMP: Data Environment
Shared Variables
Most variables (including locals) are shared
by default
Global variables are shared
File scope variables, static variables
Some variables can be private
Variables can be explicitly declared as
private:
A local copy is created for each thread
Automatic variables inside the statement
block
Automatic variables in the called functions
30
Data Environment: Changing
Storage Attribute
One can selectively change storage
attributes constructs using the
following clauses*
SHARED
PRIVATE
FIRSTPRIVATE
LASTPRIVATE
THREADPRIVATE
The default status can be modified
with:
DEFAULT (PRIVATE | SHARED | NONE)
31
OpenMP Synchronization
X = 0;
What should be the result #pragma omp parallel
(assume 2 threads)?
X = X+1;
Could be 1 or 2!
32
Synchronization Mechanisms
33
Critical Sections & Atomic
34
Barrier synchronization
#pragma omp barrier
Performs a barrier synchronization between
all the threads in a team at the given point.
Example:
#pragma omp parallel
{
int result = heavy_computation_part1();
#pragma omp atomic
sum += result;
#pragma omp barrier
heavy_computation_part2(sum);
}
35
Reduction Motivation
36
OpenMP: Reduction Example
#include <omp.h>
#define NUM_THREADS 4
void main ()
{
int i;
int A[1000], B[1000]; sum=0;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for reduction(+:sum)
for (i=0; i< 1000; i++){
sum += A[i] * B[i] ;
} Private copy of sum for each thread, computes
the sum in parallel for the chunk (of array)
} assigned to it!
Separate code for computing global sum (using
synch.) automatically added by OpenMP
37
Controlling OpenMP behavior
omp_set_num_threads(int)
Control the number of threads used for parallelization
Must be called from sequential code
Also can be set by OMP_NUM_THREADS environment
variable
omp_get_num_threads()
How many threads are currently available?
omp_get_thread_num()
omp_in_parallel()
Am I currently running in parallel mode?
omp_get_wtime()
A portable way to compute wall clock time
38
Overview
Introduction
Parallelization Essentials
Programming Models
Task creation
Synchronization / Communication
Introduction to OpenMP
Introduction to CUDA
39
Intel SIMD Extensions
New SIMD instructions, new registers
Introduced in phases/groups of functionality
MMX (1993 – 1999)
64 bit width operations
SSE – SSE4 (1999 –2006)
128 bit width operations
AVX, FMA, AVX2, AVX-512 (2008 – …)
256 – 512 bit width operations
40
SIMD Operations
SIMD execution
performs operation in parallel on an array of 2, 4,
8,16 or 32 values
X0 X1 X2 X3
Operation ⊗ can be a
data movement instruction
arithmetic instruction Y0 Y1 Y2 Y3
logical instruction
comparison instruction ⊗ ⊗ ⊗ ⊗
conversion instruction
shuffle instruction
Z0 Z1 Z2 Z3
41
Automatic Vectorization
42
CUDA Programming
© RG@SERC,IISc 43
CUDA Programming
© RG@SERC,IISc 44
CUDA Programming
0 1 63
0 255 0 255 0 255
© RG@SERC,IISc 45
Processing Flow
© RG@SERC,IISc 46
Processing Flow
© RG@SERC,IISc 47
Processing Flow
© RG@SERC,IISc 48
CUDA Programming
49
Summary
Introduction
Parallelization Essentials
Introduction to OpenMP
Covered only the basics --- there is more!
Current version is OpenMP 5.x
Support for accelerators
SIMD/Vectorization support
Nested Parallelism
Introduction to CUDA
there is more to learn about CUDA!
Current version is CUDA 12.x
Dynamic kernel invocation, CUDA Graphs, unified
memory, …
50