0% found this document useful (0 votes)
32 views50 pages

Govindarajan - ParallelizationPrinciples NSM AstroPhysics

The document provides an overview of parallel programming fundamentals, including programming models, task creation, and synchronization. It introduces OpenMP and CUDA as tools for parallel programming, detailing their execution models and memory management. Key concepts such as Amdahl's Law, task decomposition, and synchronization mechanisms are also discussed to enhance understanding of parallel computing.

Uploaded by

morgan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views50 pages

Govindarajan - ParallelizationPrinciples NSM AstroPhysics

The document provides an overview of parallel programming fundamentals, including programming models, task creation, and synchronization. It introduces OpenMP and CUDA as tools for parallel programming, detailing their execution models and memory management. Key concepts such as Amdahl's Law, task decomposition, and synchronization mechanisms are also discussed to enhance understanding of parallel computing.

Uploaded by

morgan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Parallel Programming

Fundamentals

R. Govindarajan
CSA/SERC, IISc
[email protected]

1
Overview
 Introduction
 Parallelization Essentials
 Programming Models
 Task creation
 Synchronization / Communication
 Introduction to OpenMP
 Introduction to CUDA
Acknowledgments:

Slides for this tutorial are taken from presentation materials available with
the book “Parallel Computing Architecture: A Hardware/Software
Approach” (Culler, Singh and Gupta, Morgan Kaufmann Pub.) and the
associated course material. They have been suitably adapted.
2
Space of Parallel Computing

Parallel Architecture
 Shared Memory
 Centralized shared memory (UMA)
 Distributed Shared Memory (NUMA)
 Distributed Memory
 A.k.a. Message passing
 E.g., Clusters
 Data Parallel
 SIMD, Vector Machine
 GPUs

3
Space of Parallel Computing

Programming Models
 What programmer uses in coding
applications
 Specifies synchronization and
communication
 Programming Models:
 Shared address space, e.g., OpenMP
 Message passing, e.g., MPI
 Single Instrn. Multi-Threaded (SIMT) , e.g.,
CUDA or OpenCL, specifically for GPUs

4
Definitions

 Speedup =

 Efficiency =

 Amdahl’s Law:
For a program that has s part as
sequential execution, the speedup is
limited by 1/s .
 20% time is spent on sequential part 
maximum speedup is 5 !
5
Understanding Amdahl’s Law

Example: 2-phase calculation


 sweep over n x n grid and do some
independent computation
 sweep again and add each value to global sum
concurrency

1
n2 n2 Time

(a) Serial

 Serial Execution Time = n2 + n2 = 2n2


6
Understanding Amdahl’s Law

Naïve Parallel Execution Improved Parallel


 Phase 1 : Parallel Execution
 Time for Phase = n2/p  Localize sum in p procs
 Second phase serialized  Time = n2/p
at global variable  Accumulate local sums
 Time for Phase 2 = n2; serial execution;
 Speedup = (2n2/(n2 +  Time = p
n2/p)) or at most 2 !  Speedup = (2n2/(2n2/p+p))
≈p
p p
concurrency

concurrency
1 1

n2 p
n2/p
Time Time

n2/p
n2/p

Naïve Parallel Improved 7


Overview
 Introduction
 Parallelization Essentials
 Programming Models
 Task creation
 Synchronization / Communication
 Introduction to OpenMP
 Introduction to CUDA

8
Definitions
 Task
Arbitrary piece of work in parallel computation
Executed sequentially; concurrency is only across
tasks
Fine-grained vs. coarse-grained tasks

 Process
Abstract entity that performs the tasks
Communicate and synchronize to perform the
tasks
 Process vs. Thread
Coarse vs. Fine grain
Threads typically share the address space
9
Steps involved in Parallelizaton

 Identify work that can be done in parallel


 work includes computation, data access and I/O
 Partition work and perhaps data among
processes
 Manage data access, communication and
synchronization

10
Steps in Creating a Parallel
Program
Partitioning

D A O M
e s r a
c s c p
o i h p
m g p0 p1 e p0 p1 i
p s P0 P1
n n
o m t g
s e r
i n a
t t t
P2 P3
i p2 p3 i p2 p3
o o
n n

Sequential Tasks Processes Parallel Processors


computation program

11
Task vs. Data Decomposition

 Computation is
decomposed and assigned A

(partitioned) – task
B
decomposition D
 Task graphs, C

 Synchronization among E F

tasks
 fork- join G
 barrier

12
Task vs. Domain Decomposition

Domain
 Partitioning Data is Decomposition
often a natural view too
– data or domain
decomposition
Grid example;
Computation follows
data: owner computes
for i=1 to m
for j= 1 to n
a[i,j] = a[i,j] + v[i]
13
Assignment

 Specifies how to group tasks together for a


process P1 Compute wait
Balance workload P2
P3
P4

Reduce communication and management cost


 Static versus dynamic assignment
 Both decomposition and assignment are usually
independent of architecture or programming
model
But cost and complexity of using primitives may
14
Assignment (contd.)
Block Cyclic

Block-Cyclic Tiled

15
Orchestration

 Different for different programming


models/architectures
Shared address space
 Naming: global address space
 Synchronization through barriers and locks
Distributed Memory /Message passing
 Non-shared address space
 Send-receive messages + barrier for
synchronization

16
Overview
 Introduction
 Parallelization Essentials
 Programming Models
 Task creation
 Synchronization / Communication
 Introduction to OpenMP
 Introduction to CUDA

17
What is OpenMP?

 What does OpenMP stands for?


 Open specifications for Multi Processing via collaborative
work between interested parties from the hardware and
software industry, government and academia.
 OpenMP is an Application Program Interface (API)
that may be used for explicitly direct multi-
threaded, shared memory parallelism.
 API components:
 Compiler Directives
 Runtime Library Routines
 Environment Variables
 OpenMP primitives can be included incrementally,
one function or even one loop at a time.

18
OpenMP Parallel Computing Solution
Stack
User layer

End User

Application
Prog. Layer
(OpenMP )

Environment
Directives OpenMP library
variables
System layer

Runtime library

OS/system support for shared memory.


Hardware

Proc. 1 Proc. 2 Proc. n


Shared Memory 19
OpenMP execution model

Fork and Join: Master thread spawns a


team of threads as needed Worker
Thread

Master thread

FORK
FORK

JOIN
JOIN

Parallel
Region
20
OpenMP Memory Model
 Shared memory model
 Shared variable: single copy of the variable
shared by many threads;
 Private variable: variable seen by one thread;
each thread will have its own copy of the variable

 Unintended sharing of data causes race


conditions or incorrect behavior
Thread 1 Thread 2 Values printed by Threads 1 and 2
Shared int x ; Shared int x ; (10, 25) ?
(25,15) ?
x = x + 10; x = x + 15;
(10, 15) ?
print (x); print (x); …

 Use synchronization to protect from conflicts


 Synchronization is expensive: Change how data is
21
OpenMP: Contents

 OpenMP’s constructs fall in 5


categories:
Parallel Regions
Worksharing
Data Environment
Synchronization
Runtime functions/environment variables
 OpenMP is basically the same between
Fortran and C/C++

22
OpenMP: Parallel Regions

 Omp_parallel pragma
double D[1000];
let all threads execute
#pragma omp parallel the section in parallel
{  Executes the same code
int i; double sum = 0; several times (by many
for (i=0; i<1000; i++) threads)
sum += D[I];  How many threads we
printf(“Thread %d computes have?
%f\n”, omp_thread_num(), omp_set_num_threads(n)
sum);  What is the use of
} repeating the same work
several times in parallel?
 D is shared between the
threads, i and sum are
private 23
Parallel Regions – Another Example

 You create threads in OpenMP with the


“omp_set_num_threads ” pragma.
 For example, To create a 4 thread Parallel
region:
double A[1000]; Runtime
Runtimefunction
functiontoto
Each
Eachthread
thread omp_set_num_threads(4); request
requestaacertain
certain
executes
executes aa number
number of
ofthreads
threads
copy
#pragma omp parallel
copyof ofthe
the
the {
thecode
code
within
withinthe
the
int ID = omp_get_thread_num();
structured
structured pooh(ID,A); Runtime
Runtimefunction
function
block
block } returning
returningaathread
threadIDID

 Each thread calls pooh(ID,A) for ID = 0 to 3


24
Parallel Regions – Another Example

double A[1000];
 Each thread executes omp_set_num_threads(4);
the same code #pragma omp parallel
redundantly. {
int ID = omp_get_thread_num();
double A[1000]; pooh(ID, A);
}
omp_set_num_threads(4) printf(“all done\n”);

AAsingle
singlecopy
copy
of
ofAAisisshared
shared pooh(0,A) pooh(1,A) pooh(2,A) pooh(3,A)
between
betweenall all
threads.
threads.

printf(“all done\n”); Threads


Threadswait
wait here
here for
forall
allthreads
threadstotofinish
finish
before
beforeproceeding
proceeding(i.e.
(i.e.aabarrier)
barrier)
25
OpenMP: Contents

 OpenMP’s constructs fall into 5


categories:
 Parallel Regions
 Work-sharing
 The “for” Work-Sharing construct splits up loop
iterations among the threads in a team
 By default, there is a barrier at the end of the “omp
for”. Use the “nowait” clause to turn off the barrier.
 Data Environment
 Synchronization
 Runtime functions/environment variables

26
OpenMP: Work Sharing
Constructs
Sequential for (int i=0; i<N; i++)
code
a[i]=b[i]+c[i];

#pragma omp parallel


#pragma omp for schedule(static)
Parallelization {
of the for loop for (int i=0; i<N; i++)
using OpenMP a[i]=b[i]+c[i];
}
27
OpenMP For construct:
The Schedule Clause
 The schedule clause affects how loop
iterations are mapped onto threads
 schedule(static [,csize])
 Deal-out blocks of iterations of size “csize” to each thread.
 Default: chunks of approximately equal size, one to each thread
 If more chunks than threads: assign in round-robin to the
threads
 Why might we want to use chunks of different size?
 schedule(dynamic[,csize])
 Each thread grabs “csize” iterations off a queue until all
iterations have been handled.
 Threads receive chunk assignments dynamically
 Default csize = 1

28
OpenMP Section :
Work Sharing Construct
 The Sections work-sharing construct gives
a different structured block to each
thread.
#pragma omp parallel
#pragma omp parallel
#pragma
#pragmaompompsections
sections
{{
#pragma
#pragmaomp
ompsection
section
X_calculation();
X_calculation();
#pragma
#pragmaomp
ompsection
section
y_calculation();
y_calculation();
#pragma
#pragmaompompsection
section By default, there is a
z_calculation();
z_calculation(); barrier at the end of the
}} “omp sections”. Use the
“nowait” clause to turn off
the barrier. 29
OpenMP: Data Environment

 Shared Variables
Most variables (including locals) are shared
by default
Global variables are shared
 File scope variables, static variables
 Some variables can be private
Variables can be explicitly declared as
private:
A local copy is created for each thread
Automatic variables inside the statement
block
Automatic variables in the called functions
30
Data Environment: Changing
Storage Attribute
 One can selectively change storage
attributes constructs using the
following clauses*
 SHARED
 PRIVATE
 FIRSTPRIVATE
 LASTPRIVATE
 THREADPRIVATE
 The default status can be modified
with:
 DEFAULT (PRIVATE | SHARED | NONE)
31
OpenMP Synchronization

X = 0;
What should be the result #pragma omp parallel
(assume 2 threads)?
X = X+1;
Could be 1 or 2!

 OpenMP assumes that the programmer


knows what (s)he is doing
 Regions of code that are marked to run in parallel
are independent
 If access collisions are possible, it is the
programmer’s responsibility to insert protection

32
Synchronization Mechanisms

 Many of the existing mechanisms for


shared programming
Critical sections, Atomic updates
Barriers
…

33
Critical Sections & Atomic

 #pragma omp critical [name]


 Standard critical section functionality
 Critical sections are global in the program
 Can be used to protect a single resource in
different functions
 Critical sections are identified by the name
 All the critical sections having the same name
are mutually exclusive between themselves
#pragma omp atomic
 Protects a single variable update

34
Barrier synchronization
 #pragma omp barrier
 Performs a barrier synchronization between
all the threads in a team at the given point.
 Example:
#pragma omp parallel
{
int result = heavy_computation_part1();
#pragma omp atomic
sum += result;
#pragma omp barrier
heavy_computation_part2(sum);
}
35
Reduction Motivation

 How to parallelize this code?


for (i=0; i<N; j++) {
sum += a[i]*b[i];
}
 sum is not private; need synchronization to ensure
correct reduction!
 accessing it atomically (or with synchronization)
serializes the execution
 Have a private copy of sum in each thread, then
add the private copies serially to get overall sum.

36
OpenMP: Reduction Example

#include <omp.h>
#define NUM_THREADS 4
void main ()
{
int i;
int A[1000], B[1000]; sum=0;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for reduction(+:sum)
for (i=0; i< 1000; i++){
sum += A[i] * B[i] ;
} Private copy of sum for each thread, computes
the sum in parallel for the chunk (of array)
} assigned to it!
Separate code for computing global sum (using
synch.) automatically added by OpenMP
37
Controlling OpenMP behavior

 omp_set_num_threads(int)
 Control the number of threads used for parallelization
 Must be called from sequential code
 Also can be set by OMP_NUM_THREADS environment
variable
 omp_get_num_threads()
 How many threads are currently available?
 omp_get_thread_num()
 omp_in_parallel()
 Am I currently running in parallel mode?
 omp_get_wtime()
 A portable way to compute wall clock time

38
Overview
 Introduction
 Parallelization Essentials
 Programming Models
 Task creation
 Synchronization / Communication
 Introduction to OpenMP
 Introduction to CUDA

39
Intel SIMD Extensions
 New SIMD instructions, new registers
 Introduced in phases/groups of functionality
 MMX (1993 – 1999)
 64 bit width operations
 SSE – SSE4 (1999 –2006)
 128 bit width operations
 AVX, FMA, AVX2, AVX-512 (2008 – …)
 256 – 512 bit width operations

40
SIMD Operations

 SIMD execution
 performs operation in parallel on an array of 2, 4,
8,16 or 32 values
X0 X1 X2 X3
 Operation ⊗ can be a
 data movement instruction
 arithmetic instruction Y0 Y1 Y2 Y3
 logical instruction
 comparison instruction ⊗ ⊗ ⊗ ⊗
 conversion instruction
 shuffle instruction
Z0 Z1 Z2 Z3

41
Automatic Vectorization

 In gcc, flags “-O3 –mavx –mavx2” attempts


automatic vectorization
 Works pretty well for simple loops
L1: vmovdqa xmm1, X[rax]
float X[256], Y[256], Z[256]; vmovdqa xmm2, Y[rax]
{ int i; vpmulld xmm0, xmm1, xmm2
for (i=0; i<256; i++) vmovap Z[rax], xmm0
Z[i] = X[i] * Y[i]; add rax, 4
} cmp rax, 256
jne .L1

 But not for anything complex


 E.g., naïve bubble sort code not parallelized at all

42
CUDA Programming

© RG@SERC,IISc 43
CUDA Programming

© RG@SERC,IISc 44
CUDA Programming

0 1 63
0 255 0 255 0 255

© RG@SERC,IISc 45
Processing Flow

© RG@SERC,IISc 46
Processing Flow

© RG@SERC,IISc 47
Processing Flow

© RG@SERC,IISc 48
CUDA Programming

cudaMalloc((void **) &d_x, size);


cudaMalloc((void **) &d_y, size);
cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice);
saxpy_parallel <<<nblocks, 256>>>(n,2.0, d_x, d_y)
cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost);
cudaFree(d_x); cudaFree(d_y);

49
Summary
 Introduction
 Parallelization Essentials
 Introduction to OpenMP
 Covered only the basics --- there is more!
 Current version is OpenMP 5.x
 Support for accelerators
 SIMD/Vectorization support
 Nested Parallelism
 Introduction to CUDA
 there is more to learn about CUDA!
 Current version is CUDA 12.x
 Dynamic kernel invocation, CUDA Graphs, unified
memory, …
50

You might also like