ParallelIzation Principles
ParallelIzation Principles
Sathish Vadhiyar
Parallel Programming and Challenges
Recall the advantages and motivation of
parallelism
But parallel programs incur overheads not
seen in sequential programs
Communication delay
Idling
Synchronization
2
Challenges
P0
P1
Idle time
Computation
Communication
Synchronization
3
How do we evaluate a parallel program?
Execution time, Tp
Speedup, S
S(p, n) = T(1, n) / T(p, n)
Usually, S(p, n) < p
Sometimes S(p, n) > p (superlinear speedup)
Efficiency, E
E(p, n) = S(p, n)/p
Usually, E(p, n) < 1
Sometimes, greater than 1
Scalability – Limitations in parallel computing,
relation to n and p.
4
Speedups and efficiency
S E
Ideal p p
Practical
5
Limitations on speedup – Amdahl’s law
Amdahl's law states that the performance
improvement to be gained from using some faster
mode of execution is limited by the fraction of
the time the faster mode can be used.
Overall speedup in terms of fractions of
computation time with and without enhancement,
% increase in enhancement.
Places a limit on the speedup due to parallelism.
Speedup = 1
(fs + (fp/P))
6
Gustafson’s Law
Increase problem size proportionally so as to
keep the overall time constant
The scaling keeping the problem size
constant (Amdahl’s law) is called strong
scaling
The scaling due to increasing problem size is
called weak scaling
7
PARALLEL PROGRAMMING
CLASSIFICATION AND STEPS
8
Programming Paradigms
Shared memory model – Threads, OpenMP,
CUDA
Message passing model – MPI
9
Parallelizing a Program
Given a sequential program/algorithm, how to
go about producing a parallel version
Four steps in program parallelization
1. Decomposition
Identifying parallel tasks with large extent of possible
concurrent activity; splitting the problem into tasks
2. Assignment
Grouping the tasks into processes with best load
balancing
3. Orchestration
Reducing synchronization and communication costs
4. Mapping
Mapping of processes to processors (if possible)
10
Steps in Creating a Parallel Program
Partitioning
D A O M
e s r a
c s c p
o i h p
m g p0 p1 e p0 p1 i
p s P0 P1
n n
o m t g
s e r
i n a
t t t
P2 P3
i p2 p3 i p2 p3
o o
n n
11
Decomposition and Assignment
Specifies how to group tasks together for a process
Balance workload, reduce communication and
management cost
In practical cases, both steps combined into
one step, trying to answer the question “What
is the role of each parallel processing entity?”
12
Data Parallelism and Domain
Decomposition
Given data divided across the processing
entitites
Each process owns and computes a portion
of the data – owner-computes rule
Multi-dimensional domain in simulations
divided into subdomains equal to processing
entities
This is called domain decomposition
13
Domain decomposition and Process
Grids
The given P processes arranged in multi-
dimensions forming a process grid
The domain of the problem divided into
process grid
14
Illustrations
15
Data Distributions
For dividing the data in a dimension using the
processes in a dimension, data distribution
schemes are followed
Common data distributions:
Block: for regular
computations
Block-cyclic: when
there is load
imbalance across
space
16
Task parallelism
Independent tasks identified
The task may or may not process different
data
17
Orchestration
Goals
Structuring communication
Synchronization
Challenges
Organizing data structures – packing
Small or large messages?
How to organize communication and
synchronization ?
18
Orchestration
Maximizing data locality
Minimizing volume of data exchange
Not communicating intermediate results – e.g. dot product
Minimizing frequency of interactions - packing
Minimizing contention and hot spots
Do not use the same communication pattern with the
other processes in all the processes
Overlapping computations with interactions
Split computations into phases: those that depend on
communicated data (type 1) and those that do not (type
2)
Initiate communication for type 1; During
communication, perform type 2
Replicating data or computations
Balancing the extra computation or storage cost with
the gain due to less communication
19
Mapping
Which process runs on which particular
processor?
Can depend on network topology, communication
pattern of processes
On processor speeds in case of heterogeneous
systems
The tasks are grouped by a process called mapping
Two objectives:
Balance the groups
Minimize inter-group dependencies
Represented as task graph
Mapping problem is NP-hard
20
Based on Task Partitioning
0 4
0 2 4 6
0 1 2 3 4 5 6 7
21
High-level Goals
Architecture-
Step Dependent? Major Performance Goals
Decomposition Mostly no Expose enough concurr ency but not too much
Assignment Mostly no Balance workload
Reduce communication volume
Orchestration Yes Reduce noninher ent communication via data
locality
Reduce communication and synchr onization cost
as seen by the processor
Reduce serialization at shared resources
Schedule tasks to satisfy dependences early
Mapping Yes Put related processes on the same processor if
necessary
Exploit locality in network topology
22
Example
Given a 2-d array of float values, repeatedly
average each elements with immediate
neighbours until the difference between two
iterations is less than some tolerance value
do {
diff = 0.0 A[i-1][j]
for (i=0; i < n; i++)
for (j=0; j < n, j++){
A[i][j-1] A[i][j] A[i][j+1]
temp = A[i] [j];
A[i][j] = average (neighbours);
diff += abs (A[i][j] – temp); A[i+1][j]
}
while (diff > tolerance) ;
23
Assignment
P0
P1
P2
P4
24
Orchestration
Different for different programming
models/architectures
Shared address space
Naming: global addr. Space
Synch. through barriers and locks
Distributed Memory /Message passing
Non-shared address space
Send-receive messages + barrier for synch.
25
SAS Version – Generating Processes
1. int n, nprocs; /* matrix: (n + 2-by-n + 2) elts.*/
2. float **A, diff = 0;
2a. LockDec (lock_diff);
2b. BarrierDec (barrier1);
3. main()
4. begin
5. read(n) ; /*read input parameter: matrix size*/
5a. Read (nprocs);
6. A g_malloc (a 2-d array of (n+2) x (n+2) doubles);
6a. Create (nprocs -1, Solve, A);
7. initialize(A); /*initialize the matrix A somehow*/
8. Solve (A); /*call the routine to solve equation*/
8a. Wait_for_End (nprocs-1);
9. end main
26
SAS Version -- Solve
10. procedure Solve (A) /*solve the equation system*/
11. float **A; /*A is an (n + 2)-by-(n + 2) array*/
12. begin
13. int i, j, pid, done = 0;
14. float temp;
14a. mybegin = 1 + (n/nprocs)*pid;
14b. myend = mybegin + (n/nprocs);
15. while (!done) do /*outermost loop over sweeps*/
16. diff = 0; /*initialize difference to 0*/
16a. Barriers (barrier1, nprocs);
17. for i mybeg to myend do/*sweep for all points of grid*/
18. for j 1 to n do
19. temp = A[i,j]; /*save old value of element*/
20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
21. A[i,j+1] + A[i+1,j]); /*compute average*/
22. diff += abs(A[i,j] - temp);
23. end for
24. end for
25. if (diff/(n*n) < TOL) then done = 1;
26. end while
27. end procedure
27
SAS Version -- Issues
SPMD program
Wait_for_end – all to one communication
How is diff accessed among processes?
Mutex to ensure diff is updated correctly.
Single lock too much synchronization!
Need not synchronize for every grid point. Can do only
once.
What about access to A[i][j], especially the boundary
rows between processes?
Can loop termination be determined without any
synch. among processes?
Do we need any statement for the termination condition
statement
28
SAS Version -- Solve
10. procedure Solve (A) /*solve the equation system*/
11. float **A; /*A is an (n + 2)-by-(n + 2) array*/
12. begin
13. int i, j, pid, done = 0;
14. float mydiff, temp;
14a. mybegin = 1 + (n/nprocs)*pid;
14b. myend = mybegin + (n/nprocs);
15. while (!done) do /*outermost loop over sweeps*/
16. mydiff = diff = 0; /*initialize local difference to 0*/
16a. Barriers (barrier1, nprocs);
17. for i mybeg to myend do/*sweep for all points of grid*/
18. for j 1 to n do
19. temp = A[i,j]; /*save old value of element*/
20. A[i,j] 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
21. A[i,j+1] + A[i+1,j]); /*compute average*/
22. mydiff += abs(A[i,j] - temp);
23. end for
24. end for
24a lock (diff-lock);
24b. diff += mydiff;
24c unlock (diff-lock)
24d. barrier (barrier1, nprocs);
25. if (diff/(n*n) < TOL) then done = 1;
25a. Barrier (barrier1, nprocs);
26. end while
27. end procedure
29
SAS Program
done condition evaluated redundantly by all
Code that does the update identical to
sequential program
each process has private mydiff variable
Most interesting special operations are for
synchronization
accumulations into shared diff have to be mutually
exclusive
why the need for all the barriers?
synchronization
Ghost rows
31
Data Layout and Orchestration
P0
P0
P1
P1
P2
P4
P2
32
Message Passing Version – Generating
Processes
1. int n, nprocs; /* matrix: (n + 2-by-n + 2) elts.*/
2. float **myA;
3. main()
4. begin
5. read(n) ; /*read input parameter: matrix size*/
5a. read (nprocs);
/* 6. A g_malloc (a 2-d array of (n+2) x (n+2) doubles); */
6a. Create (nprocs -1, Solve, A);
/* 7. initialize(A); */ /*initialize the matrix A somehow*/
8. Solve (A); /*call the routine to solve equation*/
8a. Wait_for_End (nprocs-1);
9. end main
33
Message Passing Version – Array allocation
and Ghost-row Copying
10. procedure Solve (A) /*solve the equation system*/
11. float **A; /*A is an (n + 2)-by-(n + 2) array*/
12. begin
13. int i, j, pid, done = 0;
14. float mydiff, temp;
14a. myend = (n/nprocs) ;
6. myA = malloc (array of (n/nprocs) x n floats );
7. initialize (myA); /* initialize myA LOCALLY */
15. while (!done) do /*outermost loop over sweeps*/
16. mydiff = 0; /*initialize local difference to 0*/
16a. if (pid != 0) then
SEND (&myA[1,0] , n*sizeof(float), (pid-1), row);
16b. if (pid != nprocs-1) then
SEND (&myA[myend,0], n*sizeof(float), (pid+1), row);
16c. if (pid != 0) then
RECEIVE (&myA[0,0], n*sizeof(float), (pid -1), row);
16d. if (pid != nprocs-1) then
RECEIVE (&myA[myend+1,0], n*sizeof(float), (pid -1),
row);
34
Message Passing Version – Solver
12. begin
… … …
15. while (!done) do /*outermost loop over sweeps*/
… … …
17. for i 1 to myend do/*sweep for all points of grid*/
18. for j 1 to n do
19. temp = myA[i,j]; /*save old value of element*/
20. myA[i,j] 0.2 * (myA[i,j] + myA[i,j-1] +myA[i-1,j] +
21. myA[i,j+1] + myA[i+1,j]); /*compute average*/
22. mydiff += abs(myA[i,j] - temp);
23. end for
24. end for
24a if (pid != 0) then
24b. SEND (mydiff, sizeof (float), 0, DIFF);
24c. RECEIVE (done, sizeof(int), 0, DONE);
24d. else
24e. for k 1 to nprocs-1 do
24f. RECEIVE (tempdiff, sizeof(float), k , DIFF);
24g. mydiff += tempdiff;
24h. endfor
24i. If(mydiff/(n*n) < TOL) then done = 1;
24j. for k 1 to nprocs-1 do
24k. SEND (done, sizeof(float), k , DONE);
24l. endfor
25. end while
26. end procedure
35
Notes on Message Passing Version
Receive does not transfer data, send does
unlike SAS which is usually receiver-initiated (load
fetches data)
Can there be deadlock situation due to sends?
Communication done at once in whole rows at
beginning of iteration, not grid-point by grid-point
Core similar, but indices/bounds in local rather
than global space
Synchronization through sends and receives
Update of global diff and event synch for done
condition – mutual exclusion occurs naturally
36
Orchestration: Summary
Shared address space
Shared and private data explicitly separate
Communication implicit in access patterns
37
Orchestration: Summary
Message passing
Data distribution among local address spaces needed
No explicit shared structures (implicit in comm. patterns)
Communication is explicit
38
Grid Solver Program: Summary
Decomposition and Assignment similar in SAS and
message-passing
Orchestration is different
Data structures, data access/naming, communication,
synchronization
Performance?
39
Grid Solver Program: Summary
SAS Msg-Passing
40