0% found this document useful (0 votes)
62 views40 pages

ParallelIzation Principles

This document discusses principles of parallelization including challenges, evaluation metrics, limitations, and steps to create a parallel program. It describes decomposing a problem into tasks, assigning tasks to processes, orchestrating communication and synchronization between processes, and mapping processes to processors. Common parallelization techniques like data parallelism using domain decomposition and task parallelism are presented. Evaluation metrics like speedup and efficiency are defined based on Amdahl's and Gustafson's laws. The document provides examples to illustrate concepts like data distributions, task graphs for mapping, and orchestrating communication for shared memory and message passing models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views40 pages

ParallelIzation Principles

This document discusses principles of parallelization including challenges, evaluation metrics, limitations, and steps to create a parallel program. It describes decomposing a problem into tasks, assigning tasks to processes, orchestrating communication and synchronization between processes, and mapping processes to processors. Common parallelization techniques like data parallelism using domain decomposition and task parallelism are presented. Evaluation metrics like speedup and efficiency are defined based on Amdahl's and Gustafson's laws. The document provides examples to illustrate concepts like data distributions, task graphs for mapping, and orchestrating communication for shared memory and message passing models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Parallelization Principles

Sathish Vadhiyar
Parallel Programming and Challenges
 Recall the advantages and motivation of
parallelism
 But parallel programs incur overheads not
seen in sequential programs
 Communication delay
 Idling
 Synchronization

2
Challenges

P0

P1

Idle time
Computation

Communication

Synchronization

3
How do we evaluate a parallel program?
 Execution time, Tp
 Speedup, S
 S(p, n) = T(1, n) / T(p, n)
 Usually, S(p, n) < p
 Sometimes S(p, n) > p (superlinear speedup)
 Efficiency, E
 E(p, n) = S(p, n)/p
 Usually, E(p, n) < 1
 Sometimes, greater than 1
 Scalability – Limitations in parallel computing,
relation to n and p.

4
Speedups and efficiency

S E

Ideal p p

Practical
5
Limitations on speedup – Amdahl’s law
 Amdahl's law states that the performance
improvement to be gained from using some faster
mode of execution is limited by the fraction of
the time the faster mode can be used.
 Overall speedup in terms of fractions of
computation time with and without enhancement,
% increase in enhancement.
 Places a limit on the speedup due to parallelism.
 Speedup = 1
(fs + (fp/P))

6
Gustafson’s Law
 Increase problem size proportionally so as to
keep the overall time constant
 The scaling keeping the problem size
constant (Amdahl’s law) is called strong
scaling
 The scaling due to increasing problem size is
called weak scaling

7
PARALLEL PROGRAMMING
CLASSIFICATION AND STEPS

8
Programming Paradigms
 Shared memory model – Threads, OpenMP,
CUDA
 Message passing model – MPI

9
Parallelizing a Program
Given a sequential program/algorithm, how to
go about producing a parallel version
Four steps in program parallelization
1. Decomposition
Identifying parallel tasks with large extent of possible
concurrent activity; splitting the problem into tasks
2. Assignment
Grouping the tasks into processes with best load
balancing
3. Orchestration
Reducing synchronization and communication costs
4. Mapping
Mapping of processes to processors (if possible)
10
Steps in Creating a Parallel Program
Partitioning

D A O M
e s r a
c s c p
o i h p
m g p0 p1 e p0 p1 i
p s P0 P1
n n
o m t g
s e r
i n a
t t t
P2 P3
i p2 p3 i p2 p3
o o
n n

Sequential Tasks Processes Parallel Processors


computation program

11
Decomposition and Assignment
 Specifies how to group tasks together for a process
 Balance workload, reduce communication and
management cost
 In practical cases, both steps combined into
one step, trying to answer the question “What
is the role of each parallel processing entity?”

12
Data Parallelism and Domain
Decomposition
 Given data divided across the processing
entitites
 Each process owns and computes a portion
of the data – owner-computes rule
 Multi-dimensional domain in simulations
divided into subdomains equal to processing
entities
 This is called domain decomposition

13
Domain decomposition and Process
Grids
 The given P processes arranged in multi-
dimensions forming a process grid
 The domain of the problem divided into
process grid

14
Illustrations

15
Data Distributions
 For dividing the data in a dimension using the
processes in a dimension, data distribution
schemes are followed
 Common data distributions:
 Block: for regular
computations
 Block-cyclic: when
there is load
imbalance across
space

16
Task parallelism
 Independent tasks identified
 The task may or may not process different
data

17
Orchestration
 Goals
 Structuring communication
 Synchronization
 Challenges
 Organizing data structures – packing
 Small or large messages?
 How to organize communication and
synchronization ?

18
Orchestration
 Maximizing data locality
 Minimizing volume of data exchange
 Not communicating intermediate results – e.g. dot product
 Minimizing frequency of interactions - packing
 Minimizing contention and hot spots
 Do not use the same communication pattern with the
other processes in all the processes
 Overlapping computations with interactions
 Split computations into phases: those that depend on
communicated data (type 1) and those that do not (type
2)
 Initiate communication for type 1; During
communication, perform type 2
 Replicating data or computations
 Balancing the extra computation or storage cost with
the gain due to less communication
19
Mapping
 Which process runs on which particular
processor?
 Can depend on network topology, communication
pattern of processes
 On processor speeds in case of heterogeneous
systems
 The tasks are grouped by a process called mapping
 Two objectives:
 Balance the groups
 Minimize inter-group dependencies
 Represented as task graph
 Mapping problem is NP-hard
20
Based on Task Partitioning

 Based on task dependency graph


0

0 4

0 2 4 6

0 1 2 3 4 5 6 7

 In general the problem is NP complete

21
High-level Goals

Table 2.1 Steps in the Parallelization Process and Their Goals

Architecture-
Step Dependent? Major Performance Goals
Decomposition Mostly no Expose enough concurr ency but not too much
Assignment Mostly no Balance workload
Reduce communication volume
Orchestration Yes Reduce noninher ent communication via data
locality
Reduce communication and synchr onization cost
as seen by the processor
Reduce serialization at shared resources
Schedule tasks to satisfy dependences early
Mapping Yes Put related processes on the same processor if
necessary
Exploit locality in network topology

22
Example
Given a 2-d array of float values, repeatedly
average each elements with immediate
neighbours until the difference between two
iterations is less than some tolerance value
do {
diff = 0.0 A[i-1][j]
for (i=0; i < n; i++)
for (j=0; j < n, j++){
A[i][j-1] A[i][j] A[i][j+1]
temp = A[i] [j];
A[i][j] = average (neighbours);
diff += abs (A[i][j] – temp); A[i+1][j]
}
while (diff > tolerance) ;

23
Assignment

P0

P1

P2

P4

24
Orchestration
 Different for different programming
models/architectures
 Shared address space
 Naming: global addr. Space
 Synch. through barriers and locks
 Distributed Memory /Message passing
 Non-shared address space
 Send-receive messages + barrier for synch.

25
SAS Version – Generating Processes
1. int n, nprocs; /* matrix: (n + 2-by-n + 2) elts.*/
2. float **A, diff = 0;
2a. LockDec (lock_diff);
2b. BarrierDec (barrier1);
3. main()
4. begin
5. read(n) ; /*read input parameter: matrix size*/
5a. Read (nprocs);
6. A  g_malloc (a 2-d array of (n+2) x (n+2) doubles);
6a. Create (nprocs -1, Solve, A);
7. initialize(A); /*initialize the matrix A somehow*/
8. Solve (A); /*call the routine to solve equation*/
8a. Wait_for_End (nprocs-1);
9. end main

26
SAS Version -- Solve
10. procedure Solve (A) /*solve the equation system*/
11. float **A; /*A is an (n + 2)-by-(n + 2) array*/
12. begin
13. int i, j, pid, done = 0;
14. float temp;
14a. mybegin = 1 + (n/nprocs)*pid;
14b. myend = mybegin + (n/nprocs);
15. while (!done) do /*outermost loop over sweeps*/
16. diff = 0; /*initialize difference to 0*/
16a. Barriers (barrier1, nprocs);
17. for i  mybeg to myend do/*sweep for all points of grid*/
18. for j  1 to n do
19. temp = A[i,j]; /*save old value of element*/
20. A[i,j]  0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
21. A[i,j+1] + A[i+1,j]); /*compute average*/
22. diff += abs(A[i,j] - temp);
23. end for
24. end for
25. if (diff/(n*n) < TOL) then done = 1;
26. end while
27. end procedure

27
SAS Version -- Issues
 SPMD program
 Wait_for_end – all to one communication
 How is diff accessed among processes?
 Mutex to ensure diff is updated correctly.
 Single lock  too much synchronization!
 Need not synchronize for every grid point. Can do only
once.
 What about access to A[i][j], especially the boundary
rows between processes?
 Can loop termination be determined without any
synch. among processes?
 Do we need any statement for the termination condition
statement

28
SAS Version -- Solve
10. procedure Solve (A) /*solve the equation system*/
11. float **A; /*A is an (n + 2)-by-(n + 2) array*/
12. begin
13. int i, j, pid, done = 0;
14. float mydiff, temp;
14a. mybegin = 1 + (n/nprocs)*pid;
14b. myend = mybegin + (n/nprocs);
15. while (!done) do /*outermost loop over sweeps*/
16. mydiff = diff = 0; /*initialize local difference to 0*/
16a. Barriers (barrier1, nprocs);
17. for i  mybeg to myend do/*sweep for all points of grid*/
18. for j  1 to n do
19. temp = A[i,j]; /*save old value of element*/
20. A[i,j]  0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
21. A[i,j+1] + A[i+1,j]); /*compute average*/
22. mydiff += abs(A[i,j] - temp);
23. end for
24. end for
24a lock (diff-lock);
24b. diff += mydiff;
24c unlock (diff-lock)
24d. barrier (barrier1, nprocs);
25. if (diff/(n*n) < TOL) then done = 1;
25a. Barrier (barrier1, nprocs);
26. end while
27. end procedure

29
SAS Program
 done condition evaluated redundantly by all
 Code that does the update identical to
sequential program
 each process has private mydiff variable
 Most interesting special operations are for
synchronization
 accumulations into shared diff have to be mutually
exclusive
 why the need for all the barriers?

 Good global reduction?


 Utility of this parallel accumulate??
30
Message Passing Version
 Cannot declare A to be global shared array
 compose it from per-process private arrays
 usually allocated in accordance with the assignment of

work -- owner-compute rule


 process assigned a set of rows allocates them locally

 Structurally similar to SPMD SAS


 Orchestration different
 data structures and data access/naming
 communication

 synchronization

 Ghost rows
31
Data Layout and Orchestration
P0

P0

P1

P1

P2

P4

P2

Data partition allocated per processor


Add ghost rows to hold boundary data
Send edges to neighbors
P4
Receive into ghost rows
Compute as in sequential program

32
Message Passing Version – Generating
Processes
1. int n, nprocs; /* matrix: (n + 2-by-n + 2) elts.*/
2. float **myA;
3. main()
4. begin
5. read(n) ; /*read input parameter: matrix size*/
5a. read (nprocs);
/* 6. A  g_malloc (a 2-d array of (n+2) x (n+2) doubles); */
6a. Create (nprocs -1, Solve, A);
/* 7. initialize(A); */ /*initialize the matrix A somehow*/
8. Solve (A); /*call the routine to solve equation*/
8a. Wait_for_End (nprocs-1);
9. end main

33
Message Passing Version – Array allocation
and Ghost-row Copying
10. procedure Solve (A) /*solve the equation system*/
11. float **A; /*A is an (n + 2)-by-(n + 2) array*/
12. begin
13. int i, j, pid, done = 0;
14. float mydiff, temp;
14a. myend = (n/nprocs) ;
6. myA = malloc (array of (n/nprocs) x n floats );
7. initialize (myA); /* initialize myA LOCALLY */
15. while (!done) do /*outermost loop over sweeps*/
16. mydiff = 0; /*initialize local difference to 0*/
16a. if (pid != 0) then
SEND (&myA[1,0] , n*sizeof(float), (pid-1), row);
16b. if (pid != nprocs-1) then
SEND (&myA[myend,0], n*sizeof(float), (pid+1), row);
16c. if (pid != 0) then
RECEIVE (&myA[0,0], n*sizeof(float), (pid -1), row);
16d. if (pid != nprocs-1) then
RECEIVE (&myA[myend+1,0], n*sizeof(float), (pid -1),
row);

34
Message Passing Version – Solver
12. begin
… … …
15. while (!done) do /*outermost loop over sweeps*/
… … …
17. for i  1 to myend do/*sweep for all points of grid*/
18. for j  1 to n do
19. temp = myA[i,j]; /*save old value of element*/
20. myA[i,j]  0.2 * (myA[i,j] + myA[i,j-1] +myA[i-1,j] +
21. myA[i,j+1] + myA[i+1,j]); /*compute average*/
22. mydiff += abs(myA[i,j] - temp);
23. end for
24. end for
24a if (pid != 0) then
24b. SEND (mydiff, sizeof (float), 0, DIFF);
24c. RECEIVE (done, sizeof(int), 0, DONE);
24d. else
24e. for k  1 to nprocs-1 do
24f. RECEIVE (tempdiff, sizeof(float), k , DIFF);
24g. mydiff += tempdiff;
24h. endfor
24i. If(mydiff/(n*n) < TOL) then done = 1;
24j. for k  1 to nprocs-1 do
24k. SEND (done, sizeof(float), k , DONE);
24l. endfor
25. end while
26. end procedure
35
Notes on Message Passing Version
 Receive does not transfer data, send does
 unlike SAS which is usually receiver-initiated (load
fetches data)
 Can there be deadlock situation due to sends?
 Communication done at once in whole rows at
beginning of iteration, not grid-point by grid-point
 Core similar, but indices/bounds in local rather
than global space
 Synchronization through sends and receives
 Update of global diff and event synch for done
condition – mutual exclusion occurs naturally

36
Orchestration: Summary
 Shared address space
 Shared and private data explicitly separate
 Communication implicit in access patterns

 Synchronization via atomic operations on shared data

 Synchronization explicit and distinct from data


communication

37
Orchestration: Summary
 Message passing
 Data distribution among local address spaces needed
 No explicit shared structures (implicit in comm. patterns)

 Communication is explicit

 Synchronization implicit in communication (at least in


synch. case)

38
Grid Solver Program: Summary
 Decomposition and Assignment similar in SAS and
message-passing
 Orchestration is different
 Data structures, data access/naming, communication,
synchronization
 Performance?

39
Grid Solver Program: Summary

SAS Msg-Passing

Explicit global data structure? Yes No

Communication Implicit Explicit

Synchronization Explicit Implicit

Explicit replication of border rows? No Yes

40

You might also like