Principles of Parallel Algorithm Design
Principles of Parallel Algorithm Design
Algorithm Design
1
Constructing a Parallel Algorithm
2
Task Decomposition and Dependency Graphs
Task-dependency graph:
Node represent s task.
Directed edge represents
control dependence.
3
Example 1: Dense Matrix-Vector Multiplication
5
• Task: create sets of elements that satisfy a (or several)
criteria.
• Edge: output of one task serves as input to the next
6
• An alternate task-dependency graph for query
8
Degree of Concurrency
9
Critical Path of Task Graph
• Critical path: The longest directed path between
any pair of start node (node with no incoming
edge) and finish node (node with on outgoing
edges).
• Critical path length: The sum of weights of nodes
along critical path.
– The weights of a node is the size or the amount of
work associated with the corresponding task
• Average degree of concurrency = total amount of
work / critical path length
10
Example: Critical Path Length
Left graph:
Critical path length = 27
Average degree of concurrency = 63/27 = 2.33
Right graph:
Critical path length = 34
Average degree of concurrency = 64/34 = 1.88
11
Limits on Parallelization
• Facts bounds on parallel execution
– Maximum task granularity is finite
• Matrix-vector multiplication O(n2)
– Interactions between tasks
• Tasks often share input, output, or intermediate data, which may
lead to interactions not shown in task-dependency graph.
12
• Speedup = sequential execution time/parallel
execution time
• Parallel efficiency = sequential execution
time/(parallel execution time × processors used)
13
Task Interaction Graphs
• Tasks generally share input, output or intermediate data
– Ex. Matrix-vector multiplication: originally there is only one
copy of b, tasks will have to communicate b.
• Task-interaction graph
– To capture interactions among tasks
– Node = task
– Edge(undirected/directed) = interaction or data exchange
• Task-dependency graph vs. task-interaction graph
– Task-dependency graph represents control dependency
– Task-interaction graph represents data dependency
– The edge-set of a task-interaction graph is usually a superset
of the edge-set of the task-dependency graph
14
Example: Task-Interaction Graph
Sparse matrix-vector multiplication
• Tasks: each task computes an entry of y[]
• Assign ith row of A to Task i. Also assign b[i] to
Task i.
15
Processes and Mapping
16
Criteria of Mapping
1. Maximize the use of concurrency by mapping independent
tasks onto different processes
2. Minimize the total completion time by making sure that
processes are available to execute the tasks on critical path
as soon as such tasks become executable
3. Minimize interaction among processes by mapping tasks
with a high degree of mutual interaction onto the same
process.
P3 P2 P1 P0 P3 P2 P1 P0
P0
P0
P2 P0
P0
P0
18
Decomposition Techniques
19
Recursive Decomposition
• Select a pivot
• Partition the sequence around the pivot
• Recursively sort each sub-sequence
21
Recursive Decomposition for Finding Min
Find the minimum in an array of numbers A of length n
22
Data Decomposition
• Ideal for problems that operate on large data
structures
• Steps
1. The data on which the computations are
performed are partitioned
2. Data partition is used to induce a partitioning of
the computations into tasks.
• Data Partitioning
– Partition output data
– Partition input data
– Partition input + output data
– Partition intermediate data
23
Data Decomposition Based on Partitioning Output Data
24
Example: Output Data Decomposition
Matrix-matrix multiplication: 𝐶𝐶 = 𝐴𝐴 × 𝐵𝐵
• Partition matrix C into 2× 2 submatrices
• Computation of C then can be partitioned into four tasks.
27
Data Decomposition Based on Partitioning Intermediate Data
28
Example: Intermediate Data Decomposition
29
Stage 1: 𝐷𝐷𝑘𝑘,𝑖𝑖,𝑗𝑗 = 𝐴𝐴𝑖𝑖,𝑘𝑘 𝐵𝐵𝑘𝑘,𝑗𝑗
30
Let 𝑫𝑫𝒌𝒌,𝒊𝒊,𝒋𝒋 = 𝑨𝑨𝒊𝒊,𝒌𝒌 � 𝑩𝑩𝒌𝒌,𝒋𝒋
Task-dependency graph
31
Owner-Computes Rule
32
Characteristics of Tasks
Key characteristics of tasks influencing choice of mapping and
performance of parallel algorithm:
1. Task generation
• Static or dynamic generation
– Static: all tasks are known before the algorithm starts execution. Data or
recursive decomposition often leads to static task generation.
Ex. Matrix-multiplication. Recursive decomposition in finding min. of a set of
numbers.
– Dynamic: the actual tasks and the task-dependency graph are not explicitly
available a priori. Recursive, exploratory decomposition can generate tasks
dynamically.
Ex. Recursive decomposition in Quicksort, in which tasks are generated
dynamically.
2. Task sizes
• Amount of time required to compute it: uniform, non-uniform
3. Knowledge of task sizes
4. Size of data associated with tasks
• Data associated with the task must be available to the process
performing the task. The size and location of data may determine the
data-movement overheads.
33
Characteristics of Task Interactions
34
Static vs. Dynamic Interactions
• Static interaction
– Tasks and associated interactions are predetermined:
task-interaction graph and times that interactions
occur are known: matrix multiplication
– Easy to program
• Dynamic interaction
– Timing of interaction or sets of tasks to interact with
can not be determined prior to the execution.
– Difficult to program using massage-passing; Shared-
memory space programming may be simple
35
Regular vs. Irregular Interactions
• Regular interactions
– Interaction has a spatial structure that can be
exploited for efficient implementation: ring, mesh
Example: Explicit finite difference for solving PDEs.
• Irregular Interactions
– Interactions has no well-defined structure
Example: Sparse matrix-vector multiplication
36
37
Mapping Technique for Load Balancing
Minimize execution time → Reduce overheads of execution
• Sources of overheads:
– Inter-process interaction
– Idling
– Both interaction and idling are often a function of mapping
• Goals to achieve:
– To reduce interaction time
– To reduce total amount of time some processes being idle
(goal of load balancing)
– Remark: these two goals often conflict
• Classes of mapping:
– Static
– Dynamic
38
Remark:
1. Loading balancing is only a necessary but not sufficient condition for reducing
idling.
• Task-dependency graph determines which tasks can execute in parallel and
which must wait for some others to finish at a given stage.
2. Good mapping must ensure that computations and interactions among processes
at each stage of execution are well balanced.
Two mappings of 12-task decomposition in which the last 4 tasks can be started only
after the first 8 are finished due to task-dependency.
39
Schemes for Static Mapping
40
Mapping Based on Data Partitioning
41
1D Block Distribution
Example. Distribute rows or columns of matrix to different
processes
42
Multi-D Block Distribution
Example. Distribute blocks of matrix to different processes
43
Load-Balance for Block Distribution
44
Suppose the size of matrix is 𝑛𝑛 × 𝑛𝑛, and 𝑝𝑝 processes are used.
𝑛𝑛2
(a): A process need to access + 𝑛𝑛2 amount of data
𝑝𝑝
(b): A process need to access 𝑂𝑂(𝑛𝑛2 / 𝑝𝑝) amount of data 45
Cyclic and Block Cyclic Distributions
46
Doolittle’s method of LU factorization
By matrix-matrix multiplication
End
𝑢𝑢𝑛𝑛𝑛𝑛 = 𝑎𝑎𝑛𝑛𝑛𝑛 − ∑𝑛𝑛−1
𝑡𝑡=1 𝑙𝑙𝑛𝑛𝑛𝑛 𝑢𝑢𝑡𝑡𝑡𝑡
47
Serial Column-Based LU
49
• Block distribution of LU factorization tasks
leads to load imbalance.
50
Block-Cyclic Distribution
• Steps
1. Partition an array into many more blocks than
the number of available processes
2. Assign blocks to processes in a round-robin
manner so that each process gets several non-
adjacent blocks.
51
(a) The rows of the array are grouped into blocks each consisting of two rows,
resulting in eight blocks of rows. These blocks are distributed to four processes
in a wrap-around fashion.
(b) The matrix is blocked into 16 blocks each of size 4×4, and it is mapped onto a
2×2 grid of processes in a wraparound fashion.
• Cyclic distribution: when the block size =1
52
Randomized Block Distribution
53
Graph Partitioning
Sparse-matrix vector multiplication
Work: nodes
Interaction/communication: edges
56
Mapping a Sparse Graph
Example. Sparse matrix-vector multiplication using 3
processes
• Arrow distribution
57
• Partitioning task-interaction graph to reduce
interaction overhead
58
Techniques to Minimize Interaction Overheads
59
• Minimize contention and hot spots
– Competition occur when multi-tasks try to access the same
resources concurrently: multiple processes sending
message to the same process; multiple simultaneous
accesses to the same memory block
𝑝𝑝−1
• Using 𝐶𝐶𝑖𝑖,𝑗𝑗 = ∑𝑘𝑘=0 𝐴𝐴𝑖𝑖,𝑘𝑘 𝐵𝐵𝑘𝑘,𝑗𝑗 causes contention. For example, 𝐶𝐶0,0 ,
𝐶𝐶0,1 , 𝐶𝐶0, 𝑝𝑝−1 attempt to read 𝐴𝐴0,0 , at the same time.
• A contention-free manner is to use:
𝑝𝑝−1
𝐶𝐶𝑖𝑖,𝑗𝑗 = ∑𝑘𝑘=0 𝐴𝐴𝑖𝑖, 𝑖𝑖+𝑗𝑗+𝑘𝑘 % 𝑝𝑝 𝐵𝐵 𝑖𝑖+𝑗𝑗+𝑘𝑘 % 𝑝𝑝,𝑗𝑗
All tasks 𝑃𝑃∗,𝑗𝑗 that work on the same row of C access block
𝐴𝐴𝑖𝑖, 𝑖𝑖+𝑗𝑗+𝑘𝑘 % 𝑝𝑝 , which is different for each task. 60
• Overlap computations with interactions
– Use non-blocking communication
• Replicate data or computations
– Some parallel algorithm may have read-only access to
shared data structure. If local memory is available,
replicate a copy of shared data on each process if
possible, so that there is only initial interaction during
replication.
• Use collective interaction operations
• Overlap interactions with other interactions
61
Parallel Algorithm Models
• Data parallel
– Each task performs similar operations on different data
– Typically statically map tasks to processes
• Task graph
– Use task dependency graph to promote locality or reduce
interactions
• Master-slave
– One or more master processes generating tasks
– Allocate tasks to slave processes
– Allocation may be static or dynamic
• Pipeline/producer-consumer
– Pass a stream of data through a sequence of processes
– Each performs some operation on it
• Hybrid
– Apply multiple models hierarchically, or apply multiple models
in sequence to different phases
62
• Reference
– A. Grama, et al. Introduction to Parallel
Computing. Chapter 3.
63