Lecture 4: Principles of Parallel Algorithm Design (Part 4)
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
1
Mapping Technique for Load Balancing
• Sources of overheads:
– Inter-process interaction
– Idling
• Goals to achieve:
– To reduce interaction time
– To reduce total amount of time some processes being
idle
– Remark: these two goals often conflict
• Classes of mapping:
– Static
– Dynamic
2
Schemes for Static Mapping
3
Mapping Based on Data Partitioning
4
1D Block Distribution
Example. Distribute rows or columns of matrix to different
processes
5
Multi-D Block Distribution
Example. Distribute blocks of matrix to different processes
6
Load-Balance for Block Distribution
7
8
Cyclic and Block Cyclic Distributions
9
Doolittle’s method of LU factorization
By matrix-matrix multiplication
𝑖−1
𝑢𝑖𝑗 = 𝑎𝑖𝑗 − 𝑡=1 𝑙𝑖𝑡 𝑢𝑡𝑗 for 𝑗 = 𝑖 + 1, … , 𝑛 (𝑖𝑡ℎ row of 𝑈)
𝑎𝑗𝑖 − 𝑖−1
𝑡=1 𝑙𝑗𝑡 𝑢𝑡𝑖
𝑙𝑗𝑖 = for 𝑗 = 𝑖 + 1, … , 𝑛 (𝑖𝑡ℎ column of 𝐿)
𝑢𝑖𝑖
End
𝑛−1
𝑢𝑛𝑛 = 𝑎𝑛𝑛 − 𝑡=1 𝑙𝑛𝑡 𝑢𝑡𝑛
10
Serial Column-Based LU
12
• Block distribution of LU factorization tasks
leads to load imbalance.
13
Block-Cyclic Distribution
• Steps
1. Partition an array into many more blocks than
the number of available processes
2. Assign blocks to processes in a round-robin
manner so that each process gets several non-
adjacent blocks.
14
(a) The rows of the array are grouped into blocks each consisting of two rows,
resulting in eight blocks of rows. These blocks are distributed to four processes
in a wraparound fashion.
(b) The matrix is blocked into 16 blocks each of size 4×4, and it is mapped onto a
2×2 grid of processes in a wraparound fashion.
• Cyclic distribution: when the block size =1
15
Graph Partitioning
• Assign equal number of nodes (or cells) to each process
• Minimize edge count of the graph partition
16
Mappings Based on Task Partitioning
17
Mapping a Binary Tree Task-Dependency Graph
• Finding min.
19
• Partitioning task interaction graph to reduce
interaction overhead
20
Schemes for Dynamic Mapping
21
Centralized Dynamic Mapping
• Processes
– Master: mange a group of available tasks
– Slave: depend on master to obtain work
• Idea
– When a slave process has no work, it takes a portion of available
work from master
– When a new task is generated, it is added to the pool of tasks in
the master process
• Potential problem
– When many processes are used, mast process may become
bottleneck
• Solution
– Chunk scheduling: every time a process runs out of work it gets
a group of tasks.
22
Distributed Dynamic Mapping
23
Techniques to Minimize Interaction Overheads
24
Techniques to Minimize Interaction Overheads
• Minimize contention and hot spots
– Contention occur when multi-tasks try to access the same resources
concurrently: multiple processes sending message to the same
process; multiple simultaneous accesses to the same memory block
𝑝−1
• Using 𝐶𝑖,𝑗 = 𝑘=0 𝐴𝑖,𝑘 𝐵𝑘,𝑗 causes contention. For example, 𝐶0,0 ,
𝐶0,1 , 𝐶0, 𝑝−1 attempt to read 𝐴0,0 , at once.
• A contention-free manner is to use:
𝑝−1
𝐶𝑖,𝑗 = 𝑘=0 𝐴𝑖, 𝑖+𝑗+𝑘 % 𝑝 𝐵 𝑖+𝑗+𝑘 % 𝑝,𝑗
All tasks 𝑃∗,𝑗 that work on the same row of C access block
𝐴𝑖, 𝑖+𝑗+𝑘 % 𝑝 , which is different for each task. 25
Techniques to Minimize Interaction Overheads
• Overlap computations with interactions
– Use non-blocking communication
• Replicate data or computations
– Replicate a copy of shared data on each process if
possible, so that there is only initial interaction during
replication.
• Use collective interaction operations
• Overlap interactions with other interactions
26
Parallel Algorithm Models
• Data parallel
– Each task performs similar operations on different data
– Typically statically map tasks to processes
• Task graph
– Use task dependency graph to promote locality or reduce
interactions
• Master-slave
– One or more master processes generating tasks
– Allocate tasks to slave processes
– Allocation may be static or dynamic
• Pipeline/producer-consumer
– Pass a stream of data through a sequence of processes
– Each performs some operation on it
• Hybrid
– Apply multiple models hierarchically, or apply multiple models
in sequence to different phases
27