HPC - Unit-2 Insem Notes
HPC - Unit-2 Insem Notes
Syllabus:
Principles of Parallel Algorithm Design:
Preliminaries, Decomposition Techniques, Characteristics
of Tasks and Interactions, Mapping Techniques for Load
Balancing, Methods for Containing Interaction overheads
Course Objectives:
To analyze the performance and modeling of parallel
programs
Course Outcomes:
CO2: Design and Develop an efficient parallel
algorithm to solve given problem
Parallel Algorithm
4
Chapter Overview: Algorithms and Concurrency
• Decomposition Techniques
– Recursive Decomposition
– Recursive Decomposition
– Exploratory Decomposition
– Hybrid Decomposition
n-1
Task n
Observations: While tasks share data (namely, the vector b), they
do not have any control dependencies – i.e., no task needs to
wait for the (partial) completion of any other. All tasks are of the
same size in terms of number of operations. Is this the maximum
number of tasks we could d e c o m p o s e this problem into?
Example: Database Query Processing
ID# Color
ID# Color
3476 White
White OR Green
7623 Green
9834 Green
6734 White
5342 Green
8354 Green
Task 1
Task 2
Task 3
Task 4
• The longest such path determines the shortest time in which the
program can be executed in parallel.
10 10 10 10 10 10 10 10
6 Task 5
9 Task 6 6 Task 5
11 Task 6
8 Task 7 7 Task 7
(a) (b)
What are the critical path lengths for the two task dependency graphs?
If each task takes 10 time units, what is the shortest parallel execution time
for each decomposition? How many processors are needed in each case to
achieve this minimum parallel execution time? What is the maximum degree
of concurrency?
Limits on Parallel Performance
—speedup on N processors
https://en.wikipedia.org/wiki/Amdahl's_law
17
Amdahl’s Law
https://en.wikipedia.org/wiki/Amdahl's_law
18
Task Interaction Graphs
20
Interaction Graphs, Granularity, & Communication
• Finer task granularity increases communication overhead
• Example: sparse matrix-vector product interaction graph
• Assumptions:
— each node takes unit time to process
— each interaction (edge) causes an overhead of a unit time
• If node 0 is a task: communication = 3; computation = 4
• If nodes 0, 4, and 5 are a task: communication = 5; computation
= 15
— coarser-grain decomposition smaller communication/computation21
Processes and Mapping
Note: These criteria often conflict eith each other. For example,
a decomposition into one task (or no decomposition at all)
minimizes interaction but does not result in a speedup at all! Can
you think of other such conflicting cases?
Processes and Mapping: Example
10 10 10 10 10 10 10 10
P3 P2 P1 P0 P3 P2 P1 P0
P0 6 Task 5
P2 9 Task 6 P0 6 Task 5
P0 11 Task 6
P0 8 Task 7 P0 7 Task 7
(a) (b)
• recursive decomposition
• data decomposition
• exploratory decomposition
• speculative decomposition
Recursive Decomposition
1 3 4 2 5 12 11 10 6 8 7 9
1 2 3 4 5 6 8 7 9 12 11 10
1 2 3 4 5 6 7 8 9 10 12 11
5 6 7 8 10 11 12
11 12
In this example, once the list has been partitioned around the pivot,
each sublist can be processed concurrently (i.e., each sublist represents an
independent subtask). This can be repeated recursively.
Recursive Decomposition: Example
min(1,2)
min(4,1) min(8,2)
A B C
. →
A 2,1 B2,1 C 2,1
Task 1: C 1, 1 = A 1, 1B 1, 1 + A 1, 2B 2, 1
Task 2: C 1, 2 = A 1, 1B 1, 2 + A 1, 2B 2, 2
Task 3: C 2, 1 = A 2, 1B 1, 1 + A 2, 2B 2, 1
Task 4: C 2, 2 = A 2, 1B 1, 2 + A 2, 2B 2, 2
Output Data Decomposition: Example
Decomposition I Decomposition II
A, B, C, E, G, H A, B, C 1
B, D, E, F, K, L D, E 3
Itemset Frequency
Database Transactions
A, B, F, H, L C, F, G 0
Itemsets
D, E, F, H A, E 2
F, G, H, K, C, D 1
A, E, F, K, L D, K 2
B, C, D, G, H, L B, C, F 0
G, H, L C, D, K 0
D, E, F, K, L
F, G, H, L
1 1
Itemset Frequency
Itemset Frequency
A, B, C, E, G, H A, B, C A, B, C, E, G, H C, D
Itemsets
Itemsets
B, D, E, F, K, L D, E 3 B, D, E, F, K, L D, K 2
Database Transactions
Database Transactions
A, B, F, H, L C, F, G 0 A, B, F, H, L B, C, F 0
D, E, F, H A, E 2 D, E, F, H C, D, K 0
F, G, H, K, F, G, H, K,
A, E, F, K, L A, E, F, K, L
B, C, D, G, H, L B, C, D, G, H, L
G, H, L G, H, L
D, E, F, K, L D, E, F, K, L
F, G, H, L F, G, H, L
task 1 task 2
Output Data Decomposition: Example
In the database counting example, the input (i.e., the transaction set) can be
partitioned. This induces a task decomposition in which each task generates
partial counts for all itemsets. These are combined subsequently for aggregate
counts.
Database Transactions
A, B, C, E, G, H A, B, C 1 A, B, C 0
B, D, E, F, K, L
Itemset Frequency
Itemset Frequency
D, E 2 D, E 1
A, B, F, H, L C, F, G 0 C, F, G 0
Itemsets
Itemsets
D, E, F, H A, E 1 A, E, F, K, L A, E 1
F, G, H, K, C, D 0 B, C, D, G, H, L C, D 1
D, K 1 G, H, L D, K 1
B, C, F 0 D, E, F, K, L B, C, F 0
C, D, K 0 F, G, H, L C, D, K 0
task 1 task 2
Partitioning Input and Output Data
Often input and output data decomposition can be combined for a higher
degree of concurrency. For the itemset counting example, the transaction set
(input) and itemset counts (output) can both be decomposed as follows:
Database Transactions
A, B, C, E, G, H A, B, C 1 A, B, C, E, G, H
B, D, E, F, K, L D, E B, D, E, F, K, L
Itemset Frequency
Itemset Frequency
2
A, B, F, H, L C, F, G 0 A, B, F, H, L
Itemsets
Itemsets
D, E, F, H A, E 1 D, E, F, H
F, G, H, K, F, G, H, K, C, D 0
D, K 1
B, C, F 0
C, D, K 0
task 1 task 2
Database Transactions
Database Transactions
A, B, C 0
Itemset Frequency
D, E
Itemset Frequency
1
C, F, G 0
Itemsets
Itemsets
A, E, F, K, L A, E 1 A, E, F, K, L
B, C, D, G, H, L B, C, D, G, H, L C, D 1
G, H, L G, H, L D, K 1
D, E, F, K, L D, E, F, K, L B, C, F 0
F, G, H, L F, G, H, L C, D, K 0
task 3 task 4
Intermediate Data Partitioning
Let us revisit the example of dense matrix multiplication. We first show how we
can visualize this computation in terms of intermediate matrices D .
A 1, 1 B 1,1 B 1,2
. D1,1,1 D1,1,2
A 2, 1
D1,2,1 D1,2,2
+
A 1, 2
. D2,1,1 D2,1,2
C 1,1 C 1,2
C 2,1 C 2,2
Intermediate Data Partitioning: Example
Stage I
0 „ « 1
„ « „ « D 1, 1, 1 D1,1,2
D 1, 2, 2
→ B
A 1, 1 A 1, 2 B 1, 1 B 1, 2 „ D1,2,2 « C
A 2, 1 A 2, 2
.
B 2, 1 B 2, 2
@ D2,1,1 D2,1,2 A
D2,2,2 D2,2,2
Stage II «
„ « „ « „
D 1, 1, 1 D1,1,2 D 2, 1, 1 D2,1,2 C 1, 1 C
+ → 1, 2
D1,2,2 D1,2,2 D2,2,2 D2,2,2 C 2C, 12, 2
1 2 3 4 5 6 7 8
9 10 11 12
The Owner Computes Rule
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
5 6 8 5 6 7 8 5 6 7 8 5 6 7 8
9 10 7 11 9 10 11 9 10 11 9 10 11 12
13 14 15 12 13 14 15 12 13 14 15 12 13 14 15
1 2 3 4
5 6 7 8
9 10 11
13 14 15 12
1 2 3 4 1 2 3 4
task 4
5 6 7 8 5 6 7
9 10 11 9 10 11 8
13 14 15 12 13 14 15 12
1 2 3 4
Exploratory Decomposition: Example
5 6 7 8
9 10 11 12
13 14 15
1 2 3 4
5 6 7 8
9 10 11
13 14 15 12
1 2 3 4
5 6 7 8
1 2 3 4 9 10 11
5 6 7 8 13 14 15 12
9 10 11 1 2 3 4
13 14 15 12 5 7 8
task 3
9 6 10 11
13 14 15 12
1 2 3 4
5 6 7 8
1 2 3 4 9 14 10 11
5 6 7 8 13 15 12
9 10 11
13 14 15 12 1 2 3 4
5 6 8
9 10 7 11
13 14 15 12
1 2 3 4
5 6 8
1 2 3 4 9 10 7 11
5 6 8 13 14 15 12
task 2
9 10 7 11 1 2 4
13 14 15 12 5 6 3 8
9 10 7 11
13 14 15 12
1 2 3 4
5 6 7 8
9 10 11
13 14 15 12
1 2 3 4
5 6 7 8
9 10 15 11
13 14 12
1 2 3 4 1 2 3 4
5 6 7 8 5 6 7 8
task 1
9 10 15 11 9 10 15 11
13 14 12 13 14 12
1 2 3 4
5 6 7 8
9 10 11
13 14 15 12
Exploratory Decomposition: Anomalous
Computations
m m m m m m m m
Solution
Total serial work: 2m+1 Total serial work: m
Total parallel work: 1 Total parallel work: 4m
(a) (b)
Speculative Decomposition
C
System Inputs
A D
System Output
E G I
B
F H
System Components
Hybrid Decompositions
Data
3 7 2 9 11 4 5 8 7 10 6 13 1 19 3 9
decomposition
2 1 Recursive
decomposition
1
Characteristics of Tasks
• Task generation.
• Task sizes.
• Task sizes may be uniform (i.e., all tasks are the same size) or
non-uniform.
Tasks
Pixels
Characteristics of Task Interactions: Example
A b 0
2 3
0 1 2 3 4 5 6 7 8 9 1011 1
Task 0
5
4
4 6
7
8
9
Task 11 8 10 11
(a) (b)
Characteristics of Task Interactions
P1 1 5 9 P1 1 2 3
P2 2 6 10 P2 4 5 6
P3 3 7 11 P3 7 8 9
P4 4 8 12 P4 10 11 12
(a) (b)
Mapping Techniques for Minimum Idling
• Hybrid ma ppings.
Mappings Based on Data Partitioning
P0
P1
P2
P3
P0 P1 P2 P3 P4 P5 P6 P7
P4
P5
P6
P7
Block Array Distribution Schemes
P0 P1 P2 P3
P0 P1 P2 P3 P4 P5 P6 P7
P4 P5 P6 P7
P8 P9 P 10 P 11
P 8 P 9 P10 P 11 P12 P13 P14 P15
P 12 P 13 P 14 P 15
(a) (b)
Block Array Distribution Schemes: Examples
(a)
A B C
P0 P1 P2 P3
P4 P5 P6 P7
X =
P8 P9 P10 P11
(b)
Cyclic and Block Cyclic Distributions
1: A 1, 1 → L 1, 1U 1, 1 6: A 2, 2 = A 2, 2 −L 2, 1U 1, 2 11: L 3, 2 = A 3, 2U 2,−12
2: L 2, 1 = A 2, 1U 1,−11 7: A 3, 2 = A 3, 2 −L 3, 1U 1, 2 −1
12: U 2, 3 = L 2, 2 A 2, 3
3: L 3, 1 = A 3, 1U 1,−11 8: A 2, 3 = A 2, 3 −L 2, 1U 1, 3 13: A 3, 3 = A 3, 3 −L 3, 2U 2, 3
−1
4: U 1, 2 = L 1, 1 A 1, 2 9: A 3, 3 = A 3, 3 −L 3, 1U 1, 3 14: A 3, 3 → L 3, 3U 3, 3
−1
5: U 1, 3 = L 1, 1 A 1, 3 10: A 2, 2 → L 2, 2U 2, 2
Block Cyclic Distributions
Column k
Column j
Inactive part
Active part
P0 P3 P6
T1 T4 T5
P1 P4 P7
T2 T 6 T10 T 8 T12
P2 P5 P8
T3 T 7 T11 T 9 T13T14
Block-Cyclic Distribution
P0
P0 P1 P0 P1
P1
P2
P2 P3 P2 P3
P3
P0
P0 P1 P0 P1
P1
P2
P2 P3 P2 P3
P3
(a) (b)
Graph Partitioning Dased Data Decomposition
Random Partitioning
0 4
0 2 4 6
0 1 2 3 4 5 6 7
Task Paritioning: Mapping a Sparse Graph
Process 0 C0 = (4,5,6,7,8)
Process 1 C1 = (0,1,2,3,8,9,10,11)
Process 2 C2 = (0,4,5,6)
C1 = (0,5,6) Process 1
0
2 3
1
Process 0
4 5
C0 = (1,2,6,9) 6
8 9 10 11
Process 2 C2 = (1,2,4,5,7,8)
Hierarchical Mappings
• For this reason, task ma pping can be used at the top level and
data partitioning within each level.
Hierarchical Mapping: Example
P0 P1 P4 P5
P2 P3 P6 P7
P0 P1 P2 P3 P4 P5 P6 P7
P0 P1 P2 P3 P4 P5 P6 P7
Schemes for Dynamic Mapping
• When a process runs out of work, it requests the master for more
work.
• There are four critical questions: how are sensing and receiving
processes paired together, who initiates work transfer, how
much work is transferred, and when is a transfer triggered?
92
Complexities:
5.4 Complexities ......................................................233
5.4.1 Sequential Computation Complexity ....................234
5.4.2 Parallel Computation Complexity .........................234
5.5 Anomalies in Parallel Algorithms ............................237