5 - ParallelAlgorithms
5 - ParallelAlgorithms
1
Problem
• Assume we have a 100-inch sandwich to feed 8 people.
Y[0] = X[0];
• Implementation: for (i = 1; i < N; i++)
Y[i] = Y[i-1] + X[i];
• It is easy to see that prefix sums can be computed sequentially in O(N) time
Naïve Inclusive Parallel Prefix Sum
• Assign one thread to calculate each element that adds up all X elements
needed for that Y element.
• Implementation:
6
Naïve Inclusive Parallel Prefix Sum
Best-Case Parallel Time:
The time is still O(N) — same as the sequential version! You used lots of processors, but got no
speedup. Parallel programming is easy as long as you don’t care about performance.
Parallel Prefix Sum
• How to improve efficiency?
• Build a balanced binary tree on the input data and sweep it to and from the root
• Traverse up from leaves to root building partial sums at internal nodes in the tree
• Root holds sum of all leaves
• Traverse back down the tree building the scan from the partial sums
8
Parallel Prefix Sum: Upward Sweep (Bottom-up)
76
for (i = 0; i < n - 1; i++)
Y[i] = X[i]; 36 40
for (d = 1; d < n; d *= 2) {
#pragma omp parallel for 10 26 30 10
for (int i = 0; i < n; i += 2 * d) {
Y[i + 2 * d - 1] += Y[i + d - 1];
} 6 4 16 10 16 14 2 8
}
Step 6 4 16 10 16 14 2 8
d =1 6 10 16 26 16 30 2 10
d=2 6 10 16 36 16 30 2 40
9
d=4 6 10 16 36 16 30 2 76
Parallel Prefix Sum: Downward Sweep (Top-down)
0
Y[n-1] = 0;
for (d = n / 2; d >= 1; d /= 2) {
#pragma omp parallel for 0 36
for (int i = 0; i < n; i += 2 * d) {
int left = i + d - 1;
int right = i + 2 * d - 1; 0 10 36 66
temp = Y[left];
Y[left] Y[right];
Y[right] temp; 0 6 10 26 36 52 66 68
}
}
Step 6 10 16 36 16 30 2 0
d=4 6 10 16 0 16 30 2 36
d=2 6 0 16 10 16 36 2 66 10
d=1 0 6 10 26 36 52 66 68
Parallel Prefix Sum: Inclusive
for (i = 0; i < n; i++) { X 6 4 16 10 16 14 2 8
Y[i] += X[i];
}
Y 0 6 10 26 36 52 66 68
Updated Y 6 10 26 36 52 66 68 76
11
Parallel Prefix Sum: Inclusive
Observations:
12
Applications of Prefix Sum
• Parallel Prefix Sum has several applications that go way beyond computing the
• sum of array elements. A key primitive in many parallel algorithms to convert
serial computation into parallel computation
• Radix Sort
• String Comparison
• Lexical Analysis
• Stream Compaction
• Polynomial Evaluation
• Solving Recurrences
• Tree Operations
• Histograms
• Assigning space in farmers market
13
• Allocating memory to parallel threads
• Allocating memory buffer for communication channels
• Frequently used for parallel work assignment and resource allocation