Parallel Algorithms: Theory and Practice
Parallel Algorithms: Theory and Practice
2
Last class
• The bound
• Work-efficiency is important
• Make work (asymptotically) no more than the best (optimal) sequential
algorithm
• is usually at least the problem size since we need to load all input
• is usually poly-logarithmic – as long as its polylog(n), its much smaller than
• is usually small compared to
• is dominated by the term
• Polylog Depth indicates good scalability
• Larger depth means that when is getting larger, may dominate the cost
• But whether is or usually does not make a huge difference in practice – both
are much smaller than
3
Last class
• Two reduce algorithms
• Looking at the dependence graph bottom-up or top-down
• work and depth
• Your feedback
• Write down what you think is the hardest/most unclear thing in the
last class
• Any other thoughts are also welcome
• It’s anonymous
4
Prefix Sum (Scan)
5
Prefix sum
A= 1 2 3 4 5 6 7 8
B= 1 3 6 10 15 21 28 36
1 2 3 4 5 6 7 8
+ + + +
3 7 11 15
+ +
10 26
+
36
1 2 3 4 5 6 7 8
3+3+4=10 10+5=15 10+5+6=21 10+11+7=28
7
Two algorithms to implement a reduce
reduce(A, n) { reduce(A, n) {
if (n == 1) return A[0]; if (n == 1) return A[0];
In parallel: if (n is odd) n=n+1;
L = reduce(A, n/2); parallel_for i=1 to n/2
R = reduce(A + n/2, n-n/2); B[i]=A[2i]+A[2i+1];
return L+R; return reduce(B, n/2); }
}
Divide-and-conquer:
Dealing with the left and right halves recursively
8
• Split the original problem into
Prefix sum: divide-and-conquer several parts with smaller sizes
1 2 3 4 5 6 7 8 (e.g., evenly in two)
+ + + + • Solve the same problem on
3 7 11 15 each part in parallel
+ +
10 26 • Combine the results
+
36
Function scan_r(A, B, s, t, offset) { 1 2 3 4 5 6 7 8
If s=t-1 then {
B[s] = offset + A[s]; return; } Left prefix sum Right prefix sum
mid = (s+t)/2;
In Parallel: 1 3 6 10 5 11 18 26
scan_r(A, B, s, mid, offset);
scan_r(A, B, mid, t, offset+leftSum); } +10 +10 +10 +10
Function scan(A, B) {
Call reduce(A, n) and save the reduce 1 3 6 10 15 21 28 36
tree;
scan_r(A, B, 0, n, 0);} What is 10?
work
Prefix sum depth
Function scan_r(A, B, s, t, offset) {
1 2 3 4 5 6 7 8 If s=t then {
+ + + + B[s] = offset + A[s]; return; }
3 7 11 15 mid = (s+t)/2;
+ + In Parallel:
10 26 scan_r(A, B, s, mid, offset);
+ scan_r(A, B, mid+1, t, offset+leftSum); }
36
offset: 0 offset: 10
11
• Reduce the problem size into a
Prefix sum – another algorithm smaller size (e.g., a half),
1 2 3 4 5 6 7 8 possibly in parallel
+ + + + • Solve the same problem on the
3 7 11 15 small input
+ +
10 26 • Convert the result of the small
+ problem to the final answer,
36 possibly in parallel
A 1 2 3 4 5 6 7 8
Prefix sum of A 1 3 6 10 15 21 28 36
Prefix sum of A’ 3 10 21 36
A’ 3 7 11 15
Prefix sum – another 1 2 3 4 5 6 7 8
algorithm 1 3 6 10 15 21 28 36
1 2 3 4 5 6 7 8
+ + + +
3 7 11 15 3 7 11 15
+ +
10 26
3 10 21 36
+
36 10 26
Function PrefixSum(In, n, Out) { 10 36
if (n==1) Out[0] = In[0];
para_for (i=0 to n/2)
B[i] = In[2i]+In[2i+1] 36
PrefixSum(B, n/2, C);
36
Out[0] = In[0];
para_for (i=1 to n) { work
if (i%2) Out[i] = C[(i-1)/2]+In[i]; depth
else Out[i] = C[i/2];} }
How did we solve the prefix sum problem?
• Divide-and-conquer
• Split the problem in half, solve each of the same subproblems in
parallel
• i.e., solve the prefix sum of the left and the right halves of the array in parallel
• Convert the results from the subproblems to the final answers
• i.e., the right sum needs to add the “left sum” of the reduce algorithm
14
Computational Models
15
Cost model
• Work-depth model
• Evaluate the cost of an algorithm
• Does not specify what operations can be used, how much do they
cost, etc
• How much does a parallel for cost?
• How do processors synchronize?
• What happens if two threads access the same memory location at the same
time?
• Reduce:
• Use processors, need time and work
• Use processors, need time and work
• Use the topological order of the computational DAG
18
PRAM: pros and cons
• Simple – we can get very good bound
• But...
• Do we know the number of processors ahead of time?
• The number of available processors even varies during the execution
• Your OS, some other applications may use some processors
19
PRAM: pros and cons
• Simple – we can get very good bound
• But...
• Do we know the number of processors ahead of time?
• Are processors really highly-synchronized?
• Accessing memory is usually more expensive than computation
• Synchronization is expensive
P1: P2:
A = 5; C = 2;
sync
B = 3; A = 3;
sync
A = A+7; B = B+C;
sync
20
PRAM: pros and cons
• Simple, easy to analyze
• But...
• Do we know the number of processors ahead of time?
• Are processors really highly-synchronized?
• Accessing memory is usually more expensive than computation
• Are concurrent writes ideal enough to take unit time?
22
Fork-join Parallelism
• The computation starts with one thread
• A thread can fork threads to execute pieces of code. After
they all finish, they join back and continue the rest
computation in
• We can use work and depth to analyze the cost
23
Fork-join Parallelism
• The computation starts with one
thread
• A thread can fork threads to
execute pieces of code. After they
all finish, they join back and
continue the rest computation in
• Fork-join is nested parallelism,
meaning that a forked thread can
further fork new threads
24
Fork-join Parallelism
• The computation starts with one thread
• A thread can fork threads to execute pieces of code. After
they all finish, they join back and continue the rest
computation in
• We can use work and depth to analyze the cost
25
Fork-join parallelism
• Function PrefixSum(In, n, Out) {
• if (n==1) Out[0] = In[0];
• para_for (i=0 to n/2) //Fork
• B[i] = n[2i]+In[2i+1]
• //Join
• PrefixSum(B, n/2, C);
• Out[0] = In[0];
• para_for (i=1 to n) { //Fork
• if (i%2) Out[i] = C[i/2];
• else Out[i] = C[i/2-1] + In[i];
• }
• //Join
•} 26
Execute a fork-join algorithm
• Need a scheduler to map each thread to a processor
27
N-ary forking vs. binary forking
• A thread can fork new tasks, or only two new tasks
28
DAG for work-depth vs. fork-join
Start
29
What else can we do
• Sometimes, concurrent write is inevitable. Then we need to
specify some atomic primitives for a model
31
Use Atomic Primitives
• Compare-and-swap:
• Multiple threads wants to add to the head of a linked-list
struct node {
X1
value_type value;
node* next; };
shared variable node* head;
?
void insert(node* x) {
node* old_head = head;
x->next = old_head;
X2
? head
head
while (!CAS(&head, old_head, x)) {
node* old_head = head; X1
x->next = old_head; }
}
void insert(node* x) {
x->next = head; X2
head = x;
} 32
Computational model
• When talking about an algorithm or a bound:
• Specify the model
• Specify any parallel primitives you need
• e.g, EREW PRAM, binary-forking with CAS, etc.
• When talking about the execution time:
• Also need to specify the scheduling algorithm
• Usually, the more or the stronger primitive you use, the better
your bound looks, but the less interesting/practical the result is
• E.g., assume constant time parallel reduce – we can get a constant time
sorting algorithm
33
Fibonacci Numbers
34
Fibonacci Numbers
• The n-th Fibonacci number can be computed as: 0
…
int F (int n) {
if (n <= 1) return n;
else {
In parallel:
int A = F(n-1); n-2
int B = F(n-2);
return A+B; n-1
} Why? Because the dependency is still long () and
there is much redundant work
n
• In the homework we’ll see a more efficient parallel algorithm
35
Parallel Programming Tools
36
Parallel Tools and Schedulers
In this course the following two schedulers are recommended
for your homework and course project.
• Cilk
• PBBS
You can also use other languages/schedulers that you are more
familiar with, e.g., OpenMP, Intel TBB, etc.
37
Cilk
• Fork-join parallelism #include <cilk/cilk.h>
#include <cilk/cilk_api.h>
• LLVM: https://cilkplus.github.io/
39
PBBS (Problem-based benchmark suite)
• Code available at: https://github.com/cmuparlay/pbbslib
#include “pbbslib/utilities.h” You can also use cilk or openmp to
compile your code
void reduce(int* A, int n, int& ret) {
if (n == 1) ret = A[0]; else {
int L, R;
par_do([&] () {reduce(A, n/2, L);},
[&] () {reduce(A+n/2, n-n/2, R);}); lambda expression
ret = L+R; (must be function calls)
}
}
40
About homework
• Sample code available using PBBS and Cilk in homework 1
• You will implement your own version of a scan algorithm –
add any optimizations that you think could help, and see if
they really help
• Use figures and tables to show the numbers you get
• Analyze the numbers to explain any interesting/abnormal
phenomenon
• There is an entry of assignment in ilearn now, you can submit
your code there
41
About homework
• The goal of the programming part is to let you learn from
practice some tricks and optimizations for implementing
parallel algorithms – the process of learning matters
42
About paper review
• What problem is solved in the paper? What is the motivation?
• Why is the problem challenging? How did previous work solve
the problem and why they didn’t work?
• What is the key technical ideas to solve the challenges?
• What are the new theoretical results (if any)?
• Why do they design experiments (if any) like that?
• What do the experimental results (if any) tell us?
• What do you think is the strength/novelty of the work?
• What do you think is the weakness of the work? Do you have
ideas to improve that?
• What are the possible directions for future work?
• Do you have any questions about the work? 43
About paper review
• A useful document of some paper review tips:
https://people.inf.ethz.ch/troscoe/pubs/review-writing.pdf
• Your paper review is slightly different since you are reviewing papers
that have already been published
44