0% found this document useful (0 votes)
32 views44 pages

Parallel Algorithms: Theory and Practice

The document discusses parallel algorithms for prefix sum (scan) problems. It describes two approaches: 1) A divide-and-conquer algorithm that divides the input array in half, computes prefix sums for each half in parallel, and combines the results. 2) A reduce-based algorithm that reduces the problem size by half in each step, solving the smaller problems in parallel and converting results to the final answer. Both approaches aim to minimize work and depth for efficient parallel computation.

Uploaded by

21522250
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views44 pages

Parallel Algorithms: Theory and Practice

The document discusses parallel algorithms for prefix sum (scan) problems. It describes two approaches: 1) A divide-and-conquer algorithm that divides the input array in half, computes prefix sums for each half in parallel, and combines the results. 2) A reduce-based algorithm that reduces the problem size by half in each step, solving the smaller problems in parallel and converting results to the final answer. Both approaches aim to minimize work and depth for efficient parallel computation.

Uploaded by

21522250
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Parallel Algorithms:

Theory and Practice


Last class
• Work-depth model: evaluate the time cost of a parallel
algorithm
• Work : total number of operations, the sequential time complexity
• Depth : the longest dependence chain in the computation, the time
required when an infinite number of processors are avaiable
• Scheduler: help you map each thread to a processor
• A helpful tool to avoid low-level design for parallel algorithms
• For a parallel computation with work and depth , using a greedy
scheduler, the time needed using processors is

2
Last class
• The bound
• Work-efficiency is important
• Make work (asymptotically) no more than the best (optimal) sequential
algorithm
• is usually at least the problem size since we need to load all input
• is usually poly-logarithmic – as long as its polylog(n), its much smaller than
• is usually small compared to
• is dominated by the term
• Polylog Depth indicates good scalability
• Larger depth means that when is getting larger, may dominate the cost
• But whether is or usually does not make a huge difference in practice – both
are much smaller than

3
Last class
• Two reduce algorithms
• Looking at the dependence graph bottom-up or top-down
• work and depth

• Your feedback
• Write down what you think is the hardest/most unclear thing in the
last class
• Any other thoughts are also welcome
• It’s anonymous

4
Prefix Sum (Scan)

5
Prefix sum
A= 1 2 3 4 5 6 7 8

B= 1 3 6 10 15 21 28 36

The most widely-used building block in


parallel algorithm design
Similar idea applies to any associative
binary operations
Prefix Sum work
depth

1 2 3 4 5 6 7 8
+ + + +
3 7 11 15
+ +
10 26
+
36

1 2 3 4 5 6 7 8
3+3+4=10 10+5=15 10+5+6=21 10+11+7=28

7
Two algorithms to implement a reduce
reduce(A, n) { reduce(A, n) {
if (n == 1) return A[0]; if (n == 1) return A[0];
In parallel: if (n is odd) n=n+1;
L = reduce(A, n/2); parallel_for i=1 to n/2
R = reduce(A + n/2, n-n/2); B[i]=A[2i]+A[2i+1];
return L+R; return reduce(B, n/2); }
}

Divide-and-conquer:
Dealing with the left and right halves recursively

8
• Split the original problem into
Prefix sum: divide-and-conquer several parts with smaller sizes
1 2 3 4 5 6 7 8 (e.g., evenly in two)
+ + + + • Solve the same problem on
3 7 11 15 each part in parallel
+ +
10 26 • Combine the results
+
36
Function scan_r(A, B, s, t, offset) { 1 2 3 4 5 6 7 8
If s=t-1 then {
B[s] = offset + A[s]; return; } Left prefix sum Right prefix sum
mid = (s+t)/2;
In Parallel: 1 3 6 10 5 11 18 26
scan_r(A, B, s, mid, offset);
scan_r(A, B, mid, t, offset+leftSum); } +10 +10 +10 +10
Function scan(A, B) {
Call reduce(A, n) and save the reduce 1 3 6 10 15 21 28 36
tree;
scan_r(A, B, 0, n, 0);} What is 10?
work
Prefix sum depth
Function scan_r(A, B, s, t, offset) {
1 2 3 4 5 6 7 8 If s=t then {
+ + + + B[s] = offset + A[s]; return; }
3 7 11 15 mid = (s+t)/2;
+ + In Parallel:
10 26 scan_r(A, B, s, mid, offset);
+ scan_r(A, B, mid+1, t, offset+leftSum); }
36

offset: 0 offset: 10

offset: 0 offset: 0+3=3 offset: 10 offset: 10+11=21

offset: 0 offset: 1 offset: offset: offset: offset: offset: offset:


3 3+3=6 10 10+5=15 21 21+7=28
Result: Result: Result: Result: Result: Result: Result: Result:
0+1=1 1+2=3 3+3=6 6+4=10 10+5=15 15+6=21 21+7=28 28+8=36
1 2 3 4 5 6 7 8
Two algorithms to implement a reduce
reduce(A, n) { reduce(A, n) {
if (n == 1) return A[0]; if (n == 1) return A[0];
In parallel: if (n is odd) n=n+1;
L = reduce(A, n/2); parallel_for i=1 to n/2
R = reduce(A + n/2, n-n/2); B[i]=A[2i]+A[2i+1];
return L+R; return reduce(B, n/2); }
}

Reduce problem size:


Shrink the original size into a half

11
• Reduce the problem size into a
Prefix sum – another algorithm smaller size (e.g., a half),
1 2 3 4 5 6 7 8 possibly in parallel
+ + + + • Solve the same problem on the
3 7 11 15 small input
+ +
10 26 • Convert the result of the small
+ problem to the final answer,
36 possibly in parallel

A 1 2 3 4 5 6 7 8
Prefix sum of A 1 3 6 10 15 21 28 36

Prefix sum of A’ 3 10 21 36
A’ 3 7 11 15
Prefix sum – another 1 2 3 4 5 6 7 8
algorithm 1 3 6 10 15 21 28 36
1 2 3 4 5 6 7 8
+ + + +
3 7 11 15 3 7 11 15
+ +
10 26
3 10 21 36
+
36 10 26
Function PrefixSum(In, n, Out) { 10 36
if (n==1) Out[0] = In[0];
para_for (i=0 to n/2)
B[i] = In[2i]+In[2i+1] 36
PrefixSum(B, n/2, C);
36
Out[0] = In[0];
para_for (i=1 to n) { work
if (i%2) Out[i] = C[(i-1)/2]+In[i]; depth
else Out[i] = C[i/2];} }
How did we solve the prefix sum problem?
• Divide-and-conquer
• Split the problem in half, solve each of the same subproblems in
parallel
• i.e., solve the prefix sum of the left and the right halves of the array in parallel
• Convert the results from the subproblems to the final answers
• i.e., the right sum needs to add the “left sum” of the reduce algorithm

• Reduce to smaller sizes


• Convert the problem to a smaller size of the same problem
• i.e., add every two elements to half the problem size
• Convert the result of the smaller problem to the final answers
• i.e., copy the results to the odd positions, and get the result at the even
positions by adding the original input value

14
Computational Models

15
Cost model
• Work-depth model
• Evaluate the cost of an algorithm
• Does not specify what operations can be used, how much do they
cost, etc
• How much does a parallel for cost?
• How do processors synchronize?
• What happens if two threads access the same memory location at the same
time?

• Sequentially, we have the Random Access Machine Model (RAM


model)
• Unbounded memory and you can access any location with its address
• Every operation (computation, memory access, etc.) costs unit time
• Simple and effective for analyzing sequential algorithms
16
The PRAM (Parallel RAM) Model
• processors share the memory
• Every operation takes unit time.
• All threads are highly synchronized
• After each unit time, all the threads finish one operation and
accessing the memory
• Some commonly-used settings
• Exclusive read exclusive write (EREW)—every memory cell can be read
or written to by only one processor at a time
• Concurrent read exclusive write (CREW)—multiple processors can read
a memory cell but only one can write at a time
• Exclusive read concurrent write (ERCW)—never considered
• Concurrent read concurrent write (CRCW)—multiple processors can
read and write.
17
The PRAM (Parallel RAM) Model
• Evaluate an algorithm in PRAM
model 1 2 3 4 5 6 7 8
• The total number of processors + + + +
3 7 11 15
• The total required (parallel) time + +
• Sometimes use as an indicator of 10 26
work +
36

• Reduce:
• Use processors, need time and work
• Use processors, need time and work
• Use the topological order of the computational DAG

18
PRAM: pros and cons
• Simple – we can get very good bound
• But...
• Do we know the number of processors ahead of time?
• The number of available processors even varies during the execution
• Your OS, some other applications may use some processors

19
PRAM: pros and cons
• Simple – we can get very good bound
• But...
• Do we know the number of processors ahead of time?
• Are processors really highly-synchronized?
• Accessing memory is usually more expensive than computation
• Synchronization is expensive

P1: P2:
A = 5; C = 2;
sync
B = 3; A = 3;
sync
A = A+7; B = B+C;
sync

20
PRAM: pros and cons
• Simple, easy to analyze
• But...
• Do we know the number of processors ahead of time?
• Are processors really highly-synchronized?
• Accessing memory is usually more expensive than computation
• Are concurrent writes ideal enough to take unit time?

• Although we do not use PRAM in this course, many useful and


good algorithms (and ideas) were proposed based on PRAM
• Proposed in 1979, but parallelism moved into the mainstream
of processor design from 2005
21
Fork-join parallelism

22
Fork-join Parallelism
• The computation starts with one thread
• A thread can fork threads to execute pieces of code. After
they all finish, they join back and continue the rest
computation in
• We can use work and depth to analyze the cost

23
Fork-join Parallelism
• The computation starts with one
thread
• A thread can fork threads to
execute pieces of code. After they
all finish, they join back and
continue the rest computation in
• Fork-join is nested parallelism,
meaning that a forked thread can
further fork new threads

24
Fork-join Parallelism
• The computation starts with one thread
• A thread can fork threads to execute pieces of code. After
they all finish, they join back and continue the rest
computation in
• We can use work and depth to analyze the cost

Function Scan(A, B, s, t, ps)


If s=t then B[s] = ps + A[s]
In Parallel (Fork):
Scan(A, B, s, mid, ps)
Scan(A, B, mid+1, t, ps+leftSum)
Join

25
Fork-join parallelism
• Function PrefixSum(In, n, Out) {
• if (n==1) Out[0] = In[0];
• para_for (i=0 to n/2) //Fork
• B[i] = n[2i]+In[2i+1]
• //Join
• PrefixSum(B, n/2, C);
• Out[0] = In[0];
• para_for (i=1 to n) { //Fork
• if (i%2) Out[i] = C[i/2];
• else Out[i] = C[i/2-1] + In[i];
• }
• //Join
•} 26
Execute a fork-join algorithm
• Need a scheduler to map each thread to a processor

• For an algorithm with work and depth , a good scheduler can


make it run in time using processors

27
N-ary forking vs. binary forking
• A thread can fork new tasks, or only two new tasks

• Can affect the depth by a factor of


• Why?
• Homework: analyze the two prefix sum algorithms (one using divide-
and-conquer, the other one using parallel-for). Do they have the
same cost under n-ary fork-join model? Do they have the same cost
under binary fork-join model?

• We will assume binary forking unless specified


• In many state-of-the-art schedulers, they only use binary-forking

28
DAG for work-depth vs. fork-join
Start

• Fork-join: a fork always corresponds to a join

29
What else can we do
• Sometimes, concurrent write is inevitable. Then we need to
specify some atomic primitives for a model

• Some commonly used ones:


• Compare-and-swap (CAS): bool CAS(value* p, value vold, value vnew):
compare the value stored in the pointer with value , if they are equal,
change ’s value to vnew and return true. Otherwise do nothing and
return false.
• Test-and-set (TAS): bool TAS(bool* p): determine if the Boolean value
stored at is false, if so, set it to true and return. Otherwise, return
false.
• Fetch-and-add (FAA): integer FAA(integer* p, integer x): add integer
’s value by x, and return the old value
•…
30
sum = 5
Use Atomic Primitives P1: add(3) P2: add(4)
void Add(x) { void Add(x) {
• Fetch-and-add: temp = sum; 5 temp = sum; 5
sum = temp + x; sum = temp + x;
• Multiple threads try to add } 8 } 9
values to a shared variable
Shared variable sum Shared variable sum sum = 8 (but should be 12)
void Add(x) { void Add(x) {
FAA(&sum, x); sum = sum + x;
} }
• Multiple threads want to get
a global sequentialized
order
Shared variable count
int get_id {
return FAA(&count, 1);
}

31
Use Atomic Primitives
• Compare-and-swap:
• Multiple threads wants to add to the head of a linked-list
struct node {
X1
value_type value;
node* next; };
shared variable node* head;

void insert(node* x) {
node* old_head = head;
x->next = old_head;
X2
? head

head
while (!CAS(&head, old_head, x)) {
node* old_head = head; X1
x->next = old_head; }
}

void insert(node* x) {
x->next = head; X2
head = x;
} 32
Computational model
• When talking about an algorithm or a bound:
• Specify the model
• Specify any parallel primitives you need
• e.g, EREW PRAM, binary-forking with CAS, etc.
• When talking about the execution time:
• Also need to specify the scheduling algorithm

• Usually, the more or the stronger primitive you use, the better
your bound looks, but the less interesting/practical the result is
• E.g., assume constant time parallel reduce – we can get a constant time
sorting algorithm

33
Fibonacci Numbers

34
Fibonacci Numbers
• The n-th Fibonacci number can be computed as: 0


int F (int n) {
if (n <= 1) return n;
else {
In parallel:
int A = F(n-1); n-2
int B = F(n-2);
return A+B; n-1
} Why? Because the dependency is still long () and
there is much redundant work
n
• In the homework we’ll see a more efficient parallel algorithm
35
Parallel Programming Tools

36
Parallel Tools and Schedulers
In this course the following two schedulers are recommended
for your homework and course project.
• Cilk
• PBBS

You can also use other languages/schedulers that you are more
familiar with, e.g., OpenMP, Intel TBB, etc.

37
Cilk
• Fork-join parallelism #include <cilk/cilk.h>
#include <cilk/cilk_api.h>

• cilk_spawn and cilk_sync


int reduce(int* A, int n) {
if (n == 1) return A[0];
cilk_spawn S1;
int L, R;
S2;
L = cilk_spawn reduce(A, n/2);
cilk_sync;
R = reduce(A+n/2, n-n/2);
cilk_sync;
return L+R;
}

• Parallel for: cilk_for cilk_for (int i = 0; i < n; i++)


38
Cilk
• cilk_spawn means to create a new thread that is executed
while the original thread proceed to the next instruction
int reduce(int* A, int n) {
if (n == 1) return A[0];
int L, R;
L = cilk_spawn reduce(A, n/2);
R = reduce(A+n/2, n-n/2);
cilk_sync;
return L+R; }

• Supported by gcc 5 to 7. Available on the course server.

• LLVM: https://cilkplus.github.io/

39
PBBS (Problem-based benchmark suite)
• Code available at: https://github.com/cmuparlay/pbbslib
#include “pbbslib/utilities.h” You can also use cilk or openmp to
compile your code
void reduce(int* A, int n, int& ret) {
if (n == 1) ret = A[0]; else {
int L, R;
par_do([&] () {reduce(A, n/2, L);},
[&] () {reduce(A+n/2, n-n/2, R);}); lambda expression
ret = L+R; (must be function calls)
}
}

parallel_for (0, 100, [&] (int i) {A[i] = i;});

40
About homework
• Sample code available using PBBS and Cilk in homework 1
• You will implement your own version of a scan algorithm –
add any optimizations that you think could help, and see if
they really help
• Use figures and tables to show the numbers you get
• Analyze the numbers to explain any interesting/abnormal
phenomenon
• There is an entry of assignment in ilearn now, you can submit
your code there

41
About homework
• The goal of the programming part is to let you learn from
practice some tricks and optimizations for implementing
parallel algorithms – the process of learning matters

• This is a graduate-level course, which means


• As long as you finish all required tasks, everyone can pass
• If you do a really good job, you get a good score
• But if you cheat, you fail

42
About paper review
• What problem is solved in the paper? What is the motivation?
• Why is the problem challenging? How did previous work solve
the problem and why they didn’t work?
• What is the key technical ideas to solve the challenges?
• What are the new theoretical results (if any)?
• Why do they design experiments (if any) like that?
• What do the experimental results (if any) tell us?
• What do you think is the strength/novelty of the work?
• What do you think is the weakness of the work? Do you have
ideas to improve that?
• What are the possible directions for future work?
• Do you have any questions about the work? 43
About paper review
• A useful document of some paper review tips:
https://people.inf.ethz.ch/troscoe/pubs/review-writing.pdf
• Your paper review is slightly different since you are reviewing papers
that have already been published

44

You might also like