0% found this document useful (0 votes)
17 views

PCP 2022 6 ParallelAlgorithms PartI

Uploaded by

Lux Musica Beats
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

PCP 2022 6 ParallelAlgorithms PartI

Uploaded by

Lux Musica Beats
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

parallel concurrent

Amdahl’s Law thread of control


processors versus processes
non-deterministic fork-join parallelism.

Parallel and concurrent programming


5. Parallel algorithms
protection data race synchronization
divide-and-conquer algorithms
thread
safety correctness
mutual exclusion
Michelle Kuttel
locks August 2022
readers-writers problem
liveness
deadlock starvation
High Performance Computing

producer-consumer problem
executors thread pools timing
Dining philosphoers problem
The DAG, or “cost graph”

Program execution can be seen as a DAG (directed


acyclic graph)
• Nodes: Pieces of work
• Edges: Source must finish before destination starts
• A fork “ends a node” and makes
two outgoing edges
• New thread
• Continuation of current thread

• A join “ends a node” and makes a


node with two incoming edges
• Node just ended
• Last node of thread joined on

slide from: Sophomoric Parallelism and Concurrency, Lecture 2 2


The DAG, or “cost graph”

• work – number of nodes, T1


• span – length of the longest path, T∞
• critical path

Checkpoint:
What is the span of this DAG?
What is the work?

slide from: Sophomoric Parallelism and Concurrency, Lecture 2 3


Checkpoint
a xb + c xd

Write a DAG to show the the work and span of


this expression.
Checkpoint
axb + cxd
• Write a DAG to show the the work and span of this
expression

The set of instructions forms


the vertices of the dag, axb cxd
the graph edges indicate
dependences between
instructions.
+
•We say that an instruction x
precedes an instruction y if x
must complete before y can
begin.
Parallelism

Parallelism is the maximum possible speed-up: T1 / T ¥


• i.e. work divided by span

At some point, adding processors won’t help


• What that point is depends on the span
e.g. parallel sum using divide+conquer
- parallelism = n/logn

Parallel algorithms are about decreasing span without


increasing work too much

slide adapted from: Sophomoric Parallelism and Concurrency, Lecture 2 6


Embarrassingly parallel algorithms
Ideal computation - a computation that can be divided
into a number of completely separate tasks, each of
which can be executed by a single processor

No special algorithms or techniques required to get a


workable solution
Embarrassingly parallel examples
• element-wise linear algebra:
• addition, scalar multiplication etc
• Image processing
• shift, rotate, clip, scale
• Monte Carlo simulations
• encryption, compression
DAG for an embarrassingly parallel algorithm

y i = f i (x i )
DAG for an embarrassingly parallel algorithm

or, indeed:

y i = f i (x i )
Image Processing

Low-level image processing uses the individual pixel


values to modify the image in some way.
• Image processing operations can be divided into:
• point processing – output produced based on value of
single pixel
• well known Mandelbrot set
• local operations – produce output based on a group of
neighbouring pixels
• global operations – produce output based on all the
pixels of the image
• Point processing operations are embarrassingly
parallel (local operations are often highly
parallelizable).
• Processing separate images is always
embarrassingly parallel
Monte Carlo Methods

Basis of Monte Carlo methods is the use of random


selections in calculations that lead to the solution of
numerical and physical problems e.g.
• brownian motion
• molecular modelling
• forecasting the stock market

• Each calculation is independent of the others and


hence embarrassingly parallel.
Aside: Parallel Random Number Generation
For successful Monte Carlo simulations, the random
numbers must be independent of each other
• Developing random number generator algorithms and
implementations that are fast, easy to use, and give
good quality pseudo-random numbers is a
challenging problem
• Developing parallel implementations is even more difficult.
Good quality pseudo-random numbers
The Mersenne Twister (MT) is a pseudorandom number
generator algorithm developed by Matsumoto and
Nishimura. It has
• Good distribution properties.
• Long period.
• Efficient use of memory
• High performance.
Default in Python (Julia, Matlab…), not in C or Java.

Makoto Matsumoto, Keio University/Max-Planck_Institut fur Mathematik; Takuji


Nishimura, Keio University.
Mersenne Twister. A 623-dimensionally equidistributed uniform pseudorandom number generator.
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/ARTICLES/mt.pdf
Requirements for a Parallel Generator
For random number generators on parallel computers,
it is vital that there are no correlations between the
random number streams on different processors.
• e.g. don't want one processor repeating part of another
processor’s sequence.
• could occur if we just use the naive method of running a
RNG on each different processor and just giving randomly
chosen seeds to each processor.
• In many applications we also need to ensure that we
get the same results for any number of processors.
Parallel Mersenne Twister
The simplest solution is to have many simultaneous
Mersenne twisters processed in parallel. But even “very
different” initial state values do not prevent correlated
sequences by generators sharing identical parameters.
• dcmt, a special offline library for the dynamic creation of
Mersenne Twisters parameters, was developed by
Matsumoto and Nishimura
• The library accepts the 16-bit “thread id” as one of the
inputs, and encodes this value into the Mersenne Twister
parameters on a per-“thread” basis, so that every thread can
update the twister independently, while still retaining good
randomness of the final output.

Makoto Matsumoto and Takaji Nishimura.


Dynamic Creation of Pseudorandom Number Generators.
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/DC/dgene.pdf
Java and parallel random numbers
java.util.Random is thread safe but can have poor
performance in a multi-threaded environment due to
contention when multiple threads share the
same Random instance.
• prefer ThreadLocalRandom.
public class ThreadLocalRandom extends Random

A random number generator isolated to the current thread. Like the


global Random generator used by the Math class, a ThreadLocalRandom is
initialized with an internally generated seed that may not otherwise be modified.
When applicable, use of ThreadLocalRandom rather than shared Random objects
in concurrent programs will typically encounter much less overhead and
contention. Use of ThreadLocalRandom is particularly appropriate when multiple
tasks (for example, each a ForkJoinTask) use random numbers in parallel in thread
pools.
Quality of ThreadLocalRandom?
Numbers “appear to be adequate for "everyday" use,
such as in Monte Carlo algorithms and randomized data
structures where speed is important.”

Guy L. Steele, Doug Lea, and Christine H. Flood.


2014. Fast splittable pseudorandom number
generators. In Proceedings of the 2014 ACM
International Conference on Object Oriented
Programming Systems Languages & Applications (OOPSLA
'14). Association for Computing Machinery, New York,
NY, USA, 453–472.
https://doi.org/10.1145/2660193.2660195
Divide-and-conquer parallel algorithms

Divide problems into sub problems that are of


the same form as the larger problem
1. Divide instance of problem into two or
more smaller instances
2. Solve smaller instances recursively
3. Obtain solution to original (larger)
instance by combining these solutions

Recursive subdivision continues until the grain


size of the problem is small enough to be solved
sequentially.

19
Examples of divide and conquer algorithms
• Finding maximum or minimum element
• Is there an element satisfying some property (e.g., is there a 17)?
• Left-most element satisfying some property (e.g., first 17)
• What should the recursive tasks return?
• How should we merge the results?
• Corners of a rectangle containing all points (a “bounding box”)
• Counts, e.g. number of strings that start with a vowel
• This is just summing with a different base case
• Many problems are!

slide adapted from: Sophomoric Parallelism and Concurrency, Lecture 2 20


Divide-and-conquer parallel algorithms
• fork and join are very flexible, but divide-and-conquer
maps and reductions use them in a very basic way:
• DAG is tree on top of an upside-down tree

divide

base cases

combine
results

Sophomoric Parallelism and


21
Concurrency, Lecture 2
Divide-and-conquer parallel algorithms

• What is the work and span of this DAG?

divide

base cases

combine
results

Sophomoric Parallelism and


22
Concurrency, Lecture 2
Divide-and-conquer parallel algorithms
Work is O(n)
Span is O(log n)
Maximum speedup is n / log n (grows exponentially)

+ + + + + + + +
+ + + +

+ +
+
• Anything that can use results from two halves and merge them in
O(1) time has the same property…

23
Basic Divide-and-Conquer algorithms: Reductions

• Reduction operations produce a single answer from


collection via an associative operator
• Examples: max, count, leftmost, rightmost, sum, …
• Non-example: median

• Note: (Recursive) results don’t have to be single numbers or


strings. They can be arrays or objects with multiple fields.
• Example: Histogram of test results is a variant of sum

• But some things are inherently sequential


• How we process arr[i] may depend entirely on the result of
processing arr[i-1]

slide adapted from: Sophomoric Parallelism and Concurrency, Lecture 2 24


Basic divide and conquer algorithms: Maps

A map operates on each element of a collection independently


to create a new collection of the same size
• No combining results
• For arrays, this is so trivial some hardware has direct support

Canonical example: Vector addition (pseudo-code)


int[] vector_add(int[] arr1, int[] arr2){
assert (arr1.length == arr2.length);
result = new int[arr1.length];
FORALL(i=0; i < arr1.length; i++) {
result[i] = arr1[i] + arr2[i];
}
return result;
}

Adapted from: Sophomoric Parallelism and Concurrency, Lecture 2 25


Maps in ForkJoin Framework
class VecAdd extends RecursiveAction {
int lo; int hi; int[] res; int[] arr1; int[] arr2;
VecAdd(int l,int h,int[] r,int[] a1,int[] a2){ … }
protected void compute(){
if(hi – lo < SEQUENTIAL_CUTOFF) {
for(int i=lo; i < hi; i++)
res[i] = arr1[i] + arr2[i];
} else {
int mid = (hi+lo)/2;
VecAdd left = new VecAdd(lo,mid,res,arr1,arr2);
VecAdd right= new VecAdd(mid,hi,res,arr1,arr2);
left.fork();
right.compute();
left.join();
}
}
}
static final ForkJoinPool fjPool = new ForkJoinPool();
int[] add(int[] arr1, int[] arr2){
assert (arr1.length == arr2.length);
int[] ans = new int[arr1.length];
fjPool.invoke(new VecAdd(0,arr.length,ans,arr1,arr2);
return ans; from: Sophomoric Parallelism and Concurrency, Lecture 2 26

}
Maps in ForkJoin Framework
Even though there is no result-combining, it still helps
with load balancing to create many small tasks
• Maybe not for vector-add but for more compute-
intensive maps
• The forking is O(log n) whereas theoretically other
approaches to vector-add is O(1)

from: Sophomoric Parallelism and Concurrency, Lecture 2 27


Maps and reductions
Maps and reductions: the “work horses” of parallel
programming

• By far the two most important and common patterns

• Learn to recognize when an algorithm can be written in


terms of maps and reductions

• Use maps and reductions to describe (parallel) algorithms

• Programming them becomes “trivial” with a little practice


• Exactly like sequential for-loops seem second-nature

from: Sophomoric Parallelism and Concurrency, Lecture 2 28


More interesting example:
Parallel prefix-sum algorithm

This “key trick” typically underlies surprising


parallelization
• Enables other things like packs

Sophomoric Parallelism and


29
Concurrency, Lecture 3
The prefix-sum problem

Given int[] input, produce int[] output where output[i]


is the sum of input[0]+input[1]+…+input[i]

Sequential can be a CS1 exam problem:


int[] prefix_sum(int[] input){
int[] output = new int[input.length];
output[0] = input[0];
for(int i=1; i < input.length; i++)
output[i] = output[i-1]+input[i];
return output;
}

Does not seem parallelizable


– Work: O(n), Span: O(n)
– This algorithm is sequential, but a different algorithm has
– Work: O(n), Span: O(log n)
slide adapted from: Sophomoric Parallelism and Concurrency, Lecture 3 30
Parallel prefix-sum
• The parallel-prefix algorithm does two passes
• Each pass has O(n) work and O(log n) span
• So in total there is O(n) work and O(log n) span
• So just like with array summing, the parallelism is n/log n, an
exponential speedup

• The first pass builds a tree bottom-up: the “up” pass

• The second pass traverses the tree top-down: the “down”


pass

Historical note:
• Original algorithm due to R. Ladner and M. Fischer at the
University of Washington in 1977

slide adapted from: Sophomoric Parallelism and Concurrency, Lecture 3 31


range 0,8
Example sum 76
fromleft

range 0,4 range 4,8


sum 36 sum 40
fromleft fromleft

range 0,2 range 2,4 range 4,6 range 6,8


sum 10 sum 26 sum 30 sum 10
fromleft fromleft fromleft fromleft

r 0,1 r 1,2 r 2,3 r 3,4 r 4,5 r 5,6 r 6,7 r 7.8


s 6 s 4 s 16 s 10 s 16 s 14 s 2 s 8
f f f f f f f f

input 6 4 16 10 16 14 2 8

output
slide from: Sophomoric Parallelism and Concurrency, Lecture 3 32
range 0,8
Example sum 76
fromleft 0

range 0,4 range 4,8


sum 36 sum 40
fromleft 0 fromleft 36

range 0,2 range 2,4 range 4,6 range 6,8


sum 10 sum 26 sum 30 sum 10
fromleft 0 fromleft 10 fromleft 36 fromleft 66

r 0,1 r 1,2 r 2,3 r 3,4 r 4,5 r 5,6 r 6,7 r 7,8


s 6 s 4 s 16 s 10 s 16 s 14 s 2 s 8
f 0 f 6 f 10 f 26 f 36 f 52 f 66 f 68

input 6 4 16 10 16 14 2 8

output 6 10 26 36 52 66 68 76
slide from: Sophomoric Parallelism and Concurrency, Lecture 3 33
range 0,8
sum 76
fromleft 0

range 0,4 range 4,8


sum 36 sum 40
fromleft 0 fromleft 36

range 0,2 range 2,4 range 4,6 range 6,8


sum 10 sum 26 sum 30 sum 10
fromleft 0 fromleft 10 fromleft 36 fromleft 66

r 0,1 r 1,2 r 2,3 r 3,4 r 4,5 r 5,6 r 6,7 r 7,8


s 6 s 4 s 16 s 10 s 16 s 14 s 2 s 8
f 0 f 6 f 10 f 26 f 36 f 52 f 66 f 68

input 6 4 16 10 16 14 2 8

output 6 10 26 36 52 66 68 76
slide from: Sophomoric Parallelism and Concurrency, Lecture 3 34
The algorithm, part 1

1. Up: Build a binary tree where


• Root has sum of the range [x,y)
• If a node has sum of [lo,hi) and hi>lo,
• Left child has sum of [lo,middle)
• Right child has sum of [middle,hi)
• A leaf has sum of [i,i+1), i.e., input[i]

This is an easy fork-join computation: combine results by actually


building a binary tree with all the range-sums
• Tree built bottom-up in parallel
• Could be more clever with an array, as with heaps

Analysis: O(n) work, O(log n) span

slide adapted from: Sophomoric Parallelism and Concurrency, Lecture 3 35


The algorithm, part 2

2. Down: Pass down a value fromLeft


• Root given a fromLeft of 0
• Node takes its fromLeft value and
• Passes its left child the same fromLeft
• Passes its right child its fromLeft plus its left child’s sum (as
stored in part 1)
• At the leaf for array position i,
output[i]=fromLeft+input[i]

This is an easy fork-join computation: traverse the tree built in


step 1 and produce no result
• Leaves assign to output
• Invariant: fromLeft is sum of elements left of the node’s range
Analysis: O(n) work, O(log n) span

slide from: Sophomoric Parallelism and Concurrency, Lecture 3 36


Sequential cut-off
Adding a sequential cut-off is easy as always:

• Up:
just a sum, have leaf node hold the sum of a range

• Down:
output[lo] = fromLeft + input[lo];
for(i=lo+1; i < hi; i++)
output[i] = output[i-1] + input[i]

slide from: Sophomoric Parallelism and Concurrency, Lecture 3 37


Parallel prefix, generalized
Just as sum-array was the simplest example of a pattern that
matches many, many problems, so is prefix-sum

• Minimum, maximum of all elements to the left of i

• Is there an element to the left of i satisfying some property?

• Count of elements to the left of i satisfying some property


• This last one is perfect for an efficient parallel pack…
• Perfect for building on top of the “parallel prefix trick”

• We did an inclusive sum, but exclusive is just as easy


slide from: Sophomoric Parallelism and Concurrency, Lecture 3 38
To be continued….maybe

You might also like