0% found this document useful (0 votes)

21 views39 pages

PCP 2022 6 ParallelAlgorithms PartI

Uploaded by

Lux Musica Beats

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views39 pages

PCP 2022 6 ParallelAlgorithms PartI

Uploaded by

Lux Musica Beats

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

parallel concurrent

Amdahl’s Law thread of control

processors versus processes
non-deterministic fork-join parallelism.

Parallel and concurrent programming

5. Parallel algorithms
protection data race synchronization
divide-and-conquer algorithms
thread
safety correctness
mutual exclusion
Michelle Kuttel
locks August 2022
readers-writers problem
liveness
deadlock starvation
High Performance Computing

producer-consumer problem
executors thread pools timing
Dining philosphoers problem
The DAG, or “cost graph”

Program execution can be seen as a DAG (directed

acyclic graph)
• Nodes: Pieces of work
• Edges: Source must finish before destination starts
• A fork “ends a node” and makes
two outgoing edges
• New thread
• Continuation of current thread

• A join “ends a node” and makes a

node with two incoming edges
• Node just ended
• Last node of thread joined on

slide from: Sophomoric Parallelism and Concurrency, Lecture 2 2

The DAG, or “cost graph”

• work – number of nodes, T1

• span – length of the longest path, T∞
• critical path

Checkpoint:
What is the span of this DAG?
What is the work?

slide from: Sophomoric Parallelism and Concurrency, Lecture 2 3

Checkpoint
a xb + c xd

Write a DAG to show the the work and span of

this expression.
Checkpoint
axb + cxd
• Write a DAG to show the the work and span of this
expression

The set of instructions forms

the vertices of the dag, axb cxd
the graph edges indicate
dependences between
instructions.
+
•We say that an instruction x
precedes an instruction y if x
must complete before y can
begin.
Parallelism

Parallelism is the maximum possible speed-up: T1 / T ¥

• i.e. work divided by span

At some point, adding processors won’t help

• What that point is depends on the span
e.g. parallel sum using divide+conquer
- parallelism = n/logn

Parallel algorithms are about decreasing span without

increasing work too much

slide adapted from: Sophomoric Parallelism and Concurrency, Lecture 2 6

Embarrassingly parallel algorithms
Ideal computation - a computation that can be divided
into a number of completely separate tasks, each of
which can be executed by a single processor

No special algorithms or techniques required to get a

workable solution
Embarrassingly parallel examples
• element-wise linear algebra:
• addition, scalar multiplication etc
• Image processing
• shift, rotate, clip, scale
• Monte Carlo simulations
• encryption, compression
DAG for an embarrassingly parallel algorithm

y i = f i (x i )
DAG for an embarrassingly parallel algorithm

or, indeed:

y i = f i (x i )
Image Processing

Low-level image processing uses the individual pixel

values to modify the image in some way.
• Image processing operations can be divided into:
• point processing – output produced based on value of
single pixel
• well known Mandelbrot set
• local operations – produce output based on a group of
neighbouring pixels
• global operations – produce output based on all the
pixels of the image
• Point processing operations are embarrassingly
parallel (local operations are often highly
parallelizable).
• Processing separate images is always
embarrassingly parallel
Monte Carlo Methods

Basis of Monte Carlo methods is the use of random

selections in calculations that lead to the solution of
numerical and physical problems e.g.
• brownian motion
• molecular modelling
• forecasting the stock market

• Each calculation is independent of the others and

hence embarrassingly parallel.
Aside: Parallel Random Number Generation
For successful Monte Carlo simulations, the random
numbers must be independent of each other
• Developing random number generator algorithms and
implementations that are fast, easy to use, and give
good quality pseudo-random numbers is a
challenging problem
• Developing parallel implementations is even more difficult.
Good quality pseudo-random numbers
The Mersenne Twister (MT) is a pseudorandom number
generator algorithm developed by Matsumoto and
Nishimura. It has
• Good distribution properties.
• Long period.
• Efficient use of memory
• High performance.
Default in Python (Julia, Matlab…), not in C or Java.

Makoto Matsumoto, Keio University/Max-Planck_Institut fur Mathematik; Takuji

Nishimura, Keio University.
Mersenne Twister. A 623-dimensionally equidistributed uniform pseudorandom number generator.
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/ARTICLES/mt.pdf
Requirements for a Parallel Generator
For random number generators on parallel computers,
it is vital that there are no correlations between the
random number streams on different processors.
• e.g. don't want one processor repeating part of another
processor’s sequence.
• could occur if we just use the naive method of running a
RNG on each different processor and just giving randomly
chosen seeds to each processor.
• In many applications we also need to ensure that we
get the same results for any number of processors.
Parallel Mersenne Twister
The simplest solution is to have many simultaneous
Mersenne twisters processed in parallel. But even “very
different” initial state values do not prevent correlated
sequences by generators sharing identical parameters.
• dcmt, a special offline library for the dynamic creation of
Mersenne Twisters parameters, was developed by
Matsumoto and Nishimura
• The library accepts the 16-bit “thread id” as one of the
inputs, and encodes this value into the Mersenne Twister
parameters on a per-“thread” basis, so that every thread can
update the twister independently, while still retaining good
randomness of the final output.

Makoto Matsumoto and Takaji Nishimura.

Dynamic Creation of Pseudorandom Number Generators.
http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/DC/dgene.pdf
Java and parallel random numbers
java.util.Random is thread safe but can have poor
performance in a multi-threaded environment due to
contention when multiple threads share the
same Random instance.
• prefer ThreadLocalRandom.
public class ThreadLocalRandom extends Random

A random number generator isolated to the current thread. Like the

global Random generator used by the Math class, a ThreadLocalRandom is
initialized with an internally generated seed that may not otherwise be modified.
When applicable, use of ThreadLocalRandom rather than shared Random objects
in concurrent programs will typically encounter much less overhead and
contention. Use of ThreadLocalRandom is particularly appropriate when multiple
tasks (for example, each a ForkJoinTask) use random numbers in parallel in thread
pools.
Quality of ThreadLocalRandom?
Numbers “appear to be adequate for "everyday" use,
such as in Monte Carlo algorithms and randomized data
structures where speed is important.”

Guy L. Steele, Doug Lea, and Christine H. Flood.

2014. Fast splittable pseudorandom number
generators. In Proceedings of the 2014 ACM
International Conference on Object Oriented
Programming Systems Languages & Applications (OOPSLA
'14). Association for Computing Machinery, New York,
NY, USA, 453–472.
https://doi.org/10.1145/2660193.2660195
Divide-and-conquer parallel algorithms

Divide problems into sub problems that are of

the same form as the larger problem
1. Divide instance of problem into two or
more smaller instances
2. Solve smaller instances recursively
3. Obtain solution to original (larger)
instance by combining these solutions

Recursive subdivision continues until the grain

size of the problem is small enough to be solved
sequentially.

19
Examples of divide and conquer algorithms
• Finding maximum or minimum element
• Is there an element satisfying some property (e.g., is there a 17)?
• Left-most element satisfying some property (e.g., first 17)
• What should the recursive tasks return?
• How should we merge the results?
• Corners of a rectangle containing all points (a “bounding box”)
• Counts, e.g. number of strings that start with a vowel
• This is just summing with a different base case
• Many problems are!

slide adapted from: Sophomoric Parallelism and Concurrency, Lecture 2 20

Divide-and-conquer parallel algorithms
• fork and join are very flexible, but divide-and-conquer
maps and reductions use them in a very basic way:
• DAG is tree on top of an upside-down tree

divide

base cases

combine
results

Sophomoric Parallelism and

21
Concurrency, Lecture 2
Divide-and-conquer parallel algorithms

• What is the work and span of this DAG?

divide

base cases

combine
results

Sophomoric Parallelism and

22
Concurrency, Lecture 2
Divide-and-conquer parallel algorithms
Work is O(n)
Span is O(log n)
Maximum speedup is n / log n (grows exponentially)

+ + + + + + + +
+ + + +

+ +
+
• Anything that can use results from two halves and merge them in
O(1) time has the same property…

23
Basic Divide-and-Conquer algorithms: Reductions

• Reduction operations produce a single answer from

collection via an associative operator
• Examples: max, count, leftmost, rightmost, sum, …
• Non-example: median

• Note: (Recursive) results don’t have to be single numbers or

strings. They can be arrays or objects with multiple fields.
• Example: Histogram of test results is a variant of sum

• But some things are inherently sequential

• How we process arr[i] may depend entirely on the result of
processing arr[i-1]

slide adapted from: Sophomoric Parallelism and Concurrency, Lecture 2 24

Basic divide and conquer algorithms: Maps

A map operates on each element of a collection independently

to create a new collection of the same size
• No combining results
• For arrays, this is so trivial some hardware has direct support

Canonical example: Vector addition (pseudo-code)

int[] vector_add(int[] arr1, int[] arr2){
assert (arr1.length == arr2.length);
result = new int[arr1.length];
FORALL(i=0; i < arr1.length; i++) {
result[i] = arr1[i] + arr2[i];
}
return result;
}

Adapted from: Sophomoric Parallelism and Concurrency, Lecture 2 25

Maps in ForkJoin Framework
class VecAdd extends RecursiveAction {
int lo; int hi; int[] res; int[] arr1; int[] arr2;
VecAdd(int l,int h,int[] r,int[] a1,int[] a2){ … }
protected void compute(){
if(hi – lo < SEQUENTIAL_CUTOFF) {
for(int i=lo; i < hi; i++)
res[i] = arr1[i] + arr2[i];
} else {
int mid = (hi+lo)/2;
VecAdd left = new VecAdd(lo,mid,res,arr1,arr2);
VecAdd right= new VecAdd(mid,hi,res,arr1,arr2);
left.fork();
right.compute();
left.join();
}
}
}
static final ForkJoinPool fjPool = new ForkJoinPool();
int[] add(int[] arr1, int[] arr2){
assert (arr1.length == arr2.length);
int[] ans = new int[arr1.length];
fjPool.invoke(new VecAdd(0,arr.length,ans,arr1,arr2);
return ans; from: Sophomoric Parallelism and Concurrency, Lecture 2 26

}
Maps in ForkJoin Framework
Even though there is no result-combining, it still helps
with load balancing to create many small tasks
• Maybe not for vector-add but for more compute-
intensive maps
• The forking is O(log n) whereas theoretically other
approaches to vector-add is O(1)

from: Sophomoric Parallelism and Concurrency, Lecture 2 27

Maps and reductions
Maps and reductions: the “work horses” of parallel
programming

• By far the two most important and common patterns

• Learn to recognize when an algorithm can be written in

terms of maps and reductions

• Use maps and reductions to describe (parallel) algorithms

• Programming them becomes “trivial” with a little practice

• Exactly like sequential for-loops seem second-nature

from: Sophomoric Parallelism and Concurrency, Lecture 2 28

More interesting example:
Parallel prefix-sum algorithm

This “key trick” typically underlies surprising

parallelization
• Enables other things like packs

Sophomoric Parallelism and

29
Concurrency, Lecture 3
The prefix-sum problem

Given int[] input, produce int[] output where output[i]

is the sum of input[0]+input[1]+…+input[i]

Sequential can be a CS1 exam problem:

int[] prefix_sum(int[] input){
int[] output = new int[input.length];
output[0] = input[0];
for(int i=1; i < input.length; i++)
output[i] = output[i-1]+input[i];
return output;
}

Does not seem parallelizable

– Work: O(n), Span: O(n)
– This algorithm is sequential, but a different algorithm has
– Work: O(n), Span: O(log n)
slide adapted from: Sophomoric Parallelism and Concurrency, Lecture 3 30
Parallel prefix-sum
• The parallel-prefix algorithm does two passes
• Each pass has O(n) work and O(log n) span
• So in total there is O(n) work and O(log n) span
• So just like with array summing, the parallelism is n/log n, an
exponential speedup

• The first pass builds a tree bottom-up: the “up” pass

• The second pass traverses the tree top-down: the “down”

pass

Historical note:
• Original algorithm due to R. Ladner and M. Fischer at the
University of Washington in 1977

slide adapted from: Sophomoric Parallelism and Concurrency, Lecture 3 31

range 0,8
Example sum 76
fromleft

range 0,4 range 4,8

sum 36 sum 40
fromleft fromleft

range 0,2 range 2,4 range 4,6 range 6,8

sum 10 sum 26 sum 30 sum 10
fromleft fromleft fromleft fromleft

r 0,1 r 1,2 r 2,3 r 3,4 r 4,5 r 5,6 r 6,7 r 7.8

s 6 s 4 s 16 s 10 s 16 s 14 s 2 s 8
f f f f f f f f

input 6 4 16 10 16 14 2 8

output
slide from: Sophomoric Parallelism and Concurrency, Lecture 3 32
range 0,8
Example sum 76
fromleft 0

range 0,4 range 4,8

sum 36 sum 40
fromleft 0 fromleft 36

range 0,2 range 2,4 range 4,6 range 6,8

sum 10 sum 26 sum 30 sum 10
fromleft 0 fromleft 10 fromleft 36 fromleft 66

r 0,1 r 1,2 r 2,3 r 3,4 r 4,5 r 5,6 r 6,7 r 7,8

s 6 s 4 s 16 s 10 s 16 s 14 s 2 s 8
f 0 f 6 f 10 f 26 f 36 f 52 f 66 f 68

input 6 4 16 10 16 14 2 8

output 6 10 26 36 52 66 68 76
slide from: Sophomoric Parallelism and Concurrency, Lecture 3 33
range 0,8
sum 76
fromleft 0

range 0,4 range 4,8

sum 36 sum 40
fromleft 0 fromleft 36

range 0,2 range 2,4 range 4,6 range 6,8

sum 10 sum 26 sum 30 sum 10
fromleft 0 fromleft 10 fromleft 36 fromleft 66

r 0,1 r 1,2 r 2,3 r 3,4 r 4,5 r 5,6 r 6,7 r 7,8

s 6 s 4 s 16 s 10 s 16 s 14 s 2 s 8
f 0 f 6 f 10 f 26 f 36 f 52 f 66 f 68

input 6 4 16 10 16 14 2 8

output 6 10 26 36 52 66 68 76
slide from: Sophomoric Parallelism and Concurrency, Lecture 3 34
The algorithm, part 1

1. Up: Build a binary tree where

• Root has sum of the range [x,y)
• If a node has sum of [lo,hi) and hi>lo,
• Left child has sum of [lo,middle)
• Right child has sum of [middle,hi)
• A leaf has sum of [i,i+1), i.e., input[i]

This is an easy fork-join computation: combine results by actually

building a binary tree with all the range-sums
• Tree built bottom-up in parallel
• Could be more clever with an array, as with heaps

Analysis: O(n) work, O(log n) span

slide adapted from: Sophomoric Parallelism and Concurrency, Lecture 3 35

The algorithm, part 2

2. Down: Pass down a value fromLeft

• Root given a fromLeft of 0
• Node takes its fromLeft value and
• Passes its left child the same fromLeft
• Passes its right child its fromLeft plus its left child’s sum (as
stored in part 1)
• At the leaf for array position i,
output[i]=fromLeft+input[i]

This is an easy fork-join computation: traverse the tree built in

step 1 and produce no result
• Leaves assign to output
• Invariant: fromLeft is sum of elements left of the node’s range
Analysis: O(n) work, O(log n) span

slide from: Sophomoric Parallelism and Concurrency, Lecture 3 36

Sequential cut-off
Adding a sequential cut-off is easy as always:

• Up:
just a sum, have leaf node hold the sum of a range

• Down:
output[lo] = fromLeft + input[lo];
for(i=lo+1; i < hi; i++)
output[i] = output[i-1] + input[i]

slide from: Sophomoric Parallelism and Concurrency, Lecture 3 37

Parallel prefix, generalized
Just as sum-array was the simplest example of a pattern that
matches many, many problems, so is prefix-sum

• Minimum, maximum of all elements to the left of i

• Is there an element to the left of i satisfying some property?

• Count of elements to the left of i satisfying some property

• This last one is perfect for an efficient parallel pack…
• Perfect for building on top of the “parallel prefix trick”

• We did an inclusive sum, but exclusive is just as easy

slide from: Sophomoric Parallelism and Concurrency, Lecture 3 38
To be continued….maybe

Chapter 3 - Principles of Parallel Algorithm Design
No ratings yet
Chapter 3 - Principles of Parallel Algorithm Design
52 pages
Parallel Programming
100% (2)
Parallel Programming
410 pages
Programming On Parallel Machines
100% (1)
Programming On Parallel Machines
344 pages
PDA_1
No ratings yet
PDA_1
72 pages
Faust Tutorial
No ratings yet
Faust Tutorial
78 pages
SAT SMT by Example
No ratings yet
SAT SMT by Example
585 pages
hpc_parallel
No ratings yet
hpc_parallel
122 pages
Parallel
No ratings yet
Parallel
59 pages
Lecture13_IO_BLG336E
No ratings yet
Lecture13_IO_BLG336E
58 pages
Introduction To Parallel Computing Design and Anal
No ratings yet
Introduction To Parallel Computing Design and Anal
53 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
BDS Session 6
No ratings yet
BDS Session 6
53 pages
Cloud Computing Unit4
No ratings yet
Cloud Computing Unit4
55 pages
BDS Session 5
No ratings yet
BDS Session 5
48 pages
Parallel computing a comparative
No ratings yet
Parallel computing a comparative
65 pages
Parallel Algorithms: Theory and Practice: Deterministi C Parallelism
No ratings yet
Parallel Algorithms: Theory and Practice: Deterministi C Parallelism
51 pages
Par Proc Book
No ratings yet
Par Proc Book
335 pages
Lecture 6 Principles of Parallel Algorithm Design
No ratings yet
Lecture 6 Principles of Parallel Algorithm Design
35 pages
Ch03 - Embarrassingly Parallel Computations 2023-2024
No ratings yet
Ch03 - Embarrassingly Parallel Computations 2023-2024
34 pages
ParProcBook PDF
No ratings yet
ParProcBook PDF
410 pages
DAA-1
No ratings yet
DAA-1
40 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
Decompositions For Parallel Computing: Task Scheduling
No ratings yet
Decompositions For Parallel Computing: Task Scheduling
28 pages
Lecture 5 Principles of Parallel Algorithm Design
No ratings yet
Lecture 5 Principles of Parallel Algorithm Design
30 pages
Hpclab
No ratings yet
Hpclab
58 pages
parallel and distributed algorithms
No ratings yet
parallel and distributed algorithms
21 pages
E- Notes -HPC-Unit 3-1
No ratings yet
E- Notes -HPC-Unit 3-1
26 pages
TRNG Manual
No ratings yet
TRNG Manual
133 pages
Chap 4-7 - Parallel - Abstractions - and - MPI
No ratings yet
Chap 4-7 - Parallel - Abstractions - and - MPI
34 pages
Watercolor Organic Shapes SlidesMania
No ratings yet
Watercolor Organic Shapes SlidesMania
23 pages
Parallel Algorithms: Theory and Practice
No ratings yet
Parallel Algorithms: Theory and Practice
44 pages
Chapter Six
No ratings yet
Chapter Six
19 pages
Chapter Six
No ratings yet
Chapter Six
18 pages
Overview of Parallel Programming in C++ - Pablo Halpern - CppCon 2014
No ratings yet
Overview of Parallel Programming in C++ - Pablo Halpern - CppCon 2014
37 pages
DAA 6th
No ratings yet
DAA 6th
12 pages
Embarrassingly Parallel Computations
No ratings yet
Embarrassingly Parallel Computations
22 pages
1.1 Parallelism
No ratings yet
1.1 Parallelism
29 pages
Parallel and Distributed Algorithms
No ratings yet
Parallel and Distributed Algorithms
65 pages
Parallel Thinking: Guy Blelloch Carnegie Mellon University
No ratings yet
Parallel Thinking: Guy Blelloch Carnegie Mellon University
37 pages
Week 3 Parallel Algorithms
No ratings yet
Week 3 Parallel Algorithms
10 pages
1 Parallel and Distributed Computation
No ratings yet
1 Parallel and Distributed Computation
10 pages
Cray-1 (1976) : The World's Most Expensive Love Seat
No ratings yet
Cray-1 (1976) : The World's Most Expensive Love Seat
18 pages
Part 1 - Lecture 3 - Parallel Software-1
No ratings yet
Part 1 - Lecture 3 - Parallel Software-1
45 pages
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
No ratings yet
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
104 pages
Parallel Thinking: Guy Blelloch Carnegie Mellon University
No ratings yet
Parallel Thinking: Guy Blelloch Carnegie Mellon University
41 pages
Par Proc Book
No ratings yet
Par Proc Book
400 pages
Mpi Course
No ratings yet
Mpi Course
202 pages
Parallel Computing972003 1223239697675005 9
No ratings yet
Parallel Computing972003 1223239697675005 9
32 pages
++probleme Tot
No ratings yet
++probleme Tot
22 pages
1.1 Parallelism Is Ubiquitous
No ratings yet
1.1 Parallelism Is Ubiquitous
3 pages
S R T S: OME Esearch Opics For Tudents
No ratings yet
S R T S: OME Esearch Opics For Tudents
3 pages
Parallel Models of Computation
No ratings yet
Parallel Models of Computation
3 pages
Chapter 04 Processors and Memory Hierarchy
75% (8)
Chapter 04 Processors and Memory Hierarchy
50 pages
Lecture Parallelism DC PDF
No ratings yet
Lecture Parallelism DC PDF
7 pages
Optimistic Parallelism Requires Abstractions
No ratings yet
Optimistic Parallelism Requires Abstractions
12 pages
Dijkstra's Algorithm Overview: Mergesort Example: Merge As We Return From Recursive Calls
No ratings yet
Dijkstra's Algorithm Overview: Mergesort Example: Merge As We Return From Recursive Calls
4 pages
Getfem Userdoc
No ratings yet
Getfem Userdoc
263 pages
SAP HANA Modeling Guide For SAP HANA Studio en
100% (1)
SAP HANA Modeling Guide For SAP HANA Studio en
266 pages
Study Notes - Module 4 - Investment Planning
No ratings yet
Study Notes - Module 4 - Investment Planning
25 pages
Study Notes - Module 4 - Investment Planning
No ratings yet
Study Notes - Module 4 - Investment Planning
25 pages
Uct PF - Module 5 Notes Residential Property
No ratings yet
Uct PF - Module 5 Notes Residential Property
23 pages
Notes 1 For CS 170
No ratings yet
Notes 1 For CS 170
6 pages
Introduction CUDA
No ratings yet
Introduction CUDA
46 pages
Security Big Data
No ratings yet
Security Big Data
26 pages
TMS320C50 Architecture
100% (5)
TMS320C50 Architecture
2 pages
L01-slides-Programming For Performance
No ratings yet
L01-slides-Programming For Performance
53 pages
High Performance Computing-Question Bank PDF
No ratings yet
High Performance Computing-Question Bank PDF
4 pages
Manual QuartusII
No ratings yet
Manual QuartusII
73 pages
CC Chapter-1 Material
No ratings yet
CC Chapter-1 Material
23 pages
Arch Assignment
No ratings yet
Arch Assignment
2 pages
AI Tools for Learning and Development
No ratings yet
AI Tools for Learning and Development
34 pages
Program and Network Properties
No ratings yet
Program and Network Properties
27 pages
Computer Hardware Module
No ratings yet
Computer Hardware Module
260 pages
t1 Brief Essay On x86 Processors
No ratings yet
t1 Brief Essay On x86 Processors
1 page
Introduction To Cloud Computing
No ratings yet
Introduction To Cloud Computing
92 pages
ACA Syllabus PDF
No ratings yet
ACA Syllabus PDF
2 pages
Bteq Fexp Fload Mload
No ratings yet
Bteq Fexp Fload Mload
59 pages
Evolution of Computer Devices: Grade 12 Competency Level 2.2 Anuradha Dissanayake
No ratings yet
Evolution of Computer Devices: Grade 12 Competency Level 2.2 Anuradha Dissanayake
16 pages
BCA 4sem Syllabus
No ratings yet
BCA 4sem Syllabus
14 pages
Multiprocessor Architecture: Taxonomy of Parallel Architectures
100% (1)
Multiprocessor Architecture: Taxonomy of Parallel Architectures
32 pages
BE - HPC - MCQ 1 - 6 Unit
No ratings yet
BE - HPC - MCQ 1 - 6 Unit
45 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Parallel Programming and R
No ratings yet
Parallel Programming and R
13 pages
Compstat 2008
No ratings yet
Compstat 2008
13 pages
Dr. Mainak Chaudhuri: Instructor
No ratings yet
Dr. Mainak Chaudhuri: Instructor
8 pages
Microprocessors and Microcontrollers/Coprocessor Multiple Choice Questions
No ratings yet
Microprocessors and Microcontrollers/Coprocessor Multiple Choice Questions
2 pages
B.C.A. PART-III EXAM, 2008: References
No ratings yet
B.C.A. PART-III EXAM, 2008: References
6 pages
Chapter 3 Projects
No ratings yet
Chapter 3 Projects
13 pages
Parallel Processing Report
No ratings yet
Parallel Processing Report
9 pages
Fallsem2019-20 Cse4001 Eth Vl2019201001348 Reference Material Cse4001 Parallel and Distributed Computing May 2019 (003) 18
No ratings yet
Fallsem2019-20 Cse4001 Eth Vl2019201001348 Reference Material Cse4001 Parallel and Distributed Computing May 2019 (003) 18
4 pages
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
From Everand
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
Fouad Sabry
No ratings yet
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
From Everand
Scanline Rendering: Exploring Visual Realism Through Scanline Rendering Techniques
Fouad Sabry
No ratings yet
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
From Everand
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

PCP 2022 6 ParallelAlgorithms PartI

Uploaded by

PCP 2022 6 ParallelAlgorithms PartI

Uploaded by

parallel concurrent

Amdahl’s Law thread of control

Parallel and concurrent programming

Program execution can be seen as a DAG (directed

• A join “ends a node” and makes a

slide from: Sophomoric Parallelism and Concurrency, Lecture 2 2

• work – number of nodes, T1

slide from: Sophomoric Parallelism and Concurrency, Lecture 2 3

Write a DAG to show the the work and span of

The set of instructions forms

Parallelism is the maximum possible speed-up: T1 / T ¥

At some point, adding processors won’t help

Parallel algorithms are about decreasing span without

slide adapted from: Sophomoric Parallelism and Concurrency, Lecture 2 6

No special algorithms or techniques required to get a

Low-level image processing uses the individual pixel

Basis of Monte Carlo methods is the use of random

• Each calculation is independent of the others and

Makoto Matsumoto, Keio University/Max-Planck_Institut fur Mathematik; Takuji

Makoto Matsumoto and Takaji Nishimura.

A random number generator isolated to the current thread. Like the

Guy L. Steele, Doug Lea, and Christine H. Flood.

Divide problems into sub problems that are of

Recursive subdivision continues until the grain

slide adapted from: Sophomoric Parallelism and Concurrency, Lecture 2 20

Sophomoric Parallelism and

• What is the work and span of this DAG?

Sophomoric Parallelism and

• Reduction operations produce a single answer from

• Note: (Recursive) results don’t have to be single numbers or

• But some things are inherently sequential

slide adapted from: Sophomoric Parallelism and Concurrency, Lecture 2 24

A map operates on each element of a collection independently

Canonical example: Vector addition (pseudo-code)

Adapted from: Sophomoric Parallelism and Concurrency, Lecture 2 25

from: Sophomoric Parallelism and Concurrency, Lecture 2 27

• By far the two most important and common patterns

• Learn to recognize when an algorithm can be written in

• Use maps and reductions to describe (parallel) algorithms

• Programming them becomes “trivial” with a little practice

from: Sophomoric Parallelism and Concurrency, Lecture 2 28

This “key trick” typically underlies surprising

Sophomoric Parallelism and

Given int[] input, produce int[] output where output[i]

Sequential can be a CS1 exam problem:

Does not seem parallelizable

• The first pass builds a tree bottom-up: the “up” pass

• The second pass traverses the tree top-down: the “down”

slide adapted from: Sophomoric Parallelism and Concurrency, Lecture 3 31

range 0,4 range 4,8

range 0,2 range 2,4 range 4,6 range 6,8

r 0,1 r 1,2 r 2,3 r 3,4 r 4,5 r 5,6 r 6,7 r 7.8

range 0,4 range 4,8

range 0,2 range 2,4 range 4,6 range 6,8

r 0,1 r 1,2 r 2,3 r 3,4 r 4,5 r 5,6 r 6,7 r 7,8

range 0,4 range 4,8

range 0,2 range 2,4 range 4,6 range 6,8

r 0,1 r 1,2 r 2,3 r 3,4 r 4,5 r 5,6 r 6,7 r 7,8

1. Up: Build a binary tree where

This is an easy fork-join computation: combine results by actually

Analysis: O(n) work, O(log n) span

slide adapted from: Sophomoric Parallelism and Concurrency, Lecture 3 35

2. Down: Pass down a value fromLeft

This is an easy fork-join computation: traverse the tree built in

slide from: Sophomoric Parallelism and Concurrency, Lecture 3 36

slide from: Sophomoric Parallelism and Concurrency, Lecture 3 37

• Minimum, maximum of all elements to the left of i

• Is there an element to the left of i satisfying some property?

• Count of elements to the left of i satisfying some property

• We did an inclusive sum, but exclusive is just as easy

You might also like