0% found this document useful (0 votes)

32 views44 pages

Parallel Algorithms: Theory and Practice

The document discusses parallel algorithms for prefix sum (scan) problems. It describes two approaches: 1) A divide-and-conquer algorithm that divides the input array in half, computes prefix sums for each half in parallel, and combines the results. 2) A reduce-based algorithm that reduces the problem size by half in each step, solving the smaller problems in parallel and converting results to the final answer. Both approaches aim to minimize work and depth for efficient parallel computation.

Uploaded by

21522250

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views44 pages

Parallel Algorithms: Theory and Practice

Uploaded by

21522250

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 44

Parallel Algorithms:

Theory and Practice

Last class
• Work-depth model: evaluate the time cost of a parallel
algorithm
• Work : total number of operations, the sequential time complexity
• Depth : the longest dependence chain in the computation, the time
required when an infinite number of processors are avaiable
• Scheduler: help you map each thread to a processor
• A helpful tool to avoid low-level design for parallel algorithms
• For a parallel computation with work and depth , using a greedy
scheduler, the time needed using processors is

2
Last class
• The bound
• Work-efficiency is important
• Make work (asymptotically) no more than the best (optimal) sequential
algorithm
• is usually at least the problem size since we need to load all input
• is usually poly-logarithmic – as long as its polylog(n), its much smaller than
• is usually small compared to
• is dominated by the term
• Polylog Depth indicates good scalability
• Larger depth means that when is getting larger, may dominate the cost
• But whether is or usually does not make a huge difference in practice – both
are much smaller than

3
Last class
• Two reduce algorithms
• Looking at the dependence graph bottom-up or top-down
• work and depth

• Your feedback
• Write down what you think is the hardest/most unclear thing in the
last class
• Any other thoughts are also welcome
• It’s anonymous

4
Prefix Sum (Scan)

5
Prefix sum
A= 1 2 3 4 5 6 7 8

B= 1 3 6 10 15 21 28 36

The most widely-used building block in

parallel algorithm design
Similar idea applies to any associative
binary operations
Prefix Sum work
depth

1 2 3 4 5 6 7 8
+ + + +
3 7 11 15
+ +
10 26
+
36

1 2 3 4 5 6 7 8
3+3+4=10 10+5=15 10+5+6=21 10+11+7=28

7
Two algorithms to implement a reduce
reduce(A, n) { reduce(A, n) {
if (n == 1) return A[0]; if (n == 1) return A[0];
In parallel: if (n is odd) n=n+1;
L = reduce(A, n/2); parallel_for i=1 to n/2
R = reduce(A + n/2, n-n/2); B[i]=A[2i]+A[2i+1];
return L+R; return reduce(B, n/2); }
}

Divide-and-conquer:
Dealing with the left and right halves recursively

8
• Split the original problem into
Prefix sum: divide-and-conquer several parts with smaller sizes
1 2 3 4 5 6 7 8 (e.g., evenly in two)
+ + + + • Solve the same problem on
3 7 11 15 each part in parallel
+ +
10 26 • Combine the results
+
36
Function scan_r(A, B, s, t, offset) { 1 2 3 4 5 6 7 8
If s=t-1 then {
B[s] = offset + A[s]; return; } Left prefix sum Right prefix sum
mid = (s+t)/2;
In Parallel: 1 3 6 10 5 11 18 26
scan_r(A, B, s, mid, offset);
scan_r(A, B, mid, t, offset+leftSum); } +10 +10 +10 +10
Function scan(A, B) {
Call reduce(A, n) and save the reduce 1 3 6 10 15 21 28 36
tree;
scan_r(A, B, 0, n, 0);} What is 10?
work
Prefix sum depth
Function scan_r(A, B, s, t, offset) {
1 2 3 4 5 6 7 8 If s=t then {
+ + + + B[s] = offset + A[s]; return; }
3 7 11 15 mid = (s+t)/2;
+ + In Parallel:
10 26 scan_r(A, B, s, mid, offset);
+ scan_r(A, B, mid+1, t, offset+leftSum); }
36

offset: 0 offset: 10

offset: 0 offset: 0+3=3 offset: 10 offset: 10+11=21

offset: 0 offset: 1 offset: offset: offset: offset: offset: offset:

3 3+3=6 10 10+5=15 21 21+7=28
Result: Result: Result: Result: Result: Result: Result: Result:
0+1=1 1+2=3 3+3=6 6+4=10 10+5=15 15+6=21 21+7=28 28+8=36
1 2 3 4 5 6 7 8
Two algorithms to implement a reduce
reduce(A, n) { reduce(A, n) {
if (n == 1) return A[0]; if (n == 1) return A[0];
In parallel: if (n is odd) n=n+1;
L = reduce(A, n/2); parallel_for i=1 to n/2
R = reduce(A + n/2, n-n/2); B[i]=A[2i]+A[2i+1];
return L+R; return reduce(B, n/2); }
}

Reduce problem size:

Shrink the original size into a half

11
• Reduce the problem size into a
Prefix sum – another algorithm smaller size (e.g., a half),
1 2 3 4 5 6 7 8 possibly in parallel
+ + + + • Solve the same problem on the
3 7 11 15 small input
+ +
10 26 • Convert the result of the small
+ problem to the final answer,
36 possibly in parallel

A 1 2 3 4 5 6 7 8
Prefix sum of A 1 3 6 10 15 21 28 36

Prefix sum of A’ 3 10 21 36
A’ 3 7 11 15
Prefix sum – another 1 2 3 4 5 6 7 8
algorithm 1 3 6 10 15 21 28 36
1 2 3 4 5 6 7 8
+ + + +
3 7 11 15 3 7 11 15
+ +
10 26
3 10 21 36
+
36 10 26
Function PrefixSum(In, n, Out) { 10 36
if (n==1) Out[0] = In[0];
para_for (i=0 to n/2)
B[i] = In[2i]+In[2i+1] 36
PrefixSum(B, n/2, C);
36
Out[0] = In[0];
para_for (i=1 to n) { work
if (i%2) Out[i] = C[(i-1)/2]+In[i]; depth
else Out[i] = C[i/2];} }
How did we solve the prefix sum problem?
• Divide-and-conquer
• Split the problem in half, solve each of the same subproblems in
parallel
• i.e., solve the prefix sum of the left and the right halves of the array in parallel
• Convert the results from the subproblems to the final answers
• i.e., the right sum needs to add the “left sum” of the reduce algorithm

• Reduce to smaller sizes

• Convert the problem to a smaller size of the same problem
• i.e., add every two elements to half the problem size
• Convert the result of the smaller problem to the final answers
• i.e., copy the results to the odd positions, and get the result at the even
positions by adding the original input value

14
Computational Models

15
Cost model
• Work-depth model
• Evaluate the cost of an algorithm
• Does not specify what operations can be used, how much do they
cost, etc
• How much does a parallel for cost?
• How do processors synchronize?
• What happens if two threads access the same memory location at the same
time?

• Sequentially, we have the Random Access Machine Model (RAM

model)
• Unbounded memory and you can access any location with its address
• Every operation (computation, memory access, etc.) costs unit time
• Simple and effective for analyzing sequential algorithms
16
The PRAM (Parallel RAM) Model
• processors share the memory
• Every operation takes unit time.
• All threads are highly synchronized
• After each unit time, all the threads finish one operation and
accessing the memory
• Some commonly-used settings
• Exclusive read exclusive write (EREW)—every memory cell can be read
or written to by only one processor at a time
• Concurrent read exclusive write (CREW)—multiple processors can read
a memory cell but only one can write at a time
• Exclusive read concurrent write (ERCW)—never considered
• Concurrent read concurrent write (CRCW)—multiple processors can
read and write.
17
The PRAM (Parallel RAM) Model
• Evaluate an algorithm in PRAM
model 1 2 3 4 5 6 7 8
• The total number of processors + + + +
3 7 11 15
• The total required (parallel) time + +
• Sometimes use as an indicator of 10 26
work +
36

• Reduce:
• Use processors, need time and work
• Use processors, need time and work
• Use the topological order of the computational DAG

18
PRAM: pros and cons
• Simple – we can get very good bound
• But...
• Do we know the number of processors ahead of time?
• The number of available processors even varies during the execution
• Your OS, some other applications may use some processors

19
PRAM: pros and cons
• Simple – we can get very good bound
• But...
• Do we know the number of processors ahead of time?
• Are processors really highly-synchronized?
• Accessing memory is usually more expensive than computation
• Synchronization is expensive

P1: P2:
A = 5; C = 2;
sync
B = 3; A = 3;
sync
A = A+7; B = B+C;
sync

20
PRAM: pros and cons
• Simple, easy to analyze
• But...
• Do we know the number of processors ahead of time?
• Are processors really highly-synchronized?
• Accessing memory is usually more expensive than computation
• Are concurrent writes ideal enough to take unit time?

• Although we do not use PRAM in this course, many useful and

good algorithms (and ideas) were proposed based on PRAM
• Proposed in 1979, but parallelism moved into the mainstream
of processor design from 2005
21
Fork-join parallelism

22
Fork-join Parallelism
• The computation starts with one thread
• A thread can fork threads to execute pieces of code. After
they all finish, they join back and continue the rest
computation in
• We can use work and depth to analyze the cost

23
Fork-join Parallelism
• The computation starts with one
thread
• A thread can fork threads to
execute pieces of code. After they
all finish, they join back and
continue the rest computation in
• Fork-join is nested parallelism,
meaning that a forked thread can
further fork new threads

24
Fork-join Parallelism
• The computation starts with one thread
• A thread can fork threads to execute pieces of code. After
they all finish, they join back and continue the rest
computation in
• We can use work and depth to analyze the cost

Function Scan(A, B, s, t, ps)

If s=t then B[s] = ps + A[s]
In Parallel (Fork):
Scan(A, B, s, mid, ps)
Scan(A, B, mid+1, t, ps+leftSum)
Join

25
Fork-join parallelism
• Function PrefixSum(In, n, Out) {
• if (n==1) Out[0] = In[0];
• para_for (i=0 to n/2) //Fork
• B[i] = n[2i]+In[2i+1]
• //Join
• PrefixSum(B, n/2, C);
• Out[0] = In[0];
• para_for (i=1 to n) { //Fork
• if (i%2) Out[i] = C[i/2];
• else Out[i] = C[i/2-1] + In[i];
• }
• //Join
•} 26
Execute a fork-join algorithm
• Need a scheduler to map each thread to a processor

• For an algorithm with work and depth , a good scheduler can

make it run in time using processors

27
N-ary forking vs. binary forking
• A thread can fork new tasks, or only two new tasks

• Can affect the depth by a factor of

• Why?
• Homework: analyze the two prefix sum algorithms (one using divide-
and-conquer, the other one using parallel-for). Do they have the
same cost under n-ary fork-join model? Do they have the same cost
under binary fork-join model?

• We will assume binary forking unless specified

• In many state-of-the-art schedulers, they only use binary-forking

28
DAG for work-depth vs. fork-join
Start

• Fork-join: a fork always corresponds to a join

29
What else can we do
• Sometimes, concurrent write is inevitable. Then we need to
specify some atomic primitives for a model

• Some commonly used ones:

• Compare-and-swap (CAS): bool CAS(value* p, value vold, value vnew):
compare the value stored in the pointer with value , if they are equal,
change ’s value to vnew and return true. Otherwise do nothing and
return false.
• Test-and-set (TAS): bool TAS(bool* p): determine if the Boolean value
stored at is false, if so, set it to true and return. Otherwise, return
false.
• Fetch-and-add (FAA): integer FAA(integer* p, integer x): add integer
’s value by x, and return the old value
•…
30
sum = 5
Use Atomic Primitives P1: add(3) P2: add(4)
void Add(x) { void Add(x) {
• Fetch-and-add: temp = sum; 5 temp = sum; 5
sum = temp + x; sum = temp + x;
• Multiple threads try to add } 8 } 9
values to a shared variable
Shared variable sum Shared variable sum sum = 8 (but should be 12)
void Add(x) { void Add(x) {
FAA(&sum, x); sum = sum + x;
} }
• Multiple threads want to get
a global sequentialized
order
Shared variable count
int get_id {
return FAA(&count, 1);
}

31
Use Atomic Primitives
• Compare-and-swap:
• Multiple threads wants to add to the head of a linked-list
struct node {
X1
value_type value;
node* next; };
shared variable node* head;
？
void insert(node* x) {
node* old_head = head;
x->next = old_head;
X2
？ head

head
while (!CAS(&head, old_head, x)) {
node* old_head = head; X1
x->next = old_head; }
}

void insert(node* x) {
x->next = head; X2
head = x;
} 32
Computational model
• When talking about an algorithm or a bound:
• Specify the model
• Specify any parallel primitives you need
• e.g, EREW PRAM, binary-forking with CAS, etc.
• When talking about the execution time:
• Also need to specify the scheduling algorithm

• Usually, the more or the stronger primitive you use, the better
your bound looks, but the less interesting/practical the result is
• E.g., assume constant time parallel reduce – we can get a constant time
sorting algorithm

33
Fibonacci Numbers

34
Fibonacci Numbers
• The n-th Fibonacci number can be computed as: 0

…
int F (int n) {
if (n <= 1) return n;
else {
In parallel:
int A = F(n-1); n-2
int B = F(n-2);
return A+B; n-1
} Why? Because the dependency is still long () and
there is much redundant work
n
• In the homework we’ll see a more efficient parallel algorithm
35
Parallel Programming Tools

36
Parallel Tools and Schedulers
In this course the following two schedulers are recommended
for your homework and course project.
• Cilk
• PBBS

You can also use other languages/schedulers that you are more
familiar with, e.g., OpenMP, Intel TBB, etc.

37
Cilk
• Fork-join parallelism #include <cilk/cilk.h>
#include <cilk/cilk_api.h>

• cilk_spawn and cilk_sync

int reduce(int* A, int n) {
if (n == 1) return A[0];
cilk_spawn S1;
int L, R;
S2;
L = cilk_spawn reduce(A, n/2);
cilk_sync;
R = reduce(A+n/2, n-n/2);
cilk_sync;
return L+R;
}

• Parallel for: cilk_for cilk_for (int i = 0; i < n; i++)

38
Cilk
• cilk_spawn means to create a new thread that is executed
while the original thread proceed to the next instruction
int reduce(int* A, int n) {
if (n == 1) return A[0];
int L, R;
L = cilk_spawn reduce(A, n/2);
R = reduce(A+n/2, n-n/2);
cilk_sync;
return L+R; }

• Supported by gcc 5 to 7. Available on the course server.

• LLVM: https://cilkplus.github.io/

39
PBBS (Problem-based benchmark suite)
• Code available at: https://github.com/cmuparlay/pbbslib
#include “pbbslib/utilities.h” You can also use cilk or openmp to
compile your code
void reduce(int* A, int n, int& ret) {
if (n == 1) ret = A[0]; else {
int L, R;
par_do([&] () {reduce(A, n/2, L);},
[&] () {reduce(A+n/2, n-n/2, R);}); lambda expression
ret = L+R; (must be function calls)
}
}

parallel_for (0, 100, [&] (int i) {A[i] = i;});

40
About homework
• Sample code available using PBBS and Cilk in homework 1
• You will implement your own version of a scan algorithm –
add any optimizations that you think could help, and see if
they really help
• Use figures and tables to show the numbers you get
• Analyze the numbers to explain any interesting/abnormal
phenomenon
• There is an entry of assignment in ilearn now, you can submit
your code there

41
About homework
• The goal of the programming part is to let you learn from
practice some tricks and optimizations for implementing
parallel algorithms – the process of learning matters

• This is a graduate-level course, which means

• As long as you finish all required tasks, everyone can pass
• If you do a really good job, you get a good score
• But if you cheat, you fail

42
About paper review
• What problem is solved in the paper? What is the motivation?
• Why is the problem challenging? How did previous work solve
the problem and why they didn’t work?
• What is the key technical ideas to solve the challenges?
• What are the new theoretical results (if any)?
• Why do they design experiments (if any) like that?
• What do the experimental results (if any) tell us?
• What do you think is the strength/novelty of the work?
• What do you think is the weakness of the work? Do you have
ideas to improve that?
• What are the possible directions for future work?
• Do you have any questions about the work? 43
About paper review
• A useful document of some paper review tips:
https://people.inf.ethz.ch/troscoe/pubs/review-writing.pdf
• Your paper review is slightly different since you are reviewing papers
that have already been published

Parallel Algorithms Ws 20
No ratings yet
Parallel Algorithms Ws 20
353 pages
PRAM Algorithms
100% (1)
PRAM Algorithms
24 pages
Chapter 02
No ratings yet
Chapter 02
47 pages
Lecture 9 - Parallel Algorithms
No ratings yet
Lecture 9 - Parallel Algorithms
28 pages
Lecture Parallelism DC PDF
No ratings yet
Lecture Parallelism DC PDF
7 pages
CSE524sp10 01
No ratings yet
CSE524sp10 01
62 pages
Parallel Prefix Sum
No ratings yet
Parallel Prefix Sum
32 pages
Simulating Ocean Currents
No ratings yet
Simulating Ocean Currents
35 pages
RG2 ParallelizationPrinciples HPCAI Jan2020
No ratings yet
RG2 ParallelizationPrinciples HPCAI Jan2020
40 pages
7-Tree Sum Parallel Algorithm & Applications
No ratings yet
7-Tree Sum Parallel Algorithm & Applications
23 pages
Lect 5 Brent
No ratings yet
Lect 5 Brent
10 pages
Parallel Computation Models: Slide 1
No ratings yet
Parallel Computation Models: Slide 1
28 pages
Parallel Algorithms: Theory and Practice: Deterministi C Parallelism
No ratings yet
Parallel Algorithms: Theory and Practice: Deterministi C Parallelism
51 pages
Cray-1 (1976) : The World's Most Expensive Love Seat
No ratings yet
Cray-1 (1976) : The World's Most Expensive Love Seat
18 pages
Parallel Models of Computation
No ratings yet
Parallel Models of Computation
3 pages
UNIT-8 Forms of Parallelism: 8.1 Simple Parallel Computation: Example 1: Numerical Integration Over Two Variables
No ratings yet
UNIT-8 Forms of Parallelism: 8.1 Simple Parallel Computation: Example 1: Numerical Integration Over Two Variables
12 pages
PRAM COMP 633: Parallel Computing Algorithms: The PRAM Model of Computation
No ratings yet
PRAM COMP 633: Parallel Computing Algorithms: The PRAM Model of Computation
49 pages
1 Overview, Models of Computation, Brent's Theorem
No ratings yet
1 Overview, Models of Computation, Brent's Theorem
8 pages
1.1 Parallelism Is Ubiquitous
No ratings yet
1.1 Parallelism Is Ubiquitous
3 pages
Par Seq Algorithms
No ratings yet
Par Seq Algorithms
44 pages
217 Lec10
No ratings yet
217 Lec10
27 pages
Pap 3 Shared Memory Algos
No ratings yet
Pap 3 Shared Memory Algos
23 pages
Parallel Random Access Machine (PRAM) : Control
No ratings yet
Parallel Random Access Machine (PRAM) : Control
9 pages
Parallel
No ratings yet
Parallel
59 pages
Co 2
No ratings yet
Co 2
22 pages
Pda 3
No ratings yet
Pda 3
90 pages
CS4230 Parallel Programming Introduction To Parallel Algorithms
No ratings yet
CS4230 Parallel Programming Introduction To Parallel Algorithms
25 pages
ParallelIzation Principles
No ratings yet
ParallelIzation Principles
40 pages
Parallel Algorithm Main Single
No ratings yet
Parallel Algorithm Main Single
289 pages
Parallel Thinking: Guy Blelloch Carnegie Mellon University
No ratings yet
Parallel Thinking: Guy Blelloch Carnegie Mellon University
37 pages
PP Assignment
No ratings yet
PP Assignment
6 pages
HPC Note
No ratings yet
HPC Note
39 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Chapter Six
No ratings yet
Chapter Six
19 pages
Ram, Pram, and Logp Models
No ratings yet
Ram, Pram, and Logp Models
72 pages
1 Parallel and Distributed Computation
No ratings yet
1 Parallel and Distributed Computation
10 pages
Fundamental Algorithms: Chapter 3: Parallel Algorithms - The PRAM Model
No ratings yet
Fundamental Algorithms: Chapter 3: Parallel Algorithms - The PRAM Model
26 pages
Chapter 14: Parallel Algorithms
No ratings yet
Chapter 14: Parallel Algorithms
23 pages
Chapter Six
No ratings yet
Chapter Six
18 pages
Chapter 1
No ratings yet
Chapter 1
39 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
HPC Notes Unit 3
No ratings yet
HPC Notes Unit 3
7 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
E - Notes - HPC-Unit 3-1
No ratings yet
E - Notes - HPC-Unit 3-1
26 pages
Ece408 Lecture13 Reduction Tree VK FL24
No ratings yet
Ece408 Lecture13 Reduction Tree VK FL24
45 pages
Arithmetic Coding in Parallel: Jan Supol and Bo Rivoj Melichar
No ratings yet
Arithmetic Coding in Parallel: Jan Supol and Bo Rivoj Melichar
11 pages
Algorithms For Parallel Machines
No ratings yet
Algorithms For Parallel Machines
7 pages
L2 Parallel Computing Models
No ratings yet
L2 Parallel Computing Models
31 pages
HPC Detailed Notes
No ratings yet
HPC Detailed Notes
5 pages
Why Parallel Computing?: Peter Pacheco
No ratings yet
Why Parallel Computing?: Peter Pacheco
84 pages
ECE408 MT2 Review FA24
No ratings yet
ECE408 MT2 Review FA24
58 pages
PRAM and Distributed Computing Report
No ratings yet
PRAM and Distributed Computing Report
5 pages
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
No ratings yet
Thinking in Parallel: Some Basic Data-Parallel Algorithms and Techniques
104 pages
Parallel Thinking: Guy Blelloch Carnegie Mellon University
No ratings yet
Parallel Thinking: Guy Blelloch Carnegie Mellon University
41 pages
Pram Algorithms: Parallel and Distributed Algorithms BY Debdeep Mukhopadhyay AND Abhishek Somani
No ratings yet
Pram Algorithms: Parallel and Distributed Algorithms BY Debdeep Mukhopadhyay AND Abhishek Somani
17 pages
Lect 1 Overview
No ratings yet
Lect 1 Overview
17 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
2.5/5 (2)
Trigonometric Ratios to Transformations (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Trigonometric Ratios to Transformations (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)
Hpe Primera 9q4s
No ratings yet
Hpe Primera 9q4s
7 pages
2.9.2 Packet Tracer - Basic Switch and End Device Configuration - Physical Mode
No ratings yet
2.9.2 Packet Tracer - Basic Switch and End Device Configuration - Physical Mode
3 pages
Resume 3
No ratings yet
Resume 3
3 pages
BNS FYP2 Students Vs Assessors - Level 5
No ratings yet
BNS FYP2 Students Vs Assessors - Level 5
2 pages
Resume 2102235
No ratings yet
Resume 2102235
1 page
IP Address Task 2 Assignment 2
No ratings yet
IP Address Task 2 Assignment 2
11 pages
Hardware Connection: Package Contents
No ratings yet
Hardware Connection: Package Contents
2 pages
Security Models
No ratings yet
Security Models
4 pages
Nas Unity
No ratings yet
Nas Unity
86 pages
Messagebroker InstallationGuide
No ratings yet
Messagebroker InstallationGuide
174 pages
Sealevel Card Windows Startup Fix
No ratings yet
Sealevel Card Windows Startup Fix
2 pages
A) Describe Process Scheduling? Explain The Various Levels of Scheduling. B)
No ratings yet
A) Describe Process Scheduling? Explain The Various Levels of Scheduling. B)
2 pages
0105 Ethernet Switching Configuration Commands PDF
No ratings yet
0105 Ethernet Switching Configuration Commands PDF
694 pages
NetBackup10 EEB Guide
No ratings yet
NetBackup10 EEB Guide
154 pages
2023 2025 Syllabus
No ratings yet
2023 2025 Syllabus
55 pages
C
100% (1)
C
75 pages
Introduction To Operating System
0% (1)
Introduction To Operating System
20 pages
BackBox QuickstartGuide GettingStartedWithBackBoxEvaluationGuide OVA
No ratings yet
BackBox QuickstartGuide GettingStartedWithBackBoxEvaluationGuide OVA
22 pages
Colibri Arm Som VFXX Datasheet
No ratings yet
Colibri Arm Som VFXX Datasheet
59 pages
Lecture 01. Introduction To Computer
No ratings yet
Lecture 01. Introduction To Computer
15 pages
Using Capacity Magic To Size Storwize V7000 Disk Systems
No ratings yet
Using Capacity Magic To Size Storwize V7000 Disk Systems
19 pages
Anonymous Proxy
No ratings yet
Anonymous Proxy
8 pages
Wireless Computer Controlled Robotics Using The Pic16F77 Microcontroller
No ratings yet
Wireless Computer Controlled Robotics Using The Pic16F77 Microcontroller
45 pages
Error
No ratings yet
Error
2 pages
Lu2 Lo1
No ratings yet
Lu2 Lo1
41 pages
Brkaci 2271 PDF
No ratings yet
Brkaci 2271 PDF
168 pages
M-Ram: (Magnetoresistive - Random Access Memory)
No ratings yet
M-Ram: (Magnetoresistive - Random Access Memory)
21 pages
Favas Sidhik: IT Administrator
No ratings yet
Favas Sidhik: IT Administrator
2 pages
Deadlock
No ratings yet
Deadlock
53 pages
Filmora Crack
100% (1)
Filmora Crack
2 pages

Parallel Algorithms: Theory and Practice

Uploaded by

Parallel Algorithms: Theory and Practice

Uploaded by

Parallel Algorithms:

Theory and Practice

The most widely-used building block in

offset: 0 offset: 0+3=3 offset: 10 offset: 10+11=21

offset: 0 offset: 1 offset: offset: offset: offset: offset: offset:

Reduce problem size:

• Reduce to smaller sizes

• Sequentially, we have the Random Access Machine Model (RAM

• Although we do not use PRAM in this course, many useful and

Function Scan(A, B, s, t, ps)

• For an algorithm with work and depth , a good scheduler can

• Can affect the depth by a factor of

• We will assume binary forking unless specified

• Fork-join: a fork always corresponds to a join

• Some commonly used ones:

• cilk_spawn and cilk_sync

• Parallel for: cilk_for cilk_for (int i = 0; i < n; i++)

• Supported by gcc 5 to 7. Available on the course server.

parallel_for (0, 100, [&] (int i) {A[i] = i;});

• This is a graduate-level course, which means

You might also like