0% found this document useful (0 votes)

32 views62 pages

04 Progbasics

The document discusses parallel programming basics and three common parallel programming models: shared address space, message passing, and data parallel. It notes that modern practice often involves a mixed approach, using shared address space within a node and message passing between nodes. Examples of applications that can benefit from parallelization include simulating ocean currents and modeling galaxy evolution using the Barnes-Hut algorithm. The key aspects of creating a parallel program are decomposing the problem into subproblems, assigning the subproblems to parallel threads, orchestrating communication and synchronization between threads, and mapping the parallel program to execution on a parallel machine.

Uploaded by

Juliana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views62 pages

04 Progbasics

Uploaded by

Juliana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

Lecture 4:

Parallel Programming Basics

Parallel Computer Architecture and Programming

CMU 15-418/15-618, Fall 2018

1
Review: 3 parallel programming models
▪ Shared address space
- Communication is unstructured, implicit in loads and stores
- Natural way of programming, but can shoot yourself in the foot easily
- Program might be correct, but not perform well
▪ Message passing
- Structure all communication as messages
- Often harder to get first correct program than shared address space
- Structure often helpful in getting to first correct, scalable program
▪ Data parallel
- Structure computation as a big “map” over a collection
- Assumes a shared address space from which to load inputs/store results, but
model severely limits communication between iterations of the map
(goal: preserve independent processing of iterations)
- Modern embodiments encourage, but don’t enforce, this structure

CMU 15-418/618, Fall 2018

Modern practice: mixed programming models
▪ Use shared address space programming within a multi-core node
of a cluster, use message passing between nodes
- Very, very common in practice
- Use convenience of shared address space where it can be implemented
eﬃciently (within a node), require explicit communication elsewhere

▪ Data-parallel-ish programming models support shared-memory

style synchronization primitives in kernels
- Permit limited forms of inter-iteration communication (e.g., CUDA, OpenCL)

▪ In a future lecture… CUDA/OpenCL use data-parallel model to

scale to many cores, but adopt shared-address space model
allowing threads running on the same core to communicate.

CMU 15-418/618, Fall 2018

Examples of applications to parallelize

4 CMU 15-418/618, Fall 2018

Simulating of ocean currents

▪ Discretize 3D ocean volume into slices represented as 2D grids

▪ Discretize time evolution of ocean: ∆t
▪ High accuracy simulation requires small ∆t and high resolution grids

Figure credit: Culler, Singh, and Gupta CMU 15-418/618, Spring 2018
Where are the dependencies?
Dependencies in one time step of ocean simulation
Boxes correspond to
computations on grids

Lines express dependencies

between computations on grids

The “grid solver” example

corresponds to these parts
of the application

Parallelism within a grid (data-parallelism) and across operations on the diﬀerent grids.
The implementation only leverages data-parallelism (for simplicity)
Figure credit: Culler, Singh, and Gupta CMU 15-418/618, Spring 2018
Galaxy evolution
Barnes-Hut algorithm

(treat as single mass)

▪ Represent galaxy as a collection of N particles (think: particle = star)

▪ Compute forces on each particle due to gravity
- Naive algorithm is O(N2) — all particles interact with all others (gravity has infinite extent)
- Magnitude of gravitational force falls oﬀ with distance (so algorithms approximate forces from
far away stars to gain performance)
- Result is an O(NlgN) algorithm for computing gravitational forces between all stars

CMU 15-418/618, Spring 2018

Barnes-Hut tree

L
D

Spatial Domain Quad-Tree Representation of Bodies

▪ Leaf nodes are star particles
▪ Interior nodes store center of mass + aggregate mass of all child bodies
▪ To compute forces on each body, traverse tree... accumulating forces from all other bodies
- Compute forces using aggregate interior node if L/D < ϴ, else descend to children
▪ Expected number of nodes touched ~ lg N / ϴ2 CMU 15-418/618, Spring 2018
Creating a parallel program
Thought process:
1. Identify work that can be performed in parallel
2. Partition work (and also data associated with the work)
3. Manage data access, communication, and synchronization

Recall one of our main goals is speedup *

For a fixed computation:

Time (1 processor)
Speedup( P processors ) =
Time (P processors)

* Other goals include high eﬃciency (cost, area, power, etc.)

or working on bigger problems than can fit on one machine 9 CMU 15-418/618, Fall 2018
Creating a parallel program
Problem to solve
Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)

Assignment
Parallel Threads ** ** I had to pick a term
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping

These responsibilities may be assumed by

Execution on the programmer, by the system (compiler,
parallel machine
runtime, hardware), or by both!
Adopted from: Culler, Singh, and Gupta 10 CMU 15-418/618, Fall 2018
Decomposition
Break up problem into tasks that can be carried out in parallel
- Decomposition need not happen statically
- New tasks can be identified as program executes

Main idea: create at least enough tasks to keep all execution

units on a machine busy

Key aspect of decomposition: identifying dependencies

(or... a lack of dependencies)

11 CMU 15-418/618, Fall 2018

Amdahl’s Law: dependencies limit
maximum speedup due to parallelism

You run your favorite sequential program...

Let S = the fraction of sequential execution that is inherently

sequential (dependencies prevent parallel execution)

Then maximum speedup due to parallel execution ≤ 1/S

12 CMU 15-418/618, Fall 2018

A simple example
Consider a two-step computation on a N x N image
- Step 1: double brightness of all pixels
(independent computation on each grid element)
- Step 2: compute average of all pixel values

Sequential implementation of program

- Both steps take ~ N2 time, so total time is ~ 2N2

N
Parallelism

N
N2 N2
1

Execution time
13 CMU 15-418/618, Fall 2018
First attempt at parallelism (P processors)
Strategy:
- Step 1: execute in parallel P
- time for phase 1: N2/P Sequential program

Parallelism
- Step 2: execute serially
- time for phase 2: N2 N2 N2
1
▪ Overall performance: Execution time

Speedup

N2/P
P
Parallel program
Parallelism

Speedup ≤ 2

N2
1

Execution time
14 CMU 15-418/618, Fall 2018
Parallelizing step 2
Strategy:
- Step 1: execute in parallel
- time for phase 1: N2/P
- Step 2: compute partial sums in parallel, combine results serially
- time for phase 2: N2/P + P

Overall performance:
- Speedup overhead:
combining the partial sums
N2/P N2/P P
P
Parallelism

Parallel program
Note:
1
speedup → P when N >> P
Execution time
15 CMU 15-418/618, Fall 2018
Amdahl’s law
Let S = the fraction of total work that is inherently sequential
Max speedup on P processors given by:
speedup
S=0.01
Max Speedup

S=0.05

S=0.1

Processors 16 CMU 15-418/618, Fall 2018

Decomposition
Who is responsible for performing decomposition?
- In most cases: the programmer

Automatic decomposition of sequential programs continues to

be a challenging research problem
(very diﬃcult in general case)
- Compiler must analyze program, identify dependencies
- What if dependencies are data dependent (not known at compile time)?
- Researchers have had modest success with simple loop nests
- The “magic parallelizing compiler” for complex, general-purpose code has
not yet been achieved

17 CMU 15-418/618, Fall 2018

Assignment
Problem to solve

Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)

Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping

Execution on
** I had to pick a term
parallel machine

18 CMU 15-418/618, Fall 2018

Assignment
Assigning tasks to threads ** ** I had to pick a term
(will explain in a second)
- Think of “tasks” as things to do
- Think of threads as “workers”
Goals: balance workload, reduce communication costs

Can be performed statically, or dynamically during execution

While programmer often responsible for decomposition, many

languages/runtimes take responsibility for assignment.

19 CMU 15-418/618, Fall 2018

Assignment examples in ISPC
export void sinx( export void sinx(
uniform int N, uniform int N,
uniform int terms, uniform int terms,
uniform float* x, uniform float* x,
uniform float* result) uniform float* result)
{ {
// assumes N % programCount = 0 foreach (i = 0 ... N)
for (uniform int i=0; i<N; i+=programCount) {
{ float value = x[i];
int idx = i + programIndex; float numer = x[i] * x[i] * x[i];
float value = x[idx]; uniform int denom = 6; // 3!
float numer = x[idx] * x[idx] * x[idx]; uniform int sign = -1;
uniform int denom = 6; // 3!
uniform int sign = -1; for (uniform int j=1; j<=terms; j++)
{
for (uniform int j=1; j<=terms; j++) value += sign * numer / denom;
{ numer *= x[i] * x[i];
value += sign * numer / denom; denom *= (2*j+2) * (2*j+3);
numer *= x[idx] * x[idx]; sign *= -1;
denom *= (2*j+2) * (2*j+3); }
sign *= -1; result[i] = value;
} }
result[i] = value; }
}
}
Decomposition of work by loop iteration
Decomposition of work by loop iteration
foreach construct exposes independent work to system
Programmer-managed assignment:
System-manages assignment of iterations (work) to ISPC
Static assignment
program instances (abstraction leaves room for dynamic
Assign iterations to ISPC program instances in
assignment, but current ISPC implementation is static)
interleaved fashion
20 CMU 15-418/618, Fall 2018
Static assignment example using pthreads
typedef struct {
Decomposition of work by loop iteration
int N, terms;
float* x, *result; Programmer-managed assignment:
} my_args;
Static assignment
void parallel_sinx(int N, int terms, float* x, float* result) Assign iterations to pthreads in blocked fashion
{
(first half of array to spawned thread, second
pthread_t thread_id;
my_args args; half to main thread)
args.N = N/2;
args.terms = terms;
args.x = x;
args.result = result;

// launch second thread, do work on first half of array

pthread_create(&thread_id, NULL, my_thread_start, &args);

// do work on second half of array in main thread

sinx(N - args.N, terms, x + args.N, result + args.N);

pthread_join(thread_id, NULL);
}

void my_thread_start(void* thread_arg)

{
my_args* thread_args = (my_args*)thread_arg;
sinx(args->N, args->terms, args->x, args->result); // do work
}

21 CMU 15-418/618, Fall 2018

Dynamic assignment using ISPC tasks
void foo(uniform float* input,
uniform float* output,
uniform int N) ISPC runtime assign tasks to
{
// create a bunch of tasks
worker threads
launch[100] my_ispc_task(input, output, N);
}

Next task ptr

List of tasks:
task 0 task 1 task 2 task 3 task 4 ... task 99

Assignment policy: after completing current task, worker thread inspects list
and assigns itself the next uncompleted task.

Worker Worker Worker Worker

thread 0 thread 1 thread 2 thread 3

22 CMU 15-418/618, Fall 2018

Orchestration
Problem to solve

Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)

Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping

Execution on
** I had to pick a term
parallel machine

23 CMU 15-418/618, Fall 2018

Orchestration
Involves:
- Structuring communication
- Adding synchronization to preserve dependencies if necessary
- Organizing data structures in memory
- Scheduling tasks

Goals: reduce costs of communication/sync, preserve locality of

data reference, reduce overhead, etc.

Machine details impact many of these decisions

- If synchronization is expensive, might use it more sparsely

24 CMU 15-418/618, Fall 2018

Mapping to hardware
Problem to solve

Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)

Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping

Execution on
** I had to pick a term
parallel machine

25 CMU 15-418/618, Fall 2018

Mapping to hardware
Mapping “threads” (“workers”) to hardware execution units
Example 1: mapping by the operating system
- e.g., map pthread to HW execution context on a CPU core

Example 2: mapping by the compiler

- Map ISPC program instances to vector instruction lanes

Example 3: mapping by the hardware

- Map CUDA thread blocks to GPU cores (future lecture)

Some interesting mapping decisions:

- Place related threads (cooperating threads) on the same processor
(maximize locality, data sharing, minimize costs of comm/sync)
- Place unrelated threads on the same processor (one might be bandwidth limited and
another might be compute limited) to use machine more eﬃciently
26 CMU 15-418/618, Fall 2018
Decomposing computation or data?

N
Often, the reason a problem requires lots of computation (and needs to be parallelized)
is that it involves manipulating a lot of data.
I’ve described the process of parallelizing programs as an act of partitioning
computation.
Often, it’s equally valid to think of partitioning data. (computations go with the data)
But there are many computations where the correspondence between work-to-do
(“tasks”) and data is less clear. In these cases it’s natural to think of partitioning
computation. 27 CMU 15-418/618, Fall 2018
A parallel programming example

28 CMU 15-418/618, Fall 2018

A 2D-grid based solver
Solve partial diﬀerential equation (PDE) on (N+2) x (N+2) grid
Iterative solution
- Perform Gauss-Seidel sweeps over grid until convergence
N
A[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j]
+ A[i,j+1] + A[i+1,j]);

Grid solver example from: Culler, Singh, and Gupta 29 CMU 15-418/618, Fall 2018
Grid solver algorithm
C-like pseudocode for sequential algorithm is provided below

const int n;
float* A; // assume allocated to grid of N+2 x N+2 elements

void solve(float* A) {

float diff, prev;

bool done = false;

while (!done) { // outermost loop: iterations

diff = 0.f;
for (int i=1; i<n i++) { // iterate over non-border points of grid
for (int j=1; j<n; j++) {
prev = A[i,j];
A[i,j] = 0.2f * (A[i,j] + A[i,j-1] + A[i-1,j] +
A[i,j+1] + A[i+1,j]);
diff += fabs(A[i,j] - prev); // compute amount of change
}
}

if (diff/(n*n) < TOLERANCE) // quit if converged
done = true;
}
}

Grid solver example from: Culler, Singh, and Gupta 30 CMU 15-418/618, Fall 2018
Step 1: identify dependencies
(problem decomposition phase)
N
Each row element depends on element to left.

Each column depends on previous column.

N
...
...

31 CMU 15-418/618, Fall 2018

Step 1: identify dependencies
(problem decomposition phase)
N
There is independent work along the diagonals!

Good: parallelism exists!

Possible implementation strategy:

1. Partition grid cells on a diagonal into tasks
2. Update values in parallel
N
3. When complete, move to next diagonal
...
...

Bad: independent work is hard to exploit

Not much parallelism at beginning and end of
computation.
Frequent synchronization (after completing
each diagonal)

32 CMU 15-418/618, Fall 2018

Let’s make life easier on ourselves
Idea: improve performance by changing the algorithm to one that
is more amenable to parallelism
- Change the order grid cell cells are updated
- New algorithm iterates to same solution (approximately),
but converges to solution diﬀerently
- Note: floating-point values computed are diﬀerent, but solution still
converges to within error threshold

- Yes, we needed domain knowledge of Gauss-Seidel method

for solving a linear system to realize this change is
permissible for the application

33 CMU 15-418/618, Fall 2018

New approach: reorder grid cell update via
red-black coloring
N
Update all red cells in parallel

When done updating red cells ,

update all black cells in parallel
(respect dependency on red cells)
N
Repeat until convergence

34 CMU 15-418/618, Fall 2018

Possible assignments of work to processors

Question: Which is better? Does it matter?

Answer: it depends on the system this program is running on
35 CMU 15-418/618, Fall 2018
Consider dependencies (data flow)
1. Perform red update in parallel Compute red cells
2. Wait until all processors done with update Wait
3. Communicate updated red cells to other processors
4. Perform black update in parallel Compute black cells

5. Wait until all processors done with update Wait

6. Communicate updated black cells to other processors
7. Repeat
P1 P2 P3 P4

36 CMU 15-418/618, Fall 2018

Communication resulting from assignment

= data that must be sent to P2 each iteration

Blocked assignment requires less data to be communicated between processors
37 CMU 15-418/618, Fall 2018
Data-parallel expression of solver

38 CMU 15-418/618, Fall 2018

Data-parallel expression of grid solver
Note: to simplify pseudocode: just showing red-cell update

const int n;
Assignment: ???
float* A = allocate(n+2, n+2)); // allocate grid

void solve(float* A) {

bool done = false;

float diff = 0.f;
while (!done) {
for_all (red cells (i,j)) { decomposition:
float prev = A[i,j];
individual grid
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] +
A[i+1,j] + A[i,j+1]);
elements constitute
reduceAdd(diff, abs(A[i,j] - prev)); independent work
}
Orchestration: handled by system

(builtin communication primitive: reduceAdd)
if (diff/(n*n) < TOLERANCE)
done = true; Orchestration:
}
handled by system
}
(End of for_all block is implicit wait for all
workers before returning to sequential control)

Grid solver example from: Culler, Singh, and Gupta 39 CMU 15-418/618, Fall 2018
Shared address space (with SPMD threads)
expression of solver

40 CMU 15-418/618, Fall 2018

Shared address space expression of solver
SPMD execution model

▪ Programmer is responsible for synchronization Compute red cells

▪ Common synchronization primitives: Wait

- Locks (provide mutual exclusion): only one

Compute black cells
thread in the critical region at a time
- Barriers: wait for threads to reach this point Wait

P1 P2 P3 P4

41 CMU 15-418/618, Fall 2018

Shared address space solver (pseudocode in SPMD execution model)
Assume these are global variables
int n; // grid size (accessible to all threads)
bool done = false;
float diff = 0.0; Assume solve function is executed by
LOCK myLock; all threads. (SPMD-style)
BARRIER myBarrier;

// allocate grid
float* A = allocate(n+2, n+2);
Value of threadId is different for
void solve(float* A) {
each SPMD instance: use value to

int threadId = getThreadId(); compute region of grid to work on
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)

while (!done) {
diff = 0.f; Each thread computes the rows it is
barrier(myBarrier, NUM_PROCESSORS);
for (j=myMin to myMax) { responsible for updating
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] +
A[i+1,j], A[i,j+1]);
lock(myLock)
diff += abs(A[i,j] - prev));
unlock(myLock);
}
}
barrier(myBarrier, NUM_PROCESSORS);
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}
Grid solver example from: Culler, Singh, and Gupta 42 CMU 15-418/618, Fall 2018
Review: need for mutual exclusion
Each thread executes
- Load the value of diff into register r1
- Add the register r2 to register r1
- Store the value of register r1 into diff
One possible interleaving: (let starting value of diff=0, r2=1)

T0 T1
T0 reads value 0
r1 ← diff
r1 ← diff T1 reads value 0
T0 sets value of its r1 to 1
r1 ← r1 + r2
T1 sets value of its r1 to 1
r1 ← r1 + r2
T0 stores 1 to diff
diff ← r1
T1 stores 1 to diff
diff ← r1

▪ Need this set of three instructions to be atomic

43 CMU 15-418/618, Fall 2018
Mechanisms for preserving atomicity
Lock/unlock mutex around a critical section
LOCK(mylock);
// critical section
UNLOCK(mylock);

▪ Some languages have first-class support for atomicity of code blocks

atomic {
// critical section
}

▪ Intrinsics for hardware-supported atomic read-modify-write operations

atomicAdd(x, 10);

44 CMU 15-418/618, Fall 2018

Shared address space solver (pseudocode in SPMD
execution model)
int n; // grid size
bool done = false;
float diff = 0.0;
LOCK myLock;
BARRIER myBarrier;

// allocate grid Do you see a potential performance

float* A = allocate(n+2, n+2);
problem with this implementation?
void solve(float* A) {

int threadId = getThreadId();
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)

while (!done) {
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS);
for (j=myMin to myMax) {
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] +
A[i+1,j], A[i,j+1]);
lock(myLock)
diff += abs(A[i,j] - prev));
unlock(myLock);
}
}
barrier(myBarrier, NUM_PROCESSORS);
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}
Grid solver example from: Culler, Singh, and Gupta 45 CMU 15-418/618, Fall 2018
Shared address space solver (SPMD execution model)
int n; // grid size
bool done = false;
float diff = 0.0;
LOCK myLock;
BARRIER myBarrier;
Improve performance by accumulating
// allocate grid
into partial sum locally, then complete
float* A = allocate(n+2, n+2);
reduction globally at the end of the
void solve(float* A) {
float myDiff; iteration.
int threadId = getThreadId();
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)

while (!done) {
float myDiff = 0.f;
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS);
for (j=myMin to myMax) {
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] +
A[i+1,j], A[i,j+1]); compute per worker partial sum
myDiff += abs(A[i,j] - prev));
} Now only only lock once per thread, not once
lock(myLock); per (i,j) loop iteration!
diff += myDiff;
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS);
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}
Grid solver example from: Culler, Singh, and Gupta 46 CMU 15-418/618, Fall 2018
Barrier synchronization primitive
barrier(num_threads)
Barriers are a conservative way to express Compute red cells
dependencies Barrier
Barriers divide computation into phases
Compute black cells
All computations by all threads before the barrier
complete before any computation in any thread after Barrier
the barrier begins

P1 P2 P3 P4

47 CMU 15-418/618, Fall 2018

Shared address space solver (SPMD execution model)
int n; // grid size
bool done = false;
float diff = 0.0;
LOCK myLock; Why are there three barriers?
BARRIER myBarrier;

// allocate grid
float* A = allocate(n+2, n+2);

void solve(float* A) {
float myDiff;
int threadId = getThreadId();
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)

while (!done) {
float myDiff = 0.f;
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS);
for (j=myMin to myMax) {
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] +
A[i+1,j], A[i,j+1]);
myDiff += abs(A[i,j] - prev));
}
lock(myLock);
diff += myDiff;
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS);
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}
Grid solver example from: Culler, Singh, and Gupta 48 CMU 15-418/618, Fall 2018
Shared address space solver: one barrier
int n; // grid size
bool done = false;
Idea:
LOCK myLock;
BARRIER myBarrier; Remove dependencies by using diﬀerent diff
float diff[3]; // global diff, but now 3 copies
variables in successive loop iterations
float *A = allocate(n+2, n+2);

void solve(float* A) { Trade oﬀ footprint for removing dependencies!

float myDiff; // thread local variable
int index = 0; // thread local variable (a common parallel programming technique)
diff[0] = 0.0f;
barrier(myBarrier, NUM_PROCESSORS); // one-time only: just for init

while (!done) {
myDiff = 0.0f;
//
// perform computation (accumulate locally into myDiff)
//
lock(myLock);
diff[index] += myDiff; // atomically update global diff
unlock(myLock);
diff[(index+1) % 3] = 0.0f;
barrier(myBarrier, NUM_PROCESSORS);
if (diff[index]/(n*n) < TOLERANCE)
break;
index = (index + 1) % 3;
}
}

Grid solver example from: Culler, Singh, and Gupta 49 CMU 15-418/618, Fall 2018
More on specifying dependencies
Barriers: simple, but conservative (coarse-granularity dependencies)
- All work in program up until this point (for all threads) must finish before any
thread begins next phase

Specifying specific dependencies can increase performance

(by revealing more parallelism)
- Example: two threads. One produces a result, the other consumes it.

T0 T1
// produce x, then let T1 know // do stuff independent
x = 1; // of x here
flag = 1;
// do more work here... while (flag == 0);
print x;

▪ We just implemented a message queue (of length 1)

T0 T1
50 CMU 15-418/618, Fall 2018
Solver implementation in two programming models
Data-parallel programming model
- Synchronization:
- Single logical thread of control, but iterations of forall loop may be
parallelized by the system (implicit barrier at end of forall loop body)
- Communication
- Implicit in loads and stores (like shared address space)
- Special built-in primitives for more complex communication patterns:
e.g., reduce

Shared address space

- Synchronization:
- Mutual exclusion required for shared variables (e.g., via locks)
- Barriers used to express dependencies (between phases of computation)
- Communication
- Implicit in loads/stores to shared variables
51 CMU 15-418/618, Fall 2018
Message-passing expression of solver

52 CMU 15-418/618, Fall 2018

Let’s think about expressing a parallel grid
solver with communication via messages
▪ Each thread has its own address space
- No shared address space abstraction (i.e., no shared variables)
▪ Threads communicate and synchronize by sending/receiving messages

One possible message passing machine configuration: a cluster of two workstations

(you could make this cluster yourself using the machines in the GHC labs)
Computer 1 Computer 2

Processor Processor
Local Cache Local Cache
Memory Memory

Network

CMU 15-418/618, Spring 2016

Message passing model: each thread operates in its
own address space
Thread 1 In this figure: four threads
Address
Space
The grid data is partitioned into
Thread 2
four allocations, each residing in
Address one of the four unique thread
Space
address spaces
(four per-thread private arrays)
Thread 3
Address
Space

Thread 4
Address
Space

CMU 15-418/618, Spring 2016

Data replication is now required to correctly
execute the program
Example:
Thread 1
After red cell processing is complete, thread 1 and
Address
Space thread 3 send row of data to thread 2
(thread 2 requires up-to-date red cell information to
update black cells in the next phase)
Send row
Thread 2
Address “Ghost cells” are grid cells replicated from a remote
Space address space. It’s common to say that information
in ghost cells is “owned” by other threads.

Thread 2 logic:
Send row float* local_data = allocate(N+2,rows_per_thread+2);

Thread 3 int tid = get_thread_id();

int bytes = sizeof(float) * (N+2);
Address
Space // receive ghost row cells (white dots)
recv(&local_data[0,0], bytes, tid-1);
recv(&local_data[rows_per_thread+1,0], bytes, tid+1);

// Thread 2 now has data necessary to perform

Thread 4 // future computation
Address
Space

CMU 15-418/618, Spring 2016

Message passing solver int N;
int tid = get_thread_id();
int rows_per_thread = N / get_num_threads();

Similar structure to shared address space float* localA = allocate(rows_per_thread+2, N+2);

solver, but now communication is explicit in // assume localA is initialized with starting values
// assume MSG_ID_ROW, MSG_ID_DONE, MSG_ID_DIFF are constants used as msg ids

message sends and receives //////////////////////////////////////

void solve() {
bool done = false;
while (!done) {

float my_diff = 0.0f;

if (tid != 0)
send(&localA[1,0], sizeof(float)*(N+2), tid-1, MSG_ID_ROW);
if (tid != get_num_threads()-1)
send(&localA[rows_per_thread,0], sizeof(float)*(N+2), tid+1, MSG_ID_ROW);

Send and receive ghost rows to “neighbor threads” if (tid != 0)
recv(&localA[0,0], sizeof(float)*(N+2), tid-1, MSG_ID_ROW);
if (tid != get_num_threads()-1)
recv(&localA[rows_per_thread+1,0], sizeof(float)*(N+2), tid+1, MSG_ID_ROW);

for (int i=1; i<rows_per_thread+1; i++) {

for (int j=1; j<n+1; j++) {
float prev = localA[i,j];
localA[i,j] = 0.2 * (localA[i-1,j] + localA[i,j] + localA[i+1,j] +
Perform computation localA[i,j-1] + localA[i,j+1]);
(just like in shared address space version of solver) my_diff += fabs(localA[i,j] - prev);
}
}

if (tid != 0) {
send(&mydiff, sizeof(float), 0, MSG_ID_DIFF);
All threads send local my_diﬀ to thread 0 recv(&done, sizeof(bool), 0, MSG_ID_DONE);
} else {
float remote_diff;
for (int i=1; i<get_num_threads()-1; i++) {
recv(&remote_diff, sizeof(float), i, MSG_ID_DIFF);
my_diff += remote_diff;
Thread 0 computes global diﬀ, evaluates }
if (my_diff/(N*N) < TOLERANCE)
termination predicate and sends result back to all done = true;
other threads for (int i=1; i<get_num_threads()-1; i++)
send(&done, sizeof(bool), i, MSD_ID_DONE);
}
}
}
Example pseudocode from: Culler, Singh, and Gupta CMU 15-418/618, Spring 2016
Notes on message passing example
▪ Computation
- Array indexing is relative to local address space (not global grid coordinates)

▪ Communication:
- Performed by sending and receiving messages
- Bulk transfer: communicate entire rows at a time (not individual elements)

▪ Synchronization:
- Performed by sending and receiving messages
- Think of how to implement mutual exclusion, barriers, flags using messages

▪ For convenience, message passing libraries often include

higher-level primitives (implemented via send and receive)
reduce_add(0, &my_diff, sizeof(float)); // add up all my_diffs, return result to thread 0
if (pid == 0 && my_diff/(N*N) < TOLERANCE)
done = true;
broadcast(0, &done, sizeof(bool), MSG_DONE); // thread 0 sends done to all threads

CMU 15-418/618, Spring 2016

Synchronous (blocking) send and receive
▪ send(): call returns when sender receives acknowledgement that message
data resides in address space of receiver

▪ recv(): call returns when data from received message is copied into address
space of receiver and acknowledgement sent back to sender

Sender: Receiver:

Call SEND(foo) Call RECV(bar)

Copy data from buffer ‘foo’ in sender’s address space into network buffer
Send message Receive message
Copy data into buffer ‘bar’ in receiver’s address space
Receive ack Send ack
SEND() returns RECV() returns

CMU 15-418/618, Spring 2016

As implemented on the prior slide, there is a
big problem with our message passing solver
if it uses synchronous send/recv!

Why?

How can we fix it?

(while still using synchronous send/recv)

CMU 15-418/618, Spring 2016

Message passing solver int N;
int tid = get_thread_id();
int rows_per_thread = N / get_num_threads();

(fixed to avoid deadlock) float* localA = allocate(rows_per_thread+2, N+2);

// assume localA is initialized with starting values

// assume MSG_ID_ROW, MSG_ID_DONE, MSG_ID_DIFF are constants used as msg ids

//////////////////////////////////////

void solve() {
bool done = false;
while (!done) {

float my_diff = 0.0f;

if (tid % 2 == 0) {
Send and receive ghost rows to “neighbor threads” sendDown(); recvDown();
sendUp(); recvUp();
Even-numbered threads send, then receive } else {
recvUp(); sendUp();
Odd-numbered thread recv, then send recvDown(); sendDown();
}

for (int i=1; i<rows_per_thread-1; i++) {

for (int j=1; j<n+1; j++) {
T0 send float prev = localA[i,j];
localA[i,j] = 0.2 * (localA[i-1,j] + localA[i,j] + localA[i+1,j] +
localA[i,j-1] + localA[i,j+1]);
send
T1 send
my_diff += fabs(localA[i,j] - prev);
}
}
send
T2 send if (tid != 0) {
send(&mydiff, sizeof(float), 0, MSG_ID_DIFF);
recv(&done, sizeof(bool), 0, MSG_ID_DONE);
send } else {
T3 send float remote_diff;
for (int i=1; i<get_num_threads()-1; i++) {
recv(&remote_diff, sizeof(float), i, MSG_ID_DIFF);
send my_diff += remote_diff;
T4 send }
if (my_diff/(N*N) < TOLERANCE)
done = true;
send if (int i=1; i<gen_num_threads()-1; i++)
T5
send(&done, sizeof(bool), i, MSD_ID_DONE);
}
time }
}

Example pseudocode from: Culler, Singh, and Gupta CMU 15-418/618, Spring 2016
Non-blocking asynchronous send/recv
▪ send(): call returns immediately
- Buﬀer provided to send() cannot be modified by calling thread since message processing
occurs concurrently with thread execution
- Calling thread can perform other work while waiting for message to be sent

▪ recv(): posts intent to receive in the future, returns immediately

- Use checksend(), checkrecv() to determine actual status of send/receipt
- Calling thread can perform other work while waiting for message to be received

Sender: Receiver:
Call SEND(foo) Call RECV(bar)
SEND returns handle h1 RECV(bar) returns handle h2
Copy data from ‘foo’ into network buﬀer
Send message Receive message
Messaging library copies data into ‘bar’
Call CHECKSEND(h1) // if message sent, now safe for thread to modify ‘foo’ Call CHECKRECV(h2)
// if received, now safe for thread
// to access ‘bar’

RED TEXT = executes concurrently with application thread

CMU 15-418/618, Spring 2016
Summary
Amdahl’s Law
- Overall maximum speedup from parallelism is limited by amount of
serial execution in a program

Aspects of creating a parallel program

- Decomposition to create independent work, assignment of work to
workers, orchestration (to coordinate processing of work by workers),
mapping to hardware
- We’ll talk a lot about making good decisions in each of these phases in
the coming lectures (in practice, they are very inter-related)

Focus today: identifying dependencies

Focus soon: identifying locality, reducing synchronization

62 CMU 15-418/618, Fall 2018

FTK Forensic Examination Report: Running Head
100% (1)
FTK Forensic Examination Report: Running Head
37 pages
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Book Review:: M. L. Liu Addison-Wesley, 2004 ISBN 0-201-79644-9
No ratings yet
Book Review:: M. L. Liu Addison-Wesley, 2004 ISBN 0-201-79644-9
1 page
04 Progbasics
No ratings yet
04 Progbasics
43 pages
04_progbasics
No ratings yet
04_progbasics
51 pages
Simulating Ocean Currents
No ratings yet
Simulating Ocean Currents
35 pages
Perspective On Parallel Programming: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
No ratings yet
Perspective On Parallel Programming: CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
42 pages
RG2-ParallelizationPrinciples-HPCAI-Jan2020
No ratings yet
RG2-ParallelizationPrinciples-HPCAI-Jan2020
40 pages
ch2 PC
No ratings yet
ch2 PC
44 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
Part 1 - Lecture 3 - Parallel Software-1
No ratings yet
Part 1 - Lecture 3 - Parallel Software-1
45 pages
02 Basicarch
No ratings yet
02 Basicarch
83 pages
TLP and ILP
No ratings yet
TLP and ILP
9 pages
1-An Overview of Parallel Computing
No ratings yet
1-An Overview of Parallel Computing
41 pages
Unit1 RMD PDF
No ratings yet
Unit1 RMD PDF
27 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
Parallel Programming Models: Sathish Vadhiyar
No ratings yet
Parallel Programming Models: Sathish Vadhiyar
32 pages
001__DDS-IIIT-Jan-10th
No ratings yet
001__DDS-IIIT-Jan-10th
34 pages
hpc_parallel
No ratings yet
hpc_parallel
122 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
ParallelIzation Principles
No ratings yet
ParallelIzation Principles
40 pages
Why Parallel Computing?: Peter Pacheco
No ratings yet
Why Parallel Computing?: Peter Pacheco
84 pages
Parallel Computing
No ratings yet
Parallel Computing
91 pages
L04 Parallel Programming Models I
No ratings yet
L04 Parallel Programming Models I
72 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Parallel Programming
No ratings yet
Parallel Programming
42 pages
FoP HPC Unit II
No ratings yet
FoP HPC Unit II
107 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
Syllabus
No ratings yet
Syllabus
2 pages
Parallel Computing Unit 3 - Principles of Parallel Computing Design
No ratings yet
Parallel Computing Unit 3 - Principles of Parallel Computing Design
78 pages
Lectures5 14
No ratings yet
Lectures5 14
85 pages
Par Proc Book
No ratings yet
Par Proc Book
335 pages
PDC Lecture 01
No ratings yet
PDC Lecture 01
36 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Mansi Kadam PC Lab Assignment 1
No ratings yet
Mansi Kadam PC Lab Assignment 1
4 pages
01 Introduction
No ratings yet
01 Introduction
41 pages
Comp322 s19 Lec01 Slides v1 PDF
No ratings yet
Comp322 s19 Lec01 Slides v1 PDF
17 pages
Lecture Week - 3 Amdahl Law 1
No ratings yet
Lecture Week - 3 Amdahl Law 1
19 pages
Lecture 5 Principles of Parallel Algorithm Design
No ratings yet
Lecture 5 Principles of Parallel Algorithm Design
30 pages
1. introduction
No ratings yet
1. introduction
17 pages
Parallel Computing
No ratings yet
Parallel Computing
24 pages
Lecture 6 Principles of Parallel Algorithm Design
No ratings yet
Lecture 6 Principles of Parallel Algorithm Design
35 pages
ParProcBook PDF
No ratings yet
ParProcBook PDF
410 pages
Program and Network Properties 2.1 Conditions of Parallelism 2.2 Program Partitioning and Scheduling
No ratings yet
Program and Network Properties 2.1 Conditions of Parallelism 2.2 Program Partitioning and Scheduling
47 pages
Programming On Parallel Machines
100% (1)
Programming On Parallel Machines
344 pages
Part 1 - Lecture 1 - Introduction Parallel Computing
No ratings yet
Part 1 - Lecture 1 - Introduction Parallel Computing
33 pages
Intro Parallel Programming 2015
No ratings yet
Intro Parallel Programming 2015
38 pages
Group3_Parallel_Computing_Techniques_presentation power point 2025 (2)
No ratings yet
Group3_Parallel_Computing_Techniques_presentation power point 2025 (2)
27 pages
4-DesigningParallelPrograms
No ratings yet
4-DesigningParallelPrograms
69 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
5-Parallel Algorithm Design Life Cycle
No ratings yet
5-Parallel Algorithm Design Life Cycle
25 pages
CS4230 Parallel Programming Introduction To Parallel Algorithms
No ratings yet
CS4230 Parallel Programming Introduction To Parallel Algorithms
25 pages
02 Multicore
No ratings yet
02 Multicore
66 pages
OpenACC 1
No ratings yet
OpenACC 1
44 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Computational Geometry: Exploring Geometric Insights for Computer Vision
From Everand
Computational Geometry: Exploring Geometric Insights for Computer Vision
Fouad Sabry
No ratings yet
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
-- 2024 Provisional Admission Slip -
No ratings yet
-- 2024 Provisional Admission Slip -
1 page
Lists and Tables in HTML
No ratings yet
Lists and Tables in HTML
7 pages
Hacking and Virus
No ratings yet
Hacking and Virus
18 pages
Nfpa List PDF
No ratings yet
Nfpa List PDF
7 pages
Android Job Portal System
No ratings yet
Android Job Portal System
9 pages
Login
No ratings yet
Login
4 pages
Agfa Drystar 5300 - Software Paches PDF
No ratings yet
Agfa Drystar 5300 - Software Paches PDF
18 pages
CADSketchNet' - An Annotated Sketch Dataset For 3D CAD Model Retrieval With Deep Neural Networks
No ratings yet
CADSketchNet' - An Annotated Sketch Dataset For 3D CAD Model Retrieval With Deep Neural Networks
14 pages
Test_Plan_Management
No ratings yet
Test_Plan_Management
80 pages
Malware Analysis
100% (1)
Malware Analysis
26 pages
Final B.tech CSE IIS Scheme and Syllabus 2021-2022 Onwards - Updated
No ratings yet
Final B.tech CSE IIS Scheme and Syllabus 2021-2022 Onwards - Updated
32 pages
PHP - File Uploading: Creating An Upload Form
No ratings yet
PHP - File Uploading: Creating An Upload Form
4 pages
Get MCSA Guide To Identity With Windows Server 2016 Exam 70 742 1st Edition Greg Tomsho PDF Ebook With Full Chapters Now
100% (6)
Get MCSA Guide To Identity With Windows Server 2016 Exam 70 742 1st Edition Greg Tomsho PDF Ebook With Full Chapters Now
62 pages
ECSE321 Project Deliverable 2
No ratings yet
ECSE321 Project Deliverable 2
3 pages
Agile Project Management
No ratings yet
Agile Project Management
11 pages
Photoshop Management System (Part 2)
100% (1)
Photoshop Management System (Part 2)
32 pages
CAT PV5.2.0 SV101 Instructions
No ratings yet
CAT PV5.2.0 SV101 Instructions
26 pages
Logcat
No ratings yet
Logcat
26 pages
Blockchain Technology and Implementation: A Systematic Literature Review
No ratings yet
Blockchain Technology and Implementation: A Systematic Literature Review
5 pages
Water Distribution Design and Modeling Fundamentals TRNC02774 1 0001 METRIC Update 1
No ratings yet
Water Distribution Design and Modeling Fundamentals TRNC02774 1 0001 METRIC Update 1
324 pages
What Is Network Administration ??: (Stallings Page 3)
No ratings yet
What Is Network Administration ??: (Stallings Page 3)
35 pages
2 Create A Simple Form in ChronoForms
No ratings yet
2 Create A Simple Form in ChronoForms
11 pages
Critical Infrastructure Interdependency Modeling: A Survey of U.S. and International Research
No ratings yet
Critical Infrastructure Interdependency Modeling: A Survey of U.S. and International Research
126 pages
Presentation - XipLink PEP BSM Workshop Dec2 2008 Public
No ratings yet
Presentation - XipLink PEP BSM Workshop Dec2 2008 Public
30 pages
Dev List
No ratings yet
Dev List
11 pages
QRadar Cyber Security Analyst CTF
No ratings yet
QRadar Cyber Security Analyst CTF
10 pages
Instant Download Applications and Techniques in Information Security: 8th International Conference, ATIS 2017, Auckland, New Zealand, July 6–7, 2017, Proceedings 1st Edition Lynn Batten PDF All Chapters
100% (4)
Instant Download Applications and Techniques in Information Security: 8th International Conference, ATIS 2017, Auckland, New Zealand, July 6–7, 2017, Proceedings 1st Edition Lynn Batten PDF All Chapters
65 pages
Oops Practical
No ratings yet
Oops Practical
17 pages