04_progbasics
04_progbasics
Time (1 processor)
Speedup( P processors ) =
Time (P processors)
* Other goals include achieving high efficiency (cost, area, power, etc.) or working on bigger problems than can fit on one machine
Stanford CS149, Fall 2023
Creating a parallel program
Problem to solve
Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)
Assignment
Parallel Threads **
(“workers”)
** I had to pick a term Orchestration
Parallel program
(communicating
threads)
Mapping
Adopted from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Problem decomposition
▪ Break up problem into tasks that can be carried out in parallel
▪ In general: create at least enough tasks to keep all execution units on a machine busy
N
N2 N2
1
Execution time
Stanford CS149, Fall 2023
First attempt at parallelism (P processors)
▪ Strategy:
- Step 1: execute in parallel P
- time for phase 1: N2/P Sequential program
Parallelism
- Step 2: execute serially
- time for phase 2: N2
N2 N2
1
Speedup
N2/P
P
Parallel program
Parallelism
Speedup ≤ 2
N2
1
Execution time
Stanford CS149, Fall 2023
Parallelizing step 2
▪ Strategy:
- Step 1: execute in parallel
- time for phase 1: N2/P
- Step 2: compute partial sums in parallel, combine results serially
- time for phase 2: N2/P + P
▪ Overall performance:
- Speedup
Overhead of parallel algorithm:
N2/P N2/P P combining the partial sums
P
Execution time
Stanford CS149, Fall 2023
Amdahl’s law
▪ Let S = the fraction of total work that is inherently sequential
▪ Max speedup on P processors given by:
speedup
S=0.05
S=0.1
Num Processors
Stanford CS149, Fall 2023
A small serial region can limit speedup on a large parallel machine
Summit supercomputer: 27,648 GPUs x (5,376 ALUs/GPU) = 148,635,648 ALUs
Machine can perform 148 million single precision operations in parallel
What is max speedup if 0.1% of application is serial?
Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)
Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping
Execution on
** I had to pick a term
parallel machine
t1.join();
}
List of tasks:
task 0 task 1 task 2 task 3 task 4 ... task 99
Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)
Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping
Execution on
parallel machine
Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)
Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping
Execution on
parallel machine
Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Grid solver algorithm: find the dependencies
Pseudocode for sequential algorithm is provided below
const int n;
float* A; // assume allocated for grid of N+2 x N+2 elements
void solve(float* A) {
Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Step 1: identify dependencies
(problem decomposition phase)
N
Each row element depends on element to left.
N
...
...
N
Update all red cells in parallel
const int n;
float* A = allocate(n+2, n+2)); // allocate grid Assignment: ???
void solve(float* A) {
Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Shared address space
(with SPMD threads)
expression of solver
while (!done) {
float myDiff = 0.f;
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS); Each thread computes the rows it is responsible for updating
for (j=myMin to myMax) {
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] + A[i+1,j], A[i,j+1]);
myDiff += abs(A[i,j] - prev));
}
}
barrier(myBarrier, NUM_PROCESSORS); And these barriers?
}
Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Synchronization in a shared address space
(Pseudocode provided in a fake C-like language for brevity.) Stanford CS149, Fall 2023
A common metaphor:
A shared address space is
like a bulletin board
Image credit:
https://thetab.com/us/stanford/2016/07/28/honest-packing-list-freshman-stanford-1278
Stanford CS149, Fall 2023
Coordinating access to shared variables with synchronization
Shared (among all threads) variables:
int x = 0;
Lock my_lock;
Thread 1: Thread 2:
mylock.lock(); my_lock.lock();
x++; x++;
mylock.unlock(); my_lock.unlock();
print(x); print(x);
// allocate grid
float* A = allocate(n+2, n+2);
while (!done) {
float myDiff = 0.f;
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS); Each thread computes the rows it is responsible for updating
for (j=myMin to myMax) {
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] + A[i+1,j], A[i,j+1]);
myDiff += abs(A[i,j] - prev));
}
lock(myLock);
diff += myDiff; Lock for mutual exclusion
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS); Hint
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
this implementation?
Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Shared address space solver (pseudocode in SPMD execution model)
int n; // grid size
bool done = false;
// allocate grid
float* A = allocate(n+2, n+2);
partial sum locally, then complete global
void solve(float* A) { reduction at the end of the iteration.
float myDiff;
int threadId = getThreadId();
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)
while (!done) {
float myDiff = 0.f;
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS);
for (j=myMin to myMax) {
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] + A[i+1,j], A[i,j+1]);
}
myDiff += abs(A[i,j] - prev));
Compute partial sum per worker
lock(myLock);
diff += myDiff; Now only only lock once per thread,
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS);
not once per (i,j) loop iteration!
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}
Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Barrier synchronization primitive
▪ barrier(num_threads)
▪ Barriers are a conservative way to express dependencies Compute red cells
P1 P2 P3 P4
// allocate grid
float* A = allocate(n+2, n+2);
void solve(float* A) {
float myDiff;
int threadId = getThreadId();
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)
while (!done) {
float myDiff = 0.f;
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS);
for (j=myMin to myMax) {
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] + A[i+1,j], A[i,j+1]);
myDiff += abs(A[i,j] - prev));
}
lock(myLock);
diff += myDiff;
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS);
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}
Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Shared address space solver: one barrier
int
bool
n;
done = false;
// grid size
Idea:
LOCK myLock;
BARRIER myBarrier;
float diff[3]; // global diff, but now 3 copies Remove dependencies by using different diff
float *A = allocate(n+2, n+2); variables in successive loop iterations
void solve(float* A) {
float myDiff; // thread local variable
int index = 0; // thread local variable Trade off footprint for removing dependencies!
diff[0] = 0.0f;
barrier(myBarrier, NUM_PROCESSORS); // one-time only: just for init
(a common parallel programming technique)
while (!done) {
myDiff = 0.0f;
//
// perform computation (accumulate locally into myDiff)
//
lock(myLock);
diff[index] += myDiff; // atomically update global diff
unlock(myLock);
diff[(index+1) % 3] = 0.0f;
barrier(myBarrier, NUM_PROCESSORS);
if (diff[index]/(n*n) < TOLERANCE)
break;
index = (index + 1) % 3;
}
}
Grid solver example from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Grid solver implementation in two programming models
▪ Data-parallel programming model
- Synchronization:
- Single logical thread of control, but iterations of forall loop may be parallelized by the system (implicit
barrier at end of forall loop body)
- Communication
- Implicit in loads and stores (like shared address space)
- Special built-in primitives for more complex communication patterns:
e.g., reduce