04 Progbasics
04 Progbasics
1
Review: 3 parallel programming models
▪ Shared address space
- Communication is unstructured, implicit in loads and stores
- Natural way of programming, but can shoot yourself in the foot easily
- Program might be correct, but not perform well
▪ Message passing
- Structure all communication as messages
- Often harder to get first correct program than shared address space
- Structure often helpful in getting to first correct, scalable program
▪ Data parallel
- Structure computation as a big “map” over a collection
- Assumes a shared address space from which to load inputs/store results, but
model severely limits communication between iterations of the map
(goal: preserve independent processing of iterations)
- Modern embodiments encourage, but don’t enforce, this structure
Figure credit: Culler, Singh, and Gupta CMU 15-418/618, Spring 2018
Where are the dependencies?
Dependencies in one time step of ocean simulation
Boxes correspond to
computations on grids
Parallelism within a grid (data-parallelism) and across operations on the different grids.
The implementation only leverages data-parallelism (for simplicity)
Figure credit: Culler, Singh, and Gupta CMU 15-418/618, Spring 2018
Galaxy evolution
Barnes-Hut algorithm
L
D
Time (1 processor)
Speedup( P processors ) =
Time (P processors)
Assignment
Parallel Threads ** ** I had to pick a term
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping
N
Parallelism
N
N2 N2
1
Execution time
13 CMU 15-418/618, Fall 2018
First attempt at parallelism (P processors)
Strategy:
- Step 1: execute in parallel P
- time for phase 1: N2/P Sequential program
Parallelism
- Step 2: execute serially
- time for phase 2: N2 N2 N2
1
▪ Overall performance: Execution time
Speedup
N2/P
P
Parallel program
Parallelism
Speedup ≤ 2
N2
1
Execution time
14 CMU 15-418/618, Fall 2018
Parallelizing step 2
Strategy:
- Step 1: execute in parallel
- time for phase 1: N2/P
- Step 2: compute partial sums in parallel, combine results serially
- time for phase 2: N2/P + P
Overall performance:
- Speedup overhead:
combining the partial sums
N2/P N2/P P
P
Parallelism
Parallel program
Note:
1
speedup → P when N >> P
Execution time
15 CMU 15-418/618, Fall 2018
Amdahl’s law
Let S = the fraction of total work that is inherently sequential
Max speedup on P processors given by:
speedup
S=0.01
Max Speedup
S=0.05
S=0.1
Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)
Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping
Execution on
** I had to pick a term
parallel machine
pthread_join(thread_id, NULL);
}
List of tasks:
task 0 task 1 task 2 task 3 task 4 ... task 99
Assignment policy: after completing current task, worker thread inspects list
and assigns itself the next uncompleted task.
Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)
Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping
Execution on
** I had to pick a term
parallel machine
Decomposition
Subproblems
(a.k.a. “tasks”,
“work to do”)
Assignment
Parallel Threads **
(“workers”)
Orchestration
Parallel program
(communicating
threads)
Mapping
Execution on
** I had to pick a term
parallel machine
N
Often, the reason a problem requires lots of computation (and needs to be parallelized)
is that it involves manipulating a lot of data.
I’ve described the process of parallelizing programs as an act of partitioning
computation.
Often, it’s equally valid to think of partitioning data. (computations go with the data)
But there are many computations where the correspondence between work-to-do
(“tasks”) and data is less clear. In these cases it’s natural to think of partitioning
computation. 27 CMU 15-418/618, Fall 2018
A parallel programming example
Grid solver example from: Culler, Singh, and Gupta 29 CMU 15-418/618, Fall 2018
Grid solver algorithm
C-like pseudocode for sequential algorithm is provided below
const int n;
float* A; // assume allocated to grid of N+2 x N+2 elements
void solve(float* A) {
N
...
...
const int n;
Assignment: ???
float* A = allocate(n+2, n+2)); // allocate grid
void solve(float* A) {
Grid solver example from: Culler, Singh, and Gupta 39 CMU 15-418/618, Fall 2018
Shared address space (with SPMD threads)
expression of solver
P1 P2 P3 P4
// allocate grid
float* A = allocate(n+2, n+2);
Value of threadId is different for
void solve(float* A) {
each SPMD instance: use value to
int threadId = getThreadId(); compute region of grid to work on
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)
while (!done) {
diff = 0.f; Each thread computes the rows it is
barrier(myBarrier, NUM_PROCESSORS);
for (j=myMin to myMax) { responsible for updating
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] +
A[i+1,j], A[i,j+1]);
lock(myLock)
diff += abs(A[i,j] - prev));
unlock(myLock);
}
}
barrier(myBarrier, NUM_PROCESSORS);
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}
Grid solver example from: Culler, Singh, and Gupta 42 CMU 15-418/618, Fall 2018
Review: need for mutual exclusion
Each thread executes
- Load the value of diff into register r1
- Add the register r2 to register r1
- Store the value of register r1 into diff
One possible interleaving: (let starting value of diff=0, r2=1)
T0 T1
T0 reads value 0
r1 ← diff
r1 ← diff T1 reads value 0
T0 sets value of its r1 to 1
r1 ← r1 + r2
T1 sets value of its r1 to 1
r1 ← r1 + r2
T0 stores 1 to diff
diff ← r1
T1 stores 1 to diff
diff ← r1
P1 P2 P3 P4
// allocate grid
float* A = allocate(n+2, n+2);
void solve(float* A) {
float myDiff;
int threadId = getThreadId();
int myMin = 1 + (threadId * n / NUM_PROCESSORS);
int myMax = myMin + (n / NUM_PROCESSORS)
while (!done) {
float myDiff = 0.f;
diff = 0.f;
barrier(myBarrier, NUM_PROCESSORS);
for (j=myMin to myMax) {
for (i = red cells in this row) {
float prev = A[i,j];
A[i,j] = 0.2f * (A[i-1,j] + A[i,j-1] + A[i,j] +
A[i+1,j], A[i,j+1]);
myDiff += abs(A[i,j] - prev));
}
lock(myLock);
diff += myDiff;
unlock(myLock);
barrier(myBarrier, NUM_PROCESSORS);
if (diff/(n*n) < TOLERANCE) // check convergence, all threads get same answer
done = true;
barrier(myBarrier, NUM_PROCESSORS);
}
}
Grid solver example from: Culler, Singh, and Gupta 48 CMU 15-418/618, Fall 2018
Shared address space solver: one barrier
int n; // grid size
bool done = false;
Idea:
LOCK myLock;
BARRIER myBarrier; Remove dependencies by using different diff
float diff[3]; // global diff, but now 3 copies
variables in successive loop iterations
float *A = allocate(n+2, n+2);
while (!done) {
myDiff = 0.0f;
//
// perform computation (accumulate locally into myDiff)
//
lock(myLock);
diff[index] += myDiff; // atomically update global diff
unlock(myLock);
diff[(index+1) % 3] = 0.0f;
barrier(myBarrier, NUM_PROCESSORS);
if (diff[index]/(n*n) < TOLERANCE)
break;
index = (index + 1) % 3;
}
}
Grid solver example from: Culler, Singh, and Gupta 49 CMU 15-418/618, Fall 2018
More on specifying dependencies
Barriers: simple, but conservative (coarse-granularity dependencies)
- All work in program up until this point (for all threads) must finish before any
thread begins next phase
T0 T1
// produce x, then let T1 know // do stuff independent
x = 1; // of x here
flag = 1;
// do more work here... while (flag == 0);
print x;
Processor Processor
Local Cache Local Cache
Memory Memory
Network
Thread 4
Address
Space
Thread 2 logic:
Send row float* local_data = allocate(N+2,rows_per_thread+2);
solver, but now communication is explicit in // assume localA is initialized with starting values
// assume MSG_ID_ROW, MSG_ID_DONE, MSG_ID_DIFF are constants used as msg ids
void solve() {
bool done = false;
while (!done) {
float my_diff = 0.0f;
if (tid != 0)
send(&localA[1,0], sizeof(float)*(N+2), tid-1, MSG_ID_ROW);
if (tid != get_num_threads()-1)
send(&localA[rows_per_thread,0], sizeof(float)*(N+2), tid+1, MSG_ID_ROW);
Send and receive ghost rows to “neighbor threads” if (tid != 0)
recv(&localA[0,0], sizeof(float)*(N+2), tid-1, MSG_ID_ROW);
if (tid != get_num_threads()-1)
recv(&localA[rows_per_thread+1,0], sizeof(float)*(N+2), tid+1, MSG_ID_ROW);
if (tid != 0) {
send(&mydiff, sizeof(float), 0, MSG_ID_DIFF);
All threads send local my_diff to thread 0 recv(&done, sizeof(bool), 0, MSG_ID_DONE);
} else {
float remote_diff;
for (int i=1; i<get_num_threads()-1; i++) {
recv(&remote_diff, sizeof(float), i, MSG_ID_DIFF);
my_diff += remote_diff;
Thread 0 computes global diff, evaluates }
if (my_diff/(N*N) < TOLERANCE)
termination predicate and sends result back to all done = true;
other threads for (int i=1; i<get_num_threads()-1; i++)
send(&done, sizeof(bool), i, MSD_ID_DONE);
}
}
}
Example pseudocode from: Culler, Singh, and Gupta CMU 15-418/618, Spring 2016
Notes on message passing example
▪ Computation
- Array indexing is relative to local address space (not global grid coordinates)
▪ Communication:
- Performed by sending and receiving messages
- Bulk transfer: communicate entire rows at a time (not individual elements)
▪ Synchronization:
- Performed by sending and receiving messages
- Think of how to implement mutual exclusion, barriers, flags using messages
▪ recv(): call returns when data from received message is copied into address
space of receiver and acknowledgement sent back to sender
Sender: Receiver:
Why?
//////////////////////////////////////
void solve() {
bool done = false;
while (!done) {
float my_diff = 0.0f;
if (tid % 2 == 0) {
Send and receive ghost rows to “neighbor threads” sendDown(); recvDown();
sendUp(); recvUp();
Even-numbered threads send, then receive } else {
recvUp(); sendUp();
Odd-numbered thread recv, then send recvDown(); sendDown();
}
Example pseudocode from: Culler, Singh, and Gupta CMU 15-418/618, Spring 2016
Non-blocking asynchronous send/recv
▪ send(): call returns immediately
- Buffer provided to send() cannot be modified by calling thread since message processing
occurs concurrently with thread execution
- Calling thread can perform other work while waiting for message to be sent
Sender: Receiver:
Call SEND(foo) Call RECV(bar)
SEND returns handle h1 RECV(bar) returns handle h2
Copy data from ‘foo’ into network buffer
Send message Receive message
Messaging library copies data into ‘bar’
Call CHECKSEND(h1) // if message sent, now safe for thread to modify ‘foo’ Call CHECKRECV(h2)
// if received, now safe for thread
// to access ‘bar’