06_progperf2
06_progperf2
Parallel Computing
Stanford CS149, Fall 2023
Today’s topic
▪ Techniques for reducing the costs of communication
- Between processors
- Between processor(s) and memory
L1 cache
(32 KB)
Core 1
L2 cache
(256 KB)
.. L3 cache DRAM
. (32 GB)
L1 cache
(20 MB)
(32 KB)
Core 8
L2 cache
(256 KB)
Memory
Memory Controller
Core 1 Core 2
Integrated
GPU
Core 3 Core 4
Graphics
Stanford CS149, Fall 2023
SUN Niagara 2 (UltraSPARC T2): crossbar interconnect
Note area of crossbar (CCX):
about same area as one core on chip
Core
Core
L2 cache Memory
Core
Crossbar
Switch
Core
L2 cache Memory
Core
Core
* In practice, you’ll find NUMA behavior on a single-socket system as well (recall: different cache slices are a different distance from each core)
Stanford CS149, Fall 2023
Summary: shared address space model
▪ Communication abstraction
- Threads read/write variables in shared address space
- Threads manipulate synchronization primitives: locks, atomic ops, etc.
- Logical extension of uniprocessor programming *
* But NUMA implementations require reasoning about locality for performance optimization Stanford CS149, Fall 2023
In the shared address space model, threads communicated by
reading and writing to variables in the shared address space.
Message passing
Stanford CS149, Fall 2023
Message passing model (abstraction)
▪ Threads operate within their own private address spaces
▪ Threads communicate by sending/receiving messages
- send: specifies recipient, buffer to be transmitted, and optional message identifier (“tag”)
- receive: sender, specifies buffer to store data, and optional message identifier
- Sending messages is the only way to exchange data between threads 1 and 2
x send(X, 2, my_msg_id)
Illustration adopted from Culler, Singh, Gupta Stanford CS149, Fall 2023
A common metaphor: snail mail
Cluster of workstations
(Infiniband network)
Computer 1 Computer 2
Processor Processor
Memory Memory
Network
x send(X, 2, my_msg_id)
Illustration adopted from Culler, Singh, Gupta Stanford CS149, Fall 2023
Message passing model: each thread operates in its own address space
Thread 3
Address
Space
Thread 4
Address
Space
Thread 2 logic:
Send row float* local_data = allocate(N+2, rows_per_thread+2);
Thread 3
Address int tid = get_thread_id();
int bytes = sizeof(float) * (N+2);
Space
// receive ghost row cells (white dots)
recv(&local_data[0], bytes, tid-1);
Thread 4 recv(&local_data[rows_per_thread+1], bytes, tid+1);
Address
Space // Thread 2 now has data necessary to perform
// its future computation
but now communication is explicit in message // assume MSG_ID_ROW, MSG_ID_DONE, MSG_ID_DIFF are constants used as msg ids
//////////////////////////////////////
sends and receives void solve() {
bool done = false;
while (!done) {
if (tid != 0)
send(&localA[1,0], sizeof(float)*(N+2), tid-1, MSG_ID_ROW);
if (tid != get_num_threads()-1)
send(&localA[rows_per_thread,0], sizeof(float)*(N+2), tid+1, MSG_ID_ROW);
Send and receive ghost rows to “neighbor threads” if (tid != 0)
recv(&localA[0,0], sizeof(float)*(N+2), tid-1, MSG_ID_ROW);
if (tid != get_num_threads()-1)
recv(&localA[rows_per_thread+1,0], sizeof(float)*(N+2), tid+1, MSG_ID_ROW);
if (tid != 0) {
send(&mydiff, sizeof(float), 0, MSG_ID_DIFF);
All threads send local my_diff to thread 0 recv(&done, sizeof(bool), 0, MSG_ID_DONE);
} else {
float remote_diff;
for (int i=1; i<get_num_threads()-1; i++) {
recv(&remote_diff, sizeof(float), i, MSG_ID_DIFF);
my_diff += remote_diff;
Thread 0 computes global diff, evaluates termination }
predicate and sends result back to all other threads if (my_diff/(N*N) < TOLERANCE)
done = true;
for (int i=1; i<get_num_threads()-1; i++)
send(&done, sizeof(bool), i, MSD_ID_DONE);
}
}
}
Example pseudocode from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Notes on the message passing example
▪ Computation
- Array indexing is relative to local address space
▪ Communication:
- Performed by sending and receiving messages
- Bulk transfer: communicate entire rows at a time
▪ Synchronization:
- Performed by sending and receiving messages
- Consider how to implement mutual exclusion, barriers, flags using messages
▪ recv(): call returns when data from received message is copied into address space of receiver and
acknowledgement sent back to sender
Sender: Receiver:
Why?
//////////////////////////////////////
void solve() {
bool done = false;
while (!done) {
if (tid % 2 == 0) {
Send and receive ghost rows to “neighbor threads” sendDown();
sendUp();
recvDown();
recvUp();
Even-numbered threads send, then receive } else {
recvUp(); sendUp();
Odd-numbered thread recv, then send recvDown(); sendDown();
}
Sender: Receiver:
Call SEND(foo) Call RECV(bar)
SEND returns handle h1 RECV(bar) returns handle h2
Copy data from ‘foo’ into network buffer
Send message Receive message
Messaging library copies data into ‘bar’
Call CHECKSEND(h1) // if message sent, now safe for thread to modify ‘foo’ Call CHECKRECV(h2)
// if received, now safe for thread
// to access ‘bar’
More examples:
Communication between cores on a chip
Communication between a core and its cache
Communication between a core and memory
L3 cache
time
total latency of memory access
= Math instruction
= Load instruction
= Load command sent to memory (part of mem latency)
= Transferring data from memory
Red regions:
Core is stalled waiting on data for next
instruction
= Math instruction
= Transferring data from memory
▪ How would the figure change if memory bus bandwidth was increased?
▪ Would there still be processor stalls if the ratio of math instructions to load instructions
was significantly increased? Why?
▪ If numerator is the execution time of computation, ratio gives average bandwidth requirement of code
▪ High arithmetic intensity (low communication-to-computation ratio) is required to efficiently utilize modern
parallel processors since the ratio of compute capability to available bandwidth is high (recall element-wise
vector multiply example from lecture 3)
Send row
P3
P4
P1 P2 P3 P processors
2
N
elements computed:
P
(per processor)
P4 P5 P6 N
elements communicated: ∝
(per processor) P
N
P7 P8 P9 arithmetic intensity:
P
▪ Finite replication capacity: the same data communicated to processor multiple times because
cache is too small to retain it between accesses (capacity misses)
N
Assume row-major grid layout.
Assume cache line is 4 grid elements.
Cache capacity is 24 grid elements (6 lines)
}
C[i] = A[i] + B[i];
Two loads, one store per math op
(arithmetic intensity = 1/3)
void mul(int n, float* A, float* B, float* C) {
for (int i=0; i<n; i++)
C[i] = A[i] * B[i]; Two loads, one store per math op
}
(arithmetic intensity = 1/3)
float* A, *B, *C, *D, *E, *tmp1, *tmp2;
// compute E = D + ((A + B) * C)
add(n, A, B,
mul(n, tmp1,
tmp1);
C, tmp2);
Overall arithmetic intensity = 1/3
add(n, tmp2, D, E);
Code on top is more modular (e.g, array-based math library like numPy in Python)
Code on bottom performs much better. Why?
Stanford CS149, Fall 2023
Optimization: improve arithmetic intensity by sharing data
▪ Exploit sharing: co-locate tasks that operate on the same data
- Schedule threads working on the same data structure at the same time on the same processor
- Reduces inherent communication
▪ Steps in operation:
1. Student walks from Bytes Cafe to Kayvon’s office (5 minutes)
2. Student waits in line (if necessary)
3. Student gets question answered with insightful answer (5 minutes)
Student 2
Student 3
Time cost to student:
23 minutes
Student 4
Student 5
Time
= Walk to Kayvon’s office (5 minutes) = Wait in line = Get question answered
Student 2
(appt @ 4pm) Time cost to student:
10 minutes
Time
▪ Contention occurs when many requests to a resource are made within a small window of time
(the resource is a “hot spot”)
Subproblems
(a.k.a. “tasks”, “work to do”)
Worker threads:
Pull data from OWN work queue
Push new work to OWN work queue
(no contention when all processors have work to do) Steal!
T1 T2 T3 T4
When local work queue is empty...
STEAL work from random work queue
(synchronization okay at this point since the thread
would have sat idle anyway)
diagonal region: memory bandwidth limited execution horizontal region: compute limited execution
* Computation, memory access, and synchronization are almost never perfectly overlapped. As a result, overall performance will rarely be dictated entirely
by compute or by bandwidth or by sync. Even so, the sensitivity of performance change to the above program modifications can be a good indication
of dominant costs
Stanford CS149, Fall 2023
Use profilers/performance monitoring tools
▪ Image at left is “CPU usage” from activity monitor in OS X while browsing the web in
Chrome (from a laptop with a quad-core Core i7 CPU)
- Graph plots percentage of time OS has scheduled a process thread onto a processor
execution context
- Not very helpful for optimizing performance
▪ All modern processors have low-level event “performance counters”
- Registers that count important details such as: instructions completed, clock ticks,
L2/L3 cache hits/misses, bytes read from memory controller, etc.
▪ Example: Intel’s Performance Counter Monitor Tool provides a C++ API for accessing
these registers. PCM *m = PCM::getInstance();
SystemCounterState begin = getSystemCounterState();
▪ Absolute performance?
- Often measured as wall clock time
- Another example: operations per second
▪ Efficiency?
- Performance per unit resource
- e.g., operations per second per chip area, per dollar, per watt
Speedup
1 2 4 8 16 32
Processors
Figure credit: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Remember: work assignment in solver
2D blocked assignment: N x N grid
N2 elements
P1 P2 P3 P processors
elements computed:
(per processor)
P4 P5 P6
elements communicated:
(per processor)
P7 P8 P9 N
arithmetic intensity: p
P
Ideal
12 4 8 16 32
Processors
Figure credit: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Pitfalls of fixed problem size speedup analysis
Execution on 32 processor SGI Origin 2000
12 4 8 16 32
Processors
Figure credit: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Understanding scaling
▪ There can be complex interactions between the size of the problem to solve and the size of the parallel
computer
- Can impact load balance, overhead, arithmetic intensity, locality of data access
- Effects can be dramatic and application dependent
- Too large a problem: (problem size chosen to be appropriate for large machine)
- Key working set may not “fit” in small machine
(causing thrashing to disk, or key working set exceeds cache capacity, or can’t run at all)
- When problem working set “fits” in a large machine but not small one, super-linear speedups can occur
▪ Be aware of scaling issues. Is the problem well matched for the machine?