0% found this document useful (0 votes)

2 views69 pages

06_progperf2

The lecture discusses performance optimization techniques in parallel computing, focusing on reducing communication costs between processors and memory. It contrasts shared address space and message passing models, highlighting the complexities of memory access and the need for efficient hardware support. The message passing model is emphasized for its explicit communication between threads operating in private address spaces, with examples illustrating its implementation and challenges such as synchronization and potential deadlocks.

Uploaded by

saudiqbal886

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views69 pages

06_progperf2

Uploaded by

saudiqbal886

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

( how to be l33t ) Lecture 6:

Performance Optimization Part II:

Locality, Communication, and Contention

Parallel Computing
Stanford CS149, Fall 2023
Today’s topic
▪ Techniques for reducing the costs of communication
- Between processors
- Between processor(s) and memory

▪ General program optimization tips

Stanford CS149, Fall 2023

So far in this course we’ve assumed all processors are
connected to a memory system that provides the
abstraction of a single shared address space

But the implementation of that abstraction

can be quite complex.

Stanford CS149, Fall 2023

The implementation of the linear memory address space abstraction
on a modern computer is complex
The instruction “load the value stored at address X into register R0” might involve a
complex sequence of operations by multiple data caches and access to DRAM

L1 cache
(32 KB)

Core 1
L2 cache
(256 KB)

.. L3 cache DRAM
. (32 GB)
L1 cache
(20 MB)
(32 KB)

Core 8
L2 cache
(256 KB)

Stanford CS149, Fall 2023

Shared address space hardware architecture
Any processor can directly reference any memory location

Memory

Memory Controller

Core 1 Core 2
Integrated
GPU
Core 3 Core 4

Intel Core i7 (quad core)

Example: Intel Core i7 processor (Kaby Lake) (interconnect is a ring)

Stanford CS149, Fall 2023

Intel’s ring interconnect
Introduced in Sandy Bridge microarchitecture
System Agent ▪ Four rings: for different types of messages
- request
- snoop
L3 cache slice - ack
(2 MB)
Core - data (32 bytes)

▪ Six interconnect nodes: four “slices” of L3 cache + system agent

L3 cache slice Core + graphics
(2 MB)

▪ Each bank of L3 connected to ring bus twice

L3 cache slice Core
(2 MB)
▪ Theoretical peak BW from cores to L3 at 3.4 GHz ~ 435 GB/sec
- When each core is accessing its local slice
L3 cache slice
Core
(2 MB)

Graphics
Stanford CS149, Fall 2023
SUN Niagara 2 (UltraSPARC T2): crossbar interconnect
Note area of crossbar (CCX):
about same area as one core on chip

Core

Core L2 cache Memory

Core

L2 cache Memory
Core
Crossbar
Switch
Core
L2 cache Memory

Core

Core L2 cache Memory

Core

Eight core processor Crossbar = All cores connected

directly to all others
Stanford CS149, Fall 2023
Non-uniform memory access (NUMA)
The latency of accessing a memory location may be different from different processing cores in the system
Bandwidth from any one location may also be different to different CPU cores *

Example: modern multi-socket configuration

X Memory Memory

Memory Controller Memory Controller On chip

network
Core 1 Core 2 Core 5 Core 6

Core 3 Core 4 Core 7 Core 8

* In practice, you’ll find NUMA behavior on a single-socket system as well (recall: different cache slices are a different distance from each core)
Stanford CS149, Fall 2023
Summary: shared address space model
▪ Communication abstraction
- Threads read/write variables in shared address space
- Threads manipulate synchronization primitives: locks, atomic ops, etc.
- Logical extension of uniprocessor programming *

▪ Requires hardware support to implement efficiently

- Any processor can load and store from any address
- Can be costly to scale to large numbers of processors
(one of the reasons why high-core count processors are expensive)

* But NUMA implementations require reasoning about locality for performance optimization Stanford CS149, Fall 2023
In the shared address space model, threads communicated by
reading and writing to variables in the shared address space.

Let’s consider a different abstraction that makes communication

between processors more explicit.

Message passing
Stanford CS149, Fall 2023
Message passing model (abstraction)
▪ Threads operate within their own private address spaces
▪ Threads communicate by sending/receiving messages
- send: specifies recipient, buffer to be transmitted, and optional message identifier (“tag”)
- receive: sender, specifies buffer to store data, and optional message identifier
- Sending messages is the only way to exchange data between threads 1 and 2

Thread 2 address space

Thread 1 address space

x send(X, 2, my_msg_id)

Variable X semantics: send contexts of local variable X as

message to thread 2 and tag message with the recv(Y, 1, my_msg_id)
id “my_msg_id” semantics: receive message with id “my_msg_id”
from thread 1 and store contents in local variable Y
Y
Variable Y

(Communication operations shown in red)

Illustration adopted from Culler, Singh, Gupta Stanford CS149, Fall 2023
A common metaphor: snail mail

Stanford CS149, Fall 2023

Message passing (implementation)
▪ Hardware need not implement a single shared address space for all processors (it only needs to provide
mechanisms to communicate messages between nodes)
- Can connect commodity systems together to form a large parallel machine
(message passing is a programming model for clusters and supercomputers)

Cluster of workstations
(Infiniband network)

Stanford CS149, Fall 2023

Message passing expression of solver
N

Recall the grid solver application:

Update all red cells in parallel

When done updating red cells , update all black

N
cells in parallel (respect dependency on red cells)

Repeat until convergence

Stanford CS149, Fall 2023

Let’s think about expressing a parallel grid solver with
communication via messages
One possible message passing machine configuration: a cluster of two machines

Computer 1 Computer 2

Processor Processor

Local Cache Local Cache

Memory Memory

Network

Stanford CS149, Fall 2023

Review: message passing model
▪ Threads operate within their own private address spaces
▪ Threads communicate by sending/receiving messages
- send: specifies recipient, buffer to be transmitted, and optional message identifier (“tag”)
- receive: sender, specifies buffer to store data, and optional message identifier
- Sending messages is the only way to exchange data between threads 1 and 2 Why?

Thread 1 address space Thread 2 address space

x send(X, 2, my_msg_id)

Variable X semantics: send contexts of local

variable X as message to thread 2 recv(Y, 1, my_msg_id)
and tag message with the id
semantics: receive message with id
“my_msg_id”
“my_msg_id” from thread 1 and
Y
store contents in local variable Y
Variable Y

(Communication operations shown in red)

Illustration adopted from Culler, Singh, Gupta Stanford CS149, Fall 2023
Message passing model: each thread operates in its own address space

Thread 1 In this figure: four threads

Address
Space
The grid data is partitioned into four
Thread 2
allocations, each residing in one of the four
Address unique thread address spaces
Space
(four per-thread private arrays)

Thread 3
Address
Space

Thread 4
Address
Space

Stanford CS149, Fall 2023

Data replication is now required to correctly execute the program
Grid data stored in four separate address spaces (four private arrays)
Thread 1 Example:
Address After processing of red cells is complete, thread 1 and thread 3 send one row of data
Space to thread 2 (thread 2 requires up-to-date red cell information to update black cells
in the next phase)
Send row
Thread 2
Address “Ghost cells” are grid cells replicated from a remote address space. It’s common to
Space say that information in ghost cells is “owned” by other threads.

Thread 2 logic:
Send row float* local_data = allocate(N+2, rows_per_thread+2);
Thread 3
Address int tid = get_thread_id();
int bytes = sizeof(float) * (N+2);
Space
// receive ghost row cells (white dots)
recv(&local_data[0], bytes, tid-1);
Thread 4 recv(&local_data[rows_per_thread+1], bytes, tid+1);
Address
Space // Thread 2 now has data necessary to perform
// its future computation

Stanford CS149, Fall 2023

Message passing solver int N;
int tid = get_thread_id();
int rows_per_thread = N / get_num_threads();

Similar structure to shared address space solver,

float* localA = allocate(rows_per_thread+2, N+2);

// assume localA is initialized with starting values

but now communication is explicit in message // assume MSG_ID_ROW, MSG_ID_DONE, MSG_ID_DIFF are constants used as msg ids

//////////////////////////////////////
sends and receives void solve() {
bool done = false;
while (!done) {

float my_diff = 0.0f;

if (tid != 0)
send(&localA[1,0], sizeof(float)*(N+2), tid-1, MSG_ID_ROW);
if (tid != get_num_threads()-1)
send(&localA[rows_per_thread,0], sizeof(float)*(N+2), tid+1, MSG_ID_ROW);
Send and receive ghost rows to “neighbor threads” if (tid != 0)
recv(&localA[0,0], sizeof(float)*(N+2), tid-1, MSG_ID_ROW);
if (tid != get_num_threads()-1)
recv(&localA[rows_per_thread+1,0], sizeof(float)*(N+2), tid+1, MSG_ID_ROW);

for (int i=1; i<rows_per_thread+1; i++) {

for (int j=1; j<n+1; j++) {
float prev = localA[i,j];
Perform computation
localA[i,j] = 0.2 * (localA[i-1,j] + localA[i,j] + localA[i+1,j] +
localA[i,j-1] + localA[i,j+1]);
(just like in shared address space version of solver) }
my_diff += fabs(localA[i,j] - prev);

if (tid != 0) {
send(&mydiff, sizeof(float), 0, MSG_ID_DIFF);
All threads send local my_diff to thread 0 recv(&done, sizeof(bool), 0, MSG_ID_DONE);
} else {
float remote_diff;
for (int i=1; i<get_num_threads()-1; i++) {
recv(&remote_diff, sizeof(float), i, MSG_ID_DIFF);
my_diff += remote_diff;
Thread 0 computes global diff, evaluates termination }
predicate and sends result back to all other threads if (my_diff/(N*N) < TOLERANCE)
done = true;
for (int i=1; i<get_num_threads()-1; i++)
send(&done, sizeof(bool), i, MSD_ID_DONE);
}
}
}
Example pseudocode from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Notes on the message passing example
▪ Computation
- Array indexing is relative to local address space

▪ Communication:
- Performed by sending and receiving messages
- Bulk transfer: communicate entire rows at a time

▪ Synchronization:
- Performed by sending and receiving messages
- Consider how to implement mutual exclusion, barriers, flags using messages

Stanford CS149, Fall 2023

Synchronous (blocking) send and receive
▪ send(): call returns when sender receives acknowledgement that message data resides in address space of
receiver

▪ recv(): call returns when data from received message is copied into address space of receiver and
acknowledgement sent back to sender

Sender: Receiver:

Call SEND(foo) Call RECV(bar)

Copy data from buffer ‘foo’ in sender’s address space into network buffer
Send message Receive message
Copy data into buffer ‘bar’ in receiver’s address space
Receive ack Send ack
SEND() returns RECV() returns

Stanford CS149, Fall 2023

As implemented on the prior slide, there is a big problem with our
message passing solver if it uses synchronous send/recv!

Why?

How can we fix it?

(while still using synchronous send/recv)

Stanford CS149, Fall 2023

Message passing solver int N;
int tid = get_thread_id();
int rows_per_thread = N / get_num_threads();

(fixed to avoid deadlock)

float* localA = allocate(rows_per_thread+2, N+2);

// assume localA is initialized with starting values

// assume MSG_ID_ROW, MSG_ID_DONE, MSG_ID_DIFF are constants used as msg ids

//////////////////////////////////////

void solve() {
bool done = false;
while (!done) {

float my_diff = 0.0f;

if (tid % 2 == 0) {
Send and receive ghost rows to “neighbor threads” sendDown();
sendUp();
recvDown();
recvUp();
Even-numbered threads send, then receive } else {
recvUp(); sendUp();
Odd-numbered thread recv, then send recvDown(); sendDown();
}

for (int i=1; i<rows_per_thread-1; i++) {

T0 send
for (int j=1; j<n+1; j++) {
float prev = localA[i,j];
localA[i,j] = 0.2 * (localA[i-1,j] + localA[i,j] + localA[i+1,j] +
send localA[i,j-1] + localA[i,j+1]);
T1 send my_diff += fabs(localA[i,j] - prev);
}
}
send
T2 send if (tid != 0) {
send(&mydiff, sizeof(float), 0, MSG_ID_DIFF);
send recv(&done, sizeof(bool), 0, MSG_ID_DONE);
T3 send } else {
float remote_diff;
for (int i=1; i<get_num_threads()-1; i++) {
T4 send recv(&remote_diff, sizeof(float), i, MSG_ID_DIFF);
send my_diff += remote_diff;
}
send if (my_diff/(N*N) < TOLERANCE)
T5 done = true;
if (int i=1; i<gen_num_threads()-1; i++)
send(&done, sizeof(bool), i, MSD_ID_DONE);
time }
}
}
Example pseudocode from: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Non-blocking asynchronous send/recv
▪ send(): call returns immediately
- Buffer provided to send() cannot be modified by calling thread since message processing occurs concurrently with thread execution
- Calling thread can perform other work while waiting for message to be sent

▪ recv(): posts intent to receive in the future, returns immediately

- Use checksend(), checkrecv() to determine actual status of send/receipt
- Calling thread can perform other work while waiting for message to be received

Sender: Receiver:
Call SEND(foo) Call RECV(bar)
SEND returns handle h1 RECV(bar) returns handle h2
Copy data from ‘foo’ into network buffer
Send message Receive message
Messaging library copies data into ‘bar’
Call CHECKSEND(h1) // if message sent, now safe for thread to modify ‘foo’ Call CHECKRECV(h2)
// if received, now safe for thread
// to access ‘bar’

RED TEXT = executes concurrently with application thread

Stanford CS149, Fall 2023
When I talk about communication, I’m not just referring to messages between machines.
(e.g., in a datacenter)

More examples:
Communication between cores on a chip
Communication between a core and its cache
Communication between a core and memory

Stanford CS149, Fall 2023

Think of a parallel system as an extended memory hierarchy
I want you to think of “communication” generally:
- Communication between a processor and its cache
- Communication between processor and memory (e.g., memory on same machine)
- Communication between processor and a remote memory
(e.g., memory on another node in the cluster, accessed by sending a network message)
View from one processor

Accesses not satisfied in local memory cause Proc

communication with next level
Reg
Lower latency, higher bandwidth,
So managing locality to reduce the amount of
Local L1 smaller capacity
communication performed is important at all levels.
Local L2

L2 from another core

L3 cache

Local memory Higher latency, lower bandwidth,

larger capacity
Remote memory (1 network hop)

Remote memory (N network hops)

Stanford CS149, Fall 2023
One example: CPU to memory communication
Processor L1 Cache L2 Cache Memory

Send request to memory

L1 cache
lookup Transfer cache line
Processor issues load L2 cache
from memory over
instruction lookup Transfer value to
memory bus
processor register

time
total latency of memory access

= Time to send cache line over memory bus

Stanford CS149, Fall 2023
Recall discussion of bandwidth limited execution:

This was an example where the processor

executed 2 instructions for each cache line load

= Math instruction
= Load instruction
= Load command sent to memory (part of mem latency)
= Transferring data from memory

time Stanford CS149, Fall 2023

Rate of completing math instructions is limited by memory bandwidth
Memory bandwidth-bound execution!

Rate of instructions is determined by the rate at

which memory can provide data.

Red regions:
Core is stalled waiting on data for next
instruction

Note that memory is transferring data 100% of

time, it can’t transfer data faster.
Convince yourself that in steady state core underutilization is
only a function of instruction and memory throughput, not a
function of memory latency or the number of outstanding
memory requests.

= Math instruction
= Transferring data from memory

time Stanford CS149, Fall 2023

Good questions about the previous slide
▪ How do you tell from the figure that the memory bus is fully utilized?
▪ How would you illustrate higher memory latency (keep in mind memory requests are
pipelined and memory bus bandwidth is not changed)?

▪ How would the figure change if memory bus bandwidth was increased?
▪ Would there still be processor stalls if the ratio of math instructions to load instructions
was significantly increased? Why?

Stanford CS149, Fall 2023

Arithmetic intensity
amount of computation (e.g., instructions)
amount of communication (e.g., bytes)

▪ If numerator is the execution time of computation, ratio gives average bandwidth requirement of code

▪ 1 / “Arithmetic intensity” = communication-to-computation ratio

- Some people like to refer to communication to computation ratio
- I find arithmetic intensity a more intuitive quantity, since higher is better.
- It also sounds cooler

▪ High arithmetic intensity (low communication-to-computation ratio) is required to efficiently utilize modern
parallel processors since the ratio of compute capability to available bandwidth is high (recall element-wise
vector multiply example from lecture 3)

Stanford CS149, Fall 2023

Two reasons for communication:
inherent vs. artifactual communication

Stanford CS149, Fall 2023

Inherent communication
P1 Communication that must occur in a parallel algorithm.
The communication is fundamental to the algorithm.

Send row In our messaging passing example at the start of class,

P2 sending ghost rows was inherent communication

Send row
P3

Stanford CS149, Fall 2023

Reducing inherent communication
Good assignment decisions can reduce inherent communication
(increase arithmetic intensity)
1D blocked assignment: N x N grid 1D interleaved assignment: N x N grid

elements computed (per processor) ≈ N2/P elements computed

∝N/P = 1/2
elements communicated (per processor) ≈ 2N elements communicated

Stanford CS149, Fall 2023

Reducing inherent communication
2D blocked assignment: N x N grid
N2 elements

P1 P2 P3 P processors
2
N
elements computed:
P
(per processor)
P4 P5 P6 N
elements communicated: ∝
(per processor) P
N
P7 P8 P9 arithmetic intensity:
P

Asymptotically better communication scaling than 1D blocked assignment

Communication costs increase sub-linearly with P
Assignment captures 2D locality of algorithm
Stanford CS149, Fall 2023
Artifactual communication
▪ Inherent communication: information that fundamentally must be moved between
processors to carry out the algorithm given the specified assignment (assumes unlimited
capacity caches, minimum granularity transfers, etc.)

▪ Artifactual communication: all other communication (artifactual communication results

from practical details of system implementation)

Stanford CS149, Fall 2023

Example:
Artifactual communication arises from
the behavior of caches

In this case: the communication is between memory and the processor.

Stanford CS149, Fall 2023

Data access in grid solver: row-major traversal
N
Assume row-major grid layout.
Assume cache line is 4 grid elements.
Cache capacity is 24 grid elements (6 lines)

Recall data access in grid solver application.

Blue elements show data that is in cache
after completing update to red element.

Stanford CS149, Fall 2023

Data access in grid solver: row-major traversal
N
Assume row-major grid layout.
Assume cache line is 4 grid elements.
Cache capacity is 24 grid elements (6 lines)

Blue elements show data in cache at end

of processing first row.

Stanford CS149, Fall 2023

Problem with row-major traversal: long time between
accesses to same data
N
Assume row-major grid layout.
Assume cache line is 4 grid elements.
Cache capacity is 24 grid elements (6 lines)

Although elements (x,y)=(0,1), (1,1), (2,1), (0,2), and

(2,2) have been accessed previously, they are no longer
present in cache at start of processing the first output
element in row 2.

As a result, this program loads three cache lines

for every four elements of output.

Stanford CS149, Fall 2023

Artifactual communication examples
▪ System has minimum granularity of data transfer (system must communicate more data than
what is needed by application)
- Program loads one 4-byte float value but entire 64-byte cache line must be transferred from
memory (16x more communication than necessary)

▪ System operation might result in unnecessary communication:

- Program stores 16 consecutive 4-byte float values, and as a result the entire 64-byte cache
line is loaded from memory, entirely overwritten, then subsequently stored to memory (2x
overhead… load was unnecessary since entire cache line was overwritten)

▪ Finite replication capacity: the same data communicated to processor multiple times because
cache is too small to retain it between accesses (capacity misses)

Stanford CS149, Fall 2023

Techniques for
reducing communication

Stanford CS149, Fall 2023

Improving temporal locality by changing grid traversal order
“Blocking”: reorder computation to reduce capacity misses

N
Assume row-major grid layout.
Assume cache line is 4 grid elements.
Cache capacity is 24 grid elements (6 lines)

“Blocked” iteration order

(diagram shows state of cache after

finishing work from first row of first block)

Now load two cache lines for every six

elements of output

Stanford CS149, Fall 2023

Improving temporal locality by “fusing” loops
void add(int n, float* A, float* B, float* C) {
for (int i=0; i<n; i++)

}
C[i] = A[i] + B[i];
Two loads, one store per math op
(arithmetic intensity = 1/3)
void mul(int n, float* A, float* B, float* C) {
for (int i=0; i<n; i++)
C[i] = A[i] * B[i]; Two loads, one store per math op
}
(arithmetic intensity = 1/3)
float* A, *B, *C, *D, *E, *tmp1, *tmp2;

// assume arrays are allocated here

// compute E = D + ((A + B) * C)
add(n, A, B,
mul(n, tmp1,
tmp1);
C, tmp2);
Overall arithmetic intensity = 1/3
add(n, tmp2, D, E);

void fused(int n, float* A, float* B, float* C, float* D, float* E) {

for (int i=0; i<n; i++)
E[i] = D[i] + (A[i] + B[i]) * C[i]; Four loads, one store per 3 math ops
}
(arithmetic intensity = 3/5)
// compute E = D + (A + B) * C
fused(n, A, B, C, D, E);

Code on top is more modular (e.g, array-based math library like numPy in Python)
Code on bottom performs much better. Why?
Stanford CS149, Fall 2023
Optimization: improve arithmetic intensity by sharing data
▪ Exploit sharing: co-locate tasks that operate on the same data
- Schedule threads working on the same data structure at the same time on the same processor
- Reduces inherent communication

Stanford CS149, Fall 2023

Contention

Stanford CS149, Fall 2023

Example: office hours from 3-3:20pm (no appointments)
▪ Operation to perform: Professor Kayvon helps a student with a question

▪ Execution resource: Professor Kayvon

▪ Steps in operation:
1. Student walks from Bytes Cafe to Kayvon’s office (5 minutes)
2. Student waits in line (if necessary)
3. Student gets question answered with insightful answer (5 minutes)

Stanford CS149, Fall 2023

Example: office hours from 3-3:20pm (no appointments)
Time cost to student:
Student 1
10 minutes

Student 2

Student 3
Time cost to student:
23 minutes
Student 4

Student 5

2:55pm 3pm 3:05 3:10 3:15 3:20

Time
= Walk to Kayvon’s office (5 minutes) = Wait in line = Get question answered

Problem: contention for shared resource results in longer overall operation

times (and likely higher cost to students) Stanford CS149, Fall 2023
Example: two students make appointments to talk to me about course
material (at 3pm and at 4:30pm)

Student 1 Time cost to student:

(appt @ 3pm) 10 minutes

Student 2
(appt @ 4pm) Time cost to student:
10 minutes

2:55pm 3pm 3:05pm 4:25pm 4:30pm 4:35pm

Time

Stanford CS149, Fall 2023

Contention
▪ A resource can perform operations at a given throughput (number of transactions per unit time)
- Memory, communication links, servers, CA’s at office hours, etc.

▪ Contention occurs when many requests to a resource are made within a small window of time
(the resource is a “hot spot”)

Example: updating a shared variable

Flat communication: Tree structured communication:

potential for high contention reduces contention
(but low latency if no contention) (but higher latency under no contention)

Stanford CS149, Fall 2023

Example: distributed work queues reduce contention
(contention in access to single shared work queue)

Subproblems
(a.k.a. “tasks”, “work to do”)

Set of work queues

(In general, one per worker thread)

Worker threads:
Pull data from OWN work queue
Push new work to OWN work queue
(no contention when all processors have work to do) Steal!
T1 T2 T3 T4
When local work queue is empty...
STEAL work from random work queue
(synchronization okay at this point since the thread
would have sat idle anyway)

Stanford CS149, Fall 2023

Summary: reducing communication costs
▪ Reduce overhead of communication to sender/receiver
- Send fewer messages, make messages larger (amortize overhead)
- Coalesce many small messages into large ones
▪ Reduce latency of communication
- Application writer: restructure code to exploit locality
- Hardware implementor: improve communication architecture
▪ Reduce contention
- Replicate contended resources (e.g., local copies, fine-grained locks)
- Stagger access to contended resources

▪ Increase communication/computation overlap

- Application writer: use asynchronous communication (e.g., async messages)
- HW implementor: pipelining, multi-threading, pre-fetching, out-of-order exec
- Requires additional concurrency in application (more concurrency than number of execution units)
Stanford CS149, Fall 2023
Here are some tricks for understanding the
performance of parallel software

Stanford CS149, Fall 2023

Remember:
Always, always, always try the simplest parallel
solution first, then measure performance to see
where you stand.

Stanford CS149, Fall 2023

A useful performance analysis strategy
▪ Determine if your performance is limited by computation, memory bandwidth (or
memory latency), or synchronization?

▪ Try and establish “high watermarks”

- What’s the best you can do in practice?
- How close is your implementation to a best-case scenario?

Stanford CS149, Fall 2023

Roofline model
▪ In plot below, different points on the X axis correspond to different programs with different arithmetic intensities
▪ The Y axis is the maximum obtainable instruction throughput for a program with a given arithmetic intensity

diagonal region: memory bandwidth limited execution horizontal region: compute limited execution

Figure credit: Williams et al. 2009 Stanford CS149, Fall 2023

Roofline model: optimization regions
Use various levels of optimization in benchmarks
(e.g., best performance with and without using SIMD instructions)

Figure credit: Williams et al. 2009 Stanford CS149, Fall 2023

Establishing high watermarks *
Add “math” (non-memory instructions)
Does execution time increase linearly with operation count as math is added?
(If so, this is evidence that code is instruction-rate limited)

Remove almost all math, but load same data

How much does execution time decrease? If not much, you might suspect memory bottleneck

Change all array accesses to A[0]

How much faster does your code get?
(This establishes an upper bound on benefit of improving locality of data access)

Remove all atomic operations or locks

How much faster does your code get? (provided it still does approximately the same amount of work)
(This establishes an upper bound on benefit of reducing sync overhead.)

* Computation, memory access, and synchronization are almost never perfectly overlapped. As a result, overall performance will rarely be dictated entirely
by compute or by bandwidth or by sync. Even so, the sensitivity of performance change to the above program modifications can be a good indication
of dominant costs
Stanford CS149, Fall 2023
Use profilers/performance monitoring tools
▪ Image at left is “CPU usage” from activity monitor in OS X while browsing the web in
Chrome (from a laptop with a quad-core Core i7 CPU)
- Graph plots percentage of time OS has scheduled a process thread onto a processor
execution context
- Not very helpful for optimizing performance
▪ All modern processors have low-level event “performance counters”
- Registers that count important details such as: instructions completed, clock ticks,
L2/L3 cache hits/misses, bytes read from memory controller, etc.

▪ Example: Intel’s Performance Counter Monitor Tool provides a C++ API for accessing
these registers. PCM *m = PCM::getInstance();
SystemCounterState begin = getSystemCounterState();

// code to analyze goes here

SystemCounterState end = getSystemCounterState();

printf(“Instructions per clock: %f\n”, getIPC(begin, end));

printf(“L3 cache hit ratio: %f\n”, getL3CacheHitRatio(begin, end));
printf(“Bytes read: %d\n”, getBytesReadFromMC(begin, end));

▪ Also see Intel VTune, PAPI, oprofile, etc.

Stanford CS149, Fall 2023
Bonus slides:
Understanding problem size issues can very helpful
when assessing program performance

Stanford CS149, Fall 2023

You are hired by [insert your favorite chip company here].

You walk in on day one, and your boss says

“All of our senior architects have decided to take the year off. Your job is to lead the
design of our next parallel processor.”

What questions might you ask?

Stanford CS149, Fall 2023

Your boss selects the application that matters most to the company
“I want you to demonstrate good performance on this application.”
How do you know if you have a good design?

▪ Absolute performance?
- Often measured as wall clock time
- Another example: operations per second

▪ Speedup: performance improvement due to parallelism?

- Execution time of sequential program / execution time on P processors
- Operations per second on P processors / operations per second of sequential program

▪ Efficiency?
- Performance per unit resource
- e.g., operations per second per chip area, per dollar, per watt

Stanford CS149, Fall 2023

Measuring scaling
▪ Consider the grid solver example
- We changed the algorithm to allow for parallelism
- The new algorithm might converge more slowly, requiring more iterations of the solver

▪ Should speedup be measured against the performance of a parallel version of a program

running on one processor, or the best sequential program?

Common pitfall: compare parallel program speedup to parallel

algorithm running on one core (easier to make yourself look good)

Stanford CS149, Fall 2023

Speedup of solver application: 258 x 258 grid
Execution on 32 processor SGI Origin 2000

Speedup

1 2 4 8 16 32

Processors
Figure credit: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Remember: work assignment in solver
2D blocked assignment: N x N grid
N2 elements
P1 P2 P3 P processors

elements computed:
(per processor)
P4 P5 P6
elements communicated:
(per processor)

P7 P8 P9 N
arithmetic intensity: p
P

Small N (or large P) yields low arithmetic intensity!

Stanford CS149, Fall 2023

Pitfalls of fixed problem size speedup analysis
Solver execution on 32 processor SGI Origin 2000

Ideal

No benefit! (slight slowdown)

Speedup

Problem size is just too small for the machine

(large communication-to-computation ratio)

Scaling the performance of small problem may

not be all that important anyway (it might
already execute fast enough on a single core)

12 4 8 16 32
Processors

258 x 258 grid on 32 processors: ~ 310 grid cells per processor

1K x 1K grid on 32 processors: ~ 32K grid cells per processor

Figure credit: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Pitfalls of fixed problem size speedup analysis
Execution on 32 processor SGI Origin 2000

Here: super-linear speedup! with enough processors,

chunk of grid assigned to each processor begins to fit in
cache (key working set fits in per-processor cache)

Another example: if problem size is too large for a single

Speedup

machine, working set may not fit in memory: causing

thrashing to disk

(this would make speedup on a bigger parallel machine

with more memory look amazing!)

12 4 8 16 32
Processors

Figure credit: Culler, Singh, and Gupta Stanford CS149, Fall 2023
Understanding scaling
▪ There can be complex interactions between the size of the problem to solve and the size of the parallel
computer
- Can impact load balance, overhead, arithmetic intensity, locality of data access
- Effects can be dramatic and application dependent

▪ Evaluating a machine with a fixed problem size can be problematic

- Too small a problem:
- Parallelism overheads dominate parallelism benefits (may even result in slow downs)
- Problem size may be appropriate for small machines, but inappropriate for large ones
(does not reflect realistic usage of large machine!)

- Too large a problem: (problem size chosen to be appropriate for large machine)
- Key working set may not “fit” in small machine
(causing thrashing to disk, or key working set exceeds cache capacity, or can’t run at all)
- When problem working set “fits” in a large machine but not small one, super-linear speedups can occur

▪ Can be desirable to scale problem size as machine sizes grow

(buy a bigger machine to compute more, rather than just compute the same problem faster)

Stanford CS149, Fall 2023

Summary of tips
▪ Measure, measure, measure…

▪ Establish high watermarks for your program

- Are you compute, synchronization, or bandwidth bound?

▪ Be aware of scaling issues. Is the problem well matched for the machine?

Stanford CS149, Fall 2023

Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
Sony DCR Sr47
No ratings yet
Sony DCR Sr47
101 pages
Red Giant Software
No ratings yet
Red Giant Software
3 pages
02_basicarch
No ratings yet
02_basicarch
103 pages
MAP - Unit2
No ratings yet
MAP - Unit2
134 pages
comporg6_ch12
No ratings yet
comporg6_ch12
36 pages
Lecture02 Types
No ratings yet
Lecture02 Types
21 pages
Pthreads Mod
No ratings yet
Pthreads Mod
110 pages
Parallel Architecture
No ratings yet
Parallel Architecture
33 pages
3-ParallelProgrammingModels
No ratings yet
3-ParallelProgrammingModels
20 pages
01 L20S1 - Networking Review 7-56
No ratings yet
01 L20S1 - Networking Review 7-56
43 pages
Parallel Computer Architecture A Hardware-Software
No ratings yet
Parallel Computer Architecture A Hardware-Software
18 pages
Parallel Programming 3
No ratings yet
Parallel Programming 3
22 pages
Lecture 4
No ratings yet
Lecture 4
20 pages
Parallel Computing Platforms: Chieh-Sen (Jason) Huang
No ratings yet
Parallel Computing Platforms: Chieh-Sen (Jason) Huang
28 pages
Chapter 4
No ratings yet
Chapter 4
46 pages
Lec11 Protection
No ratings yet
Lec11 Protection
38 pages
Lec6 Synch Semaphores S2025
No ratings yet
Lec6 Synch Semaphores S2025
22 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Unit 5 (Slides)
No ratings yet
Unit 5 (Slides)
75 pages
Memory in Multiprocessor System
No ratings yet
Memory in Multiprocessor System
52 pages
written_asst2
No ratings yet
written_asst2
27 pages
10-Multithreading
No ratings yet
10-Multithreading
60 pages
Unit3-all
No ratings yet
Unit3-all
115 pages
Untitled document
No ratings yet
Untitled document
23 pages
Fundamentals of Parallel Computers
No ratings yet
Fundamentals of Parallel Computers
6 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
HPC LA2 2024 Questions
No ratings yet
HPC LA2 2024 Questions
7 pages
S62192
No ratings yet
S62192
127 pages
Parallel and Distributed Computing Lecture#12
No ratings yet
Parallel and Distributed Computing Lecture#12
19 pages
Cache Coherence: Computer Science & Artificial Intelligence Lab
No ratings yet
Cache Coherence: Computer Science & Artificial Intelligence Lab
36 pages
Lecture 16
No ratings yet
Lecture 16
30 pages
8 Inter Processcommunication 17 05 2023
No ratings yet
8 Inter Processcommunication 17 05 2023
27 pages
Lecture-02. Message Passing
No ratings yet
Lecture-02. Message Passing
38 pages
EECS 470 Final Review
No ratings yet
EECS 470 Final Review
16 pages
04_progbasics
No ratings yet
04_progbasics
51 pages
2 Parallel Computer Memory Architectures
No ratings yet
2 Parallel Computer Memory Architectures
26 pages
Parallel Architecture: Sathish Vadhiyar
No ratings yet
Parallel Architecture: Sathish Vadhiyar
26 pages
L04 Parallel Programming Models I
No ratings yet
L04 Parallel Programming Models I
72 pages
Principles of Operating Systems: Lecture 5 - Interprocess Communication Ardalan Amiri Sani
No ratings yet
Principles of Operating Systems: Lecture 5 - Interprocess Communication Ardalan Amiri Sani
38 pages
Introduction
No ratings yet
Introduction
46 pages
CICS 504 Computer Organization
No ratings yet
CICS 504 Computer Organization
35 pages
Chap2 Slides Week3
No ratings yet
Chap2 Slides Week3
28 pages
POSIX Thread (Pthread)
No ratings yet
POSIX Thread (Pthread)
14 pages
CSE211 Computer Architecture
No ratings yet
CSE211 Computer Architecture
18 pages
Lecture09 ConcurrentProgramming 02 Synchronization
No ratings yet
Lecture09 ConcurrentProgramming 02 Synchronization
30 pages
Week 2.2
No ratings yet
Week 2.2
39 pages
pdcco1
No ratings yet
pdcco1
8 pages
V Models of Parallel Computers V. Models of Parallel Computers - After PRAM and Early Models
No ratings yet
V Models of Parallel Computers V. Models of Parallel Computers - After PRAM and Early Models
35 pages
Back To Basics Concurrency Arthur Odwyer
No ratings yet
Back To Basics Concurrency Arthur Odwyer
58 pages
Open MP1
No ratings yet
Open MP1
15 pages
What Is Parallel Computing
No ratings yet
What Is Parallel Computing
9 pages
Lect9 Pthread
No ratings yet
Lect9 Pthread
24 pages
What Is A Message Passing System? Discuss The Desirable Feature of A Message Passing System. Ans
No ratings yet
What Is A Message Passing System? Discuss The Desirable Feature of A Message Passing System. Ans
8 pages
q3.fa24
No ratings yet
q3.fa24
21 pages
01a Mach Concepts
No ratings yet
01a Mach Concepts
27 pages
Parallel Programming Unit 2
No ratings yet
Parallel Programming Unit 2
71 pages
6.189 Lecture5 Parallelism
No ratings yet
6.189 Lecture5 Parallelism
63 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
Unit I Introduction
No ratings yet
Unit I Introduction
54 pages
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Nintendo DS Architecture: Architecture of Consoles: A Practical Analysis, #14
From Everand
Nintendo DS Architecture: Architecture of Consoles: A Practical Analysis, #14
Rodrigo Copetti
No ratings yet
Main GPU
No ratings yet
Main GPU
87 pages
Brescia - Spareparts 1350 PS2
No ratings yet
Brescia - Spareparts 1350 PS2
29 pages
Misic (Parallel Implementation) PDF
No ratings yet
Misic (Parallel Implementation) PDF
4 pages
Moto Case 845 PDF
100% (6)
Moto Case 845 PDF
693 pages
SM PDF
No ratings yet
SM PDF
607 pages
hts6500 PDF
No ratings yet
hts6500 PDF
43 pages
OrangePi 3 Schematics v1.5
No ratings yet
OrangePi 3 Schematics v1.5
17 pages
Hardware Security Module
100% (3)
Hardware Security Module
2 pages
Creating External Table.
No ratings yet
Creating External Table.
6 pages
Front Sway Bar Installation Instructions
No ratings yet
Front Sway Bar Installation Instructions
4 pages
SECOM 737: User Manual
No ratings yet
SECOM 737: User Manual
133 pages
packetDUINO 1
No ratings yet
packetDUINO 1
1 page
Retail Pricelist
No ratings yet
Retail Pricelist
8 pages
401 To 500 MCQ Questions For Fundamentals of Computers MCQ Sets
No ratings yet
401 To 500 MCQ Questions For Fundamentals of Computers MCQ Sets
23 pages
MacMostKeyboardShortcutsYosemite PDF
No ratings yet
MacMostKeyboardShortcutsYosemite PDF
1 page
Reflection
No ratings yet
Reflection
5 pages
High Precision Multi-Touch Sensing On Surfaces Using Overhead Cameras
No ratings yet
High Precision Multi-Touch Sensing On Surfaces Using Overhead Cameras
4 pages
Unit 5 Data Communication
No ratings yet
Unit 5 Data Communication
15 pages
IBM System X Reference Architecture For Hadoop: MapR
No ratings yet
IBM System X Reference Architecture For Hadoop: MapR
35 pages
Virtualization #Flushyourmeds Scott Barry
No ratings yet
Virtualization #Flushyourmeds Scott Barry
4 pages
User Manual: Color Video Door Phone CMV-43A
No ratings yet
User Manual: Color Video Door Phone CMV-43A
19 pages
Ingepac Da: Control Data Sheet
No ratings yet
Ingepac Da: Control Data Sheet
24 pages
BF2536 Caen Sy4527 5527 Adocume00090
No ratings yet
BF2536 Caen Sy4527 5527 Adocume00090
8 pages
Tv-Monitor LG 24MT48DF-WU
No ratings yet
Tv-Monitor LG 24MT48DF-WU
21 pages
Keyboard Manual
No ratings yet
Keyboard Manual
1 page
Ipatch System Panel Manager Guide
No ratings yet
Ipatch System Panel Manager Guide
87 pages
John Minnery - Pick Guns - Lock Picking For Spies, Cops, and Locksmiths-Paladin Press (1989)
No ratings yet
John Minnery - Pick Guns - Lock Picking For Spies, Cops, and Locksmiths-Paladin Press (1989)
140 pages
CloudEdge V100R018C10 Layered Cloud Solution (FusionSphere OpenStack+E9000&2288&HP C7000) - B-V1.0 - OK
100% (1)
CloudEdge V100R018C10 Layered Cloud Solution (FusionSphere OpenStack+E9000&2288&HP C7000) - B-V1.0 - OK
99 pages