03_progmodels_slides
03_progmodels_slides
Parallel Programming
Abstractions
(and their corresponding HW/SW implementations)
Parallel Computing
Stanford CS149, Winter 2019
Today’s theme is a critical idea in this course.
And today’s theme is:
▪ http://ispc.github.com/
result[i] = value;
}
}
Stanford CS149, Winter 2019
sin(x) in ISPC
Compute sin(x) using Taylor expansion: sin(x) = x - x3/3! + x5/5! - x7/7! + ...
C++ code: main.cpp ISPC code: sinx.ispc
#include “sinx_ispc.h” export void sinx(
uniform int N,
int N = 1024; uniform int terms,
int terms = 5; uniform float* x,
float* x = new float[N]; uniform float* result)
float* result = new float[N]; {
// assume N % programCount = 0
// initialize x here for (uniform int i=0; i<N; i+=programCount)
{
// execute ISPC code int idx = i + programIndex;
sinx(N, terms, x, result); float value = x[idx];
float numer = x[idx] * x[idx] * x[idx];
uniform int denom = 6; // 3!
SPMD programming abstraction: uniform int sign = -1;
i=1 4 5 6 7
i=2 8 9 10 11
i=3 12 13 14 15
Single “packed load” SSE instruction (_mm_load_ps1) ...
// assumes N % programCount = 0
efficiently implements: for (uniform int i=0; i<N; i+=programCount)
float value = x[idx]; {
int idx = i + programIndex;
for all program instances, since the four values are float value = x[idx];
contiguous in memory ...
Stanford CS149, Winter 2019
Schedule: blocked assignment
“Gang” of ISPC program instances
Gang contains four instances: programCount = 4
i=1 1 5 9 13
i=2 2 6 10 14
i=3 3 7 11 15
float value = x[idx]; uniform int count = N / programCount;
int start = programIndex * count;
now touches four non-contiguous values in memory. for (uniform int i=0; i<count; i++) {
int idx = start + i;
Need “gather” instruction to implement float value = x[idx];
(gather is a more complex, and more costly SIMD ...
instruction: only available since 2013 as part of AVX2)
Stanford CS149, Winter 2019
Raising level of abstraction with foreach
Compute sin(x) using Taylor expansion: sin(x) = x - x3/3! + x5/5! - x7/7! + ...
C++ code: main.cpp ISPC code: sinx.ispc
#include “sinx_ispc.h” export void sinx(
uniform int N,
int N = 1024; uniform int terms,
int terms = 5; uniform float* x,
float* x = new float[N]; uniform float* result)
float* result = new float[N]; {
foreach (i = 0 ... N)
// initialize x here {
float value = x[i];
// execute ISPC code float numer = x[i] * x[i] * x[i];
sinx(N, terms, x, result); uniform int denom = 6; // 3!
uniform int sign = -1;
foreach: key ISPC language construct for (uniform int j=1; j<=terms; j++)
{
▪ foreach declares parallel loop iterations value += sign * numer / denom
- Programmer says: these are the iterations the numer *= x[i] * x[i];
denom *= (2*j+2) * (2*j+3);
instances in a gang cooperatively must perform sign *= -1;
}
▪ ISPC implementation assigns iterations to program result[i] = value;
instances in gang }
- Current ISPC implementation will perform a }
static interleaved assignment (but the
abstraction permits a different assignment) Stanford CS149, Winter 2019
ISPC: abstraction vs. implementation
▪ Single program, multiple data (SPMD) programming model
- Programmer “thinks”: running a gang is spawning programCount logical
instruction streams (each with a different value of programIndex)
- This is the programming abstraction
- Program is written in terms of this abstraction
return sum;
}
sum is of type uniform float (one copy of variable for all program instances)
x[i] is not a uniform expression (different value for each program instance)
Result: compile-time type error
Stanford CS149, Winter 2019
ISPC discussion: sum “reduction”
Compute the sum of all array elements in parallel export uniform float sumall2(
uniform int N,
Each instance accumulates a private partial sum uniform float* x)
{
(no communication) uniform float sum;
float partial = 0.0f;
Partial sums are added together using the reduce_add() cross- foreach (i = 0 ... N)
{
instance communication primitive. The result is the same total sum for
partial += x[i];
all program instances (reduce_add() returns a uniform float) }
The ISPC code at right will execute in a manner similar to handwritten // from ISPC math library
sum = reduce_add(partial);
C + AVX intrinsics implementation below. *
return sum;
}
float sumall2(int N, float* x) {
▪ So... all the code I’ve shown you in the previous slides would
have executed on only one of the four cores of the GHC
machines.
Operating system
Hardware Architecture
(HW/SW boundary)
Micro-architecture (hardware implementation)
pthread_create()
x86-64
modern multi-core CPU
ISPC compiler
Note: This diagram is specific to the ISPC gang abstraction. ISPC also has the “task” language primitive for multi-core execution.
I don’t describe it here but it would be interesting to think about how that diagram would look
Stanford CS149, Winter 2019
Three models of communication
(abstractions)
Thread 1: Thread 2:
int x = 0; void foo(int* x) {
spawn_thread(foo, &x); while (x == 0) {}
x = 1; print x;
}
Store to x
Thread 1
x
(Pseudocode provided in a fake C-like language for brevity.) Stanford CS149, Winter 2019
Shared address space model (abstraction)
Synchronization primitives are also shared variables: e.g., locks
Thread 1: Thread 2:
int x = 0;
Lock my_lock;
print x;
}
(Pseudocode provided in a fake C-like language for brevity.) Stanford CS149, Winter 2019
Review: why do we need mutual exclusion?
▪ Each thread executes
- Load the value of diff from shared memory into register r1
- Add the register r2 to register r1
- Store the value of register r1 into diff
▪ One possible interleaving: (let starting value of diff=0, r2=1)
T0 T1
T0 reads value 0
r1 ← diff
r1 ← diff T1 reads value 0
r1 ← r1 + r2 T0 sets value of its r1 to 1
r1 ← r1 + r2 T1 sets value of its r1 to 1
diff ← r1 T0 stores 1 to diff
diff ← r1 T1 stores 1 to diff
Shared Bus
Processor Processor Processor Processor
Local Cache Local Cache Local Cache Local Cache Memory Memory
Interconnect
Crossbar
Processor Processor Processor Processor Processor
Processor
Memory I/O
Processor
Processor
Multi-stage network
* caching introduces non-uniform access times, but we’ll talk about that later Stanford CS149, Winter 2019
Shared address space HW architectures
Memory
Memory Controller
Core 1 Core 2
Integrated
GPU
Core 3 Core 4
Processor
L2 cache Memory
Processor
Crossbar
Switch
Processor
L2 cache Memory
Processor
Processor
Eight cores
Example: latency to access address x is higher from cores 5-8 than cores 1-4
AMD Hyper-transport /
Intel QuickPath (QPI)
Stanford CS149, Winter 2019
Non-uniform memory access (NUMA)
All processors can access any memory location, but... the cost of memory access
(latency and/or bandwidth) is different for different processors
Interconnect
* But NUMA implementation requires reasoning about locality for performance Stanford CS149, Winter 2019
Message passing model of
communication
x send(X, 2, my_msg_id)
Cluster of workstations
(Infiniband network)
// ISPC code:
export void absolute_value( But if we want to be more precise: the collection is not a
uniform int N, first-class ISPC concept. It is implicitly defined by how the
uniform float* x,
uniform float* y)
program has implemented array indexing logic.
{
foreach (i = 0 ... N)
{
if (x[i] < 0)
(There is no operation in ISPC with the semantic: “map this
y[i] = -x[i]; code over all elements of this array”)
else
y[i] = x[i];
}
}
Main program:
const int N = 1024; Data-parallelism expressed in this functional
stream<float> x(N); // sequence (a “stream”) form is sometimes referred to as the stream
stream<float> y(N); // sequence (a “stream”) programming model
// initialize N elements of x here...
// double length of stream by replicating My experience: cross fingers and hope compiler is
// all elements 2x
stream_repeat(2, input, tmp);
intelligent enough to generate code below from
program at left.
absolute_value(tmp, output);
// ISPC code:
export void absolute_value(
uniform int N,
Kayvon’s experience: uniform float* x,
uniform float* y)
{
This is the achilles heel of all “proper” foreach (i = 0 ... N)
data-parallel/stream programming {
float result;
systems. if (x[i] < 0)
result = -x[i];
else
“If I just had one more operator”... result = x[i];
y[2*i+1] = y[2*i] = result;
}
}
Stanford CS149, Winter 2019
Gather/scatter: two key data-parallel
communication primitives
Map absolute_value onto stream produced by gather: Map absolute_value onto stream, scatter results:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
mem_base
3 12 4 9 9 15 13 0
Index vector: R0 Result vector: R1
▪ Data parallel
- Structure computation as a big “map” over a collection
- Assumes a shared address space from which to load inputs/store results, but
severely limits communication between iterations of the map
(goal: preserve independent processing of iterations)
- Modern embodiments encourage, but don’t enforce, this structure
Stanford CS149, Winter 2019
Modern practice: mixed programming models
▪ Use shared address space programming within a multi-core node
of a cluster, use message passing between nodes
- Very, very common in practice
- Use convenience of shared address space where it can be implemented
efficiently (within a node), require explicit communication elsewhere