03_multicore2-ispc
03_multicore2-ispc
Distance: ~ 50 km
San
Francisco
Stanford
San
Francisco Stanford
Memory
Bandwidth ~ 4 items/sec
Latency of transferring any one item: ~2 sec
Memory
Bandwidth: ~ 8 items/sec
Latency of transferring any one item: ~2 sec
If you connect the pipes, what is the maximum flow you can push through the system?
50 liters/sec
Stanford CS149, Fall 2023
Applying this concept to a computer…
Consider a program that runs threads that repeat the
following sequence of three dependent instructions Data
Cache
instruction selection
1. X = load 64 bytes Fetch/ Fetch/
2. Y = add x + x Decode Decode
3. Z = add x + y (Executes
scalar math)
ALU/
EXEC
LD/ST (Executes mem
loads/stores)
Execution Execution
Let’s say we’re running this sequence on many threads of Context Context
Add
Add
Add
Add
= Math instruction
= Load instruction
= Load command sent to memory (part of mem latency)
= Transferring data from memory
Red regions:
Core is stalled waiting on data for next
instruction
= Math instruction
= Transferring data from memory
<1% GPU efficiency… but still must faster than an eight-core CPU!
(3.2 GHz Xeon E5v4 eight-core CPU connected to 76 GB/sec memory bus: ~3% efficiency on this computation)
▪ Main point: programs must access memory infrequently to utilize modern processors efficiently
instr 0 IF D EX WB
IF D EX WB Four-stage instruction pipeline
instr 1
(steps required to complete and instruction):
instr 2 IF D EX WB
instr 3 IF D EX WB IF = instruction fetch
instr 4 IF D EX WB D = instruction decode + register read
EX = execute operation
instr 5 IF D EX WB WB = “write back” results to registers
Actual instruction pipelines can be variable length (depending on the instruction) deep as ~20 stages in modern CPUs
Given a program, and given the semantics In what (potentially parallel) order will be
(meaning) of the operations used, what is the a program’s operations be executed?
answer that the program will compute? Which operations will be computed by each thread?
Each execution unit? Each lane of a vector instruction?
▪ http://ispc.github.com/
result[i] = value;
}
}
Stanford CS149, Fall 2023
Invoking sinx()
C++ code: main.cpp C++ code: sinx.cpp
#include “sinx.h” void sinx(int N, int terms, float* x, float* result)
{
int main(int argc, void** argv) {
for (int i=0; i<N; i++)
int N = 1024;
int terms = 5; {
float* x = new float[N]; float value = x[i];
float* result = new float[N]; float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
// initialize x here
int sign = -1;
sinx(N, terms, x, result);
for (int j=1; j<=terms; j++)
return 0;
{
}
value += sign * numer / denom;
numer *= x[i] * x[i];
main() denom *= (2*j+2) * (2*j+3);
Call to sinx() sign *= -1;
sinx() Control transferred to sinx() func }
result[i] = value;
Return from sinx()
Control transferred back to main() }
}
time
i=0 0 1 2 3 4 5 6 7
i=1 8 9 10 11 12 13 14 15
i=2 16 17 18 19 20 21 22 23
i=3 24 25 26 27 28 29 30 31
A single “packed vector load” instruction (vmovaps *) efficiently implements: ...
// assumes N % programCount = 0
float value = x[idx]; for (uniform int i=0; i<N; i+=programCount)
for all program instances, since the eight values are contiguous in memory {
int idx = i + programIndex;
float value = x[idx];
...
* see _mm256_load_ps() intrinsic function Stanford CS149, Fall 2023
Schedule: blocked assignment
“Gang” of ISPC program instances
Gang contains four instances: programCount = 8
time
i=0 0 8 16 24 32 40 48 56
i=1 1 9 17 25 33 41 49 57
i=2 2 10 18 26 34 42 50 58
i=3 3 11 19 27 35 43 51 59
foreach: key ISPC language construct for (uniform int j=1; j<=terms; j++)
{
value += sign * numer / denom
▪ foreach declares parallel loop iterations numer *= x[i] * x[i];
- Programmer says: these are the iterations the entire gang (not each denom *= (2*j+2) * (2*j+3);
sign *= -1;
instance) must perform }
result[i] = value;
▪ ISPC implementation takes responsibility for assigning iterations to }
}
y[i] = result;
}
}
// ISPC code:
export void absolute_repeat(
uniform int N,
This ISPC program computes the absolute value of elements of x,
uniform float* x, then repeats it twice in the output array y
uniform float* y)
{
foreach (i = 0 ... N)
{
if (x[i] < 0)
y[2*i] = -x[i];
else
y[2*i] = x[i];
y[2*i+1] = y[2*i];
}
}
Stanford CS149, Fall 2023
What does this program do?
// main C++ code:
const int N = 1024;
float* x = new float[N];
float* y = new float[N];
// initialize N elements of x
// ISPC code:
export void shift_negative( The output of this program is undefined!
uniform int N,
uniform float* x,
uniform float* y) Possible for multiple iterations of the loop body to write to
{
foreach (i = 0 ... N)
same memory location
{
if (i >= 1 && x[i] < 0)
y[i-1] = x[i];
else
y[i] = x[i];
}
}
Cannot return many copies of a varianble to the calling x[i] has a different value for each program instance
C code, which expects one return value of type float So what gets copied into sum?
Result: compile-time type error Result: compile-time type error
Stanford CS149, Fall 2023
Computing the sum of all elements in an array (correctly)
export uniform float sum_array( Each instance accumulates a private partial sum (no communication)
uniform int N,
uniform float* x) Partial sums are added together using the reduce_add() cross-instance
{ communication primitive. The result is the same total sum for all program
uniform float sum;
float partial = 0.0f;
instances (reduce_add() returns a uniform float)
foreach (i = 0 ... N)
{ The ISPC code at left will execute in a manner similar to the C code with AVX
partial += x[i]; intrinsics implemented below. *
}
float sum_summary_AVX(int N, float* x) {
// reduce_add() is part of ISPC’s cross
float tmp[8]; // assume 16-byte alignment
// program instance standard library __mm256 partial = _mm256_broadcast_ss(0.0f);
sum = reduce_add(partial);
for (int i=0; i<N; i+=8)
return sum; partial = _mm256_add_ps(partial, _mm256_load_ps(&x[i]));
_mm256_store_ps(tmp, partial);
For all i, pass value from instance i to the instance i+offset % programCount:
int32 rotate(int32 value, uniform int offset);
▪ So all the code I’ve shown you in the previous slides would have executed on only one
of the four cores of the myth machines.
▪ Exceptions:
float result;
if (programIndex % 2 == 0)
val1 = val1 * val2;
▪
{
Expect programmer to just use foreach int twoN = 2 * N;
▪ Now there’s very little need to think about program float val = x[i];
float result;
y[i] = result;
values and uniform logic. Why? }
}
data indexing
import numpy as np
▪ I want you to always be thinking about abstraction vs. implementation for the
remainder of this course.