0% found this document useful (0 votes)
1 views

03_multicore2-ispc

The lecture focuses on multi-core architecture, latency, and bandwidth issues in parallel computing, emphasizing the importance of memory bandwidth in achieving high performance. It discusses examples such as vector multiplication and laundry operations to illustrate throughput optimization techniques. The latter part of the lecture contrasts abstraction and implementation in parallel programming, using ISPC as an example to demonstrate how to effectively utilize parallel resources.

Uploaded by

saudiqbal886
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

03_multicore2-ispc

The lecture focuses on multi-core architecture, latency, and bandwidth issues in parallel computing, emphasizing the importance of memory bandwidth in achieving high performance. It discusses examples such as vector multiplication and laundry operations to illustrate throughput optimization techniques. The latter part of the lecture contrasts abstraction and implementation in parallel programming, using ISPC as an example to demonstrate how to effectively utilize parallel resources.

Uploaded by

saudiqbal886
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Lecture 3:

Multi-Core Architecture, Part II


(latency/bandwidth issues)
+
Parallel Programming Abstractions
Parallel Computing
Stanford CS149, Fall 2023
Reviewing last time…
▪ Three ideas in throughput computing hardware
- Multi-core execution
- SIMD execution
- Hardware multi-threading

▪ [Will review by going over slides from the end of lecture 2]

Stanford CS149, Fall 2023


Thought experiment
Task: element-wise multiplication of two vectors A and B A
Assume vectors contain millions of elements ×

- Load input A[i] B

- Load input B[i] =


- Compute A[i] × B[i] C
- Store result into C[i]

Is this a good application to run on a modern


throughput-oriented parallel processor? !
Stanford CS149, Fall 2023
NVIDIA V100
There are 80 SM cores on the V100:

80 SM x 64 fp32 ALUs per SM = 5120 ALUs

Think about supplying all


those ALUs with data each
clock. " L2 Cache (6 MB)
900 GB/sec
(4096 bit interface)
GPU memory (HBM)
(16 GB)

Stanford CS149, Fall 2023


Understanding
latency and bandwidth

Stanford CS149, Fall 2023


The school year is starting… gotta get back to Stanford

Stanford CS149, Fall 2023


San Francisco fog vs. South Bay sun
When it looks like this in SF It looks like this at Stanford

Stanford CS149, Fall 2023


Everyone wants to get to back to the South Bay!
Assume only one car in a lane of the highway at once.
When car on highway reaches Stanford, the next car leaves San Francisco.

San Car’s velocity: 100 km/hr


Francisco Stanford

Distance: ~ 50 km

Latency of driving from San Francisco to Stanford: 0.5 hours


Throughput: 2 cars per hour

Stanford CS149, Fall 2023


Improving throughput
San Car’s velocity: 200 km/hr
Stanford
Francisco

Approach 1: drive faster!


Throughput = 4 cars per hour

Car’s velocity: 100 km/hr

San
Francisco
Stanford

Approach 2: build more lanes!


Throughput = 8 cars per hour (2 cars per hour per lane)
Stanford CS149, Fall 2023
Using the highway more efficiently
San Car’s velocity: 100 km/hr
Stanford
Francisco

Cars spaced out by 1 km


Throughput: 100 cars/hr (1 car every 1/100th of hour)

Car’s velocity: 100 km/hr

San
Francisco Stanford

Throughput: 400 cars/hr (4 cars every 1/100th of hour)


Stanford CS149, Fall 2023
Terminology
▪ Memory bandwidth
- The rate at which the memory system can provide data to a processor
- Example: 20 GB/s

Memory

Bandwidth ~ 4 items/sec
Latency of transferring any one item: ~2 sec

Stanford CS149, Fall 2023


Terminology
▪ Memory bandwidth
- The rate at which the memory system can provide data to a processor
- Example: 20 GB/s

Memory

Bandwidth: ~ 8 items/sec
Latency of transferring any one item: ~2 sec

Stanford CS149, Fall 2023


Example: doing your laundry
Operation: do your laundry
1. Wash clothes
2. Dry clothes
3. Fold clothes

Washer Dryer College Student


45 min 60 min 15 min

Latency of completing 1 load of laundry = 2 hours

Stanford CS149, Fall 2023


Increasing laundry throughput
Goal: maximize throughput of many loads of laundry

One approach: duplicate execution resources:


use two washers, two dryers, and call a friend

Latency of completing 2 loads of laundry = 2 hours


Throughput increases by 2x: 1 load/hour
Number of resources increased by 2x: two washers, two dryers

Stanford CS149, Fall 2023


Pipelining laundry
Goal: maximize throughput of doing many loads of laundry
1 hr 2 hr 3 hr 4 hr 5 hr

Latency: 1 load takes 2 hours


Throughput: 1 load/hour
Resources: one washer, one dryer
Stanford CS149, Fall 2023
Another example: two connected pipes
Pipe 1: max flow 100 liters/sec Pipe 2: max flow 50 liters/sec

If you connect the pipes, what is the maximum flow you can push through the system?
50 liters/sec
Stanford CS149, Fall 2023
Applying this concept to a computer…
Consider a program that runs threads that repeat the
following sequence of three dependent instructions Data
Cache
instruction selection
1. X = load 64 bytes Fetch/ Fetch/
2. Y = add x + x Decode Decode

3. Z = add x + y (Executes
scalar math)
ALU/
EXEC
LD/ST (Executes mem
loads/stores)

Execution Execution
Let’s say we’re running this sequence on many threads of Context Context

a multi-threaded* core that:


(HW thread)
... (HW thread)

▪ Executes one math operation per clock


▪ Can issue load instructions in parallel with math
▪ Receives 8 bytes/clock from memory
(* Assume there are plenty of hardware threads to hide memory latency)
Stanford CS149, Fall 2023
Processor that can do one add per clock (+ co-issues LDs)
Add
= Math instruction
Add
= Load instruction
Load 64 bytes Loads in progress: 1
= Load command sent to memory (part of mem latency)
Add
= Transferring data from memory
Add (data transfer speed = 8 bytes/clock)
Load 64 bytes Loads in progress: 2 Assumptions:
Add - 8 clocks to transfer data for a load
Add - Up to 3 outstanding load requests
Load 64 bytes Loads in progress: 3

Add

Add

Load 64 bytes Stall! Loads in progress: 3

Add

Add

Load 64 bytes Stall! Loads in progress: 3

time Stanford CS149, Fall 2023


Rate of completing math instructions is limited by memory bandwidth

= Math instruction
= Load instruction
= Load command sent to memory (part of mem latency)
= Transferring data from memory

time Stanford CS149, Fall 2023


Rate of completing math instructions is limited by memory bandwidth
Memory bandwidth-bound execution!

Rate of instructions is determined by the rate at


which memory can provide data.

Red regions:
Core is stalled waiting on data for next
instruction

Note that memory is transferring data 100% of


time, it can’t transfer data faster.
Convince yourself that in steady state core underutilization is
only a function of instruction and memory throughput, not a
function of memory latency or the number of outstanding
memory requests.

= Math instruction
= Transferring data from memory

time Stanford CS149, Fall 2023


High bandwidth memories
▪ Modern GPUs leverage high bandwidth memories located near processor
▪ Example:
- V100 uses HBM2
- 900 GB/s

Stanford CS149, Fall 2023


Back to our thought experiment
Task: element-wise multiplication of two vectors A and B A
Assume vectors contain millions of elements ×

- Load input A[i] B

- Load input B[i] =


- Compute A[i] × B[i] C
- Store result into C[i]

Three memory operations (12 bytes) for every MUL


NVIDIA V100 GPU can do 5120 fp32 MULs per clock (@ 1.6 GHz)
Need ~98 TB/sec of bandwidth to keep functional units busy

<1% GPU efficiency… but still must faster than an eight-core CPU!
(3.2 GHz Xeon E5v4 eight-core CPU connected to 76 GB/sec memory bus: ~3% efficiency on this computation)

Stanford CS149, Fall 2023


This computation is
bandwidth limited!
If processors request data at too high a rate,
the memory system cannot keep up.

Overcoming bandwidth limits is often the most important


challenge facing software developers targeting modern
throughput-optimized systems.

Stanford CS149, Fall 2023


In modern computing, bandwidth is the critical resource
Performant parallel programs will:

▪ Organize computation to fetch data from memory less often


- Reuse data previously loaded by the same thread
(temporal locality optimizations)
- Share data across threads (inter-thread cooperation)

▪ Favor performing additional arithmetic to storing/reloading values (the math is “free”)

▪ Main point: programs must access memory infrequently to utilize modern processors efficiently

Stanford CS149, Fall 2023


Another pipelining example: an instruction pipeline
Many students often ask how a processor can complete a multiply operation in a clock.
When we say a core does one operation per clock, we are referring to INSTRUCTION THROUGHPUT, NOT LATENCY.
time (clocks)

instr 0 IF D EX WB
IF D EX WB Four-stage instruction pipeline
instr 1
(steps required to complete and instruction):
instr 2 IF D EX WB
instr 3 IF D EX WB IF = instruction fetch
instr 4 IF D EX WB D = instruction decode + register read
EX = execute operation
instr 5 IF D EX WB WB = “write back” results to registers

Latency: 1 instruction takes 4 cycles


Throughput: 1 instruction per cycle
(Yes, care must be taken to ensure program correctness when back-to-back instructions are dependent.)

Actual instruction pipelines can be variable length (depending on the instruction) deep as ~20 stages in modern CPUs

Stanford CS149, Fall 2023


Part 2:
The theme of the second half of today’s lecture is:

Abstraction vs. implementation


Conflating semantics (meaning) of an abstraction with details of its
implementation is a common cause for confusion in this course.

Stanford CS149, Fall 2023


Abstraction vs. implementation
Semantics: what do the operations Implementation (aka scheduling): how will the
provided by a programming model mean? answer be computed on a parallel machine?

Given a program, and given the semantics In what (potentially parallel) order will be
(meaning) of the operations used, what is the a program’s operations be executed?
answer that the program will compute? Which operations will be computed by each thread?
Each execution unit? Each lane of a vector instruction?

Your goal as a student:


Given a program and knowledge of how a parallel programming model
is implemented, in your head can you “trace” through what each part of
the parallel computer is doing during each step of program.

Stanford CS149, Fall 2023


An example:
Programming with ISPC

Stanford CS149, Fall 2023


ISPC
▪ Intel SPMD Program Compiler (ISPC)
▪ SPMD: single program multiple data

▪ http://ispc.github.com/

▪ A great read: “The Story of ISPC” (by Matt Pharr)


- https://pharr.org/matt/blog/2018/04/30/ispc-all.html
- Go read it!

Stanford CS149, Fall 2023


Recall: example program from last class
Compute sin(x) using Taylor expansion: sin(x) = x - x3/3! + x5/5! - x7/7! + ...
for each element of an array of N floating-point numbers
void sinx(int N, int terms, float* x, float* result)
{
for (int i=0; i<N; i++)
{
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
int sign = -1;

for (int j=1; j<=terms; j++)


{
value += sign * numer / denom;
numer *= x[i] * x[i];
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}

result[i] = value;
}
}
Stanford CS149, Fall 2023
Invoking sinx()
C++ code: main.cpp C++ code: sinx.cpp
#include “sinx.h” void sinx(int N, int terms, float* x, float* result)
{
int main(int argc, void** argv) {
for (int i=0; i<N; i++)
int N = 1024;
int terms = 5; {
float* x = new float[N]; float value = x[i];
float* result = new float[N]; float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
// initialize x here
int sign = -1;
sinx(N, terms, x, result);
for (int j=1; j<=terms; j++)
return 0;
{
}
value += sign * numer / denom;
numer *= x[i] * x[i];
main() denom *= (2*j+2) * (2*j+3);
Call to sinx() sign *= -1;
sinx() Control transferred to sinx() func }

result[i] = value;
Return from sinx()
Control transferred back to main() }
}

Stanford CS149, Fall 2023


sinx() in ISPC
C++ code: main.cpp ISPC code: sinx.ispc
#include “sinx_ispc.h” export void ispc_sinx(
uniform int N,
int main(int argc, void** argv) { uniform int terms,
int N = 1024; uniform float* x,
int terms = 5; uniform float* result)
float* x = new float[N]; {
float* result = new float[N]; // assume N % programCount = 0
for (uniform int i=0; i<N; i+=programCount)
// initialize x here {
int idx = i + programIndex;
// execute ISPC code float value = x[idx];
ispc_sinx(N, terms, x, result); float numer = x[idx] * x[idx] * x[idx];
return 0; uniform int denom = 6; // 3!
} uniform int sign = -1;

SPMD programming abstraction: for (uniform int j=1; j<=terms; j++)


{
value += sign * numer / denom
Call to ISPC function spawns “gang” of ISPC numer *= x[idx] * x[idx];
“program instances” denom *= (2*j+2) * (2*j+3);
sign *= -1;
All instances run ISPC code concurrently }
result[idx] = value;
Each instance has its own copy of local variables }
(blue variables in code, we’ll talk about “uniform” later) }

Upon return, all instances have completed


Stanford CS149, Fall 2023
Invoking sinx() in ISPC
C++ code: main.cpp
#include “sinx_ispc.h”
main()
int main(int argc, void** argv) {
int N = 1024; Sequential execution (C code)
int terms = 5;
float* x = new float[N];
float* result = new float[N]; ispc_sinx() Call to ispc_sinx()
0 1 2 3 4 5 6 7 Begin executing programCount
// initialize x here
instances of ispc_sinx()
(ISPC code)
// execute ISPC code
ispc_sinx(N, terms, x, result);
return 0;
}
ispc_sinx() returns.
Completion of ISPC program instances
Resume sequential execution
SPMD programming abstraction:
Sequential execution
Call to ISPC function spawns “gang” of ISPC “program instances” (C code)
All instances run ISPC code concurrently
Each instance has its own copy of local variables
Upon return, all instances have completed

In this illustration programCount = 8


Stanford CS149, Fall 2023
sinx() in ISPC “Interleaved” assignment of array elements to program instances

C++ code: main.cpp ISPC code: sinx.ispc


#include “sinx_ispc.h” export void ispc_sinx(
uniform int N,
int main(int argc, void** argv) { uniform int terms,
int N = 1024; uniform float* x,
int terms = 5; uniform float* result)
float* x = new float[N]; {
float* result = new float[N]; // assumes N % programCount = 0
for (uniform int i=0; i<N; i+=programCount)
// initialize x here {
int idx = i + programIndex;
// execute ISPC code float value = x[idx];
ispc_sinx(N, terms, x, result); float numer = x[idx] * x[idx] * x[idx];
return 0; uniform int denom = 6; // 3!
} uniform int sign = -1;

ISPC language keywords: for (uniform int j=1; j<=terms; j++)


{
programCount: number of simultaneously executing instances in value += sign * numer / denom
the gang (uniform value) numer *= x[idx] * x[idx];
denom *= (2*j+2) * (2*j+3);
programIndex: id of the current instance in the gang. sign *= -1;
(a non-uniform value: “varying”) }
result[idx] = value;
uniform: A type modifier. All instances have the same value for this }
}
variable. Its use is purely an optimization. Not needed for correctness.
Stanford CS149, Fall 2023
Interleaved assignment of program instances to loop iterations

Elements of output array (results)


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Instance 0 Instance 1 Instance 2 Instance 3 Instance 4 Instance 5 Instance 6 Instance 7


(programIndex = 0) (programIndex = 1) (programIndex = 2) (programIndex = 3) (programIndex = 4) (programIndex = 5) (programIndex = 6) (programIndex=7)

“Gang” of ISPC program instances


In this illustration: gang contains eight instances: programCount = 8

Stanford CS149, Fall 2023


ISPC implements the gang abstraction using SIMD instructions
C++ code: main.cpp
#include “sinx_ispc.h”
main()
int main(int argc, void** argv) { Sequential execution (C code)
int N = 1024;
int terms = 5;
float* x = new float[N];
float* result = new float[N]; ispc_sinx() Call to ispc_sinx()
0 1 2 3 4 5 6 7 Begin executing programCount
// initialize x here
instances of ispc_sinx()
// execute ISPC code (ISPC code)
ispc_sinx(N, terms, x, result);
return 0;
}
ispc_sinx() returns.
SPMD programming abstraction: Completion of ISPC program instances
Call to ISPC function spawns “gang” of ISPC “program instances” Resume sequential execution
All instances run ISPC code simultaneously
Upon return, all instances have completed Sequential execution (C code)

ISPC compiler generates SIMD implementation:


Number of instances in a gang is the SIMD width of the hardware (or a small multiple of SIMD width)
ISPC compiler generates a C++ function binary (.o) whose body contains SIMD instructions
C++ code links against generated object file as usual
Stanford CS149, Fall 2023
sinx() in ISPC: version 2
“Blocked” assignment of array elements to program instances
ISPC code: sinx.ispc
C++ code: main.cpp export void ispc_sinx_v2(
#include “sinx_ispc.h” uniform int N,
uniform int terms,
int main(int argc, void** argv) { uniform float* x,
int N = 1024; uniform float* result)
int terms = 5; {
// assume N % programCount = 0
float* x = new float[N];
uniform int count = N / programCount;
float* result = new float[N]; int start = programIndex * count;
for (uniform int i=0; i<count; i++)
// initialize x here {
int idx = start + i;
// execute ISPC code float value = x[idx];
ispc_sinx_v2(N, terms, x, result); float numer = x[idx] * x[idx] * x[idx];
return 0; uniform int denom = 6; // 3!
} uniform int sign = -1;

for (uniform int j=1; j<=terms; j++)


{
value += sign * numer / denom
numer *= x[idx] * x[idx];
denom *= (j+3) * (j+4);
sign *= -1;
}
result[idx] = value;
}
}

Stanford CS149, Fall 2023


Blocked assignment of program instances to loop iterations

Elements of output array (results)


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Instance 0 Instance 1 Instance 2 Instance 3 Instance 4 Instance 5 Instance 6 Instance 7


(programIndex = 0) (programIndex = 1) (programIndex = 2) (programIndex = 3) (programIndex = 4) (programIndex = 5) (programIndex = 6) (programIndex=7)

“Gang” of ISPC program instances


In this illustration: gang contains eight instances: programCount = 8

Stanford CS149, Fall 2023


Schedule: interleaved assignment
“Gang” of ISPC program instances
Gang contains four instances: programCount = 8

Instance 0 Instance 1 Instance 2 Instance 3 Instance 4 Instance 5 Instance 6 Instance 7


(programIndex = 0) (programIndex = 1) (programIndex = 2) (programIndex = 3) (programIndex = 4) (programIndex = 5) (programIndex = 6) (programIndex = 7)

time
i=0 0 1 2 3 4 5 6 7

i=1 8 9 10 11 12 13 14 15

i=2 16 17 18 19 20 21 22 23

i=3 24 25 26 27 28 29 30 31
A single “packed vector load” instruction (vmovaps *) efficiently implements: ...
// assumes N % programCount = 0
float value = x[idx]; for (uniform int i=0; i<N; i+=programCount)
for all program instances, since the eight values are contiguous in memory {
int idx = i + programIndex;
float value = x[idx];
...
* see _mm256_load_ps() intrinsic function Stanford CS149, Fall 2023
Schedule: blocked assignment
“Gang” of ISPC program instances
Gang contains four instances: programCount = 8

Instance 0 Instance 1 Instance 2 Instance 3 Instance 4 Instance 5 Instance 6 Instance 7


(programIndex = 0) (programIndex = 1) (programIndex = 2) (programIndex = 3) (programIndex = 4) (programIndex = 5) (programIndex = 6) (programIndex = 7)

time
i=0 0 8 16 24 32 40 48 56

i=1 1 9 17 25 33 41 49 57

i=2 2 10 18 26 34 42 50 58

i=3 3 11 19 27 35 43 51 59

float value = x[idx]; uniform int count = N / programCount;


int start = programIndex * count;
For all program instances now touches eight non-contiguous values in for (uniform int i=0; i<count; i++) {
memory. Need “gather” instruction (vgatherdps *) to implement (gather is int idx = start + i;
float value = x[idx];
a more complex, and more costly SIMD instruction…) ...

* see _mm256_i32gather_ps() intrinsic function Stanford CS149, Fall 2023


Raising level of abstraction with foreach
C++ code: main.cpp ISPC code: sinx.ispc
#include “sinx_ispc.h” export void ispc_sinx(
uniform int N,
int N = 1024; uniform int terms,
int terms = 5; uniform float* x,
float* x = new float[N]; uniform float* result)
float* result = new float[N]; {
foreach (i = 0 ... N)
// initialize x here {
float value = x[i];
// execute ISPC code float numer = x[i] * x[i] * x[i];
sinx(N, terms, x, result); uniform int denom = 6; // 3!
uniform int sign = -1;

foreach: key ISPC language construct for (uniform int j=1; j<=terms; j++)
{
value += sign * numer / denom
▪ foreach declares parallel loop iterations numer *= x[i] * x[i];

- Programmer says: these are the iterations the entire gang (not each denom *= (2*j+2) * (2*j+3);
sign *= -1;
instance) must perform }
result[i] = value;
▪ ISPC implementation takes responsibility for assigning iterations to }
}

program instances in the gang

Stanford CS149, Fall 2023


How might foreach be implemented?
Code written using foreach abstraction:
foreach (i = 0 ... N) Implementation 3: block iterations onto program instances
{
// assume N % programCount = 0
// do work for iteration i here... uniform int count = N / programCount;
} int start = programIndex * count;
for (uniform int loop_i=0; loop_i<count; loop_i++)
{
int i = start + loop_i;
// do work for iteration i here...
Implementation 1: program instance 0 executes all iterations }
if (programCount == 0) {
for (int i=0; i<N; i++) {
// do work for iteration i here…
}
}
Implementation 4: dynamic assignment of iterations to instances
uniform int nextIter;
if (programCount == 0)
nextIter = 0;
Implementation 2: interleave iterations onto program instances
int i = atomic_add_local(&nextIter, 1);
// assume N % programCount = 0 while (i < N) {
for (uniform int loop_i=0; loop_i<N; loop_i+=programCount)
{ // do work for iteration i here...
int i = loop_i + programIndex;
// do work for iteration i here... i = atomic_add_local(&nextIter, 1);
} }

Stanford CS149, Fall 2023


Thinking about iterations, not parallel execution
In many simple cases, using foreach allows export void ispc_function(
uniform int N,

the programmer to express their program almost uniform float* x,


uniform float* y)

as if it was a sequential program


{
foreach (i = 0 ... N)
{
float val = x[I];
float result;

// do work here to compute


// result from val

y[i] = result;
}
}

Stanford CS149, Fall 2023


What does this program do?
// main C++ code:
const int N = 1024;
float* x = new float[N/2];
float* y = new float[N];

// initialize N/2 elements of x here

// call ISPC function


absolute_repeat(N/2, x, y);

// ISPC code:
export void absolute_repeat(
uniform int N,
This ISPC program computes the absolute value of elements of x,
uniform float* x, then repeats it twice in the output array y
uniform float* y)
{
foreach (i = 0 ... N)
{
if (x[i] < 0)
y[2*i] = -x[i];
else
y[2*i] = x[i];
y[2*i+1] = y[2*i];
}
}
Stanford CS149, Fall 2023
What does this program do?
// main C++ code:
const int N = 1024;
float* x = new float[N];
float* y = new float[N];

// initialize N elements of x

// call ISPC function


shift_negative(N, x, y);

// ISPC code:
export void shift_negative( The output of this program is undefined!
uniform int N,
uniform float* x,
uniform float* y) Possible for multiple iterations of the loop body to write to
{
foreach (i = 0 ... N)
same memory location
{
if (i >= 1 && x[i] < 0)
y[i-1] = x[i];
else
y[i] = x[i];
}
}

Stanford CS149, Fall 2023


Computing the sum of all elements in an array (incorrectly)
What’s the error in this program? What’s the error in this program?
export uniform float sum_incorrect_1( export uniform float sum_incorrect_2(
uniform int N, uniform int N,
uniform float* x) uniform float* x)
{ {
float sum = 0.0f; uniform float sum = 0.0f;
foreach (i = 0 ... N) foreach (i = 0 ... N)
{ {
sum += x[i]; sum += x[i];
} }

return sum; return sum;


} }

sum is of type float sum is of type uniform float


(different variable for all program instances) (one copy of variable for all program instances)

Cannot return many copies of a varianble to the calling x[i] has a different value for each program instance
C code, which expects one return value of type float So what gets copied into sum?
Result: compile-time type error Result: compile-time type error
Stanford CS149, Fall 2023
Computing the sum of all elements in an array (correctly)
export uniform float sum_array( Each instance accumulates a private partial sum (no communication)
uniform int N,
uniform float* x) Partial sums are added together using the reduce_add() cross-instance
{ communication primitive. The result is the same total sum for all program
uniform float sum;
float partial = 0.0f;
instances (reduce_add() returns a uniform float)
foreach (i = 0 ... N)
{ The ISPC code at left will execute in a manner similar to the C code with AVX
partial += x[i]; intrinsics implemented below. *
}
float sum_summary_AVX(int N, float* x) {
// reduce_add() is part of ISPC’s cross
float tmp[8]; // assume 16-byte alignment
// program instance standard library __mm256 partial = _mm256_broadcast_ss(0.0f);
sum = reduce_add(partial);
for (int i=0; i<N; i+=8)
return sum; partial = _mm256_add_ps(partial, _mm256_load_ps(&x[i]));

_mm256_store_ps(tmp, partial);

float sum = 0.f;


for (int i=0; i<8; i++)
sum += tmp[i];
* Self-test: If you understand why this implementation
correctly implements the semantics of the ISPC gang return sum;
abstraction, then you’ve got a good command of ISPC }

Stanford CS149, Fall 2023


ISPC’s cross program instance operations
Compute sum of a variable’s value in all program instances in a gang:
uniform int64 reduce_add(int32 x);

Compute the min of all values in a gang:


uniform int32 reduce_min(int32 a);

Broadcast a value from one instance to all instances in a gang:


int32 broadcast(int32 value, uniform int index);

For all i, pass value from instance i to the instance i+offset % programCount:
int32 rotate(int32 value, uniform int offset);

Stanford CS149, Fall 2023


ISPC: abstraction vs. implementation
▪ Single program, multiple data (SPMD) programming model
- Programmer “thinks”: running a gang is spawning programCount logical instruction streams (each with a
different value of programIndex)
- This is the programming abstraction
- Program is written in terms of this abstraction

▪ Single instruction, multiple data (SIMD) implementation


- ISPC compiler emits vector instructions (e.g., AVX2, ARM NEON) that carry out the logic performed by a ISPC gang
- ISPC compiler handles mapping of conditional control flow to vector instructions (by masking vector lanes, etc.
like you do manually in assignment 1)

▪ Semantics of ISPC can be tricky


- SPMD abstraction + uniform values
(allows implementation details to peek through abstraction a bit)

Stanford CS149, Fall 2023


SPMD programming model summary
▪ SPMD = “single program, multiple data”
▪ Define one function, run multiple instances of that function in parallel on different input arguments

Single thread of control

Call SPMD function

SPMD execution: multiple instances of function


run in parallel (multiple logical threads of control)

SPMD function returns


Resume single thread of control

Stanford CS149, Fall 2023


ISPC tasks
▪ The ISPC gang abstraction is implemented by SIMD instructions that execute within
on thread running on one x86 core of a CPU.

▪ So all the code I’ve shown you in the previous slides would have executed on only one
of the four cores of the myth machines.

▪ ISPC contains another abstraction: a “task” that is used to achieve multi-core


execution. I’ll let you read up about that as you do assignment 1.

Stanford CS149, Fall 2023


Thinking about operating on data in parallel?
▪ In many simple cases, using ISPC foreach allows the programmer to export void ispc_function(

express their program almost as if it was a sequential program


uniform int N,
uniform float* x,

- Almost want to explain code as: “independently, for each element {


uniform float* y)

in the input array… do this…” foreach (i = 0 ... N)


{
float val = x[i];

▪ Exceptions:
float result;

- Uniform variables // do work here to compute


// result from val
- Cross-instance operations (in standard library, like reduceAdd) y[i] = result;
}

▪ But ISPC is a low-level programming language: by exposing }

programIndex and programCount, it allows programmer to define


what work each program instance does and what data each instance
accesses
- Can implement programs with undefined output
- Can implement programs that are correct only for a specific
programCount
Stanford CS149, Fall 2023
But can express very advanced cooperation
Here’s a program that computes the product of all elements of an array in lg(8) = 3 steps
// compute the product of all eight elements in the
// input array. Assumes the gang size is 8.
export void vec8product(
uniform float* x,
uniform float* result)
{
float val1 = x[programIndex];
float val2 = shift(val1, -1);

if (programIndex % 2 == 0)
val1 = val1 * val2;

val2 = shift(val1, -2);


if (programIndex % 4 == 0)
val1 = val1 * val2;
}

val2 = shift(val1, -4);


if (programIndex % 8 == 0) {
*result = val1 * val2
}
}

Stanford CS149, Fall 2023


But what if ISPC was not trying to be a low-level language?
▪ Example: change language so there is no access to export void ispc_function(
int N,

programIndex, programCount float* x,


float* y)


{
Expect programmer to just use foreach int twoN = 2 * N;

foreach (i = 0 ... twoN)


{

▪ Now there’s very little need to think about program float val = x[i];
float result;

instances at all. // do work here to compute

- Everything outside a foreach must be uniform


// result from val

y[i] = result;
values and uniform logic. Why? }
}

Stanford CS149, Fall 2023


Another alternative
float dowork(float x) {

▪ Don’t even allow array indexing! // do work here to compute


// result from x

▪ Invoke computation once per element of a


}

Collection x; // data structure of N


“collection” data structure // invoke doWork for all elements of x,

▪ Programmer writes no loops, performs no // placing results in collection y


Collection y = map(doWork, x, y);

data indexing
import numpy as np

▪ This model should be very family to NumPy, def addOne(i):

PyTorch, etc. programmers, right?


return i+1
mapAddOne = np.vectorize(addOne);

X = np.arange(15) # create numPy array [0, 1, 2, 3, ...]


Y = np.arange(15) # create numPy array [0, 1, 2, 3, ...]

▪ Much more on this to come Z = X + Y;


Zplus1 = mapAddOne(Z);
# Z = [0, 2, 4, 6, … ]
# Zplus1 = [1, 3, 5, 7, …]

Stanford CS149, Fall 2023


Summary
▪ Programming models provide a way to think about the organization of parallel
programs.

▪ They provide abstractions that permit multiple valid implementations.

▪ I want you to always be thinking about abstraction vs. implementation for the
remainder of this course.

Stanford CS149, Fall 2023

You might also like