0% found this document useful (0 votes)

1 views

03_multicore2-ispc

The lecture focuses on multi-core architecture, latency, and bandwidth issues in parallel computing, emphasizing the importance of memory bandwidth in achieving high performance. It discusses examples such as vector multiplication and laundry operations to illustrate throughput optimization techniques. The latter part of the lecture contrasts abstraction and implementation in parallel programming, using ISPC as an example to demonstrate how to effectively utilize parallel resources.

Uploaded by

saudiqbal886

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

03_multicore2-ispc

Uploaded by

saudiqbal886

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Lecture 3:

Multi-Core Architecture, Part II

(latency/bandwidth issues)
+
Parallel Programming Abstractions
Parallel Computing
Stanford CS149, Fall 2023
Reviewing last time…
▪ Three ideas in throughput computing hardware
- Multi-core execution
- SIMD execution
- Hardware multi-threading

▪ [Will review by going over slides from the end of lecture 2]

Stanford CS149, Fall 2023

Thought experiment
Task: element-wise multiplication of two vectors A and B A
Assume vectors contain millions of elements ×

- Load input A[i] B

- Load input B[i] =

- Compute A[i] × B[i] C
- Store result into C[i]

Is this a good application to run on a modern

throughput-oriented parallel processor? !
Stanford CS149, Fall 2023
NVIDIA V100
There are 80 SM cores on the V100:

80 SM x 64 fp32 ALUs per SM = 5120 ALUs

Think about supplying all

those ALUs with data each
clock. " L2 Cache (6 MB)
900 GB/sec
(4096 bit interface)
GPU memory (HBM)
(16 GB)

Stanford CS149, Fall 2023

Understanding
latency and bandwidth

Stanford CS149, Fall 2023

The school year is starting… gotta get back to Stanford

Stanford CS149, Fall 2023

San Francisco fog vs. South Bay sun
When it looks like this in SF It looks like this at Stanford

Stanford CS149, Fall 2023

Everyone wants to get to back to the South Bay!
Assume only one car in a lane of the highway at once.
When car on highway reaches Stanford, the next car leaves San Francisco.

San Car’s velocity: 100 km/hr

Francisco Stanford

Distance: ~ 50 km

Latency of driving from San Francisco to Stanford: 0.5 hours

Throughput: 2 cars per hour

Stanford CS149, Fall 2023

Improving throughput
San Car’s velocity: 200 km/hr
Stanford
Francisco

Approach 1: drive faster!

Throughput = 4 cars per hour

Car’s velocity: 100 km/hr

San
Francisco
Stanford

Approach 2: build more lanes!

Throughput = 8 cars per hour (2 cars per hour per lane)
Stanford CS149, Fall 2023
Using the highway more efficiently
San Car’s velocity: 100 km/hr
Stanford
Francisco

Cars spaced out by 1 km

Throughput: 100 cars/hr (1 car every 1/100th of hour)

Car’s velocity: 100 km/hr

San
Francisco Stanford

Throughput: 400 cars/hr (4 cars every 1/100th of hour)

Stanford CS149, Fall 2023
Terminology
▪ Memory bandwidth
- The rate at which the memory system can provide data to a processor
- Example: 20 GB/s

Memory

Bandwidth ~ 4 items/sec
Latency of transferring any one item: ~2 sec

Stanford CS149, Fall 2023

Terminology
▪ Memory bandwidth
- The rate at which the memory system can provide data to a processor
- Example: 20 GB/s

Memory

Bandwidth: ~ 8 items/sec
Latency of transferring any one item: ~2 sec

Stanford CS149, Fall 2023

Example: doing your laundry
Operation: do your laundry
1. Wash clothes
2. Dry clothes
3. Fold clothes

Washer Dryer College Student

45 min 60 min 15 min

Latency of completing 1 load of laundry = 2 hours

Stanford CS149, Fall 2023

Increasing laundry throughput
Goal: maximize throughput of many loads of laundry

One approach: duplicate execution resources:

use two washers, two dryers, and call a friend

Latency of completing 2 loads of laundry = 2 hours

Throughput increases by 2x: 1 load/hour
Number of resources increased by 2x: two washers, two dryers

Stanford CS149, Fall 2023

Pipelining laundry
Goal: maximize throughput of doing many loads of laundry
1 hr 2 hr 3 hr 4 hr 5 hr

Latency: 1 load takes 2 hours

Throughput: 1 load/hour
Resources: one washer, one dryer
Stanford CS149, Fall 2023
Another example: two connected pipes
Pipe 1: max flow 100 liters/sec Pipe 2: max flow 50 liters/sec

If you connect the pipes, what is the maximum flow you can push through the system?
50 liters/sec
Stanford CS149, Fall 2023
Applying this concept to a computer…
Consider a program that runs threads that repeat the
following sequence of three dependent instructions Data
Cache
instruction selection
1. X = load 64 bytes Fetch/ Fetch/
2. Y = add x + x Decode Decode

3. Z = add x + y (Executes
scalar math)
ALU/
EXEC
LD/ST (Executes mem
loads/stores)

Execution Execution
Let’s say we’re running this sequence on many threads of Context Context

a multi-threaded* core that:

(HW thread)
... (HW thread)

▪ Executes one math operation per clock

▪ Can issue load instructions in parallel with math
▪ Receives 8 bytes/clock from memory
(* Assume there are plenty of hardware threads to hide memory latency)
Stanford CS149, Fall 2023
Processor that can do one add per clock (+ co-issues LDs)
Add
= Math instruction
Add
= Load instruction
Load 64 bytes Loads in progress: 1
= Load command sent to memory (part of mem latency)
Add
= Transferring data from memory
Add (data transfer speed = 8 bytes/clock)
Load 64 bytes Loads in progress: 2 Assumptions:
Add - 8 clocks to transfer data for a load
Add - Up to 3 outstanding load requests
Load 64 bytes Loads in progress: 3

Add

Load 64 bytes Stall! Loads in progress: 3

Add

Load 64 bytes Stall! Loads in progress: 3

time Stanford CS149, Fall 2023

Rate of completing math instructions is limited by memory bandwidth

= Math instruction
= Load instruction
= Load command sent to memory (part of mem latency)
= Transferring data from memory

time Stanford CS149, Fall 2023

Rate of completing math instructions is limited by memory bandwidth
Memory bandwidth-bound execution!

Rate of instructions is determined by the rate at

which memory can provide data.

Red regions:
Core is stalled waiting on data for next
instruction

Note that memory is transferring data 100% of

time, it can’t transfer data faster.
Convince yourself that in steady state core underutilization is
only a function of instruction and memory throughput, not a
function of memory latency or the number of outstanding
memory requests.

= Math instruction
= Transferring data from memory

time Stanford CS149, Fall 2023

High bandwidth memories
▪ Modern GPUs leverage high bandwidth memories located near processor
▪ Example:
- V100 uses HBM2
- 900 GB/s

Stanford CS149, Fall 2023

Back to our thought experiment
Task: element-wise multiplication of two vectors A and B A
Assume vectors contain millions of elements ×

- Load input A[i] B

- Load input B[i] =

- Compute A[i] × B[i] C
- Store result into C[i]

Three memory operations (12 bytes) for every MUL

NVIDIA V100 GPU can do 5120 fp32 MULs per clock (@ 1.6 GHz)
Need ~98 TB/sec of bandwidth to keep functional units busy

<1% GPU efficiency… but still must faster than an eight-core CPU!
(3.2 GHz Xeon E5v4 eight-core CPU connected to 76 GB/sec memory bus: ~3% efficiency on this computation)

Stanford CS149, Fall 2023

This computation is
bandwidth limited!
If processors request data at too high a rate,
the memory system cannot keep up.

Overcoming bandwidth limits is often the most important

challenge facing software developers targeting modern
throughput-optimized systems.

Stanford CS149, Fall 2023

In modern computing, bandwidth is the critical resource
Performant parallel programs will:

▪ Organize computation to fetch data from memory less often

- Reuse data previously loaded by the same thread
(temporal locality optimizations)
- Share data across threads (inter-thread cooperation)

▪ Favor performing additional arithmetic to storing/reloading values (the math is “free”)

▪ Main point: programs must access memory infrequently to utilize modern processors efficiently

Stanford CS149, Fall 2023

Another pipelining example: an instruction pipeline
Many students often ask how a processor can complete a multiply operation in a clock.
When we say a core does one operation per clock, we are referring to INSTRUCTION THROUGHPUT, NOT LATENCY.
time (clocks)

instr 0 IF D EX WB
IF D EX WB Four-stage instruction pipeline
instr 1
(steps required to complete and instruction):
instr 2 IF D EX WB
instr 3 IF D EX WB IF = instruction fetch
instr 4 IF D EX WB D = instruction decode + register read
EX = execute operation
instr 5 IF D EX WB WB = “write back” results to registers

Latency: 1 instruction takes 4 cycles

Throughput: 1 instruction per cycle
(Yes, care must be taken to ensure program correctness when back-to-back instructions are dependent.)

Actual instruction pipelines can be variable length (depending on the instruction) deep as ~20 stages in modern CPUs

Stanford CS149, Fall 2023

Part 2:
The theme of the second half of today’s lecture is:

Abstraction vs. implementation

Conflating semantics (meaning) of an abstraction with details of its
implementation is a common cause for confusion in this course.

Stanford CS149, Fall 2023

Abstraction vs. implementation
Semantics: what do the operations Implementation (aka scheduling): how will the
provided by a programming model mean? answer be computed on a parallel machine?

Given a program, and given the semantics In what (potentially parallel) order will be
(meaning) of the operations used, what is the a program’s operations be executed?
answer that the program will compute? Which operations will be computed by each thread?
Each execution unit? Each lane of a vector instruction?

Your goal as a student:

Given a program and knowledge of how a parallel programming model
is implemented, in your head can you “trace” through what each part of
the parallel computer is doing during each step of program.

Stanford CS149, Fall 2023

An example:
Programming with ISPC

Stanford CS149, Fall 2023

ISPC
▪ Intel SPMD Program Compiler (ISPC)
▪ SPMD: single program multiple data

▪ http://ispc.github.com/

▪ A great read: “The Story of ISPC” (by Matt Pharr)

- https://pharr.org/matt/blog/2018/04/30/ispc-all.html
- Go read it!

Stanford CS149, Fall 2023

Recall: example program from last class
Compute sin(x) using Taylor expansion: sin(x) = x - x3/3! + x5/5! - x7/7! + ...
for each element of an array of N floating-point numbers
void sinx(int N, int terms, float* x, float* result)
{
for (int i=0; i<N; i++)
{
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
int sign = -1;

for (int j=1; j<=terms; j++)

{
value += sign * numer / denom;
numer *= x[i] * x[i];
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}

result[i] = value;
}
}
Stanford CS149, Fall 2023
Invoking sinx()
C++ code: main.cpp C++ code: sinx.cpp
#include “sinx.h” void sinx(int N, int terms, float* x, float* result)
{
int main(int argc, void** argv) {
for (int i=0; i<N; i++)
int N = 1024;
int terms = 5; {
float* x = new float[N]; float value = x[i];
float* result = new float[N]; float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
// initialize x here
int sign = -1;
sinx(N, terms, x, result);
for (int j=1; j<=terms; j++)
return 0;
{
}
value += sign * numer / denom;
numer *= x[i] * x[i];
main() denom *= (2*j+2) * (2*j+3);
Call to sinx() sign *= -1;
sinx() Control transferred to sinx() func }

result[i] = value;
Return from sinx()
Control transferred back to main() }
}

Stanford CS149, Fall 2023

sinx() in ISPC
C++ code: main.cpp ISPC code: sinx.ispc
#include “sinx_ispc.h” export void ispc_sinx(
uniform int N,
int main(int argc, void** argv) { uniform int terms,
int N = 1024; uniform float* x,
int terms = 5; uniform float* result)
float* x = new float[N]; {
float* result = new float[N]; // assume N % programCount = 0
for (uniform int i=0; i<N; i+=programCount)
// initialize x here {
int idx = i + programIndex;
// execute ISPC code float value = x[idx];
ispc_sinx(N, terms, x, result); float numer = x[idx] * x[idx] * x[idx];
return 0; uniform int denom = 6; // 3!
} uniform int sign = -1;

SPMD programming abstraction: for (uniform int j=1; j<=terms; j++)

{
value += sign * numer / denom
Call to ISPC function spawns “gang” of ISPC numer *= x[idx] * x[idx];
“program instances” denom *= (2*j+2) * (2*j+3);
sign *= -1;
All instances run ISPC code concurrently }
result[idx] = value;
Each instance has its own copy of local variables }
(blue variables in code, we’ll talk about “uniform” later) }

Upon return, all instances have completed

Stanford CS149, Fall 2023
Invoking sinx() in ISPC
C++ code: main.cpp
#include “sinx_ispc.h”
main()
int main(int argc, void** argv) {
int N = 1024; Sequential execution (C code)
int terms = 5;
float* x = new float[N];
float* result = new float[N]; ispc_sinx() Call to ispc_sinx()
0 1 2 3 4 5 6 7 Begin executing programCount
// initialize x here
instances of ispc_sinx()
(ISPC code)
// execute ISPC code
ispc_sinx(N, terms, x, result);
return 0;
}
ispc_sinx() returns.
Completion of ISPC program instances
Resume sequential execution
SPMD programming abstraction:
Sequential execution
Call to ISPC function spawns “gang” of ISPC “program instances” (C code)
All instances run ISPC code concurrently
Each instance has its own copy of local variables
Upon return, all instances have completed

In this illustration programCount = 8

Stanford CS149, Fall 2023
sinx() in ISPC “Interleaved” assignment of array elements to program instances

C++ code: main.cpp ISPC code: sinx.ispc

#include “sinx_ispc.h” export void ispc_sinx(
uniform int N,
int main(int argc, void** argv) { uniform int terms,
int N = 1024; uniform float* x,
int terms = 5; uniform float* result)
float* x = new float[N]; {
float* result = new float[N]; // assumes N % programCount = 0
for (uniform int i=0; i<N; i+=programCount)
// initialize x here {
int idx = i + programIndex;
// execute ISPC code float value = x[idx];
ispc_sinx(N, terms, x, result); float numer = x[idx] * x[idx] * x[idx];
return 0; uniform int denom = 6; // 3!
} uniform int sign = -1;

ISPC language keywords: for (uniform int j=1; j<=terms; j++)

{
programCount: number of simultaneously executing instances in value += sign * numer / denom
the gang (uniform value) numer *= x[idx] * x[idx];
denom *= (2*j+2) * (2*j+3);
programIndex: id of the current instance in the gang. sign *= -1;
(a non-uniform value: “varying”) }
result[idx] = value;
uniform: A type modifier. All instances have the same value for this }
}
variable. Its use is purely an optimization. Not needed for correctness.
Stanford CS149, Fall 2023
Interleaved assignment of program instances to loop iterations

Elements of output array (results)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Instance 0 Instance 1 Instance 2 Instance 3 Instance 4 Instance 5 Instance 6 Instance 7

(programIndex = 0) (programIndex = 1) (programIndex = 2) (programIndex = 3) (programIndex = 4) (programIndex = 5) (programIndex = 6) (programIndex=7)

“Gang” of ISPC program instances

In this illustration: gang contains eight instances: programCount = 8

Stanford CS149, Fall 2023

ISPC implements the gang abstraction using SIMD instructions
C++ code: main.cpp
#include “sinx_ispc.h”
main()
int main(int argc, void** argv) { Sequential execution (C code)
int N = 1024;
int terms = 5;
float* x = new float[N];
float* result = new float[N]; ispc_sinx() Call to ispc_sinx()
0 1 2 3 4 5 6 7 Begin executing programCount
// initialize x here
instances of ispc_sinx()
// execute ISPC code (ISPC code)
ispc_sinx(N, terms, x, result);
return 0;
}
ispc_sinx() returns.
SPMD programming abstraction: Completion of ISPC program instances
Call to ISPC function spawns “gang” of ISPC “program instances” Resume sequential execution
All instances run ISPC code simultaneously
Upon return, all instances have completed Sequential execution (C code)

ISPC compiler generates SIMD implementation:

Number of instances in a gang is the SIMD width of the hardware (or a small multiple of SIMD width)
ISPC compiler generates a C++ function binary (.o) whose body contains SIMD instructions
C++ code links against generated object file as usual
Stanford CS149, Fall 2023
sinx() in ISPC: version 2
“Blocked” assignment of array elements to program instances
ISPC code: sinx.ispc
C++ code: main.cpp export void ispc_sinx_v2(
#include “sinx_ispc.h” uniform int N,
uniform int terms,
int main(int argc, void** argv) { uniform float* x,
int N = 1024; uniform float* result)
int terms = 5; {
// assume N % programCount = 0
float* x = new float[N];
uniform int count = N / programCount;
float* result = new float[N]; int start = programIndex * count;
for (uniform int i=0; i<count; i++)
// initialize x here {
int idx = start + i;
// execute ISPC code float value = x[idx];
ispc_sinx_v2(N, terms, x, result); float numer = x[idx] * x[idx] * x[idx];
return 0; uniform int denom = 6; // 3!
} uniform int sign = -1;

for (uniform int j=1; j<=terms; j++)

{
value += sign * numer / denom
numer *= x[idx] * x[idx];
denom *= (j+3) * (j+4);
sign *= -1;
}
result[idx] = value;
}
}

Stanford CS149, Fall 2023

Blocked assignment of program instances to loop iterations

Elements of output array (results)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Instance 0 Instance 1 Instance 2 Instance 3 Instance 4 Instance 5 Instance 6 Instance 7

(programIndex = 0) (programIndex = 1) (programIndex = 2) (programIndex = 3) (programIndex = 4) (programIndex = 5) (programIndex = 6) (programIndex=7)

“Gang” of ISPC program instances

In this illustration: gang contains eight instances: programCount = 8

Stanford CS149, Fall 2023

Schedule: interleaved assignment
“Gang” of ISPC program instances
Gang contains four instances: programCount = 8

Instance 0 Instance 1 Instance 2 Instance 3 Instance 4 Instance 5 Instance 6 Instance 7

(programIndex = 0) (programIndex = 1) (programIndex = 2) (programIndex = 3) (programIndex = 4) (programIndex = 5) (programIndex = 6) (programIndex = 7)

time
i=0 0 1 2 3 4 5 6 7

i=1 8 9 10 11 12 13 14 15

i=2 16 17 18 19 20 21 22 23

i=3 24 25 26 27 28 29 30 31
A single “packed vector load” instruction (vmovaps *) efficiently implements: ...
// assumes N % programCount = 0
float value = x[idx]; for (uniform int i=0; i<N; i+=programCount)
for all program instances, since the eight values are contiguous in memory {
int idx = i + programIndex;
float value = x[idx];
...
* see _mm256_load_ps() intrinsic function Stanford CS149, Fall 2023
Schedule: blocked assignment
“Gang” of ISPC program instances
Gang contains four instances: programCount = 8

Instance 0 Instance 1 Instance 2 Instance 3 Instance 4 Instance 5 Instance 6 Instance 7

(programIndex = 0) (programIndex = 1) (programIndex = 2) (programIndex = 3) (programIndex = 4) (programIndex = 5) (programIndex = 6) (programIndex = 7)

time
i=0 0 8 16 24 32 40 48 56

i=1 1 9 17 25 33 41 49 57

i=2 2 10 18 26 34 42 50 58

i=3 3 11 19 27 35 43 51 59

float value = x[idx]; uniform int count = N / programCount;

int start = programIndex * count;
For all program instances now touches eight non-contiguous values in for (uniform int i=0; i<count; i++) {
memory. Need “gather” instruction (vgatherdps *) to implement (gather is int idx = start + i;
float value = x[idx];
a more complex, and more costly SIMD instruction…) ...

* see _mm256_i32gather_ps() intrinsic function Stanford CS149, Fall 2023

Raising level of abstraction with foreach
C++ code: main.cpp ISPC code: sinx.ispc
#include “sinx_ispc.h” export void ispc_sinx(
uniform int N,
int N = 1024; uniform int terms,
int terms = 5; uniform float* x,
float* x = new float[N]; uniform float* result)
float* result = new float[N]; {
foreach (i = 0 ... N)
// initialize x here {
float value = x[i];
// execute ISPC code float numer = x[i] * x[i] * x[i];
sinx(N, terms, x, result); uniform int denom = 6; // 3!
uniform int sign = -1;

foreach: key ISPC language construct for (uniform int j=1; j<=terms; j++)
{
value += sign * numer / denom
▪ foreach declares parallel loop iterations numer *= x[i] * x[i];

- Programmer says: these are the iterations the entire gang (not each denom *= (2*j+2) * (2*j+3);
sign *= -1;
instance) must perform }
result[i] = value;
▪ ISPC implementation takes responsibility for assigning iterations to }
}

program instances in the gang

Stanford CS149, Fall 2023

How might foreach be implemented?
Code written using foreach abstraction:
foreach (i = 0 ... N) Implementation 3: block iterations onto program instances
{
// assume N % programCount = 0
// do work for iteration i here... uniform int count = N / programCount;
} int start = programIndex * count;
for (uniform int loop_i=0; loop_i<count; loop_i++)
{
int i = start + loop_i;
// do work for iteration i here...
Implementation 1: program instance 0 executes all iterations }
if (programCount == 0) {
for (int i=0; i<N; i++) {
// do work for iteration i here…
}
}
Implementation 4: dynamic assignment of iterations to instances
uniform int nextIter;
if (programCount == 0)
nextIter = 0;
Implementation 2: interleave iterations onto program instances
int i = atomic_add_local(&nextIter, 1);
// assume N % programCount = 0 while (i < N) {
for (uniform int loop_i=0; loop_i<N; loop_i+=programCount)
{ // do work for iteration i here...
int i = loop_i + programIndex;
// do work for iteration i here... i = atomic_add_local(&nextIter, 1);
} }

Stanford CS149, Fall 2023

Thinking about iterations, not parallel execution
In many simple cases, using foreach allows export void ispc_function(
uniform int N,

the programmer to express their program almost uniform float* x,

uniform float* y)

as if it was a sequential program

{
foreach (i = 0 ... N)
{
float val = x[I];
float result;

// do work here to compute

// result from val

y[i] = result;
}
}

Stanford CS149, Fall 2023

What does this program do?
// main C++ code:
const int N = 1024;
float* x = new float[N/2];
float* y = new float[N];

// initialize N/2 elements of x here

// call ISPC function

absolute_repeat(N/2, x, y);

// ISPC code:
export void absolute_repeat(
uniform int N,
This ISPC program computes the absolute value of elements of x,
uniform float* x, then repeats it twice in the output array y
uniform float* y)
{
foreach (i = 0 ... N)
{
if (x[i] < 0)
y[2*i] = -x[i];
else
y[2*i] = x[i];
y[2*i+1] = y[2*i];
}
}
Stanford CS149, Fall 2023
What does this program do?
// main C++ code:
const int N = 1024;
float* x = new float[N];
float* y = new float[N];

// initialize N elements of x

// call ISPC function

shift_negative(N, x, y);

// ISPC code:
export void shift_negative( The output of this program is undefined!
uniform int N,
uniform float* x,
uniform float* y) Possible for multiple iterations of the loop body to write to
{
foreach (i = 0 ... N)
same memory location
{
if (i >= 1 && x[i] < 0)
y[i-1] = x[i];
else
y[i] = x[i];
}
}

Stanford CS149, Fall 2023

Computing the sum of all elements in an array (incorrectly)
What’s the error in this program? What’s the error in this program?
export uniform float sum_incorrect_1( export uniform float sum_incorrect_2(
uniform int N, uniform int N,
uniform float* x) uniform float* x)
{ {
float sum = 0.0f; uniform float sum = 0.0f;
foreach (i = 0 ... N) foreach (i = 0 ... N)
{ {
sum += x[i]; sum += x[i];
} }

return sum; return sum;

} }

sum is of type float sum is of type uniform float

(different variable for all program instances) (one copy of variable for all program instances)

Cannot return many copies of a varianble to the calling x[i] has a different value for each program instance
C code, which expects one return value of type float So what gets copied into sum?
Result: compile-time type error Result: compile-time type error
Stanford CS149, Fall 2023
Computing the sum of all elements in an array (correctly)
export uniform float sum_array( Each instance accumulates a private partial sum (no communication)
uniform int N,
uniform float* x) Partial sums are added together using the reduce_add() cross-instance
{ communication primitive. The result is the same total sum for all program
uniform float sum;
float partial = 0.0f;
instances (reduce_add() returns a uniform float)
foreach (i = 0 ... N)
{ The ISPC code at left will execute in a manner similar to the C code with AVX
partial += x[i]; intrinsics implemented below. *
}
float sum_summary_AVX(int N, float* x) {
// reduce_add() is part of ISPC’s cross
float tmp[8]; // assume 16-byte alignment
// program instance standard library __mm256 partial = _mm256_broadcast_ss(0.0f);
sum = reduce_add(partial);
for (int i=0; i<N; i+=8)
return sum; partial = _mm256_add_ps(partial, _mm256_load_ps(&x[i]));

_mm256_store_ps(tmp, partial);

float sum = 0.f;

for (int i=0; i<8; i++)
sum += tmp[i];
* Self-test: If you understand why this implementation
correctly implements the semantics of the ISPC gang return sum;
abstraction, then you’ve got a good command of ISPC }

Stanford CS149, Fall 2023

ISPC’s cross program instance operations
Compute sum of a variable’s value in all program instances in a gang:
uniform int64 reduce_add(int32 x);

Compute the min of all values in a gang:

uniform int32 reduce_min(int32 a);

Broadcast a value from one instance to all instances in a gang:

int32 broadcast(int32 value, uniform int index);

For all i, pass value from instance i to the instance i+offset % programCount:
int32 rotate(int32 value, uniform int offset);

Stanford CS149, Fall 2023

ISPC: abstraction vs. implementation
▪ Single program, multiple data (SPMD) programming model
- Programmer “thinks”: running a gang is spawning programCount logical instruction streams (each with a
different value of programIndex)
- This is the programming abstraction
- Program is written in terms of this abstraction

▪ Single instruction, multiple data (SIMD) implementation

- ISPC compiler emits vector instructions (e.g., AVX2, ARM NEON) that carry out the logic performed by a ISPC gang
- ISPC compiler handles mapping of conditional control flow to vector instructions (by masking vector lanes, etc.
like you do manually in assignment 1)

▪ Semantics of ISPC can be tricky

- SPMD abstraction + uniform values
(allows implementation details to peek through abstraction a bit)

Stanford CS149, Fall 2023

SPMD programming model summary
▪ SPMD = “single program, multiple data”
▪ Define one function, run multiple instances of that function in parallel on different input arguments

Single thread of control

Call SPMD function

SPMD execution: multiple instances of function

run in parallel (multiple logical threads of control)

SPMD function returns

Resume single thread of control

Stanford CS149, Fall 2023

ISPC tasks
▪ The ISPC gang abstraction is implemented by SIMD instructions that execute within
on thread running on one x86 core of a CPU.

▪ So all the code I’ve shown you in the previous slides would have executed on only one
of the four cores of the myth machines.

▪ ISPC contains another abstraction: a “task” that is used to achieve multi-core

execution. I’ll let you read up about that as you do assignment 1.

Stanford CS149, Fall 2023

Thinking about operating on data in parallel?
▪ In many simple cases, using ISPC foreach allows the programmer to export void ispc_function(

express their program almost as if it was a sequential program

uniform int N,
uniform float* x,

- Almost want to explain code as: “independently, for each element {

uniform float* y)

in the input array… do this…” foreach (i = 0 ... N)

{
float val = x[i];

▪ Exceptions:
float result;

- Uniform variables // do work here to compute

// result from val
- Cross-instance operations (in standard library, like reduceAdd) y[i] = result;
}

▪ But ISPC is a low-level programming language: by exposing }

programIndex and programCount, it allows programmer to define

what work each program instance does and what data each instance
accesses
- Can implement programs with undefined output
- Can implement programs that are correct only for a specific
programCount
Stanford CS149, Fall 2023
But can express very advanced cooperation
Here’s a program that computes the product of all elements of an array in lg(8) = 3 steps
// compute the product of all eight elements in the
// input array. Assumes the gang size is 8.
export void vec8product(
uniform float* x,
uniform float* result)
{
float val1 = x[programIndex];
float val2 = shift(val1, -1);

if (programIndex % 2 == 0)
val1 = val1 * val2;

val2 = shift(val1, -2);

if (programIndex % 4 == 0)
val1 = val1 * val2;
}

val2 = shift(val1, -4);

if (programIndex % 8 == 0) {
*result = val1 * val2
}
}

Stanford CS149, Fall 2023

But what if ISPC was not trying to be a low-level language?
▪ Example: change language so there is no access to export void ispc_function(
int N,

programIndex, programCount float* x,

float* y)

▪
{
Expect programmer to just use foreach int twoN = 2 * N;

foreach (i = 0 ... twoN)

{

▪ Now there’s very little need to think about program float val = x[i];
float result;

instances at all. // do work here to compute

- Everything outside a foreach must be uniform

// result from val

y[i] = result;
values and uniform logic. Why? }
}

Stanford CS149, Fall 2023

Another alternative
float dowork(float x) {

▪ Don’t even allow array indexing! // do work here to compute

// result from x

▪ Invoke computation once per element of a

}

Collection x; // data structure of N

“collection” data structure // invoke doWork for all elements of x,

▪ Programmer writes no loops, performs no // placing results in collection y

Collection y = map(doWork, x, y);

data indexing
import numpy as np

▪ This model should be very family to NumPy, def addOne(i):

PyTorch, etc. programmers, right?

return i+1
mapAddOne = np.vectorize(addOne);

X = np.arange(15) # create numPy array [0, 1, 2, 3, ...]

Y = np.arange(15) # create numPy array [0, 1, 2, 3, ...]

▪ Much more on this to come Z = X + Y;

Zplus1 = mapAddOne(Z);
# Z = [0, 2, 4, 6, … ]
# Zplus1 = [1, 3, 5, 7, …]

Stanford CS149, Fall 2023

Summary
▪ Programming models provide a way to think about the organization of parallel
programs.

▪ They provide abstractions that permit multiple valid implementations.

▪ I want you to always be thinking about abstraction vs. implementation for the
remainder of this course.

Stanford CS149, Fall 2023

Oracle Academy Mid Term Exam Semester 1 Answers
89% (18)
Oracle Academy Mid Term Exam Semester 1 Answers
11 pages
Velo Clod Lab Hol 2140 01 Net - PDF - en
No ratings yet
Velo Clod Lab Hol 2140 01 Net - PDF - en
172 pages
02_basicarch
No ratings yet
02_basicarch
103 pages
01_whyparallelism
No ratings yet
01_whyparallelism
82 pages
L01-slides
No ratings yet
L01-slides
24 pages
written_asst2
No ratings yet
written_asst2
27 pages
04_progbasics
No ratings yet
04_progbasics
51 pages
Advanced Computer Architecture Fall 2019 Multithreaded Architectures
No ratings yet
Advanced Computer Architecture Fall 2019 Multithreaded Architectures
31 pages
217 Lec1
No ratings yet
217 Lec1
35 pages
L01-slides-Programming For Performance
No ratings yet
L01-slides-Programming For Performance
53 pages
HPC-Unit-1
No ratings yet
HPC-Unit-1
65 pages
Lec 01
No ratings yet
Lec 01
2 pages
written_asst1
No ratings yet
written_asst1
31 pages
HPC_introduction_Lecture_3
No ratings yet
HPC_introduction_Lecture_3
42 pages
Gujarat Technological University: W.E.F. AY 2018-19
No ratings yet
Gujarat Technological University: W.E.F. AY 2018-19
3 pages
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
No ratings yet
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
65 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
08_dataparallel
No ratings yet
08_dataparallel
51 pages
Parallel Computing I
No ratings yet
Parallel Computing I
52 pages
HPC-Unit-2
No ratings yet
HPC-Unit-2
72 pages
Introduction To HPC: Content and Definitions
No ratings yet
Introduction To HPC: Content and Definitions
22 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
Unit 1 Modern Processors
No ratings yet
Unit 1 Modern Processors
52 pages
1605808992-Using oneAPI FPGA IXPUG
No ratings yet
1605808992-Using oneAPI FPGA IXPUG
105 pages
onur-digitaldesign-2020-lecture20-gpu-beforelecture
No ratings yet
onur-digitaldesign-2020-lecture20-gpu-beforelecture
73 pages
Ppar2017 Gpu 1
No ratings yet
Ppar2017 Gpu 1
61 pages
1
No ratings yet
1
44 pages
HPC Lecture (1) Summary
No ratings yet
HPC Lecture (1) Summary
8 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Hpc_unit-1 Insem Notes
No ratings yet
Hpc_unit-1 Insem Notes
76 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Slides14 Pipeline1 4up
No ratings yet
Slides14 Pipeline1 4up
6 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
L05-PipeliningII
No ratings yet
L05-PipeliningII
36 pages
CSE5006 Multicore-Architectures ETH 1 AC41
No ratings yet
CSE5006 Multicore-Architectures ETH 1 AC41
9 pages
CS 61C: Great Ideas in Computer Architecture: Course Introduction
No ratings yet
CS 61C: Great Ideas in Computer Architecture: Course Introduction
55 pages
Lecture2 GPU Architecture_2025
No ratings yet
Lecture2 GPU Architecture_2025
46 pages
multicore02-2
No ratings yet
multicore02-2
18 pages
001__DDS-IIIT-Jan-10th
No ratings yet
001__DDS-IIIT-Jan-10th
34 pages
Lecture 4 Parallel and Scalable Machine Learning With HPC Part 1
No ratings yet
Lecture 4 Parallel and Scalable Machine Learning With HPC Part 1
47 pages
Introduction To OpenACC Course 20161026 1550 1
No ratings yet
Introduction To OpenACC Course 20161026 1550 1
68 pages
Hpca Notes
No ratings yet
Hpca Notes
216 pages
hpc_parallel
No ratings yet
hpc_parallel
122 pages
Introduction To High Performance Computing: Unit-I
No ratings yet
Introduction To High Performance Computing: Unit-I
70 pages
07_gpuarch
No ratings yet
07_gpuarch
73 pages
Computer Science 146 Computer Architecture
No ratings yet
Computer Science 146 Computer Architecture
18 pages
Pipelining vs. Parallel Processing
No ratings yet
Pipelining vs. Parallel Processing
23 pages
DigitalLogic ComputerOrganization L23 Multicore Handout
No ratings yet
DigitalLogic ComputerOrganization L23 Multicore Handout
32 pages
2023 CSC14120 Lecture00 CourseIntroduction
No ratings yet
2023 CSC14120 Lecture00 CourseIntroduction
30 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
CS3350B Computer Architecture: Marc Moreno Maza
100% (1)
CS3350B Computer Architecture: Marc Moreno Maza
45 pages
Cours 1
No ratings yet
Cours 1
38 pages
01 Introduction
No ratings yet
01 Introduction
32 pages
CRGC Mcore PDF
No ratings yet
CRGC Mcore PDF
124 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
CS 294-73 Software Engineering For Scientific Computing: Pcolella@berkeley - Edu Pcolella@lbl - Gov
No ratings yet
CS 294-73 Software Engineering For Scientific Computing: Pcolella@berkeley - Edu Pcolella@lbl - Gov
39 pages
Parallel Computing
No ratings yet
Parallel Computing
57 pages
Lecture 2
No ratings yet
Lecture 2
32 pages
Proposal - Subscription Plan
No ratings yet
Proposal - Subscription Plan
1 page
Shopping Cart Simulation
No ratings yet
Shopping Cart Simulation
6 pages
Cool Java Scripts
No ratings yet
Cool Java Scripts
2 pages
Source Code For Sudoku Solver
No ratings yet
Source Code For Sudoku Solver
3 pages
Research Project
No ratings yet
Research Project
4 pages
What Are The Advantages and Disadvantages of Computers
No ratings yet
What Are The Advantages and Disadvantages of Computers
8 pages
Minecraft Gui Datapack Tutorial
No ratings yet
Minecraft Gui Datapack Tutorial
2 pages
Book 5
No ratings yet
Book 5
6 pages
Annexure 1.II NMHS Annual Progress Report 2019-20
No ratings yet
Annexure 1.II NMHS Annual Progress Report 2019-20
25 pages
SAP Basic Transportation Functions
100% (1)
SAP Basic Transportation Functions
9 pages
H155-381 5G CPE Product Description - (V100R001 - 02, English)
No ratings yet
H155-381 5G CPE Product Description - (V100R001 - 02, English)
50 pages
Office of The Government Chief Information Officer
No ratings yet
Office of The Government Chief Information Officer
18 pages
Software Cost Estimation: ©ian Sommerville 2000 Software Engineering, 6th Edition. Chapter 23 Slide 1
No ratings yet
Software Cost Estimation: ©ian Sommerville 2000 Software Engineering, 6th Edition. Chapter 23 Slide 1
59 pages
C1 OPERATING SYSTEM (OS) AND GRAPHICAL USER
No ratings yet
C1 OPERATING SYSTEM (OS) AND GRAPHICAL USER
3 pages
Dot Matrix Printer
No ratings yet
Dot Matrix Printer
8 pages
Assignment PLC Youssef Awad
No ratings yet
Assignment PLC Youssef Awad
8 pages
Design and Analysis of Algorithms - AD3351 - Important Questions With Answer - Unit 1 - Introduction
No ratings yet
Design and Analysis of Algorithms - AD3351 - Important Questions With Answer - Unit 1 - Introduction
10 pages
Drone Bootcamp Brochure NCE Chandi
No ratings yet
Drone Bootcamp Brochure NCE Chandi
3 pages
gINT ProductFamilyBroch LTR 0713 LR F
No ratings yet
gINT ProductFamilyBroch LTR 0713 LR F
4 pages
DBMS Insurance Database 12
No ratings yet
DBMS Insurance Database 12
14 pages
Computer Evolution Worksheet
50% (2)
Computer Evolution Worksheet
1 page
Distributed Computing 2012 June (2006 Ad)
No ratings yet
Distributed Computing 2012 June (2006 Ad)
1 page
Thesis 2018
No ratings yet
Thesis 2018
64 pages
D E Shaw & Co - Paper 1 - Part 7
No ratings yet
D E Shaw & Co - Paper 1 - Part 7
1 page
IT Vendor Audit
100% (1)
IT Vendor Audit
36 pages
Selenium Automation
100% (1)
Selenium Automation
58 pages