0% found this document useful (0 votes)

3 views61 pages

03 Progmodels Slides

The lecture focuses on the distinction between abstraction and implementation in parallel programming, specifically using the Intel SPMD Program Compiler (ISPC). It illustrates how ISPC allows for concurrent execution of programs through a Single Program Multiple Data (SPMD) model, while the underlying implementation utilizes Single Instruction Multiple Data (SIMD) techniques. The document also discusses examples of computing the sine function and summing array elements in parallel, highlighting the importance of understanding both the programming abstraction and its hardware/software implementation.

Uploaded by

taruntukee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views61 pages

03 Progmodels Slides

Uploaded by

taruntukee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

Lecture 3:

Parallel Programming
Abstractions
(and their corresponding HW/SW implementations)

Parallel Computing
Stanford CS149, Winter 2019
Today’s theme is a critical idea in this course.
And today’s theme is:

Abstraction vs. implementation

Conflating abstraction with implementation is a common

cause for confusion in this course.

Stanford CS149, Winter 2019

An example:
Programming with ISPC

Stanford CS149, Winter 2019

ISPC
▪ Intel SPMD Program Compiler (ISPC)
▪ SPMD: single program multiple data

▪ http://ispc.github.com/

Stanford CS149, Winter 2019

Recall: example program from last class
Compute sin(x) using Taylor expansion: sin(x) = x - x3/3! + x5/5! - x7/7! + ...
for each element of an array of N floating-point numbers
void sinx(int N, int terms, float* x, float* result)
{
for (int i=0; i<N; i++)
{
float value = x[i];
float numer = x[i] * x[i] * x[i];
int denom = 6; // 3!
int sign = -1;

for (int j=1; j<=terms; j++)

{
value += sign * numer / denom;
numer *= x[i] * x[i];
denom *= (2*j+2) * (2*j+3);
sign *= -1;
}

result[i] = value;
}
}
Stanford CS149, Winter 2019
sin(x) in ISPC
Compute sin(x) using Taylor expansion: sin(x) = x - x3/3! + x5/5! - x7/7! + ...
C++ code: main.cpp ISPC code: sinx.ispc
#include “sinx_ispc.h” export void sinx(
uniform int N,
int N = 1024; uniform int terms,
int terms = 5; uniform float* x,
float* x = new float[N]; uniform float* result)
float* result = new float[N]; {
// assume N % programCount = 0
// initialize x here for (uniform int i=0; i<N; i+=programCount)
{
// execute ISPC code int idx = i + programIndex;
sinx(N, terms, x, result); float value = x[idx];
float numer = x[idx] * x[idx] * x[idx];
uniform int denom = 6; // 3!
SPMD programming abstraction: uniform int sign = -1;

for (uniform int j=1; j<=terms; j++)

Call to ISPC function spawns “gang” of ISPC
{
“program instances” value += sign * numer / denom
numer *= x[idx] * x[idx];
All instances run ISPC code concurrently
denom *= (2*j+2) * (2*j+3);
Upon return, all instances have completed sign *= -1;
}
result[idx] = value;
}
}
Stanford CS149, Winter 2019
sin(x) in ISPC
Compute sin(x) using Taylor expansion: sin(x) = x - x3/3! + x5/5! - x7/7! + ...

C++ code: main.cpp

#include “sinx_ispc.h”

Sequential execution (C code)

int N = 1024;
int terms = 5;
float* x = new float[N];
float* result = new float[N]; Call to sinx()
1 2 3 4 5 6 7 8 Begin executing programCount
// initialize x here instances of sinx() (ISPC code)

// execute ISPC code

sinx(N, terms, x, result);

SPMD programming abstraction: sinx() returns.

Completion of ISPC program instances.
Call to ISPC function spawns “gang” of ISPC “program instances” Resume sequential execution

All instances run ISPC code concurrently

Sequential execution
Upon return, all instances have completed (C code)

In this illustration programCount = 8

Stanford CS149, Winter 2019
sin(x) in ISPC
“Interleaved” assignment of array elements to program instances
C++ code: main.cpp ISPC code: sinx.ispc
#include “sinx_ispc.h” export void sinx(
uniform int N,
int N = 1024; uniform int terms,
int terms = 5; uniform float* x,
float* x = new float[N]; uniform float* result)
float* result = new float[N]; {
// assumes N % programCount = 0
// initialize x here for (uniform int i=0; i<N; i+=programCount)
{
// execute ISPC code int idx = i + programIndex;
sinx(N, terms, x, result); float value = x[idx];
float numer = x[idx] * x[idx] * x[idx];
uniform int denom = 6; // 3!
ISPC Keywords: uniform int sign = -1;

programCount: number of simultaneously for (uniform int j=1; j<=terms; j++)

executing instances in the gang (uniform value) {
value += sign * numer / denom
programIndex: id of the current instance in the numer *= x[idx] * x[idx];
gang. (a non-uniform value: “varying”) denom *= (2*j+2) * (2*j+3);
sign *= -1;
uniform: A type modifier. All instances have the }
same value for this variable. Its use is purely an result[idx] = value;
}
optimization. Not needed for correctness. }
Stanford CS149, Winter 2019
Interleaved assignment of program instances
to loop iterations
Elements of output array (results)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Instance 0 Instance 1 Instance 2 Instance 3

(programIndex = 0) (programIndex = 1) (programIndex = 2) (programIndex = 3)

“Gang” of ISPC program instances

In this illustration: gang contains four instances: programCount = 4

Stanford CS149, Winter 2019

ISPC implements the gang abstraction using
SIMD instructions
C++ code: main.cpp
#include “sinx_ispc.h”

Sequential execution (C code)

int N = 1024;
int terms = 5;
float* x = new float[N];
float* result = new float[N]; Call to sinx()
1 2 3 4 5 6 7 8 Begin executing programCount
// initialize x here instances of sinx() (ISPC code)

// execute ISPC code

sinx(N, terms, x, result);

SPMD programming abstraction: sinx() returns.

Completion of ISPC program instances.
Call to ISPC function spawns “gang” of ISPC “program instances” Resume sequential execution
All instances run ISPC code concurrently
Upon return, all instances have completed Sequential execution
(C code)
ISPC compiler generates SIMD implementation:
Number of instances in a gang is the SIMD width of the hardware (or a small multiple of SIMD width)
ISPC compiler generates binary (.o) with SIMD instructions
C++ code links against object file as usual
Stanford CS149, Winter 2019
sin(x) in ISPC: version 2
“Blocked” assignment of elements to instances
C++ code: main.cpp ISPC code: sinx.ispc
#include “sinx_ispc.h” export void sinx(
uniform int N,
int N = 1024; uniform int terms,
int terms = 5; uniform float* x,
uniform float* result)
float* x = new float[N];
{
float* result = new float[N];
// assume N % programCount = 0
uniform int count = N / programCount;
// initialize x here int start = programIndex * count;
for (uniform int i=0; i<count; i++)
// execute ISPC code {
sinx(N, terms, x, result); int idx = start + i;
float value = x[idx];
float numer = x[idx] * x[idx] * x[idx];
uniform int denom = 6; // 3!
uniform int sign = -1;

for (uniform int j=1; j<=terms; j++)

{
value += sign * numer / denom
numer *= x[idx] * x[idx];
denom *= (j+3) * (j+4);
sign *= -1;
}
result[idx] = value;
}
}

Stanford CS149, Winter 2019

Blocked assignment of program instances to loop
iterations
Elements of output array (results)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Instance 0 Instance 1 Instance 2 Instance 3

(programIndex = 0) (programIndex = 1) (programIndex = 2) (programIndex = 3)

“Gang” of ISPC program instances

In this illustration: gang contains four instances: programCount = 4

Stanford CS149, Winter 2019

Schedule: interleaved assignment
“Gang” of ISPC program instances
Gang contains four instances: programCount = 4

Instance 0 Instance 1 Instance 2 Instance 3

(programIndex = 0) (programIndex = 1) (programIndex = 2) (programIndex = 3)
time
i=0 0 1 2 3 _mm_load_ps1

i=1 4 5 6 7

i=2 8 9 10 11

i=3 12 13 14 15
Single “packed load” SSE instruction (_mm_load_ps1) ...
// assumes N % programCount = 0
eﬃciently implements: for (uniform int i=0; i<N; i+=programCount)
float value = x[idx]; {
int idx = i + programIndex;
for all program instances, since the four values are float value = x[idx];
contiguous in memory ...
Stanford CS149, Winter 2019
Schedule: blocked assignment
“Gang” of ISPC program instances
Gang contains four instances: programCount = 4

Instance 0 Instance 1 Instance 2 Instance 3

(programIndex = 0) (programIndex = 1) (programIndex = 2) (programIndex = 3)
time
i=0 0 4 8 12 _mm_i32gather

i=1 1 5 9 13

i=2 2 6 10 14

i=3 3 7 11 15
float value = x[idx]; uniform int count = N / programCount;
int start = programIndex * count;
now touches four non-contiguous values in memory. for (uniform int i=0; i<count; i++) {
int idx = start + i;
Need “gather” instruction to implement float value = x[idx];
(gather is a more complex, and more costly SIMD ...
instruction: only available since 2013 as part of AVX2)
Stanford CS149, Winter 2019
Raising level of abstraction with foreach
Compute sin(x) using Taylor expansion: sin(x) = x - x3/3! + x5/5! - x7/7! + ...
C++ code: main.cpp ISPC code: sinx.ispc
#include “sinx_ispc.h” export void sinx(
uniform int N,
int N = 1024; uniform int terms,
int terms = 5; uniform float* x,
float* x = new float[N]; uniform float* result)
float* result = new float[N]; {
foreach (i = 0 ... N)
// initialize x here {
float value = x[i];
// execute ISPC code float numer = x[i] * x[i] * x[i];
sinx(N, terms, x, result); uniform int denom = 6; // 3!
uniform int sign = -1;

foreach: key ISPC language construct for (uniform int j=1; j<=terms; j++)
{
▪ foreach declares parallel loop iterations value += sign * numer / denom

- Programmer says: these are the iterations the numer *= x[i] * x[i];
denom *= (2*j+2) * (2*j+3);
instances in a gang cooperatively must perform sign *= -1;
}
▪ ISPC implementation assigns iterations to program result[i] = value;
instances in gang }
- Current ISPC implementation will perform a }
static interleaved assignment (but the
abstraction permits a diﬀerent assignment) Stanford CS149, Winter 2019
ISPC: abstraction vs. implementation
▪ Single program, multiple data (SPMD) programming model
- Programmer “thinks”: running a gang is spawning programCount logical
instruction streams (each with a diﬀerent value of programIndex)
- This is the programming abstraction
- Program is written in terms of this abstraction

▪ Single instruction, multiple data (SIMD) implementation

- ISPC compiler emits vector instructions (e.g., AVX2) that carry out the logic
performed by a ISPC gang
- ISPC compiler handles mapping of conditional control flow to vector instructions
(by masking vector lanes, etc.)

▪ Semantics of ISPC can be tricky

- SPMD abstraction + uniform values
(allows implementation details to peak through abstraction a bit)
Stanford CS149, Winter 2019
ISPC discussion: sum “reduction”
Compute the sum of all array elements in parallel

export uniform float sumall1( export uniform float sumall2(

uniform int N, uniform int N,
uniform float* x) uniform float* x)
{ {
uniform float sum = 0.0f; uniform float sum;
foreach (i = 0 ... N) float partial = 0.0f;
{ foreach (i = 0 ... N)
sum += x[i]; {
} partial += x[i];
}
return sum;
} // from ISPC math library
sum = reduce_add(partial);

return sum;
}

Correct ISPC solution

sum is of type uniform float (one copy of variable for all program instances)
x[i] is not a uniform expression (diﬀerent value for each program instance)
Result: compile-time type error
Stanford CS149, Winter 2019
ISPC discussion: sum “reduction”
Compute the sum of all array elements in parallel export uniform float sumall2(
uniform int N,
Each instance accumulates a private partial sum uniform float* x)
{
(no communication) uniform float sum;
float partial = 0.0f;
Partial sums are added together using the reduce_add() cross- foreach (i = 0 ... N)
{
instance communication primitive. The result is the same total sum for
partial += x[i];
all program instances (reduce_add() returns a uniform float) }

The ISPC code at right will execute in a manner similar to handwritten // from ISPC math library
sum = reduce_add(partial);
C + AVX intrinsics implementation below. *
return sum;
}
float sumall2(int N, float* x) {

float tmp[8]; // assume 16-byte alignment

__mm256 partial = _mm256_broadcast_ss(0.0f);

for (int i=0; i<N; i+=8)

partial = _mm256_add_ps(partial, _mm256_load_ps(&x[i]));

_mm256_store_ps(tmp, partial); * Self-test: If you understand why this

float sum = 0.f;
implementation complies with the
for (int i=0; i<8; i++) semantics of the ISPC gang abstraction, then
sum += tmp[i]; you’ve got a good command of ISPC
return sum;
} Stanford CS149, Winter 2019
SPMD programming model summary
▪ SPMD = “single program, multiple data”
▪ Define one function, run multiple instances of that function
in parallel on diﬀerent input arguments

Single thread of control

Call SPMD function

SPMD execution: multiple instances of function

run in parallel (multiple logical threads of control)

SPMD function returns

Resume single thread of control

Stanford CS149, Winter 2019

ISPC tasks
▪ The ISPC gang abstraction is implemented by SIMD
instructions on one core.

▪ So... all the code I’ve shown you in the previous slides would
have executed on only one of the four cores of the GHC
machines.

▪ ISPC contains another abstraction: a “task” that is used to

achieve multi-core execution. I’ll let you read up about that.

Stanford CS149, Winter 2019

Part 2 of today’s lecture

▪ Three parallel programming models

- That diﬀer in what communication abstractions they present to the programmer
- Programming models are important because they (1) influence how programmers
think when writing programs and (2) influence the design of parallel hardware
platforms designed to execute them

▪ Corresponding machine architectures

- Abstraction presented by the hardware to low-level software

▪ We’ll focus on diﬀerences in communication and synchronization

Stanford CS149, Winter 2019

System layers: interface, implementation, interface, ...
Parallel Applications

Abstractions for describing “Programming model”

Abstractions for describing (provides way of thinking about
concurrent, parallel, or communication the structure of programs)
independent computation
Language or library
primitives/mechanisms
Compiler and/or parallel runtime

OS system call API

Operating system

Hardware Architecture
(HW/SW boundary)
Micro-architecture (hardware implementation)

Blue italic text: abstraction/concept

Red italic text: system interface
Black text: system implementation
Stanford CS149, Winter 2019
Example: expressing parallelism with pthreads
Parallel Application
Thread
Abstraction for concurrent computation: a thread
Programming
model

pthread_create()

pthread library implementation

System call API

OS support: kernel thread management

x86-64
modern multi-core CPU

Blue italic text: abstraction/concept

Red italic text: system interface
Black text: system implementation
Stanford CS149, Winter 2019
Example: expressing parallelism with ISPC
Parallel Applications

Abstractions for describing parallel computation: ISPC

1. For specifying simultaneous execution (true parallelism) Programming
2. For specifying independent work (potentially parallel) model

ISPC language (call ISPC function, foreach construct)

ISPC compiler

System call API

OS support

x86-64 (including AVX vector instructions)

single-core of CPU

Note: This diagram is specific to the ISPC gang abstraction. ISPC also has the “task” language primitive for multi-core execution.
I don’t describe it here but it would be interesting to think about how that diagram would look
Stanford CS149, Winter 2019
Three models of communication
(abstractions)

1. Shared address space

2. Message passing
3. Data parallel

Stanford CS149, Winter 2019

Shared address space model
of communication

Stanford CS149, Winter 2019

Shared address space model (abstraction)
▪ Threads communicate by reading/writing to shared variables
▪ Shared variables are like a big bulletin board
- Any thread can read or write to shared variables

Thread 1: Thread 2:
int x = 0; void foo(int* x) {
spawn_thread(foo, &x); while (x == 0) {}
x = 1; print x;
}
Store to x
Thread 1
x

Shared address space

Thread 2 Load from x

(Communication operations shown in red)

(Pseudocode provided in a fake C-like language for brevity.) Stanford CS149, Winter 2019
Shared address space model (abstraction)
Synchronization primitives are also shared variables: e.g., locks

Thread 1: Thread 2:

int x = 0;
Lock my_lock;

spawn_thread(foo, &x, &my_lock);

mylock.lock(); void foo(int* x, lock* my_lock)

x++; {
mylock.unlock(); my_lock->lock();
x++;
my_lock->unlock();

print x;
}

(Pseudocode provided in a fake C-like language for brevity.) Stanford CS149, Winter 2019
Review: why do we need mutual exclusion?
▪ Each thread executes
- Load the value of diff from shared memory into register r1
- Add the register r2 to register r1
- Store the value of register r1 into diff
▪ One possible interleaving: (let starting value of diff=0, r2=1)
T0 T1

T0 reads value 0
r1 ← diff
r1 ← diff T1 reads value 0
r1 ← r1 + r2 T0 sets value of its r1 to 1
r1 ← r1 + r2 T1 sets value of its r1 to 1
diff ← r1 T0 stores 1 to diff
diff ← r1 T1 stores 1 to diff

▪ This set of three instructions must be “atomic”

Stanford CS149, Winter 2019
Mechanisms for preserving atomicity
▪ Lock/unlock mutex around a critical section
LOCK(mylock);
// critical section
UNLOCK(mylock);

▪ Some languages have first-class support for atomicity of code blocks

atomic {
// critical section
}

▪ Intrinsics for hardware-supported atomic read-modify-write operations

atomicAdd(x, 10);

Stanford CS149, Winter 2019

Shared address space model (abstraction)
▪ Threads communicate by:
- Reading/writing to shared variables
- Inter-thread communication is implicit in memory operations
- Thread 1 stores to X
- Later, thread 2 reads X (and observes update of value by thread 1)
- Manipulating synchronization primitives
- e.g., ensuring mutual exclusion via use of locks

▪ This is a natural extension of sequential programming

- In fact, all our discussions in class have assumed a shared address space so far!

▪ Helpful analogy: shared variables are like a big bulletin board

- Any thread can read or write to shared variables

Stanford CS149, Winter 2019

HW implementation of a shared address space
Key idea: any processor can directly reference any memory location

“Dance-hall” organization Interconnect examples

Processor Processor Processor Processor

Shared Bus
Processor Processor Processor Processor
Local Cache Local Cache Local Cache Local Cache Memory Memory

Interconnect
Crossbar
Processor Processor Processor Processor Processor

Processor
Memory I/O
Processor

Processor

Memory Memory Memory Memory Memory Memory

Multi-stage network

Symmetric (shared-memory) multi-processor (SMP):

- Uniform memory access time: cost of accessing an uncached *
memory address is the same for all processors

* caching introduces non-uniform access times, but we’ll talk about that later Stanford CS149, Winter 2019
Shared address space HW architectures

Memory

Memory Controller

Core 1 Core 2
Integrated
GPU
Core 3 Core 4

Intel Core i7 (quad core)

Example: Intel Core i7 processor (Kaby Lake) (interconnect is a ring)

Stanford CS149, Winter 2019

Intel’s ring interconnect
Introduced in Sandy Bridge microarchitecture
System Agent ▪ Four rings
- request
- snoop
L3 cache slice
- ack
(2 MB)
Core - data (32 bytes)

▪ Six interconnect nodes: four

L3 cache slice Core “slices” of L3 cache + system
(2 MB)
agent + graphics
L3 cache slice Core ▪ Each bank of L3 connected to
(2 MB)
ring bus twice
L3 cache slice
Core ▪ Theoretical peak BW from
(2 MB)
cores to L3 at 3.4 GHz is
approx. 435 GB/sec
- When each core is accessing its
Graphics local slice
Stanford CS149, Winter 2019
SUN Niagara 2 (UltraSPARC T2)
Note area of crossbar (CCX):
about same area as one core on chip
Processor

Processor L2 cache Memory

Processor

L2 cache Memory
Processor
Crossbar
Switch
Processor
L2 cache Memory
Processor

Processor L2 cache Memory

Processor

Eight cores

Stanford CS149, Winter 2019

Intel Xeon Phi (Knights Landing)
▪ 72 cores, arranged as 6 x 6 mesh of tiles (2 cores/tile)
▪ YX routing of messages:
-
-
Move in Y
“Turn”
KNL Mesh Interconnect M
- Move in X OPIO
MCDRAM MCDRAM
OPIO PCIe MCDRAM
OPIO MCDRAM
OPIO

EDC EDC IIO EDC EDC

Tile Tile Tile Tile

Tile Tile Tile Tile Tile Tile

iMC Tile Tile Tile Tile iMC

C
DDR DDR

Tile Tile Tile Tile Tile Tile

EDC EDC Misc EDC EDC T

(
OPIO
MCDRAM OPIO
MCDRAM OPIO
MCDRAM OPIO
MCDRAM C

Stanford CS149, Winter 2019

Non-uniform memory access (NUMA)
All processors can access any memory location, but... the cost of memory access
(latency and/or bandwidth) is diﬀerent for diﬀerent processors

Example: latency to access address x is higher from cores 5-8 than cores 1-4

Example: modern dual-socket configuration

X Memory Memory

Memory Controller Memory Controller On chip

network
Core 1 Core 2 Core 5 Core 6

Core 3 Core 4 Core 7 Core 8

AMD Hyper-transport /
Intel QuickPath (QPI)
Stanford CS149, Winter 2019
Non-uniform memory access (NUMA)
All processors can access any memory location, but... the cost of memory access
(latency and/or bandwidth) is diﬀerent for diﬀerent processors

Processor Processor Processor Processor

Local Cache Local Cache Local Cache Local Cache
Memory Memory Memory Memory

Interconnect

▪ Problem with preserving uniform access time in a system: scalability

- GOOD: costs are uniform, BAD: they are uniformly bad (memory is uniformly far away)
▪ NUMA designs are more scalable
- Low latency access to local memory
- Provide high bandwidth to local memory
▪ Cost is increased programmer eﬀort for performance tuning
- Finding, exploiting locality is important to performance
(want most memory accesses to be to local memories)
Stanford CS149, Winter 2019
Summary: shared address space model
▪ Communication abstraction
- Threads read/write shared variables
- Threads manipulate synchronization primitives: locks, atomic ops, etc.
- Logical extension of uniprocessor programming *

▪ Requires hardware support to implement eﬃciently

- Any processor can load and store from any address (its shared address space!)
- Even with NUMA, costly to scale
(one of the reasons why high core count processors are expensive)

* But NUMA implementation requires reasoning about locality for performance Stanford CS149, Winter 2019
Message passing model of
communication

Stanford CS149, Winter 2019

Message passing model (abstraction)
▪ Threads operate within their own private address spaces
▪ Threads communicate by sending/receiving messages
- send: specifies recipient, buﬀer to be transmitted, and optional message identifier (“tag”)
- receive: sender, specifies buﬀer to store data, and optional message identifier
- Sending messages is the only way to exchange data between threads 1 and 2

Thread 1 address space Thread 2 address space

x send(X, 2, my_msg_id)

Variable X semantics: send contexts of local

variable X as message to thread 2 recv(Y, 1, my_msg_id)
and tag message with the id
semantics: receive message with id
“my_msg_id”
“my_msg_id” from thread 1 and
Y
store contents in local variable Y Variable X

(Communication operations shown in red)

Illustration adopted from Culler, Singh, Gupta Stanford CS149, Winter 2019
Message passing (implementation)
▪ Hardware need not implement system-wide loads and stores to execute
message passing programs (only be able to communicate messages
between nodes)
- Can connect commodity systems together to form large parallel machine
(message passing is a programming model for clusters)

Cluster of workstations
(Infiniband network)

IBM Blue Gene/P Supercomputer

Image credit: IBM Stanford CS149, Winter 2019

Caveat: the correspondence between
programming models and machine types is fuzzy
▪ Common to implement message passing abstractions on machines
that implement a shared address space in hardware
- “Sending message” = copying memory from message library buﬀers
- “Receiving message” = copy data from message library buﬀers

▪ Can implement shared address space abstraction on machines that

do not support it in HW (via less eﬃcient SW solutions)
- Mark all pages with shared variables as invalid
- Page-fault handler issues appropriate network requests

▪ Keep clear in your mind: what is the programming model

(abstractions used to specify program)? And what is the HW
implementation?
Stanford CS149, Winter 2019
The data-parallel model

Stanford CS149, Winter 2019

What are programming models for?
Programming models serve to impose structure on programs!

▪ Shared address space: very little structure to communication

- All threads can read and write to all shared variables
- Pitfall: due to implementation: not all reads and writes have the same cost
(and that cost is often not apparent in program code)

▪ Message passing: highly structured communication

- All communication occurs in the form of messages (programmer can read
program and see where the communication is—the sends and receives)

▪ Data-parallel: very rigid computation structure

- Programs perform same function on diﬀerent data elements in a collection

Stanford CS149, Winter 2019

Data-parallel model
▪ Historically: same operation on each element of an array
- Matched capabilities SIMD supercomputers of 80’s
- Connection Machine (CM-1, CM-2): thousands of processors, one instruction decode unit
- Cray supercomputers: vector processors
- add(A, B, n) ← this was one instruction on vectors A, B of length n

▪ NumPy is another good example: C = A + B

(A, B, and C are vectors of same length)

▪ Today: often takes form of SPMD programming

- map(function, collection)
- Where function is applied to each element of collection independently
- function may be a complicated sequence of logic (e.g., a loop body)
- Synchronization is implicit at the end of the map (map returns when function has been
applied to all elements of collection)

Stanford CS149, Winter 2019

Data parallelism in ISPC
// main C++ code:
const int N = 1024;
float* x = new float[N]; Think of loop body as function (from the previous slide)
float* y = new float[N];
foreach construct is a map
// initialize N elements of x here
Given this program, it is reasonable to think of the program
absolute_value(N, x, y);
as mapping the loop body onto each element of the arrays X
and Y.

// ISPC code:
export void absolute_value( But if we want to be more precise: the collection is not a
uniform int N, first-class ISPC concept. It is implicitly defined by how the
uniform float* x,
uniform float* y)
program has implemented array indexing logic.
{
foreach (i = 0 ... N)
{
if (x[i] < 0)
(There is no operation in ISPC with the semantic: “map this
y[i] = -x[i]; code over all elements of this array”)
else
y[i] = x[i];
}
}

Stanford CS149, Winter 2019

Data parallelism in ISPC
// main C++ code:
const int N = 1024;
float* x = new float[N/2]; Think of loop body as function
float* y = new float[N];
foreach construct is a map
// initialize N/2 elements of x here
Collection is implicitly defined by array indexing logic
absolute_repeat(N/2, x, y);

// ISPC code: This is also a valid ISPC program!

export void absolute_repeat(
uniform int N,
uniform float* x,
uniform float* y) It takes the absolute value of elements of x, then
{ repeats it twice in the output array y
foreach (i = 0 ... N)
{
if (x[i] < 0)
y[2*i] = -x[i]; (Less obvious how to think of this code as mapping
else the loop body onto existing collections.)
y[2*i] = x[i];
y[2*i+1] = y[2*i];
}
}
Stanford CS149, Winter 2019
Data parallelism in ISPC
// main C++ code:
const int N = 1024;
float* x = new float[N]; Think of loop body as function
float* y = new float[N];
foreach construct is a map
// initialize N elements of x
Collection is implicitly defined by array indexing logic
shift_negative(N, x, y);

// ISPC code: The output of this program is undefined!

export void shift_negative(
uniform int N,
uniform float* x, Possible for multiple iterations of the loop body to
uniform float* y) write to same memory location
{
foreach (i = 0 ... N)
{ Data-parallel model (foreach) provides no
if (i >= 1 && x[i] < 0) specification of order in which iterations occur
y[i-1] = x[i];
else
y[i] = x[i];
Model provides no primitives for fine-grained mutual
} exclusion/synchronization). It is not intended to help
} programmers write programs with that structure
Stanford CS149, Winter 2019
Data parallelism: a more “pure” approach
Note: this is not ISPC syntax (more of Kayvon’s made up syntax)

Main program:
const int N = 1024; Data-parallelism expressed in this functional
stream<float> x(N); // sequence (a “stream”) form is sometimes referred to as the stream
stream<float> y(N); // sequence (a “stream”) programming model
// initialize N elements of x here...

// map function absolute_value onto streams

Streams: sequences of elements. Elements in
absolute_value(x, y); a stream can be processed independently

Kernels: side-eﬀect-free functions. Operate

“Kernel” definition: element-wise on collections
void absolute_value(float x, float y)
{ Think of the inputs, outputs, and temporaries
if (x < 0)
y = -x; for each kernel invocation as forming a
else private per-invocation address space
y = x;
}

Stanford CS149, Winter 2019

Stream programming benefits
Global-scale program dependencies are known by
const int N = 1024; compiler (enables compiler to perform aggressive
stream<float> input(N);
stream<float> output(N);
optimizations that require global program analysis):
stream<float> tmp(N);
Independent processing on elements, kernel
foo(input, tmp);
bar(tmp, output); functions are side-eﬀect free:
- Optimization: parallelize kernel execution
- Application cannot write a program that is non-
deterministic under parallel execution

input tmp output Inputs/outputs of each invocation known in advance:

foo bar prefetching can be employed to hide latency.

Producer-consumer dependencies are known in

advance: Implementation can be structured so
outputs of first kernel are immediately processed by
parallel_for(int i=0; i<N; i++) second kernel. (The values are stored in on-chip
{
output[i] = bar(foo(input[i])); buﬀers/caches and never written to memory! Saves
} bandwidth!)

Stanford CS149, Winter 2019

Stream programming drawbacks
Need library of operators to describe complex data
const int N = 1024;
stream<float> input(N/2); flows (see use of repeat operator at left to
stream<float> tmp(N); obtain same behavior as indexing code below)
stream<float> output(N);

// double length of stream by replicating My experience: cross fingers and hope compiler is
// all elements 2x
stream_repeat(2, input, tmp);
intelligent enough to generate code below from
program at left.
absolute_value(tmp, output);
// ISPC code:
export void absolute_value(
uniform int N,
Kayvon’s experience: uniform float* x,
uniform float* y)
{
This is the achilles heel of all “proper” foreach (i = 0 ... N)
data-parallel/stream programming {
float result;
systems. if (x[i] < 0)
result = -x[i];
else
“If I just had one more operator”... result = x[i];
y[2*i+1] = y[2*i] = result;
}
}
Stanford CS149, Winter 2019
Gather/scatter: two key data-parallel
communication primitives
Map absolute_value onto stream produced by gather: Map absolute_value onto stream, scatter results:

const int N = 1024; const int N = 1024;

stream<float> input(N); stream<float> input(N);
stream<int> indices; stream<int> indices;
stream<float> tmp_input(N); stream<float> tmp_output(N);
stream<float> output(N); stream<float> output(N);

stream_gather(input, indices, tmp_input); absolute_value(input, tmp_output);

absolute_value(tmp_input, output); stream_scatter(tmp_output, indices, output);

ISPC equivalent: ISPC equivalent:

export void absolute_value( export void absolute_value(

uniform float N, uniform float N,
uniform float* input, uniform float* input,
uniform float* output, uniform float* output,
uniform int* indices) uniform int* indices)
{ {
foreach (i = 0 ... n) foreach (i = 0 ... n)
{ {
float tmp = input[indices[i]]; if (input[i] < 0)
if (tmp < 0) output[indices[i]] = -input[i];
output[i] = -tmp; else
else output[indices[i]] = input[i];
output[i] = tmp; }
} }
} Stanford CS149, Winter 2019
Gather instruction
gather(R1, R0, mem_base); “Gather from buﬀer mem_base into R1 according to indices specified by R0.”

Array in memory with (base address = mem_base)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

mem_base

3 12 4 9 9 15 13 0
Index vector: R0 Result vector: R1

Gather supported with AVX2 in 2013

But AVX2 does not support SIMD scatter (must implement as scalar loop)
Scatter instruction exists in AVX512

Hardware supported gather/scatter does exist on GPUs.

(still an expensive operation compared to load/store of contiguous vector)
Stanford CS149, Winter 2019
Summary: data-parallel model
▪ Data-parallelism is about imposing rigid program structure to
facilitate simple programming and advanced optimizations
▪ Basic structure: map a function onto a large collection of data
- Functional: side-eﬀect free execution
- No communication among distinct function invocations
(allow invocations to be scheduled in any order, including in parallel)

▪ In practice that’s how many simple programs work

▪ But... many modern performance-oriented data-parallel languages
do not strictly enforce this structure
- ISPC, OpenCL, CUDA, etc.
- They choose flexibility/familiarity of imperative C-style syntax over the safety of a more
functional form: it’s been their key to their adoption
- Opinion: sure, functional thinking is great, but programming systems sure should impose
structure to facilitate achieving high-performance implementations, not hinder them
Stanford CS149, Winter 2019
Summary

Stanford CS149, Winter 2019

Summary
▪ Programming models provide a way to think about the
organization of parallel programs.

▪ They provide abstractions that permit multiple valid

implementations.

▪ I want you to always be thinking about abstraction vs.

implementation for the remainder of this course.

Stanford CS149, Winter 2019

Summary
Restrictions imposed by these abstractions are designed to:

1. Reflect realities of parallelization and communication costs to

programmer (help a programmer write eﬃcient programs)
- Shared address space machines: hardware supports any processor accessing any address
- Messaging passing machines: hardware may accelerate message send/receive/buﬀering
- Desirable to keep “abstraction distance” low so programs have predictable performance, but
want abstractions to be high enough for code flexibility/portability

2. Provide useful information to implementors of optimizing

compilers/runtimes/hardware to help them eﬃciently implement
programs using these abstractions

Stanford CS149, Winter 2019

We discussed three parallel programming models
▪ Shared address space
- Communication is unstructured, implicit in loads and stores
- Natural way of programming (extension of single-threaded programming), but
programmer can shoot themselves in the foot easily
- Program might be correct, but not perform well
▪ Message passing
- Structure all communication as messages
- Often harder/more tedious to get first correct program than shared address space
- Structure often helpful in getting to first correct, scalable program

▪ Data parallel
- Structure computation as a big “map” over a collection
- Assumes a shared address space from which to load inputs/store results, but
severely limits communication between iterations of the map
(goal: preserve independent processing of iterations)
- Modern embodiments encourage, but don’t enforce, this structure
Stanford CS149, Winter 2019
Modern practice: mixed programming models
▪ Use shared address space programming within a multi-core node
of a cluster, use message passing between nodes
- Very, very common in practice
- Use convenience of shared address space where it can be implemented
eﬃciently (within a node), require explicit communication elsewhere

▪ Data-parallel-ish programming models support shared-memory

style synchronization primitives in kernels
- Permit limited forms of inter-iteration communication (e.g., CUDA, OpenCL)

▪ In a future lecture… CUDA/OpenCL use data-parallel model to

scale to many cores, but adopt shared-address space model
allowing threads running on the same core to communicate.

Stanford CS149, Winter 2019

Questions to consider
▪ Programming models enforce diﬀerent forms of structure on
programs. What are the benefits of data-parallel structure?

▪ With respect to the goals of eﬃciency/performance… what do

you think are problems of adopting a very high level of abstraction
in a programming system?
- What about potential benefits?

▪ Choose a popular parallel programming system (for example

Hadoop, Spark, or Cilk) and try and describe its programming
model (how are communication and execution expressed?)

Stanford CS149, Winter 2019

03 Ch1 BasicArch Parallel
No ratings yet
03 Ch1 BasicArch Parallel
79 pages
02 Basicarch
No ratings yet
02 Basicarch
83 pages
04 Progbasics
No ratings yet
04 Progbasics
51 pages
03 Progmodels
No ratings yet
03 Progmodels
47 pages
C Examples
No ratings yet
C Examples
34 pages
Microsoft Excel Formulas and Functions (Office 2021 and Microsoft 365) 1st Edition - Ebook PDFPDF Download
100% (2)
Microsoft Excel Formulas and Functions (Office 2021 and Microsoft 365) 1st Edition - Ebook PDFPDF Download
35 pages
05 15slides
No ratings yet
05 15slides
117 pages
Aula Ch3
No ratings yet
Aula Ch3
21 pages
Lab Programs
No ratings yet
Lab Programs
23 pages
Screenshot 2024-05-05 at 18.55.12
No ratings yet
Screenshot 2024-05-05 at 18.55.12
15 pages
IT3030E CA Chap3 Arithmetics
No ratings yet
IT3030E CA Chap3 Arithmetics
39 pages
Codigoc Simulink
No ratings yet
Codigoc Simulink
30 pages
Written Asst1
No ratings yet
Written Asst1
31 pages
Fast Implementation of CV Algorithms: Using Floating Point Hardware For Numeric Intensive Algorithms
No ratings yet
Fast Implementation of CV Algorithms: Using Floating Point Hardware For Numeric Intensive Algorithms
21 pages
Lab 7
No ratings yet
Lab 7
11 pages
Lab 7
No ratings yet
Lab 7
9 pages
Exam1 f09 v1
No ratings yet
Exam1 f09 v1
18 pages
Exp 8-Exp 10 Coa
No ratings yet
Exp 8-Exp 10 Coa
7 pages
Evaluation Criteria For Block-Level Verification in UVM
No ratings yet
Evaluation Criteria For Block-Level Verification in UVM
22 pages
Preboard2 2024
No ratings yet
Preboard2 2024
13 pages
Pooja Vashisth
No ratings yet
Pooja Vashisth
35 pages
Gng1106 Sample Midterm
No ratings yet
Gng1106 Sample Midterm
22 pages
PF Sessional 2
No ratings yet
PF Sessional 2
8 pages
Lecture 7: Instruction Set Architectures IV - Previously - Today
No ratings yet
Lecture 7: Instruction Set Architectures IV - Previously - Today
12 pages
Assignment 3 Answer (1) FINAL............. Coal
No ratings yet
Assignment 3 Answer (1) FINAL............. Coal
14 pages
F22 PF Final+Solution
No ratings yet
F22 PF Final+Solution
15 pages
Digital Assignment-6: Name: Bejugam Shiva Suprith REG NO: 18BCE0427 Faculty: Narayanamoorthi M SLOT: L59+L60
No ratings yet
Digital Assignment-6: Name: Bejugam Shiva Suprith REG NO: 18BCE0427 Faculty: Narayanamoorthi M SLOT: L59+L60
14 pages
Sample Questions
No ratings yet
Sample Questions
6 pages
Chapter 04
No ratings yet
Chapter 04
12 pages
2021 22
No ratings yet
2021 22
24 pages
Message
No ratings yet
Message
6 pages
Yeni Microsoft Word Belgesi
No ratings yet
Yeni Microsoft Word Belgesi
2 pages
Asia Macau Regional Contest The 45 International Collegiate Programming Contest
No ratings yet
Asia Macau Regional Contest The 45 International Collegiate Programming Contest
21 pages
PF FALL2023 Assignment3
No ratings yet
PF FALL2023 Assignment3
8 pages
Chapter 4 Solutions: Case Study: Implementing A Vector Kernel On A Vector Processor and GPU
No ratings yet
Chapter 4 Solutions: Case Study: Implementing A Vector Kernel On A Vector Processor and GPU
12 pages
Lab RISC-V
No ratings yet
Lab RISC-V
5 pages
Par - 1 In-Term Exam - Course 2018/19-Q2
No ratings yet
Par - 1 In-Term Exam - Course 2018/19-Q2
9 pages
Date 170807
No ratings yet
Date 170807
37 pages
BCSL 58 em
No ratings yet
BCSL 58 em
14 pages
Liz Asset
No ratings yet
Liz Asset
29 pages
Synthesis of Single Precision Floating Point ALU: Department of Electronics and Communication Engineering
No ratings yet
Synthesis of Single Precision Floating Point ALU: Department of Electronics and Communication Engineering
20 pages
Reference Material
No ratings yet
Reference Material
3 pages
Assignment 4
No ratings yet
Assignment 4
3 pages
Assignment
No ratings yet
Assignment
7 pages
Exam1 f12
No ratings yet
Exam1 f12
15 pages
IITK Midsem With Soln
No ratings yet
IITK Midsem With Soln
14 pages
CPP All Programs-2019 SASTRA University
No ratings yet
CPP All Programs-2019 SASTRA University
60 pages
Utilizzando Solo e Unicamente Istruzioni Dalla Tabella Sottostante
No ratings yet
Utilizzando Solo e Unicamente Istruzioni Dalla Tabella Sottostante
2 pages
Ict Nuces Papper
No ratings yet
Ict Nuces Papper
20 pages
BCSL 058 Computer Oriented Numerical Techniques Lab Solved Assignment 2019 20
No ratings yet
BCSL 058 Computer Oriented Numerical Techniques Lab Solved Assignment 2019 20
9 pages
Screenshot 2024-01-18 at 7.28.06 PM
No ratings yet
Screenshot 2024-01-18 at 7.28.06 PM
19 pages
PF Sessional2 Fall 22
No ratings yet
PF Sessional2 Fall 22
10 pages
Sessional-II-Solution Exam Paper Fall 2022
No ratings yet
Sessional-II-Solution Exam Paper Fall 2022
10 pages
Practice Problem 2
No ratings yet
Practice Problem 2
1 page
Example Floating Point Problems: Problem 1
No ratings yet
Example Floating Point Problems: Problem 1
4 pages
BCSL 058 Computer Oriented Numerical Techniques Lab Solved Assignment 2019 20
No ratings yet
BCSL 058 Computer Oriented Numerical Techniques Lab Solved Assignment 2019 20
17 pages
Bill June24
No ratings yet
Bill June24
1 page
Exam1 f01
No ratings yet
Exam1 f01
11 pages
Easa TCDS A.084 - Atr - 42 - Atr - 72 03 17102012 PDF
No ratings yet
Easa TCDS A.084 - Atr - 42 - Atr - 72 03 17102012 PDF
35 pages
Assignment 3 - COMP2129
No ratings yet
Assignment 3 - COMP2129
4 pages
How To Grow More Vegetables
100% (7)
How To Grow More Vegetables
168 pages
Attieh Brochure (Small FS)
50% (2)
Attieh Brochure (Small FS)
20 pages
Software Multiples Normalizing at 6x NTM Revenue
No ratings yet
Software Multiples Normalizing at 6x NTM Revenue
16 pages
ADVANCED POWER SYSTEM ANALYSIS DESIGN SY23 24 1st Semester ELECTIVE 1 Part A
No ratings yet
ADVANCED POWER SYSTEM ANALYSIS DESIGN SY23 24 1st Semester ELECTIVE 1 Part A
23 pages
Patent Cooperation Treaty
No ratings yet
Patent Cooperation Treaty
11 pages
Samsung Pg17n, Pg19n Service Manual
No ratings yet
Samsung Pg17n, Pg19n Service Manual
85 pages
AC15 Quiz 1 Solution Manual
No ratings yet
AC15 Quiz 1 Solution Manual
8 pages
Inside Sales Multimedia Study
No ratings yet
Inside Sales Multimedia Study
16 pages
Construction Phase Plan Iss 3 Guidance To Completion
No ratings yet
Construction Phase Plan Iss 3 Guidance To Completion
53 pages
Use Case Lookup
No ratings yet
Use Case Lookup
17 pages
MCA Program
No ratings yet
MCA Program
40 pages
Microprocessor
No ratings yet
Microprocessor
18 pages
Graduation Date On Resume
100% (1)
Graduation Date On Resume
7 pages
Goal Statement
No ratings yet
Goal Statement
1 page
CODAL - TAXATION - TITLE VIII Remedies
No ratings yet
CODAL - TAXATION - TITLE VIII Remedies
10 pages
Ballistic Limit Evaluation For Impact of Pistol Projectile 9 MM Luger On Aircraft Skin Metal Plate
No ratings yet
Ballistic Limit Evaluation For Impact of Pistol Projectile 9 MM Luger On Aircraft Skin Metal Plate
10 pages
Sal Proj Statement r3
No ratings yet
Sal Proj Statement r3
86 pages
Tort and Contract
No ratings yet
Tort and Contract
13 pages
Unit 123: Fixing Sheet Materials: Multiple Choice Questions
100% (2)
Unit 123: Fixing Sheet Materials: Multiple Choice Questions
4 pages
Republic Act No. 10846
No ratings yet
Republic Act No. 10846
1 page
690196-Legal Change CTe Solution
No ratings yet
690196-Legal Change CTe Solution
2 pages
Outreach Team Job Description
No ratings yet
Outreach Team Job Description
4 pages
MH12NR9505 PDF
No ratings yet
MH12NR9505 PDF
2 pages
Jurnal Referensi
No ratings yet
Jurnal Referensi
2 pages
I. Swot Analysis Facts of The Case Strengths Weaknesses Opportunity Threats
No ratings yet
I. Swot Analysis Facts of The Case Strengths Weaknesses Opportunity Threats
5 pages
Shrey Choubey: Career Objective Skills
No ratings yet
Shrey Choubey: Career Objective Skills
2 pages
Oxford Exam Excellence Recording 26
No ratings yet
Oxford Exam Excellence Recording 26
1 page
C Programming
From Everand
C Programming
Netra
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
C Language Programming Codes
From Everand
C Language Programming Codes
Durgesh
No ratings yet