0% found this document useful (0 votes)

8 views47 pages

03 Progmodels

Uploaded by

wz1151897402

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views47 pages

03 Progmodels

Uploaded by

wz1151897402

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Lecture 3:

Communication Architectures and Parallel

Programming models:

CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)

Announcements
▪ Oﬃce hours
- Fatahalian: Tue/Thurs 1:30-2:30PM, GHC 7005
- Papamichael: Wed 1-2PM, HH A-312
- Mu: Mon 4-5PM, West Wing cluster

▪ Assignment 1 out today!

- Due: Jan 31st
- No late handin (1 week is more than enough time)

▪ Course discussions, message boards hosted on piazza.com

- I’ve heard good things
- Please go to piazza.com and enroll in 15-418
(CMU 15-418, Spring 2012)
Review: Kayvon’s fictitious multi-core chip

We talked about forms of parallel/concurrent execution.

(CMU 15-418, Spring 2012)
Finishing up from last time

(CMU 15-418, Spring 2012)

Review: thought experiment
Task: element-wise multiplication of two vectors A and B
Assume vectors contain millions of elements
A
1. Load input A[i] ×
2. Load input B[i] B
=
3. Compute A[i] × B[i] C
4. Store result into C[i]

Three memory operations (12 bytes) for every MUL

NVIDIA GTX 480 GPU can do 480 MULs per clock (1.4 GHz)
Need ~8.4 TB/sec of bandwidth to keep functional units busy

~ 2% eﬃciency… but 8x faster than CPU!

(3 GHz Core i7 quad-core CPU: similar eﬃciency on this computation)
(CMU 15-418, Spring 2012)
Bandwidth limited!
If processors request data at too high a rate, the memory system cannot keep up.
No amount of latency hiding helps this.

Overcoming bandwidth limits are a common challenge for

application developers on throughput-optimized systems.

(CMU 15-418, Spring 2012)

Bandwidth is a critical resource
Performant parallel programs will:

▪ Fetch data from memory less often

- Reuse data in a thread (traditional locality optimizations)
- Share data across threads (cooperation)

▪ Request data less often (instead, do more math: it is “free”)

- Term: “arithmetic intensity” - ratio of math to data access

(CMU 15-418, Spring 2012)

Entire system view: CPU + discrete GPU
L1

21 GB/sec Memory
L2
DDR3 DRAM
.. L3 cache (Gigabytes)
. (8 MB)
L1

L2
2011 and future:
PCIe x16 bus Co-locating CPU+GPU on same chip
Multi-core CPU 8 GB/sec each direction avoids PCIe bus bottleneck
(also reduces communication latency)

Execution 177 GB/sec

contexts L1
Memory
.. L2 DDR5 DRAM
. (~1 MB)
(1-2 GB)

Execution
contexts L1

Multi-core GPU (CMU 15-418, Spring 2012)

Programming with ISPC
(quick tutorial for assignment 1)

(CMU 15-418, Spring 2012)

ISPC
▪ Intel SPMD Program Compiler (ISPC)
▪ SPMD: single *program* multiple data

▪ http://ispc.github.com/

(CMU 15-418, Spring 2012)

sin(x) in ISPC
Compute sin(x) using Tailor expansion: sin(x) = x - 3
x /3! + 5
x /5! - 7
x /7! + ...
C++ code: main.cpp ISPC code: sinx.ispc
#include “sinx_ispc.h” export void sinx(
uniform int N,
int N = 1024; uniform int terms,
int terms = 5; uniform float* x,
float* x = new float[N]; uniform float* result)
float* result = new float[N]; {
// assume N % programCount = 0
// initialize x here for (uniform int i=0; i<N; i+=programCount)
{
// execute ISPC code int idx = i + programIndex;
sinx(N, terms, x, result); float value = x[idx];
float numer = x[idx] * x[idx] * x[idx];
uniform int denom = 6; // 3!
SPMD abstraction: uniform int sign = -‐1;
Spawn “gang” of ISPC program instances for (uniform int j=1; j<=terms; j++)
All instances run ISPC code in parallel {
Upon return, all instances have completed value += sign * numer / denom
numer *= x[idx] * x[idx];
denom *= (j+3) * (j+4);
sign *= -‐1;
}
result[i] = value;
}
}
(CMU 15-418, Spring 2012)
sin(x) in ISPC
Compute sin(x) using Tailor expansion: sin(x) = x - 3
x /3! + 5
x /5! - 7
x /7! + ...

C++ code: main.cpp

#include “sinx_ispc.h”

Sequential execution (C code)

int N = 1024;
int terms = 5;
float* x = new float[N];
float* result = new float[N]; Call to sinx()
1 2 3 4 5 6 7 8 Begin executing programCount
// initialize x here instances of sinx() (ISPC code)

// execute ISPC code

sinx(N, terms, x, result);

SPMD abstraction: sinx() returns.

Completion of ISPC program instances.
Spawn “gang” of ISPC program instances Resume sequential execution
All instances run ISPC code in parallel
Upon return, all instances have completed Sequential execution
(C code)
SIMD implementation:
Number of instances in a gang is the SIMD width (or a small multiple of)
ISPC compiler generates binary (.o) with SIMD instructions
C++ code links against object as usual (CMU 15-418, Spring 2012)
sin(x) in ISPC
Interleaved assignment of elements to instances
C++ code: main.cpp ISPC code: sinx.ispc
#include “sinx_ispc.h” export void sinx(
uniform int N,
int N = 1024; uniform int terms,
int terms = 5; uniform float* x,
float* x = new float[N]; uniform float* result)
float* result = new float[N]; {
// assumes N % programCount = 0
// initialize x here for (uniform int i=0; i<N; i+=programCount)
{
// execute ISPC code int idx = i + programIndex;
sinx(N, terms, x, result); float value = x[idx];
float numer = x[idx] * x[idx] * x[idx];
uniform int denom = 6; // 3!
ISPC Keywords: uniform int sign = -‐1;

programCount: number of simultaneously for (uniform int j=1; j<=terms; j++)

executing instances in the gang (uniform value) {
value += sign * numer / denom
programIndex: id of the current instance in the numer *= x[idx] * x[idx];
gang. (a non-uniform value: “varying”) denom *= (j+3) * (j+4);
sign *= -‐1;
uniform: A type modifier. All instances have the }
same value for this variable. Its use is purely an result[i] = value;
}
optimization. Not needed for correctness. }
(CMU 15-418, Spring 2012)
sin(x) in ISPC
Blocked assignment of elements to instances
C++ code: main.cpp ISPC code: sinx.ispc
#include “sinx_ispc.h” export void sinx(
uniform int N,
int N = 1024; uniform int terms,
int terms = 5; uniform float* x,
float* x = new float[N]; uniform float* result)
{
float* result = new float[N];
// assume N % programCount = 0
uniform int count = N / programCount;
// initialize x here int start = programIndex * count;
for (uniform int i=0; i<count; i++)
// execute ISPC code {
sinx(N, terms, x, result); int idx = start + i;
float value = x[idx];
float numer = x[idx] * x[idx] * x[idx];
uniform int denom = 6; // 3!
uniform int sign = -‐1;

for (uniform int j=1; j<=terms; j++)

{
value += sign * numer / denom
numer *= x[idx] * x[idx];
denom *= (j+3) * (j+4);
sign *= -‐1;
}
result[i] = value;
}
}

(CMU 15-418, Spring 2012)

sin(x) in ISPC
Compute sin(x) using tailor expansion: sin(x) = x - 3
x /3! + 5
x /5! - 7
x /7! + ...

C++ code: main.cpp ISPC code: sinx.ispc

#include “sinx_ispc.h” export void sinx(
uniform int N,
int N = 1024; uniform int terms,
int terms = 5; uniform float* x,
float* x = new float[N]; uniform float* result)
float* result = new float[N]; {
foreach (i = 0 ... N)
// initialize x here {
float value = x[i];
// execute ISPC code float numer = x[i] * x[i] * x[i];
sinx(N, terms, x, result); uniform int denom = 6; // 3!
uniform int sign = -‐1;

for (uniform int j=1; j<=terms; j++)

foreach: key language construct {
value += sign * numer / denom
numer *= x[i] * x[i];
Declares parallel loop iterations denom *= (j+3) * (j+4);
ISPC assigns iterations to program instances in gang sign *= -‐1;
}
result[i] = value;
(ISPC implementation will perform static assignment, }
but nothing in the abstraction prevents a dynamic }
assignment)
(CMU 15-418, Spring 2012)
ISPC: abstraction vs. implementation
▪ Single program, multiple data (SPMD) programming model
- This is the programming abstraction

▪ Single instruction, multiple data (SIMD) implementation

- Compiler emits vector instructions
- Handles mapping of conditional control flow to vector instructions
export uniform float sumall1( export uniform float sumall2(
uniform int N, uniform int N,
uniform float* x) uniform float* x)

▪ Semantics of ISPC can be tricky {

uniform float sum = 0.0f;
{
uniform float sum;
foreach (i = 0 ... N) float partial = 0.0f;
- SPMD abstraction + uniform values { foreach (i = 0 ... N)
(allows implementation details to sum += x[i]; {
} partial += x[i];
peak through abstraction a bit) }

- What does sumall1 do? return sum;

} // from ISPC math library

- What does sumall2 do? sum = reduceAdd(partial);

- Which one is correct? return sum;

}

(CMU 15-418, Spring 2012)

Today
▪ Three parallel programming models
- Abstractions presented to the programmer
- Influence how programmers think when writing programs

▪ Three machine architectures

- Abstraction presented by the HW to low-level software
- Typically reflect implementation

▪ Focus on communication and cooperation

▪ Reading: Textbook section 1.2

(CMU 15-418, Spring 2012)
System layers: interface, implementation, interface, ...
Parallel Applications

Abstractions for describing Abstractions for describing “Programming model”

parallel computation communication (way of thinking about things)

Language or API
primitives/mechanisms
Compiler and/or runtime

System call API

Operating system support

Hardware Architecture
(HW/SW boundary)
Micro-architecture (HW implementation)

Blue italic text: concept

Red italic text: system abstraction
Black text: system implementation
(CMU 15-418, Spring 2012)
Example: expressing parallelism (pthreads)
Parallel Applications

Abstraction for describing parallel computation: thread Programming

model

pthread_create()

pthread library

System call API

OS support: kernel thread management

x86-64
modern multi-core CPU

(CMU 15-418, Spring 2012)

Example: expressing parallelism (ISPC)
Parallel Applications
Abstractions for describing parallel computation:
Programming
1. Specifying simultaneous execution (imperative: true parallelism)
model
2. Specifying independent work (declarative: potentially parallel)

ISPC language (call ISPC function, foreach construct)

ISPC compiler

System call API

OS support

x86-64 (including AVX vector instructions)

single-core of CPU

Note: ISPC has additional language primitives for multi-core execution (not discussed here)
(CMU 15-418, Spring 2012)
Three models of communication
1. Shared address space

2. Message passing

3. Data parallel

(CMU 15-418, Spring 2012)

Shared address space model (abstraction)
▪ Threads communicate by:
- Reading/writing to shared variables
- Interprocessor communication is implicit in memory operations
- Thread 1 stores to X. Thread 2 reads X (observes update)
- Manipulating synchronization primitives
- e.g., mutual exclusion using locks

▪ Natural extension of sequential programming model

- In fact, all our discussions have assumed a shared address space so far

▪ Think: shared variables are like a big bulletin board

- Any thread can read or write
(CMU 15-418, Spring 2012)
Shared address space (implementation)
▪ Option 1: threads share an address space (all data is sharable)

▪ Option 2: each thread has its own virtual address space, shared portion of
address spaces maps to same physical location
(like processes, described this way in book)
Physical mapping

Virtual address spaces

(CMU 15-418, Spring 2012)

Shared address space HW implementation
Any processor can directly reference any memory location

“Dance-hall” organization Interconnect examples

Processor Processor Processor Processor

Bus
Processor Processor Processor Processor
Local Cache Local Cache Local Cache Local Cache Memory Memory

Interconnect
Crossbar
Processor Processor Processor Processor Processor

Processor
Memory I/O
Processor

Processor

Memory Memory Memory Memory Memory Memory

Multi-stage network

▪ Symmetric (shared-memory) multi-processor (SMP):

- Uniform memory access time: cost of accessing an uncached* memory address is the same
for all processors
(* caching introduces non-uniform access times, but we’ll talk about that later)
(CMU 15-418, Spring 2012)
Shared address space architectures
Commodity x86 examples

Memory

Memory Controller
Intel Core i7 (quad core)
(network is a ring) On chip network
Core 1 Core 2

Core 3 Core 4

AMD Phenom II (six core)

(CMU 15-418, Spring 2012)

SUN Niagara 2
Note size of crossbar: about die area of one core

Processor

Processor L2 cache Memory

Processor

L2 cache Memory
Processor
Crossbar
Switch
Processor
L2 cache Memory
Processor

Processor L2 cache Memory

Eight cores Processor

(CMU 15-418, Spring 2012)

Non-uniform memory access (NUMA)
All processors can access any memory location, but cost of memory access is
diﬀerent for diﬀerent processors

Processor Processor Processor Processor

Local Cache Local Cache Local Cache Local Cache
Memory Memory Memory Memory

Interconnect

▪ Problem with preserving uniform access time: scalability

- Costs are uniform, but memory is uniformly far away
▪ NUMA designs are more scalable
- High bandwidth to local memory; BW scales with number of nodes if most accesses local
- Low latency access to local emory
▪ Increased programmer eﬀort: performance tuning
- Finding, exploiting locality

(CMU 15-418, Spring 2012)

Non-uniform memory access (NUMA)
Example: latency to access location X is higher from cores 5-8 than cores 1-4

Example: modern dual-socket configuration

X
Memory Memory

Memory Controller Memory Controller

On chip network
Core 1 Core 2 Core 5 Core 6

Core 3 Core 4 Core 7 Core 8

AMD Hyper-transport /
Intel QuickPath

(CMU 15-418, Spring 2012)

SGI Altix UV 1000 (PSC Blacklight)

▪ 256 blades, 2 CPUs per blade, 8 cores per CPU = 4K cores

▪ Single shared address space
▪ Interconnect
- Fat tree of 256 CPUs (15 GB/sec links)
- 2D torus to scale up another factor of 2

Fat tree 2D torus

(CMU 15-418, Spring 2012)
Shared address space summary
▪ Communication abstraction
- Threads read/write shared variables
- Manipulate synchronization primitives: locks, semaphors, etc.
- Extension of uniprocessor programming
- But NUMA implementation requires reasoning about locality for perf

▪ Hardware support
- Any processor can load and store from any address
- NUMA designs more scalable than uniform memory access
- Even so, costly to scale (see cost of Blacklight)

(CMU 15-418, Spring 2012)

Message passing model (abstraction)
▪ Threads operate within independent address spaces
▪ Threads communicate by sending/receiving messages
- Explicit, point-to-point
- send: specifies buﬀer to be transmitted, recipient, optional message “tag”
- receive: specifies buﬀer to store data, sender, and (optional) message tag
- May be synchronous or asynchronous

match!
Address Y
send(X, 2, tag) recv(Y, 1, tag)

Address X

Thread 1 address space Thread 2 address space

(CMU 15-418, Spring 2012)

Message passing (implementation)
▪ Popular library: MPI (message passing interface)
▪ Challenges: buﬀering messages (until application initiates receive), minimizing cost of
memory copies
▪ HW need not implement system-wide loads and stores
- Connect complete (often commodity) systems together
- Clusters!

Cluster of workstations
(Infiniband network)

IBM Blue Gene/P Supercomputer

(CMU 15-418, Spring 2012)

Correspondence between models and
machines is fuzzy
▪ Common to implement message passing abstractions on
machines that support a shared address space in hardware

▪ Can implement shared address space abstraction on machines

that do not support it in HW (less eﬃcient SW solution)
- Mark all pages with shared variables as invalid
- Page-fault handler issues appropriate network requests

▪ Reminder: keep in mind what is the programming model

(abstraction) and what is the HW implementation

(CMU 15-418, Spring 2012)

Data-parallel model
▪ Rigid computation structure
▪ Historically: same operation on each element of an array
- Operation ≈ instruction
- Matched capabilities of 80’s SIMD supercomputers
- Connection Machine (CM-1, CM-2): thousands of processors, one instruction
- And also Cray supercomputer vector processors
- Add(A, B, n) ⟵ this was an instruction on vectors A, B of length n
- Matlab another good example: A + B (A, B are vectors of same length)

▪ Today: often takes form of SPMD programming

-‐ map(function, collection)
- Where function may be a complicated sequence of logic (e.g., a loop body)
- Application of function to each element of collection is independent
- In pure form: no communication during map
- Synchronization is implicit at the end of the map (CMU 15-418, Spring 2012)
Data parallelism example in ISPC
// main C++ code:
const int N = 1024;
float* x = new float[N]; Think of loop body as function
float* y = new float[N];
foreach construct is a map
// initialize N elements of x here
Collection is implicitly defined by array indexing logic
absolute_value(N, x, y);

// ISPC code:
export void absolute_value(
uniform int N,
uniform float* x,
uniform float* y)
{
foreach (i = 0 ... N)
{
if (x[i] < 0)
y[i] = -‐x[i];
else
y[i] = x[i];
}
}

(CMU 15-418, Spring 2012)

Data parallelism example in ISPC
// main C++ code:
const int N = 1024;
float* x = new float[N/2]; Think of loop body as function
float* y = new float[N];
foreach construct is a map
// initialize N/2 elements of x here
Collection is implicitly defined by array indexing logic
absolute_repeat(N/2, x, y);

// ISPC code:
export void absolute_repeat(
Also a valid program!
uniform int N,
uniform float* x,
uniform float* y)
{ Takes absolute value of elements of x,
foreach (i = 0 ... N)
{
repeats them twice in output vector y
if (x[i] < 0)
y[2*i] = -‐x[i];
else
y[2*i] = x[i];
y[2*i+1] = y[2*i];
}
}
(CMU 15-418, Spring 2012)
Data parallelism example in ISPC
// main C++ code:
const int N = 1024;
float* x = new float[N]; Think of loop body as function
float* y = new float[N];
foreach construct is a map
// initialize N elements of x
Collection is implicitly defined by array indexing logic
shift_negative(N, x, y);

// ISPC code:
export void shift_negative(
This program is non-deterministic!
uniform int N,
uniform float* x,
uniform float* y)
{
Possibility for multiple iterations of the loop
foreach (i = 0 ... N) body to write to same memory location
{
if (i > 1 && x[i] < 0)
y[i-‐1] = x[i]; (as described, data parallel model provides
else no primitives for fine-grained mutual
y[i] = x[i];
} exclusion/synchronization)
}

(CMU 15-418, Spring 2012)

Data parallelism the more formal way
Note: this is not ISPC syntax
// main program: Data-parallelism expressed this way is
const int N = 1024;
sometimes referred to as stream programing
stream<float> x(N); // define collection
stream<float> y(N); // define collection
model

// initialize N elements of x here

Streams: collections of elements. Elements
// map absolute_value onto x, y can be processed independently
absolute_value(x, y);

Kernels: side-eﬀect-free functions. Operate

// “kernel” definition
void absolute_value(
element-wise on elements of collections
float x,
float y)
{ Think of kernel inputs, outputs, temporaries
if (x < 0) for each invocation as an address space
y = -‐x;
else
y = x; If you’ve ever written OpenGL shader code
}
(e.g., 15-462), you’ve coded in the stream
programming system

(CMU 15-418, Spring 2012)

Stream programming benefits
// main program: Functions really are side-eﬀect free!
const int N = 1024;
stream<float> input(N); (cannot write a non-deterministic program)
stream<float> output(N);
stream<float> tmp(N);
Program data flow is known:
foo(input,tmp);
bar(tmp, output);
Allows prefetching. Inputs and outputs of each
invocation are known in advance. Prefetching
can be employed to hide latency.
input tmp output
foo bar Producer-consumer locality. Can structure code
so that outputs of first function feed
immediately into second function, never
written to memory.

Requires sophisticated compiler analysis.

(CMU 15-418, Spring 2012)

Stream programming drawbacks
// main program: Need library of ad-hoc operators to describe
const int N = 1024;
stream<float> input(N/2); complex data flows. (see use of repeat
stream<float> tmp(N);
stream<float> output(N);
operator at left to obtain same behavior as
indexing code below)
stream_repeat(2, input, tmp);
absolute_value(tmp, output);
Cross fingers and hope compiler generates code
intelligently
// ISPC code:
export void absolute_value(
Kayvon’s experience: uniform int N,
uniform float* x,
uniform float* y)
This is the achilles heel of all “proper” data- {
foreach (i = 0 ... N)
parallel/stream programming systems. {
if (x[i] < 0)
y[2*i] = -‐x[i];
“I just need one more operator”... else
y[2*i] = x[i];
y[2*i+1] = y[2*i];
}
}

(CMU 15-418, Spring 2012)

Gather/scatter:
Two key data-parallel communication primitives
// main program: // main program:
const int N = 1024; const int N = 1024;
stream<float> input(N); stream<float> input(N);
stream<int> indices; stream<int> indices;
stream<float> tmp_input(N); stream<float> tmp_output(N);
stream<float> output(N); stream<float> output(N);

stream_gather(input, indices, tmp_input); absolute_value(input, tmp_output);

absolute_value(tmp_input, output); stream_scatter(tmp_output, indices, output);

Gather: (ISPC equivalent) Scatter: (ISPC equivalent)

export void absolute_value( export void absolute_value(
uniform float N, uniform float N,
uniform float* input, uniform float* input,
uniform float* output, uniform float* output,
uniform int* indices) uniform int* indices)
{ {
foreach (i = 0 ... n) foreach (i = 0 ... n)
{ {
float tmp = x[indices[i]]; if (x[i] < 0)
if (tmp < 0) y[indices[i]] = -‐x[i];
y[i] = -‐tmp; else
else y[indices[i]] = x[i];
y[i] = tmp; }
} }
}
(CMU 15-418, Spring 2012)
Gather operation:
gather(R1, R0, mem_base);

Array in memory: base address = mem_base

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

3 12 4 9 9 15 13 0
Index vector: R0 Result vector: R1

SSE, AVX instruction sets do not directly support SIMD gather/scatter

(must implement as scalar loop)

Hardware supported gather/scatter does exist on GPUs.

(still an expensive operation)
(CMU 15-418, Spring 2012)
Data-parallel model summary
▪ Data-parallelism is about program structure
▪ In spirit, map a single program onto a large collection of data
- Functional: side-eﬀect free executions
- No communication among invocations
▪ In practice that’s how most programs work
▪ But... most popular languages do not enforce this
- OpenCL, CUDA, ISPC, etc.
- Choose flexibility/familiarity of imperative syntax over safety and complex
compiler optimizations required for functional syntax
- It’s been their key to success (and the recent adoption of parallel programming)
- Hear that PL folks! (lighten up!)

(CMU 15-418, Spring 2012)

Three parallel programming models
▪ Shared address space
- Communication is unstructured, implicit in loads and stores
- Natural way of programming, but can shoot yourself in the foot easily
- Program might be correct, but not scale
▪ Message passing
- Structured communication as messages
- Often harder to get first correct program than shared address space
- Structure often helpful in getting to first correct, scalable program
▪ Data parallel
- Structure computation as a big map
- Assumes a shared address space from which to load inputs/store results, but
severely limits communication within the map (preserve independent
processing)
- Modern embodiments encourage, don’t enforce, this structure

(CMU 15-418, Spring 2012)

Trend: hybrid programming models
▪ Shared address space within a multi-core node of a cluster,
message passing between nodes
- Very, very common
- Use convenience of shared address space where it can be implemented
eﬃciently (within a node)

▪ Data-parallel programming models support synchronization

primitives in kernels (CUDA, OpenCL)
- Permits limited forms of communication

▪ CUDA/OpenCL use data parallel model to scale to many cores, but

adopt shared-address space model for threads in a single core.

(CMU 15-418, Spring 2012)

Los Alamos National Laboratory: Roadrunner
Fastest computer in the world in 2008 (no longer true)
3,240 node cluster. Heterogeneous nodes.
Cluster Node

AMD CPU
IBM Cell CPU IBM Cell CPU IBM Cell CPU IBM Cell CPU

Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2
16 GB
Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Memory

Core 5 Core 6 Core 5 Core 6 Core 5 Core 6 Core 5 Core 6 AMD CPU (Address
Space)
Core 7 Core 8 Core 7 Core 8 Core 7 Core 8 Core 7 Core 8 Core 1 Core 2

4 GB Memory 4 GB Memory 4 GB Memory 4 GB Memory

(Address space) (Address space) (Address space) (Address space)

Network

(CMU 15-418, Spring 2012)

Summary
▪ Programming models provide a way to think about parallel
programs. They are abstractions that admit many possible
implementations.

▪ But restrictions imposed my models reflect realities of

hardware costs of communication
- Shared address space machines
- Messaging passing machines
- Usually wise to keep abstraction distance low (performance predictability). But
want it high enough for flexibility/portability

▪ In practice, you’ll need to be able to think in a variety of ways

- Modern machines provide different types of communication at different scales
- Different models fit the machine best at the various scales

(CMU 15-418, Spring 2012)

Production Readiness Checklist - Handover
100% (1)
Production Readiness Checklist - Handover
5 pages
Parallel Algorithm Merged
No ratings yet
Parallel Algorithm Merged
76 pages
BACS File Spec
No ratings yet
BACS File Spec
10 pages
L&T B&F CAD Manual
No ratings yet
L&T B&F CAD Manual
86 pages
Parallel and Distributed Computing
33% (3)
Parallel and Distributed Computing
10 pages
Midterm
No ratings yet
Midterm
5 pages
Sols Book PDF
100% (1)
Sols Book PDF
120 pages
Flexible PCB
100% (1)
Flexible PCB
2 pages
03_progmodels_slides
No ratings yet
03_progmodels_slides
61 pages
02 Basicarch
No ratings yet
02 Basicarch
83 pages
04_progbasics
No ratings yet
04_progbasics
51 pages
Digital Assignment-6: Name: Bejugam Shiva Suprith REG NO: 18BCE0427 Faculty: Narayanamoorthi M SLOT: L59+L60
No ratings yet
Digital Assignment-6: Name: Bejugam Shiva Suprith REG NO: 18BCE0427 Faculty: Narayanamoorthi M SLOT: L59+L60
14 pages
02 Multicore
No ratings yet
02 Multicore
66 pages
04 Progbasics
No ratings yet
04 Progbasics
43 pages
Lect11 12 Parallel
No ratings yet
Lect11 12 Parallel
57 pages
written_asst1
No ratings yet
written_asst1
31 pages
04 Progbasics
No ratings yet
04 Progbasics
62 pages
ParProcBook PDF
No ratings yet
ParProcBook PDF
410 pages
Sardar Patel Institute of Technology: Department of Computer Engineering
No ratings yet
Sardar Patel Institute of Technology: Department of Computer Engineering
3 pages
Assignment 3 - COMP2129
No ratings yet
Assignment 3 - COMP2129
4 pages
PC_course_notes_May17
No ratings yet
PC_course_notes_May17
123 pages
Ass Parallel
No ratings yet
Ass Parallel
11 pages
Programming On Parallel Machines
100% (1)
Programming On Parallel Machines
344 pages
Par Proc Book
No ratings yet
Par Proc Book
335 pages
Parallel Programming
100% (2)
Parallel Programming
410 pages
Cisc 530 Midterm Fall 2022-1
No ratings yet
Cisc 530 Midterm Fall 2022-1
19 pages
gpus
No ratings yet
gpus
37 pages
05 15slides
No ratings yet
05 15slides
117 pages
Par - 1 In-Term Exam - Course 2018/19-Q2
No ratings yet
Par - 1 In-Term Exam - Course 2018/19-Q2
9 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
PDC MidTerm-I Spring2023
No ratings yet
PDC MidTerm-I Spring2023
3 pages
Architectures For Parrallel Computation
No ratings yet
Architectures For Parrallel Computation
40 pages
CS222 - COAL - SOLUTION - Final - Spring2023
No ratings yet
CS222 - COAL - SOLUTION - Final - Spring2023
12 pages
Fallsem2019-20 Cse4001 Eth Vl2019201001348 Reference Material Cse4001 Parallel and Distributed Computing May 2019 (003) 18
No ratings yet
Fallsem2019-20 Cse4001 Eth Vl2019201001348 Reference Material Cse4001 Parallel and Distributed Computing May 2019 (003) 18
4 pages
cs61c Notes
No ratings yet
cs61c Notes
29 pages
written_asst2
No ratings yet
written_asst2
27 pages
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Programming On Parallel Machines: Norm Matloff University of California, Davis
100% (1)
Programming On Parallel Machines: Norm Matloff University of California, Davis
347 pages
Programming On Parallel Machines
100% (2)
Programming On Parallel Machines
347 pages
Computer Architecture A Quantitative Approach 5th Edition Hennessy Solutions Manual 2024 scribd download full chapters
100% (32)
Computer Architecture A Quantitative Approach 5th Edition Hennessy Solutions Manual 2024 scribd download full chapters
38 pages
Par Proc Book
No ratings yet
Par Proc Book
400 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Aula Ch3
No ratings yet
Aula Ch3
21 pages
icl-utk-1031-2017
No ratings yet
icl-utk-1031-2017
45 pages
Perfbook-Eb 2023 06 11a
No ratings yet
Perfbook-Eb 2023 06 11a
1,432 pages
Codigoc Simulink
No ratings yet
Codigoc Simulink
30 pages
Computer Architecture A Quantitative Approach 5th Edition Hennessy Solutions Manual - Free Access To All Available Content For Download
100% (7)
Computer Architecture A Quantitative Approach 5th Edition Hennessy Solutions Manual - Free Access To All Available Content For Download
52 pages
Introduction To Parallel Programming - Student Workbook With Instructor's Notes PDF
No ratings yet
Introduction To Parallel Programming - Student Workbook With Instructor's Notes PDF
33 pages
hpc_parallel
No ratings yet
hpc_parallel
122 pages
Mansi Kadam PC Lab Assignment 1
No ratings yet
Mansi Kadam PC Lab Assignment 1
4 pages
2021-22
No ratings yet
2021-22
24 pages
Exam1 f01
No ratings yet
Exam1 f01
11 pages
Lecture5
No ratings yet
Lecture5
143 pages
DSP Architecture
100% (1)
DSP Architecture
31 pages
CA PDF
No ratings yet
CA PDF
10 pages
RISC Vs CISC, Harvard V/s Van Neumann
No ratings yet
RISC Vs CISC, Harvard V/s Van Neumann
35 pages
Pooja Vashisth
No ratings yet
Pooja Vashisth
35 pages
Par - 1 In-Term Exam - Course 2017/18-Q2
No ratings yet
Par - 1 In-Term Exam - Course 2017/18-Q2
7 pages
Cray-1 (1976) : The World's Most Expensive Love Seat
No ratings yet
Cray-1 (1976) : The World's Most Expensive Love Seat
18 pages
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
pstack
No ratings yet
pstack
1 page
Dts Test
No ratings yet
Dts Test
1 page
03-storage1
No ratings yet
03-storage1
55 pages
09-indexes2
No ratings yet
09-indexes2
5 pages
18-timestampordering
No ratings yet
18-timestampordering
82 pages
ORACLE EPM 11.1.2.2 - First Look: EPM 11.1.2.2 Deployment Topology Report
No ratings yet
ORACLE EPM 11.1.2.2 - First Look: EPM 11.1.2.2 Deployment Topology Report
4 pages
Microsoft AZ-900T01 - Microsoft Azure Fundamentals
No ratings yet
Microsoft AZ-900T01 - Microsoft Azure Fundamentals
2 pages
Osprey MCA User S Manual-1
No ratings yet
Osprey MCA User S Manual-1
127 pages
Extract PDF Fonts Linux
No ratings yet
Extract PDF Fonts Linux
2 pages
Dse703 4110 Data Sheet
No ratings yet
Dse703 4110 Data Sheet
2 pages
Bio Project
No ratings yet
Bio Project
11 pages
National Institute of Technology, Silchar
No ratings yet
National Institute of Technology, Silchar
6 pages
Guidelines Implementing Aws Waf
No ratings yet
Guidelines Implementing Aws Waf
28 pages
Ring Compressor Operation and Parts Manual
No ratings yet
Ring Compressor Operation and Parts Manual
7 pages
Automatic Gates
No ratings yet
Automatic Gates
74 pages
PLL Wikipedia
No ratings yet
PLL Wikipedia
8 pages
Schneider Electric_Sepam-series-20_59622
No ratings yet
Schneider Electric_Sepam-series-20_59622
3 pages
Waste Segregator Report
25% (4)
Waste Segregator Report
33 pages
Literature Survey 2.1 Java Swings: Ava Swing Tutorial
No ratings yet
Literature Survey 2.1 Java Swings: Ava Swing Tutorial
11 pages
Role of Uldc Schemes FOR Modernisation of Power System Operation
No ratings yet
Role of Uldc Schemes FOR Modernisation of Power System Operation
38 pages
DMX Usb Pro Api Spec
No ratings yet
DMX Usb Pro Api Spec
9 pages
Ebook Classic Titles
No ratings yet
Ebook Classic Titles
5 pages
Concorde Corodex Fire Protection and Automation-Low
No ratings yet
Concorde Corodex Fire Protection and Automation-Low
16 pages
A Study On Consumar Satisfaction Towords Google Pa1
100% (4)
A Study On Consumar Satisfaction Towords Google Pa1
77 pages
Spam 150 C
No ratings yet
Spam 150 C
3 pages
Curricular Vitae Muhammad Syahmi - 1551089843 PDF
No ratings yet
Curricular Vitae Muhammad Syahmi - 1551089843 PDF
2 pages
【Wearfit 2.0 introduction】: 【Introduction of the function】
No ratings yet
【Wearfit 2.0 introduction】: 【Introduction of the function】
1 page
Kevin Collins WPE 2009
No ratings yet
Kevin Collins WPE 2009
2 pages
Dark Energy Manual
No ratings yet
Dark Energy Manual
44 pages
YOLO-World: Real-Time Open-Vocabulary Object Detection
No ratings yet
YOLO-World: Real-Time Open-Vocabulary Object Detection
15 pages
Blue Screen of Death
No ratings yet
Blue Screen of Death
4 pages