03 Progmodels
03 Progmodels
21 GB/sec Memory
L2
DDR3 DRAM
.. L3 cache (Gigabytes)
. (8 MB)
L1
L2
2011 and future:
PCIe x16 bus Co-locating CPU+GPU on same chip
Multi-core CPU 8 GB/sec each direction avoids PCIe bus bottleneck
(also reduces communication latency)
Execution
contexts L1
▪ http://ispc.github.com/
Language or API
primitives/mechanisms
Compiler and/or runtime
Hardware Architecture
(HW/SW boundary)
Micro-architecture (HW implementation)
pthread_create()
pthread library
x86-64
modern multi-core CPU
ISPC compiler
Note: ISPC has additional language primitives for multi-core execution (not discussed here)
(CMU 15-418, Spring 2012)
Three models of communication
1. Shared address space
2. Message passing
3. Data parallel
▪ Option 2: each thread has its own virtual address space, shared portion of
address spaces maps to same physical location
(like processes, described this way in book)
Physical mapping
Bus
Processor Processor Processor Processor
Local Cache Local Cache Local Cache Local Cache Memory Memory
Interconnect
Crossbar
Processor Processor Processor Processor Processor
Processor
Memory I/O
Processor
Processor
Multi-stage network
Memory
Memory Controller
Intel Core i7 (quad core)
(network is a ring) On chip network
Core 1 Core 2
Core 3 Core 4
Processor
Processor
L2 cache Memory
Processor
Crossbar
Switch
Processor
L2 cache Memory
Processor
Interconnect
X
Memory Memory
AMD Hyper-transport /
Intel QuickPath
▪ Hardware support
- Any processor can load and store from any address
- NUMA designs more scalable than uniform memory access
- Even so, costly to scale (see cost of Blacklight)
match!
Address Y
send(X, 2, tag) recv(Y, 1, tag)
Address X
Cluster of workstations
(Infiniband network)
//
ISPC
code:
export
void
absolute_value(
uniform
int
N,
uniform
float*
x,
uniform
float*
y)
{
foreach
(i
=
0
...
N)
{
if
(x[i]
<
0)
y[i]
=
-‐x[i];
else
y[i]
=
x[i];
}
}
//
ISPC
code:
export
void
absolute_repeat(
Also a valid program!
uniform
int
N,
uniform
float*
x,
uniform
float*
y)
{ Takes absolute value of elements of x,
foreach
(i
=
0
...
N)
{
repeats them twice in output vector y
if
(x[i]
<
0)
y[2*i]
=
-‐x[i];
else
y[2*i]
=
x[i];
y[2*i+1]
=
y[2*i];
}
}
(CMU 15-418, Spring 2012)
Data parallelism example in ISPC
//
main
C++
code:
const
int
N
=
1024;
float*
x
=
new
float[N]; Think of loop body as function
float*
y
=
new
float[N];
foreach construct is a map
//
initialize
N
elements
of
x
Collection is implicitly defined by array indexing logic
shift_negative(N,
x,
y);
//
ISPC
code:
export
void
shift_negative(
This program is non-deterministic!
uniform
int
N,
uniform
float*
x,
uniform
float*
y)
{
Possibility for multiple iterations of the loop
foreach
(i
=
0
...
N) body to write to same memory location
{
if
(i
>
1
&&
x[i]
<
0)
y[i-‐1]
=
x[i]; (as described, data parallel model provides
else no primitives for fine-grained mutual
y[i]
=
x[i];
} exclusion/synchronization)
}
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
3 12 4 9 9 15 13 0
Index vector: R0 Result vector: R1
AMD CPU
IBM Cell CPU IBM Cell CPU IBM Cell CPU IBM Cell CPU
Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2
16 GB
Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Core 3 Core 4 Memory
Core 5 Core 6 Core 5 Core 6 Core 5 Core 6 Core 5 Core 6 AMD CPU (Address
Space)
Core 7 Core 8 Core 7 Core 8 Core 7 Core 8 Core 7 Core 8 Core 1 Core 2
Network