Computer ARCHITECTURE Lecture 8 10 1738846483
Computer ARCHITECTURE Lecture 8 10 1738846483
CA-Lec8 [email protected] 2
Introduction
SIMD Variations
• Vector architectures
• SIMD extensions
– MMX: multimedia extensions (1996)
– SSE: streaming SIMD extensions
– AVX: advanced vector extensions
• Graphics Processor Units (GPUs)
– Considered as SIMD accelerators
CA-Lec8 [email protected] 3
SIMD vs. MIMD
• For x86 processors:
• Expect two
additional cores
per chip per year
• SIMD width to
double every four
years
• Potential speedup
from SIMD to be
twice that from
MIMD!!
CA-Lec8 [email protected] 4
Vector Architectures
Vector Architectures
• Basic idea:
– Read sets of data elements into “vector registers”
– Operate on those registers
– Disperse the results back into memory
• Registers are controlled by compiler
– Register files act as compiler controlled buffers
– Used to hide memory latency
– Leverage memory bandwidth
CA-Lec8 [email protected] 5
Vector Supercomputers
• In 70‐80s, Supercomputer Vector machine
• Definition of supercomputer
– Fastest machine in the world at given task
– A device to turn a compute‐bound problem into an I/O‐bound
problem
– CDC6600 (Cray, 1964) is regarded as the first supercomputer
• Vector supercomputers (epitomized by Cray‐1, 1976)
– Scalar unit + vector extensions
• Vector registers, vector instructions
• Vector loads/stores
• Highly pipelined functional units
CA-Lec8 [email protected] 6
Cray‐1 (1976)
V0 Vi V. Mask
V1
V2 Vj
64 Element V3 V. Length
Vector Registers V4 Vk
Single Port V5
V6
Memory V7
FP Add
S0 Sj FP Mul
16 banks of 64- ( (Ah) + j k m ) S1
S2 Sk FP Recip
bit words Si S3
(A0) 64 S4 Si Int Add
+ Tjk S5
T Regs S6
8-bit SECDED S7
Int Logic
Int Shift
A0
80MW/sec data ( (Ah) + j k m ) A1 Pop Cnt
A2
load/store Ai A3
Aj
(A0) 64 A4 Ak Addr Add
Bjk A5
Ai
320MW/sec B Regs A6 Addr Mul
A7
instruction
buffer refill NIP CIP
64-bitx16
LIP
4 Instruction Buffers
memory bank cycle 50 ns [email protected]
CA-Lec8 processor cycle 12.5 ns (80MHz) 7
Vector Programming Model
Scalar Registers Vector Registers
r15 v15
r0 v0
[0] [1] [2] [VLRMAX-1]
Vector Length Register VLR
v1
Vector Arithmetic v2
+ + + + + +
ADDV v3, v1, v2 v3
[0] [1] [VLR-1]
Memory
CA-Lec8 [email protected] 8
Base, r1 Stride, r2
Example: VMIPS
• Loosely based on Cray‐1
• Vector registers
– Each register holds a 64‐element, 64
bits/element vector
– Register file has 16 read ports and 8 write
ports
• Vector functional units
– Fully pipelined
– Data and control hazards are detected
• Vector load‐store unit
– Fully pipelined
– Words move between registers
– One word per clock cycle after initial latency
• Scalar registers
– 32 general‐purpose registers
– 32 floating‐point registers
CA-Lec8 [email protected] 9
VMIPS Instructions
• Operate on many elements concurrently
• Allows use of slow but wide execution units
• High performance, lower power
• Flexible
• 64 64‐bit / 128 32‐bit / 256 16‐bit, 512 8‐bit
• Matches the need of multimedia (8bit), scientific
applications that require high precision
CA-Lec8 [email protected] 10
Vector Architectures
Vector Instructions
• ADDVV.D: add two vectors
• ADDVS.D: add vector to a scalar
• LV/SV: vector load and vector store from address
• Vector code example:
CA-Lec8 [email protected] 11
Vector Memory‐Memory vs.
Vector Register Machines
• Vector memory‐memory instructions hold all vector operands in main
memory
• The first vector machines, CDC Star‐100 (‘73) and TI ASC (‘71), were
memory‐memory machines
• Cray‐1 (’76) was first vector register machine
Vector Memory-Memory Code
Example Source Code ADDV C, A, B
SUBV D, A, B
for (i=0; i<N; i++)
{
C[i] = A[i] + B[i]; Vector Register Code
D[i] = A[i] - B[i];
} LV V1, A
LV V2, B
ADDV V3, V1, V2
SV V3, C
SUBV V4, V1, V2
CA-Lec8 [email protected]
SV V4, D 12
Vector Memory‐Memory vs.
Vector Register Machines
• Vector memory‐memory architectures (VMMA) require greater
main memory bandwidth, why?
– All operands must be read in and out of memory
• VMMAs make if difficult to overlap execution of multiple vector
operations, why?
– Must check dependencies on memory addresses
• VMMAs incur greater startup latency
– Scalar code was faster on CDC Star‐100 for vectors < 100
elements
– For Cray‐1, vector/scalar breakeven point was around 2
elements
Apart from CDC follow‐ons (Cyber‐205, ETA‐10) all major vector
machines since Cray‐1 have had vector register architectures
CA-Lec8 [email protected] 14
Vector Instructions Example
• Example: DAXPY adds a scalar multiple of a double precision vector to
another double precision vector
L.D F0,a ;load scalar a
LV V1,Rx ;load vector X
MULVS.D V2,V1,F0 ;vector-scalar mult
LV V3,Ry ;load vector Y
ADDVV V4,V2,V3 ;add
SV Ry,V4 ;store result
Requires 6 instructions only
• In MIPS Code
• ADD waits for MUL, SD waits for ADD
• In VMIPS
• Stall once for the first vector element, subsequent elements will flow
smoothly down the pipeline.
• Pipeline stall required once per vector instruction!
CA-Lec8 [email protected] 15
Vector Architectures
Challenges of Vector Instructions
• Start up time
– Application and architecture must support long vectors. Otherwise,
they will run out of instructions requiring ILP
– Latency of vector functional unit
– Assume the same as Cray‐1
• Floating‐point add => 6 clock cycles
• Floating‐point multiply => 7 clock cycles
• Floating‐point divide => 20 clock cycles
• Vector load => 12 clock cycles
CA-Lec8 [email protected] 16
Vector Arithmetic Execution
• Use deep pipeline (=> fast clock) V1 V2 V3
to execute element operations
• Simplifies control of deep
pipeline because elements in
vector are independent (=> no
hazards!)
Six stage multiply pipeline
V3 <- v1 * v2
CA-Lec8 [email protected] 17
Vector Instruction Execution
ADDV C,A,B
Four-lane
Execution using
execution using
one pipelined
four pipelined
functional unit
functional units
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]
Vector
Registers
Elements Elements Elements Elements
0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, …
Lane
Memory Subsystem
CA-Lec8 [email protected] 19
Multiple Lanes Architecture
• Beyond one element per
clock cycle
• Elements n of vector
register A is hardwired to
element n of vector B
– Allows for multiple
hardware lanes
– No communication
between lanes
– Little increase in control
overhead
Adding more lanes allows
– No need to change
designers to tradeoff clock rate and
machine code
energy without sacrificing
performance!
CA-Lec8 [email protected] 20
Code Vectorization
for (i=0; i < N; i++)
C[i] = A[i] + B[i];
Scalar Sequential Code Vectorized Code
load load load
Iter. 1 load
load load
Time
add
add add
store
store store
load
• Convoy
– Set of vector instructions that could potentially execute
together
CA-Lec8 [email protected] 22
Vector Instruction Parallelism
Can overlap execution of multiple vector instructions
– example machine has 32 elements per vector register and 8 lanes
Complete 24 operations/cycle while issuing 1 short instruction/cycle
CA-Lec8 [email protected] 23
Convoy
• Convoy: set of vector instructions that could
potentially execute together
– Must not contain structural hazards
– Sequences with RAW hazards should be in
different convoys
CA-Lec8 [email protected] 24
Vector Chaining
• Vector version of register bypassing
– Allows a vector operation to start as soon as the individual elements of its
vector source operand become available
V1 V2 V3 V4 V5
LV v1
MULV v3, v1,v2
ADDV v5, v3, v4
Chain Chain
Load
Unit
Mult. Add
Memory
CA-Lec8 [email protected] 25
Vector Chaining Advantage
• Without chaining, must wait for last element of result to be
written before starting dependent instruction
Load
Mul
Time Add
Load
Mul
Add
CA-Lec8 [email protected] 26
Chimes
• Chimes: unit of time to execute one convoy
– m convoy executes in m chimes
– For vector length of n, requires m n clock cycles
CA-Lec8 [email protected] 27
Vector Architectures
Example
LV V1,Rx ;load vector X
MULVS.D V2,V1,F0 ;vector‐scalar multiply
LV V3,Ry ;load vector Y
ADDVV.D V4,V2,V3 ;add two vectors
SV Ry,V4 ;store the sum
Convoys:
1 LV MULVS.D
2 LV ADDVV.D
3 SV
CA-Lec8 [email protected] 28
Vector Strip‐mining
Problem: Vector registers have finite length
Solution: Break loops into pieces that fit into vector registers, “Strip‐mining”
ANDI R1, N, 63 # N mod 64
for (i=0; i<N; i++) MTC1 VLR, R1 # Do remainder
C[i] = A[i]+B[i]; loop:
A B C LV V1, RA
DSLL R2, R1, 3 # Multiply by 8
+ Remainder
DADDU RA, RA, R2 # Bump pointer
LV V2, RB
DADDU RB, RB, R2
+ 64 elements ADDV.D V3, V1, V2
SV V3, RC
DADDU RC, RC, R2
DSUBU N, N, R1 # Subtract elements
+ LI R1, 64
MTC1 VLR, R1 # Reset full length
CA-Lec8 [email protected]
BGTZ N, loop 29
# Any more to do?
Vector Length Register
• Handling loops not equal to 64
• Vector length not known at compile time? Use Vector Length
Register (VLR) for vectors over the maximum length, strip
mining:
low = 0;
VL = (n % MVL); /*find odd‐size piece using modulo op % */
for (j = 0; j <= (n/MVL); j=j+1) { /*outer loop*/
for (i = low; i < (low+VL); i=i+1) /*runs for length VL*/
Y[i] = a * X[i] + Y[i] ; /*main operation*/
low = low + VL; /*start of next vector*/
VL = MVL; /*reset the length to maximum vector length*/
}
CA-Lec8 [email protected] 30
Maximum Vector Length (MVL)
• Determine the maximum number of elements in a vector
for a given architecture
CA-Lec8 [email protected] 31
Vector‐Mask Control
Simple Implementation Density-Time Implementation
– execute all N operations, turn off result – scan mask vector and only execute
writeback according to mask elements with non-zero masks
M[0]=0 C[0]
Write Disable Write data port
CA-Lec8 [email protected] 32
Vector Mask Register (VMR)
• A Boolean vector to control the execution of a vector
instruction
• VMR is part of the architectural state
• For vector processor, it relies on compilers to
manipulate VMR explicitly
• For GPU, it gets the same effect using hardware
– Invisible to SW
• Both GPU and vector processor spend time on
masking
CA-Lec8 [email protected] 33
Vector Mask Registers Example
• Programs that contain IF statements in loops cannot be run in vector
processor
• Consider:
for (i = 0; i < 64; i=i+1)
if (X[i] != 0)
X[i] = X[i] – Y[i];
• Use vector mask register to “disable” elements:
LV V1,Rx ;load vector X into V1
LV V2,Ry ;load vector Y
L.D F0,#0 ;load FP zero into F0
SNEVS.D V1,F0 ;sets VM(i) to 1 if V1(i)!=F0
SUBVV.D V1,V1,V2 ;subtract under vector mask
SV Rx,V1 ;store the result in X
CA-Lec8 [email protected] 34
Compress/Expand Operations
• Compress packs non‐masked elements from one vector register
contiguously at start of destination vector register
– population count of mask vector gives packed vector length
• Expand performs inverse operation
Compress Expand
Used for density-time conditionals and also for general
selection operations CA-Lec8 [email protected] 35
Memory Banks
• The start‐up time for a load/store vector unit is the time to get the first
word from memory into a register. How about the rest of the vector?
– Memory stalls can reduce effective throughput for the rest of the vector
• Penalties for start‐ups on load/store units are higher than those for
arithmetic unit
• Memory system must be designed to support high bandwidth for vector
loads and stores
– Spread accesses across multiple banks
1. Support multiple loads or stores per clock. Be able to control the addresses
to the banks independently
2. Support (multiple) non‐sequential loads or stores
3. Support multiple processors sharing the same memory system, so each
processor will be generating its own independent stream of addresses
CA-Lec8 [email protected] 36
Example (Cray T90)
• 32 processors, each generating 4 loads and 2 stores per
cycle
• Processor cycle time is 2.167ns
• SRAM cycle time is 15ns
• How many memory banks needed?
• Solution:
1. The maximum number of memory references each cycle is
326=192
2. SRAM takes 15/2.167=6.927 processor cycles
3. It requires 1927=1344 memory banks at full memory
bandwidth!!
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
CA-Lec8 [email protected] 37
Memory Addressing
• Load/store operations move groups of data between
registers and memory
• Three types of addressing
– Unit stride
• Contiguous block of information in memory
• Fastest: always possible to optimize this
– Non‐unit (constant) stride
• Harder to optimize memory system for all possible strides
• Prime number of data banks makes it easier to support different
strides at full bandwidth
– Indexed (gather‐scatter)
• Vector equivalent of register indirect
• Good for sparse arrays of data
• Increases number of programs that vectorize
CA-Lec8 [email protected] 38
Interleaved Memory Layout
Vector Processor
Unpipelined
Unpipelined
Unpipelined
Unpipelined
Unpipelined
Unpipelined
Unpipelined
Unpipelined
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
Addr Addr Addr Addr Addr Addr Addr Addr
Mod 8 Mod 8 Mod 8 Mod 8 Mod 8 Mod 8 Mod 8 Mod 8
= 0 = 1 = 2 = 3 = 4 = 5 = 6 = 7
CA-Lec8 [email protected] 40
(Unit/Non‐Unit) Stride Addressing
• The distance separating elements to be gathered into a single
register is called stride.
• In the example
– Matrix D has a stride of 100 double words
– Matrix B has a stride of 1 double word
• Use non‐unit stride for D
– To access non‐sequential memory location and to reshape them into a
dense structure
• The size of the matrix may not be known at compile time
– Use LVWS/SVWS: load/store vector with stride
• The vector stride, like the vector starting address, can be put in a general‐
purpose register (dynamic)
• Cache inherently deals with unit stride data
– Blocking techniques helps non‐unit stride data
CA-Lec8 [email protected] 41
How to get full bandwidth for unit
stride?
• Memory system must sustain (# lanes x word) /clock
• # memory banks > memory latency to avoid stalls
• If desired throughput greater than one word per cycle
– Either more banks (start multiple requests simultaneously)
– Or wider DRAMS. Only good for unit stride or large data
types
• # memory banks > memory latency to avoid stalls
• More numbers of banks good to support more strides at full
bandwidth
– can read paper on how to do prime number of banks
efficiently
CA-Lec8 [email protected] 42
Memory Bank Conflicts
int x[256][512];
for (j = 0; j < 512; j = j+1)
for (i = 0; i < 256; i = i+1)
x[i][j] = 2 * x[i][j];
• Even with 128 banks, since 512 is multiple of 128, conflict on word accesses
• SW: loop interchange or declaring array not power of 2 (“array padding”)
• HW: Prime number of banks
– bank number = address mod number of banks
– address within bank = address / number of words in bank
– modulo & divide per memory access with prime no. banks?
– address within bank = address mod number words in bank
– bank number? easy if 2N words per bank
CA-Lec8 [email protected] 43
Problems of Stride Addressing
• Once we introduce non‐unit strides, it becomes
possible to request accesses from the same bank
frequently
– Memory bank conflict !!
• Stall the other request to solve bank conflict
CA-Lec8 [email protected] 44
Example
• 8 memory banks, Bank busy time: 6 cycles, Total memory
latency: 12 cycles for initialized
• What is the difference between a 64‐element vector load
with a stride of 1 and 32?
• Solution:
1. Since 8 > 6, for a stride of 1, the load will take 12+64=76
cycles, i.e. 1.2 cycles per element
2. Since 32 is a multiple of 8, the worst possible stride
– every access to memory (after the first one) will collide with
the previous access and will have to wait for 6 cycles
– The total time will take 12+1+636=391 cycles, i.e. 6.1 cycles
per element
CA-Lec8 [email protected] 45
Handling Sparse Matrices in Vector
architecture
• Sparse matrices in vector mode is a necessity
• Example: Consider a sparse vector sum on arrays A and C
for (i = 0; i < n; i=i+1)
A[K[i]] = A[K[i]] + C[M[i]];
– where K and M and index vectors to designate the nonzero
elements of A and C
CA-Lec8 [email protected] 46
Scatter‐Gather
• Consider:
for (i = 0; i < n; i=i+1)
A[K[i]] = A[K[i]] + C[M[i]];
• Ra, Rc, Rk, and Rm contain the starting addresses of the vectors
• A and C must have the same number of non‐zero elements (sizes of k and m)
CA-Lec8 [email protected] 47
Vector Architecture Summary
• Vector is alternative model for exploiting ILP
– If code is vectorizable, then simpler hardware, energy efficient, and better
real‐time model than out‐of‐order
• More lanes, slower clock rate!
– Scalable if elements are independent
– If there is dependency
• One stall per vector instruction rather than one stall per vector element
• Programmer in charge of giving hints to the compiler!
• Design issues: number of lanes, functional units and registers, length of
vector registers, exception handling, conditional operations
• Fundamental design issue is memory bandwidth
– Especially with virtual address translation and caching
CA-Lec8 [email protected] 48
Programming Vector Architectures
• Compilers can provide feedback to programmers
• Programmers can provide hints to compiler
• Cray Y‐MP Benchmarks
CA-Lec8 [email protected] 49
Multimedia Extensions
• Very short vectors added to existing ISAs
• Usually 64‐bit registers split into 2x32b or 4x16b or 8x8b
• Newer designs have 128‐bit registers (Altivec, SSE2)
• Limited instruction set:
– no vector length control
– no load/store stride or scatter/gather
– unit‐stride loads must be aligned to 64/128‐bit boundary
• Limited vector register length:
– requires superscalar dispatch to keep multiply/add/load units
busy
– loop unrolling to hide latencies increases register pressure
• Trend towards fuller vector support in microprocessors
CA-Lec8 [email protected] 50
“Vector” for Multimedia?
• Intel MMX: 57 additional 80x86 instructions (1st since 386)
– similar to Intel 860, Mot. 88110, HP PA‐71000LC, UltraSPARC
• 3 data types: 8 8‐bit, 4 16‐bit, 2 32‐bit in 64bits
– reuse 8 FP registers (FP and MMX cannot mix)
• short vector: load, add, store 8 8‐bit operands
CA-Lec8 [email protected] 51
MMX Instructions
• Move 32b, 64b
• Add, Subtract in parallel: 8 8b, 4 16b, 2 32b
– opt. signed/unsigned saturate (set to max) if overflow
• Shifts (sll,srl, sra), And, And Not, Or, Xor
in parallel: 8 8b, 4 16b, 2 32b
• Multiply, Multiply‐Add in parallel: 4 16b
• Compare = , > in parallel: 8 8b, 4 16b, 2 32b
– sets field to 0s (false) or 1s (true); removes branches
• Pack/Unpack
– Convert 32b<–> 16b, 16b <–> 8b
– Pack saturates (set to max) if number is too large
CA-Lec8 [email protected] 52
SIMD Implementations: IA32/AMD64
• Intel MMX (1996)
– Repurpose 64‐bit floating point registers
– Eight 8‐bit integer ops or four 16‐bit integer ops
• Streaming SIMD Extensions (SSE) (1999)
– Separate 128‐bit registers
– Eight 16‐bit integer ops, Four 32‐bit integer/fp ops, or two 64‐bit
integer/fp ops
– Single‐precision floating‐point arithmetic
• SSE2 (2001), SSE3 (2004), SSE4(2007)
– Double‐precision floating‐point arithmetic
• Advanced Vector Extensions (2010)
– 256‐bits registers
– Four 64‐bit integer/fp ops
– Extensible to 512 and 1024 bits for future generations
CA-Lec8 [email protected] 53
SIMD Implementations: IBM
• VMX (1996‐1998)
– 32 4b, 16 8b, 8 16b, 4 32b integer ops and 4 32b FP ops
– Data rearrangement
• Cell SPE (PS3)
– 16 8b, 8 16b, 4 32b integer ops, and 4 32b and 8 64b FP ops
– Unified vector/scalar execution with 128 registers
• VMX 128 (Xbox360)
– Extension to 128 registers
• VSX (2009)
– 1 or 2 64b FPU, 4 32b FPU
– Integrate FPU and VMX into unit with 64 registers
• QPX (2010, Blue Gene)
– Four 64b SP or DP FP
CA-Lec8 [email protected] 54
Why SIMD Extensions?
• Media applications operate on data types narrower than the native
word size
• Costs little to add to the standard arithmetic unit
• Easy to implement
• Need smaller memory bandwidth than vector
• Separate data transfer aligned in memory
– Vector: single instruction, 64 memory accesses, page fault in the
middle of the vector likely !!
• Use much smaller register space
• Fewer operands
• No need for sophisticated mechanisms of vector architecture
CA-Lec8 [email protected] 55
Example SIMD Code
• Example DXPY:
L.D F0,a ;load scalar a
MOV F1, F0 ;copy a into F1 for SIMD MUL
MOV F2, F0 ;copy a into F2 for SIMD MUL
MOV F3, F0 ;copy a into F3 for SIMD MUL
DADDIU R4,Rx,#512 ;last address to load
Loop: L.4D F4,0[Rx] ;load X[i], X[i+1], X[i+2], X[i+3]
MUL.4D F4,F4,F0 ;a×X[i],a×X[i+1],a×X[i+2],a×X[i+3]
L.4D F8,0[Ry] ;load Y[i], Y[i+1], Y[i+2], Y[i+3]
ADD.4D F8,F8,F4 ;a×X[i]+Y[i], ..., a×X[i+3]+Y[i+3]
S.4D 0[Ry],F8 ;store into Y[i], Y[i+1], Y[i+2], Y[i+3]
DADDIU Rx,Rx,#32 ;increment index to X
DADDIU Ry,Ry,#32 ;increment index to Y
DSUBU R20,R4,Rx ;compute bound
BNEZ R20,Loop ;check if done
CA-Lec8 [email protected] 56
Challenges of SIMD Architectures
• Scalar processor memory architecture
– Only access to contiguous data
• No efficient scatter/gather accesses
– Significant penalty for unaligned memory access
– May need to write entire vector register
• Limitations on data access patterns
– Limited by cache line, up to 128‐256b
• Conditional execution
– Register renaming does not work well with masked execution
– Always need to write whole register
– Difficult to know when to indicate exceptions
• Register pressure
– Need to use multiple registers rather than register depth to hide
latency
CA-Lec8 [email protected] 57
Roofline Performance Model
• Basic idea:
– Plot peak floating‐point throughput as a function of arithmetic
intensity
– Ties together floating‐point performance and memory
performance for a target machine
• Arithmetic intensity
– Floating‐point operations per byte read
CA-Lec8 [email protected] 58
Examples
• Attainable GFLOPs/sec
CA-Lec8 [email protected] 59
History of GPUs
• Early video cards
– Frame buffer memory with address generation for video
output
• 3D graphics processing
– Originally high‐end computers (e.g., SGI)
– 3D graphics cards for PCs and game consoles
• Graphics Processing Units
– Processors oriented to 3D graphics tasks
– Vertex/pixel processing, shading, texture mapping, ray
tracing
CA-Lec8 [email protected] 60
Graphical Processing Units
Graphical Processing Units
• Basic idea:
– Heterogeneous execution model
• CPU is the host, GPU is the device
– Develop a C‐like programming language for GPU
– Unify all forms of GPU parallelism as CUDA thread
– Programming model is “Single Instruction Multiple
Thread”
CA-Lec8 [email protected] 61
Programming Model
• CUDA’s design goals
– extend a standard sequential programming language,
specifically C/C++,
• focus on the important issues of parallelism—how to craft efficient
parallel algorithms—rather than grappling with the mechanics of
an unfamiliar and complicated language.
CA-Lec8 [email protected] 62
Graphical Processing Units
NVIDIA GPU Architecture
• Similarities to vector machines:
– Works well with data‐level parallel problems
– Scatter‐gather transfers from memory into local store
– Mask registers
– Large register files
• Differences:
– No scalar processor, scalar integration
– Uses multithreading to hide memory latency
– Has many functional units, as opposed to a few deeply
pipelined units like a vector processor
CA-Lec8 [email protected] 63
Graphical Processing Units
Programming the GPU
• CUDA Programming Model
– Single Instruction Multiple Thread (SIMT)
• A thread is associated with each data element
• Threads are organized into blocks
• Blocks are organized into a grid
CA-Lec8 [email protected] 64
Graphical Processing Units
Example
• Multiply two vectors of length 8192
– Code that works over all elements is the grid
– Thread blocks break this down into manageable sizes
• 512 threads per block
– SIMD instruction executes 32 elements at a time
– Thus grid size = 16 blocks
– Block is analogous to a strip‐mined vector loop with
vector length of 32
– Block is assigned to a multithreaded SIMD processor
by the thread block scheduler
– Current‐generation GPUs (Fermi) have 7‐15
multithreaded SIMD processors
CA-Lec8 [email protected] 65
Graphical Processing Units
Terminology
• Threads of SIMD instructions
– Each has its own PC
– Thread scheduler uses scoreboard to dispatch
– No data dependencies between threads!
– Keeps track of up to 48 threads of SIMD instructions
• Hides memory latency
• Thread block scheduler schedules blocks to SIMD
processors
• Within each SIMD processor:
– 32 SIMD lanes
– Wide and shallow compared to vector processors
CA-Lec8 [email protected] 66
Graphical Processing Units
Example
• NVIDIA GPU has 32,768 registers
– Divided into lanes
– Each SIMD thread is limited to 64 registers
– SIMD thread has up to:
• 64 vector registers of 32 32‐bit elements
• 32 vector registers of 32 64‐bit elements
– Fermi has 16 physical SIMD lanes, each containing
2048 registers
CA-Lec8 [email protected] 67
GTX570 GPU
Global Memory
1,280MB
L2 Cache
640KB
Texture Cache L1 Cache Constant Cache
8KB 16KB 8KB
SM 0 SM 14
Shared Memory Shared Memory
48KB 48KB
Registers Registers
Up to 1536 32,768 32,768
Threads/SM
32 32
cores cores
CA-Lec8 [email protected] 68
GPU Threads in SM (GTX570)
CA-Lec8 [email protected] 70
Programming the GPU
CA-Lec8 [email protected] 71
Matrix Multiplication
• For a 4096x4096 matrix multiplication
‐ Matrix C will require calculation of 16,777,216 matrix cells.
• On the GPU each cell is calculated by its own thread.
• We can have 23,040 active threads (GTX570), which means
we can have this many matrix cells calculated in parallel.
• On a general purpose processor we can only calculate one cell
at a time.
• Each thread exploits the GPUs fine granularity by computing
one element of Matrix C.
• Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU
bandwidth.
CA-Lec8 [email protected] 72
Programming the GPU
• Distinguishing execution place of functions:
_device_ or _global_ => GPU Device
Variables declared are allocated to the GPU memory
_host_ => System processor (HOST)
• Function call
Name<<dimGrid, dimBlock>>(..parameter list..)
blockIdx: block identifier
threadIdx: threads per block identifier
blockDim: threads per block
CA-Lec8 [email protected] 73
CUDA Program Example
//Invoke DAXPY
daxpy(n,2.0,x,y);
//DAXPY in C
void daxpy(int n, double a, double* x, double* y){
for (int i=0;i<n;i++)
y[i]= a*x[i]+ y[i]
}
//DAXPY in CUDA
_device_
void daxpy(int n,double a,double* x,double* y){
int i=blockIDx.x*blockDim.x+threadIdx.x;
if (i<n)
y[i]= a*x[i]+ y[i]
}
CA-Lec8 [email protected] 74
NVIDIA Instruction Set Arch.
• “Parallel Thread Execution (PTX)”
• Uses virtual registers
• Translation to machine code is performed in software
• Example:
shl.s32 R8, blockIdx, 9 ; Thread Block ID * Block size (512 or 29)
add.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread ID
ld.global.f64 RD0, [X+R8] ; RD0 = X[i]
ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]
mul.f64 R0D, RD0, RD4 ; Product in RD0 = RD0 * RD4 (scalar a)
add.f64 R0D, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i])
st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i])
CA-Lec8 [email protected] 75
Graphical Processing Units
Conditional Branching
• Like vector architectures, GPU branch hardware uses internal masks
• Also uses
– Branch synchronization stack
• Entries consist of masks for each SIMD lane
• I.e. which threads commit their results (all threads execute)
– Instruction markers to manage when a branch diverges into multiple
execution paths
• Push on divergent branch
– …and when paths converge
• Act as barriers
• Pops stack
• Per‐thread‐lane 1‐bit predicate register, specified by programmer
CA-Lec8 [email protected] 76
Graphical Processing Units
Example
if (X[i] != 0)
X[i] = X[i] – Y[i];
else X[i] = Z[i];
CA-Lec8 [email protected] 77
Graphical Processing Units
NVIDIA GPU Memory Structures
• Each SIMD Lane has private section of off‐chip DRAM
– “Private memory”
– Contains stack frame, spilling registers, and private
variables
• Each multithreaded SIMD processor also has local
memory
– Shared by SIMD lanes / threads within a block
• Memory shared by SIMD processors is GPU Memory
– Host can read and write GPU memory
CA-Lec8 [email protected] 78
Graphical Processing Units
Fermi Architecture Innovations
• Each SIMD processor has
– Two SIMD thread schedulers, two instruction dispatch units
– 16 SIMD lanes (SIMD width=32, chime=2 cycles), 16 load‐store units, 4
special function units
– Thus, two threads of SIMD instructions are scheduled every two clock
cycles
• Fast double precision
• Caches for GPU memory
• 64‐bit addressing and unified address space
• Error correcting codes
• Faster context switching
• Faster atomic instructions
CA-Lec8 [email protected] 79
Graphical Processing Units
Fermi Multithreaded SIMD Proc.
CA-Lec8 [email protected] 80
Detecting and Enhancing Loop-Level Parallelism
Compiler Technology for Loop‐Level
Parallelism
• Loop‐carried dependence
– Focuses on determining whether data accesses in later
iterations are dependent on data values produced in
earlier iterations
• Loop‐level parallelism has no loop‐carried dependence
• Example 1:
for (i=999; i>=0; i=i‐1)
x[i] = x[i] + s;
• No loop‐carried dependence
CA-Lec8 [email protected] 81
Detecting and Enhancing Loop-Level Parallelism
Example 2 for Loop‐Level Parallelism
• Example 2:
for (i=0; i<100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */
}
CA-Lec8 [email protected] 82
Remarks
• Intra‐loop dependence is not loop‐carried dependence
– A sequence of vector instructions that uses chaining
exhibits exactly intra‐loop dependence
• Two types of S1‐S2 intra‐loop dependence
– Circular: S1 depends on S2 and S2 depends on S1
– Not circular: neither statement depends on itself, and
although S1 depends on S2, S2 does not depend on S1
• A loop is parallel if it can be written without a cycle in
the dependences
– The absence of a cycle means that the dependences give a
partial ordering on the statements
CA-Lec8 [email protected] 83
Example 3 for Loop‐Level Parallelism (1)
• Example 3
for (i=0; i<100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
• S1 uses value computed by S2 in previous iteration, but
this dependence is not circular.
• There is no dependence from S1 to S2, interchanging
the two statements will not affect the execution of S2
CA-Lec8 [email protected] 84
Example 3 for Loop‐Level Parallelism (2)
• Transform to
A[0] = A[0] + B[0];
for (i=0; i<99; i=i+1) {
B[i+1] = C[i] + D[i]; /*S2*/
A[i+1] = A[i+1] + B[i+1]; /*S1*/
}
B[100] = C[99] + D[99];
CA-Lec8 [email protected] 86
Recurrence
• Recurrence is a special form of loop‐carried dependence.
• Recurrence Example:
for (i=1;i<100;i=i+1) {
Y[i] = Y[i‐1] + Y[i];
}
CA-Lec8 [email protected] 87
Compiler Technology for Finding
Dependences
• To determine which loop might contain parallelism (“inexact”) and to
eliminate name dependences.
• Nearly all dependence analysis algorithms work on the assumption that
array indices are affine.
– A one‐dimensional array index is affine if it can written in the form a i + b (i
is loop index)
– The index of a multi‐dimensional array is affine if the index in each dimension
is affine.
– Non‐affine access example: x[ y[i] ]
• Determining whether there is a dependence between two references to
the same array in a loop is equivalent to determining whether two affine
functions can have the same value for different indices between the
bounds of the loop.
CA-Lec8 [email protected] 88
Finding dependencies Example
• Assume:
– Load an array element with index c i + d and store to a i
+b
– i runs from m to n
• Dependence exists if the following two conditions hold
1. Given j, k such that m ≤ j ≤ n, m ≤ k ≤ n
2. a j + b = c k + d
• In general, the values of a, b, c, and d are not known at
compile time
– Dependence testing is expensive but decidable
– GCD (greatest common divisor) test
• If a loop‐carried dependence exists, then GCD(c,a) | |db|
CA-Lec8 [email protected] 89
Example
for (i=0; i<100; i=i+1) {
X[2*i+3] = X[2*i] * 5.0;
}
• Solution:
1. a=2, b=3, c=2, and d=0
2. GCD(a, c)=2, |b‐d|=3
3. Since 2 does not divide 3, no dependence is possible
CA-Lec8 [email protected] 90
Remarks
• The GCD test is sufficient but not necessary
– GCD test does not consider the loop bounds
– There are cases where the GCD test succeeds but
no dependence exists
• Determining whether a dependence actually
exists is NP‐complete
CA-Lec8 [email protected] 91
Detecting and Enhancing Loop-Level Parallelism
Finding dependencies
• Example 2:
for (i=0; i<100; i=i+1) {
Y[i] = X[i] / c; /* S1 */
X[i] = X[i] + c; /* S2 */
Z[i] = Y[i] + c; /* S3 */
Y[i] = c ‐ Y[i]; /* S4 */
}
CA-Lec8 [email protected] 92
Renaming to Eliminate False (Pseudo)
Dependences
• Before: • After:
for (i=0; i<100; i=i+1) { for (i=0; i<100; i=i+1) {
Y[i] = X[i] / c; T[i] = X[i] / c;
X[i] = X[i] + c; X1[i] = X[i] + c;
Z[i] = Y[i] + c; Z[i] = T[i] + c;
Y[i] = c ‐ Y[i]; Y[i] = c ‐ T[i];
} }
CA-Lec8 [email protected] 93
Eliminating Recurrence Dependence
• Recurrence example, a dot product:
for (i=9999; i>=0; i=i‐1)
sum = sum + x[i] * y[i];
CA-Lec8 [email protected] 94
Reduction
• Reductions are common in linear algebra algorithm
• Reductions can be handled by special hardware in a vector
and SIMD architecture
– Similar to what can be done in multiprocessor
environment
CA-Lec8 [email protected] 95
Multithreading and Vector Summary
• Explicitly parallel (DLP or TLP) is next step to performance
• Coarse‐grained vs. Fine‐grained multithreading
– Switch only on big stall vs. switch every clock cycle
• Simultaneous multithreading, if fine grained multithreading
based on OOO superscalar microarchitecture
– Instead of replicating registers, reuse rename registers
• Vector is alternative model for exploiting ILP
– If code is vectorizable, then simpler hardware, more energy
efficient, and better real‐time model than OOO machines
– Design issues include number of lanes, number of FUs, number
of vector registers, length of vector registers, exception handling,
conditional operations, and so on.
• Fundamental design issue is memory bandwidth
CA-Lec8 [email protected] 96
Computer Architecture
Lecture 9: Multiprocessors and
Thread‐Level Parallelism (Chapter 5)
52%/year
100
10
25%/year
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
• VAX : 25%/year 1978 to 1986
• RISC + x86: 52%/year 1986 to 2002
CA-Lec9 [email protected] 2
• RISC + x86: ??%/year 2002 to present
Multiprocessors
• Growth in data‐intensive applications
– Data bases, file servers, …
• Growing interest in servers, server performance.
• Increasing desktop performance less important
• Improved understanding in how to use multiprocessors
effectively
– Especially server where significant natural TLP
• Advantage of leveraging design investment by replication
– Rather than unique design
CA-Lec9 [email protected] 3
Flynn’s Taxonomy
M.J. Flynn, "Very High-Speed Computers",
Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.
CA-Lec9 [email protected] 4
What is Parallel Architecture?
• A parallel computer is a collection of processing elements
that cooperate to solve large problems
– Most important new element: it is all about communication !!
• What does the programmer (or OS or compiler writer) think
about?
– Models of computation
• Sequential consistency?
– Resource allocation
• What mechanisms must be in hardware
– A high performance processor (SIMD, or Vector Processor)
– Data access, Communication, and Synchronization
CA-Lec9 [email protected] 5
Multiprocessor Basics
• “A parallel computer is a collection of processing elements that
cooperate and communicate to solve large problems fast.”
• Parallel Architecture = Computer Architecture + Communication
Architecture
• 2 classes of multiprocessors WRT memory:
1. Centralized Memory Multiprocessor
• < few dozen cores
• Small enough to share single, centralized memory with uniform
memory access latency (UMA)
2. Physically Distributed‐Memory Multiprocessor
• Larger number chips and cores than 1.
• BW demands Memory distributed among processors with
non‐uniform memory access/latency (NUMA)
CA-Lec9 [email protected] 6
Shared‐Memory Multiprocessor
• SMP, Symmetric multiprocessors
CA-Lec9 [email protected] 7
Distributed‐Memory Multiprocessor
• Distributed shared‐memory multiprocessors
(DSM)
CA-Lec9 [email protected] 8
Centralized vs. Distributed Memory
Scale
P1 Pn P1 Pn
$ $ $ $
Mem Mem
Inter
connection network
Inter
connection network
Mem Mem
CA-Lec9 [email protected] 9
Centralized Memory Multiprocessor
• Also called symmetric multiprocessors (SMPs) because
single main memory has a symmetric relationship to all
processors
• Large caches single memory can satisfy memory
demands of small number of processors
• Can scale to a few dozen processors by using a switch and
by using many memory banks
• Although scaling beyond that is technically conceivable, it
becomes less attractive as the number of processors
sharing centralized memory increases
CA-Lec9 [email protected] 10
Distributed Memory Multiprocessor
• Processors connected via direct (switched) and non‐
direct (multi‐hop) interconnection networks
• Pro: Cost‐effective way to scale memory bandwidth
• If most accesses are to local memory
• Pro: Reduces latency of local memory accesses
CA-Lec9 [email protected] 11
2 Models for Communication and
Memory Architecture
1. Communication occurs by explicitly passing messages among the
processors:
message‐passing multiprocessors
2. Communication occurs through a shared address space (via loads and
stores):
shared memory multiprocessors either
• UMA (Uniform Memory Access time) for shared address, centralized
memory MP
• NUMA (Non Uniform Memory Access time) for shared address,
distributed memory MP
• In past, confusion whether “sharing” means sharing physical memory
(Symmetric MP) or sharing address space
CA-Lec9 [email protected] 12
Challenges of Parallel Processing
• First challenge is % of program inherently sequential
CA-Lec9 [email protected] 13
Amdahl’s Law Answers
1
Speedup overall
1 Fraction enhanced Fraction enhanced
Speedup enhanced
1
80
1 Fraction parallel Fraction parallel
100
Fraction parallel
80 1 Fraction parallel 1
100
79 80 Fraction parallel 0.8 Fraction parallel
Fraction parallel 79 / 79.2 99.75%
CA-Lec9 [email protected] 14
Challenges of Parallel Processing
• Second challenge is long latency to remote memory
CA-Lec9 [email protected] 15
CPI Equation
• CPI = Base CPI +
Remote request rate x Remote request cost
CA-Lec9 [email protected] 16
Challenges of Parallel Processing
1. Application parallelism primarily via new algorithms that
have better parallel performance
2. Long remote latency impact both by architect and by the
programmer
• For example, reduce frequency of remote accesses either by
– Caching shared data (HW)
– Restructuring the data layout to make more accesses local (SW)
CA-Lec9 [email protected] 17
Shared‐Memory Architectures
CA-Lec9 [email protected] 18
Cache Coherence Problem
P1 P2 P3
u=? u=?
3
5
4 $
$ $
u :5 u :5 u= 7
1 I/O devices
2
u:5
Memory
Conceptual
Picture Mem
CA-Lec9 [email protected] 20
Intuitive Memory Model
P
L1
100:67 • Reading an address should
L2 return the last value written
100:35 to that address
Memory – Easy in uniprocessors,
except for I/O
Disk 100:34
CA-Lec9 [email protected] 22
Write Consistency
• For now assume
1. A write does not complete (and allow the next write to occur)
until all processors have seen the effect of that write
2. The processor does not change the order of any write with
respect to any other memory access
if a processor writes location A followed by location B, any
processor that sees the new value of B must also see the new
value of A
• These restrictions allow the processor to reorder reads, but
forces the processor to finish writes in program order
CA-Lec9 [email protected] 23
Basic Schemes for Enforcing
Coherence
• Program on multiple processors will normally have copies of the same
data in several caches
• Rather than trying to avoid sharing in SW, SMPs use a HW protocol to
maintain coherent caches
• Coherent caches provide migration and replication of shared data
• Migration ‐ data can be moved to a local cache and used there in a
transparent fashion
– Reduces both latency to access shared data that is allocated remotely and
bandwidth demand on the shared memory
• Replication – for shared data being simultaneously read, since caches
make a copy of data in local cache
– Reduces both latency of access and contention for read shared data
CA-Lec9 [email protected] 24
2 Classes of Cache Coherence
Protocols
to track the sharing status
CA-Lec9 [email protected] 25
Snoopy Cache‐Coherence Protocols
State P1 Pn
Bus snoop
Address
Data
$ $
Cache-memory
I/O devices transaction
Mem
CA-Lec9 [email protected] 26
Example: Write‐thru Invalidate
P1 P2 P3
u=? u=? 3
4 5
$ $ $
u :5 u :5 u= 7
I/O devices
1
2
u:5 Exclusive access ensures that no
u=7
Memory other readable or writable copies of
an data exist when the write occurs
CA-Lec9 [email protected] 27
Architectural Building Blocks
• Cache block state transition diagram
– FSM specifying how disposition of block changes
• invalid, valid, dirty
• Broadcast Medium Transactions (e.g., bus)
– Fundamental system design abstraction
– Logically single set of wires connect several devices
– Protocol: arbitration, command/addr, data
Every device observes every transaction
• Broadcast medium enforces serialization of read or write accesses
Write serialization
– 1st processor to get medium invalidates others copies
– Implies cannot complete write until it obtains bus
– All coherence schemes require serializing accesses to same cache block
• Also need to find up‐to‐date copy of cache block
CA-Lec9 [email protected] 28
Locate Up‐to‐date Copy of Data
• Write‐through: get up‐to‐date copy from memory
– Write through simpler if enough memory BW
• Write‐back harder
– Most recent copy can be in a cache
• Can use same snooping mechanism
1. Snoop every address placed on the bus
2. If a processor has dirty copy of requested cache block, it provides it in
response to a read request and aborts the memory access
– Complexity from retrieving cache block from a processor cache, which can
take longer than retrieving it from memory
CA-Lec9 [email protected] 29
Cache Resources for WB Snooping
• Normal cache tags can be used for snooping
• Valid bit per block makes invalidation easy
• Read misses easy since rely on snooping
• Writes Need to know if know whether any other copies of
the block are cached
– No other copies No need to place write on bus for WB
– Other copies Need to place invalidate on bus
CA-Lec9 [email protected] 30
Cache Resources for WB Snooping
• To track whether a cache block is shared, add extra state bit
associated with each cache block, like valid bit and dirty bit
– Write to Shared block Need to place invalidate on bus and mark
cache block as private (if an option)
– No further invalidations will be sent for that block
– This processor called owner of cache block
– Owner then changes state from shared to unshared (or exclusive)
CA-Lec9 [email protected] 31
Cache Behavior in Response to Bus
• Every bus transaction must check the cache‐address tags
– could potentially interfere with processor cache accesses
• A way to reduce interference is to duplicate tags
– One set for caches access, one set for bus accesses
• Another way to reduce interference is to use L2 tags
– Since L2 less heavily used than L1
Every entry in L1 cache must be present in the L2 cache, called the inclusion
property
– If Snoop gets a hit in L2 cache, then it must arbitrate for the L1 cache to
update the state and possibly retrieve the data, which usually requires a stall
of the processor
CA-Lec9 [email protected] 32
Example Protocol
• Snooping coherence protocol is usually implemented by incorporating a
finite‐state controller in each node
• Logically, think of a separate controller associated with each cache block
– That is, snooping operations or cache requests for different blocks can
proceed independently
• In implementations, a single controller allows multiple operations to
distinct blocks to proceed in interleaved fashion
– that is, one operation may be initiated before another is completed, even
through only one cache access or one bus access is allowed at time
CA-Lec9 [email protected] 33
Cache Coherence Protocol Example
• Processor only observes state of memory system by issuing memory operations
• Assume bus transactions and memory operations are atomic and a one‐level cache
– all phases of one bus transaction complete before next one starts
– processor waits for memory operation to complete before issuing next
– with one‐level cache, assume invalidations applied during bus transaction
• All writes go to bus + atomicity
– Writes serialized by order in which they appear on bus (bus order)
=> invalidations applied to caches in bus order
• How to insert reads in this order?
– Important since processors see writes through reads, so determines whether
write serialization is satisfied
– But read hits may happen independently and do not appear on bus or enter
directly in bus order
CA-Lec9 [email protected] 34
Ordering
write propagation + write serialization
P0: R R R W R R
P1: R R R R R W
P2: R R R R R R
CA-Lec9 [email protected] 35
Example: Write Back Snoopy Protocol
• Invalidation protocol, write‐back cache
– Snoops every address on bus
– If it has a dirty copy of requested block, provides that block in response to
the read request and aborts the memory access
• Each memory block is in one state:
– Clean in all caches and up‐to‐date in memory (Shared)
– OR Dirty in exactly one cache (Exclusive)
– OR Not in any caches (Invalid)
• Each cache block is in one state (track these):
– Shared : block can be read
– OR Exclusive : cache has only copy, its writeable, and dirty
– OR Invalid : block contains no data (in uniprocessor cache too)
• Read misses: cause all caches to snoop bus
• Writes to clean blocks are treated as misses
CA-Lec9 [email protected] 36
Write‐Back State Machine ‐ CPU
• State machine CPU Read hit
for CPU requests
for each
cache block
• Non‐resident blocks invalid
CPU Read Shared
Invalid (read/only)
Place read miss
on bus
CPU Write
Place Write
Miss on bus
CPU Write
Cache Block Place Write Miss on Bus
State Exclusive
CPU read hit (read/write)
CPU write hit CPU Write Miss (?)
Write back cache block
Place write miss on bus
CA-Lec9 [email protected] 37
Write‐Back State Machine‐ Bus Request
• State machine
for bus requests
for each Write miss
cache block
for this block Shared
Invalid
(read/only)
Write miss
for this block
Write Back Read miss
Block; (abort for this block
memory access) Write Back
Block; (abort
Exclusive memory access)
(read/write)
CA-Lec9 [email protected] 38
Block‐replacement
CPU Read hit
• State machine
for CPU requests
for each CPU Read Shared
cache block Invalid (read/only)
Place read miss
on bus
CPU Write
Place Write CPU read miss CPU Read miss
Miss on bus Write back block, Place read miss
Place read miss on bus
on bus
CPU Write
Cache Block Place Write Miss on Bus
State Exclusive
CPU read hit (read/write)
CPU write hit CPU Write Miss
Write back cache block
Place write miss on bus
CA-Lec9 [email protected] 39
Write‐back State Machine‐III
CPU Read hit
• State machine
for CPU requests Write miss
for each
cache block and
for bus requests for this block
for each Shared
cache block Invalid CPU Read
(read/only)
Place read miss
CPU Write on bus
Place Write
Miss on bus
Write miss CPU read miss CPU Read miss
for this block Write back block, Place read miss
Write Back Place read miss on bus
Block; (abort on bus CPU Write
memory Place Write Miss on Bus
Cache access)
Block Read miss Write Back
State Exclusive for this block Block; (abort
(read/write) memory access)
CPU read hit CPU Write Miss
CPU write hit Write back cache block
Place write miss on bus
CA-Lec9 [email protected] 40
Example
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1 Write
P1: Write 10
10 to
to A1
P1:P1:
Read
ReadA1A1
P2:
P2: Read A1
P2:
P2: Write
Write 20 to A1
to A1
P2:
P2: Write
Write 40 to A2
to A2
CA-Lec9 [email protected] 41
Example
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1 Write
P1: Write 10
10 to
to A1 Excl. A1 10 WrMs P1 A1
P1:P1:
Read
ReadA1A1
P2:
P2: Read A1
P2:
P2: Write
Write 20 to A1
to A1
P2: Write
P2: Write 40 to A2
to A2
CA-Lec9 [email protected] 42
Example
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1
P1:Write
Write 10
10 to
to A1 Excl. A1 10 WrMs P1 A1
P1:P1:
Read
ReadA1A1 Excl. A1 10
P2:
P2: Read A1
P2: Write
P2: Write 20 to
to A1
A1
P2:
P2: Write
Write 40 to A2
to A2
CA-Lec9 [email protected] 43
Example
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1
P1:Write
Write 10
10 to
to A1 Excl. A1 10 WrMs P1 A1
P1:P1:
Read
ReadA1A1 Excl. A1 10
P2:
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write
P2: Write 20 to
to A1
A1
P2:
P2: Write
Write 40 to A2
to A2
CA-Lec9 [email protected] 44
Example
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1
P1:Write
Write 10
10 to
to A1 Excl. A1 10 WrMs P1 A1
P1:P1:
Read
ReadA1A1 Excl. A1 10
P2:
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write
P2: Write 20 to
to A1
A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2:
P2: Write
Write 40 to A2
to A2
CA-Lec9 [email protected] 45
Example
P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1
P1:Write
Write 10
10 to
to A1 Excl. A1 10 WrMs P1 A1
P1:P1:
Read
ReadA1A1 Excl. A1 10
P2:
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write
P2: Write 20 to A1
A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2:
P2: Write
Write 40 to A2
A2 WrMs P2 A2 A1 10
Excl. A2 40 WrBk P2 A1 20 A1 20
CA-Lec9 [email protected] 46
Concluding Remark (1/2)
• 1 instruction operates on vectors of data
• Vector loads get data from memory into big register files,
operate, and then vector store
• E.g., Indexed load, store for sparse matrix
• Easy to add vector to commodity instruction set
– E.g., Morph SIMD into vector
• Vector is very efficient architecture for vectorizable codes,
including multimedia and many scientific codes
CA-Lec9 [email protected] 47
Concluding Remark (2/2)
• “End” of uniprocessors speedup => Multiprocessors
• Parallelism challenges: % parallalizable, long latency to remote memory
• Centralized vs. distributed memory
– Small MP vs. lower latency, larger BW for Larger MP
• Message Passing vs. Shared Address
– Uniform access time vs. Non‐uniform access time
• Snooping cache over shared medium for smaller MP by invalidating other
cached copies on write
• Sharing cached data Coherence (values returned by a read),
Consistency (when a written value will be returned by a read)
• Shared medium serializes writes
Write consistency
CA-Lec9 [email protected] 48
Implementation Complications
• Write Races:
– Cannot update cache until bus is obtained
• Otherwise, another processor may get bus first,
and then write the same cache block!
– Two step process:
• Arbitrate for bus
• Place miss on bus and complete operation
– If miss occurs to block while waiting for bus,
handle miss (invalidate may be needed) and then restart.
– Split transaction bus:
• Bus transaction is not atomic:
can have multiple outstanding transactions for a block
• Multiple misses can interleave,
allowing two caches to grab block in the Exclusive state
• Must track and prevent multiple misses for one block
• Must support interventions and invalidations
CA-Lec9 [email protected] 49
Implementing Snooping Caches
• Multiple processors must be on bus, access to both addresses and data
• Add a few new commands to perform coherency,
in addition to read and write
• Processors continuously snoop on address bus
– If address matches tag, either invalidate or update
• Since every bus transaction checks cache tags,
could interfere with CPU just to check:
– solution 1: duplicate set of tags for L1 caches just to allow checks in parallel
with CPU
– solution 2: L2 cache already duplicate,
provided L2 obeys inclusion with L1 cache
• block size, associativity of L2 affects L1
CA-Lec9 [email protected] 50
Limitations in Symmetric Shared‐Memory
Multiprocessors and Snooping Protocols
CA-Lec9 [email protected] 51
Performance of Symmetric Shared‐Memory
Multiprocessors
• Cache performance is combination of
1. Uniprocessor cache miss traffic
2. Traffic caused by communication
– Results in invalidations and subsequent cache misses
• 4th C: coherence miss
– Joins Compulsory, Capacity, Conflict
CA-Lec9 [email protected] 52
Coherency Misses
1. True sharing misses arise from the communication of
data through the cache coherence mechanism
• Invalidates due to 1st write to shared block
• Reads by another CPU of modified block in different cache
• Miss would still occur if block size were 1 word
2. False sharing misses when a block is invalidated
because some word in the block, other than the one
being read, is written into
• Invalidation does not cause a new value to be communicated,
but only causes an extra cache miss
• Block is shared, but no word in block is actually shared
miss would not occur if block size were 1 word
CA-Lec9 [email protected] 53
Example: True vs. False Sharing vs. Hit?
• Assume x1 and x2 in same cache block.
P1 and P2 both read x1 and x2 before.
CA-Lec9 [email protected] 55
MP Performance 2MB Cache
Commercial Workload: OLTP, Decision Support (Database),
Search Engine
3
Instruction
• True sharing, Conflict/Capacity
2.5 Cold
false sharing
increase going False Sharing
1.5
0.5
0
1 2 4 6 8
Processor count
CA-Lec9 [email protected] 56
Computer Architecture
Lecture 10: Thread‐Level
Parallelism‐‐II (Chapter 5)
Chih‐Wei Liu 劉志尉
National Chiao Tung University
[email protected]
Review
• Caches contain all information on state of cached memory
blocks
• Snooping cache over shared medium for smaller MP by
invalidating other cached copies on write
• Sharing cached data
Coherence (values returned by a read),
Consistency (when a written value will be returned by a
read)
CA-Lec10 [email protected] 2
Coherency Misses
1. True sharing misses arise from the communication of data
through the cache coherence mechanism
• Invalidates due to 1st write to shared block
• Reads by another CPU of modified block in different cache
• Miss would still occur if block size were 1 word
2. False sharing misses when a block is invalidated because
some word in the block, other than the one being read, is
written into
• Invalidation does not cause a new value to be communicated, but
only causes an extra cache miss
• Block is shared, but no word in block is actually shared
miss would not occur if block size were 1 word
CA-Lec10 [email protected] 3
A Cache Coherent System Must:
• Provide set of states, state transition diagram, and actions
• Manage coherence protocol
– (0) Determine when to invoke coherence protocol
– (a) Find info about state of block in other caches to determine action
• whether need to communicate with other cached copies
– (b) Locate the other copies
– (c) Communicate with those copies (invalidate/update)
• (0) is done the same way on all systems
– state of the line is maintained in the cache
– protocol is invoked if an “access fault” occurs on the line
• Different approaches distinguished by (a) to (c)
CA-Lec10 [email protected] 4
Bus‐based Coherence
• All of (a), (b), (c) done through broadcast on bus
– faulting processor sends out a “search”
– others respond to the search probe and take necessary action
• Could do it in scalable network too
– broadcast to all processors, and let them respond
• Conceptually simple, but broadcast doesn’t scale with p
– on bus, bus bandwidth doesn’t scale
– on scalable network, every fault leads to at least p network transactions
• Scalable coherence:
– can have same cache states and state transition diagram
– different mechanisms to manage protocol
CA-Lec10 [email protected] 5
Scalable Approach: Directories
CA-Lec10 [email protected] 6
Basic Operation of Directory
P P
• In addition to cache state, must track which processors have data when in
the shared state (usually bit vector, 1 if processor has copy)
• Keep it simple(r):
– Writes to non‐exclusive data
=> write miss
– Processor blocks until access completes
– Assume messages received and acted upon in order sent
CA-Lec10 [email protected] 8
Directory Protocol
• No bus and don’t want to broadcast:
– interconnect no longer single arbitration point
– all messages have explicit responses
• Terms: typically 3 processors involved
– Local node where a request originates
– Home node where the memory location of an address resides
– Remote node has a copy of a cache block, whether exclusive or shared
• Example messages on next slide:
P = processor number, A = address
CA-Lec10 [email protected] 9
Directory Protocol Messages (Fig 4.22)
Message type Source Destination Msg Content
Read miss Local cache Home directory P, A
– Processor P reads data at address A;
make P a read sharer and request data
Write miss Local cache Home directory P, A
– Processor P has a write miss at address A;
make P the exclusive owner and request data
Invalidate Home directory Remote caches A
– Invalidate a shared copy at address A
Fetch Home directory Remote cache A
– Fetch the block at address A and send it to its home directory;
change the state of A in the remote cache to shared
Fetch/Invalidate Home directory Remote cache A
– Fetch the block at address A and send it to its home directory;
invalidate the block in the cache
Data value reply Home directory Local cache Data
– Return a data value from the home memory (read miss response)
Data write back Remote cache Home directory A, Data
– Write back a data value for address
CA-Lec10 A (invalidate response)
[email protected] 10
State Transition Diagram for One Cache Block in
Directory Based System
CA-Lec10 [email protected] 11
CPU ‐Cache State Machine CPU Read hit
• State machine
for CPU requests
for each
memory block Invalidate
• Invalid state
if in memory Shared
Invalid (read/only)
CPU Read
Send Read Miss
message CPU read miss:
CPU Write: Send Read Miss
Send Write Miss CPU Write: Send
msg to h.d. Write Miss message
Fetch/Invalidate to home directory
send Data Write Back message
to home directory Fetch: send Data Write Back
message to home directory
CPU read miss: send Data
Exclusive Write Back message and read
(read/write) miss to home directory
CPU read hit CPU write miss:
CPU write hit send Data Write Back message
and Write Miss to home directory
CA-Lec10 [email protected] 12
State Transition Diagram for Directory
CA-Lec10 [email protected] 13
Directory State Machine Read miss:
Sharers += {P};
• State machine send Data Value Reply
for Directory requests for each Read miss:
memory block
• Uncached state
Sharers = {P}
if in memory send Data Value
Reply Shared
Uncached (read only)
Write Miss:
Write Miss:
Sharers = {P};
Data Write Back: send Invalidate
send Data
Sharers = {} to Sharers;
Value Reply
(Write back block) then Sharers = {P};
msg
send Data Value
Write Miss: Reply msg
Sharers = {P};
Read miss:
send Fetch/Invalidate;
Sharers += {P};
send Data Value Reply
Exclusive send Fetch;
msg to remote cache
(read/write) send Data Value Reply
msg to remote cache
(Write back block)
CA-Lec10 [email protected] 14
Example Directory Protocol
• Message sent to directory causes two actions:
– Update the directory
– More messages to satisfy request
• Block is in Uncached state: the copy in memory is the current value; only possible
requests for that block are:
– Read miss: requesting processor sent data from memory &requestor made only sharing
node; state of block made Shared.
– Write miss: requesting processor is sent the value & becomes the Sharing node. The
block is made Exclusive to indicate that the only valid copy is cached. Sharers indicates
the identity of the owner.
• Block is Shared => the memory value is up‐to‐date:
– Read miss: requesting processor is sent back the data from memory & requesting
processor is added to the sharing set.
– Write miss: requesting processor is sent the value. All processors in the set Sharers are
sent invalidate messages, & Sharers is set to identity of requesting processor. The state
of the block is made Exclusive.
CA-Lec10 [email protected] 15
Example Directory Protocol
• Block is Exclusive: current value of the block is held in the cache of the
processor identified by the set Sharers (the owner) => three possible
directory requests:
– Read miss: owner processor sent data fetch message, causing state of block in
owner’s cache to transition to Shared and causes owner to send data to directory,
where it is written to memory & sent back to requesting processor.
Identity of requesting processor is added to set Sharers, which still contains the
identity of the processor that was the owner (since it still has a readable copy).
State is shared.
– Data write‐back: owner processor is replacing the block and hence must write it
back, making memory copy up‐to‐date
(the home directory essentially becomes the owner), the block is now Uncached,
and the Sharer set is empty.
– Write miss: block has a new owner. A message is sent to old owner causing the
cache to send the value of the block to the directory from which it is sent to the
requesting processor, which becomes the new owner. Sharers is set to identity of
new owner, and state of block is made Exclusive.
CA-Lec10 [email protected] 16
Example
Processor 1 Processor 2 Interconnect Directory Memory
P1 P2 Bus Directory Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
P1: Write 10 to A1
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
CA-Lec10 [email protected] 17
Example
Processor 1 Processor 2 Interconnect Directory Memory
P1 P2 Bus Directory Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
CA-Lec10 [email protected] 18
Example
Processor 1 Processor 2 Interconnect Directory Memory
P1 P2 Bus Directory Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1
P2: Write 20 to A1
P2: Write 40 to A2
CA-Lec10 [email protected] 19
Example
Processor 1 Processor 2 Interconnect Directory Memory
P1 P2 Bus Directory Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 Ftch P1 A1 10 A1 A1 10
Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10
P2: Write 20 to A1 10
10
P2: Write 40 to A2 10
Write Back
CA-Lec10 [email protected] 20
Example
Processor 1 Processor 2 Interconnect Directory Memory
P1 P2 Bus Directory Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 Ftch P1 A1 10 A1 A1 10
Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10
P2: Write 20 to A1 Excl. A1 20 WrMs P2 A1 10
Inv. Inval. P1 A1 A1 Excl. {P2} 10
P2: Write 40 to A2 10
CA-Lec10 [email protected] 21
Example
Processor 1 Processor 2 Interconnect Directory Memory
P1 P2 Bus Directory Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 Ftch P1 A1 10 A1 A1 10
Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10
P2: Write 20 to A1 Excl. A1 20 WrMs P2 A1 10
Inv. Inval. P1 A1 A1 Excl. {P2} 10
P2: Write 40 to A2 WrMs P2 A2 A2 Excl. {P2} 0
WrBk P2 A1 20 A1 Unca. {} 20
Excl. A2 40 DaRp P2 A2 0 A2 Excl. {P2} 0
CA-Lec10 [email protected] 22
Implementing a Directory
• We assume operations atomic, but they are not; reality is
much harder; must avoid deadlock when run out of buffers in
network (see Appendix E)
• Optimizations:
– read miss or write miss in Exclusive: send data directly to requestor
from owner vs. 1st to memory and then from memory to requestor
CA-Lec10 [email protected] 23
Basic Directory Transactions
Requestor Requestor 1.
1. RdEx request
P P to directory
Read request
C to directory C
M/D Directorynode M/D 2. P
A A
for block Reply with C
2. sharers identity
P A M/D
Reply with
owner identity C
3.
3a. 3b. Directorynode
Read req. A M/D
to owner Inval. req. Inval. req.
4a. to sharer to sharer
Data 4a. 4b.
Reply Inval. ack Inval. ack
4b.
Revision message
to directory
P P P
C C C
CA-Lec10 [email protected] 24
Example Directory Protocol (1st Read)
D
Read pA
P1: pA S
Dir M
R/reply ctrl
U
E E
S S
$ P1 $ P2
R/req
I ld vA -> rd pA I
CA-Lec10 [email protected] 25
Example Directory Protocol (Read Share)
D
P1: pA R/_ S
Dir M
P2: pA R/reply ctrl
U
E E
R/_ S R/_ S
$ P1 $ P2
R/req R/req
I ld vA -> rd pA I
ld vA -> rd pA
CA-Lec10 [email protected] 26
Example Directory Protocol (Wr to shared)
D
RX/invalidate&reply
P1: pA EX R/_ S
Dir M
P2: pA R/reply ctrl
U
Inv ACK
reply xD(pA)
Invalidate pA
Read_to_update pA
E W/req E E
W/_
W/req E
R/_ S R/_ S
$ P1 $ P2
Inv/_ R/req Inv/_ R/req
I st vA -> wr pA I
CA-Lec10 [email protected] 27
Example Directory Protocol (Wr to Ex)
RU/_ D
RX/invalidate&reply
P1: pA R/_ S
Dir M
R/reply ctrl
U
Read_toUpdate pA
Inv pA
Reply xD(pA)
Write_back pA
E W/req E E W/req E
W/_ W/_
W/req E W/req E
R/_ S R/_ S
$ P1 $ P2
Inv/_ R/req Inv/_ R/req
I I
st vA -> wr pA
CA-Lec10 [email protected] 28
A Popular Middle Ground
• Two‐level “hierarchy”
• Individual nodes are multiprocessors, connected non‐
hiearchically
– e.g. mesh of SMPs
• Coherence across nodes is directory‐based
– directory keeps track of nodes, not individual processors
• Coherence within nodes is snooping or directory
– orthogonal, but needs a good interface of functionality
• SMP on a chip directory + snoop?
CA-Lec10 [email protected] 29
And in Conclusion …
• Caches contain all information on state of cached memory blocks
• Snooping cache over shared medium for smaller MP by invalidating other
cached copies on write
• Sharing cached data Coherence (values returned by a read),
Consistency (when a written value will be returned by a read)
• Snooping and Directory Protocols similar; bus makes snooping easier
because of broadcast (snooping => uniform memory access)
• Directory has extra data structure to keep track of state of all cache blocks
• Distributing directory => scalable shared address multiprocessor
=> Cache coherent, Non uniform memory access
CA-Lec10 [email protected] 30
Outline
• Review
• Directory‐based protocols and examples
• Synchronization
• Relaxed Consistency Models
• Conclusion
CA-Lec10 [email protected] 31
Synchronization
• Why Synchronize? Need to know when it is safe for different
processes to use shared data
CA-Lec10 [email protected] 32
Uninterruptable Instruction to Fetch and
Update Memory
• Atomic exchange: interchange a value in a register for a value in memory
0 synchronization variable is free
1 synchronization variable is locked and unavailable
– Set register to 1 & swap
– New value in register determines success in getting lock 0 if you
succeeded in setting the lock (you were first)
1 if other processor had already claimed access
– Key is that exchange operation is indivisible
• Test‐and‐set: tests a value and sets it if the value passes the test
• Fetch‐and‐increment: it returns the value of a memory location and
atomically increments it
– 0 synchronization variable is free
CA-Lec10 [email protected] 33
Uninterruptable Instruction to Fetch and
Update Memory
• Hard to have read & write in 1 instruction: use 2 instead
• Load linked (or load locked) + store conditional
– Load linked returns the initial value
– Store conditional returns 1 if it succeeds (no other store to same memory
location since preceding load) and 0 otherwise
• Example doing atomic swap with LL & SC:
try: mov R3,R4 ; mov exchange value
ll R2,0(R1) ; load linked
sc R3,0(R1) ; store conditional
beqz R3,try ; branch store fails (R3 = 0)
mov R4,R2 ; put load value in R4
• Example doing fetch & increment with LL & SC:
try: ll R2,0(R1) ; load linked
addi R2,R2,#1 ; increment (OK if reg–reg)
sc R2,0(R1) ; store conditional
beqz R2,try ; branch store fails (R2 = 0)
CA-Lec10 [email protected] 34
User Level Synchronization—Operation
Using this Primitive
• Spin locks: processor continuously tries to acquire, spinning around a loop
trying to get the lock
daddui R2,R0,#1
lockit: exch R2,0(R1) ;atomic exchange
bnez R2,lockit ;already locked?
• What about MP with cache coherency?
– Want to spin on cache copy to avoid full memory latency
– Likely to get cache hits for such variables
• Problem: exchange includes a write, which invalidates all other copies; this
generates considerable bus traffic
• Solution: start by simply repeatedly reading the variable; when it changes,
then try exchange (“test and test&set”):
try: li R2,#1
lockit: lw R3,0(R1) ;load var
bnez R3,lockit ;≠ 0 not free spin
exch R2,0(R1) ;atomic exchange
bnez R2,try ;already locked?
CA-Lec10 [email protected] 35
Another MP Issue:
Memory Consistency Models
• What is consistency? When must a processor see the new value? e.g.,
seems that
P1: A = 0; P2: B = 0;
..... .....
A = 1; B = 1;
L1: if (B == 0) ... L2: if (A == 0) ...
• Impossible for both if statements L1 & L2 to be true?
– What if write invalidate is delayed & processor continues?
• Memory consistency models:
what are the rules for such cases?
• Sequential consistency: result of any execution is the same as if the
accesses of each processor were kept in order and the accesses
among different processors were interleaved
assignments must be completed before the if statements are
initiated
– SC: delay all memory accesses until all invalidates done
CA-Lec10 [email protected] 36
Memory Consistency Model
• Schemes faster execution to sequential consistency
• Not an issue for most programs; they are synchronized
– A program is synchronized if all access to shared data are ordered by synchronization
operations
write (x)
...
release (s) {unlock}
...
acquire (s) {lock}
...
read(x)
• Only those programs willing to be nondeterministic are not synchronized: “data
race”: outcome f(proc. speed)
• Several Relaxed Models for Memory Consistency since most programs are
synchronized; characterized by their attitude towards: RAR, WAR, RAW, WAW
to different addresses
CA-Lec10 [email protected] 37
Relaxed Consistency Models: The
Basics
• Key idea: allow reads and writes to complete out of order, but to use synchronization
operations to enforce ordering, so that a synchronized program behaves as if the processor
were sequentially consistent
– By relaxing orderings, may obtain performance advantages
– Also specifies range of legal compiler optimizations on shared data
– Unless synchronization points are clearly defined and programs are synchronized, compiler could not
interchange read and write of 2 shared data items because might affect the semantics of the
program
• 3 major sets of relaxed orderings:
1. W→R ordering (all writes completed before next read)
• Because retains ordering among writes, many programs that operate under sequential
consistency operate under this model, without additional synchronization. Called
processor consistency
2. W → W ordering (all writes completed before next write)
3. R → W and R → R orderings, a variety of models depending on ordering restric ons and how
synchronization operations enforce ordering
• Many complexities in relaxed consistency models; defining precisely what it means for a write
to complete; deciding when processors can see values that it has written
CA-Lec10 [email protected] 38
Outline
• Review
• Directory‐based protocols and examples
• Synchronization
• Relaxed Consistency Models
• Conclusion
• T1 (“Niagara”) Multiprocessor
CA-Lec10 [email protected] 39
T1 (“Niagara”)
• Target: Commercial server applications
– High thread level parallelism (TLP)
• Large numbers of parallel client requests
– Low instruction level parallelism (ILP)
• High cache miss rates
• Many unpredictable branches
• Frequent load‐load dependencies
• Power, cooling, and space are major concerns for data
centers
• Metric: Performance/Watt/Sq. Ft.
• Approach: Multicore, Fine‐grain multithreading, Simple
pipeline, Small L1 caches, Shared L2
CA-Lec10 [email protected] 40
T1 Architecture
• Also ships with 6 or 4 processors
CA-Lec10 [email protected] 41
T1 Fine‐Grained Multithreading
• Each core supports four threads and has its own level one caches (16KB
for instructions and 8 KB for data)
• Switching to a new thread on each clock cycle
• Idle threads are bypassed in the scheduling
– Waiting due to a pipeline delay or cache miss
– Processor is idle only when all 4 threads are idle or stalled
• Both loads and branches incur a 3 cycle delay that can only be hidden by
other threads
• A single set of floating point functional units is shared by all 8 cores
– floating point performance was not a focus for T1
CA-Lec10 [email protected] 42
Memory, Clock, Power
• 16 KB 4 way set assoc. I$/ core
• 8 KB 4 way set assoc. D$/ core
• 3MB 12 way set assoc. L2 $ shared
– 4 x 750KB independent banks
– crossbar switch to connect
– 2 cycle throughput, 8 cycle latency
– Direct link to DRAM & Jbus
– Manages cache coherence for the 8 cores
– CAM based directory
• Coherency is enforced among the L1 caches by a directory associated with each L2 cache
block
• Used to track which L1 caches have copies of an L2 block
• By associating each L2 with a particular memory bank and enforcing the subset property, T1
can place the directory at L2 rather than at the memory, which reduces the directory
overhead
• L1 data cache is write‐through, only invalidation messages are required; the data can always
be retrieved from the L2 cache
• 1.2 GHz at 72W typical, 79W peak power consumption
CA-Lec10 [email protected] 43
Miss Rates: L2 Cache Size, Block Size
2.5%
2.0% TPC-C
SPECJBB
L2 Miss rate
1.5%
T1
1.0%
0.5%
0.0%
1.5 MB; 1.5 MB; 3 MB; 3 MB; 6 MB; 6 MB;
32B 64B 32B 64B 32B 64B
CA-Lec10 [email protected] 44
Miss Latency: L2 Cache Size, Block Size
200
180
T1 TPC-C
160 SPECJBB
140
120
L2 Miss latency
100
80
60
40
20
0
1.5 MB; 32B 1.5 MB; 64B 3 MB; 32B 3 MB; 64B 6 MB; 32B 6 MB; 64B
CA-Lec10 [email protected] 45
CPI Breakdown of Performance
Other
80%
Pipeline delay
60%
L2 miss
40%
L1 D miss
20%
L1 I miss
0%
TPC-C SPECJBB SPECWeb99
5.5
+Power5 Opteron Sun T1
5
Performance relative to Pentium D
4.5
3.5
2.5
1.5
0.5
0
SPECIntRate SPECFPRate SPECJBB05 SPECWeb05 TPC-like
CA-Lec10 [email protected] 48
Efficiency normalized to Pentium D
SP
EC
In
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
tR
at
e/
m
m
SP ^2
EC
In
tR
at
e/
W
SP at
EC t
FP
R
at
e/
m
+Power5
m
SP ^2
EC
FP
R
at
e /W
Opteron
SP at
EC t
JB
B0
5/
Sun T1
m
m
^2
SP
EC
JB
B0
5/
W
at
t
CA-Lec10 [email protected]
TP
C
-C
/m
m
^2
TP
C
-C
/W
at
t
Performance/mm2, Performance/Watt
49
Niagara 2
• Improve performance by increasing threads supported per chip from 32 to
64
– 8 cores * 8 threads per core
• Floating‐point unit for each core, not for each chip
• Hardware support for encryption standards EAS, 3DES, and elliptical‐curve
cryptography
• Niagara 2 will add a number of 8x PCI Express interfaces directly into the
chip in addition to integrated 10Gigabit Ethernet XAU interfaces and
Gigabit Ethernet ports.
• Integrated memory controllers will shift support from DDR2 to FB‐DIMMs
and double the maximum amount of system memory.
CA-Lec10 [email protected] 50