0% found this document useful (0 votes)
14 views202 pages

Computer ARCHITECTURE Lecture 8 10 1738846483

This document discusses vector processing in computer architecture, focusing on SIMD (Single Instruction, Multiple Data) architectures that enhance data-level parallelism for applications like scientific computing and multimedia processing. It covers the structure and advantages of vector architectures, including the use of vector registers, pipelined functional units, and the efficiency of vector instructions compared to scalar instructions. The document also highlights the evolution of vector supercomputers and the importance of vector instruction execution time and parallelism.

Uploaded by

Anh Tuan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views202 pages

Computer ARCHITECTURE Lecture 8 10 1738846483

This document discusses vector processing in computer architecture, focusing on SIMD (Single Instruction, Multiple Data) architectures that enhance data-level parallelism for applications like scientific computing and multimedia processing. It covers the structure and advantages of vector architectures, including the use of vector registers, pipelined functional units, and the efficiency of vector instructions compared to scalar instructions. The document also highlights the evolution of vector supercomputers and the importance of vector instruction execution time and parallelism.

Uploaded by

Anh Tuan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 202

Computer Architecture

Lecture 8: Vector Processing


(Chapter 4)
Chih‐Wei Liu 劉志尉
National Chiao Tung University
[email protected]
Introduction
Introduction
• SIMD architectures can exploit significant data‐level
parallelism for:
– matrix‐oriented scientific computing
– media‐oriented image and sound processors
• SIMD is more energy efficient than MIMD
– Only needs to fetch one instruction per data operation
– Makes SIMD attractive for personal mobile devices

• SIMD allows programmer to continue to think sequentially

CA-Lec8 [email protected] 2
Introduction
SIMD Variations
• Vector architectures
• SIMD extensions
– MMX: multimedia extensions (1996)
– SSE: streaming SIMD extensions
– AVX: advanced vector extensions
• Graphics Processor Units (GPUs)
– Considered as SIMD accelerators

CA-Lec8 [email protected] 3
SIMD vs. MIMD
• For x86 processors:
• Expect two
additional cores
per chip per year
• SIMD width to
double every four
years
• Potential speedup
from SIMD to be
twice that from
MIMD!!

CA-Lec8 [email protected] 4
Vector Architectures
Vector Architectures
• Basic idea:
– Read sets of data elements into “vector registers”
– Operate on those registers
– Disperse the results back into memory
• Registers are controlled by compiler
– Register files act as compiler controlled buffers
– Used to hide memory latency
– Leverage memory bandwidth

• Vector loads/stores deeply pipelined


– Pay for memory latency once per vector load/store
• Regular loads/stores
– Pay for memory latency for each vector element

CA-Lec8 [email protected] 5
Vector Supercomputers
• In 70‐80s, Supercomputer  Vector machine
• Definition of supercomputer
– Fastest machine in the world at given task
– A device to turn a compute‐bound problem into an I/O‐bound
problem
– CDC6600 (Cray, 1964) is regarded as the first supercomputer
• Vector supercomputers (epitomized by Cray‐1, 1976)
– Scalar unit + vector extensions
• Vector registers, vector instructions
• Vector loads/stores
• Highly pipelined functional units

CA-Lec8 [email protected] 6
Cray‐1 (1976)
V0 Vi V. Mask
V1
V2 Vj
64 Element V3 V. Length
Vector Registers V4 Vk
Single Port V5
V6
Memory V7
FP Add
S0 Sj FP Mul
16 banks of 64- ( (Ah) + j k m ) S1
S2 Sk FP Recip
bit words Si S3
(A0) 64 S4 Si Int Add
+ Tjk S5
T Regs S6
8-bit SECDED S7
Int Logic
Int Shift
A0
80MW/sec data ( (Ah) + j k m ) A1 Pop Cnt
A2
load/store Ai A3
Aj
(A0) 64 A4 Ak Addr Add
Bjk A5
Ai
320MW/sec B Regs A6 Addr Mul
A7
instruction
buffer refill NIP CIP
64-bitx16
LIP
4 Instruction Buffers
memory bank cycle 50 ns [email protected]
CA-Lec8 processor cycle 12.5 ns (80MHz) 7
Vector Programming Model
Scalar Registers Vector Registers
r15 v15

r0 v0
[0] [1] [2] [VLRMAX-1]
Vector Length Register VLR
v1
Vector Arithmetic v2
+ + + + + +
ADDV v3, v1, v2 v3
[0] [1] [VLR-1]

Vector Load and Store Vector Register


v1
LV v1, r1, r2

Memory
CA-Lec8 [email protected] 8
Base, r1 Stride, r2
Example: VMIPS
• Loosely based on Cray‐1
• Vector registers
– Each register holds a 64‐element, 64
bits/element vector
– Register file has 16 read ports and 8 write
ports
• Vector functional units
– Fully pipelined
– Data and control hazards are detected
• Vector load‐store unit
– Fully pipelined
– Words move between registers
– One word per clock cycle after initial latency
• Scalar registers
– 32 general‐purpose registers
– 32 floating‐point registers

CA-Lec8 [email protected] 9
VMIPS Instructions
• Operate on many elements concurrently
• Allows use of slow but wide execution units
• High performance, lower power

• Independence of elements within a vector instruction


• Allows scaling of functional units without costly
dependence checks

• Flexible
• 64 64‐bit / 128 32‐bit / 256 16‐bit, 512 8‐bit
• Matches the need of multimedia (8bit), scientific
applications that require high precision

CA-Lec8 [email protected] 10
Vector Architectures
Vector Instructions
• ADDVV.D: add two vectors
• ADDVS.D: add vector to a scalar
• LV/SV: vector load and vector store from address
• Vector code example:

CA-Lec8 [email protected] 11
Vector Memory‐Memory vs.
Vector Register Machines
• Vector memory‐memory instructions hold all vector operands in main
memory
• The first vector machines, CDC Star‐100 (‘73) and TI ASC (‘71), were
memory‐memory machines
• Cray‐1 (’76) was first vector register machine
Vector Memory-Memory Code
Example Source Code ADDV C, A, B
SUBV D, A, B
for (i=0; i<N; i++)
{
C[i] = A[i] + B[i]; Vector Register Code
D[i] = A[i] - B[i];
} LV V1, A
LV V2, B
ADDV V3, V1, V2
SV V3, C
SUBV V4, V1, V2
CA-Lec8 [email protected]
SV V4, D 12
Vector Memory‐Memory vs.
Vector Register Machines
• Vector memory‐memory architectures (VMMA) require greater
main memory bandwidth, why?
– All operands must be read in and out of memory
• VMMAs make if difficult to overlap execution of multiple vector
operations, why?
– Must check dependencies on memory addresses
• VMMAs incur greater startup latency
– Scalar code was faster on CDC Star‐100 for vectors < 100
elements
– For Cray‐1, vector/scalar breakeven point was around 2
elements
 Apart from CDC follow‐ons (Cyber‐205, ETA‐10) all major vector
machines since Cray‐1 have had vector register architectures

(we ignore vector memory‐memory from now on)


CA-Lec8 [email protected] 13
Vector Instruction Set Advantages
• Compact
– One short instruction encodes N operations
• Expressive
– tells hardware that these N operations are independent
– N operations use the same functional unit
– N operations access disjoint registers
– N operations access registers in the same pattern as previous
instruction
– N operations access a contiguous block of memory (unit‐stride
load/store)
– N operations access memory in a known pattern (stridden load/store)
• Scalable
– Can run same object code on more parallel pipelines or lanes

CA-Lec8 [email protected] 14
Vector Instructions Example
• Example: DAXPY adds a scalar multiple of a double precision vector to
another double precision vector
L.D F0,a ;load scalar a
LV V1,Rx ;load vector X
MULVS.D V2,V1,F0 ;vector-scalar mult
LV V3,Ry ;load vector Y
ADDVV V4,V2,V3 ;add
SV Ry,V4 ;store result
Requires 6 instructions only
• In MIPS Code
• ADD waits for MUL, SD waits for ADD
• In VMIPS
• Stall once for the first vector element, subsequent elements will flow
smoothly down the pipeline.
• Pipeline stall required once per vector instruction!

CA-Lec8 [email protected] 15
Vector Architectures
Challenges of Vector Instructions
• Start up time
– Application and architecture must support long vectors. Otherwise,
they will run out of instructions requiring ILP
– Latency of vector functional unit
– Assume the same as Cray‐1
• Floating‐point add => 6 clock cycles
• Floating‐point multiply => 7 clock cycles
• Floating‐point divide => 20 clock cycles
• Vector load => 12 clock cycles

CA-Lec8 [email protected] 16
Vector Arithmetic Execution
• Use deep pipeline (=> fast clock) V1 V2 V3
to execute element operations
• Simplifies control of deep
pipeline because elements in
vector are independent (=> no
hazards!)
Six stage multiply pipeline

V3 <- v1 * v2

CA-Lec8 [email protected] 17
Vector Instruction Execution
ADDV C,A,B

Four-lane
Execution using
execution using
one pipelined
four pipelined
functional unit
functional units

A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]

C[2] C[8] C[9] C[10] C[11]


C[1] C[4] C[5] C[6] C[7]

C[0] C[0] C[1] C[2] C[3]


CA-Lec8 [email protected] 18
Vector Unit Structure
Functional Unit

Vector
Registers
Elements Elements Elements Elements
0, 4, 8, … 1, 5, 9, … 2, 6, 10, … 3, 7, 11, …

Lane

Memory Subsystem
CA-Lec8 [email protected] 19
Multiple Lanes Architecture
• Beyond one element per
clock cycle
• Elements n of vector
register A is hardwired to
element n of vector B
– Allows for multiple
hardware lanes
– No communication
between lanes
– Little increase in control
overhead
Adding more lanes allows
– No need to change
designers to tradeoff clock rate and
machine code
energy without sacrificing
performance!
CA-Lec8 [email protected] 20
Code Vectorization
for (i=0; i < N; i++)
C[i] = A[i] + B[i];
Scalar Sequential Code Vectorized Code
load load load
Iter. 1 load
load load

Time
add
add add
store
store store
load

Iter. 2 load Iter. 1 Iter. 2 Vector Instruction

add Vectorization is a massive compile-time


reordering of operation sequencing
store  requires extensive loop dependence analysis
CA-Lec8 [email protected] 21
Vector Architectures
Vector Execution Time
• Execution time depends on three factors:
– Length of operand vectors
– Structural hazards
– Data dependencies

• VMIPS functional units consume one element per clock


cycle
– Execution time is approximately the vector length

• Convoy
– Set of vector instructions that could potentially execute
together

CA-Lec8 [email protected] 22
Vector Instruction Parallelism
Can overlap execution of multiple vector instructions
– example machine has 32 elements per vector register and 8 lanes
Complete 24 operations/cycle while issuing 1 short instruction/cycle

Load Unit Multiply Unit Add Unit


load
mul
add
time
load
mul
add

CA-Lec8 [email protected] 23
Convoy
• Convoy: set of vector instructions that could
potentially execute together
– Must not contain structural hazards
– Sequences with RAW hazards should be in
different convoys

• However, sequences with RAW hazards can be


in the same convey via chaining

CA-Lec8 [email protected] 24
Vector Chaining
• Vector version of register bypassing
– Allows a vector operation to start as soon as the individual elements of its
vector source operand become available

V1 V2 V3 V4 V5
LV v1
MULV v3, v1,v2
ADDV v5, v3, v4
Chain Chain
Load
Unit
Mult. Add
Memory
CA-Lec8 [email protected] 25
Vector Chaining Advantage
• Without chaining, must wait for last element of result to be
written before starting dependent instruction

Load
Mul
Time Add

• With chaining, can start dependent instruction as soon as


first result appears

Load
Mul
Add

CA-Lec8 [email protected] 26
Chimes
• Chimes: unit of time to execute one convoy
– m convoy executes in m chimes
– For vector length of n, requires m n clock cycles

CA-Lec8 [email protected] 27
Vector Architectures
Example
LV V1,Rx ;load vector X
MULVS.D V2,V1,F0 ;vector‐scalar multiply
LV V3,Ry ;load vector Y
ADDVV.D V4,V2,V3 ;add two vectors
SV Ry,V4 ;store the sum

Convoys:
1 LV MULVS.D
2 LV ADDVV.D
3 SV

3 chimes, 2 FP ops per result, cycles per FLOP = 1.5


For 64 element vectors, requires 64 x 3 = 192 clock cycles

CA-Lec8 [email protected] 28
Vector Strip‐mining
Problem: Vector registers have finite length
Solution: Break loops into pieces that fit into vector registers, “Strip‐mining”
ANDI R1, N, 63 # N mod 64
for (i=0; i<N; i++) MTC1 VLR, R1 # Do remainder
C[i] = A[i]+B[i]; loop:
A B C LV V1, RA
DSLL R2, R1, 3 # Multiply by 8
+ Remainder
DADDU RA, RA, R2 # Bump pointer
LV V2, RB
DADDU RB, RB, R2
+ 64 elements ADDV.D V3, V1, V2
SV V3, RC
DADDU RC, RC, R2
DSUBU N, N, R1 # Subtract elements
+ LI R1, 64
MTC1 VLR, R1 # Reset full length
CA-Lec8 [email protected]
BGTZ N, loop 29
# Any more to do?
Vector Length Register
• Handling loops not equal to 64
• Vector length not known at compile time? Use Vector Length
Register (VLR) for vectors over the maximum length, strip
mining:
low = 0;
VL = (n % MVL); /*find odd‐size piece using modulo op % */
for (j = 0; j <= (n/MVL); j=j+1) { /*outer loop*/
for (i = low; i < (low+VL); i=i+1) /*runs for length VL*/
Y[i] = a * X[i] + Y[i] ; /*main operation*/
low = low + VL; /*start of next vector*/
VL = MVL; /*reset the length to maximum vector length*/
}

CA-Lec8 [email protected] 30
Maximum Vector Length (MVL)
• Determine the maximum number of elements in a vector
for a given architecture

• All blocks but the first are of length MVL


– Utilize the full power of the vector processor
• Later generations may grow the MVL
– No need to change the ISA

CA-Lec8 [email protected] 31
Vector‐Mask Control
Simple Implementation Density-Time Implementation
– execute all N operations, turn off result – scan mask vector and only execute
writeback according to mask elements with non-zero masks

M[7]=1 A[7] B[7] M[7]=1


M[6]=0 A[6] B[6] M[6]=0 A[7] B[7]
M[5]=1 A[5] B[5] M[5]=1
M[4]=1 A[4] B[4] M[4]=1
M[3]=0 A[3] B[3] M[3]=0 C[5]
M[2]=0 C[4]
M[1]=1
M[2]=0 C[2]
M[0]=0
M[1]=1 C[1] C[1]
Write data port

M[0]=0 C[0]
Write Disable Write data port
CA-Lec8 [email protected] 32
Vector Mask Register (VMR)
• A Boolean vector to control the execution of a vector
instruction
• VMR is part of the architectural state
• For vector processor, it relies on compilers to
manipulate VMR explicitly
• For GPU, it gets the same effect using hardware
– Invisible to SW
• Both GPU and vector processor spend time on
masking

CA-Lec8 [email protected] 33
Vector Mask Registers Example
• Programs that contain IF statements in loops cannot be run in vector
processor
• Consider:
for (i = 0; i < 64; i=i+1)
if (X[i] != 0)
X[i] = X[i] – Y[i];
• Use vector mask register to “disable” elements:
LV V1,Rx ;load vector X into V1
LV V2,Ry ;load vector Y
L.D F0,#0 ;load FP zero into F0
SNEVS.D V1,F0 ;sets VM(i) to 1 if V1(i)!=F0
SUBVV.D V1,V1,V2 ;subtract under vector mask
SV Rx,V1 ;store the result in X

• GFLOPS rate decreases!

CA-Lec8 [email protected] 34
Compress/Expand Operations
• Compress packs non‐masked elements from one vector register
contiguously at start of destination vector register
– population count of mask vector gives packed vector length
• Expand performs inverse operation

M[7]=1 A[7] A[7] A[7] M[7]=1


M[6]=0 A[6] A[5] B[6] M[6]=0
M[5]=1 A[5] A[4] A[5] M[5]=1
M[4]=1 A[4] A[1] A[4] M[4]=1
M[3]=0 A[3] A[7] B[3] M[3]=0
M[2]=0 A[2] A[5] B[2] M[2]=0
M[1]=1 A[1] A[4] A[1] M[1]=1
M[0]=0 A[0] A[1] B[0] M[0]=0

Compress Expand
Used for density-time conditionals and also for general
selection operations CA-Lec8 [email protected] 35
Memory Banks
• The start‐up time for a load/store vector unit is the time to get the first
word from memory into a register. How about the rest of the vector?
– Memory stalls can reduce effective throughput for the rest of the vector
• Penalties for start‐ups on load/store units are higher than those for
arithmetic unit
• Memory system must be designed to support high bandwidth for vector
loads and stores
– Spread accesses across multiple banks
1. Support multiple loads or stores per clock. Be able to control the addresses
to the banks independently
2. Support (multiple) non‐sequential loads or stores
3. Support multiple processors sharing the same memory system, so each
processor will be generating its own independent stream of addresses

CA-Lec8 [email protected] 36
Example (Cray T90)
• 32 processors, each generating 4 loads and 2 stores per
cycle
• Processor cycle time is 2.167ns
• SRAM cycle time is 15ns
• How many memory banks needed?

• Solution:
1. The maximum number of memory references each cycle is
326=192
2. SRAM takes 15/2.167=6.927 processor cycles
3. It requires 1927=1344 memory banks at full memory
bandwidth!!
Cray T932 actually has 1024 memory banks (not sustain full bandwidth)
CA-Lec8 [email protected] 37
Memory Addressing
• Load/store operations move groups of data between
registers and memory
• Three types of addressing
– Unit stride
• Contiguous block of information in memory
• Fastest: always possible to optimize this
– Non‐unit (constant) stride
• Harder to optimize memory system for all possible strides
• Prime number of data banks makes it easier to support different
strides at full bandwidth
– Indexed (gather‐scatter)
• Vector equivalent of register indirect
• Good for sparse arrays of data
• Increases number of programs that vectorize

CA-Lec8 [email protected] 38
Interleaved Memory Layout
Vector Processor

Unpipelined

Unpipelined

Unpipelined

Unpipelined

Unpipelined

Unpipelined

Unpipelined

Unpipelined
DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM
Addr Addr Addr Addr Addr Addr Addr Addr
Mod 8 Mod 8 Mod 8 Mod 8 Mod 8 Mod 8 Mod 8 Mod 8
= 0 = 1 = 2 = 3 = 4 = 5 = 6 = 7

• Great for unit stride:


– Contiguous elements in different DRAMs
– Startup time for vector operation is latency of single read
• What about non‐unit stride?
– Above good for strides that are relatively prime to 8
– Bad for: 2, 4
CA-Lec8 [email protected] 39
Handling Multidimensional Arrays in
Vector Architectures
• Consider:
for (i = 0; i < 100; i=i+1)
for (j = 0; j < 100; j=j+1) {
A[i][j] = 0.0;
for (k = 0; k < 100; k=k+1)
A[i][j] = A[i][j] + B[i][k] * D[k][j];
}
• Must vectorize multiplications of rows of B with columns of D
– Need to access adjacent elements of B and adjacent elements of D
– Elements of B accessed in row‐major order but elements of D in
column‐major order!
• Once vector is loaded into the register, it acts as if it has logically
adjacent elements

CA-Lec8 [email protected] 40
(Unit/Non‐Unit) Stride Addressing
• The distance separating elements to be gathered into a single
register is called stride.
• In the example
– Matrix D has a stride of 100 double words
– Matrix B has a stride of 1 double word
• Use non‐unit stride for D
– To access non‐sequential memory location and to reshape them into a
dense structure
• The size of the matrix may not be known at compile time
– Use LVWS/SVWS: load/store vector with stride
• The vector stride, like the vector starting address, can be put in a general‐
purpose register (dynamic)
• Cache inherently deals with unit stride data
– Blocking techniques helps non‐unit stride data

CA-Lec8 [email protected] 41
How to get full bandwidth for unit
stride?
• Memory system must sustain (# lanes x word) /clock
• # memory banks > memory latency to avoid stalls
• If desired throughput greater than one word per cycle
– Either more banks (start multiple requests simultaneously)
– Or wider DRAMS. Only good for unit stride or large data
types
• # memory banks > memory latency to avoid stalls
• More numbers of banks good to support more strides at full
bandwidth
– can read paper on how to do prime number of banks
efficiently

CA-Lec8 [email protected] 42
Memory Bank Conflicts
int x[256][512];
for (j = 0; j < 512; j = j+1)
for (i = 0; i < 256; i = i+1)
x[i][j] = 2 * x[i][j];

• Even with 128 banks, since 512 is multiple of 128, conflict on word accesses
• SW: loop interchange or declaring array not power of 2 (“array padding”)
• HW: Prime number of banks
– bank number = address mod number of banks
– address within bank = address / number of words in bank
– modulo & divide per memory access with prime no. banks?
– address within bank = address mod number words in bank
– bank number? easy if 2N words per bank

CA-Lec8 [email protected] 43
Problems of Stride Addressing
• Once we introduce non‐unit strides, it becomes
possible to request accesses from the same bank
frequently
– Memory bank conflict !!
• Stall the other request to solve bank conflict

• Bank conflict (stall) occurs when the same bank is hit


faster than bank busy time:
– #banks / LCM(stride,#banks) < bank busy time

CA-Lec8 [email protected] 44
Example
• 8 memory banks, Bank busy time: 6 cycles, Total memory
latency: 12 cycles for initialized
• What is the difference between a 64‐element vector load
with a stride of 1 and 32?

• Solution:
1. Since 8 > 6, for a stride of 1, the load will take 12+64=76
cycles, i.e. 1.2 cycles per element
2. Since 32 is a multiple of 8, the worst possible stride
– every access to memory (after the first one) will collide with
the previous access and will have to wait for 6 cycles
– The total time will take 12+1+636=391 cycles, i.e. 6.1 cycles
per element

CA-Lec8 [email protected] 45
Handling Sparse Matrices in Vector
architecture
• Sparse matrices in vector mode is a necessity
• Example: Consider a sparse vector sum on arrays A and C
for (i = 0; i < n; i=i+1)
A[K[i]] = A[K[i]] + C[M[i]];
– where K and M and index vectors to designate the nonzero
elements of A and C

• Sparse matrix elements stored in a compact form and accessed


indirectly
• Gather‐scatter operations are used
– Using LVI/SVI: load/store vector indexed or gathered

CA-Lec8 [email protected] 46
Scatter‐Gather
• Consider:
for (i = 0; i < n; i=i+1)
A[K[i]] = A[K[i]] + C[M[i]];
• Ra, Rc, Rk, and Rm contain the starting addresses of the vectors

• Use index vector:


LV Vk, Rk ;load K
LVI Va, (Ra+Vk) ;load A[K[]]
LV Vm, Rm ;load M
LVI Vc, (Rc+Vm) ;load C[M[]]
ADDVV.D Va, Va, Vc ;add them
SVI (Ra+Vk), Va ;store A[K[]]

• A and C must have the same number of non‐zero elements (sizes of k and m)

CA-Lec8 [email protected] 47
Vector Architecture Summary
• Vector is alternative model for exploiting ILP
– If code is vectorizable, then simpler hardware, energy efficient, and better
real‐time model than out‐of‐order
• More lanes, slower clock rate!
– Scalable if elements are independent
– If there is dependency
• One stall per vector instruction rather than one stall per vector element
• Programmer in charge of giving hints to the compiler!
• Design issues: number of lanes, functional units and registers, length of
vector registers, exception handling, conditional operations
• Fundamental design issue is memory bandwidth
– Especially with virtual address translation and caching

CA-Lec8 [email protected] 48
Programming Vector Architectures
• Compilers can provide feedback to programmers
• Programmers can provide hints to compiler
• Cray Y‐MP Benchmarks

CA-Lec8 [email protected] 49
Multimedia Extensions
• Very short vectors added to existing ISAs
• Usually 64‐bit registers split into 2x32b or 4x16b or 8x8b
• Newer designs have 128‐bit registers (Altivec, SSE2)
• Limited instruction set:
– no vector length control
– no load/store stride or scatter/gather
– unit‐stride loads must be aligned to 64/128‐bit boundary
• Limited vector register length:
– requires superscalar dispatch to keep multiply/add/load units
busy
– loop unrolling to hide latencies increases register pressure
• Trend towards fuller vector support in microprocessors

CA-Lec8 [email protected] 50
“Vector” for Multimedia?
• Intel MMX: 57 additional 80x86 instructions (1st since 386)
– similar to Intel 860, Mot. 88110, HP PA‐71000LC, UltraSPARC
• 3 data types: 8 8‐bit, 4 16‐bit, 2 32‐bit in 64bits
– reuse 8 FP registers (FP and MMX cannot mix)
• short vector: load, add, store 8 8‐bit operands

• Claim: overall speedup 1.5 to 2X for 2D/3D graphics, audio,


video, speech, comm., ...
– use in drivers or added to library routines; no compiler

CA-Lec8 [email protected] 51
MMX Instructions
• Move 32b, 64b
• Add, Subtract in parallel: 8 8b, 4 16b, 2 32b
– opt. signed/unsigned saturate (set to max) if overflow
• Shifts (sll,srl, sra), And, And Not, Or, Xor
in parallel: 8 8b, 4 16b, 2 32b
• Multiply, Multiply‐Add in parallel: 4 16b
• Compare = , > in parallel: 8 8b, 4 16b, 2 32b
– sets field to 0s (false) or 1s (true); removes branches
• Pack/Unpack
– Convert 32b<–> 16b, 16b <–> 8b
– Pack saturates (set to max) if number is too large

CA-Lec8 [email protected] 52
SIMD Implementations: IA32/AMD64
• Intel MMX (1996)
– Repurpose 64‐bit floating point registers
– Eight 8‐bit integer ops or four 16‐bit integer ops
• Streaming SIMD Extensions (SSE) (1999)
– Separate 128‐bit registers
– Eight 16‐bit integer ops, Four 32‐bit integer/fp ops, or two 64‐bit
integer/fp ops
– Single‐precision floating‐point arithmetic
• SSE2 (2001), SSE3 (2004), SSE4(2007)
– Double‐precision floating‐point arithmetic
• Advanced Vector Extensions (2010)
– 256‐bits registers
– Four 64‐bit integer/fp ops
– Extensible to 512 and 1024 bits for future generations

CA-Lec8 [email protected] 53
SIMD Implementations: IBM
• VMX (1996‐1998)
– 32 4b, 16 8b, 8 16b, 4 32b integer ops and 4 32b FP ops
– Data rearrangement
• Cell SPE (PS3)
– 16 8b, 8 16b, 4 32b integer ops, and 4 32b and 8 64b FP ops
– Unified vector/scalar execution with 128 registers
• VMX 128 (Xbox360)
– Extension to 128 registers
• VSX (2009)
– 1 or 2 64b FPU, 4 32b FPU
– Integrate FPU and VMX into unit with 64 registers
• QPX (2010, Blue Gene)
– Four 64b SP or DP FP

CA-Lec8 [email protected] 54
Why SIMD Extensions?
• Media applications operate on data types narrower than the native
word size
• Costs little to add to the standard arithmetic unit
• Easy to implement
• Need smaller memory bandwidth than vector
• Separate data transfer aligned in memory
– Vector: single instruction, 64 memory accesses, page fault in the
middle of the vector likely !!
• Use much smaller register space
• Fewer operands
• No need for sophisticated mechanisms of vector architecture

CA-Lec8 [email protected] 55
Example SIMD Code
• Example DXPY:
L.D F0,a ;load scalar a
MOV F1, F0 ;copy a into F1 for SIMD MUL
MOV F2, F0 ;copy a into F2 for SIMD MUL
MOV F3, F0 ;copy a into F3 for SIMD MUL
DADDIU R4,Rx,#512 ;last address to load
Loop: L.4D F4,0[Rx] ;load X[i], X[i+1], X[i+2], X[i+3]
MUL.4D F4,F4,F0 ;a×X[i],a×X[i+1],a×X[i+2],a×X[i+3]
L.4D F8,0[Ry] ;load Y[i], Y[i+1], Y[i+2], Y[i+3]
ADD.4D F8,F8,F4 ;a×X[i]+Y[i], ..., a×X[i+3]+Y[i+3]
S.4D 0[Ry],F8 ;store into Y[i], Y[i+1], Y[i+2], Y[i+3]
DADDIU Rx,Rx,#32 ;increment index to X
DADDIU Ry,Ry,#32 ;increment index to Y
DSUBU R20,R4,Rx ;compute bound
BNEZ R20,Loop ;check if done

CA-Lec8 [email protected] 56
Challenges of SIMD Architectures
• Scalar processor memory architecture
– Only access to contiguous data
• No efficient scatter/gather accesses
– Significant penalty for unaligned memory access
– May need to write entire vector register
• Limitations on data access patterns
– Limited by cache line, up to 128‐256b
• Conditional execution
– Register renaming does not work well with masked execution
– Always need to write whole register
– Difficult to know when to indicate exceptions
• Register pressure
– Need to use multiple registers rather than register depth to hide
latency

CA-Lec8 [email protected] 57
Roofline Performance Model
• Basic idea:
– Plot peak floating‐point throughput as a function of arithmetic
intensity
– Ties together floating‐point performance and memory
performance for a target machine
• Arithmetic intensity
– Floating‐point operations per byte read

CA-Lec8 [email protected] 58
Examples
• Attainable GFLOPs/sec

CA-Lec8 [email protected] 59
History of GPUs
• Early video cards
– Frame buffer memory with address generation for video
output
• 3D graphics processing
– Originally high‐end computers (e.g., SGI)
– 3D graphics cards for PCs and game consoles
• Graphics Processing Units
– Processors oriented to 3D graphics tasks
– Vertex/pixel processing, shading, texture mapping, ray
tracing

CA-Lec8 [email protected] 60
Graphical Processing Units
Graphical Processing Units
• Basic idea:
– Heterogeneous execution model
• CPU is the host, GPU is the device
– Develop a C‐like programming language for GPU
– Unify all forms of GPU parallelism as CUDA thread
– Programming model is “Single Instruction Multiple
Thread”

CA-Lec8 [email protected] 61
Programming Model
• CUDA’s design goals
– extend a standard sequential programming language,
specifically C/C++,
• focus on the important issues of parallelism—how to craft efficient
parallel algorithms—rather than grappling with the mechanics of
an unfamiliar and complicated language.

– minimalist set of abstractions for expressing parallelism


• highly scalable parallel code that can run across tens of thousands
of concurrent threads and hundreds of processor cores.

CA-Lec8 [email protected] 62
Graphical Processing Units
NVIDIA GPU Architecture
• Similarities to vector machines:
– Works well with data‐level parallel problems
– Scatter‐gather transfers from memory into local store
– Mask registers
– Large register files

• Differences:
– No scalar processor, scalar integration
– Uses multithreading to hide memory latency
– Has many functional units, as opposed to a few deeply
pipelined units like a vector processor

CA-Lec8 [email protected] 63
Graphical Processing Units
Programming the GPU
• CUDA Programming Model
– Single Instruction Multiple Thread (SIMT)
• A thread is associated with each data element
• Threads are organized into blocks
• Blocks are organized into a grid

• GPU hardware handles thread management, not


applications or OS
– Given the hardware invested to do graphics well, how can
we supplement it to improve performance of a wider
range of applications?

CA-Lec8 [email protected] 64
Graphical Processing Units
Example
• Multiply two vectors of length 8192
– Code that works over all elements is the grid
– Thread blocks break this down into manageable sizes
• 512 threads per block
– SIMD instruction executes 32 elements at a time
– Thus grid size = 16 blocks
– Block is analogous to a strip‐mined vector loop with
vector length of 32
– Block is assigned to a multithreaded SIMD processor
by the thread block scheduler
– Current‐generation GPUs (Fermi) have 7‐15
multithreaded SIMD processors
CA-Lec8 [email protected] 65
Graphical Processing Units
Terminology
• Threads of SIMD instructions
– Each has its own PC
– Thread scheduler uses scoreboard to dispatch
– No data dependencies between threads!
– Keeps track of up to 48 threads of SIMD instructions
• Hides memory latency
• Thread block scheduler schedules blocks to SIMD
processors
• Within each SIMD processor:
– 32 SIMD lanes
– Wide and shallow compared to vector processors

CA-Lec8 [email protected] 66
Graphical Processing Units
Example
• NVIDIA GPU has 32,768 registers
– Divided into lanes
– Each SIMD thread is limited to 64 registers
– SIMD thread has up to:
• 64 vector registers of 32 32‐bit elements
• 32 vector registers of 32 64‐bit elements
– Fermi has 16 physical SIMD lanes, each containing
2048 registers

CA-Lec8 [email protected] 67
GTX570 GPU
Global Memory
1,280MB

L2 Cache
640KB
Texture Cache L1 Cache Constant Cache
8KB 16KB 8KB

SM 0 SM 14
Shared Memory Shared Memory
48KB 48KB
Registers Registers
Up to 1536 32,768 32,768
Threads/SM
32 32
cores cores

CA-Lec8 [email protected] 68
GPU Threads in SM (GTX570)

• 32 threads within a block work collectively


 Memory access optimization, latency hiding
CA-Lec8 [email protected] 69
Matrix Multiplication

CA-Lec8 [email protected] 70
Programming the GPU

CA-Lec8 [email protected] 71
Matrix Multiplication
• For a 4096x4096 matrix multiplication
‐ Matrix C will require calculation of 16,777,216 matrix cells.
• On the GPU each cell is calculated by its own thread.
• We can have 23,040 active threads (GTX570), which means
we can have this many matrix cells calculated in parallel.
• On a general purpose processor we can only calculate one cell
at a time.
• Each thread exploits the GPUs fine granularity by computing
one element of Matrix C.
• Sub‐matrices are read into shared memory from global
memory to act as a buffer and take advantage of GPU
bandwidth.

CA-Lec8 [email protected] 72
Programming the GPU
• Distinguishing execution place of functions:
 _device_ or _global_ => GPU Device
 Variables declared are allocated to the GPU memory
 _host_ => System processor (HOST)
• Function call
 Name<<dimGrid, dimBlock>>(..parameter list..)
 blockIdx: block identifier
 threadIdx: threads per block identifier
 blockDim: threads per block

CA-Lec8 [email protected] 73
CUDA Program Example
//Invoke DAXPY
daxpy(n,2.0,x,y);

//DAXPY in C
void daxpy(int n, double a, double* x, double* y){
for (int i=0;i<n;i++)
y[i]= a*x[i]+ y[i]
}

//Invoke DAXPY with 256 threads per Thread Block


_host_
int nblocks = (n+255)/256;
daxpy<<<nblocks, 256>>> (n,2.0,x,y);

//DAXPY in CUDA
_device_
void daxpy(int n,double a,double* x,double* y){
int i=blockIDx.x*blockDim.x+threadIdx.x;
if (i<n)
y[i]= a*x[i]+ y[i]
}
CA-Lec8 [email protected] 74
NVIDIA Instruction Set Arch.
• “Parallel Thread Execution (PTX)”
• Uses virtual registers
• Translation to machine code is performed in software
• Example:
shl.s32 R8, blockIdx, 9 ; Thread Block ID * Block size (512 or 29)
add.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread ID
ld.global.f64 RD0, [X+R8] ; RD0 = X[i]
ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]
mul.f64 R0D, RD0, RD4 ; Product in RD0 = RD0 * RD4 (scalar a)
add.f64 R0D, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i])
st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i])

CA-Lec8 [email protected] 75
Graphical Processing Units
Conditional Branching
• Like vector architectures, GPU branch hardware uses internal masks
• Also uses
– Branch synchronization stack
• Entries consist of masks for each SIMD lane
• I.e. which threads commit their results (all threads execute)
– Instruction markers to manage when a branch diverges into multiple
execution paths
• Push on divergent branch
– …and when paths converge
• Act as barriers
• Pops stack
• Per‐thread‐lane 1‐bit predicate register, specified by programmer

CA-Lec8 [email protected] 76
Graphical Processing Units
Example
if (X[i] != 0)
X[i] = X[i] – Y[i];
else X[i] = Z[i];

ld.global.f64 RD0, [X+R8] ; RD0 = X[i]


setp.neq.s32 P1, RD0, #0 ; P1 is predicate register 1
@!P1, bra ELSE1, *Push ; Push old mask, set new mask bits
; if P1 false, go to ELSE1
ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]
sub.f64 RD0, RD0, RD2 ; Difference in RD0
st.global.f64 [X+R8], RD0 ; X[i] = RD0
@P1, bra ENDIF1, *Comp ; complement mask bits
; if P1 true, go to ENDIF1
ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i]
st.global.f64 [X+R8], RD0 ; X[i] = RD0
ENDIF1: <next instruction>, *Pop ; pop to restore old mask

CA-Lec8 [email protected] 77
Graphical Processing Units
NVIDIA GPU Memory Structures
• Each SIMD Lane has private section of off‐chip DRAM
– “Private memory”
– Contains stack frame, spilling registers, and private
variables
• Each multithreaded SIMD processor also has local
memory
– Shared by SIMD lanes / threads within a block
• Memory shared by SIMD processors is GPU Memory
– Host can read and write GPU memory

CA-Lec8 [email protected] 78
Graphical Processing Units
Fermi Architecture Innovations
• Each SIMD processor has
– Two SIMD thread schedulers, two instruction dispatch units
– 16 SIMD lanes (SIMD width=32, chime=2 cycles), 16 load‐store units, 4
special function units
– Thus, two threads of SIMD instructions are scheduled every two clock
cycles
• Fast double precision
• Caches for GPU memory
• 64‐bit addressing and unified address space
• Error correcting codes
• Faster context switching
• Faster atomic instructions

CA-Lec8 [email protected] 79
Graphical Processing Units
Fermi Multithreaded SIMD Proc.

CA-Lec8 [email protected] 80
Detecting and Enhancing Loop-Level Parallelism
Compiler Technology for Loop‐Level
Parallelism
• Loop‐carried dependence
– Focuses on determining whether data accesses in later
iterations are dependent on data values produced in
earlier iterations
• Loop‐level parallelism has no loop‐carried dependence

• Example 1:
for (i=999; i>=0; i=i‐1)
x[i] = x[i] + s;

• No loop‐carried dependence

CA-Lec8 [email protected] 81
Detecting and Enhancing Loop-Level Parallelism
Example 2 for Loop‐Level Parallelism
• Example 2:
for (i=0; i<100; i=i+1) {
A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */
}

• S1 and S2 use values computed by S1 in previous


iteration
– Loop‐carried dependence
• S2 uses value computed by S1 in same iteration
– No loop‐carried dependence

CA-Lec8 [email protected] 82
Remarks
• Intra‐loop dependence is not loop‐carried dependence
– A sequence of vector instructions that uses chaining
exhibits exactly intra‐loop dependence
• Two types of S1‐S2 intra‐loop dependence
– Circular: S1 depends on S2 and S2 depends on S1
– Not circular: neither statement depends on itself, and
although S1 depends on S2, S2 does not depend on S1
• A loop is parallel if it can be written without a cycle in
the dependences
– The absence of a cycle means that the dependences give a
partial ordering on the statements

CA-Lec8 [email protected] 83
Example 3 for Loop‐Level Parallelism (1)
• Example 3
for (i=0; i<100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}
• S1 uses value computed by S2 in previous iteration, but
this dependence is not circular.
• There is no dependence from S1 to S2, interchanging
the two statements will not affect the execution of S2

CA-Lec8 [email protected] 84
Example 3 for Loop‐Level Parallelism (2)
• Transform to
A[0] = A[0] + B[0];
for (i=0; i<99; i=i+1) {
B[i+1] = C[i] + D[i]; /*S2*/
A[i+1] = A[i+1] + B[i+1]; /*S1*/
}
B[100] = C[99] + D[99];

• The dependence between the two statements is no longer


loop carried.
CA-Lec8 [email protected] 85
More on Loop‐Level Parallelism
for (i=0;i<100;i=i+1) {
A[i] = B[i] + C[i];
D[i] = A[i] * E[i];
}
• The second reference to A needs not be translated to a load
instruction.
– The two reference are the same. There is no intervening memory
access to the same location
• A more complex analysis, i.e. Loop‐carried dependence
analysis + data dependence analysis, can be applied in the
same basic block to optimize

CA-Lec8 [email protected] 86
Recurrence
• Recurrence is a special form of loop‐carried dependence.
• Recurrence Example:
for (i=1;i<100;i=i+1) {
Y[i] = Y[i‐1] + Y[i];
}

• Detecting a recurrence is important


– Some vector computers have special support for
executing recurrence
– It my still be possible to exploit a fair amount of ILP

CA-Lec8 [email protected] 87
Compiler Technology for Finding
Dependences
• To determine which loop might contain parallelism (“inexact”) and to
eliminate name dependences.
• Nearly all dependence analysis algorithms work on the assumption that
array indices are affine.
– A one‐dimensional array index is affine if it can written in the form a  i + b (i
is loop index)
– The index of a multi‐dimensional array is affine if the index in each dimension
is affine.
– Non‐affine access example: x[ y[i] ]
• Determining whether there is a dependence between two references to
the same array in a loop is equivalent to determining whether two affine
functions can have the same value for different indices between the
bounds of the loop.

CA-Lec8 [email protected] 88
Finding dependencies Example
• Assume:
– Load an array element with index c  i + d and store to a  i
+b
– i runs from m to n
• Dependence exists if the following two conditions hold
1. Given j, k such that m ≤ j ≤ n, m ≤ k ≤ n
2. a  j + b = c  k + d
• In general, the values of a, b, c, and d are not known at
compile time
– Dependence testing is expensive but decidable
– GCD (greatest common divisor) test
• If a loop‐carried dependence exists, then GCD(c,a) | |db|

CA-Lec8 [email protected] 89
Example
for (i=0; i<100; i=i+1) {
X[2*i+3] = X[2*i] * 5.0;
}

• Solution:
1. a=2, b=3, c=2, and d=0
2. GCD(a, c)=2, |b‐d|=3
3. Since 2 does not divide 3, no dependence is possible

CA-Lec8 [email protected] 90
Remarks
• The GCD test is sufficient but not necessary
– GCD test does not consider the loop bounds
– There are cases where the GCD test succeeds but
no dependence exists
• Determining whether a dependence actually
exists is NP‐complete

CA-Lec8 [email protected] 91
Detecting and Enhancing Loop-Level Parallelism
Finding dependencies
• Example 2:
for (i=0; i<100; i=i+1) {
Y[i] = X[i] / c; /* S1 */
X[i] = X[i] + c; /* S2 */
Z[i] = Y[i] + c; /* S3 */
Y[i] = c ‐ Y[i]; /* S4 */
}

• True dependence: S1‐>S3 (Y[i]), S1‐>S4 (Y[i]), but not loop‐


carried.
• Antidependence: S1‐>S2 (X[i]), S3‐>S4 (Y[i]) (Y[i])
• Output dependence: S1‐>S4 (Y[i])

CA-Lec8 [email protected] 92
Renaming to Eliminate False (Pseudo)
Dependences
• Before: • After:
for (i=0; i<100; i=i+1) { for (i=0; i<100; i=i+1) {
Y[i] = X[i] / c; T[i] = X[i] / c;
X[i] = X[i] + c; X1[i] = X[i] + c;
Z[i] = Y[i] + c; Z[i] = T[i] + c;
Y[i] = c ‐ Y[i]; Y[i] = c ‐ T[i];
} }

CA-Lec8 [email protected] 93
Eliminating Recurrence Dependence
• Recurrence example, a dot product:
for (i=9999; i>=0; i=i‐1)
sum = sum + x[i] * y[i];

• The loop is not parallel because it has a loop‐carried dependence.


• Transform to…
for (i=9999; i>=0; i=i‐1) This is called scalar expansion.
sum [i] = x[i] * y[i]; Scalar ===> Vector
Parallel !!

for (i=9999; i>=0; i=i‐1) This is called a reduction.


Sums up the elements of the vector
finalsum = finalsum + sum[i];
Not parallel !!

CA-Lec8 [email protected] 94
Reduction
• Reductions are common in linear algebra algorithm
• Reductions can be handled by special hardware in a vector
and SIMD architecture
– Similar to what can be done in multiprocessor
environment

• Example: To sum up 1000 elements on each of ten processors


for (i=999; i>=0; i=i‐1)
finalsum[p] = finalsum[p] + sum[i+1000*p];

– Assume p ranges from 0 to 9

CA-Lec8 [email protected] 95
Multithreading and Vector Summary
• Explicitly parallel (DLP or TLP) is next step to performance
• Coarse‐grained vs. Fine‐grained multithreading
– Switch only on big stall vs. switch every clock cycle
• Simultaneous multithreading, if fine grained multithreading
based on OOO superscalar microarchitecture
– Instead of replicating registers, reuse rename registers
• Vector is alternative model for exploiting ILP
– If code is vectorizable, then simpler hardware, more energy
efficient, and better real‐time model than OOO machines
– Design issues include number of lanes, number of FUs, number
of vector registers, length of vector registers, exception handling,
conditional operations, and so on.
• Fundamental design issue is memory bandwidth

CA-Lec8 [email protected] 96
Computer Architecture
Lecture 9: Multiprocessors and
Thread‐Level Parallelism (Chapter 5)

Chih‐Wei Liu 劉志尉


National Chiao Tung University
[email protected]
Uniprocessor Performance (SPECint)
10000 3X
From Hennessy and Patterson,
Computer Architecture: A Quantitative ??%/year
Approach, 4th edition, 2006
1000
Performance (vs. VAX-11/780)

52%/year

100

10
25%/year

1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
• VAX : 25%/year 1978 to 1986
• RISC + x86: 52%/year 1986 to 2002
CA-Lec9 [email protected] 2
• RISC + x86: ??%/year 2002 to present
Multiprocessors
• Growth in data‐intensive applications
– Data bases, file servers, …
• Growing interest in servers, server performance.
• Increasing desktop performance less important
• Improved understanding in how to use multiprocessors
effectively
– Especially server where significant natural TLP
• Advantage of leveraging design investment by replication
– Rather than unique design

CA-Lec9 [email protected] 3
Flynn’s Taxonomy
M.J. Flynn, "Very High-Speed Computers",
Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.

• Flynn classified by data and control streams in 1966


Single Instruction Single Single Instruction Multiple
Data (SISD) Data SIMD
(Uniprocessor) (single PC: Vector, CM-2)
Multiple Instruction Single Multiple Instruction
Data (MISD) Multiple Data MIMD
(ASIP) (Clusters, SMP servers)
• SIMD  Data Level Parallelism
• MIMD  Thread Level Parallelism
• MIMD popular because
– Flexible: N pgms and 1 multithreaded pgm
– Cost‐effective: same MPU in desktop & MIMD

CA-Lec9 [email protected] 4
What is Parallel Architecture?
• A parallel computer is a collection of processing elements
that cooperate to solve large problems
– Most important new element: it is all about communication !!
• What does the programmer (or OS or compiler writer) think
about?
– Models of computation
• Sequential consistency?
– Resource allocation
• What mechanisms must be in hardware
– A high performance processor (SIMD, or Vector Processor)
– Data access, Communication, and Synchronization

CA-Lec9 [email protected] 5
Multiprocessor Basics
• “A parallel computer is a collection of processing elements that
cooperate and communicate to solve large problems fast.”
• Parallel Architecture = Computer Architecture + Communication
Architecture
• 2 classes of multiprocessors WRT memory:
1. Centralized Memory Multiprocessor
• < few dozen cores
• Small enough to share single, centralized memory with uniform
memory access latency (UMA)
2. Physically Distributed‐Memory Multiprocessor
• Larger number chips and cores than 1.
• BW demands  Memory distributed among processors with
non‐uniform memory access/latency (NUMA)

CA-Lec9 [email protected] 6
Shared‐Memory Multiprocessor
• SMP, Symmetric multiprocessors

Uniform access time


to all of the memory
from all of the
processors

CA-Lec9 [email protected] 7
Distributed‐Memory Multiprocessor
• Distributed shared‐memory multiprocessors
(DSM)

CA-Lec9 [email protected] 8
Centralized vs. Distributed Memory

Scale
P1 Pn P1 Pn

$ $ $ $
Mem Mem
Inter
connection network
Inter
connection network
Mem Mem

Centralized Memory Distributed Memory

CA-Lec9 [email protected] 9
Centralized Memory Multiprocessor
• Also called symmetric multiprocessors (SMPs) because
single main memory has a symmetric relationship to all
processors
• Large caches  single memory can satisfy memory
demands of small number of processors
• Can scale to a few dozen processors by using a switch and
by using many memory banks
• Although scaling beyond that is technically conceivable, it
becomes less attractive as the number of processors
sharing centralized memory increases

CA-Lec9 [email protected] 10
Distributed Memory Multiprocessor
• Processors connected via direct (switched) and non‐
direct (multi‐hop) interconnection networks
• Pro: Cost‐effective way to scale memory bandwidth
• If most accesses are to local memory
• Pro: Reduces latency of local memory accesses

• Con: Communicating data between processors more


complex
• Con: Must change software to take advantage of increased
memory BW

CA-Lec9 [email protected] 11
2 Models for Communication and
Memory Architecture
1. Communication occurs by explicitly passing messages among the
processors:
message‐passing multiprocessors
2. Communication occurs through a shared address space (via loads and
stores):
shared memory multiprocessors either
• UMA (Uniform Memory Access time) for shared address, centralized
memory MP
• NUMA (Non Uniform Memory Access time) for shared address,
distributed memory MP
• In past, confusion whether “sharing” means sharing physical memory
(Symmetric MP) or sharing address space

CA-Lec9 [email protected] 12
Challenges of Parallel Processing
• First challenge is % of program inherently sequential

• Suppose 80X speedup from 100 processors. What fraction of


original program can be sequential?
a. 10%
b. 5%
c. 1%
d. <1%

CA-Lec9 [email protected] 13
Amdahl’s Law Answers
1
Speedup overall 
1  Fraction enhanced   Fraction enhanced
Speedup enhanced
1
80 
1  Fraction parallel  Fraction parallel
100
 Fraction parallel 

80   1  Fraction parallel    1
 100 
79  80  Fraction parallel  0.8  Fraction parallel
Fraction parallel  79 / 79.2  99.75%

CA-Lec9 [email protected] 14
Challenges of Parallel Processing
• Second challenge is long latency to remote memory

• Suppose 32 CPU MP, 2GHz, 200 ns remote memory, all local


accesses hit memory hierarchy and base CPI is 0.5. (Remote
access = 200/0.5 = 400 clock cycles.)
• What is performance impact if 0.2% instructions involve
remote access?
a. 1.5X
b. 2.0X
c. 2.5X

CA-Lec9 [email protected] 15
CPI Equation
• CPI = Base CPI +
Remote request rate x Remote request cost

• CPI = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3


• No communication (the MP with all local reference) is 1.3/0.5
or 2.6 faster than 0.2% instructions involve remote access

CA-Lec9 [email protected] 16
Challenges of Parallel Processing
1. Application parallelism  primarily via new algorithms that
have better parallel performance
2. Long remote latency impact  both by architect and by the
programmer
• For example, reduce frequency of remote accesses either by
– Caching shared data (HW)
– Restructuring the data layout to make more accesses local (SW)

Much of this lecture focuses on techniques for reducing the


impact of long remote latency.

CA-Lec9 [email protected] 17
Shared‐Memory Architectures

• From multiple boards on a shared bus to multiple processors


inside a single chip
• Caches both
– Private data are used by a single processor
– Shared data are used by multiple processors
• Caching shared data
 reduces latency to shared data, memory bandwidth for shared data,
and interconnect bandwidth
 cache coherence problem

CA-Lec9 [email protected] 18
Cache Coherence Problem
P1 P2 P3
u=? u=?
3
5
4 $
$ $

u :5 u :5 u= 7

1 I/O devices
2
u:5
Memory

– Processors see different values for u after event 3


– With write back caches, value written back to memory depends on happenstance of
which cache flushes or writes back value when
• Processes accessing main memory may see very stale value
– Unacceptable for programming, and it’s frequent!
CA-Lec9 [email protected] 19
Example
P1 P2
/*Assume initial value of A and flag is 0*/
A = 1; while (flag == 0); /*spin idly*/
flag = 1; print A;

• Intuition not guaranteed by coherence


• expect memory to respect order between accesses to different locations issued by a
given process
– to preserve orders among accesses to same location by different processes
• Coherence is not enough!
– pertains only to single location
P1 Pn

Conceptual
Picture Mem
CA-Lec9 [email protected] 20
Intuitive Memory Model
P

L1
100:67 • Reading an address should
L2 return the last value written
100:35 to that address
Memory – Easy in uniprocessors,
except for I/O
Disk 100:34

• Too vague and simplistic; 2 issues


1. Coherence defines values returned by a read
– Write to the same location by any two processors are seen in the same order by all processors
2. Consistency determines when a written value will be returned by a read
– If a processor writes location A followed by location B, any processor that see the new value
of B must also see the new value of A
• Coherence defines behavior to same location,
• Consistency defines behavior to other locations
CA-Lec9 [email protected] 21
Defining Coherent Memory System
1. Preserve Program Order: A read by processor P to location X that follows
a write by P to X, with no writes of X by another processor occurring
between the write and the read by P, always returns the value written by
P
2. Coherent view of memory: Read by a processor to location X that follows
a write by another processor to X returns the written value if the read
and write are sufficiently separated in time and no other writes to X
occur between the two accesses
3. Write serialization: 2 writes to same location by any 2 processors are
seen in the same order by all processors
– If not, a processor could keep value 1 since saw as last write
– For example, if the values 1 and then 2 are written to a location, processors
can never read the value of the location as 2 and then later read it as 1

CA-Lec9 [email protected] 22
Write Consistency
• For now assume
1. A write does not complete (and allow the next write to occur)
until all processors have seen the effect of that write
2. The processor does not change the order of any write with
respect to any other memory access
 if a processor writes location A followed by location B, any
processor that sees the new value of B must also see the new
value of A
• These restrictions allow the processor to reorder reads, but
forces the processor to finish writes in program order

CA-Lec9 [email protected] 23
Basic Schemes for Enforcing
Coherence
• Program on multiple processors will normally have copies of the same
data in several caches
• Rather than trying to avoid sharing in SW, SMPs use a HW protocol to
maintain coherent caches
• Coherent caches provide migration and replication of shared data
• Migration ‐ data can be moved to a local cache and used there in a
transparent fashion
– Reduces both latency to access shared data that is allocated remotely and
bandwidth demand on the shared memory
• Replication – for shared data being simultaneously read, since caches
make a copy of data in local cache
– Reduces both latency of access and contention for read shared data

CA-Lec9 [email protected] 24
2 Classes of Cache Coherence
Protocols
to track the sharing status

• HW cache coherence protocol


– Use hardware to track the status of the shared data
1. Directory based — Sharing status of a block of physical
memory is kept in just one location, the directory
– Centralized control protocol
2. Snooping — Every cache with a copy of data also has a
copy of sharing status of block, but no centralized state is
kept
– Distributed control protocol

CA-Lec9 [email protected] 25
Snoopy Cache‐Coherence Protocols
State P1 Pn
Bus snoop
Address
Data
$ $

Cache-memory
I/O devices transaction
Mem

• Cache Controller “snoops” all transactions on the shared medium


(bus or switch)
– It works because bus is a broadcast medium
– relevant transaction if for a block it contains
– take action to ensure coherence
• invalidate, update, or supply value
– depends on state of the block and the protocol
• Either get exclusive access before write via write invalidate or
update all copies on write

CA-Lec9 [email protected] 26
Example: Write‐thru Invalidate

P1 P2 P3
u=? u=? 3
4 5
$ $ $

u :5 u :5 u= 7

I/O devices
1
2
u:5 Exclusive access ensures that no
u=7
Memory other readable or writable copies of
an data exist when the write occurs

• Must invalidate shared data before step 3


• Write‐thru invalidate uses more broadcast medium BW

CA-Lec9 [email protected] 27
Architectural Building Blocks
• Cache block state transition diagram
– FSM specifying how disposition of block changes
• invalid, valid, dirty
• Broadcast Medium Transactions (e.g., bus)
– Fundamental system design abstraction
– Logically single set of wires connect several devices
– Protocol: arbitration, command/addr, data
 Every device observes every transaction
• Broadcast medium enforces serialization of read or write accesses 
Write serialization
– 1st processor to get medium invalidates others copies
– Implies cannot complete write until it obtains bus
– All coherence schemes require serializing accesses to same cache block
• Also need to find up‐to‐date copy of cache block

CA-Lec9 [email protected] 28
Locate Up‐to‐date Copy of Data
• Write‐through: get up‐to‐date copy from memory
– Write through simpler if enough memory BW
• Write‐back harder
– Most recent copy can be in a cache
• Can use same snooping mechanism
1. Snoop every address placed on the bus
2. If a processor has dirty copy of requested cache block, it provides it in
response to a read request and aborts the memory access
– Complexity from retrieving cache block from a processor cache, which can
take longer than retrieving it from memory

• Write‐back needs lower memory bandwidth


 Support larger numbers of faster processors
 Most multiprocessors use write‐back

CA-Lec9 [email protected] 29
Cache Resources for WB Snooping
• Normal cache tags can be used for snooping
• Valid bit per block makes invalidation easy
• Read misses easy since rely on snooping
• Writes  Need to know if know whether any other copies of
the block are cached
– No other copies  No need to place write on bus for WB
– Other copies  Need to place invalidate on bus

CA-Lec9 [email protected] 30
Cache Resources for WB Snooping
• To track whether a cache block is shared, add extra state bit
associated with each cache block, like valid bit and dirty bit
– Write to Shared block  Need to place invalidate on bus and mark
cache block as private (if an option)
– No further invalidations will be sent for that block
– This processor called owner of cache block
– Owner then changes state from shared to unshared (or exclusive)

CA-Lec9 [email protected] 31
Cache Behavior in Response to Bus
• Every bus transaction must check the cache‐address tags
– could potentially interfere with processor cache accesses
• A way to reduce interference is to duplicate tags
– One set for caches access, one set for bus accesses
• Another way to reduce interference is to use L2 tags
– Since L2 less heavily used than L1
 Every entry in L1 cache must be present in the L2 cache, called the inclusion
property
– If Snoop gets a hit in L2 cache, then it must arbitrate for the L1 cache to
update the state and possibly retrieve the data, which usually requires a stall
of the processor

CA-Lec9 [email protected] 32
Example Protocol
• Snooping coherence protocol is usually implemented by incorporating a
finite‐state controller in each node
• Logically, think of a separate controller associated with each cache block
– That is, snooping operations or cache requests for different blocks can
proceed independently
• In implementations, a single controller allows multiple operations to
distinct blocks to proceed in interleaved fashion
– that is, one operation may be initiated before another is completed, even
through only one cache access or one bus access is allowed at time

CA-Lec9 [email protected] 33
Cache Coherence Protocol Example
• Processor only observes state of memory system by issuing memory operations
• Assume bus transactions and memory operations are atomic and a one‐level cache
– all phases of one bus transaction complete before next one starts
– processor waits for memory operation to complete before issuing next
– with one‐level cache, assume invalidations applied during bus transaction
• All writes go to bus + atomicity
– Writes serialized by order in which they appear on bus (bus order)
=> invalidations applied to caches in bus order
• How to insert reads in this order?
– Important since processors see writes through reads, so determines whether
write serialization is satisfied
– But read hits may happen independently and do not appear on bus or enter
directly in bus order

• Let’s understand other ordering issues

CA-Lec9 [email protected] 34
Ordering
write propagation + write serialization

P0: R R R W R R

P1: R R R R R W

P2: R R R R R R

• Writes establish a partial order


• Doesn’t constrain ordering of reads, though shared‐medium (bus) will
order read misses too
– any order among reads between writes is fine, as long as in program
order

CA-Lec9 [email protected] 35
Example: Write Back Snoopy Protocol
• Invalidation protocol, write‐back cache
– Snoops every address on bus
– If it has a dirty copy of requested block, provides that block in response to
the read request and aborts the memory access
• Each memory block is in one state:
– Clean in all caches and up‐to‐date in memory (Shared)
– OR Dirty in exactly one cache (Exclusive)
– OR Not in any caches (Invalid)
• Each cache block is in one state (track these):
– Shared : block can be read
– OR Exclusive : cache has only copy, its writeable, and dirty
– OR Invalid : block contains no data (in uniprocessor cache too)
• Read misses: cause all caches to snoop bus
• Writes to clean blocks are treated as misses

CA-Lec9 [email protected] 36
Write‐Back State Machine ‐ CPU
• State machine CPU Read hit
for CPU requests
for each
cache block
• Non‐resident blocks invalid
CPU Read Shared
Invalid (read/only)
Place read miss
on bus

CPU Write
Place Write
Miss on bus

CPU Write
Cache Block Place Write Miss on Bus
State Exclusive
CPU read hit (read/write)
CPU write hit CPU Write Miss (?)
Write back cache block
Place write miss on bus
CA-Lec9 [email protected] 37
Write‐Back State Machine‐ Bus Request
• State machine
for bus requests
for each Write miss
cache block
for this block Shared
Invalid
(read/only)

Write miss
for this block
Write Back Read miss
Block; (abort for this block
memory access) Write Back
Block; (abort
Exclusive memory access)
(read/write)

CA-Lec9 [email protected] 38
Block‐replacement
CPU Read hit
• State machine
for CPU requests
for each CPU Read Shared
cache block Invalid (read/only)
Place read miss
on bus

CPU Write
Place Write CPU read miss CPU Read miss
Miss on bus Write back block, Place read miss
Place read miss on bus
on bus
CPU Write
Cache Block Place Write Miss on Bus
State Exclusive
CPU read hit (read/write)
CPU write hit CPU Write Miss
Write back cache block
Place write miss on bus
CA-Lec9 [email protected] 39
Write‐back State Machine‐III
CPU Read hit
• State machine
for CPU requests Write miss
for each
cache block and
for bus requests for this block
for each Shared
cache block Invalid CPU Read
(read/only)
Place read miss
CPU Write on bus
Place Write
Miss on bus
Write miss CPU read miss CPU Read miss
for this block Write back block, Place read miss
Write Back Place read miss on bus
Block; (abort on bus CPU Write
memory Place Write Miss on Bus
Cache access)
Block Read miss Write Back
State Exclusive for this block Block; (abort
(read/write) memory access)
CPU read hit CPU Write Miss
CPU write hit Write back cache block
Place write miss on bus
CA-Lec9 [email protected] 40
Example

P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1 Write
P1: Write 10
10 to
to A1
P1:P1:
Read
ReadA1A1
P2:
P2: Read A1

P2:
P2: Write
Write 20 to A1
to A1
P2:
P2: Write
Write 40 to A2
to A2

Assumes A1 and A2 map to same cache block,


initial cache state is invalid

CA-Lec9 [email protected] 41
Example

P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1 Write
P1: Write 10
10 to
to A1 Excl. A1 10 WrMs P1 A1
P1:P1:
Read
ReadA1A1
P2:
P2: Read A1

P2:
P2: Write
Write 20 to A1
to A1
P2: Write
P2: Write 40 to A2
to A2

Assumes A1 and A2 map to same cache block

CA-Lec9 [email protected] 42
Example

P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1
P1:Write
Write 10
10 to
to A1 Excl. A1 10 WrMs P1 A1
P1:P1:
Read
ReadA1A1 Excl. A1 10
P2:
P2: Read A1

P2: Write
P2: Write 20 to
to A1
A1
P2:
P2: Write
Write 40 to A2
to A2

Assumes A1 and A2 map to same cache block

CA-Lec9 [email protected] 43
Example

P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1
P1:Write
Write 10
10 to
to A1 Excl. A1 10 WrMs P1 A1
P1:P1:
Read
ReadA1A1 Excl. A1 10
P2:
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write
P2: Write 20 to
to A1
A1
P2:
P2: Write
Write 40 to A2
to A2

Assumes A1 and A2 map to same cache block

CA-Lec9 [email protected] 44
Example

P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1
P1:Write
Write 10
10 to
to A1 Excl. A1 10 WrMs P1 A1
P1:P1:
Read
ReadA1A1 Excl. A1 10
P2:
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write
P2: Write 20 to
to A1
A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2:
P2: Write
Write 40 to A2
to A2

Assumes A1 and A2 map to same cache block

CA-Lec9 [email protected] 45
Example

P1 P2 Bus Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr Value
P1
P1:Write
Write 10
10 to
to A1 Excl. A1 10 WrMs P1 A1
P1:P1:
Read
ReadA1A1 Excl. A1 10
P2:
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 WrBk P1 A1 10 A1 10
Shar. A1 10 RdDa P2 A1 10 A1 10
P2: Write
P2: Write 20 to A1
A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10
P2:
P2: Write
Write 40 to A2
A2 WrMs P2 A2 A1 10
Excl. A2 40 WrBk P2 A1 20 A1 20

Assumes A1 and A2 map to same cache block,


but A1 != A2

CA-Lec9 [email protected] 46
Concluding Remark (1/2)
• 1 instruction operates on vectors of data
• Vector loads get data from memory into big register files,
operate, and then vector store
• E.g., Indexed load, store for sparse matrix
• Easy to add vector to commodity instruction set
– E.g., Morph SIMD into vector
• Vector is very efficient architecture for vectorizable codes,
including multimedia and many scientific codes

CA-Lec9 [email protected] 47
Concluding Remark (2/2)
• “End” of uniprocessors speedup => Multiprocessors
• Parallelism challenges: % parallalizable, long latency to remote memory
• Centralized vs. distributed memory
– Small MP vs. lower latency, larger BW for Larger MP
• Message Passing vs. Shared Address
– Uniform access time vs. Non‐uniform access time
• Snooping cache over shared medium for smaller MP by invalidating other
cached copies on write
• Sharing cached data  Coherence (values returned by a read),
Consistency (when a written value will be returned by a read)
• Shared medium serializes writes
 Write consistency

CA-Lec9 [email protected] 48
Implementation Complications
• Write Races:
– Cannot update cache until bus is obtained
• Otherwise, another processor may get bus first,
and then write the same cache block!
– Two step process:
• Arbitrate for bus
• Place miss on bus and complete operation
– If miss occurs to block while waiting for bus,
handle miss (invalidate may be needed) and then restart.
– Split transaction bus:
• Bus transaction is not atomic:
can have multiple outstanding transactions for a block
• Multiple misses can interleave,
allowing two caches to grab block in the Exclusive state
• Must track and prevent multiple misses for one block
• Must support interventions and invalidations

CA-Lec9 [email protected] 49
Implementing Snooping Caches
• Multiple processors must be on bus, access to both addresses and data
• Add a few new commands to perform coherency,
in addition to read and write
• Processors continuously snoop on address bus
– If address matches tag, either invalidate or update
• Since every bus transaction checks cache tags,
could interfere with CPU just to check:
– solution 1: duplicate set of tags for L1 caches just to allow checks in parallel
with CPU
– solution 2: L2 cache already duplicate,
provided L2 obeys inclusion with L1 cache
• block size, associativity of L2 affects L1

CA-Lec9 [email protected] 50
Limitations in Symmetric Shared‐Memory
Multiprocessors and Snooping Protocols

• Single memory accommodate all CPUs


 Multiple memory banks
• Bus‐based multiprocessor, bus must support both coherence
traffic & normal memory traffic
 Multiple buses or interconnection networks (cross bar or
small point‐to‐point)
• Opteron
– Memory connected directly to each dual‐core chip
– Point‐to‐point connections for up to 4 chips
– Remote memory and local memory latency are similar, allowing OS
Opteron as UMA computer

CA-Lec9 [email protected] 51
Performance of Symmetric Shared‐Memory
Multiprocessors
• Cache performance is combination of
1. Uniprocessor cache miss traffic
2. Traffic caused by communication
– Results in invalidations and subsequent cache misses
• 4th C: coherence miss
– Joins Compulsory, Capacity, Conflict

CA-Lec9 [email protected] 52
Coherency Misses
1. True sharing misses arise from the communication of
data through the cache coherence mechanism
• Invalidates due to 1st write to shared block
• Reads by another CPU of modified block in different cache
• Miss would still occur if block size were 1 word
2. False sharing misses when a block is invalidated
because some word in the block, other than the one
being read, is written into
• Invalidation does not cause a new value to be communicated,
but only causes an extra cache miss
• Block is shared, but no word in block is actually shared
 miss would not occur if block size were 1 word

CA-Lec9 [email protected] 53
Example: True vs. False Sharing vs. Hit?
• Assume x1 and x2 in same cache block.
P1 and P2 both read x1 and x2 before.

Time P1 P2 True, False, Hit? Why?


1 Write x1
True miss; invalidate x1 in P2
2 Read x2 False miss; x1 irrelevant to P2
3 Write x1 False miss; x1 irrelevant to P2
4 Write x2 False miss; x1 irrelevant to P2
5 Read x2
True miss; invalidate x2 in P1

CA-Lec9 [email protected] 08-54


MP Performance 4 Processor
Commercial Workload: OLTP, Decision Support
(Database), Search Engine
3.25
• True sharing and 3
Instruction
false sharing 2.75
Capacity/Conflict
unchanged going 2.5 Cold

(Memory) Cycles per Instruction


from 1 MB to 8 MB 2.25 False Sharing
(L3 cache) 2 True Sharing
1.75
• Uniprocessor cache 1.5
misses 1.25
improve with 1
cache size increase 0.75
(Instruction, 0.5
Capacity/Conflict, 0.25
Compulsory) 0
1 MB 2 MB 4 MB 8 MB
Cache size

CA-Lec9 [email protected] 55
MP Performance 2MB Cache
Commercial Workload: OLTP, Decision Support (Database),
Search Engine
3
Instruction
• True sharing, Conflict/Capacity
2.5 Cold
false sharing
increase going False Sharing

(Memory) Cycles per Instruction


2 True Sharing
from 1 to 8 CPUs

1.5

0.5

0
1 2 4 6 8
Processor count
CA-Lec9 [email protected] 56
Computer Architecture
Lecture 10: Thread‐Level
Parallelism‐‐II (Chapter 5)
Chih‐Wei Liu 劉志尉
National Chiao Tung University
[email protected]
Review
• Caches contain all information on state of cached memory
blocks
• Snooping cache over shared medium for smaller MP by
invalidating other cached copies on write
• Sharing cached data
 Coherence (values returned by a read),
 Consistency (when a written value will be returned by a
read)

CA-Lec10 [email protected] 2
Coherency Misses
1. True sharing misses arise from the communication of data
through the cache coherence mechanism
• Invalidates due to 1st write to shared block
• Reads by another CPU of modified block in different cache
• Miss would still occur if block size were 1 word
2. False sharing misses when a block is invalidated because
some word in the block, other than the one being read, is
written into
• Invalidation does not cause a new value to be communicated, but
only causes an extra cache miss
• Block is shared, but no word in block is actually shared
 miss would not occur if block size were 1 word

CA-Lec10 [email protected] 3
A Cache Coherent System Must:
• Provide set of states, state transition diagram, and actions
• Manage coherence protocol
– (0) Determine when to invoke coherence protocol
– (a) Find info about state of block in other caches to determine action
• whether need to communicate with other cached copies
– (b) Locate the other copies
– (c) Communicate with those copies (invalidate/update)
• (0) is done the same way on all systems
– state of the line is maintained in the cache
– protocol is invoked if an “access fault” occurs on the line
• Different approaches distinguished by (a) to (c)

CA-Lec10 [email protected] 4
Bus‐based Coherence
• All of (a), (b), (c) done through broadcast on bus
– faulting processor sends out a “search”
– others respond to the search probe and take necessary action
• Could do it in scalable network too
– broadcast to all processors, and let them respond
• Conceptually simple, but broadcast doesn’t scale with p
– on bus, bus bandwidth doesn’t scale
– on scalable network, every fault leads to at least p network transactions
• Scalable coherence:
– can have same cache states and state transition diagram
– different mechanisms to manage protocol

CA-Lec10 [email protected] 5
Scalable Approach: Directories

• Every memory block has associated directory information


(may be cached)
– keeps track of copies of cached blocks and their states
– on a miss, find directory entry, look it up, and communicate only with
the nodes that have copies if necessary
– in scalable networks, communication with directory and copies is
through network transactions
• Many alternatives for organizing directory information

CA-Lec10 [email protected] 6
Basic Operation of Directory
P P

Cache Cache • k processors.


• With each cache-block in memory:
Interconnection Network k presence-bits, 1 dirty-bit
• With each cache-block in cache:
Memory •• • Directory 1 valid bit, and 1 dirty (owner) bit

presence bits dirty bit


• Read from main memory by processor i:
• If dirty‐bit OFF then { read from main memory; turn p[i] ON; }
• if dirty‐bit ON then { recall line from dirty proc (cache state to shared);
update memory; turn dirty‐bit OFF; turn p[i] ON; supply recalled data to
i;}
• Write to main memory by processor i:
• If dirty‐bit OFF then { supply data to i; send invalidations to all caches that
have the block; turn dirty‐bit ON; turn p[i] ON; ... }
• ... CA-Lec10 [email protected] 7
Directory Protocol
• Similar to Snoopy Protocol: Three states
– Shared: ≥ 1 processors have data, memory up‐to‐date
– Uncached (no processor has it; not valid in any cache)
– Exclusive: 1 processor (owner) has data; memory out‐of‐date

• In addition to cache state, must track which processors have data when in
the shared state (usually bit vector, 1 if processor has copy)
• Keep it simple(r):
– Writes to non‐exclusive data
=> write miss
– Processor blocks until access completes
– Assume messages received and acted upon in order sent

CA-Lec10 [email protected] 8
Directory Protocol
• No bus and don’t want to broadcast:
– interconnect no longer single arbitration point
– all messages have explicit responses
• Terms: typically 3 processors involved
– Local node where a request originates
– Home node where the memory location of an address resides
– Remote node has a copy of a cache block, whether exclusive or shared
• Example messages on next slide:
P = processor number, A = address

CA-Lec10 [email protected] 9
Directory Protocol Messages (Fig 4.22)
Message type Source Destination Msg Content
Read miss Local cache Home directory P, A
– Processor P reads data at address A;
make P a read sharer and request data
Write miss Local cache Home directory P, A
– Processor P has a write miss at address A;
make P the exclusive owner and request data
Invalidate Home directory Remote caches A
– Invalidate a shared copy at address A
Fetch Home directory Remote cache A
– Fetch the block at address A and send it to its home directory;
change the state of A in the remote cache to shared
Fetch/Invalidate Home directory Remote cache A
– Fetch the block at address A and send it to its home directory;
invalidate the block in the cache
Data value reply Home directory Local cache Data
– Return a data value from the home memory (read miss response)
Data write back Remote cache Home directory A, Data
– Write back a data value for address
CA-Lec10 A (invalidate response)
[email protected] 10
State Transition Diagram for One Cache Block in
Directory Based System

• States identical to snoopy case; transactions very similar.


• Transitions caused by read misses, write misses, invalidates,
data fetch requests
• Generates read miss & write miss msg to home directory.
• Write misses that were broadcast on the bus for snooping =>
explicit invalidate & data fetch requests.
• Note: on a write, a cache block is bigger, so need to read the
full cache block

CA-Lec10 [email protected] 11
CPU ‐Cache State Machine CPU Read hit

• State machine
for CPU requests
for each
memory block Invalidate
• Invalid state
if in memory Shared
Invalid (read/only)
CPU Read
Send Read Miss
message CPU read miss:
CPU Write: Send Read Miss
Send Write Miss CPU Write: Send
msg to h.d. Write Miss message
Fetch/Invalidate to home directory
send Data Write Back message
to home directory Fetch: send Data Write Back
message to home directory
CPU read miss: send Data
Exclusive Write Back message and read
(read/write) miss to home directory
CPU read hit CPU write miss:
CPU write hit send Data Write Back message
and Write Miss to home directory
CA-Lec10 [email protected] 12
State Transition Diagram for Directory

• Same states & structure as the transition diagram for an


individual cache
• 2 actions: update of directory state & send messages to
satisfy requests
• Tracks all copies of memory block
• Also indicates an action that updates the sharing set,
Sharers, as well as sending a message

CA-Lec10 [email protected] 13
Directory State Machine Read miss:
Sharers += {P};
• State machine send Data Value Reply
for Directory requests for each Read miss:
memory block
• Uncached state
Sharers = {P}
if in memory send Data Value
Reply Shared
Uncached (read only)

Write Miss:
Write Miss:
Sharers = {P};
Data Write Back: send Invalidate
send Data
Sharers = {} to Sharers;
Value Reply
(Write back block) then Sharers = {P};
msg
send Data Value
Write Miss: Reply msg
Sharers = {P};
Read miss:
send Fetch/Invalidate;
Sharers += {P};
send Data Value Reply
Exclusive send Fetch;
msg to remote cache
(read/write) send Data Value Reply
msg to remote cache
(Write back block)
CA-Lec10 [email protected] 14
Example Directory Protocol
• Message sent to directory causes two actions:
– Update the directory
– More messages to satisfy request
• Block is in Uncached state: the copy in memory is the current value; only possible
requests for that block are:
– Read miss: requesting processor sent data from memory &requestor made only sharing
node; state of block made Shared.
– Write miss: requesting processor is sent the value & becomes the Sharing node. The
block is made Exclusive to indicate that the only valid copy is cached. Sharers indicates
the identity of the owner.
• Block is Shared => the memory value is up‐to‐date:
– Read miss: requesting processor is sent back the data from memory & requesting
processor is added to the sharing set.
– Write miss: requesting processor is sent the value. All processors in the set Sharers are
sent invalidate messages, & Sharers is set to identity of requesting processor. The state
of the block is made Exclusive.

CA-Lec10 [email protected] 15
Example Directory Protocol
• Block is Exclusive: current value of the block is held in the cache of the
processor identified by the set Sharers (the owner) => three possible
directory requests:
– Read miss: owner processor sent data fetch message, causing state of block in
owner’s cache to transition to Shared and causes owner to send data to directory,
where it is written to memory & sent back to requesting processor.
Identity of requesting processor is added to set Sharers, which still contains the
identity of the processor that was the owner (since it still has a readable copy).
State is shared.
– Data write‐back: owner processor is replacing the block and hence must write it
back, making memory copy up‐to‐date
(the home directory essentially becomes the owner), the block is now Uncached,
and the Sharer set is empty.
– Write miss: block has a new owner. A message is sent to old owner causing the
cache to send the value of the block to the directory from which it is sent to the
requesting processor, which becomes the new owner. Sharers is set to identity of
new owner, and state of block is made Exclusive.

CA-Lec10 [email protected] 16
Example
Processor 1 Processor 2 Interconnect Directory Memory
P1 P2 Bus Directory Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
P1: Write 10 to A1

P1: Read A1
P2: Read A1

P2: Write 20 to A1

P2: Write 40 to A2

A1 and A2 map to the same cache block


(but different memory block addresses A1 ≠ A2)

CA-Lec10 [email protected] 17
Example
Processor 1 Processor 2 Interconnect Directory Memory
P1 P2 Bus Directory Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1
P2: Read A1

P2: Write 20 to A1

P2: Write 40 to A2

A1 and A2 map to the same cache block

CA-Lec10 [email protected] 18
Example
Processor 1 Processor 2 Interconnect Directory Memory
P1 P2 Bus Directory Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1

P2: Write 20 to A1

P2: Write 40 to A2

A1 and A2 map to the same cache block

CA-Lec10 [email protected] 19
Example
Processor 1 Processor 2 Interconnect Directory Memory
P1 P2 Bus Directory Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 Ftch P1 A1 10 A1 A1 10
Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10
P2: Write 20 to A1 10
10
P2: Write 40 to A2 10

Write Back

A1 and A2 map to the same cache block

CA-Lec10 [email protected] 20
Example
Processor 1 Processor 2 Interconnect Directory Memory
P1 P2 Bus Directory Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 Ftch P1 A1 10 A1 A1 10
Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10
P2: Write 20 to A1 Excl. A1 20 WrMs P2 A1 10
Inv. Inval. P1 A1 A1 Excl. {P2} 10
P2: Write 40 to A2 10

A1 and A2 map to the same cache block

CA-Lec10 [email protected] 21
Example
Processor 1 Processor 2 Interconnect Directory Memory
P1 P2 Bus Directory Memory
step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value
P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}
Excl. A1 10 DaRp P1 A1 0
P1: Read A1 Excl. A1 10
P2: Read A1 Shar. A1 RdMs P2 A1
Shar. A1 10 Ftch P1 A1 10 A1 A1 10
Shar. A1 10 DaRp P2 A1 10 A1 Shar. {P1,P2} 10
P2: Write 20 to A1 Excl. A1 20 WrMs P2 A1 10
Inv. Inval. P1 A1 A1 Excl. {P2} 10
P2: Write 40 to A2 WrMs P2 A2 A2 Excl. {P2} 0
WrBk P2 A1 20 A1 Unca. {} 20
Excl. A2 40 DaRp P2 A2 0 A2 Excl. {P2} 0

A1 and A2 map to the same cache block


(but different memory block addresses A1 ≠ A2)

CA-Lec10 [email protected] 22
Implementing a Directory
• We assume operations atomic, but they are not; reality is
much harder; must avoid deadlock when run out of buffers in
network (see Appendix E)
• Optimizations:
– read miss or write miss in Exclusive: send data directly to requestor
from owner vs. 1st to memory and then from memory to requestor

CA-Lec10 [email protected] 23
Basic Directory Transactions
Requestor Requestor 1.
1. RdEx request
P P to directory
Read request
C to directory C
M/D Directorynode M/D 2. P
A A
for block Reply with C
2. sharers identity
P A M/D
Reply with
owner identity C
3.
3a. 3b. Directorynode
Read req. A M/D
to owner Inval. req. Inval. req.
4a. to sharer to sharer
Data 4a. 4b.
Reply Inval. ack Inval. ack
4b.
Revision message
to directory
P P P
C C C

A M/D A M/D A M/D

Node with Sharer Sharer


dirtycopy
(a) Read miss to a block in dirty state (b) Write miss to a block witho tw
sharers

CA-Lec10 [email protected] 24
Example Directory Protocol (1st Read)
D
Read pA

P1: pA S
Dir M
R/reply ctrl
U

E E

S S
$ P1 $ P2
R/req
I ld vA -> rd pA I
CA-Lec10 [email protected] 25
Example Directory Protocol (Read Share)
D

P1: pA R/_ S
Dir M
P2: pA R/reply ctrl
U

E E

R/_ S R/_ S
$ P1 $ P2
R/req R/req
I ld vA -> rd pA I
ld vA -> rd pA
CA-Lec10 [email protected] 26
Example Directory Protocol (Wr to shared)
D

RX/invalidate&reply

P1: pA EX R/_ S
Dir M
P2: pA R/reply ctrl
U
Inv ACK
reply xD(pA)
Invalidate pA
Read_to_update pA

E W/req E E
W/_
W/req E

R/_ S R/_ S
$ P1 $ P2
Inv/_ R/req Inv/_ R/req
I st vA -> wr pA I
CA-Lec10 [email protected] 27
Example Directory Protocol (Wr to Ex)
RU/_ D

RX/invalidate&reply

P1: pA R/_ S
Dir M
R/reply ctrl
U
Read_toUpdate pA
Inv pA
Reply xD(pA)
Write_back pA

E W/req E E W/req E
W/_ W/_
W/req E W/req E

R/_ S R/_ S
$ P1 $ P2
Inv/_ R/req Inv/_ R/req
I I
st vA -> wr pA
CA-Lec10 [email protected] 28
A Popular Middle Ground
• Two‐level “hierarchy”
• Individual nodes are multiprocessors, connected non‐
hiearchically
– e.g. mesh of SMPs
• Coherence across nodes is directory‐based
– directory keeps track of nodes, not individual processors
• Coherence within nodes is snooping or directory
– orthogonal, but needs a good interface of functionality
• SMP on a chip directory + snoop?

CA-Lec10 [email protected] 29
And in Conclusion …
• Caches contain all information on state of cached memory blocks
• Snooping cache over shared medium for smaller MP by invalidating other
cached copies on write
• Sharing cached data  Coherence (values returned by a read),
Consistency (when a written value will be returned by a read)
• Snooping and Directory Protocols similar; bus makes snooping easier
because of broadcast (snooping => uniform memory access)
• Directory has extra data structure to keep track of state of all cache blocks
• Distributing directory => scalable shared address multiprocessor
=> Cache coherent, Non uniform memory access

CA-Lec10 [email protected] 30
Outline
• Review
• Directory‐based protocols and examples
• Synchronization
• Relaxed Consistency Models
• Conclusion

CA-Lec10 [email protected] 31
Synchronization
• Why Synchronize? Need to know when it is safe for different
processes to use shared data

• Issues for Synchronization:


– Uninterruptable instruction to fetch and update memory (atomic
operation);
– User level synchronization operation using this primitive;
– For large scale MPs, synchronization can be a bottleneck;

CA-Lec10 [email protected] 32
Uninterruptable Instruction to Fetch and
Update Memory
• Atomic exchange: interchange a value in a register for a value in memory
0  synchronization variable is free
1  synchronization variable is locked and unavailable
– Set register to 1 & swap
– New value in register determines success in getting lock 0 if you
succeeded in setting the lock (you were first)
1 if other processor had already claimed access
– Key is that exchange operation is indivisible
• Test‐and‐set: tests a value and sets it if the value passes the test
• Fetch‐and‐increment: it returns the value of a memory location and
atomically increments it
– 0  synchronization variable is free

CA-Lec10 [email protected] 33
Uninterruptable Instruction to Fetch and
Update Memory
• Hard to have read & write in 1 instruction: use 2 instead
• Load linked (or load locked) + store conditional
– Load linked returns the initial value
– Store conditional returns 1 if it succeeds (no other store to same memory
location since preceding load) and 0 otherwise
• Example doing atomic swap with LL & SC:
try: mov R3,R4 ; mov exchange value
ll R2,0(R1) ; load linked
sc R3,0(R1) ; store conditional
beqz R3,try ; branch store fails (R3 = 0)
mov R4,R2 ; put load value in R4
• Example doing fetch & increment with LL & SC:
try: ll R2,0(R1) ; load linked
addi R2,R2,#1 ; increment (OK if reg–reg)
sc R2,0(R1) ; store conditional
beqz R2,try ; branch store fails (R2 = 0)

CA-Lec10 [email protected] 34
User Level Synchronization—Operation
Using this Primitive
• Spin locks: processor continuously tries to acquire, spinning around a loop
trying to get the lock
daddui R2,R0,#1
lockit: exch R2,0(R1) ;atomic exchange
bnez R2,lockit ;already locked?
• What about MP with cache coherency?
– Want to spin on cache copy to avoid full memory latency
– Likely to get cache hits for such variables
• Problem: exchange includes a write, which invalidates all other copies; this
generates considerable bus traffic
• Solution: start by simply repeatedly reading the variable; when it changes,
then try exchange (“test and test&set”):
try: li R2,#1
lockit: lw R3,0(R1) ;load var
bnez R3,lockit ;≠ 0  not free  spin
exch R2,0(R1) ;atomic exchange
bnez R2,try ;already locked?

CA-Lec10 [email protected] 35
Another MP Issue:
Memory Consistency Models
• What is consistency? When must a processor see the new value? e.g.,
seems that
P1: A = 0; P2: B = 0;
..... .....
A = 1; B = 1;
L1: if (B == 0) ... L2: if (A == 0) ...
• Impossible for both if statements L1 & L2 to be true?
– What if write invalidate is delayed & processor continues?
• Memory consistency models:
what are the rules for such cases?
• Sequential consistency: result of any execution is the same as if the
accesses of each processor were kept in order and the accesses
among different processors were interleaved
 assignments must be completed before the if statements are
initiated
– SC: delay all memory accesses until all invalidates done

CA-Lec10 [email protected] 36
Memory Consistency Model
• Schemes faster execution to sequential consistency
• Not an issue for most programs; they are synchronized
– A program is synchronized if all access to shared data are ordered by synchronization
operations
write (x)
...
release (s) {unlock}
...
acquire (s) {lock}
...
read(x)
• Only those programs willing to be nondeterministic are not synchronized: “data
race”: outcome f(proc. speed)
• Several Relaxed Models for Memory Consistency since most programs are
synchronized; characterized by their attitude towards: RAR, WAR, RAW, WAW
to different addresses

CA-Lec10 [email protected] 37
Relaxed Consistency Models: The
Basics
• Key idea: allow reads and writes to complete out of order, but to use synchronization
operations to enforce ordering, so that a synchronized program behaves as if the processor
were sequentially consistent
– By relaxing orderings, may obtain performance advantages
– Also specifies range of legal compiler optimizations on shared data
– Unless synchronization points are clearly defined and programs are synchronized, compiler could not
interchange read and write of 2 shared data items because might affect the semantics of the
program
• 3 major sets of relaxed orderings:
1. W→R ordering (all writes completed before next read)
• Because retains ordering among writes, many programs that operate under sequential
consistency operate under this model, without additional synchronization. Called
processor consistency
2. W → W ordering (all writes completed before next write)
3. R → W and R → R orderings, a variety of models depending on ordering restric ons and how
synchronization operations enforce ordering
• Many complexities in relaxed consistency models; defining precisely what it means for a write
to complete; deciding when processors can see values that it has written

CA-Lec10 [email protected] 38
Outline
• Review
• Directory‐based protocols and examples
• Synchronization
• Relaxed Consistency Models
• Conclusion
• T1 (“Niagara”) Multiprocessor

CA-Lec10 [email protected] 39
T1 (“Niagara”)
• Target: Commercial server applications
– High thread level parallelism (TLP)
• Large numbers of parallel client requests
– Low instruction level parallelism (ILP)
• High cache miss rates
• Many unpredictable branches
• Frequent load‐load dependencies
• Power, cooling, and space are major concerns for data
centers
• Metric: Performance/Watt/Sq. Ft.
• Approach: Multicore, Fine‐grain multithreading, Simple
pipeline, Small L1 caches, Shared L2

CA-Lec10 [email protected] 40
T1 Architecture
• Also ships with 6 or 4 processors

CA-Lec10 [email protected] 41
T1 Fine‐Grained Multithreading
• Each core supports four threads and has its own level one caches (16KB
for instructions and 8 KB for data)
• Switching to a new thread on each clock cycle
• Idle threads are bypassed in the scheduling
– Waiting due to a pipeline delay or cache miss
– Processor is idle only when all 4 threads are idle or stalled
• Both loads and branches incur a 3 cycle delay that can only be hidden by
other threads
• A single set of floating point functional units is shared by all 8 cores
– floating point performance was not a focus for T1

CA-Lec10 [email protected] 42
Memory, Clock, Power
• 16 KB 4 way set assoc. I$/ core
• 8 KB 4 way set assoc. D$/ core
• 3MB 12 way set assoc. L2 $ shared
– 4 x 750KB independent banks
– crossbar switch to connect
– 2 cycle throughput, 8 cycle latency
– Direct link to DRAM & Jbus
– Manages cache coherence for the 8 cores
– CAM based directory
• Coherency is enforced among the L1 caches by a directory associated with each L2 cache
block
• Used to track which L1 caches have copies of an L2 block
• By associating each L2 with a particular memory bank and enforcing the subset property, T1
can place the directory at L2 rather than at the memory, which reduces the directory
overhead
• L1 data cache is write‐through, only invalidation messages are required; the data can always
be retrieved from the L2 cache
• 1.2 GHz at 72W typical, 79W peak power consumption

CA-Lec10 [email protected] 43
Miss Rates: L2 Cache Size, Block Size

2.5%

2.0% TPC-C
SPECJBB
L2 Miss rate

1.5%
T1
1.0%

0.5%

0.0%
1.5 MB; 1.5 MB; 3 MB; 3 MB; 6 MB; 6 MB;
32B 64B 32B 64B 32B 64B
CA-Lec10 [email protected] 44
Miss Latency: L2 Cache Size, Block Size
200

180
T1 TPC-C
160 SPECJBB

140

120
L2 Miss latency

100

80

60

40

20

0
1.5 MB; 32B 1.5 MB; 64B 3 MB; 32B 3 MB; 64B 6 MB; 32B 6 MB; 64B

CA-Lec10 [email protected] 45
CPI Breakdown of Performance

Per Per Effective Effective


Thread core CPI for IPC for
Benchmark CPI CPI 8 cores 8 cores

TPC-C 7.20 1.80 0.23 4.4

SPECJBB 5.60 1.40 0.18 5.7

SPECWeb99 6.60 1.65 0.21 4.8

CA-Lec10 [email protected] 09-46


Not Ready Breakdown
100%
Fraction of cycles not ready

Other
80%
Pipeline delay
60%
L2 miss
40%
L1 D miss
20%
L1 I miss
0%
TPC-C SPECJBB SPECWeb99

• TPC‐C ‐ store buffer full is largest contributor


• SPEC‐JBB ‐ atomic instructions are largest contributor
• SPECWeb99 ‐ both factors contribute

CA-Lec10 [email protected] 09-47


Performance Relative to Pentium D
6.5

5.5
+Power5 Opteron Sun T1
5
Performance relative to Pentium D

4.5

3.5

2.5

1.5

0.5

0
SPECIntRate SPECFPRate SPECJBB05 SPECWeb05 TPC-like
CA-Lec10 [email protected] 48
Efficiency normalized to Pentium D
SP
EC
In

0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5

tR
at
e/
m
m
SP ^2
EC
In
tR
at
e/
W
SP at
EC t
FP
R
at
e/
m
+Power5

m
SP ^2
EC
FP
R
at
e /W
Opteron

SP at
EC t
JB
B0
5/
Sun T1

m
m
^2
SP
EC
JB
B0
5/
W
at
t

CA-Lec10 [email protected]
TP
C
-C
/m
m
^2

TP
C
-C
/W
at
t
Performance/mm2, Performance/Watt

49
Niagara 2
• Improve performance by increasing threads supported per chip from 32 to
64
– 8 cores * 8 threads per core
• Floating‐point unit for each core, not for each chip
• Hardware support for encryption standards EAS, 3DES, and elliptical‐curve
cryptography
• Niagara 2 will add a number of 8x PCI Express interfaces directly into the
chip in addition to integrated 10Gigabit Ethernet XAU interfaces and
Gigabit Ethernet ports.
• Integrated memory controllers will shift support from DDR2 to FB‐DIMMs
and double the maximum amount of system memory.

CA-Lec10 [email protected] 50

You might also like