CS6461 Computer Architecture
Fall 2016
Adapted from Professor Stephen H. Kaislers Slides
Lecture 9 Vector Operations
(Partially based on notes from David Patterson, UC Berkeley)
Anyone can build a fast CPU. The trick is
to build a fast computer.
- Seymour Cray -
Improving Performance
Many scientific programs compute using collections of
like numbers either integer or floating point - e.g.,
vectors
Performance can be improved if we structure hardware
to efficiently deal with such collections
Vector processors have high-level operations that work
on linear arrays of numbers, e.g., vectors
Vector instructions access memory with a known pattern
No data caches required
Single vector instruction implies a lot of work
CSCI 6461 Computer Architecture 2
Conventional Computer
Initialize I = 0
20 Read B(I)
Read C(I)
Store A(I) = B(I) + C(I)
Increment I = I + 1
If I <= 100 Go to 20
B(1) will be fetched from memory.
C(1) will be fetched from memory.
A scalar add instruction will operate
on B(1) and C(1).
A(1) will be stored back to memory
Step (1) to (4) will be repeated 100
times.
CSCI 6461 Computer Architecture 3
General Purpose Computer
General purpose computer: A(i) = B(i) * C(i) ; i =1, ... ,N
Cycle: 1 2 3 4 5 6 ... N*5
Operation
Separate B(1) B(2)
mant. / exp. C(1) C(2)
...
Multiply B(1)
mantissa C(1)
...
Add B(1)
exponents C(1)
...
Normal.
result
A(1) ...
Put
sign
A(1) ... A(N)
CSCI 6461 Computer Architecture 4
Vector Computer
A(1:100) = B(1:100) + C(1:100)
Fetch vectors of values B(I) and C(I) into memory
Use vector integer add instruction to operate on B(I), C(I) pairs
Stream of A(I) values will be stored back to memory, one value
every clock cycle
CSCI 6461 Computer Architecture 5
Vector Computer
Vector pipeline (5 sub units / segments): A = B * C
Cycle: 1 2 3 4 5 6 ... N+4
Operation
Separate B(1) B(2) B(3) B(4) B(5) B(6)
...
Mant. / Exp. C(1) C(2) C(3) C(4) C(5) C(6)
Multiply B(1) B(2) B(3) B(4) B(5)
...
mantissa C(1) C(2) C(3) C(4) C(5)
Add B(1) B(2) B(3) B(4)
...
exponents C(1) C(2) C(3) C(4)
Normal. B(3)
A(1) A(2) ...
result C(3)
Put
A(1) A(2) ... A(N)
sign
CSCI 6461 Computer Architecture 6
Basic Ideas
Vector registers: Each vector register is a fixed-
length bank holding a single vector.
Usually comprised of normal general-purpose registers and
floating-point registers.
They can provide data as input to the vector functional
units, as well as compute addresses.
Vector functional units: Fully pipelined and can start
a new operation on every clock cycle.
Vector load-store unit: loads or stores a vector to or
from memory.
Vector Length Control: A vector has a natural length
determined by the length of the vector registers.
CSCI 6461 Computer Architecture 7
Two Types of Vector Processors
Vector-Register Processors:
All vector operations (except load and store) occur in the
vector registers.
Vector counterpart of a load-store architecture
All major vector computers (Cray machines, NEC SX/2 ~
SX/5, Fujitsu VP200, etc.)
Memory-Memory Processors:
All vector operations are memory to memory.
CDC vector computers: CDC 203, CDC 205, TI ASC
All are obsolete!
CSCI 6461 Computer Architecture 8
Properties of Vector Processors
Vector instructions access memory with known pattern
Highly interleaved memory
Amortize memory latency over multiple elements
No (data) caches required! (Do use instruction cache)
Single vector instruction implies lots of work ( loop)
=> fewer instruction fetches
Vector processor
Memory
Mask-
Unit registers MASK
I/O LOAD ADD
ControlUnit (CU) STORE Vector-
registers MULT
ScalarUnit (SU)
DIV
(RISC Processor)
Vector pipelines
CSCI 6461 Computer Architecture 9
Basic Vector-Register Processor Architecture
Main Memory
FP add/subtract
FP multiply
Vector load-store
FP divide
Integer
Vector
registers Logical
8 64-element vector registers
Scalar 5 Functional Units; each unit is
registers fully pipelined,
can start a new operation on
every clock cycle
Load/store unit - fully pipelined
Scalar registers
CSCI 6461 Computer Architecture 10
Whats in a Vector Processor
A scalar processor
Scalar register file
Scalar functional units (arithmetic, load/store, etc)
A vector register file (a 2D register array)
Each register is an array of elements, e.g. 32 registers with 32 64-bit
elements per register
MVL = maximum vector length = max # of elements per register
A set of pipelined vector functional units: Integer, FP, load/store, etc
Sometimes vector and scalar units are combined (share ALUs)
Three types of addressing
Unit stride
Contiguous block of information in memory
Fastest: always possible to optimize this
Non-unit (constant) stride
Harder to optimize memory system for all possible strides
Prime number of data banks makes it easier to support different strides at full
bandwidth
Indexed (gather-scatter)
Vector equivalent of register indirect
Good for sparse arrays of data
Increases number of programs that vectorize
CSCI 6461 Computer Architecture 11
How a Vector Pipeline Works
Consider the steps involved in a floating-point addition on a
vector machine with IEEE Arithmetic hardware
The exponents of the two floating-point numbers to be added are
compared to find the number with the smallest magnitude.
The significands of the number with the smaller magnitude is
shifted so that the exponents of the two numbers agree.
The significands are added.
The result of the addition is normalized.
Checks are made to see if any floating-point exceptions occurred
during the addition, such as overflow.
Rounding occurs.
CSCI 6461 Computer Architecture 12
Cray-1 Vector Computer
CSCI 6461 Computer Architecture 13
Cray Processors
From Bottom Left:
Cray-1,
Cray-XMP,
Cray-2,
Cray-T916
Cray Research built
aestheticallly
pleasing
supercomputers.
For over two
decades they were
the fastest machines
on earth.
CSCI 6461 Computer Architecture 14
Vector Instructions
Instruction Operands Operation Comment
VADD.VV V1,V2,V3 V1=V2+V3 vector + vector
VADD.SV V1,R0,V2 V1=R0+V2 scalar + vector
VMUL.VV V1,V2,V3 V1=V2*V3 vector x vector
VMUL.SV V1,R0,V2 V1=R0*V2 scalar x vector
VLD V1,R1 V1=M[R1...R1+63] load, stride=1
VLDS V1,R1,R2 V1=M[R1R1+63*R2] load, stride=R2
VLDX V1,R1,V2 V1=M[R1+V2i,i=0..63] indexed("gather")
VST V1,R1 M[R1...R1+63]=V1 store, stride=1
VSTS V1,R1,R2 V1=M[R1...R1+63*R2] store, stride=R2
VSTX V1,R1,V2 V1=M[R1+V2i,i=0..63] indexed(scatter")
CSCI 6461 Computer Architecture 15
SAXPY: A Common Equation
32 element SAXPY: scalar SAXPY: S = aX + Y
LD F0, a
ADDI R4, Rx,#256 X,Y are vectors (of same length);
Loop: a is a scalar
LD F2, 0(Rx) One of the most common vector
MUL.D F2, F0, F2
operations found in all arithmetic
LD F4, 0(Ry)
ADD.D F4, F2, F4 systems.
SD F4, 0(Ry) All transformations in linear algebra
ADDI Rx, Rx, 8 can be expressed in this basic triad.
ADDI Ry, Ry, 8
SUB R20,R4,Rx
BNZ R20,loop
Now, 32 element SAXPY: vector
LD F0,a #load a
VLD V1,Rx #load X[0:31]
VMULD.SV V2,F0,V1 #vector mult
VLD V3,Ry #load Y[0:31]
VADDD.VV V4,V2,V3 #vector add
VST Ry,V4 #store Y[0:31]
CSCI 6461 Computer Architecture 16
Terminology
Vector Start-up Time: A measure of the latency in starting up
the vector pipeline.
The number of clock cycles required prior to the generation of the
first result.
The start-up time adds a considerable overhead for small
value of N.
The effect of start-up time is negligible for large value of N.
To maintain an initiation rate of one word fetched/store per
clock, the memory must be able to meet this rate.
Usually done by interleaving memory in banks.
CSCI 6461 Computer Architecture 17
Issues
What to do when the application vector length is not exactly
maximum vector length (MVL)?
Vector-length (VL) register controls the length of any vector
operation, including a vector load or store
Set it before performing any vector operation
VADD.VV with VL=10 is equivalent to
for (i=0; i<10; i++)
V1[i] = V2[i]+V3[i]
VL can be anything from 0 to MVL
CSCI 6461 Computer Architecture 18
Issues
Problem: Vector registers have finite length
Solution: Break loops into pieces that fit in registers,
Stripmining
Vector Length modulo VL /= 0!!
So, do short piece first, then do rest with length VL
EX: Suppose VL = 64. We have a vector that is 264, which
is mod 8.
So, process a vector length 8, then four vectors of length
64.
Problem: All computations have some scalar
components, e.g., non-vectorizable
Solution: Separate scale from vector computations
(by hand; but maybe automatically)
CSCI 6461 Computer Architecture 19
Ex: Vector Code
Note: Fast processing rates do not always translate directly into
Fast processing of loops.
CSCI 6461 Computer Architecture 20
Assessing Performance
Pipe(line)length p: Number of stages in pipeline = N
segments
One result per cycle (if pipe is full)
Speed-up:
Serial computation: N*p cycles
Vector computation: N + p - 1 cycles
Speed-up: S = (N * p) / (N + p - 1)
N >> p S ~ p
Problems:
N~ p
No recursive references: A(i) = A(i-1) + C(i)
CSCI 6461 Computer Architecture 21
Characteristics of Vectorizable Code - I
Vectorization can only be done within a DO/FOR
loop; it must be the innermost loop.
It is crucial to ensure that there are sufficient
iterations in the DO loop to offset the start-up time
overhead.
Put as much work as possible into a vectorizable
statement to provide more opportunities for
concurrent operations.
There is a limit to vectorization because a compiler
may not vectorize the code if it is too complicated.
Exercise: How do you vectorize a WHILE loop??
CSCI 6461 Computer Architecture 22
Characteristics of Vectorizable Code - II
The existence of certain operations in the DO loop may
prevent the compiler from converting the entire, or part of
the DO loop for vector processing:
vectorization inhibitors include subroutine calls, recursion,
references to external functions, and any input/output statements
(which are actually system calls)
These types of vector inhibitors can be removed by:
expanding the function
in-lining subroutines at the point of reference.
CSCI 6461 Computer Architecture 23
Vector Code Example
Vector Processing Example:
/* Multiply a[m][k] * b[k][n] to get c[m][n] */
for (i = 1; i < m; i++)
{
for (j =1; j < n; j++)
{
sum = 0;
for (t =1; t <k; t++)
{
sum = sum + a[i][t] * b[t][j]; //// This is a dependency!!!
}
c[i][j] = sum;
}
}
CSCI 6461 Computer Architecture 24
Optimized Vector Code
/* Multiply a[m][k] * b[k][n] to get c[m][n] */
for (i = 1; i < m; i++)
{
for (j = 1; j < n; j += 32) /* Step j by 32 at a time. */
{
sum[0:31] = 0; /* Initialize a vector register to zeros. */
for (t = 1; t < k; t++)
{
a_scalar = a[i][t];
b_vector[0:31] = b[t][j:j+31];
/* Do a vector-scalar multiply. */
prod[0:31] = b_vector[0:31] * a_scalar; It's actually better to
/* Vector-vector add into results. */ interchange the i and
sum[0:31] += prod[0:31];
j loops, so that you
}
/* Unit-stride store of vector of results. */ only change
c[i][j:j+31] = sum[0:31]; vector length once
} during the whole
} matrix multiply
CSCI 6461 Computer Architecture 25
Vector Stride
Suppose adjacent elements of the vector are not sequential in
memory
do 10 i = 1,100
do 10 j = 1,100
A(i,j) = 0.0
do 10 k = 1,100
10 A(i,j) = A(i,j)+B(i,k)*C(k,j)
Either B or C accesses not adjacent (800 bytes between)
stride: distance separating elements that are to be merged into
a single vector (caches do unit stride)
=> LVWS (load vector with stride) instruction
Strides => can cause bank conflicts
(e.g., stride = 32 and 16 banks)
CSCI 6461 Computer Architecture 26
Vector Chaining
Suppose:
MULV V1,V2,V3
ADDV V4,V1,V5
chaining: vector register (V1) is not as a single entity
but as a group of individual registers, then pipeline
forwarding can work on individual elements of a
vector
Flexible chaining: allow vector to chain to any other
active vector operation => more read/write ports, e.g.
pass the result from one vector operation to another
vector operation
As long as enough HW, increases convoy size
CSCI 6461 Computer Architecture 27
Vector Register Bypassing
CSCI 6461 Computer Architecture 28
Vector Conditional Execution
CSCI 6461 Computer Architecture 29
Two Approaches
CSCI 6461 Computer Architecture 30
Vectors w/ Sparse Matrices
Suppose:
do 100 i = 1,n
100 A(K(i)) = A(K(i)) + C(M(i))
gather (LVI) operation takes an index vector and fetches data
from each address in the index vector
This produces a dense vector in the vector registers
After these elements are operated on in dense form, the sparse
vector can be stored in expanded form by a scatter store
(SVI), using the same index vector
Can't be figured out by a compiler since it can't know elements
distinct, no dependencies
Use CVI to create index 0, 1xm, 2xm, ..., 63xm
CSCI 6461 Computer Architecture 31
Gather Example
CSCI 6461 Computer Architecture 32
Vector Issues
Pitfall: Concentrating on peak performance and ignoring
start-up overhead:
NV (length faster than scalar) > 100!
Pitfall: Increasing vector performance, without
comparable increases in scalar performance (Amdahl's
Law)
problems of Cray competitor (ETA)
Pitfall: Good processor vector performance without
providing good memory bandwidth
MMX?
CSCI 6461 Computer Architecture 33
Some Previous Vector Processors
CSCI 6461 Computer Architecture 34
Vector Memory-Memory vs Register Machines
Vector memory-memory instructions hold all vector operands
in main memory
The first vector machines, CDC Star-100 (73) and TI ASC
(71), were memory-memory machines
Cray-1 (76) was first vector register machine
CSCI 6461 Computer Architecture 35
Vector Memory-Memory vs Register Machines
Vector memory-memory architectures (VMMA) require greater
main memory bandwidth, why?
All operands must be read in and out of memory
VMMAs make if difficult to overlap execution of multiple vector
operations, why?
Must check dependencies on memory addresses
VMMAs incur greater startup latency
Scalar code was faster on CDC Star-100 for vectors < 100
elements
For Cray-1, vector/scalar breakeven point was around 2
elements
Apart from CDC follow-ons (Cyber-205, ETA-10) all major
vector machines since Cray-1 have had vector register
architectures
CSCI 6461 Computer Architecture 36
CSCI 6461 Computer Architecture
The Cell Processor
Observed clock speed: > 4 GHz
37
Peak performance (single precision): > 256 GFlops
Peak performance (double precision): >26 GFlops
Local storage size per SPU: 256KB
Total number of transistors: 234M
The Cell Processor
Sony Playstation 3
Partnership between Sony,
Toshiba, IBM
Power PC-based main core (PPE)
Multiple SPEs
On die memory controller
Inter-core transport bus
High speed IO
Clocked at 3-4ghz
256GFLOPS Single Precision @
4ghz
Offload a large amount of work
onto compiler / software.
CSCI 6461 Computer Architecture 38
Cell Processor Die Layout
CSCI 6461 Computer Architecture 39
Power Processing Element (PPE)
PowerPC instruction set with AltiVec VMX instructions
Slow, but power-efficient
Used for general purpose computing and controlling
SPEs
Simultaneous Multithreading
Separate 32 KB L1 Caches for instructions and data
Unified 512 KB L2 Cache
Two issue in-order instruction fetch
Conspicuous lack of instruction window
PPEs and SPEs use different instruction sets.
CSCI 6461 Computer Architecture 40
Synergistic Processing Element (SPE)
SPEs are vector processors:
Not efficient for general-purpose
computation.
Meant to be used in parallel
(7 on PS3 implementation)
Instructions based on VMX
In-order execution w/ dual issue
Modified for 128 registers
Instructions assumed to be 4x 32 bits
128 registers (each 128 bits wide)
Vector logic
8 single precision operations per cycle
Significant performance hit for double
precision
CSCI 6461 Computer Architecture 41
SPE Local Storage
On chip local storage (256KB)
NOT a cache
Completely private to each SPE
Directly addressable by software
Software controlled DMA to and from main memory
Request queue handles 16 simultaneous requests
Up to 16 KB transfer each
Priority: DMA, L/S, Fetch
Fetch / execute parallelism
CSCI 6461 Computer Architecture 42
SPE Control Logic/Pipeline
Little ILP, and thus little control
logic faster execution
No hardware branch prediction
Software branch prediction
Loop unrolling
18 cycle penalty
Simple commit unit
no reorder buffer or other
complexities
Same execution unit for FP/int
Instruction Scheduling a HUGE
problem
Done primarily in software
IBM predicted 80-90% usage
ideally
CSCI 6461 Computer Architecture 43
Modern Vector Supercomputer
65nm CMOS technology
Vector unit (3.2 GHz)
8 foreground VRegs + 64 background
VRegs (256x64-bit elements/VReg)
64-bit functional units: 2 multiply, 2 add, 1
divide/sqrt, 1 logical, 1 mask unit
8 lanes (32+ FLOPS/cycle, 100+
GFLOPS peak per CPU)
1 load or store unit (8 x 8-byte
accesses/cycle)
Scalar unit (1.6 GHz)
4-way superscalar with out-of-order and
speculative execution
64KB I-cache and 64KB data cache
Memory system provides 256GB/s DRAM bandwidth per CPU
Up to 16 CPUs and up to 1TB DRAM form shared-memory node
total of 4TB/s bandwidth to shared DRAM memory
Up to 512 nodes connected via 128GB/s network links (message passing
between nodes)
CSCI 6461 Computer Architecture 44
Vector Advantages
Easy to get high performance: N operations
are independent
use same functional unit
access disjoint registers
access registers in same order as previous instructions
access contiguous memory words or known pattern
can exploit large memory bandwidth
hide memory latency (and any other latency)
Scalable: (get higher performance by adding HW resources)
Compact: Describe N operations with 1 short instruction
Predictable: performance vs. statistical performance (cache)
Multimedia ready: N * 64b, 2N * 32b, 4N * 16b, 8N * 8b
Mature, developed compiler technology
CSCI 6461 Computer Architecture 45
Vector Disadvantages
Vector Disadvantage: Out of Fashion?
Hard to say. Many irregular loop structures seem to still
be hard to vectorize automatically.
Not as fast with scalar instructions
Complexity of the multi-ported Vector Register File
Difficulties implementing precise exceptions
High price of on-chip vector memory systems
Increased code complexity
CSCI 6461 Computer Architecture 46
The
Last
(Vector)
Samurais
CSCI 6461 Computer Architecture 47