Onur 447 Spring15 Lecture14 Simd Afterlecture
Onur 447 Spring15 Lecture14 Simd Afterlecture
Computer Architecture
Lecture 14: SIMD Processing
(Vector and Array Processors)
Pipelining
Out-of-Order Execution
3
Reminder: Announcements
Lab 3 due this Friday (Feb 20)
Pipelined MIPS
Competition for high performance
You can optimize both cycle time and CPI
Document and clearly describe what you do during check-off
5
Readings for Today
Lindholm et al., "NVIDIA Tesla: A Unified Graphics and
Computing Architecture," IEEE Micro 2008.
6
Recap of Last Lecture
OoO Execution as Restricted Data Flow
Memory Disambiguation or Unknown Address Problem
Memory Dependence Handling
Conservative, Aggressive, Intelligent Approaches
Load Store Queues
Design Choices in an OoO Processor
Combining OoO+Superscalar+Branch Prediction
Example OoO Processor Designs
7
Reminder: Intel Pentium 4 Simplified
Mutlu+, “Runahead Execution,”
HPCA 2003.
8
Reminder: Alpha 21264
Disadvantages
Debugging difficult (no precise state)
Interrupt/exception handling is difficult (what is precise state
semantics?)
Implementing dynamic data structures difficult in pure data
flow models
Too much parallelism? (Parallelism control needed)
High bookkeeping overhead (tag matching, data storage)
Instruction cycle is inefficient (delay between dependent
instructions), memory locality is not exploited
11
Review: Combining Data Flow and Control Flow
Can we get the best of both worlds?
Two possibilities
12
Review: Data Flow Summary
Data Flow at the ISA level has not been (as) successful
13
Approaches to (Instruction-Level) Concurrency
Pipelining
Out-of-order execution
Dataflow (at the ISA level)
SIMD Processing (Vector and array processors, GPUs)
VLIW
Decoupled Access Execute
Systolic Arrays
14
SIMD Processing:
Exploiting Regular (Data) Parallelism
Flynn’s Taxonomy of Computers
Mike Flynn, “Very High-Speed Computing Systems,” Proc.
of IEEE, 1966
Time-space duality
18
Array vs. Vector Processors
ARRAY PROCESSOR VECTOR PROCESSOR
Space Space
19
SIMD Array Processing vs. VLIW
VLIW: Multiple independent operations packed together by the compiler
20
SIMD Array Processing vs. VLIW
Array processor: Single operation on multiple (different) data elements
21
Vector Processors
A vector is a one-dimensional array of numbers
Many scientific/commercial programs use vectors
for (i = 0; i<=49; i++)
C[i] = (A[i] + B[i]) / 2
22
Vector Processors (II)
A vector instruction performs an operation on each element
in consecutive cycles
Vector functional units are pipelined
Each pipeline stage operates on a different data element
23
Vector Processor Advantages
+ No dependencies within a vector
Pipelining, parallelization work well
Can have very deep pipelines, no dependencies!
Fisher, “Very Long Instruction Word architectures and the ELI-512,” ISCA 1983. 25
Vector Processor Limitations
-- Memory (bandwidth) can easily become a bottleneck,
especially if
1. compute/memory operation balance is not maintained
2. data is not mapped appropriately to memory banks
26
Vector Processing in More Depth
Vector Registers
Each vector data register holds N M-bit values
Vector control registers: VLEN, VSTR, VMASK
Maximum VLEN can be N
Maximum number of elements stored in a vector register
Vector Mask Register (VMASK)
Indicates which elements of vector to operate on
V0,N-1 V1,N-1
28
Vector Functional Units
Use deep pipeline to execute
element operations
V V V
fast clock cycle
1 2 3
V1 * V2 V3
30
Loading/Storing Vectors from/to Memory
Requires loading/storing multiple elements
Data bus
Address bus
CPU
Picture credit: Derek Chiou 32
Vector Memory System
Next address = Previous address + Stride
If stride = 1 & consecutive elements interleaved across
banks & number of banks >= bank latency, then can
sustain 1 element/cycle throughput
Bas
Stride
Vector Registers e
Address
Generator +
0 1 2 3 4 5 6 7 8 9 A B C D E F
Memory Banks
Picture credit: Krste Asanovic 33
Scalar Code Example
For I = 0 to 49
C[i] = (A[i] + B[i]) / 2
Why 16 banks?
11 cycle memory access latency
Having 16 (>11) banks ensures there are enough banks to
overlap enough memory operations to cover memory latency
35
Vectorizable Loops
A loop is vectorizable if each iteration is independent of any
other
For I = 0 to 49
C[i] = (A[i] + B[i]) / 2
Vectorized loop (each instruction and its latency):
MOVI VLEN = 50 1
7 dynamic instructions
MOVI VSTR = 1 1
VLD V0 = A 11 + VLN - 1
VLD V1 = B 11 + VLN – 1
VADD V2 = V0 + V1 4 + VLN - 1
VSHFR V3 = V2 >> 1 1 + VLN - 1
VST C = V3 11 + VLN – 1
36
Basic Vector Code Performance
Assume no chaining (no vector data forwarding)
i.e., output of a vector functional unit cannot be used as the
direct input of another
The entire vector register needs to be ready before any
element of it can be used as part of another operation
One memory port (one address generator)
16 memory banks (word-interleaved)
1 1 11 49 11 49 4 49 1 49 11 49
285 cycles
37
Vector Chaining
Vector chaining: Data forwarding from one vector
functional unit to another
V V V V V
LV v1 1 2 3 4 5
MULV v3,v1,v2
ADDV v5, v3, v4
Chain Chain
Load
Unit
Mult. Add
Memory
Strict assumption:
Each memory bank
4 49 has a single port
(memory bandwidth
bottleneck)
These two VLDs cannot be 1 49
pipelined. WHY?
11 49
1 11 49
4 49
1 49
11 49
79 cycles
19X perf. improvement!
40
Questions (I)
What if # data elements > # elements in a vector register?
Idea: Break loops so that each iteration operates on #
elements in a vector register
E.g., 527 data elements, 64-element VREGs
8 iterations where VLEN = 64
1 iteration where VLEN = 15 (need to change value of VLEN)
Called vector stripmining
41
Gather/Scatter Operations
42
Gather/Scatter Operations
Gather/scatter operations often implemented in hardware
to handle sparse matrices
Vector loads and stores use an index vector which is added
to the base register to generate the addresses
Index Vector Data Vector (to Store) Stored Vector (in Memory)
43
Conditional Operations in a Loop
What if some operations should not be executed on a vector
(based on a dynamically-determined condition)?
loop: if (a[i] != 0) then b[i]=a[i]*b[i]
goto loop
45
Masked Vector Instructions
Simple Implementation Density-Time Implementation
– execute all N operations, turn off – scan mask vector and only execute
result writeback according to mask elements with non-zero masks
M[2]=0 C[4]
M[1]=1
M[2]=0 C[2]
M[0]=0
M[1]=1 C[1] C[1]
M[0]=0 C[0]
Which one is better?
Write Enable Write data port
Tradeoffs?
Slide credit: Krste Asanovic 46
Some Issues
Stride and banking
As long as they are relatively prime to each other and there
are enough banks to cover bank access latency, we can
sustain 1 element/cycle throughput
Storage of a matrix
Row major: Consecutive elements in a row are laid out
consecutively in memory
Column major: Consecutive elements in a column are laid out
consecutively in memory
You need to change the stride when accessing a row versus
column
47
48
Minimizing Bank Conflicts
More banks
49
Array vs. Vector Processors, Revisited
Array vs. vector processor distinction is a “purist’s”
distinction
50
Remember: Array vs. Vector Processors
ARRAY PROCESSOR VECTOR PROCESSOR
Space Space
51
Vector Instruction Execution
VADD A,B C
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]
Partitioned
Vector
Registers
Elements 0, Elements 1, Elements 2, Elements 3,
4, 8, … 5, 9, … 6, 10, … 7, 11, …
Lane
Memory Subsystem
Instruction
issue
load
Iter. Iter.
Iter. 2 load 1 2 Vector Instruction
add
Vectorization is a compile-time reordering of
operation sequencing
requires extensive loop dependence analysis
store
Slide credit: Krste Asanovic 55
Vector/SIMD Processing Summary
Vector/SIMD machines are good at exploiting regular data-
level parallelism
Same operation performed on many data elements
Improve performance, simplify design (no intra-vector
dependencies)
58
MMX Example: Image Overlaying (I)
Goal: Overlay the human in image 1 on top of the background in image 2
59
MMX Example: Image Overlaying (II)
60