Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
ETH Zürich
Spring 2020
7 May 2020
We Are Almost Done With This…
◼ Single-cycle Microarchitectures
◼ Multi-cycle Microarchitectures
◼ Pipelining
◼ Out-of-Order Execution
3
Readings for this Week
◼ Required
◼ Lindholm et al., "NVIDIA Tesla: A Unified Graphics and
Computing Architecture," IEEE Micro 2008.
◼ Recommended
❑ Peleg and Weiser, “MMX Technology Extension to the Intel
Architecture,” IEEE Micro 1996.
4
Announcement
◼ Late submission of lab reports in Moodle
❑ Open until June 20, 2020, 11:59pm (cutoff date -- hard
deadline)
❑ You can submit any past lab report, which you have not
submitted before its deadline
❑ It is NOT allowed to re-submit anything (lab reports, extra
assignments, etc.) that you had already submitted via other
Moodle assignments
❑ We will grade your reports, but late submission has a
penalization of 1 point, that is, the highest possible score per
lab report will be 2 points
5
Exploiting Data Parallelism:
SIMD Processors and GPUs
SIMD Processing:
Exploiting Regular (Data) Parallelism
Flynn’s Taxonomy of Computers
◼ Mike Flynn, “Very High-Speed Computing Systems,” Proc.
of IEEE, 1966
◼ Time-space duality
10
Array vs. Vector Processors
ARRAY PROCESSOR VECTOR PROCESSOR
Space Space
11
SIMD Array Processing vs. VLIW
◼ VLIW: Multiple independent operations packed together by the compiler
12
SIMD Array Processing vs. VLIW
◼ Array processor: Single operation on multiple (different) data elements
13
Vector Processors (I)
◼ A vector is a one-dimensional array of numbers
◼ Many scientific/commercial programs use vectors
for (i = 0; i<=49; i++)
C[i] = (A[i] + B[i]) / 2
14
Vector Processors (II)
◼ A vector instruction performs an operation on each element
in consecutive cycles
❑ Vector functional units are pipelined
❑ Each pipeline stage operates on a different data element
15
Vector Processor Advantages
+ No dependencies within a vector
❑ Pipelining & parallelization work really well
❑ Can have very deep pipelines, no dependencies!
16
Vector Processor Disadvantages
-- Works (only) if parallelism is regular (data/SIMD parallelism)
++ Vector operations
-- Very inefficient if parallelism is irregular
-- How about searching for a key in a linked list?
Fisher, “Very Long Instruction Word architectures and the ELI-512,” ISCA 1983. 17
Vector Processor Limitations
-- Memory (bandwidth) can easily become a bottleneck,
especially if
1. compute/memory operation balance is not maintained
2. data is not mapped appropriately to memory banks
18
Vector Processing in More Depth
Vector Registers
◼ Each vector data register holds N M-bit values
◼ Vector control registers: VLEN, VSTR, VMASK
◼ Maximum VLEN can be N
❑ Maximum number of elements stored in a vector register
◼ Vector Mask Register (VMASK)
❑ Indicates which elements of vector to operate on
V0,N-1 V1,N-1
20
Vector Functional Units
◼ Use a deep pipeline to execute
element operations
V V V
→ fast clock cycle
1 2 3
V1 * V2 → V3
22
CRAY X-MP-28 @ ETH (CAB, E Floor)
23
CRAY X-MP System Organization
28
Vector Machine Organization (CRAY-1)
◼ CRAY-1
◼ Russell, “The CRAY-1
computer system,”
CACM 1978.
29
Loading/Storing Vectors from/to Memory
◼ Requires loading/storing multiple elements
Data bus
Address bus
CPU
Picture credit: Derek Chiou 31
Vector Memory System
◼ Next address = Previous address + Stride
◼ If (stride == 1) && (consecutive elements interleaved
across banks) && (number of banks >= bank latency), then
❑ we can sustain 1 element/cycle throughput
Base Stride
Vector Registers
Address
Generator +
0 1 2 3 4 5 6 7 8 9 A B C D E F
Memory Banks
Picture credit: Krste Asanovic 32
Scalar Code Example: Element-Wise Avg.
◼ For I = 0 to 49
❑ C[i] = (A[i] + B[i]) / 2
◼ Why 16 banks?
❑ 11-cycle memory access latency
❑ Having 16 (>11) banks ensures there are enough banks to
overlap enough memory operations to cover memory latency
34
Vectorizable Loops
◼ A loop is vectorizable if each iteration is independent of any
other
◼ For I = 0 to 49
❑ C[i] = (A[i] + B[i]) / 2
◼ Vectorized loop (each instruction and its latency):
MOVI VLEN = 50 1
7 dynamic instructions
MOVI VSTR = 1 1
VLD V0 = A 11 + VLEN – 1
VLD V1 = B 11 + VLEN – 1
VADD V2 = V0 + V1 4 + VLEN – 1
VSHFR V3 = V2 >> 1 1 + VLEN – 1
VST C = V3 11 + VLEN – 1
35
Basic Vector Code Performance
◼ Assume no chaining (no vector data forwarding)
❑ i.e., output of a vector functional unit cannot be used as the
direct input of another
❑ The entire vector register needs to be ready before any
element of it can be used as part of another operation
◼ One memory port (one address generator)
◼ 16 memory banks (word-interleaved)
1 1 11 49 11 49 4 49 1 49 11 49
◼ 285 cycles
36
Vector Chaining
◼ Vector chaining: Data forwarding from one vector
functional unit to another
V V V V V
LV v1 1 2 3 4 5
MULV v3,v1,v2
ADDV v5, v3, v4
Chain Chain
Load
Unit
Mult. Add
Memory
Strict assumption:
Each memory bank
4 49 has a single port
(memory bandwidth
bottleneck)
These two VLDs cannot be 1 49
pipelined. WHY?
11 49
1 11 49
4 49
1 49
11 49
◼ 79 cycles
◼ 19X perf. improvement!
39
Questions (I)
◼ What if # data elements > # elements in a vector register?
❑ Idea: Break loops so that each iteration operates on #
elements in a vector register
◼ E.g., 527 data elements, 64-element VREGs
◼ 8 iterations where VLEN = 64
◼ 1 iteration where VLEN = 15 (need to change value of VLEN)
❑ Called vector stripmining
40
(Vector) Stripmining
Source: https://en.wikipedia.org/wiki/Surface_mining 41
Questions (II)
◼ What if vector data is not stored in a strided fashion in
memory? (irregular memory access to a vector)
❑ Idea: Use indirection to combine/pack elements into vector
registers
❑ Called scatter/gather operations
42
Gather/Scatter Operations
43
Gather/Scatter Operations
◼ Gather/scatter operations often implemented in hardware
to handle sparse vectors (matrices)
◼ Vector loads and stores use an index vector which is added
to the base register to generate the addresses
◼ Scatter example
Index Vector Data Vector (to Store) Stored Vector (in Memory)
46
Masked Vector Instructions
Simple Implementation Density-Time Implementation
– execute all N operations, turn off – scan mask vector and only execute
result writeback according to mask elements with non-zero masks
M[2]=0 C[4]
M[1]=1
M[2]=0 C[2]
M[0]=0
M[1]=1 C[1] C[1]
M[0]=0 C[0]
Which one is better?
Write Enable Write data port
Tradeoffs?
Slide credit: Krste Asanovic 47
Some Issues
◼ Stride and banking
❑ As long as they are relatively prime to each other and there
are enough banks to cover bank access latency, we can
sustain 1 element/cycle throughput
◼ Storage of a matrix
❑ Row major: Consecutive elements in a row are laid out
consecutively in memory
❑ Column major: Consecutive elements in a column are laid out
consecutively in memory
❑ You need to change the stride when accessing a row versus
column
48
Matrix Multiplication
◼ A and B, both in row-major order
50
Array vs. Vector Processors, Revisited
◼ Array vs. vector processor distinction is a “purist’s”
distinction
51
Recall: Array vs. Vector Processors
ARRAY PROCESSOR VECTOR PROCESSOR
Space Space
52
Vector Instruction Execution
VADD A,B → C
A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27]
A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23]
A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19]
A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15]
Time Time
C[0] C[0] C[1] C[2] C[3]
Space
Slide credit: Krste Asanovic 53
Vector Unit Structure
Functional Unit
Partitioned
Vector
Registers
Elements 0, Elements 1, Elements 2, Elements 3,
4, 8, … 5, 9, … 6, 10, … 7, 11, …
Lane
Memory Subsystem
Instruction
issue
56
Automatic Code Vectorization
for (i=0; i < N; i++)
C[i] = A[i] + B[i];
Scalar Sequential Code Vectorized Code
load
Iter. Iter.
Iter. 2 load 1 2 Vector Instruction
add
Vectorization is a compile-time reordering of
operation sequencing
requires extensive loop dependence analysis
store
Slide credit: Krste Asanovic 57
Vector/SIMD Processing Summary
◼ Vector/SIMD machines are good at exploiting regular data-
level parallelism
❑ Same operation performed on many data elements
❑ Improve performance, simplify design (no intra-vector
dependencies)
a3 a2 a1 a0 $s0
+ b3 b2 b1 b0 $s1
a3 + b3 a2 + b2 a1 + b1 a0 + b0 $s2
60
Intel Pentium MMX Operations
◼ Idea: One instruction operates on multiple data elements
simultaneously
❑ À la array processing (yet much more limited)
Peleg and Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE Micro, 1996. 62
MMX Example: Image Overlaying (II)
Y = Blossom image X = Woman’s image
Peleg and Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE Micro, 1996. 63
Digital Design & Computer Arch.
Lecture 19: SIMD Processors
ETH Zürich
Spring 2020
7 May 2020