Lecture8 Simd
Lecture8 Simd
Computing
Fall 2018
Lecture: SIMD vector extensions
Flynn’s Taxonomy
© Markus Püschel
Computer Science
SIMD Extensions and SSE
SSE intrinsics
Compiler vectorization
+ x 4-way
What is it?
Extension of the ISA
Data types and instructions for the parallel computation on short
(length 2, 4, 8, …) vectors of integers or floats
Names: MMX, SSE, SSE2, …
Why do they exist?
Useful: Many applications have the necessary fine-grain parallelism
Then: speedup by a factor close to vector length
Doable: Relatively easy to design by replicating functional units
© Markus Püschel
Computer Science
© Markus Püschel
MMX: Computer Science
Multimedia extension
Intel x86 Processors
SSE: x86-16 8086
Streaming SIMD extension
AVX: 286
Advanced vector extensions
x86-32 386
486
Pentium
MMX Pentium MMX
register 64 bit
width (only int) SSE Pentium III
SSE2 Pentium 4
time
SSE3 Pentium 4E
128 bit x86-64 Pentium 4F
Core 2 Duo
SSE4 Penryn
Core i7 (Nehalem)
AVX Sandy Bridge
256 bit Haswell
AVX2
SSE:
4-way single
© Markus Püschel
Computer Science
Core 2
Has SSE3
16 SSE registers
%xmm0 %xmm8
%xmm1 %xmm9
%xmm2 %xmm10
%xmm3 %xmm11
%xmm4 %xmm12
%xmm5 %xmm13
%xmm6 %xmm14
%xmm7 %xmm15
7
SSE3 Registers
Different data types and associated instructions 128 bit LSB
Integer vectors:
16-way byte
8-way 2 bytes
4-way 4 bytes
2-way 8 bytes
© Markus Püschel
Computer Science
SSE3 Instructions: Examples
Single precision 4-way vector add: addps %xmm0 %xmm1
%xmm0
+
%xmm1
%xmm0
+
%xmm1
9
addps addss
single precision
addpd addsd
double precision
Compiler will use this for floating point
• on x86-64
• with proper flags if SSE/SSE2 is available 10
© Markus Püschel
Computer Science
x86-64 FP Code Example float ipf (float x[],
float y[],
int n) {
int i;
Inner product of two vectors float result = 0.0;
Single precision arithmetic
for (i = 0; i < n; i++)
Compiled: not vectorized, result += x[i]*y[i];
uses SSE instructions return result;
}
ipf:
xorps %xmm1, %xmm1 # result = 0.0
xorl %ecx, %ecx # i = 0
jmp .L8 # goto middle
.L10: # loop:
movslq %ecx,%rax # icpy = i
incl %ecx # i++
movss (%rsi,%rax,4), %xmm0 # t = y[icpy]
mulss (%rdi,%rax,4), %xmm0 # t *= x[icpy]
addss %xmm0, %xmm1 # result += t
.L8: # middle:
cmpl %edx, %ecx # i:n
jl .L10 # if < goto loop
movaps %xmm1, %xmm0 # return result
ret
11
+ instead of +
12
© Markus Püschel
Computer Science
SIMD Extensions and SSE
Overview: SSE family
SSE intrinsics
Compiler vectorization
References:
Intel Intrinsics Guide
(easy access to all instructions, also contains latency and throughput
information!)
13
SSE:
4-way single
© Markus Püschel
Computer Science
Intrinsics
Assembly coded C functions
Expanded inline upon compilation: no overhead
Like writing assembly inside C
Floating point:
Intrinsics for basic operations (add, mult, …)
Intrinsics for math functions: log, sin, …
Our introduction is based on icc
Most intrinsics work with gcc and Visual Studio (VS)
Some language extensions are icc (or even VS) specific
15
memory
Registers
Commonly:
LSB
R3 R2 R1 R0
We will use
LSB
R0 R1 R2 R3
16
© Markus Püschel
Computer Science
SSE Intrinsics (Focus Floating Point)
Data types
__m128 f; // = {float f0, f1, f2, f3}
__m128d d; // = {double d0, d1}
__m128i i; // 16 8-bit, 8 16-bit, 4 32-bit, or 2 64-bit ints
ints
ints
ints or floats
ints or doubles
17
Same result as
__m128 t = _mm_set_ps(4.0, 3.0, 2.0, 1.0)
18
© Markus Püschel
Computer Science
SSE Intrinsics
Native instructions (one-to-one with assembly)
_mm_load_ps() ↔ movaps
_mm_add_ps() ↔ addps
_mm_mul_pd() ↔ mulpd
…
Multi instructions (map to several assembly instructions)
_mm_set_ps()
_mm_set1_ps()
…
Macros and helpers
_MM_TRANSPOSE4_PS()
_MM_SHUFFLE()
…
19
20
© Markus Püschel
Computer Science
SSE vs. AVX
SSE AVX
22
© Markus Püschel
Computer Science
SSE Intrinsics
Load and store
Constants
Arithmetic
Comparison
Conversion
Shuffles
23
24
© Markus Püschel
Computer Science
Loads and Stores
p
kept
26
© Markus Püschel
Computer Science
Loads and Stores
p
1.0 memory
LSB 1.0 0 0 0 a
set to zero
27
→ blackboard
28
© Markus Püschel
Computer Science
Constants
LSB 0 0 0 0 d d = _mm_setzero_ps();
29
→ blackboard
Arithmetic
SSE SSE3
Intrinsic Name Operation Corresponding Intrinsic Name Operation Corresponding
SSE Instruction SSE3 Instruction
_mm_add_ss Addition ADDSS _mm_addsub_ps Subtract and add ADDSUBPS
_mm_add_ps Addition ADDPS _mm_hadd_ps Add HADDPS
_mm_sub_ss Subtraction SUBSS _mm_hsub_ps Subtracts HSUBPS
_mm_sub_ps Subtraction SUBPS
_mm_mul_ss Multiplication MULSS
_mm_mul_ps Multiplication MULPS SSE4
_mm_div_ss Division DIVSS Intrinsic Operation Corresponding
SSE4 Instruction
_mm_div_ps Division DIVPS
_mm_dp_ps Single precision dot product DPPS
_mm_sqrt_ss Squared Root SQRTSS
_mm_sqrt_ps Squared Root SQRTPS
_mm_rcp_ss Reciprocal RCPSS
_mm_rcp_ps Reciprocal RCPPS
_mm_rsqrt_ss Reciprocal Squared Root RSQRTSS
_mm_rsqrt_ps Reciprocal Squared Root RSQRTPS
_mm_min_ss Computes Minimum MINSS
_mm_min_ps Computes Minimum MINPS
_mm_max_ss Computes Maximum MAXSS
_mm_max_ps Computes Maximum MAXPS
30
© Markus Püschel
Computer Science
Arithmetic
LSB 1.0 2.0 3.0 4.0 a LSB 0.5 1.5 2.5 3.5 b
c = _mm_add_ps(a, b);
analogous:
c = _mm_sub_ps(a, b);
c = _mm_mul_ps(a, b);
31
→ blackboard
Example
void addindex(float *x, int n) {
for (int i = 0; i < n; i++)
x[i] = x[i] + i;
}
#include <ia32intrin.h>
© Markus Püschel
Computer Science
Example
void addindex(float *x, int n) {
for (int i = 0; i < n; i++)
x[i] = x[i] + i;
}
#include <ia32intrin.h>
Arithmetic
LSB 1.0 2.0 3.0 4.0 a LSB 0.5 b
c = _mm_add_ss(a, b);
34
© Markus Püschel
Computer Science
Arithmetic
LSB 1.0 2.0 3.0 4.0 a LSB 0.5 1.5 2.5 3.5 b
c = _mm_max_ps(a, b);
35
Arithmetic
LSB 1.0 2.0 3.0 4.0 a LSB 0.5 1.5 2.5 3.5 b
c = _mm_addsub_ps(a, b);
36
© Markus Püschel
Computer Science
Arithmetic
LSB 1.0 2.0 3.0 4.0 a LSB 0.5 1.5 2.5 3.5 b
c = _mm_hadd_ps(a, b);
analogous:
c = _mm_hsub_ps(a, b);
37
→ blackboard
Example
// n is even
void lp(float *x, float *y, int n) {
for (int i = 0; i < n/2; i++)
y[i] = (x[2*i] + x[2*i+1])/2;
}
#include <ia32intrin.h>
38
© Markus Püschel
Computer Science
Arithmetic
__m128 _mm_dp_ps(__m128 a, __m128 b, const int mask)
LSB 1.0 2.0 3.0 4.0 a LSB 0.5 1.5 2.5 3.5 b
Comparisons
Intrinsic Name Operation Corresponding Intrinsic Name Operation Corresponding
SSE Instruction SSE Instruction
_mm_cmpeq_ss Equal CMPEQSS _mm_cmpord_ss Ordered CMPORDSS
_mm_cmpeq_ps Equal CMPEQPS _mm_cmpord_ps Ordered CMPORDPS
_mm_cmplt_ss Less Than CMPLTSS _mm_cmpunord_ss Unordered CMPUNORDSS
_mm_cmplt_ps Less Than CMPLTPS _mm_cmpunord_ps Unordered CMPUNORDPS
_mm_cmple_ss Less Than or Equal CMPLESS _mm_comieq_ss Equal COMISS
_mm_cmple_ps Less Than or Equal CMPLEPS _mm_comilt_ss Less Than COMISS
_mm_cmpgt_ss Greater Than CMPLTSS _mm_comile_ss Less Than or Equal COMISS
_mm_cmpgt_ps Greater Than CMPLTPS _mm_comigt_ss Greater Than COMISS
_mm_cmpge_ss Greater Than or Equal CMPLESS _mm_comige_ss Greater Than or Equal COMISS
_mm_cmpge_ps Greater Than or Equal CMPLEPS _mm_comineq_ss Not Equal COMISS
_mm_cmpneq_ss Not Equal CMPNEQSS _mm_ucomieq_ss Equal UCOMISS
_mm_cmpneq_ps Not Equal CMPNEQPS _mm_ucomilt_ss Less Than UCOMISS
_mm_cmpnlt_ss Not Less Than CMPNLTSS _mm_ucomile_ss Less Than or Equal UCOMISS
_mm_cmpnlt_ps Not Less Than CMPNLTPS _mm_ucomigt_ss Greater Than UCOMISS
_mm_cmpnle_ss Not Less Than or Equal CMPNLESS _mm_ucomige_ss Greater Than or Equal UCOMISS
_mm_cmpnle_ps Not Less Than or Equal CMPNLEPS _mm_ucomineq_ss Not Equal UCOMISS
_mm_cmpngt_ss Not Greater Than CMPNLTSS
_mm_cmpngt_ps Not Greater Than CMPNLTPS
_mm_cmpnge_ss Not Greater Than or CMPNLESS
Equal
_mm_cmpnge_ps Not Greater Than or CMPNLEPS
Equal
40
© Markus Püschel
Computer Science
Comparisons
LSB 1.0 2.0 3.0 4.0 a LSB 1.0 1.5 3.0 3.5 b
=? =? =? =?
c = _mm_cmpeq_ps(a, b);
Each field:
0xffffffff if true
analogous:
0x0 if false
c = _mm_cmple_ps(a, b); Return type: __m128
c = _mm_cmplt_ps(a, b);
c = _mm_cmpge_ps(a, b);
etc. 41
Example
void fcond(float *x, size_t n) {
int i;
#include <xmmintrin.h>
ones = _mm_set1_ps(1.);
mones = _mm_set1_ps(-1.);
thresholds = _mm_set1_ps(0.5);
for(i = 0; i < n; i+=4) {
vt = _mm_load_ps(a+i);
vmask = _mm_cmpgt_ps(vt, thresholds);
vp = _mm_and_ps(vmask, ones);
vm = _mm_andnot_ps(vmask, mones);
vr = _mm_add_ps(vt, _mm_or_ps(vp, vm));
_mm_store_ps(a+i, vr);
}
}
42
© Markus Püschel
Computer Science
© Markus Püschel
Computer Science
Vectorization
=
Picture: www.druckundbestell.de
Conversion
Intrinsic Name Operation Corresponding
SSE Instruction
_mm_cvtss_si32 Convert to 32-bit integer CVTSS2SI
_mm_cvtss_si64* Convert to 64-bit integer CVTSS2SI
_mm_cvtps_pi32 Convert to two 32-bit integers CVTPS2PI
_mm_cvttss_si32 Convert to 32-bit integer CVTTSS2SI
_mm_cvttss_si64* Convert to 64-bit integer CVTTSS2SI
_mm_cvttps_pi32 Convert to two 32-bit integers CVTTPS2PI
_mm_cvtsi32_ss Convert from 32-bit integer CVTSI2SS
_mm_cvtsi64_ss* Convert from 64-bit integer CVTSI2SS
_mm_cvtpi32_ps Convert from two 32-bit integers CVTTPI2PS
_mm_cvtpi16_ps Convert from four 16-bit integers composite
_mm_cvtpu16_ps Convert from four 16-bit integers composite
_mm_cvtpi8_ps Convert from four 8-bit integers composite
_mm_cvtpu8_ps Convert from four 8-bit integers composite
_mm_cvtpi32x2_ps Convert from four 32-bit integers composite
_mm_cvtps_pi16 Convert to four 16-bit integers composite
_mm_cvtps_pi8 Convert to four 8-bit integers composite
_mm_cvtss_f32 Extract composite
44
© Markus Püschel
Computer Science
Conversion
float _mm_cvtss_f32(__m128 a)
1.0 f
float f;
f = _mm_cvtss_f32(a);
45
Cast interpreted as
floats ints
__m128i _mm_castps_si128(__m128 a)
__m128 _mm_castsi128_ps(__m128i a)
Reinterprets the four single precision floating point values in a as four 32-bit
integers, and vice versa.
46
→ blackboard
© Markus Püschel
Computer Science
Actual Conversion
__m128 _mm_cvt_pi2ps (__m128 a, __m64 b)
convert
ints floats
47
Shuffles
SSE SSE3
Intrinsic Name Operation Corresponding Intrinsic Name Operation Corresponding
SSE Instruction SSE3 Instruction
_mm_shuffle_ps Shuffle SHUFPS _mm_movehdup_ps Duplicates MOVSHDUP
_mm_unpackhi_ps Unpack High UNPCKHPS _mm_moveldup_ps Duplicates MOVSLDUP
_mm_unpacklo_ps Unpack Low UNPCKLPS
_mm_move_ss Set low word, pass in MOVSS SSSE3
three high values
Intrinsic Name Operation Corresponding
_mm_movehl_ps Move High to Low MOVHLPS SSSE3 Instruction
_mm_movelh_ps Move Low to High MOVLHPS _mm_shuffle_epi8 Shuffle PSHUFB
_mm_movemask_ps Create four-bit mask MOVMSKPS _mm_alignr_epi8 Shift PALIGNR
SSE4
Intrinsic Syntax Operation Corresponding
SSE4 Instruction
__m128 _mm_blend_ps(__m128 v1, __m128 v2, const int mask) Selects float single precision data from 2 BLENDPS
sources using constant mask
__m128 _mm_blendv_ps(__m128 v1, __m128 v2, __m128 v3) Selects float single precision data from 2 BLENDVPS
sources using variable mask
__m128 _mm_insert_ps(__m128 dst, __m128 src, const int ndx) Insert single precision float into packed INSERTPS
single precision array element selected by
index.
int _mm_extract_ps(__m128 src, const int ndx) Extract single precision float from packed EXTRACTPS
single precision array selected by index. 48
© Markus Püschel
Computer Science
Shuffles
LSB 1.0 2.0 3.0 4.0 a LSB 0.5 1.5 2.5 3.5 b
c = _mm_unpacklo_ps(a, b);
LSB 1.0 2.0 3.0 4.0 a LSB 0.5 1.5 2.5 3.5 b
c = _mm_unpackhi_ps(a, b);
49
→ blackboard
Shuffles
c = _mm_shuffle_ps(a, b, _MM_SHUFFLE(l, k, j, i));
LSB 1.0 2.0 3.0 4.0 a LSB 0.5 1.5 2.5 3.5 b
LSB c0 c1 c2 c3 c c0 = ai
c1 = aj
any element any element c2 = bk
of a of b c3 = bl
i,j,k,l in {0,1,2,3}
50
→ blackboard
© Markus Püschel
Computer Science
Example: Loading 4 Real Numbers from
Arbitrary Memory Locations
p0 p1 p2 p3
4x
LSB 1.0 0 0 0 LSB 3.0 0 0 0 _mm_load_ss
2x
_mm_shuffle_ps
1x
_mm_shuffle_ps
a = _mm_load_ss(p0);
b = _mm_load_ss(p1);
c = _mm_load_ss(p2);
d = _mm_load_ss(p3);
e = _mm_shuffle_ps(a, b, _MM_SHUFFLE(1,0,2,0)); //only zeros are important
f = _mm_shuffle_ps(c, d, _MM_SHUFFLE(1,0,2,0)); //only zeros are important
return _mm_shuffle_ps(e, f, _MM_SHUFFLE(2,0,2,0));
}
52
© Markus Püschel
Computer Science
Example: Loading 4 Real Numbers from
Arbitrary Memory Locations (cont’d)
Whenever possible avoid the previous situation
Restructure algorithm and use the aligned
_mm_load_ps()
Other possibility (but likely also yields 7 instructions)
__m128 vf;
53
g[0] = *p0;
g[1] = *p1;
g[2] = *p2;
g[3] = *p3;
vf = _mm_load_ps(g);
54
© Markus Püschel
Computer Science
Example: Storing 4 Real Numbers to
Arbitrary Memory Locations
LSB 4.0 0 0 0 3x
_mm_shuffle_ps
LSB 3.0 0 0 0
LSB 2.0 0 0 0
4x
_mm_store_ss
55
Shuffle
__m128i _mm_alignr_epi8(__m128i a, __m128i b, const int n)
Concatenate a and b and extract byte-aligned result shifted to the right by n bytes
LSB 1 2 3 4 b LSB 5 6 7 8 a
n = 12 bytes
LSB 4 5 6 7 c
56
© Markus Püschel
Computer Science
Example
void shift(float *x, float *y, int n) {
for (int i = 0; i < n-1; i++)
y[i] = x[i+1];
y[n-1] = 0;
}
#include <ia32intrin.h>
Shuffle
__m128i _mm_shuffle_epi8(__m128i a, __m128i mask)
LSB 1 2 3 4 a
LSB 4 1 0 2 c
58
© Markus Püschel
Computer Science
Shuffle
__m128 _mm_blendv_ps(__m128 a, __m128 b, __m128 mask)
LSB 1.0 2.0 3.0 4.0 a LSB 0.5 1.5 2.5 3.5 b
#include <xmmintrin.h>
ones = _mm_set1_ps(1.);
mones = _mm_set1_ps(-1.);
thresholds = _mm_set1_ps(0.5);
for(i = 0; i < n; i+=4) {
vt = _mm_load_ps(a+i);
vmask = _mm_cmpgt_ps(vt, thresholds);
vb = _mm_blendv_ps(ones, mones, vmask);
vr = _mm_add_ps(vt, vb);
}
}
60
© Markus Püschel
Computer Science
Shuffle
_MM_TRANSPOSE4_PS(row0, row1, row2, row3)
Macro for 4 x 4 matrix transposition: The arguments row0,…, row3 are __m128
values each containing a row of a 4 x 4 matrix. After execution, row0, .., row 3
contain the columns of that matrix.
LSB 1.0 2.0 3.0 4.0 row0 LSB 1.0 5.0 9.0 13.0 row0
LSB 5.0 6.0 7.0 8.0 row1 LSB 2.0 6.0 10.0 14.0 row1
LSB 9.0 10.0 11.0 12.0 row2 LSB 3.0 7.0 11.0 15.0 row2
LSB 13.0 14.0 15.0 16.0 row3 LSB 4.0 8.0 12.0 16.0 row3
61
62
© Markus Püschel
Computer Science
SIMD Extensions and SSE
SSE intrinsics
Compiler vectorization
References:
Intel icc manual (look for auto vectorization)
63
Compiler Vectorization
Compiler flags
Aliasing
Proper code style
Alignment
64
© Markus Püschel
Computer Science
How Do I Know the Compiler Vectorized?
vec-report
Look at assembly: mulps, addps, xxxps
Generate assembly with source code annotation:
Visual Studio + icc: /Fas
icc on Linux/Mac: -S
65
unvectorized: /Qvec-
<more>
;;; a[i] = a[i] + b[i];
movss xmm0, DWORD PTR [rcx+rax*4]
addss xmm0, DWORD PTR [rdx+rax*4]
movss DWORD PTR [rcx+rax*4], xmm0
<more>
vectorized:
<more>
;;; a[i] = a[i] + b[i];
movss xmm0, DWORD PTR [rcx+r11*4]
addss xmm0, DWORD PTR [rdx+r11*4] why this?
movss DWORD PTR [rcx+r11*4], xmm0
…
movups xmm0, XMMWORD PTR [rdx+r10*4]
movups xmm1, XMMWORD PTR [16+rdx+r10*4]
addps xmm0, XMMWORD PTR [rcx+r10*4] why everything twice?
addps xmm1, XMMWORD PTR [16+rcx+r10*4] why movups and movaps?
movaps XMMWORD PTR [rcx+r10*4], xmm0
movaps XMMWORD PTR [16+rcx+r10*4], xmm1
<more> unaligned aligned
66
© Markus Püschel
Computer Science
Aliasing
for (i = 0; i < n; i++)
a[i] = a[i] + b[i];
67
Removing Aliasing
Globally with compiler flag:
-fno-alias, /Oa
-fargument-noalias, /Qalias-args- (function arguments only)
For one loop: pragma
void add(float *a, float *b, int n) {
#pragma ivdep
for (i = 0; i < n; i++)
a[i] = a[i] + b[i];
}
68
© Markus Püschel
Computer Science
Proper Code Style
Use countable loops = number of iterations known at runtime
Number of iterations is a:
constant
loop invariant term
linear function of outermost loop indices
Countable or not?
© Markus Püschel
Computer Science
Alignment
float *x = (float *) malloc(1024*sizeof(float));
int i;
However, the compiler can peel the loop to extract aligned part:
float *x = (float *) malloc(1024*sizeof(float));
int i;
Ensuring Alignment
Align arrays to 16-byte boundaries (see earlier discussion)
If compiler cannot analyze:
Use pragma for loops
float *x = (float *) malloc(1024*sizeof(float));
int i;
72
© Markus Püschel
Computer Science
More Tips (icc 14.0) https://software.intel.com/en-us/node/512631
Use simple for loops. Avoid complex loop termination conditions – the upper iteration limit must be
invariant within the loop. For the innermost loop in a nest of loops, you could set the upper limit
iteration to be a function of the outer loop indices.
Write straight-line code. Avoid branches such as switch, goto, or return statements, most function
calls, orif constructs that can not be treated as masked assignments.
Avoid dependencies between loop iterations or at the least, avoid read-after-write dependencies.
Try to use array notations instead of the use of pointers. C programs in particular impose very few
restrictions on the use of pointers; aliased pointers may lead to unexpected dependencies. Without
help, the compiler often cannot tell whether it is safe to vectorize code containing pointers.
Wherever possible, use the loop index directly in array subscripts instead of incrementing a separate
counter for use as an array address.
…
73
© Markus Püschel
Computer Science
runtime check
a, b potentially aliased?
no yes
© Markus Püschel
Computer Science
Compiler Vectorization
Read manual
75
© Markus Püschel
Computer Science