SIMD v1
SIMD v1
CS 240A, 2017
1
Flynn* Taxonomy, 1966
3
SIMD: Single Instruction, Multiple Data
X X x3 x2 x1 x0
+ +
Y Y y3 y2 y1 y0
5
Intel SIMD Extensions
• MMX 64-bit registers, reusing floating-point registers
[1992]
• SSE2/3/4, new 8 128-bit registers [1999]
4x floats
2x doubles
16x bytes
96 95 64 63 32 31 4 / 128 bits
64 63 2 / 128 bits8
Packed and Scalar Double-Precision
Floating-Point Operations
Packed
Scalar
9
SSE/SSE2 Floating Point Instructions
Move
does
both
load
and
store
• Could be rewritten
for(i=1000; i>0; i=i-4) {
x[i] = x[i] + s;
x[i-1] = x[i-1] + s;
x[i-2] = x[i-2] + s;
x[i-3] = x[i-3] + s;
}
13
Generalizing Loop Unrolling
• A loop of n iterations
• k copies of the body of the loop
• Assuming (n mod k) ≠ 0
– Then we will run the loop with 1 copy of the
body (n mod k) times
– and then with k copies of the body floor(n/k)
times
14
General Loop Unrolling with a Head
• Handing loop iterations indivisible by step size.
for(i=1003; i>0; i=i-1)
x[i] = x[i] + s;
• Could be rewritten
for(i=1003;i>1000;i--) //Handle the head (1003 mod 4)
x[i] = x[i] + s;
15
Tail method for general loop unrolling
• Handing loop iterations indivisible by step size.
for(i=1003; i>0; i=i-1)
x[i] = x[i] + s;
• Could be rewritten
for(i=1003; i>0 && i> 1003 mod 4; i=i-4) {
x[i] = x[i] + s;
x[i-1] = x[i-1] + s;
x[i-2] = x[i-2] + s;
x[i-3] = x[i-3] + s;
}
for( i= 1003 mod 4; i>0; i--) //special handle in tail
x[i] = x[i] + s;
16
Another loop unrolling example
int x;
for (x = 0; x < 103/5*5; x += 5) {
delete(x);
delete(x + 1);
delete(x + 2);
int x;
delete(x + 3);
for (x = 0; x < 103; x++) {
delete(x + 4);
delete(x);
}
}
/*Tail*/
for (x = 103/5*5; x < 103; x++) {
delete(x);
}
17
Intel SSE Intrinsics
Intrinsics are C functions and procedures for inserting
assembly language into C code, including SSE instructions
Tail: Copy out 4 integers of temp and add them together to sum.
For(i=n/4*4; i<n; i++) sum += a[i];
19
Related SSE SIMD instructions
__m128i _mm_setzero_si128( ) returns 128-bit zero vector
void _mm_storeu_si128( __m128i *p, stores content off 128-bit vector ”a” ato
__m128i a ) memory starting at pointer p
20
Related SSE SIMD instructions
• Add 4 integers with 128 bits from &a[i] to temp vector with
loop body temp = temp + a[i]
• Add 128 bits, then next 128 bits …
__m128i temp=_mm_setzero_si128();
__m128i temp1=_mm_loadu_si128((__m128i *)(a+i));
temp=_mm_add_epi32(temp, temp1)
21
Example 2: 2 x 2 Matrix Multiply
Definition of Matrix Multiply:
2
Ci,j = (A×B)i,j = ∑ Ai,k× Bk,j
k=1
A1,1 A1,2 B1,1 B1,2 C1,1=A1,1B1,1 + A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
x =
A2,1 A2,2 B2,1 B2,2 C2,1=A2,1B1,1 + A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2
22
Example: 2 x 2 Matrix Multiply
• Using the XMM registers
– 64-bit/double precision/two doubles per XMM reg
C1 C1,1 C2,1
Stored in memory in Column order
C2 C1,2 C2,2
A A1,i
C1,1 C1,2
A2,i
C2,1 C2,2
B1 Bi,1 Bi,1
B2 Bi,2 Bi,2
C1 C2
23
Example: 2 x 2 Matrix Multiply
• Initialization
C1 0 0
C2 0 0
• I=1
A A1,1 A2,1 _mm_load_pd: Stored in memory in
Column order
24
Example: 2 x 2 Matrix Multiply
A1,1 A1,2 B1,1 B1,2 C1,1=A1,1B1,1 + A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
x =
A2,1 A2,2 B2,1 B2,2 C2,1=A2,1B1,1 + A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2
• Initialization
C1 0 0
C2 0 0
• I=1
A A1,1 A2,1 _mm_load_pd: Load 2 doubles into XMM
reg, Stored in memory in Column order
25
Example: 2 x 2 Matrix Multiply
A1,1 A1,2 B1,1 B1,2 C1,1=A1,1B1,1 + A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
x =
A2,1 A2,2 B2,1 B2,2 C2,1=A2,1B1,1 + A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2
26
Example: 2 x 2 Matrix Multiply
A1,1 A1,2 B1,1 B1,2 C1,1=A1,1B1,1 + A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
x =
A2,1 A2,2 B2,1 B2,2 C2,1=A2,1B1,1 + A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2
27
Example: 2 x 2 Matrix Multiply
• Second iteration intermediate result
C1,1 C2,1
C1 A1,1B1,1+A1,2B2,1 A2,1B1,1+A2,2B2,1 c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1));
C2 A1,1B1,2+A1,2B2,2 A2,1B1,2+A2,2B2,2 c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2));
C1,2 C2,2 SSE instructions first do parallel multiplies
and then parallel adds in XMM registers
• I=2
A A1,2 A2,2 _mm_load_pd: Stored in memory in
Column order
28
Example: 2 x 2 Matrix Multiply
(Part 1 of 2)
#include <stdio.h> // Initialize A, B, C for example
// header file for SSE compiler intrinsics /* A = (note column order!)
#include <emmintrin.h> 10
01
*/
// NOTE: vector registers will be represented in
A[0] = 1.0; A[1] = 0.0; A[2] = 0.0; A[3] = 1.0;
comments as v1 = [ a | b]
// where v1 is a variable of type __m128d and
/* B = (note column order!)
a, b are doubles
13
24
int main(void) {
*/
// allocate A,B,C aligned on 16-byte boundaries B[0] = 1.0; B[1] = 2.0; B[2] = 3.0; B[3] = 4.0;
double A[4] __attribute__ ((aligned (16)));
double B[4] __attribute__ ((aligned (16))); /* C = (note column order!)
double C[4] __attribute__ ((aligned (16))); 00
int lda = 2; 00
int i = 0; */
// declare several 128-bit vector variables C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
__m128d c1,c2,a,b1,b2;
29
Example: 2 x 2 Matrix Multiply
(Part 2 of 2)
// used aligned loads to set /* c1 =
// c1 = [c_11 | c_21] i = 0: [c_11 + a_11*b_11 | c_21 + a_21*b_11]
c1 = _mm_load_pd(C+0*lda); i = 1: [c_11 + a_21*b_21 | c_21 + a_22*b_21]
// c2 = [c_12 | c_22] */
c2 = _mm_load_pd(C+1*lda);
c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1));
/* c2 =
for (i = 0; i < 2; i++) {
/* a = i = 0: [c_12 + a_11*b_12 | c_22 + a_21*b_12]
i = 0: [a_11 | a_21] i = 1: [c_12 + a_21*b_22 | c_22 + a_22*b_22]
i = 1: [a_12 | a_22] */
*/ c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2));
a = _mm_load_pd(A+i*lda); }
/* b1 =
i = 0: [b_11 | b_11] // store c1,c2 back into C for completion
i = 1: [b_21 | b_21] _mm_store_pd(C+0*lda,c1);
*/
_mm_store_pd(C+1*lda,c2);
b1 = _mm_load1_pd(B+i+0*lda);
/* b2 =
i = 0: [b_12 | b_12] // print C
i = 1: [b_22 | b_22] printf("%g,%g\n%g,%g\n",C[0],C[2],C[1],C[3]);
*/ return 0;
b2 = _mm_load1_pd(B+i+1*lda); }
30
Conclusion
• Flynn Taxonomy
• Intel SSE SIMD Instructions
– Exploit data-level parallelism in loops
– One instruction fetch that operates on multiple
operands simultaneously
– 128-bit XMM registers
• SSE Instructions in C
– Embed the SSE machine instructions directly into C
programs through use of intrinsics
– Achieve efficiency beyond that of optimizing compiler
31