0% found this document useful (0 votes)
18 views31 pages

SIMD v1

Uploaded by

Svetlin Ivanov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views31 pages

SIMD v1

Uploaded by

Svetlin Ivanov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

SIMD Programming

CS 240A, 2017

1
Flynn* Taxonomy, 1966

• In 2013, SIMD and MIMD most common parallelism in


architectures – usually both in same system!
• Most common parallel processing programming style: Single
Program Multiple Data (“SPMD”)
– Single program that runs on all processors of a MIMD
– Cross-processor execution coordination using synchronization
primitives
• SIMD (aka hw-level data parallelism): specialized function
units, for handling lock-step calculations involving arrays
– Scientific computing, signal processing, multimedia
(audio/video processing)
*Prof. Michael
Flynn, Stanford 2
Single-Instruction/Multiple-Data Stream
(SIMD or “sim-dee”)
• SIMD computer exploits
multiple data streams
against a single
instruction stream to
operations that may be
naturally parallelized,
e.g., Intel SIMD
instruction extensions
or NVIDIA Graphics
Processing Unit (GPU)

3
SIMD: Single Instruction, Multiple Data

• Scalar processing • SIMD processing


• traditional mode • With Intel SSE / SSE2
• one operation produces • SSE = streaming SIMD extensions
one result • one operation produces
multiple results

X X x3 x2 x1 x0

+ +
Y Y y3 y2 y1 y0

X+Y X+Y x3+y3 x2+y2 x1+y1 x0+y0

Slide Source: Alex Klimovitski & Dean Macri, Intel Corporation


4
What does this mean to you?
• In addition to SIMD extensions, the processor may have
other special instructions
– Fused Multiply-Add (FMA) instructions:
x=y+c*z
is so common some processor execute the multiply/add as a
single instruction, at the same rate (bandwidth) as + or *
alone
• In theory, the compiler understands all of this
– When compiling, it will rearrange instructions to get a good
“schedule” that maximizes pipelining, uses FMAs and SIMD
– It works with the mix of instructions inside an inner loop or
other block of code
• But in practice the compiler may need your help
– Choose a different compiler, optimization flags, etc.
– Rearrange your code to make things more obvious
– Using special functions (“intrinsics”) or write in assembly L

5
Intel SIMD Extensions
• MMX 64-bit registers, reusing floating-point registers
[1992]
• SSE2/3/4, new 8 128-bit registers [1999]

• AVX, new 256-bit registers [2011]


– Space for expansion to 1024-bit registers
6
SSE / SSE2 SIMD on Intel
• SSE2 data types: anything that fits into 16 bytes, e.g.,

4x floats

2x doubles

16x bytes

• Instructions perform add, multiply etc. on all the data


in parallel

• Similar on GPUs, vector processors (but many more


simultaneous operations) 7
Intel Architecture SSE2+
128-Bit SIMD Data Types
• Note: in Intel Architecture (unlike MIPS) a word is 16 bits
– Single-precision FP: Double word (32 bits)
– Double-precision FP: Quad word (64 bits)

122 121 96 95 80 79 64 63 48 47 32 31 16 15 16 / 128 bits

122 121 96 95 80 79 64 63 48 47 32 31 16 15 8 / 128 bits

96 95 64 63 32 31 4 / 128 bits

64 63 2 / 128 bits8
Packed and Scalar Double-Precision
Floating-Point Operations

Packed

Scalar

9
SSE/SSE2 Floating Point Instructions
Move
does
both
load
and
store

xmm: one operand is a 128-bit SSE2 register


mem/xmm: other operand is in memory or an SSE2 register
{SS} Scalar Single precision FP: one 32-bit operand in a 128-bit register
{PS} Packed Single precision FP: four 32-bit operands in a 128-bit register
{SD} Scalar Double precision FP: one 64-bit operand in a 128-bit register
{PD} Packed Double precision FP, or two 64-bit operands in a 128-bit register
{A} 128-bit operand is aligned in memory
{U} means the 128-bit operand is unaligned in memory
{H} means move the high half of the 128-bit operand
{L} means move the low half of the 128-bit operand
10
Example: SIMD Array Processing
for each f in array
f = sqrt(f) for
{
each f in array

load f to floating-point register


calculate the square root
write the result from the
register to memory
}

for each 4 members in array


{
load 4 members to the SSE register
calculate 4 square roots in one operation
store the 4 results from the register to memory
}
SIMD style
11
Data-Level Parallelism and SIMD
• SIMD wants adjacent values in memory that
can be operated in parallel
• Usually specified in programs as loops
for(i=1000; i>0; i=i-1)
x[i] = x[i] + s;
• How can reveal more data-level parallelism
than available in a single iteration of a loop?
• Unroll loop and adjust iteration rate
12
Loop Unrolling in C
• Instead of compiler doing loop unrolling, could do it
yourself in C
for(i=1000; i>0; i=i-1)
x[i] = x[i] + s;

• Could be rewritten
for(i=1000; i>0; i=i-4) {
x[i] = x[i] + s;
x[i-1] = x[i-1] + s;
x[i-2] = x[i-2] + s;
x[i-3] = x[i-3] + s;
}
13
Generalizing Loop Unrolling
• A loop of n iterations
• k copies of the body of the loop
• Assuming (n mod k) ≠ 0
– Then we will run the loop with 1 copy of the
body (n mod k) times
– and then with k copies of the body floor(n/k)
times

14
General Loop Unrolling with a Head
• Handing loop iterations indivisible by step size.
for(i=1003; i>0; i=i-1)
x[i] = x[i] + s;

• Could be rewritten
for(i=1003;i>1000;i--) //Handle the head (1003 mod 4)
x[i] = x[i] + s;

for(i=1000; i>0; i=i-4) {// handle other iterations


x[i] = x[i] + s;
x[i-1] = x[i-1] + s;
x[i-2] = x[i-2] + s;
x[i-3] = x[i-3] + s;
}

15
Tail method for general loop unrolling
• Handing loop iterations indivisible by step size.
for(i=1003; i>0; i=i-1)
x[i] = x[i] + s;
• Could be rewritten
for(i=1003; i>0 && i> 1003 mod 4; i=i-4) {
x[i] = x[i] + s;
x[i-1] = x[i-1] + s;
x[i-2] = x[i-2] + s;
x[i-3] = x[i-3] + s;
}
for( i= 1003 mod 4; i>0; i--) //special handle in tail
x[i] = x[i] + s;

16
Another loop unrolling example

Normal loop After loop unrolling

int x;
for (x = 0; x < 103/5*5; x += 5) {
delete(x);
delete(x + 1);
delete(x + 2);
int x;
delete(x + 3);
for (x = 0; x < 103; x++) {
delete(x + 4);
delete(x);
}
}
/*Tail*/
for (x = 103/5*5; x < 103; x++) {
delete(x);
}
17
Intel SSE Intrinsics
Intrinsics are C functions and procedures for inserting
assembly language into C code, including SSE instructions

Instrinsics: Corresponding SSE instructions:


• Vector data type:
_m128d
• Load and store operations:
_mm_load_pd MOVAPD/aligned, packed double
_mm_store_pd MOVAPD/aligned, packed double
_mm_loadu_pd MOVUPD/unaligned, packed double
_mm_storeu_pd MOVUPD/unaligned, packed double
• Load and broadcast across vector
_mm_load1_pd MOVSD + shuffling/duplicating
• Arithmetic:
_mm_add_pd ADDPD/add, packed double
_mm_mul_pd MULPD/multiple, packed double 18
Example 1: Use of SSE SIMD instructions

• For (i=0; i<n; i++) sum = sum+ a[i];


• Set 128-bit temp=0;
For (i = 0; n/4*4; i=i+4){
Add 4 integers with 128 bits from &a[i] to temp; }

Tail: Copy out 4 integers of temp and add them together to sum.
For(i=n/4*4; i<n; i++) sum += a[i];

19
Related SSE SIMD instructions
__m128i _mm_setzero_si128( ) returns 128-bit zero vector

Load data stored at pointer p of memory to


__m128i _mm_loadu_si128( __m128i *p )
a 128bit vector, returns this vector.

__m128i _mm_add_epi32( __m128i a,


returns vector (a0+b0, a1+b1, a2+b2, a3+b3)
__m128i b )

void _mm_storeu_si128( __m128i *p, stores content off 128-bit vector ”a” ato
__m128i a ) memory starting at pointer p

20
Related SSE SIMD instructions
• Add 4 integers with 128 bits from &a[i] to temp vector with
loop body temp = temp + a[i]
• Add 128 bits, then next 128 bits …

__m128i temp=_mm_setzero_si128();
__m128i temp1=_mm_loadu_si128((__m128i *)(a+i));
temp=_mm_add_epi32(temp, temp1)

21
Example 2: 2 x 2 Matrix Multiply
Definition of Matrix Multiply:
2
Ci,j = (A×B)i,j = ∑ Ai,k× Bk,j
k=1
A1,1 A1,2 B1,1 B1,2 C1,1=A1,1B1,1 + A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
x =
A2,1 A2,2 B2,1 B2,2 C2,1=A2,1B1,1 + A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

1 0 1 3 C1,1= 1*1 + 0*2 = 1 C1,2= 1*3 + 0*4 = 3


x =
0 1 2 4 C2,1= 0*1 + 1*2 = 2 C2,2= 0*3 + 1*4 = 4

22
Example: 2 x 2 Matrix Multiply
• Using the XMM registers
– 64-bit/double precision/two doubles per XMM reg
C1 C1,1 C2,1
Stored in memory in Column order
C2 C1,2 C2,2

A A1,i
C1,1 C1,2
A2,i
C2,1 C2,2
B1 Bi,1 Bi,1
B2 Bi,2 Bi,2
C1 C2

23
Example: 2 x 2 Matrix Multiply
• Initialization
C1 0 0
C2 0 0

• I=1
A A1,1 A2,1 _mm_load_pd: Stored in memory in
Column order

B1 B1,1 B1,1 _mm_load1_pd: SSE instruction that loads


a double word and stores it in the high and
B2 B1,2 B1,2
low double words of the XMM register

24
Example: 2 x 2 Matrix Multiply
A1,1 A1,2 B1,1 B1,2 C1,1=A1,1B1,1 + A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
x =
A2,1 A2,2 B2,1 B2,2 C2,1=A2,1B1,1 + A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

• Initialization
C1 0 0
C2 0 0

• I=1
A A1,1 A2,1 _mm_load_pd: Load 2 doubles into XMM
reg, Stored in memory in Column order

B1 B1,1 B1,1 _mm_load1_pd: SSE instruction that loads


a double word and stores it in the high and
B2 B1,2 B1,2
low double words of the XMM register
(duplicates value in both halves of XMM)

25
Example: 2 x 2 Matrix Multiply
A1,1 A1,2 B1,1 B1,2 C1,1=A1,1B1,1 + A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
x =
A2,1 A2,2 B2,1 B2,2 C2,1=A2,1B1,1 + A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

• First iteration intermediate result


C1 0+A1,1B1,1 0+A2,1B1,1 c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1));
C2 0+A1,1B1,2 0+A2,1B1,2 c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2));
SSE instructions first do parallel multiplies
and then parallel adds in XMM registers
• I=1
A A1,1 A2,1 _mm_load_pd: Stored in memory in
Column order

B1 B1,1 B1,1 _mm_load1_pd: SSE instruction that loads


a double word and stores it in the high and
B2 B1,2 B1,2
low double words of the XMM register
(duplicates value in both halves of XMM)

26
Example: 2 x 2 Matrix Multiply
A1,1 A1,2 B1,1 B1,2 C1,1=A1,1B1,1 + A1,2B2,1 C1,2=A1,1B1,2+A1,2B2,2
x =
A2,1 A2,2 B2,1 B2,2 C2,1=A2,1B1,1 + A2,2B2,1 C2,2=A2,1B1,2+A2,2B2,2

• First iteration intermediate result


C1 0+A1,1B1,1 0+A2,1B1,1 c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1));
C2 0+A1,1B1,2 0+A2,1B1,2 c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2));
SSE instructions first do parallel multiplies
and then parallel adds in XMM registers
• I=2
A A1,2 A2,2 _mm_load_pd: Stored in memory in
Column order

B1 B2,1 B2,1 _mm_load1_pd: SSE instruction that loads


a double word and stores it in the high and
B2 B2,2 B2,2
low double words of the XMM register
(duplicates value in both halves of XMM)

27
Example: 2 x 2 Matrix Multiply
• Second iteration intermediate result
C1,1 C2,1
C1 A1,1B1,1+A1,2B2,1 A2,1B1,1+A2,2B2,1 c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1));
C2 A1,1B1,2+A1,2B2,2 A2,1B1,2+A2,2B2,2 c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2));
C1,2 C2,2 SSE instructions first do parallel multiplies
and then parallel adds in XMM registers
• I=2
A A1,2 A2,2 _mm_load_pd: Stored in memory in
Column order

B1 B2,1 B2,1 _mm_load1_pd: SSE instruction that loads


a double word and stores it in the high and
B2 B2,2 B2,2
low double words of the XMM register
(duplicates value in both halves of XMM)

28
Example: 2 x 2 Matrix Multiply
(Part 1 of 2)
#include <stdio.h> // Initialize A, B, C for example
// header file for SSE compiler intrinsics /* A = (note column order!)
#include <emmintrin.h> 10
01
*/
// NOTE: vector registers will be represented in
A[0] = 1.0; A[1] = 0.0; A[2] = 0.0; A[3] = 1.0;
comments as v1 = [ a | b]
// where v1 is a variable of type __m128d and
/* B = (note column order!)
a, b are doubles
13
24
int main(void) {
*/
// allocate A,B,C aligned on 16-byte boundaries B[0] = 1.0; B[1] = 2.0; B[2] = 3.0; B[3] = 4.0;
double A[4] __attribute__ ((aligned (16)));
double B[4] __attribute__ ((aligned (16))); /* C = (note column order!)
double C[4] __attribute__ ((aligned (16))); 00
int lda = 2; 00
int i = 0; */
// declare several 128-bit vector variables C[0] = 0.0; C[1] = 0.0; C[2] = 0.0; C[3] = 0.0;
__m128d c1,c2,a,b1,b2;

29
Example: 2 x 2 Matrix Multiply
(Part 2 of 2)
// used aligned loads to set /* c1 =
// c1 = [c_11 | c_21] i = 0: [c_11 + a_11*b_11 | c_21 + a_21*b_11]
c1 = _mm_load_pd(C+0*lda); i = 1: [c_11 + a_21*b_21 | c_21 + a_22*b_21]
// c2 = [c_12 | c_22] */
c2 = _mm_load_pd(C+1*lda);
c1 = _mm_add_pd(c1,_mm_mul_pd(a,b1));
/* c2 =
for (i = 0; i < 2; i++) {
/* a = i = 0: [c_12 + a_11*b_12 | c_22 + a_21*b_12]
i = 0: [a_11 | a_21] i = 1: [c_12 + a_21*b_22 | c_22 + a_22*b_22]
i = 1: [a_12 | a_22] */
*/ c2 = _mm_add_pd(c2,_mm_mul_pd(a,b2));
a = _mm_load_pd(A+i*lda); }
/* b1 =
i = 0: [b_11 | b_11] // store c1,c2 back into C for completion
i = 1: [b_21 | b_21] _mm_store_pd(C+0*lda,c1);
*/
_mm_store_pd(C+1*lda,c2);
b1 = _mm_load1_pd(B+i+0*lda);
/* b2 =
i = 0: [b_12 | b_12] // print C
i = 1: [b_22 | b_22] printf("%g,%g\n%g,%g\n",C[0],C[2],C[1],C[3]);
*/ return 0;
b2 = _mm_load1_pd(B+i+1*lda); }

30
Conclusion
• Flynn Taxonomy
• Intel SSE SIMD Instructions
– Exploit data-level parallelism in loops
– One instruction fetch that operates on multiple
operands simultaneously
– 128-bit XMM registers
• SSE Instructions in C
– Embed the SSE machine instructions directly into C
programs through use of intrinsics
– Achieve efficiency beyond that of optimizing compiler

31

You might also like