Programming With SIMD-instructions
Programming With SIMD-instructions
Extensions to the Intel and AMD x86 instruction set for parallel operations on packed integer or floating-point data
data parallelism parallel vector operations applies the same operation in parallel on a number of data items packed into a 64-, 128- or 256-bit vector also supports scalar operations on integer or floating-point values
Applications can benefit from SIMD processing if they have the following characteristics
small integer or floating-point data types (8 bit pixel values or characters, 16-bit audio samples, 32-bit floating-point values) small, highly repetitive loops frequent additions, multiplications or other simple operations compute-intense algorithms data-parallelism, can operate on independent values in parallel
SIMD operation
SIMD execution
performs an operation in parallel on an array of 2, 4, 8 or 16 values data parallel operation
Source 1 Source 2
X3 Y3
X2 Y2
X1 Y1
X0 Y0
Destination
X2Y2
X1Y1
X0Y0
X3Y3
MMX registers
8 64-bit MMX registers
aliased to the x87 floating-point registers no stack-organization can store 1, 2, 4 or 8 packed integer values
Floating-point registers
MMX registers
Can not use the x87 floating-point unit and the MMX unit at the same time
they share the same set of registers
b7
63
b6 w3
b5 w2
b4
b3 w1
b2
b1 w0
b0
0
63
dw1
63
dw0
0
qw
MMX operates only on integer values MMX operations are limited to integer values
the SSE extensions also provide operations on floating-point data
MMX instructions
MMX introduced 47 new instructions for operation on packed integer data
arithmetic
addition, subtraction, multiplication, multiply and add also with signed and unsigned saturation
comparision
compare for equal, compare for greater than
conversion
packing and unpacking of data
logical
and, or, xor, and not
SSE
Streaming SIMD Extension
introduced with the Pentium III processor designed to speed up performance of advanced 2D and 3D graphics, motion video, videoconferencing, image processing, speech recognition, ... supports only single-precision floating point operation
127
s3
s2
s1
s0
XMM registers
SSE adds a set of new 128-bit XMM registers
8 XMM registers in 32-bit mode 16 XMM registers in 64-bit mode
XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 XMM9 XMM10 XMM11 XMM12 XMM13 XMM14 XMM15
127 0
SSE instructions
The SSE extension added 70 new instructions to the instruction set
50 for SIMD floating-point operations 12 for SIMD integer operations 8 for cache control
10
X3 Y3
X2 Y2
X1 Y1
X0 Y0
Destination
X2Y2
X1Y1
X0Y0
X3Y3
Source 1 Source 2
X3 Y3
X2 Y2
X1 Y1
X0 Y0
The compiler uses this for floating-point operations instead of the x87 fp-unit
Destination
X3
X2
X1
X0Y0
11
SSE2
Streaming SIMD Extension 2
introduced in the Pentium 4 processor designed to speed up performance of advanced 3D graphics, video encoding/decodeing, speech recognition, E-commerce and Internet, scientific and engineering applications
12
d1
d0
127 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
127
w7
127
w6
w5
w4
w3
w2
w1
w0
0
d3
127
d2
d1
d0
0
q1
q0
13
SSE instructions
MMX instructions have names composed of different fields
a prefix P stands for Packed the operation, for example ADD, SUB or MUL for arithmetic operations US (Unsigned Saturation) or S (Signed Saturation) a suffix describing the data type
B Packed Byte, 8 bytes W Packed Word, four 16-bit words D Packed Doubleword, two 32-bit double words Q Quadword, one single 64-bit quadword
Example: PADDB Add Packed Byte PADDSB Add Packed Signed Byte Integers with Signed Saturation
Automatic vectorization
The compiler automatically recognizes loops that can be implemented with vectorized code very easy to use, no changes to the program code are needed Only loops that can be analyzed and that are found to be suitable for SIMD execution are vectorized does not guarantee any performance improvement has no effect if the compiler can not analyze the code and find the opportunities for SIMD operation Requires a compiler with vectorizing capabilities in gcc, vectorization is enabled by -O3 (use -ftree-vectorizer-verbose=1 to print reports about which loops are vectorized) the Intel compiler, icc, also does advanced vectorization
gcc -O3 -ftree-vectorizer-verbose=1 saxpy.c -o saxpy! ! saxpy.c:9: note: created 1 versioning for alias checks.! ! saxpy.c:9: note: LOOP VECTORIZED.! saxpy.c:6: note: vectorized 1 loops in function.!
16
17
18
C intrisinc functions
Example:
multiply two arrays A and B of 400 single precision f-p values
#define SIZE 400! ! float A[SIZE], B[SIZE], C[SIZE];! __m128 m1, m2, m3;! ! for (int i=0; i<SIZE; i+=4) {! m1 = _mm_load_ps (A+i);! m2 = _mm_load_ps (B+i);! m3 = _mm_mul_ps (m1,m2);! _mm_store_ps (C+i,m3);! }!
Variables of vector data types have to be aligned to 16-bit boundaries May also need to access the individual values in the packed data
can be done by using a union structure
19
10