0% found this document useful (0 votes)
74 views

Programming With SIMD-instructions

The document discusses programming with SIMD (Single Instruction Multiple Data) instructions. It describes several SIMD extensions including MMX, SSE, SSE2 and AVX, which allow parallel operations on packed integer or floating-point data. It provides details on the characteristics of SIMD operations, different data types, instructions, and programming techniques for utilizing SIMD, including using compiler intrinsics or inline assembly. Automatic vectorization by advanced compilers is also discussed, allowing SIMD parallelism without requiring explicit programming.

Uploaded by

Mahipal Yadav
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views

Programming With SIMD-instructions

The document discusses programming with SIMD (Single Instruction Multiple Data) instructions. It describes several SIMD extensions including MMX, SSE, SSE2 and AVX, which allow parallel operations on packed integer or floating-point data. It provides details on the characteristics of SIMD operations, different data types, instructions, and programming techniques for utilizing SIMD, including using compiler intrinsics or inline assembly. Automatic vectorization by advanced compilers is also discussed, allowing SIMD parallelism without requiring explicit programming.

Uploaded by

Mahipal Yadav
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Programming with SIMD-instructions

SIMD = Single Instruction stream, Multiple Data stream


From Flynn s taxonomy: SISD, SIMD, (MISD), MIMD

Extensions to the Intel and AMD x86 instruction set for parallel operations on packed integer or floating-point data
data parallelism parallel vector operations applies the same operation in parallel on a number of data items packed into a 64-, 128- or 256-bit vector also supports scalar operations on integer or floating-point values

Originally designed to speed up media processing applications


can also be very useful in other types of applications

There are many different versions of SIMD extensions


MMX, SSE, SSE2, SSE3, SSE4, 3DNow!, Altivec, VIS, AVX, ...

MMX, SSE and AVX


Extensions to the IA-32 and x86-64 instruction sets for parallel SIMD operations on packed data MMX Multimedia Extensions
introduced in the Pentium processor 1993 supports only integer operations

SSE Streaming SIMD Extension


introduced in Pentium III 1999 support for single-precision floating point operations SSE2 Streaming SIMD Extension 2 was introduced in Pentium 4, 2000 supports also double-precision floating point operations later extensions: SSE3, SSSE3, SSE4

AVX Advanced Vector Extensions


announced in 2008, supported in the Intel Sandy Bridge processors extends the vector registers to 256 bits

Characteristics of SIMD operations


The SIMD extensions were designed to speed up multimedia and communication applications
graphics and image processing video and audio processing speech compression and recognition can also be used for data-intensive scientific computations

Applications can benefit from SIMD processing if they have the following characteristics
small integer or floating-point data types (8 bit pixel values or characters, 16-bit audio samples, 32-bit floating-point values) small, highly repetitive loops frequent additions, multiplications or other simple operations compute-intense algorithms data-parallelism, can operate on independent values in parallel

SIMD operation
SIMD execution
performs an operation in parallel on an array of 2, 4, 8 or 16 values data parallel operation

The operation can be a


data movement instruction arithmetic instruction logical instruction comparison instruction conversion instruction shuffle instruction

Source 1 Source 2

X3 Y3

X2 Y2

X1 Y1

X0 Y0

Destination

X2Y2

X1Y1

X0Y0

X3Y3

MMX registers
8 64-bit MMX registers
aliased to the x87 floating-point registers no stack-organization can store 1, 2, 4 or 8 packed integer values
Floating-point registers

MMX registers can only hold data


not memory addresses the general-purpouse registers are used for addresses

MM0 MM1 MM2 MM3 MM4 MM5 MM6 MM7


63

MMX registers

Can not use the x87 floating-point unit and the MMX unit at the same time
they share the same set of registers

MMX data types


MMX instructions operate on 8, 16, 32 or 64-bit integer values, packed into a 64-bit field 4 MMX data types
packed byte 8 bytes packed into a 64-bit quantity packed word 4 16-bit words packed into a 64-bit quantity packed doubleword 2 32-bit doublewords packed into a 64-bit quantity quadword one 64-bit quantity
63 0

b7
63

b6 w3

b5 w2

b4

b3 w1

b2

b1 w0

b0
0

63

dw1
63

dw0
0

qw

MMX operates only on integer values MMX operations are limited to integer values
the SSE extensions also provide operations on floating-point data

MMX instructions
MMX introduced 47 new instructions for operation on packed integer data
arithmetic
addition, subtraction, multiplication, multiply and add also with signed and unsigned saturation

comparision
compare for equal, compare for greater than

conversion
packing and unpacking of data

logical
and, or, xor, and not

The MMX instructions start with the prefix P (for Packed)


Ex: paddb, paddw, paddd add packed bytes/word/doubleword (8/16/32 bit integers)
7

SSE
Streaming SIMD Extension
introduced with the Pentium III processor designed to speed up performance of advanced 2D and 3D graphics, motion video, videoconferencing, image processing, speech recognition, ... supports only single-precision floating point operation

Parallel operations on packed single precision floating-point values


128-bit packed single precision floating point data type four IEEE 32-bit floating point values packed into a 128-bit field data must be aligned in memory on 16-byte boundaries

127

s3

s2

s1

s0

XMM registers
SSE adds a set of new 128-bit XMM registers
8 XMM registers in 32-bit mode 16 XMM registers in 64-bit mode

The XMM registers are new physical registers


not aliased to any other registers independent of the general purpose and FPU/MMX registers can mix MMX and SSE instructions

XMM registers can be accessed in 32-bit, 64-bit or 128-bit mode


only for operations on data, not addresses

There is also a 32 bit control and status register, MXCSR


flag and mask bits for floating-point exceptions rounding control bits flush-to-zero bit denormals-are-zero bit

XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 XMM6 XMM7 XMM8 XMM9 XMM10 XMM11 XMM12 XMM13 XMM14 XMM15
127 0

SSE instructions
The SSE extension added 70 new instructions to the instruction set
50 for SIMD floating-point operations 12 for SIMD integer operations 8 for cache control

Supports both packed and scalar single precision floating-point instructions


operations on packed 32-bit floating-point values
packed instructions have the suffix PS

operations on a scalar 32-bit floating-point value (the 32 LSB)


scalar instructions have the suffix SS

Also included some 64-bit SIMD integer instructions


extension to MMX operations on packed integer values stored in MMX registers

10

Packed and scalar operations


Packed SSE operations apply an operation in parallel on 2 or 4 floating-point values
Source 1 Source 2

X3 Y3

X2 Y2

X1 Y1

X0 Y0

Destination

X2Y2

X1Y1

X0Y0

X3Y3

Scalar SSE operations apply an operation on a single (scalar) floating-point value

Source 1 Source 2

X3 Y3

X2 Y2

X1 Y1

X0 Y0

The compiler uses this for floating-point operations instead of the x87 fp-unit

Destination

X3

X2

X1

X0Y0

11

SSE2
Streaming SIMD Extension 2
introduced in the Pentium 4 processor designed to speed up performance of advanced 3D graphics, video encoding/decodeing, speech recognition, E-commerce and Internet, scientific and engineering applications

Extends MMX and SSE with support for


packed double precision floating point-values packed integer values adds over 70 new instructions to the instruction set

Operates on 128-bit entities in the XMM registers


must be aligned on 16-bit boundaries when stored in memory

12

SSE2 data types


128-bit packed double precision floating point
2 IEEE double precision floatingpoint values
127 0

d1

d0

128-bit packed byte integer


16 byte integers (8 bits)

127 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

128-bit packed word integer


8 word integers (16 bits)

127

w7
127

w6

w5

w4

w3

w2

w1

w0
0

128-bit packed doubleword integer


4 doubleword integers (32 bits)

d3
127

d2

d1

d0
0

128-bit packed quadword integer


2 quadword integers (64 bits)

q1

q0

13

SSE instructions
MMX instructions have names composed of different fields
a prefix P stands for Packed the operation, for example ADD, SUB or MUL for arithmetic operations US (Unsigned Saturation) or S (Signed Saturation) a suffix describing the data type
B Packed Byte, 8 bytes W Packed Word, four 16-bit words D Packed Doubleword, two 32-bit double words Q Quadword, one single 64-bit quadword

Example: PADDB Add Packed Byte PADDSB Add Packed Signed Byte Integers with Signed Saturation

SSE operations on packed double-precision data has the suffix PD


examples: ADDPD, MULPD, MAXPD, ANDPD

SSE operations on scalar double-precision data has the suffix SD


examples: MOVSD, ADDSD, MULSD, MINSD
14

Programming with MMX and SSE


There are different ways a programmer can use SSE in a program
automatic vectorization by the compiler
no explicit SSE programming needed, but requires a vectorizing compiler

arithmetic operations on vector data types


declare variables of vector type express computations as normal arithmetic expression

compiler intrinsinc functions for SSE operation


functions that provide access to the MMX/SSE instructions from a high-level language also requires a detailed knowledge of MMX/SSE operation

program with inline assembly language


very good possibilities to arrange instructions for efficient execution difficult to program, error prone requires detailed knowledge of MMX/SSE operation and assembly language programming
15

Automatic vectorization
The compiler automatically recognizes loops that can be implemented with vectorized code very easy to use, no changes to the program code are needed Only loops that can be analyzed and that are found to be suitable for SIMD execution are vectorized does not guarantee any performance improvement has no effect if the compiler can not analyze the code and find the opportunities for SIMD operation Requires a compiler with vectorizing capabilities in gcc, vectorization is enabled by -O3 (use -ftree-vectorizer-verbose=1 to print reports about which loops are vectorized) the Intel compiler, icc, also does advanced vectorization

gcc -O3 -ftree-vectorizer-verbose=1 saxpy.c -o saxpy! ! saxpy.c:9: note: created 1 versioning for alias checks.! ! saxpy.c:9: note: LOOP VECTORIZED.! saxpy.c:6: note: vectorized 1 loops in function.!
16

Arithmetic operations on vector data types


Declare variables of vector data types and express computations with normal arithmetic expressions
SSE2 data types defined in emmintrin.h

Vector data types


four 32-bit floating-point values: __m128 two 64-bit floating-point values: __m128d integer data types: __m128i

__m128 a; /* 4 packed int values */ ! __m128 b;! __m128 c; ! ! c = a+b;! !

17

Compiler intrinsinc functions


Functions for performing MMX and SSE operations on packed data
inplemented with inline assembly code allows the programmer to use C function calls and variables

Defines a C function for each MMX/SSE instruction


there are also intrisinc functions composed of several MMX/SSE instructions

New data types to represent packed integer and floating-point values


__m64 represents the contents of a 64-bit MMX register (8, 16 or 32 bit packed integers) __m128 represents 4 packed single precision floating-point values __m128d represents 2 packed double precision floating-point values __m128i represents packed integer values (8, 16, 32 or 64-bit)

18

C intrisinc functions
Example:
multiply two arrays A and B of 400 single precision f-p values
#define SIZE 400! ! float A[SIZE], B[SIZE], C[SIZE];! __m128 m1, m2, m3;! ! for (int i=0; i<SIZE; i+=4) {! m1 = _mm_load_ps (A+i);! m2 = _mm_load_ps (B+i);! m3 = _mm_mul_ps (m1,m2);! _mm_store_ps (C+i,m3);! }!

Register allocation and instruction scheduling is left to the compiler

Variables of vector data types have to be aligned to 16-bit boundaries May also need to access the individual values in the packed data
can be done by using a union structure

union mmdata {! __mm128 m;! float f[4];! };!

19

10

You might also like