Lec15 x86SIMD
Lec15 x86SIMD
Overview
• SIMD
• MMX architectures
• MMX instructions
• examples
• SSE/SSE2
2
Performance boost
• Increasing clock rate is not fast enough for boos
ting performance
3
Performance boost
• Architecture improvements (such as pipeline/ca
che/SIMD) are more significant
• Intel analyzed multimedia applications and foun
d they share the following characteristics:
– Small native data types (8-bit pixel, 16-bit audio)
– Recurring operations
– Inherent parallelism
4
SIMD
• SIMD (single instruction multiple data)
architecture performs the same operation on
multiple data elements in parallel
• PADDW MM0, MM1
5
SISD/SIMD/Streaming
6
IA-32 SIMD development
• MMX (Multimedia Extension) was introduced in 1
996 (Pentium with MMX and Pentium II).
• SSE (Streaming SIMD Extension) was introduced
with Pentium III.
• SSE2 was introduced with Pentium 4.
• SSE3 was introduced with Pentium 4 supporting
hyper-threading technology. SSE3 adds 13 more
instructions.
7
MMX
• After analyzing a lot of existing applications suc
h as graphics, MPEG, music, speech recognition,
game, image processing, they found that many
multimedia algorithms execute the same instru
ctions on many pieces of data in a large data se
t.
• Typical elements are small, 8 bits for pixels, 16
bits for audio, 32 bits for graphics and general c
omputing.
• New data type: 64-bit packed data type. Why 6
4 bits?
– Good enough
– Practical
8
MMX data types
9
Compatibility
• To be fully compatible with existing IA, no new
mode or state was created. Hence, for context
switching, no extra state needs to be saved.
• To reach the goal, MMX is hidden behind FPU.
When floating-point state is saved or restored,
MMX is saved or restored.
• It allows existing OS to perform context switchi
ng on the processes executing MMX instruction
without be aware of MMX.
• However, it means MMX and FPU can not be use
d at the same time. Big overhead to switch.
10
Compatibility
• Although Intel defenses their decision on aliasin
g MMX to FPU for compatibility. It is actually a
bad decision. OS can just provide a service pack
or get updated.
• It is why Intel introduced SSE later without any
aliasing
11
MMX instructions
• 57 MMX instructions are defined to perform the
parallel operations on multiple data elements p
acked into 64-bit data types.
• These include add, subtract, multiply, co
mpare, and shift, data conversion, 64-b
it data move, 64-bit logical operati
on and multiply-add for multiply-accumulat
e operations.
• All instructions except for data move use MMX r
egisters as operands.
• Most complete support for 16-bit operations.
12
Saturation arithmetic
• Useful in graphics applications.
• When an operation overflows or underflows, the
result becomes the largest or smallest possible
representable number.
• Two types: signed and unsigned saturation
wrap-around saturating
13
Arithmetic
• PADDB/PADDW/PADDD: add two packed numbe
rs, no EFLAGS is set, ensure overflow never occ
urs by yourself
• Multiplication: two steps
• PMULLW: multiplies four words and stores the fo
ur lo words of the four double word results
• PMULHW/PMULHUW: multiplies four words and st
ores the four hi words of the four double word r
esults. PMULHUW for unsigned.
14
Detect MMX/SSE
mov eax, 1 ; request version info
cpuid ; supported since Pentium
test edx, 00800000h ;bit 23
; 02000000h (bit 25) SSE
; 04000000h (bit 26) SSE2
jnz HasMMX
15
Example: add a constant to a vector
char d[]={5, 5, 5, 5, 5, 5, 5, 5};
char clr[]={65,66,68,...,87,88}; // 24 byte
s
__asm{
movq mm1, d
mov cx, 3
mov esi, 0
L1: movq mm0, clr[esi]
paddb mm0, mm1
movq clr[esi], mm0
add esi, 8
loop L1
emms 16
Comparison
• No CFLAGS, how many flags will you need?
Results are stored in destination.
• EQ/GT, no LT
17
Change data types
• Pack: converts a larger data type to the next s
maller data type.
• Unpack: takes two operands and interleave the
m. It can be used for expand data type for imm
ediate calculation.
18
Pack with signed saturation
19
Pack with signed saturation
20
Unpack low portion
21
Unpack low portion
22
Unpack low portion
23
Unpack high portion
24
Keys to SIMD programming
• Efficient data layout
• Elimination of branches
25
Application: frame difference
A B
|A-B|
26
Application: frame difference
A-B B-A
(A-B) or (B-A)
27
Application: frame difference
MOVQ mm1, A //move 8 pixels of image A
MOVQ mm2, B //move 8 pixels of image B
MOVQ mm3, mm1 // mm3=A
PSUBSB mm1, mm2 // mm1=A-B
PSUBSB mm2, mm3 // mm2=B-A
POR mm1, mm2 // mm1=|A-B|
28
Example: image fade-in-fade-out
A B
A*α+B*(1-α) = B+α(A-B)
29
α=0.75
30
α=0.5
31
α=0.25
32
Example: image fade-in-fade-out
• Two formats: planar and chunky
• In Chunky format, 16 bits of 64 bits are wasted
• So, we use planar in the following example
R G B A R G B A
33
Example: image fade-in-fade-out
Image A Image B
34
Example: image fade-in-fade-out
MOVQ mm0, alpha//4 16-b zero-padding α
MOVD mm1, A //move 4 pixels of image A
MOVD mm2, B //move 4 pixels of image B
PXOR mm3, mm3 //clear mm3 to all zeroe
s
//unpack 4 pixels to 4 words
PUNPCKLBW mm1, mm3 // Because B-A could be
PUNPCKLBW mm2, mm3 // negative, need 16 bit
s
PSUBW mm1, mm2 //(B-A)
PMULHW mm1, mm0 //(B-A)*fade/256
PADDW mm1, mm2 //(B-A)*fade + B
//pack four words back to four bytes
35
PACKUSWB mm1, mm3
Data-independent computation
• Each operation can execute without needing to
know the results of a previous operation.
• Example, sprite overlay
for i=1 to sprite_Size
if sprite[i]=clr
then out_color[i]=bg[i]
else out_color[i]=sprite[i]
37
Application: sprite overlay
MOVQ mm0, sprite
MOVQ mm2, mm0
MOVQ mm4, bg
MOVQ mm1, clr
PCMPEQW mm0, mm1
PAND mm4, mm0
PANDN mm0, mm2
POR mm0, mm4
38
Application: matrix transport
39
Application: matrix transport
char M1[4][8];// matrix to be transposed
char M2[8][4];// transposed matrix
int n=0;
for (int i=0;i<4;i++)
for (int j=0;j<8;j++)
{ M1[i][j]=n; n++; }
__asm{
//move the 4 rows of M1 into MMX registers
movq mm1,M1
movq mm2,M1+8
movq mm3,M1+16
movq mm4,M1+24
40
Application: matrix transport
//generate rows 1 to 4 of M2
punpcklbw mm1, mm2
punpcklbw mm3, mm4
movq mm0, mm1
punpcklwd mm1, mm3 //mm1 has row 2 & row 1
punpckhwd mm0, mm3 //mm0 has row 4 & row 3
movq M2, mm1
movq M2+8, mm0
41
Application: matrix transport
//generate rows 5 to 8 of M2
movq mm1, M1 //get row 1 of M1
movq mm3, M1+16 //get row 3 of M1
punpckhbw mm1, mm2
punpckhbw mm3, mm4
movq mm0, mm1
punpcklwd mm1, mm3 //mm1 has row 6 & row 5
punpckhwd mm0, mm3 //mm0 has row 8 & row 7
//save results to M2
movq M2+16, mm1
movq M2+24, mm0
emms
} //end
42
Performance boost (data from 1996)
Benchmark kernels: F
FT, FIR, vector dot-pr
oduct, IDCT, motion c
ompensation.
44
Link ASM and HLL programs
• Assembly is rarely used to develop the entire p
rogram.
• Use high-level language for overall project dev
elopment
– Relieves programmer from low-level details
• Use assembly language code
– Speed up critical sections of code
– Access nonstandard hardware devices
– Write platform-specific code
– Extend the HLL's capabilities
45
General conventions
• Considerations when calling assembly language
procedures from high-level languages:
– Both must use the same naming convention (rules re
garding the naming of variables and procedures)
– Both must use the same memory model, with compat
ible segment names
– Both must use the same calling convention
46
Inline assembly code
• Assembly language source code that is inserted
directly into a HLL program.
• Compilers such as Microsoft Visual C++ and Borl
and C++ have compiler-specific directives that i
dentify inline ASM code.
• Efficient inline code executes quickly because
CALL and RET instructions are not required.
• Simple to code because there are no external n
ames, memory models, or naming conventions i
nvolved.
• Decidedly not portable because it is written for
a single platform.
47
__asm directive in Microsoft Visual C++
• Can be placed at the beginning of a single state
ment
• Or, It can mark the beginning of a block of asse
mbly language statements
• Syntax: __asm statement
__asm {
statement-1
statement-2
...
statement-n
}
48
Intrinsics
• An intrinsic is a function known by the compiler
that directly maps to a sequence of one or mor
e assembly language instructions.
• The compiler manages things that the user woul
d normally have to be concerned with, such as r
egister names, register allocations, and memory
locations of data.
• Intrinsic functions are inherently more efficient
than called functions because no calling linkage
is required. But, not necessarily as efficient as
assembly.
• _mm_<opcode>_<suffix> ps: packed single-precision
ss: scalar single-precision
49
Intrinsics
#include <xmmintrin.h>
__m128 a , b , c;
c = _mm_add_ps( a , b );
// a = b * c + d / e;
__m128 a = _mm_add_ps( _mm_mul_ps( b , c )
,
_mm_div_ps( d , e ) 50
);
SSE
• Adds eight 128-bit registers
• Allows SIMD operations on packed single-precisi
on floating-point numbers
• Most SSE instructions require 16-aligned address
es
51
SSE features
• Add eight 128-bit data registers (XMM registers)
in non-64-bit modes; sixteen XMM registers are
available in 64-bit mode.
• 32-bit MXCSR register (control and status)
• Add a new data type: 128-bit packed single-pre
cision floating-point (4 FP numbers.)
• Instruction to perform SIMD operations on 128-b
it packed single-precision FP and additional 64-
bit SIMD integer operations.
• Instructions that explicitly prefetch data, contr
ol data cacheability and ordering of store
52
SSE programming environment
XMM0
|
XMM7
MM0
|
MM7
53
MXCSR control and status register
54
Exception
_MM_ALIGN16 float test1[4] = { 0, 0, 0, 1 }
;
_MM_ALIGN16 float test2[4] = { 1, 2, 3, 0 }
;
_MM_ALIGN16 float out[4];
Without this, result is 1.#INF
_MM_SET_EXCEPTION_MASK(0);//enable exceptio
n
__try {
__m128 a = _mm_load_ps(test1);
__m128 b = _mm_load_ps(test2);
a = _mm_div_ps(a, b);
_mm_store_ps(out, a);
}
__except(EXCEPTION_EXECUTE_HANDLER) {
if(_mm_getcsr() & _MM_EXCEPT_DIV_ZERO)
cout << "Divide by zero" << endl; 55
return;
SSE packed FP operation
56
SSE scalar FP operation
58
SSE2 features
• Add data types and instructions for them
61
SSE Shuffle (SHUFPS)
62
Example (cross product)
Vector cross(const Vector& a , const Vector& b ) {
return Vector(
( a[1] * b[2] - a[2] * b[1] ) ,
( a[2] * b[0] - a[0] * b[2] ) ,
( a[0] * b[1] - a[1] * b[0] ) );
}
63
Example (cross product)
/* cross */
__m128 _mm_cross_ps( __m128 a , __m128 b ) {
__m128 ea , eb;
// set to a[1][2][0][3] , b[2][0][1][3]
ea = _mm_shuffle_ps( a, a, _MM_SHUFFLE(3,0,2,1) );
eb = _mm_shuffle_ps( b, b, _MM_SHUFFLE(3,1,0,2) );
// multiply
__m128 xa = _mm_mul_ps( ea , eb );
// set to a[2][0][1][3] , b[1][2][0][3]
a = _mm_shuffle_ps( a, a, _MM_SHUFFLE(3,1,0,2) );
b = _mm_shuffle_ps( b, b, _MM_SHUFFLE(3,0,2,1) );
// multiply
__m128 xb = _mm_mul_ps( a , b );
// subtract
return _mm_sub_ps( xa , xb );
}
64
Example: dot product
• Given a set of vectors {v1,v2,…vn}={(x1,y1,z1), (x2,
y2,z2),…, (xn,yn,zn)} and a vector vc=(xc,yc,zc), cal
culate {vcvi}
• Two options for memory layout
• Array of structure (AoS)
typedef struct { float dc, x, y, z; } Verte
x;
Vertex v[n];
• Structure of array (SoA)
typedef struct { float x[n], y[n], z[n]; }
VerticesList;
VerticesList v; 65
Example: dot product (AoS)
movaps xmm0, v ; xmm0 = DC, x0, y0, z0
movaps xmm1, vc ; xmm1 = DC, xc, yc, zc
mulps xmm0, xmm1 ;xmm0=DC,x0*xc,y0*yc,z0*z
c
movhlps xmm1, xmm0 ; xmm1= DC, DC, DC, x0*x
c
addps xmm1, xmm0 ; xmm1 = DC, DC, DC,
; x0*xc+z0*z
c
movaps xmm2, xmm0
shufps xmm2, xmm2, 55h ; xmm2=DC,DC,DC,y0*y
c
movhlps:DEST[63..0] := SRC[127..64]
addps xmm1, xmm2 ; xmm1 = DC, DC, DC,
66
; x0*xc+y0*yc+z0*z
Example: dot product (SoA)
; X = x1,x2,...,x3
; Y = y1,y2,...,y3
; Z = z1,z2,...,z3
; A = xc,xc,xc,xc
; B = yc,yc,yc,yc
; C = zc,zc,zc,zc
movaps xmm0, X ; xmm0 = x1,x2,x3,x4
movaps xmm1, Y ; xmm1 = y1,y2,y3,y4
movaps xmm2, Z ; xmm2 = z1,z2,z3,z4
mulps xmm0, A ;xmm0=x1*xc,x2*xc,x3*xc,x4*x
c
mulps xmm1, B ;xmm1=y1*yc,y2*yc,y3*xc,y4*y
c
mulps xmm2, C ;xmm2=z1*zc,z2*zc,z3*zc,z4*z
c 67
addps xmm0, xmm1
Other SIMD architectures
• Graphics Processing Unit (GPU): nVidia 7800, 24
pipelines (8 vector/16 fragment)
68
NVidia GeForce 8800, 2006
• Each GeForce 8800 GPU stream processor is a fu
lly generalized, fully decoupled, scalar, process
or that supports IEEE 754 floating point precisio
n.
• Up to 128 stream processors
69
Cell processor
• Cell Processor (IBM/Toshiba/Sony): 1 PPE (Powe
r Processing Unit) +8 SPEs (Synergistic Processin
g Unit)
• An SPE is a RISC processor with 128-bit SIMD for
single/double precision instructions, 128 128-bi
t registers, 256K local cache
• used in PS3.
70
Cell processor
71
GPUs keep track to Moore’s law better
72
Different programming paradigms
73
References
• Intel MMX for Multimedia PCs, CACM, Jan. 1997
• Chapter 11 The MMX Instruction Set, The Art of
Assembly
• Chap. 9, 10, 11 of IA-32 Intel Architecture Soft
ware Developer’s Manual: Volume 1: Basic Archi
tecture
• http://www.csie.ntu.edu.tw/~r89004/hive/sse/page_1.html
74