0% found this document useful (0 votes)

44 views38 pages

Lecture8 Simd

The document discusses SIMD vector extensions for parallel computing. It describes how SIMD extensions work by replicating functional units to perform the same operation on multiple data elements simultaneously. The document outlines different SIMD instruction set architectures including MMX, SSE, AVX, and provides examples of SSE instructions and register layout.

Uploaded by

Leoncito filosofal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views38 pages

Lecture8 Simd

Uploaded by

Leoncito filosofal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Design of Parallel and High-Performance

Computing
Fall 2018
Lecture: SIMD vector extensions

Instructor: Torsten Hoefler & Markus Püschel

TA: Salvatore Di Girolamo

Flynn’s Taxonomy

Single instruction Multiple instruction

Single data SISD MISD
Uniprocessor
Multiple data SIMD MIMD
Vector computer Multiprocessors
Short vector extensions VLIW

 This lecture and material was created together with

Franz Franchetti (ECE, Carnegie Mellon)

SIMD Vector Extensions

+ x 4-way

 What is it?
 Extension of the ISA
 Data types and instructions for the parallel computation on short
(length 2, 4, 8, …) vectors of integers or floats
 Names: MMX, SSE, SSE2, …
 Why do they exist?
 Useful: Many applications have the necessary fine-grain parallelism
Then: speedup by a factor close to vector length
 Doable: Relatively easy to design by replicating functional units

© Markus Püschel
Computer Science
© Markus Püschel
MMX: Computer Science
Multimedia extension
Intel x86 Processors
SSE: x86-16 8086
Streaming SIMD extension

AVX: 286
Advanced vector extensions
x86-32 386
486
Pentium
MMX Pentium MMX
register 64 bit
width (only int) SSE Pentium III

SSE2 Pentium 4
time
SSE3 Pentium 4E
128 bit x86-64 Pentium 4F

Core 2 Duo
SSE4 Penryn
Core i7 (Nehalem)
AVX Sandy Bridge
256 bit Haswell
AVX2

512 bit AVX-512 Skylake-X

Example SSE Family: Floating Point

SSE4
SSSE3
SSE3

SSE2: 2-way double

SSE:
4-way single

 Not drawn to scale

 From SSE3: Only additional instructions
 Every Core 2 has SSE3
6

128 bit = 2 doubles = 4 singles

%xmm0 %xmm8

%xmm1 %xmm9

%xmm2 %xmm10

%xmm3 %xmm11

%xmm4 %xmm12

%xmm5 %xmm13

%xmm6 %xmm14

%xmm7 %xmm15
7

SSE3 Registers
 Different data types and associated instructions 128 bit LSB
 Integer vectors:
 16-way byte
 8-way 2 bytes
 4-way 4 bytes
 2-way 8 bytes

 Floating point vectors:

 4-way single (since SSE)
 2-way double (since SSE2)

 Floating point scalars:

 single (since SSE)
 double (since SSE2)
8

%xmm0

+
%xmm1

 Single precision scalar add: addss %xmm0 %xmm1

%xmm0

+
%xmm1
9

SSE3 Instruction Names

packed (vector) single slot (scalar)

addps addss

single precision

addpd addsd

double precision
Compiler will use this for floating point
• on x86-64
• with proper flags if SSE/SSE2 is available 10

© Markus Püschel
Computer Science
x86-64 FP Code Example float ipf (float x[],
float y[],
int n) {
int i;
 Inner product of two vectors float result = 0.0;
 Single precision arithmetic
for (i = 0; i < n; i++)
 Compiled: not vectorized, result += x[i]*y[i];
uses SSE instructions return result;
}
ipf:
xorps %xmm1, %xmm1 # result = 0.0
xorl %ecx, %ecx # i = 0
jmp .L8 # goto middle
.L10: # loop:
movslq %ecx,%rax # icpy = i
incl %ecx # i++
movss (%rsi,%rax,4), %xmm0 # t = y[icpy]
mulss (%rdi,%rax,4), %xmm0 # t *= x[icpy]
addss %xmm0, %xmm1 # result += t
.L8: # middle:
cmpl %edx, %ecx # i:n
jl .L10 # if < goto loop
movaps %xmm1, %xmm0 # return result
ret
11

SSE: How to Take Advantage?

+ instead of +

 Necessary: fine grain parallelism

 Options (ordered by effort):
 Use vectorized libraries (easy, not always available)
 Compiler vectorization (this lecture)
 Use intrinsics (this lecture)
 Write assembly
 We will focus on floating point and single precision (4-way)

© Markus Püschel
Computer Science
SIMD Extensions and SSE
 Overview: SSE family
 SSE intrinsics
 Compiler vectorization

References:
Intel Intrinsics Guide
(easy access to all instructions, also contains latency and throughput
information!)

Intel icc compiler manual

Visual Studio manual

SSE Family: Floating Point

SSE4
SSSE3
SSE3

SSE2: 2-way double

SSE:
4-way single

 Not drawn to scale

 From SSE2: Only additional instructions
 Every Core 2 has SSE3
14

© Markus Püschel
Computer Science
Intrinsics
 Assembly coded C functions
 Expanded inline upon compilation: no overhead
 Like writing assembly inside C
 Floating point:
 Intrinsics for basic operations (add, mult, …)
 Intrinsics for math functions: log, sin, …
 Our introduction is based on icc
 Most intrinsics work with gcc and Visual Studio (VS)
 Some language extensions are icc (or even VS) specific

Visual Conventions We Will Use

 Memory increasing address

memory

 Registers
 Commonly:
LSB

R3 R2 R1 R0

 We will use

LSB

R0 R1 R2 R3
16

© Markus Püschel
Computer Science
SSE Intrinsics (Focus Floating Point)
 Data types
__m128 f; // = {float f0, f1, f2, f3}
__m128d d; // = {double d0, d1}
__m128i i; // 16 8-bit, 8 16-bit, 4 32-bit, or 2 64-bit ints

ints

ints or floats

ints or doubles

SSE Intrinsics (Focus Floating Point)

 Instructions
 Naming convention: _mm_<intrin_op>_<suffix>
 Example:

// a is 16-byte aligned p: packed

float a[4] = {1.0, 2.0, 3.0, 4.0};
__m128 t = _mm_load_ps(a); s: single precision

LSB 1.0 2.0 3.0 4.0

 Same result as
__m128 t = _mm_set_ps(4.0, 3.0, 2.0, 1.0)

© Markus Püschel
Computer Science
SSE Intrinsics
 Native instructions (one-to-one with assembly)
_mm_load_ps() ↔ movaps
_mm_add_ps() ↔ addps
_mm_mul_pd() ↔ mulpd
…
 Multi instructions (map to several assembly instructions)
_mm_set_ps()
_mm_set1_ps()
…
 Macros and helpers
_MM_TRANSPOSE4_PS()
_MM_SHUFFLE()
…

What Are the Main Issues?

 Alignment is important (128 bit = 16 byte)
 You need to code explicit loads and stores
 Overhead through shuffles

SSE AVX

float, double 4-way, 2-way 8-way, 4-way

register 16 x 128 bits: 16 x 256 bits:

%xmm0 - %xmm15 %ymm0 - %ymm15
The lower halfs are the
%xmms
assembly ops addps, mulpd, … vaddps, vmulpd

intrinsics data type m128, m128d m256, m256d

intrinsics instructions _mm_load_ps, _mm256_load_ps,

_mm_add_pd, … _mm256_add_pd

Mixing SSE and AVX may incur penalties

32 zmm (AVX-512) 16 ymm (AVX) 16 xmm (SSE) scalar

© Markus Püschel
Computer Science
SSE Intrinsics
 Load and store
 Constants
 Arithmetic
 Comparison
 Conversion
 Shuffles

Loads and Stores

Intrinsic Name Operation Corresponding
SSE Instructions
_mm_loadh_pi Load high MOVHPS reg, mem
_mm_loadl_pi Load low MOVLPS reg, mem
_mm_load_ss Load the low value and clear the three high values MOVSS
_mm_load1_ps Load one value into all four words MOVSS + Shuffling
_mm_load_ps Load four values, address aligned MOVAPS
_mm_loadu_ps Load four values, address unaligned MOVUPS
_mm_loadr_ps Load four values in reverse MOVAPS + Shuffling

Intrinsic Name Operation Corresponding

SSE Instruction
_mm_set_ss Set the low value and clear the three high values Composite
_mm_set1_ps Set all four words with the same value Composite
_mm_set_ps Set four values, address aligned Composite
_mm_setr_ps Set four values, in reverse order Composite
_mm_setzero_ps Clear all four values Composite

1.0 2.0 3.0 4.0 memory

LSB 1.0 2.0 3.0 4.0 a

a = _mm_load_ps(p); // p 16-byte aligned

a = _mm_loadu_ps(p); // p not aligned avoid (can be expensive)

on recent Intel
possibly no penalty

load_ps on unaligned pointer: seg fault

25
→ blackboard

Loads and Stores

1.0 2.0 memory

LSB 1.0 2.0 a

kept LSB 1.0 2.0 a

kept

a = _mm_loadl_pi(a, p); // p 8-byte aligned

a = _mm_loadh_pi(a, p); // p 8-byte aligned

1.0 memory

LSB 1.0 0 0 0 a

set to zero

a = _mm_load_ss(p); // p any alignment

27
→ blackboard

Stores Analogous to Loads

Intrinsic Name Operation Corresponding SSE Instruction
_mm_storeh_pi Store high MOVHPS mem, reg
_mm_storel_pi Store low MOVLPS mem, reg
_mm_store_ss Store the low value MOVSS
_mm_store1_ps Store the low value across all four Shuffling + MOVSS
words, address aligned
_mm_store_ps Store four values, address aligned MOVAPS
_mm_storeu_ps Store four values, address unaligned MOVUPS
_mm_storer_ps Store four values, in reverse order MOVAPS + Shuffling

LSB 1.0 2.0 3.0 4.0 a a = _mm_set_ps(4.0, 3.0, 2.0, 1.0);

LSB 1.0 1.0 1.0 1.0 b b = _mm_set1_ps(1.0);

LSB 1.0 0 0 0 c c = _mm_set_ss(1.0);

LSB 0 0 0 0 d d = _mm_setzero_ps();

29
→ blackboard

Arithmetic
SSE SSE3
Intrinsic Name Operation Corresponding Intrinsic Name Operation Corresponding
SSE Instruction SSE3 Instruction
_mm_add_ss Addition ADDSS _mm_addsub_ps Subtract and add ADDSUBPS
_mm_add_ps Addition ADDPS _mm_hadd_ps Add HADDPS
_mm_sub_ss Subtraction SUBSS _mm_hsub_ps Subtracts HSUBPS
_mm_sub_ps Subtraction SUBPS
_mm_mul_ss Multiplication MULSS
_mm_mul_ps Multiplication MULPS SSE4
_mm_div_ss Division DIVSS Intrinsic Operation Corresponding
SSE4 Instruction
_mm_div_ps Division DIVPS
_mm_dp_ps Single precision dot product DPPS
_mm_sqrt_ss Squared Root SQRTSS
_mm_sqrt_ps Squared Root SQRTPS
_mm_rcp_ss Reciprocal RCPSS
_mm_rcp_ps Reciprocal RCPPS
_mm_rsqrt_ss Reciprocal Squared Root RSQRTSS
_mm_rsqrt_ps Reciprocal Squared Root RSQRTPS
_mm_min_ss Computes Minimum MINSS
_mm_min_ps Computes Minimum MINPS
_mm_max_ss Computes Maximum MAXSS
_mm_max_ps Computes Maximum MAXPS
30

LSB 1.5 3.5 5.5 7.5 c

c = _mm_add_ps(a, b);

analogous:

c = _mm_sub_ps(a, b);

c = _mm_mul_ps(a, b);

31
→ blackboard

Example
void addindex(float *x, int n) {
for (int i = 0; i < n; i++)
x[i] = x[i] + i;
}

#include <ia32intrin.h>

// n a multiple of 4, x is 16-byte aligned

void addindex_vec(float *x, int n) {
__m128 index, x_vec;

for (int i = 0; i < n; i+=4) {

x_vec = _mm_load_ps(x+i); // load 4 floats
index = _mm_set_ps(i+3, i+2, i+1, i); // create vector with indexes
x_vec = _mm_add_ps(x_vec, index); // add the two
_mm_store_ps(x+i, x_vec); // store back
}
}

Is this the best solution?

No! _mm_set_ps may be too expensive
32

© Markus Püschel
Computer Science
Example
void addindex(float *x, int n) {
for (int i = 0; i < n; i++)
x[i] = x[i] + i;
}

#include <ia32intrin.h>

// n a multiple of 4, x is 16-byte aligned

void addindex_vec(float *x, int n) {
__m128 x_vec, init, incr;

ind = _mm_set_ps(3, 2, 1, 0);

incr = _mm_set1_ps(4);
for (int i = 0; i < n; i+=4) {
x_vec = _mm_load_ps(x+i); // load 4 floats
x_vec = _mm_add_ps(x_vec, ind); // add the two
ind = _mm_add_ps(ind, incr); // update ind
_mm_store_ps(x+i, x_vec); // store back
}
}

Code style helps with performance!

Arithmetic
LSB 1.0 2.0 3.0 4.0 a LSB 0.5 b

LSB 1.5 2.0 3.0 4.0 c

c = _mm_add_ss(a, b);

max max max max

LSB 1.0 2.0 3.0 4.0 c

c = _mm_max_ps(a, b);

Arithmetic
LSB 1.0 2.0 3.0 4.0 a LSB 0.5 1.5 2.5 3.5 b

LSB 0.5 3.5 0.5 7.5 c

c = _mm_addsub_ps(a, b);

LSB 3.0 7.0 2.0 6.0 c

c = _mm_hadd_ps(a, b);

analogous:

c = _mm_hsub_ps(a, b);

37
→ blackboard

Example
// n is even
void lp(float *x, float *y, int n) {
for (int i = 0; i < n/2; i++)
y[i] = (x[2*i] + x[2*i+1])/2;
}

#include <ia32intrin.h>

// n a multiple of 8, x, y are 16-byte aligned

void lp_vec(float *x, int n) {
__m128 half, v1, v2, avg;

half = _mm_set1_ps(0.5); // set vector to all 0.5

for(int i = 0; i < n/8; i++) {
v1 = _mm_load_ps(x+i*8); // load first 4 floats
v2 = _mm_load_ps(x+4+i*8); // load next 4 floats
avg = _mm_hadd_ps(v1, v2); // add pairs of floats
avg = _mm_mul_ps(avg, half); // multiply with 0.5
_mm_store_ps(y+i*4, avg); // save result
}
}

(SSE4) Computes the pointwise product of a and b and writes a

selected sum of the resulting numbers into selected elements of c; the
others are set to zero. The selections are encoded in the mask.
Example: mask = 117 = 01110101

LSB 1.0 2.0 3.0 4.0 a LSB 0.5 1.5 2.5 3.5 b

0.5 3.0 7.5 14.0 01110101

LSB 11.0 0 11.0 0 c

Comparisons
Intrinsic Name Operation Corresponding Intrinsic Name Operation Corresponding
SSE Instruction SSE Instruction
_mm_cmpeq_ss Equal CMPEQSS _mm_cmpord_ss Ordered CMPORDSS
_mm_cmpeq_ps Equal CMPEQPS _mm_cmpord_ps Ordered CMPORDPS
_mm_cmplt_ss Less Than CMPLTSS _mm_cmpunord_ss Unordered CMPUNORDSS
_mm_cmplt_ps Less Than CMPLTPS _mm_cmpunord_ps Unordered CMPUNORDPS
_mm_cmple_ss Less Than or Equal CMPLESS _mm_comieq_ss Equal COMISS
_mm_cmple_ps Less Than or Equal CMPLEPS _mm_comilt_ss Less Than COMISS
_mm_cmpgt_ss Greater Than CMPLTSS _mm_comile_ss Less Than or Equal COMISS
_mm_cmpgt_ps Greater Than CMPLTPS _mm_comigt_ss Greater Than COMISS
_mm_cmpge_ss Greater Than or Equal CMPLESS _mm_comige_ss Greater Than or Equal COMISS
_mm_cmpge_ps Greater Than or Equal CMPLEPS _mm_comineq_ss Not Equal COMISS
_mm_cmpneq_ss Not Equal CMPNEQSS _mm_ucomieq_ss Equal UCOMISS
_mm_cmpneq_ps Not Equal CMPNEQPS _mm_ucomilt_ss Less Than UCOMISS
_mm_cmpnlt_ss Not Less Than CMPNLTSS _mm_ucomile_ss Less Than or Equal UCOMISS
_mm_cmpnlt_ps Not Less Than CMPNLTPS _mm_ucomigt_ss Greater Than UCOMISS
_mm_cmpnle_ss Not Less Than or Equal CMPNLESS _mm_ucomige_ss Greater Than or Equal UCOMISS
_mm_cmpnle_ps Not Less Than or Equal CMPNLEPS _mm_ucomineq_ss Not Equal UCOMISS
_mm_cmpngt_ss Not Greater Than CMPNLTSS
_mm_cmpngt_ps Not Greater Than CMPNLTPS
_mm_cmpnge_ss Not Greater Than or CMPNLESS
Equal
_mm_cmpnge_ps Not Greater Than or CMPNLEPS
Equal

=? =? =? =?

LSB 0xffffffff 0x0 0xffffffff 0x0 c

c = _mm_cmpeq_ps(a, b);
Each field:
0xffffffff if true
analogous:
0x0 if false
c = _mm_cmple_ps(a, b); Return type: __m128
c = _mm_cmplt_ps(a, b);

c = _mm_cmpge_ps(a, b);
etc. 41

Example
void fcond(float *x, size_t n) {
int i;

for(i = 0; i < n; i++) {

if(x[i] > 0.5)
x[i] += 1.;
else x[i] -= 1.;
}
}

#include <xmmintrin.h>

void fcond(float *a, size_t n) {

int i;
__m128 vt, vmask, vp, vm, vr, ones, mones, thresholds;

ones = _mm_set1_ps(1.);
mones = _mm_set1_ps(-1.);
thresholds = _mm_set1_ps(0.5);
for(i = 0; i < n; i+=4) {
vt = _mm_load_ps(a+i);
vmask = _mm_cmpgt_ps(vt, thresholds);
vp = _mm_and_ps(vmask, ones);
vm = _mm_andnot_ps(vmask, mones);
vr = _mm_add_ps(vt, _mm_or_ps(vp, vm));
_mm_store_ps(a+i, vr);
}
}
42

Vectorization
=

Picture: www.druckundbestell.de

Conversion
Intrinsic Name Operation Corresponding
SSE Instruction
_mm_cvtss_si32 Convert to 32-bit integer CVTSS2SI
_mm_cvtss_si64* Convert to 64-bit integer CVTSS2SI
_mm_cvtps_pi32 Convert to two 32-bit integers CVTPS2PI
_mm_cvttss_si32 Convert to 32-bit integer CVTTSS2SI
_mm_cvttss_si64* Convert to 64-bit integer CVTTSS2SI
_mm_cvttps_pi32 Convert to two 32-bit integers CVTTPS2PI
_mm_cvtsi32_ss Convert from 32-bit integer CVTSI2SS
_mm_cvtsi64_ss* Convert from 64-bit integer CVTSI2SS
_mm_cvtpi32_ps Convert from two 32-bit integers CVTTPI2PS
_mm_cvtpi16_ps Convert from four 16-bit integers composite
_mm_cvtpu16_ps Convert from four 16-bit integers composite
_mm_cvtpi8_ps Convert from four 8-bit integers composite
_mm_cvtpu8_ps Convert from four 8-bit integers composite
_mm_cvtpi32x2_ps Convert from four 32-bit integers composite
_mm_cvtps_pi16 Convert to four 16-bit integers composite
_mm_cvtps_pi8 Convert to four 8-bit integers composite
_mm_cvtss_f32 Extract composite

LSB 1.0 2.0 3.0 4.0 a

1.0 f

float f;

f = _mm_cvtss_f32(a);

Cast interpreted as

floats ints

__m128i _mm_castps_si128(__m128 a)

__m128 _mm_castsi128_ps(__m128i a)

Reinterprets the four single precision floating point values in a as four 32-bit
integers, and vice versa.

No conversion is performed. Does not map to any assembly instructions.

Makes integer shuffle instructions usable for floating point.

46
→ blackboard

convert

ints floats

Shuffles
SSE SSE3
Intrinsic Name Operation Corresponding Intrinsic Name Operation Corresponding
SSE Instruction SSE3 Instruction
_mm_shuffle_ps Shuffle SHUFPS _mm_movehdup_ps Duplicates MOVSHDUP
_mm_unpackhi_ps Unpack High UNPCKHPS _mm_moveldup_ps Duplicates MOVSLDUP
_mm_unpacklo_ps Unpack Low UNPCKLPS
_mm_move_ss Set low word, pass in MOVSS SSSE3
three high values
Intrinsic Name Operation Corresponding
_mm_movehl_ps Move High to Low MOVHLPS SSSE3 Instruction
_mm_movelh_ps Move Low to High MOVLHPS _mm_shuffle_epi8 Shuffle PSHUFB
_mm_movemask_ps Create four-bit mask MOVMSKPS _mm_alignr_epi8 Shift PALIGNR

SSE4
Intrinsic Syntax Operation Corresponding
SSE4 Instruction
__m128 _mm_blend_ps(__m128 v1, __m128 v2, const int mask) Selects float single precision data from 2 BLENDPS
sources using constant mask
__m128 _mm_blendv_ps(__m128 v1, __m128 v2, __m128 v3) Selects float single precision data from 2 BLENDVPS
sources using variable mask
__m128 _mm_insert_ps(__m128 dst, __m128 src, const int ndx) Insert single precision float into packed INSERTPS
single precision array element selected by
index.
int _mm_extract_ps(__m128 src, const int ndx) Extract single precision float from packed EXTRACTPS
single precision array selected by index. 48

LSB 1.0 0.5 2.0 1.5 c

c = _mm_unpacklo_ps(a, b);

LSB 1.0 2.0 3.0 4.0 a LSB 0.5 1.5 2.5 3.5 b

LSB 3.0 2.5 4.0 3.5 c

c = _mm_unpackhi_ps(a, b);
49
→ blackboard

Shuffles
c = _mm_shuffle_ps(a, b, _MM_SHUFFLE(l, k, j, i));

helper macro to create mask

LSB 1.0 2.0 3.0 4.0 a LSB 0.5 1.5 2.5 3.5 b

LSB c0 c1 c2 c3 c c0 = ai
c1 = aj
any element any element c2 = bk
of a of b c3 = bl
i,j,k,l in {0,1,2,3}

50
→ blackboard

p0 p1 p2 p3

1.0 2.0 3.0 4.0 memory

4x
LSB 1.0 0 0 0 LSB 3.0 0 0 0 _mm_load_ss

LSB 2.0 0 0 0 LSB 4.0 0 0 0

2x
_mm_shuffle_ps

LSB 1.0 0 2.0 0 LSB 3.0 0 4.0 0

1x
_mm_shuffle_ps

LSB 1.0 2.0 3.0 4.0

7 instructions, this is one good way of doing it 51

Code For Previous Slide

#include <ia32intrin.h>

__m128 LoadArbitrary(float p0, float p1, float p2, float p3) {

__m128 a, b, c, d, e, f;

a = _mm_load_ss(p0);
b = _mm_load_ss(p1);
c = _mm_load_ss(p2);
d = _mm_load_ss(p3);
e = _mm_shuffle_ps(a, b, _MM_SHUFFLE(1,0,2,0)); //only zeros are important
f = _mm_shuffle_ps(c, d, _MM_SHUFFLE(1,0,2,0)); //only zeros are important
return _mm_shuffle_ps(e, f, _MM_SHUFFLE(2,0,2,0));
}

© Markus Püschel
Computer Science
Example: Loading 4 Real Numbers from
Arbitrary Memory Locations (cont’d)
 Whenever possible avoid the previous situation
 Restructure algorithm and use the aligned
_mm_load_ps()
 Other possibility (but likely also yields 7 instructions)

__m128 vf;

vf = _mm_set_ps(p3, p2, p1, p0);

 SSE4: _mm_insert_epi32 together with _mm_castsi128_ps

 Not clear whether better

Example: Loading 4 Real Numbers from

Arbitrary Memory Locations (cont’d)
 Do not do this (why?):

__declspec(align(16)) float g[4];

__m128 vf;

g[0] = *p0;
g[1] = *p1;
g[2] = *p2;
g[3] = *p3;
vf = _mm_load_ps(g);

LSB 1.0 2.0 3.0 4.0

LSB 4.0 0 0 0 3x
_mm_shuffle_ps
LSB 3.0 0 0 0

LSB 2.0 0 0 0

4x
_mm_store_ss

1.0 2.0 3.0 4.0 memory

7 instructions, shorter critical path

Shuffle
__m128i _mm_alignr_epi8(__m128i a, __m128i b, const int n)

Concatenate a and b and extract byte-aligned result shifted to the right by n bytes

Example: View __m128i as 4 32-bit ints; n = 12

LSB 1 2 3 4 b LSB 5 6 7 8 a

n = 12 bytes

LSB 4 5 6 7 c

How to use this with floating point vectors?

Use with _mm_castsi128_ps !

© Markus Püschel
Computer Science
Example
void shift(float *x, float *y, int n) {
for (int i = 0; i < n-1; i++)
y[i] = x[i+1];
y[n-1] = 0;
}

#include <ia32intrin.h>

// n a multiple of 4, x, y are 16-byte aligned

void shift_vec(float *x, float *y, int n) {
__m128 f;
__m128i i1, i2, i3;

i1 = _mm_castps_si128(_mm_load_ps(x)); // load first 4 floats and cast to int

for (int i = 0; i < n-8; i = i + 4) {

i2 = _mm_castps_si128(_mm_load_ps(x+4+i)); // load next 4 floats and cast to int
f = _mm_castsi128_ps(_mm_alignr_epi8(i2,i1,4)); // shift and extract and cast back
_mm_store_ps(y+i,f); // store it
i1 = i2; // make 2nd element 1st
}

// we are at the last 4

i2 = _mm_castps_si128(_mm_setzero_ps()); // set the second vector to 0 and cast to int
f = _mm_castsi128_ps(_mm_alignr_epi8(i2,i1,4)); // shift and extract and cast back
_mm_store_ps(y+n-4,f); // store it
}
57

Shuffle
__m128i _mm_shuffle_epi8(__m128i a, __m128i mask)

Result is filled in each position by any element of a or with 0, as specified by mask

Example: View __m128i as 4 32-bit ints

LSB 1 2 3 4 a

LSB 4 1 0 2 c

Use with _mm_castsi128_ps to do the same for floating point

(SSE4) Result is filled in each position by an element of a or b in the same position as

specified by mask

Example: LSB 0x0 0xffffffff 0x0 0x0 mask

LSB 1.0 2.0 3.0 4.0 a LSB 0.5 1.5 2.5 3.5 b

LSB 1.0 1.5 3.0 4.0 c

Example (Continued From Before)

void fcond(float *x, size_t n) {
int i;

for(i = 0; i < n; i++) {

if(x[i] > 0.5)
x[i] += 1.;
else x[i] -= 1.;
}
}

#include <xmmintrin.h>

void fcond(float *a, size_t n) {

int i;
__m128 vt, vmask, vp, vm, vr, ones, mones, thresholds;

ones = _mm_set1_ps(1.);
mones = _mm_set1_ps(-1.);
thresholds = _mm_set1_ps(0.5);
for(i = 0; i < n; i+=4) {
vt = _mm_load_ps(a+i);
vmask = _mm_cmpgt_ps(vt, thresholds);
vb = _mm_blendv_ps(ones, mones, vmask);
vr = _mm_add_ps(vt, vb);
}
}

Macro for 4 x 4 matrix transposition: The arguments row0,…, row3 are __m128
values each containing a row of a 4 x 4 matrix. After execution, row0, .., row 3
contain the columns of that matrix.

LSB 1.0 2.0 3.0 4.0 row0 LSB 1.0 5.0 9.0 13.0 row0

LSB 5.0 6.0 7.0 8.0 row1 LSB 2.0 6.0 10.0 14.0 row1

LSB 9.0 10.0 11.0 12.0 row2 LSB 3.0 7.0 11.0 15.0 row2

LSB 13.0 14.0 15.0 16.0 row3 LSB 4.0 8.0 12.0 16.0 row3

In SSE: 8 shuffles (4 _mm_unpacklo_ps, 4 _mm_unpackhi_ps)

Vectorization With Intrinsics: Key Points

 Use aligned loads and stores as much as possible
 Minimize shuffle instructions
 Minimize use of suboptimal arithmetic instructions.
e.g., add_ps has higher throughput than hadd_ps
 Be aware of available instructions (intrinsics guide!)

References:
Intel icc manual (look for auto vectorization)

Compiler Vectorization
 Compiler flags
 Aliasing
 Proper code style
 Alignment

© Markus Püschel
Computer Science
How Do I Know the Compiler Vectorized?
 vec-report
 Look at assembly: mulps, addps, xxxps
 Generate assembly with source code annotation:
 Visual Studio + icc: /Fas
 icc on Linux/Mac: -S

void myadd(float a, float b, const int n) {

for (int i = 0; i< n; i++)
Example }
a[i] = a[i] + b[i];

unvectorized: /Qvec-
<more>
;;; a[i] = a[i] + b[i];
movss xmm0, DWORD PTR [rcx+rax*4]
addss xmm0, DWORD PTR [rdx+rax*4]
movss DWORD PTR [rcx+rax*4], xmm0
<more>

vectorized:
<more>
;;; a[i] = a[i] + b[i];
movss xmm0, DWORD PTR [rcx+r11*4]
addss xmm0, DWORD PTR [rdx+r11*4] why this?
movss DWORD PTR [rcx+r11*4], xmm0
…
movups xmm0, XMMWORD PTR [rdx+r10*4]
movups xmm1, XMMWORD PTR [16+rdx+r10*4]
addps xmm0, XMMWORD PTR [rcx+r10*4] why everything twice?
addps xmm1, XMMWORD PTR [16+rcx+r10*4] why movups and movaps?
movaps XMMWORD PTR [rcx+r10*4], xmm0
movaps XMMWORD PTR [16+rcx+r10*4], xmm1
<more> unaligned aligned

Cannot be vectorized in a straightforward way due to potential aliasing.

However, in this case compiler can insert runtime check:

if (a + n < b || b + n < a)
/* vectorized loop */
...
else
/* serial loop */
...

Removing Aliasing
 Globally with compiler flag:
 -fno-alias, /Oa
 -fargument-noalias, /Qalias-args- (function arguments only)
 For one loop: pragma
void add(float *a, float *b, int n) {
#pragma ivdep
for (i = 0; i < n; i++)
a[i] = a[i] + b[i];
}

 For specific arrays: restrict (needs compiler flag –restrict, /Qrestrict)

void add(float *restrict a, float *restrict b, int n) {
for (i = 0; i < n; i++)
a[i] = a[i] + b[i];
}

© Markus Püschel
Computer Science
Proper Code Style
 Use countable loops = number of iterations known at runtime
 Number of iterations is a:
constant
loop invariant term
linear function of outermost loop indices
 Countable or not?

for (i = 0; i < n; i++)

a[i] = a[i] + b[i];

void vsum(float a, float b, float *c) {

int i = 0;

while (a[i] > 0.0) {

a[i] = b[i] * c[i];
i++;
}
}
69

Proper Code Style

 Use arrays, structs of arrays, not arrays of structs
 Ideally: unit stride access in innermost loop
void mmm1(float *a, float *b, float *c) {
int N = 100;
int i, j, k;

for (i = 0; i < N; i++)

for (j = 0; j < N; j++)
for (k = 0; k < N; k++)
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}

void mmm2(float a, float b, float *c) {

int N = 100;
int i, j, k;

for (i = 0; i < N; i++)

for (k = 0; k < N; k++)
for (j = 0; j < N; j++)
c[i][j] = c[i][j] + a[i][k] * b[k][j];
} 70

for (i = 0; i < 1024; i++)

x[i] = 1;

Cannot be vectorized in a straightforward way since x may not be aligned

However, the compiler can peel the loop to extract aligned part:
float *x = (float *) malloc(1024*sizeof(float));
int i;

peel = x & 0x0f; /* x mod 16 */

if (peel != 0) {
peel = 16 - peel;
/* initial segment */
for (i = 0; i < peel; i++)
x[i] = 1;
}
/* 16-byte aligned access */
for (i = peel; i < 1024; i++)
x[i] = 1; 71

Ensuring Alignment
 Align arrays to 16-byte boundaries (see earlier discussion)
 If compiler cannot analyze:
 Use pragma for loops
float *x = (float *) malloc(1024*sizeof(float));
int i;

#pragma vector aligned

for (i = 0; i < 1024; i++)
x[i] = 1;
 For specific arrays:
__assume_aligned(a, 16);

© Markus Püschel
Computer Science
More Tips (icc 14.0) https://software.intel.com/en-us/node/512631
 Use simple for loops. Avoid complex loop termination conditions – the upper iteration limit must be
invariant within the loop. For the innermost loop in a nest of loops, you could set the upper limit
iteration to be a function of the outer loop indices.

 Write straight-line code. Avoid branches such as switch, goto, or return statements, most function
calls, orif constructs that can not be treated as masked assignments.

 Avoid dependencies between loop iterations or at the least, avoid read-after-write dependencies.

 Try to use array notations instead of the use of pointers. C programs in particular impose very few
restrictions on the use of pointers; aliased pointers may lead to unexpected dependencies. Without
help, the compiler often cannot tell whether it is safe to vectorize code containing pointers.

 Wherever possible, use the loop index directly in array subscripts instead of incrementing a separate
counter for use as an array address.

 Access memory efficiently:

 Favor inner loops with unit stride.
 Minimize indirect addressing.
 Align your data to 16 byte boundaries (for SSE instructions).
 Choose a suitable data layout with care. Most multimedia extension instruction sets are rather
sensitive to alignment.

 …

void myadd(float a, float b, const int n) { Assume:

for (int i = 0; i< n; i++) • No aliasing information
a[i] = a[i] + b[i]; • No alignment information
}
Can compiler vectorize?

Yes: Through versioning

function

runtime check
a, b potentially aliased?

no yes

runtime check unvectorized loop

a, b aligned?

yes, yes yes, no no, no

no, yes

vectorized loop vectorized loop vectorized loop

aligned loads aligned and unaligned loads unaligned loads
or or
peeling and aligned loads peeling and aligned loads

Sample Lifting Plan and Rigging Study
89% (18)
Sample Lifting Plan and Rigging Study
13 pages
hw7 Sol
No ratings yet
hw7 Sol
5 pages
Assembly Lab Manual
100% (1)
Assembly Lab Manual
21 pages
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
07 Simd Avx
No ratings yet
07 Simd Avx
41 pages
04 Simd
No ratings yet
04 Simd
53 pages
CS3330 - A Quick Guide To SSE - SIMD
No ratings yet
CS3330 - A Quick Guide To SSE - SIMD
9 pages
SIMD v1
No ratings yet
SIMD v1
31 pages
Lec15 x86SIMD
No ratings yet
Lec15 x86SIMD
74 pages
Lec15 x86SIMD
No ratings yet
Lec15 x86SIMD
74 pages
Slides 18 645 Simd
No ratings yet
Slides 18 645 Simd
37 pages
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
No ratings yet
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
37 pages
x86 Intrinsics Cheat Sheet: Jan Finis Finis@in - Tum.de
100% (1)
x86 Intrinsics Cheat Sheet: Jan Finis Finis@in - Tum.de
1 page
Assembly #4
No ratings yet
Assembly #4
3 pages
SIMD Tutorial
No ratings yet
SIMD Tutorial
17 pages
Programming With SIMD-instructions
No ratings yet
Programming With SIMD-instructions
10 pages
Intel SIMD Architecture: Computer Organization and Assembly Languages Yung-Yu Chuang
No ratings yet
Intel SIMD Architecture: Computer Organization and Assembly Languages Yung-Yu Chuang
80 pages
The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
No ratings yet
The Significance of SIMD, SSE and AVX - Intel - Slides (3a - SIMD)
57 pages
Matrix Multiplication Using SIMD Technologies
No ratings yet
Matrix Multiplication Using SIMD Technologies
13 pages
Lec17 x86SIMD PDF
No ratings yet
Lec17 x86SIMD PDF
80 pages
Organisasi & Arsitektur Komputer
No ratings yet
Organisasi & Arsitektur Komputer
3 pages
CA 4 Notes
No ratings yet
CA 4 Notes
34 pages
Streaming SIMD Extensions: CSE 820 Dr. Richard Enbody
No ratings yet
Streaming SIMD Extensions: CSE 820 Dr. Richard Enbody
14 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
Lab Session 3: Arithmetic Operations in Assembly Language
No ratings yet
Lab Session 3: Arithmetic Operations in Assembly Language
6 pages
Useful x86 Instructions This Is A Very Small Subset of The Available In-Structions But Should Be Enough For Your Pur - Poses
No ratings yet
Useful x86 Instructions This Is A Very Small Subset of The Available In-Structions But Should Be Enough For Your Pur - Poses
31 pages
DSP Processor Fundamentals
No ratings yet
DSP Processor Fundamentals
58 pages
CS-3006!3!1 SIMD Intrinsic Programming Reduced
No ratings yet
CS-3006!3!1 SIMD Intrinsic Programming Reduced
55 pages
Data-Level Parallelism Vector and GPU
No ratings yet
Data-Level Parallelism Vector and GPU
6 pages
CS7103 - MultiCore Architecture Ppts Unit-II
No ratings yet
CS7103 - MultiCore Architecture Ppts Unit-II
43 pages
Comparing C++ Compilers Parallel-Programming Performance
No ratings yet
Comparing C++ Compilers Parallel-Programming Performance
8 pages
Guc 315 61 38694 2023-11-23T11 50 52
No ratings yet
Guc 315 61 38694 2023-11-23T11 50 52
33 pages
Introduction To Assembly Language
100% (6)
Introduction To Assembly Language
65 pages
Week 10 2021
No ratings yet
Week 10 2021
42 pages
ESD-CortexM3 Data Processing Instruction
No ratings yet
ESD-CortexM3 Data Processing Instruction
22 pages
Assembly Language
No ratings yet
Assembly Language
26 pages
What S Inside An 8086
No ratings yet
What S Inside An 8086
29 pages
Sehs3317 L4
No ratings yet
Sehs3317 L4
53 pages
Intel I
No ratings yet
Intel I
72 pages
XX-BSC Compact Vector Processing
No ratings yet
XX-BSC Compact Vector Processing
49 pages
CRGC Mcore PDF
No ratings yet
CRGC Mcore PDF
124 pages
Compilers: Tools For Scientists and Engineers
No ratings yet
Compilers: Tools For Scientists and Engineers
42 pages
CSE 820 Graduate Computer Architecture Vectors and Multiprocessor Introduction
No ratings yet
CSE 820 Graduate Computer Architecture Vectors and Multiprocessor Introduction
39 pages
Pentium 4
No ratings yet
Pentium 4
60 pages
Lab 02
No ratings yet
Lab 02
7 pages
14 Assembly Instructions
No ratings yet
14 Assembly Instructions
9 pages
SIMD For C++ Developers © 2019 Konstantin, Http://const - Me Page 1 of 21
No ratings yet
SIMD For C++ Developers © 2019 Konstantin, Http://const - Me Page 1 of 21
21 pages
Embeded 5
No ratings yet
Embeded 5
28 pages
HTTP Software - Intel
No ratings yet
HTTP Software - Intel
196 pages
master
No ratings yet
master
2 pages
Acle 2021Q2
No ratings yet
Acle 2021Q2
84 pages
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
No ratings yet
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
126 pages
Vectorization For Intel C++
No ratings yet
Vectorization For Intel C++
58 pages
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
No ratings yet
7-VECTOR PROCESSING-04-Jan-2020Material - I - 04-Jan-2020 - VECTOR - PROCESSING PDF
31 pages
Writing Disassembler
100% (1)
Writing Disassembler
12 pages
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
No ratings yet
Flynn's Taxonomy: Data-Level Parallelism in Vector, SIMD, and GPU Architectures
28 pages
Simple Vector Processor Modeled With VHDL
No ratings yet
Simple Vector Processor Modeled With VHDL
6 pages
Floating Point Instructions: Ray Seyfarth
No ratings yet
Floating Point Instructions: Ray Seyfarth
18 pages
Assembly Programming
No ratings yet
Assembly Programming
6 pages
Module 2
No ratings yet
Module 2
68 pages
17.40 Vector - RISCV 20190611 Vectors
No ratings yet
17.40 Vector - RISCV 20190611 Vectors
26 pages
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
From Everand
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
Anand Vemula
No ratings yet
TCP Wrapper Hostsallow Hostsdeny Command Options in Linux
No ratings yet
TCP Wrapper Hostsallow Hostsdeny Command Options in Linux
2 pages
Seiko 5m85 Manual
No ratings yet
Seiko 5m85 Manual
17 pages
2024101706399412, Haryana Cet Pre English, P1
No ratings yet
2024101706399412, Haryana Cet Pre English, P1
7 pages
Condon, Cotton, Broderick - 2002 - Radio Sources and Star Formation in The Local Universe
No ratings yet
Condon, Cotton, Broderick - 2002 - Radio Sources and Star Formation in The Local Universe
15 pages
Evaluation of Shampoos
No ratings yet
Evaluation of Shampoos
2 pages
Kap07 SX E
No ratings yet
Kap07 SX E
12 pages
MCU - ADK - 30 - OUD - Mast Static Calculations PDF
No ratings yet
MCU - ADK - 30 - OUD - Mast Static Calculations PDF
54 pages
Ats 100
No ratings yet
Ats 100
2 pages
4.adc Part 2 On Stm32f103
No ratings yet
4.adc Part 2 On Stm32f103
10 pages
A Proposed Solution To A Puzzle About Belief - Ruth Barcan Marcus
No ratings yet
A Proposed Solution To A Puzzle About Belief - Ruth Barcan Marcus
10 pages
Test Case Template
No ratings yet
Test Case Template
15 pages
Manually Loaded Trucks For The Collection of Household Refuse Incorporating A Compression Mechanism - Interpretation of The Term
No ratings yet
Manually Loaded Trucks For The Collection of Household Refuse Incorporating A Compression Mechanism - Interpretation of The Term
3 pages
Lecture 2 PDF
No ratings yet
Lecture 2 PDF
9 pages
7 - Term Structure
No ratings yet
7 - Term Structure
44 pages
IDOC Overview
No ratings yet
IDOC Overview
37 pages
The Ultimate Technical Analysis Handbook PDF
100% (4)
The Ultimate Technical Analysis Handbook PDF
54 pages
IMP
No ratings yet
IMP
1 page
Guardian Glass Specification 2
No ratings yet
Guardian Glass Specification 2
3 pages
Alloys in Dentistry
No ratings yet
Alloys in Dentistry
70 pages
HOLIP HLP-A100 Series General Vector Frequency Converter
No ratings yet
HOLIP HLP-A100 Series General Vector Frequency Converter
12 pages
Harmonic Analysis Textbook PDF
No ratings yet
Harmonic Analysis Textbook PDF
584 pages
Lesson 9 - Advanced Array Concepts
No ratings yet
Lesson 9 - Advanced Array Concepts
24 pages
Digital Logic Design R17a0461
No ratings yet
Digital Logic Design R17a0461
131 pages
Data Sheet: Fluke Scopemeter Series Iib Selection Guide
No ratings yet
Data Sheet: Fluke Scopemeter Series Iib Selection Guide
14 pages
Chemistry 1 and 2 WRITE SHOP
No ratings yet
Chemistry 1 and 2 WRITE SHOP
73 pages
Multimedia Networking Applications
No ratings yet
Multimedia Networking Applications
2 pages
Test Bank For Calculate With Confidence 7th Edition
100% (56)
Test Bank For Calculate With Confidence 7th Edition
6 pages
Q2 Week 5 Lesson G7 DLL
No ratings yet
Q2 Week 5 Lesson G7 DLL
11 pages

Lecture8 Simd

Uploaded by

Lecture8 Simd

Uploaded by

Design of Parallel and High-Performance

Instructor: Torsten Hoefler & Markus Püschel

Single instruction Multiple instruction

 This lecture and material was created together with

SIMD Vector Extensions

512 bit AVX-512 Skylake-X

Example SSE Family: Floating Point

SSE2: 2-way double

 Not drawn to scale

128 bit = 2 doubles = 4 singles

 Floating point vectors:

 Floating point scalars:

 Single precision scalar add: addss %xmm0 %xmm1

SSE3 Instruction Names

SSE: How to Take Advantage?

 Necessary: fine grain parallelism

Intel icc compiler manual

SSE Family: Floating Point

SSE2: 2-way double

 Not drawn to scale

Visual Conventions We Will Use

SSE Intrinsics (Focus Floating Point)

// a is 16-byte aligned p: packed

LSB 1.0 2.0 3.0 4.0

What Are the Main Issues?

float, double 4-way, 2-way 8-way, 4-way

register 16 x 128 bits: 16 x 256 bits:

intrinsics data type __m128, __m128d __m256, __m256d

intrinsics instructions _mm_load_ps, _mm256_load_ps,

Mixing SSE and AVX may incur penalties

32 zmm (AVX-512) 16 ymm (AVX) 16 xmm (SSE) scalar

Loads and Stores

Intrinsic Name Operation Corresponding

1.0 2.0 3.0 4.0 memory

LSB 1.0 2.0 3.0 4.0 a

a = _mm_load_ps(p); // p 16-byte aligned

a = _mm_loadu_ps(p); // p not aligned avoid (can be expensive)

load_ps on unaligned pointer: seg fault

Loads and Stores

1.0 2.0 memory

LSB 1.0 2.0 a

kept LSB 1.0 2.0 a

a = _mm_loadl_pi(a, p); // p 8-byte aligned

a = _mm_loadh_pi(a, p); // p 8-byte aligned

a = _mm_load_ss(p); // p any alignment

Stores Analogous to Loads

LSB 1.0 2.0 3.0 4.0 a a = _mm_set_ps(4.0, 3.0, 2.0, 1.0);

LSB 1.0 1.0 1.0 1.0 b b = _mm_set1_ps(1.0);

LSB 1.0 0 0 0 c c = _mm_set_ss(1.0);

LSB 1.5 3.5 5.5 7.5 c

// n a multiple of 4, x is 16-byte aligned

for (int i = 0; i < n; i+=4) {

Is this the best solution?

// n a multiple of 4, x is 16-byte aligned

ind = _mm_set_ps(3, 2, 1, 0);

Code style helps with performance!

LSB 1.5 2.0 3.0 4.0 c

max max max max

LSB 1.0 2.0 3.0 4.0 c

LSB 0.5 3.5 0.5 7.5 c

LSB 3.0 7.0 2.0 6.0 c

// n a multiple of 8, x, y are 16-byte aligned

half = _mm_set1_ps(0.5); // set vector to all 0.5

(SSE4) Computes the pointwise product of a and b and writes a

0.5 3.0 7.5 14.0 01110101

LSB 11.0 0 11.0 0 c

LSB 0xffffffff 0x0 0xffffffff 0x0 c

for(i = 0; i < n; i++) {

void fcond(float *a, size_t n) {

LSB 1.0 2.0 3.0 4.0 a

No conversion is performed. Does not map to any assembly instructions.

Makes integer shuffle instructions usable for floating point.

LSB 1.0 0.5 2.0 1.5 c

LSB 3.0 2.5 4.0 3.5 c

helper macro to create mask

1.0 2.0 3.0 4.0 memory

intrinsics data type m128, m128d m256, m256d

__m128 LoadArbitrary(float p0, float p1, float p2, float p3) {

vf = _mm_set_ps(p3, p2, p1, p0);

void myadd(float a, float b, const int n) {

void vsum(float a, float b, float *c) {

void mmm2(float a, float b, float *c) {

void myadd(float a, float b, const int n) { Assume: