SIMD For C++ Developers © 2019 Konstantin, Http://const - Me Page 1 of 21
SIMD For C++ Developers © 2019 Konstantin, Http://const - Me Page 1 of 21
Contents
Introduction ............................................................................................................................... 2
Motivation ............................................................................................................................. 2
Scope ..................................................................................................................................... 2
History Tour ........................................................................................................................... 2
Introduction to SIMD ................................................................................................................. 3
Why Care? ............................................................................................................................. 5
Applications ........................................................................................................................... 5
Hardware Support ................................................................................................................. 6
Programming with SIMD ........................................................................................................... 6
Data Types ............................................................................................................................. 7
General Purpose Intrinsics .................................................................................................... 7
Casting Types ..................................................................................................................... 7
Converting Types ............................................................................................................... 7
Memory Access ................................................................................................................. 8
Initializing Vector Registers ............................................................................................... 8
Bitwise Instructions ........................................................................................................... 9
Floating Point Instructions .................................................................................................... 9
Arithmetic .......................................................................................................................... 9
Comparisons .................................................................................................................... 10
Shuffles ............................................................................................................................ 11
Fused Multiply-Add ......................................................................................................... 14
Integer Instructions ............................................................................................................. 14
Arithmetic ........................................................................................................................ 14
Comparisons .................................................................................................................... 15
Shifts ................................................................................................................................ 15
Pack and Unpack ............................................................................................................. 16
Shuffles ............................................................................................................................ 16
_mm_shuffle_epi8........................................................................................................... 18
Miscellaneous Vector Instructions ...................................................................................... 18
Wrappers Library ................................................................................................................. 19
Problems with Raw Intrinsics .......................................................................................... 19
Wrappers ......................................................................................................................... 20
Random Tips and Tricks ....................................................................................................... 20
Scope
Reading this article, or any article at all, won’t make you an expert in the technology. I believe
programming is not science but an applied skill. One can’t become good in it unless practicing a lot.
This article is an overview and a starting point.
Most of this article is focused on PC target platform. Some assembly knowledge is recommended,
but not required, as the main focus of the article is SIMD intrinsics, supported by all modern C and
C++ compilers. The support for them is cross-platform, same code will compile for both Windows,
Linux, OSX, and current and next generations of game consoles.
History Tour
At some high level, a processor is a mechanism which executes a sequence of instructions called
“program”. Specifically, it runs an endless loop with the following steps.
1
Modern PC programs almost never do I/O, because nowadays only OS kernel and drivers have direct access
to hardware. Instead, they call some OS kernel API, which calls a device driver, which finally does the I/O.
Another approach is increasing size of CPU caches, but that improvement has diminishing returns, at
some point program speed no longer limited with memory, you actually need to compute more
stuff.
Another approach (very complicated, nevertheless widely used in modern processors), running
multiple instructions per clock with instruction-level parallelism, deep pipelining, speculative
execution. Newer CPUs do it progressively better, but just like cache sizes this approach has
diminishing returns. With deep pipelines, when processor mispredicts a branch, too much time is
wasted, also too much stuff is required to rollback results of the false-started instructions.
In addition to the above, CPU designers wanted to compute more stuff with same count of
instructions. That’s why they have invented SIMD.
Modern SIMD processors entered mainstream market with the release of Pentium III in 1999, and
they never left. Technically MMX and 3DNow! were before that and they are SIMD, too, but they are
too old, no longer relevant for developers.
Even cell phones support SIMD now, the instruction set is called ARM NEON. But this article is
focused on PC platform, so NEON is out of scope.
Introduction to SIMD
The acronym stands for “single instruction, multiple data”. In short, it’s an extension to the
instruction set which can apply same operation onto multiple values. These extensions also define
extra set of registers, wide ones, able to hold these multiple values in a single register.
For example, imagine you want to compute squares of 4 floating point numbers. A straightforward
implementation looks like this:
void mul4_scalar( float* ptr )
{
for( int i = 0; i < 4; i++ )
{
const float f = ptr[ i ];
ptr[ i ] = f * f;
}
}
2
In practice, they usually do a decent job for trivially simple examples like the above one. They usually fail to
auto-vectorize anything more complex. They rarely do anything at all for code which operates on integers, as
opposed to floats or doubles.
The above example only requires SSE 1, will run on your grandmother’s Pentium III. SSE 1 operates
on 16 bytes registers. In C and C++, these registers are exposed as __m128 data type: 128 bits = 16
bytes. Each register contains 4 single-precision floating point numbers, and the instruction set has 8
of these registers, they are named xmm0 to xmm7.
SSE 2, introduced with Pentium 4 in 2000, added support for double-precision and integer SIMD
instructions. They process same 8 registers, but for type safely these are exposed in C as __m128d
and __m128i data types, respectively. Their length stayed the same 16 bytes, so there’re just 2
double lanes there, compared to 4 single-precision float lanes in SSE 1.
When AMD introduced their AMD64 architecture in 2003, they have incorporated SSE 2 as a part of
their then-new instruction set. I’ll repeat it bold: every 64-bit PC processor in the world is required
to support at least SSE 1 and SSE 2. At the same time, AMD added 8 more of these wide SIMD
registers, the new ones are named xmm8 to xmm15.
The mandatory support of SIMD opened a few On modern PCs, the single-lane version
possibilities. Modern compilers don’t usually emit old- takes exactly same time to execute as
school x87 instructions when compiling 64-bit code which the wide version. The reason single-lane
does math on float or double numbers. Instead, the arithmetic instructions exist, it wasn’t
compilers emit SSE 1 and SSE 2 instructions. The floating- always the case. Pentium III didn’t have
point math instructions have single-lane versions. The 128-bit wide arithmetic units, it did 128-
above snippet uses _mm_mul_ps intrinsic (compiles into bit math by splitting lanes into 64-bit
pieces, and computing them
mulps instruction) to compute squares of 4 numbers.
independently. The same way previous
SSE 1 has matching instruction that only multiplies one
generation AMD Ryzen 1000- and 2000-
lowest lane, _mm_mul_ss, it compiles into mulss series didn’t have 256-bit wide
instruction. The upper 3 lanes are copied from the first arithmetic units, to execute AVX and
argument without multiplying them. AVX2 instructions they split the
operands into 128-bit pieces.
Apparently, CPU vendors viewed SIMD as a success,
because they continued adding new vector instructions.
SSE 3, SSSE 3 and SSE 4.1 all added new instructions operating on these SIMD registers.
By now, these instruction sets are almost universally supported by mainstream CPUs. At the time of
writing, steam hardware survey says 100.00% of users have SSE3, and 96.72% of them have SSE 4.1;
expand “other settings” panel to see these numbers on that web page.
Then in 2011 Intel introduced AVX in their Sandy Bridge processors. That version extended 16-byte
registers into 32 bytes, so they now can hold 8 single-precision floats, or 4 doubles. Their count
stayed the same, 16 registers. In assembly, new registers are named zmm0 to zmm15, and their lower
128-bit halves are still accessible as xmm0 to xmm15. The same way the lowest byte of rax register is
accessible as al. AVX supports operations on 32- and 64-bit floats, but not much for integers.
Then in 2013 Intel introduced AVX 2. No new registers, but they have added integer math support
there. Both AVX and AVX 2 are widely supported by now, but not yet universally so on desktops. See
hardware support section for details.
Why Care?
I’ll first make a list of programmers who I think have good reasons to not care.
• If you don’t write code that’s CPU or RAM bandwidth bound, you don’t care. Majority of web
developers fall into this category, all the CPU-bound heavy lifting happens in external systems
like TCP/IP stack or database engines.
• If you write code that does some math, but the actual math is implemented in external libraries.
Many machine learning developers fall into this category, for the same reason they’re OK using
Python and similar high-level but slow languages for their job. Some C++ folks are happy with
vectorized algorithms implemented in HPC libraries like Eigen and MKL and don’t write
performance-sensitive code of their own.
• If you write code that’s not performance critical. Maybe a desktop app which idles most of the
time waiting for user input. Maybe a cloud app where you have unlimited compute power as
long as you have unlimited budget, and horizontal scaling is cheaper than optimizing code.
• If you use a language that doesn’t have a good SIMD support. Languages like JavaScript or PHP
usually don’t, but SIMD is not limited to C++, there’s solid support in modern .NET
• If you write numerical code which runs on GPGPUs, using CUDA, DirectCompute, or OpenCL.
• If you’re happy with OpenCL. It has its limitations, and it has very different programming model
close to that one of GPUs, but if you’re OK with these limitations it might be easier to use than
manual SIMD, while delivering comparable performance.
Applications
The obvious answer is “for code that multiplies long float32 vectors and inverts large matrices”, but
it’s more than just that. Only the very first SSE 1 was limited to 32-bit floats. Personally, I’ve used it
with great success in following areas.
1. For video games, and similar stuff like CAD/CAM/CAE. There’s an awesome Microsoft’s library
DirectXMath which already implements many useful algorithms on top of SIMD. Despite the
name, it’s a general-purpose library, very useful for OpenGL and others. To use it efficiently, you
need some understanding what it’s all about. There’s a third-party portable fork on github which
compiles for Linux, OSX, iOS and Android.
2. For code dealing with large volumes of voxels. Regardless whether you have floats, integers or
bits in these voxels, well-implemented manual vectorization can speed up your code by an order
of magnitude. For bits, a single _mm_and_si128 instruction takes 0.33 to 0.25 CPU cycles to
process 128 of them, you won’t get anywhere close with scalar code.
3. For video processing. I once needed to convert video frames on CPU from RGB32 into NV12.
I needed code to run fast, as I had a video stream of these frames, up to 60 FPS at 3840×2160
pixels / each. I’ve implemented and benchmarked a few versions, and the fastest one was SSE 2
code doing 16-bit fixed point math. That time I haven’t tested AVX due to target platform
limitation, I only did SSE 1, 2, 4.1, and FMA.
4. For image processing. Images typically have quite a few pixels.
5. For random stuff, both trivial and not. Here’s an example of trivial one.
AMD Intel
3 3
SSE 1 Athlon 64, 2003 Pentium III, 1999
SSE 2 Athlon 64, 2003 Pentium 4, 2000
SSE 3 Athlon 64 “Venice”, 2004 Pentium 4 “Prescott”, 2004
SSSE 3 Bobcat and Bulldozer, 2011 Penryn, 2007; Bonnell, 2008.
SSE 4.1 Jaguar, 2013; Bulldozer, 2011 Penryn, 2007; Silvermont, 2013.
FMA Piledriver, 2012; not supported in Jaguar Haswell, 2013
AVX 1 Jaguar, 2013; Bulldozer, 2011 Sandy Bridge, 2011
Haswell, 2013 but only “Core i” models,
AVX 2 Excavator, 2015
Pentium and Celeron don’t have it.
I think now in 2019 it’s OK to require SSE 3, SSE 4.1 and SSSE 3, and fail at startup if not supported.
I don’t think it’s generally OK to require FMA, AVX 1 or AVX 2 for desktop PC software, at least not
yet. If they bring too much profit, either ship separate version of your binaries, or do runtime
dispatch. A good way to implement runtime dispatch, only check the support once at application
startup, and cache C function pointers, or maybe a pointer to abstract C++ class.
Current generation consoles have AMD Jaguar inside. Apparently, next generation consoles arriving
Q4 2020 going to feature AMD Zen 2 which supports all of the above vector instruction sets.
First it takes time to wrap the head around the concept. That all CPUs are actually vector ones, and
every time you’re writing float a = b + c; you’re wasting 75% to 87.5% of CPU time, because it
could have added 4 or 8 numbers with slightly different instruction, spending exactly same time.
Then it takes much more time to get familiar with actual instructions. When I started learning the
technology, I can remember few times when I spent hours trying to do something, come up with a
solution that barely outperformed scalar version, only to find out later there’s that one special
instruction which only takes a couple of them to do what I did in a page of code, improving
performance by a factor of magnitude.
Another thing, the documentation is not great. I hope I have fixed it to some extent, but still, that
official guide is only useful when you already know which intrinsics do you need. Most developers
don’t know what modern SIMD is capable of. I’m writing this article hoping to fix it.
3
Before SSE 1, both Intel and AMD had other vector stuff: MMX, 3DNow!, and some extended variants of
both. These have been superseded by modern SIMD instruction sets discussed in this article.
Technically, MMX instructions still run on modern CPUs from both vendors, despite 22 years old. Most of the
integer subset of SSE 2 is a direct copy-paste of MMX, operating on registers twice as wide.
The compiler assigns variables to registers automagically, but remember there’re only 16 registers
underneath the compiler. Or just 8 if you’re building a 32-bit binary. If you define too many local
variables, and write code with many dependencies preventing compiler from reusing registers, the
compiler will evict some variables from registers to RAM on the stack. In some edge cases, this may
ruin the performance. Profile your software, and review what the compiler did to your C++ code on
the performance critical paths.
But they are different in C++, and there’re intrinsics to re-interpret them to different types. In the
table below, rows correspond to source types, and columns represent destination type.
__m128 __m128d __m128i __m256 __m256d __m256i
_mm_castps_si12 _mm256_castps12
__m128 = _mm_castps_pd
8 8_ps256
_mm_castpd_ _mm_castpd_si12 _mm256_castpd12
__m128d ps
=
8 8_pd256
_mm_castsi1 _mm_castsi128_p _mm256_castsi12
__m128i 28_ps d
=
8_si256
_mm256_cast _mm256_castps_p _mm256_castps_s
__m256 ps256_ps128
=
d i256
_mm256_castpd25 _mm256_castpd_p _mm256_castpd_s
__m256d 6_pd128 s
=
i256
_mm256_castsi25 _mm256_castsi25 _mm256_castsi25
__m256i 6_si128 6_ps 6_pd
=
All these intrinsics compile into no instructions, so they’re practically free performance wise. They
don’t change bits in the registers, so 32-bit 1.0f float becomes 0x3f800000 in 32-bit lanes of the
destination integer register. When casting 16-byte values into 32 bytes, the upper half is undefined.
Converting Types
There’re also instructions to do proper type conversion. They only support signed 32-bit integer
lanes. _mm_cvtepi32_ps converts 4 integer 32-bit lanes into 32-bit floats, _mm_cvtepi32_pd converts
first 2 integers into 64-bit floats, _mm256_cvtpd_epi32 converts 4 doubles into 4 integers, setting
higher 4 lanes of the output to zero. When converting from floats to integer, these instructions use
rounding mode specified in MXCSR control and status register. See _MM_SET_ROUNDING_MODE macro
for how to change that mode. That value is a part of thread state, so the OS persists the value across
context switches4. There’re also instructions to convert floats between 32- and 64-bit.
4
That feature may cause hard to find numerical errors when SIMD is used in a thread pool environment, such
as OpenMP. For this reason, instead of changing MXCSR I generally prefer SSE 4.1 rounding intrinsics.
• All types support regular loads, like _mm_load_si128 or _mm256_load_ps. They require source
address to be aligned by 16 or 32 bytes. They may crash otherwise but it’s not guaranteed,
depends on compiler settings.
• All types support unaligned loads, like _mm_loadu_si128 or _mm256_loadu_ps. They’re fine with
unaligned pointers but aligned loads are faster.
• __m128 and __m128d types support a few extra. They support single-lane loads, to load just the
first lane and set the rest of them to 0.0, they are _mm_load_ss and _mm_load_sd. They support
intrinsics to load them from memory in reverse order (requires aligned source address).
• In AVX, __m128, __m256 and __m256d have broadcast load instructions, to load single float or
double value from RAM into all lanes of the destination variable. Also __m256 and __m256d types
have broadcast loads which read 16 bytes from RAM and duplicate the value in 2 halves of the
destination register.
• AVX 1 introduced masked load instructions, they load some lanes and zero out others.
• AVX 2 introduced gather load instruction, like _mm_i32gather_ps, they take a base pointer,
integer register with offsets, and also integer value to scale the offsets.
• Finally, in many cases you don’t need to do anything special to load values. If your source data is
aligned, you can just write code like __m128i value = *pointer; and it’ll compile into an
equivalent of regular load.
Similarly, there’s many ways to store stuff: aligned, unaligned, single-lane, masked in AVX 1.
1. On modern PCs, RAM delivers at least 8 bytes per load. If you have dual-channel RAM, it delivers
16 bytes. This makes it much faster to access memory in aligned 16- or 32-bytes blocks. If you
look at the source code of a modern version of memcpy or memmove library routine, you’ll
usually find a manually vectorized version which uses SSE 2 instructions to copy these bytes
around. Similarly, when you need something transposed, it’s often much faster to load
continuously in blocks of 16 or 32 bytes, then transpose in registers with shuffle instructions.
2. Most single-lane loads/store can only use lowest lane of destination/source register. If you’re
implementing a reduction like sum or average, try to do so you have result in the first lane.
3. There’re instructions to load or store values bypassing CPU caches, like
_mm[256]_stream_something. Slightly useful for cases like video processing, when you know
that you’re writing an output value that will only be loaded much later, long after you’ve
processed the complete frame, and probably by another CPU core (L1 and L2 caches are per-
core, only L3 cache is shared).
5
The hardware has much more than 16, it’s just 16 names in assembly. The hardware has thousands of them,
the pipeline is deep and filled with in-flight instructions who need different versions of same logical registers.
For all their _mm[256]_set_something intrinsics, Intel screwed up the order. To get a register with
integer values [ 1, 2, 3, 4 ], you have to write either _mm_set_epi32( 4, 3, 2, 1 ) or
_mm_setr_epi32( 1, 2, 3, 4 ).
Bitwise Instructions
Both floats and integers have a fairly complete set of bitwise instructions. All of them have AND, OR,
XOR, and ANDNOT instructions. If you need bitwise NOT, the fastest way is probably XOR with all
ones. Example for 16-byte values:
__m128i bitwiseNot( __m128i x )
{
const __m128i zero = _mm_setzero_si128();
const __m128i one = _mm_cmpeq_epi32( zero, zero );
return _mm_xor_si128( x, one );
}
Whenever you need a register value with all bits set to 1, it’s often a good idea to do what I did here,
setzero then compare zeroes with zeroes.
Traditional
All 4 basic operations are implemented, add, subtract, multiply, divide. Square root instruction is
also there, even for 64-bit floats.
Unorthodox
The CPU has minimum and maximum instructions. Too bad VC++ doesn’t compile std::min<float>
/ std::max<double> into _mm_min_ss / _mm_max_sd, it really should, they’re branchless and very fast.
1 1
For 32-bit floats, CPUs implement faster approximate versions of 𝑥 and They are _mm_rcp_ps and
√𝑥
_mm_rsqrt_ps, respectively. For both of them the documentation says “maximum relative error for
this approximation is less than 1.5*2^-12”, translates to 3.6621 × 10−4 maximum relative error.
SSE 3 has horizontal add and subtract instructions, like _mm_hadd_ps, which takes two registers with
[a, b, c, d] and [e, f, g, h] and returns [a+b, c+d, e+f, g+h].
SSE 3 also has a couple of instructions which alternatively add and subtract values. _mm_addsub_ps
takes two registers with [a, b, c, d] and [e, f, g, h] and returns [a-e, b+f, c-g, d+h]. I have never used
them and have no idea why they were added, except for lulz.
SSE 4.1 has introduced rounding instructions for both sizes of floats. For 32-bit floats, the all-lanes
version is exposed in C++ as _mm_round_ps, _mm_ceil_ps, and _mm_floor_ps intrinsics.
Missing
There’s no unary minus nor absolute value. The fastest way to implement them is probably bitwise
tricks, specifically _mm_xor_ps( x, _mm_set1_ps( -0.0f ) ) for unary minus, _mm_andnot_ps(
_mm_set1_ps( -0.0f ), x ) for absolute value. The reason these tricks work, -0.0f float value only
has the sign bit set, the rest of the bits are 0, so _mm_xor_ps flips the sign and _mm_andnot_ps clears
the sign bit making the value non-negative.
There’s no logarithm nor exponent, and no trigonometry either. Intel’s documentation says
otherwise, because Intel were writing their documentation for their C++ compiler, which
implements them in its standard library. You can fall back to scalar code, or better search the web
for the implementation. It’s not terribly complex, e.g. trigonometric functions are usually
implemented as high-degree minmax polynomial approximation.
Comparisons
There’re all-lanes versions of equality comparison and the rest of them, <, >, ≤, ≥, ≠. These versions
return another float register, that’s either all zeros 0.0f, or all ones. A float with all ones is a NAN.
You can send the result from SIMD to a general-purpose CPU register with _mm_movemask_ps,
_mm_movemask_pd or the corresponding AVX 1 equivalents. These instructions gather most significant
bits of each float/double lane (that bit happen to be the sign bit, by the way), pack these bits into a
scalar, and copy to a general-purpose CPU register. This code prints 15:
const __m128 zero = _mm_setzero_ps();
const __m128 eq = _mm_cmpeq_ps( zero, zero );
const int mask = _mm_movemask_ps( eq );
printf( "%i\n", mask );
The 0 == 0 comparison is true for all 4 lanes of __m128, the eq variable has all 128 bits set to 1, then
_mm_movemask_ps gathers and returns sign bits of all 4 float lanes. 0b1111 becomes 15 in decimal.
Another thing you can do with comparison results, use them as arguments to bitwise instructions to
combine lanes somehow. Or pass them into blendv_something. Or cast to integer vectors and do
something else entirely.
Also, both 32- and 64-bit floats have instructions to compare just the lowest lanes of 2 registers, and
return the comparison result as bits in the flags register accessible to general-purpose instructions:
// Compare the first lane for `x > 0.0f`
bool isFirstLanePositive( __m128 x )
{
return (bool)_mm_comigt_ss( x, _mm_setzero_ps() );
}
A
Duplicate even-indexed single- A
B
precision (32-bit) floating-point A
_mm_moveldup_ps SSE 3 C
elements from "a", and store the C
results in "dst". D
C
A
Broadcast the low single-precision B A
_mm_broadcastss_ps AVX 2 (32-bit) floating-point element from C A
"a" to all elements of "dst". D A
A
In the pictures below, blue arrows show what has been selected by the control value used for the
illustration. Dashed gray lines show what could have been selected by different control constant.
A
Blend packed single-precision (32-bit) B
floating-point elements from "a" and "b" C E
using control mask. D B
_mm_blend_ps SSE 4.1 On the illustration on the right, the control C
was 1, it’s 0b0001 in binary, so only the E
D
first lane was taken from the second F
argument. G
H
I’m too lazy to draw similar block diagrams for __m128d, __m256, and __m256d shuffles. Fortunately,
Intel was lazy as well. The 64-bit float versions are very similar to 32-bit floats. The shuffle constants
have only half of the bits for 64-bit floats, the idea is the same.
As for the 32-byte AVX versions, the corresponding AVX instructions only shuffle/permute within 16-
bit halves of the register. Here’s from documentation of _mm256_shuffle_ps: Shuffle single-
precision (32-bit) floating-point elements in "a" within 128-bit lanes.
One exception is _mm256_broadcastss_ps AVX 2 instruction, it broadcasts the lowest lane into all 8
destination ones. Same applies to _mm256_broadcastsd_pd, it broadcasts into all 4 64-bit lanes.
Another exception is AVX 2 _mm256_permutevar8x32_ps intrinsic. Unlike the rest of them, it can
permute the data across 128-bit lanes of the 256-bit register.
No version of SSE has float shuffling instructions controllable in runtime. The closest one is
_mm_shuffle_epi8 from SSSE3, see the corresponding section under integer instructions. If you
really need that, you can cast registers to integer type, use _mm_shuffle_epi8, then cast back6.
AVX 1 finally introduced one, _mm_permutevar_ps. It takes 2 arguments, float register with source
values, and integer vector register with source indices. It treats integer register as 32-bit lanes, uses
lower 2 bits of each integer lane to select source value from the float input vector.
6
But beware of bypass delays between integer and floating-point domains. On some old CPUs the latency can
be up to 2 cycles. For more info, see section 13.6 “Using vector instructions with other types of data than they
are intended for” of “Optimizing subroutines in assembly language” article.
FMA instructions take 3 arguments, each being either 32- or 64-bit floating point numbers, a, b and
c, and compute ( a · b ) + c. Due to marketing reasons, that single operation counts as 2 FLOPs.
Modern CPUs implement 8-wide FMA for 32-bit floats, _mm256_fmadd_ps intrinsic in C++, and 4-
wide FMA for 64-bit floats, _mm256_fmadd_pd intrinsics. There’re versions which subtract instead
of adding, flip sign of the product before adding. There’re even versions which alternately add or
subtract, computing ( a · b ) + c for even lanes ( a · b ) − c for odd ones, or doing the opposite.
There’re versions twice as narrow operating on 16-byte registers, 4-wide 32-bit floats FMA, and 2-
wide 64-bit floats FMA.
These instructions process the same SIMD registers, __m128, __m128d, __m256 or __m256d C++ types.
Besides performance, FMA is also more precise. These instructions only round the value once, after
they have computed both the product and the sum. The intermediate value has twice as many bits.
The support for these instructions is wide but not universal. Both Intel and AMD support the
compatible version of FMA, called FMA 3, in their CPUs released since 2012-2013. See hardware
support section for more info.
Integer Instructions
Arithmetic
Traditional
Additions and subtractions are totally fine and do what you would expect, e.g. _mm_add_epi32 adds
32-bit lanes of two integer registers.
Moreover, for 8 and 16 bits, there’re saturated versions of them, which don’t overflow but instead
stick to minimum or maximum values. For example, _mm_adds_epi8( _mm_set1_epi8( 100 ),
_mm_set1_epi8( 100 ) ) will return a value with all 8-bit lanes set to +127, because the sum is 200
but maximum value for signed bytes is +127. Here’s more in-depth article on saturation arithmetic.
There’s no integer divide instruction. Intel has integer divide implemented in the standard library of
their proprietary compiler. If you’re dividing by a compile-time constant that’s the same for all lanes,
you can write a function which divides same sized scalar by that constant, compile with
https://godbolt.org/ and port to SIMD. For example, here’s what it compiled when asked to divide
uint16_t scalar by 11, and here’s how to translate that assembly into SIMD intrinsics:
// Divide uint16_t lanes by 11, all 8 of them in just two SSE 2 instructions.
__m128i div_by_11_epu16( __m128i x )
{
x = _mm_mulhi_epu16( x, _mm_set1_epi16( (short)47663 ) );
return _mm_srli_epi16( x, 3 );
}
Multiplication is tricky as well. _mm_mul_epi32 takes 2 values [a,b,c,d] [e, f, g, h], ignores half of the
values, and returns register with 2 64-bit values [ a*e, c*g ] The normal version is _mm_mullo_epi32.
Oh, and both are SSE 4.1. If you only have SSE 2, you’ll need workarounds.
Unorthodox
For most lane sizes there’re integer minimum and maximum instructions, e.g. _mm_max_epu8
(SSE 2) is maximum for uint8_t lanes, _mm_min_epi32 (SSE 4.1) is minimum for 32-bit integer lanes.
For uint8_t and uint16_t lanes, SSE 2 has instructions to compute average of two inputs.
The weirdest of them is probably _mm_sad_epu8 (SSE2) / _mm256_sad_epu8 (AVX2). The single
instruction, which can run once per cycle on Intel Skylake, is an equivalent of the following code:
array<uint64_t, 4> avx2_sad_epu8( array<uint8_t, 32> a, array<uint8_t, 32> b )
{
array<uint64_t, 4> result;
for( int i = 0; i < 4; i++ )
{
uint16_t totalAbsDiff = 0;
for( int j = 0; j < 8; j++ )
{
const uint8_t va = a[ i * 8 + j ];
const uint8_t vb = b[ i * 8 + j ];
const int absDiff = abs( (int)va - (int)vb );
totalAbsDiff += (uint16_t)absDiff;
}
result[ i ] = totalAbsDiff;
}
return result;
}
I have never used it in practice but I have an idea why it’s there. It’s useful for video encoders.
Apparently, they treat AVX 2 register as blocks of 8×4 grayscale 8-bit pixels, and want to estimate
compression errors. That instruction computes sum of errors over each row of the block.
Comparisons
Only all-lane versions are implemented. The results are similar to float comparisons, they set
complete lane to all zeroes or all ones. For example, _mm_cmpgt_epi8 sets 8-bit lanes of the output
to either 0 or 0xFF, depending on whether the corresponding signed 8-bit values are greater or not.
There’re no unsigned integer comparison instructions. If you need that, here’s an example how to
implement manually, on top of signed integer comparison.
// Compare uint16_t lanes for a > b
__m128i cmpgt_epu16( __m128i a, __m128i b )
{
const __m128i highBit = _mm_set1_epi16( (short)0x8000 );
a = _mm_xor_si128( a, highBit );
b = _mm_xor_si128( b, highBit );
return _mm_cmpgt_epi16( a, b );
}
The move mask instruction is only implemented for 8-bit lanes. If you need results of 32-bit integer
comparison in a general-purpose register, 1 bit per lane, one workaround is cast __m128i into __m128
and use _mm_movemask_ps. Similar for 64-bit lanes, cast into __m128d and use _mm_movemask_pd.
Shifts
Whole Register Shift
All of them encode shift amount in the instruction, and can only shift by whole number of bytes.
The whole register shifts are _mm_srli_si128 and _mm_slli_si128 in SSE 2. AVX 2 equivalents like
_mm256_slli_si256 independently shift 16-byte lanes, so the bytes that would have been shifted
across 16-byte halves of the register become zeros.
Unpack instructions come in 2 versions. unpacklo_something unpacks and interleaves values from
the low half of the two registers. For example, _mm_unpacklo_epi32 takes 2 values, [a,b,c,d] and
[e,f,g,h], and returns [a,e,b,f]; _mm_unpackhi_epi32 takes 2 values, [a,b,c,d] and [e,f,g,h], and returns
[c,g,d,h]. One obvious application, if you supply zero for the second argument, these instructions will
convert unsigned integer lanes to wider ones, e.g. 8-bit lanes into 16-bit ones, with twice as few
lanes per register.
The opposite instructions, for packing lanes, come in 2 versions, for signed and unsigned integers. All
of them use saturation when packing the values. If you don’t want the saturation, bitwise AND with
a constant like _mm_set1_epi16( 0xFF ) or _mm_set1_epi32( 0xFFFF )7 , and unsigned saturation
won’t engage.
_mm_packs_epi16 takes 2 registers, assumes they contain 16-bit signed integer lanes, packs each
lane into 8-bit signed integer using saturation (values that are greater than +127 are clipped to +127,
values that are less than -128 are clipped to -128), and returns a value with all 16 values.
_mm_packus_epi16 does the same but it assumes the input data contains 16-bit unsigned integer
lanes, that one packs each lane into 8-bit unsigned integer using saturation (values that are greater
than 255 are clipped to 255), and returns a value with all 16 values.
Shuffles
Besides pack and unpack, there’re instructions to shuffle integer lanes.
7
Or you can build the magic numbers in code. It takes 3 cycles, setzero, cmpeq_epi32, unpacklo_epi8/
unpacklo_epi16. In some cases, can be measurably faster than a RAM load to fetch the constant.
_mm_shuffle_epi8
This instruction is a part of SSSE 3 set, and it deserves a dedicated section in this article. It’s the only
SSE instruction that implements runtime-variable shuffles. Unlike the rest of the shuffles, this one
takes shuffle values from a vector register. Here’s a C++ equivalent of that instruction:
array<uint8_t, 16> shuffle_epi8( array<uint8_t, 16> a, array<uint8_t, 16> b )
{
array<uint8_t, 16> result;
for( int i = 0; i < 16; i++ )
{
const uint8_t mask = b[ i ];
if( 0 != ( mask & 0x80 ) )
result[ i ] = 0;
else
result[ i ] = a[ mask & 0xF ];
}
return result;
}
Only unlike the above C++ code with the loop, _mm_shuffle_epi8 has 1 cycle latency, and the
throughput is even better, the computer I’m using to write this article can run 2 or them per clock.
Can be quite useful in practice. Here’s one example which uses the first argument as a lookup table,
and the second one as indices to lookup, to compute Hamming weight of integers. In some cases,
that version is even faster than dedicated POPCNT instruction implemented in modern CPUs for
counting these bits.
The AVX 2 equivalent, _mm256_shuffle_epi8, shuffles within 128-bit lanes, applying the same
algorithm onto both 128-bit pieces, using the corresponding piece of the second argument for
shuffle control values.
Similar ones are available for both floats types, _mm_cvtss_f32, but only 1 way, from vector register
to normal C++ floats. For the other way, to convert scalar floats into SIMD registers, I usually use
_mm_set_ps or _mm_set1_ps. Scalar floats are often already in SIMD registers, compilers needs to do
very little work, maybe nothing at all, maybe a shuffle.
SSSE 3 has a few unusual integer instructions. _mm_mulhrs_epi16 does something like this to signed
int16_t lanes: return (int16_t)( ( (int)a * (int)b + 0x4000 ) >> 15 ) _mm_sign_something
multiplies signed lanes by signs of corresponding lane in the second argument, either -1, 0 or +1.
_mm_maddubs_epi16 does an equivalent of the following C++ code:
array<int16_t, 8> mad_bs( array<uint8_t, 16> a, array<int8_t, 16> b )
{
array<int16_t, 8> result;
for( int i = 0; i < 16; i += 2 )
{
const int p1 = (int16_t)a[ i ] * (int16_t)b[ i ];
const int p2 = (int16_t)a[ i + 1 ] * (int16_t)b[ i + 1 ];
int sum = p1 + p2;
sum = std::max( sum, -32768 );
sum = std::min( sum, +32767 );
result[ i / 2 ] = (int16_t)sum;
}
return result;
}
Besides integer and floating-point math, modern CPUs can do other things with these vector
registers. Hardware AES and SHA cryptographic algorithms are implemented on top of them. Intel
implemented some strange string-handling instructions in SSE 4.2. This article covers general-
purpose programming with focus on numerical computations, they aren’t covered here.
Wrappers Library
A year ago, I have implemented a header-only wrapper library around these intrinsics. Most of the
code is automatically generated from Intel’s XML documentation. The repository, along with the
code generator and a few basic examples, is on github: https://github.com/Const-me/IntelIntrinsics
The generator is written in C#.
Raw intrinsics expose C API. Wrapper library replaces _mm_ / _mm256_ prefixes with C++
namespaces, Intrinsics::Sse and Intrinsics::Avx, respectively. This allows some degree of code
reuse across the architectures, the same max_ps C++ function name can be compiled to SSE 1
instruction processing 4 lanes, or to AVX 1 instruction processing 8 lanes. avx.hpp and avx2.hpp
headers have two namespaces, Intrinsics::Avx with AVX intrinsics processing 32 bytes registers,
and Intrinsics::Sse with AVX intrinsics processing 16 bytes registers.
Wrappers
Most of the wrappers do nothing, just call the original intrinsic. There're a few exceptions.
set_something argument order is flipped so it's now correct. For example, Sse::set_epi32( 1, 2,
3, 4 ) makes a register with [ 1, 2, 3, 4 ] values. setr_xxx is not exposed. Added extra set_something
and set1_something for unsigned integer types, e.g. set1_epu16( uint16_t ) and
Sse::set_epu64x( uint64_t, uint64_t ).
Scalar integer arguments/return values are casted from/to standard types, from <stdint.h> header.
Immediate arguments which become part of the instruction (such as shuffle constants), which leak C
language abstraction with that C2057: expected constant expression compilation error, are
converted to C++ template arguments. Implemented a few template-based shuffle functions.
Implemented some missing intrinsics, most of them are for AVX: movemask_epi32, movemask_epi64,
and comparison operators like cmpeq_ps, cmpgt_pd, cmplt_epi32, cmpge_epi8, cmpneq_epi8, etc.
The generated code overloads some operators: comparisons, bitwise, some basic arithmetic. Integer
lanes use the same __m128i / __m256i data types, so component-wise integer operators are placed
in separate namespaces, e.g. Intrinsics::Sse::Int16 and Intrinsics::Avx::Int64. This doesn’t
work on neither GCC nor clang, they fail to compile saying “overloaded operator must have at least
one parameter of class or enumeration type”. Operators are only enabled when building with VC++.
For optimal performance, RAM access needs to be aligned. If you have std::vector<__m128> the
standard library should align automatically, but sometimes you want aligned vectors of floats or
types like DirectX::SimpleMath::Vector3 which don’t have sufficient natural alignment. For these
cases, you can use custom allocator, tested on Windows and Linux.
When you’re implementing complex SIMD algorithms, sometimes it’s a good idea to create C++
classes. If you have a class with a couple of __m128 fields, create it on the stack, and never create
references nor pointers to it, VC++ compiler normally puts these fields in SIMD registers. This way
there’s no class anywhere in machine code, no RAM loads or stores, and the performance is good.
There’s still a class in C++, when done right, it makes code easier to work with and reason about.
Same often applies to small std::array-s of SIMD vectors, or C arrays.
Don’t write static const __m128 x = something(); inside functions or methods. In modern C++ that
construction is guaranteed to be thread safe. To support the language standard, a compiler has to
emit some boilerplate code, which gonna have locks and branches. Instead, you can place that value
in a global variable so they’re initialized before main() starts to run, or for DLLs before LoadLibrary
of your DLL returns. Or you can place that value in a local non-static const variable.
<xmmintrin.h> library header has a macro to make shuffle constants for _mm_shuffle_ps and
_mm_shuffle_epi32, _MM_SHUFFLE. My intrinsics library exposes template functions of 4 uint8_t
template arguments, validates them with static_assert, and makes the shuffling magic number.
When compiling for 32-bit, all the general-purpose registers are 32-bit as well. SIMD intrinsics which
move 64-bit integers between SIMD registers and general-purpose ones are not available on the
platform. To work around, use load/store, or fetch two 32-bit values with separate instructions.
If you use VC++, spam __forceinline on performance-critical SIMD functions which are called from
hot loops. Code often contains magic numbers, also variables which don’t change across iterations.
Unlike scalar code, SIMD magic numbers often come from memory, not from the instruction stream.
When compiler is told to __forceinline, it can load these SIMD constants once, and keep them in
vector registers across iterations. Unless they are evicted to RAM due to registers shortage. Without
inlining, the code will be reloading these constants in consuming functions. I’ve observed 20%
performance improvement after adding __forceinline keywords everywhere. Apparently, VC++
inlining heuristics are tuned for scalar code, and fail for code with SIMD intrinsics. Unfortunately, this
means you probably can’t use C++ lambdas on hot paths, as there’s no way to mark lambda’s
operator() with __forceinline. You’ll have to write custom classes instead, class methods and
operators support __forceinline just fine.
If you’re implementing dynamic dispatch to switch implementation of some vector routines based
on supported instruction set, apply __vectorcall convention to the function pointers or virtual class
methods. Such functions pass arguments and return value in vector registers. Can be measurable
difference for 32-bit binaries. The default 64-bit convention ain’t that bad, you probably won’t get
much profit for your 64-bit builds.
Agner Fog has resources for SIMD programming on his web site. The “Optimizing subroutines in
assembly language” is extremely useful, also timings tables. He also has a vector class library with
Apache license. I don’t like wrapper classes approach: sometimes compilers emit suboptimal code
due to these classes, many available instructions are missing from the wrappers, vectorized integer
math often treats vectors as having different lanes count on every line of code so you’ll have to cast
them a lot. But parts like floating point exponent, logarithm and trigonometry are good.