IntelCompilerIntrinsics PDF
IntelCompilerIntrinsics PDF
Intrinsic Reference
Document Number: 312482-003US
Disclaimer and Legal Information
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL
PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO
ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT
AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS,
INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS
OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS
INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR
PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR
OTHER INTELLECTUAL PROPERTY RIGHT.
UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE
NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF
THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR
DEATH MAY OCCUR.
Intel may make changes to specifications and product descriptions at any time,
without notice. Designers must not rely on the absence or characteristics of any
features or instructions marked "reserved" or "undefined." Intel reserves these for
future definition and shall have no responsibility whatsoever for conflicts or
incompatibilities arising from future changes to them. The information here is subject
to change without notice. Do not finalize a design with this information.
The products described in this document may contain design defects or errors known
as errata which may cause the product to deviate from published specifications.
Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest
specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this
document, or other Intel literature, may be obtained by calling 1-800-548-4725, or
by visiting Intel's Web Site.
BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino logo, Core Inside, FlashFile,
i960, InstantIP, Intel, Intel logo, Intel386, Intel486, Intel740, IntelDX2, IntelDX4,
IntelSX2, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap
ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver,
Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, IPLink, Itanium,
Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside,
skoool, Sound Mark, The Journey Inside, VTune, Xeon, and Xeon Inside are
trademarks of Intel Corporation in the U.S. and other countries.
Registers ................................................................................................... 2
Data Types................................................................................................. 2
References .................................................................................................... 7
Floating-point Intrinsics................................................................................ 9
Miscellaneous Intrinsics...............................................................................12
iii
Table Of Contents
Data Types................................................................................................27
Intrinsics to Read and Write Registers for Streaming SIMD Extensions ...............54
iv
Table Of Contents
Macro Functions............................................................................................59
Floating-point Intrinsics...............................................................................63
Integer Intrinsics........................................................................................83
v
Table Of Contents
Subtraction Intrinsics................................................................................122
Multiplication Intrinsics..............................................................................123
vi
Table Of Contents
Floating Point Dot Product Intrinsics for Streaming SIMD Extensions 4 ..........132
vii
Table Of Contents
Application Registers..............................................................................156
Multimedia Additions.................................................................................158
Miscellaneous Intrinsics.............................................................................165
Intrinsics for Dual-Core Intel Itanium 2 processor 9000 series ................... 166
viii
Table Of Contents
Examples ................................................................................................168
Data Alignment, Memory Allocation Intrinsics, and Inline Assembly .................... 174
Overview: Data Alignment, Memory Allocation Intrinsics, and Inline Assembly .. 174
Example .................................................................................................178
Example .................................................................................................178
Index ........................................................................................................193
ix
Intel(R) C++ Intrinsic Reference
Intrinsics are expanded inline eliminating function call overhead. Providing the same
benefit as using inline assembly, intrinsics improve code readability, assist
instruction scheduling, and help reduce debugging.
Intrinsics provide access to instructions that cannot be generated using the standard
constructs of the C and C++ languages.
The Intel C++ Compiler provides intrinsics that work on specific architectures and
intrinsics that work across IA-32, Intel 64, and IA-64 architectures. Most intrinsics
map directly to a corresponding assembly instruction, some map to several assembly
instructions.
The Intel C++ Compiler also supports Microsoft* Visual Studio 2005 intrinsics (for
x86 and x64 architectures) to generate instructions on Intel processors based on IA-
32 and Intel 64 architectures. For more information on these Microsoft* intrinsics,
visit http://msdn2.microsoft.com/en-us/library/26td21ds.aspx.
Not all Intel processors support all intrinsics. For information on which intrinsics are
supported on Intel processors, visit http://processorfinder.intel.com. The Processor
Spec Finder tool links directly to all processor documentation and the data sheets list
the features, including intrinsics, supported by each processor.
1
Intel(R) C++ Intrinsics Reference
Registers
The MMX instructions use eight 64-bit registers (mm0 to mm7) which are aliased on the
floating-point stack registers.
The Streaming SIMD Extensions use eight 128-bit registers (xmm0 to xmm7).
Because each of these registers can hold more than one data element, the processor
can process more than one data element simultaneously. This processing capability
is also known as single-instruction multiple data processing (SIMD).
For each computational and data manipulation instruction in the new extension sets,
there is a corresponding C intrinsic that implements that instruction directly. This
frees you from managing registers and assembly programming. Further, the
compiler optimizes the instruction scheduling so that your executable runs faster.
Note
The MM and XMM registers are the SIMD registers used by the IA-32 platforms to
implement MMX technology and SSE or SSE2 intrinsics. On the IA-64 architecture,
the MMX and SSE intrinsics use the 64-bit general registers and the 64-bit
significand of the 80-bit floating-point register.
Data Types
Intrinsic functions use four new C data types as operands, representing the new
registers that are used as the operands to these intrinsic functions.
The following table details for which instructions each of the new data types are
available.
2
Intel(R) C++ Intrinsics Reference
The __m64 data type is used to represent the contents of an MMX register, which is
the register that is used by the MMX technology intrinsics. The __m64 data type can
hold eight 8-bit values, four 16-bit values, two 32-bit values, or one 64-bit value.
The __m128 data type is used to represent the contents of a Streaming SIMD
Extension register used by the Streaming SIMD Extension intrinsics. The __m128 data
type can hold four 32-bit floating-point values.
The __m128d data type can hold two 64-bit floating-point values.
The __m128i data type can hold sixteen 8-bit, eight 16-bit, four 32-bit, or two 64-bit
integer values.
The compiler aligns __m128d and _m128i local and global data to 16-byte
boundaries on the stack. To align integer, float, or double arrays, you can use the
declspec align statement.
These data types are not basic ANSI C data types. You must observe the following
usage restrictions:
3
Intel(R) C++ Intrinsics Reference
_mm_cvtsi128_si32(_mm_srli_si128((x), 4 * (imm)))
_mm_cvtsi128_si64(_mm_srli_si128((x), 8 * (imm)))
4
Intel(R) C++ Intrinsics Reference
_mm_<intrin_op>_<suffix>
<intrin_op> Indicates the basic operation of the intrinsic; for example, add for
addition and sub for subtraction.
<suffix> Denotes the type of data the instruction operates on. The first one or
two letters of each suffix denote whether the data is packed (p),
extended packed (ep), or scalar (s). The remaining letters and
numbers denote the type, with notation as follows:
A number appended to a variable name indicates the element of a packed object. For
example, r0 is the lowest word of r. Some intrinsics are "composites" because they
require more than one instruction to implement them.
The packed values are represented in right-to-left order, with the lowest value being
used for scalar operations. Consider the following example operation:
__m128d t = _mm_load_pd(a);
In other words, the xmm register that holds the value t appears as follows:
5
Intel(R) C++ Intrinsics Reference
The "scalar" element is 1.0. Due to the nature of the instruction, some intrinsics
require their arguments to be immediates (constant integer literals).
6
Intel(R) C++ Intrinsics Reference
References
See the following publications and internet locations for more information about
intrinsics and the Intel architectures that support them. You can find all publications
on the Intel website.
IA-32 Intel Architecture Software Describes the format of the instruction set
Developer's Manual, Volume 2A: of IA-32 Intel Architecture and covers the
Instruction Set Reference, A-M reference pages of instructions from A to M
IA-32 Intel Architecture Software Describes the format of the instruction set
Developer's Manual, Volume 2B: of IA-32 Intel Architecture and covers the
Instruction Set Reference, N-Z reference pages of instructions from N to Z
Intel Itanium 2 processor website Intel website for the Itanium 2 processor;
select the "Documentation" tab for
documentation.
7
Intel(R) C++ Intrinsics Reference
The intrinsics in this section function across all IA-32 and IA-64-based platforms.
They are offered as a convenience to the programmer. They are grouped as follows:
The following table lists and describes integer arithmetic intrinsics that you can use
across all Intel architectures.
Intrinsic Description
int abs(int) Returns the absolute value of an
integer.
long labs(long) Returns the absolute value of a long
integer.
unsigned long _lrotl(unsigned long Implements 64-bit left rotate of value
value, int shift) by shift positions.
unsigned long _lrotr(unsigned long Implements 64-bit right rotate of
value, int shift) value by shift positions.
unsigned int _rotl(unsigned int Implements 32-bit left rotate of value
value, int shift) by shift positions.
unsigned int _rotr(unsigned int Implements 32-bit right rotate of
value, int shift) value by shift positions.
unsigned short _rotwl(unsigned short Implements 16-bit left rotate of value
val, int shift) by shift positions.
8
Intel(R) C++ Intrinsics Reference
Note
Floating-point Intrinsics
The following table lists and describes floating point intrinsics that you can use
across all Intel architectures.
Intrinsic Description
double fabs(double) Returns the absolute value of a floating-point
value.
double log(double) Returns the natural logarithm ln(x), x>0, with
double precision.
float logf(float) Returns the natural logarithm ln(x), x>0, with
single precision.
double log10(double) Returns the base 10 logarithm log10(x), x>0,
with double precision.
float log10f(float) Returns the base 10 logarithm log10(x), x>0,
with single precision.
double exp(double) Returns the exponential function with double
precision.
float expf(float) Returns the exponential function with single
precision.
double pow(double, double) Returns the value of x to the power y with
double precision.
float powf(float, float) Returns the value of x to the power y with
single precision.
double sin(double) Returns the sine of x with double precision.
float sinf(float) Returns the sine of x with single precision.
double cos(double) Returns the cosine of x with double precision.
float cosf(float) Returns the cosine of x with single precision.
double tan(double) Returns the tangent of x with double
precision.
float tanf(float) Returns the tangent of x with single precision.
double acos(double) Returns the inverse cosine of x with double
precision
float acosf(float) Returns the inverse cosine of x with single
precision
9
Intel(R) C++ Intrinsics Reference
10
Intel(R) C++ Intrinsics Reference
The following table lists and describes string and block copy intrinsics that you can
use across all Intel architectures.
11
Intel(R) C++ Intrinsics Reference
The string and block copy intrinsics are not implemented as intrinsics on IA-64
architecture.
Intrinsic Description
char *_strset(char *, _int32) Sets all characters in
a string to a fixed
value.
int memcmp(const void *cs, const void *ct, size_t n) Compares two
regions of memory.
Return <0 if cs<ct,
0 if cs=ct, or >0 if
cs>ct.
void *memcpy(void *s, const void *ct, size_t n) Copies from
memory. Returns s.
void *memset(void * s, int c, size_t n) Sets memory to a
fixed value. Returns
s.
char *strcat(char * s, const char * ct) Appends to a string.
Returns s.
int strcmp(const char *, const char *) Compares two
strings. Return <0 if
cs<ct, 0 if cs=ct,
or >0 if cs>ct.
char *strcpy(char * s, const char * ct) Copies a string.
Returns s.
size_t strlen(const char * cs) Returns the length
of string cs.
int strncmp(char *, char *, int) Compare two
strings, but only
specified number of
characters.
int strncpy(char *, char *, int) Copies a string, but
only specified
number of
characters.
Miscellaneous Intrinsics
The following table lists and describes intrinsics that you can use across all Intel
architectures, except where noted.
Intrinsic Description
_abnormal_termination(void) Can be invoked only by termination handlers.
Returns TRUE if the termination handler is
12
Intel(R) C++ Intrinsics Reference
13
Intel(R) C++ Intrinsics Reference
14
Intel(R) C++ Intrinsics Reference
MMX technology is an extension to the Intel architecture (IA) instruction set. The
MMX instruction set adds 57 opcodes and a 64-bit quadword data type, and eight 64-
bit registers. Each of the eight registers can be directly addressed using the register
names mm0 to mm7.
The prototypes for MMX technology intrinsics are in the mmintrin.h header file.
Using EMMS is like emptying a container to accommodate new content. The EMMS
instruction clears the MMX registers and sets the value of the floating-point tag
word to empty. Because floating-point convention specifies that the floating-point
stack be cleared after use, you should clear the MMX registers before issuing a
floating-point instruction. You should insert the EMMS instruction at the end of all
MMX code segments to avoid a floating-point overflow exception.
Caution
Failure to empty the multimedia state after using an MMX instruction and before
using a floating-point instruction can result in unexpected execution or poor
performance.
15
Intel(R) C++ Intrinsics Reference
The prototypes for MMX technology intrinsics are in the mmintrin.h header file.
16
Intel(R) C++ Intrinsics Reference
void _mm_empty(void)
__m64 _mm_cvtsi32_si64(int i)
Convert the integer object i to a 64-bit __m64 object. The integer value is zero-
extended to 64 bits.
int _mm_cvtsi64_si32(__m64 m)
__m64 _mm_cvtsi64_m64(__int64 i)
__int64 _mm_cvtm64_si64(__m64 m)
Pack the four 16-bit values from m1 into the lower four 8-bit values of the result with
signed saturation, and pack the four 16-bit values from m2 into the upper four 8-bit
values of the result with signed saturation.
Pack the two 32-bit values from m1 into the lower two 16-bit values of the result with
signed saturation, and pack the two 32-bit values from m2 into the upper two 16-bit
values of the result with signed saturation.
17
Intel(R) C++ Intrinsics Reference
Pack the four 16-bit values from m1 into the lower four 8-bit values of the result with
unsigned saturation, and pack the four 16-bit values from m2 into the upper four 8-
bit values of the result with unsigned saturation.
Interleave the four 8-bit values from the high half of m1 with the four values from the
high half of m2. The interleaving begins with the data from m1.
Interleave the two 16-bit values from the high half of m1 with the two values from
the high half of m2. The interleaving begins with the data from m1.
Interleave the 32-bit value from the high half of m1 with the 32-bit value from the
high half of m2. The interleaving begins with the data from m1.
Interleave the four 8-bit values from the low half of m1 with the four values from the
low half of m2. The interleaving begins with the data from m1.
Interleave the two 16-bit values from the low half of m1 with the two values from the
low half of m2. The interleaving begins with the data from m1.
Interleave the 32-bit value from the low half of m1 with the 32-bit value from the low
half of m2. The interleaving begins with the data from m1.
The prototypes for MMX technology intrinsics are in the mmintrin.h header file.
18
Intel(R) C++ Intrinsics Reference
Add the eight 8-bit values in m1 to the eight 8-bit values in m2.
Add the four 16-bit values in m1 to the four 16-bit values in m2.
Add the two 32-bit values in m1 to the two 32-bit values in m2.
Add the eight signed 8-bit values in m1 to the eight signed 8-bit values in m2 using
saturating arithmetic.
Add the four signed 16-bit values in m1 to the four signed 16-bit values in m2 using
saturating arithmetic.
Add the eight unsigned 8-bit values in m1 to the eight unsigned 8-bit values in m2 and
using saturating arithmetic.
Add the four unsigned 16-bit values in m1 to the four unsigned 16-bit values in m2
using saturating arithmetic.
19
Intel(R) C++ Intrinsics Reference
Subtract the eight 8-bit values in m2 from the eight 8-bit values in m1.
Subtract the four 16-bit values in m2 from the four 16-bit values in m1.
Subtract the two 32-bit values in m2 from the two 32-bit values in m1.
Subtract the eight signed 8-bit values in m2 from the eight signed 8-bit values in m1
using saturating arithmetic.
Subtract the four signed 16-bit values in m2 from the four signed 16-bit values in m1
using saturating arithmetic.
Subtract the eight unsigned 8-bit values in m2 from the eight unsigned 8-bit values in
m1 using saturating arithmetic.
Subtract the four unsigned 16-bit values in m2 from the four unsigned 16-bit values
in m1 using saturating arithmetic.
Multiply four 16-bit values in m1 by four 16-bit values in m2 producing four 32-bit
intermediate results, which are then summed by pairs to produce two 32-bit results.
Multiply four signed 16-bit values in m1 by four signed 16-bit values in m2 and
produce the high 16 bits of the four results.
Multiply four 16-bit values in m1 by four 16-bit values in m2 and produce the low 16
bits of the four results.
The prototypes for MMX technology intrinsics are in the mmintrin.h header file.
20
Intel(R) C++ Intrinsics Reference
Shift four 16-bit values in m left the amount specified by count while shifting in zeros.
Shift four 16-bit values in m left the amount specified by count while shifting in zeros.
For the best performance, count should be a constant.
Shift two 32-bit values in m left the amount specified by count while shifting in zeros.
Shift two 32-bit values in m left the amount specified by count while shifting in zeros.
For the best performance, count should be a constant.
21
Intel(R) C++ Intrinsics Reference
Shift the 64-bit value in m left the amount specified by count while shifting in zeros.
Shift the 64-bit value in m left the amount specified by count while shifting in zeros.
For the best performance, count should be a constant.
Shift four 16-bit values in m right the amount specified by count while shifting in the
sign bit.
Shift four 16-bit values in m right the amount specified by count while shifting in the
sign bit. For the best performance, count should be a constant.
Shift two 32-bit values in m right the amount specified by count while shifting in the
sign bit.
Shift two 32-bit values in m right the amount specified by count while shifting in the
sign bit. For the best performance, count should be a constant.
Shift four 16-bit values in m right the amount specified by count while shifting in
zeros.
Shift four 16-bit values in m right the amount specified by count while shifting in
zeros. For the best performance, count should be a constant.
Shift two 32-bit values in m right the amount specified by count while shifting in
zeros.
Shift two 32-bit values in m right the amount specified by count while shifting in
zeros. For the best performance, count should be a constant.
Shift the 64-bit value in m right the amount specified by count while shifting in zeros.
22
Intel(R) C++ Intrinsics Reference
Shift the 64-bit value in m right the amount specified by count while shifting in zeros.
For the best performance, count should be a constant.
The prototypes for MMX technology intrinsics are in the mmintrin.h header file.
Perform a bitwise AND of the 64-bit value in m1 with the 64-bit value in m2.
Perform a bitwise NOT on the 64-bit value in m1 and use the result in a bitwise AND
with the 64-bit value in m2.
Perform a bitwise OR of the 64-bit value in m1 with the 64-bit value in m2.
Perform a bitwise XOR of the 64-bit value in m1 with the 64-bit value in m2.
The prototypes for MMX technology intrinsics are in the mmintrin.h header file.
The intrinsics in the following table perform compare operations. Details about each
intrinsic follows the table below.
23
Intel(R) C++ Intrinsics Reference
If the respective 8-bit values in m1 are equal to the respective 8-bit values in m2 set
the respective 8-bit resulting values to all ones, otherwise set them to all zeros.
If the respective 16-bit values in m1 are equal to the respective 16-bit values in m2
set the respective 16-bit resulting values to all ones, otherwise set them to all zeros.
If the respective 32-bit values in m1 are equal to the respective 32-bit values in m2
set the respective 32-bit resulting values to all ones, otherwise set them to all zeros.
If the respective 8-bit signed values in m1 are greater than the respective 8-bit
signed values in m2 set the respective 8-bit resulting values to all ones, otherwise set
them to all zeros.
If the respective 16-bit signed values in m1 are greater than the respective 16-bit
signed values in m2 set the respective 16-bit resulting values to all ones, otherwise
set them to all zeros.
If the respective 32-bit signed values in m1 are greater than the respective 32-bit
signed values in m2 set the respective 32-bit resulting values to all ones, otherwise
set them all to zeros.
The prototypes for MMX technology intrinsics are in the mmintrin.h header file.
24
Intel(R) C++ Intrinsics Reference
Note
In the descriptions regarding the bits of the MMX register, bit 0 is the least
significant and bit 63 is the most significant.
__m64 _mm_setzero_si64()
Sets the 64-bit value to zero.
R
0x0
R0 R1
i0 i1
R0 R1 R2 R3
w0 w1 w2 w3
__m64 _mm_set_pi8(char b7, char b6, char b5, char b4, char b3, char b2,
char b1, char b0)
25
Intel(R) C++ Intrinsics Reference
R0 R1 ... R7
b0 b1 ... b7
__m64 _mm_set1_pi32(int i)
R0 R1
i i
__m64 _mm_set1_pi16(short s)
R0 R1 R2 R3
w w w w
__m64 _mm_set1_pi8(char b)
R0 R1 ... R7
b b ... b
R0 R1
i1 i0
R0 R1 R2 R3
w3 w2 w1 w0
__m64 _mm_setr_pi8(char b7, char b6, char b5, char b4, char b3, char b2,
char b1, char b0)
R0 R1 ... R7
26
Intel(R) C++ Intrinsics Reference
b7 b6 ... b0
MMX technology intrinsics provide access to the MMX technology instruction set on
systems based on IA-64 architecture. To provide source compatibility with the IA-32
architecture, these intrinsics are equivalent both in name and functionality to the set
of IA-32-based MMX intrinsics.
The prototypes for MMX technology intrinsics are in the mmintrin.h header file.
Data Types
The C data type __m64 is used when using MMX technology intrinsics. It can hold
eight 8-bit values, four 16-bit values, two 32-bit values, or one 64-bit value.
The __m64 data type is not a basic ANSI C data type. Therefore, observe the
following usage restrictions:
Use the new data type only on the left-hand side of an assignment, as a
return value, or as a parameter. You cannot use it with other arithmetic
expressions (" + ", " - ", and so on).
Use the new data type as objects in aggregates, such as unions, to access the
byte elements and structures; the address of an __m64 object may be taken.
Use new data types only with the respective intrinsics described in this
documentation.
For complete details of the hardware instructions, see the Intel Architecture MMX
Technology Programmer's Reference Manual. For descriptions of data types, see the
Intel Architecture Software Developer's Manual, Volume 2.
27
Intel(R) C++ Intrinsics Reference
This section describes the C++ language-level features supporting the Streaming
SIMD Extensions (SSE) in the Intel C++ Compiler. These topics explain the
following features of the intrinsics:
The prototypes for SSE intrinsics are in the xmmintrin.h header file.
Note
You can also use the single ia32intrin.h header file for any IA-32 intrinsics.
You should be familiar with the hardware features provided by the Streaming SIMD
Extensions (SSE) when writing programs with the intrinsics. The following are four
important issues to keep in mind:
The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the
xmmintrin.h header file.
28
Intel(R) C++ Intrinsics Reference
The results of each intrinsic operation are placed in a register. This register is
illustrated for each intrinsic with R0-R3. R0, R1, R2 and R3 each represent one of the
4 32-bit pieces of the result register.
Adds the lower single-precision, floating-point (SP FP) values of a and b; the upper 3
SP FP values are passed through from a.
R0 R1 R2 R3
a0 + b0 a1 a2 a3
29
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
a0 +b0 a1 + b1 a2 + b2 a3 + b3
Subtracts the lower SP FP values of a and b. The upper 3 SP FP values are passed
through from a.
R0 R1 R2 R3
a0 - b0 a1 a2 a3
R0 R1 R2 R3
a0 - b0 a1 - b1 a2 - b2 a3 - b3
Multiplies the lower SP FP values of a and b; the upper 3 SP FP values are passed
through from a.
R0 R1 R2 R3
a0 * b0 a1 a2 a3
R0 R1 R2 R3
a0 * b0 a1 * b1 a2 * b2 a3 * b3
Divides the lower SP FP values of a and b; the upper 3 SP FP values are passed
through from a.
R0 R1 R2 R3
a0 / b0 a1 a2 a3
30
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
a0 / b0 a1 / b1 a2 / b2 a3 / b3
__m128 _mm_sqrt_ss(__m128 a)
Computes the square root of the lower SP FP value of a ; the upper 3 SP FP values
are passed through.
R0 R1 R2 R3
sqrt(a0) a1 a2 a3
__m128 _mm_sqrt_ps(__m128 a)
R0 R1 R2 R3
sqrt(a0) sqrt(a1) sqrt(a2) sqrt(a3)
__m128 _mm_rcp_ss(__m128 a)
R0 R1 R2 R3
recip(a0) a1 a2 a3
__m128 _mm_rcp_ps(__m128 a)
R0 R1 R2 R3
recip(a0) recip(a1) recip(a2) recip(a3)
__m128 _mm_rsqrt_ss(__m128 a)
Computes the approximation of the reciprocal of the square root of the lower SP FP
value of a; the upper 3 SP FP values are passed through.
R0 R1 R2 R3
recip(sqrt(a0)) a1 a2 a3
__m128 _mm_rsqrt_ps(__m128 a)
Computes the approximations of the reciprocals of the square roots of the four SP FP
values of a.
31
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
recip(sqrt(a0)) recip(sqrt(a1)) recip(sqrt(a2)) recip(sqrt(a3))
R0 R1 R2 R3
min(a0, b0) a1 a2 a3
R0 R1 R2 R3
min(a0, b0) min(a1, b1) min(a2, b2) min(a3, b3)
R0 R1 R2 R3
max(a0, b0) a1 a2 a3
R0 R1 R2 R3
max(a0, b0) max(a1, b1) max(a2, b2) max(a3, b3)
The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the
xmmintrin.h header file.
The results of each intrinsic operation are placed in a register. This register is
illustrated for each intrinsic with R0-R3. R0, R1, R2 and R3 each represent one of the
4 32-bit pieces of the result register.
32
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
a0 & b0 a1 & b1 a2 & b2 a3 & b3
R0 R1 R2 R3
~a0 & b0 ~a1 & b1 ~a2 & b2 ~a3 & b3
R0 R1 R2 R3
a0 | b0 a1 | b1 a2 | b2 a3 | b3
R0 R1 R2 R3
a0 ^ b0 a1 ^ b1 a2 ^ b2 a3 ^ b3
Each comparison intrinsic performs a comparison of a and b. For the packed form,
the four SP FP values of a and b are compared, and a 128-bit mask is returned. For
the scalar form, the lower SP FP values of a and b are compared, and a 32-bit mask
is returned; the upper three SP FP values are passed through from a. The mask is
set to 0xffffffff for each element where the comparison is true and 0x0 where the
comparison is false.
33
Intel(R) C++ Intrinsics Reference
The results of each intrinsic operation are placed in a register. This register is
illustrated for each intrinsic with R or R0-R3. R0, R1, R2 and R3 each represent one
of the 4 32-bit pieces of the result register.
The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the
xmmintrin.h header file.
34
Intel(R) C++ Intrinsics Reference
35
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
(a0 == b0) ? 0xffffffff : 0x0 a1 a2 a3
R0 R1 R2 R3
(a0 == b0) ? (a1 == b1) ? (a2 == b2) ? (a3 == b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
R0 R1 R2 R3
(a0 < b0) ? 0xffffffff : 0x0 a1 a2 a3
36
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
(a0 < b0) ? (a1 < b1) ? (a2 < b2) ? (a3 < b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
R0 R1 R2 R3
(a0 <= b0) ? 0xffffffff : 0x0 a1 a2 a3
R0 R1 R2 R3
(a0 <= b0) ? (a1 <= b1) ? (a2 <= b2) ? (a3 <= b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
R0 R1 R2 R3
(a0 > b0) ? 0xffffffff : 0x0 a1 a2 a3
R0 R1 R2 R3
(a0 > b0) ? (a1 > b1) ? (a2 > b2) ? (a3 > b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
R0 R1 R2 R3
(a0 >= b0) ? 0xffffffff : 0x0 a1 a2 a3
37
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
(a0 >= b0) ? (a1 >= b1) ? (a2 >= b2) ? (a3 >= b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
R0 R1 R2 R3
(a0 != b0) ? 0xffffffff : 0x0 a1 a2 a3
R0 R1 R2 R3
(a0 != b0) ? (a1 != b1) ? (a2 != b2) ? (a3 != b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
R0 R1 R2 R3
!(a0 < b0) ? 0xffffffff : 0x0 a1 a2 a3
R0 R1 R2 R3
!(a0 < b0) ? !(a1 < b1) ? !(a2 < b2) ? !(a3 < b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
R0 R1 R2 R3
!(a0 <= b0) ? 0xffffffff : 0x0 a1 a2 a3
38
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
!(a0 <= b0) ? !(a1 <= b1) ? !(a2 <= b2) ? !(a3 <= b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
R0 R1 R2 R3
!(a0 > b0) ? 0xffffffff : 0x0 a1 a2 a3
R0 R1 R2 R3
!(a0 > b0) ? !(a1 > b1) ? !(a2 > b2) ? !(a3 > b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
R0 R1 R2 R3
!(a0 >= b0) ? 0xffffffff : 0x0 a1 a2 a3
R0 R1 R2 R3
!(a0 >= b0) ? !(a1 >= b1) ? !(a2 >= b2) ? !(a3 >= b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
R0 R1 R2 R3
(a0 ord? b0) ? 0xffffffff : 0x0 a1 a2 a3
39
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
(a0 ord? b0) ? (a1 ord? b1) ? (a2 ord? b2) ? (a3 ord? b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
R0 R1 R2 R3
(a0 unord? b0) ? 0xffffffff : 0x0 a1 a2 a3
R0 R1 R2 R3
(a0 unord? b0) ? (a1 unord? b1) ? (a2 unord? b2) ? (a3 unord? b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
Compares the lower SP FP value of a and b for a equal to b. If a and b are equal, 1 is
returned. Otherwise 0 is returned.
R
(a0 == b0) ? 0x1 : 0x0
Compares the lower SP FP value of a and b for a less than b. If a is less than b, 1 is
returned. Otherwise 0 is returned.
R
(a0 < b0) ? 0x1 : 0x0
Compares the lower SP FP value of a and b for a less than or equal to b. If a is less
than or equal to b, 1 is returned. Otherwise 0 is returned.
R
(a0 <= b0) ? 0x1 : 0x0
Compares the lower SP FP value of a and b for a greater than b. If a is greater than b
are equal, 1 is returned. Otherwise 0 is returned.
40
Intel(R) C++ Intrinsics Reference
R
(a0 > b0) ? 0x1 : 0x0
R
(a0 >= b0) ? 0x1 : 0x0
Compares the lower SP FP value of a and b for a not equal to b. If a and b are not
equal, 1 is returned. Otherwise 0 is returned.
R
(a0 != b0) ? 0x1 : 0x0
Compares the lower SP FP value of a and b for a equal to b. If a and b are equal, 1 is
returned. Otherwise 0 is returned.
R
(a0 == b0) ? 0x1 : 0x0
Compares the lower SP FP value of a and b for a less than b. If a is less than b, 1 is
returned. Otherwise 0 is returned.
R
(a0 < b0) ? 0x1 : 0x0
Compares the lower SP FP value of a and b for a less than or equal to b. If a is less
than or equal to b, 1 is returned. Otherwise 0 is returned.
R
(a0 <= b0) ? 0x1 : 0x0
Compares the lower SP FP value of a and b for a greater than b. If a is greater than
or equal to b, 1 is returned. Otherwise 0 is returned.
41
Intel(R) C++ Intrinsics Reference
R
(a0 > b0) ? 0x1 : 0x0
R
(a0 >= b0) ? 0x1 : 0x0
Compares the lower SP FP value of a and b for a not equal to b. If a and b are not
equal, 1 is returned. Otherwise 0 is returned.
R
r := (a0 != b0) ? 0x1 : 0x0
The results of each intrinsic operation are placed in a register. This register is
illustrated for each intrinsic with R or R0-R3. R0, R1, R2 and R3 each represent one
of the 4 32-bit pieces of the result register.
The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the
xmmintrin.h header file.
42
Intel(R) C++ Intrinsics Reference
int _mm_cvtss_si32(__m128 a)
R
(int)a0
__int64 _mm_cvtss_si64(__m128 a)
Convert the lower SP FP value of a to a 64-bit signed integer according to the current
rounding mode.
R
(__int64)a0
__m64 _mm_cvtps_pi32(__m128 a)
Convert the two lower SP FP values of a to two 32-bit integers according to the
current rounding mode, returning the integers in packed form.
R0 R1
(int)a0 (int)a1
int _mm_cvttss_si32(__m128 a)
R
(int)a0
43
Intel(R) C++ Intrinsics Reference
__int64 _mm_cvttss_si64(__m128 a)
Convert the lower SP FP value of a to a 64-bit signed integer with truncation.
R
(__int64)a0
__m64 _mm_cvttps_pi32(__m128 a)
Convert the two lower SP FP values of a to two 32-bit integer with truncation,
returning the integers in packed form.
R0 R1
(int)a0 (int)a1
Convert the 32-bit integer value b to an SP FP value; the upper three SP FP values
are passed through from a.
R0 R1 R2 R3
(float)b a1 a2 a3
Convert the signed 64-bit integer value b to an SP FP value; the upper three SP FP
values are passed through from a.
R0 R1 R2 R3
(float)b a1 a2 a3
Convert the two 32-bit integer values in packed form in b to two SP FP values; the
upper two SP FP values are passed through from a.
R0 R1 R2 R3
(float)b0 (float)b1 a2 a3
__m128 _mm_cvtpi16_ps(__m64 a)
Convert the four 16-bit signed integer values in a to four single precision FP values.
R0 R1 R2 R3
(float)a0 (float)a1 (float)a2 (float)a3
__m128 _mm_cvtpu16_ps(__m64 a)
44
Intel(R) C++ Intrinsics Reference
Convert the four 16-bit unsigned integer values in a to four single precision FP values.
R0 R1 R2 R3
(float)a0 (float)a1 (float)a2 (float)a3
__m128 _mm_cvtpi8_ps(__m64 a)
Convert the lower four 8-bit signed integer values in a to four single precision FP
values.
R0 R1 R2 R3
(float)a0 (float)a1 (float)a2 (float)a3
__m128 _mm_cvtpu8_ps(__m64 a)
Convert the lower four 8-bit unsigned integer values in a to four single precision FP
values.
R0 R1 R2 R3
(float)a0 (float)a1 (float)a2 (float)a3
Convert the two 32-bit signed integer values in a and the two 32-bit signed integer
values in b to four single precision FP values.
R0 R1 R2 R3
(float)a0 (float)a1 (float)b0 (float)b1
__m64 _mm_cvtps_pi16(__m128 a)
Convert the four single precision FP values in a to four signed 16-bit integer values.
R0 R1 R2 R3
(short)a0 (short)a1 (short)a2 (short)a3
__m64 _mm_cvtps_pi8(__m128 a)
Convert the four single precision FP values in a to the lower four signed 8-bit integer
values of the result.
R0 R1 R2 R3
(char)a0 (char)a1 (char)a2 (char)a3
float _mm_cvtss_f32(__m128 a)
45
Intel(R) C++ Intrinsics Reference
This intrinsic extracts a single precision floating point value from the first vector
element of an __m128. It does so in the most efficient manner possible in the context
used.
The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the
xmmintrin.h header file.
The results of each intrinsic operation are placed in a register. This register is
illustrated for each intrinsic with R0-R3. R0, R1, R2 and R3 each represent one of the
4 32-bit pieces of the result register.
_mm_load_ss Load the low value and clear the three high MOVSS
values
_mm_load1_ps Load one value into all four words MOVSS + Shuffling
Sets the upper two SP FP values with 64 bits of data loaded from the address p.
R0 R1 R2 R3
a0 a1 *p0 *p1
Sets the lower two SP FP values with 64 bits of data loaded from the address p; the
upper two values are passed through from a.
R0 R1 R2 R3
*p0 *p1 a2 a3
46
Intel(R) C++ Intrinsics Reference
__m128 _mm_load_ss(float * p )
Loads an SP FP value into the low word and clears the upper three words.
R0 R1 R2 R3
*p 0.0 0.0 0.0
__m128 _mm_load1_ps(float * p )
R0 R1 R2 R3
*p *p *p *p
__m128 _mm_load_ps(float * p )
R0 R1 R2 R3
p[0] p[1] p[2] p[3]
__m128 _mm_loadu_ps(float * p)
R0 R1 R2 R3
p[0] p[1] p[2] p[3]
__m128 _mm_loadr_ps(float * p)
R0 R1 R2 R3
p[3] p[2] p[1] p[0]
The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the
xmmintrin.h header file.
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed
explanation of each intrinsic. R0, R1, R2 and R3 represent the registers in which
results are placed.
47
Intel(R) C++ Intrinsics Reference
__m128 _mm_set_ss(float w )
Sets the low word of an SP FP value to w and clears the upper three words.
R0 R1 R2 R3
w 0.0 0.0 0.0
__m128 _mm_set1_ps(float w )
R0 R1 R2 R3
w w w w
R0 R1 R2 R3
w x y z
R0 R1 R2 R3
z y x w
48
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
0.0 0.0 0.0 0.0
The detailed description of each intrinsic contains a table detailing the returns. In
these tables, p[n] is an access to the n element of the result.
The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the
xmmintrin.h header file.
_mm_store1_ps Store the low value across all four words, Shuffling +
address aligned MOVSS
*p0 *p1
a2 a3
*p0 *p1
a0 a1
49
Intel(R) C++ Intrinsics Reference
*p
a0
The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the
xmmintrin.h header file.
50
Intel(R) C++ Intrinsics Reference
Loads one cache line of data from address a to a location "closer" to the processor.
The value sel specifies the type of prefetch operation: the constants _MM_HINT_T0,
_MM_HINT_T1, _MM_HINT_T2, and _MM_HINT_NTA should be used for IA-32,
corresponding to the type of prefetch instruction. The constants _MM_HINT_T1,
_MM_HINT_NT1, _MM_HINT_NT2, and _MM_HINT_NTA should be used for systems based
on IA-64 architecture.
Stores the data in a to the address p without polluting the caches. This intrinsic
requires you to empty the multimedia state for the mmx register. See The EMMS
Instruction: Why You Need It.
Stores the data in a to the address p without polluting the caches. The address must
be 16-byte-aligned.
void _mm_sfence(void)
Guarantees that every preceding store is globally visible before any subsequent store.
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed
explanation of each intrinsic. R, R0, R1...R7 represent the registers in which results
are placed.
The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the
xmmintrin.h header file.The prototypes for Streaming SIMD Extensions (SSE)
intrinsics are in the xmmintrin.h header file.
Before using these intrinsics, you must empty the multimedia state for the MMX(TM)
technology register. See The EMMS Instruction: Why You Need It for more details.
51
Intel(R) C++ Intrinsics Reference
R
(n==0) ? a0 : ( (n==1) ? a1 : ( (n==2) ? a2 : a3 ) )
Inserts word d into one of four words of a. The selector n must be an immediate.
R0 R1 R2 R3
(n==0) ? d : a0; (n==1) ? d : a1; (n==2) ? d : a2; (n==3) ? d : a3;
R0 R1 R2 R3
min(a0, b0) min(a1, b1) min(a2, b2) min(a3, b3)
R0 R1 ... R7
min(a0, b0) min(a1, b1) ... min(a7, b7)
52
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
min(a0, b0) min(a1, b1) min(a2, b2) min(a3, b3)
R0 R1 ... R7
min(a0, b0) min(a1, b1) ... min(a7, b7)
__m64 _mm_movemask_pi8(__m64 b)
Creates an 8-bit mask from the most significant bits of the bytes in a.
R
sign(a7)<<7 | sign(a6)<<6 |... | sign(a0)
Multiplies the unsigned words in a and b, returning the upper 16 bits of the 32-bit
intermediate results.
R0 R1 R2 R3
hiword(a0 * b0) hiword(a1 * b1) hiword(a2 * b2) hiword(a3 * b3)
R0 R1 R2 R3
word (n&0x3) word ((n>>2)&0x3) word ((n>>4)&0x3) word ((n>>6)&0x3)
of a of a of a of a
Conditionally store byte elements of d to address p. The high bit of each byte in the
selector n determines whether the corresponding byte in d will be stored.
R0 R1 ... R7
(t >> 1) | (t & (t >> 1) | (t & ... ((t >> 1) | (t &
53
Intel(R) C++ Intrinsics Reference
R0 R1 ... R7
(t >> 1) | (t & 0x01), (t >> 1) | (t & 0x01), ... (t >> 1) | (t & 0x01),
where t = (unsigned where t = (unsigned where t = (unsigned
int)a0 + (unsigned int)b0 int)a1 + (unsigned int)b1 int)a7 + (unsigned int)b7
Computes the sum of the absolute differences of the unsigned bytes in a and b,
returning the value in the lower word. The upper three words are cleared.
R0 R1 R2 R3
abs(a0-b0) +... + abs(a7-b7) 0 0 0
The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the
xmmintrin.h header file.
54
Intel(R) C++ Intrinsics Reference
The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the
xmmintrin.h header file.
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed
explanation of each intrinsic. R, R0, R1, R2 and R3 represent the registers in which
results are placed.
Selects four specific SP FP values from a and b, based on the mask imm8. The mask
must be an immediate. See Macro Function for Shuffle Using Streaming SIMD
Extensions for a description of the shuffle semantics.
R0 R1 R2 R3
a2 b2 a3 b3
R0 R1 R2 R3
a0 b0 a1 b1
55
Intel(R) C++ Intrinsics Reference
Sets the low word to the SP FP value of b. The upper 3 SP FP values are
passed through from a.
R0 R1 R2 R3
b0 a1 a2 a3
Moves the upper 2 SP FP values of b to the lower 2 SP FP values of the result. The
upper 2 SP FP values of a are passed through to the result.
R0 R1 R2 R3
b2 b3 a2 a3
Moves the lower 2 SP FP values of b to the upper 2 SP FP values of the result. The
lower 2 SP FP values of a are passed through to the result.
R0 R1 R2 R3
a0 a1 b0 b1
int _mm_movemask_ps(__m128 a)
Creates a 4-bit mask from the most significant bits of the four SP FP values.
R
sign(a3)<<3 | sign(a2)<<2 | sign(a1)<<1 | sign(a0)
The Streaming SIMD Extensions (SSE) intrinsics provide access to IA-64 instructions
for Streaming SIMD Extensions. To provide source compatibility with the IA-32
architecture, these intrinsics are equivalent both in name and functionality to the set
of IA-32-based SSE intrinsics.
To write programs with the intrinsics, you should be familiar with the hardware
features provided by SSE. Keep the following issues in mind:
56
Intel(R) C++ Intrinsics Reference
Data Types
The new data type __m128 is used with the SSE intrinsics. It represents a 128-bit
quantity composed of four single-precision FP values. This corresponds to the 128-bit
IA-32 Streaming SIMD Extensions register.
The compiler aligns __m128 local data to 16-byte boundaries on the stack. Global
data of these types is also 16 byte-aligned. To align integer, float, or double
arrays, you can use the declspec alignment.
Because IA-64 instructions treat the SSE registers in the same way whether you are
using packed or scalar data, there is no __m32 data type to represent scalar data.
For scalar operations, use the __m128 objects and the "scalar" forms of the intrinsics;
the compiler and the processor implement these operations with 32-bit memory
references. But, for better performance the packed form should be substituting for
the scalar form whenever possible.
For more information, see Intel Architecture Software Developer's Manual, Volume 2:
Instruction Set Reference Manual, Intel Corporation, doc. number 243191.
SSE intrinsics are defined for the __m128 data type, a 128-bit quantity consisting of
four single-precision FP values. SIMD instructions for systems based on IA-64
architecture operate on 64-bit FP register quantities containing two single-precision
floating-point values. Thus, each __m128 operand is actually a pair of FP registers
and therefore each intrinsic corresponds to at least one pair of IA-64 instructions
operating on the pair of FP register operands.
Many of the SSE intrinsics for systems based on IA-64 architecture were created for
compatibility with existing IA-32 intrinsics and not for performance. In some
situations, intrinsic usage that improved performance on IA-32 architecture will not
do so on systems based on IA-64 architecture. One reason for this is that some
intrinsics map nicely into the IA-32 instruction set but not into the IA-64 instruction
set. Thus, it is important to differentiate between intrinsics which were implemented
for a performance advantage on systems based on IA-64 architecture, and those
implemented simply to provide compatibility with existing IA-32 code.
The following intrinsics are likely to reduce performance and should only be used to
initially port legacy code or in non-critical code sections:
Any SSE scalar intrinsic (_ss variety) - use packed (_ps) version if possible
comi and ucomi SSE comparisons - these correspond to IA-32 COMISS and
UCOMISS instructions only. A sequence of IA-64 instructions are required to
implement these.
57
Intel(R) C++ Intrinsics Reference
If the inaccuracy is acceptable, the SIMD reciprocal and reciprocal square root
approximation intrinsics (rcp and rsqrt) are much faster than the true div and sqrt
intrinsics.
58
Intel(R) C++ Intrinsics Reference
Macro Functions
The Streaming SIMD Extensions (SSE) provide a macro function to help create
constants that describe shuffle operations. The macro takes four small integers (in
the range of 0 to 3) and combines them into an 8-bit immediate value used by the
SHUFPS instruction.
You can view the four integers as selectors for choosing which two words from the
first input operand and which two words from the second are to be put into the result
word.
The following macro functions enable you to read and write bits to and from the
control register. For details, see Intrinsics to Read and Write Registers. For
Itanium-based systems, these macros do not allow you to access all of the bits of
the FPSR. See the descriptions for the getfpsr() and setfpsr() intrinsics in the
Native Intrinsics for IA-64 Instructions topic.
59
Intel(R) C++ Intrinsics Reference
_MM_EXCEPT_UNDERFLOW
_MM_EXCEPT_INEXACT
_MM_MASK_INEXACT
The following example masks the overflow and underflow exceptions and unmasks all
other exceptions.
The following example tests the rounding mode for round toward zero.
60
Intel(R) C++ Intrinsics Reference
The Streaming SIMD Extensions (SSE) provide the following macro function to
transpose a 4 by 4 matrix of single precision floating point values.
The arguments row0, row1, row2, and row3 are __m128 values whose elements form
the corresponding rows of a 4 by 4 matrix. The matrix transposition is returned in
arguments row0, row1, row2, and row3 where row0 now holds column 0 of the
original matrix, row1 now holds column 1 of the original matrix, and so on.
61
Intel(R) C++ Intrinsics Reference
This section describes the C++ language-level features supporting the Intel
Pentium 4 processor Streaming SIMD Extensions 2 (SSE2) in the Intel C++
Compiler, which are divided into two categories:
Note
There are no intrinsics for floating-point move operations. To move data from
one register to another, a simple assignment, A = B, suffices, where A and B are
the source and target registers for the move operation.
Note
Itanium Processor
Pentium III Processor
Pentium II Processor
Pentium with MMX Technology
You should be familiar with the hardware features provided by the SSE2 when
writing programs with the intrinsics. The following are three important issues to keep
in mind:
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
Note
You can also use the single ia32intrin.h header file for any IA-32 intrinsics.
62
Intel(R) C++ Intrinsics Reference
Floating-point Intrinsics
The arithmetic operations for the Streaming SIMD Extensions 2 (SSE2) are listed in
the following table. The prototypes for SSE2 intrinsics are in the emmintrin.h header
file.
The results of each intrinsic operation are placed in a register. This register is
illustrated for each intrinsic with R0 and R1. R0 and R1 each represent one piece of
the result register.
The Double Complex code sample contains examples of how to use several of these
intrinsics.
63
Intel(R) C++ Intrinsics Reference
R0 R1
a0 + b0 a1
R0 R1
a0 + b0 a1 + b1
R0 R1
a0 - b0 a1
R0 R1
a0 - b0 a1 - b1
Multiplies the lower DP FP values of a and b. The upper DP FP is passed through from
a.
R0 R1
a0 * b0 a1
R0 R1
a0 * b0 a1 * b1
Divides the lower DP FP values of a and b. The upper DP FP value is passed through
from a.
64
Intel(R) C++ Intrinsics Reference
R0 R1
a0 / b0 a1
R0 R1
a0 / b0 a1 / b1
Computes the square root of the lower DP FP value of b. The upper DP FP value is
passed through from a.
R0 R1
sqrt(b0) A1
__m128d _mm_sqrt_pd(__m128d a)
R0 R1
sqrt(a0) sqrt(a1)
Computes the minimum of the lower DP FP values of a and b. The upper DP FP value
is passed through from a.
R0 R1
min (a0, b0) a1
R0 R1
min (a0, b0) min(a1, b1)
Computes the maximum of the lower DP FP values of a and b. The upper DP FP value
is passed through from a.
65
Intel(R) C++ Intrinsics Reference
R0 R1
max (a0, b0) a1
R0 R1
max (a0, b0) max (a1, b1)
The prototypes for Streaming SIMD Extensions 2 (SSE2) intrinsics are in the
emmintrin.h header file.
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed
explanation of each intrinsic. R0 and R1 represent the registers in which results are
placed.
R0 R1
a0 & b0 a1 & b1
Computes the bitwise AND of the 128-bit value in b and the bitwise NOT of the 128-bit
value in a.
R0 R1
(~a0) & b0 (~a1) & b1
66
Intel(R) C++ Intrinsics Reference
R0 R1
a0 | b0 a1 | b1
R0 R1
a0 ^ b0 a1 ^ b1
Each comparison intrinsic performs a comparison of a and b. For the packed form,
the two DP FP values of a and b are compared, and a 128-bit mask is returned. For
the scalar form, the lower DP FP values of a and b are compared, and a 64-bit mask
is returned; the upper DP FP value is passed through from a. The mask is set to
0xffffffffffffffff for each element where the comparison is true and 0x0 where
the comparison is false. The r following the instruction name indicates that the
operands to the instruction are reversed in the actual implementation. The
comparison intrinsics for the Streaming SIMD Extensions 2 (SSE2) are listed in the
following table followed by detailed descriptions.
The results of each intrinsic operation are placed in a register. This register is
illustrated for each intrinsic with R, R0 and R1. R, R0 and R1 each represent one
piece of the result register.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
67
Intel(R) C++ Intrinsics Reference
68
Intel(R) C++ Intrinsics Reference
R0 R1
(a0 == b0) ? 0xffffffffffffffff : (a1 == b1) ? 0xffffffffffffffff :
0x0 0x0
R0 R1
(a0 < b0) ? 0xffffffffffffffff : (a1 < b1) ? 0xffffffffffffffff :
0x0 0x0
R0 R1
(a0 <= b0) ? 0xffffffffffffffff : (a1 <= b1) ? 0xffffffffffffffff :
0x0 0x0
R0 R1
(a0 > b0) ? 0xffffffffffffffff : (a1 > b1) ? 0xffffffffffffffff :
0x0 0x0
R0 R1
(a0 >= b0) ? 0xffffffffffffffff : (a1 >= b1) ? 0xffffffffffffffff :
0x0 0x0
R0 R1
(a0 ord b0) ? 0xffffffffffffffff : (a1 ord b1) ? 0xffffffffffffffff :
0x0 0x0
69
Intel(R) C++ Intrinsics Reference
R0 R1
(a0 unord b0) ? (a1 unord b1) ?
0xffffffffffffffff : 0x0 0xffffffffffffffff : 0x0
R0 R1
(a0 != b0) ? 0xffffffffffffffff : (a1 != b1) ? 0xffffffffffffffff :
0x0 0x0
R0 R1
!(a0 < b0) ? 0xffffffffffffffff : !(a1 < b1) ? 0xffffffffffffffff :
0x0 0x0
Compares the two DP FP values of a and b for a not less than or equal to b.
R0 R1
!(a0 <= b0) ? 0xffffffffffffffff : !(a1 <= b1) ? 0xffffffffffffffff :
0x0 0x0
R0 R1
!(a0 > b0) ? 0xffffffffffffffff : !(a1 > b1) ? 0xffffffffffffffff :
0x0 0x0
Compares the two DP FP values of a and b for a not greater than or equal to b.
R0 R1
!(a0 >= b0) ? 0xffffffffffffffff : !(a1 >= b1) ? 0xffffffffffffffff :
0x0 0x0
Compares the lower DP FP value of a and b for equality. The upper DP FP value is
passed through from a.
70
Intel(R) C++ Intrinsics Reference
R0 R1
(a0 == b0) ? 0xffffffffffffffff : 0x0 a1
Compares the lower DP FP value of a and b for a less than b. The upper DP FP value
is passed through from a.
R0 R1
(a0 < b0) ? 0xffffffffffffffff : 0x0 a1
Compares the lower DP FP value of a and b for a less than or equal to b. The upper
DP FP value is passed through from a.
R0 R1
(a0 <= b0) ? 0xffffffffffffffff : 0x0 a1
Compares the lower DP FP value of a and b for a greater than b. The upper DP FP
value is passed through from a.
R0 R1
(a0 > b0) ? 0xffffffffffffffff : 0x0 a1
Compares the lower DP FP value of a and b for a greater than or equal to b. The
upper DP FP value is passed through from a.
R0 R1
(a0 >= b0) ? 0xffffffffffffffff : 0x0 a1
Compares the lower DP FP value of a and b for ordered. The upper DP FP value is
passed through from a.
R0 R1
(a0 ord b0) ? 0xffffffffffffffff : 0x0 a1
Compares the lower DP FP value of a and b for unordered. The upper DP FP value is
passed through from a.
71
Intel(R) C++ Intrinsics Reference
R0 R1
(a0 unord b0) ? 0xffffffffffffffff : 0x0 a1
Compares the lower DP FP value of a and b for inequality. The upper DP FP value is
passed through from a.
R0 R1
(a0 != b0) ? 0xffffffffffffffff : 0x0 a1
Compares the lower DP FP value of a and b for a not less than b. The upper DP FP
value is passed through from a.
R0 R1
!(a0 < b0) ? 0xffffffffffffffff : 0x0 a1
Compares the lower DP FP value of a and b for a not less than or equal to b. The
upper DP FP value is passed through from a.
R0 R1
!(a0 <= b0) ? 0xffffffffffffffff : 0x0 a1
Compares the lower DP FP value of a and b for a not greater than b. The upper DP FP
value is passed through from a.
R0 R1
!(a0 > b0) ? 0xffffffffffffffff : 0x0 a1
R0 R1
!(a0 >= b0) ? 0xffffffffffffffff : 0x0 a1
72
Intel(R) C++ Intrinsics Reference
R
(a0 == b0) ? 0x1 : 0x0
R
(a0 < b0) ? 0x1 : 0x0
R
(a0 <= b0) ? 0x1 : 0x0
R
(a0 > b0) ? 0x1 : 0x0
R
(a0 >= b0) ? 0x1 : 0x0
R
(a0 != b0) ? 0x1 : 0x0
73
Intel(R) C++ Intrinsics Reference
R
(a0 == b0) ? 0x1 : 0x0
R
(a0 < b0) ? 0x1 : 0x0
R
(a0 <= b0) ? 0x1 : 0x0
R
(a0 > b0) ? 0x1 : 0x0
R
(a0 >= b0) ? 0x1 : 0x0
R
(a0 != b0) ? 0x1 : 0x0
Each conversion intrinsic takes one data type and performs a conversion to a
different type. Some conversions such as _mm_cvtpd_ps result in a loss of precision.
The rounding mode used in such cases is determined by the value in the MXCSR
74
Intel(R) C++ Intrinsics Reference
register. The default rounding mode is round-to-nearest. Note that the rounding
mode used by the C and C++ languages when performing a type conversion is to
truncate. The _mm_cvttpd_epi32 and _mm_cvttsd_si32 intrinsics use the truncate
rounding mode regardless of the mode specified by the MXCSR register.
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed
explanation of each intrinsic. R, R0, R1, R2 and R3 represent the registers in which
results are placed.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
__m128 _mm_cvtpd_ps(__m128d a)
75
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
(float) a0 (float) a1 0.0 0.0
__m128d _mm_cvtps_pd(__m128 a)
R0 R1
(double) a0 (double) a1
__m128d _mm_cvtepi32_pd(__m128i a)
R0 R1
(double) a0 (double) a1
__m128i _mm_cvtpd_epi32(__m128d a)
R0 R1 R2 R3
(int) a0 (int) a1 0x0 0x0
int _mm_cvtsd_si32(__m128d a)
R
(int) a0
R0 R1 R2 R3
(float) b0 a1 a2 a3
76
Intel(R) C++ Intrinsics Reference
R0 R1
(double) b a1
R0 R1
(double) b0 a1
__m128i _mm_cvttpd_epi32(__m128d a)
R0 R1 R2 R3
(int) a0 (int) a1 0x0 0x0
int _mm_cvttsd_si32(__m128d a)
R
(int) a0
__m64 _mm_cvtpd_pi32(__m128d a)
R0 R1
(int)a0 (int) a1
__m64 _mm_cvttpd_pi32(__m128d a)
Converts the two DP FP values of a to 32-bit signed integer values using truncate.
R0 R1
(int)a0 (int) a1
__m128d _mm_cvtpi32_pd(__m64 a)
R0 R1
(double)a0 (double)a1
77
Intel(R) C++ Intrinsics Reference
_mm_cvtsd_f64(__m128d a)
This intrinsic extracts a double precision floating point value from the first vector
element of an __m128d. It does so in the most efficient manner possible in the
context used. This intrinsic does not map to any specific SSE2 instruction.
The following load operation intrinsics and their respective instructions are functional
in the Streaming SIMD Extensions 2 (SSE2).
The load and set operations are similar in that both initialize __m128d data. However,
the set operations take a double argument and are intended for initialization with
constants, while the load operations take a double pointer argument and are
intended to mimic the instructions for loading data from memory.
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed
explanation of each intrinsic. R0 and R1 represent the registers in which results are
placed.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
The Double Complex code sample contains examples of how to use several of these
intrinsics.
78
Intel(R) C++ Intrinsics Reference
R0 R1
p[0] p[1]
Loads a single DP FP value, copying to both elements. The address p need not be
16-byte aligned.
R0 R1
*p *p
Loads two DP FP values in reverse order. The address p must be 16-byte aligned.
R0 R1
p[1] p[0]
R0 R1
p[0] p[1]
Loads a DP FP value. The upper DP FP is set to zero. The address p need not be 16-
byte aligned.
R0 R1
*p 0.0
Loads a DP FP value as the upper DP FP value of the result. The lower DP FP value is
passed through from a. The address p need not be 16-byte aligned.
R0 R1
a0 *p
Loads a DP FP value as the lower DP FP value of the result. The upper DP FP value is
passed through from a. The address p need not be 16-byte aligned.
79
Intel(R) C++ Intrinsics Reference
R0 R1
*p a1
The following set operation intrinsics and their respective instructions are functional
in the Streaming SIMD Extensions 2 (SSE2).
The load and set operations are similar in that both initialize __m128d data. However,
the set operations take a double argument and are intended for initialization with
constants, while the load operations take a double pointer argument and are
intended to mimic the instructions for loading data from memory.
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed
explanation of each intrinsic. R0 and R1 represent the registers in which results are
placed.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
__m128d _mm_set_sd(double w)
Sets the lower DP FP value to w and sets the upper DP FP value to zero.
R0 R1
w 0.0
__m128d _mm_set1_pd(double w)
80
Intel(R) C++ Intrinsics Reference
R0 R1
w W
R0 R1
x W
R0 R1
w X
__m128d _mm_setzero_pd(void)
R0 R1
0.0 0.0
Sets the lower DP FP value to the lower DP FP value of b. The upper DP FP value is
passed through from a.
R0 R1
b0 A1
The following store operation intrinsics and their respective instructions are
functional in the Streaming SIMD Extensions 2 (SSE2).
The detailed description of each intrinsic contains a table detailing the returns. In
these tables, dp[n] is an access to the n element of the result.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
81
Intel(R) C++ Intrinsics Reference
The Double Complex code sample contains an example of how to use the
_mm_store_pd intrinsic.
Stores the lower DP FP value of a. The address dp need not be 16-byte aligned.
*dp
a0
Stores the lower DP FP value of a twice. The address dp must be 16-byte aligned.
dp[0] dp[1]
a0 a0
dp[0] dp[1]
a0 a1
dp[0] dp[1]
a0 a1
82
Intel(R) C++ Intrinsics Reference
Stores two DP FP values in reverse order. The address dp must be 16-byte aligned.
dp[0] dp[1]
a1 a0
*dp
a1
*dp
a0
Integer Intrinsics
The integer arithmetic operations for Streaming SIMD Extensions 2 (SSE2) are listed
in the following table followed by their descriptions. The floating point packed
arithmetic intrinsics for SSE2 are listed in the Floating-point Arithmetic Operations
topic.
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed
explanation of each intrinsic. R, R0, R1...R15 represent the registers in which results
are placed.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
83
Intel(R) C++ Intrinsics Reference
Adds the 16 signed or unsigned 8-bit integers in a to the 16 signed or unsigned 8-bit
integers in b.
84
Intel(R) C++ Intrinsics Reference
R0 R1 ... R15
a0 + b0 a1 + b1; ... a15 + b15
Adds the 8 signed or unsigned 16-bit integers in a to the 8 signed or unsigned 16-bit
integers in b.
R0 R1 ... R7
a0 + b0 a1 + b1 ... a7 + b7
Adds the 4 signed or unsigned 32-bit integers in a to the 4 signed or unsigned 32-bit
integers in b.
R0 R1 R2 R3
a0 + b0 a1 + b1 a2 + b2 a3 + b3
Adds the signed or unsigned 64-bit integer a to the signed or unsigned 64-bit integer
b.
R0
a + b
Adds the 2 signed or unsigned 64-bit integers in a to the 2 signed or unsigned 64-bit
integers in b.
R0 R1
a0 + b0 a1 + b1
Adds the 16 signed 8-bit integers in a to the 16 signed 8-bit integers in b using
saturating arithmetic.
R0 R1 ... R15
SignedSaturate (a0 + SignedSaturate (a1 + ... SignedSaturate (a15 +
b0) b1) b15)
85
Intel(R) C++ Intrinsics Reference
Adds the 8 signed 16-bit integers in a to the 8 signed 16-bit integers in b using
saturating arithmetic.
R0 R1 ... R7
SignedSaturate (a0 + SignedSaturate (a1 + ... SignedSaturate (a7 +
b0) b1) b7)
Adds the 16 unsigned 8-bit integers in a to the 16 unsigned 8-bit integers in b using
saturating arithmetic.
R0 R1 ... R15
UnsignedSaturate (a0 UnsignedSaturate (a1 ... UnsignedSaturate (a15
+ b0) + b1) + b15)
Adds the 8 unsigned 16-bit integers in a to the 8 unsigned 16-bit integers in b using
saturating arithmetic.
R0 R1 ... R7
UnsignedSaturate (a0 UnsignedSaturate (a1 ... UnsignedSaturate (a7
+ b0) + b1) + b7)
Computes the average of the 16 unsigned 8-bit integers in a and the 16 unsigned 8-
bit integers in b and rounds.
R0 R1 ... R15
(a0 + b0) / 2 (a1 + b1) / 2 ... (a15 + b15) / 2
Computes the average of the 8 unsigned 16-bit integers in a and the 8 unsigned 16-
bit integers in b and rounds.
R0 R1 ... R7
(a0 + b0) / 2 (a1 + b1) / 2 ... (a7 + b7) / 2
Multiplies the 8 signed 16-bit integers from a by the 8 signed 16-bit integers from b.
Adds the signed 32-bit integer results pairwise and packs the 4 signed 32-bit integer
results.
86
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
(a0 * b0) + (a1 (a2 * b2) + (a3 (a4 * b4) + (a5 (a6 * b6) + (a7
* b1) * b3) * b5) * b7)
Computes the pairwise maxima of the 8 signed 16-bit integers from a and the 8
signed 16-bit integers from b.
R0 R1 ... R7
max(a0, b0) max(a1, b1) ... max(a7, b7)
Computes the pairwise maxima of the 16 unsigned 8-bit integers from a and the 16
unsigned 8-bit integers from b.
R0 R1 ... R15
max(a0, b0) max(a1, b1) ... max(a15, b15)
Computes the pairwise minima of the 8 signed 16-bit integers from a and the 8
signed 16-bit integers from b.
R0 R1 ... R7
min(a0, b0) min(a1, b1) ... min(a7, b7)
Computes the pairwise minima of the 16 unsigned 8-bit integers from a and the 16
unsigned 8-bit integers from b.
R0 R1 ... R15
min(a0, b0) min(a1, b1) ... min(a15, b15)
Multiplies the 8 signed 16-bit integers from a by the 8 signed 16-bit integers from b.
Packs the upper 16-bits of the 8 signed 32-bit results.
R0 R1 ... R7
(a0 * b0)[31:16] (a1 * b1)[31:16] ... (a7 * b7)[31:16]
87
Intel(R) C++ Intrinsics Reference
Multiplies the 8 unsigned 16-bit integers from a by the 8 unsigned 16-bit integers
from b. Packs the upper 16-bits of the 8 unsigned 32-bit results.
R0 R1 ... R7
(a0 * b0)[31:16] (a1 * b1)[31:16] ... (a7 * b7)[31:16]
__m128i_mm_mullo_epi16(__m128i a, __m128i b)
R0 R1 ... R7
(a0 * b0)[15:0] (a1 * b1)[15:0] ... (a7 * b7)[15:0]
Multiplies the lower 32-bit integer from a by the lower 32-bit integer from b, and
returns the 64-bit integer result.
R0
a0 * b0
R0 R1
a0 * b0 a2 * b2
Computes the absolute difference of the 16 unsigned 8-bit integers from a and the
16 unsigned 8-bit integers from b. Sums the upper 8 differences and lower 8
differences, and packs the resulting 2 unsigned 16-bit integers into the upper and
lower 64-bit elements.
R0 R1 R2 R3 R4 R5 R6 R7
abs(a0 - b0) + abs(a1 0x0 0x0 0x0 abs(a8 - b8) + abs(a9 0x0 0x0 0x0
- b1) +...+ abs(a7 - - b9) +...+ abs(a15 -
b7) b15)
88
Intel(R) C++ Intrinsics Reference
R0 R1 ... R15
a0 - b0 a1 b1 ... a15 - b15
__m128i_mm_sub_epi16(__m128i a, __m128i b)
R0 R1 ... R7
a0 - b0 a1 b1 ... a7 - b7
R0 R1 R2 R3
a0 - b0 a1 b1 a2 - b2 a3 - b3
Subtracts the signed or unsigned 64-bit integer b from the signed or unsigned 64-bit
integer a.
R
a - b
R0 R1
a0 - b0 a1 b1
Subtracts the 16 signed 8-bit integers of b from the 16 signed 8-bit integers of a
using saturating arithmetic.
R0 R1 ... R15
SignedSaturate (a0 - SignedSaturate (a1 - ... SignedSaturate (a15 -
b0) b1) b15)
89
Intel(R) C++ Intrinsics Reference
Subtracts the 8 signed 16-bit integers of b from the 8 signed 16-bit integers of a
using saturating arithmetic.
R0 R1 ... R15
SignedSaturate (a0 - SignedSaturate (a1 - ... SignedSaturate (a7 -
b0) b1) b7)
Subtracts the 16 unsigned 8-bit integers of b from the 16 unsigned 8-bit integers of
a using saturating arithmetic.
R0 R1 ... R15
UnsignedSaturate (a0 UnsignedSaturate (a1 ... UnsignedSaturate (a15
- b0) - b1) - b15)
Subtracts the 8 unsigned 16-bit integers of b from the 8 unsigned 16-bit integers of
a using saturating arithmetic.
R0 R1 ... R7
UnsignedSaturate (a0 UnsignedSaturate (a1 ... UnsignedSaturate (a7
- b0) - b1) - b7)
The following four logical-operation intrinsics and their respective instructions are
functional as part of Streaming SIMD Extensions 2 (SSE2).
The results of each intrinsic operation are placed in register R. The information about
what is placed in each register appears in the tables below, in the detailed
explanation of each intrinsic.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
90
Intel(R) C++ Intrinsics Reference
Computes the bitwise AND of the 128-bit value in a and the 128-bit value in b.
R0
a & b
Computes the bitwise AND of the 128-bit value in b and the bitwise NOT of the 128-
bit value in a.
R0
(~a) & b
Computes the bitwise OR of the 128-bit value in a and the 128-bit value in b.
R0
a | b
Computes the bitwise XOR of the 128-bit value in a and the 128-bit value in b.
R0
a ^ b
The shift-operation intrinsics for Streaming SIMD Extensions 2 (SSE2) and the
description for each are listed in the following table.
The results of each intrinsic operation are placed in a register. This register is
illustrated for each intrinsic with R and R0-R7. R and R0 R7 each represent one of
the pieces of the result register.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
Note
The count argument is one shift count that applies to all elements of the operand
being shifted. It is not a vector shift count that shifts each element by a different
amount.
91
Intel(R) C++ Intrinsics Reference
Shifts the 128-bit value in a left by imm bytes while shifting in zeros. imm must be an
immediate.
R
a << (imm * 8)
R0 R1 ... R7
a0 << count a1 << count ... a7 << count
92
Intel(R) C++ Intrinsics Reference
R0 R1 ... R7
a0 << count a1 << count ... a7 << count
R0 R1 R2 R3
a0 << count a1 << count a2 << count a3 << count
R0 R1 R2 R3
a0 << count a1 << count a2 << count a3 << count
R0 R1
a0 << count a1 << count
R0 R1
a0 << count a1 << count
R0 R1 ... R7
a0 >> count a1 >> count ... a7 >> count
93
Intel(R) C++ Intrinsics Reference
R0 R1 ... R7
a0 >> count a1 >> count ... a7 >> count
R0 R1 R2 R3
a0 >> count a1 >> count a2 >> count a3 >> count
R0 R1 R2 R3
a0 >> count a1 >> count a2 >> count a3 >> count
Shifts the 128-bit value in a right by imm bytes while shifting in zeros.
imm must be an immediate.
R
srl(a, imm*8)
R0 R1 ... R7
srl(a0, count) srl(a1, count) ... srl(a7, count)
R0 R1 ... R7
srl(a0, count) srl(a1, count) ... srl(a7, count)
94
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
srl(a0, count) srl(a1, count) srl(a2, count) srl(a3, count)
R0 R1 R2 R3
srl(a0, count) srl(a1, count) srl(a2, count) srl(a3, count)
R0 R1
srl(a0, count) srl(a1, count)
R0 R1
srl(a0, count) srl(a1, count)
The comparison intrinsics for Streaming SIMD Extensions 2 (SSE2) and descriptions
for each are listed in the following table.
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed
explanation of each intrinsic. R, R0, R1...R15 represent the registers in which results
are placed.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
95
Intel(R) C++ Intrinsics Reference
R0 R1 ... R15
(a0 == b0) ? 0xff : (a1 == b1) ? 0xff : ... (a15 == b15) ? 0xff :
0x0 0x0 0x0
Compares the 8 signed or unsigned 16-bit integers in a and the 8 signed or unsigned
16-bit integers in b for equality.
R0 R1 ... R7
(a0 == b0) ? 0xffff : (a1 == b1) ? 0xffff : ... (a7 == b7) ? 0xffff :
0x0 0x0 0x0
Compares the 4 signed or unsigned 32-bit integers in a and the 4 signed or unsigned
32-bit integers in b for equality.
R0 R1 R2 R3
(a0 == b0) ? (a1 == b1) ? (a2 == b2) ? (a3 == b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
Compares the 16 signed 8-bit integers in a and the 16 signed 8-bit integers in b for
greater than.
96
Intel(R) C++ Intrinsics Reference
R0 R1 ... R15
(a0 > b0) ? 0xff : (a1 > b1) ? 0xff : ... (a15 > b15) ? 0xff :
0x0 0x0 0x0
Compares the 8 signed 16-bit integers in a and the 8 signed 16-bit integers in b for
greater than.
R0 R1 ... R7
(a0 > b0) ? 0xffff : (a1 > b1) ? 0xffff : ... (a7 > b7) ? 0xffff :
0x0 0x0 0x0
Compares the 4 signed 32-bit integers in a and the 4 signed 32-bit integers in b for
greater than.
R0 R1 R2 R3
(a0 > b0) ? (a1 > b1) ? (a2 > b2) ? (a3 > b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
Compares the 16 signed 8-bit integers in a and the 16 signed 8-bit integers in b for
less than.
R0 R1 ... R15
(a0 < b0) ? 0xff : (a1 < b1) ? 0xff : ... (a15 < b15) ? 0xff :
0x0 0x0 0x0
Compares the 8 signed 16-bit integers in a and the 8 signed 16-bit integers in b for
less than.
R0 R1 ... R7
(a0 < b0) ? 0xffff : (a1 < b1) ? 0xffff : ... (a7 < b7) ? 0xffff :
0x0 0x0 0x0
Compares the 4 signed 32-bit integers in a and the 4 signed 32-bit integers in b for
less than.
R0 R1 R2 R3
(a0 < b0) ? (a1 < b1) ? (a2 < b2) ? (a3 < b3) ?
0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0 0xffffffff : 0x0
97
Intel(R) C++ Intrinsics Reference
The following conversion intrinsics and their respective instructions are functional in
the Streaming SIMD Extensions 2 (SSE2).
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed
explanation of each intrinsic. R, R0, R1, R2 and R3 represent the registers in which
results are placed.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
R0 R1
(double)b a1
__int64 _mm_cvtsd_si64(__m128d a)
Converts the lower DP FP value of a to a 64-bit signed integer value according to the
current rounding mode.
R
(__int64) a0
__int64 _mm_cvttsd_si64(__m128d a)
Converts the lower DP FP value of a to a 64-bit signed integer value using truncation.
98
Intel(R) C++ Intrinsics Reference
R
(__int64) a0
__m128 _mm_cvtepi32_ps(__m128i a)
R0 R1 R2 R3
(float) a0 (float) a1 (float) a2 (float) a3
__m128i _mm_cvtps_epi32(__m128 a)
R0 R1 R2 R3
(int) a0 (int) a1 (int) a2 (int) a3
__m128i _mm_cvttps_epi32(__m128 a)
R0 R1 R2 R3
(int) a0 (int) a1 (int) a2 (int) a3
The following conversion intrinsics and their respective instructions are functional in
the Streaming SIMD Extensions 2 (SSE2).
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed
explanation of each intrinsic. R, R0, R1, R2 and R3 represent the registers in which
results are placed.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
99
Intel(R) C++ Intrinsics Reference
__m128i _mm_cvtsi32_si128(int a)
Moves 32-bit integer a to the least significant 32 bits of an __m128i object. Zeroes
the upper 96 bits of the __m128i object.
R0 R1 R2 R3
a 0x0 0x0 0x0
__m128i _mm_cvtsi64_si128(__int64 a)
Moves 64-bit integer a to the lower 64 bits of an __m128i object, zeroing the upper
bits.
R0 R1
a 0x0
int _mm_cvtsi128_si32(__m128i a)
R
a0
__int64 _mm_cvtsi128_si64(__m128i a)
R
a0
The following load operation intrinsics and their respective instructions are functional
in the Streaming SIMD Extensions 2 (SSE2).
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed
explanation of each intrinsic. R, R0 and R1 represent the registers in which results
are placed.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
100
Intel(R) C++ Intrinsics Reference
R
*p
R
*p
Load the lower 64 bits of the value pointed to by p into the lower 64 bits of the result,
zeroing the upper 64 bits of the result.
R0 R1
*p[63:0] 0x0
The following set operation intrinsics and their respective instructions are functional
in the Streaming SIMD Extensions 2 (SSE2).
The results of each intrinsic operation are placed in registers. The information about
what is placed in each register appears in the tables below, in the detailed
explanation of each intrinsic. R, R0, R1...R15 represent the registers in which results
are placed.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
101
Intel(R) C++ Intrinsics Reference
R0 R1
q0 q1
R0 R1 R2 R3
i0 i1 i2 i3
__m128i _mm_set_epi16(short w7, short w6, short w5, short w4, short w3,
short w2, short w1, short w0)
R0 R1 ... R7
w0 w1 ... w7
__m128i _mm_set_epi8(char b15, char b14, char b13, char b12, char b11,
char b10, char b9, char b8, char b7, char b6, char b5, char b4, char b3,
char b2, char b1, char b0)
102
Intel(R) C++ Intrinsics Reference
R0 R1 ... R15
b0 b1 ... b15
__m128i _mm_set1_epi64(__m64 q)
R0 R1
q q
__m128i _mm_set1_epi32(int i)
R0 R1 R2 R3
i i i i
__m128i _mm_set1_epi16(short w)
R0 R1 ... R7
w w w w
__m128i _mm_set1_epi8(char b)
R0 R1 ... R15
b b b b
R0 R1
q0 q1
R0 R1 R2 R3
i0 i1 i2 i3
__m128i _mm_setr_epi16(short w0, short w1, short w2, short w3, short w4,
short w5, short w6, short w7)
103
Intel(R) C++ Intrinsics Reference
R0 R1 ... R7
w0 w1 ... w7
__m128i _mm_setr_epi8(char b15, char b14, char b13, char b12, char b11,
char b10, char b9, char b8, char b7, char b6, char b5, char b4, char b3,
char b2, char b1, char b0)
R0 R1 ... R15
b0 b1 ... b15
__m128i _mm_setzero_si128()
R
0x0
The following store operation intrinsics and their respective instructions are
functional in the Streaming SIMD Extensions 2 (SSE2).
The detailed description of each intrinsic contains a table detailing the returns. In
these tables, p is an access to the result.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
104
Intel(R) C++ Intrinsics Reference
Stores the data in a to the address p without polluting the caches. If the cache line
containing address p is already in the cache, the cache will be updated. Address p
must be 16 byte aligned.
*p
a
Stores the data in a to the address p without polluting the caches. If the cache line
containing address p is already in the cache, the cache will be updated.
*p
a
*p
a
*p
a
Conditionally store byte elements of d to address p. The high bit of each byte in the
selector n determines whether the corresponding byte in d will be stored. Address p
need not be 16-byte aligned.
*p[63:0]
a0
105
Intel(R) C++ Intrinsics Reference
The prototypes for Streaming SIMD Extensions 2 (SSE2) intrinsics are in the
emmintrin.h header file.
Stores the data in a to the address p without polluting caches. The address p must
be 16-byte aligned. If the cache line containing address p is already in the cache, the
cache will be updated.
p[0] := a0
p[1] := a1
p[0] p[1]
a0 a1
Stores the data in a to the address p without polluting the caches. If the cache line
containing address p is already in the cache, the cache will be updated. Address p
must be 16-byte aligned.
*p
a
Stores the data in a to the address p without polluting the caches. If the cache line
containing address p is already in the cache, the cache will be updated.
*p
106
Intel(R) C++ Intrinsics Reference
Cache line containing p is flushed and invalidated from all caches in the coherency
domain.
void _mm_lfence(void)
Guarantees that every load instruction that precedes, in program order, the load
fence instruction is globally visible before any load instruction which follows the fence
in program order.
void _mm_mfence(void)
Guarantees that every memory access that precedes, in program order, the memory
fence instruction is globally visible before any memory instruction which follows the
fence in program order.
The miscellaneous intrinsics for Streaming SIMD Extensions 2 (SSE2) are listed in
the following table followed by their descriptions.
The prototypes for SSE2 intrinsics are in the emmintrin.h header file.
107
Intel(R) C++ Intrinsics Reference
Packs the 16 signed 16-bit integers from a and b into 8-bit integers and saturates.
Packs the 8 signed 32-bit integers from a and b into signed 16-bit integers and
saturates.
R0 ... R3 R4 ... R7
Signed ... Signed Signed ... Signed
Saturate(a0) Saturate(a3) Saturate(b0) Saturate(b3)
Packs the 16 signed 16-bit integers from a and b into 8-bit unsigned integers and
saturates.
Extracts the selected signed or unsigned 16-bit integer from a and zero extends. The
selector imm must be an immediate.
108
Intel(R) C++ Intrinsics Reference
R0
(imm == 0) ? a0: ( (imm == 1) ? a1: ... (imm==7) ? a7)
Inserts the least significant 16 bits of b into the selected 16-bit integer of a. The
selector imm must be an immediate.
R0 R1 ... R7
(imm == 0) ? b : a0; (imm == 1) ? b : a1; ... (imm == 7) ? b : a7;
int _mm_movemask_epi8(__m128i a)
Creates a 16-bit mask from the most significant bits of the 16 signed or unsigned 8-
bit integers in a and zero extends the upper bits.
R0
a15[7] << 15 | a14[7] << 14 | ... a1[7] << 1 | a0[7]
Shuffles the 4 signed or unsigned 32-bit integers in a as specified by imm. The shuffle
value, imm, must be an immediate. See Macro Function for Shuffle for a description
of shuffle semantics.
Shuffles the upper 4 signed or unsigned 16-bit integers in a as specified by imm. The
shuffle value, imm, must be an immediate. See Macro Function for Shuffle for a
description of shuffle semantics.
Shuffles the lower 4 signed or unsigned 16-bit integers in a as specified by imm. The
shuffle value, imm, must be an immediate. See Macro Function for Shuffle for a
description of shuffle semantics.
Interleaves the upper 8 signed or unsigned 8-bit integers in a with the upper 8
signed or unsigned 8-bit integers in b.
Interleaves the upper 4 signed or unsigned 16-bit integers in a with the upper 4
signed or unsigned 16-bit integers in b.
109
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3 R4 R5 R6 R7
a4 b4 a5 b5 a6 b6 a7 b7
Interleaves the upper 2 signed or unsigned 32-bit integers in a with the upper 2
signed or unsigned 32-bit integers in b.
R0 R1 R2 R3
a2 b2 a3 b3
Interleaves the upper signed or unsigned 64-bit integer in a with the upper signed or
unsigned 64-bit integer in b.
R0 R1
a1 b1
Interleaves the lower 8 signed or unsigned 8-bit integers in a with the lower 8 signed
or unsigned 8-bit integers in b.
Interleaves the lower 4 signed or unsigned 16-bit integers in a with the lower 4
signed or unsigned 16-bit integers in b.
R0 R1 R2 R3 R4 R5 R6 R7
a0 b0 a1 b1 a2 b2 a3 b3
Interleaves the lower 2 signed or unsigned 32-bit integers in a with the lower 2
signed or unsigned 32-bit integers in b.
R0 R1 R2 R3
a0 b0 a1 b1
Interleaves the lower signed or unsigned 64-bit integer in a with the lower signed or
unsigned 64-bit integer in b.
110
Intel(R) C++ Intrinsics Reference
R0 R1
a0 b0
__m64 _mm_movepi64_pi64(__m128i a)
R0
a0
__128i _mm_movpi64_pi64(__m64 a)
Moves the 64 bits of a to the lower 64 bits of the result, zeroing the upper bits.
R0 R1
a0 0X0
__128i _mm_move_epi64(__m128i a)
Moves the lower 64 bits of a to the lower 64 bits of the result, zeroing the upper bits.
R0 R1
a0 0X0
R0 R1
a1 b1
R0 R1
a0 b0
int _mm_movemask_pd(__m128d a)
Creates a two-bit mask from the sign bits of the two DP FP values of a.
R
sign(a1) << 1 | sign(a0)
111
Intel(R) C++ Intrinsics Reference
Selects two specific DP FP values from a and b, based on the mask i. The mask must
be an immediate. See Macro Function for Shuffle for a description of the shuffle
semantics.
This version of the Intel C++ Compiler supports casting between various SP, DP,
and INT vector types. These intrinsics do not convert values; they change one data
type to another without changing the value.
The intrinsics for casting support do not correspond to any Streaming SIMD
Extensions 2 (SSE2) instructions.
The prototypes for Streaming SIMD Extensions (SSE) intrinsics are in the
xmmintrin.h header file.
void _mm_pause(void)
PAUSE Intrinsic
The PAUSE intrinsic is used in spin-wait loops with the processors implementing
dynamic execution (especially out-of-order execution). In the spin-wait loop, PAUSE
improves the speed at which the code detects the release of the lock. For dynamic
scheduling, the PAUSE instruction reduces the penalty of exiting from the spin-loop.
spin loop:pause
cmp eax, A
jne spin_loop
112
Intel(R) C++ Intrinsics Reference
In this example, the program spins until memory location A matches the value in
register eax. The code sequence that follows shows a test-and-test-and-set. In this
example, the spin occurs only after the attempt to get a lock has failed.
Critical Section
// critical_section code
mov A, 0 ; Release lock
jmp continue
spin_loop: pause;
// spin-loop hint
cmp 0, A ;
// check lock availability
jne spin_loop
jmp get_lock
// continue: other code
Note that the first branch is predicted to fall-through to the critical section in
anticipation of successfully gaining access to the lock. It is highly recommended that
all spin-wait loops include the PAUSE instruction. Since PAUSE is backwards
compatible to all existing IA-32 processor generations, a test for processor type (a
CPUID test) is not needed. All legacy processors will execute PAUSE as a NOP, but in
processors which use the PAUSE as a hint there can be significant performance
benefit.
The Streaming SIMD Extensions 2 (SSE2) provide a macro function to help create
constants that describe shuffle operations. The macro takes two small integers (in
the range of 0 to 1) and combines them into an 2-bit immediate value used by the
SHUFPD instruction. See the following example.
You can view the two integers as selectors for choosing which two words from the
first input operand and which two words from the second are to be put into the result
word.
113
Intel(R) C++ Intrinsics Reference
114
Intel(R) C++ Intrinsics Reference
The Intel C++ intrinsics listed in this section are designed for the Intel Pentium 4
processor with Streaming SIMD Extensions 3 (SSE3). They will not function correctly
on other IA-32 processors. New SSE3 intrinsics include:
The prototypes for these intrinsics are in the pmmintrin.h header file.
Note
You can also use the single ia32intrin.h header file for any IA-32 intrinsics.
The integer vector intrinsic listed here is designed for the Intel Pentium 4
processor with Streaming SIMD Extensions 3 (SSE3).
Loads an unaligned 128-bit value. This differs from movdqu in that it can provide
higher performance in some cases. However, it also may provide lower performance
than movdqu if the memory value being read was just previously written.
R
*p;
The single-precision floating-point vector intrinsics listed here are designed for the
Intel Pentium 4 processor with Streaming SIMD Extensions 3 (SSE3).
The results of each intrinsic operation are placed in the registers R0, R1, R2, and R3.
The prototypes for these intrinsics are in the pmmintrin.h header file.
115
Intel(R) C++ Intrinsics Reference
R0 R1 R2 R3
a0 - b0; a1 + b1; a2 - b2; a3 + b3;
R0 R1 R2 R3
a0 + a1; a2 + a3; b0 + b1; b2 + b3;
R0 R1 R2 R3
a0 - a1; a2 - a3; b0 - b1; b2 - b3;
R0 R1 R2 R3
a1; a1; a3; a3;
R0 R1 R2 R3
a0; a0; a2; a2;
116
Intel(R) C++ Intrinsics Reference
The floating-point intrinsics listed here are designed for the Intel Pentium 4 processor with
Streaming SIMD Extensions 3 (SSE3).
The results of each intrinsic operation are placed in the registers R0 and R1.
The prototypes for these intrinsics are in the pmmintrin.h header file.
R0 R1
a0 - b0; a1 + b1;
R0 R1
a0 + a1; b0 + b1;
R0 R1
a0 - a1; b0 - b1;
117
Intel(R) C++ Intrinsics Reference
R0 R1
*dp; *dp;
R0 R1
a0; a0;
The macro function intrinsics listed here are designed for the Intel Pentium 4
processor with Streaming SIMD Extensions 3 (SSE3).
The prototypes for these intrinsics are in the pmmintrin.h header file.
_MM_SET_DENORMALS_ZERO_MODE(x)
_MM_GET_DENORMALS_ZERO_MODE()
No arguments. This returns the current value of the denormals are zero mode bit of
the control register.
The miscellaneous intrinsics listed here are designed for the Intel Pentium 4
processor with Streaming SIMD Extensions 3 (SSE3).
The prototypes for these intrinsics are in the pmmintrin.h header file.
Generates the MONITOR instruction. This sets up an address range for the monitor
hardware using p to provide the logical address, and will be passed to the monitor
instruction in register eax. The extensions parameter contains optional extensions to
the monitor hardware which will be passed in ecx. The hints parameter will contain
hints to the monitor hardware, which will be passed in edx. A non-zero value for
extensions will cause a general protection fault.
118
Intel(R) C++ Intrinsics Reference
Generates the MWAIT instruction. This instruction is a hint that allows the processor
to stop execution and enter an implementation-dependent optimized state until
occurrence of a class of events. In future processor designs extensions and hints
parameters may be used to convey additional information to the processor. All non-
zero values of extensions and hints are reserved. A non-zero value for extensions will
cause a general protection fault.
119
Intel(R) C++ Intrinsics Reference
The Intel C++ intrinsics listed in this section are supported in the Supplemental
Streaming SIMD Extensions 3. The prototypes for these intrinsics are in tmmintrin.h.
You can also use the ia32intrin.h header file for these intrinsics.
Addition Intrinsics
Subtraction Intrinsics
Multiplication Intrinsics
Absolute Value Intrinsics
Shuffle Intrinsics
Concatenate Intrinsics
Negation Intrinsics
Addition Intrinsics
120
Intel(R) C++ Intrinsics Reference
121
Intel(R) C++ Intrinsics Reference
Subtraction Intrinsics
122
Intel(R) C++ Intrinsics Reference
Multiplication Intrinsics
Multiply signed and unsigned bytes, add horizontal pair of signed words, pack
saturated signed words.
123
Intel(R) C++ Intrinsics Reference
Multiply signed and unsigned bytes, add horizontal pair of signed words, pack
saturated signed words.
Multiply signed words, scale and round signed dwords, pack high 16-bits.
Multiply signed words, scale and round signed dwords, pack high 16-bits.
r[i] = abs(a[i]);
124
Intel(R) C++ Intrinsics Reference
r[i] = abs(a[i]);
r[i] = abs(a[i]);
r[i] = abs(a[i]);
r[i] = abs(a[i]);
125
Intel(R) C++ Intrinsics Reference
r[i] = abs(a[i]);
r[i] = 0;
else {
r[i] = 0;
else {
126
Intel(R) C++ Intrinsics Reference
Concatenate Intrinsics
t1[255:128] = a;
t1[127:0] = b;
r[127:0] = t1[127:0];
t1[127:64] = a;
t1[63:0] = b;
r[63:0] = t1[63:0];
Negation Intrinsics
if (b[i] < 0) {
127
Intel(R) C++ Intrinsics Reference
r[i] = -a[i];
else if (b[i] == 0) {
r[i] = 0;
else {
r[i] = a[i];
if (b[i] < 0) {
r[i] = -a[i];
else if (b[i] == 0) {
r[i] = 0;
else {
r[i] = a[i];
128
Intel(R) C++ Intrinsics Reference
if (b[i] < 0) {
r[i] = -a[i];
else if (b[i] == 0) {
r[i] = 0;
else {
r[i] = a[i];
if (b[i] < 0) {
r[i] = -a[i];
else if (b[i] == 0) {
r[i] = 0;
else {
r[i] = a[i];
129
Intel(R) C++ Intrinsics Reference
if (b[i] < 0) {
r[i] = -a[i];
else if (b[i] == 0) {
r[i] = 0;
else {
r[i] = a[i];
if (b[i] < 0) {
r[i] = -a[i];
else if (b[i] == 0) {
r[i] = 0;
else {
r[i] = a[i];
130
Intel(R) C++ Intrinsics Reference
131
Intel(R) C++ Intrinsics Reference
These intrinsics enable floating point single precision and double precision dot
products.
This intrinsic calculates the dot product of double precision packed values with mask-
defined summing and zeroing of the parts of the result.
This intrinsic calculates the dot product of single precision packed values with mask-
defined summing and zeroing of the parts of the result.
132
Intel(R) C++ Intrinsics Reference
133
Intel(R) C++ Intrinsics Reference
These intrinsics compare packed integers in the destination operand and the source
operand, and return the minimum or maximum for each packed operand in the
destination operand.
134
Intel(R) C++ Intrinsics Reference
These rounding intrinsics cover scalar and packed single-precision and double
precision floating-point operands.
The floor and ceil intrinsics correspond to the definitions of floor and ceil in the
ISO 9899:1999 standard for the C programming language.
mm_floor_pd
mm_ceil_pd
_mm_round_ps Packed float single precision rounding ROUNDPS
mm_floor_ps
mm_ceil_ps
_mm_round_sd Single float double precision rounding ROUNDSD
mm_floor_sd
mm_ceil_sd
_mm_round_ss Single float single precision rounding ROUNDSS
mm_floor_ss
mm_ceil_ss
135
Intel(R) C++ Intrinsics Reference
These DWORD multiply intrinsics are designed to aid vectorization. They enable four
simultaneous 32 bit by 32 bit multiplies.
These intrinsics enable data insertion and extraction between general purpose
registers and XMM registers.
136
Intel(R) C++ Intrinsics Reference
137
Intel(R) C++ Intrinsics Reference
Converts 8 packed signed DWORDs into 8 packed unsigned WORDs, using unsigned
saturation to handle overflow condition.
Performs a packed integer 64-bit comparison for equality. The intrinsic zeroes or fills
with ones the corresponding parts of the result.
Loads _m128 data from a 16-byte aligned address (v1) to the destination operand
(m128i) without polluting the caches.
138
Intel(R) C++ Intrinsics Reference
int _mm_cmpestri(__m128i src1, int len1, __m128i src2, int len2, const
int mode)
This intrinsic performs a packed comparison of string data with explicit lengths,
generating an index and storing the result in ECX.
This intrinsic performs a packed comparison of string data with explicit lengths,
generating a mask and storing the result in XMM0.
139
Intel(R) C++ Intrinsics Reference
This intrinsic performs a packed comparison of string data with implicit lengths,
generating an index and storing the result in ECX.
This intrinsic performs a packed comparison of string data with implicit lengths,
generating a mask and storing the result in XMM0.
int _mm_cmpestrz(__m128i src1, int len1, __m128i src2, int len2, const
int mode);
This intrinsic performs a packed comparison of string data with explicit lengths.
Returns 1 if ZFlag == 1, otherwise 0.
int _mm_cmpestrc(__m128i src1, int len1, __m128i src2, int len2, const
int mode);
This intrinsic performs a packed comparison of string data with explicit lengths.
Returns 1 if CFlag == 1, otherwise 0.
int _mm_cmpestrs(__m128i src1, int len1, __m128i src2, int len2, const
int mode);
This intrinsic performs a packed comparison of string data with explicit lengths.
Returns 1 if SFlag == 1, otherwise 0.
int _mm_cmpestro(__m128i src1, int len1, __m128i src2, int len2, const
int mode);
This intrinsic performs a packed comparison of string data with explicit lengths.
Returns 1 if OFlag == 1, otherwise 0.
int _mm_cmpestra(__m128i src1, int len1, __m128i src2, int len2, const
int mode);
This intrinsic performs a packed comparison of string data with explicit lengths.
Returns 1 if CFlag == 0 and ZFlag == 0, otherwise 0.
This intrinsic performs a packed comparison of string data with implicit lengths.
Returns 1 if (ZFlag == 1), otherwise 0.
This intrinsic performs a packed comparison of string data with implicit lengths.
Returns 1 if (CFlag == 1), otherwise 0.
This intrinsic performs a packed comparison of string data with implicit lengths.
Returns 1 if (SFlag == 1), otherwise 0.
140
Intel(R) C++ Intrinsics Reference
This intrinsic performs a packed comparison of string data with implicit lengths.
Returns 1 if (OFlag == 1), otherwise 0.
This intrinsic performs a packed comparison of string data with implicit lengths.
Returns 1 if (ZFlag == 0 and CFlag == 0), otherwise 0.
Starting with an initial value in the first operand, accumulates a CRC32 value for the
second operand and stores the result in the destination operand. Accumulates CRC32
on r/m8.
Starting with an initial value in the first operand, accumulates a CRC32 value for the
second operand and stores the result in the destination operand. Accumulates CRC32
on r/m16.
141
Intel(R) C++ Intrinsics Reference
Starting with an initial value in the first operand, accumulates a CRC32 value for the
second operand and stores the result in the destination operand. Accumulates CRC32
on r/m32.
Starting with an initial value in the first operand, accumulates a CRC32 value for the
second operand and stores the result in the destination operand. Accumulates CRC32
on r/m64.
142
Intel(R) C++ Intrinsics Reference
This section lists and describes the native intrinsics for IA-64 instructions. These
intrinsics cannot be used on the IA-32 architecture. These intrinsics give
programmers access to IA-64 instructions that cannot be generated using the
standard constructs of the C and C++ languages.
The prototypes for these intrinsics are in the ia64intrin.h header file.
The Intel Itanium processor does not support SSE2 intrinsics. However, you can
use the sse2mmx.h emulation pack to enable support for SSE2 instructions on IA-64
architecture.
For information on how to use SSE intrinsics on IA-64 architecture, see Using
Streaming SIMD Extensions on IA-64 Architecture.
For information on how to use MMX (TM) technology intrinsics on IA-64 architecture,
see MMX(TM) Technology Intrinsics on IA-64 Architecture
The prototypes for these intrinsics are in the ia64intrin.h header file.
Integer Operations
Intrinsic Operation Corresponding IA-64
Instruction
_m64_dep_mr Deposit dep
143
Intel(R) C++ Intrinsics Reference
FSR Operations
Intrinsic Description
void _fsetc(unsigned int Sets the control bits of FPSR.sf0. Maps to the
amask, unsigned int fsetc.sf0 r, r instruction. There is no
omask)
corresponding instruction to read the control bits.
Use _mm_getfpsr().
void _fclrf(void) Clears the floating point status flags (the 6-bit flags
of FPSR.sf0). Maps to the fclrf.sf0 instruction.
The right-justified 64-bit value r is deposited into the value in s at an arbitrary bit
position and the result is returned. The deposited bit field begins at bit position pos
and extends to the left (toward the most significant bit) the number of bits specified
by len.
The sign-extended value v (either all 1s or all 0s) is deposited into the value in s at
an arbitrary bit position and the result is returned. The deposited bit field begins at
bit position p and extends to the left (toward the most significant bit) the number of
bits specified by len.
The right-justified 64-bit value s is deposited into a 64-bit field of all zeros at an
arbitrary bit position and the result is returned. The deposited bit field begins at bit
position pos and extends to the left (toward the most significant bit) the number of
bits specified by len.
The sign-extended value v (either all 1s or all 0s) is deposited into a 64-bit field of all
zeros at an arbitrary bit position and the result is returned. The deposited bit field
begins at bit position pos and extends to the left (toward the most significant bit) the
number of bits specified by len.
A field is extracted from the 64-bit value r and is returned right-justified and sign
extended. The extracted field begins at position pos and extends len bits to the left.
The sign is taken from the most significant bit of the extracted field.
A field is extracted from the 64-bit value r and is returned right-justified and zero
extended. The extracted field begins at position pos and extends len bits to the left.
144
Intel(R) C++ Intrinsics Reference
The 64-bit values a and b are treated as signed integers and multiplied to produce a
full 128-bit signed result. The 64-bit value c is zero-extended and added to the
product. The least significant 64 bits of the sum are then returned.
The 64-bit values a and b are treated as signed integers and multiplied to produce a
full 128-bit unsigned result. The 64-bit value c is zero-extended and added to the
product. The least significant 64 bits of the sum are then returned.
The 64-bit values a and b are treated as signed integers and multiplied to produce a
full 128-bit signed result. The 64-bit value c is zero-extended and added to the
product. The most significant 64 bits of the sum are then returned.
The 64-bit values a and b are treated as unsigned integers and multiplied to produce
a full 128-bit unsigned result. The 64-bit value c is zero-extended and added to the
product. The most significant 64 bits of the sum are then returned.
__int64 _m64_popcnt(__int64 a)
The number of bits in the 64-bit integer a that have the value 1 are counted, and the
resulting sum is returned.
a is shifted to the left by count bits and then added to b. The result is returned.
a and b are concatenated to form a 128-bit value and shifted to the right count bits.
The least significant 64 bits of the result are returned.
The prototypes for these intrinsics are in the ia64intrin.h header file.
Intrinsic Description
unsigned __int64 Map to the xchg1 instruction.
_InterlockedExchange8(volatile unsigned char Atomically write the least
*Target, unsigned __int64 value)
significant byte of its 2nd
argument to address
specified by its 1st
argument.
unsigned __int64 Compare and exchange
_InterlockedCompareExchange8_rel(volatile atomically the least
unsigned char *Destination, unsigned __int64
significant byte at the
145
Intel(R) C++ Intrinsics Reference
146
Intel(R) C++ Intrinsics Reference
147
Intel(R) C++ Intrinsics Reference
Note
Uses cmpxchg to do an atomic sub of the incr value to the target. Maps to a
loop with the cmpxchg instruction to guarantee atomicity.
The prototypes for these intrinsics are in the ia64intrin.h header file.
Intrinsic Description
unsigned __int64 Map to the xchg1 instruction.
_InterlockedExchange8(volatile unsigned char Atomically write the least
*Target, unsigned __int64 value)
significant byte of its 2nd
argument to address
specified by its 1st
argument.
unsigned __int64 Compare and exchange
_InterlockedCompareExchange8_rel(volatile atomically the least
unsigned char *Destination, unsigned __int64
Exchange, unsigned __int64 Comparand) significant byte at the
address specified by its 1st
argument. Maps to the
cmpxchg1.rel instruction
with appropriate setup.
148
Intel(R) C++ Intrinsics Reference
149
Intel(R) C++ Intrinsics Reference
150
Intel(R) C++ Intrinsics Reference
Note
Uses cmpxchg to do an atomic sub of the incr value to the target. Maps to a
loop with the cmpxchg instruction to guarantee atomicity.
You can use the load and store intrinsic to force the strict memory access ordering of
specific data objects. This intended use is for the case when the user suppresses the
strict memory access ordering by using the -serialize-volatile- option.
151
Intel(R) C++ Intrinsics Reference
instruction.
__ld8_acq unsigned __int64 __ld8_acq(void Generates an ld8.acq
*src); instruction.
The prototypes for these intrinsics are in the ia64intrin.h header file.
Intrinsic Description
unsigned __int64 Gets the value from a hardware register based
__getReg(const int whichReg) on the index passed in. Produces a
corresponding mov = r instruction. Provides
access to the following registers:
See Register Names for getReg() and setReg().
void __setReg(const int Sets the value for a hardware register based on
whichReg, unsigned __int64 the index passed in. Produces a corresponding
value)
mov = r instruction.
See Register Names for getReg() and setReg().
unsigned __int64 Return the value of an indexed register. The
__getIndReg(const int index is the 2nd argument; the register file is
whichIndReg, __int64 index)
the first argument.
void __setIndReg(const int Copy a value in an indexed register. The index is
whichIndReg, __int64 index, the 2nd argument; the register file is the first
unsigned __int64 value)
argument.
void *__ptr64 _rdteb(void) Gets TEB address. The TEB address is kept in
r13 and maps to the move r=tp instruction
void __isrlz(void) Executes the serialize instruction. Maps to the
srlz.i instruction.
void __dsrlz(void) Serializes the data. Maps to the srlz.d
instruction.
unsigned __int64 Map the fetchadd4.acq instruction.
__fetchadd4_acq(unsigned int
*addend, const int
increment)
unsigned __int64 Map the fetchadd4.rel instruction.
__fetchadd4_rel(unsigned int
*addend, const int
increment)
unsigned __int64 Map the fetchadd8.acq instruction.
__fetchadd8_acq(unsigned
__int64 *addend, const int
increment)
unsigned __int64 Map the fetchadd8.rel instruction.
__fetchadd8_rel(unsigned
__int64 *addend, const int
increment)
152
Intel(R) C++ Intrinsics Reference
153
Intel(R) C++ Intrinsics Reference
void __ptcl(__int64 va, Purges the local translation cache. Maps to the
__int64 pagesz) ptc.l r, r instruction.
void __ptcg(__int64 va, Purges the global translation cache. Maps to the
__int64 pagesz) ptc.g r, r instruction.
void __ptcga(__int64 va, Purges the global translation cache and ALAT.
__int64 pagesz) Maps to the ptc.ga r, r instruction.
void __ptri(__int64 va, Purges the translation register. Maps to the
__int64 pagesz) ptr.i r, r instruction.
void __ptrd(__int64 va, Purges the translation register. Maps to the
__int64 pagesz) ptr.d r, r instruction.
__int64 __tpa(__int64 va) Map the tpa instruction.
void __invalat(void) Invalidates ALAT. Maps to the invala
instruction.
void __invala (void) Same as void __invalat(void)
void __invala_gr(const int whichGeneralReg = 0-127
whichGeneralReg)
void __invala_fr(const int whichFloatReg = 0-127
whichFloatReg)
void __break(const int) Generates a break instruction with an
immediate.
void __nop(const int) Generate a nop instruction.
void __debugbreak(void) Generates a Debug Break Instruction fault.
void __fc(void*) Flushes a cache line associated with the address
given by the argument. Maps to the fc
instruction.
void __sum(int mask) Sets the user mask bits of PSR. Maps to the sum
imm24 instruction.
void __rum(int mask) Resets the user mask.
__int64 _ReturnAddress(void) Get the caller's address.
void __lfetch(int lfhint, Generate the lfetch.lfhint instruction. The
const *y) value of the first argument specifies the hint
type.
void __lfetch_fault(int Generate the lfetch.fault.lfhint instruction.
lfhint, const *y) The value of the first argument specifies the hint
type.
void __lfetch_excl(int Generate the lfetch.excl.lfhint instruction.
lfhint, const *y) The value {0|1|2|3} of the first argument
specifies the hint type.
void __lfetch_fault_excl(int Generate the lfetch.fault.excl.lfhint
lfhint, void const *y) instruction. The value of the first argument
154
Intel(R) C++ Intrinsics Reference
Conversion Intrinsics
The prototypes for these intrinsics are in the ia64intrin.h header file.
Intrinsic Description
__int64 _m_to_int64(__m64 a) Convert a of type __m64 to type __int64.
Translates to nop since both types reside in
the same register for systems based on IA-
64 architecture.
__m64 _m_from_int64(__int64 a) Convert a of type __int64 to type __m64.
Translates to nop since both types reside in
the same register for systems based on IA-
64 architecture.
__int64 Convert its double precision argument to a
__round_double_to_int64(double signed integer.
d)
unsigned __int64 Map the getf.exp instruction and return
__getf_exp(double d) the 16-bit exponent and the sign of its
operand.
The prototypes for getReg() and setReg() intrinsics are in the ia64regs.h header
file.
Name whichReg
_IA64_REG_IP 1016
_IA64_REG_PSR 1019
155
Intel(R) C++ Intrinsics Reference
_IA64_REG_PSR_L 1019
_IA64_REG_SP 1036
_IA64_REG_TP 1037
Application Registers
Name whichReg
_IA64_REG_AR_KR0 3072
_IA64_REG_AR_KR1 3073
_IA64_REG_AR_KR2 3074
_IA64_REG_AR_KR3 3075
_IA64_REG_AR_KR4 3076
_IA64_REG_AR_KR5 3077
_IA64_REG_AR_KR6 3078
_IA64_REG_AR_KR7 3079
_IA64_REG_AR_RSC 3088
_IA64_REG_AR_BSP 3089
_IA64_REG_AR_BSPSTORE 3090
_IA64_REG_AR_RNAT 3091
_IA64_REG_AR_FCR 3093
_IA64_REG_AR_EFLAG 3096
_IA64_REG_AR_CSD 3097
_IA64_REG_AR_SSD 3098
_IA64_REG_AR_CFLAG 3099
_IA64_REG_AR_FSR 3100
_IA64_REG_AR_FIR 3101
_IA64_REG_AR_FDR 3102
_IA64_REG_AR_CCV 3104
_IA64_REG_AR_UNAT 3108
_IA64_REG_AR_FPSR 3112
_IA64_REG_AR_ITC 3116
156
Intel(R) C++ Intrinsics Reference
_IA64_REG_AR_PFS 3136
_IA64_REG_AR_LC 3137
_IA64_REG_AR_EC 3138
Control Registers
Name whichReg
_IA64_REG_CR_DCR 4096
_IA64_REG_CR_ITM 4097
_IA64_REG_CR_IVA 4098
_IA64_REG_CR_PTA 4104
_IA64_REG_CR_IPSR 4112
_IA64_REG_CR_ISR 4113
_IA64_REG_CR_IIP 4115
_IA64_REG_CR_IFA 4116
_IA64_REG_CR_ITIR 4117
_IA64_REG_CR_IIPA 4118
_IA64_REG_CR_IFS 4119
_IA64_REG_CR_IIM 4120
_IA64_REG_CR_IHA 4121
_IA64_REG_CR_LID 4160
_IA64_REG_CR_IVR 4161 ^
_IA64_REG_CR_TPR 4162
_IA64_REG_CR_EOI 4163
_IA64_REG_CR_IRR0 4164 ^
_IA64_REG_CR_IRR1 4165 ^
_IA64_REG_CR_IRR2 4166 ^
_IA64_REG_CR_IRR3 4167 ^
_IA64_REG_CR_ITV 4168
_IA64_REG_CR_PMV 4169
_IA64_REG_CR_CMCV 4170
_IA64_REG_CR_LRR0 4176
_IA64_REG_CR_LRR1 4177
157
Intel(R) C++ Intrinsics Reference
Multimedia Additions
The prototypes for these intrinsics are in the ia64intrin.h header file.
158
Intel(R) C++ Intrinsics Reference
__int64 _m64_czx1l(__m64 a)
The 64-bit value a is scanned for a zero element from the most significant element to
the least significant element, and the index of the first zero element is returned. The
element width is 8 bits, so the range of the result is from 0 - 7. If no zero element is
found, the default result is 8.
__int64 _m64_czx1r(__m64 a)
The 64-bit value a is scanned for a zero element from the least significant element to
the most significant element, and the index of the first zero element is returned. The
element width is 8 bits, so the range of the result is from 0 - 7. If no zero element is
found, the default result is 8.
__int64 _m64_czx2l(__m64 a)
The 64-bit value a is scanned for a zero element from the most significant element to
the least significant element, and the index of the first zero element is returned. The
element width is 16 bits, so the range of the result is from 0 - 3. If no zero element
is found, the default result is 4.
__int64 _m64_czx2r(__m64 a)
The 64-bit value a is scanned for a zero element from the least significant element to
the most significant element, and the index of the first zero element is returned. The
element width is 16 bits, so the range of the result is from 0 - 3. If no zero element
is found, the default result is 4.
159
Intel(R) C++ Intrinsics Reference
Interleave 64-bit quantities a and b in 1-byte groups, starting from the left, as
shown in Figure 1, and return the result.
Interleave 64-bit quantities a and b in 1-byte groups, starting from the right, as
shown in Figure 2, and return the result.
Interleave 64-bit quantities a and b in 2-byte groups, starting from the left, as
shown in Figure 3, and return the result.
Interleave 64-bit quantities a and b in 2-byte groups, starting from the right, as
shown in Figure 4, and return the result.
Interleave 64-bit quantities a and b in 4-byte groups, starting from the left, as
shown in Figure 5, and return the result.
160
Intel(R) C++ Intrinsics Reference
Interleave 64-bit quantities a and b in 4-byte groups, starting from the right, as
shown in Figure 6, and return the result.
161
Intel(R) C++ Intrinsics Reference
The unsigned data elements (bytes) of b are subtracted from the unsigned data
elements (bytes) of a and the results of the subtraction are then each independently
shifted to the right by one position. The high-order bits of each element are filled
with the borrow bits of the subtraction.
The unsigned data elements (double bytes) of b are subtracted from the unsigned
data elements (double bytes) of a and the results of the subtraction are then each
independently shifted to the right by one position. The high-order bits of each
element are filled with the borrow bits of the subtraction.
Two signed 16-bit data elements of a, starting with the most significant data element,
are multiplied by the corresponding two signed 16-bit data elements of b, and the
two 32-bit results are returned as shown in Figure 9.
162
Intel(R) C++ Intrinsics Reference
Two signed 16-bit data elements of a, starting with the least significant data element,
are multiplied by the corresponding two signed 16-bit data elements of b, and the
two 32-bit results are returned as shown in Figure 10.
The four signed 16-bit data elements of a are multiplied by the corresponding signed
16-bit data elements of b, yielding four 32-bit products. Each product is then shifted
to the right count bits and the least significant 16 bits of each shifted product form 4
16-bit results, which are returned as one 64-bit word.
The four unsigned 16-bit data elements of a are multiplied by the corresponding
unsigned 16-bit data elements of b, yielding four 32-bit products. Each product is
then shifted to the right count bits and the least significant 16 bits of each shifted
product form 4 16-bit results, which are returned as one 64-bit word.
a is shifted to the left by count bits and then is added to b. The upper 32 bits of the
result are forced to 0, and then bits [31:30] of b are copied to bits [62:61] of the
result. The result is returned.
The four signed 16-bit data elements of a are each independently shifted to the right
by count bits (the high order bits of each element are filled with the initial value of
the sign bits of the data elements in a); they are then added to the four signed 16-
bit data elements of b. The result is returned.
163
Intel(R) C++ Intrinsics Reference
a is added to b as four separate 16-bit wide elements. The elements of a are treated
as unsigned, while the elements of b are treated as signed. The results are treated
as unsigned and are returned as one 64-bit word.
a is subtracted from b as four separate 16-bit wide elements. The elements of a are
treated as unsigned, while the elements of b are treated as signed. The results are
treated as unsigned and are returned as one 64-bit word.
The unsigned byte-wide data elements of a are added to the unsigned byte-wide
data elements of b and the results of each add are then independently shifted to the
right by one position. The high-order bits of each element are filled with the carry
bits of the sums.
The unsigned 16-bit wide data elements of a are added to the unsigned 16-bit wide
data elements of b and the results of each add are then independently shifted to the
right by one position. The high-order bits of each element are filled with the carry
bits of the sums.
Synchronization Primitives
164
Intel(R) C++ Intrinsics Reference
Miscellaneous Intrinsics
void* __get_return_address(unsigned int level);
This intrinsic yields the return address of the current function. The level argument
must be a constant value. A value of 0 yields the return address of the current
function. Any other value yields a zero return address. On Linux systems, this
intrinsic is synonymous with __builtin_return_address. The name and the
argument are provided for compatibility with gcc*.
This intrinsic overwrites the default return address of the current function with the
address indicated by its argument. On return from the current invocation, program
execution continues at the address provided.
This intrinsic returns the frame address of the current function. The level argument
must be a constant value. A value of 0 yields the frame address of the current
function. Any other value yields a zero return value. On Linux systems, this intrinsic
is synonymous with __builtin_frame_address. The name and the argument are
provided for compatibility with gcc.
165
Intel(R) C++ Intrinsics Reference
The Dual-Core Intel Itanium 2 processor 9000 series supports the intrinsics listed
in the table below.
These intrinsics each generate IA-64 instructions. The first alpha-numerical chain in
the intrinsic name represents the return type, and the second alpha-numerical chain
in the intrinsic name represents the instruction the intrinsic generates. For example,
the intrinsic _int64_cmp8xchg generates the _int64 return type and the cmp8xchg
IA-64 instruction.
For more information about the instructions these intrinsics generate, please see the
documentation area of the Itanium 2 processor website at
http://developer.intel.com/products/processor/itanium2/index.htm
Note
Generates the 16-byte form of the IA-64 compare and exchange instruction.
Returns the original 64-bit value read from memory at the specified address.
166
Intel(R) C++ Intrinsics Reference
and 1 that specifies the and 2 that specifies the address of significant 8
semaphore completer load hint completer the value to bytes of the
(0==.acq, 1==.rel) (0==.none, 1==.nt1, read. exchange value.
2==.nta).
The following table describes each implicit argument for this intrinsic.
xchg_hi cmpnd
Highest 8 bytes of the exchange The 64-bit compare value. Use the __setReg
value. Use the setReg intrinsic to set intrinsic to set the <cmpnd> value in the
the <xchg_hi> value in the register register AR[CCV]. [__setReg
AR[CSD]. [__setReg (_IA64_REG_AR_CCV,<cmpnd>);]
(_IA64_REG_AR_CSD,
<xchg_hi>); ].
Generates the IA-64 instruction that loads 16 bytes from the given address.
Returns the lower 8 bytes of the quantity loaded from <addr>. The higher 8 bytes
are loaded in register AR[CSD].
Generates implicit return of the higher 8 bytes to the register AR[CSD]. You can use
the __getReg intrinsic to copy the value into a user variable. [foo =
__getReg(_IA64_REG_AR_CSD);]
Generates the IA-64 instruction that flushes the cache line associated with the
specified address and ensures coherency between instruction cache and data cache.
cache_line
An address associated with the cache line you want to flush
Generates the IA-64 instruction that provides performance hints about the program
being executed.
167
Intel(R) C++ Intrinsics Reference
hint_value
A literal value that specifies the hint. Currently, zero is the only legal value.
__hint(0) generates the IA-64 hint@pause instruction.
The following table describes the implicit argument for this intrinsic.
src_hi
The highest 8 bytes of the 16-byte value to store. Use the setReg intrinsic to set
the <src_hi> value in the register AR[CSD]. [__setReg(_IA64_REG_AR_CSD,
<src_hi>); ]
Examples
The following examples show how to use the intrinsics listed above to generate the
corresponding instructions. In all cases, use the __setReg (resp. __getReg) intrinsic
to set up implicit arguments (resp. retrieve implicit return values).
// file foo.c
//
#include <ia64intrin.h>
/**/
168
Intel(R) C++ Intrinsics Reference
// The following two calls load the 16-byte value at the given
address
// The call to __getReg moves that value into a user variable (hi).
// ld16 Ra,ar.csd=[Rb]
*hi = __getReg(_IA64_REG_AR_CSD);
/**/
/**/
// This is the same as the previous example, except that it uses the
// ld16.acq Ra,ar.csd=[Rb]
//
*hi = __getReg(_IA64_REG_AR_CSD);
/**/
/**/
// first set the highest 64-bits into CSD register. Then call
169
Intel(R) C++ Intrinsics Reference
//
__setReg(_IA64_REG_AR_CSD, hi);
/**/
__int64 old_value;
/**/
// set the highest bits of the exchange value and the comperand
value
//
__setReg(_IA64_REG_AR_CSD, xchg_hi);
__setReg(_IA64_REG_AR_CCV, cmpnd);
/**/
return old_value;
// end foo.c
The Dual-Core Intel Itanium 2 processor 9000 series supports the intrinsics listed
in the table below. These intrinsics are also compatible with the Microsoft compiler.
These intrinsics each generate IA-64 instructions. The second alpha-numerical chain
in the intrinsic name represents the IA-64 instruction the intrinsic generates. For
example, the intrinsic _int64_cmp8xchg generates the cmp8xchg IA-64 instruction.
170
Intel(R) C++ Intrinsics Reference
For more information about the instructions these intrinsics generate, please see the
documentation area of the Itanium 2 processor website at
http://developer.intel.com/products/processor/itanium2/index.htm.
171
Intel(R) C++ Intrinsics Reference
Generates the IA-64 instruction that atomically reads 128 bits from the memory
location.
Source DestinationHigh
Pointer to the 128-bit Pointer to the location in memory that stores the highest
Source value 64 bits of the 128-bit loaded value
Generates the IA-64 instruction that atomically reads 128 bits from the memory
location. Same as __load128, but the this intrinsic uses acquire semantics.
172
Intel(R) C++ Intrinsics Reference
Source DestinationHigh
Pointer to the 128-bit Pointer to the location in memory that stores the highest
Source value 64 bits of the 128-bit loaded value
Generates the IA-64 instruction that atomically stores 128 bits at the destination
memory location.
No returns.
Generates the IA-64 instruction that atomically stores 128 bits at the destination
memory location. Same as __store128, but this intrinsic uses release semantics.
No returns.
173
Intel(R) C++ Intrinsics Reference
This section describes features that support usage of the intrinsics. The following
topics are described:
Alignment Support
Allocating and Freeing Aligned Memory Blocks
Inline Assembly
Alignment Support
Aligning data improves the performance of intrinsics. When using the Streaming
SIMD Extensions, you should align data to 16 bytes in memory operations.
Specifically, you must align __m128 objects as addresses passed to the _mm_load
and _mm_store intrinsics. If you want to declare arrays of floats and treat them as
__m128 objects by casting, you need to ensure that the float arrays are properly
aligned.
Use __declspec(align) to direct the compiler to align data more strictly than it
otherwise would. For example, a data object of type int is allocated at a byte address
which is a multiple of 4 by default. However, by using __declspec(align), you can
direct the compiler to instead use an address which is a multiple of 8, 16, or 32 with
the following restriction on IA-32:
You can use this data alignment support as an advantage in optimizing cache line
usage. By clustering small objects that are commonly used together into a struct,
and forcing the struct to be allocated at the beginning of a cache line, you can
effectively guarantee that each object is loaded into the cache as soon as any one is
accessed, resulting in a significant performance benefit.
align(n)
Caution
174
Intel(R) C++ Intrinsics Reference
Note
If a value is specified that is less than the alignment of the affected data type, it
has no effect. In other words, data is aligned to the maximum of its own
alignment or the alignment specified with __declspec(align).
You can request alignments for individual variables, whether of static or automatic
storage duration. (Global and static variables have static storage duration; local
variables have automatic storage duration by default.) You cannot adjust the
alignment of a parameter, nor a field of a struct or class. You can, however,
increase the alignment of a struct (or union or class), in which case every object
of that type is affected.
int i, j;
These variables are commonly used together. But they can fall in different cache
lines, which could be detrimental to performance. You can instead declare them as
follows:
The compiler now ensures that they are allocated in the same cache line. In C++,
you can omit the struct variable name (written as sub in the previous example). In
C, however, it is required, and you must write references to i and j as sub.i and
sub.j.
If you use many functions with such subscript pairs, it is more convenient to declare
and use a struct type for them, as in the following example:
By placing the __declspec(align) after the keyword struct, you are requesting the
appropriate alignment for all objects of that type. Note that allocation of parameters
is unaffected by __declspec(align). (If necessary, you can assign the value of a
parameter to a local variable with the appropriate alignment.)
Use the _mm_malloc and _mm_free intrinsics to allocate and free aligned blocks of
memory. These intrinsics are based on malloc and free, which are in the libirc.a
library. You need to include malloc.h. The syntax for these intrinsics is as follows:
175
Intel(R) C++ Intrinsics Reference
The _mm_malloc routine takes an extra parameter, which is the alignment constraint.
This constraint must be a power of two. The pointer that is returned from
_mm_malloc is guaranteed to be aligned on the specified boundary.
Note
Inline Assembly
The Intel C++ Compiler supports Microsoft-style inline assembly with the -use-
msasm compiler option. See your Microsoft documentation for the proper syntax.
The Intel C++ Compiler supports GNU-like style inline assembly. The syntax is as
follows:
The Intel C++ Compiler also supports mixing UNIX and Microsoft style asms. Use the
__asm__ keyword for GNU-style ASM when using the -use_msasm switch.
Note
The Intel C++ Compiler supports gcc-style inline ASM if the assembler code uses
AT&T* System V/386 syntax.
Syntax Description
Element
asm- asm statements begin with the keyword asm. Alternatively, either
keyword __asm or __asm__ may be used for compatibility. When mixing UNIX
and Microsoft style asm, use the __asm__ keyword. The compiler only
accepts the __asm__ keyword. The asm and __asm keywords are
reserved for Microsoft style assembly statements.
volatile- If the optional keyword volatile is given, the asm is volatile. Two
keyword volatile asm statements will never be moved past each other, and a
reference to a volatile variable will not be moved relative to a
volatile asm. Alternate keywords __volatile and __volatile__ may
be used for compatibility.
asm- The asm-template is a C language ASCII string which specifies how to
template output the assembly code for an instruction. Most of the template is a
176
Intel(R) C++ Intrinsics Reference
When compiling an assembly statement on Linux, the compiler simply emits the
asm-template to the assembly file after making any necessary operand substitutions.
The compiler then calls the GNU assembler to generate machine code. In contrast,
on Windows the compiler itself must assemble the text contained in the asm-
template string into machine code. In essence, the compiler contains a built-in
assembler.
The compilers built-in assembler does not support the full functionality of the GNU
assembler, so there are limitations in the contents of the asm-template. In particular,
the following assembler features are not currently supported.
Directives
177
Intel(R) C++ Intrinsics Reference
Labels
Symbols*
Note
Example
Example
#ifdef _WIN64
#define INT64_PRINTF_FORMAT "I64"
#else
#define __int64 long long
#define INT64_PRINTF_FORMAT "L"
#endif
#include <stdio.h>
typedef struct {
__int64 lo64;
__int64 hi64;
} my_i128;
#define ADD128(out, in1, in2) \
__asm__("addq %2, %0; adcq %3, %1" : \
"=r"(out.lo64), "=r"(out.hi64) : \
"emr" (in2.lo64), "emr"(in2.hi64), \
178
Intel(R) C++ Intrinsics Reference
This example, written for Intel 64 architecture, shows how to use a GNU-style
inline assembly statement to add two 128-bit integers. In this example, a 128-bit
integer is represented as two __int64 objects in the my_i128 structure. The inline
assembly statement used to implement the addition is contained in the ADD128
macro, which takes 3 my_i128 arguments representing 3 128-bit integers. The first
argument is the output. The next two arguments are the inputs. The example
compiles and runs using the Intel Compiler on Linux or Windows, producing the
following output.
0x0000000000000000ffffffffffffffff
+ 0x00000000000000410000000000000001
------------------------------------
+ 0x00000000000000420000000000000000
In the GNU-style inline assembly implementation, the asm interface specifies all the
inputs, outputs, and side effects of the asm statement, enabling the compiler to
generate very efficient code.
add r13, 1
adc r12, 65
It is worth noting that when the compiler generates an assembly file on Windows, it
uses Intel syntax even though the assembly statement was written using Linux
assembly syntax.
179
Intel(R) C++ Intrinsics Reference
The compiler moves in1.lo64 into a register to match the constraint of operand 4.
Operand 4s constraint of "0" indicates that it must be assigned the same location as
output operand 0. And operand 0s constraint is "=r", indicating that it must be
assigned an integer register. In this case, the compiler chooses r13. In the same way,
the compiler moves in1.hi64 into register r12.
The constraints for input operands 2 and 3 allow the operands to be assigned a
register location ("r"), a memory location ("m"), or a constant signed 32-bit integer
value ("e"). In this case, the compiler chooses to match operands 2 and 3 with the
constant values 1 and 65, enabling the add and adc instructions to utilize the
"register-immediate" forms.
The same operation is much more expensive using a Microsoft-style inline assembly
statement, because the interface between the assembly statement and the
surrounding C++ code is entirely through memory. Using Microsoft assembly, the
ADD128 macro might be written as follows.
{ \
The compiler must add code before the assembly statement to move the inputs into
memory, and it must add code after the assembly statement to retrieve the outputs
from memory. This prevents the compiler from exploiting some optimization
opportunities. Thus, the following assembly code is produced.
; Begin ASM
180
Intel(R) C++ Intrinsics Reference
; End ASM
The operation that took only 4 instructions and 0 memory references using GNU-
style inline assembly takes 12 instructions with 12 memory references using
Microsoft-style inline assembly.
181
Intel(R) C++ Intrinsics Reference
This section provides a series of tables that compare intrinsics performance across
architectures. Before implementing intrinsics across architectures, please note the
following.
Instrinsics may generate code that does not run on all IA processors. You
should therefore use CPUID to detect the processor and generate the
appropriate code.
Implement intrinsics by processor family, not by specific processor. The
guiding principle for which family -- IA-32 or Itanium processors -- the
intrinsic is implemented on is performance, not compatibility. Where there is
added performance on both families, the intrinsic will be identical.
int abs(int)
long labs(long)
unsigned long _lrotl(unsigned long value, int shift)
unsigned long _lrotr(unsigned long value, int shift)
unsigned int _rotl(unsigned int value, int shift)
unsigned int _rotr(unsigned int value, int shift)
__int64 __i64_rotl(__int64 value, int shift)
__int64 __i64_rotr(__int64 value, int shift)
double fabs(double)
double log(double)
float logf(float)
double log10(double)
float log10f(float)
double exp(double)
float expf(float)
double pow(double, double)
float powf(float, float)
double sin(double)
float sinf(float)
double cos(double)
182
Intel(R) C++ Intrinsics Reference
float cosf(float)
double tan(double)
float tanf(float)
double acos(double)
float acosf(float)
double acosh(double)
float acoshf(float)
double asin(double)
float asinf(float)
double asinh(double)
float asinhf(float)
double atan(double)
float atanf(float)
double atanh(double)
float atanhf(float)
float cabs(double)*
double ceil(double)
float ceilf(float)
double cosh(double)
float coshf(float)
float fabsf(float)
double floor(double)
float floorf(float)
double fmod(double)
float fmodf(float)
double hypot(double, double)
float hypotf(float)
double rint(double)
float rintf(float)
double sinh(double)
float sinhf(float)
float sqrtf(float)
double tanh(double)
float tanhf(float)
char *_strset(char *, _int32)
183
Intel(R) C++ Intrinsics Reference
184
Intel(R) C++ Intrinsics Reference
SSE
SSE2
_mm_empty A B
_mm_cvtsi32_si64 A A
_mm_cvtsi64_si32 A A
_mm_packs_pi16 A A
_mm_packs_pi32 A A
_mm_packs_pu16 A A
_mm_unpackhi_pi8 A A
_mm_unpackhi_pi16 A A
_mm_unpackhi_pi32 A A
_mm_unpacklo_pi8 A A
_mm_unpacklo_pi16 A A
_mm_unpacklo_pi32 A A
_mm_add_pi8 A A
_mm_add_pi16 A A
_mm_add_pi32 A A
_mm_adds_pi8 A A
_mm_adds_pi16 A A
_mm_adds_pu8 A A
_mm_adds_pu16 A A
_mm_sub_pi8 A A
185
Intel(R) C++ Intrinsics Reference
_mm_sub_pi16 A A
_mm_sub_pi32 A A
_mm_subs_pi8 A A
_mm_subs_pi16 A A
_mm_subs_pu8 A A
_mm_subs_pu16 A A
_mm_madd_pi16 A C
_mm_mulhi_pi16 A A
_mm_mullo_pi16 A A
_mm_sll_pi16 A A
_mm_slli_pi16 A A
_mm_sll_pi32 A A
_mm_slli_pi32 A A
_mm_sll_pi64 A A
_mm_slli_pi64 A A
_mm_sra_pi16 A A
_mm_srai_pi16 A A
_mm_sra_pi32 A A
_mm_srai_pi32 A A
_mm_srl_pi16 A A
_mm_srli_pi16 A A
_mm_srl_pi32 A A
_mm_srli_pi32 A A
_mm_srl_si64 A A
_mm_srli_si64 A A
_mm_and_si64 A A
_mm_andnot_si64 A A
_mm_or_si64 A A
_mm_xor_si64 A A
_mm_cmpeq_pi8 A A
_mm_cmpeq_pi16 A A
_mm_cmpeq_pi32 A A
186
Intel(R) C++ Intrinsics Reference
_mm_cmpgt_pi8 A A
_mm_cmpgt_pi16 A A
_mm_cmpgt_pi32 A A
_mm_setzero_si64 A A
_mm_set_pi32 A A
_mm_set_pi16 A C
_mm_set_pi8 A C
_mm_set1_pi32 A A
_mm_set1_pi16 A A
_mm_set1_pi8 A A
_mm_setr_pi32 A A
_mm_setr_pi16 A C
_mm_setr_pi8 A C
Regular Streaming SIMD Extensions (SSE) intrinsics work on 4 32-bit single precision
values. On IA-64 architecture-based systems, basic operations like add and compare
require two SIMD instructions. All can be executed in the same cycle so the
throughput is one basic SSE operation per cycle or 4 32-bit single precision
operations per cycle.
187
Intel(R) C++ Intrinsics Reference
_mm_sub_ps N/A A A
_mm_mul_ss N/A B B
_mm_mul_ps N/A A A
_mm_div_ss N/A B B
_mm_div_ps N/A A A
_mm_sqrt_ss N/A B B
_mm_sqrt_ps N/A A A
_mm_rcp_ss N/A B B
_mm_rcp_ps N/A A A
_mm_rsqrt_ss N/A B B
_mm_rsqrt_ps N/A A A
_mm_min_ss N/A B B
_mm_min_ps N/A A A
_mm_max_ss N/A B B
_mm_max_ps N/A A A
_mm_and_ps N/A A A
_mm_andnot_ps N/A A A
_mm_or_ps N/A A A
_mm_xor_ps N/A A A
_mm_cmpeq_ss N/A B B
_mm_cmpeq_ps N/A A A
_mm_cmplt_ss N/A B B
_mm_cmplt_ps N/A A A
_mm_cmple_ss N/A B B
_mm_cmple_ps N/A A A
_mm_cmpgt_ss N/A B B
_mm_cmpgt_ps N/A A A
_mm_cmpge_ss N/A B B
_mm_cmpge_ps N/A A A
_mm_cmpneq_ss N/A B B
_mm_cmpneq_ps N/A A A
_mm_cmpnlt_ss N/A B B
188
Intel(R) C++ Intrinsics Reference
_mm_cmpnlt_ps N/A A A
_mm_cmpnle_ss N/A B B
_mm_cmpnle_ps N/A A A
_mm_cmpngt_ss N/A B B
_mm_cmpngt_ps N/A A A
_mm_cmpnge_ss N/A B B
_mm_cmpnge_ps N/A A A
_mm_cmpord_ss N/A B B
_mm_cmpord_ps N/A A A
_mm_cmpunord_ss N/A B B
_mm_cmpunord_ps N/A A A
_mm_comieq_ss N/A B B
_mm_comilt_ss N/A B B
_mm_comile_ss N/A B B
_mm_comigt_ss N/A B B
_mm_comige_ss N/A B B
_mm_comineq_ss N/A B B
_mm_ucomieq_ss N/A B B
_mm_ucomilt_ss N/A B B
_mm_ucomile_ss N/A B B
_mm_ucomigt_ss N/A B B
_mm_ucomige_ss N/A B B
_mm_ucomineq_ss N/A B B
_mm_cvtss_si32 N/A A B
_mm_cvtps_pi32 N/A A A
_mm_cvttss_si32 N/A A B
_mm_cvttps_pi32 N/A A A
_mm_cvtsi32_ss N/A A B
_mm_cvtpi32_ps N/A A C
_mm_cvtpi16_ps N/A A C
_mm_cvtpu16_ps N/A A C
_mm_cvtpi8_ps N/A A C
189
Intel(R) C++ Intrinsics Reference
_mm_cvtpu8_ps N/A A C
_mm_cvtpi32x2_ps N/A A C
_mm_cvtps_pi16 N/A A C
_mm_cvtps_pi8 N/A A C
_mm_move_ss N/A A A
_mm_shuffle_ps N/A A A
_mm_unpackhi_ps N/A A A
_mm_unpacklo_ps N/A A A
_mm_movehl_ps N/A A A
_mm_movelh_ps N/A A A
_mm_movemask_ps N/A A C
_mm_getcsr N/A A A
_mm_setcsr N/A A A
_mm_loadh_pi N/A A A
_mm_loadl_pi N/A A A
_mm_load_ss N/A A B
_mm_load1_ps N/A A A
_mm_load_ps N/A A A
_mm_loadu_ps N/A A A
_mm_loadr_ps N/A A A
_mm_storeh_pi N/A A A
_mm_storel_pi N/A A A
_mm_store_ss N/A A A
_mm_store_ps N/A A A
_mm_store1_ps N/A A A
_mm_storeu_ps N/A A A
_mm_storer_ps N/A A A
_mm_set_ss N/A A A
_mm_set1_ps N/A A A
_mm_set_ps N/A A A
_mm_setr_ps N/A A A
_mm_setzero_ps N/A A A
190
Intel(R) C++ Intrinsics Reference
_mm_prefetch N/A A A
_mm_stream_pi N/A A A
_mm_stream_ps N/A A A
_mm_sfence N/A A A
_mm_extract_pi16 N/A A A
_mm_insert_pi16 N/A A A
_mm_max_pi16 N/A A A
_mm_max_pu8 N/A A A
_mm_min_pi16 N/A A A
_mm_min_pu8 N/A A A
_mm_movemask_pi8 N/A A C
_mm_mulhi_pu16 N/A A A
_mm_shuffle_pi16 N/A A A
_mm_maskmove_si64 N/A A C
_mm_avg_pu8 N/A A A
_mm_avg_pu16 N/A A A
_mm_sad_pu8 N/A A A
On processors that do not support SSE2 instructions but do support MMX Technology,
you can use the sse2mmx.h emulation pack to enable support for SSE2 instructions.
You can use the sse2mmx.h header file for the following processors:
191
Intel(R) C++ Intrinsic Reference
Index
registers .................................... 1
E
using ......................................... 4
EMMS Instruction
M
about .......................................14
macros
using ........................................15
for SSE3 ................................. 133
EMMS Instruction ..........................15
matrix transposition....................68
I
read and write control registers ....66
intrinsics
shuffle for SSE ...........................66
about ........................................ 1
shuffle for SSE2 ....................... 129
arithmetic intrinsics 6, 20, 32, 70, 94
S
data alignment...........194, 195, 196
Streaming SIMD Extensions............31
data types.................................. 1
Streaming SIMD Extensions 2 .........69
floating point .. 7, 31, 70, 73, 75, 84,
87, 90, 92, 130, 132 Streaming SIMD Extensions 3 ....... 129
193