Leiterman J C 3264bit 80x86 Assembly Language Architecture
Leiterman J C 3264bit 80x86 Assembly Language Architecture
James C. Leiterman
No part of this book may be reproduced in any form or by any means without permission in writing from
Wordware Publishing, Inc.
ISBN-13: 978-1-59822-002-5
ISBN-10: 1-59822-002-0
10 9 8 7 6 5 4 3 2 1
0506
Brand names and product names mentioned in this book are trademarks or service marks of their respective
companies. Any omission or misuse (of any kind) of service marks or trademarks should not be regarded as
intent to infringe on the property of others. The publisher recognizes and respects all marks used by
companies, manufacturers, and developers as a means to distinguish their products.
This book is sold as is, without warranty of any kind, either express or implied, respecting the contents of this
book and any disks or programs that may accompany it, including but not limited to implied warranties for the
book's quality, performance, merchantability, or fitness for any particular purpose. Neither Wordware
Publishing, Inc. nor its dealers or distributors shall be liable to the purchaser or any other person or entity with
respect to any liability, loss, or damage caused or alleged to have been caused directly or indirectly by this
book.
All inquiries for volume purchases of this book should be addressed to Wordware Publishing, Inc., at the above
address. Telephone inquiries may be made by calling:
(972) 423-0090
1
Оглавление
32/64-Bit 80x86 Assembly Language Architecture ......................................................................... 1
Preface —(or, So Why Did He Write Yet Another Book?) ................................................. 14
Chapter 1: Introduction ...................................................................................................................... 16
Overview ........................................................................................................................................... 16
Chapter 2: Coding Standards ........................................................................................................... 19
Overview ........................................................................................................................................... 19
Constants ........................................................................................................................................... 19
Data Alignment ................................................................................................................................. 20
Stacks and Vectors ............................................................................................................................ 22
3D Vector (Floating-Point) ........................................................................................................... 22
3D Quad Vector (Floating-Point) ................................................................................................. 23
Compiler Data ................................................................................................................................... 25
Assertions .......................................................................................................................................... 26
Memory Systems .............................................................................................................................. 29
RamTest Memory Alignment Test ............................................................................................... 29
Memory Header ............................................................................................................................ 31
Allocate Memory (Malloc Wrapper) ............................................................................................ 31
Release Memory (Free Wrapper).................................................................................................. 32
Allocate Memory .......................................................................................................................... 33
Allocate (Cleared) Memory .......................................................................................................... 33
Free Memory — Pointer Is Set to NULL ..................................................................................... 33
Chapter 3: Processor Differential Insight........................................................................................ 34
Overview ........................................................................................................................................... 34
Processor Overview .......................................................................................................................... 34
History............................................................................................................................................... 36
The 64-Bit Processor......................................................................................................................... 37
80x86 Registers................................................................................................................................. 38
General-Purpose Registers ............................................................................................................ 38
REX............................................................................................................................................... 40
Segment/Selector Registers .......................................................................................................... 42
MMX Registers ............................................................................................................................. 43
XMM Registers ............................................................................................................................. 43
CPU Status Registers (EFLAGS/64-Bit RFLAGS).......................................................................... 43
LAHF — Load AH Flags ............................................................................................................. 44
SAHF — Save AH Flags .............................................................................................................. 44
PUSHF/PUSHFD — Push EFLAGS onto Stack .......................................................................... 45
PUSHFQ — Push RFLAGS onto Stack ....................................................................................... 45
POPF/POPFD — Pop EFLAGS from Stack ................................................................................ 45
POPFQ — Pop RFLAGS from Stack ........................................................................................... 45
CLC — Clear (Reset) Carry Flag ................................................................................................. 46
STC — Set Carry Flag .................................................................................................................. 46
CMC — Complement Carry Flag ................................................................................................. 46
NOP — No Operation ....................................................................................................................... 47
Floating-Point 101 ............................................................................................................................ 47
Processor Data Type Encoding ......................................................................................................... 47
EMMS — Enter/Leave MMX State ................................................................................................. 48
FEMMS — Enter/Leave MMX State ............................................................................................... 48
Destination/Source Orientations ....................................................................................................... 49
Big/Little-Endian .............................................................................................................................. 49
Alignment Quickie ............................................................................................................................ 50
(Un)aligned Memory Access ............................................................................................................ 51
2
MOV/MOVQ — Move Data ........................................................................................................ 51
Move (Unaligned) ......................................................................................................................... 51
Move (Aligned) ............................................................................................................................. 52
XCHG — Exchange (Swap) Data ................................................................................................ 54
System Level Functionality .............................................................................................................. 54
Indirect Memory Addressing ............................................................................................................ 55
uint32 OddTable[ ]........................................................................................................................ 56
LEA — Load Effective Address ................................................................................................... 57
Translation Table .............................................................................................................................. 57
XLAT/XLATB — Translation Table Lookup .............................................................................. 57
String Instructions ............................................................................................................................. 59
LODSB/LODSW/LODSD/LODSQ — Load String .................................................................... 59
REP LODSx .................................................................................................................................. 59
STOSB/STOSW/STOSD/STOSQ — Save String ....................................................................... 60
REP/REPE/REPZ/REPNE/REPNZ — Repeat String .................................................................. 61
REP STOSx .................................................................................................................................. 61
MOVSB/MOVSW/MOVSD/MOVSQ — Move String............................................................... 61
REP MOVSx ................................................................................................................................. 62
CLD/STD — Clear/Set Direction Flag ......................................................................................... 62
Special (Non-Temporal) Memory Instructions ................................................................................. 63
MOVNTx — Copy Using Non-Temporal Hint ............................................................................ 63
MOVNTPS — Copy 4SPFP Using Non-Temporal Hint ........................................................... 63
MOVNTPD — Copy 2DPFP Using Non-Temporal Hint .......................................................... 63
MASKMOVQ/MASKMOVDQU — Copy Selected Bytes ......................................................... 64
Exercises ........................................................................................................................................... 64
Chapter 4: Bit Mangling ..................................................................................................................... 65
Overview ........................................................................................................................................... 65
Boolean Logical AND ...................................................................................................................... 66
Pseudo Vec.................................................................................................................................... 67
Boolean Logical OR ......................................................................................................................... 70
Pseudo Vec.................................................................................................................................... 71
Boolean Logical XOR (Exclusive OR)............................................................................................. 72
Pseudo Vec.................................................................................................................................... 73
NOT — One's Complement Negation .......................................................................................... 74
NEG — Two's Complement Negation ......................................................................................... 74
ToolBox Snippet — The Butterfly Switch ................................................................................... 75
I-VU-Q .......................................................................................................................................... 76
Boolean Logical ANDC .................................................................................................................... 78
Pseudo Vec.................................................................................................................................... 79
Exercises ........................................................................................................................................... 80
Chapter 5: Bit Wrangling ................................................................................................................... 81
Overview ........................................................................................................................................... 81
Logical Left Shifting ......................................................................................................................... 81
SHL/SAL – Shift (Logical/Arithmetic) Left................................................................................. 81
SHLD — Shift (Logical) Left (Double) ....................................................................................... 82
PSLLx — Parallel Shift Left (Logical) ......................................................................................... 83
Pseudo Vec.................................................................................................................................... 83
Pseudo Vec (x86) .......................................................................................................................... 85
Logical Right Shifting....................................................................................................................... 86
SHR — Shift (Logical) Right ....................................................................................................... 86
SHRD — Shift (Logical) Right (Double) ..................................................................................... 87
PSRLx — Parallel Shift Right (Logical) ...................................................................................... 88
3
Pseudo Vec.................................................................................................................................... 88
Logical Right Shifting....................................................................................................................... 89
SHR — Shift (Logical) Right ....................................................................................................... 89
SHRD — Shift (Logical) Right (Double) ..................................................................................... 90
PSRLx — Parallel Shift Right (Logical) ...................................................................................... 91
Pseudo Vec.................................................................................................................................... 91
Rotate Left (or n-Right) .................................................................................................................... 92
ROL — Rotate Left ...................................................................................................................... 92
RCL — Rotate Carry Left............................................................................................................. 93
Rotate Right ...................................................................................................................................... 94
ROR — Rotate Right .................................................................................................................... 94
RCR — Rotate Carry Right .......................................................................................................... 95
Bit Scanning ...................................................................................................................................... 95
BSF — Bit Scan Forward ............................................................................................................. 95
BSR — Bit Scan Reverse ............................................................................................................. 96
ToolBox Snippet — Get Bit Count............................................................................................... 96
Graphics 101 — Palette Bits ......................................................................................................... 97
Exercises ......................................................................................................................................... 100
Chapter 6: Data Conversion ........................................................................................................... 102
Data Interlacing, Exchanging, Unpacking, and Merging ............................................................... 102
Byte Swapping ................................................................................................................................ 103
Little-Endian ............................................................................................................................... 103
(Big/Little)-Endian to (Big/Little)-Endian Data Relationship Macros ....................................... 104
BSWAP — Byte Swap ............................................................................................................... 104
PSWAPD — Packed Swap Double Word .................................................................................. 106
Data Interlacing ............................................................................................................................... 106
PUNPCKLBW — Parallel Extend Lower from Byte................................................................. 106
PUNPCKHBW — Parallel Extend Upper from Byte................................................................. 107
PUNPCKLWD — Parallel Extend Lower from 16-Bit .............................................................. 107
PUNPCKHWD — Parallel Extend Upper from 16-Bit .............................................................. 108
PUNPCKLDQ — Parallel Extend Lower from 32-Bit............................................................... 108
PUNPCKHDQ — Parallel Extend Upper from 32-Bit .............................................................. 109
MOVSS — Move Scalar (SPFP) ................................................................................................ 110
MOVQ2DQ — Move Scalar (1×32-Bit) MMX to XMM .......................................................... 110
MOVDQ2Q — Move Scalar (1×32-bit) XMM to MMX........................................................... 110
MOVLPS — Move Low Packed (2×SPFP) ............................................................................... 111
MOVHPS — Move High Packed (2×SPFP) .............................................................................. 111
MOVLHPS — Move Low to High Packed (2×SPFP) ............................................................... 111
MOVHLPS — Move High to Low Packed (2×SPFP) ............................................................... 112
MOVSD — Move Scalar (1×DPFP) .......................................................................................... 112
MOVLPD — Move Low Packed (1×DPFP) .............................................................................. 113
MOVHPD — Move High Packed (1×DPFP) ............................................................................. 113
PUNPCKLQDQ — Parallel Copy Lower (2×64-Bit) ................................................................ 114
PUNPCKHQDQ — Parallel Copy Upper (2×64-Bit) ................................................................ 114
Swizzle, Shuffle, and Splat ............................................................................................................. 115
PINSRW — Shuffle (1×16-Bit) to (4×16-Bit) ........................................................................... 115
PSHUFW — Shuffle Packed Words (4×16-Bit) ........................................................................ 115
PSHUFLW — Shuffle Packed Low Words (4×16-Bit) ............................................................. 116
PSHUFHW — Shuffle Packed High Words (4×16-Bit) ............................................................ 116
PSHUFD — Shuffle Packed Double Words (4×32-Bit) ............................................................ 116
SHUFPS — Shuffle Packed SPFP Values (4×SPFP) ................................................................. 117
MOVSLDUP — Splat Packed Even SPFP to (4×SPFP) ............................................................ 117
4
MOVSHDUP — Splat Packed Odd SPFP to (4×SPFP)............................................................. 117
MOVDDUP — Splat Lower DPFP to Packed (2×DPFP) .......................................................... 118
SHUFPD — Shuffle Packed DPFP (2×64-Bit) .......................................................................... 118
Data Bit Expansion ......................................................................................................................... 118
CBW Convert Signed AL (Byte) to AX (Word) ........................................................................ 119
CWDE Convert Signed AX (Word) to EAX (DWord) .............................................................. 119
CDQE Convert Signed EAX (DWord) to RAX (QWord).......................................................... 119
MOVSX/MOVSXD — Move with Sign Extension ................................................................... 119
MOVZX — Move with Zero Extension ..................................................................................... 121
CWD — Convert Signed AX (Word) to DX:AX ....................................................................... 122
CDQ — Convert Signed EAX (DWord) to EDX:EAX.............................................................. 122
CQO — Convert Signed RAX (QWord) to RDX:RAX ............................................................. 122
PEXTRW — Extract (4×16-bit) into Integer to (1×16) ............................................................. 123
Data Bit Reduction (with Saturation) ............................................................................................. 123
PACKSSWB — Packed Signed int16 to int8 with Saturation ................................................... 123
PACKUSWB — Packed uint16 to uint8 with Saturation ........................................................... 124
PACKSSDW — Packed int32 to int16 with Saturation ............................................................. 124
Data Conversion (Integer : Float, Float : Integer, Float : Float) ..................................................... 125
PI2FW — Convert Packed Even int16 to SPFP ......................................................................... 125
CVTDQ2PS — Convert Packed int32 to SPFP .......................................................................... 125
CVTPS2DQ — Convert Packed SPFP to int32.......................................................................... 125
CVTPI2PS — Convert Lo Packed int32 to SPFP ...................................................................... 126
CVTPS2PI — Convert Lo Packed SPFP to int32 ...................................................................... 126
CVTSI2SS — Convert Scalar int32 to SPFP ............................................................................. 127
CVTDQ2PD — Convert Even Packed int32 to DPFP ............................................................... 127
CVTPD2DQ — Convert Packed DPFP to Even int32 ............................................................... 127
CVTPD2PS — Convert Packed DPFP to Lo SPFP ................................................................... 127
CVTPS2PD — Convert Lo Packed SPFP to DPFP ................................................................... 128
CVTPD2PI — Convert Packed DPFP to int32 .......................................................................... 128
CVTPI2PD — Convert Packed int32 to DPFP .......................................................................... 128
CVTSS2SI — Convert Scalar SPFP to int32/64 ........................................................................ 129
CVTSD2SI — Convert Scalar DPFP to Int ................................................................................ 129
CVTSI2SD — Convert Scalar Int to DPFP ................................................................................ 129
CVTSD2SS — Convert Scalar DPFP to SPFP........................................................................... 130
CVTSS2SD — Convert Scalar SPFP to DPFP........................................................................... 130
Exercises ......................................................................................................................................... 130
Chapter 7: Integer Math .................................................................................................................. 131
Overview ......................................................................................................................................... 131
General Integer Math ...................................................................................................................... 131
ADD — Add ............................................................................................................................... 131
ADC — Add with Carry ............................................................................................................. 131
INC — Increment by 1 ............................................................................................................... 133
XADD — Exchange and Add..................................................................................................... 134
SUB — Subtract ......................................................................................................................... 135
SBB — Subtract with Borrow .................................................................................................... 135
DEC — Decrement by 1 ............................................................................................................. 136
Packed Addition and Subtraction.................................................................................................... 137
PADDB/PADDW/PADDD/PADDQ Integer Addition .............................................................. 138
Vector {8/16/32/64}-Bit Int Addition with Saturation ............................................................... 139
PSUBB/PSUBW/PSUBD/PSUBQ Integer Subtraction ............................................................. 140
Vector {8/16/32/64}-Bit Integer Subtraction with Saturation .................................................... 140
Vector Addition and Subtraction (Fixed Point) .............................................................................. 141
5
Pseudo Vec.................................................................................................................................. 141
Pseudo Vec (x86) ........................................................................................................................ 141
Averages ......................................................................................................................................... 142
PAVGB/PAVGUSB — N×8-Bit [Un]signed Integer Average .................................................. 142
PAVGW — N×16-Bit [Un]signed Integer Average ................................................................... 143
Sum of Absolute Differences .......................................................................................................... 143
PSADBW — N×8-Bit Sum of Absolute Differences ................................................................. 143
8×8-Bit Sum of Absolute Differences ........................................................................................ 144
16×8-Bit Sum of Absolute Differences ...................................................................................... 144
Integer Multiplication ..................................................................................................................... 144
MUL — Unsigned Muliplication (Scalar) .................................................................................. 144
IMUL — Signed Multiplication (Scalar) .................................................................................... 145
Packed Integer Multiplication ......................................................................................................... 145
PMULLW — N×16-Bit Parallel Multiplication (Lower) ........................................................... 146
PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper) ..................................... 147
PMULHRW — Signed 4×16-Bit Multiplication with Rounding (Upper) ................................. 147
Pseudo Vec (x86) ........................................................................................................................ 148
PMULUDQ — Unsigned N×32-Bit Multiply Even ................................................................... 149
PMADDWD — Signed N×16-Bit Parallel Multiplication — ADD .......................................... 150
Exercises ......................................................................................................................................... 150
Chapter 8: Floating-Point Anyone? ............................................................................................... 151
Overview ......................................................................................................................................... 151
The Floating-Point Number ............................................................................................................ 151
FPU Registers ............................................................................................................................. 154
Loading/Storing Numbers and the FPU Stack ................................................................................ 155
FLD — Floating-Point Load ....................................................................................................... 155
FST/FSTP — FPU Floating-Point Save ..................................................................................... 158
FILD — FPU Integer Load ......................................................................................................... 158
FIST/FISTP/FISTTP — FPU Integer Save ................................................................................ 158
FPU Constants ............................................................................................................................ 159
FXCH .......................................................................................................................................... 159
FINCSTP — FPU Increment Stack Pointer................................................................................ 160
FDECSTP — FPU Decrement Stack Pointer ............................................................................. 160
FWAIT/WAIT ............................................................................................................................ 160
EMMS/FEMMS.......................................................................................................................... 160
FNOP .......................................................................................................................................... 160
General Math Instructions............................................................................................................... 161
FCHS — FPU Two's Complement ST(0) = – ST(0) .................................................................. 161
FABS — FPU Absolute Value ST(0) = |ST(0)| .......................................................................... 161
FADD/FADDP/FIADD — FPU Addition D = ST(0) + A ......................................................... 161
FSUB/FSUBP/FISUB — FPU Subtraction D = ST(0) – A........................................................ 162
FSUBR/FSUBRP/FISUBR — FPU ReverseSubtraction D = A – ST(0) ................................... 162
FMUL/FMULP/FIMUL — FPU Multiplication D = ST(0) × A................................................ 163
FDIV/FDIVP/FIDIV — FPU Division D = Dst ÷ Src ............................................................... 163
FDIVR/FDIVRP/FIDIVR — FPU ReverseDivision D = Src ÷ Dst .......................................... 163
FPREM — FPU Partial Remainder ............................................................................................ 164
FPREM1 — FPU Partial Remainder .......................................................................................... 164
FRNDINT — FPU Round to Integer .......................................................................................... 164
General Math Instructions............................................................................................................... 164
FCHS — FPU Two's Complement ST(0) = – ST(0) .................................................................. 164
FABS — FPU Absolute Value ST(0) = |ST(0)| .......................................................................... 165
FADD/FADDP/FIADD — FPU Addition D = ST(0) + A ......................................................... 165
6
FSUB/FSUBP/FISUB — FPU Subtraction D = ST(0) – A........................................................ 165
FSUBR/FSUBRP/FISUBR — FPU ReverseSubtraction D = A – ST(0) ................................... 166
FMUL/FMULP/FIMUL — FPU Multiplication D = ST(0) × A................................................ 166
FDIV/FDIVP/FIDIV — FPU Division D = Dst ÷ Src ............................................................... 167
FDIVR/FDIVRP/FIDIVR — FPU ReverseDivision D = Src ÷ Dst .......................................... 167
FPREM — FPU Partial Remainder ............................................................................................ 168
FPREM1 — FPU Partial Remainder .......................................................................................... 168
FRNDINT — FPU Round to Integer .......................................................................................... 168
Advanced Math Instructions ........................................................................................................... 168
FSQRT — FPU ST(0) Square Root ........................................................................................... 168
FSCALE — FPU Scale ST(0) = ST(0) << ST(1) ....................................................................... 168
F2XM1 — FPU ST(0) = 2ST(0) – 1 .............................................................................................. 169
FYL2X — FPU ST(0) = y log2x ................................................................................................. 169
FYL2XP1 — FPU ST(0) = y log2(x+1)...................................................................................... 169
FXTRACT — FPU Extract Exponent and Significand .............................................................. 169
Floating-Point Comparison ............................................................................................................. 169
FTST — FPU Test If Zero .......................................................................................................... 172
FCOM/FCOMP/FCOMPP — FPU Unordered CMP FP ........................................................... 172
FUCOM/FUCOMP/FUCOMPP — FPU Unordered CMP FP................................................... 173
FCOMI/FCOMIP/FUCOMI/FUCOMIP — FPUA ? B and EFLAGS ....................................... 173
FICOM/FICOMP — FPU A ? B ................................................................................................ 174
FCMOVcc — FPU Conditional Move ....................................................................................... 174
FXAM — FPU Examine ............................................................................................................ 175
FPU BCD (Binary-Coded Decimal) ............................................................................................... 176
FBLD — FPU (BCD Load) ........................................................................................................ 176
FBSTP — FPU (BCD Save and Pop ST(0)) .............................................................................. 176
FPU Trigonometry .......................................................................................................................... 176
FPTAN — FPU Partial Tangent ................................................................................................. 176
FPATAN — FPU Partial Arctangent ......................................................................................... 177
FSINCOS — Sine and Cosine .................................................................................................... 177
Pseudo Vec.................................................................................................................................. 179
Pseudo Vec (x86) ........................................................................................................................ 179
FSIN — FPU Sine ...................................................................................................................... 180
FCOS — FPU Cosine ................................................................................................................. 180
FSINCOS — FPU Sine and Cosine ............................................................................................ 180
FPU System Instructions................................................................................................................. 182
FINIT/FNINIT — FPU Init ........................................................................................................ 182
FCLEX/FNCLEX — FPU Clear Exceptions ............................................................................. 182
FFREE — FPU Free FP Register ............................................................................................... 183
FSAVE/FNSAVE — FPU Save X87 FPU, MMX, SSE, SSE2 ................................................. 183
FRSTOR — FPU Restore x87 State ........................................................................................... 183
FXSAVE — FPU Save x87 FPU, MMX, SSE, SSE2, SSE3 ..................................................... 183
FXRSTOR — FPU Restore x87 FPU, MMX, SSE, SSE2, SSE3 .............................................. 184
FSTENV/FNSTENV — FPU Store x87 Environment............................................................... 184
FLDENV — FPU Load x87 Environment ................................................................................. 184
FSTCW/FNSTCW — FPU Store x87 Control Word ................................................................. 184
FLDCW — FPU Load x87 Control Word.................................................................................. 185
FSTSW/FNSTSW — FPU Store x87 Status Word .................................................................... 185
Validating (Invalid) Floating-Point................................................................................................. 185
Exercises ......................................................................................................................................... 186
Chapter 9: Comparison ................................................................................................................... 187
Overview ......................................................................................................................................... 187
7
TEST — Logical Compare A B .................................................................................................. 187
Indexed Bit Testing ......................................................................................................................... 188
BT — Bit Test ............................................................................................................................. 188
BTC — Bit Test and Complement.............................................................................................. 189
BTR — Bit Test and Reset (Clear) ............................................................................................. 189
BTS — Bit Test and Set.............................................................................................................. 190
SETcc — Set Byte on Condition .................................................................................................... 191
Comparing Operands and Setting EFLAGS ................................................................................... 192
CMP — Compare Two Operands ............................................................................................... 192
COMISS — Compare Scalar SPFP, Set EFLAGS ..................................................................... 193
COMISD — Compare Scalar DPFP, Set EFLAGS.................................................................... 193
UCOMISS — Unordered Cmp Scalar SPFP, Set EFLAGS ....................................................... 194
UCOMISD — Unordered Cmp Scalar DPFP, Set EFLAGS ..................................................... 194
CMPSB/CMPSW/CMPSD/CMPSQ — Compare String Operands .......................................... 195
CMP — Packed Comparison .......................................................................................................... 196
CMPPS/CMPSS/CMPPD/CMPSD — Floating-Point ............................................................... 196
Packed Compare if Equal to (=) ................................................................................................. 196
Packed Compare if Greater Than or Equal () ........................................................................... 197
Packed Compare if Greater Than (>) .......................................................................................... 197
Extract Packed Sign Masks ............................................................................................................. 199
PMOVMSKB — Extract Packed Byte (Sign) Mask .................................................................. 199
MOVMSKPS — Extract Packed SPFP Sign Mask .................................................................... 199
MOVMSKPD — Extract Packed DPFP Sign Mask .................................................................. 199
SCAS/SCASB/SCASW/SCASD/SCASQ — Scan String ............................................................. 200
REP SCASx ................................................................................................................................ 200
CMOVcc — Conditional Move ...................................................................................................... 200
CMPXCHG — Compare and Exchange......................................................................................... 202
CMPXCHG8B — Compare and Exchange 64 Bits ................................................................... 202
CMPXCHG16B — Compare and Exchange 128 Bits ............................................................... 203
Boolean Operations upon Floating-Point Numbers ........................................................................ 203
ANDPS — Logical AND of Packed SPFP D = A B............................................................... 204
ANDPD — Logical AND of Packed DPFP ............................................................................... 204
Pseudo Vec — (XMM) FABS — FP Absolute A = | A | ........................................................... 204
Pseudo Vec — (3DNow!) FABS — FP Absolute A = | A | ....................................................... 204
ORPS — Logical OR of Packed SPFP D = A B ..................................................................... 205
ORPD — Logical OR of Packed DPFP ...................................................................................... 205
XORPS — Logical XOR of Packed SPFP D = A B .............................................................. 205
XORPD — Logical XOR of Packed DPFP ................................................................................ 205
Pseudo Vec — FCHS — FP Change Sign A = – A ................................................................... 205
ANDNPS — Logical ANDC of Packed SPFP D = A (B).................................................... 206
ANDNPD — Logical ANDC of Packed DPFP .......................................................................... 206
Min — Minimum ............................................................................................................................ 206
Pseudo Vec.................................................................................................................................. 207
8-Bit Integer Minimum ......................................................................................................... 207
16-Bit Integer Minimum ....................................................................................................... 208
SPFP Minimum...................................................................................................................... 208
SPFP Scalar Minimum ........................................................................................................... 209
DPFP Minimum ...................................................................................................................... 209
DPFP Scalar Minimum ........................................................................................................... 209
Max — Maximum........................................................................................................................... 210
8-Bit Integer Maximum ........................................................................................................ 210
8
16-Bit Integer Maximum ...................................................................................................... 210
SPFP Maximum ..................................................................................................................... 211
SPFP Scalar Maximum ........................................................................................................... 211
DPFP Maximum ..................................................................................................................... 212
DPFP Scalar Maximum .......................................................................................................... 212
Chapter 10: Branching .................................................................................................................... 213
Overview ......................................................................................................................................... 213
Jump Unconditionally ..................................................................................................................... 213
JMP — Jump............................................................................................................................... 213
Delta JMP.................................................................................................................................... 214
Protected Mode JMP (NEAR) .................................................................................................... 214
Protected Mode JMP (FAR) ....................................................................................................... 214
Jump Conditionally ......................................................................................................................... 216
Jcc — Branching ......................................................................................................................... 216
Branch Prediction............................................................................................................................ 217
Intel Branch Prediction ............................................................................................................... 218
Static Branch Prediction ............................................................................................................. 218
AMD Branch Prediction ............................................................................................................. 219
Branch Optimization ................................................................................................................... 219
PAUSE — (Spin Loop Hint) .......................................................................................................... 221
I-VU-Q ........................................................................................................................................ 222
JECXZ/JCXZ — Jump if ECX/CX Is Zero ................................................................................ 225
LOOPcc....................................................................................................................................... 226
LOOP .......................................................................................................................................... 227
Pancake Memory LIFO Queue ....................................................................................................... 227
Stack ................................................................................................................................................ 227
PUSH — Push Value onto Stack ................................................................................................ 227
POP — Pop Value off Stack ....................................................................................................... 228
PUSHA/PUSHAD — Push All General-Purpose Registers ....................................................... 228
POPA/POPAD — Pop All General-Purpose Registers .............................................................. 229
PUSHFD/PUSHFQ and POPFD/POPFQ ................................................................................... 230
ENTER — Allocate Stack Frame for Procedure ARGS ............................................................ 230
LEAVE — Deallocate Stack Frame of Procedure ARGS .......................................................... 230
CALL Procedure (Function) ........................................................................................................... 231
CALL .......................................................................................................................................... 231
Protected Mode CALL (NEAR) ................................................................................................. 231
Protected Mode CALL (FAR) .................................................................................................... 232
RET/RETF — Return ................................................................................................................. 233
Calling Conventions (Stack Argument Methods) ........................................................................... 234
C Declaration (__CDECL) ......................................................................................................... 234
Standard Declaration (__STDCALL) ......................................................................................... 235
Fast Call Declaration (__FASTCALL) ....................................................................................... 236
Interrupt Handling ........................................................................................................................... 236
INT/INTO — Call Interrupt Procedure ...................................................................................... 236
IRET/IRETD/IRETQ — Interrupt Return .................................................................................. 237
CLI/STI — Clear (Reset)/Set Interrupt Flag .............................................................................. 238
Chapter 11: Branchless ................................................................................................................... 240
Overview ......................................................................................................................................... 240
Function y=ABS(x) 'Absolute' D = | A | ......................................................................................... 240
Function y=MIN(p, q) 'Minimum' .................................................................................................. 241
Function y=MAX(p, q) 'Maximum' ................................................................................................ 242
Graphics 101 — Quick 2D Distance .......................................................................................... 244
9
Chapter 12: Floating-Point Vector Addition and Subtraction .................................................... 247
Overview ......................................................................................................................................... 247
Floating-Point Vector Addition and Subtraction ............................................................................ 247
Vector Floating-Point Addition .................................................................................................. 247
Vector Floating-Point Addition with Scalar ............................................................................... 248
Vector Floating-Point Subtraction .............................................................................................. 248
Vector Floating-Point Subtraction with Scalar ........................................................................... 249
Vector Floating-Point Reverse Subtraction ................................................................................ 250
Pseudo Vec.................................................................................................................................. 251
Pseudo Vec (x86) ........................................................................................................................ 252
Vector Scalar Addition and Subtraction ......................................................................................... 255
Single-Precision Quad Vector Float Scalar Addition ................................................................. 255
Single-Precision Quad Vector Float Scalar Subtraction ............................................................. 255
Special — FP Vector Addition and Subtraction ............................................................................. 256
Vector Floating-Point Addition and Subtraction ........................................................................ 256
HADDPS/HADDPD/PFACC — Vector Floating-Point Horizontal Addition ........................... 256
HSUBPS/HSUBPD/PFNACC — Vector Floating-Point Horizontal Subtraction ..................... 257
PFPNACC — Vector Floating-Point Horizontal Add/Sub ........................................................ 258
Exercises ......................................................................................................................................... 258
Chapter 13: FP Vector Multiplication and Division...................................................................... 259
Overview ......................................................................................................................................... 259
Floating-Point Multiplication ......................................................................................................... 259
Vector Floating-Point Multiplication.......................................................................................... 259
(Semi-Vector) DPFP Multiplication ........................................................................................... 260
SPFP Scalar Multiplication ......................................................................................................... 260
DPFP Scalar Multiplication ........................................................................................................ 260
Vector (Float) Multiplication — ADD ....................................................................................... 260
Pseudo Vec.................................................................................................................................. 261
Pseudo Vec (x86) ........................................................................................................................ 262
Vector Floating-Point Division ....................................................................................................... 264
(Vector) SPFP Division .............................................................................................................. 264
(Semi-Vector) DPFP Division .................................................................................................... 264
SPFP Scalar Division .................................................................................................................. 265
DPFP Scalar Division ................................................................................................................. 265
N×SPFP Reciprocal .................................................................................................................... 265
1×SPFP Reciprocal (14-Bit) ....................................................................................................... 266
SPFP Reciprocal (2 Stage) (24-Bit) ............................................................................................ 267
Pseudo Vec.................................................................................................................................. 267
Pseudo Vec (x86) ........................................................................................................................ 268
Exercises ......................................................................................................................................... 269
Chapter 14: Floating-Point Deux ................................................................................................... 270
Overview ......................................................................................................................................... 270
SQRT — Square Root .................................................................................................................... 270
1×SPFP Scalar Square Root ....................................................................................................... 271
4×SPFP Square Root .................................................................................................................. 271
1×DPFP Scalar Square Root ....................................................................................................... 272
2×DPFP Square Root .................................................................................................................. 272
1×SPFP Scalar Reciprocal Square Root (15-Bit) ....................................................................... 272
Pseudo Vec.................................................................................................................................. 273
Pseudo Vec (x86) ........................................................................................................................ 273
SPFP Square Root (2 Stage) (24-Bit) ......................................................................................... 274
Vector Square Root ..................................................................................................................... 275
10
Pseudo Vec.................................................................................................................................. 275
Pseudo Vec (x86) ........................................................................................................................ 275
Graphics 101 — Vector Magnitude (aka 3D Pythagorean Theorem) ........................................ 277
Pseudo Vec.................................................................................................................................. 279
Pseudo Vec (x86) ........................................................................................................................ 279
Chapter 15: Binary-Coded Decimal (BCD) .................................................................................. 281
Overview ......................................................................................................................................... 281
BCD ................................................................................................................................................ 281
DAA — Decimal Adjust AL (After) Addition ........................................................................... 282
DAS — Decimal Adjust AL (After) Subtraction ....................................................................... 282
AAA — ASCII Adjust (After) Addition .................................................................................... 283
AAS — ASCII Adjust AL (After) Subtraction........................................................................... 283
AAM — ASCII Adjust AX (After) Multiplication .................................................................... 283
AAD — ASCII Adjust AX (Before) Division ............................................................................ 284
FBLD — FPU (BCD Load) ........................................................................................................ 284
Graphics 101 ................................................................................................................................... 285
ASCII String to Double-Precision Float ..................................................................................... 286
ASCII to Double ......................................................................................................................... 286
Chapter 16: What CPUID? ............................................................................................................. 289
Overview ......................................................................................................................................... 289
CPUID............................................................................................................................................. 289
Standard CPUID EDX-Feature Flags ......................................................................................... 292
Intel — Standard CPUID ECX-Feature Flags ............................................................................ 293
Intel — Extended #1 CPUID EDX-Feature Flags ...................................................................... 294
AMD — Extended #1 CPUID EDX-Feature Flags .................................................................... 294
PIII Serial License........................................................................................................................... 295
Sample CPU Detection Code .......................................................................................................... 296
x86 CPU Detect — Bit Flags ...................................................................................................... 296
x86 CPU Detect — Vendors ....................................................................................................... 297
Cpu Detect — Information ......................................................................................................... 297
Chapter 17: PC I/O........................................................................................................................... 305
Overview ......................................................................................................................................... 305
IN — Input from Port ..................................................................................................................... 305
Vertical Sync ............................................................................................................................... 305
OUT — Output to Port ................................................................................................................... 306
INSx — Input from Port to String .................................................................................................. 307
OUTSx — Output String to Port .................................................................................................... 307
Serial/Parallel Port for IBM PC .................................................................................................. 308
Parallel Port ................................................................................................................................. 308
Parallel Port Dip Switches .......................................................................................................... 309
Serial Port.................................................................................................................................... 311
Chapter 18: System ......................................................................................................................... 316
System "Lite" .................................................................................................................................. 316
System Timing Instructions ............................................................................................................ 316
RDPMC — Read Performance — Monitoring Counters ........................................................... 316
RDTSC — Read Time-Stamp Counter....................................................................................... 316
Calculating Processor Speed ....................................................................................................... 317
80x86 Architecture...................................................................................................................... 319
CPU Status Registers (32-Bit EFLAGS/64-Bit RFLAGS) ........................................................ 320
Protection Rings .......................................................................................................................... 321
Control Registers ........................................................................................................................ 321
Debug Registers .......................................................................................................................... 323
11
Cache Manipulation ........................................................................................................................ 323
Cache Sizes ................................................................................................................................. 323
Cache Line Sizes ......................................................................................................................... 323
PREFETCHx — Prefetch Data into Caches................................................................................ 324
LFENCE — Load Fence ............................................................................................................. 325
SFENCE — Store Fence ............................................................................................................. 325
MFENCE — Memory Fence ...................................................................................................... 325
CLFLUSH — Flush Cache Line................................................................................................. 325
INVD — Invalidate Cache (WO/Writeback) ............................................................................. 326
WBINVD — Write Back and Invalidate Cache ......................................................................... 326
System Instructions ......................................................................................................................... 326
ARPL — Adjust Requested Privilege Level .............................................................................. 326
BOUND — Check Array Index For Bounding Error ................................................................. 326
CLTS — Clear Task Switch Flag ............................................................................................... 327
HLT — Halt Processor ............................................................................................................... 327
UD2 — Undefined Instruction.................................................................................................... 327
INVLPG — Invalidate TLB ....................................................................................................... 327
LAR — Load Access Rights....................................................................................................... 327
LOCK — Assert Lock # Signal Prefix ....................................................................................... 328
LSL — Load Segment Limit ...................................................................................................... 328
MOV — Move To/From Control Registers................................................................................ 328
MOV — Move To/From Debug Registers ................................................................................. 328
STMXCSR — Save MXCSR Register State .............................................................................. 329
LDMXCSR — Load MXCSR Register State ............................................................................. 329
SGDT/SIDT — Save Global/Interrupt Descriptor Table ........................................................... 329
LGDT/LIDT — Load Global/Interrupt Descriptor Table .......................................................... 329
SLDT — Save Local Descriptor Table ....................................................................................... 329
LLDT — Load Local Descriptor Table ...................................................................................... 330
SMSW — Save Machine Status Word ....................................................................................... 330
LMSW — Load Machine Status Word ...................................................................................... 330
STR — Save Task Register ........................................................................................................ 330
LTR — Load Task Register ........................................................................................................ 330
RDMSR — Read from Model Specific Register ........................................................................ 331
WRMSR — Write to Model Specific Register........................................................................... 331
SWAPGS — Swap GS Base Register ........................................................................................ 331
SYSCALL — 64-Bit Fast System Call ...................................................................................... 331
SYSRET — Fast Return from 64-Bit Fast System Call ............................................................. 331
SYSENTER — Fast System Call ............................................................................................... 332
SYSEXIT — Fast Return from Fast System Call ....................................................................... 332
RSM — Resume from System Management Mode ................................................................... 332
VERR/VERW — Verify Segment for Reading.......................................................................... 332
LDS/LES/LFS/LGS/LSS — Load Far Pointer ........................................................................... 332
Hyperthreading Instructions............................................................................................................ 335
MONITOR — Monitor ............................................................................................................... 335
MWAIT — Wait ......................................................................................................................... 335
Chapter 19: Gfx 'R' Asm .................................................................................................................. 336
Overview ......................................................................................................................................... 336
Setting Memory .............................................................................................................................. 336
Copying Memory ............................................................................................................................ 341
Speed Freak ..................................................................................................................................... 346
Graphics 101 — Frame Buffer ....................................................................................................... 347
Graphics 101 — Blit ....................................................................................................................... 348
12
Copy Blit ..................................................................................................................................... 348
Transparent Blit .......................................................................................................................... 349
Graphics 101 — Blit (MMX) ......................................................................................................... 349
Graphics Engine — Sprite Layered ............................................................................................ 349
Graphics Engine — Sprite Overlay ............................................................................................ 350
Graphics 101 — Clipping Blit ........................................................................................................ 352
Chapter 20: MASM vs. NASM vs. TASM vs. WASM ................................................................. 353
Overview ......................................................................................................................................... 353
MASM — Microsoft Macro Assembler ......................................................................................... 354
REPEAT ..................................................................................................................................... 359
WHILE ........................................................................................................................................ 359
FOR ............................................................................................................................................. 359
Compiler Intrinsics.......................................................................................................................... 360
Chapter 21: Debugging Functions ................................................................................................. 362
Overview ......................................................................................................................................... 362
Guidelines of Assembly Development ........................................................................................... 362
Visual C++ ...................................................................................................................................... 362
Tuning and Optimization ................................................................................................................ 365
Exception Handling — AKA: Dang that 1.#QNAN ...................................................................... 365
Exceptions ................................................................................................................................... 365
FPU Versus MMX ...................................................................................................................... 366
Print Output..................................................................................................................................... 366
Float Array Print ......................................................................................................................... 367
Vector Print ................................................................................................................................. 368
Quad Vector Print ....................................................................................................................... 368
Quaternion Print .......................................................................................................................... 368
Matrix Print ................................................................................................................................. 368
Memory Dump ............................................................................................................................ 369
Test Jigs .......................................................................................................................................... 371
Matrix Test Fill ........................................................................................................................... 371
Matrix Splat ................................................................................................................................ 371
Chapter 22: Epilogue ....................................................................................................................... 373
Overview ......................................................................................................................................... 373
Appendix A: Data Structure Definitions ........................................................................................ 374
Appendix B: Mnemonics ................................................................................................................. 376
Mnemonics Part 1 ........................................................................................................................... 376
Appendix C: Reg/Mem Mapping .................................................................................................... 389
Glossary ................................................................................................................................... 391
Alignment Macros .................................................................................................................. 396
Algebraic Laws Used in This Book ................................................................................... 396
References ........................................................................................................................................ 397
Index .............................................................................................................................................. 398
List of Figures ................................................................................................................................... 403
List of Tables ..................................................................................................................................... 406
List of Listings ................................................................................................................................... 408
13
Preface —(or, So Why Did He Write Yet Another Book?)
Yet another book? Well, actually I am merely just one of the other ex-Atari employees who was a member of
Alan Kay's group, Atari Corporate Research, and has published books.
For those of you who have been following my eventful career, you already know that this is actually my third
published book.
Just to bring you up to date, my first book was titled The Death and Rebirth of the x86 Assembly Language
Programmer and was originally written between July 1997 and June 1998. It also went under the title of x86
Assembly Language Optimization in Computer Games. Its timing was perfect as SSE code named Katmai by
Intel was just about to come out. I could not find a publisher willing to sign as they "could not sell their last
80x86 assembly language book," and so I put it aside and went back to school to work on my next degree at
Oregon State University. (Go Beavers!)
Two years later I came up with a SIMD (vector) assembly language book idea. Having learned my lesson, I
sold the idea to a publisher first (Wordware Publishing) and Vector Game Math Processors became my first
published book. It probably should have included 80x86, Mips, and PowerPC in the title, as that would have
helped to increase its sales. It seems most people stumble across it.
Everybody who read it and contacted me seemed to like it. It was more of a niche book as it primarily taught
how to vectorize code to take advantage of the vector instruction sets of processors only recently introduced
into offices and homes. These contain functionality that was once confined to the domain of the super
computer. Vector Game Math Processors discusses pseudo vector code that actually emulated vector
processors using generic optimized C, as well as heavy utilization of the following instruction sets:
80x86 (MMX, SSE(2), 3DNow! Professional)
PowerPC and AltiVec
MIPS (MIPS-3D, MIPS V, etc.)
While writing it I wrote a chapter on vertex shaders and sold the idea to my publisher. Thus my second book
published was Learn Vertex and Pixel Shader Programming with DirectX 9.
And now we have come full circle, as this book is actually the completion of my original 80x86 book, The
Death… But I digress. The title was too long and not really to the point. So now that book has been completed,
updated, and designed to be the prequel to my vector book. And thank you for buying both (you did, did you
not?) as you are reading one of them now!
The problem with writing "yet another book" is that one tends to use the best jokes or favorite stories in the first
one. Well, I have so many from over the years, especially since I have eight kids (with the same wife), that I
have saved some for this book. But relax, the worst stories are laying on the editing room floor in order to
make this book more academic friendly.
One (almost) last item. Keep an eye out for my not-so-technical book, Programming Pyrite — The Fool's Gold
of Programming Video Games. It is meant not to be a programming book, but a book about the trials and
tribulations of becoming a video game programmer. It is meant to be a satirical peek into this world, and is
based on my experiences as well as the good and bad stories that I have encountered over the years.
I wish to thank those who have contributed information, hints, testing time, etc., for this book. Paul Stapley for
some recommendations and technical overview recommendations; Peter Jefferies and John Swiderski for
some code sample testing; Ken Mayfield for some 3D computer art donations; John Romero for his technical
check and tip recommendations; and some of those others that I have neglected to mention here for their
contributions.
And most of all to my wife for not balking too much after she thought I had finished writing for good after
finishing my second published book and then noticed within a couple weeks that I was working on this book.
"Revising your vector book for a second edition?" Of course she did not seem to notice (or mention) the
purchases of new motherboards and processors as they were hidden inside my nice super quiet Sonata
computer equipment.
I finally wish to thank Wes Beckwith from Wordware Publishing, Inc., for seeing the niche that this book would
fill and convincing his company to publish yet another of my books.
14
So get up from that floor or chair in the bookstore in which you are currently reading this book; you know you
will need this book for work or school. Besides, I filled it with so much stuff you might as well stop copying the
stuff into that little notebook, grab a second copy for use at home, walk over to that check stand, and buy them
both. Tell your friends how great the book is so they will buy a copy too! Insist to your employer that the
technical book library needs a few copies as well. This book is a tutorial and a reference manual.
My eight children and my outnumbered domestic engineering wife will be thankful that we will be able to afford
school clothes as well as Christmas presents this year! Unlike the title of Steve Martin's remake of the movie
that kids are Cheaper by the Dozen, they are not! They eat us out of house and home! (Fortunately for me, my
kids are typically not quite as rambunctious as his!)
To check for any updates or code supplements to any of my books, check out my web site at
http://www.leiterman.com/books.html.
My brother Ranger Robert Leiterman is a writer of mystery related nature books that cover diverse topics such
as natural resources, as well as his Big Foot Mystery series. Buy his books too! Especially buy them if you are
a game designer and interested in cryptozoology or natural resources or just have kids. If it were not for his
sending me his books to help proofread, I probably would not have started writing my own books as well.
Watch out for books from Thomas, yet another of my brothers who has joined the Leiterman brothers' book
writing club. That leaves my three remaining brothers yet to join!
(Now if we can only get our kids to read our books! Something about needing more pictures or not being so
technical….)
15
Chapter 1: Introduction
Overview
When the processor manufacturer Intel is mentioned, two 64-bit processors come to mind: EM64T and the
Itanium. For AMD: the AMD64. Non-80x86 manufacturers discovered years ago that competing against an
established desktop market is difficult to impossible. The key to successful market injection is being able to run
a large quantity of pre-existing applications. Intel and AMD have built their business upon this by periodically
creating superset instruction sets for their 80x86 processors so that pre-existing software still runs on the new
equipment and new software can be written for the new equipment.
The technology has been forked into two 64-bit paths. One uses the Itanium-based platform with a new 64-bit
primary instruction set that belongs to the IA-64 family. The other is a superset to the IA-32, referred to as the
Extended Memory 64 Technology (EM64T). Newer P4 and Xeon processors are of this alternate type.
This book targets the AMD32/64, IA-32, and EM64T processor technology. It is not written for the Itanium
series. (Look for a future book to cover the Itanium processor.) The EM64T supports a new superset
instruction set, SSE-3, and 64-bit extensions to the IA-32 general-purpose instruction set. It also allows 64-bit
operating systems such as Windows XP Professional x64 and Windows Server 2003 x64 editions to run both
64-bit and 32-bit software on the same machine.
This book can be used for both 32-bit and 64-bit instruction sets, but there is an operating system application
dependency that needs to be followed.
Win9X (32-bit) X
WinXP (32-bit) X
Win2K (32-bit) X
Win2003 (32-bit) X
XP – X64 (64-bit) X X
Win Server 2003 X64 X X
The 80x86 processor has joined the domain of the super computer since the introduction of the SIMD (single
instruction multiple data) such as Intel's Pentium III used in the Xbox, and all other x86s including the Pentium
IV and AMD's 3DNow! extension instructions used in PCs. And now they are available in 64 bit. Both fixed-
point (inclusive of integer) and floating-point math are being used by the computer, video gaming, and
embedded worlds in assembly and vector-based operations.
3D graphic rendering hardware has been going through major increases in the numbers of polygons that can
be handled by using geometry engines as part of their rendering hardware to accelerate the speed of
mathematical calculations. There is also the recent introduction of the programmable vertex and pixel shaders
built into newer video cards that use this same vector functionality. (This is another type of assembly language
programming. For more information on shaders read my book Learn Vertex and Pixel Shader Programming
with DirectX 9.) These work well for rendering polygons with textures, depth ordering Z-buffers or W-buffers,
and translucency controlled alpha channels with lighting, perspective correction, etc., at relatively high rates of
speed. The problem is that the burden of all the other 3D processing, culling, transformations, rotations, etc.,
are put on the computer's central processing unit (CPU), which is needed for artificial intelligence (AI), terrain
following, landscape management, property management, sound, etc. Fortunately for most programmers, a
continuous growth market of middle-ware providers is developing key building blocks such as the Unreal 3D
rendering libraries and physics packages such as Havok. Whether you are looking to become employed by
these companies and generate this technology or merely one who wishes to use these libraries, you should
keep in mind that the introduction of new hardware technology has created a surplus of CPU processor power
that can now be used to fulfill aspects of your programming projects as well as develop new technologies. All
of this creates openings for programmers needing to write assembly language, whether using a scalar or
parallel architecture.
16
There are perhaps only two reasons for writing code in assembly language: writing low-level kernels in
operating systems and writing high-speed optimized critical code. A vector processor can be given sequences
and arrays of calculations to perform to enhance the performance above that of scalar operations that high-
level compilers typically generate during a compile.
Hint Check out the following web site for additional information, code, links, etc., related to this
book: http://www.leiterman.com/books.html.
There are exceptions to this as some vector compilers do exist but as of yet have not been adopted into the
mainstream marketplace. These are well worth investigating if you are in need of high-level C code that takes
advantages of SIMD instruction sets.
One other item to keep in mind is that if you understand this information, it may be easier for you to get a job in
the game or embedded software development industry. This is because you will have enhanced your
programming foundation and possibly have a leg up on your competition. Even if you rarely program in 80x86
assembly language, peeking at the disassembly output of your high-level compiler while debugging your
application can give you insight into code bloat due to your coding methodology and you will better be able to
resolve some of the weird bugs you encounter in your applications.
Goal A better understanding of 80x86 assembly.
I know how a number of you like technical books to be like a resource bible, but I hate for assembly books (no
matter how detailed) to be arranged in that fashion, because:
1. It takes me too long to find what I am looking for!
2. They almost always put me to sleep!
This book is not arranged like a bible, but it contains the same information. By using the instruction mnemonic
lookup in Appendix B, it becomes an abstracted bible. It is instead arranged in chapters of functionality. If you
want that bible-like alpha-sorted organization, just look at the index or Appendix B of this book, scan for the
instruction you are looking for, and turn to the page.
Info Appendix B is the master instruction index listing and what processors support it.
I program multiple processors in assembly and occasionally have to reach for a book to look up the correct
mnemonic. Quite often my own books! Manufacturers almost always seem to camouflage those needed
instructions. As an example, mnemonics shifting versus rotating can be located all over the place in a book.
For example, in the 80x86, {psllw, pslld, psllq, …, shld, shr, shrd} are mild cases due to the closeness of their
spellings, but for Boolean bit logic, {and…or, pand…xor} are all over the place in an alphabetical arrangement.
When grouped in chapters of functionality, however, one merely turns to the chapter related to what
functionality is required and then leafs through the pages. For these examples, merely turn to Chapter 4, "Bit
Mangling" or Chapter 5, "Bit Wrangling." Okay, okay, so I had a little fun with the chapter titles, but there is no
having to wade through pages of extra information trying to find what you are looking for. In addition (not
meant to be a pun), there are practical examples near the descriptions as well as in Chapter 19, which are
even more helpful in jogging your memory as to an instruction's usage. Even the companion code for this book
uses this same orientation.
The examples are for the 80x86. I tried to minimize printed computer code as much as possible so that the
pages of the book do not turn into a mere source code listing! Hopefully I did not overtrim and make it seem
confusing. If that occurs, merely open your source code editor or integrated development environment (IDE) to
the chapter and project in the accompanying code related to that point in the book you are trying to
understand. By the way, if you find a discrepancy between the code and the book, you should favor the code
as the code in the book was cut and pasted from elements of code that could be lost during the editing
process.
The book is also written in a friendly style so as to occasionally be amusing and thus help you in remembering
the information over a longer period of time. What good is a technical book that is purely mathematical in
nature, difficult to extract any information from, and just puts you (I mean me) to sleep? You would most likely
have to reread the information again once you woke up! The idea is that you should be able to sit down in a
comfortable setting and read the book cover to cover to get a global overview. (My favorite place to read is in a
lawn chair on the back patio with a wireless laptop.) Then go back to your computer and, using the book as a
tool, implement what you need or cut and paste into your code. But use at your own risk! You should use this
book as an appendix to more in-depth technical information to gain an understanding of that information.
17
An attempt was made to layer the information so you would be able to master the information at your skill
level. In regard to cutting and pasting: You will find portions of this book also occur inside one of my other
published books: Vector Game Math Processors. There is a degree of overlap, but this book is to be
considered the prequel and a foundation for that book. Any duplication of information between the two has
been enhanced in this book as it is now almost three years later and the technology has been extended.
The code is broken down by platform, chapter, and project, but most of the code has not been optimized. This
is explained later but briefly, optimized code is difficult to read and understand. For that reason, I tried to keep
this book as clear and as readable as possible. Code optimizers such as Intel's VTune program are available
for purposes of optimization.
This book, as mentioned, is divided into chapters of functionality. It is related to the use of learning to write
80x86 assembly language for games, or embedded and scientific applications. (Except for writing a binary-
coded decimal (BCD) package, there is not a big need for assembly language in pure business applications.)
Now graphically or statistically oriented, that is another matter. With that in mind, you will learn from this book:
Adapted coding standards that this book recommends
Bit manipulations and movement
Converting data from one form into another
Addition/subtraction (integer/floating-point/BCD)
Multiplication/division (integer/floating-point)
Special functions
(Some) trigonometric functionality
Branching and branchless coding
Some vector foundations
Debugging
It is very important to write functions in a high-level language such as C code before rewriting in assembly. Do
not write code destined for assembly code using the C++ programming language because you will have to
untangle it later. Assembly language is designed for low-level development and C++ for high-level object-
oriented development using inheritance, name mangling, and other levels of abstraction, which makes the
code harder to simplify. There is of course no reason why you would not wrap your assembly code with C++
functions or libraries. But I strongly recommend you debug your assembly language function before locking it
away in a static or dynamic library, as debugging it will become harder. This allows the algorithm to be
debugged and mathematical patterns to be identified before writing the algorithm. In addition, the results of
both algorithms can be compared to verify that they are identical and thus the assembly code is functioning as
expected.
Tip Sometimes examining compiler output can give insight into writing optimized code. (That
means peeking at the disassembly view while debugging your application.)
18
Chapter 2: Coding Standards
Overview
I am not going to bore you to death with coding standards, as there are a multitude of books on the subject
already and that is mostly not what this book is about. To become a better programmer, however, you should
adopt and rigidly follow a set of coding standards.
I just said I was not going to bore you with coding standards, as this book is about assembly language
programming, and this chapter is titled "Coding Standards"? What does one have to do with the other? One
actually supports the other. You should have a proper development environment before attempting to write any
assembly code in an attempt to make that development project better. If you have read this chapter in my
vector book, you can skip to the next chapter, but I would recommend reading it anyway as a refresher. It also
contains a bit more information than what was in the vector book.
The source code for this book use terms such as const and assert, as well as naming conventions, which
should be a standard that you adopt for your own style of programming, if you have not already. The following
coding methods cost very little, only a little extra time to drop them into your code, but they will save you time
in the long run, especially when dealing with general-purpose or SIMD assembly code. You can think of them
as a couple extra tools for your software development toolbox. They primarily entail the use of const and
assert as well as memory alignment. It should be noted that due to page count limitations, they are mostly only
used in the first few functions in print and then infrequently. This is not an indicator that they are not needed,
only that I did not want the pages in the book to appear to be too verbose in printed computer code. You will
find that those equivalent functions will be fully loaded with the use of const and the appropriate assertions.
Constants
To put it simply, the const can essentially be thought of as a write protection of sorts.
void voodoo(const Foo * const pInfo);
In this example the contents of the data structure (first const) and the pointer to the structure (second const)
cannot be altered. This function merely reads information contained within that structure and does not alter it
or its pointer in any fashion. The const guarantees it! Of course that does not stop the programmer from
casting the pointer to a const-less pointer and then modifying the data. But that would be bad!
The placement of the second const protecting the pointer is very specific, but the placement of the first const is
a little more liberal. It can occur before or after the data type declaration such as in the following:
void voodoo(const Foo * const pInfo);
void voodoo( Foo const * const pInfo);
Tip When working with C++ all data members — private, protected or public — can be protected
with the inclusion of an appended const to the prototype and declaration.
class foo
{
private:
int a;
public:
int b;
19
void voodoo(const byte * const pFoo) const;
};
Data Alignment
Processors work most efficiently with data that is properly aligned. In the case of the SSE or better instruction
set, there is not one 128-bit load but two. Processors have been designed for efficient operations so internally
the data is not loaded misaligned, it is loaded 128-bit aligned, but in the case of the SSE, it is corrected by
shifting two halves of two 128-bit loads to adjust for the requested 128 bits of data. This misalignment is very
inefficient and time consuming! This means that instead of loading only the needed 16 bytes, 32 bytes were
loaded by the processor.
The first item on the agenda is the alignment of data values. Pointers are typically 4-byte on a 32-bit
processor; 64-bit requires 8-byte; 128-bit requires 16-byte.
#define ALIGN2(len) (((len) + 1) & ~1) // round up to 2 items
#define ALIGN4(len) (((len) + 3) & ~3) // round up to 4 items
#define ALIGN8(len) (((len) + 7) & ~7) // round up to 8 items
#define ALIGN16(len) (((len) + 15) & ~15) // round up to 16 items
These can easily be used to align bytes (or implied ×8 bits). So to align to 16 bytes:
nWidth = ALIGN16(nWidth); // 128-bit alignment!
Some of you may note that the basic equation of these macros:
(A, X) (((A) + (X)) & ~(X))
N
relies on a byte count of 2 so that a logical AND can be taken to advantage and could possibly be replaced
with:
(A, X) ((A) % ((X) + 1))
and be easier to read, but that would be giving too much credit to the C compiler as some will do a division for
the modulus and some will see the binary mask and take advantage with a mere logical AND operation. Even
though this latter code is clearer, it may not be compiled as fast code. If you are not sure what your compiler
does, then merely set a breakpoint at a macro, then either expand the macro or display mixed C and assembly
code. The division or logical AND will be right near where your instruction pointer (IP) is pointing to your
position within the code.
This is a little obscure and typically unknown by non-console developers, but CD sector size alignment is
needed for all files destined to be loaded directly from a CD or DVD as they are typically loaded by number of
sectors rather than number of bytes and this is typically 2048 or 2336 bytes in size. All these require some sort
of alignment correction jig.
// round up 1 CD sector
#define ALIGN2048(len) (((len) + 2047) & ~2047)
20
N
Sometimes CD sectors are not 2048 byte but 2336. Since this is not 2 , a modulus (%) must be used since
simple bit masking will not work.
#define ALIGN2336(len) (((len) + 2335) % 2336)
Subtracting a value of 1 before a logical AND would skew the bits. If only one bit is set, then
the result of the AND would be 0. If more than one bit is set, the result would be non-zero.
The requested size is stretched to the appropriate sized block. This really comes in handy when building
dedicated relational databases in tools for use in games.
Goal Ensure properly aligned data.
I have had several incidents over the years with compilers and mis-aligned data. Calls to the new function or
the C function malloc() returned memory on a 4-byte alignment but when working with 64-bit MMX or some
128-bit SSE instructions there would be unaligned memory stall problems. Some instructions cause a
misalignment exception error, while others just cause memory stalls. The 80x86 is more forgiving than other
processors as its memory accesses can be mis-aligned without a penalty, but there are SIMD instructions that
require proper memory alignment or an exception will occur. Thus it is always best to ensure memory is
always properly aligned. The PowerPC and MIPS processors require that memory be properly aligned. For
cross-platform portability, it is very important to ensure that your data is properly aligned at all times whether
you know your application will be ported or not.
The 80x86 has an alignment check flag in the CR0 register that can be enabled to verify all memory is aligned
properly. (Use with caution unless you are writing your own board support package.)
The first half of the remedy is easy. Just make sure your vectors are a data type with a block size set to 16
bytes and that your compiler is set to 16-byte alignment and not 1-byte aligned or the default, even if using 64-
bit MMX-based instructions. The smart move is to ensure that all data structures are padded so they will still
be aligned properly even if the compiler alignment gets set to 1 byte. This will make your life easier, especially
if code is ever ported to other platforms, especially UNIX. This is a safety factor. Normally one would manually
pack the data elements by their size to ensure proper alignment and insert (future) pad bytes where
appropriate, but by adjusting the alignment in the compiler you can ensure that the ported applications using
different compilers will export proper data files and network messages.
Notice the "Struct member alignment" field in the following property page for Project Settings in Visual Studio
version 6 and Visual C++.NET. The default is 8 bytes, which is denoted by the asterisk, but 16 bytes is
required for vector programming.
21
Figure 2-1: Visual C++ (version 6) memory alignment property page
Figure 2-2: Visual C++ .NET (version 7) memory alignment property page
You should get into the habit of always setting your memory alignment to 16 for any new project. It will help
your application even if it uses scalar and no SIMD-based instructions.
This is only one example of aligning memory. Adding fixed sizes to allow for modulus 16 to the buffer will
correct the alignment and will improve processing speed as well.
I realize that padding or aligning memory to 16 bytes initially appears to be crude, but it delivers the
functionality you require through the use of pointers, and it is cross-platform compatible.
3D Vector (Floating-Point)
22
float y;
float z;
} vmp3DVector;
// Three 96-bit vectors aligned to 128 bits each, thus four floats each
// so 3 × 4 = 12 bytes, but add three extra floats (+3) to handle a
// possible misaligned offset of {4, 8, 12}. Once the first is
// aligned, all other 4-byte blocks will be aligned as well!
Of course if you are dealing with quad vectors, then align the first one. All the others, which are the same data
type and are already 16 bytes in size, will automatically be aligned.
The same applies for 4×4 matrices. The following is just a quick and dirty demonstration of aligned three-
element vector data structures.
// Copy vectors to aligned memory
pvA->x=vA.x; pvA->y=vA.y; pvA->z=vA.z;
pvB->x=vB.x; pvB->y=vB.y; pvB->z=vB.z;
23
vmp_CrossProduct0(&vD, &vA, &vB);
Note the convention of the appended zero used by vmp_CrossProduct0 and vmp_CrossProduct. The zero
denotes that the function is not guaranteed to be aligned to (n mod 16) with a zero remainder.
Another item to keep in mind is that a vector is 12 bytes in length (as it is made up of three floats, and a float is
four bytes in size), but it is being read/written as 16 bytes on a processor with a 128-bit data width. The extra
4-byte float must be preserved. If the trick of 128-bit memory allocation is utilized, then an out of bounds error
will not occur since the data is being advanced in 16-byte blocks. This fourth float is scratch data and as such
is not harmful. (No past end of buffer access!)
There are always exceptions to the rule and that occurs here as well. The compiler for the AltiVec instruction
set for Motorola typically found in Macintosh PowerPC computers uses the following local argument stack
declaration:
void Foo(void)
{
vector float vD, vA, vB;
}
Trivia The PowerPC's AltiVec SIMD instructions never have an alignment exception as the four
lower address bits, A0…3, are always forced to zero. So if your memory is misaligned, your
data will not be. But it definitely will not be where you expected it!
The following vector declaration automatically aligns the data to a 16-byte alignment. The GNU C compiler
(GCC) can generate the following definition:
typedef float FVECTOR[4] \
__attribute__((aligned (16)));
void Foo(void)
{
FVECTOR vD, vA, vB;
}
I am sorry to say that there is only one method for the 16-byte aligned stack frame of data within the Visual
C++ environment for the 80x86-based Win32 environment, but unfortunately this only works with Visual C++
.NET (version 7.0) or higher, or with version 6 and a processor pack. The following is a snippet from a DirectX
header file d3dx8math.h:
#if _MSC_VER >= 1300 // Visual C++ ver. 7
#define _ALIGN_16 __declspec(align(16))
#else
#define _ALIGN_16 // Earlier compiler may not understand
#endif // this; do nothing.
The alignment of vA cannot be guaranteed, but vB and vC are aligned on a 16-byte boundary. Codeplay's
Vector C and Intel's C++ compilers also support this declaration.
24
There is, however, the Macro Assembler (MASM), which has the following:
align 16
Another item that can easily be implemented from within assembly language is a stack correction for 16-byte
memory alignment. The stack pointer works by moving down through memory while adding items to the stack,
so by using a secondary stack frame pointer the stack can be corrected.
push ebx
mov ebx,esp ; ebx references passed arguments
and esp,0fffffff0h ; 16-byte align
mov esp,ebx
pop ebx
The line of assembly {and esp} actually snaps the stack pointer down the stack to a 16-byte boundary; thus the
next local stack argument will be 16-byte aligned, e.g., [esp-16], [esp-32], etc.
Compiler Data
Within the recent Visual C++ and Intel compilers SIMD functionality has been incorporated, such as the
allocation of registers and aligned stack memory.
Definition Description
push ebx
mov ebx,esp
and esp,0FFFFFFF0h ; 16-byte alignment
25
movaps xmm1,xmmword ptr [esp-10h] ; ta
Using this same implementation, a bit of magic using unions can be utilized:
#define MY_FLT_ARY 100
union
{
__m128 t[ MYFLT_ARY/4 ];
float ary[ MYFLT_ARY ];
};
The __m128 aligns the stack memory, which the float takes advantage of since it is the same size, and maps
to the same location compliments of the nameless union declaration.
There are other methodologies, but these should get you rolling for now.
Tip Use assertions in your code to trap errors early in the development process.
Assertions
Assembly code can get confusing enough without having to debug it just to find that misaligned memory was
the problem. Thus, a code development tool used within the source code for this book is the use of an
assertion. Typically, the top of your file should include the following header file declaration:
#include <assert.h>
Sorry to bore those of you who religiously use these! An assertion is essentially a test for a "not true thus false"
condition. That is, if the condition is false, then an assertion will occur whereby the instruction pointer is halted
at the offending line of code. This is very similar to the following:
if (!(3==a)) // assert(3==a)
{
// System Break
DebugBreak(); // Win32, Xbox
}
As mentioned, the condition must fail to halt the code. Here we wanted the variable a to be assigned the value
of 3 to be successful; otherwise it would halt the system. There are many philosophies on the subject about
assertions. In my case, I believe in four types of assertions:
Fatal – Debug
Fatal – Release
Non-fatal – Debug
Non-fatal – Release
A Fatal – Debug is a programmer error. This is something that should never, ever, ever occur, such as passing
a NULL, the address of a pointer, an obviously bad pointer, an out-of-range argument to a function, or a
misaligned vector or matrix. It halts the debug version of the code and forces the programmer to fix it right
away. This is the one assertion that should occur most of the time in your debug code while it is being
developed, but never in the shipped code.
A Fatal – Release is an unrecoverable error in a shipped version of the application. The computer is totally out
of memory so there is not even enough memory to display an error message. Catastrophic error. This should
preferably never be used. There are even ways in an application to stop the application logic, jettison memory,
and put up your "Sorry, I am dead! Call your Customer Support Person" screen. You do not want your
customers to ever get the Blue Screen of Death! (Windows users know what I am talking about!)
These last two — Non-fatal – Debug and Non-fatal – Release — are recoverable errors. That is, they are
errors that can be worked around without killing the application. These errors are typically not able to load a
resource from the media, missing or contaminated file, missing sound resources, etc. Your application can be
26
crippled in such a way to allow it to progress. Resources such as art can use placeholders when are can't be
loaded and allocated. Sounds that aren't loaded don't get played. The idea is to try to keep the application
running. Even during development of the application you may want to alert the developer that table sizes do
not match the number of defined indexes for that table. That sort of thing. The idea is to not kill the running
application but to annoy the programmer enough to get him to fix the problem. I like that so much I am going to
state that again!
Tip The idea of non-fatal assertions is to not kill the running application but to annoy the
programmer enough to get him to fix the problem.
On a personal note: Please do not get annoyed and then turn off the assertions at the source instead of fixing
the real problem. (I once worked with a programmer who did just that! It caused lots of Homer Simpson type
errors to creep into the code that would have been caught immediately otherwise. DOH!) On the other hand, a
non-technical supervisor once threatened to chop off my fingers because the assertions in the code to catch
errors were doing just that and thus breaking the build! A better method is to spend a couple minutes now
catching the errors early in the development cycle rather than a long time during crunch mode at the tail end of
the project trying to track them down, or never knowing about them until after the product has shipped.
For purposes of this book, only a debug assertion will be utilized. In essence this means if an application were
to be defined as a debug model, then the assertion would exist in the code, making the code run a little slower
but considerably safer. In a release model, it is stubbed out, thus does not exist, and your code will magically
run a bit faster. In the code for this book you will find the following assertions in use for the generic assertion
ASSERT():
Listing 2-1: Assertion type definitions
27
You would merely insert your custom assertion.
uint nActor = 5;
ASSERT(nActor < MAX_ACTOR);
The pointer assertion ASSERT_PTR4() does two things: The first is to guarantee that the pointer is not
assigned to NULL. The second is that memory pointers must or should be (depending on the processor and
exception settings) referencing memory that is at least 4-byte aligned so that pointer alignment is checked for.
Even if only referencing strings, it should be recognized that if they are at least 4-byte aligned, algorithms can
be made more efficient when dealing with those string components in parallel.
void *pApe;
ASSERT_PTR4(pApe);
This is a good way to ensure that integers and single-precision floating-point values are properly aligned.
The pointer assertion ASSERT_PTR16() is virtually the same, except that it guarantees that the pointer is not
NULL and is referencing data on a 16-byte alignment, which is necessary for vector math processing. This is
what this part of the book is all about!
ASSERT_PTR16(pApe);
Different processors have different behaviors, which could become a camouflage problem. Asserting on a non-
aligned 16-byte data reference finds these problems fast. These last two deal with assertions of values of zero
or negative single-precision floating-point numbers.
float f;
ASSERT_FNEG(f); // Assert if a negative
ASSERT_FZERO(f); // Assert if absolute zero
One last assertion topic would be the utilization of strings with variable arguments within your assertions as an
additional type of assertion. The trick here though is that assertion macros do not like variable arguments, so
you would need something in the assertion name to represent the number of string arguments, such as the
following:
ASSERT1(a == b, "So BOOM because %u", nDucks);
28
Memory Systems
Another code development tool is to do a memory allocation check for any compiled code to verify that the
memory manager is indeed allocating properly aligned memory for the superset single instruction multiple data
(SIMD) instruction sets. You can test this by executing a simple algorithm such as the following one.
29
bSet >>= 1; // reduce by a bit
bTst = bSet - 1;
bMsk = ~0 ^ bTst;
} while(1);
}
delete [] pRaw;
//free(pTbl[n]);
}
Please note that it loops up to 4096 times, slowly increasing the size of the allocation just in case the memory
manager issues properly aligned memory for a short period of time before allocating any that might be skewed.
Also, you will most likely get a memory low message but that is okay; you are only allocating about 8 MB or so.
If everything is fine, there will be a match between your processor and the following table.
Table 2-1: SIMD instruction set with data width in bits and bytes
AMD 3D Now! 64 8
AMD 3D Now! Extensions 64 8
AMD 3D Now! MMX Extensions 64 8
AMD 3D Now! Professional 64/128 8/16
MMX 64 8
SSE 128 16
SSE2 128 16
SSE3 128 16
If there is a mismatch, then you have an alignment error problem. This can be rectified by using memory
allocation code similar to that in Listing 2-4. This function is designed to wrap the standard function call to
malloc() or new[]. Do not forget to add the assertion as a good programming practice.
30
Memory Header
The following header is hidden at the true base of memory allocated by our function. Basically, memory is
slightly overallocated. The malloc function is in essence routed to the correct core allocation function.
Listing 2-3: …\chap02\ram\ram.cpp
ASSERT_ZERO(nReqSize);
//pMem = (byte*)malloc(nSize);
pMem = new byte[ nSize ];
pHead = (RamHead *)pMem;
if (NULL==pMem)
{ // Allocation error
}
else
{ // Save Req Size
pHead->nReqSize = nReqSize + sizeof(RamHead);
pHead->extra[0] = 1;
pHead->extra[1] = 2;
pHead->extra[2] = 3;
// Align by adj header +4 to +16 bytes
31
How this functions is that it aligns the amount of memory requested to the nearest 16-byte boundary. This will
assist in maintaining memory to a 16-byte block size. An additional 16 bytes are allocated as the header. This
is useful for two reasons:
The memory passed to the calling function can be forced to the proper alignment.
A side benefit of storing the requested size is that size adjustments similar to a realloc() can be issued
and the calling function does not have to know what the current size is when releasing that memory back
to the pool.
Hidden in the beginning of the allocated memory is a header where the requested size is stored in the first 32-
bit word and the other three words are set to the values of {1, 2, 3}. The pointer is then advanced to a 16-byte
alignment and passed to the calling function.
When releasing memory back to the system, the returned pointer needs to be unadjusted back to the true
base address; otherwise a memory exception will occur. The following function wraps the release function
free().
This may seem wasteful, but the base address of the memory being allocated by new or malloc is unknown.
With current malloc libraries it tends to be 4- or 8-byte aligned, so there is a need to allocate for a worst case.
ASSERT_PTR4(pRaw);
ASSERT_PTR(*pRaw);
The memory release occurs by decrementing the word pointer by one 4-byte word. If that location contains a
value between one and three, the pointer is decremented by that value so that it then points at the size
information when cast to a RamHead pointer. This is the true memory base position and the pointer that gets
returned to the system function free().
For C++ fans, the new and delete operators can be overloaded to this insulating memory module. I also
recommend one final item: The memory allocation and release functions should require a pointer to be
passed. This will allow the release function to nullify the pointer, and in future enhancements each pointer
could be considered the "owner" of the memory and thus adjusted for any garbage collection algorithms
instituted for a heap compaction in a flat memory environment.
32
Allocate Memory
A pointer is passed as ppMem and set.
Listing 2-6: ...\chap02\ram\ram.cpp
ASSERT_PTR4(ppMem);
ret = false;
*ppMem = (byte *)ramAlloc(nReqSize);
if (NULL!=*ppMem)
{
ramZero(*ppMem, nReqSize);
ret = true;
}
return ret;
}
ramFree(*ppMem);
*ppMem = NULL;
}
33
Chapter 3: Processor Differential Insight
Overview
This chapter discusses the similarities and differences between the various 80x86 processors from the
manufacturers AMD and Intel.
Processor Overview
There are a large variety of computers with different processors and different word sizes but there is one
constant, the byte. Memory in a computer is represented as a series of bytes and each of these bytes is made
up of eight bits. This allows an unsigned value ranging from 0…255 or a signed value ranging from –
128…0…127 to be stored in each byte. These eight bits can store one ASCII character A…Z. These bytes can
be used together to form larger data structures such as a 16-bit short, 32-bit int, 64-bit long, etc.
In a higher level language such as C this is typically represented by a hex value. For example, 123 decimal is:
64+32+16+8+2+1.
7 6 5 4 3 2 1 0
2 2 2 2 2 2 2 2
128 64 32 16 8 4 2 1
0 1 1 1 1 0 1 1
So binary 01111011 broken into nibbles (4-bit chunks) 0111 1011 is 7B hex. I did it for you here, but you really
should already know how to do decimal-to-hex and hex-to-decimal conversions. In the C programming
language this is represented by 0x7B. In an assembler such as MASM this can be represented in a variety of
ways:
mov eax, 123 ; Decimal
mov eax, 7bh ; Hex
mov eax, 01111011b ; Binary
Let's try that again but with a slightly bigger number in which the most significant bit (MSB) gets set.
7 6 5 4 3 2 1 0
2 2 2 2 2 2 2 2
128 64 32 16 8 4 2 1
1 0 1 0 0 1 0 1
It should be pointed out that a number represented in hex in C only needs a leading 0x to indicate that the
trailing digits are hex code. In assembly language the suffix of h indicates the value is hex. But if the first digit
is not a digit but an alpha value of A…F, then a leading zero is required. Therefore, a hex value in assembly
language must always begin with a digit even if it is zero. Letters indicate the word about to be processed by
34
the assembler is a label and not a value! Hex letters A…F can be mixed and matched upper- and/or
lowercase; capitalization does not matter.
We are using the value of 0a5h = 10100101B. The B represents binary and the MSB indicated in bold is a 1. If
this were an unsigned value ranging from 0…255, then 0a5h would resolve to 128+32+4+1 = 165 decimal.
Numbers without prefixes or suffixes are in decimal. But what if this were a negative number? 0a5h is a
decimal value of –91. How did we do that? Well, we needed something called a two's complement. This is a
one's complement followed by an addition of +1.
Since the MSB is set and this is a signed number ranging from –128 to 127, then first NOT (meaning flip) all
the bits in the number.
7 6 5 4 3 2 1 0
2 2 2 2 2 2 2 2
128 64 32 16 8 4 2 1
1 0 1 0 0 1 0 1
0 1 0 1 1 0 1 0
The bit sequence of 0101 1010 gives us 5Ah. (Just a coincidence; I chose the 5 and A on purpose since they
are complements of each other! Now add 1 to that: 5Ah+1=5Bh= 01011011B = 64+16+8+2+1 = 91. Since we
performed the two's complement we also stick the negative sign (–) back on it: – (91), thus –91. Again, this
should be review for you but I want to make sure you understand signed versus unsigned values and how to
handle one or the other.
Note To help alleviate any confusion between this book and my vector book, this one was written
for multiple processors. Both books share a generic calling convention and a standard
naming convention for data types: (b)yte 8-bit, (h)alf-word 16-bit, (w)ord 32-bit,(d)word 64-
bit, and (q)word 128-bit.
They are used for function declarations to maintain compatibility between books as well as
processor types.
Even though the 80x86 does not use a half-word declaration, I forced it to do so for easier
understanding of the vector book. This book is strictly about 80x86 assembly language, and
its letter encoding is directly connected to data types and instructions, so the specific 80x86
convention will be used here: (b)yte 8-bit, (w)ord 16-bit,(d)word 32-bit, (q)word 64-bit, and
(dq)word 128-bit.
Please keep this in mind if you are switching back and forth between the two books!
History The declaration of "word" has a bit of a history. When the 8086 processor was first
produced it had 16-bit registers. At that time a word was considered the width of the data,
so a word became the definition for a 16-bit value. Other processors had a data width of
32 bits, so a word was used to represent 32 bits. With the release of the 80386, the word
was already embedded in the 80x86 assembly code to represent 16 bits, and all code
would have to be modified accordingly, so 32 bits came to be known as double words.
And so a schism of bit widths related to the definition of a word came to be. In high- level
languages such as C an integer (int) was used to represent a word. Since it was not
directly tied to an absolute data width, it expanded with time. With the 8086 an int and a
short were 16 bits, but with 32-bit processors the int came to represent 32-bit data, while
the short still represents 16-bit data.
Table 3-1: 80x86 data types
char b byte 1 8
short w word 2 16
int d dword 4 32
long q qword 8 64
SSE, SSE2 (128-bit)
long long dq dqword 16 128
35
Figure 3-1: Unsigned/signed data types
History
A quick history lesson. The 80x86 family started with the 8088 and 8086 (16-bit) processors. These were 16-
bit processors as the general- purpose register was 16 bits in size. The 8088, however, had an 8-bit databus.
The general-purpose registers had two addressable components, i.e., AX, with the upper 8 bits as AH and the
lower 8 bits as AL. Along came the 80186 and 80286. With these came the first foundations of segment:offset
addressing. These had a 16-bit segment and a 16-bit offset, which were used to generate a 20-bit address.
Thus, there were 12 bits of overlap between the two registers. The segment was used to address a 16-byte
block within 1 MB of memory.
The 80386 was the first 32-bit processor to break the 1 MB memory barrier. Additional instructions were
added, and the segment:offset registers became selector:offsets. There then became a differentiation between
Real Mode using 16-bit addressing and Protected Mode, which used the new 32-bit technology. Then the
processor clone wars started between manufacturers. The 80486 came along with additional instructions. The
Pentium series came next with the introduction of multimedia instructions known as SIMD with the MMX
registers. There were different flavors, different instructions — it was very convoluted and confusing. AMD
introduced 3DNow! This is about when the book you are now reading was started. Then came Katmai by Intel,
now known as the SSE new instruction set. AMD countered, Intel countered, back and forth. Other
manufacturers either sunk at battle or pushed onto the shores of an embedded island. Then the technology
was forked by Intel (but we are ignoring the Itanium series in this book). Somewhere along the way AMD and
Intel each came out with 64-bit processors and seem to have reached a tentative truce.
Incidentally, the FPU (floating-point unit) was an optional second coprocessor originally known as the 8087
and subsequently the 80187, 80287, 80387, and then the 80487. Back then, floating-point was emulated in
software as an alternative if no hardware was present. The FPU grew up with the CPU and eventually become
an internal component of the 80486DX CPU. The 80486SX still had to rely upon a coprocessor.
History Rumor or fact? There are lots of stories of processors, faulty FPU chips, math bugs, etc.
One story has to do with the 80486SX. Rumor has it that if the Intel 80486DX's FPU failed
testing or cheaper SX processors were in short supply, then connections to the FPU were
zapped within the 80486DX to transform it into an 80486SX processor. (Rumor or fact?)
This is as deep into Real Mode as this book is going to get. If you need additional information, please
investigate specific documentation from the 80x86 processor manufacturer for your embedded application as
Real Mode is fading into obscurity. The same applies to the need for selector manipulation as it can get pretty
complicated, but this is only needed by operating system programmers for pre-64 bit processors, and that is
not what this book is about. It is written to help those programmers but primarily written for application
developers for 32- and 64-bit operating systems. System-level programming pretty much requires its own book
about this size.
Many interesting things occurred during the introduction of the Protected and Private Modes. C compilers had
to be selected for various code vs. data models.
36
Model Code Data
These various operating modes give a nice upgrade methodology as well as extend the life of a large portion
of pre-existing software as it has the ability to run under the new hardware and operating systems. Additional
instructions and hardware have been added to the 32-bit hardware base to enhance its functionality while
moving forward into the 64-bit world. I am not going to hype the technology here, as the manufacturers are
doing a pretty good job themselves. My job with this book is to help you learn to program in assembly code for
32/64-bit processors. If you are already up to speed with 32-bit, it will help with your transition to the 64-bit
machine. At the time of writing this book the new 64-bit operating systems were XP 64 and Linux 64.
At this point you are probably wondering to yourself, "My, this is a small section on the topic for 64-bit
processors. Wasn't 64-bit 80x86 in the title?" From this book's point of view its only interest is the hardware
capabilities of the instruction set and that means software, not the infrastructure of the hardware. But keep in
64
mind that in 64-bit mode the data size becomes 2 .
64
Tip 2 bits are used to store a pretty big number.
Add computer memory and video memory to get your computer's maximum limit of memory.
64-bit is much higher than 32-bit and should be enough for a couple years of expansion.
For example, in this C for-loop snippet three bits were needed to represent the loop, thus 61 bits are wasted!
(Well, not really, but you may begin to see my point.) The advantage to registers that are 64 bits in size is not
only can you store a value twice as big as a 32-bit register, but you can address memory that is twice the size.
That drastically increases the amount of memory your application can access. Applications can be much
bigger! Which means that you can put larger memory sticks into your computers so your computer can run
even hotter. (That meant heat, not necessarily speed!) For network applications that is good. But what the chip
manufacturers realized is that sometimes those extra bits for size were needed for size of data but mostly not.
So this is where things get cool: The default is 32-bit data! The processor stack, however, is still 64 bits.
The topic of 64-bit mode is spread throughout this book. Almost every instruction has a difference when used
while in 64-bit mode. Some have been invalidated and thus do not exist anymore. Some have been renamed
to help differentiate between 32-bit and 64-bit modes. The one thing to keep in mind is that most use a new
64-bit data addressing scheme involving a REX prefix, which will be discussed later in this chapter. The neat
thing is some scalar operations that used to be done with MMX registers can now be done with the general-
purpose registers. Instructions that used MMX registers such as PAND, POR, or XOR can now be done
alternatively, using 64-bit general-purpose registers in 64-bit mode, thus minimizing the use of MMX registers
and alleviating their associated dependency issues.
The goal is to preferably use XMM (SSE) registers or general-purpose registers instead.
37
80x86 Registers
The Intel 80x86 processor family has internal storage that is referred to as registers. These have been
organized into seven groups: general-purpose data registers, segment registers, status registers, control
registers, FPU registers used by the FPU (floating-point unit), MMX registers, and XMM registers. Please note
that the FPU and MMX registers share storage area and thus are mutually exclusive (only one of them can be
used at a time). Each has its own set of functionality.
The following registers and their ranges are for dealing with the SIMD instruction sets directly. They do not
include system registers.
Table 3-2: SIMD instruction with register names and bit widths
General-Purpose Registers
The general-purpose registers are organized into two groups of eight registers: The RAX, RBX, RCX, and
RDX general registers each have an 8-, 16-, 32-, and 64-bit form, as well as the index registers RSI and RDI,
and the stack pointers RBP and RSP. The second set of eight are new registers R8…R15. The instruction
pointer RIP has a 16-, 32-, and 64-bit form depending on which mode the processor is running in.
No There is a special case where the AH, BH, CH, and DH are not accessible when the REX
64-bit prefix is used. REX is used for all extended 64-bit registers as well as SIL, DIL, BPL, and
SPL register access.
16 AX BX CX DX SI DI BP SP
38
8 AL BL CL DL SIL DIL BPL SPL
16 AX BX CX DX SI DI BP SP IP
8 AL BL CL DL
AH BH CH DH
The general-purpose registers are used as memory pointers, address calculations (displacement w/scaling),
logical bit, and mathematical operations. There is an exception as the RSP register is the stack pointer and
has limitations such as it does not support scaling and displacement. In relation to the segment registers, the
64-bit stack pointer is used in conjunction with the stack segment-selector (SS) register as a default.
Table 3-5: 64-, 32-, and 16-bit general-purpose registers
RAX, EAX, AX The accumulator. If used as a pointer in Protected Mode uses the DS segment
register as a default. DS:[EAX]
RBX, EBX, BX Used as a data pointer using the DS segment register as a default. DS:[EBX]
RCX, ECX, CX Used as a counter in string (rep) and loop operations. If used as a pointer in
Protected Mode uses the DS segment register as a default. DS:[ECX]
RDX, EDX, DX Input/output port address. If used as a pointer in Protected Mode uses the DS
segment register as a default. DS:[EDX]
RSI, ESI, SI Source index using DS segment register as a default. DS:[ESI]
RDI, EDI, DI Destination index using ES segment register as a default. ES:[EDI]
RBP, EBP, BP Pointer to data on the stack (very similar to ESP); uses the SS segment register
as a default. SS:[EBP]
RSP, ESP, SP Stack pointer used the SS segment register as a default. SS:[ESP]
RIP, EIP, IP Instruction pointer. CS:[EIP]
Note: The "R" prefix of these registers only became available with the 64-bit versions of the processor. The "E"
prefix of these registers became available with the introduction of the 32-bit 386 processor. Prior to that, only
16-bit registers were supported.
Regardless of which mode you were in, you could access either 32-bit or 16-bit registers. But this was with the
introduction of a pre-op code. If in 32 bit, you accessed 32-bit registers. In 16 bit, you accessed 16-bit
registers. But if you needed to access the alternate type, then a hidden leading prefix was embedded in the
binary output:
39
…from 16-bit code…
66h mov eax,3
mov ax,3
This covers all the 32-bit processors. So now that I have complicated things for you with some history, let us
examine the new REX prefix.
REX
The prefix REX is not an instruction, it is an invisible prefix. It is similar to the operand-size and address-size
prefix that the assembler and compilers inject into the code when switching before a 16-bit and 32-bit access
method. With the new 64-bit instructions it has been extended again.
Note When the processor is running in 64-bit mode the data is 32 bit. A REX prefix of 40h…48h is
embedded when using 64-bit data access. After all, a 64-bit number is a very big number
and thus not needed that often. Sign extending a 32-bit number when needed is more code
efficient.
With the introduction of 64-bit processors a new invisible prefix is used: REX (40h…4Fh). So if this new
processor is running in 64-bit mode the previous rules still apply, but to access the 64-bit data a REX opcode
is injected:
66h mov ax,3
mov eax,3
REX mov rax,3
No You cannot use 64-bit data while running in 32-bit mode or inc/dec register instructions in
64-bit 64-bit mode and here is why: The opcodes 40h…4Fh are mapped to register increment and
decrement instructions in a 32-bit mode environment! Thus, in 32-bit mode only the 32-bit
data and instruction sets can be accessed. REX does not exist. In 64-bit mode, 32-bit and
64-bit data can be accessed, but the inc/dec instructions are no longer available for direct
use by a register.
Table 3-6: Mappings of inc/dec instructions that use the opcode 40h-4Fh in compatibility or legacy 32-
bit mode.
40h inc EAX
41h inc ECX
42h inc EDX
43h inc EBX
44h inc ESP
45h inc EBP
46h inc ESI
47h inc EDI
40
7 6 5 4 3 2 1 0
0 1 0 0 W R X C
The instruction format is a grouping of a prefix that is optional, opcode, mod r/m, sib, displacement, and data.
This book does not get into the nitty-gritty of how an instruction, registers, and/or memory references map into
an actual opcode. But the bit mapping for the mod r/m is as follows:
Table 3-8: Mappings of mod r/m code. 32-bit is the default. Substitute 16/64-bit for 32-bit form where
needed, such as 00-001 DS:[ECX], DS:[CX], [RCX].
MOD R/M
00 000 DS:[EAX]
00 001 DS:[ECX]
00 010 DS:[EDX]
00 011 DS:[EBX]
00 100 s-i-b
00 101 DS:d32
00 110 DS:[ESI]
00 111 DS:[EDI]
MOD R/M
01 000 DS:[EAX+d8]
01 001 DS:[ECX+d8]
01 010 DS:[EDX+d8]
01 011 DS:[EBX+d8]
01 100 s-i-b
01 101 SS:[EBP+d8]
01 110 DS:[ESI+d8]
01 111 DS:[EDI+d8]
MOD R/M
10 000 DS:[EAX+d32]
10 001 DS:[ECX+d32]
10 010 DS:[EDX+d32]
10 011 DS:[EBX+d32]
10 100 s-i-b
10 101 SS:[EBP+d32]
10 110 DS:[ESI+d32]
10 111 DS:[EDI+d32]
41
MOD R/M
There are other mappings but this is sufficient. The reason this book does not get too deep into details is that
you are probably not writing assemblers or compilers. If you were, then you mostly would not need this book
except as a reference. It is just one of those interesting tidbits but unnecessary for assembly language
programming or debugging. "s-i-b" represents (scale-index-base) byte.
Segment/Selector Registers
In Protected Mode these registers are referred to as "selectors" and in Real Mode "segment registers." In Real
Mode they are used in conjunction with an index register to calculate a memory address. As they are
functionally the same, in this section "segment" will mean both. They are sometimes referred to as segment-
selectors.
15…0 Description
CS Code segment
DS Data segment
ES Extra (data) segment
FS Data segment
GS Data segment
SS Stack segment
Note: The FS and GS were not available prior to the 386 processor.
When modifying any segment-selector register you must first save a copy of its stored value and restore it
before exiting your function or your program will go "BOOM!" (That is a technical term!) Well, it will not explode
as it will just cause the process to crash, but it will sure seem like it exploded. (Ask any assembly language
programmer!)
If you are writing a Win32 type application, then typically all the segment-selectors are used in the execution of
your code but are usually not denoted in your code as the defaults will be used. The FS and GS are used in
your assembly code typically only in device drivers. This is the case of a flat memory model and the DS and
ES are set to the same base address in memory. This section essentially becomes a no-brainer! You can
42
completely ignore the segment registers since the DS, ES, and SS are set to the same segment and the
indexing registers have used one or the other segment register as a default.
If you are writing an Extended DOS or other OS-based application, then you will typically use most or all of the
segment-selector registers, especially in your low-level operating system drivers.
MMX Registers
There are eight 64-bit MMX registers (MM0, MM1, MM2, MM3, MM4, MM5, MM6, MM7). These are 64-bit
registers that can deal with a single 64-bit number, or two 32-bit, four 16-bit, or eight 8-bit packed values. In
the 3DNow! instruction set they used for both integers and floating-point value pairs. These registers were
introduced with the Pentium Pro series processors. There are no flags to set or read but based upon the
instruction the individual packed data values are treated individually to effectively replicate a desired
instruction.
XMM Registers
There are eight 128-bit SSE registers (XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6, XMM7) for pre
64-bit and eight additional registers (XMM8, XMM9, XMM10, XMM11, XMM12, XMM13, XMM14, XMM15) for
64-bit or larger data processors. These are 128-bit registers that can deal with two single 64-bit, four 32-bit,
eight 16-bit, or sixteen 8-bit packed values, whether they be integer or single/double-precision floating-point.
These registers were introduced with the PIII series processors. There are no flags to set or read, but based
upon the instruction the individual packed data values are treated individually to effectively replicate a desired
instruction. The functionality of the 64-bit MMX registers was migrated to the 128-bit SSE registers, thus
doubling the size but without the burden of the FPU vs. MMX data type switching. Whenever possible, these
should be used instead of MMX.
Each 80x86 processor has standard status flags stored in an EFLAGS/ RFLAGS register, which indicate
various conditions of the processor. In 64-bit mode the EFLAGS register is referred to as the 64-bit register
RFLAGS, with the upper bits set to zero. As this book is written mainly for the application programmer there
are only a few EFLAGS that you need to know initially. All the EFLAGS are defined in Chapter 18 but for now:
Table 3-9: RFLAG/EFLAG(s) and bit encoding
43
In this book you will find the following table with each general-purpose instruction description indicating the
flags that it sets. Those flags are in reality the individual bits found in the EFLAGS register.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
LAHF 32 32
lahf
The LAHF instruction loads the lower 8 bits of the EFLAGS register into the AH register. This only loads the
standard conditional flags used by the Jcc and SETcc instructions. The complement to this instruction is
SAHF.
Bit 7 6 5 4 3 2 1 0
Flag SF ZF 0 AF 0 PF 1 CF
This instruction can get you a copy of all the conditional flags in one step without having to use the stack. It
should be remembered that only bits {0, 2, 4, 6, 7} are usable and {1, 3, 5} are reserved and not usable.
No This instruction is not supported in 64-bit mode!
64-bit
To get the flags into register AL use instead:
PUSHFQ ; Push RFLAGS
POP RAX ; Pop into RAX
SAHF 32 32
sahf
The SAHF instruction saves the AH register to the lower 8 bits of the EFLAGS register. This only sets the
standard conditional flags used by the Jcc, SETcc, and other instructions. The complement to this instruction is
LAHF.
For more detailed information about EFLAGS, see the instruction LAHF and Chapter 18.
Flags O.flow Sign Zero Aux Parity Carry
- X X X X X
This can set all the conditional flags in one step without having to use the stack. It should be remembered that
only bits {0, 2, 4, 6, 7} are usable and {1, 3, 5} are reserved and not usable.
No This instruction is not supported in 64-bit mode!
64-bit
44
PUSHF/PUSHFD — Push EFLAGS onto Stack
PUSHF
PUSHFD 32 32
PUSHFQ 64 64
pushf 16
pushfd 32
pushfq 64
The PUSHF instruction pushes the lower 16 bits of the EFLAGS register and the PUSHFD instruction pushes
the entire 32 bits of the EFLAGS register. PUSHFQ replaces PUSHFD, but it pushes all 64 bits of RFLAGS
(the extended EFLAGS). They are the WORD, DWORD, and QWORD forms of the same instruction. (They all
map to the same opcode.)
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
To get the EFLAGS, one only needs to push it onto the stack and then pop it off into a general-purpose
register!
pushfd ; push EFLAGS register
pop eax ; pop those flags into EAX
POPF
POPFD 32 32
POPFQ 64 64
pushf 16
pushfd 32
pushfq 64
The POPF instruction pops the stack value into the lower 16 bits of the EFLAGS register, and the POPFD
instruction pops the entire 32 bits off the stack into the EFLAGS register. In 64-bit mode POPFQ replaces the
POPFD instruction and pops all 64 bits of RFLAGS. They are the WORD, DWORD, and QWORD forms of the
same instruction. This instruction is the complement to PUSHF and PUSHFD.
Flags O.flow Sign Zero Aux Parity Carry
X X X X X X
45
64-bit
Some of the EFLAGS bits can be individually modified by using the PUSHF/PUSHFD and POPF/POPFD CPU
instructions. They pop the flags onto the stack, at which point you pop them into a general-purpose register,
manipulate the bits, push the register value back onto the stack, and then pop it back into the EFLAGS. This
same trick is used to detect CPU capability (see Chapter 16). The same applies to RFLAGS in conjunction
with PUSHFQ and POPFQ.
pushfd ; push EFLAGS register
pop eax ; pop those flags into EAX
xor eax,EFLAGS_ID ; flip ID bit#21 in EFLAGS
push eax ; push modified flags on stack
popfd ; pop flags back into EFLAGS
During an interrupt call, the EFLAGS are automatically stored on the stack with the return address so that
when the interrupt returns, the register status can be restored to exactly the way it was when the interrupt
occurred. The same thing happens in a multithreaded environment. When a thread is preempted so that
another thread can be executed for a while, the processor status related to that thread is saved and restored
accordingly. This is hidden from the application programmer and so does not need to be worried about. Just
keep in the back of your mind that it is taken care of for you.
CLC
clc
STC
stc
CMC
cmc
46
- - - - - X
Carry = (Carry == 0) ? 1 : 0;
NOP — No Operation
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
NOP
nop
The NOP is a no-operation instruction. It uses up one instruction byte and changes no flags. It is typically used
for code alignment by one byte or a time delay.
Floating-Point 101
Remember that this is not rocket science, and thus minor deviations will occur in the formulas since, for
example, a single-precision float is only 32 bits in size. For higher precision, 64-bit double-precision or 80-bit
double extended-precision floating-point should be used instead. These floating-point numbers are based
upon a similarity to the IEEE 754 standards specification. Unfortunately the 80-bit version is only available in a
scalar form on the 80x86's FPU and the 64-bit packed double-precision is only available on the SSE2
processor.
Most programmers only know a floating-point value from using a declaration such as float, double, real4, or
real8, etc. They know that there is a sign bit that if set indicates the value is negative and if clear indicates the
value is positive. That is typically about the limit of the programmer's knowledge, as floating-point is typically
treated as a black box and they typically do not need to dig into it further.
For this book you will be required to understand a little bit more and that will be discussed in Chapter 8.
For a processor to survive it must either be enhanced to meet the demands of technology, find a second life in
an embedded marketplace, or die. Intel and AMD have done just that (not the die part), but unfortunately in the
process, the technology has forked and so there are now a multitude of flavors of the original 80x86 processor
core in existence. In addition, AMD has merged the technologies of the 3DNow! extensions and SSE(2) and
formed the 3DNow! Professional instruction sets.
47
Figure 3-8: Flavor tree of x86 processors
The point is that now there are several 80x86 SIMD feature sets and not all processors have them all. So the
first step is to resolve this. Intel initially did so by developing an instruction with the mnemonic CPUID along
with a set of sample processor detection code. AMD adopted the same instruction with their set of sample
code. As the technology forked further, each company's sample CPUID code emphasized its own processors
so programmers have had to merge both companies' code a bit, although AMD's was more diverse. To make it
even more complicated, AMD put out 3DNow! Professional. This is a hybrid of all the 80x86 instruction sets of
the AMD and Intel technologies, except the SSE (Extensions 3) (at least at the time this book was written).
Because of the confusion factor, this book's goal is to try to make it easier to understand.
In Chapter 16 the CPUID instruction is explained. This is a very complicated instruction, but it is wrapped with
a function call used by this book that fills in a structure and builds an ASCII string describing the capabilities of
a computer in code. It or something similar to it should be used to decide whether a certain set of instructions
is usable on a particular computer. Since you are most likely learning this subject material, then you are most
likely using one or two computers to test this code and thus know the processor type already. Just to be sure
you are running the correct instructions on the correct machine there is a CPUID testing logic included with
most of the test applications to run the appropriate set of code. If you wish to learn more about this, please
skip ahead to Chapter 16.
void CpuDetect(CpuInfo * const pInfo);
Briefly, the CPU detection code checks for the processor type and its capability and sets flags accordingly. The
initialization function attaches function pointers to appropriate code compatible with that processor type and
then it is just a matter of the application calling a generic function pointer, which gets routed to the appropriate
code.
When you write your code, try to use SSE instructions whenever possible for scalar as well as vector
processing. When possible use the instructions that perform quick estimations as they are designed for higher
speed calculations despite their lower accuracy. In that way you will always have the best performance for
your code, even on newer machines released after you ship your code.
The 80x86 processor has a dual mode in relationship to its MMX and FPU registers. In these particular cases
whenever there is a need to switch back and forth, the appropriate instruction needs to be called. In addition,
there is a difference between the AMD instruction FEMMS and the Intel instruction EMMS. (These will be
discussed in Chapter 8.) When writing code, use instructions that favor using the SSE instructions as the
(F)EMMS instructions are only needed if switching between MMX and FPU.
EMMS
Emms
FEMMS
3DNow! femms
Tip Note that if your floating-point code gets erratic and appears to have unexpected QNAN or
infinity or illegal values, then look for a usage of an FPU or MMX instruction while the other
mode was thought to be in effect. It is probably because you are missing an(F)EMMS
48
instruction to flip modes.
The point is that whatever mechanism you put into place — switch-case statements, lookup tables, etc. — you
want to have the best (fastest) set of code available for that processor. The trick, however, is not to use up
valuable memory supporting all those combinations. Fortunately, consoles such as Xbox are fixed targets,
which can assist you in being more selective. In fact, you can get away with absolute function calls rather than
function pointers, but that would really be up to you! It all depends upon how you implement your cross-
platform capabilities. The code samples use a platform-specific file to connect the dots so to speak, so it can
be abstracted out easily to a platform-specific implementation. Of course, if using direct calls you would want
to have two flavors of function prototypes in the header files: those set up to be function pointer based and
those set up as standard function prototypes.
Destination/Source Orientations
Another difference between platforms has to do with the format of the assembly instructions. Depending on the
processor there are typically two orientations.
Some non-80x86 processors allow the destination register to not be a source register. Thus, Register D =
Register A + Register B. Or D = D + A. In C programming this is a form similar to:
D=A+B
D=D+A
The 80x86 processor family requires the destination to be one of the sources. In C programming this is similar
to:
D += A
D=D+A
Big/Little-Endian
One very important processor specification to be aware of is the endian orientation. This drastically affects how
byte ordering affects data orientation. Depending on the processor and its manufacturer, data structures larger
than a byte are typically arranged in one of two orientations:
Little-endian
Big-endian
One interesting aspect is that for either little or big endian, the 8-bit byte both have an identical bit ordering of
bits {0…7}. The MIPS processors (as a default) and the 80x86 are little-endian, but the Motorola 68000 and
the PowerPC RISC microprocessor are big-endian.
49
Little-endian is linear just like memory, so every more significant byte would be the next (incremental)
addressed one in memory. For the size of a data word in big-endian, every more significant byte would be the
previous (decremental) addressed one in memory.
In big-endian the most significant byte is first in memory and it progresses down to the least significant byte;
the cycle repeats for the next block. In the following diagram, the data in memory is blocked into groups of 128
bits (16 bytes).
The Intel 80x86 processor is a little-endian based processor. That is, the memory is laid out in a linear fashion
so the first byte contains the LSB (least significant bit). For example, as Figure 3-11 shows, a dword has the
lower bits (7…0) in the first byte (#0) and bits (31…24) in the fourth byte (#3). This pattern is repeated over
and over.
Figure 3-11: Visible connections between individual bytes and the left shift of a 32-bit data element
In the C programming language, use the following shift to the left by one for a 32-bit data word (int).
a = a << 1;
Dealing with endian orientation can sometimes be confusing, especially if you primarily work in little-endian
and then need to convert data to the form of big-endian. This makes perfect visual sense for bigendian
because the fourth byte contains the least significant bit (LSB) and data is shifted to the left toward the most
significant bit (MSB). For little-endian the same shift in C magnifies the value by a factor of two for each bit but
visually it makes no sense because the LSB is on the left. By using a mirrored reflection it then becomes clear.
At this point this should be enough conversation about endian orientation of memory until you get to Chapter 6,
"Data Conversions," where this is discussed more thoroughly.
Alignment Quickie
In the following little-endian ordered data snippet, you will find an example of the most common data types.
The actual hex value is on the right and the byte arrangement in memory is on the left.
0000 12 Ape db 12h ; (byte)
0001 34 12 Bat dw 1234h ; (word)
0003 78 56 34 12 Cat dd 12345678h ; (dword)
0007 F0 DE BC 9A
78 56 34 12 Dog dq 123456789ABCDEF0h ; (qword)
Note that the data should be properly aligned to their data type, so a slight rearrangement of the previous data
snippet is in order. You should note the previous odd addressing versus the following aligned addressing:
0000 f0 DE BC 9A
78 56 34 12 Dog dq 123456789ABCDEF0h ; (qword)
0008 78 56 34 12 Cat dd 12345678h ; (dword)
000C 34 12 Bat dw 1234h ; (word)
000E 12 Ape db 12h ; (byte)
50
You will note by the base address on the far left that the data has been reordered to ensure that all elements
of a data type are properly aligned for their data type. Eight-byte values are aligned on 8-byte boundaries, 4-
byte on 4-byte boundaries, etc.
LDDQU
MOV
MOVAPD
MOVAPS
MOVD
MOVDQA
MOVDQU
MOVQ
MOVUPD
MOVUPS
Move (Unaligned)
mov rmDst(8/16/32/64), #(8/16/32) [Un]signed
mov rmDst, rSrc(8/16/32/64)
mov rDst, rmSrc(8/16/32/64)
System Level Functionality
mov srDst, rmSrc(16/32)
mov rmDst, srSrc(16/32)
MMX movd mmxDst, mmxSrc/m64 [Un]signed 64
" movq mmxDst, mmxSrc/m64 "
SSE movups xmmDst, xmmSrc/m128 Single- 128
precision
" movups xmmDst/m128, xmmSrc "
SSE2 movdqu xmmDst, xmmSrc/m128 [Un]signed 128
" movdqu xmmDst/m128, xmmSrc "
" movupd xmmDst, xmmSrc/m128 Double-
precision
" movupd xmmDst/m128, xmmSrc "
51
3DNow! movq mmxDst, mmxSrc/m64 Single- 64
precision
lddqu xmmDst, mSrc(128) [Un]signed 128
Move (Aligned)
SSE movaps xmmDst, xmmSrc/m128 Single-precision 128
" movaps xmmDst/m128, xmmSrc "
SSE2 movdqa xmmDst, xmmSrc/m128 [Un]signed 128
" movdqa xmmDst/m128, xmmSrc "
" movapd xmmDst, xmmSrc/m128 Double-precision
The instructions do not really move data; rather it "copies" 8-, 16-, 32-, 64-, or 128-bit values from register to
register, register to memory, memory to register, and immediate to register or memory, but almost never
memory to memory.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
You will note that these are either a specified floating-point value or a signless "[Un]signed" integer. As the
data is being copied, whether or not it is signed has no meaning!
ASM Sorry to do this to those of you who are more experienced, but I need to embellish further for
101 those new to the subject matter. I of course do not need to remind you to skip ahead at any
time!
MOV — Copy a (32-bit) double-word value from one location to another. From a 32-bit (4 bytes) general-
purpose register to a general-purpose register:
mov eax,ebx // Copy INT32 from register to register
MOVQ — Copy a (64-bit) quad-word value from one location to another. From 64-bit (8 bytes) memory to an
MMX register, or from a MMX register back to memory:
movq mm0,mem64 // Read INT64 from memory
movq mem64,mm0 // Write INT64 to memory
For the MMX and SSE the MMX registers can only manipulate integer values. Any kind of data can be loaded
but it is treated as an integer whether in reality it is a floating-point value or not. The 3DNow! instruction set,
however, does not differentiate.
movq mm0,mem64 // Read SPFP from memory
movq mem64,mm0 // Write SPFP to memory
MOVDQU — Move an unaligned double-quad word value from memory to 128-bit memory and vice versa:
movdqu xmm0,mem128 // Read INT from unaligned memory
movdqu mem128,xmm0 // Write INT to unaligned memory
MOVDQA — Move an aligned double-quad word value from memory to 128-bit memory and vice versa:
movdqa xmm0,mem128 // Read INT from aligned memory
movdqa mem128,xmm0 // Write INT to aligned memory
52
MOVUPS — Move an unaligned single-precision floating-point value from memory to 128-bit memory and vice
versa:
movups xmm0,mem128 // Read SPFP from unaligned memory
movups mem128,xmm0 // Write SPFP to unaligned memory
MOVAPS — Move an aligned single-precision floating-point value from memory to 128-bit memory and vice
versa:
movaps xmm0,mem128 // Read SPFP from aligned memory
movaps mem128,xmm0 // Write SPFP to aligned memory
MOVUPD — Move an unaligned double-precision floating-point value from memory to 128-bit memory and
vice versa:
movupd xmm0,mem128 // Read DPFP from unaligned memory
movupd mem128,xmm0 // Write DPFP to unaligned memory
MOVAPD — Move an aligned double-precision floating-point value from memory to 128-bit memory and vice
versa:
movapd xmm0,mem128 // Read DPFP from aligned memory
movapd mem128,xmm0 // Write DPFP to aligned memory
Keep in mind that the 3DNow! floating-point uses the same MMX registers as the integer instructions and thus
this same MOVQ instruction.
Data can be moved around using all sorts of methods: immediate to register, immediate to memory, register to
memory, memory to register, and register to register, but data cannot be copied from memory to memory, from
register to immediate, from memory to immediate, or from immediate to segment register.
Let's begin with immediate values. These are values in C such as A=5. The (5) is an immediate value.
foo dd 0 ; a 4-byte memory location
53
mov [ebx],eax ; Save the value at the address
In all of these examples foo now contains the specified value. All instructions use this sort of destination,
source methodology, and I hope you understand that as I am now going to reduce verbosity a tad.
Out of curiosity, how many of you recognized 71077345 as the old calculator trick of "Shell Oil" upside down?
:) (Okay, okay, a little calculator humor!)
XCHG
xchg rmDst, rSrc(8/16/32/64) [Un]signed
xchg rDst, rmSrc(8/16/32/64)
8-, 16-, 32-, or 64-bit data can be exchanged between registers of the same size or between memory and a
register of the same size. Note that 64-bit exchange is only in 64-bit mode.
If you are developing 32-bit applications the system registers are known as selectors. In Win32 the CS, DS,
and ES selectors are typically set to the same block of memory and need not be modified.
If you are developing in Real Mode, aka 16-bit applications, then these system registers are known as
segment registers. In the Real Mode environment sreg represents the segment register supporting a 16-bit
offset addressing up to a maximum of 65,536 x 16 bytes. In a Protected Mode environment the selector
register indexes a 32-bit address using virtual memory. Application programmers do not need to worry about
this. In the DOS and Extended DOS Protected Mode environments the ES is typically set to the same value as
the DS, and the FS and GS are used to address extraneous memory blocks such as a video graphics adapter
(VGA).
xor edi,edi
mov es,monosel
mov es:[di],ax
add di,4
mov edi,monoadr
mov fs,monosel
54
mov fs:[edi],eax
add edi,4
or with some displacement. The assembler resolves the address with the offset for a new address:
mov eax,MonkeyBrainSoup+8
This is still an absolute address. The address is simply adjusted and the new address is encoded into the
code. If MonkeyBrainSoup were in data memory at location 1000h, then adding the displacement merely
would encode the address as 10008h. A structure is referenced in the same way.
vmp3DVector STRUCT
x REAL4 ? ;float ?
y REAL4 ? ;float ?
z REAL4 ? ;float ?
vmp3DVector ends
Or how about as an array of floats as shown below? Note that 0*4, 1*4, and 2*4 are not addressing multipliers.
They are base address multipliers. To truly be a multiplier a register has to be the prefix argument, such as
EDX*8 or EAX*2, etc. What the following really says is the base address of 0×1=0 or 1×4=4 or 2×4=8 + the
value in ebx = the adjusted base address.
mov eax,[ebx+0*4]
mov ecx,[ebx+1*4]
mov edx,[ebx+2*4]
55
Those were just some examples, as almost any register can be used alone, in a pair addition, with an optional
base address and/or scale factor of {2, 4, or 8}, but note that there are some limitations in regard to the ESP
register. For address memory the equivalent scaling factor is needed: int16=×2, int32=×4, int64=×8. During
code reviews of other programmers'assembly code I have seen single registers with and without scaling but
rarely multiple register addition; instead there is usually some discrete logic to calculate a base address. (That
is a waste of CPU time when hardware can resolve some of the addressing for you!)
In regard to the following mapping mechanisms I have read a lot of books and they usually have some minimal
reference or simple tables made to show multiple register referencing.
I have never seen them in a verbose table such as that in Appendix C. Even the data books direct from the
chip manufacturers seem to be lacking in this information, and so here it is. Seeing it should help you
remember and then entice you to use them.
uint32 OddTable[ ]
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53
ebx = 0 edx = 1
mov eax, OddTable ; = 1
mov eax, OddTable[ebx] ; = 1
mov ecx, OddTable[eax * 4] ; = 11
mov ebx, OddTable[ebx + ecx] ; = 37
mov ebx, OddTable[ecx + edx * 4] ; = 53
mov esi, offset OddTable
mov eax, [esi + ebx * 4]
Since the table is 32-bit (4 byte) a multiplier of ×4 can be used to reference the correct array cell. If the table
were 16-bit, the multiplier would be ×2, 64-bit then ×8.
The same kind of memory reference used to access an element in a memory table or array can also be used
to access a jump or call vector.
56
LEA — Load Effective Address
LEA destination, source
LEA
lea rDst16, Unsigned
mSrc(16/32)
lea rDst32,
mSrc(16/32)
lea rDst64, mSrc16
This instruction calculates the effective address of the source argument and stores it in the destination. It can
be used as multi-register math:
mov ebx,2
mov ecx,3
lea eax, [10+ebx+ecx] ; eax = 10+2+3 = 15
Translation Table
An alternate method of lookup simply uses the 8-bit AL register in conjunction with an indirect memory
reference for a 256-byte table lookup.
XLAT
XLATB
xlat mSrc8 [Un]signed
xlatb
This general-purpose instruction does a table lookup by adding the AL register to the DS:[eBX] address and
copies the selected memory location to the AL register. It is equivalent to the following:
mov al,[rbx+al] ; 64-bit Mode
mov al,ds:[ebx+al] ; Protected Mode (32-bit)
mov al,ds:[bx+al] ; Real Mode (16-bit)
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
If you are building your own logging mechanism for your application and you are not using C where you could
use a function similar to:
printf("0x%x", dw);
then you will probably want to use the following binary to hex text print algorithm. Or at least some form of it!
Okay, I am already breaking rules of precedence where I am using instructions before they are discussed. But
bear with me!
HexTbl: db "0123456789ABCDEF"
57
; void HexDmp32(uint32 dw);
push ebx
dec ecx
jnz $L1 ; Loop for 8 letters
pop ebx
pop ebp
ret
HexDmp32 endp
State: db 3, 4, 5, 1, 2, 0ffh
lea ebx,State
mov al,0
58
String Instructions
The following instruction groups LODS, STOS, and MOVS are string functions, which are discussed shortly.
An extra instruction REP is used to repeatedly execute a list of data types for a count indicated by the looping
count register (CX/ECX/RCX). The register is dependent upon the data-bit mode (16/32/64-bit).
There are additional string instructions not discussed in this chapter. CMPS and SCAS are discussed in
Chapter 9, "Comparison." The INS and OUTS instructions are discussed in Chapter 17, "PC I/O." These
should be used very carefully. There are alternatives that use register pairing and other methods of
optimization. With older processors these were the optimal method of data manipulation but with the newer
processors that is not so much the case any more. Loop counters need to be large values to be effective.
Table 3-13: String function and associated index register(s)
CMPSx
INSx
LODSx
MOVSx
OUTSx
SCASx
STOSx
LODSx
LODSQ 64 64
lods{b/w/d/q} [Un]signed
lods mDst{8/16/32/64}
This instruction reads the contents of memory and loads it into the AL, AX, EAX, or RAX register. It belongs to
the group of string functions but it really has nothing to do with text strings in the C sense. It has to do with
loading strings of memory, which are contiguous bytes of memory. These functions are equivalent to:
32-bit mode
lodsb lodsw lodsd lodsq
64-bit mode
lodsb lodsw lodsd lodsq
REP LODSx
A single read operation can be performed or a repeat sequence specified by a REP prefix word and a count
specified in the RCX register in 64-bit mode, the ECX register in Protected Mode, or the CX register in Real
Mode.
59
rep lodsd
L3: mov
eax,ds:[esi]
add esi,4
dec ecx
jne L3
rep lodsw
L3: mov
ax,ds:[si]
add si,2
dec cx
jne L3
The LOOP instruction could be used instead of DEC and JNE but do not do so, as it is considered complex
and actually costs CPU cycles on more advanced model processors. It takes a very special application indeed
that does nothing but read a series of bytes into the destination register without processing them. The only
result is that only the last data read is remembered! And the source index ESI would point to the next available
memory location.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
STOSx
STOSQ 64 64
stos{b/w/d/q} [Un]signed
stos mDst{8/16/32/64
}
This instruction writes the contents of the AL, AX, EAX, or RAX register to memory. It belongs to the group of
string functions but it really has nothing to do with text strings in the C sense. It has to do with strings of
memory. Those single write operations can be performed or a repeat sequence specified by a REP prefix word
and a count specified in the RCX register in 64-bit mode, the ECX register in Protected Mode, or the CX
register in Real Mode. Those are contiguous bytes of memory. These functions are equivalent to:
32-bit mode
stosb stosw stosd stosq
64-bit mode
stosb stosw stosd stosq
mov [rdi],al mov [rdi]ax , mov [rdi] eax, mov [rdi] rax,
60
stosb stosw stosd stosq
inc edi add edi,2 add edi,4 add rdi,8
REP
REPE
REPNE
REPNZ
REPZ
rep ins m(8/16/32), dx
rep outs dx, m(8/16/32)
rep movs mDst, mSrc(8/16/32/64)
rep stos mSrc(8/16/32/64)
rep lods mDst, al
rep lods mDst, ax
rep lods mDst, eax
rep lods mDst, rax
REP STOSx
rep stosd
L3: mov ds:[esi],eax
add esi,4
dec ecx
jne L3
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
This function is great for setting a block of memory with a particular value such as for clearing large blocks of
memory.
MOVSx
MOVSQ 64 64
movs{b/w/d/q} [Un]signed
movs mDst,
61
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
mSrc{8/16/32/64}
This instruction copies the contents of memory addressed by the DS:[SI], DS:[ESI], or [RSI] source register
pair to the ES:[DI], ES:[EDI], or [RDI] destination register pair. It is similar to using both the LODSB/W/D/Q and
STOSB/W/D/Q functions except that the AL/AX/EAX/RAX registers are not used. It belongs to the group of
string functions, but it really has nothing to do with text strings in the C sense; it has to do with moving strings
of memory. This is the only memory-to-memory move. The strings are contiguous bytes of memory. These
functions are equivalent to:
32-bit mode
REP MOVSx
A single move operation can be performed or a repeat sequence specified by a REP prefix word and a count
specified in the ECX register in Protected Mode or CX in Real Mode.
rep movsd
L3: mov eax,ds:[esi]
add esi,4
mov es:[edi],eax
inc edi,4
dec ecx
jne L3
CLD
STD
cld
std
The direction flag is only used by the string instructions to indicate whether the RSI/RDI, ESI/EDI, and SI/DI
registers are advanced in a forward or reverse direction. CLD is the norm, where the advance is in a forward
(n+1) direction. STD indicates a reverse (n–1) direction.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
62
but should be avoided whenever possible.
Temporal data — Memory that requires multiple accesses and therefore is loaded into a cache (as a default)
for normal access.
Non-temporal hint — An indicator to the processor that memory only requires a single access (one shot).
This would be similar to copying a block of memory or performing a calculation, but the result is not going to be
needed for a while so there is no need to write it into the cache; thus the memory access has no need to read
and load cache, and therefore the code can be faster!
For more information, see Chapter 18, "System," for more information related to the cache.
MOVNTDQ
MOVNTI
MOVNTQ
SSE movntdq mDst128, xmmSrc [Un]signed 128
SSE movntq mDst64, mmxSrc [Un]signed 64
SSE3 movnti mDst32, rSrc32 128
EM64T movnti mDst64, rSrc64 [Un]signed 128
The MOVNTDQ instruction copies 128 bits of (non-floating-point) data from xmmSrc to mDst128 using a non-
temporal hint.
The MOVNTQ instruction copies 64 bits of (non-floating-point) data from mmxSrc to mDst64 using a non-
temporal hint.
The MOVNTI instruction copies 32 bits of (non-floating-point) data from xmmSrc to mDst32 using a non-
temporal hint. In 64-bit mode, 64 bits of (non-floating-point) data is copied from a 64-bit general-purpose
register rSrc64 to memory destination mDst64 using a non-temporal hint.
Hint These instructions generate a non-temporal hint to the processor (indicating one-shot data)
that it can bypass the cache, which can make them faster than a standard copy to memory
instruction. Standard compilers do not support these instructions, which is yet another reason
to write some of your code in assembly.
MOVNTPS
SSE movntps mDst128, xmmSrc Single-precision 128
The MOVNTPS instruction copies 128 bits of four (packed single-precision floating-point) elements from
xmmSrc to mDst128 using a non-temporal hint.
MOVNTPD
63
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
SSE movntpd mDst128, xmmSrc Double-precision 128
2
The MOVNTPD instruction copies 128 bits of two (packed double-precision floating-point) elements from
xmmSrc to mDst128 using a non- temporal hint.
MASKMOVQ
MASKMOVDQ
U
maskmovq mmxSrcA, 64
mmxSrcB
maskmovd xmmSrcA, 128
qu xmmSrcB
This instruction copies up to 128/64 bits of memory from xmmSrcA (mmxSrcA) to DS:[EDI] depending on the
MSB of each correlating byte in the mask indicated in xmmSrcB (mmxSrcB). If an 8-bit mask element of SrcB
has its bit set, then that same correlated 8-bit element of SrcA is copied to the destination memory.
This instruction generates a non-temporal hint to the processor (indicating one-shot data) that it can bypass
the cache!
Note that for the MMX instruction MASKMOVQ, the CPU forces a transition from x87 to MMX mode
Exercises
0x12, 0x56, 0x89, 0x23, 0xEF, 0x89, 0x28, 0xC3
0xE2, 0xFF, 0x04, 0x23, 0x49, 0x41, 0x74, 0x3F
0x56, 0x89, 0xAA, 0xB2, 0xC7, 0x38, 0x28, 0x2A
0x28, 0x28, 0x42, 0x73, 0x82, 0xDE, 0xF3, 0x28
1. Show 128-bit data for SSE in proper endian order, for 8-bit, 16-bit, 32-bit, and 64-bit block sizes.
2. 128 bits = four single-precision floats. How wide would the next generation processor have to be for
quad vector double-precision? Write a memory handler for this new hypothetical type processor.
3. Future super matrix processors will be able to number-crunch entire matrices at a time. How wide
would the data path be? Write an alignment macro for that width. Hint: There are two primary solutions!
4. Earlier in this chapter the function HexDmp32() demonstrated a 32-bit (8 ASCII byte) hex dump. Write a
slightly different function that prints a string of 8-bit bytes in hex of a specified count.
void HexDmp(byte *pData, uint nCount);
64
Chapter 4: Bit Mangling
Overview
For about seven years my family and I lived in the California Sierras. During that time I developed a passion
for rustic mountain living as well as most environmental associations related to the mountains and the Old
West: snow, historic mining towns, coaches and wagons, treasure hunting, narrow gauge railroads, wildland
fire fighting, and other miscellaneous rural benefits. Now that I have milled that Old West imagery into your
mind "ore" bored you to death, you can continue reading what I fondly refer to as the mangling of bits. This is
one of my favorite sections, because with their use, I typically devise speedier methods for use in the
manipulation of data. I label this "thinking out of the box," sometimes referred to as "Thar's gold in them thar
bits!"
Bit mangling relates to the individual twiddling of bits using Boolean logic such as NOT, AND, OR, XOR, or
some combination thereof. Each bit is individually isolated so no bit affects any adjacent bit encapsulated by
the register. In vector mathematical operations, groups of bits are isolated so that an operation on one vector
group does not affect another. Boolean operations are similar but on an individual bit basis. Each group in this
case is a group of one bit; thus, an operation to a single bit does not affect an adjacent bit. This is why there
are no vector type Boolean operations. There are operations that do, however, use 32/64-bit general-purpose
registers, 64-bit MMX registers, and 128-bit SSE2 registers for Boolean operations so as to manipulate more
bits simultaneously in parallel.
Hint In a manner of speaking, all processors that support Boolean operations on pairs of bits have
a degree of SIMD support.
These are heavily used by algorithms utilizing vectors, which is why they are in this book. Included in this
chapter are the electronic symbols for each logical operation. Typically, I use my own books for reference, and
from time to time I have found that drawing logic circuits using digital logic symbols actually makes more
complex Boolean logic algorithms easier for me to simplify. Maybe it will work the same for you.
Any processor professing to contain a multimedia, SIMD, packed, parallel, or vector instruction set will contain
almost all of the following instructions in one form or another. Parallel instructions typically do not have a Carry
flag as found in some of the older scalar based instruction sets using general-purpose registers, such as the
80x86. They tend to lose overflows through the shifting out of bits, wraparound of data, or saturation. Another
item to note is that not all the displayed diagrams are used by all the various 80x86 processors defined. Over
the years the functionality has been enhanced, so older processors will not have the same abilities as the
newer processors.
It must be reiterated that you watch the alignment of your data objects in memory very closely. It takes extra
overhead to adjust the memory into an aligned state and it is a lot more efficient to ensure that they are aligned
in the first place. Your code will be smaller and faster! This will be made obvious by the sample code included
in this chapter.
You will find in this chapter and throughout this book sections titled "Pseudo Vec." As processors are
enhanced, new superset functionality is given to them such as SIMD operations. Some of you, however, are
still programming for older processors and need the newer functionality, while some of you require a more in-
depth understanding of vector operations. The "Pseudo Vec" sections are for you!
65
Boolean Logical AND
AND
ANDPD
ADDPS
PAND
and rmDst(8/16/32/64), #(8/16/32) [Un]signed
and rmDst, rSrc(8/16/32/64)
and rDst, rmSrc(8/16/32/64)
MMX pand mmxDst, mmxSrc/m64 [Un]signed 64
SSE andps xmmDst, xmmSrc/m128 Single-precision 12
8
SSE pand xmmDst, xmmSrc/m128 [Un]signed 12
2 8
" andpd xmmDst, xmmSrc/m128 Double-precision 12
8
An AND operation means that one would need both A and B to be true to have a true result. The AND general-
purpose instruction logically AND's the 8-, 16-, 32-, or 64-bit source with the destination and stores the result in
the destination. An 8-bit source value can be sign extended to a 16-, 32-, or 64-bit value. A 32-bit source value
can be sign extended to a 64-bit destination. The source can be an immediate value, a register, or memory.
The destination is a register or memory. A memory-to-memory operation is invalid!
Flags O.flow Sign Zero Aux Parity Carry
0 X X - X 0
Flags: With the AND general-purpose instruction, the Carry and Overflow flags are cleared. The Zero flag is
set if all destination bits are 0, thus destination = 0. The Sign flag reflects the state of the MSB.
B 00000000000000001010010110100101b 00000a5a5h
A 00000000000000000000111111110000b 000000ff0h
AB 00000000000000000000010110100000b 0000005a0h
66
The multimedia extension instruction is a parallel operation that uses a Boolean AND operation upon each of
the corresponding 64 or 128 bits.
The source A and B are aSrc (xmmSrc) and bSrc (xmmDst), and the result is stored in the destination Dst
(xmmDst). The instruction may be labeled as packed, parallel, or vector but each bit is in reality isolated from
each other so there is no need for a data bit block arrangement. No flags are set.
msk dd 00000ffffffff0000h
dat dd 00000a5a50000ff11h
For the SSE instruction, the ANDPS is a bitwise AND of four packed single-precision floating-point values with
a bit mask.
For the SSE2 instruction, the ANDPD is a bitwise AND of two packed double-precision floating-point values
with a bit mask.
Pseudo Vec
The following C code demonstrates the simulated vector functionality of a logical AND upon a 128-bit vector.
The code sample logical AND's the four packed 32-bit values from vector A and B, to effectively AND all 128
bits and then stores the result in vector D. Note that the function stores the result pointed to by the first function
argument.
ASSERT_PTR16(pvD);
ASSERT_PTR16(pvA);
ASSERT_PTR16(pvB);
pD=(uint32*) pvD;
pA=(uint32*) pvA;
pB=(uint32*) pvB;
*(pD+0) = *(pA+0) & *(pB+0);
*(pD+1) = *(pA+1) & *(pB+1);
*(pD+2) = *(pA+2) & *(pB+2);
*(pD+3) = *(pA+3) & *(pB+3);
}
67
The assertion macro ASSERT_PTR4 could have been used instead since memory alignment for generic code
only needs to be aligned to four bytes to support a 32-bit value, but the ASSERT_PTR16 was used instead to
ensure 16-byte vector data alignment
The previous MOV instructions are common to the following code samples and thus not replicated in those
samples, but they need to be recognized as loading general-purpose registers in preparation for those
samples.
mov ebp,[eax+0] ; Read A lower 64 bits
mov esi,[eax+4]
mov ecx,[ebx+0] ; " B " "
mov edi,[ebx+4]
and ebp,ecx ; AND bits (31...0)
and esi,edi ; AND bits (63...32)
mov [edx+0],ebp ; Write lower 64 bits
mov [edx+4],esi
An optimization worth noting is the interlacing of the general-purpose registers into pairs to minimize
dependencies.
68
mov r9,[rax+8] ; Read A upper 64 bits
mov r10,[rbx+0] ; Read B lower 64 bits
mov r11,[rbx+8] ; Read B upper 64 bits
and r8,r10 ; AND bits (63...0)
and r9,r11 ; AND bits (127...64)
mov [rdx+0],r8 ; Write lower 64 bits
mov [rdx+8],r9 ; Write upper 64 bits
vmp_pand (MMX)
In the following examples the burden is placed upon 64- or 128-bit registers, so the 32-bit general-purpose
registers are only used for memory access. With the MMX instructions, only 64 bits can be manipulated at a
time so the data is handled as upper and lower 64-bit pairs. It also helps minimize processor stalls due to
register dependencies.
Listing 4-4: …\chap04\pbool\PBoolx86M.asm
Note the use of MOVDQA (aligned) used previously and MOVDQU (unaligned) used in the following
examples. The code is virtually identical for both examples, except that if the data is misaligned and if the
MOVDQA instruction is utilized, an exception will occur. If memory alignment cannot be guaranteed, the
following (slightly slower version) should be used instead.
69
movdqu xmm0,[ebx] ; Read B non-aligned 128 bits
movdqu xmm1,[eax] ; Read A
I was not trying to pull the wool over your eyes or anything. The really nice feature for the SSE and SSE2
instructions is that for both aligned and unaligned data, the code is virtually identical except for the method of
access. The only trick is to make sure it is properly aligned before using MOVDQA. If in doubt, use the
instruction MOVDQU; otherwise an exception will occur upon that misaligned access.
You may now be thinking, but why bother using MOVDA? Why not just use MOVDQU all the time?
The answer? Your code will run slower, and that is contrary to the reason for writing your code in assembly or
using vector instructions!
Boolean Logical OR
OR
ORPD
ORPS
POR
or rmDst(8/16/32/64), #(8/16/32) [Un]signed
or rmDst, rSrc(8/16/32/64)
or rDst, rmSrc(8/16/32/64)
MMX por mmxDst, mmxSrc/m64 [Un]signed 64
SSE orps xmmDst, xmmSrc/m128 Single-precision FP 12
8
SSE por xmmDst, xmmSrc/m128 [Un]signed 12
2 8
" orpd xmmDst, xmmSrc/m128 Double-precision FP 12
8
An OR operation means that one would need either A or B to be true to have a true result. The OR general-
purpose instruction logically OR's the 8-, 16-, 32-, or 64-bit source with the destination and stores the result in
the destination. An 8-bit source value can be sign extended to 16-, 32-, or 64-bit value. A 32-bit source value
can be sign extended to a 64-bit destination.
The source can be an immediate value, register, or memory. The destination is a register or memory. A
memory-to-memory operation is invalid!
70
Flags O.flow Sign Zero Aux Parity Carry
0 X X - X 0
Flags: With the OR general-purpose instruction, the Carry and Overflow flags are cleared. The Zero flag is set
if all destination bits are 0, thus destination =0. The Sign flag reflects the state of the MSB.
B 00000000000000001010010110100101b 0a5a5h
A 00000000000000000000111111110000b 00ff0h
AB 00000000000000001010111111110101b 0aff5h
This multimedia extension instruction is a parallel operation that uses a Boolean OR operation upon each of
the corresponding 64 or 128 bits. The source A and B are aSrc (xmmSrc) OR bSrc (xmmDst), and the result is
stored in the destination Dst (xmmDst). The instruction may be labeled as packed, parallel, or vector, but each
bit is in reality isolated from each other so there is no need for a data bit block arrangement. The following 32-
bit example demonstrates that.
For the SSE instruction, the ORPS is a bitwise OR of four packed single-precision floating-point values with a
bit mask.
For the SSE2 instruction, the ORPD is a bitwise OR of two packed double-precision floating-point values with
a bit mask.
Pseudo Vec
The following C code demonstrates the functionality of a logical OR upon a 128-bit vector. The code sample
logical OR's the four blocks of 32-bit values from vector A and B 32 bits at a time four times to effectively OR
all 128 bits, and then stores the result in vector D. Note that the function stores the result pointed to by the first
function argument.
pD=(uint32*) pvD;
pA=(uint32*) pvA;
pB=(uint32*) pvB;
*(pD+0) = *(pA+0) | *(pB+0); // {31...0}
*(pD+1) = *(pA+1) | *(pB+1); // {63...32}
*(pD+2) = *(pA+2) | *(pB+2); // {95...64}
*(pD+3) = *(pA+3) | *(pB+3); // {127...96}
}
71
See the code snippets from the previously discussed instruction AND, then substitute the instruction {|, or, por}
for the {&, and, pand} accordingly.
PXOR
XOR
XORPD
XORPS
xor rmDst(8/16/32/64), #(8/16/32) [Un]signed
xor rmDst, rSrc(8/16/32/64)
xor rDst, rmSrc(8/16/32/64)
MMX pxor mmxDst, mmxSrc/m64 [Un]signed 64
SSE xorps xmmDst, xmmSrc/m128 Single-precision FP 12
8
SSE pxor xmmDst, xmmSrc/m128 [Un]signed 12
2 8
" xorpd xmmDst, xmmSrc/m128 Double-precision FP 12
8
An XOR operation means that one would need either A or B to be true but not both to have a true result. The
XOR general-purpose instruction logically AND's the 8-, 16-, 32-, or 64-bit source with the destination and
stores the result in the destination. An 8-bit source value can be sign extended to a 16-, 32-, or 64-bit value. A
32-bit source value can be sign extended to a 64-bit destination. The source can be an immediate value, a
register, or memory. The destination is a register or memory. A memory-to-memory operation is invalid.
Flags O.flow Sign Zero Aux Parity Carry
0 X X - X 0
Flags: With the XOR general-purpose instruction, the Carry and Overflow flags are cleared. The Zero flag is
set if all destination bits are 0, thus destination =0. The Sign flag reflects the state of the MSB.
The best method to clear a register is to exclusive or (XOR) it with itself. The newer processors have
incorporated optimization logic that encodes that since there are no dependencies upon a read operation when
the source and the destination are the same (as the end result is cleared bits), and so merely writes all bits
with zero.
72
B 00000000000000001010010110100101b 0a5a5h
A 00000000000000000000111111110000b 00ff0h
AB 00000000000000001010101001010101b 0aa55h
This multimedia extension instruction is a parallel operation that uses a Boolean XOR operation upon each of
the corresponding 64 or 128 bits. The source A and B are aSrc (xmmSrc) XOR bSrc (xmmDst), and the result
is stored in the destination Dst (xmmDst). The instruction may be labeled as packed, parallel, or vector, but
each bit is in reality isolated from each other so there is no need for a data bit block arrangement.
For the SSE instruction, XORPS is a bitwise XOR of four packed single-precision floating-point values with a
bit mask.
For the SSE2 instruction, XORPD is a bitwise XOR of two packed double-precision floating-point values with a
bit mask.
Pseudo Vec
The following C code demonstrates the functionality of a logical XOR upon a 128-bit vector. The code sample
logical XOR's the four blocks of 32-bit elements from vector A and B to effectively XOR all 128 bits and then
stores the result in vector D. Note that the function stores the result referenced by the first function parameter
pointer.
pD=(uint32*) pvD;
pA=(uint32*) pvA;
pB=(uint32*) pvB;
*(pD+0) = *(pA+0) ^ *(pB+0); // {31...0}
*(pD+1) = *(pA+1) ^ *(pB+1); // {63...32}
*(pD+2) = *(pA+2) ^ *(pB+2); // {95...64}
*(pD+3) = *(pA+3) ^ *(pB+3); // {127...96}
}
See the code snippets from the previously discussed instruction AND, then substitute the instruction {^, xor,
pxor} for the {&, and, pand} accordingly.
Another use for this operation is as a Boolean NOT (one's complement) operator. By using an input A and
setting B permanently to a logical one, an inverse bit is achieved. Note the following table where zero becomes
one and one becomes zero.
A B Y
0 1 1
1 1 0
73
Figure 4-1: Example of using a logical XOR with a logical true bit to achieve the equivalent of a logical NOT
condition.
As can be seen, input A and output Y are the exact opposite. Keep in mind that Boolean logic is bit based and
bit isolated so that adjacent bits do not affect each other.
NOT
not rmDst(8/16/32/64) [Un]signed
This instruction logically inverts the bits of the destination. All ones become zeros, and all zeros become ones.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
NEG
neg rmDst(8/16/32/64) [Un]signed
Flags: Carry is cleared if the destination is 0; else it is set. The other bits are set accordingly from its results.
mov eax,00000a5a5h
neg eax
; 00000000000000001010010110100101b 00000a5a5h (42405)
; 11111111111111110101101001011011b 0ffff5a5bh (-42405)
Hint For a NEG (two's complement negation), use an XOR followed by an increment.
74
00000000000000001010010110100101b 00000a5a5h (42405)
NOT 11111111111111110101101001011010b 0ffff5a5ah (-42406)
INC 11111111111111110101101001011011b 0ffff5a5bh (-42405)
Another use for an XOR is in an implementation for a negation (two's complement). As a refresher, a
subtraction is a two's complement followed by an addition. By inverting all bits, a one's complement is
achieved. The next step for the two's complement is by an increment (addition of one) of this value. This is
followed by the addition, which effectively results in the subtraction.
This operation is the result of a Boolean NOT (one's complement) operator that was just discussed, followed
by an increment or an addition by one. Of course, this is not as slick as a reverse subtraction where…
–A = A ISUB 0 thus –A = 0 – A
but not all processors support that. Keep in mind though, that when dealing with pipelining your code, a two-
step operation may be more helpful than a single step.
if (FLIP == nf)
{
nf = FLOP;
}
else
{
nf = FLIP;
}
…or…
nf = (FLIP == nf) ? FLOP : FLIP;
No matter which way you coded it, it is the same, identical code. This is fine and dandy, but the branching and
especially a possible misprediction of a branch takes time and there are two branches to contend with. If in a
flip, it flops, and if in a flop, it flips. Of course, instead of two branches such as the previous code snippet, it
could always be coded for one branch such as follows:
nf = FLIP;
if (FLIP == nf)
{
nf = FLOP;
}
The code, if not a FLIP as in the previous code snippet, branches around and continues on, but again, there
could be a misprediction.
A misprediction is as it sounds. A more advanced CPU will predict that at an if-then conditional, it will take the
branch and do the conditional or branch around, thus the if-then-else. The problem is that the CPU gains
efficiency by predicting that it is correct because it is preloading memory and, in some cases, executing
instructions that are further down in the code. The punishment comes that if it predicted wrong, that memory
has to be thrown away and the results of the calculations it processed ahead of time disposed of. Then it
75
needs to continue processing down the correct path. Either way, this is very time consuming, and so
alternative (branchless) methods need to be devised if possible.
My favorite solution is a branchless result so there is no misprediction and the appropriate value can be
selected with a butterfly switch. Let's examine these two values more closely:
FLOP = 47 = 0x002F = 0000000000101111b
FLIP = –30 = 0xFFE2 = 1111111111100010b
…and in place of the branching code the following snippet is used instead:
xor nf,nf,FLIPPY // FLIP/FLOP
…then each time the code is executed it will flip to the alternate value.
Pretty cool, huh! And the best part is the actual code is a single instruction, not a group of instructions to
process a branch and decide whether to branch or not! The code runs a lot faster, and it's smaller. This also
works with non-definitions as well. Initialize with the following:
nf = valueA; // First value
iFlipVal = nf ^ valueB; // Butterfly key
… and it works great in a parallel configuration. Different butterflies to control different elements, all in parallel.
I-VU-Q
I know this is not x86 related, but what instruction is thought to be missing from the VU coprocessor on the
PS2?
The XOR!!!
This seems to be a popular interview question, as I have encountered it numerous times. Yes, I know, I
digress. The 80x86 supports the XOR instruction as was discussed, but this is good practice for the other
Boolean instructions that have been touched on in this chapter.
After interviewers ask this question, they then sometimes want to know how to write equivalent code. Funny
thing is that they do not seem to remember the answer themselves. I will hastily draw out the following truth
table from left to right, and then draw the circuit for good measure. (That is, if I am not too tired and frustrated
from having to answer programming questions all day long!)
76
Figure 4-2: A four-gate (OR, NOT-AND, AND) solution to achieve an equivalent result of a logical XOR
I will then hesitate for a second and then announce, "But wait!" This takes four logical operations — an OR,
AND, NOT, then AND. So instead let's make this a bit smaller. If ANDC type functionality is used (which has
not been discussed yet)…
Figure 4-3: A two-gate (ANDC, OR) solution to achieve an equivalent result of a logical XOR
…notice the swapped inputs and the gate functionality similar to a logical AND, then the outputs are logical OR
gated together. So now it is down to two!
But wait. What if there were no NOT, NOR, XOR, or ANDC instruction? At this point, if you replicate what is
being presented here at your interview as a response to the question, you will either come off sounding like a
really smart guy or a prima donna, depending on how you do it. So you then announce, "But wait! There is no
NOT or ANDC, so how do we do it!" (Keep in mind there is a NOT and an ANDC on the 80x86.) "Ever hear of
a half-adder?" It has the same result as that of an XOR except that it also contains a carry, and that is where
the problem resides. That carry bit is contaminating an adjacent bit.
Table 4-1: A half-adder solution. By ignoring the carry, a logical XOR will result.
A+B Carry Y
0 0 0 0
0 1 0 1
1 0 0 1
1 1 1 0
So if the bits are stripped into an odd-even arrangement, the A and B (o dd bits) are summed, then masked
with the 1010 bit mask pattern. The A and B (even bits) are summed, then masked with the 0101 even bit
mask pattern. The results of the odd and even are logical OR'd with each other; effectively a logical XOR is
simulated. Let's examine some 16-bit data:
oeoeoeoeoeoeoeoe
B 1010010110100101b 0a5a5h
A 0000111111110000b 00ff0h
AB 1010101001010101b 0aa55h
So effectively all the odd bits are separated from the even bits. Odd bits are stored in a {o0o0…o0o0} bit
pattern while the even bits are stored in a {0e0e…0e0e} bit pattern. So by a logical AND with the odd {10} and
even {01} binary masks, an un-interlaced form is generated. This allows the odd and even bits to be handled
separately, and the overflow from any operation is stored in the adjacent (next higher) bit. This means any
carry will be ignored when the odd and even results are blended back together. The carry results will be
removed using the same odd or even mask. Notice the bits indicated in bold are the usable result from the
AND.
B 1010010110100101b 0a5a5h
Even B 0101010101010101b 05555h (Mask)
Even's B 0000010100000101b 00505h
B 1010010110100101b 0a5a5h
Odd B 1010101010101010b 0aaaah (Mask)
Odd's B 1010000010100000b 0a0a0h
77
A 0000111111110000b 00ff0h
Even A 0101010101010101b 05555h (Mask)
Even's A 0000010101010000b 00550h
A 0000111111110000b 00ff0h
Odd A 1010101010101010b 0aaaah (Mask)
Odd's A 0000101010100000b 00aa0h
Now the even and odd values of A and B are summed up separately. Note that we only care about the
resulting bits in bold and not the others, as those are the individual carries, which are stripped by the logical
AND of the original mask.
Even's B 0000010100000101b 00505h
Even's A + 0000010101010000b 00550h
0000101001010101b 00a55h
Even 0101010101010101b 05555h (Mask)
Even AB 0000000001010101b 00055h
Now logical OR the even bits and odd bits back together for the interlaced XOR result.
Even AB 0000000001010101b 00055h
Odd AB 1010101000000000b 0aa00h
AB 1010101001010101b 0aa55h
…exactly the same. Okay, okay, a lot more operations, but just another technique for your repertoire. But now
that you have special insight into the problem, here is the start of solving this equation for yourself!
You did not think that I would be giving you all the answers, did you?
Hint There is one more method available to do a logical NOT. A subtraction is a two's complement
as it is a bit flip (one's complement) followed by an increment (two's complement), and then
an addition. If the result of a subtraction from zero is decremented, you effectively get a NOT
(one's complement)! A'=(0–A)–1
78
Mnemonic P PI K 3D 3Mx SS SSE A6 SSE E64T
I 6 ! + E 2 4 3
ANDNPD
ANDNPS
PANDN
MMX pandn mmxDst, mmxSrc/m64 [Un]signed 64
SSE andnp xmmDst, xmmSrc/m128 Single-precision FP 12
s 8
SSE pandn xmmDst, xmmSrc/m128 [Un]signed 12
2 8
" andnp xmmDst, xmmSrc/m128 Double-precision FP 12
d 8
This multimedia extension instruction is a parallel operation that uses a Boolean NOT AND operation upon
each of the corresponding 64 or 128 bits. The source A and B are aSrc (xmmSrc) and a one's complement of
bSrc (xmmDst), and the result is stored in the destination Dst (xmmDst). The instruction may be labeled as
packed, parallel, or vector, but each bit is in reality isolated from each other so there is no need for a data bit
block arrangement. The following 32-bit example demonstrates this.
B 00000000000000001010010110100101b 00000a5a5h
B NOT 11111111111111110101101001011010b 0ffff5a5ah
A 00000000000000000000111111110000b 000000ff0h
A(B) 00000000000000000000101001010000b 000000a50h
For the SSE instruction, ANDNPS is a bitwise NOT AND of four packed single-precision floating-point values
with a bit mask.
For the SSE2 instruction, ANDNPD is a bitwise NOT AND of two packed double-precision floating-point values
with a bit mask.
Pseudo Vec
The following C code demonstrates the functionality of a logical ANDC upon a 128-bit vector. The code sample
logical NOT's the bits from vector B, then AND's these bits with vector A, 32 bits at a time, four times, to
effectively ANDC all 128 bits, and then stores the result in vector D. Note that the function stores the result
referenced by the first function parameter pointer.
pD=(uint32*) pvD;
pA=(uint32*) pvA;
pB=(uint32*) pvB;
79
*(pD+0) = (0xffffffff ^ *(pB+0)) & *(pA+0);
*(pD+1) = (0xffffffff ^ *(pB+1)) & *(pA+1);
*(pD+2) = (0xffffffff ^ *(pB+2)) & *(pA+2);
*(pD+3) = (0xffffffff ^ *(pB+3)) & *(pA+3);
}
See the code snippets from the instruction AND, then substitute the instruction {pandn} for the {pand}
accordingly.
Exercises
1. What is a half-adder?
2. Make a NOT gate using
a. an XOR gate
b. a NAND gate
c. a NOR gate.
3. Write a software algorithm for the following equation. D=(–1–(AB))(AB)
State 0 1 2 0…
Extra credit:
This cross-coupled SR flip-flop will be more familiar to those of you with electronics experience. Using A=0 and
B=0 as the starting state, finish the logic table.
How does this circuit function? What are the behaviors of p and q if the data inputs of A and B are changed to
a new state? Try all 12 possibilities. How would you code it?
80
Chapter 5: Bit Wrangling
Overview
Think of this chapter as a continuation of the previous chapter, "Bit Mangling."
This chapter deals with bit wrangling and thus the movement of bits. They can be shifted into oblivion, rotated,
and watched going around and around. As Bugs Bunny would say, "Dizzing, ain't it!" Bit wrangling actually
occurs with the rotation and logical and arithmetic shifting of bits within each parallel bit range of packed bits.
Just as in the scalar point of view of a similar general-purpose CPU instruction, the bits can be used for
masking, base two multiplication and division, and other functionality.
It must be reiterated that you watch the alignment of your data objects in memory very closely. It takes extra
overhead to adjust the memory into an aligned state, and it is a lot more efficient to ensure that they are
aligned in the first place. Your code will be smaller and faster! This will be made obvious by the sample code
included in this chapter.
SAL
SHL
shl rmDst(8/16/32/64), 1 [Un]signed
n
shl rmDst(8/16/32/64), #(2 –
1)
shl rmDst(8/16/32/64), cl
n
This instruction is a multiplier (2 ) by shifting the destination to the left and shifting the most significant bit
(MSB) into the carry. The previous content of the carry is shifted out and lost, anda0is shifted into the least
significant bit (LSB). The count indicating the number of bits to shift is invalid if greater than or equal to width of
n
data (2 ). If shifted enough times, the data will be emptied, being set to all zeroes. If count = 0, then there is no
shift and thus the destination and flags are not changed. Shift Logical Left and Shift Arithmetic Left are the
same identical instruction.
Flags O.flow Sign Zero Aux Parity Carry
X X X - X X
81
Flags: Carry = last bit shifted out, Zero = 1 if all destination bits are 0, thus destination = 0. If using the shift by
1, then Overflow = 0 if Carry = MSB.
datd dd 32
datw dw 16
datb db 8
With Pentium processors a multiply is pretty quick, but for a 486 processor or older it was sometimes
necessary and faster to do shifting and adding to compensate. For example, a video display with a resolution
of 640x480x8-bit would need to calculate an actual screen (memory pointer) related to an X and Y coordinate.
mov ecx,X
mov edx,Y
mov edi,edx
shl edi,2 ; 4x
add edi,edx ; 5x = (4x+x)
shl edi,7 ; 640x = (5x)×(128)
add edi,ecx ; edi = memory index
mov es:[edi],al ; Write a byte pixel
SHLD
shld rmDst(16/32/64), rSrc(16/32/64), #(2n–1) [Un]signed
shld rmDst(16/32/64), rSrc(16/32/64), cl
n
This instruction shifts the upper half of an unsigned 64-bit integer by 2 , which shifts the source to the left into
the destination to the left and shifts the most significant bit (MSB) into the carry. The previous content of the
carry is shifted out and lost. (It should be noted that in reality the source is not altered, only the destination.)
n
The count indicating the number of bits to shift is invalid if it is greater than or equal to width of data (2 ). If you
exceed the operand size, the result in destination is undefined. If count = 0, then there is no shift and the
destination and flags are not changed.
Flags O.flow Sign Zero Aux Parity Carry
X X X - X X
Flags: Carry = last bit shifted out, Zero = 1 if all destination bits are 0, thus destination = 0. If using the shift by
1, then Overflow = 1 if a sign change occurred; else it is 0.
datd dd 32
datw dw 16
datb db 8
82
mov ax,0001001000110100b 01234h
mov dx,1010010110100101b 0a5a5h
mov cl,4
shld dx,ax,4
shl ax,4
; dx = 0101101001010001b 05a51h carry=0
Figure 5-1: Miscellaneous examples of data types being shifted to the left by one bit
PSLLx
PSLLDQ
MMX psll(w/d/q) mmxDst, count(#/mmx/m64) [Un]signed 64
SSE psll(w/d/q) xmmDst, count(#/xmm/m128) [Un]signed 128
2
" pslldq xmmDst, count(#) [Un]signed 128
These multimedia and SIMD extension instructions are parallel operations that logically left shift each of the
data bit blocks, by a count of bits. Depending upon the processor and the instruction, block sizes of 16, 32, 64,
or 128 bits can be shifted by the specified count.
n
This is a multiplier (2 ) by shifting a 0 into the LSB of each packed value of the source aSrc (xmmSrc) and
causing all bits to be shifted to the left and the MSB of each packed value being lost. The result is stored in the
destination Dst (xmmDst). There is no Carry flag to save the bit. If the count indicating the number of bits to
shift is more than packed bits–1—15for(Words), 31 for (Double words), 63 for (Quad words), or 127 for
(Double Quad words) — then the destination will be typically set to a value of zero. This is most effective when
n
used in an integer math function where multiple numbers need to be multiplied by 2 concurrently.
Pseudo Vec
Although logical left, right, or arithmetic shifting is supported by almost all processors, parallel shifting is not.
The same parallel effect can be simulated in those particular cases of non-SIMD supporting processors. The
following C code demonstrates it by concurrently shifting four packed 8-bit values to the left. All 32 bits are
shifted to the left in unison but as the remaining bits become corrupted by the adjacent byte, a logical AND
using a mask in the table lookup forces those vacated bits to zero.
#define eMASKFE 0xfefefefe // Mask FE
83
uint32 val;
As you saw in the previous example, an absolute shift value is the most simplistic way to go, as it is merely a
matter of shifting the n number of bits and masking with the appropriate value.
11111111 11111111
11111110 01111111
11111100 00111111
11111000 00011111
11110000 00001111
11100000 00000111
11000000 00000011
10000000 00000001
Table 5-1: An 8-bit mask for stripping bits in conjunction with shifting data (0…7) bits to the left or to
the right
8-bit Shift 0 1 2 3 4 5 6 7
Left FF FE FC F8 F0 E0 C0 80
Right FF 7F 3F 1F 0F 07 03 01
If the shift factor is unknown, then the algorithm required, such as the following, becomes more complicated as
a table lookup is needed for the masking value.
uint32 llMaskBD[] = { // Left Shift 32-bit 4×8-bit mask
0xffffffff, 0xfefefefe, 0xfcfcfcfc, 0xf8f8f8f8,
0xf0f0f0f0, 0xe0e0e0e0, 0xc0c0c0c0, 0x80808080 };
Since packed 8-bit logical shift is not supported, let us first examine a block of four 8-bit bytes being shifted
simultaneously. Similar to the processors, the shift count needs to be truncated to a value of n–1, thus 0…7 in
this particular case. Different processors have different results (especially with the various C compilers due to
being a feature of the C language). For example, on some compilers a shift of an 8-bit byte by a value of nine
would result in a zero, on some a one, and some, if directly transposing to assembly instructions, would tend to
ignore the extraneous bits. For the 8-bit value only the three least significant bits would be used, for a shift
value of {0…7}. A shift by nine would be effectively a shift by one since a value of nine in Boolean (1001b)
would have its upper bits ignored (1001b) and so only the lower three bits (001b) would be used as a shift
count; thus it becomes a value of one.
This also helps to prevent an out of bounds memory reference in regard to the mask lookup. The value is then
shifted by the adjusted count, and then logical AND'd with the mask. This effectively processes all four values
simultaneously as mentioned!
pD=(uint32*) pvD;
84
pA=(uint32*) pvA;
msk = llMaskBD[count];
count &= (8-1); // Clip count to data bit size -1
As an MMX register is only 64 bits, the 128-bit value is handled using register pairs. This is actually faster and
easier than these 128-bit C pseudocode samples.
Of course with this instruction, if only a shift by one is needed, the AND and PSLLQ can be replaced with only
a PADDB. The MMX and SSE2 instructions support a byte add, and (A<<1) is equivalent to (A+A)!
So replace this:
psllq mm0,1 ; val << 1
psllq mm1,1
pand mm0,llMaskBQ[1*8] ; Strip lower bit
85
pand mm1,llMaskBQ[1*8]
…with this:
paddb mm0,mm0
paddb mm1,mm1
One last thing to note is the actual operation. There is no difference between a pre-mask vs. a post-mask.
Only the mask!
((A&7F)<<1) is equivalent to ((A<<1)&FE)
If you compare the left and right trace, you should see that they have the same result.
SHR
shr rmDst(8/16/32/64), 1 [Un]signed
n
shr rmDst(8/16/32/64), #(2 –1)
shr rmDst(8/16/32/64), cl
n
This instruction is an unsigned divide by 2 that shifts the destination to the right and shifts the least significant
bit (LSB) into the carry. The previous content of the carry is shifted out and lost, and a 0 is shifted into the most
significant bit (MSB). The count indicating the number of bits to shift is invalid if greater than or equal to the
n
width of data (2 ). If you shift enough times, you will empty the data, setting it to all 0's. If count = 0, then there
is no shift and the destination and flags are not changed.
Flags O.flow Sign Zero Aux Parity Carry
X X X - X X
Flags: Carry = last bit shifted out, Zero = 1 if all destination bits are 0, thus destination = 0. If using the shift by
1, then Overflow is set to 0; else Overflow is set to the starting MSB.
datd dd 32
datw dw 16
datb db 8
86
shr dx,1 ; dx = dx × 2
; dx = 0101001011010010b 052d2h carry=1
shr eax,cl ; eax = eax ÷ 4
This is typically used as a divisor for use in masking for table lookups such as:
mov eax,nWidth ; # of bytes to alter shr
eax,3 ; divide by (23)=(n÷8)
; At this point eax is # of 64-bit #'s
SHRD
n
shrd rmDst(16/32), rSrc(16/32), #(2 – [Un]signed
1),
shrd rmDst(16/32), rSrc(16/32), cl,
n
This instruction is the lower half of an unsigned 64-bit integer divide by 2 that shifts the source to the right into
the destination to the right and shifts the least significant bit (LSB) into the carry. The previous content of the
carry is shifted out and lost. (It should be noted that in reality the source is not altered, only the destination.)
n
The count indicating the number of bits to shift is invalid if it is greater than or equal to the width of data (2 ). If
count = 0, then there is no shift, and the destination and flags are not changed.
Flags O.flow Sign Zero Aux Parity Carry
X X X - X X
Flags: Carry = last bit shifted out, Zero = 1 if all destination bits are 0, thus destination = 0. If using the shift by
1, then Overflow = 1 if a sign change occurred; else it is 0.
datd dd 32
datw dw 16
datb db 8
87
PSRLx — Parallel Shift Right (Logical)
Figure 5-2: Miscellaneous examples of data types being logical shifted to the right by one bit
PSRLx
PSRLDQ
MMX psrl(w/d/q) mmxDst, count(#/mmx/m64) [Un]signed 64
SSE psrl(w/d/q) xmmDst, count(#/xmm/m128) [Un]signed 128
2
" psrldq xmmDst, count(#) [Un]signed 128
These multimedia and SIMD extension instructions are parallel operations that logically right shift each of the
data bit blocks by a count of bits. Depending on the processor and the instruction, block sizes of 16, 32, 64, or
128 bits can be shifted by the specified count.
n
This is a divisor (2 ) of unsigned values by shifting a 0 into the MSB of each packed value of the source aSrc
(xmmSrc), which cause all bits to be shifted to the right and the LSB of each packed value being lost; the result
is stored in the destination Dst (xmmDst). There is no Carry flag to save the bit. If the count indicating the
number of bits to shift is more than packed bits – 1 — 15 for (Words), 31 for (Double words), 63 for (Quad
words), or 127 for (Double Quad words) — then the destination will be typically set to a value of zero. This is
most effective when used in an integer math function where multiple unsigned numbers need to be divided by
n
2 concurrently.
Pseudo Vec
The C code simulating the functionality is almost identical to the Parallel Shift (Logical) Left instruction,
previously discussed in this chapter, except in this case a different mask and a shift to the right are used
instead. This should look very similar to you as previously reviewed in SLL; only the bold areas should be
different!
pD=(uint32*) pvD;
pA=(uint32*) pvA;
88
msk = llMaskBD[count];
There are similarities between left and right logical shifting for the various processor instructions.
The function shell code is very similar as lower versus upper bits are masked. The bits are shifted to
the left versus to the right. In assembly code, one merely needs to substitute the correct instruction for
the same functionality. In simulated C code, the correct mask needs to be chosen. Those instructions,
along with their complement, are reflected in the following table. Table 5-2: Instruction substitution
table to convert a previous SLL (Shift Left Logical) instruction into a SRL (Shift Right Logical), as well
as masks and their complement
Instructions Masks
SHR
shr rmDst(8/16/32/64), 1 [Un]signed
n
shr rmDst(8/16/32/64), #(2 –1)
shr rmDst(8/16/32/64), cl
n
This instruction is an unsigned divide by 2 that shifts the destination to the right and shifts the least significant
bit (LSB) into the carry. The previous content of the carry is shifted out and lost, and a 0 is shifted into the most
significant bit (MSB). The count indicating the number of bits to shift is invalid if greater than or equal to the
n
width of data (2 ). If you shift enough times, you will empty the data, setting it to all 0's. If count = 0, then there
is no shift and the destination and flags are not changed.
Flags O.flow Sign Zero Aux Parity Carry
X X X - X X
Flags: Carry = last bit shifted out, Zero = 1 if all destination bits are 0, thus destination = 0. If using the shift by
1, then Overflow is set to 0; else Overflow is set to the starting MSB.
datd dd 32
datw dw 16
datb db 8
89
mov dx,1010010110100101b ;0a5a5h carry=0
2
mov cl,2 ; ×4 = 2
This is typically used as a divisor for use in masking for table lookups such as:
mov eax,nWidth ; # of bytes to alter shr
eax,3 ; divide by (23)=(n÷8)
; At this point eax is # of 64-bit #'s
SHRD
n
shrd rmDst(16/32), rSrc(16/32), #(2 – [Un]signed
1),
shrd rmDst(16/32), rSrc(16/32), cl,
n
This instruction is the lower half of an unsigned 64-bit integer divide by 2 that shifts the source to the right into
the destination to the right and shifts the least significant bit (LSB) into the carry. The previous content of the
carry is shifted out and lost. (It should be noted that in reality the source is not altered, only the destination.)
n
The count indicating the number of bits to shift is invalid if it is greater than or equal to the width of data (2 ). If
count = 0, then there is no shift, and the destination and flags are not changed.
Flags O.flow Sign Zero Aux Parity Carry
X X X - X X
Flags: Carry = last bit shifted out, Zero = 1 if all destination bits are 0, thus destination = 0. If using the shift by
1, then Overflow = 1 if a sign change occurred; else it is 0.
datd dd 32
datw dw 16
datb db 8
90
PSRLx — Parallel Shift Right (Logical)
Figure 5-2: Miscellaneous examples of data types being logical shifted to the right by one bit
PSRLx
PSRLDQ
MMX psrl(w/d/q) mmxDst, count(#/mmx/m64) [Un]signed 64
SSE psrl(w/d/q) xmmDst, count(#/xmm/m128) [Un]signed 128
2
" psrldq xmmDst, count(#) [Un]signed 128
These multimedia and SIMD extension instructions are parallel operations that logically right shift each of the
data bit blocks by a count of bits. Depending on the processor and the instruction, block sizes of 16, 32, 64, or
128 bits can be shifted by the specified count.
n
This is a divisor (2 ) of unsigned values by shifting a 0 into the MSB of each packed value of the source aSrc
(xmmSrc), which cause all bits to be shifted to the right and the LSB of each packed value being lost; the result
is stored in the destination Dst (xmmDst). There is no Carry flag to save the bit. If the count indicating the
number of bits to shift is more than packed bits – 1 — 15 for (Words), 31 for (Double words), 63 for (Quad
words), or 127 for (Double Quad words) — then the destination will be typically set to a value of zero. This is
most effective when used in an integer math function where multiple unsigned numbers need to be divided by
n
2 concurrently.
Pseudo Vec
The C code simulating the functionality is almost identical to the Parallel Shift (Logical) Left instruction,
previously discussed in this chapter, except in this case a different mask and a shift to the right are used
instead. This should look very similar to you as previously reviewed in SLL; only the bold areas should be
different!
pD=(uint32*) pvD;
pA=(uint32*) pvA;
91
msk = llMaskBD[count];
There are similarities between left and right logical shifting for the various processor instructions.
The function shell code is very similar as lower versus upper bits are masked. The bits are shifted to
the left versus to the right. In assembly code, one merely needs to substitute the correct instruction for
the same functionality. In simulated C code, the correct mask needs to be chosen. Those instructions,
along with their complement, are reflected in the following table. Table 5-2: Instruction substitution
table to convert a previous SLL (Shift Left Logical) instruction into a SRL (Shift Right Logical), as well
as masks and their complement
Instructions Masks
ROL
rol rmDst(8/16/32/64), 1 [Un]signed
n
rol rmDst(8/16/32/64), #(2 –
1)
rol rmDst(8/16/32/64), cl
This instruction shifts the destination to the left and the most significant bit (MSB) is rotated into the carry and
the least significant bit (LSB). The previous content of the carry is lost. This is like a wagon wheel with the bits
going counterclockwise around and around shooting off bits like sparks. The count indicating the number of
n
bits to rotate is invalid if it is greater than or equal to the width of data (2 ). If count = 0, then there is no shift
and the destination and flags are not changed.
Flags O.flow Sign Zero Aux Parity Carry
92
X X X - X X
Flags: Carry = last bit rotated out of MSB, Zero = 1 if all destination bits are 0, thus destination = 0. If using a
count of 1, then Overflow = the exclusive OR of the two MSBs before the rotate. (This is the same as the Carry
XOR MSB after the rotate!) If the rotation is greater than 1, Overflow is undefined.
datd dd 32
datw dw 16
datb db 8
rol dx,1
; dx = 0100101101001011b 04b4bh carry=1 OF=1
RCL
rcl rmDst(8/16/32/64), 1 [Un]signed
n
rcl rmDst(8/16/32/64), #(2 –
1)
rcl rmDst(8/16/32/64), cl
This instruction shifts the destination to the left, rotates the most significant bit (MSB) into the carry, and
rotates the previous contents of the carry into the least significant bit (LSB). This is like a wagon wheel with an
extra spoke. The count indicating the number of bits to rotate is invalid if it is greater than or equal to the width
n
of data (2 ). If count =0, then there is no shift and the destination and flags are not changed.
Flags O.flow Sign Zero Aux Parity Carry
X X X - X X
Flags: Carry = last bit rotated out, Zero = 1 if all destination bits are 0, thus destination = 0. If using a count of
1, then Overflow = the exclusive OR of the two MSBs before the rotate. (This is the same as the Carry XOR
MSB after the rotate!) If the rotation is greater than 1, Overflow is undefined.
datd dd 32
datw dw 16
datb db 8
rcl dx,1
; dx = 0100101101001010b 04b45h carry=1 Z=0 OF=1
This is effective for the shifting of monochrome data or masking planes, effectively making an image appear to
move to the right.
93
; actual nibble mask f00000000000ffff00e300080ff00ef7
Msk: dd 00000000fh,0ffff0000h,080003e00h,07fe00ff0h
mov eax,[esi]
shl eax,1
mov [edi],eax
Rotate Right
ROR
ror rmDst(8/16/32/64), 1 [Un]signed
n
ror rmDst(8/16/32/64), #(2 –
1),
ror rmDst(8/16/32/64), cl
This instruction shifts the destination to the right and rotates the least significant bit (LSB) into the carry and
the most significant bit (MSB). The previous content of the carry is lost. This is similar to a wagon wheel with
the bits going clockwise around and around shooting off bits like sparks. The count indicating the number of
n
bits to rotate is invalid if it is greater than or equal to the width of the data (2 ). If count = 0, then there is no
shift and the destination and flags are not changed.
Flags O.flow Sign Zero Aux Parity Carry
X X X - X X
Flags: Carry = last bit rotated out, Zero = 1 if all destination bits are 0; thus destination = 0. If using a count of
1, then Overflow = the exclusive OR of the two MSBs after the rotate. If the rotation is greater than 1, Overflow
is undefined.
datd dd 32
94
datw dw 16
datb db 8
ror dx,1
; dx = 1101001011010010b 0d2d2h carry=1 Z=0 OF=0
RCR
rcr rmDst(8/16/32/64), 1 [Un]signed
n
rcr rmDst(8/16/32/64), #(2 –
1)
rcr rmDst(8/16/32/64), cl
This instruction shifts the destination to the right and rotates the least significant bit (LSB) into the carry. The
previous content of the carry is rotated into the most significant bit (MSB). This is like a wagon wheel with an
extra spoke. The count indicating the number of bits to rotate is invalid if it is greater than or equal to the width
n
of the data (2 ). If count = 0, then there is no shift and the destination and flags are not changed.
Flags O.flow Sign Zero Aux Parity Carry
X X X - X X
Flags: Carry = last bit rotated out, Zero = 1 if all destination bits are 0; thus destination = 0. If using a count of
1, then Overflow = the exclusive OR of the two MSBs after the rotate. If rotation is greater than 1, Overflow is
undefined.
datd dd 32
datw dw 16
datb db 8
rcr dx,1
; dx = 0100101101001010b 0d2d2h carry=1 Z=0 OF=1
Bit Scanning
95
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
BSF
bsf rDst, rmSrc(16/32/64) [Un]signed
This instruction scans the source for the least significant bit (LSB); if found, it sets the destination to the index
(0…n). If the source is set to 0, the destination is undefined.
Flags O.flow Sign Zero Aux Parity Carry
- - X - - -
Flags: If the source is zero, then the Zero flag is set to 1; else it is set to 0.
; ebx = some value
mov ecx,0 ; Preset to zero in case ebx is 0.
mov ebx,eax
bsf ecx,ebx ; ecx contains the bit setting
shr ebx,cl ; Down shift the data
Note Preset the BSF destination to zero if the source is not guaranteed to be not zero (unless you
plan to branch if zero).
BSR
bsr rDst, [Un]signed
rmSrc(16/32/64)
This instruction scans the source for the most significant bit (MSB); if found, it sets the destination to the index
(0..n). If the source is set to 0, the destination is undefined.
Flags O.flow Sign Zero Aux Parity Carry
- - X - - -
Flags: If the source is zero, then the Zero flag is set to 1; else it is set to 0.
The following calculates the number of bits needed to store an unsigned value.
uint CountBits(uint val)
{
uint nCnt;
#if 01
96
// Use this! VERY MUCH FASTER than a bit scanner!
// Reverse scan for MSB
__asm {
mov eax,0 // Set to zero, in case val is zero.
bsr eax,val // N...0 Bit index
inc eax // N+1...1 # of bits
mov nCnt,eax
};
#else
uint nBit = 0x80000000;
nCnt = 32;
return nCnt;
So if you had numerous numbers to crunch, which of the code snippet algorithms just shown would you rather
utilize? A couple instructions of assembly in the upper conditional code or looping up to 31 times maximum for
each 32-bit integer value as shown in the lower conditional code?
Sorry if I am offending any of you speed freaks. I personally happen to like to squeeze every
last drop of clock cycle out of my CPU, but development projects are about balance: code size
vs. code speed. Project budget vs. time constraints. There are also portability issues, but
these are usually resolved through abstraction where function pointers are used to vector to
generic C or custom assembler based upon processor requirements.
// Color Masking information
97
uint GreenShift;
uint BlueShift;
uint zRedShift;
uint zGreenShift;
uint zBlueShift;
} RGBInfo;
// If a fixed palette then get 16/24/32-bit color masking
information.
HBITMAP hBitmap;
byte *pBMap;
RGBInfo info;
rect.left = rect.top = 0;
rect.right = rect.bottom = 2;
// Now analyze
gfxGetColorMasks(nDepth, &info);
98
}
Now we come to the fun part. Normally we would only call these function pairs once during program
initialization, so it does not need to be very fast. However, the BSF and BSR instructions are two that do not
translate well to C code. They would need a slowpoke bit-shifting loop, so I thought it would be a good
example and a function that you would be able to use in your own graphic applications.
; Color masking information
RGBInfo STRUCT 4
RedMask dd 0 ; Red color mask
GreenMask dd 0 ; Green color mask
BlueMask dd 0 ; Blue color mask
; These bits are used to shift the bits into position
RedShft dd 0 ; Red shift bits up
GreenShft dd 0 ; Green shift bits up
BlueShft dd 0 ; Blue shift bits up
; These bits are used to reduce the bits needed
zRedShft dd 0 ; Red shift bits down
zGreenShft dd 0 ; Green shift bits down
zBlueShft dd 0 ; Blue shift bits down
RGBInfo ENDS
public gfxGetColorMasks
gfxGetColorMasks proc near
push ebp
mov ebp,esp
; Green bits
mov eax,(RGBInfo PTR [esi]).GreenMask
and eax,ebx ; AND with mask
99
mov (RGBInfo PTR [esi]).GreenMask,eax
bsf ecx,eax ; get LSB set to 1
bsr edx,eax ; get MSB set to 1
mov (RGBInfo PTR [esi]).GreenShft,ecx
sub edx,ecx
add edx,-7
neg edx
mov (RGBInfo PTR [esi]).zGreenShft,edx
;Blue bits
mov eax,(RGBInfo PTR [esi]).BlueMask
and eax,ebx
mov (RGBInfo PTR [esi]).BlueMask,eax
bsf ecx,eax ; get LSB set to 1
bsr edx,eax ; get MSB set to 1
mov (RGBInfo PTR [esi]).BlueShft,ecx
sub edx,ecx
add edx,-7
neg edx
mov (RGBInfo PTR [esi]).zBlueShft,edx
;Red bits
mov eax,(RGBInfo PTR [esi]).RedMask
and eax,ebx
mov (RGBInfo PTR [esi]).RedMask,eax
bsf ecx,eax ; get LSB set to 1
bsr edx,eax ; get MSB set to 1
mov (RGBInfo PTR [esi]).RedShft,ecx
sub edx,ecx
add edx,–7
neg edx
mov (RGBInfo PTR [esi]).zRedShft,edx
Exercises
1. With a 32-bit data element size, what is the result of a logical right shifting of the following data by 34
bits? With an arithmetic right shift? With a logical left shift?
0xB83DE7820
100
2. Write a function to count the number of bits to contain a signed value.
3. Write a packed 16× 8-bit SAR, using pseudo vector code.
4. Write a packed 16× 8-bit SAR, using SIMD instructions.
5. Write a packed ROL using SIMD.
a. 8-bit
b. 16-bit
c. 32-bit
6. Write a packed ROR using SIMD.
a. 8-bit
b. 16-bit
c. 32-bit
101
Chapter 6: Data Conversion
Data Interlacing, Exchanging, Unpacking, and Merging
Data must sometimes be interlaced to get it into a form that can be easily handled. By understanding how to
interlace and de-interlace data, a most productive solution can be found for solving an expression.
The instructions in this chapter are easier to understand through visualization, and each processor has its set
of instructions that it handles, but here is where data swizzling can easily be confusing: converting data from
the output of one instruction and used as the input of another.
As a guide to assist you in remembering big- versus little-endian orientations, the following are the placement
of bytes within the 64-bit data diagrams used in this chapter. Keep in mind that each 64-bit block is a repeat of
the previous block.
Little-Endian
0x88,0x99,0xaa,0xbb,0xcc,0xdd,0xee,0xff (8-bit)
0x9988, 0xbbaa, 0xddcc, 0xffee (16-bit)
0xaab9988, 0xffeeddcc (32-bit)
Big-Endian
0x88,0x99,0xaa,0xbb,0xcc,0xdd,0xee,0xff (8-bit)
0x8899, 0xaabb, 0xccdd, 0xeeff (16-bit)
0x8899aabb, 0xccddeeff (32-bit)
The one thing to remember here is that the data elements are isolated from each other. The A n placement of
each element is related to its position. For example, when related to a quad vector:
So that means that Aw Az Ay Ax are visually on the far right just like A3 A2 A1 A0 for little-endian, and Ax,Ay,Az,Aw
are on the far left just like A0, A1, A2, A3 for big-endian.
As long as you get the element positions correct for your processor, then the data flow represented by the
arrows in the diagrams will be correct.
Note The bit indicators on the diagrams in this section are in little-endian byte order.
Quite often, data needs to be migrated from one form to another, and a single instruction may not be sufficient.
For instance, a matrix is made up of four vectors: Axyzw,Bxyzw, Cxyzw, Dxyzw. This is known as an Array of
Structures (AoS). But mathematical operations are typically between like terms such as A xBxCxDx,AyByCyDy,
etc. This is known as a Structure of Arrays (SoA), which is more matrix friendly (and efficient) due to the
simultaneous operation upon the same elements. To get the data from one form to another requires the data
to be manipulated.
102
Figure 6-2: AoS to SoA SIMD conversion
Byte Swapping
Before exploring various methods of converting data, let us first examine the method most often needed: The
endian conversion! The handling of big- and little-endian was initially discussed in Chapter 3 but let us re-
examine.
If your application is designed to be multiplatform, then having multiple endian declarations might make your
life easier. Having Endian16(), Endian32(), Endian64(), and Endian128() conversion functions are one thing,
but having extra declarations such as BigEndian64() versus LittleEndian64() that map to either a stub macro or
an Endian64 converter will save you some time. The data file being read will be in a known endian orientation.
The target platform knows what it needs, so if the big-endian label is used if the source data is known to be
big-endian and vice versa for little-endian, then the use of that declaration will resolve any confusion. This will
work for any platform!
Table 6-1: Correlation between little- and big-endian orientation and whether a byte swap or a stub
function is implemented
These endian wrappers are shown only for little-endian as this book is mostly meant for little-endian. If you are
truly interested, my Vector Game Math Processors book contains both sets.
Hint Use descriptive big- versus little-endian macros to simplify endian conversion.
Little-Endian
#define VMP_LITTLE_ENDIAN
This is oversimplifying it and there are better methods such as the BSWAP instruction on the 80x86, but this is
a generic method for cross-platform portability.
Listing 6-1: Generic 32-bit endian conversion
103
buf[ 3 ]=*(((uint8*)&val)+0); // = [0]
return *(int32*)buf;
}
Listing 6-2: Generic 16-bit endian conversion
The typecasting camouflages it a bit, but it is merely a byte read-write with inverse offsets. I will leave the
actual endian implementation up to you! Just remember that it is preferable to have the tools handle your
endian conversion so that a game application does not have to. And since tools exercise the same data over
and over for the length of a project, you might as well make them as efficient as possible.
For cross-platform compatibility I refer to the following as a little pretzel logic. It looks a little twisted, but if you
dig a little deeper it becomes what it is — slicker 'n snail snot!
#endif
Note that same endian to same endian assignment merely stubs out the macro, so no conversion is needed or
implemented. One only needs to know what byte order the data is in and what order is needed, and use the
appropriate macro. It will then be cross-platform compatible to all other platforms as long as the endian flag is
set properly for that platform.
BSWAP
bswap rDst(32/64) [Un]signed
104
This general-purpose instruction does a big/little-endian conversion. It reverses the byte order of a 32-bit or 64-
bit register.
BEFORE D C B A
AFTER A B C D
bswap eax
This is used in the conversion of communications messages from bigendian platforms such as Unix or
Macintosh, or file formats such as TIFF, MIDI, etc.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
The C code equivalent is slow, especially when compared to the speed of a BSWAP instruction. I normally do
not believe in in-line assembly as it makes code less portable to other platforms, but here is one of my rare
exceptions. Note that compiling C with optimization set for speed should truly embed the Endian32 function
into your code like a macro.
Listing 6-4: BSWAP-based 32-bit endian conversion
return val;
}
For those of you working with an embedded 8086…80386 processor, a 16-bit endian conversion can be
accomplished with a ROR, which would have the same effect as an XCHG; however, it is more efficient
depending on the processor manufacturer and model.
When using that same technique for 32-bit endian conversion, it should be noted that the ROR will cause a
stall performing an operation with the EAX register after the write to the AX. So use the BSWAP on the
Pentiums!
Best Worst
bswap ror ax,8
ror eax,16
ror ax,8
P bytes 2 11
R bytes 3 10
Listing 6-5: 32-bit mode : 64-bit endian conversion
105
bswap edx ; Lower bits
ret ; edx:eax
Listing 6-6: 64-bit mode : 64-bit endian conversion
PSWAPD
3DNow!+ pswapd mmxDst, mmxSrc/m64 [Un]signed 6
4
Single-precision
pswapd mm0,mm1
Data Interlacing
PUNPCKLBW
MMX punpcklbw mmxDst, [Un]signed 64
mmxSrc/m64
SSE2 punpcklbw xmmDst, [Un]signed 128
xmmSrc/m128
punpcklbw mm0,mm1
punpcklbw xmm0,xmm1
This is one of the more popular instructions as it is extremely useful in the expansion of an unsigned data
value. By interlacing a value of zero with an actual value, an 8-bit value is expanded to 16 bits.
A = 0x00000000 B = 0x44332211
106
D = 00 44 00 33 00 22 00 11
0044 0033 0022 0011
punpcklbw mm0,mm0 ; {w w z z y y x x} {u t s r w z y x}
PUNPCKHBW
MMX punpckhbw mmxDst, [Un]signed 64
mmxSrc/m64
SSE2 punpckhbw xmmDst, [Un]signed 128
xmmSrc/m128
punpckhbw mm0,mm1
punpckhbw xmm0,xmm1
movq mm7,fooa
movq mm6,foob
punpckhbw mm7,mm6
; 80 00 00 3f 00 7f 00 ff ff ff a5 a5 5a 5a 00 00
; became
; 80 ff 00 ff 00 a5 3f a5
punpckhbw mm0,mm0 ; {u u t t s s r r} {u t s r w z y x}
PUNPCKLWD
MMX punpcklwd mmxDst, [Un]signed 64
mmxSrc/m64
SSE2 punpcklwd xmmDst, [Un]signed 128
107
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
xmmSrc/m128
punpcklwd mm0,mm1
punpcklwd xmm0,xmm1
movq mm7,fooa
movq mm6,foob
punpcklwd mm7,mm6
punpcklwd xmm0,xmm0 ; {w w z z y y x x} {u t s r w z y x}
PUNPCKHWD
MMX punpckhwd mmxDst, [Un]signed 64
mmxSrc/m64
SSE2 punpckhwd xmmDst, [Un]signed 128
xmmSrc/m128
punpckhwd mm0,mm1
punpckhwd xmm0,xmm1
punpckhwd xmm0,xmm0 ; {u u t t s s r r} {u t s r w z y x}
108
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
PUNPCKLDQ
UNPCKLPS
MMX punpckldq mmxDst, mmxSrc/m64 [Un]signed 64
SSE unpcklps xmmDst, xmmSrc/m128 Single-precision 128
SSE2 punpckldq xmmDst, xmmSrc/m128 [Un]signed 128
punpckldq mm0,mm1
unpcklps xmm0,xmm1
punpckldq xmm0,xmm1
PUNPCKHDQ
UNPCKHPS
MMX punpckhdq mmxDst, mmxSrc/m64 [Un]signed 64
SSE unpckhps xmmDst, xmmSrc/m128 Single-precision 128
SSE2 punpckhdq xmmDst, xmmSrc/m128 [Un]signed 128
punpckhdq mm0,mm1
unpckhps xmm0,xmm1
punpckhdq xmm0,xmm1
movq mm7,fooa
movq mm6,foob
punpckhdq mm7,mm6
109
If a "splat" functionality is needed, it can be emulated with the following:
punpckhdq mm0,mm0 ; 64-bit {y y} {y x}
punpckhdq xmm0,xmm0 ; 128-bit {w w z z} {w z y x}
MOVSS
SSE movss xmmDst, xmmSrc/m32 Single- 128
precision
" movss xmmDst/m32, xmmSrc
This SSE instruction copies the least significant single-precision floating-point scalar value from 32-bit memory
aSrc and copies it to destination Dst. Source and destination can be XMM register, XMM to 32-bit memory, or
32-bit memory to XMM scalar copy.
MOVQ2D
Q
SSE movq2dq xmmDst, mmxSrc(mmx/m32) [Un]signed 128
This SSE instruction copies the least significant 32-bit unsigned scalar value from MMX or 32-bit memory aSrc
and copies it to XMM destination Dst. Other elements remain unchanged.
MOVDQ2
Q
110
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
SSE movdq2q mmxDst, [Un]signed 128
(xmmSrc/m32)
This SSE instruction copies the least significant 32-bit unsigned scalar value from XMM or 32-bit memory aSrc
to the least significant 32-bit element of the MMX destination Dst. The other element of Dst remains
unchanged.
MOVLPS
SSE movlps xmmDst, mSrc64 Single-precision 128
" movlps mDst64, xmmSrc
This SSE instruction copies the two least significant single-precision floating-point values from XMM source
register or 32-bit memory aSrc to the two least significant single-precision floating-point elements of the XMM
destination Dst. The other elements of Dst remain unchanged.
MOVHPS
SSE movhps xmmDst, m64 Single-precision 128
" movhps m64, xmmSrc
When the source is memory, this SSE instruction copies the two single- precision floating-point values from
64-bit memory aSrc and copies them to the two most significant single-precision floating-point elements within
an XMM register specified by Dst. When aSrc is an XMM register, the two most significant single-precision
floating-point values are copied to 64-bit memory Dst.
111
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
MOVLHPS
SSE movlhps xmmDst, xmmSrc Single-precision 128
This SSE instruction copies the two least significant single-precision floating-point values from XMM source
register aSrc to the two most significant single-precision floating-point elements of the XMM register
destination Dst. The other elements of Dst remain unchanged.
movlhps xmm0,xmm0 ; {y x y x} {w z y x}
MOVHLPS
SSE movhlps xmmDst, xmmSrc Single-precision 128
This SSE instruction copies the two most significant single-precision floating-point values from the XMM
register aSrc to the two least significant single-precision floating-point elements of destination XMM register
Dst. The other elements of Dst remain unchanged.
movhlps xmm0,xmm0 ; {w z w z} {w z y x}
MOVSD
SSE2 movsd xmmDst, xmmSrc/m64 Double-precision 128
" movsd xmmDst/m64, xmmSrc
When the source is memory, this SSE2 instruction copies the double- precision floating-point value from 64-bit
memory aSrc and copies it to the least significant double-precision floating-point element of the XMM
destination register specified by Dst. The upper double-precision floating-point value is unchanged. When aSrc
is an XMM register, the lower double-precision floating-point value is copied to 64-bit memory.
Note This MOVSD instruction should not be confused with the string instruction MOVSD. This
instruction uses XMM registers, not general-purpose registers, and does not work with the
112
REP prefix.
MOVLPD
MOVSD
SSE2 movsd xmmDst, mSrc64 Double-precision 128
" movsd mSrc64, xmmSrc
" movlpd xmmDst, mSrc64
" movlpd mSrc64, xmmSrc
The MOVLPD instruction copies the double-precision floating-point value from 64-bit memory aSrc to the lower
64 bits of the XMM register or from the lower 64 bits of the XMM register to 64-bit memory. The upper double-
precision floating-point value in the XMM register is unchanged when the destination is the XMM register.
Note The MOVLPD and MOVSD instructions appear to be functionally identical. However, the
MOVSD can optionally move XMM to XMM, while the MOVLPD cannot!
MOVHPD
SSE2 movhpd xmmDst, m64 Double-precision 128
" movhpd m64, xmmSrc
The MOVHPD instruction copies the double-precision floating-point value from 64-bit memory aSrc to the
upper 64 bits of the XMM register or from the upper 64 bits of the XMM register to 64-bit memory. The upper
double-precision floating-point value in the XMM register is unchanged when the destination is the XMM
register.
113
PUNPCKLQDQ — Parallel Copy Lower (2×64-Bit)
PUNPCKLQDQ
UNPCKLPD
SSE2 punpcklqdq xmmDst, [Un]signed 12
xmmSrc/m128 8
" unpcklpd xmmDst, Double-precision 12
xmmSrc/m128 8
punpcklqdq xmm0,xmm1
unpcklpd xmm0,xmm1
punpcklqdq xmm0,xmm0 ; {x x} {y x}
PUNPCKHQDQ
UNPCKHPD
SSE2 punpckhqdq xmmDst, [Un]signed 128
xmmSrc/m128
" unpckhpd xmmDst, Double-precision 128
xmmSrc/m128
punpckhqdq xmm0,xmm1
unpckhpd xmm0,xmm1
punpckhqdq xmm0,xmm0 ; {y y} {y x}
114
Swizzle, Shuffle, and Splat
The various manufacturers refer to the swapping of data values by different terms: swizzle, shuffle, splat, etc.
Some of these replicate a data value into two or more destination locations. In a few hybrid cases the functions
use a defined distribution or a custom-defined interlacing of source arguments such as was discussed in the
previous section.
The splat functionally is similar to a bug hitting the windshield of an automobile at 70 mph.
A bit sequence is used to indicate which source elements are mapped to which destination elements.
PINSRW
MMX+ pinsrw mmxDst, r32, # [Un]signed 64
" pinsrw mmxDst, m16, # " "
SSE2 pinsrw xmmDst, r64, # " 128
" pinsrw xmmDst, m32, # " "
For 64-bit data there exist four output elements and so an immediate value of 0…3; thus two bits are needed
to identify which element is the destination. The two least significant bits of the index are masked to only allow
a selectable value of 0…3. With 128-bit data there exist eight output elements and therefore a value of 0…7;
thus three bits are used to select the destination.
pinsrw mm0,eax,01b ; 1 {3...0}
The lower 16 bits of the general-purpose register are assigned to one of the four destination 16-bit values
selected by the index.
PSHUFW
MMX pshufw mmxDst, (mmxSrc/m64), [Un]signed 64
+ #
SSE pshufw mmxDst, (mmxSrc/m64), [Un]signed 64
#
115
The immediate value indicates which source index is mapped to each of the destination elements. The
immediate value is a single 8-bit byte; with four possible source elements needing two bits each, that leaves a
4
maximum of four remappable elements. There are 4×4×4×4 = 4 = 256 possible patterns.
pshufw mm0,mm1,10000111b ; 2 0 1 3
PSHUFLW
SSE2 pshuflw xmmDst, (xmmSrc/m128), # [Un]signed 128
The immediate value indicates which source index is mapped to each of the destination elements. The
immediate value is a single 8-bit byte; with four possible source elements needing two bits each, that leaves a
maximum of four remappable elements. This is similar in functionality to PSHUFW; the lower four 16-bit
elements are remappable but the upper four elements are straight mappings and thus a direct copy.
pshuflw xmm0,xmm1,01001110b ; 1 0 3 2
PSHUFHW
SSE2 pshufhw xmmDst, (xmmSrc/m128), # [Un]signed
The immediate value indicates which source index is mapped to each of the destination elements. The
immediate value is a single 8-bit byte, and with four possible source elements needing two bits each, that
leaves a maximum of four remappable elements. This is similar in functionality to PSHUFW; the upper four 16-
bit elements are remappable but the lower four elements are straight mappings and thus a direct copy.
pshufhw xmm0,xmm1,11000110b ; 3 0 1 2
PSHUFD
116
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
SSE2 pshufd xmmDst, (xmmSrc/m128), [Un]signed 128
#
pshufd xmm0,xmm1,01001110b ; 1 0 3 2
The immediate value indicates which source index is mapped to each of the destination elements. The
immediate value is a single 8-bit byte, and with four possible source elements needing two bits each, that
leaves a maximum of four remappable elements.
SHUFPS
SSE shufps xmmDst, (xmmSrc/m128), # Single-precision 128
The immediate value is split between where the two lowest elements are selectable from the destination and
the two highest elements of the destination are selectable from the source. The immediate value is a single 8-
bit byte; with four possible source elements needing two bits each, that leaves a maximum of four remappable
elements.
shufps xmm0,xmm1,11100100b ; 3 2 1 0 {3...0}
MOVSLDUP
SSE3 movsldup xmmDst, (xmmSrc/m32) Single-precision 128
The even single-precision floating-point elements from the source are replicated so element #0 is copied to the
two lower destination elements and the source element #2 is copied to the upper two destination elements.
MOVSHDUP
117
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
SSE3 movshdup xmmDst, (xmmSrc/m32) Single-precision 128
The odd single-precision floating-point elements from the source are replicated so element #1 is copied to the
two lower destination elements and the source element #3 is copied to the upper two destination elements.
MOVDDU
P
SSE movddup xmmDst, Double-precision 12
3 (xmmSrc/m32) 8
The lower double-precision floating-point element from the source is replicated and copied to the lower and
upper double-precision floating-point elements.
SHUFPD
SSE shufpd xmmDst, Double-precision 128
2 (xmmSrc/m128), #
shufpd xmm0,xmm1,01b ; 0 1 {1...0}
Four possibilities: {x x} {x y} {y x} {y y} {y x}
118
CBW Convert Signed AL (Byte) to AX (Word)
CBW
CDQE
CWDE
cbw Signed 16
cwde Signed 32
cdqe Signed 64
The CBW general-purpose instruction converts the (7+1)-bit signed value in the AL register to a (15+1)-bit
signed value in the AX register.
The CWDE general-purpose instruction converts the (15+1)-bit signed value in the AX register to the (31+1)-bit
signed value in the EAX register.
The CDQE general-purpose instruction converts the (31+1)-bit signed value in the EAX register to the (63+1)-
bit signed value in the RAX register.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
To convert a signed value of [–128…0…127] to a 16-bit value. This only works with the AL to AX register, and
is most efficient if the data value originated in the AL; if not, then the MOVSX is best.
Best Worst
cbw movsx movsx ror eax,8 ror ax,8
eax,al ax,al sar eax,24 sar ax,8
P bytes 2 3 4 6 8
R bytes 1 4 3 6 6
119
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
MOVSX
MOVSXD 64 64
movsx rDst16, rmSrc8 Signed
movsx rDst32,
rmSrc(8/16)
movsx rDst64,
rmSrc(8/16)
movsxd rDst64, rmSrc32
These general-purpose instructions are very similar to CBW and CWDE except that they are a lot more
versatile in which other registers can be sign extended to the same or a different register instead of just the AL
or AX. A (7+1)-bit signed value can be converted to a (15+1)-bit or (31+1)-bit signed value. A (15+1)-bit signed
value is converted into a (31+1)-bit signed value.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
To convert a signed value of [–128…0…127] to a 16-bit value. If working with AX,AL then use the CBW
instruction as it is more efficient. I recommend using 32-bit form, as it is the best.
Best (+Best+) Worst
movsx ax,bl movsx mov eax,ebx mov ax,bx
eax,bl shl eax,24 shl ax,8
sar eax,24 sar ax,8
P bytes 4 3 8 11
R bytes 3 4 11 8
120
R bytes 4 11 10
MOVZX
movzx rDst16, rmSrc8 Unsigned
movzx rDst32,
rmSrc(8/16)
movzx rDst64,
rmSrc(8/32)
This instruction converts an unsigned value into a larger unsigned value. An 8-bit unsigned value can be
converted to a 16-bit or 32-bit unsigned value. A 16-bit unsigned value is converted into a 32-bit unsigned
value.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
To convert an unsigned value of (0…255) to a 16-bit value. I recommend using the 8- to 32-bit form.
Best Worst
movzx ax,bl xor ax,ax sub ax,ax mov ax,bx
mov al,bl mov al,bl and ax,0ffh
P bytes 4 5 5 7
R bytes 3 4 4 5
121
Best Worst
movzx xor eax,eax sub eax,eax mov eax,ebx
eax,bl mov al,bl mov al,bl and eax,0ffh
P bytes 3 4 4 7
R bytes 4 5 5 9
CWD
CDQ
CQO 64 64
cwd Signed
cdq Signed
cqo Signed
The general-purpose CWD, CDQ, and CQO instructions are typically used for preparation of a number before
a division. The integer division requires:
You would get the same result by multiplying two numbers together.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
122
sar edx,31
sar edx,1
P bytes 1 7
R bytes 2 10
PEXTRW
MMX pextrw r(32/64), mmxSrc, # [Un]signed 64
SSE pextrw r(32/64), xmmSrc, #
pextrw eax,mm1,00b ; {3...0}
One of the four 16-bit values is assigned to the lower 16 bits of the general-purpose register and zero
extended into the upper 16 bits for the 32-bit register, or 48 bits for the 64-bit register.
One of the eight 16-bit values is assigned to the lower 16 bits of the general-purpose register and zero
extended into the upper 16 bits for the 32-bit register, or 48 bits for the 64-bit register.
PACKSSWB
MMX packsswb mmxDst, Signed 64
mmxSrc/m64
SSE2 packsswb xmmDst, Signed 128
xmmSrc/m128
packsswb mm0,mm1
packsswb xmm0,xmm1
This instruction takes a word value in the range {–32768 … 32767} and saturates it to a signed 8-bit range of
{–128…127}.
fooa qword 0ffffa5a55a5a0000h
foob qword 08000003f007f00ffh
movq mm7,fooa
movq mm6,foob
123
packsswb mm7,mm6
PACKUSWB
MMX packuswb mmxDst, [Un]signed 64
mmxSrc/m64
SSE2 packuswb xmmDst, [Un]signed 128
xmmSrc/m128
packuswb mm0,mm1
This instruction uses the same diagram as the 64-bit form of the PACKSSWB instruction but saturates an
unsigned word with a range of {–32768…32767} to an unsigned 8-bit range of {0…255}.
packuswb xmm0,xmm1
The following instruction uses the same diagram as the 128-bit form of the PACKSSWB instruction but
saturates an unsigned word with a range of {–32768…32767} to an unsigned 8-bit range of {0…255}.
fooa qword 0ffffa5a55a5a0000h
foob qword 08000003f007f00ffh
movq mm7,fooa
movq mm6,foob
packuswb mm7,mm6
PACKSSDW
MMX packssdw mmxDst, Signed 64
mmxSrc/m64
SSE2 packssdw xmmDst, Signed 128
xmmSrc/m128
packssdw mm0,mm1
124
packssdw xmm0,xmm1
This instruction takes a 32-bit signed value in the range {–2147483648… 2147483647} and saturates it to a
signed 16-bit range of {–32768… 32767}.
fooa qword 0ffffa5a55a5a0000h
foob qword 08000003f007f00ffh
movq mm7,fooa
movq mm6,foob
packssdw mm7,mm6
PI2FW
3DMx pi2fw mmxDst, mmxSrc/m64 SPFP INT16 6
+ 4
This instruction converts even packed signed 16-bit values into packed single-precision floating-point values
and stores the result in the destination mmxDst.
CVTDQ2PS
SSE cvtdq2ps xmmDst, SPFP 128
xmmSrc/m128 INT32
This instruction converts a packed 32-bit signed integer from source xmm to xmm single-precision floating-
point destination.
125
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
CVTPS2DQ
CVTTPS2DQ
SSE cvtps2dq xmmDst, INT32 SPFP 128
xmmSrc/m128
This converts a packed single-precision floating-point source xmm to xmm 32-bit signed destination.
SSE cvttps2dq xmmDst, xmmSrc/m128 INT32 SPFP 128
This converts a packed single-precision floating-point with truncation source xmm to xmm 32-bit signed
destination.
CVTPI2PS
PI2FD
SSE cvtpi2ps xmmDst, xmmSrc/m64 SPFP INT32 128
3DNow pi2fd mmxDst, mmxSrc/m64 SPFP INT32 64
!
This instruction converts packed signed 32-bit values into packed single-precision floating-point values and
stores the result in the destination MMX register mmxDst.
CVTPS2PI
CVTTPS2PI
PF2ID
PF2IW
SSE cvtps2pi xmmDst, xmmSrc/m64 INT32 SPFP 128
SSE cvttps2pi mmxDst, mmxSrc/m64 INT32 SPFP 128
(the same but with truncation)
3DNow pf2id mmxDst, mmxSrc/m64 INT32 SPFP 64
!
This instruction converts packed single-precision values in source mmxSrc to signed (saturated) 32-bit values
and stores the result in the destination mmxDst.
126
This instruction converts packed single-precision values in source mmxSrc to signed (saturated) 16-bit values
[–32768, 32767], then sign extends to 32 bits and stores the result in the destination mmxDst.
CVTSI2SS
SSE cvtsi2ss xmmDst, r/m32 SPFP 128
INT32
CVTDQ2PD
SSE2 cvtdq2pd xmmDst, xmmSrc/m64 SPFP 128
INT32
This instruction converts a packed even signed 32-bit integer value from source xmm to destination xmm
double-precision floating-point.
CVTPD2DQ
CVTTPD2DQ
SSE2 cvtpd2dq xmmDst, INT32 DPFP 128
xmmSrc/m128
This converts a packed double-precision floating-point from source xmm to destination even packed 32-bit
signed integer.
SSE2 cvttpd2dq xmmDst, xmmSrc/m128 INT32 DPFP 128
127
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
CVTPD2PS
SSE2 cvtpd2ps xmmDst, SPFP DPFP 128
xmmSrc/m128
This converts a packed double-precision floating-point from source xmm to destination xmm packed single-
precision floating-point.
CVTPS2PD
SSE2 cvtps2pd xmmDst, xmmSrc/m64 DPFP SPFP 128
This converts a lower packed single-precision floating-point xmm source to xmm double-precision destination.
CVTPD2PI
CVTTPD2PI
SSE2 cvtpd2pi mmxDst, INT32 DPFP 64
xmmSrc/m128
This converts a packed double-precision floating-point from source xmm to destination mmx register.
SSE2 cvttpd2pi mmxDst, xmmSrc/m128 INT32 DPFP 64
This converts a packed double-precision floating-point with truncation from source xmm to destination mmx
register.
CVTPI2PD
SSE2 cvtpi2pd xmmDst, xmmSrc/m64 SPFP 128
INT32
This converts a packed 32-bit signed integer source mmx to double-precision floating-point xmm destination.
128
CVTSS2SI — Convert Scalar SPFP to int32/64
CVTSS2SI
CVTTSS2SI
SSE cvtss2si r32, xmmSrc/m32 INT32 32
SPFP
" cvtss2si r64, xmmSrc/m32 INT64 64
SPFP
" cvttss2si r32, xmmSrc/m32 INT32 32
SPFP
" cvttss2si r64, xmmSrc/m32 INT64 64
SPFP
(the same but with truncation)
CVTSD2SI
CVTTSD2SI
SSE2 cvtsd2si r64, xmmSrc/m64 INT64 64
DPFP
" cvtsd2si r32, xmmSrc/m64 INT32 32
DPFP
" cvttsd2si r32, xmmSrc/m64 INT32 32
DPFP
" cvttsd2si r64, xmmSrc/m64 INT64 64
DPFP
(the same but with truncation)
CVTSI2SD
SSE2 cvtsi2sd xmmDst, rmSrc64 DPFP 128
INT64
129
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
" cvtsi2sd xmmDst, rmSrc32 DPFP 128
INT32
CVTSD2SS
SSE2 cvtsd2ss xmmDst, xmmSrc/m64 SPFP DPFP 128
CVTSS2SD
SSE2 cvtss2sd xmmDst, xmmSrc/m32 DPFP SPFP 128
Exercises
1. Write code for your favorite processor to convert an image 256 pixels wide by 256 pixels high from 32-
bit RGB data to 24-bit RGB.
2. Same as problem #1 but convert to 5:6:5-bit RGB (5 bits red, 6 bits green, 5 bits blue).
3. Convert from packed 5:5:5 RGB to 32-bit RGB.
4. For {SSE2} what instruction is needed to convert from a 16-bit signed number to 32-bit? Unsigned?
5. For {SSE2} write code snippets to pack four 128-bit vectors from packed 32-bit to packed 8-bit.
6. Write a function to convert a 128-bit vector of 32-bit packed integers from big-endian to little-endian and
vice versa.
7. Same as problem #6 but 32-bit single-precision floating-point values.
8. Same as problem #6 but 64-bit double-precision floating-point values.
9. The beginning of this chapter shows a diagram for an AoS to SoA. Write it using
a. 3DNow!
b. SSE
c. SSE2
10. The beginning of this chapter shows a diagram for an AoS to SoA. Write the reverse from SoA to AoS
using
a. 3DNow!
b. SSE
c. SSE2
130
Chapter 7: Integer Math
Overview
This chapter involves math related to whole numbers that you learned in grade school: the processes of
addition and subtraction. There you learned about the number line in which numbers just go on forever in
positive and negative directions. In computers, as the numbers increase in a positive or negative amount they
actually approach a limit (the end of their finite space). Upon reaching the end of the world (Flat Earth Society),
they overflow and wrap to the opposite end of the number line (quantum physics, or was that string theory?).
Well, anyway, your integer range of numbers is limited by the size of the data used to store them.
We can add 8-, 16-, 32-, and 64-bit numbers, and 4-bit if you include BCD (binary-coded decimal) discussed
later in Chapter 15.
No The general-purpose BCD instructions are not available in 64-bit mode!
64-bit
The EFLAGS Overflow, Sign, Zero, Auxiliary Carry, Parity, and Carry are all altered when using the general-
purpose instructions.
Overflow Whether the numbers overflow the binary limit of the destination operand
Sign Set to the resulting value of the MSB (most significant bit)
Zero Set if the result is a value of zero
Aux. Carry Set as a result of a carry of the low-order nibble
Parity Set to 1 if the bits add up to an even number; 0 if not
Carry Any mathematical carry results of the mathematical operation
ADD — Add
ADC
ADD
add rmDst(8/16/32/64), Signed
#(8/16/32)
add rmDst, rSrc(8/16/32/64)
add rDst, rmSrc(8/16/32/64)
The ADD general-purpose instruction logically sums the 8-, 16-, 32-, or 64-bit source to the destination, saves
the result in destination, and sets the flags accordingly. An 8-bit source immediate value can be sign extended
to 16-, 32-, or 64-bit value. A 32-bit source immediate value can be sign extended to a 64-bit destination.
131
d = a + b d = b so d = a + d = d + a d += a
Algebraic Law:
Commutative Law of a+b=b+a
Addition
adc rmDst(8/16/32/64), Signe
#(8/16/32) d
adc rmDst, rSrc(8/16/32/64)
adc rDst, rmSrc(8/16/32/64)
The ADC general-purpose instruction does exactly the same operation but with one difference. The carry value
of 0 or 1 is added to the ending result: d = a + b + (carry)
Flags O.flow Sign Zero Aux Parity Carry
X X X X X X
mov al,0efh
add al,01ah
mov bx,1234h
add bx,5678h
mov eax,00000a5a5h
add eax,000000ff0h
When adding a series of numbers one typically uses the non-carry ADD first, followed by ADC for each
additional calculation, taking into account the resulting Carry flag.
a = a + b + c + d + e
a = a ADD b ADC c ADC d ADC e
mov eax,0000a5a5h
add eax,00000ff0h
adc eax,0000ff00h
adc eax,000ff000h
adc eax,00ff0000h
132
Sums for really big numbers can be calculated in this way. Need a number 8000 bits long? You can do it in
assembly by continuously adding the carry of previous arithmetic operations onto a current operation.
INC — Increment by 1
Mnemoni P PII K 3D 3Mx SS SSE A6 SSE E64
c 6 ! + E 2 4 3 T
INC 32 32
The INC general-purpose instruction is very similar to the ADD instruction, adding a value of 1 to the 8-, 16-,
32-, or 64-bit destination. The result is saved in the destination: d = a + 1.
Flags O.flow Sign Zero Aux Parity Carry
X X X X X -
Flags: The flags are set as a result of the addition operation. Note that the Carry flag is unaffected.
It is recommended that an ADD instruction be used instead of an INC since flag dependencies can cause
stalling due to previous writes to the EFLAG registers.
No Due to remapping of the REX prefix in 64-bit mode, this instruction can no longer access a
64-bit register directly:
inc eax
Various forms of indirect memory reference, such as the following, are still allowed:
inc [eax]
inc [eax+ebx*4]
mov al,0efh
inc al
mov bx,1234h
inc bx
mov eax,00000a5a5h
inc eax
Note There is no Carry flag set for this instruction. If a carry is needed, use ADD reg/mem, 1.
Tip The 80x86 is one of the few processors that can perform an operation upon a memory
location, not necessarily only in a register. An instruction cannot be preempted by another
thread or process in the middle of its execution. It must complete its execution first!
Thread-Safe Increment
#ifdef _M_IX86
#define QUICK_INCREMENT(v) __asm inc v;
133
#else
#define QUICK_INCREMENT(v) \
EnterMutex() \
v++; \
ExitMutex()
#endif
XADD
xadd rmDst, Signed
rSrc(8/16/32/64)
The XADD general-purpose instruction first exchanges the values in the source general-purpose register and
destination just like the XCHG instruction and then logically sums the 8-, 16-, 32-, or 64-bit source to the
destination, saves the result in destination, and sets the flags accordingly.
This instruction is rarely used except in the case where a register exchange needs to take place so as to
preserve the old value. This is equivalent to four separate instructions:
OLD NEW
mov ebx,eax OLD NEW NEW
mov eax,edx OLD OLD NEW
mov edx,ebx NEW OLD NEW
add edx,eax NEW+OLD OLD
NEW+OLD OLD
OLD NEW
xchg edx,eax NEW OLD
add edx,eax NEW+OLD OLD
NEW+OLD OLD
As you can see, the old value is preserved in the source register, and the destination memory gets the sum of
the new and old values.
Flags O.flow Sign Zero Aux Parity Carry
X X X X X X
134
SUB — Subtract
SUB
SBB
sub rmDst(8/16/32/64), Signed
#(8/16/32)
sub rmDst, rSrc(8/16/32/64)
sub rDst, rmSrc(8/16/32/64)
The SUB general-purpose instruction logically subtracts the 8-, 16-, 32-, or 64-bit source from the destination,
saves the result in destination, and sets the flags accordingly. An 8-bit immediate source value can be sign
extended to 16-, 32-, or 64-bit value. A 32-bit immediate source value can be sign extended to a 64-bit
destination: d = b – a.
sbb rmDst(8/16/32/64), #(8/16/32) Signed
sbb rmDst, rSrc(8/16/32/64)
sbb rDst, rmSrc(8/16/32/64)
The SBB general-purpose instruction does exactly the same thing but with one difference: The Carry flag is
referred to as the borrow flag with a value of 0 or 1, which is subtracted from the ending result.
6d = b – a – (Carry)
Flags O.flow Sign Zero Aux Parity Carry
X X X X X X
Flags: The flags are set as a result of the subtraction operation. When carry is set (1) it indicates a borrow is
needed.
mov al,0efh
sub al,01ah
When subtracting a series of numbers one typically uses SUB for the first subtraction, followed by SBB for
each additional subtraction, taking into account the Carry flag, which indicates a borrow is required from the
next calculation.
a=a–b–c–d–e
135
DEC — Decrement by 1
Mnemoni P PII K 3D 3Mx SS SSE A6 SSE E64
c 6 ! + E 2 4 3 T
DEC 32 32
The DEC general-purpose instruction is very similar to the SUB instruction, subtracting a value of 1 from the 8-
, 16-, 32-, or 64-bit destination. The result is saved in the destination: d = a – 1.
Flags O.flow Sign Zero Aux Parity Carry
X X X X X -
Flags: The flags are set as a result of the subtraction operation. Note that the carry is unaffected.
It is recommended that a SUB instruction be used instead of a DEC since flag dependencies can cause
stalling due to previous writes to the EFLAG registers.
No Due to remapping of the REX prefix in 64-bit mode, this instruction can no longer access a
64-bit register directly:
dec eax
The various forms of indirect memory reference, such as the following, are still allowed:
dec [eax]
dec tbl[ebx+ecx*2]
mov al,0efh
dec al
Thread-Safe Decrement
#ifdef _M_IX86
#define QUICK_DECREMENT(v) __asm dec v;
#else
#define QUICK_DECREMENT(v) \
EnterMutex() \
v--; \
ExitMutex()
#endif
Hint A problem sometimes appears in programming, especially in video games with really high
N 8
score rollovers. A value is incremented and needs to be saturated at 2 –1, such as 2 –1 is
255. Some programmers assume the maximum value will never be reached and thus allow it
to be incremented too much, risking that the value will roll over to zero.
136
dec al
$L1:
But that introduces some branching and misprediction. (Branch misprediction will be discussed in a later
chapter.) A branchless code methodology can be alternatively utilized where the state of the Carry flag is taken
to advantage. Remember, the INC does not affect the Carry flag but the ADD instruction does.
add al,1
sbb al,0
Instead of using an increment, a non-carry addition is used. A normal advancement of one to a value ranging
from [0, 255) has a carry of zero, but upon 255+1, a carry is generated. Immediately subtracting a zero with no
carry has no effect. But subtracting a zero with a carry takes that one increment back off. Normallyd=a+1–0c,
16 32
then upon rolloverd=a + 1 – 1c. In essence, a saturation is created. This same logic can be applied for 2 , 2 ,
64
or 2 , whatever your data size is. The code is smaller and much faster.
Contrarily, a down count to the floor value of zero can be implemented utilizing this same rollback method:
sub al,1
adc al,0
The number is decremented until the floor is reached, thus initiating a borrow, and then the borrow is summed
back in. Normally, d = a – 1 + 0c, then upon rollover, d = a – 1 + 1c.
If, for example, four pairs of 32-bit words are being calculated simultaneously such as in the following addition:
47 53 38 87
+ 23 + 74 + 39 + 16
70 127 77 103
…or subtraction:
47 53 38 87
– 23 – 74 – 39 – 16
24 – 21 –1 71
…the point is that the calculations all need to use the same operator. There is an exception, but this is too
early to discuss it. There are workarounds, such as if only a couple of expressions need a calculated
adjustment while others do not, then adding or subtracting a zero would keep their result neutral.
47 53 38 87
+ 23 +0 + 39 +0
70 53 77 87
Algebraic Law:
Additive Identity n+0=0+n=n
It is in essence a wasted calculation, but its use as a placeholder helps make SIMD instructions easier to use.
137
Algebraic Law:
Additive Inverse a – b = a + (–b)
The other little item to remember is that subtraction is merely the addition of a value's additive inverse:
a–b = a+(–b)
PADDx
PADDQ
MMX padd(b/w/d/q) mmxDst, mmxSrc/m64 [Un]signed 64
SSE padd(b/w/d/q) xmmDst, xmmSrc/m128 [Un]signed 128
2
The PADDx instruction is a parallel operation that uses an adder on each of the source bit blocks aSrc
(xmmSrc) and bSrc (xmmDst) and stores the result in the destination Dst (xmmDst).
The instructions may be labeled as packed, parallel, or vector, but each block of bits is in reality isolated from
each other. The following is a 32-bit example consisting of four unsigned 8-bit values:
95 – 25 – 125 – 69 0x5F E7 83 BB
+ 85 + 37 + 103 + 11 + 0x55 25 67 0B
(0xB4) 180 (0x0C) 12 (0xEA)– 22 (0xC6) – 58 0xB4 0C EA C6
Regardless of the decimal representation of unsigned or signed, the hex values of the two examples remains
the same, which is the reason for these being [Un]signed, thus sign neutral.
Notice in the following additions of 7-bit signed values that with the limit range of –64…63 the worst case of
negative and positive limit values results with no overflow.
11000000b C0 (–64) 00111111b 3F (63)
+ 11000000b C0 (–64) + 00111111b 3F (63)
80000000b 80 (–128) 11111110b 7E (126)
Of course, positive and negative signed values could also be intermixed without an overflow. For a 7-bit
unsigned value, 0…127, there would be no overflow.
11000000b C0 (–64) 01111111b 7F (127)
138
+ 00111111b 3F (63) + 01111111b 7F (127)
11111111b FF (–1) 11111110b FE (254)
The eighth unused bit is in reality used as a buffer to prevent any overflow to a ninth bit.
PADDSx
PADDUSx
MMX paddus(b/w) mmxDst, mmxSrc/m64 Unsigned 64
" padds(b/w) mmxDst, mmxSrc/m64 Signed
SSE paddus(b/w) xmmDst, xmmSrc/m128 Unsigned 12
2 8
" padds(b/w) xmmDst, xmmSrc/m128 Signed
The PADDSx (signed) and PADDUSx (unsigned) instructions are a parallel operation that uses an adder on
each of the source bit block registers aSrc (xmmSrc) and bSrc (xmmDst) and stores the result in the
destination Dst (xmmDst) using saturation logic to prevent any possible wraparound.
Each calculation limits the value to the extents of the related data type so that if the limit is exceeded, it is
clipped inclusively to that limit. This is handled differently whether it is signed or unsigned, as they both use
different limit values. Effectively, the result of the summation is compared to the upper limit with a Min
expression and compared to the lower limit with a Max expression. Notice in the previous section that when
two signed 8-bit values of 0x7F (127) are summed, a value of 0xFE (254) results but is clipped to the
maximum value of 0x7f (127).
The same applies if, for example, two values of 0x80 (–128) are summed, resulting in –256 but clipped to the
minimum value of 0x80 (–128).
The instructions may be labeled as packed, parallel, or vector but each block of bits is in reality isolated from
each other.
A sample use of this instruction would be for sound mixing where two sound waves are mixed into a single
wave for output. The saturation point keeps the amplitude of the wave from wrapping from a positive or high
level into a negative or low one, thus creating a pulse encoded harmonic, or distortion.
For saturation, the limits are different for the data size as well as for signed and unsigned.
8-bit 16-bit
139
signed –128…127 –32768…32767
unsigned 0…255 0…65535
PSUBx
PSUBQ
MMX psub(b/w/d/q) mmxDst, mmxSrc/m64 [Un]signed 64
SSE psub(b/w/d/q) xmmDst, xmmSrc/m128 [Un]signed 12
2 8
This vector instruction is a parallel operation that subtracts each of the source bit blocks aSrc (xmmSrc) from
bSrc (xmmDst) and stores the result in the destination Dst (xmmDst).
Note Be careful here! The register and operator ordering is as follows:
–126 91 – 56 – 96 0x82 5B C8 A0
– 12 – 122 –57 – 114 – 0x0C 7A C7 72
(0x76) 118 (0xE1) –31 (0x01) 1 (0x2E) 46 0x76 E1 01 2E
PSUBSx
PSUBUSx
MMX mmxDst psubus(b/w), mmxSrc/m64 Unsigned 64
psubs(b/w) mmxDst, mmxSrc/m64 Signed
140
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
SSE psubus(b/w) xmmDst, xmmSrc/m128 Unsigned 12
2 8
psubs(b/w) xmmDst, xmmSrc/m128 Signed
This vector instruction is a parallel operation that subtracts each of the source bit blocks aSrc (xmmSrc) from
bSrc (xmmDst) and stores the result in the destination Dst (xmmDst).
–126 91 – 56 – 96 0x82 5B C8 A0
– 12 – 122 – 57 – 114 – 0x0C 7A C7 72
(0x80) –128 (0xE1) –31 (0x01) 1 (0x80) –128 0x76 E1 01 80
An alternative would be to use integer calculations in a fixed-point format of zero or more places. If the data
size is large enough to contain the number, then there is no precision loss!
Pseudo Vec
These can get pretty verbose, as for fixed-point (integer) addition there would be support for 8-, 16-, and 32-bit
data elements within a 128-bit vector and these would be signed and unsigned, with and without saturation.
The interesting thing about adding signed and unsigned numbers, other than the carry or borrow, is that the
resulting value will be exactly the same and thus the same equation can be used. This can be viewed in the
following 8-bit example:
95 05Fh 95
+ 240 + 0F0h + –16
C=1 C=0
335 04Fh 79
C=1 C=1 C=0
79 (79) 79
Notice that the resulting bits from the 8-bit calculation are all the same. Only the carry is different and the
resulting bits are only interpreted as being signed or unsigned.
The following is a 16×8-bit addition but substituting a PSUBB for the PADDB will transform it into a subtraction.
141
movq mm1,[ebx+8] ; {BF...B8}
movq mm2,[eax+0] ; Read A Data {A7...A0}
movq mm3,[eax+8] ; {AF...A8}
paddb mm0,mm2 ; lower 64 bits {A7+B7 ... A0+B0}
paddb mm1,mm3 ; upper 64 bits {AF+BF ... A8+B8}
movq [edx+0],mm0
movq [edx+8],mm1
For SSE, it is essentially the same function wrapper, keeping in mind aligned memory MOVDQA versus non-
aligned memory MOVDQU.
The following is a master substitution table for change of functionality, addition versus subtraction
(inclusive/exclusive of saturation).
Averages
VD[] = (vA[] + vB[] + 1) 2;
PAVGB
PAVGUSB
3DNow! pavgusb mmxDst, mmxSrc/m64 Unsigned 64
MMX+ pavgb mmxDst, mmxSrc/m64 Unsigned 64
SSE pavgb mmxDst, mmxSrc/m64 Unsigned 64
SSE2 pavgb xmmDst, xmmSrc/m128 Unsigned 12
8
This SIMD instruction is a 64 (128)-bit parallel operation that sums the eight (16) individual 8-bit source integer
bit blocks aSrc (xmmSrc) and bSrc (xmmDst), adds one, then divides by two and returns the lower 8 bits with
the result being stored in the destination Dst (xmmDst).
142
Tip These two instructions are remnants from the processor wars. They have two different
mnemonics with two different opcodes, but they have the same functionality. PAVGB was
added to the AMD instruction set, which matched the SSE instruction PAVGB but had the
same functionality as AMD's PAVGUSB. If the target is 3DNow! and MMX extensions or
better, use the instruction PAVGB.
PAVGW
MMX pavgw mmxDst, mmxSrc/m64 Unsigned 64
+
SSE pavgw mmxDst, mmxSrc/m64 Unsigned 64
SSE2 pavgw xmmDst, xmmSrc/m128 Unsigned 12
8
This SIMD instruction is a 64 (128)-bit parallel operation that sums the four (eight) individual 16-bit source
integer bit blocks aSrc (xmmSrc) and bSrc (xmmDst), adds one, then divides by two and returns the lower 16
bits with the result being stored in the destination Dst (xmmDst).
143
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
PSADBW
MMX psadbw mmxDst, mmxSrc/m64 Unsigned 64
+
SSE psadbw mmxDst, mmxSrc/m64 Unsigned 64
SSE2 psadbw xmmDst, xmmSrc/m128 Unsigned 12
8
Integer Multiplication
Integer data is expanded upon the calculation of a product between two operands, and thus two operands are
needed to store the result. Depending on whether the source values are small enough, the operand receiving
the upper bits of the result may be ignored as it always contains a predictable amount.
MUL
The MUL general-purpose instruction multiplies the unsigned 8-, 16-, 32-, or 64-bit operand by the unsigned
AL/AX/EAX/RAX register, saves the result based upon the bit size of the operand as indicated in the table
below, and sets the flags accordingly.
8-bit 16-bit 32-bit 64-bit
144
8-bit 16-bit 32-bit 64-bit
Flags: If the upper half of the resulting bits are 0, then the Overflow and Carry bits are set to 0; else they are
set to 1. The other flags are undefined.
2
A square operation (x ) is actually pretty simple!
mov rax,7
mul rax ; rdx:rax = rax × rax
IMUL
imul rmDst(8/16/32/64) Signed
imul rDst(16/32/64), rmSrc(16/32/64)
imul rDst(8/16/32/64), #(8/16/32)
imul rDst(16/32/64), rmSrc(16/32/64), #(8/16/32)
The IMUL general-purpose instruction has three forms based upon the number of operands.
One Operand — Similar to that of the MUL instruction but deals with signed numbers. That is, it multiplies
the 8-, 16-, 32-, or 64-bit signed operand by the unsigned AL/AX/EAX/RAX register, saves the result
based upon the bit size of the operand as indicated in the table below, and sets the flags accordingly.
Two Operands — Used for multiplication of 16-, 32-, or 64-bit signed numbers in operand1 with the value
in operand2 and stores the result in operand1. Unlike MUL, AX/EAX/RAX is not used unless it is one of
the two operands. If operand2 is an immediate value, then it is sign extended to match operand1.
Three Operands — Used for multiplication of 16-, 32-, or 64-bit signed numbers in operand2 with the sign
extended immediate value in operand3. The result is stored in operand1. Unlike MUL, AX/EAX/RAX is not
used unless it is one of the three operands.
Flags: If the upper half of the resulting bits are 0 then the Overflow and Carry bits are set to 0, else they are set
to 1. The other flags are undefined.
Hint This instruction is designed for signed values, but the register receiving the lower set of bits
of the calculated result will contain an identical value whether the source numbers were
signed or unsigned.
145
point multiplication is not like floating-point multiplication. In floating-point, there is a precision loss with
each calculation since a numerical value is stored in an exponential form. With fixed-point, there is no
precision loss, which is great but leads to another problem. When two integers are used in a summation,
the most significant bits are carried into an additional (n+1) bit. With a multiplication of two integers, the
resulting storage required is (n+n=2n) bits. This poses a problem of how to deal with the resulting
solution. Since the data size increases, there are multiple solutions to contain the result of the calculation.
PMULLW
These vector instructions use a 64 (128)-bit data path and so four (eight) operations occur in parallel. The
product is calculated using each of the 16-bit half-words of the multiplicand mmxSrc (xmmSrc) and the 16-bit
half-words of the multiplier mmxDst (xmmDst) for each 16-bit block, and stores the lower 16 bits of each of the
results in the original 16-bit half-words of the destination mmxDst (xmmDst).
146
PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper)
PMULHW destination, source
PMULHUW
PMULHW
These vector instructions use a 64 (128)-bit data path and so four (eight) operations occur in parallel. The
product is calculated using each of the 16-bit half-word of the multiplicand mmxSrc (xmmSrc) and the 16-bit
half-word of the multiplier mmxDst (xmmDst) for each 16-bit block, and stores the upper 16 bits of each of the
results in the original 16-bit half-words of the destination mmxDst (xmmDst).
147
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
PMULHRW
This vector instruction uses a 64-bit data path and so four operations occur in parallel. The product is
calculated using each of the unsigned 16-bit half-words of the multiplicand mmxDst and the 16-bit half-word of
the multiplier mmxSrc for each 16-bit block, sums 00008000 hex to the 32-bit product, and stores the resulting
upper 16 bits in the destination mmxDst (xmmDst).
Dst(15...0) = UPPER16((Dst(15...0) × Src(15...0)) + 0x8000)
Dst(31...16) = UPPER16((Dst(31...16) × Src(31...16)) + 0x8000)
Dst(47...32) = UPPER16((Dst(47...32) × Src(47...32)) + 0x8000)
Dst(63...48) = UPPER16((Dst(63...48) × Src(63...48)) + 0x8000)
148
movq mm2,[eax+0] ; Read A Data {3...0}
movq mm3,[eax+8] ; {7...4}
pmullw mm0,mm2
pmullw mm1,mm3
movq [edx+0],mm0 ; Write D ??? 16 bits {3...0}
movq [edx+8],mm1 ; (upper or lower) {7...4}
PMULUDQ
31…0 63…0
3287565 0000000000322A0D
× 593 × 0000000000000251
1949526045 000000007433681D
149
95…64 31…0 127…0
PMADDWD
Exercises
1. Write a code snippet to saturate an unsigned 8-bit incrementing count at a value of 99.
2. The result of the summation of two integers increases the size of the source by one bit (the carry). How
many bits are needed from the product of two 32-bit values?
3. What are different output methods to store the various results of a product of two integers?
150
Chapter 8: Floating-Point Anyone?
Overview
If you are interested in programming the FPU (floating-point unit) in assembly, you are either bored or deeply
disturbed. My apologies to those few of you writing compilers or low-level operating system kernels who
actually need to know how to do this! Then again, you may just be curious. Other books tend to stick this topic
in the section beginning with "F," or somewhere buried in the back in an appendix, and maybe 1% of the
readers actually even glance at it. Some people have been known to tape those pages together or just rip
them out of their book (not my book) altogether. If C/C++ is good enough for you to handle your floating-point
math, then just skip this section.
Wait! There are a couple items you REALLY NEED TO KNOW! Stuff not in the standard runtime
C libraries. Something to make your code just a bit faster and better!
My gosh! You're still reading this? You must be indeed interested in floating-point math. So we will chat about
a couple items but we are not going to go into very much detail — only enough to take advantage of some
enhanced functionality.
These are the three methods available to use for floating-point calculation: 3DNow!, SSE, and SSE2 packed
instructions; run-time C libraries; and of course, ye old floating-point coprocessor known as the FPU (floating-
point unit), for which most whom program it directly have a love/hate relationship. This book should help you
get over that, at least in part.
It is strongly recommended that if you only have generic floating- point programming in mind (whether it be
single-precision floating- point or double-precision floating-point), then just use your higher level language such
as C/C++ instead. It will be much easier and any optimization will be merely that of intermixing general-
purpose instructions with floating-point instructions.
Working with the SIMD instructions for floating-point support is much easier than working with the FPU, but the
FPU has some helpful features. Fortunately, most of the hard work of directly programming the FPU has been
done for you already and is available in libraries such as the runtime library for the C programming language.
The FPU has so many nuances that it could take a several hundred-page book to understand all the feature
sets, controls, exception handling, etc. This chapter is going to describe how floating-point works in general
and paint some broad strokes as to how to use the FPU, especially the few instructions that can be of use but
are not utilized by off-the-shelf libraries. If you indeed want to do in-depth programming I suggest you
download the various technical manuals and application notes from the manufacturer.
151
Data Size C reference Assembler Bytes
You are probably familiar with the single- and double-precision but not the double extended-precision. Did you
know that when you do a floating-point calculation that the data is actually expanded into the 10-byte (80-bit)
form — double extended-precision floating-point — as it is pushed on the FPU stack?
The larger the number of bits used to store the number, the higher the precision of that number.
Sign 1 1 1
Exponent 8 11 15
Integer 0 0 1
Significand 23 52 63
Total 32 64 80
The exponent is a base-2 power representation stored as a binary integer. The significand (mantissa) really
consists of two components: a J-bit and a binary fraction.
For the single-precision value, there is a hidden integer bit (1.) leading the 23 bits of the mantissa, thus making
it a 24-bit significand. The exponent is 8 bits, thus having a bias value of 127. The magnitude of the supported
–38 38
range of numbers is 2×10 to 2×10 .
For double-precision values, there is a hidden integer bit (1.) leading the 52 bits of the mantissa, thus making it
a 53-bit significand. The exponent is 11 bits, thus having a bias value of 1023. The magnitude of the supported
–308 308
range of numbers is 2.23×10 to 1.8×10 .
For the 80-bit version, the extra bits are primarily for protection against precision loss from rounding and
th
over/underflows. The leading integer bit (1.) is the 64 bit of the significand. The exponent is 15 bits, thus
–4932 4932
having a bias value of 32767. The magnitude of the supported range of numbers is 3.3×10 to 1.21×10 .
The product of the exponent and significand result in the floating- point value.
A zero exists in two forms (±0): positive zero (+0) and negative zero (–0). Both of these are valid indications of
zero. (The sign is ignored!)
For double-precision and single-precision floating-point numbers, the integer bit is always set to one. (It just is
not part of the 64 or 32 bits used to encode the number.) For double extended-precision the bit is encoded as
part of the number and so denormalized numbers apply. These are very small non-zero numbers represented
with an exponent of zero and thus very close to the value of zero and considered tiny. Keep in mind for the
FPU that the single-precision and double-precision numbers are expanded into double extended-precision
where the integer bit is one of the 80 bits and thus denormalized numbers exist for the calculations. Upon
saving the single- or double-precision floating-point number back to memory the bit is stripped out as an
imaginary bit, which is set!
Programmers are also usually aware that floats cannot be divided by zero or process a square root of negative
because an exception error would occur.
e
Table 8-1: Single-precision floating-point number representations. ± Sign bit. x Exponent. Note: The
integer bit of (1) 1.### is implied for single-precision and double-precision numbers.
e
± x Significand NaN (Not a Number)
152
e
Table 8-1: Single-precision floating-point number representations. ± Sign bit. x Exponent. Note: The
integer bit of (1) 1.### is implied for single-precision and double-precision numbers.
e
± x Significand NaN (Not a Number)
There are two types of NaNs (non-numbers): The quiet NaNs known as QNaNs and the signalling NaNs
known as SNaNs.
QNaN
The QNaN has the most significant fraction bit set and is a valid value to use in most floating-point
instructions even though it is not a number. A QNaN is an unordered number due to not being a real
floating-point value.
SNaN
The SNaN has the most significant fraction bit reset (clear) and typically signals an invalid exception when
used with floating-point instructions. SNaN values are never generated by the result of a floating-point
operation. They are only operands supplied by software algorithms. A SNaN is an unordered number due
to not being a real floating-point value.
NaN
An ordered number is a valid number that is not NaN (neither QNaN nor SNaN).
Table 8-2: Single-precision floating-point to hex equivalent
153
Table 8-2: Single-precision floating-point to hex equivalent
Value Hex
Value Hex
FPU Registers
The floating-point unit has eight data registers, {ST(0), ST(1), ST(2), ST(3), ST(4), ST(5), ST(6), ST(7)}, and
Status, Control Word, Tag Word, IP, Data Pointer, and Op Code Registers.
Table 8-5: (16-bit) FPU status register
154
Table 8-5: (16-bit) FPU status register
The FPU has condition code bits contained within the status register. These bits match 1:1 with the EFLAGS
of the CPU. They can be copied to the AX register using the FSTSW AX instruction followed by a SAHF
instruction to place them into the EFLAGS register.
Unordered x x x 1
Table 8-6: (16-bit) FPU control word
Now would be a good time to talk about FPU exceptions. The FPU uses exceptions for invalid operations in a
manner similar to how the CPU uses exceptions.
Table 8-7: FPU exceptions
Most of the single- and double-precision floating-point functionality is covered by the C runtime math library,
which can be accessed in the file: #include <math.h>.
FLD
155
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
FPU fld source Single-precision 32
Double-precision 64
Extended-precision 80
The floating-point load instruction loads the value in memory, expands it to an 80-bit double extended-
precision floating-point form, and pushes it on the stack, thereby making it addressable by the FPU as register
ST(0).
float a;
double b;
__asm {
fld a ; Load SPFP 32-bit to 80-bit
fld b ; Load DPFP 64-bit to 80-bit
fadd
};
…which is in reality:
add edx,ebx ; D = D + B
The FPU, however, has its registers as entries on a stack. Thus, values are pushed on a stack and an
operation is performed upon the values stored at the top of the stack.
The FPU behaves like an RPN (Reverse Polish Notation) calculator. Arguments are "pushed" on the stack and
then the operation is performed.
1 2 +
There are eight floating-point data registers, which are treated like a circular stack. The stack pointer ST(0)
references the register number that is at the top of the stack.
ST(0) 2.0
ST(1) 1.0
ST(2)
ST(3)
ST(4)
ST(5)
ST(6)
ST(7)
As values are loaded they are pushed onto the top of the stack, and the other items on the stack are pushed
down, leaving the new item addressed as ST(0) and the other items addressed as ST(1)…ST(7). A maximum
156
of eight values can be pushed on the FPU stack, as the ninth item generates an FPU error: #IS aka: 1#IND
"Stack overflow or underflow."
If the stack had only one or no arguments pushed, the operation would have generated a stack underflow due
to the missing argument.
As mentioned, ST(0) references the top of the stack but the stack is really a rotating queue. Let us examine a
simple addition operation again but step by step: 1 + 2 + 5 thus 1 2 + 5+.
a = 1; b = 2; c = 5;
fld a
fld b
fadd
fld c
fadd
fst d ; Save float to memory.
ST(2)
ST(3)
ST(4)
ST(5)
ST(6)
First, the value of 1 was loaded and thus pushed onto the stack. Then, the value of 2 was loaded and thus
pushed onto the stack.
The FADD operation actually caused the stack to unwind. The 2.0 was retrieved from ST(0), and the stack was
rolled back one slot, then summed with the new ST(0) value 1.0 and the result written back to ST(0). The
effect is the result was stored for the next operation, but the old ST(0) was rolled into ST(7) and the other slots
rolled down by one. If a new value is loaded and processed, the same kind of operation occurs. Notice the
activity of ST(0) and ST(7).
ST(2) 1.0
157
1.0 2.0 5.0 FADD FADD FST
ST(3)
ST(4)
ST(5)
All three values are pushed on the stack and they are popped as the stack gets rolled and processed. Upon
completion of the floating-point operation, the value is exported in one of the forms while being popped off the
stack and put back into system memory.
A+ B × C is A+ (B × C) thus A B C × +
Warning! There seems to be some confusion or encoding errors in some of the macro
assemblers in regard to the "P" suffix. If you try to assemble your code and get an error
on one of the FPU instructions, try the alternative (with or without P).
FST
FSTP
FPU fst source Single-precision 32
Double-precision 64
Extended-precision 80
The FST instruction copies the truncated integer from ST(0) to 16/32/64-bit memory and leaves the FPU stack
intact.
The FSTP instruction is similar to FST, but it pops the value off the FPU stack.
FILD
FPU fild mSrc(16/32/64) Signed 80
The integer load instruction loads the 16/32/64-bit integer in the 80-bit double extended-precision floating-point
value.
FIST
FISTP
FISTTP
FPU fist mSrc(16/32/64) Signed 80
fistp mSrc(16/32/64)
158
The FIST instruction copies the integer from ST(0) to 16/32/64-bit memory. The FISTP instruction does the
same, only it pops the value off the FPU stack.
SSE3-FPU fisttp mDst{16/32/64} Signed 80
The FISTTP instruction is similar to FISTP, but it copies the truncated integer from ST(0) to 16/32/64-bit
memory and it pops the value off the FPU stack. This instruction was introduced with SSE3.
FPU Constants
Some FPU values are predefined constants that get loaded into stack register ST(0) and as such do not need
to be loaded from memory.
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
FLD1
FLDL2E
FLDL2T
FLDLG2
FLDLN2
FLDPI
FLDZ
Table 8-8: FPU constants
FXCH
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
FXCH
FPU fxch Extended- 80
precision
fxch st(i)
The FXCH instruction can be used to exchange any of the stack registers {ST(1) … ST(7)} with stack register
ST(0). The default (with no parameter) is ST(1) and ST(0). The stack exchange value must be valid before the
exchange to be valid after the swap.
Since floats and doubles are pretty much handled for you there is no real need to pick them apart as scalar
libraries have been written for you by compiler manufacturers. If you want to use the double extended-
precision, then you have a lot of work that you will need to do since standard C compilers do not tend to
support it and thus you need to do the math yourself using FPU instructions.
159
FINCSTP — FPU Increment Stack Pointer
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
FINCSTP
FPU fincstp
The FINCSTP instruction effectively rolls the stack. The value indexed by ST(0) becomes ST(7), but the data
does not become invalidated. A valid stack argument in ST(0) gets rolled to ST(7). As the data is still valid, it is
still eligible for exchange with the FXCHG instruction.
FDECSTP
FPU fdecstp
The FDECSTP instruction effectively rolls the stack. The value indexed by ST(0) becomes ST(1) and ST(7)
becomes ST(0), but the data retains its valid/invalid state. A valid stack argument in ST(0) gets rolled to ST(1).
As the data is still valid, it is still eligible for exchange with the FXCHG instruction.
FWAIT/WAIT
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
FWAIT
WAIT
FPU fwait
" wait
The FWAIT instruction is used to wait for the FPU (floating-point unit) operation to complete. Without it, an
operation that is in effect can be preempted by a new floating-point calculation pushed on the FPU stack.
EMMS/FEMMS
Whenever MMX and FPU instructions are both used within the same thread there must be a switch-over by
the CPU calling the EMMS instruction. The 3DNow! instruction set has the fast instruction FEMMS. Having to
switch back and forth is time consuming and a burden on the CPU, so whenever possible the instructions
should be grouped by type or kept in separate threads.
EMMS
FEMMS
FNOP
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
FNOP
FPU fnop
160
General Math Instructions
FCHS
FPU fchs Extended-precision 80
This instruction performs a one's complement on the sign bit in ST(0). This effectively two's complements the
value in the ST(0) register.
A=–A
fldpi ; ST(0) = (PI) 3.1415926535897932
fchs ; ST(0) = –
FABS
FPU fabs Extended-precision 80
A = |A|
fld1 ; ST(0) = 1.0
fchs ; ST(0) = –1.0
fabs ; ST(0) = 1.0
FADD
FADDP
FIADD
FPU faddp Extended-precision 80
" fadd mSrc(32/64)
" fadd st(0)Dst, st(i)Src
" fadd st(i)Dst, st(0)Src
" faddp st(i)Dst, st(0)Src
" fiadd mSrc(32/64) Signed
The FADD instruction performs a summation of the source to the destination and stores the result in the
destination. If no destination is specified, the default is ST(0).
FADDP performs the same operation, but it pops ST(0) off the stack and pushes the result back onto the stack
at ST(0).
The FIADD instruction performs the same operation as FADD except the source value is a 16/32-bit integer.
fs REAL4 2.0
fldpi ; ST(0) = (PI) 3.1415926535897932
161
fldpi
faddp ; ST(0) = 2 = +
fadd fs
FSUB
FSUBP
FISUB
FPU fsubp Extended-precision 80
" fsub mSrc(32/64)
" fsub st(0)Dst, st(i)Src
" fsub st(i)Dst, st(0)Src
" fsubp st(i)Dst, st(0)Src
" fisub mSrc(32/64) Signed
The FSUB instruction subtracts the source from the destination and stores the result in the destination. If no
destination is specified, the default is ST(0).
FSUBP performs the same operation, but it pops ST(0) off the stack and stores the result in ST(0).
The FISUB instruction performs the same operation as FSUB except the source value is a 16/32-bit integer.
FSUBR
FSUBRP
FISUBR
FPU fsubrp Extended-precision 80
" fsubr mSrc(32/64)
" fsubr st(0)Dst, st(i)Src
" fsubr st(i)Dst, st(0)Src
" fsubrp st(i)Dst, st(0)Src
" fisubr mSrc(32/64) Signed
The FSUBR instruction is a reverse subtraction where the destination is subtracted from the source and the
result is stored in the destination. If no destination is specified, the default is ST(0).
Instead of D = (B) + (– A), the expression D = (– B) + Ais used. FSUBRP performs the same operation, but it
pops ST(0) off the stack and stores the result in ST(0).
The FISUBR instruction performs the same operation as FSUBR except the source value is a 16/32-bit
integer.
162
FMUL/FMULP/FIMUL — FPU Multiplication D = ST(0) × A
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
FMUL
FMULP
FIMUL
FPU fmulp Extended-precision 80
" fmul mSrc(32/64)
" fmul st(0)Dst, st(i)Src
" fmul st(i)Dst, st(0)Src
" fmulp st(i)Dst, st(0)Src
" fimul mSrc(32/64) Signed
The FMUL instruction calculates the product of the source and the destination and stores the result in the
destination. If no destination is specified, the default is ST(0).
FMULP performs the same operation, but it pops ST(0) off the stack and stores the result in ST(0).
The FIMUL instruction performs the same operation as FMUL except the source value is a 16/32-bit integer.
FDIV
FDIVP
FIDIV
FPU fdivp Extended-precision 80
" fdiv mSrc(32/64)
" fdiv st(0)Dst, st(i)Src
" fdiv st(i)Dst, st(0)Src
" fdivp st(i)Dst, st(0)Src
" fidiv mSrc(32/64) Signed
The FDIV instruction calculates the dividend of the destination quotient and source divisor and stores the result
in the destination. If no destination is specified, the default is ST(0).
FDIVP performs the same operation, but it pops ST(0) off the stack and stores the result in ST(0).
The FIDIV instruction performs the same operation as FDIV except the source value is a 16/32-bit integer.
FDIVR
FDIVRP
FIDIVR
FPU fdivrp Extended-precision 80
" fdivr mSrc(32/64)
163
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
" fdivr st(0)Dst, st(i)Src
" fdivr st(i)Dst, st(0)Src
" fdivrp st(i)Dst, st(0)Src
" fidivr mSrc(32/64) Signed
The FDIVR instruction calculates the dividend of the source quotient and destination divisor and stores the
result in the destination. If no destination is specified, the default is ST(0).
FDIVRP performs the same operation, but it pops ST(0) off the stack and stores the result in ST(0).
The FIDIVR instruction performs the same operation as FDIVR except the source value is a 16/32-bit integer.
FPREM
FPU fprem Extended- 80
precision
This instruction returns the remainder from dividing the dividend ST(0) by the divisor ST(1) and stores the
result in ST(0). This instruction is the equivalent of a modulus:
FPREM1
FPU fprem1 Extended- 80
precision
This instruction is similar to the instruction FPREM but returns the IEEE remainder from dividing the dividend
ST(0) by the divisor ST(1) and stores the result in ST(0). This instruction is the equivalent of a modulus:
FRNDINT
FPU frndint Extended- 80
precision
FCHS
FPU fchs Extended-precision 80
164
This instruction performs a one's complement on the sign bit in ST(0). This effectively two's complements the
value in the ST(0) register.
A=–A
fldpi ; ST(0) = (PI) 3.1415926535897932
fchs ; ST(0) = –
FABS
FPU fabs Extended-precision 80
A = |A|
fld1 ; ST(0) = 1.0
fchs ; ST(0) = –1.0
fabs ; ST(0) = 1.0
FADD
FADDP
FIADD
FPU faddp Extended-precision 80
" fadd mSrc(32/64)
" fadd st(0)Dst, st(i)Src
" fadd st(i)Dst, st(0)Src
" faddp st(i)Dst, st(0)Src
" fiadd mSrc(32/64) Signed
The FADD instruction performs a summation of the source to the destination and stores the result in the
destination. If no destination is specified, the default is ST(0).
FADDP performs the same operation, but it pops ST(0) off the stack and pushes the result back onto the stack
at ST(0).
The FIADD instruction performs the same operation as FADD except the source value is a 16/32-bit integer.
fs REAL4 2.0
fldpi ; ST(0) = (PI) 3.1415926535897932
fldpi
faddp ; ST(0) = 2 = +
fadd fs
165
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
FSUB
FSUBP
FISUB
FPU fsubp Extended-precision 80
" fsub mSrc(32/64)
" fsub st(0)Dst, st(i)Src
" fsub st(i)Dst, st(0)Src
" fsubp st(i)Dst, st(0)Src
" fisub mSrc(32/64) Signed
The FSUB instruction subtracts the source from the destination and stores the result in the destination. If no
destination is specified, the default is ST(0).
FSUBP performs the same operation, but it pops ST(0) off the stack and stores the result in ST(0).
The FISUB instruction performs the same operation as FSUB except the source value is a 16/32-bit integer.
FSUBR
FSUBRP
FISUBR
FPU fsubrp Extended-precision 80
" fsubr mSrc(32/64)
" fsubr st(0)Dst, st(i)Src
" fsubr st(i)Dst, st(0)Src
" fsubrp st(i)Dst, st(0)Src
" fisubr mSrc(32/64) Signed
The FSUBR instruction is a reverse subtraction where the destination is subtracted from the source and the
result is stored in the destination. If no destination is specified, the default is ST(0).
Instead of D = (B) + (– A), the expression D = (– B) + Ais used. FSUBRP performs the same operation, but it
pops ST(0) off the stack and stores the result in ST(0).
The FISUBR instruction performs the same operation as FSUBR except the source value is a 16/32-bit
integer.
FMUL
FMULP
FIMUL
FPU fmulp Extended-precision 80
166
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
" fmul mSrc(32/64)
" fmul st(0)Dst, st(i)Src
" fmul st(i)Dst, st(0)Src
" fmulp st(i)Dst, st(0)Src
" fimul mSrc(32/64) Signed
The FMUL instruction calculates the product of the source and the destination and stores the result in the
destination. If no destination is specified, the default is ST(0).
FMULP performs the same operation, but it pops ST(0) off the stack and stores the result in ST(0).
The FIMUL instruction performs the same operation as FMUL except the source value is a 16/32-bit integer.
FDIV
FDIVP
FIDIV
FPU fdivp Extended-precision 80
" fdiv mSrc(32/64)
" fdiv st(0)Dst, st(i)Src
" fdiv st(i)Dst, st(0)Src
" fdivp st(i)Dst, st(0)Src
" fidiv mSrc(32/64) Signed
The FDIV instruction calculates the dividend of the destination quotient and source divisor and stores the result
in the destination. If no destination is specified, the default is ST(0).
FDIVP performs the same operation, but it pops ST(0) off the stack and stores the result in ST(0).
The FIDIV instruction performs the same operation as FDIV except the source value is a 16/32-bit integer.
FDIVR
FDIVRP
FIDIVR
FPU fdivrp Extended-precision 80
" fdivr mSrc(32/64)
" fdivr st(0)Dst, st(i)Src
" fdivr st(i)Dst, st(0)Src
" fdivrp st(i)Dst, st(0)Src
" fidivr mSrc(32/64) Signed
167
The FDIVR instruction calculates the dividend of the source quotient and destination divisor and stores the
result in the destination. If no destination is specified, the default is ST(0).
FDIVRP performs the same operation, but it pops ST(0) off the stack and stores the result in ST(0).
The FIDIVR instruction performs the same operation as FDIVR except the source value is a 16/32-bit integer.
FPREM
FPU fprem Extended- 80
precision
This instruction returns the remainder from dividing the dividend ST(0) by the divisor ST(1) and stores the
result in ST(0). This instruction is the equivalent of a modulus:
FPREM1
FPU fprem1 Extended- 80
precision
This instruction is similar to the instruction FPREM but returns the IEEE remainder from dividing the dividend
ST(0) by the divisor ST(1) and stores the result in ST(0). This instruction is the equivalent of a modulus:
FRNDINT
FPU frndint Extended- 80
precision
FSQRT
FPU fsqrt Extended- 80
precision
The FSQRT instruction pops the ST(0) off the stack, performs a square root operation, then pushes the result
back on the stack.
168
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
FSCALE
FPU fscale Extended- 80
precision
A
The FSCALE instruction effectively shifts ST(0) by the amount set in ST(1). This is equivalent to D = D × 2 .
F2XM1
FPU f2xm1 Extended- 80
precision
X
The F2XM1 instruction calculates 2 – 1, where the source x is ST(0) and the result is stored in ST(0).
FYL2X
FPU fyl2x Extended- 80
precision
The FYL2X instruction calculates y log2x, where the source y is ST(1), the source x is ST(0), and the result is
stored in ST(0). The operand x must be > 0.
FYL2XP1
FPU fyl2xp1 Extended- 80
precision
The FYL2XP1 instruction calculates y log2(x+1), where the source y is ST(1), the source x is ST(0), and the
result is stored in ST(0). The operand x must be within the range –(1– (2) /2)) … (1 – (2)/2).
FXTRACT
FPU fxtract Extended- 80
precision
This instruction pops the ST(0) value off the stack, separates the exponent and significand, and pushes the
exponent on the stack. The significand is then set with the 3FFFh exponent and pushed on the stack.
Floating-Point Comparison
Do not expect the resulting values from different calculations to be identical. For example, 2.0 x 9.0 is about
18.0, and 180.0 ÷ 10.0 is about18.0, but the two 18.0 values are not guaranteed to be identical.
2.0 × 9.0 180.0 ÷ 10.0
169
n
Let us examine a range of values 10 and compare a displacement of ±0.001 versus ±0.0000001.
Table 8-9: Note the single-precision loss between the ±0.001 displacement as the number of digits
goes up in the base number. As the base number gets larger, fewer decimal places of precision can be
supported. The hexadecimal numbers in bold are where the precision was totally lost.
What this means is that smaller numbers such as those that are normalized and have a numerical range from
–1.0 to 1.0 allow for higher precision values, but those with larger values are inaccurate and thus not very
precise. For example, the distance between 1.001 and 1.002,1.002 and 1.003, etc. is about 0x20c4 (8,388).
170
This means that about 8,387 numbers exist between those two samples. A number with a higher digit count
such as 1000.001 or 1000.002 support about 0x11(17), so only about 16 numbers exist between those two
numbers. And a number around 1000000 identifies 1000000.001 and 1000000.002 as the same number. This
makes for comparisons of floating-point numbers with nearly the same value very tricky. This is one of the
reasons why floating-point numbers are not used for currency as they tend to lose pennies. Binary-coded
decimal (BCD) and fixed-point (integer) are used instead.
So when working with normalized numbers {–1.0 … 1.0}, a comparison algorithm with a precision slop factor
(accuracy) of around0.0000001 should be utilized. When working with estimated results, a much smaller value
should be used. The following function returns a Boolean true : false value to indicate that the two values are
close enough to be considered the same value. Normally you would not compare two floating-point values
except to see if one is greater than the other for purposes of clipping. You almost never use a comparison of
the same value as shown here. It is only used in this book for purposes of comparing the results of C code to
assembly code to see if you are getting results from your algorithms in the range of what you expected.
Listing 8-1: vmp_IsFEqual() — Compares two single-precision floating-point values and determines if
they are equivalent based upon the precision factor or if one is less than or greater than the other.
For a fast algorithm that uses estimation for division or square roots, then merely reduces the precision to
something less accurate:
#define FAST_PRECISION 0.001f
This book will discuss these fast estimate algorithms in later chapters. For vector comparisons, this book uses
the following code: When dealing with quad vectors (vmp3DQVector) an alternative function is called:
Listing 8-2: Compare two {XYZ} vectors using a specified precision factor.
171
return true;
}
Listing 8-3: Compare two {XYZW} vectors using a specified precision factor.
FTST
FPU ftst Extended- 80
precision
This instruction compares 0.0 with the value in FPU register ST(0) and sets the FPU code flags {C0, C2, and
C3} with the results. (The state of the sign bit is ignored, allowing 0.0 = ±0.0.) If register ST(0) contains NaN an
unordered condition is set.
Unordered 1 1 0 1
0.0 = ST(0) 1 0 0 0
0.0 > ST(0) 0 0 0 1
0.0 < ST(0) 0 0 0 0
FCOM
FCOMP
FCOMPP
FPU fcom mSrc(32/64) Extended-precision 80
fcom st(i)
fcom
fcomp mSrc(32/64)
fcomp st(i)
fcomp
fcompp
This ordered instruction compares the value of ST(0) with ST(i) or 32/64-bit memory and sets the condition
flags. If no source operand is specified, ST(1) is used as a default. This instruction has the same functionality
as FUCOM, but exceptions occur for operands with NaN. QNaN or SNaN values both set an unordered
172
condition but generate an exception. The FCOMP instruction pops ST(0) off the stack, while FCOMPP pops
ST(0) and ST(1) off the stack. (The state of the sign bit is ignored if zero, allowing 0.0 = ±0.0.)
Unordered 1 1 – 1
0.0 = ST(0) 1 0 – 0
0.0 > ST(0) 0 0 – 1
0.0 < ST(0) 0 0 – 0
FUCOM
FUCOMP
FUCOMPP
FPU fucom st(i) Extended-precision 80
fucom
fucomp st(i)
fucomp
fucompp
This unordered instruction compares the value of ST(0) with ST(i) and sets the condition flags. If no source
operand is specified, ST(1) is used as a default. This instruction has the same functionality as FCOM, but
exceptions only occur for operands with SNaN. A QNaN or SNaN sets the condition to unordered but the
QNaN value is valid in comparisons. The FUCOMP instruction pops ST(0) off the stack, while FUCOMPP pops
ST(0) and ST(1) off the stack. (The state of the sign bit is ignored if zero, allowing 0.0 = ±0.0.)
Unordered 1 1 – 1
0.0 = ST(0) 1 0 – 0
0.0 > ST(0) 0 0 – 1
0.0 < ST(0) 0 0 – 0
FCOMI
FCOMIP
FUCOMI
FUCOMIP
FPU fcomi st, Extended-precision
st(i)
fcomip st,
st(i)
173
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
fucomi st,
st(i)
fucomip st,
st(i)
The FCOMI and FCOMIP instructions have identical functionality to the FCOM and FCOMP instructions,
except that the EFLAGS are set instead of the FPU conditional flags.
The FUCOMI and FUCOMIP instructions have identical functionality to the FUCOM and FUCOMP instructions,
except that the EFLAGS are set instead of the FPU conditional flags.
Flags O.flow Sign Zero Aux Parity Carry
0 - X - X X
FICOM/FICOMP — FPU A ? B
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
FICOM
FICOMP
FPU ficom mSrc(16/32) Extended-precision 80
ficomp mSrc(16/32)
This unordered instruction loads the 16/32-bit integer from memory, converts it to double extended-precision
floating-point, compares it to the value in the ST(0) register and sets the condition flags. A QNaN or SNaN
value sets the condition to unordered, but the QNaN value is valid in comparisons and a SNaN generates an
exception. The FICOMP instruction pops ST(0) off the stack. (The state of the sign bit is ignored if zero,
allowing 0.0 = ±0.0.)
Unordered 1 1 – 1
0.0 = ST(0) 1 0 – 0
0.0 > ST(0) 0 0 – 1
0.0 < ST(0) 0 0 – 0
FCMOVx
FPU fcmovb st(0), st(i) Extended-precision 80
" fcmove st(0), st(i) " "
" fcmovbe st(0), st(i) " "
" fcmovu st(0), st(i) " "
" fcmovnb st(0), st(i) " "
" fcmovne st(0), st(i) " "
174
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
" fcmovnbe st(0), st(i) " "
" fcmovnu st(0), st(i) " "
The EFLAGS are examined and the FPU moves the value from ST(i) to ST(0) if the condition is met.
Unordered – – 0 –
Flags: The condition flags C0, C2, and C3 remain unchanged. C0 is cleared if stack underflow occurred.
FXAM
FPU fxam
This instruction sets the conditional flags to reflect the value in ST(0).
ST(0) State C3 C2 C1(±) C0
Denormal 1 1 x 0
Empty 1 0 x 1
0 1 0 x 0
∞ 0 1 x 1
Normal finite 0 1 x 0
NaN 0 0 x 1
Unsupported 0 0 x 0
175
Flags: The condition flag C1 reflects the sign bit of ST(0).
Figure 8-4: Ten-byte BCD data storage. The MSB in the far left byte (byte #9) is the sign bit and the rightmost
th
eight bytes (#8...0) contain the BCD value pairs. The 18 BCD digit resides in the upper nibble of byte #8 and
the first BCD digit resides in the lower nibble of byte #0.
FBLD
FPU fbld source BCD 80
The binary-coded decimal (BCD) load instruction loads the 80-bit value in memory, expands it to an 80-bit
double extended-precision floating-point form, and pushes it on the stack, thereby making it addressable by
the FPU as register ST(0).
TBYTE bcd
REAL8 f
fbld tbyte ptr bcd ; Load (10 byte) BCD
fstp f ; Write 64-bit double-precision
FBSTP
FPU fbstp destination BCD 80
The binary-coded decimal (BCD) save instruction reads the 80-bit value in register ST(0) and pops it off the
stack, then saves to memory.
FPU Trigonometry
FPTAN
FPU fptan Extended- 80
precision
The partial tangent of the source radians in ST(0) is calculated, then stored in ST(0).
176
ST(0) = tan(ST(0))
FPATAN
FPU fpatan Extended- 80
precision
The value of the divisor ST(0) is divided into the quotient ST(1), and the resulting value is stored into the
unwound stack into ST(0).
You should already be familiar with the fact that the angle is not passed into those functions in degrees but
instead in radians. If you recall, π (PI) is equivalent to 180° and 2π to 360°. By using the following macro, an
angle in degrees can be converted to radians:
#define PI 3.141592f
#define DEG2RAD(x) ((x) * (PI/180.0F))
…and used in the calculations. It can then be converted from radians back to degrees:
#define RAD2DEG(x) ((x) * (180.0f/PI))
177
…if needed for printing or other purposes.
There is one thing that has always bothered me about these two functions: When a cosine is needed, a sine is
typically needed as well, and that one is in reality 90 degrees out of phase of the other, which means that they
share similar algorithms. As the following equation shows, subtracting an angle from 90 degrees results in the
alternate angle. So, for example, a sin(30°) is equal to the cos(60°). As the angles of a trian-gle always add up
to 180 degrees, that would mean that the two angles are really not that far from each other, and mathematical
operations to find one of the angles can be slightly adapted to find the other angle without having to do twice
the work.
Equation 8-2: Angular relationship sine to cosine.
But I find something really interesting. In Intel's wisdom they not only support the sine and cosine on their FPU
(floating-point unit) but they also support the combination sine-cosine, which returns both results. What I find
amazing is that very few programmers actually take advantage of it! In addition, I am upon occasion called in
to review other programmer's code and I quite consistently have to recommend calling the sine and cosine
functions only once in their matrix algorithms.
A 3D rotational matrix for an axis will use two identical cosines, a sine, and a negative sine, but why do I keep
seeing these functions each called twice? I see code such as follows for a left-handed rotate X:
Mx[1][2] = sin(fRadian);
Mx[2][2] = cos(fRadian);
178
Mx[1][1] = cos(fRadian);
Mx[2][1] = –sin(fRadian);
…and then you wonder why your code runs so slow! Instead, your code should be more like:
Mx[1][2] = sin(fRadian);
Mx[2][2] = Mx[1][1] = cos(fRadian);
Mx[2][1] = –Mx[1][2]; //–sin
…or better yet, if you are using an x86 processor use its combination sine/cosine instruction! When I work on
legacy projects involving trigonometry I quite often find that I need to drop the following code snippet into the
application:
sincos(&fSin, &fCos, fRadian);
Mx[1][2] = fSin;
Mx[2][2] = Mx[1][1] = fCos;
Mx[2][1] = –fSin;
Pseudo Vec
Similar to the other functions in this book, the sine and cosine functions have been wrapped for extra
portability and for alternate specialty function replacements. One for single-precision floating-point and another
for double-precision floating-point:
Listing 8-4: ...\chap8\ vtrig3d\VTrig3D.cpp
179
eventually by Microsoft. I believe the absence of a library function for a combination sine-cosine was merely an
oversight. It has recently been introduced in third-party libraries and in different forms.
AMD-SDK
void _sincos(float, float *);
void a_sincos(void); // mm0 –> mm0 (cos|sin)
__m64 _m_sincos(__m64); // mm0 –> mm0 (cos|sin)
3DSMax-SDK
inline void SinCos(float angle, float *sine, float *cosine)
I did a wildcard search on my computer for header files within various SDKs and this is all that turned up with
all the compilers I work with. Apparently, most programmers up until now either did not understand the issue or
did not have the ability to write FPU code! Let's fix that right now.
First, stub your cross-platform C function for sincos(). No matter the platform, you should always use
something similar.
We will soon build an equivalent function using the FPU found in every 80x86 processor since the 80486, and
available as an optional coprocessor before that.
FSIN
FPU fsin Extended- 80
precision
The sine of the source radians in ST(0) is calculated and the result is stored in ST(0). See the table for the
FSINCOS instruction for source versus destination details.
float sinf(float rad);
double sin(double rad);
FCOS
FPU fcos Extended- 80
precision
The cosine of the source radians in ST(0), is calculated and the result is stored in ST(0). See the table for the
FSINCOS instruction for source versus destination details.
float cosf(float rad);
double cos(double rad);
FSINCOS
FPU fsincos Extended- 80
precision
180
The FSINCOS instruction calculates the sine and cosine simultaneously for a single-precision floating-point
radian in ST(0) and stores the result in ST(0). It's slower than calling just sine or cosine, but faster than calling
them both consecutively.
Note The value used for radians must be a proper floating-point normalized value between [–, ].
If valid but out of range, a precision loss will occur.
63 63
If the source radians are outside the allowed range of [–2 , 2 ], the value can be reduced within range by N
mod (2).
Listing 8-5: …\chap8\vtrig3d\VTrig3DX86.asm
mov eax,pfSin
mov ecx,pfCos
vmp_SinCos (3DNow!)
AMD has a really nice solution for their 3DNow! functionality. It resolves the sine and cosine simultaneously
using the MMX register set. Their algorithm is actually pretty cool, as what they do is ratio (3.14159) to a
value of 4. The value of {–3.14159 … 3.14159} is effectively scaled to a value of {–4.0 … 4.0}, which in
181
essence makes π radians a base 2 number and thus integer math is used in conjunction with parallel floating-
point math and MMX registers. Signs are tracked to encode the quadrant of the circle in which the angle
resides. Since a circle is actually made up of eight quadrants (investigate the Bresenham DDA algorithm for
circle plotting), then only the modulus 45º angle needs to be resolved. Using this methodology, several cos/sin
functions could be processed simultaneously.
Figure 8-9: The relationship of a circle with its quadrants and associated sizes
The code can be downloaded from the AMD web site (www.amd.com) and is contained within their Software
Development Kit (SDK) – AMD Library Reference.
Cut and paste it into your own code. If using Direct3D, it is automatically called if you are using a processor
supporting the 3DNow! instruction set; there is also an SSE version. It is a lot more efficient than having to do
an FPU/MMX switch, especially as sine and cosine are processed simultaneously in parallel.
I am not telling you to reverse-engineer the DirectX library, as that would be wrong and illegal, but by setting a
break point on the DirectX library function call and upon getting the break, set your view to assembly code and
step into the code. You can cut and paste a copy of that code to a text editor, allowing you to edit and analyze
how it works. Remember, there is more than one algorithm residing in the library as well.
FINIT
FNINIT
FPU finit
fninit
The FINIT instruction checks for pending unmasked floating-point exceptions, then initializes the FPU.
The FNINIT instruction initializes the FPU without any pending checks.
FCLEX
FNCLEX
FPU fclex
fnclex
The FCLEX instruction checks for pending unmasked floating-point exceptions, then clears the exception
flags.
182
The FNCLEX instruction clears the exception flags without any pending checks.
– – – –
Flags: The condition of flags {C0, C1, C2, C3} remain unchanged; however, the other FPU flags (PE, UE, OE,
ZE, DE, IE, ES, SF, and B) are cleared.
FFREE
FPU ffree st(i)
This instruction sets the tag associated with ST(i) to an empty state.
FSAVE
FNSAVE
FPU fsave mSrc(94/108)
fnsave mSrc(94/108)
The FSAVE instruction checks for pending unmasked floating-point exceptions, saves the FPU state to
94/108-byte memory, and then reinitializes the FPU.
The FNSAVE instruction saves the FPU state to 94/108-byte memory, and then reinitializes the FPU.
The assembler typically inserts the FWAIT instruction before this instruction to ensure the FPU is ready (no
instruction in process) to save.
(16-bit) Real Mode saves 94 bytes, and (32-bit) Protected Mode saves 108 bytes. The complement of this
instruction is FRSTOR.
FRSTOR
FPU frstor mSrc(94/108)
This instruction loads the FPU state from 94/108-byte memory that was saved with the instruction FSAVE or
FNSAVE.
FXSAVE
FPU fxsave mSrc(512)
This instruction saves the registers {FPU, MMX, XMM, and MXCSR} to a 512-byte memory block. The
complement of this instruction is FXRSTOR.
183
FXRSTOR — FPU Restore x87 FPU, MMX, SSE, SSE2, SSE3
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
FXRSTOR
FPU fxrstor mSrc(512)
This instruction loads registers from 512-byte memory saved with the FXSAVE instruction.
FSTENV
FNSTENV
FPU fstenv mSrc(14/28)
fnstenv mSrc(14/28)
The FSTENV instruction checks for pending unmasked floating-point exceptions, then masks all exceptions
and saves the FPU environment to 14/28-byte memory.
The FNSTENV instruction save the FPU environment to 14/28- byte memory, then masks all floating-point
exceptions.
The assembler typically inserts the FWAIT instruction before this instruction to ensure the FPU is ready (no
instruction in process) to save.
The data saved is the FPU control word, status word, tag word, instruction pointer, data pointer, and last
opcode. (16-bit) Real Mode saves 14 bytes, and (32-bit) Protected Mode saves 28 bytes. The complement of
this instruction is FLDENV.
FLDENV
FPU fldenv mSrc(14/28)
This instruction loads the environment from 14/28-byte memory that was saved with the instruction FSTENV or
FNSTENV.
FSTCW
FNSTCW
FPU fstcw mSrc(2)
fnstcw mSrc(2)
The FSTCW instruction checks for pending unmasked floating-point exceptions, then saves the 16-bit FPU
control word to memory.
The FNSTCW instruction saves the 16-bit FPU control word to memory.
The assembler typically inserts the FWAIT instruction before this instruction to ensure the FPU is ready (no
instruction in process) to save.
184
The complement of this instruction is FLDCW.
FLDCW
FPU fldcw mSrc(2)
This instruction loads the 16-bit FPU control word from memory that was saved with the instruction FSTCW or
FNSTCW.
FSTSW
FNSTSW
FPU fstsw mSrc(2)
fstsw AX
fnstsw mSrc(2)
fnstsw AX
The FPU has condition code bits contained within the status register. These bits match 1:1 with the EFLAGS
of the CPU. They can be copied to memory or the AX register using this instruction. They can then be copied
to the EFLAGS register using the SAHF instruction.
fstsw ax ; Copy FPU condition bits to AX register.
sahf ; Copy AH to EFLAGS.
The following is a code snippet that detects single-precision floating-point overflows. In this case, the value is
radians.
OFLOW_FIXUP fRads ; Overflow fixup
Listing 8-6: incx86\vmp3DX86M.inc
To fix the overflow of a single-precision floating-point value, it can be expanded to double extended-precision
floating-point or double-precision floating-point, such as in the following code snippet.
Listing 8-7: FixupReal4
185
; Fixup Real4 (float)
; eax=bad float
; NOTE: ST(0) already contains the float!!!
; Build a Real10
mov eax,[esp–8–10+4] ; Hi
or eax,7FFF0000h ; Set Real8 (Double)
mov dword ptr [esp–10+6],eax ; 7FFFXXXX????????????
ret ; ST(0)=value
FixupReal4 endp
Exercises
1. Write a sincos function in assembly language similar to sincosf but using double-precision floating-
point.
2. Write a floating-point equation A+B×C+D using RPN. Write it using FPU instructions.
3. Write a factorial program to solve for !15 using FPU instructions and taking advantage of the stack.
4. How much accuracy does the vmp_IsFEqual() function allow with a single-precision definition? Fast
precision?
5. Does vmp_IsFEqual() accept a negative value for the third argument? Should there be an assertion?
Why or why not?
6. In this chapter, vmp_IsVEqual() uses an ASSERT_PTR4(). What assertion would be used instead to
force a 16-byte alignment?
7. Write C functions to support double-precision for:
a. vmp_IsDEqual() scalar double-precision
b. vmp_IsDVEqual() vector double-precision
c. vmp_IsDQVEqual() quad vector double-precision
186
Chapter 9: Comparison
Overview
Comparisons are typically related to decision making. A C programmer is typically familiar with one of the
following:
An index into a table is calculated by deciding between multiple factors:
index = (a < b ) ? i : j;
A comparison is used to decide which branch to take:
if (a < b )
{
dothis();
}
else
{
dothat();
}
The general-purpose instructions set the EFLAGS/RFLAGS conditional flags to indicate their scalar result.
Additional instructions would then utilize individual flags that are reset (cleared) or set, such as branching and
bit shifting/rotation. The instructions utilizing packed data store the result of the comparison as packed Boolean
values of 0=False and 1=True.
Since the logical AND operation has already been discussed, let us start with the TEST instruction, which has
a similar functionality.
TEST
test rmDst(8/16/32/64), #(8/16/32) [Un]signed
test rmDst, rSrc(8/16/32/64)
The TEST general-purpose instruction logically AND's the 8-, 16-, 32-, or 64-bit operand1 with the operand2
and sets the flags accordingly, but there is no destination as the result is not stored. An AND operation means
that you would need both A and B to be true to have a true result. The behavior of this instruction is identical to
the AND instruction except that there is no saved result. A 32-bit source value can be sign extended to a 64-bit
destination.
Flags O.flow Sign Zero Aux Parity Carry
187
0 X X - X 0
Flags: The Carry and Overflow flags are cleared. The Sign flag reflects the state of the MSB.
mov eax,00000a5a5h
mov edx,000000ff0h
test eax,edx
; 00000000000000001010010110100101b 00000a5a5h
; AND 00000000000000000000111111110000b 000000ff0h
Tip The best way to test for zero is to AND a value with itself with a TEST instruction; nothing is
written, only data is read, and EFLAGS are set but no write cycles.
test ebx,ebx ; cmp ebx,ebx
jne $L1
BT — Bit Test
BT operand, index
BT
bt rmDst(16/32/64), #(8) [Un]signed
bt rmDst, rSrc(16/32/64)
N
This general-purpose instruction tests the bit in the operand, using the index {0…2 –1}, and returns the state
of the bit in the Carry flag.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - X
Flags: The Carry flag contains the value of the indexed bit.
foo: dd 01011011101111011111011111101111b
mov ebx,0fedcba98h
;11111110110111001011101010011000b
mov eax,31
188
BTC — Bit Test and Complement
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
BTC
btc rmDst(16/32/64), #(8) [Un]signed
btc rmDst, rSrc(16/32/64)
N
This general-purpose instruction tests the bit in the destination, using the index {0…2 –1}, returns the state of
the bit in the Carry flag, and then performs a one's complement, that is, it flips the indexed bit.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - X
Flags: The Carry flag contains the value of the indexed bit before the complement.
mov ebx,0fedcba98h
;11111110110111001011101010011000b
mov eax,31
btc ebx,eax ; carry will equal 1
; ebx is now 01111110110111001011111000011001b 07edcbe19h
BTR
btr rmDst(16/32/64), #(8) [Un]signed
btr rmDst, rSrc(16/32/64)
N
This general-purpose instruction tests the bit in the destination, using the index {0…2 –1}, returns the state of
the bit in the Carry flag, and then resets (clears) the indexed bit to 0.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - X
Flags: The Carry flag contains the value of the indexed bit before it is altered.
mov ebx,0fedcba98h
;11111110110111001011101010011000b
189
btr ebx,10 ; carry will equal 0
; ebx is now 11111110110111001011101010011000b 0fedcba98h
; note that bit#10 is cleared.
mov eax,31
btr ebx,eax ; carry will equal 1
; ebx is now 01111110110111001011101000011000b 07edcba18h
BTS
bts rmDst(16/32/64), #(8) [Un]signe
d
bts rmDst, rSrc(16/32/64)
N
This general-purpose instruction tests the bit in the destination, using the index {0…2 –1}, returns the state of
the bit in the Carry flag, and then sets the bit to 1.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - X
Flags: The Carry flag contains the value of the indexed bit before it is altered.
mov ebx,0fedcba98h
;11111110110111001011101010011000b
190
SETcc — Set Byte on Condition
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
SETcc
setcc rmDst(8) [Un]signe
d
This instruction sets the 8-bit destination to 1 (true) if the EFLAGS condition is met or 0 (false) if not. These
flags are typically set to the result of the comparison of two values. These values can be equal (=) or not equal
(). When the two values compared are signed versus unsigned and are not equal to each other, then the
mathematical operations of {<,,, >} have different meanings. When the values are signed {}, that means
that values are less than (<), less than or equal to (), greater than or equal to (), or greater than (>) another.
This utilizes the combination of the Zero and Carry flag states. When the values are unsigned, that means that
values are below (<), below or equal to (), above or equal to (), or above (>) another. This utilizes the
combination of the Zero and Sign flag states.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
Flags: None are affected by this opcode. They are read-only for their flag states.
Left and right columns are complemented instructions. Conditional Set if…
SETA+ Set if above. ZF=0, CF=0
SETNBE Set if not below or equal.
SETAE + Set if above or equal. CF=0
SETNB Set if not below.
SETC Set if carry. CF=1
SETE Set if equal. ZF=1
SETZ Set if zero.
SETG ± Set if greater. SF=OF. ZF=0.
SETNLE Set if not less or equal.
SETGE ± Set if greater or equal. SF=OF
SETNL Set if not less.
SETO Set if overflow. OF=1
SETP Set if parity. PF=1
SETPE Set if parity even.
SETS Set if sign. SF=1
SETBE Set if below or equal. ZF=1, CF=1
+ Set if not above.
SETNA
SETB + Set if below. CF=1
SETNAE Set if not above or equal.
SETNC Set if no carry. CF=0
SETNE Set if not equal. ZF=0
SETNZ Set if not zero.
SETLE ± Set if less or equal. SF<>OF, ZF=1
SETNG Set if not greater.
191
SETL ± Set if less. SF<>OF
SETNGE Set if not greater or equal.
SETNO Set if not overflow. OF=0
SETNP Set if not parity. PF=0
SETPO Set if parity odd.
SETNS Set if no sign. SF=0
One of the reasons for this instruction is to remove branches from our code such as follows:
test eax,ebx
mov eax,0
jne $L1 ; Jump if not equal
inc eax
$L1: ; eax =0 if false, 1 if true
It takes time to actually process the branch — prediction failure, prefetch load, etc. — but the same result can
be achieved with the following:
test eax,ebx
mov eax,0
sete al
Some computer languages use the following to indicate a true or false condition:
typedef enum {
FALSE = 0,
TRUE = 1
} BOOL;
We can duplicate it by using an opposite SETcc instruction and then decrement. Let's modify the sample from
above:
test eax,ebx
mov eax,0
setne al ; Note: the sete vs setne
dec eax ; 0 becomes 0ffffffffh (1)
; 1 becomes 000000000h (0)
CMP
192
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
cmp rmDst(8/16/32/64), #(8/16/32) [Un]signe
d
cmp rmDst, rSrc(8/16/32/64)
cmp rDst, rmSrc(8/16/32/64)
This general-purpose instruction compares operand1 with operand2 by subtracting operand2 from operand1.
The flags are set as per a SUB (subtraction) instruction, but the end result is thrown away and not saved. An
8-bit source immediate value can be sign extended to 16-, 32-, or 64-bit value. A 32-bit immediate source
value can be sign extended to a 64-bit destination.
Flags O.flow Sign Zero Aux Parity Carry
X X X X X X
Flags: Based upon whether the operands were signed or unsigned the following table should be used to reflect
the results.
Comparison (±) Signed (+)Unsigned ± +
Comparisons are all about signage. Is the comparison signed or unsigned? The conditional response is based
upon terminology: greater-less is signed, and above-below is unsigned.
foodd: dd 0
cmp al,12h
cmp bx,1234h
cmp ecx,12345678h
cmp rdx,rax
cmp foodd,12345678h
COMISD
COMISS
SSE comiss xmmA, Single-precision 128
xmmB/m32
SSE2 comisd xmmA, Double-precision 128
xmmB/m64
193
These SSE-based instructions compare the ordered scalar operands of source xmmA and source xmmB to
each other and set the general-purpose EFLAGS as to the result. The upper elements are ignored. Unlike
UCOMISS or UCOMISD, QNaN or SNaN generate an exception. The COMISS instruction is used for single-
precision floating-point, and COMISD is used for double-precision floating-point comparisons.
Unordered 1 1 1
= 1 0 0
> 0 0 1
< 0 0 0
Unordered results if either source operand is QNaN or SNaN. This is also referred to as NaN.
Flags O.flow Sign Zero Aux Parity Carry
0 0 X 0 X X
Flags: The ZF, PF, and CF are set based upon the comparison. All other flags are set to zero.
As usual, try to keep your source floating-point values valid, as an invalid source value of SNaN or QNaN
generates a floating-point exception error.
UCOMISD
UCOMISS
SSE ucomiss xmmDst, xmmSrc/m32 Single-precision 128
SSE ucomisd xmmDst, xmmSrc/m64 Double-precision 128
2
These SSE-based instructions compare the unordered scalar operands of source xmmA and source xmmB to
each other and set the general-pur- pose EFLAGS as to the result. The upper elements are ignored.
A?B Zero Parity Carry
Unordered 1 1 1
= 1 0 0
> 0 0 1
< 0 0 0
The results are unordered if either source operand is QNaN or SNaN. This is also referred to as NaN. Only an
SNaN generates an exception, unlike the COMISS and COMISD instructions, which generate an exception on
either.
Flags O.flow Sign Zero Aux Parity Carry
0 0 X 0 X X
Flags: The ZF, PF, and CF are set based upon the comparison. All other flags are set to zero.
As usual, try to keep your source floating-point values valid, as an invalid source value of SNaN generates a
floating-point exception error.
194
CMPSB/CMPSW/CMPSD/CMPSQ — Compare String Operands
Mnemonic P PII K6 3D 3Mx SS SSE A6 SSE E64
! + E 2 4 3 T
CMPSx
CMPSQ
cmps{b/w/d/q} [Un]signed
cmps mDst,
mSrc(8/16/32/64)
This general-purpose instruction compares the memory referenced by operand1 with that referenced by
operand2 by subtraction similar to that of the CMP instruction. The end result is thrown away and not saved,
but the flags are set. The CMPSB, CMPSW, and CMPSD instructions compare the memory referenced by
ES:[EDI] with DS:[ESI] in Protected Mode or ES:[DI] with DS:[SI] in Real Mode. The (e)SI and (e)DI index
registers are incremented automatically depending on how the direction flag is set. For 64-bit mode the (R)SI
and (R)DI registers are used instead. If the direction flag is clear, they are incremented, or if set, they are
decremented; this is the same as the STOSB and MOVSB instructions. Also, in 64-bit mode AH, BH, CH, and
DH cannot be used with REX encoding.
When used in a loop, typically one of the repeat instructions will be used: REP, REPE, REPZ, REPNE, or
REPNZ.
For optimized code this instruction should be considered to be a "complex" instruction and an alternative
should be used instead!
Flags O.flow Sign Zero Aux Parity Carry
X X X X X X
Flags: Based upon whether the operands were signed or unsigned the following table should be used to reflect
the results.
test ecx,ecx
je $L1 ; Jump if done
; Found a mismatch
$L1:
195
CMP — Packed Comparison
The two types of comparison are integers and floating-point values. Integers are compared to integers and
floating-point values are compared to floating-point values.
Hint SSE and SSE2 use an immediate value to determine the type of condition for single-
precision and double-precision values.
CMPPS/CMPSS/CMPPD/CMPSD — Floating-Point
Mnemonic P PI K 3D 3Mx SS SSE A6 SSE E64T
I 6 ! + E 2 4 3
CMPPD
CMPPS
CMPSD
CMPSS
SSE cmpps xmmDst, xmmSrc/m128, # Single-precision 12
8
" cmpss xmmDst, r32, # Single-precision scalar
" cmpss x mmDst, xmmSrc/m32, #
SSE cmppd xmmDst, xmmSrc/m128, # Double-precision 12
2 8
" cmpsd xmmDst, r64, # Double-precision scalar 12
8
" cmpsd x mmDst, xmmSrc/m64, #
The SSE-based CMPSS, CMPSD, CMPPS, and CMPPD instructions use an immediate value that specifies
the method of compare.
Table 9-1: SSE SPFP and DPFP immediate compare codes
0 1 2 3 4 5 6 7
The resulting bits set in the XMM register of 0 or –1 map to zero or QnaN, which are usable floating-point
numbers. These can be used in conjunction with the logical Boolean instructions to affect the result. These are
discussed later in this chapter.
PCMPEQx
PFCMPEQ
MMX pcmpeq{b/w/d} mmxDst, mmxSrc [Un]signed 64
196
Mnemonic P PI K 3D 3Mx SS SSE A6 SSE E64T
I 6 ! + E 2 4 3
3DNo pfcmpeq mmxDst, mmxSrc Single-precision 64
w!
SSE cmpps xmmDst, xmmSrc/m128, 0 Single-precision 12
8
" cmpss xmmDst, r32, 0 Single-precision scalar
cmpss xmmDst, xmmSrc/m32, #
SSE2 cmppd xmmDst, xmmSrc/m128, 0 Double-precision 12
8
" cmpsd xmmDst, r64, 0 Double-precision scalar
"
pcmpeq{b/w/d} xmmDst, xmmSrc [Un]signed
PFCMPGE
3DNow pfcmpge mmxDst, mmxSrc Single-precision 64
!
SSE cmpps xmmDst, xmmSrc, 5 Single-precision 12
8
" cmpss xmmDst, r32, 5 Single-precision scalar
SSE2 cmppd , 5 xmmDst, xmmSrc Double-precision 12
8
" cmpsd xmmDst, r64, 5 Double-precision scalar
PCMPGTx
PFCMPGT
MMX pcmpgt{b/w/d} mmxDst, mmxSrc Signed 64
3DNow pfcmpgt mmxDst, mmxSrc Single-precision 64
!
SSE cmpps xmmDst, xmmSrc, 6 Single-precision
cmpss xmmDst, r32, 6 Single-precision scalar
SSE2 cmppd xmmDst, xmmSrc, 6 Double-precision 12
8
cmpsd xmmDst, r64, 6 Double-precision scalar 12
8
pcmpgt{b/w/d} xmmDst, xmmSrc Signed
This SIMD instruction is a 64 (128)-bit parallel operation that compares the individual {eight 8-bit, four 16-bit, or
two 32-bit} ({16 8-bit, eight 16-bit, or four 32-bit}) source elements. If aSrc (xmmDst)'s meets the condition to
197
bSrc (xmmSrc), then all bits will be set to ones in the destination Dst (xmmDst) for that element; if not, all bits
will be clear (reset) for that element. This is equivalent to –1 or 0.
vD[] = (vA[] ? vB[] ) ? –1 : 0 // an element
movq mm7,[edi]
pcmpeqd mm7,[esi]
198
; dst 0a5a41122h 00405f604h
; src 0a5a41122h 0050503f7h
; 0ffffffffh 000000000h
; mm7 0ffffffff00000000h
This can also be used in a not-equal () form by using a one's complement on the result. The complement
form of the less-than or equal to () form can be the result as well.
PMOVMSKB
SS pmovmskb r(32/64), mmxSrc [Un]signed 64
E
pmovmskb r(32/64), xmmSrc 128
This instruction copies the MSB sign bit from each byte of the source xmmA and copies them to the associated
bit of the least significant bits of the destination general-purpose register. For MMX, an 8-bit mask is generated
with the upper bits set to zero. For SSE, a 16-bit mask is generated with upper bits set to zero.
MOVMSKPS
MOVMSKPD
movmskps r32, 32
xmmSrc
movmskps r64, 64
xmmSrc
movmskpd r32, 32
xmmSrc
movmskpd r64, 64
xmmSrc
The MOVMSKPS instruction copies the four sign bits of the single- precision floating-point elements {0…3} into
the corresponding bit positions 0…3 and sets the upper bits to zero with the result stored in the 32/64-bit
general-purpose register. The four bits result in a mapping of 16 possibilities of sign states.
The MOVMSKPD instruction copies the two sign bits of the double-precision floating-point elements {0…1} into
the corresponding bit positions 0…1 and sets the upper bits to zero with the result stored in the 32/64-bit
general-purpose register. The two bits result in a mapping of four possibilities.
199
SCAS/SCASB/SCASW/SCASD/SCASQ — Scan String
Mnemoni P PII K 3D 3Mx SS SSE A6 SSE E64
c 6 ! + E 2 4 3 T
SCASx
SCASQ 64 64
scas{b/w/d/q} [Un]signed
scas
mDst{8/16/32/64}
This instruction scans the contents of memory and compares it (using a read-only subtraction) to the {AL, AX,
EAX, RAX} register dependent upon {SCASB, SCASW, SCASD, SCASQ} and sets the flags accordingly. It
belongs to the group of string functions but it really has nothing to do with text strings in the C sense. It has to
do with loading strings of memory. Those are contiguous bytes of memory.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
Flags: All the flags reflect the result of the scan. When used in conjunction with a REP prefix the flags reflect
the last SCAS operation.
32-bit mode
scasb scasw scasd scasq
64-bit mode
scasb scasw scasd scasq
REP SCASx
A single scan operation can be performed or a repeat sequence specified by a REP prefix word and a count
specified in the RCX register in 64-bit mode, ECX register in Protected Mode, or CX in Real Mode.
rep scasd
L3: mov eax,es:[edi]
add edi,4
dec ecx
jne L3
CMOVcc
cmovcc rDst, rmSrc(16/32/64) [Un]signed
200
This general-purpose instruction moves the data from source to destination if the flags condition is met. See
the following chart.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
Flags: None are affected by this opcode. They are read-only for their flag states.
mov eax,edx
$L1:
Left and right columns are complemented instructions. Conditional Move if…
CMOVA + if above. ZF=0, CF=0
CMOVNBE if not below or equal.
CMOVAE + if above or equal. CF=0
CMOVNB if not below.
CMOVC if carry. CF=1
CMOVE if equal. ZF=1
CMOVZ if zero.
CMOVG ± if greater. SF=OF. ZF=0.
CMOVNLE if not less or equal.
CMOVGE ± if greater or equal. SF=OF
CMOVNL if not less.
CMOVO if overflow. OF=1
CMOVP if parity. PF=1
CMOVPE if parity even.
CMOVS if sign. SF=1
CMOVBE if Below or Equal. ZF=1, CF=1
+ if not above.
CMOVNA
CMOVB + if below. CF=1
CMOVNAE if not above or equal.
CMOVNC if no carry. CF=0
CMOVNE if not equal. ZF=0
CMOVNZ if not zero.
CMOVLE ± if less or equal. SF<>OF, ZF=1
CMOVNG if not greater.
201
CMOVL ± if less. SF<>OF
CMOVNGE if not greater or equal.
CMOVNO if not overflow. OF=0
CMOVNP if not parity. PF=0
CMOVPO if parity odd.
CMOVNS if no sign. SF=0
CMPXCH
G
cmpxchg rmDst8, rSrc8 [Un]signed
cmpxchg rmDst16,
rSrc16
cmpxchg rmDst32,
rSrc32
cmpxchg rmDst64,
rSrc64
A secondary source {AL/AX/EAX/RAX} is used depending on the referenced bit size {8/16/32/64}. This
instruction compares {AL/AX/ EAX/RAX} with the value in the destination, and if equal the Zero flag is set and
the source is copied into the destination. If not equal, the {AL/AX/EAX/RAX} register is loaded with the value in
the destination.
Flags O.flow Sign Zero Aux Parity Carry
- - X - - -
CMPXCHG8
B
cmpxchg8b [Un]signed 64
rmDst64
This instruction compares the 64-bit value in EDX:EAX with the value in the destination mDst. If equal, the
Zero flag is set and ECX:EBX is copied into the destination mDst. If not equal, EDX:EAX is loaded with the
value in the destination mDst.
Flags O.flow Sign Zero Aux Parity Carry
- - X - - -
202
Flags: Only the Zero flag is affected by this opcode.
if (EDX:EAX ? rmDst)
EDX:EAX = rmDst ; zero flag = 0
else
rmDst = ECX:EBX ; zero flag = 1
CMPXCHG16 64
B
cmpxchg16b [Un]signed 128
rmDst128
This instruction compares the 128-bit value in RDX:RAX with the value in the destination mDst. If equal, the
Zero flag is set and RCX:RBX is copied into the destination mDst. If not equal, the RDX:RAX is loaded with the
value in the destination mDst.
Flags O.flow Sign Zero Aux Parity Carry
- - X - - -
There is an exception to this and that is the use of 3DNow! With this instruction set integers and floating-point
values can be mixed and matched.
A quick preview of floating-point numbers illuminates the individual bits and their associations with their
particular components.
By manipulating key bits the number can be affected. For example, for a 32-bit single-precision floating-point
number the sign bit is bit #31 080000000h. So setting it (1) indicates the number is negative, and resetting it
(0) indicates the number is positive. Other bit manipulations can be conceived by re-examining the bits
described at the beginning of this chapter.
This coincidently happens to be an example of branchless code. It uses bit blending using masks generated by
the results of the floating-point comparison 0 : –1.
a = (a b ) ? c : d;
0 1 2 3 4 5 6 7
= > ~ORD < ORD
203
cmpps xmm0, xmm1, 5 ; (a b ) ? –1 : 0
movaps xmm7, xmm0 ; Copy the bit mask
andps xmm0, xmm2 ; a = a c
andnps xmm7, xmm3 ; t = a d
orps xmm0, xmm7 ; a = a t
The register xmm0 contains either c or d, depending on the result of the comparison.
ANDPD
ANDPS
SSE andps xmmDst, xmmSrc/m128 Single-precision 12
8
SSE andpd xmmDst, xmmSrc/m128 Double-precision 12
2 8
This instruction ANDs packed bits in the form of a floating-point number with the single- or double-precision
floating-point value.
By masking all but the sign bit for a floating-point number, the positive form of that number can be retained:
Axy = | Axy |
204
ORPS — Logical OR of Packed SPFP D = A B
ORPD
ORPS
SSE orps xmmDst,xmmSrc/m128 Single-precision 128
SSE orpd xmmDst,xmmSrc/m128 Double-precision 128
2
This instruction ORs packed bits in the form of a floating-point number with the single- or double-precision
floating-point value. Typical masking values are 0 or –1.
XORPD
XORPS
SSE xorps xmmDst, xmmSrc/m128 Single-precision 128
SSE xorpd xmmDst, xmmSrc/m128 Double-precision 128
2
This instruction XORs packed bits in the form of a floating-point number with the single- or double-precision
floating-point value. Typical masking values are 0 or –1.
As we've learned, the sign bit is merely the MSB of the occupied bits. This maps to a legitimate value of –0.0,
thus a value of +0.0.
205
Pseudo Vec (XMM)
FChsMask REAL4 080000000h,080000000h,080000000h,080000000h
By masking all bits but the sign bit for a floating-point number the sign will be flipped (one's complement) while
that number can be retained.
Axyzw = – Axyzw
The SSE-based instructions differentiate between integers and floating-point values, and thus cannot be used
for bit blending unless the source values are legitimate floating-point numbers. The 3DNow!- based
instructions do not differentiate between integers or floating-point values so the two can be indiscriminately
blended.
ANDNPD
ANDNPS
SSE andnps xmmDst, xmmSrc/m128 Single-precision 12
8
SSE andnp xmmDst, xmmSrc/m128 Double-precision 12
2 d 8
This instruction ANDCs packed bits in the form of a floating-point number with the single- or double-precision
floating-point value. The mask is inverted (bits flipped) and then logical ANDed with the other operand and the
results stored.
Min — Minimum
MMX SSE SSE2 3D Now 3DMX+
8×8-bit 16×8-bit 2×SPFP 8×8-bit
4×16-bit 8×16-bit 4×16-bit
206
4×SPFP 2×DPFP
1×SPFP 1×DPFP
The simplified form of this parallel instruction individually compares the integer or floating-point source
arguments and returns the minimum value result in the destination.
vD[] = (vA[] < vB[] ) ? vA[] : vB[] // an element
Pseudo Vec
The previous C equation is a branching equation, which can cause a processor misprediction whether the
branch is taken or not. A scalar operation could be done with branchless code such as follows:
// r=(p < q) ? p : q;
__inline MIN(int p, int q )
{
r = (p–q) >> INT_MAX_BITS; // (–)=0xFFFFFFFF (+)=0x00000000
return (p & r) | (q & (r^–1)); // keep lower of p or q
}
The two values p and q are being compared so that the retained value is the smaller one. If p is less than q,
subtraction (p–q) generates a negative value. The sign bit is then arithmetically shifted to the right the size of
the data word, which would be a 31-bit shift and thus latching the MSB of 1 into all the bits. If p = q, then p–q is
positive, the sign bit of zero would be latched into all the bits, thus generating a mask of all zeros. By bit
blending with the mask and its inverse, the resulting value will be retained. For legacy processors that do not
support this instruction it can be replicated in parallel using a packed arithmetic shift right or with a packed
compare, if they are supported.
PMINUB
MMX pminub mmxDst, mmxSrc/m64 Unsigned 64
+
SSE pminub mmxDst, mmxSrc/m64 Unsigned 64
SSE2 pminub xmmDst, xmmSrc/m128 Unsigned 12
8
This SIMD instruction is a 64 (128)-bit parallel operation that compares the eight (16) individual 8-bit source
integer bit blocks aSrc (xmmSrc) and bSrc (xmmDst), with the minimum value result being stored in the
destination Dst (xmmDst).
207
16-Bit Integer Minimum
Mnemonic P PI K 3D 3Mx SS SSE A6 SSE E64T
I 6 ! + E 2 4 3
PMINSW
MMX pmins mmxDst, mmxSrc/m64 Signed 64
+ w
SSE pmins mmxDst, mmxSrc/m64 Signed 64
w
SSE2 pmins xmmDst, xmmSrc/m128 Signed 12
w 8
This SIMD instruction is a 64 (128)-bit parallel operation that compares the four (eight) individual 16-bit source
integer bit blocks aSrc (xmmSrc) and bSrc (xmmDst), with the minimum value result being stored in the
destination Dst (xmmDst).
SPFP Minimum
Mnemonic P PI K 3D 3Mx SS SSE A6 SSE E64T
I 6 ! + E 2 4 3
PFMIN
MINPS
3DNow pfmin mmxDst, mmxSrc/m64 Single-precision 64
!
SSE minp xmmDst, xmmSrc/m128 Single-precision 12
s 8
208
This SIMD instruction is a 64 (128)-bit parallel operation that compares the two (four) individual single-
precision floating-point source bit blocks aSrc (xmmSrc) and bSrc (xmmDst), with the minimum value result
being stored in the destination Dst (xmmDst).
MINSS
SSE minss xmmDst, xmmSrc/m32 Single-precision 12
8
This SIMD instruction is a 128-bit scalar operation that compares only the lowest bit block containing the scalar
single-precision floating- point aSrc (xmmSrc) and bSrc (xmmDst), and stores the minimum value result in the
lowest bit block at destination Dst (xmmDst). The remaining floating-point bit blocks are left intact.
DPFP Minimum
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
MINPD
SSE minpd xmmDst, xmmSrc/m128 Double-precision 12
8
This SIMD instruction is a 128-bit parallel operation that compares the two individual double-precision floating-
point source bit blocks aSrc (xmmSrc) and bSrc (xmmDst), with the minimum value result being stored in the
destination Dst (xmmDst).
MINSD
SSE minsd xmmDst, xmmSrc/m64 Double-precision 12
2 8
This SIMD instruction is a 128-bit scalar operation that compares only the lower 64-bit block containing the
scalar double-precision floating- point aSrc (xmmSrc) and bSrc (xmmDst), and stores the minimum value
result in the lower bit block of the destination Dst (xmmDst). The remaining upper 64-bit floating-point bit block
is left intact.
209
Max — Maximum
MMX SSE SSE2 3D Now 3DMX+
8×8-bit 16×8-bit 2×SPFP 8×8-bit
4×16-bit 8×16-bit 4×16-bit
4×SPFP 2×DPFP
1×SPFP 1×DPFP
The simplified form of this parallel instruction individually compares the integer or floating-point source
arguments and returns the maximum value result in the destination.
vD[] = (vA[] > vB[] ) ? vA[] : vB[] // an element
PMAXUB
MMX pmaxu mmxDst, mmxSrc/m64 Unsigned 64
+ b
SSE pmaxu mmxDst, mmxSrc/m64 Unsigned 64
b
SSE2 pmaxu xmmDst, xmmSrc/m128 Unsigned 12
b 8
This SIMD instruction is a 64 (128)-bit parallel operation that compares the eight (16) individual 8-bit source
integer bit blocks aSrc (xmmSrc) and bSrc (xmmDst), with the maximum value result being stored in the
destination Dst (xmmDst).
PMAXSW
MMX pmaxs mmxDst, mmxSrc/m64 Signed 64
+ w
SSE pmaxs mmxDst, mmxSrc/m64 Signed 64
w
210
Mnemonic P PI K 3D 3Mx SS SSE A6 SSE E64T
I 6 ! + E 2 4 3
SSE2 pmaxs xmmDst, xmmSrc/m128 Signed 12
w 8
This SIMD instruction is a 64 (128)-bit parallel operation that compares the four (eight) individual 16-bit source
integer bit blocks aSrc (xmmSrc) and bSrc (xmmDst), with the maximum value result being stored in the
destination Dst (xmmDst).
SPFP Maximum
Mnemonic P PI K 3D 3Mx SS SSE A6 SSE E64T
I 6 ! + E 2 4 3
MAXPS
PFMAX
3DNow pfmax mmxDst, mmxSrc/m64 Single-precision 64
!
SSE maxp xmmDst, xmmSrc/m128 Single-precision 12
s 8
This SIMD instruction is a 64 (128)-bit parallel operation that compares the two (four) individual single-
precision floating-point source bit blocks aSrc (xmmSrc) and bSrc (xmmDst), with the maximum value result
being stored in the destination Dst (xmmDst).
MAXSS
SSE maxss xmmDst,xmmSrc/m32 Single-precision 12
8
Single-precision 128 This SIMD instruction is a 128-bit scalar operation that compares only the lowest bit block
containing the scalar single-precision floating- point aSrc (xmmSrc) and bSrc (xmmDst) and stores the
maximum value result in the lowest bit block of the destination Dst (xmmDst). The remaining floating-point bit
blocks are left intact.
211
DPFP Maximum
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
MAXPD
SSE maxpd xmmDst, xmmSrc/m128 Double-precision 12
8
This SIMD instruction is a 128-bit parallel operation that compares the two individual double-precision floating-
point source bit blocks aSrc (xmmSrc) and bSrc (xmmDst), with the maximum value result being stored in the
destination Dst (xmmDst).
MAXSD
SSE maxsd xmmDst, xmmSrc/m64 Double-precision 12
2 8
This SIMD instruction is a 128-bit scalar operation that compares only the lower 64-bit block containing the
scalar double-precision floating- point aSrc (xmmSrc) and bSrc (xmmDst), and stores the maximum value
result in the lower bit block of the destination Dxt (xmmDst). The remaining upper 64-bit floating-point bit block
is left intact.
212
Chapter 10: Branching
Overview
The processor's instruction pointer is just that — a pointer to the instruction that is about to be executed. This
register is the RIP in 64-bit, EIP in Protected Mode, and IP in Real Mode. It behaves very similarly to that of a
CD player. You can only read one data stream at a time. To read elsewhere, you have to move the pointer to
the new location to read. (A better visualization would be a record player with its needle that cannot skip
around.)
The only way to read the value of the instruction pointer is to call a function with the CALL instruction and then
read the value on the stack where you had been. There is no MOV EAX,EIP instruction.
There are four primary methods that can be used to change the position of the processor's instruction pointer:
jump, call, interrupt, and return. You can jump a delta, near, or far distance; call and return near or far;
interrupt; and return. These instructions tend to be the most confusing to an assembler, and the exact
instruction that you think you are using sometimes is not.
No The concept of Real Mode memory does not exist in 64-bit mode, nor does 16-bit
64-bit displacement as it is sign extended to 64 bits for the 64-bit instruction pointer (RIP).
Jump Unconditionally
JMP — Jump
JMP destination
JMP
Delta JMP
jmp ± 8/16/32 Relative Jump
This is a general-purpose instruction used to jump to another location in memory. It can be a delta-based
jump, or a near or far JMP. In a Real Mode environment the addresses are 16-bit based, supporting a segment
size up to 65,536 bytes, and thus the segment and/or offset need to be 16-bit each. In a Protected Mode
environment the addresses are 32-bit based with a 16-bit segment-selector register.
213
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
Delta JMP
The delta jump is in reality a hop in a signed direction. You will notice that the following JMP examples use an
8-bit signed destination value.
00000000 33 C0 FuncA: xor eax,eax
00000002 EB FC jmp FuncA ; (–4 bytes)
00000004 EB 02 jmp FuncB ; (+2 bytes)
00000006 33 DB xor ebx,ebx
00000008 EB FE FuncB: jmp FuncB ; (–2 bytes) infinite loop
0000000A EB 00 jmp FuncC ; (+0 bytes)
0000000C FuncC:
An EB FF opcode pair obviously is not used as it jumps into the middle of the jump instruction. If the address
being jumped to is out of range in a (–128) reverse direction or a (127) forward direction, then the assembler
will automatically switch to the E9 opcode, which supports 32 bits of signed direction in Protected Mode and 16
bits of signed direction in Real Mode.
jmp NearJmp
xor ebx,ebx
jmp NearAdrPtr[ebx]
jmp NearAdrPtr[ebx*4]
214
FarAdrPtr FWORD offset FarJmp ; Far pointer
FWORD offset OtherFarJmp ; Far function
FarJmp proc far
ret
FarJmp endp
This will be discussed further in the section on the RET instruction, but if you jump to code with an RET in the
logic flow you need to make sure that the RET is a near return type if the previous call instruction was near, or
the RET is a far return type if the previous call instruction was far. It needs to match the call instruction. For
example, if you are executing a code fragment in a NEAR type function such as NearJmp and you jump to a
different set of code, you should never jump into a FAR type procedure as the RET instruction when executed
will be of the wrong type. Instead of a 32-bit value being popped off the stack in a Protected Mode
environment, a 48-bit value would be popped, thus dis-orienting the pointers and eventually causing the
program to crash! Even though the assembly code uses the same spelling of RET in the NEAR and FAR
procs, the NEAR is translated to 0C3h and the FAR is translated to 0CBh. These have different meanings! If
you specifically need a far return, try using RETF.
A simple rule to remember is for each NEAR or FAR call, have the appropriate matching RET; this is typically
automatically done for you by the assembler unless you start jumping around in the code.
NearAdrPtr DWORD offset NearJmp ; Near pointer
DWORD offset OtherNearJmp
vs
FarJmp proc far
ret
FarJmp endp
The same kind of memory reference that is used to access a memory table or array can also be used to
access a jump vector. Almost any register can be used alone, in a pair addition, or with an optional base
address and/or scale factor of {2, 4, or 8}, but you will note that there are some limitations in regard to the ESP
register.
215
jmp fword ptr [...]
jmp NearAdrPtr[...]
Jump Conditionally
Jcc — Branching
Jcc destination
Jcc
Delta JMP
jcc disp{8/16/32} Near
All of the instructions in the following table are conditional jumps, sometimes referred to as branch instructions.
The instruction pointer (RIP/EIP/IP) is redirected to the relative address if the associated conditions are met as
a logical TRUE. If they fail, the pointer merely executes the next line of code. For a properly optimized
program, these instructions need to be minimally used and well positioned within the code.
The 8086 and 286 processors only support 8-bit displacement, not 16- or 32-bit displacement. Protected Mode
uses an 8-bit or 32-bit displacement. The displacement gets sign extended and the address stored in the
instruction pointer gets adjusted. The default is an 8-bit displacement [–128, 127] unless the jump is out of
range; in that case, the larger displacement will be used. The goal is to organize your code so that a minimal
number of bytes are required for the conditional branching logic.
No The displacement of 16-bit is not supported in 64-bit mode!
64-bit
Table 10-1: Comparison types. The same value types are contained with an individual cell.
Complement types (opposites) are across from each other.
JA + Jump if above. ZF=0, CF=0
JNBE Jump if not below or equal.
JAE + Jump if above or equal. CF=0
JNB Jump if not below.
JC Jump if carry. CF=1
JE Jump if equal. ZF=1
JZ Jump if zero.
JG ± Jump if greater. SF=OF. ZF=0.
JNLE Jump if not less or equal.
JGE ± Jump if greater or equal. SF=OF
JNL Jump if not less.
JO Jump of overflow. OF=1
JP Jump if parity. PF=1
JPE Jump if parity even.
JS Jump if sign. SF=1
216
JNE Jump if not equal. ZF=0
JNZ Jump if not zero.
JLE ± Jump if less or equal. SF<>OF, ZF=1
JNG Jump if not greater.
JL ± Jump if less. SF<>OF
JNGE Jump if not greater or equal.
JNO Jump if not overflow. OF=0
JNP Jump if not parity. PF=0
JPO Jump if parity odd.
JNS Jump if no sign. SF=0
Back in the early processors using these instructions made it very easy to calculate loop timing, etc., but the
newer model Pentium processors use prediction mechanisms to help keep your code flowing at a pretty good
rate typically based upon decisions the last time through and a touch of black box magic. When designing your
code you should try to architect it to take advantage of the predictor. Better yet, use something called
branchless code. Try to use logic to circumvent the need for branching logic.
Current Pentium type processors use different prediction mechanisms to help make the code run at its fastest
rate. A bad prediction can cost you cycles, making optimization using the Jcc instructions quite tricky. We are
not going to discuss the older processors as most are not made anymore and discussing them here would be
pretty much a waste of time, print, paper, and trees! Alas, those of you working with embedded processors are
typically using exact models and manufacturers and will therefore be, for the most part, using their related data
books.
Warning Different processors have different methods of branch prediction for different
manufacturers. The prefetch and other cache buffers increase in size, and processors
get faster as newer models of the processor become available on the market. With this
said, it should be pointed out that the material being discussed here is probably already
dated.
Branch Prediction
The most important optimization method for the 80x86 processors is using the branch prediction algorithm.
These processors use what both Intel and AMD call a BTB (branch target buffer). This is essentially a history
217
buffer of the behavior of the last n Jcc instructions. In a need for speed processors prefetch (preload)
instruction code bytes before they are needed, translate them, and arrange them for processing within their
multipipelines to be processed. When a relative or absolute jump or call occurs, that code is prefetched and
prepared for processing. However, a problem comes up when a branch (Jcc) is encountered. Which way to
go? Take the branch or flow through to the next instruction? Different solutions have been taken by different
manufacturers. They have and use different sized BTBs, different prediction methodologies, different prefetch
sizes, etc. This particular book is not about optimization but we will talk about the mechanism. The particulars
depend on which manufacturer and which processor.
The instruction prefetch has four 32-byte buffers loaded sequentially (one at a time) until a branch is
encountered, and then the BTB is used to predict a branch or not. If no predicted branch, the contiguous
memory is loaded, but if a branch is predicted, the alternate prefetch buffer is loaded with the memory
referenced by the branch. If the prediction was wrong, all the instruction pipelines are flushed and the prefetch
mechanism begins again. So you should see the need to design your code to minimize the number of
mispredictions. There is one other thing to be careful of and that has to do with two back-to-back Jcc
instructions. If two Jcc instructions both have their last byte in the same 4-byte block, a misprediction can
occur. This would only occur if the second branch has a displacement of 8 bits. Using a larger bit
displacement, rearranging the code, or inserting a NOP instruction would solve the problem.
This method is bad as 14h and 16h are in the same 4-byte block {14h…17h}:
00000013 75 F8 jne $Z1
00000015 74 07 jz $Z2
The following is the best method to solve the problem, but only if your assembler lets you override an 8-bit
displacement with a 32-bit one for Protected Mode, setting the last byte at 14h and 1ah. Note that in the
following, the second code branch uses a 4-byte offset and not one byte. This is because it is outside the [–
128, 127] range.
00000013 75 F8 jne $Z1
00000015 0f 84 00000007 jz near ptr $Z2
Here is a 16-bit displacement for Real Mode setting the last byte at 14h and 18h:
00000013 75 F8 jne $Z1
00000015 0f 84 0007 jz near ptr $Z2
Not that I am urging you to use the NOP instructions, but this one is an alternative as 14h and 18h are in
different 4-byte blocks. The NOP pushes the second conditional branch address further down.
00000013 75 F8 jne $Z1
00000015 90 nop
00000016 90 nop
00000017 74 07 jz $Z2
Branches that are not already in the BTB use the static prediction logic as follows.
Back-Branch-Taken
The branch is predicted to be taken if a negative displacement, such as at the bottom of a loop. A flow through
(branch not taken), would be a misprediction!
218
Forward-Branch-Not-Taken
The branch is predicted not to be taken if a positive displacement such as a jump further down the code. The
instruction pointer is expected to just flow through the branch instruction. A jump would be a misprediction.
$L1: nop
nop
jz $L2 ; Forward-Branch-Not-Taken
nop
$L2: nop
Branching Hints
A prefix of 3Eh (HT) is a hint to take the branch. A prefix of 2eh (HNT) is a hint not to take a branch (flow
through). Only set if contrary to a static branch prediction. Sometimes there are no elements to test, so at the
top of a function one might have an if conditional (size=0) empty test.
test ecx,ecx
db 3eh ; Hint to take the branch
jz $L9
$L9:
The default static prediction is to not branch as the jz is a forward-branch and the prediction logic does a flow
through, but the 3eh says to override and take the branch as the length is typically expected to be zero most of
the time.
Branch Optimization
Removing branches from your code such as unrolling loops makes it more efficient by removing the possibility
of misprediction. This is discussed in more detail in the next chapter. One method is to use the SETcc
instruction to set Boolean flags. Another method is to use CMOV or FCMOV instructions to copy data. These
methods can sometimes be manipulated to duplicate the same effect you were trying to achieve with the Jcc
instruction without any possible prediction failure that would cost cycles.
For example, the following is the signed integer absolute number function n = ABS(n), which uses a Jcc
instruction.
test eax,eax ; Test if negative
jns $Abs1 ; Jump if positive
219
$Abs1: ; eax is positive
So you see, we did an ABS() function without any Jcc instructions; just a sleight of hand using general-purpose
instructions. Admittedly this technique will not work on everything, but it will help in your optimizations. In
Chapter 11 we will go into more detail.
Destination addresses of a jump should be code aligned to take advantage of the instruction prefetch.
align 16
Of course the Align statement must not be in the code flow, as unknown bytes are added to align the code, So
it must always occur outside a function, thus after a JMP or RET statement. Another alignment type would be
for fine-tuning on a byte-by-byte basis such as:
org $+3
…which aligns by moving the origin pointer from the $ current location by three bytes, effectively adding three
unknown bytes. Typically you will find it paired with an alignment to create a fixed alignment point. This
prevents any alignment of a previous function affecting the alignment of a following function.
align 16
org $+3
Inside the code flow you can add nondestructive instructions such as the following to help align your code.
Your flags may be altered but not the registers.
nop ; 1 byte
mov eax,eax ; 2 bytes
66h mov ax,ax ; 3 bytes
Let's start by examining a C type strlen() function designed to find the number of bytes in a zero-terminated
ASCII string:
uint strlen(char *p)
{
uint cnt = 0;
while (*p++)
cnt++;
return cnt;
}
Let's try that in assembly and align the code to a 16-byte boundary:
align 16
220
0000000C 40 $L1: inc eax
0000000D 8A 0A mov cl,[edx] ; Get a character
0000000F 42 inc edx
00000010 84 C9 test cl,cl
00000012 75 F8 jnz $L1 ; 0 terminator?
As you have probably noted, the entire function occupies 16h (22) bytes (000h..016h). It also has an efficiency
problem. The 16-byte prefetch is first loaded with address 0000h…000fh, executes up to and including the "inc
edx" and then reloads the prefetch with the next 16 bytes, address 0010h…001fh. The code then executes up
to address 0013h and then the prefetch has to be reloaded with 0000h again. This continues over and over
again until the zero terminator is encountered. Now if we tweak the alignment a tad we can contain this $L1
loop within one prefetch:
align 16
org $+4
You will now note that the beginning of the function actually starts on address 0004h, but the beginning of the
loop is now aligned perfectly on a 16-byte boundary, allowing the entire loop to be contained with a single
instruction prefetch load. This is a very old and simple alignment trick that is still usable on the newer
processors.
PAUSE
pause
221
This was introduced with the P4. It indicates to the processor that this is a tight weight loop in one thread in a
multithreaded application that is waiting for another thread. This is typically referred to as a spin loop. In
essence, the processor is constantly testing and looping until a signal flag gets set.
$L1: cmp eax,bSignal
jne $L1 ; I am *** BAD Code ***
Tip Tight loops are a burden on the processor in a single or multithreaded environment. Inserting
the PAUSE instruction indicates to the processor to let the thread snooze a micro-bit so as to
allow the other threads more time to run. This is also effective in helping to reduce current
drawn by a processor and so can help it run a bit cooler.
cmp eax,bSignal
je $L2 ; Already set so continue
The code byte for the PAUSE instruction maps to a NOP instruction on previous processors so invisible to
them!
I-VU-Q
"I would like you to write an insertion sort algorithm." Or "I would like you to write a function to convert a zero-
terminated ASCII formatted string containing an upper/lowercase mix into just uppercase."
That is what the standard C runtime library function strupr() is for! The next time you are interviewing and they
ask you to code the function strupr() but give you a minimal amount of information, smile, appear deep in
thought, and then draw on that white board the following code in assembly. That will impress them! They will
probably hire you.
The following is a sample string to uppercase conversion algorithm. It is only partially optimized as a do loop is
used instead of a while loop so only one jump is utilized instead of two! It is also ASCII only as it has not been
modified for SJIS, WCHAR, or Unicode strings. Using char c instead of char *p may or may not save you CPU
cycles depending on the optimization ability of your compiler.
char *strupr(char *pStr)
{
char *p;
p = pStr;
++p;
} while (*p); // Loop while characters
222
}
}
return pStr;
}
Now try this in assembly. This can be done in one of two ways. If this function is called a lot and needs to be
very fast, a table lookup could be used, as it would only cost a 256-byte enumeration table with indexes 61h to
7ah ("a" to "z") set to their uppercase equivalents.
struprtbl db 000h,001h,002h,003h,004h,005h,006h,007h,...
...etc
xor eax,eax
That code snippet is fine and dandy, but because there is a processor stall on line $L1, a technique learned
earlier needs to be used. The following does just that and has no stall.
mov ebx,offset struprtbl
$L1: xlatb
Obviously, using the memory alignment tricks learned in Chapter 2, "Coding Standards," in regard to the
setting of memory, one could make this function fairly quick but a lot larger. But I leave that up to you.
The alternate method is using two comparisons similar to that used in the C code. The branch prediction within
the CPU rewards you for a correct prediction and penalizes you for an incorrect one. If the English text string is
examined it would be noted that it is mostly lowercase with some symbols and some uppercase. So if that is
taken to advantage one can make this function pretty efficient.
Contrarily, writing the function strupr(), String Upper, the logic would want to skip around for symbols,
uppercase, and extended ASCII and predict a flow through conversion for lowercase. In other words, skip
below the conversion so the predictor will tend to be correct on a flow through.
; strupr snippet
xyzzy db "Quick brown fox jumped!",0
223
$L2: inc edx ; Nothing to do, next!
mov al,[edx] ; Get a character
test al,al
jnz $L1 ; 0 terminator?
That was pretty simple because I picked the simple one. The predictions for the JMP will succeed most of the
time. Let's make things a little more interesting and try the complement function strlwr(), String Lower.
; strlwr snippet
It is practically the same but definitely not very efficient as the branch predictor will fail more often. The
following code is larger but more efficient.
; strlwr snippet
;
mov al,[edx]
224
; symbols or uppercase
In these examples it is basically known what the data would look like and this was taken to advantage so as to
allow for the best data prediction, which could help the code run faster. Some types of data are hard to predict
and those will require a little trial and error experimentation to get a handle on.
Tip Branch predicting is not fortune-telling or soothsaying; it is pre- planning, data analysis,
statistics, and a little dumb luck.
JCXZ destination
JRCXZ destination
JCXZ 32
JECXZ 32
JRCXZ 64 ?
jecxz disp8
jcxz disp8
jrcxz disp8 ± 8-bit relative hop
This instruction jumps to the relative destination address if RCX/ECX/ CX has a value of zero.
JCXZ Jump if CX Zero
JECXZ Jump if ECX zero
JRCXZ Jump if RCX zero
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
225
R bytes 2 4
This is pretty useful at the top of a function to detect a zero condition loaded from the stack. Note that the
SETcc instruction does something very similar but this sample is to make JECXZ easier to understand.
mov ecx[ebp+arg1] ; Get # of bytes
jecxz $xit ; Jump if a value of 0
:
:
mov ecx,1 ; true
LOOPcc
LOOPZ destination
LOOPNZ destination
LOOPcc
loope disp8 ± 8-bit relative hop
loopz disp8
loopne disp8
loopnz disp8
The LOOP instruction decrements the ECX/CX register. If not a value of zero, the instruction pointer jumps to
the destination address. If it is zero, the instruction pointer merely advances to the next instruction.
The LOOPZ and LOOPE instructions decrement the ECX/CX register. If not a value of zero and the zero flag
is set from a previous instruction, then the instruction pointer jumps to the destination address. If it is zero, the
instruction pointer merely advances to the next instruction.
The LOOPNZ and LOOPNE instructions decrement the ECX/CX register. If not a value of zero and the zero
flag is not set from a previous instruction, then the instruction pointer jumps to the destination address. If it is
zero, the instruction pointer merely advances to the next instruction.
226
LOOP
LOOP destination
LOOP
loop disp8 ± 8-bit relative hop
The following table is based upon using the LOOP instruction. You will note that this instruction was only
effective for the original 80x86 processor. Since that time, its use is limited to the Cyrix processor. If writing
generic code, do not use it! Write a macro to replace it; better yet, forget it exists! But if writing Cyrix-specific
code, then use it by all means; it will save you a clock cycle.
loop $L1 dec ecx dec cx
jz $L1 jz $L1
P bytes 2 3 4
R bytes 2 4 3
On the other hand, the LOOPZ and LOOPNZ instructions have an efficient CPU time. Even with the ability of
processors to handle multiple instructions in multiple pipes, the LOOPZ and LOOPNZ instructions are the most
efficient.
Obviously, since there is no need for reallocations, or freeing of memory, then there is no need for a header
either.
There are other schemes. Just make sure your memory is 16-byte aligned. Now that any possible memory
allocation alignment problems have been taken care of up front, it is time to move on to the good stuff.
When one orders "all you can eat" pancakes, either a short stack or a tall stack is delivered to your table. If you
have not finished eating the stack of pancakes and the server brings you more, you do not pick them all up
and place them under your older pancakes; you have them placed on top of those already on your plate. So
this would be considered a LIFO (last in, first out) system. Those new pancakes will be the first to be eaten,
will they not?
Well, a computer stack is like that. Memory is typically allocated from the bottom up, and data in that memory
is low address to high address oriented. The computer stack starts from the top of memory and works its way
down, and hopefully the two ends do not meet or boom! That is why you have to watch recursive functions; if
they "curse" too much, they run out of memory. We will go into a little more depth as we discuss the PUSH and
POP instructions.
Stack
227
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
PUSH
push #{8/16/32/64}
push rm{16/32/64}
push sreg16
This instruction pushes an 8-, 16-, 32-, or 64-bit immediate value on the stack depending on the processor
mode. A 16-, 32-, or 64-bit general- purpose register or memory value, or 16-bit segment register or memory
value can also be pushed onto the stack. When operands are a different size than the CPU mode, the data
size of the data is extended and the stack remains aligned. POP is the complement of this instruction.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
POP
pop rm{16/32/64}
pop sreg16
This instruction pops a 16- or 32-bit register value or 16-bit segment register from the stack. PUSH is the
complement of this instruction.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
PUSHA 32
PUSHAD 32
pusha
pushad
The PUSHA and PUSHAD instructions use the same opcode, which pushes in order the following list of
registers: EAX, ECX, EDX, EBX, ESP, EBP, ESI, and EDI if in Protected Mode or AX, CX, DX, BX, SP, BP, SI,
and DI if in Real Mode. In Protected Mode PUSHAD should be used and in Real Mode PUSHA should be
used. POPAD/POPA are the complement of this instruction. This is no 64-bit push!
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
228
Flags: None are affected by this opcode.
No The general-purpose instructions PUSHA and PUSHAD are not available in 64-bit mode.
64-bit
The following are the equivalent functions and push order of the general-purpose registers.
pushad push eax
push ecx
push edx
push ebx
push esp
push ebp
push esi
push edi
P bytes 1 8
R bytes 2 16
pusha push ax
push cx
push dx
push bx
push sp
push bp
push si
push di
1 16
1 8
Intel recommends that you not use "complex" instructions and encourages you to use simple instructions
instead.
POPA 32
POPAD 32
popa
popad
The POPA and POPAD instructions use the same opcode, which pops in reverse order the results of the
complement instruction PUSHA or PUSHAD. The following registers are popped from the stack in this order:
EDI, ESI, EBP, ESP, EBX, EDX, ECX, and EAX if in Protected Mode or DI, SI, BP, SP, BX, DX, CX or AX if in
Real Mode.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
The following are the equivalent functions and pop order of the general-purpose registers.
popa pop di
pop si
pop bp
pop sp
pop bx
229
pop dx
pop cx
pop ax
popad pop edi
pop esi
pop ebp
pop esp
pop ebx
pop edx
pop ecx
pop eax
ENTER
enter #(16), 0
enter #(16), 1
enter #(16), #(8)
LEAVE
leave
This instruction deallocates a stack frame for a procedure. The register pairings of SP and BP are dependent
upon the mode running.
64-bit
mov
rsp,rbp
pop rbp
32-bit
mov
esp,ebp
pop ebp
16-bit
mov sp,bp
pop bp
230
CALL Procedure (Function)
Now to discuss a totally different but related topic. These functions fall into one of two categories: the function
and the procedure. Now as we all learned in school, a function returns a value and a procedure does not, but
their code is typically written the same. The only real difference is that the calling code makes use of the EAX
and/or EDX/EAX regis- ter(s) when the function returns.
Function Procedure
y = ( x ) ( x )
CALL
CALL destination
CALL
Delta CALL
call ±16/32 Relative Jump
This is a general-purpose instruction used to call near or far code in another location in memory followed up by
a RET or RETF instruction.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
This instruction is very similar to the JMP instruction except that it puts a return address on the stack so when
a matching RET or RETF instruction is encountered, it will return to the next instruction following the CALL.
231
NearJmp endp
call NearJmp
xor eax,eax
xor ebx,ebx
call NearAdrPtr[ebx]
call NearAdrPtr[ebx*4]
call NearAdrPtr[eax+ebx*4]
The same kind of memory reference that is used to access a memory table or array can also be used to
access a call vector. Almost any register or register pair can be used alone or in an addition equation with an
optional base address and scale factor of {2, 4, or 8}, but you will note that there are some limitations in regard
to the ESP register.
232
call dword ptr [...]
call fword ptr [...]
call NearAdrPtr[...]
RET/RETF — Return
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
RET
RETF
ret Near
ret #
retf Far
This is a general-purpose instruction used to return from a CALL instruction to a previous location in memory.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
The various calls given in the following table push onto the stack the listed number of bytes as a return
address. When this instruction is encountered, that value is popped off the stack. The 0C3h opcode pops a
NEAR call and the 0CBh opcode pops a FAR call. The actual number of bytes also depends on whether the
processor is in Protected or Real Mode. The number of bytes of stack displacement can also be specified by
0C2h versus 0CAh.
Std/Fast Call
CALL bytes on stack RET opcode Ret w/Stack Adj
Protected Mode Near 4 0C3h 0C2h, #{16}
Protected Mode Far 6 0CBh 0CAh, #{16}
Real Mode Near 2 0C3h 0C2h, #{16}
Real Mode Far 4 0CBh 0CAh, #{16}
vs
FarJmp proc far
retf
FarJmp endp
Note that RETF is not really an instruction. With a procedural block marked as far, a RET instruction should
automatically be encoded as a RET FAR. If your code has trouble at run time, peek at the assembly listing and
verify the RET code byte is the correct one to match the call. Some macro assemblers allow the RETF
reference to force a RET FAR.
233
Calling Conventions (Stack Argument Methods)
Before we get too far along we should discuss the methods of passing arguments on a stack. In essence, a
function call has to push arguments (if not a void function) onto the stack, push the current processor's
instruction pointer (EIP or RIP) (the pointer to where the instruction being executed is) onto the stack, and
perform a subroutine call. Use the stack yet again for any local data and then return to where it left off while
unwinding the stack. There are three basic methods to this. From a high-level language such as C/C++ this is
taken for granted, but from the low level of assembly language this has to be done carefully or the stack and
program counter will be corrupted.
We are going to examine function calls using a 32-bit processor, as that is what most of you are currently
using. Thus, each argument that gets pushed onto the stack is 4 bytes in size. An item such as a double-
precision floating-point, which uses 8 bytes, is actually pushed as two halves — lower 4 bytes, upper 4 bytes.
When the processor is in 64-bit mode, 8 bytes are pushed on the stack.
int hello(int a, int b)
{
int c = a + b;
return c;
}
C Declaration (__CDECL)
The function call to hello is straightforward:
00401118 push 2
0040111A push 1
0040111C call hello
00401121 add esp,8
Once the instruction pointer (EIP) arrives at the first byte of the function hello, the stack will look similar to this:
The function hello looks similar to the following. I have left the addresses for each line of assembly for
reference but they are not needed.
; Set up stack frame
004010D0 push ebp ; Save old ebp
004010D1 mov ebp,esp ; Set local frame base
004010D3 sub esp,4
Let us peek at the stack one more time and note the changes:
Register Address (N…N+3) HexValue Description
234
Register Address (N…N+3) HexValue Description
The EBP register is used to remember where the ESP was last, and the ESP is moved lower in memory,
leaving room for the local stack arguments and positioned for the next needed push.
; Do the calculation a+b
004010E8 mov eax,dword ptr [ebp+8]
004010EB add eax,dword ptr [ebp+0Ch]
; Restore stack frame
004010F1 mov esp,ebp ; Restore esp
004010F3 pop ebp ; Restore ebp
004010F4 ret ; Restore eip
So upon returning, anything lower than ESP in stack memory is essentially garbage, but the instruction pointer
(EIP) is back to where it can continue in the code. But the stack pointer still needs to be corrected for the two
arguments that were pushed.
00401118 push 2
0040111A push 1
0040111C call hello
00401121 add esp,8 ;2*sizeof(int)
…or, more simply, just adjust the stack pointer for two arguments, four bytes each:
add esp,8
So in a C declaration (CDECL) type function call, the calling function corrects the stack pointer for the
arguments it pushed. One other item to note is that immediate values {1, 2} were pushed on the stack. So the
stack was used for the arguments and for the instruction pointer.
You will note that there is no stack correction upon returning. This means that the function must handle the
stack frame correction upon returning.
; Restore stack frame
004010F1 mov esp,ebp ; Restore esp
004010F3 pop ebp ; Restore ebp
004010F4 ret 8 ; Restore eip
235
In reality, the return instruction RET handles the stack correction by adjusting the return address by the
number of bytes specified by the immediate value. In the previous snippet, it was adjusted by 8 bytes.
You will notice that the arguments were actually assigned to registers and the stack was only used to retain
the program counter (EIP) for the function return. Since the values are already in registers, there is no need for
the function to access them from the stack or copy them to a register.
…the arguments that were pushed on the stack are stack corrected upon return by the function; this is the
same as the fast call mechanism!
mov esp,ebp
pop ebp
ret 4 ; One 4-byte arg to be popped.
It is very important to realize that both the calling routine and the function itself must be written using the same
calling convention. These can all be used within a single application but can get very confusing as to which
was used where, and so consistency is important or your code will fail.
Interrupt Handling
INT
INT n
INTO 32 32
int #
into
This is an operating system instruction typically used by an application to access a BIOS function. This is also
referred to as a software interrupt. A hardware interrupt calls an interrupt procedure in response to servicing an
236
IRQ (interrupt request). The base of the computer's memory is at memory location 0000:00000000h. At that
base is a vector jump table. Multiplying 4 the interrupt number will give you the offset to the entry that
contains the address that will be vectored to.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
The interrupt function is typically written in one of two ways. If written to service a hardware interrupt, it
services some predefined single task. If written to service a software interrupt, it is written as a function library
where it takes values in registers and services them based upon the specialized functionality. On the 80x86
type personal computer the older DOS operating system typically required the application programmer to call
the BIOS using the INT instruction to access all the peripherals such as keyboard, display card, mouse,
communications port, printer port, timer, etc. Sometimes a peripheral would use more than one interrupt, one
to support the BIOS (basic input/ouput system) library interface and one to handle the IRQs.
Table 10-2: Device, interrupt, address, and IRQ mappings for PC
There are many more interrupts, which are too numerous to list. No matter how the interrupt is written, they all
end exactly the same way, with the IRET instruction.
Win32 developers will find that calling the function DebugBreak() actually calls INT 3. This effectively stops the
debugger at the position of the instruction pointer.
IRET
IRETD
IRETQ 64 64
iret [Un]Signed 16
iretd [Un]Signed 32
iretq [Un]Signed 64
This is a general-purpose instruction used to return from an interrupt to a previous location in memory.
Flags O.flow Sign Zero Aux Parity Carry
237
- - - - - -
CLI
STI
cli
sti
The STI instruction is used to set the interrupt flag, thus enabling (allowing) interrupts, and the CLI instruction
is used to clear the interrupt flag, thus disabling (preventing) interrupts. It should be noted that NMI (non-
maskable interrupts) and exceptions are not prevented.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
When an interrupt is being serviced due to an elapsed timer, keyboard key pressed, communications received
or sent, etc., the interrupt flag bit is automatically cleared to 0, thus preventing any other interrupts from
interrupting (disturbing) the interrupt code already being run. The
interrupt flag is automatically set to 1 upon the IRET instruction being executed.
IService proc far
push eax
pop eax
iret
IService endp
If an interrupt is going to take some time to process and is not a quick in and out, a programmer will typically
insert an STI instruction at the top of the interrupt to allow interrupts to occur. If at some point interrupt-critical
hardware is being accessed, the CLI instruction will be called first to temporarily disable interrupts, the
hardware will be accessed, and the STI instruction will be used to immediately allow interrupts again.
IService proc far
sti ; Enable Interrupts
push eax
pop eax
iret
IService endp
When interrupts are re-enabled inside an interrupt, there is a possibility that the event that instigated the
interrupt can cause a new event requesting a new interrupt before the interrupt was done servicing the first
238
interrupt. This interrupt "nesting" needs to be accounted for in your code through the use of a flag, etc.,
typically not allowing the body of the interrupt code to be executed by the second (nested) interrupt. It should
return immediately. The body of the code, however, should take into account that interrupt nesting may have
taken place and therefore should compensate for it. A simple directly manipulated flag can solve this problem.
Something to keep in mind is that the only absolute is the code segment where the interrupt is! Data is
unknown, so we actually store the data segment value in the code segment so we can access the application
data.
IServCS dw 0
IServCnt dd 0
$Nest: iret
IService endp
You should note that the flag test occurred before the interrupt was re-enabled. This was to ensure that
another interrupt did not occur while the possibility of nested interrupt was tested for. If application code wants
to very temporarily stop interrupts so it can set up interrupt-sensitive hardware, all it needs to do is call the CLI
instruction followed by a STI instruction.
239
Chapter 11: Branchless
Overview
Okay, we learned how to branch in the last chapter. We also learned about some branchless coding methods
such as the butterfly switch and value setting using bit blending. Now let's learn how to make decisions without
branching. You have already learned instructions such as the bit test and set. Here you will be shown how to
use masking logic to get the same result as one would with the use of branching. The first item to learn is that
the MSB (negative bit #31) and Carry bits are your friends! Two values can be subtracted to obtain a positive
or negative threshold, and then the resulting state of the MSB can be arithmetically shifted to create a mask of
all ones or zeros. That mask can be used to mask or blend bits.
There is only one trick, however. The values need to be one bit less than the data size so as to take advantage
of the MSB. Data could be shifted by the bit width of the data, but on some older embedded processors there
are penalties for each shifted bit.
#define UINT_MAX_BITS (sizeof(uint) <<3) // 32-bit value = 32
#define INT_MAX_BITS ((sizeof(int) <<3)–1) // 32-bit value = 31
#define UCHAR_MAX_BITS (sizeof(uchar) <<3) // 8-bit value = 8
#define CHAR_MAX_BITS ((sizeof(char) <<3)–1) // 8-bit value = 7
In the previous chapter on branching, you found the following example of the signed integer absolute number
functiony=|x|, which uses a Jcc instruction.
test eax,eax ; Test if negative
jns $Abs ; Jump if positive
240
sub eax,ecx ; n–(–1)=n+1 = two's complement
Voilà, an ABS() function without any Jcc instructions. Just some sleight of hand using general-purpose
instructions.
Let's look a little closer at how this works. Note that it will work on 8-, 16-, or 32-bit numbers in the same way.
Below you will find an 8-bit negative number on the left and a positive 8-bit number on the right for a side-by-
side comparison.
mov cl,al 10001010b 8Ah (–118) mov dl,bl 01111011b 7Bh (123)
sar cl,7 11111111b FFh sar dl,7 00000000b 00h
The SAR instruction shifts the MSB arithmetically into all the other bit positions, so a negative number
becomes all FFs, and a positive number becomes all 00s. Now XOR those bits with their original XOR value.
You learned in a previous chapter that 11=0 and 10=1.
; xor al,cl 01110101b 75h xor bl,dl 01111011b 7Bh
A negative value would actually flip the original value's bits with a one's complement, and a positive value
would keep the value intact. We then subtract the new value with that mask, which effectively adds a +1 or +0,
thus a two's complement is performed, since n–(–1)=n+1 = two's complement.
; cl= 11111111b FFh dl= 00000000b 00h
; sub al,cl 01110110b 76h (118) sub bl,dl 01111011b 7Bh (123)
Normally there would be a single branch test, but as you will note in the C example above and in the assembly
below, there is no branch.
MIN PROC C PUBLIC p:DWORD, q:DWORD
mov edx,p ; edx = p
mov ecx,q ; ecx = q
; r=(p < q) ? p : q
; r = (p–q) >> INT_MAX_BITS;
; (–)=0xFFFFFFFF (+)=0x00000000
mov eax,edx
sub eax,ecx
sar eax,INT_MAX_BITS ; (int >> 31)
241
; return (p r) (q (r))
and edx,eax ; (p r)
xor eax,–1 ; ()
and eax,ecx ; q (r)
or eax,edx
ret ; result is in eax
MIN endp
This function is similar to the ABS() function above. The argument q is subtracted from p to force a negative
value if p is less than q. Upon that negative result, a bit mask of all one's is created. By ANDing that mask with
p, ANDing the inverse of that mask with q, and ORing the results, the minimum value is retrieved.
; r=(q < p) ? q : p
; r = (q–p) >> INT_MAX_BITS;
; (–)=0xFFFFFFFF (+)=0x00000000
mov eax,edx
sub eax,ecx
sar eax,INT_MAX_BITS ; (int >> 31)
There is also something to note in this code. This is a single MIN() function test. As such there is no branch,
but there are dependency stalls. The EAX register is constantly being reassigned or operated upon and thus
stalling any pipelining. The good news is that it is still faster than the branch. If a particular pair of arrays
needed to have a similar operation performed, the code could be intermixed in such a way that pipelining could
be used to advantage.
Then again, remember that job interview question in the previous chapter? It was related to writing a
STRUPR() function. Let's try that again!
; strupr snippet
xyzzy db "Quick brown fox jumped!",0
242
if 0 ; *** Old Branching Code ***
In the old branching code section you will notice the two branches, which can be costly in mispredictions.
Statistically, there will be more lowercase than uppercase and symbols, and an even higher misprediction rate
if Single-Byte Character System (SBCS) international characters are involved. The branchless code is not very
optimized, but you will notice some pairing between the AL and AH registers processing the lower and upper
limits of the inclusive character range of a to z. There are still some stalls near the bottom of the loop, but there
are no prediction errors or CPU cycle hits from having to take a branch; thus, the new code is faster. By
spending some optimization time intermixing the looping logic with the actual masking logic, the register
dependencies can be reduced even further, allowing the code to run even faster.
$L1: mov ah,al
sub al,'z'+1
sub ah,'a'
shr al,2 ; xx?xxxxxb 'z'=0x20 else 0x00
shr ah,2 ; xx?xxxxxb <'a'=0x20 else 0x00
inc edx ; Nothing to do; next!
243
not ah ; Flip the bit
;ah stall
and ah,al ; Blend the ~'a' and 'z' masks
mov al,[edx] ; Get the next character
and ah,020h ; 00X00000b strip masking bit
;ah stall
sub [edx–1],ah ; –00h or –20h adjustment to character
test al,al
jnz $L1 ; 0 terminator?
8-bit values can be mirrored, loaded into a 32-bit register, and then processed simultaneously, but there would
be a significant stall in doing so. Experiment and give it a try anyway!
So now you are probably thinking, "That was silly, since it will probably never be needed!" Just remember that
it just might be your next job interview test question. When writing code in optimized assembly, one has to
initially decide whether it is worthwhile and an efficient use of time. I am sorry to say that string converters are
typically not; however, the principles you learned here can be applied to both the C and assembly world of
programming, making your code faster and more efficient, and earning the admiration of your fellow
programming team, the recognition (or lack thereof) by your manager, and hopefully a better performance
review and hefty pay hike! Or not!
The equation returns a summation of the longest axis and half the shortest axis. It is fairly efficient but contains
three conditional branches that can be removed and instead use masking logic to make it even faster. The
ABS() function was discussed in detail earlier in this chapter. The new algorithm is a comparison with a
substitution. So applying what we have learned about branchless coding we get this new function:
int Quick2DDist(int x1, int y1, int x2, int y2)
{
int x, y, dx, dy;
// dx=ABS(x2–x1); dy=ABS(y2–y1);
x = (x2–x1);
y = (y2–y1);
244
dx = x >> INT_MAX_BITS;
dy = y >> INT_MAX_BITS;
dx = (x ^ dx) – dx;
dy = (y ^ dy) – dy;
Note that this function has a length of 133 bytes and is only slightly larger than the previous, but runs much
faster. There is no branching, and it's been set up for pipelining although the compiler will optimize it in its own
way. This would normally be good enough but if there is a need for more speed, then check out the following
conversion of this optimized C code to assembly.
Quick2DDist PROC C PUBLIC x1:DWORD, y1:DWORD, x2:DWORD, y2:DWORD
push ebx
; dx = |x2–x1| dy = |y2–y1|
mov ecx,y2 ; y = (y2–y1)
mov ebx,x2 ; x = (x2–x1)
sub ecx,y1
sub ebx,x1
; dx = x >> INT_MAX_BITS; dy = y >> INT_MAX_BITS
mov edx,ecx
mov eax,ebx
sar edx,INT_MAX_BITS
sar eax,INT_MAX_BITS
245
sar ebx,INT_MAX_BITS ; r=(+)=00's or (–)=FF's
; ebx=r (dx mask) eax=x edx=y
; ebx stall
; return (x r) (y (r))
mov ecx,ebx ; (r)
and eax,ebx ; (x r)
not ecx ; Flip mask
; ecx stall
and edx,ecx ; (y (r))
pop ebx
or eax,edx ; eax = distance
ret
Quick2DDist endp
Note that this only requires half the length of either version of the C code as it is only 66 bytes in length. Since
it was hand optimized assembly and not C, it is that much faster because of reduced code cache fetches and
fewer stalls. Keep in mind that code written in assembly is not always smaller; in fact, it is typically larger, but
this just demonstrates some hidden benefits. Again, you will note the blended instructions to support two
pipelines, but unfortunately there are two remaining register stalls. By not using the proc macro, the POP EBP
and POP EBX instructions can be moved up, removing one of the stalls. If you look carefully, the code is
visually paired until you get about halfway down, at which point the logic gets a little foggy. That is because
lower code was moved up to remove additional register stalls that would have occurred, and thus making the
code less readable.
Again, please remember that you should rewrite your C code in an attempt to break it into simpler
components, keeping in mind that it should be readable. Once this is proven to run, then work on the assembly
code using this fresher C version as a template to follow. Use the Carry or one of the data bits such as the
MSB to smear the bit pattern across the other bits as required. The alternative method would be to use the
SET instruction to obtain a 0 or 1, then a decrement to get a –1 or 0 for a mask.
246
Chapter 12: Floating-Point Vector Addition and
Subtraction
Overview
The topic of floating-point was discussed back in Chapter 8, "Floating-Point Anyone?" As the same
methodologies of SIMD processing learned in Chapter 7, "Integer Math," apply for packed floating-point, it
does not matter whether one is calculating the sum or the product. However, there is one exception. With
integer addition, the data width increases by one bit. With integer multiplication, the data width increases by 2N
bits. With floating-point, the number occupies the same number of bits. So with that said, let's jump right into
packed floating-point addition.
The samples are actually three different types of examples: a standard single data element solution; a 3D
value, typically an {XYZ} value; or a 4D value, {XYZW}. Integer or fixed point is important, but in terms of fast
3D processing, single-precision floating-point is of more interest.
ADDPD
ADDPS
PFADD
3DNow pfadd mmxDst, mmxSrc/mm64 Single-precision 64
!
SSE addps xmmDst, xmmSrc/m128 Single-precision 128
SSE2 addpd xmmDst, xmmSrc/m128 Double-precision 128
This vector instruction is a parallel operation that uses an adder on each of the source floating-point blocks
aSrc (xmmSrc) and bSrc (xmmDst) and stores the result in the destination Dst (xmmDst).
The instructions may be labeled as packed, parallel, or vector, but each block of floating-point bits is in reality
isolated from one another.
63…32 31…0
247
63…32 31…0
120.885 – 86.479
+ –120.622 + 30.239
0.263 –56.240
127…64 63…0
– 75.499 57.480
+ 124.073 + –50.753
48.574 6.727
ADDSD
ADDSS
SSE addss xmmDst, xmmSrc/m32 Single-precision 12
8
SSE addsd xmmDst, xmmSrc/m64 Double-precision 12
2 8
This vector instruction is a scalar operation that uses an adder with the source scalar xmmSrc and the source
floating-point value in the least significant block within xmmDst and stores the result in the destination
xmmDst. The upper float elements are unaffected.
PFSUB
SUBPD
248
Mnemonic P PI K 3D 3Mx SS SSE A6 SSE E64T
I 6 ! + E 2 4 3
SUBPS
3DNow pfsub mmxDst, mmxSrc/m64 Single-precision 64
!
SSE subps xmmDst, xmmSrc/m128 Single-precision 12
8
SSE2 subp xmmDst, xmmSrc/m128 Double-precision 12
d 8
This vector instruction is a parallel operation that subtracts each of the source floating-point blocks aSrc
(xmmSrc) from bSrc (xmmDst) with the result stored in the destination Dst (xmmDst).
Note Be careful here as A – B B – A.
The instructions may be labeled as packed, parallel, or vector, but each block of floating-point bits is in reality
isolated from one another.
63…32 31…0
–98.854 124.264
– 50.315 – 33.952
–48.539 158.216
127…64 63…0
– 48.043 127.277
– –106.051 – –77.288
58.008 204.565
SUBSD
249
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
SUBSS
SSE subss xmmDst, xmmSrc/m32 Single-precision 12
8
SSE subsd xmmDst, xmmSrc/m64 Double-precision 12
2 8
This vector instruction is a scalar operation that subtracts the least significant source floating-point block of
xmmSrc from the same block in xmmDst and stores the result in the destination xmmDst. The upper float
elements are unaffected.
PFSUBR
3DNow pfsubr mmxDst, mmxSrc/m64 Single-Precision 6
! 4
This vector instruction is a parallel operation that subtracts each of the source floating-point blocks bSrc
(mmxDst) from aSrc (mmxSrc) with the result stored in the destination Dst (mmxDst).
The instructions may be labeled as packed, parallel, or vector, but each block of floating-point bits is in reality
isolated from one another.
A typical subtraction uses an equation similar to {a=a–b}, but what happens if the equation {a=b–a} is needed
instead? This instruction solves that situation by limiting any special handling needed to exchange values
between registers such as the following:
c[0]=a[0]; c[1]=a[1];
a[0]=b[0]; a[1]=b[1];
a[0]=a[0] – c[0]; a[1]=a[1] – c[1];
b[0]=c[0]; b[1]=c[1];
or
exchange(a, b) A=b B=a a=a b=b
A = A – B b=b–a a=a
exchange(a, b) a=(b–a) b=a b=(b–a) a=a
250
Pseudo Vec
By now you should be very aware that you should be using assertions in your code such as the
ASSERT_PTR4 for normal pointers and ASSERT_PTR16 for pointers to vectors to ensure they are properly
aligned in memory, so I will try not to bore you with it much anymore in print. You should also by now be aware
of the penalties for dealing with out of alignment memory. Keep these in mind when writing your own code.
There is also a limitation on the use of the term const to help make the printed code less wordy and more
clear.
You will find that for purposes of cross-platform compatibility, these functions return no arguments. They are
instead written as procedures where the first argument points to a buffer that the result is stored in. This is not
written to make your life confusing. It is written this way because of one particular processor: the 80x86. Due to
its MMX versus FPU usage an EMMS instruction must be called to reset that functionality as a clean slate, so
only one of them can be used at a time. By not returning a value such as a float or array of floats, it minimizes
the risk that the programmer might accidentally try to use the returned value while in the wrong mode. In this
way the vmp_SIMDEntry() and vmp_SIMDExit() procedure calls are made to assist in switching between FPU
and MMX mode of operation. Since most of you will be focused upon float and not integer or fixed-point vector
math, that will be the focus, but the principles are the same!
The simple addition and subtraction of a single (scalar) float has been included here as a reference.
The above are simple scalar addition and subtraction using single-precision floats. Now view the addition of
two vectors containing a three-cell {XYZ} float.
251
pvD->y = pvA->y + pvB->y;
pvD->z = pvA->z + pvB->z;
}
252
mov edx,vD ; Vector destination
vmp_VecAdd (3DNow!)
vmp_VecSub (3DNow!)
Listing 12-7: …\chap12\vas3d\Vas3DX86M.asm
For subtraction, the functions are virtually identical to the addition functions, except for the exchanging of a
PFSUB for the PFADD.
Listing 12-8: …\chap12\vas3d\Vas3DX86M.asm
vmp_QVecAdd (3DNow!)
A quad vector access is not much different. Instead of loading a single float for each vector, a double float pair
is loaded instead using a MOVQ instead of a MOVD.
Listing 12-9: …\chap12\vas3d\Vas3DX86M.asm
253
previously discussed in Chapter 4, "Bit Mangling." Data can be misaligned or aligned, but 128-bit alignment
would be preferable.
You now need to review two SSE instructions: MOVAPS and MOVUPS. This was introduced in Chapter 3,
"Processor Differential Insight."
MOVAPS — is for use in aligned memory access of single-precision floating-point values.
MOVUPS — is for use in unaligned memory access of single-precision floating-point values.
One other item that should be brought to light is the special handling required by vectors versus quad vectors.
As previously discussed in Chapter 4, the vector is three single-precision floats 96 bits in size, but when
accessed as a vector, 128 bits are accessed simultaneously. This means that those extra 32 bits must be
preserved and not destroyed. Also, the data contained within it must not be expected to be a float; it should be
garbage data to that particular expression but valid data to another expression, and thus must be treated as
such. Therefore, the easiest method is to clear and then restore those bits. The following declarations work
nicely as masks for bit blending just for that purpose:
himsk32 DWORD 000000000h, 000000000h, 000000000h,
0FFFFFFFFh ; Save upper 32 bits
lomsk96 DWORD 0FFFFFFFFh, 0FFFFFFFFh, 0FFFFFFFFh,
000000000h ; Save lower 96 bits
Also note that if bits are being preserved with a mask, then others are being cleared to zero. Of course it
depends upon the endian type byte ordering of the platform but for x86 it is as listed!
Listing 12-10: …\chap12\vas3d\Vas3DX86M.asm
The code looks almost identical, so from this point forward, the book will only show the aligned code using
MOVAPS.
254
vmp_QVecAdd (SSE) Aligned
And for quad vectors, it is even easier as there is no masking of the fourth float {W}; just read, evaluate, and
then write! Of course the function should have the instructions arranged for purposes of optimization but here
they are left in a readable form.
Listing 12-12: …\chap12\qvas3d\QVas3DX86M.asm
Did that look strangely familiar? The big question now is, "How do we replicate a scalar to look like a vector
since there tends not to be mirrored scalar math on processors?" Typically a processor will interpret a scalar
calculation as the lowest (first) float being evaluated with a single scalar float. This is fine and dandy, but there
are frequent times when a scalar needs to be replicated and summed to each element of a vector. So the next
question is how do we do that?
With the 3DNow! instruction set it is easy. Since the processor is really a 64-bit half vector, the data is merely
unpacked into the upper and lower 32 bits.
movd mm2,fScalar ; fScalar {0 s}
255
punpckldq mm2,mm2 ; fScalar {s s}
Then it is just used twice, once with the upper 64 bits and then once with the lower 64 bits.
pfadd mm0,mm2 ; {Ay+s Ax+s}
pfadd mm1,mm2 ; {Aw+s Az+s}
With the SSE instruction set it is almost as easy. The data is shuffled into all 32-bit floats.
movss xmm1,fScalar ; {0 0 0 s}
shufps xmm1,xmm1,00000000b ; {s s s s}
Any questions?
d(3) = b(3) + a(3) d(2) = b(2) – a(2) d(1) = b(1) + a(1) d(0) = b(0) – a(0)
Mnemonic P PI K 3D 3Mx SS SSE A6 SSE E64T
I 6 ! + E 2 4 3
ADDSUBPD
ADDSUBPS
SSE addsubp xmmDst, xmmSrc/m128 Single-precision 12
3 s 8
" addsubp xmmDst, xmmSrc/m128 Double-precision "
d
This vector instruction is a parallel operation that has an even subtraction and an odd addition of the source
floating-point blocks. For the even elements, subtract aSrc (xmmSrc) from bSrc (xmmDst) with the result
stored in the destination Dst (xmmDst). For the odd elements, sum aSrc (xmmSrc) and bSrc (xmmDst) with
the result stored in the destination Dst (xmmDst).
d(3) = a(2) + a(3) d(2) = a(0) + a(1) d(1) = b(2) + b(3) d(0) = b(0) + b(1)
256
Mnemonic P PI K 3D 3Mx SS SSE A6 SSE E64T
I 6 ! + E 2 4 3
HADDPD
HADDPS
PFACC
3DNow pfacc mmxDst, mmxSrc/m64 Single-precision 64
!
SSE3 haddp xmmDst, xmmSrc/m128 Single-precision 12
s 8
" haddp xmmDst, xmmSrc/m128 Double-precision "
d
This vector instruction is a parallel operation that separately sums the odd/even pairs of the source and
destination and stores the result of the bSrc (xmmDst) in the lower destination elements of Dst (xmmDst) and
the result of the aSrc (xmmSrc) in the upper elements of Dst (xmmDst).
d(3) = a(2) – a(3) d(2) = a(0) – a(1) d(1) = b(2) – b(3) d(0) = b(0) – b(1)
PFNACC
HSUBPD
HSUBPS
3Mx+ pfnacc mmxDst, mmxSrc/m64 Single-precision 64
SSE hsubps xmmDst, xmmSrc/m128 Single-precision 12
3 8
" hsubpd xmmDst, xmmSrc/m128 Double-precision "
This vector instruction is a parallel operation that separately subtracts the (odd) element from the (even)
element and stores the result of the bSrc (xmmDst) in the lower destination elements of Dst (xmmDst) and the
result of the aSrc (xmmSrc) in the upper addresses of Dst (xmmDst).
257
PFPNACC — Vector Floating-Point Horizontal Add/Sub
PFPNACC
3Mx pfpnacc mmxDst, mmxSrc/m64 Single-precision 6
+ 4
This half-vector instruction is a parallel operation that separately subtracts the upper element from the lower
element of bSrc (mmxDst) and stores the result in the lower element of Dst (mmxDst). The sum of the upper
and lower elements of aSrc (mmxSrc) is stored in the upper element of Dst (mmxDst).
Exercises
1. Using only Boolean logic, how could two numbers be summed?
2. If your processor had no instructions for parallel subtraction, how would you find the difference of two
numbers?
3. Invert the sign of the even-numbered elements of signed 8-bit byte, 16-bit half-word, and 32-bit word of
a 128-bit data value using:
a. pseudo vector C code
b. MMX
c. SSE2
4. Same as problem 3 but use odd-numbered elements.
5. Invert the sign of all the elements of four packed single-precision floating-point values.
6. You have been given a 4096-byte audio sample consisting of left and right channel components with a
PCM (pulse coded modulation) of unsigned 16-bit with 0x8000 as the baseline.
7. unsigned short leftStereo[1024], rightStereo[1024];
signed char Mono[???];
8. How many bytes is the mixed sample?
9. Write a mixer function to sum the two channels from stereo into mono and convert to a signed 8-bit
sample.
Project:
You now have enough information to write an SHA-1 algorithm discussed in Chapter 5, "Bit Wrangling," for
your favorite processor. Write one! HINT: Write the function code in C first.
258
Chapter 13: FP Vector Multiplication and Division
Overview
There are multitudes of variations of multiplicative mnemonic manipulations. It seems almost every processor
supports a slightly different scheme involving different integer word sizes, floating-point precision types,
methods of rounding, with saturations, etc. Fundamentally, despite its variations, it is very similar to and uses
the same methodologies as the addition detailed in the last chapter.
Notice the integer multiplication examples below. The example on the left requires more bits to contain the
results of the operation and thus different methods have been implemented to reduce that value to its
component size. The results of the floating-point multiplication in the other two examples follow the rules of the
IEEE-754 standard for binary floating-point arithmetic. The result of a multiplication is stored with no increase
in data containment size, but there is a penalty of a loss of precision.
Floating-Point Multiplication
PFMUL
MULPS
3DNow pfmul mmxDst, mmxSrc Single-precision 64
!
SSE mulp xmmDst, xmmSrc/m128 Single-precision 12
s 8
This SIMD instruction uses a 64 (128)-bit data path and so two (four) operations occur in parallel. The product
is calculated for each of the Real32 single-precision floating-point elements of the multiplicand xmmDst and
the multiplier xmmSrc for each block, and the result is stored in each of the original Real32 elements of the
destination xmmDst.
259
(Semi-Vector) DPFP Multiplication
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
MULPD
SSE mulpd xmmDst, xmmSrc/m128 Double-precision 12
2 8
This vector instruction uses a 128-bit data path and so two operations occur in parallel. The product is
calculated for each of the Real64 (double-precision floating-point) pairs of the multiplicand xmmDst and the
multiplier xmmSrc for each block, and the result is stored in each of the original Real64 elements of the
destination xmmDst.
MULSS
SSE mulss xmmDst, xmmSrc/m32 Single-precision 32
This vector instruction uses a 128-bit data path and only the first Real32 (single-precision floating-point) source
scalar multiplier and the multiplicand xmmDst, and the result is stored in the original Real32 elements of the
destination xmmDst.
MULSD
SSE mulsd xmmDst, xmmSrc/m64 Double-precision 64
2
This vector instruction uses a 128-bit data path and only the first Real64 (double-precision floating-point)
source scalar multiplier and the multiplicand xmmDst, and the result is stored in the original Real64 elements
of the destination xmmDst.
260
Pseudo Vec
The multiplication of vectors is similar to that of the addition of vectors.
261
pvD–>y = pvA–>y * pvB–>y + pvC–>y;
pvD–>z = pvA–>z * pvB–>z + pvC–>z;
pvD–>w = pvA–>w * pvB–>w + pvC–>w;
}
As you may have noticed, the vector only loaded one float instead of two, set the second to zero, calculated
the product, and then wrote the three values back to memory.
vmp_VecMul (3DNow!)
Listing 13-5: …\chap13\vmd3d\Vmd3DX86M.asm
vmp_QVecMul (3DNow!)
Listing 13-6: …\chap13\qvmd3d\QVmd3DX86M.asm
vmp_QVecMAdd (3DNow!)
For MADD, the addition needs to be handled separately.
Listing 13-7: …\chap13\qvmd3d\QVmd3DX86M.asm
262
movq mm2,[ebx+8] ; vB.zw {Bw Bz}
movq mm3,[eax+8] ; vA.zw {Aw Az}
movq mm5,[ecx+8] ; vC.zw {Cw Cz}
vmp_VecMul (SSE)
The SSE processor in the following code snippet can load 128 bits at a time, so the entire 96-bit vector can be
loaded at once, including an extra 32 bits. This introduces a problem of contamination when the 96-bit value is
written to memory as 128 bits. The solution is to read those destination bits, preserve the upper 32 bits through
bit masking and blending, and write the newly merged 128 bits. Keep in mind efficient memory organization
and memory tail padding as discussed earlier. Data can be misaligned or aligned, but 128-bit alignment is
preferable. Only aligned memory will be discussed, but for SSE keep in mind the use of MOVUPS instead of
MOVAPS when memory alignment cannot be guaranteed.
263
movaps xmm2,[ecx] ; vC.xyzw {Cw Cz Cy Cx}
mulps xmm0,xmm1 ; {AwBw AzBz AyBy AxBx}
addps xmm0,xmm2 ; {AwBw+Cw ... AxBx+Cx}
movaps [edx],xmm0 ; {AwBw+Cw ... AxBx+Cx}
It was discussed in a previous chapter that a difference is the summation of a term and the inverse of a second
term using the additive inverse algebraic law. A division is also a play on an equation transformation: a
multiplication of the dividend by the reciprocal of the divisor.
Some instruction sets, such as 3DNow!, do not directly support floating-point division but do support the
product of a reciprocal.
DIVPS
SS divps xmmDst, xmmSrc/m128 Single-precision 128
E
This vector instruction uses a 128-bit data path and so four operations occur in parallel. The result is
calculated for each of the source Real32 (single-precision floating-point) quads of the quotient xmmDst and the
divisor xmmSrc of each block, and the result is stored in each of the original Real32 elements of the
destination xmmDst.
DIVPD
SSE divpd xmmDst, xmmSrc/m128 Double-precision 128
2
This vector instruction uses a 128-bit data path and so two operations occur in parallel. The result is calculated
for each of the source Real64 (double-precision floating-point) pairs of the quotient xmmDst and the divisor
xmmSrc of each block, and the result is stored in each of the original Real64 elements of the destination
xmmDst.
264
Dst(63…0) = (Dst(63…0) Src(63…0))
DIVSS
SS divss xmmDst, xmmSrc/m32 Single-Precision 32
E
This scalar instruction uses a 128-bit data path but only the least significant Real32 (single-precision floating-
point) elements are used — the xmmSrc source scalar divisor and the xmmDst quotient. The result is stored in
the lower 32 bits of the destination xmmDst, leaving the upper 96 bits unaffected.
DIVSD
SSE divsd xmmDst, xmmSrc/m64 Double-precision 64
2
This scalar instruction uses a 128-bit data path and only the first Real64 (double-precision floating-point)
source scalar divisor and the quotient xmmDst. The result is stored in the original lower 64 bits of the
destination xmmDst, leaving the upper 64 bits unaffected.
N×SPFP Reciprocal
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
RCPPS
RCPSS
SS rcpps xmmDst, xmmSrc/m128 Single-precision 128
E
SS rcpss xmmDst, xmmSrc/m32 Single-precision 128
E
The RCPPS instruction uses a packed 128-bit data path and each source xmmDst element and divides it by
each xmmSrc divisor, produces the reciprocal, and stores the result in destination xmmDst.
265
The RCPSS scalar instruction uses a 128-bit data path. It takes the least significant Real32 in source xmmDst,
divides it by the least significant Real32 in xmmSrc divisor, produces the reciprocal, and stores the result in
destination xmmDst.
PFRCP
3DNow pfrcp mmxDst, mmxSrc/m32 Single-precision 32/6
! 4
This 3DNow! scalar instruction uses a 64-bit data path. It takes only the first Real32 (single-precision floating-
point) source scalar divisor, produces the 14-bit reciprocal, and stores the result in both the lower 32 bits and
upper 32 bits of the destination mmxDst.
To convert the result to a division it only needs to be followed up by the multiplication instruction PFMUL. This
instruction would be considered a low precision division.
Hint FAST or SLOW algorithm?
Of course fast! Why would anyone even bother calling a slow algorithm or keep a slow
algorithm in memory unused? In reality, this title is misleading. It really means:
SLOW — Not slow, just the standard algorithm with the best precision possible for the
supported data size.
Picking and choosing an appropriate algorithm is just another level of code optimization. In a
video game the standard algorithm would be used for rendering the display or another
precision required event, but the fast (quick) algorithm would be used for quick processing
such as pre-culling of polygons or quick distances between sprites, etc.
movd mm3,fB ; {0 B}
movd mm0,fA ; {0 A}
mov edx,pfD ; float destination
Note that the code has the fast precision set to 0.001f to accommodate SSE, but 0.0001f works for 3DNow!
estimation.
Listing 13-22: …\chap13\fmd\FmdX86M.asm
266
SPFP Reciprocal (2 Stage) (24-Bit)
A fast version of the previous instruction would entail taking advantage of the two-stage vector instructions
PFRCPIT1 and PFRCPIT2, in conjunction with the result of the reciprocal instruction PFRCP, to achieve a
higher 24-bit precision. It uses a variation of the Newton-Raphson reciprocal square approximation.
This is an error correcting scheme to infinitely reduce the error, but typically only a single pass is used. Not to
simplify it, but this typically involves calculating the product of the estimated square root, finding the difference
from the original number, then adjusting by that ratio.
PFRCPIT1
PFRCPIT2
movd mm0,fA ; {0 A}
pfmul mm0,mm3 ; {# A×(1/B)}
movd [edx],mm0 ; AB
movss xmm1,fB ; B {0 0 0 B}
movss xmm0,fA ; A {0 0 0 A}
mov eax,pfD ; Float destination
divss xmm0,xmm1 ; {0 0 0 AB}
movss [edx],xmm0 ; AB
Pseudo Vec
The vector and quad vector operations are not much different. The scalar in essence becomes replicated into
all the denominator fields and then the product of the reciprocals (division) takes place.
267
Single-Precision Vector Float Scalar Division
Listing 13-25: ...\chap13\fmd\Fmd.cpp
268
movq [edx+0],mm0 ; {Ay/s Ax /s}
movq [edx+8],mm1 ; {Aw/s Az /s}
It is fairly simple. Similar to a scalar multiplication, the scalar is merely distributed to each of the elements of
the denominator and then the division takes place. (Have you read this enough times yet?)
Exercises
1. What is an "inner product"?
2. A cross product is known by another name. What is it?
3. What happens to a vector if a negative scalar is applied as a product?
4. What is the solution for:
a. AB + C D, if A= 2, B = 5, C = 3, and D = 4?
b. A= B = C = D = 0x80000000?
c. With saturation?
d. Without saturation?
5. What is the equation for a dot product?
6. Given the two vertices v:{–8, 4, –6, 4} and w:{8, 2, –6, 8}, resolve
a. v + w
b. vw
c. v w
d. v w
269
Chapter 14: Floating-Point Deux
Overview
Since the floating-point values have been discussed, it is now time to discuss some of the operations that can
be performed with them, such as bit masking and comparisons.
Why would someone wish to generate a bit mask for a floating-point number? Due to the nature of the
mantissa and exponential bits, a floating-point value can be manipulated.
The simplified form of this parallel instruction individually calculates the square root of each of the packed
floating-point values, and returns the result in the destination. Some processors support the square root
instruction directly, but some processors, such as the 3DNow! instruction set, actually support it indirectly
through instructional stages. And some processors support it as a reciprocal square root.
So now I pose a little problem. We hopefully all know that a negative number should never be passed into a
square root because computers go BOOM, as they have no idea how to deal with an identity (i).
With that in mind, what is wrong with a reciprocal square root? Remember your calculus and limits?
270
Hint As x approaches zero from the right.
Do you see it now? You cannot divide by zero, as it results in infinity and is mathematically problematic. So
what has to be done is to trap for the x being too close to zero (as x approaches zero) and then substitute the
value of one as the solution for the reciprocal square root.
y = (x < 0.0000001) ? 1.0 : (1 / sqrt(x)); // Too close to zero
It is not perfect but it is a solution. The number is so close to infinity that the result of its product upon another
number is negligible. So in essence the result is that other number; thus the multiplicative identity comes to
mind: 1 n = n. But how to deal with this in vectors? Well, you just learned the trick in this chapter! Remember
the packed comparison? It is just a matter of using masking and bit blending. So in the case of a reciprocal
square root, the square root can be easily achieved by merely multiplying the result by the original x value,
thus achieving the desired square root. Recall that the square of a square root is the original value.
vD[] = (vA[]);
SQRTSS
SSE sqrtss xmmDst, xmmSrc/m32 Single-precision 12
8
This SIMD instruction is a 128-bit scalar operation that calculates the square root of only the lowest single-
precision floating-point element containing the scalar xmmSrc. The result is stored in the lowest single-
precision floating-point block at destination xmmDst, and the remaining bit blocks are left intact.
Dst(31…0) = sqrt(Src(31…0))
sqrtss xmm0,xmm0 ; {0 0 0 }
271
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
SQRTPS
SSE sqrtps xmmDst, xmmSrc/m128 Single-precision 12
8
This SIMD instruction is a 128-bit parallel operation that calculates the square root of the four single-precision
floating-point blocks contained within xmmSrc, and stores the result in the single-precision floating- point
blocks at destination xmmDst.
Dst(31…0) = (Src(31…0) )
Dst(63…32) = (Src(63…32) )
Dst(95…64) = (Src(95…64) )
Dst(127…96) = (Src(127…96) )
SQRTSD
SSE sqrtsd xmmDst, xmmSrc/m64 Double-precision 12
2 8
This SIMD instruction is a 128-bit scalar operation that calculates the square root of only the lowest double-
precision floating-point block containing the scalar xmmSrc, and stores the result in the lowest double-
precision floating-point block at destination xmmDst. The remaining bit blocks are left intact.
Dst(63…0) = (Src(63…0))
SQRTPD
SSE sqrtpd xmmDst, xmmSrc/m128 Double-precision 12
2 8
This SIMD instruction is a 128-bit parallel operation that calculates the square root of the two double-precision
floating-point blocks contained within xmmSrc, and stores the result in the double-precision floating- point
blocks at destination xmmDst.
Dst(63…0) = (Src(63…0))
Dst(127…64) = (Src(127...64))
272
Mnemonic P PI K 3D 3Mx SS SSE A6 SSE E64T
I 6 ! + E 2 4 3
PFRSQRT
RSQRTPS
RSQRTSS
3DNow pfrsqrt mmxDst, mmxSrc/m32 Single-precision 32/6
! 4
SSE rsqrtss xmmDst, xmmSrc/m32 Single-precision 32
rsqrtps xmmDst, xmmSrc/m128 Single-precision 128
This SIMD instruction is a 32-bit scalar operation that calculates the square root of only the lowest single-
precision floating-point block containing the scalar mmSrc, and stores the duplicate result in the low and high
single-precision floating-point blocks at destination mmDst.
Pseudo Vec
*pfD = sqrtf(fA); // =
}
; { 0 }
movd [edx],mm0 ;
273
SPFP Square Root (2 Stage) (24-Bit)
A fast version of the previous instruction would entail taking advantage of the two-stage vector instructions
PFRSQIT1 and PFRCPIT2, in conjunction with the result of the square root instruction PFRSQRT, to achieve
a higher 24-bit precision. It uses a variation of the Newton- Raphson reciprocal square root approximation.
PFRCPIT2
PFRSQIT1
Second stage for 24-bit reciprocal and/or square root (see reciprocals in Chapter 13):
3DNow! pfrcpit2 mmxDst, scalar(mmx/m32) Single-precision 64
; { A }
movd [edx],mm0 ;
movss xmm0,fA ; {# # # A}
274
sqrtss xmm0,xmm0 ; SqRoot {# # # }
movss [edx],xmm0 ; Save square root
Unless you have a top-of-the-line supercomputer, I would recommend you stay away from vector square roots.
Instead, you will typically only need a single square root. If you really need vector-based square roots,
remember that your processor can only do one at a time and your code will have to wait for it to complete
before issuing a request to begin the next one. That could take almost forever! Well, not quite. But it is still not
a great idea. Also, do not forget about preventing negative numbers from being processed by a square root.
That causes exception faults!
Pseudo Vec
Similar to an estimated reciprocal for a division, a square root sometimes is available as an estimate as well.
Be warned that the estimated square root is faster but has a lower precision. But if the lower precision is viable
for your application, then investigate using the estimated square root instead.
275
vmp_QVecSqrt (3DNow!) Fast Quad Float SQRT 15-Bit Precision
Listing 14-7: …\chap14\qvsf3d\QVsf3DX86M.asm
movq [edx+0],mm0 ; { }
movq [edx+8],mm1 ; { }
276
pfrcpit2 mm3,mm6 ; xy {2nd step}
pfrcpit2 mm2,mm5 ; zw
sqrtps xmm0,xmm0 ; { }
movaps [edx],xmm0 ; Save square roots
; vD = (1 / )
rsqrtps xmm0,xmm0
The first rule of a square root operation is to not use it unless you really have to as it is a time intensive
mathematical operation. One method typically used for calculating the length of a line between two points
whether it exists in 2D or 3D space is to use the Pythagorean equation.
277
2D Distance
3D Distance
Figure 14-2: Right triangle within 3D Cartesian coordinate system representing a 3D distance and thus its
magnitude
Equation 14-4: 3D distance (magnitude)
Mathematical Formula:
278
square root, the accuracy becomes lost. So instead of being forced to divide by zero (1/0) to represent infinity,
it is instead set to a value of one (y 1 = y); thus the original value is preserved.
The Pythagorean equation is the distance between two points, in essence, the magnitude of their differences.
In a terrain-following algorithm for creature AI, the distance between each of the creatures and the main
character would be compared to make an idle, run, or flee determination. The coordinates of each object are
known but their distances would have to be calculated and then compared to each other as part of a solution.
Let's examine a simplistic equation utilizing r to represent the distance between the player and four monsters
{mA through mD}:
If you remember the algebraic law of multiplicative identity, the square root factors out of the equation, as it
can be removed from both sides of the equal sign and the equation will remain in balance.
sqrt(Δ Ax2 + ΔAy2 + ΔAz2) =? sqrt(ΔBx2 + ΔBy2 + ΔBz2)
sqrt(Δ Ax2 + ΔAy2 + ΔAz2) =? sqrt(ΔBx2 + ΔBy2 + ΔBz2)
(Δ Ax2 + ΔAy2 + ΔAz2) =? (ΔBx2 + ΔBy2 + ΔBz2)
Does this look a little similar to the sum of absolute differences operation discussed in Chapter 7? They are
different by the sum of absolutes versus the sum of the squares, but they nevertheless have a similarity. The
point is that there is no need to use the square root operation each time in this kind of problem. Neat, huh! It is
an old trick but still an effective one.
Now supposing that it has been discovered that Monster C is the closest monster. Take the square root to
calculate the distance, not forgetting to use the estimate square root version if accuracy is unnecessary.
Pseudo Vec
Listing 14-11: ...\chap14\vsf3d\Vsf3D.cpp
279
vmp_VecMagnitude (3DNow!)
Listing 14-12: …\chap14\vsf3d\Vsf3DX86M.asm
pfrsqrt mm1,mm0 ; {# 1/ }
; Calculate 1/sqrt() accurate to 24 bits
movq mm2,mm1 ; {# 1/ }
pfmul mm1,mm1 ; {# 1/r}
pfrsqit1 mm1,mm0 ; {1st step}
pfrcpit2 mm1,mm2 ; {2nd step}
pfmul mm0,mm1
movd [edx],mm0 ; Save distance
sqrtss xmm0,xmm2 ;
movss [edx],xmm0 ; Save scalar
280
Chapter 15: Binary-Coded Decimal (BCD)
Overview
Converting an ASCII string to binary-coded decimal is as easy as pie (or is it a piece of cake?). In BCD, for
every byte, the lower 4-bit nibble and upper 4-bit nibble each store a value from 0 to 9 (think double-digit hex
only the upper six values A through F are ignored).
BCD
Table 15-1: ASCII numerical digit to hex and decimal values
ASCII 0 1 2 3 4 5 6 7 8 9
Hex 0x30 0x31 0x32 0x33 0x34 0x35 0x36 0x37 0x38 0x39
Decimal 48 49 50 51 52 53 54 55 56 57
BCD 0 1 2 3 4 5 6 7 8 9
Binary 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001
Converting a BCD value from ASCII to a nibble is as easy as subtracting the hex value of 0x30, '0', or 48
decimal from the ASCII numerical value and get the resulting value with a range of {0…9}.
byte ASCIItoBCD(char c)
{
ASSERT(('0' <= c) && (c <= '9'));
When the 8086 processor was first manufactured the FPU was a separate optional chip (8087). There was a
need for some BCD operations similar to other processors and so it was incorporated into the CPU. The 8087
had some BCD support as well. When the 64-bit processor was developed, it was decided that BCD support
was not required anymore as the FPU was an alternative method.
th
The FPU uses the first nine bytes to support 18 BCD digits. The uppermost bit of the 10 byte indicates the
value is negative if set or positive if the bit is clear.
Figure 15-1: Ten-byte BCD data storage. MSB in far left byte (byte #9) is the sign bit and the rightmost eight
th st
bytes (#8...0) contain the BCD value pairs. The 18 BCD digit resides in the upper nibble of byte #8 and the 1
BCD digit resides in the lower nibble of byte #0.
Setting the upper nibble of a byte is merely the shifting left of a BCD digit by four bits, then logical ORing (or
suming) the lower nibble.
byte BCDtoByte(byte lo, byte hi)
{
return (hi << 4) | lo;
281
}
DAA 32 32
daa Signed
The DAA general-purpose instruction adjusts the EFLAGS for a decimal carry after an addition.
Flags O.flow Sign Zero Aux Parity Carry
- - - X - X
Flags: The Aux and Carry flags are set to 1 if an addition resulted in a decimal carry in their associated 4-bit
nibble; otherwise they are cleared to 0.
No The general-purpose BCD instructions are not available in 64-bit mode.
64-bit xor eax,eax ; Reset Carry(s)
$L1: mov al,[edi] ; D = D + A
adc al,[esi]
daa
mov [edi],al ; Store result
dec esi
dec edi
dec ecx
jne $L1 ; Loop for n BCD bytes
Note that this function steps through memory in reverse byte order, which is not processor efficient. High digits
are in low offset bytes, and low digits are in high offset bytes: {N…0}. So the operation must go to the end of
the buffer and traverse memory backward from low-digit pairs to high-digit pairs. If not working with the FPU to
handle BCD, then each nibble pair could be stored in reverse order: {0…N}. Only when they need to be
displayed or printed would there be a reverse increment through memory. Note this is backward to the
ordering of the FPU! The sample code uses this method.
DAS 32 32
das Signed
The DAS general-purpose instruction adjusts the EFLAGS for a decimal borrow after a subtraction.
Flags O.flow Sign Zero Aux Parity Carry
- - - X - X
Flags: The Aux and Carry flags are set to 1 if a subtraction resulted in a decimal carry set due to a borrow in
their associated 4-bit nibble; otherwise they are cleared to 0.
No The general-purpose BCD instructions are not available in 64-bit mode.
64-bit xor eax,eax ; Reset Carry(s)
$L1: mov al,[edi] ; D = D + A
sbb al,[esi]
das
mov [edi],al ; Store result
282
dec esi
dec edi
dec ecx
jne $L1 ; Loop for n BCD bytes
AAA 32 32
aaa Signed
The AAA general-purpose instruction adjusts the EFLAGS for a decimal carry. If a resulting calculation is
greater than 9, then AL is set to the remainder between (0…9) and AH is incremented.
Flags O.flow Sign Zero Aux Parity Carry
- - - X - X
Flags: The Aux and Carry flags are set to 1 if a decimal carry resulted; otherwise they are cleared to 0.
No The general-purpose BCD instructions are not available in 64-bit mode.
64-bit add al,ah
aaa
or al,'0' ; '0' + {0...9} = ASCII '0...9'
AAS 32 32
aas Signed
The AAS general-purpose instruction adjusts the EFLAGS depending on the results of the AL register after a
multiplication operation. If a resulting calculation sets the carry indicating a borrow has occurred, then AL is set
to the remainder between (0…9) and AH is decremented.
Flags O.flow Sign Zero Aux Parity Carry
- - - X - X
Flags: The Aux and Carry flags are set to 1 if a decimal borrow resulted; otherwise they are cleared to 0.
No The general-purpose BCD instructions are not available in 64-bit mode.
64-bit sub al,'7'
aas
or al,'0' ; '0' + {0...9} = ASCII '0...9'
AAM 32 32
aam Signed
The AAM general-purpose instruction adjusts the EFLAGS depending on the results of the AL register after a
multiplication operation.
Flags O.flow Sign Zero Aux Parity Carry
283
- X X - X -
Flags: The Sign, Zero, and Parity flags are set to the resulting value in the AL register.
No The general-purpose BCD instructions are not available in 64-bit mode.
64-bit mul al,bh
aam
AAD 32 32
aad Signed
The AAD general-purpose instruction adjusts the EFLAGS in preparation for a division operation.
Flags O.flow Sign Zero Aux Parity Carry
- X X - X -
Flags: The Sign, Zero, and Parity flags are set to the resulting value in the AL register.
No The general-purpose BCD instructions are not available in 64-bit mode.
64-bit and eax,0000111100001111b
aad
FBLD
FPU fbld source BCD 80
How does this all work? Well, the FPU has a single instruction that loads a BCD value and converts it to an 80-
bit (10-byte) double extended precision floating-point value that it stores on the FPU stack. This can then be
written back to computer memory as double-precision floating-point. Simple, fast, and minimal excess code
and nothing time intensive.
Listing 15-1: …\chap15\ase2vmp\util.cpp
__asm {
fbld tbyte ptr bcd ; Load (80-bit) BCD
fstp f ; Write 64-bit double-precision
}
The returned floating-point value contains the BCD number as an integer with no fractional component. For
example:
unsigned char bcd[10];
double f;
__asm {
fbld tbyte ptr bcd ; Load (80-bit) BCD
fstp f ; Write 64-bit double-precision
284
}
byte bcd[10] = {0x68, 0x23, 0x45, 0x67, 0x89, 0x98, 0x87, 0x76, 0x65,
0x80};
At this point the decimal place needs to be adjusted to its correct position using the product of an exponential
-n
10 . This can be done with either a simple table lookup or a call to the function pow(10,-e), but the table
lookup is faster. And speed is what it is all about.
Graphics 101
All of you who start a processing tool to convert art resources or game resources into a game database and
then leave to have lunch, get a soda, have a snack, go to the bathroom, pick up your kids from school, or go
home, all yell, "ME!"
WOW! That was loud! It could be heard reverberating across the planet.
Those of you who have worked on games in the past, did you meet your timelines? Did you find yourself
3
working lots of extra (crunch) time to meet a milestone? (We will ignore E and the final milestones!) How often
do you have to wait for a tool to complete a data conversion? Add up all that "waiting" time. What did your tally
come to?
You don't really know? Here is a thought: Add a wee bit of code to your program and write the results to an
accumulative log file. Then check it from time to time to see where some of that time is going.
Some people believe in optimizing the game only if there is time somewhere in the schedule. Management
quite often counts the time beans and decides that getting the milestone met is much more important than
early ongoing debugging or optimization. But just think of that time savings if your tools are written with
optimization. Just do not tell management about it or they will think they can ship the product early.
3D rendering tools are expensive and so programmers typically do not have ready access to a live tool. They
sometimes write plug-ins, but quite often they will merely write an ASCII scene exporter (ASE) file parser to
import the 3D data into their tools that generate the game databases. With this method, programmers do not
have to have a licensed copy of a very expensive tool sitting on their desks.
Hint Prevent artist versus programmer wars by working out who gets the burden of the day-to-day
conversion of art resources for the game application before the project coding begins. Use
and document it in the game design, technical design, and especially the art bible
documents, and treat these as living entities and not written in stone!
This little item brings up a trivial item of artist versus programmer wars. It all comes down to who will have the
task of running the tools to export and convert data into a form loaded and used by a game application. Neither
typically wants the task and both consider it mundane, but it is nevertheless required. Artists need to run the
tools occasionally so as to check results of their changes to art resources. Programmers occasionally need to
run the tools to test changes to database designs, etc. But nobody wants to do it all the time. So my
suggestion is to automate the tools and incorporate the who and what into the game design, technical design,
and art bibles for the project. In that way there will be no misperception.
Let's talk about something else but related to assembly. In this particular case, an ASE file is an ASCII export
from 3D Studio MAX. How many of you have actually written a parser and have wondered where all your
processing time had gone? Did you use streaming file reads to load a line at a time, or a block read to read the
entire file into memory?
I personally write ASE parsers by loading the entire file into memory even when they are 20MB or larger in
size. The core ASE parser code included with this book can actually parse an entire 20MB file and convert
about 1.15 million floating-point values from ASCII to doubles in a few seconds. But here is where it really gets
interesting!
285
ASCII String to Double-Precision Float
Calling the standard C language function atof() to convert an ASCII floating-point value to single or double-
precision will add significant time onto your processing time for those large ASE files.
Hint Do not use the runtime C library atof()! Use the following functionality instead. By using this
single function, this book will pay for itself quickly in time savings — and that is what it is all
about!
But I have good news for you. The following function will carve those hours back to something a lot more
reasonable. What it does is take advantage of a little-known functionality within the floating-point unit of the
80x86 processor.
As discussed in Chapter 8, the FPU loads and handles the following data types:
(4-byte) single-precision floating-point
(8-byte) double-precision floating-point
(10-byte) double extended-precision floating-point
(10-byte) binary-coded decimal (BCD)
ASCII to Double
Note that the following code sample expects a normal floating-point number and no exponential. The ASE files
do not contain exponential, just really long ASCII floating-point numbers; thus, the reason this code traps for
more than 18 digits.
Listing 15-2: …\chap15\ase2vmp\util.cpp
double exptbl[] = // -e
{
1.0, 0.1,
0.01, 0.001,
0.0001, 0.00001,
0.000001, 0.0000001,
0.00000001, 0.000000001,
0.0000000001, 0.00000000001,
0.000000000001, 0.0000000000001,
0.00000000000001, 0.000000000000001,
0.0000000000000001, 0.00000000000000001,
0.000000000000000001
}; // Limit 18 places
ASSERT_PTR(pStr);
286
*(((uint32*)bcd)+2) = 0; // 2 + 2 spare bytes
p = pStr;
if ('-' == *p)
{
*(bcd+9) = 0x80; // Set the negative bit into the BCD
p++;
}
287
// last to first digit
If you do not believe me about the speed, then replace all the atof() functions in your current tool with a macro
to assign 0.0 and measure the difference in speed. Or better yet, embed the atof() function within this function
and then do a float comparison with the precision slop factor since by now you should be very aware that you
never ever compare two floating-point numbers to each other to test for equivalence unless a precision slop
factor (accuracy) is utilized.
Tip One should always test optimized code (vector based or not) in conjunction with slow scalar
code written in C to ensure that the code is functioning as required.
One more thing: If you insist on using atof() or sscanf(), copy the ASCII number to a scratch buffer before
processing it with either of these two functions because processing them within a 20MB file dramatically
increases the processing time by hours. Apparently these conversion functions scan the string until they reach
the terminator, which in the case of an ASE file can be a few megabytes away instead of a few bytes.
288
Chapter 16: What CPUID?
Overview
There are multiple manufacturers all making different models of the 80x86 type microprocessors. Some are
variations of the Intel processors and are highly specialized, but most are not. They are clones of the Intel
processor family but with their own designs that require alternate optimization methods. Most of these
manufacturers have technical manuals usually available in a PDF format that can be downloaded from the
Internet and used for all your custom optimization needs. If the project you're coding for uses custom
hardware, then you are probably using a custom processor such as National Semiconductor's NS486SXF
under an operating system such as pSOS. When you are designing code for a specific processor, your code
can be highly optimized and tuned accordingly.
When the hardware you are writing code for is a little more generic, the programmer needs a method to
identify the exact model of processor that the code is running on. Each manufacturer has written a sample
CPU code detection algorithm that uses the CPUID instruction. This is great, but these code samples are not
exactly compatible with each other. Since it is ridiculous to write code that encapsulates all of these samples I
have written this chapter to help you. You can find all sorts of variations of the following program on the
Internet, but the following is designed to be expandable and versatile.
Most of these Intel processors are deviations of each other but if we take a closer look at their "family type" we
will note a pattern of 80(x)86, where the x represents a family number. A 3 would be the 80386, etc. So using
this family type number we can actually group the processor into a category of functionality, as each "group"
actually has its individual subset of instructions that it could execute.
Other manufacturers have second sourced various models of the 80x86 processor line. Intel and AMD are the
primary manufacturers, but other manufacturers have brought to market their modified or less expensive
versions of these same processors.
CPUID
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
CPUID
cpuid
This instruction uses the value stored in the EAX register as a function identifier and returns the related
requested information in the various associated registers.
With the release of the Pentium chip, Intel instituted the CPUID instruction, which gives detailed information of
the capabilities of the individual processor. This was also introduced into the re-release of the Intel 80486
processor. AMD has implemented it in all models since the Am486. This makes it easier to identify the
capabilities of the CPU being tested.
Before trying to use this instruction, bit #21 of the EFLAGS/ RFLAGS must be tested to see if it is writable. If it
is, the CPUID instruction exists and therefore can be called. The application code uses mainly the
PUSHFD/PUSHFQ and POPFD/POPFQ instructions to manipulate the EFLAGS/RFLAGS register.
pushfd ; push EFLAGS register
pop eax ; pop those flags into EAX
xor eax,EFLAGS_ID ; flip ID bit#21 in EFLAGS
push eax ; push modified flags on stack
popfd ; pop flags back into EFLAGS
pushfd ; Push resulting EFLAGS on stack
289
pop ecx ; pop those flags into ECX
xor eax,ecx ; See if bit stayed flipped
jz $nope ; Jump if bit not flipped
At a very minimum, all CPUs that support the CPUID instruction support both functions #0 and #1.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
EAX=0 EAX = The highest CPUID function number this CPU can handle. The Intel Pentium
and 486 return a 1 in EAX. The Pentium Pro returns a 2 in EAX. The EBX, EDX,
ECX registers contain a text identifier.
ebx edx ecx
Amd = Auth, enti, cAMD
Centaur = Cent, aurH, auls
Cyrix = Cyri, xIns, tead
Intel = Genu, ineI, ntel
EAX=1 EAX = Version Information.
Bits 0...3 – Stepping ID
Bits 4...7 – Model
Bits 8...11 – Generation / family
Bits 12...15 – Reserved
Bits 16…19 – Extended model
Bits 20…27 – Extended family
Bits 28…31 – Reserved.
EBX =
Bits 0…7 – Brand Index
Bits 8…15 – CLFLUSH line size
Bits 16…23 – (Intel) # of logical processors
(AMD) Reserved
Bits 24…31 – Processor's initial local APIC ID
ECX = (Intel) Feature info. (AMD) Reserved
EDX= Feature info
Intel EAX=2 EAX, EBX, ECX, EDX = Cache and TLB information
Intel EAX=3 EAX, EBX, ECX, EDX = Reserved
Intel EAX=4 EAX =
Bits 0…4 – Cache type
Bits 5…7 – Cache level
Bit 8 – Self-initializing cache
Bit 9 – Fully associative cache
Bits 10…13 – Reserved
290
Function Returned Data
AMD, Cyrix, and See the Intel – Standard CPUID ECX-Feature Flags section.
WinChip EAX = Processor signature
EAX= EBX, ECX = Reserved
80000001h See the AMD – Extended #1 CPUID EDX-Feature Flags section.
Intel Extended processor signature and extended feature bits.
EAX= See the Intel – Extended #1 CPUID EDX-Feature Flags section.
80000001h
AMD, Cyrix, EAX, EBX, ECX, EDX = 4 * 4 * 3 = 48 byte text string
WinChip, and Intel
EAX=
80000002h
80000003h
80000004h
AMD TLB and L1 cache information
EAX=
80000005h
Intel EAX, EBX, ECX, EDX = Reserved
EAX=
80000005h
291
80000006h Bits 8…11 – Lines per tag
Bits 12…15 – L2 Associativity
Bits 16…31 – Number of 1K cache blocks
EAX, EBX, EDX = Reserved
AMD EDX = Advanced power management
EAX= EAX, EBX, ECX = Reserved
80000007h
Intel EAX, EBX, ECX, EDX = Reserved
EAX=
80000007h
292
CPUIDFLG_ Code Bit Flag Descriptions
293
CPUIDFLG_ Code Bit Flag Descriptions
294
AMD_EFLG Code Bit Flag Descriptions
test edx,CPUIDFLG_PSN
jz $xit
push eax
mov eax,3
cpuid
pop eax
295
Sample CPU Detection Code
There are a lot of features in the CPUID, but most of them are not needed for what we are doing here. I have
documented some of what this instruction does (a lot more than what I normally need), but I strongly
recommend that if you are truly interested in this instruction that you download the manufacturer's technical
manuals.
Most programs being written these days are primarily written for a Protected Mode environment and so we
only need to deal with, at a minimum, the first processor capable of truly running in Protected Mode — the 386
processor. (The 80286 does not count!) This CPU detection algorithm detects the model, manufacturer, and
capabilities, and sets flags as such. As we really only deal with 32-bit modes in this book, we do not bother
detecting for an 8086, 80186, or an 80286. We do, however, detect for a 386 or above. In our algorithm we
use the following CPU IDs.
This instruction has been enhanced since I wrote Vector Game Math Processors as newer instructions have
been added to the processor. It has been used throughout the book, but let us examine it a bit closer.
; CPU Detect - definition IDs
CPU_386 = 3 ; 80386
CPU_486 = 4 ; 80486
CPU_PENTIUM = 5 ; P5 (Pentium)
CPU_PENTIUM_PRO = 6 ; Pentium Pro
CPU_PII = 6 ; PII
Prior to the Pentium processor, a computer system would optionally have a floating-point chip, which
contained a FPU. In the case of CPUs, no functionality is lost as one upgrades to a more advanced processor;
they are all downward compatible. This is not the case with the FPU. Some functionality was lost; so if writing
any floating-point instructions, you should know which FPU you are coding for. Some external FP chips did not
exactly match the processor but were compatible.
; Legacy CPUs and compatible FPU coprocessors
; CPU_086 NONE, FPU_087
; CPU_186 NONE, FPU_087
; CPU_286 NONE, FPU_287
; CPU_386 NONE, FPU_287, FPU_387
; CPU_486 NONE, FPU_387, FPU_487
The various manufacturers implemented the same functionality as Intel but recently have begun to do their
own. Due to this, unions and intersections can be drawn, and so we use individual flags to indicate CPU
capability.
296
CPUBITS_FPU = 0x0001, // FPU flag
CPUBITS_MMX = 0x0002, // MMX flag
CPUBITS_3DNOW = 0x0004, // 3DNow! flag
CPUBITS_FXSR = 0x0008, // Fast FP Store
CPUBITS_SSE = 0x0010, // SSE
CPUBITS_SSE2 = 0x0020, // SSE (Ext 2)
CPUBITS_3DNOW_MMX = 0x0040, // 3DNow! (MMX Ext)
CPUBITS_3DNOW_EXT = 0x0080, // 3DNow! (Ext)
CPUBITS_3DNOW_SSE = 0x0100, // 3DNow! Professional
CPUBITS_HTT = 0x0200, // Hyperthreading Tech
CPUBITS_SSE3 = 0x0400, // Prescott NI
CPUBITS_EM64T = 0x0800, // EM64T supported
CPUBITS_AMD64 = 0x1000, // AMD Long Mode
} CPUBITS;
Each manufacturer has its own unique optimization methods and so we get a vendor name.
typedef enum
{
CPUVEN_UNKNOWN = 0, // Unknown
CPUVEN_INTEL = 1, // Intel
CPUVEN_AMD = 2, // AMD
CPUVEN_CYRIX = 3, // Cyrix
CPUVEN_CENTAUR = 4, // IDT Centaur (WinChip)
CPUVEN_NATIONAL = 5, // National Semiconductor
CPUVEN_UMC = 6, // UMC
CPUVEN_NEXGEN = 7, // NexGen
CPUVEN_RISE = 8, // Rise
CPUVEN_TRANSMETA = 9 // Transmeta
} CPUVEN;
We use the following data structure to reference the extracted CPU information.
297
nFpuId dd 0 ; Floating-point unit identifier
nBits dd 0 ; Feature bits
nMfg dd 0 ; Manufacturer
nProcCnt db 0 ; # of logical processors
pad db 0,0,0
CpuInfo ends
This book's CPU detection uses the following data structure for finding matching vendor information. Each
microprocessor that supports the CPUID instruction has encoded a 12-byte text string identifying the
manufacturer.
; Vendor Data Structure
VENDOR STRUCT 4
vname BYTE '------------'
Id DWORD CPUVEN_UNKNOWN
VENDOR ENDS
CpuInfo cinfo;
char szBuf[ CPU_SZBUF_MAX ];
This is an example of what gets filled into the ASCII buffer with a call to the function CpuInfoStr().
"CpuId:15 'INTEL' FPU MMX FXSR SSE SSE2 SSE3 HTT"
That took care of the initial detection code. Now comes the fun part — function mapping. Every function you
write should have a set of slower default code written in a high-level language such as C. This is really very
simple. First there are the private definitions:
void FmdSetup(const CpuInfo * const pcinfo);
298
void vmp_FMulGeneric(float * const pfD, float fA, float fB);
void vmp_FMulAsm3DNow(float * const pfD, float fA, float fB);
void vmp_FMulAsmSSE(float * const pfD, float fA, float fB);
// Division
typedef void (*vmp_FDivProc)(float * const pfD, float fA, float fB);
extern vmp_FDivProc vmp_FDiv;
extern vmp_FDivProc vmp_FDivFast;
*pfD = fA * fB;
}
The initialization code assigns the appropriate processor-based function to the public function pointer:
void CpuSetup(const CpuInfo * const pcinfo)
{
ASSERT_PTR4(pcinfo);
299
vmp_FDivFast = vmp_FDivFastAsm3DNow; //***FAST***
}
else
{
vmp_FMul = vmp_FMulGeneric;
vmp_FDiv = vmp_FDivGeneric;
vmp_FDivFast = vmp_FDivGeneric;
}
}
You will probably need to play with the mapping until you get used to it. You could use case statements,
function table lookups, or other methods, but due to similarity of processor types I find the conditional
branching with Boolean logic seems to work best.
What is supplied should be thought of as a starting point. It should be included with most applications, even
those that do not use any custom assembly code, as it will compile a breakdown of the computer that ran the
application. With custom assembly code, it is the building block of writing cross processor code. There is one
more bit of "diagnostic" information that you can use — the processor speed. It can give you an idea of why
your application is not running well. (Sometimes processors do not run at their marked speed either through
misconfiguration or overheating.) This is discussed in Chapter 18, "System."
The listed information can be obtained by using the included function CpuDetect(); however, from your point of
view, who manufactured the CPU is not nearly as important as to the bits CPUBITS listed above! Each of
those bits being set indicates the existence of the associated functionality. Your program would merely check
the bit and correlate the correct set of code. If the processor sets the CPUBITS_3DNOW bit, then it would
need to vector to the 3DNow!-based algorithm. If the CPUBITS_SSE bit is set, then it would vector to that set
of code. Keep in mind that when I first started writing this book neither existed on the same CPU, but while I
was writing it, AMD came out with 3DNow! Professional. This is a union of the two superset families (excluding
the SSE3) for which there is also a CPU bit definition. However, that can easily change in the future. My
recommendation would be to rate their priority from highest to lowest performance in the initialization logic of
your program based upon your applications' criteria.
300
301
302
303
304
Chapter 17: PC I/O
Overview
All computers have to have some form of input or output in their execution. This will be one of the following:
Neither input nor output — A box that does not do anything!
Input only — Those status reports for your manager that never get read.
Output only — Always talking but never thinking!
Input and output — A (hopefully) useful computer
This chapter discusses one method needed to allow for the input and output of data. There is usually a BIOS
or software library that you can use that encapsulates the hardware known as the BSP (board support
package), but if you are programming an application that requires direct port access, read on! For example,
the communications API in Win32 does not give you time stamps or error status on each character received
and some communication protocols require it.
Cameras typically input to a computer using memory in graphic frame grabbers just as a graphic display
outputs using memory in graphic frame buffers, but the I/O (input/output) port is another method one can use
to communicate with a computer. Network cards, serial communications, computer mouse, display controller
on a graphics card, keyboard, etc., all use an input port, a "doorway," to get information into or out of a
computer.
For example, pressing a key to type this document caused the little computer chip in my keyboard to decode
the row and column of the keyboard matrix it was scanning, convert it to the corresponding value representing
that scan code, and send it out that little wire or infrared port from my keyboard to my computer. There it
arrived in an input port, causing an interrupt and letting my computer know there was something waiting to be
read. Interested? Well, read on!
IN
in AL/AX/EAX/RAX, #8 (Un)signe
d
in AL/AX/EAX/RAX, DX
This instruction inputs an 8-, 16-, 32-, or 64-bit value from the specified 8-bit port (0…255), or an 8-, 16-, 32-,
or 64-bit value from the 16-bit port (0…65535) specified by the DX register.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
Vertical Sync
When working a legacy VGA graphics display there is sometimes a need to synchronize your code to the
VBlank (vertical blank). An example would be using DirectX under Win95. If we have animation running at a
frame rate in a dual frame environment, there is no clean way to wait for the VBlank on frames that do not
need to flip because it is too easy to miss them as DirectX only waits for the leading edge of the VBlank. Note
that in a multithreaded environment there is a chance that we could get preempted by another thread, causing
us to miss the VBlank. We could also be preempted just after detecting it, causing us to miss it entirely.
VGA_STATUS = 03dah ; VGA Status address
VSYNC_MASK = 00001000b ; VSync bit
mov edx,VGA_STATUS
305
mov ah,VSYNC_MASK
OUT
out #8, AL/AX/EAX/RAX (Un)signe
d
out DX, AL/AX/EAX/RAX
This instruction outputs an 8-, 16-, 32-, or 64-bit value to the specified 8-bit port (0…255), or an 8-, 16-, 32-, or
64-bit value to the 16-bit port (0…65535) specified by the DX register.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
The following two functions are snippets from the examples later in this chapter.
; Initialize the parallel port
;
; void PPortInit(void)
306
jz $S1 ; Jump if still has character
INSB
insb (Un)signed 8
insw 16
insd 32
ins byte ptr es:[edi], dx 8
ins word ptr es:[edi], dx 16
ins dword ptr es:[edi], dx 32
This instruction inputs an 8-, 16-, or 32-bit value from the specified 16-bit port (0…65535), copies it to the
destination string referenced by ES:[EDI], and advances the EDI register based upon the setting of the
Direction flag and the data size. It is similar to a STOSx instruction except that the data is read from the port
instead of the AL/AX/EAX register and both are written (stored) to the string. This is actually a DMA transfer
from an input port but is really only efficient if trying to move more than 64 bytes.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
OUTSB
outsb (Un)signed 8
outsw 16
outsd 32
outs dx, byte ptr ds:[esi] 8
outs dx, word ptr ds:[esi] 16
outs dx, dword ptr ds:[esi] 32
This instruction outputs an 8-, 16-, or 32-bit value from the source string referenced by DS:[ESI], copies it to
the 16-bit port (0…65535), and advances the ESI register based upon the settings of the Direction flag and the
data size. It is similar to a LODSx instruction as data is read (loaded) from a string but with the exception that
the data is written to the port instead of AL/AX/EAX register. This is actually a DMA transfer to an output port
but is really only efficient if trying to move more than 64 bytes.
Flags O.flow Sign Zero Aux Parity Carry
307
- - - - - -
A UART (universal asynchronous receiver/transmitter) these days has a FIFO (first in, first out) buffer typically
16 bytes in length. You could send this in a variety of ways. If you know the FIFO is empty, you can have some
sort of unrolled loop:
i=0
REPEAT 16
mov al,[esi+i]
out dx,al
i=i+1
ENDM
add esi,16
…keeping in mind the >=64 repeat needed to be efficient. Because of this, the unrolled loop is better.
Both of these are excellent references for all the I/O information necessary for communicating directly with the
serial and parallel ports on your PC.
Parallel Port
Normally you do not need to access this port, but what if you have a specialized application that is not
communicating with a printer but something else, like a parallel communications device or a dongle (hasp).
What if your PC is in reality an embedded system where it runs a display in a Gazebo and as such has no
mouse or standard keyboard, but merely a couple of buttons for advancing through a slide presentation. It just
powers up and runs an advertisement or is a slot machine using a NLX motherboard or some other embedded
device, not necessarily a PC. What if it can be signaled through communications to play something else? What
if this same device needs a watchdog to know it failed or dip switches for some kind of addressing or feature
selection just by setting some switches? Maybe it drives an LCD (liquid crystal display). (Remember, no
keyboard or maybe it drives one directly in parallel.) Or maybe you are just itching to build some hardware to
plug into your computer.
Datel used to manufacture a special parallel interface card for the PC. This was designed to connect via a
parallel cable to a Game Shark made by InterAct and allowed a PC to be able to connect to various consoles.
Hackers built software drivers and tools that allowed them to develop games for some of those consoles such
as the old Sony Playstation.
308
Parallel Port Dip Switches
Our little experiment lets us tinker with reading dip switches plugged into the parallel port. You can buy an
eight-bank dip switch at your local Radio Shack store (part #275-1301).
DB25P Description
2 DO
3 D1
4 D2
5 D3
6 D4
7 D5
8 D6
9 D7
18-25 GND
DB25P Description
1 ~Strobe
10 ~ACK
11 ~BUSY
12 PE
13 SLCT
14 ~AUTOFEED XT
15 ~ERROR
16 ~INIT
17 ~SLCT IN
309
First we need to initialize the port that we will consider LPT1, which uses a base address of 378h. We should
note that the typical parallel port is set up to be unidirectional and not bidirectional, which means that we
normally can only write to the data lines, not read from them. In our initialization function we set the bit, which
causes our eight data bits to tri-state, thus making them bidirectional.
310
PPortInit();
n = PPortRead();
You will note that the logic is inverted so when a switch is off, thus leaving the circuit open, a logical 1 is
produced, but if the switch is on, thus closing the circuit and shorting the input to ground, a logic 0 occurs. You
can think of this as upside-down binary. To solve for this just insert the following:
PPortRead proc near
xor eax,eax ; eax=0
mov edx,PPort ; base address is data port
in al,dx ; get 8 bits from port
xor eax,000000ffh ; Flip switch bits
ret ; return value in EAX
PPortRead endp
The second XOR just before the RET will flip the switch bits so off=0 and on=1. This will help make life
simpler.
I need to mention that there is one little problem with the circuit. Although the specifications of the data port
consider the tri-state (float) condition to be a logical 1 until pulled to ground, some computers float with voltage
a little low so they are not considered a logical 1. And as such the switches are thought of as always closed.
The parallel port has no (+5v) source and as such one of two things need to be done: an external one supplied
(battery, etc.) or one leached off all the other control signals, creating a pseudo 5-volt supply. Those signals
would obviously have to have their output signals in a high state. That voltage would then need to be
connected in series to a pull-up resistor attached to one side of each of the switch terminals, thus giving them
a logical high signal. The other switch terminal is connected to ground, so when the switch is closed it pulls the
signal low.
Serial Port
DOS had a BIOS function that allowed you to print a TTY message to the output device. But what if you are
running an embedded operating system like pSOS or running under Win32? The printf() and TTY functions do
not really go anywhere, especially if you are running a full- screen DirectX application. Well, you could always
do a kind of printf() out the serial port to a smart terminal. Did you know you could pick up old PS/2 computers
from places like Goodwill for $9-$50 still in running condition and looking in pretty good shape? I call them
throw- away computers. When running an old DOS version of ProComm, they work great. When they break,
you throw them away! The lap-link cable to connect your two computers will probably cost you more than the
computer. To reduce my desk footprint I use a very small 9" VGA monochrome monitor and a very small 83-
key keyboard. Unfortunately, they are not as cheap. I use a logging library that I recreate for every company I
work for. In it I do a single log write and it can optionally go to a disk file, a Notepad window, and/or a particular
serial port at a specified baud rate. In Win32 I use the CommAPI so it'll work in all the Win32 platforms, but the
following will only work in Win95, DOS, or Linux.
Of course, other operating systems such as embedded systems that allow Real Mode or security access to the
I/O ports and interrupts, will work as well. To simplify it I have hard-coded the base address and IRQ
information to COM1, but it is really a snap to do a table lookup during the initialization function to set things up
and save the actual information needed for the send function.
1 8 CD (Carrier Detect)
311
DB9S DB25S Description
5 7 GND
9 22 RI (Ring Indicator)
ComAdr dd 3f8h
ComIO db 0ch
ComMask db 0efh
312
; Disable all handshaking
mov ecx,[ebp+8]
mov edx,0
mov eax,1843200/16
div ecx
313
; Serial Out ah=char
314
Since the port number is hard coded to simplify this example, the code is initialized but only the baud rate can
be set. Thereafter, zero delimited character strings are written to that serial port through the call to the
ComStr() function. In this elementary form, very simple.
315
Chapter 18: System
System "Lite"
There are other instructions available in your processor, but they have very little to no relationship to your
application code. As mentioned at the beginning of this book, there are basically three types of instructions.
(Note that I am oversimplifying here!) They are general-purpose, floating-point, and system instructions. The
existence of these later instructions has to do with writing system level, thus operating system, code. They are
not typically accessible or needed by those programmers writing non-operating system code. As this book is
not targeted for that market, there is no need to make the standard application programmer wade through it.
But as some of you may just cry foul, I have included a very light overview of these instructions. Besides, there
are some tidbits in here for all of you!
Chapter 3, "Processor Differential Insight," as well as Chapter 16, "What CPUID?" gave some background on
the processor. We shall now continue with that information. Some of what is included here is not necessarily
just for system programmers as some features of the 80x86 are system related but are accessible from the
application level. Note the System "Lite" part? Keep in mind that this is a superficial overview. If you need an
in-depth explanation, please refer to documentation direct from the manufacturer.
RDPMC
rdpmc
This instruction loads the 40-bit performance monitoring counter indexed by ECX into the EDX:EAX register
pair. For 64-bit mode, RDX[0…31]:RAX[0…31]=[RCX]. This instruction is accessible from any layer inclusive
of the application layer only if the PCE flag in CR4 is set. When the flag is clear, this instruction can only be run
from privilege level 0.
RDTSC
rdtsc
This system instruction reads the 64-bit time-stamp counter and loads the value into the EDX:EAX registers.
The counter is incremented every clock cycle and is cleared (reset) to zero upon the processor being reset.
This instruction is accessible from any layer inclusive of the application layer unless the TSD flag in CR4 is set.
So far while running under Win32 the flag has been clear as a default, thus allowing an application to access
this instruction.
; void CpuDelaySet(void)
public CpuDelaySet
CpuDelaySet proc near
rdtsc ; Read time-stamp counter
316
CpuDelaySet endp
; long int CpuDelayCalc(void)
;
; This function is called after IClkSet() to get the
; elapsed interval in clock cycles.
;
; Note: On a 400MHz computer, only reading the lower 32 bits
; gives a maximum 10 second sampling before rollover.
public CpuDelayCalc
CpuDelayCalc proc near
rdtsc ; Read time-stamp counter
sub eax,tclkl
sbb edx,tclkh ; edx:eax = total elapsed interval
These two functions can be used for time trials while optimizing code. Due to multithreaded environments,
another thread or interrupt can steal your time slice while you are trying to do time analysis on a bit of code.
You could divide the number of loops into the total delay to get an average loop delay count. What I like to do
is run a benchmark of executing the same code a few thousand times, ignoring the effects the prefetch has on
these times or the fact the Nth time around the data is already sitting in memory. One time I took the governor
off an MPEG decoder so it would run full speed, allowing code to be optimized so that it would run faster and
faster.
317
SpeedData *sp = (SpeedData *)dwUser;
switch(sp->tSpeedState)
{
case 2: // 2nd tick (avg of the two intervals)
sp->tSpeed = (CpuDelayCalc() + sp->tSpeed) >> 1;
sp->nCnt++;
CpuDelaySet();
break;
default:
break;
}
}
// Be VERY careful when this is called, as your OS may not like it!
uint SpeedCalc(void)
{
TIMECAPS tc;
uint wTimerRes, nCnt;
SpeedData sd;
wTimerRes = 1;
318
wTimerRes = min(max(tc.wPeriodMin, 1), tc.wPeriodMax);
timeBeginPeriod(wTimerRes); // 1ms resolution
}
timeKillEvent(sd.wTimerID);
}
// Didn't work? Try it the really not-so-accurate way!
CpuDelaySet();
Sleep(10);
return CpuDelayCalc()/10000;
}
80x86 Architecture
The Intel and AMD processors have similar functional architecture. Different processors have different
numbers of caches, on chip cache, off chip cache, different speeds, different instruction sets, different methods
of pipelining instructions. All this book is interested in is helping you, the application programmer, make your
code go fast by writing it in assembly. You have no control over what flavor of processor the user of your
application chooses to run their applications on. (Of course you could program your application to check these
parameters and refuse to run on a system you do not like! But that would be evil!)
test ebx,ebx
mov ecx,ebx
mov esi,ebx
mov edi,ebx
test ebx,ebx
The use of full registers (such as in the above 32-bit code snippet in Protected Mode) allows instructions to be
able to be executed on the same clock.
319
Partial stalls occur if a short version of a register is written to and then immediately followed by a larger
version. For example:
mov al,9
add bx,ax ; clock stall
mov al,9
add ebx,eax ; clock stall
mov ax,9
add ebx,ax ; clock stall
The AL register will cause the next instruction to have a partial stall if it contains a large form such as AX, EAX,
or RAX and if it is being written. This is like being at a red signal light in your car and when the light turns green
you slam down on the accelerator; your car will sputter, spit a little, hesitate (stall), and then finally accelerate.
And in 64-bit mode the upper 32 bits of the RFLAGS register (0:EFLAGS):
320
32…63 RFLAG (extra) bits
Protection Rings
The 386 and above have layers of protection referred to as protection rings.
The inner ring #0 contains the operating system kernel. The two middle rings (#1 and #2) contain the operating
system services (device drivers), and the outer ring #3 is where the application (user code) resides. The ring
numbers are also referred to as privilege levels with 0 being the highest and 3 being the lowest.
An application can access functions in the other rings by means of a gate. The SYSCALL and SYSENTER
functions are two methods. This is a protection system to protect the inner rings from the outer. You know, to
keep the riffraff out! Any attempt to access an inner ring without going through a gate will cause a general
protection fault.
Control Registers
There are four control registers {CR0, CR2, CR3, CR4} that control system level operations. Note that CR1 is
reserved.
Table 18-1: Control register 0 (CR0) extensions
321
Table 18-1: Control register 0 (CR0) extensions
And in 64-bit mode the upper 32 bits of the CR0 register (0:CR0):
32…63
0…2
CR3_PWT 000000008h 3 Page Writes Transparent
CR3_PCD 000000010h 4 Page Cache Disable
Page Dir.Base 12…31
And in 64-bit mode the upper 32 bits of the CR3 register (0:CR3):
32…63 CR3 (extra) bits
Table 18-3: Control register 4 (CR4) extensions
And in 64-bit mode the upper 32 bits of the CR4 register (0:CR4):
32…63 CR4 (extra) bits
322
Table 18-4: Control register 8 (CR8) extensions. This is new for EM64T.
Debug Registers
There are eight debug registers: {DR0, DR1, DR2, DR3, DR4, DR5, DR6, DR7}. Knowing them is unimportant
as you are most likely using a debugger to develop your application, not building a debugger. These are
privileged resources and only accessible at the system level to set up and monitor the breakpoints {0…3}.
Cache Manipulation
Several mechanisms have been put into place to squeeze optimal throughput from the processors. One
method of cache manipulation discussed in Chapter 10, "Branching," is Intel's hint as to the prediction of logic
flow through branches counter to the static prediction logic. Another mechanism is a hint to the processor
about cache behavior so as to give the processor insight into how a particular piece of code is utilizing memory
access. Here is a brief review of some terms that have already been discussed:
Temporal data — Memory that requires multiple accesses and therefore needs to be loaded into a cache
for better throughput.
Non-temporal hint — A hint (an indicator) to the processor that memory only requires a single access (one
shot). This would be similar to copying a block of memory or performing a calculation, but the result is not
going to be needed for a while so there is no need to write it into the cache. Thus, the memory access has
no need to read and load cache, and therefore the code can be faster.
For speed and efficiency, when memory is accessed for read or write a cache line containing that data (whose
length is dependent upon manufacturer and version) is copied from system memory to high-speed cache
memory. The processor performs read/write operations on the cache memory. When a cache line is
invalidated, the write back of that cache line to system memory occurs. In a multiprocessor system, this occurs
frequently due to non-sharing of internal caches. The second stage of writing the cache line back to system
memory is called a "write back."
Cache Sizes
Different processors have different cache sizes for data and for code. These are dependent upon processor
model, manufacturer, etc., as shown below:
CPU L1 Cache (Data /Code) L2 Cache
Depending on your code and level of optimization, the size of the cache may be of importance. For the
purposes of this book, however, it is being ignored, as that topic is more suitable for a book very specifically
targeting heavy-duty optimization. This book, however, is interested in the cache line size as that is more along
the lightweight optimization that has been touched on from time to time. It should be noted that AMD uses a
minimum size of 32 bytes.
323
Intel Cache Line Size
PIII 32
Pentium 64
M
P4 64
Xeon 64
Athlon 64
Opteron 64
The cache line size can be obtained by using the CPUID instruction with EAX set to 1. The following
calculation will give you the actual cache line size.
mov eax,1
cpuid
and ebx,00000FF00h
shr ebx,8-3 ; ebx = size of cache line
PREFETCH
PREFETCHNT
A
PREFETCHT0
PREFETCHT1
PREFETCHT2
PREFETCHW
3DNow! prefetch mSrc8
prefetchw mSrc8
SSE prefetcht0 mSrc8
prefetcht1 mSrc8
prefetcht2 mSrc8
prefetchnta mSrc8
The PREFETCHNTA instruction performs a non-temporal hint to the processor with respect to all the caches,
to load from system memory mSrc8 into the first-level cache for a PIII or a second-level cache for a P4 or
Xeon processor.
The PREFETCHT0 instruction performs a temporal hint to the processor to load from system memory mSrc8
into the first- or second-level cache for a PIII, or a second-level cache for a P4 or Xeon processor.
The PREFETCHT1 instruction performs a temporal hint to the processor with respect to the first-level cache to
load from system memory mSrc8 into the second-level cache for PIII, P4, or Xeon processor.
324
The PREFETCHT2 instruction performs a temporal hint to the processor with respect to the second-level
cache to load from system memory mSrc8 into the first-level cache for PIII or the second-level cache for P4 or
Xeon processor.
If data is already loaded at the same or higher cache, then no operation is performed.
AMD processors alias PREFETCHT1 and PREFETCHT2 instructions to the PREFETCHT0 instructions, so
they all have the PREFETCHT0 functionality.
The 3DNow! PREFETCH instruction loads a cache line into the L1 data cache from the mSrc8.
The 3DNow! PREFETCHW instruction loads a cache line into the L1 data cache from the mSrc8 but sets a
hint indicating that it is for write operations.
LFENCE
lfence
This instruction is similar to the MFENCE instruction, but it acts as a barrier between memory load instructions
issued before and after the LFENCE and MFENCE instructions.
SFENCE
sfence
This instruction is similar to the instruction MFENCE but it acts as a barrier between memory save instructions
issued before and after the SFENCE or MFENCE instructions.
MFENCE
mfence
This instruction is a barrier (fence) to isolate system memory to and from cache memory operations that occur
before and after this instruction.
CLFLUSH
clflush mSrc8
This instruction invalidates the cache line (code or data) containing the linear address specified by mSrc8. If
the line is dirty — that is, different from the system memory in the process of being written to — it is written
back to system memory. This instruction is ordered by the MFENCE instruction. Check CPUID bit #19
(CLFSH) to see if this instruction is available.
325
INVD — Invalidate Cache (WO/Writeback)
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
INVD
invd
This instruction invalidates the internal caches without waiting for write back of modified cache lines and
initiates bus cycles for external caches to flush. This is similar to WBINVD but without a write back.
No App (Sys This is only accessible from Real Mode or Privilege Level 0.
Only)
WBINVD
wbinvd
This instruction writes back all the modified cache lines, then invalidates the internal caches, and initiate bus
cycles for external caches to flush. This is similar to INVD but with a write back.
No App (Sys This is only accessible from Real Mode or Privilege Level 0.
Only)
System Instructions
The scope of system instructions are not covered in this book. Refer to the Intel and AMD specific
documentation for full specifications. They are considered OS/System instructions and as such will not be
discussed in this book. Some are accessible by the application layer at the low privilege level but are not part
of the general application development process. They are only referenced here for informational purposes and
to ensure this book lists all instructions available at the time of its publication.
ARPL 32 32
arpl rmDst16, rSrc16
This system instruction adjusts the RPL (Request Privilege Level) by comparing the segment selector of rSrc
with rmDst. If rSrc > rmDst, then set the Zero flag; otherwise clear (reset) it. This instruction can be accessed
by an application.
No The ARPL system instruction is not supported in 64-bit mode!
64-bit
BOUND 32 32
bound rSrcA16, mSrcB16 16
bound rSrcA32, mSrcB32 32
This system instruction checks if the array index rSrcA is within the bounds of the array specified by mSrcB. A
#BR (Bounds Range) exception is triggered if it is not inclusive.
No The BOUND system instruction is not available in 64-bit mode.
64-bit
326
CLTS — Clear Task Switch Flag
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
CLTS
clts
This system instruction clears the task switch flag TS Bit #3 of CR0 (CR0_TS). The operating system sets this
flag every time a task switch occurs and this flag is used to clear it. It is used in conjunction with the
synchronization of the task switch with the FPU.
No App (Sys This is only accessible from Real Mode or Privilege Level 0.
Only)
HLT
hlt
This is a system instruction that stops the processor and puts it into a halt state.
No App (Sys This is only accessible from Real Mode or Privilege Level 0.
Only)
UD2
ud2
UD2 is an undefined instruction and guaranteed to throw an opcode exception in all modes.
INVLPG
invlpg mSrc
This instruction invalidates the TLB (Translation Lookaside Buffer) page referenced by mSrc.
No App (Sys This is only accessible from Real Mode or Privilege Level 0.
Only)
LAR
lar rDst16, rmSrc16
lar rDst32, rmSrc32
lar rDst64, rmSrc64
This system instruction copies the access rights from the segment descriptor referenced by the source rmSrc,
stores them in the destination rDst, and sets the zero flag. This instruction can only be called from Protected
Mode.
327
LOCK — Assert Lock # Signal Prefix
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
LOCK
lock
This system instruction is a code prefix to turn the trailing instruction into an atomic instruction. In a
multiprocessor environment it ensures that the processor using the lock has exclusive access to memory
shared with the other processor.
This instruction can only be used with the following instructions and only when they are performing a write
operation to memory: ADD, ADC, AND, BTC, BTR, BTS, CMPSCHG, CMPXCHG8B, DEC, INC, NEG, NOT,
OR, SBB, SUB, XADD, XCHG.
This instruction works best with a read-modify-write operation such as the BTS instruction.
LSL
lsl rDst16, rmSrc16
lsl rDst32, rmSrc32
lsl rDst64, rmSrc64
This system instruction copies the segment descriptor referenced by the source rmSrc to the destination rDst.
MOV CR
mov cr{0…4}, 32
r32
mov r32,
cr{0…4}
This system instruction copies memory from the control register to a general-purpose register or from a
general-purpose register to a control register.
No App (Sys This is only accessible from Privilege Level 0.
Only)
MOV DR
mov r32, dr{0…7} 32
mov dr{0…7}, r32
This system instruction copies memory from the debug register to a general-purpose register or from a
general-purpose register to a debug register.
No App (Sys This is only accessible from Real Mode or Privilege Level 0.
Only)
328
STMXCSR — Save MXCSR Register State
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
STMXCSR
stmxcsr mDst32
This system instruction saves the MXCSR control and status register to the destination mDst32. The
complement to this instruction is LDMXCSR.
LDMXCSR
ldmxcsr mSrc32
This system instruction loads the MXCSR control and status register from the source mSrc32. The
complement of this instruction is STMXCSR.
SGDT
SIDT
sgdt m
sidt m
The SGDT system instruction copies the Global Descriptor Table Register (GDTR) to the destination. The
complement of this instruction is LGDT.
The SIDT system instruction copies the Interrupt Descriptor Table Register (IDTR) to the destination. The
complement of this instruction is LIDT.
LGDT
LIDT
lgdt mSrc16 32/64)
lidt mSrc16 32/64)
The LGDT system instruction loads the source mSrc16 into the Global Descriptor Table Register (GDTR).
The LIDT system instruction loads the source mSrc16 into the Interrupt Descriptor Table Register (IDTR).
No App (Sys This is only accessible from Real Mode or Privilege Level 0.
Only)
SLDT
329
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
sldt rmDst16
This system instruction copies the segment selector from the Local Descriptor Table Register (LDTR) to the
destination rmDst16.
LLDT
lldt rmSrc16
This system instruction loads the source rmSrc16 into the segment selector element of the Local Descriptor
Table Register (LDTR). This instruction is only available in Protected Mode.
SMSW
smsw rmDst16
This system instruction copies the lower 16 bits of control register CR0 into the destination rmDst16.
LMSW
lmsw rmSrc16
This system instruction loads the lower four bits of the source rmSrc16 and overwrites the lower four bits of the
control register CR0.
No App (Sys This is only accessible from Real Mode or Privilege Level 0.
Only)
STR
str rmDst16
This system instruction reads the task register and saves the segment selector value into the 16-bit destination
rmDst16. The register gets the upper 16 bits cleared to zero in the upper bits of the 32-bit form.
str ax ; actually stores 0000:AX into EAX
LTR
ltr rmSrc16
This system instruction sets the task register with the segment selector stored in the 16-bit source rmSrc16.
330
No App (Sys This is only accessible from Real Mode or Privilege Level 0.
Only)
RDMSR
rdmsr
This is a system instruction that may only be run in Privilege Level 0. The Model Specific Register (MSR)
indexed by ECX is loaded into the EDX:EAX register pair.
No App (Sys This is only accessible from Real Mode or Privilege Level 0.
Only)
WRMSR
wrmsr
This system instruction writes the 64-bit value in EDX:EAX to the Model Specific Register specified by the ECX
register. In 64-bit mode the lower 32 bits of each 64-bit register RDX[0..31]:[RAX[0…31] form the 64-bit value
that is written to the MSR specified by the RCX register.
MSR[ecx] = edx:eax
No App (Sys This is only accessible from Real Mode or Privilege Level 0.
Only)
SWAPGS 64 64
swapgs
This system instruction swaps the GS register value with the value in the MSR address C0000102H.
No App (Sys This is only accessible from Real Mode or Privilege Level 0.
Only)
SYSCALL 64
syscall
This instruction is a fast 64-bit system call to privilege level 0. It allows code at the lower privilege levels to call
code within Privilege Level 0.
SYSRET 64
sysret
331
This instruction is a return from a fast 64-bit system call. It is a complement to SYSCALL.
SYSENTER
sysenter
This instruction is a fast system call to Privilege Level 0. It allows code at the lower privilege levels to call code
within Privilege Level 0.
SYSEXIT
sysexit
RSM
rsm
This system instruction returns control from the System Management Mode (SMM) back to the operating
system or the application that was interrupted by the SMM interrupt.
VERR
VERW
verr rm16
verw rm16
mov ax,cs
verr ax
verw ax
These instructions verify whether the specified segment/selector CS, DS, ES, FS, or GS is VERR (readable) or
VERW (writeable) and sets the zero flag to 1 if yes or resets (clears) the zero flag if no. Code segments are
never verified as writeable. The stack segment-selector (SS) is not an allowed register. These instructions are
not available in Real Mode.
LDS
LES
332
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
LFS
LGS
LSS
lds r32Dst, Protected Mode 48
mSrc(16:32)
lds r16Dst, Real Mode 32
mSrc(16:16)
les r32Dst, Protected Mode 48
mSrc(16:32)
les r16Dst, Real Mode 32
mSrc(16:16)
lfs r64Dst, 64-bit Mode 80
mSrc(16:64)
lfs r32Dst, 64-bit, Protected Mode 48
mSrc(16:32)
lfs r16Dst, 64-bit, Real Mode 32
mSrc(16:16)
lgs r64Dst, 64-bit Mode 80
mSrc(16:64)
lgs r32Dst, 64-bit, Protected Mode 48
mSrc(16:32)
lgs r16Dst, 64-bit, Real Mode 32
mSrc(16:16)
lss r64Dst, 64-bit Mode 80
mSrc(16:64)
lss r32Dst, 64-bit, Protected Mode 48
mSrc(16:32)
lss r16Dst, 64-bit, Real Mode 32
mSrc(16:16)
This is a special memory pointer instruction that moves a memory address into a register pair with a specified
pointer value. The form you use is determined by the (64-bit/Protected/Real) mode your code is for.
Flags O.flow Sign Zero Aux Parity Carry
- - - - - -
Protected Mode Win95 programmers do not need to get at the VGA, but if you have an old monochrome
adapter plugged into your system this will be handy using Microsoft's secret (unpublished) selector {013fh},
which gets you access to every linear address on your machine {013fh:00000000…0ffffffffh}. This became the
data selector for Win 95B and is a Bounds Error for Win32 and Win64 developers.
monoadr dd 0b0000h
monosel dw 013fh
mov edi,monoadr
mov es,monosel
333
or
les FWORD PTR monoadr
That was fine and dandy, but the following is a quicker method even though it takes a little organization and is
very easy to make a mistake due to its length.
monobase FWORD 013f000b0000h
les edi,monobase
The declaration has too many zeros and is a lil' too darn long, don't you think! It almost looks like binary.
Loading that address into the pointer is very quick, but trying to save the pointer back to the address isn't so
slick and it seems a little murky to me.
mov DWORD PTR monobase,edi
mov WORD PTR monobase+4,es
Now, doesn't that look much cleaner? Assembly coding can get convoluted enough without creating one's own
confusion. Now for those Real Mode programmers, a touch of VGA nostalgia:
vgaseg dw 0a000h
mov di,0
mov es,vgaseg
The following code snippet is similar to the previous 32-bit version but scaled down for 16-bit. Using the same
techniques:
; Real Mode address (Far)
RMADR STRUC
off dw ? ; Real Mode Offset
rseg dw ? ; Real Mode Segment
334
RMADR ends
les di,vgabase
Hyperthreading Instructions
The scope of hyperthreading instructions is not covered in this book. Refer to the Intel-specific documentation
for full specifications. They are considered OS/System instructions and as such will not be discussed in this
book. They are accessible by the application layer at the low privilege level but are not part of the general
application development process. They are only referenced here for informational purposes.
MONITOR — Monitor
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
MONITOR
monitor
This system instruction sets up a hardware monitor using an address stored in the EAX register and arms the
monitor. Registers ECX and EDX contain information to be sent to the monitor. This is accessible at any
privilege level unless the MONITOR flag in the CPUID is not set, indicating the processor does not support this
instruction. This instruction is used in conjunction with the instruction MWAIT.
MWAIT — Wait
Mnemonic P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
MWAIT
mwait
This system instruction is similar to a NOP but works in conjunction with the MONITOR instruction for signaling
to a hardware monitor. It is a hint to the processor that it is okay to stop instruction execution until a monitor
related event. MONITOR can be used by itself, but if MWAIT is used, only one MWAIT instruction follows a
MONITOR instruction (especially in a loop).
335
Chapter 19: Gfx 'R' Asm
Overview
This is one of two chapters that are probably the reason you bought this book. You are probably working with a
graphics library and then ran into a little bind in which there did not seem to be a software library function
available with that special needed functionality. Your schedules are sliding, nerves are wrecked, and paranoia
is beginning to set in as you believe that your project lead or manager is beginning to doubt your abilities and
that pink slip is only about a week away. Out of desperation you have escaped to your refuge, that favorite
technical bookstore that has rescued you so many times in the past. You find lots of graphics books but not
what you are looking for. There are a few assembly books that you have seen over the years, but they have
always been targeted for beginners, but you have bought them anyway for your personal library as another
resource book. (I have always been impressed with someone who has multiple large bookcases with
multitudes of dog-eared books in their office. There is a fine art to making brand new books look well used.)
And then you see this book. You flip through it and this chapter catches your gleaming eye and then you
whisper in euphoria to yourself, "This book will save my butt!" At that point you look around and see everyone
in the bookstore staring at you as you skulk toward the sales clerk.
Of course, there are other code samples in my Vector Game Math Processors book so do not forget to buy
that book as well.
For those of you C programmers out there, this chapter is very similar to those heavily used functions
memset() and memcpy(). They are used in almost every application for a large variety of purposes, but their
behavior typically is not that useful or fast enough in the clearing or blitting of graphic images. Some of you are
probably thinking "Why isn't this guy using the hardware blitter on a graphics card?" Well, in some cases, the
blitter is hidden from you in the bowels of drivers such as Direct Draw but that's all it is, a blitter — a hardware
device designed to move video card memory to video card memory and you only have 64 to 256 MB to play
with. Okay, okay, that is much better than a couple years ago when you only had 2 to 8 MB. What we are
trying to do here is learn the optimal method of moving memory around the computer system and from system
memory to video memory. Also, just where did those images come from? Whose file format, and what
compression type? How did they get loaded? Where do you get the driver, etc? There is also more to life than
displaying video games! What about streaming media such as MPEG-4 and DivX, video analysis, scientific
research, speech recognition, stereoscopic vision, etc.? The list is endless and new reasons are being
invented all the time. Now that I am off my soap- box we can continue.
Setting Memory
If you happen to be constructing 8-bit images, then the memset() function can work pretty well for you as the
transparency value can be anywhere from 0 to 255. If working in 16-, 24-, or 32-bit colors this function is only
useful if the transparency color that you are trying to set just happens to be 0; if it is any other value, you have
a serious problem. You have put together a C function to do the task but even though this function is not called
that often, its speed is not up to your needs.
Your older 32-bit C libraries typically have the following library function for clearing a block of memory:
; void *memset(void *pMem, int val, uint nCnt)
;
; Note: if nCnt is set to 0 then no bytes will be set!
public memset
memset proc near
push ebp
mov ebp,esp
push edi
336
mov eax,[ebp+arg2] ; val
$L0: stosb
loop $L0
That code is relatively small but pretty inefficient as it is using the repeating string function to write a series of
8-bit bytes. The payoff on Pentium processors only comes with a repeat of 64 or more.
rep stosb
With a repeat factor of less than 64, use the following. Note that in using the ES:[EDI], the ES: IS the default
and so we do not really need to put it in the code.
test ecx,ecx
jz $xit ; jump if len of 0
An alternate method that is a lot more efficient than those listed above is to divide our total number of bytes
into the number of 4-byte blocks, then loop on that, not forgetting to handle the remainders.
test ecx,ecx
jz $xit ; jump if len of 0
mov ah,al
mov edx,eax
shl eax,16
mov ax,dx ; eax=replicated byte ×4
337
; edx = # of 32-bit writes
; 1 to 3 bytes to set
$L4:
There are more sophisticated methods that you can employ but this is a good start.
For optimal performance all data reads and writes must be on a 32-bit boundary. In a copy situation, if the
source and destination are misaligned, there is not much that can be done about it. But in the case of setting
memory that is misaligned, it is a snap to fix it.
Figure 19-1: Imagine these four differently aligned memory strands as eels. We pull out our sushi knife and
finely chop off their heads into little 8-bit (1-byte) chunks, chop off the tails into 8-bit (1-byte) chunks, and then
coarsely chop the bodies into larger 32-bit (4-byte) chunks, and serve raw.
The first three memory strands had their heads misaligned, but the fourth was aligned properly. On the other
hand, the tails of the last three were misaligned and the first one was aligned properly. Now that they're sliced
and diced, their midsections are all properly aligned for best data handling.
The latest C runtime libraries use something a lot more elaborate such as the following function:
; void *memset(void *pMem, int val, uint nCnt)
;
; Note: if nCnt is set to 0 then no bytes will be set!
public memset
memset proc near
338
$BSHFT = 2 ; Shift count
$BCNT = 4 ; Byte count
push ebp
mov ebp,esp
; Unlike the code above, flow does not have to fall through
; if a size of 0 was passed, and so we need to test for it.
; The lines are adjusted to help prevent a stall.
; Older programmers will say hey, why didn't you 'OR ecx,ecx'
; but this is a read/write function that will cost you
; time for the write. The 'TEST ecx,ecx' is a read only!
test ecx,ecx
push ebx
jz $Xit ; jump if size is 0
mov edx,edi
and edx,($BCNT-1)
jz $SetD
339
inc edi
inc edx
jne $Lead ; Loop for those {1..3} bytes
mov edx,ecx
shr edx,$BSHFT ; (n÷4) = # of 32-bit writes
and ecx,($BCNT-1)
jz $Xit ; jump if size is 0
As you can see, that simple memory set function became a lot bigger, but its execution speed became a lot
quicker. With very short lengths of bytes to set, such as sizes of fewer than four bytes, this code is actually
slower but it quickly gains in speed as the memory lengths increase in size, especially if aligned on 4-byte
boundaries. For an extra efficiency on a size of 256 bytes or more, using the STOSD instruction would be best.
Note You should use the string functions such as STOSD only if the repeat factor is 64 or more.
These numbers aren't exactly right as this function has not been tuned for its optimal timing yet, but I leave
that to you. Besides, what would be the fun in it if I gave you all the answers? As versatile as the MMX
340
instruction set is, the linear setting or copying of memory is no more efficient than the integer instructions. In
fact, a STOSD/MOVSD string set/copy with a repeat of 64 or more is actually faster than the equivalent MMX
instructions on legacy processors. This would also leave the XMM register for math related solutions. It turns
out that we are actually pumping data very close to or at the bus speed. For experimental purposes and to
have some MMX practice, one alternative would be the use of the MMX instruction MOVQ in the $SetD
section of the code so eight bytes would be written at one time.
mov edx,ecx
shr edx,$BSHFT ; (n÷8) = # of 64-bit writes
And call at the appropriate time only if your CPU thread has floating- point operations to handle:
Emms
Note I recommend the use of the ZeroMemory() function instead. It saves passing an extra
argument value of 0, or the time to replicate the single byte to four bytes.
Copying Memory
A few years ago I was working on a project that was required to run on a 386 processor but typically ran on a
486 and had this little squirrely problem. One of the in-house computer systems that I tested the application on
ran the code extremely slowly. I spent quite a while on it and when doing some benchmark testing to isolate
the problem I found that the memory copy algorithm, which was used to blit graphical sprites onto the screen,
341
was the culprit. Sprites could appear on screen with any kind of data alignment as they moved horizontally
across the screen. Upon deeper investigation I found that this computer system was running DOS like all the
others but in this particular case, it was running on an AMD 386SX processor. AMD usually has pretty good
processors but I was intrigued and so I ordered and received their AM386 data book unique to that model
processor. Upon reading the book I found out to my horror that this processor had a little zinger. As it is a 32-
bit processor with a 16-bit bus, if your source and destination pointers are not properly aligned, then a single
32-bit memory access has an additional eight clock penalty for that misaligned access. And so we come to my
next rule.
Hint Write your assembly to be CPU model and manufacturer specific!
That little problem required the need to detect not only the exact manufacturer but also the model of processor
and must route function calls to special code to handle each. In most cases the code could be shared, but
some isolated instances required the special code. The following is an older style of the C function memcpy().
; void *memcpy(void *pDst, const void *pSrc, uint nSize)
;
; Note: if nSize is set to 0 then no bytes will be copied!
public memcpy
memcpy proc near
push ebp
mov ebp,esp
push esi
push edi
The following code is relatively small but pretty inefficient as it is using the repeating string function to write a
series of 8-bit bytes. The payoff on a Pentium only comes with a repeat of 64 or more.
rep movsb
With a repeat factor of less than 64 use the following. Note that we do not need to put the DS: or the ES: AS
the default for the ESI source register is DS, and the default for the EDI destination register is ES.
$L1: mov al,[esi] ; al,ds:[esi]
mov [edi],al ; es:[edi],al
inc esi
inc edi
dec ecx
342
jne $L1
In the above example we actually get a dependency penalty as we set the AL register but have to wait before
we can actually execute the next instruction. If we adjust the function as follows, we no longer have that
problem. You will note that the "inc esi" line was moved up to separate the AL, and the AL register.
$L1: mov al,ds:[esi]
inc esi ; removes dependency penalty
mov es:[edi],al
inc edi
dec ecx
jne $L1
Another method that is a lot more efficient than those listed above uses the same techniques we learned for
setting memory. We divide our total number of bytes into the number of 4-byte blocks, then loop on that, not
forgetting to handle the remainders. We handle the dependency penalty at $L1 in the same way.
mov edx,ecx ; Get # of bytes to set
shr edx,2 ; n = n ÷ 4
jz $L2 ; Jump if 1..3 bytes
; DWORDS (uint32)
; Remainders
; 1 to 3 bytes to set
$L4:
This following method is significantly faster as it moves eight bytes at a time instead of four. There is no
dependency penalty since the register being set is not being used immediately.
mov ecx,[ebp+arg3] ; nSize
shr ecx,3 ; n = n ÷ 8
343
jz $L2 ; Jump if 1..7 bytes
; QWORDS (uint64)
; Remainders
; 1 to 7 bytes to set
$L4:
This code is just about as fast as a copy using MMX. An example would be to replace $L1 with the following
code:
$L1: movq mm7,[esi] ; read 64 bits
add esi,8
movq [edi],mm7 ; write 64 bits
add edi,8
dec ecx
jne $L1 ; Loop for QWORDS
There are more sophisticated methods that you can employ, but this is a good start.
It is important for memory to be aligned, as a problem occurs when the source and/or destination are
misaligned. Memory movement (copy) functions should try to reorient source and destination pointers.
Unfortunately, if one is not lucky enough that the source and destination are either both properly aligned or
they are misaligned exactly the same:
If ((pSrc AND 00000000111b) == (pDst AND 00000000111b))
344
…then adjust them. If their logically AND'ed values are 0, no adjustment is needed. If the alignment is the
same, adjust by 1's to get into the alignment position. If both are out of alignment, obtain a speed increase by
putting at least one of them into alignment (preferably the destination):
mov edx,edi ; At least align destination!
and edx,0000111b
jz $Mid ; Jump if properly aligned
add edx,0fffffffch ; -3
$Mid:
For the actual memory movement operation there are various techniques that can be used, each with its own
benefit or drawback.
The best method is a preventative one. If the memory you're dealing with is for video images, then not only
should the (width mod 8) equal a remainder of zero but the source and destination pointers should also be
properly aligned. In this way, there is no problem of clock penalties for each memory access and no extra and
possibly futile effort trying to align them.
In 8-bit images, moving (blitting) sprite memory can be difficult as sprites will always be misaligned. In 32-bit
images where one pixel is 32 bits, alignment is a snap, as every pixel is properly aligned.
#ifdef __cplusplus
extern "C" void gfxCopyBlit8x8Asm(byte *pDst, byte *pSrc,
uint nStride, uint nWidth, uint nHeight);
#endif
#define USE_ASM_COPYBLIT_8X8
#ifdef USE_ASM_COPYBLIT_8X8
#define CopyBlit8x8 CopyBlit8x8Asm
345
#else
void CopyBlit8x8(byte *pDst, byte *pSrc, uint nStride,
uint nWidth, uint nHeight)
{
// If width is the stride then copy entire image
if (nWidth == nStride)
{
memcpy(pDst, pSrc, nStride * nHeight);
}
else
{ // Copy image 1 scanline at a time.
do {
memcpy(pDst, pSrc, nWidth);
As you probably noted, there is extra logic checking if width and stride are the same. If so, then unroll the loop
to make the code even more efficient.
Goal Try to write the listed function in assembly optimized for your processor. Or multiple
processors.
Speed Freak
The code size would increase but using a vector table such as follows would allow you to unroll your
(remainder) loops. With normal code, four states would be required but for MMX all eight would be best.
mov eax,ecx ; Get Width
and eax,0000111b
jmp $SetTbl[eax*4]
Align 16
$SetTbl:dd $SetQ ; (n mod 8) = 0
dd $Set1 ; (n mod 4) = 1
dd $Set2 ; (n mod 4) = 2
dd $Set3 ; (n mod 4) = 3
dd $SetD ; (n mod 4) = 0
dd $Set1 ; (n mod 4) = 1
dd $Set2 ; (n mod 4) = 2
dd $Set3 ; (n mod 4) = 3
346
Graphics 101 — Frame Buffer
When dealing with graphic images there are various parameters defining its geometry.
memptr — The base pointer to a coordinate within the image related to its physical memory address.
bits per pixel — The number of bits per pixel used to represent the image. Typically 1/4/8/16/24/32-bit
but pretty much only 8- to 32-bit are used these days.
width — The width of the image in pixels.
height — The height of the image in pixels.
stride — The number of bytes used to represent the start of one row of pixels to the start of another. It
should be noted that there may be extra bytes beyond the last visible pixel and the start of the row of
pixels. For example, in Figure 19-2 the 640-pixel scanline has an overage of 384 bytes. That means when
th
you write that 640 pixel you need to add 384 to get to the start of the next scanline (640+384=1024).
// Graphics Clear
//
// This is a pre-clipped function used to clear a bitmap
// pointed to by the destination pointer.
// Note: This can be used to clear 8/16/24/32-bit pixels.
#ifdef USE_ASM_GFXCLR
#define gfxClr gfxClrAsm
#else
347
pDst += nStride;
} while (--nHeight);
}
#endif
Project:
Using what you've learned, try to write the C function above in assembly optimized for your processor.
void gfxClrAsm(byte *pDst, uint nStride, uint nWidth, uint nHeight);
A transparent pixel is referred to by a variety of names, including transparent, color key, skip color, invisible
color, and non-displayed pixel. This is a pixel containing no image color data that allows the color of the pixel
directly underneath it to be displayed. It is typically set to an unusual color that helps the artists and
programmers easily identify it in relation to the rest of the colors.
If you watch the news you see this process every day compliments of the weatherman. He is shot on a green
screen, being careful not to wear a color similar to the color key, so the electronics will make him appear in
front of an image such as a map and that composite image is transmitted to your television. If he wore the
same shade of color as the color key, in the middle of his chest he would appear to have a big hole where you
would be able to see through his body.
When using film, moviemakers shoot models or actors on a blue screen, as the color of blue is actually clear
on the film negative. Oversimplifying this explanation, the non-clear areas would be converted into a mask and
the images would be cut into a composite typically using a matte backdrop.
When using digitized graphics in a computer, movie/game makers shoot actors on a green screen and digitally
map the images into a single image using some sort of backdrop.
Your transparency color can be any color. I typically pick a dark shade of blue. For instance, in an RGB range
of (0 to 255) {red:0, green:0, blue:108}. This allows me to differentiate between the color of black and
transparency and still have the transparent color dark enough so as not to detract from the art. When I am
nearly done with the image and almost ready to test it for any stray transparent pixels, I set them to a bright
purple {red:255, green:0, blue:255} as that particular color of bright purple is not usually found in my art images
and it really stands out. It does not matter what color you use as long as the image does not contain that
particular color.
In a 2D graphics application, there is typically a need to composite images and so this leads to how to handle
a transparent blit.
A few years ago, I taught a College for Kids program during the summer titled "The Art of Computer/Video
Game Design." For that class, I had put together a small program that reinforced the need for computer games
to have foreign language support. This particular game was called "Monster Punch." A language would be
selected and then various living fruit with their eyes moving around would drop down from the top of the screen
and pass through the opening out of view at the bottom of the screen. After all the fruit had fallen, the display
would snap to a view of a blender, at which point all the fruit would be blended, while screaming, into monster
punch where the blender comes alive, à la "Monster Punch!" (Okay, maybe I am a little warped, but you should
have been able to figure that out by now!)
Copy Blit
The following sprite imagery is that of a copy blit, where a rectangular image is copied to the destination and
overwrites any overlapped pixel.
348
Figure 19-3: Monster Punch — Copy blit of strawberry image on the right into the blender on the left.
Using efficiently optimized code, up to eight bytes at a time can be copied with 64-bit access, which
corresponds to simultaneously writing eight 8-bit pixels, or four 16-bit pixels, or almost three 24-bit pixels, or
only two 32-bit pixels. With 128-bit, up to 16 bytes can be accessed, thus 16 8-bit pixels, or eight 16-bit pixels,
or slightly over five 24-bit pixels, or only four 32-bit pixels.
Transparent Blit
As the following sprite image portrays, all pixels from the source that match the transparent color are not
copied, thus causing the sprite to be seamlessly pasted into the background.
Figure 19-4: Monster Punch — Transparent blit of strawberry image on the right into the blender on the left.
Normally when dealing with transparencies, only one pixel at a time can be tested to detect if it is transparent
or not and so wind up introducing inefficiencies such as branch mispredictions, but that is where the sample in
the following section comes in handy.
349
add esi,8 ; Adjust source pointer
pand mm4,mm5
; Using the same mask, flip it, then AND it with the
; source pixels, keeping the non-transparent pixels.
There is no transparency testing or branching, only the masking and blending of data, which makes the
process of a transparent blit much faster. These two different blits (copy, transparent) are typically designed for
a graphic environment such as in Figure 19-5 where the background seen on the right is kept in a separate
buffer like wallpaper.
Figure 19-5: Transparent copy blit of strawberry sprite and blender image background to achieve composite
result of both.
The background is CopyBlit to the working surface as seen on the left, and the sprite image is Transparent Blit
in front of it. When the sprite image is animated, the area being changed is "erased" from the working surface
by a rectangular CopyBlit of that area from the background to the working surface and then the update sprite
image has a rectangular area Transparent Blit in front. This is a layered approach typically used in a video
game that has a number of animated objects moving around the display.
This is typically called an "overlay" method used by Windows and some sprite engines. The drawback to this
method is that overlapping of sprites needs to be minimized because erasing one requires all the other
intersecting sprites visible above that sprite to be erased. The list of sprites needs to be traversed to find out
which sprites intersect the area and need to be erased and repainted by replacing the image under each
intersecting sprite in the image buffer with the corresponding original background image. The list of sprites
then needs to be traversed again, this time drawing the sprites back into the scene.
350
Figure 19-6: The blit of a rectangular blender image to a storage buffer, then the transparent blit of a
strawberry into blender. A blit of the saved blender image back into blender effectively erases the strawberry.
tcolor qword 03f3f3f3f3f3f3f3fh ; 03fh = transparent pixel
pand mm4,mm5
351
Graphics 101 — Clipping Blit
The same trick of using inverse logic can be used for expanding image clipping planes.
In the image on the left, no matter how it's encoded (8/16/24/32 bits), only a single bit in the clipping plane
image on the right would be needed to represent a single pixel. If black=0 and white=1, then a sprite object
could appear to pass in front of the fence as well as behind it but in front of the distant background. This could
be done in a variety of ways. One would be to use masks where both the sprite pixel and the background pixel
are masked so only one has a non-zero value. The resulting color is written to the destination buffer.
; esi = sprite image pointer
; ebx = clipping plane
; ebp = background image pointer
; edi = destination image buffer pointer
352
Chapter 20: MASM vs. NASM vs. TASM vs. WASM
Overview
This chapter reminds me of the old Batman and Robin show where in a fight scene we see sound effect words
flash on screen such as "OOF," "KABONG," "ZING," "MASM," "ZOT," "TASM," "ZANG," "WASM," and "POW."
These, by an amazing coincidence, are the war cry of the assembly language programmer. Wars have started
for far less than trying to get one of these programmers to use a different C/C++ compiler or assembler. At one
company I spent almost two years writing a good portion of the application core libraries and all the
documentation for their SDK that allowed internal and external programmers to write online computer games
using dedicated game servers. For the last year and a half there I worked on their Win32 Network API SDK
(not to be confused with Microsoft's GameSDK). For the first four months there I wrote the DOS SDK, which
uses 16-bit and various forms of 32-bit Extended DOS. It was a mix and match of C/C++ compilers,
assemblers, linkers, and DOS extenders. It seemed every company had their own flavor. They had gotten
used to their favorite combination and nothing was better! So every now and then I had to create libraries for
that new flavor to entice new clients. Some of the code I currently write also uses the High C/C++ compiler
with MASM or Pharlap's 386ASM. I do not use it these days, but there is also the Watcom C/C++ with their
WASM Assembler. Occasionally on software I write today, I get inquiries if my libraries are compatible with the
Borland TASM Assembler. I have used all of these and a few others, and to date my favorite is MASM by
Microsoft.
There is a form of assembly that we should not forget: in-line assembly. Some people swear by it. I, on the
other hand, swear at it! I rarely use it and only for some specific type of data conversion that I need to be fast
without the penalty of a stack call to call a pure assembly function. It is akin to programming with one arm tied
behind one's back. A lot of macro assembler functions are not available.
I have read book reviews in which advocates of non-MASM assemblers indicate a book could have been a lot
better if the author had used TASM instead of MASM. Again, a personal bias! Although I have a few
apprehensions about MASM, I have a personal bias for it. In writing this book I have tried to appease the critics
by keeping the examples as generic as possible, and if this was not good enough for you,
"RASPBERRIES! " MASM is only available separately by download but it's built into the Visual C++ 6
and VC .NET compilers.
You should always use the latest and greatest version of your favorite assembler because if you do not, your
version could have bugs (I find them all the time) or be too old to support some of the newer instructions. Back
when MMX first came out I had to use the IAMMX.INC by Intel with the MASM as a workaround just to support
MMX instructions. Since then it has been built into MASM. Now for SSE3 support you need to either get the
latest VC .NET or download the ia_pni.inc file to get assembly instruction macro emulation. With one other
company's assembler I had to hand-code the opcodes to make sure I had the appropriate JMP instruction.
There was a bug and the jump instruction that I had coded in assembly code was not the jump instruction
being encoded into machine code. A bug was being introduced into compiled code because of a bug in the
assembler itself!
With the latest instruction sets there seem to be two assemblers at the forefront with recently introduced
assembly instructions: MASM and NASM. No matter whose assembler you're using, I use the following as
placeholders for the arguments being passed into the example code used in this book:
arg1 equ 8 ; Argument #1
arg2 equ (arg1+4) ; Argument #2
arg3 equ (arg2+4) ; Argument #3
arg4 equ (arg3+4) ; Argument #4
arg5 equ (arg4+4) ; Argument #5
arg6 equ (arg5+4) ; Argument #6
arg7 equ (arg6+4) ; Argument #7
arg8 equ (arg7+4) ; Argument #8
353
public unzip
unzip proc near
push ebp
mov ebp,esp
push ebx
push esi
push edi
;
;
;
pop edi
pop esi
pop ebx
pop ebp
ret
unzip endp
As an alternative to the arg1 you could use a define to make the argument name make more sense.
pRaw = arg1
mov eax,[ebp+pRaw]
The following information is a brief overview and you should refer to your assembler's documentation for
specific information.
When using this assembler, the first thing you need to do is activate the appropriate CPU target by using one
of the following assembler directives depending on what processor will be executing that section of code.
There are several of these directives such as .386, .486, .586, etc. If the target is for an embedded 486, then
obviously the .586 directive would not be used, as instructions would be allowed that the 486 would not
understand. When you write your code for a single processor you can merely set the appropriate directive(s) at
the top of the file, but quite often a single file will contain sets of code unique to individual processors.
.686 — This allows Model 6 type x86 code to be assembled. The next line actually allows MMX
instructions to be assembled. You can pretty much have this directive as most processors being released
these days support MMX. You do need to make sure that the code is only going to be executed by one of
those processors, however.
The directives not only target processors but certain instruction sets, and so care must be used when
setting the appropriate directives.
354
.MMX — An alternate method is to set the supported instruction sets such as this directive for enabling
MMX code.
.K3D — This is the directive for the 3DNow! instruction set. As you do not want an Intel processor trying to
execute one of these instructions, only insert this above 3DNow! instruction code. These are also order
dependent and so this must occur after the.MMX directive.
.XMM — Use this if you are using any SSE-based instructions requiring XMM registers.
There are other legacy declarations such as .387, .286, .386, .386P, .486,.486P, etc. The suffix "P" indicates
an enabling of privileged instructions.
Here is a sample file that you should be able to drop into a Win32 application in conjunction with the Visual
C++ compiler:
TITLE zipX86M.asm – My x86 (MASM) Assembly
PAGE 53,160
; This module is designed to Blah Blah Blah!
;
; Created - 20 May 98 - J.Leiterman
; Tabs = 8
.686
.MMX
.K3D
.model flat, C
.data
ALIGN 4
foo dd 0 ; Data value
;
; void unzip(byte *pRaw, byte *pZip, uint nWidth);
;
align 16
unzip PROC C PUBLIC USES ebx esi edi pRaw:PTR, pZip:PTR,
nWidth:DWORD
mov esi,pRaw
mov edi,pZip
mov ecx,nWidth
;
;
;
ret
unzip endp
zipX86M ends
355
end
The function is declared PUBLIC, meaning it's global in definition and can therefore be accessed by functions
in other files.
unzip PROC C PUBLIC USES ebx esi edi pRaw:PTR, pZip:PTR,
nWidth:DWORD
For convenience, you can specify the registers to push onto the stack and in what order. The RET instruction
is actually a macro when used within this PROC, and therefore the registers are popped automatically in a
reverse order wherever a RET instruction is encountered. The coup de grâce? No more pesky code like:
mov esi,[ebp+arg1] ; pRaw
The assembler expands the PROC macro and takes care of everything for you, making your code a little more
readable.
You will notice that I used the default data segment (.data) as this is a flat memory model, but I declared a 32-
bit Protected Mode code segment. The reasoning is that I tend to group my assembly files using an object-
oriented approach and as such all my decompression functions/ procedures would reside within this segment.
Other assembly code related to other functionality would be contained in a different file with a different
segment name. They can occur with the same segment name but they wouldn't appear very organized,
especially in the application address/data map.
zipX86M SEGMENT USE32 PUBLIC 'CODE'
:
:
zipX86M ends
Since segments are being mentioned I am going to give you a snapshot of segments back in the days of DOS
and DOS extenders. Code and data was differentiated by 16-bit code/data versus 32-bit code/data addressing.
The following is a snippet of code from those days.
; Segment Ordering Sequence
We also must not forget the (end) signal to the assembler that it has reached the end of the file:
End
I personally think this is just a carryover from the good old days of much simpler assemblers. With the advent
of macros such as the following, you can turn on or off various sections of code and not just the bottom portion
of your file:
if 0
else
endif
356
Visual C++ has never really had a peaceful coexistence with its own MASM Assembler. In the early days of
around version 3.x you had to assemble your files using batch files or external make files and only link the
*.obj files into your project files. Microsoft has fortunately made this a little simpler, but in my opinion it still
seems shortsighted. My assumption is that they would prefer you to use either inline assembly or none at all.
(But I've been known to be wrong before!)
The first thing you need to do is add the MASM hooks into your version 5.0 or above Visual C++ environment.
Select the Tools|Options menu item, and then select the Directories tab. Set the following to the path of your
MASM installation:
Executable Files: c:\masm\bin
c:\masm\binr
Include Files: c:\masm\include
With your project loaded in your FileView tab, just right-click on the project files folder, and select the pop-up
menu item Add Files to Project. The Insert Files into Project dialog box will be displayed. That dialog seems to
support almost every file type known except for assembly! What you need to do is select the All Files (*.*)
option, select the assembly file you desire, and then press the OK button.
Now that the file occurs in your list of files in your project, right- click on that file and select the Settings item
from the pop-up menu. In the Commands edit box insert the following:
ml @MyGame.amk ..\util\unzipx86.asm
This will execute the assembler using the option switches defined in the MyGame.amk file. In the Outputs edit
box insert the following:
unzipx86.obj
To make my life simpler I use a file, such as the following, that I refer to as my assembly make file. I clone it
from project to project, as you'll never know when you'll need to tweak it.
File: MYGAME.AMK
/L../util
/c
/coff
/Cp
/Fl
/Fm
/FR
/Sg
/Zd
/Zi
For those of you who would prefer to use in-line assembly or just plain don't have an assembler, you can do
the same thing with the following from within your C/C++ code.
357
Figure 20-1: VC6 assembler configuration display
void unzip(byte *pRaw, byte *pZip, uint nWidth)
{
__asm {
mov esi,pRaw
mov edi,pZip
mov ecx,nWidth
};
}
You should be very careful if you mix C/C++ and in-line assembly code unless you push the registers to save
them. Setting a breakpoint at the beginning of your function and then examining the source code during run
time can help point out any register use conflicts.
MASM is my favorite macro assembler as it has an excellent macro expansion ability. Not only can new
instructions be incorporated by use of macros but the predefined macro expansion can be taken to advantage
as they are C like. In some cases, I find it better than C. In fact, in-line assembly sucks! (Another technical
term!) (Note: I only said in some cases!) The following are some of the highlights. For details, read the
technical manuals. For example, the MASM toolset has the following manuals:
Environment and Tools
Programmers Guide
Reference
In the following charts, notice the C method on the left and the MASM method on the right.
Defines are pretty similar; however, enums do not exist and so must be emulated with a standard equate.
C MASM
#define FOO 3 FOO = 3
typedef enum
{
CPUVEN_UNKNOWN = 0, CPUVEN_UNKNOWN = 0
} CPUVEN;
C MASM
typedef struct CpuInfoType { CpuInfo struct 4
uint nCpuId; // CPU Id nCpuId dd 0 ; CPU Id
358
C MASM
uint nFpuId; // FPU Id nFpuId dd 0 ; FPU Id
uint nBits; // Feature nBits dd 0 ; Feature
uint nMfg; // Mfg nMfg dd 0 ; Mfg.
uint16 SerialNum[6]; SerialNum dw 0,0,0,0,0,0
uint nSpeed; // Speed nSpeed dd 0 ; Speed
} CpuInfo; CpuInfo ends
In C there is no looping macro expansion; there is only one-shot (a definition gets expanded). However, some
special macro functionality is available when using a MACRO assembler.
REPEAT
MASM supports the repeat declaration when used in conjunction with a local temporary argument.
i = 0
REPEAT 5
mov [i + ebx],eax
i = i + 4
ENDM
The i is temporary and expands the code. For this example, the REPEAT macro is replicated five times and
adds 4 (the size of the write) onto every iteration. So the code is unrolled:
mov [0 + ebx],eax
mov [4 + ebx],eax
mov [8 + ebx],eax
mov [12 + ebx],eax
mov [16 + ebx],eax
WHILE
MASM also supports a while loop.
i = 0
WHILE i LE 20 ; <
mov [i + ebx],eax
i = i + 4
ENDM
This is essentially similar code. The example was a simple loop, but while loops are typically used in loops of
more complexity.
FOR
MASM also supports a for loop.
FOR arg, <1,3,5,7,9,11,13,17,19,23>
out dx,arg
ENDM
As mentioned, these are examples of MASM related code. Those assemble-time loops are something not
available to a C compiler. Other items are available including access to data/code segment specification and
all assembly instructions, while inline assembly has only a limited set of instructions available. The macro
359
assembler allows code/data intermixed, while a C compiler does not. The IF-ELSE-ENDIF conditionals are
also available, along with other features available in a standard C compiler.
Compiler Intrinsics
The more recent Visual C++ and Intel compilers support a method of programming in assembly language
referred to as intrinsics. This is where the functionality of SIMD instructions has been wrapped within C
wrappers and compiled into code as inline code. Let us examine the following example:
void test(float *c, float a, float b)
{
*c = a + b;
}
Not to oversimplify the power of using intrinsics to get code up and running quickly, the following code uses
intrinisics in conjunction with (__m128) XMM registers with SSE single-precision floating-point instructions.
Note that it looks more complicated, but I chose a simple scalar expression to resolve.
#include <xmmintrin.h>
ta = _mm_load_ps(&a);
tb = _mm_load_ps(&b);
ta = _mm_add_ps(ta, tb);
_mm_store_ps(c, ta);
}
But underneath in the pure assembly code generated by the compiler this breaks down to something similar to
the following:
push ebx
mov ebx,esp
sub esp,8
and esp,0FFFFFFF0h ; 16-byte align stack
add esp,4
push ebp
mov ebp,dword ptr [ebx+4]
mov dword ptr [esp+4],ebp
mov ebp,esp
sub esp,98h
push esi
push edi
; __m128 ta, tb
; ta = _mm_load_ps(&a);
lea eax,[ebx+0Ch]
movaps xmm0,xmmword ptr [eax]
movaps xmmword ptr [ebp-30h],xmm0
movaps xmm0,xmmword ptr [ebp-30h]
360
movaps xmmword ptr [ebp-10h],xmm0
; tb = _mm_load_ps(&b);
lea eax,[ebx+10h]
movaps xmm0,xmmword ptr [eax]
movaps xmmword ptr [ebp-40h],xmm0
movaps xmm0,xmmword ptr [ebp-40h]
movaps xmmword ptr [ebp-20h],xmm0
; ta = _mm_add_ps(ta, tb);
movaps xmm0,xmmword ptr [ebp-20h]
movaps xmm1,xmmword ptr [ebp-10h]
addps xmm1,xmm0
movaps xmmword ptr [ebp-50h],xmm1
movaps xmm0,xmmword ptr [ebp-50h]
movaps xmmword ptr [ebp-10h],xmm0
; _mm_store_ps(c, ta);
movaps xmm0,xmmword ptr [ebp-10h]
mov eax,dword ptr [ebx+8]
movaps xmmword ptr [eax],xmm0
pop edi
pop esi
mov esp,ebp
pop ebp
mov esp,ebx
pop ebx
ret
361
Chapter 21: Debugging Functions
Overview
Debugging code written for vector processors can be either complicated or very simple, depending on your
toolset. The first item to remember is to have a good IDE (integrated development environment). With this and
the proper processor package, the packed data vector registers can be immediately dumped and examined to
verify the data is as expected.
It never fails. You may be in the middle of optimizing some assembly code and management comes by for a
demo. If you still have the C code, you can switch over to it; it runs slower but you are still able to run the
demo.
If your code deals with arrays of numbers or a series of like numbers, then orient your C code in groups of four
for single-precision floating-point numbers, etc. The idea is 128-bit data. Read Vector Game Math Processors
for more information on this topic.
Debug your C code! Every loop, every variable. Make sure it works exactly as you think it does. This is
important when you go to benchmark its results with the assembly code.
Locking your function means that it is done. There is (hopefully) absolutely no reason ever to make any more
changes to it.
Maybe your optimized C code is fast enough for your needs. It takes time to write assembly code and get it
working correctly. Also, you will want to phase in your assembly code. From within your C code, call the
assembly function (at least initially during development). This allows you to generate two sets of results and
compare them. This also allows you to handle the aligned memory algorithm first before moving on to the
misaligned version, etc. One by one, you phase them in. Eventually you actually vector the assembly code
instead of the C code. You keep the C code as a fallback position, plus it gives you a starting point if the
function is changed in a major way.
Visual C++
If using Visual C++ version 7 or higher, then your compiler has already been updated with the latest and
greatest processor information, but if you're using Visual C++ version 6 you need to download and install the
service packs from http://msdn.microsoft.com/vstudio/downloads/updates/sp/vs6/sp5/default.asp.
You will also need the processor packs if you are running service pack 4, available at
http://msdn.microsoft.com/vstudio/downloads/ppack/Beta/download.asp.
362
If you are using a version older than Visual C++ version 6 you will need to rely heavily on functions that you
can call to dump the contents of your registers.
You can find out what version of compiler you are using by choosing the About option from the Help menu. In
your particular case, your Product ID number will be displayed. For this example, I have erased my numbers
from the figures for security.
As you will note in the following Visual C++ dump for an x86 type processor under VC6, only MMX — not XMM
— registers are supported in this example. Since this is from a run using a 3DNow! processor, there is a dump
of the MMX registers in hex (MM#) as well as floating-point, denoted by (MM##) in upper and lower 32-bit
single-precision floating-point values. MM00 in the bottom portion of Figure 21-3 represents the lower single-
precision value of the MM0 register.
363
Figure 21-3: Register dump from Visual C++ 6.0(3DNow! extensions)
In some cases, such as an older version of the compiler, the extended registers are not immediately available.
In the newer version, such as the VC .NET toolset, that is no longer an issue. As you may notice in
Figure 21-4, selective dumps of the other registers have been made available. Using at least version 7 of the
compiler allows the active register sets to be enabled and thus viewed in the window.
Figure 21-4: Register dump from Visual C++ .NET version 7.0 (3DNow! professional extensions). Note the
3DNow! and SSE supported registers as well as the extra pop-up menu of register type selections that you can
choose from. The enabled selections are based upon the processor type running on the computer that is being
debugged.
The one thing that you will really need while developing assembly code is a second monitor. This is because
you will need lots of screen real estate for all the various windows that you will need for debugging your
assembly code.
Since we are talking here about Win32 code, whether it is targeted for the Xbox or PC platforms, then you are
most likely dealing with Direct3D or the less likely OpenGL. If it is Direct3D, then it is in your best interest to
install the debug versions of the DirectX dynamic-link libraries (DLLs). They will run slower but if something
weird happens they will give you more information as to the type of failure. And remember that your application
will run that much faster on a retail version of the DLLs.
Tip Another handy debugging tool for vectors is the use of multi- colored pens. Using multiple
colors on graph paper can produce layered results, giving depth and helping make complex
problems easier to understand.
The biggest help in developing vector-based applications above all is the symbolic debugger. The program
counter can be followed throughout the code and the general-purpose and SIMD registers monitored. The
changes to memory as the results are written can be observed.
364
Tuning and Optimization
There are multiple programs on the market for tuning and optimizing your application. The two most popular
performance monitoring applications are the VTune program by Intel for use with Intel's chips, available at
http://developer.intel.com/software/products/eval, and the AMD Code Analyst for use with AMD's chips,
available at http://www.amd.com.
These programs allow you to balance your pipelining and monitor the overall performance of your application
and make recommendations as to how to balance your code to make it as tight as possible. By setting up
monitoring zones indicated by start and stop markers, the application can be benchmark between those zones.
Exceptions
There are two types of exceptions in the 80x86 processor: hardware and software. The hardware exception
has a maskable and non-maskable type. The software exception has three subtypes: fault, trap, and abort.
Hardware Exception
Nonmaskable — Signaled by a NMI (nonmaskable interrupt) and cannot be ignored.
Maskable — Signaled by a INTR but can be ignored by clearing the Interrupt flag in EFLAGS with the
instruction CLI.
Software Exception
Fault — A detectable error before it occurs, but the program is restartable.
Trap — An error is trapped after it occurs, but the application is not restartable.
Abort — An error trapped after it occurs, but the program may need to be terminated.
Almost every instruction that the 80x86 uses has an association with an exception. Some have been
mentioned, but if they had been thoroughly documented this book would be at least twice its size. So general
rules apply, such as the divide by zero, using privileged instructions from a non-privileged protection layer,
trying to use bad floating-point values, protection violations, and internal machine faults. Even a breakpoint
that is set and triggered during debugging is an exception. Each exception is indicated by a "#" followed by a
two-letter ID encapsulating an optional error code in parentheses: #GP(fault#). For example, #GP(0) is a
general-protection fault with an error code of zero. The following is a list of exceptions.
When an exception occurs, an interrupt is generated and the instruction pointer is immediately redirected to
the special handler associated with the interrupt slot in the Interrupt Descriptor Table (IDT). Each slot in the
table is a gate to layer #0. As shown in Table 21-1, there are up to 256 interrupts, of which 32 are reserved for
exceptions and interrupts. Note that some are actually unused and reserved for the future.
Table 21-1: 80x86 exception types
365
Table 21-1: 80x86 exception types
Throughout this book I have been oversimplifying, using merely vmp_SIMDEntry() and vmp_SIMDExit() to
block off sections of code, but it is actually more complicated than that as you truly need to differentiate
between MMX and FPU. But remember, MMX can be floating-point on an AMD or integer or logical bit for all
x86 processors. In this book, I did not use them on every function call like most books tell you to do. Instead, I
used them between blocks of code but in essence, for some processors such as SSE I overused them. Having
multiple flavors of SIMD that can stub out helps to lighten your code even more so.
Print Output
"To printf or not to printf, that is the question!" Well, not to butcher Shakespeare, but sometimes the old ASCII
text output is an excellent method of tracking down obscure bugs. This is not meant to displace the use of full-
blown symbolic debuggers, but sometimes the old fashioned printf is better.
This is where things get interesting! (I keep saying that in this book, do I not!) With the C++ cout IOStream
class it would have to be overloaded and the output rerouted to your own device. The printf function is a little
harder to capture although the standard output could be reassigned. Some platforms such as Win32 do not
use printf. Instead they use a string output function such as OutputDebugString(). What I like to do is create a
thin abstraction layer (whether it be written in C++ or C) of log functions that route to one or more specified
output devices. This is set up for two types of output: Debug, which is not shipped with the game, and
Release, which is. So, for example:
LogWrite()
LogRDWrite()
366
They are both similar to the functionality of printf() except they do not return a count of bytes written. The
LogWrite() function is a debug version that gets stubbed to nothing in a release build. LogRDWrite() is
compiled for both a release and debug skew and gets shipped with the code. This helps to simplify the job of
customer support when something really bad happens. There are other functions in the log module, but these
two are at the core and they in essence would call printf() and/or OutputDebugString(), etc. The idea is that
your cross-platform application would call the log functions and be abstracted for the individual platform.
When I develop Win32 applications I frequently route this output to one or all of the following: the function
OutputDebugString(), a Notepad window, a text file, a high-speed serial port with hardware handshake, and a
TCP/IP communications setup for data collection on another computer. I do something similar for other
embedded platforms. With a smart terminal there, I can back scroll through live memory dumps to track
behaviors.
One of the methods I frequently use with this type of diagnostics is to set threshold values to enable the
trigger, which in effect allows the capture and printing of select data. This helps to minimize being inundated
with massive amounts of data that hinders rather than helps detect problems.
As a superset of this functionality, the following functions are generated for even more detailed information on
a generic basis.
For platforms that do not offer you the advantage of an IDE or if you need to track problems and history dumps
of memory or registers, then some of the following functions should come in handy.
vmp_SIMDEntry();
do {
i = (--nCnt >= 4) ? 3 : (nCnt & 3);
nCnt -= i;
do {
printf("%lf ", (double)(*pf));
pf++;
} while (i-- > 0);
printf("\n");
} while (nCnt);
vmp_SIMDExit();
}
367
Once a float array function is in place, creating other dependent functions such as the following is a snap.
Also, by creating a debug versus release version, the debug would be the function type definitions and the
release would be empty macros, thus stubbing the code to non-existence similar to the following:
#ifdef USE_LOG
void vmp_FloatPrint(const float * pf, uint nCnt);
#else
#define vmp_FloatPrint(pf, nCnt) // stub
#endif
The same would need to be implemented in the logging header file for any other logging function that would
need to be stubbed in a release build.
Vector Print
Listing 21-2: …\chap20\vecBug\vecBug.cpp
Quaternion Print
Listing 21-4: ...\chap20\vecBug\vecBug.cpp
Matrix Print
This function dumps a single-precision floating-point matrix. It expands single-precision to double-precision to
help minimize any problems of ASCII exponential representations.
Listing 21-5: ...\chap20\vecBug\vecBug.cpp
vmp_SIMDEntry();
368
printf("%lf %lf %lf %lf\n", (double)Mx[0][0],
(double)Mx[0][1], (double)Mx[0][2], (double)Mx[0][3]);
printf(" %lf %lf %lf %lf\n", (double)Mx[1][0],
(double)Mx[1][1], (double)Mx[1][2], (double)Mx[1][3]);
printf(" %lf %lf %lf %lf\n", (double)Mx[2][0],
(double)Mx[2][1], (double)Mx[2][2], (double)Mx[2][3]);
printf(" %lf %lf %lf %lf\n\n", (double)Mx[3][0],
(double)Mx[3][1], (double)Mx[3][2], (double)Mx[3][3]);
vmp_SIMDExit();
}
Memory Dump
There is nothing sweeter than a memory dump similar to the old DOS debug.exe program. A combination of
address, hex data, and ASCII equivalents is dumped on an up to 16-byte ASCII string. Even when you have a
memory dump window within your IDE it can typically only handle one dump at a time. If you are trying to track
a nasty little problem or to verify proper functionality, sometimes a print trail is far superior. So with this in mind,
the following should be a handy little tool for your development cycle.
Listing 21-6: ...\chap20\vecBug\vecBug.cpp
ASSERT_PTR(vp);
ASSERT_ZERO(size);
mp = (byte *) vp;
run = 16;
col = tail = 0;
369
p = buf + sprintf(buf, "%8.8x ", mp);
// Hex values
if ((++col % 4) == 0)
{
*p++ = ' ';
}
}
while (tail--)
{
*p++ = ' ';
}
// ASCII chars
*p++ = '\r';
*p++ = '\n';
*p = 0;
printf(buf);
y += run;
run = 16;
tail = col = 0;
}
370
printf("\r\n");
}
Test Jigs
One sometimes needs data to load vector algorithms during testing. These functions may seem downright silly
but they do come in handy from time to time.
do {
*pF = f;
f += 1.0f; // Advance value by 1.0f
pF++;
} while (--n);
}
Matrix Splat
This function splats (replicates) a single-precision floating-point value into all sixteen elements of a matrix.
Listing 21-8: ...\chap20\vecBug\vecBug.cpp
pF = (float *)Mx;
n = 16;
do {
*pF = f;
pF++;
} while (--n);
371
}
There are other functions and test jigs for use in the debugging of vector code that are available for purchase
from middleware providers. They are also something to be developed from your imagination.
372
Chapter 22: Epilogue
Overview
If you actually read this entire book and haven't skipped around here and there, you should have found a
diverse spectrum of topics related to computer processing using all the flavors of the x86 instruction set, from
general-purpose instructions up through SSE3, AMD64, and EM64T, such as the terrible penalty (and hidden
problems) of not having your data properly aligned, how to detect problems early in your development cycle,
and the importance of using assertions and constant declarations within your code. Okay, I did not go
overboard on the concept of const but you can find that in any good C++ programming book.
Also contained within were answers to some of the most often asked questions that are typically encountered
in a job interview and my insight into that and other topics of game or embedded development. The companion
files, available at www.wordware.com/files/8086 and www.leiterman.com/books.html, contain practical
problems and solutions despite the fact that they are not heavily optimized and algorithmically tuned.
The principles found throughout this book can be applied to your project to make it that much better. There
have been programming tricks as well as other foundational elements for you to use to build upon and
enhance your knowledge. I hope you have found this book to be informative, that it has taught you at least one
new item, and has been a value for your dollars. Please feel free to contact me and let me know your opinion
about what you would like to see in any future revision.
I am sorry the code included with this book is not heavily optimized but as mentioned at the beginning,
"optimized code is difficult to understand!" An unmentioned reason for not doing so is greed! There is gold in
them thar hills! Companies make money writing heavily optimized—thus very, very fast—algorithms. You too
can possibly make lots of money from the comfort of your home writing optimized libraries targeted for a
specific processor. In fact, consider any algorithms you find in this book that you need in your code a
homework assignment of optimization.
I hope this book was as pleasant for you to read as it was fun for me to research and write. In the process of
writing this book I too learned a few new things, and that is what it is all about! I also hope you were not too
irritated with seeing similarities (okay 80x86 targeted and refined copies) of some of the chapters found in my
vector book. But that shouldn't stop you from buying a copy of my vector book! After all, this book is actually
the prequel to that book!
Please send an email to [email protected] to register your email address as a purchaser of this book.
Insert the word "Register" into the Subject field and in the message section, list your processors of interest and
any other comment. I can add you to an update notification list to let you know when additional code samples
or chapter supplements are available for download. Or you can just make a comment about this book. For
additional information or related links, check out the associated web site at
http://www.leiterman.com/books.html.
Oh, and one last item. Buy my brother's books as well. He has a large family too and can use some extra
income! What books you ask? Rewind the book and read the preface!
373
Appendix A: Data Structure Definitions
When dealing with three-dimensional transformation, data structures are needed to contain the data type. The
data structures listed here are discussed in terms of this book.
Integer 2D Point
This is mostly used within a 2D game but has uses within a 3D game. It is used for specifying a starting point
within a bitmap image with the origin {0,0} located in the upper-left corner. This is primarily used as the starting
point of a texture within a bitmap image. It is also used in relationship to the anchor point of a font baseline so
as where to position it within an image or two-triangle quad polygon tile.
typedef struct iPosType
{
int x;
int y;
} iPos;
Integer 2D Size
Normally bitmaps range from 0 to w–1 for width and 0 to h–1 for height and are never negative. With that in
mind, only unsigned arguments are used.
typedef struct iSizeType
{
uint w; // Width
uint h; // Height
} iSize;
Integer Rectangle
Some rectangle structures are designed with a field definition such as {x1,y1,x2,y2}, but any movement has to
be translated with the modification of four coordinates. With the following structure {x,y,w,h}, only the two fields
{x,y} need to be modified.
typedef struct iRectType
{
int x; // X coord.
int y; // Y coord.
uint w; // Width
uint h; // Height
} iRect;
3D Vector (Integer)
typedef struct iVector3DType
{
int x;
int y;
int z;
} iVector3D;
374
int y;
int z;
int w;
} iQVector3D;
375
Appendix B: Mnemonics
Mnemonics Part 1
The following is the supported mnemonic lookup table per processor. They have been divided into revisions of
Intel and AMD. Sub-instruction types are indented and the root mnemonic has a shaded background.
Intel
P Pentium
PII Pentium II
SSE SSE (Katmai NI)
SSE2 SSE2
SSE3 SSE3 (Prescott NI)
EM64T 64-bit memory
AMD
K6 K6
3D! 3DNow!
3Mx+ 3DNow! & MMX Ext.
A64 AMD64
Legend:
Mnemonic supported
32 32-bit support only
64 64-bit support only
376
Mnemonic Page# P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
AAA 394 32
AAD 396 32
AAM 395 32
AAS 394 32
ADC 176
ADD 176
ADDPD 83
ADDPS 83
ADDSD 335
ADDSS 335
ADDSUBPD 346
ADDSUBPS 346
AND 83
ANDNPD 99
ANDNPS 99
ANDPD 83
ANDPS 83
ARPL 445 32
BOUND 445 32
BSF 122
BSR 122
BSWAP 133
BT 255
BTC 255
BTR 256
BTS 257
CALL 311
CBW 156
CDQ 160
CDQE 156
CLC 53
CLD 77
CLFLUSH 443
CLI 320
CLTS 445
CMC 54
CMOVcc *
CMOVA 271
CMOVAE 271
CMOVB 271
CMOVBE 271
CMOVC 271
CMOVE 271
CMOVG 271
CMOVGE 271
CMOVL 271
CMOVLE 271
CMOVNA 271
CMOVNAE 271
CMOVNB 271
CMOVNBE 271
Mnemonic Page# P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
377
CMOVNC 271
CMOVNE 271
CMOVNG 271
CMOVNGE 271
CMOVNL 271
CMOVNLE 271
CMOVNO 271
CMOVNP 271
CMOVNS 271
CMOVNZ 271
CMOVO 271
CMOVP 271
CMOVPE 271
CMOVPO 271
CMOVS 271
CMOVZ 271
CMP 260
CMPPD 265
CMPPS 265
CMPS 263
CMPSB 263
CMPSD 263
CMPSQ 263
CMPSW 263
CMPSD 265
CMPSS 265
CMPXCHG 272
CMPXCHG8B 272
CMPXCHG16B 273 64
COMISD 261
COMISS 261
CPUID 404
CQO 160 64
CVTDQ2PD 167
CVTDQ2PS 165
CVTPD2DQ 168
CVTPD2PI 169
CVTPD2PS 168
CVTPI2PD 170
CVTPI2PS 166
CVTPS2DQ 165
CVTPS2PD 169
CVTPS2PI 166
CVTSD2SI 171
CVTSD2SS 172
CVTSI2SD 171
CVTSI2SS 167
CVTSS2SD 172
CVTSS2SI 170
CVTTPD2DQ 168
CVTTPD2PI 169
CVTTPS2DQ 165
Mnemonic Page# P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
378
CVTTPS2PI 166
CVTTSD2SI 171
CVTTSS2SI 170
CWD 160
CWDE 156
DAA 392 32
DAS 373 32
DEC 181
DIV 204
DIVPD 367
DIVPS 367
DIVSD 368
DIVSS 367
EMMS 58
ENTER 310
F2XM1 226
FABS 221
FADD 221
FADDP 221
FBLD 236
FBSTP 236
FCHS 221
FCLEX 245
FCMOVcc *
FCMOVB 234
FCMOVBE 234
FCMOVE 234
FCMOVNB 234
FCMOVNBE 234
FCMOVNE 234
FCMOVNU 234
FCMOVU 234
FCOM 231
FCOMI 233
FCOMIP 233
FCOMP 231
FCOMPP 231
FCOS 242
FDECSTP 219
FDIV 224
FDIVP 224
FDIVR 224
FDIVRP 224
FEMMS 58
FFREE 246
FIADD 231
FICOM 233
FICOMP 233
FIDIV 224
FIDIVR 224
FILD 217
FIMUL 223
Mnemonic Page# P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
379
FINCSTP 219
FINIT 245
FIST 217
FISTP 217
FISTTP 217
FISUB 222
FISUBR 223
FLD 214
FLD1 218
FLDCW 248
FLDENV 248
FLDL2E 218
FLDL2T 218
FLDLG2 218
FLDLN2 218
FLDPI 218
FLDZ 218
FMUL 223
FMULP 223
FNCLEX 245
FNINIT 245
FNOP 220
FNSAVE 246
FNSTCW 248
FNSTENV 247
FNSTSW 249
FPATAN 237
FPREM 225
FPREM1 225
FPTAN 237
FRNDINT 226
FRSTOR 246
FSAVE 246
FSCALE 226
FSIN 242
FSINCOS 242
FSQRT 226
FST 217
FSTCW 248
FSTENV 247
FSTP 217
FSTSW 249
FSUB 222
FSUBP 222
FSUBR 223
FSUBRP 223
FTST 231
FUCOM 232
FUCOMI 233
FUCOMIP 233
FUCOMP 232
FUCOMPP 232
Mnemonic Page# P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
380
FWAIT 220
FXAM 235
FXCH 218
FXRSTOR 247
FXSAVE 247
FXTRACT 227
FYL2X 227
FYL2XP1 227
HADDPD 346
HADDPS 346
HLT 446
HSUBPD 347
HSUBPS 347
IDIV 205
IMUL 195
IN 418
INC 177
INS 420
INSB 420
INSD 420
INSW 420
INT 318
INT n 318
INTO 318
INVD 444
INVLPG 446
IRET 320
IRETD 320
IRETQ 320
Jcc *
JA 291
JAE 291
JB 291
JBE 291
JC 291
JE 291
JG 291
JGE 291
JL 291
JLE 291
JNA 291
JNAE 291
JNB 291
JNBE 291
JNC 291
JNE 291
JNG 291
JNGE 291
JNL 291
JNLE 291
JNO 291
JNP 291
Mnemonic Page# P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
381
JNS 291
JNZ 291
JO 291
JP 291
JPE 291
JPO 291
JS 291
JZ 291
JCXZ 303
JECXZ 303
JMP 287
JRCXZ 303
LAHF 50
LAR 447
LDDQU 62
LDMXCSR 449
LDS 455
LEA 70
LEAVE 310
LES 455
LFENCE 442
LFS 455
LGDT 449
LGS 455
LIDT 449
LLDT 450
LMSW 451
LOCK 447
LODS 73
LODSB 73
LODSD 73
LODSQ 73
LODSW 73
LOOP 305
LOOPcc *
LOOPE 304
LOOPNE 304
LOOPNZ 304
LOOPZ 304
LSL 447
LSS 455
LTR 451
MASKMOVDQU 80
MASKMOVQ 80
MAXPD 286
MAXPS 285
MAXSD 286
MAXSS 286
MFENCE 443
MINPD 282
MINPS 281
MINSD 282
Mnemonic Page# P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
382
MINSS 281
MONITOR 457
MOV 62
MOV CR 448
MOV DR 448
MOVAPD 62
MOVAPS 62
MOVD 62
MOVDDUP 154
MOVDQ2Q 141
MOVDQA 62
MOVDQU 62
MOVHLPS 144
MOVHPD 147
MOVHPS 143
MOVLHPS 143
MOVLPD 146
MOVLPS 142
MOVMSKPD 269
MOVMSKPS 269
MOVNTDQ 78
MOVNTI 78
MOVNTPD 79
MOVNTPS 79
MOVNTQ 78
MOVQ 61
MOVQ2DQ 141
MOVS 76
MOVSB 76
MOVSD 76
MOVSQ 76
MOVSW 76
MOVSD 145
MOVSHDUP 154
MOVSLDUP 153
MOVSS 140
MOVSX 157
MOVSXD 157
MOVUPD 62
MOVUPS 62
MOVZX 158
MUL 194
MULPD 352
MULPS 352
MULSD 353
MULSS 353
MWAIT 458
NEG 93
NOP 54
NOT 92
OR 88
ORPD 88
Mnemonic Page# P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
383
ORPS 88
OUT 419
OUTS 420
OUTSB 420
OUTSD 420
OUTSW 420
PACKSSDW 163
PACKSSWB 162
PACKUSWB 162
PADDB 184
PADDD 184
PADDQ 184
PADDSB 186
PADDSW 186
PADDUSB 186
PADDUSW 186
PADDW 184
PAND 83
PANDN 99
PAUSE 299
PAVGB 191
PAVGUSB 191
PAVGW 192
PCMPEQB 265
PCMPEQD 265
PCMPEQW 265
PCMPGTB 266
PCMPGTD 266
PCMPGTW 266
PEXTRW 161
PF2ID 166
PF2IW 166
PFACC 346
PFADD 334
PFCMPEQ 265
PFCMPGE 266
PFCMPGT 266
PFMAX 285
PFMIN 281
PFMUL 352
PFNACC 347
PFPNACC 348
PFRCP 369
PFRCPIT1 370
PFRCPIT2 370
PFRSQIT1 380
PFRSQRT 379
PFSUB 336
PFSUBR 338
PI2FD 166
PI2FW 164
PINSRW 149
Mnemonic Page# P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
384
PMADDWD 202
PMAXSW 284
PMAXUB 283
PMINSW 280
PMINUB 279
PMOVMSKB 268
PMULHRW 199
PMULHUW 198
PMULHW 198
PMULLW 197
PMULUDQ 201
POP 307
POPA 309
POPAD 309
POPF 52
POPFD 52
POPFQ 52
POR 88
PREFETCH 441
PREFETCHNTA 441
PREFETCHT0 441
PREFETCHT1 441
PREFETCHT2 441
PREFETCHW 441
PSADBW 193
PSHUFD 152
PSHUFHW 151
PSHUFLW 151
PSHUFW 150
PSLLD 106
PSLLDQ 106
PSLLQ 106
PSLLW 106
PSRAD 116
PSRAW 116
PSRLD 113
PSRLDQ 113
PSRLQ 113
PSRLW 113
PSUBB 187
PSUBD 187
PSUBQ 187
PSUBSB 188
PSUBSW 188
PSUBUSB 188
PSUBUSW 188
PSUBW 187
PSWAPD 135
PUNPCKHBW 137
PUNPCKHDQ 139
PUNPCKHQDQ 148
PUNPCKHWD 138
Mnemonic Page# P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
385
PUNPCKLBW 136
PUNPCKLDQ 139
PUNPCKLQDQ 147
PUNPCKLWD 137
PUSH 307
PUSHA 309
PUSHAD 309
PUSHF 51
PUSHFD 51
PUSHFQ 51
PXOR 90
RCL 118
RCPPS 368
RCPSS 368
RCR 121
RDMSR 452
RDPMC 432
RDTSC 432
REP 75
REPE 75
REPNE 75
REPNZ 75
REPZ 75
RET 313
RETF 313
ROL 117
ROR 120
RSM 454
RSQRTPS 379
RSQRTSS 379
SAHF 51
SAL 104
SAR 114
SBB 180
SCAS 269
SCASB 269
SCASD 269
SCASQ 269
SCASW 269
SETcc *
SETA 259
SETAE 259
SETB 259
SETBE 259
SETC 259
SETE 259
SETG 259
SETGE 259
SETL 259
SETLE 259
SETNA 259
SETNAE 259
Mnemonic Page# P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
386
SETNB 259
SETNBE 259
SETNC 259
SETNE 259
SETNG 259
SETNGE 259
SETNL 259
SETNLE 259
SETNO 259
SETNP 259
SETNS 259
SETNZ 259
SETO 259
SETP 259
SETPE 259
SETPO 259
SETS 259
SETZ 259
SFENCE 443
SGDT 443
SHL 104
SHLD 105
SHR 110
SHRD 111
SHUFPD 155
SHUFPS 153
SIDT 449
SLDT 450
SMSW 450
SQRTPD 378
SQRTPS 377
SQRTSD 378
SQRTSS 377
STC 54
STD 77
STI 320
STMXCSR 448
STOS 75
STOSB 75
STOSD 75
STOSQ 75
STOSW 75
STR 451
SUB 180
SUBPD 336
SUBPS 336
SUBSD 337
SUBSS 337
SWAPGS 452
SYSCALL 453
SYSENTER 453
SYSEXIT 453
Mnemonic Page# P PII K6 3D! 3Mx+ SSE SSE2 A64 SSE3 E64T
387
SYSRET 453
TEST 254
UCOMISD 262
UCOMISS 262
UD2 446
UNPCKHPD 148
UNPCKHPS 139
UNPCKLPD 148
UNPCKLPS 139
VERR 454
VERW 454
WAIT 220
WBINVD 444
WRMSR 452
XADD 179
XCHG 66
XLAT 71
XLATB 71
XOR 90
XORPD 90
XORPS 90
388
Appendix C: Reg/Mem Mapping
This appendix maps the addressing methods for memory load/save, jump table lookup, and call table lookup.
The same kind of memory reference that is used to access a memory table or array can also be used to
access a jump vector. Almost any register can be used alone, in a pair addition, and with an optional base
address and/or scale factor of {2, 4, or 8}, but you will note that there are some limitations in regard to the ESP
register.
(Real Mode — Near or far is same opcodes) (Protected Mode — Near or far is same opcodes)
jmp word ptr [...]
jmp dword ptr [...]
jmp fword ptr [...]
jmp NearAdrPtr[...]
eax ebx ecx edx esp ebp esi ed
i
eax+eax eax+ebx eax+ecx eax+edx eax+esp eax+ebp eax+esi eax+edi
ebx+eax ebx+ebx ebx+ecx ebx+edx ebx+esp ebx+ebp ebx+esi ebx+edi
ecx+eax ecx+ebx ecx+ecx ecx+edx ecx+esp ecx+ebp ecx+esi ecx+edi
edx+eax edx+ebx edx+ecx edx+edx edx+esp edx+ebp edx+esi edx+edi
esp+eax esp+ebx esp+ecx esp+edx - esp+ebp esp+esi esp+edi
ebp+eax ebp+ebx ebp+ecx ebp+edx ebp+esp ebp+ebp ebp+esi ebp+edi
esi+eax esi+ebx esi+ecx esi+edx esi+esp esi+ebp esi+esi esi+edi
edi+eax edi+ebx edi+ecx edi+edx edi+esp edi+ebp edi+esi edi+edi
389
ebx+eax*4 ebx+ebx*4 ebx+ecx*4 ebx+edx*4 - ebx+ebp*4 ebx+esi*4 ebx+edi*
4
ecx+eax*4 ecx+ebx*4 ecx+ecx*4 ecx+edx*4 - ecx+ebp*4 ecx+esi*4 ecx+edi*4
edx+eax*4 edx+ebx*4 edx+ecx*4 edx+edx*4 - edx+ebp*4 edx+esi*4 edx+edi*
4
esp+eax*4 esp+ebx*4 esp+ecx*4 esp+edx*4 - esp+ebp*4 esp+esi*4 esp+edi*
4
ebp+eax*4 ebp+ebx*4 ebp+ecx*4 ebp+edx*4 - ebp+ebp*4 ebp+esi*4 ebp+edi*
4
esi+eax*4 esi+ebx*4 esi+ecx*4 esi+edx*4 - esi+ebp*4 esi+esi*4 esi+edi*4
edi+eax*4 edi+ebx*4 edi+ecx*4 edi+edx*4 - edi+ebp*4 edi+esi*4 edi+edi*4
390
Glossary
Symbol
#
a number
A
absolute address
A fixed address within an application's code space.
AI
Artificial intelligence. Computer simulation of human behavior and intelligence.
alpha channel
A field within an RGBW (red, green, blue, alpha) color value representing the level of opacity and/or
transparency.
ALU
Algorithmic Logic Unit
AMD64
AMD's 64-bit based processor family. Superset of their 32-bit processors.
AoS
Array of Structures {XYZW}[4]
ASCII
American Standard Code for Information Interchange. A 7-bit numerical representation used to
represent control characters and the standard English language character set.
ASE
ASCII Scene Exporter (3D Studio MAX). Application Specific Extension (MIPS-3D)
B
BCD
Binary-coded decimal notation
bi-endian
A byte ordering of either big-endian or little-endian supported by a processor.
(Byte) 0 1 2 3
0x1A2B3C4D 1A 2B 3C 4D Big
4D 3C 2B 1A Little
Note that the byte is endianless; that is, whether it is big-endian or little-endian, the MSB (most
significant bit) is bit #7 and the LSB (least significant bit) is bit #0.
big-endian
The byte ordering typically used by large mainframes. For purposes of this book, that would include
the EE, VU, PowerPC-AltiVec, and PowerPC-Gekko processors.
0 1 2 3
0x1A2B3C4D 1A 2B 3C 4D
blit
The process of transferring one or more blocks of data. The etymology of the word is Bacon, Lettuce,
and Interactive Tomato.
branch prediction
391
A methodology used by a processor to predict at a conditional jump whether an instruction branch will
be taken or not.
BSP
Board Support Package
BTB
Branch Target Buffer
C
CD
Compact Disc (storage capacity 540MB to 700MB)
CODEC
Compression/decompression
coprocessor
A software tool that converts symbolic source code into object code
compiler
A secondary processor that adds enhanced functionality to a primary processor
CPU
Central Processor Unit
culling
A process of reducing the number of polygons needed to be passed to the rendering engine
D
DOS
Disk operating system
double extended-precision
An extra long format (80-bit) storage of floating-point data
double-precision
A long format (64-bit) storage of floating-point data
DSP
Digital Signal Processing
DV
Digital Video
DVD
Digital Versatile Disk (storage capacity 17 GB)
E
EM64T
Extended Memory 64 Technology. A superset of the IA32 (Intel architecture) used by Xeon and
advanced P4 processors.
EULA
End user license agreement
F
FIFO
First in, first out
flat memory
A number in which the decimal point is fixed to a number of places
392
floating-point
An address space that appears to be one contiguous block of memory. This can be made up of RAM
or virtual memory.
fixed-point
A number in which the decimal point is floating and thus can be in any position. But it is typically stored
in a sign, exponent, and mantissa components.
FPU
Floating-point unit
G
GCC
GNU C Compiler
GDTR
Global Descriptor Table Register
GNU
Gnu is not UNIX
GPU
Graphics processor unit
GRDB
Game Relational Data Base
I
IA32
Intel architecture 32-bit
IA64
Intel architecture 64-bit
IDE
Integrated development environment
IDTR
Interrupt Descriptor Table Register
indexed address
Institute of Electrical and Electronics Engineers
IEEE
An indexed element of an array of items. This is made up of the result of a base address summed with
the product of the index and the size of the element (D = Base + index × element_size).
IP
Instruction pointer
J
JPEG
Joint Photographic Experts Group
L
LCD
Liquid crystal display
little-endian
393
The byte ordering used by most modern computers. For purposes of this book that would include the
x86 and MIPS processor. A MIPS processor can be configured for big-endian but for game consoles it
is used in a little-endian configuration.
0 1 2 3
0x1A2B3C4D 4D 3C 2B 1A
LSB
Least significant bit. The related bit depends upon the endian orientation.
M
MBCS
Multi-Byte Character System
MIPS
Million Instructions Per Second. See the References section for MIPS organization.
MMX
Multimedia Extensions
MPEG
Motion Picture Experts Group
MSB
Most significant bit. The related bit depends upon the endian orientation.
MSR
Model Specific Register
N
non-temporal hint
A hint (an indicator) to the processor that memory only requires a single access (one shot). This would
be similar to copying a block of memory or performing a calculation but the result is not going to be
needed for a while so there is no need to write it into the cache. Thus the memory access has no need
to read and load cache, and therefore the code can be faster.
O
open source
A standard of the Open Source Initiative (OSI) that makes available the source code of a computer
program free of charge to the general public
P
PCM
Pulse Coded Modulation
polygon
In the context of 3D rendering, a graphical primitive within a closed plane consisting of a three-sided
(triangle) or four-sided (quadrilateral) representing a face typically covered by a texture.
Q
QNaN
Quiet Not a Number
R
relative address
A delta address from a fixed address within an application
RGB
Red Green Blue
394
RPL
Request Privilege Level
RPN
Reverse Polish Notation
S
SBCS
Single Byte Character System
scalar
A single operation performed on a single data element D = A+ B
scalar processor
A processor that can perform only one instruction on one data element at a time
SIMD
Single instruction multiple data. A computer instruction that performs the same instruction in parallel
for a series of isolated packed data blocks.
single-precision floating-point
A standard format (32-bit) storage of floating-point data
SMM
System Management Mode
SNaN
Signaled Not a Number
SoA
Structure of Arrays
squirrely
A term I use in an attempt to explain the behavior of a piece of code
superscalar processor
A processor that performs similarly to a scalar processor but can handle multiple data operations
simultaneously
T
temporal data
Memory that requires multiple accesses and therefore needs to be loaded into a cache for better
throughput
texture
A 2D image that is mapped upon a 3D wireframe polygon to represent its surface
TLB
Translation Lookaside Buffer
V
VBlank
Vertical Blank
vector
(1) A pointer to code or data typically used in a table (vector table); (2) A one-dimensional array; (3) A
line defined by a starting and ending points.
vector processor
A processor that performs an instruction on an entire array of data in a single step. See: scalar
processor and superscalar processor.
395
vertex
The intersection of two vectors used to define a corner of a polygon, for example, three corners of a
triangle or eight corners of a cube.
vertex normal
A direction vector perpendicular to the plane intersecting the three vertices of a triangle
VSync
Vertical Sync
VU
Vector Units
W
w-buffer
A rectangular representation of the image buffer used to store the distance of each pixel in the image
from the camera. The range of possible Z values is linearly distributed between the camera and a
point in 3D space depicted as infinity. The distances from the camera are finer in resolution than those
closer to infinity, allowing for a more refined depth of view.
X
x86
An abbreviation for the 80x86 processor family inclusive of all its processors: 8086, 80186, 80286,
80386, 80486, Pentium, PII, PIII, P4, Xeon, K5, K6, Athlon, Duron, Opteron, etc.
Z
z-buffer
A rectangular representation of the image buffer used to store the distance of each pixel in the image
from the camera. The range of possible Z values is uniformly distributed between the camera and a
point in 3D space depicted as infinity.
Alignment Macros
#define ALIGN2(len) ((len+1) & ~1) // round up to 16 bits
#define ALIGN4(len) ((len+3) & ~3) // round up to 32 bits
#define ALIGN8(len) ((len+7) & ~7) // round up to 64 bits
#define ALIGN16(len) ((len+15) & ~15) // round up to 128 bits
#define ALIGN32(len) ((len+31) & ~31) // round up to 128 bits
396
References
AMD (Advanced Micro Devices)
http://www.amd.com
Intel
http://www.intel.com
MPEG
MPEG Video Compression Standard. Joan L. Mitchell, Didier J. Legall, William B. Pennebaker, and Chad E.
Fogg, editors. Chapman & Hall Publications, 1996. ISBN: 0412087715.
SHA-1
Personal Computer
397
Index
board support package, see _declspec, 23
BSP denormalized, 210-213, 235
Numbers Boolean disk operating system, see
2Eh, 295
AND, 82-87, 274 DOS
3DNow!, 42, 48, 58, 484
ANDC, 96, 99-100, 278 division, 204-205, 366-368
3Eh, 295
NEG, 93 DOS, 12
66h, 44-45
NOT, 92-93 Double-Extended Precision, 9,
67h, 44
OR, 88-89, 276 55
TEST, 254 Double-Precision, 9, 55
A XOR, 90-82, 276 DPFP, see Double-Precision
abs, 324-325 branch,
accuracy, 189, 196, 209, 213, conditional, 291-293
hint, 299-302
E
228, 236, 243
EFLAGS, 49-50, 176, 436-437
addition, 175-180, 183-191, non-temporal hint, 78-80, 440,
EM64-T, 1
334-335 442
endian, 59, 131-135
AI, 2 optimization, 296-298
big, 59, 133
algebraic laws, 536 prediction, 293-301
little, 59, 132
AltiVec, xix, 22, 355 AMD, 296
exceptions, 18-19, 22, 210-
AMD32, 1 Intel, 293-294
212, 232, 245,408-410, 439,
AMD64, 1, 8, 411 static, 295
498-499
AND, 82-87 unconditional, 287-291
hardware, 498
AoS, 130-131, 340 Branch Target Buffer, see BTB
software, 498
arctangent, 237 branching, 287-305
exponent, 209-210, 227
arithmetic shift left, 103-110 BSP, 417
Extended Memory 64
arithmetic shift right, 114-117 BTB, 293-296
Technology, 1
Array of Structures, see AoS Butterfly Switch, 94-96
Extended-Precision, 9, 55
artificial intelligence, see AI
ASE, 397-401 C
_asm, 123, 134, 179, 182, 214,
cache, 78-80, 293, 296, 405-
F
243, 396, 401,488 FASTCALL, 314, 317-318
407, 438-444
ASSERT, 27 fence, 442-443
calling convention,
assertions, 25 floating-point
C, 315-316
attribute, 23 comparison, 228-235
fast call, 314, 317-318
averages, 191-192 Double-Precision, 9, 55, 208
standard, 317
Double-Extended Precision, 9,
CDECL, 315-316
55, 208
B coding standards, 15
Single-Precision, 9, 55, 208
BCD, see binary-coded comparison, 253
validating, 249-250
decimal bit, 254-257
FOR, 490
big-endian, 59-61, 131-133 floating-point, 228-235
FPU, 39, 55, 58, 207-249
binary, 35-38 compiler intrinsics, 490-491
errors, 211
binary-coded decimal, 229, const, 16
math bugs, 39
236-237, 391-401 constants, 16
math, 392-396 cosine, 237-244, 362
values, 391-392 CPU, 2 G
bit CPU detection, 411-416 GDTR, 449-450
mangling, 81 CPUID, 35, 404-416 Global Descriptor Table
scanning Register, see GDTR
forward, 122 D GNU, 23
reverse, 122 Graphics 101
data alignment, 17-24, 28-32,
shift 2D distance, 328-331
61-62, 536
arithmetic left, 103-110 ASCII to DPFP, 398-401
data conversion, 60, 129
arithmetic right, 114-117 blit, 473-480
exchanging, 131-135
logical left, 103-110 clipping, 479-480
expansion, 155-161
logical right, 110-114 copy, 475
interlacing, 136-148
rotate left, 117-119 sprite layered, 476-477
reduction, 162-164
rotate right, 120-121 sprite overlay, 477-478
shuffle, 149-153, 155
wrangling, 103 transparent, 475
splat, 153-154
testing, 255-257, 323 cross product, 363-366
unpacking, 136-140, 147-148
bits, 35 dot product, 359-363
DebugBreak function, 25, 319
blit, 466, 473-478 frame buffer, 472-473
debugging, 493-506
398
palette bits, 124-127 btr, 256-257 faddp, 221-222
vector magnitude, 384-388 bts, 257 fbld, 236, 396-397
vector normalize, 388-390 call, 311-313 fbstp, 236
cbw, 156-157 fchs, 221
cdq, 160-161 fclex, 245
G cdqe, 156-157 fcmovcc, 234-235
GDTR, 449-450
clc, 53, 77-78 fcom, 231-232
Global Descriptor Table
cld, 77 fcomi, 233
Register, see GDTR
clflush, 443 fcomip, 233
GNU, 23
cli, 320 fcomp, 231-232
Graphics 101
clts, 445 fcompp, 231-232
2D distance, 328-331
cmc, 54 fcos, 242
ASCII to DPFP, 398-401
cmovcc, 271 fdecstp, 219
blit, 473-480
cmp, 260-261 fdiv, 224
clipping, 479-480
cmppd, 265 fdivp, 224
copy, 475
cmpps, 265 fdivr, 224-225
sprite layered, 476-477
cmps_, 73, 263-264 fdivrp, 224-225
sprite overlay, 477-478
cmpsd, 265 femms, 58, 220
transparent, 475
cmpss, 265 ffree, 246
cross product, 363-366
cmpxchg, 272 fiadd, 221-222
dot product, 359-363
cmpxchg16b, 273 ficom, 233-234
frame buffer, 472-473
cmpxchg8b, 272-273 ficomp, 233-234
palette bits, 124-127
comisd, 261-262 fidiv, 224
vector magnitude, 384-388
comiss, 261-262 fidivr, 224-225
vector normalize, 388-390
cpuid, 35, 404-416 fild, 217
cqo, 160-161 fimul, 223-224
H cvtdq2pd, 167 fincstp, 219
half-adder, 97 cvtdq2ps, 165 finit, 245
hint, 78-80, 295, 440, 442, 458 cvtpd2dq, 168 fist, 217-218
cvtpd2pi, 169 fistp, 217-218
cvtpd2ps, 168 fisttp, 217-218
I cvtpi2pd, 170 fisub, 222
IA-32, 1 cvtpi2ps, 166 fisubr, 223
IDTR, 449-450 cvtps2dq, 165 fld, 214-217
image patterning, 7 cvtps2pd, 169 fld1, 218
inline, 241, 279, 324-325, 486, cvtps2pi, 166 fldcw, 248-249
490 cvtsd2si, 171 fldenv, 248
instruction sets cvtsd2ss, 172 fldl2e, 218
aaa, 394 cvtsi2sd, 171 fldl2t, 218
aad, 396 cvtsi2ss, 167 fldlg2, 218
aam, 395 cvtss2sd, 172 fldln2, 218
aas, 394-395 cvtss2si, 170 fldpi, 218
adc, 176-177 cvttpd2dq, 168 fldz, 218
add, 176-177 cvttpd2pi, 169 fmul, 223-224
addpd, 334-335 cvttps2dq, 165 fmulp, 223-224
addps, 334-335 cvttps2pi, 166 fnclex, 245
addsd, 335 cvttsd2si, 171 fninit, 245
addss, 335 cvttss2si, 170 fnop, 220
addsubpd, 346 cwd, 160-161 fnsave, 246
addsubps, 346 cwde, 156-157 fnstcw, 248
and, 83 daa, 392-393 fnstenv, 247
andnpd, 99-100, 278 das, 393-394 fnstsw, 249
andnps, 99-100, 278 dec, 181-183 fpatan, 237
andpd, 83, 274-275 div, 204 fprem, 225
andps, 83, 274-275 divpd, 367 fprem1, 225
arpl, 445 divps, 367 fptan, 237
bound, 445 divsd, 368 frndint, 226
bsf, 122 divss, 367-368 frstor, 246
bsr, 122 emms, 58, 220 fsave, 246
bswap, 133-135 enter, 310 fscale, 226
bswapd, 135 f2xmi, 226 fsin, 242
bt, 255 fabs, 221 fsincos, 242-243
btc, 255-256 fadd, 221-222 fsqrt, 226
399
fst, 217 loop, 305 out, 419
fstcw, 248 loopcc, 304-305 outs_, 420
fstenv, 247 lsl, 447 packssdw, 163-164
fstp, 217 lss, 455 packsswb, 162
fstsw, 249 ltr, 451 packuswb, 162-163
fsub, 222 maskmovdqu, 80 paddb, 184-185
fsubp, 222 maskmovq, 80 paddd, 184-185
fsubr, 223 maxpd, 286 paddq, 184-185
fsubrp, 223 maxps, 285 paddsb, 186-187
ftst, 231 maxsd, 286 paddsw, 186-187
fucom, 232-233 maxss, 285 paddusb, 186-187
fucomi, 233 mfence, 443 paddusw, 186-187
fucomip, 233 minpd, 282 paddw, 184-185
fucomp, 232-233 minps, 281 pand, 83
fucompp, 232-233 minsd, 282 pandn, 99-100
fwait, 220 minss, 281 pause, 299-303
fxam, 235 monitor, 457 pavgb, 191-192
fxch, 218-219 mov, 62 pavgusb, 191-192
fxrstor, 247 cr, 448 pavgw, 192
fxsave, 247 dr, 448 pcmpeqb, 265-266
fxtract, 227 movapd, 62 pcmpeqd, 265-266
fyl2x, 227 movaps, 62 pcmpeqw, 265-266
fyl2xp1, 227 movd, 62 pcmpgtb, 266-268
haddpd, 346-347 movddup, 154 pcmpgtd, 266-268
haddps, 346-347 movdq2q, 141-142 pcmpgtw, 266-268
hlt, 446 movdqa, 62 pextrw, 161
hsubpd, 347 movdqu, 62 pf2id, 166
hsubps, 347 movhlps, 144 pf2iw, 166
idiv, 205 movhpd, 147 pfacc, 346-347
imul, 194-195 movhps, 143 pfadd, 334-335
in, 418 movlhps, 143-144 pfcmpeq, 265-268
inc, 177-179 movlpd, 146 pfcmpge, 266-268
ins_, 73, 420 movlps, 142 pfcmpgt, 266-268
int, 318-322 movmskpd, 269 pfmax, 285
int#, 318-322 movmskps, 269 pfmin, 281
into, 318-322 movntdq, 78-79 pfmul, 352
invd, 444 movnti, 78-79 pfnacc, 347
invlpg, 446 movntpd, 79 pfpnacc, 348
iret, 320 movntps, 79 pfrcp, 369-370
iretd, 320 movntq, 78-79 pfrcpit1, 370-373
iretq, 320 movq, 62 pfrcpit2, 370-373, 380
jcc, 291-293 movq2dq, 141 pfrsqit1, 380
jcxz, 303-304 movs_, 73, 76-77, 157-158 pfrsqrt, 379
jecxz, 303-304 movsd, 145-146 pfsub, 336-337
jmp, 287-291 movshdup, 154 pfsubr, 338
jrcxz, 303-304 movsldup, 153 pi2fd, 166
lahf, 50-51 movss, 140-141 pi2fw, 164
lar, 447 movsxd, 157-158 pinsrw, 149
lddqu, 62 movupd, 62 pmaddwd, 202-203
ldmxcsr, 449 movups, 62 pmaxsw, 284
lds, 455 movzx, 158-160 pmaxub, 283-284
lea, 70 mul, 194 pminsw, 280-281
leave, 310 mulpd, 352-353 pminub, 279-280
les, 455 mulps, 352 pmovmskb, 268
lfence, 442 mulsd, 353 pmulhrw, 199-201
lfs, 455 mulss, 353 pmulhuw, 198-199
lgdt, 449-450 mwait, 458 pmulhw, 198-199
lgs, 455 neg, 93 pmullw, 196-197
lidt, 449-450 nop, 54-55 pmuludq, 201-202
lldt, 450 not, 92-93 pop, 307
lmsw, 451 or, 88 popa, 308
lock, 447 orpd, 88, 276 popad, 308
lods_, 73-74 orps, 88, 276 popf, 52-53
400
popfd, 52-53 rsm, 454 interrupt, 12, 28, 53, 296, 318-
popfq, 52-53 rsqrtps, 379 322, 408-410,426-427, 436-
por, 88 rsqrtss, 379 437, 449-450, 498-499
prefetch_, 441-442 sahf, 51 I/O, 417-429
psadbw, 193 sal, 104-105 Itanium, 1
pshufd, 152 sar, 114-115
pshufhw, 151 sbb, 180-181
pshuflw, 151 scas_, 73, 269-270
K
K3D, 484
pshufw, 150 set_, 258
pslld, 106-110 sfence, 443
pslldq, 106-110 sgdt, 449 L
psllq, 106-110 shl, 104-105 least significant bit, 60-61
psllw, 106-110 shld, 105-106 Linux-64, 40
psrad, 116-117 shr, 110-111 little-endian, 59-61, 129-133
psraw, 116-117 shrd, 111-112 log, 218, 227
psrld, 112-114 shufpd, 155 logical shift left, 103-110
psrldq, 112-114 shufps, 153 logical shift right, 110-114
psrlq, 112-114 sidt, 449 loop, 304-305
psrlw, 112-114 sldt, 450 LSB, see least significant bit
psubb, 187-188 smsw, 450
psubd, 187-188 sqrtpd, 378
psubq, 187-188 sqrtps, 377 M
psubsb, 188-189 sqrtsd, 378 _m128, 24
psubsw, 188-189 sqrtss, 377 _m128d, 24
psubusb, 188-189 stc, 54 _m128i, 24
psubusw, 188-189 std, 77-78 _m64, 24
psubw, 187-188 sti, 320 mantissa, 209, 375
pswapd, 135 stmxcsr, 448 masks, 323-331, 343, 356,
punpckhbw, 137 stos_, 73, 75 377, 426, 438-439,498
punpckhdq, 139-140 str, 451 MASM, 23, 481-490
punpckhqdq, 148 sub, 180-181 matrices, 22, 240
punpckhwd, 138 subpd, 336-337 matrix, 130, 239-240, 417, 503-
punpcklbw, 136 subps, 336-337 506
punpckldq, 139 subsd, 337 max, 283-286, 326-328
punpcklqdq, 147-148 subss, 337 memory,
punpcklwd, 137-138 swapgs, 452 allocation, 31, 33-34
push, 307 syscall, 453 copying, 466-471
pusha, 308 sysenter, 453 indirect addressing, 68
pushad, 308 sysexit, 453 release, 32, 34
pushf, 51-52 sysret, 453 setting, 460-466
pushfd, 51-52 test, 254 unaligned, 62
pushfq, 51-52 ucomisd, 262-263 min, 278-282, 325
pxor, 90 ucomiss, 262-263 MIPS, xix, 59, 317
rcl, 118-119 ud2, 446 MMX, 7-8, 24, 42, 48, 57-59,
rcpps, 368 unpckhpd, 148-149 409-410, 500
rcpss, 368 unpckhps, 139-140 mnemonics, 513-528
rcr, 121 unpcklpd, 148 Model Specific Register, 408,
rdmsr, 452 unpcklps, 139 410, 452
rdpmc, 432 verr, 454 modes,
rdtsc, 432-433 verw, 454 64-bit, 40
rep, 74-75 wait, 220 compatibility, 40
rep lodsx, 74 wbinvd, 444 legacy, 40
rep movsx, 77 wrmsr, 452 protected, 47, 67-69, 288-291,
rep scasx, 270 xadd, 179-180 311-314,455
rep stosx, 76 xchg, 66 real, 47, 67, 70, 288-291, 311-
repe, 75-76 xlat, 71 314, 455
repne, 75-76 xlatb, 71 most significant bit, 36, 61,
repnz, 75-76 xor, 90 273, 279, 323,392
repz, 75-76 xorpd, 90, 276 move
ret, 313 xorps, 90, 276 aligned, 63
retf, 313 INT32, 9 unaligned, 62
rol, 117-118 MSB, see most significant bit
ror, 120
401
MSR, see Model Specific R square root, 226, 375-384
Register radians, 237-238, 242-244, 249 SSE,8, 38, 41-42, 49, 484, 490
multiplication, 194-203 REAL4, 23, 68, 208, 222, 250, stack, 20, 307
275, 277 STDCALL,317
strings, 73, 75-77, 269-270,
N REAL8, 55, 208, 236
429
NaN, 211 Real Mode, 47
reciprocal, 366-383, 388 Structure of Arrays, see SoA
NASM, 481-483
registers, subtraction, 180-191,336-338
negate, 93, 277
Newton-Raphson, 370, 380, control, 438-439
383, 389 CPU status T
nibbles, 36 EFLAGS, 49-50, 176, 436-437 TASM,481-483
non-temporal hint, 78-80, 440, RFLAGS, 49-50, 176, 436-437 TBYTE, 236, 250, 396, 401
442 debug, 439 temporal data, 78-80,440-442
normalized, 210, 229, 243, FPU, 212-213 test jigs, 505-506
388-390 general-purpose, 42-44 TLB, 405, 407,446
NOT, 92-93 MMX, 48 ToolboxSnippet
segment, 47 Butterfly Switch, 94-96
selector, 47 Get Bit Count, 123
O task priority, 439 Translation Lookaside Buffer,
one's complement, 36-37, 54, XMM, 49 seeTLB
92-94 remainder, 204-205, 225, 366, trigonometry,237-244
OR, 88-89 394-395, 471 two's complement, 36-37, 93-
ordered, 211 REPEAT, 489 94, 221
Request Privilege Level, see
P RPL
return, 310-321 U
Pentium, 8
REX, 41-42, 45, 263 unordered, 211, 213, 231-
PIII serial license, 411
rotate left, 117-119 235,262-263
pop, 52-53, 216-227, 307-310
rotate right, 120-121 unsigned,9
PowerPC, xix, 22
precision loss, 189, 196, 209, rotation
213, 228, 236,243 angle, 239-240, 253 V
prefetch, 441-442 bit, 103, 117-121 vectors,20
print, RPL, 445 3D, 21
float array, 501-502 quad, 21
matrix, 503 S Vertical Sync, 418
memory, 504-505 s-i-b, 47
quad vector, 503 signed, 9
quaternion, 503
W
significand, 209-210, 227 WHILE, 489
vector, 502 signless, 9 Windows
processor speed, 433-435 SIMD, 2 Server 2003 x64, 1
Protected Mode, 67 sine, 237-244, 362 XPProfessional 64, 1
protection rings, 437 single nstruction multiple data,
push, 51-53, 214-216, 307- see SIMD
308, 314 Single-Precision, 9,55 X
SNaN,211 XBox, 6
Q SoA, 130-131, 340 XMM, 8, 41-42, 49, 484, 490
SPFP, 9 XOR, 90-92, 94,96
QNaN, 211, 498
spin loop hint, 299-303
402
List of Figures
Chapter 2: Coding Standards
Figure 2-1: Visual C++ (version 6) memory alignment property page
Figure 2-2: Visual C++ .NET (version 7) memory alignment property page
Figure 3-11: Visible connections between individual bytes and the left shift of a 32-bit data element
Figure 4-2: A four-gate (OR, NOT-AND, AND) solution to achieve an equivalent result of a logical XOR
Figure 4-3: A two-gate (ANDC, OR) solution to achieve an equivalent result of a logical XOR
Figure 5-2: Miscellaneous examples of data types being logical shifted to the right by one bit
Figure 5-3: Miscellaneous examples of data types being arithmetically shifted to the right by one bit
403
Chapter 8: Floating-Point Anyone?
Figure 8-1: Floating-point formats
Figure 8-4: Ten-byte BCD data storage. The MSB in the far left byte (byte #9) is the sign bit and the rightmost
th
eight bytes (#8...0) contain the BCD value pairs. The 18 BCD digit resides in the upper nibble of byte #8 and
the first BCD digit resides in the lower nibble of byte #0.
Figure 8-9: The relationship of a circle with its quadrants and associated sizes
Figure 13-2: Face culling mechanism where if the angle between the camera and the perpendicular to the face
plane is obtuse, then the face is pointed away from the camera and can be culled.
Figure 13-4: Cross product (outer product). The perpendicular to the two vectors v and w.
Figure 14-2: Right triangle within 3D Cartesian coordinate system representing a 3D distance and thus its
magnitude
404
Chapter 19: Gfx 'R' Asm
Figure 19-1: Imagine these four differently aligned memory strands as eels. We pull out our sushi knife and
finely chop off their heads into little 8-bit (1-byte) chunks, chop off the tails into 8-bit (1-byte) chunks, and then
coarsely chop the bodies into larger 32-bit (4-byte) chunks, and serve raw.
Figure 19-3: Monster Punch — Copy blit of strawberry image on the right into the blender on the left.
Figure 19-4: Monster Punch — Transparent blit of strawberry image on the right into the blender on the left.
Figure 19-5: Transparent copy blit of strawberry sprite and blender image background to achieve composite
result of both.
Figure 19-6: The blit of a rectangular blender image to a storage buffer, then the transparent blit of a
strawberry into blender. A blit of the saved blender image back into blender effectively erases the strawberry.
Figure 21-4: Register dump from Visual C++ .NET version 7.0 (3DNow! professional extensions). Note the
3DNow! and SSE supported registers as well as the extra pop-up menu of register type selections that you can
choose from. The enabled selections are based upon the processor type running on the computer that is being
debugged.
405
List of Tables
Chapter 1: Introduction
Table 1-1: Float data type declaration
Table 3-2: SIMD instruction with register names and bit widths
Table 3-6: Mappings of inc/dec instructions that use the opcode 40h-4Fh in compatibility or legacy 32-bit
mode.
Table 3-8: Mappings of mod r/m code. 32-bit is the default. Substitute 16/64-bit for 32-bit form where needed,
such as 00-001 DS:[ECX], DS:[CX], [RCX].
Table 5-2: Instruction substitution table to convert a previous SLL (Shift Left Logical) instruction into a SRL
(Shift Right Logical), as well as masks and their complement
406
Table 8-3: Double-precision floating-point to hex equivalent
Table 8-9: Note the single-precision loss between the ±0.001 displacement as the number of digits goes up in
the base number. As the base number gets larger, fewer decimal places of precision can be supported. The
hexadecimal numbers in bold are where the precision was totally lost.
Table 8-10: This is a similar single-precision table except the displacement is between ±0.0000001. Note the
larger number of hexadecimal numbers in bold indicating a loss of precision.
Table 8-11: Note that accuracy of the precision of the numbers diminishes as the number of digits increases.
Chapter 9: Comparison
Table 9-1: SSE SPFP and DPFP immediate compare codes
Table 18-4: Control register 8 (CR8) extensions. This is new for EM64T.
407
List of Listings
Chapter 2: Coding Standards
Listing 2-1: Assertion type definitions
408
Chapter 7: Integer Math
Listing 7-1: …\chap07\pas\PAddX86M.asm
Listing 8-2: Compare two {XYZ} vectors using a specified precision factor.
Listing 8-3: Compare two {XYZW} vectors using a specified precision factor.
409
Chapter 13: FP Vector Multiplication and Division
Listing 13-1: ...\chap13\fmd\Fmd.cpp
410
Listing 14-1: ...\chap14\fsf\Fsf.cpp
411
Listing 21-8: ...\chap20\vecBug\vecBug.cpp
412
32/64-Bit 80x86 Assembly Language Architecture ......................................................................... 1
Preface —(or, So Why Did He Write Yet Another Book?) ................................................. 14
Chapter 1: Introduction ...................................................................................................................... 16
Overview ........................................................................................................................................... 16
Chapter 2: Coding Standards ........................................................................................................... 19
Overview ........................................................................................................................................... 19
Constants ........................................................................................................................................... 19
Data Alignment ................................................................................................................................. 20
Stacks and Vectors ............................................................................................................................ 22
3D Vector (Floating-Point) ........................................................................................................... 22
3D Quad Vector (Floating-Point) ................................................................................................. 23
Compiler Data ................................................................................................................................... 25
Assertions .......................................................................................................................................... 26
Memory Systems .............................................................................................................................. 29
RamTest Memory Alignment Test ............................................................................................... 29
Memory Header ............................................................................................................................ 31
Allocate Memory (Malloc Wrapper) ............................................................................................ 31
Release Memory (Free Wrapper).................................................................................................. 32
Allocate Memory .......................................................................................................................... 33
Allocate (Cleared) Memory .......................................................................................................... 33
Free Memory — Pointer Is Set to NULL ..................................................................................... 33
Chapter 3: Processor Differential Insight........................................................................................ 34
Overview ........................................................................................................................................... 34
Processor Overview .......................................................................................................................... 34
History............................................................................................................................................... 36
The 64-Bit Processor......................................................................................................................... 37
80x86 Registers................................................................................................................................. 38
General-Purpose Registers ............................................................................................................ 38
REX............................................................................................................................................... 40
Segment/Selector Registers .......................................................................................................... 42
MMX Registers ............................................................................................................................. 43
XMM Registers ............................................................................................................................. 43
CPU Status Registers (EFLAGS/64-Bit RFLAGS).......................................................................... 43
LAHF — Load AH Flags ............................................................................................................. 44
SAHF — Save AH Flags .............................................................................................................. 44
PUSHF/PUSHFD — Push EFLAGS onto Stack .......................................................................... 45
PUSHFQ — Push RFLAGS onto Stack ....................................................................................... 45
POPF/POPFD — Pop EFLAGS from Stack ................................................................................ 45
POPFQ — Pop RFLAGS from Stack ........................................................................................... 45
CLC — Clear (Reset) Carry Flag ................................................................................................. 46
STC — Set Carry Flag .................................................................................................................. 46
CMC — Complement Carry Flag ................................................................................................. 46
NOP — No Operation ....................................................................................................................... 47
Floating-Point 101 ............................................................................................................................ 47
Processor Data Type Encoding ......................................................................................................... 47
EMMS — Enter/Leave MMX State ................................................................................................. 48
FEMMS — Enter/Leave MMX State ............................................................................................... 48
Destination/Source Orientations ....................................................................................................... 49
Big/Little-Endian .............................................................................................................................. 49
Alignment Quickie ............................................................................................................................ 50
(Un)aligned Memory Access ............................................................................................................ 51
MOV/MOVQ — Move Data ........................................................................................................ 51
413
Move (Unaligned) ......................................................................................................................... 51
Move (Aligned) ............................................................................................................................. 52
XCHG — Exchange (Swap) Data ................................................................................................ 54
System Level Functionality .............................................................................................................. 54
Indirect Memory Addressing ............................................................................................................ 55
uint32 OddTable[ ]........................................................................................................................ 56
LEA — Load Effective Address ................................................................................................... 57
Translation Table .............................................................................................................................. 57
XLAT/XLATB — Translation Table Lookup .............................................................................. 57
String Instructions ............................................................................................................................. 59
LODSB/LODSW/LODSD/LODSQ — Load String .................................................................... 59
REP LODSx .................................................................................................................................. 59
STOSB/STOSW/STOSD/STOSQ — Save String ....................................................................... 60
REP/REPE/REPZ/REPNE/REPNZ — Repeat String .................................................................. 61
REP STOSx .................................................................................................................................. 61
MOVSB/MOVSW/MOVSD/MOVSQ — Move String............................................................... 61
REP MOVSx ................................................................................................................................. 62
CLD/STD — Clear/Set Direction Flag ......................................................................................... 62
Special (Non-Temporal) Memory Instructions ................................................................................. 63
MOVNTx — Copy Using Non-Temporal Hint ............................................................................ 63
MOVNTPS — Copy 4SPFP Using Non-Temporal Hint ........................................................... 63
MOVNTPD — Copy 2DPFP Using Non-Temporal Hint .......................................................... 63
MASKMOVQ/MASKMOVDQU — Copy Selected Bytes ......................................................... 64
Exercises ........................................................................................................................................... 64
Chapter 4: Bit Mangling ..................................................................................................................... 65
Overview ........................................................................................................................................... 65
Boolean Logical AND ...................................................................................................................... 66
Pseudo Vec.................................................................................................................................... 67
Boolean Logical OR ......................................................................................................................... 70
Pseudo Vec.................................................................................................................................... 71
Boolean Logical XOR (Exclusive OR)............................................................................................. 72
Pseudo Vec.................................................................................................................................... 73
NOT — One's Complement Negation .......................................................................................... 74
NEG — Two's Complement Negation ......................................................................................... 74
ToolBox Snippet — The Butterfly Switch ................................................................................... 75
I-VU-Q .......................................................................................................................................... 76
Boolean Logical ANDC .................................................................................................................... 78
Pseudo Vec.................................................................................................................................... 79
Exercises ........................................................................................................................................... 80
Chapter 5: Bit Wrangling ................................................................................................................... 81
Overview ........................................................................................................................................... 81
Logical Left Shifting ......................................................................................................................... 81
SHL/SAL – Shift (Logical/Arithmetic) Left................................................................................. 81
SHLD — Shift (Logical) Left (Double) ....................................................................................... 82
PSLLx — Parallel Shift Left (Logical) ......................................................................................... 83
Pseudo Vec.................................................................................................................................... 83
Pseudo Vec (x86) .......................................................................................................................... 85
Logical Right Shifting....................................................................................................................... 86
SHR — Shift (Logical) Right ....................................................................................................... 86
SHRD — Shift (Logical) Right (Double) ..................................................................................... 87
PSRLx — Parallel Shift Right (Logical) ...................................................................................... 88
Pseudo Vec.................................................................................................................................... 88
414
Logical Right Shifting....................................................................................................................... 89
SHR — Shift (Logical) Right ....................................................................................................... 89
SHRD — Shift (Logical) Right (Double) ..................................................................................... 90
PSRLx — Parallel Shift Right (Logical) ...................................................................................... 91
Pseudo Vec.................................................................................................................................... 91
Rotate Left (or n-Right) .................................................................................................................... 92
ROL — Rotate Left ...................................................................................................................... 92
RCL — Rotate Carry Left............................................................................................................. 93
Rotate Right ...................................................................................................................................... 94
ROR — Rotate Right .................................................................................................................... 94
RCR — Rotate Carry Right .......................................................................................................... 95
Bit Scanning ...................................................................................................................................... 95
BSF — Bit Scan Forward ............................................................................................................. 95
BSR — Bit Scan Reverse ............................................................................................................. 96
ToolBox Snippet — Get Bit Count............................................................................................... 96
Graphics 101 — Palette Bits ......................................................................................................... 97
Exercises ......................................................................................................................................... 100
Chapter 6: Data Conversion ........................................................................................................... 102
Data Interlacing, Exchanging, Unpacking, and Merging ............................................................... 102
Byte Swapping ................................................................................................................................ 103
Little-Endian ............................................................................................................................... 103
(Big/Little)-Endian to (Big/Little)-Endian Data Relationship Macros ....................................... 104
BSWAP — Byte Swap ............................................................................................................... 104
PSWAPD — Packed Swap Double Word .................................................................................. 106
Data Interlacing ............................................................................................................................... 106
PUNPCKLBW — Parallel Extend Lower from Byte................................................................. 106
PUNPCKHBW — Parallel Extend Upper from Byte................................................................. 107
PUNPCKLWD — Parallel Extend Lower from 16-Bit .............................................................. 107
PUNPCKHWD — Parallel Extend Upper from 16-Bit .............................................................. 108
PUNPCKLDQ — Parallel Extend Lower from 32-Bit............................................................... 108
PUNPCKHDQ — Parallel Extend Upper from 32-Bit .............................................................. 109
MOVSS — Move Scalar (SPFP) ................................................................................................ 110
MOVQ2DQ — Move Scalar (1×32-Bit) MMX to XMM .......................................................... 110
MOVDQ2Q — Move Scalar (1×32-bit) XMM to MMX........................................................... 110
MOVLPS — Move Low Packed (2×SPFP) ............................................................................... 111
MOVHPS — Move High Packed (2×SPFP) .............................................................................. 111
MOVLHPS — Move Low to High Packed (2×SPFP) ............................................................... 111
MOVHLPS — Move High to Low Packed (2×SPFP) ............................................................... 112
MOVSD — Move Scalar (1×DPFP) .......................................................................................... 112
MOVLPD — Move Low Packed (1×DPFP) .............................................................................. 113
MOVHPD — Move High Packed (1×DPFP) ............................................................................. 113
PUNPCKLQDQ — Parallel Copy Lower (2×64-Bit) ................................................................ 114
PUNPCKHQDQ — Parallel Copy Upper (2×64-Bit) ................................................................ 114
Swizzle, Shuffle, and Splat ............................................................................................................. 115
PINSRW — Shuffle (1×16-Bit) to (4×16-Bit) ........................................................................... 115
PSHUFW — Shuffle Packed Words (4×16-Bit) ........................................................................ 115
PSHUFLW — Shuffle Packed Low Words (4×16-Bit) ............................................................. 116
PSHUFHW — Shuffle Packed High Words (4×16-Bit) ............................................................ 116
PSHUFD — Shuffle Packed Double Words (4×32-Bit) ............................................................ 116
SHUFPS — Shuffle Packed SPFP Values (4×SPFP) ................................................................. 117
MOVSLDUP — Splat Packed Even SPFP to (4×SPFP) ............................................................ 117
MOVSHDUP — Splat Packed Odd SPFP to (4×SPFP)............................................................. 117
415
MOVDDUP — Splat Lower DPFP to Packed (2×DPFP) .......................................................... 118
SHUFPD — Shuffle Packed DPFP (2×64-Bit) .......................................................................... 118
Data Bit Expansion ......................................................................................................................... 118
CBW Convert Signed AL (Byte) to AX (Word) ........................................................................ 119
CWDE Convert Signed AX (Word) to EAX (DWord) .............................................................. 119
CDQE Convert Signed EAX (DWord) to RAX (QWord).......................................................... 119
MOVSX/MOVSXD — Move with Sign Extension ................................................................... 119
MOVZX — Move with Zero Extension ..................................................................................... 121
CWD — Convert Signed AX (Word) to DX:AX ....................................................................... 122
CDQ — Convert Signed EAX (DWord) to EDX:EAX.............................................................. 122
CQO — Convert Signed RAX (QWord) to RDX:RAX ............................................................. 122
PEXTRW — Extract (4×16-bit) into Integer to (1×16) ............................................................. 123
Data Bit Reduction (with Saturation) ............................................................................................. 123
PACKSSWB — Packed Signed int16 to int8 with Saturation ................................................... 123
PACKUSWB — Packed uint16 to uint8 with Saturation........................................................... 124
PACKSSDW — Packed int32 to int16 with Saturation ............................................................. 124
Data Conversion (Integer : Float, Float : Integer, Float : Float) ..................................................... 125
PI2FW — Convert Packed Even int16 to SPFP ......................................................................... 125
CVTDQ2PS — Convert Packed int32 to SPFP .......................................................................... 125
CVTPS2DQ — Convert Packed SPFP to int32.......................................................................... 125
CVTPI2PS — Convert Lo Packed int32 to SPFP ...................................................................... 126
CVTPS2PI — Convert Lo Packed SPFP to int32 ...................................................................... 126
CVTSI2SS — Convert Scalar int32 to SPFP ............................................................................. 127
CVTDQ2PD — Convert Even Packed int32 to DPFP ............................................................... 127
CVTPD2DQ — Convert Packed DPFP to Even int32 ............................................................... 127
CVTPD2PS — Convert Packed DPFP to Lo SPFP ................................................................... 127
CVTPS2PD — Convert Lo Packed SPFP to DPFP ................................................................... 128
CVTPD2PI — Convert Packed DPFP to int32 .......................................................................... 128
CVTPI2PD — Convert Packed int32 to DPFP .......................................................................... 128
CVTSS2SI — Convert Scalar SPFP to int32/64 ........................................................................ 129
CVTSD2SI — Convert Scalar DPFP to Int ................................................................................ 129
CVTSI2SD — Convert Scalar Int to DPFP ................................................................................ 129
CVTSD2SS — Convert Scalar DPFP to SPFP........................................................................... 130
CVTSS2SD — Convert Scalar SPFP to DPFP........................................................................... 130
Exercises ......................................................................................................................................... 130
Chapter 7: Integer Math .................................................................................................................. 131
Overview ......................................................................................................................................... 131
General Integer Math ...................................................................................................................... 131
ADD — Add ............................................................................................................................... 131
ADC — Add with Carry ............................................................................................................. 131
INC — Increment by 1 ............................................................................................................... 133
XADD — Exchange and Add..................................................................................................... 134
SUB — Subtract ......................................................................................................................... 135
SBB — Subtract with Borrow .................................................................................................... 135
DEC — Decrement by 1 ............................................................................................................. 136
Packed Addition and Subtraction.................................................................................................... 137
PADDB/PADDW/PADDD/PADDQ Integer Addition .............................................................. 138
Vector {8/16/32/64}-Bit Int Addition with Saturation ............................................................... 139
PSUBB/PSUBW/PSUBD/PSUBQ Integer Subtraction ............................................................. 140
Vector {8/16/32/64}-Bit Integer Subtraction with Saturation .................................................... 140
Vector Addition and Subtraction (Fixed Point) .............................................................................. 141
Pseudo Vec.................................................................................................................................. 141
416
Pseudo Vec (x86) ........................................................................................................................ 141
Averages ......................................................................................................................................... 142
PAVGB/PAVGUSB — N×8-Bit [Un]signed Integer Average .................................................. 142
PAVGW — N×16-Bit [Un]signed Integer Average ................................................................... 143
Sum of Absolute Differences .......................................................................................................... 143
PSADBW — N×8-Bit Sum of Absolute Differences ................................................................. 143
8×8-Bit Sum of Absolute Differences ........................................................................................ 144
16×8-Bit Sum of Absolute Differences ...................................................................................... 144
Integer Multiplication ..................................................................................................................... 144
MUL — Unsigned Muliplication (Scalar) .................................................................................. 144
IMUL — Signed Multiplication (Scalar) .................................................................................... 145
Packed Integer Multiplication ......................................................................................................... 145
PMULLW — N×16-Bit Parallel Multiplication (Lower) ........................................................... 146
PMULHW/PMULHUW — N×16-Bit Parallel Multiplication (Upper) ..................................... 147
PMULHRW — Signed 4×16-Bit Multiplication with Rounding (Upper) ................................. 147
Pseudo Vec (x86) ........................................................................................................................ 148
PMULUDQ — Unsigned N×32-Bit Multiply Even ................................................................... 149
PMADDWD — Signed N×16-Bit Parallel Multiplication — ADD .......................................... 150
Exercises ......................................................................................................................................... 150
Chapter 8: Floating-Point Anyone? ............................................................................................... 151
Overview ......................................................................................................................................... 151
The Floating-Point Number ............................................................................................................ 151
FPU Registers ............................................................................................................................. 154
Loading/Storing Numbers and the FPU Stack ................................................................................ 155
FLD — Floating-Point Load ....................................................................................................... 155
FST/FSTP — FPU Floating-Point Save ..................................................................................... 158
FILD — FPU Integer Load ......................................................................................................... 158
FIST/FISTP/FISTTP — FPU Integer Save ................................................................................ 158
FPU Constants ............................................................................................................................ 159
FXCH .......................................................................................................................................... 159
FINCSTP — FPU Increment Stack Pointer................................................................................ 160
FDECSTP — FPU Decrement Stack Pointer ............................................................................. 160
FWAIT/WAIT ............................................................................................................................ 160
EMMS/FEMMS.......................................................................................................................... 160
FNOP .......................................................................................................................................... 160
General Math Instructions............................................................................................................... 161
FCHS — FPU Two's Complement ST(0) = – ST(0) .................................................................. 161
FABS — FPU Absolute Value ST(0) = |ST(0)| .......................................................................... 161
FADD/FADDP/FIADD — FPU Addition D = ST(0) + A ......................................................... 161
FSUB/FSUBP/FISUB — FPU Subtraction D = ST(0) – A........................................................ 162
FSUBR/FSUBRP/FISUBR — FPU ReverseSubtraction D = A – ST(0) ................................... 162
FMUL/FMULP/FIMUL — FPU Multiplication D = ST(0) × A................................................ 163
FDIV/FDIVP/FIDIV — FPU Division D = Dst ÷ Src ............................................................... 163
FDIVR/FDIVRP/FIDIVR — FPU ReverseDivision D = Src ÷ Dst .......................................... 163
FPREM — FPU Partial Remainder ............................................................................................ 164
FPREM1 — FPU Partial Remainder .......................................................................................... 164
FRNDINT — FPU Round to Integer .......................................................................................... 164
General Math Instructions............................................................................................................... 164
FCHS — FPU Two's Complement ST(0) = – ST(0) .................................................................. 164
FABS — FPU Absolute Value ST(0) = |ST(0)| .......................................................................... 165
FADD/FADDP/FIADD — FPU Addition D = ST(0) + A ......................................................... 165
FSUB/FSUBP/FISUB — FPU Subtraction D = ST(0) – A........................................................ 165
417
FSUBR/FSUBRP/FISUBR — FPU ReverseSubtraction D = A – ST(0) ................................... 166
FMUL/FMULP/FIMUL — FPU Multiplication D = ST(0) × A................................................ 166
FDIV/FDIVP/FIDIV — FPU Division D = Dst ÷ Src ............................................................... 167
FDIVR/FDIVRP/FIDIVR — FPU ReverseDivision D = Src ÷ Dst .......................................... 167
FPREM — FPU Partial Remainder ............................................................................................ 168
FPREM1 — FPU Partial Remainder .......................................................................................... 168
FRNDINT — FPU Round to Integer .......................................................................................... 168
Advanced Math Instructions ........................................................................................................... 168
FSQRT — FPU ST(0) Square Root ........................................................................................... 168
FSCALE — FPU Scale ST(0) = ST(0) << ST(1) ....................................................................... 168
F2XM1 — FPU ST(0) = 2ST(0) – 1 .............................................................................................. 169
FYL2X — FPU ST(0) = y log2x ................................................................................................. 169
FYL2XP1 — FPU ST(0) = y log2(x+1)...................................................................................... 169
FXTRACT — FPU Extract Exponent and Significand .............................................................. 169
Floating-Point Comparison ............................................................................................................. 169
FTST — FPU Test If Zero .......................................................................................................... 172
FCOM/FCOMP/FCOMPP — FPU Unordered CMP FP ........................................................... 172
FUCOM/FUCOMP/FUCOMPP — FPU Unordered CMP FP................................................... 173
FCOMI/FCOMIP/FUCOMI/FUCOMIP — FPUA ? B and EFLAGS ....................................... 173
FICOM/FICOMP — FPU A ? B ................................................................................................ 174
FCMOVcc — FPU Conditional Move ....................................................................................... 174
FXAM — FPU Examine ............................................................................................................ 175
FPU BCD (Binary-Coded Decimal) ............................................................................................... 176
FBLD — FPU (BCD Load) ........................................................................................................ 176
FBSTP — FPU (BCD Save and Pop ST(0)) .............................................................................. 176
FPU Trigonometry .......................................................................................................................... 176
FPTAN — FPU Partial Tangent ................................................................................................. 176
FPATAN — FPU Partial Arctangent ......................................................................................... 177
FSINCOS — Sine and Cosine .................................................................................................... 177
Pseudo Vec.................................................................................................................................. 179
Pseudo Vec (x86) ........................................................................................................................ 179
FSIN — FPU Sine ...................................................................................................................... 180
FCOS — FPU Cosine ................................................................................................................. 180
FSINCOS — FPU Sine and Cosine ............................................................................................ 180
FPU System Instructions................................................................................................................. 182
FINIT/FNINIT — FPU Init ........................................................................................................ 182
FCLEX/FNCLEX — FPU Clear Exceptions ............................................................................. 182
FFREE — FPU Free FP Register ............................................................................................... 183
FSAVE/FNSAVE — FPU Save X87 FPU, MMX, SSE, SSE2 ................................................. 183
FRSTOR — FPU Restore x87 State ........................................................................................... 183
FXSAVE — FPU Save x87 FPU, MMX, SSE, SSE2, SSE3 ..................................................... 183
FXRSTOR — FPU Restore x87 FPU, MMX, SSE, SSE2, SSE3 .............................................. 184
FSTENV/FNSTENV — FPU Store x87 Environment............................................................... 184
FLDENV — FPU Load x87 Environment ................................................................................. 184
FSTCW/FNSTCW — FPU Store x87 Control Word ................................................................. 184
FLDCW — FPU Load x87 Control Word.................................................................................. 185
FSTSW/FNSTSW — FPU Store x87 Status Word .................................................................... 185
Validating (Invalid) Floating-Point................................................................................................. 185
Exercises ......................................................................................................................................... 186
Chapter 9: Comparison ................................................................................................................... 187
Overview ......................................................................................................................................... 187
TEST — Logical Compare A B .................................................................................................. 187
418
Indexed Bit Testing ......................................................................................................................... 188
BT — Bit Test ............................................................................................................................. 188
BTC — Bit Test and Complement.............................................................................................. 189
BTR — Bit Test and Reset (Clear) ............................................................................................. 189
BTS — Bit Test and Set.............................................................................................................. 190
SETcc — Set Byte on Condition .................................................................................................... 191
Comparing Operands and Setting EFLAGS ................................................................................... 192
CMP — Compare Two Operands ............................................................................................... 192
COMISS — Compare Scalar SPFP, Set EFLAGS ..................................................................... 193
COMISD — Compare Scalar DPFP, Set EFLAGS.................................................................... 193
UCOMISS — Unordered Cmp Scalar SPFP, Set EFLAGS ....................................................... 194
UCOMISD — Unordered Cmp Scalar DPFP, Set EFLAGS ..................................................... 194
CMPSB/CMPSW/CMPSD/CMPSQ — Compare String Operands .......................................... 195
CMP — Packed Comparison .......................................................................................................... 196
CMPPS/CMPSS/CMPPD/CMPSD — Floating-Point ............................................................... 196
Packed Compare if Equal to (=) ................................................................................................. 196
Packed Compare if Greater Than or Equal () ........................................................................... 197
Packed Compare if Greater Than (>) .......................................................................................... 197
Extract Packed Sign Masks ............................................................................................................. 199
PMOVMSKB — Extract Packed Byte (Sign) Mask .................................................................. 199
MOVMSKPS — Extract Packed SPFP Sign Mask .................................................................... 199
MOVMSKPD — Extract Packed DPFP Sign Mask .................................................................. 199
SCAS/SCASB/SCASW/SCASD/SCASQ — Scan String ............................................................. 200
REP SCASx ................................................................................................................................ 200
CMOVcc — Conditional Move ...................................................................................................... 200
CMPXCHG — Compare and Exchange......................................................................................... 202
CMPXCHG8B — Compare and Exchange 64 Bits ................................................................... 202
CMPXCHG16B — Compare and Exchange 128 Bits ............................................................... 203
Boolean Operations upon Floating-Point Numbers ........................................................................ 203
ANDPS — Logical AND of Packed SPFP D = A B............................................................... 204
ANDPD — Logical AND of Packed DPFP ............................................................................... 204
Pseudo Vec — (XMM) FABS — FP Absolute A = | A | ........................................................... 204
Pseudo Vec — (3DNow!) FABS — FP Absolute A = | A | ....................................................... 204
ORPS — Logical OR of Packed SPFP D = A B ..................................................................... 205
ORPD — Logical OR of Packed DPFP ...................................................................................... 205
XORPS — Logical XOR of Packed SPFP D = A B .............................................................. 205
XORPD — Logical XOR of Packed DPFP ................................................................................ 205
Pseudo Vec — FCHS — FP Change Sign A = – A ................................................................... 205
ANDNPS — Logical ANDC of Packed SPFP D = A (B).................................................... 206
ANDNPD — Logical ANDC of Packed DPFP .......................................................................... 206
Min — Minimum ............................................................................................................................ 206
Pseudo Vec.................................................................................................................................. 207
8-Bit Integer Minimum ......................................................................................................... 207
16-Bit Integer Minimum ....................................................................................................... 208
SPFP Minimum...................................................................................................................... 208
SPFP Scalar Minimum ........................................................................................................... 209
DPFP Minimum ...................................................................................................................... 209
DPFP Scalar Minimum ........................................................................................................... 209
Max — Maximum........................................................................................................................... 210
8-Bit Integer Maximum ........................................................................................................ 210
16-Bit Integer Maximum ...................................................................................................... 210
419
SPFP Maximum ..................................................................................................................... 211
SPFP Scalar Maximum ........................................................................................................... 211
DPFP Maximum ..................................................................................................................... 212
DPFP Scalar Maximum .......................................................................................................... 212
Chapter 10: Branching .................................................................................................................... 213
Overview ......................................................................................................................................... 213
Jump Unconditionally ..................................................................................................................... 213
JMP — Jump............................................................................................................................... 213
Delta JMP.................................................................................................................................... 214
Protected Mode JMP (NEAR) .................................................................................................... 214
Protected Mode JMP (FAR) ....................................................................................................... 214
Jump Conditionally ......................................................................................................................... 216
Jcc — Branching ......................................................................................................................... 216
Branch Prediction............................................................................................................................ 217
Intel Branch Prediction ............................................................................................................... 218
Static Branch Prediction ............................................................................................................. 218
AMD Branch Prediction ............................................................................................................. 219
Branch Optimization ................................................................................................................... 219
PAUSE — (Spin Loop Hint) .......................................................................................................... 221
I-VU-Q ........................................................................................................................................ 222
JECXZ/JCXZ — Jump if ECX/CX Is Zero ................................................................................ 225
LOOPcc....................................................................................................................................... 226
LOOP .......................................................................................................................................... 227
Pancake Memory LIFO Queue ....................................................................................................... 227
Stack ................................................................................................................................................ 227
PUSH — Push Value onto Stack ................................................................................................ 227
POP — Pop Value off Stack ....................................................................................................... 228
PUSHA/PUSHAD — Push All General-Purpose Registers ....................................................... 228
POPA/POPAD — Pop All General-Purpose Registers .............................................................. 229
PUSHFD/PUSHFQ and POPFD/POPFQ ................................................................................... 230
ENTER — Allocate Stack Frame for Procedure ARGS ............................................................ 230
LEAVE — Deallocate Stack Frame of Procedure ARGS .......................................................... 230
CALL Procedure (Function) ........................................................................................................... 231
CALL .......................................................................................................................................... 231
Protected Mode CALL (NEAR) ................................................................................................. 231
Protected Mode CALL (FAR) .................................................................................................... 232
RET/RETF — Return ................................................................................................................. 233
Calling Conventions (Stack Argument Methods) ........................................................................... 234
C Declaration (__CDECL) ......................................................................................................... 234
Standard Declaration (__STDCALL) ......................................................................................... 235
Fast Call Declaration (__FASTCALL) ....................................................................................... 236
Interrupt Handling ........................................................................................................................... 236
INT/INTO — Call Interrupt Procedure ...................................................................................... 236
IRET/IRETD/IRETQ — Interrupt Return .................................................................................. 237
CLI/STI — Clear (Reset)/Set Interrupt Flag .............................................................................. 238
Chapter 11: Branchless ................................................................................................................... 240
Overview ......................................................................................................................................... 240
Function y=ABS(x) 'Absolute' D = | A | ......................................................................................... 240
Function y=MIN(p, q) 'Minimum' .................................................................................................. 241
Function y=MAX(p, q) 'Maximum' ................................................................................................ 242
Graphics 101 — Quick 2D Distance .......................................................................................... 244
Chapter 12: Floating-Point Vector Addition and Subtraction .................................................... 247
420
Overview ......................................................................................................................................... 247
Floating-Point Vector Addition and Subtraction ............................................................................ 247
Vector Floating-Point Addition .................................................................................................. 247
Vector Floating-Point Addition with Scalar ............................................................................... 248
Vector Floating-Point Subtraction .............................................................................................. 248
Vector Floating-Point Subtraction with Scalar ........................................................................... 249
Vector Floating-Point Reverse Subtraction ................................................................................ 250
Pseudo Vec.................................................................................................................................. 251
Pseudo Vec (x86) ........................................................................................................................ 252
Vector Scalar Addition and Subtraction ......................................................................................... 255
Single-Precision Quad Vector Float Scalar Addition ................................................................. 255
Single-Precision Quad Vector Float Scalar Subtraction ............................................................. 255
Special — FP Vector Addition and Subtraction ............................................................................. 256
Vector Floating-Point Addition and Subtraction ........................................................................ 256
HADDPS/HADDPD/PFACC — Vector Floating-Point Horizontal Addition ........................... 256
HSUBPS/HSUBPD/PFNACC — Vector Floating-Point Horizontal Subtraction ..................... 257
PFPNACC — Vector Floating-Point Horizontal Add/Sub ........................................................ 258
Exercises ......................................................................................................................................... 258
Chapter 13: FP Vector Multiplication and Division...................................................................... 259
Overview ......................................................................................................................................... 259
Floating-Point Multiplication ......................................................................................................... 259
Vector Floating-Point Multiplication..........................................................................................259
(Semi-Vector) DPFP Multiplication ........................................................................................... 260
SPFP Scalar Multiplication ......................................................................................................... 260
DPFP Scalar Multiplication ........................................................................................................ 260
Vector (Float) Multiplication — ADD ....................................................................................... 260
Pseudo Vec.................................................................................................................................. 261
Pseudo Vec (x86) ........................................................................................................................ 262
Vector Floating-Point Division ....................................................................................................... 264
(Vector) SPFP Division .............................................................................................................. 264
(Semi-Vector) DPFP Division .................................................................................................... 264
SPFP Scalar Division .................................................................................................................. 265
DPFP Scalar Division ................................................................................................................. 265
N×SPFP Reciprocal .................................................................................................................... 265
1×SPFP Reciprocal (14-Bit) ....................................................................................................... 266
SPFP Reciprocal (2 Stage) (24-Bit) ............................................................................................ 267
Pseudo Vec.................................................................................................................................. 267
Pseudo Vec (x86) ........................................................................................................................ 268
Exercises ......................................................................................................................................... 269
Chapter 14: Floating-Point Deux ................................................................................................... 270
Overview ......................................................................................................................................... 270
SQRT — Square Root .................................................................................................................... 270
1×SPFP Scalar Square Root ....................................................................................................... 271
4×SPFP Square Root .................................................................................................................. 271
1×DPFP Scalar Square Root ....................................................................................................... 272
2×DPFP Square Root .................................................................................................................. 272
1×SPFP Scalar Reciprocal Square Root (15-Bit) ....................................................................... 272
Pseudo Vec.................................................................................................................................. 273
Pseudo Vec (x86) ........................................................................................................................ 273
SPFP Square Root (2 Stage) (24-Bit) ......................................................................................... 274
Vector Square Root ..................................................................................................................... 275
Pseudo Vec.................................................................................................................................. 275
421
Pseudo Vec (x86) ........................................................................................................................ 275
Graphics 101 — Vector Magnitude (aka 3D Pythagorean Theorem) ........................................ 277
Pseudo Vec.................................................................................................................................. 279
Pseudo Vec (x86) ........................................................................................................................ 279
Chapter 15: Binary-Coded Decimal (BCD) .................................................................................. 281
Overview ......................................................................................................................................... 281
BCD ................................................................................................................................................ 281
DAA — Decimal Adjust AL (After) Addition ........................................................................... 282
DAS — Decimal Adjust AL (After) Subtraction ....................................................................... 282
AAA — ASCII Adjust (After) Addition .................................................................................... 283
AAS — ASCII Adjust AL (After) Subtraction........................................................................... 283
AAM — ASCII Adjust AX (After) Multiplication .................................................................... 283
AAD — ASCII Adjust AX (Before) Division ............................................................................ 284
FBLD — FPU (BCD Load) ........................................................................................................ 284
Graphics 101 ................................................................................................................................... 285
ASCII String to Double-Precision Float ..................................................................................... 286
ASCII to Double ......................................................................................................................... 286
Chapter 16: What CPUID? ............................................................................................................. 289
Overview ......................................................................................................................................... 289
CPUID............................................................................................................................................. 289
Standard CPUID EDX-Feature Flags ......................................................................................... 292
Intel — Standard CPUID ECX-Feature Flags ............................................................................ 293
Intel — Extended #1 CPUID EDX-Feature Flags ...................................................................... 294
AMD — Extended #1 CPUID EDX-Feature Flags .................................................................... 294
PIII Serial License........................................................................................................................... 295
Sample CPU Detection Code .......................................................................................................... 296
x86 CPU Detect — Bit Flags ...................................................................................................... 296
x86 CPU Detect — Vendors ....................................................................................................... 297
Cpu Detect — Information ......................................................................................................... 297
Chapter 17: PC I/O........................................................................................................................... 305
Overview ......................................................................................................................................... 305
IN — Input from Port ..................................................................................................................... 305
Vertical Sync ............................................................................................................................... 305
OUT — Output to Port ................................................................................................................... 306
INSx — Input from Port to String .................................................................................................. 307
OUTSx — Output String to Port .................................................................................................... 307
Serial/Parallel Port for IBM PC .................................................................................................. 308
Parallel Port ................................................................................................................................. 308
Parallel Port Dip Switches .......................................................................................................... 309
Serial Port.................................................................................................................................... 311
Chapter 18: System ......................................................................................................................... 316
System "Lite" .................................................................................................................................. 316
System Timing Instructions ............................................................................................................ 316
RDPMC — Read Performance — Monitoring Counters ........................................................... 316
RDTSC — Read Time-Stamp Counter....................................................................................... 316
Calculating Processor Speed ....................................................................................................... 317
80x86 Architecture...................................................................................................................... 319
CPU Status Registers (32-Bit EFLAGS/64-Bit RFLAGS) ........................................................ 320
Protection Rings .......................................................................................................................... 321
Control Registers ........................................................................................................................ 321
Debug Registers .......................................................................................................................... 323
Cache Manipulation ........................................................................................................................ 323
422
Cache Sizes ................................................................................................................................. 323
Cache Line Sizes ......................................................................................................................... 323
PREFETCHx — Prefetch Data into Caches................................................................................ 324
LFENCE — Load Fence ............................................................................................................. 325
SFENCE — Store Fence ............................................................................................................. 325
MFENCE — Memory Fence ...................................................................................................... 325
CLFLUSH — Flush Cache Line................................................................................................. 325
INVD — Invalidate Cache (WO/Writeback) ............................................................................. 326
WBINVD — Write Back and Invalidate Cache ......................................................................... 326
System Instructions ......................................................................................................................... 326
ARPL — Adjust Requested Privilege Level .............................................................................. 326
BOUND — Check Array Index For Bounding Error ................................................................. 326
CLTS — Clear Task Switch Flag ............................................................................................... 327
HLT — Halt Processor ............................................................................................................... 327
UD2 — Undefined Instruction.................................................................................................... 327
INVLPG — Invalidate TLB ....................................................................................................... 327
LAR — Load Access Rights....................................................................................................... 327
LOCK — Assert Lock # Signal Prefix ....................................................................................... 328
LSL — Load Segment Limit ...................................................................................................... 328
MOV — Move To/From Control Registers................................................................................ 328
MOV — Move To/From Debug Registers ................................................................................. 328
STMXCSR — Save MXCSR Register State .............................................................................. 329
LDMXCSR — Load MXCSR Register State ............................................................................. 329
SGDT/SIDT — Save Global/Interrupt Descriptor Table ........................................................... 329
LGDT/LIDT — Load Global/Interrupt Descriptor Table .......................................................... 329
SLDT — Save Local Descriptor Table ....................................................................................... 329
LLDT — Load Local Descriptor Table ...................................................................................... 330
SMSW — Save Machine Status Word ....................................................................................... 330
LMSW — Load Machine Status Word ...................................................................................... 330
STR — Save Task Register ........................................................................................................ 330
LTR — Load Task Register ........................................................................................................ 330
RDMSR — Read from Model Specific Register ........................................................................ 331
WRMSR — Write to Model Specific Register........................................................................... 331
SWAPGS — Swap GS Base Register ........................................................................................ 331
SYSCALL — 64-Bit Fast System Call ...................................................................................... 331
SYSRET — Fast Return from 64-Bit Fast System Call ............................................................. 331
SYSENTER — Fast System Call ............................................................................................... 332
SYSEXIT — Fast Return from Fast System Call ....................................................................... 332
RSM — Resume from System Management Mode ................................................................... 332
VERR/VERW — Verify Segment for Reading.......................................................................... 332
LDS/LES/LFS/LGS/LSS — Load Far Pointer ........................................................................... 332
Hyperthreading Instructions............................................................................................................ 335
MONITOR — Monitor ............................................................................................................... 335
MWAIT — Wait ......................................................................................................................... 335
Chapter 19: Gfx 'R' Asm .................................................................................................................. 336
Overview ......................................................................................................................................... 336
Setting Memory .............................................................................................................................. 336
Copying Memory ............................................................................................................................ 341
Speed Freak ..................................................................................................................................... 346
Graphics 101 — Frame Buffer ....................................................................................................... 347
Graphics 101 — Blit ....................................................................................................................... 348
Copy Blit ..................................................................................................................................... 348
423
Transparent Blit .......................................................................................................................... 349
Graphics 101 — Blit (MMX) ......................................................................................................... 349
Graphics Engine — Sprite Layered ............................................................................................ 349
Graphics Engine — Sprite Overlay ............................................................................................ 350
Graphics 101 — Clipping Blit ........................................................................................................ 352
Chapter 20: MASM vs. NASM vs. TASM vs. WASM ................................................................. 353
Overview ......................................................................................................................................... 353
MASM — Microsoft Macro Assembler ......................................................................................... 354
REPEAT ..................................................................................................................................... 359
WHILE ........................................................................................................................................ 359
FOR ............................................................................................................................................. 359
Compiler Intrinsics.......................................................................................................................... 360
Chapter 21: Debugging Functions ................................................................................................. 362
Overview ......................................................................................................................................... 362
Guidelines of Assembly Development ........................................................................................... 362
Visual C++ ...................................................................................................................................... 362
Tuning and Optimization ................................................................................................................ 365
Exception Handling — AKA: Dang that 1.#QNAN ...................................................................... 365
Exceptions ................................................................................................................................... 365
FPU Versus MMX ...................................................................................................................... 366
Print Output..................................................................................................................................... 366
Float Array Print ......................................................................................................................... 367
Vector Print ................................................................................................................................. 368
Quad Vector Print ....................................................................................................................... 368
Quaternion Print .......................................................................................................................... 368
Matrix Print ................................................................................................................................. 368
Memory Dump ............................................................................................................................ 369
Test Jigs .......................................................................................................................................... 371
Matrix Test Fill ........................................................................................................................... 371
Matrix Splat ................................................................................................................................ 371
Chapter 22: Epilogue ....................................................................................................................... 373
Overview ......................................................................................................................................... 373
Appendix A: Data Structure Definitions ........................................................................................ 374
Appendix B: Mnemonics ................................................................................................................. 376
Mnemonics Part 1 ........................................................................................................................... 376
Appendix C: Reg/Mem Mapping .................................................................................................... 389
Glossary ................................................................................................................................... 391
Symbol ........................................................................................................................................ 391
A.................................................................................................................................................. 391
B.................................................................................................................................................. 391
C.................................................................................................................................................. 392
D.................................................................................................................................................. 392
E .................................................................................................................................................. 392
F .................................................................................................................................................. 392
G ................................................................................................................................................. 393
I ................................................................................................................................................... 393
J .................................................................................................................................................. 393
L .................................................................................................................................................. 393
M ................................................................................................................................................. 394
N.................................................................................................................................................. 394
O ................................................................................................................................................. 394
P .................................................................................................................................................. 394
424
Q ................................................................................................................................................. 394
R.................................................................................................................................................. 394
S .................................................................................................................................................. 395
T .................................................................................................................................................. 395
V .................................................................................................................................................. 395
W ................................................................................................................................................. 396
X .................................................................................................................................................. 396
Z .................................................................................................................................................. 396
Alignment Macros .................................................................................................................. 396
Algebraic Laws Used in This Book ................................................................................... 396
References ........................................................................................................................................ 397
Index .............................................................................................................................................. 398
Numbers ...................................................................................................................................... 398
A.................................................................................................................................................. 398
B .................................................................................................................................................. 398
C .................................................................................................................................................. 398
D.................................................................................................................................................. 398
E .................................................................................................................................................. 398
F .................................................................................................................................................. 398
G.................................................................................................................................................. 398
G.................................................................................................................................................. 399
H.................................................................................................................................................. 399
I ................................................................................................................................................... 399
K.................................................................................................................................................. 401
L .................................................................................................................................................. 401
M ................................................................................................................................................. 401
N.................................................................................................................................................. 402
O.................................................................................................................................................. 402
P .................................................................................................................................................. 402
Q.................................................................................................................................................. 402
R .................................................................................................................................................. 402
S .................................................................................................................................................. 402
T .................................................................................................................................................. 402
U.................................................................................................................................................. 402
V.................................................................................................................................................. 402
W ................................................................................................................................................. 402
X.................................................................................................................................................. 402
List of Figures ................................................................................................................................... 403
Chapter 2: Coding Standards .......................................................................................................... 403
Chapter 3: Processor Differential Insight ....................................................................................... 403
Chapter 4: Bit Mangling ................................................................................................................. 403
Chapter 5: Bit Wrangling ................................................................................................................ 403
Chapter 6: Data Conversion ............................................................................................................ 403
Chapter 8: Floating-Point Anyone? ................................................................................................ 404
Chapter 13: FP Vector Multiplication and Division ....................................................................... 404
Chapter 14: Floating-Point Deux .................................................................................................... 404
Chapter 15: Binary-Coded Decimal (BCD) .................................................................................... 404
Chapter 17: PC I/O.......................................................................................................................... 404
Chapter 18: System ......................................................................................................................... 404
Chapter 19: Gfx 'R' Asm ................................................................................................................. 405
Chapter 20: MASM vs. NASM vs. TASM vs. WASM .................................................................. 405
Chapter 21: Debugging Functions .................................................................................................. 405
425
List of Tables ..................................................................................................................................... 406
Chapter 1: Introduction ................................................................................................................... 406
Chapter 2: Coding Standards .......................................................................................................... 406
Chapter 3: Processor Differential Insight ....................................................................................... 406
Chapter 4: Bit Mangling ................................................................................................................. 406
Chapter 5: Bit Wrangling ................................................................................................................ 406
Chapter 6: Data Conversion ............................................................................................................ 406
Chapter 8: Floating-Point Anyone? ................................................................................................ 406
Chapter 9: Comparison ................................................................................................................... 407
Chapter 10: Branching .................................................................................................................... 407
Chapter 15: Binary-Coded Decimal (BCD) .................................................................................... 407
Chapter 18: System ......................................................................................................................... 407
Chapter 21: Debugging Functions .................................................................................................. 407
List of Listings ................................................................................................................................... 408
Chapter 2: Coding Standards .......................................................................................................... 408
Chapter 4: Bit Mangling ................................................................................................................. 408
Chapter 5: Bit Wrangling ................................................................................................................ 408
Chapter 6: Data Conversion ............................................................................................................ 408
Chapter 7: Integer Math .................................................................................................................. 409
Chapter 8: Floating-Point Anyone? ................................................................................................ 409
Chapter 12: Floating-Point Vector Addition and Subtraction ........................................................ 409
Chapter 13: FP Vector Multiplication and Division ....................................................................... 410
Chapter 14: Floating-Point Deux .................................................................................................... 410
Chapter 15: Binary-Coded Decimal (BCD) .................................................................................... 411
Chapter 16: What CPUID? ............................................................................................................. 411
Chapter 21: Debugging Functions .................................................................................................. 411
426