Intel Architecture Software Developer's Manuals, Volume 1: Basic Architecture
Intel Architecture Software Developer's Manuals, Volume 1: Basic Architecture
Software Developer’s
Manual
Volume 1:
Basic Architecture
1999
Information in this document is provided in connection with Intel products. No license, express or implied, by estoppel
or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel’s Terms and
Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied
warranty, relating to sale and/or use of Intel products including liability or warranties relating to fitness for a particular
purpose, merchantability, or infringement of any patent, copyright or other intellectual property right. Intel products are
not intended for use in medical, life saving, or life sustaining applications.
Intel may make changes to specifications and product descriptions at any time, without notice.
Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or
“undefined.” Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or
incompatibilities arising from future changes to them.
Intel’s Intel Architecture processors (e.g., Pentium®, Pentium® II, Pentium® III, and Pentium® Pro processors) may
contain design defects or errors known as errata which may cause the product to deviate from published
specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your
product order.
Copies of documents which have an ordering number and are referenced in this document, or other Intel literature,
may be obtained by calling 1-800-548-4725, or by visiting Intel's literature center at http://www.intel.com.
CHAPTER 1
ABOUT THIS MANUAL
1.1. OVERVIEW OF THE INTEL ARCHITECTURE SOFTWARE DEVELOPER’S MANUAL,
VOLUME 1: BASIC ARCHITECTURE 1-1
1.2. OVERVIEW OF THE INTEL ARCHITECTURE SOFTWARE DEVELOPER’S MANUAL,
VOLUME 2: INSTRUCTION SET REFERENCE 1-3
1.3. OVERVIEW OF THE INTEL ARCHITECTURE SOFTWARE DEVELOPER’S MANUAL,
VOLUME 3: SYSTEM PROGRAMMING GUIDE 1-3
1.4. NOTATIONAL CONVENTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-5
1.4.1. Bit and Byte Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-5
1.4.2. Reserved Bits and Software Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-6
1.4.3. Instruction Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-7
1.4.4. Hexadecimal and Binary Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-7
1.4.5. Segmented Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-7
1.4.6. Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1-8
1.5. RELATED LITERATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-9
CHAPTER 2
INTRODUCTION TO THE INTEL ARCHITECTURE
2.1. BRIEF HISTORY OF THE INTEL ARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . 2-1
2.2. INCREASING INTEL ARCHITECTURE PERFORMANCE AND MOORE’S LAW . 2-4
2.3. BRIEF HISTORY OF THE INTEL ARCHITECTURE FLOATING-POINT UNIT. . . . 2-6
2.4. INTRODUCTION TO THE P6 FAMILY PROCESSOR’S
ADVANCED MICROARCHITECTURE 2-6
2.5. DETAILED DESCRIPTION OF THE P6 FAMILY PROCESSOR
MICROARCHITECTURE 2-9
2.5.1. Memory Subsystem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-9
2.5.2. Fetch/Decode Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-11
2.5.3. Instruction Pool (Reorder Buffer). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-11
2.5.4. Dispatch/Execute Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-12
2.5.5. Retirement Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-13
CHAPTER 3
BASIC EXECUTION ENVIRONMENT
3.1. MODES OF OPERATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
3.2. OVERVIEW OF THE BASIC EXECUTION ENVIRONMENT . . . . . . . . . . . . . . . . . 3-2
3.3. MEMORY ORGANIZATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
3.4. MODES OF OPERATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4
3.5. 32-BIT VS. 16-BIT ADDRESS AND OPERAND SIZES. . . . . . . . . . . . . . . . . . . . . . 3-4
3.6. REGISTERS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5
3.6.1. General-Purpose Data Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-6
3.6.2. Segment Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-7
3.6.3. EFLAGS Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-10
3.6.3.1. Status Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-12
3.6.3.2. DF Flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-13
3.6.4. System Flags and IOPL Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-13
3.7. INSTRUCTION POINTER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14
3.8. OPERAND-SIZE AND ADDRESS-SIZE ATTRIBUTES. . . . . . . . . . . . . . . . . . . . . 3-14
iii
TABLE OF CONTENTS
CHAPTER 4
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
4.1. PROCEDURE CALL TYPES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
4.2. STACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
4.2.1. Setting Up a Stack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-3
4.2.2. Stack Alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-3
4.2.3. Address-Size Attributes for Stack Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-3
4.2.4. Procedure Linking Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-4
4.2.4.1. Stack-Frame Base Pointer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-4
4.2.4.2. Return Instruction Pointer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-4
4.3. CALLING PROCEDURES USING CALL AND RET . . . . . . . . . . . . . . . . . . . . . . . . 4-5
4.3.1. Near CALL and RET Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-5
4.3.2. Far CALL and RET Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-6
4.3.3. Parameter Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-7
4.3.3.1. Passing Parameters Through the General-Purpose Registers . . . . . . . . . . . .4-7
4.3.3.2. Passing Parameters on the Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-7
4.3.3.3. Passing Parameters in an Argument List . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-7
4.3.4. Saving Procedure State Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-7
4.3.5. Calls to Other Privilege Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-8
4.3.6. CALL and RET Operation Between Privilege Levels . . . . . . . . . . . . . . . . . . . . .4-10
4.4. INTERRUPTS AND EXCEPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-11
4.4.1. Call and Return Operation for Interrupt or Exception Handling Procedures . . . .4-13
4.4.2. Calls to Interrupt or Exception Handler Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . .4-17
4.4.3. Interrupt and Exception Handling in Real-Address Mode . . . . . . . . . . . . . . . . . .4-17
4.4.4. INT n, INTO, INT 3, and BOUND Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . .4-17
4.5. PROCEDURE CALLS FOR BLOCK-STRUCTURED LANGUAGES. . . . . . . . . . . 4-18
4.5.1. ENTER Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-18
4.5.2. LEAVE Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-24
CHAPTER 5
DATA TYPES AND ADDRESSING MODES
5.1. FUNDAMENTAL DATA TYPES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1
5.1.1. Alignment of Words, Doublewords, and Quadwords. . . . . . . . . . . . . . . . . . . . . . .5-2
5.2. NUMERIC, POINTER, BIT FIELD, AND STRING DATA TYPES . . . . . . . . . . . . . . 5-3
5.2.1. Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-3
5.2.2. Unsigned Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-5
5.2.3. BCD Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-5
5.2.4. Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-5
5.2.5. Bit Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-5
5.2.6. Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-5
5.2.7. Floating-Point Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-6
5.2.8. MMX™ Technology Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-6
5.2.9. Streaming SIMD Extensions Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-6
5.3. OPERAND ADDRESSING. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6
5.3.1. Immediate Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-6
5.3.2. Register Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-7
5.3.3. Memory Operands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-7
5.3.3.1. Specifying a Segment Selector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-8
5.3.3.2. Specifying an Offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-9
5.3.3.3. Assembler and Compiler Addressing Modes . . . . . . . . . . . . . . . . . . . . . . . . .5-10
5.3.4. I/O Port Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-11
iv
TABLE OF CONTENTS
CHAPTER 6
INSTRUCTION SET SUMMARY
6.1. NEW INTEL ARCHITECTURE INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1
6.1.1. New Instructions Introduced with the Streaming SIMD Extensions . . . . . . . . . . . 6-1
6.1.2. New Instructions Introduced with the MMX™ Technology . . . . . . . . . . . . . . . . . 6-1
6.1.3. New Instructions in the Pentium® Pro Processor . . . . . . . . . . . . . . . . . . . . . . . . 6-2
6.1.4. New Instructions in the Pentium® Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2
6.1.5. New Instructions in the Intel486™ Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
6.2. INSTRUCTION SET LIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
6.2.1. Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
6.2.1.1. Data Transfer Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3
6.2.1.2. Binary Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
6.2.1.3. Decimal Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
6.2.1.4. Logic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
6.2.1.5. Shift and Rotate Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5
6.2.1.6. Bit and Byte Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6
6.2.1.7. Control Transfer Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7
6.2.1.8. String Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8
6.2.1.9. Flag Control Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9
6.2.1.10. Segment Register Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9
6.2.1.11. Miscellaneous Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9
6.2.2. MMX™ Technology Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
6.2.2.1. MMX™ Data Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
6.2.2.2. MMX™ Conversion Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
6.2.2.3. MMX™ Packed Arithmetic Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10
6.2.2.4. MMX™ Comparison Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11
6.2.2.5. MMX™ Logic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11
6.2.2.6. MMX™ Shift and Rotate Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11
6.2.2.7. MMX™ State Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12
6.2.3. Floating-Point Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12
6.2.3.1. Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12
6.2.3.2. Basic Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-13
6.2.3.3. Comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14
6.2.3.4. Transcendental . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14
6.2.3.5. Load Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15
6.2.3.6. FPU Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15
6.2.4. System Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16
6.2.5. Streaming SIMD Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-17
6.2.5.1. Streaming SIMD Extensions Data Transfer Instructions. . . . . . . . . . . . . . . . 6-17
6.2.5.2. Streaming SIMD Extensions Conversion Instructions. . . . . . . . . . . . . . . . . . 6-17
6.2.5.3. Streaming SIMD Extensions Packed Arithmetic Instructions . . . . . . . . . . . . 6-18
6.2.5.4. Streaming SIMD Extensions Comparison Instructions . . . . . . . . . . . . . . . . . 6-18
6.2.5.5. Streaming SIMD Extensions Logical Instructions . . . . . . . . . . . . . . . . . . . . . 6-18
6.2.5.6. Streaming SIMD Extensions Data Shuffle Instructions . . . . . . . . . . . . . . . . . 6-19
6.2.5.7. Streaming SIMD Extensions Additional SIMD-Integer Instructions. . . . . . . . 6-19
6.2.5.8. Streaming SIMD Extensions Cacheability Control Instructions. . . . . . . . . . . 6-19
6.2.5.9. Streaming SIMD Extensions State Management Instructions . . . . . . . . . . . 6-19
6.3. DATA MOVEMENT INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-20
6.3.1. General-Purpose Data Movement Instructions . . . . . . . . . . . . . . . . . . . . . . . . . 6-20
6.3.1.1. Move Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-20
6.3.1.2. Conditional Move Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-20
6.3.1.3. Exchange Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-21
v
TABLE OF CONTENTS
vi
TABLE OF CONTENTS
CHAPTER 7
FLOATING-POINT UNIT
7.1. COMPATIBILITY AND EASE OF USE OF THE INTEL ARCHITECTURE FPU . . . 7-1
7.2. REAL NUMBERS AND FLOATING-POINT FORMATS . . . . . . . . . . . . . . . . . . . . . . 7-2
7.2.1. Real Number System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3
7.2.2. Floating-Point Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-4
7.2.2.1. Normalized Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-4
7.2.2.2. Biased Exponent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5
7.2.3. Real Number and Non-number Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5
7.2.3.1. Signed Zeros. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-6
7.2.3.2. Normalized and Denormalized Finite Numbers . . . . . . . . . . . . . . . . . . . . . . . 7-6
7.2.3.3. Signed Infinities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8
7.2.3.4. NaNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8
7.2.4. Indefinite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8
7.3. FPU ARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8
7.3.1. FPU Data Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-9
7.3.1.1. Parameter Passing with the FPU Register Stack . . . . . . . . . . . . . . . . . . . . . 7-11
7.3.2. FPU Status Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
7.3.2.1. Top of Stack (TOP) Pointer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
7.3.2.2. Condition Code Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
7.3.2.3. Exception Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14
7.3.2.4. Stack Fault Flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-15
7.3.3. Branching and Conditional Moves on FPU Condition Codes . . . . . . . . . . . . . . 7-15
7.3.4. FPU Control Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-16
7.3.4.1. Exception-Flag Masks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17
7.3.4.2. Precision Control Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17
7.3.4.3. Rounding Control Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-18
7.3.5. Infinity Control Flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20
7.3.6. FPU Tag Word. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20
7.3.7. FPU Instruction and Operand (Data) Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21
7.3.8. Last Instruction Opcode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21
7.3.9. Saving the FPU’s State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21
7.4. FLOATING-POINT DATA TYPES AND FORMATS . . . . . . . . . . . . . . . . . . . . . . . . 7-24
7.4.1. Real Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-25
7.4.2. Binary Integers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-27
7.4.3. Decimal Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-29
7.4.4. Unsupported Extended-Real Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-30
7.5. FPU INSTRUCTION SET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-31
7.5.1. Escape (ESC) Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-32
7.5.2. FPU Instruction Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-32
7.5.3. Data Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-32
7.5.4. Load Constant Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-34
7.5.5. Basic Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-35
7.5.6. Comparison and Classification Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-36
7.5.6.1. Branching on the FPU Condition Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-38
7.5.7. Trigonometric Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-38
7.5.8. Pi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-39
7.5.9. Logarithmic, Exponential, and Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-40
7.5.10. Transcendental Instruction Accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-40
7.5.11. FPU Control Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-41
7.5.12. Waiting Vs. Non-waiting Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-42
7.5.13. Unsupported FPU Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-43
vii
TABLE OF CONTENTS
CHAPTER 8
PROGRAMMING WITH THE INTEL
MMX™ TECHNOLOGY
8.1. OVERVIEW OF THE MMX™ TECHNOLOGY PROGRAMMING ENVIRONMENT 8-1
8.1.1. MMX™ Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-2
8.1.2. MMX™ Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-3
8.1.3. Single Instruction, Multiple Data (SIMD) Execution Model . . . . . . . . . . . . . . . . . .8-4
8.1.4. Memory Data Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-4
8.1.5. Data Formats for MMX™ Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-5
8.2. MMX™ INSTRUCTION SET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5
8.2.1. Saturation Arithmetic and Wraparound Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-6
8.2.2. Instruction Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-7
8.3. OVERVIEW OF THE MMX™ INSTRUCTION SET . . . . . . . . . . . . . . . . . . . . . . . . . 8-7
8.3.1. Data Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-7
8.3.2. Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-9
8.3.2.1. Packed Addition and Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-9
8.3.2.2. Packed Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-9
8.3.2.3. Packed Multiply Add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-9
8.3.3. Comparison Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-9
8.3.4. Conversion Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-10
8.3.5. Logical Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-10
8.3.6. Shift Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-10
8.3.7. EMMS (Empty MMX™ State) Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-10
8.4. COMPATIBILITY WITH FPU ARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11
8.4.1. MMX™ Instructions and the Floating-Point Tag Word . . . . . . . . . . . . . . . . . . . .8-11
8.4.2. Effect of Instruction Prefixes on MMX™ Instructions . . . . . . . . . . . . . . . . . . . . .8-11
8.5. WRITING APPLICATIONS WITH MMX™ CODE . . . . . . . . . . . . . . . . . . . . . . . . . 8-11
8.5.1. Detecting Support for MMX™ Technology Using the CPUID Instruction . . . . . .8-12
8.5.2. Using the EMMS Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-12
viii
TABLE OF CONTENTS
CHAPTER 9
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
9.1. OVERVIEW OF THE STREAMING SIMD EXTENSIONS . . . . . . . . . . . . . . . . . . . . 9-2
9.1.1. SIMD Floating-Point Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
9.1.2. SIMD Floating-Point Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3
9.1.3. Single Instruction, Multiple Data (SIMD) Execution Model . . . . . . . . . . . . . . . . . 9-4
9.1.4. Pentium® III Processor Single Precision Floating-Point Format . . . . . . . . . . . . . 9-4
9.1.5. Memory Data Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5
9.1.6. SIMD Floating-Point Register Data Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5
9.1.7. SIMD Floating-Point Control/Status Register. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7
9.1.8. Rounding Control Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8
9.1.9. Flush-To-Zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8
9.2. STREAMING SIMD EXTENSIONS SET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-9
9.2.1. Instruction Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-9
9.3. OVERVIEW OF THE STREAMING SIMD EXTENSIONS SET . . . . . . . . . . . . . . . 9-10
9.3.1. Data Movement Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-10
9.3.2. Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-11
9.3.2.1. Packed/Scalar Addition and Subtraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-11
9.3.2.2. Packed/Scalar Multiplication and Division . . . . . . . . . . . . . . . . . . . . . . . . . . 9-11
9.3.2.3. Packed/Scalar Square Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-11
9.3.2.4. Packed Maximum/Minimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-12
9.3.3. Comparison Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-12
9.3.4. Conversion Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-13
9.3.5. Logical Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-13
9.3.6. Additional SIMD Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-14
9.3.7. Shuffle Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-15
9.3.8. State Management Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-16
9.3.9. Cacheability Control Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-17
9.4. COMPATIBILITY WITH FPU ARCHITECTURE. . . . . . . . . . . . . . . . . . . . . . . . . . . 9-19
9.4.1. Effect of Instruction Prefixes on Streaming SIMD Extensions . . . . . . . . . . . . . . 9-19
9.5. WRITING APPLICATIONS WITH STREAMING SIMD EXTENSIONS CODE. . . . 9-21
9.5.1. Detecting Support for Streaming SIMD Extensions Using the
CPUID Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-21
9.5.2. Interfacing with Streaming SIMD Extensions Procedures and Functions . . . . . 9-22
9.5.3. Writing Code with MMX™, Floating-Point, and Streaming SIMD Extensions . . 9-22
9.5.3.1. Cacheability Hint Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-23
9.5.3.2. Recommendations and Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-25
9.5.4. Using Streaming SIMD Extensions Code in a Multitasking Operating
System Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-25
9.5.4.1. Cooperative Multitasking Operating System. . . . . . . . . . . . . . . . . . . . . . . . . 9-26
9.5.4.2. Preemptive Multitasking Operating System . . . . . . . . . . . . . . . . . . . . . . . . . 9-26
9.5.5. Exception Handling in Streaming SIMD Extensions . . . . . . . . . . . . . . . . . . . . . 9-26
ix
TABLE OF CONTENTS
CHAPTER 10
INPUT/OUTPUT
10.1. I/O PORT ADDRESSING. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1
10.2. I/O PORT HARDWARE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1
10.3. I/O ADDRESS SPACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2
10.3.1. Memory-Mapped I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-2
10.4. I/O INSTRUCTIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3
10.5. PROTECTED-MODE I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4
10.5.1. I/O Privilege Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-4
10.5.2. I/O Permission Bit Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-5
10.6. ORDERING I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-6
CHAPTER 11
PROCESSOR IDENTIFICATION AND FEATURE DETERMINATION
11.1. PROCESSOR IDENTIFICATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2
11.2. IDENTIFICATION OF EARLIER INTEL ARCHITECTURE PROCESSORS . . . . . 11-4
11.3. CPUID INSTRUCTION EXTENSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-5
11.3.1. Version Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-5
11.3.2. Control Register Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-7
APPENDIX A
EFLAGS CROSS-REFERENCE
APPENDIX B
EFLAGS CONDITION CODES
APPENDIX C
FLOATING-POINT EXCEPTIONS SUMMARY
APPENDIX D
SIMD FLOATING-POINT EXCEPTIONS SUMMARY
APPENDIX E
GUIDELINES FOR WRITING FPU
EXCEPTIONS HANDLERS
E.1. ORIGIN OF THE MS-DOS* COMPATIBILITY MODE FOR HANDLING
FPU EXCEPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-2
E.2. IMPLEMENTATION OF THE MS-DOS* COMPATIBILITY MODE IN THE INTEL486™,
PENTIUM®, AND P6 FAMILY PROCESSORS E-3
E.2.1. MS-DOS* Compatibility Mode in the Intel486™ and Pentium® Processors . . . . E-3
E.2.1.1. Basic Rules: When FERR# Is Generated. . . . . . . . . . . . . . . . . . . . . . . . . . . . E-4
E.2.1.2. Recommended External Hardware to Support the
MS-DOS* Compatibility Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-5
E.2.1.3. No-Wait FPU Instructions Can Get FPU Interrupt in Window. . . . . . . . . . . . . E-7
E.2.2. MS-DOS* Compatibility Mode in the P6 Family Processors . . . . . . . . . . . . . . . . E-9
E.3. RECOMMENDED PROTOCOL FOR MS-DOS* COMPATIBILITY HANDLERS. . E-10
E.3.1. Floating-Point Exceptions and Their Defaults . . . . . . . . . . . . . . . . . . . . . . . . . . E-10
E.3.2. Two Options for Handling Numeric Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . E-11
E.3.2.1. Automatic Exception Handling: Using Masked Exceptions . . . . . . . . . . . . . E-11
E.3.2.2. Software Exception Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-12
E.3.3. Synchronization Required for Use of FPU Exception Handlers . . . . . . . . . . . . E-14
x
TABLE OF CONTENTS
APPENDIX F
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS
F.1. TWO OPTIONS FOR HANDLING NUMERIC EXCEPTIONS . . . . . . . . . . . . . . . . . F-1
F.2. SOFTWARE EXCEPTION HANDLING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-1
F.3. EXCEPTION SYNCHRONIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-3
F.4. SIMD FLOATING-POINT EXCEPTIONS AND THE IEEE-754 STANDARD FOR
BINARY FLOATING-POINT COMPUTATIONS F-4
F.4.1. Floating-Point Emulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-4
F.4.2. Streaming SIMD Extensions Response To Floating-Point Exceptions . . . . . . . . F-6
F.4.2.1. Numeric Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-7
F.4.2.2. Results of Operations with NaN Operands or a NaN Result for
Streaming SIMD Extensions Numeric Instructions . . . . . . . . . . . . . . . . . . . . . F-7
F.4.2.3. Condition Codes, Exception Flags, and Response for Masked and
Unmasked Numeric Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F-10
F.4.3. SIMD Floating-Point Emulation Implementation Example . . . . . . . . . . . . . . . . . F-13
xi
TABLE OF CONTENTS
xii
TABLE OF FIGURES
xiii
TABLE OF FIGURES
xiv
TABLE OF TABLES
Table 2-1. Processor Performance Over Time and Other Intel Architecture
Key Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-5
Table 3-1. Effective Operand- and Address-Size Attributes . . . . . . . . . . . . . . . . . . . . . .3-15
Table 4-1. Exceptions and Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-14
Table 5-1. Default Segment Selection Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-8
Table 6-1. Move Instruction Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-21
Table 6-2. Conditional Move Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-22
Table 6-3. Bit Test and Modify Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-34
Table 6-4. Conditional Jump Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-37
Table 6-5. Information Provided by the CPUID Instruction . . . . . . . . . . . . . . . . . . . . . . .6-45
Table 7-1. Real Number Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-5
Table 7-2. Denormalization Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-7
Table 7-3. FPU Condition Code Interpretation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-14
Table 7-4. Precision Control Field (PC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-17
Table 7-5. Rounding Control Field (RC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-18
Table 7-6. Rounding of Positive Numbers with Masked Overflow . . . . . . . . . . . . . . . . . .7-19
Table 7-7. Rounding of Negative Numbers with Masked Overflow . . . . . . . . . . . . . . . . .7-19
Table 7-8. Length, Precision, and Range of FPU Data Types. . . . . . . . . . . . . . . . . . . . .7-26
Table 7-9. Real Number and NaN Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-27
Table 7-10. Binary Integer Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-28
Table 7-11. Packed Decimal Integer Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-29
Table 7-12. Unsupported Extended-Real Encodings. . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-31
Table 7-13. Data Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-32
Table 7-14. Floating-Point Conditional Move Instructions . . . . . . . . . . . . . . . . . . . . . . . . .7-33
Table 7-15. Setting of FPU Condition Code Flags for Real Number Comparisons . . . . . .7-37
Table 7-16. Setting of EFLAGS Status Flags for Real Number Comparisons. . . . . . . . . .7-37
Table 7-17. TEST Instruction Constants for Conditional Branching . . . . . . . . . . . . . . . . .7-38
Table 7-18. Rules for Generating QNaNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-44
Table 7-19. Results of Operations with NaN Operands. . . . . . . . . . . . . . . . . . . . . . . . . . .7-45
Table 7-20. Arithmetic and Non-arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . .7-48
Table 7-21. Invalid Arithmetic Operations and the Masked Responses to Them . . . . . . .7-53
Table 7-22. Divide-By-Zero Conditions and the Masked Responses to Them . . . . . . . . .7-54
Table 7-23. Masked Responses to Numeric Overflow. . . . . . . . . . . . . . . . . . . . . . . . . . . .7-55
Table 8-1. Data Range Limits for Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-6
Table 8-2. MMX™ Instruction Set Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-8
Table 8-3. Effect of Prefixes on MMX™ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-11
Table 9-1. Precision and Range of SIMD Floating-point Datatype . . . . . . . . . . . . . . . . . .9-5
Table 9-2. Real Number and NaN Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-6
Table 9-3. Rounding Control Field (RC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-8
Table 9-4. Streaming SIMD Extensions Behavior with Prefixes . . . . . . . . . . . . . . . . . . .9-20
Table 9-5. SIMD Integer Instructions Behavior with Prefixes. . . . . . . . . . . . . . . . . . . . . .9-20
Table 9-6. Cacheability Control Instruction Behavior with Prefixes . . . . . . . . . . . . . . . . .9-20
Table 9-7. Cache Hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-24
Table 10-1. I/O Instruction Serialization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-7
Table 11-1. EAX Input Value and CPUID Return Values . . . . . . . . . . . . . . . . . . . . . . . . .11-5
Table 11-2. New P6-Family Processor Feature Information Returned by
CPUID in EDX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11-6
Table A-1. EFLAGS Cross-Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1
Table B-1. EFLAGS Condition Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1
xv
TABLE OF TABLES
xvi
1
About This Manual
CHAPTER 1
ABOUT THIS MANUAL
The Intel Architecture Software Developer’s Manual, Volume 1: Basic Architecture (Order
Number 243190) is part of a three-volume set that describes the architecture and programming
environment of all Intel Architecture (IA) processors. The other two volumes in this set are:
• The Intel Architecture Software Developer’s Manual, Volume 2: Instruction Set Reference
(Order Number 243191).
• The Intel Architecture Software Developer’s Manual, Volume 3: System Programming
Guide (Order Number 243192).
The Intel Architecture Software Developer’s Manual, Volume 1, describes the basic architecture
and programming environment of an IA processor; the Intel Architecture Software Developer’s
Manual, Volume 2, describes the instruction set of the processor and the opcode structure. These
two volumes are aimed at application programmers who are writing programs to run under
existing operating systems or executives. The Intel Architecture Software Developer’s Manual,
Volume 3 describes the operating-system support environment of an IA processor, including
memory management, protection, task management, interrupt and exception handling, and
system management mode. It also provides IA processor compatibility information. This
volume is aimed at operating-system and BIOS designers and programmers.
1-1
ABOUT THIS MANUAL
Chapter 5 — Data Types and Addressing Modes. Describes the data types and addressing
modes recognized by the processor.
Chapter 6 — Instruction Set Summary. Gives an overview of all the IA instructions except
those executed by the processor’s floating-point unit. The instructions are presented in function-
ally related groups.
Chapter 7 — Floating-Point Unit. Describes the IA floating-point unit, including the floating-
point registers and data types; gives an overview of the floating-point instruction set; and
describes the processor’s floating-point exception conditions.
Chapter 8 — Programming with Intel MMX™ Technology. Describes the Intel MMX™
technology, including MMX™ registers and data types, and gives an overview of the MMX™
instruction set.
Chapter 9 — Programming with the Streaming SIMD Extensions. Describes the Intel
Streaming SIMD Extensions, including the registers and data types.
Chapter 10 — Input/Output. Describes the processor’s I/O architecture, including I/O port
addressing, the I/O instructions, and the I/O protection mechanism.
Chapter 11 — Processor Identification and Feature Determination. Describes how to deter-
mine the CPU type and the features that are available in the processor.
Appendix A — EFLAGS Cross-Reference. Summarizes how the IA instructions affect the
flags in the EFLAGS register.
Appendix B — EFLAGS Condition Codes. Summarizes how the conditional jump, move, and
byte set on condition code instructions use the condition code flags (OF, CF, ZF, SF, and PF) in
the EFLAGS register.
Appendix C — Floating-Point Exceptions Summary. Summarizes the exceptions that can be
raised by floating-point instructions.
Appendix D — SIMD Floating-Point Exceptions Summary. Provides the Streaming SIMD
Extensions mnemonics, and the exceptions that each instruction can cause.
Appendix E — Guidelines for Writing FPU Exception Handlers. Describes how to design
and write MS-DOS* compatible exception handling facilities for FPU and SIMD floating-point
exceptions, including both software and hardware requirements and assembly-language code
examples. This appendix also describes general techniques for writing robust FPU exception
handlers.
Appendix F — Guidelines for Writing SIMD-FP Exception Handlers. Provides guidelines
for the Streaming SIMD Extensions instructions that can generate numeric (floating-point)
exceptions, and gives an overview of the necessary support for handling such exceptions.
1-2
ABOUT THIS MANUAL
1-3
ABOUT THIS MANUAL
1-4
ABOUT THIS MANUAL
1-5
ABOUT THIS MANUAL
Data Structure
Highest 24 23 16 15 8 7 0 Bit offset
31
Address
28
24
20
16
12
8
4
Lowest
Byte 3 Byte 2 Byte 1 Byte 0 0 Address
Byte Offset
NOTE
Avoid any software dependence upon the state of reserved bits in IA registers.
Depending upon the values of reserved register bits will make software
dependent upon the unspecified manner in which the processor handles these
bits. Programs that depend upon reserved values risk incompatibility with
future processors.
1-6
ABOUT THIS MANUAL
1-7
ABOUT THIS MANUAL
refer to the code space, and stack addresses would always refer to the stack space. The following
notation is used to specify a byte address within a segment:
Segment-register:Byte-address
For example, the following segment address identifies the byte at address FF79H in the segment
pointed by the DS register:
DS:FF79H
The following segment address identifies an instruction address in the code segment. The CS
register points to the code segment and the EIP register contains the address of the instruction.
CS:EIP
1.4.6. Exceptions
An exception is an event that typically occurs when an instruction causes an error. For example,
an attempt to divide by zero generates an exception. However, some exceptions, such as break-
points, occur under other conditions. Some types of exceptions may provide error codes. An
error code reports additional information about the error. An example of the notation used to
show an exception and error code is shown below.
#PF(fault code)
This example refers to a page-fault exception under conditions where an error code naming a
type of fault is reported. Under some conditions, exceptions which produce error codes may not
be able to report an accurate code. In this case, the error code is zero, as shown below for a
general-protection exception.
#GP(0)
Refer to Chapter 5, Interrupt and Exception Handling, in the Intel Architecture Software Devel-
oper’s Manual, Volume 3, for a list of exception mnemonics and their descriptions.
1-8
ABOUT THIS MANUAL
1-9
ABOUT THIS MANUAL
1-10
2
Introduction to the
Intel Architecture
CHAPTER 2
INTRODUCTION TO THE INTEL ARCHITECTURE
A strong case can be made that the exponential growth of both the power and breadth of usage
of the computer has made it the most important force that is reshaping human technology, busi-
ness, and society in the second half of the twentieth century. Further, the computer promises to
continue to dominate technological growth well into the twenty-first century, in part since other
powerful technological forces that are just emerging are strongly dependent on the growth of
computing power for their own existence and growth (such as the Internet, and genetics devel-
opments like recombinant DNA research and development). The Intel Architecture (IA) is
clearly today’s preferred computer architecture, as measured by number of computers in use and
total computing power available in the world. Thus it is hard to overestimate the importance of
the IA.
2-1
INTRODUCTION TO THE INTEL ARCHITECTURE
executing programs created for the 8086 and 8088 processors on the new 32-bit machine. The
32-bit addressing was supported with an external 32-bit address bus, giving a 4-GByte address
space, and also allowed each segment to be as large as 4 GBytes. The original instructions were
enhanced with new 32-bit operand and addressing forms, and completely new instructions were
provided, including those for bit manipulation. The Intel386™ processor also introduced paging
into the IA, with the fixed 4-KByte page size providing a method for virtual memory manage-
ment that was significantly superior compared to using segments for the purpose (it was much
more efficient for operating systems, and completely transparent to the applications without
significant sacrifice of execution speed). Furthermore, the ability to define segments as large as
the 4 GBytes physical address space, together with paging, allowed the creation of protected
“flat model”1 addressing systems in the architecture, including complete implementations of the
widely used mainframe operating system UNIX.
The IA has been and is committed to the task of maintaining backward compatibility at the
object code level to preserve our customers’ very large investment in software, but at the same
time, in each generation of the architecture, the latest most effective microprocessor architecture
and silicon fabrication technologies have been used to produce the fastest, most powerful
processors possible. Intel has worked over the generations to adapt and incorporate increasingly
sophisticated techniques from mainframe architecture into microprocessor architecture. Various
forms of parallel processing have been the most performance enhancing of these techniques, and
the Intel386™ processor was the first IA processor to include a number of parallel stages: six.
These are the Bus Interface Unit (accesses memory and I/O for the other units), the Code
Prefetch Unit (receives object code from the Bus Unit and puts it into a 16-byte queue), the
Instruction Decode Unit (decodes object code from the Prefetch unit into microcode), the
Execution Unit (executes the microcode instructions), the Segment Unit (translates logical
addresses to linear addresses and does protection checks), and the Paging Unit (translates linear
addresses to physical addresses, does page based protection checks, and contains a cache with
information for up to 32 most recently accessed pages).
The Intel486™ processor added more parallel execution capability by (basically) expanding the
Intel386™ processor’s Instruction Decode and Execution Units into five pipelined stages, where
each stage (when needed) operates in parallel with the others on up to five instructions in
different stages of execution. Each stage can do its work on one instruction in one clock, and so
the Intel486™ processor can execute as rapidly as one instruction per CPU clock. An 8-KByte
on-chip L1 cache was added to the Intel486™ processor to greatly increase the percent of
instructions that could execute at the scalar rate of one per clock: memory access instructions
were now included if the operand was in the L1 cache. The Intel486™ processor also for the
first time integrated the floating-point math Unit onto the same chip as the CPU (refer to Section
2.3., “Brief History of the Intel Architecture Floating-Point Unit”) and added new pins, bits, and
instructions to support more complex and powerful systems (L2 cache support and multipro-
cessor support).
Late in the Intel486™ processor generation, Intel incorporated features designed to support
energy savings and other system management capabilities into the IA mainstream with the
Intel486™ SL Enhanced processors. These features were developed in the Intel386™ SL and
Intel486™ SL processors, which were specialized for the rapidly growing battery-operated
1. Requires only one 32-bit address component to access anywhere in the address space.
2-2
INTRODUCTION TO THE INTEL ARCHITECTURE
notebook PC market. The features include the new System Management Mode, triggered by its
own dedicated interrupt pin, which allows complex system management features (such as power
management of various subsystems within the PC), to be added to a system transparently to the
main operating system and all applications. The Stop Clock and Auto Halt Powerdown features
allow the CPU itself to execute at a reduced clock rate to save power, or to be shut down (with
state preserved) to save even more power.
The Intel Pentium® processor added a second execution pipeline to achieve superscalar perfor-
mance (two pipelines, known as u and v, together can execute two instructions per clock). The
on-chip L1 cache has also been doubled, with 8 KBytes devoted to code, and another 8 KBytes
devoted to data. The data cache uses the MESI protocol to support the more efficient write-back
mode, as well as the write-through mode that is used by the Intel486™ processor. Branch predic-
tion with an on-chip branch table has been added to increase performance in looping constructs.
Extensions have been added to make the virtual-8086 mode more efficient, and to allow for 4-
MByte as well as 4-KByte pages. The main registers are still 32 bits, but internal data paths of
128 and 256 bits have been added to speed internal data transfers, and the burstable external data
bus has been increased to 64 bits. The Advanced Programmable Interrupt Controller (APIC) has
been added to support systems with multiple Pentium® processors, and new pins and a special
mode (dual processing) has been designed in to support glueless two processor systems.
The Intel Pentium® Pro processor introduced “Dynamic Execution.” It has a three-way super-
scalar architecture, which means that it can execute three instructions per CPU clock. It does this
by incorporating even more parallelism than the Pentium® processor. The Pentium® Pro
processor provides Dynamic Execution (micro-data flow analysis, out-of-order execution, supe-
rior branch prediction, and speculative execution) in a superscalar implementation. Three
instruction decode units work in parallel to decode object code into smaller operations called
“micro-ops.” These go into an instruction pool, and (when interdependencies don’t prevent) can
be executed out of order by the five parallel execution units (two integer, two FPU and one
memory interface unit). The Retirement Unit retires completed micro-ops in their original
program order, taking account of any branches. The power of the Pentium® Pro processor is
further enhanced by its caches: it has the same two on-chip 8-KByte L1 caches as does the
Pentium® processor, and also has a 256-KByte L2 cache that is in the same package as, and
closely coupled to, the CPU, using a dedicated 64-bit (“backside”) full clock speed bus. The L1
cache is dual-ported, the L2 cache supports up to 4 concurrent accesses, and the 64-bit external
data bus is transaction-oriented, meaning that each access is handled as a separate request and
response, with numerous requests allowed while awaiting a response. These parallel features for
data access work with the parallel execution capabilities to provide a “non-blocking” architec-
ture in which the processor is more fully utilized and performance is enhanced. The Pentium®
Pro processor also has an expanded 36-bit address bus, giving a maximum physical address
space of 64 GBytes.
The Pentium® II processor added MMX™ instructions to the Pentium® Pro processor architec-
ture, incorporating the new slot 1 and slot 2 packaging techniques. These new packaging tech-
niques moved the L2 cache “off-chip” or “off-die”. The slot 1 and slot 2 package uses a single-
edge connector instead of a socket. The Pentium® II processor expanded the L1 data cache and
L1 instruction cache to 16 KBytes each. The Pentium® II processor has L2 cache sizes of 256
KBytes, 512 KBytes and 1 MByte or 2 MByte (slot 2 only). The slot 1 processor uses a “half
clock speed” backside bus while the slot 2 processor uses a “full clock speed” backside bus. The
2-3
INTRODUCTION TO THE INTEL ARCHITECTURE
Pentium® II processors utilize multiple low-power states such as AutoHALT, Stop-Grant, Sleep,
and Deep Sleep to conserve power during idle times.
The newest processor in the IA is the Pentium® III processor. It is based on the Pentium® Pro
and Pentium® II processors architectures. The Pentium® III processor introduces 70 new instruc-
tions to the IA instruction set. These instructions target existing functional units within the archi-
tecture as well as the new SIMD-floating-point unit. More detailed discussion of the new
features in the Pentium® Pro, Pentium® II, and Pentium® III processors is provided in Section
2.4., “Introduction to the P6 Family Processor’s Advanced Microarchitecture” and Section 2.5.,
“Detailed Description of the P6 FaMILY Processor Microarchitecture”. More detailed hardware
and architectural information on each of the generations of the IA family is available in the sepa-
rate data books for the processor generations (Section 1.5., “Related Literature” in Chapter 1,
About This Manual).
2-4
INTRODUCTION TO THE INTEL ARCHITECTURE
The table below shows the dramatic increases in performance and transistor count of the IA
processors over their history, as predicted by Moore’s Law, and also summarizes the evolution
of other key features of the architecture.
Table 2-1. Processor Performance Over Time and Other Intel Architecture Key Features
No. of
Date of Perform Max. CPU Transis Main Extern. Max. Caches
Product -ance Frequency -tors on CPU Data Extern. in CPU
Intel Intro- in MIPs1 at Intro- the Die Register Bus Addr. Pack-
Processor duction duction Size2 Size2 Space age3
8086 1978 0.8 8 MHz 29 K 16 16 1 MB None
Intel 286 1982 2.7 12.5 MHz 134 K 16 16 16 MB Note 3
Intel386™ 1985 6.0 20 MHz 275 K 32 32 4 GB Note 3
DX
Intel486™ 1989 20 25 MHz 1.2 M 32 32 4 GB 8KB L1
DX
Pentium® 1993 100 60 MHz 3.1 M 32 64 4 GB 16KB L1
®
Pentium 1995 440 200 MHz 5.5 M 32 64 64 GB 16KB L1;
Pro 256KB or
512KB L2
Pentium II® 1997 466 266 7M 32 64 64 GB 32KB L1;
256KB or
512KB L2
Pentium® 1999 1000 500 8.2 M 32 GP 64 64 GB 32KB L1;
III 128 512KB L2
SIMD-FP
NOTES:
1. Performance here is indicated by Dhrystone MIPs (Millions of Instructions per Second) because even
though MIPs are no longer considered a preferred measure of CPU performance, they are the only
benchmarks that span all six generations of the IA. The MIPs and frequency values given here corre-
spond to the maximum CPU frequency available at product introduction.
2. Main CPU register size and external data bus size are given in bits. Note also that there are 8 and 16-bit
data registers in all of the CPUs, there are eight 80-bit registers in the FPUs integrated into the Intel386™
chip and beyond, and there are internal data paths that are 2 to 4 times wider than the external data bus
for each processor.
3. In addition to the large general-purpose caches listed in the table for the Intel486™ processor (8 KBytes
of combined code and data) and the Intel Pentium® and Pentium® Pro processors (8 KBytes each for
separate code cache and data cache), there are smaller special purpose caches. The Intel 286 has 6
byte descriptor caches for each segment register. The Intel386™ has 8 byte descriptor caches for each
segment register, and also a 32-entry, 4-way set associative Translation Lookaside Buffer (cache) to
store access information for recently used pages on the chip. The Intel486™ has the same caches
described for the Intel386™, as well as its 8K L1 general-purpose cache. The Intel Pentium® and Pen-
tium® Pro processors have their general-purpose caches, descriptor caches, and two Translation Looka-
side Buffers each (one for each 8K L1 cache). The Pentium® II and Pentium® III processors have the
same cache structure as the Pentium® Pro processor except that the size of each cache is 16K.
2-5
INTRODUCTION TO THE INTEL ARCHITECTURE
2-6
INTRODUCTION TO THE INTEL ARCHITECTURE
by Intel in 1993), the Pentium® Pro processor, with its advanced superscalar microarchitecture,
sets an impressive performance standard. In designing the P6 Family processors, one of the
primary goals of the Intel chip architects was to exceed the performance of the Pentium®
processor significantly while still using the same 0.6-micrometer, four-layer, metal BICMOS
manufacturing process. Using the same manufacturing process as the Pentium® processor meant
that performance gains could only be achieved through substantial advances in the microarchi-
tecture.
The resulting P6 Family processor microarchitecture is a three-way superscalar, pipelined archi-
tecture. The term “three-way superscalar” means that using parallel processing techniques, the
processor is able on average to decode, dispatch, and complete execution of (retire) three
instructions per clock cycle. To handle this level of instruction throughput, the P6 Family
processors use a decoupled, 12-stage superpipeline that supports out-of-order instruction execu-
tion. Figure 2-1 shows a conceptual view of this pipeline, with the pipeline divided into four
processing units (the fetch/decode unit, the dispatch/execute unit, the retire unit, and the instruc-
tion pool). Instructions and data are supplied to these units through the bus interface unit.
System Bus
L2 Cache
Cache Bus
Intel
Fetch/Decode Dispatch/ Architecture
Unit Execute Unit Retire Unit
Registers
Instruction
Pool
To insure a steady supply of instructions and data to the instruction execution pipeline, the P6
Family processor microarchitecture incorporates two cache levels. The L1 cache provides an 8-
2-7
INTRODUCTION TO THE INTEL ARCHITECTURE
KByte instruction cache and an 8-KByte data cache, both closely coupled to the pipeline. The
L2 cache is a 256-KByte, 512-KByte, or 1-MByte static RAM that is coupled to the core
processor through a full clock-speed 64-bit cache bus.
The centerpiece of the P6 Family processor microarchitecture is an innovative out-of-order
execution mechanism called “dynamic execution.” Dynamic execution incorporates three data-
processing concepts:
• Deep branch prediction.
• Dynamic data flow analysis.
• Speculative execution.
Branch prediction is a concept found in most mainframe and high-speed microprocessor archi-
tectures. It allows the processor to decode instructions beyond branches to keep the instruction
pipeline full. In the P6 Family processors, the instruction fetch/decode unit uses a highly opti-
mized branch prediction algorithm to predict the direction of the instruction stream through
multiple levels of branches, procedure calls, and returns.
Dynamic data flow analysis involves real-time analysis of the flow of data through the processor
to determine data and register dependencies and to detect opportunities for out-of-order instruc-
tion execution. The P6 Family processors dispatch/execute unit can simultaneously monitor
many instructions and execute these instructions in the order that optimizes the use of the
processor’s multiple execution units, while maintaining data integrity. This out-of-order execu-
tion keeps the execution units busy even when cache misses and data dependencies among
instructions occur.
Speculative execution refers to the processor’s ability to execute instructions ahead of the
program counter but ultimately to commit the results in the order of the original instruction
stream. To make speculative execution possible, the P6 Family processors microarchitecture
decouples the dispatching and executing of instructions from the commitment of results. The
processor’s dispatch/execute unit uses data-flow analysis to execute all available instructions in
the instruction pool and store the results in temporary registers. The retirement unit then linearly
searches the instruction pool for completed instructions that no longer have data dependencies
with other instructions or unresolved branch predictions. When completed instructions are
found, the retirement unit commits the results of these instructions to memory and/or the IA
registers (the processor’s eight general-purpose registers and eight floating-point unit data regis-
ters) in the order they were originally issued and retires the instructions from the instruction
pool.
Through deep branch prediction, dynamic data-flow analysis, and speculative execution,
dynamic execution removes the constraint of linear instruction sequencing between the tradi-
tional fetch and execute phases of instruction execution. It allows instructions to be decoded
deep into multi-level branches to keep the instruction pipeline full. It promotes out-of-order
instruction execution to keep the processor’s six instruction execution units running at full
capacity. And finally, it commits the results of executed instructions in original program order
to maintain data integrity and program coherency.
The following section describes the P6 Family processor microarchitecture in greater detail. The
Pentium® Pro processor architecture is the base architecture for the processors that followed it.
2-8
INTRODUCTION TO THE INTEL ARCHITECTURE
The Pentium® II processor and now the Pentium® III processor are based on the Pentium® Pro
processor architecture. Changes or enhancements to the Pentium® Pro processor architecture are
noted where appropriate.
2-9
INTRODUCTION TO THE INTEL ARCHITECTURE
blocking. The L1 data cache automatically forwards a cache miss on to the L2 cache, and then,
if necessary, the bus interface unit forwards an L2 cache miss to system memory.
Cache Bus
Next IP
Instruction Fetch Unit Instruction Cache (L1) Unit
Memory
Branch
Reorder
Instruction Decoder Target
Buffer
Buffer
Simple Simple Complex
Instuction Instuction Instuction
Decoder Decoder Decoder Microcode From
Instruction Integer
Sequencer Unit
Retirement
Retirement Unit Register File Data Cache
(Intel Arch. Unit (L1)
Reorder Buffer (Instruction Pool) Registers)
Reservation Station
Execution Unit
Memory requests to the L2 cache or system memory go through the memory reorder buffer,
which functions as a scheduling and dispatch station. This unit keeps track of all memory
requests and is able to reorder some requests to prevent blocks and improve throughput. For
example, the memory reorder buffer allows loads to pass stores. It also issues speculative loads.
(Stores are always dispatched in order, and speculative stores are never issued.)
2-10
INTRODUCTION TO THE INTEL ARCHITECTURE
2-11
INTRODUCTION TO THE INTEL ARCHITECTURE
executed but not yet committed to machine state. The dispatch/execute unit can execute instruc-
tions from the reorder buffer in any order.
2-12
INTRODUCTION TO THE INTEL ARCHITECTURE
2-13
INTRODUCTION TO THE INTEL ARCHITECTURE
2-14
3
Basic Execution
Environment
CHAPTER 3
BASIC EXECUTION ENVIRONMENT
This chapter describes the basic execution environment of an Intel Architecture (IA) processor
as seen by assembly-language programmers. It describes how the processor executes instruc-
tions and how it stores and manipulates data. The parts of the execution environment described
here include memory (the address space), the general-purpose data registers, the segment regis-
ters, the EFLAGS register, and the instruction pointer register.
The execution environment for the floating-point unit (FPU) is described in Chapter 7, Floating-
Point Unit.
3-1
BASIC EXECUTION ENVIRONMENT
236 −1
3-2
BASIC EXECUTION ENVIRONMENT
and the procedure stack are all contained in this address space. The linear address space is byte
addressable, with addresses running contiguously from 0 to 236 − 1. An address for any byte in
the linear address space is called a linear address.
Flat Model
Linear Address
Linear
Address
Space*
Segmented Model
Segments
Offset Linear
Address
Space*
Logical
Address Segment Selector
With the segmented memory model, memory appears to a program as a group of independent
address spaces called segments. When using this model, code, data, and stacks are typically
contained in separate segments. To address a byte in a segment, a program must issue a logical
address, which consists of a segment selector and an offset. (A logical address is often referred
to as a far pointer.) The segment selector identifies the segment to be accessed and the offset
identifies a byte in the address space of the segment. The programs running on an IA processor
can address up to 16,383 segments of different sizes and types, and each segment can be as large
as 236 bytes.
Internally, all the segments that are defined for a system are mapped into the processor’s linear
address space. The processor translates each logical address into a linear address to access a
memory location. This translation is transparent to the application program.
3-3
BASIC EXECUTION ENVIRONMENT
The primary reason for using segmented memory is to increase the reliability of programs and
systems. For example, placing a program’s stack in a separate segment prevents the stack from
growing into the code or data space and overwriting instructions or data, respectively. Placing
the operating system’s or executive’s code, data, and stack in separate segments also protects
them from the application program and vice versa.
With either the flat or segmented model, the IA provides facilities for dividing the linear address
space into pages and mapping the pages into virtual memory. If an operating system/executive
uses the IA’s paging mechanism, the existence of the pages is transparent to an application
program.
The real-address mode model uses the memory model for the Intel 8086 processor, the first IA
processor. It was provided in all the subsequent IA processors for compatibility with existing
programs written to run on the Intel 8086 processor. The real-address mode uses a specific
implementation of segmented memory in which the linear address space for the program and the
operating system/executive consists of an array of segments of up to 64 Kbytes in size each. The
maximum size of the linear address space in real-address mode is 220 bytes. (Refer to Chapter
16, 8086 Emulation, in the Intel Architecture Software Developer’s Manual, Volume 3, for more
information on this memory model.)
3-4
BASIC EXECUTION ENVIRONMENT
maximum linear address or segment offset is FFFFH (216), and operand sizes are typically 8 bits
or 16 bits.
When using 32-bit addressing, a logical address (or far pointer) consists of a 16-bit segment
selector and a 32-bit offset; when using 16-bit addressing, it consists of a 16-bit segment selector
and a 16-bit offset.
Instruction prefixes allow temporary overrides of the default address and/or operand sizes from
within a program.
When operating in protected mode, the segment descriptor for the currently executing code
segment defines the default address and operand size. A segment descriptor is a system data
structure not normally visible to application code. Assembler directives allow the default
addressing and operand size to be chosen for a program. The assembler and other tools then set
up the segment descriptor for the code segment appropriately.
When operating in real-address mode, the default addressing and operand size is 16 bits. An
address-size override can be used in real-address mode to enable 32-bit addressing; however, the
maximum allowable 32-bit address is still 0000FFFFH (216).
3.6. REGISTERS
The processor provides 16 registers for use in general system and application programing. As
shown in Figure 3-3, these registers can be grouped as follows:
• General-purpose data registers. These eight registers are available for storing operands
and pointers.
• Segment registers. These registers hold up to six segment selectors.
• Status and control registers. These registers report and allow modification of the state of
the processor and of the program being executed.
3-5
BASIC EXECUTION ENVIRONMENT
General-Purpose Registers
31 0
EAX
EBX
ECX
EDX
ESI
EDI
EBP
ESP
Segment Registers
15 0
CS
DS
SS
ES
FS
GS
Many instructions assign specific registers to hold operands. For example, string instructions
use the contents of the ECX, ESI, and EDI registers as operands. When using a segmented
memory model, some instructions assume that pointers in certain registers are relative to
3-6
BASIC EXECUTION ENVIRONMENT
specific segments. For instance, some instructions assume that a pointer in the EBX register
points to a memory location in the DS segment.
The special uses of general-purpose registers by instructions are described in Chapter 6, Instruc-
tion Set Summary, in this volume, and Chapter 3, Instruction Set Reference in the Intel Architec-
ture Software Developer’s Manual, Volume 2. The following is a summary of these special uses:
• EAX—Accumulator for operands and results data.
• EBX—Pointer to data in the DS segment.
• ECX—Counter for string and loop operations.
• EDX—I/O pointer.
• ESI—Pointer to data in the segment pointed to by the DS register; source pointer for string
operations.
• EDI—Pointer to data (or destination) in the segment pointed to by the ES register;
destination pointer for string operations.
• ESP—Stack pointer (in the SS segment).
• EBP—Pointer to data on the stack (in the SS segment).
As shown in Figure 3-4, the lower 16 bits of the general-purpose registers map directly to the
register set found in the 8086 and Intel 286 processors and can be referenced with the names
AX, BX, CX, DX, BP, SP, SI, and DI. Each of the lower two bytes of the EAX, EBX, ECX, and
EDX registers can be referenced by the names AH, BH, CH, and DH (high bytes) and AL, BL,
CL, and DL (low bytes).
General-Purpose Registers
31 16 15 8 7 0 16-bit 32-bit
AH AL AX EAX
BH BL BX EBX
CH CL CX ECX
DH DL DX EDX
BP EBP
SI ESI
DI EDI
SP ESP
3-7
BASIC EXECUTION ENVIRONMENT
in memory, the segment selector for that segment must be present in the appropriate segment
register.
When writing application code, programmers generally create segment selectors with assembler
directives and symbols. The assembler and other tools then create the actual segment selector
values associated with these directives and symbols. If writing system code, programmers may
need to create segment selectors directly. (A detailed description of the segment-selector data
structure is given in Chapter 3, Protected-Mode Memory Management, of the Intel Architecture
Software Developer’s Manual, Volume 3.)
How segment registers are used depends on the type of memory management model that the
operating system or executive is using. When using the flat (unsegmented) memory model, the
segment registers are loaded with segment selectors that point to overlapping segments, each of
which begins at address 0 of the linear address space (as shown in Figure 3-5). These overlap-
ping segments then comprise the linear address space for the program. (Typically, two overlap-
ping segments are defined: one for code and another for data and stacks. The CS segment
register points to the code segment and all the other segment registers point to the data and stack
segment.)
When using the segmented memory model, each segment register is ordinarily loaded with a
different segment selector so that each segment register points to a different segment within the
linear address space (as shown in Figure 3-6). At any time, a program can thus access up to six
segments in the linear-address space. To access a segment not pointed to by one of the segment
registers, a program must first load the segment selector for the segment to be accessed into a
segment register.
Linear Address
Space for Program
3-8
BASIC EXECUTION ENVIRONMENT
Code
Segment
Segment Registers
Data
CS Segment
DS Stack
SS Segment
ES All segments
FS are mapped
GS to the same
linear-address
space
Data
Segment
Data
Segment
Data
Segment
Each of the segment registers is associated with one of three types of storage: code, data, or
stack). For example, the CS register contains the segment selector for the code segment, where
the instructions being executed are stored. The processor fetches instructions from the code
segment, using a logical address that consists of the segment selector in the CS register and the
contents of the EIP register. The EIP register contains the linear address within the code segment
of the next instruction to be executed. The CS register cannot be loaded explicitly by an appli-
cation program. Instead, it is loaded implicitly by instructions or internal processor operations
that change program control (such as, procedure calls, interrupt handling, or task switching).
The DS, ES, FS, and GS registers point to four data segments. The availability of four data
segments permits efficient and secure access to different types of data structures. For example,
four separate data segments might be created: one for the data structures of the current module,
another for the data exported from a higher-level module, a third for a dynamically created data
structure, and a fourth for data shared with another program. To access additional data segments,
the application program must load segment selectors for these segments into the DS, ES, FS, and
GS registers, as needed.
The SS register contains the segment selector for a stack segment, where the procedure stack is
stored for the program, task, or handler currently being executed. All stack operations use the
SS register to find the stack segment. Unlike the CS register, the SS register can be loaded
explicitly, which permits application programs to set up multiple stacks and switch among them.
Refer to Section 3.3., “Memory Organization” for an overview of how the segment registers are
used in real-address mode.
3-9
BASIC EXECUTION ENVIRONMENT
The four segment registers CS, DS, SS, and ES are the same as the segment registers found in
the Intel 8086 and Intel 286 processors and the FS and GS registers were introduced into the IA
with the Intel386™ family of processors.
3-10
BASIC EXECUTION ENVIRONMENT
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
I
V V
N O O D I T S Z P C
0 0 0 0 0 0 0 0 0 0 I I I A V R 0 T
A
D C M F P F F F F F F 0 F 0 F 1 F
P F
L
X ID Flag (ID)
X Virtual Interrupt Pending (VIP)
X Virtual Interrupt Flag (VIF)
X Alignment Check (AC)
X Virtual-8086 Mode (VM)
X Resume Flag (RF)
X Nested Task (NT)
X I/O Privilege Level (IOPL)
S Overflow Flag (OF)
C Direction Flag (DF)
X Interrupt Enable Flag (IF)
X Trap Flag (TF)
S Sign Flag (SF)
S Zero Flag (ZF)
S Auxiliary Carry Flag (AF)
S Parity Flag (PF)
S Carry Flag (CF)
As the IA has evolved, flags have been added to the EFLAGS register, but the function and
placement of existing flags have remained the same from one family of the IA processors to the
next. As a result, code that accesses or modifies these flags for one family of IA processors
works as expected when run on later families of processors.
3-11
BASIC EXECUTION ENVIRONMENT
3-12
BASIC EXECUTION ENVIRONMENT
3.6.3.2. DF FLAG
The direction flag (DF, located in bit 10 of the EFLAGS register) controls the string instructions
(MOVS, CMPS, SCAS, LODS, and STOS). Setting the DF flag causes the string instructions to
auto-decrement (that is, to process strings from high addresses to low addresses). Clearing the
DF flag causes the string instructions to auto-increment (process strings from low addresses
to high addresses).
The STD and CLD instructions set and clear the DF flag, respectively.
3-13
BASIC EXECUTION ENVIRONMENT
flag; the processor only reads it.) Used in conjunction with the VIF
flag.
ID (bit 21) Identification flag. The ability of a program to set or clear this flag
indicates support for the CPUID instruction.
Refer to Chapter 3, Protected-Mode Memory Management, in the Intel Architecture Software
Developer’s Manual, Volume 3, for a detail description of these flags.
3-14
BASIC EXECUTION ENVIRONMENT
32-bit address-size attribute is in force, segment offsets and displacements are 32 bits, allowing
segments of up to 4 GBytes to be addressed.
The default operand-size attribute and/or address-size attribute can be overridden for a particular
instruction by adding an operand-size and/or address-size prefix to an instruction (refer to
Chapter 17, Mixing 16-Bit and 32-Bit Code of the Intel Architecture Software Developer’s
Manual, Volume 3). The effect of this prefix applies only to the instruction it is attached to.
Table 3-1 shows effective operand size and address size (when executing in protected mode)
depending on the settings of the D/B flag and the operand-size and address-size prefixes.
NOTES:
Y Yes, this instruction prefix is present.
N No, this instruction prefix is not present.
3-15
BASIC EXECUTION ENVIRONMENT
3-16
4
Procedure Calls,
Interrupts, and
Exceptions
CHAPTER 4
PROCEDURE CALLS, INTERRUPTS, AND
EXCEPTIONS
This chapter describes the facilities in the Intel Architecture (IA) for executing calls to proce-
dures or subroutines. It also describes how interrupts and exceptions are handled from the
perspective of an application programmer.
4.2. STACK
The stack (refer to Figure 4-1) is a contiguous array of memory locations. It is contained in a
segment and identified by the segment selector in the SS register. (When using the flat memory
model, the stack can be located anywhere in the linear address space for the program.) A stack
can be up to 4 gigabytes long, the maximum size of a segment.
The next available memory location on the stack is called the top of stack. At any given time,
the stack pointer (contained in the ESP register) gives the address (that is the offset from the base
of the SS segment) of the top of the stack.
Items are placed on the stack using the PUSH instruction and removed from the stack using the
POP instruction. When an item is pushed onto the stack, the processor decrements the ESP
register, then writes the item at the new top of stack. When an item is popped off the stack, the
processor reads the item from the top of stack, then increments the ESP register. In this manner,
the stack grows down in memory (towards lesser addresses) when items are pushed on the stack
and shrinks up (towards greater addresses) when the items are popped from the stack.
A program or operating system/executive can set up many stacks. For example, in multitasking
systems, each task can be given its own stack. The number of stacks in a system is limited by
the maximum number of segments and the available physical memory. When a system sets up
4-1
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
many stacks, only one stack—the current stack—is available at a time. The current stack is the
one contained in the segment referenced by the SS register.
Stack Segment
Bottom of Stack
(Initial ESP Value)
Local Variables
for Calling
Procedure The Stack Can Be
16 or 32 Bits Wide
Parameters
Passed to The EBP register is
Called typically set to point
Procedure to the return
instruction pointer.
Frame Boundary
Return Instruction EBP Register
Pointer
ESP Register
Top of Stack
The processor references the SS register automatically for all stack operations. For example,
when the ESP register is used as a memory address, it automatically points to an address in the
current stack. Also, the CALL, RET, PUSH, POP, ENTER, and LEAVE instructions all perform
operations on the current stack.
4-2
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
4-3
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
the segment’s descriptor. When this flag is clear, the default address-size attribute is 16; when
the flag is set, the address-size attribute is 32.
4-4
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
4-5
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
4-6
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
4-7
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
in any of the general-purpose registers that it will need when it resumes execution after a return.
These values can be saved on the stack or in memory in one of the data segments.
The PUSHA and POPA instruction facilitates saving and restoring the contents of the general-
purpose registers. PUSHA pushes the values in all the general-purpose registers on the stack in
the following order: EAX, ECX, EDX, EBX, ESP (the value prior to executing the PUSHA
instruction), EBP, ESI, and EDI. The POPA instruction pops all the register values saved with a
PUSHA instruction (except the ESI value) from the stack to their respective registers.
If a called procedure changes the state of any of the segment registers explicitly, it should restore
them to their former value before executing a return to the calling procedure.
If a calling procedure needs to maintain the state of the EFLAGS register it can save and restore
all or part of the register using the PUSHF/PUSHFD and POPF/POPFD instructions. The
PUSHF instruction pushes the lower word of the EFLAGS register on the stack, while the
PUSHFD instruction pushes the entire register. The POPF instruction pops a word from the
stack into the lower word of the EFLAGS register, while the POPFD instruction pops a double
word from the stack into the register.
4-8
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
Protection Rings
Operating
System
Kernel Level 0
Operating System
Services (Device
Drivers, Etc.) Level 1
Applications Level 2
Level 3
Highest Lowest
0 1 2 3
Privilege Levels
If an operating system or executive uses this multilevel protection mechanism, a call to a proce-
dure that is in a more privileged protection level than the calling procedure is handled in a
similar manner as a far call (refer to Section 4.3.2., “Far CALL and RET Operation”). The
differences are as follows:
• The segment selector provided in the CALL instruction references a special data structure
called a call gate descriptor. Among other things, the call gate descriptor provides the
following:
— Access rights information.
— The segment selector for the code segment of the called procedure.
— An offset into the code segment (that is, the instruction pointer for the called
procedure).
• The processor switches to a new stack to execute the called procedure. Each privilege level
has its own stack. The segment selector and stack pointer for the privilege level 3 stack are
stored in the SS and ESP registers, respectively, and are automatically saved when a call to
a more privileged level occurs. The segment selectors and stack pointers for the privilege
level 2, 1, and 0 stacks are stored in a system segment called the task state segment (TSS).
The use of a call gate and the TSS during a stack switch are transparent to the calling procedure,
except when a general-protection exception is raised.
4-9
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
4-10
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
Calling SS
Calling ESP
Param 1 Param 1
Stack Frame
Param 2 Param 2 Stack Frame
Before Call
Param 3 ESP Before Call Param 3 After Call
Calling CS
ESP After Call Calling EIP
Calling SS
ESP After Return Calling ESP
Param 1 Param 1
Param 2 Param 2
Param 3 Param 3
Calling CS
ESP Before Return Calling EIP
Refer to Chapter 4, Protection of the Intel Architecture Software Developer’s Manual, Volume
3, for detailed information on calls to privileged levels and the call gate descriptor.
4-11
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
rupt descriptor table (IDT). When the handler has completed handling the interrupt or exception,
program control is returned to the interrupted program or task.
The operating system, executive, and/or device drivers normally handle interrupts and excep-
tions independently from application programs or tasks. Application programs can, however,
access the interrupt and exception handlers incorporated in an operating system or executive
through assembly-language calls. The remainder of this section gives a brief overview of the
processor’s interrupt and exception handling mechanism. Refer to Chapter 5, Interrupt and
Exception Handling of the Intel Architecture Software Developer’s Manual, Volume 3, for a
detailed description of this mechanism.
The IA defines 17 predefined interrupts and exceptions and 224 user defined interrupts, which
are associated with entries in the IDT. Each interrupt and exception in the IDT is identified with
a number, called a vector. Table 4-1 lists the interrupts and exceptions with entries in the IDT
and their respective vector numbers. Vectors 0 through 8, 10 through 14, and 16 through 19 are
the predefined interrupts and exceptions, and vectors 32 through 255 are the user-defined inter-
rupts, called maskable interrupts.
Note that the processor defines several additional interrupts that do not point to entries in the
IDT; the most notable of these interrupts is the SMI interrupt. Refer to Chapter 5, Interrupt and
Exception Handling of the Intel Architecture Software Developer’s Manual, Volume 3, for more
information about the interrupts and exceptions that the IA supports.
When the processor detects an interrupt or exception, it does one of the following things:
• Executes an implicit call to a handler procedure.
• Executes an implicit call to a handler task.
The Pentium® III processor can generate two types of exceptions:
• Numeric exceptions
• Non-numeric exceptions
When numeric exceptions occur, a processor supporting Streaming SIMD Extensions takes one
of two possible courses of action:
• The processor can handle the exception by itself, producing the most reasonable result and
allowing numeric program execution to continue undisturbed (i.e., masked exception
response).
• A software exception handler can be invoked to handle the exception (i.e., unmasked
exception response).
Each of the numeric exception conditions has corresponding flag and mask bits in the MXCSR
(Streaming SIMD Extensions control status register). If an exception is masked (the corre-
sponding mask bit in MXCSR = 1), the processor takes an appropriate default action and
continues with the computation. If the exception is unmasked (mask bit = 0) and the OS supports
SIMD floating-point exceptions (i.e. CR4.OSXMMEXCPT = 1), a software exception handler
is invoked immediately through SIMD floating-point exception interrupt vector 19. If the excep-
tion is unmasked (mask bit = 0) and the OS does not support SIMD floating-point exceptions
(i.e. CR4.OSXMMEXCPT = 0), an invalid opcode exception is signaled instead of a SIMD
4-12
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
4-13
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
4-14
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
If no stack switch occurs, the processor does the following when calling an interrupt or excep-
tion handler (refer to Figure 4-5):
1. Pushes the current contents of the EFLAGS, CS, and EIP registers (in that order) on the
stack.
2. Pushes an error code (if appropriate) on the stack.
3. Loads the segment selector for the new code segment and the new instruction pointer
(from the interrupt gate or trap gate) into the CS and EIP registers, respectively.
4. If the call is through an interrupt gate, clears the IF flag in the EFLAGS register.
5. Begins execution of the handler procedure at the new privilege level.
ESP Before
EFLAGS Transfer to Handler
CS
EIP
Error Code ESP After
Transfer to Handler
ESP Before
Transfer to Handler SS
ESP
EFLAGS
CS
EIP
ESP After Error Code
Transfer to Handler
Figure 4-5. Stack Usage on Transfers to Interrupt and Exception Handling Routines
4-15
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
4-16
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
4-17
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
The BOUND instruction explicitly calls the BOUND-range exceeded exception (#BR) handler
if an operand is found to be not within predefined boundaries in memory. This instruction is
provided for checking references to arrays and other data structures. Like the overflow
exception, the BOUND-range exceeded exception can only be raised explicitly with the
BOUND instruction or the INT n instruction with an argument of 5 (the vector number of the
bounds-check exception). The processor does not implicitly perform bounds checks and raise
the BOUND-range exceeded exception.
ENTER 2048,3
4-18
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
The lexical nesting level determines the number of stack frame pointers to copy into the new
stack frame from the preceding frame. A stack frame pointer is a doubleword used to access the
variables of a procedure. The set of stack frame pointers used by a procedure to access the
variables of other procedures is called the display. The first doubleword in the display is a
pointer to the previous stack frame. This pointer is used by a LEAVE instruction to undo the
effect of an ENTER instruction by discarding the current stack frame.
After the ENTER instruction creates the display for a procedure, it allocates the dynamic local
variables for the procedure by decrementing the contents of the ESP register by the number of
bytes specified in the first parameter. This new value in the ESP register serves as the initial top-
of-stack for all PUSH and POP operations within the procedure.
To allow a procedure to address its display, the ENTER instruction leaves the EBP register
pointing to the first doubleword in the display. Because stacks grow down, this is actually the
doubleword with the highest address in the display. Data manipulation instructions that specify
the EBP register as a base register automatically address locations within the stack segment
instead of the data segment.
The ENTER instruction can be used in two ways: nested and non-nested. If the lexical level is
0, the non-nested form is used. The non-nested form pushes the contents of the EBP register on
the stack, copies the contents of the ESP register into the EBP register, and subtracts the first
operand from the contents of the ESP register to allocate dynamic storage. The non-nested form
differs from the nested form in that no stack frame pointers are copied. The nested form of the
ENTER instruction occurs when the second parameter (lexical level) is not zero.
The following pseudo code shows the formal definition of the ENTER instruction. STORAGE
is the number of bytes of dynamic storage to allocate for local variables, and LEVEL is the
lexical nesting level.
PUSH EBP;
FRAME_PTR ← ESP;
IF LEVEL > 0
THEN
DO (LEVEL − 1) times
EBP ← EBP − 4;
PUSH Pointer(EBP); (* doubleword pointed to by EBP *)
OD;
PUSH FRAME_PTR;
FI;
EBP ← FRAME_PTR;
ESP ← ESP − STORAGE;
The main procedure (in which all other procedures are nested) operates at the highest lexical
level, level 1. The first procedure it calls operates at the next deeper lexical level, level 2. A level
2 procedure can access the variables of the main program, which are at fixed locations specified
by the compiler. In the case of level 1, the ENTER instruction allocates only the requested
dynamic storage on the stack because there is no previous display to copy.
A procedure which calls another procedure at a lower lexical level gives the called procedure
access to the variables of the caller. The ENTER instruction provides this access by placing a
pointer to the calling procedure’s stack frame in the display.
4-19
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
A procedure which calls another procedure at the same lexical level should not give access to
its variables. In this case, the ENTER instruction copies only that part of the display from the
calling procedure which refers to previously nested procedures operating at higher lexical levels.
The new stack frame does not include the pointer for addressing the calling procedure’s stack
frame.
The ENTER instruction treats a re-entrant procedure as a call to a procedure at the same lexical
level. In this case, each succeeding iteration of the re-entrant procedure can address only its own
variables and the variables of the procedures within which it is nested. A re-entrant procedure
always can address its own variables; it does not require pointers to the stack frames of previous
iterations.
By copying only the stack frame pointers of procedures at higher lexical levels, the ENTER
instruction makes certain that procedures access only those variables of higher lexical levels, not
those at parallel lexical levels (refer to Figure 4-6).
Block-structured languages can use the lexical levels defined by ENTER to control access to the
variables of nested procedures. In Figure 4-6, for example, if procedure A calls procedure B
which, in turn, calls procedure C, then procedure C will have access to the variables of the
MAIN procedure and procedure A, but not those of procedure B because they are at the same
lexical level. The following definition describes the access to variables for the nested procedures
in Figure 4-6.
1. MAIN has variables at fixed locations.
2. Procedure A can access only the variables of MAIN.
3. Procedure B can access only the variables of procedure A and MAIN. Procedure B cannot
access the variables of procedure C or procedure D.
4. Procedure C can access only the variables of procedure A and MAIN. procedure C cannot
access the variables of procedure B or procedure D.
5. Procedure D can access the variables of procedure C, procedure A, and MAIN. Procedure
D cannot access the variables of procedure B.
4-20
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
In Figure 4-7, an ENTER instruction at the beginning of the MAIN procedure creates three
doublewords of dynamic storage for MAIN, but copies no pointers from other stack frames. The
first doubleword in the display holds a copy of the last value in the EBP register before the
ENTER instruction was executed. The second doubleword holds a copy of the contents of the
EBP register following the ENTER instruction. After the instruction is executed, the EBP
register points to the first doubleword pushed on the stack, and the ESP register points to the last
doubleword in the stack frame.
When MAIN calls procedure A, the ENTER instruction creates a new display (refer to Figure
4-8). The first doubleword is the last value held in MAIN’s EBP register. The second double-
word is a pointer to MAIN’s stack frame which is copied from the second doubleword in MAIN’s
display. This happens to be another copy of the last value held in MAIN’s EBP register. Proce-
dure A can access variables in MAIN because MAIN is at level 1. Therefore the base address
for the dynamic storage used in MAIN is the current address in the EBP register, plus four bytes
to account for the saved contents of MAIN’s EBP register. All dynamic variables for MAIN are
at fixed, positive offsets from this value.
Dynamic
Storage
ESP
4-21
Old EBP
Main’s EBP
When procedure A calls procedure B, the ENTER instruction creates a new display (refer to
Figure 4-9). The first doubleword holds a copy of the last value in procedure A’s EBP register.
The second and third doublewords are copies of the two stack frame pointers in procedure A’s
display. Procedure B can access variables in procedure A and MAIN by using the stack frame
pointers in its display.
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
Old EBP
Main’s EBP
Main’s EBP
Main’s EBP
Procedure A’s EBP
Dynamic
Storage
ESP
When procedure B calls procedure C, the ENTER instruction creates a new display for proce-
dure C (refer to Figure 4-10). The first doubleword holds a copy of the last value in procedure
B’s EBP register. This is used by the LEAVE instruction to restore procedure B’s stack frame.
The second and third doublewords are copies of the two stack frame pointers in procedure A’s
display. If procedure C were at the next deeper lexical level from procedure B, a fourth double-
word would be copied, which would be the stack frame pointer to procedure B’s local variables.
Note that procedure B and procedure C are at the same level, so procedure C is not intended to
access procedure B’s variables. This does not mean that procedure C is completely isolated from
procedure B; procedure C is called by procedure B, so the pointer to the returning stack frame
is a pointer to procedure B's stack frame. In addition, procedure B can pass parameters to proce-
dure C either on the stack or through variables global to both procedures (that is, variables in the
scope of both procedures).
4-23
PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS
Old EBP
Main’s EBP
Main’s EBP
Main’s EBP
Procedure A’s EBP
Dynamic
Storage
ESP
4-24
5
Data Types and
Addressing Modes
CHAPTER 5
DATA TYPES AND ADDRESSING MODES
This chapter describes data types and addressing modes available to programmers of the Intel
Architecture (IA) processors.
7 0
Byte
N
15 87 0
High Low Word
Byte Byte
N+1 N
31 16 15 0
High Word Low Word Doubleword
N+2 N
63 32 31 0
High Doubleword Low Doubleword Quadword
N+4 N
Figure 5-1. Fundamental Data Types
The Pentium® III processor introduced a new data type, a 128-bit packed data type. It is packed
single precision (32 bits) floating-point numbers. These values are the operands for the SIMD
floating-point operations. They are also the operands for the scalar equivalents of these instruc-
tions. Refer to Chapter 5-2, SIMD Floating-Point Data Type for a description of this data type.
127 96 95 64 63 32 31 0
5-1
DATA TYPES AND ADDRESSING MODES
Figure 5-2 shows the byte order of each of the fundamental data types when referenced as oper-
ands in memory. The low byte (bits 0 through 7) of each data type occupies the lowest address
in memory and that address is also the address of the operand.
EH
7AH DH
36H AH
Byte at Address 9H
1FH 9H
Contains 1FH Quadword at Address 6H
A4H 8H Contains 7AFE06361FA4230BH
When accessing 128 bit data for the Pentium® III processor, data must be aligned on 16-byte
boundaries. There are instructions that allow for unaligned access, but additional time is
required to receive the data into the cache. If an instruction that expects aligned data is used to
access unaligned data, a general protection fault will occur.
5-2
DATA TYPES AND ADDRESSING MODES
5.2.1. Integers
Integers are signed binary numbers held in a byte, word, or doubleword. All operations assume
a two’s complement representation. The sign bit is located in bit 7 in a byte integer, bit 15 in a
word integer, and bit 31 in a doubleword integer. The sign bit is set for negative integers and
cleared for positive integers and zero. Integer values range from –128 to +127 for a byte integer,
from –32,768 to +32,767 for a word integer, and from –231 to +231 – 1 for a doubleword integer.
5-3
DATA TYPES AND ADDRESSING MODES
7 0
Word Unsigned Integer
15 0
Doubleword Unsigned Integer
31 0
BCD Integers
X BCD .... X BCD X BCD
7 43 0
Packed BCD Integers
BCD BCD .... BCD BCD BCD BCD
7 43 0
Near Pointer
Offset or Linear Address
31 0
Far Pointer or Logical Address
Segment Selector Offset
47 32 31 0
Bit Field
Field Length
Least
Significant
Bit
5-4
DATA TYPES AND ADDRESSING MODES
5.2.4. Pointers
Pointers are addresses of locations in memory. The Pentium® Pro processor recognizes two types
of pointers: a near pointer (32 bits) and a far pointer (48 bits). A near pointer is a 32-bit offset
(also called an effective address) within a segment. Near pointers are used for all memory refer-
ences in a flat memory model or for references in a segmented model where the identity of the
segment being accessed is implied. A far pointer is a 48-bit logical address, consisting of a 16-bit
segment selector and a 32-bit offset. Far pointers are used for memory references in a segmented
memory model where the identity of a segment being accessed must be specified explicitly.
5.2.6. Strings
Strings are continuous sequences of bits, bytes, words, or doublewords. A bit string can begin
at any bit position of any byte and can contain up to 232 – 1 bits. A byte string can contain bytes,
words, or doublewords and can range from zero to 232 – 1 bytes (4 gigabytes).
5-5
DATA TYPES AND ADDRESSING MODES
5-6
DATA TYPES AND ADDRESSING MODES
15 0 31 0
Segment Offset (or Linear Address)
Selector
5-7
DATA TYPES AND ADDRESSING MODES
When storing data in or loading data from memory, the DS segment default can be overridden
to allow other segments to be accessed. Within an assembler, the segment override is generally
handled with a colon “:” operator. For example, the following MOV instruction moves a value
from register EAX into the segment pointed to by the ES register. The offset into the segment is
contained in the EBX register:
MOV ES:[EBX], EAX;
(At the machine level, a segment override is specified with a segment-override prefix, which is
a byte placed at the beginning of an instruction.) The following default segment selections
cannot be overridden:
• Instruction fetches must be made from the code segment.
• Destination strings in string instructions must be stored in the data segment pointed to by
the ES register.
• Push and pop operations must always reference the SS segment.
Some instructions require a segment selector to be specified explicitly. In these cases, the 16-bit
segment selector can be located in a memory location or in a 16-bit register. For example, the
following MOV instruction moves a segment selector located in register BX into segment
register DS:
MOV DS, BX
Segment selectors can also be specified explicitly as part of a 48-bit far pointer in memory. Here,
the first doubleword in memory contains the offset and the next word contains the segment
selector.
5-8
DATA TYPES AND ADDRESSING MODES
EAX
EAX None
EBX 1
EBX
ECX
ECX 2 8-bit
EDX
+ EDX +
ESP
EBP
EBP * 3 16-bit
ESI
ESI 4 32-bit
EDI
EDI
The uses of general-purpose registers as base or index components are restricted in the following
manner:
• The ESP register cannot be used as an index register.
• When the ESP or EBP register is used as the base, the SS segment is the default segment.
In all other cases, the DS segment is the default segment.
The base, index, and displacement components can be used in any combination, and any of these
components can be null. A scale factor may be used only when an index also is used. Each
possible combination is useful for data structures commonly used by programmers in high-level
languages and assembly language. The following addressing modes suggest uses for common
combinations of address components.
Displacement
A displacement alone represents a direct (uncomputed) offset to the operand. Because the
displacement is encoded in the instruction, this form of an address is sometimes called an abso-
lute or static address. It is commonly used to access a statically allocated scalar operand.
5-9
DATA TYPES AND ADDRESSING MODES
Base
A base alone represents an indirect offset to the operand. Since the value in the base register can
change, it can be used for dynamic storage of variables and data structures.
Base + Displacement
A base register and a displacement can be used together for two distinct purposes:
• As an index into an array when the element size is not 2, 4, or 8 bytes—The displacement
component encodes the static offset to the beginning of the array. The base register holds
the results of a calculation to determine the offset to a specific element within the array.
• To access a field of a record—The base register holds the address of the beginning of the
record, while the displacement is an static offset to the field.
An important special case of this combination is access to parameters in a procedure activation
record. A procedure activation record is the stack frame created when a procedure is entered.
Here, the EBP register is the best choice for the base register, because it automatically selects
the stack segment. This is a compact encoding for this common function.
5-10
DATA TYPES AND ADDRESSING MODES
5-11
DATA TYPES AND ADDRESSING MODES
5-12
6
Instruction Set
Summary
CHAPTER 6
INSTRUCTION SET SUMMARY
This chapter lists all the instructions in the Intel Architecture (IA) instruction set, divided into
three functional groups: integer, floating-point, and system. It also briefly describes each of the
integer instructions.
Brief descriptions of the floating-point instructions are given in Chapter 7, Floating-Point Unit;
brief descriptions of the system instructions are given in the Intel Architecture Software Devel-
oper’s Manual, Volume 3.
Detailed descriptions of all the IA instructions are given in the Intel Architecture Software
Developer’s Manual, Volume 2. Included in this volume are a description of each instruction’s
encoding and operation, the effect of an instruction on the EFLAGS flags, and the exceptions an
instruction may generate.
6-1
INSTRUCTION SET SUMMARY
6-2
INSTRUCTION SET SUMMARY
6-3
INSTRUCTION SET SUMMARY
6-4
INSTRUCTION SET SUMMARY
6-5
INSTRUCTION SET SUMMARY
6-6
INSTRUCTION SET SUMMARY
6-7
INSTRUCTION SET SUMMARY
6-8
INSTRUCTION SET SUMMARY
6-9
INSTRUCTION SET SUMMARY
6-10
INSTRUCTION SET SUMMARY
6-11
INSTRUCTION SET SUMMARY
6-12
INSTRUCTION SET SUMMARY
6-13
INSTRUCTION SET SUMMARY
6.2.3.3. COMPARISON
FCOM Compare real
FCOMP Compare real and pop
FCOMPP Compare real and pop twice
FUCOM Unordered compare real
FUCOMP Unordered compare real and pop
FUCOMPP Unordered compare real and pop twice
FICOM Compare integer
FICOMP Compare integer and pop
FCOMI Compare real and set EFLAGS
FUCOMI Unordered compare real and set EFLAGS
FCOMIP Compare real, set EFLAGS, and pop
FUCOMIP Unordered compare real, set EFLAGS, and pop
FTST Test real
FXAM Examine real
6.2.3.4. TRANSCENDENTAL
FSIN Sine
FCOS Cosine
FSINCOS Sine and cosine
FPTAN Partial tangent
FPATAN Partial arctangent
F2XM1 2x − 1
FYL2X y∗log2x
FYL2XP1 y∗log2(x+1)
6-14
INSTRUCTION SET SUMMARY
6-15
INSTRUCTION SET SUMMARY
6-16
INSTRUCTION SET SUMMARY
6-17
INSTRUCTION SET SUMMARY
6-18
INSTRUCTION SET SUMMARY
6-19
INSTRUCTION SET SUMMARY
6-20
INSTRUCTION SET SUMMARY
Table 6-4 shows the mnemonics for the CMOVcc instructions and the conditions being tested
for each instruction. The condition code mnemonics are appended to the letters “CMOV” to
form the mnemonics for the CMOVcc instructions. The instructions listed in Table 6-4 as pairs
(for example, CMOVA/CMOVNBE) are alternate names for the same instruction. The assem-
bler provides these alternate names to make it easier to read program listings.
The CMOVcc instructions are useful for optimizing small IF constructions. They also help elim-
inate branching overhead for IF statements and the possibility of branch mispredictions by the
processor.
These instructions may not be supported on some processors in the Pentium® Pro processor
family. Software can check if the CMOVcc instructions are supported by checking the
processor’s feature information with the CPUID instruction (refer to “CPUID—CPU Identifica-
tion” in Chapter 3, Instruction Set Reference of the Intel Architecture Software Developer’s
Manual, Volume 2).
6-21
INSTRUCTION SET SUMMARY
exchanged with 16 through 23. Executing this instruction twice in a row leaves the register with
the same value as before. The BSWAP instruction is useful for converting between “big-endian”
and “little-endian” data formats. This instruction also speeds execution of decimal arithmetic.
(The XCHG instruction can be used two swap the bytes in a word.)
The XADD (exchange and add) instruction swaps two operands and then stores the sum of the
two operands in the destination operand. The status flags in the EFLAGS register indicate the
result of the addition. This instruction can be combined with the LOCK prefix (refer to
“LOCK—Assert LOCK# Signal Prefix” in Chapter 3, Instruction Set Reference of the Intel
Architecture Software Developer’s Manual, Volume 2) in a multiprocessing system to allow
multiple processors to execute one DO loop.
The CMPXCHG (compare and exchange) and CMPXCHG8B (compare and exchange 8 bytes)
instructions are used to synchronize operations in systems that use multiple processors. The
CMPXCHG instruction requires three operands: a source operand in a register, another source
operand in the EAX register, and a destination operand. If the values contained in the destination
operand and the EAX register are equal, the destination operand is replaced with the value of
the other source operand (the value not in the EAX register). Otherwise, the original value of the
destination operand is loaded in the EAX register. The status flags in the EFLAGS register
6-22
INSTRUCTION SET SUMMARY
reflect the result that would have been obtained by subtracting the destination operand from the
value in the EAX register.
The CMPXCHG instruction is commonly used for testing and modifying semaphores. It checks
to see if a semaphore is free. If the semaphore is free it is marked allocated, otherwise it gets the
ID of the current owner. This is all done in one uninterruptible operation. In a single-processor
system, the CMPXCHG instruction eliminates the need to switch to protection level 0 (to disable
interrupts) before executing multiple instructions to test and modify a semaphore. For multiple
processor systems, CMPXCHG can be combined with the LOCK prefix to perform the compare
and exchange operation atomically. (Refer to Section 7.1., “Locked Atomic Operations” of the
Intel Architecture Software Developer’s Manual, Volume 3, for more information on atomic
operations.)
The CMPXCHG8B instruction also requires three operands: a 64-bit value in EDX:EAX, a
64-bit value in ECX:EBX, and a destination operand in memory. The instruction compares the
64-bit value in the EDX:EAX registers with the destination operand. If they are equal, the 64-bit
value in the ECX:EBX register is stored in the destination operand. If the EDX:EAX register
and the destination are not equal, the destination is loaded in the EDX:EAX register. The
CMPXCHG8B instruction can be combined with the LOCK prefix to perform the operation
atomically.
Stack
Before Pushing Doubleword After Pushing Doubleword
Stack
Growth 31 0 31 0
n ESP
n−4 Doubleword Value ESP
n−8
The PUSHA instruction saves the contents of the eight general-purpose registers on the stack
(refer to Figure 6-2). This instruction simplifies procedure calls by reducing the number of
instructions required to save the contents of the general-purpose registers. The registers are
pushed on the stack in the following order: EAX, ECX, EDX, EBX, the initial value of ESP
before EAX was pushed, EBP, ESI, and EDI.
6-23
INSTRUCTION SET SUMMARY
Stack
Before Pushing Registers After Pushing Registers
Stack 31 0 31 0
Growth
n
n-4 ESP
n-8 EAX
n - 12 ECX
n - 16 EDX
n - 20 EBX
n - 24 Old ESP
n - 28 EBP
n - 32 ESI
n - 36 EDI ESP
The POP instruction copies the word or doubleword at the current top of stack (indicated by the
ESP register) to the location specified with the destination operand, and then increments the ESP
register to point to the new top of stack (refer to Figure 6-3). The destination operand may
specify a general-purpose register, a segment register, or a memory location.
Stack
Before Popping Doubleword After Popping Doubleword
Stack
Growth 31 0 31 0
n
n-4 ESP
n-8 Doubleword Value ESP
The POPA instruction reverses the effect of the PUSHA instruction. It pops the top eight words
or doublewords from the top of the stack into the general-purpose registers, except for the ESP
register (refer to Figure 6-4). If the operand-size attribute is 32, the doublewords on the stack are
transferred to the registers in the following order: EDI, ESI, EBP, ignore doubleword, EBX,
EDX, ECX, and EAX. The ESP register is restored by the action of popping the stack. If the
operand-size attribute is 16, the words on the stack are transferred to the registers in the
following order: DI, SI, BP, ignore word, BX, DX, CX, and AX.
6-24
INSTRUCTION SET SUMMARY
Stack
Before Popping Registers After Popping Registers
Stack 0 31 0 31
Growth
n
n-4 ESP
n-8 EAX
n - 12 ECX
n - 16 EDX
n - 20 EBX
n - 24 Ignored
n - 28 EBP
n - 32 ESI
n - 36 EDI ESP
15 0
Before Sign
S N N N N N N N N N N N N N N N
Extension
31 15 0
After Sign
S S S S S S S S S S S S S S S S S N N N N N N N N N N N N N N N
Extension
6-25
INSTRUCTION SET SUMMARY
The CBW instruction copies the sign (bit 7) of the byte in the AL register into every bit position
of the upper byte of the AX register. The CWDE instruction copies the sign (bit 15) of the word
in the AX register into every bit position of the high word of the EAX register.
The CWD instruction copies the sign (bit 15) of the word in the AX register into every bit posi-
tion in the DX register. The CDQ instruction copies the sign (bit 31) of the doubleword in the
EAX register into every bit position in the EDX register. The CWD instruction can be used to
produce a doubleword dividend from a word before a word division, and the CDQ instruction
can be used to produce a quadword dividend from a doubleword before doubleword division.
6-26
INSTRUCTION SET SUMMARY
6-27
INSTRUCTION SET SUMMARY
6-28
INSTRUCTION SET SUMMARY
Initial State
CF Operand
X 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1
0
1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1 0
0
0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0
The SHR instruction shifts the source operand right by from 1 to 31 bit positions (refer to Figure
6-7). As with the SHL/SAL instruction, the empty bit positions are cleared and the CF flag is
loaded with the last bit shifted out of the operand.
6-29
INSTRUCTION SET SUMMARY
The SAR instruction shifts the source operand right by from 1 to 31 bit positions (refer to Figure
6-8). This instruction differs from the SHR instruction in that it preserves the sign of the source
operand by clearing empty bit positions if the operand is positive or setting the empty bits if the
operand is negative. Again, the CF flag is loaded with the last bit shifted out of the operand.
The SAR and SHR instructions can also be used to perform division by powers of 2 (refer to
Chapter 3, Instruction Set Reference of the Intel Architecture Software Developer’s Manual,
Volume 2).
6-30
INSTRUCTION SET SUMMARY
0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1
1 1 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1
6-31
INSTRUCTION SET SUMMARY
SHLD Instruction
31 0
CF Destination (Memory or Register)
31 0
Source (Register)
SHRD Instruction
31 0
Source (Register)
31 0
Destination (Memory or Register) CF
The SHLD instruction shifts the bits in the destination operand to the left and fills the empty bit
positions (in the destination operand) with bits shifted out of the source operand. The destination
and source operands must be the same length (either words or doublewords). The shift count can
range from 0 to 31 bits. The result of this shift operation is stored in the destination operand, and
the source operand is not modified. The CF flag is loaded with the last bit shifted out of the desti-
nation operand.
The SHRD instruction operates the same as the SHLD instruction except bits are shifted to the
left in the destination operand, with the empty bit positions filled with bits shifted out of the
source operand.
6-32
INSTRUCTION SET SUMMARY
ROL Instruction
31 0
31 ROR Instruction 0
Destination (Memory or Register) CF
RCL Instruction
31 0
CF Destination (Memory or Register)
RCR Instruction
31 0
Destination (Memory or Register) CF
The ROL instruction rotates the bits in the operand to the left (toward more significant bit loca-
tions). The ROR instruction rotates the operand right (toward less significant bit locations).
The RCL instruction rotates the bits in the operand to the left, through the CF flag). This instruc-
tion treats the CF flag as a one-bit extension on the upper end of the operand. Each bit which
exits from the most significant bit location of the operand moves into the CF flag. At the same
time, the bit in the CF flag enters the least significant bit location of the operand.
The RCR instruction rotates the bits in the operand to the right through the CF flag.
For all the rotate instructions, the CF flag always contains the value of the last bit rotated out of
the operand, even if the instruction does not use the CF flag as an extension of the operand. The
value of this flag can then be tested by a conditional jump instruction (JC or JNC).
6-33
INSTRUCTION SET SUMMARY
6-34
INSTRUCTION SET SUMMARY
6-35
INSTRUCTION SET SUMMARY
6-36
INSTRUCTION SET SUMMARY
The destination operand specifies a relative address (a signed offset with respect to the address
in the EIP register) that points to an instruction in the current code segment. The Jcc instructions
do not support far transfers; however, far transfers can be accomplished with a combination of
a Jcc and a JMP instruction (refer to “Jcc—Jump if Condition Is Met” in Chapter 3, Instruction
Set Reference of the Intel Architecture Software Developer’s Manual, Volume 2).
6-37
INSTRUCTION SET SUMMARY
Table 6-4 shows the mnemonics for the Jcc instructions and the conditions being tested for each
instruction. The condition code mnemonics are appended to the letter “J” to form the mnemonic
for a Jcc instruction. The instructions are divided into two groups: unsigned and signed condi-
tional jumps. These groups correspond to the results of operations performed on unsigned and
signed integers, respectively. Those instructions listed as pairs (for example, JA/JNBE) are alter-
nate names for the same instruction. The assembler provides these alternate names to make it
easier to read program listings.
The JCXZ and JECXZ instructions test the CX and ECX registers, respectively, instead of one
or more status flags. Refer to Section 6.9.2.3., “Jump If Zero Instructions” for more informa-
tion about these instructions.
6-38
INSTRUCTION SET SUMMARY
instructions decrement the contents of the ECX register before testing for zero. If the value in
the ECX register is zero initially, it will be decremented to FFFFFFFFH on the first loop instruc-
tion, causing the loop to be executed 232 times. To prevent this problem, a JECXZ instruction
can be inserted at the beginning of the code block for the loop, causing a jump out the loop if
the EAX register count is initially zero. When used with repeated string scan and compare
instructions, the JECXZ instruction can determine whether the loop terminated because the
count reached zero or because the scan or compare conditions were satisfied.
The JCXZ (jump if CX is zero) instruction operates the same as the JECXZ instruction when the
16-bit address-size attribute is used. Here, the CX register is tested for zero.
6-39
INSTRUCTION SET SUMMARY
source and destination strings can be located in the same segment. (This latter condition can also
be achieved by loading the DS and ES segment registers with the same segment selector and
allowing the ESI register to default to the DS register.)
The MOVS instruction moves the string element addressed by the ESI register to the location
addressed by the EDI register. The assembler recognizes three “short forms” of this instruction,
which specify the size of the string to be moved: MOVSB (move byte string), MOVSW (move
word string), and MOVSD (move doubleword string).
The CMPS instruction subtracts the destination string element from the source string element
and updates the status flags (CF, ZF, OF, SF, PF, and AF) in the EFLAGS register according to
the results. Neither string element is written back to memory. The assembler recognizes three
“short forms” of the CMPS instruction: CMPSB (compare byte strings), CMPSW (compare
word strings), and CMPSD (compare doubleword strings).
The SCAS instruction subtracts the destination string element from the contents of the EAX,
AX, or AL register (depending on operand length) and updates the status flags according to the
results. The string element and register contents are not modified. The following “short forms”
of the SCAS instruction specifies the operand length: SCASB (scan byte string), SCASW (scan
word string), and SCASD (scan doubleword string).
The LODS instruction loads the source string element identified by the ESI register into the
EAX register (for a doubleword string), the AX register (for a word string), or the AL register
(for a byte string). The “short forms” for this instruction are LODSB (load byte string), LODSW
(load word string), and LODSD (load doubleword string). This instruction is usually used in a
loop, where other instructions process each element of the string after they are loaded into the
target register.
The STOS instruction stores the source string element from the EAX (doubleword string), AX
(word string), or AL (byte string) register into the memory location identified with the EDI
register. The “short forms” for this instruction are STOSB (store byte string), STOSW (store
word string), and STOSD (store doubleword string). This instruction is also normally used in a
loop. Here a string is commonly loaded into the register with a LODS instruction, operated
on by other instructions, and then stored again in memory with a STOS instruction.
The I/O instructions (refer to Section 6.11., “I/O Instructions”) also perform operations on
strings in memory.
6-40
INSTRUCTION SET SUMMARY
the EFLAGS register controls whether the registers are incremented (DF=0) or decremented
(DF=1). The STD and CLD instructions set and clear this flag, respectively.
The following repeat prefixes can be used in conjunction with a count in the ECX register to
cause a string instruction to repeat:
• REP—Repeat while the ECX register not zero.
• REPE/REPZ—Repeat while the ECX register not zero and the ZF flag is set.
• REPNE/REPNZ—Repeat while the ECX register not zero and the ZF flag is clear.
When a string instruction has a repeat prefix, the operation executes until one of the termination
conditions specified by the prefix is satisfied. The REPE/REPZ and REPNE/REPNZ prefixes
are used only with the CMPS and SCAS instructions. Also, note that a A REP STOS instruction
is the fastest way to initialize a large block of memory.
6-41
INSTRUCTION SET SUMMARY
6-42
INSTRUCTION SET SUMMARY
PUSHFD/POPFD
PUSHF/POPF
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
I
V V
A N O O D I T S Z P C
0 0 0 0 0 0 0 0 0 0 I I I C V R 0 T A
D M F P F F F F F F 0 F 0 F 1 F
P F
L
Figure 6-11. Flags Affected by the PUSHF, POPF, PUSHFD, and POPFD instructions
The POPF instruction pops a word from the stack into the EFLAGS register. Only bits 11, 10,
8, 7, 6, 4, 2, and 0 of the EFLAGS register are affected with all uses of this instruction. If the
current privilege level (CPL) of the current code segment is 0 (most privileged), the IOPL bits
(bits 13 and 12) also are affected. If the I/O privilege level (IOPL) is greater than or equal to the
CPL, numerically, the IF flag (bit 9) also is affected.
The POPFD instruction pops a doubleword into the EFLAGS register. This instruction can
change the state of the AC bit (bit 18) and the ID bit (bit 21), as well as the bits affected by a
POPF instruction. The restrictions for changing the IOPL bits and the IF flag that were given for
the POPF instruction also apply to the POPFD instruction.
6-43
INSTRUCTION SET SUMMARY
The POP and MOV instructions cannot place a value in the CS register. Only the far control-
transfer versions of the JMP, CALL, and RET instructions (refer to Section 6.14.2., “Far Control
Transfer Instructions”) affect the CS register directly.
6-44
INSTRUCTION SET SUMMARY
registers before the execution of string instructions or for initializing the EBX register before an
XLAT instruction.
6-45
INSTRUCTION SET SUMMARY
6-46
7
Floating-Point Unit
CHAPTER 7
FLOATING-POINT UNIT
The Intel Architecture (IA) Floating-Point Unit (FPU) provides high-performance floating-
point processing capabilities. It supports the real, integer, and BCD-integer data types and the
floating-point processing algorithms and exception handling architecture defined in the IEEE
754 and 854 Standards for Floating-Point Arithmetic. The FPU executes instructions from the
processor’s normal instruction stream and greatly improves the efficiency of IA processors in
handling the types of high-precision floating-point processing operations commonly found in
scientific, engineering, and business applications.
This chapter describes the data types that the FPU operates on, the FPU’s execution environ-
ment, and the FPU-specific instruction set. Detailed descriptions of the FPU instructions are
given in Chapter 3, Instruction Set Reference, in the Intel Architecture Software Developer’s
Manual, Volume 2.
7-1
FLOATING-POINT UNIT
2
– b ± b – 4ac
---------------------------------------
2a
If a does not equal 0, the formula is numerically unstable when the roots are nearly coincident
or when their magnitudes are wildly different. The formula is also vulnerable to spurious
over/underflows when the coefficients a, b, and c are all very big or all very tiny. When single-
precision (4-byte) floating-point coefficients are given as data and the formula is evaluated in
the FPU’s normal way, keeping all intermediate results in its stack, the FPU produces impec-
cable single-precision roots. This happens because, by default and with no effort on the
programmer’s part, the FPU evaluates all those sub-expressions with so much extra precision
and range as to overwhelm almost any threat to numerical integrity.
If double-precision data and results were at issue, a better formula would have to be used, and
once again the FPU’s default evaluation of that formula would provide substantially enhanced
numerical integrity over mere double-precision evaluation.
On most machines, straightforward algorithms will not deliver consistently correct results (and
will not indicate when they are incorrect). To obtain correct results on traditional machines
under all conditions usually requires sophisticated numerical techniques that go beyond typical
programming practice. General application programmers using straightforward algorithms will
produce much more reliable programs using the IAs. This simple fact greatly reduces the soft-
ware investment required to develop safe, accurate computation-based products.
Beyond traditional numeric support for scientific applications, the IA processors have built-in
facilities for commercial computing. They can process decimal numbers of up to 18 digits
without round-off errors, performing exact arithmetic on integers as large as 264 (or 1018).
Exact arithmetic is vital in accounting applications where rounding errors may introduce mone-
tary losses that cannot be reconciled.
The Intel FPU’s contain a number of optional numerical facilities that can be invoked by sophis-
ticated users. These advanced features include directed rounding, gradual underflow, and
programmed exception-handling facilities.
These automatic exception-handling facilities permit a high degree of flexibility in numeric
processing software, without burdening the programmer. While performing numeric calcula-
tions, the processor automatically detects exception conditions that can potentially damage a
calculation (for example, X ÷ 0 or X when X < 0). By default, on-chip exception logic handles
these exceptions so that a reasonable result is produced and execution may proceed without
program interruption. Alternatively, the processor can invoke a software exception handler to
provide special results whenever various types of exceptions are detected.
7-2
FLOATING-POINT UNIT
+10
10.0000000000000000000000
1.11111111111111111111111
Precision 24 Binary Digits
Because the size and number of registers that any computer can have is limited, only a subset of
the real-number continuum can be used in real-number calculations. As shown at the bottom of
Figure 7-1, the subset of real numbers that a particular FPU supports represents an approxima-
tion of the real number system. The range and precision of this real-number subset is determined
by the format that the FPU uses to represent real numbers.
7-3
FLOATING-POINT UNIT
Sign
Exponent Significand
Fraction
Integer or J-Bit
Table 7-1 shows how the real number 178.125 (in ordinary decimal format) is stored in floating-
point format. The table lists a progression of real number notations that leads to the single-real,
32-bit floating-point format (which is one of the floating-point formats that the FPU supports).
In this format, the significand is normalized (refer to Section 7.2.2.1., “Normalized Numbers”)
and the exponent is biased (refer to Section 7.2.2.2., “Biased Exponent”). For the single-real
format, the biasing constant is +127.
7-4
FLOATING-POINT UNIT
Representing numbers in normalized form maximizes the number of significant digits that can
be accommodated in a significand of a given width. To summarize, a normalized real number
consists of a normalized significand that represents a real number between 1 and 2 and an expo-
nent that specifies the number’s binary point.
7-5
FLOATING-POINT UNIT
Figure 7-3 shows how the encodings for these numbers and non-numbers fit into the real number
continuum. The encodings shown here are for the IEEE single-precision (32-bit) format, where
the term “S” indicates the sign bit, “E” the biased exponent, and “F” the fraction. (The exponent
values are given in decimal.)
The FPU can operate on and/or return any of these values, depending on the type of computation
being performed. The following sections describe these number and non-number classes.
NaN NaN
−Denormalized Finite +Denormalized Finite
−∞ −Normalized Finite −0 +0 +Normalized Finite +∞
−Denormalized +Denormalized
1 0 0.XXX2 Finite Finite 0 0 0.XXX2
−Normalized +Normalized 0 1...254 Any Value
1 1...254 Any Value Finite Finite
1 255 0 −∞ +∞ 0 255 0
NOTES:
1. Sign bit ignored.
2. Fractions must be non-zero.
7-6
FLOATING-POINT UNIT
When real numbers become very close to zero, the normalized-number format can no longer be
used to represent the numbers. This is because the range of the exponent is not large enough to
compensate for shifting the binary point to the right to eliminate leading zeros.
When the biased exponent is zero, smaller numbers can only be represented by making the
integer bit (and perhaps other leading bits) of the significand zero. The numbers in this range are
called denormalized (or tiny) numbers. The use of leading zeros with denormalized numbers
allows smaller numbers to be represented. However, this denormalization causes a loss of preci-
sion (the number of significant bits in the fraction is reduced by the leading zeros).
When performing normalized floating-point computations, an FPU normally operates on
normalized numbers and produces normalized numbers as results. Denormalized numbers
represent an underflow condition.
A denormalized number is computed through a technique called gradual underflow. Table 7-2
gives an example of gradual underflow in the denormalization process. Here the single-real
format is being used, so the minimum exponent (unbiased) is −12610. The true result in this
example requires an exponent of −12910 in order to have a normalized number. Since −12910
is beyond the allowable exponent range, the result is denormalized by inserting leading zeros
until the minimum exponent of −12610 is reached.
NOTE:
* Expressed as an unbiased, decimal number.
In the extreme case, all the significant bits are shifted out to the right by leading zeros, creating
a zero result.
The FPU deals with denormal values in the following ways:
• It avoids creating denormals by normalizing numbers whenever possible.
• It provides the floating-point underflow exception to permit programmers to detect cases
when denormals are created.
• It provides the floating-point denormal operand exception to permit procedures or
programs to detect when denormals are being used as source operands for computations.
When a denormal number in single- or double-real format is used as a source operand and the
denormal exception is masked, the FPU automatically normalizes the number when it is
converted to extended-real format.
7-7
FLOATING-POINT UNIT
7.2.3.4. NANS
Since NaNs are non-numbers, they are not part of the real number line. In Figure 7-3, the
encoding space for NaNs in the FPU floating-point formats is shown above the ends of the real
number line. This space includes any value with the maximum allowable biased exponent and
a non-zero fraction. (The sign bit is ignored for NaNs.)
The IEEE standard defines two classes of NaN: quiet NaNs (QNaNs) and signaling NaNs
(SNaNs). A QNaN is a NaN with the most significant fraction bit set; an SNaN is a NaN with
the most significant fraction bit clear. QNaNs are allowed to propagate through most arithmetic
operations without signaling an exception. SNaNs generally signal an invalid operation excep-
tion whenever they appear as operands in arithmetic operations. Exceptions are discussed in
Section 7.7., “Floating-Point Exception Handling”.
Refer to Section 7.6., “Operating on NaNs”, for detailed information on how the FPU handles
NaNs.
7.2.4. Indefinite
For each FPU data type, one unique encoding is reserved for representing the special value
indefinite. For example, when operating on real values, the real indefinite value is a QNaN
(refer to Section 7.4.1., “Real Numbers”). The FPU produces indefinite values as responses
to masked floating-point exceptions.
7-8
FLOATING-POINT UNIT
whereas, the Pentium® processor has two integer units and one FPU, and the Intel486™
processor has one integer unit and one FPU.)
Instruction
Decoder and
Sequencer
Integer
FPU
Unit
Data Bus
Figure 7-4. Relationship Between the Integer Unit and the FPU
The instruction execution environment of the FPU (refer to Figure 7-5) consists of 8 data regis-
ters (called the FPU data registers) and the following special-purpose registers:
• The status register.
• The control register.
• The tag word register.
• Instruction pointer register.
• Last operand (data pointer) register.
• Opcode register.
These registers are described in the following sections.
7-9
FLOATING-POINT UNIT
increment TOP by one. (For the FPU, a load operation is equivalent to a push and a store oper-
ation is equivalent to a pop.)
15 0 47 0
Control FPU Instruction Pointer
Register
Tag 10 0
Register
Opcode
If a load operation is performed when TOP is at 0, register wraparound occurs and the new value
of TOP is set to 7. The floating-point stack-overflow exception indicates when wraparound
might cause an unsaved value to be overwritten (refer to Section 7.8.1.1., “Stack Overflow or
Underflow Exception (#IS)”).
7-10
FLOATING-POINT UNIT
Many floating-point instructions have several addressing modes that permit the programmer to
implicitly operate on the top of the stack, or to explicitly operate on specific registers relative to
the TOP. Assemblers supports these register addressing modes, using the expression ST(0), or
simply ST, to represent the current stack top and ST(i) to specify the ith register from TOP in
the stack (0 ≤ i ≤ 7). For example, if TOP contains 011B (register 3 is the top of the stack), the
following instruction would add the contents of two registers in the stack (registers 3 and 5):
7-11
FLOATING-POINT UNIT
Computation
Dot Product = (5.6 x 2.4) + (3.8 x 10.3)
Code:
FLD value1 ;(a) value1=5.6
FMUL value2 ;(b) value2=2.4
FLD value3 ; value3=3.8
FMUL value4 ;(c)value4=10.3
FADD ST(1) ;(d)
7-12
FLOATING-POINT UNIT
point instructions set the condition code flags. These condition code bits are used principally for
conditional branching and for storage of information used in exception handling (refer to
Section 7.3.3., “Branching and Conditional Moves on FPU Condition Codes”).
FPU Busy
Top of Stack Pointer
15 14 13 11 10 9 8 7 6 5 4 3 2 1 0
B C C C C E S P U O Z D I
TOP
3 2 1 0 S F E E E E E E
Condition
Code
Error Summary Status
Stack Fault
Exception Flags
Precision
Underflow
Overflow
Zero Divide
Denormalized Operand
Invalid Operation
As shown in Table 7-3, the C1 condition code flag is used for a variety of functions. When both
the IE and SF flags in the FPU status word are set, indicating a stack overflow or underflow
exception (#IS), the C1 flag distinguishes between overflow (C1=1) and underflow (C1=0).
When the PE flag in the status word is set, indicating an inexact (rounded) result, the C1 flag is
set to 1 if the last rounding by the instruction was upward. The FXAM instruction sets C1 to the
sign of the value being examined.
The C2 condition code flag is used by the FPREM and FPREM1 instructions to indicate an
incomplete reduction (or partial remainder). When a successful reduction has been completed,
the C0, C3, and C1 condition code flags are set to the three least-significant bits of the quotient
(Q2, Q1, and Q0, respectively). Refer to “FPREM1—Partial Remainder” in Chapter 3, Instruc-
tion Set Reference, of the Intel Architecture Software Developer’s Manual, Volume 2, for more
information on how these instructions use the condition code flags.
The FPTAN, FSIN, FCOS, and FSINCOS instructions set the C2 flag to 1 to indicate that the
source operand is beyond the allowable range of ±263.
Where the state of the condition code flags are listed as undefined in Table 7-3, do not rely on
any specific value in these flags.
7-13
FLOATING-POINT UNIT
7-14
FLOATING-POINT UNIT
flag is set, the FPU exception handler is invoked, using one of the techniques described in
Section 7.7.3., “Software Exception Handling”. (Note that if an exception flag is masked, the
FPU will still set the flag if its associated exception occurs, but it will not set the ES flag.)
The exception flags are “sticky” bits, meaning that once set, they remain set until explicitly
cleared. They can be cleared by executing the FCLEX/FNCLEX (clear exceptions) instructions,
by reinitializing the FPU with the FINIT/FNINIT or FSAVE/FNSAVE instructions, or by over-
writing the flags with an FRSTOR or FLDENV instruction.
The B-bit (bit 15) is included for 8087 compatibility only. It reflects the contents of the ES flag.
7-15
FLOATING-POINT UNIT
SAHF Instruction
31 EFLAGS Register 7 0
Z P C
F F 1 F
Figure 7-9. Moving the FPU Condition Codes to the EFLAGS Register
The new mechanism is available only in the Pentium® Pro processor. Using this mechanism, the
new floating-point compare and set EFLAGS instructions (FCOMI, FCOMIP, FUCOMI, and
FUCOMIP) compare two floating-point values and set the ZF, PF, and CF flags in the EFLAGS
register directly. A single instruction thus replaces the three instructions required by the old
mechanism.
Note also that the FCMOVcc instructions (also new in the Pentium® Pro processor) allow condi-
tional moves of floating-point values (values in the FPU data registers) based on the setting of
the status flags (ZF, PF, and CF) in the EFLAGS register. These instructions eliminate the need
for an IF statement to perform conditional moves of floating-point values.
7-16
FLOATING-POINT UNIT
Infinity Control
Rounding Control
Precision Control
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
P U O Z D I
X RC PC M M M M M M
Exception Masks
Precision
Underflow
Overflow
Zero Divide
Denormalized Operand
Invalid Operation
Reserved
The precision-control (PC) field (bits 8 and 9 of the FPU control word) determines the precision
(64, 53, or 24 bits) of floating-point calculations made by the FPU (refer to Table 7-4). The
default precision is extended precision, which uses the full 64-bit significand available with the
extended-real format of the FPU data registers, but is configurable by the user, compiler, or oper-
ating system. This setting is best suited for most applications, because it allows applications to
take full advantage of the precision of the extended-real format.
NOTE:
*
Includes the implied integer bit.
7-17
FLOATING-POINT UNIT
The double precision and single precision settings, reduce the size of the significand to 53 bits
and 24 bits, respectively. These settings are provided to support the IEEE standard and to allow
exact replication of calculations which were done using the lower precision data types. Using
these settings nullifies the advantages of the extended-real format’s 64-bit significand length.
When reduced precision is specified, the rounding of the significand value clears the unused bits
on the right to zeros.
The precision-control bits only affect the results of the following floating-point instructions:
FADD, FADDP, FSUB, FSUBP, FSUBR, FSUBRP, FMUL, FMULP, FDIV, FDIVP, FDIVR,
FDIVRP, and FSQRT.
The round up and round down modes are termed directed rounding and can be used to imple-
ment interval arithmetic. Interval arithmetic is used to determine upper and lower bounds for the
true result of a multistep computation, when the intermediate results of the computation are
subject to rounding.
The round toward zero mode (sometimes called the “chop” mode) is commonly used when
performing integer arithmetic with the FPU.
Whenever possible, the FPU produces an infinitely precise result in the destination format
(single, double, or extended real). However, it is often the case that the infinitely precise result
of an arithmetic or store operation cannot be encoded exactly in the format of the destination
operand.
7-18
FLOATING-POINT UNIT
For example, the following value (a) has a 24-bit fraction. The least-significant bit of this frac-
tion (the underlined bit) cannot be encoded exactly in the single-real format (which has only a
23-bit fraction):
(a) 1.0001 0000 1000 0011 1001 0111E2 101
To round this result (a), the FPU first selects two representable fractions b and c that most
closely bracket a in value (b < a < c).
(b) 1.0001 0000 1000 0011 1001 011E2 101
(c) 1.0001 0000 1000 0011 1001 100E2 101
The FPU then sets the result to b or to c according to the rounding mode selected in the RC field.
Rounding introduces an error in a result that is less than one unit in the last place to which the
result is rounded.
The rounded result is called the inexact result. When the FPU produces an inexact result, the
floating-point precision (inexact) flag (PE) is set in the FPU status word.
When the overflow exception is masked and the infinitely precise result is between the largest
positive finite value allowed in a particular format and +∞, the FPU rounds the result as shown
in Table 7-6.
When the overflow exception is masked and the infinitely precise result is between the largest
negative finite value allowed in a particular format and −∞, the FPU rounds the result as shown
in Table 7-7.
The rounding modes have no effect on comparison operations, operations that produce exact
results, or operations that produce NaN results.
7-19
FLOATING-POINT UNIT
15 0
TAG Values
00 — Valid
01 — Zero
10 — Special: invalid (NaN, unsupported), infinity, or denormal
11 — Empty
Each tag in the FPU tag word corresponds to a physical register (numbers 0 through 7). The
current top-of-stack (TOP) pointer stored in the FPU status word can be used to associate tags
with registers relative to ST(0).
The FPU uses the tag values to detect stack overflow and underflow conditions. Stack overflow
occurs when the TOP pointer is decremented (due to a register load or push operation) to point
to a non-empty register. Stack underflow occurs when the TOP pointer is incremented (due to a
save or pop operation) to point to an empty register or when an empty register is also referenced
as a source operand. A non-empty register is defined as a register containing a zero (01), a valid
value (00), or an special (10) value.
Application programs and exception handlers can use this tag information to check the contents
of an FPU data register without performing complex decoding of the actual data in the register.
To read the tag register, it must be stored in memory using either the FSTENV/FNSTENV or
FSAVE/FNSAVE instructions. The location of the tag word in memory after being saved with
one of these instructions is shown in Figures 7-13 through 7-16.
7-20
FLOATING-POINT UNIT
Software cannot directly load or modify the tags in the tag register. The FLDENV and FRSTOR
instructions load an image of the tag register into the FPU; however, the FPU uses those tag
values only to determine if the data registers are empty (11B) or non-empty (00B, 01B, or 10B).
If the tag register image indicates that a data register is empty, the tag in the tag register for that
data register is marked empty (11B); if the tag register image indicates that the data register is
non-empty, the FPU reads the actual value in the data register and sets the tag for the register
accordingly. This action prevents a program from setting the values in the tag register to incor-
rectly represent the actual contents of non-empty data registers.
7-21
FLOATING-POINT UNIT
FSTENV/FNSTENV instruction saves the contents of the status, control, tag, FPU instruction
pointer, FPU operand pointer, and opcode registers. The FSAVE/FNSAVE instruction stores that
information plus the contents of the FPU data registers. Note that the FSAVE/FNSAVE instruc-
tion also initializes the FPU to default values (just as the FINIT/FNINIT instruction does) after
it has saved the original state of the FPU.
10 8 7 0
The manner in which this information is stored in memory depends on the operating mode of
the processor (protected mode or real-address mode) and on the operand-size attribute in effect
(32-bit or 16-bit). Refer to Figures 7-13 through 7-16. In virtual-8086 mode or SMM, the real-
address mode formats shown in Figure 7-16 is used. Refer to Chapter 12, System Management
Mode (SMM) of the Intel Architecture Software Developer’s Manual, Volume 3, for special
considerations for using the FPU while in SMM.
Reserved
Figure 7-13. Protected Mode FPU State Image in Memory, 32-Bit Format
7-22
FLOATING-POINT UNIT
Reserved
Figure 7-14. Real Mode FPU State Image in Memory, 32-Bit Format
Figure 7-15. Protected Mode FPU State Image in Memory, 16-Bit Format
7-23
FLOATING-POINT UNIT
Figure 7-16. Real Mode FPU State Image in Memory, 16-Bit Format
The FLDENV and FRSTOR instructions allow FPU state information to be loaded from
memory into the FPU. Here, the FLDENV instruction loads only the status, control, tag, FPU
instruction pointer, FPU operand pointer, and opcode registers, and the FRSTOR instruction
loads all the FPU registers, including the FPU stack registers.
7-24
FLOATING-POINT UNIT
Single Real
Sign Exp. Fraction
3130 23 22 Implied Integer 0
Double Real
Sign Exponent Fraction
63 62 52 51 Implied Integer 0
Sign
Extended Real
Exponent Fraction
79 78 6463 62 Integer 0
Word Integer
Sign
15 14 0
Short Integer
Sign
31 30 0
Long Integer
Sign
Sign 63 62 0
Packed BCD Integers
X D17 D16 D15 D14 D13 D12 D11 D10 D9 D8 D7 D6 D5 D4 D3 D2 D1 D0
79 78 72 71 4 Bits = 1 BCD Digit 0
When stored in memory, the least significant byte of an FPU data-type value is stored at the
initial address specified for the value. Successive bytes from the value are then stored in succes-
sively higher addresses in memory. The floating-point instructions load and store memory oper-
ands using only the initial address of the operand.
7-25
FLOATING-POINT UNIT
is bit 62. Here, the integer is explicitly set to 1 for normalized numbers, infinities, and NaNs,
and to 0 for zero and denormalized numbers.
The exponent of each real data type is encoded in biased format. The biasing constant is 127 for
the single-real format, 1023 for the double-real format, and 16,383 for the extended-real format.
Table 7-9 shows the encodings for all the classes of real numbers (that is, zero, denormalized-
finite, normalized-finite, and ∞) and NaNs for each of the three real data-types. It also gives the
format for the real indefinite value.
When storing real values in memory, single-real values are stored in 4 consecutive bytes in
memory; double-real values are stored in 8 consecutive bytes; and extended-real values are
stored in 10 consecutive bytes.
As a general rule, values should be stored in memory in double-real format. This format
provides sufficient range and precision to return correct results with a minimum of programmer
attention. The single-real format is appropriate for applications that are constrained by memory;
however, it provides less precision and a greater chance of overflow. The single-real format is
also useful for debugging algorithms, because rounding problems will manifest themselves
more quickly in this format. The extended-real format is normally reserved for holding interme-
diate results in the FPU registers and constants. Its extra length is designed to shield final results
from the effects of rounding and overflow/underflow in intermediate calculations. However,
when an application requires the maximum range and precision of the FPU (for data storage,
computations, and results), values can be stored in memory in extended-real format.
The real indefinite value is a QNaN encoding that is stored by several floating-point instructions
in response to a masked floating-point invalid operation exception (refer to Table 7-21).
7-26
FLOATING-POINT UNIT
7-27
FLOATING-POINT UNIT
The most significant bit of each format is the sign bit (0 for positive and 1 for negative). Nega-
tive values are represented in standard two’s complement notation. The quantity zero is repre-
sented with all bits (including the sign bit) set to zero. Note that the FPU’s word-integer data
type is identical to the word-integer data type used by the processor’s integer unit and the short-
integer format is identical to the integer unit’s doubleword-integer data type.
Word-integer values are stored in 2 consecutive bytes in memory; short-integer values are stored
in 4 consecutive bytes; and long-integer values are stored in 8 consecutive bytes. When loaded
into the FPU’s data registers, all the binary integers are exactly representable in the extended-
real format.
The binary integer encoding 100..00B represents either of two things, depending on the circum-
stances of its use:
• The largest negative number supported by the format (–215, –231, or –263).
• The integer indefinite value.
If this encoding is used as a source operand (as in an integer load or integer arithmetic instruc-
tion), the FPU interprets it as the largest negative number representable in the format being used.
If the FPU detects an invalid operation when storing an integer value in memory with an
FIST/FISTP instruction and the invalid operation exception is masked, the FPU stores the
integer indefinite encoding in the destination operand as a masked response to the exception. In
situations where the origin of a value with this encoding may be ambiguous, the invalid opera-
tion exception flag can be examined to see if the value was produced as a response to an
exception.
7-28
FLOATING-POINT UNIT
If the integer indefinite is stored in memory and is later loaded back into an FPU data register,
it is interpreted as the largest negative number supported by the format.
The decimal integer format exists in memory only. When a decimal integer is loaded in a data
register in the FPU, it is automatically converted to the extended-real format. All decimal inte-
gers are exactly representable in extended-real format.
7-29
FLOATING-POINT UNIT
The packed decimal indefinite encoding is stored by the FBSTP instruction in response to a
masked floating-point invalid operation exception. Attempting to load this value with the FBLD
instruction produces an undefined result.
7-30
FLOATING-POINT UNIT
7-31
FLOATING-POINT UNIT
Refer to Section 6.2.3., “Floating-Point Instructions” in Chapter 6, Instruction Set Summary, for
a list of the floating-point instructions by category.
The following section briefly describes the instructions in each category. Detailed descriptions
of the floating-point instructions are given in Chapter 3, Instruction Set Reference, in the Intel
Architecture Software Developer’s Manual, Volume 2.
7-32
FLOATING-POINT UNIT
Operands are normally stored in the FPU data registers in extended-real format (refer to Section
7.3.4.2., “Precision Control Field”). The FLD (load real) instruction pushes a real operand from
memory onto the top of the FPU data-register stack. If the operand is in single- or double-real
format, it is automatically converted to extended-real format. This instruction can also be used
to push the value in a selected FPU data register onto the top of the register stack.
The FILD (load integer) instruction converts an integer operand in memory into extended-real
format and pushes the value onto the top of the register stack. The FBLD (load packed decimal)
instruction performs the same load operation for a packed BCD operand in memory.
The FST (store real) and FIST (store integer) instructions store the value in register ST(0) in
memory in the destination format (real or integer, respectively). Again, the format conversion is
carried out automatically.
The FSTP (store real and pop), FISTP (store integer and pop), and FBSTP (store packed decimal
and pop) instructions store the value in the ST(0) registers into memory in the destination format
(real, integer, or packed BCD), then performs a pop operation on the register stack. A pop oper-
ation causes the ST(0) register to be marked empty and the stack pointer (TOP) in the FPU
control work to be incremented by 1. The FSTP instruction can also be used to copy the value
in the ST(0) register to another FPU register [ST(i)].
The FXCH (exchange register contents) instruction exchanges the value in a selected register in
the stack [ST(i)] with the value in ST(0).
The FCMOVcc (conditional move) instructions move the value in a selected register in the stack
[ST(i)] to register ST(0). These instructions move the value only if the conditions specified with
a condition code (cc) are satisfied (refer to Table 7-14). The conditions being tested with the
FCMOVcc instructions are represented by the status flags in the EFLAGS register. The condi-
tion code mnemonics are appended to the letters “FCMOV” to form the mnemonic for a
FCMOVcc instruction.
7-33
FLOATING-POINT UNIT
Like the CMOVcc instructions, the FCMOVcc instructions are useful for optimizing small IF
constructions. They also help eliminate branching overhead for IF operations and the possibility
of branch mispredictions by the processor.
NOTE
The FCMOVcc instructions may not be supported on some processors in the
Pentium® Pro processor family. Software can check if the FCMOVcc instruc-
tions are supported by checking the processor’s feature information with the
CPUID instruction (refer to “CPUID—CPU Identification” in Chapter 3,
Instruction Set Reference, of the Intel Architecture Software Developer’s
Manual, Volume 2).
7-34
FLOATING-POINT UNIT
7-35
FLOATING-POINT UNIT
performs a function similar to the FIST/FISTP instructions, except that the result is saved in a
real format.
The FABS, FCHS, and FXTRACT instructions perform convenient arithmetic operations. The
FABS instruction produces the absolute value of the source operand. The FCHS instruction
changes the sign of the source operand. The FXTRACT instruction separates the source operand
into its exponent and fraction and stores each value in a register in real format.
7-36
FLOATING-POINT UNIT
Table 7-15. Setting of FPU Condition Code Flags for Real Number Comparisons
Condition C3 C2 C0
ST(0) > Source Operand 0 0 0
ST(0) < Source Operand 0 0 1
ST(0) = Source Operand 1 0 0
Unordered 1 1 1
The FICOM and FICOMP instructions also operate the same as the FCOM and FCOMP instruc-
tions, except that the source operand is an integer value in memory. The integer value is auto-
matically converted into an extended real value prior to making the comparison. The FICOMP
instruction pops the FPU register stack following the comparison operation.
The FTST instruction performs the same operation as the FCOM instruction, except that the
value in register ST(0) is always compared with the value 0.0.
The FCOMI and FCOMIP instructions are new in the Intel Pentium® Pro processor. They
perform the same comparison as the FCOM and FCOMP instructions, except that they set the
status flags (ZF, PF, and CF) in the EFLAGS register to indicate the results of the comparison
(refer to Table 7-16) instead of the FPU condition code flags. The FCOMI and FCOMIP instruc-
tions allow condition branch instructions (Jcc) to be executed directly from the results of their
comparison.
Table 7-16. Setting of EFLAGS Status Flags for Real Number Comparisons
Comparison Results ZF PF CF
ST0 > ST(i) 0 0 0
ST0 < ST(i) 0 0 1
ST0 = ST(i) 1 0 0
Unordered 1 1 1
The FUCOMI and FUCOMIP instructions operate the same as the FCOMI and FCOMIP
instructions, except that they do not generate a floating-point invalid operation exception if the
unordered condition is the result of one or both of the operands being a QNaN. The FCOMIP
and FUCOMIP instructions pop the FPU register stack following the comparison operation.
The FXAM instruction determines the classification of the real value in the ST(0) register (that
is, whether the value is zero, a denormal number, a normal finite number, ∞, a NaN, or an unsup-
ported format) or that the register is empty. It sets the FPU condition code flags to indicate the
classification (refer to “FXAM—Examine” in Chapter 3, Instruction Set Reference, of the Intel
Architecture Software Developer’s Manual, Volume 2). It also sets the C1 flag to indicate the sign
of the value.
7-37
FLOATING-POINT UNIT
2. Check ordered comparison result. Use the constants given in Table 7-17 in the TEST
instruction to test for a less than, equal to, or greater than result, then use the corresponding
conditional branch instruction to transfer program control to the appropriate procedure or
section of code.
If a program or procedure has been thoroughly tested and it incorporates periodic checks for
QNaN results, then it is not necessary to check for the unordered result every time a comparison
is made.
Refer to Section 7.3.3., “Branching and Conditional Moves on FPU Condition Codes”, for
another technique for branching on FPU condition codes.
Some non-comparison FPU instructions update the condition code flags in the FPU status word.
To ensure that the status word is not altered inadvertently, store it immediately following a
comparison operation.
7-38
FLOATING-POINT UNIT
These instructions operate on the top one or two registers of the FPU register stack and they
return their results to the stack. The source operands must be given in radians.
The FSINCOS instruction returns both the sine and the cosine of a source operand value. It oper-
ates faster than executing the FSIN and FCOS instructions in succession.
The FPATAN instruction computes the arctangent of ST(1) divided by ST(0). It is useful for
converting rectangular coordinates to polar coordinates.
7.5.8. Pi
When the argument (source operand) of a trigonometric function is within the range of the func-
tion, the argument is automatically reduced by the appropriate multiple of 2π through the same
reduction mechanism used by the FPREM and FPREM1 instructions. The internal value of π
that the IA FPU uses for argument reduction and other computations is as follows:
π = 0.f ∗ 22
where:
f = C90FDAA2 2168C234 C
(The spaces in the fraction above indicate 32-bit boundaries.)
This internal π value has a 66-bit mantissa, which is 2 bits more than is allowed in the signifi-
cand of an extended-real value. (Since 66 bits is not an even number of hexadecimal digits, two
additional zeros have been added to the value so that it can be represented in hexadecimal
format. The least-significant hexadecimal digit (C) is thus 1100B, where the two least-
significant bits represent bits 67 and 68 of the mantissa.)
This value of π has been chosen to guarantee no loss of significance in a source operand,
provided the operand is within the specified range for the instruction.
If the results of computations that explicitly use π are to be used in the FSIN, FCOS, FSINCOS,
or FPTAN instructions, the full 66-bit fraction of π should be used. This insures that the results
are consistent with the argument-reduction algorithms that these instructions use. Using a
rounded version of π can cause inaccuracies in result values, which if propagated through
several calculations, might result in meaningless results.
A common method of representing the full 66-bit fraction of π is to separate the value into two
numbers (highπ and lowπ) that when added together give the value for π shown earlier in this
section with the full 66-bit fraction:
π = highπ + lowπ
For example, the following two values (given in scientific notation with the fraction in hexadec-
imal and the exponent in decimal) represent the 33 most-significant and the 33 least-significant
bits of the fraction:
highπ (unnormalized)= 0.C90FDAA20 * 2+2
lowπ (unnormalized) = 0.42D184698 * 2−31
7-39
FLOATING-POINT UNIT
7-40
FLOATING-POINT UNIT
given argument x, let f(x) and F(x) be the correct and computed (approximate) function values,
respectively. The error in ulps is defined to be:
f(x) – F(x )
error = --------------------------
-
k – 63
2
–k
where k is an integer such that 1 ≤ 2 f( x) < 2 .
With the Pentium® and Pentium® Pro processors, the worst case error in the transcendental
instructions is less than 1 ulp when rounding to nearest and less than 1.5 ulps when rounding
in other modes. (The instructions fyl2x and fyl2xp1 are two operand instructions and are guar-
anteed to be within 1 ulp only when y = 1.
When y != 1, the maximum ulp error is always within 1.35 ulps in round to nearest mode. The
trigonometric
instructions may use a 66-bit approximation to the true value of pi to reduce the magnitude of
the input argument.
In this case, the final computed result can vary considerably from the true mathematically
precise result.) The instructions are guaranteed to be monotonic, with respect to the input oper-
ands, throughout the domain supported by the instruction. (For the two operand functions,
monotonicity was proved by holding one of the operands constant.)
With the Intel486™ processor and Intel 387 math coprocessor, the worst-case, transcendental-
function error is typically 3 or 3.5 ulps, but is sometimes as large as 4.5 ulps.
7-41
FLOATING-POINT UNIT
The FLDCW instructions loads the FPU control word register with a value from memory. The
FSTCW/FNSTCW and FSTSW/FNSTSW instructions store the FPU control and status words,
respectively, in memory (or for an FSTSW/FNSTSW instruction in a general-purpose register).
The FSTENV/FNSTENV and FSAVE/FNSAVE instructions save the FPU environment and
state, respectively, in memory. The FPU environment includes all the FPU’s control and status
registers; the FPU state includes the FPU environment and the data registers in the FPU register
stack. (The FSAVE/FNSAVE instruction also initializes the FPU to default values, like the
FINIT/FNINIT instruction, after it saves the original state of the FPU.)
The FLDENV and FRSTOR instructions load the FPU environment and state, respectively,
from memory into the FPU. These instructions are commonly used when switching tasks or
contexts.
The WAIT/FWAIT instructions are synchronization instructions. (They are actually mnemonics
for the same opcode.) These instructions check the FPU status word for pending unmasked FPU
exceptions. If any pending unmasked FPU exceptions are found, they are handled before the
processor resumes execution of the instructions (integer, floating-point, or system instruction)
in the instruction stream. The WAIT/FWAIT instructions are provided to allow synchronization
of instruction execution between the FPU and the processor’s integer unit. Refer to Section 7.9.,
“Floating-Point Exception Synchronization” for more information on the use of the
WAIT/FWAIT instructions.
NOTE
When operating a Pentium® or Intel486™ processor in MS-DOS compati-
bility mode, it is possible (under unusual circumstances) for a non-waiting
instruction to be interrupted prior to being executed to handle a pending FPU
exception. The circumstances where this can happen and the resulting action
of the processor are described in Section E.2.1.3., “No-Wait FPU Instructions
Can Get FPU Interrupt in Window” in Appendix E, Guidelines for Writing
FPU Exceptions Handlers. When operating a Pentium® Pro processor in MS-
DOS compatibility mode, non-waiting instructions can not be interrupted in
this way (refer to Section E.2.2., “MS-DOS* Compatibility Mode in the P6
Family Processors” in Appendix E, Guidelines for Writing FPU Exceptions
Handlers).
7-42
FLOATING-POINT UNIT
7-43
FLOATING-POINT UNIT
7-44
FLOATING-POINT UNIT
7-45
FLOATING-POINT UNIT
7-46
FLOATING-POINT UNIT
7-47
FLOATING-POINT UNIT
7-48
FLOATING-POINT UNIT
Note that when exceptions are masked, the FPU may detect multiple exceptions in a single
instruction, because it continues executing the instruction after performing its masked response.
For example, the FPU can detect a denormalized operand, perform its masked response to this
exception, and then detect numeric underflow.
7-49
FLOATING-POINT UNIT
The MS-DOS compatibility mode is typically used as follows to invoke the floating-point
exception handler:
1. If the FPU detects an unmasked floating-point exception, it sets the flag for the exception
and the ES flag in the FPU status word.
2. If the IGNNE# pin is deasserted, the FPU then asserts the FERR# pin either immediately,
or else delayed (deferred) until just before the execution of the next waiting floating-point
instruction or MMX™ instruction. Whether the FERR# pin is asserted immediately or
delayed depends on the type of processor, the instruction, and the type of exception.
3. If a preceding floating-point instruction has set the exception flag for an unmasked FPU
exception, the processor freezes just before executing the next WAIT instruction, waiting
floating-point instruction, or MMX™ instruction. Whether the FERR# pin was asserted at
the preceding floating-point instruction or is just now being asserted, the freezing of the
processor assures that the FPU exception handler will be invoked before the new floating-
point (or MMX™) instruction gets executed.
4. The FERR# pin is connected through external hardware to IRQ13 of a cascaded, program-
mable interrupt controller (PIC). When the FERR# pin is asserted, the PIC is programmed
to generate an interrupt 75H.
5. The PIC asserts the INTR pin on the processor to signal the interrupt 75H.
6. The BIOS for the PC system handles the interrupt 75H by branching to the interrupt 2
(NMI) interrupt handler.
7. The interrupt 2 handler determines if the interrupt is the result of an NMI interrupt or a
floating-point exception.
8. If a floating-point exception is detected, the interrupt 2 handler branches to the floating-
point exception handler.
If the IGNNE# pin is asserted, the processor ignores floating-point error conditions. This pin is
provided to inhibit floating-point exceptions from being generated while the floating-point
exception handler is servicing a previously signaled floating-point exception.
Appendix E, Guidelines for Writing FPU Exceptions Handlers, describes the MS-DOS compat-
ibility mode in much greater detail. This mode is somewhat more complicated in the Intel486™
and Pentium® processor implementations, as described in Appendix E, Guidelines for Writing
FPU Exceptions Handlers.
7-50
FLOATING-POINT UNIT
7-51
FLOATING-POINT UNIT
The flag for this exception (IE) is bit 0 of the FPU status word, and the mask bit (IM) is bit 0 of
the FPU control word. The stack fault flag (SF) of the FPU status word indicates the type of
operation caused the exception. When the SF flag is set to 1, a stack operation has resulted in
stack overflow or underflow; when the flag is cleared to 0, an arithmetic instruction has encoun-
tered an invalid operand. Note that the FPU explicitly sets the SF flag when it detects a stack
overflow or underflow condition, but it does not explicitly clear the flag when it detects an
invalid-arithmetic-operand condition. As a result, the state of the SF flag can be 1 following an
invalid-arithmetic-operation exception, if it was not cleared from the last time a stack overflow
or underflow condition occurred. Refer to Section 7.3.2.4., “Stack Fault Flag”, for more infor-
mation about the SF flag.
7-52
FLOATING-POINT UNIT
Section 7.7.3., “Software Exception Handling”) and the top-of-stack pointer (TOP) and source
operands remain unchanged.
Table 7-21. Invalid Arithmetic Operations and the Masked Responses to Them
Condition Masked Response
Any arithmetic operation on an operand that is in an Return the real indefinite value to the destination
unsupported format. operand.
Any arithmetic operation on a SNaN. Return a QNaN to the destination operand (refer
to Section 7.6., “Operating on NaNs”).
Compare and test operations: one or both operands Set the condition code flags (C0, C2, and C3) in
are NaNs. the FPU status word to 111B (not comparable).
Addition: operands are opposite-signed infinities. Return the real indefinite value to the destination
Subtraction: operands are like-signed infinities. operand.
Multiplication: ∞ by 0; 0 by ∞. Return the real indefinite value to the destination
operand.
Division: ∞ by ∞; 0 by 0. Return the real indefinite value to the destination
operand.
Remainder instructions FPREM, FPREM1: modulus Return the real indefinite; clear condition code
(divisor) is 0 or dividend is ∞. flag C2 to 0.
Trigonometric instructions FCOS, FPTAN, FSIN, Return the real indefinite; clear condition code
FSINCOS: source operand is ∞. flag C2 to 0.
FIST/FISTP instruction when input operand <> Return MAXNEG to destination operand.
MAXINT for destination operand size.
FSQRT: negative operand (except FSQRT (–0) = –0); Return the real indefinite value to the destination
FYL2X: negative operand (except FYL2X (–0) = –∞); operand.
FYL2XP1: operand more negative than –1.
FBSTP: source register is empty or it contains a NaN, Store BCD integer indefinite value in the
∞, or a value that cannot be represented in 18 destination operand.
decimal digits.
FXCH: one or both registers are tagged empty. Load empty registers with the real indefinite
value, then perform the exchange.
7-53
FLOATING-POINT UNIT
7-54
FLOATING-POINT UNIT
FSTP instructions), where a within-range value in a data register is stored in memory in a single-
or double-real format. The overflow threshold range for the single-real format is −1.0 ∗ 2128 to
1.0 ∗ 2128; the range for the double-real format is −1.0 ∗ 21024 to 1.0 ∗ 21024.
The numeric overflow exception cannot occur when overflow occurs when storing values in an
integer or BCD integer format. Instead, the invalid-arithmetic-operand exception is signaled.
The flag (OE) for the numeric overflow exception is bit 3 of the FPU status word, and the mask
bit (OM) is bit 3 of the FPU control word.
When a numeric overflow exception occurs and the exception is masked, the FPU sets the OE
flag and returns one of the values shown in Table 7-23. The value returned depends on the
current rounding mode of the FPU (refer to Section 7.3.4.3., “Rounding Control Field”).
.
Table 7-23. Masked Responses to Numeric Overflow
Rounding Mode Sign of True Result Result
To nearest + +∞
– –∞
Toward –∞ + Largest finite positive number
– –∞
Toward +∞ + +∞
– Largest finite negative number
Toward zero + Largest finite positive number
– Largest finite negative number
The action that the FPU takes when numeric overflow occurs and the numeric overflow excep-
tion is not masked, depends on whether the instruction is supposed to store the result in memory
or on the register stack.
If the destination is a memory location, the OE flag is set and a software exception handler is
invoked (refer to Section 7.7.3., “Software Exception Handling”). The top-of-stack pointer
(TOP) and source and destination operands remain unchanged.
If the destination is the register stack, the exponent of the rounded result is divided by 224576 and
the result is stored along with the significand in the destination operand. Condition code bit C1
in the FPU status word (called in this situation the “round-up bit”) is set if the significand was
rounded upward and cleared if the result was rounded toward 0. After the result is stored, the
OE flag is set and a software exception handler is invoked.
The scaling bias value 24,576 is equal to 3 ∗ 213. Biasing the exponent by 24,576 normally trans-
lates the number as nearly as possible to the middle of the extended-real exponent range so that,
if desired, it can be used in subsequent scaled operations with less risk of causing further
exceptions.
When using the FSCALE instruction, massive overflow can occur, where the result is too large
to be represented, even with a bias-adjusted exponent. Here, if overflow occurs again, after the
result has been biased, a properly signed ∞ is stored in the destination operand.
7-55
7.8.5. Numeric Underflow Exception (#U)
The FPU reports a floating-point numeric underflow exception (#U) whenever the rounded
result of an arithmetic instruction is “tiny” (that is, less than the smallest possible normalized,
finite value that will fit into the real format of the destination operand). For example, if the desti-
nation format is extended-real (80 bits), underflow occurs when the rounded result falls in the
unbiased range of −1.0 ∗ 2−16382 to 1.0 ∗ 2−16382 (exclusive). Like numeric overflow, numeric
underflow can occur on arithmetic operations where the result is stored in an FPU data register.
It can also occur on store-real operations (with the FST and FSTP instructions), where a within-
range value in a data register is stored in memory in a single- or double-real format. The under-
flow threshold range for the single-real format is −1.0 ∗ 2−126 to 1.0 ∗ 2−126; the range for the
double-real format is −1.0 ∗ 2−1022 to 1.0 ∗ 2−1022. (The numeric underflow exception cannot
occur when storing values in an integer or BCD integer format.)
The flag (UE) for the numeric-underflow exception is bit 4 of the FPU status word, and the mask
bit (UM) is bit 4 of the FPU control word.
When a numeric-underflow exception occurs and the exception is masked, the FPU denormal-
izes the result (refer to Section 7.2.3.2., “Normalized and Denormalized Finite Numbers”). If
the denormalized result is exact, the FPU stores the result in the destination operand, without
setting the UE flag. If the denormal result is inexact, the FPU sets the UE flag, then goes on to
handle the inexact result exception condition (refer to Section 7.8.6., “Inexact Result (Precision)
Exception (#P)”). It is important to note that if numeric-underflow is masked, a numeric-under-
flow exception is signaled only if the denormalized result is inexact. If the denormalized result
is exact, no flags are set and no exceptions are signaled.
The action that the FPU takes when numeric underflow occurs and the numeric-underflow
exception is not masked, depends on whether the instruction is supposed to store the result in
memory or on the register stack.
If the destination is a memory location, the UE flag is set and a software exception handler is
invoked (refer to Section 7.7.3., “Software Exception Handling”). The top-of-stack pointer
(TOP) and source and destination operands remain unchanged.
If the destination is the register stack, the exponent of the rounded result is multiplied by
224576 and the product is stored along with the significand in the destination operand. Condition
code bit C1 in the FPU the status register (acting here as a “round-up bit”) is set if the significand
was rounded upward and cleared if the result was rounded toward 0. After the result is stored,
the UE flag is set and a software exception handler is invoked.
The scaling bias value 24,576 is the same as is used for the overflow exception and has the same
effect, which is to translate the result as nearly as possible to the middle of the extended-real
exponent range.
When using the FSCALE instruction, massive underflow can occur, where the result is too tiny
to be represented, even with a bias-adjusted exponent. Here, if underflow occurs again, after the
result has been biased, a properly signed 0 is stored in the destination operand.
FLOATING-POINT UNIT
The inexact result exception flag (PE) is bit 5 of the FPU status word, and the mask bit (PM) is
bit 5 of the FPU control word.
If the inexact result exception is masked when an inexact result condition occurs and a numeric
overflow or underflow condition has not occurred, the FPU sets the PE flag and stores the
rounded result in the destination operand. The current rounding mode determines the method
used to round the result (refer to Section 7.3.4.3., “Rounding Control Field”). The C1 (round-
up) bit in the FPU status word indicates whether the inexact result was rounded up (C1 is set) or
“not rounded up” (C1 is cleared). In the “not rounded up” case, the least-significant bits of the
inexact result are truncated so that the result fits in the destination format.
If the inexact result exception is not masked when an inexact result occurs and numeric overflow
or underflow has not occurred, the FPU performs the same operation described in the previous
paragraph and, in addition, invokes a software exception handler (refer to Section 7.7.3., “Soft-
ware Exception Handling”).
If an inexact result occurs in conjunction with numeric overflow or underflow, one of the
following operations is carried out:
• If an inexact result occurs along with masked overflow or underflow, the OE or UE flag
and the PE flag are set and the result is stored as described for the overflow or underflow
exceptions (refer to Section 7.8.4., “Numeric Overflow Exception (#O)” or Section 7.8.5.,
“Numeric Underflow Exception (#U)”). If the inexact result exception is unmasked, the
FPU also invokes the software exception handler.
• If an inexact result occurs along with unmasked overflow or underflow and the destination
operand is a register, the OE or UE flag and the PE flag are set, the result is stored as
described for the overflow or underflow exceptions, and the software exception handler is
invoked.
• If an inexact result occurs along with unmasked overflow or underflow and the destination
operand is a memory location, the inexact result condition is ignored.
7-57
FLOATING-POINT UNIT
7-58
FLOATING-POINT UNIT
This problem is related to the way the FPU signals the existence of unmasked floating-point
exceptions. (Special exception synchronization is not required for masked floating-point excep-
tions, because the FPU always returns a masked result to the destination operand.)
When a floating-point exception is unmasked and the exception condition occurs, the FPU stops
further execution of the floating-point instruction and signals the exception event. On the next
occurrence of a floating-point instruction or a WAIT/FWAIT instruction in the instruction
stream, the processor checks the ES flag in the FPU status word for pending floating-point
exceptions. It floating-point exceptions are pending, the FPU makes an implicit call (traps) to
the floating-point software exception handler. The exception handler can then execute recovery
procedures for selected or all floating-point exceptions.
Synchronization problems occur in the time frame between when the exception is signaled and
when it is actually handled. Because of concurrent execution, integer or system instructions can
be executed during this time frame. It is thus possible for the source or destination operands for
a floating-point instruction that faulted to be overwritten in memory, making it impossible for
the exception handler to analyze or recover from the exception.
To solve this problem, an exception synchronizing instruction (either a floating-point instruction
or a WAIT/FWAIT instruction) can be placed immediately after any floating-point instruction
that might present a situation where state information pertaining to a floating-point exception
might be lost or corrupted. Floating-point instructions that store data in memory are prime candi-
dates for synchronization. For example, the following three lines of code have the potential for
exception synchronization problems:
7-59
FLOATING-POINT UNIT
FNINIT, FNSTENV, FNSAVE, or FNCLEX instruction is executed, all pending exceptions are
essentially lost (either the FPU status register is cleared or all exceptions are masked). The
FNSTSW and FNSTCW instructions do not check for pending interrupts, but they do not
modify the FPU status and control registers. A subsequent “waiting” floating-point instruction
can then handle any pending exceptions.
7-60
8
Programming With
the Intel MMX™
Technology
CHAPTER 8
PROGRAMMING WITH THE INTEL
MMX™ TECHNOLOGY
The Intel MMX™ technology comprises a set of extensions to the Intel Architecture (IA) that
are designed to greatly enhance the performance of advanced media and communications appli-
cations. These extensions (which include new registers, data types, and instructions) are
combined with a single-instruction, multiple-data (SIMD) execution model to accelerate the
performance of applications such as motion video, combined graphics with video, image
processing, audio synthesis, speech synthesis and compression, telephony, video conferencing,
and 2D and 3D graphics, which typically use compute-intensive algorithms to perform repeti-
tive operations on large arrays of simple, native data elements.
The MMX™ technology defines a simple and flexible software model, with no new mode or
operating-system visible state. All existing software will continue to run correctly, without
modification, on IA processors that incorporate the MMX™ technology, even in the presence
of existing and new applications that incorporate this technology.
The following sections of this chapter describe the MMX™ technology’s basic programming
environment, including the MMX™ register set, data types, and instruction set. Detailed
descriptions of the MMX™ instructions are provided in Chapter 3, Instruction Set Reference, of
the Intel Architecture Software Developer’s Manual, Volume 2. The manner in which the
MMX™ technology is integrated into the IA system programming model is described in
Chapter 10, MMX™ Technology System Programming, in the Intel Architecture Software Devel-
oper’s Manual, Volume 3.
8-1
PROGRAMMING WITH THE INTEL MMX™ TECHNOLOGY
63 0
MM7
MM6
MM5
MM4
MM3
MM2
MM1
MM0
3006044
8-10
PROGRAMMING WITH THE INTEL MMX™ TECHNOLOGY
3006002
The MMX™ instructions move the packed data types (packed bytes, packed words, or packed
doublewords) and the quadword data type to-and-from memory or to-and-from the IA general-
purpose registers in 64-bit blocks. However, when performing arithmetic or logical operations
on the packed data types, the MMX™ instructions operate in parallel on the individual bytes,
8-11
PROGRAMMING WITH THE INTEL MMX™ TECHNOLOGY
63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0
Byte 7 Byte 6 Byte 5 Byte 4 Byte 3 Byte 2 Byte 1 Byte 0
3006045
8-10
PROGRAMMING WITH THE INTEL MMX™ TECHNOLOGY
8-11
PROGRAMMING WITH THE INTEL MMX™ TECHNOLOGY
For example, when the result exceeds the data range limit for signed bytes, it is saturated to 7FH
(FFH for unsigned bytes). If a value is less than the data range limit, it is saturated to 80H for
signed bytes (00H for unsigned bytes).
Saturation provides a useful feature of avoiding wraparound artifacts. In the example of color
calculations, saturation causes a color to remain pure black or pure white without allowing for
and inversion.
MMX™ instructions do not indicate overflow or underflow occurrence by generating excep-
tions or setting flags.
8-10
PROGRAMMING WITH THE INTEL MMX™ TECHNOLOGY
8-11
PROGRAMMING WITH THE INTEL MMX™ TECHNOLOGY
8-10
PROGRAMMING WITH THE INTEL MMX™ TECHNOLOGY
8-11
PROGRAMMING WITH THE INTEL MMX™ TECHNOLOGY
implement a packed conditional move operation without a branch or a set of branch instructions.
No flags are set.
These instructions support packed byte, packed word and packed doubleword data types.
8-10
PROGRAMMING WITH THE INTEL MMX™ TECHNOLOGY
Refer to Section 2.2., “Instruction Prefixes” in Chapter 2, Instruction Format of the Intel Archi-
tecture Software Developer’s Manual, Volume 2, for detailed information on prefixes.
8-11
PROGRAMMING WITH THE INTEL MMX™ TECHNOLOGY
NOTE
The CPUID instruction will continue to report the existence of the MMX™
technology if the CR0.EM bit is set (which signifies that the CPU is
configured to generate exception interrupt 7 that can be used to emulate
floating-point instructions). In this case, executing an MMX™ instruction
results in an invalid opcode exception.
Example 8-1 illustrates how to use the CPUID instruction. This example does not represent the
entire CPUID sequence, but shows the portion used for detection of MMX™ technology.
Example 8-1. Partial Routine for Detecting MMX™ Technology with the CPUID Instruction
; identify existence of CPUID instruction
8-10
PROGRAMMING WITH THE INTEL MMX™ TECHNOLOGY
point tag word as empty. Therefore, it is imperative to use the EMMS instruction at the end of
every MMX™ routine, if the next routine may contain FPU code.
The EMMS instruction must be used in each of the following cases:
• When an application using the floating-point instructions calls an MMX™ technology
library/DLL. (Use the EMMS instruction at the end of the MMX™ code.)
• When an application using MMX™ instructions calls a floating-point library/DLL. (Use
the EMMS instruction before calling the floating-point code.)
• When a switch is made between MMX™ code in a task/thread and other tasks/threads in
cooperative operating systems, unless it is certain that more MMX™ instructions will be
executed before any FPU code.
If the EMMS instruction is not used when trying to execute a floating-point instruction, the
following may occur:
• Depending on the exception mask bits of the floating-point control word, a floating- point
exception event may be generated.
• A “soft exception” may occur. In this case floating-point code continues to execute, but
generates incorrect results. This happens when the floating-point exceptions are masked
and no visible exceptions occur. The internal exception handler (microcode, not user
visible) loads a NaN (Not a Number) with an exponent of 11..11B onto the floating-point
stack. The NaN is used for further calculations, yielding incorrect results.
• A potential error may occur only if the operating system does NOT manage floating-point
context across task switches. These operating systems are usually cooperative operating
systems. It is imperative that the EMMS instruction execute at the end of all the MMX™
routines that may enable a task switch immediately after they end execution (explicit yield
API or implicit yield API).
• The EMMS instruction is not returned when mixing MMX™ technology instructions and
Streaming SIMD Extensions. Refer to Section 9.4., “Compatibility with FPU Archi-
tecture” in Chapter 9.4., Compatibility with FPU Architecture, of the Intel Architecture
Software Developer’s Manual, Volume 3, for more detailed information.
8-11
PROGRAMMING WITH THE INTEL MMX™ TECHNOLOGY
If a high-level language, such as C, is used, the data types could be defined as a 64-bit structure
with packed data types.
When implementing usage of MMX™ instructions in high-level languages other approaches
can be taken, such as:
• Passing parameters to an MMX™ routine by passing a pointer to a structure via the integer
stack.
• Returning a value from a function by returning the pointer to a structure.
8-10
If the application contains floating-point and MMX™ instructions, follow these guidelines:
• Partition the MMX™ technology module and the floating-point module into separate
instruction streams (separate loops or subroutines) so that they contain only instructions of
one type.
• Do not rely on register contents across transitions.
• When the MMX™ state is not required, empty the MMX™ state using the EMMS
instruction.
• Exit the floating-point code section with an empty stack.
FP_code 1:
..
.. (*leave the FPU stack empty*)
8-10
9
Programming With
the Streaming SIMD
Extensions
CHAPTER 9
PROGRAMMING WITH THE STREAMING SIMD
EXTENSIONS
The Intel Streaming SIMD Extensions comprise a set of extensions to the Intel Architecture (IA)
that is designed to greatly enhance the performance of advanced media and communications
applications. These extensions (which include new registers, data types, and instructions) are
combined with a single-instruction, multiple-data (SIMD) execution model to accelerate the
performance of applications. Applications that typically use compute-intensive algorithms to
perform repetitive operations on large arrays of simple, native data elements benefit the most.
Applications that require regular access to large amount of data also benefit from the Streaming
SIMD Extensions prefetching and streaming stores capabilities.
Examples of these types of applications include:
• motion video
• combined graphics with video
• image processing
• audio synthesis
• speech recognition, synthesis, and compression
• telephony
• video conferencing
• 2D and 3D graphics.
The Streaming SIMD Extensions define a simple and flexible software model. This new mode
introduces a new operating-system visible state. To enhance performance and yield more
concurrent execution, a new set of registers has been added. All existing software will continue
to run correctly without modification on IA processors that incorporate the Streaming SIMD
Extensions, even in the presence of existing and new applications that incorporate this tech-
nology.
The following sections of this chapter describe the Streaming SIMD Extensions’ basic program-
ming environment, including the SIMD floating-point register set, data types, and instruction
set. Detailed descriptions of the Streaming SIMD Extensions are provided in Chapter 3, Instruc-
tion Set Reference, of the Intel Architecture Software Developer’s Manual, Volume 2. The
manner in which the Streaming SIMD Extensions are integrated into the IA system program-
ming model is described in Chapter 10, MMX™ Technology System Programming, in the Intel
Architecture Software Developer’s Manual, Volume 3.
9-1
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
9-2
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
XMM7
XMM6
XMM5
XMM4
XMM3
XMM2
XMM1
XMM0
MMX™ registers are mapped onto the floating-point registers. Transitioning from MMX™
operations to floating-point operations required executing the EMMS instruction. Since SIMD
floating-point registers are a separate register file, MMX™ instructions and floating-point
instructions can be mixed with Streaming SIMD Extensions without execution of a special
instruction such as EMMS.
127 96 95 65 63 32 31 0
Packed Single-FP
9-3
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
9-4
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
processor operates on these values, refer to Section 7.2., “Real Numbers and Floating-Point
Formats” in Chapter 7, Floating-Point Unit.
Byte 15 Byte 0
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Table 9-2 shows the encodings for all the classes of real numbers (that is, zero, denormalized-
finite, normalized-finite, and ∞) and NaNs for the single-real data-type. It also gives the format
for the real indefinite value, which is a QNaN encoding that is generated by several Streaming
SIMD Extensions in response to a masked, floating-point, invalid operation exception.
9-5
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
When storing real values in memory, single-real values are stored in 4 consecutive bytes in
memory. The 128-bit access mode is used for 128-bit memory accesses, 128-bit transfers
between SIMD floating-point registers, and all logical, unpack and arithmetic instructions. The
32-bit access mode is used for 32-bit memory access, 32-bit transfers between SIMD floating-
point registers, and all arithmetic instructions.
NOTES:
1. Integer bit is implied and not stored for single-real and double-real formats.
2. The fraction for SNaN encodings must be non-zero.
9-6
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
31-16 15 10 5 0
Reserved F R R P U O Z D I R P U O Z D I
Z C C M M M M M M s E E E E E E
v
d
Bits 5-0 indicate whether a SIMD floating-point numerical exception has been detected. They
are “sticky” flags, and can be cleared by using the LDMXCSR instruction to write zeroes to
these fields. If an LDMXCSR instruction clears a mask bit and sets the corresponding exception
flag bit, an exception will not be immediately generated. The exception will occur only upon the
next Streaming SIMD Extensions to cause this type of exception. Streaming SIMD Extensions
use only one exception flag for each exception. There is no provision for individual exception
reporting within a packed data type. In situations where multiple identical exceptions occur
within the same instruction, the associated exception flag is updated and indicates that at least
one of these conditions happened. These flags are cleared upon reset.
Bits 12-7 configure numerical exception masking; an exception type is masked if the corre-
sponding bit is set, and it is unmasked if the bit is clear. These enables are set upon reset,
meaning that all numerical exceptions are masked.
Bits 14-13 encode the rounding control, which provides for the common round to nearest mode,
as well as directed rounding and true chop (refer to Section 9.1.8., “Rounding Control Field”).
The rounding control is set to round to nearest upon reset.
Bit 15 (FZ) is used to turn on the Flush-To-Zero mode (refer to Section 9.1.9., “Flush-To-Zero”).
This bit is cleared upon reset, disabling the Flush-To-Zero mode.
The other bits of MXCSR (bits 31-16 and bit 6) are defined as reserved and cleared; attempting
to write a non-zero value to these bits, using either the FXRSTOR or LDMXCSR instructions,
will result in a general protection exception.
9-7
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
The round up and round down modes are termed directed rounding and can be used to imple-
ment interval arithmetic. Interval arithmetic is used to determine upper and lower bounds for the
true result of a multistep computation, when the intermediate results of the computation are
subject to rounding.
The round toward zero mode (sometimes called the “chop” mode) is commonly used when
performing integer arithmetic with the processor.
9.1.9. Flush-To-Zero
Turning on the Flush-To-Zero mode has the following effects during underflow situations:
• Zero results are returned with the sign of the true result
• Precision and underflow exception flags are set
The IEEE mandated masked response to underflow is to deliver the denormalized result (i.e.,
gradual underflow); consequently, the Flush-To-Zero mode is not compatible with IEEE Stan-
dard 754. It is provided primarily for performance reasons. At the cost of a slight precision loss,
faster execution can be achieved for applications where underflows are common. Underflow for
Flush-To-Zero is defined to occur when the exponent for a computed result, prior to denormal-
ization scaling, falls in the denormal range; this is regardless of whether a loss of accuracy has
occurred. Unmasking the underflow exception takes precedence over Flush-To-Zero mode; this
means that an exception handler will be invoked for a Streaming SIMD Extensions instruction
that generates an underflow condition while this exception is unmasked, regardless of whether
Flush-To-Zero is enabled.
9-8
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
OP OP OP OP
9-9
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
OP
9-10
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
The MOVLPS (Move unaligned, low packed, single-precision, floating-point) instruction trans-
fers 64 bits of packed data from memory to the lower two fields of a SIMD floating-point
register and vice versa. The upper two fields are left unchanged.
The MOVMSKPS (Move mask packed, single-precision, floating-point) instruction transfers
the most significant bit of each of the four, packed, single-precision, floating-point numbers to
an IA integer register. This 4-bit value can then be used as a condition to perform branching.
The MOVSS (Move scalar single-precision, floating-point) instruction transfers the least signif-
icant 32 bits from memory to a SIMD floating-point register or vice versa, and between regis-
ters.
9-11
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
The SQRTSS (Square root scalar single-precision, floating-point) instruction returns the square
root of the least significant component of the packed, single-precision, floating-point numbers
from source to a destination register; the upper three fields are passed through from the source
operand.
9-12
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
point numbers, and sets the ZF, PF, and CF bits in the EFLAGS register as described above (the
OF, SF, and AF bits are cleared).
9-13
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
The ANDNPS (Bit-wise packed logical AND NOT for single-precision, floating-point) instruc-
tion returns a bitwise AND NOT between the two operands.
The ORPS (Bit-wise packed logical OR for single-precision, floating-point) instruction returns
a bitwise OR between the two operands.
The XORPS (Bit-wise packed logical XOR for single-precision, floating-point) instruction
returns a bitwise XOR between the two operands.
9-14
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
The PSHUFW (Shuffle packed integer word in MMX™ register) instruction performs a full
shuffle of any source word field to any result word field, using an 8-bit immediate operand.
X4 X3 X2 X1
Y4 Y3 Y2 Y1
{Y4 ... Y1} {Y4 ... Y1} {X4 ... X1} {X4 ... X1}
9-15
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
X4 X3 X2 X1
Y4 Y3 Y2 Y1
Y4 X4 Y3 X3
X4 X3 X2 X1
Y4 Y3 Y2 Y1
Y2 X2 Y1 X1
9-16
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
Floating-Point Control and Status Register) instruction stores the Streaming SIMD Extensions
control and status word to memory.
The FXSAVE instruction saves FP and MMX™ state and SIMD floating-point state to memory.
Unlike FSAVE, FXSAVE it does not clear the x87-FP state. FXRSTOR loads FP and MMX™
state and SIMD floating-point state from memory.
9-17
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
9-18
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
For more information on prefetch hints, refer to Section 9.5.3.1., “Cacheability Hint Instruc-
tions”. For even more detailed information, refer to Chapter 6, “Optimizing Cache Utilization
for Pentium® III Processors”, in the Intel Architecture Optimization Reference Manual (Order
Number 245127-001).
The SFENCE (Store Fence) instruction guarantees that every store instruction that precedes the
store fence instruction in program order is globally visible before any store instruction that
follows the fence. The SFENCE instruction provides an efficient way of ensuring ordering
between routines that produce weakly-ordered results and routines that consume this data.
The use of weakly-ordered memory types can be important under certain data sharing relation-
ships, such as a producer-consumer relationship. The use of weakly-ordered memory can make
the assembling of data more efficient, but care must be taken to ensure that the consumer obtains
the data that the producer intended it to see.
9-19
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
9-20
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
9-21
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
For full details on how to determine what support is present for the Streaming SIMD Extensions,
please refer to the Intel Processor Identification and the CPUID Instruction Application Note
(AP-485), order number 241618-008.
9-22
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
tion is not necessary when integrating a Streaming SIMD Extensions module with existing
MMX™ technology modules or existing x87-FP modules. Streaming SIMD Extensions also do
not affect the floating-point tag word (FTW), floating-point control word (FCW), floating-point
status word (FSW) or floating-point exception state (FIP, FOP, FCS, FDS and FDP).
The SIMD integer instructions that are included in Streaming SIMD Extensions behave identi-
cally to original MMX™ instructions, in the presence of x87-FP instructions; this includes:
• Transition from x87-FP to MMX™ technology (TOS=0, FP valid bits set to all valid).
• MMX™ instructions write ones (1s) to the exponent part of the corresponding x87-FP
register.
• Use of EMMS for transition from MMX™ technology to x87-FP.
The Streaming SIMD Extensions that follow this behavior are: CVTPI2PS, CVTPS2PI,
CVTTPS2PI, MASKMOVQ, MOVNTQ, PEXTRW, PINSRW, PMOVMSKB, PMULHUW,
PSHUFW.
9-23
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
The PREFETCH instruction does not change the user-visible semantics of a program, although
it may affect the performance of a program. The operation of this instruction is implementation-
dependent and can be overloaded to a subset of the hints (for example, T0, T1, and T2 may have
the same behavior) or altogether ignored by an implementation. The programmer will have to
tune his application for each implementation to take advantage of these instructions. These
instructions do not generate exceptions or faults. Excessive usage of prefetch instructions may
be throttled by the processor. For more detailed information on prefetch hints, refer to Chapter
6, “Optimizing Cache Utilization for Pentium® III Processors”, in the Intel Architecture Opti-
mization Reference Manual (Order Number 245127-001).
Some common usage models that may be affected in this way by weakly-ordered stores are:
• library functions, which use weakly-ordered memory to write results
• compiler-generated code, which also benefit from writing weakly-ordered results
• hand-crafted code
The degree to which a consumer of data knows that the data is weakly-ordered can vary for these
cases. As a result, the SFENCE instruction should be used to ensure ordering between routines
that produce weakly-ordered data and routines that consume this data. The SFENCE instruction
provides a performance-efficient way to ensure ordering, by guaranteeing that every store
instruction that precedes the store fence instruction in program order is globally visible before
any store instruction that follows the fence.
9-24
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
9-25
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
done by checking the CPUID.FXSR bit; for a processor that does implement Streaming SIMD
Extensions, use the approach described in Section 9.5.1., “Detecting Support for Streaming
SIMD Extensions Using the CPUID Instruction”. For even more detailed information, refer to
the Intel Processor Identification and the CPUID Instruction Application Note (AP-485), order
number 241618-008 and Identifying Support for Streaming SIMD Extensions in the Processor
and Operating System (AP-900).
The operating systems can be classified into two types:
• Cooperative multitasking operating systems
• Preemptive multitasking operating systems
9-26
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
in the Processor and Operating System (AP-900).) If an application unmasks exceptions using
either FXRSTOR or LDMXCSR without the required OS support being enabled, an invalid
opcode fault, instead of a SIMD floating-point exception, will be generated on the first faulting
Streaming SIMD Extensions.
SIMD floating-point numeric exceptions are precise and occur as soon as the instruction
completes execution. They will not catch pending x87 floating-point exceptions and will not
cause assertion of FERR# (independent of the value of CR0.NE). In addition, they ignore the
assertion/de-assertion of IGNNE#.
For more details on SIMD floating-point exceptions and exception handlers, refer to Section
4.4., “Interrupts and Exceptions”, in Chapter 4, Procedure Calls, Interrupts, and Exceptions,
Appendix D, SIMD Floating-Point Exceptions Summary, and Appendix E, Guidelines for
Writing FPU Exceptions Handlers.
9-27
PROGRAMMING WITH THE STREAMING SIMD EXTENSIONS
9-28
10
Input/Output
CHAPTER 10
INPUT/OUTPUT
In addition to transferring data to and from external memory, Intel Architecture (IA) processors
can also transfer data to and from input/output ports (I/O ports). I/O ports are created in system
hardware by circuity that decodes the control, data, and address pins on the processor. These I/O
ports are then configured to communicate with peripheral devices. An I/O port can be an input
port, an output port, or a bidirectional port. Some I/O ports are used for transmitting data, such
as to and from the transmit and receive registers, respectively, of a serial interface device. Other
I/O ports are used to control peripheral devices, such as the control registers of a disk controller.
This chapter describes the processor’s I/O architecture. The topics discussed include:
• I/O port addressing.
• I/O instructions.
• I/O protection mechanism.
10-1
INPUT/OUTPUT
10-2
INPUT/OUTPUT
Physical Memory
FFFF FFFFH
EPROM
I/O Port
I/O Port
I/O Port
RAM
All the IA processors that have on-chip caches also provide the PCD (page-level cache disable)
flag in page table and page directory entries. This flag allows caching to be disabled on a page-
by-page basis. Refer to Chapter 3.6.4., Page-Directory and Page-Table Entries in Chapter 3,
Protected-Mode Memory Management, in the Intel Architecture Software Developer’s Manual,
Volume 3.
10-3
INPUT/OUTPUT
When used with one of the repeat prefixes (such as REP), the INS and OUTS instructions
perform string (or block) input or output operations. The repeat prefix REP modifies the INS
and OUTS instructions to transfer blocks of data between an I/O port and memory. Here, the ESI
or EDI register is incremented or decremented (according to the setting of the DF flag in the
EFLAGS register) after each byte, word, or doubleword is transferred between the selected I/O
port and memory.
Refer to the individual references for the IN, INS, OUT, and OUTS instructions in Chapter 3,
Instruction Set Reference, of the Intel Architecture Software Developer’s Manual, Volume 2, for
more information on these instructions.
10-4
INPUT/OUTPUT
exception (#GP) being signaled. Because each task has its own copy of the EFLAGS register,
each task can have a different IOPL.
The I/O permission bit map in the TSS can be used to modify the effect of the IOPL on I/O sensi-
tive instructions, allowing access to some I/O ports by less privileged programs or tasks (refer
to Section 10.5.2.).
A program or task can change its IOPL only with the POPF and IRET instructions; however,
such changes are privileged. No procedure may change the current IOPL unless it is running at
privilege level 0. An attempt by a less privileged procedure to change the IOPL does not result
in an exception; the IOPL simply remains unchanged.
The POPF instruction also may be used to change the state of the IF flag (as can the CLI and
STI instructions); however, the POPF instruction in this case is also I/O sensitive. A procedure
may use the POPF instruction to change the setting of the IF flag only if the CPL is less than or
equal to the current IOPL. An attempt by a less privileged procedure to change the IF flag does
not result in an exception; the IF flag simply remains unchanged.
Because each task has its own TSS, each task has its own I/O permission bit map. Access to indi-
vidual I/O ports can thus be granted to individual tasks.
10-5
INPUT/OUTPUT
If in protected mode and the CPL is less than or equal to the current IOPL, the processor allows
all I/O operations to proceed. If the CPL is greater than the IOPL or if the processor is operating
in virtual-8086 mode, the processor checks the I/O permission bit map to determine if access to
a particular I/O port is allowed. Each bit in the map corresponds to an I/O port byte address. For
example, the control bit for I/O port address 29H in the I/O address space is found at bit position
1 of the sixth byte in the bit map. Before granting I/O access, the processor tests all the bits corre-
sponding to the I/O port being addressed. For a doubleword access, for example, the processors
tests the four bits corresponding to the four adjacent 8-bit port addresses. If any tested bit is set,
a general-protection exception (#GP) is signaled. If all tested bits are clear, the I/O operation is
allows to proceed.
Because I/O port addresses are not necessarily aligned to word and doubleword boundaries, the
processor reads two bytes from the I/O permission bit map for every access to an I/O port. To
prevent exceptions from being generated when the ports with the highest addresses are accessed,
an extra byte needs to included in the TSS immediately after the table. This byte must have all
of its bits set, and it must be within the segment limit.
It is not necessary for the I/O permission bit map to represent all the I/O addresses. I/O addresses
not spanned by the map are treated as if they had set bits in the map. For example, if the TSS
segment limit is 10 bytes past the bit-map base address, the map has 11 bytes and the first 80 I/O
ports are mapped. Higher addresses in the I/O address space generate exceptions.
If the I/O bit map base address is greater than or equal to the TSS segment limit, there is no I/O
permission map, and all I/O instructions generate exceptions when the CPL is greater than the
current IOPL. The I/O bit map base address must be less than or equal to DFFFH.
10-6
INPUT/OUTPUT
order. Refer to Chapter 9, Memory Cache Control, in the Intel Architecture Software Devel-
oper’s Manual, Volume 3, for more information on using MTRRs.
Another method of enforcing program order is to insert one of the serializing instructions, such
as the CPUID instruction, between operations. Refer to Chapter 7, Multiple-Processor Manage-
ment, in the Intel Architecture Software Developer’s Manual, Volume 3, for more information on
serialization of instructions.
It should be noted that the chip set being used to support the processor (bus controller, memory
controller, and/or I/O controller) may post writes to uncacheable memory which can lead to out-
of-order execution of memory accesses. In situations where out-of-order processing of memory
accesses by the chip set can potentially cause faulty memory-mapped I/O processing, code must
be written to force synchronization and ordering of I/O operations. Serializing instructions can
often be used for this purpose.
When the I/O address space is used instead of memory-mapped I/O, the situation is different in
two respects:
• The processor never buffers I/O writes. Therefore, strict ordering of I/O operations is
enforced by the processor. (As with memory-mapped I/O, it is possible for a chip set to
post writes in certain I/O ranges.)
• The processor synchronizes I/O instruction execution with external bus activity (refer to
Table 10-1).
IN Yes Yes
INS Yes Yes
REP INS Yes Yes
OUT Yes Yes Yes
OUTS Yes Yes Yes
REP OUTS Yes Yes Yes
10-7
11
Processor
Identification
and Feature
Determination
CHAPTER 11
PROCESSOR IDENTIFICATION AND FEATURE
DETERMINATION
When writing software intended to run on several different types of Intel Architecture (IA)
processors, it is generally necessary to identify the type of processor present in a system and the
processor features that are available to an application. This chapter describes how to identify the
processor that is executing the code and determine the features the processor supports. It also
shows how to determine if an FPU or NPX is present. For more information about processor
identification and supported features, refer to the following documents:
• AP-485, Intel Processor Identification and the CPUID Instruction
• For a complete list of the features that are available for the different IA processors, refer to
Chapter 18, Intel Architecture Compatibility of the Intel Architecture Software Developer’s
Manual, Volume 3: System Programming Guide.
11-1
PROCESSOR IDENTIFICATION AND FEATURE DETERMINATION
11-2
PROCESSOR IDENTIFICATION AND FEATURE DETERMINATION
AP-485, Intel Processor Identification and the CPUID Instruction (Order Number 241618),
provides additional information and example source code for use in identifying IA processors.
It also contains guidelines for using the CPUID instruction to help maintain the widest range of
software compatibility. The following guidelines are among the most important, and should
always be followed when using the CPUID instruction to determine available features:
• Always begin by testing for the “GenuineIntel,” message in the EBX, EDX, and ECX
registers when the CPUID instruction is executed with EAX equal to 0. If the processor is
not genuine Intel, the feature identification flags may have different meanings than are
described in “CPUID—CPU Identification” in Chapter 3, Instruction Set Reference of the
Intel Architecture Software Developer’s Manual, Volume 2.
• Do not assume a value of 1 in a feature identification flag indicates that a given feature is
present. For future feature identification flags, a value of 1 may indicate that the specific
feature is not present.
• Test feature identification flags individually and do not make assumptions about undefined
bits.
Note that the CPUID instruction will cause the invalid opcode exception (#UD) if executed on
a processor that does not support it. The CPUID instruction application note provides a code
sequence to test the validity of the CPUID instruction. Also, this test code (for CPUID valid) is
not reliable when executed in virtual-8086 mode. To avoid this, if the test code is written to run
in real-address mode, the SMSW instruction must be used to read the PE bit from the MSW
(lower half of CR0). If PE flag is set to 1, the Real Mode code is actually being executed in
virtual-8086 mode, and the test sequence cannot be guaranteed to return reliable information.
(Note that the new version of the CPUID application note (AP-485, Intel Processor Identifica-
tion and the CPUID Instruction (Order Number 241618-005)), explains this virtual-8086
problem, but the older versions of the application note do not.)
11-3
11.2. IDENTIFICATION OF EARLIER INTEL ARCHITECTURE
PROCESSORS
The CPUID instruction is only available in the Pentium® Pro, Pentium®, and recent Intel486™
processors. For the earlier IA processors (including the earlier Intel486™ processors), several
other architectural features can be exploited to identify the processor.
The settings of bits 12 and 13 (IOPL), 14 (NT), and 15 (reserved) in the EFLAGS register (refer
to Figure 3-7, Section 3.6.3., “EFLAGS Register”, in Chapter 3, Basic Execution Environment)
is different for Intel’s 32-bit processors than for the Intel 8086 and Intel 286 processors. By
examining the settings of these bits (with the PUSHF/PUSHFD and POP/POPFD instructions),
an application program can determine whether the processor is an 8086, Intel286, or one of the
Intel 32-bit processors:
• 8086 processor — Bits 12 through 15 of the EFLAGS register are always set.
• Intel 286 processor — Bits 12 through 15 are always clear in real-address mode.
• 32-bit processors — In real-address mode, bit 15 is always clear and bits 12 through 14
have the last value loaded into them. In protected mode, bit 15 is always clear, bit 14 has
the last value loaded into it, and the IOPL bits depends on the current privilege level
(CPL). The IOPL field can be changed only if the CPL is 0.
Other EFLAG register bits that can be used to differentiate between the 32-bit processors:
• Bit 18 (AC) — Implemented only on the Pentium® Pro, Pentium®, and Intel486™
processors. The inability to set or clear this bit distinguishes an Intel386™ processor from
the other Intel 32-bit processors.
• Bit 21 (ID) — Determines if the processor is able to execute the CPUID instruction. The
ability to set and clear this bit indicates that the processor is a Pentium ® Pro, Pentium®, or
later version Intel486™ processor.
To determine whether an FPU or NPX is present in a system, applications can write to the
FPU/NPX status and control registers using the FNINIT instruction and then verify the correct
values are read back using the FNSTENV instruction.
After determining that an FPU or NPX is present, its type can then be determined. In most cases,
the processor type will determine the type of FPU or NPX; however, an Intel386™ processor is
compatible with either an Intel 287 or Intel 387 math coprocessor. The method the coprocessor
uses to represent ∞ (after the execution of the FINIT, FNINIT, or RESET instruction) indicates
which coprocessor is present. The Intel 287 math coprocessor uses the same bit representation
for +∞ and −∞; whereas, the Intel 387 math coprocessor uses different representations for +∞
and −∞.
PROCESSOR IDENTIFICATION AND FEATURE DETERMINATION
EBX Reserved
ECX Reserved
Refer to the CPUID application note, AP-485, for details on cache information. AP-485 is avail-
able from the following web site: http://developer.intel.com/design/pro/applnots/ap485.htm.
In addition, the following two new cache descriptors are defined for P6-family processors with
Model > 3:
1M L2 Cache 4-way set associative 32-byte line size 44h
2M L2 Cache 4-way set associative 32-byte line size 45h
11-5
PROCESSOR IDENTIFICATION AND FEATURE DETERMINATION
“3” in the Model ID field. Future P6-family processors are indicated by a “6” in the Family ID
and a value greater than “3” in the Model ID field.
31 12 11 08 07 04 03 00
Reserved (0) Family ID Model ID Stepping ID
3 26 25 24 23 22 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
-
19
XMM
FXSR
MMX
rsvd
PN
PSE-36
PAT
CMOV
MCA
PGE
MTRR
SEP
rsvd
APIC
CX8
MCE
PAE
MSR
TSC
PSE
DE
VME
FPU
Reserved
(0)
Table 11-2. New P6-Family Processor Feature Information Returned by CPUID in EDX
Bit Feature Value Description Notes
11 SEP 1 Fast System Call Indicates whether the processor supports the Fast
System Call instructions SYSENTER and SYSEXIT.
16 PAT 1 Page Attribute Indicates whether the processor supports the Page
Table Attribute Table. This feature augments the Memory
Type Range Registers (MTRRs), allowing an
operating system to specify attributes of memory on a
page granularity through a linear address.
17 PSE-36 1 36-bit Page Size Indicates whether the processor supports 4 MB pages
Extension that are capable of addressing physical memory
beyond 4 GB. This feature indicates that the up-per
four bits of the physical address of the 4-MB page is
encoded by bits 13-16 of the page directory entry.
18 PN 1 Processor Indicates whether the processor supports the 96-bit
Number Processor Number feature.
19-22 rsvd 0 Reserved These bits are reserved for future use. The contents of
these fields are not defined and should not be relied
upon or altered.
23 MMX 1 MMX-technology Indicates whether the processor supports the MMX
technology instruction set and architecture.
24 FXSR 1 Fast floating- Indicates whether the processor supports the
point save and FXSAVE and FXRSTOR instructions for fast save and
restore restore of the floating-point context. Presence of this
bit also indicates that CR4.OSFXSR is available,
allowing an operating system to indicate that it uses
the fast save/restore instructions.
25 XMM 1 Streaming SIMD Indicates whether the processor supports the
Extension Streaming SIMD Extensions instruction set.
11-6
PROCESSOR IDENTIFICATION AND FEATURE DETERMINATION
31 10 09 08 07 06 05 04 03 02 01 00
Reserved (set to 0) OSFXSR PCE PGE MCE PAE PSE DE TSD PVI VME
11-7
PROCESSOR IDENTIFICATION AND FEATURE DETERMINATION
11-8
A
EFLAGS
Cross-Reference
APPENDIX A
EFLAGS CROSS-REFERENCE
The cross-reference in Table A-1 summarizes how the flags in the processor’s EFLAGS register
are affected by each instruction. For detailed information on how flags are affected, refer to
Chapter 3, Instruction Set Reference of the Intel Architecture Software Developer’s Manual, Vol-
ume 2. The following codes describe how the flags are affected:
A-1
EFLAGS CROSS-REFERENCE
A-2
EFLAGS CROSS-REFERENCE
A-3
EFLAGS CROSS-REFERENCE
A-4
B
EFLAGS
Condition Codes
APPENDIX B
EFLAGS CONDITION CODES
Table B-1 gives all the condition codes that can be tested for by the CMOVcc, FCMOVcc, Jcc
and SETcc instructions. The condition codes refer to the setting of one or more status flags (CF,
OF, SF, ZF, and PF) in the EFLAGS register. The “Mnemonic” column gives the suffix (cc) add-
ed to the instruction to specific the test condition. The “Condition Tested For” column describes
the condition specified in the “Status Flags Setting” column. The “Instruction Subcode” column
gives the opcode suffix added to the main opcode to specify a test condition.
O Overflow 0000 OF = 1
NO No overflow 0001 OF = 0
B Below 0010 CF = 1
NAE Neither above nor equal
NB Not below 0011 CF = 0
AE Above or equal
E Equal 0100 ZF = 1
Z Zero
NE Not equal 0101 ZF = 0
NZ Not zero
BE Below or equal 0110 (CF OR ZF) = 1
NA Not above
NBE Neither below nor equal 0111 (CF OR ZF) = 0
A Above
S Sign 1000 SF = 1
NS No sign 1001 SF = 0
P Parity 1010 PF = 1
PE Parity even
NP No parity 1011 PF = 0
PO Parity odd
Instruction
Mnemonic Meaning Subcode Condition Tested
L Less 1100 (SF xOR OF) = 1
NGE Neither greater nor equal
NL Not less 1101 (SF xOR OF) = 0
GE Greater or equal
B-1
EFLAGS CONDITION CODES
Many of the test conditions are described in two different ways. For example, LE (less or equal)
and NG (not greater) describe the same test condition. Alternate mnemonics are provided to
make code more intelligible.
The terms “above” and “below” are associated with the CF flag and refer to the relation between
two unsigned integer values. The terms “greater” and “less” are associated with the SF and OF
flags and refer to the relation between two signed integer values.
B-2
C
Floating-Point
Exceptions Summary
APPENDIX C
FLOATING-POINT EXCEPTIONS SUMMARY
Table C-1 lists the floating-point instruction mnemonics in alphabetical order. For each
mnemonic, it summarizes the exceptions that the instruction may cause. Refer to Section 7.8.,
“Floating-Point Exception Conditions” in Chapter 7, Floating-Point Unit for a detailed discus-
sion of the floating-point exceptions. The following codes indicate the floating-point excep-
tions:
C-1
FLOATING-POINT EXCEPTIONS SUMMARY
C-2
FLOATING-POINT EXCEPTIONS SUMMARY
C-3
D
SIMD Floating-Point
Exceptions Summary
APPENDIX D
SIMD FLOATING-POINT EXCEPTIONS SUMMARY
Table D-1 lists the Streaming SIMD Extensions mnemonics in alphabetical order. For each
mnemonic, it summarizes the exceptions that the instruction may cause. Refer to Section 9.5.5.,
“Exception Handling in Streaming SIMD Extensions” in Chapter 9, Programming with the
Streaming SIMD Extensions for a detailed discussion of the various exceptions that can occur
when executing Streaming SIMD Extensions.
The following codes indicate the exceptions associated with execution of an instruction that uti-
lizes the 128-bit Streaming SIMD Extensions registers.
D-1
SIMD FLOATING-POINT EXCEPTIONS SUMMARY
D-2
SIMD FLOATING-POINT EXCEPTIONS SUMMARY
Mnemonic Instruction #I # # # # #
D Z O U P
D-3
SIMD FLOATING-POINT EXCEPTIONS SUMMARY
D-4
E
Guidelines for
Writing FPU
Exception Handlers
APPENDIX E
GUIDELINES FOR WRITING FPU
EXCEPTIONS HANDLERS
As described in Chapter 7, Floating-Point Unit, the Intel Architecture (IA) supports two mech-
anisms for accessing exception handlers to handle unmasked FPU exceptions: native mode and
MS-DOS compatibility mode. The primary purpose of this appendix is to provide detailed in-
formation to help software engineers design and write FPU exception-handling facilities to run
on PC systems that use the MS-DOS compatibility modeI for handling FPU exceptions. Some
of the information in this appendix will also be of interest to engineers who are writing native-
mode FPU exception handlers. The information provided is as follows:
• Discussion of the origin of the MS-DOS* FPU exception handling mechanism and its
relationship to the FPU’s native exception handling mechanism.
• Description of the IA flags and processor pins that control the MS-DOS FPU exception
handling mechanism.
• Description of the FPU’s exception handling mechanism and the typical protocol for FPU
exception handlers.
NOTES
I Microsoft Windows* 95 and Windows* 3.1 (and earlier versions) operating systems use almost the same
FPU exception handling interface as the operating system. The recommendations in this appendix for a
MS-DOS* compatible exception handler thus apply to all three operating systems.
E-1
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
E-2
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
the same MS-DOS compatibility floating-point exception handling mechanism that was used in
the PC AT was used in PCs based on the Intel386™.
E-3
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
compatibility mode. The actions of these mechanisms are slightly different and more straight-
forward for the P6 family processors, as described in Section E.2.2., “MS-DOS* Compatibility
Mode in the P6 Family Processors”.
For Pentium® and P6 family processors, it is important to note that the special DP (Dual Pro-
cessing) mode for Pentium® processors and also the more general Intel MultiProcessor Specifi-
cation for systems with multiple Pentium® or P6 family processors support FPU exception
handling only in the native mode. Intel does not recommend using the MS-DOS compatibility
FPU mode for systems using more than one processor.
E-4
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
If NE=0 but the IGNNE# input is active while an unmasked FPU exception is in effect, the pro-
cessor disregards the exception, does not assert FERR#, and continues. If IGNNE# is then de-
asserted and the FPU exception has not been cleared, the processor will respond as described
above. (That is, an immediate exception case will assert FERR# immediately. A deferred excep-
tion case will assert FERR# and freeze just before the next FPU or WAIT instruction.) The as-
sertion of IGNNE# is intended for use only inside the FPU exception handler, where it is needed
if one wants to execute non-control FPU instructions for diagnosis, before clearing the exception
condition. When IGNNE# is asserted inside the exception handler, a preceding FPU exception
has already caused FERR# to be asserted, and the external interrupt hardware has responded,
but IGNNE# assertion still prevents the freeze at FPU instructions. Note that if IGNNE# is left
active outside of the FPU exception handler, additional FPU instructions may be executed after
a given instruction has caused an FPU exception. In this case, if the FPU exception handler ever
did get invoked, it could not determine which instruction caused the exception.
To properly manage the interface between the processor’s FERR# output, its IGNNE# input, and
the IRQ13 input of the PIC, additional external hardware is needed. A recommended configu-
ration is described in the following section.
E-5
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
FF #1
Intel486,
Pentium®, or
FF #2
Pentium Pro
processor
FP_IRQ
Legend:
FF #n Flip Flop #n
CLR Clear or Reset
Figure E-1. Recommended Circuit for MS-DOS* Compatibility FPU Exception Handling
In the circuit in Figure E-1, when the FPU exception handler accesses I/O port 0F0H it clears
the IRQ13 interrupt request output from Flip Flop #1 and also clocks out the IGNNE# signal
(active) from Flip Flop #2. So the handler can activate IGNNE#, if needed, by doing this 0F0H
access before clearing the FPU exception condition (which de-asserts FERR#). However, the
E-6
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
circuit does not depend on the order of actions by the FPU exception handler to guarantee the
correct hardware state upon exit from the handler. Flip Flop #2, which drives IGNNE# to the
processor, has its CLEAR input attached to the inverted FERR#. This ensures that IGNNE# can
never be active when FERR# is inactive. So if the handler clears the FPU exception condition
before the 0F0H access, IGNNE# does not get activated and left on after exit from the handler.
0F0H Address
Decode
E-7
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
exception, which is not an explicitly documented behavior of a no-wait instruction. This process
is illustrated in Figure E-3.
Exception Generating
Floating-Point
Instruction
Assertion of FERR#
by the Processor Start of the “No-Wait”
Floating-Point
Instruction
System
Dependent
Delay
Case 1 External Interrupt
Sampling Window
Assertion of INTR Pin
by the System
Case 2
Window Closed
E-8
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
asserting INTR is long enough, relative to the time elapsed before the no-wait floating-point in-
struction, INTR can be asserted inside the interrupt window for the latter. Second, consider two
no-wait FPU instructions in close sequence, and assume that a previous FPU instruction has
caused an unmasked numeric exception. Then if the INTR timing is too long for an FERR# sig-
nal triggered by the first no-wait instruction to hit the first instruction’s interrupt window, it
could catch the interrupt window of the second.
The possible malfunction of a no-wait FPU instruction explained above cannot happen if the in-
struction is being used in the manner for which Intel originally designed it. The no-wait instruc-
tions were intended to be used inside the FPU exception handler, to allow manipulation of the
FPU before the error condition is cleared, without hanging the processor because of the FPU er-
ror condition, and without the need to assert IGNNE#. They will perform this function correctly,
since before the error condition is cleared, the assertion of FERR# that caused the FPU error
handler to be invoked is still active. Thus the logic that would assert FERR# briefly at a no-wait
instruction causes no change since FERR# is already asserted. The no-wait instructions may also
be used without problem in the handler after the error condition is cleared, since now they will
not cause FERR# to be asserted at all.
If a no-wait instruction is used outside of the FPU exception handler, it may malfunction as ex-
plained above, depending on the details of the hardware interface implementation and which
particular processor is involved. The actual interrupt inside the window in the no-wait instruc-
tion may be blocked by surrounding it with the instructions: PUSHFD, CLI, no-wait, then
POPFD. (CLI blocks interrupts, and the push and pop of flags preserves and restores the original
value of the interrupt flag.) However, if FERR# was triggered by the no-wait, its latched value
and the PIC response will still be in effect. Further code can be used to check for and correct
such a condition, if needed. Section E.3.5., “Considerations When FPU Shared Between Tasks”
discusses an important example of this type of problem and gives a solution.
E-9
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
sponse to INTR itself is blocked if the operating system has cleared the IF bit in EFLAGS. Note
that Streaming SIMD Extensions numeric exceptions will not cause assertion of FERR# (inde-
pendent of the value of CR0.NE). In addition they ignore the assertion /de-assertion of IGNNE#.
However, just as with the Intel486™ and Pentium® processors, if the IGNNE# input is inactive,
a floating-point exception which occurred in the previous FPU instruction and is unmasked
causes the processor to freeze immediately when encountering the next WAIT or FPU instruc-
tion (except for no-wait instructions). This means that if the FPU exception handler has not al-
ready been invoked due to the earlier exception (and therefore, the handler not has cleared that
exception state from the FPU), the processor is forced to wait for the handler to be invoked and
handle the exception, before the processor can execute another WAIT or FPU instruction.
As explained in Section E.2.1.3., “No-Wait FPU Instructions Can Get FPU Interrupt in Win-
dow”, if a no-wait instruction is used outside of the FPU exception handler, in the Intel486™
and Pentium® processors, it may accept an unmasked exception from a previous FPU instruction
which happens to fall within the external interrupt sampling window that is opened near the be-
ginning of execution of all FPU instructions. This will not happen in the P6 family processors,
because this sampling window has been removed from the no-wait group of FPU instructions.
E-10
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
For complete details on these exceptions and their defaults, refer to Section 7.7., “Floating-Point
Exception Handling” and Section 7.8., “Floating-Point Exception Conditions” in Chapter 7,
Floating-Point Unit.
• The FPU can handle selected exceptions itself, producing a default fix-up that is
reasonable in most situations. This allows the numeric program execution to continue
undisturbed. Programs can mask individual exception types to indicate that the FPU should
generate this safe, reasonable result whenever the exception occurs. The default exception
fix-up activity is treated by the FPU as part of the instruction causing the exception; no
external indication of the exception is given (except that the instruction takes longer to
execute when it handles a masked exception.) When masked exceptions are detected, a
flag is set in the numeric status register, but no information is preserved regarding where or
when it was set.
• Alternatively, a software exception handler can be invoked to handle the exception. When
a numeric exception is unmasked and the exception occurs, the FPU stops further
execution of the numeric instruction and causes a branch to a software exception handler.
The exception handler can then implement any sort of recovery procedures desired for any
numeric exception detectable by the FPU.
E-11
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
becomes zero. With the divide-by-zero and precision exceptions masked, the processor will pro-
duce the correct result. FDIV of R1 into 1 gives infinity, and then FDIV of (infinity +R2 +R3)
into 1 gives zero.
R1 R2 R3
1
Equivalent Resistance =
1 1 1
+ +
R1 R2 R3
E-12
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
cussed earlier in Section E.1., “Origin of the MS-DOS* Compatibility Mode for Handling FPU
Exceptions” and Section E.2., “Implementation of the MS-DOS* Compatibility Mode in the
Intel486™, Pentium®, and P6 family processors” The elapsed time between the initial error sig-
nal and the invocation of the FPU exception handler depends of course on the external hardware
interface, and also on whether the external interrupt for FPU errors is enabled. But the architec-
ture ensures that the handler will be invoked before execution of the next WAIT or floating-point
instruction since an unmasked floating-point exception causes the processor to freeze just before
executing such an instruction (unless the IGNNE# input is active, or it is a no-wait FPU instruc-
tion).
The frozen processor waits for an external interrupt, which must be supplied by external hard-
ware in response to the FERR# (or ERROR#) output of the processor (or coprocessor), usually
through IRQ13 on the “slave” PIC, and then through INTR. Then the external interrupt invokes
the exception handling routine. Note that if the external interrupt for FPU errors is disabled
when the processor executes an FPU instruction, the processor will freeze until some other (en-
abled) interrupt occurs if an unmasked FPU exception condition is in effect. If NE = 0 but the
IGNNE# input is active, the processor disregards the exception and continues. Error reporting
via an external interrupt is supported for MS-DOS compatibility. Chapter 18, Intel Architecture
Compatibility of the Intel Architecture Software Developer’s Manual, Volume 3, contains further
discussion of compatibility issues.
The references above to the ERROR# output from the FPU apply to the Intel 387 and Intel 287
math coprocessors (NPX chips). If one of these coprocessors encounters an unmasked exception
condition, it signals the exception to the Intel 286 or Intel386™ processor using the ERROR#
status line between the processor and the coprocessor. Refer to Section E.1., “Origin of the MS-
DOS* Compatibility Mode for Handling FPU Exceptions”, in this appendix, and Chapter 18,
Intel Architecture Compatibility, in the Intel Architecture Software Developer’s Manual, Volume
3 for differences in FPU exception handling.
The exception-handling routine is normally a part of the systems software. The routine must
clear (or disable) the active exception flags in the FPU status word before executing any float-
ing-point instructions that cannot complete execution when there is a pending floating-point ex-
ception. Otherwise, the floating-point instruction will trigger the FPU interrupt again, and the
system will be caught in an endless loop of nested floating-point exceptions, and hang. In any
event, the routine must clear (or disable) the active exception flags in the FPU status word after
handling them, and before IRET(D). Typical exception responses may include:
• Printing or displaying diagnostic information (e.g., the FPU environment and registers).
• Aborting further execution, or using the exception pointers to build an instruction that will
run without exception and executing it.
Applications programmers should consult their operating system’s reference manuals for the ap-
propriate system response to numerical exceptions. For systems programmers, some details on
writing software exception handlers are provided in Chapter 5, Interrupt and Exception Han-
dling, in the Intel Architecture Software Developer’s Manual, Volume 3, as well as in Section
E.3.3.4., “FPU Exception Handling Examples” in this appendix.
E-13
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
E-14
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
• The FPU can provide a default fix-up for selected numeric exceptions. If the FPU performs
its default action for all exceptions, then the need for exception synchronization is not
manifest. However, code is often ported to contexts and operating systems for which it was
not originally designed. Example E-1 and Example E-2, below, illustrate that it is safest to
always consider exception synchronization when designing code that uses the FPU.
• Alternatively, a software exception handler can be invoked to handle the exception. When
a numeric exception is unmasked and the exception occurs, the FPU stops further
execution of the numeric instruction and causes a branch to a software exception handler.
When an FPU exception handler will be invoked, synchronization must always be
considered to assure reliable performance.
Example E-1 and Example E-2, below, illustrate the need to always consider exception synchro-
nization when writing numeric code, even when the code is initially intended for execution with
exceptions masked.
E-15
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
invoked, so that the recovery routine will load an incorrect value of COUNT, causing the pro-
gram to fail or behave unreliably.
E-16
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
displaying a message, to attempting to repair the problem and proceed with normal execution.
The epilogue essentially reverses the actions of the prologue, restoring the processor so that nor-
mal execution can be resumed. The epilogue must not load an unmasked exception flag into the
FPU or another exception will be requested immediately.
The following code examples show the ASM386/486 coding of three skeleton exception han-
dlers, with the save spaces given as correct for 32-bit protected mode. They show how prologues
and epilogues can be written for various situations, but the application dependent exception han-
dling body is just indicated by comments showing where it should be placed.
The first two are very similar; their only substantial difference is their choice of instructions to
save and restore the FPU. The trade-off here is between the increased diagnostic information
provided by FNSAVE and the faster execution of FNSTENV. (Also, after saving the original
contents, FNSAVE re-initializes the FPU, while FNSTENV only masks all FPU exceptions.)
For applications that are sensitive to interrupt latency or that do not need to examine register
contents, FNSTENV reduces the duration of the “critical region,” during which the processor
does not recognize another interrupt request. (Refer to Section 7, “Floating-Point Unit” in Chap-
ter 7, Floating-Point Unit, for a complete description of the FPU save image.) If the processor
supports Streaming SIMD Extensions and the operating system supports it, the FXSAVE in-
struction should be used instead of FNSAVE. If the FXSAVE instruction is used, the save area
should be increased to 512 bytes and aligned to 16 bytes to save the entire state. These steps will
ensure that the complete context is saved.
After the exception handler body, the epilogues prepare the processor to resume execution from
the point of interruption (i.e., the instruction following the one that generated the unmasked ex-
ception). Notice that the exception flags in the memory image that is loaded into the FPU are
cleared to zero prior to reloading (in fact, in these examples, the entire status word image is
cleared).
Example E-3 and Example E-4 assume that the exception handler itself will not cause an un-
masked exception. Where this is a possibility, the general approach shown in Example E-5 can
be employed. The basic technique is to save the full FPU state and then to load a new control
word in the prologue. Note that considerable care should be taken when designing an exception
handler of this type to prevent the handler from being reentered endlessly.
E-17
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
;
; APPLICATION-DEPENDENT EXCEPTION HANDLING CODE GOES HERE
;
; CLEAR EXCEPTION FLAGS IN STATUS WORD (WHICH IS IN MEMORY)
; RESTORE MODIFIED STATE IMAGE
MOVBYTE PTR [EBP-104], 0H
FRSTOR[EBP-108]
; DE-ALLOCATE STACK SPACE, RESTORE REGISTERS
MOVESP, EBP
.
.
POPEBP
;
; RETURN TO INTERRUPTED CALCULATION
IRETD
SAVE_ALLENDP
;
; APPLICATION-DEPENDENT EXCEPTION HANDLING CODE GOES HERE
;
; CLEAR EXCEPTION FLAGS IN STATUS WORD (WHICH IS IN MEMORY)
; RESTORE MODIFIED ENVIRONMENT IMAGE
MOV BYTE PTR [EBP-24], 0H
FLDENV[EBP-28]
; DE-ALLOCATE STACK SPACE, RESTORE REGISTERS
MOV ESP, EBP
.
.
POP EBP
;
; RETURN TO INTERRUPTED CALCULATION
IRETD
E-18
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
SAVE_ENVIRONMENT ENDP
; SAVE STATE, LOAD NEW CONTROL WORD, RESTORE INTERRUPT ENABLE FLAG (IF)
FNSAVE[EBP-108]
FLDCW LOCAL_CONTROL
PUSH [EBP + OFFSET_TO_EFLAGS] ; COPY OLD EFLAGS TO STACK TOP
POPFD ; RESTORE IF TO VALUE BEFORE FPU EXCEPTION
.
.
;
; APPLICATION-DEPENDENT EXCEPTION HANDLING CODE GOES HERE. AN
UNMASKED EXCEPTION
E-19
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
E.3.4. Need for Storing State of IGNNE# Circuit If Using FPU and
SMM
The recommended circuit (refer to Figure E-1) for MS-DOS compatibility FPU exception han-
dling for Intel486™ processors and beyond contains two flip flops. When the FPU exception
handler accesses I/O port 0F0H it clears the IRQ13 interrupt request output from Flip Flop #1
and also clocks out the IGNNE# signal (active) from Flip Flop #2. The assertion of IGNNE#
may be used by the handler if needed to execute any FPU instruction while ignoring the pending
FPU errors. The problem here is that the state of Flip Flop #2 is effectively an additional (but
hidden) status bit that can affect processor behavior, and so ideally should be saved upon enter-
ing SMM, and restored before resuming to normal operation. If this is not done, and also the
SMM code saves the FPU state, AND an FPU error handler is being used which relies on
IGNNE# assertion, then (very rarely) the FPU handler will nest inside itself and malfunction.
The following example shows how this can happen.
Suppose that the FPU exception handler includes the following sequence:
FNSTSWsave_sw ; save the FPU status word
; using a no-wait FPU instruction
OUT0F0H, AL; clears IRQ13 & activates IGNNE#
....
FLDCW new_cw; loads new CW ignoring FPU errors,
; since IGNNE# is assumed active; or any
; other FPU instruction that is not a no-wait
; type will cause the same problem
....
FCLEX ; clear the FPU error conditions & thus turn off FERR# & reset the IGNNE# FF
The problem will only occur if the processor enters SMM between the OUT and the FLDCW
instructions. But if that happens, AND the SMM code saves the FPU state using FNSAVE, then
the IGNNE# Flip Flop will be cleared (because FNSAVE clears the FPU errors and thus de-as-
serts FERR#). When the processor returns from SMM it will restore the FPU state with FR-
STOR, which will re-assert FERR#, but the IGNNE# Flip Flop will not get set. Then when the
FPU error handler executes the FLDCW instruction, the active error condition will cause the
processor to re-enter the FPU error handler from the beginning. This may cause the handler to
malfunction.
To avoid this problem, Intel recommends two measures:
1. Do not use the FPU for calculations inside SMM code. (The normal power management,
and sometimes security, functions provided by SMM have no need for FPU calculations; if
they are needed for some special case, use scaling or emulation instead.) This eliminates
the need to do FNSAVE/FRSTOR inside SMM code, except when going into a 0 V
suspend state (in which, in order to save power, the CPU is turned off completely, requiring
its complete state to be saved.)
2. The system should not call upon SMM code to put the processor into 0 V suspend while
the processor is running FPU calculations, or just after an interrupt has occurred. Normal
power management protocol avoids this by going into power down states only after timed
intervals in which no system activity occurs.
E-20
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
NOTES
I In a software task switch, the operating system uses a sequence of instructions to save the suspending
thread’s state and restore the resuming thread’s state, instead of the single long non-interruptible task
switch operation provided by the IA.
E-21
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
clear the TS bit before exit using the CLTS instruction. On return from the handler the faulting
thread will proceed with its floating-point computation.
Some operating systems save the FPU context on every task switch, typically because they also
change the linear address space between tasks. The problem and solution discussed in the fol-
lowing sections apply to these operating systems also.
NOTES
II Although CR0, bit 2, the emulation flag (EM), also causes a DNA exception, do not use the EM bit as a
surrogate for TS. EM means that no floating-point unit is available and that floating-point instructions
must be emulated. Using EM to trap on task switches is not compatible with IA MMX™ technology. If the
EM flag is set, MMX™ instructions raise the invalid opcode exception.
E-22
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
A simple way to handle the case of exceptions arriving during FPU state swaps is to allow the
kernel to be one of the FPU owning threads. A reserved thread identifier is used to indicate ker-
nel ownership of the FPU. During an floating-point state swap, the “FPU owner” variable should
be set to indicate the kernel as the current owner. At the completion of the state swap, the vari-
able should be set to indicate the new owning thread. The numeric exception handler needs to
check the FPU owner and discard any numeric exceptions that occur while the kernel is the FPU
owner. A more general flow for a DNA exception handler that handles this case is shown in Fig-
ure E-5.
Numeric exceptions received while the kernel owns the FPU for a state swap must be discarded
in the kernel without being dispatched to a handler. A flow for a numeric exception dispatch rou-
tine is shown in Figure E-6.
It may at first glance seem that there is a possibility of floating-point exceptions being lost be-
cause of exceptions that are discarded during state swaps. This is not the case, as the exception
will be re-issued when the floating-point state is reloaded. Walking through state swaps both
with and without pending numeric exceptions will clarify the operation of these two handlers.
Case #1: FPU State Swap Without Numeric Exception
Assume two threads A and B, both using the floating-point unit. Let A be the thread to have most
recently executed a floating-point instruction, with no pending numeric exceptions. Let B be the
currently executing thread. CR0.TS was set when thread A was suspended. When B starts to ex-
ecute a floating-point instruction the instruction will fault with the DNA exception because TS
is set.
At this point the handler is entered, and eventually it finds that the current FPU Owner is not the
currently executing thread. To guard the FPU state swap from extraneous numeric exceptions,
the FPU Owner is set to be the kernel. The old owner’s FPU state is saved with FNSAVE, and
the current thread’s FPU state is restored with FRSTOR. Before exiting, the FPU owner is set to
thread B, and the TS bit is cleared.
On exit, thread B resumes execution of the faulting floating-point instruction and continues.
Case #2: FPU State Swap with Discarded Numeric Exception
Again, assume two threads A and B, both using the floating-point unit. Let A be the thread to
have most recently executed a floating-point instruction, but this time let there be a pending nu-
meric exception. Let B be the currently executing thread. When B starts to execute a floating-
point instruction the instruction will fault with the DNA exception and enter the DNA handler.
(If both numeric and DNA exceptions are pending, the DNA exception takes precedence, in or-
der to support handling the numeric exception in its own context.)
E-23
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
Current Thread
same as
FPU Owner? Yes
No
Is Kernel
FPU Owner? Yes
No
Normal Dispatch to
Numeric Exception Handler Exit
When the FNSAVE starts, it will trigger an interrupt via FERR# because of the pending numeric
exception. After some system dependent delay, the numeric exception handler is entered. It may
E-24
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
be entered before the FNSAVE starts to execute, or it may be entered shortly after execution of
the FNSAVE. Since the FPU Owner is the kernel, the numeric exception handler simply exits,
discarding the exception. The DNA handler resumes execution, completing the FNSAVE of the
old floating-point context of thread A and the FRSTOR of the floating-point context for thread
B.
Thread A eventually gets an opportunity to handle the exception that was discarded during the
task switch. After some time, thread B is suspended, and thread A resumes execution. When
thread A starts to execute a floating-point instruction, once again the DNA exception handler is
entered. B’s FPU state is stored, and A’s FPU state is restored. Note that in restoring the FPU
state from A’s save area, the pending numeric exception flags are reloaded in to the floating-
point status word. Now when the DNA exception handler returns, thread A resumes execution
of the faulting floating-point instruction just long enough to immediately generate a numeric ex-
ception, which now gets handled in the normal way. The net result is that the task switch and
resulting FPU state swap via the DNA exception handler causes an extra numeric exception
which can be safely discarded.
E-25
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
Software Developer’s Manual, Volume 2, for information about exceptions generated if the
memory region is not aligned).
3. Maintaining compatibility with legacy applications/libraries: The operating system
changes to support Streaming SIMD Extensions must be invisible to legacy applications or
libraries that deal only with floating-point instructions. The layout of the memory region
operated on by the FXSAVE/FXRSTOR instructions is different from the layout for the
FNSAVE/FRSTOR instructions. Specifically, the format of the FPU tag word and the
length of the various fields in the memory region is different. Care must be taken to return
the FPU state to a legacy application (e.g., when reporting FP exceptions) in the format it
expects.
4. Instruction semantic differences: There are some semantic differences between the way
the FXSAVE and FSAVE/FNSAVE instructions operate. The FSAVE/FNSAVE instruc-
tions clear the FPU after they save the state while the FXSAVE instruction saves the
FPU/Streaming SIMD Extensions state but does not clear it. Operating systems that use
FXSAVE to save the FPU state before making it available for another thread (e.g., during
thread switch time) should take precautions not to pass a “dirty” FPU to another appli-
cation.
E.4.1. Origin with the Intel 286 and Intel 287, and Intel386™ and
Intel 387 Processors
The Intel 286 and Intel 287, and Intel386™ and Intel 387 processor/coprocessor pairs are each
provided with ERROR# pins that are recommended to be connected between the processor and
FPU. If this is done, when an unmasked FPU exception occurs, the FPU records the exception,
and asserts its ERROR# pin. The processor recognizes this active condition of the ERROR# sta-
tus line immediately before execution of the next WAIT or FPU instruction (except for the no-
wait type) in its instruction stream, and branches to the routine at interrupt vector 16. Thus an
FPU exception will be handled before any other FPU instruction (after the one causing the error)
is executed (except for no-wait instructions, which will be executed without triggering the FPU
exception interrupt, but it will remain pending).
Using the dedicated interrupt 16 for FPU exception handling is referred to as the native mode.
It is the simplest approach, and the one recommended most highly by Intel.
E-26
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
E-27
GUIDELINES FOR WRITING FPU EXCEPTIONS HANDLERS
save little code, it would be a reasonable and conservative habit (as long as the MS-DOS com-
patibility mode is widely used) to include these steps in all systems.
Note that the special DP (Dual Processing) mode for Pentium® Processors, and also the more
general Intel MultiProcessor Specification for systems with multiple Pentium® or P6 family pro-
cessors, support FPU exception handling only in the native mode. Intel does not recommend us-
ing the MS-DOS compatibility mode for systems using more than one processor.
E-28
F
Guidelines for
Writing SIMD
Floating-Point
Exception Handlers
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
APPENDIX F
GUIDELINES FOR WRITING SIMD FLOATING-
POINT EXCEPTION HANDLERS
Most of the information on Streaming SIMD Extensions instructions can be found in Chapter 9,
Programming with the Streaming SIMD Extensions. Exceptions in Streaming SIMD Extensions
are specifically presented in Section 9.5.5., “Exception Handling in Streaming SIMD Exten-
sions”
This appendix considers only the Streaming SIMD Extensions instructions that can generate nu-
meric (floating-point) exceptions, and gives an overview of the necessary support for handling
such exceptions. This appendix does not address RSQRTSS, RSQRTPS, RCPSS, RCPPS, or
any unlisted instruction. For detailed information on which instructions generate numeric excep-
tions, and a listing of those exceptions, refer to Appendix D, SIMD Floating-Point Exceptions
Summary. Non-numeric exceptions are handled in a way similar to that for the standard IA-32
instructions.
• If the exception being raised is masked (by setting the corresponding mask bit in the
MXCSR to 1), then a default result is produced, which is acceptable in most situations. No
external indication of the exception is given, but the corresponding exception flags in the
MXCSR are set, and may be examined later. Note though that for packed operations, an
exception flag that is set in the MXCSR will not tell which of the four sets of sub-operands
caused the event to occur.
• If the exception being raised is not masked (by setting the corresponding mask bit in the
MXCSR to 0), a software exception handler previously registered by the user will be
invoked through the SIMD floating-point exception vector 19. This case is discussed
below in Section F.2., “Software Exception Handling”.
F-1
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
control word with FLDCW), or it must be implemented as re-entrant (for the case of FPU ex-
ceptions, refer to Example E-5 in Appendix E, Guidelines for Writing FPU Exceptions Han-
dlers). If this is not the case, the routine has to clear the status flags for FPU exceptions, or to
mask all FPU floating-point exceptions. For Streaming SIMD Extensions floating-point excep-
tions though, the exception flags in MXCSR do not have to be cleared, even if they remain un-
masked (they may still be cleared). Exceptions are in this case precise and occur immediately,
and a Streaming SIMD Extensions exception status flag that is set when the corresponding ex-
ception is unmasked will not generate an exception.
Typical actions performed by this low-level exception handling routine are:
• printing or displaying diagnostic information (e.g. the MXCSR and XMM registers)
• aborting further execution, or using the exception pointers to build an instruction that will
run without exception and executing it
• storing information about the exception in a data structure that will be passed to a higher
level user exception handler
In most cases (and this applies also to the Streaming SIMD Extensions instructions), there will
be three main components of a low-level floating-point exception handler: a “prologue”, a
“body”, and an “epilogue”.
The prologue performs functions that must be protected from possible interruption by higher-
priority sources - typically saving registers and transferring diagnostic information from the pro-
cessor to memory. When the critical processing has been completed, the prologue may re-enable
interrupts to allow higher-priority interrupt handlers to preempt the exception handler (assuming
that the interrupt handler was called through an interrupt gate, meaning that the processor
cleared the interrupt enable (IF) flag in the EFLAGS register - refer to Section 4.4.1., “Call and
Return Operation for Interrupt or Exception Handling Procedures” in Chapter 4, Procedure
Calls, Interrupts, and Exceptions).
The body of the exception handler examines the diagnostic information and makes a response
that is application-dependent. It may range from halting execution, to displaying a message, to
attempting to fix the problem and then proceeding with normal execution, to setting up a data
structure, calling a higher-level user exception handler and continuing execution upon return
from it. This latter case will be assumed in Section F.4., “SIMD Floating-Point Exceptions and
the IEEE-754 Standard for Binary Floating-Point Computations” below.
Finally, the epilogue essentially reverses the actions of the prologue, restoring the processor
state so that normal execution can be resumed.
The following example represents a typical exception handler. To link it with Example F-2 that
will follow in Section F.4.3., “SIMD Floating-Point Emulation Implementation Example”, as-
sume that the body of the handler (not shown here in detail) passes the saved state to a routine
that will examine in turn all the sub-operands of the excepting instruction, invoking a user float-
ing-point exception handler if a particular set of sub-operands raises an unmasked (enabled) ex-
ception, or emulating the instruction otherwise.
F-2
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
F-3
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
F-4
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
• If an unmasked (enabled) exception occurs in this process, the emulation function will
return to its caller (the filter function) with the appropriate information. The filter will
invoke a (previously registered) user floating-point exception handler for this set of sub-
operands, and will record the result upon return from the user handler (provided the user
handler allows continuation of the execution).
• If no unmasked (enabled) exception occurs, the emulation function will determine and will
return to its caller the result of the operation for the current set of sub-operands (it has to
be IEEE compliant). The filter function will record the result (plus any new flag settings).
The user level filter function will then call the emulation function for the next set of sub-oper-
ands (if any). When done, the partial results will be packed (if the excepting instruction has a
packed floating-point result, which is true for most Streaming SIMD Extensions numeric in-
structions) and the filter will return to the low-level exception handler, which in turn will return
from the interruption, allowing execution to continue. Note that the instruction pointer (EIP) has
to be altered to point to the instruction following the excepting instruction, in order to continue
execution correctly.
If a user mode floating-point exception filter is not provided, then all the work for decoding the
excepting instruction, reading its operands, emulating the instruction for the components of the
result that do not correspond to unmasked floating-point exceptions, and providing the com-
pounded result will have to be performed by the user provided floating-point exception handler.
Actual emulation will have to take place for one operand or pair of operands for scalar opera-
tions, and for all four operands or pairs of operands for packed operations. The steps to perform
are the following:
• the excepting instruction has to be decoded and the operands have to be read from the
saved context
• the instruction has to be emulated for each (pair of) sub-operand(s); if no floating-point
exception occurs, the partial result has to be saved; if a masked floating-point exception
occurs, the masked result has to be produced through emulation and saved, and the
appropriate status flags have to be set; if an unmasked floating-point exception occurs, the
result has to be generated by the user provided floating-point exception handler, and the
appropriate status flags have to be set
• the four partial results have to be combined and written to the context that will be restored
upon application program resumption
F-5
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
User Application
F-6
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
Section 5.7., “Priority Among Simultaneous Exceptions and Interrupts” in Chapter 5, Interrupt
and Exception Handling.
Note that some floating-point instructions (non-waiting instructions) do not check for pending
unmasked exceptions (refer to Section 7.5.11., “FPU Control Instructions”, in Chapter 7,
Floating-Point Unit). They include the FNINIT, FNSTENV, FNSAVE, FNSTSW, FNSTCW,
and FNCLEX instructions. When an FNINIT, FNSTENV, FNSAVE, or FNCLEX instruction is
executed, all pending exceptions are essentially lost (either the FPU status register is cleared or
all exceptions are masked). The FNSTSW and FNSTCW instructions do not check for pending
interrupts, but they do not modify the FPU status and control registers. A subsequent “waiting”
floating-point instruction can then handle any pending exceptions.
F-7
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
Table F-1. ADDPS, ADDSS, SUBPS, SUBSS, MULPS, MULSS, DIVPS, DIVSS
Source Operands Masked Result Unmasked Result
SNaN1 op SNaN2 SNaN1 | 0x00400000 None
SNaN1 op QNaN2 SNaN1 | 0x00400000 None
QNaN1 op SNaN2 QNaN1 None
QNaN1 op QNaN2 QNaN1 QNaN1 (not an exception)
SNaN op real value SNaN | 0x00400000 None
Real value op SNaN SNaN | 0x00400000 None
QNaN op real value QNaN QNaN (not an exception)
Real value op QNaN QNaN QNaN (not an exception)
Neither source operand is SNaN, Single-Precision QNaN Indefinite None
but #I is signaled (e.g. for Inf - Inf,
Inf * 0, Inf / Inf, 0/0)
Note 1. SNaN | 0x00400000 is a quiet NaN obtained from the signaling NaN given as input
Note 2. Operations involving only quiet NaNs do not raise a floating-point exception
F-8
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
Note: SNaN | 0x00400000 is a quiet NaN obtained from the signaling NaN given as input
F-9
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
F-10
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
Note 1. rnd signifies the user rounding mode from MXCSR, and rz signifies the rounding mode toward zero
(truncate), when rounding a floating-point value to an integer. For more information, refer to Table 9-3 in
Section 9.1.8., “Rounding Control Field”, of Chapter 9, Programming with the Streaming SIMD Exten-
sions.
Note 2. For NAN encodings, see Table 9-2, Chapter 9, Programming with the Streaming SIMD Extensions.
F-11
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
Note: For denormal encodings, see Table 9-2, Chapter 9, Programming with the Streaming SIMD Exten-
sions.
F-12
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
F-13
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
typedef struct {
unsigned int operation; // Streaming SIMD Extensions operation: ADDPS, ADDSS, ...
float operand1_fval; // first operand value
float operand2_fval; // second operand value (if any)
float result_fval; // result value (if any)
unsigned int rounding_mode; // rounding mode
unsigned int exc_masks; // exception masks, in the order P, U, O, Z, D, I
unsigned int exception_cause; // exception cause
unsigned int status_flag_inexact; // inexact status flag
unsigned int status_flag_underflow; // underflow status flag
unsigned int status_flag_overflow; // overflow status flag
unsigned int status_flag_divide_by_zero; // divide by zero status flag
unsigned int status_flag_denormal_operand; // denormal operand status flag
unsigned int status_flag_invalid_operation; // invalid operation status flag
unsigned int ftz; // flush-to-zero flag
} EXC_ENV;
The arithmetic operations exemplified are emulated as follows:
1. Perform the operation using IA-32 FPU instructions, with exceptions disabled, the original
user rounding mode, and single precision; this will reveal invalid, denormal, or divide-by-
zero exceptions (if there are any); store the result in memory as a double precision value
(whose exponent range is large enough to look like “unbounded” to the result of the single
precision computation).
2. If no unmasked exceptions were detected, determine if the result is tiny (less than the
smallest normal number that can be represented in single precision format), or huge
(greater than the largest normal number that can be represented in single precision format);
if an unmasked overflow or underflow occur, calculate the scaled result that will be handed
to the user exception handler, as specified by the IEEE-754 Standard for Binary Floating-
Point Computations.
3. If no exception was raised above, calculate the result with “bounded” exponent; if the
result was tiny, it will require denormalization (shifting right the significand while incre-
menting the exponent to bring it into the admissible range of [-126,+127] for single
precision floating-point numbers); the result obtained in step A above cannot be used
because it might incur a double rounding error (it was rounded to 24 bits in step A, and
might have to be rounded again in the denormalization process); the way to overcome this
is to calculate the result as a double precision value, and then to store it to memory in
single precision format - rounding first to 53 bits in the significand, and then to 24 will
never cause a double rounding error (exact properties exist that state when double-
rounding error does not occur, but for the elementary arithmetic operations, the rule of
thumb is that if we round an infinitely precise result to 2p+1 bits and then again to p bits,
the result is the same as when rounding directly to p bits, which means that no double
rounding error occurs).
4. If the result is inexact and the inexact exceptions are unmasked, the result calculated in
step C will be delivered to the user floating-point exception handler.
5. Finally, the flush-to-zero case is dealt with if the result is tiny.
F-14
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
The emulation function returns RAISE_EXCEPTION to the filter function if an exception has
to be raised (the exception_cause field will indicate the cause); otherwise, the emulation func-
tion returns DO_NOT_ RAISE_EXCEPTION. In the first case, the result will be provided by
the user exception handler called by the filter function. In the second case, it is provided by the
emulation function. The filter function has to collect all the partial results, and to assemble the
scalar or packed result that will be used if execution is to be continued.
// 32-bit constants
static unsigned ZEROF_ARRAY[] = {0x00000000};
#define ZEROF *(float *) ZEROF_ARRAY
// +0.0
static unsigned NZEROF_ARRAY[] = {0x80000000};
#define NZEROF *(float *) NZEROF_ARRAY
// -0.0
static unsigned POSINFF_ARRAY[] = {0x7f800000};
#define POSINFF *(float *)POSINFF_ARRAY
// +Inf
static unsigned NEGINFF_ARRAY[] = {0xff800000};
#define NEGINFF *(float *)NEGINFF_ARRAY
// -Inf
// 64-bit constants
static unsigned MIN_SINGLE_NORMAL_ARRAY [] = {0x00000000, 0x38100000};
#define MIN_SINGLE_NORMAL *(double *)MIN_SINGLE_NORMAL_ARRAY
// +1.0 * 2^-126
static unsigned MAX_SINGLE_NORMAL_ARRAY [] = {0x70000000, 0x47efffff};
#define MAX_SINGLE_NORMAL *(double *)MAX_SINGLE_NORMAL_ARRAY
// +1.1...1*2^127
static unsigned TWO_TO_192_ARRAY[] = {0x00000000, 0x4bf00000};
#define TWO_TO_192 *(double *)TWO_TO_192_ARRAY
// +1.0 * 2^192
static unsigned TWO_TO_M192_ARRAY[] = {0x00000000, 0x33f00000};
#define TWO_TO_M192 *(double *)TWO_TO_M192_ARRAY
// +1.0 * 2^-192
// auxiliary functions
static int isnanf (float f); // returns 1 if f is a NaN, and 0 otherwise
F-15
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
static float quietf (float f); // converts a signaling NaN to a quiet NaN, and
// leaves a quiet NaN unchanged
unsigned int
simd_fp_emulate (EXC_ENV *exc_env)
// have to check first for faults (V, D, Z), and then for traps (O, U, I)
result_tiny = 0;
result_huge = 0;
switch (exc_env->operation) {
case ADDPS:
case ADDSS:
case SUBPS:
case SUBSS:
case MULPS:
case MULSS:
case DIVPS:
case DIVSS:
F-16
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
opd1 = exc_env->operand1_fval;
opd2 = exc_env->operand2_fval;
case ADDPS:
case ADDSS:
// perform the addition
__asm {
fnclex;
// load input operands
fld DWORD PTR opd1; // may set the denormal or invalid status flags
fld DWORD PTR opd2; // may set the denormal or invalid status flags
faddp st(1), st(0); // may set the inexact or invalid status flags
// store result
fstp QWORD PTR dbl_res24; // exact
}
break;
case SUBPS:
F-17
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
case SUBSS:
// perform the subtraction
__asm {
fnclex;
// load input operands
fld DWORD PTR opd1; // may set the denormal or invalid status flags
fld DWORD PTR opd2; // may set the denormal or invalid status flags
fsubp st(1), st(0); // may set the inexact or invalid status flags
// store result
fstp QWORD PTR dbl_res24; // exact
}
break;
case MULPS:
case MULSS:
// perform the multiplication
__asm {
fnclex;
// load input operands
fld DWORD PTR opd1; // may set the denormal or invalid status flags
fld DWORD PTR opd2; // may set the denormal or invalid status flags
fmulp st(1), st(0); // may set the inexact or invalid status flags
// store result
fstp QWORD PTR dbl_res24; // exact
}
break;
case DIVPS:
case DIVSS:
// perform the division
__asm {
fnclex;
// load input operands
fld DWORD PTR opd1; // may set the denormal or invalid status flags
fld DWORD PTR opd2; // may set the denormal or invalid status flags
fdivp st(1), st(0); // may set the inexact, divide by zero, or
// invalid status flags
// store result
fstp QWORD PTR dbl_res24; // exact
}
break;
default:
; // will never occur
F-18
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
// if invalid flag is set, and invalid exceptions are enabled, take trap
if (!(exc_env->exc_masks & INVALID_MASK) && (sw & INVALID_MASK)) {
exc_env->status_flag_invalid_operation = 1;
exc_env->exception_cause = INVALID_OPERATION;
return (RAISE_EXCEPTION);
}
// checking for NaN operands has priority over denormal exceptions; also fix for the
// differences in treating two NaN inputs between the Streaming SIMD Extensions
// instructions and other IA-32 instructions
if (isnanf (opd1) || isnanf (opd2)) {
// if denormal flag is set, and denormal exceptions are enabled, take trap
if (!(exc_env->exc_masks & DENORMAL_MASK) && (sw & DENORMAL_MASK)) {
exc_env->status_flag_denormal_operand = 1;
exc_env->exception_cause = DENORMAL_OPERAND;
return (RAISE_EXCEPTION);
}
F-19
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
// if the underflow traps are enabled and the result is tiny, take
// underflow trap
if (!(exc_env->exc_masks & UNDERFLOW_MASK) && result_tiny) {
dbl_res24 = TWO_TO_192 * dbl_res24; // exact
exc_env->status_flag_underflow = 1;
exc_env->exception_cause = UNDERFLOW;
exc_env->result_fval = (float)dbl_res24; // exact
if (sw & PRECISION_MASK) exc_env->status_flag_inexact = 1;
return (RAISE_EXCEPTION);
}
F-20
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
// overflow trap
if (!(exc_env->exc_masks & OVERFLOW_MASK) && result_huge) {
dbl_res24 = TWO_TO_M192 * dbl_res24; // exact
exc_env->status_flag_overflow = 1;
exc_env->exception_cause = OVERFLOW;
exc_env->result_fval = (float)dbl_res24; // exact
if (sw & PRECISION_MASK) exc_env->status_flag_inexact = 1;
return (RAISE_EXCEPTION);
}
switch (exc_env->operation) {
case ADDPS:
case ADDSS:
// perform the addition
__asm {
// load input operands
fld DWORD PTR opd1; // may set the denormal status flag
fld DWORD PTR opd2; // may set the denormal status flag
faddp st(1), st(0); // rounded to 53 bits, may set the inexact
// status flag
// store result
fstp QWORD PTR dbl_res; // exact, will not set any flag
}
break;
case SUBPS:
case SUBSS:
// perform the subtraction
__asm {
// load input operands
fld DWORD PTR opd1; // may set the denormal status flag
fld DWORD PTR opd2; // may set the denormal status flag
fsubp st(1), st(0); // rounded to 53 bits, may set the inexact
// status flag
// store result
fstp QWORD PTR dbl_res; // exact, will not set any flag
}
break;
F-21
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
case MULPS:
case MULSS:
// perform the multiplication
__asm {
// load input operands
fld DWORD PTR opd1; // may set the denormal status flag
fld DWORD PTR opd2; // may set the denormal status flag
fmulp st(1), st(0); // rounded to 53 bits, exact
// store result
fstp QWORD PTR dbl_res; // exact, will not set any flag
}
break;
case DIVPS:
case DIVSS:
// perform the division
__asm {
// load input operands
fld DWORD PTR opd1; // may set the denormal status flag
fld DWORD PTR opd2; // may set the denormal status flag
fdivp st(1), st(0); // rounded to 53 bits, may set the inexact
// status flag
// store result
fstp QWORD PTR dbl_res; // exact, will not set any flag
}
break;
default:
; // will never occur
// if inexact traps are enabled and result is inexact, take inexact trap
if (!(exc_env->exc_masks & PRECISION_MASK) &&
((sw & PRECISION_MASK) || (exc_env->ftz && result_tiny))) {
exc_env->status_flag_inexact = 1;
F-22
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
exc_env->exception_cause = INEXACT;
if (result_tiny) {
exc_env->status_flag_underflow = 1;
F-23
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
res = ZEROF;
else if (res < 0.0)
res = NZEROF;
// else leave res unchanged
exc_env->status_flag_inexact = 1;
exc_env->status_flag_underflow = 1;
}
exc_env->result_fval = res;
if (sw & ZERODIVIDE_MASK) exc_env->status_flag_divide_by_zero = 1;
if (sw & DENORMAL_MASK) exc_env->status_flag_denormal= 1;
if (sw & INVALID_MASK) exc_env->status_flag_invalid_operation = 1;
return (DO_NOT_RAISE_EXCEPTION);
break;
case CMPPS:
case CMPSS:
...
break;
case COMISS:
case UCOMISS:
...
break;
case CVTPI2PS:
case CVTSI2SS:
...
break;
case CVTPS2PI:
case CVTSS2SI:
case CVTTPS2PI:
case CVTTSS2SI:
...
break;
F-24
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
case MAXPS:
case MAXSS:
case MINPS:
case MINSS:
...
break;
case SQRTPS:
case SQRTSS:
...
break;
case UNSPEC:
...
break;
default:
...
F-25
GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION
F-26
INDEX
Numerics B
16-bit B (default size) flag, segment descriptor .3-14, 4-3
address size . . . . . . . . . . . . . . . . . . . . . . . . .3-4 Base (operand addressing) . . . . . . . . . . .5-9, 5-10
operand size . . . . . . . . . . . . . . . . . . . . . . . . .3-4 Basic execution environment . . . . . . . . . . . . . . 3-2
32-bit B-bit, FPU status word . . . . . . . . . . . . . . . . . . 7-15
address size . . . . . . . . . . . . . . . . . . . . . . . . .3-4 BCD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
operand size . . . . . . . . . . . . . . . . . . . . . . . . .3-4 BCD integers . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
FPU encoding . . . . . . . . . . . . . . . . . . . . . . 7-29
packed. . . . . . . . . . . . . . . . . . . . . . . . .5-5, 6-28
A relationship to status flags. . . . . . . . . . . . . 3-12
AAA instruction. . . . . . . . . . . . . . . . . . . . . . . . .6-28 unpacked. . . . . . . . . . . . . . . . . . . . . . .5-5, 6-28
AAD instruction . . . . . . . . . . . . . . . . . . . . . . . .6-28 BH register . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
AAM instruction . . . . . . . . . . . . . . . . . . . . . . . .6-28 Bias value
AAS instruction. . . . . . . . . . . . . . . . . . . . . . . . .6-28 numeric overflow . . . . . . . . . . . . . . . . . . . . 7-55
AC (alignment check) flag, EFLAGS register. .3-13 numeric underflow. . . . . . . . . . . . . . . . . . . 7-56
Access rights, segment descriptor . . . . . . 4-9, 4-13 Biased exponent. . . . . . . . . . . . . . . . . . . . . . . . 7-5
ADC instruction . . . . . . . . . . . . . . . . . . . . . . . .6-26 Binary numbers . . . . . . . . . . . . . . . . . . . . . . . . 1-7
ADD instruction . . . . . . . . . . . . . . . . . . . . . . . .6-26 Binary-coded decimal (see BCD)
Address size attribute Bit fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
code segment . . . . . . . . . . . . . . . . . . . . . . .3-14 Bit order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-5
description of . . . . . . . . . . . . . . . . . . . . . . .3-14 BOUND instruction . . . . . . . . . . . . 4-17, 6-39, 6-44
of stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-3 BOUND range exceeded exception (#BR) . . . 4-18
Address sizes. . . . . . . . . . . . . . . . . . . . . . . . . . .3-4 BP register . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
Addressing modes Branch prediction . . . . . . . . . . . . . . . . . . . . . . . 2-8
assembler . . . . . . . . . . . . . . . . . . . . . . . . . .5-10 Branching, on FPU condition codes . . . .7-15, 7-38
base . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9, 5-10 BSF instruction . . . . . . . . . . . . . . . . . . . . . . . . 6-34
base plus displacement . . . . . . . . . . . . . . .5-10 BSR instruction. . . . . . . . . . . . . . . . . . . . . . . . 6-34
base plus index plus displacement . . . . . . .5-10 BSWAP instruction . . . . . . . . . . . . . . . . . .6-3, 6-21
base plus index time scale plus BT instruction . . . . . . . . . . . . . . . . 3-10, 3-12, 6-34
displacement . . . . . . . . . . . . . . . . . . . . .5-10 BTC instruction . . . . . . . . . . . . . . . 3-10, 3-12, 6-34
displacement. . . . . . . . . . . . . . . . . . . . . . . . .5-9 BTR instruction . . . . . . . . . . . . . . . 3-10, 3-12, 6-34
effective address. . . . . . . . . . . . . . . . . . . . . .5-9 BTS instruction . . . . . . . . . . . . . . . 3-10, 3-12, 6-34
immediate operands . . . . . . . . . . . . . . . . . . .5-6 Bus interface unit . . . . . . . . . . . . . . . . . . . . . . . 2-9
index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-9 BX register . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
index times scale plus displacement . . . . .5-10 Byte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1
memory operands. . . . . . . . . . . . . . . . . . . . .5-7 Byte order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-5
register operands . . . . . . . . . . . . . . . . . . . . .5-7
scale factor . . . . . . . . . . . . . . . . . . . . . . . . . .5-9
specifying a segment selector . . . . . . . . . . .5-8 C
specifying an offset . . . . . . . . . . . . . . . . . . . .5-9 C1 flag, FPU status word . . 7-13, 7-52, 7-55, 7-57
Addressing, segments . . . . . . . . . . . . . . . . . . . .1-7 C2 flag, FPU status word . . . . . . . . . . . . . . . . 7-13
Advanced programmable interrupt controller Call gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8
(see APIC) CALL instruction . . . 3-14, 4-4, 4-5, 4-9, 6-36, 6-44
AF (adjust) flag, EFLAGS register . . . . . . . . . .3-12 Calls (see Procedure calls)
AH register . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-7 CBW instruction . . . . . . . . . . . . . . . . . . . . . . . 6-26
Alignment CDQ instruction . . . . . . . . . . . . . . . . . . . . . . . 6-26
of words, doublewords, and quadwords . . . .5-2 CF (carry) flag, EFLAGS register . . . . . . . . . . 3-12
AND instruction . . . . . . . . . . . . . . . . . . . . . . . .6-29 CH register . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
APIC, presence of . . . . . . . . . . . . . . . . . . . . . .11-2 CLC instruction . . . . . . . . . . . . . . . . . . . .3-12, 6-42
Arctangent, FPU operation. . . . . . . . . . . . . . . .7-38 CLD instruction . . . . . . . . . . . . . . . . . . . .3-13, 6-42
Arithmetic instructions, FPU. . . . . . . . . . . . . . .7-46 CLI instruction. . . . . . . . . . . . . . . . . . . . .6-43, 10-4
Assembler, addressing modes. . . . . . . . . . . . .5-10 CMC instruction . . . . . . . . . . . . . . . . . . .3-12, 6-42
AX register . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-7 CMOVcc instructions . . . . . . . . . . . . . . . .6-2, 6-20
INDEX-1
INDEX
INDEX-2
INDEX
INDEX-3
INDEX
INDEX-4
INDEX
INDEX-5
INDEX
N P
NaN P6 family processors
description of . . . . . . . . . . . . . . . . . . . . 7-5, 7-8 microarchitecture. . . . . . . . . . . . . . . . . .2-6, 2-9
encoding of . . . . . . . . . . . . . . . . . . . . . 7-6, 7-27 Packed BCD integers . . . . . . . . . . . . . . . . . . . . 5-5
operating on . . . . . . . . . . . . . . . . . . . . . . . .7-43
INDEX-6
INDEX
INDEX-7
RESET pin . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-10 SIMD (single-instruction, multiple-data)
RET instruction. . . . . . . . 3-14, 4-4, 4-5, 6-36, 6-44 execution model . . . . . . . . . . . . . . . . 8-4
Retirement unit. . . . . . . . . . . . . . . . . . . . . . . . .2-13 Sine, FPU operation . . . . . . . . . . . . . . . . . . . . 7-38
Return instruction pointer . . . . . . . . . . . . . . . . . .4-4 Single-precision, IEEE
Returns, from procedure calls floating-point format . . . . . . . . .7-25, 9-5
exception handler, return from . . . . . . . . . .4-13 Single-real floating-point format . . . . . . . .7-25, 9-5
far return . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-6 SNaN
interrupt handler, return from . . . . . . . . . . .4-13 description of. . . . . . . . . . . . . . . . . . . . . . . . 7-8
Returns, from procedures calls operating on . . . . . . . . . . . . . . . . . . . . . . . 7-43
inter-privilege level return . . . . . . . . . . . . . .4-10 typical uses of . . . . . . . . . . . . . . . . . . . . . . 7-43
near return . . . . . . . . . . . . . . . . . . . . . . . . . .4-5 SP register . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
RF (resume) flag, EFLAGS register . . . . . . . . .3-13 Speculative execution. . . . . . . . . . . . . . . . .2-7, 2-8
ROL instruction . . . . . . . . . . . . . . . . . . . . . . . .6-33 SS register . . . . . . . . . . . . . . . . . . . . . 3-7, 3-9, 4-1
ROR instruction . . . . . . . . . . . . . . . . . . . . . . . .6-33 Stack alignment . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Rounding Stack fault, FPU . . . . . . . . . . . . . . . . . . . . . . . 7-15
control, RC field of FPU control word . 7-18, 9-8 Stack overflow and underflow
modes, FPU . . . . . . . . . . . . . . . . . . . . 7-18, 9-8 exceptions (#IS), FPU . . . . . . . . . . 7-52
results, FPU . . . . . . . . . . . . . . . . . . . . . . . .7-19 Stack overflow exception, FPU. . . . . . . .7-13, 7-51
RSM instruction . . . . . . . . . . . . . . . . . . . . . . . . .6-2 Stack pointer (ESP register) . . . . . . . . . . . . . . . 4-1
Stack segment . . . . . . . . . . . . . . . . . . . . . . . . . 3-9
Stack switching
S on calls to interrupt and
SAHF instruction . . . . . . . . . . . . . . . . . . 3-10, 6-42 exception handlers. . . . . . . . . . . . . . . . 4-13
SAL instruction . . . . . . . . . . . . . . . . . . . . . . . . .6-29 on inter-privilege level calls . . . . . . . .4-10, 4-16
SAR instruction . . . . . . . . . . . . . . . . . . . . . . . .6-30 Stack underflow exception, FPU . . . . . .7-13, 7-51
Saturation arithmetic (MMX instructions) . . . . . .8-6 Stack (see Procedure stack)
Saving the FPU state . . . . . . . . . . . . . . . . . . . .7-21 Stack-frame base pointer, EBP register . . . . . . 4-4
SBB instruction. . . . . . . . . . . . . . . . . . . . . . . . .6-26 Status flags, EFLAGS register . . 3-12, 7-15, 7-16,
Scale (operand addressing) . . . . . . . . . . . 5-9, 5-10 7-37
Scale, FPU operation . . . . . . . . . . . . . . . . . . . .7-40 STC instruction . . . . . . . . . . . . . . . . . . . .3-12, 6-42
Scaling bias value . . . . . . . . . . . . . . . . . 7-55, 7-56 STD instruction . . . . . . . . . . . . . . . . . . . .3-13, 6-42
SCAS instruction . . . . . . . . . . . . . . . . . . 3-13, 6-40 STI instruction. . . . . . . . . . . . . . . . 6-42, 6-43, 10-4
Segment registers STOS instruction . . . . . . . . . . . . . . . . . .3-13, 6-40
description of . . . . . . . . . . . . . . . . . . . . 3-5, 3-7 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5
Segment selector ST(0), top-of-stack register. . . . . . . . . . . . . . . 7-11
description of . . . . . . . . . . . . . . . . . . . . 3-3, 3-7 SUB instruction. . . . . . . . . . . . . . . . . . . . . . . . 6-26
specifying . . . . . . . . . . . . . . . . . . . . . . . . . . .5-8 Superscalar . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7
Segmented addressing . . . . . . . . . . . . . . . . . . .1-7 Synchronization, of floating-point exceptions . 7-58
Segmented memory model . . . . . . . . . . . . 3-3, 3-8 System flags, EFLAGS register . . . . . . . . . . . 3-13
Segments System management mode (SSM) . . . . . . . . . 3-4
defined . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-3
maximum number . . . . . . . . . . . . . . . . . . . . .3-3
Serialization of I/O instructions. . . . . . . . . . . . .10-6 T
SETcc instructions . . . . . . . . . . . . . . . . . 3-12, 6-34 Tangent, FPU operation . . . . . . . . . . . . . . . . . 7-38
SF (sign) flag, EFLAGS register. . . . . . . . . . . .3-12 Task gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17
SF (stack fault) flag, FPU status word . . 7-15, 7-52 Task state segment (see TSS)
SHL instruction. . . . . . . . . . . . . . . . . . . . . . . . .6-29 Tasks
SHLD instruction . . . . . . . . . . . . . . . . . . . . . . .6-32 exception handler . . . . . . . . . . . . . . . . . . . 4-17
SHR instruction . . . . . . . . . . . . . . . . . . . . . . . .6-29 interrupt handler . . . . . . . . . . . . . . . . . . . . 4-17
SHRD instruction . . . . . . . . . . . . . . . . . . . . . . .6-32 TEST instruction . . . . . . . . . . . . . . . . . . . . . . . 6-35
SI register. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-7 TF (trap) flag, EFLAGS register . . . . . . . . . . . 3-13
Signaling NaN (see SNaN) Tiny number . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-7
Signed infinity. . . . . . . . . . . . . . . . . . . . . . . . . . .7-8 TOP (stack TOP) field, FPU status word . . . . . 7-9
Signed zero . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-6 Transcendental instruction accuracy . . . . . . . 7-40
Significand Trap gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13
of floating-point number . . . . . . . . . . . . . . . .7-4 TSS
Sign, floating-point number . . . . . . . . . . . . . . . .7-4 I/O map base. . . . . . . . . . . . . . . . . . . . . . . 10-5
I/O permission bit map . . . . . . . . . . . . . . . 10-5
INDEX
U
UD2 instruction. . . . . . . . . . . . . . . . . . . . . 6-2, 6-45
UE (numeric overflow exception) flag,
FPU status word . . . . . . . . . . 7-14, 7-56
Underflow, FPU exception
(see Numeric underflow exception)
Underflow, FPU stack . . . . . . . . . . . . . . 7-51, 7-52
Underflow, numeric . . . . . . . . . . . . . . . . . . . . . .7-7
Un-normal number . . . . . . . . . . . . . . . . . . . . . .7-30
Unsigned integers . . . . . . . . . . . . . 5-5, 6-26, 6-27
Unsupported floating-point formats . . . . . . . . .7-30
Unsupported FPU instructions . . . . . . . . . . . . .7-43
V
Vector (see Interrupt vector)
VIF (virtual interrupt) flag, EFLAGS register . .3-13
VIP (virtual interrupt pending) flag,
EFLAGS register . . . . . . . . . . . . . . .3-13
Virtual 8086 mode
description of . . . . . . . . . . . . . . . . . . . . . . .3-13
memory model . . . . . . . . . . . . . . . . . . . . . . .3-4
VM (virtual 8086 mode) flag, EFLAGS register 3-13
W
Waiting instructions . . . . . . . . . . . . . . . . . . . . .7-42
WAIT/FWAIT instructions. . . . . . . . . . . . 7-42, 7-59
WBINVD instruction . . . . . . . . . . . . . . . . . . . . . .6-3
Word. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-1
Wraparound mode (MMX instructions) . . . . . . .8-6
WRMSR instruction . . . . . . . . . . . . . . . . . 6-2, 11-2
X
XADD instruction . . . . . . . . . . . . . . . . . . . 6-3, 6-22
XCHG instruction . . . . . . . . . . . . . . . . . . . . . . .6-21
XLAT/XLATB instruction . . . . . . . . . . . . . . . . .6-45
XOR instruction . . . . . . . . . . . . . . . . . . . . . . . .6-29
Z
ZE (division-by-zero exception) flag,
FPU status word . . . . . . . . . . . . . . .7-14
Zero, floating-point format . . . . . . . . . . . . . . . . .7-6
ZF (zero) flag, EFLAGS register . . . . . . . . . . .3-12
INDEX-9