The IEEE Standard For Floating Point Arithmetic
The IEEE Standard For Floating Point Arithmetic
The IEEE Standard For Floating Point Arithmetic
arithmetic
The IEEE (Institute of Electrical and Electronics Engineers) has produced a standard for
floating point arithmetic. This standard specifies how single precision (32 bit) and double
precision (64 bit) floating point numbers are to be represented, as well as how arithmetic
should be carried out on them.
Because many of our users may have occasion to transfer unformatted or "binary" data
between an IEEE machine and the Cray or the VAX/VMS, it is worth noting the details of
this format for comparison with the Cray and VAX representations. The differences in the
formats also affect the accuracy of floating point computations.
Summary:
Single Precision
The IEEE single precision floating point standard representation requires a 32 bit word,
which may be represented as numbered from 0 to 31, left to right. The first bit is the sign
bit, S, the next eight bits are the exponent bits, 'E', and the final 23 bits are the fraction 'F':
S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF
0 1
8 9
31
In particular,
0 00000000 00000000000000000000000 = 0
1 00000000 00000000000000000000000 = -0
0 11111111 00000000000000000000000 = Infinity
1 11111111 00000000000000000000000 = -Infinity
Double Precision
The IEEE double precision floating point standard representation requires a 64 bit word,
which may be represented as numbered from 0 to 63, left to right. The first bit is the sign
bit, S, the next eleven bits are the exponent bits, 'E', and the final 52 bits are the fraction
'F':
S EEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
0 1
11 12
63
IEEE Standard 754 floating point is the most common representation today for real
numbers on computers, including Intel-based PC's, Macintoshes, and most Unix
platforms. This article gives a brief overview of IEEE floating point and its
representation. Discussion of arithmetic implementation may be found in the book
mentioned at the bottom of this article.
Storage Layout
IEEE floating point numbers have three basic components: the sign, the exponent, and
the mantissa. The mantissa is composed of the fraction and an implicit leading digit
(explained below). The exponent base (2) is implicit and need not be stored.
The following figure shows the layout for single (32-bit) and double (64-bit) precision
floating-point values. The number of bits for each field are shown (bit ranges are in
square brackets):
Single Precision
Sign
Exponent
Fraction
Bias
1 [31]
8 [30-23]
23 [22-00]
127
The Exponent
The exponent field needs to represent both positive and negative exponents. To do this, a
bias is added to the actual exponent in order to get the stored exponent. For IEEE singleprecision floats, this value is 127. Thus, an exponent of zero means that 127 is stored in
the exponent field. A stored value of 200 indicates an exponent of (200-127), or 73. For
reasons discussed later, exponents of -127 (all 0s) and +128 (all 1s) are reserved for
special numbers.
For double precision, the exponent field is 11 bits, and has a bias of 1023.
The Mantissa
The mantissa, also known as the significand, represents the precision bits of the number.
It is composed of an implicit leading bit and the fraction bits.
To find out the value of the implicit leading bit, consider that any number can be
expressed in scientific notation in many different ways. For example, the number five can
be represented as any of these:
5.00 100
0.05 102
5000 10-3
This approximates the 32-bit value, but doesn't yield an exact representation. On the other
hand, besides the ability to represent fractional components (which integers lack
completely), the floating-point value can represent numbers around 2127, compared to 32bit integers maximum value around 232.
The range of positive floating point numbers can be split into normalized numbers (which
preserve the full precision of the mantissa), and denormalized numbers (discussed later)
which use only a portion of the fractions's precision.
Denormalized
Single Precision 2-149 to (1-2-23)2-126
Double
Precision
2-1074 to (1-2-52)2-1022
Normalized
Approximate
Decimal
2-126 to (2-2-23)2127
~10-44.85 to ~1038.53
2-1022 to (2-252
)21023
~10-323.3 to ~10308.3
Since the sign of floating point numbers is given by a special leading bit, the range for
negative numbers is given by the negation of the above values.
There are five distinct numerical ranges that single-precision floating-point numbers are
not able to represent:
1.
2.
3.
4.
5.
Overflow means that values have grown too large for the representation, much in the
same way that you can overflow integers. Underflow is a less serious problem because is
just denotes a loss of precision, which is guaranteed to be closely approximated by zero.
Here's a table of the effective range (excluding infinite values) of IEEE floating-point
numbers:
Binary
Decimal
(2-2-23) 2127
~ 1038.53
~ 10308.25
Single
Note that the extreme values occur (regardless of sign) when the exponent is at the
maximum value for finite numbers (2127 for single-precision, 21023 for double), and the
mantissa is filled with 1s (including the normalizing 1 bit).
Special Values
IEEE reserves exponent field values of all 0s and all 1s to denote special values in the
floating-point scheme.
Zero
As mentioned above, zero is not directly representable in the straight format, due to the
assumption of a leading 1 (we'd need to specify a true zero mantissa to yield a value of
zero). Zero is a special value denoted with an exponent field of zero and a fraction field
of zero. Note that -0 and +0 are distinct values, though they both compare as equal.
Denormalized
If the exponent is all 0s, but the fraction is non-zero (else it would be interpreted as zero),
then the value is a denormalized number, which does not have an assumed leading 1
before the binary point. Thus, this represents a number (-1)s 0.f 2-126, where s is the
sign bit and f is the fraction. For double precision, denormalized numbers are of the form
(-1)s 0.f 2-1022. From this you can interpret zero as a special type of denormalized
number.
Infinity
The values +infinity and -infinity are denoted with an exponent of all 1s and a fraction of
all 0s. The sign bit distinguishes between negative infinity and positive infinity. Being
able to denote infinity as a specific value is useful because it allows operations to
continue past overflow situations. Operations with infinite values are well defined in
IEEE floating point.
Not A Number
The value NaN (Not a Number) is used to represent a value that does not represent a real
number. NaN's are represented by a bit pattern with an exponent of all 1s and a non-zero
fraction. There are two categories of NaN: QNaN (Quiet NaN) and SNaN (Signalling
NaN).
A QNaN is a NaN with the most significant fraction bit set. QNaN's propagate freely
through most arithmetic operations. These values pop out of an operation when the result
is not mathematically defined.
An SNaN is a NaN with the most significant fraction bit clear. It is used to signal an
exception when used in operations. SNaN's can be handy to assign to uninitialized
variables to trap premature usage.
Semantically, QNaN's denote indeterminate operations, while SNaN's denote invalid
operations.
Special Operations
Operations on special numbers are well-defined by IEEE. In the simplest case, any
operation with a NaN yields a NaN result. Other operations are as follows:
Operation
Result
n Infinity
Infinity
Infinity + Infinity
Infinity
0 0
NaN
Infinity - Infinity
NaN
Infinity Infinity
NaN
Infinity 0
NaN
Summary
To sum up, the following are the corresponding values for a given representation:
Float Values (b = bias)
Sign Exponent (e) Fraction (f)
Value
00..00
00..00
+0
00..00
00..01
:
11..11
00..01
:
11..10
XX..XX
11..11
00..00
+Infinity
11..11
00..01
:
01..11
SNaN
11..11
10..00
:
11..11
QNaN
00..00
00..00
-0
00..00
00..01
:
11..11
00..01
:
11..10
XX..XX
11..11
00..00
-Infinity
11..11
00..01
:
SNaN
01..11
1
11..11
10..00
:
11.11
QNaN
References
A lot of this stuff was observed from small programs I wrote to go back and forth
between hex and floating point (printf-style), and to examine the results of various
operations. The bulk of this material, however, was lifted from Stallings' book.
1. Computer Organization and Architecture, William Stallings, pp. 222-234
Macmillan Publishing Company, ISBN 0-02-415480-6
2. IEEE Computer Society (1985), IEEE Standard for Binary Floating-Point
Arithmetic, IEEE Std 754-1985.
3. Intel Architecture Software Developer's Manual, Volume 1: Basic Architecture , (a
PDF document downloaded from intel.com.)