Tutorial - Floating-Point Binary
Tutorial - Floating-Point Binary
htm
1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa. Also called single
IEEE Short Real: 32 bits
precision.
1 bit for the sign, 11 bits for the exponent, and 52 bits for the mantissa. Also called
IEEE Long Real: 64 bits
double precision.
Both formats use essentially the same method for storing floating-point binary numbers, so we will use
the Short Real as an example in this tutorial. The bits in an IEEE Short Real are arranged as follows, with
the most significant bit (MSB) on the left:
Fig.
1
The Sign
The sign of a binary floating-point number is represented by a single bit. A 1 bit indicates a negative
number, and a 0 bit indicates a positive number.
The Mantissa
It is useful to consider the way decimal floating-point numbers represent their mantissa. Using -3.154 x
105 as an example, the sign is negative, the mantissa is 3.154, and the exponent is 5. The fractional
portion of the mantissa is the sum of each digit multiplied by a power of 10:
A binary floating-point number is similar. For example, in the number +11.1011 x 23, the sign is positive,
the mantissa is 11.1011, and the exponent is 3. The fractional portion of the mantissa is the sum of
successive powers of 2. In our example, it is expressed as:
Or, you can calculate this value as 1011 divided by 24. In decimal terms, this is eleven divided by
sixteen, or 0.6875. Combined with the left-hand side of 11.1011, the decimal value of the number is
3.6875. Here are additional examples:
The last entry in this table shows the smallest fraction that can be stored in a 23-bit mantissa. The
following table shows a few simple examples of binary floating-point numbers alongside their equivalent
decimal fractions and decimal values:
Decimal
Binary Decimal Fraction
Value
.1 1/2 .5
.01 1/4 .25
.001 1/8 .125
1 of 4 2/19/2013 12:07 AM
Tutorial: Floating-Point Binary http://kipirvine.com/asm/workbook/floating_tut.htm
The Exponent
IEEE Short Real exponents are stored as 8-bit unsigned integers with a bias of 127. Let's use the number
1.101 x 25 as an example. The exponent (5) is added to 127 and the sum (132) is binary 10000100. Here
are some examples of exponents, first shown in decimal, then adjusted, and finally in unsigned binary:
Adjusted
Exponent (E) Binary
(E + 127)
+5 132 10000100
0 127 01111111
-10 117 01110101
+128 255 11111111
-127 0 00000000
-1 126 01111110
The binary exponent is unsigned, and therefore cannot be negative. The largest possible exponent is
128-- when added to 127, it produces 255, the largest unsigned value represented by 8 bits. The
approximate range is from 1.0 x 2-127 to 1.0 x 2+128 .
Similarly, the floating-point binary value 1101.101 is normalized as 1.101101 x 23 by moving the decimal
point 3 positions to the left, and multiplying by 23. Here are some examples of normalizations:
You may have noticed that in a normalized mantissa, the digit 1 always appears to the left of the decimal
point. In fact, the leading 1 is omitted from the mantissa in the IEEE storage format because it is
redundant.
Biased Exponent
Binary Value Sign, Exponent, Mantissa
-1.11 127 1 01111111 11000000000000000000000
2 of 4 2/19/2013 12:07 AM
Tutorial: Floating-Point Binary http://kipirvine.com/asm/workbook/floating_tut.htm
Of course, the real world is never so simple. A fraction such as 1/5 (0.2) must be represented by a sum
of fractions whose denominators are powers of 2. Here is the output from a program that subtracts each
succesive fraction from 0.2 and shows each remainder. In fact, an exact value is not found after creating
the 23 mantissa bits. The result, however, is accurate to 7 digits. The blank lines are for fractions that
were too large to be subtracted from the remaining value of the number. Bit 1, for example, was equal to
.5 (1/2), which could not be subtracted from 0.2.
starting: 0.200000000000
1
2
3 subtracting 0.125000000000
remainder = 0.075000000000
4 subtracting 0.062500000000
remainder = 0.012500000000
5
6
7 subtracting 0.007812500000
remainder = 0.004687500000
8 subtracting 0.003906250000
remainder = 0.000781250000
9
10
11 subtracting 0.000488281250
remainder = 0.000292968750
12 subtracting 0.000244140625
remainder = 0.000048828125
13
14
15 subtracting 0.000030517578
remainder = 0.000018310547
3 of 4 2/19/2013 12:07 AM
Tutorial: Floating-Point Binary http://kipirvine.com/asm/workbook/floating_tut.htm
16 subtracting 0.000015258789
remainder = 0.000003051758
17
18
19 subtracting 0.000001907349
remainder = 0.000001144409
20 subtracting 0.000000953674
remainder = 0.000000190735
21
22
23 subtracting 0.000000119209
remainder = 0.000000071526
Mantissa: .00110011001100110011001
4 of 4 2/19/2013 12:07 AM