IEEE Arithmetic
IEEE Arithmetic
IEEE Arithmetic
1.1 Definitions
Bit = 0 or 1
Byte = 8 bits
Word = Reals: 4 bytes (single precision)
8 bytes (double precision)
= Integers: 1, 2, 4, or 8 byte signed
1, 2, 4, or 8 byte unsigned
·
Decimal: 103 102 101 100 10−1 10−2 10−3 10−4
Binary: 23 22 21 20 2−1 2−2 2−3 2−4
{0, 1, 2, 3, . . . , 9, 10, 11, 12, 13, 14, 15} = {0, 1, 2, 3.......9, a,b,c,d,e,f}
1
2 CHAPTER 1. IEEE ARITHMETIC
𝑓
⏞ 𝑠⏟ ⏞ 𝑒
⏟ ⏞ ⏟
· · · · · · · ·
0 1 2 3 4 5 6 7 8 9 31
where s = sign
e = biased exponent
p=e-127 = exponent
1.f = significand (use binary point)
1.7. SPECIAL NUMBERS 3
Subnormal numbers
Allow 1.f → 0.f (in software)
Smallest positive number = 0.0000 0000 · · · · · 0001 × 2−126
= 2−23 × 2−126 ≃ 1.4 × 10−45
0.333 . . . 0.
0.666 . . . 0.0
1.333 . . . 0.01
0.666 . . . 0.010
1.333 . . . 0.0101
etc.
𝑓 = 01010101010101010101011,
where we have rounded to the nearest binary number (here, rounded up). The
machine number 1/3 is then represented as
00111110101010101010101010101011
or in hex
3𝑒𝑎𝑎𝑎𝑎𝑎𝑏.
𝑁 − 2 = 223 × 1.11 . . . 1,
and
𝑁 − 1 = 224 × 1.00 . . . 0.
The integer 𝑁 would then require a one-bit in the 2−24 position, which is not
available. Therefore, the smallest positive integer that is not exact is 224 + 1 =
16 777 217. In MATLAB, single(224 ) has the same value as single(224 +1). Since
single(224 + 1) is exactly halfway between the two consecutive machine numbers
224 and 224 + 2, MATLAB rounds to the number with a final zero-bit in f, which
is 224 .
1.10. MACHINE EPSILON 5
Find 𝜖mach
The number 1 in the IEEE format is written as
1 = 20 × 1.000 . . . 0,
with 23 0’s following the binary point. The number just larger than 1 has a 1
in the 23rd position after the decimal point. Therefore,
𝜖mach = 2−23 ≈ 1.192 × 10−7 .
What is the distance between 1 and the number just smaller than 1? Here,
the number just smaller than one can be written as
2−1 × 1.111 . . . 1 = 2−1 (1 + (1 − 2−23 )) = 1 − 2−24
Therefore, this distance is 2−24 = 𝜖mach /2.
The spacing between numbers is uniform between powers of 2, with logarith-
mic spacing of the powers of 2. That is, the spacing of numbers between 1 and
2 is 2−23 , between 2 and 4 is 2−22 , between 4 and 8 is 2−21 , etc. This spacing
changes for denormal numbers, where the spacing is uniform all the way down
to zero.
𝑓
⏞ 𝑠⏟ ⏞ 𝑒
⏟ ⏞ ⏟
· · · · · · · ·
0 1 2 3 4 5 6 7 8 9 10 11 12 63
6 CHAPTER 1. IEEE ARITHMETIC
where s = sign
e = biased exponent
p=e-1023 = exponent
1.f = significand (use binary point)
𝑥2 + 2𝑏𝑥 − 1 = 0,
Consider the solution with 𝑏 > 0 and 𝑥 > 0 (the 𝑥+ solution) given by
√︀
𝑥 = −𝑏 + 𝑏2 + 1. (1.1)
As 𝑏 → ∞,
√︀
𝑥 = −𝑏 + 𝑏2 + 1
√︀
= −𝑏 + 𝑏 1 + 1/𝑏2
√︀
= 𝑏( 1 + 1/𝑏2 − 1)
(︂ )︂
1
≈𝑏 1+ 2 −1
2𝑏
1
= .
2𝑏
Now in double precision, realmin ≈ 2.2 × 10−308 and we would like 𝑥 to be
accurate to this value before it goes to 0 via denormal numbers. Therefore,
𝑥 should be computed accurately to 𝑏 ≈ 1/(2 × realmin) ≈ 2 × 10307 . What
2 2
happens if we compute (1.1) directly? Then √ 𝑥√= 0 when 𝑏 + 1 = 𝑏 , or
1 + 1/𝑏2 = 1. That is 1/𝑏2 = 𝜖mach /2, or 𝑏 = 2/ 𝜖mach ≈ 108 .
For a subroutine written to compute the solution of a quadratic for a general
user, this is not good enough. The way for a software designer to solve this
problem is to compute the solution for 𝑥 as
1
𝑥= √ .
𝑏 + 𝑏2 + 1