floating_point
floating_point
Dmitriy Leykekhman
Spring 2012
Goals
I Basic understanding of computer representation of numbers
I Basic understanding of floating point arithmetic
I Consequences of floating point arithmetic for numerical computation
1234.567
for us means
1234.567
for us means
More generally
. . . dj . . . d1 d0 .d−1 . . . d−i . . .
represents
Example
11
= 5 ∗ 100 + 5 ∗ 10−1 = (5.5)10 ,
2
11
= 1 ∗ 22 + 0 ∗ 21 + 1 ∗ 20 + 1 ∗ 2−1
2
= (1 ∗ 20 + 0 ∗ 2−1 + 1 ∗ 2−2 + 1 ∗ 2−3 ) ∗ 22 = (1.011)2 ∗ 22 .
x̄ = ±1.d1 d2 × 2e
x̄ = ±1.d1 d2 × 2e
x̄ = ±1.d1 d2 × 2e
-
5 3 7
1 4 2 4
0
x̄ = ±1.d1 d2 × 2e
-
5 3 7 5 7
1 4 2 4
2 2
3 2
0
x̄ = ±1.d1 d2 × 2e
-
5 3 7 5 7
1 4 2 4
2 2
3 2
4 5 6 7
0
x̄ = ±1.d1 d2 × 2e
-
1537 5 3 7 5 7
2848
1 4 2 4
2 2
3 2
4 5 6 7
0
then
Pm−1 −i
e
sign(x) i=0 di β β , if dm < 21 β,
fl(x) = P
m−1 −i
sign(x)
i=0 di β + β −(m−1) β e , if dm ≥ 21 β.
then
Pm−1 −i
e
sign(x) i=0 di β β , if dm < 21 β,
fl(x) = P
m−1 −i
sign(x)
i=0 di β + β −(m−1) β e , if dm ≥ 21 β.
then
Pm−1 −i
e
sign(x) i=0 di β β , if dm < 21 β,
fl(x) = P
m−1 −i
sign(x)
i=0 di β + β −(m−1) β e , if dm ≥ 21 β.
Note, there may be two floating point numbers closest to x. fl(x) picks one of
them. For example, let β = 10, m = 3. Then 1.235 − 1.24 = 0.005, but also
1.235 − 1.23 = 0.005. See [G91,O01,SUN] for more details on ’breaking’ ties.
D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 8
Rounding Error
Theorem
If x is a number within the range of floating point numbers and
|x| ∈ [β e , β e+1 ), then the absolute error between x and the floating point
number fl(x) closest to x is given by
1 e(1−m)
|fl(x) − x| ≤ β
2
and, provided x 6= 0, the relative error is given by
|fl(x) − x| 1
≤ β 1−m . (2)
|x| 2
The number
1 1−m def
βmach =
2
is called machine precision or unit roundoff.
|x̄ − x| 1
≤ β −(m−1) .
|x| 2
|x̄ − x| 1
≤ β −(m−1) .
|x| 2
P
∞
fl(x) is a floating point number closest to x = i=0 di β −i β e , d0 > 0?
Examples
Consider the floating point system β = 10 and m = 4.
i. x̄ = 2.552 ∗ 103 and ȳ = 2.551 ∗ 103 .
x̄ − ȳ = 0.001 ∗ 103 = 1.000 ∗ 100 . In this case x̄ − ȳ is a floating point
number and nothing needs to done; no error occurs in the subtraction of
x̄, ȳ.
ii. x̄ = 2.552 ∗ 103 and ȳ = 2.551 ∗ 102 .
x̄ − ȳ = 2.2969 ∗ 103 . This is not a floating point number. The floating
point number nearest to x̄ − ȳ is fl(x̄ − ȳ) = 2.297 ∗ 103 .
|fl(x) − x| def 1
≤ mach = β 1−m .
|x| 2
I Introduced basic properties of floating point arithmetic.
I Catastrophic cancellation can occur if one subtracts [adds] two
numbers which are not both in floating point format and which have
the same [opposite] sign and [their absolute values] are of
approximately the same size.
G91 David Goldberg. What every computer scientist should know about
floating-point arithmetic, ACM Comput. Surv., Vol. 23 (1), 1991,
pp. 5 - 48.
http://docs.sun.com/source/806-3568/ncg goldberg.html
O01 Michael L. Overton. Numerical Computing with IEEE Floating Point
Arithmetic, SIAM, Philadelphia, 2001.
SUN SUN Microsystems Numerical Computation Guide
http://docs.sun.com/source/806-3568/