0% found this document useful (0 votes)
2 views

floating_point

The document discusses floating point arithmetic, focusing on the representation of real numbers in computers. It explains the structure of floating point numbers, including the base, mantissa, and exponent, and presents a toy floating point number system as an example. The goals include understanding computer number representation and the implications of floating point arithmetic in numerical computation.

Uploaded by

thamsanqadube008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

floating_point

The document discusses floating point arithmetic, focusing on the representation of real numbers in computers. It explains the structure of floating point numbers, including the base, mantissa, and exponent, and presents a toy floating point number system as an example. The goals include understanding computer number representation and the implications of floating point arithmetic in numerical computation.

Uploaded by

thamsanqadube008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

MATH 3511

Lecture 4. Floating Point Arithmetic

Dmitriy Leykekhman

Spring 2012

Goals
I Basic understanding of computer representation of numbers
I Basic understanding of floating point arithmetic
I Consequences of floating point arithmetic for numerical computation

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 1


Representation of Real Numbers

In everyday life we use decimal representation of numbers. For example

1234.567

for us means

1 ∗ 104 + 2 ∗ 103 + 3 ∗ 102 + 4 ∗ 100 + 5 ∗ 10−1 + 6 ∗ 10−2 + 7 ∗ 10−3 .

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 2


Representation of Real Numbers

In everyday life we use decimal representation of numbers. For example

1234.567

for us means

1 ∗ 104 + 2 ∗ 103 + 3 ∗ 102 + 4 ∗ 100 + 5 ∗ 10−1 + 6 ∗ 10−2 + 7 ∗ 10−3 .

More generally
. . . dj . . . d1 d0 .d−1 . . . d−i . . .
represents

· · · dj ∗ 10j + · · · + d1 ∗ 101 + d0 ∗ 100 + d−1 ∗ 10−1 + · · · + d−i ∗ 10−i + · · · .

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 2


Representation of Real Numbers
Let β ≥ 2 be an integer. For every x ∈ IR there exist integers e and
di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , such that

!
X
x = sign(x) di β −i β e . (1)
i=0

The representation is unique if one requires that d0 > 0 when x 6= 0.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 3


Representation of Real Numbers
Let β ≥ 2 be an integer. For every x ∈ IR there exist integers e and
di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , such that

!
X
x = sign(x) di β −i β e . (1)
i=0

The representation is unique if one requires that d0 > 0 when x 6= 0.

Example

11
= 5 ∗ 100 + 5 ∗ 10−1 = (5.5)10 ,
2
11
= 1 ∗ 22 + 0 ∗ 21 + 1 ∗ 20 + 1 ∗ 2−1
2
= (1 ∗ 20 + 0 ∗ 2−1 + 1 ∗ 2−2 + 1 ∗ 2−3 ) ∗ 22 = (1.011)2 ∗ 22 .

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 3


Floating Point Numbers
In a computer only a finite subset of all real numbers can be represented.
These are the so–called floating point numbers and they are of the form
m−1
!
X
s −i
x̄ = (−1) di β βe
i=0

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 4


Floating Point Numbers
In a computer only a finite subset of all real numbers can be represented.
These are the so–called floating point numbers and they are of the form
m−1
!
X
s −i
x̄ = (−1) di β βe
i=0

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.


I β is called the base,

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 4


Floating Point Numbers
In a computer only a finite subset of all real numbers can be represented.
These are the so–called floating point numbers and they are of the form
m−1
!
X
s −i
x̄ = (−1) di β βe
i=0

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.


I β is called the base,
Pm−1 −i
i=0 di β is the significant or mantissa, m is the mantissa length,
I

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 4


Floating Point Numbers
In a computer only a finite subset of all real numbers can be represented.
These are the so–called floating point numbers and they are of the form
m−1
!
X
s −i
x̄ = (−1) di β βe
i=0

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.


I β is called the base,
Pm−1 −i
i=0 di β is the significant or mantissa, m is the mantissa length,
I

I e is the exponent, and {emin , . . . , emax } is the exponent range.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 4


Floating Point Numbers
In a computer only a finite subset of all real numbers can be represented.
These are the so–called floating point numbers and they are of the form
m−1
!
X
s −i
x̄ = (−1) di β βe
i=0

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.


I β is called the base,
Pm−1 −i
i=0 di β is the significant or mantissa, m is the mantissa length,
I

I e is the exponent, and {emin , . . . , emax } is the exponent range.


I If β = 2, then we say the floating point number system is a binary
system. In this case the di ’s are called bits.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 4


Floating Point Numbers
In a computer only a finite subset of all real numbers can be represented.
These are the so–called floating point numbers and they are of the form
m−1
!
X
s −i
x̄ = (−1) di β βe
i=0

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.


I β is called the base,
Pm−1 −i
i=0 di β is the significant or mantissa, m is the mantissa length,
I

I e is the exponent, and {emin , . . . , emax } is the exponent range.


I If β = 2, then we say the floating point number system is a binary
system. In this case the di ’s are called bits.
I If β = 10, then we say the floating point number system is a decimal
system. In this case the di ’s are called digits.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 4


Floating Point Numbers
In a computer only a finite subset of all real numbers can be represented.
These are the so–called floating point numbers and they are of the form
m−1
!
X
s −i
x̄ = (−1) di β βe
i=0

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.


I β is called the base,
Pm−1 −i
i=0 di β is the significant or mantissa, m is the mantissa length,
I

I e is the exponent, and {emin , . . . , emax } is the exponent range.


I If β = 2, then we say the floating point number system is a binary
system. In this case the di ’s are called bits.
I If β = 10, then we say the floating point number system is a decimal
system. In this case the di ’s are called digits.
I A floating point number x̄ 6= 0 is said to be normalized if d0 > 0.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 4


A Toy Floating Point Number System
Consider the floating point number system
β = 2, m = 3, emin = −1, emax = 2.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 5


A Toy Floating Point Number System
Consider the floating point number system
β = 2, m = 3, emin = −1, emax = 2.
The normalized floating point numbers x̄ 6= 0 are of the form

x̄ = ±1.d1 d2 × 2e

since the normalization condition implies that d0 ∈ {1, . . . , β − 1} = {1}.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 5


A Toy Floating Point Number System
Consider the floating point number system
β = 2, m = 3, emin = −1, emax = 2.
The normalized floating point numbers x̄ 6= 0 are of the form

x̄ = ±1.d1 d2 × 2e

since the normalization condition implies that d0 ∈ {1, . . . , β − 1} = {1}.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 5


A Toy Floating Point Number System
Consider the floating point number system
β = 2, m = 3, emin = −1, emax = 2.
The normalized floating point numbers x̄ 6= 0 are of the form

x̄ = ±1.d1 d2 × 2e

since the normalization condition implies that d0 ∈ {1, . . . , β − 1} = {1}.

-
5 3 7
1 4 2 4
0

Positive numbers with exponent e = 0

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 5


A Toy Floating Point Number System
Consider the floating point number system
β = 2, m = 3, emin = −1, emax = 2.
The normalized floating point numbers x̄ 6= 0 are of the form

x̄ = ±1.d1 d2 × 2e

since the normalization condition implies that d0 ∈ {1, . . . , β − 1} = {1}.

-
5 3 7 5 7
1 4 2 4
2 2
3 2
0

Positive numbers with exponent e = 0 , 1

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 5


A Toy Floating Point Number System
Consider the floating point number system
β = 2, m = 3, emin = −1, emax = 2.
The normalized floating point numbers x̄ 6= 0 are of the form

x̄ = ±1.d1 d2 × 2e

since the normalization condition implies that d0 ∈ {1, . . . , β − 1} = {1}.

-
5 3 7 5 7
1 4 2 4
2 2
3 2
4 5 6 7
0

Positive numbers with exponent e = 0 , 1 , 2

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 5


A Toy Floating Point Number System
Consider the floating point number system
β = 2, m = 3, emin = −1, emax = 2.
The normalized floating point numbers x̄ 6= 0 are of the form

x̄ = ±1.d1 d2 × 2e

since the normalization condition implies that d0 ∈ {1, . . . , β − 1} = {1}.

-
1537 5 3 7 5 7
2848
1 4 2 4
2 2
3 2
4 5 6 7
0

Positive numbers with exponent e = 0 , 1 , 2 ,−1

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 5


Consider the floating point number system
m−1
!
s
X −i
x̄ = (−1) di β βe
i=0

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 6


Consider the floating point number system
m−1
!
s
X −i
x̄ = (−1) di β βe
i=0

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.


I The mantissa satisfies
m−1
X m−1
X
di β −i ≤ (β − 1)β −i = β(1 − β −m ) < β.
i=0 i=0

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 6


Consider the floating point number system
m−1
!
s
X −i
x̄ = (−1) di β βe
i=0

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.


I The mantissa satisfies
m−1
X m−1
X
di β −i ≤ (β − 1)β −i = β(1 − β −m ) < β.
i=0 i=0

I The mantissa of a normalized floating point number is always ≥ 1.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 6


Consider the floating point number system
m−1
!
s
X −i
x̄ = (−1) di β βe
i=0

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.


I The mantissa satisfies
m−1
X m−1
X
di β −i ≤ (β − 1)β −i = β(1 − β −m ) < β.
i=0 i=0

I The mantissa of a normalized floating point number is always ≥ 1.


I The largest floating point number is
m−1
!
X −i
x̄max = (β − 1)β β emax = (1 − β −m )β emax +1 .
i=0

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 6


Consider the floating point number system
m−1
!
s
X −i
x̄ = (−1) di β βe
i=0

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.


I The mantissa satisfies
m−1
X m−1
X
di β −i ≤ (β − 1)β −i = β(1 − β −m ) < β.
i=0 i=0

I The mantissa of a normalized floating point number is always ≥ 1.


I The largest floating point number is
m−1
!
X −i
x̄max = (β − 1)β β emax = (1 − β −m )β emax +1 .
i=0

I The smallest positive normalized floating pt. number is x̄min = β emin .

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 6


Consider the floating point number system
m−1
!
s
X −i
x̄ = (−1) di β βe
i=0

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.


I The mantissa satisfies
m−1
X m−1
X
di β −i ≤ (β − 1)β −i = β(1 − β −m ) < β.
i=0 i=0

I The mantissa of a normalized floating point number is always ≥ 1.


I The largest floating point number is
m−1
!
X −i
x̄max = (β − 1)β β emax = (1 − β −m )β emax +1 .
i=0

I The smallest positive normalized floating pt. number is x̄min = β emin .


I The distance between 1 and the next largest floating pt. number is β 1−m .
Half this number, mach = 12 β 1−m , is called machine precision or unit
roundoff. (We will see later why).
The spacing between the floating pt. numbers in [1, β] is β −(m−1) .
The spacing between the floating pt. numbers in [β e , ββ e ] is β −(m−1) β e .
D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 6
IEEE Floating Point Numbers
I Almost all every modern computer implements the IEEE binary (β = 2)
floating point standard.
I IEEE single precision floating point numbers are stored in 32 bits.
I IEEE double precision floating point numbers are stored in 64 bits.
I How these numbers are stored is quite interesting (clever), but a little too
involved to get into here. See the references [G91,O01,SUN] given at the
end of this lecture.
I Here are some important numbers.

Common Name (Approximate) Equivalent Value


Single Precision Double Precision

Unit roundoff 2−24 ≈ 6.e − 8 2−53 ≈ 1.1e − 16


Maximum normal number 3.4e + 38 1.7e + 308
Minimum positive normal number 1.2e − 38 2.3e − 308
Maximum subnormal number 1.1e − 38 2.2e − 308
Minimum positive subnormal number 1.5e − 45 5.0e − 324

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 7


Rounding
Given a real number x we define
fl(x) = normalized floating point number closest to x.
A floating point number x̄ closest to x is obtained by rounding. If

!
X −i
x = sign(x) di β βe,
i=0

then
 Pm−1 −i
 e
 sign(x) i=0 di β β , if dm < 21 β,
fl(x) = P
m−1 −i

 sign(x)
i=0 di β + β −(m−1) β e , if dm ≥ 21 β.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 8


Rounding
Given a real number x we define
fl(x) = normalized floating point number closest to x.
A floating point number x̄ closest to x is obtained by rounding. If

!
X −i
x = sign(x) di β βe,
i=0

then
 Pm−1 −i
 e
 sign(x) i=0 di β β , if dm < 21 β,
fl(x) = P
m−1 −i

 sign(x)
i=0 di β + β −(m−1) β e , if dm ≥ 21 β.

Example Let β = 10, m = 3. Then


fl(1.234 ∗ 10−1 ) = 1.23 ∗ 10−1 ,
−1
fl(1.235 ∗ 10 ) = 1.24 ∗ 10−1 ,
fl(1.295 ∗ 10−1 ) = 1.30 ∗ 10−1 .

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 8


Rounding
Given a real number x we define
fl(x) = normalized floating point number closest to x.
A floating point number x̄ closest to x is obtained by rounding. If

!
X −i
x = sign(x) di β βe,
i=0

then
 Pm−1 −i
 e
 sign(x) i=0 di β β , if dm < 21 β,
fl(x) = P
m−1 −i

 sign(x)
i=0 di β + β −(m−1) β e , if dm ≥ 21 β.

Example Let β = 10, m = 3. Then


fl(1.234 ∗ 10−1 ) = 1.23 ∗ 10−1 ,
−1
fl(1.235 ∗ 10 ) = 1.24 ∗ 10−1 ,
fl(1.295 ∗ 10−1 ) = 1.30 ∗ 10−1 .

Note, there may be two floating point numbers closest to x. fl(x) picks one of
them. For example, let β = 10, m = 3. Then 1.235 − 1.24 = 0.005, but also
1.235 − 1.23 = 0.005. See [G91,O01,SUN] for more details on ’breaking’ ties.
D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 8
Rounding Error

Theorem
If x is a number within the range of floating point numbers and
|x| ∈ [β e , β e+1 ), then the absolute error between x and the floating point
number fl(x) closest to x is given by
1 e(1−m)
|fl(x) − x| ≤ β
2
and, provided x 6= 0, the relative error is given by

|fl(x) − x| 1
≤ β 1−m . (2)
|x| 2

The number
1 1−m def
βmach =
2
is called machine precision or unit roundoff.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 9


Proof of the theorem:
If x = 0, then fl(x) = x and the assertion follows immediately.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 10


Proof of the theorem:
If x = 0, then fl(x) = x and the assertion follows immediately.
Consider x > 0. (The case x < 0 can be treated in the same manner.)
Recall that the spacing between the floating point numbers
m−1
!
X
x̄ = di β −i β e ∈ [β e , β e+1 )
i=0

is β −(m−1) β e . Hence if x ∈ [β e , β e+1 ), then the floating point number x̄


closest to x satisfies |x̄ − x| ≤ 12 β −(m−1) β e . Since x ≥ β e ,

|x̄ − x| 1
≤ β −(m−1) .
|x| 2

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 10


Proof of the theorem:
If x = 0, then fl(x) = x and the assertion follows immediately.
Consider x > 0. (The case x < 0 can be treated in the same manner.)
Recall that the spacing between the floating point numbers
m−1
!
X
x̄ = di β −i β e ∈ [β e , β e+1 )
i=0

is β −(m−1) β e . Hence if x ∈ [β e , β e+1 ), then the floating point number x̄


closest to x satisfies |x̄ − x| ≤ 12 β −(m−1) β e . Since x ≥ β e ,

|x̄ − x| 1
≤ β −(m−1) .
|x| 2
P 

fl(x) is a floating point number closest to x = i=0 di β −i β e , d0 > 0?

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 10


Examples
Let β = 10, m = 3, thus mach = 5 ∗ 10−3 .

|fl(1.234 ∗ 10−1 ) − 1.234 ∗ 10−1 | = 0.0004,


|fl(1.234 ∗ 10−1 ) − 1.234 ∗ 10−1 | 0.0004
= ≈ 3.2 ∗ 10−3 ,
1.234 ∗ 10−1 1.234 ∗ 10−1
|fl(1.295 ∗ 10−1 ) − 1.295 ∗ 10−1 | = 0.0005,
|fl(1.295 ∗ 10−1 ) − 1.295 ∗ 10−1 | 0.0005
= ≈ 3.9 ∗ 10−3 .
1.295 ∗ 10−1 1.295 ∗ 10−1

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 11


Floating Point Arithmetic
I Let  represent one of the elementary operations +, −, ∗, /. If x̄ and
ȳ are floating point numbers, then x̄ȳ may not be a floating point
number.
Example: β = 10, m = 4: 1.234 + 2.751 ∗ 10−1 = 1.5091.
What is the computed value for x̄ȳ?

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 12


Floating Point Arithmetic
I Let  represent one of the elementary operations +, −, ∗, /. If x̄ and
ȳ are floating point numbers, then x̄ȳ may not be a floating point
number.
Example: β = 10, m = 4: 1.234 + 2.751 ∗ 10−1 = 1.5091.
What is the computed value for x̄ȳ?
I In IEEE floating point arithmetic the result of the computation x̄ȳ
is equal to the floating point number that is nearest to the exact
result x̄ȳ. Therefore we use fl(x̄ȳ) to denote the result of the
computation x̄ȳ

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 12


Floating Point Arithmetic
I Let  represent one of the elementary operations +, −, ∗, /. If x̄ and
ȳ are floating point numbers, then x̄ȳ may not be a floating point
number.
Example: β = 10, m = 4: 1.234 + 2.751 ∗ 10−1 = 1.5091.
What is the computed value for x̄ȳ?
I In IEEE floating point arithmetic the result of the computation x̄ȳ
is equal to the floating point number that is nearest to the exact
result x̄ȳ. Therefore we use fl(x̄ȳ) to denote the result of the
computation x̄ȳ
I Model for the computation of x̄ȳ, where  is one of the
elementary operations +, −, ∗, /.
1. Given floating point numbers x̄ and ȳ.
2. Compute x̄ȳ exactly.
3. Round the exact result x̄ȳ to the nearest floating point number
and normalize the result.
Example cont.: 1.234 + 2.751 ∗ 10−1 = 1.5091. Comp. result: 1.509
The actual implementation of the elementary operations is more
sophisticated. For more details see [DG91,O01].
D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 12
Floating Point Arithmetic (Cont.)
Given two numbers x̄, ȳ in floating point format, the computed result satisfies
|fl(x̄ȳ) − (x̄ȳ)|
≤ mach .
x̄ȳ

Examples
Consider the floating point system β = 10 and m = 4.
i. x̄ = 2.552 ∗ 103 and ȳ = 2.551 ∗ 103 .
x̄ − ȳ = 0.001 ∗ 103 = 1.000 ∗ 100 . In this case x̄ − ȳ is a floating point
number and nothing needs to done; no error occurs in the subtraction of
x̄, ȳ.
ii. x̄ = 2.552 ∗ 103 and ȳ = 2.551 ∗ 102 .
x̄ − ȳ = 2.2969 ∗ 103 . This is not a floating point number. The floating
point number nearest to x̄ − ȳ is fl(x̄ − ȳ) = 2.297 ∗ 103 .

|fl(x̄ − ȳ) − (x̄ − ȳ)| |2.297 ∗ 103 − 2.2969 ∗ 103 |


= ≈ 4.4 ∗ 10−5
|x̄ − ȳ| 2.2969 ∗ 103
< mach = 5 ∗ 10−4 .

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 13


Floating Point Arithmetic: Cancellation
For the previous result on the error between x̄ȳ and the computed fl(x̄ȳ)
only holds if x̄, ȳ in floating point format. What happens when we operate with
numbers that are not in floating point format?
Example
Consider the floating point system β = 10 and m = 4.
Subtract the numbers x = 2.5515052 ∗ 103 and y = 2.5514911 ∗ 103 .
1. Compute the floating point numbers x̄ and ȳ nearest to x and y,
respectively: x̄ = 2.552 ∗ 103 and ȳ = 2.551 ∗ 103 .
2. Compute x̄ − ȳ exactly: x̄ − ȳ = 0.001 ∗ 103 .
3. Round the exact result x̄ − ȳ to the nearest floating point number:
fl(0.001 ∗ 103 ) = 0.001 ∗ 103 . Normalize the number:
fl(0.001 ∗ 103 ) = 1.000. The last digits are filled with (spurious) zeros.
The exact result is 2.5515052 ∗ 103 − 2.5514911 ∗ 103 = 1.410 ∗ 10−2 . The
relative error between exact and computed solution is

|1.000 − 1.410 ∗ 10−2 |


≈ 70  mach = 5 ∗ 10−4 .
1.410 ∗ 10−2
Note that this large error is not due the computation of fl(x̄ − ȳ). The large
error is caused by the rounding of x and y at the beginning.
D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 14
Floating Point Arithmetic: Cancellation (cont.)
I To analyze the analyze the error incurred by the subtraction of two
numbers, the following representation is useful:
For every x ∈ IR, there exists  with || ≤ mach such that
fl(x) = x(1 + ).
Note that if x 6= 0, then the previous identity is satisfied for
def
 = (fl(x) − x)/x. The bound || ≤ mach follows from (2).

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 15


Floating Point Arithmetic: Cancellation (cont.)
I To analyze the analyze the error incurred by the subtraction of two
numbers, the following representation is useful:
For every x ∈ IR, there exists  with || ≤ mach such that
fl(x) = x(1 + ).
Note that if x 6= 0, then the previous identity is satisfied for
def
 = (fl(x) − x)/x. The bound || ≤ mach follows from (2).
I For x, y ∈ IR we have 1 , 2 with |1 |, |2 | ≤ mach such that
fl(x) = x(1 + 1 ), fl(y) = y(1 + 2 ).
Moreover fl(fl(x) − fl(y)) = (fl(x) − fl(y))(1 + 3 ), with |3 | ≤ mach .

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 15


Floating Point Arithmetic: Cancellation (cont.)
I To analyze the analyze the error incurred by the subtraction of two
numbers, the following representation is useful:
For every x ∈ IR, there exists  with || ≤ mach such that
fl(x) = x(1 + ).
Note that if x 6= 0, then the previous identity is satisfied for
def
 = (fl(x) − x)/x. The bound || ≤ mach follows from (2).
I For x, y ∈ IR we have 1 , 2 with |1 |, |2 | ≤ mach such that
fl(x) = x(1 + 1 ), fl(y) = y(1 + 2 ).
Moreover fl(fl(x) − fl(y)) = (fl(x) − fl(y))(1 + 3 ), with |3 | ≤ mach .
I Thus,
fl(fl(x) − fl(y)) = (fl(x) − fl(y))(1 + 3 ) = [x(1 + 1 ) − y(1 + 2 )](1 + 3 )
= (x − y)(1 + 3 ) + (x1 − y2 )(1 + 3 )
and, if x − y 6= 0, then the relative error is given by
|fl(fl(x) − fl(y)) − (x − y)| x1 − y2
= 3 + (1 + 3 ) (3)
|x − y| x−y
If 1 2 6= 0 and x − y is small, the quantity on the rhs could be  mach .
D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 15
Floating Point Arithmetic: Cancelation (cont.)

I Similar analysis can be carried out for +, −, ∗, /.


I Catastrophic cancelation can only occur with +, −.
I Catastrophic cancelation can only occur if one subtracts two
numbers which are not both in floating point format and which have
the same sign and are of approximately the same size, see (3), or if
one adds two numbers which are not both in floating point format,
which have opposite sign and their absolute values of approximately
the same size.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 16


Floating Point Arithmetic: Cancelation Example 1
x 1 − cos
0.500000 0.122417E + 00
0.125000 0.780231E − 02
0.312500E − 01 0.488222E − 03
Evaluation of 1 − cos(x) near x = 0.
0.781250E − 02 0.305176E − 04
(All computations were done using
0.195312E − 02 0.190735E − 05
single precision Fortran.)
0.488281E − 03 0.119209E − 06
0.122070E − 03 0.
0.305176E − 04 0.
0.762939E − 05 0.
0.190735E − 05 0.

Since cos(0) = 1 we expect catastrophic cancelation. If x = 0.122070E − 03,


then
1 − cos(x) ≈ 1.0000000000 − 0.99999999254......
= 0.00000000745..... = 7.45054.....e − 09
−1
1 − fl(cos(x)) ≈ 1.000000 − fl(9.999999
| {z } 9254...... ∗ 10 )
7 digits
= 1.000000 − 1.000000 = 0.
D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 17
Floating Point Arithmetic: Cancelation Example 1 (cont.)
Two alternatives for small |x|.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 18


Floating Point Arithmetic: Cancelation Example 1 (cont.)
Two alternatives for small |x|.
I Since cos2 (x) + sin2 (x) = 1 it holds that
1 − cos(x) = sin2 (x)/(1 + cos(x)).
The formula sin2 (x)/(1 + cos(x)) avoids subtraction of two number that
are not in floating point format and are almost the same (recall that we
consider the case |x| small).

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 18


Floating Point Arithmetic: Cancelation Example 1 (cont.)
Two alternatives for small |x|.
I Since cos2 (x) + sin2 (x) = 1 it holds that
1 − cos(x) = sin2 (x)/(1 + cos(x)).
The formula sin2 (x)/(1 + cos(x)) avoids subtraction of two number that
are not in floating point format and are almost the same (recall that we
consider the case |x| small).
says that if the series S = ∞ i
P
The Leibnitz criterionP i=1 (−1) ci , ci ≥ 0,
I
n i
converges, then S − i=1 (−1) ci < cn+1 .
If we apply this to the Taylor expansion of cos(x),
x2 x4 x6 x8
cos(x) = 1 − + − + ± ...,
2 4! 6! 8!
then
x2 x4 x6 x8
 
cos(x) − 1 − + − < .
2 4! 6! 8!
After some rearrangements we can use the approximation
x2 x2 x4
 
1 − cos(x) ≈ 1− +
2 12 360
and we know that the difference is less than x8 /(8!) which allows us to
control the error of the approximation.
D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 18
Floating Point Arithmetic: Cancelation Example 1 (cont.)

x 1 − cos sin2 /(1 + cos) Taylor


0.500000 0.122417 0.122417 0.122418
0.125000 0.780231E − 02 0.780233E − 02 0.780233E − 02
0.312500E − 01 0.488222E − 03 0.488241E − 03 0.488242E − 03
0.781250E − 02 0.305176E − 04 0.305174E − 04 0.305174E − 04
0.195312E − 02 0.190735E − 05 0.190735E − 05 0.190735E − 05
0.488281E − 03 0.119209E − 06 0.119209E − 06 0.119209E − 06
0.122070E − 03 0. 0.745058E − 08 0.745058E − 08
0.305176E − 04 0. 0.465661E − 09 0.465661E − 09
0.762939E − 05 0. 0.291038E − 10 0.291038E − 10
0.190735E − 05 0. 0.181899E − 11 0.181899E − 11
0.476837E − 06 0. 0.113687E − 12 0.113687E − 12
0.119209E − 06 0. 0.710543E − 14 0.710543E − 14
0.298023E − 07 0. 0.444089E − 15 0.444089E − 15

Computations were performed using single precision Fortran.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 19


Floating Point Arithmetic: Cancelation Example 2
I The roots of the quadratic equation ax2 + bx + c = 0 are given by
 p 
x± = −b ± b2 − 4ac /(2a).

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 20


Floating Point Arithmetic: Cancelation Example 2
I The roots of the quadratic equation ax2 + bx + c = 0 are given by
 p 
x± = −b ± b2 − 4ac /(2a).

I When a = 5 ∗ 10−4 , b = 100, and c = 5 ∗ 10−3 the computed (using single


precision Fortran) first root is
x+ = 0.
Cannot be exact, since x = 0 is a solution of the quadratic equation if and
only if c = 0.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 20


Floating Point Arithmetic: Cancelation Example 2
I The roots of the quadratic equation ax2 + bx + c = 0 are given by
 p 
x± = −b ± b2 − 4ac /(2a).

I When a = 5 ∗ 10−4 , b = 100, and c = 5 ∗ 10−3 the computed (using single


precision Fortran) first root is
x+ = 0.
Cannot be exact, since x = 0 is a solution of the quadratic equation if and
only if c = 0.
I Since fl(b2 − 4ac) = fl(b2 ) for the data given above, we suffer from
catastrophic cancellation.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 20


Floating Point Arithmetic: Cancelation Example 2
I The roots of the quadratic equation ax2 + bx + c = 0 are given by
 p 
x± = −b ± b2 − 4ac /(2a).

I When a = 5 ∗ 10−4 , b = 100, and c = 5 ∗ 10−3 the computed (using single


precision Fortran) first root is
x+ = 0.
Cannot be exact, since x = 0 is a solution of the quadratic equation if and
only if c = 0.
I Since fl(b2 − 4ac) = fl(b2 ) for the data given above, we suffer from
catastrophic cancellation.
I A remedy is the following reformulation of the formula for x+ :
√ √  √ 
−b + b2 − 4ac 1 −b + b2 − 4ac −b − b2 − 4ac 2c
= √ = √
2a 2a −b − b2 − 4ac −b − b2 − 4ac
Here the subtraction of two almost equal numbers is avoided and the
computation using this formula gives x+ = −0.5E − 04.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 20


Floating Point Arithmetic: Cancelation Example 2
I The roots of the quadratic equation ax2 + bx + c = 0 are given by
 p 
x± = −b ± b2 − 4ac /(2a).

I When a = 5 ∗ 10−4 , b = 100, and c = 5 ∗ 10−3 the computed (using single


precision Fortran) first root is
x+ = 0.
Cannot be exact, since x = 0 is a solution of the quadratic equation if and
only if c = 0.
I Since fl(b2 − 4ac) = fl(b2 ) for the data given above, we suffer from
catastrophic cancellation.
I A remedy is the following reformulation of the formula for x+ :
√ √  √ 
−b + b2 − 4ac 1 −b + b2 − 4ac −b − b2 − 4ac 2c
= √ = √
2a 2a −b − b2 − 4ac −b − b2 − 4ac
Here the subtraction of two almost equal numbers is avoided and the
computation using this formula gives x+ = −0.5E − 04.
I A ‘stable’ (see later for a description of stability) formula for both roots
 p 
x1 = −b − sign(b) b2 − 4ac /(2a), x2 = c/(ax1 ).

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 20


Summary

I Introduced how numbers are represented on a computer.


I Only a small set of numbers can be represented on the computer.
I The relative error between x 6= 0 and its nearest floating point
number fl(x) is

|fl(x) − x| def 1
≤ mach = β 1−m .
|x| 2
I Introduced basic properties of floating point arithmetic.
I Catastrophic cancellation can occur if one subtracts [adds] two
numbers which are not both in floating point format and which have
the same [opposite] sign and [their absolute values] are of
approximately the same size.

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 21


Additional Reading

G91 David Goldberg. What every computer scientist should know about
floating-point arithmetic, ACM Comput. Surv., Vol. 23 (1), 1991,
pp. 5 - 48.
http://docs.sun.com/source/806-3568/ncg goldberg.html
O01 Michael L. Overton. Numerical Computing with IEEE Floating Point
Arithmetic, SIAM, Philadelphia, 2001.
SUN SUN Microsystems Numerical Computation Guide
http://docs.sun.com/source/806-3568/

D. Leykekhman - MATH 3511 Numerical Analysis 2 Floating Point Arithmetic – 22

You might also like