0% found this document useful (0 votes)
32 views33 pages

An Introduction To Floating Point Arithmetic by Example: Pat Quillen

This document provides an introduction to floating point arithmetic through examples in MATLAB. It explains that floating point numbers are represented in binary computers as (-1)^s * 2^e * (1 + f), where s is the sign bit, e is the exponent, and f is the mantissa. For double precision numbers, e is an 11-bit number between -1022 and 1023 and f is a 52-bit number between 0 and 1. Due to the finite precision of the mantissa, not all real numbers can be exactly represented, which can lead to rounding errors in calculations. Examples are provided to illustrate representation of simple values and how rounding errors can accumulate.

Uploaded by

Aashish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views33 pages

An Introduction To Floating Point Arithmetic by Example: Pat Quillen

This document provides an introduction to floating point arithmetic through examples in MATLAB. It explains that floating point numbers are represented in binary computers as (-1)^s * 2^e * (1 + f), where s is the sign bit, e is the exponent, and f is the mantissa. For double precision numbers, e is an 11-bit number between -1022 and 1023 and f is a 52-bit number between 0 and 1. Due to the finite precision of the mantissa, not all real numbers can be exactly represented, which can lead to rounding errors in calculations. Examples are provided to illustrate representation of simple values and how rounding errors can accumulate.

Uploaded by

Aashish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

An Introduction to Floating Point

Arithmetic by Example
Pat Quillen

21 January 2010

Floating Point Arithmetic by Example – p.1/15


Example
What is the value of

1 − 3 ∗ (4/3 − 1)

according to M ATLAB?

Floating Point Arithmetic by Example – p.2/15


Example
What is the value of

1 − 3 ∗ (4/3 − 1)

according to M ATLAB?

2.220446049250313e-016

Floating Point Arithmetic by Example – p.2/15


Example
What is the value of

1 − 3 ∗ (4/3 − 1)

according to M ATLAB?

2.220446049250313e-016
Why??

Floating Point Arithmetic by Example – p.2/15


Example
What is the value of

1 − 3 ∗ (4/3 − 1)

according to M ATLAB?

2.220446049250313e-016
Why?? Essentially because 4/3 cannot be represented
exactly by a binary number with finitely many terms.

Floating Point Arithmetic by Example – p.2/15


Example (continued)
Notice that

4 1 1 X 1k
= 3 = 1 =
3 4 1− 4
4
k=0

That is,
4 1 1 1
= 1 + 2 + 4 + 6 + ···
3 2 2 2
or, in binary,
4
= 1.010101010101 · · ·
3
which, again, is not exactly representable by finitely many
terms.

Floating Point Arithmetic by Example – p.3/15


Floating Point Representation
In binary computers, most floating point numbers are
represented as
(−1)s 2e (1 + f )
where
s is represented by one bit (called the sign bit).
e is the exponent.
f is the mantissa.

Floating Point Arithmetic by Example – p.4/15


Floating Point Representation
In binary computers, most floating point numbers are
represented as
(−1)s 2e (1 + f )
where
s is represented by one bit (called the sign bit).
e is the exponent.
f is the mantissa.

For double precision numbers, e is an eleven bit number


and f is a fifty-two bit number.

Floating Point Arithmetic by Example – p.4/15


Floating Point Exponent
As e is represented by 11 bits, it can range in value from
0 to 211 − 1 = 2047.

Floating Point Arithmetic by Example – p.5/15


Floating Point Exponent
As e is represented by 11 bits, it can range in value from
0 to 211 − 1 = 2047.
Negative exponents are represented by biasing e when
stored.

Floating Point Arithmetic by Example – p.5/15


Floating Point Exponent
As e is represented by 11 bits, it can range in value from
0 to 211 − 1 = 2047.
Negative exponents are represented by biasing e when
stored.
The double precision bias is 210 − 1 = 1023. Thus,
−1023 ≤ e ≤ 1024.

Floating Point Arithmetic by Example – p.5/15


Floating Point Exponent
As e is represented by 11 bits, it can range in value from
0 to 211 − 1 = 2047.
Negative exponents are represented by biasing e when
stored.
The double precision bias is 210 − 1 = 1023. Thus,
−1023 ≤ e ≤ 1024.
The extreme values e = −1023 (stored as eb = 0) and
e = 1024 (stored as eb = 2047) are special, so
−1022 ≤ e ≤ 1023 is the valid range of the exponent.

Floating Point Arithmetic by Example – p.5/15


Floating Point Mantissa
f limits the precision of the floating point number.

Floating Point Arithmetic by Example – p.6/15


Floating Point Mantissa
f limits the precision of the floating point number.
0≤f <1

Floating Point Arithmetic by Example – p.6/15


Floating Point Mantissa
f limits the precision of the floating point number.
0≤f <1
The format 2e (1 + f ) provides an implicitly stored 1, so
doubles actually have 53 bits of precision.

Floating Point Arithmetic by Example – p.6/15


Floating Point Mantissa
f limits the precision of the floating point number.
0≤f <1
The format 2e (1 + f ) provides an implicitly stored 1, so
doubles actually have 53 bits of precision.
252 f is an integer ⇒ gaps between successive doubles.

Floating Point Arithmetic by Example – p.6/15


Floating Point Mantissa
f limits the precision of the floating point number.
0≤f <1
The format 2e (1 + f ) provides an implicitly stored 1, so
doubles actually have 53 bits of precision.
252 f is an integer ⇒ gaps between successive doubles.

For example, all integers up to 253 are exactly representable


as floating point numbers, but 253 + 1 is not.

Floating Point Arithmetic by Example – p.6/15


Examples
The number 1 is represented as

(−1)0 20 (1 + 0).

That is, s = 0, e = 0, f = 0. Adding the bias (1023), the


biased value of e is eb = 1023.

Floating Point Arithmetic by Example – p.7/15


Examples
The number 1 is represented as

(−1)0 20 (1 + 0).

That is, s = 0, e = 0, f = 0. Adding the bias (1023), the


biased value of e is eb = 1023.

You can use format hex in M ATLAB to see the bit pattern
of the floating point number in hexadecimal. The first three
hex digits (12 bits) represent the sign bit and the biased ex-
ponent, and the remaining 13 hex digits (52 bits) represent
the mantissa.

Floating Point Arithmetic by Example – p.7/15


Examples
In the case of the number 1, s = 0 and eb = 01111111111,
so the first three hex digits are 001111111111 = 3f f so, 1
is represented by
3ff0000000000000

Floating Point Arithmetic by Example – p.8/15


Examples
In the case of the number 1, s = 0 and eb = 01111111111,
so the first three hex digits are 001111111111 = 3f f so, 1
is represented by
3ff0000000000000
For 34 , f = 0.01010101 · · · 0101, or 55 · · · 5 in hex. As with
1, 34 has e = 1, and so it has representation
3ff5555555555555
which is just slightly smaller than 43 .

Floating Point Arithmetic by Example – p.8/15


Examples
In the case of the number 1, s = 0 and eb = 01111111111,
so the first three hex digits are 001111111111 = 3f f so, 1
is represented by
3ff0000000000000
For 34 , f = 0.01010101 · · · 0101, or 55 · · · 5 in hex. As with
1, 34 has e = 1, and so it has representation
3ff5555555555555
which is just slightly smaller than 43 .
The real number 0.1 has e = −4, and
f = 0.10011001 · · · 10011010, and thus has representation
3fb999999999999a
which is just slightly larger than 0.1.
Floating Point Arithmetic by Example – p.8/15
Round-off
4
6= 34 (where fl(x) stands for “the floating point

Since fl 3
representation of x”), we see the behavior

1 − 3 ∗ (4/3 − 1) 6= 0.

All of the operations except the division are performed


without error, and the special value

ǫ = 2−52

is the result.
ǫ is referred to as machine epsilon, or the unit-roundoff, and it
is the distance between 1 and the next closest floating point
number.
Floating Point Arithmetic by Example – p.9/15
Example
A very common example of propogation of round-off comes
in the form of
0.1 + 0.1 + 0.1
Specifically, is the above expression equal to 0.3?

Floating Point Arithmetic by Example – p.10/15


Example
A very common example of propogation of round-off comes
in the form of
0.1 + 0.1 + 0.1
Specifically, is the above expression equal to 0.3?

No! As a matter of fact, M ATLAB will tell you that 0.3 is


represented by
3fd3333333333333
while 0.1 + 0.1 + 0.1 is represented by
3fd3333333333334
The difference in the last place is due to accumulation of the
difference between 0.1 and fl(0.1).

Floating Point Arithmetic by Example – p.10/15


Deadly Consequences

Numerical Disasters
1991: Patriot Missile misses Scud!
1996: Ariane Rocket explodes!

Floating Point Arithmetic by Example – p.11/15


Swamping
Due to finiteness of precision, floating point addition can
suffer swamping. Suppose we have two floating point
numbers a = 105 and b = 10−12 . The quantity c = a + b is
equal to a, since a and b differ by many orders of
magnitude.

Floating Point Arithmetic by Example – p.12/15


Swamping
Due to finiteness of precision, floating point addition can
suffer swamping. Suppose we have two floating point
numbers a = 105 and b = 10−12 . The quantity c = a + b is
equal to a, since a and b differ by many orders of
magnitude.

To rectify the effects of swamping, one may compute in


increasing order of magnitude. For example, try these in
M ATLAB:

eps/2 + 1 − eps/2 eps/2 − eps/2 + 1

Floating Point Arithmetic by Example – p.12/15


Swamping
Due to finiteness of precision, floating point addition can
suffer swamping. Suppose we have two floating point
numbers a = 105 and b = 10−12 . The quantity c = a + b is
equal to a, since a and b differ by many orders of
magnitude.

To rectify the effects of swamping, one may compute in


increasing order of magnitude. For example, try these in
M ATLAB:

eps/2 + 1 − eps/2 eps/2 − eps/2 + 1

Note: It is frequently infeasible to do this!

Floating Point Arithmetic by Example – p.12/15


Cancellation
A phenomenon not dissimilar from swamping is
cancellation, which occurs when a number is subtracted
from another number of rougly the same magnitude.

For example, for values of x very near 0, the expression



x+1−1

suffers cancellation, as 1 swamps x in the computation of


x + 1, and the subsuquent subtraction results in 0.

Floating Point Arithmetic by Example – p.13/15


Cancellation
To get around the effects of cancellation, one may rewrite
their computation in an equivalent form that avoids the
cancellation altogether. For example, computing with
√ x
x+1−1= √
x+1+1

avoids the cancellation for values of x near zero. Now, the


only value of x that results in a zero output is 0 itself.

Floating Point Arithmetic by Example – p.14/15


Cancellation
To get around the effects of cancellation, one may rewrite
their computation in an equivalent form that avoids the
cancellation altogether. For example, computing with
√ x
x+1−1= √
x+1+1

avoids the cancellation for values of x near zero. Now, the


only value of x that results in a zero output is 0 itself.

Note: Not all cancellation can be avoided, and not all can-
cellation is bad!

Floating Point Arithmetic by Example – p.14/15


Resources
What Every Computer Scientist Should Know About
Floating-Point Arithmetic by David Goldberg. Available
here.
Numerical Analysis, 8th ed. by Richard L. Burden and
J. Douglas Faires.
Numerical Computing with MATLAB by Cleve Moler.
Accuracy and Stability of Numerical Algorithms by
Nicholas J. Higham.
Technical Note regarding Floating Point Arithmetic.
Available here

Floating Point Arithmetic by Example – p.15/15

You might also like