Machine Arithmetic_Notes
Machine Arithmetic_Notes
Machine Arithmetic
an · · · a1 a0 = 10n an + · · · + 10a1 + a0 ,
f = .d1 d2 . . . dt . . . = 10 1 d1 + 10 2 d2 + · · · + 10 t dt + · · ·
±.d1 d2 . . . dt . . . ⇥ 10e
86
11.3 Normalized floating point representation
To avoid unnecessary multiple representations of the same number, such
as in (11.1), we can require that d1 6= 0. We say the floating point repre-
sentation is normalized if d1 6= 0. Then 0.582 ⇥ 102 is the only normalized
representation of the number 58.2.
For every positive real r > 0 there is a unique integer e 2 Z such that
f = 10 e r 2 [0.1, 1). Then r = f ⇥ 10e is the normalized representation of r.
For most of real numbers, the normalized representation is unique, however,
there are exceptions, such as
.9999 . . . ⇥ 100 = .1000 . . . ⇥ 101 ,
both representing the number r = 1. In such cases one of the two representa-
tions has a finite fractional part and the other has an infinite fractional part
with trailing nines.
87
11.5 Other number systems
We can use a number system with any fixed base 2. By analogy
with Sections 11.1 and 11.4, any natural number N can be written, in that
system, as a sequence of digits:
n
N = (an · · · a1 a0 ) = an + · · · + a 1 + a0
88
11.7 Basic properties of machine systems
Suppose we use a machine system with parameters ( , t, L, U ). There are finitely many
real numbers that can be represented by (11.2). More precisely, there are
t 1
2( 1) (U L + 1)
such numbers. The largest machine number is
U U t
M = .( 1) . . . ⇥ = (1 ).
The smallest positive machine number is
L L 1
m = .10 . . . 0 ⇥ = .
Note that zero cannot be represented in the above format, since we require d1 6= 0. For
this reason, every real machine systems includes a few special numbers, like zero, that
have to be represented di↵erently. Other “special numbers” are +1 and 1.
89
11.10 Relative errors
Let x be a real number and xc its computer representation in a machine
system with parameters ( , t, L, U ), as described above. We will always assume
that x is neither too big nor too small, i.e., its exponent e is within the proper range
L e U . How accurately does xc represent x? (How close is xc to x?)
The absolute error of the computer representation, i.e., xc x, may be
quite large when the exponent e is big. It is more customary to describe the
accuracy in terms of the relative error (xc x)/x as follows:
xc x
= " or xc = x(1 + ").
x
It is easy to see that the maximal possible value of |"| is
⇢ 1 t
for chopped arithmetic (a)
|"| u = 1 1 t
2
for rounded arithmetic (b)
The number u is called unit roundo↵ or machine epsilon.
90
In the next two examples, we will solve systems of linear equations by
using the rules of a machine system. In other words, we will pretend that we
are computers. This will help us understand how real computers work.
11.13 Example
Let us solve the system of equations
0.01 2 x 2
=
1 3 y 4
Now let us solve this system by using chopped arithmetic with base = 10 and t = 2 (i.e.,
our mantissa will be always limited to two decimal digits).
First we use Gaussian elimination (without pivoting). Multiplying the first equation
by 100 and subtracting it from the second gives 197y = 196. Both numbers 197 and
196 are three digit long, so we must chop the third digit o↵. This gives 190y = 190,
hence y = 1. Substituting y = 1 into the first equation gives 0.01x = 2 2 = 0, hence
x = 0. Thus our computed solution is
xc = 0 and yc = 1.
The relative error in x is (xc x)/x = 1, i.e., the computed xc is 100% o↵ mark!
Let us increase the length of mantissa to t = 3 and repeat the calculations. This gives
xc = 2 and yc = 0.994.
The relative error in x is now (xc x)/x = 0.97, i.e., the computed xc is 97% o↵ mark.
Not much of improvement... We postpone the explanation until Chapter 13.
Let us now apply partial pivoting (Section 7.15). First we interchange the rows:
1 3 x 4
= .
0.01 2 y 2
Now we multiply the first equation by 0.01 and subtract it from the second to get 1.97y =
1.96. With t = 2, we must chop o↵ the third digit in both numbers: 1.9y = 1.9, hence
y = 1. Substituting y = 1 into the fits equation gives x + 3 = 4, hence x = 1, i.e.,
xc = 1 and yc = 1.
This is a great improvement over the first two solutions (without pivoting). The relative
error in x is now (xc x)/x = 0.015, so the computed xc is only 1.5% o↵. The relative
error in y is even smaller (about 0.005).
91
Machine arithmetic Exact arithmetic: Machine arithmetic
with t = 2: with t = 3:
0.01x + 2y = 2
x + 3y = 4
+
0.01x + 2y = 2 Chopping o↵ 0.01x + 2y = 2
190y = 190 197y = 196
+ +
0.01x + 2y = 2 0.01x + 2y = 2 Chopping o↵ 0.01x + 2y = 2
196 !
y=1 y= 197 ⇡ 0.9949 y = 0.994
+ + +
392 2
0.01x = 2 2=0 !! 0.01x = 2 197 = 197
!! 0.01x = 2 1.98A8 = 0.02
196
y=1 y= 197 y = 0.994
+ + +
x=0 x = 200
197 ⇡ 1.0152 x=2
196
y=1 y= 197 ⇡ 0.9949 y = 0.994
92
Now let us continue the partial pivoting with t = 3. We can keep three digits, so
1.97y = 1.96 gives y = 1.96/1.97 ⇡ 0.9949, which we have to reduce to y = 0.994.
Substituting this value of y into the first equation gives x + 2.982 = 4, which we have to
reduce to x + 2.98 = 4, hence x = 1.02. So now
Conclusions: Gaussian elimination without pivoting may lead to catastrophic errors, which
will remain unexplained until Chapter 13. Pivoting is more reliable – it seems to provide
nearly maximum possible accuracy here. But see the next example...
11.14 Example
Let us solve another system of equations:
3 1 x 5
=
1 0.35 y 1.7
We postpone a complete analysis of the above two examples until Chapter 13.
93
11.15 Computational errors
Let x and y be two real numbers represented in a machine system by xc
and yc , respectively. An arithmetic operation x ⇤ y (where ⇤ stands for one
of the four basic operations: +, , ⇥, ÷) is performed by a computer in the
following way. The computer first finds xc ⇤ yc exactly and then represents
that number in its machine system. The result is z = (xc ⇤ yc )c .
Note that, generally, z is di↵erent from (x ⇤ y)c , which is the machine representation
of the exact result x ⇤ y. Hence, z is not necessarily the best representation for x ⇤ y. In
other words, the computer makes additional round o↵ errors at each arithmetic operation.
Assuming that xc = x(1 + "1 ) and yc = y(1 + "2 ) we have
z = xy(1 + "1 )(1 + "2 )(1 + "3 ) ⇡ xy(1 + "1 + "2 + "3 )
(here we ignore higher order terms like "1 "2 ), so the relative error is (approx-
imately) bounded by 3u. A similar estimate can be made for division:
94
11.17 Addition and subtraction
For addition, we have
⇣ x"1 + y"2 ⌘
z = (x + y + x"1 + y"2 )(1 + "3 ) = (x + y) 1 + (1 + "3 ).
x+y
Again ignoring higher order terms we can bound the relative error of z by
|x| + |y|
u + u.
|x + y|
Thus, the operation of addition magnifies relative errors by a factor of
|x| + |y|
+ 1.
|x + y|
Similar estimates can be made for subtraction x y: it magnifies relative
errors by a factor
|x| + |y|
+ 1.
|x y|
Hence the addition and subtraction magnify relative errors by a variable
factor which depends on x and y. This factor may be arbitrarily large if
x + y ⇡ 0 for addition or x y ⇡ 0 for subtraction. This phenomenon is
known as catastrophic cancelation. It occurred in our Example 11.13 when we solved
it without pivoting (see the line marked with double exclamation signs on page 92).
95