0% found this document useful (0 votes)
6 views

Machine Arithmetic_Notes

Uploaded by

Sai laxman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Machine Arithmetic_Notes

Uploaded by

Sai laxman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Chapter 11

Machine Arithmetic

11.1 Decimal number system


In our decimal system, natural numbers are represented by a sequence of
digits. For example, 582 = 5 · 102 + 8 · 10 + 2. Generally,

an · · · a1 a0 = 10n an + · · · + 10a1 + a0 ,

where ai 2 {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} are digits. Fractional numbers require an


additional fractional part:

f = .d1 d2 . . . dt . . . = 10 1 d1 + 10 2 d2 + · · · + 10 t dt + · · ·

which may be finite or infinite.

11.2 Floating point representation


Alternatively, any real number can be written as a product of a fractional
part with a sign and a power of ten:

±.d1 d2 . . . dt . . . ⇥ 10e

where di are decimal digits and e 2 Z is an integer. For example,

58.2 = 0.582 ⇥ 102 = 0.0582 ⇥ 103 , etc. (11.1)

This is called floating point representation of decimal numbers. The part


.d1 d2 . . . dt . . . is called mantissa and e is called exponent. By changing the
exponent e with a fixed mantissa .d1 d2 . . . dt . . . we can move (“float”) the
decimal point, for example 0.582 ⇥ 102 = 58.2 and 0.582 ⇥ 101 = 5.82.

86
11.3 Normalized floating point representation
To avoid unnecessary multiple representations of the same number, such
as in (11.1), we can require that d1 6= 0. We say the floating point repre-
sentation is normalized if d1 6= 0. Then 0.582 ⇥ 102 is the only normalized
representation of the number 58.2.
For every positive real r > 0 there is a unique integer e 2 Z such that
f = 10 e r 2 [0.1, 1). Then r = f ⇥ 10e is the normalized representation of r.
For most of real numbers, the normalized representation is unique, however,
there are exceptions, such as
.9999 . . . ⇥ 100 = .1000 . . . ⇥ 101 ,
both representing the number r = 1. In such cases one of the two representa-
tions has a finite fractional part and the other has an infinite fractional part
with trailing nines.

11.4 Binary number system


In the binary number system, the base is 2 (instead of 10), and there
are only two digits: 0 and 1. Any natural number N can be written, in the
binary system, as a sequence of binary digits:
N = (an · · · a1 a0 )2 = 2n an + · · · + 2a1 + a0
where ai 2 {0, 1}. For example, 5 = 1012 , 11 = 10112 , 64 = 10000002 , etc.
Binary system, due to its simplicity, is used by all computers. In the modern
computer world, the word bit means binary digit.
Fractional numbers require an additional fractional part:
f = (.d1 d2 . . . dt . . .)2 = 2 1 d1 + 2 2 d2 + · · · + 2 t dt + · · ·
which may be finite or infinite. For example, 0.5 = 0.12 , 0.625 = 0.1012 , and
0.6 = (0.10011001100110011 . . .)2
(the blocks 00 and 11 alternate indefinitely). The floating point representa-
tion of real numbers in the binary system is given by
r = ±.d1 d2 . . . dt . . . 2
⇥ 2e
where .d1 d2 . . . dt . . . is called mantissa and e 2 Z is called exponent. Again,
we say that the above representation is normalized if d1 6= 0, this ensures
uniqueness for almost all real numbers. Note that d1 6= 0 implies d1 = 1, i.e.,
every normalized binary representation begins with a one.

87
11.5 Other number systems
We can use a number system with any fixed base 2. By analogy
with Sections 11.1 and 11.4, any natural number N can be written, in that
system, as a sequence of digits:
n
N = (an · · · a1 a0 ) = an + · · · + a 1 + a0

where ai 2 {0, 1, . . . , 1} are digits. Fractional numbers require an addi-


tional fractional part:
1 2 t
f = .d1 d2 . . . dt . . . = d1 + d2 + · · · + dt + · · ·

which may be finite or infinite. The floating point representation of real


numbers in the system with base is given by
e
r = ±.d1 d2 . . . dt . . . ⇥

where .d1 d2 . . . dt . . . is called mantissa and e 2 Z is called exponent. Again,


we say that the above representation is normalized if d1 6= 0, this ensures
uniqueness for almost all real numbers.
In the real world, computers can only handle a certain fixed number of digits in their
electronic memory. For the same reason, possible values of the exponent e are always
limited to a certain fixed interval. This motivates our next definition.

11.6 Machine number systems (an abstract version)


A machine number system is specified by four integers, ( , t, L, U ), where
2 is the base
t 1 is the length of the mantissa
L 2 Z is the minimal value for the exponent e
U 2 Z is the maximal value for the exponent e (of course, L  U ).
Real numbers in a machine system are represented by
e
r = ±.d1 d2 . . . dt ⇥ , LeU (11.2)

The representation must be normalized, i.e., d1 6= 0.

88
11.7 Basic properties of machine systems
Suppose we use a machine system with parameters ( , t, L, U ). There are finitely many
real numbers that can be represented by (11.2). More precisely, there are
t 1
2( 1) (U L + 1)
such numbers. The largest machine number is
U U t
M = .( 1) . . . ⇥ = (1 ).
The smallest positive machine number is
L L 1
m = .10 . . . 0 ⇥ = .
Note that zero cannot be represented in the above format, since we require d1 6= 0. For
this reason, every real machine systems includes a few special numbers, like zero, that
have to be represented di↵erently. Other “special numbers” are +1 and 1.

11.8 Two standard machine systems


Most modern computers conform to the IEEE floating-point standard (ANSI/IEEE
Standard 754-1985), which specifies two machine systems:
I Single precision is defined by = 2, t = 24, L = 125 and U = 128.
II Double precision is defined by = 2, t = 53, L = 1021 and U = 1024.

11.9 Rounding rules


A machine system with parameters ( , t, L, U ) provides exact represen-
tation for finitely many real numbers. Other real numbers have to be ap-
proximated by machine numbers. Suppose x 6= 0 be a real number with
normalized floating point representation
e
x = ±0.d1 d2 . . . ⇥
where the number of digits may be finite or infinite.
If e > U or e < L, then x cannot be properly represented in the machine system (it
is either “too large” or “too small”). If e < L, then x is usually converted to the special
number zero. If e > U , then x is usually converted to the special number +1 or 1.
If e 2 [L, U ] is within the proper range, then the mantissa of x has to
be reduced to t digits (if it is longer than that or infinite). There are two
standard versions of such reductions:
(a) keep the first t digits and chop o↵ the rest;
(b) round o↵ to the nearest available, i.e. use the rules

.d1 . . . dt if dt+1 < /2
.d1 . . . dt + .0 . . . 01 if dt+1 /2

89
11.10 Relative errors
Let x be a real number and xc its computer representation in a machine
system with parameters ( , t, L, U ), as described above. We will always assume
that x is neither too big nor too small, i.e., its exponent e is within the proper range
L  e  U . How accurately does xc represent x? (How close is xc to x?)
The absolute error of the computer representation, i.e., xc x, may be
quite large when the exponent e is big. It is more customary to describe the
accuracy in terms of the relative error (xc x)/x as follows:
xc x
= " or xc = x(1 + ").
x
It is easy to see that the maximal possible value of |"| is
⇢ 1 t
for chopped arithmetic (a)
|"|  u = 1 1 t
2
for rounded arithmetic (b)
The number u is called unit roundo↵ or machine epsilon.

11.11 Machine epsilon


Note that the machine epsilon u is not the smallest positive number m
represented by the given machine system (cf. Sect. 11.7).
One can describe u as the smallest positive number " > 0 such that
(1 + ")c 6= 1. In other words, u is the smallest positive value that, when
added to one, yields a result di↵erent from one.
In more practical terms, u tells us how many accurate digits machine
numbers can carry. If u ⇠ 10 p , then any machine number xc carries at
most p accurate decimal digits.
For example, suppose for a given machine system u ⇠ 10 7 and a machine number
xc representing some real number x has value 35.41879236 (when printed on paper or
displayed on computer screen). Then we can say that x ⇡ 35.41879, and the digits of x
beyond 9 cannot be determined. In particular, the digits 236 in the printed value of xc
are meaningless (they are “trash” to be discarded).

11.12 Machine epsilon for the two standard machine systems


I For the IEEE floating-point single precision standard with chopped
arithmetic u = 2 23 ⇡ 1.2 ⇥ 10 7 . In other words, approximately 7 decimal
digits are accurate.
II For the IEEE floating-point double precision standard with chopped
arithmetic u = 2 52 ⇡ 2.2⇥10 16 . In other words, approximately 16 decimal
digits are accurate.

90
In the next two examples, we will solve systems of linear equations by
using the rules of a machine system. In other words, we will pretend that we
are computers. This will help us understand how real computers work.

11.13 Example
Let us solve the system of equations
  
0.01 2 x 2
=
1 3 y 4

The exact solution was found in Section 7.13:


200 196
x= 197 ⇡ 1.015 and y= 197 ⇡ 0.995.

Now let us solve this system by using chopped arithmetic with base = 10 and t = 2 (i.e.,
our mantissa will be always limited to two decimal digits).
First we use Gaussian elimination (without pivoting). Multiplying the first equation
by 100 and subtracting it from the second gives 197y = 196. Both numbers 197 and
196 are three digit long, so we must chop the third digit o↵. This gives 190y = 190,
hence y = 1. Substituting y = 1 into the first equation gives 0.01x = 2 2 = 0, hence
x = 0. Thus our computed solution is

xc = 0 and yc = 1.

The relative error in x is (xc x)/x = 1, i.e., the computed xc is 100% o↵ mark!
Let us increase the length of mantissa to t = 3 and repeat the calculations. This gives

xc = 2 and yc = 0.994.

The relative error in x is now (xc x)/x = 0.97, i.e., the computed xc is 97% o↵ mark.
Not much of improvement... We postpone the explanation until Chapter 13.
Let us now apply partial pivoting (Section 7.15). First we interchange the rows:
  
1 3 x 4
= .
0.01 2 y 2

Now we multiply the first equation by 0.01 and subtract it from the second to get 1.97y =
1.96. With t = 2, we must chop o↵ the third digit in both numbers: 1.9y = 1.9, hence
y = 1. Substituting y = 1 into the fits equation gives x + 3 = 4, hence x = 1, i.e.,

xc = 1 and yc = 1.

This is a great improvement over the first two solutions (without pivoting). The relative
error in x is now (xc x)/x = 0.015, so the computed xc is only 1.5% o↵. The relative
error in y is even smaller (about 0.005).

91
Machine arithmetic Exact arithmetic: Machine arithmetic
with t = 2: with t = 3:
0.01x + 2y = 2
x + 3y = 4
+
0.01x + 2y = 2 Chopping o↵ 0.01x + 2y = 2
190y = 190 197y = 196
+ +
0.01x + 2y = 2 0.01x + 2y = 2 Chopping o↵ 0.01x + 2y = 2
196 !
y=1 y= 197 ⇡ 0.9949 y = 0.994
+ + +
392 2
0.01x = 2 2=0 !! 0.01x = 2 197 = 197
!! 0.01x = 2 1.98A8 = 0.02
196
y=1 y= 197 y = 0.994
+ + +
x=0 x = 200
197 ⇡ 1.0152 x=2
196
y=1 y= 197 ⇡ 0.9949 y = 0.994

Example 11.13 without pivoting.

Machine arithmetic Exact arithmetic: Machine arithmetic


with t = 2: with t = 3:
x + 3y = 4
0.01x + 2y = 2
+
x + 3y = 4 Chopping o↵ x + 3y = 4
1.9y = 1.9 1.97y = 1.96
+ +
x + 3y = 4 x + 3y = 4 Chopping o↵ x + 3y = 4
196 !
y=1 y= 197 ⇡ 0.9949 y = 0.994
+ + +
588 200
x=4 3=1 x=4 197 = 197 x = 4 2.982A = 1.02
196
y=1 y= 197 y = 0.994

Example 11.13 with partial pivoting.

92
Now let us continue the partial pivoting with t = 3. We can keep three digits, so
1.97y = 1.96 gives y = 1.96/1.97 ⇡ 0.9949, which we have to reduce to y = 0.994.
Substituting this value of y into the first equation gives x + 2.982 = 4, which we have to
reduce to x + 2.98 = 4, hence x = 1.02. So now

xc = 1.02 and yc = 0.994.

The relative error in x is (xc x)/x = 0.0047, less than 0.5%.


The table below shows the relative error of the numerical solution xc by Gaussian
elimination with pivoting and di↵erent lengths of mantissa. We see that the relative error
is roughly proportional to the “typical” round-o↵ error 10 t , with a factor of about 2 to
5. We can hardly expect a better accuracy.

relative error typical error factor


t=2 1.5 ⇥ 10 2 10 2 1.5
t=3 4.7 ⇥ 10 3 10 3 4.7
t=4 2.2 ⇥ 10 4 10 4 2.2

Conclusions: Gaussian elimination without pivoting may lead to catastrophic errors, which
will remain unexplained until Chapter 13. Pivoting is more reliable – it seems to provide
nearly maximum possible accuracy here. But see the next example...

11.14 Example
Let us solve another system of equations:
  
3 1 x 5
=
1 0.35 y 1.7

The exact solution here is


x=1 and y = 2.
The largest coefficient, 3, is at the top left corner already, so pivoting (partial or complete)
would not change anything.
Solving this system in chopped arithmetic with = 10 and t = 2 gives xc = 0 and
yc = 5, which is 150% o↵. Increasing the length of the mantissa to t = 3 gives xc = 0.883
and yc = 2.35, so the relative error is 17%. With t = 4, we obtain xc = 0.987 and
yc = 2.039, now the relative error is 2%. The table below shows that the relative error
of the numerical solutions is roughly proportional to the typical round-o↵ error 10 t , but
with a big factor fluctuating around 150 or 200.

relative error typical error factor


t=2 1.5 ⇥ 10 0 10 2 150
t=3 1.7 ⇥ 10 1 10 3 170
t=4 2.0 ⇥ 10 2 10 4 200

We postpone a complete analysis of the above two examples until Chapter 13.

93
11.15 Computational errors
Let x and y be two real numbers represented in a machine system by xc
and yc , respectively. An arithmetic operation x ⇤ y (where ⇤ stands for one
of the four basic operations: +, , ⇥, ÷) is performed by a computer in the
following way. The computer first finds xc ⇤ yc exactly and then represents
that number in its machine system. The result is z = (xc ⇤ yc )c .
Note that, generally, z is di↵erent from (x ⇤ y)c , which is the machine representation
of the exact result x ⇤ y. Hence, z is not necessarily the best representation for x ⇤ y. In
other words, the computer makes additional round o↵ errors at each arithmetic operation.
Assuming that xc = x(1 + "1 ) and yc = y(1 + "2 ) we have

(xc ⇤ yc )c = (xc ⇤ yc ) (1 + "3 ) = [ x(1 + "1 ) ] ⇤ [ y(1 + "2 ) ] (1 + "3 )

where |"1 |, |"2 |, |"3 |  u.

11.16 Multiplication and division


For multiplication, we have

z = xy(1 + "1 )(1 + "2 )(1 + "3 ) ⇡ xy(1 + "1 + "2 + "3 )

(here we ignore higher order terms like "1 "2 ), so the relative error is (approx-
imately) bounded by 3u. A similar estimate can be made for division:

x(1 + "1 )(1 + "3 ) x


z= ⇡ (1 + "1 "2 + "3 ).
y(1 + "2 ) y
Note: we used Taylor expansion
1
=1 "2 + "22 "32 + · · ·
1 + "2
and again ignored higher order terms.
Thus again the relative error is (approximately) bounded by 3u.
Conclusion: machine multiplication and machine division magnify rela-
tive errors by a factor of three, at most.

94
11.17 Addition and subtraction
For addition, we have
⇣ x"1 + y"2 ⌘
z = (x + y + x"1 + y"2 )(1 + "3 ) = (x + y) 1 + (1 + "3 ).
x+y
Again ignoring higher order terms we can bound the relative error of z by
|x| + |y|
u + u.
|x + y|
Thus, the operation of addition magnifies relative errors by a factor of
|x| + |y|
+ 1.
|x + y|
Similar estimates can be made for subtraction x y: it magnifies relative
errors by a factor
|x| + |y|
+ 1.
|x y|
Hence the addition and subtraction magnify relative errors by a variable
factor which depends on x and y. This factor may be arbitrarily large if
x + y ⇡ 0 for addition or x y ⇡ 0 for subtraction. This phenomenon is
known as catastrophic cancelation. It occurred in our Example 11.13 when we solved
it without pivoting (see the line marked with double exclamation signs on page 92).

Exercise 11.1. (JPE, September 1993). Solve the system


✓ ◆✓ ◆ ✓ ◆
0.001 1.00 x 1.00
=
1.00 2.00 y 3.00
using the LU decomposition with and without partial pivoting and chopped arithmetic
with base = 10 and t = 3 (i.e., work with a three digit mantissa). Obtain computed
solutions (xc , yc ) in both cases. Find the exact solution, compare, make comments.

Exercise 11.2. (JPE, May 2003). Consider the system


✓ ◆✓ ◆ ✓ ◆
" 1 x 1
=
2 1 y 0
Assume that |"| ⌧ 1. Solve the system by using the LU decomposition with and with-
out partial pivoting and adopting the following rounding o↵ models (at all stages of the
computation!):
a + b" = a (for a 6= 0),
a + b/" = b/" (for b 6= 0).
Find the exact solution, compare, make comments.

95

You might also like