0% found this document useful (0 votes)
80 views15 pages

CFD 1st Unit

This document discusses number representation in computers. It begins by explaining how integers are represented in binary format. It then discusses how fractions are represented using binary fractions which may be terminating or non-terminating. The document concludes by introducing floating-point number representation which separates a number into a mantissa and exponent to allow for a wider range of representable numbers.

Uploaded by

Obula Reddy K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views15 pages

CFD 1st Unit

This document discusses number representation in computers. It begins by explaining how integers are represented in binary format. It then discusses how fractions are represented using binary fractions which may be terminating or non-terminating. The document concludes by introducing floating-point number representation which separates a number into a mantissa and exponent to allow for a wider range of representable numbers.

Uploaded by

Obula Reddy K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Previous Home

CHAPTER
ONE
NUMBER SYSTEMS AND ERRORS

In this chapter we consider methods for representing numbers on com-


puters and the errors introduced by these representations. In addition, we
examine the sources of various types of computational errors and their
subsequent propagation. We also discuss some mathematical preliminaries.

1.1 THE REPRESENTATION OF INTEGERS

In everyday life we use numbers based on the decimal system. Thus the
number 257, for example, is expressible as
257 = 2·100 + 5·10 + 7·1
= 2·102 + 5·101 + 7·1000
We call 10 the base of this system. Any integer is expressible as a
polynomial in the base 10 with integral coefficients between 0 and 9. We
use the notation
N = (a n a n - 1 ··· a 0 ) 1 0
= a n 10 n + a n-1 10 n-1 + ··· + a 0 10 0 (1.1)
to denote any positive integer in the base 10. There is no intrinsic reason to
use 10 as a base. Other civilizations have used other bases such as 12, 20,
or 60. Modern computers read pulses sent by electrical components. The
state of an electrical impulse is either on or off. It is therefore convenient to
represent numbers in computers in the binary system. Here the base is 2,
and the integer coefficients may take the values 0 or 1.

1
2 NUMBER SYSTEMS AND ERRORS

A nonnegative integer N will be represented in the binary system as

(1.2)
where the coefficients ak are either 0 or 1. Note that N is again represented
as a polynomial, but now in the base 2. Many computers used in scientific
work operate internally in the binary system. Users of computers, however,
prefer to work in the more familiar decimal system. It is therefore neces-
sary to have some means of converting from decimal to binary when
information is submitted to the computer, and from binary to decimal for
output purposes.
Conversion of a binary number to decimal form may be accomplished
directly from the definition (1.2). As examples we have

The conversion of integers from a base to the base 10 can also be


accomplished by the following algorithm, which is derived in Chap. 2.

Algorithm 1.1 Given the coefficients an, . . . , a0 of the polynomial


(1.3)
and a number Compute recursively the numbers

Then

Since, by the definition (1.2), the binary integer


represents the value of the polynomial (1.3) at x = 2, we can use Algo-
rithm 1.1, with to find the decimal equivalents of binary integers.
Thus the decimal equivalent of (1101)2 computed using Algorithm 1.1
is
1.1 THE REPRESENTATION OF INTEGERS 3

and the decimal equivalent of (10000)2 is

Converting a decimal integer N into its binary equivalent can also be


accomplished by Algorithm 1.1 if one is willing to use binary arithmetic.
For if then by the definition (1.1), N = p(10). where
p(x) is the polynomial (1.3). Hence we can calculate the binary representa-
tion for N by translating the coefficients into binary integers
and then using Algorithm 1.1 to evaluate p(x) at x = 10 = (1010) 2 in
binary arithmetic. If, for example, N = 187, then

and using Algorithm 1.1 and binary arithmetic,

Therefore 187 = (10111011)2.


Binary numbers and binary arithmetic, though ideally suited for
today’s computers, are somewhat tiresome for people because of the
number of digits necessary to represent even moderately sized numbers.
Thus eight binary digits are necessary to represent the three-decimal-digit
number 187. The octal number system, using the base 8, presents a kind of
compromise between the computer-preferred binary and the people-pre-
ferred decimal system. It is easy to convert from octal to binary and back
since three binary digits make one octal digit. To convert from octal to
binary, one merely replaces all octal digits by their binary equivalent; thus

Conversely, to convert from binary to octal, one partitions the binary digits
in groups of three (starting from the right) and then replaces each three-
group by its octal digit; thus

If a decimal integer has to be converted to binary by hand, it is usually


fastest to convert it first to octal using Algorithm 1.1, and then from octal
to binary. To take an earlier example,
4 NUMBER SYSTEMS AND ERRORS

Hence, using Algorithm 1.1 [with 2 replaced by 10 = (12)8, and with octal
arithmetic],

Therefore, finally,

EXERCISES

1.1-l Convert the following binary numbers to decimal form:

1.1-2 Convert the following decimal numbers to binary form:


82, 109, 3433
1.1-3 Carry out the conversions in Exercises 1. l-l and 1.1-2 by converting first to octal form.
1.1-4 Write a FORTRAN subroutine which accepts a number to the base BETIN with the
NIN digits contained in the one-dimensional array NUMIN, and returns the NOUT digits of
the equivalent in base BETOUT in the one-dimensional array NUMOUT. For simplicity,
restrict both BETIN and BETOUT to 2, 4, 8, and 10.

1.2 THE REPRESENTATION OF FRACTIONS


If x is a positive real number, then its integral part xI is the largest integer
less than or equal to x, while

is its fractional part. The fractional part can always be written as a decimal
fraction:

(1.4)

where each b k is a nonnegative integer less than 10. If b k = 0 for all k


greater than a certain integer, then the fraction is said to terminate. Thus

is a terminating decimal fraction, while

is not.
If the integral part of x is given as a decimal integer by
1.2 THE REPRESENTATION OF FRACTIONS 5

while the fractional part is given by (1.4), it is customary to write the two
representations one after the other, separated by a point, the “decimal
point”:

Completely analogously, one can write the fractional part of x as a


binary fraction:

where each bk is a nonnegative integer less than 2, i.e., either zero or one. If
the integral part of x is given by the binary integer

then we write

using a “binary point.”


The binary fraction (.b 1 b 2 b 3 · · · ) 2 for a given number xF between
zero and one can be calculated as follows: If

then

Hence b1 is the integral part of 2xF, while

Therefore, repeating this procedure, we find that b2 is the integral part of


2(2xF)F, b3 is the integral part of 2(2(2xF)F)F, etc.
If, for example, x = 0.625 = xF, then

and all further bk’s are zero. Hence

This example was rigged to give a terminating binary fraction. Un-


happily, not every terminating decimal fraction gives rise to a terminating
binary fraction. This is due to the fact that the binary fraction for
6 NUMBER SYSTEMS AND ERRORS

is not terminating. We have

and now we are back to a fractional part of 0.2, so that the digits cycle. It
follows that

The procedure just outlined is formalized in the following algorithm.

Algorithm 1.2 Given x between 0 and 1 and an integer greater than


1. Generate recursively b1, b2, b3, . . . by

Then

We have stated this algorithm for a general base rather than for the
specific binary base for two reasons. If this conversion to binary is
carried out with pencil and paper, it is usually faster to convert first to
octal, i.e., use and then to convert from octal to binary. Also, the
algorithm can be used to convert a binary (or octal) fraction to decimal, by
choosing and using binary (or octal) arithmetic.
To give an example, if x = (.lOl)2, then, with and
binary arithmetic, we get from Algorithm 1.2

Hence subsequent bk’s are zero. This shows that

confirming our earlier calculation. Note that if xF is a terminating binary


1.3 FLOATING-POINT ARITHMETIC 7

fraction with n digits, then it is also a terminating decimal fraction with n


digits, since

EXERCISES
1.2-l Convert the following binary fractions to decimal fractions:
(.1100011)2 (. 1 1 1 1 1 1 1 1)2
1.2-2 Find the first 5 digits of .1 written as an octal fraction, then compute from it the first 15
digits of .1 as a binary fraction.
1.2-3 Convert the following octal fractions to decimal:
(.614)8 (.776)8
Compare with your answer in Exercise 1.2-1.
1.2-4 Find a binary number which approximates to within 10-3.
1.2-5 If we want to convert a decimal integer N to binary using Algorithm 1.1, we have to use
binary arithmetic. Show how to carry out this conversion using Algorithm 1.2 and decimal
arithmetic. (Hint: Divide N by the appropriate power of 2, convert the result to binary, then
shift the “binary point” appropriately.)
1.2-6 If we want to convert a terminating binary fraction x to a decimal fraction using
Algorithm 1.2, we have to use binary arithmetic. Show how to carry out this conversion using
Algorithm 1.1 and decimal arithmetic.

1.3 FLOATING-POINT ARITHMETIC

Scientific calculations are usually carried out in floating-point arithmetic.


An n-digit floating-point number in base has the form
(1.5)
where is a called the mantissa, and e is an
integer called the exponent. Such a floating-point number is said to be
normalized in case or else
For most computers, although on some, and in hand
calculations and on most desk and pocket calculators,
The precision or length n of floating-point numbers on any particular
computer is usually determined by the word length of the computer and
may therefore vary widely (see Fig. 1.1). Computing systems which accept
FORTRAN programs are expected to provide floating-point numbers of
two different lengths, one roughly double the other. The shorter one, called
single precision, is ordinarily used unless the other, called double precision,
is specifically asked for. Calculation in double precision usually doubles
the storage requirements and more than doubles running time as compared
with single precision.
8 NUMBER SYSTEMS AND ERRORS

Figure 1.1 Floating-point characteristics.

The exponent e is limited to a range


(1.6)
for certain integers m and M. Usually, m = - M, but the limits may vary
widely; see Fig. 1.1.
There are two commonly used ways of translating a given real number
x into an n floating-point number fl(x), rounding and chopping. In
rounding, fl(x) is chosen as the normalized floating-point number nearest
x; some special rule, such as symmetric rounding (rounding to an even
digit), is used in case of a tie. In chopping, fl(x) is chosen as the nearest
normalized floating-point number between x and 0. If, for example, two-
decimal-digit floating-point numbers are used, then

and

On some computers, this definition of fl(x) is modified in case


(underflow), where m and M are the
bounds on the exponents; either fl(x) is not defined in this case, causing a
stop, or else fl(x) is represented by a special number which is not subject to
the usual rules of arithmetic when combined with ordinary floating-point
numbers.
The difference between x and fl(x) is called the round-off error. The
round-off error depends on the size of x and is therefore best measured
relative to x. For if we write
(1.7)
where is some number depending on x, then it is possible to
bound independently of x, at least as long as x causes no overflow or
underflow. For such an x, it is not difficult to show that
in rounding (1.8)

while in chopping (1.9)


1.3 FLOATING-POINT ARITHMETIC 9

See Exercise 1.3-3. The maximum possible value for is often called the
unit roundoff and is denoted by u.
When an arithmetic operation is applied to two floating-point num-
bers, the result usually fails to be a floating-point number of the same
length. If, for example, we deal with two-decimal-digit numbers and

then

Hence, if denotes one of the arithmetic operations (addition, subtraction,


multiplication, or division) and denotes the floating-point operation of
the same name provided by the computer, then, however the computer
may arrive at the result for two given floating-point numbers x and
y, we can be sure that usually

Although the floating-point operation corresponding to may vary in


some details from machine to machine, is usually constructed so that
(1.10)
In words, the floating-point sum (difference, product, or quotient) of two
floating-point numbers usually equals the floating-point number which
represents the exact sum (difference, product, or quotient) of the two
numbers. Hence (unless overflow or underflow occurs) we have
(1.11 a)
where u is the unit roundoff. In certain situations, it is more convenient to
use the equivalent formula
(1.116)
Equation (1.11) expresses the basic idea of backward error analysis (see J.
H. Wilkinson [24]†). Explicitly, Eq. (1.11) allows one to interpret a float-
ing-point result as the result of the corresponding ordinary arithmetic, but
performed on slightly perturbed data. In this way, the analysis of the effect
of floating-point arithmetic can be carried out in terms of ordinary
arithmetic.
For example, the value of the function at a point x0 can be
calculated by n squarings, i.e., by carrying out the sequence of steps

with In floating-point arithmetic, we compute instead, accord-


ing to Eq. (1.1 la), the sequence of numbers

†Numbers in brackets refer to items in the references at the end of the book.
10 NUMBER SYSTEMS AND ERRORS

with all i. The computed answer is, therefore,

To simplify this expression, we observe that, if then

for some (see Exercise 1.3-6). Also then

for some Consequently,

for some In words, the computed value is the


exact value of f(x) at the perturbed argument
We can now gauge the effect which the use of floating-point arithmetic
has had on the accuracy of the computed value for f(x0) by studying how
the value of the (exactly computed) function f(x) changes when the
argument x is perturbed, as is done in the next section. Further, we note
that this error is, in our example, comparable to the error due to the fact
that we had to convert the initial datum x0 to a floating-point number to
begin with.
As a second example, of particular interest in Chap. 4, consider
calculation of the number s from the equation

(1.12)

by the formula

If we obtain s through the steps

then the corresponding numbers computed in floating-point arithmetic


satisfy

Here, we have used Eqs. (1.11a ) and (1.11b), and have not bothered to
1.3 FLOATING-POINT ARITHMETIC 11

distinguish the various by subscripts. Consequently,

This shows that the computed value for s satisfies the perturbed equation

(1.13)
Note that we can reduce all exponents by 1 in case ar+1 = 1, that is, in
case the last division need not be carried out.

1.3-1 The following numbers are given in a decimal computer with a four-digit normalized
mantissa:

Perform the following operations, and indicate the error in the result, assuming symmetric
rounding:

1.3-2 Let be given by chopping. Show that and that


(unless overflow or underflow occurs).
13-3 Let be given by chopping and let be such that (If
Show that then is bounded as in (1.9).
1.3-4 Give examples to show that most of the laws of arithmetic fail to hold for floating-point
arithmetic. (Hint: Try laws involving three operands.)
1.3-5 Write a FORTRAN FUNCTION FL(X) which returns the value of the n-decimal-digit
floating-point number derived from X by rounding. Take n to be 4 and check your
calculations in Exercise 1.3-l. [Use ALOG10(ABS(X)) to determine e such that

1.3-6 Let Show that for all there exists o


that Show also that fo r
so m e provided all have the same sign.
1.3-7 Carry out a backward error analysis for the calculation of the scalar product
Redo the analysis under the assumption that double-precision ac-
cumulation is used. This means that the double-precision results of each multiplicatioin are
retained and added to the sum in double precision, with the resulting sum rounded only at the
end to single precision.
12 NUMBER SYSTEMS AND ERRORS

1.4 LOSS OF SIGNIFICANCE AND ERROR PROPAGATION;


CONDITION AND INSTABILITY

If the number x* is an approximation to the exact answer x, then we call


the difference x - x* the error in x*; thus
Exact = approximation + error (1.14)
The relative error in x*, as an approximation to x, is defined to be the
number (x - x*)/x. Note that this number is close to the number (x -
x * ) / x * if it is at all small. [Precisely, if then (x -
x* )/x * =
Every floating-point operation in a computational process may give
rise to an error which, once generated, may then be amplified or reduced
in subsequent operations.
One of the most common (and often avoidable) ways of increasing the
importance of an error is commonly called loss of significant digits. If x* is
an approximation to x, then we say that x* approximates x to r significant
provided the absolute error |x - x*| is at most in the rt h
significant of x. This can be expressed in a formula as

(1.15)
with s the largest integer such that For instance, x* = 3 agrees
with to one significant (decimal) digit, while
is correct to three significant digits (as an approximation to ). Suppose
now that we are to calculate the number

and that we have approximations x* and y* for x and y, respectively,


available, each of which is good to r digits. Then

is an approximation for z, which is also good to r digits unless x* and y*


agree to one or more digits. In this latter case, there will be cancellation of
digits during the subtraction, and consequently z* will be accurate to fewer
than r digits.
Consider, for example,

and assume each to be an approximation to x and y, respectively, correct


to seven significant digits. Then, in eight-digit floating-point arithmetic,

is the exact difference between x* and y*. But as an approximation to


z = x - y,z* is good only to three digits, since the fourth significant digit
of z* is derived from the eighth digits of x* and y*, both possibly in error.
1.4 LOSS OF SIGNIFICANCE, ERROR PROPAGATION; CONDITION, INSTABILITY 13

Hence, while the error in z* (as an approximation to z = x - y) is at most


the sum of the errors in x* and y*, the relative error in z* is possibly 10,000
times the relative error in x* or y*. Loss of significant digits is therefore
dangerous only if we wish to keep the relative error small.
Such loss can often be avoided by anticipating its occurrence. Con-
sider, for example, the evaluation of the function

in six-decimal-digit arithmetic. Since for x near zero, there will


be loss of significant digits for x near zero if we calculate f(x) by first
finding cos x and then subtracting the calculated value from 1. For we
cannot calculate cos x to more than six digits, so that the error in the
calculated value may be as large as 5 · 10-7, hence as large as, or larger
than, f(x) for x near zero. If one wishes to compute the value of f(x) near
zero to about six significant digits using six-digit arithmetic, one would
have to use an alternative formula for f(x), such as

which can be evaluated quite accurately for small x; else, one could make
use of the Taylor expansion (see Sec. 1.7) for f(x),

which shows, for example, that for agrees with f(x) to at


least six significant digits.
Another example is provided by the problem of finding the roots of
the quadratic equation
(1.16)
We know from algebra that the roots are given by the quadratic formula

(1.17)

Let us assume that b2 - 4ac > 0, that b > 0, and that we wish to find the
root of smaller absolute value using (1.17); i.e.,

(1.18)

If 4ac is small compared with b 2 , then will agree with b to


several places. Hence, given that will be calculated correctly
only to as many places as are used in the calculations, it follows that the
numerator of (1.18), and therefore the calculated root, will be accurate to
fewer places than were used during the calculation. To be specific, take the
14 NUMBER SYSTEMS AND ERRORS

equation
(1.19)
Using (1.18) and five-decimal-digit floating-point chopped arithmetic, we
calculate

while in fact,

is the correct root to the number of digits shown. Here too, the loss of
significant digits can be avoided by using an alternative formula for the
calculation of the absolutely smaller root, viz.,

(1.20)

Using this formula, and five-decimal-digit arithmetic, we calculate

which is accurate to five digits.


Once an error is committed, it contaminates subsequent results. This
error propagation through subsequent calculations is conveniently studied
in terms of the two related concepts of condition and instability.
The word condition is used to describe the sensitivity of the function
value f(x) to changes in the argument x. The condition is usually measured
by the maximum relative change in the function value f(x) caused by a
unit relative change in the argument. In a somewhat informal formula,
condition off at x =

(1.21)

The larger the condition, the more ill-conditioned the function is said to
be. Here we have made use of the fact (see Sec. 1.7) that

i.e., the change in argument from x to x* changes the function value by


approximately
If, for example,
1.4 LOSS OF SIGNIFICANCE, ERROR PROPAGATION; CONDITION, INSTABILITY 15

then hence the condition of f is, approximately,

This says that taking square roots is a well-conditioned process since it


actually reduces the relative error. By contrast, if

then so that

and this number can be quite large for |x| near 1. Thus, for x near 1 or
- 1, this function is quite ill-conditioned. It very much magnifies relative
errors in the argument there.
The related notion of instability describes the sensitivity of a numerical
process for the calculation of f(x) from x to the inevitable rounding errors
committed during its execution in finite precision arithmetic. The precise
effect of these errors on the accuracy of the computed value for f(x) is
hard to determine except by actually carrying out the computations for
particular finite precision arithmetics and comparing the computed answer
with the exact answer. But it is possible to estimate these effects roughly by
considering the rounding errors one at a time. This means we look at the
individual computational steps which make up the process. Suppose there
are n such steps. Denote by xi the output from the ith such step, and take
x0 = x. Such an xi then serves as input to one or more of the later steps
and, in this way, influences the final answer xn = f(x). Denote by f i the
function which describes the dependence of the final answer on the
intermediate result xi . In particular, f0 is just f. Then the total process is
unstable to the extent that one or more of these functions fi is ill-condi-
tioned. More precisely, the process is unstable to the extent that one or
more of the fi ’s has a much larger condition than f = f0 has. For it is the
condition of fi which gauges the relative effect of the inevitable rounding
error incurred at the ith step on the final answer.
To give a simple example, consider the function

for “large” x, say for Its condition there is

which is quite good. But, if we calculate f(12345) in six-decimal arithmetic,

You might also like