Unit 1
Unit 1
Arithmetic
and Errors
UNIT 1 FLOATING POINT ARITHMETIC
AND ERRORS
Structure Page Nos.
1.0 Introduction 7
1.1 Objectives 8
1.2 Floating Point Representations 8
1.2.1 Floating Point Arithmetic 10
1.2.2 Properties of Floating Point Arithmetic 10
1.2.3 Significant Digits 11
1.3 Error - Basics 15
1.3.1 Rounding-off Error 16
1.3.2 Absolute and Relative Errors 18
1.3.3 Truncation Error 20
1.4 Summary 21
1.5 Solutions/Answers 22
1.6 Exercises 23
1.7 Solutions to Exercises 24
1.0 INTRODUCTION
Numerical Analysis is the study of computational methods for solving scientific and
engineering problems by using basic arithmetic operations such as addition,
subtraction, multiplication and division. The results obtained by using such methods,
are usually approximations to the true solutions. These approximations to the true
solutions introduce errors but can be made more accurate up to some extent. There can
be several reasons behind this approximation, such as the formula or method used to
solve a problem may not be exact. i.e., the expression of sin x can be evaluated by
expressing it as an infinite power series. This series has to be truncated to the finite
number of terms. This truncation introduces an error in the computed result. As a
student of computer science you should also consider the computer oriented aspect of
this concept of approximation and errors, say the machine involved in the computation
doesn’t have the capacity to accommodate the data or result produced by calculation
of a numerical problem and hence the data is to be approximated in to the limitations
of the machine. When this approximated data is to be further utilized in successive
calculations, then it causes the propagation of error, and if the error starts growing
abnormally then some big disasters may happen. Let me cite some of the well-known
disasters caused because of the approximations and errors.
Instance 1: On February 25, 1991, during the Gulf War, an American Patriot Missile
battery in Dhahran, Saudi Arabia, failed to intercept an incoming Iraqi Scud Missile.
The Scud struck an American Army barracks and killed 28 soldiers. A report of the
General Accounting office, GAO/IMTEC-92-26, entitled Patriot Missile Defense:
Software Problem Led to System Failure at Dhahran, Saudi Arabia reported on the
cause of the failure. It turns out that the cause was an inaccurate calculation of the
time since boot due to computer arithmetic errors.
7
Numerical
Computing -I Specifically, a 64-bit floating point number relating to the horizontal velocity of the
rocket with respect to the platform was converted to a 16-bit signed integer. The
number was larger than 32,768, the largest integer storeable in a 16-bit signed integer,
and thus the conversion failed.
In this Unit, we will describe the concept of number approximation, significant digits,
the way, the numbers are expressed and arithmetic operations are performed on them,
types of errors and their sources, propagation of errors in successive operations etc.
The Figure 1 describes the stages of Numerical Computing.
Mathematical
concepts
Numerical method
Improve Solution
Change algorithm
Modify model
method
Wrong
Modification to
Validity
reduce error
Correct
Application
Figure 1: Stages of Numerical Computation
1.1 OBJECTIVES
• describe the concept of fixed point and floating point numbers representations;
• discuss rounding-off errors and the rules associated with round-off errors;
• implement floating-point arithmetic on available data;
• conceptual description of significant digits, and
• analysis of different types of errors – absolute error, relative errors, truncation
error.
In scientific calculations, very large numbers such as velocity of light or very small
numbers such as size of an electron occur frequently. These numbers cannot be
satisfactorily represented in the usual manner. Therefore, scientific calculations are
usually done by floating point arithmetic.
This means that we need to have two formats to represent a number, which are fixed
point representation and floating point representation. We can transform data of one
8
Floating Point
Arithmetic
format in to another and vice versa. The concept of transforming fixed point data into and Errors
floating point data is known as normalisation, and it is done to preserve the maximum
number of useful information carrying digits of numbers. This transformation
ultimately leads to the calculation errors. Then, you may ask what is the benefit of
doing this normalisation when it is contributing to erroneous results. The answer is
simply to proceed with the calculations keeping in mind the data and calculation
processing limitation of machine.
3
1.6236 x 10 Exponent
Base
Mantissa
Let us first discuss what is a floating-point number. Consider the number 123. It can
be written using exponential notation as:
1.23 x 102, 12.3 x 102, 123 x 102, 0.123 x 102, 1230 x 102, etc.
Notice how the decimal point “floats” within the number as the exponent is changed.
This phenomenon gives floating point numbers their name. The representations of the
number 123 above are in kind of standard form. The first representation, 1.23 x 102, is
in a form called “scientific notation”.
–M<m<M (2)
In scientific notation, such as 1.23 x 102 in the above example, the significand is
always a number greater than or equal to 1 and less than 10. We may also write
1.23E2.
Standard computer normalisation for floating point numbers follows the fourth form
namely, 0.123 x 103 in the list above.
In floating point notation (1), if fl(x) ≠ 0 and m ≥ M (that is, the number becomes too
large and it cannot be accommodated), then x is called an over-flow number and if
9
Numerical
Computing -I m ≤ – M (that is the number is too small but not zero) the number is called an under-
flow number. The number n in the floating-point notation is called its precision.
When arithmetic operations are applied on floating-point numbers, the results usually
are not floating-point numbers of the same length. For example, consider an operation
with 2 digit precision floating-point numbers (i.e., those numbers which are accurate
up to two decimal places) and suppose the result has to be in 2 digit floating point
precision. Consider the following example,
x = 0.30 x101 , y = 0.66 x10 −6 , z = 0.10 x101
Arithmetic using the floating-point number system has two important properties that
differ from those of arithmetic using real numbers.
Floating point arithmetic is not associative. This means that in general, for floating
point numbers x, y, and z:
• ( x + y) + z ≠ x + ( y + z )
• ( x . y) . z ≠ x . ( y . z)
Floating point arithmetic is also not distributive. This means that in general,
• x . ( y + z) ≠ ( x . y) + ( x . z)
Therefore, the order in which operations are carried out can change the output of a
floating-point calculation. This is important in numerical analysis since two
mathematically equivalent formulas may not produce the same numerical output, and
one may be substantially more accurate than the other.
Example 1: Let a = 0.345 x 100, b = 0.245 x 10–3 and c = 0.432 x 10–3. Using
3-digit decimal arithmetic with rounding, we have
b + c = 0.000245 + 0.000432 = 0.000677 (in accumulator)
= 0.677 × 10–3
a + (b + c) = 0.345 + 0.000677 (in accumulator)
= 0.346 × 100 (in memory) with rounding
a + b = 0.345 × 100 + 0.245 × 10–3
= 0.345 × 100 (in memory)
(a + b) + c = 0.345432 (in accumulator)
= 0.345 × 100 (in memory)
Hence, ( x + y) + z ≠ x + ( y + z ) .
From the above examples, we note that in a computational process, every floating-
point operation gives rise to some error, which may then get amplified or reduced in
subsequent operations.
The concept of significant digits has been introduced primarily to indicate the
accuracy of a numerical value. For example, if, in the number y = 23.40657, only the
digits 23406 are correct, then we may say that y has given significant digits and is
correct to only three decimal places.
The number of significant digits in an answer in a calculation depends on the number
of significant digits in the given data, as discussed in the rules below.
Non-zero digits are always significant. Thus, 22 has two significant digits, and 22.3
has three significant digits. The following rules are applied when zeros are
encountered in the numbers,
a) Zeros placed before other digits are not significant; 0.046 has two significant
digits.
b) Zeros placed between other digits are always significant; 4009 kg has four
significant digits.
c) Zeros placed after other digits but behind a decimal point are significant;
7.90 has three significant digits.
11
Numerical
Computing -I d) Zeros at the end of a number are significant only if they are behind a decimal
point as in (c). For example, in the number 8200, it is not clear if the zeros are
significant or not. The number of significant digits in 8200 is at least two, but
could be three or four. To avoid uncertainty, we use scientific notation to place
significant zeros behind a decimal point.
8.200 * 10 3 has four significant digits,
8.20 * 10 3 has three significant digits,
8.2 * 10 3 has two significant digits.
Note: Accuracy and precision are closely related to significant digits. They are related
as follows:
1) Accuracy refers to the number of significant digits in a value. For example, the
number 57.396 is accurate to five significant digits.
2) Precision refers to the number of decimal positions, i.e. the order of magnitude of
the last digit in a value. The number 57.396 has a precision of 0.001 or 10–3.
Solution:
a) 4.3201 has a precision of 10–4
b) 4.32 has a precision of 10–2
c) 4.320106 has a precision of 10–6
The last number has the greatest precision.
Solution:
a) This has five significant digits.
b) This has four significant digits. The leading or higher order zeros are only place
holders.
c) This has six significant digits.
d) This has two significant digits.
e) This has six significant digits. Note that the zeros were made significant by
writing .00 after 3600.
12
Floating Point
Arithmetic
Significant digits in Addition and Subtraction and Errors
When quantities are being added or subtracted, the number of decimal places (not
significant digits) in the answer should be the same as the least number of decimal
places in any of the numbers being added or subtracted.
When doing multi-step calculations, keep at least one or more significant digits in
intermediate results than needed in your final answer.
For instance, if a final answer requires two significant digits, then carry at least three
significant digits in calculations. If you round-off all your intermediate answers to
only two digits, you are discarding the information contained in the third digit, and as
a result the second digit in your final answer might be incorrect. (This phenomenon is
known as “round-off error.”)
This truncation process is done either through rounding off or chopping, leading
to round off error.
2) Rounding-off, say, to two digits in an intermediate answer, and then writing three
digits in the final answer.
Example 4: Expressions for significant digits and scientific notation associated with a
floating point number.
13
Numerical
Computing -I Loss of significant digits in subtraction of two nearly equal numbers:
Subtraction of two nearly equal number gives the relative error
x y
rx– y = rx – ry
x− y x− y
which becomes very large. It has largest value when rx and ry are of opposite signs.
But, if x* and y* agree at left most digits (one or more), then the left most digits will
cancel and there will be loss of significant digits.
The more the digits on left agrees, the more loss of significant digits. A similar loss in
significant digits occurs when a number is divided by a small number (or multiplied
by a very large number).
Example 5: Solve the quadratic equation x2 + 9.9 x – 1 = 0 using two decimal digit
arithmetic with rounding.
Solution:
while the true solutions are – 10 and 0.1. Now, if we rationalize the expression, we
obtain
− b + b 2 − 4ac −4ac
= =
2a 2a(b + b 2 − 4ac )
−2c 2 2 2 2
= = = = = ≅ 0.1 . (0.1000024)
b + b − 4ac )
2
9.9 + 102 9.9 + 10 19.9 20
14
Floating Point
Arithmetic
and Errors
1.3 ERROR - BASICS
What is Error?
An error is defined as the difference between the actual value and the approximate
value obtained from the experimental observation or from numerical computation.
Consider that x represents some quantity and xa is an approximation to x, then
Every calculation has two parts, one is operand and other is operator. Hence, any
approximation in either of the two contributes to error. Approximations to operands
causes propagated error and approximation to operators causes generated errors. Let
us discuss how the philosophy behind these errors is related to computers.
Operand Point of View: Computers need fixed Operator Point of View: Computers need some
numbers to do processing, which is mostly not operation to be performed on the operands
available. Hence, we need to transform the output available. Now, the operations that occur in
of an operation to a fixed number by performing computers are at bit level and complex operations
truncation of series, rounding, chopping etc. This are simplified. There are, hence, small changes in
contributes to difference between exact value and actual operations and operations performed by
approximated value. These errors get further computer. This difference in operations produces
amplified in subsequent calculations as these errors in calculations, which get further amplified
values and the results produced are further utilized in subsequent calculations. This error contribution
in subsequent calculations. Hence, this error is referred to as generated error.
contribution is referred to as propagated error.
The sources of error can be classified as (i) data input errors, (ii) errors in algorithms
and (iii) errors during computations.
Sources of Error?
15
Numerical
Computing -I Type of Errors?
We list below the types of errors that are encountered while carrying out numerical
calculations to solve a problem.
1) Round off errors arise due to floating point representation of initial data in the
machine. Subsequent errors in the solution due to this are called propagated
errors.
2) Due to finite digit arithmetic operations, the computer produces generated errors
or rounding errors.
3) Error due to finite representation of an inherently infinite process. For example,
consider the use of a finite number of terms in the infinite series expansions of
Sin x, Cos x or f(x) by Maclaurin’s or Taylor Series expression. Such errors are
called truncation errors.
The two terms “error” and “ accuracy” are inter-related, one measures the other, in the
sense less the error is, more the accuracy is and vice versa. In general, the errors
which are used for determination of accuracy are categorized as:
a) Absolute Error: Absolute error is the magnitude of the difference between the
true value x and the approximate value xa. Therefore, absolute error = | x – xa |.
b) Relative Error: Relative error is the ratio of the absolute error and actual value.
Therefore, relative error = |x – xa | / x .
Now, we discuss each of the errors defined above, and its propagation in detail.
There are two ways of translating a given real number x into floating-point number
f(x) – rounding and chopping. For example, suppose we want to represent the number
5562 in the normalized floating point representation. The representations for different
values of n are as follows:
n = 1, fl(5562) = .5 * 10 4 chopped
= .6 * 10 4 rounded . (5)
16
Floating Point
Arithmetic
Rules for rounding-off: Whenever, we want to use only a certain number of digits and Errors
after the decimal point, then number is rounded-off to that many digits. A number is
rounded-off to n places after decimal by seeing (n+1)th place digit dn+1, as follows:
The difference between a number x and fl(x) is called the round-off error. It is clear
that the round-off error decreases when precision increases. The round-off error also
depends on the size of x and is therefore represented relative to x as
Example 6: If p = 3.14159265, then find out to how many decimal places the
approximate value of 22/7 is accurate?
17
Numerical
Computing -I 3) The numbers 28.483 and 27.984 are both approximate and are correct up to the
last digit shown. Compute their difference. Indicate how many significant digits
are present in the result and comment.
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
4) Consider the number 2/3. Its floating point representation rounded to 5 decimal
places is 0.66667. Find out to how many decimal places the approximate value of
2/3 is accurate?
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
5) Find out to how many decimal places the value 355/133 is accurate as an
approximation to p ?
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
We shall now discuss two types of errors that are commonly encountered in numerical
computations. You are already familiar with the rounding off error. These rounded-off
numbers are approximations of the actual values. In any computational procedure, we
make use of these approximate values instead of the true values. How do we measure
the goodness of an approximation fl(x) to x ? The simplest measure which naturally
comes to our mind is the difference between x and fl(x). This measure is called the
error. Formally, we define error as a quantity which satisfies the identity
x = fl(x) + e, (10)
e = x − fl(x) (11)
Sometimes, when the true value x is very large or very small, we prefer to study the
error by comparing it with the true value. This is known as relative error and we
define this error as
x − f ( x)
relative error = rx =
x
and
x − fl(x) e
relative error = = (12)
x x
18
Floating Point
Arithmetic
Note that in certain computations, the true value may not be available. In that case, we and Errors
replace the true value by the computed approximate value in the definition of relative
error.
1 1– n
i) rx < β if rounding is used.
2
ii) 0 ≤ rx ≤ β1 – n if chopping is used.
1
Case 1. dn+1 < β, then fl(x) = ± (.d1d2…dn)β e
2
x-fl(x) = dn+1, dn+2 …β
e– n–1
1 1
≤ β.β e– n–1 = β e– n
2 2
1
Case 2. dn+1 ≥ β,
2
fl(x) = ± {(.d1d2…dn)β e+β e– n}
x-fl(x) = . − d n +1 , d n + 2 . β e-n-1 + β e − n
= β e– n–1 dn+1 . dn+2 ৄ β
1 1
≤ β e– n–1 × β = β e– n
2 2
Now, we convert 22/7 to decimal form, so that we can find the difference between the
approximate value and true value. Then, the approximate value of
22
p is = 3.14285714
7
19
Numerical
Computing -I Check Your Progress 3
1) Let x* = .3454 and y* = .3443 be approximations to x and y respectively correct to
3 significant digits. Further, let z* = x* – y* be the approximation to x – y. Then
show that the relative error in z* as an approximation to x – y can be as large as
100 times the relative error in x or y.
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
2) Round the number x = 2.2554 to three significant figures. Find the absolute error
and the relative error.
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
3) If π = 3.14 instead of 22/7, find the relative error and percentage error.
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
5) Round-off the number 4.5126 to four significant figures and find the relative
percentage error.
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
………………………………………………………………………………………
20
Floating Point
Arithmetic
Numerical integration is another example of an operation that is affected by truncation and Errors
error. A quadrature formula works by evaluating the integrand at a finite number of
points and using smooth functions to approximate the integrand between those points.
The difference between those smooth functions and the actual integrand leads to
truncation error.
Taylor series represents the local behaviour of a function near a give n point. If one
replaces the series by the n-th order polynomial, the truncation error is said to be
order of n, or O(hn), where h is the distance to the given point. Consider the
irrational number e
e = 2.71828182845905…
and compare it with the Taylor series of the function exp(x) near the given point x = 0.
exp( x ) = 1 + x + x 2 + x 3 6 + .....
1 1 1
Solution: Recall that e = 1+ + + + .........
2! 3! 4!
The series is to be truncated such that the finite sum equals e to three decimal places.
This means the must be less than 0.0005. Suppose that the tail starts at n = k+1. Then,
∞
1 1 1
∑
n = k +1 n !
= +
(k + 1)! (k + 2)!
....... + ....
1 1 1
< [1 + + + ......
(k + 1)! (k + 1) (k + 1) 2
1 (k + 1) 1
= = < 0.0005
(k + 1)! 1 − /(k + 1) k !k
1.4 SUMMARY
In this unit, we have defined the floating point numbers and their representation for
usage in computers. We have defined accuracy and number of significant digits in a
given number. We have also discussed the sources of errors in computation. We have
defined the round-off and truncation errors and their propagation in later computations
21
Numerical
Computing -I using these values, which contains errors. Therefore, care must be taken to analyise
the computations, so that we are sure that the output of computations is meaningful.
Total error
Missing Human
Information imperfection
1.5 SOLUTIONS/ANSWERS
1) (i) 50.9 (ii) 48.37 (iii) 9.326 (iv) 8.416 (v) 0.8001 (vi) 0.04251 (vii) 0.004912
(viii) 0.0002022
2) (i) 1000 * 102 or 0.1000 * 106 (ii) –0.2214 * 10–2 (iii) –0.3567 * 102
22
Floating Point
Arithmetic
and Errors
3) We have 28.483 – 27.984 = 00.499. The result has only three significant digits.
This is due to the loss of significant digits during subtraction of nearly equal
numbers.
1
4) We find that 2 3 − 0.66667 = 0.0000033... < 10 −5
2
We find, k = 5. Therefore, the approximation is accurate to 5 decimal places.
5) Left as an exercise.
1 1–3
1) Given, rx, ry, ≤ 10
2
z* = x* – y* = 0.3454 – 0.3443 = 0.0011 = 0.11 × 10–2.
This is correct to one significant digit since last digits 4 in x* and 3 in y* are not
reliable and second significant digit of i* is derived from the fourth digits of x*
and y*.
1 1 1 –2
Max. rz = 101–1 = = (100). .10 ≥ 100 rx, 100 ry
2 2 2
22 22
3) Relative error = − 3.14 = 0.00093. Percentage error = 0.093 %.
7 7
4) Absolute error = 0.2 * 10 −1 * 0.2217 = 0.04493. Hence x has only one correct digit
x ≈ 0.2 .
1.6 EXERCISES
E1) Give the floating-point representation of the following numbers in 2 decimal
digit and 4 decimal digit floating point number using (i) rounding and (ii)
chopping.
(a) 37.21829
(b) 0.022718
(c) 3000527.11059
E2) Show that a(b – c) ≠ ab – ac, where, a = .5555 × 101, b = .4545 × 101,
c = .4535 × 101.
E3) How many bits of significance will be lost in the following subtraction?
37.593621 – 37.584216. Assume each number is correct to seven significant
digits.
23
Numerical
Computing -I E4) What is the relative error in the computation of x – y, where x = 0.3721448693
and y = 0.3720214371 with five decimal digit of accuracy?
E5) Find the smaller root in the magnitude of the quadratic equation
x 2 + 111.11x + 1.2121 = 0 , using five-decimal digit floating point chopped
arithmetic.
The numbers are, correct to seven significant digits. Then, in eight digit
floating-point arithmetic, the number can be written as
z* = x* – y* = (0.94050000) 10 −2 But as an approximation to z = x – y, z* is
good only to three digits, since the fourth significant digit of z* is derived from
the eighth digits of x* and y*, and both possibly contains errors. Here, while
the error in z* as an approximation to z = x – y is at most the sum of the errors in
x* and y*, the relative error in z* is possibly 10,000 times the relative error in x*
or y*. Loss of significant digits is, therefore, dangerous only if we wish to
keep the relative error small.
1
Given rx , ry < 10 1−7 , z* = (0.9405)10 −2 , is correct to three significant digits.
2
1 1
Max rz = 10 1− 3 = 10000. 10 −6 ≥ (1000) rz (10000) ry
2 2
E4) With five decimal digit accuracy x* = 0.37214 × 100 , y* = 0.37202 × 100 ,
x* – y* = 0.00012 while x – y = 0.0001234322.
( x − y ) − ( x* − y* ) 0.0000034322
= ≈ 3 × 10 −2 .
x− y 0.0001234322
24
Floating Point
Arithmetic
The magnitude of this relative error is quite large when compared with the and Errors
relative errors of x* and y* (which cannot exceed 5 × 10–5 and in this case it is
approximately 1.3 × 10–5)
25