CH 1
CH 1
CH 1
1. Numerical analysis
Numerical analysis is the branch of mathematics which study and develop the algorithms that
use numerical approximation for the problems of mathematical analysis (continuous mathematics).
Numerical technique is widely used by scientists and engineers to solve their problems. A major
advantage for numerical technique is that a numerical answer can be obtained even when a problem
has no analytical solution. However, result from numerical analysis is an approximation,
in general,
which can be made as accurate as desired. For example to find the approximate values of 2, etc.
In this chapter, we introduce and discuss some basic concepts of scientific computing. We begin
with discussion of floating-point representation and then we discuss the most fundamental source of
imperfection in numerical computing namely roundoff errors. We also discuss source of errors and then
stability of numerical algorithms.
2. Numerical analysis and the art of scientific computing
Scientific computing is a discipline concerned with the development and study of numerical algorithms for solving mathematical problems that arise in various disciplines in science and engineering.
Typically, the starting point is a given mathematical model which has been formulated in an attempt
to explain and understand an observed phenomenon in biology, chemistry, physics, economics, or any
engineering or scientific discipline. We will concentrate on those mathematical models which are continuous (or piecewise continuous) and are difficult or impossible to solve analytically: this is usually
the case in practice. Relevant application areas within computer science include graphics, vision and
motion analysis, image and signal processing, search engines and data mining, machine learning, hybrid
and embedded systems, and many more. In order to solve such a model approximately on a computer,
the (continuous, or piecewise continuous) problem is approximated by a discrete one. Continuous functions are approximated by finite arrays of values. Algorithms are then sought which approximately
solve the mathematical problem efficiently, accurately and reliably.
3. Floating-point representation of numbers
Any real number is represented by an infinite sequence of digits. For example
2
6
6
8
= 2.66666 =
+
+
+ . . . 101 .
3
101 102 103
This is an infinite series, but computer use an finite amount of memory to represent numbers. Thus
only a finite number of digits may be used to represent any number, no matter by what representation
method.
For example, we can chop the infinite decimal representation of 83 after 4 digits,
8
2
6
6
6
= ( 1 + 2 + 3 + 4 ) 101 = 0.2666 101 .
3
10
10
10
10
Generalizing this, we say that number has n decimal digits and call this n as precision.
For each real number x, we associate a floating point representation denoted by f l(x), given by
f l(x) = (0.a1 a2 . . . an ) e ,
here based fraction is called mantissa with all ai integers and e is known as exponent. This representation is called based floating point representation of x.
For example,
42.965 = 4 101 + 2 100 + 9 101 + 6 102 + 5 103
= 42965 102 .
1
a1 6= 0.
(1) Chopping: We ignore digits after an and write the number as following in chopping
f l(x) = (.a1 a2 . . . an ) e .
(2) Rounding: Rounding is defined as following
(0.a1 a2 . . . an ) e , 0 an+1 < /2
(rounding down)
f l(x) =
(0.a1 a2 . . . an ) + (0.00 . . . 01) e , /2 an+1 < (rounding up).
Example 1.
6
0.86 100 (rounding)
=
fl
0.85 100 (chopping).
7
Rules for rounding off numbers:
(1) If the digit to be dropped is greater than 5, the last retained digit is increased by one. For example,
12.6 is rounded to 13.
(2) If the digit to be dropped is less than 5, the last remaining digit is left as it is. For example,
12.4 is rounded to 12.
(3) If the digit to be dropped is 5, and if any digit following it is not zero, the last remaining digit is
increased by one. For example,
12.51 is rounded to 13.
(4) If the digit to be dropped is 5 and is followed only by zeros, the last remaining digit is increased
by one if it is odd, but left as it is if even. For example,
11.5 is rounded to 12, and 12.5 is rounded to 12.
Definition 3.2 (Absolute and relative error). If f l(x) is the approximation to the exact value x, then
|x f l(x)|
.
the absolute error is |x f l(x)|, and relative error is
|x|
Remark: As a measure of accuracy, the absolute error may be misleading and the relative error is more
meaningful.
Definition 3.3 (Overflow and underflow). An overflow is obtained when a number is too large to fit
into the floating point system in use, i.e e > M . An underflow is obtained when a number is too small,
i.e e < m . When overflow occurs in the course of a calculation, this is generally fatal. But underflow
is non-fatal: the system usually sets the number to 0 and continues. (Matlab does this, quietly.)
X
ai
i
x = (0.a1 a2 . . . an an+1 . . . ) e =
!
e , a1 6= 0.
i=1
f l(x) = (0.a1 a2 . . . an ) e =
n
X
ai
i
!
e.
i=1
Therefore
X
ai
i
|x f l(x)| =
!
e
i=n+1
e |x f l(x)| =
X
ai
.
i
i=n+1
X
1
i
i=n+1
1
1
= ( 1) n+1 + n+2 + . . .
" 1 #
e |x f l(x)|
= ( 1))
n+1
1 1
= n .
1
e.
|x f l(x)|
n e
1
= 1n .
|x|
e
P ai
e =
(0.a
a
.
.
.
a
)
e , 0 an+1 < /2
1 2
n
i
i=1
f l(x) =
n a
P
1
i
e
+
e , /2 an+1 < .
(0.a1 a2 . . . an1 [an + 1]) =
n i=1 i
For 0 < an+1 < /2,
e |x f l(x)| =
X
X
ai
an+1
ai
=
+
i
n+1
i=n+1
i=n+2
X
/2 1
( 1)
+
n+1
i
/2 1
1
1
+ n+1 = n .
n+1
i=n+2
|x f l(x)| =
X a
1
i
i n
i=n+1
1
X
ai
n
i
i=n+1
1
X
an+1
ai
n n+1
i
i=n+2
1
a
n+1
n n+1
1 n
.
=
2
Therefore, for both cases absolute error bound is
1
Ea = |x f l(x)| en .
2
Also relative error bound is
|x f l(x)|
1 n e
1
Er =
= 1n .
1
e
|x|
2
2
e
5. Significant Figures
All measurements are approximations. No measuring device can give perfect measurements without
experimental uncertainty. By convention, a mass measured to 13.2 g is said to have an absolute
uncertainty of plus or minus 0.1 g and is said to have been measured to the nearest 0.1 g. In other
words, we are somewhat uncertain about that last digit-it could be a 2; then again, it could be a
1 or a 3. A mass of 13.20 g indicates an absolute uncertainty of plus or minus 0.01 g.
The number of significant figures in a result is simply the number of figures that are known with some
degree of reliability.
The number 25.4 is said to have 3 significant figures. The number 25.40 is said to have 4 significant
figures
Rules for deciding the number of significant figures in a measured quantity:
(1) All nonzero digits are significant:
1.234 has 4 significant figures, 1.2 has 2 significant figures.
(2) Zeros between nonzero digits are significant: 1002 has 4 significant figures.
(3) Leading zeros to the left of the first nonzero digits are not significant; such zeros merely indicate
the position of the decimal point: 0.001 has only 1 significant figure.
(4) Trailing zeros that are also to the right of a decimal point in a number are significant: 0.0230 has
3 significant figures.
(5) When a number ends in zeros that are not to the right of a decimal point, the zeros are not necessarily significant: 190 may be 2 or 3 significant figures, 50600 may be 3, 4, or 5 significant figures.
The potential ambiguity in the last rule can be avoided by the use of standard exponential, or scientific, notation. For example, depending on whether the number of significant figures is 3, 4, or 5, we
would write 50600 calories as:
0.506 105 (3 significant figures)
0.5060 105 (4 significant figures), or
0.50600 105 (5 significant figures).
What is an exact number? Some numbers are exact because they are known with complete certainty.
Most exact numbers are integers: exactly 12 inches are in a foot, there might be exactly 23 students in
a class. Exact numbers are often found as conversion factors or as counts of objects. Exact numbers
can be considered to have an infinite number of significant figures. Thus, the number of apparent
significant figures in any exact number can be ignored as a limiting factor in determining the number
of significant figures in the result of a calculation.
6. Rules for mathematical operations
In carrying out calculations, the general rule is that the accuracy of a calculated result is limited
by the least accurate measurement involved in the calculation. In addition and subtraction, the result
is rounded off so that it has the same number of digits as the measurement having the fewest decimal
places (counting from left to right). For example,
100 (assume 3 significant figures) +23.643 (5 significant figures) = 123.643,
which should be rounded to 124 (3 significant figures). However, that it is possible two numbers have
no common digits (significant figures in the same digit column).
In multiplication and division, the result should be rounded off so as to have the same number of
significant figures as in the component with the least number of significant figures. For example,
3.0 (2 significant figures ) 12.60 (4 significant figures) = 37.8000
which should be rounded to 38 (2 significant figures).
Let X = f (x1 , x2 , . . . , xn ) be the function having n variables. To determine the error X in X due to
the errors x1 , x2 , . . . , xn , respectively.
X + X = f (x1 + x1 , x2 + x2 , . . . , xn + xn ).
Error in addition of numbers. Let X = x1 + x2 + + xn .
Therefore
X + X = (x1 + x1 ) + (x2 + x2 ) + + (xn + xn )
= (x1 + x2 + + xn ) + (x1 + x2 + + xn )
Therefore absolute error,
|X| |x1 | + |x2 | + + |xn |.
Dividing by X we get,
xn
X x1 x2
X X + X + ... X
which is a maximum relative error. Therefore it shows that when the given numbers are added then
the magnitude of absolute error in the result is the sum of the magnitudes of the absolute errors of the
components of that number.
Error in subtraction of numbers. As in the case of addition, we can obtain the maximum absolute
errors for subtraction of two numbers. Let X = x1 x2 . Then
|X| |x1 | + |x2 |.
Also
X x1 x2
X X + X
which is a maximum relative error in subtraction of numbers.
Error in product of numbers. Let X = x1 x2 . . . xn then using the general formula for error
X
X
X
X = x1
+ x2
+ + xn
.
x1
x2
xn
We have
X
x1 X
x2 X
xn X
=
+
+ +
.
X
X x1
X x2
X xn
Now
1 X
x 2 x 3 . . . xn
1
=
=
X x1
x1 x2 x3 . . . xn
x1
1 X
x 1 x 3 . . . xn
1
=
=
X x2
x1 x2 x3 . . . xn
x2
1 X
x1 x2 . . . xn1
1
=
=
.
X xn
x1 x2 x3 . . . xn
xn
Therefore
xn
X
x1 x2
+
+ +
.
=
X
x1
x2
xn
Therefore maximum relative and absolute errors are given by
X x1 x2
xn
+
+ +
Er =
xn .
X x1 x2
X
|X|.
Ea =
X
x1
then
x2
X
X
X = x1
+ x2
.
x1
x2
We have
X
x1 X
x2 X
x1 x2
=
+
=
.
X
X x1
X x2
x1
x2
X x1 x2
,
Er =
+
X x1 x2
X
|X|.
Ea =
X
(x y) (f l(x) f l(y)
.04% = 4%.
xy
Example 9. The quadratic formula is used for computing the roots of equation ax2 + bx + c = 0, a 6= 0
and roots are given by
b b2 4ac
x=
.
2a
Consider the equation x2 + 62.10x + 1 = 0 and discuss the numerical results.
Sol. Using quadratic formula and 8-digit rounding arithmetic, we obtain two roots
x1 = .01610723
x2 = 62.08390.
We use these
values asexact values. Now
we perform calculations with 4-digit rounding arithmetic.
We have b2 4ac = 62.102 4.000 = 3856 4.000 = 62.06 and
62.10 + 62.06
= 0.02000.
f l(x1 ) =
2.000
The relative error in computing x1 is
|f l(x1 ) x1 |
| 0.02000 + .01610723|
=
= 0.2417.
|x1 |
| 0.01610723|
In calculating x2 ,
f l(x2 ) =
62.10 62.06
= 62.10.
2.000
In this equation since b2 = 62.102 is much larger than 4ac = 4. Hence b and b2 4ac become two
equal numbers. Calculation of x1 involves the subtraction of nearly two equal numbers but x2 involves
the addition of the nearly equal numbers which will not cause serious loss of significant figures.
To obtain a more accurate 4-digit rounding approximation for x1 , we change the formulation by
rationalizing the numerator, that is,
2c
x1 =
.
b + b2 4ac
Then
2.000
= 2.000/124.2 = 0.01610.
f l(x1 ) =
62.10 + 62.06
The relative error in computing x1 is now reduced to 0.62103 . However, if rationalize the numerator
in x2 to get
2c
x2 =
.
b b2 4ac
The use of this formula results not only involve the subtraction of two nearly equal numbers but also
division by the small number. This would cause degrade in accuracy.
2.000
= 2.000/.04000 = 50.00
f l(x2 ) =
62.10 62.06
The relative error in x2 becomes 0.19.
Example 10. Consider the stability of x + 1 1 when x is near 0. Rewrite the expression to rid it
of subtractive cancellation.
Sol. Suppose that x = 1.2345678 105 . Then x + 1 1.000006173. If our computer (or calculator)
can only keep 8 significant digits, this will be rounded to 1.0000062. When 1 is subtracted, the result
is 6.2 106 .
Thus 6 significant digits have been lost from the original. To fix this, we rationalize the expression
x
x+1+1
x + 1 1 = ( x + 1 1)
=
.
x+1+1
x+1+1
This expression has no subtractions, and so is not subject to subtractive canceling. When x =
1.2345678 105 , this expression evaluates approximately as
1.2345678 105
6.17281995 106
2.0000062
on a machine with 8 digits, there is no loss of precision.
Example 11. Find the solution of the following equation using floating-point arithmetic with 4-digit
mantissa
x2 1000x + 25 = 0.
Sol. Given that,
x2 1000x + 25 = 0
0.1000e4 + 0.1000e4
2
= 0.1000e4
0.1000e4 0.1000e4
= 0.0000e4
x2 =
2
One of the roots becomes zero due to the limited
precision allowed in computation. In this equation
2
since b is much larger than 4ac. Hence b and b2 4ac become two equal numbers. Calculation of x2
involves the subtraction of nearly two equal numbers which will cause serious loss of significant figures.
To obtain a more accurate 4-digit rounding approximation for x2 , we change the formulation by
rationalizing the numerator or we know that in quadratic equation ax2 + bx + c = 0, the product of
the roots is given by c/a, therefore the smaller root may be obtained by dividing (c/a) by the largest
root. Therefore first root is given by 0.1000e4 and second root is given as
0.2500e2
25
=
= 0.2500e 1.
0.1000e4
0.1000e4
Example 12. How to evaluate y x sin x, when x is small.
Sol. Since x sin x, x is small. This will cause loss of significant figures. Alternatively, if we use
Taylor series for sin x, we obtain
x3 x5 x7
+
+ ...)
3!
5!
7!
x5
x7
x3
=
+
...
6
6 20 6 20 42
x3
x2
x2
x2
=
1 (1 (1 )(...)) .
6
20
42
72
y = x (x
7.2. Conditioning. The words condition and conditioning are used to indicate how sensitive the
solution of a problem may be to small changes in the input data. A problem is ill-conditioned if
small changes in the data can produce large changes in the results. For a certain types of problems, a
condition number can be defined. If that number is large, it indicates an ill-conditioned problem. In
contrast, if the number is modest, the problem is recognized as a well-conditioned problem.
The condition number can be calculated in the following manner:
=
10
10
, then the condition number can be calculated as
1 x2
0
2
xf (x)
= 2x .
=
f (x) |1 x2 |
Condition number can be quite large for |x| 1. Therefore, the function is ill-conditioned.
7.3. Stability of an algorithm. Another theme that occurs repeatedly in numerical analysis is the
distinction between numerical algorithms are stable and those that are not. Informally speaking, a
numerical process is unstable if small errors made at one stage of the process are magnified and propagated in subsequent stages and seriously degrade the accuracy of the overall calculation.
An algorithm can be thought of as a sequence of problems, i.e. a sequence of function evaluations.
In this case we consider the algorithm for evaluating f (x) to consist of the evaluation of the sequence
x1 , x2 , , xn . We are concerned with the condition of each of the functions f1 (x1 ), f2 (x2 ), , fn1 (xn1 )
where f (x) = fi (xi ) for all i. An algorithm is unstable if any fi is ill-conditioned, i.e. if any fi (xi ) has
condition much worse than f (x). Consider the example
f (x) = x + 1 x
so that there is potential loss of significance when x is large. Taking x = 12345 as an example, one
possible algorithm is
x0
x1
x2
x3
f (x) := x4
:
:
:
:
:
=
=
=
=
=
x = 12345
x0 + 1
x
1
x0
x2 x3 .
The loss of significance occurs with the final subtraction. We can rewrite the last step in the form
f3 (x3 ) = x2 x3 to show how the final answer depends on x3 . As f30 (x3 ) = 1, we have the condition
x3 f30 (x3 ) x3
(x3 ) =
=
f3 (x3 ) x2 x3
from which we find (x3 ) 2.2 104 when x = 12345. Note that this is the condition of a subproblem
arrived at during the algorithm. To find an alternative algorithm we write
1
x+1+ x
f (x) = ( x + 1 x)
=
x+1+ x
x+1+ x
This suggests the algorithm
x0
x1
x2
x3
x4
f (x) := x5
:
:
:
:
:
:
=
=
=
=
=
=
x = 12345
x0 + 1
x
1
x0
x2 + x3
1/x4 .
11
Exercises
(1) Determine the number of significant digits in the following numbers: 123, 0.124, 0.0045,
0.004300, 20.0045, 17001, 170.00, and 1800.
(2) Find the absolute, percentage, and relative errors if x = 0.005998 is rounded-off to three decimal
digits.
(3) Round-off the following numbers correct to four significant figures: 58.3643, 979.267, 7.7265,
56.395, 0.065738 and 7326853000.
(4) The following numbers are given in a decimal computer with a four digit normalized mantissa:
A = 0.4523e 4, B = 0.2115e 3, and C = 0.2583e1.
Perform the following operations, and indicate the error in the result, assuming symmetric
rounding:
(i) A + B + C (ii) A B (iii) A/C (iv) AB/C.
(5) Assume 3-digit mantissa with rounding
(i) Evaluate y = x3 3x2 + 4x + 0.21 for x = 2.73.
(ii) Evaluate y = [(x 3)x + 4]x + 0.21 for x = 2.73.
Compare and discuss the errors obtained in part (i) and (ii).
(6) Associativity not necessarily hold for floating point addition (or multiplication).
Let a = 0.8567 100 , b = 0.1325 104 , c = 0.1325 104 , then a + (b + c) = 0.8567 100 ,
and (a + b) + c) = 0.1000 101 .
The two answers are
NOT
the same!
Show the calculations.
(7) Calculate the sum of 3, 5, and 7 to four significant digits and find its absolute and relative
errors.
(8) Find the root of smallest magnitude of the equation x2 400x + 1 = 0 using quadratic formula.
Work in floating-point arithmetic using a four-decimal place mantissa.
(9) Calculate the value of x2 + 2x 2 and (2x 2) + x2 where x = 0.7320e0, using normalized point
arithmetic and proves that they are not the same. Compare with the value of (x2 2) + 2x.
(10) Suppose that a function f (x) = ln(x + 1) ln(x), is computed by the following algorithm for
large values of x using six digit rounding arithmetic
x0
x1
x2
x3
f (x) := x4
:
:
:
:
:
=
=
=
=
=
x = 12345
x0 + 1
ln x1
ln x0
x2 x3 .
By considering the condition (x3 ) of the subproblem of evaluating the function, show that
such a function evaluation is not stable. Also propose the modification of function evaluation
so that algorithm will become stable.
(11) Discuss the condition number of the polynomial function f (x) = 2x2 + x 1.
(12) Suppose that a function ln is available to compute the natural logarithm of its argument.
Consider the calculation of ln(1 + x), for small x, by the following algorithm
x0 : = x
x1 : = x0
f (x) := x2 : = ln(x1 )
By considering the condition (x1 ) of the subproblem of evaluating ln(x1 ), show that such a
function ln is inadequate for calculating ln(1 + x) accurately.
Bibliography
[Atkinson]
[Burden]
K. Atkinson and W. Han, Elementary Numerical Analysis, John Willey and Sons, 3
edition, 2004.
Richard L. Burden and J. Douglas Faires, Numerical Analysis, Brooks Cole, 9 edition,
2011.