Num Chap 1
Num Chap 1
1.1
Mathematical Review
The tools of scientic, engineering, and operations research computing are rmly based in the calculus. In particular, formulating and solving mathematical models in these areas involves approximation of quantities, such as integrals, derivatives, solutions to dierential equations, and solutions to systems of equations, rst seen an a calculus course. Indeed, techniques from such a course are the basis of much of scientic computation. We review these techniques here, with particular emphasis on how we will use them. In addition to basic calculus techniques, scientic computing involves approximation of the real number system by decimal numbers with a xed number of digits in their representation. Except for certain research-oriented systems, computer number systems today for this purpose are oating point systems, and almost all such oating point systems in use today adhere to the IEEE 754-2008 oating point standard. We describe oating point numbers and the oating point standard in this chapter, paying particular attention to consequences and pitfalls of its use. Third, programming and software tools are used in scientic computing. Considering how commonly it is used, ease of programming and debugging, documentation, and packages accessible from it, we have elected to use matlab throughout this book. We introduce the basics of matlab in this chapter.
1.1.1
Throughout, C n [a, b] will denote the set of real-valued functions f dened on the interval [a, b] such that f and its derivatives, up to and including its n-th derivative f (n) , are continuous on [a, b]. THEOREM 1.1 (Intermediate value theorem) If f C [a, b] and k is any number between m = min f (x) and M = max f (x), then there exists a number c in [a, b]
axb axb
Example 1.1 Consider f (x) = ex x 2. Using a computational device (such as a calculator) on which we trust the approximation of ex to be accurate, we compute f (0) = 1 and f (2) 3.3891. We know f is continuous, since it is a sum of continuous functions. Since 0 is between f (0) and f (2), the Intermediate Value Theorem tells us there is a point c [0, 2] such that f (c) = 0. At such a c, ec = c + 2. THEOREM 1.2 (Mean value theorem for integrals) Let f be continuous and w be Riemann integrable1 on [a, b] and suppose that w(x) 0 for x [a, b]. Then there exists a point c in [a, b] such that
b
w(x)dx.
a
1 0
x2 ex dx.
means that the limit of the Riemann sums exists. For example w may be continuous, or w may have a nite number of breaks.
With w(x) = x2 and f (x) = ex , the Mean Value Theorem for integrals tells us that for some c [0, 1], so 1 3e
1 0 1 0
x2 ex dx = ec
x2 dx
0
x2 ex dx = ec
1 0
x2 dx
1 . 3
THEOREM 1.3 (Taylors theorem) Suppose that f C n+1 [a, b]. Let x0 [a, b]. Then for any x [a, b], f (x) = Pn (x) + Rn (x), where Pn (x) = f (x0 ) + f (x0 )(x x0 ) + + = f (n) (x0 )(x x0 )n n!
n 1 (k) f (x0 )(x x0 )k , and k! k=0 1 x (n+1) Rn (x) = f (t)(x t)n dt (integral form of remainder). n! x0
Furthermore, there is a = (x) between x0 and x with f (n+1) ( (x))(x x0 )n+1 (Lagrange form of remainder). (n + 1)!
Rn (x) =
PROOF
udv = uv
vdu.
x x 2 (x t)2 ( x t ) f (t) + f (t)dt = f (x0 )(x x0 ) 2 2 x0 x0 x (x x0 )2 (x t)2 = f (x0 )(x x0 ) + f (x0 ) + f (t)dt 2 2 x0
Continuing this procedure, f (x) = f (x0 ) + f (x0 )(x x0 ) + (x x0 )2 f (x0 ) 2 x (x t)n (n+1) (x x0 )n (n) f (x0 ) + f (t)dt + + n! n! x0 = Pn (x) + Rn (x).
(x t)n (n+1) f (t)dt and assume that x0 < x (same n! x0 argument if x0 > x). Then, by Theorem 1.2, Now consider Rn (x) = Rn (x) = f
(n+1)
( (x))
x x0
where is between x0 and x and thus, = (x). Example 1.3 Approximate sin(x) by a polynomial p(x) such that | sin(x) p(x)| 1016 for 0.1 x 0.1. For Example 1.3, Taylor polynomials about x0 = 0 are appropriate, since that is the center of the interval about which we wish to approximate. We observe that the terms of even degree in such a polynomial are absent, so, for
Mathematical Review and Computer Arithmetic n even, Taylors theorem gives n 2 x x3 3! x5 x3 + 6 x 3! 5! . . . . . . 4 x n Pn Rn x3 cos(c2 ) 3! x5 cos(c4 ) 5! x7 cos(c6 ) 7! . . . (1)n/2 xn+1 cos(cn ) (n + 1)!
Observing that | cos(cn )| 1, we see that |Rn (x)| We may thus form the following table. n bound on error Rn 2 1.67 104 4 8.33 108 6 1.98 1011 8 2.76 1015 10 2.51 1019 Thus, a polynomial with the required accuracy for x [0.1, 0.1] is p(x) = x x5 x7 x9 x3 + + . 3! 5! 7! 9! |x|n+1 . (n + 1)!
An important special case of Taylors theorem is obtained with n = 0 (that is, directly from the Fundamental Theorem of Calculus). THEOREM 1.4 (Mean value theorem) Suppose f C 1 [a, b], x [a, b], and y [a, b] (and, without loss of generality, x y ). Then there is a c [x, y ] [a, b] such that f (y ) f (x) = f (c)(y x). Example 1.4 Suppose f (1) = 1 and |f (x)| 2 for x [1, 2]. What are an upper bound and a lower bound on f (2)?
Applied Numerical Methods The mean value theorem tells us that f (2) = f (1) + f (c)(2 1) = f (1) + f (c)
for some c (1, 2). Furthermore, the fact |f (x)| 2 is equivalent to 2 f (x) 2. Combining these facts gives 1 2 = 1 f (2) 1 + 2 = 3.
1.1.2
Big O Notation
We study rates of growth and rates of decrease of errors. For example, if we approximate eh by a rst degree Taylor polynomial about x = 0, we get eh (1 + h) = 1 2 h e , 2
where is some unknown quantity between 0 and h. Although we dont know exactly what e is, we know that it is nearly constant (in this case, approximately 1) for h near 0, so the error eh (1 + h) is roughly proportional to h2 for h small. This approximate proportionality is often more important to know than the slowly-varying constant e . The big O and little o notation are used to describe and keep track of this approximate proportionality. DEFINITION 1.1 Let E (h) be an expression that depends on a small quantity h. We say that E (h) = O(hk ) if there are an and C such that E (h) Chk for all |h| . The O denotes order. For example, if f (h) = O(h2 ), we say that f exhibits order 2 convergence to 0 as h tends to 0. Example 1.5 E (h) = eh h 1. Then E (h) = O(h2 ). PROOF By Taylors Theorem, eh = e0 + e0 (h 0) + for some c between 0 and h. Thus, E (h) = e 1 h h
h 2
h2 e 2
e1 2
, and E (h) 0
Mathematical Review and Computer Arithmetic for h 1, that is, = 1 and C = e/2 work.
Example 1.6 )f (x) Show that f (x+hh f (x)= O(h) for x, x + h [a, b], assuming that f has two continuous derivatives at each point in [a, b]. PROOF f (x + h) f (x) f (x) h x+h f (x) + f (x)h + (x + h t)f (t)dt f (x) x = f (x) h x+h h 1 (x + h t)f (t)dt max |f (t)| = ch. = atb h x 2
1.1.3
Convergence Rates
DEFINITION 1.2 Let {xk } be a sequence with limit x . If there are constants C and and an integer N such that |xk+1 x | C |xk x | for k N we say that the rate of convergence is of order at least . If = 1 (with C < 1), the rate is said to be linear. If = 2, the rate is said to be quadratic.
Example 1.7 A sequence sometimes learned in elementary classes for computing the square root of a number a is a xk + . 2 2xk
xk+1 =
8 We have xk+1
a= = = = = =
xk a + a 2 2xk x2 a xk k a 2xk xk + a (xk a) (xk a) 2x k xk + a (xk a) 1 2xk xk a (xk a) 2xk 2 1 (xk a) 2xk 1 (xk a)2 2 a
for xk near
Quadratic convergence is very fast. We can think of quadratic convergence, with C 1, as doubling the number of signicant gures on each iteration. (In contrast, linear convergence with C = 0.1 adds one decimal digit of accuracy to the approximation on each iteration.) For example, if we use the square root computation from Example 1.7 with a = 2, and starting with x0 = 2, we obtain the following table k 0 1 2 3 4 5 xk 2 1.5 1.416666666666667 1.414215686274510 1.414213562374690 1.414213562373095 xk 2 xk 2 (xk1 2)2 0.2500 0.3333 0.3529 0.3535
0.5858 100 0.8579 101 0.2453 102 0.2123 106 0.1594 1013 0.2204 1017
In this table, the correct digits are underlined. This table illustrates that the total number of digits more than doubles on each iteration. In fact, the multiplying factor C for the quadratic convergence appears to be approaching 0.3535. (The last error ratio is not meaningful in this sense, because only roughly 16 digits were carried in the computation.) Based on our analysis, the limiting value of C should be about 1/(2 2) 0.353553390593274. (We explain how we computed the table at the end of this chapter.)
Mathematical Review and Computer Arithmetic Example 1.8 As an example of linear convergence, consider the iteration xk+1 = xk which converges to k 0 1 2 3 4 5 6 7 8 . . . 2 x2 k + , 3.5 3.5
2. We obtain the following table. xk 2 xk 2 (xk1 2) 0.2451 101 0.1878 0.1911 0.1917 0.1918 0.1919 0.1919 0.1919 . . .
0.5858 100 0.1436 101 0.2696 102 0.5152 103 0.9879 104 0.1895 104 0.3636 105 0.6955 106 0.1339 106 . . . 0.1554 1014
19 1.414213562373097
Here, the constant C in the linear convergence, to four signicant digits, appears to be 0.1919 1/5. That is, the error is reduced by approximately a factor of 5 each iteration. We can think of this as obtaining one more correct base-5 digit on each iteration.
1.2
Computer Arithmetic
In numerical solution of mathematical problems, two common types of error are: 1. Method (algorithm or truncation) error. This is the error due to approximations made in the numerical method. 2. Rounding error. This is the error made due to the nite number of digits available on a computer.
10
Thus, f (x) (f (x + h) f (x))/h, and the error is O(h). We will call this the method error or truncation error , as opposed to roundo errors due to using machine approximations.
Example 1.9 By the mean value theorem for integrals (Theorem 1.2, as in Example 1.6 on page 7), if f C 2 [a, b], then 1 x+h f (x + h) f (x) f (t)(x + h t)dt + f (x) = h h x 1 x+h f (t)(x + h t)dt ch. and h x
)ln 3 for h small Now consider f (x) = ln x and approximate f (3) ln(3+h h using a calculator having 11 digits. The following results were obtained.
h 101 102 103 104 105 106 107 108 109 1010
ln(3 + h) ln(3) h 0.3278982 0.332779 0.3332778 0.333328 0.333330 0.333300 0.333 0.33 0.3 0.0
Error =
1 ln(3 + h) ln(3) = O(h) 3 h 5.44 103 5.54 104 5.55 105 5.33 106 3.33 106 3.33 105 3.33 104 3.33 103 3.33 102 3.33 101
One sees that, in the rst four steps, the error decreases by a factor of 10 as h is decreased by a factor of 10 (That is, the method error dominates). However, starting with h = 0.00001, the error increases. (The error due to a nite number of digits, i.e., roundo error dominates). There are two possible ways to reduce rounding error: 1. The method error can be reduced by using a more accurate method. This allows larger h to be used, thus avoiding roundo error. Consider f (x) = f (x + h) f (x h) + {error}, where {error} is O(h2 ). 2h h 0.1 0.01 0.001 ln(3 + h) ln(3 h) 2h 0.3334568 0.3333345 0.3333333 error 1.24 104 1.23 106 1.91 108
11
The error decreases by a factor of 100 as h is decreased by a factor of 10. 2. Rounding error can be reduced by using more digits of accuracy, such as using double precision (or multiple precision) arithmetic. To fully understand and avoid roundo error, we should study some details of how computers and calculators represent and work with approximate numbers.
1.2.1
Let = {a positive integer}, the base of the computer system. (Usually, = 2 (binary) or = 16 (hexadecimal)). Suppose a number x has the exact base representation x = (0.1 2 3 t t+1 ) m = q m , where q is the mantissa , is the base, m is the exponent, 1 1 1 and 0 i 1 for i > 1. On a computer, we are restricted to a nite set of oating-point numbers F = F (, t, L, U ) of the form x = (0.a1 a2 at ) m , where 1 a1 1, 0 ai 1 for 2 i t, L m U , and t is the number of digits. (In most oating point systems, L is about 64 to 1000 and U is about 64 to 1000.) Example 1.10 (binary) = 2 x = (0.1011)23 = = 1 1 1 1 1 +0 +1 +1 2 4 8 16 8
11 = 5.5 (decimal). 2
REMARK 1.1 Most numbers cannot be exactly represented on a computer. Consider x = 10.1 = 1010.0001 1001 1001 ( = 2). If L = 127, U = 127, t = 24, and = 2, then x x = (0.10100001 1001 1001 1001 1001)24 . Question: Given a real number x, how do we dene a oating point number (x) in F , such that (x) is close to x? On modern machines, one of the following four ways is used to approximate a real number x by a machine-representable number (x). round down: (x) = x , the nearest machine representable number to the real number x that is less than or equal to x
12
round up: (x) = x , the nearest machine number to the real number x that is greater than or equal to x. round to nearest: (x) is the nearest machine number to the real number x. round to zero, or chopping: (x) is the nearest machine number to the real number x that is closer to 0 than x. The term chopping is because we simply chop the expansion of the real number, that is, we simply ignore the digits in the expansion of x beyond the t-th one. The default on modern systems is usually round to nearest, although chopping is faster or requires less circuitry. Round down and round up may be used, but with care, to produce results from a string of computations that are guaranteed to be less than or greater than the exact result. Example 1.11 = 10, t = 5, x = 0.12345666 107 . Then (x) = 0.12346 107 (rounded to nearest). (In this case, round down corresponds to chopping and round up corresponds to round to nearest.) See Figure 1.2 for an example with = 10 and t = 1. In that gure, the exhibited oating point numbers are (0.1) 101 , (0.2) 101 , . . . , (0.9) 101 , 0.1 102 . mt = 100 = 1 + + + (x) = 0.12345 107 (chopping).
+ m 1 = 1
m = 101 successive oating point numbers An example oating point system: = 10, t = 1, and
FIGURE 1.2: m = 1.
Example 1.12 Let a = 0.410, b = 0.000135, and c = 0.000431. Assuming 3-digit decimal computer arithmetic with rounding to nearest, does a + (b + c) = (a + b) + c when using this arithmetic?
13
Following the rounding to nearest denition of , we emulate the operations a machine would do, as follows: a 0.410 100 , and (b + c) = (0.135 103 + 0.431 103 ) = (0.566 103 ) = 0.566 103 , so (a + 0.566 103 ) = (0.410 100 + 0.566 103 ) = (0.410 100 + 0.000566 100 ) = (0.410566 100 ) = 0.411 100 . b 0.135 103 , c 0.431 103 ,
On the other hand, (a + b) = (0.410 100 + 0.135 103 ) = (0.410000 100 + 0.000135 100 ) = (0.410135 100 ) = 0.410 100 , so (0.410 100 + c) = (0.410 100 + 0.431 103 )
= (0.410 100 + 0.000431 100 ) = (0.410431 100 ) = 0.410 100 = 0.411 100 .
Thus, the distributive law does not hold for oating point arithmetic with round to nearest. Furthermore, this illustrates that accuracy is improved if numbers of like magnitude are added rst in a sum. The following error bound is useful in some analyses. THEOREM 1.5 1 |x| 1t p, 2 where p = 1 for rounding and p = 2 for chopping. |x (x)|
14 DEFINITION 1.3
(x) x Let = . Then (x) = (1 + )x, where || . With this, we have x the following. THEOREM 1.6 Let denote the operation +, , , or , and let x and y be machine numbers. Then (x y ) = (x y )(1 + ), where || = p 1t . 2
Roundo error that accumulates as a result of a sequence of arithmetic operations can be analyzed using this theorem. Such an analysis is called forward error analysis . Because of the properties of oating point arithmetic, it is unreasonable to demand strict tolerances when the exact result is too large. Example 1.13 Suppose = 10 and t = 3 (3-digit decimal arithmetic), and suppose we wish to compute 104 with a computed value x such that |104 x| < 102 . The closest oating point number in our system to 104 is x = 0.314 105 = 31400. However |104 x| = 15.926 . . . . Hence, it is impossible to nd a number x in the system with |104 x| < 102 . The error |104 x| in this example is called the absolute error in approximating 104 . We see that absolute error is not an appropriate measure of error when using oating point arithmetic. For this reason, we use relative error :
Let x approximation to x. Then |x x | is be an x x called the absolute error, and x is called the relative error.
DEFINITION 1.4
1.2.1.1
We now examine some common situations in which roundo error can become large, and explain how to avoid many of these situations.
Mathematical Review and Computer Arithmetic Example 1.14 = 10, t = 4, p = 1. (Thus, = y = 0.6399 105 . Then
15
1 3 2 10
= 0.0005.) Let x = 0.5795 105 , 1 3.28 104 , 2 5.95 105 , |1 | < , |2 | < . and
(Note: x + y = 0.12194 106 , xy = 0.37082205 1010 .) Example 1.15 Suppose = 10 and t = 4 (4 digit arithmetic), suppose x1 = 10000 and x2 = x3 = = x1001 = 1. Then (x1 + x2 ) = 10000, (x1 + x2 + x3 ) = 10000, . . . 1001 xi = 10000,
i=1
when we sum forward from x1 . But going backwards, (x1001 + x1000 ) = 2, (x1001 + x1000 + x999 ) = 3, . . . 1 xi = 11000,
i=1001
which is the correct sum. This example illustrates the point that large relative errors occur when a large number of almost small numbers is added to a large number, or when a very large number of small almost-equal numbers is added. To avoid such large relative errors, one can sum from the smallest number to the largest number. However, this will not work if the numbers are all approximately equal. In such cases, one possibility is to group the numbers into sets of two adjacent numbers, summing two almost equal numbers together. One then groups those results into sets of two and sums these together, continuing until the total sum is reached. In this scheme, two almost equal numbers are always being summed, and the large relative error from repeatedly summing a small number to a large number is avoided.
16
Example 1.16 x1 = 15.314768, x2 = 15.314899, = 10, t = 6 (6-digit decimal accuracy). Then x2 x1 (x2 ) (x1 ) = 15.3149 15.3148 = 0.0001. Thus, x2 x1 ( (x2 ) (x1 )) 0.000131 0.0001 = x2 x1 0.000131 = 0.237 = 23.7% relative accuracy.
This example illustrates that large relative errors can occur when two nearly equal numbers are subtracted on a computer. Sometimes, an algorithm can be modied to reduce rounding error occurring from this source, as the following example illustrates. Example 1.17 Consider nding the roots of ax2 + bx + c = 0, where b2 is large compared with |4ac|. The most common formula for the roots is b b2 4ac . x1,2 = 2a Consider x2 + 100x + 1 = 0, = 10, t = 4, p = 2, and 4-digit chopped arithmetic. Then 100 + 9996 100 9996 x1 = , x2 = , 2 2 but 9996 99.97 (4 digit arithmetic chopped). Thus, x1 100 99.97 100 + 99.97 , x2 . 2 2
Hence, x1 0.015, x2 99.98, but x1 = 0.010001 and x2 = 99.989999, so the relative errors in x1 and x2 are 50% and 0.01%, respectively. Lets change the algorithm. Assume b 0 (can always make b 0). Then b + b2 4ac b b2 4ac x1 = 2a b b2 4ac 2c 4ac , = = 2 2a(b b 4ac) b + b2 4ac and b b2 4ac x2 = (the same as before). 2a
Mathematical Review and Computer Arithmetic Then, for the above values, x1 = 2 2(1) = 0.0100. 100 + 99.97 100 + 9996
17
Now, the relative error in x1 is also 0.01%. Let us now consider error in function evaluation. Consider a single valued function f (x) and let x = (x) be the oating point approximation of x. Therefore the machine evaluates f (x ) = f ( (x)), which is an approximate value of f (x) at x = x . Then the perturbation in f (x) for small perturbations in x can be computed via Taylors formula. This is illustrated in the next theorem. THEOREM 1.7 The relative error in functional evaluation is, f (x) f (x ) x f (x) x x f (x) x f (x)
PROOF The linear Taylor approximation of f (x ) about f (x) for small values of |x x | is given by f (x ) f (x) + f (x)(x x). Rearranging the terms immediately yields the result. This leads us to the following denition. DEFINITION 1.5 The condition number of a function f (x) is x f (x) f (x) := f (x)
The condition number describes how large the relative error in function evaluation is with respect to the relative error in the machine representation of x. In other words, f (x) is a measure of the degree of sensitivity of the function at x. Example 1.18 Let f (x) = x. The condition number of f (x) about x is 1 x 1 2 x f (x) = = . x 2
18
This is not dened at x = 2. Hence the function f (x) is numerically unstable and ill-conditioned for values of x close to 2. REMARK 1.2 If x = f (x) = 0, then the condition number is simply |f (x)|. If x = 0, f (x) = 0 (or f (x) = 0, x = 0) then it is more useful to consider the relation between absolute errors than relative errors. The condition number then becomes |f (x)/f (x)|. REMARK 1.3 Generally, if a numerical approximation z to a quantity z is computed, the relative error is related to the number of digits after the decimal point that are correct. For example if z = 0.0000123453 and z = 0.00001234543, we say that z is correct to 5 signicant digits . Expressing z as 0.123453 104 and z as 0.123454 104 , we see that if we round z to the nearest number with ve digits in its mantissa, all of those digits are correct, whereas, if we do the same with six digits, the sixth digit is not correct. Signicant digits is the more logical way to talk about accuracy in a oating point computation where we are interested in relative error, rather than number of digits after the decimal point, which can have a dierent meaning. (Here, one might say that z is correct to 9 digits after the decimal point.)
Example 1.19 Let f (x) = x 2. The condition number of f (x) about x is x f (x) = 2(x 2) .
1.2.2
Prior to 1985, dierent machines used dierent word lengths and dierent bases, and dierent machines rounded, chopped, or did something else to form the internal representation (x) for real numbers x. For example, IBM mainframes generally used hexadecimal arithmetic ( = 16), with 8 hexadecimal digits total (for the base, sign, and exponent) in single precision numbers and 16 hexadecimal digits total in double precision numbers. Machines such as the Univac 1108 and Honeywell Multics systems used base = 2 and 36 binary digits (or bits) total in single precision numbers and 72 binary digits total in double precision numbers. An unusual machine designed at Moscow State University from 1955-1965, the Setun even used base-3 ( = 3, or ternary) numbers. Some computers had 32 bits total in single precision numbers and 64 bits total in double precision numbers, while some supercomputers (such as the Cray-1) had 64 bits total in single precision numbers and 128 bits total in double precision numbers. Some hand-held calculators in existence today (such as some Texas Instruments calculators) can be viewed as implementing decimal (base 10, = 10)
19
arithmetic, say, with L = 999 and U = 999, and t = 14 digits in the mantissa. Except for the Setun (the value of whose ternary digits corresponded to positive, negative, and neutral in circuit elements or switches), digital computers are mostly based on binary switches or circuit elements (that is, on or o), so the base is usually 2 or a power of 2. For example, the IBM hexadecimal digit could be viewed as a group of 4 binary digits2 . Older oating point implementations did not even always t exactly into the model we have previously described. For example, if x is a number in the system, then x may not have been a number in the system, or, if x were a number in the system, then 1/x may have been too large to be representable in the system. To promote predictability, portability, reliability, and rigorous error bounding in oating point computations, the Institute of Electrical and Electronics Engineers (IEEE) and American National Standards Institute (ANSI) published a standard for binary oating point arithmetic in 1985: IEEE/ANSI 754-1985: Standard for Binary Floating Point Arithmetic, often referenced as IEEE-754, or simply the IEEE standard3 . Almost all computers in existence today, including personal computers and workstations based on Intel, AMD, Motorola, etc. chips, implement most of the IEEE standard. In this standard, = 2, 32 bits total are used in a single precision number (an IEEE single), and 64 bits total are used for a double precision number (IEEE double). In a single precision number, 1 bit is used for the sign, 8 bits are used for the exponent, and t = 23 bits are used for the mantissa. In double precision numbers, 1 bit is used for the sign, 11 bits are used for the exponent, and 52 bits are used for the mantissa. Thus, for single precision numbers, the exponent is between 0 and (11111111)2 = 255, and 128 is subtracted from this, to get an exponent between 127 and 128. In IEEE numbers, the minimum and maximum exponent are used to denote special symbols (such as innity and unnormalized numbers), so the exponent in single precision represents magnitudes between 2126 1038 and 2127 1038 . The mantissa for single precision numbers represents numbers between (20 = 1 23 and i=0 2i = 2(1 224 ) 2. Similarly, the exponent for double precision numbers is, eectively, between 21022 10308 and 21023 10308 , while the mantissa double precision numbers represents numbers between 20 = 1 52 for i and i=0 2 2. Summarizing, the parameters for IEEE arithmetic appear in Table 1.1. In many numerical computations, such as solving the large linear systems arising from partial dierential equation models, more digits or a larger exponent range is required than is available with IEEE single precision. For
exception is in some systems for business calculations, where base 10 is implemented. update to the 1985 standard was made in 2008. This update gives clarications of certain ambiguous points, provides certain extensions, and species a standard for decimal arithmetic.
3 An 2 An
20
TABLE 1.1:
IEEE arithmetic precision single 2 double 2
this reason, many numerical analysts at present have adopted IEEE double precision as the default precision. For example, underlying computations in the popular computational environment matlab are done in IEEE double precision. IEEE arithmetic provides four ways of dening (x), that is, four rounding modes, namely, round down, round up, round to nearest, and round to zero, are specied as follows. The four elementary operations +, , , and / must be such that (x y ) is implemented for all four rounding modes, for , +, , /, . The default mode (if the rounding mode is not explicitly set) is normally round to nearest, to give an approximation after a long string of computations that is hopefully near the exact value. If the mode is set to round down and a string of computations is done, then the result is less than or equal to the exact result. Similarly, if the mode is set to round up, then the result of a string of computations is greater than or equal to the exact result. In this way, mathematically rigorous bounds on an exact result can be obtained. (This technique must be used astutely, since naive use could result in bounds that are too large to be meaningful.) Several parameters more directly related to numerical computations than L, U , and t are associated with any oating point number system. These are HUGE: the largest representable number in the oating point system; TINY: the smallest positive representable number in the oating point system. m : the machine epsilon , the smallest positive number which, when added to 1, gives something other than 1 when using the rounding moderound to the nearest. These so-called machine constants appear in Table 1.2 for the IEEE single and IEEE double precision number systems. For IEEE arithmetic, 1/TINY < HUGE, but 1/HUGE < TINY. This brings up the question of what happens when the result of a computation has absolute value less than the smallest number representable in the system, or has absolute value greater than the largest number representable in the system. In the rst case, an underow occurs, while, in the second case, an overow occurs. In oating point computations, it is usually (but not always) reasonable to replace the result of an underow by 0, but it is usually more problematical
21
TABLE 1.2:
Precision single 2
127
TINY 2
126
m
38
1.18 10
24
+2
45
5.96 108
double 21023 1.79 10308 21022 2.23 10308 253 + 2105 1.11 1016
when an overow occurs. Many systems prior to the IEEE standard replaced an underow by 0 but stopped when an overow occurred. The IEEE standard species representations for special numbers , , +0, 0, and NaN, where the latter represents not a number. The standard species that computations do not stop when an overow or underow occurs, or when quantities such as 1, 1/0, 1/0, etc. are encountered (although many programming languages by default or optionally do stop). For example, the result of an overow is set to , whereas the result of 1 is set to NaN, and computation continues. The standard also species gradual underow, that is, setting the result to a denormalized number, or a number in the oating point format whose rst digit in the mantissa is equal to 0. Computation rules for these special numbers, such as NaN any number = NaN, any positive normalized number = , allow such nonstop arithmetic. Although the IEEE nonstop arithmetic is useful in many contexts, the numerical analyst should be aware of it and be cautious in interpreting results. In particular, algorithms may not behave as expected if many intermediate results contain or NaN, and the accuracy is less than expected when denormalized numbers are used. In fact, many programming languages, by default or with a controllable option, stop if or NaN occurs, but implement IEEE nonstop arithmetic with an option. Example 1.20 IEEE double precision oating point arithmetic underlies most computations in matlab. (This is true even if only four decimal digits are displayed.) One obtains the machine epsilon with the function eps, one obtains TINY with the function realmax, and one obtains HUGE with the function realmin. Observe the following matlab dialog:
>> epsm = eps(1d0) epsm = 2.2204e-016 >> TINY = realmin TINY = 2.2251e-308 >> HUGE = realmax HUGE = 1.7977e+308 >> 1/TINY ans = 4.4942e+307 >> 1/HUGE
22
ans = 5.5627e-309 >> HUGE^2 ans = Inf >> TINY^2 ans = 0 >> new_val = 1+epsm new_val = 1.0000 >> new_val - 1 ans = 2.2204e-016 >> too_small = epsm/2 too_small = 1.1102e-016 >> not_new = 1+too_small not_new = 1 >> not_new - 1 ans = 0 >>
Example 1.21 (Illustration of underow and overow) Suppose, for the purposes of illustration, we have a system with = 10, t = 2 and one digit in the exponent, so 9 9 that the positive numbers in the system range from 0.10 10 to 0.99 10 , 6 2 2 and suppose we wish to compute N = x1 + x2 , where x1 = x2 = 10 . Then both x1 and x2 are exactly represented in the system, and the nearest oating point number in the system to N is 0.14 107 , well within range. However, 12 x2 1 = 10 , larger than the maximum oating point number in the system. In older systems, an overow usually would result in stopping the computation, while in IEEE arithmetic, the result would be assigned the symbol Innity. The result of adding Innity to Innity then taking the square root would be Innity, so that N would be assigned Innity. Similarly, 12 if x1 = x2 = 106 , then x2 , smaller than the smallest representable 1 = 10 machine number, causing an underow. On older systems, the result is usually set to 0. On IEEE systems, if gradual underow is switched on, the result either becomes a denormalized number, with less than full accuracy, or is set to 0; without gradual underow on IEEE systems, the result is set to 0. When the result is set to 0, a value of 0 is stored in N , whereas the closest oating point number in the system is 0.14 105 , well within range. To avoid this type of catastrophic underow and overow in the computation of N , we may use the following scheme. 1. s max{|x1 |, |x2 |}. 2. 1 x1 /s; 2 x2 /s. 2 + 2 . 3. N s 1 2
23
For examining the output to large numerical computations arising from mathematical models, plots, graphs, and movies comprised of such plots and graphs are often preferred over tables of values. However, to develop such models and study numerical algorithms, it is necessary to examine individual numbers. Because humans are trained to comprehend decimal numbers more easily than binary numbers, the binary format used in the machine is usually converted to a decimal format for display or printing. In many programming languages and environments (such as all versions of Fortran, C, C++, and in matlab), the format is of a form similar to d1 .d2 d3 ...dm e 1 2 3 , or d1 .d2 d3 ...dm E 1 2 3 , where the e or E denotes the exponent of 10. For example, -1.00e+003 denotes 1 103 = 1000. Numbers are usually also input either in a standard decimal form (such as 0.001) or in this exponential format (such as 1.0e-3). (This notation originates from the earliest computers, where the only output was a printer, and the printer could only print numerical digits and the 26 upper case letters in the Roman alphabet.) Thus, for input, a decimal fraction needs to be converted to a binary oating point number, while, for output, a binary oating point number needs to be converted to a decimal fraction. This conversion necessarily is inexact. For example, the exact decimal fraction 0.1 converts to the innitely repeating binary expansion (0.00011)2 , which needs to be rounded into the binary oating point system. The IEEE 754 standard species that the result of a decimal to binary conversion, within a specied range of input formats, be the nearest oating point number to the exact result, over a specied range, and that, within a specied range of formats, a binary to decimal conversion be the nearest number in the specied format (which depends on the number m of decimal digits requested to be printed). Thus, the number that one sees as output is usually not exactly the number that is represented in the computer. Furthermore, while the oating point operations on binary numbers are usually implemented in hardware or rmware independently of the software system, the decimal to binary and binary to decimal conversions are usually implemented separately as part of the programming language (such as Fortran, C, C++, Java, etc.) or software system (such as matlab). The individual standards for these languages, if there are any, may not specify accuracy for such conversions, and the languages sometimes do not conform to the IEEE standard. That is, the number that one sees printed may not even be the closest number in that format to the actual number. This inexactness in conversion usually does not cause a problem, but may cause much confusion in certain instances. In those instances (such as in debugging, or nding programming blunders), one may need to examine the binary numbers directly. One way of doing this is in an octal, or base-8 format, in which each digit (between 0 and 7) is interpreted as a group of
24
three binary digits, or in hexadecimal format (where the digits are 0-9, A, B, C, D, E, F), in which each digit corresponds to a group of four binary digits. 1.2.2.2 Standard Functions
To enable accurate computation of elementary functions such as sin, cos, and exp, IEEE 754 species that a long 80-bit register (with guard digits) be available for intermediate computations. Furthermore, IEEE 754-2008, an ocial update to IEEE 754-1985, provides a list of functions it recommends be implemented, and species accuracy requirements (in terms of correct rounding ), for those functions a programming language elects to implement. REMARK 1.4 Alternative number systems, such as variable precision arithmetic, multiple precision arithmetic, rational arithmetic, and combinations of approximate and symbolic arithmetic have been investigated and implemented. These have various advantages over the traditional oating point arithmetic we have been discussing, but also have disadvantages, and usually require more time, more circuitry, or both. Eventually, with the advance of computer hardware and better understanding of these alternative systems, their use may become more ubiquitous. However, for the foreseeable future, traditional oating point number systems will be the primary tool in numerical computations.
1.3
Interval Computations
Interval computations are useful for two main purposes: to use oating point computations to compute mathematically rigorous bounds on an exact result (and hence to rigorously bound roundo error); to use oating point computations to compute mathematically rigorous bounds on the ranges of functions over boxes. In complicated traditional oating point algorithms, naive arrangement of interval computations usually gives bounds that are too wide to be of practical use. For this reason, interval computations have been ignored by many. However, used cleverly and where appropriate, interval computations are powerful, and provide rigor and validation when other techniques cannot. Interval computations are based on interval arithmetic.
25
1.3.1
Interval Arithmetic
In interval arithmetic, we dene operations on intervals, which can be considered as ordered pairs of real numbers. We can think of each interval as representing the range of possible values of a quantity. The result of an operation is then an interval that represents the range of possible results of the operation as the range of all possible values, as the rst argument ranges over all points in the rst interval and the second argument ranges over all values in the second interval. To state this symbolically, let x = [x, x] and y = [y, y ], and dene the four elementary operations by x y = {x y | x x and y y } for {+, , , }. (1.1)
Interval arithmetics usefulness derives from the fact that the mathematical characterization in Equation (1.1) is equivalent to the following operational denitions. x + y = [x + y, x + y ], x y = [x y, x y ], x y = [min{xy, xy, xy, xy }, max{xy, xy, xy, xy }] (1.2) 1 1 1 =[ , ] if x > 0 or x < 0 x x x 1 xy = x y
then
The ranges of the four elementary interval arithmetic operations are exactly the ranges of the corresponding real operations, but, if such operations are composed, bounds on the ranges of real functions can be obtained. For example, if f (x) = (x + 1)(x 1), (1.3) f ([2, 2]) = [2, 2] + 1 [2, 2] 1 = [1, 3][3, 1] = [9, 3],
REMARK 1.5 In some denitions of interval arithmetic, division by intervals containing 0 is dened, consistent with (1.1). For example, 1 1 1 1 [1, 2] , = , , = R , [3, 4] 3 4 3 4
where R is the extended real number system ,4 consisting of the real numbers with the two additional numbers and . This extended interval arith4 also
26
metic5 was originally invented by William Kahan6 for computations with continued fractions, but has wider use than that. Although a closed system can be dened for the sets arising from this extended arithmetic, typically, the complements of intervals (i.e., the unions of two semi-innite intervals) are immediately intersected with intervals, to obtain zero, one, or two intervals. Interval arithmetic can then proceed using (1.2). The power of interval arithmetic lies in its implementation on computers. In particular, outwardly rounded interval arithmetic allows rigorous enclosures for the ranges of operations and functions. This makes a qualitative dierence in scientic computations, since the results are now intervals in which the exact result must lie. It also enables use of oating point computations for automated theorem proving. Outward rounding can be implemented on any machine that has downward rounding and upward rounding, such as any machine that complies with the IEEE 754 standard. For example, take x + y = [x + y, x + y ]. If x + y is computed with downward rounding, and x + y is computed with upward rounding, then the resulting interval z = [z, z ] that is represented in the machine must contain the exact range of x + y for x x and y y . We call the expansion of the interval from rounding the lower end point down and the upper end point up roundout error . Interval arithmetic is only subdistributive . That is, if x, y , and z are intervals, then x(y + z ) xy + xz , but x(y + z ) = xy + xz in general. (1.4)
As a result, algebraic expressions that would be equivalent if real values are substituted for the variables are not equivalent if interval values are used. For example, if, instead of writing (x 1)(x + 1) for f (x) in (1.3), suppose we write f (x) = x2 1, (1.5) and suppose we provide a routine that computes an enclosure for the range of x2 that is the exact range to within roundo error. Such a routine could be as follows: ALGORITHM 1.1 (Computing an interval whose end points are machine numbers and which encloses the range of x2 .)
5 There are small dierences in current denitions of extended interval arithmetic. For example, in some systems, and are not considered numbers, but just descriptive symbols. In those systems, [1, 2]/[3, 4] = (, 1/3] [1/4, ) = R\(1/3, 1/4). See [31] for a theoretical analysis of extended arithmetic. 6 who also was a major contributor to the IEEE 754 standard
27
INPUT: x = [x, x]. OUTPUT: a machine-representable interval that contains the range of x2 over x. IF x 0 THEN RETURN [x2 , x2 ], where x2 is computed with downward rounding and x2 is computed with upward rounding. ELSE IF x 0 THEN RETURN [x2 , x2 ], where x2 is computed with downward rounding and x2 is computed with upward rounding. ELSE 1. Compute x2 and x2 with both downward and upward rounding; that is, 2 2 2 compute x2 l and xu such that xl and xu are machine representable num2 2 2 2 2 2 bers and x [xl , xu ], and compute xl and x2 u such that xl and xu are 2 2 2 machine representable numbers and x [xl , xu ]. 2 2. RETURN [0, max x2 u , xu ].
END IF
END ALGORITHM 1.1. With Algorithm 1.1 and rewriting f (x) from (1.3) as in (1.5), we obtain f ([2, 2]) = [2, 2]2 1 = [0, 4] 1 = [1, 3] which, in this case, is equal to the exact range of f over [2, 2]. In fact, this illustrates a general principle: If each variable in the expression occurs only once, then interval arithmetic gives the exact range, to within roundout error. We state this formally as THEOREM 1.8 (Fundamental theorem of interval arithmetic.) Suppose f (x1 , x2 , . . . , xn ) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn ), and suppose that this expression is evaluated with interval arithmetic. The algebraic expression or computer program can contain the four elementary operations and operations such as xn , sin(x), exp(x), and log(x), etc., as long as the interval values of these functions contain their range over the input intervals. Then 1. The interval value f (x1 , . . . , xn ) contains the range of f over the interval vector (or box) (x1 , . . . , xn ).
28
Applied Numerical Methods 2. If the single functions (the elementary operations and functions xn , etc.) have interval values that represent their exact ranges, and if each variable xi , 1 i n occurs only once in the expression for f , then the values of f obtained by interval arithmetic represent the exact ranges of f over the input intervals.
If the expression for f contains one or more variables more than once, then overestimation of the range can occur due to interval dependency . For example, when we evaluate our example function f ([2, 2]) according to (1.3), the rst factor, [1, 3] is the exact range of x + 1 for x [2, 2], while the second factor, [3, 1] is the exact range of x 1 for x [2, 2]. Thus, [9, 3] (x1 , x2 ) = (x1 + 1)(x2 1) for x1 and x2 independent, is the exact range of f x1 [2, 2], x2 [2, 2]. We now present some denitions and theorems to clarify the practical consequences of interval dependency. DEFINITION 1.6 An expression for f (x1 , . . . , xn ) which is written so that each variable occurs only once is called a single use expression, or SUE. Fortunately, we do not need to transform every expression into a single use expression for interval computations to be of value. In particular, the interval dependency becomes less as the widths of the input intervals becomes smaller. The following formal denition will help us to describe this precisely. DEFINITION 1.7 Suppose an interval evaluation f (x1 , . . . , xn ) gives [a, b] as a result interval, but the exact range {f (x1 , . . . , xn ), xi xi , 1 i n} is [c, d] [a, b]. We dene the excess width E (f ; x1 , . . . , xn ) in the interval evaluation f (x1 , . . . , xn ) by E (f ; x1 , . . . , xn ) = (c a) + (b d). For example, the excess width in evaluating f (x) represented as (x+1)(x1) over x = [2, 2] is (1 (9)) + (3 3) = 8. In general, we have THEOREM 1.9 Suppose f (x1 , x2 , . . . , xn ) is an algebraic expression in the variables x1 through xn (or a computer program with inputs x1 through xn ), and suppose that this expression is evaluated with interval arithmetic, as in Theorem 1.8, to obtain an interval enclosure f (x1 , . . . , xn ) to the range of f for xi xi , 1 i n. Then, if E (f ; x1 , . . . , xn ) is as in Denition 1.7, we have E (f ; x1 , . . . , xn ) = O max w(xi ) ,
1in
29
That is, the overestimation becomes less as the uncertainty in the arguments to the function becomes smaller. Interval evaluations as in Theorem 1.9 are termed rst-order interval extensions. It is not dicult to obtain second-order extensions, where required. (See Exercise ?? below.)
1.3.2
We give one such example here. Example 1.22 Using 4-digit decimal oating point arithmetic, compute an interval enclosure for the rst two digits of e, and prove that these two digits are correct. Solution: The fth degree Taylor polynomial representation for e is 1 1 1 1 1 + + + + e , 2! 3! 4! 5! 6! for some [0, 1]. If we assume we know e < 3 and we assume we know ex is an increasing function of x, then the error term is bounded by 1 e 3 < 0.005, 6! 6! e=1+1+
so this fth-degree polynomial representation should be adequate. We will evaluate each term with interval arithmetic, and we will replace e with [1, 3]. We obtain the following computation: [1.000, 1.000] + [1.000, 1.000] [2.000, 2.000] [1.000, 1.000]/[2.000, 2.000] [0.5000, 0.5000]
[2.000, 2.000] + [0.5000, 0.5000] [2.500, 2.500] [1.000, 1.000]/[6.000, 6.000] [0.1666, 0.1667]
[2.707, 2.709] + [0.008333, 0.008334] [2.715, 2.718] [1.000, 1.000]/[720.0, 720.0] [0.001388, 0.001389] [.001388, .001389] [1, 3] [0.001388, 0.004167]
[2.666, 2.667] + [0.04166, 0.04167] [2.707, 2.709] [1.000, 1.000]/[120.0, 120.0] [0.008333, 0.008334]
[2.500, 2.500] + [0.1666, 0.1667] [2.666, 2.667] [1.000, 1.000]/[24.00, 24.00] [0.04166, 0.04167]
Since we used outward rounding in these computations, this constitutes a mathematical proof that e [2.716, 2.723]. Note:
30
Applied Numerical Methods 1. These computations can be done automatically on a computer, as simply as evaluating the function in oating point arithmetic. We will explain some programming techniques for this in Chapter 6, Section 6.2. 2. The solution is illustrative. More sophisticated methods, such as argument reduction, would be used in practice to bound values of ex more accurately and with less operations.
Proofs of the theorems, as well as greater detail, appear in various texts on interval arithmetic. A good book on interval arithmetic is R. E. Moores classic text [27] although numerous more recent monographs and reviews are available. A World Wide Web search on the term interval computations will lead to some of these. A general introduction to interval computations is [26]. That work gives not only a complete introduction, with numerous examples and explanation of pitfalls, but also provides examples with intlab, a free matlab toolbox for interval computations, and reference material for intlab. If you have matlab available, we recommend intlab for the exercises in this book involving interval computations.
1.4
Programming Environments
Modern scientic computing (with oating point numbers) is usually done with high-level imperative (as opposed to functional) programming languages. Common programming environments in use for general scientic computing today are Fortran (or FORmula TRANslation), C/C++, and matlab. Fortran is the original such language, with its origins in the late 1950s. There is a large body of high-quality publicly available software in Fortran for common computations in numerical analysis and scientic computing. Such software can be found, for example, on NETLIB, at http://www.netlib.org/ Fortran has evolved over the years, becoming a modern, multi-faceted language with the Fortran 2003 standard. Throughout, the emphasis by both the standardization committee and suppliers of compilers for Fortran has been features that simplify programming of solutions to large problems in numerical analysis and scientic computing, and features that enable high performance, especially on computers that can process vectors and matrices eciently. The C language, originally developed in conjunction with the Unix operating system, was originally meant to be a higher-level language for designing and accessing the operating system, but has become more ubiquitous since then. C++, appearing in the late 1980s, was the rst widely available lan-
31
guage7 to allow the object-oriented programming paradigm. In recent years, computer science departments have favored teaching C++ over teaching Fortran, and Fortran has fallen out of favor in relative terms. However, Fortran is still favored in certain large-scale applications such as uid dynamics (e.g. in weather prediction and similar simulations), and some courses are still oered in it in engineering schools. However, some people still think of Fortran as the now somewhat rudimentary language known as FORTRAN 77. Reasonably high-quality compilers for both Fortran and C/C++ are available free of charge with Linux operating systems. Fortran 2003, largely implemented in these compilers, has a standardized interface to C, so functions written in C can be called from Fortran programs, and visa versa. These compilers include interactive graphical-user-interface-oriented debuggers, such as insight, available with the Linux operating system. Commercially available compilation and debugging systems are also available under Windows. The matlab system has become increasingly popular over the last two decades or so. The matlab (or MATrix LABoratory) began in the early 1980s as a National Science Foundation project, written by Cleve Moler in FORTRAN 66, to provide an interactive environment for computing with matrices and vectors, but has since evolved to be both an interactive environment and full-featured programming language. matlab is highly favored in courses such as this, because the ease of programming, debugging, and general use (such as graphing), and because of the numerous toolboxes, supplied by both Mathworks (Cleve Molers company) and others, for many basic computing tasks and applications. The main drawback to use of matlab in all scientic computations is that the language is interpretive, that is, matlab translates each line of a program to machine language each time that it executes the line. This makes complicated programs that involve nested iterations much slower (a factor of 60 or more) than comparable programs written in a compiled language such as C or Fortran. However, functions compiled from Fortran or C/C++ can be called from matlab. A common strategy has been to initially develop algorithms in matlab, then translate all or part of the program to a compilable language, as necessary for eciency. One perceived disadvantage of matlab is that it is proprietary. Undesirable possible consequences are that it is not free, and there is no ocial guarantee that it will be available forever, unchanged. However, its use has become so widespread in recent years that these concerns do not seem to be major. Several projects including Octave and Scilab have produced free products that partially support the matlab programming language. The most widely distributed of these Octave, is integrated into Linux systems. However, the object-oriented features of Octave are rudimentary compared to those of matlab, and some toolboxes, such as intlab (which we will mention later) will not function with Octave.
7 with
32
Alternative systems sometimes used for scientic computing are computer algebra systems. Perhaps the most common of these are Mathematica and Maple , while a free such system under development is SAGE. These systems admit a dierent way of thinking about programming, termed functional programming , in which rules are dened and available at all times to automatically simplify expressions that are presented to the system. (In contrast, in imperative programming, a sequence of commands is executed one after the other.) Although these systems have become comprehensive, they are based in computations of a dierent character, rather than in the oating point computations and linear algebra involved in numerical analysis and scientic computing. We will use matlab in this book to illustrate the concepts, techniques, and applications. With newer versions of matlab, a student can study how to use the system and make programs largely by using the matlab help system. The rst place to turn will be the Getting started demos, which in newer versions are presented as videos. There are also many books devoted to use of matlab. Furthermore, we will be giving examples throughout this book. matlab programs can be written as matlab scripts and matlab functions . Example 1.23 The matlab script we used to produce the table following Example 1.7 (on page 8) is:
a = 2; x=2; xold=x; err_old = 1; for k=0:10 k x err = x - sqrt(2); err ratio = err/err_old^2 err_old = err; x = x/2 + 1/x; end
Example 1.24 The matlab script we used to produce the table in Example 1.8 (on page 9) is
format long a = 2; x=2;
33
An excellent alternative text book that focuses on matlab functions is Cleve Molers Numerical Computing with Matlab [25]. An on-line version, along with m les, etc., is currently available at http://www.mathworks. com/moler/chapters.html.
1.5
Applications
The purpose of the methods and techniques in this book ultimately is to provide both accurate predictions and insight into practical problems. This includes understanding and predicting, and managing or controlling the evolution of ecological systems and epidemics, designing and constructing durable but inexpensive bridges, buildings, roads, water control structures, understanding chemical and physical processes, designing chemical plants and electronic components and systems, minimizing costs or maximizing delivery of products or services within companies and governments, etc. To achieve these goals, the numerical methods are a small part of the overall modeling process, that can be viewed as consisting of the following steps. Identify the problem: This is the rst step in translating an often vague situation into a mathematical problem to be solved. What questions must be answered and how can they be quantied? Assumptions: Which factors are to be ignored and which are important? The real world is usually signicantly more complicated than mathematical models of it, and simplications must be made, because some factors are poorly understood, because there isnt enough data to determine some minor factors, or because it is not practical to accurately solve the resulting equations unless the model is simplied. For example, the theory of relativity and variations in the acceleration of gravity
34
Applied Numerical Methods due to the fact that the earth is not exactly round and due to the fact that the density diers from point to point on the surface of the earth in principle will aect the trajectory of a baseball as it leaves the bat. However, such eects can be ignored when we write down a model of the trajectory of the baseball. On the other hand, we need to include such eects if we are measuring the change in distance between two satellites in a tandem orbit to detect the location of mineral deposits on the surface of the earth.
Construction: In this step, we actually translate the problem into mathematical language. Analysis: We solve the mathematical problem. Here is where the numerical techniques in this book come into play. With more complicated models, there is an interplay between the previous three steps and this solution process: We may need to simplify the process to enable practical solution. Also, presentation of the result is important here, to maximize the usefulness of the results. In the early days of scientic computing, printouts of numbers are used, but increasingly, results are presented as two and three-dimensional graphs and movies. Interpretation: The numerical solution is compared to the original problem. If it does not make sense, go back and reformulate the assumptions. Validation: Compare the model to real data. For example, in climate models, the model might be used to predict climate changes in past years, before it is used to predict future climate changes. Note that there is an intrinsic error introduced in the modeling process (such as when certain phenomena are ignored), that is outside the scope of our study of numerical methods. Such error can only be measured indirectly through the interpretation and validation steps. In the model solution process (the analysis step), errors are also introduced due to roundo error and the approximation process. We have seen that such error consists of approximation error and roundo error. In a study of numerical methods and numerical analysis, we quantify and nd bounds on such errors. Although this may not be the major source of error, it is important to know. Consequences of this type of error might be that a good model is rejected, that incorrect conclusions are deduced about the process being modeled, etc. The authors of this book have personal experience with these events. Errors in the modeling process can sometimes be quantied in the solution process. If the model depends on parameters that are not known precisely, but bounds on those parameters are known, knowledge of these bounds can sometimes be incorporated into the mathematical equations, and the set of possible solutions can sometimes be computed or bounded. One tool that sometimes works is interval arithmetic. Other tools, less mathematically denite but
35
applicable in dierent situations, are statistical methods and computing solutions to the model for many dierent values of the parameter. Throughout this book, we introduce applications from many areas. Example 1.25 The formula for the net capacitance when two capacitors of values x and y are connected in series is xy . z= x+y Suppose the measured values of x and y are x = 1 and y = 2, respectively. Estimate the range of possible values of z , given that the true values of x and y are known to be within 10% of the measured value. In this example, the identication, assumptions, and construction have already been done. (It is well known how capacitances in a linear electrical circuit behave.) We are asked to analyze the error in the output of the computation, due to errors in the data. We may proceed using interval arithmetic, relying on the accuracy assumptions for the measured values. In particular, these assumptions imply that x [0.9, 1.1] and y [1.8, 2.2]. We will plug these intervals into the expression for z , but we rst use Theorem 1.8, part (2) as a guide to rewrite the expression for z so x and y only occur once. (We do this so we obtain sharp bounds on the range, without overestimation.) Dividing the numerator and denominator for z by xy , we obtain z=
1 x
1 +
1 y
We use the intlab toolbox8 for matlab to evaluate z . We have the following dialog in matlabs command window.
>> intvalinit(DisplayInfsup) ===> Default display of intervals by infimum/supremum >> x = intval([0.9,1.1]) intval x = [ 0.8999, 1.1001] >> y = intval([1.8,2.2]) intval y = [ 1.7999, 2.2001] >> z = 1/(1/x + 1/y) intval z = [ 0.5999, 0.7334] >> format long >> z intval z =
8 If
one has matlab, intlab is available free of charge for non-commercial use from http: //www.ti3.tu-harburg.de/~rump/intlab/
36
[ >> 0.59999999999999,
Thus, the capacitance must lie between 0.5999 and 0.7334. Note that x and y are input as strings. This is to assure that roundo errors in converting the decimal expressions 0.9, 1.1, 1.8, and 2.2 into internal binary format are taken into account. See [26] for more examples of the use of intlab.
1.6
Exercises
1. Write down a polynomial p(x) such that |S(x) p(x)| 1010 for 0.2 x 0.2, where sin(x) if x = 0, x S(x) = 1 if x = 0. Note: sinc(x) = S(x) = sin(x)/(x) is the sinc function (wellknown in signal processing, etc.). (a) Show that your polynomial p satises the condition |sinc(x) p(x)| 1010 for x [0.2, 0.2]. Hint: You can obtain polynomial approximations with error terms for sinc(x) by writing down Taylor polynomials and corresponding error terms for sin(x), then dividing these by x. This can be easier than trying to dierentiate sinc(x). For the proof part, you can use, for example, the Taylor polynomial remainder formula or the alternating series test and you can use interval arithmetic to obtain bounds. (b) Plot your polynomial approximation and sinc(x) on the same graph, (i) over the interval [0.2, 0.2], (ii) over the interval [3, 3], (iii) over the interval [10, 10]. 2. Suppose f has a continuous third derivative. Show that f (x + h) f (x h) = O(h2 ). f ( x ) 2h
37
4. Let a = 0.41, b = 0.36, and c = 0.7. Assuming a 2-digit decimal coma b ab = when using puter arithmetic with rounding, show that c c c this arithmetic. 5. Write down a formula relating the unit roundo of Denition 1.3 (page 14) and the machine epsilon m dened on page 20. 6. Store and run the following matlab script. What are your results? What does the script compute? Can you say anything about the computer arithmetic underlying matlab? eps = 1; x = 1+eps; while(x~=1) eps = eps/2; x = 1+eps; end eps = eps+(2*eps)^2 y = 1+eps; y-1 7. Suppose, for illustration, we have a system with base = 10, t = 3 decimal digits in the mantissa, and L = 9, U = 9 for the exponent. For example, 0.123 104 , that is, 1230 is a machine number in this system. Suppose also that round to nearest is used in this system. (a) What is HUGE for this system? (b) What is TINY for this system? (c) What is the machine epsilon m for this system? (d) Let f (x) = sin(x) + 1. i. Write down (f (0)) and (f (0.0008)) in normalized format for this toy system. ii. Compute ( (f (0.0008)) (f (0))) On the other hand, what is the nearest machine number to the exact value of f (0.0008) f (0)? iii. Compute ( (f (0.0008)) (f (0)))/ (0.0008). Compare this to the nearest machine number to the exact value of (f (0.0008) f (0))/0.0008 and to f (0). 8. Let f (x) = ln(x + 1) ln(x) . 2
(a) Use four-digit decimal arithmetic with rounding to evaluate f (100, 000).
38
Applied Numerical Methods (b) Use the Mean Value Theorem to approximate f (x) in a form that avoids the loss of signicant digits. Use this form to evaluate f (x) for x = 100, 000 once again. (c) Compare the relative errors for the answers obtained in (a) and (b). 9. Compute the condition number of f (x) = e any possible ill-conditioning.
x2 1
10. Let f (x) = (sin(x))2 + x/2. Use interval arithmetic to prove that there are no solutions to f (x) = 0 for x [1, 0.8].