0% found this document useful (0 votes)
14 views24 pages

NAChapter 1

Numerical analysis

Uploaded by

Gradi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views24 pages

NAChapter 1

Numerical analysis

Uploaded by

Gradi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Chapter 1

Preliminaries

To understand algorithms well, we should know some things about how number arithmetics,
addition, subtraction, multiplication and division, are performed on computers.
In this Chapter we first introduce computer arithmetic, which includes how numbers are
stored and operated in computer, how much is machine error, and then, we introduce some
pitfalls in machine operations and how to avoid loss of significance.

1.1 Binary numbers

Decimal numbers, for example, 1234.56, can be written as

1234.56 = 1×103 + 2×102 + 3×101 + 4×100 + 5×10−1 + 6×10−2 ,

where digits 1, 2, 3, 4, 5, 6 and as well as 7, 8, 9, 0 are used to express decimal numbers.

Definition 1.1.1 (Binary number) Binary numbers are expressed as: (· · · b2 b1 b0 .b−1 b−2 · · ·)2 ,
where each binary digit bi is 0 or 1. This binary number is equivalent to the base 10 number:

(· · · b2 b1 b0 .b−1 b−2 · · ·)2


= · · · + b2 22 + b1 21 + b0 20 + b−1 2−1 + b−2 2−2 + · · · .

Example 1.1.1 Convert the binary numbers: (a) (100.0)2 ; (b) −(0.11)2 ; (c) (101101.1011)2
into decimal numbers.

Solution. (a) Adding up the digits times powers of 2 leads to

(101.0)2 = 0 × 2−1 + 1 × 20 + 0 × 21 + 1 × 22 = 5;

1
2 CHAPTER 1. PRELIMINARIES

(b) Adding up the digits after the point times the negative powers of 2 leads to

3
−(0.11)2 = −(1 × 2−1 + 1 × 2−2 ) = − ;
4

(c) Noting that

(101101)2 = 1 × 20 + 0 × 21 + 1 × 22 + 1 × 23 + 1 × 24 = 29,
11
(0.1011)2 = 1 × 2−1 + 0 × 2−2 + 1 × 2−3 + 1 × 2−4 = ,
16

we obtain
11 11
(101101.1011)2 = (101101)2 + (0.1011)2 = 29 + = 29 . 
16 16

Theorem 1.1.1 Conversion of decimal integers to binary is obtained by dividing the decimal
number by 2 successively and recording the remainders from the bottom to the top.

Example 1.1.2 Convert the decimal number 57 into a binary number.

Solution. Dividing 57 by 2 successively and recording the remainders as follows:

57 ÷ 2 = 28 R 1
28 ÷ 2 = 14 R 0
14 ÷ 2 = 7R0
.
7÷2 = 3R1
3÷2 = 1R1
1÷2 = 0R1

When the quotient becomes 0, the process stops, since the additional equation 0 ÷ 2 = 0 R 0
is trivial. Then, the binary number is obtained by writing the remainders (binary digits in the
rightmost column) from the bottom to the top: 57 = (111001)2 . 

Theorem 1.1.2 Decimal fractions are converted to binary numbers by multiplying the decimal
fraction and the resulted fractions by 2 successively, and recording the integer parts from the top
to the bottom.

Example 1.1.3 Convert the decimal fractions: 0.6 and −0.75 to the binary numbers.

Solution. Multiplying 0.6 by 2 leads to an integer 1 and a fraction 0.2, recording the integer
1 and multiplying the resulted fraction 0.2 by 2 leads to an integer number 0 and a fraction
0.4. Keeping doing this process a sequence of binary digits is obtained. The process is like as
1.1. BINARY NUMBERS 3

follows:
0.6 × 2 = 0.2 + 1
0.2 × 2 = 0.4 + 0
0.4 × 2 = 0.8 + 0
0.8 × 2 = 0.6 + 1
0.6 × 2 = 0.2 + 1
0.2 × 2 = 0.4 + 0
0.4 × 2 = 0.8 + 0
0.8 × 2 = 0.6 + 1
.. ..
. .
When the new fractional part becomes 0, the process stops, otherwise, the process goes on.
The digits are written from top to bottom, 0.6 = (0.10011001)2 = 0.1001)2 , will be the binary
fraction. The symbol 1001 means the four digits will repeat for ever.
Similarly, we consider the binary number of 0.75.

0.75 × 2 = 0.5 + 1
0.5 × 2 = 0+ 1

Then, by adding the negative sign, we have −0.75 = −(0.11)2 . 

Example 1.1.4 Convert the numbers 57.6 and −57.75 to the binary numbers.

Solution. Combining Examples 1.1.2 and 1.1.3 yields

57.6 = 57 + 0.6 = (111001)2 + (0.1001)2 = (111001. 1001 1001 · · · )2


= (111001.1001)2 ,

which is a repeated binary number. The bar over the digits means the four digits are infinitely
repeated.
Similarly, for −57.75 we have

−57.75 = −57 + (−0.75) = −(111001)2 + (−0.11)2 = −(111001.11)2 .

Example 1.1.5 Suppose x = (0.1101)2 . Convert it to decimal.

Solution. Note that x = 0000.1101 and 24 x = 1101.1101. Subtracting x from 24 x yields

(24 − 1)x = (1101)2 = 1 × 20 + 0 × 21 + 1 × 22 + 1 × 23 = (13)10 = 13.


4 CHAPTER 1. PRELIMINARIES

Thus, x = (0.1101)2 = 15 .
13


By Examples 1.1.3-1.1.5 we see that every decimal number can be uniquely converted into
a binary number and vice versa.

Exercise 1 (Section 1.1)

1. Find the binary representation of the decimal numbers:


(a) 8; (b) 64; (c) 17; (d) 1/8; (e) 35/16.

2. Convert the following binary numbers to decimal numbers.


(a) (10111)2 ; (b) (0.1001)2 ; (c) (1101.101)2 ; (d) (101.11)2 .

3. Convert the following base 10 numbers to binary.


(a) 37; (b) 1/4; (c) 1/3 (d) 9.5; (e) 25.75.

4. (a) Suppose x = (0.1001)2 . Convert this binary number to decimal; (b) use the result in
(a) to convert (1.11001)2 to decimal number.

5. "dec2bin" and "bin2dec" are the two commands in the Matlab library. Use these two
functions to check your results in Exercises 1-3.

1.2 Floating-point numbers and round-off errors

In scientific notation all numbers are written in the form ±a × 10b (a times ten raised to the
power of b), where the coefficient a is any real number, called the significand or mantissa, and
the exponent b is chosen so that the absolute value of a remains at least one but less than ten
(1 ≤ |a| < 10). For example,

Decimal notation Scientific notation


4, 321.768 4.321768 × 103
−53000 −5.3 × 104
.
6, 720, 000, 000 6.72 × 109
0.2 2 × 10−1
0.00000000751 7.51 × 10−9

Similar to scientific notations for decimal numbers, the binary numbers in Examples 1.1.1
and 1.1.2 can be rewritten as
1.2. FLOATING-POINT NUMBERS AND ROUND-OFF ERRORS 5

(101.0)2 = (1.01)2 × 22 ;
−(0.11)2 = −(1.1)2 × 2−1 ;
(101101.1011)2 = (1.011011011)2 × 25 ;
(111001.1001)2 = (1.110011001)2 × 25 .

Shifting point rule:Power of 2 will be added m if binary point is moved m places on the left.
On the other hand, power of 2 will be subtracted by m if binary point is moved m places on the
right.
Those binary numbers with infinite many binary digits are stored on computers as binary
numbers with a fixed number of binary digits, and the redundant digits will be removed by the
IEEE rounding rules. Binary numbers stored on computers are called floating-point numbers.

Definition 1.2.1 (IEEE rounding rule) The IEEE rounding to nearest rule(IEEE754 Float-
ing Point Standard) states that for storing binary numbers r in double precision as the form
r0 ,
(a) if the 53rd bit to the right of the binary point is 0, then round down;
(b) If the 53rd bit is 1 and all the bits after the 53rd bit are not all zero, then round up (or
add 1 to the 52nd bit);
(c) The 53rd bit to the right of the binary point is 1 and all the bits after the 53rd bit are all
zero, in this case, if the 52nd bit is 1, then add 1 to the 52nd bit; otherwise, if the 52nd bit is 0,
then round down.
(d) The rules for storing binary numbers in single and long double precisions are similar-
ly given. Computers with single precision and long double precision store 23 and 64 digits
respectively.

Example 1.2.1 By the IEEE rounding rules, give the binary number of r = (1.110011001)2 × 25
stored on computers with 32-bit accuracy(called single precision).

Solution. In terms of the rounding rules, the stored number on computers with single precision
for r is
r0 = (1.11001 1001 1001 1001 1001 1)2 × 25 ,
which has 23 significant digits. The 24th bit to the right of the binary point is 0, then round
down(truncate after the 23rd bit). If the 24th bit is 1, then round up (add 1 to the 23rd bit).
Binary number r0 is called the binary floating-point representation of r, where

1.10101101100110011001101
6 CHAPTER 1. PRELIMINARIES

is called the mantissa, 5 is the exponent. Binary number r0 approximates r and has the same 23
significant digits as r.

Definition 1.2.2 (Floating point number) A floating point number has three parts: the sign
(+ or -), a mantissa, which contains the string of significant bits, and an exponent.
The form of a normalized floating point number is: ±1.bbb... × 2 p , where b = 0 or 1, p is
an M-bit binary number representing the exponent.

Remark 1.2.1 (a) Normalization means that the leading or leftmost bit must be the digit 1;
(b) A floating-point number is a rational number, because it has a finite number of digits and
can be represented as one integer divided by another. For example (1.1011)2 × 22 is (110.11)2 ,
which is equal to
3 27
1 × 2 + 1 × 22 + 1 × 2−1 + 1 × 2−2 = 6 + = .
4 4
(c) In a normalized floating point number, sign, exponent and mantissa are stored together
in a computer word, sign exponent mantissa

For example, the IEEE floating-point representation of the real number (r = 57.6) in Exam-
ple 1.1.4 by single precision is

r0 = 1.11001 1001 1001 1001 1001 10 × 25 .

The word for storing r0 is

0 00000 101 11001 1001 1001 1001 1001 10 .

The lengths of the significand and exponent determine the precision to which numbers can
be represented. There are three commonly used levels of precision for floating point numbers,
single precision (32), double precision (64), and long double precision (80). The details of
the standards for the representation in the three levels of precision are shown below.

precision sign exponent mantissa


single 1 8 23
.
double 1 11 52
long double 1 15 64

Definition 1.2.3 (Round-off error) Most real numbers have to be rounded off in order to be
represented as t-digit floating point numbers. The difference between the floating point number
x0 and the original number x is called the round-off error.
1.2. FLOATING-POINT NUMBERS AND ROUND-OFF ERRORS 7

Definition 1.2.4 (Absolute and relative error) If x is a real number and x0 is its floating-point
approximation, then the absolute value of the difference x0 − x, |x0 − x|, is called the absolute
|x0 −x|
error and that of the quotient (x0 − x)/x, |x| , is called the relative error.

Example 1.2.2 Find the round-off error, absolute and relative error of the floating-point number
r0 stored on computers with single precision in Example 1.2.1.

Solution: Note that the correct value is r = 57.6 = (1.11001 1001 1001)2 × 25 , and that the
number stored on computers with single precision is r0 = (1.11001 1001 1001 1001 1001 1)2 ×
25 . Thus the difference or the truncation error is:

r − r0 = 0.00000 0000 0000 0000 0000 001 1001 × 25 = 1.1001 × 2−20 .

The absolution error of r0 is

ErrA = |r − r0 | = 1.1001 × 2−20 = 1.5625 × 2−20 .

The relative error of r0 is

|r − r0 | 1.1001 × 2−20
ErrR = =
|r| (1.11001 1001 1001)2 × 25
1.5625 × 2−20
= ≈ 0.2712 × 2−20 .
57.6

Next we consider how to describe significant digits in terms of error. For example, let

a = 3 = 0.1732050807568877 · · · × 101 be the correct value, and a∗ = 1.732051 be an
approximate value obtained by using rounding rules on the decimal number. By counting the
number of the decimal digits we see that a∗ has 7 significant digits. The absolute error of a∗ is

|a − a∗ | = 1.924311 · · · × 10−7 = 0.1924311 · · · × 101−7

Similarly, if b = 10a = 0.1732050807568877 · · · × 102 , and b∗ = 0.1732051 × 102 , then

|b − b∗ | = 1.924311 · · · × 10−6 = 0.1924311 · · · × 102−7 .

Both a∗ and b∗ have the same significant digits and satisfy that the general form |a − a ∗ | ≤
0.5 × 10m−n .
We conclude and give the relation between significant digits and error.
8 CHAPTER 1. PRELIMINARIES

Theorem 1.2.1 (Absolute error determines number of significant digits) Let a∗ be an approx-
imate value of the correct value a and that

a∗ = ±0.a1 a2 · · · an × 10m

where a1 ∈ {1, 2, · · · , 9} and ai ∈ {0, 1, 2, · · · , 9} with i = 2, 3, · · · , n, and m be an integer. If the


absolute error of a∗ satisfies
|a − a∗ | ≤ 0.5 × 10m−n ,

then a∗ has n significant digits.

By the above theorem it can be proved that

Theorem 1.2.2 (Relation between significant digits and relative error) Let a∗ be an approx-
imate value of the correct value a and have n significant digits, that is

a∗ = ±0.a1 a2 · · · an × 10m

where a1 ∈ {1, 2, · · · , 9} and ai ∈ {0, 1, 2, · · · , 9} with i = 2, 3, · · · , n, and m be an integer. Then


the relative error of a∗ , er , satisfies that
|a − a∗ | 1
er = ∗
≤ × 101−n .
|a | 2a1
Conversely, if the relative error of a∗ , er , satisfies that
|a − a∗ | 1
er = ≤ × 101−n ,

|a | 2(a1 + 1)
then a∗ has at least n significant digits.

Proof. Let the approximate value a∗ have n significant digits. By Theorem 1.2.1, we see that

1
|a − a∗ | ≤ × 10m−n .
2
Note that

a1 × 10m−1 ≤ |a∗ | = a1 .a2 · · · an × 10m−1 ≤ (a1 + 1) × 10m−1 ,

then, we obtain the bound of the relative error:

|a − a∗ | 1
er = ∗
≤ × 101−n .
|a | 2a1
1.2. FLOATING-POINT NUMBERS AND ROUND-OFF ERRORS 9

Conversely, if
|a − a∗ | 1
er = ≤ × 101−n ,

|a | 2(a1 + 1)
then, by the following two relations:
|a − a∗ |
|a − a∗ | = |a∗ | · = |a∗ | · er ,
|a∗ |
|a∗ | ≤ (a1 + 1) × 10m−1 ,

we obtain
1
|a − a∗ | = |a∗ | · er ≤ (a1 + 1) × 10m−1 · × 101−n
2(a1 + 1)
= 0.5 × 10m−n .

Thus, a∗ has n significant digits. 

Example 1.2.3 Determine absolute error, relative error and numbers of significant digits of the
√ √
following approximate values of 2 = 1.41421356237... and 10 + 2 = 11.41421356237....
(a) x1 = 1.414213; (b) x2 = 1.414214; (c) x3 = 1.414213321;
(d) y1 = 11.414213; (b) y2 = 11.414214; (c) y3 = 11.414213321.

Solution: The absolute errors of x1 , x2 and x3 are:

e(x1 ) = 5.623730952031281e − 07; e(x2 ) = 4.376269049366499e − 07;


e(x3 ) = 2.413730950667770e − 07; e(y1 ) = 5.623730956472173e − 07;
e(y2 ) = 4.376269036043823e − 07 e(y3 ) = 2.413730957329108e − 07.

By the definition of significant digits we see that x1 , x2 , and x3 have 6, 7, and 7 significant
digits, and that y1 , y2 , and y3 have 7, 8, and 8 significant digits, since

e(x1 ) ≈ 0.056237310e − 5 ≤ 0.5 × 10−5 ; e(x2 ) ≈ 0.43762690e − 6 ≤ 0.5 × 10−6 ;


e(x3 ) ≈ 0.24137310e − 6 ≤ 0.5 × 10−6 ; e(y1 ) ≈ 5.6237310e − 07 ≤ 0.5 × 10−5 ;
e(y2 ) ≈ 4.3762690e − 07 ≤ 0.5 × 10−6 e(y3 ) ≈ 2.4137310e − 07 ≤ 0.5 × 10−6 .

Their relative errors are:

re(x1 ) = 3.976578291749997e − 7; re(x2 ) = 3.094489521103857e − 7;


re(x3 ) = 1.706765523177032e − 7; e(y1 ) = 4.926954385198098e − 8;
e(y2 ) = 3.834052177252206e − 8 e(y3 ) = 2.114671277297598e − 08,
10 CHAPTER 1. PRELIMINARIES

which are in agreement with the results in Theorem 1.2.2. 

The two theorems above can be used as a stopping criterion in programs to set tolerance if
number of significant digits is given.

Apart from rounding-off error, additional round-off errors may occur when arithmetic op-
erations on computers are applied to floating-point numbers. To see this error and pitfalls in
computer operations, in next section we introduce arithmetic operations for binary numbers.

Exercise 2 (Section 1.2)

1. Convert the following decimal numbers to binary and express them as a floating point
number with single precision by using the Rounding to Nearest Rule.

(a) 1/4; (b) 2/3; (c) 3/4; (d) 0.29.

2. Convert the following decimal numbers to binary and express them as a floating point
number fl(x) with single precision by using the Rounding to Nearest Rule.

(a) 7.5; (b) 8.6; (c) 98.2; (d) 26/7.

3. Do the following sums by hand in IEEE double precision computer arithmetic, using the
rounding rules.

(a) (1 + (2−51 + 2−52 )) − 1; (b) (1 + (2−51 + 2−52 + 2−53 )) − 1.

4. Decide whether 1 + x > 1 in double precision floating point arithmetic with the rounding
rules.

(a) x = 2−53 ; (b) x = 2−53 + 2−58 ; (c) x = 2−52



5. The following numbers are the approximate values for π and 3 respectively.

(a) x1 = 3.142; x2 = 3.141593; x3 = 3.1415927;


(b) y1 = 1.732; y2 = 1.73205; y3 = 1.732050808.

Give the numbers of significant digits of the approximate values xi and yi with i = 1, 2, 3,
and their absolute error and relative error. Then test the Theorem about the relation
between significant digits and error.
1.3. OPERATIONS OF FLOATING-POINT NUMBERS AND ERROR ANALYSIS 11

1.3 Operations of floating-point numbers and error analysis


1.3.1 Operations of floating point numbers

The four operations of binary numbers are addition, subtraction, multiplication and divi-
sion. The simplest arithmetic operation for binaries is addition. For example,
0 1 1 0 1
+ 1 0 1 1 1 = 36.
= 1 0 0 1 0 0

Definition 1.3.1 (Addition) Adding two single-digit binary numbers is relatively simple, using
a form of carrying:

0 + 0 → 0, 0 + 1 → 1, 1 + 0 → 1, 1 + 1 → 0, carry 1,

since 1 + 1 = 2 = (10)2 = 0 + 1 × 21 . Adding two "1" digits produces a digit "0", while 1 will
have to be added to the next column.

If addition of two binary single numbers equals or exceeds the value of the radix (2), the
digit to the left is incremented. This is known as carrying.

Definition 1.3.2 (Subtraction) Subtraction of binary numbers works in much the same way as
addition:
0 − 0 → 0 0 − 1 → 1, borrow 1, 1 − 0 → 1, 1 − 1 → 0.

Subtracting a "1" digit from a "0" digit produces the digit "1", while 1 will have to be subtracted
from the next column. This is known as borrowing. The principle is the same as for carrying.
When 0 − 1 happens, the procedure is to "borrow" the deficit divided by the radix from the left,
subtracting it from the next positional value.

For example,
1 1 0 1 1 1 0
− 1 0 1 1 1
− − − − − − − −
= 1 0 1 0 1 1 1

Definition 1.3.3 (Multiplication) Multiplication in binary is similar to its decimal counterpart.


Two numbers a and b can be multiplied by partial products: for each digit in b, the product of
that digit and a is calculated and written on a new line, shifted leftward so that its rightmost digit
lines up with the digit in b that was used. The sum of all these partial products gives the final
result. Binary numbers can also be multiplied with bits after a binary point where the binary
point will be moved on the left.
12 CHAPTER 1. PRELIMINARIES

Since there are only two digits in binary, there are only two possible outcomes of each partial
multiplication: (i) If the digit in b is 0, the partial product is also 0; (ii) If the digit in b is 1, the
partial product of 1 and a is equal to a.

Example 1.3.1 Evaluate the products of the binary numbers.

(a) (1011)2 × (1010)2 ; (b) (101.101)2 × (110.01)2 .

Solution. The binary numbers (1011)2 and (1010)2 are multiplied as follows:

1 0 1 1 (= a)
× 1 0 1 0 (= b)
− − − − − − − − − −
0 0 0 0
1 0 1 1
0 0 0 0
+ 1 0 1 1
− − − − − − − − − −
= 1 1 0 1 1 1 0

The multiplication (101.101)2 × (110.01)2 is carried out as follows:

1 0 1 . 1 0 1
× 1 1 0 . 0 1
− − − − − − − − − − − −
1 . 0 1 1 0 1
0 0 . 0 0 0 0
. 
0 0 0 . 0 0 0
1 0 1 1 . 0 1
+ 1 0 1 1 0 . 1
− − − − − − − − − − − −
=1 0 0 0 1 1 . 0 0 1 0 1

Definition 1.3.4 (Division) Binary division is again similar to its decimal counterpart. The
procedure is illustrated by an example: compute (11011)2 ÷ (101)2 .
1.3. OPERATIONS OF FLOATING-POINT NUMBERS AND ERROR ANALYSIS 13

1 0 1
− − − − − −
1 0 1 1 1 0 1 1
− 1 0 1
− − − − −
0 1 1
− 0 0 0
− − − − −
1 1 1
− 1 0 1
− − − − −
1 0

Thus, (11011)2 = (101)2 · (101)2 + (10)2 , the remainder is (10)2 ; in decimal form, it is written
as: 27 = 5 · 5 + 2.

1.3.2 Analysis of errors in computer arithmetics

Most real numbers have to be rounded off in order to be represented as t-digit floating point num-
bers on computers. When arithmetic operations are applied to floating-point numbers, additional
round-off errors may occur.
In this section we talk about errors in numerical computation on computers.

Theorem 1.3.1 (Absolute error under basic operations) Let a∗ and b∗ be approximate values
for a and b respectively, and e(a∗ ) = |a − a∗ |, e(b∗ ) = |b − b∗ |. Then under the basic operations:
addition, subtraction, multiplication and division, the errors of the results are

e(a∗ ± b∗ ) ≤ e(a∗ ) + e(b∗ );

 |b | · e(a ) + |a| · e(b )


 ∗ ∗ ∗

∗ ∗

e(a · b ) ≤  ;
 |b| · e(a∗ ) + |a∗ | · e(b∗ )

a∗ |b| · e(a∗ ) + |a| · e(b∗ )


e( ) ≤ , where b · b∗ , 0.
b∗ |b · b∗ |
14 CHAPTER 1. PRELIMINARIES

Proof. By the definition of the error e(·), we see that

e(a∗ ± b∗ ) = |a ± b − (a∗ ± b∗ )| = |a − a∗ ± (b − b∗ )| ≤ e(a∗ ) + e(b∗ );


e(a∗ b∗ ) = |ab − a∗ b∗ | = |ab − ab∗ + ab∗ − a∗ b∗ | ≤ |a| · e(b∗ ) + |b∗ | · e(a∗ ), or
e(a∗ b∗ ) = |ab − a∗ b∗ | = |ab − a∗ b + a∗ b − a∗ b∗ | ≤ |b| · e(a∗ ) + |a∗ | · e(b∗ );
a∗ a a∗ ab∗ − a∗ b ab∗ − ab + ab − a∗ b
e( ) = − = =
b∗ b b∗ bb∗ bb∗
|a| · |b − b∗ | + |b| · |a − a∗ | |a| · e(b∗ ) + |b| · e(a∗ )
≤ = .
|b · b∗ | |b · b∗ |
This completes the proof. 

Note 1µ Small denominators should be avoided.

Example 1.3.2 (Error in computation of rectangular area) The area of a rectangle is: A =
h · l, where h is the height and l is the width of the rectangle. Suppose that h∗ = 80 m and
l∗ = 110 m are the measured values of h and l, and e(h∗ ) = |h−h∗ | ≤ 0.1 m, e(l∗ ) = |l−l∗ | ≤ 0.2 m.
Give the error of the computed area A∗ = h∗ · l∗ .

Solution. By the formulas A = h · l and A∗ = h∗ · l∗ , we have

e(A∗ ) = |A − A∗ | = |h · l − h∗ · l∗ | = |h · l − h · l∗ + h · l∗ − h∗ · l∗ |
≤ |h| · |l − l∗ | + |l∗ | · |h − h∗ | ≈ |h∗ | · |l − l∗ | + |l∗ | · |h − h∗ | ≤ 80 · 0.2 + 110 · 0.1 = 27 (m2 ).

Thus the absolute error of A∗ is approximately 27 m2 , and the relative error of A∗ is approxi-
mately e(A∗ )/A∗ = 27/(80 · 110) = 0.0031. 

Error estimate of evaluation of functions. Taylor’s Theorem states that


f 00 (ξ)
f (x) = f (x∗ ) + f 0 (x∗ )(x − x∗ ) + (x − x∗ )2 ,
2
where ξ is between x and x∗ . By this theorem we can estimate the error of f (x∗ ).
| f 00 (ξ)|
e( f (x∗ )) = | f (x) − f (x∗ )| ≤ | f 0 (x∗ )| · |e(x∗ )| + · e(x∗ ) 2 .

2
If we know the bounds of f 0 (x) and f 00 (x), then we can estimate the error of f (x∗ ) by the above
equality.

Definition 1.3.5 (Machine addition of floating point numbers) Machine addition consists of
lining up the binary points of the two numbers to be added so that both of the numbers have the
same exponent, adding them, and then storing the result again as a floating point number.
1.3. OPERATIONS OF FLOATING-POINT NUMBERS AND ERROR ANALYSIS 15

Example 1.3.3 Carry out the addition of 1 with 2−53 on computers with double precision by
using the rules for machine addition.

Solution. In terms of machine addition of floating point numbers, the addition of the two num-
bers would appear as follows:

1. 00...0 × 20 + 1. 00...0 × 2−53

1. 000000000000000000000000000000000000000000000000000 × 20
=
+ 0. 000000000000000000000000000000000000000000000000000 1 × 20
= 1. 000000000000000000000000000000000000000000000000000 1 × 20 .

This sum is saved as 1.0 × 20 = 1 by the rounding-to-nearest-rule. From this example we see
that if a big number adds a very small number, the result would be the same as the big number.
This is a pitfall in machine addition.
By the addition of the two numbers in Example 1.3.3, we see that on computers with double
precision the smallest number greater than 1 is 1 + 1 × 2−52 . The distance between 1 and 1 + 1 ×
2−52 , denoted by mach , mach = 2−52 , is the machine error.

Note 2: That large numbers plus small numbers should be avoided.

Definition 1.3.6 (Loss of significance) When two nearly equal numbers are subtracted, signif-
icant digits are lost. This phenomenon is called loss of significance.

For example, we use seven significant digits to do the subtraction: 113.4567 − 113.4566:

1 1 3 . 4 5 6 7
− 1 1 3 . 4 5 6 6
− − − − − − − − −
= 0 0 0 . 0 0 0 1
Two input numbers have seven-digit accuracy, but after subtraction the result has only one-digit
accuracy. This operation loses many significant digits. In programming and computation by a
computer, loss of significance should be avoided by restructuring the calculation and reducing
operation counts.
Note 3: In programming, equivalent mathematical expressions are selected to avoid subtrac-
tion of two nearly equal numbers.

Example 1.3.4 To avoid loss of significance, rewrite the following expressions.


√ 1 − cos(x) 1
(a) 9.01 − 3; (b) 1 − cos(0.001); (c) = .
2
sin (x) 1 + cos(x)
16 CHAPTER 1. PRELIMINARIES

Solution. In order to avoid loss of significance, we rewrite expression of subtraction of two


nearly equal numbers in (a), (b) and (c) as follows.


√ 9.01 − 3 √ 0.01
(a) 9.01 − 3 = √ ( 9.01 + 3) = √ ;
9.01 + 3 9.01 + 3
(b) 1 − cos(0.001) = 1 − (1 − 2 sin2 (0.0005)) = 2 sin2 (0.0005);
1 − cos(x) 1 − cos2 (x) 1
(c) = = .
2
sin (x) (1 + cos(x)) sin (x) 1 + cos(x)
2

Example 1.3.5 Give the solutions of the equation x2 + 1012 x − 2 = 0 and pay attention to loss
of significance.

Solution. By the quadratic formula for solutions of quadratic equations, we see that the solution
x1 and x2 are
√ √
−1012 − 1024 + 8 −1012 + 1024 + 8
x1 = , and x2 = .
2 2

Note that 1012 and 1024 + 8 are nearly equal to each other, and that the calculation −1012 +

1024 + 8 by computers will lead to loss of significance. Then, by multiplying the numerator

and denominator of x2 by −1012 − 1024 + 8, we have
4
x2 = √ .
1012 + 1024 + 8
Thus the formulas being used on computers or developed into programs are:

−1012 − 1024 + 8 4
x1 = , and x2 = √ . 
2 1012 + 1024 + 8
In general we obtain

Theorem 1.3.2 (Quadratic formula for machine operation) For the solution of the quadratic
equation, ax2 − bx + c = 0, where a and or c are very small compared with b so that b2 − 4ac is
nearly equal to b2 , (a) if b is positive in this situation, the roots should be computed by

−b − b2 − 4ac −2c
x1 = , and x2 = √ ;
2a b + b2 − 4ac
(b) if b is negative and b2 − 4ac very close to b2 , then, the roots are best computed by

−b + b2 − 4ac 2c
x1 = , and x2 = √ .
2a −b + b2 − 4ac
1.3. OPERATIONS OF FLOATING-POINT NUMBERS AND ERROR ANALYSIS 17

If we use the quadratic formula to develop a program to solve a quadratic equation, the above
expressions should be considered and used.

Note 4: Simplification of expressions is used to reduce numbers of operations.


Workload or number of operations is one of the issues in numerical analysis. In evaluating
polynomials multiplication and addition of floating-point numbers are often used.

Example 1.3.6 What is the best way to evaluate

p(x) = 6x4 + 4x3 − 3x2 + 5x − 1

at a point, for example x = 1/2 ?

Solution. First we consider the traditional way:


1 1 1 1 1 1 1 1 1 1 1 13
p( ) = 6 ∗ ∗ ∗ ∗ + 4 ∗ ∗ ∗ − 3 ∗ ∗ + 5 ∗ − 1 = .
2 2 2 2 2 2 2 2 2 2 2 8
This requires 10 multiplications and 4 additions.

Next we rewrite the polynomial by using parentheses as

p(x) = −1 + x ∗ (5 + x ∗ (−3 + x ∗ (4 + x ∗ 6))).

This requires only 4 multiplications and 4 additions. So the second way is the best. 
In general, we obtain the following

Theorem 1.3.3 (Evaluation of polynomial on computers) A general polynomial of degree d


c0 + c1 x + · · · + cd xd can be evaluated in d multiplications and d additions at least.
For example, for a standard form of a polynomial of degree 4

c1 + c2 x + c3 x2 + c4 x3 + c5 x4 ,

evaluation of it will use the form:

c1 + x(c2 + x(c3 + x(c4 + x(c5 ))));

For a polynomial of degree 4 of the form

c1 + c2 (x − r1 ) + c3 (x − r1 )(x − r2 ) + c4 (x − r1 )(x − r2 )(x − r3 )

+c5 (x − r1 )(x − r2 )(x − r3 )(x − r4 ),

where r1 , r2 , r3 and r4 are distinctive numbers and called the base points (this form is from
Newton’s interpolation), evaluation of this polynomial will require the form
18 CHAPTER 1. PRELIMINARIES

c1 + (x − r1 )(c2 + (x − r2 )(c3 + (x − r3 )(c4 + (x − r4 )(c5 )))).

This method is called Nested multiplication or Horner’s method, or Qin Jiushao Algorithm.
Implementation of Qin Jiushao’s method for the polynomial of four is as follows:

v1 = (x − r4 )c5 ; → v2 = (x − r3 )(c4 + v1 );
v3 = (x − r2 )(c3 + v2 ); → v4 = (x − r1 )(c2 + v3 );
v5 = c1 + v4 .

Next we give a Matlab function to implement the nest multiplication for polynomials.

Remark 1.3.1 ( Matlab command for nested multiplication) .

% Program for Nested multiplication


% Evaluates polynomial from nested form using Horner’s method
% Input: degree d of polynomial( in descending powers)
% array of $d+1$ coefficients c (constant term first)
% x-coordinate x at which to evaluate, and
% array of d base points r, if needed
% output: value y of polynomial at x.

function y=nest(d,c,x,r)
if nargin < 4, r=zeros(d,1);
end
y=c(d+1);
for i=d:-1:1
y=y.*(x-r(i))+c(i);
end

Example 1.3.7 Evaluate x253 and count the number of multiplication.

Solution. Note that 1 + 2 + 4 + 8 + 16 + 32 + 64 + 128 = 255. Then

x253 = x255−2 = x · x2 · x4 · x8 · x16 · x32 · x64 · x128 · x−2


= x · x4 · x8 · x16 · x32 · x64 · x128 .

The number of multiplication is: 4 + 1 + 1 + 1 + 1 + 1 + 6 = 14. 


1.3. OPERATIONS OF FLOATING-POINT NUMBERS AND ERROR ANALYSIS 19

When we study sequences and series, some kinds of operations will be repeated or iterated.
In these cases error magnification or stability will be considered.
Note 5: Stable algorithms should be used.
Z 1
Example 1.3.8 Let {an }∞
n=0 be a sequence of numbers defined by an = xn e x−1 dx. Test the
0
iterative formula: an = 1 − nan−1 and then try to work out the values an (n = 0, 1, 2, . . . , 16).

Solution. By integration by parts, we see that


Z 1 Z 1
an = xn e x−1 dx = xn e x−1 |10 − n xn−1 e x−1 dx
0 0
= 1 − nan−1 , where n = 0, 1, 2, · · · .

By using the following Matlab commands:

% To compute a sequence of numbers to test stable algorithms


clear; clc; syms x;a(1)=double(int(x*exp(x-1)));
for i=1:10
a(i+1)=1-(i+1)*a(i);
end
disp(’The sequence of the numbers is a=:’), a

By running this program, we have

a = 0.63212056, −0.26424112, 1.79272336, −6.17089344,


31.85446720, −190.12680320, 1331.8876224, −10654.1009792,
95887.9088128, −958878.08812800, 10547659.969408.

However, if we use Matlab builtin function int as follows

% To compute a sequence of numbers by the command int


clear; clc; syms t; b(1)=double(int(t*exp(t-1),0,1));
for i=1:10
b(i+1)=int(t^(i+1)*exp(t-1),0,1);
end
b
20 CHAPTER 1. PRELIMINARIES

we obtain that

b = 0.367879441171, 0.264241117657, 0.207276647029, 0.170893411885,


0.145532940573, 0.126802356562, 0.112383504069, 0.100931967446,
0.091612292990, 0.083877070103, 0.077352228863.

It looks that the computation by using the formula an = 1 − nan−1 (the formula in theory) is
incorrect. Next we explain the reason.
Let a∗n (n = 0, 1, . . .) be the values provided by computers satisfying the formula: a∗n =
1 − na∗n−1 (this is the practical formula on computers). Subtraction of the two formula leads that

an − a∗n = −n(an−1 − a∗n−1 ) = (−1)2 n(n − 1)(an−2 − a∗n−2 ) = · · · = (−1)n n!(a0 − a∗0 ).

This error equation means that the error in the initial step is enlarged n! (10! = 3628800) times
at the n-th step. This phenomenon is called unstable.
On the other hand, from the mathematical formula an = 1 − nan−1 we see that
1
an−1 = (1 − an ).
n
Note that Z 1 Z n
1
0 < an = n x−1
x e ds < xn dx = ,
0 0 n+1
1
so we suppose a20 = 1/21. By using the formula an−1 = (1 − an ) and the following commands:
n
% To compute the sequence by a stable formula
clear; clc; a(20)=1/21;
for i=20:(-1):2
a(i-1)=1/(i)*(1-a(i));
end
a(1:11)

we get the results:

a = 0.367879441171, 0.264241117657, 0.207276647029, 0.170893411885,


0.145532940573, 0.126802356561, 0.112383504069, 0.100931967446,
0.091612292990, 0.083877070103, 0.077352228863.
1
Thus the formula an−1 = (1 − an ) is good and stable.
n
1.3. OPERATIONS OF FLOATING-POINT NUMBERS AND ERROR ANALYSIS 21

For a system of linear equations, Ax = b, the system stored on computer is A∗ x = b∗ . How


to estimate the error x − x∗ and how do the error of the entries of A∗ and b∗ affect the error of x∗
are complicated and will be discussed later in Chapter 3.

Exercise 3 (Section 1.3)

1. Identify for which values of x there is subtraction of nearly equal numbers, and find an
alternate form that avoids the problem.

1 − sec x 1 − (1 − x)3 1 1
(a) ; (b) ; (c) − .
tan2 x x 1+x 1−x

2. (1) Explain how to most accurately compute the two roots of the equation x2 + bx −
10−12 = 0, where b is a number greater than 100.
(2) Find the roots of the equation x2 + 3x − 8−14 = 0 with three-digit accuracy.
3. Assume that the sequence {yn } satisfies that yn = 15yn−1 − 2, n = 1, 2, . . . , and that the
initial value y0 = 1.425. (a) Compute the values y2 , y3 , . . . , y9 ; (b) Derive the error
equation for the formula; (c) Is the algorithm or the formula stable? If not, give a stable
one to compute yi (i = 2, 3, . . . , 9).
4. The volume of a sphere is: V = 43 πr3 where r is the radius of the sphere. Let r∗ denote the
approximate value of r and e(r∗ ) = |r − r∗ | denote the absolute error. If e(v∗ ) = |V − V ∗ |,
V ∗ = 34 π(r∗ )3 , and e(v∗ ) ≤ 0.01, then how much of e(r∗ ) is required?

5. Let f (x) = ln(x − x2 − 1). (a) Compute f (50) with six significant digits and its error;

(b) Alternatively, if the equivalent form of f (x), f (x) = − ln(x + x2 − 1) is used, answer
the questions in (a); (c) Use Matlab and the two expressions of f (x) to compute f (1010 ),
observe the difference.
6. Use the Matlab function nest.m to evaluate the values of the following polynomials at
points x1 = 0, x2 = 1, x3 = 12, x4 = 2.546.
(a) p(x) = 1 + x − 3x2 + 6x3 ; (b) q(x) = 2.5 + 1.2x + 4x2 − 5.6x3 + 6.8x4 − 1.2x5 ;
(c) r(x) = 1 + x + x2 + · · · + x25 .
7. Write Matlab code to solve the equation x2 + 108 x + 1 = 0 and compare with results
given by the Matlab command roots.

8. Assume that a∗ is an approximate value for the correct value a = 20 and that the
relative error of a∗ is less than 10−3 . How many significant digits does a∗ have?
22 CHAPTER 1. PRELIMINARIES

1.4 Review of some results in linear algebra and calculus


In this section we list some concepts and results in linear algebra and calculus etc., since they
will be used during study of this course.

Definition 1.4.1 (Vector space Pn [x]) For any integer n ≥ 1, let Pn [x] be the set of polynomials
with the form: an−1 xn−1 + · · · + a1 x + a0 , where an−1 , an−2 , · · · , a0 are any real numbers. Then,
Pn [x] under the addition "+" and scalar multiplication forms a linear space or vector space,
whose dimension is n, and the standard basis is 1, x, x2 , · · · , xn−1 .

Theorem 1.4.1 (Two bases of Pn [x]) For distinctive real numbers x0 , x1 , · · · , xn−1 , it can be
proved that (a) 1, x − x0 , (x − x0 )(x − x1 ), (x − x0 )(x − x1 ) · · · (x − xn−2 ), and (b) L0 (x), L1 (x), · · · ,
Ln−1 (x) form two bases of Pn [x], where
n−1
(x − x1 )(x − x2 ) · · · (x − xn−1 ) Y x − xk
L0 (x) = = ,
(x0 − x1 )(x0 − x2 ) · · · (x0 − xn−1 ) k=1,k,0 x0 − xk

and for j = 1, · · · , n − 1 and the product symbol


Q

n
(x − x0 ) · · · (x − x j−1 )(x − x j+1 ) · · · (x − xn−1 ) Y x − xk
L j (x) = = .
(x j − x0 ) · · · (x j − x j−1 )(x j − x j+1 ) · · · (x j − xn−1 ) k=0,k, j x j − xk

Definition 1.4.2 (Root) Let f (x) = an xn + an−1 xn−1 + · · · + a1 x + a0 be a polynomial of degree


n, where an , 0. For a number c, if c satisfies f (c) = 0, then we say c is a root of f (x).

In other words, the zero points of the polynomial function f (x) or solutions of the equation
f (x) = 0 are the roots of the polynomial f (x).

Theorem 1.4.2 (Fundamental theorem of algebra) Every polynomial p(x) of degree n ≥ 1


has n complex roots when counting multiplicity.

Remark 1.4.1 For polynomials of degree one, two, three and four, their roots can be solved by
formulas (here omitted). For polynomials with higher degrees, the Abel Ruffini theorem asserts
that there can not exist a general formula involving arithmetic operations and radicals that
express the roots of a polynomial of degree 5 or greater in terms of its coefficients.

Theorem 1.4.3 (Intermediate value theorem) Let f (x) be a continuous function on the inter-
val [a, b], and y be a value between f (a) and f (b). Then, there exists a number c in [a, b] such
that f (c) = y.

For example, f (x) = x2 − 3 on [1, 3] must take on the values 0 and 1.


1.4. REVIEW OF SOME RESULTS IN LINEAR ALGEBRA AND CALCULUS 23

Theorem 1.4.4 (Continuous limits) Let f be a continuous function in a neighborhood of x0 ,


and assume that
lim xn = x0 .
n→∞

Then
lim f (xn ) = f ( lim xn ) = f (x0 ).
n→∞ n→∞

Theorem 1.4.5 (Mean value theorem) Let f be a continuously differentiable function on [a, b].
Then there exists a number c in (a, b) such that
f (b) − f (a)
f 0 (c) = .
b−a

Theorem 1.4.6 (Roll’s theorem) Let the function f (x) be continuously differentiable on the
closed interval [a, b] and f (a) = f (b). Then there exists at least a number c in (a, b) such
that f 0 (c) = 0.

Theorem 1.4.7 (Taylor’s theorem with remainder) Let x0 and x be real numbers and let f be
k + 1 times continuously differentiable on the closed interval between x0 and x. Then there exists
a number c between x and x0 such that
f 00 (x0 )
f (x) = f (x0 ) + f 0 (x0 )(x − x0 ) + (x − x0 )2 + · · ·
2!
f (k) (x0 ) f (k+1) (c)
+ (x − x0 )k + (x − x0 )k+1 .
k! (k + 1)!
The resulted polynomial x − x0 without the truncation error is called the Taylor polynomial
of degree k for f at x0 . The final term is called the Taylor remainder.

Theorem 1.4.8 (Zero point theorem) Let f be a continuous function on [a, b], satisfying f (a) f (b) <
0. Then f has a root between a and b, that is, there exists a number r satisfying a < r < b and
f (r) = 0.

Theorem 1.4.9 (Mean value theorem for integrals) Let f be a continuous function on the
closed interval [a, b], and let g be an integrable function that does not change sign on [a, b].
Then there exists a number c between a and b such that
Z b Z b
f (x)g(x)dx = f (c) g(x)dx.
a a

Theorem 1.4.10 (Derivatives of some fundamental functions) Derivatives of some ordinary


functions are displayed as follows.
24 CHAPTER 1. PRELIMINARIES

f(x) f 0 (x) f (x) f 0 (x)


xµ µxµ−1 ax a x ln a
ex ex logax 1
x logea
cos(x) − sin(x) tan(x) sec2 (x)
cot(x) − csc2 (x) arcsin(x) √1
1−x2
arccos(x) −√1 arctan(x) 1
1+x2
1−x2
1
arccot(x) − 1+x 2

Theorem 1.4.11 (Identities for trigonometric functions) Some identities for trigonometric func-
tions are shown below.
α+β α−β α+β α−β
sin(α) + sin(β) = 2 sin cos , sin(α) − sin(β) = 2 cos sin ,
2 2 2 2
α+β α−β α+β α−β
cos(α) + cos(β) = 2 cos cos , cos(α) − cos(β) = −2 sin sin ,
2 2 2 2
sin(α + β) sin(α − β)
tan(α) + tan(β) = , tan(α) − tan(β) = .
cos(α) cos(β) cos(α) cos(β)

Exercise 4 (Section 1.4)

1. For distinctive numbers b1 , b2 , . . . , b5 prove that the two sets of polynomials {1, x, · · · , x4 }
and {1, x − b1 , (x − b1 )(x − b2 ), · · · , (x − b1 )(x − b2 )(x − b3 )(x − b4 )} form two bases of
the vector space P5 [x] = {a0 + a1 x + · · · + a4 x4 }.
2. Use the Intermediate Value Theorem to prove that f (c) = 0 for some c in [0, 1], where
(a) f (x) = x3 − 4x + 1; (b) f (x) = 4 cos(pix) − 3; (c) f (x) = 8x4 − 8x2 + 1.
3. Find c in [a, b] = [0, 1] satisfying the Mean Value Theorem for f (x) on [0, 1]:
(a) f (x) = 12 x2 ; (b) f (x) = 1
x+1 .

4. Find c satisfying the Mean Value Theorem for Integrals with f (x), g(x) on [0, 1]:
(a) f (x) = x, g(x) = 2x; (b) f (x) = x2 , g(x) = x; (b) f (x) = x, g(x) = e x .
5. Find the Taylor polynomial of degree 2 about the point x = 0 for the following:
2
(a) f (x) = e x ; (b) f (x) = cos(4x); (c) f (x) = ln(x + 1).
6. Find the degree 5 Taylor polynomial p(x) centered at x = 0 for f (x) = cos(x), and find
an error upper bound for the error in approximating f (x) = cos(x) for x ∈ [−π/4, π/4] by
p(x).
7. Review orthogonal polynomials and their properties.
8. Review some examples of differential equations with analytical solutions.

You might also like