Chapter
Chapter
Mathematical models are an integral part in solving engineering problems. Many times, these
mathematical models are derived from engineering and science principles, while at other times
the models may be obtained from experimental data.
Mathematical models generally result in need of using mathematical procedures that
include but are not limited to
(A) differentiation,
(B) nonlinear equations,
(C) simultaneous linear equations,
(D) curve fitting by interpolation or regression,
(E) integration, and
(F) differential equations.
These mathematical procedures may be suitable to be solved exactly as you must have
experienced in the series of calculus courses you have taken, but in most cases, the procedures
need to be solved approximately using numerical methods. Let us see an example of such a
need from a real-life physical problem.
To make the fulcrum (Figure 1) of a bascule bridge, a long hollow steel shaft called the
trunnion is shrink fit into a steel hub. The resulting steel trunnion-hub assembly is then shrink
fit into the girder of the bridge.
Trunnion
Hub
Girder
01.01.1
Introduction to Numerical Methods 01.01.2
This is done by first immersing the trunnion in a cold medium such as a dry-ice/alcohol
mixture. After the trunnion reaches the steady state temperature of the cold medium, the
trunnion outer diameter contracts. The trunnion is taken out of the medium and slid through
the hole of the hub (Figure 2).
When the trunnion heats up, it expands and creates an interference fit with the hub. In
1995, on one of the bridges in Florida, this assembly procedure did not work as designed.
Before the trunnion could be inserted fully into the hub, the trunnion got stuck. Luckily, the
trunnion was taken out before it got stuck permanently. Otherwise, a new trunnion and hub
would needed to be ordered at a cost of $50,000. Coupled with construction delays, the total
loss could have been more than a hundred thousand dollars.
Why did the trunnion get stuck? This was because the trunnion had not contracted
enough to slide through the hole. Can you find out why?
A hollow trunnion of outside diameter 12.363" is to be fitted in a hub of inner
diameter 12.358" . The trunnion was put in dry ice/alcohol mixture (temperature of the
fluid - dry ice/alcohol mixture is 108F ) to contract the trunnion so that it can be slid
through the hole of the hub. To slide the trunnion without sticking, a diametrical clearance of
at least 0.01" is required between the trunnion and the hub. Assuming the room
temperature is 80F , is immersing the trunnion in dry-ice/alcohol mixture a correct decision?
To calculate the contraction in the diameter of the trunnion, the thermal expansion
coefficient at room temperature is used. In that case the reduction D in the outer diameter
of the trunnion is
D DT (1)
where
D = outer diameter of the trunnion,
coefficient of thermal expansion coefficient at room temperature, and
T change in temperature,
Given
D = 12.363"
6.47 106 in/in/ F at 80F
T T fluid Troom
= 108 80
188F
Introduction to Numerical Methods 01.01.3
where
T fluid = temperature of dry-ice/alcohol mixture
Troom = room temperature
the reduction in the outer diameter of the trunnion is given by
D (12.363) 6.47 106 188
= 0.01504"
So the trunnion is predicted to reduce in diameter by 0.01504" . But, is this enough
reduction in diameter? As per specifications, the trunnion needs to contract by
= trunnion outside diameter - hub inner diameter + diametric clearance
= 12.363 – 12.358 + 0.01
= 0.015"
So according to his calculations, immersing the steel trunnion in dry-ice/alcohol
mixture gives the desired contraction of greater than 0.015" as the predicted contraction
is 0.01504" . But, when the steel trunnion was put in the hub, it got stuck. Why did this
happen? Was our mathematical model adequate for this problem or did we create a
mathematical error?
As shown in Figure 3 and Table 1, the thermal expansion coefficient of steel decreases
with temperature and is not constant over the range of temperature the trunnion goes through.
Hence, Equation (1) would overestimate the thermal contraction.
7.00E-06
6.00E-06
Coefficient of Thermal
Expancion (in/in/oF)
5.00E-06
4.00E-06
3.00E-06
2.00E-06
1.00E-06
0.00E+00
-400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150
o
Tem perature ( F)
The contraction in the diameter of the trunnion for which the thermal expansion coefficient
varies as a function of temperature is given by
T fluid
D D dT
Troom
(2)
So one needs to curve fit the data to find the coefficient of thermal expansion as a function of
temperature. This is done by regression where we best fit a curve through the data given in
Table 1. In this case, we may fit a second order polynomial
a0 a1 T a2 T 2 (3)
Introduction to Numerical Methods 01.01.4
a 6.1946 10 9
1
a 2 1.2278 10
11
D D (a
Tro o m
0 a1T a2T 2 )dT
6
( 108 80) 6.1946 10 9
( 108) 2 (80) 2
6.0150 10 2
D 12.363
12 (( 108) (80) )
3 3
1.2278 10
3
= 0.013689"
6.00
5.00
(m in/in/oF)
4.00
3.00
2.00
1.00
0.00
-400 -300 -200 -100 0 100 200
Temperature (oF)
Figure 4 Second order polynomial regression model for coefficient of thermal expansion as
a function of temperature.
What do we find here? The contraction in the trunnion is not enough to meet the required
specification of 0.015" .
So here are some questions that you may want to ask yourself?
1. What if the trunnion were immersed in liquid nitrogen (boiling temperature
321F )? Will that cause enough contraction in the trunnion?
2. Rather than regressing the thermal expansion coefficient data to a second order
polynomial so that one can find the contraction in the trunnion OD, how would you use
Trapezoidal rule of integration for unequal segments? What is the relative difference
between the two results?
3. We chose a second order polynomial for regression. Would a different order
polynomial be a better choice for regression? Is there an optimum order of polynomial
you can find?
As mentioned at the beginning of this chapter, we generally see mathematical
procedures that require the solution of nonlinear equations, differentiation, solution of
simultaneous linear equations, interpolation, regression, integration, and differential equations.
A physical example to illustrate the need for each of these mathematical procedures is given
in the beginning of each chapter. You may want to look at them now to understand better why
we need numerical methods in everyday life.
Introduction to Numerical Methods 01.01.6
Measuring Errors
In any numerical analysis, errors will arise during the calculations. To be able to deal with the
issue of errors, we need to
(G) identify where the error is coming from, followed by
(H) quantifying the error, and lastly
(I) minimize the error as per our needs.
In this chapter, we will concentrate on item (B), that is, how to quantify errors.
numerically, we will only have access to approximate values. We need to know how to quantify
error for such cases.
Approximate error is denoted by E a and is defined as the difference between the present
approximation and previous approximation.
Approximate Error Present Approximation – Previous Approximation
Example 3
The derivative of a function f (x) at a particular value of x can be approximately calculated
by
f ( x h) f ( x )
f ' ( x)
h
and at x 2 , find the following
0.5 x
For f ( x ) 7e
a) f (2) using h 0.3
b) f (2) using h 0.15
c) approximate error for the value of f (2) for part (b)
Solution
a) The approximate expression for the derivative of a function is
f ( x h) f ( x )
f ' ( x) .
h
For x 2 and h 0.3 ,
f (2 0.3) f (2) f (2.3) f (2) 7e 0.5( 2.3) 7e 0.5( 2)
f ' (2)
0.3 0.3 0.3
22.107 19.028
0.3
10.265
b) Repeat the procedure of part (a) with h 0.15,
f ( x h) f ( x )
f ( x)
h
For x 2 and h 0 .15 ,
f (2 0.15) f (2) f (2.15) f (2) 7e 0.5( 2.15) 7e 0.5( 2 )
f ' (2)
0.15 0.15 0.15
20.50 19.028
9.8799
0.15
c) So the approximate error, E a is
E a Present Approximation – Previous Approximation
9.8799 10.265
0.38474
The magnitude of approximate error does not show how bad the error is . An approximate
error of Ea 0.38300 may seem to be small; but for f ( x) 7 10 e , the approximate
6 0.5 x
6
error in calculating f (2) with h 0.15 would be Ea 0.38474 10 . This value of
'
approximate error is smaller, even when the two problems are similar in that they use the same
Introduction to Numerical Methods 01.01.9
value of the function argument, x 2 , and h 0.15 and h 0.3 . This brings us to the
definition of relative approximate error.
Q: While solving a mathematical model using numerical methods, how can we use relative
approximate errors to minimize the error?
A: In a numerical method that uses iterative methods, a user can calculate relative approximate
error a at the end of each iteration. The user may pre-specify a minimum acceptable
tolerance called the pre-specified tolerance, s . If the absolute relative approximate error
a is less than or equal to the pre-specified tolerance s , that is, |a | s , then the
acceptable error has been reached and no more iterations would be required.
Alternatively, one may pre-specify how many significant digits they would like to be
correct in their answer. In that case, if one wants at least m significant digits to be correct in
the answer, then you would need to have the absolute relative approximate error,
|a | 0.5 10 2m %.
Introduction to Numerical Methods 01.01.10
Example 5
x 0.7
If one chooses 6 terms of the Maclaurin series for e to calculate e , how many significant
digits can you trust in the solution? Find your answer without knowing or using the exact
answer.
x2
Solution ex 1 x .................
2!
Using 6 terms, we get the current approximation as
0.7 2 0.7 3 0.7 4 0.7 5
e 0.7 1 0.7 2.0136
2! 3! 4! 5!
Using 5 terms, we get the previous approximation as
0.7 2 0.7 3 0.7 4
e 0.7 1 0.7 2.0122
2! 3! 4!
The percentage absolute relative approximate error is
2.0136 2.0122
a 100 0.069527%
2.0136
Since a 0.5 10 % , at least 2 significant digits are correct in the answer of
2 2
e 0.7 2.0136
Sources of Error
After reading this chapter, you should be able to:
7. know that there are two inherent sources of error in numerical methods – round-
off and truncation error,
8. recognize the sources of round-off and truncation error, and
9. know the difference between round-off and truncation error.
Error in solving an engineering or science problem can arise due to several factors.
First, the error may be in the modeling technique. A mathematical model may be based on
using assumptions that are not acceptable. For example, one may assume that the drag force
on a car is proportional to the velocity of the car, but actually it is proportional to the square of
the velocity of the car. This itself can create huge errors in determining the performance of the
car, no matter how accurate the numerical methods you may use are. Second, errors may arise
from mistakes in programs themselves or in the measurement of physical quantities. But, in
applications of numerical methods itself, the two errors we need to focus on are
(J) Round off error
(K) Truncation error.
1
(0 21 0 22 0 23 1 24 ... 1 222 0 223 0 224 )
10
9.537 108
The battery was on for 100 consecutive hours, hence causing an inaccuracy of
s 3600s
9.537 108 100 hr
0.1s 1hr
0.3433s
The shift calculated in the range gate due to 0.3433s was calculated as 687 m . For
the Patriot missile defense system, the target is considered out of range if the shift was going
to more than 137 m .
n e1.2 Ea a %
1 1 - -
2 2.2 1.2 54.546
3 2.92 0.72 24.658
4 3.208 0.288 8.9776
5 3.2944 0.0864 2.6226
6 3.3151 0.020736 0.62550
follows.
f ( x) x 2
From the definition of the derivative of a function,
f ( x x ) f ( x ) ( x x) 2 ( x) 2
f ( x ) lim lim
x 0 x x 0 x
x 2 xx (x) x
2 2 2
lim (2 x x )
lim x 0
2x
x 0 x
This is the same expression you would have obtained by directly using the formula from your
differential calculus class
d
( x n ) nx n 1
dx
By this formula for
f ( x) x 2
f ( x ) 2 x
The exact value of f (3) is
f (3) 2 3
6
If we now choose x 0.2 , we get
f (3 0.2) f (3) f (3.2) f (3) 3.2 2 3 2 10.24 9 1.24
f (3) =
0.2 0.2 0 .2 0.2 0.2
6.2
Introduction to Numerical Methods 01.01.14
because we wanted to have no round-off error in our calculations so that the truncation error
can be isolated. The truncation error in this example is
6 6.2 0.2.
Can you reduce the truncate error by choosing a smaller x ?
Another example of truncation error is the numerical integration of a function,
b
I f ( x)dx
a
Exact calculations require us to calculate the area under the curve by adding the area
of the rectangles as shown in Figure 2. However, exact calculations requires an infinite number
of such rectangles. Since we cannot choose an infinite number of rectangles, we will have
truncation error.
For example, to find
9
x
2
dx ,
3
x dx ( x 2 ) (6 3) ( x 2 ) (9 6)
2
x 3 x 6
3
(32 )3 (6 2 )3
27 108
135
90
y = x2
60
30
0 x
0 3 6 9 12
Figure 2 Plot of y x showing the approximate area under the curve from x 3 to
2
90
y = x2
60
30
0 x
0 1.5 3 4.5 6 7.5 9 10.5 12
Figure 3 Plot of y x showing the approximate area under the curve from
2
References
“Patriot Missile Defense – Software Problem Led to System Failure at Dhahran,
Saudi Arabia”, GAO Report, General Accounting Office, Washington DC,
February 4, 1992.
Introduction to Numerical Methods 01.01.16
In everyday life, we use a number system with a base of 10. For example, look at the
number 257.56. Each digit in 257.56 has a value of 0 through 9 and has a place value. It can
be written as
257 .76 2 10 2 5 101 7 10 0 7 10 1 6 10 2
In a binary system, we have a similar system where the base is made of only two digits 0 and
1. So it is a base 2 system. A number like (1011.0011) in base-2 represents the decimal number
as
(1011 .0011 ) 2 (1 23 0 22 1 21 1 20 ) (0 21 0 22 1 23 1 24 ) 10
11 .1875
in the decimal system.
To understand the binary system, we need to be able to convert binary numbers to
decimal numbers and vice-versa.
We have already seen an example of how binary numbers are converted to decimal
numbers. Let us see how we can convert a decimal number to a binary number. For example
take the decimal number 11.1875. First, look at the integer part: 11.
1. Divide 11 by 2. This gives a quotient of 5 and a remainder of 1. Since the remainder
is 1, a0 1 .
2. Divide the quotient 5 by 2. This gives a quotient of 2 and a remainder of 1. Since
the remainder is 1, a1 1 .
3. Divide the quotient 2 by 2. This gives a quotient of 1 and a remainder of 0. Since
the remainder is 0, a 2 0 .
4. Divide the quotient 1 by 2. This gives a quotient of 0 and a remainder of 1. Since
the remainder is , a3 1 .
Since the quotient now is 0, the process is stopped. The above steps are summarized in Table
1.
Introduction to Numerical Methods 01.01.17
Quotient Remainder
11/2 5 1 a0
5/2 2 1 a1
2/2 1 0 a2
1/2 0 1 a3
Hence
(11)10 (a3 a2 a1a0 ) 2
(1011) 2
For any integer, the algorithm for finding the binary equivalent is given in the flow chart on
the next page.
Now let us look at the decimal part, that is, 0.1875.
1. Multiply 0.1875 by 2. This gives 0.375. The number before the decimal is 0 and the
number after the decimal is 0.375. Since the number before the decimal is 0, a 1 0 .
2. Multiply the number after the decimal, that is, 0.375 by 2. This gives 0.75. The number
before the decimal is 0 and the number after the decimal is 0.75. Since the number
before the decimal is 0, a 2 0 .
3. Multiply the number after the decimal, that is, 0.75 by 2. This gives 1.5. The number
before the decimal is 1 and the number after the decimal is 0.5. Since the number
before the decimal is 1, a 3 1 .
4. Multiply the number after the decimal, that is, 0.5 by 2. This gives 1.0. The number
before the decimal is 1 and the number after the decimal is 0. Since the number before
the decimal is 1, a 4 1 .
Since the number after the decimal is 0, the conversion is complete. The above steps are
summarized in Table 2.
Start
Integer N to be converted
Input (N)10
to binary format
i=0
Divide N by 2 to get
quotient Q & remainder R
i = i+1 ai = R
No
Is Q = 0?
Yes
n=i
(N)10 = (an. . .a0)2
STO
P
Introduction to Numerical Methods 01.01.19
Hence
(0.1875)10 (a1a 2 a 3a 4 ) 2
(0.0011) 2
The algorithm for any fraction is given in a flowchart on the next page.
Having calculated
(11)10 (1011) 2
and
(0.1875)10 (0.0011) 2 ,
we have
(11.1875)10 (1011.0011) 2 .
In the above example, when we were converting the fractional part of the number, we were left
with 0 after the decimal number and used that as a place to stop. In many cases, we are never
left with a 0 after the decimal number. For example, finding the binary equivalent of 0.3 is
summarized in Table 3.
As you can see the process will never end. In this case, the number can only be approximated
in binary format, that is,
(0.3)10 (a 1a 2 a 3a 4 a 5 ) 2 (0.01001) 2
Q: But what is the mathematics behinds this process of converting a decimal number to binary
format?
A: Let z be the decimal number written as
z x. y
where
x is the integer part and y is the fractional part.
We want to find the binary equivalent of x . So we can write
Introduction to Numerical Methods 01.01.20
Start
Fraction F to be converted
Input (F)10
to binary format
i 1
Multiply F by 2 to get
number before decimal, S
and after decimal, T
i i 1 ai = S
No
Is T = 0?
Yes
n=i
(F)10 = (a-1. . .a-n)2
STOP
Introduction to Numerical Methods 01.01.21
Example 1
Convert (11.1875)10 to base 2.
Solution
To convert (11)10 to base 2, what is the highest power of 2 that is part of 11. That power is
3, as 2 8 to give
3
11 2 3 3
What is the highest power of 2 that is part of 3. That power is 1, as 21 2 to give
3 21 1
So
11 23 3 23 21 1
What is the highest power of 2 that is part of 1. That power is 0, as 2 0 1 to give
1 20
Hence
(11)10 23 21 1 23 21 20 1 23 0 2 2 1 21 1 20 (1011) 2
To convert (0.1875)10 to the base 2, we proceed as follows. What is the smallest negative
power of 2 that is less than or equal to 0.1875. That power is 3 as 2 0.125.
3
So
0.1875 2 3 0.0625
What is the next smallest negative power of 2 that is less than or equal to 0.0625. That power
is 4 as 2 0.0625 .
4
So
0.1875 23 24
Hence
(0.1875 )10 2 3 0.0625 2 3 2 4 0 2 1 0 2 2 1 2 3 1 2 4 (0.0011) 2
Since
(11)10 (1011) 2
and
(0.1875)10 (0.0011) 2
we get
Introduction to Numerical Methods 01.01.22
(11.1875)10 (1011.0011) 2
Can you show this algebraically for any general number?
Example 2
Convert (13.875)10 to base 2.
Solution
For (13)10 , conversion to binary format is shown in Table 4.
So
(13)10 (1101)2 .
Conversion of (0.875)10 to binary format is shown in Table 5.
So
(0.875)10 (0.111) 2
Hence
(13.875)10 (1101.111)2
Introduction to Numerical Methods 01.01.23
Consider an old time cash register that would ring any purchase between 0 and 999.99 units of
money. Note that there are five (not six) working spaces in the cash register (the decimal
number is shown just for clarification).
Q: How will the smallest number 0 be represented?
A: The number 0 will be represented as
0 0 0 . 0 0
Q: Now look at any typical number between 0 and 999.99, such as 256.78. How would it be
represented?
A: The number 256.78 will be represented as
2 5 6 . 7 8
For another number, 3.546, rounding it off to 3.55 accounts for the same round-off
error of 3.546 3.55 0.004 . The relative error in this case is
0.004
t 100
3.546
0.11280% .
Q: If I am interested in keeping relative errors of similar magnitude for the range of numbers,
what alternatives do I have?
A: To keep the relative error of similar order for all numbers, one may use a floating-point
representation of the number. For example, in floating-point representation, a number
256.78 is written as 2.5678 10 ,
2
3
0.003678 is written as 3.67810 , and
256.789 is written as 2.56789 102 .
The general representation of a number in base-10 format is given as
sign mantissa 10 exponent
or for a number y ,
y m 10 e
Where
sign of the number, 1 or - 1
m mantissa, 1 m 10
e integer exponent (also called ficand)
Let us go back to the example where we have five spaces available for a number. Let us also
limit ourselves to positive numbers with positive exponents for this example. If we use the
same five spaces, then let us use four for the mantissa and the last one for the exponent. So
the smallest number that can be represented is 1 but the largest number would be 9.99910 .
9
By using the floating-point representation, what we lose in accuracy, we gain in the range of
numbers that can be represented. For our example, the maximum number represented changed
from 999.99 to 9.99910 .
9
What is the error in representing numbers in the scientific format? Take the previous
example of 256.78. It would be represented as 2.568 10 and in the five spaces as
2
2 5 6 8 2
Another example, the number 576329.78 would be represented as 5.763 10 and in five
5
spaces as
5 7 6 3 5
So, how much error is caused by such representation. In representing 256.78, the round
off error created is 256.78 256.8 0.02 , and the relative error is
0.02
t 100 0.0077888 % ,
256 .78
In representing 576329.78 , the round off error created is 576329.78 5.763 10 29.78 ,
5
What you are seeing now is that although the errors are large for large numbers, but the relative
errors are of the same order for both large and small numbers.
Example 1
Represent 54.7510 in floating point binary format. Assuming that the number is written to a
hypothetical word that is 9 bits long where the first bit is used for the sign of the number, the
second bit for the sign of the exponent, the next four bits for the mantissa, and the next three
bits for the exponent,
Solution
54 .75 10 (110110 .11) 2 1.1011011 2 2 ( 5 )10
The sign of the number is positive, so the bit for the sign of the number will have zero in it.
0
The sign of the exponent is positive. So the bit for the sign of the exponent will have zero in
it.
The mantissa
m 1011
(There are only 4 places for the mantissa, and the leading 1 is not stored as it is always expected
to be there), and
the exponent
e 101 .
we have the representation as
0 0 1 0 1 1 1 0 1
Example 2
What number does the below given floating point format
0 1 1 0 1 1 1 1 0
Introduction to Numerical Methods 01.01.26
represent in base-10 format. Assume a hypothetical 9-bit word, where the first bit is used for
the sign of the number, second bit for the sign of the exponent, next four bits for the mantissa
and next three for the exponent.
Solution
Given
Bit Representation Part of Floating point number
0 Sign of number
1 Sign of exponent
1011 Magnitude of mantissa
110 Magnitude of exponent
Example 3
A machine stores floating-point numbers in a hypothetical 10-bit binary word. It employs the
first bit for the sign of the number, the second one for the sign of the exponent, the next four
for the exponent, and the last four for the magnitude of the mantissa.
a) Find how 0.02832 will be represented in the floating-point 10-bit word.
b) What is the decimal equivalent of the 10-bit word representation of part (a)?
Solution
a) For the number, we have the integer part as 0 and the fractional part as 0.02832
Let us first find the binary equivalent of the integer part
Integer part 010 02
Now we find the binary equivalent of the fractional part
Fractional part: .02832 2
0.05664 2
0.11328 2
0.22656 2
0.45312 2
0.90624 2
1.81248 2
1.62496 2
1.24992 2
Introduction to Numerical Methods 01.01.27
0.49984 2
0.99968 2
1.99936
Hence
0.02832 10 0.0000011100 12
1.110012 2 6
1.11002 2 6
The binary equivalent of exponent is found as follows
Quotient Remainder
6/2 3 0 a0
3/2 1 1 a1
1/2 0 1 a2
So
610 110 2
So
0.0283210 1.11002 2 110 2
1.11002 2 0110 2
b) Converting the above floating point representation from part (a) to base 10 by following
Example 2 gives
1.11002 2 01102
1 20 1 21 1 22 0 23 0 24 2 0 2
3
1 2 2 1 2 1 0 2 0
1.7510 2 6 10
0.02734375
Q: How do you determine the accuracy of a floating-point representation of a number?
A: The machine epsilon, mach is a measure of the accuracy of a floating point representation
and is found by calculating the difference between 1 and the next number that can be
represented. For example, assume a 10-bit hypothetical computer where the first bit is used
for the sign of the number, the second bit for the sign of the exponent, the next four bits for the
exponent and the next four for the mantissa.
We represent 1 as
0 0 0 0 0 0 0 0 0 0
and the next higher number that can be represented is
Introduction to Numerical Methods 01.01.28
0 0 0 0 0 0 0 0 0 1
The difference between the two numbers is
1.00012 2(0000) 1.00002 2(0000)
2 2
0.00012
(1 2 4 ) 10
(0.0625)10 .
The machine epsilon is
mach 0.0625.
The machine epsilon, mach is also simply calculated as two to the negative power of the
number of bits used for mantissa. As far as determining accuracy, machine epsilon, mach is
an upper bound of the magnitude of relative error that is created by the approximate
representation of a number (See Example 4).
Example 4
A machine stores floating-point numbers in a hypothetical 10-bit binary word. It employs the
first bit for the sign of the number, the second one for the sign of the exponent, the next four
for the exponent, and the last four for the magnitude of the mantissa. Confirm that the
magnitude of the relative true error that results from approximate representation of 0.02832 in
the 10-bit format (as found in previous example) is less than the machine epsilon.
Solution
From Example 2, the ten-bit representation of 0.02832 bit-by-bit is
0 1 0 1 1 0 1 1 0 0
Again from Example 2, converting the above floating point representation to base-10 gives
1.11002 2 01102
1.7510 26 10
0.02734375 10
The absolute relative true error between the number 0.02832 and its approximate
representation 0.02734375 is
0.02832 0.02734375
t
0.02832
0.034472
which is less than the machine epsilon for a computer that uses 4 bits for mantissa, that is,
mach 2 4
0.0625 .
Q: How are numbers actually represented in floating point in a real computer?
A: In an actual typical computer, a real number is stored as per the IEEE-754 (Institute of
Electrical and Electronics Engineers) floating-point arithmetic format. To keep the discussion
short and simple, let us point out the salient features of the single precision format.
A single precision number uses 32 bits.
A number y is represented as
y 1.a1 a2 a23 2 e
Introduction to Numerical Methods 01.01.29
where
= sign of the number (positive or negative)
ai entries of the mantissa, can be only 0 or 1, i 1,..,23
e =the exponent
Note the 1 before the radix point.
11. The first bit represents the sign of the number (0 for positive number and 1 for a
negative number).
12. The next eight bits represent the exponent. Note that there is no separate bit for the
sign of the exponent. The sign of the exponent is taken care of by normalizing by
adding 127 to the actual exponent. For example in the previous example, the exponent
was 6. It would be stored as the binary equivalent of 127 6 133 . Why is 127 and
not some other number added to the actual exponent? Because in eight bits the largest
integer that can be represented is 11111111 2 255 , and halfway of 255 is 127. This
allows negative and positive exponents to be represented equally. The normalized (also
called biased) exponent has the range from 0 to 255, and hence the exponent e has the
range of 127 e 128 .
13. If instead of using the biased exponent, let us suppose we still used eight bits for the
exponent but used one bit for the sign of the exponent and seven bits for the exponent
magnitude. In seven bits, the largest integer that can be represented is
1111111 2 127 in which case the exponent e range would have been smaller, that
is, 127 e 127 . By biasing the exponent, the unnecessary representation of a
negative zero and positive zero exponent (which are the same) is also avoided.
14. Actually, the biased exponent range used in the IEEE-754 format is not 0 to 255, but 1
to 254. Hence, exponent e has the range of 126 e 127 . So what are e 127
and e 128 used for? If e 128 and all the mantissa entries are zeros, the number is
( the sign of infinity is governed by the sign bit), if e 128 and the mantissa
entries are not zero, the number being represented is Not a Number (NaN). Because of
the leading 1 in the floating point representation, the number zero cannot be represented
exactly. That is why the number zero (0) is represented by e 127 and all the
mantissa entries being zero.
15. The next twenty-three bits are used for the mantissa.
16. The largest number by magnitude that is represented by this format is
1 2 0
1 2 1 1 2 2 1 2 22 1 2 23 2127 3.40 10 38
The smallest number by magnitude that is represented, other than zero, is
1 2 0
0 21 0 22 0 222 0 223 2126 1.18 10 38
Since 23 bits are used for the mantissa, the machine epsilon,
mach 2 23
1.19 10 7
.
Propagation of Errors
If a calculation is made with numbers that are not exact, then the calculation itself will have an
error. How do the errors in each individual number propagate through the calculations. Let’s
look at the concept via some examples.
Example 1
Find the bounds for the propagation error in adding two numbers. For example if one is
calculating X Y where
X 1.5 0.05 ,
Y 3.4 0.04 .
Solution
By looking at the numbers, the maximum possible value of X and Y are
X 1.55 and Y 3.44
Hence
X Y 1.55 3.44 4.99
is the maximum value of X Y .
The minimum possible value of X and Y are
X 1.45 and Y 3.36 .
Hence
X Y 1.45 3.36
4.81
is the minimum value of X Y .
Hence
4.81 X Y 4.99 .
One can find similar intervals of the bound for the other arithmetic operations of
X Y , X * Y , and X / Y . What if the evaluations we are making are function evaluations
instead? How do we find the value of the propagation error in such cases.
If f is a function of several variables X 1 , X 2 , X 3 ,......., X n 1 , X n , then the
maximum possible value of the error in f is
f f f f
f X 1 X 2 ....... X n 1 X n
X 1 X 2 X n 1 X n
Introduction to Numerical Methods 01.01.31
Example 2
The strain in an axial member of a square cross-section is given by
F
2
h E
where
F =axial force in the member, N
h = length or width of the cross-section, m
E =Young’s modulus, Pa
Given
F 72 0.9 N
h 4 0.1 mm
E 70 1.5 GPa
Find the maximum possible error in the measured strain.
Solution
72
3
(4 10 ) 2 (70 10 9 )
64.286 106
64.286 m
F h E
F h E
1
2
F h E
2F
3
h h E
F
2 2
E h E
1 2F F
2
F 3 h 2 2 E
h E h E h E
1 2 72
3
0.9 3 3
0.0001
(4 10 ) (70 10 )
2 9
( 4 10 ) (70 10 9 )
72
1.5 10 9
( 4 10 3 ) 2 (70 10 9 ) 2
8.0357107 3.2143106 1.3776106
5.3955 10 6
5.3955m
Hence
(64.286m 5.3955m )
implying that the axial strain, is between 58.8905m and 69.6815m
Example 3
Subtraction of numbers that are nearly equal can create unwanted inaccuracies. Using the
formula for error propagation, show that this is true.
Introduction to Numerical Methods 01.01.32
Solution
Let
z x y
Then
z z
z x y
x y
(1)x (1)y
x y
So the absolute relative change is
z x y
z x y
As x and y become close to each other, the denominator becomes small and hence create
large relative errors.
For example if
x 2 0.001
y 2.003 0.001
z 0.001 0.001
z | 2 2.003 |
= 0.6667
= 66.67%
The use of Taylor series exists in so many aspects of numerical methods that it is imperative
to devote a separate chapter to its review and applications. For example, you must have come
across expressions such as
x2 x4 x6
cos( x) 1 (1)
2! 4! 6!
x3 x5 x7
sin( x) x (2)
3! 5! 7!
x2 x3
ex 1 x (3)
2! 3!
Introduction to Numerical Methods 01.01.33
All the above expressions are actually a special case of Taylor series called the Maclaurin
series. Why are these applications of Taylor’s theorem important for numerical methods?
Expressions such as given in Equations (1), (2) and (3) give you a way to find the approximate
values of these functions by using the basic arithmetic operations of addition, subtraction,
division, and multiplication.
Example 1
0.25
Find the value of e using the first five terms of the Maclaurin series.
Solution
x
The first five terms of the Maclaurin series for e is
x 2 x3 x 4
ex 1 x
2! 3! 4!
0.25 2 0.25 3 0.25 4
e 0.25 1 0.25
2! 3! 4!
1.2840
0.25
The exact value of e up to 5 significant digits is also 1.2840.
But the above discussion and example do not answer our question of what a Taylor series is.
Here it is, for a function f x
f x 2 f x 3
f x h f x f x h h h (4)
2! 3!
provided all derivatives of f x exist and are continuous between x and x h .
interval ( x, x h) .
Example 2
Take f x sin x , we all know the value of sin 2 1 . We also know the f x cosx
and cos 2 0 . Similarly f x sin( x) and sin 2 1 . In a way, we know the value
of sin x and all its derivatives at x 2 . We do not need to use any calculators, just plain
Introduction to Numerical Methods 01.01.34
differential calculus and trigonometry would do. Can you use Taylor series and this
information to find the value of sin 2 ?
Solution
x
2
xh2
h 2 x
2
2
0.42920
So
h2 h3 h4
f x h f x f x h f x f x f ( x)
2! 3! 4!
x
2
h 0.42920
f x sin x , f sin 1
2 2
f x cosx , f 0
2
f x sin x , f 1
2
f x cos(x) , f 0
2
f x sin( x) , f 1
2
Hence
h h h
2 3 4
f h f f h f f f
2 2 2 2 2! 2 3! 2 4!
f 0.42920 1 00.42920 1
0.42920 0 0.42920 3 1 0.42920 4
2
2 2! 3! 4!
1 0 0.092106 0 0.00141393
0.90931
The value of sin 2 I get from my calculator is 0.90930 which is very close to the value I
just obtained. Now you can get a better value by using more terms of the series. In addition,
you can now use the value calculated for sin 2 coupled with the value of cos2 (which can
be calculated by Taylor series just like this example or by using the sin x cos x 1 identity)
2 2
to find value of sin x at some other point. In this way, we can find the value of sin x for
any value from x 0 to 2 and then can use the periodicity of sin x , that is
sin x sin x 2n , n 1,2, to calculate the value of sin x at any other point.
Introduction to Numerical Methods 01.01.35
Example 3
x3 x5 x7
Derive the Maclaurin series of sin x x
3! 5! 7!
Solution
In the previous example, we wrote the Taylor series for sin x around the point x 2 .
Maclaurin series is simply a Taylor series for the point x 0 .
f x sin x , f 0 0
f x cosx , f 0 1
f x sin x , f 0 0
f x cosx , f 0 1
f x sin x , f 0 0
f x cos(x) , f 0 1
Example 4
Find the value of f 6 given that f 4 125 , f 4 74 , f 4 30 , f 4 6 and all other
higher derivatives of f x at x 4 are zero.
Solution
h2 h3
f x h f x f x h f x f x
2! 3!
x4
h 64
2
Introduction to Numerical Methods 01.01.36
Example 5
The Taylor series for e at point x 0 is given by
x
x 2 x3 x 4 x5
ex 1 x
2! 3! 4! 5!
1
a) What is the truncation (true) error in the representation of e if only four terms of the series
are used?
b) Use the remainder theorem to find the bounds of the truncation error.
Solution
17. If only four terms of the series are used, then
x2 x3
ex 1 x
2! 3!
Introduction to Numerical Methods 01.01.37
12 13
e1 1 1
2! 3!
2.66667
The truncation (true) error would be the unused terms of the Taylor series, which then are
x4 x5
Et
4! 5!
14 15
4! 5!
0.0516152
18. But is there any way to know the bounds of this error other than calculating it directly?
Yes,
hn
f x h f x f x h f n x Rn x h
n!
where
Rn x h
hn1 f n1 c
x c x h , and
n 1! ,
c is some point in the domain x, x h . So in this case, if we are using four terms of the
Taylor series, the remainder is given by x 0, n 3
R3 0 1
131 f 31 c
3 1!
1
f 4
c
4!
ec
24
Since
xc xh
0 c 0 1
0 c 1
The error is bound between
e0 e1
R3 1
24 24
R3 1
1 e
24 24
0.041667 R3 1 0.113261
So the bound of the error is less than 0.113261 which does concur with the calculated error
of 0.0516152 .
Example 6
The Taylor series for e at point x 0 is given by
x
x 2 x3 x 4 x5
ex 1 x
2! 3! 4! 5!
Introduction to Numerical Methods 01.01.38
As you can see in the previous example that by taking more terms, the error bounds decrease
1
and hence you have a better estimate of e . How many terms it would require to get an
1 6
approximation of e within a magnitude of true error of less than 10 ?
Solution
Using n 1 terms of the Taylor series gives an error bound of
Rn x h
hn1 f n1 c
n 1!
x 0, h 1, f ( x) e x
Rn 1
1n1 f n1 c
n 1!
1n 1 e c
n 1!
Since
xc xh
0 c 0 1
0 c 1
Rn 1
1 e
(n 1)! (n 1)!
1
So if we want to find out how many terms it would require to get an approximation of e
6
within a magnitude of true error of less than 10 ,
e
10 6
(n 1)!
(n 1)! 10 6 e
( n 1)! 10 6 3 (as we do not know the value of e but it is less than 3).
n9
1 6
So 9 terms or more will get e within an error of 10 in its value.
We can do calculations such as the ones given above only for simple functions. To do
a similar analysis of how many terms of the series are needed for a specified accuracy for any
general function, we can do that based on the concept of absolute relative approximate errors
discussed in Chapter 01.02 as follows.
We use the concept of absolute relative approximate error (see Chapter 01.02 for
details), which is calculated after each term in the series is added. The maximum value of m ,
2m
for which the absolute relative approximate error is less than 0.5 10 % is the least
number of significant digits correct in the answer. It establishes the accuracy of the
approximate value of a function without the knowledge of remainder of Taylor series or the
true error.