0% found this document useful (0 votes)
28 views38 pages

Chapter

1) The document discusses a problem where a steel trunnion got stuck when being inserted into a steel hub to form a bridge assembly. 2) To solve this, the document examines using numerical methods to calculate the thermal contraction of the trunnion when cooled in dry ice/alcohol. 3) It finds that assuming a constant thermal expansion coefficient leads to an overestimation of contraction. Accounting for how the coefficient varies with temperature using a polynomial regression model provides a more accurate contraction calculation.

Uploaded by

tatek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views38 pages

Chapter

1) The document discusses a problem where a steel trunnion got stuck when being inserted into a steel hub to form a bridge assembly. 2) To solve this, the document examines using numerical methods to calculate the thermal contraction of the trunnion when cooled in dry ice/alcohol. 3) It finds that assuming a constant thermal expansion coefficient leads to an overestimation of contraction. Accounting for how the coefficient varies with temperature using a polynomial regression model provides a more accurate contraction calculation.

Uploaded by

tatek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Chapter 01

Introduction to Numerical Methods

After reading this chapter, you should be able to:


1. understand the need for numerical methods, and
2. go through the stages (mathematical modeling, solving and implementation) of
solving a particular physical problem.

Mathematical models are an integral part in solving engineering problems. Many times, these
mathematical models are derived from engineering and science principles, while at other times
the models may be obtained from experimental data.
Mathematical models generally result in need of using mathematical procedures that
include but are not limited to
(A) differentiation,
(B) nonlinear equations,
(C) simultaneous linear equations,
(D) curve fitting by interpolation or regression,
(E) integration, and
(F) differential equations.
These mathematical procedures may be suitable to be solved exactly as you must have
experienced in the series of calculus courses you have taken, but in most cases, the procedures
need to be solved approximately using numerical methods. Let us see an example of such a
need from a real-life physical problem.
To make the fulcrum (Figure 1) of a bascule bridge, a long hollow steel shaft called the
trunnion is shrink fit into a steel hub. The resulting steel trunnion-hub assembly is then shrink
fit into the girder of the bridge.

Trunnion

Hub

Girder

Figure 1 Trunnion-Hub-Girder (THG) assembly.

01.01.1
Introduction to Numerical Methods 01.01.2

This is done by first immersing the trunnion in a cold medium such as a dry-ice/alcohol
mixture. After the trunnion reaches the steady state temperature of the cold medium, the
trunnion outer diameter contracts. The trunnion is taken out of the medium and slid through
the hole of the hub (Figure 2).

Figure 2 Trunnion slided through the hub after contracting

When the trunnion heats up, it expands and creates an interference fit with the hub. In
1995, on one of the bridges in Florida, this assembly procedure did not work as designed.
Before the trunnion could be inserted fully into the hub, the trunnion got stuck. Luckily, the
trunnion was taken out before it got stuck permanently. Otherwise, a new trunnion and hub
would needed to be ordered at a cost of $50,000. Coupled with construction delays, the total
loss could have been more than a hundred thousand dollars.
Why did the trunnion get stuck? This was because the trunnion had not contracted
enough to slide through the hole. Can you find out why?
A hollow trunnion of outside diameter 12.363" is to be fitted in a hub of inner
diameter 12.358" . The trunnion was put in dry ice/alcohol mixture (temperature of the
fluid - dry ice/alcohol mixture is  108F ) to contract the trunnion so that it can be slid
through the hole of the hub. To slide the trunnion without sticking, a diametrical clearance of
at least 0.01" is required between the trunnion and the hub. Assuming the room
temperature is 80F , is immersing the trunnion in dry-ice/alcohol mixture a correct decision?
To calculate the contraction in the diameter of the trunnion, the thermal expansion
coefficient at room temperature is used. In that case the reduction D in the outer diameter
of the trunnion is
D  DT (1)
where
D = outer diameter of the trunnion,
  coefficient of thermal expansion coefficient at room temperature, and
T  change in temperature,
Given
D = 12.363"
  6.47  106 in/in/ F at 80F
T  T fluid  Troom
=  108  80
 188F
Introduction to Numerical Methods 01.01.3

where
T fluid = temperature of dry-ice/alcohol mixture
Troom = room temperature
the reduction in the outer diameter of the trunnion is given by
D  (12.363) 6.47 106  188  
=  0.01504"
So the trunnion is predicted to reduce in diameter by 0.01504" . But, is this enough
reduction in diameter? As per specifications, the trunnion needs to contract by
= trunnion outside diameter - hub inner diameter + diametric clearance
= 12.363 – 12.358 + 0.01
= 0.015"
So according to his calculations, immersing the steel trunnion in dry-ice/alcohol
mixture gives the desired contraction of greater than 0.015" as the predicted contraction
is 0.01504" . But, when the steel trunnion was put in the hub, it got stuck. Why did this
happen? Was our mathematical model adequate for this problem or did we create a
mathematical error?
As shown in Figure 3 and Table 1, the thermal expansion coefficient of steel decreases
with temperature and is not constant over the range of temperature the trunnion goes through.
Hence, Equation (1) would overestimate the thermal contraction.

7.00E-06
6.00E-06
Coefficient of Thermal
Expancion (in/in/oF)

5.00E-06
4.00E-06
3.00E-06
2.00E-06

1.00E-06
0.00E+00
-400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150
o
Tem perature ( F)

Figure 3 Varying thermal expansion coefficient as a function of temperature for cast


steel.

The contraction in the diameter of the trunnion for which the thermal expansion coefficient
varies as a function of temperature is given by
T fluid

D  D  dT
Troom
(2)

So one needs to curve fit the data to find the coefficient of thermal expansion as a function of
temperature. This is done by regression where we best fit a curve through the data given in
Table 1. In this case, we may fit a second order polynomial
  a0  a1  T  a2  T 2 (3)
Introduction to Numerical Methods 01.01.4

Table 1 Instantaneous thermal expansion coefficient as a function of temperature.


Instantaneous
Temperature
Thermal Expansion
F μin/in/F
80 6.47
60 6.36
40 6.24
20 6.12
0 6.00
-20 5.86
-40 5.72
-60 5.58
-80 5.43
-100 5.28
-120 5.09
-140 4.91
-160 4.72
-180 4.52
-200 4.30
-220 4.08
-240 3.83
-260 3.58
-280 3.33
-300 3.07
-320 2.76
-340 2.45
The values of the coefficients in the above Equation (3) will be found by polynomial regression
(we will learn how to do this later in Chapter 06.04). At this point we are just going to give
you these values and they are
a0   6.0150  10 
6

 a    6.1946  10 9 
 1  
a 2    1.2278  10 
 11

to give the polynomial regression model (Figure 4) as


  a0  a1T  a2T 2
 6.0150 106  6.1946 109 T  1.2278 1011 T 2
Knowing the values of a 0 , a1 and a 2 , we can then find the contraction in the trunnion
diameter as
T flu id

D  D  (a
Tro o m
0  a1T  a2T 2 )dT

(Tfluid  Troom ) (T fluid  Troom )


2 2 3 3

 D[a0 (T fluid  Troom )  a1  a2 ] (4)


2 3
which gives
Introduction to Numerical Methods 01.01.5

 6
 ( 108  80)  6.1946  10 9
( 108) 2  (80) 2 
6.0150  10 2 
D  12.363 
 12 (( 108)  (80) ) 
3 3

  1.2278  10 
 3 
=  0.013689"

Coefficient of thermal expansion 7.00

6.00

5.00
(m in/in/oF)

4.00

3.00

2.00

1.00

0.00
-400 -300 -200 -100 0 100 200
Temperature (oF)

Figure 4 Second order polynomial regression model for coefficient of thermal expansion as
a function of temperature.

What do we find here? The contraction in the trunnion is not enough to meet the required
specification of 0.015" .
So here are some questions that you may want to ask yourself?
1. What if the trunnion were immersed in liquid nitrogen (boiling temperature
 321F )? Will that cause enough contraction in the trunnion?
2. Rather than regressing the thermal expansion coefficient data to a second order
polynomial so that one can find the contraction in the trunnion OD, how would you use
Trapezoidal rule of integration for unequal segments? What is the relative difference
between the two results?
3. We chose a second order polynomial for regression. Would a different order
polynomial be a better choice for regression? Is there an optimum order of polynomial
you can find?
As mentioned at the beginning of this chapter, we generally see mathematical
procedures that require the solution of nonlinear equations, differentiation, solution of
simultaneous linear equations, interpolation, regression, integration, and differential equations.
A physical example to illustrate the need for each of these mathematical procedures is given
in the beginning of each chapter. You may want to look at them now to understand better why
we need numerical methods in everyday life.
Introduction to Numerical Methods 01.01.6

Measuring Errors

After reading this chapter, you should be able to:


3. find the true and relative true error,
4. find the approximate and relative approximate error,
5. relate the absolute relative approximate error to the number of significant digits at
least correct in your answers, and
6. know the concept of significant digits.

In any numerical analysis, errors will arise during the calculations. To be able to deal with the
issue of errors, we need to
(G) identify where the error is coming from, followed by
(H) quantifying the error, and lastly
(I) minimize the error as per our needs.
In this chapter, we will concentrate on item (B), that is, how to quantify errors.

Q: What is true error?


A: True error denoted by Et is the difference between the true value (also called the exact
value) and the approximate value.
True Error  True value – Approximate value

Example 1: The derivative of a function f(x) at a particular value of x


f(x + h)  f(x)
can be approximately calculated by f ' (x) 
h
of f ' ( 2 ) For f(x)= 7e 0.5x and h = 0.3 , find
a) the approximate value of f ' ( 2 )
b) the true value of f ' ( 2 )
c) the true error for part (a)
Solution
f(x + h)  f(x)
a) f ' (x) 
h
For x = 2 and h = 0.3 ,
f (2  0.3)  f (2) f (2.3)  f (2) 7e 0.5( 2.3)  7e 0.5( 2)
f (2)   
0.3 0.3 0.3
22.107  19.028

0.3
 10.265
b) The exact value of f ' ( 2 ) can be calculated by using our knowledge of differential calculus.
f ( x)  7e 0.5 x
Introduction to Numerical Methods 01.01.7

f ' ( x)  7  0.5  e0.5 x  3.5e 0.5 x


So the true value of f' ( 2 ) is
f ' (2)  3.5e 0.5( 2 )  9.5140
c) True error is calculated as
Et = True value – Approximate value  9.5140  10.265  0.75061
The magnitude of true error does not show how bad the error is. A true error of Et = 0.722
may seem to be small, but if the function given in the Example 1 were f(x)= 7 10 6 e 0.5x , the
true error in calculating f ' ( 2 ) with h = 0.3 , would be Et = 0.75061 10 6. This value of true
error is smaller, even when the two problems are similar in that they use the same value of the
function argument, x = 2 and the step size, h = 0.3 . This brings us to the definition of relative
true error.

Q: What is relative true error?


A: Relative true error is denoted by ∈ t and is defined as the ratio between the true error and
the true value.
True Error
Relative True Error  True Value
Example 2
The derivative of a function f(x) at a particular value of x can be approximately calculated
by
f ( x  h)  f ( x )
f ' ( x) 
h
0.5 x
For f(x)= 7e and h = 0.3 , find the relative true error at x = 2 .
Solution
From Example 1,
Et = True value – Approximate value  9.5140  10.265  0.75061
Relative true error is calculated as
True Error  0.75061
t    0.078895
True Value 9.5140
Relative true errors are also presented as percentages. For this example,
t  0.0758895 100%  7.58895%
Absolute relative true errors may also need to be calculated. In such cases,
t | 0.075888 | = 0.0758895 = 7.58895%

Q: What is approximate error?


A: In the previous section, we discussed how to calculate true errors. Such errors are calculated
only if true values are known. An example where this would be useful is when one is checking
if a program is in working order and you know some examples where the true error is known.
But mostly we will not have the luxury of knowing true values as why would you want to find
the approximate values if you know the true values. So when we are solving a problem
Introduction to Numerical Methods 01.01.8

numerically, we will only have access to approximate values. We need to know how to quantify
error for such cases.
Approximate error is denoted by E a and is defined as the difference between the present
approximation and previous approximation.
Approximate Error  Present Approximation – Previous Approximation

Example 3
The derivative of a function f (x) at a particular value of x can be approximately calculated
by
f ( x  h)  f ( x )
f ' ( x) 
h
 and at x  2 , find the following
0.5 x
For f ( x ) 7e
a) f (2) using h  0.3
b) f (2) using h  0.15
c) approximate error for the value of f (2) for part (b)
Solution
a) The approximate expression for the derivative of a function is
f ( x  h)  f ( x )
f ' ( x)  .
h
For x  2 and h  0.3 ,
f (2  0.3)  f (2) f (2.3)  f (2) 7e 0.5( 2.3)  7e 0.5( 2)
f ' (2)   
0.3 0.3 0.3
22.107  19.028

0.3
 10.265
b) Repeat the procedure of part (a) with h  0.15,
f ( x  h)  f ( x )
f ( x) 
h
For x  2 and h  0 .15 ,
f (2  0.15)  f (2) f (2.15)  f (2) 7e 0.5( 2.15)  7e 0.5( 2 )
f ' (2)   
0.15 0.15 0.15
20.50  19.028
  9.8799
0.15
c) So the approximate error, E a is
E a  Present Approximation – Previous Approximation
 9.8799  10.265
 0.38474
The magnitude of approximate error does not show how bad the error is . An approximate
error of Ea  0.38300 may seem to be small; but for f ( x)  7  10 e , the approximate
6 0.5 x

6
error in calculating f (2) with h  0.15 would be Ea  0.38474  10 . This value of
'

approximate error is smaller, even when the two problems are similar in that they use the same
Introduction to Numerical Methods 01.01.9

value of the function argument, x  2 , and h  0.15 and h  0.3 . This brings us to the
definition of relative approximate error.

Q: What is relative approximate error?


A: Relative approximate error is denoted by a and is defined as the ratio between the
approximate error and the present approximation.
Approximat e Error
Relative Approximate Error  Present Approximat ion
Example 4
The derivative of a function f (x) at a particular value of x can be approximately calculated
by
f ( x  h)  f ( x )
f ' ( x) 
h
 , find the relative approximate error in calculating f (2) using values from
0.5 x
For f ( x ) 7 e
h  0.3 and h  0.15 .
Solution
From Example 3, the approximate value of f (2)  10.263 using h  0.3 and
f ' (2)  9.8800 using h  0.15 .
E a  Present Approximation – Previous Approximation
 9.8799  10.265  0.38474
The relative approximate error is calculated as
Approximat e Error  0.38474
a    0.038942
Present Approximat ion 9.8799
Relative approximate errors are also presented as percentages. For this example,
a  0.038942 100% =  3.8942%
Absolute relative approximate errors may also need to be calculated. In this example
a  | 0.038942 |  0.038942 or 3.8942%

Q: While solving a mathematical model using numerical methods, how can we use relative
approximate errors to minimize the error?
A: In a numerical method that uses iterative methods, a user can calculate relative approximate
error a at the end of each iteration. The user may pre-specify a minimum acceptable
tolerance called the pre-specified tolerance, s . If the absolute relative approximate error
a is less than or equal to the pre-specified tolerance s , that is, |a | s , then the
acceptable error has been reached and no more iterations would be required.
Alternatively, one may pre-specify how many significant digits they would like to be
correct in their answer. In that case, if one wants at least m significant digits to be correct in
the answer, then you would need to have the absolute relative approximate error,
|a | 0.5  10 2m %.
Introduction to Numerical Methods 01.01.10

Example 5
x 0.7
If one chooses 6 terms of the Maclaurin series for e to calculate e , how many significant
digits can you trust in the solution? Find your answer without knowing or using the exact
answer.
x2
Solution ex  1 x   .................
2!
Using 6 terms, we get the current approximation as
0.7 2 0.7 3 0.7 4 0.7 5
e 0.7  1  0.7      2.0136
2! 3! 4! 5!
Using 5 terms, we get the previous approximation as
0.7 2 0.7 3 0.7 4
e 0.7  1  0.7     2.0122
2! 3! 4!
The percentage absolute relative approximate error is
2.0136  2.0122
a   100  0.069527%
2.0136
Since a  0.5  10 % , at least 2 significant digits are correct in the answer of
2 2

e 0.7  2.0136

Q: But what do you mean by significant digits?


A: Significant digits are important in showing the truth one has in a reported number. For
example, if someone asked me what the population of my county is, I would respond, “The
population of the Hillsborough county area is 1 million”. But if someone was going to give
me a $100 for every citizen of the county, I would have to get an exact count. That count
would have been 1,079,587 in year 2003. So you can see that in my statement that the
population is 1 million, that there is only one significant digit, that is, 1, and in the statement
that the population is 1,079,587, there are seven significant digits. So, how do we differentiate
the number of digits correct in 1,000,000 and 1,079,587? Well for that, one may use scientific
notation. For our data we show
1,000,000  1  10 6
1,079,587  1.079587  10 6
to signify the correct number of significant digits.
Example 5
Give some examples of showing the number of significant digits.
Solution
4. 0.0459 has three significant digits
5. 4.590 has four significant digits
6. 4008 has four significant digits
7. 4008.0 has five significant digits
8. 1.079  10 has four significant digits
3

9. 1.0790 10 has five significant digits


3

10. 1.07900 10 has six significant digits


3
Introduction to Numerical Methods 01.01.11

Sources of Error
After reading this chapter, you should be able to:
7. know that there are two inherent sources of error in numerical methods – round-
off and truncation error,
8. recognize the sources of round-off and truncation error, and
9. know the difference between round-off and truncation error.

Error in solving an engineering or science problem can arise due to several factors.
First, the error may be in the modeling technique. A mathematical model may be based on
using assumptions that are not acceptable. For example, one may assume that the drag force
on a car is proportional to the velocity of the car, but actually it is proportional to the square of
the velocity of the car. This itself can create huge errors in determining the performance of the
car, no matter how accurate the numerical methods you may use are. Second, errors may arise
from mistakes in programs themselves or in the measurement of physical quantities. But, in
applications of numerical methods itself, the two errors we need to focus on are
(J) Round off error
(K) Truncation error.

Q: What is round off error?


1
A: A computer can only represent a number approximately. For example, a number like 3
may be represented as 0.333333 on a PC. Then the round off error in this case is
1
 0.333333  0.00000033 . Then there are other numbers that cannot be represented
3
exactly. For example,  and 2 are numbers that need to be approximated in computer
calculations.

Q: What problems can be created by round off errors?


A: Twenty-eight Americans were killed on February 25, 1991. An Iraqi Scud hit the Army
barracks in Dhahran, Saudi Arabia. The patriot defense system had failed to track and intercept
the Scud. What was the cause for this failure?
The Patriot defense system consists of an electronic detection device called the range gate. It
calculates the area in the air space where it should look for a Scud. To find out where it should
aim next, it calculates the velocity of the Scud and the last time the radar detected the Scud.
Time is saved in a register that has 24 bits length. Since the internal clock of the system is
measured for every one-tenth of a second, 1/10 is expressed in a 24 bit-register as
0.00011001100110011001100. However, this is not an exact representation. In fact, it would
need infinite numbers of bits to represent 1/10 exactly. So, the error in the representation in
decimal format is
Introduction to Numerical Methods 01.01.12

Figure 1 Patriot missile (Courtesy of the US Armed Forces,


http://www.redstone.army.mil/history/archives/patriot/patriot.html)

1
 (0  21  0  22  0  23  1 24  ...  1 222  0  223  0  224 )
10
 9.537 108
The battery was on for 100 consecutive hours, hence causing an inaccuracy of
s 3600s
 9.537  108  100 hr 
0.1s 1hr
 0.3433s
The shift calculated in the range gate due to 0.3433s was calculated as 687 m . For
the Patriot missile defense system, the target is considered out of range if the shift was going
to more than 137 m .

Q: What is truncation error?


A: Truncation error is defined as the error caused by truncating a mathematical procedure. For
x
example, the Maclaurin series for e is given as
x2 x3
ex  1 x    ....................
2! 3!
x
This series has an infinite number of terms but when using this series to calculate e , only a
x
finite number of terms can be used. For example, if one uses three terms to calculate e , then
x2
ex  1 x  .
2!
the truncation error for such an approximation is
 x2  x3 x4
e x
 
 1  x  ,    .......... .......... ...
Truncation error =
 2!  3! 4!
But, how can truncation error be controlled in this example? We can use the concept of relative
approximate error to see how many terms need to be considered. Assume that one is
1.2
calculating e using the Maclaurin series, then
1.2 2 1.2 3
e1.2  1  1.2    .......... .........
2! 3!
Let us assume one wants the absolute relative approximate error to be less than 1 %. In Table
1.2
1, we show the value of e , approximate error and absolute relative approximate error as a
function of the number of terms, n .
Introduction to Numerical Methods 01.01.13

n e1.2 Ea a %
1 1 - -
2 2.2 1.2 54.546
3 2.92 0.72 24.658
4 3.208 0.288 8.9776
5 3.2944 0.0864 2.6226
6 3.3151 0.020736 0.62550

Using 6 terms of the series yields a a < 1%.


Q: Can you give me other examples of truncation error?
A: In many textbooks, the Maclaurin series is used as an example to illustrate truncation error.
This may lead you to believe that truncation errors are just chopping a part of the series.
However, truncation error can take place in other mathematical procedures as well. For
example to find the derivative of a function, we define
f x  x   f x 
f x   lim
x 0 x
But since we cannot use x  0, we have to use a finite value of x , to give
f ( x  x)  f ( x)
f ( x) 
x
So the truncation error is caused by choosing a finite value of x as opposed to a x  0.
For example, in finding f (3) for f ( x)  x , we have the exact value calculated as
2

follows.
f ( x)  x 2
From the definition of the derivative of a function,
f ( x  x )  f ( x ) ( x  x) 2  ( x) 2
f ( x )  lim  lim
x 0 x x 0 x
x  2 xx  (x)  x
2 2 2
 lim (2 x  x )
 lim x 0
 2x
x 0 x
This is the same expression you would have obtained by directly using the formula from your
differential calculus class
d
( x n )  nx n 1
dx
By this formula for
f ( x)  x 2
f ( x )  2 x
The exact value of f (3) is
f (3)  2  3
6
If we now choose x  0.2 , we get
f (3  0.2)  f (3) f (3.2)  f (3) 3.2 2  3 2 10.24  9 1.24
f (3)   =  
0.2 0.2 0 .2 0.2 0.2
 6.2
Introduction to Numerical Methods 01.01.14

We purposefully chose a simple function f ( x)  x with value of x  2 and x  0.2


2

because we wanted to have no round-off error in our calculations so that the truncation error
can be isolated. The truncation error in this example is
6  6.2  0.2.
Can you reduce the truncate error by choosing a smaller x ?
Another example of truncation error is the numerical integration of a function,
b
I   f ( x)dx
a

Exact calculations require us to calculate the area under the curve by adding the area
of the rectangles as shown in Figure 2. However, exact calculations requires an infinite number
of such rectangles. Since we cannot choose an infinite number of rectangles, we will have
truncation error.
For example, to find
9

x
2
dx ,
3

we have the exact value as


9
 9 3  33 
9
 x3 
3 x dx   3    234
2

3  3 
If we now choose to use two rectangles of equal width to approximate the area (see Figure 2)
under the curve, the approximate value of the integral
9

x dx  ( x 2 ) (6  3)  ( x 2 ) (9  6)
2
x 3 x 6
3

 (32 )3  (6 2 )3
 27  108
 135

90

y = x2
60

30

0 x
0 3 6 9 12

Figure 2 Plot of y  x showing the approximate area under the curve from x  3 to
2

x  9 using two rectangles.


Introduction to Numerical Methods 01.01.15

Again, we purposefully chose a simple example because we wanted to have no round


off error in our calculations. This makes the obtained error purely truncation. The truncation
error is
234  135  99
Can you reduce the truncation error by choosing more rectangles as given in Figure 3? What
is the truncation error?

90

y = x2
60

30

0 x
0 1.5 3 4.5 6 7.5 9 10.5 12

Figure 3 Plot of y  x showing the approximate area under the curve from
2

x  3 to x  9 using four rectangles.

References
“Patriot Missile Defense – Software Problem Led to System Failure at Dhahran,
Saudi Arabia”, GAO Report, General Accounting Office, Washington DC,
February 4, 1992.
Introduction to Numerical Methods 01.01.16

Binary Representation of Numbers


After reading this chapter, you should be able to:

1. convert a base-10 real number to its binary representation,


2. convert a binary number to an equivalent base-10 number.

In everyday life, we use a number system with a base of 10. For example, look at the
number 257.56. Each digit in 257.56 has a value of 0 through 9 and has a place value. It can
be written as
257 .76  2  10 2  5  101  7  10 0  7  10 1  6  10 2
In a binary system, we have a similar system where the base is made of only two digits 0 and
1. So it is a base 2 system. A number like (1011.0011) in base-2 represents the decimal number
as
 
(1011 .0011 ) 2  (1  23  0  22  1  21  1  20 )  (0  21  0  22  1  23  1  24 ) 10
 11 .1875
in the decimal system.
To understand the binary system, we need to be able to convert binary numbers to
decimal numbers and vice-versa.
We have already seen an example of how binary numbers are converted to decimal
numbers. Let us see how we can convert a decimal number to a binary number. For example
take the decimal number 11.1875. First, look at the integer part: 11.
1. Divide 11 by 2. This gives a quotient of 5 and a remainder of 1. Since the remainder
is 1, a0  1 .
2. Divide the quotient 5 by 2. This gives a quotient of 2 and a remainder of 1. Since
the remainder is 1, a1  1 .
3. Divide the quotient 2 by 2. This gives a quotient of 1 and a remainder of 0. Since
the remainder is 0, a 2  0 .
4. Divide the quotient 1 by 2. This gives a quotient of 0 and a remainder of 1. Since
the remainder is , a3  1 .
Since the quotient now is 0, the process is stopped. The above steps are summarized in Table
1.
Introduction to Numerical Methods 01.01.17

Table 1 Converting a base-10 integer to binary representation.

Quotient Remainder
11/2 5 1  a0
5/2 2 1  a1
2/2 1 0  a2
1/2 0 1  a3
Hence
(11)10  (a3 a2 a1a0 ) 2
 (1011) 2
For any integer, the algorithm for finding the binary equivalent is given in the flow chart on
the next page.
Now let us look at the decimal part, that is, 0.1875.
1. Multiply 0.1875 by 2. This gives 0.375. The number before the decimal is 0 and the
number after the decimal is 0.375. Since the number before the decimal is 0, a 1  0 .
2. Multiply the number after the decimal, that is, 0.375 by 2. This gives 0.75. The number
before the decimal is 0 and the number after the decimal is 0.75. Since the number
before the decimal is 0, a  2  0 .
3. Multiply the number after the decimal, that is, 0.75 by 2. This gives 1.5. The number
before the decimal is 1 and the number after the decimal is 0.5. Since the number
before the decimal is 1, a 3  1 .
4. Multiply the number after the decimal, that is, 0.5 by 2. This gives 1.0. The number
before the decimal is 1 and the number after the decimal is 0. Since the number before
the decimal is 1, a  4  1 .
Since the number after the decimal is 0, the conversion is complete. The above steps are
summarized in Table 2.

Table 2. Converting a base-10 fraction to binary representation.

Number after Number before


Number
decimal decimal
0.1875  2 0.375 0.375 0  a 1
0.375  2 0.75 0.75 0  a 2
0.75  2 1.5 0.5 1  a 3
0.5  2 1.0 0.0 1  a 4
Introduction to Numerical Methods 01.01.18

Start

Integer N to be converted
Input (N)10
to binary format

i=0

Divide N by 2 to get
quotient Q & remainder R

i = i+1 ai = R

No

Is Q = 0?

Yes

n=i
(N)10 = (an. . .a0)2

STO
P
Introduction to Numerical Methods 01.01.19

Hence
(0.1875)10  (a1a 2 a 3a 4 ) 2
 (0.0011) 2
The algorithm for any fraction is given in a flowchart on the next page.
Having calculated
(11)10  (1011) 2
and
(0.1875)10  (0.0011) 2 ,
we have
(11.1875)10  (1011.0011) 2 .
In the above example, when we were converting the fractional part of the number, we were left
with 0 after the decimal number and used that as a place to stop. In many cases, we are never
left with a 0 after the decimal number. For example, finding the binary equivalent of 0.3 is
summarized in Table 3.

Table 3. Converting a base-10 fraction to approximate binary representation.

Number after Number before


Number
decimal decimal
0.3  0  a 1
0.6 0.6
2
0.6  1  a 2
1.2 0.2
2
0.2  0  a 3
0.4 0.4
2
0.4  0  a 4
0.8 0.8
2
0.8  1  a 5
1.6 0.6
2

As you can see the process will never end. In this case, the number can only be approximated
in binary format, that is,
(0.3)10  (a 1a  2 a 3a  4 a 5 ) 2  (0.01001) 2
Q: But what is the mathematics behinds this process of converting a decimal number to binary
format?
A: Let z be the decimal number written as
z  x. y
where
x is the integer part and y is the fractional part.
We want to find the binary equivalent of x . So we can write
Introduction to Numerical Methods 01.01.20

Start

Fraction F to be converted
Input (F)10
to binary format

i  1

Multiply F by 2 to get
number before decimal, S
and after decimal, T

i  i 1 ai = S

No

Is T = 0?

Yes

n=i
(F)10 = (a-1. . .a-n)2

STOP
Introduction to Numerical Methods 01.01.21

x  an 2n  an1 2 n1  ...  a0 20


If we can now find a 0 ,. . ., a n in the above equation then
( x)10  (an an1 . . .a0 ) 2
We now want to find the binary equivalent of y . So we can write
y  b1 2 1  b2 2 2  ...  bm 2 m
If we can now find b1 ,. . ., b m in the above equation then
( y)10  (b1b2 . . .bm ) 2
Let us look at this using the same example as before.

Example 1
Convert (11.1875)10 to base 2.
Solution
To convert (11)10 to base 2, what is the highest power of 2 that is part of 11. That power is
3, as 2  8 to give
3

11  2 3  3
What is the highest power of 2 that is part of 3. That power is 1, as 21  2 to give
3  21  1
So
11  23  3  23  21  1
What is the highest power of 2 that is part of 1. That power is 0, as 2 0  1 to give
1  20
Hence
(11)10  23  21  1  23  21  20  1 23  0  2 2  1 21  1 20  (1011) 2
To convert (0.1875)10 to the base 2, we proceed as follows. What is the smallest negative
power of 2 that is less than or equal to 0.1875. That power is  3 as 2  0.125.
3

So
0.1875  2 3  0.0625
What is the next smallest negative power of 2 that is less than or equal to 0.0625. That power
is  4 as 2  0.0625 .
4

So
0.1875  23  24
Hence
(0.1875 )10  2 3  0.0625  2 3  2 4  0  2 1  0  2 2  1  2 3  1  2 4  (0.0011) 2
Since
(11)10  (1011) 2
and
(0.1875)10  (0.0011) 2
we get
Introduction to Numerical Methods 01.01.22

(11.1875)10  (1011.0011) 2
Can you show this algebraically for any general number?

Example 2
Convert (13.875)10 to base 2.
Solution
For (13)10 , conversion to binary format is shown in Table 4.

Table 4. Conversion of base-10 integer to binary format.


Quotient Remainder
13/2 6 1  a0
6/2 3 0  a1
3/2 1 1  a2
1/2 0 1  a3

So
(13)10  (1101)2 .
Conversion of (0.875)10 to binary format is shown in Table 5.

Table 5. Converting a base-10 fraction to binary representation.

Number after Number before


Number
decimal decimal
0.875  1  a1
1.75 0.75
2
0.75  2 1.5 0.5 1  a 2
0.5  2 1.0 0.0 1  a 3

So
(0.875)10  (0.111) 2
Hence
(13.875)10  (1101.111)2
Introduction to Numerical Methods 01.01.23

Floating Point Representation

After reading this chapter, you should be able to:


1. convert a base-10 number to a binary floating point representation,
2. convert a binary floating point number to its equivalent base-10 number,
3. understand the IEEE-754 specifications of a floating point representation in a
typical computer,
4. calculate the machine epsilon of a representation.

Consider an old time cash register that would ring any purchase between 0 and 999.99 units of
money. Note that there are five (not six) working spaces in the cash register (the decimal
number is shown just for clarification).
Q: How will the smallest number 0 be represented?
A: The number 0 will be represented as
0 0 0 . 0 0

Q: How will the largest number 999.99 be represented?


A: The number 999.99 will be represented as
9 9 9 . 9 9

Q: Now look at any typical number between 0 and 999.99, such as 256.78. How would it be
represented?
A: The number 256.78 will be represented as
2 5 6 . 7 8

Q: What is the smallest change between consecutive numbers?


A: It is 0.01, like between the numbers 256.78 and 256.79.

Q: What amount would one pay for an item, if it costs 256.789?


A: The amount one would pay would be rounded off to 256.79 or chopped to 256.78. In either
case, the maximum error in the payment would be less than 0.01.

Q: What magnitude of relative errors would occur in a transaction?


A: Relative error for representing small numbers is going to be high, while for large numbers
the relative error is going to be small.
For example, for 256.786, rounding it off to 256.79 accounts for a round-off error of
256.786  256.79  0.004 . The relative error in this case is
 0.004
t   100
256 .786
 0.001558% .
Introduction to Numerical Methods 01.01.24

For another number, 3.546, rounding it off to 3.55 accounts for the same round-off
error of 3.546  3.55  0.004 . The relative error in this case is
 0.004
t   100
3.546
 0.11280% .

Q: If I am interested in keeping relative errors of similar magnitude for the range of numbers,
what alternatives do I have?
A: To keep the relative error of similar order for all numbers, one may use a floating-point
representation of the number. For example, in floating-point representation, a number
256.78 is written as  2.5678  10 ,
2

3
0.003678 is written as  3.67810 , and
 256.789 is written as  2.56789 102 .
The general representation of a number in base-10 format is given as
sign  mantissa  10 exponent
or for a number y ,
y    m 10 e
Where
  sign of the number,  1 or - 1
m  mantissa, 1  m  10
e  integer exponent (also called ficand)
Let us go back to the example where we have five spaces available for a number. Let us also
limit ourselves to positive numbers with positive exponents for this example. If we use the
same five spaces, then let us use four for the mantissa and the last one for the exponent. So
the smallest number that can be represented is 1 but the largest number would be 9.99910 .
9

By using the floating-point representation, what we lose in accuracy, we gain in the range of
numbers that can be represented. For our example, the maximum number represented changed
from 999.99 to 9.99910 .
9

What is the error in representing numbers in the scientific format? Take the previous
example of 256.78. It would be represented as 2.568 10 and in the five spaces as
2

2 5 6 8 2
Another example, the number 576329.78 would be represented as 5.763 10 and in five
5

spaces as
5 7 6 3 5
So, how much error is caused by such representation. In representing 256.78, the round
off error created is 256.78  256.8  0.02 , and the relative error is
 0.02
t   100  0.0077888 % ,
256 .78
In representing 576329.78 , the round off error created is 576329.78  5.763  10  29.78 ,
5

and the relative error is


29 .78
t   100  0.0051672 % .
576329 .78
Introduction to Numerical Methods 01.01.25

What you are seeing now is that although the errors are large for large numbers, but the relative
errors are of the same order for both large and small numbers.

Q: How does this floating-point format relate to binary format?


A: A number y would be written as
y    m  2e
Where
 = sign of number (negative or positive – use 0 for positive and 1 for negative),
m = mantissa, 12  m  10 2 , that is, 110  m  210 , and
e = integer exponent.

Example 1
Represent 54.7510 in floating point binary format. Assuming that the number is written to a
hypothetical word that is 9 bits long where the first bit is used for the sign of the number, the
second bit for the sign of the exponent, the next four bits for the mantissa, and the next three
bits for the exponent,

Solution
54 .75 10  (110110 .11) 2  1.1011011 2  2 ( 5 )10

The exponent 5 is equivalent in binary format as


510  1012
Hence
54.7510  1.10110112  2 (101) 2

The sign of the number is positive, so the bit for the sign of the number will have zero in it.
 0
The sign of the exponent is positive. So the bit for the sign of the exponent will have zero in
it.
The mantissa
m  1011
(There are only 4 places for the mantissa, and the leading 1 is not stored as it is always expected
to be there), and
the exponent
e  101 .
we have the representation as

0 0 1 0 1 1 1 0 1

Example 2
What number does the below given floating point format
0 1 1 0 1 1 1 1 0
Introduction to Numerical Methods 01.01.26

represent in base-10 format. Assume a hypothetical 9-bit word, where the first bit is used for
the sign of the number, second bit for the sign of the exponent, next four bits for the mantissa
and next three for the exponent.
Solution
Given
Bit Representation Part of Floating point number
0 Sign of number
1 Sign of exponent
1011 Magnitude of mantissa
110 Magnitude of exponent

The first bit is 0, so the number is positive.


The second bit is 1, so the exponent is negative.
The next four bits, 1011, are the magnitude of the mantissa, so
m  1.10112  1  20  1  2 1  0  2 2  1  2 3  1  2 4 10  1.6875 10
The last three bits, 110, are the magnitude of the exponent, so
e  110 2  1 22  1 21  0  20 10  610
The number in binary format then is
1.1011 2  2  110  2
The number in base-10 format is
= 1.6875 2
6
 0.026367

Example 3
A machine stores floating-point numbers in a hypothetical 10-bit binary word. It employs the
first bit for the sign of the number, the second one for the sign of the exponent, the next four
for the exponent, and the last four for the magnitude of the mantissa.
a) Find how 0.02832 will be represented in the floating-point 10-bit word.
b) What is the decimal equivalent of the 10-bit word representation of part (a)?
Solution
a) For the number, we have the integer part as 0 and the fractional part as 0.02832
Let us first find the binary equivalent of the integer part
Integer part 010  02
Now we find the binary equivalent of the fractional part
Fractional part: .02832  2
0.05664  2
0.11328  2
0.22656  2
0.45312  2
0.90624  2
1.81248  2
1.62496  2
1.24992  2
Introduction to Numerical Methods 01.01.27

0.49984  2
0.99968  2
1.99936
Hence
0.02832 10  0.0000011100 12
 1.110012  2 6
 1.11002  2 6
The binary equivalent of exponent is found as follows
Quotient Remainder
6/2 3 0  a0
3/2 1 1  a1
1/2 0 1  a2
So
610  110 2
So
0.0283210  1.11002  2 110 2

 1.11002  2 0110 2

Part of Floating point number Bit Representation


Sign of number is positive 0
Sign of exponent is negative 1
Magnitude of the exponent 0110
Magnitude of mantissa 1100

The ten-bit representation bit by bit is


0 1 0 1 1 0 1 1 0 0

b) Converting the above floating point representation from part (a) to base 10 by following
Example 2 gives
1.11002  2  01102
 
 1 20  1 21  1 22  0  23  0  24  2  0  2
3
 1 2 2  1 2 1  0  2 0 
 1.7510  2 6 10
 0.02734375
Q: How do you determine the accuracy of a floating-point representation of a number?
A: The machine epsilon, mach is a measure of the accuracy of a floating point representation
and is found by calculating the difference between 1 and the next number that can be
represented. For example, assume a 10-bit hypothetical computer where the first bit is used
for the sign of the number, the second bit for the sign of the exponent, the next four bits for the
exponent and the next four for the mantissa.
We represent 1 as
0 0 0 0 0 0 0 0 0 0
and the next higher number that can be represented is
Introduction to Numerical Methods 01.01.28

0 0 0 0 0 0 0 0 0 1
The difference between the two numbers is
1.00012  2(0000)  1.00002  2(0000)
2 2

 0.00012
 (1  2 4 ) 10
 (0.0625)10 .
The machine epsilon is
mach  0.0625.
The machine epsilon, mach is also simply calculated as two to the negative power of the
number of bits used for mantissa. As far as determining accuracy, machine epsilon, mach is
an upper bound of the magnitude of relative error that is created by the approximate
representation of a number (See Example 4).

Example 4
A machine stores floating-point numbers in a hypothetical 10-bit binary word. It employs the
first bit for the sign of the number, the second one for the sign of the exponent, the next four
for the exponent, and the last four for the magnitude of the mantissa. Confirm that the
magnitude of the relative true error that results from approximate representation of 0.02832 in
the 10-bit format (as found in previous example) is less than the machine epsilon.
Solution
From Example 2, the ten-bit representation of 0.02832 bit-by-bit is
0 1 0 1 1 0 1 1 0 0
Again from Example 2, converting the above floating point representation to base-10 gives
1.11002  2  01102
 1.7510  26 10
 0.02734375 10
The absolute relative true error between the number 0.02832 and its approximate
representation 0.02734375 is
0.02832  0.02734375
t 
0.02832
 0.034472
which is less than the machine epsilon for a computer that uses 4 bits for mantissa, that is,
 mach  2 4
 0.0625 .
Q: How are numbers actually represented in floating point in a real computer?
A: In an actual typical computer, a real number is stored as per the IEEE-754 (Institute of
Electrical and Electronics Engineers) floating-point arithmetic format. To keep the discussion
short and simple, let us point out the salient features of the single precision format.
 A single precision number uses 32 bits.
 A number y is represented as
y    1.a1 a2  a23   2 e
Introduction to Numerical Methods 01.01.29

where
 = sign of the number (positive or negative)
ai  entries of the mantissa, can be only 0 or 1, i  1,..,23
e =the exponent
Note the 1 before the radix point.
11. The first bit represents the sign of the number (0 for positive number and 1 for a
negative number).
12. The next eight bits represent the exponent. Note that there is no separate bit for the
sign of the exponent. The sign of the exponent is taken care of by normalizing by
adding 127 to the actual exponent. For example in the previous example, the exponent
was 6. It would be stored as the binary equivalent of 127  6  133 . Why is 127 and
not some other number added to the actual exponent? Because in eight bits the largest
integer that can be represented is 11111111 2  255 , and halfway of 255 is 127. This
allows negative and positive exponents to be represented equally. The normalized (also
called biased) exponent has the range from 0 to 255, and hence the exponent e has the
range of  127  e  128 .
13. If instead of using the biased exponent, let us suppose we still used eight bits for the
exponent but used one bit for the sign of the exponent and seven bits for the exponent
magnitude. In seven bits, the largest integer that can be represented is
1111111 2  127 in which case the exponent e range would have been smaller, that
is,  127  e  127 . By biasing the exponent, the unnecessary representation of a
negative zero and positive zero exponent (which are the same) is also avoided.
14. Actually, the biased exponent range used in the IEEE-754 format is not 0 to 255, but 1
to 254. Hence, exponent e has the range of  126  e  127 . So what are e  127
and e  128 used for? If e  128 and all the mantissa entries are zeros, the number is
  ( the sign of infinity is governed by the sign bit), if e  128 and the mantissa
entries are not zero, the number being represented is Not a Number (NaN). Because of
the leading 1 in the floating point representation, the number zero cannot be represented
exactly. That is why the number zero (0) is represented by e  127 and all the
mantissa entries being zero.
15. The next twenty-three bits are used for the mantissa.
16. The largest number by magnitude that is represented by this format is
1 2 0

 1 2 1  1 2 2    1 2 22  1 2 23  2127  3.40  10 38
The smallest number by magnitude that is represented, other than zero, is
1 2 0

 0  21  0  22    0  222  0  223  2126  1.18  10 38
 Since 23 bits are used for the mantissa, the machine epsilon,
mach  2 23
 1.19  10 7
.

Q: How are numbers represented in floating point in double precision in a computer?


A: In double precision IEEE-754 format, a real number is stored in 64 bits.
 The first bit is used for the sign,
 the next 11 bits are used for the exponent, and
 the rest of the bits, that is 52, are used for mantissa.
Introduction to Numerical Methods 01.01.30

Can you find in double precision the


(L) range of the biased exponent,
(M) smallest number that can be represented,
(N) largest number that can be represented, and
(O) machine epsilon?

Propagation of Errors
If a calculation is made with numbers that are not exact, then the calculation itself will have an
error. How do the errors in each individual number propagate through the calculations. Let’s
look at the concept via some examples.

Example 1
Find the bounds for the propagation error in adding two numbers. For example if one is
calculating X  Y where
X  1.5  0.05 ,
Y  3.4  0.04 .
Solution
By looking at the numbers, the maximum possible value of X and Y are
X  1.55 and Y  3.44
Hence
X  Y  1.55  3.44  4.99
is the maximum value of X  Y .
The minimum possible value of X and Y are
X  1.45 and Y  3.36 .
Hence
X  Y  1.45  3.36
 4.81
is the minimum value of X  Y .
Hence
4.81  X  Y  4.99 .

One can find similar intervals of the bound for the other arithmetic operations of
X  Y , X * Y , and X / Y . What if the evaluations we are making are function evaluations
instead? How do we find the value of the propagation error in such cases.
If f is a function of several variables X 1 , X 2 , X 3 ,......., X n 1 , X n , then the
maximum possible value of the error in f is
f f f f
f  X 1  X 2  .......  X n 1  X n
X 1 X 2 X n 1 X n
Introduction to Numerical Methods 01.01.31

Example 2
The strain in an axial member of a square cross-section is given by
F
 2
h E
where
F =axial force in the member, N
h = length or width of the cross-section, m
E =Young’s modulus, Pa
Given
F  72  0.9 N
h  4  0.1 mm
E  70  1.5 GPa
Find the maximum possible error in the measured strain.
Solution
72
 3
(4  10 ) 2 (70  10 9 )
 64.286 106
 64.286 m
  
  F  h  E
F h E
 1
 2
F h E
 2F
 3
h h E
 F
 2 2
E h E
1 2F F
  2
F  3 h  2 2 E
h E h E h E
1 2  72
 3
 0.9  3 3
 0.0001
(4  10 ) (70  10 )
2 9
( 4  10 ) (70  10 9 )
72
  1.5  10 9
( 4  10 3 ) 2 (70  10 9 ) 2
 8.0357107  3.2143106  1.3776106
 5.3955 10 6
 5.3955m
Hence
 (64.286m  5.3955m )
implying that the axial strain,  is between 58.8905m and 69.6815m

Example 3
Subtraction of numbers that are nearly equal can create unwanted inaccuracies. Using the
formula for error propagation, show that this is true.
Introduction to Numerical Methods 01.01.32

Solution
Let
z  x y
Then
z z
z  x  y
x y
 (1)x  (1)y
 x  y
So the absolute relative change is
z  x  y

z x y
As x and y become close to each other, the denominator becomes small and hence create
large relative errors.
For example if
x  2  0.001
y  2.003  0.001
z 0.001  0.001

z | 2  2.003 |
= 0.6667
= 66.67%

Taylor Theorem Revisited


After reading this chapter, you should be able to

(P) understand the basics of Taylor’s theorem,


(Q) write transcendental and trigonometric functions as Taylor’s polynomial,
(R) use Taylor’s theorem to find the values of a function at any point, given the values of
the function and all its derivatives at a particular point,
(S) calculate errors and error bounds of approximating a function by Taylor series, and
(T) revisit the chapter whenever Taylor’s theorem is used to derive or explain numerical
methods for various mathematical procedures.

The use of Taylor series exists in so many aspects of numerical methods that it is imperative
to devote a separate chapter to its review and applications. For example, you must have come
across expressions such as
x2 x4 x6
cos( x)  1     (1)
2! 4! 6!
x3 x5 x7
sin( x)  x     (2)
3! 5! 7!
x2 x3
ex  1 x    (3)
2! 3!
Introduction to Numerical Methods 01.01.33

All the above expressions are actually a special case of Taylor series called the Maclaurin
series. Why are these applications of Taylor’s theorem important for numerical methods?
Expressions such as given in Equations (1), (2) and (3) give you a way to find the approximate
values of these functions by using the basic arithmetic operations of addition, subtraction,
division, and multiplication.

Example 1
0.25
Find the value of e using the first five terms of the Maclaurin series.
Solution
x
The first five terms of the Maclaurin series for e is
x 2 x3 x 4
ex  1 x   
2! 3! 4!
0.25 2 0.25 3 0.25 4
e 0.25  1  0.25   
2! 3! 4!
 1.2840
0.25
The exact value of e up to 5 significant digits is also 1.2840.
But the above discussion and example do not answer our question of what a Taylor series is.
Here it is, for a function f  x 
f x  2 f x  3
f x  h  f x   f x h  h  h  (4)
2! 3!
provided all derivatives of f  x  exist and are continuous between x and x  h .

What does this mean in plain English?


As Archimedes would have said (without the fine print), “Give me the value of the function at
a single point, and the value of all (first, second, and so on) its derivatives, and I can give you
the value of the function at any other point”.
It is very important to note that the Taylor series is not asking for the expression of the
function and its derivatives, just the value of the function and its derivatives at a single point.
Now the fine print: Yes, all the derivatives have to exist and be continuous between x
(the point where you are) to the point, x  h where you are wanting to calculate the function
th
at. However, if you want to calculate the function approximately by using the n order Taylor
st nd th
polynomial, then 1 ,2 ,...., n derivatives need to exist and be continuous in the closed
interval [ x, x  h] , while the (n  1) derivative needs to exist and be continuous in the open
th

interval ( x, x  h) .

Example 2
 
Take f x   sin x  , we all know the value of sin  2   1 . We also know the f x   cosx 
 
   
and cos 2   0 . Similarly f x    sin( x) and sin  2   1 . In a way, we know the value
   

of sin  x  and all its derivatives at x  2 . We do not need to use any calculators, just plain
Introduction to Numerical Methods 01.01.34

differential calculus and trigonometry would do. Can you use Taylor series and this
information to find the value of sin 2  ?
Solution

x
2
xh2
h  2 x

 2
2
 0.42920
So
h2 h3 h4
f x  h   f x   f x h  f x   f x   f ( x)  
2! 3! 4!

x
2
h  0.42920
   
f x   sin x  , f    sin    1
2 2
 
f x   cosx  , f    0
2
 
f  x    sin  x  , f    1
2
 
f x    cos(x) , f    0
2

 
f  x   sin( x) , f    1
2
Hence
        h   h   h
2 3 4
f   h   f    f  h  f    f    f    
2  2 2  2  2!  2  3!  2  4!
 
f   0.42920   1  00.42920   1
0.42920   0 0.42920 3  1 0.42920 4  
2

2  2! 3! 4!
 1  0  0.092106  0  0.00141393  
 0.90931
The value of sin 2  I get from my calculator is 0.90930 which is very close to the value I
just obtained. Now you can get a better value by using more terms of the series. In addition,
you can now use the value calculated for sin 2  coupled with the value of cos2  (which can
be calculated by Taylor series just like this example or by using the sin x  cos x  1 identity)
2 2

to find value of sin  x  at some other point. In this way, we can find the value of sin  x  for
any value from x  0 to 2 and then can use the periodicity of sin  x  , that is
sin x   sin x  2n , n  1,2, to calculate the value of sin  x  at any other point.
Introduction to Numerical Methods 01.01.35

Example 3
x3 x5 x7
Derive the Maclaurin series of sin x   x    
3! 5! 7!
Solution

In the previous example, we wrote the Taylor series for sin  x  around the point x  2 .
Maclaurin series is simply a Taylor series for the point x  0 .
f x   sin x  , f 0   0
f x   cosx  , f 0  1
f  x    sin  x  , f 0   0
f x   cosx , f 0   1
f x  sin x , f 0  0
f x   cos(x) , f 0   1

Using the Taylor series now,


h2 h3 h4 h5
f x  h   f x   f x h  f x   f x   f x   f x   
2! 3! 4 5
h2 h3 h4 h5
f 0  h   f 0  f 0h  f 0  f 0  f 0  f 0  
2! 3! 4 5
2 3 4 5
f h   f 0  f 0h  f 0  f 0  f 0  f 0  
h h h h
2! 3! 4 5
h2 h3 h4 h5
 0  1h   0 1  0 1 
2! 3! 4 5
h3 h5
 h  
3! 5!
So
x3 x5
f x   x   
3! 5!
x3 x5
sin  x   x   
3! 5!

Example 4
Find the value of f 6 given that f 4  125 , f 4  74 , f 4  30 , f 4  6 and all other
higher derivatives of f  x  at x  4 are zero.
Solution
h2 h3
f x  h   f x   f x h  f x   f x   
2! 3!
x4
h  64
2
Introduction to Numerical Methods 01.01.36

Since fourth and higher derivatives of f  x  are zero at x  4 .


22 23
f 4  2  f 4  f 42  f 4 f 4
2! 3!
 22   23 
f 6  125  742  30   6 
 2!   3! 
 125  148  60  8
 341
Note that to find f 6 exactly, we only needed the value of the function and all its derivatives
at some other point, in this case, x  4 . We did not need the expression for the function and
all its derivatives. Taylor series application would be redundant if we needed to know the
expression for the function, as we could just substitute x  6 in it to get the value of f 6 .
Actually the problem posed above was obtained from a known function
f x   x 3  3x 2  2 x  5 where f 4   125 , f 4   74 , f 4   30 , f 4   6 , and all other
higher derivatives are zero.

Error in Taylor Series


As you have noticed, the Taylor series has infinite terms. Only in special cases such as a finite
polynomial does it have a finite number of terms. So whenever you are using a Taylor series
to calculate the value of a function, it is being calculated approximately.

The Taylor polynomial of order n of a function f (x) with (n  1) continuous derivatives in


the domain [ x, x  h] is given by
h2 hn
f x  h   f x   f x h  f ' ' x     f  n   x   Rn  x  h 
2! n!
where the remainder is given by
Rn  x  h  
h n1 f  n 1
c  .
(n  1)!
where
xc xh
that is, c is some point in the domain  x, x  h  .

Example 5
The Taylor series for e at point x  0 is given by
x

x 2 x3 x 4 x5
ex  1 x     
2! 3! 4! 5!
1
a) What is the truncation (true) error in the representation of e if only four terms of the series
are used?
b) Use the remainder theorem to find the bounds of the truncation error.
Solution
17. If only four terms of the series are used, then
x2 x3
ex  1 x  
2! 3!
Introduction to Numerical Methods 01.01.37

12 13
e1  1  1 
2! 3!
 2.66667
The truncation (true) error would be the unused terms of the Taylor series, which then are
x4 x5
Et   
4! 5!
14 15
  
4! 5!
 0.0516152
18. But is there any way to know the bounds of this error other than calculating it directly?
Yes,
hn
f x  h   f x   f x h    f n  x   Rn  x  h 
n!
where
Rn x  h  
hn1 f n1 c 
x  c  x  h , and
n  1! ,
c is some point in the domain x, x  h  . So in this case, if we are using four terms of the
Taylor series, the remainder is given by x  0, n  3
R3 0  1 
131 f 31 c 
3  1!

1
f 4 
c 
4!
ec

24
Since
xc xh
0  c  0 1
0  c 1
The error is bound between
e0 e1
 R3 1 
24 24
 R3 1 
1 e
24 24
0.041667  R3 1  0.113261
So the bound of the error is less than 0.113261 which does concur with the calculated error
of 0.0516152 .

Example 6
The Taylor series for e at point x  0 is given by
x

x 2 x3 x 4 x5
ex  1 x     
2! 3! 4! 5!
Introduction to Numerical Methods 01.01.38

As you can see in the previous example that by taking more terms, the error bounds decrease
1
and hence you have a better estimate of e . How many terms it would require to get an
1 6
approximation of e within a magnitude of true error of less than 10 ?
Solution
Using n  1 terms of the Taylor series gives an error bound of
Rn x  h  
hn1 f n1 c 
n  1!
x  0, h  1, f ( x)  e x

Rn 1 
1n1 f n1 c 
n  1!

1n 1 e c
n  1!
Since
xc xh
0  c  0 1
0  c 1
 Rn 1 
1 e
(n  1)! (n  1)!
1
So if we want to find out how many terms it would require to get an approximation of e
6
within a magnitude of true error of less than 10 ,
e
 10 6
(n  1)!
(n  1)! 10 6 e
( n  1)!  10 6  3 (as we do not know the value of e but it is less than 3).
n9
1 6
So 9 terms or more will get e within an error of 10 in its value.

We can do calculations such as the ones given above only for simple functions. To do
a similar analysis of how many terms of the series are needed for a specified accuracy for any
general function, we can do that based on the concept of absolute relative approximate errors
discussed in Chapter 01.02 as follows.
We use the concept of absolute relative approximate error (see Chapter 01.02 for
details), which is calculated after each term in the series is added. The maximum value of m ,
2m
for which the absolute relative approximate error is less than 0.5  10 % is the least
number of significant digits correct in the answer. It establishes the accuracy of the
approximate value of a function without the knowledge of remainder of Taylor series or the
true error.

You might also like