Introduction To Numerical Methods: 1.1 A Problem and Its Solution
Introduction To Numerical Methods: 1.1 A Problem and Its Solution
Contents
1.1 A problem and its solution . . . . . . . . . . . . . . . . . . . . 10
1.2 Error in numerical solutions . . . . . . . . . . . . . . . . . . . 15
1.2.1 Round-off Errors . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.2 Truncation Error . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.3 Total Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Number representation on a computer . . . . . . . . . . . . 19
1.3.1 Scientific number representation . . . . . . . . . . . . . . . . . 19
1.3.2 Decimal and binary representation . . . . . . . . . . . . . . . . 19
1.3.3 Floating-point representation . . . . . . . . . . . . . . . . . . . 20
1.3.4 Single and double precision floating point representation . . . . 23
1.4 Types of problems to be solved . . . . . . . . . . . . . . . . . 25
In this lecture, an engineering problem is introduced and its solution using the
analytical method and numerical method is discussed. You will observe how the
numerical method is capable of solving a complex engineering problem without knowing
the analytical solution of the problem. However, numerical methods are always
associated with some error due to its behaviour of getting approximate solution instead
of an exact solution. Different types of error, while using numerical methods, are also
discussed. In this regard, the number representation and arithmetic operation on a
computer (32-bit and 64-bit) are covered. At the end of this lecture, types of problem,
which can be solved using numerical methods, are briefly introduced which will be
covered in more detail in forthcoming lectures.
F = FD + FU . (1.1)
10
1.1 A SIMPLE MATHEMATICAL M
1.1. A problem and its solution
FU The second law can be recast
sides by m to give
F
a5
m
where a 5 the dependent variabl
function, and m 5 a parameter rep
simple case there is no independe
acceleration varies in time or space
Equation (1.3) has several char
the physical world:
1.
It describes a natural process o
2.
It represents an idealization an
FD negligible details of the natura
Thus, the second law does no
FIGURE
Figure 1.2
1.1: Falling object importance when applied to ob
Schematic diagram of the
forces acting on a falling surface at velocities and on sc
If the downward force is assigned a positive sign, the second law can be
3. Finally, it used
yieldstoreproducible
parachutist. FD is the downward
formulate the force due to gravity as purposes. For example, if the fo
force due to gravity. FU is the
upward force = mg
FD due to air Eq. (1.3) can(1.2)
be used to comp
resistance.
where g = gravitational constant, or the acceleration due to Because gravity, of
which is
its simple algebraic
approximately equal to 9.81 m/s and m is the mass of the object.
2
ily. However, other mathematical
Although the air resistance can be formulated in many ways complex,
(proportional
andtoeither
v 2 ); we
cannot be solve
consider a very simple approach that air resistance is linearly techniques
proportionalthan
to velocity
simple algebra for
and acts in an upward direction, as in this kind, Newton’s second law can
FU = −cv, falling body near (1.3)
the earth’s surfac
model for this case can be derive
where c = a proportionality constant called the drag coefficient (kg/s).
changeThus, thevelocity
of the greater (dy兾dt) and
the fall velocity, the greater the upward force due to air resistance.
dy Therefore,
The net force is the algebraic sum of the downward and upward forces. F we
5
can write dt m
F = FD + FU where y is velocity (m/s) and t is
⇒ F = mg − cv change of the velocity is equal to
" positive,
# the object will accelerate.
dv
⇒ ma = mg − cv = is zero, the object’s velocity
aforce
dt
Next, we will express the net for
dv mg − cv a body falling within the vicinity of
⇒ = opposing forces: the downward pull
dt m
dv c F 5 FD 1 FU(1.4)
⇒ =g− v
dt m
If the downward force is assigned
11 late the force due to gravity, as
FD 5 mg
where g 5 the gravitational constan
1.1. A problem and its solution
Equation (1.4) is a model that relates the acceleration of a falling object to the forces
acting on it. It is a differential equation because it is written in terms of the differential
rate of change (dv/dt) of the variable that we are interested in predicting. The exact
solution of Eq. (1.4) for the velocity of the falling object cannot be obtained using simple
algebraic manipulation. Rather, more advanced techniques, such as those of calculus,
must be applied to obtain an exact or analytical solution. For example, if the object is
initially at rest (v = 0 at t = 0), calculus can be used to solve Eq. (1.4) for
gm
v(t) = 1 − e−(c/m)t . (1.5)
c
Equation (1.5) is a simple mathematical model of the falling object problem.
However, in general, a mathematical model can be broadly defined in terms of dependent,
independent variables, and parameters.
where the dependent variable (e.g., v(t)) is a characteristic that usually reflects the
behavior or state of the system; the independent variables (e.g., t) are usually dimensions,
such as time and space, along which the system’s behavior is being determined. The
parameters (e.g., m, c) are reflective of the system’s properties and the forcing functions
(e.g., g) are external influences acting upon the system.
Now, we can solve a problem using the mathematical model derived in Eq. (1.5).
Example 1.1. A parachutist of mass 68.1 kg jumps out of a stationary hot air
balloon. Use Eq. (1.5) to compute velocity prior to opening the parachute. The
drag coefficient is equal to 12.5 kg/s.
Solution: Given values, m = 68.1 kg, g = 9.81 m/s, and c = 12.5 kg/s, put in
Eq. (1.5), we get
Example 1.1 is solved using analytical approach, which gives analytical or exact
solution, because it uses exact mathematical model of the system defined earlier. We
have seen the mathematical model defined in Eq. (1.4) that can be solved for exact
12
1.1. A problem and its solution
solution but lots of effort is required. We need to adopt one or more advanced techniques
to find out the solution. In many of these cases, the only alternative is to develop a
numerical solution that approximates the exact solution.
Now let us try to reformulate the problem to find the approximate solution close to
exact solution. This can be illustrated for Newton’s second law by realizing that the time
Figure 1.2: Use of a finite difference to approximate the first derivative of v with
respect to t.
dv ∼ ∆v v(ti+1 ) − v(ti )
= = (1.7)
dt ∆t ti+1 − ti
where ∆v and ∆t are differences in velocity and time, respectively, computed over finite
intervals, v(ti ) is velocity at an interval time ti , and v(tt+1 ) is the velocity at some later
time ti+1 . Note that dv/dt ∼ = ∆v/∆t is approximate because ∆t is finite. Remember
from calculus that
dv ∆v
= lim
dt ∆t→0 ∆t
Equation (1.7) represents the reverse process and is called a finite divided difference
approximation of the derivative at time ti . So we can substitute this value in Eq. (1.4),
we get
v(ti+1 ) − v(ti ) c
= g − v(ti ).
ti+1 − ti m
This equation can then be rearranged to yield
c
v(ti+1 ) = v(ti ) + g − v(ti ) (ti+1 − ti ). (1.8)
m
13
1.1. A problem and its solution
Thus, the differential equation has been transformed into an equation that can be used to
determine the velocity algebraically at ti+1 using the slope and previous values of v and
t. If you are given an initial value for velocity at some time ti , you can easily compute
velocity at a later time ti+1 . Let us solve the same problem (Example 7.1) using numerical
approach formulated in Eq. (1.8).
Example 1.2. A parachutist of mass 68.1 kg jumps out of a stationary hot air
balloon. Compute the velocity attained by parachutist after t s using approximation
approach. Employ a step size of 2 s for the calculation.
Solution: Let us considered that at the start of the computation (t0 = 0), the
velocity of the parachutist is zero (v(t0 ) = 0). Given ti+1 − ti = 2 (step size)
So we can compute velocity v(t1 ) at t1 as
12.5
v(t1 ) =v(t0 ) + 9.81 − v(t0 ) × 2 (1.9)
68.1
12.5
v(t1 ) =0 + 9.81 − (0) × 2 = 19.62 m/s (1.10)
68.1
For the next interval (from t=2 to 4 s), the computation is repeated but this time
updated velocity will be used to compute velocity in next iteration.
12.5
v(t2 ) =v(t1 ) + 9.81 − v(t1 ) × 2 (1.11)
68.1
12.5
v(t2 ) =19.62 + 9.81 − (19.62) × 2 = 32.04 m/s (1.12)
68.1
The list of computed values of veloctiy at different instance of time are tabulated
as
t(sec) v (m/sec)
0 0.00
2 19.62
4 32.04
6 39.90
8 44.87
10 48.02
12 50.01
∞ 53.44
Please note that the computed values of velocities are approximated values.
Please note that the technique used to solve Example 1.1 is called Analytical
method. The solution obtained using analytical methods is called analytical solution
which is an exact answer in the form of a mathematical expression in terms of a
14
1.2. Error in numerical solutions
variables associated with the problem that being solved. On the contrary, technique
used to solve Example 1.2 is called Numerical method. Formally, numerical methods
used for calculating approximated solutions to problems that cannot be solved (or are
difficult to solve) analytically. In other words, numerical methods are techniques by
which mathematical problems are formulated so that they can be solved with arithmetic
operations. Although numerical solutions are an approximation, they can be very
accurate. In many numerical methods, the calculations are executed in an iterative
manner until a desired accuracy is achieved. In the previous examples, a computational
price must be paid for a more accurate numerical results. Each halving of the step size
to attain more accuracy leads to a doubling of the number of computations. Thus, we
see that there is a trade-off between accuracy and computational effort.
Figure 1.3: Comparison of the analytical and numerical solutions for the falling
object problem
method captures the essential features of the exact solution. However, because we have
employed straight-line segments in numerical method to approximate a continuously
curving function, there is some discrepancy between the two results. One way to minimize
such discrepancies is to use a smaller step size.
Two kinds of errors are introduced when numerical methods are used for solving a
problem.
15
1.2. Error in numerical solutions
• First errors are labeled as Round-off errors. Round-off errors occurs because of
the way that we (or digital computers) store the number and execute numerical
operations. (Discussed in Section 1.3)
• The second kind of errors is introduced by the numerical method that is used for
the solution. These errors are labeled Truncation errors.
Together, the two errors constitute the total error of the numerical solution, which
is the difference (can be defined in various ways) between the true (exact) solution (which
is usually unknown) and the approximate numerical solution.
Here, 10 and 2 are base values where base 10 or 2 are used to represent decimal or
binary numbers. A computer’s representation of real numbers is limited to the fixed
precision of the mantissa. True values are sometimes not stored exactly by a computers
representation. Numbers are represented on a computer by a finite number of bits.
Consequently, real numbers that have a mantissa longer than the number of bits that are
available for representing them have to be shortened. The actual number that is stored
in the computer may undergo chopping or rounding of the last digit. A number can be
shortened either by
– In chopping, the digits in the mantissa beyond the length that can be stored
are simply left out.
– For illustration, consider the number 2/3. In decimal form with four significant
digits, 2/3 can be written as 0.6666.
• Rounding:
– In rounding, the last digit that is stored is rounded. Ex: 2/3 can be written
as 0.6667.
Either way, such chopping and rounding of real numbers lead to errors in numerical
computations, especially when many operations are performed. This is called Round-off
error.
16
1.2. Error in numerical solutions
1.2.2 Truncation Error
Truncation error usually refers to errors introduced when a more complicated
mathematical expression is “replaced” with a more elementary formula as we did in
Example 1.2. For better understating, let us consider an example of the infinite Taylor
series expansion of sinusoidal function
x3 x5 x7 x9 x11
sin(x) = x − + − + − + ··· (1.13)
3! 5! 7! 9! 11!
might be replaced with just the first one or two terms. For example, if only the first term
is used
π π
sin = = 0.5235988
6 6
ET runc = 0.5 − 0.5235988 = −0.0235988
If two terms of the Taylor’s series are used
π π (π/6)3
sin = − = 0.4996742
6 6 3!
ET runc = 0.5 − 0.4996742 = 0.0003258
The truncation error is dependent on the specific numerical method or algorithm used to
solve a problem.
x2 x4 x6 x8 x10
cos(x) = 1 − + − + − + ... (1.14)
2! 4! 6! 8! 10!
Use the first three terms to calculate the value of cos(π/4). Use the decimal format
with six significant digits (apply rounding at each step). Calculate the truncation
error.
• Relative error: The relative error expresses the error as a percentage of the true
value.
|p − p|
Rp = provided that p 6= 0 (1.16)
b
|p|
17
1.2. Error in numerical solutions
The number pb is said to approximate p to d significant digits if d is the largest
non-negative integer for which
|p − p| 101−d
(1.17)
b
< .
|p| 2
Example 1.4. Compute the absolute error and relative error in Example 1.2 at
time instances 2s, 8s, 20s, and 100s.
Example 1.5. Find the error and relative error in the following three cases.
Solution:
(ii) Given, y = 1, 000, 000 and yb = 999, 996; then the error is
18
1.3. Number representation on a computer
de . 2
d e
otherwise, since communication "f.F with the
-cL--mgsme
1
=
dt
computer
mL-
dt2
(input/output) is (1.3)
=
in base 10
numbers. This transparency
Equation (1.3), which is a does not mean
second-order, that
nonlinear, thedifferential
ordinary computer usescanbase
equation, 10. inIn fact, it
be written
The initial conditions are that when the motion of the pendulum starts ( t = 0 ), the pendulum is at
angle 90 and its velocity is zero (released from rest):
1.3.1 Scientific number representation
9(0) dB i
9 and - 0 (1.5)
=
0 dt
=
I= 0
A standardMethod
way to ofpresent
solution a real number, called scientific notation, is obtained by shifting
the decimalEquation
point (1.4)
andis supplying an appropriate
a nonlinear equation power
and cannot be solved of 10.However,
analytically. For example,
in part (a) the ini
tial displacement of the pendulum is 90 = 5°, and once the pendulum is released, the angle as the
0.0000747 = 7.47 × 10
pendulum oscillates will be less than 5°. For this case, Eq.−5
(1.4) can be linearized by assuming that
sine:::::e. With this approximation, the equation that has to be solved is linear and can be solved ana
lytically: 31.4159265 = 3.14159265 × 10 (1.18)
a2e 000de= 9.7 × 109
9, 700,mL-+cL-+mgS
000, 0 = (1.6)
dt2 dt
with the initial conditions Eq. (1.5).
In computer science, 1K = 1.024 × 103 .
In part (b), the initial displacement of the pendulum is 90 = 90° and the equation has to be
solved numerically. An actual numerical solution for this problem is shown in Example 8-8.
6 0 7 2 4 • 3 I 2 5
the left of the decimal point corresponds to 100 . The digit next to it on the left
corresponds to 101 , the next digit to the left to 102 , and so on. In the same way,
the first digit to the right of the decimal point corresponds to 10−1 , the next digit
to the right to 10−2 , and so on.
• Binary representation:
Similarly, binary system (base 2) uses two digits 0 and 1. A number is then written
as a sequence of zeros and ones that correspond to multiples of powers of 2. The
first digit to the left of the decimal point corresponds to 20 . The digit next to it on
the left corresponds to 21 , the next digit to the left to 22 , and so on. In the same
19
+
1 0 0 1 1 1 0 1
4 3 2 1 ° 1 2 3
1x 2 + 0x2 + 0x2 + 1x2 + 1x2 + 1x 2- + 0x 2- + 1x 2-
Figure 1-4: Representation of the number 19.625 in the binary system (base 2).
way, the first digit to the right of the decimal point corresponds to 2−1 , the next
Another example is shown in Fig. 1-5, where the number
digit to the right to 2−2 , and so on.
60,724.3125 is written in binary form.
1 1 0 0 0 0 0 0 0 0 0
2 1 2
+1x24+0 x23+1x2 +0x2 +ox2°+Ox2-l +1x2- +Ox2-3+1x2-4 60,72 . 125
43
=
As mentioned earlier that most computers store and process numbers in binary (base
2) form. However, It uses a normalized floating-point binary representation for real
numbers.
One digit is written to the left of the decimal point, and the rest of the significant digits
are written to the right of the decimal point (normalized form). The decimal floating
point representation also known as scientific notation (See Section 1.3.1). The number
0.dddddd is called the mantissa. The power of 10, p, represents the number’s order of
magnitude, provided the preceding number is smaller than 5. Otherwise, the number is
said to be of the order of p + 1. Thus, the number 3.91 × 10−6 is of the order of 10−6 ,
O(10−6 ), and the number 6.51923 × 103 is of the order of 104 (written as O(104 )).
Example 1.6. Floating Point Addition: Add the following two decimal
numbers in scientific notation: 8.70 × 10−1 with 9.95 × 101
Solution: Rewrite the smaller number such that its exponent matches with the
exponent of the larger number.
20
1.3. Number representation on a computer
Round the result:If the mantissa does not fit in the space reserved for it, it has
to be rounded off. For Example: If only 4 digits are allowed for mantissa
In this form, the mantissa is .bbbbbb , and the power of 2 is called the exponent. Both
the mantissa and the exponent are written in a binary form. The form in Eq. (1.20)
is obtained by normalizing the number (when it is written in the decimal form) with
respect to the largest power of 2 that is smaller than the number itself. If there are eight
choices for the mantissa and eight choices for the exponent then this produces a set of 64
numbers
Floating-point operations:
0.5 + (−0.4375)
Solution:
21
1.3. Number representation on a computer
Example 1.9. Compute 10 1
+ 15 + 16 if a computer had only a 3-bit mantissa and
exponent of n ∈ {−3, −2, −1, 0, 1, 2, 3, 4}.
Solution: First we need to compute 10 1
+ 51 . To write the binary representation
of 10
1
in 3-bit mantissa, first we need to convert the number in normalized form as
1 0.1
= 0.1 = −4 × 2−4 = 1.6 × 2−4
10 2
Now, we can write binary representation of 10
1
as
1
= (1.101)2 × 2−4 (after rounding)
10 10
1 0.2
= 0.2 = −3 × 2−3 = 1.6 × 2−3
5 2
1
= (1.101)2 × 2−3 (after rounding)
5 10
Next step is to rewrite smaller number such that exponent of smaller number
matches with the exponent of the largest number. Therefore, we can write
1
= (0.1101)2 × 2−3
10 10
22
1.3. Number representation on a computer
After rounding,
1.00111 × 2−2 = 1.010 × 2−2
Now, we need to add 1
6
in previously computed value
1
= (1.3333)10 × 2−3 = (1.011)2 × 2−3
6
Now, we can add the mantissa of these two numbers after equating the exponent
value
(0.1011)2 + (1.0100)2 = (1.1111)2
1 1 1
+ + = (1.1111)2 × 2−2 = (1.000)2 × 2−1 (after rounding)
10 5 6 10
After many truncation, the calculated approximate value
0.033333
% of Error = = 0.0714 = 7.14%
0.466667
23
cision, the numbers are stored in a string of 32 bits (4 bytes), and in
double precision in a string of 64 bits (8 bytes). In both cases the first
bit stores the sign (0 corresponds to + and 1 corresponds to -) of the
number. The next 8 bits in single precision (11 bits in double precision)
are used for storing the exponent. The following 23 bits in single preci
sion (52 bits in double precision) are used for storing the mantissa. This
is illustrated for double precision in Fig. 1-6.
1.3. Number representation on a computer
1/ 1/ 1/ 1/ 1/ 1 / 1/ 1/ 1/ 1/ 1/ 1/
I I I
/o /o /o /o . . . .
/o 10 10 /o /o . . . . . .
10 /o /o
and the largest is 1024 (which will be stored as 2047). However, the smallest and largest
values
8
of the exponent plus bias are stands
1. IEEE reserved for zero
for the Institute and infinity
of Electrical (Inf)
and Electronics or not-a-number
Engineers.
Chapter 1 Introduction
2. Precision refers to the number of significant digits of a real number that can be
(NaN) due to invalid mathematical operation. The 11 bits for the exponent plus bias
stored on a computer. For example, the number 1/3 0.333333 ... can be represented =
Li+-������+ �������
Figure 1-7: Storing the number 22.5 in double precision according to the IEEE-754 standard.
Example 1.11. How the number 22.5 can be stored in single precision according
to the IEEE-754 standard.Additional notes
• The smallest positive number that can be expressed in double preci
sion 1s:
(vi) Integration
Z b
I= f (x)f x
a
(ii) Linear algebraic equations: Given
find the area under the curve.
the a’s and c’s, Solve
a11 x1 + a12 x2 = c1
a21 x1 + a22 x2 = c2
25