Numerical Analysis I-1
Numerical Analysis I-1
CEGN-2073
Eyaya B
2 Nonlinear Equations 21
2.1 Locating Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Bisection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Number of Iterations Needed in the Bisection Method to Achieve
Certain Accuracy: . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 False Position (Regular-falsi) Method . . . . . . . . . . . . . . . . . . . . 29
2.4 Fixed-point Iteration Method . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5 Newton-Raphson Method . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.6 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3 System of Equations 50
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2 Direct Methods for System of Linear Equations (SLE) . . . . . . . . . . . 51
3.2.1 Gaussian Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2.2 Gaussian method with partial pivoting . . . . . . . . . . . . . . . 64
3.2.3 Gauss Jordan Method . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2.4 Gauss Jordan Method for matrix inversion . . . . . . . . . . . . . 73
Contents
4 Interpolation 109
4.1 Finite differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.1.1 Shift operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.1.2 Averaging Operator . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.1.3 Differential Operator . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.1.4 Forward difference operator . . . . . . . . . . . . . . . . . . . . . 112
4.1.5 Backward difference operator . . . . . . . . . . . . . . . . . . . . 116
4.1.6 Central difference operator . . . . . . . . . . . . . . . . . . . . . . 117
4.1.7 Relations between operators . . . . . . . . . . . . . . . . . . . . . 119
4.2 Interpolations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.2.1 Linear interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.2.2 Quadratic interpolation . . . . . . . . . . . . . . . . . . . . . . . . 125
4.3 Lagrange’s interpolation formula . . . . . . . . . . . . . . . . . . . . . . . 126
4.4 Divided difference formula . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.5 The Newton-Gregory Interpolation Formulae (with equidistant data points)134
4.5.1 The Newton-Gregory Forward Interpolation Formula . . . . . . . 135
4.5.2 The Newton-Gregory Backward Interpolation Formula . . . . . . 139
4.6 Error in Polynomial Interpolation . . . . . . . . . . . . . . . . . . . . . . 142
4.6.1 Spline Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.6.2 Cubic Spline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
1.1 Introduction
Many problems in Science and Engineering can not be solved analytically on a computer
and as a result Numeric solutions are often required. Numeric solutions provide only
approximate solutions and they are not unique. Different numerical algorithms might
yield different approximations.
Numerical Analysis deals with the design, analysis of numeric algorithms that deals
with continuous or discrete quantities and considers or analyzes the effects of ap-
proximations.
The reliability of the numerical result will depend on an error estimate or bound, therefore
the analysis of error and the sources of error in numerical methods is also a critically
important part of the study of numerical technique.
1
Chapter-1: Basic Consepts in Error Estimation
1.2. Errors and approximations in computations
Mathematical Models: are simplified representations of some real world entity that
can be expressed in mathematical equations or computer code. A Mathematical
Model can be broadly defined as a formulation or equation that express the essential
features of a physical system or process in mathematical terms. In a very general
sense, it can be represented as a functional relationship of the form
When we use numerical methods or algorithms and computing with finite precision, errors
of approximation or rounding and truncation are introduced. It is important to have a
notion of their nature and their order. A newly developed method is worthless without
an error analysis. Neither does it make sense to use methods which introduce errors with
magnitudes larger than the estimated error bound. On the other hand, using a method
with very high accuracy might be computationally too expensive to justify the gain in
accuracy.
Before we going to discuss these source of errors in detail, let us fist define what an error
is and measure of errors in a Numerical calculation.
Error in solving an engineering or science problem can arise due to several factors. A
paramount goal in numerical analysis is to asses the accuracy of the results of calculations.
Errors contained in the numerical answers to problems generally arise in two areas:
(a) The error incurred when the mathematical statement of a problem is only an
approximation to the physical situation;
(b) The error due to inaccuracies in the physical data.
Type (1) errors are beyond the control of the calculation and are usually negligible. It is
understood, however, that the worth of a computed solution must be carefully weighed
against these errors. Programming blunder which results in the correct calculation of
the wrong result usually can be detected or verified. It is the last three sources of
computational error that chiefly interest us and should be controlled by any feasible
Numerical method.
where x denotes the exact value and xa denotes the approximate value. The unit
of exact or unit of approximate values expresses the absolute error.
The absolute error doesn’t usually signify the measure of an error. For instance a 0.1
pound absolute error is very small error when measuring a person’s weight, but the same
error measure can be disastrous when measuring the dosage of a medicine. On the other
hand, the relative error and percentage error defined bellow, gives a better measure of an
error.
kx − xa k abs
rel = = . (1.3)
kxk kxk
Example 1.1
Note that the relative error is unchanged, while the absolute error changed by a
factor of 105 .
• The relative error is a measure of the number of significant digits of x that are
correct, we will discuss this in detail in Section (1.3.2).
• A relative error has meaning even when x is not known. It is given as a percentage
value.
Example 1.2
1
Three approximate values of the number 3
are given as 0.3, 0.33, and 0.34. Which
of these 3 approximations is the best approximation?
Solution: The number which has least absolute error gives the best approximation:
• Ea1 = | 31 − 0.30| = 1
30
• Ea1 = | 31 − 0.33| = 1
300
• Ea1 = | 13 − 0.34| = 1
150
1
Therefore, since 300
is the smallest of all the absolute errors, 0.33 is the best ap-
1
proximation for 3
.
Approximate Numbers
• Approximate Number: There are numbers which are not exact. For example,
√
e = 2.7182 · · · , 2 = 1.41421 · · · e.t.c. They contain infinitely many non-recurring
digits. Therefore, the numbers obtained by retaining a few digits are called ap-
proximate numbers.
Example: The approximate numbers, e ≈ 2.718 and π ≈ 3.142.
• Significant digits (Figures): are the numbers of digits used to express the num-
ber. The digits 1, 2, 3, · · · , 9 are significant digits and 0 is also a significant digit
except when it is used to fix the decimal point or used to fill the place of discarded
digits.
Example: 5879, 0.4762 contains four significant digits, 0.00486 and 0.000382 con-
tains three significant digits and 2.0682 contains five significant digits.
Example 1.3
Find the absolute, relative and percentage errors if x is rounded-off to three decimal
digits, given x = 0.005998.
Solution: If x is rounded-off to three decimal places we get x = 0.006. Therefore,
Therefore,
Absolute Error Ea = |Error|
= | − 0.000002|
= 0.000002,
|Error|
Relative Error Er =
|True value|
| − 0.000002|
=
|0.005998|
= 0.0033344 and
Percentage Error Ep = Er × 100%
= 0.0033344 × 100%
= 0.33344%
Often times the true value is unknown to us, which is usually the case in numerical
computing. In this case we will have to quantify errors using approximate values only.
For example, when an iterative method is used we get a approximate value at the end
of each iteration. The approximate error (Ea ) is defined as the difference between the
current (present) approximate value and the previous approximation. In general,
Similarly we can calculate the relative approximate error (Er ) by dividing the approximate
error by the present approximate value.
approximate error
relative approximate error(Er ) = .
present approximation
Assume our iterative method yield a better approximation as the iteration carries on.
Often times we can set an acceptable tolerance to stop the iteration when the relative
approximate error is small enough. We often set the tolerance in terms of the number of
significant digits - the number of digits that carry meaningful contribution to its precision.
It corresponds to the number of digits in the scientific notation to represent a number’s
significand or mantissa.
An approximation rule for minimizing the error is as follows: if the absolute relative
approximate error is less than or equal to a predefined tolerance then the acceptable
error has been reached and no more iterations would be required.
Truncation error refers to the error in a method, which occurs because some series (finite
or infinite) is truncated to a fewer number of terms. Examples of this include the compu-
tation of a definite integral through approximation by a sum or the numerical integration
of an ordinary differential equation by some finite difference method. Such errors are
essentially algorithmic errors and we can predict the extent of the error that will occur
in the method. Simply,
Example 1.4
x2 x3
ex = 1 + x + + + ···
2! 3!
This series has an infinite number of terms but when using this series to calculate
ex , only a finite number of terms can be used. For example, if one uses three terms
to calculate ex , then
x2
ex ≈ 1 + x + .
2!
Thus, the truncation error for such an approximation is
!
x x2
Truncation Error (ET ) = e − 1 + x +
2!
x3 x4
= + + ···
3! 4!
Example 1.5
• The patriot defense system failed to track and intercept the Scud. Why?
1
= − 0 × 2−1 + 0 × 2−2 + 0 × 2−3 + 1 × 2−4 · · · + 1 × 2−22
10
+0 × 2−23 + 0 × 2−24
s 360s
= 9.5 × 10−8 × 100hr ×
0.1s 1hr
= 0.342s
The battery was on for 100 consecutive hours, hence causing an inaccuracy
of
s 360s
= 9.5 × 10−8 × 100hr ×
0.1s 1hr
= 0.342s
• The shift calculated in the range gate due to 0.3433 s was calculated as 687 m.
For the Patriot missile defense system, the target is considered out of range
if the shift was going to more than 137 m.
For example a number like 1/3 may be represented as 0.33333 on a computing device.
Then the round off error in this case is 1/3 − 0.33333 = 0.00000033. There are also other
numbers that cannot be represented exactly on a computing machine, for example, π and
√
2 are numbers that need to be approximated in computer calculations.
There are two major approaches in approximating the actual number in a computer:
chopping and rounding:
• When chopping a number to a specified number of decimal places, say n, the first
n digits of the mantissa are retained, simply chopping off the remain digits.
• When rounding a number, the computer chooses the closest number that is rep-
resentable by the computer.
A natural question one may ask is what error is committed when a number is chopped
or rounded to n digits? Consider the number
with an error
Error = x − f l(x) = 0.dn+1 dn+2 · · · × 10−n
≤ 0.99999 · · · × 10−n (1.7)
≤ 10−n .
Does rounding x to n digits increase or decrease this error? We now show that error will
decrease. Once again, consider:
When 5 < dn+1 ≤ 9, add 0.5 × 10−n and chop the result. If x∗ = x + 0.5 × 10−n ,
When dn+1 = 5, increase the dn by unity if it is odd otherwise, leaves it unchanged and
apply chopping.
Example: If the number x = 11.675 is round-off to two decimal places, then we get
xa = 11.68. However, if we round-off the number x = 11.685 to two decimal places we
xa = 11.68.
x − f l(x)
= −rel , (1.12)
x
is usually written as
f l(x) = (1 + rel )x.
Example: If x = 0.51 and correct to two decimal places. Then Ea = 0.005 and the
relative accuracy is given by
0.005
Ep = Er × 100% = × 100%
0.51
= 0.98%
Our intuition tells us that the more accurate a number is, the more digits (that follow the
decimal point) will be correct, once the number is expressed in normalized format. To
make this statement more precise, we relate the number of correct digits to the relative
error.
Definition 1.8
Let p∗ approximates p to t significant digits if t is the largest integer > 0 such that
|p − p∗ |
Er = ≤ 5 × 10−t .
|p|
so that
5
t ≤ log10 .
Er
Example 1.6
5
t ≤ log10 = log10 50
10−1
Example 1.7
Consider x = 3.29 and f l(x) = 3.2. How accurate is the approximation f l(x)?
Solution: The true relative error is
3.29 − 3.2
Er =
3.29
9 × 10−2
=
3.29
≈ 3 × 10−2 ,
so that
5 5
t ≤ log10 −2
= 2 + log10 ,
3 × 10 3
which implies that t lies between 2 and 3. Thus, there are 2 significant digits.
The number of significant digits is only weakly dependent on the value of the relative
error mantissa. For example, as long as
h i
Er ∈ 0.5 × 10−2 , 5.0 × 10−2 , (1.13)
Example 1.8
Another way to pose the question is given x = 3.2, what is the worst possible
approximation to x which is accurate to 2 significant digits?
Solution: Let the approximation be x∗ . Therefore
5
2 = t = log10 ∗ .
| 3.2−x
3.2
|
5
100 = ∗ .
| 3.2−x
3.2
|
Rearanging terms,
20
= |3.2 − x∗ |−1
3.2
or
1
|3.2 − x∗ | ≈ = 0.16
6.25
The worst approximation to x while retaining 2 significant digits is therefore x∗ =
3.04 or x∗ = 3.36.
Example 1.9
Given the solution of a problem as xa = 35.25 with the relative error in the solution
at most 0.02. Find, to four decimal digits, the range of values within which the
exact value of the solution must lie.
Solution: If xt is the exact value of the solution, then according to given informa-
tion in the question we have
Ea |xt − xa | xt − xa
=⇒ Er = = =| | < 0.02,
|xt | |xt | xt
xa
=⇒ |1 − | < 0.02
xt
xa
=⇒ − 0.02 < 1 − < 0.02,
xt
xa xa
=⇒ − 0.02 < 1 − and 1 − < 0.02,
xt xt
xa xa
=⇒ < 1.02 and 0.98 < ,
xt xt
xa xa
=⇒ < xt and xt < ,
1.02 0.98
xa xa
=⇒ < xt <
1.02 0.98
35.25 35.25
< xt <
1.02 0.98
=⇒ 34.55882353 < xt < 35.96938775
Hence, correct to 4 decimal digits, the range of values within which the exact value
of the solution lies, is 34.5588 < xt < 35.9694.
The last the last tow example demonstrates that the concept of significant digits is not
simply a matter of counting the number of digits which are correct.
Error analysis of algorithms generally assumes perfect precision, ie. no round-off error.
However, it is there and is worth keeping it in mind. Especially if you are doing many
sequential calculations where the output from one is input into another. In this way,
errors can be propagated and your final answer can be garbage.
If a calculation is made with numbers that are not exact, then the calculation itself
will have an error. How do the errors in each individual number propagate through the
calculations? Let’s look at the concept via some examples.
Example 1.10
Find the bounds for the propagation error in adding two numbers. For example if
one is calculating x + y where
Solution: By looking at the numbers, the maximum possible value of x and y are
x = 1.55 and y = 3.44. Hence
4.81 ≤ x + y ≤ 4.99.
Here, we can see that the exact sum of the two numbers is 4.9 and has an absolute
error of 0.09 which is greater than the error in x and y.
Example 1.11
E = (x∗ )2 − x2
h i
= x2 (1 + δ)2 − 1 since δ 2 1.
≈ x2 (2δ)
E
which can be very big, especially if x is big. If we look at relative error= x2
, we still
get a relative error of 2δ . Notice, the relative error doubled. This is an example
of an error being propagated.
What if the evaluations we are making are function evaluations instead of arithmetic
operations? How do we find the value of the propagation error in such cases. In the
section, we derive a general formula for the error committed in using a certain formula
for a functional relation. Consider
u = f (x1 , x2 , · · · , xn ),
be a function of several variables x1 , x2 , · · · , xn and let the error in each xi be 4xi . Then
the error 4u in u is given by
Using Taylor’s series expansion of f about (x1 , x2 , · · · , xn ) in the above equation we get
u + 4u = f (x1 , x2 , · · · , xn )+
∂u ∂u ∂u
4x1 + 4x2 + · · · + 4xn +
∂x1 ∂x2 ∂xn
1 ∂ 2u 1 ∂ 2u 1 ∂ 2u
4x21 2 + 4x1 4x2 + ··· + 4x1 4xn +
2! ∂x1 2! ∂x1 ∂x2 2! ∂x1 ∂xn
1 ∂ 2u 1 ∂ 2u 1 ∂ 2u
4x1 4x2 + 4x22 2 + · · · + 4x2 4xn +
2! ∂x1 ∂x2 2! ∂x2 2! ∂x2 ∂xn
..
.
1 ∂ 2u 1 ∂ 2u 1 ∂ 2u
4x1 4xn + 4x2 4xn + · · · + 4x2n 2 +
2! ∂x1 ∂xn 2! ∂x2 ∂xn 2! ∂xn
..
.
4xi
Assuming the errors 4x1 , 4x2 , · · · , 4xn in xi all are small and that xi
1 so that
that the terms containing 4x21 , 4x22 , · · · , 4x2n and higher powers of 4x1 , 4x2 , · · · , 4xn
are being neglected. Therefore,
∂u ∂u ∂u
u + 4u ≈ f (x1 , x2 , · · · , xn ) + 4x1 + 4x2 + · · · + 4xn ,
∂x1 ∂x2 ∂xn
which implies that
∂u ∂u ∂u
4u ≈ 4x1 + 4x2 + · · · + 4xn . (1.14)
∂x1 ∂x2 ∂xn
Equation (1.14) represents the general formula for errors. If we divide the above
equation by u on both sides we get the relative error
Also from Equation (1.14), by taking modulus we get maximum absolute error,
∂u ∂u ∂u
|4u| ≤ |4x1 | + |4x2 | + · · · + |4xn |. (1.16)
∂x1 ∂x2 ∂xn
In addition, from Equation (1.17), by taking the modulus we get the maximum relative
error as
Example 1.12
5xy 2
Let u = z3
, with 4x = 4y = 4z = 0.0001 and x = y = z = 1. Find the
maximum absolute error and relative errors.
Solution:
∂u 5y 2 ∂u 10xy ∂u −15y 2
∂x
= ;
z 3 ∂y
= z3
; ∂z = z4
.
Thus, the absolute error bound is given as
5y 2 10xy −15y 2
(4u)max = |4x | + |4y | + |4x |
z3 z3 z4
|0.0001 × 5| + |0.0001 × 10| + |0.0001 × −15|
= 0.003,
(4u) 0.003
Er = = = 0.0006.
u 5
Definition 1.9
Error Inverse Problem What must the absolute errors of the independent variables
of a function be so that the absolute error of the function does not exceed a given
quantity?
Given the errors of several independent quantities or approximate numbers, the direct
problem requires us to find the error of any function of these quantities.However,the
inverse problem requires us to find the allowable errors in several independent quantities
∂f ∂f ∂f
∆u = ∆x1 + ∆x2 + · · · + ∆xn .
∂x1 ∂x2 ∂xn
The problem is solved with the minimum effort by using what is known as the principle
of equal effects.This principle assumes that the partial differentials
∂f
∆xi , i = 1, 2, · · · , n,
∂xi
∆u
∆xi = , for i = 1, 2, · · · n, (1.18)
n ∂xf i
where n is the number of independent variables.
Example 1.13
The base of a cylinder has radius r ≈ 2m, the altitude of the cylinder is h ≈ 3m.
With what absolute errors must we determine r and h so that the volume v may
be computed within 0.1m3 ?
Solution: We have v = πrr h and ∆v = 0.1m3 . Here we can see see that v is
function of three variables, i.e., π, r and h. Thus, putting r = 2m, h = 3m and
∂v
= r2 h = 22 × 3m3 = 12m3
∂π
∂v
= 2πrh = 2 × π × 2 × 3m3 = 37.7m3
∂r
∂v
= πr2 = π × 22 m3 = 12.6m3
∂h
∆v 0.1
∆r = ∂v = = 0.000884173298 < 0.001
3 × ∂r 3 × 37.7
∆v 0.1
∆h = ∂v = = 0.0026455026455 < 0.003
3 × ∂r 3 × 12.6
Solving nonlinear equations is one of the most important and challenging problems in
science and engineering applications. The root finding problem is one of the most relevant
computational problems. It arises in a wide variety of practical applications in Physics,
Chemistry, Biosciences, Engineering, etc.
The problem of nonlinear root finding can be stated in an abstract sense as follows:
Given some function f (x), determine the value(s) of x such that f (x) = 0.
The solution x is called the root of the equation or the zero of the function f and the
problem is called root finding or zero finding.
The central concept to all root finding methods is iteration or successive approxi-
mation. The idea is that we make some guess at the solution, and then we repeatedly
improve upon that guess, using some well-defined operations, until we arrive at an ap-
proximate answer that is sufficiently close to actual answer. We refer to this process as
an iterative method. We call the sequence of approximations the iterates and denote
21
Chapter-2: Nonlinear Equations 2.1. Locating Roots
them by x0 , x1 , x2 , · · · , xn , · · · .
Iterative methods generally involve an infinite number of steps to obtain the exact so-
lution. However, the beauty and power of these methods is that typically after a finite,
relatively small number of steps the iteration can be terminated with the last iterate
providing a very good approximation to the actual solution. One of the primary concerns
of an iterative method is thus the rate at which it converges to the actual solution, called
order of convergence.
lim |xn − α| = 0.
n→∞
Thus, if there exists a constant c > 0, an integer N0 ≥ 0 and p ≥ 0 such that for
all n > N0 we have
|α − xn |
|α − xn | ≤ c|α − xn−1 |p , i.e., → c as n → ∞, (2.1)
|α − xn−1 |p
If p > 1 then the convergence is called superlinear for any c > 1. In particular,
the values p = 2 and p = 3 are given the special names quadratic and cubic
convergence, respectively.
Notation: The notation en = |xn − α|, is the error in the nth iteration. The
equation
en+1 = cepn + O(ep+1
n ) (2.3)
In addition to the order of convergence, the factors for deciding whether an iterative
method for root finding problem is good or not are accuracy, stability, efficiency, and
robustness. Each of these can be defined as follows:
• Stability: If the input parameters are changed by small amounts the output of
the iterative method should not be wildly different, unless the underlying problem
exhibits this type of behavior.
• Efficiency: The number of operations and the time required to obtain an approx-
imate answer should be minimized.
In the iterative methods that we study we will see how each one of these concepts applies.
This is one of the simplest iterative methods and is strongly based on the property
of intervals (bracketing). The bisection method is a bracketing method for finding a
numerical solution of an equation of the form f (x) = 0.
As the name suggests, the method is based on repeated bisections of an interval containing
the root. The basic idea is very simple. Suppose we now want to approximate the solution
to f (x) = 0 for a general continuous function f (x). The key to the bisection method is
to keep the actual solution bracketed between the guesses. Thus, in addition to being
given f (x), we need an interval a ≤ x ≤ b where f (a) and f (b) differ in sign. We can
write this requirement mathematically as
It seems reasonable to conclude that since f (x) is continuous and has different signs at
each end of the interval [a, b], there must be at least one point α ∈ [a, b], such that
f (α) = 0. Thus, f (x) has at least one root in the interval. This result is in fact known
as the corollary of Intermediate Value Theorem.
At each step the method divides the interval in two by computing the midpoint c =
(a + b)/2 of the interval and the value of the function f (c) at that point. Unless c is itself
a root (which is very unlikely, but possible) there are now only two possibilities: either
f (a) and f (c) have opposite signs and bracket a root, or f (c) and f (b) have opposite
signs and bracket a root. The method selects the sub-interval that is guaranteed to be
a bracket as the new interval to be used in the next step. In this way an interval that
contains a zero of f is reduced in width by 50% at each step.
The process is continued until the interval is sufficiently small. Explicitly, if f (a) and
f (c) have opposite signs, then the method sets c as the new value for b, and if f (b) and
f (c) have opposite signs then the method sets c as the new a. (If f (c) = 0 then c may be
taken as the solution and the process stops.) In both cases, the new f (a) and f (b) have
opposite signs, so the method is applicable to this smaller interval.
Now that we have an idea for how the bisection method works for a general problem
f (x) = 0, it is time to write down a formal procedure for it using well defined operations.
We call such a procedure an algorithm.
We can think of an algorithm as a receipe for solving some mathematical problem. How-
ever, instead of the basic ingredients of flour, sugar, eggs, and salt, the fundamental
building blocks of an algorithm are the basic mathematical operations of addition, sub-
traction, multiplication, and division, as well as the for, if, and while constructs.
In the Bisection Algorithm 1, the line while b − a > 2 in this algorithm is called the
stopping criterion, and we call the error tolerance. This line says that we are going
to continue bisecting the interval until the length of the interval is ≤ 2. This guarantees
that the value returned by the algorithm is at most a distance away from the actual
solution. The value for is given as an input to the algorithm.
Note that the smaller the value of , the longer it takes the bisection method to converge.
Typically, we choose this value to be something small like = 10−6 . The stopping criterion
that we have chosen is called an absolute criterion. Some other types of criterion are
relative and residual. These correspond to b − a < 2|xn | and |f (xn )| ≤ , respectively.
There is no correct choice.
Let an , bn and xn denote the nth computed values of a, b and x respectively. Then, we
have
1
bn+1 − an+1 = (bn − an ), n ≥ 1,
2
also
1
b n − an = (b − a), n ≥ 1, (2.5)
2n−1
where (b − a) is the length of the original interval with which we started. Since the root
α is in either the interval (an , xn ) or (xn , bn ), we know that
1
|α − xn | ≤ xn − an = bn − xn = (bn − an ) (2.6)
2
The above Equation (6.5) is the error bound for xn that is used in Step 2 of Algorithm 1.
Now, combining Equation (2.5) and (6.5) we get the further bound
1
|α − xn | ≤ (b − a) (2.7)
2n
Let us now find out what is the minimum number of iterations n needed with the
bisection method to achieve a certain desired accuracy, say , suppose we want to have
1
|α − xn | ≤ (b − a) ≤ .
2n
Taking logarithms, (with any convenient base), of both sides of the above equation and
simplifying the resulting expression we obtain
log b−a
n≥ (2.8)
log 2
Example 2.1
f (x) = x6 − x − 1 = 0
accurate to within = 0.001. Solution: With the help of the following graph, it
is easy to check that 1 < α < 2
We choose a = 1, b = 2; then f (a) = −1, f (b) = 61, and Equation (2.4) is satisfied.
Thus, applying the Bisection Methods results in
Therefore, the largest root for the given function is 1.134765625 which is 5 correct
significant digits
Example 2.2
√
Approximate 2 with an accuracy of = 10−3 and also compute the minimum
number of iterations need.
√
Solution: Let x = 2 be a root of a function f . First let’s find the function f ,
√
α= 2 =⇒ α2 = 2 =⇒ α2 − 2 = 0
Thus, let f (x) = x2 − 2 and take the interval [1, 2], we have f (1) ∗ f (2) < 0 and
√
2 ∈ [1, 2], the convergence of the bisection method is guaranteed on this interval.
Thus, the number of minimum iteration is given by
2−1
log 10−3 3
n≥ = ≈ 9.965784
log 2 log10 2
Thus, the minimum number iterations required to have the given accuracy is 10.
The table below shows the numerical value and error of the first 10 iterates of the
√
bisection algorithm for approximating 2 using the function f (x) = x2 − 2 on the
interval [1, 2] with an error tolerance = 10−3 .
As we can see from the table it takes a minimum of 10 iteration to have an accuracy
|b−a|
of 10−3 , i.e., 2
< . In addition we can see that |α − c| ≤ 21 |b − a|.
The most difficult part about using the bisection method is finding an interval [a, b] where
the continuous function f (x) changes sign. Once this is found, the algorithm is guranteed
to converge. Thus, we would say that the bisection method is very robust. Also, as long
as f (x) has only one root between the interval [a, b], and it does not have another root
very close to a or b, we can make small changes to a or b and the method will converge
to the same solution. Thus, we would say the bisection method is stable. Additionally,
Equation (2.7) tells us that the error |α−xn | can be made as small as we like by increasing
n. Thus, we would say the bisection method is accurate. Finally, the method converges
linearly which is acceptable, but, as we will see in the next two sections, it is by no means
the best we can do. Thus, we would say that the bisection method is not very efficient.
The false position method retains the main features of the Bisection method, that the
root is trapped in a sequence of intervals of decreasing size. Therefore, The regula falsi
method is a bracketing method. This method uses the point where the secant line
intersect the x-axis. The secant line over the interval [a, b] is the chord between (a, f (a))
and (b, f (b)) as shown in Figure 6.3. The two right triangles angles in the figure are
similar, which mean that we have
b−c c−a
=
f (b) f (a)
The rate of convergence is still linear but faster than that of the bisection method. Both
the Bisection and Regular-falsi methods will fail if f has a double root.
We can rewrite the above algorithm as follows so that we can use the iteration number.
Given an interval [x0 , x1 ] such that sing(f (x0 )) × sign(f (x1 )) < 0, then there exists a
root on this interval and the next approximate root, x2 , is computed as
x0 × f (x1 ) − x1 × f (x0 )
x2 = .
f (x1 ) − f (x0 )
Now, check if this approximate root is within the given tolerance, i.e., if f (x2 ) < then we
take x2 as an approximate root other wise we need to find the next new interval depending
on the sing of f (x0 ) × f (x2 ), if the sign is negative then the root lies in [x0 , x2], otherwise
the root lies in x2 , x1 ]. Thus, the next approximate root, x3 can be computed as
x0 × f (x2 ) − x2 × f (x0 )
x3 = .
f (x2 ) − f (x0 )
x1 × f (x2 ) − x2 × f (x1 )
x3 = .
f (x2 ) − f (x1 )
In general, at the nth iteration we have the interval [xn−1 , xn ] such that f (xn−1 )×f (xn ) < 0
and the next approximate root is given as
or,
xn − xn−1
xn+1 = xn − × f (xn ).
f (xn ) − f (xn−1 )
Example 2.3
Find a real root of the equation f (x) = xex − 3 using Regula-falsi method correct
to three decimal places.
Since, f (x2 ) is negative, the next approximate root lies between x1 and x2
and also none of the decimal digits in x1 and x2 are correct.
x1 f (x2 ) − x2 f (x1 )
• Iteration-2: x3 = = 1.0456
f (x2 ) − f (x1 )
Now, f (x3 ) is negative and hence the root again lies between x1 and x3 .
The error analysis for the false-position method is not as easy as it is for the bisection
method, however, if one of the end points becomes fixed, it can be shown that it is still a
linear order of convergence, that is, it is the same rate as the bisection method, usually
faster, but possibly slower. For differentiable functions, the closer the fixed end point is
to the actual root, the faster the convergence.
Let α be the exact root of f (x) = 0 and let xn and xn+1 be two successive approximate
solutions to the actual root α at step n.If n and n+1 are the corresponding errors, thus
we have:
xn = α + n and xn+1 = α + n+1 .
xn − xn−1
xn+1 = xn − × f (xn ).
f (xn ) − f (xn−1 )
xn − xn−1
=⇒ α + n+1 = α + n − × f (α + n ).
f (α + n ) − f (α + n−1 )
2n 00
h i
(xn − xn−1 ) f (α) + n f 0 (α) + 2
f (α) + ···
=⇒ n+1 = n − h 2n 00
i h
2n 00
i
f (α) + n f 0 (α) + 2
f (α) + · · · − f (α) − n f 0 (α) + 2
f (α) − ···
2n 00
h i
(n − n−1 ) f (α) + n f 0 (α) + 2
f (α)
=⇒ n+1 = n − (ignoring higher order terms)
2n 2n−1
(n − n−1 )f 0 (α) + 2
− 2
f 00 (α)
2n 00
h i
(n − n−1 ) f (α) + n f 0 (α) + 2
f (α)
=⇒ n+1 = n − h
n +n−1 00
i
(n − n−1 ) f 0 (α) + 2
f (α)
2
f (α) + n f 0 (α) + 2n f 00 (α)
=⇒ n+1 = n −
f 0 (α) + n +2n−1 f 00 (α)
2
n f 0 (α) + 2n f 00 (α)
=⇒ n+1 = n − 0 Since f (α) = 0
f (α) + n +2n−1 f 00 (α)
2n f 00 (α)
n + 2 f 0 (α)
=⇒ n+1 = n − n +n−1 f 00 (α)
Dividing numerator and denominator by f 0 (α)
1+ 2 f 0 (α)
#−1
2n f 00 (α) n + n−1 f 00 (α)
" #"
=⇒ n+1 = n − n + 1+
2 f 0 (α) 2 f 0 (α)
Now, using the formula (1 + x)−1 = 1 − x + x2 − x3 + · · · and ignoring higher order powers
of x we get
Thus, we have
n+1 = n n−1 M (2.10)
1 f 00 (α)
!
where M = is constant.
2 f 0 (α)
=⇒ n = cpn−1
1
n
p
=⇒ n−1 =
c
Substituting this value of n−1 into Equation (2.10), we get:
1
n p
n+1 = n M
c
1
n p
=⇒ cpn = n M
c
1
n p 1
=⇒ pn = n M×
c c
1
1 (1+ )
=⇒ pn = M c−(1+ p ) n p
1
p=1+
p
=⇒ p2 − p − 1 = 0
=⇒ p ≈ 1.618 taking the positive root
n+1 = cn1.618
The formal procedure of the Fixed-point iteration Method is given below in Algorithm 3.
x = x3 + x − 2 (2.13)
2 + 5x − x3
x= (2.14)
5
Thus Equation (2.14) gives the correct root, while Equation (2.13) does not converge.
We begin by asking whether the equation x = g(x) has a solution. For this to occur,
the graphs of y = x and y = g(x) must intersect, as seen on the earlier figure 6.3. The
following lemma gives conditions under which we are guaranteed there is a fixed point α
which is the root of the function f .
Lemma: Let g(x) be a continuous function on the interval [a, b], and suppose
it satisfies the property
a ≤ x ≤ b =⇒ a ≤ g(x) ≤ b (2.15)
Then the equation x = g(x) has at least one solution α in the interval [a, b].
The next question is on what condition does Equation (2.12) will converge to the fixed
point α. Suppose g(x) and g 0 (x) are continuous then from the Tylor series expansion
about a point xn
(x − xn )2 00
g(x) = g(xn ) + (x − xn )g0 (xn ) + g (xn ) + · · ·
2!
Now, let x0 be an initial guess to the fixed point, then from Equation (2.12) we have
x1 = g(x0 )
=⇒ α − x1 = α − g(x0 ) = g(α) − g(x0 )
= (α − x0 )g0 (ξ0 ), x0 ≤ ξ0 ≤ α, (using the above Equation (2.16))
α − x2 = g0 (ξ1 )(α − x1 )
= g0 (ξ0 )g0 (ξ1 )(α − x0 ), x1 ≤ ξ1 ≤ α
..
.
α − xn = g0 (ξ0 )g0 (ξ1 ) · · · g0 (ξn−1 )(α − x0 )
(2.17)
So, if |g0 (ξn )| ≤ M for all n, then |n | ≤ M n |0 | and convergence is assured if M < 1, i.e.
if |g0 (x)| < 1 in a neighbourhood containing both α and x0 ; this condition dictates the
version of the method which is to be used. Thus, condition for convergence is given as
Example 2.4
Find the root of the equation cos x = 3x − 1 correct to three decimal places using
fixed-point iteration method.
π
Solution: Here we have f (x) = cos x − 3x + 1 assume x0 = 0 and x1 = 2
and
f (0) = 2 = +ve and f ( π2 ) = −3( π2 ) + 1 = −ve. Thus, the root lies between 0 and
π
2
.
1
Now, the given equation can be rewrite as x = cos x + 1 = g(x) (say) then we
− sin x
3
can check that g0 (x) = 3
= |g0 (x)| < 1 in [0, π2 ] hence, the Fixed-point iteration
method can be applied.
Let x0 = 0 be the initial guess then x1 = g(x0 ) = 0.667, x2 = g(x1 ) =
0.5953, · · · , x5 = 0.6072, x6 = 0.6071. Therefore, since x5 and x6 are correct to
three decimal places the required root is given by 0.6071.
In general Fixed-point Iteration converges linearly with asymptotic error constant |g0 (α)|,
since, by the definition of ξn and the continuity of g0 ,
en+1
lim = lim |g0 (ξn )| = |g0 (α)|. (2.18)
n→∞ e n→∞
n
Recall that the conditions we have stated for linear convergence are nearly identical to
the conditions for g to have a unique fixed point in [a, b]. The only difference is that now,
we also require g0 to be continuous on [a, b].
Now, suppose that in addition to the previous conditions on g, we assume that g 0 (α) = 0,
and that g is twice continuously differentiable on [a, b]. Then, using Taylor’s Theorem,
we obtain
en+1 = g(xn ) − g(α) = g 0 (α)(xn − α) + g 00 (ξn )(xn − α)2 = g00 (ξn )e2k ,
where ξn lies between xn and α. It follows that for any initial iterate x0 ∈ [a, b], Fixed-
point iteration converges at least quadratically, with asymptotic error constant |g 00 (α)/2|.
This discussion implies the following general result
Let g(x) be a function that is n times continuously differentiable on an interval [a, b].
Furthermore, assume that g(x) ∈ [f (a), f (b)] for x ∈ [a, b], and that |g0 (x)| ≤ M
on (a, b) for some constant M < 1. If the unique fixed point α in [a, b] satisfies
Then for any x0 ∈ [a, b], Fixed-point Iteration converges to α of order n, with
asymptotic error constant |g(n)(α)/n||.
It is also called Newton’s method and it is the general root finding method. This method
requires only one appropriate starting point x0 as an initial assumption of the root of
the function f (x) = 0. At (x0 , f (x0 )) a tangent to f (x) = 0 is drawn. Equation of this
tangent is given by
The point of intersection, say , of this tangent with x-asis (y = 0) is taken to be the next
approximation to the root of f(x) = 0. So on substituting y = 0 in the tangent equation
we get
f (x0 )
x1 = x 0 − (2.21)
f 0 (x0 )
If |f (x1 )| < we have got an acceptable approximate root of f (x) = 0, otherwise we
replace x0 by x1 , and draw a tangent to f (x) = 0 at (x1 , f (x1 )) and consider its intersec-
tion, say , with x-axis as an improved approximation to the root of f(x)=0. If |f (x2 )| > ,
we iterate the above process till the convergence criteria is satisfied. This geometrical
description of the method may be clearly visualized in the figure below:
The various steps involved in calculating the root of f (x) = 0 by Newton Raphson Method
are described compactly in the algorithm below.
Remark (1): This method converges faster than the earlier methods. In fact the method
converges at a quadratic rate. We will prove this later.
Remark (2): This method can be also derived directly by the Taylor expansion f(x) in
the neighbourhood of the root α of f (x) = 0. The starting approximation x0 to α is to
be properly chosen so that the first order Taylor series approximation of f (x0 + h) in the
neighbourhood of x0 leads to , an improved approximation to α. i.e
0 h2 00
f (x0 + h) = f (x0 ) + hf (x0 ) + f (x0 ) + ..... = 0
2
f (x0 ) + hf 0 (x0 ) = 0
i.e.
f (x0 )
h=− , ∵ h = x1 − x0
f 0 (x0 )
f (x0 )
x1 = x 0 −
f 0 (x0 )
Now the successive approximations etc may be calculated by the iterative formula:
f (xn )
xn+1 = xn −
f 0 (xn )
Example 2.5
Find the root of the equation cos x = 3x − 1 correct to three decimal places using
the Newton-Raphson’s Method.
Solution: Given
f (x) = cos x − 3x + 1
and hence
f 0 (x) = − sin x − 3
π
assume x0 = 4
= 0.7854
Iteration 1:
∴ f (x0 ) = −0.6491; f 0 (x0 ) = −3.7071
f (x0 ) π −0.6491
∴ x1 = x0 − = − = 0.6103
f 0 (x0 ) 4 −3.7071
Since |f (x1 )| = 0.0114 > 12 10−3 or since none of the digits in x0 and x1 are correct,
we repeat the Newton-Raphson procedure
Iteration 2:
f (x1 ) = −0.0114; f 0 (x1 ) = −3.5731
We have
f (x1 ) −0.0114
x2 = x1 − 0
= 0.6103 − = 0.6071028260
f (x1 ) −3.5731
Now, we can see that only one correct decimal place between x1 and x2 and we
need procedure the iteration.
Iteration 3:
We have
f (x2 ) −0.0114
x3 = x2 − 0
= 0.6071 − = 0.6071016481
f (x2 ) −3.5731
Using the same approach as with Fixed-point Iteration, we can determine the convergence
rate of Newton’s Method applied to the equation f (x) = 0, where we assume that f is
continuously differentiable near the exact solution α , and that f 00 exists near α. Using
Taylor’s Theorem, we obtain
en+1 = xn+1 − α
f (xn )
= xn − −α
f 0 (xn )
f (xn )
= en −
f 0 (xn )
1 1 00
0 2
= en − f (α) − f (xn )(α − xn ) − f (ξn )(α − xn )
f 0 (xn ) 2
1 1
0 00 2
= en + f (x n )(α − x n ) + f (ξn )(α − x n ) (MVT)
f 0 (xn ) 2
1 1 00
0 2
= en + −f (x )e
n n + f (ξ )e
n n
f 0 (xn ) 2
Thus, we have
f 00 (ξn ) 2
en+1 = e (2.22)
2f 0 (xn ) n
where ξn is between xn and α . We conclude that if f 0 (α) 6= 0, then Newton’s Method
f 00 (α)
converges quadratically, with asymptotic error constant | 0 |. It is easy to see from
2f (α)
this constant, however, that if f 0 (α) is very small, or zero, then convergence can be very
slow or may not even occur.
Example 2.6
Solve 2x3 − 2.5x − 5 = 0 for the root in [1,2] by Newton Raphson method with a
tolerance = 10−6 .
Solution: Given
f (x0 ) 6.0
∴ x1 = x0 − 0
=2− = 1.72093023
f (x0 ) 21.5
Since, |f (x1 )| = 0.8910913504 > 10−6 we repeat the process.
Results are tabulated below:
Examining the numbers in the table above, we can see that the number of cor-
rect decimal places approximately doubles with each iteration, which is typical of
quadratic convergence.
• Unfortunately, for bad choices of x0 (the initial guess) the method can fail to con-
verge! Therefore the choice of x 0 is VERY IMPORTANT!
• Each iteration of Newton’s method requires two function evaluations, while the
bisection method requires only one.
Note: A good strategy for avoiding failure to converge would be to use the bisection
method for a few steps (to give an initial estimate and make sure the sequence of guesses
is going in the right direction) followed by Newton’s method, which should converge very
fast at this point.
It is the most important variant of Netwon-Raphson method. The idea behind the Secant
Method is as follows. Assume we need to find a root of the equation f (x) = 0, called
α. Consider the graph of the function f (x) and two initial estimates of the root, x0
and x1 . Unlike the Bisection and Regular-falsi method, the two initial guesses do not
need to bracket the root of the equation. Thus, The secant method is an open method
but a two-point iteration method and may or may not converge. However, when secant
method converges, it will typically converge faster than the Bisection method. However,
since the derivative is approximated as given by Equation (2.23), it converges slower than
the Newton-Raphson method.
The two points (x0 , f (x0 )) and (x1 , f (x1 )) on the graph of f (x) determine a straight line,
called a secant line which can be viewed as an approximation to the graph.
The straight line passing through the two points (x0 f (x0 )) and (x1 , f (x1 ))can be expressed
as
Alternatively, we can think the secant method as in Newton’s method, but instead of
using f 0 (xn ), we approximate this derivative by a finite difference or the secant, i.e., the
slope of the straight line that goes through the two most recent approximations xn and
xn−1 . This slope is given by
f (xn ) − f (xn−1 )
f 0 (xn−1 ) = . (2.23)
xn − xn−1
Inserting this expression for f 0 (xn ) in Newton’s method simply gives us the secant method:
f (xn )
xn+1 = xn − f (xn )−f (xn−1 )
,
xn −xn−1
or
xn − xn−1
xn+1 = xn − f (xn ) (2.24)
f (xn ) − f (xn−1 )
Comparing Equation (2.24) to the graph in Figure 2.5, we see how two chosen starting
points x0 , x1 , and corresponding function values are used to compute x2 . Once we have
x2 , we similarly use x1 and x2 to compute x3 . As with Newton’s method, the procedure
is repeated until |f (xn )| or |xn+1 − xn |is below some chosen error tolerance ().
Example 2.7
Therefore,
f (x0 )x−1 − f (x−1 )x0
x2 =
f (x0 ) − f (x−1 )
f (2).1 − f (1).2
= )
f (2) − f (1)
6.1 − (−5.5).2
=
6 − (−5.5)
= 1.4782608747
We now consider the order of convergence of the secant method. Let α be the true root
of the equation f (α) = 0, and the error of xn+1 is:
en+1 = xn+1 − α
xn − xn−1
= xn − f (xn ) − α
f (xn ) − f (xn−1 )
(xn−1 − α)f (xn ) − (xn − α)f (xn−1 )
=
f (xn ) − f (xn−1 )
en−1 f (xn ) − en f (xn−1 )
=
f (xn ) − f (xn−1 )
1 1
f (xn ) = f (α + en ) = f (α) + f 0 (α)en + f 00 (α)e2n + O(e3n ) = f 0 (α)en + f 00 (α)e2n + O(e3n )
2 2
1
f (xn−1 ) = f 0 (α)en−1 + f 00 (α)e2n−1 + O(e3n−1 )
2
When n → ∞, all error terms in both the numerator and denominator of order higher
than the lowest order term approach to zero, we have
en−1 en f 00 (α)
en+1 = = en en−1 C
2f 0 (x)
where we have defined C = f 00 (α)/2f 0 (α). To find the order of convergence, we need to
find p in
|en | = µ|en−1 |p
!1/(p−1) !p
1 |C| |C|
p= , µ= =
p−1 µ µ
√
1+ 5 √
p= = 1.618, µ = |C|p/(p+1) = C ( 5−1)/2
= |C|0.618
2
i.e.,
f 00 (x) 0.618
|en+1 | = |en |1.618
0
2f (x)
We see that the order of convergence p = 1.618 of the secant method is better than linear
but not worse than quadratic convergence.
• The error decreases slowly at first but then rapidly after a few iterations.
• The secant method is slower than Newton’s method but faster than the bisection
method.
• Each iteration of Newton’s method requires two function evaluations, while the
secant method requires only one
3.1 Introduction
In this chapter we will learn how to solve system of Equations of the form
f1 (x1 , x2 , · · · , xn ) = 0
f2 (x1 , x2 , · · · , xn ) = 0
.. (3.1)
.
fn (x1 , x2 , · · · , xn ) = 0,
On the other hand, if f satisfies the superposition principle the f is a linear function
and hence we can rewrite the equation f (x1 , x2 , · · · , xn ) = 0 as
a1 x1 + a2 x2 + · · · an xn = b.
If all the functions fi in Equation (7.1) are linear equation, then Equation (7.1) called
System of Linear Equations (SLEs), otherwise if one of the fi ’s is nonlinear then it is
50
Chapter-3: System of Equations
3.2. Direct Methods for System of Linear Equations (SLE)
called System of Nonlinear Equations (SNLEs). In the next section subsequent sections
of this chapter we will discuss methods for solving SLEs and at the end we will give one
particular method,Newton’s Method, for solving SNLEs.
In general, we can divide the approaches to the solution of linear algebraic equations into
two broad areas. The first of these involve algorithms that lead directly to a solution
of the problem after a finite number of steps while the second class involves an initial
"guess" which then is improved by a succession of finite steps, each set of which we will
call an iteration. If the process is applicable and properly formulated, a finite number of
iterations will lead to a solution.
There are several methods which directly solve system of linear equations, among these are
such as Cramer’s rule, Gaussian elimination, Gauss Jordan, QR factorization. However,
since the Cramer’s rule is unsuitable for computer implementation and is not discussed
here. Among the direct methods, we only present the Gaussian elimination and Gauss
Jordan here in detail.
where x1 , x2 , · · · , xn are the unknowns, a11 , a12 , · · · , ann are the coefficients of the sys-
tem, and b1 , b2 , · · · , bn the constant terms. The above system of linear equations (7.2)
can be written in matrix form as
Ax = b,
where the coefficient matrix A, the unknown column vector x and the constant column
The Gaussian elimination is arguably the most used method for solving a set of linear
algebraic equations. It makes use of the fact that a solution of a special system of linear
equations, namely the systems involving triangular matrices, can be constructed very
easily.
where, L is an lower triangular matrix,i.e., lij = 0, ∀i > j and lii 6= 0∀. We can solve
this system by using the forward substitution as follows. The process is so called because,
we first computes x1 from the first equation and then substitutes that forward into the
next equation to solve for x2 , and repeats through to xn .
Observe that the first equation l11 x1 = b1 only involves x1 , and hence we can compute
x1 directly. The second equation only involves x1 and x2 , thus we can substitutes the
computed value of x1 to this equation and solver for x2 . Continuing in this way, the
j th equation only involves x1 , x2 , · · · , xj , and one can solve for xj by substituting the
previously solved values x1 , x2 , · · · , xj−1 . Thus, we can write this procedure in a formal
way as bellow
b1
x1 =
l11
b2 − l21 x1
x2 =
l22
.. (3.3)
.
i−1
1 X
xi = bi − lij xj for i = 3, 4, · · · n.
lii j=1
On the other hand, if we consider a system of linear equation given by the following
matrix equation
Ux = β (3.4)
Or
u11 u12 u13 · · · u1n x1 b1
0 u22 u2 3 · · · u2n x2
b2
0 0 u33 · · · u3n
x3 =
b3
.. .. .. . . . .
.
. .. ..
.
. . . .
0 0 0 · · · unn xn bn
where U is an upper triangular matrix such that all elements below the main diagonal
are zero and all the diagonal elements are non-zero, i.e., uii 6= for all i. In an upper
triangular matrix, similarly, we can work backwards, first computing xn , then substituting
xn back into the previous equation to solve for xn−1 , and repeating through x1 .
To solve the system (3.4) one can start from the last equation
βn
xn = ,
unn
1
xn−1 = [bn−1,n−1 − un−1,n xn ] ,
un−1,n−1
1
xn−2 = [bn−2,n−2 − un−2,n−1 xn−1 − un−2,n xn ] ,
un−2,n−2
Backward Substitution(BF):
bn
xn =
unn
n
(3.5)
1 X
xi = bi − uij xj
uii j=i+1
Thus, the solution procedure for solving this system of equations involving a special type
of upper triangular matrix is particularly simple. However, the trouble is that most of
the problems encountered in real applications do not have such special form.
The following row operations on the augmented matrix of a system produce the aug-
mented matrix of an equivalent system, i.e., a system with the same solution as the
original one.
• Replace a row by the sum of itself and a constant multiple of another row of the
matrix.
• Ri + αRj means: Replace row i with the sum of row i and α times row j.
Now, we shall assume that the system has a unique solution, i.e., we assume aii 6= 0 and
proceed to describe the simple Gaussian Elimination method(GEM) for finding the
solution. The method reduces the system to an upper triangular system using elementary
row operations (ERO).
(1) (1) (1) (1) (1)
a11 a12 a13 · · · a1n | b1
(1) (1) (1) (1) (1)
a21
a22 a23 · · · a2n | b2
(1) (1) (1) (1) (1) (1) (1)
A(1) = a
31 a32 a33 · · · a3n | b3 , where aij = aij and bj = bj .
. .. .. .. .. ..
. . . | .
. . .
(1) (1) (1)
an1 an2 an3 · · · a(1) (1)
nn | bn
Step -2: Perform elementary row operations to get zeros below the diagonal.
(1)
2 -1: Assuming the pivot element ,a11 6= 0, is nonzero we apply ERO on A(1) to
(1)
reduce all entries below a11 to zero. Let the resulting matrix be denoted by
A(2) .
(1) (1)
(1) Ri +mi1 R1 (2) (1) ai1
A −−−−−−→ A , where mi1 =− (1)
.
a11
Here, the mi ’s are called multipliers for each row i and also note that A(2) is
of the form
(1) (1) (1) (1) (1)
a11 a12 a13 · · · a1n | b1
(2) (2) (2) (2)
0 · · · a2n | b2
a22 a23
(2) (2) (2) (2)
A(2) =
0 a32 a33 · · · a3n | b3 ,
.. .. .. .. .. ..
. . . . . | .
(2) (2)
0 an2 an3 · · · a(2) (2)
nn | bn
(2) (2)
(2) Ri +mi2 R1 (3) (2) ai2
A −−−−−−→ A , where mi2 =− (2)
, for i > 2.
a22
Repeat these procedures until you get A(n) . These procedures are generalized
as follow at the k th step.
2 -k: Assume a(k)kk 6= 0, then reduce all entries below a(k)kk to zero by applying
ERO,i.e.,
(k) (k)
Ri +m Rk (k) aik
−−→ A(k+1) ,
A(k) −−−−ik where mik = − (k)
, for i > k.
akk
Continue this procedure until the (n − 1)th -step and is given bellow.
(n−1) (n−1)
2 -(n-1): Assuming an−1,n−1 6= 0, reduce all entries below an−1,n−1 to zero by applying
(n−1) (n−1)
(n−1)
Rn +mn,n−1 Rn−1
(n) (n−1) an,n−1
A −−−−−−−−−−→ A , where mn,n−1 =− (n−1)
,
an−1,n−1
0
I(n−1)
M (n−1) =
(n−1)
0 0 · · · m(n,n−1) 1
Step -3: Inspect the resulting matrix and re-interpret it as a system of equations.
a If you get a zero diagonal element then the system can not be solved using
GEM or doe not have a solution. In this case the matrix is probably a
singular matrix.
b If you get less equations than unknowns after the reduction and if there is a
solution then there is an infinite number of solutions.
c If you get as many equations as unknowns after the reduction and if there is
a solution then there is exactly one solution.
Step -4: Apply the BSF to get the solution of the given system if part (c) in step -3 is true.
Note, as a "by-product" of GEM, the simple GEM in addition to to solve the system
(k)
Ax = b it can also be used to evaluate det A provided akk 6= 0 for each k. Note further
that each M (k) is a lower triangular matrix with all diagonal entries as 1. Thus let
det M (k) is 1 for every k. Now, from Equation (3.6) taking the un-augmented matrix
from A0(n) A0(1) and we have
det A0(n) = det M (n−1) det M (n−2) · · · det M (1) det A0(1) ,
det A0(n) = det A0(1) = det A, since A = A0(1)
Now A0(n) is an upper triangular matrix and hence its determinant is the product of the
diagonal elements
(1) (2)
a11 a22 · · · a(n)
nn
In addition, note that all the matrix M (k) are lower triangular, and nonsingular as their
det = 1 6= 0 for all k. They are all therefore invertible and their inverses are all lower
triangular, i.e., if
L = M (n−1) M (n−2) · · · M (1)
then N is lower triangular, and nonsingular and N −1 is also lower triangular. Now
Therefore
A = N −1 A0(n) .
Now N −1 is lower triangular which we denote by L and A0(n) is upper triangular which
we denote by U , and we thus get the so called LU decomposition
A = LU ,
(k)
REMEMBER IF AT ANY STAGE WE GET akk 6= 0 WE CANNOT PROCEED FUR-
THER WITH THE SIMPLE GSM.
Example 3.1:
Solve the following system using Gauss elimination method.
x+y+z =6
2x − 2y + z = 3 (3.7)
x+z =4
Step -2: Perform elementary row operations to get zeros below the diagonal.
2-1: First compute the multipliers for the second and third row:
(1) (1)
(1) a21 2 (1) a 1
m2,1 =− (1)
= − = −2 and m3,1 = − 31
(1)
= − = −1.
a11 1 a11 1
Thus,
1 0 0
M (1) =
−2 1 0
(3.9)
−1 0 1
1 1 0 1 1 1 | 6 1 1 1 | 6
A(2) = −2 1 0 2 −2 1 | 3 = 0 −3 −1 | −9 (3.10)
−1 0 1 1 0 1 | 4 0 −1 0 | −2
(2)
(2) a32 −1 1
m3,2 =− (2)
=− =−
a22 −3 3
Thus,
1 0 0
M (2) =
0 1 0
(3.11)
0 − 13 1
1 0 0 1 1 1 | 6 1 1 1 | 6
A(2) =
0 1 0 0 −3 −1 | −9 = 0 −3 −1 | −9
0 − 31 1 0 −1 0 | −2 0 0 1
3
| 1
x+y+z =6
−3y − z = −9 (3.12)
1
z = 1.
3
Note that Equation (3.23) and Equation (7.4) have the same solution since the
applied ERO doesn’t change the solutions.
Here, we can see that, there are no zeros on the diagonal of A0 and we have the
number of equations as the number of unknowns. Thus, the system have a unique
solution and the BSF method can be applied to solve the system.
−9 − (−z) −6
y= = =2
−3 −3
x=6−y−z =6−2−3=1
x = 1, y = 2 and z = 3.
Exercise 3.1:
Solve the following system of linear equations using the GEM:
(k)
akk 6= 0
for each stage k. This may not be satisfied always. So we have to modify the simple
GEM in order to overcome this situation. Further, even if the condition
(k)
akk 6= 0
is satisfied at each stage, simple GEM may not be a very accurate method to use. What
do we mean by this?
Example 3.2:
Consider, as an example, the following system:
Solve this system by GEM and perform the computations to 6 significant digits.
Solution:
Step-1: Form the augmented matrix:
0.000003 0.213472 0.332147 | 0.235262
A(1) = 0.215512 0.375623 0.476625 | 0.127653
0.173257 0.663257 0.625675 | 0.285321
Step -2: Perform elementary row operations to get zeros below the diagonal.
(1)
2-1: First compute the multipliers for the second and third row. Since a11 6= 0, we have
(1) (1)
(1) a21 0.215512 (1) a 0.173257
m2,1 =− (1)
=− = −71837.3 and m3,1 = − 31
(1)
=− = −57752.3.
a11 0.000003 a11 0.000003
Thus,
1 0 0
M (1) = −71837.3 1 0
−57752.3 0 1
(2)
2-2: Compute the multipliers for third row, since a22 = −15334.9 6= 0 we have
(2)
(2) a32 −12327.8
m3,2 =− (2)
=− = −0.803905
a22 −15334.9
Thus,
1 0 0
M (2) = 0 1 0
0 −0.803905 1
Here, we can see that we there are no zeros on the diagonal of A0 and we have the
number of equations as the number of unknowns. Thus, the system have a unique
solution and the BSF method can be applied to solve the system.
−0.20000
z= = 0.400000
−0.50000
−16900.5 − (−23860.0z)
y= = 0.479723
−15334.9
Finally, from the first equation we solve for x, where y = 0.479723 and z = 0.400000,
as
0.235262 − (0.213472y + 0.332147z)
x= = −1.33333
0.000003
Therefore, the solution of the above system of equations (7.12) is
x = −1.33333,
y = 0.479723,
z = 0.400000
This compares poorly with the correct answers (to 10 digits) given by
x = −0.9912894252
y = 0.0532039339 (3.15)
z = 0.6741214691
Thus we see that the simple Gaussian Elimination method needs modification in order
(k)
to handle the situations that may lead to akk = 0 for some k or situations as arising in
the above example. In order to alleviate such problems we introduce the idea of Partial
Pivoting.
When performing Gaussian elimination, the diagonal element that one uses during the
elimination procedure is called the pivot. To obtain the correct multiple, one uses the
pivot as the divisor to the elements below the pivot. Gaussian elimination in this form
will fail if the pivot is zero. In this situation, a row interchange must be performed.
Even if the pivot is not identically zero, a small value can result in big round-off errors.
For very large matrices, one can easily lose all accuracy in the solution. To avoid these
round-off errors arising from small pivots, row interchanges are made, and this technique
is called partial pivoting (partial pivoting is in contrast to complete pivoting, where both
rows and columns are interchanged).
The idea of partial pivoting is as follows. At the k th stage we shall be trying to reduce
all the entries below the k th diagonal to zero as we did in the simple GEM. However,
before we do this we look at the entries in the k th diagonal and below it and then pick
the one that has the largest absolute value and we bring it to the k th diagonal position
by a row interchange, and then reduce the entries below the k th diagonal to zero. When
we incorporate this idea at each stage of the Gaussian elimination process we get the
Gaussian Elimination Method with Partial Pivoting. We now illustrate this with
a few examples:
Example 3.3:
Solve the following system using Gauss elimination method.
x + y + 2z = 4
2x − y + z = 2
x + 2y = 3
Solution:
Step-1: Form the augmented matrix:
1 1 2 | 4
A(1) = 2 −1 1 | 2
1 1 0 | 3
Step -2: Perform elementary row operations to get zeros below the diagonal.
(1)
2-1 Select the pivot row by comparing the elements a11 and below it and pick the one
(1)
with the largest absolute value. Thus, the pivot element has to be chosen as a21 2
as this is the largest absolute valued entry in the first column. Therefore we need
to interchange row-1 and row-2 and hence we have
0 1 0
M (1) =
1 0 0 ,
0 0 1
and
0 1 0 1 1 1 | 4 2 −1 1 | 2
A(2) = M (1) A(1) = 1 0 0 2 −2 1 | 2 = 1 1 2 | 4
0 0 1 1 0 1 | 3 1 1 0 | 3
Note, that multiplying the matrix A(1) by the matrix M (1) will simply interchange
row-1 and row-2 of the matrix A(1) .
2-2: Now, compute the multipliers for the second and third row:
(1) (1)
(2) a21 1 (2) a 1
m2,1 =− (1)
= − and m3,1 = − 31
(1)
=− .
a11 2 a11 2
Thus,
1 0 0
M (2) =
− 12 1 0
− 12 0 1
5
2-3 Next, the pivot element a332 = 2
is selected since this is the entry with the largest
st
absolute value in the 1 column of the next sub matrix. So we have to do another
row interchange, i.e., interchange row-2 and row-3. Let
1 0 0
M (3) = 0 0 1 ,
0 1 0
and we have
2 −1 1 | 2
A(4) = M (3) A(3) = 0 5
− 12 | 2
2
3 3
0 2 2
| 3
(4) 3
(4) a32 3
m3,2 =− (4)
= − 25 = −
a22 2
5
Thus,
1 0 0
M (2) = 0 1 0
0 − 35 1
This completes the reduction and we have that the given system is equivalent to
the system
A 0 x = b0 ,
where
2 −1 1 2
0 5 0
A = 0 − 21 and b = 2
2
9 9
0 0 5 5
2x − y + z = 2
5 1
y− z=2
2 2
9 9
z= .
5 5
Here, we can see that we there are no zeros on the diagonal of A0 and we have the
number of equations as the number of unknowns. Thus, the system have a unique
solution and the BSF method can be applied to solve the system.
x = 1, y = 1 and z = 1,
Exercise 3.2:
Solve the system of linear equations in Equation (3.13) by using GEM with partial piv-
oting and compare the results with the correct answer in Equation (7.13):
Determinant Evaluation
Notice that even in the partial pivoting method we get matrices M (k) , M (k−1) · · · M (1)
such that the product M (k) M (k−1) · · · M (1) A0 is upper triangular matrix and therefore
det M (k) det M (k−1) · · · det M (1) det A0 = Product of the diagonal entries in the final
upper triangular matrix.
Now if M (i) is used to make entries below a diagonal to zero then det M (i) = 1; and if it
is used for a row interchange necessary for a partial pivoting then det M (i) = −1. There-
fore, det M (k) det M (k−1) · · · det M (1) = (−1)n where n is the number of row interchange
effected in the reduction. Hence, det A = (−1)n ×product of the diagonal elements in the
final upper triangular matrix, A0 .
1. If a row does not consist entirely of zeros, then the first nonzero number in the row
is a 1. (We call this the leading 1)
2. If there are any rows that consist entirely of zeros, then they are grouped together
at the bottom of the matrix.
3. In any two successive rows that do not consist entirely of zeros, the leading 1 in the
lower row occurs farther to the right than the leading 1 in the higher row.
Consider the following example. Gauss-Jordan elimination is the exact same as Gaussian
elimination until the matrix is in row-echelon form.
Example 3.4:
Solve the following system using Gauss-Jordan method.
x + y + 2z = 9
2x + 4y − 3z = 1
3x + 6y − 5z = 0
Solution:
Step-1: Form the augmented matrix:
1 1 2 | 9
A(1) =
2 4 −3 | 1
3 6 −5 | 0
2-1: Normalize the diagonal element in the first-column to have a leading 1, but it’s
already a leading 1.
2-2: Perform elementary row operations to get zeros below the diagonal of column-1,i.e.,
compute the multipliers for the second and third row:
(1) (1)
(1) a21 2 (1) a 3
m2,1 =− (1)
= − = −2 and m3,1 = − 31
(1)
= − = −3.
a11 1 a11 1
Thus,
1 0 0
M (1) =
−2 1 0
−3 0 1
2-3 Normalize the diagoanl element in the second column to ge the leading 1. Here, we
need to divide the second row by 2. Thus, we have
1 0 0
M (2) =
0 1
2
0
0 0 1
and
1 1 2 | 9
A(3) = M (2) A(2) = 0 1 | − 17 − 27
2
0 3 −11 | −27
(3)
(3) a32
2-3 Compute multiplier for the third-row: m32 = (3) = − 31 = −3
a22
Thus,
1 0 0
M (3) = 0 1 0
0 −3 1
and hence
1 1 2 | 9
A(4) = M (3) A(3) = 0 1 − 72 | − 17 .
2
0 0 − 12 | − 23
and hence
1 1 2 | 9
A(5) = M (4) A(4) = 0 1 − 72 | − 17 .
2
0 0 1 | 3
This matrix is now in row-echelon form (upper triangular). To solve this matrix
using Gauss-Jordan method we need to do one more step.
Step-3: Beginning with the last nonzero row and working upwards, add suitable multi-
ples of each row to the rows above it to introduce zeroes above the leading 1’s.
3-1 Compute multipliers for the first and and second row using the third row as a pivot:
(5) (5)
(5) a13 7 (5) a23
m13 = (5)
= and m23 = (5)
= −2
a33 2 a33
.
Thus,
1 0 −2
M (5) =
0 1 7
2
0 0 1
and hence
1 1 0 | 3
A(6) = M (5) A(5) = 0 1 0 | 2 .
0 0 1 | 3
We are now finished with column 3 as there are all zeros above the leading 1. The final
step involves looking at column 2 and getting a zero above the leading 1. To do this we
compute multiplier for the first row using the second row as a pivot, i.e.,
(6)
(6) a12
m12 =− (6)
= −1
a22
where
1 0 0 1
A0 = 0 1 0 and b0 = 2
0 0 1 3
Step-4 Inspect the resulting system.
x + 0y + 0z = 1
0x + y + 0z = 2
0x + 0y + z = 3.
Here, we can see that we there are no zeros on the diagonal of A0 and we have the number
of equations as the number of unknowns. Thus, the system have a unique solution is equal
to b0 ,i.e.,
x = 1, y = 2, and z = 3.
Note that the final matrix is in reduced row-echelon form and it’s and identity matrix.
Remark:
The natural question that we need to ask at this point is “why the extra computations in
Gauss Jordan Method?". The Gauss Jordan Method will result the inverse of the original
matrix and also to determine the determinant of the matrix we simply take the product
of the determinant M k ’s. The inverse of a matrix can be computed in two ways, one
way is take the product M (k) M (k−1) · · · M (1) = A−1 and the other way is by augmenting
the original matrix with and identity matrix. The Gauss Jordan method will reduce the
original matrix to an identity matrix and the identity matrix will become the inverse of
As we try to explain above the Gauss Jordan method is also used to find inverse of a
given matrix. We simply formalize the procedure using the following example.
Example 3.5:
Solve the following system using Gauss-Jordan method and also find the inverse of the
coefficient matrix and it’s determinant.
x + 2y + 3z = 12
3x + 2y + z = 24
2x + y + 3z = 36
Solution:
Step-1: In order to solve and find the inverse of the coefficient matrix the first need we
need to do is forming the augmented matrix of A, b and I, where I is an identity matrix
of same order as A:
1 2 3 | 12 | 1 0 0
A(1) = 3 2 1 | 24 | 0 1 0
2 1 3 | 36 | 0 0 1
Step-2 Reduce the matrix to reduced row-echelon form. Here, we will be a little bit
systematic so that we can reduce the computation and we are not going to compute the
M k rather we follow a different form.
2-1: Now normalize the all rows by factoring out the lead elements of the first column
so that
1
R
3 2 1 2 3 | 12 | 1 0 0
−−−−−−→
A(1) 1
1
2
3
1
3
| 8 | 0 1
3
0 (1)(3)(2) = A(2) .
2
R 3
−−−−−−→ 1 1 3
| 18 | 0 0 1
2 2 2
Note: the product (1)(3)(2) is to for the computation of the the determinant.
- The first row can then be subtracted from the remaining rows (i.e. rows 2 and 3)
to yield
R2−R1 1 2 3 | 12 | 1 0 0
−−−−−−−→
A(2) 0 − 43 − 38 | −4 | −1 31 0 (6) = A(3) .
R3−R1
−−−−−−−→
0 − 32 − 23 | 6 | −1 0 12
2-2 Now repeat the cycle normalizing by factoring out the elements of the second column
getting
1
2
R1
−−−− −−→ 1 1 3
| 6 | 1
0 0
2 2 2
−3R
4 3
A(3) −−−−4−−
2
−→ 0 1 2 | 3 | 3
− 41 0 (6)(2)(− )(− ) = A(4) .
4 3 2
2
− 2 R3 0 1 1 | −4 | 3
0 − 13
−−−−3−−−→
- Subtracting the second row from the remaining rows (i.e. rows 1 and 3) gives
1
R1−R 0 − 12 | 3 | − 14 1
0
−−−−−−−2−→ 2 4
A(4) 0 1 2 | 3 | 3
− 14 0 (24) = A(5) .
R3−R 4
−−−−−−−2−→
1 1
0 0 −1 | −7 | − 12 4
− 13
2-3 Again repeat the cycle normalizing by the elements of the third column so
−2R
−−−−−−1−→ −1 0 1 | −6 |
1
2
− 21 0
R 1
1
A(5) −−−−2 2
0 1
1 | 3
| 3
− 81 0 (24)(− )(2)(−1) = A(6) .
−−→
2 2 8 2
1
−−−−−−−→ 0
−R3 0 1 | 7 | 12
− 41 1
3
2-4 Finally normalize each row by the diagonal elements so as to produce the unit
Step-3 The solution to the equations is now contained in the center vector while the right
hand matrix contains the inverse of the original matrix that was on the left hand
side of expression (2.2.14). The scalar quantity accumulating at the front of the
matrix is the determinant as it represents factors of individual rows of the original
matrix. Thus our complete solution is
Although it certainly represents a sound way to solve such systems, it becomes inefficient
when solving equations with the same coefficients A, but with different right-hand-side
constants (the bs). Recall that Gauss elimination involves two steps: forward elimination
and back-substitution. Of these, the forward-elimination step comprises the bulk of the
computational effort. This is particularly true for large systems of equations. On the other
hand, LU decomposition methods separate the time-consuming elimination of the matrix
A from the manipulations of the right-hand side b. Thus, once the matrix A has been
decomposed, multiple right-hand-side vectors can be evaluated in an efficient manner.
Before showing how this can be done, let us first provide a mathematical overview of the
decomposition strategy.
Ab = b,
A = LU ,
y = U x,
LU x = Ly = b,
b1
y1 = ,
l11
i−1
(3.16)
1 X
yi = bi − lij yj , for i = 2, 3, · · · , n.
lii j=1
ynn
xnn = ,
unn
(3.17)
n
1 X
xi = yi − uij yj , for i = 2, 3, · · · , n.
uii j=i+1
When the Gauss elimination procedure is applied to a matrix , the elements of the
matrices L and U are actually calculated. The upper triangular matrix U is the matrix
of coefficients A that is obtained at the end of the Gauss elimination procedure. For the
lower triangular matrix L, the elements on the diagonal are all 1, and the elements below
(k)
the diagonal are the negative of multipliers mij that multiply the pivot equation when
it is used to eliminate the elements below the pivot coefficient. For the case of a system
of three equations, the decomposition has the form:
(1) (1) (1)
a11 a12 a13 1 0 0 a11 a12 a13
(1) (2) (2)
a21 a22 a23 = m21 1 0 0 a22 a23 ,
(1) (2) (3)
a31 a32 a33 m31 m32 1 0 0 a33
where
(1) (1) (2)
(1) a21 (1) a31 (2) a32
m21 = (1)
, m31 = (1)
, and m32 = (2)
.
a11 a11 a22
We determine L and U as follows: The 1st row of U and 1st column of L are determined
as follows : n
X
a11 = l1k uk1
k=1
Now,
n
X
a1j = l1k ukj
k=1
Thus the first row of U is the same as the first row of A. The first column of L is
determined as follows:
n
X
aj1 = ljk uk1
k=1
aj1
=⇒ lj1 = (3.21)
u11
Thus Equation (3.20) and Equation (3.21) determine respectively the first row of U and
first column of L. The other rows of U and columns of L are determined recursively as
given below: Suppose we have determined the first i − 1 rows of U and the first i − 1
columns of L. Now we proceed to describe how one then determines the ith row of U
and ith column of L. Since first i − 1 rows of U have been determined, this means, ukj
are all known for 1 ≤ k ≤ i − 1 ; 1 ≤ j ≤ n. Similarly, since first i − 1 columns are known
for L, this means, lik are all known for 1 ≤ i ≤ n ; 1 ≤ k ≤ i − 1.
Now,
n
X
aij = lik ukj
k=1
Xi
= lik ukj since lik = 0 for k > i,
k=1
i−1
X
= lik ukj + lii uij
k=1
i−1
X
= lik ukj + uij since lii = 1.
k=1
i−1
X
=⇒ uij = aij − lik ukj . (3.22)
k=1
Note that on the RHS we have aij which is known from the given matrix. Also the sum on
the RHS involves lik for 1 ≤ k ≤ i − 1 which are all known because they involve entries in
the first i − 1 columns of L; and they also involve ukj ; 1 ≤ k ≤ i − 1 which are also known
since they involve only the entries in the first i − 1 rows of U . Thus Equation (3.22)
determines the ith row of U in terms of the known given matrix and quantities determined
upto the previous stage. Now we describe how to get the ith column of L:
n
X
aji = ljk uki
k=1
Xi
= ljk uki since uki = 0for k > i,
k=1
i−1
X
= ljk uki + lji uii ,
k=1
i−1
" #
1 X
lji = aji − ljk uki . (3.23)
uii k=1
Once again we note the RHS involves uii , which has been determined using Equa-
tion (3.22); aij which is from the given matrix; ljk ; 1 ≤ k ≤ i − 1 and hence only entries
in the first i − 1 columns of L; and uki , 1 ≤ k ≤ i − 1 and hence only entries in the first
i − 1 rows of U . Thus RHS in Equation (3.23) is completely known and hence lji , the
entries in the ith column of L are completely determined by Equation (3.23).
Summary:
The summary of Doolittle’s procedure is as follows:
Step-1 determining 1st row of U and 1st column of L :
lii = 1, for i = 1, 2 · · · , n.
aj1
lj1 = for j = 1, 2, · · · , n.
u11
i−1
X
uij = aij − lik ukj , for j = i, i + 1, · · · , n.
k=1
i−1
" #
1 X
lji = aji − ljk uki , for j = i, i + 1, · · · , n.
uii k=1
Example 3.6:
Determine the the Doolittle’s decomposition for the matrix.
2 1 −1 3
−2 2 6 −4
A=
4 14 19 4
6 0 −6 12
Solution:
Step-1 Determine the 1st row of U and 1st column of L :
lii = 1, for i = 1, 2 · · · , n.
Thus, we have,
l11 = 1, l22 = 1, l33 = 1, and l44 = 1
b) 1st row of U
Thus we have
aj1
lj1 = for j = 1, 2, · · · , n.
u11
Thus, j = 1
=⇒ l11 = 1,
j=2
a21 −2
=⇒ l21 = = = −1,
u11 2
j=3
a31 4
=⇒ l31 = = = 2.
u11 2
and j = 4
a41 6
=⇒ l41 = = = 3,
u11 2
Step-2 a) 2th row of U :
The second row of U is given by Equation (3.22) when i=2, i.e.,
2−1
X
u2j = a2j − l2k ukj , for j = 2, 3, 4.
k=1
Thus, we have
j=2
=⇒ u22 = a22 − l21 u12 = 2 − (−1)(1) = 3
j=3
=⇒ u23 = a23 − l21 u13 = 6 − (−1)(−1) = 5
j=4
=⇒ u24 = a23 − l21 u14 = −4 − (−1)(3) = −1
b.) 2nd column of L: The second column of L is given by Equation (3.23) when
i = 2, i.e.,
2−1
" #
1 X
lj2 = aj2 − ljk uk2 , for j = 2, 3, 4.
u22 k=1
Thus, we have j = 2
=⇒ l22 = 1
j=3
1 1
=⇒ l32 = [a32 − l31 u12 ] = [14 − (2)(1)] = 4
u22 3
j=4
1 1
=⇒ l42 = [a42 − l41 u12 ] = [0 − (3)(1)] = −1
u22 3
3−1
X
u3j = a3j − l3k ukj , and j = 3, 4.
k=1
Thus, we have
j=3
j=4
b.) 3rd column of L: The second column of L is given by Equation (3.23) when
i = 3, i.e.,
3−1
" #
1 X
lj3 = aj3 − ljk uk3 , for j = 3, 4.
u33 k=1
Thus, we have j = 3
=⇒ l33 = 1
j=4
1
l43 = [a43 − (l41 u13 + l42 u23 )]
u33
=⇒ 1
= [−6 − ((3)(−1) + (−1)(5))]
1
=2
4−1
X
u4j = a4j − l4k ukj , and j = 4.
k=1
Thus, we have
j=4
u44 = a44 − (l41 u14 + l42 u24 + l43 u34 )
=⇒ = 12 − ((3)(3) + (−1)(−) + (2)(2))
= −2
b.) 4th column of L: The second column of L is given by Equation (3.23) when
i = 4, i.e.,
4−1
" #
1 X
lj4 = aj4 − ljk uk4 , for j = 4.
u44 k=1
Thus, we have j = 4
=⇒ l44 = 1
The second method for decomposing a general matrix A into the LU factor is the Crout’s
Method. In this method the matrix is decomposed into the product LU , where the
diagonal elements of the matrix U are all 1s. It turns out that in this case, the elements
of both matrices can be determined using formulas that can be easily programmed just
like the Doolittel’s method.
A procedure for determining the elements of the matrices L and U using the Crout’s
method can be written as follow.
uii = 1 for i = 1, 2, · · · n.
a1j
u1j = for j = 2, · · · n.
l11
Step-i: Calculate the ith column of L and the ith row of U , for i = 2, 3, · · · , n.
j−1
X
lij = aij − lik ukj for j = 2, 3, · · · i.
k=1
i−1
" #
1 X
uij = aij − lik ukj , for j = i + 1, i + 2, · · · , n
lii k=1
Example 3.7:
Solve the following system of linear equations by Crout’s Method (LU factorization or
decomposition method):
uii = 1 for i = 1, 2, · · · n.
Thus, we have
Here we have
a1j
u1j = for j = 2, 3, · · · n.
l11
j=2
l12 3 1
=⇒ u12 = = =
l11 9 3
j=3
l13 3 1
=⇒ u13 = = =
l11 9 3
j=4
l14 3 1
=⇒ u14 = = =
l11 9 3
Step-2: Calculate the 2nd column of L and the 2nd row of U , for i = 2, 3, · · · , n.
j−1
X
l2j = a2j − l2k ukj for j = 2 · · · i.
k=1
j=2
1
l22 = a22 − l21 u12 = 10 − (3)( ) = 9
3
b.) Calculate the 2th row of U :
2−1
" #
1 X
u2j = a2j − l2k ukj , forj = 3, 4
l22 k=1
j=3
1 1 1 1
u23 = [a23 − l21 u13 ] = −2 − (3)( ) = −
l22 9 3 3
j=4
1 1 1 1
u24 = [a24 − l21 u14 ] = −2 − (3)( ) = −
l22 9 3 3
Step-3: Calculate the 3rd column of L and the 3rd row of U , for i = 2, 3.
j−1
X
l3j = a3j − l3k ukj for j = 2, 3 · · · i.
k=1
j=2
1
l32 = a32 − l31 u12 = −2 − (3)( ) = −3
3
j=3
1 1
l33 = a33 − [l31 u13 + l32 u23 ] = 18 − (3)( ) + (−3)(− ) = 16
3 3
3−1
" #
1 X
u3j = a3j − l3k ukj , forj = 4
l33 k=1
j=4
1 1 1 1 1
u34 = [a34 − (l31 u14 + l32 u24 )] = 10 − (3)( ) + (−3)(− ) =
l33 16 3 3 2
Step-4: Calculate the 4th column of L and the 4th row of U , for i = 2, 3, 4.
j−1
X
l4j = a4j − l4k ukj for j = 2, 3 · · · i.
k=1
j=2
1
l42 = a42 − l41 u12 = −2 − (3)( ) = −3
3
j=3
1 1
l43 = a43 − [l41 u13 + l42 u23 ] = 10 − (3)( ) + (−3)(− ) = 8
3 3
j=4
1 1 1
l44 = a44 − [l41 u14 + l42 u24 + l43 u34 ] = 10 − (3)( ) + (−3)(− ) + (8)( ) = 4
3 3 2
This completes the LU decomposition by Doolittle’s method for the given A. Now, let
us solve the given system
we have
9 0 0 0 y1 24
3 9 0 0 y 17
2 =
3 −3 16 0 y 45
3
3 −3 8 4 y4 29
Using the forward substitution formula we get
8
9y1 = 24 =⇒ y1 =
3
3y1 + 9y2 = 17 =⇒ y2 = 1
5
3y1 − 3y2 + 16y3 = 45 =⇒ y3 =
2
3y1 − 3y2 + 8y3 + 4y4 = 29 =⇒ y4 = 1.
8
3
1
Thus y =
5 and U x = y gives
2
1
1 1 1 8
1 3 3 3
x1
3
0 1 − 31 − 13 x 1
2 = .
0 1 5
0 1 x
2 3
2
0 0 0 1 x4 1
x4 = 1
1 5
x3 + x4 = =⇒ x3 = 2
2 2
1 1
x2 − x3 − x4 = 1 =⇒ x2 = 2
3 3
1 1 1 8
x1 + x2 + x3 + x4 = =⇒ x1 = 1.
3 3 3 3
x1 = 1, x2 = 2, x3 = 2 and x4 = 1.
A symmetric matrix is a square matrix that in which it’s transpose is the matrix
it self. Formally, matrix A is symmetric if
AT = A.
where L is a lower triangular matrix with real positive diagonal elements and LT is
the transpose of L. The Cholesky decomposition is unique when A is positive definite;
there is only one lower triangular matrix L with strictly positive diagonal entries such
that A = LLT . However, the decomposition need not be unique when A is positive
semidefinite. The converse holds trivially: if A can be written as LLT for some lower
triangular matrix L, then A is symmetric and positive definite.
√
l11 = a11
ai1
li1 = for i = 2, 3, · · · , n.
l11
for j = 2, 3, · · · , n
v
u j−1
u X
2 (3.24)
ljj = ta
jj − ljk
k=1
for i = j + 1, j + 2, · · · , n.
j−1
1 X
lij = aij − lik ljk ,
ljj k=1
Cholesky decomposition is evaluated column by column (starting from the first column)
and, in each row, the elements are evaluated from top to bottom. That is, in each column
the diagonal element is evaluated first using (2) (the elements above the diagonal are zero)
and then the other elements in the same row are evaluated next using (3). This is carried
out for each column starting from the first one.
1. Factorize A in to A = LLT .
2. Solve for x
Example 3.8:
Solve the following system of equations using the Cholesky decomposition.
In order to solve the above system first we need to get the cholesky decomposition of the
coefficient matrix A. Now, we use the formulas in Equation (3.24) determine the LLT
decomposition as follows. The computation of L is column by column.
Step-1 First column when j = 1:
√ √
l11 = a11 = 4=2
ai1
li1 = , i = 2, 3
l11
2−1
!
1 X
li2 = ai2 − lik l2k , i = 3
l22 k=1
when i = 3 we have
1 1
l32 = (a32 − l31 l21 ) = (−5 − (7)(1)) = −3
l22 4
Step-3: Compute the 3rd column, i.e., j = 3 Here we only have one element to compute l33
and is given by
v
u 3−1 q q
2
u X
2 2
l33 = ta
33 − l3k = a33 − (l31 + l32 )= 83 − (49 + 9) = 5
k=1
Exercise: Using the above decomposition solve the system given system
All the methods described so far generally require about n3 operations to obtain the solu-
tion. However, there is one frequently occurring system of equations for which extremely
efficient solution algorithms exist. This system of equations is called tri-diagonal because
there are never more than three unknowns in any equation and they can be arranged so
that the coefficient matrix is composed of non-zero elements on the main diagonal and
the diagonal immediately adjacent to either side. Thus such a system would have the
form
Ax = b
Here, we can solve the given system using the two-stage strategy:
b1 = u1 =⇒ u1 = b1
c1 = v1 =⇒ v1 = c1
for k = 2, 3, · · · , n
ak
ak = lk uk−1 =⇒ lk = ,
uk−1
bk = lk vk−1 + uk =⇒ uk = bk − lk vk−1 ,
ck = vk =⇒ vk = ck .
Ly = b,
y 1 = b1 ,
for k = 2, · · · n
lk yk−1 + yk = bk =⇒ yk = bk − lk yk−1
for k = n − 1, n − 2, · · · , 1:
yk − vk xk+1
uk xk + vk xk+1 = yk =⇒ xk =
uk
Introduction
Direct solvers such as Gaussian Elimination and LU decomposition allow for efficient
solving. In this section we introduce iterative solutions methods. The choice of a direct
method or an indirect method is a combination of the efficiency of the method (and in
general iterative methods are more efficient), the particular structure of the matrix sys-
tem, a trade-off between compute time and memory, and the computer architecture being
used. Iterative methods work by refining a guess to the solution and converging as quickly
as possible from that guess to the actual solution. You may have met iterative methods
previously in, for example, the general purpose solution of non-linear equations– such
as bisection or Newton-Raphson techniques (along with their more advanced cousins).
Iterative methods for linear systems have become a widespread and powerful tool for
solving the most complex scientific and engineering problems and can be extremely ef-
fective, especially when starting from a good guess at the final solution – and often effort
is expended in making that initial guess as good as possible and which will start you off
close to the final solution and yield a more rapid convergence to the answer. Their only
drawback is that they may not necessarily converge to a solution for a particular matrix
system.
By this approach, we start with some initial guess solution, say x(0) ; for solution x and
generate an improved solution estimate x(k+1) from the previous approximation x(k) :
This method is a very effective for solving differential equations, integral equations and
related problems. Let the residue vector r be defined as
n
(k) X (k)
ri = bi − aij xi for i = 1, 2, · · · n
j=1
In this section, to begin with, some well known iterative schemes are presented. Their
convergence analysis is presented next. In the derivations that follow, it is implicitly
assumed that the diagonal elements of matrix A are non-zero, i.e. aii 6= 0: If this is not
the case, simple row exchange is often sufficient to satisfy this condition.
4x1 − x2 + x3 = 7
4x1 − 8x2 + x3 = −21 (3.25)
−2x1 + x2 + 5x3 = 15
(k) (k)
(k+1) 7 + x2 − x3
x1 =
4
(k) (k)
(k+1) 21 + 4x1 + x3
x2 =
8
(k) (k)
(k+1) 15 + 2x1 − x2
x3 =
5
(0) (0)
(1) 7 + x2 − x3 7+2−2
x1 = = = 1.75
4 4
(0) (0)
(1) 21 + 4x1 + x3 21 + 4 + 1
x2 = = = 3.375
8 8
(0) (0)
(1) 15 + 2x1 − x2 15 + 2 − 2
x3 = = = 3.000
5 5
Stopping Criteria
The iterations are stopped when the absolute relative error is less than a respecified
tolerance,, for all unknowns,i.e.,
(k+1) (k)
|xi − xi |
(k+1)
< , for all i = 1, 2, · · · , n. (3.27)
|xi |
Convergence
It is a sufficient condition for the matrix to be strictly diagonally dominant for the Jacobi
method to converge from any given starting vector.
Exercise 3.3:
Solve the linear equation A2 x = b2 using Jacobi Iteration, where
−2 1 5 15
A2 = 4 −1 1 , and b2 = −21
4 −8 1 7
When matrix A is large, there is a practical difficulty with the Jacobi method. It is
required to store all components of x(k) in the computer memory (as a separate variables)
until calculations of x(k+1) is over. The Gauss-Seidel method overcomes this difficulty by
(k+1) (k+1)
using xi immediately in the next equation while computing xi+1 :This modification
leads to the following set of equations
(k+1) 1
(k) (k)
x1 = b1 − a12 x2 − a13 x3 − · · · − a1n x(k)
n
a11
(k+1) 1
(k+1) (k)
x2 = b2 − {a21 x1 } − {a23 x3 + · · · + a2n xn(k) }
a22
Now we are using the new values of x as soon as they are available at each iteration.
Thus the equations in Equation (3.25) would become:
(k) (k)
(k+1) 7 + x2 − x 3
x1 =
4
(k+1) (k)
(k+1) 21 + 4x1 + x3
x2 =
8
(k+1) (k+1)
(k+1) 15 + 2x1 − x2
x3 =
5
Making this change and repeating the above makes the iteration to the solution [2, 4, 3]T
take only 10 steps as per the table below.
The stopping criteria of the Gauss Seidel iteration is the same as Gauss Jacobi as given
in Equation (3.27).
Convergence
• symmetric positive-definite.
In this section, we will now extend the Newton’s Method that we discussed in Section 2.5
of Chapter 2 further to systems of many nonlinear equations. Consider the general system
of n linear equations in n unknowns:
f1 (x1 , x2 , · · · , xn ) = 0
f2 (x1 , x2 , · · · , xn ) = 0
.. (3.29)
.
fn (x1 , x2 , · · · , xn ) = 0,
in which 0 denotes the zero vector [0, 0, · · · , 0]t ∈ Rn . In order to find x such that f goes
to 0, an initial estimate x0 is chosen, and Newton’s iterative method for converging to
the solution is used:
x1 = x0 − J −1 (x0 )f (x0 ) (3.30)
where J (x) is the Jacobian matrix of partial derivatives of f with respect to x given as
∂ ∂ ∂
∂x1 f1 (x) f (x)
∂x2 1
··· f (x)
∂xn 1
∂ ∂ ∂
···
∂x1 f2 (x) f (x)
∂x2 2
f (x)
∂xn 2
J (x) = .
..
.
∂ ∂ ∂
f (x)
∂x1 n
f (x)
∂x2 n
··· f (x)
∂x1 n
The formula in Equation (3.30) is the vector equivalent of the Newton’s method formula
we learned before. However, in practice we never use the inverse of a matrix for
computations, so we cannot use this formula directly. Rather, we can do the following.
First solve the equation
J (x)h = −f (x0 ).
Since J (x0 ) is a known matrix and f (x0 ) is a known vector, this equation is just a
system of linear equations, which can be solved efficiently and accurately. Once we have
the solution vector h we can obtain our improved estimate x1 by
x1 = x0 + h.
This, then, fully defines Newton’s method for systems of non-linear equations as
||xk+1 − xk || < ,
Example 3.6:
Solve the following systems of nonlinear equations using the Newton’s method
x1 − x2 + 1 = 0
x21 + x22 − 4 = 0
In this section we will discuss, in some detail, some iterative methods for finding single
eigenvalue-eigenvector pairs (eigenpairs is a common term) of a given real matrix A; we
will also give an overview of more powerful and general methods that are commonly used
to find all the eigenpairs of a given real A.
The algebraic eigenvalue problem is as follows: Given a matrix A ∈ Rn×n , find a nonzero
vector x ∈ Rn and the scalar λ such that
Ax = λx
Note that this says that the vector Ax is parallel to x, with λ being an amplification
factor, or gain. Note also that the above implies that
(A − xI)x = 0,
that A − λI is a singular matrix. Hence, det(A − λI) = 0; it is easy to show that this
determinant is a polynomial (of degree n) in λ, known as the characteristic polynomial
of A, p(λ), so that the eigenvalues are the roots of a polynomial. Although this is not a
good way to compute the eigenvalues, it does give us some insight into their properties.
Thus, we know that an n × n matrix has n eigenvalues, that the eigenvalues can be
repeated, and that a real matrix can have complex eigenvalues, but these must occur in
conjugate pairs. We summarize these and a number of other basic eigenvalue properties
in the following theorem, presented without proof.
Theorem 3.1
Basic Eigenvalue Properties Let A ∈ Rn×n be given. Then we have the following:
The power method is an iterative technique to find or locater the dominant eigenvalue of
a given matrix and also computes the associated eigenvector.
x0 = α1 v1 + α2 v2 + α3 v3 + · · · + αn vn
Next, construct the sequence {xm according to the rule xm = Axm−1 form ≥ 1. By the
x1 = Ax0
= α1 (Av1 ) + α2 (Av2 ) + α3 (Av3 ) + · · · + αn (Avn )
= α1 (λ1 v1 ) + α2 (λ2 v2 ) + α3 (λ3 v3 ) + · · · + αn (λn vn )
x2 = Ax1 = A2 x0
= α1 (A2 v1 ) + α2 (A2 v2 ) + α3 (A2 v3 ) + · · · + αn (A2 vn )
= α1 (λ21 v1 ) + α2 (λ22 v2 ) + α3 (λ23 v3 ) + · · · + αn (λ2n vn )
and, in general
xm = Axm−1 = · · · = Am x0
= α1 (Am v1 ) + α2 (Am v2 ) + α3 (Am v3 ) + · · · + αn (Am vn )
= α1 (λm m m m
1 v1 ) + α2 (λ2 v2 ) + α3 (λ3 v3 ) + · · · + αn (λn vn )
In deriving these equations we have made repeated use of the relation Avj = λj vj , which
flows from the fact that vj is an eigenvector associated with the eigenvalue λj .
Factoring λm m
1 from the right-hand side of the equation for x gives
" ! ! !#
m λ2 λ3 λn
x = λm
1 α1 v2 + α2 + α3 + · · · αn . (3.31)
λ1 λ1 λ1
xm
lim = α1 v1 .
m→∞ λm
1
Since any nonzero constant times and eigenvector is still an eigenvector associated with
the same eigenvalue, we see that the scaled sequence {x(m) /λm
1 } converges to and eigenvec-
tor associated with the dominant eigenvalue provided α1 6= 0. Furthermore, convergence
towards the eigenvector is linear with asymptotic error constant | λλ12 |.
An approximation for the dominant eigenvalue of A can be obtained from the sequence
(m−1)
{x(m) } as follows. Let i be an index for which xi 6= 0, and consider the ratio of the
ith element from the vector xm to the ith element from x(m−1) .
By equation (3.31),
(m)
xi λm m
1 α1 v1,i [1 + O((λ2 /λ1 ) )]
(m−1)
= m−1 m−1
= λ1 [1 + O((λ2 /λ1 )m−1 )]
xi λ1 α1 v1,i [1 + O((λ2 /λ1 ) )]
provided v1,i 6= 0, where v1,i denotes the ith element from the vector v1 . Hence, the ratio
(m)
xi
(m−1) converges towards the dominant eigenvalue, and the convergence is linear with
xi
asymptotic rate constant |λ2 /λ1 |.
To avoid overflow or underflow problems when calculating the sequence {x(m) , it is com-
mon practice to scale the vectors x(m) so that they all of unit length. Here, we will use
the l∞ norm to measure vector length. Thus, in a practical implementation of the power
method, the vectors x(m) would be computed in two steps: First multiply the previous
vector by the matrix A and then scale the resulting vector to unit length.
Ax · x
λ=
x·x
Example 3.1
Approximate dominant eigenvalue using the power method for the matrix
1 2 0
A = −2 1 2
1 3 1
Since |λ(2) − λ(1) | = |2.20 − 5| = 2.80 ≥ tol we need to repeat the power
method.
Continuing this process, you obtain the sequence of approximations shown in Table ??.
x(0) x(1) x(2) x(3) x(4) x(5) x(6) x(7)
1.00 0.60 0.45 0.48 0.51 0.50 0.50 0.50
1.00 0.20 0.45 0.55 0.51 0.49 0.50 0.50
Then the dominant eigenvalue can be obtained using the Rayleigh quotient after the
convergence of the power method. Hence
Ax(7) · x(7)
λ1 = = 3.0
x(7) · x(7)
Exercise 3.1
Determine the largest eigenvalue and the corresponding eigenvector of the matrix
below using power method
5 4 1
using x0 = as an initial guess.
1 2 0
Answer: After
the 6th iteration you will find that λ(6) = λ(6) = 6 and x(6) =
1
x(5) = .
0.25
Ax = λx
⇐⇒ A−1 Ax = λA−1 x
⇐⇒ x = λA−1 x
1
⇐⇒ A−1 x = A−1 x
λ
1
⇐⇒ is eigenvalue of A−1 .
λ
1
Thus if λ is the eigenvalue of A and x is the corresponding eigenvector, then is an
λ
eigenvalue of corresponding to the same eigenvector. Hence the reciprocal of the largest
eigenvalue of A−1 is the smallest eigenvalue of A. Therefore to find the smallest eigenvalue
of A, we apply power method on A−1 . This process is called inverse power method.
Example 3.2
Perform3 iterations
of the inverse
power method to obtain the smallest eigenvalue
1 6 1 1
1 2 0 using x(0) = 0 as an initial guess.
of A =
0 0 3 0
Different types of finite difference operators are used in Numerical Analysis, viz. shift (E),
average(mean) (µ),forward difference (4), backward difference (5) and central difference
(δ) are discussed in this chapter.
When a function is known explicitly, it is easy to calculate the value (or values) of f (x),
corresponding to a fixed given value . However, when the explicit form of the function is
not known, it is possible to obtain an approximate value of the function up to a desired
level of accuracy with the help of finite differences. The calculus of finite differences deals
with the changes that take place in the value of a function due to finite changes in the
independent variable.
Different types of finite difference operators are defined, among them forward difference,
backward difference and central difference operators are widely used for equally spaced
data points.
Let the function y = f (x), x begin the independent variable and y a dependent variable,
be defined on the closed interval [a, b] and let x0 , x1 , · · · , xn be the n discrete values of
109
Chapter-4: Interpolation 4.1. Finite differences
x on the given interval. Assumed that these values are equidistant, i.e. xi = x0 + ih,
i = 0, 1, 2, · · · , n; h is a suitable real number called the difference of the interval or step
size. When x = xi , the value of y is denoted by yi and is defined by yi = f (xi ). The
values of independent variable x are called arguments and the dependent variable y are
called entries.
If the shift operator E is operating on yi , the y value is shifted down to the next provided
y value.
Ef (x) = f (x + h).
Thus, when E operates on f (x), the result is the next value of the function. Here,
E is called the shift operator.
E n f (x) = f (x + nh),
or
E n yx = y(x + nh),
For example
Ey0 = y1 , E 2 y0 = Ey1 = y2 , E 4 y0 = y4 , · · · , E 2 y2 = y4 .
Similarly,
E −n f (x) = f (x − nh).
1 1 1
µf (x) = f (x + h) + f (x − h)
2 2 2
d
Df (x) = f (x) = f 0 (x),
dx (4.1)
d2
D2 f (x) = 2 f (x) = f 00 (x).
dx
In particular,
These are called First order Forward Differences. The differences of the first order
forward differences are called Second order Forward Differences and are denoted by
42 y0 , 42 y1 , 42 y2 , · · · , 42 yn . In particulat two second order differences are
The third order forward differences are also defined in similar manner, i.e.
43 y 0 = 42 y 1 − 4 2 y 0
= (y3 − 2y2 + y1 ) − (y2 − 2y1 + y0 .)
= y3 − 3y2 + 3y1 − y0 .
! ! !
1 3 3 3
= y3 + (−1) y2 + (−1)2 y1 + (−1)3 y0
2 1 0
43 y1 = y4 − 3y3 + 3y2 − y1 ,
! ! !
1 3 3 3
= y4 + (−1) y3 + (−1)2 + (−1)3 y1 ,
2 1 0
!
n n!
where the combination = .
k k!(n − k)!
Or n
! !
n
X
k n
4 yi = yn+i + (−1) yn+i−k . (4.3)
k=1 n−k
It must be remembered that 40 ≡ identity operator, i.e. 40 f (x) = f (x) and 41 ≡ 4.
Example 4.1
Find 44 y3 ?
Solution: Here, we can either use the formula in Equation (4.3) or successive
application of the forward difference operator on y3 four times.
Using the formula, we have n = 4 and i = 3 hence
4
! !
4
X
k 4
4 y3 = y4+3 + (−1) y4+3−k
k=1 4−k
! ! ! !
1 4 4 4 4
= y7 + (−1) y6 + (−1)2 y5 + (−1)3 y4 + (−1)4 y3
3 2 1 0
= y7 − 4y6 + 6y5 − 4y4 + 1y3 .
44 y3 = 43 y4 − 43 y3
= 42 y 5 − 4 2 y 4 − 42 y 4 − 4 2 y 3
= [(4y6 − 4y5 ) − (4y5 − 4y4 )] − [(4y5 − 4y4 ) − (4y4 − 4y3 )]
= [((y7 − y6 ) − (y6 − y5 )) − ((y6 − y5 ) − (y5 − y4 ))]
− [((y6 − y5 ) − (y5 − y4 )) − ((y5 − y4 ) − (y4 − y3 ))]
= [(y7 − 2y6 + y5 ) − (y6 − 2y5 + y4 )] − [(y6 − 2y5 + y4 ) − (y5 − 2y4 + y3 )]
= [y7 − 3y6 + 3y5 − y4 ] − [y6 − 3y5 + 3y4 − y3 ]
= y7 − 4y6 + 6y5 − 4y4 + y3 .
All the forward differences can be represented in a tabular form, called the forward
difference or diagonal difference table. Let x0 , x1 , · · · , x4 be four arguments. All the
forwarded differences of these arguments are shown in Table 4.1 bellow.
x y 4 42 43 44
x0 y0
4y0
x1 y1 42 y 0
4y1 43 y0
2
x2 y2 4 y1 44 y 0
4y2 43 y1
2
x3 y3 4 y2
4y3
x4 y4
The first entry, i.e., y0 is called leading term and 4y0 , 42 y0 , 43 y0 , · · · are called the
leading differences.
If any entry of the difference table is has an error, then this error spread over the table in
convex manner. The propagation of error in a difference table is illustrated in Table 4.2.
Let us assumed that y3 has an error and the amount of the error be .
x y 4 42 43 44 45 46
x0 y0
4y0
x1 y1 42 y 0
4y1 43 y 0 +
x2 y2 42 y 1 + 44 y0 − 4
3
4y2 + 4 y1 − 3 45 y0 + 10
x3 y3 + 42 y2 − 2 44 y1 + 6 46 y0 − 20
3 5
4y3 − 4 y2 + 3 4 y1 − 10
x4 y4 42 y 3 + 44 y2 − 4
4y4 43 y 3 −
2
x5 y5 4 y4
4y5
x6 y6
(ii) The error is maximum (in magnitude) along the horizontal line through the erro-
neous tabulated value.
(iii) In the k th difference column, the coefficients of errors are the binomial coefficients in
the expansion of (1 − x)k . In particular, the errors in the second difference column
are , −2, , in the third difference column these are , −3, 3, −, and so on.
Example 4.2
Construct a forward diagonal difference table for the following set of values:
xi 0 2 4 6 8 10 12 14
yi 625 81 1 1 81 625 2401 65611
xi yi 4 42 43 44 45
0 625
-544
2 81 464
-80 -384
4 1 80 384
0 0 0
6 1 80 384
80 384 0
8 81 464 384
544 768 0
10 625 1232 384
1776 1152
12 2401 2384
4160
14 65611
4f (x) = f (x + h) − f (x)
h i h i
= (x + h)2 + 8(x + h) − 5 − x2 + 8x − 5
= 2xh + h2 + 8h
Now,
42 f (x) = 4f (x + h) − 4f (x)
h i h i
= 2h(x + h) + h2 + 8h − 2hx + h2 + 8h
= 2h2 , which is a constant.
Hence,
43 f (x) = 42 f (x + h) − 42 f (x)
= 2h2 − 2h2 = 0.
Please note that the backward difference of f (x + h) is same as the forward difference of
f (x), that is
5f (x + h) = 4f (x).
5yi = yi − yi−1 , i = n, n − 1, · · · , 1.
In particular,
5y1 = y1 − y0 , 5y2 = y2 − y1 , · · · , 5yn = yn − yn−1 ,
These are called the first order backward differences. The second order backward
differences are denoted by 52 y2 , 52 y3 , · · · , 52 yn . First three second order backward
differences are
52 y2 = 5 (5y2 ) = 5 (y2 − y1 )
= 5y2 − 5y1
= (y2 − y1 ) − (y1 − y0 )
= y2 − 2y1 + y0 ,
52 y3 = y3 − 2y2 + y1 , and
52 y4 = y4 − 2y3 + y2 .
The other second order differences can be obtained in similar manner.
In general
5k yi = 5k−1 yi − 5k−1 yi−1 , i = n, n − 1, · · · , k,
x y 5 52 53 54
x0 y0
5y1
x1 y1 52 y 2
5y2 53 y3
2
x2 y2 5 y3 54 y 4
5y3 53 y4
x3 y3 52 y 4
5y4
x4 y4
It is observed from the forward and backward difference tables that for a given table of
values both the tables are same. Practically, there are no differences among the values
of the tables, but, theoretically they have separate significant which will be discussed in
the next chapter.
There is another kind of finite difference operator known as central difference opera-
tor.
h h
δf (x) = f (x + ) − f (x − ). (4.5)
2 2
δy 1 = y1 − y0 , δy 3 = y2 − y1 , · · · , δyn− 1 = yn − yn−1 .
2 2 2
δ 2 yi = δyi+ 1 − δyi− 1
2 2
In general,
δ n yi = δ n−1 yi+ 1 − δ n−1 yi− 1 .
2 2
All central differences for the five arguments x0 , x1 , · · · , x4 is shown in Table 4.4.
x y δ δ2 δ3 δ4
x0 y0
δy1/2
x1 y1 δ 2 y1
δy3/2 δ 3 y3/2
x2 y2 δ 2 y2 δ 4 y2
δy5/2 δ 3 y5/2
2
x3 y3 δ y3
δy7/2
x4 y4
It may be observed that all odd (even) order differences have fraction suffices (integral
suffices), respectively.
Lot of useful and interesting results can be derived among the operators discussed above.
First of all, we determine the relation between forward and backward difference operators.
4 and 5
42 yi = 4yi+1 − 4yi
= yi+2 − 2yi + yi−1
= 5yi+2 .
In general
4n yi = 5n yi+n , i = 0, 1, 2, · · · (4.6)
4 and E
4f (x) = f (x + h) − f (x)
= Ef (x) − f (x) (4.7)
= (E − 1)f (x).
From this relation one can conclude that the operators 4 and E − 1 are equivalent. That
is,
4 ≡ E − 1, (4.8)
or
E ≡ 4 + 1, (4.9)
The expression for higher order forward differences in terms of function values can be
derived as per following way:
43 yi = (E − 1)3 yi
= (E 3 − 3E 2 + 3E − 1)yi = y3 − 3y2 + 3y1 − y0 .
5 and E
5f (x) = f (x) − f (x − h)
= f (x) − E −1 f (x)
= (1 − E −1 )f (x).
That is,
5 ≡ 1 − E −1 , (4.10)
δ and E
1 h
δf (x) = f (x + ) − f (x − )
2 2
1 1
−
= E 2 f (x) − E 2 f (x)
1 1
= E 2 − E − 2 f (x).
That is
1 1
δ ≡ E 2 − E− 2 . (4.11)
Every operator defined earlier can be expressed in terms of other operator(s). Few more
relations among the operators 4, 5, E and δ are given in the following.
E 4 5 δ q
hD
δ2 δ2
E E 4+1 (1 − 5)−1 1+ 2q
+ δ 1+ 4
ehD
−1 δ2 δ2
4 E−1 4 (1 − 5) −1 2
+δ 1+ 4
ehD − 1
2
q
2
5 1 − E −1 1 − (1 + 4)−1 5 − δ2 + δ 1 + δ4 1- e−hD
1 1
δ E 2 − E− 2 4(1 + 4)−1/2 5(1 − 5)−1/2 δ 2 sinh(hD/2)
hD log E log(1 + 4) − log(1 − 5) −2 sinh−1 (δ/2) hD
4.2 Interpolations
An interpolation task usually involves a given set of data points: where the values yi can,
xi x0 x1 ··· xn
f (xi ) y0 y1 ··· yn
for example, be the result of some physical measurement or they can come from a long
numerical calculation. Thus we know the value of the underlying function f (x) at the
set of points {xi }, and we want to find an analytic expression for f . In practice, often
we can measure a physical process or quantity (e.g., temperature) at a number of points
(e.g., time instants for temperature), but we do not have an analytic expression for the
process that would let us calculate its value at an arbitrary point. Interpolation provides
a simple and good way of estimating the analytic expression, essentially a function, over
the range spanned by the measured points.
In interpolation, the task is to estimate f (x) for arbitrary x that lies between the smallest
and the largest xi . If x is outside the range of the xi ’s, then the task is called extrapo-
lation, which is considerably more hazardous.
such as, polynomials. Polynomial approximations assume the data as exact at the (n + 1)
tabular points and generate an nth degree polynomial passing through these (n + 1)
points. However, if the given data has some errors then these errors also will reflect in the
corresponding approximated function. More accurate approximations can be done using
Splines and Chebicheve, Legender and Hermite polynomials but polynomials of degree n
or less passing through (n + 1) points are easy to develop and useful in understanding
numerical differentiation and numerical integral. Hence the present chapter is devoted to
developing and using polynomial interpolation formulae to the required functions.
Definition 4.9
The points x0 , · · · , xn are called the interpolation points. The property of “pass-
ing through these points” is referred to as interpolating the data or called inter-
polation condition. The function that interpolates the data is an interpolant
or an interpolating polynomial (or whatever function is being used).
1. Replace a set of data points {(xi , yi )} with a function given analytically. Here we
have several aspects
• Given a set of data points {(xi , yi )}, find a curve passing thru these points
that is “pleasing to the eye”. In fact, this is what is done continually with
computer graphics. How do we connect a set of points to make a smooth
curve? Connecting them with straight line segments will often give a curve
with many corners, whereas what was intended was a smooth curve.
• We may want to take function values f (x) given in a table for selected values
of x, often equally spaced, and extend the function to values of x not in the
table. For example, given numbers from a table of logarithms, estimate the
logarithm of a number x not in the table.
• The data may be from a known class of functions. Interpolation is then used
to find the member of this class of functions that agrees with the given data.
For example, data may be generated from functions of the form
Then we need to find the coefficients {aj } based on the given data values.
The simplest form of interpolation is probably the straight line, connecting two points
by a straight line. Consider the data (x0 , y0 ), (x1 , y1 ). The problem is to find a function
P1 (x) which passes through these two data points. Since there are only two data points
available, the maximum degree of the unique polynomial which passes through these
points is one. Let us assume that P1 (x) = ax + b is the straight line passing through the
two points then
ax0 + b = y0 ,
ax1 + b = y1 .
Solving for a and b gives
y1 − y0
a=
x1 − x 0
x1 y0 − x0 y1
b=
x1 − x0
Thus, P1 (x) can be written in more convenient ways as
x − x1 x − x0
P1 (x) = y0 + y1
x0 − x1 x1 − x0
(x1 − x)y0 + (x0 − x)y1
=
x1 − x0 (4.12)
x − x0
= y0 + [y1 − y0 ]
x1 − x0
y1 − y0
= y0 + (x − x0 )
x1 − x 0
Check each of these by evaluating them at x = x0 and x1 to see if the respective values
are y0 and y1 .
As we will see, the interpolating polynomial can be written in a variety of forms, among
these are the Lagrange form and the Newton form. These forms are equivalent in
the sense that the polynomial in question is the one and the same (in fact, the solution
to the interpolation task is given by a unique polynomial)
Example 4.3
Suppose we have the following velocity versus time data (a car accelerating from a
rest position)
Use linear interpolation to estimate the car’s velocity at time t = 1.5 and at t =
4.25.
Solution: Let’s denote the discrete times by ti and velocities by vi , i.e., velocity
at ti .
To estimate the car’s velocity at t = 1.5 we used the data points (t1 , v1 ) and
(t1 , v2 ),i.e., (1, 10) and (2, 25) respectively, in Equation (4.12). Thus, we have
t − t2 t − t1
v(t) = v1 + v2 ,
t1 − t2 t2 − t1
1.5 − 2 1.5 − 1
v(1.5) = 10 + 25,
1−2 2−1
= 5 + 12.5 = 17.5
Next, to estimate the velocity at t = 4.25 we use the last two data points and we
have
t − t5 t − t4
v(t) = v4 + v5 ,
t4 − t5 t5 − t4
4.25 − 5 4.25 − 4
v(4.25) = 52 + 59,
4−5 5−4
= 53.75
Example 4.4
0.83 − x x − 0.82
P1 (x) = 2.2705 + 2.29332
0.83 − 0.82 0.83 − 0.82
Hence
P1 (0.826) = 2.284192.
Remark: In general, if y0 = f (x0 ) and y1 = f (x1 ) for some function f , then P1 (x) is a
linear approximation of f (x) for all x ∈ [x0 , x1 ].
Given the data points (x0 , y0 ), (x1 , y1 ), (x2 , y2 ), are given data we want to find a quadratic
polynomial that passes through these points.
P2 (x) = a0 + a1 x + a2 x2 ,
which satisfies
P2 (xi ) = yi , fori = 0, 1, 2.
for the given data points. One formula for such a polynomial follows:
with
Equation (4.13) is called Lagrange’s form of the interpolation polynomial and the
functions L0 , L1 and L2 are called Lagrange’s interpolating basis functions. The
Lagrange’s basis functions have the property that deg(Li ) ≤ 2 and
1,
if i = j
Li(x) = δij =
0, if i 6= j.
Example 4.5
Construct the quadratic polynomial interpolation P2 (x) that interpolates the points
(1, 4), (2, 1), and (5, 6).
Given n + 1 discrete data points (xi , yi ), i = 0, 1, 2...n, since there are (n + 1) data points
(xi , yi ), we can define a polynomial of degree n as
yi = Pn (xi ) i = 0, 1, .....n
For i = 0, we have
y0 = Pn (x0 ) = a0 (x0 − x1 )...(x0 − xn )
y0
∴ a0 =
(x0 − x1 )...(x0 − xn )
For i = 1, we get
y1
∴ a1 =
(x1 − x0 )(x1 − x2 )...(x1 − xn )
yi
ai =
(xi − x0 )(xi − x1 )...(xi − xi−1 )(xi − xi+1 )...(xi − xn )
yn
an =
(xn − x0 )...(xn − xn−1 )
(x − x1 )(x − x2 ) · · · (x − xn )
Pn (x) = y0
(x0 − x1 )(x0 − x2 ) · · · (x0 − xn )
(x − x0 )(x − x2 ) · · · (x − xn )
+ y1
(x1 − x0 )(x1 − x2 ) · · · (x1 − xn )
..
.
(x − x0 )(x − x1 ) · · · (x − xi−1 )(x − xi+1 ) · · · (x − xn )
+ yi
(xi − x0 )(xi − x1 ) · · · (xi − xi−1 )(xi − xi+1 ) · · · (xi − xn )
..
.
(x − x0 )(x − x2 ) · · · (x − xn−1 )
+ yn
(xn − x0 )(xn − x2 ) · · · (xn − xn−1 )
where
(x − x0 )(x − x1 ) · · · (x − xi−1 )(x − xi+1 ) · · · (x − xn )
Li (x) =
((xi − x0 )(xi − x1 ) · · · (xi − xi−1 )(xi − xi+1 ) · · · (xi − xn )
n
X
Pn (x) = Li (x)yi , (4.15)
i=0
where Qn
k=1 (x − xk )
k6=i
Li (x) = Qn (4.16)
k=1 (xi − xk )
k6=i
Example 4.6
Given the following data table, construct the Lagrange interpolation polynomial
P (x), to fit the data and find f (1.25)
xi 0 1 2 3
yi 1 2.25 3.75 4.25
A major difficulty with the Lagrange Interpolation is that one is not sure about the degree
of interpolating polynomial needed to achieve a certain accuracy. Thus, if the accuracy
is not good enough with polynomial of a certain degree, one needs to increase the degree
of the polynomial, and computations need to be started all over again. Furthermore,
computing various Lagrangian polynomials is an expensive procedure. It is, indeed,
desirable to have a formula which makes use of Pk−1 (x) in computing Pk (x).
Pn (x) = a0 +a1 (x−x0 )+a2 (x−x0 )(x−x1 )+· · ·+an (x−x0 )(x−x1 ) · · · (x−xn−1 ), (4.17)
such that
Pn (xi ) = fi = yi , (Interpolation Condition.)
Hence, the constants a0 through an can be determined as follows using the interpolation
condition.
For i = 0 we get
f0 = Pn (x0 ) = a0
For i = 1 we have
f1 = pn (x1 ) = a0 + a1 (x1 − x0 )
f1 − f0
∴ a1 =
x1 − x0
For i = 2, we have
f [xk ] = fk
f [xk+1 ] − f [xk ]
f [xk , xk+1 ] =
xk+1 − xk
a0 = f 0
= f [x0 ]
f2 −f1
x2 −x1
− xf11 −f
−x0
0
a2 =
x2 − x0
f [x1 , x2 ] − f [x0 , x1 ]
=
x2 − x0
= f [x0 , x1 , x2 ]
f [x1 , x2 , x3 ] − f [x0 , x1 , x2 ]
a3 =
x3 − x 0
= f [x0 , x1 , x2 , x3 ]
...
f [x1 , x2 , · · · , xi ] − f [x0 , x1 , x2 , · · · xi−1 ]
ai =
x i − x x0
= f [x0 , x1 , x2 , · · · , xi ]
..
.
Note that a1 is called as the first divided difference, a2 as the second divided
difference and so on. Now the polynomial in Equation (4.17) can be rewritten as:
n
X k−1
Y
Pn (x) = f [x0 , x1 , · · · , xk ] (x − xi ) (4.19)
k=0 i=0
It may also be noted for calculating the higher order divided differences we have used
lower order divided differences. In fact starting from the given zeroth order differences
; one can systematically arrive at any of higher order divided differences. For clarity
the entire calculation may be depicted in the form of a table called Newton Divided
Difference Table.
xi f [xi ] First order Second order Third order Fourth order
difference difference difference difference
x0 f [x0 ]
f [x0 , x1 ]
x1 f [x1 ] f [x0 , x1 , x2 ]
f [x1 , x2 ] f [x0 , x1 , x2 , x3 ]
x2 f [x2 ] f [x1 , x2 , x3 ] f [x0 , x1 , x2 , x3 , x4 ]
f [x2 , x3 ] f [x1 , x2 , x3 , x4 ]
x3 f [x3 ] f [x2 , x3 , x4 ]
f [x3 , x4 ]
x4 f [x4 ]
In the above Newton divided difference table the bold faces are the coefficients of the
polynomial. Again suppose that we are given the data set (xi , fi ), i = 0, 1 · · · , 4 and
that we are interested in finding the 4th order Newton Divided Difference interpolating
polynomial. Let us first construct the Newton Divided Difference Table. Wherein one
can clearly see how the lower order differences are used in calculating the higher order
Divided Differences:
Example 4.7
Construct the Newton Divided Difference Table for generating Newton interpola-
tion polynomial with the following data set:
xi 0 1 2 3 4
f (xi ) = yi 0 1 8 27 64
Solution: Here n = 4. One can find a fourth order Newton Divided Difference
interpolation polynomial to the given data. Let us generate Newton Divided Dif-
ference Table first; as requested.
Note: One may note that the given data corresponds to the cubic polynomial x3 .
To fit such a data 3rd order polynomial is adequate. From the Newton Divided
Difference table we notice that the fourth order difference is zero. Further the
divided differences in the table can be directly used for constructing the Newton
Divided Difference interpolation polynomial that would fit the data as follows
The advantage of the above method is that there is no need to start all over again
if additional pairs of data are added. We simply need to compute additional divided
differences. Since nth order polynomial interpolation of a given (n + 1) pairs of data is
unique, thus the above polynomial and Lagrangian polynomial are exactly the same.
Example 4.8
Use Newton’s Divided Difference formula and evaluate f (3.0) by a third and fourth
degree polynomial, given
Solution: Here n = 4. One can find a fourth order Newton Divided Difference
interpolation polynomial to the given data. Let us generate Newton Divided Dif-
ference Table first:
The third degree polynomial fitting all points from x0 = 3.2 to x3 = 4.8 is given by
The fourth degree polynomial fitting all points from x0 = 3.2 to x4 = 5.6 is also
given by
xi+1 = xi + h, i = 0, 1, n − 1
then we have
xi = x0 + ih, i = 1, 2, · · · n.
Using these we can easily simplify the Newton divided differences using the forward, back-
ward and central difference and we can obtain the corresponding interpolating polynomial
formula.
Recalling the Divided difference and the forward difference we have the following:
f (x1 ) − f (x0 )
f [x0 , x1 ] =
x1 − x0
4f (x0 )
=
h
f [x1 , x2 ] − f [x0 , x1 ]
f [x0 , x1 , x2 ] =
x2 − x0
f (x2 )−f (x1 )
x2 −x1
− f (xx11)−f
−x0
(x0 )
=
x2 − x0
f (x2 ) − 2f (x1 ) + f (x0 )
=
2h2
2
4 f (x0 )
=
2h2
f [x1 , x2 , x3 ] − f [x0 , x1 , x2 ]
f [x0 , x1 , x2 , x3 ] =
x3 − x0
f (x3 )−2f (x2 )+f (x1 )
2h2
− f (x2 )−2f2h(x21 )+f (x0 )
=
x3 − x0
f (x3 )−3f (x2 )+3f (x1 )−f (x0 )
2h2
=
3h
43 f (x0 )
=
3!h3
Or in general we have
4n f (x0 )
f [x0 , x1 , · · · xn ] = .
n!hn
with this notation the Newton’s Divided Difference interpolating polynomial 4.18 can be
written as
4f (x0 ) 42 f (x0 )
Pn (x) = f (x0 ) + (x − x0 ) + (x − x0 )(x − x1 ) + · · ·
h 2!h2 (4.20)
4n f (x0 )
+ (x − x0 )(x − x1 ) · · · (x − xn−1 )
n!hn
simplifies to
42 f (x0 ) 4n f (x0 )
Pn (x) = f (x0 ) + k 4 f (x0 ) + k(k − 1) + · · · + k(k − 1) · · · (k − n + 1) ,
2! n!
(4.21)
where
(x − x0 )
k= .
h
This is called the Newton’s forward interpolation formula or forward Newton-
Gregory formula.
By looking at the forward difference table 4.1 we can see that this formula uses the values
along the diagonal of the differences of y - it is a FORWARD DIFFERENCE formula. It
is therefore used for interpolation near the beginning of a table where k is small.
Example 4.9
xi 2 4 6 8 10
f (xi ) = yi 9.86 10.96 12.32 13.76 15.28
Solution: Form a difference table and note that all forward differences > 2 are
zero.
xi f (xi ) 4 42
2 9.68
1.28
4 10.96 0.08
1.36
6 12.32 0.08
1.44
8 13.76 0.08
1.52
10 15.28
x − x0 2.4 − 2
(a) Here we have x = 2.4, x0 = 2, h = 2 and k = = = 0.2. Using
h 2
Equation (4.20) we have
4f (x0 ) 42 f (x0 )
P2 (x) = f (x0 ) + (x − x0 ) + (x − x0 )(x − x1 )
h 2!h2
1.28 0.08
= 9.68 + (x − 2) + (x − 2)(x − 4) .
2 2! × 22
Hence
f (2.4) ≈ P2 (2.4)
1.28 0.08
= 9.68 + (2.4 − 2) + (2.4 − 2)(2.4 − 4) .
2 2! × 22
= 9.9296.
42
P2 (x) = f (x0 ) + k 4 f (x0 ) + k(k − 1)
2!
Hence
f (2.4) ≈ P2 (2.4)
0.08
= 9.68 + 0.2 × 1.28 + 0.2(0.2 − 1)
2!
= 9.9296
8.7−2
(b) Here we have x = 8.7; x0 = 2; h = 2 and k = 2
= 3.35. Now using
0.08
f (8.7) ≈ P2 (8.7) = 9.68 + 3.35(1.28) + 3.35(3.35 − 1)
2!
= 14.2829
Example 4.10
IIn the following table of ex use the Newton-Gregory formula of forward interpola-
tion to calculate
xi f (xi ) 4 42 43 44
0.1 1.1052
0.7169
0.6 1.8221 0.4652
1.1821 0.3015
1.1 3.0042 0.7667 0.1962
1.9488 0.4977
1.6 4.9530 1.2644
3.2132
2.1 8.1662
Note that in this case there is no difference column that is constant. This is to be
expected since ex cannot be represented by a polynomial function of finite degree.
0.4652
e0.12 ≈ 1.1052 + 0.04(0.7169) + 0.04(0.04 − 1)
2!
0.3015
+ 0.04(0.04 − 1)(0.04 − 2)
3!
0.1962
+ 0.04(0.04 − 1)(0.04 − 2)(0.04 − 3)
4!
= 1.1269. (correct value to 5 d.p. is 1.12750)
In Example 4.5.1 the interpolation formula is identical with f (x), which is a quadratic
function, and the results for f (2.4) and f (8.7) will therefore be correct to the number
of decimal places retained. In example 4.5.1 the function ex is replaced by a 4th degree
polynomial which takes the value of ex at the five given entries. Because the successive
difference decrease, higher differences are relatively small and the value of the estimate
converges. From direct calculation it turns out that the error in the estimate for e0.12 is
about 0.05 percent and for e2.00 it is about 0.04 percent.
For interpolating the value of the function y = f (x) near the end of the given data points
and also to extrapolate value of the function a short distance forward from yn , Newton’s
backward interpolation formula is used.
Pn (x) = an +an−1 (x−xn )+an−2 (x−xn )(x−xn−1 )+· · ·+a0 (x−xn )(x−xn−1 ) · · · (x−x1 ).
(4.22)
Now, the constants an , an−1 , · · · , a0 are computed such that
Pn (xn ) = an =⇒ an = yn
Again,
Pn (xn−1 ) = an + an−1 (xn−1 − xn )
=⇒ yn−1 = yn + an−1 (xn−1 − xn )
yn − yn−1
=⇒ an−1 =
xn − xn−1
5yn
= .
h
When we evaluate Pn (x) at x = xn−2 we get
5n−i yn
ai = , i = n − 3, n − 4, · · · , 0
(n − i)!hn−i
5yn 52 yn 5n y n
Pn (x) = yn + (x−xn )+ (x−x n )(x−x n−1 )+· · ·+ (x−xn )(x−xn−1 ) · · · (x−x0 ).
h 2!h2 n!hn
(4.23)
Now, setting
x − xn
k= ,
h
then x − xn = kh and x − xn−1 = x − (xn − h) = x − xn + h = kh + h = (k + 1)h Similarly,
x − xn−2 = (k + 2)h
x − xn−3 = (k + 3)h
..
.
x − x1 = (k + (n − 1))h
52 y n 5n yn
Pn (x) = yn + k 5 yn + k(k + 1) + · · · + k(k + 1) · · · (k + (n − 1)) , (4.24)
2! n!
where
(x − xn )
k=
h
This is called the Newton’s back interpolation formula or Newton-Gregory
backward formula.
Example 4.11
xi 1 2 3 4 5 6 7 8
yi = f (xi ) 1 8 27 64 125 216 343 512
xi f (xi ) 5 52 53 54
1 1
7
2 8 12
19 6
3 27 18 0
37 6
4 64 24 0
61 6
5 125 30 0
91 6
6 216 36 0
127 6
7 343 42
169
8 512
Since the 4th and higher order differences are zero, the required Newton’s backward
interpolation formula is
52 y n 53 yn
yx = yn + k 5 yn + k(k + 1) + k(k + 1)(k + 1) . (4.25)
2! 3!
In this problem,
x − xn 7.5 − 8
k= = = −0.5,
1 1
hence, we have
42
y7.5 ≈ 512 + (−0.5)169 + (−0.5)(−0.5 + 1)
2
6
+ (−0.5)(−0.5 + 1)(−0.5 + 2)
3! (4.26)
= 512 − 84.5 − 5.25 − 0.375
= 421.875
Our goal in this section is to provide estimates on the “error" we make when interpolating
data that is taken from sampling an underlying function f (x). While the interpolant and
the function agree with each other at the interpolation points, there is, in general, no
reason to expect them to be close to each other elsewhere. Nevertheless, we can estimate
the difference between them, a difference which we refer to as the interpolation error.
The interpolating polynomial Pn is an approximation to f , but unless f itself is a
polynomial of degree n, there will be a nonzero error e(x) = f (x) − Pn (x). At times it is
useful to have an explicit expression for the error and is given by the following theorem
which we are not going to prove.
Theorem 4.1
Let f be a given function on [a, b] and Pn be the polynomial of degree less than or
equal to n interpolating the f at the n + 1 data points x0 , x1 , x2 , · · · , xn in [a, b].
If f has n + 1 continuous derivatives and xi are distinct, then
n
(x − xk ) f (n+1) (ξx ),
Y
en (x) = |f (x) − Pn (x)| = (4.27)
i=0
f (x1 ) − f (x0 )
f [x0 , x] = .
x1 − x0
Since the order of the points does not change the value of the divided difference, we can
assume, without any loss of generality, that x0 < x1 . If we assume, in addition, that f (x)
is continuously differentiable in the interval [x0 , x1 ], then this divided difference equals to
the derivative of f (x) at an intermediate point, i.e.,
In other words, the first-order divided difference can be viewed as an approximation of the
first derivative of f (x) in the interval. It is important to note that while this interpretation
is based on additional smoothness requirements from f (x) (i.e. its being differentiable),
the divided differences are well defined also for non-differentiable functions. This notion
can be extended to divided differences of higher order as stated bellow
f n (ξ)
f [x, x0 , x1 , · · · , xn−1 ] = , (4.28)
n!
Example 4.12
If P (x) is the polynomial that interpolates the function f (x) = sin(x) at 10 points
on the interval [0, 1], what is the greatest possible error?
Solution: In this example we have n + 1 = 10 data points, hence n = 9 and
Thus, the largest possible error would be the maximum of e10 (x), i.e.,
n
1
(10)
Y
f (ξx) (x − x i ),
10!
i=0
max |x − xi | = 1
and
max f (10) (ξx ) = max |−s sin(x)| = 1
1
(1)(1)n+1 ≈ 2.8 × 10−7 .
10!
Example 4.13
Determine the spacing h in a table of evenly spaced values of the function f (x) =
√
x between 1 and 2, so that interpolation with a second-degree polynomial in this
table will yield a desired accuracy .
1
Solution: Such a table contains the values f (xi ), i = 0, 1, · · · , n = h
, at points
xi = 1 + ih. If x ∈ [xi−1 , xi+1 ], then we approximate f (x) with P2 (x), where
P2 (x) is the polynomial of degree 2 that interpolates f at xi−1 , xi , xi+1 . Thus, by
Theorem 4.6, the error is
f 000 (ξx )
e2 (x) = (x − xi−1 )(x − xi )(x − xi+1 ),
3!
In the last step, the maximum absolute value of g(y) = (y − h)y(y + h) over [−h, h]
is obtained as follows. Since
g(y) achieves its maximum (or minimum) in the interior of the interval. We have
g0 (h) = 3y 2 − h2 .
h
Thus, g0 (h) = 0 yields y = ± √ , where the maximum of |g(y)| is attained. Hence
3
we have derived a bound on the interpolation error:
1 3 h
|e2 (x)| ≤ · ·√
3! 8 3
3 (4.29)
h
= √ .
2 3
Therefore, we have
h3
√ < .
2 3
In particular, suppose we want the accuracy of at least 7 places after zero. We
should choose h such that
h3
√ < 5 × 10−7 .
24 3
This gives us h ≈ 0.0127619. And the number of entries is about 79.
Exercise 4.1
Construct the Lagrange and Newton forms of the interpolating polynomial P3 (x)
√
for the function f (x) = 3 x which passes through the points (0, 0), (1, 1), (8, 2) and
(27, 3). Calculate the interpolation error at x = 5 and compare with the theoretical
error bound.
Linear Spline
Definition 4.10
Function S is called a spline of degree one or linear spline on [a, b] if:
ii. There is a partitioning of the interval a = x0 < x1 < x2 < · · · < xn = b such
that S is a linear polynomial on each sub interval.
Example 4.14
Solution:
Example 4.15
State whether the following piecewise polynomials are linear splines or not.
x + 1,
−1 ≤ x ≤ 0
s(x) = 2x + 1, 0<x<1
4 − x, 1≤x≤2
Solution:
Example 4.16
Obtain the piecewise linear interpolating polynomial (linear spline) for the function
f (x) defined by the data:
x 1 2 4 8
f (x) 3 7 21 73
For the interval [2, 4] (i.e., for points (2, 7) and (4, 21)) we have:
For the interval [4, 8] (i.e., for points (4, 21) and (8, 73)) we have:
Example 4.17
Approximate the function f (x) = 4x on [−1, 1] by linear spline method and find
the value of f (0.125).
Definition 4.11
A cubic spline,S(x) is defined by the following conditions:
If the data
x x1 x2 ··· x1
f (x) y3 y2 ··· y1
is given then the cubic spline s(x) on this given data is found by finding the spline on
[xj−1 , xj ], j = 1, 2, · · · , n using the formula:
!
1 1 h2
s(x) = (xj − x)3 Mj−1 + (x − xj−1 )3 Mj + (xj − x) yj−1 − Mj−1
6h h 6
! (4.30)
1 h2
+ (x − xj−1 ) yj − Mj
h 6
And
6
Mj−1 + 4Mj + Mj = (yj−1 − 2yj + yj+1 ) , j = 1, 2, · · · n − 1,
h2 (4.31)
where s00 (xj ) = Mj .
Equation (??) gives a system of (n11) linear equations in the (n+1) unknowns M0 , M1 , M2 , · · · , Mn .
Two more conditions, called the end conditions, have to be prescribed to obtain (n + 1)
equations in (n + 1) unknowns M0 , M1 , M2 , · · · , Mn .
Different types of cubic splines are obtained when different end conditions are supplied.
2. We may also assume that S 00 (x) to be constant near end points i.e. M0 = M1 and
Mn = Mn−1
3. We can impose the first derivative condition at the end points as follows:
4. Similarly ce can impose the second derivative condition at the end points as follows:
Example 4.18
1 2 3 4
1 2 5 11
Find the natural cubic spline and evaluate y(1.5) and y0(3)
Solution:
5.1 Differentiation
Differentiation is a very important mathematical process and a great deal of effort has
been devoted to the development of analytic techniques of finding the derivatives of
various mathematical functions. It often occurs, however, that it is not possible to utilise
these traditional methods. This happens when:
(ii) The function is unknown (when data are collected from some experiment).
In this section numerical techniques are described which provide an estimate of the deriva-
tive of a tabulated function. The main concept of numerical differentiation is stated
below:
construct an appropriate interpolation polynomial from the given set of values of x and y
and then differentiate it at any value of x.
Like interpolation, lot of formulae are available for differentiation. Based on the given
set of values, different types of formulae can be constructed. The common formulae are
based on Lagrange’s and Newton’s interpolation formulae. These formulae are discussed
in this section.
151
Chapter-5: Interpolations 5.1. Differentiation
We know that the Newton’s forward and backward interpolation formulae are applicable
only when the arguments are in equispaced. So, we assumed that the given arguments
are equispaced.
k(k − 1) 2 k(k − 1) · · · (k − n + 1) n
Pn (x) = y0 + k 4 y0 + 4 y0 + · · · + 4 y0
2! n!
k2 − k 2 k 3 − 3k 2 + 2k 3
= y0 + k 4 y0 + 4 y0 + 4 y0
2! 3!
k 4 − 6k 3 + 11k 2 − 6k 4 k 5 − 10k 4 + 35k 3 − 50k 2 + 24k 5
+ 4 y0 + 4 y0 + · · ·
4! 5!
(5.1)
The error term of this interpolation formula is
"
1 2k − 1 2 3k 2 − 6k + 2) 3 4k 3 − 18k 2 + 22k − 6 4
Pn0 (x) = 4y0 + 4 y0 + 4 y0 + 4 y0
h 2! 3! 4!
# !
5k 4 − 40k 3 + 105k 2 − 100k + 24 5 dk 1
+ 4 y0 + · · · ∵ =
5! dx h
(5.3)
"
1 6k − 6 3 12k 2 − 36k + 22 4
Pn00 (x) = 2 42 y0 + 4 y0 + 4 y0
h 3! 4!
# (5.4)
20k 3 − 120k 2 + 210k − 100 5
+ 4 y0 + · · ·
5!
" #
1 24k − 36 4 60k 2 − 240k + 210 5
Pn000 (x) = 3 43 y 0 + 4 y0 + 4 y0 + · · · (5.5)
h 4! 5!
and so on.
In this way, we can find all other derivatives. It may be noted that 4y0 , 42 y0 , 43 y0 , · · ·
are constants.
The above three formulae give the first three (approximate) derivatives of f (x) at any
arbitrary argument x where x = x0 +kh. The above formulae become simple when x = x0
, i.e. k = 0. That is,
1 1 1 1 1
Pn0 (x0 ) = 4y0 − 42 y0 + 43 y0 − 44 y0 + 45 y0 − · · · (5.6a)
h 2 3 4 5
00 1 2 3 11 4 5 5
Pn (x0 ) = 4 y0 − 4 y0 + 4 y0 − 4 y0 + · · · (5.6b)
h 12 6
1 3 7
Pn000 (x0 ) = 43 y0 − 44 y0 + 45 y0 − · · · (5.6c)
h 2 5
f (n+1) (ξ)
En (x) = k(k − 1)(k − 2) · · · (k − n)hn+1
(n + 1)!
where ξ and ξ1 are two quantities depend on x and min{x, x0 , · · · , xn } < ξ, ξ1 < max{x, x0 , · · · , xn }.
f (n+1) (ξ) d
En0 (x0 ) = hn [k(k − 1) · · · (k − n)]k=0 + 0
(n + 1)! dk
hn (−1)n n!f (n+1) (ξ)
" #
d n
= as [k(k − 1) · · · (k − n)]k=0 = (−1) n!
(n + 1)! dk
(−1)n hn f (n+1) (ξ)
= ,
n+1
where ξ lies between min{x, x0 , · · · , xn } and max{x, x0 , · · · , xn }.
Example 5.1
dy d2 y dy
Find the value of , 2
at x = 1 and when x = 1.2
dx dx dx
Solution: The forward difference table is
x y 4y 42 y 43 y 44 y 45 y
1.0 1.234
3.425
1.5 2.453 3.095
6.520 0.36
2.0 7.625 3.455 0.71
9.975 1.07 −0.680
2.5 12.321 4.525 0.03
14.500 1.10
3.0 18.892 5.625
20.125
3.5 23.327
1 1 1 1 1
0
y (1) ≈ 4y0 − 42 y0 + 43 y0 − 44 y0 + 45 y0
h 2 3 4 5
1 1 1 1 1
= 3.425 − × 3.095 + × 0.36 − × 0.71 + × (−0.680)
0.5 2 3 4 5
= 3.36800.
1 11 4 5
00
y (1) ≈ 42 y0 − 43 y0 + 4 y0 − 45 y0
0.5 12 6
11 5
= 4.0 × 3.095 − ×0.36 + × 0.71 + × (−0.680)
12 6
= 15.8100.
x − x0 1.2 − 1
Now, at x = 1.2, h = 0.5, k = = = 0.4
h 0.5
Therefore, using Equation (5.3) we have:
"
0 1 2k − 1 2 3k 2 − 6k − 2) 3 4k 3 − 18k 2 + 22k − 6 4
y (1.2) = 4y0 + 4 y0 + 4 y0 + 4 y0
0.5 2! 3! 2!
#
5k 4 − 40k 3 + 105k 2 − 100k + 24 5
+ 4 y0
2!
"
1 2 × 0.4 − 1 3(0.4)2 − 6(0.4) − 2)
= 3.425 + × 3.095 + 0.36
0.5 2! 3!
4(0.4)3 − 18(0.4)2 + 22(0.4) − 6
+ × 0.71
4! #
5(0.4)4 − 40(0.4)3 + 105(0.4)2 − 100(0.4) + 24
+ × (−0.68)
5!
= 6.26948.
Like Newton’s forward differentiation formula one can derive Newton’s backward dif-
ferentiation formula based on Newton’s backward interpolation formula.
Suppose the function y = f (x) is not know explicitly, but it is known at (n+1) arguments
x0 , x1 , · · · , xn . That is, yi = f (xi ), i = 0, 1, 2, · · · , n are given. Since the Newton’s
backward interpolation formula is applicable only when the arguments are equispaced,
x − xn
therefore, xi = x0 + ih, i = 0, 1, 2, · · · , n and k = .
h
Differentiating this formula with respect to x successively, the formulae for derivatives of
different order can be derived as
"
1 2k + 1 2 3k 2 + 6k + 2 3 4k 3 + 18k 2 + 22k + 6 4
Pn0 (x) = ∇yn + ∇ yn + ∇ yn + ∇ yn
h 2! 3! 4!
#
5k 4 + 40k 3 + 105k 2 + 100k + 24 5
+ ∇ yn + · · ·
5!
(5.7)
"
1 6k + 6 3 12k 2 + 36k + 22 4
Pn00 (x) = 2 ∇2 yn + ∇ yn + ∇ yn
h 3! 4!
# (5.8)
20k 3 + 120k 2 + 210k + 100 5
+ ∇ yn + · · ·
5!
" #
1 24k + 36 4 60k 2 + 240k + 210 5
Pn000 (x) = 2 ∇3 yn + ∇ yn + ∇ yn + · · · (5.9)
h 4! 5!
and so on.
dy d2 y d3 y
The above formulae give the approximate value of , , , and so on, at any point
dx dx2 dx3
value of x, where min{x, x0 , · · · , xn } < x < max{x, x0 , · · · , xn }.
When x = xn then v = 0. In this particular case, the above formulae reduced to the
following form.
1 1 1 1 1
Pn0 (xn ) = ∇yn + ∇2 yn + ∇3 yn + ∇4 yn + ∇5 yn + · · · (5.10a)
h 2 3 4 5
00 1 2 3 11 4 5 5
Pn (xn ) = ∇ yn + ∇ yn + ∇ yn + ∇ yn + · · · (5.10b)
h 12 6
1 3 7
Pn000 (xn ) = ∇3 yn + ∇4 yn + ∇5 yn + · · · (5.10c)
h 2 5
The error can be calculated by differentiating the error in Newton’s backward interpola-
tion formula. Such error is given by
f (n+1) (ξ)
En (x) = k(k + 1)(k + 2) · · · (k + n)hn+1 .
(n + 1)!
x − xn
where k = and ξ lies between min{x, x0 , · · · , xn } and max{x, x0 , · · · , xn }. Differ-
h
entiating En (x), we get
d f (n+1) (ξ)
En0 (x) = hn [k(k + 1)(k + 2) · · · (k + n)]
dk (n + 1)!
(5.11)
k(k + 1)(k + 2) · · · (k + n) (n+2)
+ hn+1 f (ξ1 )
(n + 1)!
where ξ and ξ1 are two quantities depend on x and min{x, x0 , · · · , xn } < ξ, ξ1 < max{x, x0 , · · · , xn }.
This expression gives the error in differentiation at any argument x. In particular, when
x = xn , i.e. when k = 0 then
f (n+1) (ξ) d
En0 (xn ) = hn [k(k + 1) · · · (k + n)]k=0 + 0
(n + 1)! dk
hn n! (n+1)
" #
d
= f (ξ) as [k(k + 1) · · · (k + n)]k=0 = n!
(n + 1)! dk
hn f (n+1) (ξ)
= ,
n+1
Example 5.2
A slider in a machine moves along a fixed straight rod. It’s distance x (in cm)
along the rod are given in the following table for various values of the time t (in
second):
(sec) t : 0 2 4 6 8
(cm) x : 20 50 80 120 180
x y ∇y ∇2 y ∇3 y ∇4 y
0 20
30
2 50 0
30 10
4 80 10 0
40 10
6 120 20
60
8 180
dx 1 1 2 1 1 1
v(8) = |t=8 ≈ ∇y4 + ∇ y4 + ∇3 y4 + ∇4 + ∇5 + · · ·
dt h 2 3 4 5
1 1 1
= 60 + × 20 + × 10 + 0
2 2 3
10
= 0.5 × 70 +
3
= 36.66667
d2 x 1 11 4 5 5
2 3
a(8) = |t=8 ≈ ∇ y 4 + ∇ y 4 + ∇ y 4 + ∇ y4 + · · ·
dt2 h2 12 6
1
= [20 + 10 + 0]
2
2 10
= 2 × 70 +
2 3
= 7.50
Choice of differentiation formula is same as choice of interpolation formula. That is, if the
given argument is at the beginning of the table then the Newton’s forward differentiation
formula is used. Similarly, when the given argument is at the end of the table then
the Newton’s backward differentiation formula is used. The Lagrange’s differentiation
formula is used for any argument.
Integration is a very common and fundamental tool of integral calculus. But, finding of
integration is not easy for all kind of functions, even the function is known completely.
Again, in many real life problems, only a set of values of x and y are available and we
have to find the integration of such functions. In this situations, separate methods are
developed and these methods are known as numerical integration or quadrature.
Given a set of points (x0 , y0 ), (x1 , y1 ), · · · , (xn , yn ) of a function y = f (x). The problem
is to find the value of the definite integral
Z xn
I= f (x) dx.
x0
Then the approximate value of the definite integral is then evaluated by the following
formula Z xn Z xn
f (x) dx ≈ Pn (x) dx.
x0 x0
Let f (x) be an unknown function whose numerical values are given at (n + 1) equidistant
points xi in the interval [a, b], where xi = x0 + ih, i = 0, 1, · · · , n such that a = x0 and
b = xn , i.e., b − a = nh. Then
Z b Z xn
f (x) dx ≈ Pn (x) dx
a x0
Z n
=h Pn (p) dp, where x = x0 + ph, and dx = h dp.
0
" ! #
Z b
n2 1 n3 n2
f (x) dx ≈ h nf0 + ∆f0 + − ∆2 f0 + · · · + last term (5.12)
a 2 2! 3 2
From this formula one can derive many simple formulae for different values of n =
1, 2, 3, · · · . Some particular cases are discussed below.
One of the simple quadrature formula is trapezoidal formula. To obtain this formula, we
substitute n = 1 to the Equation (5.12). Obviously, we can fit a straight line through
these two points or we can say, one can obtain only first order differences from two points
then neglecting second and high order differences in Equation (5.12) we get,
Z b
1 1
f (x) dx ≈ h f0 + ∆f0 = h f0 + (f1 − f0 )
a 2 2
h
= (f0 + f1 )
2
Hence
Z b
h
f (x) dx ≈ (f0 + f1 ) (5.13)
a 2
Note that the formula is very simple and it gives a very rough approximation of the inte-
gral. So, if the interval [a, b] is divided into some subintervals and the formula is applied
to each of these subintervals, then much better approximate result may be obtained. This
formula is known as composite trapezoidal formula, described below.
Now, the trapezoidal formula is applied to each of the subintervals, and we obtained the
composite formula as follows:
Z b Z x1 Z x2 Z xn
f (x) dx = f (x) dx + f (x) dx + · · · + f (x) dx
a x0 x1 xn−1
h h h
≈ (f0 + f1 ) + (f1 + f2 ) + · · · + (fn−1 + fn )
2 2 2
h
= [f0 + 2 (f1 + f2 + · · · + fn−1 ) + fn ] .
2
Z b n−1
h X
f (x) dx ≈ f0 + 2 fj + fn . (5.14)
a 2 j=1
Example 5.3
Since Trapezoidal rule is a numerical formula, it must have an error. The error in trape-
zoidal formula is calculated below.
i.e., the local error of Trapezoidal rule is O(h3 ). Now, Global error in Trapezoidal rule
means the sum of all n local errors, which is
n−1
h3 00
[f (ξ1 ) + f 00 (ξ2 ) + · · · + f 00 (ξn )] , xi ≤ ξi ≤ xi+1 , i = 0, 1, 2 · · · , n − 1.
X
ELi = −
i=0 12
If we assume that f 00 (x) is continuous on (a, b) then there exists some value of x in (a, b),
n
f 00 (ξi ) = nf 00 (ξ).
X
say ξ such that
i=1
h3 00 b − a 2 00
ET = − nf (ξ) = − h f (ξ) → O(h2 ) since nh = b − a, (5.15)
12 12
Note: The error term in trapezoidal formula indicates that if the second and higher order
derivatives of the function f (x) vanish, then the trapezoidal formula gives exact result.
That is, the trapezoidal formula gives exact result when the integrand is linear.
In trapezoidal formula, the integrand y = f (x) is replaced by the straight line, let AB
joining the points (xi , yi ) and (x1 , y1 ) (see Figure 5.1). Then the area bounded by the
curve y = f (x), the ordinates x = xi , x = xi+1 and the x-axis is approximated by the
area of the trapezium bounded by the straight line AB,Z the straight lines x = xxi , x =
xi+1
xi+1 and x-axis. That is, the value of the integration f (x) dx obtained by the a
xi
trapezoidal formula is nothing but the area of the trapezium.
When substituting n = 2 in the formula (5.12) and similar to n = 1, neglecting third and
higher order differences in (5.12), we get
" ! #
Z b
22 1 23 22
f (x) dx ≈ h 2f0 + ∆f0 + − ∆ 2 f0
a 2 2! 3 2
1
= h 2f0 + 2(f1 − f0 ) + (f2 − 2f1 + f0 )
3
h
= [f0 + 4f1 + f2 ]
3
Z b
h
f (x) dx ≈ [f0 + 4f1 + f2 ] (5.16)
a 3
In the above formula, the interval of integration [a, b] is divided into two subdivisions.
Now, we divide the interval [a, b] into n (even number) equal subintervals by the arguments
x0 , x1 , x2 , · · · , xn , where xi = x0 + ih, i = 1, 2, · · · , n.
Z b Z x2 Z x4 Z xn
f (x) dx = f (x) dx + f (x) dx + · · · + f (x) dx
a x0 x2 xn−2
h h h
≈ (f0 + 4f1 + f2 ) + (f2 + 4f3 + f4 ) + · · · + (fn−2 + 4fn−1 + fn )
3 3 3
h
= [f0 + 4 (f1 + f3 + · · · + fn−1 ) + 2 (f2 + f4 + · · · + fn−2 ) + fn ] .
3
h
= [f0 + 4 (sum of fi with odd subscripts) + 2 (sum of fi with even subscripts) + fn ] .
3
(5.17)
n n
−1
Z b
h X2 2
X
f (x) dx ≈ f0 + 4 f2j−1 + 2 f2j + fn . (5.18)
a 3 j=1 j=1
Note: Simpson’s 1/3 -rule requires the division of the whole range into an even number
of subintervals of width h.
The Local error expression in Simpson’s 1/3 rule on the interval [x0 , x2 ] is given by
h5 (iv)
EL ≈ − f (ξ)
90
where x0 < ξ < x2 and the Global error in the composite Simpson’s 1/3 rule is given by
b − a 4 (iv)
ET = − h f (ξ), where f (iv)(ξ) = max{f (iv) (x0 ), f (iv) (x1 ), · · · , f (iv) (xn )}
180
(5.19)
Example 5.4
Next, we need to find an upper bound for the absolute error using the general error
term for the composite Simpson’s rule given by
b − a 4 (4)
E=− h f (ξ)
180
where a < ξ < b. Taking absolute values, and inserting a = 0, b = 1 and h = 1/4,
the absolute error E is
1
E=− |f (4) (ξ)|.
180 × 44
Since |f (4) (ξ) = e−x | is a decreasing positive function, we have the bound |f (4) (ξ)| <
e−0 = 1. Therefore the error bound is given by
1
E≤ ≈ 2.2 × 10−5 .
46080
To verify that the bound holds here, we easily compute the exact value of the
integral Z 1 h i1
e−x dx = −e−x = e−1 − e0 = 1 − e−1 = 0.6321206
0 0
so the absolute error is within our bound. The bound quite closely bounds the
actual absolute error in this case.
Exercise 5.1
Z 3
2
Evaluate (x+1)ex dx taking 10 intervals, by (i) Trapezoidal, and (ii) Simpson’s
1
1/3 rule. Ans. (i) 6149.2217 (ii) 5557.9445
We put n = 3 in Equation (5.12), and we will have the four points (x0 , y0 ), (x1 , y1 ), (x2 , y2 ), (x3 , y3 )
so that all forward differences higher than third order in Equation (5.12) will be zero.
Hence we obtain:
" ! ! #
Z b
32 1 33 32 1 34
f (x) dx ≈ h 3f0 + ∆f0 + − ∆2 f0 + − 33 + 32 ∆3 f0
a 2 2! 3 2 3! 4
9 9 2 9 3
= h 3f0 + ∆f0 + ∆ f0 + ∆ f0
2 4 24
3 h i
= h 8f0 + 12∆f0 + 6∆2 f0 + ∆3 f0
8
3
= h [8f0 + 12(f1 − f0 ) + 6(f2 − 2f1 + f0 ) + (f3 − 3f2 + 3f1 − f0 )]
8
3
= h [f0 + 3f1 + 3f2 + f3 ]
8
Z b
3
f (x) dx ≈ h [f0 + 3f1 + 3f2 + f3 ] (5.20)
a 8
In the above formula, the interval of integration [a, b] is divided into three subdivisions.
Now, we divide the interval [a, b] into n ( a multiple of three) equal subintervals by the
arguments x0 , x1 , x2 , · · · , xn , where xi = x0 + ih, i = 1, 2, · · · , n.
Z b Z x3 Z x6 Z xn
f (x) dx = f (x) dx + f (x) dx + · · · + f (x) dx
a x0 x3 xn−3
3 3
≈ h (f0 + 3f1 + 3f2 + f3 ) + h (f3 + 3f4 + 3f5 + f6 ) + · · ·
8 8
3
+ h (fn−3 + 3fn−2 + 3fn−1 + fn )
8
3
= h [f0 + 3 (f1 + f2 + f4 + f5 + · · · fn−1 ) + 2 (f3 + f6 + f9 + · · · + fn−3 ) + fn ]
8
Z b
3
f (x) dx ≈ h [f0 + 3 (f1 + f2 + f4 + f5 + · · · fn−1 ) + 2 (f3 + f6 + · · · + fn−3 ) + fn ]
a 8
(5.21)
Example 5.5
i. Trapezoidal rule
Solution: Since we have six subintervals, i.e. n = 6, we can obtain the the step
size h as
6−0
h= = 1.
6
1
As a result we obtain the values of the function f (x) = 1+x2
at the nodal points
xi 0 1 2 3 4 5 6
f (xi ) 1 0.5 0.2 0.1 0.0588 0.0385 0.027
i. Trapezoidal rule:
Z 6
1 1
dx ≈ [f0 + 2 (f1 + f2 + f3 + f4 + f5 ) + f6 ]
0 1 + x2 2
1
≈ [1 + 2 (0.5 + 0.2 + 0.1 + 0.0588 + 0.0385) + 0.027]
2
≈ 1.4108
A more illuminating explanation of why Simpson’s rule is “more accurate than it ought
to be” can be had by looking at the extent to which it integrates polynomials exactly.
This leads us to the notion of degree of precision for a quadrature rule.
Example 5.6
but, Z 2
32 1 20
x4 dx = 6= (0 + 4 + 16) = ,
0 5 3 3
Therefore, the degree of precision is 3.
Given a set of data points (x1 , y1 ), (x2 , y2 ), · · · , (xm , ym ), a normal and useful practice in
many applications in statistics, engineering and other applied sciences is to construct a
curve that is considered to be the “best fit” for the data, in some sense. So far, we have
discussed two data-fitting techniques, polynomial interpolation and piecewise polynomial
interpolation. Interpolation techniques, of any kind, construct functions that agree ex-
actly with the data. That is, given points (x1 , y1 ), (x2 , y2 ), · · · , (xm , ym ), interpolation
yields a function f (x) such that f (xi ) = yi for i = 1, 2, · · · , m.
However, fitting the data exactly may not be the best approach to describing the data
with a function. We have seen that high-degree polynomial interpolation can yield os-
cillatory functions that behave very differently than a smooth function from which the
data is obtained. Also, it may be pointless to try to fit data exactly, for if it is obtained
by previous measurements or other computations, it may be erroneous. Therefore, we
consider revising our notion of what constitutes a “best fit” of given data by a function.
One alternative approach to data fitting is to solve the minimax problem, which is the
problem of finding a function f (x) of a given form for which
max |f (xi ) − yi |,
1≤i≤m
170
Chapter-6: Leas Squares Method 6.1. Discrete Least Squares Approximation
Another approach is to minimize the total absolute deviation of f (x) from the data. That
is, we seek a function f (x) of a given form for which
m
X
|f (xi ) − yi |,
i=1
This defect is overcome by considering the problem of finding f (x) of a given form for
which m
[f (xi ) − yi ]2 ,
X
i=1
is minimized. This is known as the least squares problem. In summary, the problem
of least squares is the following
Pn (x) = a0 + a1 x + a2 x2 + · · · an xn (n ≤ m),
such that the error E(a0 , a1 , a2 , · · · , an ) in the least-squares sense is minimized; that
is,
m h i2
a0 + a1 x + a2 x2 + · · · an xn − yi
X
E(a0 , a1 , a2 , · · · , an ) =
i=1
is minimum.
Here E(a0 , a1 , a2 , · · · , an ) is a function of (n + 1) variables: a0 , a1 , a2 , · · · an
We will first show how this problem is solved for the case where f (x) is a linear function
of the form f (x) = a1 x + a0 , and then generalize this solution to other types of functions.
When f (x) is linear, the least squares problem is the problem of finding constants a0 and
a1 such that the function
m
[yi − a0 + a1 x]2
X
E(a0 , a1 )) =
i=1
is minimum. In order to minimize this function of a0 and a1 , we must compute its partial
derivatives with respect to a0 and a1 and set these partial derivatives to zero. This yields
m m
∂E(a0 , a1 ) X ∂E(a0 , a1 ) X
=2 [yi − a0 + a1 x] , =2 [yi − a0 + a1 x] xi .
∂a0 i=1 ∂a1 i=1
At a minimum, both of these partial derivatives must be equal to zero. This yields the
system of linear equations
m m
!
X X
na0 + xi a1 = yi ,
i=1 i=1
m m m
! !
x2i a1 =
X X X
x i a0 + xi yi ,
i=1 i=1 i=1
Example 6.1
We wish to find the linear function y = a1 x + a0 that best approximates the data
shown in the following table, in the least-squares sense.
i xi yi
1 2.0774 3.3123
2 2.0774 3.8982
3 3.0125 4.6500
4 4.7092 6.5576
5 5.5016 7.5173
6 5.8704 7.0415
7 6.2248 7.7497
8 8.4431 11.0451
9 8.7594 9.8179
10 9.3900 12.2477
We conclude that the linear function that best fits this data in the least-squares
sense is
y = 1.1044 + 1.1667x.
The data, and this function, are shown in Figure 6.1 bellow.
Figure 6.1: Data points (xi , yi ) (circles) and least-squares line (solid line)
In the last lecture, we learned how to compute the coefficients of a linear function that
best fit given data, in a least-squares sense. We now consider the problem of finding a
polynomial of degree n or exponential function that gives the best least-squares fit.
As before, let (x1 , y1 ), (x2 , y2 ), · · · , (xm , ym ) be given data points that need to be approx-
imated by a polynomial of degree n. We assume that n < m − 1, for otherwise, we can
use polynomial interpolation to fit the points exactly.
Our goal is to minimize the sum of squares of the deviations in Pn (x) from each y−value,
2
m m n
[Pn (x) − yi ]2 = aj x j
X X X
E(a) =
i − yi ,
i=1 i=1 j=0
∂E
= 0, j = 0, 1, · · · , n
∂aj
Set m
xki ,
X
sk = k = 0, 1, · · · , 2n
i=1
m
xki yi ,
X
bk = k = 0, 1, · · · , n
i=1
s0 a0 + s1 a1 + · · · + sn an = b0
s1 a0 + s2 a1 + · · · + sn+1 an = b0
.. (6.1)
.
sn a0 + sn+1 a1 + · · · + s2n an = b0
or
Sa = b, (6.3)
where
s0 s1 ··· sn a0 b0
· · · sn+1
s1 s2 a1 b1
S=
. ...
.. , a= . ,
b= . .
. . .
. .
. .
sn sn+1 · · · s2n an bn
Define
1 x1 x21 · · · xn1
1 x22 · · · xn2
x2
x23 · · · xn3
V = 1 x3
. .. .. . . . ..
.
. . . .
1 xm x2m · · · xnm
V T V a = b. (6.4)
The matrix V is known as the Vandermonde matrix, and this matrix has full rank if
xi ’s are distinct. In this case, the matrix S = V T V is symmetric and positive definite
[Exercise] and is therefore nonsingular. Thus, if xi ’s are distinct, the equation (6.3) has
a unique solution.
Let (x1 , y1 ), (x2 , y2 ), · · · , (xm , ym ) be m distinct points. Then the discrete least-
square approximation problem has a unique solution.
Example 6.2
i xi yi
1 2.0774 3.3123
2 2.0774 3.8982
3 3.0125 4.6500
4 4.7092 6.5576
5 5.5016 7.5173
6 5.8704 7.0415
7 6.2248 7.7497
8 8.4431 11.0451
9 8.7594 9.8179
10 9.3900 12.2477
First let’s compute the summation
and conclude that the quadratic function that best fits this data in the least-squares
sense is
y = 0.4251x2 − 1.5193x + 4.7681.
Figure 6.2: Data points (xi , yi ) (circles) and least-squares line (solid line)
Least-squares fitting can also be used to fit data with functions that are not linear combi-
nations of functions such as polynomials. Suppose we believe that given data points can
best be matched to an exponential function of the form y = beax , where the constants a
and b are unknown. Taking the natural logarithm of both sides of this equation yields
ln y = ln b + ax.
If we define z = ln y and c = ln b, then the problem of fitting the original data points
{(xi , yi )}m
i=1 with an exponential function is transformed into the problem of fitting the
data points {(xi , yi )}m
i=1 with a linear function of the form c + ax, for unknown constants
a and c.
ln y = ln b + a ln x.
Example 6.3
We wish to find the exponential function y = beax that best approximates the data
shown in the following table , in the least-squares sense.
i xi yi
1 2.0774 1.4509
2 2.3049 2.8462
3 3.0125 2.1536
4 4.7092 4.7438
5 5.5016 7.7260
First let’s compute the summation
i xi yi zi = ln yi x2i xi zi
1 2.0774 1.4509 0.3722 4.3156 0.7732
2 2.3049 2.8462 1.0460 5.3126 2.4109
3 3.0125 2.1536 0.7671 9.0752 2.3110
4 4.7092 4.7438 1.5568 22.1766 7.3315
5 5.5016 7.7260 2.0446 30.2676 11.2485
Sum 17.6056 18.9205 5.7867 71.1475 24.0751
By defining
5 17.6056 5.7867
S= b=
17.6056 71.1475 24.0751
solving the normal equations
Sc = b
c0 = −0.2653, c1 = 0.4040
the exponential function that best fits this data in the least-squares sense is
y = 0.7670e0.4040x .
Figure 6.3: Data points (xi , yi ) (circles) and least-squares line (solid line)
such that the integral of the square of the error is minimized. That is,
Z b
E(a0 , a1 , a2 , · · · , an ) = [f (x) − Pn (x)]2 dx,
a
is minimized.
Since Z bh i2
E= f (x) − a0 + a1 x + a2 x2 + · · · an xn dx,
a
Thus, we have
Z b Z b Z b Z b Z b
∂E 2 n
=0 =⇒ a0 1 dx + a1 x dx + a2 x dx + · · · + an x dx = f (x).
∂a0 a a a a a
Similarly,
Z b Z b Z b Z b Z b
∂E
=0 =⇒ a0 xi dx + a1 dxi+1 x dx + a2 xi+2 dx + · · · + an xi+n dx = xi f (x).
∂ai a a a a a
i = 0, 1, 2, · · · n.
Denoting
Z b Z b
xi dx = si , i = 0, 1, 2, · · · , 2n, and bi = xi f (x) dx, i = 0, 1, 2, · · · n,
a a
s 0 a0 + s 1 a1 + · · · + s n an = b 0
s1 a0 + s2 a1 + · · · + sn+1 an = b0
..
.
sn a0 + sn+1 a1 + · · · + s2n an = b0
or in matrix notation
s0 s1 ··· s n a0 b 0
· · · sn+1
s1 s2 a1
b1
= .
. ... .. .. .
.
..
. . .
sn sn+1 · · · s2n an bn
Hence, we have the system of normal equations
Sa = b, (6.5)
where
s0 s1 ··· sn a0 b0
· · · sn+1
s1 s2 a1 b1
S=
. ...
.. , a= . ,
b= . .
. . .
. .
. .
sn sn+1 · · · s2n an bn
The solution of Equation (6.5) will yield the coefficients a0 , a1 , · · · , an of the least-squares
polynomial Pn (x).
end
Step 2: Compute b0 , b1 , · · · , bn :
for i = 0, 1, 2, · · · , n do
Z b
bi = xi f (x) dx
a
end
Step 3: Form the matrix S from the numbers s0 , s1 , · · · , s2n and the vector
b from the numbersb0 , b1 , · · · , bn , i.e.,
s0 s1 ··· sn b0
· · · sn+1
s1 s2 b1
S=
. ...
.. , b= . .
. .
. .
.
sn sn+1 · · · s2n bn
A Special Case:
Example 6.4
This gives
a0 = 1.1752, a1 = 1.1037
Relative Error:
2
s0 = 2, s1 = 0, s2 =
3
" #1
Z 1
x4 1 1
3
s3 = x dx = = − = 0,
−1 4 −1
4 4
5 1
" #
Z 1
x 1 −1 2
4
s4 = x dx = = − = .
−1 5 −1
5 5 5
Step 2: Compute bi ’s
b0 = 2.3504, b1 = 0.7358,
Z 1
5
b2 = x2 ex dx = e − = 0.8789.
−1 e
This gives
a0 = 0.9963, a1 = 1.1037, a2 = 0.5368.
Relative Error:
7.1 Introduction
i) y(x) is differentiable
ii) Substitution of y(x) and y 0 (x) in (7.1) satisfies the differential equation identically.
187
Chapter-7: Numerical solutions of differential equation 7.2. Initial Value problem
At first we are concentrate on the so-called first order Initial Value Problem (IVP).
A first order differential equation together with specified initial condition at x = x0 is
defined as
y 0 (x) = f (x, y(x)) with y(x0 ) = y0 . (7.2)
There exist several methods for finding solutions of differential equations. However, all
differential equations are not solvable. The following well known theorem from theory of
differential equations establishes the existence and uniqueness of solution of the IVP:
The theorem gives conditions on function f (x, y) for existence and uniqueness of the
solution. But the solution has to be obtained by available methods. It may not be possi-
ble to obtain analytical solution (in closed form) of a given first order differential equation
by known methods even when the above theorem guarantees its existence. Sometimes it
is very difficult to obtain the solution. In such cases, the approximate solution of given
differential equation can be obtained using Numerical methods.
Discretization
The aim of this chapter is to device numerical methods to obtain an approximate solution
of the initial value problem (7.2) at only a discrete set of point. That is, if we are
interested in obtaining solution for (7.2) in an interval [a, b], then we first discretize the
interval as
a = x0 < x1 < · · · < xN = b,
xi = x0 + ih, i = 0, 1, · · · N
for a sufficiently small positive real number h, called the stepsize. We use the notation
for the approximate solution as
yi = yh (xi ) ≈ y(xi ), i = 0, 1, · · · , N.
In both cases, let us assume that we somehow have found solutions yi ≈ y(xi ), for i =
0, 1, · · · , n, and we want to find an approximation yn+1 ≈ y(xn + 1) where xn+1 = xn + h.
Basically, there are two different classes of methods in practical use.
1). One-step methods: Only yn is used to find the approximation yn+1 . One-step
methods usually require more than one function evaluation pre-step.
They can all be put in a general abstract form
In the next section a very basic one-step method known as Euler method is being dis-
cussed.
Euler’s method is the natural starting point for any discussion of numerical methods for
IVPs. Although it is not the most accurate of the methods we study, it is by far the
simplest, and much of what we learn from analyzing Euler’s method in detail carries over
to other methods without a lot of difficulty.
Euler’s Method assumes our solution is written in the form of a Taylor’s Series (??). This
gives us a reasonably good approximation if we take plenty of terms, and if the value of
h is reasonably small.
0h2 00
y(x + h) = y(x) + hy (x) + y (η), where x < η < x + h.
2!
Using the fact that y 0 = f (x, y(x)), we obtain a numerical scheme by truncating the
Taylor series after the second term.
Example 7.1
1
y 0 = y − x, y(0) =
2
Use Euler’s method (a) with h = 0.1 and (b) with h = 0.05 to obtain an approxi-
mation to y(1). Given the exact solution to the initial value problem is
1
y(x) = x + 1 − ex
2
Assume that xn , yn is known. The exact solution y(xn+1 ) with xn+1 = xn + h of equa-
tion (7.1) passing through this point is given by
Z xn+1 Z xn +h
0
y(xn + h) = yn + y (τ )dτ = yn + f (τ, y(τ ))dτ. (7.4)
xn xn
The idea is to find approximations to the last integral. The simplest idea is to use
f (τ, y(τ )) ≈ f (xn , yn ), in which case we get the Euler method again:
yn+1 = yn + hf (xn , yn ).
By inserting the forward Euler step for the missing value y(xn + h/2)
h
y(xn + h/2) = yn + f (xn , yn )
2
yn+1 = yn + k2 ,
k1 = hf (xn , yn ) (7.5)
k2 = hf (xn + h/2, yn + k1 /2)
The above numerical method can be improved for a more accurate solution by using
the trapezoidal rule instead of using the rectangular rule. Approximating the integral in
Equation (7.4) by Trapezoidal rule results in
Z xn +h
h
yn+1 = yn + f (xn , y(τ ))dτ = yn + (f (xn , yn ) + f (xn+1 , yn+1 )) . (7.6)
xn 2
Here yn+1 is available by solving a (usually) nonlinear system of equations. Such methods
are called implicit methods. To avoid this extra difficulty, we could replace yn+1 on the
right hand side by the approximation from Eulers method. The method that we consider
here is an example of what is called a predictor-corrector method. The idea is to use
the formula from Euler’s method to obtain a first approximation to the solution y(xn+1 ),
we denote this approximation as
yn+1 = yn + hf (xn , yn ).
h
yn+1 = yn + (f (xn , yn ) + f (xn + h, yn + hf (xn , yn ))) , (7.7)
2
1
yn+1 = yn + (k1 + k2 ) ,
2
where
k1 = hf (xn , yn )
k2 = hf (xn + h, yn + k1 )
Example 7.2
Apply (i) Euler’s method, (ii) Modified Euler’s Method and (ii) Improved Euler’s
method to compute y(x) at x = 0.3 with step-size h = 0.1 for the initial value
problem:
dy
= 2x (1 − y) , y(0) = 2
dx
Compare the errors en = |y(xn ) − yn | at each step with the exact solution y(x) =
2
1 + e−x .
Solution: Since we are solving the problem on the interval [0, 0.3] with step-size
h = 0.1 we have the nodes x0 = 0, x1 = 0.1, x2 = 0.2 and x3 = 0.3 and the initial
value y0 = 2. In addition we have f (x, y) = 2x(1 − y).
yn+1 = yn + hf (xn , yn ),
where
f (xn , yn ) = 2xn (1 − yn )
y1 = y0 + hf (x0 , y0 ) = y0 + h [2x0 (1 − y0 )]
= 2 + 0.1 × [2 × 0 × (1 − 2)]
=2
y2 = y1 + hf (x1 , y1 ) = y1 + h [2x1 (1 − y1 )]
= 2 + 0.1 × [2 × 2 × (1 − 2)]
= 1.98
y3 = y2 + hf (x2 , y2 ) = y2 + h [2x2 (1 − y2 )]
= 1.98 + 0.1 × [2 × 1.98 × (1 − 1.98)]
= 1.9408
yn+1 = yn + k2,
where
k1 = hf (xn , yn )
k2 = hf (xn + h/2, yn + k1/2)
k1 = hf (x0 , y0 )
= h (2x0 (1 − y0 ))
= 0.1 × (2 × 0 × (1 − 2))
=0
k2 = hf (x0 + h/2, y0 + k1/2)
= h (2(x0 + h/2) (1 − (y0 − k1/2)))
= 0.1 × (2 × (0 + 0.1/2) × (1 − (2 + 0/2)))
= −0.01
y1 = y0 + k2
= 2 + (−0.01)
= 1.99
k1 = hf (x1 , y1 ) = h (2x1 (1 − y1 ))
= 0.1 × [2 × 0.1 × (1 − 1.99)]
= −0.0198
k1 = hf (x2 , y2 ) = h (2x2 (1 − y2 ))
= 0.1 × [2 × 0.2 × (1 − 1.9606)]
= −0.0384
k2 = hf (x2 + h/2, y2 + k2/2)
= h (2(x2 + h/2) (1 − (y2 − k1/2)))
= 0.1 × [2 × (0.2 + 0.1/2) × (1 − (1.9606 + (−0.0384)/2))]
= −0.0471
y3 = y0 + k2
= 1.9606 + (−0.0384)
= 1.9135
1
yn+1 = yn + (k1 + k2) ,
2
where
k1 = hf (xn , yn )
k2 = hf (xn + h, yn + k1)
k1 = hf (x0 , y0 )
= h (2x0 (1 − y0 ))
= 0.1 × (2 × 0 × (1 − 2))
=0
k2 = hf (x0 + h, y0 + k1)
= h (2(x0 + h) (1 − (y0 − k1)))
= 0.1 × (2 × (0 + 0.1) × (1 − (2 + 0)))
= −0.02
1
y1 = y0 + (k1 + k2)
2
= 2 + 0.5 × (0 + (−0.02))
= 1.99
k1 = hf (x1 , y1 ) = h (2x1 (1 − y1 ))
= 0.1 × [2 × 0.1 × (1 − 1.99)]
= −0.0198
k2 = hf (x1 + h, y1 + k1)
= h (2(x1 + h) (1 − (y1 − k1)))
= 0.1 × [2 × (0.1 + 0.1) × (1 − (1.99 + (−0.0198)))]
= −0.0388
1
y2 = y0 + (k1 + k2)
2
= 1.99 + 0.5 × (−0.0198 − 0.0388)
= 1.9607
k1 = hf (x2 , y2 ) = h (2x2 (1 − y2 ))
= 0.1 × [2 × 0.2 × (1 − 1.9607)]
= −0.0384
k2 = hf (x2 + h, y2 + k2)
= h (2(x2 + h) (1 − (y2 − k1)))
= 0.1 × [2 × (0.2 + 0.1) × (1 − (1.9607 + (−0.0384)))]
= −0.0553
1
y1 = y0 + (k1 + k2)
2
= 1.9606 + 0.5 × (−0.0384 − 0.0553) = 1.9138
Summary
x Exact value Euler’s Modified Improved
0.1 1.9900 2 1.99 1.99
0.2 1.9608 1.9800 1.9606 1.9607
0.3 1.9139 1.9408 1.9135 1.9138
Error 0 -0.0269 0.0004 0.0001
Although Euler’s method is easy to implement, this method is not so efficient in the sense
that to get a better approximation, one need a very small step size. One way to get a
better accuracy is to include the higher order terms in the Taylor expansion in the formula.
But the higher order terms involve higher derivatives of y. The Runge-Kutta methods
attempt to obtain greater accuracy and at the same time avoid the need for higher
derivatives, by evaluating the function f (x, y) at selected points on each subintervals. A
general Runge Kutta algorithm is given as
The function φ is termed as increment function. The mth order Runge-Kutta method gives
accuracy of order O(hm ). The function φ is chosen in such a way that when expanded the
right hand side of (7.8) matches with the Taylor series up to desired order. This means
that for a second order Runge-Kutta method the right side of (7.8) matches up to second
order terms of Taylor series.
The Second order Runge Kutta methods are known as RK2 methods. For the derivation
of second order Runge Kutta methods, it is assumed that φ is the weighted average of
two functional evaluations at suitable points in the interval [xn , xn+1 ], i.e., φ(xn , yn , h) =
w1 k1 + w2 k2 . Thus, we have:
where
k1 = hf (xn , yn ), k2 = hf (xn + αh, yn + βk1 ) (7.10)
Here w1 , w2 , α and β are constants to be determined so that equation (7.9) agrees with
the Taylor algorithm of a possible higher order.
Now, let’s write down the Taylor series expansion of y in the neighborhood of xn correct
to the h2 term i.e
h2 0
y(xn+1 ) = y(xn ) + hf (xn , y(xn )) + f (xn , y(xn )) + O(h3 ) (7.11)
2
Then, using chain rule for the derivative f 0 (xn , y(xn )) we get
Thus we have
" #
h2 ∂f (xn , y(xn )) ∂f (xn , y(xn ))
y(xn+1 ) = y(xn )+hf (xn , y(xn ))+ + f (xn , y(xn )) +O(h3 )
2 ∂x ∂y
(7.12)
In addition, equation (7.9) and (7.10) can be rewritten as:
Therefore,
" #
2 ∂f (xn , yn ) ∂f (xn , yn )
yn+1 = yn + h(w2 + w2 )f (xn , yn ) + h w2 α + w2 βf (xn , yn ) + O(h3 ).
∂x ∂y
(7.13)
Assuming y(xn ) ≈ yn and comparing equations (7.12) and (7.13) yields
1 1
w1 + w2 = 1, w2 α = and w2 β = . (7.14)
2 2
Observe that four unknowns are to be evaluated from three equations. Accordingly many
solutions are possible for (7.14). Two examples of second-order Runge-Kutta methods of
the form (7.9) and (7.10) are the modified Euler method and the improved Euler method.
1
(a) The modified Euler method In this case we take β = 2
obtain
!
1 h
yn+1 = yn + hf xn + h, yn + f (xn , yn ) .
2 2
(b) The improved Euler method, usually called RK2 This is arrived at by choos-
ing β = 1 which gives
k1 = hf (xn , yn ),
k2 = hf (xn + h, yn + k1 ),
1
yn+1 = yn + (k1 + k2 ) .
2
1
yn+1 = yn + (k1 + 2k2 + 2k3 + k4 ) (7.15)
6
where
k1 = hf (xn , yn )
1 1
k2 = hf (xn + h, yn + k1 )
2 2
1 1
k3 = hf (xn + h, yn + k2 )
2 2
k4 = hf (xn + h, yn + k3 ).
Example 7.3
y 0 = y, y(0) = 1.
A x = 0.01 or when n = 0:
At x = 0.02 or when n = 1:
xi k1 k2 yi
0.0 – – 1.000000
0.1 0.010000 0.010100 1.010050
0.2 0.010100 0.010202 1.020201
0.3 0.010202 0.010304 1.030454
0.4 0.010305 0.010408 1.040810
0.5 0.010408 0.010512 1.051270
At x = 0.02 or when n = 1:
xi k1 k2 k3 k4 yi
0.0 – – – – 1.000000
0.1 0.010000 0.010050 0.010050 0.010101 1.010050
0.2 0.010101 0.010151 0.010151 0.010202 1.020201
0.3 0.010202 0.010253 0.010253 0.010305 1.030455
0.4 0.010305 0.010356 0.010356 0.010408 1.040811
0.5 0.010408 0.010460 0.010460 0.010513 1.051271