LEYKEKHMAN 2019 Numerical Analysis Lecture Notes
LEYKEKHMAN 2019 Numerical Analysis Lecture Notes
Lecture notes
Dmitriy Leykekhman
0 Introduction
This lecture notes are designed for the MATH 5510, which is the first graduate course in numerical analysis at Univer-
sity of Connecticut. Here I present the material which I consider important for students to see in their first numerical
analysis course. The material and the style of these lecture notes are strongly influenced by the lecture notes of Prof.
Matthias Heinkenschloss for CAMM 353 at Rice University. There are many other nice lecture notes that one can find
freely online. Let me just mention four volumes Numerical Analysis course (in German) by Rolf Rannacher and the
lecture notes by Doron Levy . There are plenty of other sources on the internet.
There are plenty of misconceptions among mathematicians what really numerical analysis is. In my opinion one of
the best definitions was given by N. Trefethen in his essay ”What is numerical analysis?”
Definition 0.0. Numerical analysis is the study of algorithms for the problem of continuous mathematics.
We strongly encourage to read this essay whoever is interested in the subject, it is only 5 pages long.
This lecture notes start with interpolation, which is not orthodox, but in my opinion it is an interesting topic that
captures students attention and introduces important ideas, challenges and motivations for the other topics in numerical
analysis.
The required background is rather light, good understanding of linear algebra and calculus would be sufficient. The
students also encouraged to code as many algorithms as possible appearing in the notes. From time to time I adopt
Matlab notation and syntax for columns or rows of matrices, loops and etc.
Dmitriy Leykekhman
Department of Mathematics, University of Connecticut, Storrs, CT 06269, USA. e-mail: [email protected]
1
Contents
3
4 Contents
A very good introduction to floating point arithmetics is a book by Michael Overton [1].
In everyday life we use decimal representation of numbers. For example
1234.567
for us means
1 * 104 + 2 * 103 + 3 * 102 + 4 * 100 + 5 * 10−1 + 6 * 10−2 + 7 * 10−3 .
More generally
d j . . . d1 d0 .d−1 . . . d−i . . .
represents
· · · d j * 10 j + · · · + d1 * 101 + d0 * 100 + d−1 * 10−1 + · · · + d−i * 10−i + · · · .
Let β ≥ 2 be an integer. For every x ∈ there exist integers e and di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , such that
!
∞
x = sign(x) ∑ di β −i β e. (1.1)
i=0
Of course every integer has finite representation, but very often even simple rational numbers have infinite representa-
tions
Example 1.2.
1
= (0.33333 . . . )10 = (3.33333 . . . )10 * 101 ,
3
1
= (0.0101 . . . )2 = (1.0101)2 * 22 .
3
1 Floating point arithmetics 5
In a computer only a finite subset of all real numbers can be represented. These are the so–called floating point
numbers and they are of the form !
m−1
x̄ = (−1)s ∑ di β −i βe
i=0
Example 1.3. Consider the floating point number system β = 2, m = 3, emin = −1, emax = 2.
The normalized floating point numbers x̄ ̸= 0 are of the form
x̄ = ±1.d1 d2 × 2e
since the normalization condition implies that d0 ∈ {1, . . . , β − 1} = {1}. For the exponent we have choices e =
−1, 0, 1, 2. Below is the we plot all the numbers in this example. Notice that spacing between numbers is increasing as
we move away from 0.
-
1537 5 3 7 5 7
2 8 4 81 4 2 4 2 2 3 2 4 5 6 7
0
∙ Half this number, εmach = 21 β 1−m , is called machine precision or unit roundoff.
∙ The spacing between the floating pt. numbers in [1, β ] is β −(m−1) .
∙ The spacing between the floating pt. numbers in [β e , β β e ] is β −(m−1) β e .
Almost all modern computer implements the IEEE binary (β = 2) floating point standard. IEEE single precision
floating point numbers are stored in 32 bits. IEEE double precision floating point numbers are stored in 64 bits.
1.1 Rounding
then
sign(x) ∑m−1 −i β e , if dm < 21 β ,
i=0 di β
fl(x) =
sign(x) ∑m−1 di β −i + β −(m−1) β e , if dm ≥ 1 β .
i=0 2
Note, there may be two floating point numbers closest to x. fl(x) picks one of them. For example, let β = 10, m = 3.
Then 1.235 − 1.24 = 0.005, but also 1.235 − 1.23 = 0.005.
Theorem 1.1. If x is a number within the range of floating point numbers and |x| ∈ [β e , β e+1 ), then the absolute error
between x and the floating point number fl(x) closest to x is given by
1
| fl(x) − x| ≤ β e(1−m)
2
1 Floating point arithmetics 7
| fl(x) − x| 1 1−m
≤ β . (1.2)
|x| 2
In other words fl(x) is a floating point number closest to x = ∑∞ −i β e with d > 0.
i=0 di β 0
is β −(m−1) β e . Hence if x ∈ [β e , β e+1 ), then the floating point number x̄ closest to x satisfies |x̄ − x| ≤ 21 β −(m−1) β e .
Since x ≥ β e ,
|x̄ − x| 1 −(m−1)
≤ β .
|x| 2
Let represent one of the elementary operations +, −, *, /. If x̄ and ȳ are floating point numbers, then x̄ȳ may not
be a floating point number, for example: β = 10, m = 4: 1.234 + 2.751 * 10−1 = 1.5091. What is the computed value
for x̄ȳ? In IEEE floating point arithmetic the result of the computation x̄ȳ is equal to the floating point number that
is nearest to the exact result x̄ȳ. Therefore we use fl(x̄ȳ) to denote the result of the computation x̄ȳ Model for the
computation of x̄ȳ, where is one of the elementary operations +, −, *, /.
1. Given floating point numbers x̄ and ȳ.
2. Compute x̄ȳ exactly.
3. Round the exact result x̄ȳ to the nearest floating point number and normalize the result.
In the above example: 1.234 + 2.751 * 10−1 = 1.5091. Comp. result: 1.509 The actual implementation of the elemen-
tary operations is more sophisticated [1].
Given two numbers x̄, ȳ in floating point format, the computed result satisfies
8 Contents
| fl(x̄ȳ) − (x̄ȳ)|
≤ εmach .
x̄ȳ
For the previous result on the error between x̄ȳ and the computed fl(x̄ȳ) only holds if x̄, ȳ in floating point
format. What happens when we operate with numbers that are not in floating point format?
To analyze the analyze the error incurred by the subtraction of two numbers, the following representation is useful:
For every x ∈, there exists ε with |ε| ≤ εmach such that
Note that if x ̸= 0, then the previous identity is satisfied for ε := (fl(x) − x)/x. The bound |ε| ≤ εmach follows from
(1.2).
For x, y ∈ we have ε1 , ε2 with |ε1 |, |ε2 | ≤ εmach such that
ax2 + bx + c = 0
are given by p
x± = −b ± b2 − 4ac /(2a).
When a = 5 * 10−4 , b = 100, and c = 5 * 10−3 the computed (using single precision Fortran) first root is
x+ = 0.
Cannot be exact, since x = 0 is a solution of the quadratic equation if and only if c = 0. Since fl(b2 − 4ac) = fl(b2 ) for
the data given above, we suffer from catastrophic cancellation.
A remedy is the following reformulation of the formula for x+ :
√
√ √
2 −b + b2 − 4ac −b − b 2 − 4ac
−b + b − 4ac 1 2c
= √ = √
2a 2a 2
−b − b − 4ac −b − b2 − 4ac
Here the subtraction of two almost equal numbers is avoided and the computation using this formula gives x+ =
−0.5E − 04.
A ‘stable’ (see later for a description of stability) formula for both roots
p
x1 = −b − sign(b) b2 − 4ac /(2a), x2 = c/(ax1 ).
2 Interpolation
Let Φ(x; a0 , . . . , an ) be a family of functions of variable x, which can be real or complex. Given n + 1 pairs (xi , fi ),
i = 0, 1, . . . , n, find parameters a0 , a1 , . . . , an such that
Φ(xi ; a0 , . . . , an ) = fi , i = 0, 1, . . . , n.
Φ(x; a0 , . . . , an ) = a0 + a1 x + a2 x2 + · · · + an xn .
a0 + a1 x + a2 x2 + · · · + an xn
Φ(x; a0 , . . . , an ; b0 , . . . , bm ) = .
b0 + b1 x + · · · + bm xm
Example 2.3 (Trigonometric interpolation).
There are many others, for example splines, which we will address later.
Definition 2.1. The interpolation problem is linear if Φ(xi ; a0 , . . . , an ) depends linearly on a0 , a1 , . . . , an , i.e.
Polynomial interpolation is historically very important and well investigated problem, due to all nice properties of
polynomials.
p(x) = a0 + a1 x + a2 x2 + · · · + an xn
such that
p(xi ) = fi i = 0, 1, . . . , n.
The main motivation for the approximation is to estimate the unknown values of f (x).
Definition 2.2. If the point x̄ lies inside the interval formed by the points x1 , . . . , xn , we speak about interpolation; if
the point x̄ lies outside the interval formed by the points x1 , . . . , xn , we speak about extrapolation.
First, we establish that the interpolation problem is well-posed.
Theorem 2.1 (Existence and Uniqueness). Given n + 1 distinct points xi , i.e. xi ̸= x j for i ̸= j and arbitrary n + 1
values f0 , f1 , . . . , fn . There exists a unique polynomial pn (x) of degree n or less such that pn (xi ) = fi for i = 0, 1, . . . , n.
Assume there are two such polynomials pn (x) and p̃n (x). Then q(x) = pn (x) − p̃n (x) is a polynomial of degree at
most n that has n + 1 roots, namely xi , i = 0, 1, . . . , n. The only polynomial with such property is zero polynomial. Thus
q(x) ≡ 0 and p̃n (x) = pn (x).
Existence.
For i = 0, 1, . . . , n consider
n x−x
(x − x0 ) · · · (x − xi−1 )(x − xi+1 ) · · · (x − xn ) j
Li (x) = =∏ .
(xi − x0 ) · · · (xi − xi−1 )(xi − xi+1 ) · · · (xi − xn ) j=0 xi − x j
j̸=i
Thus,
pn (x) = f0 L0 (x) + f1 L1 (x) + · · · + fn Ln (x)
is the desired polynomial.
Example 2.5.
xi 0 1 3
fi 1 3 2
Thus,
(x − 1)(x − 3) x(x − 3) x(x − 1)
L0 (x) = , L1 (x) = , L2 (x) =
(0 − 1)(0 − 3) (1 − 0)(1 − 3) (3 − 0)(3 − 1)
and cross-multiplying, we compute
From now on we assume that the points x1 , . . . , xn are distinct. Given a basis ψ1 (x), . . . , ψn (x) of Pn−1 . Problem 1
we now can state as to find coefficients a1 , . . . , an for the polynomial
such that
p(xi ) = a1 ψ1 (xi ) + a2 ψ2 (xi ) + · · · + an ψn (xi ) = fi
for i = 1, 2, . . . , n. Which is equivalent to the linear system
ψ1 (x1 ) ψ2 (x1 ) . . . . . . ψn (x1 ) a1 f1
ψ1 (x2 ) ψ2 (x2 ) . . . . . . ψn (x2 ) a2 f2
.. .. = .. . (2.1)
.. .. .. ..
. . . . . . .
ψ1 (xn ) ψ2 (xn ) . . . . . . ψn (xn ) an fn
In Theorem 2.1 we have established the existence and uniqueness of the solution for the linear system (2.1) with
arbitrary right hand side. As a result the matrix in (2.1) is non-singular for any choice of basis. Thus mathematically
any choice of basis would work, however computationally it makes a huge difference. Here are some natural choices.
12 Contents
. . . x1n−1
M0 (x1 ) M1 (x1 ) . . . . . . Mn−1 (x1 ) 1 x1 ...
M0 (x2 ) M1 (x2 ) . . . . . . Mn−1 (x2 ) 1 x2 ... . . . x2n−1
=. . .. ,
.. .. .. .. .. .. ..
. . . . . .. .. . . .
M0 (xn ) M1 (xn ) . . . . . . Mn−1 (xn ) 1 xn . . . . . . xnn−1
which is known as Vandermonde matrix. We can take several observation. First of all the matrix is full, so potentially
it can be expensive to solve the linear system. Secondly, looking at the plot of the monomial basis we can observe
that they are very similar near 0, meaning that if the interpolating points are near zero, the matrix can be close to a
singular. Later we make it more precise when we study the condition number of a matrix. On the other hand once the
coefficients a0 , . . . , an−1 are found out, the evaluation of the resulting polynomial at any point x̄ is rather simple. We
can make it even very efficient by noticing
Using this nested form, we can write algorithm for evaluation, known as Horner’s Scheme.
1. p = an
2. for i = n − 1 : −1 : 1
3. p = p * x + ai
4. end
Since (
1 if i = j
Li (x j ) = δi j =
0 if i ̸= j.
The resulting matrix is just an identity matrix and as a result a1 = f1 , . . . , an = fn and
which is a lower triangular matrix and can be solved explicitly by forward substituion. Thus,
a1 = f1
f 2 − a1
a2 =
x2 − x1
f3 − a1 − a2 (x2 − x1 )
a3 =
(x3 − x1 )(x3 − x2 )
..
.
fn − ∑n−1 i−1
i=1 ai ∏ j=1 (xn − x j )
an = .
∏n−1
j=1 (xn − x j )
Similarly, to the Monomial basis, once the coefficients a0 , . . . , an−1 are found, the evaluation of the resulting polyno-
mial at any point x̄ can be done via Horner’s Scheme as well, since
14 Contents
n−2 n−1
p(x) = a0 + a1 (x − 1) + a2 (x − x1 )(x − x2 ) + · · · + an−2 ∏ (x − x j ) + an−1 ∏ (xn − x j )
j=1 j=1
" #
n−2 n−1
= a0 + a1 + a2 x + · · · + an−2 ∏ (x − x j ) + an−1 ∏ (x − x j ) (x − x1 )
j=2 j=2 (2.3)
" " # #
n−2 n−2
= a0 + a1 + a2 + · · · + an−2 ∏ (x − x j ) + an−1 ∏ (x − x j ) (x − x2 ) (x − x1 )
j=3 j=3
Using this nested form, we can write algorithm for evaluation, known as Horner’s Scheme.
1. p = an
2. for i = n − 1 : −1 : 1
3. p = p * (x − xi ) + ai
4. end
since P0 ( f | xi )(x) is just a constant function that passes through xi , which is just fi . Surprisingly above expression can
be generalized to polynomials of arbitrary degree.
Theorem 2.4. Given Pn−2 ( f | x1 , . . . , xn−1 )(x) and Pn−2 ( f | x2 , . . . , xn )(x), we can obtain Pn−1 ( f | x1 , . . . , xn )(x) from
x − x1
Pn−1 ( f | x1 , . . . , xn )(x) = Pn−2 ( f | x1 , . . . , xn−1 )(x) + Pn−2 ( f | x2 , . . . , xn )(x) − Pn−2 ( f | x1 , . . . , xn−1 )(x) .
xn − x1
2 Interpolation 15
Proof. First of all we notice that on the right (and as a result on the left) is a polynomial of degree n − 1. Thus, the
only thing we need to check that it indeed interpolates the function f at x1 , . . . , xn .
For x = x1 , since Pn−2 ( f | x1 , . . . , xn−1 )(x1 ) = f1 , we obtain
For x = xi , 1 < i < n, we notice that Pn−2 ( f | x2 , . . . , xn )(xi ) − Pn−2 ( f | x1 , . . . , xn−1 )(xi ) = fi − fi = 0 and as a result
The above result gives us a recursive way to construct the interpolating polynomial
P0 ( f | x1 ) ↘
P0 ( f | x2 ) → P1 ( f | x1 , x2 )
.. .. .. ..
. . . .
.. .. ..
. . . ↘
P0 ( f | xn−1 ) → P1 ( f | xn−2 , xn−1 ) . . . . . . → Pn−2 ( f | x1 , . . . , xn−1 ) ↘
P0 ( f | xn ) → P1 ( f | xn−1 , xn ) . . . . . . → Pn−2 ( f | x2 , . . . , xn ) → Pn−1 ( f | x1 , . . . , xn ).
In this section we will see how efficient the method of divided differences can be for Newton’s basis. For Newton’s
basis the interpolating polynomial has the form
n i−1
Pn−1 ( f | x1 , . . . , xn )(x) = ∑ ai ∏ (x − x j ) = an xn−1 + · · · . (2.4)
i=1 j=1
From the above we can observe that the leading coefficient an in the interpolating polynomial is the same as in the
leading coefficient for the polynomial in Newton’s basis.
Definition 2.4. The leading coefficient ak of the polynomial Pk−1 ( f | x1 , . . . , xk )(x) is called the (k − 1) divided differ-
ence and is denoted by f [x1 , . . . , xk ].
Using this definition we can write the Pn−1 ( f | x1 , . . . , xn )(x) in Newton’s basis as
n i−1
Pn−1 ( f | x1 , . . . , xn )(x) = ∑ f [x1 , . . . , xi ] ∏ (x − x j ). (2.5)
i=1 j=1
k i−1
Pk− j−1 ( f | x j , . . . , xk )(x) = ∑ f [x j , . . . , xi ] ∏ (x − xm )
i= j m= j
k−1 i−1
Pk− j−2 ( f | x j , . . . , xk−1 )(x) = ∑ f [x j , . . . , xi ] ∏ (x − xm )
i= j m= j
k i−1
Pk− j−2 ( f | x j+1 , . . . , xk )(x) = ∑ f [x j , . . . , xi ] ∏ (x − xm ).
i= j+1 m= j
Since the leading coefficients of the polynomial on the left and on the right hand sides must be equal, we obtain the
following formula
f [x j+1, . . . , xk ] − f [x j , . . . , xk−1 ]
f [x j , . . . , xk ] = (2.8)
xk − x j
Using it we obtain a recursive way to compute coefficients
f [x1 ] ↘
f [x2 ] → f [x1 , x2 ]
.. .. .. ..
. . . .
.. .. ..
. . . ↘
f [xn−1 ] → f [xn−2 , xn−1 ] . . . . . . → f [x1 , . . . , xn−1 ] ↘
f [xn ] → f [xn−1 , xn ] . . . . . . → f [x2 , . . . , xn ] → f [x1 , . . . , xn ].
1. for i = 1 : n
2. ai1 = fi
3. end
4. for j = 2 : n
5. for i = j : n
a −ai−1, j−1
6. ai j = i, j−1
xi −xi− j+1
7. end
8. end
2 Interpolation 17
One can modify the algorithm to save some storage by overwriting the entries of the coefficients that are needed
anymore
1. for i = 1 : n
2. ai = f i
3. end
4. for j = 2 : n
5. for i = j : n
ai −ai−1
6. ai = xi −x i− j+1
7. end
8. end
Example 2.6.
xi 0 1 −1 2 −2
fi −5 −3 −15 39 −9
Using (2.8), we obtain
xi fi
0 −5
1 −3 2
−1 −15 6 −4
2 39 18 12 8
−2 −9 12 6 2 3
We can notice that the table actually contains much more information. For example, it contains the coefficients of
the polynomial interpolating (−1, −15), (2, 39), (−2, −9), which is
or the coefficients of the polynomial interpolating (1, −3), (−1, −15), (2, 39), which is
Once the coefficients of the Newton’s polynomial are computed we can use Horner’s Scheme (Algorithm 2.3) to
evaluate it at given points.
18 Contents
The recursive formula in Theorem 2.4 can be used to compute the value of the interpolating polynomial at some point
x̄ without computing coefficients ai . Since
x̄ − x j
Pk− j−1 ( f | x j , . . . , xk )(x̄) = Pk− j−2 ( f | x j , . . . , xk−1 )(x̄)+ Pk− j−2 ( f | x j+1 , . . . , xk )(x̄)−Pk− j−2 ( f | x j , . . . , xk−1 )(x̄)
xk − x j
and naturally
P0 ( f | xi )(x̄) = fi i = 1, 2, . . . , n.
The other values can be computed recursively, we obtain Neville scheme:
P0 ( f | x1 )(x̄) ↘
P0 ( f | x2 )(x̄) → P1 ( f | x1 , x2 )(x̄)
.. .. .. ..
. . . .
.. .. ..
. . . ↘
P0 ( f | xn−1 )(x̄) → P1 ( f | xn−2 , xn−1 )(x̄) . . . . . . → Pn−2 ( f | x1 , . . . , xn−1 )(x̄) ↘
P0 ( f | xn )(x̄) → P1 ( f | xn−1 , xn )(x̄) . . . . . . → Pn−2 ( f | x2 , . . . , xn )(x̄) → Pn−1 ( f | x1 , . . . , xn )(x̄).
1. for i = 1 : n
2. pi = f i
3. end
4. for j = 2 : n
5. for i = n : −1 : j
x̄−x j+1
6. pi = pi−1 + xi −xi−
i− j+1
* (pi − pi−1 )
7. end
8. end
9. p = pn
Example 2.7.
xi fi
0 −5
1 −3 1
−1 −15 9 −23
2 39 57 105 169
−2 −9 51 81 121 241
In the previous section we show how to compute approximating polynomials and some evaluation techniques. In this
section we try to answer a question, how close the approximation polynomial to the function f , namely we want an
estimate for
sup | f (x) − P( f |x1 , . . . , xn )(x)| (2.9)
x∈[a,b]
Proof. If x = xi for some i = 1, . . . , n, then the result naturally holds. Assume x ̸= xi for any i = 1, . . . , n. Consider a
function
ψ(x) = f (x) − P( f | x1 , . . . , xn )(x) − cω(x),
where the constant c is taken to be
f (x̄) − P( f | x1 , . . . , xn )(x̄)
c= .
ω(x̄)
With this choice of the constant c, the function ψ(x) has at least n + 1 roots, namely at x1 , . . . , xn and x̄. Thus, from the
(1)
Rolle’s Theorem there exist n points, call them xi , i = 1, . . . , n, such that
(1)
ψ ′ (xi ) = 0, i = 1, . . . , n.
(2)
Again, from the Rolle’s Theorem there exist n − 1 points, call them xi , i = 1, . . . , n − 1, such that
(2)
ψ ′′ (xi ) = 0, i = 1, . . . , n − 1.
(n)
Continue this process, there exists a point x1 such that
20 Contents
(n)
ψ (n) (x1 ) = 0.
dn dn dn dn
n
ψ(x) = n f (x) − n P( f | x1 , . . . , xn )(x) − c n ω(x).
dx dx dx dx
Since P( f | x1 , . . . , xn )(x) is a polynomial of degree n − 1, we have
dn
P( f | x1 , . . . , xn )(x) = 0
dxn
and since ω(x) is a polynomial of degree n with leading coefficient 1, we have
dn
ω(x) = n!.
dxn
As a result
(n)
(n) (n) (n) (n) f (n) (x1 )
ψ (x1 ) = f (x1 ) − cn! =0 ⇒ c= ,
n!
(n)
which shows the theorem with ξ (x̄) = x1 .
As a corollary, we immediately obtain
Corollary 2.1.
1 n
max | f (x) − P( f | x1 , . . . , xn )(x)| ≤ max | f (n) (ξ (x))| max ∏ (x − x j ) .
x∈[a,b] n! x∈[a,b] x∈[a,b] j=1
1
Thus we observe that the error is bounded by three terms. First term decreases
rather fast, the second term
n!
n
maxx∈[a,b] | f (n) (ξ (x))
depends on the (unknown) function f , the last term maxx∈[a,b] ∏ j=1 (x − x j ) looks rather mys-
terious. Of course we can estimate it roughly as
n
max ∏ (x − x j ) ≤ (b − a)n ,
x∈[a,b] j=1
and if we can control n-th derivative of f by M n , then from Corollary 2.1, we obtain
M n (b − a)n
max | f (x) − P( f | x1 , . . . , xn )(x)| ≤ →0 as n → ∞.
x∈[a,b] n!
Example 2.8. Consider f (x) = sin (3x) on [0, π]. Then | f (n) (x)| ≤ 3n an we obtain
3n π n
max | sin (3x) − P( f | x1 , . . . , xn )(x)| ≤ .
x∈[a,b] n!
n n n n n n
Although, we know that limn→∞ 3 n!π = 0 we need many points before 3 n!π is small. Thus for n = 20, 3 n!π ≈ 12.5689,
n n
for n = 30, 3 n!π ≤ 6.3 * 10−4 . Thus even for this simple example we need to deal with high order polynomials. Of
course we used rather cruel estimate for maxx∈[a,b] |ω(x)|, and indeed if we have more information about the locations
of x1 , . . . , xn we can obtain better estimates.
2 Interpolation 21
b−a
xi = a + (i − 1)h with h = , i = 1, . . . , n.
n−1
For n = 2, we have x1 = a and x2 = b, h = b − a and
(x2 − x1 )2 (b − a)2 h2
|ω(x)| = |(x − x1 )(x − x2 )| = = = . (2.10)
4 4 4
Thus, from Corollary 2.1 we obtain that in this case
(b − a)2 h2
max | f (x) − P1 ( f | x1 , x2 )(x)| ≤ max | f ′′ (x)| ≤ max | f ′′ (x)|. (2.11)
x∈[a,b] 8 x∈[a,b] 8 x∈[a,b]
h2
|(x − x j )(x − x j+1 )| ≤ .
4
We have
n j−1 n
∏(x − xi ) = ∏ (x − xi ) · |(x − x j )(x − x j+1 )| · ∏ (x − xi )
i=1 i=1 i= j+2
j−1 h2 n
≤ ∏ (x − xi ) · · ∏ (x − xi )
i=1 4 i= j+2
h2 j−1 n
≤ ∏ (x j+1 − xi ) · ∏ (x j − xi ) .
4 i=1 i= j+2
Example 2.9. Consider f (x) = sin (3x) on [0, π]. Then | f (n) (x)| ≤ 3n and for equidistant approximation with h =
π/(n − 1) from Theorem 2.10 we obtain
n
3n
π
max | sin (3x) − P( f | x1 , . . . , xn )(x)| ≤ .
x∈[a,b] 4n n − 1
3n π n
The above estimate is much sharper than the one we used in Example 2.8. Thus for n = 20, 4n n−1 ≈ 1.0169e − 08
3n π n
and for n = 30, 4n n−1 ≈ 1.8924e − 17.
A natural question: is there a choice for the interpolation notes that minimizes maxx∈[a,b] |ω(x)|, i.e. what is the solution
to the following min-max problem
n
min max ∏ (x − x j ) . (2.12)
x1 ,...,xn x∈[a,b]
j=1
The solution x1* , . . . , xn* to (2.12) are called the Chebyshev points and are given by the formula
a+b a+b (2i − 1)π
xi* = + cos , i = 1, . . . , n.
2 2 2n
1
Example 2.10. Consider f (x) = 1+x2
TO ADD
2 Interpolation 23
For approximation of a function f on an interval [a, b], instead of choosing high order interpolating polynomial one
can partition the interval into small pieces and on each small piece use small or moderate order polynomials for an
approximation. In addition, one may choose various ways to connect pieces together, resulting in global smoothness
properties. The advantage of such approach is that no high order of smoothness of f is required. We will consider two
popular choices, linear (continuous) splines and cubic (C2 ) splines.
Si (x) = ai + bi (x − xi ), i = 1, . . . , n − 1.
Thus on each subinterval we have 2 unknowns. Since there are n − 1 subintervals we have 2n − 2 unknowns in total.
We want our spline function S(x) to have the following properties:
1. Interpolation at nodes, i.e. S(xi ) = f (xi ) := fi for i = 1, 2, . . . , n
2. Continuity Si+1 (xi+1 ) = Si (xi+1 ) for i = 2, . . . , n − 1.
Thus in total we have n + n − 2 = 2n − 2 conditions, which matches the total number of unknowns. Easy to see that
the conditions above uniquely determine S(x) and on each subinterval [xi , xi+1 ] the coefficients ai and bi in Si (x) =
ai + bi (x − xi ) for i = 1, . . . , n − 1 are given by
fi+1 − fi
ai = fi bi = .
xi+1 − xi
Below we provide algorithms for computing the coefficients and evaluation.
1. for i = 1 : n − 1
2. ai = f i
3. bi = ( fi+1 − fi )/(xi+1 − xi )
4. end
1. for i = 1 : n − 1
2. if x̄ ≤ xi+1
3. S = ai + bi * (x̄ − xi )
4. end
5. end
From (2.11), we obtain the following convergence property of the linear spline
Theorem 2.13. Let S(x) be the linear spline interpolating f at x1 , . . . , xn . If f ∈ C2 [a, b] then
h2
max | f (x) − S(x)| ≤ max | f ′′ (x)|,
x∈[a,b] 8 x∈[a,b]
where
h= max hi = max (xi+1 − xi ).
i=1,...,n−1 i=1,...,n−1
Si (x) = ai + bi (x − xi ) + ci (x − xi )2 + di (x − xi )3 , i = 1, . . . , n − 1.
Thus on each subinterval we have 4 unknowns. Since there are n − 1 subintervals we have 4n − 4 unknowns in total.
We want our spline function S(x) to be smooth, we may asked for the following properties:
1. Interpolation at nodes, i.e. S(xi ) = f (xi ) := fi for i = 1, 2, . . . , n
2. Continuity Si+1 (xi+1 ) = Si (xi+1 ) for i = 2, . . . , n − 1.
′ (x ′
3. Continuity of the first derivatives Si+1 i+1 ) = Si (xi+1 ) for i = 2, . . . , n − 1.
4. Continuity of the second derivatives Si+1 ′′ (x ) = Si′′ (xi+1 ) for i = 2, . . . , n − 1.
i+1
Thus in total we have n + 3(n − 2) = 4n − 6 conditions, which does not match the number of unknowns, we are two
short. The popular choices are:
∙ Natural boundary S1′′ (x1 ) = 0 and Sn−1
′′ (x ) = 0
n
∙ Clamped boundary S1′ (x1 ) = f ′ (x1 ) and Sn−1′ (xn ) = f ′ (xn )
′ ′
∙ Periodic spline S1 (x1 ) = Sn−1 (xn ), S1 (x1 ) = Sn−1 (xn ), and S1′′ (x1 ) = Sn−1
′′ (x ).
n
It is not still obvious how to compute the cubic spline from these conditions. This is what we will address next. First,
we introduce a notation
hi = xi+1 − xi , i = 1, . . . , n − 1.
Differentiating the expression
2 Interpolation 25
Si (x) = ai + bi (x − xi ) + ci (x − xi )2 + di (x − xi )3 , i = 1, . . . , n − 1, (2.13)
we obtain
Thus,
ai+1 = ai + bi hi + ci h2i + di h3i , i = 1, . . . , n − 1. (2.16)
Similarly, from the continuity of the derivatives we obtain
′
bi+1 = Si+1 (xi+1 ) = Si′ (xi+1 ) = bi + 2ci hi + 3di h2i i = 1, . . . , n − 2.
′
Setting bn = Sn−1 (xn ). Thus, we also have
′
bn = Sn−1 (xn ) = bn−1 + 2cn−1 hn−1 + 3dn−1 h2n−1
Summarizing,
bi+1 = bi + 2ci hi + 3di h2i , i = 1, . . . , n − 1. (2.17)
From the continuity of the second derivatives we obtain
′′
2ci+1 = Si+1 (xi+1 ) = Si′′ (xi+1 ) = 2ci + 6di hi , i = 1, . . . , n − 2.
′′ (x ). Thus, we also have
Again, setting cn = 21 Sn−1 n
′′
2cn = Sn−1 (xn ) = bn−1 + 2cn−1 hn−1 + 3dn−1 h2n−1 .
Summarizing,
2ci+1 = bi + 2ci hi + 3di h2i , i = 1, . . . , n − 1. (2.18)
From (2.18), we find
ci+1 − ci
di = , i = 1, . . . , n − 1. (2.19)
3hi
Inserting it into (2.16) and (2.17), we obtain
h2i
ai+1 = ai + bi hi + (2ci+1 + ci ), i = 1, . . . , n − 1 (2.20)
3
and
bi+1 = bi + hi (ci+1 + ci ), i = 1, . . . , n − 1. (2.21)
26 Contents
1 hi
bi = (ai+1 − ai ) − (2ci+1 + ci ), i = 1, . . . , n − 1. (2.22)
hi 3
Replacing i with i − 1 we have
bi = bi−1 + hi−1 (ci + ci−1 ), i = 2, . . . , n. (2.23)
and
1 hi−1
bi−1 = (ai − ai−1 ) − (2ci + ci−1 ), i = 2, . . . , n. (2.24)
hi−1 3
Inserting (2.22) and (2.24) into (2.23), we obtain
1 hi 1 hi−1
(ai+1 − ai ) − (2ci+1 + ci ) = (ai − ai−1 ) − (2ci + ci−1 ) + hi (ci+1 + ci ).
hi 3 hi−1 3
Moving all c′i s to the left and a′i s to the right and rearranging the terms, we obtain
3 3
hi−1 ci−1 + 2(hi−1 + hi )ci + hi ci+1 = (ai+1 − ai ) − (ai − ai−1 ) i = 2, . . . , n − 1. (2.25)
hi hi−1
Which is equivalent to n − 2 equations with n unknowns. We need additional conditions (like natural or periodic
boundary) to close the system.
Alternatively, we could introduce variables Mi = S′′ (xi ), i = 1, . . . , n, often called moments, and set equations for
them. Of course, Mi = 2ci , however the point of view is slightly different. Thus, since Si (x) is a cubic on [xi , xi+1 ],
Si′′ (x) is linear on [xi , xi+1 ] and in terms of moments has the form
x − xi xi+1 − x
Si′′ (x) = Mi+1 + Mi , i = 1, . . . , n − 1.
hi hi
Integrating, we obtain
(x − xi )2 (xi+1 − x)2
Si′ (x) = Mi+1 − Mi + bi , i = 1, . . . , n − 1.
2hi 2hi
and integrating once again
(x − xi )3 (xi+1 − x)3
Si (x) = Mi+1 + Mi + bi (x − xi ) + ai , i = 1, . . . , n − 1.
6hi 6hi
And now one can work with this form to derive a linear system for Mi .
Notice that for equidistant nodes x1 , . . . , xn we have h = hi and the matrix A and the vector g take the form
4 1 a3 − 2a2 + a1 f (x3 ) − 2 f (x2 ) + f (x1 )
1 4 1 a4 − 2a3 + a2 f (x4 ) − 2 f (x3 ) + f (x2 )
.. .. ..
3 .
3
.
A = h . . .
, g =
.
.
=
.
. .
(2.28)
h h
1 4 1 an−1 − 2an−2 + an−3 f (xn−1 ) − 2 f (xn−2 ) + f (xn−3 )
1 4 an − 2an−1 + an−2 f (xn ) − 2 f (xn−1 ) + f (xn−2 )
The key to the analysis will be Lemma 2.2. Matrix A posses very nice properties. Obviously, it is symmetric and
diagonally dominant, so it is non-singular. Moreover, we have
Lemma 2.2. Given a linear system Az = w, where A is the matrix given in (2.26) and w = (w1 , . . . , wn−2 )T is arbitrary.
Then for z = (z1 , . . . , zn−2 )T , we have
|wi |
max |zi | ≤ max .
1≤i≤n−2 1≤i≤n−2 hi + hi+1
Proof. Let max1≤i≤n−2 |zi | = |zr | for some 2 ≤ r ≤ n − 3. Then looking at the r-th row of Az = w, we have
where in the last step we used that |zr | is maximal. The cases r = 1 and r = n − 2 are left to the reader.
Proof. Assume it is. Then there exists a vector z ̸= 0 such that Az = 0. But then we get a contradiction since from the
above lemma maxi |zi | = 0.
Hence from the system of linear equations Ac = g with A, c, and g given by (2.26) and (2.27), we can obtain c, and
then using (2.22), we can obtain bi and from (2.19), we can compute di . We summarize it in the following algorithms.
Output: The coefficients a1 , . . . , an−1 , b1 , . . . , bn−1 , c1 , . . . , cn−1 , and d1 , . . . , dn−1 of the natural cubic spline
1. for i = 1 : n
2. ai = f i
3. end
4. for i = 1 : n − 1
5. hi = xi+1 − xi
6. end
7. Generate matrix A ∈ R(n−2)×(n−2) and the vector g ∈ Rn−2 given in (2.26) and (2.27)
8. Compute c2 , . . . , cn−1 by solving the system Ac = g
9. Set c1 = 0 and cn = 0
10. for i = 1 : n − 1
11. bi = h1i * (ai+1 − ai ) − h3i * (2ci + ci+1 )
12. di = 3h1 i * (ci+1 − ci )
13. end
Once the coefficients of the cubic spline are computed, given a point x̄ we can use the following algorithm to evaluate
S(x̄).
1. for i = 1 : n − 1
2. if x̄ ≤ xi+1
3. S = ai + bi * (x̄ − xi ) + ci * (x̄ − xi )2 + di * (x̄ − xi )3
4. end
5. end
Now we address the question of convergence. The key result we will use is Lemma 2.2. Define M = (M1 , . . . , Mn )T
where Mi = Si′′ (xi ) and F = (F1 , . . . , Fn )T where Fi = f ′′ (xi ), i = 1, . . . , n. Define r = (r1 , . . . , rn )T by
r = A(M − F).
ri = 2gi − (AF)i
6 6
( f (xi ) − f (xi−1 )) − hi−1 f ′′ (xi−1 ) + 2(hi−1 + hi ) f ′′ (xi ) + hi f ′′ (xi+1 ) .
= ( f (xi+1 ) − f (xi )) −
hi hi−1
Since xi+1 = xi + hi and xi−1 = xi − hi−1 , using Taylor expansion (for f sufficiently smooth), we obtain
h2i ′′ h3
f (xi+1 ) = f (xi ) + hi f ′ (xi ) + f (xi ) + i f ′′′ (xi ) + O(h4i )
2 6
and
h2i−1 ′′ h3
f (xi−1 ) = f (xi ) − hi−1 f ′ (xi ) + f (xi ) − i−1 f ′′′ (xi ) + O(h4i−1 ).
2 6
As a result
6 6
( f (xi+1 ) − f (xi )) − ( f (xi ) − f (xi−1 )) = 3hi f ′′ (xi ) + h2i f ′′′ (xi ) + O(h3i ) + 3hi−1 f ′′ (xi ) − h2i−1 f ′′′ (xi ) + O(h3i−1 ).
hi hi−1
(2.30)
Similarly,
f ′′ (xi+1 ) = f ′′ (xi ) + hi f ′′′ (xi ) + O(h2i )
and
f ′′ (xi−1 ) = f ′′ (xi ) − hi−1 f ′′′ (xi ) + O(h2i−1 ).
As a result
hi−1 f ′′ (xi−1 ) + 2(hi−1 + hi ) f ′′ (xi ) + hi f ′′ (xi+1 ) = 3(hi + hi−1 ) f ′′ (xi ) − h2i−1 f ′′′ (xi ) + h2i f ′′′ (xi ) + O(h3i−1 ) + O(h3i )
and thus
|ri | O(h3i + h3i−1 )
max |Mi − Fi | ≤ max ≤ = O(h2i + h2i−1 ) = O(h2 ). (2.31)
i i hi + hi+1 hi + hi+1
Thus we established that at the nodes
max |S′′ (xi ) − f ′′ (xi )| ≤ Ch2 , (2.32)
i
for some constant independent of h provided f ∈ C4 [a, b]. The estimate (2.31) is key result for error estimates.
Theorem 2.16. Let f ∈ C4 [a, b] and maxi (h/hi ) ≤ K for some K > 0. Then there exists a constant C independent of h
such that
max |S(k) (x) − f (k) (x)| ≤ Ch4−k , k = 0, 1, 2, 3.
x∈[a,b]
M j+1 − M j
S′′′
j (x) = ,
hj
M j+1 − f ′′ (x j+1 )
′′ ′′
f (x j+1 ) − f ′′ (x j )
M j+1 − M j f (x j ) − M j
S′′′ ′′′
j (x) − f (x) = − f ′′′ (x) = + + ′′′
− f (x)
hj hj hj hj
M j+1 − f ′′ (x j+1 )
′′
Ch2
f (x j ) − M j
+ ≤ ≤ CKh.
hj hj hj
f ′′ (x j+1 ) − f ′′ (x j )
− f ′′′ (x) = f ′′′ (x j ) − f ′′′ (x) + O(h) = O(h).
hj
Case k=0. Since f (xi ) − S(xi ) = 0 again for x ∈ [x j , x j+1 ] by the Fundamental Theorem of Calculus
Z x
S′j (t) − f ′ (t) dt ≤ |x j − x|Ch3 ≤ Ch4 ,
S j (x) − f (x) = (2.36)
xj
3 Linear systems
Linear algebra is a wonderful subject. One of the wonderful aspects of linear algebra is a variety of ways one can look
at the same problem. Sometimes a hard problem what appears from one perspective can turn out to be trivial from
another. To illustrate my point, let’s look at some m × n matrix A ∈ Rm×n
a11 a12 . . . a1n
a21 a22 . . . a2n
A= . .. .. .. .
.. . . .
am1 am2 . . . amn
Looking at this matrix we can see different things. First of all we see m × n entries, or in other word an element of
Rm×n . We also can see m rows A(i, :) ∈ Rn for i = 1, 2, . . . , m, or n columns A(:, j) ∈ Rm for j = 1, 2, . . . , n. More
sophisticated, one can see a map A : Rn → Rm or that matrix A consists of four submatrices A11 ∈ Rm1 ×n1 , A12 ∈
Rm1 ×n2 , A21 ∈ Rm2 ×n1 , A22 ∈ Rm2 ×n2 ,
A11 A12
A= , with n = m1 + m2 and n = n1 + n2 .
A21 A22
Let A ∈ Rm×n and x ∈ Rn . The i-th component of the matrix-vector product y = Ax is defined by
n
yi = ∑ ai j x j (3.1)
j=1
i.e., yi is the dot product (inner product) of the i-th row of A with the vector x.
x
.. .. .. ..
1
. . . . x2
yi = ai1 ai2 · · · ain ..
.
.. .. .. ..
. . . . xn
i.e. the i j-th element of the product matrix is the dot product between i-th row of A and j-th column of B.
b
1j
ci j = ai1 · · · aip .
.. .
bp j
Example 3.1.
Let
12 3 1 2
A = 4 5 6 and B = 3 4 ,
78 9 5 6
then C = AB can be computed as
12 1 3 12 2 3
45 + 5 + 6
3 6 4 5 4 6 22 28
C= = 49 64 .
1 2 76 100
78 +9·5 78 +9·6
3 4
3 Linear systems 33
This idea is key for the asymptotically faster matrix-matrix multiplication of celebrated Strassen Algorithm [2].
Let A ∈ Rm×n and x ∈ Rn . Then from (3.2), Ax is a linear combination of the columns of A, i.e.
a11 a12 a1n
a21 a22 a2n
Ax = x1 . + x2 . + · · · + xn .
.. .. ..
am1 am2 amn
If A ∈ Rm×n then AT ∈ Rn×m is obtained by reflecting the elements with respect to main diagonal. Thus, if A ∈ Rm×n
and B ∈ Rn×k , then
(AB)T = BT AT .
More generally,
(A1 A2 . . . A j )T = ATj . . . AT2 AT1 .
If A ∈ Rn×n is invertible, then AT is invertible and
Definition 3.1 (Lower triangular). A matrix L ∈ Rn×n is called lower triangular matrix if all matrix entries above
the diagonal are equal to zero, i.e., if
li j = 0 for j > i.
Definition 3.2 (Upper triangular). A matrix U ∈ Rn×n is called upper triangular matrix if all matrix entries below
the diagonal are equal to zero, i.e., if
ui j = 0 for i > j.
34 Contents
A Linear system with lower (upper) triangular matrix can be solved by forward substitution (backward substitution).
Matlab code:
if all(diag(u)) == 0
disp(’the matrix is singular’)
else
b(n) = b(n)/U(n,n);
for i = n-1:-1:1
b(i)= (b(i) - U(i,i+1:n)*b(i+1:n))/U(i,i);
end
end
for j = n:-1:2
b(j) = b(j)/U(j,j) ;
b(1:j-1) = b(1:j-1) - b(j)*U(1:j-1,j);
end
b(1) = b(1)/U(1,1);
end
3 Linear systems 35
Gaussian elimination for the solution of a linear system transforms the system Ax = b into an equivalent system with
upper triangular matrix. This is done by applying three types of transformations to the augmented matrix (A|b).
∙ Type 1: Replace an equation with the sum of the same equation and a multiple of another equation;
∙ Type 2: Interchange two equations; and
∙ Type 3: Multiply an equation by a nonzero number.
Once the augmented matrix (A|b) is transformed into (U|y), where U is an upper triangular matrix, we can use the
techniques discussed previously to solve this transformed system Ux = y.
We need to modify Gaussian elimination for two reasons:
∙ improve numerical stability (change how we perform pivoting)
∙ make it more versatile (leads to LU-decomposition)
Definition 3.3 (Partial pivoting). In step i, find row j with j > i such that |a ji | ≥ |aki | for all k > i and exchange rows
i and j. Such numbers a ji we call pivots.
Partial pivoting is not relevant in exact arithmetic. Without partial pivoting with floating point arithmetic the method
can be unstable.
Using the formula
n
n(n + 1)(2n + 1)
∑ j2 = 6
,
j=1
we can calculate that for large n the number of flops in the Gaussian elimination with partial pivoting approximately
equal to 2n3 /3.
3.7 LU decomposition
The Gaussian elimination operates on the augmented matrix (A | b) and performs two types of operations to transform
(A | b) into (U | y), where U is upper triangular. However, the right hand side does not influence how the augmented
matrix (A | b) is transformed into an upper triangular matrix (U | y). This transformation depends only on the matrix
A. Thus, if we keep a record of how A is transformed into an upper triangular matrix U, then we can apply the same
transformation to any right hand side, without re-applying the same transformations to A.
What operations are performed? In step k of the Gaussian elimination with partial pivoting we perform two opera-
tions:
∙ Interchange a row i0 > k with row k.
∙ Add a multiple −lik times row k to row i for i = k + 1, . . . , n.
How do we record this?
∙ Introduce an integer array ipivt. Set
ipivt(k) = i0
when the k-th step the rows k and i0 are interchanged.
∙ Store the −lik in the (i, k)-th position of the array that originally held A. (Remember that we add a multiple −lik
times row k to row i to eliminate the entry (i, k). Hence this storage can be reused.)
36 Contents
a11 a12 a13 · · · · · · a1,n−1 a1n
a21 a22 a23 · · · · · · a2,n−1 a2n
a31 a32 a33 · · · · · · a3,n−1 a3n
.. .. .. .. ..
. . . . .
an−1,1 an−1,2 an−1,3 · · · · · · an−1,n−1 an−1,n
an1 an2 an3 · · · · · · an,n−1 ann
↓
Step 1
↓
a11 a12 a13 · · · · · · a1,n−1 a1n
−l21 a22 a23 · · · · · · a2,n−1 a2n
−l31 a32 a33 · · · · · · a3,n−1 a3n
.. .. .. .. ..
. . . . .
−ln−1,1 an−1,2 an−1,3 · · · · · · an−1,n−1 an−1,n
−ln1 an2 an3 · · · · · · an,n−1 ann
↓
Step 2
↓
a11 a12 a13 · · · · · · a1,n−1 a1n
−l21 a22 a23 · · · · · · a2,n−1 a2n
−l31 −l32 a33 · · · · · · a3,n−1 a3n
.. .. .. .. ..
. . . . .
−ln−1,1 −ln−1,2 an−1,3 · · · · · · an−1,n−1 an−1,n
−ln1 −ln2 an3 · · · · · · an,n−1 ann
↓
Step n-1
↓
a11 a12 a13 · · · · · · a1,n−1 a1n
−l21 a22 a23 · · · · · · a2,n−1 a2n
−l31 −l32 a33 · · · · · · a3,n−1 a3n
.. .. .. .. .. ..
.
. . . . .
. .. .. ..
..
. . . an−1,n−1 an−1,n
−ln1 −ln2 −ln3 · · · · · · −ln,n−1 ann
Row interchange in step k can be expressed by multiplying with
3 Linear systems 37
1
..
.
1
0 1 ←k
1
Pk =
..
.
1
← ipivt(k)
1 0
1
..
.
1
↑ ↑
k ipivt(k)
from the left. Pk is a permutation matrix.
Easy to see that Pk satisfies Pk = PkT and Pk2 = Id and as a result Pk−1 = Pk . Furthermore, Pk A interchanges rows k
and ipivt(k) of A, but APk interchanges columns k and ipivt(k) of A.
Adding −li,k times row k to row i for i = k + 1, . . . , n is equivalent to multiplying from the left with
1
..
.
1
Mk =
−lk+1,k
.. ..
. .
−ln,k 1
Observe that matrix Mk is lower triangular and is called a Gauss transformation. Easy to see that Mk is invertible and
1
..
.
−1
1
Mk =
lk+1,k
.. . .
. .
ln,k 1
Furthermore, product of two lower triangular matrices results in a lower triangular matrix, i.e. Mk M j is again lower
triangular
The transformation of a matrix A into an upper triangular matrix U can be expressed as a sequence of matrix-matrix
multiplications
Mn−1 Pn−1 . . . M1 P1 A = U.
Above we have observed that Pk and Mk are invertible. Hence, if we have to solve Ax = b and if
Mn−1 Pn−1 . . . M1 P1 A = U,
Mn−1 Pn−1 . . . M1 P1 b = y
and solve
Ux = y.
We observe that for j > k
Pj Mk = M̃k Pj , (3.3)
where M̃k is obtained from Mk by interchanging the entries −l j,k and −lipivt( j),k . M̃k has the same structure as Mk and
we can easily compute M̃k−1 . Using (3.3) we can move all Pj ’s to the right of the M̃k ’s
After step k = 2 the vector ipivt and the matrix A are given by
2 −1
1
ipivt = (2, 3, ) −1 5 2
2 2
1 3 11
2 −5 − 5
Solving
2 1 0 x1 2
0 5 2 x2 = 2
2
11
0 0−5 x3 − 11
5
Then using
100 100 100 100
P2 M1 = 0 0 1 − 12 1 0 = 12 1 0 0 0 1 = M̃1 P2
1
010 2 0 1 − 12 0 1 010
we have
U = M2 M̃1 P2 P1 A.
Calling
1 00 010
L = M̃1−1 M2−1 = − 21 1 0 and P = 0 0 1
1 3 100
2 5 1
we have PA = LU, i.e.
010 1 2 −1 1 00 2 1 0
0 0 1 2 1 0 = −1 1 0 0 5 2 .
2 2
−1 2 2 1 3 11
100 2 5 1 0 0−5
40 Contents
LU-decomposition is very useful one and using it we can compute many quantities rather efficiently.
To solve the linear system Ax = b using LU decomposition PA = LU, we need to solve two triangular systems, Ly = Pb
and Ux = y. It especially beneficial if we need to solve Ax = bn for many right hand sides, as for computing A−1 .
Computing A−1 is rarely requited. Usually we need to compute A−1 b, which means we need to solve a linear system
Ax = b. In the rare occasion when the explicit form of A−1 is needed, we can use LU-decomposition to find it by using
O(n3 ) operations. Recall that A−1 is a unique matrix X such that
AX = I,
where I is the identity matrix. Denote the columns of the matrix X be xi , i = 1, . . . , n. Thus in column notation the
above equation is equivalent to
[Ax1 , Ax2 , . . . , Axn ] = [e1 , e2 , . . . , en ].
In other words we need to solve n equations
Axi = ei , i = 1, . . . , n.
In section 3.8.1 above we explained, how we can solve it using LU decomposition. Notice, LU decomposition takes
3
O( 2n3 ) operations solving triangular systems O(n2 ) operations, and since we need to solve n of them it take O(n3 )
operations to compute A−1 explicitly.
det(C) = det(A)det(B).
det(P)det(A) = det(L)det(U).
Since P is a permutation matrix det(P) = ±1. More, precisely, 1 if the permutation is even, and −1 if it is odd. Since L
and U are triangular, their determinates are just product of diagonal elements. Thus, det(L) = 1 and det(U) = ∏ni=1 uii .
As a result
n
det(A) = ± ∏ uii .
i=1
3 Linear systems 41
Definition 3.5. A ∈ Rn×n is called symmetric positive definite if A = AT and vT Av > 0 for all v ∈ Rn , v ̸= 0.
If A ∈ Rn×n is symmetric positive definite, then the LU decomposition can be computed in a stable way without
permutation, i.e.,
A = LU
and more efficiently.
First we notice that if L is a unit lower triangular matrix and U is an upper triangular matrix such that A = LU,
then L and U are unique, i.e., if there L̃ is a unit lower triangular matrix and Ũ is an upper triangular matrix such
that A = L̃Ũ, then L = L̃ and U = Ũ . Furthermore, if A ∈ Rn×n is symmetric positive definite and A = LU, then the
diagonal entries of U are positive and we can write it as U = DŨ, where Ũ is unit upper triangular and
Thus,
A = LU = LDŨ.
On the other hand
AT = (LDŨ)T = Ũ T DLT .
Using that A = AT and LU decomposition is unique
Thus,
L = Ũ T and U = DLT .
We showed that if A is a symmetric positive definite matrix, then
A = LDLT .
Recall that
D = diag(u11 , . . . , unn )
has positive diagonal entries. So we can define
√ √
D1/2 = diag( u11 , . . . unn ).
Matlab’s function [R] = chol(A) computes the matrix R such that A = RT R. By clever implementation, one can show
that the total number of basic operations is of order n3 /3, comparing to 2n3 /3 for LU decomposition.
Let
d1 e1
c1 d2 e2
A=
.. .. ..
. . .
cn−2 dn−1 en−1
cn−1 dn
Since there are only about 3n entries in matrix A, a natural question, how can we compute the LU-decomposition of A
efficiently? Let’s look at LU decomposition of A
d e
1 1 1
− dc1 1 d˜2 e2
1
. c2 d3 e3
.
0 0 . , .. .. ..
.
. . .
.. 1
cn−2 dn−1 en−1
01 cn−1 dn
| {z } | {z }
=M1 c
=M1 A, where d˜2 =d2 − d1 e1
1
1
d1 e1
0 1 d˜2 e2
˜
− dc2 1 d3 e3
2
c3 d4 e4
. ,
0 0 ..
.. .. ..
..
. . .
.1
cn−2 dn−1 en−1
0 1 cn−1 dn
| {z } | {z }
=M2 c
=M2 M1 A, where d˜3 =d3 − ˜2 e2
d2
˜ = di+1 − ci ei , for i = 1 : n − 1.
where di+1 d˜ i
We remind that a norm ‖·‖ on a vector space V over R is a function ‖·‖ : V → R+ that satisfies the following properties
1. (Positivity) ‖x‖ > 0 for any non-zero x ∈ V .
2. (Scalability) ‖αx‖ = |α|‖x‖, for any α ∈ R.
3. (Triangle inequalit) ‖x + y‖ ≤ ‖x‖ + ‖y‖ for any x, y ∈ V
As a consequence of the triangle inequality, we have the for following inequality.
‖x‖1 = 1 + 2 + 3 + 4 = 10,
√ √
‖x‖2 = 1 + 4 + 9 + 16 = 30 ≈ 5.48,
‖x‖∞ = max {1, 2, 3, 4} = 4.
{x ∈ Rn : ‖x‖ p ≤ 1}.
TO ADD PICTURE
The following inequalities hold:
‖x‖∞ ≤ ‖x‖2 ≤ ‖x‖1 .
Theorem 3.4. All vector norms on Rn are equivalent, i.e. for every two vector norms ‖ · ‖a and ‖ · ‖b on Rn there exist
constants cab , Cab (depending on the vector norms ‖ · ‖a and ‖ · ‖b , but not on x) such that
x = x1 e1 + · · · + xn en .
It follows that
n n
‖x‖ ≤ ‖x‖∞ ∑ ‖e j ‖ ≤ γ‖x‖∞ , with γ = ∑ ‖e j ‖.
j=1 j=1
x
‖y0 ‖ ≤ ≤ ‖y1 ‖,
‖x‖∞
which shows
m‖y‖∞ ≤ ‖y‖∞ ≤ M‖y‖∞ ,
with m = ‖y0 ‖ and M = ‖y1 ‖.
Remark 3.1. Although all vector norm are equivalent, they are not equivalent with respect to the dimensions n. For
example, for 1 = (1, . . . , 1)T we have
√
‖1‖∞ = 1, ‖1‖2 = n, ‖1‖1 = n.
1
√ ‖x‖1 ≤ ‖x‖2 ≤ ‖x‖1
n
√
‖x‖∞ ≤ ‖x‖2 ≤ n‖x‖∞
‖x‖∞ ≤ ‖x‖1 ≤ n‖x‖∞ .
We can view a matrix A ∈ Rm×n as a vector in Rmn , by stacking the columns of the matrix into a long vector. Appling
the 2-vector norm to this vectors of length mn, we obtain the Frobenius norm,
!1/2
n m
‖A‖F = ∑∑ a2i j .
i=1 j=1
In many√situations the Frobenius norm is not convenient. One of the reasons is that for the identity matrix I ∈ Rn×n ,
‖I‖F = n.
Another approach is to view a matrix A ∈ Rm×n as a linear mapping, which maps a vector x ∈ Rn into a vector
Ax ∈ Rm
A : Rn → Rm
x → Ax.
To define the size of this linear mapping, we compare the size of the image Ax ∈ Rm with the size of x. This leads us
to look at
‖Ax‖
sup
x̸=0 ‖x‖
Here Ax ∈ Rm and x ∈ Rn are vectors and ‖ · ‖ are vector norms (in Rm and Rn ).
Definition 3.8 (Matrix p-norm).
‖Ax‖ p
‖A‖ p = max , 1 ≤ p ≤ ∞. (3.4)
x̸=0 ‖x‖ p
The following holds,
‖Ax‖ p ‖Ax‖ p
sup x‖ p = max = max ‖Ax‖ p .
x̸=0 ‖x‖ p x̸ =0 ‖x‖ p ‖x‖ p =1
‖Ix‖ p
‖I‖ p = max = 1.
x̸=0 ‖x‖ p
and
‖AB‖ p ≤ ‖A‖ p ‖B‖ p (submultiplicativity of matrix norms)
Proof. The first statement follows directly from the definition of the p matrix norm. The second statement follows that
for B ̸= 0
46 Contents
and as a result
m
‖A‖1 ≤ max ∑ |aik |.
1≤k≤n i=1
Then,
m m
‖Ae j0 ‖1 = ∑ |ai j0 | = max ∑ |aik |.
i=1 1≤k≤n i=1
‖A‖1 = ‖AT ‖∞
‖A‖2 = ‖AT ‖2 .
3 Linear systems 47
(A + ∆ A)x̃ = b + ∆ b, (3.6)
where ∆ A ∈ Rn×n and ∆ b ∈ Rn represent the perturbations in A and b, respectively. The main question we are faced,
what is the error ∆ x = x̃ − x between the solution x of the exact linear system (3.5) and the solution ex perturbed linear
system (3.10). For the simplicity of the representation, let’s us first consider the case when A is exact.
Theorem 3.7. Consider the perturbed system (3.10) with ∆ A = 0. Then the relative error
‖∆ x‖ ‖∆ b‖
≤ ‖A‖‖A−1 ‖ ,
‖x‖ ‖b‖
Definition 3.10. If κ p (A) is small, we say that the linear system is well conditioned.
Otherwise, we say that the linear system is ill conditioned.
To obtain similar result for the fully perturbed system we need the following auxiliary result.
Lemma 3.1. Let B ∈ Rn×n be arbitrary with ‖B‖ < 1, where ‖ · ‖ denotes any p matrix norm. Then I + B is invertible
and
1
‖(I + B)−1 ‖ ≤ .
1 − ‖B‖
Proof. Since by the triangle inequality and the assumption of the lemma, for any x ̸= 0,
1 = ‖I‖ = ‖(I + B)C‖ = ‖C + BC‖ ≥ ‖C‖ − ‖CB‖ ≥ ‖C‖ − ‖C‖‖B‖ = ‖C‖(1 − ‖B‖),
Thus using the previous lemma, (A + ∆ A) is a product of two invertible matrices A and (I + A−1 ∆ A), hence invertible.
Since
∆ x = x̃ − x,
we have
(A + ∆ A)∆ x = (A + ∆ A)x̃ − (A + ∆ A)x = (b + ∆ b) − (b + ∆ Ax) = ∆ b − ∆ Ax.
Hence
‖∆ x‖ ≤ ‖(A + ∆ A)−1 (∆ b − ∆ Ax)‖ ≤ ‖(A + ∆ A)−1 ‖ (‖∆ b‖ + ‖∆ Ax‖) .
Using the Lemma 3.1 with B = A−1 ∆ A, we have
‖A−1 ‖
‖(A + ∆ A)−1 ‖ = ‖(A(I + A−1 ∆ A))−1 ‖ = ‖((I + A−1 ∆ A)−1 A−1 ‖ ≤ .
1 − ‖A−1 ‖‖∆ A‖
As a result
4 Numerical Integration 49
‖A−1 ‖
‖∆ x‖ ≤ (‖∆ b‖ + ‖∆ A‖‖x‖).
1 − ‖A−1 ‖‖∆ A‖
Using that ‖b‖ ≤ ‖Ax‖ ≤ ‖A‖‖x‖, we finally obtain
‖A−1 ‖
‖∆ x‖ ‖∆ b‖ ‖∆ A‖‖x‖
≤ +
‖x‖ 1 − ‖A−1 ‖‖∆ A‖ ‖x‖ ‖x‖
−1
‖A ‖‖A‖ ‖∆ b‖ ‖∆ A‖
≤ +
1 − ‖A−1 ‖‖∆ A‖ ‖b‖ ‖A‖
κ(A) ‖∆ b‖ ‖∆ A‖
≤ + .
1 − κ(A) ‖∆ A‖
‖A‖
‖b‖ ‖A‖
If we solve the linear system in m-digit floating point arithmetic, then, as rule of thumb, we may approximate the
the input errors due to rounding by
‖∆ A‖ ‖∆ x‖
≈ 0.5 * 10−m+1 , ≈ 0.5 * 10−m+1
‖A‖ ‖x‖
‖∆ x‖ 10α
≤ (0.5 * 10−m + 0.5 * 10−m ) ≈ 10α−m .
‖x‖ 1 − 10α−m+1
4 Numerical Integration
Definition 4.1. ∙ In the above formula xi ∈ [a, b] are called the nodes of the integration formula and wi are called the
weights of the integration formula.
∙ When we approximate ab f (x) dx by ∑ni=1 wi f (xi ) we speak of numerical integration or numerical quadrature.
R
b−a
x = a+ (z − α)
β −α
if we have computed weights w bi and nodes b zi for the numerical integration on an interval [α, β ], then we can use the
above identity to approximate the integral of f over any interval [a, b] (assuming, of course, that this integral exists) by
Z b
b−a β b−a
Z
f (x) dx = f a+ (z − α) dz
a β −α α β −α
b−a n
b−a
≈ ∑ wbi f a + β − α (bzi − α) .
β − α i=1
That is, the weights wi and nodes xi for the numerical integration on the interval [a, b] are
b−a b−a
wi = bi ,
w xi = a + zi − α).
(b
β −α β −α
This means it is sufficient to compute weights and nodes for the numerical integration on a certain interval like [0, 1]
or [−1, 1], often called the reference intervals.
Before we discuss several quadrature methods, we list some properties of the integral which are important for the
development of quadrature rules. First, we note that
Z b
1 dx = b − a.
a
Therefore we require
n
∑ wi = b − a,
i=1
Otherwise, our quadrature formula could not even evaluate the integral of a constant function exactly.
Another useful property of the integral is
Z b
f (x) ≥ 0 =⇒ f (x) dx ≥ 0.
a
If
wi ≥ 0, i = 1, . . . , n,
then
n
∑ wi f (xi ) ≥ 0,
i=1
p(x) ≈ f (x),
4 Numerical Integration 51
then Z b Z b
p(x) dx ≈ f (x) dx
a a
Thus we need a function p(x) which close to f (x) and easy to integrate.
The natural choice is the interpolating polynomials and their properties that we developed in Section 2.
Chose nodes x1 , . . . , xn in the interval [a, b] and compute the polynomial P( f |x1 , . . . , xn ) of degree less or equal to n
interpolating f at x1 , . . . , xn . If we use the approximation
If we substitute this representation of the interpolation polynomial into (4.2), then we obtain
Z b Z b Z b n n n Z b n
x−xj x−xj
f (x) dx ≈ P( f |x1 , . . . , xn )(x) dx = ∑ f (xi ) ∏ xi − x j dx = ∑ f (xi ) ∏ dx.
a a a i=1 j=1 i=1 a j=1 xi − x j
j̸=i j̸=i
where Z b n
x−xj
wi = ∏ dx.
a j=1 xi − x j
j̸=i
a+b
Example 4.1 (Midpoint Rule). The simplest quadrature formula can be constructed using n = 1 and x1 = 2 . Since
1 x−xj
∏ xi − x j = 1
j=1
j̸=i
x − b+a
Z b
2 x−b b−a
b+a
dx = ,
a a− 2
a−b 6
Z b
x−a x−b b−a
b+a b+a
dx = 4 ,
a 2 −a 2 −b 6
x − b+a
Z b
2 x−b b−a
dx = .
a b − b+a
2
b−a 6
Since the interpolation polynomial is uniquely determined, the interpolating polynomial for a polynomial pn−1 of
degree less or equal to n − 1 is the polynomial itself:
4 Numerical Integration 53
for all polynomials pn−1 of degree less or equal to n we say that the integration method is exact of degree n − 1.
In Theorem 2.9, we have established that for any x̄ ∈ (a, b)
1
f (x̄) − P( f | x1 , . . . , xn )(x̄)| = ω(x̄) f (n) (ξ (x̄)),
n!
where ω(x) = ∏nj=1 (x − x j ). Combining the above estimates, and integrating for the Newton-Cotes methods we obtain
Z b n Z b n
1
f (x)dx − ∑ wi f (xi ) = f (n) (ξ (x)) ∏ (x − x j )dx. (4.6)
a i=1 n! a j=1
However, using the weighted mean value theorem the for integrals, namely
Theorem 4.1 (Weighted Mean-Value Theorem for Integrals). Suppose f is continuous on [a, b] and g is integrable
on [a, b] and does not change sign. Then there exists c ∈ (a, b) such that
Z b Z b
f (x)g(x)dx = f (c) g(x)dx.
a a
Example 4.4 (Convergence for Trapezoidal Rule). If f ∈ C2 (a, b), using Theorem 4.1 for the Trapezoidal rule (n = 2,
h = b − a) we obtain
Z b Zb
b−a 1 ′′
f (x)dx − ( f (a) + f (b))= f (x)(x − a)(x − b)dx
a 2 2 a
′′
f (c) b
Z
= (x − a)(x − b)dx
2 a
′′
f (c) (b − a)3
=
2 6
h3
≤ max | f ′′ (x)|.
12 a≤x≤b
Example 4.5 (Convergence for Midpoint Rule).
If f ∈ C1 (a, b), using (4.6) for the Midpoint rule (n = 1, h = b − a) we obtain
54 Contents
Z b Z
a + b 1 b ′
a+b
f (x)dx − (b − a) f = f (ξ (x)) x − dx .
a 2 2 a 2
Since x − a+b
2 changes sign on (a, b) we can not use Theorem 4.1 and it seems the best we can do is just use (4.7) to
obtain
h2
Z b Z b
a + b 1 ′ a + b
max | f ′ (x)|.
a f (x)dx − (b − a) f ≤ max | f (x)| x− dx =
2 2 a≤x≤b a
2 8 a≤x≤b
However, looking at the form of the error, we can notice that
Z b
a+b
x− dx = 0,
a 2
thus we can subtract any constant multiple of it from the function f (x) without changing the value. Thus
Z b Z b
a+b a+b a+b a+b
f (x)dx − (b − a) f = f (x) − f ′ x− dx − (b − a) f .
a 2 a 2 2 2
a+b 2
a+b a+b a+b 1
f (x) − f − f′ x− = f ′′ (ξ (x)) x − .
2 2 2 2 2
2
Now x − a+b
2 does not change the sign on (a, b) and we can use Theorem 4.1 to obtain
2
(b − a)3 h3
Z b Z b
a+b 1 a+b
f (x)dx − (b − a) f = f ′′ (c) x− dx ≤ max | f ′′ (x)| = max | f ′′ (x)|.
a 2 2 a 2 24 a≤x≤b 24 a≤x≤b
It is interesting to observe that the constant in the Trapezoidal method is twice as large as for the Midpoint Rule.
Example 4.6 (Convergence for Simpson’s Rule). If f ∈ C3 (a, b), using (4.6) for the Simpson’s rule (n = 3, x1 = a,
x2 = a+b b−a
2 , x3 = b, h = 2 ) we obtain
Z b Z b
h a+b 1 ′′′ a+b
a f (x)dx − 3 f (a) + 4 f + f (b) = f (ξ (x))(x − a) x − (x − b)dx .
2 6 a 2
However, exactly as in the Midpoint Rule example, looking at the form of the error, we can notice that
4 Numerical Integration 55
Z b
a+b
(x − a) x − (x − b)dx = 0,
a 2
thus we can subtract any constant multiple of it from the function f (x) without changing the value. In other words we
can replace the interpolating polynomial P( f |x1 , x3 , x2 )(x) that gives rise to the Simpson’s Rule with
without changing the value of the numerical quadrature. Recalling the divided difference formula
f [x j+1, . . . , xk ] − f [x j , . . . , xk−1 ]
f [x j , . . . , xk ] = . (4.9)
xk − x j
We can extend the formula 6.4 to have equal nodes, i.e. xi = x j for some i ̸= j, with the convention that
f [xi , xi ] = f ′ (xi ),
p(x) = f (x1 ) + f [x1 , x3 ](x − x1 ) + f [x1 , x3 , x2 ](x − x1 )(x − x3 ) + f [x1 , x3 , x2 , x2 ](x − x1 )(x − x2 )(x − x3 ). (4.10)
1 (4)
f (x̄)− f (x1 )− f [x1 , x3 ](x̄−x1 )− f [x1 , x3 , x2 ](x̄−x1 )(x̄−x3 )− f [x1 , x3 , x2 , x2 ](x̄−x1 )(x̄−x2 )(x̄−x3 ) = f (ξ (x̄))ω(x̄),
4!
where ω(x) = (x − x1 )(x − x2 )2 (x − x3 ).
Proof. The proof is almost identical to the proof of Theorem 2.9. If x = xi for some i = 1, . . . , n, then the result naturally
holds. Assume x ̸= xi for any i = 1, . . . , n. Consider a function
ψ(x) = f (x) − f (x1 ) − f [x1 , x3 ](x̄ − x1 ) − f [x1 , x3 , x2 ](x̄ − x1 )(x̄ − x3 ) − f [x1 , x3 , x2 , x2 ](x̄ − x1 )(x̄ − x2 )(x̄ − x3 ) − cω(x),
With this choice of the constant c, the function ψ(x) has at least 4 roots, namely at x1 , x2 , x3 and x̄. Thus, from the
(1)
Rolle’s Theorem there exist 3 points, call them xi , i = 1, 2, 3, such that
(1)
ψ ′ (xi ) = 0, i = 1, 2, 3.
ψ ′ (x2 ) = 0.
56 Contents
(1)
Since xi ̸= x2 for all i = 1, 2, 3, it follows that ψ ′ (x) has at least 4 roots as well, namely at x1 , x2 , x3 and x̄. Thus, from
(2)
the Rolle’s Theorem there exist 3 points, call them xi , i = 1, 2, 3, such that
(1)
ψ ′′ (xi ) = 0, i = 1, 2, 3.
(4)
The of the proof is identical to the proof of Theorem 2.9. Continue this process, there exists a point x1 such that
(n)
ψ (4) (x1 ) = 0.
d4 dn d4
4
ψ(x) = 4 f (x) − 0 − c 4 ω(x).
dx dx dx
Since ω(x) is a polynomial of degree 4 with leading coefficient 1, we have
dn
ω(x) = n!.
dxn
As a result
(4)
(4) (4) f (4) (x1 )
ψ (4) (x1 ) = f (4) (x1 ) − c4! = 0 ⇒ c= ,
4!
(4)
which shows the proposition with ξ (x̄) = x1 .
Now we can continue with the Example 4.6. Since ω(x) = (x − x1 )(x − x2 )2 (x − x3 ) does not change the sign on the
interval (a, b), if f ∈ C4 (a, b) we can use Theorem 4.1 to obtain
h5
Z b Z b
h a+b 1
f (x)dx − f (a) + 4 f + f (b) = f (4) (c) (x − x1 )(x − x2 )2 (x − x3 )dx ≤ max | f (4) (x)|,
a 3 2 4! a 90 a≤x≤b
b−a
with h = 2 .
Rb
The method discussed in the above example, can be generalized to any method for which a ω(x)dx = 0. In partic-
ular we can obtain the following result.
Theorem 4.2 (Exactness of Newton-Cotes Formulas). Let a ≤ x1 < · · · < xn ≤ b be given and let wi be the nodes
and weights of a Newton-Cotes formula. If n is odd, then the quadrature formula is exact for polynomials of degree n.
If n is even, then the quadrature formula is exact for polynomials of degree n − 1.
The weights and nodes for the most popular closed Newton-Cotes formulas are summarized in the table below.
n h w
bi formula error name
1 1 (b−a) 1 (2)
2 b−a 2, 2 2 ( f (a) + f (b)) h3 12 f (ξ ) Trapezoidal rule
The weights and nodes for the most popular open Newton-Cotes formulas are summarized in the table below.
n h w
bi formula error name
1 b−a
2 1 (b − a) f ( a+b
2 ) h3 13 f (2) (ξ ) Midpoint rule
1 1 (b−a)
3 b−a
3 2, 2 2 ( f (x1 ) + f (x2 )) h3 14 f (2) (ξ ) Trapezoid method
2 1 2 (b−a)
4 b−a
4 3,−3, 3 3 (2 f (x1 ) − f (x2 ) + 2 f (x3 )) h5 28
90 f
(4) (ξ ) Milne’s rule
Rb R x j+1
Now we can approximate a f (x)dx by approximating each integral xj f (x)dx by a (low degree) quadrature formula,
Z x j+1 n
f (x)dx ≈ ∑ w ji f (x ji )
xj i=1
and Z b n Z x j+1 m m
f (x)dx = ∑ f (x)dx ≈ ∑ ∑ w ji f (x ji ).
a i=1 x j j=1 i=0
The function values f (x2 ), . . . , f (xm ) appear twice in the summation. This can be utilized in the implementation of the
composite Trapezoidal rule:
58 Contents
Z b m
x2 − x1 x j − x j−1 x j+1 − x j xm+1 − xm
f (x)dx ≈ f (x1 ) + ∑ + f (x j ) + f (xm+1 ).
a 2 j=1 2 2 2
Notice that the function values f (x2 ), . . . , f (xm ) appear twice in the summation. This has to be utilized in the
implementation of the composite Simpson rule.
The idea of the Gauss Quadrature is to choose nodes x1 , . . . , xn and the weights w1 , . . . , wn such that the formula
Z b n
p(x) dx ≈ ∑ wi p(xi ). (4.11)
a i=1
is exact for a polynomial of maximum degree. For example, if we require the formula (4.11) to be exact for N monomial
basis, we obtain a nonlinear system of N + 1 with 2n unknowns. Thus we can expect a formula to be exact for
polynomials of degree 2n − 1.
Example 4.10 (Case n = 2). Let’s determine the weights w1 and w2 and the nodes x1 and x2 such that
Z 1
w1 p(x1 ) + w2 p(x2 ) = p(x) dx
−1
holds for polynomials p(x) of degree 3 or less. This seems possible since we have 4 parameters to choose w1 , w2 , x1 , x2
and exactly 4 numbers are needed in order to define uniquely a polynomial of degree 3. Forcing formula to be exact
for 1, x, x2 , and x3 , leads to Z 1
w1 + w2 = 1 dx = 2
−1
Z 1
w1 x1 + w2 x2 = x dx = 0
−1
Z 1
2
w1 x12 + w2 x22 = x2 dx =
−1 3
Z 1
w1 x13 + w2 x23 = x3 dx = 0
−1
a nonlinear system of 4 equations with 4 unknowns. We can easily solve this system analytically to obtain
1 1
w1 = w2 = 1, x1 = − √ , x2 = √ .
3 3
Actually we will consider more general problem. We want to choose nodes x1 , . . . , xn and the weights w1 , . . . , wn
such that the formula Z b n
ω(x)p(x) dx ≈ ∑ wi p(xi ). (4.12)
a i=0
4 Numerical Integration 59
is exact for a polynomial of maximum degree, where omega(x) is some positive function on (a, b). For example, 1,
2
1
1+x2
, e−x and etc.
It is seems reasonable to expect the quadrature formula (4.12) to be exact for polynomials of degree 2n − 1 or less,
but not more than that as the following results shows.
Lemma 4.1. There is no choice of nodes x1 , . . . , xn and weights w1 , . . . , wn such that
Z b n
ω(x)pN (x) dx ≈ ∑ wi pN (xi ).
a i=1
g(x) = (x − x1 )2 . . . (x − xn )2 .
Easy to see that g is a polynomial of degree 2n and g > 0 on (a, b). Thus we obtain
Z b n
0< ω(x)g(x)dx = ∑ wi g(xi ) = 0
a i=1
a contradiction.
Lemma 4.2. If (4.12) is exact for all polynomials of degree 2n − 1 or less, then the polynomials p*0 (x), p*1 (x), . . . , p*n (x)
given by their roots, i.e.
j
p*j (x) = ∏(x − xi ), j = 1, 2, . . . , n
i=1
satisfy
Z b
ω(x)p*j (x)p*i (x)dx = 0, for i ̸= j.
a
Proof. If i ̸= j, then p*j (x)p*i (x) is a polynomial of degree at most 2n − 1, and since (4.12) is exact for p*j (x)p*i (x), we
obtain Z b n
ω(x)p*j (x)p*i (x)dx = ∑ wk p*j (xk )p*j (xk ) = 0.
a k=1
Definition 4.2. For two integrable functions f (x) and g(x), and a given weight function ω(x) ≥ 0, we define a
(weighted) inner-product by
Z b
( f , g) = ω(x) f (x)g(x)dx.
a
One can easily check the above definition indeed satisfies all the conditions for the inner-product. Once we have a
notion of inner-product for functions we can define an orthogonality.
Definition 4.3. We say that two functions f (x) and g(x) are orthogonal if
( f , g) = 0.
Thus the Lemma 4.2 says that if (4.12) is exact for all polynomials of degree 2n − 1 or less, then the nodes x1 , . . . , xn
are the roots of the orthogonal polynomials. Usually the roots of a polynomial can be repeated, complex, or lie outside
of (a, b), this is not the case for the orthogonal polynomials.
Lemma 4.3. The roots of the orthogonal polynomials in the sense of the Definition 4.3, are real, simple in lie inside
the interval (a, b).
60 Contents
Proof. Given an orthogonal polynomial pn (x) of degree n on (a, b), let x1 , . . . , xr be the points, where pn (x) changes
it sign. Consider a function g(x) = (x − x1 ) . . . (x − xr ). Thus, pn (x)g(x) ≥ 0 on (a, b). If r < n, then pn (x)g(x) is a
polynomial of degree less or equal than 2n − 1 and
Z b n
0< ω(x)pn (x)g(x)dx = ∑ wi pn (xi )g(xi ) = 0
a i=1
a contradiction. Thus r = n and it means that pn (x) changes sign n times and from this fact the conclusion of the lemma
follows.
The next result can be think of as Gram-Schmidt orthogonalization process for polynomials.
Lemma 4.4. We can construct orthogonal polynomials p*j (x), j = 1, . . . , n such that
In the addition, the polynomials p*j (x), j = 1, . . . , n satisfy the three-term recursion
Proof. The proof is by induction. Suppose we constructed such polynomials p*j (x) for j ≤ m and established that they
are unique. Next, we want to construct p*m+1 (x) and that (p*m+1 , p j ) = 0 for j ≤ m and satisfies (4.13).
Any polynomial of degree m + 1 with leading coefficient 1, we can write uniquely as
pm+1 (x) = (x − δm+1 )p*m (x) + cm−1 p*m−1 (x) + · · · + c1 p*1 (x). (4.14)
0 = (pm+1 , p*m ) = ((x − δm+1 )p*m , p*m ) + cm−1 (p*m−1 , p*m ) + · · · + c0 (p*0 , p*m ) = (xp*m , p*m ) − δm+1 (p*m , p*m ).
0 = (pm+1 , p*j ) = ((x − δm+1 )p*m , p*j ) + cm−1 (p*m−1 , p*j ) + · · · + c j (p*j , p*j ) · · · + c0 (p*0 , p*j ) = (p*m , xp*j ) + c j (p*j , p*j ).
(4.15)
By induction for j ≤ m − 1
p*j+1 (x) = (x − δ j )p*j (x) − γ 2j p*j−1 (x),
hence
xp*j (x) = p*j+1 (x) + δ j p*j (x) + γ 2j p*j−1 (x) for j ≤ m − 1.
Plugging it into (4.15) and using that j ≤ m − 1, we obtain
4 Numerical Integration 61
(p*m , p*m )
γ 2 := cm−1 = .
(p*m−1 , p*m−1 )
The above lemma allows us to generate the orthogonal polynomials recursively. Later, we will see that we can compute
the real roots of any polynomial very efficiently. We also have an alternative way to compute the nodes xi .
Theorem 4.3. The root xi , i = 1, . . . , n are the eigenvalues of the tridiagonal matrix
δ1 γ2
γ2 δ1 γ3
Jn =
· · ·
· · γn
γn δn
Proof. The proof follows easily be showing that the characteristic polynomials
p*0 (x1 )
p*0 (x2 ) . . . . . . p*0 (xn )
p*1 (x1 )
p*1 (x2 ) . . . . . . p*1 (xn )
P=
.. .. .. .. ..
. . . . .
p*n−1 (x1 ) p*n−1 (x2 ) . . . *
. . . pn−1 (xn )
is non-singular.
Proof. We will proof the result by contradiction. Assume the matrix P is singular. Then there exists a vector z ∈ Rn
such that zT P = 0T . Hence
n−1
∑ z j p*j (xi ) = 0 for i = 1, . . . n.
j=0
Notice that
n−1
q(x) = ∑ z j p*j (x)
j=0
is a polynomial of degree n − 1 that has n roots, namely at x1 , . . . , xn . Hence q(x) ≡ 0. Let k be the largest index such
that zk ̸= 0. Then
62 Contents
1 k−1 *
p*k (x) = − ∑ zi pi (x),
zk i=0
which is a contradiction, since on the left we have a polynomial of degree k − 1 and a polynomial of degree less than
k − 1 on the right.
Theorem 4.4. Let x1 , . . . , xn be the roots of the p*n and let w = (w1 , . . . , wn )T be the solution of the system
Pw = g, (4.16)
where P is the matrix defined in Lemma 4.5 and the g = (g0 , . . . , gn−1 )T is given by
(
(p*0 , p*0 ) i=0
gi =
0 i = 1, . . . , n − 1.
where q ∈ Pn−1 and r ∈ Pn−1 . Since {p*0 , . . . , p*n−1 } form a basis for Pn−1 , we have
n−1 n−1
q(x) = ∑ αk p*k (x) and r(x) = ∑ βk p*k (x).
k=0 k=0
Then,
Z b Z b n−1 n−1
ω(x)p(x)dx = ω(x)(p*n (x)q(x) + r(x))dx = (p*n , q) + (r, 1) = ∑ αk (p*n , p*k ) + ∑ βk (p*k , p*0 ) = β0 (p*0 , p*0 ).
a a k=0 k=0
To show the converse, notice that the nodes x1 , . . . , xn are distinct (otherwise we could write the formula (4.17) with
less than n nodes which would contradict Lemma 4.1), hence the matrix P is non-singular.
Applying formula (4.17) with p(x) = p*k (x), k = 0, . . . , n − 1, we have
(
n
(p* , p* )
Z b
* * * * k=0
∑ wi pk (xi ) = a ω(x)pn (x)dx = (p0 , pk ) = 0 0 0 k = 1, . . . , n − 1.
,
i=1
The most common and simplest problem in linear least squares is the linear regression problem
(xi , yi ), i = 1, . . . , m,
ri = axi + b − yi
is the i-th component of Az − b, where z = [a b]T . Thus we want to minimize ‖r‖22 which leads to
Loosely speaking, the linear least squares problem says: Given A ∈ Rm×n , find x ∈ Rn such that Ax ≈ b.
Of course, if m = n and A is invertible, then we can solve Ax = b. Otherwise, we may not have a solution of Ax = b
or we may have infinitely many of them.
We are interested in vectors x that minimize the norm of squares of the residual Ax − b, i.e., which solve
Notice that
1
min ‖Ax − b‖22 , min ‖Ax − b‖2 , min ‖Ax − b‖22
x∈Rn x∈Rn 2 x∈Rn
are all equivalent in the sense that if x solves one of them it also solves the others.
Instead of finding x that minimizes the norm of squares of the residual Ax − b, we could also try to find x that
minimizes the p-norm of the residual
minn ‖Ax − b‖ p
x∈R
This can be done, but is more complicated and will not be covered.
There are many examples of such problems.
Example 5.2 (Best Polynomial Fitting). Given m measurements
(xi , yi ), i = 1, . . . , m,
is minimized.
Write in matrix form. Let
x1n
... x1 1 y1
xn ... x2 1 y2
2 m×(n+1)
A= . .. ∈ R , b = . ∈ Rm
.. ..
.. . . . ..
n ... x 1
xm ym
m
Sometimes one has to modify the problem in order to state it as a LLS problem
Example 5.3 (Best Circle Fitting). Find a best fit circle through points (x1 , y1 ), (x2 , y2 ), . . . , (xm , ym ). Equation for the
circle around (c1 , c2 ) with radius r is
(x − c1 )2 + (y − c2 )2 = r2 .
It is not a LLS problem, due to quadratic terms. However, rewrite the equation for the circle in the form
Suppose x* satisfies
‖Ax* − b‖22 = minn ‖Ax − b‖22 (LLS)
x∈R
z = −α(AT Ax* − AT b)
2zT (AT Ax* − AT b) + ‖Az‖22 = −2α‖AT Ax* − AT b‖22 + α 2 ‖A(AT Ax* − AT b)‖22 < 0
for
‖AT Ax* − AT b‖22
0<α < .
‖A(AT Ax* − AT b)‖22
Thus, if x* solves (LLS) then x* must satisfy
5 Linear Least Squares 67
always has a solution. A vector x* solves (LLS) iff x* solves the normal equation
AT Ax = AT b.
Note: If the matrix A ∈ Rm×n , m ≥ n, has rank n, then AT A is symmetric positive definite and satisfies
vT AT Av = ‖Av‖22 > 0, ∀v ∈ Rn , v ̸= 0.
If A ∈ Rm×n , m ≥ n, has full rank n, then we can use the Cholesky-decomposition to solve the normal equation
(and, hence, the linear least squares problem) as follows
1. Compute AT A and AT b.
2. Compute the Cholesky-decomposition AT A = RT R.
3. Solve RT y = AT b (forward solve),
solve Rx = y (backward solve) .
68 Contents
The computation of AT A and AT b requires roughly mn2 and 2mn flops. Roughly 13 n3 flops are required to compute
the Cholesky-decomposition. The solution of RT y = AT b and of Rx = y requires approximately 2n2 flops.
Computing the normal equations requires us to calculate terms of the form ∑m T
k=1 aki ak j . The computed matrix A A
may not be positive definite, because of floating point arithmetic.
t = 10.ˆ(0:-1:-10)’;
A = [ ones(size(t)) t t.ˆ2 t.ˆ3 t.ˆ4 t.ˆ5];
B = A’*A;
[R,iflag] = chol( B );
if( iflag ˜= 0 )
disp([’ Cholesky decomposition returned with iflag = ’, ...
int2str(iflag)])
end
In exact arithmetic B = AT A is symmetric positive definite, but the Cholesky-Decomposition detects that a j j −
j−1
∑k=1 r2jk < 0 in step j = 6.
>> Cholesky decomposition returned with iflag = 6
The use of the Cholesky decomposition is problematic if the condition number of AT A is large. In the example,
κ2 (AT A) ≈ 4.7 * 1016 .
Definition 5.1. A matrix Q ∈ Rm×n is called orthogonal if QT Q = In , i.e., if its columns are orthogonal and have
2-norm one.
(Qx)T (Qy) = xT y x, y ∈ Rn
i.e. the angle between Qx and Qy is equal to the angle between x and y
5. As a result
‖Qx‖2 = ‖x‖2
i.e. orthogonal matrices preserve the 2-norm.
The last property is the key for solving LSS.
is orthogonal matrix.
5 Linear Least Squares 69
5.2.1 QR-decompostion
Let m ≥ n. For each A ∈ Rm×n there exists a permutation matrix P ∈ Rmn×n , an orthogonal matrix Q ∈ Rm×m , and an
upper triangular matrix R ∈ Rn×n such that
R }n
AP = Q QR-decomposition.
0 } m−n
The QR decomposition of A can be computed using the Matlab command [Q, R, P] = qr(A).
We will not go into the details of how Q, P, R are computed. If you interested check Chapter 5 of the book Gene
Golub and Charles Van Loan, Matrix Computations
Assume that A ∈ Rm×n , has full rank n. (Rank deficient case will be considered later.)
Let
R }n T R }n
AP = Q ⇔ Q AP =
0 } m−n 0 } m−n
where R ∈ Rn×n is upper triangular matrix. Since A has full rank n the matrix R also has rank n and, therefore, is
nonsingular. Moreover, since Q is orthogonal it obeys QQT = I. Hence
In addition, the permutation matrix satisfies PPT = I. Using these properties of Q we get
Recall
y = PT x, PPT = I, ⇒ x = Py.
Hence the solution is x = Py = PR−1 c.
In Summary: To solve a Linear Least Squares Problem using the QR-Decomposition with matrix A ∈ Rm×n , of
rank n and b ∈ Rm :
1. Compute an orthogonal matrix Q ∈ Rm×m , an upper triangular matrix R ∈ Rn×n , and a permutation matrix P ∈
Rn×n such that
T R
Q AP = .
0
2. Compute
T c
Q b= .
d
3. Solve
Ry = c.
4. Set
x = Py.
The MATLAB Implementation is very simple.
[m,n] = size(A);
[Q,R,P] = qr(A);
c = Q’*b;
y = R(1:n,1:n) \ c(1:n);
x = P*y;
If you type
x = A∖b;
in Matlab, then Matlab computes the solution of the linear least squares problem
The Rank Deficient Case: Assume that A ∈ Rm×n , m ≥ n has rank r < n. (The case m < n can be handled analogously.)
Suppose that
AP = QR,
where Q ∈ Rm×m is orthogonal, P ∈ Rn×n is a permutation matrix, and R ∈ Rn×n is an upper triangular matrix of the
form
R1 R2
R=
0 0
with nonsingular upper triangle R1 ∈ Rr×r and R2 ∈ Rr×(n−r) .
We can write
2
2 T T 2
R1 R2 T T
‖Ax − b‖2 = ‖Q (APP x − b)‖2 =
P x − Q b
.
0 2
5 Linear Least Squares 71
Partition QT b as
c1 } r
QT b = c2 } n − r
d } m−n
and put y = PT x.
Partition
y1 }r
y=
y2 } n−r
This give us
R1 y1 + R2 y2 − c1
2
y1 = R−1
1 (c1 − R2 y2 )
AP = QR,
where the diagonal entries of R satisfy |R11 | ≥ |R22 | ≥ . . . . The effective rank r of A ∈ Rn×n is the smallest integer r
such that
|Rr+1,r+1 | < ε max {m, n}|R11 |
tol = max(size(A))*eps*abs(R(1,1));
r = 0;
while ( abs(R(r+1,r+1)) >= tol & r < n )
r = r+1;
end
All solutions of
min ‖Ax − b‖22
x
are given by
R−1
x = Py = P 1 (c1 − R2 y2 )
y2
where y2 ∈ Rn−r is arbitrary.
Minimum norm solution:
Of all solutions, pick the one with the smallest 2-norm. This leads to
−1
2
R1 (c1 − R2 y2 )
min
P
y2 y2
2
which is another linear least squares problem with unknown y2 . This problem is n × (n − r) and it has full rank. It can
be solved using the techniques discussed earlier.
MATLAB Implementation:
[m,n] = size(A);
[Q,R,P] = qr(A);
c = Q’*b;
% Determine rank of A (as before).
tol = max(size(A))*eps*abs(R(1,1));
r = 1;
while ( abs(R(r+1,r+1)) >= tol & r < n ); r = r+1; end
% Solve least squares problem to get y2
S = [ R(1:r,1:r) \ R(1:r,r+1:n);
eye(n-r) ];
t = [ R(1:r,1:r) \ c(1:r);
zeros(n-r,1) ];
y2 = S \ t; % solve least squares problem using backslash
% Compute x
y1 = R(1:r,1:r) \ ( c(1:r) - R(1:r,r+1:n) * y2 );
x = P*[y1;y2];
5 Linear Least Squares 73
For any matrix A ∈ Rm×n there exist orthogonal matrices U ∈ Rm×m , V ∈ Rn×n and a ’diagonal’ matrix Σ ∈ Rm×n , i.e.,
σ1 0 ... 0
..
.
σr
for m ≤ n
Σ =
0
..
.
0 ... 0
and
σ1
..
.
σr
0
Σ = for m≥n
..
.
0 0
.. ..
. .
0 0
with diagonal entries
σ1 ≥ · · · ≥ σr > σr+1 = · · · = σmin {m,n} = 0
such that A = UΣV T .
AV = UΣ
We can interpret it as follows: there exists a special orthonormal set of vectors (i.e. the columns of V ), that is mapped
by the matrix A into an orthonormal set of vectors (i.e. the columns of U).
Given the SVD-Decomposition of A,
A = UΣV T
with
σ1 ≥ · · · ≥ σr > σr+1 = · · · = σmin {m,n} = 0
one may conclude the following:
∙ rank(A) = r,
74 Contents
This is called the dyadic decomposition of A, decomposes the matrix A of rank r into sum of r matrices of rank 1.
The 2-norm and the Frobenius norm of A can be easily computed from the SVD decomposition
‖Ax‖2
‖A‖2 = sup = σ1
x̸=0 ‖x‖2
m n q
‖A‖F = ∑ ∑ a2i j = σ12 + · · · + σ p2 , p = min {m, n}.
i=1 j=1
AT A = V Σ T ΣV T and AAT = UΣ Σ T U T .
Thus, σi2 , i = 1, . . . , p are the eigenvalues of symmetric matrices AT A and AAT and vi and ui are the corresponding
eigenvectors.
then
min ‖A − D‖2 = ‖A − Ak ‖2 = σk+1 ,
rank(D)=k
and s
p
min
rank(D)=k
‖A − D‖F = ‖A − Ak ‖F = ∑ σi2 , p = min {m, n}.
k+1
and let A = UΣV T be the SVD of A ∈ Rm×n . Using the orthogonality of U and V we have
Thus,
r m
min ‖Ax − b‖22 = ∑ (σi zi − uTi b)2 + ∑ (uTi b)2 .
x
i=1 i=r+1
All solutions of the linear least squares problem are given by z = V T x with
uTi b
zi = , i = 1, . . . , r,
σi
zi = arbitrary, i = r + 1, . . . , n.
The minimum norm solution of the linear least squares problem is given by
x † = V z† ,
MATLAB code:
% compute the SVD:
[U,S,V] = svd(A);
s = diag(S);
% determine the effective rank r of A using singular values
r = 1;
while( r < size(A,2) & s(r+1) >= max(size(A))*eps*s(1) )
r = r+1;
end
d = U’*b;
x = V* ( [d(1:r)./s(1:r); zeros(n-r,1) ] );
Suppose that the data b are
b = bex + δ b,
where δ b represents the measurement error. The minimum norm solution of min ‖Ax − (bex + δ b)‖22 is
76 Contents
r r T
uTi b ui b uTi δ b
x† = ∑ vi = ∑ + vi .
i=1 σi i=1 σi σi
uT (δ b)
If a singular value σi is small, then i σi could be large, even if uTi (δ b) is small. This shows that errors δ b in the data
can be magnified by small singular values σi .
% Compute A
t = 10.ˆ(0:-1:-10)’;
A = [ ones(size(t)) t t.ˆ2 t.ˆ3 t.ˆ4 t.ˆ5];
% compute SVD of A
[U,S,V] = svd(A); sigma = diag(S);
% compute exact data
xex = ones(6,1); bex = A*xex;
for i = 1:10
% data perturbation
deltab = 10ˆ(-i)*(0.5-rand(size(bex))).*bex;
b = bex+deltab;
% solution of perturbed linear least squares problem
w = U’*b;
x = V * (w(1:6) ./ sigma);
errx(i+1) = norm(x - xex); errb(i+1) = norm(deltab);
end
loglog(errb,errx,’*’);
ylabel(’||xˆ{ex} - x||_2’); xlabel(’||\delta b||_2’)
The singular values of A in the above Matlab example are:
We see that small perturbations δ b in the measurements can lead to large errors in the solution x of the linear least
squares problem if the singular values of A are small.
If σ1 /σr ≫ 1, then it might be useful to consider the regularized linear least squares problem (Tikhonov regular-
ization)
1 λ
minn ‖Ax − b‖22 + ‖x‖22 . (5.2)
x∈R 2 2
Here λ > 0 is the regularization parameter.
The regularization parameter λ > 0 is not known a-priori and has to be determined based on the problem data.
Observe that
2
1 λ √A b
min ‖Ax − b‖22 + ‖x‖22 = min
x −
.
x 2 2 x
λI 0
2
For λ > 0, the matrix
√A ∈ R(m+n)×n
λI
is always of full rank n. Hence, for λ > 0, the regularized linear least squares problem (5.2) has a unique solution. The
normal equation corresponding to (5.2) are given by
T T T
√A √A x = (AT A + λ I)x = AT b = √A b
.
λI λI λI 0
Using the SVD Decomposition of A = UΣV T , where U ∈ Rm×m and V ∈ Rn×n are orthogonal matrices and Σ ∈ Rm×n
is a ’diagonal’ matrix with diagonal entries
(Σ T Σ + λ I)z = Σ T U T b.
Note that
r r
σi (uTi b) uTi b
lim xλ = lim ∑ 2
v i = ∑ vi = x†
λ →0 λ →0 i=1 σi + λ i=1 σi
i.e., the solution of the regularized Least Square problem (5.2) converges to the minimum norm solution of the original
Least Square problem as λ goes to zero. The representation
r
σi (uTi b)
xλ = ∑ 2
vi
i=1 σi + λ
of the solution of the regularized LLS also reveals the regularizing property of adding the term λ2 ‖x‖22 to the (ordinary)
least squares functional. We have that
(
σi (uTi b) 0, if 0 ≈ σi ≪ λ
≈ uTi b
σi2 + λ σi , if σi ≫ λ .
Hence, adding λ2 ‖x‖22 to the original least squares functional acts as a filter. Contributions from singular values which
are large relative to the regularization parameter λ are left (almost) unchanged whereas contributions from small
singular values are (almost) eliminated. It raises an important question:
How to choose λ ?
Suppose that the data are b = bex + δ b. We want to compute the minimum norm solution of the original Least
Squares problem with unperturbed data bex
r
uT b
xex = ∑ i vi ,
i=1 σi
but we can only compute with b = bex + δ b, we don’t know bex . The solution of the regularized least squares problem
is
r
σi (uTi bex ) σi (uTi δ b)
xλ = ∑ + 2 vi .
i=1 σi2 + λ σi + λ
We observed that
r
σi (uTi bex )
∑ 2
→ xex as λ → 0.
i=1 σi + λ
t = 10.ˆ(0:-1:-10)’;
A = [ ones(size(t)) t t.ˆ2 t.ˆ3 t.ˆ4 t.ˆ5];
% compute SVD of A
[U,S,V] = svd(A); sigma = diag(S);
For this example λ ≈ 10−3 , seems to be a good choice for the regularization parameter λ . However, we could only
create this figure with the knowledge of the desired solution xex .
So the question is, how can we determine a λ ≥ 0 so that ‖xex − xλ ‖2 is small without knowledge of xex . One
approach is the Morozov discrepancy principle.
Suppose b = bex + δ b. We do not know the perturbation δ b, but we assume that we know its size ‖δ b‖. Suppose
the unknown desired solution xex satisfies Axex = bex . Hence,
Since the exact solution satisfies ‖Axex − b‖ = ‖δ b‖ we want to find a regularization parameter λ ≥ 0 such that the
solution xλ of the regularized least squares problem satisfies
80 Contents
‖Axλ − b‖ = ‖δ b‖
‖Axλ − b‖ = ‖δ b‖
To compute ‖Axλ − b‖ for given λ ≥ 0 we need to solve a regularized linear least squares problem
2
1 2 λ 2
√A b
min ‖Ax − b‖2 + ‖x‖2 = min
x −
x 2 2 x
λI 0
2
f (λ ) = 0
is a root finding problem. We will discuss in the next section how to solve such problems. In this case f maps a scalar
λ into a scalar
f (λ ) = ‖Axλ − b‖ − ‖δ b‖,
but the evaluation of f requires the solution of a regularized Least Square problems and can be rather expensive, so
one has to watch out for a number of function evaluations.
The bisection method is based on the following version of Intermediate Value Theorem: If f : R → R is a continuous
function and a, b ∈ R, a < b, and f (a) f (b) < 0, then there exists x ∈ [a, b], such that f (x) = 0.
Thus, given [ak , bk ] with f (ak ) f (bk ) < 0 (i.e., f (ak ) and f (bk ) have opposite signs, compute the interval midpoint
ck = 12 (ak + bk ) and evaluates f (ck ). If f (ck ) and f (ak ) have opposite sign, then [ak , ck ] contains a root of f . Sets
ak+1 = ak and bk+1 = ck . Otherwise f (ck ) and f (bk ) must the opposite sign. In this case, [ck , bk ] contains a root of f
and we sets ak+1 = ck and bk+1 = bk .
The algorithm is rather straightforward.
Input: Initial values a(0); b(0) such that f(a(0))f(b(0)) < 0
and a tolerance tolx
Output: approximation of a root of f
for k = 0, 1,... do
if b(k)-a(k) < tolx,
return c(k) = (a(k) + b(k))/2 as an approximate root of f and stop
end
6 Numerical Solution of Nonlinear Equations in R1 81
1 1 1
|ak+1 − bk+1 | = |ak − bk | = 2 |ak−1 − bk−1 | = · · · = k+1 |a0 − b0 |.
2 2 2
There is a root of f such that the midpoints satisfy
In particular, limk→∞ ck = x* . By ⌊z⌋ we denote the largest integer less or equal to z. After
|b0 − a0 |
k = ⌊log2 ⌋−1
tolx
iterations we are guaranteed to have an approximation ck of a root x* of f that satisfies |x* − ck | ≤ tolx .
The following table gives ”pros” and ”cons” of the bisection method.
”pros” ”cons”
The method is very simple Hard to generalize to higher dimensions
The Bisection method requires only function values. In The Bisection method only requires the sign of function
fact, it only requires the sign of function values (that values. In general, it will not find the root of the simple
means as long as the sign is correct, the function values affine linear function f (x) = ax + b in a finite number of
can be inaccurate). iterations.
The method is very robust. The Bisection method con- The local convergence behavior of the Bisection method
verges for any [a0 , b0 ] that contains a root of f (the Bi- is rather slow (the error only reduced by a factor 2, no
section method converges globally). matter how close we are to the solution.
One of the problem with the bisection method, is that the midpoint may have nothing to do with the real root. One
of the possible improvements of bisection method, is to use the function values at the end points (if available) more
efficiently.
Thus, given an initial interval [a0 , b0 ] such that f (a0 ) f (b0 ) < 0 (i.e., [a0 , b0 ] contains a root of f ), Regula Falsi
constructs an affine linear function m(x) = αx + β such that m(a0 ) = f (a0 ) and m(b0 ) = f (b0 ). In the notation of
82 Contents
section 2, m(x) = P( f |a0 , b0 )(x) i.e. m interpolates f at a0 and at b0 . m(x) = P( f |a0 , b0 )(x) is given by
x − a0
m(x) = f (a0 ) + ( f (b0 ) − f (a0 ))
b0 − a0
Instead of taking c0 as a mid point, we choose c0 such that m(c0 ) = 0. Solving for c0 we obtain
b0 − a0
c0 = a0 − f (a0 ).
f (b0 ) − f (a0 )
Thus it use c0 as an approximation of the root of f . This root satisfies c0 ∈ (a0 , b0 ). If f (c0 ) and f (a0 ) have opposite
signs, then we set a1 = a0 and b1 = c0 . Otherwise f (c0 ) and f (b0 ) must have opposite signs and we sets a1 = c0 and
b1 = b0 . Then we proceed similarly to the bisection method. This give us an algorithm
Input: Initial values a(0); b(0) such that f(a(0))*f(b(0)) < 0,
a maximum number of iterations maxit, a tolerance tolf and a tolerance tolx
1 For k = 0,1,...,maxit do
2 If b(k) - a(k) < tolx, then return x = c(k) and stop
3 Compute c(k) = a(k) - f(a(k))(b(k) - a(k))(f(b(k)) - f(a(k)))
and f(c(k)).
4 If |f(c(k))| < tolf , then return x = c(k) and stop
5 If f(a(k))f(c(k)) < 0 , then
6 a(k+1) = a(k); b(k+1) = c(k),
7 else
8 a(k+1) = c(k); b(k+1) = b(k).
9 End
10 End
However the following simple numerical example shows that computationally we often can not see any advantage
of Regula Falsi over Bisection Rule.
Example 6.1. Take f (x) = x3 − 2x − 5 on interval [0, 4], with tolerance tol = 10−5 .
The idea of the Newton’s method is similar Regula Falsi, to approximate the f (x) with a linear function, but this time
for a given an approximation x0 of a root x* of f , we select
f (x0 )
x1 = x0 −
f ′ (x0 )
6 Numerical Solution of Nonlinear Equations in R1 83
of the tangent model is used as an approximation of the root of f . Write the previous identity as
1 For k = 0:maxit do
2 Compute s(k) = -f(x(k))/f’(x(k)).
3 Compute x(k+1) = x(k) + s(k).
4 Check for truncation
5 End
2. The sequence is called superlinearly convergent if there exists a sequence {ck } with ck > 0 and limk→∞ ck = 0
such that
|xk+1 − x* | ≤ ck |xk − x* |
or, equivalently, if
|xk+1 − x* |
lim = 0.
k→∞ |xk − x* |
3. The sequence is called quadratically convergent to x* if limk→∞ xk = x* and if there exists c > 0 and k̂ ∈ N such
that
|xk+1 − x* | ≤ c|xk − x* |2 for all k ≥ k̂.
Thus using the above definition, we can see that bisection method is linearly convergence with c = 1/2. On the other
hand as we will see, the Newton’s method is quadratically convergent under some conditions.
Theorem 6.1. Let D ∈ R be an open interval and let f : D → R be differentiable on D Furthermore, let f ′ (x) be
Lipschitz continuous with Lipschitz constant L. If x* ∈ D is a root and if f ′ (x* ) ̸= 0, then there exists an ε > 0 such
that Newton’s method with starting point x0 with |x0 − x* | < 0 generates iterates xk which converge to x* ,
lim xk = x* ,
k→∞
L
|xk+1 − x* | ≤ |xk − x* |2 for all k ∈ N.
| f ′ (x* )|
Before proving the above theorem, we give ”pros” and ”cons” of the Newton’s method.
”pros” ”cons”
The method is very simple Requires derivatives of a function
Fast convergence. Only local convergence and requires good initial guess
Can be easily generalized to Rn
Can be easily modified
We remind the following definition.
Definition 6.3 (Lipschitz continuity). The function f : [a, b] → R is said to be Lipschitz continuous if there exists
L > 0 such that
| f (x) − f (y)| ≤ L|x − y|, ∀x, y ∈ [a, b].
The constant L is called the Lipschitz constant.
For a given x, y ∈ [a, b], consider a function
We can check that φ (0) = f (y) and φ (1) = f (x). Thus by the Fundamental Theorem of Calculus and the chain rule
Z 1 Z 1
f (x) − f (y) = φ (1) − φ (0) = φ ′ (t) dt = f ′ (y + t(x − y)) dt(x − y). (6.1)
0 0
Taking y = x0 , we have
Z 1
f (x) − f (x0 ) − f ′ (x0 )(x − x0 ) = f ′ (y + t(x − y))(x − x0 ) dt − f ′ (x0 )(x − x0 )
0
Z 1
f ′ (x0 + t(x − x0 )) − f ′ (x0 ) (x − x0 ) dt.
=
0
f (x0 ) 1 L
f (x* ) − f (x0 ) − f ′ (x0 )(x0 − x* ) ≤ |x* − x0 |2 .
x1 − x* = x0 − ′
− x* = ′ ′
f (x0 ) f (x0 ) 2| f (x0 )|
This strongly supports our previous claim that the Newton’s method convergences quadratically, we only need to
establish that f ′ (xk ) stays away from zero, provided f ′ (x* ) ̸= 0 and x0 is sufficiently close to x* .
Recall
f (x + h) − f (x)
f ′ (x) = lim .
h→0 h
6 Numerical Solution of Nonlinear Equations in R1 85
Thus if the derivatives of the function are not available, in the Newton’s method we can replace f ′ (xk ) by the finite
difference
f (xk + hk ) − f (xk )
hk
for some hk ̸= 0. It is natural to use hk = xk−1 − xk , then
1 For k = 1:maxit do
2 Compute s_k = -f(x_k)*(x_{k-1}-x_k))/(f(x_{k-1})-f(x_k)).
3 Compute x(k+1) = x(k) + s(k).
4 Check for truncation
5 End
We will see that the Secant method has very interesting properties. Recalling the divided difference formula (2.8),
f [x j+1, . . . , xk ] − f [x j , . . . , xk−1 ]
f [x j , . . . , xk ] = (6.4)
xk − x j
Now we turn back to the secant method (6.3). Using the divided difference notation, we can rewrite it as
f (xk )
xk+1 = xk − . (6.6)
f [xk−1 , xk ]
f (xk )
xk+1 − x* = xk − x* −
f [xk−1 , xk ]
f (xk )
= (xk − x* ) 1 −
(xk − x* ) f [xk−1 , xk ]
xk − x* f (xk ) − f (x* )
= f [xk−1 , xk ] −
f [xk−1 , xk ] (xk − x* )
(6.7)
xk − x*
= ( f [xk−1 , xk ] − f [xk , x* ])
f [xk−1 , xk ]
(xk − x* )(xk−1 − x* ) f [xk−1 , xk ] − f [xk , x* ]
=
f [xk−1 , xk ] xk−1 − x*
f [xk−1 , xk , x* ]
= (xk − x* )(xk−1 − x* ) .
f [xk−1 , xk ]
lim xk = x* ,
k→∞
λk+1 = λk + λk−1 ,
i.e.
√ !k √ !k √ !k
1 1+ 5 1 1− 5 1 1+ 5
λk = √ −√ ∼√ , as k → ∞.
5 2 5 2 5 2
Proof. Using the Mean Value Theorem and the assumption of the theorem be obtain
L
|xk+1 − x* | ≤ εε < ε,
2m
References 87
ek+1 ≤ ek ek−1 , k = 1, 2, . . . .
Since e0 ≤ q and e1 ≤ q, we have ek ≤ qλk for k = 1, 2, . . . . Since q < 1 and λk → ∞ and k → ∞, we have ek → 0 as
k → ∞. This concludes the proof of the theorem.
we have
f (xk−1 ) − f (xk−2 )
f (xk−1 ) = −(xk − xk−1 ) .
xk−1 − xk−2
Combining it with the above estimate and using Lemma 6.1, we obtain
|xk − xk−1 | f (xk−1 ) f (xk ) − f (xk−1 )
|xk − x* | ≤ x −x +
m xk − xk−1
k k−1
|xk − xk−1 | f (xk ) − f (xk−1 ) f (xk−1 ) − f (xk−2 )
= x −x −
m k k−1 xk−1 − xk−2
|xk − xk−1 ||xk − xk−2 |
= | f [xk−2 , xk−1 , xk ]|
m
L
≤ |xk − xk−1 ||xk − xk−2 |.
2m
References
1. Michael L. Overton. Numerical computing with IEEE floating point arithmetic. Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, 2001. Including one theorem, one rule of thumb, and one hundred and one exercises.
2. Volker Strassen. Gaussian elimination is not optimal. Numer. Math., 13:354–356, 1969.