0% found this document useful (0 votes)
80 views87 pages

LEYKEKHMAN 2019 Numerical Analysis Lecture Notes

This document contains lecture notes for a numerical analysis course. It introduces numerical analysis as the study of algorithms for continuous mathematics. The notes begin with interpolation, which involves finding functions that fit a set of known data points. Later sections will cover additional topics like floating point arithmetic, linear systems, numerical integration, and solving nonlinear equations. Students are encouraged to code algorithms and the required background is a good understanding of linear algebra and calculus.

Uploaded by

P6E7P7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views87 pages

LEYKEKHMAN 2019 Numerical Analysis Lecture Notes

This document contains lecture notes for a numerical analysis course. It introduces numerical analysis as the study of algorithms for continuous mathematics. The notes begin with interpolation, which involves finding functions that fit a set of known data points. Later sections will cover additional topics like floating point arithmetic, linear systems, numerical integration, and solving nonlinear equations. Students are encouraged to code algorithms and the required background is a good understanding of linear algebra and calculus.

Uploaded by

P6E7P7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Numerical Analysis 1

Lecture notes

Dmitriy Leykekhman

0 Introduction

This lecture notes are designed for the MATH 5510, which is the first graduate course in numerical analysis at Univer-
sity of Connecticut. Here I present the material which I consider important for students to see in their first numerical
analysis course. The material and the style of these lecture notes are strongly influenced by the lecture notes of Prof.
Matthias Heinkenschloss for CAMM 353 at Rice University. There are many other nice lecture notes that one can find
freely online. Let me just mention four volumes Numerical Analysis course (in German) by Rolf Rannacher and the
lecture notes by Doron Levy . There are plenty of other sources on the internet.
There are plenty of misconceptions among mathematicians what really numerical analysis is. In my opinion one of
the best definitions was given by N. Trefethen in his essay ”What is numerical analysis?”

Definition 0.0. Numerical analysis is the study of algorithms for the problem of continuous mathematics.

We strongly encourage to read this essay whoever is interested in the subject, it is only 5 pages long.
This lecture notes start with interpolation, which is not orthodox, but in my opinion it is an interesting topic that
captures students attention and introduces important ideas, challenges and motivations for the other topics in numerical
analysis.
The required background is rather light, good understanding of linear algebra and calculus would be sufficient. The
students also encouraged to code as many algorithms as possible appearing in the notes. From time to time I adopt
Matlab notation and syntax for columns or rows of matrices, loops and etc.

Dmitriy Leykekhman
Department of Mathematics, University of Connecticut, Storrs, CT 06269, USA. e-mail: [email protected]

1
Contents

Numerical Analysis 1 Lecture notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


Dmitriy Leykekhman
0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Floating point arithmetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Floating Point Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Interpolation problem: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Polynomial Interpolation in 1D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Divided differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Neville Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Approximation properties of interpolating polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6 Equidistant points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Chebyshev points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 Hermite interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.9 Spline interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 Matrix-Vector multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Matrix-Matrix multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Existence of Uniqueness of solution of linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Transpose of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Solution of Triangular Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Gaussian Elimination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7 LU decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.8 Applications of LU decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.9 Symmetric Positive definite matrices. Cholesky decomposition. . . . . . . . . . . . . . . . . . . . . . . . . 41
3.10 Tridiagonal matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.11 Error analysis of Linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1 Newton Cotes Quadrature Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Composite quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Gauss Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5 Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1 Normal Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3
4 Contents

5.2 Solving Linear Least Square problem using QR-decomposition . . . . . . . . . . . . . . . . . . . . . . . . 68


5.3 Solving Linear Least Square problem using SVD-decomposition . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Regularized Linear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6 Numerical Solution of Nonlinear Equations in R1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.1 Bisection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.2 Regula Falsi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.4 Secant method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

1 Floating point arithmetics

A very good introduction to floating point arithmetics is a book by Michael Overton [1].
In everyday life we use decimal representation of numbers. For example

1234.567

for us means
1 * 104 + 2 * 103 + 3 * 102 + 4 * 100 + 5 * 10−1 + 6 * 10−2 + 7 * 10−3 .
More generally
d j . . . d1 d0 .d−1 . . . d−i . . .
represents
· · · d j * 10 j + · · · + d1 * 101 + d0 * 100 + d−1 * 10−1 + · · · + d−i * 10−i + · · · .
Let β ≥ 2 be an integer. For every x ∈ there exist integers e and di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , such that
!

x = sign(x) ∑ di β −i β e. (1.1)
i=0

The representation is unique if one requires that d0 > 0 when x ̸= 0.


Example 1.1.
11
= 5 * 100 + 5 * 10−1 = (5.5)10 ,
2
11
= 1 * 22 + 0 * 21 + 1 * 20 + 1 * 2−1
2
= (1 * 20 + 0 * 2−1 + 1 * 2−2 + 1 * 2−3 ) * 22 = (1.011)2 * 22 .

Of course every integer has finite representation, but very often even simple rational numbers have infinite representa-
tions
Example 1.2.
1
= (0.33333 . . . )10 = (3.33333 . . . )10 * 101 ,
3
1
= (0.0101 . . . )2 = (1.0101)2 * 22 .
3
1 Floating point arithmetics 5

In a computer only a finite subset of all real numbers can be represented. These are the so–called floating point
numbers and they are of the form !
m−1
x̄ = (−1)s ∑ di β −i βe
i=0

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.


∙ β is called the base,
∙ m−1
∑i=0 di β −i is the significant or mantissa, m is the mantissa length,
∙ e is the exponent, and {emin , . . . , emax } is the exponent range.
∙ If β = 2, then we say the floating point number system is a binary system. In this case the di ’s are called bits.
∙ If β = 10, then we say the floating point number system is a decimal system. In this case the di ’s are called digits.
∙ A floating point number x̄ ̸= 0 is said to be normalized if d0 > 0.

Example 1.3. Consider the floating point number system β = 2, m = 3, emin = −1, emax = 2.
The normalized floating point numbers x̄ ̸= 0 are of the form

x̄ = ±1.d1 d2 × 2e

since the normalization condition implies that d0 ∈ {1, . . . , β − 1} = {1}. For the exponent we have choices e =
−1, 0, 1, 2. Below is the we plot all the numbers in this example. Notice that spacing between numbers is increasing as
we move away from 0.

-
1537 5 3 7 5 7
2 8 4 81 4 2 4 2 2 3 2 4 5 6 7
0

Consider the floating point number system


!
m−1
s −i
x̄ = (−1) ∑ di β βe
i=0

with di ∈ {0, . . . , β − 1}, i = 0, 1, . . . , m − 1, and e ∈ {emin , . . . , emax }.

∙ The mantissa satisfies


m−1 m−1
∑ di β −i ≤ ∑ (β − 1)β −i = β (1 − β −m ) < β .
i=0 i=0

∙ The mantissa of a normalized floating point number is always ≥ 1.


∙ The largest floating point number is
!
m−1
x̄max = ∑ (β − 1)β −i β emax = (1 − β −m )β emax +1 .
i=0

∙ The smallest positive normalized floating pt. number is x̄min = β emin .


∙ The distance between 1 and the next largest floating pt. number is β 1−m .
6 Contents

∙ Half this number, εmach = 21 β 1−m , is called machine precision or unit roundoff.
∙ The spacing between the floating pt. numbers in [1, β ] is β −(m−1) .
∙ The spacing between the floating pt. numbers in [β e , β β e ] is β −(m−1) β e .
Almost all modern computer implements the IEEE binary (β = 2) floating point standard. IEEE single precision
floating point numbers are stored in 32 bits. IEEE double precision floating point numbers are stored in 64 bits.

Common Name (Approximate) Equivalent Value


Single Precision Double Precision

Unit roundoff 2−24 ≈ 6.e − 8 2−53 ≈ 1.1e − 16


Maximum normal number 3.4e + 38 1.7e + 308
Minimum positive normal number 1.2e − 38 2.3e − 308
Maximum subnormal number 1.1e − 38 2.2e − 308
Minimum positive subnormal number 1.5e − 45 5.0e − 324

1.1 Rounding

Given a real number x we define

fl(x) = normalized floating point number closest to x.

A floating point number x̄ closest to x is obtained by rounding. If


!

x = sign(x) ∑ di β −i β e,
i=0

then 
 sign(x) ∑m−1 −i β e , if dm < 21 β ,

i=0 di β
fl(x) =  
 sign(x) ∑m−1 di β −i + β −(m−1) β e , if dm ≥ 1 β .
i=0 2

Example 1.4. Let β = 10, m = 3. Then

fl(1.234 * 10−1 ) = 1.23 * 10−1 ,


fl(1.235 * 10−1 ) = 1.24 * 10−1 ,
fl(1.295 * 10−1 ) = 1.30 * 10−1 .

Note, there may be two floating point numbers closest to x. fl(x) picks one of them. For example, let β = 10, m = 3.
Then 1.235 − 1.24 = 0.005, but also 1.235 − 1.23 = 0.005.

Theorem 1.1. If x is a number within the range of floating point numbers and |x| ∈ [β e , β e+1 ), then the absolute error
between x and the floating point number fl(x) closest to x is given by

1
| fl(x) − x| ≤ β e(1−m)
2
1 Floating point arithmetics 7

and, provided x ̸= 0, the relative error is given by

| fl(x) − x| 1 1−m
≤ β . (1.2)
|x| 2
 
In other words fl(x) is a floating point number closest to x = ∑∞ −i β e with d > 0.
i=0 di β 0

Definition 1.1. The number


1
εmach := β 1−m
2
is called machine precision or unit roundoff.

Proof. If x = 0, then fl(x) = x and the assertion follows immediately.


Consider x > 0. (The case x < 0 can be treated in the same manner.) Recall that the spacing between the floating
point numbers !
m−1
x̄ = ∑ di β −i β e ∈ [β e , β e+1 )
i=0

is β −(m−1) β e . Hence if x ∈ [β e , β e+1 ), then the floating point number x̄ closest to x satisfies |x̄ − x| ≤ 21 β −(m−1) β e .
Since x ≥ β e ,
|x̄ − x| 1 −(m−1)
≤ β .
|x| 2

Example 1.5. Let β = 10, m = 3, thus εmach = 5 * 10−3 .

| fl(1.234 * 10−1 ) − 1.234 * 10−1 | = 0.0004,


| fl(1.234 * 10−1 ) − 1.234 * 10−1 | 0.0004
−1
= ≈ 3.2 * 10−3 ,
1.234 * 10 1.234 * 10−1
| fl(1.295 * 10−1 ) − 1.295 * 10−1 | = 0.0005,
| fl(1.295 * 10−1 ) − 1.295 * 10−1 | 0.0005
−1
= ≈ 3.9 * 10−3 .
1.295 * 10 1.295 * 10−1

1.2 Floating Point Arithmetic

Let  represent one of the elementary operations +, −, *, /. If x̄ and ȳ are floating point numbers, then x̄ȳ may not
be a floating point number, for example: β = 10, m = 4: 1.234 + 2.751 * 10−1 = 1.5091. What is the computed value
for x̄ȳ? In IEEE floating point arithmetic the result of the computation x̄ȳ is equal to the floating point number that
is nearest to the exact result x̄ȳ. Therefore we use fl(x̄ȳ) to denote the result of the computation x̄ȳ Model for the
computation of x̄ȳ, where  is one of the elementary operations +, −, *, /.
1. Given floating point numbers x̄ and ȳ.
2. Compute x̄ȳ exactly.
3. Round the exact result x̄ȳ to the nearest floating point number and normalize the result.
In the above example: 1.234 + 2.751 * 10−1 = 1.5091. Comp. result: 1.509 The actual implementation of the elemen-
tary operations is more sophisticated [1].
Given two numbers x̄, ȳ in floating point format, the computed result satisfies
8 Contents

| fl(x̄ȳ) − (x̄ȳ)|
≤ εmach .
x̄ȳ

Example 1.6. Consider the floating point system β = 10 and m = 4.


i. x̄ = 2.552 * 103 and ȳ = 2.551 * 103 .
x̄ − ȳ = 0.001 * 103 = 1.000 * 100 . In this case x̄ − ȳ is a floating point number and nothing needs to done; no error
occurs in the subtraction of x̄, ȳ.
ii. x̄ = 2.552 * 103 and ȳ = 2.551 * 102 .
x̄ − ȳ = 2.2969 * 103 . This is not a floating point number. The floating point number nearest to x̄ − ȳ is fl(x̄ − ȳ) =
2.297 * 103 .
| fl(x̄ − ȳ) − (x̄ − ȳ)| |2.297 * 103 − 2.2969 * 103 |
= ≈ 4.4 * 10−5 < εmach = 5 * 10−4 .
|x̄ − ȳ| 2.2969 * 103

For the previous result on the error between x̄ȳ and the computed fl(x̄ȳ) only holds if x̄, ȳ in floating point
format. What happens when we operate with numbers that are not in floating point format?

Example 1.7. Consider the floating point system β = 10 and m = 4.


Subtract the numbers x = 2.5515052 * 103 and y = 2.5514911 * 103 .
1. Compute the floating point numbers x̄ and ȳ nearest to x and y, respectively: x̄ = 2.552 * 103 and ȳ = 2.551 * 103 .
2. Compute x̄ − ȳ exactly: x̄ − ȳ = 0.001 * 103 .
3. Round the exact result x̄ − ȳ to the nearest floating point number: fl(0.001 * 103 ) = 0.001 * 103 . Normalize the
number: fl(0.001 * 103 ) = 1.000. The last digits are filled with (spurious) zeros.
The exact result is 2.5515052 * 103 − 2.5514911 * 103 = 1.410 * 10−2 . The relative error between exact and computed
solution is
|1.000 − 1.410 * 10−2 |
≈ 70 ≫ εmach = 5 * 10−4 .
1.410 * 10−2
Note that this large error is not due the computation of fl(x̄ − ȳ). The large error is caused by the rounding of x and y
at the beginning.

To analyze the analyze the error incurred by the subtraction of two numbers, the following representation is useful:
For every x ∈, there exists ε with |ε| ≤ εmach such that

fl(x) = x(1 + ε).

Note that if x ̸= 0, then the previous identity is satisfied for ε := (fl(x) − x)/x. The bound |ε| ≤ εmach follows from
(1.2).
For x, y ∈ we have ε1 , ε2 with |ε1 |, |ε2 | ≤ εmach such that

fl(x) = x(1 + ε1 ), fl(y) = y(1 + ε2 ).

Moreover fl(fl(x) − fl(y)) = (fl(x) − fl(y))(1 + ε3 ), with |ε3 | ≤ εmach .


Thus,

fl(fl(x) − fl(y)) = (fl(x) − fl(y))(1 + ε3 ) = [x(1 + ε1 ) − y(1 + ε2 )](1 + ε3 )


= (x − y)(1 + ε3 ) + (xε1 − yε2 )(1 + ε3 )

and, if x − y ̸= 0, then the relative error is given by


2 Interpolation 9

| fl(fl(x) − fl(y)) − (x − y)| xε1 − yε2


= ε3 + (1 + ε3 ) (1.3)
|x − y| x−y

If ε1 ε2 ̸= 0 and x − y is small, the quantity on the rhs could be ≫ εmach .


Similar analysis can be carried out for +, −, *, /. Catastrophic cancelation can only occur with +, −. Catastrophic
cancelation can only occur if one subtracts two numbers which are not both in floating point format and which have
the same sign and are of approximately the same size, see (1.3), or if one adds two numbers which are not both in
floating point format, which have opposite sign and their absolute values of approximately the same size.

Example 1.8. The roots of the quadratic equation

ax2 + bx + c = 0

are given by  p 
x± = −b ± b2 − 4ac /(2a).

When a = 5 * 10−4 , b = 100, and c = 5 * 10−3 the computed (using single precision Fortran) first root is

x+ = 0.

Cannot be exact, since x = 0 is a solution of the quadratic equation if and only if c = 0. Since fl(b2 − 4ac) = fl(b2 ) for
the data given above, we suffer from catastrophic cancellation.
A remedy is the following reformulation of the formula for x+ :

 √  √ 
2 −b + b2 − 4ac −b − b 2 − 4ac
−b + b − 4ac 1 2c
= √ = √
2a 2a 2
−b − b − 4ac −b − b2 − 4ac
Here the subtraction of two almost equal numbers is avoided and the computation using this formula gives x+ =
−0.5E − 04.
A ‘stable’ (see later for a description of stability) formula for both roots
 p 
x1 = −b − sign(b) b2 − 4ac /(2a), x2 = c/(ax1 ).

2 Interpolation

We start this section with general interpolation problem.

2.1 Interpolation problem:

Let Φ(x; a0 , . . . , an ) be a family of functions of variable x, which can be real or complex. Given n + 1 pairs (xi , fi ),
i = 0, 1, . . . , n, find parameters a0 , a1 , . . . , an such that

Φ(xi ; a0 , . . . , an ) = fi , i = 0, 1, . . . , n.

Here are some examples.


10 Contents

Example 2.1 (Polynomial interpolation).

Φ(x; a0 , . . . , an ) = a0 + a1 x + a2 x2 + · · · + an xn .

Example 2.2 (Rational interpolation).

a0 + a1 x + a2 x2 + · · · + an xn
Φ(x; a0 , . . . , an ; b0 , . . . , bm ) = .
b0 + b1 x + · · · + bm xm
Example 2.3 (Trigonometric interpolation).

Φ(x; a0 , . . . , an ) = a0 + a1 eix + a2 e2ix + · · · + an einx .

Example 2.4 (Exponential interpolation).

Φ(x; a0 , . . . , an ; λ0 , . . . , λn ) = a0 eλ0 x + a1 eλi x + · · · + an eλn x .

There are many others, for example splines, which we will address later.
Definition 2.1. The interpolation problem is linear if Φ(xi ; a0 , . . . , an ) depends linearly on a0 , a1 , . . . , an , i.e.

Φ(x; a0 , . . . , an ) = a0 Φ0 (x) + a1 Φ1 (x) + · · · + an Φn (x),

for some functions Φi (x), i = 0, 1, . . . , n.


Easy to see that the interpolation problems in Example 2.1 and Example 2.3 are linear and in Example 2.2 and Example
2.4 are nonlinear.

2.2 Polynomial Interpolation in 1D

Polynomial interpolation is historically very important and well investigated problem, due to all nice properties of
polynomials.

Problem 1 (Polynomial Interpolation). Given (xi , fi ), i = 0, 1, . . . , n find

p(x) = a0 + a1 x + a2 x2 + · · · + an xn

such that
p(xi ) = fi i = 0, 1, . . . , n.

The main motivation for the approximation is to estimate the unknown values of f (x).
Definition 2.2. If the point x̄ lies inside the interval formed by the points x1 , . . . , xn , we speak about interpolation; if
the point x̄ lies outside the interval formed by the points x1 , . . . , xn , we speak about extrapolation.
First, we establish that the interpolation problem is well-posed.
Theorem 2.1 (Existence and Uniqueness). Given n + 1 distinct points xi , i.e. xi ̸= x j for i ̸= j and arbitrary n + 1
values f0 , f1 , . . . , fn . There exists a unique polynomial pn (x) of degree n or less such that pn (xi ) = fi for i = 0, 1, . . . , n.

Proof. First we will establish uniqueness.


Uniqueness.
2 Interpolation 11

Assume there are two such polynomials pn (x) and p̃n (x). Then q(x) = pn (x) − p̃n (x) is a polynomial of degree at
most n that has n + 1 roots, namely xi , i = 0, 1, . . . , n. The only polynomial with such property is zero polynomial. Thus
q(x) ≡ 0 and p̃n (x) = pn (x).
Existence.
For i = 0, 1, . . . , n consider
n x−x
(x − x0 ) · · · (x − xi−1 )(x − xi+1 ) · · · (x − xn ) j
Li (x) = =∏ .
(xi − x0 ) · · · (xi − xi−1 )(xi − xi+1 ) · · · (xi − xn ) j=0 xi − x j
j̸=i

Then each Li (x) is a polynomial of degree n with the property


(
1 if i = j
Li (x j ) = δi j =
0 if i ̸= j.

Thus,
pn (x) = f0 L0 (x) + f1 L1 (x) + · · · + fn Ln (x)
is the desired polynomial.

Example 2.5.

xi 0 1 3
fi 1 3 2
Thus,
(x − 1)(x − 3) x(x − 3) x(x − 1)
L0 (x) = , L1 (x) = , L2 (x) =
(0 − 1)(0 − 3) (1 − 0)(1 − 3) (3 − 0)(3 − 1)
and cross-multiplying, we compute

(x − 1)(x − 3) x(x − 3) x(x − 1) −5x2 + 17x + 6


p(x) = 1 · −3· +2· = .
3 2 6 6

From now on we assume that the points x1 , . . . , xn are distinct. Given a basis ψ1 (x), . . . , ψn (x) of Pn−1 . Problem 1
we now can state as to find coefficients a1 , . . . , an for the polynomial

p(x) = a1 ψ1 (x) + a2 ψ2 (x) + · · · + an ψn (x) ∈ Pn−1

such that
p(xi ) = a1 ψ1 (xi ) + a2 ψ2 (xi ) + · · · + an ψn (xi ) = fi
for i = 1, 2, . . . , n. Which is equivalent to the linear system
    
ψ1 (x1 ) ψ2 (x1 ) . . . . . . ψn (x1 ) a1 f1
 ψ1 (x2 ) ψ2 (x2 ) . . . . . . ψn (x2 )   a2   f2 
..   ..  =  ..  . (2.1)
    
 .. .. .. ..
 . . . . .  .   . 
ψ1 (xn ) ψ2 (xn ) . . . . . . ψn (xn ) an fn

In Theorem 2.1 we have established the existence and uniqueness of the solution for the linear system (2.1) with
arbitrary right hand side. As a result the matrix in (2.1) is non-singular for any choice of basis. Thus mathematically
any choice of basis would work, however computationally it makes a huge difference. Here are some natural choices.
12 Contents

2.2.1 Monomial Basis

We denote the monomial basis by

M0 = 1, M1 = x, M2 = x2 , ··· Mn−1 = xn−1 .

The resulting matrix takes the form

. . . x1n−1
   
M0 (x1 ) M1 (x1 ) . . . . . . Mn−1 (x1 ) 1 x1 ...
 M0 (x2 ) M1 (x2 ) . . . . . . Mn−1 (x2 )   1 x2 ... . . . x2n−1 
=. . ..  ,
   
 .. .. .. .. .. .. ..
 . . . . .   .. .. . . . 
M0 (xn ) M1 (xn ) . . . . . . Mn−1 (xn ) 1 xn . . . . . . xnn−1

which is known as Vandermonde matrix. We can take several observation. First of all the matrix is full, so potentially
it can be expensive to solve the linear system. Secondly, looking at the plot of the monomial basis we can observe
that they are very similar near 0, meaning that if the interpolating points are near zero, the matrix can be close to a
singular. Later we make it more precise when we study the condition number of a matrix. On the other hand once the
coefficients a0 , . . . , an−1 are found out, the evaluation of the resulting polynomial at any point x̄ is rather simple. We
can make it even very efficient by noticing

p(x) = a0 + a1 x + a2 x2 + · · · + an−2 xn−2 + an−1 xn−1


= a0 + a1 + a2 x + · · · + an−2 xn−3 + an−1 xn−2 x
 
(2.2)
= a0 + a1 + a2 + · · · + an−2 xn−4 + an−1 xn−3 x x
   

= a0 + [a1 + [a2 + · · · + [an−2 + an−1 x] · · · ] x] x.

Using this nested form, we can write algorithm for evaluation, known as Horner’s Scheme.

Algorithm 2.2 (Horner’s Scheme for monomial basis)

Input: The interpolation points x1 , . . . , xn


The coefficients a1 , . . . , an
The point (or points) x̄ at which the polynomial to be evaluated

Output: The value of the interpolating polynomial p(x̄)

1. p = an
2. for i = n − 1 : −1 : 1
3. p = p * x + ai
4. end

2.2.2 Lagrange Basis

Another obvious choice using Lagrange basis


n x−xj
Li (x) = ∏
j=1 i − x j
x
j̸=i
2 Interpolation 13

Since (
1 if i = j
Li (x j ) = δi j =
0 if i ̸= j.
The resulting matrix is just an identity matrix and as a result a1 = f1 , . . . , an = fn and

pn (x) = f1 L1 (x) + f2 L2 (x) + · · · + fn Ln (x).

However the evaluation of the above expression at some x̄ is not cheap.

2.2.3 Newton’s Basis

We denote the Newton’s basis by


n−1
N0 = 1, N1 = x − x1 , N2 = (x − x1 )(x − x2 ), ··· , Nn−1 = ∏ (x − x j ).
j=1

The resulting matrix takes the form


   
N0 (x1 ) N1 (x1 ) N2 (x1 ) . . . . . . Nn−1 (x1 ) 1 0 0 ... ... 0
 N0 (x2 ) N1 (x2 ) N2 (x2 ) . . . . . . Nn−1 (x2 )   1 x2 − x1 0 ... ... 0 
   
 N0 (x3 ) N1 (x3 ) N2 (x3 ) . . . . . . Nn−1 (x3 )   1 x3 − x1 (x3 − x1 )(x3 − x2 ) . . . ... 0
= ,

  .. .. .. .. .. ..

 .. .. .. .. .. .. 
 . . . . . .  . . . . . . 
N0 (xn ) N1 (xn ) N2 (xn ) . . . . . . Nn−1 (xn ) 1 xn − x1 (xn − x1 )(xn − x2 ) . . . . . . ∏n−1
j=1 (xn − x j )

which is a lower triangular matrix and can be solved explicitly by forward substituion. Thus,

a1 = f1
f 2 − a1
a2 =
x2 − x1
f3 − a1 − a2 (x2 − x1 )
a3 =
(x3 − x1 )(x3 − x2 )
..
.
fn − ∑n−1 i−1
i=1 ai ∏ j=1 (xn − x j )
an = .
∏n−1
j=1 (xn − x j )

Similarly, to the Monomial basis, once the coefficients a0 , . . . , an−1 are found, the evaluation of the resulting polyno-
mial at any point x̄ can be done via Horner’s Scheme as well, since
14 Contents

n−2 n−1
p(x) = a0 + a1 (x − 1) + a2 (x − x1 )(x − x2 ) + · · · + an−2 ∏ (x − x j ) + an−1 ∏ (xn − x j )
j=1 j=1
" #
n−2 n−1
= a0 + a1 + a2 x + · · · + an−2 ∏ (x − x j ) + an−1 ∏ (x − x j ) (x − x1 )
j=2 j=2 (2.3)
" " # #
n−2 n−2
= a0 + a1 + a2 + · · · + an−2 ∏ (x − x j ) + an−1 ∏ (x − x j ) (x − x2 ) (x − x1 )
j=3 j=3

= a0 + [a1 + [a2 + · · · + [an−2 + an−1 (x − xn−1 )] · · · ] (x − x2 )] (x − x1 ).

Using this nested form, we can write algorithm for evaluation, known as Horner’s Scheme.

Algorithm 2.3 (Horner’s Scheme for Newton’s basis)

Input: The interpolation points x1 , . . . , xn


The coefficients a1 , . . . , an
The point (or points) x̄ at which the polynomial to be evaluated

Output: The value of the interpolating polynomial p(x̄)

1. p = an
2. for i = n − 1 : −1 : 1
3. p = p * (x − xi ) + ai
4. end

2.3 Divided differences

For this section we need new notation.


Definition 2.3. We denote by Pn−1 ( f | x1 , . . . , xn )(x) a polynomial fo degree n − 1 that interpolated a function f at
points x1 , . . . , xn .
The main idea of divided differences is based on elementary fact that for any two given points (x1 , f1 ) and (x2 , f2 ),
with x1 ̸= x2 there is unique straight line that passes though them, namely
f2 − f1
l(x) = f1 + (x − x1 ).
x2 − x1
Using the new notation from Definition 2.3, we can rewrite it as
 
x − x1
P1 ( f | x1 , x2 )(x) = P0 ( f | x1 )(x) + P0 ( f | x2 )(x) − P0 ( f | x1 )(x) ,
x2 − x1

since P0 ( f | xi )(x) is just a constant function that passes through xi , which is just fi . Surprisingly above expression can
be generalized to polynomials of arbitrary degree.
Theorem 2.4. Given Pn−2 ( f | x1 , . . . , xn−1 )(x) and Pn−2 ( f | x2 , . . . , xn )(x), we can obtain Pn−1 ( f | x1 , . . . , xn )(x) from
 
x − x1
Pn−1 ( f | x1 , . . . , xn )(x) = Pn−2 ( f | x1 , . . . , xn−1 )(x) + Pn−2 ( f | x2 , . . . , xn )(x) − Pn−2 ( f | x1 , . . . , xn−1 )(x) .
xn − x1
2 Interpolation 15

Proof. First of all we notice that on the right (and as a result on the left) is a polynomial of degree n − 1. Thus, the
only thing we need to check that it indeed interpolates the function f at x1 , . . . , xn .
For x = x1 , since Pn−2 ( f | x1 , . . . , xn−1 )(x1 ) = f1 , we obtain

Pn−1 ( f | x1 , . . . , xn )(x1 ) = Pn−2 ( f | x1 , . . . , xn−1 )(x1 ) + 0 = f1 .

For x = xn , since Pn−2 ( f | x2 , . . . , xn−1 )(xn ) = fn , we obtain


 
xn − x1
Pn−1 ( f | x1 , . . . , xn )(x1 ) = Pn−2 ( f | x1 , . . . , xn−1 )(xn ) + fn − Pn−2 ( f | x1 , . . . , xn−1 )(xn ) = fn .
xn − x1

For x = xi , 1 < i < n, we notice that Pn−2 ( f | x2 , . . . , xn )(xi ) − Pn−2 ( f | x1 , . . . , xn−1 )(xi ) = fi − fi = 0 and as a result

Pn−1 ( f | x1 , . . . , xn )(xi ) = Pn−2 ( f | x1 , . . . , xn−1 )(xi ) + 0 = fi .

The above result gives us a recursive way to construct the interpolating polynomial

P0 ( f | x1 ) ↘
P0 ( f | x2 ) → P1 ( f | x1 , x2 )
.. .. .. ..
. . . .
.. .. ..
. . . ↘
P0 ( f | xn−1 ) → P1 ( f | xn−2 , xn−1 ) . . . . . . → Pn−2 ( f | x1 , . . . , xn−1 ) ↘
P0 ( f | xn ) → P1 ( f | xn−1 , xn ) . . . . . . → Pn−2 ( f | x2 , . . . , xn ) → Pn−1 ( f | x1 , . . . , xn ).

2.3.1 Divided differences for Newton’s basis

In this section we will see how efficient the method of divided differences can be for Newton’s basis. For Newton’s
basis the interpolating polynomial has the form
n i−1
Pn−1 ( f | x1 , . . . , xn )(x) = ∑ ai ∏ (x − x j ) = an xn−1 + · · · . (2.4)
i=1 j=1

From the above we can observe that the leading coefficient an in the interpolating polynomial is the same as in the
leading coefficient for the polynomial in Newton’s basis.
Definition 2.4. The leading coefficient ak of the polynomial Pk−1 ( f | x1 , . . . , xk )(x) is called the (k − 1) divided differ-
ence and is denoted by f [x1 , . . . , xk ].
Using this definition we can write the Pn−1 ( f | x1 , . . . , xn )(x) in Newton’s basis as
n i−1
Pn−1 ( f | x1 , . . . , xn )(x) = ∑ f [x1 , . . . , xi ] ∏ (x − x j ). (2.5)
i=1 j=1

From Theorem 2.4 it follows that


 
x−xj
Pk− j−1 ( f | x j , . . . , xk )(x) = Pk− j−2 ( f | x j , . . . , xk−1 )(x)+ Pk− j−2 ( f | x j+1 , . . . , xk )(x)−Pk− j−2 ( f | x j , . . . , xk−1 )(x) .
xk − x j
(2.6)
Using (2.5) we have
16 Contents

k i−1
Pk− j−1 ( f | x j , . . . , xk )(x) = ∑ f [x j , . . . , xi ] ∏ (x − xm )
i= j m= j
k−1 i−1
Pk− j−2 ( f | x j , . . . , xk−1 )(x) = ∑ f [x j , . . . , xi ] ∏ (x − xm )
i= j m= j
k i−1
Pk− j−2 ( f | x j+1 , . . . , xk )(x) = ∑ f [x j , . . . , xi ] ∏ (x − xm ).
i= j+1 m= j

Plugging the above expressions into (2.6), we obtain


k i−1 k−1 i−1
∑ f [x j , . . . , xi ] ∏ (x − xm ) = ∑ f [x j , . . . , xi ] ∏ (x − xm )
i= j m= j i= j m= j
! (2.7)
k i−1 k−1 i−1
x−xj
+ ∑ f [x j , . . . , xi ] ∏ (x − xm ) − ∑ f [x j , . . . , xi ] ∏ (x − xm ) .
xk − x j i= j+1 m= j i= j m= j

Since the leading coefficients of the polynomial on the left and on the right hand sides must be equal, we obtain the
following formula
f [x j+1, . . . , xk ] − f [x j , . . . , xk−1 ]
f [x j , . . . , xk ] = (2.8)
xk − x j
Using it we obtain a recursive way to compute coefficients

f [x1 ] ↘
f [x2 ] → f [x1 , x2 ]
.. .. .. ..
. . . .
.. .. ..
. . . ↘
f [xn−1 ] → f [xn−2 , xn−1 ] . . . . . . → f [x1 , . . . , xn−1 ] ↘
f [xn ] → f [xn−1 , xn ] . . . . . . → f [x2 , . . . , xn ] → f [x1 , . . . , xn ].

Algorithm 2.5 (Newton’s Interpolating polynomials)

Input: The interpolation points x1 , . . . , xn


The values f1 , . . . , fn
Output: The coefficients a1 , · · · , an of the Newton’s interpolating polynomial (returned as a11 , . . . , ann )

1. for i = 1 : n
2. ai1 = fi
3. end
4. for j = 2 : n
5. for i = j : n
a −ai−1, j−1
6. ai j = i, j−1
xi −xi− j+1
7. end
8. end
2 Interpolation 17

One can modify the algorithm to save some storage by overwriting the entries of the coefficients that are needed
anymore

Algorithm 2.6 (Modified Newton’s Interpolating polynomials)

Input: The interpolation points x1 , . . . , xn


The values f1 , . . . , fn
Output: The coefficients a1 , · · · , an of the Newton’s interpolating polynomials (returned as a1 , . . . , an )

1. for i = 1 : n
2. ai = f i
3. end
4. for j = 2 : n
5. for i = j : n
ai −ai−1
6. ai = xi −x i− j+1
7. end
8. end

Example 2.6.

xi 0 1 −1 2 −2
fi −5 −3 −15 39 −9
Using (2.8), we obtain

xi fi
0 −5
1 −3 2
−1 −15 6 −4
2 39 18 12 8
−2 −9 12 6 2 3

Hence the polynomial is

P4 ( f | x1 , . . . , x5 )(x) = −5 + 2x − 4x(x − 1) + 8x(x − 1)(x + 1) + 3x(x − 1)(x + 1)(x − 2).

We can notice that the table actually contains much more information. For example, it contains the coefficients of
the polynomial interpolating (−1, −15), (2, 39), (−2, −9), which is

P2 ( f | x3 , x4 , x5 )(x) = −15 + 18(x + 1) + 6(x + 1)(x − 2)

or the coefficients of the polynomial interpolating (1, −3), (−1, −15), (2, 39), which is

P2 ( f | x2 , x3 , x4 )(x) = −3 + 6(x − 1) + 12(x − 1)(x + 1).

Once the coefficients of the Newton’s polynomial are computed we can use Horner’s Scheme (Algorithm 2.3) to
evaluate it at given points.
18 Contents

2.4 Neville Algorithm

The recursive formula in Theorem 2.4 can be used to compute the value of the interpolating polynomial at some point
x̄ without computing coefficients ai . Since
 
x̄ − x j
Pk− j−1 ( f | x j , . . . , xk )(x̄) = Pk− j−2 ( f | x j , . . . , xk−1 )(x̄)+ Pk− j−2 ( f | x j+1 , . . . , xk )(x̄)−Pk− j−2 ( f | x j , . . . , xk−1 )(x̄)
xk − x j

and naturally
P0 ( f | xi )(x̄) = fi i = 1, 2, . . . , n.
The other values can be computed recursively, we obtain Neville scheme:

P0 ( f | x1 )(x̄) ↘
P0 ( f | x2 )(x̄) → P1 ( f | x1 , x2 )(x̄)
.. .. .. ..
. . . .
.. .. ..
. . . ↘
P0 ( f | xn−1 )(x̄) → P1 ( f | xn−2 , xn−1 )(x̄) . . . . . . → Pn−2 ( f | x1 , . . . , xn−1 )(x̄) ↘
P0 ( f | xn )(x̄) → P1 ( f | xn−1 , xn )(x̄) . . . . . . → Pn−2 ( f | x2 , . . . , xn )(x̄) → Pn−1 ( f | x1 , . . . , xn )(x̄).

Algorithm 2.7 (Modified Newton’s Interpolating polynomials)

Input: The interpolation points x1 , . . . , xn


The values f1 , . . . , fn
The point x̄ at which the interpolating polynomial to be evaluated.
Output: The coefficients value of the interpolating polynomial p

1. for i = 1 : n
2. pi = f i
3. end
4. for j = 2 : n
5. for i = n : −1 : j
x̄−x j+1
6. pi = pi−1 + xi −xi−
i− j+1
* (pi − pi−1 )
7. end
8. end
9. p = pn

Example 2.7.

Assume we want to evaluate the polynomial interpolating


xi 0 1 −1 2 −2
fi −5 −3 −15 39 −9
at x̄ = 3. Using Neville algorithm, we obtain
2 Interpolation 19

xi fi
0 −5
1 −3 1
−1 −15 9 −23
2 39 57 105 169
−2 −9 51 81 121 241

Thus, P4 ( f | x1 , x2 , x3 , x4 , x5 )(3) = 241.

2.5 Approximation properties of interpolating polynomials

In the previous section we show how to compute approximating polynomials and some evaluation techniques. In this
section we try to answer a question, how close the approximation polynomial to the function f , namely we want an
estimate for
sup | f (x) − P( f |x1 , . . . , xn )(x)| (2.9)
x∈[a,b]

on some interval [a, b].


The idea behind the main result of this section is the Rolle’s theorem.
Theorem 2.8 (Rolle’s Theorem). If g ∈ C1 [a, b] and g(a) = g(b) = 0, then there exists c ∈ (a, b) such that g′ (c) = 0.
Using it we can show the following result.
Theorem 2.9. Let x1 , . . . , xn be distinct points and f ∈ Cn [a, b]. Then for each x̄ ∈ [a, b] there exists ξ (x̄) in the interval
generated by x1 , . . . , xn , x̄ such that
1
f (x̄) − P( f | x1 , . . . , xn )(x̄)| = ω(x̄) f (n) (ξ (x̄)),
n!
where ω(x) = ∏nj=1 (x − x j ).

Proof. If x = xi for some i = 1, . . . , n, then the result naturally holds. Assume x ̸= xi for any i = 1, . . . , n. Consider a
function
ψ(x) = f (x) − P( f | x1 , . . . , xn )(x) − cω(x),
where the constant c is taken to be
f (x̄) − P( f | x1 , . . . , xn )(x̄)
c= .
ω(x̄)
With this choice of the constant c, the function ψ(x) has at least n + 1 roots, namely at x1 , . . . , xn and x̄. Thus, from the
(1)
Rolle’s Theorem there exist n points, call them xi , i = 1, . . . , n, such that
(1)
ψ ′ (xi ) = 0, i = 1, . . . , n.
(2)
Again, from the Rolle’s Theorem there exist n − 1 points, call them xi , i = 1, . . . , n − 1, such that
(2)
ψ ′′ (xi ) = 0, i = 1, . . . , n − 1.
(n)
Continue this process, there exists a point x1 such that
20 Contents

(n)
ψ (n) (x1 ) = 0.

From the definition of ψ(x), we have

dn dn dn dn
n
ψ(x) = n f (x) − n P( f | x1 , . . . , xn )(x) − c n ω(x).
dx dx dx dx
Since P( f | x1 , . . . , xn )(x) is a polynomial of degree n − 1, we have

dn
P( f | x1 , . . . , xn )(x) = 0
dxn
and since ω(x) is a polynomial of degree n with leading coefficient 1, we have

dn
ω(x) = n!.
dxn
As a result
(n)
(n) (n) (n) (n) f (n) (x1 )
ψ (x1 ) = f (x1 ) − cn! =0 ⇒ c= ,
n!
(n)
which shows the theorem with ξ (x̄) = x1 .
As a corollary, we immediately obtain
Corollary 2.1.

1 n
max | f (x) − P( f | x1 , . . . , xn )(x)| ≤ max | f (n) (ξ (x))| max ∏ (x − x j ) .

x∈[a,b] n! x∈[a,b] x∈[a,b] j=1

1
Thus we observe that the error is bounded by three terms. First term decreases
rather fast, the second term
n!
n
maxx∈[a,b] | f (n) (ξ (x))
depends on the (unknown) function f , the last term maxx∈[a,b] ∏ j=1 (x − x j ) looks rather mys-

terious. Of course we can estimate it roughly as

n
max ∏ (x − x j ) ≤ (b − a)n ,

x∈[a,b] j=1

and if we can control n-th derivative of f by M n , then from Corollary 2.1, we obtain

M n (b − a)n
max | f (x) − P( f | x1 , . . . , xn )(x)| ≤ →0 as n → ∞.
x∈[a,b] n!

Example 2.8. Consider f (x) = sin (3x) on [0, π]. Then | f (n) (x)| ≤ 3n an we obtain

3n π n
max | sin (3x) − P( f | x1 , . . . , xn )(x)| ≤ .
x∈[a,b] n!
n n n n n n
Although, we know that limn→∞ 3 n!π = 0 we need many points before 3 n!π is small. Thus for n = 20, 3 n!π ≈ 12.5689,
n n
for n = 30, 3 n!π ≤ 6.3 * 10−4 . Thus even for this simple example we need to deal with high order polynomials. Of
course we used rather cruel estimate for maxx∈[a,b] |ω(x)|, and indeed if we have more information about the locations
of x1 , . . . , xn we can obtain better estimates.
2 Interpolation 21

2.6 Equidistant points

It is very natural to select n equidistant points on an interval (a, b)

b−a
xi = a + (i − 1)h with h = , i = 1, . . . , n.
n−1
For n = 2, we have x1 = a and x2 = b, h = b − a and

ω(x) = (x − a)(x − b) = (x − x1 )(x − x2 ).


x1 +x2
Since it is a parabola the maximum value |ω(x)| attains at 2 and as a result

(x2 − x1 )2 (b − a)2 h2
|ω(x)| = |(x − x1 )(x − x2 )| = = = . (2.10)
4 4 4
Thus, from Corollary 2.1 we obtain that in this case

(b − a)2 h2
max | f (x) − P1 ( f | x1 , x2 )(x)| ≤ max | f ′′ (x)| ≤ max | f ′′ (x)|. (2.11)
x∈[a,b] 8 x∈[a,b] 8 x∈[a,b]

For n ≥ 3 we can obtain the following result.


Lemma 2.1. For equidistant points
b−a
xi = a + (i − 1)h with h= , i = 1, . . . , n,
n−1
and arbitrary x ∈ [a, b] the following estimate holds

n hn
|ω(x)| = ∏(x − xi ) ≤ (n − 1)!.

i=1 4

Proof. Let x ∈ [x j , x j+1 ] for some 1 ≤ j ≤ n − 1. From (2.10) it follows that

h2
|(x − x j )(x − x j+1 )| ≤ .
4
We have
n j−1 n
∏(x − xi ) = ∏ (x − xi ) · |(x − x j )(x − x j+1 )| · ∏ (x − xi )

i=1 i=1 i= j+2

j−1 h2 n
≤ ∏ (x − xi ) · · ∏ (x − xi )

i=1 4 i= j+2

h2 j−1 n
≤ ∏ (x j+1 − xi ) · ∏ (x j − xi ) .

4 i=1 i= j+2

Since xi = a + (i − 1)h, we have |x j − xi | = | j − i|h and as a result


22 Contents

n h2 j−1 n
∏(x − xi ) ≤ ∏ ( j + 1 − i)h · ∏ (i − j)h

i=1 4 i=1 i= j+2
j−1 n
hn
≤ ∏ ( j + 1 − i) · ∏ (i − j)
4 i=1 i= j+2
hn
≤ j!(n − j)!
4
and since j!(n − j)! ≤ (n − 1)! for 1 ≤ j ≤ n − 1, we obtain the result.
Using the above result we obtain.
Theorem 2.10. Let
b−a
xi = a + (i − 1)h with h= , i = 1, . . . , n.
n−1
If f ∈ Cn [a, b], then
hn
max | f (x) − P( f | x1 , . . . , xn )(x)| ≤ max | f (n) (ξ (x))|.
x∈[a,b] 4n x∈[a,b]
We revisit Example 2.8, where the above theorem gives sharper estimate.

Example 2.9. Consider f (x) = sin (3x) on [0, π]. Then | f (n) (x)| ≤ 3n and for equidistant approximation with h =
π/(n − 1) from Theorem 2.10 we obtain
n
3n

π
max | sin (3x) − P( f | x1 , . . . , xn )(x)| ≤ .
x∈[a,b] 4n n − 1
3n π n

The above estimate is much sharper than the one we used in Example 2.8. Thus for n = 20, 4n n−1 ≈ 1.0169e − 08
3n π n

and for n = 30, 4n n−1 ≈ 1.8924e − 17.

2.7 Chebyshev points

A natural question: is there a choice for the interpolation notes that minimizes maxx∈[a,b] |ω(x)|, i.e. what is the solution
to the following min-max problem
n
min max ∏ (x − x j ) . (2.12)

x1 ,...,xn x∈[a,b]
j=1

The solution x1* , . . . , xn* to (2.12) are called the Chebyshev points and are given by the formula
 
a+b a+b (2i − 1)π
xi* = + cos , i = 1, . . . , n.
2 2 2n

In addition, one can show


n 21−2n
*
∏ (x − x j ) ≤ (b − a)n .

j=1 n!

1
Example 2.10. Consider f (x) = 1+x2
TO ADD
2 Interpolation 23

2.8 Hermite interpolation

2.9 Spline interpolation

For approximation of a function f on an interval [a, b], instead of choosing high order interpolating polynomial one
can partition the interval into small pieces and on each small piece use small or moderate order polynomials for an
approximation. In addition, one may choose various ways to connect pieces together, resulting in global smoothness
properties. The advantage of such approach is that no high order of smoothness of f is required. We will consider two
popular choices, linear (continuous) splines and cubic (C2 ) splines.

2.9.1 Linear Splines

Let x1 , . . . , xn be such that


a = x1 < x2 < · · · < xn−1 < xn = b.
Our goal to approximate f by piecewise linear polynomials, i.e. on each subinterval [xi , xi+1 ] we seek a linear function

Si (x) = ai + bi (x − xi ), i = 1, . . . , n − 1.

Thus on each subinterval we have 2 unknowns. Since there are n − 1 subintervals we have 2n − 2 unknowns in total.
We want our spline function S(x) to have the following properties:
1. Interpolation at nodes, i.e. S(xi ) = f (xi ) := fi for i = 1, 2, . . . , n
2. Continuity Si+1 (xi+1 ) = Si (xi+1 ) for i = 2, . . . , n − 1.
Thus in total we have n + n − 2 = 2n − 2 conditions, which matches the total number of unknowns. Easy to see that
the conditions above uniquely determine S(x) and on each subinterval [xi , xi+1 ] the coefficients ai and bi in Si (x) =
ai + bi (x − xi ) for i = 1, . . . , n − 1 are given by

fi+1 − fi
ai = fi bi = .
xi+1 − xi
Below we provide algorithms for computing the coefficients and evaluation.

Algorithm 2.11 (Computing coefficients for linear spline)

Input: The interpolation nodes x1 , . . . , xn


The function values f1 , . . . , fn

Output: The coefficients a1 , . . . , an and b1 , . . . , bn of the linear spline

1. for i = 1 : n − 1
2. ai = f i
3. bi = ( fi+1 − fi )/(xi+1 − xi )
4. end

Algorithm 2.12 (Linear spline evaluation)

Input: The interpolation nodes x1 , . . . , xn


24 Contents

The coefficients a1 , . . . , an and b1 , . . . , bn


The point x̄ in [x1 , xn ] at which the linear spline to be evaluated.

Output: The value S = S(x̄) of the linear spline

1. for i = 1 : n − 1
2. if x̄ ≤ xi+1
3. S = ai + bi * (x̄ − xi )
4. end
5. end

From (2.11), we obtain the following convergence property of the linear spline
Theorem 2.13. Let S(x) be the linear spline interpolating f at x1 , . . . , xn . If f ∈ C2 [a, b] then

h2
max | f (x) − S(x)| ≤ max | f ′′ (x)|,
x∈[a,b] 8 x∈[a,b]

where
h= max hi = max (xi+1 − xi ).
i=1,...,n−1 i=1,...,n−1

2.9.2 Cubic Splines

Again, let x1 , . . . , xn be such that


a = x1 < x2 < · · · < xn−1 < xn = b.
Our goal to approximate f by piecewise cubic polynomials, i.e. on each subinterval [xi , xi+1 ] we seek a cubic function

Si (x) = ai + bi (x − xi ) + ci (x − xi )2 + di (x − xi )3 , i = 1, . . . , n − 1.

Thus on each subinterval we have 4 unknowns. Since there are n − 1 subintervals we have 4n − 4 unknowns in total.
We want our spline function S(x) to be smooth, we may asked for the following properties:
1. Interpolation at nodes, i.e. S(xi ) = f (xi ) := fi for i = 1, 2, . . . , n
2. Continuity Si+1 (xi+1 ) = Si (xi+1 ) for i = 2, . . . , n − 1.
′ (x ′
3. Continuity of the first derivatives Si+1 i+1 ) = Si (xi+1 ) for i = 2, . . . , n − 1.
4. Continuity of the second derivatives Si+1 ′′ (x ) = Si′′ (xi+1 ) for i = 2, . . . , n − 1.
i+1

Thus in total we have n + 3(n − 2) = 4n − 6 conditions, which does not match the number of unknowns, we are two
short. The popular choices are:
∙ Natural boundary S1′′ (x1 ) = 0 and Sn−1
′′ (x ) = 0
n
∙ Clamped boundary S1′ (x1 ) = f ′ (x1 ) and Sn−1′ (xn ) = f ′ (xn )
′ ′
∙ Periodic spline S1 (x1 ) = Sn−1 (xn ), S1 (x1 ) = Sn−1 (xn ), and S1′′ (x1 ) = Sn−1
′′ (x ).
n

It is not still obvious how to compute the cubic spline from these conditions. This is what we will address next. First,
we introduce a notation
hi = xi+1 − xi , i = 1, . . . , n − 1.
Differentiating the expression
2 Interpolation 25

Si (x) = ai + bi (x − xi ) + ci (x − xi )2 + di (x − xi )3 , i = 1, . . . , n − 1, (2.13)

we obtain

Si′ (x) = bi + 2ci (x − xi ) + 3di (x − xi )2 , i = 1, . . . , n − 1 (2.14)


Si′′ (x) = 2ci + 6di (x − xi ), i = 1, . . . , n − 1. (2.15)

From 2.13 we immediately find


ai = Si (xi ) = fi i = 1, . . . , n − 1.
Since all ai are known, the goal is to express the other coefficients, i.e. bi , ci , and di in terms of ai . From the continuity
at nodes we also have

ai+1 = Si+1 (xi+1 ) = Si (xi+1 ) = ai + bi hi + ci h2i + di h3i i = 1, . . . , n − 2.

Put an = f (xn ). Then for i = n − 1 we also have

an = Sn−1 (xn ) = an−1 + bn−1 hn−1 + cn−1 h2n−1 + dn−1 h3n−1 .

Thus,
ai+1 = ai + bi hi + ci h2i + di h3i , i = 1, . . . , n − 1. (2.16)
Similarly, from the continuity of the derivatives we obtain

bi+1 = Si+1 (xi+1 ) = Si′ (xi+1 ) = bi + 2ci hi + 3di h2i i = 1, . . . , n − 2.

Setting bn = Sn−1 (xn ). Thus, we also have

bn = Sn−1 (xn ) = bn−1 + 2cn−1 hn−1 + 3dn−1 h2n−1

Summarizing,
bi+1 = bi + 2ci hi + 3di h2i , i = 1, . . . , n − 1. (2.17)
From the continuity of the second derivatives we obtain
′′
2ci+1 = Si+1 (xi+1 ) = Si′′ (xi+1 ) = 2ci + 6di hi , i = 1, . . . , n − 2.
′′ (x ). Thus, we also have
Again, setting cn = 21 Sn−1 n

′′
2cn = Sn−1 (xn ) = bn−1 + 2cn−1 hn−1 + 3dn−1 h2n−1 .

Summarizing,
2ci+1 = bi + 2ci hi + 3di h2i , i = 1, . . . , n − 1. (2.18)
From (2.18), we find
ci+1 − ci
di = , i = 1, . . . , n − 1. (2.19)
3hi
Inserting it into (2.16) and (2.17), we obtain

h2i
ai+1 = ai + bi hi + (2ci+1 + ci ), i = 1, . . . , n − 1 (2.20)
3
and
bi+1 = bi + hi (ci+1 + ci ), i = 1, . . . , n − 1. (2.21)
26 Contents

Solving (2.20) for bi we find

1 hi
bi = (ai+1 − ai ) − (2ci+1 + ci ), i = 1, . . . , n − 1. (2.22)
hi 3
Replacing i with i − 1 we have
bi = bi−1 + hi−1 (ci + ci−1 ), i = 2, . . . , n. (2.23)
and
1 hi−1
bi−1 = (ai − ai−1 ) − (2ci + ci−1 ), i = 2, . . . , n. (2.24)
hi−1 3
Inserting (2.22) and (2.24) into (2.23), we obtain

1 hi 1 hi−1
(ai+1 − ai ) − (2ci+1 + ci ) = (ai − ai−1 ) − (2ci + ci−1 ) + hi (ci+1 + ci ).
hi 3 hi−1 3

Moving all c′i s to the left and a′i s to the right and rearranging the terms, we obtain

3 3
hi−1 ci−1 + 2(hi−1 + hi )ci + hi ci+1 = (ai+1 − ai ) − (ai − ai−1 ) i = 2, . . . , n − 1. (2.25)
hi hi−1
Which is equivalent to n − 2 equations with n unknowns. We need additional conditions (like natural or periodic
boundary) to close the system.
Alternatively, we could introduce variables Mi = S′′ (xi ), i = 1, . . . , n, often called moments, and set equations for
them. Of course, Mi = 2ci , however the point of view is slightly different. Thus, since Si (x) is a cubic on [xi , xi+1 ],
Si′′ (x) is linear on [xi , xi+1 ] and in terms of moments has the form
x − xi xi+1 − x
Si′′ (x) = Mi+1 + Mi , i = 1, . . . , n − 1.
hi hi
Integrating, we obtain
(x − xi )2 (xi+1 − x)2
Si′ (x) = Mi+1 − Mi + bi , i = 1, . . . , n − 1.
2hi 2hi
and integrating once again

(x − xi )3 (xi+1 − x)3
Si (x) = Mi+1 + Mi + bi (x − xi ) + ai , i = 1, . . . , n − 1.
6hi 6hi
And now one can work with this form to derive a linear system for Mi .

2.9.3 Natural Cubic Splines

From the conditions S1′′ (x1 ) = 0 and Sn−1


′′ (x ) = 0 we have c = 0 and c = 0. Thus the equations (2.25) are equivalent
n 1 n
to the system
Ac = g,
where A is (n − 2) × (n − 2) matrix given by
2 Interpolation 27
 
2(h1 + h2 ) h2

 h2 2(h2 + h3 ) 

A=
 .. .. .. 
(2.26)
 . . . 

 hn−4 2(hn−4 + hn−3 ) hn−3 
hn−3 2(hn−3 + hn−2 )

and (n − 2)-vectors c and g are given by


3 3
h2 (a3 − a2 ) − h1 (a2 − a1 )
   
c2
3 3
c3 h3 (a4 − a3 ) − h2 (a3 − a2 )
   
 
..
 
c= ..  
, g= . (2.27)
 
. .

  
 3 3
 cn−2   hn−2 (an−1 − an−2 ) − hn−3 (an−2 − an−3 ) 

cn−1 3 3
hn−1 (an − an−1 ) − hn−2 (an−1 − an−2 )

Notice that for equidistant nodes x1 , . . . , xn we have h = hi and the matrix A and the vector g take the form
     
4 1 a3 − 2a2 + a1 f (x3 ) − 2 f (x2 ) + f (x1 )
1 4 1   a4 − 2a3 + a2   f (x4 ) − 2 f (x3 ) + f (x2 ) 

 .. .. ..
 3 .
 3
.

A = h . . . 

 , g =

 .
.

 =

 .
. .

(2.28)
 h  h 
 1 4 1  an−1 − 2an−2 + an−3   f (xn−1 ) − 2 f (xn−2 ) + f (xn−3 ) 
1 4 an − 2an−1 + an−2 f (xn ) − 2 f (xn−1 ) + f (xn−2 )

Using Taylor it is easy to see that


f (xi − h) − 2 f (xi ) + f (xi + h)
≈ f ′′ (xi ).
h2

2.9.4 Computation and Convergence of the natural cubic splines

In this section we will derive error estimates for

max |S(k) (x) − f (k) (x)|, for k = 0, 1, 2, 3.


x∈[a,b]

The key to the analysis will be Lemma 2.2. Matrix A posses very nice properties. Obviously, it is symmetric and
diagonally dominant, so it is non-singular. Moreover, we have
Lemma 2.2. Given a linear system Az = w, where A is the matrix given in (2.26) and w = (w1 , . . . , wn−2 )T is arbitrary.
Then for z = (z1 , . . . , zn−2 )T , we have
|wi |
max |zi | ≤ max .
1≤i≤n−2 1≤i≤n−2 hi + hi+1

Proof. Let max1≤i≤n−2 |zi | = |zr | for some 2 ≤ r ≤ n − 3. Then looking at the r-th row of Az = w, we have

hr zr−1 + 2(hr + hr+1 )zr + hr+1 zr+1 = wr .

Using it together with the triangle inequality we obtain


28 Contents

|wi | |wr | |hr zr−1 + 2(hr + hr+1 )zr + hr+1 zr+1 |


max ≥ =
i hi + hi+1 hr + hr+1 hr + hr+1
hr hr+1
≥ 2|zr | − |zr−1 | − |zr+1 |
hr + hr+1 hr + hr+1
hr hr+1
≥ 2|zr | − |zr | − |zr | = |zr |,
hr + hr+1 hr + hr+1

where in the last step we used that |zr | is maximal. The cases r = 1 and r = n − 2 are left to the reader.

From the above lemma we immediately obtain


Lemma 2.3. The matrix A given in (2.26) is non-singular.

Proof. Assume it is. Then there exists a vector z ̸= 0 such that Az = 0. But then we get a contradiction since from the
above lemma maxi |zi | = 0.

Hence from the system of linear equations Ac = g with A, c, and g given by (2.26) and (2.27), we can obtain c, and
then using (2.22), we can obtain bi and from (2.19), we can compute di . We summarize it in the following algorithms.

Algorithm 2.14 (Computing coefficients for natural cubic spline)

Input: The interpolation nodes x1 , . . . , xn


The function values f1 , . . . , fn

Output: The coefficients a1 , . . . , an−1 , b1 , . . . , bn−1 , c1 , . . . , cn−1 , and d1 , . . . , dn−1 of the natural cubic spline

1. for i = 1 : n
2. ai = f i
3. end
4. for i = 1 : n − 1
5. hi = xi+1 − xi
6. end
7. Generate matrix A ∈ R(n−2)×(n−2) and the vector g ∈ Rn−2 given in (2.26) and (2.27)
8. Compute c2 , . . . , cn−1 by solving the system Ac = g
9. Set c1 = 0 and cn = 0
10. for i = 1 : n − 1
11. bi = h1i * (ai+1 − ai ) − h3i * (2ci + ci+1 )
12. di = 3h1 i * (ci+1 − ci )
13. end

Once the coefficients of the cubic spline are computed, given a point x̄ we can use the following algorithm to evaluate
S(x̄).

Algorithm 2.15 (Natural cubic spline evaluation)

Input: The interpolation nodes x1 , . . . , xn


The coefficients a1 , . . . , an−1 , b1 , . . . , bn−1 , c1 , . . . , cn−1 , d1 , . . . , dn−1
The point x̄ in [x1 , xn ] at which the cubic spline to be evaluated.
2 Interpolation 29

Output: The value S = S(x̄) of the cubic spline

1. for i = 1 : n − 1
2. if x̄ ≤ xi+1
3. S = ai + bi * (x̄ − xi ) + ci * (x̄ − xi )2 + di * (x̄ − xi )3
4. end
5. end

Now we address the question of convergence. The key result we will use is Lemma 2.2. Define M = (M1 , . . . , Mn )T
where Mi = Si′′ (xi ) and F = (F1 , . . . , Fn )T where Fi = f ′′ (xi ), i = 1, . . . , n. Define r = (r1 , . . . , rn )T by

r = A(M − F).

Since M = 2c and using that Ac = g, we have

r = A(M − F) = AM − AF = 2Ac − AF = 2g − AF. (2.29)

On the other hand by Lemma 2.2


|ri |
max |Mi − Fi | ≤ max .
i i hi + hi+1
From (2.29) and definition of matrix A and vector g we have

ri = 2gi − (AF)i
6 6
( f (xi ) − f (xi−1 )) − hi−1 f ′′ (xi−1 ) + 2(hi−1 + hi ) f ′′ (xi ) + hi f ′′ (xi+1 ) .

= ( f (xi+1 ) − f (xi )) −
hi hi−1

Since xi+1 = xi + hi and xi−1 = xi − hi−1 , using Taylor expansion (for f sufficiently smooth), we obtain

h2i ′′ h3
f (xi+1 ) = f (xi ) + hi f ′ (xi ) + f (xi ) + i f ′′′ (xi ) + O(h4i )
2 6
and
h2i−1 ′′ h3
f (xi−1 ) = f (xi ) − hi−1 f ′ (xi ) + f (xi ) − i−1 f ′′′ (xi ) + O(h4i−1 ).
2 6
As a result
6 6
( f (xi+1 ) − f (xi )) − ( f (xi ) − f (xi−1 )) = 3hi f ′′ (xi ) + h2i f ′′′ (xi ) + O(h3i ) + 3hi−1 f ′′ (xi ) − h2i−1 f ′′′ (xi ) + O(h3i−1 ).
hi hi−1
(2.30)
Similarly,
f ′′ (xi+1 ) = f ′′ (xi ) + hi f ′′′ (xi ) + O(h2i )
and
f ′′ (xi−1 ) = f ′′ (xi ) − hi−1 f ′′′ (xi ) + O(h2i−1 ).
As a result

hi−1 f ′′ (xi−1 ) + 2(hi−1 + hi ) f ′′ (xi ) + hi f ′′ (xi+1 ) = 3(hi + hi−1 ) f ′′ (xi ) − h2i−1 f ′′′ (xi ) + h2i f ′′′ (xi ) + O(h3i−1 ) + O(h3i )

Subtracting it from (2.30), we obtain


ri = O(h3i + h3i−1 ),
30 Contents

and thus
|ri | O(h3i + h3i−1 )
max |Mi − Fi | ≤ max ≤ = O(h2i + h2i−1 ) = O(h2 ). (2.31)
i i hi + hi+1 hi + hi+1
Thus we established that at the nodes
max |S′′ (xi ) − f ′′ (xi )| ≤ Ch2 , (2.32)
i

for some constant independent of h provided f ∈ C4 [a, b]. The estimate (2.31) is key result for error estimates.
Theorem 2.16. Let f ∈ C4 [a, b] and maxi (h/hi ) ≤ K for some K > 0. Then there exists a constant C independent of h
such that
max |S(k) (x) − f (k) (x)| ≤ Ch4−k , k = 0, 1, 2, 3.
x∈[a,b]

Proof. Case k=3.


First we treat the case k = 3. Let x ∈ [x j , x j+1 ] for some 1 ≤ j ≤ n − 1. Notice that since S j (x) is cubic, S′′′
j (x) is
constant on [x j , x j+1 ] and can be expressed as

M j+1 − M j
S′′′
j (x) = ,
hj

where M j = S′′j (x j ). Then

M j+1 − f ′′ (x j+1 )
  ′′   ′′
f (x j+1 ) − f ′′ (x j )
 
M j+1 − M j f (x j ) − M j
S′′′ ′′′
j (x) − f (x) = − f ′′′ (x) = + + ′′′
− f (x)
hj hj hj hj

From (2.32) and using that h/h j ≤ K we have

M j+1 − f ′′ (x j+1 )
  ′′
Ch2
 
f (x j ) − M j
+ ≤ ≤ CKh.
hj hj hj

Using Taylor expansion

f ′′ (x j+1 ) − f ′′ (x j )
− f ′′′ (x) = f ′′′ (x j ) − f ′′′ (x) + O(h) = O(h).
hj

Thus we have established


max |S′′′ ′′′
j (x) − f (x)| ≤ Ch, (2.33)
x∈[x j ,x j+1 ]

which gives us the Theorem for k = 3.


Case k=2. To show the result in this case we use the result fro k = 2. Thus for x ∈ [x j , x j+1 ] by the Fundamental
Theorem of Calculus
Z x
S′′j (x) − f ′′ (x) = S′′j (x j ) − f ′′ (x j ) + S′′′ ′′′ 2 2

j (t) − f (t) dt ≤ Ch + |x − x j |Ch ≤ Ch , (2.34)
xj

where we used (2.33) and (2.32).


Case k=1. Since f (xi ) − S(xi ) = 0 for i = 1, . . . , n, by the Rolle’s Theorem there exist ξ j ∈ [x j , x j+1 ]. Then for
x ∈ [x j , x j+1 ] by the Fundamental Theorem of Calculus
Z x
S′j (x) − f ′ (x) = S′′j (t) − f ′′ (t) dt ≤ |x − ξ j |Ch2 ≤ Ch3 ,

(2.35)
ξj

where we used (2.34).


3 Linear systems 31

Case k=0. Since f (xi ) − S(xi ) = 0 again for x ∈ [x j , x j+1 ] by the Fundamental Theorem of Calculus
Z x
S′j (t) − f ′ (t) dt ≤ |x j − x|Ch3 ≤ Ch4 ,

S j (x) − f (x) = (2.36)
xj

where we used (2.36).

3 Linear systems

Linear algebra is a wonderful subject. One of the wonderful aspects of linear algebra is a variety of ways one can look
at the same problem. Sometimes a hard problem what appears from one perspective can turn out to be trivial from
another. To illustrate my point, let’s look at some m × n matrix A ∈ Rm×n
 
a11 a12 . . . a1n
 a21 a22 . . . a2n 
A= . .. .. ..  .
 
 .. . . . 
am1 am2 . . . amn

Looking at this matrix we can see different things. First of all we see m × n entries, or in other word an element of
Rm×n . We also can see m rows A(i, :) ∈ Rn for i = 1, 2, . . . , m, or n columns A(:, j) ∈ Rm for j = 1, 2, . . . , n. More
sophisticated, one can see a map A : Rn → Rm or that matrix A consists of four submatrices A11 ∈ Rm1 ×n1 , A12 ∈
Rm1 ×n2 , A21 ∈ Rm2 ×n1 , A22 ∈ Rm2 ×n2 ,
 
A11 A12
A= , with n = m1 + m2 and n = n1 + n2 .
A21 A22

3.1 Matrix-Vector multiplication

Let A ∈ Rm×n and x ∈ Rn . The i-th component of the matrix-vector product y = Ax is defined by
n
yi = ∑ ai j x j (3.1)
j=1

i.e., yi is the dot product (inner product) of the i-th row of A with the vector x.
 x 
.. .. .. ..
  
1
.  . . .   x2 
 yi  =  ai1 ai2 · · · ain   .. 


.
  
.. .. .. ..  
. . . . xn

Another useful point of view is to look at entire vector y = Ax,


          
y1 a11 a12 . . . a1n x1 a11 a12 a1n
 y2   a21 a22 . . . a2n   x2   a21   a22   a2n 
 ..  =  .. .. .. ..   ..  = x1  ..  + x2  ..  + · · · + xn  ..  . (3.2)
          
 .   . . . .   .   .   .   . 
ym am1 am2 . . . amn xn am1 am2 amn
32 Contents

Thus, y is a linear combination of the columns of matrix A.

3.2 Matrix-Matrix multiplication

If A ∈ Rm×p and B ∈ R p×n , then AB = C ∈ Rm×n defined by


p
ci j = ∑ aik bk j ,
k=1

i.e. the i j-th element of the product matrix is the dot product between i-th row of A and j-th column of B.
     b 
1j
 ci j  =  ai1 · · · aip   . 
 ..  .

bp j

Another useful point of view is to look at j-th column of C


       
c1 j a11 a12 a1n
 ..   ..   ..   .. 
 .   .   .   . 
       
 ci j  = b1 j  ai1  + b2 j  ai2  + · · · + bn j  ain  .
       
 .   .   .   . 
 ..   ..   ..   .. 
cm j am1 am2 amn
Thus, j-th column of C is a linear combination of the columns of matrix A. Sometimes it is useful to consider matrices
partitioned into blocks. For example,
   
A11 A12 B11 B12
A= B=
A21 A22 B21 B22

with m1 + m2 = m, p1 + p2 = p, and n1 + n2 = n. This time C = AB can be expressed as


 
A11 B11 + A12 B21 A11 B12 + A12 B22
C= .
A21 B21 + A22 B21 A12 B12 + A22 B22

Example 3.1.
Let    
12 3 1 2
A = 4 5 6 and B = 3 4 ,
78 9 5 6
then C = AB can be computed as
          
12 1 3 12 2 3
 45 + 5 + 6  
 3 6 4 5 4 6  22 28
C=  =  49 64  .
     
  1  2  76 100
78 +9·5 78 +9·6
3 4
3 Linear systems 33

This idea is key for the asymptotically faster matrix-matrix multiplication of celebrated Strassen Algorithm [2].

3.3 Existence of Uniqueness of solution of linear systems

Let A ∈ Rm×n and x ∈ Rn . Then from (3.2), Ax is a linear combination of the columns of A, i.e.
     
a11 a12 a1n
 a21   a22   a2n 
Ax = x1  .  + x2  .  + · · · + xn  . 
     
 ..   ..   .. 
am1 am2 amn

Hence Ax = b has a solution if b ∈ Rm can be written as a linear combination of the columns of A.


Solvability:
Ax = b is solvable for every b ∈ Rm iff the columns of A span Rm (necessary n ≥ m).
Uniqueness:
If Ax = b has a solution, then the solution is unique iff the columns of A are linearly independent (necessary n ≤ m).
Existence and Uniqueness:
For any b ∈ Rm , the system Ax = b has a unique solution iff n = m and the columns of A are linearly independent.

3.4 Transpose of a Matrix

If A ∈ Rm×n then AT ∈ Rn×m is obtained by reflecting the elements with respect to main diagonal. Thus, if A ∈ Rm×n
and B ∈ Rn×k , then
(AB)T = BT AT .
More generally,
(A1 A2 . . . A j )T = ATj . . . AT2 AT1 .
If A ∈ Rn×n is invertible, then AT is invertible and

(AT )−1 = (A−1 )T .

We will write A−T .

3.5 Solution of Triangular Systems

Definition 3.1 (Lower triangular). A matrix L ∈ Rn×n is called lower triangular matrix if all matrix entries above
the diagonal are equal to zero, i.e., if
li j = 0 for j > i.

Definition 3.2 (Upper triangular). A matrix U ∈ Rn×n is called upper triangular matrix if all matrix entries below
the diagonal are equal to zero, i.e., if
ui j = 0 for i > j.
34 Contents

A Linear system with lower (upper) triangular matrix can be solved by forward substitution (backward substitution).

Algorithm 3.1 (Solution of Upper Triangular Systems (Row-Oriented Version))

Input: Upper triangular matrix U ∈ Rn×n


right hand side vector b ∈ Rn
Output: Solution x ∈ Rn of Ux = b
Mathematically, !
n
xi = bi − ∑ ui j x j /uii , if uii ̸= 0.
j=i+1

Matlab code:
if all(diag(u)) == 0
disp(’the matrix is singular’)
else
b(n) = b(n)/U(n,n);
for i = n-1:-1:1
b(i)= (b(i) - U(i,i+1:n)*b(i+1:n))/U(i,i);
end
end

We can also put column oriented version

Algorithm 3.2 (Solution of Upper Triangular Systems (Column-Oriented Version))

Input: Upper triangular matrix U ∈ Rn×n


right hand side vector b ∈ Rn
Output: Solution x ∈ Rn of Ux = b
Mathematically, !
n
xi = bi − ∑ ui j x j /uii , if uii ̸= 0.
j=i+1

MATLAB code that overwrites b with the solution to Ux = b.


if all(diag(u)) == 0
disp(’the matrix is singular’)
else

for j = n:-1:2
b(j) = b(j)/U(j,j) ;
b(1:j-1) = b(1:j-1) - b(j)*U(1:j-1,j);
end

b(1) = b(1)/U(1,1);
end
3 Linear systems 35

3.6 Gaussian Elimination.

Gaussian elimination for the solution of a linear system transforms the system Ax = b into an equivalent system with
upper triangular matrix. This is done by applying three types of transformations to the augmented matrix (A|b).
∙ Type 1: Replace an equation with the sum of the same equation and a multiple of another equation;
∙ Type 2: Interchange two equations; and
∙ Type 3: Multiply an equation by a nonzero number.
Once the augmented matrix (A|b) is transformed into (U|y), where U is an upper triangular matrix, we can use the
techniques discussed previously to solve this transformed system Ux = y.
We need to modify Gaussian elimination for two reasons:
∙ improve numerical stability (change how we perform pivoting)
∙ make it more versatile (leads to LU-decomposition)

Definition 3.3 (Partial pivoting). In step i, find row j with j > i such that |a ji | ≥ |aki | for all k > i and exchange rows
i and j. Such numbers a ji we call pivots.

Partial pivoting is not relevant in exact arithmetic. Without partial pivoting with floating point arithmetic the method
can be unstable.
Using the formula
n
n(n + 1)(2n + 1)
∑ j2 = 6
,
j=1

we can calculate that for large n the number of flops in the Gaussian elimination with partial pivoting approximately
equal to 2n3 /3.

3.7 LU decomposition

The Gaussian elimination operates on the augmented matrix (A | b) and performs two types of operations to transform
(A | b) into (U | y), where U is upper triangular. However, the right hand side does not influence how the augmented
matrix (A | b) is transformed into an upper triangular matrix (U | y). This transformation depends only on the matrix
A. Thus, if we keep a record of how A is transformed into an upper triangular matrix U, then we can apply the same
transformation to any right hand side, without re-applying the same transformations to A.
What operations are performed? In step k of the Gaussian elimination with partial pivoting we perform two opera-
tions:
∙ Interchange a row i0 > k with row k.
∙ Add a multiple −lik times row k to row i for i = k + 1, . . . , n.
How do we record this?
∙ Introduce an integer array ipivt. Set
ipivt(k) = i0
when the k-th step the rows k and i0 are interchanged.
∙ Store the −lik in the (i, k)-th position of the array that originally held A. (Remember that we add a multiple −lik
times row k to row i to eliminate the entry (i, k). Hence this storage can be reused.)
36 Contents
 
a11 a12 a13 · · · · · · a1,n−1 a1n

 a21 a22 a23 · · · · · · a2,n−1 a2n 


 a31 a32 a33 · · · · · · a3,n−1 a3n 

 .. .. .. .. .. 

 . . . . . 

 an−1,1 an−1,2 an−1,3 · · · · · · an−1,n−1 an−1,n 
an1 an2 an3 · · · · · · an,n−1 ann

Step 1

 
a11 a12 a13 · · · · · · a1,n−1 a1n

 −l21 a22 a23 · · · · · · a2,n−1 a2n 


 −l31 a32 a33 · · · · · · a3,n−1 a3n 

 .. .. .. .. .. 

 . . . . . 

 −ln−1,1 an−1,2 an−1,3 · · · · · · an−1,n−1 an−1,n 
−ln1 an2 an3 · · · · · · an,n−1 ann

Step 2

 
a11 a12 a13 · · · · · · a1,n−1 a1n

 −l21 a22 a23 · · · · · · a2,n−1 a2n 


 −l31 −l32 a33 · · · · · · a3,n−1 a3n 

 .. .. .. .. .. 

 . . . . . 

 −ln−1,1 −ln−1,2 an−1,3 · · · · · · an−1,n−1 an−1,n 
−ln1 −ln2 an3 · · · · · · an,n−1 ann

Step n-1

 
a11 a12 a13 · · · · · · a1,n−1 a1n
 −l21 a22 a23 · · · · · · a2,n−1 a2n 
 
 −l31 −l32 a33 · · · · · · a3,n−1 a3n 
 
 .. .. .. .. .. .. 
 .
 . . . . .  
 . .. .. ..
 ..

. . . an−1,n−1 an−1,n 
−ln1 −ln2 −ln3 · · · · · · −ln,n−1 ann
Row interchange in step k can be expressed by multiplying with
3 Linear systems 37
 
1
 .. 

 . 


 1 


 0 1 ←k


 1 

Pk = 
 .. 
 . 

 1 
 ← ipivt(k)
 

 1 0 

 1 

 .. 
 . 
1

↑ ↑
k ipivt(k)
from the left. Pk is a permutation matrix.
Easy to see that Pk satisfies Pk = PkT and Pk2 = Id and as a result Pk−1 = Pk . Furthermore, Pk A interchanges rows k
and ipivt(k) of A, but APk interchanges columns k and ipivt(k) of A.
Adding −li,k times row k to row i for i = k + 1, . . . , n is equivalent to multiplying from the left with
 
1
 .. 

 . 

 1 
Mk =  

 −lk+1,k 

 .. .. 
 . . 
−ln,k 1

Observe that matrix Mk is lower triangular and is called a Gauss transformation. Easy to see that Mk is invertible and
 
1
 .. 

 . 

−1
 1 
Mk =   
 lk+1,k 

 .. . . 
 . . 
ln,k 1

Furthermore, product of two lower triangular matrices results in a lower triangular matrix, i.e. Mk M j is again lower
triangular
The transformation of a matrix A into an upper triangular matrix U can be expressed as a sequence of matrix-matrix
multiplications
Mn−1 Pn−1 . . . M1 P1 A = U.
Above we have observed that Pk and Mk are invertible. Hence, if we have to solve Ax = b and if

Mn−1 Pn−1 . . . M1 P1 A = U,

then we apply the matrices to the right hand side,


38 Contents

Mn−1 Pn−1 . . . M1 P1 b = y

and solve
Ux = y.
We observe that for j > k
Pj Mk = M̃k Pj , (3.3)
where M̃k is obtained from Mk by interchanging the entries −l j,k and −lipivt( j),k . M̃k has the same structure as Mk and
we can easily compute M̃k−1 . Using (3.3) we can move all Pj ’s to the right of the M̃k ’s

U = Mn−1 Pn−1 . . . M1 P1 A = Mn−1 M̃n−2 . . . M̃1 Pn−1 . . . P1 A.

Thus P = Pn−1 . . . P1 is the permutation matrix and L = M̃1−1 . . . M̃n−2


−1 −1
Mn−1 is a lower triangular matrix with diagonal
elements equal to 1. Matlab’s function [L,U, P] = lu(A) computes matrices P, L,U such that PA = LU.

Example 3.2. Consider the matrix  


1 2 −1
 2 1 0
−1 2 2
After step k = 1 the vector ipivt and the matrix A are given by
 
1 2 -1
ipivt = (2, , )  − 1 3 -1 
2 2
1 5
2 2 2

After step k = 2 the vector ipivt and the matrix A are given by
 
2 −1
1
ipivt = (2, 3, ) −1 5 2
2 2
1 3 11
2 −5 − 5

If we want to solve the linear system Ax = b with


 
0
b = 2
1

We have to apply the same transformation to b


     
0 2 2
 2  step 1 step 2
−→  −1  −→  2
1 2 −11/5

and then solve     


1 2 −1 x1 2
0 5 2   x2  =  2
2
0 0 − 11
5
x 3 −11/5
by back substitution to obtain x1 = 1, x2 = 0, x3 = 1.
Using matrices Pi and Mi , same steps we can write as
Step 1:
3 Linear systems 39
   
010 21 0
P1 =  1 0 0  , P1 A =  1 2 −1 
001 −1 2 2
   
100 2 1 0
M1 =  − 12 1 0  , M1 P1 A =  0 23 −1 
1
2 0 1 0 52 2
Step 2:    
100 2 1 0
P2 =  0 0 1  , P2 M1 P1 A =  0 52 2 
010 −1 32 −1
   
1 00 2 1 0
M2 =  0 1 0  , M2 P2 M1 P1 A =  0 25 2
0 − 35 1 0 0 − 11
5

For b = (0, 2, 1)T to solve Ax = b, we compute


      
1 00 100 010 0 2
M2 P2 M1 P1 b =  0 1 0   0 0 1   1 0 0   2  =  2 
0 − 53 1 010 001 1 − 11
5

Solving     
2 1 0 x1 2
0 5 2   x2  =  2 
2
11
0 0−5 x3 − 11
5

we obtain x = (1, 0, 1)T .


To obtain LU decomposition, we have
 
21 0
U =  0 52 2  = M2 P2 M1 P1 A.
0 0 − 11
5

Then using      
100 100 100 100
P2 M1 =  0 0 1   − 12 1 0  =  12 1 0   0 0 1  = M̃1 P2
1
010 2 0 1 − 12 0 1 010
we have
U = M2 M̃1 P2 P1 A.
Calling    
1 00 010
L = M̃1−1 M2−1 =  − 21 1 0  and P =  0 0 1 
1 3 100
2 5 1
we have PA = LU, i.e.      
010 1 2 −1 1 00 2 1 0
 0 0 1  2 1 0  =  −1 1 0  0 5 2 .
2 2
−1 2 2 1 3 11
100 2 5 1 0 0−5
40 Contents

3.8 Applications of LU decomposition

LU-decomposition is very useful one and using it we can compute many quantities rather efficiently.

3.8.1 Solving Linear System

To solve the linear system Ax = b using LU decomposition PA = LU, we need to solve two triangular systems, Ly = Pb
and Ux = y. It especially beneficial if we need to solve Ax = bn for many right hand sides, as for computing A−1 .

3.8.2 Finding Inverse A−1

Computing A−1 is rarely requited. Usually we need to compute A−1 b, which means we need to solve a linear system
Ax = b. In the rare occasion when the explicit form of A−1 is needed, we can use LU-decomposition to find it by using
O(n3 ) operations. Recall that A−1 is a unique matrix X such that

AX = I,

where I is the identity matrix. Denote the columns of the matrix X be xi , i = 1, . . . , n. Thus in column notation the
above equation is equivalent to
[Ax1 , Ax2 , . . . , Axn ] = [e1 , e2 , . . . , en ].
In other words we need to solve n equations

Axi = ei , i = 1, . . . , n.

In section 3.8.1 above we explained, how we can solve it using LU decomposition. Notice, LU decomposition takes
3
O( 2n3 ) operations solving triangular systems O(n2 ) operations, and since we need to solve n of them it take O(n3 )
operations to compute A−1 explicitly.

3.8.3 Computing detetminates

Recall that if A, B,C ∈ Rn×n and C = AB, then

det(C) = det(A)det(B).

Assume we have LU decomposition of A, i.e. PA = LU. Then

det(P)det(A) = det(L)det(U).

Since P is a permutation matrix det(P) = ±1. More, precisely, 1 if the permutation is even, and −1 if it is odd. Since L
and U are triangular, their determinates are just product of diagonal elements. Thus, det(L) = 1 and det(U) = ∏ni=1 uii .
As a result
n
det(A) = ± ∏ uii .
i=1
3 Linear systems 41

3.9 Symmetric Positive definite matrices. Cholesky decomposition.

Definition 3.4. A ∈ Rn×n is called symmetric if A = AT .

Definition 3.5. A ∈ Rn×n is called symmetric positive definite if A = AT and vT Av > 0 for all v ∈ Rn , v ̸= 0.

If A ∈ Rn×n is symmetric positive definite, then the LU decomposition can be computed in a stable way without
permutation, i.e.,
A = LU
and more efficiently.
First we notice that if L is a unit lower triangular matrix and U is an upper triangular matrix such that A = LU,
then L and U are unique, i.e., if there L̃ is a unit lower triangular matrix and Ũ is an upper triangular matrix such
that A = L̃Ũ, then L = L̃ and U = Ũ . Furthermore, if A ∈ Rn×n is symmetric positive definite and A = LU, then the
diagonal entries of U are positive and we can write it as U = DŨ, where Ũ is unit upper triangular and

D = diag(u11 , . . . , unn ) with uii > 0, i = 1, . . . , n.

Thus,
A = LU = LDŨ.
On the other hand
AT = (LDŨ)T = Ũ T DLT .
Using that A = AT and LU decomposition is unique

A = LU = Ũ T DLT = Ũ T (DLT ) = (lower unit triangular) × (upper triangular).

Thus,
L = Ũ T and U = DLT .
We showed that if A is a symmetric positive definite matrix, then

A = LDLT .

Recall that
D = diag(u11 , . . . , unn )
has positive diagonal entries. So we can define
√ √
D1/2 = diag( u11 , . . . unn ).

Define R := D1/2 LT , then

A = LDLT = LD1/2 D1/2 LT = RT R : Cholesky-decomposition

Matlab’s function [R] = chol(A) computes the matrix R such that A = RT R. By clever implementation, one can show
that the total number of basic operations is of order n3 /3, comparing to 2n3 /3 for LU decomposition.

3.10 Tridiagonal matrices

Definition 3.6. A ∈ Rn×n is called m- banded if ai j = 0 for |i − j| > m.


42 Contents

Definition 3.7. A ∈ Rn×n is called tridiagonal if it is 1- banded.

Let  
d1 e1
 c1 d2 e2 
 
A=
 .. .. .. 
 . . . 

 cn−2 dn−1 en−1 
cn−1 dn
Since there are only about 3n entries in matrix A, a natural question, how can we compute the LU-decomposition of A
efficiently? Let’s look at LU decomposition of A
  d e 
1 1 1
 − dc1 1   d˜2 e2 
 1   
.   c2 d3 e3
 
.

 0 0 . ,  .. .. ..



.
  . . . 
.. 1  
   
cn−2 dn−1 en−1 

01 cn−1 dn
| {z } | {z }
=M1 c
=M1 A, where d˜2 =d2 − d1 e1
1

 

1
 d1 e1
0 1   d˜2 e2 
˜
 
 − dc2 1 d3 e3
   

2
 
c3 d4 e4
 
. ,  
0 0 ..
  
   .. .. .. 

 ..

  . . . 
.1
 
cn−2 dn−1 en−1 
  
0 1 cn−1 dn
| {z } | {z }
=M2 c
=M2 M1 A, where d˜3 =d3 − ˜2 e2
d2

After n − 1 steps we obtain


 
1

d1 e1
c1   d˜ e

 d1 1  2 2 
c2 

A= d2 1  .. .. ,

 . .
.. ..
  

 . .

 ˜
dn−1 en−1 
cn−1
dn−1 1 d˜n

˜ = di+1 − ci ei , for i = 1 : n − 1.
where di+1 d˜ i

Algorithm 3.3 Input: Vectors c, d, e that form tridiagonal matrix A


Output: vectors c, d that form LU-decomposition of A
1. for k = 1 : n − 1
2. If dk = 0, stop
3. ck = ck /dk
4. dk+1 = dk+1 − ck ek
5. end
3 Linear systems 43

This algorithm requires about 3n flops.

Given three arrays of length n, c = [c1 ; . . . ; cn ], d = [d1 ; . . . ; dn ]; e = [e1 ; . . . ; en ].


The Matlab command
A = spdiags([c, d, e], −1 : 1, n, n);
generates a sparse form of the matrix  
d1 e2
 c1 d2 e3 
 
A=
 .. .. .. 
 . . . 

 cn−2 dn−1 en 
cn−1 dn

3.11 Error analysis of Linear systems

3.11.1 Vector Norms

We remind that a norm ‖·‖ on a vector space V over R is a function ‖·‖ : V → R+ that satisfies the following properties
1. (Positivity) ‖x‖ > 0 for any non-zero x ∈ V .
2. (Scalability) ‖αx‖ = |α|‖x‖, for any α ∈ R.
3. (Triangle inequalit) ‖x + y‖ ≤ ‖x‖ + ‖y‖ for any x, y ∈ V
As a consequence of the triangle inequality, we have the for following inequality.

‖x + y‖ ≥ ‖x‖ − ‖y‖ for all x, y ∈ Rn .


The Euclidian or 2-norm on Rn is given by


!1/2
n
‖x‖2 = ∑ xi2 , 2-norm
i=1

More generally for any p ∈ [1, ∞) we can define p-norm


!1/p
n
p
‖x‖ p = ∑ |xi | .
i=1

The infinity norm is also often is used


‖x‖∞ = max |xi |.
i=1,...,n

Example 3.3. Let x = (1, −2, 3, −4)T . Then


44 Contents

‖x‖1 = 1 + 2 + 3 + 4 = 10,
√ √
‖x‖2 = 1 + 4 + 9 + 16 = 30 ≈ 5.48,
‖x‖∞ = max {1, 2, 3, 4} = 4.

The unit ball with respect to p-norm is defined by

{x ∈ Rn : ‖x‖ p ≤ 1}.

TO ADD PICTURE
The following inequalities hold:
‖x‖∞ ≤ ‖x‖2 ≤ ‖x‖1 .

Theorem 3.4. All vector norms on Rn are equivalent, i.e. for every two vector norms ‖ · ‖a and ‖ · ‖b on Rn there exist
constants cab , Cab (depending on the vector norms ‖ · ‖a and ‖ · ‖b , but not on x) such that

cab ‖x‖b ≤ ‖x‖a ≤ Cab ‖x‖b ∀x ∈ Rn .

Proof. It is sufficient to show any norm ‖ · ‖ is equivalent to infinity norm ‖ · ‖∞ .


Let e1 , . . . , en be the standard basis for Rn . Then any x ∈ Rn can be written as

x = x1 e1 + · · · + xn en .

It follows that
n n
‖x‖ ≤ ‖x‖∞ ∑ ‖e j ‖ ≤ γ‖x‖∞ , with γ = ∑ ‖e j ‖.
j=1 j=1

This shows that any norm ‖ · ‖ on Rn is continuous with respect to ‖ · ‖∞ norm.


Consider the set
S = {y ∈ Rn : ‖y‖∞ = 1}.
Thus the set S is bounded closed set in Rn and as a result compact. On a compact set a function ‖·‖ attains its maximum
and minimum values. So there exist vecy0 and y1 such that

0 < ‖y0 ‖ ≤ ‖y‖ ≤ ‖y1 ‖ < ∞, ∀y ∈ S.


x
For any x ̸= 0 consider y = ‖x‖∞ ∈ S. Then

x
‖y0 ‖ ≤ ≤ ‖y1 ‖,
‖x‖∞

which shows
m‖y‖∞ ≤ ‖y‖∞ ≤ M‖y‖∞ ,
with m = ‖y0 ‖ and M = ‖y1 ‖.
Remark 3.1. Although all vector norm are equivalent, they are not equivalent with respect to the dimensions n. For
example, for 1 = (1, . . . , 1)T we have

‖1‖∞ = 1, ‖1‖2 = n, ‖1‖1 = n.

In particular, for any x ∈ Rn we have the inequalities


3 Linear systems 45

1
√ ‖x‖1 ≤ ‖x‖2 ≤ ‖x‖1
n

‖x‖∞ ≤ ‖x‖2 ≤ n‖x‖∞
‖x‖∞ ≤ ‖x‖1 ≤ n‖x‖∞ .

3.11.2 Matrix norm

We can view a matrix A ∈ Rm×n as a vector in Rmn , by stacking the columns of the matrix into a long vector. Appling
the 2-vector norm to this vectors of length mn, we obtain the Frobenius norm,
!1/2
n m
‖A‖F = ∑∑ a2i j .
i=1 j=1

In many√situations the Frobenius norm is not convenient. One of the reasons is that for the identity matrix I ∈ Rn×n ,
‖I‖F = n.
Another approach is to view a matrix A ∈ Rm×n as a linear mapping, which maps a vector x ∈ Rn into a vector
Ax ∈ Rm
A : Rn → Rm
x → Ax.
To define the size of this linear mapping, we compare the size of the image Ax ∈ Rm with the size of x. This leads us
to look at
‖Ax‖
sup
x̸=0 ‖x‖

Here Ax ∈ Rm and x ∈ Rn are vectors and ‖ · ‖ are vector norms (in Rm and Rn ).
Definition 3.8 (Matrix p-norm).
‖Ax‖ p
‖A‖ p = max , 1 ≤ p ≤ ∞. (3.4)
x̸=0 ‖x‖ p
The following holds,
‖Ax‖ p ‖Ax‖ p
sup x‖ p = max = max ‖Ax‖ p .
x̸=0 ‖x‖ p x̸ =0 ‖x‖ p ‖x‖ p =1

With the above definition, now for the identity matrix I,

‖Ix‖ p
‖I‖ p = max = 1.
x̸=0 ‖x‖ p

Two important inequalities.


Theorem 3.5. For any A ∈ Rm×n , B ∈ Rn×k and x ∈ Rn , the following inequalities hold.

‖Ax‖ p ≤ ‖A‖ p ‖x‖ p (compatibility of matrix and vector norm)

and
‖AB‖ p ≤ ‖A‖ p ‖B‖ p (submultiplicativity of matrix norms)

Proof. The first statement follows directly from the definition of the p matrix norm. The second statement follows that
for B ̸= 0
46 Contents

‖(AB)x‖ p ‖A(Bx)‖ p ‖Bx‖ p ‖Ay‖ p ‖Bx‖ p


‖AB‖ p = sup = sup = sup sup = ‖A‖ p ‖B‖ p .
x̸=0 ‖x‖ p x̸=0 ‖Bx‖ p ‖x‖ p y̸=0 ‖y‖ p x̸=0 ‖x‖ p

The case B = 0 holds trivially.


For the most commonly used matrix-norms (3.4) with p = 1, p = 2, or p = ∞, there exist rather simple representa-
tions.
Theorem 3.6. Let ‖ · ‖ p be the matrix norm defined in (3.4), then
m
‖A‖1 = max ∑ |ai j | (maximum column norm);
1≤ j≤n i=1
n
‖A‖∞ = max ∑ |ai j | (maximum row norm);
1≤i≤m j=1
q
‖A‖2 = λmax (AT A) (spectral norm),

where λmax (AT A) is the largest eigenvalue of AT A.


Proof. We will show the result for the 1-norm. The rest we will leave as an exercise. We have

m n m n
‖Ax‖1 = ∑ ∑ ai j x j ≤ ∑ ∑ |ai j ||x j |

i=1 j=1
i=1 j=1
n m n m
= max ∑ |aik |
∑ |x j | ∑ |ai j | ≤ ∑ |x j | 1≤k≤n
j=1 i=1 j=1 i=1
m
≤ ‖x‖1 max ∑ |aik |,
1≤k≤n i=1

and as a result
m
‖A‖1 ≤ max ∑ |aik |.
1≤k≤n i=1

To the equality, it is sufficient to construct x0 with ‖x0 ‖1 = 1 such that


m
‖Ax0 ‖1 = max ∑ |aik |.
1≤k≤n i=1

Let j0 be such that


m
|ai j0 | = max ∑ |aik |.
1≤k≤n i=1

Then,
m m
‖Ae j0 ‖1 = ∑ |ai j0 | = max ∑ |aik |.
i=1 1≤k≤n i=1

That gives us the result.


From the above theorem, we immediately obtain the following results.
Corollary 3.1. For any matrix A ∈ Rm×n , we have

‖A‖1 = ‖AT ‖∞
‖A‖2 = ‖AT ‖2 .
3 Linear systems 47

In the case of symmetric matrices we can show sharper results.


Corollary 3.2. For any symmetric matrix A ∈ Rn×n , we have
n
‖A‖1 = ‖A‖∞ = max ∑ |ai j |
1≤ j≤n i=1
‖A‖2 = max |λi (A)|,
1≤i≤n

where λi (A) denotes the i-th eigenvalue of A.

3.11.3 Error analysis

The linear systems


Ax = b, (3.5)
where A ∈ Rn×n and b ∈ Rn , usually come from some applications, where we do not know the exact values of A and
b. Instead we are often face with the perturbed system

(A + ∆ A)x̃ = b + ∆ b, (3.6)

where ∆ A ∈ Rn×n and ∆ b ∈ Rn represent the perturbations in A and b, respectively. The main question we are faced,
what is the error ∆ x = x̃ − x between the solution x of the exact linear system (3.5) and the solution ex perturbed linear
system (3.10). For the simplicity of the representation, let’s us first consider the case when A is exact.
Theorem 3.7. Consider the perturbed system (3.10) with ∆ A = 0. Then the relative error

‖∆ x‖ ‖∆ b‖
≤ ‖A‖‖A−1 ‖ ,
‖x‖ ‖b‖

where ‖ · ‖ is any p-norm.

Proof. Using a representation


x̃ = x + ∆ x,
from
A(x + ∆ x) = b + ∆ b,
since Ax = b, we get
A∆ x = ∆ b, or ∆ x = A−1 ∆ b.
Taking norms, we obtain
‖∆ x‖ = ‖A−1 ∆ b‖ ≤ ‖A−1 ‖‖∆ b‖. (3.7)
Since Ax = b,
1 1
‖b‖ = ‖Ax‖ ≤ ‖A‖‖x‖ ⇒ ≤ ‖A‖ . (3.8)
‖x‖ ‖b‖
Combining (3.7) and (3.8) we get
‖∆ x‖ ‖∆ b‖
≤ ‖A‖‖A−1 ‖ . (3.9)
‖x‖ ‖b‖
Definition 3.9. The (p-) condition number κ p (A) of a matrix A (with respect to inversion) is defined by

κ p (A) = ‖A‖ p ‖A−1 ‖ p .


48 Contents

Set κ p (A) = ∞ is A is not invertible.

Notice that κ p (A) ≥ 1, since


1 = ‖I‖ p = ‖AA−1 ‖ p ≤ ‖A‖ p ‖A−1 ‖ p = κ p (A).

Definition 3.10. If κ p (A) is small, we say that the linear system is well conditioned.
Otherwise, we say that the linear system is ill conditioned.
To obtain similar result for the fully perturbed system we need the following auxiliary result.
Lemma 3.1. Let B ∈ Rn×n be arbitrary with ‖B‖ < 1, where ‖ · ‖ denotes any p matrix norm. Then I + B is invertible
and
1
‖(I + B)−1 ‖ ≤ .
1 − ‖B‖
Proof. Since by the triangle inequality and the assumption of the lemma, for any x ̸= 0,

‖(I + B)x‖ = ‖x + Bx‖ ≥ ‖x‖ − ‖Bx‖ = ‖x‖(1 − ‖B‖) > 0,

it shows that (I + B) is invertible. Denote C = (I + B)−1 . Then

1 = ‖I‖ = ‖(I + B)C‖ = ‖C + BC‖ ≥ ‖C‖ − ‖CB‖ ≥ ‖C‖ − ‖C‖‖B‖ = ‖C‖(1 − ‖B‖),

which give us the lemma.

Theorem 3.8. Let


(A + ∆ A)(∆ x + x) = b + ∆ b (3.10)
be the perturbed system, where ∆ A ∈ Rn×n and ∆ b ∈ Rn represent the perturbations in A and b, respectively. If
‖A−1 ‖ p ‖∆ A‖ p < 1, then  
‖∆ x‖ p κ p (A) ‖∆ A‖ p ‖∆ x‖
≤ ‖∆ A‖ p
+ . (3.11)
‖x‖ p 1 − κ p (A) ‖A‖ p ‖x‖
‖A‖ p

Proof. First of all notice that since ‖A−1 ‖ p ‖∆ A‖ p < 1, we have

(A + ∆ A) = A(I + A−1 ∆ A).

Thus using the previous lemma, (A + ∆ A) is a product of two invertible matrices A and (I + A−1 ∆ A), hence invertible.
Since
∆ x = x̃ − x,
we have
(A + ∆ A)∆ x = (A + ∆ A)x̃ − (A + ∆ A)x = (b + ∆ b) − (b + ∆ Ax) = ∆ b − ∆ Ax.
Hence
‖∆ x‖ ≤ ‖(A + ∆ A)−1 (∆ b − ∆ Ax)‖ ≤ ‖(A + ∆ A)−1 ‖ (‖∆ b‖ + ‖∆ Ax‖) .
Using the Lemma 3.1 with B = A−1 ∆ A, we have

‖A−1 ‖
‖(A + ∆ A)−1 ‖ = ‖(A(I + A−1 ∆ A))−1 ‖ = ‖((I + A−1 ∆ A)−1 A−1 ‖ ≤ .
1 − ‖A−1 ‖‖∆ A‖

As a result
4 Numerical Integration 49

‖A−1 ‖
‖∆ x‖ ≤ (‖∆ b‖ + ‖∆ A‖‖x‖).
1 − ‖A−1 ‖‖∆ A‖
Using that ‖b‖ ≤ ‖Ax‖ ≤ ‖A‖‖x‖, we finally obtain

‖A−1 ‖
 
‖∆ x‖ ‖∆ b‖ ‖∆ A‖‖x‖
≤ +
‖x‖ 1 − ‖A−1 ‖‖∆ A‖ ‖x‖ ‖x‖
−1
 
‖A ‖‖A‖ ‖∆ b‖ ‖∆ A‖
≤ +
1 − ‖A−1 ‖‖∆ A‖ ‖b‖ ‖A‖
 
κ(A) ‖∆ b‖ ‖∆ A‖
≤ + .
1 − κ(A) ‖∆ A‖
‖A‖
‖b‖ ‖A‖

If we solve the linear system in m-digit floating point arithmetic, then, as rule of thumb, we may approximate the
the input errors due to rounding by

‖∆ A‖ ‖∆ x‖
≈ 0.5 * 10−m+1 , ≈ 0.5 * 10−m+1
‖A‖ ‖x‖

If the condition number of A is κ(A) = 10α , then

‖∆ x‖ 10α
≤ (0.5 * 10−m + 0.5 * 10−m ) ≈ 10α−m .
‖x‖ 1 − 10α−m+1

Provided 10α−m+1 < 1.


Rule of thumb: If the linear system is solved in m-digit floating point arithmetic and if the condition number of A
is of the order 10α , then only m − α − 1 digits in the solution can be trusted.

4 Numerical Integration

Our problem in this section is to compute


Z b
f (x) dx.
a
Even if f (x) can be expressed in terms of elementary functions, the antiderivative of f (x) may not have this property.
2
For example: e−x , sin (x2 ), sinx x , etc. Hence, all exact techniques of integration taught in Calculus courses are more
like exceptions than the rules. As a general rule, one must rely on numerical integration.
Our goal is to approximate the integral of a function f by a weighted sum of function values:
Z b n
f (x) dx ≈ ∑ wi f (xi ). (4.1)
a i=1

Definition 4.1. ∙ In the above formula xi ∈ [a, b] are called the nodes of the integration formula and wi are called the
weights of the integration formula.
∙ When we approximate ab f (x) dx by ∑ni=1 wi f (xi ) we speak of numerical integration or numerical quadrature.
R

∙ ∑ni=1 wi f (xi ) is called a quadrature formula.


Notice, that by introducing a new variable
50 Contents

b−a
x = a+ (z − α)
β −α

and by the change of variable formula


Z b  
b−a b−a
Z β
f (x) dx = f a+ (z − α) dz,
a β −α α β −α

if we have computed weights w bi and nodes b zi for the numerical integration on an interval [α, β ], then we can use the
above identity to approximate the integral of f over any interval [a, b] (assuming, of course, that this integral exists) by
Z b  
b−a β b−a
Z
f (x) dx = f a+ (z − α) dz
a β −α α β −α
b−a n
 
b−a
≈ ∑ wbi f a + β − α (bzi − α) .
β − α i=1

That is, the weights wi and nodes xi for the numerical integration on the interval [a, b] are

b−a b−a
wi = bi ,
w xi = a + zi − α).
(b
β −α β −α

This means it is sufficient to compute weights and nodes for the numerical integration on a certain interval like [0, 1]
or [−1, 1], often called the reference intervals.
Before we discuss several quadrature methods, we list some properties of the integral which are important for the
development of quadrature rules. First, we note that
Z b
1 dx = b − a.
a

Therefore we require
n
∑ wi = b − a,
i=1

Otherwise, our quadrature formula could not even evaluate the integral of a constant function exactly.
Another useful property of the integral is
Z b
f (x) ≥ 0 =⇒ f (x) dx ≥ 0.
a

If
wi ≥ 0, i = 1, . . . , n,
then
n
∑ wi f (xi ) ≥ 0,
i=1

for all functions f (x) ≥ 0.


We also desire our numerical quadrature to be efficient. Efficiency often depends upon the number of function
evaluations.
Basic idea: if p(x) is some function such that

p(x) ≈ f (x),
4 Numerical Integration 51

then Z b Z b
p(x) dx ≈ f (x) dx
a a
Thus we need a function p(x) which close to f (x) and easy to integrate.
The natural choice is the interpolating polynomials and their properties that we developed in Section 2.
Chose nodes x1 , . . . , xn in the interval [a, b] and compute the polynomial P( f |x1 , . . . , xn ) of degree less or equal to n
interpolating f at x1 , . . . , xn . If we use the approximation

f (x) ≈ P( f |x1 , . . . , xn )(x),

then we obtain an approximation for the integral:


Z b Z b
f (x) dx ≈ P( f |x1 , . . . , xn )(x) dx (1). (4.2)
a a

These types of quadrature formulas are called interpolatory quadrature formulas.


From the form of the quadrature formula (4.1), it is useful to represent the interpolation polynomial using the
Lagrange basis,
n n x−x
j
P( f |x1 , . . . , xn )(x) = ∑ f (xi ) ∏ .
i=1 j=1 xi − x j
j̸=i

If we substitute this representation of the interpolation polynomial into (4.2), then we obtain
Z b Z b Z b n n n Z b n
x−xj x−xj
f (x) dx ≈ P( f |x1 , . . . , xn )(x) dx = ∑ f (xi ) ∏ xi − x j dx = ∑ f (xi ) ∏ dx.
a a a i=1 j=1 i=1 a j=1 xi − x j
j̸=i j̸=i

This leads to the quadrature formula


Z b n
f (x) dx ≈ ∑ wi f (xi ),
a i=1

where Z b n
x−xj
wi = ∏ dx.
a j=1 xi − x j
j̸=i

a+b
Example 4.1 (Midpoint Rule). The simplest quadrature formula can be constructed using n = 1 and x1 = 2 . Since

1 x−xj
∏ xi − x j = 1
j=1
j̸=i

we obtain the midpoint rule:


Z b  
a+b
f (x) dx ≈ (b − a) f . (4.3)
a 2
Example 4.2 (Trapezoidal Rule). The next quadrature formula is constructed using n = 2 and x1 = a, x2 = b. It holds
Z b Z b
x−a b−a x−b b−a
dx = , dx = .
a b−a 2 a a−b 2
This yields the Trapezoidal rule:
52 Contents
Z b
b−a
f (x) dx ≈ ( f (a) + f (b)). (4.4)
a 2
Example 4.3 (Simpson’s rule).
b+a
The next quadrature formula is constructed using n = 3 and x1 = a, x2 = 2 , x3 = b. Then

x − b+a
Z b
2 x−b b−a
b+a
dx = ,
a a− 2
a−b 6
Z b
x−a x−b b−a
b+a b+a
dx = 4 ,
a 2 −a 2 −b 6
x − b+a
Z b
2 x−b b−a
dx = .
a b − b+a
2
b−a 6

This yields the Simpson rule:


Z b  
b−a b+a
f (x) dx ≈ f (a) + 4 f + f (b) . (4.5)
a 6 2

4.1 Newton Cotes Quadrature Formula

If we have equidistant points


b−a
xi = a + (i − 1)h, i = 1, . . . , n,
, h=
n−1
then the resulting interpolatory quadrature formula is called a closed Newton-Cotes quadrature formula (a and b
are nodes). In this case we can use the substitution x = a + sh, to compute
Z b n Z n−1 n
x−xj b−a s− j+1
wi = ∏ dx = ∏ ds.
a j=1 xi − x j n−1 0 j=1 i− j
j̸=i j̸=i

If we have equidistant points


xi = a + ih, i = 1, . . . , n,
where
b−a
h= , x1 = a + h, xn = b − h,
n+1
then the resulting interpolatory quadrature formula is called an open Newton-Cotes quadrature formula (a and b are
not nodes). Again, we can use the substitution x = a + sh, to compute
Z b n Z n+1 n
x−xj 1 s− j
wi = ∏ dx = (b − a) ∏ i− j ds.
a j=1 xi − x j n+1 0 j=1
j̸=i j̸=i

4.1.1 Convergence of the numerical quadrature

Since the interpolation polynomial is uniquely determined, the interpolating polynomial for a polynomial pn−1 of
degree less or equal to n − 1 is the polynomial itself:
4 Numerical Integration 53

P(pn−1 |x1 , . . . , xn )(x) = pn−1 (x).

This implies that


Z b Z b n
pn−1 (x)dx = P(pn−1 |x1 , . . . , xn )(x)dx = ∑ wi pn−1 (xi )
a a i=1

for all polynomials pn−1 of degree less or equal to n − 1. If


Z b n
pn−1 (x)dx = ∑ wi pn−1 (xi )
a i=0

for all polynomials pn−1 of degree less or equal to n we say that the integration method is exact of degree n − 1.
In Theorem 2.9, we have established that for any x̄ ∈ (a, b)

1
f (x̄) − P( f | x1 , . . . , xn )(x̄)| = ω(x̄) f (n) (ξ (x̄)),
n!
where ω(x) = ∏nj=1 (x − x j ). Combining the above estimates, and integrating for the Newton-Cotes methods we obtain
Z b n Z b n
1
f (x)dx − ∑ wi f (xi ) = f (n) (ξ (x)) ∏ (x − x j )dx. (4.6)
a i=1 n! a j=1

Taking the absolute values we immediately obtain



Z b n Z b n
1 (n)

f (x)dx − ∑ wi f (xi ) ≤ max | f (x)| ∏ (x − x j ) dx. (4.7)


a i=1
n! a≤x≤b a j=1

However, using the weighted mean value theorem the for integrals, namely
Theorem 4.1 (Weighted Mean-Value Theorem for Integrals). Suppose f is continuous on [a, b] and g is integrable
on [a, b] and does not change sign. Then there exists c ∈ (a, b) such that
Z b Z b
f (x)g(x)dx = f (c) g(x)dx.
a a

we often can obtain sharper estimates.

Example 4.4 (Convergence for Trapezoidal Rule). If f ∈ C2 (a, b), using Theorem 4.1 for the Trapezoidal rule (n = 2,
h = b − a) we obtain
Z b Zb
b−a 1 ′′

f (x)dx − ( f (a) + f (b)) = f (x)(x − a)(x − b)dx
a 2 2 a
′′
f (c) b
Z
= (x − a)(x − b)dx
2 a
′′
f (c) (b − a)3

=
2 6
h3
≤ max | f ′′ (x)|.
12 a≤x≤b
Example 4.5 (Convergence for Midpoint Rule).
If f ∈ C1 (a, b), using (4.6) for the Midpoint rule (n = 1, h = b − a) we obtain
54 Contents
Z b  Z 
a + b 1 b ′
 
a+b
f (x)dx − (b − a) f = f (ξ (x)) x − dx .
a 2 2 a 2

Since x − a+b

2 changes sign on (a, b) we can not use Theorem 4.1 and it seems the best we can do is just use (4.7) to
obtain
h2
Z b   Z b  
a + b 1 ′ a + b
max | f ′ (x)|.

a f (x)dx − (b − a) f ≤ max | f (x)| x− dx =

2 2 a≤x≤b a
2 8 a≤x≤b
However, looking at the form of the error, we can notice that
Z b 
a+b
x− dx = 0,
a 2

thus we can subtract any constant multiple of it from the function f (x) without changing the value. Thus
Z b   Z b     
a+b a+b a+b a+b
f (x)dx − (b − a) f = f (x) − f ′ x− dx − (b − a) f .
a 2 a 2 2 2

Now using that f a+b



2 is a constant we obtain
Z b   Z b     
a+b a+b ′ a+b a+b
f (x)dx − (b − a) f = f (x) − f −f x− dx.
a 2 a 2 2 2
a+b
Using the Taylor series expansion around x1 = 2 provided f ∈ C2 (a, b), we obtain

a+b 2
      
a+b a+b a+b 1
f (x) − f − f′ x− = f ′′ (ξ (x)) x − .
2 2 2 2 2
2
Now x − a+b
2 does not change the sign on (a, b) and we can use Theorem 4.1 to obtain
2
(b − a)3 h3
Z b   Z b
a+b 1 a+b
f (x)dx − (b − a) f = f ′′ (c) x− dx ≤ max | f ′′ (x)| = max | f ′′ (x)|.
a 2 2 a 2 24 a≤x≤b 24 a≤x≤b

It is interesting to observe that the constant in the Trapezoidal method is twice as large as for the Midpoint Rule.

Example 4.6 (Convergence for Simpson’s Rule). If f ∈ C3 (a, b), using (4.6) for the Simpson’s rule (n = 3, x1 = a,
x2 = a+b b−a
2 , x3 = b, h = 2 ) we obtain
Z b     Z b  
h a+b 1 ′′′ a+b
a f (x)dx − 3 f (a) + 4 f + f (b) = f (ξ (x))(x − a) x − (x − b)dx .

2 6 a 2

Similarly to the Midpoint Rule example, since (x − a) x − a+b



2 (x − b) changes sign on (a, b) we can not use Theorem
4.1 and it seems the best we can do is just use (4.7) to obtain
Z b     Z b  
h a+b 1
≤ max | f ′′′ (x)| (x − a) x − a + b (x − b) dx

f (x)dx − f (a) + 4 f + f (b)
a 3 2 6 a≤x≤b a
2
≤ Ch3 max | f ′′′ (x)|.
a≤x≤b

However, exactly as in the Midpoint Rule example, looking at the form of the error, we can notice that
4 Numerical Integration 55
Z b  
a+b
(x − a) x − (x − b)dx = 0,
a 2

thus we can subtract any constant multiple of it from the function f (x) without changing the value. In other words we
can replace the interpolating polynomial P( f |x1 , x3 , x2 )(x) that gives rise to the Simpson’s Rule with

p(x) = P( f |x1 , x3 , x2 )(x) + c(x − x1 )(x − x2 )(x − x3 ) (4.8)

without changing the value of the numerical quadrature. Recalling the divided difference formula

f [x j+1, . . . , xk ] − f [x j , . . . , xk−1 ]
f [x j , . . . , xk ] = . (4.9)
xk − x j

We can extend the formula 6.4 to have equal nodes, i.e. xi = x j for some i ̸= j, with the convention that

f [xi , xi ] = f ′ (xi ),

which is consistent with the definition of derivative, since


f (xi + ε) − f (xi )
lim f [xi , xi + ε] = lim = f ′ (xi ).
ε→0 ε→0 ε
The constant c in (4.10), we take to be f [x1 , x3 , x2 , x2 ], and using the Newton basis, the interpolating polynomial takes
the form

p(x) = f (x1 ) + f [x1 , x3 ](x − x1 ) + f [x1 , x3 , x2 ](x − x1 )(x − x3 ) + f [x1 , x3 , x2 , x2 ](x − x1 )(x − x2 )(x − x3 ). (4.10)

Carefully looking at the proof of Theorem 2.9, we can show


Proposition 1. Let f ∈ C4 [a, b]. Then for each x̄ ∈ [a, b] there exists ξ (x̄) ∈ (a, b) such that

1 (4)
f (x̄)− f (x1 )− f [x1 , x3 ](x̄−x1 )− f [x1 , x3 , x2 ](x̄−x1 )(x̄−x3 )− f [x1 , x3 , x2 , x2 ](x̄−x1 )(x̄−x2 )(x̄−x3 ) = f (ξ (x̄))ω(x̄),
4!
where ω(x) = (x − x1 )(x − x2 )2 (x − x3 ).

Proof. The proof is almost identical to the proof of Theorem 2.9. If x = xi for some i = 1, . . . , n, then the result naturally
holds. Assume x ̸= xi for any i = 1, . . . , n. Consider a function

ψ(x) = f (x) − f (x1 ) − f [x1 , x3 ](x̄ − x1 ) − f [x1 , x3 , x2 ](x̄ − x1 )(x̄ − x3 ) − f [x1 , x3 , x2 , x2 ](x̄ − x1 )(x̄ − x2 )(x̄ − x3 ) − cω(x),

where the constant c is taken to be


f (x̄) − f (x1 ) − f [x1 , x3 ](x̄ − x1 ) − f [x1 , x3 , x2 ](x̄ − x1 )(x̄ − x3 ) − f [x1 , x3 , x2 , x2 ](x̄ − x1 )(x̄ − x2 )(x̄ − x3 )
c= .
ω(x̄)

With this choice of the constant c, the function ψ(x) has at least 4 roots, namely at x1 , x2 , x3 and x̄. Thus, from the
(1)
Rolle’s Theorem there exist 3 points, call them xi , i = 1, 2, 3, such that
(1)
ψ ′ (xi ) = 0, i = 1, 2, 3.

In addition, by construction (it is also easy to check directly),

ψ ′ (x2 ) = 0.
56 Contents

(1)
Since xi ̸= x2 for all i = 1, 2, 3, it follows that ψ ′ (x) has at least 4 roots as well, namely at x1 , x2 , x3 and x̄. Thus, from
(2)
the Rolle’s Theorem there exist 3 points, call them xi , i = 1, 2, 3, such that
(1)
ψ ′′ (xi ) = 0, i = 1, 2, 3.
(4)
The of the proof is identical to the proof of Theorem 2.9. Continue this process, there exists a point x1 such that
(n)
ψ (4) (x1 ) = 0.

From the definition of ψ(x), we have

d4 dn d4
4
ψ(x) = 4 f (x) − 0 − c 4 ω(x).
dx dx dx
Since ω(x) is a polynomial of degree 4 with leading coefficient 1, we have

dn
ω(x) = n!.
dxn
As a result
(4)
(4) (4) f (4) (x1 )
ψ (4) (x1 ) = f (4) (x1 ) − c4! = 0 ⇒ c= ,
4!
(4)
which shows the proposition with ξ (x̄) = x1 .
Now we can continue with the Example 4.6. Since ω(x) = (x − x1 )(x − x2 )2 (x − x3 ) does not change the sign on the
interval (a, b), if f ∈ C4 (a, b) we can use Theorem 4.1 to obtain

h5
Z b     Z b
h a+b 1
f (x)dx − f (a) + 4 f + f (b) = f (4) (c) (x − x1 )(x − x2 )2 (x − x3 )dx ≤ max | f (4) (x)|,
a 3 2 4! a 90 a≤x≤b
b−a
with h = 2 .
Rb
The method discussed in the above example, can be generalized to any method for which a ω(x)dx = 0. In partic-
ular we can obtain the following result.
Theorem 4.2 (Exactness of Newton-Cotes Formulas). Let a ≤ x1 < · · · < xn ≤ b be given and let wi be the nodes
and weights of a Newton-Cotes formula. If n is odd, then the quadrature formula is exact for polynomials of degree n.
If n is even, then the quadrature formula is exact for polynomials of degree n − 1.

The weights and nodes for the most popular closed Newton-Cotes formulas are summarized in the table below.
n h w
bi formula error name

1 1 (b−a) 1 (2)
2 b−a 2, 2 2 ( f (a) + f (b)) h3 12 f (ξ ) Trapezoidal rule

b−a 1 4 1 (b−a) a+b 1 (4)


3 2 6, 6, 6 6 ( f (a) + 4 f ( 2 ) + f (b)) h5 90 f (ξ ) Simpsons rule

b−a 1 3 3 1 (b−a) a+2b 2a+b 3 (4)


4 3 8, 8, 8, 8 8 ( f (a) + 3 f ( 3 ) + 3 f ( 3 ) + f (b)) h5 80 f (ξ ) 3/8-rule

b−a 7 32 12 32 7 (b−a) 8 (6)


5 4 90 , 90 , 90 , 90 , 90 90 (7 f (x1 ) + 32 f (x2 ) + 12 f (x3 ) + 32 f (x4 ) + 7 f (x5 )) h7 945 f (ξ ) Boole’s rule
4 Numerical Integration 57

In the above table


b−a
bi (b − a),
wi = w xi = a + (i − 1)h, i = 1, . . . , n, h= .
n−1

The weights and nodes for the most popular open Newton-Cotes formulas are summarized in the table below.
n h w
bi formula error name

1 b−a
2 1 (b − a) f ( a+b
2 ) h3 13 f (2) (ξ ) Midpoint rule

1 1 (b−a)
3 b−a
3 2, 2 2 ( f (x1 ) + f (x2 )) h3 14 f (2) (ξ ) Trapezoid method

2 1 2 (b−a)
4 b−a
4 3,−3, 3 3 (2 f (x1 ) − f (x2 ) + 2 f (x3 )) h5 28
90 f
(4) (ξ ) Milne’s rule

b−a 11 1 1 11 (b−a) 95 (4)


5 5 24 , 24 , 24 , 24 24 (11 f (x1 ) + f (x2 ) + f (x3 ) + 11 f (x4 )) h5 144 f (ξ )

In the above table


b−a
bi (b − a),
wi = w xi = a + ih, i = 1, . . . , n, h= .
n+1

4.2 Composite quadrature

Let a = x1 < x2 · · · < xm+1 = b be a partition of [a, b] Then


Z b m Z x j+1
f (x)dx = ∑ f (x)dx.
a j=1 x j

Rb R x j+1
Now we can approximate a f (x)dx by approximating each integral xj f (x)dx by a (low degree) quadrature formula,
Z x j+1 n
f (x)dx ≈ ∑ w ji f (x ji )
xj i=1

and Z b n Z x j+1 m m
f (x)dx = ∑ f (x)dx ≈ ∑ ∑ w ji f (x ji ).
a i=1 x j j=1 i=0

Example 4.7 (Composite Midpoint Rule).


Z b m  
x j+1 + x j
f (x)dx ≈ ∑ (x j+1 − x j ) f .
a j=1 2

Example 4.8 (Composite Trapezoidal Rule).


Z b m
x j+1 − x j 
f (x)dx ≈ ∑ f (x j+1 ) + f (x j ) .
a j=1 2

The function values f (x2 ), . . . , f (xm ) appear twice in the summation. This can be utilized in the implementation of the
composite Trapezoidal rule:
58 Contents
Z b m  
x2 − x1 x j − x j−1 x j+1 − x j xm+1 − xm
f (x)dx ≈ f (x1 ) + ∑ + f (x j ) + f (xm+1 ).
a 2 j=1 2 2 2

Example 4.9 (Composite Simpsons Rule).


Z b m  
x j+1 − x j x j+1 + x j 
f (x)dx ≈ ∑ f (x j ) + 4 f + f (x j+1 ) .
a j=1 6 2

Notice that the function values f (x2 ), . . . , f (xm ) appear twice in the summation. This has to be utilized in the
implementation of the composite Simpson rule.

4.3 Gauss Quadrature

The idea of the Gauss Quadrature is to choose nodes x1 , . . . , xn and the weights w1 , . . . , wn such that the formula
Z b n
p(x) dx ≈ ∑ wi p(xi ). (4.11)
a i=1

is exact for a polynomial of maximum degree. For example, if we require the formula (4.11) to be exact for N monomial
basis, we obtain a nonlinear system of N + 1 with 2n unknowns. Thus we can expect a formula to be exact for
polynomials of degree 2n − 1.

Example 4.10 (Case n = 2). Let’s determine the weights w1 and w2 and the nodes x1 and x2 such that
Z 1
w1 p(x1 ) + w2 p(x2 ) = p(x) dx
−1

holds for polynomials p(x) of degree 3 or less. This seems possible since we have 4 parameters to choose w1 , w2 , x1 , x2
and exactly 4 numbers are needed in order to define uniquely a polynomial of degree 3. Forcing formula to be exact
for 1, x, x2 , and x3 , leads to Z 1
w1 + w2 = 1 dx = 2
−1
Z 1
w1 x1 + w2 x2 = x dx = 0
−1
Z 1
2
w1 x12 + w2 x22 = x2 dx =
−1 3
Z 1
w1 x13 + w2 x23 = x3 dx = 0
−1
a nonlinear system of 4 equations with 4 unknowns. We can easily solve this system analytically to obtain
1 1
w1 = w2 = 1, x1 = − √ , x2 = √ .
3 3
Actually we will consider more general problem. We want to choose nodes x1 , . . . , xn and the weights w1 , . . . , wn
such that the formula Z b n
ω(x)p(x) dx ≈ ∑ wi p(xi ). (4.12)
a i=0
4 Numerical Integration 59

is exact for a polynomial of maximum degree, where omega(x) is some positive function on (a, b). For example, 1,
2
1
1+x2
, e−x and etc.
It is seems reasonable to expect the quadrature formula (4.12) to be exact for polynomials of degree 2n − 1 or less,
but not more than that as the following results shows.
Lemma 4.1. There is no choice of nodes x1 , . . . , xn and weights w1 , . . . , wn such that
Z b n
ω(x)pN (x) dx ≈ ∑ wi pN (xi ).
a i=1

for all polynomials pN of degree less or equal to N if N > 2n − 1.


Proof. Assume the formula exact for the polynomials of degree at least 2n. Consider a function

g(x) = (x − x1 )2 . . . (x − xn )2 .

Easy to see that g is a polynomial of degree 2n and g > 0 on (a, b). Thus we obtain
Z b n
0< ω(x)g(x)dx = ∑ wi g(xi ) = 0
a i=1

a contradiction.
Lemma 4.2. If (4.12) is exact for all polynomials of degree 2n − 1 or less, then the polynomials p*0 (x), p*1 (x), . . . , p*n (x)
given by their roots, i.e.
j
p*j (x) = ∏(x − xi ), j = 1, 2, . . . , n
i=1

satisfy
Z b
ω(x)p*j (x)p*i (x)dx = 0, for i ̸= j.
a

Proof. If i ̸= j, then p*j (x)p*i (x) is a polynomial of degree at most 2n − 1, and since (4.12) is exact for p*j (x)p*i (x), we
obtain Z b n
ω(x)p*j (x)p*i (x)dx = ∑ wk p*j (xk )p*j (xk ) = 0.
a k=1

Definition 4.2. For two integrable functions f (x) and g(x), and a given weight function ω(x) ≥ 0, we define a
(weighted) inner-product by
Z b
( f , g) = ω(x) f (x)g(x)dx.
a

One can easily check the above definition indeed satisfies all the conditions for the inner-product. Once we have a
notion of inner-product for functions we can define an orthogonality.
Definition 4.3. We say that two functions f (x) and g(x) are orthogonal if

( f , g) = 0.

Thus the Lemma 4.2 says that if (4.12) is exact for all polynomials of degree 2n − 1 or less, then the nodes x1 , . . . , xn
are the roots of the orthogonal polynomials. Usually the roots of a polynomial can be repeated, complex, or lie outside
of (a, b), this is not the case for the orthogonal polynomials.
Lemma 4.3. The roots of the orthogonal polynomials in the sense of the Definition 4.3, are real, simple in lie inside
the interval (a, b).
60 Contents

Proof. Given an orthogonal polynomial pn (x) of degree n on (a, b), let x1 , . . . , xr be the points, where pn (x) changes
it sign. Consider a function g(x) = (x − x1 ) . . . (x − xr ). Thus, pn (x)g(x) ≥ 0 on (a, b). If r < n, then pn (x)g(x) is a
polynomial of degree less or equal than 2n − 1 and
Z b n
0< ω(x)pn (x)g(x)dx = ∑ wi pn (xi )g(xi ) = 0
a i=1

a contradiction. Thus r = n and it means that pn (x) changes sign n times and from this fact the conclusion of the lemma
follows.

The next result can be think of as Gram-Schmidt orthogonalization process for polynomials.
Lemma 4.4. We can construct orthogonal polynomials p*j (x), j = 1, . . . , n such that

(p*j , p*k ) = 0 for j ̸= k.

In the addition, the polynomials p*j (x), j = 1, . . . , n satisfy the three-term recursion

p*j (x) = (x − δ j )p*j−1 (x) − γ 2j p*j−2 (x), j ≥ 1,

where p*−1 = 0, p*0 = 1 and

(xp*j−1 , p*j−1 ) (p*j−1 , p*j−1 )


δj = , γ 2j = , j ≥ 1, γ1 = 0. (4.13)
(p*j−1 , p*j−1 ) (p*j−2 , p*j−2 )

Proof. The proof is by induction. Suppose we constructed such polynomials p*j (x) for j ≤ m and established that they
are unique. Next, we want to construct p*m+1 (x) and that (p*m+1 , p j ) = 0 for j ≤ m and satisfies (4.13).
Any polynomial of degree m + 1 with leading coefficient 1, we can write uniquely as

pm+1 (x) = (x − δm+1 )p*m (x) + cm−1 p*m−1 (x) + · · · + c1 p*1 (x). (4.14)

Since (p*j , p*k ) = 0 for any k, j ≤ m and j ̸= k, we have an equation

0 = (pm+1 , p*m ) = ((x − δm+1 )p*m , p*m ) + cm−1 (p*m−1 , p*m ) + · · · + c0 (p*0 , p*m ) = (xp*m , p*m ) − δm+1 (p*m , p*m ).

Since (p*m , p*m ) > 0, it has a solution


(xp*m , p*m )
δm+1 = .
(p*m , p*m )
We also require that (pm+1 , p*j ) = 0 for all j ≤ m − 1. Using (4.14) again, we have equations in c j

0 = (pm+1 , p*j ) = ((x − δm+1 )p*m , p*j ) + cm−1 (p*m−1 , p*j ) + · · · + c j (p*j , p*j ) · · · + c0 (p*0 , p*j ) = (p*m , xp*j ) + c j (p*j , p*j ).
(4.15)
By induction for j ≤ m − 1
p*j+1 (x) = (x − δ j )p*j (x) − γ 2j p*j−1 (x),
hence
xp*j (x) = p*j+1 (x) + δ j p*j (x) + γ 2j p*j−1 (x) for j ≤ m − 1.
Plugging it into (4.15) and using that j ≤ m − 1, we obtain
4 Numerical Integration 61

0 = (pm+1 , p*j ) = (p*m , xp*j ) + c j (p*j , p*j )


= (p*m , p*j+1 ) + δ j (p*m , p*j ) + γ 2j (p*m , p*j−1 ) + c j (p*j , p*j )
= (p*m , p*j+1 ) + c j (p*j , p*j ).
* * (p*m ,p*m )
,pm )
The above equations have solutions c j = 0 for j < m − 1 and cm−1 = − (p*(pm ,p* ) . Since (p*m−1 ,p*m−1 ) > 0 we set
m−1 m−1

(p*m , p*m )
γ 2 := cm−1 = .
(p*m−1 , p*m−1 )

The above lemma allows us to generate the orthogonal polynomials recursively. Later, we will see that we can compute
the real roots of any polynomial very efficiently. We also have an alternative way to compute the nodes xi .
Theorem 4.3. The root xi , i = 1, . . . , n are the eigenvalues of the tridiagonal matrix
 
δ1 γ2
 γ2 δ1 γ3 
 
Jn = 
 · · · 

 · · γn 
γn δn

Proof. The proof follows easily be showing that the characteristic polynomials

pn (x) = det(Jn − xI)

satisfy the three-term recursion

pn (x) = (x − δ j )pn−1 (x) − γ 2j pn−2 (x), n ≥ 1,

which can be done by induction, for example.


We still need to obtain the corresponding weights. That is what we will address now.
Lemma 4.5. Let p*0 (x), . . . , p*n−1 (x) be orthogonal polynomials and x1 , . . . , xn be any distinct points. Then the matrix

p*0 (x1 )
p*0 (x2 ) . . . . . . p*0 (xn )
 
 p*1 (x1 )
p*1 (x2 ) . . . . . . p*1 (xn ) 
P=
 
.. .. .. .. .. 
 . . . . . 
p*n−1 (x1 ) p*n−1 (x2 ) . . . *
. . . pn−1 (xn )

is non-singular.
Proof. We will proof the result by contradiction. Assume the matrix P is singular. Then there exists a vector z ∈ Rn
such that zT P = 0T . Hence
n−1
∑ z j p*j (xi ) = 0 for i = 1, . . . n.
j=0

Notice that
n−1
q(x) = ∑ z j p*j (x)
j=0

is a polynomial of degree n − 1 that has n roots, namely at x1 , . . . , xn . Hence q(x) ≡ 0. Let k be the largest index such
that zk ̸= 0. Then
62 Contents

1 k−1 *
p*k (x) = − ∑ zi pi (x),
zk i=0
which is a contradiction, since on the left we have a polynomial of degree k − 1 and a polynomial of degree less than
k − 1 on the right.
Theorem 4.4. Let x1 , . . . , xn be the roots of the p*n and let w = (w1 , . . . , wn )T be the solution of the system

Pw = g, (4.16)

where P is the matrix defined in Lemma 4.5 and the g = (g0 , . . . , gn−1 )T is given by
(
(p*0 , p*0 ) i=0
gi =
0 i = 1, . . . , n − 1.

Then wi > 0, i = 1, . . . , n and


Z b n
ω(x)p(x) dx = ∑ wi p(xi ). (4.17)
a i=1

for all p ∈ P2n−1


Conversely, if wi and xi , i = 1, . . . , n, are such that (4.17) holds for all p ∈ P2n−1 , then xi the roots of p*n+1 (x) and
wi satisfy (4.16).
Proof. Since the roots of the orthogonal polynomials are simple, real, and inside (a, b) by Lemma 4.5 the matrix P is
non-singular. Hence the system (4.16) has a unique solution.
Consider any p(x) ∈ P2n−1 . We can write

p(x) = p*n (x)q(x) + r(x), (4.18)

where q ∈ Pn−1 and r ∈ Pn−1 . Since {p*0 , . . . , p*n−1 } form a basis for Pn−1 , we have

n−1 n−1
q(x) = ∑ αk p*k (x) and r(x) = ∑ βk p*k (x).
k=0 k=0

Then,
Z b Z b n−1 n−1
ω(x)p(x)dx = ω(x)(p*n (x)q(x) + r(x))dx = (p*n , q) + (r, 1) = ∑ αk (p*n , p*k ) + ∑ βk (p*k , p*0 ) = β0 (p*0 , p*0 ).
a a k=0 k=0

On the other hand using that xi are the roots of p*n ,


!
n n n n−1 n
∑ wi p(xi ) = ∑ wi (p*n (xi )q(xi ) + r(xi )) = ∑ wi r(xi ) = ∑ βk ∑ wi p*k (xi ) = β0 (p*0 , p*0 ),
i=1 i=1 i=1 k=0 i=1

by using (4.16). Hence we have (4.17).


To show w j > 0 for j = 1, . . . , n, we consider b j (x) = (x − x1 )2 . . . (x − x j−1 )2 (x − x j+1 )2 . . . (x − xn )2 ∈ P2n−2 . Using
that b j (x) > 0 on (a, b) and (4.17), we have
Z b n
0< ω(x)b j (x)dx = ∑ wi b j (xi ) = w j (x j − x1 )2 . . . (x j − x j−1 )2 (x j − x j+1 )2 . . . (x j − xn )2 .
a i=1

From above, the positivity of w j follows.


5 Linear Least Squares 63

To show the converse, notice that the nodes x1 , . . . , xn are distinct (otherwise we could write the formula (4.17) with
less than n nodes which would contradict Lemma 4.1), hence the matrix P is non-singular.
Applying formula (4.17) with p(x) = p*k (x), k = 0, . . . , n − 1, we have
(
n
(p* , p* )
Z b
* * * * k=0
∑ wi pk (xi ) = a ω(x)pn (x)dx = (p0 , pk ) = 0 0 0 k = 1, . . . , n − 1.
,
i=1

Hence the weights wi satisfy (4.16).


Applying formula (4.17) with p(x) = p*k (x)p*n (x), k = 0, . . . , n − 1, we have
n
0 = (p*k , p*n ) = ∑ wi p*k (xi )p*n (xi ).
i=1

In other words the vector


c = (p*n (x1 ), . . . , p*n (xn ))T ,
solves the homogeneous system
Pc = 0.
Since P is non-singular, we have c = 0, i.e. xi are the roots of p*n (x).

4.4 Error analysis

5 Linear Least Squares

The most common and simplest problem in linear least squares is the linear regression problem

Example 5.1 (Linear regression). Given m measurements

(xi , yi ), i = 1, . . . , m,

find a linear function


y(x) = ax + b
that best fits these data, i.e.,
yi ≈ axi + b i = 1, . . . , m.
In other words, we want two numbers a and b such that
m
∑ (axi + b − yi )2
i=1

is minimized. To write the problem in matrix form, we let


   
x1 1 y1
 x2 1   y2 
A =  . .  ∈ Rm×2 , b =  .  ∈ Rm
   
 .. ..   .. 
xm 1 ym

then the ith residual


64 Contents

ri = axi + b − yi
is the i-th component of Az − b, where z = [a b]T . Thus we want to minimize ‖r‖22 which leads to

min ‖Az − b‖22


z∈R2

Loosely speaking, the linear least squares problem says: Given A ∈ Rm×n , find x ∈ Rn such that Ax ≈ b.
Of course, if m = n and A is invertible, then we can solve Ax = b. Otherwise, we may not have a solution of Ax = b
or we may have infinitely many of them.
We are interested in vectors x that minimize the norm of squares of the residual Ax − b, i.e., which solve

min ‖Ax − b‖22


x∈Rn

The Linear Least Squares (LLS) problem:


min ‖Ax − b‖22 .
x∈Rn

Notice that
1
min ‖Ax − b‖22 , min ‖Ax − b‖2 , min ‖Ax − b‖22
x∈Rn x∈Rn 2 x∈Rn
are all equivalent in the sense that if x solves one of them it also solves the others.
Instead of finding x that minimizes the norm of squares of the residual Ax − b, we could also try to find x that
minimizes the p-norm of the residual
minn ‖Ax − b‖ p
x∈R

This can be done, but is more complicated and will not be covered.
There are many examples of such problems.
Example 5.2 (Best Polynomial Fitting). Given m measurements

(xi , yi ), i = 1, . . . , m,

find a polynomial function


y(x) = an xn + · · · a1 x + a0
that best fits these data, i.e.,
yi ≈ an xin + · · · a1 xi + a0 i = 1, . . . , m.
We want two numbers a and b such that
m
∑ (an xin + · · · a1 xi + a0 − yi )2
i=1
5 Linear Least Squares 65

is minimized.
Write in matrix form. Let
x1n
   
... x1 1 y1
 xn ... x2 1  y2 
 2 m×(n+1)
A= . ..  ∈ R , b =  .  ∈ Rm
  
.. ..
 .. . . .  .. 
n ... x 1
xm ym
m

then the ith residual


ri = an xin + · · · a1 xi + a0 − yi
is the ith component of Az − b, where z = [an , . . . , a1 , a0 ]T . Thus we want to minimize ‖r‖22 which leads again to

min ‖Az − b‖22


z∈Rn

Sometimes one has to modify the problem in order to state it as a LLS problem

Example 5.3 (Best Circle Fitting). Find a best fit circle through points (x1 , y1 ), (x2 , y2 ), . . . , (xm , ym ). Equation for the
circle around (c1 , c2 ) with radius r is
(x − c1 )2 + (y − c2 )2 = r2 .
It is not a LLS problem, due to quadratic terms. However, rewrite the equation for the circle in the form

2xc1 + 2yc2 + (r2 − c21 − c22 ) = x2 + y2


q
and setting c3 = r2 − c21 − c22 , then we can compute the center (c1 , c2 ) and the radius r = c3 + c21 + c22 of the circle
that best fits the data points by solving the least squares problem
 2
x1 + y21
  
2x1 2y1 1  
c1  2 2  2
 2x2 2y2 1     x2 + y2 

min  . . . c2 − . 
[c1 ,c2 ,c3 ]T ∈R3  .. .. ..  ..
 
 2
c3

2xm 2ym 1 2 + y2
xm m
66 Contents

5.1 Normal Equation

Suppose x* satisfies
‖Ax* − b‖22 = minn ‖Ax − b‖22 (LLS)
x∈R

For any vector z ∈ Rn


‖Ax* − b‖22 ≤ ‖A(x* + z) − b‖22
= (A(x* + z) − b)T (A(x* + z) − b)
= x*T AT Ax* − 2x*T AT b + bT b + 2zT AT Ax* − 2zT AT b + zT AT z
= ‖Ax* − b‖22 + 2zT (AT Ax* − AT b) + ‖Az‖22 .
Of course ‖Az‖22 ≥ 0, but
2zT (AT Ax* − AT b)
could be negative for some z if AT Ax* − AT b ̸= 0. In fact setting

z = −α(AT Ax* − AT b)

for some α ∈ R. For such z ∈ Rn , we get

2zT (AT Ax* − AT b) + ‖Az‖22 = −2α‖AT Ax* − AT b‖22 + α 2 ‖A(AT Ax* − AT b)‖22 < 0

for
‖AT Ax* − AT b‖22
0<α < .
‖A(AT Ax* − AT b)‖22
Thus, if x* solves (LLS) then x* must satisfy
5 Linear Least Squares 67

AT Ax* − AT b = 0 normal equation. (5.1)

On the other hand if x* satisfies


AT Ax* − AT b = 0,
then for any x
‖Ax − b‖22 = ‖Ax* + A(x − x* ) − b‖22
= ‖Ax* − b‖22 + 2(x − x* )T (AT Ax* − AT b) + ‖A(x − x* )‖22
= ‖Ax* − b‖22 + ‖A(x − x* )‖22
≥ ‖Ax* − b‖22
i.e. x* solves (LLS).
Theorem 5.1. The linear least square problem

min ‖Ax − b‖22 (LLS)


x∈Rn

always has a solution. A vector x* solves (LLS) iff x* solves the normal equation

AT Ax = AT b.

Note: If the matrix A ∈ Rm×n , m ≥ n, has rank n, then AT A is symmetric positive definite and satisfies

vT AT Av = ‖Av‖22 > 0, ∀v ∈ Rn , v ̸= 0.

If A ∈ Rm×n , m ≥ n, has full rank n, then we can use the Cholesky-decomposition to solve the normal equation
(and, hence, the linear least squares problem) as follows
1. Compute AT A and AT b.
2. Compute the Cholesky-decomposition AT A = RT R.
3. Solve RT y = AT b (forward solve),
solve Rx = y (backward solve) .
68 Contents

The computation of AT A and AT b requires roughly mn2 and 2mn flops. Roughly 13 n3 flops are required to compute
the Cholesky-decomposition. The solution of RT y = AT b and of Rx = y requires approximately 2n2 flops.
Computing the normal equations requires us to calculate terms of the form ∑m T
k=1 aki ak j . The computed matrix A A
may not be positive definite, because of floating point arithmetic.

t = 10.ˆ(0:-1:-10)’;
A = [ ones(size(t)) t t.ˆ2 t.ˆ3 t.ˆ4 t.ˆ5];
B = A’*A;
[R,iflag] = chol( B );
if( iflag ˜= 0 )
disp([’ Cholesky decomposition returned with iflag = ’, ...
int2str(iflag)])
end
In exact arithmetic B = AT A is symmetric positive definite, but the Cholesky-Decomposition detects that a j j −
j−1
∑k=1 r2jk < 0 in step j = 6.
>> Cholesky decomposition returned with iflag = 6
The use of the Cholesky decomposition is problematic if the condition number of AT A is large. In the example,
κ2 (AT A) ≈ 4.7 * 1016 .

5.2 Solving Linear Least Square problem using QR-decomposition

Definition 5.1. A matrix Q ∈ Rm×n is called orthogonal if QT Q = In , i.e., if its columns are orthogonal and have
2-norm one.

The orthogonal matrices have the following important properties


1. If Q ∈ Rn×n is orthogonal, then QT Q = I implies that Q−1 = QT .
2. If Q ∈ Rn×n is an orthogonal matrix, then QT is an orthogonal matrix.
3. If Q1 , Q2 ∈ Rn×n are orthogonal matrices, then Q1 Q2 is an orthogonal matrix.
4. If Q ∈ Rn×n is an orthogonal matrix, then

(Qx)T (Qy) = xT y x, y ∈ Rn

i.e. the angle between Qx and Qy is equal to the angle between x and y
5. As a result
‖Qx‖2 = ‖x‖2
i.e. orthogonal matrices preserve the 2-norm.
The last property is the key for solving LSS.

Example 5.4. In two dimensions a rotation matrix


 
cos θ sin θ
Q=
− sin θ cos θ

is orthogonal matrix.
5 Linear Least Squares 69

5.2.1 QR-decompostion

Let m ≥ n. For each A ∈ Rm×n there exists a permutation matrix P ∈ Rmn×n , an orthogonal matrix Q ∈ Rm×m , and an
upper triangular matrix R ∈ Rn×n such that
 
R }n
AP = Q QR-decomposition.
0 } m−n

The QR decomposition of A can be computed using the Matlab command [Q, R, P] = qr(A).
We will not go into the details of how Q, P, R are computed. If you interested check Chapter 5 of the book Gene
Golub and Charles Van Loan, Matrix Computations

5.2.2 Case 1: Rank(A) = n

Assume that A ∈ Rm×n , has full rank n. (Rank deficient case will be considered later.)
Let    
R }n T R }n
AP = Q ⇔ Q AP =
0 } m−n 0 } m−n
where R ∈ Rn×n is upper triangular matrix. Since A has full rank n the matrix R also has rank n and, therefore, is
nonsingular. Moreover, since Q is orthogonal it obeys QQT = I. Hence

‖QT y‖2 = ‖y‖2 ∀y ∈ Rm .

In addition, the permutation matrix satisfies PPT = I. Using these properties of Q we get

‖Ax − b‖22 = ‖QT (Ax − b)‖22


= ‖QT (APPT x − b)‖22
= ‖(QT AP)PT x − QT b‖22
 
R
=‖ PT x − QT b‖22 .
0
Partitioning QT b as  
T c }n
Q b=
d } m−n
and putting y = PT x we get
    2   2
= Ry − c
R c
‖Ax − b‖22 =

y −
0 d 2
−d
2
= ‖Ry − c‖22 + ‖d‖22 .
Thus,
min ‖Ax − b‖22 ⇔ min ‖Ry − c‖22 + ‖d‖22
x y

and the solution is y = R−1 c.


Thus,
min ‖Ax − b‖22 ⇔ min ‖Ry − c‖22 + ‖d‖22
x y

and the solution is y = R−1 c.


70 Contents

Recall
y = PT x, PPT = I, ⇒ x = Py.
Hence the solution is x = Py = PR−1 c.
In Summary: To solve a Linear Least Squares Problem using the QR-Decomposition with matrix A ∈ Rm×n , of
rank n and b ∈ Rm :
1. Compute an orthogonal matrix Q ∈ Rm×m , an upper triangular matrix R ∈ Rn×n , and a permutation matrix P ∈
Rn×n such that  
T R
Q AP = .
0
2. Compute  
T c
Q b= .
d
3. Solve
Ry = c.
4. Set
x = Py.
The MATLAB Implementation is very simple.
[m,n] = size(A);
[Q,R,P] = qr(A);
c = Q’*b;
y = R(1:n,1:n) \ c(1:n);
x = P*y;
If you type
x = A∖b;
in Matlab, then Matlab computes the solution of the linear least squares problem

min ‖Ax − b‖22


x

using the QR decomposition as described above.

5.2.3 Case 2: Rank(A) < n

The Rank Deficient Case: Assume that A ∈ Rm×n , m ≥ n has rank r < n. (The case m < n can be handled analogously.)
Suppose that
AP = QR,
where Q ∈ Rm×m is orthogonal, P ∈ Rn×n is a permutation matrix, and R ∈ Rn×n is an upper triangular matrix of the
form  
R1 R2
R=
0 0
with nonsingular upper triangle R1 ∈ Rr×r and R2 ∈ Rr×(n−r) .
We can write   2
2 T T 2
R1 R2 T T

‖Ax − b‖2 = ‖Q (APP x − b)‖2 = P x − Q b .
0 2
5 Linear Least Squares 71

Partition QT b as  
c1 } r
QT b =  c2  } n − r
d } m−n
and put y = PT x.
Partition  
y1 }r
y=
y2 } n−r
This give us
R1 y1 + R2 y2 − c1 2
 

‖Ax − b‖22 =  = ‖R1 y1 + R2 y2 − c1 ‖22 + ‖c2 ‖22 + ‖d‖22 .




 c2
d
2

Linear least squares problem minx ‖Ax − b‖22 is equivalent to

‖R1 y1 + R2 y2 − c1 ‖22 + ‖c2 ‖22 + ‖d‖22 ,

where R1 ∈ Rr×r is nonsingular. the solution is

y1 = R−1
1 (c1 − R2 y2 )

for any y2 ∈ Rn−r .


Since y = PT x and PT P = I,
R−1
 
x = Py = P 1 (c1 − R2 y2 )
y2
We have infinitely many solutions since y2 is arbitrary. Which one to choose?
If we use Matlab x = A∖b, then Matlab computes the one with y2 = 0
 −1 
R1 c1
x=P .
0

MATLAB Implementation is:


[m,n] = size(A);
[Q,R,P] = qr(A);
c = Q’*b;
% Determine rank of A.
% The diagonal entries of R satisfy
%|R(1,1)| >= |R(2,2)| >= |R(3,3)| >= ..
% Find the smallest integer r such that
%|R(r+1,r+1)| < max(size(A))*eps*|R(1,1)|
tol = max(size(A))*eps*abs(R(1,1));
r = 1;
while ( abs(R(r+1,r+1)) >= tol & r < n ); r = r+1; end
y1 = R(1:r,1:r) \ c(1:r);
y2 = zeros(n-r,1);
x = P*[y1;y2];
where we used
to determinate of the effective rank of A ∈ Rn×n using the QR decomposition
72 Contents

AP = QR,

where the diagonal entries of R satisfy |R11 | ≥ |R22 | ≥ . . . . The effective rank r of A ∈ Rn×n is the smallest integer r
such that
|Rr+1,r+1 | < ε max {m, n}|R11 |
tol = max(size(A))*eps*abs(R(1,1));
r = 0;
while ( abs(R(r+1,r+1)) >= tol & r < n )
r = r+1;
end

All solutions of
min ‖Ax − b‖22
x

are given by
R−1
 
x = Py = P 1 (c1 − R2 y2 )
y2
where y2 ∈ Rn−r is arbitrary.
Minimum norm solution:
Of all solutions, pick the one with the smallest 2-norm. This leads to
 −1  2
R1 (c1 − R2 y2 )
min P

y2 y2
2

Since permutation matrix P is orthogonal


 2 −1 2
R (c1 − R2 y2 ) 2 R−1 R2
 −1 −1    −1  2
P R1 (c1 − R2 y2 ) = R1 (c1 − R2 y2 ) R1 c1

= 1 = 1 y −
2

y2
2
y2
2
−y2
2
I 0
2

which is another linear least squares problem with unknown y2 . This problem is n × (n − r) and it has full rank. It can
be solved using the techniques discussed earlier.
MATLAB Implementation:
[m,n] = size(A);
[Q,R,P] = qr(A);
c = Q’*b;
% Determine rank of A (as before).
tol = max(size(A))*eps*abs(R(1,1));
r = 1;
while ( abs(R(r+1,r+1)) >= tol & r < n ); r = r+1; end
% Solve least squares problem to get y2
S = [ R(1:r,1:r) \ R(1:r,r+1:n);
eye(n-r) ];
t = [ R(1:r,1:r) \ c(1:r);
zeros(n-r,1) ];
y2 = S \ t; % solve least squares problem using backslash
% Compute x
y1 = R(1:r,1:r) \ ( c(1:r) - R(1:r,r+1:n) * y2 );
x = P*[y1;y2];
5 Linear Least Squares 73

5.3 Solving Linear Least Square problem using SVD-decomposition

For any matrix A ∈ Rm×n there exist orthogonal matrices U ∈ Rm×m , V ∈ Rn×n and a ’diagonal’ matrix Σ ∈ Rm×n , i.e.,
 
σ1 0 ... 0
 .. 

 . 

 σr 
 for m ≤ n
Σ = 
 0 

 .. 
 . 
0 ... 0

and  
σ1
 .. 

 . 


 σr 

 0 
Σ = for m≥n
 
.. 

 .  

 0 0 
 .. .. 
 . .
0 0
with diagonal entries
σ1 ≥ · · · ≥ σr > σr+1 = · · · = σmin {m,n} = 0
such that A = UΣV T .

Definition 5.2. The decomposition


A = UΣV T
is called Singular Value Decomposition (SVD). It is very important decomposition of a matrix and tells us a lot about
its structure.

It can be computed using the Matlab command svd.


Definition 5.3. The diagonal entries σi of Σ are called the singular values of A. The columns of U are called left
singular vectors and the columns of V are called right singular vectors.
Using the orthogonality of V we can write it in the form

AV = UΣ

We can interpret it as follows: there exists a special orthonormal set of vectors (i.e. the columns of V ), that is mapped
by the matrix A into an orthonormal set of vectors (i.e. the columns of U).
Given the SVD-Decomposition of A,
A = UΣV T
with
σ1 ≥ · · · ≥ σr > σr+1 = · · · = σmin {m,n} = 0
one may conclude the following:
∙ rank(A) = r,
74 Contents

∙ R(A) = R([u1 , . . . , ur ]),


∙ N(A) = R([vr+1 , . . . , vn ]),
∙ R(AT ) = R([v1 , . . . , vr ]),
∙ N(AT ) = R([ur+1 , . . . , um ]).
Moreover if we denote
Ur = [u1 , . . . , ur ], Σr = diag(σ1 , . . . , σr ), Vr = [v1 , . . . , vr ],
then we have
r
A = Ur ΣrVrT = ∑ σi ui vTi
i=1

This is called the dyadic decomposition of A, decomposes the matrix A of rank r into sum of r matrices of rank 1.
The 2-norm and the Frobenius norm of A can be easily computed from the SVD decomposition

‖Ax‖2
‖A‖2 = sup = σ1
x̸=0 ‖x‖2
m n q
‖A‖F = ∑ ∑ a2i j = σ12 + · · · + σ p2 , p = min {m, n}.
i=1 j=1

From the SVD decomposition of A it also follows that

AT A = V Σ T ΣV T and AAT = UΣ Σ T U T .

Thus, σi2 , i = 1, . . . , p are the eigenvalues of symmetric matrices AT A and AAT and vi and ui are the corresponding
eigenvectors.

Theorem 5.2. Let the SVD of A ∈ Rm×n be given by


r
A = Ur ΣrVrT = ∑ σi ui vTi
i=1

with r = rank(A). If k < r


k
Ak = ∑ σi ui vTi ,
i=1

then
min ‖A − D‖2 = ‖A − Ak ‖2 = σk+1 ,
rank(D)=k

and s
p
min
rank(D)=k
‖A − D‖F = ‖A − Ak ‖F = ∑ σi2 , p = min {m, n}.
k+1

Consider the LLS


min ‖Ax − b‖22
x

and let A = UΣV T be the SVD of A ∈ Rm×n . Using the orthogonality of U and V we have

‖Ax − b‖22 = ‖U T (AVV T x − b)‖22 = ‖Σ |{z}


V T x −U T b)‖22
=z
r m
= ∑ (σi zi − uTi b)2 + ∑ (uTi b)2 .
i=1 i=r+1
5 Linear Least Squares 75

Thus,
r m
min ‖Ax − b‖22 = ∑ (σi zi − uTi b)2 + ∑ (uTi b)2 .
x
i=1 i=r+1

The solution is given


uTi b
zi = , i = 1, . . . , r,
σi
zi = arbitrary, i = r + 1, . . . , n.
As a result
m
min ‖Ax − b‖22 = ∑ (uTi b)2 .
x
i=r+1

Recall that z = V T x. Since V is orthogonal, we find that

‖x‖2 = ‖VV T x‖2 = ‖V T x‖2 = ‖z‖2 .

All solutions of the linear least squares problem are given by z = V T x with

uTi b
zi = , i = 1, . . . , r,
σi
zi = arbitrary, i = r + 1, . . . , n.

The minimum norm solution of the linear least squares problem is given by

x † = V z† ,

where z† ∈ Rn is the vector with entries


uTi b
z†i = , i = 1, . . . , r,
σi
z†i = 0, i = r + 1, . . . , n.
The minimum norm solution is
r
uTi b
x† = ∑ vi
i=1 σi

MATLAB code:
% compute the SVD:
[U,S,V] = svd(A);
s = diag(S);
% determine the effective rank r of A using singular values
r = 1;
while( r < size(A,2) & s(r+1) >= max(size(A))*eps*s(1) )
r = r+1;
end
d = U’*b;
x = V* ( [d(1:r)./s(1:r); zeros(n-r,1) ] );
Suppose that the data b are
b = bex + δ b,
where δ b represents the measurement error. The minimum norm solution of min ‖Ax − (bex + δ b)‖22 is
76 Contents
r r  T
uTi b ui b uTi δ b

x† = ∑ vi = ∑ + vi .
i=1 σi i=1 σi σi

uT (δ b)
If a singular value σi is small, then i σi could be large, even if uTi (δ b) is small. This shows that errors δ b in the data
can be magnified by small singular values σi .
% Compute A
t = 10.ˆ(0:-1:-10)’;
A = [ ones(size(t)) t t.ˆ2 t.ˆ3 t.ˆ4 t.ˆ5];
% compute SVD of A
[U,S,V] = svd(A); sigma = diag(S);
% compute exact data
xex = ones(6,1); bex = A*xex;
for i = 1:10
% data perturbation
deltab = 10ˆ(-i)*(0.5-rand(size(bex))).*bex;
b = bex+deltab;
% solution of perturbed linear least squares problem
w = U’*b;
x = V * (w(1:6) ./ sigma);
errx(i+1) = norm(x - xex); errb(i+1) = norm(deltab);
end
loglog(errb,errx,’*’);
ylabel(’||xˆ{ex} - x||_2’); xlabel(’||\delta b||_2’)
The singular values of A in the above Matlab example are:

σ1 ≈ 3.4 σ4 ≈ 7.2 * 10−4


σ2 ≈ 2.1 σ5 ≈ 6.6 * 10−7
σ3 ≈ 8.2 * 10 −2 σ6 ≈ 5.5 * 10−11

The error ‖xex − x‖2 for different values of ‖δ b‖2 (loglog-scale):


5 Linear Least Squares 77

We see that small perturbations δ b in the measurements can lead to large errors in the solution x of the linear least
squares problem if the singular values of A are small.

5.4 Regularized Linear Least Squares

If σ1 /σr ≫ 1, then it might be useful to consider the regularized linear least squares problem (Tikhonov regular-
ization)
1 λ
minn ‖Ax − b‖22 + ‖x‖22 . (5.2)
x∈R 2 2
Here λ > 0 is the regularization parameter.
The regularization parameter λ > 0 is not known a-priori and has to be determined based on the problem data.
Observe that     2
1 λ √A b
min ‖Ax − b‖22 + ‖x‖22 = min

x − .
x 2 2 x λI 0 2
For λ > 0, the matrix  
√A ∈ R(m+n)×n
λI
is always of full rank n. Hence, for λ > 0, the regularized linear least squares problem (5.2) has a unique solution. The
normal equation corresponding to (5.2) are given by
 T    T  T
√A √A x = (AT A + λ I)x = AT b = √A b
.
λI λI λI 0

Using the SVD Decomposition of A = UΣV T , where U ∈ Rm×m and V ∈ Rn×n are orthogonal matrices and Σ ∈ Rm×n
is a ’diagonal’ matrix with diagonal entries

σ1 ≥ · · · ≥ σr > σr+1 = · · · = σmin {m,n} = 0.

the normal to (5.2)


(AT A + λ I)xλ = AT b,
can be written as
(V Σ T U T T
I )xλ = V Σ T U T b.
| {zU} ΣV + λ |{z}
=I =VV T

Rearranging the terms, we obtain


V (Σ T Σ + λ I)V T xλ = V Σ T U T b,
multiplying both sides by V T from the left and setting z = V T xλ we get

(Σ T Σ + λ I)z = Σ T U T b.

Thus, the normal equation


(AT A + λ I)xλ = AT b,
is equivalent to
(Σ T Σ + λ I) z = Σ T U T b,
| {z }
diagonal
78 Contents

where z = V T xλ and has the solution


σi (uTi b)
(
σi2 +λ
, i = 1, . . . , r,
zi =
0, i = r + 1, . . . , n.
Since xλ = V z = ∑ni=1 zi vi , the solution of the regularized linear least squares problem (5.2) is given by
r
σi (uTi b)
xλ = ∑ 2
vi .
i=1 σi + λ

Note that
r r
σi (uTi b) uTi b
lim xλ = lim ∑ 2
v i = ∑ vi = x†
λ →0 λ →0 i=1 σi + λ i=1 σi

i.e., the solution of the regularized Least Square problem (5.2) converges to the minimum norm solution of the original
Least Square problem as λ goes to zero. The representation
r
σi (uTi b)
xλ = ∑ 2
vi
i=1 σi + λ

of the solution of the regularized LLS also reveals the regularizing property of adding the term λ2 ‖x‖22 to the (ordinary)
least squares functional. We have that
(
σi (uTi b) 0, if 0 ≈ σi ≪ λ
≈ uTi b
σi2 + λ σi , if σi ≫ λ .

Hence, adding λ2 ‖x‖22 to the original least squares functional acts as a filter. Contributions from singular values which
are large relative to the regularization parameter λ are left (almost) unchanged whereas contributions from small
singular values are (almost) eliminated. It raises an important question:
How to choose λ ?
Suppose that the data are b = bex + δ b. We want to compute the minimum norm solution of the original Least
Squares problem with unperturbed data bex
r
uT b
xex = ∑ i vi ,
i=1 σi

but we can only compute with b = bex + δ b, we don’t know bex . The solution of the regularized least squares problem
is
r 
σi (uTi bex ) σi (uTi δ b)

xλ = ∑ + 2 vi .
i=1 σi2 + λ σi + λ
We observed that
r
σi (uTi bex )
∑ 2
→ xex as λ → 0.
i=1 σi + λ

On the other hand (


σi (uTi δ b) 0, if 0 ≈ σi ≪ λ
≈ uTi δ b
σi2 + λ σi , if σi ≫ λ ,
which suggests to choose λ sufficiently large to ensure that errors δ b in the data are not magnified by small singular
values.
Example 5.5 (Vandermonde matrix).
% Compute A
5 Linear Least Squares 79

t = 10.ˆ(0:-1:-10)’;
A = [ ones(size(t)) t t.ˆ2 t.ˆ3 t.ˆ4 t.ˆ5];

% compute exact data


xex = ones(6,1); bex = A*xex;

% data perturbation of 0.1%


deltab = 0.001*(0.5-rand(size(bex))).*bex;
b = bex+deltab;

% compute SVD of A
[U,S,V] = svd(A); sigma = diag(S);

for i = 0:7 % solve regularized LLS for different lambda


lambda(i+1) = 10ˆ(-i)
xlambda = V * (sigma.*(U’*b) ./ (sigma.ˆ2 + lambda(i+1)))
err(i+1) = norm(xlambda - xex);
end
loglog(lambda,err,’*’); ylabel(’||xˆ{ex} - x_{\lambda}||_2’); xlabel(’\lambda’);

The error ‖xex − xλ ‖2 for different values of λ (loglog-scale):

For this example λ ≈ 10−3 , seems to be a good choice for the regularization parameter λ . However, we could only
create this figure with the knowledge of the desired solution xex .

So the question is, how can we determine a λ ≥ 0 so that ‖xex − xλ ‖2 is small without knowledge of xex . One
approach is the Morozov discrepancy principle.
Suppose b = bex + δ b. We do not know the perturbation δ b, but we assume that we know its size ‖δ b‖. Suppose
the unknown desired solution xex satisfies Axex = bex . Hence,

‖Axex − b‖ = ‖Axex − bex − δ b‖ = ‖δ b‖.

Since the exact solution satisfies ‖Axex − b‖ = ‖δ b‖ we want to find a regularization parameter λ ≥ 0 such that the
solution xλ of the regularized least squares problem satisfies
80 Contents

‖Axλ − b‖ = ‖δ b‖

This is Morozov’s discrepancy principle.


Morozov’s discrepancy principle: Find λ ≥ 0 such that

‖Axλ − b‖ = ‖δ b‖

To compute ‖Axλ − b‖ for given λ ≥ 0 we need to solve a regularized linear least squares problem
    2
1 2 λ 2

√A b
min ‖Ax − b‖2 + ‖x‖2 = min x −
x 2 2 x λI 0 2

to get xλ and then we have to compute ‖Axλ − b‖.


Let f (λ ) = ‖Axλ − b‖ − ‖δ b‖. Finding λ ≥ 0 such that

f (λ ) = 0

is a root finding problem. We will discuss in the next section how to solve such problems. In this case f maps a scalar
λ into a scalar
f (λ ) = ‖Axλ − b‖ − ‖δ b‖,
but the evaluation of f requires the solution of a regularized Least Square problems and can be rather expensive, so
one has to watch out for a number of function evaluations.

6 Numerical Solution of Nonlinear Equations in R1

Goal: Given a function f : R → R we want to find x* such that f (x* ) = 0


Definition 6.1. A point x* with f (x* ) = 0 is called a root of f or zero of f .

6.1 Bisection Method

The bisection method is based on the following version of Intermediate Value Theorem: If f : R → R is a continuous
function and a, b ∈ R, a < b, and f (a) f (b) < 0, then there exists x ∈ [a, b], such that f (x) = 0.
Thus, given [ak , bk ] with f (ak ) f (bk ) < 0 (i.e., f (ak ) and f (bk ) have opposite signs, compute the interval midpoint
ck = 12 (ak + bk ) and evaluates f (ck ). If f (ck ) and f (ak ) have opposite sign, then [ak , ck ] contains a root of f . Sets
ak+1 = ak and bk+1 = ck . Otherwise f (ck ) and f (bk ) must the opposite sign. In this case, [ck , bk ] contains a root of f
and we sets ak+1 = ck and bk+1 = bk .
The algorithm is rather straightforward.
Input: Initial values a(0); b(0) such that f(a(0))f(b(0)) < 0
and a tolerance tolx
Output: approximation of a root of f

for k = 0, 1,... do
if b(k)-a(k) < tolx,
return c(k) = (a(k) + b(k))/2 as an approximate root of f and stop
end
6 Numerical Solution of Nonlinear Equations in R1 81

Compute c(k) = (a(k) + b(k))/2 and f(c(k)).


if f(c(k)) = 0, x = c(k); end

if f(a(k))f(c(k)) < 0 , then


a(k+1) = a(k);
b(k+1) = c(k);
else
a(k+1) = c(k);
b(k+1) = b(k);
end
end

6.1.1 Convergence of the Bisection Method

In each step the interval [ak , bk ] is halved. Hence

1 1 1
|ak+1 − bk+1 | = |ak − bk | = 2 |ak−1 − bk−1 | = · · · = k+1 |a0 − b0 |.
2 2 2
There is a root of f such that the midpoints satisfy

|x* − ck | ≤ 2−k+1 |b0 − a0 |.

In particular, limk→∞ ck = x* . By ⌊z⌋ we denote the largest integer less or equal to z. After
 
|b0 − a0 |
k = ⌊log2 ⌋−1
tolx

iterations we are guaranteed to have an approximation ck of a root x* of f that satisfies |x* − ck | ≤ tolx .
The following table gives ”pros” and ”cons” of the bisection method.

”pros” ”cons”
The method is very simple Hard to generalize to higher dimensions
The Bisection method requires only function values. In The Bisection method only requires the sign of function
fact, it only requires the sign of function values (that values. In general, it will not find the root of the simple
means as long as the sign is correct, the function values affine linear function f (x) = ax + b in a finite number of
can be inaccurate). iterations.
The method is very robust. The Bisection method con- The local convergence behavior of the Bisection method
verges for any [a0 , b0 ] that contains a root of f (the Bi- is rather slow (the error only reduced by a factor 2, no
section method converges globally). matter how close we are to the solution.

6.2 Regula Falsi

One of the problem with the bisection method, is that the midpoint may have nothing to do with the real root. One
of the possible improvements of bisection method, is to use the function values at the end points (if available) more
efficiently.
Thus, given an initial interval [a0 , b0 ] such that f (a0 ) f (b0 ) < 0 (i.e., [a0 , b0 ] contains a root of f ), Regula Falsi
constructs an affine linear function m(x) = αx + β such that m(a0 ) = f (a0 ) and m(b0 ) = f (b0 ). In the notation of
82 Contents

section 2, m(x) = P( f |a0 , b0 )(x) i.e. m interpolates f at a0 and at b0 . m(x) = P( f |a0 , b0 )(x) is given by
x − a0
m(x) = f (a0 ) + ( f (b0 ) − f (a0 ))
b0 − a0

Instead of taking c0 as a mid point, we choose c0 such that m(c0 ) = 0. Solving for c0 we obtain

b0 − a0
c0 = a0 − f (a0 ).
f (b0 ) − f (a0 )

Thus it use c0 as an approximation of the root of f . This root satisfies c0 ∈ (a0 , b0 ). If f (c0 ) and f (a0 ) have opposite
signs, then we set a1 = a0 and b1 = c0 . Otherwise f (c0 ) and f (b0 ) must have opposite signs and we sets a1 = c0 and
b1 = b0 . Then we proceed similarly to the bisection method. This give us an algorithm
Input: Initial values a(0); b(0) such that f(a(0))*f(b(0)) < 0,
a maximum number of iterations maxit, a tolerance tolf and a tolerance tolx

Output: approximation x of the root

1 For k = 0,1,...,maxit do
2 If b(k) - a(k) < tolx, then return x = c(k) and stop
3 Compute c(k) = a(k) - f(a(k))(b(k) - a(k))(f(b(k)) - f(a(k)))
and f(c(k)).
4 If |f(c(k))| < tolf , then return x = c(k) and stop
5 If f(a(k))f(c(k)) < 0 , then
6 a(k+1) = a(k); b(k+1) = c(k),
7 else
8 a(k+1) = c(k); b(k+1) = b(k).
9 End
10 End
However the following simple numerical example shows that computationally we often can not see any advantage
of Regula Falsi over Bisection Rule.
Example 6.1. Take f (x) = x3 − 2x − 5 on interval [0, 4], with tolerance tol = 10−5 .

6.3 Newton’s Method

The idea of the Newton’s method is similar Regula Falsi, to approximate the f (x) with a linear function, but this time
for a given an approximation x0 of a root x* of f , we select

m(x) = f (x0 ) + f ′ (x0 )(x − x0 ).

Comparing to the Taylor approximation of f around x0

f (x* ) = f (x0 + (x* − x0 )) ≈ f (x0 ) + f ′ (x0 )(x* − x0 ).

We see that we use the tangent of f at x0 , as a model for f . The root

f (x0 )
x1 = x0 −
f ′ (x0 )
6 Numerical Solution of Nonlinear Equations in R1 83

of the tangent model is used as an approximation of the root of f . Write the previous identity as

s0 = − f (x0 )/ f ′ (x0 ) step (correction)


x1 = x0 + s0 .

Input: Initial values x(0), tolerance tol,


maximum number of iterations maxit

Output: approximation of the root

1 For k = 0:maxit do
2 Compute s(k) = -f(x(k))/f’(x(k)).
3 Compute x(k+1) = x(k) + s(k).
4 Check for truncation
5 End

6.3.1 Convergence of Sequences.

Let {xk } be a sequence of real numbers.


Definition 6.2 (Convergence rates).
1. The sequence is called linearly convergent if there exists c ∈ (0, 1) and k̂ ∈ N such that

|xk+1 − x* | ≤ c|xk − x* | for all k ≥ k̂.

2. The sequence is called superlinearly convergent if there exists a sequence {ck } with ck > 0 and limk→∞ ck = 0
such that
|xk+1 − x* | ≤ ck |xk − x* |
or, equivalently, if
|xk+1 − x* |
lim = 0.
k→∞ |xk − x* |

3. The sequence is called quadratically convergent to x* if limk→∞ xk = x* and if there exists c > 0 and k̂ ∈ N such
that
|xk+1 − x* | ≤ c|xk − x* |2 for all k ≥ k̂.
Thus using the above definition, we can see that bisection method is linearly convergence with c = 1/2. On the other
hand as we will see, the Newton’s method is quadratically convergent under some conditions.

6.3.2 Convergence of Newton’s Method.

Theorem 6.1. Let D ∈ R be an open interval and let f : D → R be differentiable on D Furthermore, let f ′ (x) be
Lipschitz continuous with Lipschitz constant L. If x* ∈ D is a root and if f ′ (x* ) ̸= 0, then there exists an ε > 0 such
that Newton’s method with starting point x0 with |x0 − x* | < 0 generates iterates xk which converge to x* ,

lim xk = x* ,
k→∞

and which obey


84 Contents

L
|xk+1 − x* | ≤ |xk − x* |2 for all k ∈ N.
| f ′ (x* )|
Before proving the above theorem, we give ”pros” and ”cons” of the Newton’s method.

”pros” ”cons”
The method is very simple Requires derivatives of a function
Fast convergence. Only local convergence and requires good initial guess
Can be easily generalized to Rn
Can be easily modified
We remind the following definition.
Definition 6.3 (Lipschitz continuity). The function f : [a, b] → R is said to be Lipschitz continuous if there exists
L > 0 such that
| f (x) − f (y)| ≤ L|x − y|, ∀x, y ∈ [a, b].
The constant L is called the Lipschitz constant.
For a given x, y ∈ [a, b], consider a function

φ (t) = f (y + t(x − y)).

We can check that φ (0) = f (y) and φ (1) = f (x). Thus by the Fundamental Theorem of Calculus and the chain rule
Z 1 Z 1
f (x) − f (y) = φ (1) − φ (0) = φ ′ (t) dt = f ′ (y + t(x − y)) dt(x − y). (6.1)
0 0

Taking y = x0 , we have
Z 1
f (x) − f (x0 ) − f ′ (x0 )(x − x0 ) = f ′ (y + t(x − y))(x − x0 ) dt − f ′ (x0 )(x − x0 )
0
Z 1
f ′ (x0 + t(x − x0 )) − f ′ (x0 ) (x − x0 ) dt.

=
0

Thus if f ′ (x) is Lipschitz with Lipschitz constant L


Z 1
| f (x) − f (x0 ) − f ′ (x0 )(x − x0 )| ≤ f ′ (x0 + t(x − x0 )) − f ′ (x0 ) dt|x − x0 | ≤ L |x − x0 |2 .

(6.2)
0 2
Using the above inequality we observe that for the Newton’s first step

f (x0 ) 1 L
f (x* ) − f (x0 ) − f ′ (x0 )(x0 − x* ) ≤ |x* − x0 |2 .

x1 − x* = x0 − ′
− x* = ′ ′
f (x0 ) f (x0 ) 2| f (x0 )|

This strongly supports our previous claim that the Newton’s method convergences quadratically, we only need to
establish that f ′ (xk ) stays away from zero, provided f ′ (x* ) ̸= 0 and x0 is sufficiently close to x* .

6.4 Secant method

Recall
f (x + h) − f (x)
f ′ (x) = lim .
h→0 h
6 Numerical Solution of Nonlinear Equations in R1 85

Thus if the derivatives of the function are not available, in the Newton’s method we can replace f ′ (xk ) by the finite
difference
f (xk + hk ) − f (xk )
hk
for some hk ̸= 0. It is natural to use hk = xk−1 − xk , then

f (xk + hk ) − f (xk ) f (xk−1 ) − f (xk )


= .
hk xk−1 − xk

This gives us the Secant method


xk−1 − xk
xk+1 = xk − f (xk ) . (6.3)
f (xk−1 ) − f (xk )
Input: Initial values x_0 and x_1, tolerance tol,
maximum number of iterations maxit

Output: approximation of the root

1 For k = 1:maxit do
2 Compute s_k = -f(x_k)*(x_{k-1}-x_k))/(f(x_{k-1})-f(x_k)).
3 Compute x(k+1) = x(k) + s(k).
4 Check for truncation
5 End

We will see that the Secant method has very interesting properties. Recalling the divided difference formula (2.8),

f [x j+1, . . . , xk ] − f [x j , . . . , xk−1 ]
f [x j , . . . , xk ] = (6.4)
xk − x j

we can rewrite (6.1) as Z 1


f [x, y] = f ′ (y + t(x − y)) dt. (6.5)
0
We will also require the following estimate.
Lemma 6.1. Let f : D → R be differentiable on D and let furthermore f ′ (x) be Lipschitz continuous with Lipschitz
constant L. Then for any distinct x, y, z ∈ D
L
f [x, y, z] ≤ .
2
Proof. From (6.4) and (6.5), and using that f ′ is Lipschitz continuous, we obtain
Z 1 
f [y, z] − f [x, y] f [z, y] − f [x, y] 1  ′
f (y + t(z − y)) − f ′ (y + t(x − y)) dt

f [x, y, z] = = =
z−x z−x z−x 0
Z 1
L L
≤ t dt|z − x| = .
|z − x| 0 2

Now we turn back to the secant method (6.3). Using the divided difference notation, we can rewrite it as

f (xk )
xk+1 = xk − . (6.6)
f [xk−1 , xk ]

Using that x* is a root, i.e. f (x* ) = 0, we have


86 Contents

f (xk )
xk+1 − x* = xk − x* −
f [xk−1 , xk ]
 
f (xk )
= (xk − x* ) 1 −
(xk − x* ) f [xk−1 , xk ]
 
xk − x* f (xk ) − f (x* )
= f [xk−1 , xk ] −
f [xk−1 , xk ] (xk − x* )
(6.7)
xk − x*
= ( f [xk−1 , xk ] − f [xk , x* ])
f [xk−1 , xk ]
(xk − x* )(xk−1 − x* ) f [xk−1 , xk ] − f [xk , x* ]
=
f [xk−1 , xk ] xk−1 − x*
f [xk−1 , xk , x* ]
= (xk − x* )(xk−1 − x* ) .
f [xk−1 , xk ]

Now similarly to the Theorem 6.1, we show


Theorem 6.2. Let D ∈ R be an open interval and let f : D → R be differentiable on D Furthermore, let f ′ (x) be
Lipschitz continuous with Lipschitz constant L. If x* ∈ D is a root. In addition, assume

min | f ′ (x)| = m > 0.


a≤x≤b

Let ε be such that


L
q := ε < 1.
2m
Then the Secant method with starting points x0 , x1 ∈ Bε (x* ), generates iterates xk ∈ Bε (x* ) which converge to x* ,

lim xk = x* ,
k→∞

and which obey


|xk+1 − x* | ≤ qλk for all k ∈ N,
where λk are the elements of the Fibonnachi sequence

λk+1 = λk + λk−1 ,

i.e.
√ !k √ !k √ !k
1 1+ 5 1 1− 5 1 1+ 5
λk = √ −√ ∼√ , as k → ∞.
5 2 5 2 5 2

Proof. Using the Mean Value Theorem and the assumption of the theorem be obtain

f [xk−1 , xk ] = | f ′ (ξk )| ≥ m, for some ξk ∈ (xk−1 , xk ).

Together with Lemma 6.1, from (6.7), we obtain


L
|xk+1 − x* | ≤ |xk − x* ||xk−1 − x* |. (6.8)
2m

Thus if xk−1 , xk ∈ Bε (x* ), then from the above estimate, and since 2m < 1, we see that

L
|xk+1 − x* | ≤ εε < ε,
2m
References 87

i.e. xk+1 ∈ Bε (x* ) and as a result {xk } ∈ Bε (x* ) if x0 , x1 ∈ Bε (x* ).


L
Set ek = 2m |xk − x* |. Then (6.8), we can rewrite as

ek+1 ≤ ek ek−1 , k = 1, 2, . . . .

Since e0 ≤ q and e1 ≤ q, we have ek ≤ qλk for k = 1, 2, . . . . Since q < 1 and λk → ∞ and k → ∞, we have ek → 0 as
k → ∞. This concludes the proof of the theorem.

We can also obtain a posteriori error estimate.


Corollary 6.1. Under the assumptions of Theorem 6.2, we have
1 L
|xk − x* | ≤ | f (xk )| ≤ |xk − xk−1 ||xk − xk−2 |.
m 2m
Proof. Using the Mean Value Theorem and low bound for the derivative of the function we obtain

1 1 1 f (xk ) − f (xk−1 )
|xk − x* | ≤ | f (xk ) − f (x* )| = | f (xk )| = f (xk−1 ) + (xk − xk−1 ) .
m m m xk − xk−1

From definition of the Secant method


f (xk−1 )(xk−1 − xk−2 )
xk = xk−1 − ,
f (xk−1 ) − f (xk−2 )

we have
f (xk−1 ) − f (xk−2 )
f (xk−1 ) = −(xk − xk−1 ) .
xk−1 − xk−2
Combining it with the above estimate and using Lemma 6.1, we obtain

|xk − xk−1 | f (xk−1 ) f (xk ) − f (xk−1 )
|xk − x* | ≤ x −x +
m xk − xk−1
k k−1

|xk − xk−1 | f (xk ) − f (xk−1 ) f (xk−1 ) − f (xk−2 )
= x −x −
m k k−1 xk−1 − xk−2
|xk − xk−1 ||xk − xk−2 |
= | f [xk−2 , xk−1 , xk ]|
m
L
≤ |xk − xk−1 ||xk − xk−2 |.
2m

References

1. Michael L. Overton. Numerical computing with IEEE floating point arithmetic. Society for Industrial and Applied Mathematics (SIAM),
Philadelphia, PA, 2001. Including one theorem, one rule of thumb, and one hundred and one exercises.
2. Volker Strassen. Gaussian elimination is not optimal. Numer. Math., 13:354–356, 1969.

You might also like