0% found this document useful (0 votes)
60 views

Numerical Analysis Lecture Notes

The document summarizes two numerical methods for finding roots of functions: 1) The bisection method, which works by repeatedly bisecting an interval containing a root and narrowing in on the root. It is proven to converge to the root at a rate proportional to 1/2^n. 2) The fixed point method, which looks for fixed points of an associated function g(x) = x - f(x). It is shown that if g is a contraction mapping, then by Banach's fixed point theorem g has a unique fixed point, which must be a root of the original function f. An example is given of applying this method to find the positive square root of 2.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Numerical Analysis Lecture Notes

The document summarizes two numerical methods for finding roots of functions: 1) The bisection method, which works by repeatedly bisecting an interval containing a root and narrowing in on the root. It is proven to converge to the root at a rate proportional to 1/2^n. 2) The fixed point method, which looks for fixed points of an associated function g(x) = x - f(x). It is shown that if g is a contraction mapping, then by Banach's fixed point theorem g has a unique fixed point, which must be a root of the original function f. An example is given of applying this method to find the positive square root of 2.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Numerical Analysis

Jean C. Cortissoz I.

May 10, 2019


2
Chapter 1

1.1 Bisection Method


This method has the advantage of being simple and that it always works. So
to start it, we must locate an interval [a0 , b0 ] where there is a root of the given
function. This is expressed by the following assumption:

f (a0 ) f (b0 ) < 0.

Once we have this, we can start our procedure. Define


a0 + b0
p1 = ,
2
and then check whether one of the following inequalities hold:
• If
f (a0 ) f (p1 ) < 0,
define a1 = a0 and then b1 = r1 .
• If the above inequality does not hold, then verify that

f (p1 ) f (b0 ) < 0,

and then define a1 = r1 and b1 = r1 .


Notice that if we obtain f (a0 ) f (p1 ) = 0 it means that the root of the function
has been located, and we can stop the procedure. We now proceed inductively.
Once we have defined an interval [an , bn ], and a sequence {rk }k=1,2,3,...,n , we
define
an + bn
rn+1 = ,
2
and then define a new interval in the obvious way, i.e.,
• If
f (an ) f (rn+1 ) < 0,
define an+1 = an and then bn+1 = rn+1 .
• If the above inequality does not hold, then verify that

f (rn+1 ) f (bn ) < 0,

and then define an+1 = r1 and bn+1 = bn .

3
4 CHAPTER 1.

We have thus produced a sequence of approximations for a root of f , namely


(rk )k=1,2,3,... .
When we study numerical methods, it will always be important to know
when the required accuracy has been reached, and hence we must derive esti-
mates to know how far from the right answer the approximate answer is. We
do this for the bisection method.

Lemma 1. Let [a, b] be an interval where a single root r of the continuous


function f lies. Then
b−a
|rn − r| < n .
2
Proof. Just notice that in the process, assuming it never stops, the root r is
always located inside the interval [an , bn ] for all n. Therefore, since rn is either
an or bn , then
|rn − r| ≤ |bn − an | .
1
On the other hand, by the construction itself, bn+1 − an+1 = 2 (bn − an ), and
this shows the assertion of the lemma.

The previous lemma makes the bisection method a useful tool. Indeed,
assume you want to compute a root r ∈ [a, b] of a continuous function f with
a precission given by  > 0, that is, you want to produce an r∗ such that
|r∗ − r| < . Then choose N a positive integer such that

1
(b − a) < ,
2N
and apply the bisection procedure N times: your desired r∗ is just rN . Notice

b−a
that then we can actually compute N from the initial data: N = dlog2 e.


1.2 Fixed Point Method


First we shall discuss this method in abstract, and then we shall apply it to find
roots of continuous functions. So let X be a complete metric space, in our case
think of R or a closed interval, and let f : X −→ X be a continuous function.
Recall that a fixed point of f is an x∗ ∈ X such that f (x∗ ) = x∗ . We have the
following definition.

Definition 1. We say that f is a contraction if there is an α < 1 such that

dX (f (x1 ) , f (x2 )) ≤ αdX (x1 , x2 ) .

The following classical fact is due to Banach.

Theorem 1. Let f : X −→ X be a contraction. Then f has a unique fixed


point x∗ .

Proof. Construct the following sequence. Let x0 ∈ X be arbitrary, and define

xn+1 = f (xn ) .
1.2. FIXED POINT METHOD 5

We will show that the sequence thus constructed is a Cauchy sequence, and
given the completeness of X, it is convergent. Notice first that

|xn+1 − xn | = |f (xn ) − f (xn−1 )|


≤ α |xn − xn−1 | .

This actually shows that

|xm+1 − xm | ≤ αn |x1 − x0 | .

Therefore we have that


m+k−1
X
|xm+k − xm | ≤ |xj+1 − xj |
j=m
m+k−1
X
≤ |x1 − x0 | αj
j=m

X αm
≤ |x1 − x0 | αj = |x1 − x0 | .
j=m
1−α

From this follows that the sequence (xk )k=1,2,3,... is a Cauchy sequence, and
hence convergent. Let x∗ = limn→∞ xn . Using the continuity of f , we have
that
x∗ = lim xn+1 = lim f (xn ) = f (x∗ ) ,
n→∞ n→∞

so x∗ is a fixed point of f .
For the uniqueness part, assume that x∗ and x∗∗ are both fixed points of f .
Then we have that

|x∗ − x∗∗ | = |f (x∗ ) − f (x∗∗ )| ≤ α |x∗ − x∗∗ | ,

and this inequality is only possible if |x∗ − x∗∗ | = 0, i.e., x∗ = x∗∗ .


Notice that the method of proof employed in the previous theorem, gives us
an estimate for |xn − x∗ |. Indeed, if we take k → ∞ in the inequality
αn
|xn+k − xn | ≤ |x1 − x0 | ,
1−α
we see that
αn
|xn − x∗ | ≤ |x1 − x0 | .
1−α
There is a beautiful proof of Banach’s fixed point theorem due to Richard
Palais, it avoids having to add a geometric series in the proof. So if f is a con-
traction, an aplication of the triangle inequality twice (here we use the notation
for metric spaces), gives

d (x1 , x2 ) ≤ d (x1 , f (x1 )) + d (f (x1 ) , f (x2 )) + d (f (x2 ) , x2 ) ,

and using that d (f (x1 ) , f (x2 )) ≤ αd (x1 , x2 ) yields

(1 − α) d (x1 , x2 ) ≤ d (x1 , f (x1 )) + d (f (x2 ) , x2 ) .


6 CHAPTER 1.

As a consequence of the previous inequality we have


1
d (f n (x) , f m (x)) ≤ (d (f n (x) , f n (f (x))) + d (f m (x) , f m (f (x))))
1−α
1
≤ (αn + αm ) d (x, f (x)) .
1−α
This shows that the sequence xn = f n (x) is a Cauchy sequence! The sought
fixed point is x∗ = limn→ xn (which exists as long as the metric space we are
working in is complete).

1.2.1 Application of the Fixed Point Theorem to finding


roots: The Fixed Point Method
To apply the FPT, it will be nice to have a simple way of checking when a map
is a contraction. We have the following result.
Lemma 2. Let f : [a, b] −→ [a, b] be a differentiable function. Assume that
there is an α < 1 such that |f 0 (ξ)| ≤ α for all ξ ∈ [a, b], then f is a contraction.
Proof. The result follows from the mean value theorem applied on any subin-
terval [x, y] of [a, b]: Just notice that for a ξ ∈ (x, y)

|f (x) − f (y)| = |f 0 (ξ)| |x − y| ≤ α |x − y| .

We are now ready for the applications. Say we want to find a root r of a
differentiable function f . Consider the function

g (x) = x − f (x) ,

then a fixed point of g will be a root of f . So we want g to be a contraction.


This will happen if for instance 0 < β ≤ f 0 ≤ α < 2. So let us work with a
2
√ Let us find the positive root of f (x) = x − 2. You may know
simple example:
the answer ( 2) but this is just a symbol; for us the only real numbers are the
rational
√ numbers, so by finding a root of f we mean finding an aproximation to
2 by rationals. So let us proceed:
Notice that we have f (1) = −1 and f (2) = 2 and hence a root is localized
in [1, 2]. But now notice that the Fixed Point Method will not work if we use

g (x) = x − x2 − 2 ,


since f 0 (x) = 2x can be as large as 4. But not all is lost, because we can define
1 2 
g (x) = x − x −2 ,
4
and a fixed point of this new g will be a root of f . Also, if we compute g 0 (x) =
1
1 − x, and we have the stars aligned in our favor, since it is easy to see that
2
0 1
|g (x)| ≤ . All that is left to do to make sure that the Fixed Point Method
2
can be applied is to check that g [1, 2] ⊆ [1, 2]. This is left as an exercise, and
as a hint notice that g is an increasing function.
1.2. FIXED POINT METHOD 7

Using the Fixed Point Method we obtain the following succesive aproxima-
√ 3 23
tions for 2 starting with x0 = 2: 2, , .
2 16
The question you should be asking yourself, dear reader, is how good these
23
approximations are. Say, how close is from the right answer. To estimate
16
23
how far is is from the right answer notice that this number corresponds to
16
x2 and that α = 14 , and hence


2 − 23 ≤ 1 4 1 = 1 .

16 42 3 2 24
Sometimes, when the Fixed Point Method cannot be directly applied by
massaging the original equation to be solved a Fixed Point Iteration scheme can
be produced. Let us give an example again. We start from

x2 − 2 = 0,

which is the equation that we want to solve. Then we can rewrite this equation
succesively as follows:

x2 − 1 = 1, (x − 1) (x + 1) = 1,

and finally
1
x=1+ .
1+x
1
So in this case g (x) = 1 + . Again we must verify that g [1, 2] ⊆ [1, 2], and
1+x
now notice that
1 1
|g 0 (x)| = 2 ≤ on [1, 2] .
(1 + x) 4
3 7 17 41 99
If we start with x0 = 1 we generate the following sequence: , , , , ,
2 5 12 29 √ 70
etc. Of course, we can also estimate how far are our approximations from 2.
Employing the formula from the proof of the Fixed Point Theorem, we see that


2 − 99 ≤ 1 1 · 1 = 1 .

70 45 1 − 14 2 1536
Problem. If N steps are needed in a fixed point algorithm to get an an-
swer with a precision of m digits, how many steps are required to double such
precision? √
Problem. Using the second method described above to find 2, find an
algorithm to compute the square root of any positive integer. You must give the
seed to start the scheme, you must show that it does converge, and you must
give an estimate of how good your approximation is after n iterations.
Problem. Start with the equation you want to solve

f (x) = 0 on [a, b] ,

and assume a bound 0 < η ≤ f 0 < M , and introduce the function


f (x)
g (x) = x − .
M
8 CHAPTER 1.

Show that the Fixed Point Method converges to a root of f .


Exercise. Let f : [a, b] −→ R be a nondecreasing function which is also a
contraction. Assume that there exists x∗ ∈ [a, b] such that f (x∗ ) = x∗ . Show
that f [a, b] ⊂ [a, b].

1.3 Newton’s Method


We shall introduce directly Newton’s method as the following iteration scheme
xn+1 = g (xn ) .
where
f (x)
g (x) = x − .
f 0 (x)
It is then easy to see that a fixed point of g is a root of f . Let us find out when
g is a contraction. Taking a derivative
f (x) f 00 (x)
g 0 (x) = − 2
[f 0 (x)]
Assume now that we have bounds |f 00 (x)| ≤ M and |f 0 (x)| ≥ η > 0 on
[a, b], an interval where in its interior we are sure to find a root of f , that we
shall call r. By the continuity of f we can find a δ > 0 (more on how to find
this δ > 0 later) such that if x ∈ [r − δ, r + δ] then |f (x)| ≤ , where  is such
that
·M
=: α < 1,
η
then we know that g when restricted to [r − δ, r + δ] is a contraction (please
verify that g [r − δ, r + δ] ⊂ [r − δ, r + δ]). Hence, when x0 ∈ [r − δ, r + δ],
Newton’s method will find a root of f .
The question arises now: how do we estimate δ? Well, assume that we have
a bound |f 0 (x)| ≤ K, then δ = /K. So now we are ready to locate a seed
to start the method. To do so, just use the bisection method until you have
located r within an interval of length at most δ: any point in that interval will
be a suitable seed. A new question may arise: why not go all the way with the
Bisection Method. For the answer go to the section just below.

1.3.1 Convergence speed.


Newton’s method converges quite fast. We first recall the following result from
elementary calculus
M 2
|f (x) − f (y) − f 0 (x) (x − y)| ≤ (x − y) , (1.1)
2
where M is such that |f 00 (ξ)| ≤ M for all ξ ∈ [x, y]. So if f (r) = 0, we have the
following estimates

f (xn )
|xn+1 − r| = xn − 0 − r
f (xn )
−1
= |f 0 (xn )| |f (r) − f (xn ) − f 0 (xn ) (r − xn )|
M 0 −1 2
≤ |f (xn )| (xn − r) ,
2
1.3. NEWTON’S METHOD 9

where M is a bound on the second derivative of f . Assume we have a lower


bound on |f 0 (x)|, say η > 0. Then we arrive at an estimate

M 2
|xn+1 − r| ≤ |xn − r| .

M
Let us call β = 2η , and let ρ = |x0 − r|; then we have

2n
|xn − r| ≤ (βρ) β −1 , (1.2)

and the right hand quantity converges to 0 as long as βρ < 1.


Now we are ready to show via an example how fast Newton’s method con-
verges. Again, we shall compute the positive root of x2 − 2. Then the iteration
scheme we obtain is
x2n − 2 xn 1
xn+1 = xn − = + .
2xn 2 xn
3 17 577 665857
We take as a seed x0 = 1, and then we get a sequence , , , , etc.
2 12 408 470832
1
Notice that in this example β = 1 and we can use as ρ = 2 . Hence from (1.2),
we have that
1
|xn − r| ≤ 2n ,
2
665857 √
and if we take n = 4, then approximation x4 = is quite close to 2
470832
being the difference between the approximation and the exact value at most
1
!!!
216
Exercise. Prove estimate (1.1).
Exercise. Let f : R −→ R be such that f 00 > 0. Call x∗ the largest root
of f , and assume also that f 0 (x∗ ) > 0. If you take as a seed x0 > x∗ , does
Newton’s method converge?
Exercise. Using Newton’s method, write an algorithm to compute the roots
of
f (x) = x2 + x + 1.
Compute the convergence speed of the method.
10 CHAPTER 1.
Chapter 2

2.1 Lagrange interpolation


Given a set of data (xi , yi ), i = 0, 1, 2, 3, . . . , n with xi 6= xj if i 6= j we want
to produce a function f : R −→ R such that yi = f (xi ). Since polynomials
are somewhat the easiest family of functions do deal with (at least referring
to differentiation and integration), we shall look for a polynomial with this
property. It turns out that we only need a polynomial of degree at most n to
fit all the data.
To construct the interpolating polynomial, we need a family of polynomials
with the following property
Li (xj ) = δij ,

where δij is Kronecker’s delta. Then it is easy to construct an interpolant by


defining
n
X
In f (x) = yi Li (x) .
i=0

We call the Li ’s Lagrange polynomials, and they are quite easy to construct.
First define
Y
Mi (x) = (x − xk ) ,
k6=i

and we are more than already half way done. Indeed, notice that

M (xj ) = 0, whenever j 6= i,

whereas
Y
M (xi ) = (xi − xk ) 6= 0,
k6=i

and hence, the Li ’s can be defined as

M (x) Y x − xk
Li (x) := = .
M (xi ) xi − xk
k6=i

It is now important to know how good is the approximation given by a


polynomial. We have the following theorem.

11
12 CHAPTER 2.

Theorem 2. Let f : [a, b] −→ R be (n + 1)-times continuously differentiable.


Then
n
f (n+1) (ξ) Y
f (x) − In f (x) = (x − xj ) ,
(n + 1)! j=0

where ξ ∈ [a, b] depends on x.


Proof. (Following the argument in the book of Kress, page 157) Set
n
Y
qn+1 (x) = (x − xj )
j=0

Define the function


f (x) − In f (x)
g (y) = f (y) − In f (y) − qn+1 (y) .
qn+1 (x)

Notice that g (y) = 0 for y = x, x0 , x1 , . . . , xn , i.e., it has n + 2 roots; therefore


g (n+1) has at least one root in [a, b]. Call this root ξ. Then,

f (x) − In f (x)
0 = g (n+1) (ξ) = f (n+1) (ξ) − (n + 1)! ,
qn+1 (x)
so we obtain
f (n+1) (ξ)
f (x) − In f (x) = qn+1 (x) .
(n + 1)!

Exercise. Assume you are aproximating the sine function in the interval
[0, h] using (a) a linear polynomial using as interpolation points x0 = 0 and
x1 = h (b) a quadratic polynomial using x0 = 0, x1 = h2 and x2 = h. Find in
both cases a bound on the error between the sine function and its interpolating
polynomial. In case (b), is there an optimal way of choosing x1 ? (think of
minimizing the maximum possible error).

2.2 Applications of Lagrange interpolation to in-


tegration
We shall use a Lagrange polynomial of order 2. Say then that an interval
[xi , xi+2 ]. We let
xi + xi+2
xi+1 = , h = xi+1 − xi .
2
The Lagrange polynomials are given by:
(x − xi+1 ) (x − xi+2 ) 2 (x − xi+1 ) (x − xi+2 )
= ,
(xi − xi+1 ) (xi − xi+2 ) h2

(x − xi ) (x − xi+2 ) (x − xi ) (x − xi+2 )
=− ,
(xi+1 − xi ) (xi+1 − xi+2 ) h2
(x − xi+1 ) (x − xi ) 2 (x − xi+1 ) (x − xi )
= ,
(xi+2 − xi+1 ) (xi+2 − xi ) h2
2.2. APPLICATIONS OF LAGRANGE INTERPOLATION TO INTEGRATION13

so the interpolating polynomial is

2 (x − xi+1 ) (x − xi+2 )
P2 (x) = yi
h2
(x − xi ) (x − xi+2 ) 2 (x − xi+1 ) (x − xi )
− yi+1 + yi+2 .
h2 h2
Z xi+2
You are invited to use P2 to compute an approximation of f (x) dx. We
xi
1
will try another way. To simplify a little, we will choose xi = 0, xi+1 = 2 and
xi+2 = 1. The interpolating polynomial ax2 + bx + c must satisfy

c = y0

1 1
a + b = y1 − y0
4 2
and
a + b = y2 − y0 .
Solving this system we obtain

a = 2y2 + 2y0 − 4y1 ,

b = 4y1 − 3y0 − y2 .
Hence, the area under the parabola is given by
Z 1
1
ax2 + bx + c dx = (y2 + 4y1 + y0 ) .
0 6

So the previous formula gives us the area under the parabola when h = 1. For
arbitrary h, using a change of variables, we obtain
Z 1
h
h ax2 + bx + c dx = (y2 + 4y1 + y0 ) .
0 6

If we divide the interval [a, b] into 2n subintervals, then Simpson’s rule becomes
Z b
h
f (x) dx ∼ (y0 + 4y1 + 2y2 + 4y3 + 2y4 · · · + 4y2n−1 + y2n ) .
a 6

Let us estimate the error given by this formula. We first compute the error on
[xi , xi+2 ]. We use Theorem 2 to obtain

M
|f (x) − In f (x)| ≤ |(x − xi ) (x − xi+1 ) (x − xi+2 )|
6
M 3
≤ √ h ,
72 3

where M is a bound on the third derivative of f on [a, b]. Hence


Z xi+2
h M 4
f (x) dx − (y 2 + 4y1 + y 0 ≤
) √ h ,

xi 6 72 3
14 CHAPTER 2.

so the error for Simpson’s rule is bounded above by

M (b − a) 3
√ h .
72 3

Z b
Exercise. Find a formula to compute f (x) dx using interpolation with
a
polynomials of degree 3. Find an estimate for the error.

Exercise. Show that for Simpson’s rule the error estimate can be improved to

M (b − a) 3
h ,
192

where M is a bound on the third derivative of f on [a, b].

2.3 Using Taylor series to compute integrals


First, recall Taylor’s theorem (Taken from Rudin ”Principles of Mathematical
Analysis”)

Theorem 3. Suppose f is a real function on [a, b], n is a positive integer, f (n−1)


is continuous in [a, b], f (n) (t) exists for all t ∈ [a, b]. Let α, β be distinct points
of [a, b], and define
n−1
X f (k) (α) k
P (t) = (t − α) .
k!
k=0

Then there exists a point x between α and β such that

f (n) (x) n
f (β) = P (β) + (β − α) .
n!

Proof. ...

2.3.1 The left end-point rule


As before we divide the interval [a, b] as

a = x0 < x1 < x2 < · · · < xn−1 < xn = b, h = xi − xi−1 ,

and define the approximation


Z b n
X
f (x) dx ∼ f (xi−1 ) h.
a i=1
2.3. USING TAYLOR SERIES TO COMPUTE INTEGRALS 15

We must worry now about how good this approximation is. To do this we use
Taylor’s theorem (actually we only need to use the mean value theorem).
Z Z
xi xi
f (x) dx − f (xi−1 ) h = f (x) − f (xi−1 ) dx


xi−1 xi−1
Z xi
≤ |f (x) − f (xi−1 )| dx
xi−1
Z xi
= |f 0 (ξx )| (x − xi−1 )
xi−1
M 2
≤ h ,
2
where M is a uniform bound on f 0 in the interval [a, b]. Hence we have the
useful estimate
Z
b n
X M (b − a)
f (x) dx − f (xi−1 ) h ≤ h.

2

a
i=1

2.3.2 The midpoint rule


A seemingly innocent change in the previous procedure can improve on the
approximation obtained. Here we introduce the midpoint rule. With the same
notation as in the previous section, define
xi−1 + xi
x∗i−1 = .
2
The approximation we will use now is
Z b X n
f x∗i−1 h.

f (x) dx ∼
a i=1

The approximation given by the midpoint rule is in general better than the one
given by the leftpoint rule. Indeed, we can estimate the error as follows.
Z Z
xi xi
f (x) dx − f x∗i−1 h = f (x) − f x∗i−1 dx
 

xi−1 xi−1
Z xi
f (x) − f x∗i−1 dx


xi−1
Z x∗
i−1
|f 0 (ξx )| x∗i−1 − x dx

=
xi−1
Z xi
|f 0 (ξx∗ )| x − x∗i−1 dx

+
x∗
i−1
 2  2
M h M h M 2
≤ + = h ,
2 2 2 2 4
where M is a uniform bound on f 0 in the interval [a, b]. Hence we have
Z n

b X M (b − a)
f x∗i−1 h ≤

f (x) dx − h.

4

a
i=1
16 CHAPTER 2.

2.3.3 An aside: the big O notation


From now on, to make the writing of error terms easier, and perhaps more
telling and neat, we will introduce the big O notation due to Edmund Landau.
We say that a function f is O (g), and write f = O (g) if there exists a constant
C > 0 such that for all x,
|f (x)| ≤ C |g (x)| .

2.3.4 An Improvement on the error estimate for the mid-


point rule
To better estimate the error committed when using the midpoint rule we shall
employ Taylor’s theorem. Let x∗i be the midpoint of the interval Ii = [xi , xi+1 ].
Then, using Taylor around x∗i we have
Z x∗i
h h2
f (x) dx = f (x∗i ) + f 0 (x∗i ) + O h3 .

xi 2 8
M
The implicit constant in the error is given by , where M is a bound on
48
|f 00 (x)|. On the other hand
Z xi+1
h h2
f (x) dx = f (x∗i ) − f 0 (x∗i ) + O h3 .

x∗
i
2 8
After addition, we obtain the estimate
Z xi+1
f (x) dx = f (x∗i ) h + O h3 .

xi

Which finally gives


Z b n−1
X
f (x∗i ) h + O h2 ,

f (x) dx =
a i=0

M
where the implicit constant is given by
.
24
Exercise (a higher order midpoint rule). Given a partition
a = x0 < x1 < · · · < xn−1 < xn = b
with h = xi+1 − xi , let x∗i be the midpoint of the interval Ji = [xi , xi+1 ]. Use
the Taylor expansion of f centered at x∗i given by
f (x) = f (x∗i ) + f 0 (x∗i ) (x − x∗i ) +
f 00 (x∗i ) 2 f 000 (x∗i ) 3 f (4) (ξx ) (x∗i ) 4
(x − x∗i ) + (x − x∗i ) + (x − x∗i )
2 3! 4!
to show that
xi+1
f 00 (x∗i ) 3
Z  
M4
f (x) dx − f (x∗i ) h + h5 .

h ≤

xi 24 80 × 4!
where M4 is a bound on the fourth derivative of f on the interval [a, b]. Use
Z b
this to develop a numerical method to compute f (x) dx.
a
2.4. THE TRAPEZOIDAL RULE 17

2.4 The Trapezoidal Rule


Again using Taylor’s expansion around x∗i (we are keeping the notation from
the previous section).
We can compute that
f (xi ) + f (xi+1 )
f (x∗i ) − = O h2 ,

2
M
where the implicit constant is given by , where M is a bound on the second
8
derivative of f on the interval [a, b]. This shows the following
R xi+1 f (xi ) + f (xi+1 )
xi
f (x) dx − h
2
= 
R xi+1 ∗ ∗ f (xi ) + f (xi+1 )
xi
f (x) dx − f (xi ) h + f (xi ) h − h
 2
= O h3 ,

M M M
where the implicit constant is given by + = . From this we deduce
24 8 6
that
Z b n−1
hX
[f (xi ) + f (xi+1 )] + O h2 ,

f (x) dx =
a 2 i=0

with the implicit constant given by M/6.

Exercise. Complete the details of the derivation of the for the size of the error
in the trapezoidal rule.

Exercise. Can you improve on the implicit constant given for the Trapezoidal
Rule?

2.5 Gaussian quadrature


The
Z 1 idea of Gaussian cuadrature is as follows. To approximate the integral
f (x) dx by choosing n points in the interval [−1, 1] and setting
−1
Z 1
f (x) dx ∼ c1 f (x1 ) + c2 f (x2 ) + · · · + cn f (xn ) .
−1

But how do we choose the ci ’s (called weights) and the xi ’s? Well, the idea is
that the approximation given above becomes an equality when f is a polynomial
of degree at most 2n−1. Notice then, that estabilishing the values of the ci ’s and
the xi ’s becomes a problem of solving 2n nonlinear equations with 2n unknowns.
To see how this method works, we shall work out the case n = 2. In this
case, we must produce a formula
Z 1
f (x) dx ∼ c1 f (x1 ) + c2 f (x2 ) .
−1
18 CHAPTER 2.

But then, for this formula to be exact for a polynomial of degree at most 2 ×
2 − 1 = 3 we must have
Z 1
1 dx = 2 = c1 + c2 ,
−1
Z 1
x dx = 0 = c1 x1 + c2 x2 ,
−1
Z 1
2
x2 dx = = c1 x21 + c2 x22 ,
−1 3
Z 1
1 1
x3 dx = 0 = c1 x31 + c2 x32 .
−1 3 3
By symmetry considerations, we have that c1 = 1 = c2 and x1 = −x2 . Using
then the third equation yields
r r
1 1
x2 = , x1 = − .
3 3
Therefore, the formula for the Gaussian quadrature with n = 2 is given by
Z 1 r ! r !
1 1
f (x) dx ∼ f − +f .
−1 3 3

We are now concerned with estimating the error when using Gaussian quadra-
tures. We can do this via the ever useful Taylor’s theorem. Indeed, when work-
ing with a Gaussian quadrature of degree n, by its very definition, if we use the
Taylor approximation centered a 0 of the function (so we must assume that it
is at least 2n-times differentiable), call it Tn (x), we know that

f (2n) (ξx ) 2n
f (x) = T2n−1 (x) + x .
(2n)!
So if we denote by Qj (f ) the Gaussian quadrature of order j applied to the
function f , we want to estimate
Z 1


f (x) dx − Qn (f ) ,

−1

we can do it as follows. We assume a bound f (2n) (ξ) ≤ M2n on [−1, 1].

Z
1

Z 1

Z 1


f (x) dx − Qn (f ) =
f (x) dx − T2n−1 (x) dx + Qn (T2n−1 ) − Qn (f )
−1 −1 −1
Z 1 (2n)
f (ξx ) 2n

(2n)! x dx + |Qn (f ) − Qn (T2n−1 )|
−1
2M2n X
≤ + |ci | |f (xi ) − T2n−1 (xi )|
(2n + 1)! i
2M2n M2n X
≤ + |ci | x2n
i .
(2n + 1)! 2n! i
2.5. GAUSSIAN QUADRATURE 19

If we have that ci ≥ 0, as we always must have


Z 1 n
X
1 dx = 2 = ci ,
−1 i=1

then we arrive at the following estimate:


Z
1
4M2n
f (x) dx − Qn (f ) ≤ .

−1 (2n)!

Example. For the Gaussian quadrature with n = 3 we have the formula


Z 1 r ! r !
5 3 8 5 3
f (x) dx ∼ f − + f (0) + f ,
−1 9 5 9 9 5

M6
and the error expected from using this formula is at most . So if we apply
R1 x 2520
this formula to computing −1 e dx, we obtain from our efforts that
Z 1
ex dx ∼ 2.350336929,
−1

1
and this value is at most at from the real actual value which for your
840
information is approximately 2.35040....

Exercise. Deduce the points and weights for the Gaussian quadrature formula
with n = 3.
Exercise. Develop a formula using the Taylor polynomial of degree 1 to ap-
proximate integrals numerically. Find an estimate for the error of your approx-
imation.

2.5.1
In general, to apply the Gaussian quadrature technique to a function in a general
interval
Z b
f (x) dx,
a

we just have to notice that


b 1  
b−a a+b b−a
Z Z
f (x) dx = f + z dz.
a 2 −1 2 2

Also, to increase the accuracy of the method, we can divide the interval [−1, 1]
into n pieces of the same size:

−1 = x0 < x1 < x2 < · · · < xn−1 < xn = 1, h = xi − xi−1 ,


20 CHAPTER 2.

and then use Gaussian quadrature in each of the subintervals Ji−1 = [xi−1 , xi ].
We do this as follows:
h
Z xi Z  
2 xi−1 + xi
f (x) dx = f +z dz.
xi−1 −h
2
2
 
Let us write g (z) = f xi−12+xi + z . When we use Gaussian quadratures to
compute an integral of the form
Z h
2
g (x) dx,
−h
2

we do not have to recompute the ci ’s or the xi ’s. Using a change of variables,


one finds that the corresponding quadrature is given by
 
hX h
ci g xi .
2 2

And then it comes the computation of the errors. This time we have
Z h Z h Z h
2 2
f (x) dx − Qn (f ) = f (x) dx − T2n−1 (x) dx+


−h −h −h
2 2

Qn (T2n−1 ) − Qn (f )|
Z h (2n)
2 f (ξx ) 2n

(2n)! x dx + |Qn (f ) − Qn (T2n−1 )|
−h2
 2n+1    
2M2n h hX h h
≤ + |ci | f xi − T2n−1 xi
(2n + 1)! 2 2 i 2 2
 2n+1   2n
2M2n h M2n h X h
≤ + |ci | x2n
i .
(2n + 1)! 2 2n! 2 i 2

Say we apply Gaussian quadrature of degree n on Ji−1 . Then the error, assuming
the ci are positive is given by
 2n
2M2n h
,
(2n)! 2

and hence using this procedure, we obtain that the error of the approximation
is
 2n
2M2n h
.
(2n)! 2
In the case n = 3, we obtain that the error when using Gaussian quadrature, is
at most
M6 5
h ,
1260
where M6 is a bound on the sixth derivative of f on [a, b].
2.6. MONTECARLO METHOD FOR INTEGRATION 21

General form of the Gaussian quadrature


Z b N −1 n  
X hX h ∗
f (x) dx ∼ ci f xi + xj ,
a j=0
2 i=1 2

and the error is bounded above by


 2n
2 (b − a) h
M2n .
(2n)! 2

Miniproject
Given a curve, which is given by y = y (x), a function of x, and which joins the
points (0, 0) and (1, 1), show that the time of descent of a particle from (1, 1)
to (1, 1), is given by the integral
s
1
1 + ẏ 2
Z
dx,
0 2g (1 − y)

here ẏ indicates differentiation. Use a numerical method to decide which curve


between y = x2 and y = x3 , 0 ≤ x ≤ 1, gives the curve of fastest descent. What
do you believe is the answer if you compare two curves y = xm and y = xn with
m < n? Can you prove your conjecture?

2.6 Montecarlo Method for Integration


2.6.1 Review on Probability
A probability space is a triple (Ω, τ, P ), where Ω is a set, τ is σ-algebra over Ω,
i.e., a subset of P (Ω) whic contains Ω, and is closed under taking complements
and countable unions; and P is a probability: P (Ω) = 1, and for any countable
family (Ak )k=1,2,3,... of disjoint elements of τ

∞ ∞
!
[ X
P Ak = P (Ak ) .
k=1 k=1

A random variable X is a function between probability spaces Ω1 = (Ω1 , τ1 , P1 )


and Ω2 = (Ω2 , τ2 , P2 )
X : Ω1 −→ Ω2 ,
such that for any B ∈ τ2 , X −1 (B) ∈ τ1 .

2.6.2 Integration with respect to a Probability


Given a real valued random variable X (ω) defined on a probability space, we
will define the integral Z
X (ω) dP (ω) .

22 CHAPTER 2.

To do so, we start by defining the integral of simple functions such as the


indicator function 1A of an event A (a set in the σ-algebra). It should be of
course expected that Z
1A (ω) dP (ω) = P (A) .

By linearity we can extend this definition for finite linear combinations of indi-
cator functions, which we shall call simple functions:
n
Z X n
X
cj 1Aj (ω) dP (ω) := cj P (Aj ) .
Ω j=1 j=1

We can now extend this definition for positive functions. Given a function
X ≥ 0, we define
Z Z
X (ω) dP (ω) = sup s (ω) dP (ω) ,
Ω s∈S,0≤s≤X Ω

where S is the set of all simple functions. And for a general random variable,
first define

X + (ω) = max {X (ω) , 0} and X − (ω) = max {−X (ω) , 0} ,

and thus
Z Z Z
X (ω) dP (ω) := X + (ω) dP (ω) − X − (ω) dP (ω) .
Ω Ω Ω

Example. Consider the following Ω = {1, 2, 3, 4, 5, 6}, with τ = P (Ω). This


space represents the possible outcomes of throwing a dice. Now, assume that
whenever you obtain a 2 you earn 20, when you throw a 4 you earn 10, but
with any other number you lose 30. This experiment can be represented by the
random variable

X (ω) = 20 · 1{2} (ω) + 10 · 1{4} (ω) − 30 · 1{1,3,5,6} (ω) .

Now assume that the dice is not fair, and the respective probabilities are
P ({1}) = P ({6}) = 41 , and P ({j}) = 18 for j = 1, 3, 5, 6. We want to compute
Z
X (ω) dP (ω). In order to do this we can compute from the definition for the

integral of a simple variable:
Z
X (ω) dP (ω) = 20P ({2}) + 10P ({4}) − 30P ({1, 3, 5, 6}) ;

using the σ-additivity property of P we can compute

1 1 1 1 3
P ({1, 3, 5, 6}) = P ({1}) + P ({3}) + P ({5}) + P ({6}) = + + + = ,
4 8 8 4 4
and hence Z
5 5 45 75
X (ω) dP (ω) = + − = .
2 4 2 4
2.6. MONTECARLO METHOD FOR INTEGRATION 23

With these definitions at hand, we can define the expectation (or expected
value) of a random variable:
Z
E (X) = hXi = X (ω) dP (ω) .
Ω1

2
And the variance σ is given by
D E
2
σ 2 = (X − hXi) .

The main tool we will make use of is the following


Theorem 4. (Chebyshev’s inequality) Let X be a random variable of expectation
µ and variance σ 2 . Then for any  > 0, the following inequality holds:
σ2
P (|X − µ| ≥ ) ≤ .
2
Proof. Let 1A be the indicator function of the set A. We let in this case
A = {ω : |X (ω) − µ| ≥ } .
Then we have that
Z
P (A) = 1A dP (ω)

2
|X (ω) − µ|
Z
≤ 1A (ω) dP (ω)
Ω 2
2
|X (ω) − µ|
Z
≤ dP (ω)
Ω 2
2
σ
= .
2

2.6.3 Monte Carlo Integration


Z a
Assume that given f : [0, a] −→ [0, b] we want to numerically compute f (x) dx.
0
Then we proceed as follows. Define a random variable
X : [0, a] × [0, b] −→ {0, 1} ,
by X (x, y) = 1 if y ≤ f (x), and 0 otherwise. We endow [0, a] × [0, b] with the
uniform distribution, which is the probability measure defined by the following
density:
1
ρ (x, y) = .
ab
Then we must compute hXi:
Z
hXi = X (x, y) ρ (x, y) dx dy
[0,a]×[0,b]
Z
1
= dx dy,
ab 0≤y≤f (x)
24 CHAPTER 2.

and notice that the integral gives the area under the curve y = f (x). X repre-
sents the following experiment: choose a point on [0, a] × [0, b] and give yourself
a point if the point is under the curve y = f (x) and none if your point is above.
We take copies of the random variable X, and denote them by Xi , i =
1, 2, 3, . . . , N ; these are independent copies in the following sense: what happens
in the i-th instance of the experiment has nothing to do with what happend or
will happen at another instance of the experiment. Our approximation to the
N
integral we want to compute is then Z : ([0, a] × [0, b]) −→ [0, 1] defined by
N
1 X
Z (ω1 , . . . , ωN ) = Xi (ωi ) .
N i=1

Each ωj ∈ [0, a] × [0, b] represents an instance of the experiment (selecting a


random point from [0, a] × [0, b]). We shall compute hZi, but to do so, notice
that
R
Xi (ωi ) dP (ω1 ) dP (ω2 ) . . . dP (ωi ) . . . dP (ωN )
=
R R
Xi (ωi ) dP (ωi ) · dP (ω1 ) . . . dP (ωi−1 ) dP (ωi+1 ) . . . dP (ωN )
=
R
Xi (ωi ) dP (ωi ) .

It is then easy to compute


Z
hZi = Z (ω1 , . . . , ωN ) dP (ω1 ) . . . dP (ωN )
([0,a]×[0,b])N
Z
1
= dx dy.
ab 0≤y≤f (x)

Now we are interested in knowing how good the approximation given by this
method. In order to do this, we estimate the variance of Z and use Chebyshev’s
inequality. First, notice that, being the Xi independent, the variance of Z can
be computed as
N
2 1 X 2
σZ = σ
N 2 i=1 Xi
1 2
= σ .
N X
Therefore for any  > 0, we have that
2
1 σX
P (|Z − hZi| ≥ ) ≤ .
N 2
Let us illustrate with an example. Assume we want to compute
Z 1
x3 dx.
0

If we want an accuracy of 10−3 with a probability of 0.90, how many instances


2
of the experiment must be performed? It is easy to show that σX ≤ 4 (so we do
2.6. MONTECARLO METHOD FOR INTEGRATION 25

not need to compute the exact variance, we only need an estimate!), and hence
we want
1 4
≤ 0.1,
N 10−6
and hence N ≥ 4 × 107 . Z 1 √
Exercise. Write an algorithm to compute x dx using a Montecarlo Method.
0
How many instances of the experiment must be performed to obtain an accuracy
of 10−3 with a probability higher than 0.85
Z Z
Exercise. Write a Montecarlo algorithm to compute x2 + y 4 dx dy where
D
D is the part of the unit disc centered at the origin in the xy-plane located in
the first quadrant. Estimate the number of iterations in your algorithm to reach
an accuracy of 10−3 with probability 0.85.

2.6.4 An improvement on the estimates


Using Chebyshev’s inequality for estimating how many trials of our experiment
must be performed to obtain certain accuracy with high probability was not
quite sharp. Instead, we shall use the following inequality due to Hoeffding.
Theorem 5. Suppose Xi ∈ [ai , bi ], i = 1, 2, . . . , N is a family of independent
random variables. Let Z be defined as
N
1 X
Z= Xi .
N i=1

Then !
2N 2 2
P (|Z − hZi| ≥ ) ≤ 2 exp − PN 2
.
i=1 (bi − ai )
UsingZ Hoeffding’s inequality to find N so that the approximation of the
1
integral x3 dx is within 10−3 of the exact value with probability at least of
0
0.90 gives N ∼ 1.500.000.
26 CHAPTER 2.
Chapter 3

In this chapter we will discuss a simple way that can be used to find the mini-
mum of a given function. This method is based on the fact that for any given
differentiable function, minus its gradient points toward the direction of max-
imum decrease. So,quite appropiately, the name of the method that we shall
discuss is the gradient method. As an application of the techniques introduced
in this chapter we will discuss neural networks.

3.1 The Gradient (Steepest descent) Method


Given a differentiable function f , assume that we want to locate a minimum of
the function. Then if we are located at a point xn , we will move to the point

xn+1 = xn − t∇f (xn ) ,

where h is a number called the stepsize, and that in principle is arbitrary. Of


course we expect that for h small enough it holds that f (xn ) ≥ f (xn+1 ), and
that xn converges towards a point of local minimum of f .
First of all, let us find conditions on the step size such that f (xn+1 ) < f (xn ).
Our main tool again is Taylor’s theorem. So in several variables, for a given h,
we have that
1
f (x0 + h) = f (x0 ) + ∇f (x0 ) h + hT Hess (f ) (θ) h,
2
where θ is a point in the interior of the segment joining x0 and x0 + h. In our
case, h = t∇f (xn ), and the previous expression becomes

2 1 T
f (xn+1 ) = f (xn ) − t k∇f (xn )k + ∇f (xn ) Hess (f ) (θ) ∇f (xn ) t2 .
2
We can bound the righthandside from above as

2 t2 2
f (xn ) − t k∇f (xn )k + kHess (f ) (θ)k2 k∇f (xn )k ,
2
where for a matrix A = (aij )i=1,...,m;j=1,...,n the expression kAk2
  21
Xm X
n
kAk2 =  a2ij  .
i=1 j=1

27
28 CHAPTER 3.

Hence, if we want that f (xn+1 ) < f (xn ), assuming a bound kHess (f ) (θ)k2 ≤
β, we obtain that if the stepsize t satisfies
2
t< ,
β

we would obtain a sequence of xn along which f is decreasing. But how can we


ensure that this is a convergent sequence? Well, assume that f is bounded from
below, and call m to a bound from below. Then, observe that
2
f (xn+1 ) ≤ f (xn ) − tn k∇f (xn )k ,

and solving for the recurrence we obtain


n
X 2
m ≤ f (xn+1 ) ≤ f (x0 ) − tj k∇f (xj )k .
j=1

In conclusion,
n
X 2
tj k∇f (xj )k
j=1

is uniformly bounded! Being a nondecreasing sequence, it is convergent. Better


yet, the n-th term of the series satisfies
2
tn k∇f (xn )k → 0,

so if the stepsizes tn are uniformly bounded from below, if the sequence xn is


converging somewhere, we should expect that at the limit ∇f is zero.
Now, to get convergence towards a minimizer we need more. The assumption
we shall make is that the function is that the Hessian is uniformly positive
definite. By this we mean the following. There exists an η > 0 so that at any
point x the following holds: for any unit vector u

uT Hessf (x) u ≥ η > 0.

Under this assumption we shall show that we get convergence towards a


minimum.
Assume there is a minimizer z (we shall prove its existence in a little while).
Then we must have ∇f (z) = 0. Now pick any unit vector u, then

d2

f (z + tu) = uT Hessf (z) u ≥ η > 0.
dt2 t=0

We want to estimate ∇f (z + h) for vectors h small enough. Writing

h
u= ,
khk

from the fundamental theorem of calculus we have that


Z khk 2
d d
f (z + h) = 2
f (z + τ u) dτ,
dt 0 dτ
3.1. THE GRADIENT (STEEPEST DESCENT) METHOD 29

and hence we can estimate


Z khk
d
f (z + h) = uT Hessf (z + τ u) u dτ
dt 0
Z khk
≥ uT Hessf (z + τ u) u dτ ≥ η khk .
0

Since
d
f (z + tu) = ∇f (z + h) · u
dt t=khk
we arrive at an estimate

k∇f (z + h)k ≥ η khk .

The conclusion we can draw is the following. Let  > 0 and assume that for all
n ≥ N we have that k∇f (xn )k ≤ , then the distance from xn to z must satisfy

kxn − zk ≤ ,
η
and the sequence provided by the gradient method will converge towards z.

3.1.1 Existence of a minimum.


Under certain conditions we can guarantee that a function has a global mini-
mum. One of these conditions is called strong convexity.
Definition 2. A function f : Rn −→ R is said to be strongly convex if there is
an m > 0 such that the following inequality holds for any x, y ∈ Rn
T m 2
f (y) ≥ f (x) + ∇f (x) (y − x) + ky − xk .
2
So we have:
Theorem 6. A strongly convex function has a unique global minimum, say at
x0 . Moreover, x0 is the unique point where ∇f (x) = 0 (which means that x0
is also the only critical point).
To prove this theorem we shall first show that there is a global minimum.
Then we will argue that if at a point z we have that ∇f (z) = 0 then it must
be a local minimum. Finally, we will show that there can only be one local
minimum. To prove this last statement we shall employ the following.
Lemma 3. A strongly convex function satisfies the following inequality for any
pair of distinct points x, y ∈ Rn and any 0 < t < 1:

f (tx + (1 − t) y) < tf (x) + (1 − t) f (y) .

Proof. We let x, y be arbitrary distinct points. We let t ∈ (0, 1), and define
z = ty + (1 − t) x, and consider the Taylor expansion around z. So we have,
using the fact that f 00 > 0,

f (y) > f (z) + ∇f (z) [(1 − t) (y − x)] ,


30 CHAPTER 3.

and
f (x) > f (z) + ∇f (z) [t (x − y)] .

Now multiply the first by t and the second equation by 1 − t and them to
conclude that
f (z) < tf (x) + (1 − t) f (y) .

The lemma above implies that a strongly convex function can only have one
local minimum. In fact, assume there are two local minimums, say x1 and x2 ,
and assume, without loss of generality that f (x1 ) ≤ f (x2 ). If there is strict
inequality, then x2 cannot be a local minimum. Indeed, by the previous lemma,
we would have that for any 0 < t < 1

f (tx1 + (1 − t) x2 ) < tf (x1 ) + (1 − t) f (x2 ) < f (x2 ) ,

so by taking t very close to 1, we see that x2 does not satisfy the definition of
local minimum. Hence we must have f (x1 ) = f (x2 ), and hence f is constant
along the line [x1 , x2 ], which contradicts the previous lemma, as we would have
for any 0 < t < 1

f (tx1 + (1 − t) x2 ) = tf (x1 ) + (1 − t) f (x2 ) < f (x2 ) .

We proceed to show that f has a global minimum. From the definition of


strong convexity, we have that
m 2
f (x) > f (0) − k∇f (0)k kxk + kxk ,
2
but then notice that for R > 0 large enough
m 2
− k∇f (0)k kxk + kxk > 0,
2

and hence for R > 0 large enough, if kxk > 0 we have that

f (x) > f (0) .

From this we immediatly conclude that the minimum of f in the closed ball
centered at 0 of radius R, which exists by compactness, is a global minimum
of f . This global minimum is of course a local minimum, and by the previous
considerations we can conclude that it is unique. Also, it is clear that at the
point were the minimum is reached, ∇f = 0.
Finally, for a strongly convex function we can guarantee that the steepest
descent method with small enough step size converges towards its unique global
minimum.

Exercise. Given f a strongly convex function, show that for a conveniently


chosen stepsize the steepest descent algorithm converges towards the minimizer
z of f .
3.1. THE GRADIENT (STEEPEST DESCENT) METHOD 31

3.1.2 An Example: Finding 2 again.
To do this define the function
2
g (x) = x2 − 2 .

In this case, the algorithm is given by the formula

xn+1 = xn − 4tn xn x2n − 2 .




√ 2
We shall prove that if we select x0 > 2 and t ≤ 2 the algorithm
√ 12x0 − 8
converges towards 2, though quite slowly. Indeed, the fact that f (xn+1 ) <
f (xn ) implies that the sequence must remain in a bounded set. This implies that
a there is a convergent subsequence xnk . This subsequence converges towards
a point z for which f 0 (z) = 0. But close enough to a minimum, the function
g is monotonic. This then implies that the original sequence xn is eventually
monotonic, and being bounded, it is convergent.

3.1.3 Solving Linear Systems


Assume we want to solve the equation Ax = b where A is a symmetric positive
definite matrix. Define the function
1 T
f (x) = x Ax − xT b.
2

Assuming A definite positive implies that f is bounded from below, this as a


consequence of
1 2
f (x) ≥ λmin kxk − kbk kxk ,
2
where λmin is the minimum eigenvalue of A, which is positive by assumption.
A simple computation, where we use that A is symmetric, yields

∇f (x) = Ax − b.

On the other hand, Hessf (x) = A, and f is strongly convex, and we can
conclude that the gradient method (steepest descent) will provide the minimizer
of f , say x0 which in turn satisfies

b = Ax0 .

Of course we must be careful with the stepsize. In this case, it must be less
than 2/Λ where Λ is the largest absolute value of an eigenvalue of A.

How to check that a matrix is positive definite and bounding the


absolute value of the eigenvalues.

This is an important matter, in view of the previous analysis.


32 CHAPTER 3.

3.1.4 Least Squares Method.


Given a set of data (xi , yi ) we want to find parameters a and b so that the line
y = ax + b is the best fit for the data. This requires minimizing
m
1X 2
J (a, b) = (yi − axi − b) .
2 i=1

The gradient method is an obvious candidate to solve this problem. We compute


the gradient of J
m
∂J X
=− (yi − axi − b) xi ,
∂a i=1
m
∂J X
=− (yi − axi − b) ,
∂b i=1
m
∂2J X 2
= x2i = kxk .
∂a2 i=1
m
∂2J X
= 1 = m.
∂b2 i=1
m
∂2J X
= xi .
∂a∂b i=1

What can we then say about Hess (J), is it positive definite? Notice that the
upper left corner Jaa > 0, and the determinant is strictly positive (unless all
xi ’s) are equal. This is justified by the Cauchy-Schwartz inequality
2
∂ J √
∂a∂b ≤ m kxk ,

with equality if and only if all xi ’s are equal. The determinant of the Hessian
is then
2 2 2
|Hess (J)| = Jaa Jbb − (Jab ) > m kxk − m kxk = 0.
The error function J is then strongly convex. The existence of a minimum is
guaranteed, as is the convergence of the gradient method.

3.2 Neural Networks


3.2.1 How does a neuron works? A Neurophysiology primer
Neurons are the building blocks of the nervous system. This cells come in differ-
ent shapes, however a typical neuron can be divided into three parts: dendrites,
axon and soma. The soma is the body of the cell, and there we can find the
nucleus. The dendrites are the places where the neuron receives ”the informa-
tion” from other neurons, and the axon works as a cable along which the action
potential runs.
Let us take a closer look to how neurons transmit information. So we look at
this process midway. Neurons communicate through synapses, which are small
3.2. NEURAL NETWORKS 33

gaps between neurons. In this gap, a given neuron from the the terminal parts
of its axon releases chemical substances called neurotransmitters which bind to
receptors in the neuron across the synapse, this neuron is called presynaptic;
the receiving neuron is called postsynaptic. This neurotransmitter generate a
reaction on the receiving neuron, but in order for this reaction to be strong
enough so that the receiving neuron in turn generates a pulse an releases its
own neurotransmitters, it might be required that many neurons release their
neurotransmitters that bind to the receptors of the postsynaptic neuron, and
hence a threshold must be surpassed before a neuron really reacts. This fact
will be important when modelling a neuron in a neural network.
Once the neuron reacts, a pulse is created which will run along its axon.
This is the action potential. Let us explain how this works. The fluids across
the membrane of the axon is polarised. The interior of the neuron is at a lower
electric potential with respect to the exterior of the neuron. Once the neuron
reacts and ”decides” to create its action potential, a series of voltage activated
gates open and let a flow of positive ions to enter the neuron, and this causes
a change in polarization at the membrane, a change that goes running along
the whole axon: once the process of polarization reaches a point in the axon, a
voltage operated gate opens allowing the positive ions to flow in, and the process
of polarization to keep going. Once the potential reaches the end of the axon,
some bubbles,
called vesicles, containing the neurotransmitters come into contact with the
cell membrane and ”open up” releasing the neurotransmitter into the sinapsis
which goes on to a dendrite, a muscle or a.

3.2.2 Hebbian Learning


This is just the dictum: neurons that fire together join??? The idea is as follows.
Two neurons that fire together: the presynaptic neuron creates a stronger bond
with the postsynaptic neuron.

3.2.3 What is a neural network?


To simulate what happens in a neural system, we shall use a directed graph. But
it will be special type of graph. In this case, just to simplify, we will assume
that we have three sets of nodes, one which are called input nodes, output
nodes and hidden nodes. The input nodes are connected to the hidden nodes
and the hidden nodes are connected to the output nodes. The input nodes
are the nodes where the information received enters, this react in some way,
communicate their reaction to the neurons in the hidden layer, which in turn
react and communicate their reaction to the neurons in the output layer which
finally give the output. The strenght of the bond between neurons is represented
by a weight. A weight is any real number: it could be positive (representing an
excitatory bound ), it could be 0 (there is no actual bond between the neurons),
or it could be negative (representing an inhibitory bond)

3.2.4 The case of one neuron


What just one neuron can do. Well, it can classify data.
34 CHAPTER 3.

The perceptron algorithm.


Assume that you have two set of data (say points in the plane), and you want to
classify them, and by classification we understand the following. There are two
sets A− and A+ , such that there is a hyperplane (a line) that separates the two
sets. this can be written as follows. We assume that both sets are subsets of
Rn ; the separability means that there exists a vector w ∈ Rn+1 , called a weight
vector, such that the following holds
• if a ∈ A− then w0 + w1 a1 + · · · + wn an < 0,
• if a ∈ A+ then w0 + w1 a1 + · · · + wn an > 0.
The task is to find such a vector w which solves the problem. We will
work under the assumption that a solution to the problem exists, and we will
an algorithm that finds a solution. This algorithm is called the perceptron
algorithm.
We will define X = A+ ∪ (−A− ), where

−A− = {−a : a ∈ A− } ,

an give it an order, say

X = {p (1) , p (2) , . . . , p (m)}

We start with the weight vector w (0) = 0. Then we pick the k-th element in X,
T
and we denote it by p (k). If w (k) p (k) < 0, we change the weight as follows

w (k + 1) = w (k) + p (k) .

If we finish with the training set, we start over again, following the same order.
Interestingly enough, if there is a solution to the problem of classification (in
this case if there exists a w0 such that for any x ∈ X we have w0T x > 0), the
algorithm described above finds a solution in a finite number of iterations. Let
us show this. In order to proceed, we define two parameters
T
α= min p (j) w0 ,
j=1,...,m

and
2
β 2 = max kp (j)k
j=1,...,m

It is not difficult to see from the recurrence that


k
X
w (k + 1) = p (n) .
n=0

From this formula, taking scalar product with w0 , and the definition of α, we
obtain
Xk
T T
w (k + 1) w0 = p (n) w0 ≥ kα.
n=0

From Cauchy-Schwartz, we obtain,


T
kw (k + 1)k kw0 k ≥ w (k + 1) w0 ≥ kα,
3.2. NEURAL NETWORKS 35

which can be rewritten as


 
α
kw (k + 1)k ≥ k .
kw0 k
T
On the other hand, assuming w (k) p (k) < 0, that assuming that the weight
vector calculated thus far does not classify well the k-th example, yields
2 2 T 2 2 2
kw (k + 1)k = kw (k)k + 2w (k) p (k) + kp (k)k ≤ kw (k)k + kp (k)k ,

and thus
k
X
2 2
kw (k + 1)k ≤ kp (j)k ≤ β 2 k,
j=0

or √
kw (k + 1)k ≤ β k.
This two inequalities can be read as follows: on the one hand, the weight vector
grows at least linearly in the number of training examples it classifies wrongly,
but at most as the square root this same number of examples times a constant.
Clearly, this situation is untenable, unless at some point the weight vector clas-
sifies correctly all examples.

3.2.5 How does a neuronal network learn


Deep Learning: The backpropagation algorithm
We shall employ the following notation Layer k will be denoted by lk , and it is
by definition a set of neurons

lk = N1k , . . . , Nm
k

k
,

the Njk are the nodes or neurons. The last layer we will denote by ld , the d as a
reminder of depth. Neurons from layer lk−1 are connected with neurons in layer
lk . The output of the i-th neuron in the k-th layer we will denoted by oki . The
total output of layer k into neuron i in the next layer is given by

xkj = wjk,i ok−1


i ,

where the wjk,i are called weights and they represent the strength of the con-
nection between neuron i in layer k − 1 and neuron j in layer k, that is the
weight of the connection between Nik−1 and Njk . This is the output received by
the j-th neuron in the k-th layer, but as the neuron depends on an activation
function, we will have by definition that

okj = S xkj ,


where S is the sigmoid function, which is given by

1
S (x) = .
1 + e−x
36 CHAPTER 3.

For the network to learn, it will be trained by a set of data: (xa , ya ), where
xa is the given input and ya the corresponding expected output. In a training
scenario, there is a total error

1X d 2
J= oa − y a ,
2 a

and the objective of the learning algorithm is to find weights wjk,i that minimize
J. Of course, the reader must notice that J depends on the family of weights
employed, and this in principle could be seen if we expanded the ouputs oda in
all its glory. The art here is then no having to proceed to do this expansion.
The first weights that can be updated at time t + 1, knowing those weights
at time t are the weights incoming into the last layer.
To minimize (or at least trying to do so) we employ a steepest descent
algorithm, so we adjust the weights as follows

∂J
wjd,i (t + 1) = wjd,i (t) − α , α > 0.
∂wjd,i

We must then compute the partial derivative in the previous expression. The
following fact about the sigmoid function will be useful

S 0 (x) = S (x) (1 − S (x)) .

By the chain rule

∂J ∂J ∂odj ∂xdj
= odj − yj S xdj 1 − S xdj od−1
  
= i .
∂wjd,i d d d,i
∂oj ∂xj ∂wj

The function S (x) (1 − S (x)), so we shall name it B (x), and thus we can rewrite
the previous identity as

∂J ∂J ∂odj ∂xdj
= odj − yj B xdj oid−1 .
 
=
∂wjd,i d d d,i
∂oj ∂xj ∂wj

Finally, it customary to use δjd = odj − yj so we have

wjd,i (t + 1) = wjd,i (t) − αδjd B xdj oid−1 .




An aside must be made here. We will have to keep

∂J ∂J ∂odj d j
 ∂odj
= = o j − y ,
∂xdj ∂odj ∂xdj ∂xdj

for the next calculation, and in a more general fashion, once we have updated
the weights coming into the layer k + 1, we will need to keep in memory

∂J
.
∂xk+1
j
3.2. NEURAL NETWORKS 37

Next, we occupy ourselves on finding how to update the weights on the inner
layers. Assume, we have updated the weights of the links coming into the k + 1-
th layer; our task is to update the weights coming into the k-th layer, that is,
wjk,i . As before, we use a steepest descent method,

∂J
wjk,i (t + 1) = wjk,i (t) − α .
∂wjk,i

∂J
We must compute , and to do this we recur to the chain rule
∂wjk,i

∂J ∂J ∂xkj
= ,
∂wjk,i ∂xkj ∂wjk,i

∂J
and we need to compute , which is given by
∂xkj

∂J X ∂J ∂xk+1
l
k
= k+1 ∂xk
,
∂xj l
∂xl j

and then
∂xk+1 ∂xk+1 ∂okj
l l
= wlj,k+1 S xkj 1 − S xkj .
 
=
∂xkj ∂okj ∂xkj
As the reader should notice, this way we can proceed until we reach the first
layer. This is the backpropagation algorithm.
38 CHAPTER 3.
Chapter 4

In this part of the notes, we shall find numerical methods for solving linear
systems. This is important not only per se, but we should look at this having
in sight applications for solving differential equations numerically. One of our
main tools will be the Fixed Point Theorem. In order to apply this important
result, we must learn how to measure matrix norms.

4.1 Matrix Norms


Let T : Rn −→ Rm , and we assume both Rn and Rm endowed with norms, say
k·kν and k·kµ . We define the norm kT kν,µ as

kT xkµ
kT kν,µ := sup = sup kT xkµ .
x6=0 kxkν kxkν =1

Exercise. Show that the second equality in the previous definition holds.
Exercise. Show that kT kν,µ defines a norm.
An aside on notation: whenever n = m and µ = ν, we shall denote the norm
simply as kT kν .
Let us show a few examples where we compute different norms for a given
operator. We shall consider T : R2 −→ R2 represented by the matrix
 
2 −5
T =
1 3

We shall consider R2 endowed first with the following norm

kxk1 = |x1 | + |x2 | .

Let us compute kT k1 . In order to do this, let x be such that kxk1 = 1. On one


hand we have that  
2x1 − 5x2
Tx = ,
x1 + 3x2
and therefore

kT xk1 = |2x1 − 5x2 | + |x1 + 3x2 | ≤ (2 |x1 | + 5 |x2 |) + (|x1 | + 3 |x2 |) .

39
40 CHAPTER 4.

Notice now that the quantities inside each parenthesis is just the expectation
value of a random variable: the one in the first parenthesis takes the value 2
with probability |x1 | and the value 5 with probability |x2 | = 1 − |x1 |, whereas
the one in the second parenthesis takes on the value 1 with probability |x1 | and
the value 3 with probability |x2 | = 1 − |x1 |. Hence, it is not difficult to estimate
kT xk1 ≤ 5 + 3 = 8.
On the other hand, if we choose x with x1 = 0 and x2 = 1, it is not difficult to
see that kT xk1 = 8. We conclude then that kT k1 = 8.
Next we consider R2 endowed with the norm
kxk∞ = max {|x1 | , |x2 |} .
Our goal now is to compute kT k∞ .
kT k∞ = max {|2x1 − 5x2 | , |x1 + 3x2 |}
≤ max {|x1 | + 5 |x2 | , |x1 | + 3 |x2 |}
≤ 2 |x1 | + 5 |x2 |
≤ (5 + 2) max {|x1 | , |x2 |} ,
and if kxk∞ = 1, then we have the estimate kT k∞ ≤ 7. On the other hand
notice that if we take x such that x1 = 1 and x2 = −1(= sign of − 5), then
kT xk∞ = 6. Therefore kT k∞ = 7.
The previous computations are particular cases of the following theorem.
Theorem 7. Given T : Rn −→ Rn , we have that
n
X
kT k1 = max |Tjk | , (4.1)
j=1,2,...,n
k=1

and
n
X
kT k∞ = max |Tjk | . (4.2)
k=1,2,...,n
j=1

Proof. We shall show (4.1). Pick any vector x = (x1 , . . . , xn ) with kxk = 1.
Then, we have X XX
kT xk1 = |T xj | = |Tij | |xi | .
j i j
P P
Now notice that i |xi | = 1, so, if we define Li = j |Tij |, then the expression
X
Li |xi |
i

can be interpreted as the expectation of the set of numbers Li , i = 1, . . . , n.


Being these numbers nonnegative, then we have
X X
Li |xi | ≤ max Li = max |Tij | |xi | .
i i
i j

To show equality, if the maximum is obtained when i = m, just choose as x the


vector which has a 1 in the m-th coordinate and 0 everywhere else.
4.1. MATRIX NORMS 41

Exercise. Prove the identity (4.2).


There is an important concept related to the norms of matrices: the spectral
radius.

Definition 3. Let A be a square n by n matrix of complex entries. Let λ1 , . . . , λn


be its eigenvalues. Then the spectral radius of A, ρ (A) is given by

ρ (A) = max {|λ1 | , . . . , |λn |} .

To get some intuition about how the spectral radius is realted to the norm
of matrices we prove the following.

Theorem 8. Let A be a symmetric matrix. Then the kAk2 = ρ (A), where A


is the spectral radius of A.

Proof. We shall show that the following optimization problem is equivalent to


finding the eigenvalues of A. Find the extremums of

J (x) = hAx, xi ,
2
restricted to kxk = 1. First notice that

∂ ∂ X
J (x) = Akj xk xj
∂xi ∂xi
k,j
X
= (Akj δki xj + Akj δji xk )
k,j
X
= (Aij xj + Aki xk ) ,
k,j

notice that k and j are dummy summation indices, hence, using that A is
symmetric gives
∂ X
J (x) = 2 Aik xk ,
∂xi
k

which shows that


∇J = 2Ax.
Using Lagrange multipliers, we arrive at the equation

Ax = λx,

which is what we first wanted to show.


By the spectral theorem, A is diagonalizable. This means that we can find
a basis for Rn composed by eigenvectors of A. Let v1 , . . . vn be such basis,
with associated eigenvalues λ1 , . . . , λn . Clearly, At A being symmetric is also
diagonalizable. The important claim here is that v1 , . . . vn are also eigenvectors
of At A and its associated eigenvalues are λ21 , . . . , λ2n . This last assertion can be
verified by direct computation. But then, this implies that
2
ρ At A = ρ (A) .

42 CHAPTER 4.

Hence, if the minimum and maximum eigenvalues of A are λmin and λmax ,
then if kxk = 1

|hAx, Axi| = AT Ax, x



≤ max λ2min , λ2max ,




and hence
kAk2 ≤ max {|λmin | , |λmax |} ,
from which we deduce that kAk2 ≤ ρ (A). The other inequality is left as an
exercise.

In order to prove convergence, the following lemma will be very useful. We


leave the proof to the reader.

Theorem 9. Let A be a matrix whose spectral radius is ρ (A), and let  > 0.

There exists a matrix norm k·k such that

kAk ≤ ρ (A) + .

4.2 The Method of Jacobi


Given a linear transformation A and a vector c ∈ Rn , we can consider the affine
linear map
T x = Ax + c.
To check for T to be a contraction once we have endowed Rn with a norm kxkν ,
all we have to check is the following condition

kAkν < 1.

Exercise. Show the validity of the previous criterion.


The previous condition is the basis for all the iterative methods we will
describe in this chapter, the first of which is Jacobi’s method. Our purpose is
to solve the equation
Ax = b.
With this in mind, we will rewrite the previous equation as

x = −D−1 (A − D) x + D−1 b,

where D is the diagonal matrix whose diagonal coincides with that of A.


The previous equation is written so the Fixed Point Method can be readily
applied, and in this case the iteration scheme is given by

xn+1 = −D−1 (A − D) xn + D−1 b.

The map T x = −D −1 (A − D) x + −1
D b will have a fixed point as long as, in a
−1
given norm, αµ = D (A − D) µ < 1; notice also that this condition implies

4.3. GAUSS-SEIDEL 43

that the fixed point scheme will converge in the k·kµ norm towards the fixed
point. Besides, we have an estimate
αµn
kxn − x∗ k ≤ kx1 − x0 kµ .
1 − αµ

Intuitively, Jacobi’s method will work when the elements in the diagonal
are much larger than the other elements in the matrix A. We have then the
following result which guarantees in some particular cases when one of the norms
described above are strictly less than 1 for a matrix.

Theorem 10. If A = (aij )i,j=1,...n

1 X
max |aij | < 1,
i |aii |
j6=i

then Jacobi’s method converges.

Proof. Just notice that the expression above corresponds to the norm k·k1 of
the transformation
D−1 (A − D) .

4.2.1 Example.
We will solve iteratively the equation Ax = b
   
10 1 1
A= , b= .
−2 12 −1

In this case  
10 0
D=
0 12
The iteration scheme is

xn+1 = U xn + c,

where
1 1
   
0 − 10 10
U= 1 , c= 1 .
6 0 − 12
We take as seed x0 = 0, and hence
 1   1

x1 = 10 , x2 = 10 ,
1 1
− 12 − 12

4.3 Gauss-Seidel
In this case we use the following decomposition:

(D + AR ) x = −AL x + b,
44 CHAPTER 4.

and use the iteration scheme


−1 −1
xn+1 = − (D + AR ) AL xn + (D + AR ) b.
However, instead of inverting the matrix D + AR , it is better to solve for each
of the variables, starting from x1 , and then solving for x2 and so on. In other
words, the following iteration scheme is considered
(D + AR ) xn+1 = −AL xn + b,
where xn+1 is solved from the above equation (xn is known), starting from its
topmost entry to its lowest by replacement.

4.3.1 An instance of convergence for Gauss-Seidel


Gauss-Seidel converges whenever the matrix is diagonally dominated. By this
we mean that the absolute value of the diagonal entry of a given row is larger
than the sum of the absolute values of the other entries in the row. This is,
when considering the row i we have the inequality
X
|aii | ≥ |aij | .
j,i6=j

Let us give a proof of this fact

4.4 Relaxation Methods


Notice that we can write
xn+1 = xn + (xn+1 − xn ) ,
and we can interpret xn+1 − xn as a correction term. Perhaps this correction is
not good enough, and we could approximate the correct solution by changing
the correction via a new parameter ω. that is, we could write
xn+1 = xn + ω (xn+1 − xn ) ,
and we must determine ω to make the approximation as good as possible. To
obtain the iteration scheme, we must recall that the Gauss-Seidel method can
be written as
Dxn+1 = b − AL xn+1 − AR xn ,
and hence we can define
xn+1 − xn = D−1 (b − AL xn+1 − AR xn ) − xn ,
to obtain, after solving for xn+1 :
−1
xn+1 = (ωAL + D) [((1 − ω) D − ωAR ) xn + ωb] .
We shall emply the notation
−1
Bω = (ωAL + D) ((1 − ω) D − ωAR ) .
As for the convergence of the method, we have the following result.
Theorem 11. Let A be positive definite matrix. If 0 < ω < 2 then ρ (Bω ) < 1.
Notice that ω = 1 corresponds to the Gauss-Seidel method, and in particular the
Gauss-Seidel method converges if the matrix A is positive definite.
4.5. THE CONJUGATE GRADIENT METHOD 45

4.5 The Conjugate Gradient Method


We shall consider a positive definite matrix A. We have then a definition
Definition 4. We say that u and v are A-conjugate if

hu, viA := ut Av = 0.

Our goal is to solve the system Ax = b. Assume then that we have a set
P = {p1 , . . . , pn } of A-conjugate vectors. It is not difficult to prove that P
forms a basis of Rn , and it is left as an exercise for the reader. A solution to
the system is then given by
x = αi pi ,
where it is not difficult to compute the coefficients αi :
hx, pi iA
αi = . (4.3)
hpi , pi iA

As sometimes an approximate solution will do, and computing a basis of A-


conjugate vectors can be costly, we want to produce an iterative method to
solve Ax = b.
To do this we will use the following iteration.

r0 = b, x(1) = b, p1 = b, p0 = 0.

Then we define

rj = b − Ax(j) , pj+1 = rj + αj pj , x(j+1) = x(j) + βj pj+1 .

where the choices of αj and βj are as follows: αj is such that p1 , . . . , pj , pj + 1


and βj satisfies an equation

d 1
f xj + tpj+1 = 0, where f (x) = xt Ax − xt b.

dt t=βj
2

Of course it is not obvious that αj exists, this must be proven. That is the
purpose of next lemma.
Lemma 4. With the iterations defined as above,

hr0 , . . . , rk−1 i = b, . . . , Ak−1 b ,



hp1 , . . . , pk i = b, . . . , Ak−1 b ,

x(k) = b, . . . , Ak−1 b .

Lemma 5. Assume that p1 , . . . , pk are A-conjugate vectors. Then, with the


above notation, rk ⊥ pl for l ≤ k and as a consequence rk ⊥ rl for l ≤ k − 1.
Let us see how Lemma 5 implies that we can find αj . in order to produce a
pj+1 which is A-conjugate to p1 , . . . , pj , we write
j−1
X
pj+1 = rj + αj pj + αl pl .
l=1
46 CHAPTER 4.

Pick m ≤ j − 1, and compute

j−1
X
hpj+1 , pm iA = hrj , pm iA + αj hpj , pm iA + αl hpl , pm iA .
l=1

By assumption, hpl , pm i = 0 if l 6= m. So we have that

hpj+1 , pm iA = hrj , pm iA + αm hpm , pm iA .

What we want to show now is that αm = 0 for m ≤ j − 1 if hpj+1 , pm iA = 0.


To do this is enough to show that

hrj , pm iA = 0.

First, notice the following,

hrj , pm iA = hrj , Apm i .

But then, from Lemma 4,

Apm ∈ b, . . . , Am−1+1 b ⊂ b, . . . , Aj−1 b = hr0 , . . . , rj−1 i ,




and since rj ⊥ rk for k ≤ j − 1, then we must have rj ⊥ Apm , or in other words,


hrj , pm iA = 0.
Therefore, choosing
hrj , pm iA
αj = − ,
hpm , pm iA

we get that pj+1 is A-conjugate to p1 , . . . , pj .

4.5.1 Proof Lemma 4


The proof is by induction. For k = 1 the statement of the lemma clearly holds.
Assuming it holds at stage k we will show

that it must
hold at stage k + 1. In
fact, all we must show is that rk+1 ∈ b, . . . , Ak+1 b . But as

rk+1 = b − Ax(k+1) ,


all we must show is that x(k+1) ∈ b, . . . , Ak b . Since

x(k+1) = x(k) + βk pk+1 ,

and
pk+1 = rk + αk pk ,


and both rk and pk belong to b, . . . , Ak b , we can conclude that so does pk+1 ,
and hence x(k+1) . The lemma follows.
4.5. THE CONJUGATE GRADIENT METHOD 47

4.5.2 Proof of Lemma 5


We proceed by induction. The case k = 0 holds obviously (recall that p0 = 0).
The induction hypothesis is that rj−1 is orthogonal to pl with l ≤ j − 1; we will
show that this statement holds for k = j.
First notice that

d j−1
 
(j)

f x + tpj = ∇f x · pj = 0,
dt t=βj−1

(j)

and ∇f x = Ax(j) − b = −rj . Thus rj ⊥ pj .
On the other hand
 
rj = b − Ax(j) = b − A x(j−1) + αj pj ,

and hence  
rj = b − Ax(j−1) − αj Apj = rj−1 − αj Apj .

By asumption, rj−1 is orthogonal to pj−1 and also hpj−1 , pj iA = 0 or, in other


words, hpj−1 , Apj i = 0. This allows us to conclude that rj ⊥ pj−1 . We can
continue in this way to finish the proof of the first part of the lemma. The
second part, i.e., that pk ⊥ pl for l < k follows from the first part of the lemma,
and Lemma 4.
48 CHAPTER 4.
Chapter 5

Differential Equations:
Iterative Methods

5.1 Euler’s Method


Given a differential equation of first order
dy
= f (t, y) , y (0) = y0 , (5.1)
dt
which we want to solve on an interval [0, L], Euler’s method can be set up as
follows  L
 tn+1 = tn + h, h = N ,
yn+1 = yn + hf (tn , yn ) ,
y0 = y (0) , t0 = 0,

N a fixed positive integer.


As usual, it is important to estimate how far is the approximation from the
exact answer. For this we shall use a technique called error propagation. The
spirit of the method is to assume that we have a bound on |y (tn ) − yn | and to
extend it to a bound on |y (tn+1 ) − yn+1 |: This will give us a recurrence which
can be solved.
We proceed as follows. We expand using Taylor’s theorem
y 00 (ξn ) 2
y (tn + h) = y (tn ) + y 0 (tn ) h + h .
2
We can compute y 0 (tn ) = f (tn , y (tn )) So we have that

y 00 (ξn ) 2
y (tn+1 ) − yn+1 = y (tn ) − yn + h (f (tn , y (tn )) − f (tn , yn )) + h .
2
The mean value theorem gives
∂f
f (tn , y (tn )) − f (tn , yn ) = (tn , ζn ) (y (tn ) − yn ) .
∂y
Let us write
En = |y (tn ) − yn | ,

49
50 CHAPTER 5. DIFFERENTIAL EQUATIONS: ITERATIVE METHODS

so we get
B 2
En+1 ≤ En + AhEn + h ,
2

∂f
where A is a bound on (t, y) , and B is a bound on y 00 . To obtain this last
∂y
bound, we work out y 00 in terms of f and its derivatives.

∂f ∂f
y 00 = +f .
∂t ∂y

∂f
Assume then that the bound A also works for f and for . Then we can
∂t
bound for all t
|y 00 (t)| ≤ A + A2 .

Hence we have
A + A2 2
En+1 ≤ En (1 + Ah) + h .
2
Therefore, if we solve the recurrence

A + A2
En+1 = θEn + γh2 , θ = 1 + Ah, γ= ,
2

with E0 = 0, Ej is an upper bound for the quantity |y (tj ) − yj |.


Solving for this recurrence we obtain

j−1
X θj − 1 γ j
Ej = γh2 θk = γh2

= θ − 1 h.
θ−1 A
k=0

Just for the sake of simplicity, we want to drop the dependence in j on the
previous formula. Thus we have (as 0 ≤ j ≤ N )
 N
j AL
θj − 1 = (1 + Ah) − 1 ≤ 1+ − 1 ≤ eAL − 1.
N

This yields the estimate


γ AL 
|y (tn ) − yn | ≤ e − 1 h.
A
We resume our findings in the following theorem.

Theorem 12. Given the differential equation (5.1) and the approximate solu-
tion by Euler’s method, then we have that

1 + A AL 
|y (tn ) − yn | ≤ e − 1 h,
2

∂f ∂f
where A is a bound on |f |, and .
∂t ∂y
5.2. RUNGE-KUTTA 51

Exercise. In this problem, we suggest a small change in Euler’s method,


namely, we now use the approximation

h2 ∂f
 
∂f
yn+1 = yn + hf (tn , yn ) + (tn , yn ) + f (tn , yn ) (tn , yn ) .
2 ∂t ∂y

Find an estimate for |y (tn ) − yn |. Implement this method for the differential
equation
dy 1
= , y (0) = 1, t ∈ [0, 1] ,
dt 1 + t2 y 2
and compare its performance with Euler’s method.

5.2 Runge-Kutta
Euler’s method is what we would call an O (h) or first order method, as the
error between the exact and the approximate solution is of order h.
A simple technique, which we shall illustrate below, can give a method where
the error is or order h2 . Let us begin. The idea is to introduce some extra
parameters that will allow us some freedom of choice. So we set up the method
as follows  L

 tn+1 = tn + h, h = N ,
 k1 = hf (tn , yn ) ,


k2 = hf (tn + αh, yn + βk1 ) ,
y = yn + ak1 + bk2 ,

 n+1



y0 = y (0) , t0 = 0.
As before, our aim is to obtain a recurrence to estimate |y (tn ) − yn |.
First we expand yn+1 using Taylor’s theorem
 
∂f ∂f
yn+1 = yn + ak1 + bh f (tn , yn ) + αh (tn , yn ) + βk1 (tn , yn ) + Rn ,
∂t ∂y

where
bh t
Rn = u Hessf (θn ) un ,
2 n
and here θn is a point in the interior of the segment that joins the point (tn , yn )
to the point (tn + αh, yn + βk1 ), and

utn = [αh βk1 ] .

The previous expression for yn+1 can be further expanded to give


 
∂f ∂f
yn+1 = yn +ahf (tn , yn )+bh f (tn , yn ) + αh (tn , yn ) + βhf (tn , yn ) +Rn ,
∂t ∂y

and we choose α, β, a and b such that a + b = 1 and αa + βb = 21 (the reason


for the second choice will be made apparent below), so that
 
1 2 ∂f ∂f
yn+1 = yn + hf (tn , yn ) + h +f (tn , yn ) + Rn ,
2 ∂t ∂y
52 CHAPTER 5. DIFFERENTIAL EQUATIONS: ITERATIVE METHODS

where we are using the notation


 
∂f ∂f ∂f ∂f
+f (tn , yn ) := (tn , yn ) + f (tn , yn ) (tn , yn ) .
∂t ∂y ∂t ∂y
The attentive reader might have noticed that
 
∂f ∂f
y 00 (t) = +f (t, y (t)) .
∂t ∂y

On the other hand, using Taylor’s theorem and (5.1) we obtain

h2 ∂f
 
∂f
y (tn+1 ) = y (tn ) + hf (tn , y (tn )) + +f (tn , y (tn )) + R̃n ,
2 ∂t ∂y
where
1 000
R̃n =y (ηn ) h3 .
6
As an aside, that will be useful later, we can compute y 000 using (5.1):
 2 !
000 ∂2f ∂2f ∂f ∂f ∂2f ∂f 2
2∂ f
y (t) = +f + +f +f +f (t, y (t)) .
∂t2 ∂y∂t ∂t ∂y ∂t∂y ∂y ∂y 2

We compute Dn+1 = y (tn+1 ) − yn+1 , and using the mean value theorem con-
veniently, we obtain
 2 !
∂f h2 ∂ 2 f ∂f ∂2f
Dn + h (tn , θn ) Dn + + + f 2 (tn , ξn ) Dn + R̃n − R˜n .
∂y 2 ∂y∂t ∂y ∂y

∂f ∂f ∂ 2 f
Assuming that there is an A which gives a common bound for f , , , ,
∂t ∂y ∂t2
∂2f ∂2f
, , and writing En = |y (tn ) − yn |, we can estimate
∂t∂y ∂y 2

h2
A + 2A2 En + R̃n + |Rn | .

En+1 ≤ En + hAEn +

2

Next we show that both Rn and R̃n are both O h3 . Using the expression
for R̃n given above, and the assume bounds on f and its derivatives, it is not
difficult to get an estimate
1
A + 3A2 + 2A3 h3 .

R̃n ≤

6
For Rn , we use Cauchy-Schwartz and the definition of the k·k2 to obtain

bh 2
|Rn | ≤ kHessf (θn )k2 kun k2 .
2
Since the Hessian is symmetric, we can compute its k·k2 as follows
s 2  2 2  2 2
∂2f ∂ f ∂ f
kHessf (θn )k2 = + 2 + ,
∂t2 ∂t∂y ∂y 2
5.2. RUNGE-KUTTA 53

and the assumed bounds yield


kHessf (θn )k2 ≤ 2A.
2
All we are left to estimate is kun k2 . This is not difficult. Indeed,
2
kun k2 = α2 h2 + β 2 k12 ≤ h2 α2 + β 2 A2 .


Thus,
|Rn | ≤ bA α2 + β 2 A2 h3 .


To finish our illustration of the Runge-Kutta technique, let us make a choice for
a, b, α and β. We choose
1
a=b=α=β= .
2
This allows us to conclude that a solution to the recurrence
En+1 = ρEn + ωh3 , E0 = 0,
will give a bound from above for the error |y (tn ) − yn |. Here
h2 1 1
A + 2A2 , A + 3A2 + 2A3 + A 1 + A2 .
  
ρ = 1 + Ah + ω=
2 6 16
Solving this recurrence gives
ρj − 1 ω
Ej ≤ ωh3 h2 ρj − 1 .


ρ−1 h
A + (A + 2A2 )
2
1
If we also assume that h ≤ , this expression can be simplified a bit further,
A
to give
ω  1

Ej ≤ h2 e2A+ 2 − 1 .
A
Thus we have developped a method of solving (5.1) where the error goes as
O h2 : This is a Runge-Kutta method of order 2. We resume what we have
done in this section in the following theorem.
Theorem 13. Given the differential equation (5.1) and the approximate solu-
tion by the Runge-Kutta Method as above with a = b = α = β = 21 , then we
have that
 
11 1 19 1

|y (tn ) − yn | ≤ + A + A2 e2A+ 2 − 1 h2 ,
48 2 48
2 2 2
∂f ∂f ∂ f ∂ f
, and ∂ f .

where A is a bound on |f |, , , 2 , 2
∂t ∂y ∂t ∂t∂y ∂y

Exercise.
 Develop a Runge-Kutta method of order 3 (i.e., the error is of order
O h3 ).

Exercise. Compare the Runge-Kutta method developed in this section with


the second order Euler method proposed in the previous section, for the equation
dy 1
= , y (0) = 1,
dt 1 + t2 y 2
on [0, 1].
54 CHAPTER 5. DIFFERENTIAL EQUATIONS: ITERATIVE METHODS
Chapter 6

Differential Equations:
Finite Difference Method

In this chapter we study some discretization methods.

6.1 Discretization of derivatives


Using Taylor’s theorem, we can write
h2 00
y (ti+1 ) = y (ti ) + hy 0 (ti ) + y (ti ) + O h3 ,

(6.1)
2
and in the same way we can write
h2 00
y (ti−1 ) = y (ti ) − hy 0 (ti ) + y (ti ) + O h3 .

(6.2)
2
Subtract both identities to obtain

y (ti+1 ) − y (ti−1 ) = 2hy 0 (ti ) + O h3 ,



(6.3)

which gives for the first derivative


y (ti+1 ) − y (ti−1 )
y 0 (ti ) = + O h2

(6.4)
2h
To discretize the second derivative, we follow a similar procedure, we just go a
little further in the Taylor expansion, to obtain
y (ti+1 ) + y (ti−1 ) − 2y (ti )
y 00 (ti ) = + O h2 .

2
(6.5)
h

6.2 The Theory by the way of an example


We will show how to numerically solve the foolowing boundary value problem,
including an estimation of the error.
 00
y − y = 0 in [0, 1] ,
(6.6)
y (0) = 1, y (1) = 2.

55
56CHAPTER 6. DIFFERENTIAL EQUATIONS: FINITE DIFFERENCE METHOD

We use the equations in the previous section to obtain the following expres-
sion for the differential equation:

y (ti+1 ) + y (ti−1 ) − 2y (ti )


+ O h2 − y (ti ) = 0,

h 2

where
1  (4) 
O h2 = y (ξi+1 ) + y (4) (ξi−1 ) h2 ,

24
where ξi+1 is an intermediate point between ti and ti+1 and ξi−1 is analogously
defined. So we propose the following scheme for a numerical solution to the
problem above:
yi+1 + yi−1 − 2yi
− yi = 0, y0 = 1, yN = 2. (6.7)
h2
So if we call y the column vector whose i-th component is yi , we call b the
vector such that b0 = 1, bN = 2 and bj = 0 if j 6= 1, N , and we form the matrix
A with components

1 2 + h2
a00 = 1 = aN N , ai,i−1 = = ai,i+1 , ajj = − ,
h2 h2
then we can rewrite the scheme as the following linear system.

Ay = b.

Observe that A is invertible, and then our approximation is just

y = A−1 b.

Now we need to find out how good the approximation is. We shall use the
following notation
Ej ≡ yj − y (tj ) .
So if we subtract from the scheme the differential equation (using the approxi-
mations), we obtain

Ei+1 + Ei−1 − 2Ei


− Ei = γi h2 ,
h2
where
1  (4) 
γi = y (ξi+1 ) + y (4) (ξi−1 ) .
24
Hence, letting h be the vector with components h,0 = 0 = h,N and h,j = γj h4 ,
we have that the vector of errors E (Ej = Ej ) stisfies

E = h−2 A−1 .
 

Let B = h−2 A−1 , and assume |γi | ≤ M . Then we have an estimate

kEk2 ≤ M kBk2 h4 .

The question now is, how can we estimate M ?


6.2. THE THEORY BY THE WAY OF AN EXAMPLE 57

6.2.1 The Maximum Principle


To estimate M in the previous section, we must find a way to estimate the fourth
derivative of y. Notice that the equation itself is telling us that its solution must
be smooth, and that actually
y (4) = y.
So we must find a way to estimate y a priori, that is, without having to solve the
equation. For this we shall use the following theorem, which a one dimensional
version of a more general fact called the Maximum Principle.

Theorem 14. If u satisfies the differential inequality in (a, b)

(L + h) [u] ≡ u00 + g (x) u0 + h (x) u ≥ 0,

with g and h ≤ 0 continuous in [a, b], and if u assumes a nonnegative maximum


in (a, b), then u must be constant.

We delay the proof of this beautiful result to show how to use it. In the
example we are studying, g (x) = 0 and h (x) = −1 ≤ 0, hence the hypotheses
of the theorem hold. Therefore, y reaches its maximum at the boundary (indeed,
if the maximum were reached in the interior of the interval, it would be greater
than 2, hence positive, and thus y would be constant which is not!). We can
conclude then that
y ≤ 2, so y (4) ≤ 2.
Now let us show that y is nonnegative. To do so, assume the opposite, that
is, there is a c ∈ (a, b) such that y (c) < 0. Notice that if we define u = −y,
u satisfies u00 − u = 0, so it cannot have an interior nonnegative maximum
because it would be constant. But then, this contradicts our assumption that y is
negative at an interior point. In consequence we must have y ≥ 0. Summarizing,
we have shown that

0 ≤ y ≤ 2, and so, 0 ≤ y (4) ≤ 2.

We can thus take M = 2.

6.2.2 Bounding other derivatives.


An interesting question that could be posed is the following: How can we bound
the first derivative of a solution to (6.6)? This question is not only interesting
on its own, but also the method used could be useful in other problems; besides,
the method is simple and elegant.
By the Fundamental Theorem of Calculus, for any c ∈ [0, 1] we have that
Z t
0 0
y (t) = y (c) + y 00 (τ ) dτ.
c

We choose c wisely: by the mean value theorem, there is a c ∈ (0, 1) where

y (1) − y (0)
y 0 (c) = = 1.
1−0
58CHAPTER 6. DIFFERENTIAL EQUATIONS: FINITE DIFFERENCE METHOD

Hence, we can bound


Z 1
|y 0 (t)| ≤ |y 0 (c)| + |y 00 (τ )| dτ ≤ 1 + 2 = 3.
0

Having a bound on the first and second derivatives then allows us to bound any
other derivative of y that we wish or need.

6.2.3 Bounding derivatives 2


The purpose of this section is to develop a method to bound the first derivative
of solutions to equations of the form

y 00 + p (t) y 0 + q (t) y = 0.

Assume of course that we know a bound on the derivative of y at some point,


say c ∈ [a, b]. Then we can rewrite the previous equation as
Z t
0 0
y (t) = y (c) + (−p (τ ) y 0 − q (τ ) y) dτ.
c

We let P = maxx∈[a,b] p (x), Q = maxx∈[a,b] q (x) and assume that we have a


bound on |y|, say M . Then we can establish the following inequality
Z t Z t
0 0 0
|y (t)| ≤ |y (c)| + P |y (τ )| dτ + QM dτ.
c c

Now let η > 0 be any number such that 0 < P η < 1. We are going to bound on
[c, c + η] (to bound on [c − η, c] we proceed in a similar way). Then we obtain,
by writing B1 = maxt∈[c,c+η] |y 0 (t)|

B1 ≤ |y 0 (c)| + P ηB1 + QM η.

Let us also write B0 for a bound on |y 0 (c)|. Hence we have

B0 QM η
B1 ≤ + .
1 − Pη 1 − Pη

In general, assuming we already have a bound on [c + kη, c + (k + 1) η] for y 0 ,


bound that we shall call Bk , we obtain
Z t Z t
0 0 0
|y (t)| ≤ |y (c + (k + 1) η)| + P |y (τ )| dτ + QM dτ.
c+(k+1)η c+(k+1)η

and thus, we obtain the following estimate for a bound Bk+1 on

|y 0 | on [c + (k + 1) η, c + (k + 2) η] :

Bk+1 ≤ Bk + P ηBk+1 + QM η,
and hence
Bk
Bk+1 ≤ + A (η) ,
1 − Pη
6.2. THE THEORY BY THE WAY OF AN EXAMPLE 59

QM η
where A (η) = . If we have that
1 − Pη
b−c
η= ,
N
for N large, then
θN − 1 1
BN = θN B0 + A (η) , θ= .
θ−1 1 − Pη
will give a global bound on y 0 on [c, b]. It is not difficult to prove that
lim θN = eP (b−c) .
N →∞

θN − 1 QM  P (b−c) 
lim A (η) = e −1 .
N →∞ θ−1 P
and hence
QM P (b−c)
maxt∈[c,b] |y 0 (t)| ≤ eP (b−c) |y 0 (c)| +

e −1
P
QM P (b−a)
≤ eP (b−a) |y 0 (c)| +

e −1 .
P
In the same way we get
QM  P (b−a) 
max |y 0 (t)| ≤ eP (b−a) |y 0 (c)| + e −1 ,
t∈[a,c] P
which gives as a bound:
QM  P (b−a) 
max |y 0 (t)| ≤ eP (b−a) |y 0 (c)| + e −1 .
t∈[a,b] P
y (b) − y (a)
If we wisely choose c so that y 0 (c) = , we finally obtain
b−a

P (b−a) y (b) − y (a)
0
QM  P (b−a) 
max |y (t)| ≤ e + e − 1 .
t∈[a,b] b−a P
We state the previous estimate in the following theorem.
Theorem 15. Let y be a smooth solution to the boundary value problem
 00
y + p (t) y 0 + q (t) y = 0
y (a) = ya , y (b) = yb .
Let M, P and Q be such that
max |y (t)| ≤ M, max |p (t)| ≤ P, and max |q (t)| ≤ Q.
t∈[a,b] t∈[a,b] t∈[a,b]

Then
0 P (b−a)
y (b) − y (a) QM  P (b−a) 
max |y (t)| ≤ e + e − 1 .
t∈[a,b] b−a P
When P = 0, the estimate becomes

y (b) − y (a)
max |y 0 (t)| ≤ + QM (b − a) .
t∈[a,b] b−a
60CHAPTER 6. DIFFERENTIAL EQUATIONS: FINITE DIFFERENCE METHOD

Exercise. Extend the previous theorem to the following case:

y 00 + p (t) y 0 + q (t) y = f (t)



y (a) = ya , y (b) = yb .

Exercise. Grönwall’s inequality says the following, If u ≥ 0 satisfies the integral


inequality
Z t
u (t) ≤ α + β u (s) ds,
0

α, β ≥ 0, then
u (t) ≤ αeβt .

Give a proof of Grönwall’s inequality and use it to give a proof of Theorem 15


(if you need to modify the statement of the theorem, do it).

6.2.4 An extended Maximum Principle


To treat more general equations the Maximum Principle as presented in the
previous section would not be enough. For instance, if in the differential operator
h (x) ≥ 0, the technique employed before would not work. So assume now that
h (x) ≥ 0, and that there is a function that satisfies

• w>0

• (L + h) [w] ≤ 0

we have the following theorem.

Theorem 16. Assume that u defined in [a, b] satisfies the differential inequality
u
(L + h) [u] ≥ 0. Then the function v = satisfies the differential inequality
w
 0 
w 1
00
v + 2 + g v 0 + (L + h) [w] v ≥ 0.
w w

Hence u cannot attain a nonnegative maximum in (a, b) unless it is constant.

Exercise. We are going to solve the equation

y 00 + y = 0 in [0, 1] ,

(6.8)
y (0) = 1, y (1) = 2.

Notice that h = 1 ≥ 0, and hence we cannot work out this problem as we did in
the previous section. However use the extended Maximum Principle to bound
the error when using a discretization scheme with N = 4, that is h = 0.25. I
suggest to use as w = cos t. And a question arises, not for every interval [0, a]
you will find a w...for which a’s can you guarantee the existence of a w?
6.3. PROBLEM 61

6.3 Problem
For this problem you will have a little bit of research. Find about the Maximum
Principle for the heat equation, and about compatibility conditions for solutions
to the heat equation (or parabolic equations).
In this section you are asked to work out by yourself the problem of solving
numerically the Boundary Value Problem

∂u ∂2u


 = in (0, 1) × (0, ∞)
∂t ∂x2

 u (0, t) = 0 = u (1, t)
u (x, 0) = sin (2πx) .

You will approximate u (0.3, 0.2). To do so, choose stepsizes ∆t = 0.05 and
∆x = 0.1, and to discretize the time derivative use at (xj , ti ) use

u (xj , ti+1 ) − u (xj , ti )


.
∆t
Estimate the error of the method you use. Remember that you will have to
∂4u
bound , and to do this you will need to use the Maximum Principle for
∂x4
Parabolic equations; but to apply it you need to check that u is in C 4 , and
hence a compatibility condition.
62CHAPTER 6. DIFFERENTIAL EQUATIONS: FINITE DIFFERENCE METHOD
Chapter 7

The Finite Element Method

7.1 Introduction
To introduce the method we shall work out an example (a very simple one!).
First we shall work out the approximation and then we will perform the error
analysis.
−y 00 + y = 1 in (0, 1)

(7.1)
y (0) = 0 = y (1) .
Multiply (7.1) by a differentiable function ϕ and integrate by parts to obtain
Z 1 Z 1
y 0 (τ ) ϕ0 (τ ) + y (τ ) ϕ (τ ) dτ = ϕ (τ ) dτ.
0 0

This is called the weak form of (7.3).


Take a partition of the interval [0, 1] into N subintervals of equal length. We
denote these intervals as

[tj , tj+1 ] , j = 0, . . . , N − 1,

and in what follows we shall denote by h the common length of each of these
subintervals.
We define the piecewise linear functions ej,N , j = 1, . . . , N − 1 as

 N (t − tj−1 ) if tj−1 ≤ t < tj ,
ej,N (t) = N (tj+1 − t) if tj ≤ t < tj+1 , (7.2)
0 otherwise.

The N in front of each of the linear pieces in the previous definition is there so
that
1−0
N (tj − tj−1 ) = N h = N × = 1,
N
so it has to be appropriately modified accordingly with the interval where the
boundary value problem is posed. In general, if the interval we are working with
is [a, b], the coefficient multiplying each of the linear parts should be N/ (b − a).
Define the space VN as the finite dimensional vector space generated by the
ej,N . The Finite Element Method consists on finding an approximation for the

63
64 CHAPTER 7. THE FINITE ELEMENT METHOD

solution to (7.3) in VN . To do so we write the approximate solution


N
X −1
yN = cj ej,N (t) ,
j=1

and then look for the coefficients ck , k = 1, . . . , N − 1 by assuring that the


approximate solution satisfies the weak equations
Z 1 Z 1
0 0
yN (τ ) ek,N (τ ) + yN (τ ) ek,N (τ ) dτ = ek,N (τ ) dτ.
0 0

Working out the proposed example, with N = 4 we obtain the following system
 49 95
   1 
6 − 24 0 c1 4
 − 95 49
− 95   c2  =  1 
24 6 24 4
95 49 1
0 − 24 6
c3 4

Solving this system we obtain


873
   
c1 10183
 c2  =  1158 .
10183
873
c3 10183

7.2 Error Analysis


The following lemma due to Céa is the cornerstone of the error analysis. First,
let us introduce some terminology. Given a differential operator on an interval
(c, d)
0
− (p (t) y 0 ) + q (t) y,
it defines a bilinear form
Z d
a (u, v) = p (τ ) u0 (τ ) v 0 (τ ) + q (τ ) u (τ ) v (τ ) dτ.
c

Assume this bilinear form is positive definite (it is not always so); it then defines
a norm p
kuka = a (u, u).
Then we have the following.
Lemma 6. The approximation yN given by the finite element method is the
best fit approximation to y in the space VN given by the norm induced by the
bilinear form a. In other words, for any w ∈ VN

ky − yN ka ≤ ky − wka .

The proof is an application of the Cauchy-Schwartz inequality.


Let us apply it to the example of the previous section. To do so, we introduce
the interpolant of the solution, which is given by
N
X −1
IN y = y (tj ) ej,N .
j=1
7.2. ERROR ANALYSIS 65

We will first estimate


ky − IN yka
in terms of
00
(y − IN y) = y 00 .
Using Fourier analysis we have the following estimate.
Lemma 7. Let ζ be a function defined on [c, d] with ζ (c) = 0 = ζ (d). Then
the following estimates hold:
Z d 2 Z d
2 (c − d) 2
(ζ 0 (τ )) dτ ≤ (ζ 00 (t)) dτ,
c π2 c
Z d 4 Z d
2 (c − d) 2
(ζ (τ )) dτ ≤ 4
(ζ 00 (t)) dτ.
c π c

Proof. Let us prove the first inequality, the second one is left as an exercise.
The fact that ζ (c) = 0 = ζ (d), implies that we can write ζ as

X π 
ζ (t) = ak sin (t − c) , L = d − c.
L
k=1

Differentiating, we obtain

X π  π 
ζ 0 (t) = k ak cos k (t − c) ,
L L
k=1

and

X  π 2  π 
ζ 00 (t) = − k2 ak sin k (t − c) .
L L
k=1
But then we have
Z d ∞  π 2
2
X 2
(ζ 0 (τ )) dτ = k2 |ak | ,
c L
k=1

and Z d ∞  π 4
2
X 2
(ζ 00 (t)) dτ = k4 |ak | ,
c L
k=1
and by comparing these two expressions the first inequality follows.

Using the previous lemma, we can argue as follows. First, if we compute


ky − IN yka on [tj , tj+1 ]
Z tj+1
2  0 2 2
ky − IN yka,j = (y − IN y) (τ ) + [(y − IN y) (τ )] dτ,
tj

and noticing that (y − IN y) (tk ) = 0, applying the previous lemma we can


estimate on [tj , tj+1 ], calling h = tj+1 − tj :
 2  2 ! Z tj+1
2 h h 2
ky − IN yka,j ≤ 1+ (y 00 (τ )) dτ,
π π tj
66 CHAPTER 7. THE FINITE ELEMENT METHOD

and hence by adding over the subintervals (over j) we get


 2  2 ! Z 1
2 h h 2
ky − IN yka ≤ 1+ (y 00 (τ )) dτ.
π π 0

Z 1
2
We must estimate (y 00 (τ )) dτ . In order to do so, from (7.1) we obtain
0
Z 1 Z 1
2 2
(y 00 (τ )) dτ = (1 − y) dτ.
0 0

Using the inequality 2ab ≤ a2 + b2 we can estimate the right hand side of the
previous inequality as
Z 1 Z 1
2
(1 − y) dτ ≤ 1 − 2y + y 2 dτ
0 0
Z 1
≤ 3 + y 2 dτ,
0
Z 1
and all that is left to do is to estimate y 2 dτ . But then notice that from
0
Z 1 Z 1
0 2 2
(y (τ )) + (y (τ )) dτ = y (τ ) dτ,
0 0

which is obtained from the weak form of the equation by taking y = ϕ, we


obtain that
Z 1 Z 1
2
y dτ ≤ y (τ ) dτ
0 0
Z 1
1 2 1
≤ (y (τ )) + dτ,
0 2 2
from which we get Z 1
y 2 dτ ≤ 1.
0
Therefore Z 1
2
(y 00 (τ )) dτ ≤ 4.
0
By Céa’s lemma
 2  2 !
2 2 h h
ky − yN ka ≤ ky − IN yka ≤4 1+ .
π π

To obtain an estimate on the norm ky − yN k∞ = maxt∈[0,1] |y (t) − yN (t)|, we


proceed as follows. Notice that (y − yN ) (0) = 0. By the Fundamental Theorem
of Calculus, we have
Z t
0
y (t) − yN (t) = (y − yN ) (τ ) dτ,
0
7.2. ERROR ANALYSIS 67

but the, by the Cauchy-Schwartz inequality


 21
√ t √
Z
0 0 2
|y (t) − yN (t)| ≤ t (y − yN ) dτ ≤ t ky − yN ka .
0

So we have an estimate for the absolute error


 s  2
√ h h
|y (t) − yN (t)| ≤ 4 t 1+ .
π π

7.2.1 Problem
Perform the error analysis for the finite element method when applied to
  0
− (2t2 + 2)y 0 + ty = t in (0, 1)
(7.3)
y (0) = 0 = y (1) .

Before you begin your analysis show that the bilinear form induced by the
equation is positive definite!
68 CHAPTER 7. THE FINITE ELEMENT METHOD
Chapter 8

Monte Carlo Methods

8.1 Monte Carlo for ODE’s


We start by showing via an example how this method works. So our purpose is
to solve the following boundary value problem.
 00
y + y0 = 0
y (0) = 1, y (1) = 2.

We will use the finite difference method, so we discretise the second derivative
as
yi+1 + yi−1 − 2yi
,
h2
and the first derivative as
yi+1 − yi−1
.
2h
Then we obtain the system

yi+1 + yi−1 − 2yi yi+1 − yi−1


2
+ = 0,
h 2h
which can be rewrriten as
   
1 h 1 h
+ yi+1 + − yi−1 − yi = 0.
2 4 2 2

For pedagogical reasons, let us only divide the interval into three subintervals.
Then, the system we must solve becomes

Ay = b,

where
 
1 0 0 0
 0.5 − 0.25h −1 0.5 + 0.25h 0 
A= ,
 0 0.5 − 0.25h −1 0.5 + 0.25h 
0 0 0 1

69
70 CHAPTER 8. MONTE CARLO METHODS
  
1 1
 0   y1 
 0 ,
y= 
 y2  .
b= 

2 2
We can rewrite the previous system as a fixed poin problem
y = P y,
where P = (pij )i,j=0,...,3 is the matrix
 
1 0 0 0
 0.5 − 0.25h 0 0.5 + 0.25h 0 
 .
 0 0.5 − 0.25h 0 0.5 + 0.25h 
0 0 0 1
This problem can be solved by successive approximations, and we get that the
solution would be given by
y = P∞ y,
where P∞ = limn→∞ P n , provided that this limit exists. We shall argue, using
probabilistic arguments that P∞ does exists, and that it has a natural interpre-
tation.
Notice then that the previous matrix can be interpreted as the matrix of a
random walk (a Markov chain), as follows. If you look at pij it is the probability
of going from yi to yj . The fact that the 00 and 33 elements of the matrix are
both 1 reflect the fact that there is absorption at the boundary, i.e., once either
0 or 1, the boundary points of the interval, are reached you remain there.
Hence, the ij element of the matrix P n gives the probability of starting at
yi and ending up at yj after n steps. To compute the limit of P n as n goes to
∞ we will need the following
Lemma 8. Let A > 0 be fixed, then for a random walk starting at 0, of stepsize
1, the probability that after n steps it is at distance less that A goes to 0 as
n → ∞.
Before we begin with a proof of this lemma, we shall need the following
elementary inequality
4k
 
2k
≤√ .
k 2k + 1
We proceed by induction. Assuming the bound at k = n, at k = n + 1 we have
   
2n + 2 (2n + 2) (2n + 1) 2n
= 2
n+1 (n + 1) n
n

2 · 4 2n + 1
≤ ,
n+1
and hence in order to have that
4n+1
 
2n + 2
≤√ ,
n+1 2n + 3
all we need to check is that
√ √
2n + 1 2n + 3 ≤ 2 (n + 1) ,
8.1. MONTE CARLO FOR ODE’S 71

or equivalently
4n2 + 8n + 3 ≤ 4n2 + 8n + 4,
which is obviously true. The stated bound follows then by induction.
Proof of Lemma 8. Assume that we go to the right with probability p and to
the leftt with probability 1 − p. Hence, the probability that the walk remains
within distance A from the starting point is given by
 
X n m n−m
p (1 − p) .
n−A n+A
m
2 ≤m≤ 2

First of all we have the inequalities


2n
 
n
≤√ ,
m n
and
m
 m m  m n−m
(1 − p) pn−m ≤ 1 − ,
n n
the first one is proved above, and the secon follows from elementary calculus.
To estimate how large is the expression on the right hand side of the previous
inequality, we analyze the behavior of the function
h in
1−x
f (x) = xx (1 − x) ,

near x = 21 . By Taylor’s theorem we have that


     
1 0 1 1 0 1
f (x) = f + f (ξ) x − = n + f (ξ) x − ,
2 2 2 2
1 A
− , 12 +  , with  =

and we are interested in x ∈ 2 . Our work now is to
2n
0
1 1

find an estimate for f (ξ), ξ ∈ 2 − , 2 +  .
A straightforward computation gives
in  
0
h
x 1−x x
f (x) = n x (1 − x) log .
1−x
First observe that
  21 + !2  1+2
x 1−x 1 1
x (1 − x) ≤ + = + .
2 2
 
x 1
− , 12 +  , and

Also, as h (x) = log is increasing on 2
1−x
   
1 1
h − = −h + ,
2 2
we find that    
x
≤ log 1 + 2
log
1 ≤ 4,
1−x 2 −
72 CHAPTER 8. MONTE CARLO METHODS

if n is large enough.
From the previous estimates we find that
 n+2n  n
0 1 1 A
|f (x)| ≤ 4n + = 2A n+A 1 + .
2 2 n

Thus, we obtain that


1 A2  e  A
|f (x)| ≤ + .
2n n2n 2
All this allows us to conclude that the required probability is less than

A2  e  A
 
A
√ 1+ ,
n n 2

which clearly goes to 0 as n → ∞.

From the proof of the previous lemma, we can extract the following estimate

Corollary 1. Consider s random walk starting at the origin of the real line, with
stepsize h. Then the probability that after n steps the walk is within distance A
from the origin is at most

A2  e  Ah
 
A
√ 1+ .
h n nh2 2

The previous Lemma shows that eventually any random walk hits the bound-
ary of the interval. This shows that as n goes to ∞ if pn,ij is the ij element of
P n and j 6= 0, N then it goes to zero. In other words, all the columns of the
matrix P∞ = limn→∞ P n are zero except the first and the last, and the i-th
component of the 0-th and N -th column of gives the probability of arriving to
the right endpoint and left endpoint of the interval when starting on the i-th
point of the mesh respectively. We shall denote by pi,a the probability of reach-
ing the point a when starting from ti , and by pi,b the probability of reaching
the point b when starting also from ti . Our discussion shows that

pi,a + pi,b = 1.

Hence, if we define a random variable Wi which takes the value ya with proba-
bility pi,a and yb with probability pi,b then,

yi = E [Wi ] ,

where E denotes the ”expectation of”.


We can now use this information to write a Monte Carlo algorithm to com-
pute yi . In order to do this, we shall introduce a new random variable, that we
will denote by WiM , and which represents a random walk of at most M steps.
We thus must define the espacio muestral first. We let

S = {−h, h} ,

and then
SM = ∪M i
i=1 S .
8.1. MONTE CARLO FOR ODE’S 73

Then, the espacio muestral is


( l
X

SM = (h1 , . . . , hj ) ∈ SM : a < ti + hk < b, 1 ≤ l < j, and
k=1
j j
)
X X
if j < M, either ti + hk = a, or ti + hk = b .
k=1 k=1

We equip this set with a probability measure. Given an element (h1 , . . . , hj ) of



SM its probability is computed as
p1 · p2 · · · · pj ,
where, for 1 ≤ i ≤ j, and in the case of the example we are working with,
pi = 0.5 − 0.25h if hi = −h and pi = 0.5 + 0.5h if hi = h. We thus define the
random variable

WiM : SM −→ {ya , yb , 0} ,
where
j
X
WiM (h1 , . . . , hj ) = ya if ti + hk = a,
k=1
j
X
WiM (h1 , . . . , hj ) = yb if ti + hk = b,
k=1
and 0 otherwise. This random variable takes on the value ya with probability
pi,a;M , the value yi,b with probability pi,b;M and 0 with probability p0 . Lemma
8 implies that
pi,a;M → pi,a and pi,b;M → pi,b .
One actually has that
pi,a;M ≤ pi,a and pi,b;M ≤ pi,b ,
and using this, and the estimate in Lemma 8, which implies that
b−a 2   b−a
 !
b−a e h
p0 ≤ √ 1+ h .
h M M 2

we can estimate that


b−a 2
 !
b−a  e  b−a
h
h
pi,a − pi,a;M ≤ √ 1+ ,
h M M 2

and a similar estimates holds for pi,b − pi,b;M . Using the definition of h, we
obtain
N 2  e N
 
N
pi,a − pi,a;M ≤ √ 1+ .
M M 2
 
This last estimate is quite helpful when trying to estimate the difference E [Wi ] − E W M . i

Lemma 9. We have the following estimate


N 2  e N
 
E [Wi ] − E WiM ≤ max {|ya | , |yb |} √N
 
1+ .
M M 2
74 CHAPTER 8. MONTE CARLO METHODS

To get an estimate of yi we now use the the random variable WiM . We use
J copies of WiM , which we shall denote by Wi,k
M
, k = 1, 2, . . . , J, and define the
random variable
∗ J
Z : (SM ) −→ R,
as
J
1X M
Z (ω1 , . . . , ωJ ) = Wi,k (ωk ) .
J
k=1

It is easy to compute the mean and the variance of Z. In fact we have that

σ2
µZ = E WiM , and σZ
2
 
= ,
J
where σ 2 is the variance of WiM . We can easily bound σ 2 , as

σ 2 ≤ max ya2 , yb2 .




Chebysheff inequality can now be applied as follows. Let  > 0, then, by


Chebysheff
σ2
P [|Z − µZ | ≥ ] ≤ 2 .
 J
So if we want an approximation within  > 0 with a confidence of γ, we choose
N so that the approximation given by the finite difference method differs from
the exact solution by less than /3. Then we choose M so that we have

E [Wi ] − E WiM <  ,


 
3
and finally J so that h i
P |Z − µZ | < > γ,
3
and to obtain this we need J such that
9σ 2
γ =1− .
2 J

8.2 Elliptic equations


We shall work out the Dirichlet problem for the Laplacian first. To be more
precise, we shall consider the boundary value problem

∆u = 0 in Ω = [a, b] × [c, d] ,
u = ϕ (x, y) on ∂Ω.

To proceed we make partitions of the intervals [a, b] and [c, d]:

a = x0 < x1 < · · · < xN −1 < xN = b,

c = y0 < y1 < · · · < yM −1 < yM = d.


We discretize as usual
∂2u ui+1,j + ui−1,j − 2ui,j
∼ ,
∂x2 h2
8.2. ELLIPTIC EQUATIONS 75

∂2u ui,j+1 + ui,j−1 − 2ui,j


∼ ,
∂y 2 k2
where yj − yj−1 = k, xi − xi−1 = h and ui,j is the approximation for u (xi , yj ).
This yields the approximate system
ui+1,j + ui−1,j − 2ui,j ui,j+1 + ui,j−1 − 2ui,j
+ = 0,
h2 k2
which when solving for ui,j gives

k2 k2 h2 h2
ui,j = ui+1,j + u i−1,j + ui,j+1 + ui,j−1 .
2 (h2 + k 2 ) 2 (h2 + k 2 ) 2 (h2 + k 2 ) 2 (h2 + k 2 )

Let us rewrite this as

ui,j = pr ui+1,j + pl ui−1,j + pu ui,j+1 + pd ui,j−1 ,

and observe that


pr + pl + pu + pd = 1.
This analysis leads to the following Monte Carlo algorithm. Starting from a
given point we go up with probability pr we go right, with probability pl we go
left, with probability pu we go up and with probability pd we go down. This
until we hit the boundary and we tabulate the value of ϕ where this random
walk hits the boundary. We repeat this process as necessary (starting from the
same point), and then take an average (dividing by the number of repetitions).
This will give an approximate value for u at the starting point of all the random
walks.
76 CHAPTER 8. MONTE CARLO METHODS
Bibliography

77

You might also like