notes_lec
notes_lec
1
theorem: If f is a function that has k + 1 continuous derivatives over the
interval between a and x, then
(x − a)k+1 (k+1)
Rk+1 = f (ζ)
(k + 1)!
for some ζ between a and x.
Rk+1 represents the error in approximating f (x) by the first k + 1
terms in the right-hand side. We typically don’t know the value of Rk+1
exactly, since the exact value of ζ isn’t easily determined. But if the
difference x − a is small enough, its k + 1 power will also be small, and
we might therefore find that the remainder term is small enough to be
negligible.
Taylor’s theorem can also be written in a series form:
k
X (x − a)i
f (x) = f (i) (a) + Rk+1 .
i=0
i!
h2 00 hk (k)
f (a + h) = f (a) + hf 0 (a) + f (a) + . . . + f (a) + Rk+1
2 k!
with
hk+1 (k+1)
Rk+1 = f (ζ)
(k + 1)!
for some ζ between a and a + h, or more compactly
k
X hi
f (a + h) = f (i) (a) + Rk+1
i=0
i!
or as an infinite series
∞
X hi
f (a + h) = f (i) (a).
i=0
i!
2
Taylor’s theorem can be applied to find series expansions for functions in
terms of polynomials. Frequently seen ones include:
x2 x3 x4
ex = 1 + x + + + + ...
2 6 24
x3 x5
sin(x) = x − + − ...
6 120
x2 x4
cos(x) = 1 − + − ...
2 24
(these series converge for all x)
1
= 1 + x + x2 + x3 + . . .
1−x
(converges for |x| < 1)
x2 x3 x4
log(x + 1) = x − + − + ...
2 3 4
(converges for |x| < 1)
(In this class, trigonometric functions are always for the argument x
in radians, and log means the natural logarithm [base e].)
Several important numerical methods can be directly derived from Tay-
lor’s theorem, including Newton’s method, finite differences, and Euler’s
method:
Newton’s method for estimating x where some function f is equal to 0
can be derived from Taylor’s theorem:
f (a) R2 f (a)
0 = f (a) + (x − a)f 0 (a) + R2 → x = a − 0
− 0 ≈a− 0 ,
f (a) f (a) f (a)
with the approximation good if R2 is small (small h [starting guess a is
close to the solution x] or small second derivative f (2) )
Example: Apply Newton’s method to estimate the square root of 26 iteratively.
Set f (x) = x2 − 26 = 0, start with a0 = 5, get a1 = a0 − ff0(a0)
(a0 ) = 5.1, a2 =
f (a1 ) f (a2 )
a1 − f 0 (a1 ) = 5.0990196078 . . . , a3 = a2 − f 0 (a2 ) = 5.09901951359 . . . , giving a
√
series of increasingly accurate numerical estimates of 26
3
Centered finite difference for estimating the derivative of some function f
at x :
h2 00
f (x + h) = f (x) + hf 0 (x) + f (x) + R3,+
2
2
h
f (x − h) = f (x) − hf 0 (x) + f 00 (x) − R3,−
2
f (x + h) − f (x − h) = 2hf 0 (x) + R3,+ + R3,−
f (x + h) − f (x − h) R3,+ + R3,− f (x + h) − f (x − h)
f 0 (x) = − ≈
2h 2h 2h
where
h3 (3) h3 (3)
R3,+ =f (ζ+ ), R3,− = f (ζ− )
6 6
for some ζ+ between x and x + h and ζ− between x − h and x.
Example: To estimate the derivative of f (x) = ex at x = 1, we can use the
approximation derived here with h = 0.1 : f 0 (1) ≈ f (1.1)−f
2·0.1
(0.9)
= 2.72281456 . . .
In this case, the true answer can be found analytically to be e1 ≈ 2.7183.
Euler’s method for numerically approximating the value of y(x) given the
differential equation y 0 (x) = g(x, y(x)) and the initial value y(a) = y0 :
These examples illustrate tradeoffs that are found for many numerical
methods. We often only get approximate answers, but can make them
more accurate by doing more iterations (Newton method) or more steps
(Euler method), either of which requires more computation. For finite
difference, we can make our answer more accurate without doing more
computations by making h smaller, but this then requires our computa-
tions to be increasingly accurate to avoid roundoff error.
4
2 Error sources and control
To solve important problems reliably, need to be able to identify, control,
and estimate sources of error
Size of error: If the true answer is x∗ and the numerical answer is x,
Absolute error : |x − x∗ |
Fractional error :
|x − x∗ |
|x∗ |
|x1 − x2 |
|x+ |
Error types in solving math problems: gross error, roundoff error, trunca-
tion error
Gross error : Entering the wrong number, using the wrong commands or
syntax, incorrect unit conversion . . .
This does happen in engineering practice (but isn’t looked on kindly).
Example: Mars Climate Orbiter spacecraft lost in space in 1999 be-
cause Lockheed Martin calculated rocket thrust in lb while NASA thought
the results were in N
Detection methods: Know what answer to expect and check if what
you got makes sense; try to find the answer for a few simple cases where you
know what it should be (programs to be used to solve important problems
should have a test suite to do this); compare answers with others who
worked independently
5
Roundoff error in floating-point computation: results from the fact that
computers only carry a finite number of decimal places – about 16 for
Python’s default IEEE double-precision format (machine epsilon [numpy.finfo(float).eps]
or unit roundoff ≈ 10−16 )
Example: 0.3/0.1 - 3 is nonzero in binary floating-point computa-
tion, but is of order
The double-precision format uses 64 bits (binary digits) to represent
each number – 1 bit for the sign, 11 bits for the base-2 exponent (which
can range between −1022 and 1023), and 52 bits for the significand or
mantissa, which is interpreted as binary 1.bbbb. . . . It can represent num-
ber magnitudes between about 10−308 and 10308 (beyond that, numbers
overflow to infinity or underflow to zero)
Multiplication and division under roundoff are subject to maximum
fractional error of
Addition of two similar numbers also has maximum roundoff error
similar to , but subtraction of two numbers that are almost the same can
have roundoff error much bigger than (cancellation of significant digits)
A non-obvious example of subtractive cancellation: computing ex
P∞ i
using the first number of terms in the Taylor series expansion ex = i=0 xi!
(In Python, sum((x ** numpy.arange(imax+1)) / scipy.special.factorial(numpy.arange(imax+1
when x is a large negative number, say −20
Another example: Approximating derivatives with finite difference
formulas with small increments h
In general, math operations which incur large roundoff error (like
subtracting similar numbers) tend to be those that are ill-conditioned,
meaning that small fractional changes in the numerical values used can
change the output by a large fractional amount
Examples of ill-conditioned problems: solving linear systems when
the coefficient matrix has a large condition number; trigonometric opera-
tions with a large argument (say, sin(10100 )); the quadratic formula when
b is much larger than a and c; finding a polynomial that interpolates a
large number of given points
Mitigation: Use extended precision; reformulate problems to avoid
subtracting numbers that are very close together
Truncation error : Running an iterative numerical algorithm for only a
few steps, whereas convergence to the exact answer requires theoretically
an infinite number of steps
Often can be thought of as only considering the first few terms in the
Taylor series
“Steps” can be terms in the Taylor series, iterations of Newton’s
method or bisection for root finding, number of subdivisions in the com-
posite trapezoid rule or Simpson rule for numerical integration, etc.
6
Detection: Estimate truncation error (and roundoff error) by com-
paring the results of different numerical methods, or the same method run
for different numbers of steps on the same problem
Mitigation: Run for more steps (at the cost of more computations);
use a more accurate numerical method, if available
3 Linear systems
3.1 Properties
Linear systems arise directly from engineering problems (stresses, circuits,
pipes, traffic networks . . . ) as well as indirectly via numerical methods, for
example finite difference and finite element methods for solving differential
equations
Any system of m linear equations in n unknowns x1 , x2 , x3 , . . . xn (where
each equation looks like a1 x1 + a2 x2 + a3 x3 + . . . + an xn = b, with different
a and b coefficients) can be written in a standard matrix form Ax = b,
where
A is the m × n matrix with each row containing the (known) coeffi-
cients of the unknowns in one linear equation,
x is the n × 1 vector of unknowns,
b is the m × 1 vector of (known) constant terms in the equations.
Can also write the system as an m × (n + 1) “augmented matrix” A|b
(with x implied)
Does a solution exist? For square systems (m = n), there is a unique
solution equal to A−1 b if A has an inverse.
If A has no inverse (is singular ), then there will be either no solution
or infinitely many solutions.
A has no inverse when the equations of a linear system with A as the
coefficient matrix are not linearly independent of each other. E.g.: the 2
given equations in 2 unknowns are x1 + 2x2 = 3, 2x1 + 4x2 = 6 so that the
coefficient matrix is
1 2
2 4
7
Solution accuracy measures
Error size: ||x−x∗ ||, where x∗ is the true solution and x is computed;
Residual size: ||Ax − b|| (usually easier to calculate than the error)
Note: norms, denoted by ||x||, measure the size of a vector or matrix
(analogous to absolute value, |x|, for a scalar)
2-norm of a vector v:
sX
||v||2 = vi2
i
8
If the coefficient matrix of a square linear system A is upper triangular,
then it can generally be solved for the unknown xi by back substitution:
xn = bn /An,n
for i = n − 1, n − 2, . . . to 1
Pn
xi = bi − j=i+1 Ai,j xj /Ai,i
Similarly, a lower triangular matrix has only zero entries for columns more
than the row number, i.e. Ai,j = 0 whenever j > i.
If the coefficient matrix of a square linear system A is lower triangular,
then it can generally be solved for the unknown xi by forward substitution:
x1 = b1 /A1,1
for i = 2, 3, . . . to n
Pi−1
xi = bi − j=1 Ai,j xj /Ai,i
9
Example: to solve
−1 2 2 8
1 1 1 x = 1
1 3 2 4
for x:
−1 2 2 8 −1 2 2 8
1 1 1 1 (augmented matrix) → 0 3 3 9 →
1 3 2 4 0 5 4 12
−1 2 2 8
0 3 3 9
0 0 −1 −3
10
for i = j + 1, j + 2, . . . to n
M = Ai,j /p (multiplier)
Ai,: = Ai,: − M × Aj,:
bi = bi − M × bj
Example:
0 2 2 −2 4 −2 4 4 4 −2 4 4
1 −2 −1 2 → 1 −2 −1 2 → 0 2 2 −2 →
3
4 −2 4 4 0 2 2 −2 0 − 2 −2 1
4 −2 4 4
0 2 2 −2
0 0 − 2 − 21
1
Giving
−1
x = −2
1
11
Example:
−1 2 2 −1 2 2 −1 2 2
1 1 1 → −1| 3 3 → −1| 3 3
1 3 2 −1| 5 4 −1 5/3| −1
This gives us the factors in the form (L\U) , which can be expanded to
1 0 0 −1 2 2
L = −1 1 0 , U = 0 3 3
−1 5/3 1 0 0 −1
You can then check that LU is in fact equal to the original matrix.
12
for i = j + 1, j + 2, . . . to n
M = Ai,j /p (multiplier)
Save M as Li,j
Ai,: = Ai,: − M × Aj,:
Then U is the transformed A (which is upper triangular) and L is
the matrix built up from the multipliers M , with ones added along the
main diagonal.
Example:
0 2 2 (1) 4 −2 4 (3) 4 −2 4 (3)
1 −2 −1 (2) → 1 −2 −1 (2) → 41 | − 32 −2 (2) →
4 −2 4 (3) 0 2 2 (1) 0| 2 2 (1)
4 −2 4 (3) 4 −2 4 (3)
0| 2 2 (1) → 0| 2 2 (1)
1
4| − 23 −2 (2) 1
4 − 3
4 | − 1
2 (2)
13
about half as many arithmetic operations as LU decomposition, and can
used to help solve linear systems with this coefficient matrix just like the
LU decomposition can.
Given a symmetric positive definite matrix A, its lower triangular
Cholesky factor L (A = LLT ) can be computed as:
for j = 1, 2, . . . to n
v
u
u j−1
X
Lj,j = tAj,j − L2j,k
k=1
for i = j + 1, j + 2, . . . to n
Pj−1
Ai,j − k=1 Li,k Lj,k
Li,j =
Lj,j
Example:
1 2 3 1 0 0
If A = 2 5 7 , then L = 2 1 0
3 7 14 3 1 2
Newton’s second law for a system with forces linear in the displacements
(as in ideal springs connecting n different masses, or a discretized ap-
proximation to a linear beam or to a multistory building) can be written
as
d2 x
Mx00 = −Kx (where x00 stands for the acceleration, dt2 ) – or
x00 = −Ax,
where A ≡ M−1 K.
√
√ this system, if Av = λv, then x(t) = v sin( λt) and x(t) =
For
v cos( λt) are solutions.
Since this is a system of linear differential equations, any linear com-
bination of solutions is also a solution. (There are normally n eigenvalue-
eigenvector pairs λi , vi of A, and we need 2n initial conditions [i.e. the
values of x(0) and x0 (0)] to find a unique solution x(t).)
14
Pn √ √
The general solution is √
therefore i=1 √
ci vi sin( λi t)+di vi cos( λi t),
Pn
or equivalently i=1 ai vi ei λi t + bi vi e−i λi t , where ci , di or ai , bi can be
determined from the initial conditions
The eigenvectors are modes of oscillation for the system, and the
eigenvalues are the squared frequencies for each mode.
In analyzing vibrational systems, the first few modes (the ones with
the lowest frequencies/eigenvalues) are usually the most important be-
cause they are the most likely to be excited and the slowest to damp. The
fundamental mode is the one with lowest frequency.
Modes of e.g. a beam or structure can be found experimentally by
measuring the responses induced by vibrations with different frequencies
(modal analysis).
For a multistory building with stories indexed 1 to n (the shear build-
ing model), the differential equations are
15
Example:
If
−1 2
A= ,
1 1
eigenvalues λ and eigenvectors v must be solutions to
Av = λv, or (A − λI)v = 0.
Assuming that v isn’t a zero vector, this implies that (A − λI) is not invertible,
so its determinant must be zero. But the determinant of (A − λI) is (−1 −
λ)(1 − λ) − 2, so we √
have the characteristic polynomial
λ2 − 3 = 0 → λ = ± 3.
To find the eigenvectors corresponding
√ to each λ, we solve the linear system to
find v. We have Av = ± 3v, or
√
−1 ∓ 3 2√
v = 0,
1 1∓ 3
where the second row is a multiple of the first, so there is not a unique solution.
We have
1
√
v=
(1 ± 3)/2
or any multiples thereof.
In general, an n×n matrix has n (complex) eigenvalues, which are the roots
of a characteristic polynomial of order n. We could find the eigenvalues by
writing and solving for this characteristic polynomial, but more efficient
numerical methods exist, for example based on finding a QR factorization
of the matrix (which we won’t cover in this class).
A conceptually simple numerical method for finding the largest (in abso-
lute value) eigenvalue of any given square matrix is the power method. It
involves the iteration:
Start with an n × 1 vector v (that isn’t all zeros)
Do until convergence:
v ← Av
v
v← ||v|| (this step scales v to have a norm of 1)
||Av||
The corresponding eigenvalue can be estimated as λ = ||v||
16
Example: If
1 2 1
A= , v0 = ,
1 1 1
then (using the vector ∞-norm ||v||∞ ≡ max(|vi |)) successive iterations produce
for v
1 1 1 1
, , , ,···
2/3 5/7 12/17 29/41
√
converging toward the eigenvector √1 , which has the eigenvalue 2 + 1.
2/2
A symmetric real matrix will have only real eigenvalues. Otherwise, eigen-
values of a real matrix, being the roots of a polynomial, may also come in
complex conjugate pairs.
A symmetric matrix is positive definite if (and only if) all its eigen-
values are positive.
For a triangular matrix (upper or lower), the eigenvalues are equal to the
diagonal elements.
For any square matrix,
The product of the eigenvalues is equal to the determinant
Hence, if any eigenvalue is zero, the matrix is singular (has no inverse).
If no eignevalues are zero, the matrix has an inverse.
The sum of the eigenvalues is equal to the sum of the elements on the
matrix main diagonal (called the trace)
A matrix has the same eigenvalues as its transpose
The eigenvalues of the inverse of a matrix (if the inverse exists) are the
reciprocals of the eigenvalues of the matrix, while the eigenvectors of both
are the same.
For any matrix A, the square root of the ratio of the largest to smallest
eigenvalue of AAT is equal to the (2-norm) condition number of A
A linear system with damping, Mx00 +Cx0 +Kx = 0 where P2n C is a matrix of
damping coefficients, has the general solution x(t) = i=1 ci vi eλi t , where
λi , vi are generalized eigenvalue-eigenvector pairs that solve the quadratic
eigenvalue problem (λ2 M + λC + K)v = 0 and the coefficients ci can be
set to match the initial conditions x(0), x0 (0).
5 Differentiation
Finite difference (centered) to approximate f 0 (x0 ) :
17
f (x0 + ∆x) − f (x0 − ∆x)
f 0 (x0 ) ≈
2∆x
This approximation is ‘second-order accurate,’ meaning that trunca-
tion error is proportional to (∆x)2 (and to f 000 (x)) (derived previously
from Taylor’s theorem)
However, can’t make ∆x very small because roundoff error will in-
crease
Richardson extrapolation
Starting with some fairly large ∆0 , define Di0 to be the centered finite-
difference estimate obtained with ∆x = ∆2i0 :
18
f (x + 2∆x) − 2f (x + ∆x) + 2f (x − ∆x) − f (x − 2∆x)
f 000 (x) ≈
2(∆x)3
6 Integration
6.1 Introduction
Some applications of integrals
1
Rb
Average function value between a and b: b−a a
f (x)dx
Center of mass (in 1-D):
Rb
a
xρ(x)dx
Rb
a
ρ(x)dx
(ρ = density)
Rb
Moment of inertia about x = x0 (in 1-D): (x − x0 )2 ρ(x)dx
a
Rb
Net force produced by a distributed loading: a w(x)dx (w = force
per unit length)
Net moment about x = x0 produced by a distributed loading:
Z b
(x − x0 )w(x)dx
a
19
Rb
Typical situations where we need to approximate an integral I = a
f (x)dx
numerically:
The function f doesn’t have an analytic integral
No mathematical expression for function is available – we can only
measure values or get them from a computation.
in (a, b). Although we often don’t know the maximum value of derivatives
20
of f , these bounds are nevertheless useful for estimating how the error will
decrease as a result of increasing the number of intervals n.
To integrate functions whose values are only available at certain points
(which may be unequally spaced), there are a few options:
Composite trapezoid rule works even for unequally spaced intervals
Can interpolate the points with an easily integrable function (such as
a polynomial or cubic spline) and integrate the interpolating function.
Because the error from the composite trapezoid rule decreases as the
0
number of subintervals squared, The error of Ri+1 is expected to be about
0
1/4 that of Ri , and in the same direction
We exploit this by coming up with a generally more accurate estimate
Ri1 ≡ 43 Ri+1
0
− 13 Ri0 .
j
Can continue, with Rij ≡ 4j4−1 Ri+1
j−1
− 4j1−1 Rij−1 for any j ≥ 1, to
obtain generally even more accurate estimates.
Difference between two estimates can give an estimate of uncertainty,
which may be used as a criterion for convergence. For smooth functions,
i often doesn’t need to be large to get a very accurate estimate.
Algorithm:
for j = 0, 1, . . . jmax :
Evaluate the function at 2j + 1 equally spaced points, including the
endpoints a and b (giving 2j equal-width subintervals), and obtain Rj0
4j
i
Find Rj−i , i = 1, . . . j using the formula Rij ≡ j−1 1 j−1
4j −1 Ri+1 − 4j −1 Ri
21
As an example, consider f (x) = ex − 4x, a = 0, b = 1, jmax = 3 :
Gauss quadrature
Estimate an integral based on the function value at specific non-
equally spaced points within the interval (more points closer to the edges)
Select the sample points and weights based on approximating the
function as a polynomial of degree 2n − 1, where n is the number of points
In practice, tabulated values of the sample points xi and weights wi
for the standard integration interval [−1, 1] are available
R1 Pn
To approximate I = −1 f (x)dx, use Gn = i=1 wi f (xi )
Rb
To approximate I = a f (x)dx, use
n
b−aX b−a
Gn = wi · f a + (xi + 1)
2 i=1 2
Can give very accurate numerical integral estimates with few function
evaluations (small n).
With given n, could divide the integration interval into parts and
apply Gauss quadrature to each one in order to get increased accuracy
Both Romberg integration and Gauss quadrature are only applicable if
we can find the function value at the desired points. Also, they may
not be more accurate than simpler methods if the function is not smooth
(e.g. has discontinuities). Essentially this is because they both rely on
approximating the function by the first terms in its Taylor series.
22
Euler method (explicit): yi+1 = yi + hf (ti , yi ), where h is the step size,
ti+1 − ti (iterate this step N times to get from t0 = a to tN = b)
Approximates f¯ with the value of f at the beginning of the interval
[ti , ti + h]
Modified (second-order accurate) Euler method is Heun’s method (also
known as RK2 because it’s a 2nd-order ‘Runge-Kutta’ method); it’s anal-
ogous to trapezoid rule:
Two stages at each timestep, based on estimating the derivative at
the end as well as the beginning of each subinterval:
K1 + K2
yi+1 = yi + ,
2
with K1 = hf (ti , yi ) (as in Euler method) and K2 = hf (ti + h, yi + K1 )
The classic Runge-Kutta method of order 4 (RK4) is one that is often
used in practice for solving ODEs. Each step involves four stages, and can
be written as:
1
y(ti+1 ) = y(ti ) + (K1 + 2K2 + 2K3 + K4 ),
6
with K1 = hf (t, y(t)) (as in Euler method), K2 = hf (ti + h/2, y(ti ) +
K1 /2), K3 = hf (ti + h/2, y(ti ) + K2 /2), K4 = hf (ti + h, y(ti ) + K3 )
(Notice that the 1-2-2-1 ratio of the weights is similar to the 1-4-1 of the
Simpson rule)
Implicit Euler method: yi+1 = yi + hf (ti+1 , yi+1 ) – implicit because the
unknown yi+1 appears on both sides of the equation.
Solving for this yi+1 may require a numerical nonlinear equation solv-
ing (root-finding) method, depending on how f depends on y.
Crank-Nicolson method (also implicit; uses average of slopes from original
(explicit) and implicit Euler methods to estimate f¯, and so should be more
accurate then either one – analogous to trapezoid rule)
yi+1 = yi + h(f (ti , yi ) + f (ti+1 , yi+1 ))/2
Again, an implicit method, which makes each step substantially more
complicated than in the original (explicit) Euler method (unless, for ex-
ample, f is a linear function of y)
Local truncation error of a numerical method: error of each step of length
h
For Euler’s method, the local error is bounded by
h2
K2 ,
2
where K2 is the maximum of |y 00 | between t and t + h
23
Global truncation error: error for far end of interval, after n = |b − a|/h
steps
The global error may depend in a complicated way on the local errors
at each step, but usually can be roughly approximated as the number of
steps times the local truncation error for each step
For Euler’s method, the estimated global error using this approxima-
tion is |b − a| · h/2 · y 00 (c) for some c between a and b. Thus, the estimated
global error is proportional to h (first-order accuracy).
The RK2 local truncation error is bounded by
h3
K3 ,
12
where K3 is the maximum of |y (3) | between t and t + h (same as for
the trapezoid rule in integration). Thus, the estimated global error is
proportional to h2 (second order).
The RK4 local truncation error is bounded by
h5
K5 ,
2880
where K5 is the maximum of |y (5) | between t and t + h (same as for Simp-
son’s rule in integration). Thus, the estimated global error is proportional
to h4 (fourth order). Usually, it will be much more accurate than first-
or second-order numerical methods (such as Euler or RK2) for the same
step size h.
The implicit Euler method has first-order accuracy (similar to the
usual, explicit Euler method), and the Crank-Nicolson method has second-
order accuracy
A general form for the different Runge-Kutta methods (which include all
the methods mentioned above, both explicit and implicit) is
s
X
yi+1 = yi + bi Ki ,
i=1
where s is the number of stages in each step of the method (e.g. 1 for the
Euler method, 2 for RK2, 4 for RK4), and each Ki is given by the formula
s
X
Ki = h · f (ti + ci h, yi + aij Kj )
j=1
The coefficients a, b, c are different for each method, and can be collected
into an s + 1 by s + 1 table of the form
24
c A
bT
00
,
1
00
1 1 0,
1 1
2 2
0
1 1
2 2
1 1 .
2 0 2
1 0 0 1
1 1 1 1
6 3 3 6
In this notation, implicit RK methods are those for which A has nonzero
elements on or above the main diagonal. Thus, for the implicit Euler
method, we have
11
,
1
00
1 1 .
1 2 2
1 1
2 2
25
Initial-value problem for an ODE first-order system: Given dyi
dt = f (t, y1 , y2 , . . . , yn )
and yi (a) = yi,0 for i = 1, 2, . . . n, find all yi (b)
In vector notation: Given dy dt = f (y, t) and y(a) = y0 , where y(t) is
an n × 1 vector and f is a function that returns an n × 1 vector, find y(b)
In this notation, Euler’s method for a system can be written com-
pactly as yi+1 = yi + hf (ti , yi )
Any ODE system (even one with higher derivatives) can be converted
to this first-order form by setting the derivatives of lower order than the
highest one that appears as additional variables in the system
The general procedure to convert an ODE with derivatives up to order
n into n first-order ODEs is:
Write the ODE in the standard form
dn y dy d2 y dn−1 y
= f t, y, , 2 , . . . , n−1
dtn dt dt dt
(i−1)
Write the initial conditions as zi (t0 ) = y0 , i = 1, 2, . . . , n
26
Example: pendulum motion (with friction and a driving force),
d2 θ dθ g
+ c + sin(θ) = a sin(Ωt)
dt2 dt L
(second-order equation: highest-order derivative of θ is 2)
Can be written as
dy1
= y2
dt
dy2 g
= −cy2 − sin(y1 ) + a sin(Ωt)
dt L
dθ
where y1 = θ, y2 = dt .
The Euler method as well as the other explicit and implicit methods can be
extended readily to systems of ODEs – just run through all the equations
in the system at each timestep to go from y(t) to y(t + h) (but for the
implicit methods, will need to solve a system of equations at each timestep)
27
For example, to use RK2 (Heun’s method) with step size h = 1/2 to estimate
y(2) given
d2 y
= −y, y(θ = 0) = 0, y 0 (θ = 0) = 1 :
dθ2
The general formula for each step of RK2 with a system of equations (in vector
form) is
1 1
vi = vi−1 + K1 + K2 ,
2 2
where
step K1 K2 v
0
0
1
1/2 1/2 1/2
1
0 −1/4 7/8
7/16 5/16 7/8
2
−1/4 −15/32 33/64
33/128 5/128 131/128
3
−7/16 −145/256 7/512
7/1024 255/1024 231/256
4
−131/256 −1055/2048 −2047/4096
28
size is decreased so that the answer is accurate enough.
y 00 T wx(L − x)
0 2 3/2
− y= ,
(1 + (y ) ) EI 2EI
where y(x) is the groundwater head and K the hydraulic conductivity, with
upstream and downstream boundary conditions y(xu ) = yu , y(xd ) = yd .
Finite-difference is one method of numerically solving boundary value
problems. The idea is to find y on an equally spaced grid, in the beam
example case between x = 0 and x = L, where the points on the grid are
designated 0 = x0 , x1 , x2 , . . . xn = L and the corresponding y values are
y0 , y1 , y2 , . . . yn . We get one equation at each xi . For the beam example,
this is
y 00 (xi ) T wx(L − xi )
− yi = .
(1 + (y 0 (xi ))2 )3/2 EI 2EI
To continue we need expressions for the derivatives of y(x) at each xi in
terms of the xi and yi . We approximate these by finite-difference formulas,
for example (all these formulas are centered and second-order-accurate):
29
yi+1 − yi−1
y 0 (xi ) ≈
2h
30
Example: Use finite differences to approximate y(x) given
y 00 + y 0 = x, y(0) = 1, y 0 (15) = 2 :
Start by establishing a grid of points to solve at that spans the domain
over which there are boundary conditions: say n = 5 equally spaced
intervals, with endpoints indexed 0, 1, . . . 5 (h = 3)
Next, write the differential equation for each interior grid point, plus
the boundary conditions, with each derivative replaced by a finite-
difference approximation (here, ones with second-order accuracy are
used):
y0 = 1
1 2 1 1 1
h2 y0 − h2 y1 + h2 y2 − 2h y0 + 2h y2 = 3
1 2 1 1 1
h2 y1 − h2 y2 + h2 y3 − 2h y1 + 2h y3 = 6
1 2 1 1 1
h2 y2 − h2 y3 + h2 y4 − 2h y2 + 2h y4 = 9
1 2 1 1 1
h2 y3 − h2 y4 + h2 y5 − 2h y3 + 2h y5 = 12
1 2 3
2h y3 − h y4 + 2h y5 = 2 [backward finite-difference approximation for the
first derivative at the upper boundary]
The resulting system of algebraic equations for the approximate y values
at the grid points is
1 0 0 0 0 0 y0 1
−1 2 5
18 − 19 0 0 0 y1 3
18
2 5
0 − 18 − 9 18 0 0 y2 6
y3 = 9
1 2 5
0
0 − 18 − 9 18 0
1
0 0 0 − 18 − 29 18 5
y4 12
1
0 0 0 6 − 23 12 y5 2
which gives, to 5 decimal places,
1
−5622.5
−4487.0
y= .
−4692.5
−4619.0
−4590.5
31
Example: Use finite differences to approximate y(x) given
y 0000 (x) = 1 − 0.1x, y(0) = 0, y 0 (0) = 0, y 00 (10) = 0, y 000 (10) = 0 :
(This is the classical beam equation [also known as the Euler-Bernoulli equation]
for a cantilever beam with a triangle-shaped loading.)
Start by establishing a grid of points to solve at that spans the domain
over which there are boundary conditions: say n = 5 equally spaced
intervals, with endpoints indexed 0, 2, . . . 10 (h = 2).
Next, write the differential equation for each interior grid point, plus
the boundary conditions, with each derivative replaced by a finite-
difference approximation (here, ones with second-order accuracy are
used):
y0 = 0
3 4 1
− 2h y0 + 2h y1 − 2h y2 = 0 [forward finite-difference approximation for the
first derivative at the left boundary]
2 9 16 14 6 1
h4 y0 − h4 y1 + h4 y2 − h4 y3 + h4 y4 − h4 y5 = 0.8 [non-centered approximation
for fourth derivative close to left edge]
1 4 6 4 1
h4 y0 − h4 y1 + h4 y2 − h4 y3 + h4 y4 = 0.6
1 4 6 4 1
h4 y1 − h4 y2 + h4 y3 − h4 y4 + h4 y5 = 0.4
1
− h4 y0 + h4 y1 − h4 y2 + h4 y3 − h94 y4 + h24 y5 = 0.2 [non-centered approxi-
6 14 16
Polynomial interpolation
For any set of n points (xi , yi ) with distinct xi , there’s a unique
polynomial p of degree n − 1 such that p(xi ) = yi .
For n > 5 or so, polynomial interpolation tends to in most cases give
oscillations around the given points, so the interpolating polynomial often
looks unrealistic.
Polynomial interpolation is quite nonlocal and ill-conditioned, espe-
cially for larger n: if you change slightly one of the points to interpolate,
the whole curve will often change substantially.
Finding the interpolating polynomial through given points:
Lagrange form:
n n
X Y x − xj
yi
i=1
x
j=1 i
− xj
j6=i
for i = 2, . . . n
33
Example: if x = [0 1 -2 2 -1]’ and y = [-3 -2 1 -4 1]’, the Lagrange
form of the interpolating polynomial is
Spline interpolation
Idea: interpolating function S(x) is a piecewise polynomial of some
low degree d; the derivative of order d is discontinuous at nodes or knots,
which are the boundaries between the polynomial pieces (for our purposes,
these are set to be at the data points).
Linear spline: d = 1 – connects points by straight lines (used for
plot) – first derivative isn’t continuous.
Quadratic spline: d = 2 – connects lines by segments of parabolas –
second derivative isn’t continuous (but first derivative is).
Cubic spline: d = 3 – commonly used – minimizes curvature out of
all the possible interpolating functions with continuous second derivatives
(third derivative is discontinuous)
Uses n−1 cubic functions to interpolate n points (4n−4 coefficients
total)
Need 2 additional conditions to specify coefficients uniquely. Com-
monly, we use natural boundary conditions, where the second derivative is
zero at the endpoints. Another possibility is not-a-knot boundary condi-
tions, where the first two cubic functions are the same and last two cubic
functions are the same.
The spline coefficients that interpolate a given set of points can be
found by solving a linear system. For a cubic spline:
Let the n + 1 given points be (xi , yi ), for i = 0, 1, . . . n
Each piecewise cubic polynomial (which interpolates over the in-
terval [xi−1 , xi ]) can be written as Si (x) = ai (x − xi−1 )3 + bi (x − xi−1 )2 +
ci (x − xi−1 ) + di , where i = 1, 2, . . . n
The 4n − 2 conditions for the piecewise polynomials to form a cubic
spline that interpolates the given points are
Si (xi−1 ) = yi−1 , i = 1, 2, . . . n
Si (xi ) = yi , i = 1, 2, . . . n
34
Si0 (xi ) = Si+1
0
(xi ), i = 1, 2, . . . n − 1
00 00
Si (xi ) = Si+1 (xi ), i = 1, 2, . . . n − 1
We can find the bi by solving a linear system that includes the
following n − 2 equations, plus two more from the boundary conditions:
hi−1 bi−1 + 2(hi−1 + hi )bi + hi bi+1 = 3(∆i − ∆i−1 ), for i = 2, 3, . . . n − 1
where hi ≡ xi − xi−1 , ∆i ≡ yi −yhi
i−1
35
10 Regression
Least squares fitting is a kind of regression
Regression: fit a function of a given type, say f , approximately through
some given points (xi , yi ) so that for the given points, f (xi ) ≈ yi .
If the points are given as n × 1 vectors x, y, the residual vector of the
fitted function is r = y − f (x) (i.e. ri = yi − f (xi ))
The least squares criterion: Out of all the functions
Pnin the given type,
minimize the residual sum of squares, RSS = rT r = i=1 ri2
Suppose the function type is such that we can write f (x) = Aβ, where A
is a known n × m design matrix for the given x while β is an unknown
m×1 vector of ‘parameters’. That is, f (xi ) = Ai,1 β1 +Ai,2 β2 +. . . Ai,m βm .
Example: the function type is a straight line, f (x) = ax + b. Then
row i of A consists of (xi , 1), and β is [a, b]T (m = 2).
Example: the function type is a quadratic with zero intercept, f (x) =
cx2 . Then row i of A consists of (x2i ), and β is [c] (m = 1).
In that case, the residual sum of squares (RSS) rT r = ||r||22 is equal to
(Aβ − y)T (Aβ − y). Under least squares, we want to choose β so that
this quantity is as small as possible.
To find a minimum of the residual sum of squares, we take its deriva-
tive with respect to β, which works out to be
If we set this equal to zero, we then get for β the m × m linear system
AT Aβ = AT y
36
Out-of-sample validation: Divide the available data into “training”
and “validation” subsets. Fit the parameters for each model using only
the training data. Choose the model with lowest RSS for the validation
data.
Numerical criteria:
Adjusted R2 :
n − 1 RSS
Ra2 = 1 − ,
n − m TSS
where n is the number of data points and m is the number of unknown
parameters in each model. Each model is fitted to the data and Ra2 is
computed, and the model with highest Ra2 is chosen as likely to be best
at predicting the values of new points.
Another commonly used rule is the Akaike information criterion
(AIC), which can be given as follows for linear least squares:
(log designates natural logarithm). Each model is fitted to the data and
AIC is computed, and the model with lowest AIC is chosen as likely to be
best at predicting the values of new points.
In nonlinear least squares fitting, the function form is such that finding the
least squares parameter values for it requires solving a system of nonlinear
equations. In general this is a more difficult numerical problem.
37
Like many numerical methods, Newton’s method (stopped after finitely
many iterations) will usually only give an approximate answer, but is eas-
ily implementable on a computer.
One possible “stopping criterion” would be to check when the (absolute or
relative) difference between xn and xn−1 is small enough, as an estimate
of the error relative to the unknown true value x∗ .
Another possible stopping criterion would be for f (xn ) to be close enough
to f (x∗ ) = 0 is small enough. This difference is called the “residual” and
is another measure of the error in our estimate of x∗ after n iterations.
Newton’s method can be generalized for a function of a vector, f (x), where
the “Jacobian” matrix of partial derivatives J(x) has the elements Jij (x) =
∂fi −1
∂xj : xi+1 = xi − J (xi )f (xi )
Disadvantages of Newton’s method are that it requires the derivative of f
and may not converge if the function is too nonlinear (so that it has ‘flat’
regions where the derivative is much closer to zero than elsewhere).
Bisection is another iterative (approximate) numerical root-finding method.
It is slower to converge (more iterations typically required for high accu-
racy) but very reliable.
Algorithm:
Start by finding two values a and b on either side of the root, such
that f (a) < 0 and f (b) > 0.
For each iteration:
Find f (c), where c is halfway between a and b (c ← (a + b)/2)
If f (c) = 0, return c as the root
If f (c) > 0, set b ← c
If f (c) < 0, set a ← c
Assuming that the exact root is not found, each iteration makes the
interval [a, b] half as wide as before
Possible stopping criteria (for some specified error tolerances tol, ):
Number of iterations
Uncertainty in the root: |b−a|
2 < tol (absolute uncertainty) or
|b−a|
|b+a| < tol (fractional uncertainty)
Small residual, |f (c)| <
Example for f (x) = x2 − 2:
Initialize: a ← 1, b ← 2
Iteration 1: c ← 1.5, f (c) > 0, b ← c
Iteration 2: c ← 1.25, f (c) < 0, a ← c
Iteration 3: c ← 1.375, f (c) < 0, a ← c
38
The secant method is similar to Newton’s method in that it’s based on
locally approximating the function as a straight line, but it doesn’t require
us to be able to compute the derivative, instead estimating it from the
difference in function value between two points.
Like Newton’s method, this method converges fast (few iterations)
if the function f is locally pretty close to a straight line, but may not
converge at all if the function is very nonlinear.
Algorithm:
Start with two initial points a and b close to the root (can be the
same as the starting points for bisection)
For each iteration:
Find f (a) and f (b)
Estimate the function’s slope (first derivative) as
f (b) − f (a)
s=
b−a
Compute c ← b − f (b)/s
Set a ← b, b ← c.
Possible stopping criteria:
Number of iterations
Estimated uncertainty in the root, |f (b)/s| < tol
Small residual, |f (c)| <
Example for f (x) = x2 − 2:
Initialize: a ← 1, b ← 2
Iteration 1: s = 3, c ← 1.333, a ← b, b ← c
Iteration 2: s = 3.333, c ← 1.4, a ← b, b ← c
Iteration 3: s = 2.733, c ← 1.4146, a ← b, b ← c
The false position method chooses c in each iteration the same way as in
the secant method, but then finds a new bracketing interval for the next
iteration as in bisection. This is more reliable than the secant method in
that it should converge even for functions with ‘flat’ parts.
Algorithm:
Start by finding two points a and b on either side of the root, such
that f (a) < 0 and f (b) > 0 (as in bisection).
For each iteration:
Find f (a) and f (b)
39
Estimate the function’s slope (first derivative) as
f (b) − f (a)
s=
b−a
Compute c ← a − f (a)/s
Find f (c)
If f (c) = 0, return c as the root
If f (c) > 0, set b ← c
If f (c) < 0, set a ← c
Possible stopping criteria: As in bisection, or if c changes by less than
some tolerance between iterations.
Example for f (x) = x2 − 2:
Initialize: a ← 1, b ← 2
Iteration 1: s = 3, c ← 1.4167, f (c) > 0, b ← c
Iteration 2: s = 2.417, c ← 1.4138, f (c) < 0, a ← c
Iteration 3: s = 2.831, c ← 1.4142, f (c) < 0, a ← c
12 Optimization
Choose x to maximize or minimize a function f (x); examples: minimize
f (x) = −x2 + 2x; maximize f (θ) = 4 sin(θ)(1 + cos(θ))
max f (x) is equivalent to min −f (x)
Local vs. global optimums (maximums or minimums)
First step should always be to graph the function and assess where the
optimum might be
40
Example: Find minimum of f (x) = − sin(x)(1 + cos(x)) for x in [−π π] :
# a b c d f (c) f (d)
1 −π π 0.74162942 −0.74162942 −1.17357581 1.17357581
2 π −0.74162942 0.74162942 1.65833381 −1.17357581 −0.90908007
3 −0.74162942 1.65833381 0.74162942 0.17507496 −1.17357581 −0.34570127
4 1.65833381 0.17507496 0.74162942 1.09177934 −1.17357581 −1.29647964
5 1.65833381 0.74162942 1.09177934 1.30818389 −1.29647964 −1.21641886
6 0.74162942 1.30818389 1.09177934 0.95803397 −1.29647964 −1.28855421
7 1.30818389 0.95803397 1.09177934 −1.29647964
So after 6 iterations, the location of the minimum is narrowed down to [0.95 1.31]
and the minimum value is under −1.296
41