Numerical Methods and Modelling
Numerical Methods and Modelling
Numerical Methods and Modelling
DouglasWilhelmHarder
Numerical
Methods and
Modelling for
Engineering
Numerical Methods
and Modelling
for Engineering
Richard Khoury
Lakehead University
Thunder Bay, ON, Canada
ISBN 978-3-319-21175-6
ISBN 978-3-319-21176-3
DOI 10.1007/978-3-319-21176-3
(eBook)
Conventions
Throughout this textbook, the following conventions are used for functions and
variables:
Object
Scalar number
(real or
complex)
Vector
Matrix
Format
Lowercase
italics
Example
x5
Lowercase
bold
Uppercase
bold
v x0 ; x1 ; . . . ; xi ; . . . ; xN1
2
6
6
6
M6
6
6
4
x0, 0
x1, 0
...
xi , 0
...
x0 , 1
...
xi, 1
...
x0, j
x1, j
...
xi, j
...
...
...
x0, N1
x1, N1
...
xi, N1
...
7 6
7 6
7 6
76
7 6
7 6
5 4
Lowercase
italics with
lowercase
italics
Lowercase
italics with
lowercase
bold
Lowercase
bold with
lowercase
italic
Lowercase
bold with
lowercase
bold
v0
v1
...
vi
...
3
7
7
7
7
7
7
5
vM1
f x 5x 2
f v f x0 ; x1 ; . . . ; xi ; . . . ; xN1
5x0 2x1 7xi 4xN1 3
f x f 0 x, f 1 x, . . . , f i x, . . . , f N1 x
f v f 0 v, f 1 v, . . . , f i v, . . . , f N1 v
(continued)
vi
Conventions
Object
Matrix-valued
function of
scalar
Format
Uppercase
bold with
lowercase
italics
Matrix-valued
function of
vector
Uppercase
bold with
lowercase
bold
Example
2
3
f 0, 1 x ... f 0, j x ... f 0, N1 x
f 0, 0 x
6 f 1, 0 x
f 1 , j x
f 1, N1 x 7
7
6
7
6 ...
...
...
7
Mx 6
7
6 f i , 0 x
f
...
f
...
f
i
,
1
i
,
j
i
,
N1
7
6
5
4 ...
...
...
f M1, 0 x f M1, 1 x ... f M1, j x ... f M1, N1 x
3
f 0, 0 v
f 0, 1 v ... f 0, j v ... f 0, N1 v
6 f 1, 0 v
f 1, j v
f 1, N1 v 7
7
6
7
6
...
...
...
7
Mv 6
6 f i, 0 v
f i, 1 v ... f i, j v ... f i, N1 v 7
7
6
5
4
...
...
...
f M1, 0 v f M1, 1 v ... f M1, j v ... f M1, N1 v
2
Acknowledgements
Khadijeh Bayat
Dan Busuioc
Tim Kuo
Abbas Attarwala
Prashant Khanduri
Matthew Chan
Christopher Olekas
Jaroslaw Kuszczak
Chen He
Hans Johannes Petrus Vanleeuwen
David Smith
Jeff Teng
Roman Kogan
Mohamed Oussama Damen
Rudko Volodymyr
Vladimir Rutko
George Rizkalla
Alexandre James
Scott Klassen
Brad Murray
Brendan Boese
Aaron MacLennan
vii
Contents
.
.
.
.
.
.
.
.
.
.
1
1
2
5
5
6
8
9
11
11
Numerical Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
Decimal and Binary Numbers . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1
Decimal Numbers . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2
Binary Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3
Base Conversions . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3
Number Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1
Fixed-Point Representation . . . . . . . . . . . . . . . . . . . .
2.3.2
Floating-Point Representation . . . . . . . . . . . . . . . . . .
2.3.3
Double-Precision Floating-Point Representation . . . . .
2.4
Limitations of Modern Computers . . . . . . . . . . . . . . . . . . . . .
2.4.1
Underflow and Overflow . . . . . . . . . . . . . . . . . . . . .
2.4.2
Subtractive Cancellation . . . . . . . . . . . . . . . . . . . . . .
2.4.3
Non-associativity of Addition . . . . . . . . . . . . . . . . . .
2.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
13
14
14
15
18
18
19
20
22
24
24
25
27
29
29
ix
Contents
Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
Iteration and Convergence . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3
Halting Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
31
31
31
34
37
37
Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2
PLU Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3
Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4
Jacobi Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5
Gauss-Seidel Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6
Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.1
Reciprocal Matrix . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.2
Maximum Eigenvalue . . . . . . . . . . . . . . . . . . . . . .
4.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
39
39
41
46
50
54
56
59
62
64
64
Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2
Taylor Series and nth-Order Approximation . . . . . . . . . . . . .
5.3
Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4
Modelling with the Taylor Series . . . . . . . . . . . . . . . . . . . . .
5.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
67
67
67
71
74
74
74
77
77
78
79
79
81
83
84
87
94
96
98
100
103
104
106
108
110
111
Contents
xi
Bracketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2
Binary Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3
Advantages and Limitations . . . . . . . . . . . . . . . . . . . . . . . .
7.4
Summary of the Five Tools . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
115
115
115
117
118
Root-Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2
Bisection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3
False Position Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.1
Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.2
Nonlinear Functions . . . . . . . . . . . . . . . . . . . . . . . .
8.4
Closed and Open Methods . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5
Simple Fixed-Point Iteration . . . . . . . . . . . . . . . . . . . . . . . .
8.6
Newtons Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6.1
One-Dimensional Newtons Method . . . . . . . . . . . .
8.6.2
Multidimensional Newtons Method . . . . . . . . . . . .
8.7
Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8
Mullers Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.9
Engineering Applications . . . . . . . . . . . . . . . . . . . . . . . . . .
8.10
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
119
119
120
124
127
130
131
131
136
136
140
144
148
153
154
155
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2
Golden-Mean Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3
Newtons Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4
Quadratic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.5
Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6
Stochastic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.7
Random Brute-Force Optimization . . . . . . . . . . . . . . . . . . . .
9.8
Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.9
Engineering Applications . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.10
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.11
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
157
157
159
166
169
172
179
181
183
187
188
189
10
Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2
Centered Divided-Difference Formulae . . . . . . . . . . . . . . . . .
10.3
Forward and Backward Divided-Difference Formulae . . . . . . .
10.4
Richardson Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.5
Second Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.6
Unevenly Spaced Measurements . . . . . . . . . . . . . . . . . . . . . .
10.7
Inaccurate Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.8
Engineering Applications . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.9
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
191
191
193
198
199
205
209
211
214
215
216
xii
Contents
11
Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.2
Trapezoid Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.2.1 Single Segment . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.2.2 Composite Trapezoid Rule . . . . . . . . . . . . . . . . . . .
11.3
Romberg Integration Rule . . . . . . . . . . . . . . . . . . . . . . . . . .
11.4
Simpsons Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.4.1 Simpsons 1/3 Rules . . . . . . . . . . . . . . . . . . . . . . . .
11.4.2 Simpsons 3/8 Rule . . . . . . . . . . . . . . . . . . . . . . . .
11.5
Gaussian Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.6
Engineering Applications . . . . . . . . . . . . . . . . . . . . . . . . . .
11.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.8
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
219
219
220
220
223
228
232
232
236
239
246
247
248
12
.
.
.
.
.
.
.
.
.
.
.
251
251
252
256
261
266
271
275
278
279
280
13
.
.
.
.
.
.
.
.
.
285
285
286
291
291
296
303
304
305
List of Figures
Fig. 1.1
Fig. 2.1
Fig. 2.2
25
28
Fig. 3.1
36
Fig. 4.1
Fig. 4.2
Fig. 4.3
Fig. 4.4
Fig. 4.5
Fig. 4.6
40
42
48
52
56
63
Fig. 6.1
Fig. 6.2
Fig. 6.3
Fig. 6.4
Fig. 6.5
Fig. 6.6
Fig. 6.7
Fig. 6.8
xiii
xiv
List of Figures
Fig. 7.1
Fig. 8.1
Fig. 8.2
Fig. 8.3
Fig. 8.4
Fig. 8.5
Fig. 8.6
Fig. 8.7
Fig. 8.8
Fig. 8.9
Fig. 8.10
Fig. 8.11
Fig. 8.12
Fig. 8.13
Fig. 8.14
Fig. 8.15
Fig. 8.16
Fig. 8.17
120
121
124
127
128
130
134
134
135
136
138
140
141
142
146
147
159
Fig. 8.18
Fig. 8.19
Fig. 9.1
Fig. 9.2
Fig. 9.3
Fig. 9.4
Fig. 9.5
Fig. 9.6
Fig. 9.7
Fig. 9.8
Fig. 9.9
Fig. 9.10
Fig. 10.1
Fig. 10.2
Fig. 10.3
149
150
151
160
160
163
167
169
173
176
179
185
List of Figures
Fig. 11.1
Fig. 11.2
Fig. 11.3
Fig. 11.4
Fig. 11.5
Fig. 11.6
Fig. 11.7
Fig. 11.8
Fig. 11.9
Fig. 12.1
Fig. 12.2
Fig. 12.3
xv
220
221
221
223
224
230
237
237
240
Fig. 12.8
Fig. 12.9
Fig. 12.10
A sample RC circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pseudocode of Eulers method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Eulers method underestimating a convex
functions values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Eulers method overestimating a convex
functions values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Heuns method averaging the Eulers
method approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pseudocode of Heuns method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Top-left: K0, aka Eulers method, used to compute K1.
Top-right: K1 used to compute K2. Middle-left: K2
used to compute K3. Middle-right: K0 to K3 used to compute
the next point in the fourth-order RungeKutta method.
Bottom-left: Eulers method used to compute
the next point. Bottom-right: Heuns method
used to compute the next point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pseudocode of the fourth-order RungeKutta method . . . . . . . . . . .
Pseudocode of the backward Eulers method . . . . . . . . . . . . . .. . . . . . .
Pseudocode of Eulers method for a system of IVP . . . . . . . . . . . . .
Fig. 13.1
Fig. 13.2
Fig. 13.3
Fig. 13.4
A sample circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pseudocode of the shooting method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pseudocode of the finite difference method . . . . . . . . . . . . . . . . . . . . . .
A visualization of a two-dimensional BVP . . . . . . . . . . . . . . . . . . . . . . .
Fig. A.1
Fig. A.2
Fig. 12.4
Fig. 12.5
Fig. 12.6
Fig. 12.7
Fig. A.3
252
253
256
257
258
258
263
264
267
273
286
288
293
297
xvi
Fig. A.4
Fig. A.5
Fig. A.6
Fig. A.7
Fig. A.8
Fig. A.9
List of Figures
311
312
313
314
316
317
List of Tables
Table 1.1
Table 2.1
10
22
Table 6.1
Table 6.2
Table 8.1
Table 9.1
Table 9.2
Table 9.3
Table 10.1
Table 10.2
Table 10.3
Table 11.1
Table 11.2
Table 12.1
Table 13.1
xvii
Chapter 1
1.1
Introduction
1.2
Before we can meaningfully talk about solving a model and measuring errors, we
must understand the modelling process and the sources of error.
Engineers and scientists study and work in the physical world. However, exactly
measuring and tracking every value of every variable in the natural world, and its
complete effect on nature, is a completely impossible task. Consequently, all
engineers and scientists work on different models of the physical world, which
track every variable and natural phenomenon we need to be aware of for our given
tasks to a level of accuracy we require for our work. This implies that different
professionals will work using different models; for instance, while an astrophysicist
studying the movement of galaxies and a quantum physicist studying the collision
of subatomic particles are studying the same physical world, they use completely
different models of it in their work.
Selecting the proper model for a project is the first step of the modelling cycle
shown in Fig. 1.1. All models stem in some way from the real physical world we
live in and are meant to represent some aspect of it. Once a proper model has been
selected, an implementation of it has to be made. Today, this is synonymous with
writing a software version of the model to run on a computer, but in the past a
simpler implementation approach was used, which consisted in writing down all
necessary equations on paper or on a blackboard. Whichever implementation
method is used, it will include variables that are placeholders for real-world values.
Consequently, the next step is to look to the physical world again and take
measurements of these values to fill in the variables. Finally, we are ready for the
final step, simulation. At this step, we execute the implementation of the model with
the measured values to get an output result. Whether this execution is done by a
Fig. 1.1 The modelling
loop of reality to
engineering approximation
sources of interference accounted for in their model, they found that there was still a
low steady noise being detected by their receiver, and that it was 100 times more
intense than what the model had predicted they would find. This noise, it was later
found, was the cosmic microwave background radiation of the universe left over
from the Big Bang, which their model did not include and which, as a result, threw
off their entire predictions. But to be fair, the Big Bang was still a recently proposed
hypothesis at that time, and one that a large portion of the scientific community did
not yet accept, and so we cannot fault Penzias and Wilson for selecting a model that
did not include it. In fact, their accidental discovery earned them a Nobel Prize in
1978. Most engineer errors do not have such positive outcomes however.
Implementation errors occur when the software code representing the model in a
computer is poorly built. This can be the result of algorithmic errors, of passing
values in the wrong order through a software interface, of legacy code being used in
a way it was never meant for, and of many more potential issues. These errors are
usually detected and corrected through proper software quality assurance (SQA)
methods within the project, and consequently SQA is an important component of
software engineering good practice. Implementation errors have notably plagued
space exploration agencies worldwide and are blamed for some of the most famous
space disasters. The explosion of the European Ariane 5 rocket shortly after takeoff
in 1996 was due to a 64-bit value in a new inertial reference system being passed
into a 16-bit value in a legacy control system. When the value exceeded the 16-bit
limit, the control system failed and the rocket shot off course and destroyed itself
and its cargo, a loss of some $500 million. Likewise, NASA lost its $328 million
Mars Climate Orbiter in 1999 because of poorly documented software components.
The Orbiters instruments took measurements in imperial units and relayed them to
the control software, which was designed to handle metric units. Proper documentation of these two modules to explicitly name the units used by each, along with
proper code review, would have caught this problem easily; instead, it went
unnoticed until the Orbiter crashed into Mars.
Once an appropriate model has been selected and correctly implemented, it is
necessary to fill in unknown variables with values measured from the real world to
represent the current problem. Measurement errors occur at this stage, when the
measurements are inaccurate. In a sense, measurement errors will always occur;
while one can avoid model errors through proper research and implementation
errors through a proper SQA process, measuring tools will always have limited
precision and consequently the measurements themselves will always have some
errors. However, care can still be taken in a number of ways: by including error
bounds on the measures rather than treating them as exact values, by running the
computations on worst-case scenarios in addition to more likely average scenarios,
by designing in safety buffers, and of course by making sure the measurements are
taken properly in the first place. This was not the case in 1979, by a Finnish team
tasked to build a lighthouse on Market Island. This island is on the border between
Sweden and Finland and had been neatly divided by a treaty between the
two nations 170 years before. The team was tasked with building the new lighthouse on the Finnish side of the island, but because of improper geographical
measurements, they built it on the Swede side accidentally. Rectifying the situation
after construction required reopening the century-old treaty between the two
nations to negotiate new borders that remained fair for territory size, coast lines,
fishery claims, and more. And while the two nations resolved the issue peacefully,
to their credit, accidentally causing an international incident is not a line any team
leader would want to include on their resume.
Even once a proper model has been selected, correctly implemented and populated with accurate measurements, errors can still occur. These final errors are
simulation errors that are due to the accumulation of inaccuracies over the execution of the simulation. To understand the origin of these errors, one must remember
that a simulation on a computer tries to represent reality, a continuous-valued and
infinite world, with a discrete set of finite values, and then predicts what will happen
next in this world using approximation algorithms. Errors are inherent and unavoidable in this process. Moreover, while the error on an individual value or algorithm
may seem so small as to be negligible, these errors accumulate with each other. An
individual value may have a small error, but then is used in an algorithm with its
own small error and the result has the error of both. When a proper simulation uses
dozens of values and runs algorithms hundreds of times, the errors can accumulate
to very significant values. For example, in 1991, the Sleipner A oil platform under
construction in Norway collapsed because of simulation errors. The problem could
be traced back to the approximation in a finite element function in the model; while
small, this error then accumulated throughout the simulation so that by the end the
stress predicted on the structure by the model was 47 % less than reality. Consequently, the concrete frame of the oil platform was designed much too weak, sprung
a leak after it was submersed under water, and caused the entire platform to sink to
the bottom of a fjord. The shock of the platform hitting the bottom of the fjord
caused a seismic event of 3.0 on the Richter scale about $700 million in damages.
This book focuses on simulation errors. Throughout the work, it will present not
only algorithms to build simulations and model reality but their error values in order
to account for simulation errors in engineering work.
1.3
1.3.1
Error Analysis
Precision and Accuracy
Before talking about errors, it is necessary to lay down some formal vocabulary.
The first are the notions of precision and accuracy, two words that are often used
interchangeably by laypeople. In engineering, these words have different, if related,
meanings. Precision refers to the number of digits an approximation uses to
represent a real value, while accuracy refers to how close to the real value the
approximation is.
An example can help clarify these notions. Imagine a car with two speedometers,
an analogue one and a digital one. The digital one indicates the cars speed at every
0.1 km/h, while the analogue one only indicates it at every 1 km/h. When running
an experiment and driving the car at a constant 100 km/h, it is observed that the
digital speedometer fluctuates from 96.5 to 104.4 km/h, while the analogue one
only fluctuates from 99 to 101 km/h. In this example, the digital speedometer is
more precise, as it indicates the speed with one more digit than the analogue one,
but the analogue speedometer is more accurate, as it is closer to the real value of
100 km/h than the digital one.
While precision and accuracy measure two different and independent aspects of
our values, in practice it makes sense to use precision to reflect accuracy. Adding
additional digits of precision that cannot be accurately measured consists simply in
adding noise in our values. This was the case in the previous example, with the
digital speedometer showing a precision of 0.1 km/h when it couldnt accurately
measure the speed to more than 3 or 4 km/h. On the other hand, if a value can be
accurately measured to a great precision, then these digits should be included. If the
cars speed is accurately measured to 102.44 km/h, then reporting it to a lesser
precision at 102 km/h not only discards useful information, it actually reduces
accuracy by rounding known figures.
Consequently, the accuracy of a measure is usually a function of the last digit of
precision. When a speedometer indicates the cars speed to 0.1 km/h, it implies that
it can accurately measure its speed to that precision. In fact, given no other
information except a value, it is implied that the accuracy is half the last digit of
precision. For example, a car measured as going to 102.3 km/h is implied to have
been accurately measured to 0.05 km/h to get that precision. This accuracy is
called the implied precision of the measure. In our example, this means that the real
speed of the car is somewhere in the range from 102.25 to 103.35 km/h and cannot
be obtained any more accurately than that.
1.3.2
The next important term to introduce is that of error. The error is the value of the
inaccuracy on a measure. If it is given with the same units as the measure itself, then
it is an absolute error. More formally, given a real measure and an approximation,
the absolute error is the difference between the approximation and the real value:
Eabs japproximation valuej
1:1
It can be realized at this point that the implied precision introduced in the previous
subsection is also a measure of absolute error. Absolute error has the benefit of
being immediately clear and related to the measure being evaluated. However, it is
also inherently vague when it comes to determining if that measure is accurate or
not. Given a distance with an absolute error of 3 m, one can get an immediate sense
of the precision that was used to measure it and of how far apart the two objects
might be, but is this accurate enough? The answer is that it depends on the
magnitude of the distance being measured. An absolute error of 3 m is incredibly
accurate when measuring the thousands of metres of distance between two cities,
but incredibly inaccurate when measuring the fraction of a metre distance between
your thumb and index finger. The notion of relative error, or absolute error as a
ratio of the value being measured, introduces this difference:
approximation value
1:2
Erel
value
Unlike absolute error, which is given in the same units as the value being measured,
relative error is given as a percentage of the measured value.
Example 1.1
What is the maximum and minimum resistance of a resistor labelled brown,
grey, brown, red?
Solution
Given the colour code, the resistor is 180 with a tolerance of 2 %. In order
words, the resistance value is approximated as 180 and the relative error on
this approximation is 2 %. Putting these values in the relative error formula
(1.2) to solve for the real value:
Erel
j180 r j
0:02 ) r
jr j
176:5
183:7
The resistors minimum and maximum resistance values are 176.5 and
183.7 , respectively, and the real resistance value is somewhere in that
range. It can be noted that the absolute error on the resistance value is 3.6 ,
which is indeed 2 % of 180 .
Example 1.2
A cars speedometer indicates a current speed of 102 km/h. What is the
relative error on that measure?
Solution
The implied precision on the measure is half the last decimal, or 0.5 km/h.
The real speed is in the interval from 101.5 to 102.5 km/h. The relative error is
computed from these two bounds:
(continued)
1.3.3
Significant Digits
1:3
This integer n is the actual number of significant digits that we are looking for.
Given multiple approximations for a value, the most accurate one is the one with the
highest value of n. Moreover, very bad approximations that yield a positive power
of 10 in Eq. (1.3) and therefore negative values of n are said to have no significant
digits.
Example 1.3
Given two approximations 2.9999 and 1.9999 for the real value 2.0000,
which has the greatest number of significant digits?
Solution
First, compute the relative error of each approximation:
j2:9999 2:0000j
0:49995
j2:0000j
j1:9999 2:0000j
0:00005
j2:0000j
Next, find the maximum exponent n for the inequalities on the order of
magnitude in Eq. (1.3):
0:49995 0:5 100
0:00005 0:5 104
This tells us that the approximation of 1.9999 has four significant digits, while
the approximation of 2.9999 has none. This is despite the fact that the value of
2.9999 has one digit in common with the real value of 2.0000 while 1.9999
has none. However, this result is in line with mathematical sense: 1.9999 is
only 0.0001 off from the correct value, while 2.9999 is off by 0.9999.
1.3.4
Big O Notation
10
greater big O value. It is important to note again that this is a general rule and does
not account for special cases, such as specific input values for which an algorithm
with a greater big O value might outperform one with a smaller big O value.
The generalization power of big O in that case comes from the fact that, given a
mathematical sequence, it only keeps the term with the greatest growth rate,
discarding all other terms and the coefficient multiplying that term. Table 1.1
gives some example of functions with their big O growth rates in the second
column. In all these functions, the term with the greatest growth rate is the one
with the greatest exponent. The first and second functions have the same big O
value despite the fact they would give very different results mathematically,
because they both have the same highest exponent x4, and both the constant
multiplying that term and all other terms are abstracted away. The third function
will clearly give a greater result than either of the first two for a large range of lower
values of x, but that is merely a special case due to the coefficient multiplying the x3
of that equation term. Beyond that range in the general case, values of x4 will be
greater than x3 multiplying a constant, and so the third functions O(x3) is considered lesser than O(x4). The fourth function is a constant; it returns the same value
regardless of the input value of x. Likewise, its big O value is a constant O(1). When
the goal is to select the function with the least growth rate, the one with the lowest
big O value is preferred.
The mathematical formula and algorithms used for modelling are also evaluated
against a variable to obtain their big O values. However, unlike their software
engineering counterparts, they are not measured against variable input sizes; their
inputs will always be the measured values of the model. Rather, the variable will be
the size of the simulation step meant to approximate the continuous nature of the
natural world. Whether the model simulates discrete steps in time, in space, in
frequency, in pressure, or in some other attribute of the physical world, the smaller
the step, the more natural the simulation will be. In this context, big O notation is
thus measuring a decline rate instead of a growth rate, and the value of x becomes
smaller and tends towards zero. In that case, the term of the equation with greater
exponents will decline more quickly than those with lesser exponents. Big O
notation will thus estimate the worst-case decline value by keeping the lowest
exponent term, discarding all other terms and constants multiplying that term.
This yields the third column of Table 1.1, and the equation with the greatest big
O exponent, rather than the lowest one, will be preferred. That equation is the one
that will allow the error of the formula to decrease the fastest as the step size is
reduced.
1.5 Exercises
11
Big O notation will be used in this book to measure both the convergence rate of
algorithms and their error rate. In fact, these two notions are interchangeable in this
context: an algorithm converges on a solution by reducing the error on its approximation of this solution, and the rate at which it converges is the same as the rate at
which it reduces the approximation error.
1.4
Summary
The main focus of this chapter has been to introduce and formally define several
notions related to error measurement. The chapter began by introducing the four
steps of the modelling cycle, namely, model selection to implementation, measurements, and simulation, along with the errors that can be introduced at each step. It
then defined the vocabulary of error measurement, precision, accuracy, and implied
precision. And finally it presented formal measures of error, namely, relative and
absolute error, significant digits, and big O notation.
1.5
Exercises
1. Your partner uses a ruler to measure the length of a pencil and states that the
length is 20.35232403 cm. What is your response to the given precision?
2. Given two approximations of the constant , as 3.1417 and 3.1392838, which
has the greatest precision? Which has the greatest accuracy?
3. Which number has more precision and which has more accuracy as an approximation of e, 2.7182820135423 or 2.718281828?
4. The distance between two cities is given as approximately 332 mi. As
1 mi 1.609344 km exactly, it follows that the distance is approximately
534.302208 km. Discuss this conversion with respect to precision and
accuracy.
5. What is approximately the absolute and relative error of 3.14 as an approximation of the constant ?
6. What are the absolute and relative errors of the approximation 22/7 of ? How
many significant digits does it have?
7. What are the absolute and relative errors of the approximation 355/113 of ?
How many significant digits does it have?
8. A resistor labelled as 240 is actually measured at 243.32753 . What are the
absolute and relative errors of the labelled value?
9. The voltage in a high-voltage transmission line is stated to be 2.4 MV while the
actual voltage may range from 2.1 to 2.7 MV. What is the maximum absolute
and relative error of voltage?
12
Chapter 2
Numerical Representation
2.1
Introduction
The numerical system used in the Western World today is a place-value base-10
system inherited from India through the intermediary of Arabic trade; this is why
the numbers are often called Arabic numerals or more correctly Indo-Arabic
numerals. However, this is not the only numerical system possible. For centuries,
the Western World used the Roman system instead, which is a base-10 additivevalue system (digits of a number are summed and subtracted from each other to get
the value represented), and that system is still in use today, notably in names and
titles. Other civilizations experimented with other bases: some precolonial
Australian cultures used a base-5 system, while base-20 systems arose independently in Africa and in pre-Columbian America, and the ancient Babylonians used a
base-60 counting system. Even today, despite the prevalence of the base-10 system,
systems in other bases continue to be used every day: degrees, minutes, and seconds
are counted in the base-60 system inherited from Babylonian astrologers, and base12 is used to count hours in the day and months (or zodiacs) in the year.
When it comes to working with computers, it is easiest to handle a base-2 system
with only two digits, 0 and 1. The main advantage is that this two-value system can
be efficiently represented by an open or closed electrical circuit that measures 0 or
5 V, or in computer memory by an empty or charged capacitor, or in secondary
storage by an unmagnetized or magnetized area of a metal disc or an absorptive or
refractive portion of an optical disc. This base-2 system is called binary, and a
single binary digit is called a bit.
It should come as no surprise, however, that trying to model our infinite and
continuous real world using a computer which has finite digital storage, memory,
and processing capabilities will lead to the introduction of errors in our modelling.
These errors will be part of all computer results; no amount of technological
advancement or upgrading to the latest hardware will allow us to overcome them.
Nonetheless, no one would advocate for engineers to give up computers altogether!
Springer International Publishing Switzerland 2016
R. Khoury, D.W. Harder, Numerical Methods and Modelling for Engineering,
DOI 10.1007/978-3-319-21176-3_2
13
14
2 Numerical Representation
It is only necessary to be aware of the errors that arise from using computers and to
account for them.
This chapter will look in details at how binary mathematics work and how
computers represent and handle numbers. It will then present the weaknesses that
result from this representation and that, if ignored, can compromise the quality of
engineering work.
2.2
This section introduces binary numbers and arithmetic. Since we assume the reader
to be intimately familiar with decimal (base-10) numbers and arithmetic, we will
use that system as a bridge to binary.
2.2.1
Decimal Numbers
2:1
di 10i
2:2
i1
The digits d0 and d1, the first digit multiplied by 10 to a negative power, are
separated by a point called the decimal point. The digit dn, which has the greatest
value in the total number, is called the most significant digit, while the digit di with
the lowest value of i and therefore the lowest contribution in the total number is
called the least significant digit.
It is often inconvenient to write numbers in the form of Eq. (2.1), especially when
modelling very large or very small quantities. For example, the distance from the
Earth to the Sun is 150,000,000,000 m, and the radius of an electron is
0.0000000000000028 m. For this reason, numbers are often represented in scientific
notation, where the non-zero part is kept, normally with one digit left of the decimal
point and a maximum of m digits on the right, and the long string of zeroes is
simplified using a multiplication by a power of 10. The number can then be written as
15
2:3
2:4
The m + 1 digits that are kept are called the mantissa, the value n is called the
exponent, and the letter e in Eq. (2.4) stands for the word exponent, and
normally m < n. Using scientific notation, the distance from the Earth to the Sun
is 1.5 1011 m and the radius of the electron is 2.8 1015.
We can now define our two basic arithmetic operations. The rule to perform the
addition of two decimal numbers written in the form of Eq. (2.1) is to line up the
decimal points and add the digits at corresponding positions. If a digit is missing
from a position in one of the numbers, it is assumed to be zero. If two digits sum to
more than 9, the least significant digit is kept in that position and the most
significant digit carries and is added to the digits on the left. An addition of two
numbers in scientific notations is done first by writing the two numbers at the same
exponent value, then adding the two mantissas in the same way as before. To
multiply two decimal numbers written in the form of Eq. (2.1), multiply the first
number by each digit di of the second number and multiply that partial result by 10i,
then sum the partial results together to get the total. Given two numbers in scientific
notation, multiply the two mantissas together using the same method, and add the
two exponents together.
2.2.2
Binary Numbers
A binary system uses only two ordered digits (or bits, short for binary digits),
0 and 1, to represent any number as a sequence:
bn bn1 bn2 . . . b1 b0 :b1 b2 . . .
2:5
bi 2i
2:6
i1
The digits b0 and b1, the first digit multiplied by 2 to a negative power, are
separated by a point; however, it would be wrong to call it a decimal point now
since this is not a decimal system. In binary it is called the binary point, and a more
16
2 Numerical Representation
general term for it independent of base is the radix point. We can define a binary
scientific notation as well, as
b0 :b1 b2 . . . bm 2n
2:7
2:8
The readers can thus see clear parallels with Eqs. (2.1)(2.4) which define
our decimal system. Likewise, the rules for addition and multiplication in
binary are the same as in decimal, except that digits carry whenever two 1s are
summed.
Since binary and decimal use the same digits 1 and 0, it can lead to ambiguity as
to whether a given number is written in base 2 or base 10. When this distinction is
not clear from the context, it is habitual to suffix the numbers with a subscript of
their base. For example, the number 110 is ambiguous, but 11010 is one hundred and
ten in decimal, while 1102 is a binary number representing the number 6 in decimal.
It is not necessary to write that last value as 610 since 62 is nonsensical and no
ambiguity can exist.
Example 2.1
Compute the addition and multiplication of 3.25 and 18.7 in decimal and of
1011.1 and 1.1101 in binary.
Solution
Following the first addition rule, line up the two numbers and sum the digits
as follows:
3:25
18:7
21:95
The second addition rule requires writing the numbers in scientific notation
with the same exponent. These two numbers in scientific notations are
3.25 100 and 1.87 101, respectively. Writing them in the same exponent
would change the first one to 0.325 101. Then the sum of the mantissa gives
0:325
1:87
2:195
for a final total of 2.195 101, the same result as before.
(continued)
17
18
2 Numerical Representation
2.2.3
Base Conversions
The conversion from binary to decimal can be done simply by computing the
summation from Eq. (2.6).
The conversion from decimal to binary is much more tedious. For a decimal
number N, it is necessary to find the largest power k of 2 such that 2k N. Add that
power of 2 to the binary number and subtract it from the decimal number, and
continue the process until the decimal number has been reduced to 0. This will yield
the binary number as a summation of the form of Eq. (2.6).
Example 2.2
Convert the binary number 101.101 to decimal, and then convert the result
back to binary.
Solution
Convert the number to decimal by writing it in the summation form of
Eq. (2.6) and computing the total:
101:101 1 22 0 21 1 20 1 21 0 22 1 23
4 0 1 0:5 0 0:125
5:625
Converting 5.625 back to binary requires going through the algorithm steps:
step 1 : N 5:625
k2
2k 4 N
N 2k 1:625
step 2 : N 1:625
k0
2k 1 N
N 2k 0:625
step 3 : N 0:625
k 1
2k 0:5 N
N 2k 0:125
step 4 : N 0:125
k 3
2k 0:125 N
N 2k 0
2.3
Number Representation
19
2.3.1
Fixed-Point Representation
Perhaps the easiest method of storing a real number is by storing a fixed number of
digits before and after the radix point, along with its sign (0 or 1 to represent a
positive or negative number respectively). For the sake of example, we will
assume three digits before the point and three after, thus storing a decimal
20
2 Numerical Representation
2.3.2
Floating-Point Representation
21
0500300, 0510030, or 0520003, and only by subtracting the bias and shifting
the mantissa appropriately does it become evident that all four values are the
same. The requirement that the first bit of the mantissa be non-zero insures
that only the first of these four representations is legal.
The requirement that the first bit of the mantissa must be non-zero introduces a
surprising new problem: representing the real value 0 in floating-point representation is a rule-breaking special case. Moreover, given that each floating-point value
has a sign, there are two such special values at 0000000 and 1000000. Floatingpoint representation uses this to its advantage by actually defining two values of
zero, a positive and a negative one. A positive zero represents a positive number
smaller than the smallest positive number in the range, and a negative zero
represents a negative number greater than the greatest negative number in the range.
It is also possible to include an additional exception to the rule that the first digit
of the mantissa must be non-zero, in the special case where a number is so small that
it cannot be represented while respecting the rule. In the six-digit example, this
would be the case, for example, for the number 1.23 1050, which could be
represented as 0000123 but only with a zero as the first digit of the mantissa.
Allowing this type of exception is very attractive; it would increase the range of
values that can be represented by several orders of magnitude at no cost in memory
space and without making relative comparisons more expensive. But this is no free
lunch: the cost is that the mantissa will have fewer digits, and thus the relative error
on the values in this range will be increased. Nonetheless, this trade-off can
sometimes be worthwhile. A floating-point representation that allows this exception
is called denormalized.
Example 2.3
Represent 10! in the six-digit floating-point format.
Solution
First, compute that 10! 3628800, or 3.6288 106 in scientific notation. The
exponent is thus 55 to take into account the bias of 49, the mantissa rounded
to four digits is 3.629, and the positive sign is a 0, giving the representation
0553629.
Example 2.4
What number is represented, using the six-digit floating-point format, by
1234567?
Solution
The leading 1 indicates that it is a negative number, the exponent is 23 and
the mantissa is 4.567. This represents the number 4.567 102349
4.567 1026.
22
2.3.3
2 Numerical Representation
The representation most commonly used in computers today is double, short for
double-precision floating-point format, and formally defined in the IEEE
754 standard. Numbers are stored in binary (as one would expect in a computer)
over a fixed amount of memory of 64 bits (8 bytes). The name comes from the fact
this format uses double the amount of memory that was allocated to the original
floating-point format (float) numbers, a decision that was made when it was found
that 4 bytes was not enough to allow for the precision needed for most scientific and
engineering calculations.
The 64 bits of a double number comprise, in order, 1 bit for the sign (0 for
positive numbers, 1 for negative numbers), 11 bits for the exponent, and 52 bits for
the mantissa. The maximum exponent value that can be represented with 11 bits is
2047, so the bias is 1023 (011111111112), allowing the representation of numbers
in the range from 21022 to 21023. And the requirement defined previously, that the
first digit of the mantissa cannot be 0, still holds. However, since the digits are now
binary, this means that the first digit of the mantissa must always be 1; it is
consequently not stored at all, and all 52 bits of the mantissa represent digits after
the radix point following an implied leading 1. This means also that double cannot
be a denormalized number representation.
For humans reading and writing 64-bit-long binary numbers can be tedious
and very error prone. Consequently, for convenience, the 64 bits are usually grouped
into 16 sets of 4 bits, and the value of each set of 4 bits (which will be between 0 and
15) is written using a single hexadecimal (base-16) digit. The following Table 2.1
gives the equivalences between binary, hexadecimal, and decimal.
Binary
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
Hexadecimal
0
1
2
3
4
5
6
7
8
9
a
b
c
d
e
f
Decimal
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
23
1100
0000
0110
0110
1111
0100
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
Example 2.6
Find the double representation of the integer 289.
Solution
First, note that the number is positive, so the sign bit is 0.
Next, convert the number to binary: 289 256 + 32 + 1 28 + 25 + 20
1001000012. In scientific notation, this becomes 1.001000012 28 (the
radix point must move eight places to the left). The exponent for the number
(continued)
24
2 Numerical Representation
Example 2.7
Find the double-precision floating-point format of 324/33 given that its
binary representation is:
1001.11010001011101000101110100010111010001011101000101110
100010111010001. . .
Solution
The number is negative, so the sign bit is 1.
The radix point must be moved three spots to the left to produce a
scientific-format number, so the exponent is 310 112. Adding the bias
gives 01111111111 + 11 10000000010.
Finally, rounding the infinite number to 53 bits and removing the leading
1 yield the 52 bits of the mantissa, 0011101000101110100010111010001
011101000101110100011.
Putting it all together, the double representation is:
1 10000000010 0011101000101110100010111010001011101000101110100011
2.4
Since computers try to represent the infinite range of real numbers using the finite
set of floating-point numbers, it is unavoidable that some problems will arise. Three
of the most common ones are explored in this section.
2.4.1
25
double MaxVal = 1.8E307;
double MinVal = 5E-324;
The name overflow comes from a figurative imagery of the problem. Picture
the largest number that can be represented in 8-bit binary, 111111112. Adding 1 to
that value causes a chain of carry-overs: the least significant bit flips to 0 and a
1 carries over to the next position and causes that 1 to flip and carry a 1 over to the
next position, and so on. In the end the most significant bit flips to zero and a 1 is
carried over, except it has no place to carry to since the value is bounded to 8 bits;
the number is said to over flow. As a result, instead of 1000000002, the final result
of the addition is only 000000002. This problem will be very familiar to the older
generation of gamers: in many 8-bit RPG games, players who spent too much time
levelling up their characters might see a level-255 (111111112) character gain a
level and fall back to its starting level. This problem was also responsible for the
famous kill screen in the original Pac-Man game, where after passing level
255, players found themselves in a half-formed level 00.
Overflow and underflow are well-known problems, and they have solutions defined
in the IEEE 754 standard. That solution is to define four special values: a positive
infinity as 7ff0000000000000, a negative infinity as fff0000000000000, a positive
zero as 0000000000000000, and a negative zero as 8000000000000000 (both of
which are different from each other and from an actual value of zero). Converting
these to binary will show the positive and negative infinity values to be the appropriate
sign bit with all-1 exponent bits and all-zero mantissa bits, while the positive and
negative zero values are again the appropriate sign bit with all-zero exponent and
mantissa bits. Whenever a computation gives a result that falls beyond one of the four
edges of the double range, it is replaced by the appropriate special value. The sample
code in the next Fig. 2.1 is an example that will generate all four special values.
2.4.2
Subtractive Cancellation
Consider the following difference: 3.523 3.537 0.014. Using the six-digit floating-point system introduced previously, these numbers are represented by 0493523,
0493537, and 0471400, respectively. All three numbers appear to have the same
26
2 Numerical Representation
precision, with four decimal digits in the mantissa. However, 3.523 is really a
truncated representation of any number in the range [3.5225, 3.5235], as any
number in that five-digit range will round to the four-digit representation 3.523.
Likewise, the second number 3.537 represents the entire range [3.5365, 3.5375].
The maximum relative error on any of these approximations is 0.00014, so they are
not a problem. However, when considering the ranges, the result of the subtraction
is not 0.014 but actually could be any value in the range [0.013, 0.015]. The result
0.014 has no significant digits. Worse, as an approximation of the range of results,
0.014 has a relative error of 0.071, more than 500 times greater than the error of the
initial values.
This phenomenon where the subtraction of similar numbers results in a significant reduction in precision is called subtractive cancellation. It will occur any time
there is a subtraction of two numbers which are almost equal, and the result will
always have no significant digits and much less precision than either initial
numbers.
Unlike overflow and underflow, double format does not substitute the result of
such operations with a special value. The result of 0.014 in the initial example will
be stored and used in subsequent computations as if it were a precise value rather
than a very inaccurate approximation. It is up to the engineers designing the
mathematical software and models to check for such situations in the algorithms
and take steps to avoid them.
Example 2.8
Consider two approximations of using the six-digit floating-point representation: 3.142 and 3.14. Subtract the second from the first. Then, compute the
relative error on both initial values and on the subtraction result.
Solution
3:142 3:14 0:002
However, in six-digit floating-point representation, 3.142 (0493142) represents any value in [3.1415, 3.1425] and 3.14 (0493140) represents any value
in the range [3.1395, 3.1405]. Their difference is any number in the range
[0.001, 0.003]. The result of 0.002 has no significant digits.
Compared to 3.141592654. . ., the value 3.142 has a relative error of
0.00013 and the value 3.14 has a relative error of 0.0051. The correct result of
the subtraction is 3.14 0.001592654. . ., and compared to that result,
0.002 has a relative error of 0.2558, 50 times greater than the relative error of
3.14.
2.4.3
27
Non-associativity of Addition
2:9
2:10
then there is no problem in storing the partial result 0.7846, and only the final result
needs to be rounded to 5593. However, if the summation is computed as
5592 0:3923 0:3923 5592:3923 0:3923 5592:7846
2:11
then there is a problem, as the partial result 5592.3923 gets rounded to 5592, and the
second part of the summation then becomes 5592 + 0.3923 again, the result of
which again gets rounded to 5592. The final result of the summation has changed
28
2 Numerical Representation
because of the order in which the partial summations were computed, in clear
violation of the associativity property.
Example 2.9
Using three decimal digits of precision, add the powers of 2 from 0 to 17 in
the order from 0 to 17 and then in reverse order from 17 to 0. Compute the
relative error of the final result of each of the two summations.
Solution
Recall that with any such system, numbers must be rounded before and after
any operation is performed. For example, 210 1024 1020 after rounding
to three significant digits. Thus, the partial sums in increasing order are
1 3 7 15 31 63 127 255 511 1020 2040 4090 8190 16400 32800 65600
131000 262000
while in decreasing order, they are
131000 196000 229000 245000 253000 257000 259000 260000 261000
261000 261000 261000 261000 261000 261000 261000 261000 261000
The correct value of this sum is 218 1 262,143. The relative error of the
first sum is 0.00055, while the relative error of the second sum is 0.0044. So
not only are the results different given the order of the sum, but the sum in
increasing order gives a result an order of magnitude more accurate than the
second one.
Like with subtractive cancellation, there is no special value in the double format
to substitute in for non-associative results, and it is the responsibility of software
engineers to detect and correct such cases when they happen in an algorithm. In that
respect, it is important to note that non-associativity can be subtly disguised in the
code. It can occur, for example, in a loop that sums a small value into a total at each
increment of a long process, such as in the case illustrate in Fig. 2.2. As the partial
total grows, the small incremental addition will become rounded off in later
iterations, and inaccuracies will not only occur, they will become worse and
worse as the loop goes on.
The solution to this problem is illustrated in Example 2.9. It consists in sorting
the terms of summations in order of increasing value. Two values of the same
magnitude summed together will not lose precision, as the digits of neither number
will be rounded off. By summing together smaller values first, the precision of the
partial total is maintained, and moreover the partial total grows in magnitude and
can then be safely added to larger values. This ordering ensures that the cumulative
sum of many small terms is still present in the final total.
2.6 Exercises
2.5
29
Summary
2.6
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Exercises
Represent the decimal number 523.2345 in scientific notation.
What is the problem if we dont use scientific notation to represent 1.23e10?
Add the two binary integers 1001112 and 10001102.
Add the two binary integers 11000112 and 101001011012.
Add the two binary numbers 100.1112 and 10.001102.
Add the two binary numbers 110.00112 and 10100.1011012.
Multiply the two binary integers 1001112 and 10102.
Multiply the two binary integers 11000112 and 100112.
Multiply the two binary numbers 10.11012 by 1.0112.
Multiply the two binary numbers 100.1112 and 10.102.
Multiply the two binary numbers 1100.0112 and 10.0112.
Convert the binary number 10010.00112 to decimal.
Convert the binary number 111.1111112 to decimal.
30
2 Numerical Representation
14. What decimal numbers do the following represent using the six-digit floating-point
format?
a.
b.
c.
d.
479323
499323
509323
549323
15. Given their implied precision, what range of numbers do the following represent using the six-digit floating-point format?
a. 521234
b. 522345
16. Represent the following numbers in six-digit floating-point format:
a. Square root of two (1.414213562)
b. One million (1000000)
c. e10 0.00004539992976
17. Convert the decimal number 1/8 to binary double format.
18. Convert the hexadecimal double format number c01d600000000000 to binary
and to decimal.
19. Convert the following binary double format numbers to decimal:
a. 0100000001100011001011111000000000000000000000000000000000000000
b. 0011111111101000100000000000000000000000000000000000000000000000
20. Add the following two hexadecimal double format numbers: 3fe8000000000000
and 4011000000000000.
21. Using the six-digit floating-point format:
a. What is the largest value which can be added to 3.523 which will result in a
sum of 3.523 and why?
b. What is the largest float which may be added to 722.4 which will result in a
sum of 722.4 and why?
c. What is the largest float which may be added to 722.3 which will result in a
sum of 722.3 and why?
22. How would you calculate the sum of n2 for n 1, 2, . . ., 100,000 and why?
Chapter 3
Iteration
3.1
Introduction
This chapter and the following four chapters introduce five basic mathematical
modelling tools: iteration, linear algebra, Taylor series, interpolation, and
bracketing. While they can be used as simple modelling tools on their own, their
main function is to provide the basic building blocks from which numerical
methods and more complex models will be built.
One technique that will be used in almost every numerical method in this book
consists in applying an algorithm to some initial value to compute an approximation
of a modelled value, then to apply the algorithm to that approximation to compute an
even better approximation and repeat this step until the approximation improves to a
desired level. This process is called iteration, and it can be as simple as applying a
mathematical formula over and over or complex enough to require conditional flowcontrol statements. It also doubles as a modelling tool for movement, both for
physical movement in space and for the passage of time. In those cases, one iteration
can represent a step along a path in space or the increment of a clock.
This chapter will introduce some basic notions and terminology related to the
tool of iteration, including most notably the different halting conditions that can
come into play in an iterating function.
3.2
Given a function f(x) and an initial value x0, it is possible to calculate the result of
applying the function to the value as such:
x1 f x 0
3:1
31
32
3 Iteration
This would be our first iteration. The second iteration would compute x2 by
applying the function to x1, the third iteration would compute x3 by applying the
function to x2, and so on. More generally, the ith iteration is given in Eq. (3.2):
x1 f x0
x2 f x1
x3 f x2
...
xi f xi1
3:2
Each value of x will be different from the previous. However, for certain
functions, each successive value of x will become more and more similar to the
previous one, until they stop changing altogether. At that point, the function is said
to have converged to the value x. The steps of a function starting at x0 and
converging to xi are given in Eq. (3.3):
x1 f x0
x2 f x1
x3 f x2
xi1 f xi2
xi f xi1
x i f x i
3:3
33
decimal to converge to 0.73. This notion of the speed with which a function
converges is called convergence rate, and in the numerical methods that will be
introduced later on, it will be found that it is directly linked to the big O error rate of
the functions themselves: a function with a lesser big O error rate will have less
error remaining in the value computed at each successive iteration and will thus
converge to the error-free value in fewer iterations than a function with a greater big
O error rate.
Finally, it is easy to see that not all functions will converge. Using the x2 function
of the calculator and starting at any value greater than 1 will yield results that are
both greater at each successive iteration and that increase more between each
successive iteration, until the maximum value or the precision of the calculator is
exceeded. Functions with that behaviour are said to diverge. It is worth noting that
some functions display both behaviours: they can converge when iterating from
certain initial values, but diverge when iterating from others. The x2 function is in
fact an example of such a function: it diverges for any initial value greater than 1 or
lesser than 1, but converges to 0 using any initial value between 1 and 1.
Example 3.1
Starting from any positive or negative non-zero value, compute ten iterations
of the function:
f x
x 1
2 x
Solution
p
This function has been known since antiquity to converge to 2. The exact
convergence sequence will depend on the initial value, but positive and
negative examples are given in the table below.
Iteration
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
Positive sequence
44.0
22.0227272727
11.0567712731
5.6188279524
2.9873870354
1.8284342085
1.4611331460
1.4149668980
1.4142137629
1.4142135623
1.4142135623
Negative sequence
88.0
44.0113636364
22.0284032228
11.0595975482
5.6202179774
2.9880380306
1.8286867771
1.4611838931
1.4149685022
1.4142137637
1.4142135623
(continued)
34
3 Iteration
20
30
10
Iteration number
40
50
60
70
80
90
3.3
Halting Conditions
35
In that definition, when the values computed in two successive iterations have a
relative error Ei less than a preset threshold, they are close enough to be considered
equal for all intents and purposes, and the function has successfully converged.
If the algorithm iterates to discover the value of a vector instead of a scalar, a
better test for convergence is to compute the Euclidean distance between two
successive vectors. This distance is the square root of the sum of squared differences between each pair of values in the two vectors. More formally, given two
successive n 1 vectors xi [xi,0, . . ., xi,n1]T and xi1 [xi1,0, . . ., xi1,n1]T, the
Euclidean distance is defined as
Ei kxi1 xi k
q
xi1, 0 xi, 0 2 xi1, n1 xi, n1 2
3:6
Once again, when two successive vectors have a distance Ei less than a preset
threshold, they are close enough to be considered equal for all intents and purposes,
and the function has successfully converged.
An iterative function fails to converge if it does not reach a success condition in a
reasonable amount of time. A simple catch-all failure condition would be to set a
36
3 Iteration
3.5 Exercises
37
Example 3.2
x0:5
in radians using x0 0.5, until the relative error is
Iterate f x x sin
cos x
105 or up to a maximum of ten iterations.
Solution
Since the target relative error is 105, it is necessary to keep six digits of
precision in the values. Any less would make it impossible to compute the
error, while more digits would be unnecessary.
The results of the iterations, presented in the table below, show that the
threshold relative error has been surpassed after four iterations. The function
has converged and reached the success condition. It is unnecessary to compute the remaining six iterations.
Iteration
x0
x1
x2
x3
x4
3.4
Value
0.5
0.616049
0.520707
0.523596
0.523599
Relative error
N/A
1.811624
0.183101
0.005518
0.000006
Summary
This chapter has introduced the concept of iteration, the first of the five mathematical tools that will underlie the numerical methods and modelling algorithms of
later chapters. While iteration is a simple tool, it is also a very versatile technique
and will be used in most algorithms in coming chapters. This chapter has introduced
notions related to iterations, such as the notion of convergence, divergence, and
divergence rate and the notions of success and failure halting conditions.
3.5
Exercises
1. What value does the function f x 2:4xx 1 converge to? How many
iterations do you need to get the equality condition of Eq. (3.3)?
2. Starting with x0 0.5, compute iterations of f(x) sin(x) and f(x) cos(x).
Which converges faster? Is the difference significant?
3. Consider f(x) x3. For which range of values will the function converge or
diverge?
4. Consider the function f(x) x + sin(x), where the sin function is in radians
(Beeler et al. 1972). Starting from x0 0.5, compute the value of xi and its
relative error as an approximation of over five iterations.
38
3 Iteration
5. Consider the function f(x) (3x4 + 10x3 20x2 24)/(4x3 + 15x2 40x). Starting
from x0 5, compute the value of xi and its relative error as an approximation of
2 over five iterations.
6. Consider the following functions. How many values can each one converge to?
What are they?
(a)
(b)
(c)
(d)
(e)
Chapter 4
Linear Algebra
4.1
Introduction
The first of the five mathematical modelling tools, introduced in the previous
chapter, is iteration. The second is solving systems of linear algebraic equations
and is the topic of this chapter. A system of linear algebraic equations is any set of
n equations with n unknown variables x0, . . ., xn1:
m0, 0 x0 m0, 1 x1 m0, n1 xn1 b0
m1, 0 x0 m1, 1 x1 m1, n1 xn1 b1
4:1
where the n n values m0,0, . . ., mn1,n1 are known coefficient values that multiply
the variables, and the n values b0, . . ., bn1 are the known result of each equation. A
system of that form can arise in many ways in engineering practice. For example, it
would be the result of taking measurements of a dynamic system at n different
times. It also results from taking measurements of a static system at n different
internal points. Consider, for example, the simple electrical circuit in Fig. 4.1. Four
internal nodes have been identified in it. If one wants to model this circuit, for
example, to be able to predict the voltage flowing between two nodes, the
corresponding energy losses, or other of its properties, it is first necessary to
model the voltages at each node using Kirchhoffs current law. Removing units
and with appropriate scaling, this gives the following set of four equations and four
unknown variables:
39
40
4 Linear Algebra
v1 0 v1 v2
0:01
120
240
v2 0 v2 v1 v2 v3 v2 v4
Node 2 :
0
320
240
180
200
v3 0 v3 v2 v3 v4
Node 3 :
0
160
180
360
v4 v2 v4 v3
Node 4 :
0:01
200
360
Node 1 :
4:2
which is a system of four linear algebraic equations of the same form as Eq. (4.1).
The linear system of Eq. (4.1) can be written as a classic matrix-vector problem:
4:3
Mxb
1
1
1
0
6 120 240
240
6
6
1
1
1
1
1
6 1
6
240
320 240 180 200
180
6
6
6
1
1
1
1
6
0
6
180
160
180
360
6
4
1
1
0
200
360
3
2
0:01
7
6
6 0 7
7
6
6
7
6 0 7
5
4
0
1
200
1
360
1
1
200 360
72 3
7 v1
7
76 7
7 6 v2 7
76 7
76 7
7 6 v3 7
74 5
7
7 v4
5
0:01
4:4
41
and each of the node equations can be recovered by multiplying the corresponding
row of the matrix by the vector of unknowns and keeping the matching result value.
Writing the system in matrix-vector form makes it easier to solve and discover the
value of each unknown variable.
Some readers may have learned how to solve a system of linear equations using
Gaussian elimination together with backward substitution, possibly in a previous
course on linear algebra. However, there are two problems with Gaussian elimination. The first is its lack of efficiency. Even an optimal implementation of a
Gaussian elimination and backward substitution algorithm for solving a system of
n linear equations will require n3 =3 n=3 multiplications and additions and
n2 =2 n=2 divisions and negations for the Gaussian elimination step in addition
to n2 =2 n=2 multiplications and subtractions and n divisions for the backward
substitution step. In other words, it is an algorithm with O(n3) time complexity,
which is very inefficient: the computation time required to solve a linear system will
grow proportionally to the cube of the size of the system! Doubling the size of a
system will require eight times more computations to solve, and trying to solve a
system with 10 times more equations and unknowns will take 1000 times as long.
The second problem with Gaussian elimination is related to the step that requires
adding a multiple of one row to the others. In any situation where the coefficients of
one row are a lot bigger in magnitude than those of another, and given a finite
number of digits of precision, the algorithm will suffer from the problem of
non-associativity of addition explained in Chap. 2. When that happens, the values
computed for the variables by Gaussian elimination will end up having a very high
relative error compared to the real values that should have been obtained.
This chapter will introduce better methods for solving the Mx b system, both
for general matrices and for some special cases. The ability to solve linear systems
will be important for other mathematical tools and for numerical methods, as it will
make it possible to easily, efficiently, and accurately solve complex systems.
4.2
PLU Decomposition
The PLU decomposition, also called the LUP decomposition or the PLU factorization, is an improvement on the Gaussian elimination technique. It addresses the two
problems highlighted in the introduction: by decomposing the matrix M into a
lower triangular matrix L and an upper triangular matrix U, it can solve the system
in O(n2) instead of O(n3). And by doing a permutation of the rows so that the
element with the maximum absolute value is always in the diagonal and keeping
track of these changes in a permutation matrix P, it can avoid the non-associativity
problem.
The PLU decomposition technique thus works in two steps: first, decomposing
the matrix M into three equivalent matrices:
42
M
P
L
U
4 Linear Algebra
Input nn matrix
nn identity matrix
nn zero matrix
M
ColumnIndex 0
WHILE (ColumnIndex < n-1)
ColumnVector columnColumnIndex, rows ColumnIndex to n of U
IndexOfMaximum index of maximum absolute value in ColumnVector
P swap row ColumnIndex and row IndexOfMaximum in P
L swap row ColumnIndex and row IndexOfMaximum in L
U swap row ColumnIndex and row IndexOfMaximum in U
RowIndex ColumnIndex + 1
WHILE (RowIndex < n)
s -1 (element at row RowIndex, column ColumnIndex of U) /
(element at row RowIndex, column ColumnIndex of U)
row Update of U (row RowIndex of U) + s (row ColumnIndex of U)
element at row RowIndex, column ColumnIndex of L -1 s
RowIndex RowIndex + 1
END WHILE
ColumnIndex ColumnIndex + 1
END WHILE
L L + (nn identity matrix)
RETURN P, L, U
MPLU
4:5
then solving the PLUx b system using simple forward and backward
substitutions.
There is a simple step-by-step algorithm to decompose the matrix M into L, U,
and PT. Note that the algorithm decomposes into the transpose of the permutation
matrix P; later the forward and backward substitution operations will need this
transposed matrix, so this actually saves a bit of time in the overall process. A
pseudo-code version of this algorithm is presented in Fig. 4.2.
Step 1: Initialization. Each of the three decomposition matrices will have initial
values. The matrix U is initially equal to M, the matrix L is initially an
n n zero matrix, and the matrix PT is initially an n n identity matrix.
Step 2: Decomposition. The algorithm considers each column in turn, working left
to right from column 0 to the penultimate column n 2. For the current
column i, find the row j in the matrix U such that the element uj,i has the
greatest absolute value in the column and j i, meaning that the element is
on or below the diagonal element of the column. If that element is zero,
then the matrix is singular and cannot be decomposed with this method, and
43
the algorithm terminates. Next, swap rows i and j in all three matrices U, L,
and PT. This will bring the greatest-valued element found on the diagonal
of matrix U as element ui,i. Finally, for every row k below the diagonal,
calculate a scalar value s uk,i/ui,i. Save the value of s at element lk,i in
matrix L in order to fill the entries below the diagonal in that matrix, and
add s times row i to row k in matrix U. This addition will cause every
value in column i under the diagonal of U to become 0.
Step 3: Finalization. Once the decomposition step has been done for every column
except the last right-most one, the algorithm ends by adding an n n
identity matrix to the lower-diagonal matrix L.
Once the matrix M has been decomposed, the PLUx b system can be solved in
two steps. The matrix-vector Ux is replaced with a vector y in the equation, which
can be computed with a forward substitution step:
LyPT b
4:6
4:7
Example 4.1
Use PLU decomposition to solve the following matrix-vector problem. Keep
two decimals of precision:
2
0:7
6 1:4
6
4 7
1:4
4
8
0
0:8
7:4
2:8
1
2:75
32 3 2 3
4:3
x0
7
6 x1 7 6 4 7
0:6 7
76 7 6 7
2 54 x2 5 4 8 5
6:25
3
x3
Solution
Initialize the values of the matrices PT, L, and U:
2
1
60
T
P 6
40
0
0
1
0
0
0
0
1
0
3
2
0
0
60
07
7 L6
40
05
1
0
0
0
0
0
0
0
0
0
3
3
2
0
0:7 4 7:4 4:3
7
6
07
7 U 6 1:4 8 2:8 0:6 7
5
4
0
7
0
1
2 5
0
1:4 0:8 2:75 6:25
Find the element on or under the diagonal in column 0 that has the largest
absolute value. In this case, it is element u2,0, so rows 0 and 2 need to be
swapped in all three matrices.
(continued)
44
4 Linear Algebra
0
0
0
0
0
0
0
0
3
3
2
0
7
0
1
2
7
6
07
7 U 6 1:4 8 2:8 0:6 7
5
4
0
0:7 4 7:4 4:3 5
0
1:4 0:8 2:75 6:25
Then, for each row under the diagonal, compute the scalar value s.
For row 1, the value is s 1.4/7 0.2. That value will take position l1,0
in matrix L. Multiplying s times row 0 gives [1.4, 0, 0.2, 0.4], and
adding that to row 1 gives [0, 8, 3, 1]. For row 2 the scalar value
is s 0.7/7 0.1, which is saved as l2,0; multiplying s by row 0 gives
[0.7, 0, 0.1, 0.2], and adding that to row 2 gives [0, 4, 7.5, 4.5]. Finally,
for row 3, s 1.4/7 0.2, which is saved as l3,0; multiplying s by row
0 gives [1.4, 0, 0.2, 0.4], and adding that to row 3 gives [0, 0.8, 2.55, 6.65].
The resulting matrices are
2
0
6
0
PT 6
41
0
0
1
0
0
1
0
0
0
3
2
0
0
6 0:2
07
7 L6
4 0:1
05
1
0:2
0
0
0
0
0
0
0
0
3
3
2
0
7 0
1
2
6
07
3
1 7
7 U 60 8
7
4 0 4 7:5 4:5 5
05
0
0 0:8 2:55 6:65
0
60
6
PT 6
41
0
1
1
0
2
3
0
0
6
7
07
6 0:2
7 L6
4 0:1
05
0 0
0 0 0 1
3
2
7 0
1
2
60 8
3
1 7
7
6
U6
7
4 0 0 9
5 5
0
2:25
0:2
0
0
0
0
0:5 0
0:1 0
3
0
07
7
7
05
0
6:75
Example 4.1
2
0 0
60 1
T
6
P 4
1 0
0 0
45
(continued)
3
2
1 0
0
0
0
6
0 07
0
7 L 6 0:2 0
4 0:1 0:5
0 05
0
0 1
0:2 0:1 0:25
3
2
0
7
6
07
7 U 60
40
05
0
0
3
0 1 2
8 3 1 7
7
0 9 5 5
0 0 8
Since this operation is not done on the final row, the decomposition step is
over. The final step is to add a 4 4 identity matrix to L, to get the final
matrices:
2
0
6
0
PT 6
41
0
0
1
0
0
1
0
0
0
3
2
1
0
0
0
6 0:2 1
0
07
7 L6
4 0:1 0:5
1
05
0:2 0:1 0:25
1
3
2
7
0
60
07
7 U6
40
05
0
1
3
0 1 2
8 3 1 7
7
0 9 5 5
0 0 8
6
6 0:2
6
6
6 0:1
4
0:2
0:5
0:1
Ly PT b
3 2
y0
0 0
0
76 7 6
6 7 6
07
76 y1 7 6 0 1
76 7 6
6 7 6
07
54 y2 5 4 1 0
0:25 1
32
y3
0 0
32 3
7
76 7
6 7
07
76 4 7
76 7
6 7
07
54 8 5
1
0
8
4
7
3
46
4 Linear Algebra
3
32 3 2
8:0
2
x0
6 7 6
7
1 7
76 x1 7 6 5:6 7
5
4
4
5
3:4 5
x2
5
1:7
8
x3
8x3 1:7
9x2 5x3 3:4
8x1 3x2 x3 5:6
7x0 x2 2x3 8:0
2 3 2
3
x0
1:24
6 x1 7 6 0:82 7
6 76
7
4 x2 5 4 0:26 5
0:21
x3
4.3
Cholesky Decomposition
m0, 0
6 m1, 0
6
4 m2, 0
m3, 0
m1, 0
m1, 1
m2, 1
m3, 1
m2, 0
m2, 1
m2, 2
m3, 2
3 2
l0, 0
m3, 0
6
m3, 1 7
7 6 l1, 0
m3, 2 5 4 l2, 0
l3, 0
m3, 3
0
l1, 1
l2, 1
l3, 1
0
0
l2, 2
l3, 2
32
l0, 0
0
6
0 7
76 0
0 54 0
l3, 3
0
l1, 0
l1, 1
0
0
l2, 0
l2, 1
l2, 2
0
3
l3, 0
l3, 1 7
7
l3, 2 5
l3, 3
m 0, 0
m 1, 0
m 2, 0
m 3, 0
47
6
7
6 m 1, 0 m 1, 1 m 2, 1 m 3, 1 7
6
7
6m
7
4 2, 0 m 2, 1 m 2, 2 m 3, 2 5
m 3, 0 m 3, 1 m 3, 2 m 3, 3
2 2
l 0, 0 l 1, 0
l 0, 0
6
2
6 l 0, 0 l 1, 0
l1, 0 l21, 1
6
6
6 l 0, 0 l 2, 0 l 1, 0 l 2, 0 l 1, 1 l 2 , 1
4
l 0, 0 l 3, 0
l 0, 0 l 2, 0
l 1, 0 l 2, 0 l 1, 1 l 2 , 1
l22, 0 l22, 1 l22, 2
l 1, 0 l 3, 0 l 1, 1 l 3 , 1
l 2, 0 l 3, 0 l 2, 1 l 3, 1 l 2, 2 l 3, 2
l 0, 0 l 3, 0
7
7
7
7
l 2, 0 l 3, 0 l 2, 1 l 3, 1 l 2, 2 l 3 , 2 7
5
2
2
2
2
l 3, 0 l 3, 1 l 3, 2 l 3, 3
l 1, 0 l 3, 0 l 1, 1 l 3, 1
4:8
Furthermore, looking at the matrix LLT column by column (or row by row, since
it is symmetric), it can be seen that each element li,j can be discovered by forward
substitution in order, starting from element l0,0. Column 0 gives the set of equations:
p
m 0, 0
m1, 0
l0, 0
m2, 0
l0, 0
m3, 0
l0, 0
l0, 0
l1, 0
l2, 0
l3, 0
4:9
4:10
4:11
4:12
q
m1, 1 l21, 0
4:13
l2, 1
4:14
l3, 1
4:15
l3, 2
l3, 3
q
m2, 2 l22, 0 l22, 1
4:16
4:17
4:18
48
4 Linear Algebra
8v
u
j1
>
u
>
> tm X l 2
>
>
i
,
j
i, k
>
>
<
k0
j1
li, j
X
>
>
m
lj, k li, k
>
i
,
j
>
>
>
k0
>
:
l j, j
if i j
4:19
if i 6 j
END WHILE
Index Index + 1
END WHILE
RETURN L
49
LLT xb
4:20
which can then be solved by replacing the matrix-vector LTx with a vector y and
performing a forward substitution step:
4:21
Lyb
followed by a backward substitution step to compute x:
4:22
LT xy
Example 4.2
Use Cholesky decomposition to solve the system of Eq. (4.4). Keep four
decimals of precision.
Solution
Begin by writing the matrix M with four digits of precision, to make it easier
to work with:
2
0:0125
6 0:0042
M6
4
0
0
0:0042
0:0178
0:0056
0:0050
3
0
0
0:0056 0:0050 7
7
0:0146 0:0028 5
0:0028 0:0078
p
0:0125 0:1118
l1, 0
0:0042
0:0373
0:1118
l2, 0
0
0
0:1118
0
0
0:1118
q
0:0178 0:03732 0:1283
l3, 0
l1, 1
l2, 1
0:0056 0:03730
0:0433
0:1283
(continued)
50
4 Linear Algebra
l3, 1
l2, 2
0:0028 00 0:04330:0390
0:0396
0:1127
q
0:0078 02 0:03902 0:03692 0:0685
l3, 2
l3, 3
0:1118
6 0:0373
6
4
0
0
0
0:1283
0:0433
0:0390
3
32 3 2
0:01
0
0
y0
6 7 6
7
0
0 7
7 6 y1 7 6 0 7
0:1127
0 5 4 y2 5 4 0 5
0:01
y3
0:0396 0:0685
4.4
Jacobi Method
51
It is important to note though that the Jacobi method can be used even without prior
knowledge of x, by using a random vector or a zero vector. It will take longer to
converge to a solution in that case, but may still be more efficient than the PLU
decomposition, especially for very large systems, and it is not restricted to positivedefinite matrices like the Cholesky decomposition. The only requirement to use the
Jacobi method is that the matrix M must have non-zero diagonal entries.
The Jacobi method begins by decomposing the matrix M into the sum of two
matrices D and E, where D contains the diagonal entries of M and zeros everywhere
else, and E contains the off-diagonal entries of M and zeros on the diagonal. For a
simple 3 3 example:
2
6
6d
4
g
M DE
3 2
a 0
b c
7 6
6
e f7
5 40 e
0 0
h i
7
f7
5
h 0
7 6
6
07
5 4d
i
4:23
4:24
4:25
Moreover, in the special case where D is already a diagonal matrix, the reciprocal D1
is simply a diagonal matrix containing the inverse of each diagonal scalar value:
2
a
40
0
0
e
0
1
0
36
0 6a
1
6
0 56 0
6
e
i 4
0 0
3
07 2
1
7
7
07 40
7
0
15
i
0
1
0
3
0
05
1
4:26
The Jacobi method implements an iterative algorithm using Eq. (4.24) to refine the
estimate of x at every iteration k as follows:
xk1 D1 b Exk
4:27
52
4 Linear Algebra
The iterations continue until one of two halting conditions is reached: either the
Euclidean distance (see Chap. 3) between two successive iterations of xk is less than
a target error, in which case the algorithm has converged to a good estimate of x, or
k increments to a preset maximum number of iterations, in which case the method
has failed to converge. The pseudo-code for this algorithm is presented in Fig. 4.4.
Example 4.3
Use the Jacobi method to solve the following system to an accuracy of 0.1,
keeping two decimals of precision and starting with a zero vector.
2
5 2
43 7
1 4
32 3 2
3
1
x0
2
3 54 x1 5 4 1 5
6
1
x2
Solution
Begin by decomposing the matrix into diagonal and off-diagonal matrices:
(continued)
53
0
7
0
3 2
0
0
05 43
6
1
3
2 1
0
3 5
4 0
1
2
3 6
xk1, 0
65
4 xk1, 1 5 6
60
6
xk1, 2
4
0
0
1
7
0
0 7 02
3 2
2
0 2
7
7
0 7@4 1 5 4 3 0
7
1
1 4
15
32
31
1
xk , 0
3 5 4 xk , 1 5 A
0
xk , 2
It can be seen that the zeros off-diagonal in D1 will simplify the equations a
lot. In fact, the three values of xk+1 will be computed by these simple
equations:
1
xk1, 0 2 2xk, 1 1xk, 2
5
1
xk1, 1 1 3xk, 0 3xk, 2
7
1
xk1, 2 1 1xk, 0 4xk, 1
6
Starting with x0 [0 0 0]T, the first iteration will give x1 [0.40 0.14 0.17]T.
The Euclidean distance between these two iterations is
E1 k x 1 x 0 k
q
0:40 02 0:14 02 0:17 02 0:46
kx2 x1 k 0:31
x3 0:56 0:36 0:17T x3 x2 0:19
x4 0:51 0:31 0:16T x4 x3 0:07
The target accuracy has been reached after the fourth iteration. For reference,
the correct answer to this system is x [0.50 0.31 0.12]T, which the
method approximated well. It could be noted that the value xi2 actually started
off in the wrong direction, starting at 0 and increasing to 0.17, before
dropping towards its correct negative value in subsequent iterations.
54
4 Linear Algebra
4.5
Gauss-Seidel Method
4:28
and adds the step to being each iteration by setting xk+1 xk. Since the value of xk is
converging, this simple improvement of using the updated values earlier in the
iterations actually allows it to converge faster. This change to the pseudo-code of
the Jacobi method is shown in Fig. 4.5. Notice that the single line computing the
new value of the vector x in the code of Fig. 4.4 has been replaced by a loop that
computes each new element of vector x one at a time and uses all new values from
previous iterations of the loop to compute subsequent ones.
Example 4.4
Use the Gauss-Seidel method to solve the following system to an accuracy of
0.1, keeping two decimals of precision and starting with a zero vector.
2
5 2
43 7
1 4
32 3 2
3
1
x0
2
3 54 x1 5 4 1 5
6
1
x2
Solution
This is the same system as in Example 4.3, and using Eq. (4.28) will build an
almost identical iterative equation, with the only difference that it uses xk+1
instead of xk in its computations:
2
1
2
3 6
xk1, 0
65
4 xk1, 1 5 6
60
6
xk1, 2
4
0
3
0 0 702
3 2
2
0
7
1
7@4
1 5 4 3
07
7
7
1
1
15
0
6
2
0
4
32
31
1
xk1, 0
3 54 xk1, 1 5A
0
xk1, 2
(continued)
55
x2 x1 0:11
x3 x2 0:01
The target accuracy has been reached after the third iteration, one sooner than
the Jacobi method achieved in Example 4.3. Comparing the results of the first
iteration obtained with the two methods demonstrates how beneficial using
the updated values is. While the element computed will always be the same
with both methods, the second element computed here was x1,1 0.31
(or more precisely x1,1 0.314), almost exactly correct compared to the
correct answer of 0.307 and a much better estimate than the x1,1 0.14
computed by the Jacobi method. And the third element computed here was
x1,2 0.11, a very good step towards the correct answer of 0.12, and a
definite improvement compared to the Jacobi method which had started off
with a step the wrong direction entirely, at x1,2 0.17.
56
4 Linear Algebra
4.6
Error Analysis
When solving an Mx b system, one must be mindful to account for the error both
in the matrix M and the vector b, as well as for the propagation of that error to
the solution vector x. The error on the values in M and b will be dependent on the
method used to obtain those values; it could be, for example, the error on the
instruments used to measure these values or the error of the mathematical models
used to estimate the values. Since both M and b are used to compute x, it should be
no surprise that the error on x will be dependent on those of M and b. Unfortunately,
the error propagation is not linear: a 10 % relative error in either M or b does not
translate to a 10 % relative error in x, but could in fact be a lot more.
57
The error propagation involves a property of the matrix M called the condition
number of the matrix, which is defined as
condM kMkM1
4:29
where the double bar represents the Euclidean norm (or 2-norm) of the matrix
defined as the maximum that the matrix stretches the length of any vector, that is,
k
the largest value of kkMv
vk for any vector v. This value may be calculated by finding
the square root of the maximum eigenvalue of the matrix multiplied by its
transpose:
kM k
q
max MMT
4:30
and the eigenvalue, in turn, is any scalar value that is a solution to the matrixvector problem:
Mv v
4:31
ke b k
kb k
4:32
where the double-bar operator is the Euclidean norm of the vector or the square
root of sum of squares of the elements of b. More formally, if b is an n 1 vector
[b0, . . ., bn1]T:
v
u n1
uX
4:33
b2i
kbk t
i0
Finally, the relative error on the solution of the system x will be bounded by the
relative error on b and the condition number of M:
Ex b condM
4:34
58
4 Linear Algebra
Example 4.5
Consider the following Mx b system:
5
2
1
10
x0
x1
0:005
0:901
0
0:090
ke b k
b
q 0:100
kbk
0:0052 0:9012
q
0:0152 0:0032
kex k
x
q 0:173
kxk
0:0162 0:0872
Thus, a perturbation causing a relative error of 10 % on b has caused a
corresponding relative error of over 17 % on the solution x.
Example 4.6
An Mx b system has the following matrix of coefficients:
2
3
M 42
4
1
7
2
3
1
35
9
and the value of b is being measured by instruments that have a relative error
of 5 %. What relative error can be expected on the value of x?
(continued)
59
11
MMT 4 16
23
16
62
49
3
23
49 5
101
The maximum eigenvalue of that matrix is 140.31 (see Example 4.8), and the
square root of it is 11.85. Next, compute the inverse matrix (see Example 4.7):
2
M1
0:40
4 0:04
0:17
3
0:05 0:03
0:16 0:05 5
0:01 0:13
multiply it by its transpose and find the largest eigenvalue to be 0.20, the
square root of which is 0.45. Finally, we can use these values in Eq. (4.29):
condM 11:85 0:45 5:33
The relative error on the vector b is given to be 5 %, or 0.05. Using
Eq. (4.29), the relative error on x will be
Ex 0:05 5:33 0:267
which means the relative error on x will be bounded to 26.7 % at a maximum.
4.6.1
Reciprocal Matrix
The inverse matrix or reciprocal matrix M1 of a matrix M is a matrix such that
MM1 I
4:35
and is computed as the adjoint matrix (the transpose of the cofactor matrix) of
M divided by the determinant of M.
The cofactor of a matrix given an element mi,j is the determinant of the
submatrix obtained by deleting row i and column j. The cofactor matrix of an
n n matrix M is the n n matrix obtained by computing the cofactor given each
element mi,j in the corresponding position of M and alternating + and signs, with
the initial cofactor given m0,0 having a positive sign.
60
4 Linear Algebra
a b
M
c d
4:36
jd j
jbj
jcj
d
b
j aj
c
a
4:37
d c
b
a
T
d b
c
a
4:38
and finally its reciprocal will be the adjoint divided by the determinant:
"
M1
a
c
b b
d
"
d
1
c
#T
a
b
ad bc c
4:39
Likewise, if M is a 3 3 matrix,
2
a
M 4d
g
its reciprocal will be
b
e
h
3
c
f5
i
4:40
M1
61
d
d f
g
g i
1
a c a
g
g i
a b
d e
a c
a
d
g h
d f
2
e
6h
6
6f
1
6
aei hf bdi gf cdh ge6
6 i
4d
g
2
e f
6 h i
6
6 b c
6
c 6
6 h i
4
f
b c
e f
i
3T
e
h 7
7
b 7
7
h 7
7
b 5
e
f
i
d
g
e
h
c
i
a
g
b
h
b
h
c
i
a
g
b
e
c
f
a
d
3
c
f 7
7
a 7
7
d 7
7
b 5
e
4:41
Example 4.7
Compute the inverse of this matrix, using two decimals of precision:
2
3
M 42
4
1
7
2
3
1
35
9
Solution
The cofactor matrix is obtained by computing, at each position (i,j), the
determiner of the submatrix without row i and column j. So, for example, at
position (0,0),
7 3
2 9 7 9 2 3 57
and at position (1,2),
3
4
1
32412
2
The cofactors are assembled into the matrix with alternating signs, to create
the cofactor matrix:
(continued)
62
4 Linear Algebra
57
4 7
4
6
23
7
3
24
2 5
19
M1
4.6.2
57
1 4
6
141
24
7
23
2
3 2
0:40
4
7 5 4 0:04
0:17
19
0:05
0:16
0:01
3
0:03
0:05 5
0:13
Maximum Eigenvalue
Mxk
kxk k
4:42
63
Example 4.8
Compute the maximum eigenvalue of this matrix with two decimals to a
relative error of less than 0.01.
2
11
M 4 16
23
16
62
49
3
23
49 5
101
Solution
Start with a vector of random values, such as x0 [0.1, 0.2, 0.3]T. Then
compute the iterations, keeping track of the error at each one.
x1
e1
140:07
kx1 k
x2 29:72
77:01
76:70
113:32 T
113:46 T
e2 0:002
(continued)
64
4 Linear Algebra
4.7
p
29:722 77:012 113:462 140:31
Summary
4.8
Exercises
4.8 Exercises
65
Chapter 5
Taylor Series
5.1
Introduction
It is known that, zooming-in close enough to a curve, it will start to look like a
straight line. This can be tested easily by using any graphic software to draw a
curve, and then zooming into a smaller and smaller region of it. It is also the reason
why the Earth appears flat to us; it is of course spherical, but humans on its surface
see a small portion up close so that it appears like a plane. This leads to the intuition
for the third mathematical and modelling tool in this book: it is possible to represent
a high-order polynomial (such as a curve or a sphere) with a lower-order polynomial (such as a line or a plain), at least over a small region. The mathematical tool
that allows this is called the Taylor series. And, since the straight line mentioned in
the first intuitive example is actually the tangent (or first derivative) of the curve, it
should come as no surprise that this Taylor series will make heavy use of derivatives of the functions being modelled.
5.2
Assume a function f(x) which has infinite continuous derivatives (which can be zero
after a point). Assume furthermore than the function has a known value at a point xi,
and that an approximation of the functions value is needed at another point
x (which will generally be near xi). That approximation can be obtained by
expanding the derivates of the function around xi in this manner:
67
68
f x f xi f 1 xi x xi
f n xi
x xi n
n!
Taylor Series
f 2 xi
f 3 xi
x xi 2
x xi 3
2!
3!
5:1
Equation (5.1) can be written in an equivalent but more compact form as:
f x
1
X
f k xi
x x i k
k!
k0
5:2
This expansion is called the Taylor series. Note that Eq. (5.2) makes explicit the
divisions by 0! and 1!, which were not written in Eq. (5.1) because both those
factorials are equal to 1, as well as a multiplication by (x xi)0 also excluded from
Eq. (5.1) for the same reason. In the special case where xi 0, the series simplifies
into Eq. (5.3), which is also called the Maclaurin series.
f x
1
X
f k 0 k
x
k!
k0
5:3
1
X
f k xi
xi1 xi k
k!
k0
1
X
f k xi k
h
k!
k0
5:4
This form of the Taylor series given in Eq. (5.4) is the one that will be most useful in
this textbook. One problem with it, however, is that it requires computing an infinite
summation, something that is never practical to do in engineering! Consequently, it
is usually truncated at some value n as such:
f x i h
n
X
f k xi k
h
k!
k0
5:5
69
Example 5.1
Approximate the value of the following function f(x) at a step h 1 after
x0 0, using a 0th, 1st, 2nd, and 3rd-order Taylor series approximation.
f x 1 x
x2 x3 x4
2! 3! 4!
Solution
To begin, note that the summation is actually the expansion of f(x) ex. This
means that the value f(1) that is being approximated is the constant
e 2.7183. . . Note as well that the summation is infinitely derivable and
that each derivative is equal to the original summation, as is the case for
f(x) ex:
f x f 1 x f 2 x f 3 x
The Taylor series expansion of the function at any step h after x0 0 is:
f 0 h f h
1 k
X
f 0
k0
k!
f 0 f 1 0h
hk
f 2 0 2 f 3 0 3 f 4 0 4
h
h
h
2!
3!
4!
and the 0th, 1st, 2nd, and 3rd-order approximations truncate the series after
the first, second, third, and fourth terms respectively. In other words, the
0th-order approximation is:
f h f 0
f 1 1
The 1st-order approximation is:
f h f 0 f 1 0h 1 h
f 1 2
The 2nd-order approximation is:
f h f 0 f 1 0h
f 2 0 2
h2
h 1h
2!
2!
f 1 2:5
And the 3rd-order approximation is:
(continued)
70
Taylor Series
f 2 0 2 f 3 0 3
h2 h3
h
h 1 h 2:67
2!
3!
2! 3!
Alternatively, it can be seen that all five functions go through the same
point f(0). However, the 0th approximation in red then diverges immediately
from the correct function in black, while the 1st approximation in blue
matches the correct function over a short step to about 0.1, the 2nd approximation in green follows the function in black over a longer step to approximately 0.3, and the 3rd approximation in orange has the longest overlaps
with the function in black, to a step of almost 0.7. This highlights another
understanding of the Taylor series approximation: the greater the approximation order, the greater the step for which it will give an accurate
approximation.
5.3
71
Error Analysis
When expanded to infinity, the Taylor series of Eq. (5.4) is exactly equivalent to the
original function. That is to say, the error in that case is null. Problems arise
however when the series is truncated into the nth-order approximation of
Eq. (5.5). Clearly, the truncated series is not equivalent to the complete series nor
to the original equation, and there is an approximation error to account for.
Comparing Eqs. (5.4) and (5.5), it is clear to see that the error will be exactly
equal to the truncated portion of the series:
f x i h
n
1
X
X
f k xi k
f k xi k
h
h
k!
k!
k0
kn1
5:6
Unfortunately, this brings back the summation to infinity that the nth-order approximation was meant to eliminate. Fortunately, there is a way out of this, by noting
that the terms of the Taylor series are ordered in decreased absolute value. That is to
say, each term contributes less than the previous but more than the next to the total
summation. This phenomenon could be observed in Example 5.1: note that the 0th
and 1st-order approximation add a value of 1 to the total, the 2nd-order approximation adds a value of 0.5, and the 3rd-order approximation a value of 0.17.
Likewise, graphically in that example, it can be seen that while each step brings
the approximation closer to the real value, it also leaves much less room for further
improvement with the infinite number of remaining terms. This observation can be
formalized by writing:
1
X
f k xi k
f n1 xi n1
h
h
k!
n 1!
kn1
5:7
n
X
f k x i k
h O hn1
k!
k0
5:8
Special care should be taken with Eq. (5.7) when dealing with series that
alternate zero and non-zero terms (such as trigonometric functions). If the (n + 1)
th term happens to be one of the zero terms of the series, it should not be mistaken
for the approximation having no error! Rather, in that case, the (n + 1)th term and all
subsequent zero terms should be skipped, and the error will be proportional to the
next non-zero term.
72
Taylor Series
Example 5.2
What is the error of the 1st-order Taylor series approximation of the following
function at a step h 1 after x0 0?
f x 1 0:2x 0:6x2 0:3x3 0:5x4 0:1x5
Solution
The 1st-order approximation is:
f x0 h f x0 f 1 x0 h Ex0
where the error term E(x) is the 2nd-order term:
E x 0
f 2 x0 2
h
2!
Example 5.3
What is the error of the 2nd-order Taylor series approximation of cos(x) in
radians at a step h 0.01 after x0 0?
Solution
The derivatives of the cos and sin functions are:
(continued)
73
f 3 x0 3
sin x0 3
h
h
3!
3!
However, since sin(0) 0, this term will be 0. That is clearly wrong, since
0.99995 is not a perfect approximation of the value of cos(0.01)! In this case,
the error is the next non-zero term, which is the 4th-order term:
E x 0
f 4 x0 4
cos x0 4
1
h
h 0:014 4:16 1010
4!
4!
24
Now that it is possible to measure the error of a Taylor series approximation, the
natural next question is, how can this information be used to create better approximations of real systems? Given both the discussion in this section and Eq. (5.8) specifically, it can be seen that there are two ways to reduce the error term O(hn+1): by using
smaller values of h or greater values of n. Using smaller values of h means taking
smaller steps, or evaluating the approximation nearer to the known point. Indeed, it has
been established and clearly illustrated in Example 5.1 that the approximation
diverges more from the real function the further it gets from the evaluated point;
conversely, even a low-order approximation is accurate for a small step around the
point. It makes sense, then, that reducing the step size h will lead to a smaller
approximation error. The second option is to increase the approximation order n,
which means adding more terms to the series. This will make the approximation more
complete and similar to the original function, and therefore reduce the error.
74
5.4
Taylor Series
5.5
Summary
5.6
Exercises
f 2 x0
f 3 x0
x x0 2
x x 0 3
2!
3!
5.6 Exercises
75
5. Compute the 0th to 2nd Taylor series approximation of the following functions
for x0 1 and h 0.5. For each one, use the Taylor series to estimate the
approximation error and compute the absolute error to the real value.
(a) f x x2 4x 3
(b) f x 3x3 x2 4x 3
(c) f x 2x5 3x3 x2 4x 3
Chapter 6
6.1
Introduction
77
78
Fig. 6.1 Left: A set of exact measurement points in 2D and the interpolated mathematical function
(solid line) and extrapolated function (dashed line). Right: The same points as inexact measurements in 2D and the regressed mathematical function (solid line) and extrapolated function
(dashed line)
6.2
Interpolation
6:1
79
A polynomial with n 1 roots must be of degree n 1 or more. The roots are the
values for which the polynomials solution is 0. Each root is a value for which
one of the terms of the polynomial becomes exactly equal but of opposite sign to
all the other terms combined, which is why a polynomial with n 1 roots must
have at least n 1 terms and be at least of degree n 1. Graphically, when
plotted, a polynomial of degree n 1 will have n 2 local optima where the
curve changes directions. In order for the curve to intersect the zero axis n 1
times by crossing it, changing directions, and crossing it again, it will need to
perform at least n 2 changes of directions, and thus be at least of degree n 1.
The one exception to this rule is the zero polynomial f(x) 0, which has more
roots (an infinite number of them) than its degree of 0.
Now assume there exists two polynomials of degree n 1 p1(x) and p2(x) that
both interpolate a set of n points. Define a third polynomial as the difference of
these two:
r x p1 x p2 x
6:2
By the first of the two observations above, it is clear that r(x) will be a polynomial of
degree n 1. Moreover, the polynomials interpolate the same set of n points, which
means they will both have the same value at those points and their difference will be
zero. These points will therefore be the roots of r(x). And there will be n of them,
one more than the degree of r(x). By the second of the two observations above, the
only polynomial r(x) could be is the zero polynomial, which means p1(x) and p2(x)
were the same polynomials in the first place. Q.E.D.
6.3
6.3.1
Vandermonde Method
Univariate Polynomials
80
x0 ; y0 , x1 ; y1 , . . . , xi ; yi , . . . , xn1 ; yn1
6:3
6:4
1
6 1
6
6
6
6
6 1
6
6
4
1
32
x0
x1
x0i
x1i
...
xi
xii
3 2
c0
76 c 7 6
76 1 7 6
76
7 6
76 7 6
76
76
7 6
6
xin1 7
76 ci 7 6
76
7 6
54 5 4
i
xn1
n1
xn1
xn1
x0n1
x1n1
cn1
3
y0
y1 7
7
7
7
7
yi 7
7
7
5
6:5
yn1
Here, the matrix containing the values of the variables of the polynomial is called
the Vandermonde Matrix and is written V, the vector of unknown coefficients is
written c, and the vector of solutions of the polynomial, or the evaluations of the
points, is y. This gives a Vc y system. This system can then be solved using any of
the techniques learnt in Chap. 4, or any other decomposition technique, to discover
the values of the coefficients and thus the polynomial interpolating the points.
Example 6.1
Four measurements of an electrical system were taken. At time 0 s the output
is 1 V, at time 1 s it is 2 V, at time 2 s it is 9 V, and at time 3 s it is 28 V. Find a
mathematical model for this system.
Solution
There are four 2D points: (0, 1), (1, 2), (2, 9), and (3, 28). Four points can
interpolate a polynomial of degree 3 of the form:
y f x c0 c1 x c2 x2 c3 x3
Writing this into a Vandermonde system of the form of Eq. (6.5) gives:
(continued)
81
0
1
2
3
0
1
4
9
32 3 2 3
0
c0
1
6 c1 7 6 2 7
1 7
76 7 6 7
8 5 4 c2 5 4 9 5
27
28
c3
This system can then be solved using any technique to find the solution:
c 1
0 1 T
Voltage
25
20
15
10
5
0
0
0.5
1.5
2.5
Time
6.3.2
One of the advantages of the Vandermonde method is its flexibility: with little
modification, it can be use to interpolate any function, not just a polynomial. If it is
known that a non-polynomial function of x, such as a trigonometric function for
example, is part of the underlying model, the mathematical development can be
adapted to include it. This is accomplished by rewriting Eq. (6.1) in a more general
82
form, with each term having one of the desired functions of x multiplied by an
unknown coefficient:
y f x c0 c1 f 1 x c2 f 2 x c3 f 3 x . . . cn1 f n1 x
6:6
It can be seen that the original polynomial equation was simply a special case of this
equation with fi(x) xi. In the more general case, any function of x can be used, its
result evaluated for each given sample of x and stored in the matrix V, and then used
to solve the vector of coefficients.
Example 6.2
Four measurements of an electrical system were taken. At time 0 s the output
is 1 V, at time 1 s it is 2 V, at time 2 s it is 9 V, and at time 3 s it is 28 V. Find a
mathematical model for this system, knowing that the system handles a sine
and a cosine wave. Work in radians.
Solution
There are four 2D points: (0,1), (1,2), (2,9), and (3,28). Given the information
in the question, they interpolate a trigonometric function of the form:
y f x c0 c1 sin x c2 cos x c3 sin x cos x
Writing this into a Vandermonde system gives:
2
1
61
6
41
1
0
0:84
0:91
0:14
1
0:54
0:42
0:99
32 3 2 3
0
c0
1
6 c1 7 6 2 7
0:45 7
76 7 6 7
0:38 54 c2 5 4 9 5
0:14
28
c3
This system can then be solved using any technique to find the solution:
c 15:91 11:19
14:91
7:87 T
83
Voltage
25
20
15
10
5
0
0
0.5
1.5
2.5
Time
6.3.3
Multidimensional Polynomial
The same basic technique can be used to generalize the Vandermonde method to
interpolate multivariate or multidimensional functions. In this case, the singlevariable functions fi(x) in Eq. (6.6) become multivariate functions of k variables
fi(x0, . . ., xk1). Each function fi(x0, . . ., xk1) is a different product of the
k variables, and the entire polynomial would exhaustively list all such products,
starting with the constant term (all variables exponent 0), then the single-variable
products (one variable at exponent 1 multiplying all others at exponent 0), then the
two-variable products (two variables at exponent 1 multiplying all others at exponent 0), and so on. After the last degree-1 term (multiplying all variables together),
the exhaustive list continues with one variable at exponent 2.
The most common multidimensional case in engineering is the threedimensional case, with measurement points being of the form (x, y, z f(x, y)). In
this case, the polynomial exhaustively listing all terms at degree 1 and at degree
2 are given in Eqs. (6.7) and (6.8) respectively.
z f x; y c0 c1 x c2 y c3 xy
6:7
z f x; y c0 c1 x c2 y c3 xy c4 x c5 y c6 x y c7 xy c8 x y
2
2 2
6:8
Given the number of coefficients to compute, Eq. (6.7) requires four points to
interpolate, and Eq. (6.8) requires nine points to interpolate. The choice of how
84
many terms to include in the polynomial will be guided by how many measurements of the system are available to use. With fewer points available, it is possible
to interpolate a partial form of one of those polynomials.
Example 6.3
Four measurements of the height of a structure were taken. At position (3 km,
3 km) the height is 5 km, at position (3 km, 4 km) it is 6 km, at position (4 km,
3 km) it is 7 km, and at position (4 km, 4 km) it is 9 km in height. Find a
mathematical model for this structure.
Solution
There are four 3D points: (3, 3, 5), (3, 4, 6), (4, 3, 7), (4, 4, 9). Given the
information in the question, they interpolate a 3D function of the form:
z f x; y c0 c1 x c2 y c3 xy
Writing this into a Vandermonde system gives:
2
1
61
6
41
1
3
3
4
4
3
4
3
4
32 3 2 3
9
c0
5
6 c1 7 6 6 7
12 7
76 7 6 7
12 54 c2 5 4 7 5
16
9
c3
This system can then be solved using any technique to find the solution:
c 5
1
2
1 T
6.4
Lagrange Polynomials
85
n separate polynomials, with each one being equal to 1 for one of the n points and
equal to 0 for all others. These polynomials are actually quite simple to define; they
will each have the form:
L i x
6:9
Notice that this polynomial, developed for point xi, skips over the (x xi) term in
the numerator and the (xi xi) term in the denominator, but has n 1 terms for the
other n 1 points in the set. When x xi, the denominator and numerator will be
equal and the polynomial will evaluate to 1. At any of the other points, one of the
subtractions in the numerator will give 0, as will the entire polynomial. The second
point then multiplies each polynomial Li(x) with the value yi of the measurement at
that point, and sums them all together.
y f x
n1
X
yi Li x
6:10
i0
x 1x 2x 3 x3 6x2 11x 6
0 10 20 3
6
L1 x
x 0x 2x 3 x3 5x2 6x
1 0 1 2 1 3
2
L2 x
x 0x 1x 3 x3 4x2 3x
2 0 2 1 2 3
2
(continued)
86
x 0x 1x 2 x3 3x2 2x
3 0 3 1 3 2
6
Next, each polynomial is multiplied by its matching value, and they are all
summed up together to get:
y f x y0 L0 x y1 L1 x y2 L2 x y3 L3 x
1
x3 6x2 11x 6
x3 5x2 6x
x3 4x2 3x
2
9
6
2
2
28
x3 3x2 2x
6
x2
6
6
6
6
6
6
6
6
11 36 81 56
6 0 0 0
6
6
6
6
6 6 6 6
x3 1
which is the same polynomial that was found in Example 6.1.
While the Lagrange polynomials method is the easiest interpolation method
for humans to understand and use, it is also the most complicated one to
implement in software, as can be seen from the pseudocode in Fig. 6.2. It also
suffers from the problem that interpolating a polynomial for a set of n points
with this method gives no information whatsoever on the polynomial that could
be interpolated with n + 1 points including the same n points. In practical terms,
this means that if a polynomial has been interpolated for a set of n points and
new measurements of the system are made subsequently, the computations
have to be done over from scratch. To be sure, that was also the case with
the Vandermonde method. However, since with Lagrange polynomials the
computations are also made by hand, this can become a major limitation of
this method.
87
6.5
Newton Polynomials
The Newton polynomials method discovers the polynomial that interpolates a set
of n points under the form of a sum of polynomials going from degree 0 to degree
n 1, in the form given in Eq. (6.11). That equation may look long, but it is actually
quite straightforward: each individual term i is composed of a coefficient ci multiplied by a series of subtractions of x by every measurement point from x0 to xi1.
y f x
c0 c1 x x0 c2 x x0 x x1 . . .
cn1 x x0 x x1 . . . x xn2
6:11
Unlike the Vandermonde method and Lagrange polynomials, the Newton polynomials method can be used to incrementally add points to the interpolation set.
A new measurement point (xn, yn) will simply add the term cn(x x0)(x xn1) to
the sum of Eq. (6.11). This new term will be a polynomial of degree n, as will the
entire polynomial (as it should be since it now interpolates a set of n + 1 points).
Moreover, it can be seen that this new term will not have any effect on the terms
computed previously: since it is multiplied by (x x0)(x xn1), it was 0 at all
previous interpolated points. The polynomial of Eq. (6.11) was correct for n points,
and the newly added n + 1 point makes it possible to compute a refinement to that
equation without requiring recomputing of the entire interpolation.
88
The biggest challenge in Newton polynomials is to compute the set of coefficients. There is actually a simple method for computing them, but to understand
where the equations come from, it is best to learn the underlying logic by computing
the first few coefficients.
Much like Eq. (6.11) makes it possible to incrementally add new points into the
interpolation, the coefficients are computed by incrementally adding new points
into the set considered. The first coefficient, c0, will be computed using only the first
point (x0, y0). Evaluating Eq. (6.11) at that first point reduces it to the straight-line
polynomial y f(x) c0 since, when the polynomial is evaluated at x0, all other
terms are multiplied by (x0 x0) and become 0. The value of the coefficient is thus
clear:
y0 f x 0 c 0
6:12
Taking the second point into consideration and evaluating Eq. (6.11) at that
coordinate while including the result of Eq. (6.12) gives a polynomial the degree 1:
y1 f x 1 y 0 c 1 x 1 x 0
6:13
The value of the coefficient c1 in the newly added term of the equation is the only
unknown in that equation, and can be discovered simply by isolating it in that
equation:
c1
y1 y0 f x1 f x0
x1 x0
x1 x0
6:14
The right-hand side of Eq. (6.14) can be written in a more general form:
f xi ; xi1
f xi1 f xi
xi1 xi
6:15
6:16
Next, a third measurement point (x2, y2) is observed. Evaluating Eq. (6.11) with
that new point gives:
y2 f x2 f x0 f x0 ; x1 x2 x0 c2 x2 x0 x2 x1
6:17
Once again, the value of the coefficient in the newly added term of the equation is
the only unknown in that equation, and its value can be discovered simply by
isolating it in that equation:
c2
f x2 f x0 f x0 ; x1 x2 x0 f x1 ; x2 f x0 ; x1
x2 x0
x2 x0 x2 x2
6:18
89
And once again that result can be written in a more compact function form:
f xi ; xi1 ; xi2
c2 f x 0 ; x 1 ; x 2
6:19
6:20
A general rule should be apparent from these examples. For any new point
(xk, yk) added to the interpolation set, a new function can be written as:
f xi ; xi1 ; . . . ; xik
6:21
and the coefficient of the new term added to the polynomial is the evaluation of that
new function from x0 to xk:
ck f x0 ; x1 ; . . . ; xk
6:22
6:23
f x0 ; . . . ; xn1 x x0 x x1 . . . x xn2
One thing that should be evident from the examples and from the general
formula of Eq. (6.21) is that calculating one level of the function f(xi,. . .,xi+k)
requires knowledge of the previous level of the function f(xi,. . .,xi+k1) and, recursively, knowledge of all previous levels of the function down to f(xi). There is in
fact a simple method of systematically computing all these values, by building what
is called a table of divided differences. One such table combining the information of
the sample computations from Eqs. (6.12) to (6.20) is given in Table 6.1. Each
column of this table is filled in by computing one level of the function f. The first
column simply contains the measurement values xi, and the second column the
corresponding values f(xi). The third column then has the values f(xi,xi+1), which are
computed from the first two columns. Moreover, following Eq. (6.15), each individual value is computed by subtracting the two immediately adjacent values in the
previous column, divided by the subtraction of highest and lowest value of xi. That
column will also have one less value than the previous one, since there are
fewer combinations possible at that level. The fourth column has the values of
f(xi,xi+1,xi+2), which are computed from the third and first column. Once again, each
individual value of the new column is computed by subtracting the two immediately adjacent values in the previous column divided by the subtraction of highest
and lowest value of xi, as per Eq. (6.19). And once again, there will be one less value
in the new column than there was in the previous one. This process goes on until the
90
f (xi)
x0
f (x0)
x1
f (x1)
x2
f (x2)
f (xi,xi+1)
f (x0,x1)
f (x0,x1,x2)
f (x1,x2)
last column has only one value. The coefficients of the polynomial in Eq. (6.23) are
immediately available in the final table, as the first value of each column.
The pseudocode for an algorithm to compute the table of divided differences is
presented in Fig. 6.3. This algorithm will return the coefficients needed to build a
polynomial of the form of Eq. (6.23). An additional step would be needed to recover
91
coefficients for a simpler but equivalent polynomial of the form of Eq. (6.1); this
step would be to multiply the Newton coefficients with subsets of x coordinates of
the interpolation points and adding all products of the same degree together. This
additional step is not included here.
Example 6.5
Four measurements of an electrical system were taken. At time 0 s the output
is 1 V, at time 1 s it is 2 V, at time 2 s it is 9 V, and at time 3 s it is 28 V. Find a
mathematical model for this system.
Solution
There are four 2D points: (0,1), (1,2), (2,9), and (3,28). Build the table of
divided differences. The first two columns are immediately available.
xi
f(xi)
28
f (xi,xi+1)
f (xi,xi+1,xi +2)
Values in the third column are computed using Eq. (6.15), combining
values from the previous two columns. Then, values in the fourth column will
be computed using Eq. (6.19) and the values of the third column and the first
column.
xi
f(xi)
f (xi,xi+1)
f (xi,xi+1,xi +2)
1
3
7
2
6
19
28
(continued)
92
and the values needed to compute it are the two values in column four and the
largest and smallest values of xi. This completes the table:
xi
f(xi)
f (xi,xi+1)
f (xi,xi+1,xi +2)
1
3
7
2
1
6
19
3
28
Finally, the polynomial of Eq. (6.23) can be constructed by using the first
entry of each column as the coefficients.
y f x 1 1x 0 3x 0x 1 1x 0x 1x 2
x3 1
Again, the final simplified polynomial is the same one that was computed in
Examples 6.1 and 6.4.
As explained previously, a major advantage of Newton polynomials is that it is
possible to add points into the interpolation set without recomputing the entire
interpolation, but simply by adding higher-order terms to the existing polynomial.
In practice, this is done by appending the new points to the existing table of divided
differences and adding columns as needed to generate more coefficients.
Example 6.6
A fifth measurement of the electrical system of Example 6.5 has been taken.
At time 5 s, the measurement is 54 V. Update the mathematical model for this
system.
(continued)
93
f(xi)
f(xi,xi +1)
1
3
7
1
6
19
3
28
3/5
2
2
13
54
94
Voltage
120
100
80
60
40
20
0
0
Time
6.6
95
px f x Ex
6:24
pi x
i!
6:25
Consequently, the error term will be the n-order term of the series evaluated at a
point x in the interval [x0, xn1]:
E x
pn x
x x0 x x1 . . . x xn1
n!
6:26
Unfortunately, Eq. (6.26) cannot be used to compute the error term, for the same
reason Eq. (6.25) could not be used to compute the coefficients: the polynomial p(x)
is unknown. It is, after all, the very polynomial that is being modelled by interpolation. However, an alternative is immediately available from Eq. (6.25): using the
coefficient cn, which can be computed from Eqs. (6.21) and (6.22). The error term
then becomes:
Ex f x0 , x1 , . . . , xn1 x x x0 x x1 . . . x xn1
6:27
The coefficient of the error term can thus be computed using Newton polynomials
and the table of divided differences learnt in the previous section, provided an
additional point x not used in the interpolation is available.
It is worth noting that, while the development of the error term above uses
explicitly Newton polynomials, the error term will be the same for any interpolation
method, including Vandermonde and Lagrange polynomials. It is also worth
remembering again that this error term is only valid within the interpolation
interval.
Example 6.7
Given the interpolated model of the electrical system from Example 6.5,
estimate the modelling error on a point computed at time 2.5 s. Use the
additional measurement of 4 V at 1.5 s to compute the coefficient.
(continued)
96
f(xi)
f(xi,xi +1)
1
2
3
7
1
6
19
3
28
1.5
2/3
0
6
16
6.7
Linear Regression
97
98
SSE
n1
X
y i f x i 2
6:28
i0
The best polynomial that can be regressed is the one that minimizes the value of
the SSE.
6.8
The method of least squares computes the polynomial that minimizes the SSE
through a formal mathematical development. To understand the development,
consider the easiest case of a simple linear regression. In that case, Eq. (6.28)
becomes:
SSE
n1
X
yi c0 c1 xi 2
6:29
i0
The method is looking for the polynomial, or the values of c0 and c1, that minimize
the SSE. The minimum for each coefficient is found by computing the partial
derivative of the equation with respect to that coefficient and setting it equal to 0:
n1
X
SSE
2
yi c0 c1 xi 0
c0
i0
6:30
n1
X
SSE
2
yi c0 c1 xi xi 0
c1
i0
6:31
The problem is now reduced to a system of two equations with two unknown
variables to solve together, which is trivial to do. It can be done by isolating c0 and
c1 in Eqs. (6.30) and (6.31) (note that the coefficients multiply the summation), or
99
by writing the equations into an Mx b form and solving the system using one of
the decomposition techniques from Chap. 4:
2
6 n
6
6
6
6X
n1
4
xi
i0
n1
X
n1
X
xi 7" # 6
y 7
7 c0
6 i0 i 7
7
7
6
6
7
7
7 c1
7
6X
n1
n1
X
5
5
4
x2i
yi xi
i0
i0
6:32
i0
The method of least squares can be applied to other cases of linear regression, to
discover higher-order polynomials for the model. The only downside is that each
additional term and coefficient in the polynomial to regress requires computing one
more derivative and handling one more equation.
Example 6.8
An electrical system is measured at every second, starting at time 1 s, using
noisy equipment. At time 1 s the initial output is 0.5 V, and the following
measurements are 1.7 V at time 2 s, then 1.4, 2.8, 2.3, 3.6, 2.7, 4.1, 3.0 V, and
finally 4.9 V at time 10 s. Find a linear model for this system.
Solution
Compute a simple linear regression by filling in the values into the matrix
vector system of Eq. (6.32):
10
55
55
385
c0
c1
27:0
180:1
Then solve the system to find c0 0.59 and c1 0.38. This means the model is:
y f x 0:59 0:38x
And the models SSE, computed using Eq. (629), is 3.30.
Graphically, the measurements and the regressed line are presented below.
It can be seen that the data points are lined up in two uneven linear sets. While
the model does not actually go through any of the points, it is nonetheless the
best approximation, as it goes roughly in-between the two sets, a bit closer to
the larger one. Since the errors are squared in the SSE, attempting to reduce
the error by moving the line closer to one of the sets of points would cause a
much larger increase from the error to the other set.
(continued)
100
6.9
Vandermonde Method
The Vandermonde method learned for interpolation in Sect. 6.3 can be used for
linear regression as well. Much like before, this is done first by writing out the
polynomial of the model, then filling in a Vc y system using the values of the
measurements, and finally solving for the coefficient vector. The main difference
with the previous version of the method is that there are a lot more points than
coefficients, so the matrixvector system does not balance out. This can be simply
solved by multiplying both sides of the system by the transpose of V. The correct
system for regression is thus:
VT VcVT y
6:33
It is worth noting that the Vandermonde method is equivalent to the method of least
squares. The multiplications VTV and VTy yield the values computed by deriving
and expanding the SSE equations. The main advantage of the Vandermonde
method is its simplicity. The matrix V and vector y are straightforward to build
from the observations without having to derive equations or remember sets of
summations, then the system can be built from two multiplications only.
An important benefit of having a fast and simple way to compute regressions is
to make it possible to easily compute multiple regressions of a set of measurements.
This is a benefit when the degree of the polynomial required to model a system is
unknown, and must be discovered through trial-and-error. Generating multiple
regressions at different degrees and finding which one gives the best trade-off
between low SSE and simplicity is one modelling approach that can work when
there is no other information available.
101
Example 6.9
The following measurements of an electrical system were taken with noisy
equipment. At time 1 s the output is 0.3 V, at time 2 s it is 0.2 V, at time 3 s
it is 0.5 V, at time 4 s it is 2.0 V, at time 5 s it is 4.0 V, at time 6 s it is 6.0 V, at
time 7 s it is 9.0 V, at time 8 s it is 13.0 V, at time 9 s it is 17.0 V, and at time
10 s it is 22.0 V. Find a model for this system.
Solution
Since the degree of the model is unknown, use a trial-and-error approach to
find the correct one. Begin by computing three regressions for polynomials of
degree 1, 2, and 3, to see if one of those can approximate the data well
enough. If none of them are appropriate, higher-degree regressions might be
required. The three polynomials are:
f 1 x c0 c 1 x
f 2 x c0 c 1 x c2 x2
f 3 x c0 c 1 x c2 x2 c 3 x3
The corresponding Vandermonde systems are:
V1T V1 c1 V1T y
V2T V2 c2 V2T y
V3T V3 c3 V3T y
where the matrices Vi and the vectors of coefficients ci will have two, three, or
four columns or rows, respectively, depending on the polynomial being
computed. Expanded, the seven vectors and matrices used in the above
equations are:
2
2
3
1 1
1 1
61 2 7
61 2
6
6
7
6
6
7
61 3 7
61 3
6
6
7
61 4 7
61 4
6
6
7
6
6
7
61 5 7
61 5
7V2 6
V1 6
61 6 7
61 6
6
6
7
6
6
7
61 7 7
61 7
6
6
7
61 8 7
61 8
6
6
7
6
6
7
41 9 5
41 9
1 10
1 10
2
3
1
1
6
7
4 7
61
6
7
61
9 7
6
7
61
7
16 7
6
6
7
61
25 7
7V3 6
61
7
36 7
6
6
7
61
49 7
6
7
61
64 7
6
7
6
7
41
81 5
100
1
3
2
3
1
0:3
6 0:2 7
8 7
7
6
7
7
6
7
6 0:5 7
27 7
7
6
7
6 2 7
64 7
7
6
7
7
6
7
6 4 7
125 7
7y 6
7
6 6 7
216 7
7
6
7
7
6
7
6 9 7
343 7
7
6
7
6 13 7
512 7
7
6
7
7
6
7
4 17 5
729 5
1
2
1
4
3
4
9
16
5
6
25
36
7
8
49
64
9
10
81
100 1000
22
(continued)
102
2 3
2 3
c0
c0
6 c1 7
c0
7
c1
c 4 c1 5 c3 6
4 c2 5
c1 2
c2
c3
Solving each of the three vectormatrix systems finds the coefficients of the
corresponding polynomial. Those polynomials are:
f 1 x 6:25 2:46x
f 2 x 0:19 0:76x 0:29x2
f 3 x 0:10 0:50x 0:24x2 0:003x3
The final challenge is to decide which of these three polynomials, if any, is
the best approximation of the system, to use in a model. To make this
decision, consider the SSE values. For f1(x) it is 45.70, for f2(x) it is 0.23,
and for f3(x) it is 0.19. These error values clearly indicate that a polynomial of
degree 1 is a very wrong approximation of the data, while a polynomial of
degree 3 gives very little improvement compared to the one of degree 2. The
polynomial of degree 2 is the best model in this situation.
Alternatively, looking at the situation graphically can help shed some light
on it. The data points are presented in the following figure, along with the
approximations f1(x) in solid red, f2(x) in dashed red, and f3(x) in dashed
brown A visual inspection makes it clear that the measurements are following
a parabola curve and that the straight-line regression is a very poor approximation. Meanwhile, the degree-3 approximation overlaps very much with
the degree-2 approximation and does not offer a better approximation.
(continued)
103
6.9.1
In interpolation, one of the major advantages of the Vandermonde method was that
it made it possible to model multivariate cases and multidimensional problems
easily. This is also true when the method is used for linear regression. Moreover, it
is done in the same way, by computing the values of the matrix V using terms of a
multivariate polynomial and solving the system to get the coefficients.
Example 6.10
The shape of a ski slope needs to be modelled. The elevation of various points
on the slope has been measured, along with their GPS coordinates. Defining
the low end of the ski lift at GPS coordinates (1,3) as elevation 0 m, the points
measured are: (0,0) 5 m, (2,1) 10 m, (2.5,2) 9 m, (4,6) 3 m, and (7,2) 27 m.
Knowing that the ski slope can be approximated as a plane, find the best
model for it.
Solution
A plane is simply a linear polynomial in 3D, and its equation is:
z f x; y c0 c1 x c2 y
The system to solve is VTVc VTz, where the matrix V will contain the
values multiplying each coefficient, namely 1, x, and y respectively. The
values of the three variables in the system are:
(continued)
104
3
2 3
0
5
6 10 7
2 3
17
7
7
6
c0
6 7
27
7 c 4 c1 5 z 6 9 7
6 0 7
37
7
6 7
c2
5
4 3 5
6
2
27
6.10
Transformations
The two regression methods seen so far are used specifically for linear regressions.
However, many systems in engineering practice are not linear, but instead are
logarithmic or exponential. Such systems cannot be modelled accurately by a linear
polynomial, regardless of the order of the polynomial used. This is the case for
6.10
Transformations
105
example of models of population growths, and of capacitor charges in resistorcapacitor (RC) circuits.
In such a situation, the solution is to compute a transformation of the nonlinear
equation into a linear one, compute the linear regression in that form to find the best
approximation, then to reverse the transformation to find the real model. The
transformation is whatever operation is needed to turn the polynomial into a linear
function. For example, if the function is logarithmic, the transformation is to take its
exponential, and the reverse transformation is to take the logarithm of the model.
Example 6.11
The following measures are taken of a discharging capacitor in an RC circuit:
at time 0.25 s it registers 0.54 V, at 0.75 s it registers 0.25 V, at 1.25 s it
registers 0.11 V, at 1.75 s it registers 0.06 V, and at 2.25 s it registers 0.04 V.
Find the best model to approximate this capacitor.
Solution
Plotting the measurements graphically shows clearly that they follow an
exponential relationship of the form:
y f x c0 ec1 x
Such a function cannot be modelled using the linear regression tools seen so
far. However, transforming by taking the natural log of each side of the
equation yields a simple linear function:
lny lnc0 ec1 x lnc0 lnec1 x c0transform c1 x
This simple linear regression problem can easily be solved using the method
of least squares or the Vandermonde method to find the coefficients. The
linear equation is:
lny 0:25 1:66x
Finally, reverse the transformation by taking the exponential of each side of
the equation:
y e0:251:66x e0:25 e1:66x 0:78e1:66x
That equation models the measured data with an SSE of 0.002. The data and
the modelling exponential are illustrated below.
(continued)
106
6.11
Linear regression error is different from the interpolation error computed previously in some major respects. Interpolation methods compute a polynomial that fits
the measured data exactly, and consequently constrain the error that can occur
in-between those measures, since the error must always drop back to zero at the
next interpolated measurement. Linear regression does not impose such a requirement; it computes a polynomial that approximates the measured data, and that
polynomial might not actually fit any of the measurements with zero error. Consequently, the error in-between the measures are not constrained. It is instead probabilistic: the values in-between the approximated measurements are probably near
the polynomial values (since it is the approximation with minimal SSE), but some
of them might be far away from it. In fact, the same holds true for the measurements
themselves. The situation could be understood visually by adding a third probability
dimension on top of the two dimensions of the data, as in Fig. 6.7. That figure shows
a polynomial y f(x) regressed from a set of points, and the probability of the
position of measurements in the XY plane is illustrated in the third dimension. The
Fig. 6.7 Linear regression on the x- and y-axes, with the probability of the measurements on top
6.11
107
Range of s
around f(x)
1.00s
1.28s
1.64s
1.96s
2.00s
2.58s
2.81s
3.00s
3.29s
4.00s
5.00s
Confidence interval
of observations
0.6826895
0.8000000
0.9000000
0.9500000
0.9544997
0.9900000
0.9950000
0.9973002
0.9990000
0.9999366
0.9999994
error of the measurement is thus a normal distribution with the mean at the
polynomial, which is the position with least error, and some standard deviation
of unknown value.
The standard deviation may not be known, but given the set of measurement
points it can be approximated as the sample standard deviation s:
v
u
n1
u 1 X
st
y f x i 2
n 1 i0 i
6:34
This in turn makes it possible to compute the confidence interval (CI) of the
approximation, or the area around the regressed polynomial that the measurements
are likely to be found with a given probability. For a normal distribution, these
intervals are well known: 68.3 % of the observed measurements y will be within 1
sample standard deviations of the regressed f(x), 95.4 % of the measurements will
be within 2 s, and 99.7 % of the measurements will be within 3 s of f(x). These
points are also called the 0.683 CI, the 0.954 CI, and the 0.997 CI. Table 6.2 lists
other common relationships between s and CI.
Example 6.12
The following measurements of an electrical system were taken with noisy
equipment. At time 1 s the output is 2.6228 V, at time 2 s it is 2.9125 V, at
time 3 s it is 3.1390 V, at time 4 s it is 4.2952 V, at time 5 s it is 4.9918 V,
at time 6 s it is 4.6468 V, at time 7 s it is 5.4008 V, at time 8 s it is 6.3853 V, at
time 9 s it is 6.7494 V, and at time 10 s it is 7.3864 V. Perform a simple linear
regression to find a model of the system, and compute the 0.8 CI.
(continued)
108
6.12
Extrapolation
The two techniques seen so far, interpolation and regression, have in common that
they take in a set of measurements from x0 to xn1 and compute a model to represent
the system within the interval covered by those n measurements in order to predict
the value of new measurements with a predictable error. The model in question is
6.12
Extrapolation
109
however not valid outside of that interval, and if used beyond those bounds it could
lead to a massive misrepresentation of reality. The problem is illustrated graphically in Fig. 6.8, in the case of the interpolation of three points. The degree-2
polynomial interpolated (the solid red parabola in the figure) fits the measurements
perfectly and is a good, low-error approximation of the real system (the blue line) in
that interval. However, the real system is a degree-4 polynomial, and as a result,
outside the interpolation region of the three measurements, the polynomial quickly
becomes an inaccurate and high-error approximation of the system (the dashed red
line), especially after the inflection points of the system that is not part of the model.
Nonetheless, being able to model and predict the values of a system beyond the
confines of a measured interval is a common problem in engineering. It must be
done, for example, in order to predict the future behavior of a natural system, in order
to design structures that can withstand the likely natural conditions and variations
they will be subjected to. It is also necessary to reconstruct historical data that has
been lost or was never measured, for example to analyze the failure of a system after
the fact and understand the conditions that caused it to go wrong. This challenge, of
modelling a system beyond the limits of the measurements, is called extrapolation.
Performing an accurate extrapolation requires more information than interpolation and linear regression. Most notably, it requires knowledge of the nature of the
system being modelled, and of the degree of the polynomial that can represent it.
With that additional information, it becomes possible to compute a model that will
have the correct number of inflection points and will avoid the error illustrated in
Fig. 6.8. Then, by performing a linear regression over the set of measurements, it is
possible to find the best polynomial of the required degree to approximate the data.
That polynomial will also give the best extrapolation values.
Example 6.13
The following input/output measurements of a system were recorded:
0:73507, 0:17716, 0:58236, 0:13734, 0:22868, 0:00741,
0:24253, 0:00397, 0:27129, 0:01410, 0:31244, 0:08215,
0:51378, 0:04926, 0:59861, 0:14643, 0:63754, 0:08751
(continued)
110
6.13
Summary
6.14
Exercises
111
6.14
Exercises
1. Using the Vandermonde Method, find the polynomial which interpolates the
following set of measurements:
(a)
(b)
(c)
(d)
(e)
(f)
(2,3), (5,7).
(0,2), (1,6), (2,12).
(2,21), (0,1), (1,0), (3, 74).
(1,5), (2,7), (4,11), (6,15).
(3.2,4.5), (1.5,0.5), (0.3,0.6), (0.7,1.2), (2.5,3.5).
(1.3,0.51), (0.57,0.98), (0.33,1.2), (1.2,14), (2.1, 0.35), (0.36,0.52).
2. Must the x values be ordered from smallest to largest exponent for the
Vandermonde method to work?
3. Using the Vandermonde Method, find the polynomial of the form f(x) c1sin
(x) + c2cos(x) which interpolates the following set of measurements: (0.3,0.7),
(1.9, 0.2).
4. Using the Vandermonde Method, find the polynomial of the form f(x)
c0 + c1sin(x) + c2cos(x) which interpolates the following set of measurements:
(4,0.3), (5, 0.9), (6, 0.2).
5. Using Lagrange polynomials, find the polynomial which interpolates the following set of measurements:
(a)
(b)
(c)
(d)
(e)
(f)
(g)
6. Using Newton polynomial, find the polynomial which interpolates the following set of measurements:
(a) (2,3), (5,7).
(b) (2,2), (3,1), (5,2).
112
(c)
(d)
(e)
(f)
7. Must the x values be ordered from smallest to largest for the method to find
Newton polynomials to work?
8. Suppose you have computed the polynomial which interpolates the set of
measurements (1, 4), (3, 2), (4, 10), (5, 16) using the following table of
divided differences:
xi
f(xi)
f(xi,xi+1)
f(xi,xi+1,xi+2)
f(xi,xi+1,xi+2,xi+3)
3
3
10
5
12
2
3
6
5
16
Use this result to compute the polynomial which interpolates the set of
measurements (3,2), (4,10), (5,16), (7,34).
9. Using the Vandermonde Method, find the polynomial which interpolates the
following set of measurements:
(a)
(b)
(c)
(d)
(e)
10. Compute a simple linear regression using the following set of measurements:
(a) (1,0), (2,1), (3,1), (4,2).
(b) (0.282,0.685), (0.555,0.563), (0.089,0.733), (0.157,0.722), (0.357,0.662),
(0.572,0.588), (0.222,0.693), (0.800,0.530), (0.266,0.650), (0.056,0.713).
(c) (1, 2.6228), (2, 2.9125), (3, 3.1390), (4,4.2952), (5, 4.9918), (6, 4.6468),
(7,5.4008), (8, 6.3853), (9, 6.7494), (10, 7.3864).
(d) (0.350,2.909), (0.406,2.987), (0.597,3.259), (1.022,3.645), (1.357,4.212),
(1.507,4.295), (2.228,5.277),(2.475,5.574), (2.974,6.293), (2.975,6.259).
11. Consider the following set of measurements submitted for simple linear
regression:
(1, 2.6228), (2, 2.9125), (3, 3.1390), (4, 4.2952), (5, 4.9918),
(6, 4.6468), (7, 5.4008), (8, 63.853), (9, 6.7494), (10, 7.3864)
6.14
Exercises
113
What would you consider to be problematic about it, and what would you
consider a reasonable solution?
12. Compute a linear regression for a quadratic polynomial using the following set
of measurements:
(a) (2,3), (1,1), (0,0), (1,1), (2,5).
(b) (1,0.5), (2,1.7), (3,3.4), (4,5.7), (5,8.4).
(c) (0,2.1), (1,7.7), (2,13.6), (3,27.2), (4,40.9), (5,61.1).
13. Compute a linear regression for an exponential polynomial using the following
set of measurements:
(a) (0.029,2.313), (0.098, 2.235), (0.213,2.094), (0.352,1.949), (0.376,1.924),
(0.393,1.907), (0.473,1.828), (0.639,1.674), (0.855,1.493), (0.909,1.451).
(b) (0.228,0.239), (0.266,0.196), (0.268,0.218), (0.345,0.173), (0.351,0.188),
(0.543,0.090), (0.667,0.057), (0.942,0.022), (0.959,0.026), (0.991,0.019).
(c) (0,0.71666), (1,0.42591), (2,0.25426), (3,0.15122), (4,0.08980), (5,0.05336),
(6,0.03179), (7,0.01889), (8,0.01123), (9,0.00666), (10,0.00396).
14. Using the following set of measurements:
(0,2.29), (1,1.89), (2,1.09), (3,0.23), (4, 0.80), (5, 1.56), (6, 2.18),
(7, 2.45), (8, 2.29), (9, 1.75), (10, 1.01)
compute a linear regression for a polynomial of the following form:
(a) f(x) c1sin(0.4x) + c2cos(0.4x).
(b) f(x) c0 + c1sin(0.4x) + c2cos(0.4x).
(c) Comparing both polynomials, what conclusion can you reach about the
constant term c0?
15. Compute the requested value at the given following set of measurements,
knowing that the polynomial is linear:
(a) (1, 7), (0, 3), (1, 0), (2, 3), looking for x 3.
(b) (0.3,0.80), (0.7,1.3), (1.2,2.0), (1.8,2.7), looking for x 2.3.
(c) (0.01559,0.73138), (0.30748,0.91397), (0.31205,0.83918), (0.90105,1.05687),
(1.21687,1.18567), (1.47891,1.23277), (1.52135,1.25152), (3.25427,1.79252),
(3.42342, 1.85110), (3.84589,1.98475), looking for x 4.5.
16. Compute the value at x 3 given the following set of measurements, knowing
that the polynomial is of the form y(x) c1x2: (2, 5), (1, 1), (0, 0), (1, 2),
(2, 4).
17. The following measurements come from the exponential decrease of a
discharging capacitor:
(1.5,1.11), (0.9,0.92), (0.7,0.85), (0.7,0.57), (1.2,0.49), (1.4,0.45)
At what time will the charge be half the value of the charge at time 0 s?
Chapter 7
Bracketing
7.1
Introduction
7.2
Possibly the simplest and most popular bracketing algorithm is the binary search
algorithm. Assume that a solution to a problem is needed; its value is unknown but
it is known to be somewhere between a lower bound value xL and an upper bound
value xU. The algorithm words by iteratively dividing the interval into half and
keeping the half that contains the solution. So, in the first iteration, the middle point
would be:
115
116
7 Bracketing
xL
xU xL
xi
2
7:1
and, assuming the solution is not exactly that point xi (which it will rarely if ever be
for any real-world problem except very simple ones), then one of the two intervals,
either [xL, xi] or [xi, xU], will contain the solution. In either case, the search interval
is now half the size it was before! In the next iteration, the remaining interval is
again divided into half, then again and again. As with any iterative algorithm in
Chap. 3, halting conditions must be defined, namely a maximum number of
iterations and a target-relative error. The error can be defined as before, between
two successive middle points (using for example xi+1, the middle point of the
interval [xL, xi]):
xi xi1
7:2
Erel
xi1
The final result returned by the algorithm is not the solution to the problem, but an
interval that the solution is found in. The solution can be approximated as the
central point of that interval, with half the interval as absolute error:
xi1
jxi xL j
2
7:3
Note that this definition of the absolute error could be substituted into Eq. (7.2) as
well to compute the relative error:
xi xL
2
7:4
Erel
xi1
The pseudocode for the binary search algorithm is presented in Fig. 7.1. This code
will serve as a foundation for the more sophisticated bracketing algorithms that will
be presented in later chapters. Note that it requires a call to a function
SolutionInInterval(XL,XU) which serves to determine if the solution is
between the lower and upper bounds given in parameter. This function cannot be
defined in pseudocode; it will necessarily be problem-specific. For example, in the
dictionary lookup problem of the previous section it will consist in comparing the
spelling of the target word to that of the bounds, while in more mathematical
problems it can require evaluating a function and comparing the result to that of
the bounds. Note as well that the first check in the IF uses the new point x as both
bounds; it will return true if that point is the exact solution.
117
7.3
Bracketing is by far the simplest and most intuitive of the mathematical tools
covered in this book. It is nonetheless quite powerful, since it makes it possible to
zoom in on the solution to a problem given no information other than a way to
evaluate bounds. The initial bounds for the algorithm can easily be selected by ruleof-thumb, rough estimation (or guesstimation), or visual inspection of the
problem.
However, this tool does have important limitations as well. It is less efficient
than the other four mathematical tools studied, both in terms of computational time
and in terms of accuracy. As a result, numerical methods that make use of
bracketing, despite being simpler, will also be the ones that converge to the solution
in the longest time and to the least degree of accuracy. Bracketing methods also
118
7 Bracketing
work best for one-dimensional y f(x) problems and scale up very poorly into
n dimensions. Indeed, a system of n equations and n unknowns will require 2n
bounds to be properly bracketed.
7.4
Chapter 8
Root-Finding
8.1
Introduction
8:1
The voltage of the source VS is known to be 0.5 V, while Ohms law says that the
voltage going through the resistor is:
V R RI
8:2
where R is also given in the circuit to be 10 . The Shockley ideal diode equation
gives the voltage going through the diode, VD, as:
VD
I I S enVT 1
8:3
119
120
8 Root-Finding
where Is is the saturation current, VT is the thermal voltage, and n is the ideality
factor. Assume these values have been measured to be Is 8.3 1010 A,
VT 0.7 V and n 2.
Moving VS to the left-hand side of Eq. (8.1) along with the other two voltage
values makes the model into a root-finding problem. Moreover, incorporating the
equations of VR and VD from Eqs. (8.2) and (8.3) respectively into Eq. (8.1) along
with the measured values given yields the following equation for the system:
0:5 1:4ln 1:20482 109 I 1 10I 0
8:4
The value of the current running through the circuit loop is thus the root of the
circuits model. In the case of Eq. (8.4), that value is 3.562690102 1010 A.
8.2
Bisection Method
8:5
Step 2: Once points on either side of the zero-crossing have been selected, the
bisection method iteratively tightens them by picking the point exactly in-between
the two bounds:
121
FUNCTION F(x)
RETURN evaluation of the target function at point x
END FUNCTION
xi
xL xU
2
8:6
The function is then evaluated at that point. In some rare cases, the function will
evaluate to zero exactly, in which case the point xi is the root and the algorithm can
terminate. More generally however, the function will evaluate to a positive or
negative value, and the middle point will be on one or the other side of the root.
The middle point replaces the bound on the same side of the root as itself and
becomes the new bound on that side of the root. The interval is thus reduced by half
at each iteration, and the root remains bracketed between a lower and upper bound:
xL ; xU !
xi ; xU
xL ; xi
if f xi f xL > 0
if f xi f xU > 0
8:7
122
8 Root-Finding
Step 3: The iterative process continues until a halting condition is reached. One
halting condition mentioned in the previous step, albeit an unlikely one, is that a
middle point xi is found to be exactly the root. More usually, one of the two halting
conditions presented in Chap. 7 and Chap. 3 will apply: the algorithm will reach a
preset maximum number of iterations (failure condition) or some error metric, such
as the interval between the two brackets or the relative error on the point xi, will
become lower than some preset error value (success condition).
It is clear to see that the absolute error on the approximation of the root xi is the
current interval [xi2, xi1]. Moreover, since this interval is reduced at each
iteration, it follows that the error is also reduced at each iteration. More formally,
define h0 as the width of the initial interval, and the initial absolute error value:
h0 jxL xU j
8:8
At each iteration, the interval, and therefore the error, is reduced by a factor of
2 compared to its previous iteration value. If we assume a total number of
n iterations performed, the final interval and error value is:
hn
hn1 h0
n
2
2
8:9
Equation (8.9) is the convergence rate of the algorithm to the solution, or the rate
the error decreases over the iterations: it is a linear algorithm with O(h). But the
equation also makes it possible to predict the number of iterations that will
be needed to converge to an acceptable solution. For example, if an initial interval
on a root is [0.7, 1.5] and a solution is required with an error of no more than 105,
then the algorithm will need to perform dlog2(0.8/105)e 17 iterations.
Example 8.1
Suppose a circuit represented by a modified version of Eq. (8.4) as follows:
0:5 1:4lnI 1 0:1I 0
Perform six iterations of the bisection method to determine the value of the
current running through this circuit, knowing initially that it is somewhere
between 0 and 1 A.
Solution
First, evaluate the function at the given bounds. At I 0 A the total voltage in
the system is 0.5 V, and at I 1 A it is 0.5704 A.
The first middle point between the initial bounds is x1 0.5 A. Evaluating the
function at that value gives 0.1177 V. This is a positive evaluation, just like at
1 A; the middle value thus replaces this bound, and the new interval is [0, 0.5].
(continued)
123
At the end of six iterations, the root is known to be in the interval [0.375,
0.3906]. It could be approximated as 0.3828 0.0078 A. For reference, the
real root of the equation is 0.389977 A, so the approximation has a relative
error of only 1.84 %. To further illustrate this method, the function is plotted
in red in the figure below, with the five intervals of the root marked in
increasingly light color on the horizontal axis. It can be seen that the method
tries to keep the root as close as possible to the middle of each shrinking
interval.
Voltage
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Current
124
8.3
8 Root-Finding
The false position method is an improvement of the bisection method. Like the
bisection method, it is a bracketing algorithm. However, it does not simply use the
middle point between the two bounds as a new bound, but instead interpolates a
degree-1 polynomial between the two bounds and uses the root of that straight line
as the new bound. This usually gives a better approximation of the root then blindly
using the middle point, especially when the function being modeled can accurately
be represented by a straight line, which will be the case as the interval around the
root gets smaller and smaller, as discussed in Chap. 5. To understand how this
works, consider the function plotted in blue in Fig. 8.3. Its brackets go from 6 to
8, and its root is at 6.3. The bisection method picks the middle point at each
iteration: the middle point this iteration is at 7, which reduces the interval to [6,
7] and yields 6.5 0.5 as an approximation of the root. On the other hand, the false
position method interpolates a straight line between the two bounds, shown in red in
Fig. 8.3, and uses the root of that interpolated polynomial at 6.45 to reduce the
interval to [6, 6.45]. This interval is already a lot better than the one obtained after
one iteration (or even two iterations) of the bisection method. After just one step,
the root is approximated to be 6.225 0.225, a much better approximation than
the one obtained by the bisection method. Moreover, as can be seen in Fig. 8.3, the
function in that new interval is practically linear, which means that the root of the
polynomial interpolated in-between those two bounds will be practically the same
as the root of the real function.
The algorithm for the false position method is a three-step iterative process that
is very similar to the bisection method. In particular, the first step to setup the initial
bounds is exactly the same as before, to get two points on either side of the zerocrossing. In the second step, the method iteratively tightens the bounds by using the
root of the straight-line polynomial interpolated in-between the two bounds. Interpolating a straight line between two known points and then finding the zero of that
line is trivially easy, and in fact both operations can be done at once using
Eq. (8.10).
xi xU
f xU xL xU
f xL f xU
8:10
125
The method then evaluates the new function at the new point, f(xi), and substitutes xi for the root on the same side of the zero-crossing. One distinctive feature
of the false position method is that it will usually focus on updating only one bound
of the interval. For a concave-down function such as the one in Fig. 8.3, the root of
the interpolated polynomial will usually be on the right-hand side of the real root
and only the right-hand side bound will be the one updated. And conversely, in the
case of a concave-up function, the interpolated root will usually be on the left-hand
side of the function and only the left-hand bound will be updated.
Finally, the iterative process terminates when one of three termination conditions are reached. Two of these conditions are exactly the same as for the bisection
method: the algorithm might generate a point xi that is exactly the root of the
function (success condition), or reach a preset maximum number of iterations
(failure condition). The third condition is that the root of the function is approximated to an acceptable preset error rate (success condition). However, this acceptable approximation is defined differently in this algorithm than it was in the
bisection algorithm or in most bracketing algorithms. Since usually only one
bound is updated, the relative error between two successive update values can be
used to measure the error, using the usual relative error value introduced in Chap. 1:
xi1 xi
Ei
8:11
xi
The pseudocode for this method, in Fig. 8.4, will clearly be very similar to the
one for the bisection method, which was presented in Fig. 8.2, with only a different
formula to evaluate the new point.
Example 8.2
Suppose a circuit represented by a modified version of Eq. (8.4) as follows:
0:5 1:4lnI 1 0:1I 0
Perform six iterations of the bisection method to determine the value of the
current running through this circuit, knowing initially that it is somewhere
between 0 and 1 A.
Solution
First, evaluate the function at the given bounds. At I 0 A the total voltage in
the system is 0.5 V, and at I 1 A, it is 0.5704 A.
Using Eq. (8.10), the first interpolated root is found to be at:
x1 1
0:57040 1
0:4671 A
0:5 0:5704
and evaluating the function using that current value gives a voltage of
0.0833 V. This is a positive evaluation, just like at 1 V; the middle value
thus replaces this bound, and the new interval is [0, 0.4671].
(continued)
126
8 Root-Finding
Evaluation (V)
0.0016
0.0002
0.000029
0.000004
0.6
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Current
127
8.3.1
Error Analysis
To simplify the error analysis, assume that one of the bounds is fixed. As explained
previously, that assumption will be correct for all except occasionally the first few
iterations of the false position method.
Let xr be the root, bracketed between a lower bound a0 and an upper bound b,
and assume that the bound b is fixed. Then, the change in the moving bound will
be proportional to the difference between the slope from (xr, 0) to (b, f(b)) and the
derivative f(1)(xr). To visualize this, define the error between the root and
the moving bound h0 |a0 r| and assume that the bound is sufficiently close to
the root that the first-order Taylor series approximation f(a0) f (1)(r)h0 holds.
In that case, the slope of the interpolated polynomial from (a0, f(a0)) to (b, f(b)) is
approximately equal to f(b)/(b xr). This is shown in Fig. 8.5.
128
8 Root-Finding
After one iteration, the bound will move to a1, the root of the interpolated line
(in red), and the error will be reduced by the distance between a0 and a1, which can
be computed by simple trigonometry as indicated in Fig. 8.5. The resulting error is:
!
f 1 xr h0
f 1 xr b xr
h0 1
h1 h0
f b
f b
b xr
8:12
f 1 xr b xr
1
f b
!n
8:13
The error rate thus decreases linearly as a function of h, or O(h), just like for the
bisection method. This result seems to clash with the observation earlier, both in the
explanations and in Example 8.2 that the error of the false position method
decreases faster than that of the bisection method. In fact, Eq. (8.13) shows that
this method will not always converge faster than the bisection method, but only in
the specific case where the factor multiplying h0 is smaller, that is to say:
1 f 1 r b xr
<
2
f b
8:14
Since the root and the derivative are beyond ones ability to change, the only ways
to improve the convergence rate of the false position algorithm is to increase the
129
difference between b and xr, or to decrease the value of f(b). It can be seen from
Fig. 8.5 that the effect of these actions will be, respectively, to move the point
b further right on the x-axis or down nearer to the x-axis. Either option will cause the
root of the interpolated line between a0 and b to be nearer to the root.
Example 8.3
Compare and contrast the error of the bisection method and the false position
method from Examples 8.1 and 8.2.
Solution
The points computed in each of the six iterations are listed in the table
below, along with the relative error of each one compared to the real root at
0.389977 A.
Iteration
1
2
3
4
5
6
Bisection
Point (A)
x1 0.5
x2 0.25
x3 0.375
x4 0.4375
x5 0.40625
x6 0.390625
False position
Point (A)
x1 0.4671
x2 0.4004
x3 0.3914
x4 0.3902
x5 0.3900
x6 0.38998
It is clear to see that the false position method reduces the error a lot more
quickly than the bisection method: after three iterations the error achieved by
the false position is comparable to that from six iterations of the bisection
method, and after the fifth iteration, the false position method has surpassed
by orders of magnitude the best error achieved by the bisection method.
But it is even more interesting to observe the progression of the error. The
relative error of points generated by the false position method always
decreases after each generation, and does so almost perfectly linearly. By
contrast, the relative error of the points computed by the bisection method
zigzags, it decreases but then increases again between iterations 3 and 4, and
between iterations 5 and 6. This zigzag is another consequence of blindly
selecting the middle point of the interval at each iteration. When the real root
is near the center of the interval, the middle point selected by the bisection
method will have a low relative error, but when the root is nearer the edge of
the interval the middle point will have a high relative error. By contrast, the
false position method selects points intelligently by interpolating an approximation of the function and using the approximations root, and thus is not
subject to these fluctuations. Moreover, as the interval becomes smaller after
each iteration, the approximation becomes more accurate and the interpolated
root is guaranteed to become closer to the real one.
130
8.3.2
8 Root-Finding
Nonlinear Functions
The false position method introduced an additional assumption that was not present
in the bisection method, namely that the function can be approximated within the
interval by a linear polynomial. This is of course not always the case, and it is
important to be mindful of it: when this assumption does not hold, the function
cannot be approximated well by a straight line, and consequently the root of the
straight line is a very poor approximation of the root and cannot be used to
effectively tighten the bounds.
To illustrate, an example of a highly nonlinear function is given in Fig. 8.6:
within the interval [0,1], the function looks like a straight horizontal line with a
sudden sharp vertical turn near the root at 0.96. The function is concave-up, and as
explained before the root of the interpolated polynomial falls on the left side of the
real root, and the left-hand bound is the one updated. And indeed, it can be seen in
that Figure that a straight line interpolated from 0 to 1 will have a root at a point
before 0.96, and therefore the left bound will be updated. However, interpolated
lines root will actually be at 0.04, very far from the real root at 0.96! This is a result
of the fact that the straight line from 0 to 1 is not at all a good approximation of the
concave function it is supposed to represent. Worse, the bounds will be updated to
[0.04,1] and the function in that interval will still be nonlinear, so in the next
iteration the false position method will again interpolate a poor approximation and
poorly update the bounds. In fact, it will take over 20 iterations for the false position
to generate a approximation of the root of this function within the interval [0.9,1].
By contrast, the bisection method, by blindly cutting out half the interval at each
iteration, gets within that interval in four iterations.
Referring back to Eq. (8.13), it can be seen that the inequality necessary for the
false position method to outperform the bisection method does not hold in this
example. The difference between the root and the fixed bound is only 0.04, while
131
the first derivative at the root is almost 1 because of the discontinuity and the value
of f(b) is exactly 1, so Eq. (8.13) evaluates to a result much smaller than 0.5.
Clearly, it is important to determine whether the function can be approximated
by a straight line interpolated between the bounds before starting the false position
method, otherwise many iterations will be wasted computing poor approximations
of the root. The solution to that problem is also hinted at the end of the example: to
switch to the bisection method, which will work the same regardless of whether the
function is linear or nonlinear within the interval, for a few iterations, until the
interval has been reduced to a region where the function is closer to linear.
8.4
The bisection and false position methods are both bracketing root-finding methods.
They are called closed methods, because they enclose the root between bounds.
These bounds constitute both an advantage and a limitation. On the one hand, they
guarantee that the methods will converge, that they will succeed in finding the root.
Indeed, these methods cannot possibly fail, since they begin by locking in the root
between bounds and never lose sight of it. However, iteratively updating bounds is a
slow process, and the two methods seen so far only have O(h) convergence rates.
The alternatives to closed methods are open methods. As the name implies, these
methods do not enclose the root between bounds. They do use initial points, but
these points could all be on the same side of the root. The methods then use some
mathematical formula to iteratively refine the value of the root. Since these algorithms can update their estimates without worrying about maintaining bounds, they
typically converge a lot more efficiently than closed methods. However, for the
same reason that they do not keep the root bracketed, they can sometimes diverge
and fail to find the root altogether if they use a bad point or combination of points in
their computations.
8.5
The simple fixed-point iteration (SFPI) method is the simplest open root-finding
method available. As will be seen, it is also the open method with the worst
convergence rate in general and it diverges in many common situations, so it is
far from the best. However, it will be useful to use as the first open numerical
method in this book, to introduce fundamental notions that will be applied to all
other methods.
As mentioned in the previous section, an open method is one that iteratively
improves an (unbounded) estimate of the solution point xi. The first necessary step
to any open method is thus to write the system being studied in an iterative form of
xi+1 f(xi). In the case of root-finding methods, however, this is a special problem,
since the solution point is the one where f(xi) 0. It is necessary to modify the
132
8 Root-Finding
equation of the systems model somehow. Each open root-finding method that will
be presented in the next sections will be based on a different intuition to rewrite the
equation into iterative form. For SPFI, it is done simply by isolating a single
instance of x in the equation f(x) 0 to get g(x) x. In other words:
f x g x x 0 ) g x x
8:15
5x3 4x2 2
x
3
8:16
8:17
until the equation converges; and given Eq. (8.15), the value of x it converges to is
the root of f(x). As with any iterative method, the standard halting conditions apply.
The method will be said to have converged and succeeded if the relative error
between two successive iterations is less than a predefined threshold :
xi1 xi
<
Ei
8:18
xi1
And it will be said to have failed if the number of iterations reaches a preset
maximum. The pseudocode for this method is presented in Fig. 8.7. Notice that,
contrary to the pseudocode of the bisection and false position methods, this one
does not maintain two bounds, only one current value x. Consequently, the iterative
update is a lot simpler; while before it was necessary to check the evaluation of the
current value against that of each bound to determine which one to replace, now the
value is updated unconditionally.
Example 8.4
The power of a signal being measured by a tower is decreasing over time
according to this model:
Pt et cos t
Starting at t0 0 s, find the time that the signal will have lost all power to a
relative error of less than 0.5 %.
(continued)
133
ti (s)
0
1
1.20
1.31
1.38
1.43
1.46
1.49
1.51
1.52
1.53
1.54
1.55
Ei (%)
Power
1
0.9
100.00
16.58
8.38
5.09
3.38
2.36
1.71
1.26
0.95
0.72
0.56
0.43
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
Time
134
8 Root-Finding
FUNCTION G(x)
RETURN evaluation of the transformed version of the function at point x
END FUNCTION
pointing to g(ti) ti+1 1.20, and so on. It can be seen that the method converges
towards the root.
But, as explained in Sect. 8.4, convergence is not guaranteed for open methods
like the SFPI. In fact, it can be shown that the SFPI method only converges if the
135
slope of g(x) at the root point xr is less than the slope of x at that point, or in other
words if the absolute value of the derivative of g(xr) is less than 1. That was not a
problem for the function of Example 8.8; its derivative is g(1)(x) ex(cos(x) +
sin(x)) + 1, and evaluated at the root it gives 0.792120424, below the threshold
value of 1. However, consider for example the function f(x) x2 x, which has two
roots at xr0 0 and xr1 1. In the SFPI method, this equation becomes:
xi1 gxi xi 2
8:19
The absolute value of the derivative is |g(1)(x)| |2xi|, and evaluated at the roots it
gives |g(1)(xr0)| 0 and |g(1)(xr1)| 2. This means the SPFI will converge on the first
root, but cannot converge on the second root. The practical impact of this problem
is illustrated in Fig. 8.9. As this figure shows, picking an initial value x0 that is less
than 1 will lead to the SPFI converging on the root at 0. However, picking an initial
value that is greater than 1 doesnt allow the SPFI to converge on the root at 1, but
instead it causes the method to diverge quickly away from both roots and towards
infinity. The root at 1 is simply unreachable by the SPFI, unless it is picked as the
initial value (in which case the method converges on it without iterating). This
discussion also illustrates two other problems with the SPFI method. First, one
cannot predict whether the method will converge or diverge ahead of time unless
one already knows the value of the root in order to evaluate the derivative at that
point, which is of course not a piece of information available ahead of time in a
root-finding problem. And second, the form of g(x) that the equation f(x) is
rewritten into actually matters. Multiple different forms of g(x) are possible for
one equation f(x), and not all of them will converge on the same values, or at all.
The convergence rate for the SPFI method is derived from a Taylor series
expansion of the function g(x). Recall from Chap. 5 that this means that the error
rate will be proportional to the order of the first non-null term in the series. In other
136
8 Root-Finding
words, much like the convergence test, the convergence rate will depend on
evaluations of the derivative at the root. If g(1)(xr) 0, then the error rate will be
O(h2), if in addition g(2)(xr) 0 then the error rate will be O(h3), and if in addition to
those two g(3)(xr) 0 then the error rate will be O(h4), and so on. In the general case
though, the assumption is that g(1)(xr) 6 0 and the error rate is O(h).
8.6
8.6.1
Newtons Method
One-Dimensional Newtons Method
Newtons method, also called the Newton-Raphson method, is possibly the most
popular root-finding method available. It has a number of advantages: it converges
very efficiently (in fact it has the highest convergence rate of any root-finding
methods covered in this book), it is simple to implement, and it only requires to
maintain one past estimate of the root, like the SFPI but unlike any of the other rootfinding methods available. Its main downside is that it requires knowing or estimating the derivative of the function being modelled.
The basic assumption behind Newtons method is that, for a small enough
neighborhood around a point, a function can be approximated by its first derivative.
Since this first derivative is a straight line, its root is straightforward to find. The
derivatives root is used as an approximation of the original functions root, and as a
new point to evaluate the derivative at to iteratively improve the approximation.
As the approximation point gets closer to the root and the neighborhood approximated by the first derivative gets smaller, the first derivative becomes a more
accurate approximation of the function and its root becomes a more accurate
approximation of the functions root. To illustrate, a single iteration of Newtons
method is represented graphically in Fig. 8.10.
137
8:20
Since the function is converging to a root, then the value f(xi+1) 0. With that value
set, Eq. (8.20) can be rewritten as an iterative formula of x:
xi1 xi
f xi
f
x i
8:21
f 2 xi
xr xi 2
2
8:22
Following the same steps used to get from Eq. (8.20) to (8.21) but putting the entire
Newtons method equation on one side of the equation gives:
xr xi
f x i
f 2 xi
xr xi 2
0
f 1 xi 2f xi
8:23
138
8 Root-Finding
FUNCTION F(x)
RETURN evaluation of the target function at point x
END FUNCTION
FUCNTION Derive(F(x))
RETURN evaluation of the derivative of the function at point x
END FUNCTION
Note that Newtons method formula for xi+1 is on the left of the equation. Both sides
of the equation then have a subtraction of xr to an approximation, which is the error
h at that point. The equation thus simplifies to:
hi1
Which is indeed an O(h2) error rate.
f 2 xi
2f
x i
hi 2
8:24
139
Example 8.5
The power of a signal being measured by a tower is decreasing over time
according to this model:
Pt et cos t
Starting at t0 0 s, find the time that the signal will have lost all power to a
relative error of less than 0.5 %.
Solution
The derivative of this function is:
P0 t et sin t cos t
And the iterative formula for Newtons method, from Eq. (8.21), is thus:
ti1 ti
eti cos ti
eti sin ti cos ti
ti (s)
0
1
1.39
1.54
1.570
1.571
Ei (%)
Power
1
100
28.1
10.0
1.6
0.04
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
Time
140
8 Root-Finding
somewhat in the vicinity of the root, and not any random value anywhere in the
x-axis. The second condition that can cause the method to diverge is if the first
derivative at the current point is near zero. This has already been included as a
failure halting condition in the algorithm. Conceptually, this means the point xi is at
an optimum of the function, and the first derivative is horizontal. In such a case, the
next point computed by Eq. (8.21) will shoot out very far from the current point and
the root. This situation is illustrated in Fig. 8.12. The third and final condition that
can cause the method to diverge is if the second derivative is very large. Again, it is
clear to see from Eq. (8.24) that this will cause the error to increase between
iterations. Conceptually, a high second derivative means that the current point is
near a saddle of the function. In that case, Eq. (8.21) will generate points that
oscillate around this saddle and get progressively further and further away. This
situation is illustrated in Fig. 8.13.
8.6.2
The examples used so far in this chapter have all been one-dimensional y f(x)
root-finding problems. However, one of the main advantages of Newtons method,
especially when contrasted to the bracketing methods, is that it can easily be
adapted to more complex multidimensional problems. In such a problem, an
engineer must deal with n independent variables constrained by n separate model
equations. The root of the system is the simultaneous root of all n equations.
Begin by defining x [x0, x1, . . ., xn1]T, the vector of n independent variables of
the system, and f(x) [f0(x), f1(x), . . ., fn1(x)]T, the vector of n n-dimensional
functions that model the system. Since this is now a vector problem, Newtons
method equation (8.21) needs to be rewritten to eliminate the division as:
141
f 1 xi xi1 xi f xi
8:25
8:26
The derivative of the vector of functions is the Jacobian matrix Jf(x), or the n n
matrix of partial derivatives of each of the n functions with respect to each of the
n variables, arranged as shown in Eq. (8.27):
2
f 0 x
6 x0
6
6 f 1 x
6
Jf x 6 x0
6
6
4 f x
n1
x0
f 0 x
x1
f 1 x
x1
f n1 x
x1
3
f 0 x
xn1 7
7
f 1 x 7
7
xn1 7
7
7
f n1 x 5
xn1
8:27
8:28
At each iteration, the Jacobian function can be evaluated, and only the step size xi
is unknown. The problem has thus become an Mx b linear algebra equation to
solve, which can be done using any of the methods learned in Chap. 4. Finally, the
next vector is obtained simply with:
xi1 xi xi
8:29
142
8 Root-Finding
The halting conditions for the iterative algorithm are the same as for the
one-dimensional Newtons method, but adapted to matrices and vectors. The
success condition is that the relative error between two successive approximations
of the root is less than a preset threshold value, defined now as the Euclidean
distance between the two vectors xi and xi+1 introduced in Chap. 3. There is a failure
condition if the derivative is zero, as there was with the one-dimensional Newtons
method. This is defined here as the case where the determinant of the Jacobian
matrix is zero. Finally, as always, the algorithm fails if it reaches a preset maximum
number of iterations. The pseudocode for Newtons method, updated to handle
multidimensional problems, has been updated from Fig. 8.11 and is presented in
Fig. 8.14.
Since the multidimensional Newtons method equation of (8.28) is derived from
the first-order Taylor series approximation, just like the one-dimensional case, it
will also have O(h2) convergence rate.
x Input initial approximation of the root as vector of length n
IterationMaximum Input maximum number of iterations
ErrorMinimum Input minimum relative error
IterationCounter 0
WHILE (TRUE)
PreviousValue x
Delta solution of system [CALL Jacobian(F(x))] Delta
= -1 [CALL F(x)]
x x + Delta
IF ( CALL F(x) = 0 )
RETURN Success, x
ELSE IF ( Determinant of [CALL Jacobian(F(x))] = 0 )
RETURN Failure
END IF
CurrentError Euclidean distance between x and PreviousValue
IterationCounter IterationCounter + 1
IF (CurrentError <= ErrorMinimum)
RETURN Success, x
ELSE IF (IterationCounter = IterationMaximum)
RETURN Failure
END IF
END WHILE
FUNCTION F(x)
RETURN vector of length n of evaluations of the n target functions at
point x
END FUNCTION
FUCNTION Jacobian(F(x))
RETURN nn matrix of evaluation of the partial derivatives of the n
functions with respect to the n variables at point x
END FUNCTION
143
Example 8.6
The shape of the hull of a sunken ship is modelled by this equation:
z f 0 x; y x2 2y2 xy x 1
where z is the height of the sunken hull above the seabed. An automated
submarine is scanning the hull to find the damaged point where the ship has
hit the seabed. It has been programmed to explore it in a 2D grid pattern,
starting at coordinates (0,0) and following this program:
z f 1 x; y 3x2 2y2 xy 3y 2
Determine if the probe will find the damage it is looking for with a relative
error of 0.001.
Solution
This underwater exploration can be modelled by the following system of
equations:
f x
x2 2y2 xy x 1
3x2 2y2 xy 3y 2
x
x
y
The point where the hull has hit the sea bed is at z 0, and that is the
damaged point the probe is looking for. It is therefore a root-finding problem.
To use Newtons method, first compute the Jacobian following Eq. (8.27):
2x y 1
Jf x
6x y
4y x
4y x 3
Jf x0 x0 f x0
1
1 0
x0
2
0 3
1:0000
x0
0:6667
x1 x0 x0
1:0000
x1
0:6667
(continued)
144
8 Root-Finding
xi
[0, 0]T
[1.0000, 0.66667]T
[0.87517, 0.10816]T
[0.67288, 0.18795]T
[0.64038, 0.19199]T
[0.63982, 0.19183]T
xi
[1.0000, 0.66667]T
[0.12483, 0.55851]T
[0.20229, 0.079792]T
[0.032496, 0.0040448]T
[0.00056451, 0.00016391]T
Ei
1.2019
0.5723
0.2175
0.0328
0.0009
8.7
Secant Method
145
method cannot be used. One alternative is to approximate the derivative using the
secant line of the curve, a line passing through (or interpolating) two points on the
function. As these two points iteratively become closer to the root and to each other,
the secant line will become an approximation of the tangent near the root and this
secant method will approximate Newtons method.
From the first-order Taylor series approximation, the approximation of the first
derivative at a point xi computed near a previous point xi1 is given as:
f xi1 f xi
xi1 xi
f 1 xi
8:30
This immediately adds a new requirement into the method: instead of keeping only
one current point with Newtons method, it is necessary to keep two points at each
iteration. This is one of the costs of eliminating the derivative from the method.
Next, the derivative approximation formula is used to replace the actual derivative
in Eq. (8.21):
xi1 xi f
f xi
xi1 f xi =x
i1 xi
xi
f xi xi1 xi
f xi1 f xi
8:31
And Newtons method is now the secant method. The halting conditions for the
iterations are the same as for Newtons method: the method will fail if it reaches a
preset maximum number of iterations or if the denominator becomes zero, which
will be the case if two points are generated too close to each other (this situation will
also introduce the risk of subtractive cancellation explained in Chap. 2), and it will
succeed if the relative error between two iterations is less than a preset threshold.
The pseudocode for Newtons method in Fig. 8.11 can be updated for the secant
method, and is presented in Fig. 8.15.
Since the secant method approximates Newtons method and replaces the
derivative with an approximation of the derivative, it should be no surprise that
its convergence rate is not as good as Newtons method. In fact, while the proof is
outside the scope of this book, the convergence rate of the secant method is O(h1.618),
less than the quadratic rate Newtons method boasted but better than the linear rate of
the other methods presented so far in this chapter.
Equation (8.31) should be immediately recognizable: it is the same as the false
position methods equation (8.10). In fact, both methods work in the same way:
they both estimate the root by modelling the function with a straight line interpolated from two function points, and use the root of that line as an approximation of
the root. The difference between the two methods is in the update process once a
new approximation of the root is available. As explained back in Sect. 8.3, the false
position method will update the one boundary point on the same side of the zerocrossing as the new point. Moreover, the method will usually generate points only
on one side of the zero-crossing, which means that only one of the two bounds is
updated, while the other keeps its original value in most of the computations. This
will insure that the root stays within the brackets, and guarantee that the method will
146
8 Root-Finding
FUNCTION F(x)
RETURN evaluation of the target function at point x
END FUNCTION
converge, albeit slowly. By contrast, the secant method will update the points in the
order they are generated: the new approximation and the previous one are kept and
used to compute the next one, and the approximation from two iterations back is
discarded. This is done without checking whether the points are on the same side of
the zero-crossing or on opposite sides. This allows faster convergence, since the
two newest and best estimates of the root are always used in the computations.
However, it also introduces the risk that the function will diverge, which was
impossible for the false position method.
To understand the problem of divergence with the secant method, consider the
example in Fig. 8.16. On the top side, a secant line (in blue) is interpolated between
two points xi1 and xi of the function (in red) and an approximation of the root xi+1
is obtained. This approximation is then used along with xi to interpolate a new
secant line, which is a very good approximation of the function. It can clearly be
seen that the next approximation xi+2 will be very close to the real root of the
function. But what if the exact same points had been considered in the opposite
order? The result is shown on the bottom side of Fig. 8.16. Initially the same secant
147
line is interpolated and the same approximation xi+1 is obtained. However, now the
next secant line interpolated between xi and xi+1 diverges and the next point xi+2 will
be very distant from the root. The problem is that, in this new situation, the points xi
and xi+1 are interpolating a section of the function that is very dissimilar to the
section that includes the root. As a result, while the interpolation is a good
approximation of that section of the function, it is not at all useful for the purpose
of root-finding. Meanwhile, because the false position method only updates the
point on the same side of the zero-crossing, it can only generate the situation on
the top side of Fig. 8.16 regardless of the order the points are fed into the algorithm,
and can never diverge in the way shown on the bottom side. Note however that this
constraint is not necessary to avoid divergence: it is only necessary for the secant
method to use points that interpolate a section of the function similar to the section
that has the zero-crossing. For example, using two points both on the negative side
of the function would allow the secant method to generate a very good approximation of the root.
148
8 Root-Finding
Example 8.7
The power of a signal being measured by a tower is decreasing over time
according to this model:
Pt et cos t
Starting at t1 0 s and t0 1 s, find the time that the signal will have lost all
power to a relative error of less than 0.5 %.
Solution
The secant method equation, using Eq. (8.31), is:
ti1 ti
eti1
Given the two initial points given, the iterations computed are:
Iteration
1
0
1
2
3
4
5
ti (s)
0
1
1.25
1.46
1.54
1.568
1.571
Ei (%)
Power
1
100
19.9
14.4
5.51
1.61
0.2
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
Time
8.8
Mullers Method
The secant method and the false position method both approximate the function
using a straight line interpolated from two points, and use the root of that line as the
approximation of the root of the function. The weakness of these methods, as
illustrated in Figs. 8.6 and 8.16, is that a straight line is not always a good
approximation of a function. To address that problem, a simple solution is available: to use more points and compute a higher-degree interpolation, which would be
149
Fig. 8.17 Approximating the root using a degree 1 (left) and degree 2 (right) interpolation of a
function
a better approximation of the function and would make it possible to get a closer
approximation of the root. Of course, there is a limit to this approach; the higher the
degree of the interpolating polynomial, the more roots it will have and the more
difficult it will be to find them all easily and efficiently. Mullers method offers a
good compromise. It uses three points to compute a degree-2 interpolation
(a parabola) of the function to model. The degree-2 polynomial offers a better
approximation of the function than a straight line, as illustrated in Fig. 8.17, while
still being easy enough to handle to find the roots.
The first step of Mullers method is thus to approximate the function f(x) using a
parabola interpolated from three points xi2, xi1, and xi. The equation for a
parabola is well known to be p(x) ax2 + bx + c. Given three points on the function
to model, it can be computed using the Vandermonde method from Chap. 4 as the
solution to the system:
2
x2i2
6 2
4 xi1
x2i
3
1 2 a 3 2 f x 3
i2
7
1 54 b 5 4 f xi1 5
f xi
c
1
xi2
xi1
xi
8:32
In order to use these equations in an iterative formula of the form xi+1 xi + xi,
substitute x for x xi. This changes the parabola equation to p(x) a(x xi)2 + b
(x xi) + c and the Vandermonde system to:
2
xi2 xi 2
6
4 xi1 xi 2
0
xi2 xi
xi1 xi
0
3
1 2 a 3 2 f xi2 3
7
1 54 b 5 4 f xi1 5
c
f xi
1
8:33
Written in that form, the Vandermonde system is trivial to solve. In fact, a solution
can be obtained immediately as:
150
8 Root-Finding
f xi1 f xi2
xi xi1
xi1 xi2
a
x
x
i i2
b axi xi1 f xi f xi1 x x
f xi f xi1
8:34
i1
c f xi
Note that in both the parabola equation and the Vandermonde system, the solution
remains unchanged. This is because the subtraction represents only a horizontal
shift of the function. All values of the function are moved along the x-axis by a
factor of xi, but they remain unchanged along the y-axis. This is akin to a timeshifting operation in signal processing, and is illustrated in Fig. 8.18 for clarity.
Once the coefficients a, b, and c for the parabola equation are known, the next
step is to find the roots of the parabola, which will serve as approximations of the
root of f(x). The standard quadratic equation to find the roots of a polynomial is:
p
b b2 4ac
r0
r1
2a
8:35
This equation will yield both roots of the polynomial. The one that is useful for the
iterative system is the one obtained by setting the sign to the same sign as b. Note
however that this will introduce the risk of that the problem of subtractive cancellation described in Chap. 2 will occur in cases where b2 4 ac. To avoid this, an
alternative form of Eq. (8.35) exists that avoids this issue:
2c
r0
p
r1
b b2 4ac
8:36
2c
p
b b2 4ac
8:37
Where the is set to the same sign as b and the values a, b, and c are computed by
solving the Mx b system of Eq. (8.33). The algorithm has only two halting
conditions: a success conditions if the relative error between two successive values
Fig. 8.18 Horizontal shift
of the parabola f(x) to
f(x + 3)
151
xi and xi+1 is less than a preset threshold, and a failure condition if a preset maximum
number of iterations is reached. The pseudocode for this algorithm, using the
solution of Eq. (8.34) and a simple test to set the sign, is presented in Fig. 8.19.
The convergence rate of Mullers method is O(h1.839), slower than Newtons
method but better than the secant method. Intuitively, the fact that Mullers method
performs better than the secant method should not be a surprise, since it follow the
same idea of interpolating a model of the function and using the models root as an
approximation, but does so with more information (one more point) to get a better
model. On the other hand, Newtons method uses information from the function
itself, namely its derivative, instead of an approximation, so it should naturally
perform better than any approximation-based method.
PreviousValue2, PreviousValue, x Input three initial approximations of
the root
IterationMaximum Input maximum number of iterations
ErrorMinimum Input minimum relative error
IterationCounter 0
WHILE (TRUE)
A
FUNCTION F(x)
RETURN evaluation of the target function at point x
END FUNCTION
152
8 Root-Finding
Example 8.8
The power of a signal being measured by a tower is decreasing over time
according to this model:
Pt et cos t
Starting at t2 0 s, t1 0.1 s, t0 0.2 s, find the time that the signal will
have lost all power to a relative error of less than 0.5 %.
Solution
Note to begin that this example starts with some initial bad values. Indeed,
p(0) 1, p(0.1) 0.900, and p(0.2) 0.802. Nonetheless, use these values to
compute the first iteration of Mullers method. Applying Eq. (8.34) finds the
coefficients of the interpolated parabola to be:
0:8020:900=0:20:1 0:9000=0:10
0:089
0:2 0
b 0:0890:2 0:1 0:8020:900=0:20:1 0:970
c 0:802
Using these values in Eq. (8.37) finds the relevant root of the interpolated
parabola, and the first approximation of the root:
t1 0:2
2 0:802
q 1:101
0:970 0:9702 4 0:089 0:802
ti (s)
0
0.1
0.2
1.101
1.481
1.583
1.5706
1.5708
Ei (%)
Power
1
0.9
0.8
0.7
0.6
81.8
25.6
6.5
0.82
0.01
0.5
0.4
0.3
0.2
0.1
0
0.1
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
Time
(continued)
153
8.9
Engineering Applications
Root-finding problems arise often in engineering design, namely when a system has
been modelled by an equation or set of equations with known parameter and
property values, and the value of a dependent (controllable) variable of the system
must be discovered. If the model used only a linear equation, it would be a simple
matter to isolate the variable to be controlled and to compute its optimal value.
Unfortunately, most real engineering situations are modelled by more complex
equations where the variable is part of exponentials, logarithms, or trigonometric
terms, and cannot be isolated. This was the case of the value of the current I in
Eq. (8.4); it is simply impossible to isolate it in that equation. Many other such
situations can also occur in engineering practice. Some examples are listed here.
The van der Waals equation relates the pressure p, volume V, number of moles
N, and temperature T of a fluid or gas in a container. The equation is:
N2
pa 2
V
V
b
N
RT
8:38
154
8 Root-Finding
"
r!#
P
ec
L P
1 2 sec
A
r
2r EA
8:39
where P is the axial load, A is the cross-section area of the column, ec/r2 is the
eccentricity ratio, L/r is the slenderness ratio, and E is Youngs modulus for the
material of the column. Structural design often requires using this equation to
determine the area of a column that will support a given load.
It is well-known that an object thrown will follow a parabolic trajectory. More
specifically, the (x, y) coordinates of the object following this trajectory after
being thrown at an angle with initial velocity v will obey the equation:
y x tan
gx2
2v2 cos 2
8:40
where g is the Earths gravitational acceleration and the objects starting position
is assumed to be the origin (0, 0). While the initial speed will often be determined by the nature of the objects propulsion mechanism, a common challenge
is to determine the initial angle to use in order to reach a specific target
destination or to intercept a point in mid-trajectory.
In all these situations, the exact value of a specific variable must be known to
properly design the system, but that variable cannot be isolated from the systems
equation in order for the value to be computed. However, simply by subtracting one
side of the equation from the other, the equation becomes equal to zero and the
needed value becomes the root of the equation, and can thus be approximated to a
known error rate by any of the methods seen in this chapter.
8.10
Summary
Many engineering situations can be modelled and solved by finding the value of
some parameters of the system for which the system balances out to zero. These are
root-finding problems, and this chapter introduced several numerical methods to
solve them. The two closed methods, the bisection and false position methods,
setup bounds around the root and either blindly pick the middle point between these
bounds or interpolate a line through the function to get closer to the root. Because
they bracket the root, these two methods are guaranteed to converge on the root
8.11
Exercises
155
Requires
2 bounds
2 bounds
1 point
1 point + derivative
2 points
3 points
Error
O(h)
O(h)
O(h)
O(h2)
O(h1.618)
O(h1.839)
eventually, albeit slowly. Next, three open methods were introduced, namely
Newtons method, the secant method, and Mullers method. These methods all
work by approximating the function, either using its derivative at one point, a
straight line interpolated through two points, or a parabola interpolated through
three points. Since none of them are burdened by maintaining possibly inaccurate
brackets, they all perform faster than the closed methods. However, they all have a
risk of diverging and failing to find the root in certain conditions. Of these three
open methods, Newtons method was the most efficient and the most versatile since
it could easily be expanded to multidimensional and multivariate problems.
Table 8.1 summarizes the methods covered in this chapter.
8.11
Exercises
156
8 Root-Finding
6. Perform three steps of Newtons method for the function f(x) x2 2 starting
with x0 1.
7. Perform three iterations of Newtons method to approximate a root of the
following multivariate systems given their starting points:
x2 y2 3
(a) f x
, x0 [1, 1]T.
2x2 0:5y2 2
2
x xy y2 3
(b) f x
, x0 [1.5, 0.5]T.
x y xy
8. Perform three steps of the secant method for the function f(x) x2 2 starting
with
x1 0 and x0 1.
9. Perform four steps of the secant method for the function f(x) cos(x) + 2 sin(x) + x2
starting with x1 0.0 and x0 0.1.
10. Use the secant method to find a root of the function f(x) x2 7x + 3 starting
with
x1 1 and x0 0 and with an accuracy of 0.1.
11. Perform six iterations of Mullers method on the function f(x) x7 + 3x6 + 7x5
+ x4 + 5x3 + 2x2 + 5x + 5 starting with the three initial values x2 0,
x1 0.1, and x0 0.2.
Chapter 9
Optimization
9.1
Introduction
One major challenge in engineering practice is often the need to design systems that
must perform as well as possible given certain constraints. Working without
constraints would be easy: when a system can be designed with no restrictions on
cost, size, or components used, imagination is the only limit on what can be built.
But when constraints are in place, as they always will be in practice, then not only
must engineering designs respect them, but the difference between a good and a bad
design will be which one can get the most done within the stated constraints.
Take for example the design of a fuel tank. If the only design requirement is
hold a certain amount of fuel, then there are no constraints and the tank could be
of any shape at all, provided the shapes volume is greater than the amount of fuel it
must contain. However, when the cost of the materials the tank is made up of is
taken into account, the design requirement becomes hold at least a certain amount
of fuel at the least cost possible, and this new constraint means the problem
becomes about designing a fuel tank while minimizing its surface area, a very
different one from before. A clever engineer would design the fuel tank to be a
sphere, the shape with the lowest surface to volume ratio, in order to achieve the
optimal result within the constraints. This design will be superior to the one using,
say, a cube-shaped fuel tank, that would have a higher surface area and higher cost
to hold the same volume of fuel.
To make the example more interesting, suppose the shape of the fuel tank is also
constrained by the design of the entire system: it must necessarily be a cylinder
closed at the top and made of a metal that costs 300$/m2, while the bottom of the
tank is attached to a nozzle shaped as a cone with height equal to its radius and made
of a plastic that costs 500$/m2. The entire assembly must hold at least 2000 m3
of fuel. How to determine the optimal dimensions of the tank and the connected
nozzle? First, model the components. For a given radius r and height h of the cylinder
tank, the surface of the side and top of the cylinder will be:
Springer International Publishing Switzerland 2016
R. Khoury, D.W. Harder, Numerical Methods and Modelling for Engineering,
DOI 10.1007/978-3-319-21176-3_9
157
158
9 Optimization
A1 2rh r 2
9:1
9:2
Likewise, the volume of the cylinder of radius r and height h will be:
V 1 r 2 h
9:3
r 3
3
9:4
By looking at the cost (area) and volume of the entire assembly, this model becomes
two equations with two unknown parameters that can be controlled, r and h:
r 3
2000m3
3
p
2rh r 2 300 r 2 1 2 500 ?$
r 2 h
9:5
Normally a system of two equations and two unknowns would be easy to solve. The
problem in the system of (9.5) is that one of the equations does not have a known
result. The area and cost of the tank is not specified in the problem, the only
requirement is that they must be as low as possible.
The problem could be further simplified by writing the parameter h as a function
of r in the volume equation, and inserting that function of r into the price equation,
to get:
2000 r
h
r 2
3
2
p
4000 2r
2
2
r 300 r 1 2 500 ?$
r
3
9:6
Now the price in Eq. (9.6) is only dependent on the radius; the height will be
automatically adjusted to generate a container of 2000 m3 of fuel. The cost of a fuel
tank with a radius from 1 to 10 m can be computed, and will give the values
illustrated on the graph of Fig. 9.1. The ideal tank with minimal cost can also be
found to have a radius of 5.27 m, a height of 17,679 m, and a cost of $341,750.
This type of problem is called optimization, since it is seeking the optimal value
of a function. This optimum can be the minimum value of the function, as it was for
the cost function in the preceding example, or its maximum, for example if one was
trying to design a fuel tank that can hold the greatest volume given a fixed budget.
159
In the former case the problem can be called minimization, and in the latter
maximization. It is important to note that these are not different types of problems
though, but simply sign differences on the same optimization problem.
9.2
Golden-Mean Search
160
9 Optimization
must be in and which one can be safely discarded. Consider once again the example
of the function with a minimum between the bounds of x 1 and x 3 and which
evaluates to (1,4) and (3,3). Points evaluated at 1.66 and 2.33 divide the function
neatly into three equal intervals. Suppose the function evaluates to (1.66, 0.95) and
(2.33, 1.6). The fact that the function has a lower value at the one-third point than at
the two-third point means that a minimum must have been reached somewhere
within those two intervals, to allow the function to turn around and increase again.
In fact, two cases are possible: either the minimum is in the interval [1, 1.66] and the
function is on an upward slope through 1.66 and 2.33 to 3, or the function is
decreasing from 1 through 1.66 to reach a minimum in the [1.66, 2.33] interval,
and is increasing again through 2.33 to 3. The only impossible case is for the
minimum to be in the [2.33, 3] interval, as that would require the function two have
two minimums, one in the [1, 2.33] interval to allow the decrease from 1 to 1.66 and
increase from 1.66 to 2.33, and the second one in the [2.33, 3] interval, and it has
already been stated that the function has only one minimum within the bounds.
Consequently, the interval [2.33, 3] can safely be discarded, and the new bounds
can be reduced to [1, 2.33]. This situation is illustrated in Fig. 9.3.
To formalize the bound update rule demonstrated above, assume that, at iteration i,
the algorithm has a lower bound xiL and an upper bound xiU bracketing an optimum of
the function f(x). Two points are generated for the iteration within the bounds, xi0 and
xi1 where xi0 < xi1, and they are evaluated. Then, in the case of a minimization
problem, the bounds are updated according to the following rule:
161
xiL ; xiU !
xiL ; xi1
xi0 ; xiU
9:7
xiL ; xi1
xi0 ; xiU
9:8
Note that these rules are independent of the step between xiL, xi0, xi1, and xiU. The
decision to use the one-third and two-third points in the previous example was made
only for the sake of simplicity. More generally, the step can be represented by a
value , and the two points are computed as:
xi0 xiL 1 xiU
9:9
Iteration
0
1
2
3
xiL
0
0
0
0
xi0
0.33333
0.22222
0.14815
0.09876
xi1
0.66666
0.44444
0.29629
0.19753
xiU
1
0.66666
0.44444
0.29629
Iteration
0
1
2
3
xiL
0
0
0
0
xi0
0.3820
0.2361
0.1459
0.0902
xi1
0.6180
0.3820
0.2361
0.1459
xiU
1
0.6180
0.3820
0.2361
162
9 Optimization
This time, when one inner value becomes the new bound, the interval is reduced
in such a way that the other inner value becomes the new opposite inner value. In
Table 9.2, whenever xi1 becomes the new bound, xi0 becomes xi1. This is a natural
result of using the golden ratio: the ratio of the distance between xiL and xiU to the
distance between xiL and xi1 is the same as the ratio of the distance between xiL and
xi1 to the distance between xi1 and xiU and the same as the ratio of the distance
between xiL and xi1 to the distance between xiL and xi0,. Consequently, when the
interval between xi1 and xiU is taken out and the new complete interval is xiL to xi1,
xi0 is at the correct distance from xiL to become the new inner point xi1. Moreover,
with this value of , the interval is reduced at each iteration to 0.6180 of its previous
size, which is smaller than the reduction to 0.6666 of its previous size when 2/3.
In other words, using 0.6180 leads to an algorithm that both requires only half
the computations in each iteration and that converges faster. There are no
downsides.
As with any iterative algorithm, it is important to define termination conditions.
There are two conditions for the golden-mean search, the two usual conditions
apply. If the absolute error between the bounds after the update is less than a
predefined threshold, then an accurate enough approximation of the optimum has
been found and the algorithm terminates in success. If however the algorithm first
reaches a predefined maximum number of iterations, it ends in failure. The
pseudocode of the complete golden-mean search method is given in Fig. 9.4.
The convergence rate of this algorithm has already been hinted to previously,
when it was mentioned that each iteration reduces the interval by a factor of .
When the value of is set to the golden ratio and the initial interval between the
bounds is h0 jx0L x0Uj, then after the first iteration it will be h1 h0, and after
the second iteration it will be:
h2 h1 2 h0
9:10
9:11
This is clearly a linear O(h) convergence rate. One advantage of Eq. (9.11) is that it
makes it possible to predict an upper bound on the number of iterations the goldenmean algorithm will reach the desired error threshold. For example, if the initial
search interval was h0 1 and an absolute error of 0.0001 is required, the algorithm
will need to perform at most log0.6180(0.0001) 19 iterations.
163
XU
XU
XU
XU
164
9 Optimization
Example 9.1
A solar panel is connected to a house, connected also to the citys power grid.
When the house consumes more power than can be generated by the solar
panel it draws from the city, and when it consumes less the extra power is fed
into the citys power grid. The power consumption of the house over time has
been modelled as P(t) t(t 1), where a positive value is extra power
generated by the house and a negative value is power drain from the city.
Find the maximum amount of power the house will need from the city over
the time interval [0, 2] to an absolute error of less than 0.01 kW.
Solution
Begin by noting that Eq. (9.11) gives:
0:01 2 0:6180n
n8
In other words, the solution should be found at the eighth iteration of the
golden-mean method.
The first two middle points computed from Eq. (9.9) are:
x00 0:6180 0 0:3820 2 0:76393
x01 0:3820 0 0:6180 2 1:2361
The power consumption can then be evaluated from the model at those two
points:
Px00 0:763930:76393 1 0:18034
Px01 1:23611:2361 1 0:29180
Since this is a minimization problem, the rule of Eq. (9.7) applies, and the
upper bound is replaced by x01. The absolute error after this first iteration is
j0 1.2361j 1.2361. At the second iteration, the new middle points are:
x10 0:6180 0 0:3820 1:2361 0:47214
x11 0:3820 0 0:6180 1:2361 0:76393
Notice that x11 is exactly the same as x00; this was expected from the earlier
explanations, and as a result that middle point does not need to be
re-evaluated, its value can simply be carried over from the previous iteration.
The other middle point does need to be evaluated:
Px10 0:472140:47214 1 0:24922
(continued)
165
xiL
0
0
0
0.29180
0.29180
0.40325
0.47214
xiU
2
1.2361
0.76393
0.76393
0.58359
0.58359
0.58359
xi0
0.76393
0.47214
0.29180
0.47214
0.40325
0.47214
0.51471
xi1
1.2361
0.76393
0.47214
0.58359
0.47214
0.51471
0.54102
P(xi0)
0.18034
0.24922
0.20665
0.24924
0.24064
0.24922
0.24978
P(xi1)
0.29180
0.18034
0.24922
0.24301
0.24922
0.24978
0.24832
Ei
1.2361
0.76393
0.47213
0.29179
0.18034
0.11145
0.06888
Power
2.1
1.8
1.5
1.2
0.9
0.6
0.3
0
0
0.3
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
2
Time
166
9.3
9 Optimization
Newtons Method
It is well-known that the optimum of a function f(x) is an inflection point where its
derivative f (1)(x) 0. This means that an optimization method for f(x) is the same as
a root-finding method for f (1)(x), and any of the root-finding methods learned in
Chap. 8 could be used. Most interestingly, if the first and second derivatives of the
function are known, it is possible to use Newtons method, the most efficient of the
root-finding methods learned. Recall from Sect. 8.6 that the equation for Newtons
method to find a root of f(x) is:
xi1 xi
f xi
f 1 xi
9:12
f 1 xi
f 2 xi
9:13
As was proven in Chap. 8 using Taylor series, this method will iteratively converge
towards the nearest root of f (1)(x) at a quadratic rate O(h2).
Once issue is that there is no indication in Eq. (9.13) as to whether this root will
be a maximum or a minimum of f (x); the root of the derivative only indicates that it
is an optimum. Real-world functions will usually have both maxima and minima,
and a problem will require finding a specific one of the two, not just the nearest
optimum regardless of whether it is a maximum or a minimum. One way of
checking if the function is converging on a maximum or a minimum is of course
to evaluate f(xi) and see if the values are increasing or decreasing. However, this
will require additional function evaluations, since evaluating f(x) is not needed for
Newtons method in Eq. (9.13), as well as a memory of one past value f(xi1) to
compare f(xi) to. To avoid these added costs in the algorithm, another way of
checking using only information available in Eq. (9.13) is to consider the sign of
the second derivative at the final value xi. If f (2)(xi) < 0 the optimum is a maximum,
and if f (2)(xi) > 0 then the optimum is a minimum. If the method is found to have
converged to the wrong type of optimum, then the only solution is to start over from
another, more carefully chosen initial point.
The same three halting conditions seen for Newtons method in Chap. 8 still
apply. To review, if the relative error between two successive approximations is
less than a predefined threshold , then the iterative algorithm has converged
successfully. If however a preset maximum number of iterations is reached first,
then the method has failed to converge. Likewise, if the evaluation of the second
derivative f (2)(xi) 0, then the point generated is in a discontinuity of f (1)(x) and the
method cannot continue. The pseudocode for Newtons optimization method is
given in Fig. 9.5; it can be seen that it is only a minor modification of the code of
Newtons root-finding method presented in the previous chapter, to replace the
167
FUNCTION F(x)
RETURN evaluation of the target function at point x
END FUNCTION
FUCNTION Derive(F(x))
RETURN evaluation of the derivative of the function at point x
END FUNCTION
168
9 Optimization
For an initial value, since none are provided but the spike is said to be
one-second-long, the search could start in the middle of the interval, at
t0 0.5. In that case, Eq. (9.13), using the models derivatives gives:
0:565
0:6900s
t1 0:5
2:979
0:5 0:6900
27:501%
E1
0:6900
The next iterations are given in the table below, and illustrated in the
accompanying figure along with the power function:
i
0
1
2
3
4
ti (s)
0.5
0.690
0.640
0.634
0.634
P(1)(ti)
0.565
0.360
0.035
0.0005
8 108
P(2)(ti)
2.979
7.197
5.832
5.682
5.680
ti+1 (s)
0.6900
0.6397
0.6337
0.6336
0.6336
Ei (%)
27.501
7.813
0.945
0.013
0.000
Power
0.5
0.4
0.3
0.2
0.1
0
0
Time
9.4
169
Quadratic Optimization
It has been observed several times already that an optimum in a function f(x) is an
inflection point where the function turns around. Locally, the inflection point region
could be approximated as a degree-2 polynomial, a parabola p(x). As was learned in
Chap. 5, all that is required for this is to be able to evaluate three points on the
function to interpolate the polynomial from. The situation is illustrated in Fig. 9.6.
The equation for a degree-2 polynomial is:
px c0 c1 x c2 x2
9:14
where p(x) f(x) at three points x xi2, xi1, and xi. Given this information,
Chap. 5 has covered several methods to discover the values of the coefficients,
such as solving the matrixvector system of the Vandermonde method:
2
1 xi2
6
4 1 xi1
1
xi
32 3 2
3
c0
f xi2
x2i2
76 7 6
7
x2i1 54 c1 5 4 f xi1 5
f x i
c2
x2i
9:15
9:16
170
9 Optimization
oldest of the three points used in the interpolation. The region covered by the
interpolation becomes iteratively a smaller section of the inflection point, the
interpolated polynomial thus becomes a better approximation of the function in
that region, and the optimum of the parabola becomes closer to the functions
optimum.
For the iterative version of this approach, given three past approximations of the
optimum xi2, xi1, and xi, it is possible to interpolate this iterations polynomial
pi(x). Then, the new approximation of the optimum is computed as:
xi1
c1, i
2c2, i
9:17
f xi2 x2i1 x2i f xi1 x2i x2i2 f xi x2i2 x2i1
2f xi2 xi1 xi 2f xi1 xi xi2 2f xi xi2 xi1
9:18
This new point replaces xi2 to compute pi+1(x) in the next iteration. There are three
halting conditions to the iterative algorithm. If the relative error between two
successive approximations of the optimum is less than a preset threshold , then
the algorithm has successfully converged. On the other hand, if the algorithm first
reaches a preset maximum number of iterations, it has failed. There is a second
failure condition to watch out for: if the interpolated polynomial becomes a degree1 polynomial, a straight line, then the algorithm has diverged and is no longer in the
region of the inflection point at all. From Eq. (9.17), it can be seen that in that
case the equation would have a division by zero, a sure sign of divergence.
The pseudocode of a version of the quadratic optimization method using the
matrixvector system of Eq. (9.15) and including the additional failure condition
check is presented in Fig. 9.7.
The convergence rate for this method is O(h1.497), although the proof is outside
the scope of this book. This method thus converges more efficiently than the
golden-mean method, which is normal when comparing an open method like this
one to a closed method that must maintain brackets. On the other hand it converges
more slowly than the Newtons method. Again, this was to be expected: Newtons
method uses actual features of the function, namely its first and second derivatives,
to find the optimum, while this method uses an interpolated approximation of the
function to do it, and therefore cannot get as close at each iteration.
Example 9.3
A sudden electrical surge is known to cause a one-second-long power spike in
an electrical system. The behavior of the system during the spike has been
studied, and during that event the power (in kW) is modelled as:
(continued)
171
pt2 0 kW
t1 0:5 s
pt1 0:448 kW
t0 1 s
pt0 0:159 kW
1
41
1
0
0:5
1
32
3 2
3
0
c00
0
0:25 54 c10 5 4 0:488 5
1
0:159
c20
p0 x 0 1:951x 2:110x2
The optimum of this parabola, the first approximation of the optimum computed by the method, is then obtained from the derivative of the parabola, as
given in Eq. (9.17):
t1
1:951
0:4624s
2 2:110
1 0:462
116:249%
E1
0:462
(continued)
172
9 Optimization
ti2 (s)
0.0000
0.5000
1.0000
0.4624
0.5719
0.5849
0.6538
0.6342
ti1 (s)
0.5000
1.0000
0.4624
0.5719
0.5849
0.6538
0.6342
0.6329
ti (s)
1.0000
0.4624
0.5719
0.5849
0.6538
0.6342
0.6329
0.6336
ti+1 (s)
0.4624
0.5719
0.5849
0.6538
0.6342
0.6329
0.6336
0.6336
Ei (%)
116.249
19.136
2.237
10.527
3.088
0.207
0.117
0.000
Power
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Time
The method converges in eight iterations on t 0.6336 s, the same optimum that Newtons method found in five iterations in Example 9.2. Having to
carry three bad initial guesses for several iterations slows down the initial
convergence; in iteration 3, the first one computed using only approximations
computed in previous iterations, the result shows a large jump in accuracy,
from a relative error of 7.8 % compared to the real optimum to 3.1 %.
9.5
Gradient Descent
The gradient descent optimization method is also known by its more figurative
name of hill climbing. It has been described as what you would do if you needed to
find the top of Mount Everest with amnesia in a fog. What would this unfortunate
climber, unable to remember where theyve been or to see where they are going,
do? Simply feel around the ground in one step in every direction to find the one that
goes up the fastest, and proceed along that way step by step. Once the climber has
reached a point where the ground only goes down in all directions, they can assume
they have reached the top of the mountain. In mathematical terms, the direction of
the step that gives the greatest change (be it increase or decrease) in the value of a
function is called its gradient, and the basic idea of the gradient descent method is
simply to take step after step along the gradient until a point is reached where no
step can be taken to improve the value.
The gradient descent is different from the other methods covered in this chapter
so far by the fact that it is a multidimensional optimization method, instead of a
one-dimensional one. In fact, as will be shown, it makes it possible to reduce the
multidimensional optimization problem into a one-dimensional problem of optimizing the step size along the gradient that optimizes the multidimensional
function.
173
FUNCTION F(x)
RETURN evaluation of the target function at point x
END FUNCTION
174
9 Optimization
3
f x
6 x0 7
7
6
7
6
6 f x 7
7
6
7
f x 6
6 x1 7
7
6
6 7
7
6
4 f x 5
2
9:19
xn1
Knowing the direction of maximum change limits the options for the orientation of
the step from 360 around the current point to only two directions, either positively
or negatively along the line of the gradient. In fact this is the distinction between
maximization and minimization in this method: for a maximization problem the
algorithm should positively follow the gradient to increase as quickly as possible,
and for a minimization problem the algorithm should go in the negative direction to
decrease as quickly as possible. Thus, the next point xi+1 will be the current
iterations xi plus or minus one step along the gradient at that point, as so:
xi1 xi hf xi
9:20
Example 9.4
Compute the gradient of this function:
f x x20 3x0 x1 x31 x2 4x23 x1 x2 x3
Solution
Following Eq. (9.19), the gradient is:
3
f x
6 x 7
0 7
6
3
2
6 f x 7
2x0 3x1
7
6
6 x 7 6 3x1 3x2 x2 x3 7
1
1 7
7
6
f x 6
6 f x 7 4
5
1 x2 x3
7
6
6 x 7
8x
x
x
3
1 2
2 7
6
4 f x 5
2
x3
The second question is how to pick the step size. After all, a value of h too large
will give a poor approximation, as the method will step over the optimum. On the
other hand, a small step will make the algorithm converge slowly. One solution is to
use an iteratively decreasing step size hi, that begins with a large value to take large
steps towards the optimum and then decreases it in order to pinpoint the optimum.
175
A better solution though would be to actually compute the length of the step hi to
take at each iteration to get as close as possible to the optimum; in other words, to
optimize the step size at that iteration! The optimal step size is of course the one that
will give the value of xi+1, as computed in Eq. (9.20), which will in turn allow f(xi+1)
to evaluate to its optimal value. This leads to a simple but important realization:
the only unknown value in Eq. (9.20) missing to compute the value of xi+1 is the
value of hi, and as a result the value of the function evaluation f(xi+1) will only vary
based on hi:
f xi1 f xi hi f xi ghi
9:21
In other words, optimizing the value of the multidimensional function f(xi+1) is the
same as optimizing the single-variable function g(hi) obtained by the simple
variable substitution of Eq. (9.20). And a single-variable, single-dimension optimization problem is one that can easily be done using the golden-mean method,
Newtons method, or the quadratic optimization method. The optimal value of hi
that is found is the optimal step size to use in iteration i to compute xi+1;
it corresponds to the step to take to get to the local optimum of the gradient line
f(xi).
To summarize, at each iteration i, the gradient descent algorithm will perform
these steps:
1. Evaluate the gradient at the current point, f(xi), using Eq. (9.19).
2. Rewrite the function xi+1 with the variable substitution of Eq. (9.21) as
f(xi hif(xi)) to get a function of hi.
3. Use the method of your choice to find the value of hi that is the optimum of
f(xi hif(xi)).
4. Compute the next value of xi+1 using Eq. (9.20) as xi hif(xi).
5. Evaluate termination conditions, either to terminate or continue the iterations.
The pseudocode of an algorithm implementing all these steps is presented in
Fig. 9.8. The success termination condition for this algorithm is that the Euclidean
distance between xi and xi+1 becomes less than a preset error threshold , at which
point the method has converged on the optimum of the function with sufficient
accuracy. There are two failure termination conditions. The first is, as always, if the
algorithm reaches a preset maximum number of iterations without converging. The
second condition is if the gradient f(xi) evaluates to a vector of zeros. From
Eq. (9.20), it can be seen that in that case, the method is stuck in place and xi+1 xi.
Mathematically, a point where the gradient is null is a plateau in the function, a
point where there is no change to the evaluation of the function in any direction.
The method cannot continue anymore since all orientations from that point are
equivalent and none improve the function at all. Note however that, since xi+1 xi
in that situation, the Euclidean distance between xi and xi+1 will be zero, which
corresponds to the success condition despite it being actually a failure of the
176
9 Optimization
h as:
Gradient(F(x))]
h as:
Gradient(F(x))]
FUNCTION F(x)
RETURN vector of length n of evaluation of the n-dimensional target
function at point x
END FUNCTION
FUNCTION Gradient(F(x))
RETURN vector of length n of evaluation of the partial derivatives of
the function with respect to the n variables at point x
END FUNCTION
method. It is thus important to evaluate this condition first, before the Euclidean
distance of the success condition, to avoid potentially disastrous mistakes in the
interpretation of the results.
177
Example 9.5
The strength of the magnetic field of a sheet of magnetic metal has been
modelled by the 2D function:
H x x2 4x 2xy 2y2 2y 14
which is in amperes per meter. The origin (0,0) corresponds to the center of
the sheet of metal. Determine the point with the weakest magnetic field, to the
nearest centimeter. Use the corner of the sheet at (4, 4) meters as a starting
point.
Solution
The first thing to do is to compute the gradient of the function using equation
(9 19). This gives the vector of functions:
Hx
2x 2y 4
2x 4y 2
For the first iteration, the value of x0 is given as [4, 4]T. Since this is a
minimization problem, Eq. (9.20) becomes:
xi1 xi hi Hxi
And for the first iteration, it evaluates to:
2 4 2 4 4
4
4
4
h0
h0
x1
2 4 4 4 2
6
4
4
The next step of the iteration is the variable substitution of x1 to make H(x1)
into a function of h0 that can be easily optimized. To do this, write down the
original H(x) replacing x with (4 + 4 h0) and y with (4 + 6 h0). The new
equation, which will be labelled g(h0) for convenience, is:
gh0 4 4h0 2 44 4h0 24 4h0 4 6h0 24 6h0 2
24 6h0 14 136h20 52h0 6
The step h0 is the optimum of g(h0), which can be easily found using any
single-variable optimization method, or by setting the derivative of g(h0) to
zero (which is the most straightforward way to get the optimum of a degree-2
polynomial). The value is h0 0.191. Using this step value in Eq. (9.20) gives
the next point x1 [4.764, 2.854]. The magnetic field model and the first
(continued)
178
9 Optimization
xi
4
4
4:764
2:854
4:994
3:006
H(xi)
4
6
0:176
0:112
0:024
0:035
hi
0.191
1.300
0.191
xi+1
4:764
2:854
4:994
3:006
4:999
2:999
Ei
1.378
0.276
0.008
9.6
179
Stochastic Optimization
180
9 Optimization
to behave in only one way, to always take the best step they can find, the one that
makes them converge on the nearest optimum in the fastest way possible. They are
deterministic methods: their behavior is entirely known and predictable. Given the
same function and initial values, they will always compute the same steps and
converge on the same optimum.
The alternative to a deterministic method is a stochastic method: an algorithm
that includes an element of randomness. This random element is what will allow the
method to have two different behaviors, by giving it a chance to escape local optima
but not the global optimum. This can be implemented in practice in a number of
ways, such as for example by including a random variable in an equation or by
having a decision step that includes an element of chance. It should be noted that a
random element is not necessarily one whose value is a result of complete and
unbiased chance, such as a lottery draw. Rather, it is simply a term whose result is
not known for sure in advance. It is completely acceptable to skew the probabilities
towards a preferred outcome, or to change the probabilities as the algorithm progresses to reduce the impact of randomness over the iterations. A flip of a coin with
a 99 % probability of landing on heads is still a stochastic event: even though one
outcome is more likely than the other, it is still a result of chance.
Using randomness in optimization methods eliminates some of the certainty
offered by deterministic algorithms. As mentioned already, a stochastic method is
no longer guaranteed to converge on the nearest local optimum, which can be a
desirable feature. However, this should not be mistaken for a certainty to converge
on the global optimum; stochastic methods can make no such guarantee. Given the
use of randomness in their algorithms and decision-making, no outcome can be
certain or guaranteed. Another important point to keep in mind is that running the
same stochastic method twice on the same function with the same initial values can
lead to two very different final results, unlike with stochastic methods which
guaranteed the exact same result both times. The reason for this is of course the
inclusion of a random element in the algorithm, which can take very different
values in successive runs.
Stochastic optimization algorithms are an intense area of ongoing research.
Dozens of algorithms already exist, and new algorithms, variations, and
enhancements are being proposed every year. Popular algorithms include genetic
algorithms, ant colony algorithms, particle swarm algorithms, and many others.
A complete review of these methods is beyond the scope of this book. The next two
sections will introduce two stochastic optimization methods, to give an overview of
this class of optimization methods.
9.7
181
The random brute-force search is the simplest stochastic search method available.
However, despite its inefficiency, it remains a functional and useful tool. Moreover,
its simplicity makes it a good method to study as an initiation to stochastic
optimization.
A brute-force search refers to any search algorithm that systematically tries
possible solutions one after the other until it finds one that is acceptable or until a
preset maximum number of attempts. A brute-force optimization algorithm would
thus simply evaluate value after value for a given time, and return the value with the
optimal result as its solution at the end. And a random brute-force search is one that
selects the values to evaluate stochastically.
While the random brute-force search may seem unsophisticated, it does have the
advantage of being able to search any function, even one that has a complex and
irregular behavior, multiple local optima, and even discontinuities. By trying points
at random and always keeping the optimal one, it is likely to get close to the global
optimum and certain not to get stuck in a local optimum. The random brute-force
search can be useful to deal with black box problems, when no information is
available on the behavior of the function being optimized. The method makes no
assumptions on the function and does not require a starting point, an interpolation or
a derivative; it only needs an interval to search in. It can therefore perform its search
and generate a good result in a situation of complete ignorance.
However, when this method does get a point close to the global optimum, it does
not improve on it except by possibly randomly generating an even closer point.
In other words, while the random brute-force approach is likely to find a point close
to the global optimum, it is very unlikely to actually find the global optimum itself.
For that reason, the algorithm is often followed by a few iterations of a deterministic
algorithm such as Newtons method, which can easily and quickly converge to the
global optimum from the starting point found by the brute-force search.
It should be instinctively clear that testing more points increases the algorithms
odds of getting closer to the optimum. However, even that rule of thumb is not a
certainty given the stochastic nature of the algorithm. It could easily be the case that
in one short run, the algorithm fortuitously generates a point very close to the
optimum while in another much longer run, the algorithm is a lot unluckier and does
not get as close. This is one of the risks of working with stochastic algorithms.
The iterative algorithm of the random brute-force search is straightforward: at
each iteration, generate a random value within the search interval and evaluate it.
Compare the result to the best value discovered so far in previous iterations.
If the new result is better, keep it as the new best value; otherwise, discard
it. This continues until the one and only termination condition, that a maximum
number of iterations is reached.
182
9 Optimization
Example 9.6
A two-dimensional periodic signal s(x,y) is generated by the combination of
two sinusoids, modelled by this equation:
sx; y ey cos 3x ex cos 3y
2
183
Optimum found
(0.0284, 0.1005)
(0.0219, 0.0730)
(0.0090, 0.0459)
(0.0317, 0.0011)
(0.0067, 0.0161)
(0.0137, 0.0010)
(0.0064, 0.0005)
(0.0098, 0.0149)
(0.0108, 0.0065)
(0.0245, 0.0159)
Optimum value
1.6886
1.9453
1.9581
1.9410
1.8775
1.9999
1.9922
1.9909
1.9816
1.9981
The table also shows that, on average, checking more points leads to a
better result: the five runs of 6000 points or more all returned better maxima
than the five runs with 5000 points or less. However, this relationship is not
perfect. The maximum found in the run with 6000 points is the single best one
of all ten runs, better even than the maximum found in the run with 10,000
points, while the one found in the run at 5000 points is second-worst, better
only than the one found in 1000 points. This illustrates nicely one of the
important differences between stochastic and deterministic optimization
mentioned earlier. In a deterministic algorithm, more iterations will always
lead to a better result (unless the method diverges), while in a stochastic
search, more iterations will on average, but not necessarily improve the result.
9.8
Simulated Annealing
184
9 Optimization
The parallel between real-world annealing and simulated annealing is straightforward: in one case an atom moves towards an optimal position in the crystal while
avoiding getting stuck in attractive but suboptimal positions, and in the other steps
are taken on a function to find the global optimum while avoiding getting trapped
in a local optimum. But the real insight comes by studying how to escape the local
optimum. In annealing, this is done by heating the metal to give it energy. When
the metal is hot and the atoms are energized, they are more likely to move out of
the local optimum (a high-energy movement), and as the metal cools down over
time the atoms are more less energized and more likely to simply converge on
the nearest (hopefully global) optimum. This can be simulated by using a temperature parameter that starts off at a high value and decreases iteratively.
This temperature is directly related to the probability of a bad step (one that causes
the value of the function to become less optimal) being accepted. At the higher
initial value, bad moves are accepted more often and steps are taken away from the
local optimum, and at the lower temperature of later iterations, bad steps are
unlikely to be accepted and the method converges.
An iteration of the simulated annealing algorithm is thus:
1. Select a step hi a random orientation around xi to generate xi+1. Compute fi, the
difference in function evaluation between f(xi) and f(xi+1), defined as:
f i f xi f xi1
9:22
9:23
for minimization problems. Either way, the value of fi will be negative if the
step brings the method closer to an optimum and positive if it moves away from
it, and the magnitude of the value will be proportional to the significance of the
change.
2. If the value of fi is negative, accept the step to xi+1.
3. If the value of fi is positive, compute a probability of accepting the step based
on the current temperature parameter value Ti:
Pe
f i
Ti
9:24
4. Reduce the temperature and step size for the next iteration.
5. Terminate the search if the termination condition is reached, which is that Ti+1 0.
These steps are implemented in the pseudocode of Fig. 9.10.
The stochastic behavior of the method thus comes from step 3, where a step that
worsens the value of the function might or might not be accepted based on a
probability P. Equation (9.24) shows that P depends on two variables: the change
in value of the step fi, so that steps with a weak negative impact are more likely to
be accepted than steps that massively worsen results, and the temperature Ti, so that
185
a bad move is more likely to be accepted at the beginning of the method when the
temperature is high than at the end when the temperature is low.
Simulated annealing has the advantage of being able to explore a complex
solution space and to escape local optima. The method used to explore the solution
space in step 1 is simple and only requires that it be possible to numerically evaluate
and compare two possible solutions. For that reason, simulated annealing is also
very good at optimizing complex problems, including problems where the optimum
depends on multiple interdependent variables and where the optimum is found by
maximizing certain variables while minimizing others.
186
9 Optimization
Example 9.7
A two-dimensional periodic signal s(x,y) is generated by the combination of
two sinusoids, modelled by this equation:
sx; y ey cos 3x ex cos 3y
2
It can be seen that the method moves randomly through the space, and
visits six of the local optima and three of the plateaux, before finally finding
the global optimum. For each of the local optima, the method eventually steps
away, quickly earlier in the iterations or after a longer exploration later in the
iterations. Being able to explore and eventually leave the plateaux is another
advantage of simulated annealing over other methods; Newtons method,
quadratic optimization, and the gradient descent would all fail and terminate
if they reached a constant region of the function they were optimizing.
9.9
187
Engineering Applications
Engineering design is constrained by reality, both physical (the laws of nature that
dictate the performance limits of their systems) and economic (the need to keep
costs and resource consumption down). In that sense, engineers are constantly
confronted by optimization problems, to get the most out of systems within the
limits of their constraints. The fuel tank design problem of Sect. 9.1 was a
simple illustration of that common problem: the design had to minimize the
tanks surface area and cost while respecting the systems requirements in terms
of shape and volume. It is nonetheless representative of real-world problems; a
similar optimization challenge led to the selection of the cylindrical 330 mL softdrink can as the most cost-efficient design. Optimization problems are also encountered elsewhere in engineering practice, whenever conflicting requirements and
constraints will arise.
The design of many electrical components can be reduced to finding optimal
points in equations. Indeed, the equations representing individual resistors,
capacitors, inductances, and voltage sources are well-known, as are the formulae
to combine them in parallel and serial connections. An entire circuit can thus be
modelled in that manner, and once this model is available, the values of specific
components can be optimized. For example, the impedance value Z for a resistor
is known to be R, for an inductance it is L, and for a capacitor it is (C)1,
where is the frequency of the power supply. In turn, the impedance of a serial
RLC circuit is given by:
s
1 2
2
Z R L
C
9:25
9:26
where a, b, and c are constants dependent on the specific type of yeast studied.
From this equation, it is possible to determine the temperature that will maximize growth.
Scheduling problems are among the most popular optimization problems
encountered in practice. Suppose for example a production line that can manufacture two different products, each with an associated unit production cost Ck
and unit profit Pk. The aim is to manufacture a number of units of each product
Nk in order to maximize profits P; however, the production line must operate
within its allocated budget B. In other words:
188
9 Optimization
N 0 C0 N 1 C1 B
N 0 P0 N 1 P1 P
N 0 P0
B N 0 C0
P1 P
C1
9:27
9.10
Summary
One of the most common challenges in engineering is to try to determine the value
of some parameter of a system being designed to either maximize or minimize
its output value. This value of the parameter is the optimum, and this challenge is
optimization. This chapter has introduced several methods designed to solve
an optimization model. The golden-mean method is a closed method, which sets
up bounds around the optimum and uses the golden ratio property to get closer to
the value. As with the closed methods seen for root-finding, this closed method is
the least efficient one available but also the only one guaranteed not to diverge,
because of its requirement to keep the optimum bracketed between the bounds. Two
more open methods were examined, Newtons method and the quadratic optimization method. Both are more efficient than the golden-mean method, but both require
more information, namely the first and second derivative for Newtons method and
three points to interpolate a parabola with for the quadratic optimization, and both
have a risk of diverging and failing in certain conditions. All three of these methods
are also designed for two-dimensional problems; the next method learned was
the gradient method, and it is a more general method designed to deal with
multidimensional optimization problems. Finally, the topic of stochastic optimization was discussed. This topic is huge, worthy of an entire textbook to itself, and
highly active in the scientific literature, so the discussion in this chapter is meant as
nothing more than an introduction. Nonetheless, two stochastic methods were
presented, the random brute-force search and simulated annealing. Table 9.3
summarizes the methods covered in this chapter.
Table 9.3 Summary of optimization methods
Method
Golden-mean search
Newtons method
Quadratic optimization
Gradient descent
Random Brute-Force search
Simulated annealing
Requires
2 bounds
1 point + first and second derivatives
3 points
1 point + derivatives
1 point + thousands of tries
1 point
Error
O(h)
O(h2)
O(h1.497)
O(h2)
Unbounded
Unbounded
9.11
Exercises
9.11
189
Exercises
Chapter 10
Differentiation
10.1
Introduction
t0
t>0
10:1
Acceleration 5 m=s2
10:2
Jerk
5 m=s3
0 m=s3
191
192
10 Differentiation
Fig. 10.1 Jerk (top-right), acceleration (top-left), speed (bottom-left), and position (bottom-right)
with respect to time, for a robot at constant acceleration of 5 m/s2
And the speed and position both increase with respect to time:
Speed 5t m=s
10:3
Position 2:5t m
10:4
Looking at the relationship, the graphs of Fig. 10.1 in clockwise order from the
bottom-right shows the differentiation operation, while looking at their relationship
in counter-clockwise order from the top-right shows the integration operation.
Indeed, the counter-clockwise relationship demonstrates the rate of change of the
previous curve. The position curve is increasing exponentially, as Eq. (10.4) shows,
and the speed curve is thus one whose value is constantly increasing. But since it is
increasing at a constant rate, the acceleration curve is a constant line, aside from the
initial jump from 0 to 5 m/s2 when movement started. And since a constant line is
not changing, the jerk curve is zero, save for the impulse of 5 m/s3 at time 0 s when
the acceleration changes from 0 to 5 m/s2. On the other hand, integration, which
will be covered in Chap. 11, is the area under the curve of the previous graph. The
jerk is a single impulse of 5 m/s3 at time 0 s, which has an area of 5 m/s2, followed
by a constant zero line with null area. Consequently, the acceleration value jumps to
5 m/s2 at time 0 s and remains constant there since no additional area is added. The
area under this acceleration graph will be increasing constantly, by a value of 5 m/s.
Consequently, the speed value that reflects its area increases constantly by that rate.
And the area under the linearly increasing speed graph is actually increasing
exponentially: it covers 2.5 m from 0 to 1 s, 10 m from 0 to 2 s, 22.5 m from 0 to
3 s, and so on, and as a result the position value is an exponentially increasing curve
over time.
Being able to measure the differentiation and integral of systems is thus critical
if the system being modelled is changing; if the system is not in a steady-state or if
the model is not meant to capture a snapshot of the system at a specific moment in
time. If the equation of the system is known or can be determined, as is the case with
Eq. (10.2) in the previous example, then its derivates and integrals can be computed
exactly using notions learned in calculus courses. This chapter and the next one,
however, will deal with the case where the equation of the system is unknown and
10.2
193
10.2
10:5
10:6
Now assume that the measurements of the function are known at all three points,
but the derivative is unknown. It is immediately clear that either (10.5) or (10.6)
could be solved to find the value of the derivative, since each is an equation with
only one unknown. But, for reasons that will become clear soon, it is possible and
preferable to do even better than this, by taking the difference of both equations:
f xi h f xi h f xi f 1 xi h f xi f 1 xi h
10:7
f x i h f x i h
2h
10:8
194
10 Differentiation
f 2 xi h2 f 3 xi h3
2!
3!
2
f xi h2 f 3 xi h3
1
f xi h f xi f xi h
2!
3!
2
2
3
f xi h
f xi h3
1
f xi h f xi h f xi f xi h
f xi
2!
3!
2
2
3
3
f xi h
f xi h
10:9
f 1 xi h
2!
3!
2f 3 xi h3
2f 1 xi h
3!
f xi h f xi h f 3 xi h2
1
f xi
2h
3!
2
f xi h f xi h
O h
2h
f xi h f xi f 1 xi h
Since the error is quadratic, or second-order, this gives the formula its name.
Moreover, the development of Eq. (10.9) demonstrates why it was preferable to
take the difference between Eqs. (10.5) and (10.6) to approximate the derivative,
rather than simply solving either one of these equations. With only one equation,
the second-order term of the Taylor series would have nothing to cancel out with,
and the final formula would be O(h), a less-accurate first-order formula.
Example 10.1
A 1-L reservoir is getting filled with water. It was initially empty, but reached
a quarter-full after 1.16 s, half-full after 2.39 s, three-quarter-full after 3.45 s,
and completely full after 4 s. Estimate the rate it was getting filled up by the
time it was half-full using the centered divided-difference formula and steps
of 0.5 and 0.25 L.
Solution
The fill-up rate is the volume filled per unit of time. The information given in
the problem statement is instead the time needed to fill certain units of
volume. The derivative of these values will be the time per volume, and the
inverse will be the rate asked for. The derivative can be obtained by a
straightforward application of Eq. (10.8):
f 1 f 0
40
From Eq. (10.9), it has been demonstrated that the error on the approximation
is proportional to the square of the step size. Reducing the step size by
half, from h to h/2, should thus reduce the error by a factor of 4, from O(h2)
(continued)
10.2
195
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Volume
196
10 Differentiation
The previous example used two pairs of two points to compute two secondorder divided-difference approximations of the derivative. But if four points are
available, an intuitive decision would be rather to use all four of them to compute
a single approximation with a higher accuracy. Such a formula can be derived
from the Taylor series approximations, as before. Begin by writing out the four
Taylor series approximations, to a sufficiently high order to model the error of
the approximation as well. To determine which order to go up to, note that for
with the divided-difference formula with two points in Eq. (10.9), the even
second-order term cancelled out and the error was due to the third-order term.
Consequently, it can be intuitively expected that, with the more accurate formula
with four points, the next even-order term will also cancel out and the error
will be due to the fifth-order term. The four fifth-order Taylor series approximations are:
f xi 2h f xi 2f 1 xi h
4f 2 xi h2 8f 3 xi h3 16f 4 xi h4 32f 5 xi h5
2!
3!
4!
5!
f xi h f xi f 1 xi h
f 2 xi h2 f 3 xi h3 f 4 xi h4 f 5 xi h5
2!
3!
4!
5!
f xi h f xi f 1 xi h
f 2 xi h2 f 3 xi h3 f 4 xi h4 f 5 xi h5
2!
3!
4!
5!
f xi 2h f xi 2f 1 xi h
4f 2 xi h2 8f 3 xi h3 16f 4 xi h4 32f 5 xi h5
2!
3!
4!
5!
10:10
Taking the difference of the series one step before and after, as before, cancels out
the second- and fourth-order terms, but leaves the third-order term:
f xi h f xi h 2f 1 xi h
2f 3 xi h3 2f 5 xi h5
3!
5!
10:11
Thats a problem; the third-order term must get cancelled out as well, otherwise it
will be the dominant term for the error. Fortunately, there are two more series to
incorporate into the formula. Looking back at the set of equations (10.10), it can be
seen that the third-order term in the series f(xi h) are eight times less than the
third-order term in the series f(xi 2 h). To get them to cancel out, the series
f(xi h) should thus be multiplied by 8 in Eq. (10.11) and the series f(xi 2 h)
should be of opposite signs from their counterparts one step before or after. And by
having the series f(xi 2 h) be of opposite signs from each other, the second- and
fourth-order terms will cancel with each other as they did in Eq. (10.11).
The resulting formula is:
10.2
197
48f 5 xi h5
5!
f
2h
8f
8f
x
h
x
2h
i
i
i
i
O h4 10:12
f 1 xi
12h
f xi 2h 8f xi h 8f xi h f xi 2h 12f 1 xi h
The third-order terms now cancel out, leaving only the fifth-order terms and a
division by h for an error of O(h4). As expected, using two points before and after
rather than just one has greatly increased the accuracy of the formula. This is now
the fourth-order centered divided-difference formula.
Example 10.2
A 1-L reservoir is getting filled with water. It was initially empty, but reached
a quarter-full after 1.16 s, half-full after 2.39 s, three-quarter-full after 3.45 s,
and completely full after 4 s. Estimate the rate it was getting filled up by the
time it was half-full using the fourth-ordered centered divided-difference
formula.
Solution
The fill-up rate is the volume filled per unit of time. The information given in
the problem statement is instead the time needed to fill certain units of
volume. The derivative of these values will be the time per volume, and the
inverse will be the rate asked for. The derivative can be obtained by a
straightforward application of Eq. (10.12):
f 1 8f 0:75 8f 0:25 f 0
12 h
4 8 3:45 8 1:16 0
4:77 s=L ) 0:21 L=s
12 0:25
f 1 0:5
Compared to the real rate of 4.69 s/L (fill-up rate of 0.21 L/s), the relative
error of this approximation is:
4:77 4:69
1:7 %
Eh0:25
4:69
This result is clearly more accurate than either ones obtained with the
same data using the second-order centered divided-difference formula in
Example 10.1.
198
10.3
10 Differentiation
f 2 xi h2 f 3 xi h3
2!
3!
4f 2 xi h2 8f 3 xi h3
xi h
2!
3!
10:13
The second-order term of the Taylor series approximations will need to cancel out
in order to keep the third-order term as the dominant error term. It can be seen from
Eq. (10.13) that this second-order term is four times greater in f(xi 2 h) than it is in
f(xi h), and of the same sign. Consequently, the series f(xi h) will need to be
multiplied by 4 for the second-order terms to cancel out:
f xi 2h 4f xi h 3f xi 2f 1 xi h
f 1 xi
4f 3 xi h3
3!
f xi 2h 4f xi h 3f xi
O h2
2h
10:14
The resulting equation is indeed O(h2). However, it can be noted that this accuracy
required three measurements, at the current point and at one and two steps before,
while the second-order centered divided-difference formula achieved it with only
two measurements. As indicated before, this is because more measurements are
10.4
Richardson Extrapolation
199
needed to even out the loss of information that comes from using measurements all
on one side of the target point, rather than centered around the target point.
The second-order forward divided-difference formula is computed using the
same development as its backward counterpart. The final equation and error term
are:
f 1 xi
f xi 2h 4f xi h 3f xi
O h2
2h
10:15
Example 10.3
A robot is observed moving in a straight line. It starts off at the 8 m mark, and
is measured every second:
Time t (s)
Position f(t) (m)
0
8
1
16
2
34
3
62
4
100
2
43 m=s
f 1 4
To verify this result, note that the equation modelling the robots position is:
f t 5t2 3t 8
The derivative of this equation is trivial to compute, and evaluated at t 4
it does give 43 m/s.
10.4
Richardson Extrapolation
It has been demonstrated in the previous sections that the error on the approximation of the derivative is function of h, the step size between the measurements. In
fact, Example 10.1 even demonstrated practically that a smaller step size leads to a
200
10 Differentiation
better approximation. However, this also leads to a major problem with the divideddifference formulae: the risk of subtractive cancellation, introduced back in
Chap. 2. Consider for example the second-order centered divided-difference formula of Eq. (10.9). As the value of h is reduced with the expectation of increasing
accuracy, there will come a point where f(xi h) f(xi + h) and the effects of
subtractive cancellation will be felt. At that point, it would be wrong, and potentially dangerous, to continue to use smaller and smaller values of h and to advertise
the results as more accurate. And this issue of subtractive cancellation at smaller
values of h will occur with every divided-difference formula available, as they are
all based on taking the difference between measurements at regular intervals.
Yet the fundamental problem remains; smaller values of h are the only way to
improve the accuracy of the approximation of the derivative. Richardson extrapolation offers a solution to this problem. Instead of decreasing the value of h and
computing the divided-difference formula, this method makes it possible to compute the divided-difference formula with a large value of h then iteratively decrease
it to increase accuracy.
To begin, note that in every divided-difference formula, the even-order terms
from the Taylor series approximation cancel out, leaving the odd-order terms with
even-valued exponents of h once the final division by h is performed. The first
non-zero term after the formula becomes the error term, and all other terms are
ignored. But if the first non-zero term is cancelled out, as it was when going from
the second-order centered divided-difference formula to the fourth-order one, then
the error drops by a factor of h2 for that reason.
Now consider again the centered divided-difference formula of Eq. (10.9). If the
equation is expanded to include all terms from the Taylor series, and since the evenorder terms cancel out, the formula becomes:
f 1 x i
f xi h f xi h f 3 xi h2 f 5 xi h4 f 7 xi h6
2h
3!
5!
7!
10:16
Note again that since every other term cancels out, every term k actually appearing
in the series represents an error of O(h2k). The error is of course dominated by the
largest term, which in this case is O(h2). But keeping that first error term written out
explicitly, Eq. (10.16) can be rewritten equivalently as:
Dexact D1 h K 1 h2 O h4
10:17
where Dexact represents the real exact value of the derivative and:
D 1 h
f x i h f x i h
2h
10:18
10.4
Richardson Extrapolation
201
K1
f 3 xi h2
3!
10:19
2
4 !
h
h
h
D1
O
K1
2
2
2
2
h
h
D1
K 1 O h4
2
4
10:20
Note that the division by 4 in the big O term has disappeared; as explained in
Chap. 1, big O notation is indifferent to constant values and only function of the
variable, in this case h. As for the derivative approximation, nothing special has
happened. The formula is the same with half the step size, and the risk of subtractive
cancellation is still present if h becomes too small. But notice that Eqs. (10.17) and
(10.20) each have a different parameter for D1 but the same term h2, with the only
difference being that one is four times larger than the other. This should give an idea
for cancelling out the h2 term: taking four times Eq. (10.20) and subtracting
Eq. (10.17):
h
h2
4Dexact Dexact 4D1
4K 1 4O h4 D1 h K 1 h2 O h4
2
4
h
3Dexact 4D1
D 1 h O h4
2
4D1 h2 D1 h
O h4
Dexact
10:21
3
The error is now O(h4), an important improvement, and more importantly this was
done without risking subtractive cancellation! And this is after only one iteration;
Richardson extrapolation is an iterative process, so it can be done again. Begin by
rewriting Eq. (10.21) in the same form as Eq. (10.17):
Dexact
h
D2
K 2 h4 O h6
2
10:22
where:
4D1 h2 D1 h
h
D2
3
2
10:23
202
10 Differentiation
h
h4
K 2 O h6
4
16
10:24
This time, comparing Eqs. (10.22) and (10.24), the h4 term is 16 times larger before.
Therefore the second equation will need to be multiplied by 16 to cancel out this
next error term.
6
h
h4
h
16Dexact Dexact 16D2
16K 2 16O h D2
K 2 h4 O h6
4
2
16
h
h
15Dexact 16D2
D2
O h6
4
2
h
h
16D2 4 D2 2
10:25
O h6
Dexact
15
This process can go on iteratively forever, or until one of the usual termination
conditions applies: a threshold relative error between the approximation of the
derivative of two iterations is achieved (success condition), or a preset maximum
number of iterations is reached (failure condition). Richardson extrapolation can be
summarized as follows:
Dexact Dk
h
k1
O h2k
for k 1
10:26
where:
h
h
4k1 Dk1 j Dk1 j1
h
2
2
Dk j
if k > 1
2
4k1 1
h
h
f xi j f xi j
h
2
2
Dk j
if k 1
h
2
2 j
2
10:27
10:28
Then, for each value of j starting from 0 and going up to the termination condition,
compute Eq. (10.27) for all values of k from 1 to j + 1. The first iteration will thus
only compute one instance of Eq. (10.28), and each subsequent iteration will add
one more instance of Eq. (10.27). Moreover, each new instance of Eq. (10.27) will
be computed from a lower-k-valued instance of it, down to Eq. (10.28). The value of
Eq. (10.27) with the highest values of j and k at the final iteration will be the
approximation of the derivative in Eq. (10.26).
Algorithmically, the Richardson extrapolation method can be implemented by
filling up a table left to right using values computed and stored in the previous
10.4
Richardson Extrapolation
203
204
10 Differentiation
Example 10.4
A 1-L reservoir is getting filled with water. It was initially empty, but reached
a quarter-full after 1.16 s, half-full after 2.39 s, three-quarter-full after 3.45 s,
and completely full after 4 s. Estimate the rate it was getting filled up by the
time it was half-full using the centered divided-difference formula and
Richardson extrapolation.
Solution
The fill-up rate is the volume filled per unit of time. The information given in
the problem statement is instead the time needed to fill certain units of
volume. The derivative of these values will be the time per volume, and the
inverse will be the rate asked for.
Richardson extrapolation starts with j 1. The only possible value of
i from 1 to j + 1 is thus k 1, and the only equation to compute is (10.28).
Putting in the values naturally gives the second-order centered divideddifference formula as it was computed in Example 10.1:
h
h
f x 0 f x 0
h
2
2
D1 0
h
2
2 0
2
f 1 f 0
2h
40
4:00 s=L
2 0:5
At the next iteration, j 1 and k {1, 2}. There are now two equations to
compute, again one of which was already computed in Example 10.1:
h
h
f x 1 f x 1
h
2
2
D1 1
h
2
2 1
2
f 0:75 f 0:25
2 0:25
3:45 1:16
4:57 s=L
2 0:25
h
h
1
4 D1 1 D1 0
h
2
2
D2 1
2
41 1
4 4:57 4
4:77 s=L
3
(continued)
10.5
Second Derivatives
205
10.5
Second Derivatives
So far, this chapter has focused on approximating the first derivative of a system
being modelled. This is reasonable, as the rate of change of the parameters of a
system over time is often critically important to include in a complete model.
However, these rates of change are themselves often not constant, and modelling
them as such will lead the model to diverge from reality over time. To remedy
that, it is important to include their rate of change over time as well; in other words,
to compute higher derivatives. In engineering practice, the second derivative of
the system (modelling the rate of change of the rate of change of parameters) is the
one most often included, and the one this chapter will focus on, but the same
technique described here could be used to develop equations for third derivatives
and higher.
The technique for finding the nth derivative is the same as that for finding the
first derivative. Given a set of measurements of the system at equally spaced
intervals, expand the Taylor series approximations for each measurement, then
combine them with multiplications and subtractions to eliminate all non-zeroth-
206
10 Differentiation
order terms except the one of the same order as the desired derivative and the
highest-possible-order one for the error term.
Consider the case with one measurement before and after the target point. The
two third-order Taylor series approximations were expanded in Eq. (10.9), and they
were subtracted from each other to derive the second-order centered divideddifference formula to approximate the first derivative. But if the goal is to keep
the second derivative and cancel out the first, then the two series should be summed
together rather than subtracted. The result is:
f x i h f x i h
!
2
2
3
3
4
4
f
x
h
f
x
h
f
x
h
i
i
i
f xi f 1 xi h
2!
3!
4!
!
f 2 xi h2 f 3 xi h3 f 4 xi h4
f xi f 1 xi h
2!
3!
4!
2f 2 xi h2 2f 4 xi h4
2!
4!
f xi h f xi h 2f xi
f 2 xi
O h2
2
h
2f xi
10:29
10.5
Second Derivatives
207
f xi h f xi h 2f xi
f xi 2h f xi 2h 2f xi
2f 2 xi h2 2f 4 xi h4 2f 6 xi h6
2!
4!
6!
8f 2 xi h2 32f 4 xi h4 128f 6 xi h6
2!
4!
6!
10:30
This leaves the zeroth-order term, which is the measurement at the target point
and therefore available, the second-order term, which has the second derivative, the
sixth-order term, which will be the error term, and the fourth-order term, which
must be eliminated in order for the error to be the sixth-order term. The problem is
that the fourth-order term is positive in all four series, so it cannot be cancelled out
by adding them together, and it is 16 times greater in the two series two steps away.
The solution is to multiply the two series one step away by 16 to make their fourthorder term of the correct magnitude, and the two series two steps away by 1 to
insure that the terms cancel out with the series one step away without affecting the
other terms eliminated by summation. The final result is:
f 2 xi
f xi 2h 16f xi h 16f xi h f xi 2h 30f xi
O h4
12h2
10:31
The same process can also be used to devise forward and backward formulae.
Recall that the second-order backward divided-difference formula for the first
derivative was estimated from two past measurements of the system. The discussion about the centered divided-difference formulae has already shown that an
additional point is needed to estimate the second derivative to the same error
value, as well as an additional order term in the Taylor series approximation to
account for the division by h2. Consequently, three past measurements will be
needed for a second-derivative second-order backward divided-difference formula,
and the corresponding Taylor series approximations will need to be expanded to the
fourth order, as such:
f xi h f xi f 1 xi h
f 2 xi h2 f 3 xi h3 f 4 xi h4
2!
3!
4!
f xi 2h f xi 2f 1 xi h
f xi 3h f xi 3f 1 xi h
4f 2 xi h2 8f 3 xi h3 16f 4 xi h4
2!
3!
4!
9f 2 xi h2 27f 3 xi h3 81f 4 xi h4
2!
3!
4!
10:32
Clearly, cancelling out the first and third-order terms will require more than a
simple addition as was the case with the centered divided-difference formulae.
208
10 Differentiation
However, this can be done in a simple methodical way, by starting with f(xi 3 h),
the formula with the largest coefficients multiplying terms, and figuring out the
multiple of f(xi 2 h) needed to eliminate the largest coefficients. It will not be
exact, but it should be rounded up, and then f(xi h) can be used to cancel out the
remainders. In the case of Eq. (10.32), the third-order term of f(xi 3 h) is 3.4 times
larger than that of f(xi 2 h), so rounding up the latter series will be multiplied by
4 and subtracted:
f xi 3h 4f xi 2h 3f xi 5f 1 xi h
7f 2 xi h2
2!
5f 3 xi h3 17f 4 xi h4
3!
4!
10:33
This leaves five times the first-order term and five times the third-order term. The
series f(xi h) will thus need to be multiplied by 5, and added to the other two to
cancel out these terms:
2f 2 xi h2 22f 4 xi h4
2!
4!
10:34
f xi 3h 4f xi 2h 5f xi h 2f xi
f 2 xi
10:35
O h2
2
h
f xi 3h 4f xi 2h 5f xi h 2f xi
The second-order forward divided-difference formula is the same but with a sign
difference, and can be derived using the same process:
f 2 xi
f xi 3h 4f xi 2h 5f xi h 2f xi
O h2
2
h
10:36
Example 10.5
A robot is observed moving in a straight line. It starts off at the 8 m mark, and
is measured every second:
Time t (s)
Position f(t) (m)
0
8
1
16
2
34
3
62
4
100
10.6
209
f 2 4
10 m=s2
To verify this result, note that the equation modelling the robots position is:
f t 5t2 3t 8
The second derivative of this equation is trivial to compute, and evaluated at
t 4 it does give 10 m/s2.
The speed of the robot at t 4 has already been found to be 43 m/s in
Example 10.3. Not modelling acceleration, at t 5 the position would be
assumed to be 143 m and the speed still 43 m/s. But now that the model does
include acceleration, the speed and position at t 5 will be found to be 53 m/s
and 148 m respectively. This shows the importance of including not only the
rate of change of parameters, but also the second-derivative change of the rate
of change, in engineering models.
10.6
All the formulae seen so far have one thing in common: they require measurements
taken at equal intervals before or after the target point at which the derivative is
required. Unfortunately, such measurements may not always be available. They
might have been recorded irregularly because of equipment failure, or lost to a data
storage failure, bad record-keeping, or simple human negligence. Another approach
will be needed to deal with such cases.
Given measurements at irregular intervals, one simple option is to interpolate a
polynomial that fits these measurements using any of the techniques learned in
Chap. 6, and then simply compute the derivative of that polynomial. In fact, theres
an even better option, namely to include the derivative in the interpolation formula
and thus to interpolate the derived equation directly. This can be done easily
starting from the Lagrange polynomial formula:
210
10 Differentiation
f x
n1
X
f x i
i0
10:37
To interpolate the derivative instead, take the derivative of the Lagrange formula
with respect to x. This is in fact easier than it looks, since x only appears in the
numerator:
f 1 x
n1
X
i0
d
x x0 . . . x xi1 x xi1 . . . x xn1
f xi dx
xi x0 . . . xi xi1 xi xi1 . . . xi xn1
10:38
This is for the first derivative, but higher derivatives can be obtained in the same
way. The interpolation method then works as it did back in Chap. 6: for each of the
n measurements available, compute the polynomial that results from the multiplications in the numerator, derive it, and sum it with the other polynomials from the
other measurements to get the derivative equation. That equation can then be
evaluated at any point of interest within the interpolation interval.
Example 10.6
A robot is observed moving in a straight line. It has been measured at the
following positions:
Time t (s)
Position f(t) (m)
1
16
3
62
4
100
2t 7
2t 5
2t 4
62
100
6
2
3
f 1 t 10t 3
(continued)
10.7
Inaccurate Measurements
211
10.7
Inaccurate Measurements
The divided-difference formulae studied in this chapter all estimate the derivative
from measurements of a system. So far, these measurements have been assumed to
be accurate, and have been used as such. But empirical measurements taken in
practice will often have measurement errors, due to inaccurate instrumentation and
handling errors. Worse, differentiation can be very unstable in the presence of this
noise: the errors get added together and amplified.
Consider the robot tracking data of Example 10.3. Given exact data, the derivative can be computed at any of the five times using the backward, centered, or
forward divided-difference formulae, as in Table 10.1:
However, small errors in measurements can have a drastic impact. Table 10.2
runs through the example again, this time introducing 14 m of errors on the
position measurements. Notice how this error is amplified dramatically in the
derivatives:
To further illustrate, Fig. 10.3 compares the real and noisy position measurements, and the real and noisy derivative estimations. A visual inspection of that
figure confirms how even a small error in measurements can cause errors in the
derivative estimation that are not only much larger in amplitude, but also fluctuate
wildly.
Clearly, the divided-difference formulae should be avoided in cases such as this
one. An alternative solution is to compute a linear regression on the data, as was
learned in Chap. 6, to obtain the best-fitting polynomial that goes through the data.
Time t (s)
Position f(t) (m)
Speed f (1)(t) (m/s)
0
8
3
1
16
13
2
34
23
3
62
33
4
100
43
Time t (s)
Position f(t) (m)
Position error (%)
Speed f (1)(t) (m/s)
Speed error (%)
0
7
12.5
5.5
83.3
1
17
6.3
14.5
11.5
2
36
5.9
20.6
10.4
3
60
3.2
34
3.0
4
104
4.0
54
25.6
212
10 Differentiation
Fig. 10.3 Position (left) and speed (right) using exact values (blue) and noisy values (red)
That polynomial can then be derived and used to estimate the derivative value at
any point within its interval.
Example 10.7
A robot is observed moving in a straight line. It has been measured, with
noise, at the following positions:
Time t (s)
Position f(t) (m)
0
7
1
17
2
36
3
60
4
104
1
61
6
V6
61
41
1
0
1
2
3
4
3
2
3
0
7
6 17 7
1 7
7
6
7
6 36 7;
4 7
and
y
7
6
7
4 60 5
9 5
16
104
10.7
Inaccurate Measurements
213
5
6
6 10
4
30
10
30
100
VT Vc VT y
3
32 3 2
224
30
c0
7
76 7 6
7
6 7 6
100 7
54 c1 5 4 685 5
2365
354
c2
3
2 3 2
c0
7:83
7
6 7 6
7
7 6
6
4 c1 5 4 2:84 5
5:21
c2
This gives the regressed polynomial for the position of the robot:
f t 5:21t2 2:84t 7:83
which is trivial to derive to obtain the equation for the speed. Note that this
regressed polynomial is very close to the real polynomial that generated the
correct values of the example, which was:
f t 5t2 3t 8
Using the derivative of the regressed equation makes it possible to compute
the speed at all five measurement times:
Time t (s)
Speed f (1)(t) (m/s)
Speed error (%)
0
2.8
6.7
1
13.3
2.3
2
23.7
3.0
3
34.1
3.3
4
44.5
3.5
These results are clearly much more accurate than those obtained using the
divided-difference formula in Table 10.2. To further illustrate the difference,
Fig. 10.2 is taken again, this time to include the regressed estimate of the
speed (purple dashed line) in addition to the actual value (blue line) and
divided-difference estimate (red dashed line). It can be seen that, while the
divided-difference estimate fluctuates wildly, the regressed estimate remains
linear like the actual derivative, and very close to it in value, even overlapping
with it for half a second.
(continued)
214
10 Differentiation
10.8
Engineering Applications
dT
dx
10:39
where qx is the heat flux in orientation x, k is the materials conductivity, and dT/
dx is the first derivative of the temperature over orientation x.
Ficks laws of diffusion model the movement of particles of a substance from a
region of higher concentration to a region of lower concentration. Ficks first law
models the diffusion flux in orientation x, Jx, as:
J x D
d
dx
10:40
where D is the diffusion coefficient of the medium, and d/dx is the first derivative
of the concentration over orientation x. Ficks second law models the rate of change
10.9
Summary
215
of the concentration over time, d/dt, in relationship to the second derivative of the
concentration over orientation x:
d
d2
D 2
dt
dx
10:41
dB
dt
10:42
The current-voltage relationships of electrical components are also the derivative of their performance over time. For a capacitor with capacitance C, that
relationship is:
IC
dV
dt
10:43
dI
dt
10:44
This means that the current going through a capacitor is proportional to the rate
of change of its voltage over time, while the voltage across an inductor is proportional to the rate of change of the current going through it over time.
In all these examples, as in many others, a value of the system is defined and
modelled in relationship to the rate of change of another related parameter. If this
parameter can be observed and measured, then the methods seen in this chapter can
be used to approximate its rate of change.
10.9
Summary
Engineering models of systems that are not in a steady-state are incomplete if they
only include a current snapshot of the values of system parameters. To be complete
and accurate, it is necessary to include information about the rate of change of these
parameters. With this addition, models are not static pictures but they change,
move, or grow, in ways that reflect the changes of the real systems they represent.
If a mathematical model of the system is already available, then it is straightforward
to compute its derivative and include it in the model. This chapter has focused on
the case where such a mathematical model is not available, and presented methods
to estimate the derivative using only observed measurements of the system.
216
10 Differentiation
Requires
1 Point before and 1 point after, equally spaced
Error
O(h2)
O(h2)
O(h2)
O(h4)
O(h2n)
Interpolation method
Regression method
n Points, noisy
See
Chap. 6
See
Chap. 6
If a set of error-free and equally spaced measurements are available, then one of
the divided-difference formulae can be used. The backward, forward, or centered
formulae can be used in the case that measurements are available before, after, or
around the target point at which the derivative is needed, and more measurements
can be used in the formulae to improve the error rate. This chapter presented in
detail how new divided-difference formulae can be developed from Taylor series
approximations, so whichever set of points are available, it will always be possible
to create a custom divided-difference formula to fit them and to know its error rate.
And in addition to the divided-difference formulae, Richardson extrapolation was
presented as a means to improve the error rate of the derivative estimate.
If measurements are available but they are noisy or unevenly spaced, then the
divided-difference formulae cannot be used. Two alternatives were presented to
deal with these cases. If the measurements are error-free but unevenly spaced, then
an interpolation method can be used to model the derivative of the system. And if
the measurements are noisy, whether they are evenly or unevenly spaced, then a
regression method should be used to find the best-fitting mathematical model of the
system, and the derivative of that model can then be computed. Table 10.3 summarizes all the methods learned in this chapter.
10.10
Exercises
1. The charge of a capacitor is measured every 0.1 s. At the following five measurement times: {7.2, 7.3, 7.4, 7.5, 7.6 s}, the charge is measured at {0.00242759F,
0.00241500F, 0.00240247F, 0.00239001F, 0.00237761F} respectively. Find the
rate of change of the charge at 7.4 s using the second-order centered divideddifference formula.
10.10
Exercises
217
2. The rotation of a satellite is measured at times {3.2, 3.3, 3.4, 3.5, 3.6 s}, and the
measured angles are {1.05837, 1.15775, 1.25554, 1.35078, 1.44252 rad}
respectively. Approximate the rate of change of the angle at time 3.4 using
both the second-order and fourth-order centered divided-difference formulae.
3. Repeat exercise 2 using the second-order backward divided-difference
formula.
4. Use h 0.5 approximate the derivative of f(x) tan(x) at x 1 to a relative
error of 0.00001 using the centered divided-difference formula.
5. Repeat Question 4 but for the function f(x) sin(x)/x.
6. Perform three iterations of Richardson extrapolation to estimate the derivative
of f(x) ex at x 0 starting with a step of h 1, using the centered divideddifference formula.
7. Perform three iterations of Richardson extrapolation to estimate the derivative
of f(x) sin2(x)/x at x 5 rad starting with a step of h 2, using (a) the secondorder centered divided-difference formula; (b) the forward divided-difference
formula; (c) the fourth-order centered divided-difference formula.
8. Repeat exercise 7 using the function f(x) cos1(x) at x 2 rad starting with a
step of h 0.5. Perform 4 iterations.
9. A runner starts a 40-m sprint at time 0. He passes the 10-m mark after 2 s, the
20-m mark after 3 s, the 30-m mark after 4 s and reaches the finish line after
4.5 s. Estimate his speed at the middle of his run, after 2.25 s.
10. A 3 L container is getting filled. It is one-third filled after an hour, two-thirds
filled after 3 h, and full after 6 h. Determine the initial filling rate.
Chapter 11
Integration
11.1
Introduction
Chapter 10 has already introduced the need for differentiation and integration to
quantify change in engineering systems. Differentiation measures the rate of
change of a parameter, and integration conversely measures the changing value
of a given parameter. Chapter 10 demonstrated how important modelling change
was to insure that engineering models accurately reflected reality.
Integration does have uses beyond measuring change in parameters. Integration
is mathematically the measure of an area under a curve. It can thus be used to model
and approximate forces, areas, volumes, and other quantities bounded geometrically. Suppose for example an engineer who needs to model a river; possibly an
environmental engineer who needs to model water flow, or a civil engineer who is
doing preliminary work to design a dam or a bridge. In all cases, a complete model
will need to include the area of a cross-section of the river. So a boat is sent out with
a sonar, and it takes depth measurements at regular intervals, to generate a depth
map such as the one shown in Fig. 11.1. From these discrete measurements, it is
then possible to compute the cross-sectional area of the river. The process to do this
computation is an integral: the depth measurements can be seen as points on the
curve of a function, the straight horizontal surface of the water is the axis of the
graph, and the cross-sectional area is the area under the curve.
As with derivation, computing an exact integral would be a simple calculus
problem if the equation of the system were known. This chapter, like Chap. 10, will
deal with the case where the equation is not known, and the only information
available is discrete measurements of the system. The formulae presented in this
chapter are all part of the set of Newton-Cotes rules for integration, the general
name for the family of formulae that approximate an integral value from a set of
equally spaced points, by interpolating a polynomial through these points and
computing its area. Most of this chapter will focus on closed Newton-Cotes rules,
219
220
11
Integration
which are closed in the sense that the first and last of the equally spaced points are
also the integration boundaries. However, the last method presented will be an open
Newton-Cotes rule, where the integration boundaries lie beyond the first and last of
the equally spaced point.
11.2
Trapezoid Rule
x1
x0
f xdx x1 x0
f x 0 f x 1
2
11:1
While the trapezoid rule has an undeniable advantage in simplicity, its downside
is potentially a very high error. Indeed, it works by approximating the function
being modelled f(x) as a straight line between the measurements x0 and x1, and can
therefore be very wrong when that is not the case. The value of the integral will be
11.2
Trapezoid Rule
221
Fig. 11.2 Two measurements at x0 and x1 (left). A fictional trapezoid, the area of which
approximates the integral of the function from x0 to x1 (right)
wrong by the area between the straight line and the real curve of the function, as in
Fig. 11.3. This graphical representation is a good way to visualize the error, but
unfortunately it does not help to compute it.
An alternative way to understand the error of this formula is to recall that the
function f(x) is being modelled as a polynomial p(x) interpolated from two points,
and then the trapezoid method takes the integral of that interpolation. Consequently,
the integration error will be the integral of the interpolation error; and the interpolation error E(x) is one that was already learnt, back in Chap. 6. For an interpolation
from two points, the error is:
E x
f 2 x
x x0 x x1
2
11:2
for a point x in the interval [x0, x1]. Then the integration error will be obtained by
taking the integral of the formula:
222
11
x1
f xdx
x0
x1
px Exdx
x0
x1
pxdx
x1
x0
Exdx
Integration
11:3
x0
The integral of p(x) is obtained from the trapezoid rule of Eq. (11.1). For E(x),
Eq. (11.2) can be substituted in to compute the integral:
x1
x0
x1
f 2 x
x x0 x x1 dx
2
x0
f x0 f x1 f 2 x x1 2
x 1 x 0
x xx0 x1 x0 x1 dx
2
2
x0
x1
f x0 f x1 f 2 x x3 x2
x0 x1 xx0 x1
x1 x0
2
2
3
2
x0
"
#
3
2
f x0 f x1 f x
x 1 x0
x 1 x 0
2
2
6
f xdx x1 x0
f x0 f x1
f x0 f x1 f 2 x x1 x0 3
2
12
2
f x0 f x1 f x x1 x0 3
x1 x0
2
12
x1 x 0
11:4
Equation (11.4) gives a formula for the error of the trapezoid method, but it does
require an extra point x within the interpolation interval in order to compute it. If
such a point is not available, the formula can still be used by substituting the exact
value of the second derivative at x with the average value of the second derivative
within the integration interval:
x1
f xdx x1 x0
x0
f x0 f x1 f
2
x 1 x 0 3
12
11:5
And if this average second derivative is also not available, it can be estimated from
the first derivative of the function:
x1
f
f 2 xdx
x0
x1 x0
f 1 x1 f 1 x0
x1 x0
11:6
11.2
Trapezoid Rule
223
f xdx x1 x0
f x 0 f x 1
f x 1 f x 2
x2 x1
2
2
n2
f xn2 f xn1 X
f xi f xi1
xn1 xn2
xi1 xi
2
2
s0
11:7
Fig. 11.4 Trapezoid approximation of the integral of Fig. 11.2 with two points and one segment
(left), with three points and two segments (center), and with four points and three segments (right)
224
11
Integration
Moreover, if the measurements are equally spaced, then the length of all the
subsegments will be the same fraction of the length of the entire integration
interval:
xi1 xi
xn1 x0
h
m
11:8
And looking closely at Eq. (11.7), it can be seen that all measurements f(xi) will be
summed twice, except for the measurements at the two bounds, f(x0) and f(xn1).
Putting this observation and Eq. (11.8) into the formula of Eq. (11.7) gives the
composite trapezoid rule:
xn1
x0
n2
X
h
f x0 2
f xdx
f xi f xn1
2
i1
!
11:9
Comparing to the equation for the trapezoid rule with one segment in the previous
section, it can be seen that Eq. (11.1) is only a simplification of Eq. (11.9) for the
special case where only two measurements at the integration bounds are available.
The pseudocode for the composite trapezoid rule is presented in Fig. 11.5. Like
Eq. (11.9), this code can also simplify for the one-segment rule, by setting the value
of the appropriate input variable.
The error on the composite rule is the sum of the error on each two-point
subsegment, and the error of each subsegment can be computed using Eq. (11.7).
This means the error of the entire formula will be:
11.2
Trapezoid Rule
xn1
225
n2
X
h
f x 0 2
f xdx
f xi f xn1
2
i1
x0
!
n2 2
X
f xi xi1 xi 3
12
i0
11:10
where xi is a point in the interval [xi, xi+1]. Substituting in Eq. (11.8) further
simplifies the equation to:
xn1
x0
n2
X
h
f x 0 2
f xdx
f xi f xn1
2
i1
!
n2
xn1 x0 3 X
f 2 xi
12m3
i0
11:11
This leaves n 2 instances of the second derivative f (2)(xi) to evaluate, one for
each of the m segments. But recall that, when computing the error for the two-point
trapezoid, one approximation used was that the second derivative at any point in the
integration interval could be substituted for the average value of the second
derivative in the integration interval. Using the same assumption here makes it
possible to replace every instance of f (2)(xi) with the average:
n2
X
f 2 xi m f 2
11:12
i0
n2
X
h
f x 0 2
f xdx
f xi f xn1
2
i1
!
f 2 xn1 x0 3
12m2
11:13
Notice that the error term is almost the same as it was for the two-point trapezoid
rule in Eq. (11.5), since xn1 is the upper integration bound as x1 was back in
Eq. (11.5). The difference is that the error term is divided by m2, the number of
subsegments within the integration interval; that value was m 1 in the case of
Eq. (11.5) when the entire integration interval was only one segment. It is however
also important to keep in mind that Eqs. (11.5) and (11.13) give approximations of
the absolute error, not exact values; if the value by which the integral approximation was wrong could be computed exactly, then itd be added to the approximation
to get the exact integral value! An error approximation is useful rather to design and
build safety margins into engineering systems. Equation (11.13) also demonstrates
that the error is quadratic in terms of the number of segments. And since the number
of segments is directly related to the interval width h in Eq. (11.8), this means the
trapezoid formulae have a big O error rate of O(h2).
226
11
Integration
Example 11.1
The relationship between the voltage V(t) and current I(t) that goes through a
capacitor over a period of time from t0 to tn1 can described by the following
integral, where C is the capacitance value and V(t0) is the initial voltage
across the capacitor:
V t
1
C
tn1
I tdt V t0
t0
0
0
0.5
16.2
1
1
01
0:5 V
2
1 0 0 2 16:2 1
8:4 V
2
2
To compute the error, it is necessary to know the average value of the second
derivative over this one-second interval. That information is not given
directly; however, the first derivative can be computed from the measurements using the methods learned in Chap. 10, and then Eq. (11.6) can be used
to compute the average second derivative. With three measurements available, the forward and backward divided-difference formulae can be applied:
(continued)
11.2
Trapezoid Rule
227
63:8 A=s
2 0:5
2 0:5
I 0 4I 0:5 3I 1 0 4 16:2 3 1
I 1 1
61:8 A=s
2 0:5
2 0:5
I 1 0
Then Eq. (11.6) can be used to compute the average second derivative:
I 2
I 1 1 I 1 0
125:6 A=s2
10
I 2 1 03
10:6 V
12
I 2 1 03
2:6 V
12 22
for the two-segment trapezoid. This is consistent with a quadratic error rate;
doubling the number of segments had roughly quartered the error on the
approximation.
There is a very large difference between the approximated integral value
with one and two segments. The reason for this difference is that the initial
and final measurements in the interval give a very poor picture of the current
going through the supercapacitor over that time. The current over that period
is illustrated in the figure below: it can be seen that, starting from zero, it rises
to a peak of almost 40 A before dropping again by the time the final
measurement is taken. The single-segment trapezoid, by using only the initial
and final measurements, ignores everything that happened in-between those
bounds. This corresponds to the straight-line interpolation and the purple area
in the figure below; it is clearly a poor representation of the current. The
two-segment trapezoid uses an additional measurement in the middle of the
time interval, and the resulting two interpolations, in red in the figure below
(and including the area in purple), while still inaccurate, nonetheless give a
much better approximation of the current over that time.
(continued)
228
11
Integration
For reference, the actual integral value is 16.5 V. This means that the
single-segment trapezoid gave an approximation with an absolute error of
16 V; the error estimate of 10.6 V was in the correct range. Meanwhile, the
two-segment approximation had an absolute error of 8.1 V, three times higher
than the error estimate of 2.6 V, but still in the correct order of magnitude.
11.3
Back in Chap. 10, the Richardson extrapolation method was introduced as a means
to iteratively reduce the error rate of the derivative approximation without the risk
of subtractive cancellation that would come from taking the difference of two
points that are nearer and nearer together. To be sure, the trapezoid rule to
approximate the integral does not perform such a difference, and is therefore not
susceptible to subtractive cancellation. Nonetheless, an iterative method to improve
its approximation accuracy could be very beneficial. The Romberg integration rule
provides such a method.
Suppose two approximations of an integral I, both obtained using the composite
trapezoid rule as written out in Eq. (11.13) but with different numbers of segments.
The trapezoid approximation obtained using m0 segments will be noted I0,0, and the
other obtained using m1 segments will be noted I1,0.
I I 0 , 0 E0
I I 1 , 0 E1
11:14
Moreover, from the discussion in the previous section, it has been noted that
doubling the number of segments quarters the error. So if m1 2m0, then
11.3
229
3E0
4
11:15
Next, substitute the value of E1 of Eq. (11.15) back into the I1,0 line of Eq. (11.14)
gives:
I I 0, 0
4I 1, 0 I 0, 0 4I 1, 0 I 0, 0
I 1, 1
3
3
11:16
This integral approximation is labelled I1,1; the first subscript 1 is because the best
approximation it used from the previous iteration was computed from m1 segments,
and the second subscript 1 is because it is the first iteration (iteration 0 being the
trapezoid rule iteration). Moreover, while the approximations of iteration 0 had an
error rate of O(h2), the approximation at iteration 1 has an error rate of O(h4). This
can be shown from the Taylor series, in a proof similar to that of the Richardson
extrapolation.
This first iteration can be generalized as follows: given two trapezoid approximations Ij,0 and Ij1,0 computed from 2j and 2j1 segments respectively using
Eq. (11.13) with O(h2) error rate, then:
I j, 1
4I j, 0 I j1, 0
O h22
3
11:17
This process can then be repeated iteratively. For iteration k, the general version of
the Romberg integration rule is:
I j, k
4k I j, k1 I j1, k1
O m22k
k
4 1
k>0
11:18
230
11
Integration
and fill it left to right and top to bottom, where each new column added represents
an increment of k in Eq. (11.18) and will contain one less element than the previous
column, and each additional row represents a power of 2 of the number of segments,
or a value of j in Eq. (11.18).
Example 11.2
The relationship between the voltage V(t) and current I(t) that goes through a
capacitor over a period of time from t0 to tn1 can described by the following
integral, where C is the capacitance value and V(t0) is the initial voltage
across the capacitor:
1
V t
C
tn1
I tdt V t0
t0
(continued)
11.3
231
0
0
0.5
16.2
1
1
Compute the voltage going across this supercapacitor using the best application of the Romberg integration rule possible.
Solution
To begin, note that, since the first measurement is at the moment the computer
boots up, the initial voltage V(t0) will be null. With the capacitance value of
1 F, the voltage will be only the result of the integral of the current.
At iteration 0, two approximations are possible. I0,0 is the 20 segment
trapezoid rule, which has been computed in Example 11.1 as:
I 0 , 0 1 0
01
0:5 V
2
1 0 0 2 16:2 1
8:4 V
2
2
41 I 1, 0 I 0, 0 4 8:4 0:5
11:0 V
3
41 1
For reference, the actual integral value is 16.5 V. As expected, the higheriteration Romberg rule result generates a better approximation than either
of the trapezoid rule approximations it is computed from. In fact, I0,0 has
a relative error of 97 % and I1,0 has a relative error of 49 %, but I1,1 has a
relative error of only 34 %.
232
11.4
11
Integration
Simpsons Rules
x2
x2
f xdx f x1 dx
x0
x0
!
f 2 x1
f 3 x 1
2
3
x1 x x1
x x1
x x1 dx
2!
3!
x0
x2
f 4 x 1
x x1 4 dx
4!
x0
11:19
#x2
f 1 x1
f 2 x1
f 3 x1
2
3
4
f x1 x2 x0
x x 1
x x 1
x x1
2
3!
4!
x0
x2
4
f x 1
x x1 5
5!
"
x0
11:20
11.4
Simpsons Rules
233
Next, recall that the three points x0, x1 and x2 are equally spaced, and the distance
between two successive steps is defined as h in Eq. (11.8). As a result, all the evenexponent subtractions in Eq. (11.20) will cancel out, and all the odd-exponent ones
will be added together:
x2 x1 2 x0 x1 2 0
x2 x1 3 x0 x1 3 2h3
x2 x1 4 x0 x1 4 0
11:21
x2 x1 5 x0 x1 5 2h5
This result simplifies Eq. (11.20) considerably:
x2
f xdx f x1 2h
f 2 x1 3 f 4 x1 5
h
h
3
60
11:22
x0
The fourth-order term of the series, which has been kept somewhat separate so far
in the equations, will become the error term of the method. This however leaves the
second derivative to deal with in the second-order term; after all, the derivative of
the function is not known, and only three measurements are available. Fortunately,
Chap. 10 has explained how to approximate the derivative of a function from
measurements. Specifically, the centered divided-difference formula for the second
derivative can be substituted into Eq. (11.22), and subsequently simplified to get the
formula for Simpsons 1/3 rule:
x2
f xdx f x1 2h
2
h3 f x2 f x0 2f x1
f 4 x1 5
h
O
h
60
3
h2
x0
f 4 x1 5
6h h
f x2 f x0 2f x1 O h5
h
3 3
60
h
f x0 4f x1 f x2 O h5
3
f x0 4f x1 f x2
O h5
x2 x0
6
f x 1
11:23
If more than three measurements are available, then the same idea as for the
composite trapezoid applies: group them into triplets of successive points and
interpolate multiple smaller and more accurate nonoverlapping parabola, and sum
the resulting areas to get a higher-accuracy approximation of the integral. A general
form of the equation can be obtained to do this:
234
11
f xdx xn1 x0
x0
n2
X
f x0 4
xn1
i1, 3, 5, ...
f x i 2
n1
X
Integration
f xj f xn1
i2, 4, 6, ...
3n 1
5
h
O
n
11:24
Do be careful with the two separate summations that must be computed in the
composite equation: each adds every other measurement, but they are multiplied by
different constants. Note also that, just like with the composite trapezoid equation,
the measurements at the two bounds of the integration interval are only
summed once.
Example 11.3
The relationship between the voltage V(t) and current I(t) that goes through a
capacitor over a period of time from t0 to tn1 can described by the following
integral, where C is the capacitance value and V(t0) is the initial voltage
across the capacitor:
1
V t
C
tn1
I tdt V t0
t0
0
0
0.5
16.2
1
1
11.4
Simpsons Rules
235
0 4 16:2 1
11:0 V
6
Example 11.4
Two sets of current measurements with different intervals have been taken for
the capacitor of Example 11.3. They are:
Time t (s)
Current I(t) (A)
0
0
0.25
4.4
0.5
16.2
0.75
35.5
1
1
and:
Time t (s)
Current I(t) (A)
0
0
0.2
3.1
0.4
10.1
0.6
24.1
0.8
37.1
1
1
Compute the voltage going across this supercapacitor using Simpsons 1/3
rule on each set of points, and compare the results.
Solution
Using Eq. (11.24) on the first set of five measurements gives:
(continued)
236
11
Integration
5
0 44:4 35:5 2 16:2 1
h
O
16:1 V
34
5
238
11
Integration
The formula for Simpsons 3/8 rule is given below. It can be seen that it has an
error rate of O(h5), just like Simpsons 1/3 rule with three points. Thus, both
formulae can be used together without loss of accuracy.
x3
f xdx x3 x0
f x0 3f x1 3f x2 f x3
O h5
8
11:25
x0
Example 11.5
The relationship between the voltage V(t) and current I(t) that goes through a
capacitor over a period of time from t0 to tn1 can described by the following
integral, where C is the capacitance value and V(t0) is the initial voltage
across the capacitor:
1
V t
C
tn1
I tdt V t0
t0
0
0
0.2
3.1
0.4
10.1
0.6
24.1
0.8
37.1
1
1
Compute the voltage going across this supercapacitor using Simpsons rules.
Solution
To begin, note that, since the first measurement is at the moment the computer
boots up, the initial voltage V(t0) will be null. With the capacitance value of
1 F, the voltage will be only the result of the integral of the current.
With six measurements, two options are available to use Simpsons rules:
either apply Simpsons 3/8 rule on the first four and Simpsons 1/3 rule on the
last three, or the other way around, apply Simpsons 1/3 rule on the first three
measurements and Simpsons 3/8 rule on the last four. There is, a priori, no
way to prefer one option over the other. So using the first one, Eq. (11.25)
over the first four measurements gives:
I 3=8 0:6 0
11.5
Gaussian Quadrature
239
11.5
Gaussian Quadrature
So far this chapter has introduced several formulae to approximate the integral of a
function, using different number of measurements and different number of iterations, and with different error rates. By far, the simplest one was the single-segment
trapezoid rule of Eq. (11.1). Unfortunately it was also the method with the worst
error rate. As was demonstrated in Example 11.1, the error stems from the selection
of the two points; they must be the two integration bounds, and these two bounds
may not be representative of the entire function. Since the trapezoid rule is a closed
method, those are the two points that must be used.
Wouldnt it be better if there was an open-method equivalent of the trapezoid
method, which had the simplicity of using only two points but made it possible to
choose points within the integration interval that are more representative of the
function, points that make it possible to interpolate a straight line such that the
missing area above the interpolated line and the extra area below the interpolated
line cancel each other out? Such a method could pick the two points that yield a
trapezoid with an area as close as possible to the real function. For example, instead
of the two bounds x0 and x1 of Fig. 11.2, the two points a and b inside the interval
could be selected as in Fig. 11.9 to get a better approximation of the integral value.
Such a method does exist. It is called the Gaussian quadrature method, or
alternatively the Gauss-Legendre rule. To understand where it comes from, it is
best to go back to the two-point trapezoid and learn a different way to discover that
method.
The single-segment trapezoid of Eq. (11.1) estimates the integral of the function
as a combination of the measurement value of that function at two points, x0 and x1,
the two known integration bounds. The formula can thus be written as:
I
x1
f xdx w0 f x0 w1 f x1
11:26
x0
The challenge is then discovering what the weights w0 and w1 are. To solve for two
unknown values, two equations are needed; that means two equations where the
240
11
Integration
function evaluations f(x0) and f(x1) and the total integral value I are known. But
these equations dont need to be complicated. They could be, for example, the
integral of the straight-line polynomial p0(x) 1 and the integral of the diagonal
line p1(x) x. Moreover, to further simplify, the polynomials can be centered on the
origin, meaning that x0 x1 6 0. In that case, the two integrals become:
x1
p0 xdx w0 p0 x0 w1 p0 x1
x0
x1
11:27
1dx 1w0 1w1
x0
x1 x0
w0 w1
x1
p1 xdx w0 p1 x0 w1 p1 x1
x0
x1
x dx w0 x0 w1 x1
x0
11:28
0 w0 x0 w1 x1
From these two equations, it is simple to solve for the unknown weight values and
find that w0 w1 (x1 x0)/2, the same values as in Eq. (11.1).
The Gaussian quadrature formula starts from the same single-segment formula,
except the two measurements are taken not at the known bounds but at two
unknown points a and b inside the integration interval. The equation thus becomes:
11.5
Gaussian Quadrature
241
x1
f xdx w0 f a w1 f b
11:29
x0
And there are now four unknown values to discover, namely the two points in
addition to the two weights. Four equations will be needed in that case; so in
addition to the integrals of the straight line and diagonal line polynomials from
earlier, add the polynomials p2(x) x2 and p3(x) x3. But to simplify, though,
assume that the original function f(x) has been transformed to an equivalent
function g( y) of the same degree and area but over the integration interval
y0 1 to y1 1. In that case, the four equations become:
1
p0 ydy w0 p0 a w1 p0 b
1
11:30
1
1 dy 1w0 1w1
1
2 w0 w1
1
p1 ydx w0 p1 a w1 p1 b
1
11:31
1
y dy w0 a w1 b
1
0 w0 a w1 b
1
p2 ydx w0 p2 a w1 p2 b
1
1
y2 dy w0 a2 w1 b2
1
11:32
2
w0 a2 w1 b2
3
1
p3 ydx w0 p3 a w1 p3 b
1
11:33
y3 dy w0 a3 w1 b3
1
0 w0 a3 w1 b3
242
11
Integration
With four equations and four unknowns, it is easy to solve to find that w0 w1 1
p
and a b 1= 3. Equation (11.29) thus becomes:
I
x1
x0
1
1
f xdx
gydy g p g p
3
3
1
1
11:34
The only problem that remains is to convert f(x) into its equivalent form g( y),
and Eq. (11.34) will make it easy to compute an approximation of its integral. This
transformation will be done with a linear mapping of the form:
x c0 c1 y
dx c1 dy
11:35
This transformation introduces two more unknown coefficients, c0 and c1, and thus
two equations will be needed to discover their values. Fortunately, two equations
are already available thanks to the known mappings of the two bounds from x0 and
x1 to 1 and 1 respectively:
x0 c0 c1 1
x1 c0 c1 1
11:36
x1 x0 x1 x0 y
2
x1 x0
dx
dy
2
11:37
11:38
11.5
Gaussian Quadrature
243
Example 11.6
By taking samples of the current going through a supercapacitor at every
second during seven seconds and computing an interpolation, the following
mathematical model of the current has been developed:
I t 8t 42t2 45t3 62t4 286t5 352t6
Determine the voltage going through the supercapacitor over the first second
after the systems boot-up using the Gaussian quadrature method, knowing
that its capacitance value is 1 F.
Solution
The relationship between the voltage V(t) and current I(t) is the following
integral:
V t
1
C
tn1
I tdt V t0
t0
Since the integration interval starts at the moment the computer boots up the
initial voltage V(t0) will be null, and with the capacitance value of 1 F the
voltage will be only the result of the integral of the current.
The first step of applying the Gaussian quadrature method is to convert the
integral using the two Eqs. (11.37) and (11.38). With the integration bounds
t0 0 and t1 1, the equations become:
t
1y
2
dt
dy
2
8t 42t2 45t3 62t4 286t5 352t6 dt
01
1y
1y 2
1y 3
1y 4
8
45
62
42
2
2
2
2
1
!
5
6
1y
1y
dy
286
352
2
2
2
1
V t
1
The next step is to approximate the value of the integral using Eq. (11.34):
(continued)
244
11
Integration
gydy
1
1
g p g p
3
3
1:7 18:5
20:2 V
1
Compared to the real value of the integral of 16.5 V, this approximation has a
relative error of 22.2 %. This is a massive improvement compared to the
trapezoid rule approximation of Example 11.1, which had a relative error of
97.0 % using the same number of measurements. This approximation is also
an improvement compared to the three-measurement approximations
obtained by the composite trapezoid rule and Simpsons 1/3 rule, which had
relative errors of 49.5 % and 33.7 % respectively, despite being computed
using one more measurement than this approximation.
To further illustrate how this equation works, apply Eq. (11.37) again to
p
find that the two points at y 1= 3 correspond to times t 0.21 s and
t 0.79 s. This means the integral approximation is computed from the
colored area of the trapezoid under the red line in the figure below. Compared
to the single-segment and two-segment trapezoids of Example 11.1, included
as the light and dark purple lines respectively in this figure, it is clear to see
how the Gaussian quadrature gives a superior result. Because it is an open
method, it can forego the unrepresentative points at the integration bounds
that the two trapezoid rules are forced to use in their computations. The
straight-line interpolation resulting from the Gaussian quadrature points is
clearly a better linear approximation of the function over a large part of the
integration interval than either of the trapezoid interpolations. And even the
errors, the large section included under the interpolation line beyond t 0.83 s
when the function begins decreasing quickly, is cancelled out in part by the
negative area under the curve from t 0 to t 1.5.
11.5
Gaussian Quadrature
245
x1
f xdx
x0
1
1
gydy
n1
X
w k g y k
11:39
k0
In the case where n 2, the weights wk are always 1 and the evaluated points yk are
p
1= 3, and Eq. (11.39) reduces to Eq. (11.34). Weights and points for the first four
values of n are presented in Table 11.1. Notice from this table that the weights and
points are different at every value of n; this means that the complete summation will
have to be recomputed from scratch every time the number of points n is increased.
The Gaussian quadrature method thus cannot be implemented in an iterative
algorithm that increments the number of points to gradually improve the quality
of the approximation, in the way the Romberg integration rule did.
Number of points n
1
2
Evaluated points yk
0
r
1
3
r
1
3
r
3
5
0
r
3
5
s
r
3 2 6
7 7 5
s
r
3 2 6
7 7 5
s
r
3 2 6
7 7 5
s
r
3 2 6
7 7 5
Weights wk
2
1
1
5
9
8
9
5
9
p
18 30
36
p
18 30
36
p
18 30
36
p
18 30
36
246
11
Integration
The development of the error for Eq. (11.39) falls outside the scope of this book,
but the final result is:
Ex
22n1 n!4
2n 12n!
g2n x O h2n
11:40
This shows that the Gaussian quadrature method will compute the exact integral
value with no error for a polynomial f(x) of degree 2n 1, in which case the 2nth
derivative of g(x) will be 0. In the special case of n 2 that has been the topic of
this section, the method will have an error rate of O(h4), a considerable improvement compared to the O(h2) error rate of the trapezoid rule with the same number
of points.
11.6
Engineering Applications
x1
Fxdx
11:41
x0
The Fourier transform of a continuous signal over time s(t) into a continuous
frequency-domain signal S() is done using the equation:
S
1
1
stejt dt
11:42
where e and j are Eulers number and the imaginary number, respectively.
According to Ohms law, the voltage between two points x0 and x1 along a path
is given by:
V
x1
x0
E dx
x1
J dx
11:43
x0
where E is the electric field, J is the current density, and is the resistivity along
the path.
11.7
Summary
247
Given a spring of stiffness k that was initially at rest and was gradually stretched
or compressed by a length L, the total elastic potential energy transferred into the
spring is computed as:
U
L
kx dx
11:44
11.7
Summary
248
11
Integration
11.8
Requires
Measurements at the two integration bounds
n measurements of the function
n<2
n measurements of the function and k iterations
n 2k
n measurements of the function
n<3
Two points selected within boundaries
Error
O(h2)
O(h2)
O(h2k+2)
O(h5)
O(h4)
Exercises
1. Approximate the integral of the function f(x) ex over the interval [0, 10]
using:
(a) A single-segment trapezoid rule.
(b) A 20-segment composite trapezoid rule.
(c) Rombergs integration rule, starting with one interval and continuing until
the absolute error between two approximations is less than 0.000001.
(d) Simpsons rule with three points.
(e) Simpsons rule with four points.
2. Using a single-segment trapezoid rule, approximate the integral of the following functions over the specified intervals.
(a) f(x) x3 over the interval [1, 2].
(b) f(x) e0.1x over the interval [2, 5].
3. Using a single-segment trapezoid rule, approximate the integral of the following functions over the specified intervals. Then, evaluate their approximate
error and their real error.
(a) f(x) x2 over the interval [0, 2].
(b) f(x) x4 over the interval [0, 2].
(c) f(x) cos(x) over the interval [0.2, 0.4].
4. Approximate the integral of f(x) x3 over the interval [1, 2] using a foursegment composite trapezoid rule.
5. Approximate the integral of f(x) xex over the interval [0, 4] using a
10-segment composite trapezoid rule.
6. Using four- and eight-segment composite trapezoid rules, approximate the
integral of the following functions over the specified intervals. Then, evaluate
their approximate error and their real error when using eight segments.
(a) f(x) x2 over the interval [2, 2].
(b) f(x) x4 over the interval [2, 2].
11.8
Exercises
249
7. Use Romberg integration to approximate the integral of f(x) cos(x) over the
interval [0, 3], starting with one interval and computing ten iterations.
8. Use Romberg integration to approximate the integral of f(x) x5 on the interval
[0, 4], starting with one interval and until the error on two successive steps is 0.
9. Use Romberg integration to approximate the integral of f(x) sin(x) on the
interval [0, ], starting with one interval and until the error on two successive
steps is less than 105.
10. Using a three-point Simpsons rule, approximate the integral of the following
functions over the specified intervals.
(a) f(x) x3 over the interval [1, 2].
(b) f(x) e0.1x over the interval [2, 5].
11. Using a three-point Simpsons rule and a four-point Simpsons rule, approximate the integral of the following functions over the specified intervals.
(a) f(x) x2 over the interval [0, 2].
(b) f(x) x4 over the interval [0, 2].
Chapter 12
12.1
Introduction
Consider a simple RC circuit such as the one shown in Fig. 12.1. Kirchhoffs law
states that this circuit can be modelled by the following equation:
dV
V t
dt
RC
12:1
This model would be easy to use if the voltage and the values of the resistor and
capacitor are known. But what if the voltage is not known or measurable over time,
and only the initial conditions of the system are known? That is to say, only the
initial value of the voltage and of its derivative, along with the resistor and capacitor
value, are known.
This type of problem is an initial value problem (IVP), a situation in which a
parameters change (derivative) equation can be modelled mathematically and
initial condition measurements are available, and future values of the parameters
need to be estimated. Naturally, if the initial value of a parameter and the equation
modelling its change over time are both available, it can be expected that it is
possible to predict the value at any time in the systems operation. Different
numerical methods to accomplish this, with different levels of complexity and of
accuracy, will be presented in this chapter.
To formalize the discussion, an equation such as (12.1), or more generally any
equation of the form
y1 t f t, yt c0 c1 yt c2 t
12:2
is called a first-order ordinary differential equation (ODE). For an IVP, the initial
value y(t0) y0 is known, as are the values of the coefficients c0, c1, and c2, and the
goal is to determine the value at a future time y(tn1). However, the challenge is that
Springer International Publishing Switzerland 2016
R. Khoury, D.W. Harder, Numerical Methods and Modelling for Engineering,
DOI 10.1007/978-3-319-21176-3_12
251
252
12
12:3
the methods in this chapter will instead consider a set of n discrete time mesh
points:
t 2 ft0 ; . . . ; ti ; ti1 ; . . . ; tn1 g
12:4
tn1 t0
ti1 ti
n1
12:5
From these equations, any mesh point within a problems time interval can be
written as:
ti t0 ih
12:6
While these definitions may seem simple, and indeed they are, they will also be
fundamental to the numerical methods presented in this chapter. Indeed, they make
the IVP problem simpler: instead of trying to model the behavior of the unknown
function y(t) over the entire time interval of Eq. (12.3), it is only necessary to
approximate it over the finite set of mesh points of Eq. (12.4).
12.2
Eulers Method
12.2
Eulers Method
253
Eulers method: starting at the initial known mesh point t0, evaluate the derivative at
each mesh point and follow the straight line to approximate the function to the next
mesh point, and repeat this process until the target point tn1 is reached. Stated
more formally, this method follows the equation:
yti1 yti hy1 ti
yti hf ti , yti
12:7
Starting from the known initial conditions of y(t0) y0 at time t0, it is possible to
evaluate the ODE to obtain the value of the derivative y(1)(t0) and to use it to
approximate the value of y(t1). This process is then repeated iteratively at each
mesh point until the requested value of y(tn1) is obtained. The pseudocode of an
algorithm to do this is presented in Fig. 12.2.
Equation (12.7) should be immediately recognizable as a first-order Taylor
series approximation of the function y(t) evaluated at ti+1 from the point ti (and
indeed it could have been obtained from the Taylor series instead of the reasoning
presented above). This means that the error on this method is proportional to the
second-order term of the series:
y2 ti 2
h O h2
2!
12:8
Eulers method thus has a quadratic error rate, and for example halving the step size
h will quarter the approximation error. It should be easy to understand why reducing
the step size improves the approximation: as was seen in Chap. 5, the underlying
assumption that a function can be approximated by its straight-line first derivative is
only valid for a short interval around any given point and becomes more erroneous
the farther away from that point the approximation goes.
254
12
Example 12.1
Using Kirchhoffs law, the voltage in a circuit has been modelled by the
following equation:
dV
V t t 1
dt
Given that the initial voltage was of 0.5 V, determine the voltage after 1 s
using six steps of Eulers method.
Solution
Using n 6 gives a sample every 0.20 s, following Eq. (12.5). Putting the
ODE equation of this problem into Eulers method Eq. (12.7) gives the
formula to compute these samples:
V ti1 V ti hV 1 ti
V ti hV ti ti 1
And this formula can then be used to compute the value at each step of the
method:
V 0 0:5 V
V 0:20 0:5 0:200:5 0 1 0:80 V
V 0:40 0:80 0:200:80 0:20 1 1:12 V
V 0:60 1:12 0:201:12 0:40 1 1:46 V
V 0:80 1:46 0:201:46 0:60 1 1:83 V
V 1 1:83 0:201:83 0:80 1 2:24 V
To compare, note that the equation for the voltage used in this example was:
V t t
et
2
The actual voltage values computed by this equation are presented in the table
below, alongside the values computed by Eulers method and their relative
error. It can be seen that the error is small, thanks to the small step size used to
in this example. It can also be noted that the error increases in each successive
step. This is a consequence of the process implemented by Eulers method, as
explained in this section: the new point estimated at each step is computed by
following an approximation of the function starting from an approximation of
the previous point, and thus errors accumulate step after step.
V(0.20)
V(0.40)
V(0.60)
V(0.80)
V(1.00)
(continued)
12.2
Eulers Method
255
Example 12.2
Given the following IVP, approximate the value of y(1) and y(0.5) using one
step of Eulers method for each:
y1 t 1 tyt
y0 1
Solution
Using Eq. (12.7), the result can be computed immediately:
(continued)
256
12
12.3
Heuns Method
It was explained in the previous section that Eulers method approximates the
behavior of a function by following the derivative at the current point y(ti) for
one step. But since the function being approximated will normally not be linear, the
approximated behavior will diverge from the real function and the estimated next
point y(ti+1) will be somewhat off. From that point, the function will again be
approximated by a straight line, and the following point y(ti+2) will be more off
compared to the real functions value. These errors will continue to accumulate,
step after step. In the case of a convex function such as the one in Fig. 12.3, for
example, it will lead to a consistent and increasing underestimation of the values of
the function.
The reason for this accumulation of error is that the derivative at y(ti) is a good
approximation of the functions behavior at that point, but not at the next point y(ti+1).
Fig. 12.3 Eulers method
underestimating a convex
functions values
12.3
Heuns Method
257
But what if the derivative at y(ti+1) was somehow available to be used in Eulers
method instead of the derivative at y(ti)? It would give a good approximation of the
behavior of the function at y(ti+1). . . but not at y(ti). The net result would be an
accumulation of errors in the opposite direction from before. For the convex function
of Fig. 12.3, it would lead to a consistent and increasing overestimation instead of an
underestimation, resulting in Fig. 12.4.
Considering the previous discussion, and comparing Figs. 12.3 and 12.4, a
solution becomes apparent: to average out the two estimates. Since the derivative
at y(ti) is a good approximation of the behavior of the function at y(ti) but leads to
errors at y(ti+1), and vice-versa, an average of the two derivatives should lead to a
good representation of the functions behavior on average over the interval from ti
to ti+1. Or, looking at the figures, taking the average of the underestimation of
Fig. 12.3 and the overestimation of Fig. 12.4 should give much more accurate
estimates, as shown in Fig. 12.5. And a better approximation of the behavior of the
function will, in turn, lead to a better approximation of y(ti+1).
That is the intuition that underlies Heuns method. Mathematically, it simply
consists in rewriting the Eulers method equation of (12.7) to use the average of the
two derivatives instead of using only the derivative at the current point:
y1 ti y1 ti1
2
f ti , yti f ti1 , yti1
y t i h
2
yti1 yti h
12:9
There is one problem with Eq. (12.9): it requires the use of the value of the next
point y(ti+1) in order to compute the derivative at the next point, f(ti+1, y(ti+1)), and
that next point is exactly what the method is meant to estimate! This circular
requirement can be solved easily though, by using Eulers method to get an initial
estimate of the value of y(ti+1). That initial estimate will be of lesser quality than the
258
12
tU)
y + h F(t,y)
h [ F(t,y) + F(t+h,Euler) ] / 2
h
RETURN y
FUNCTION F(t,y)
RETURN evaluation of the derivative of the target function at mesh
point t and at function point y
END FUNCTION
one computed by Heuns method, but it makes the computation of Heuns method
possible. Integrating Eulers method into Heuns method alters Eq. (12.9) into:
yti1 yti h
12:10
The pseudocode for Heuns method is only a small modification of the one
presented earlier for Eulers method, as shown in Fig. 12.6.
Just like for Eulers method, the error for Heuns method can be obtained from its
Taylor series approximation. Since it was already shown that Eulers method can be
obtained from the first-order Taylor series approximation and it was stated that
12.3
Heuns Method
259
Heuns method is more accurate, then it can be expected that Heuns method could
be obtained from a second-order Taylor series approximation, and thus that its error
will be the next term. So begin from a third-order Taylor series approximation:
yti1 yti hy1 ti
y2 ti 2 y3 ti 3
h
h
2!
3!
12:11
yti1 yti y2 ti
h
h
2!
y1 ti1 y1 ti y3 ti
h
y ti
h
2!
12:12
Substituting this second derivative into Eq. (12.11) gives the formula of (12.13),
which is only a simplification away from Eq. (12.9) and shows the error term to be
O(h3).
yti1 y1 ti hy1 ti
y1 ti1 y1 ti
y3 ti 3 y3 ti 3
h
h
h
2
4
3!
12:13
Example 12.3
Using Kirchhoffs law, the voltage in a circuit has been modelled by the
following equation:
dV
V t t 1
dt
Given that the initial voltage was of 0.5 V, determine the voltage after 1 s
using six steps of Heuns method.
Solution
Using n 6 gives a sample every 0.20 s, following Eq. (12.5). Computing the
derivative at the initial value, at t 0, gives:
V 1 0 V 0 0 1 1:5 V=s
(continued)
260
12
V 1 0 V 1 0:2
0:81 V
2
The values for all five mesh points to compute, along with the real value
computed by the actual voltage equation of V(t) t + 0.5et and the relative
error of Heuns approximation, are given in the table below. Note that the
relative error in this table was computed using nine decimals of precision
instead of the two shown in the table, for added details.
V(0.20)
V(0.40)
V(0.60)
V(0.80)
V(1.00)
As with Eulers method, it can be seen that the relative error increases at
every step. However, the improved O(h3) pays off, and even in the final step
the relative error is one quarter that of the first step using Eulers method. To
further illustrate the improvement, the function is plotted in blue in the figure
below, along with Eulers approximation in red and Heuns approximation in
green. It can be seen visually that Eulers method diverges from the real
function quite quickly, while Heuns method continues to match the real
function quite closely over the entire interval.
(continued)
12.4
261
1.5
0.5
0
0
12.4
To summarize the IVP methods learnt so far: following the derivative at y(ti) for one
step generates a poor estimation of the next point y(ti+1), while predicting the
derivative at y(ti+1) and following that for one step generates a poor estimation
with the opposite error. Following the first derivative is the idea behind Eulers
method, while Heuns method takes the average of both derivatives and cancels out
a lot of the errors, thus leading to a much better estimate of y(ti+1). It is also known
that the error is proportional to h, the step size between two successive mesh points.
Taking these ideas together leads to an intuition for a new IVP method: perhaps
taking the average derivative at more than two points could lead to a more accurate
approximation of the behavior of the function and thus a more accurate computation of y(ti+1). And since smaller step sizes help, perhaps this average should
include the derivative estimated halfway between ti and ti. These are the intuitions
that underlie the fourth-order RungeKutta method.
To lay down foundations for this method, begin by defining a point half a step
between two mesh points:
ti0:5 ti
h
2
12:14
The RungeKutta method begins, like Eulers method and Heuns method, by
computing the derivative at the current point. That result will be labelled K0:
262
12
K 0 f ti , yti
12:15
Following the entire step hK0 from y(ti) will lead to the Eulers method estimate of
y(ti+1) y(ti) + hf(ti,y(ti)). However, following only half a step will reach a middle
point between ti and ti+1: y(ti+0.5) y(ti) + 0.5hK0. It is at this middle point that the
next derivative and step are estimated:
hK 0
K 1 f ti0:5 , yti
2
12:16
Since the step K1 is measured using the derivative in the middle of the interval, it
should normally be a better representation of the behavior of the function over
the interval between ti and ti+1. But instead of using it to approximate the value of
y(ti+1), it will be used to compute the half-step again and generate an even better
approximation of y(ti+0.5) y(ti) + 0.5hK1 from which an improved value of the
derivative and the step can be computed:
hK 1
K 2 f ti0:5 , yti
2
12:17
This improved approximation of the derivative is the one that will be used to
compute the value of the step along the derivative at y(ti+1) y(ti) + hK2:
K 3 f ti1 , yti hK 2
12:18
Finally, much like with Heuns method, the approximated value of y(ti+1) is
computed by taking one step along the average value of all the available derivatives. However, in this case it will be a weighted average, with more weight given to
K1 and K2, the derivatives estimated in the middle of the interval. The reason for
this preference is that, as explained before, the derivative at y(ti) is a poor estimate
of the behavior of the function at ti+1 and the derivative at y(ti+1) is a poor estimate
of the behavior of the function at ti, but the derivatives at ti+0.5 offer a good
compromise between those two points. The final equation for the fourth-order
RungeKutta method is thus:
yti1 yti h
K 0 2K 1 2K 2 K 3
6
12:19
A visual representation can help understand this method. Figure 12.7 shows
multiple measurements of the derivative y(1)(t) 0.5y(t)t + 1 in the interval [0,
1.5] [0, 2], with the black line representing the graph of the function y(t) given the
initial condition y(0) 0.5. The top-left figure shows the derivative approximation
K0, the same one that would be used for one full step in Eulers method to generate
the next point y(1). But in the top-left figure, only half a step has been followed
along that derivative, and the derivative at the middle point is computed and
12.4
263
Fig. 12.7 Top-left: K0, aka Eulers method, used to compute K1. Top-right: K1 used to compute
K2. Middle-left: K2 used to compute K3. Middle-right: K0 to K3 used to compute the next point in
the fourth-order RungeKutta method. Bottom-left: Eulers method used to compute the next
point. Bottom-right: Heuns method used to compute the next point
marked. This gives the value of K1. Next, in the top-right figure, half a step is taken
following K1 and the derivative at that point gives the value of K2. It is visually clear
in these two graphs that the derivatives measured at the centre of the interval are
264
12
better approximations of the function than the derivative at the beginning of the
interval. In the middle-left figure, an entire step along K2 is taken in order to
approximate the derivative at the far end of the interval and measure K3. The
final approximation of y(1) of the fourth-order RungeKutta method is obtained
by taking the weighted average of these four steps, and is presented in the middleright graph. By contrast, Eulers method relies only on the initial step K0 while
Heuns method uses only the average of step K0 and of one step along the derivative
approximated the point at the end of K0, which is actually an approximation of
lesser quality than K3. As a result, both these methods give approximations of y
(1) of poorer quality, as shown in the bottom-left and bottom-right graphs. To
further help illustrate this process, the pseudocode of the RungeKutta method is
given in Fig. 12.8.
The computation of the error of the fourth-order RungeKutta method is beyond
the scope of this book, but note that it is O(h5), making it considerably more
accurate than Eulers method or Heuns method.
Example 12.4
Using Kirchhoffs law, the voltage in a circuit has been modelled by the
following equation:
dV
V t t 1
dt
Given that the initial voltage was of 0.5 V, determine the voltage after 1 s
using six steps of the fourth-order RungeKutta method.
(continued)
12.4
265
The values for all five mesh points to compute, along with the real value
computed by the actual voltage equation of V(t) t + 0.5et and the relative
error of the RungeKutta approximation, are given in the table below.
V(0.20)
V(0.40)
V(0.60)
V(0.80)
V(1.00)
As with Eulers method and Heuns method, it can be seen that the relative
error increases at every step. However, the error value is three orders of
magnitude smaller than with Heuns method! The improved accuracy
resulting from the inclusion of measurements at half-step intervals in the
weighted average is clear to see. A close-up look at the last approximated
point V(1.00), presented below, gives a visual comparison of the accumulated
error of the three methods seen so far. It can be seen that RungeKuttas
approximation (in orange) fits almost perfectly the real function (in blue),
while Heuns approximation (in green) has accumulated a small error, and
(continued)
266
12
2.4
2.2
2
0.9
12.5
There is a class of differential equations on which the IVP methods seen so far will
fail. These are called stiff ordinary differential equations and they arise very often
in nature and thus in engineering practice. A stiff differential equation is an
equation of the form of (12.2) where the coefficient c1 which multiplies the term
y(t) is a lot larger, in absolute value, than either c0 or c2. For such equations, the
value of the derivative at points above and below the function, even at points near
the function, will be very large and in opposite directions from each other. This will
cause the steps of the IVP method used to oscillate above and below the actual value
of the function with greater and greater amplitudes.
One solution to this problem is to use one of the three IVP methods seen earlier
and to use very small step values h. This will make it possible for the approximation
at each step to hug the actual function value and use its derivative value, and not
diverge. However, this solution is not satisfactory: in addition to not giving any
information on how to compute an appropriate step size, it would also cause the IVP
method to require a massive amount of steps and computations to reach its target
value. Moreover, it is of no help if the step interval cannot be controlled or cannot
be reduced to an appropriately small size. And the risk remains that even the
slightest misstep will still cause the IVP method to diverge.
An alternative solution is a modification of Eulers method, called the backward
Eulers method.
12.5
267
The backward Eulers method starts from the same formula as Eulers method in
Eq. (12.7), but uses the derivative at the next point y(ti+1) instead of the current
point y(ti):
yti1 yti hy1 ti1
yti hf ti1 , yti1
12:20
This is reminiscent of Heuns method and the RungeKutta method, both of which
made use of the next points derivative. However, these methods actually estimate
the value of y(ti+1) using Eulers method in order to compute that derivative. When
dealing with stiff ODEs, however, using Eulers method to estimate the value of the
next point is exactly what must be avoided, because of the risk of divergence. In the
backward Eulers method, the value of y(ti+1) is left as an unknown in Eq. (12.20).
When the problem-specific ODE is used in Eq. (12.2), this unknown value y(ti+1)
will be present on both sides of the equation, but it will be the only unknown value
in the equation. It will thus be possible to solve the equation, using algebra or one of
the root-finding methods of Chap. 8, to discover the value of y(ti+1). For the
algebraic solution, if the derivative formula is of the form of Eq. (12.2), then
substituting that equation into Eq. (12.20) and isolating y(ti+1) gives:
yti1 yti hc0 c1 yti1 c2 ti1
12:21
268
12
As explained, the reason Eulers method fails in this case is that it starts from the
current point and follows the derivative forward one step to find the next point;
however, for a stiff ODE, if the current point is even a little bit wrong then the
derivative will be wildly different from the function and the next point will diverge.
The backward Eulers method, on the other hand, looks for the next point that, when
following the derivative backward one step, will lead back to the current point. This
makes it capable of handling these difficult cases.
Notwithstanding the error of the method used to solve Eq. (12.20) for y(ti+1), the
backward Eulers method is a simple variation of Eulers method, and thus has the
same error rate of O(h2).
Example 12.5
Using Kirchhoffs law, the voltage in a circuit has been modelled by the
following equation:
dV
21V t et
dt
Given that the initial voltage was of 0 V, determine the voltage after 2 s using
steps of 0.1 s, using Eulers method and the backward Eulers method.
Solution
To begin, note the large value multiplying V(t); this is a telltale sign that the
function is a stiff ODE. Note as well that the correct voltage function that
leads to this ODE, and which can be obtained by calculus, is:
y t
et e21t
20
Using Eq. (12.7), it is easy to compute the first six steps of Eulers method:
V ti1 V ti h21V ti eti
V 0:1 0 0:121 0 e0 0:100
V 0:2 0:10 0:121 0:10 e0:1 0:020
V 0:3 0:02 0:121 0:02 e0:2 0:104
V 0:4 0:104 0:121 0:104 e0:3 0:040
V 0:5 0:040 0:1
e0:4 0:111
21 0:040 0:5
0:080
V 0:6 0:111 0:1 21 0:111 e
As predicted, the values are oscillating from positive to negative with greater
and greater amplitude. The entire run of 20 steps is presented in the figure
below, along with the correct function, to confirm visually that Eulers
method is diverging in this case.
(continued)
12.5
269
As explained, the problem stems from the fact that the derivative at points
above and below the function have large amplitudes in opposite orientations.
To visualize, the derivatives at an array of points around the function have
been plotted in the graph below. It can be seen from that graph that, saved in
the immediate vicinity of the function, the derivatives do not represent the
functions behavior at all. As a result, any approximated measurement that is
even a little bit inaccurate will generate an erroneous derivative and start the
divergence.
Next, the backward Eulers method formula can be obtained by putting the
problems ODE into Eq. (12.20):
(continued)
270
12
3:1
And from that equation, it is easy to compute the approximations:
0 0:1e0:1
0:029
3:1
0:029 0:1e0:2
0:036
V 0:2
3:1
0:3
0:036 0:1e
0:036
V 0:3
3:1
0:036 0:1e0:4
0:033
V 0:4
3:1
0:5
0:033 0:1e
V 0:5
0:030
3:1
0:6
0:030 0:1e
0:027
V 0:6
3:1
V 0:1
12.6
12.6
Systems of IVPs
271
Systems of IVPs
This chapter has focused so far on handling problems that can be modelled using a
single ODE and starting condition. However, many real-life engineering situations
are more complex than that, and require the manipulation of multiple ODEs
simultaneously, each one with its own initial value. Fortunately, as this section
will demonstrate, such problems can be handled using a simple variable substitution
and any of the IVP methods seen so far.
A system of IVPs will arise naturally from any multidimensional system, or any
system that has multiple interacting variables related in a single model equation.
Each of these variables can be represented by a function of a single independent
variable (normally time). Assume a system with n such dimensions, {y0(t), y1(t),
. . ., yn1(t)}. The equation modelling each one is unknown; however, since this is
an IVP, an ODE describing the behavior of each one over time in the form of
Eq. (12.2) is available:
dy0
f 0 t, y0 t, y1 t, . . . , yn1 t
dt
dy1
f 1 t, y0 t, y1 t, . . . , yn1 t
dt
dyn1
f n1 t, y0 t, y1 t, . . . , yn1 t
dt
12:22
Likewise, as always for an IVP, initial values of each variable are available: {y0(t0),
y1(t0), . . ., yn1(t0)}. Note that, since the parameters are interacting in the system,
the derivatives of Eq. (12.22) can include terms with any, or all, of the variables of
the system. The presence of these multiple variables in the ODEs is what makes this
problem difficult to handle; otherwise, it could be dealt with simply by applying an
IVP method on each ODE independently of the others.
This situation can be simplified greatly by changing the representation of the
system. Instead of a multidimensional problem composed of a set of independent
variables, define a single multidimensional variable whose values will be the variables of the system. In other words, a vector of the variables of the system:
3
y0 t
6 y1 t 7
7
ut 6
4 5
yn1 t
2
12:23
272
12
dt 7 6
7
dt 6
5
6 7 4
7
6
4 dy 5
f n1 t, y0 t, y1 t, . . . , yn1 t
n1
dt
2
12:24
And the initial conditions of u(t) is the vector of initial values of its dimensions,
which are all known:
2
y0 t0
6
7
6 y1 t0 7
7
ut0 6
6 7
4
5
yn1 t0
12:25
This simple variable substitution has transformed the multidimensional IVP into
a single-variable IVP, albeit a variable in multiple dimensions. But this nonetheless
remains an IVP problem that can be solved using any of the IVP methods seen in
this chapter, provided the formulae are modified to use a vector variable instead of a
scalar. Namely, Eulers method becomes:
uti1 uti hf ti , uti
12:26
12:27
12:29
The error of each method remains unchanged for its scalar version. Likewise, the
pseudocode to implement these methods follows the same logical steps as the
12.6
Systems of IVPs
273
pseudocode for the scalar versions, except with vectors. For example, the code for
Eulers method presented in Fig. 12.2 becomes the vectorial version of Fig. 12.10.
Example 12.6
The Lorenz equations are a set of three ODE that describe the chaotic
behavior of certain natural systems. These equations arise in many engineering models of real-world systems. The ODE are:
dx
y x
dt
dy
x z y
dt
dz
xy z
dt
Using the system parameters 10, 28, and 2.7, and the initial
values at t 0 of x 1, y 1, and z 1, use Eulers method with a step
size of 0.01 to draw the solution of the Lorenz equations.
Solution
This three-dimensional problem can be written using a single threedimensional vector by applying Eqs. (12.23), (12.24), and (12.25):
(continued)
274
12
1
2
1:00
0:98
101:26 1:00
1:03
7
7 6
6
7
6
u0:02 4 1:26 5 0:014 1:0028 0:98 1:26 5 4 1:52 5
0:98
0:97
If Eulers method is applied for 1000 steps, the resulting scatter of points will
draw the figure below:
12.7
12.7
Higher-Order ODEs
275
Higher-Order ODEs
So far, this chapter has only dealt with IVPs that had first-order ODE. It is of course
possible, however, for a model to include a higher-order derivative of a parameter.
Most commonly, engineering models will include the second derivative (acceleration given position) or third derivative (jerk given position), but it is possible to
model an IVP with any degree of derivative.
An IVP with a first-order derivative starts with the ODE of Eq. (12.2) that is a
function of time (or some other independent variable the formula is derived with
respect to) and of the function (a.k.a. the zeroth-order derivative). Likewise, an IVP
with an nth-order ODE starts with an equation that is a function of time and of all
lower-order derivatives:
yn t f t, yt, y1 t, . . . , yn1 t
12:30
Initial values are available for all n lower-order derivatives in that equation: {y(t0),
y(1)(t0), . . ., y(n1)(t0)}.
The method to solve this system is very similar to the method used to solve the
system of IVPs in the previous section: by using a variable substitution to reduce the
problem to a first-order IVP. Instead of dealing with a problem with multiple
derivatives of a variable, define a single multidimensional variable whose values
will be the n lower-order derivatives:
2
yt
6 1
7
6 y t 7
7
ut 6
6 7
4
5
n1
y
t
12:31
A single derivative of this variable will introduce the nth-order ODE of the problem
into this vector, while all other dimensions are simply shifted up by one position
compared to (12.31):
2
6
du 6
6
dt 6
4
y1 t
y2 t
yn t f t, yt, y1 t, . . . , yn1 t
3
7
7
7 f t, ut
7
5
12:32
And the initial conditions of u(t) are the vector of initial values of its dimensions,
which are all known:
276
12
yt0
6 1
7
6 y t 0 7
7
ut0 6
6 ... 7
4
5
n1
t0
y
12:33
This simple variable substitution has transformed the nth-order IVP into a firstorder IVP, albeit one with a vectorial variable instead of a scalar. But this nonetheless remains an IVP problem that can be solved using the vectorial versions of
the IVP methods presented in the previous section in Eqs. (12.26), (12.27), (12.28),
and (12.29). The error of each method remains unchanged for its scalar version.
Example 12.7
Consider a circuit with a single loop with an inductor of 1 H, a resistor of 10
and a capacitor of 0.25 F. If the system is initially at rest, and at time t 0, a
voltage force of V(t) sin(t) is applied. Approximate the current I(t) moving
through the loop for t > 0. This circuit is shown below.
du
dt
u0
I t
I 1 t
I 1 t
cos t 4I t 10I 1 t
0
#
f t, ut
0
(continued)
12.7
Higher-Order ODEs
277
" #
0
cos 0 4 0 10 0
1
" #
"
#
0
0
u0:1
0:1f 0, u0
0
0:1
"
# "
#
0:1
0:1
f 0:1, u0:1
u0:1
f 0, u0 f 0:1, u0:1
2
0
0:050
So the current after the first 0.1 s is I(0.1) 0.005 A. A second iteration gives:
"
f 0:1, u0:1
0:050
"
0:050
And so the current has increased to I(0.2) 0.012 A. This method can be
applied over and over until an appropriate model of the current is obtained. If
200 steps are done to approximate the current until t 20 s, the results give
the following graph:
(continued)
278
12
12.8
Engineering Applications
12:34
T T env
dt
C
12:35
where h is the heat transfer coefficient between the object and the environment,
A is the heat transfer surface area between the object and its environment, C is the
total heat capacity of the system, and Tenv is the temperature of the environment.
The Lotka-Volterra equations describe the dynamic relationship between the
populations of two species living in the same environment. One is the prey
species, with population size x, and the other is the predator species, with
12.9
Summary
279
population size y, that hunts the prey. The rate of change of each population is
modelled as:
dx
x xy
dt
dy
xy y
dt
12:36
where the four constants describe the interaction of the two species; is the
growth rate of the prey, is the rate at which predators kill off preys, is the
growth rate of predators given their consumption of prey, and is the death rate
of predators.
The displacement over time of damped spring in orientation x is described by:
d2 x
dx
2 2 x 0
dt2
dt
12:37
where is the damping ratio and is the angular frequency of the oscillations.
In all these examples, the value of an important parameter of the system can be
approximated at any instant from the model, provided only that its initial value is
known.
12.9
Summary
280
12
Requires
Initial point and ODE
Initial point and ODE, computes one additional point
and derivative
Initial point and ODE, computes three additional
points and derivatives
Initial point and ODE, solves for the next point
Error
O(h2)
O(h3)
O(h5)
O(h2)
method and the fourth-order RungeKutta method. However, it is useful for dealing
with a special class of problems called stiff ODEs, where the other three methods
fail and diverge. Table 12.1 summarizes these techniques.
These methods were developed to evaluate a two-dimensional scalar point using
a first-order ODE. Engineering practice, however, is not always so simple, and it is
possible one has to work on multidimensional problems where multiple parameters
of a system interact in the ODE, or with two-dimensional problems where the ODE
uses a higher-order derivative. This chapter has presented a simple technique to
deal with either cases using a variable substitution. By writing the multiple dimensions or derivatives of the system as entries in a vectorial variable and changing the
equations to use a vectorial variable instead of a scalar, any of the IVP methods seen
in this chapter can be used.
12.10
Exercises
1. Given the following IVP, approximate the requested values using one step of
Eulers method:
a. y(1.5).
b. y(1).
c. y(0.75)
y1 t 1 tyt
y0:5 2:5
2. Using the following IVP and a step h 0.5, approximate y(0.5), y(1), and y(1.5)
using Eulers method.
y1 t 1 0:25yt 0:2t
y 0 1
12.10
Exercises
281
3. Given the following IVP and a step h 0.5, approximate y(1.5) and y(2) using
Eulers method.
y1 t 1 0:25yt 0:2t
y 1 2
4. Given an IVP with an initial condition y(0) y0, if the second derivative is
bounded by 8 < y(2)(t) < 8, on how large an interval can we estimate y(t)
using Eulers method if we want to ensure that the error is less than 0.0001?
5. Solve Example 12.2 using Heuns method. Compare your results to the error
from Eulers method given in that example.
6. Given the following IVP, approximate y(1) and y(1.5) using one step of Heuns
method.
y1 t 1 tyt
y0:5 2:5
7. Using the following IVP and a step of h 0.5, approximate y(0.5), y(1), and
y(1.5) using Heuns method.
y1 t 1 0:25yt 0:2t
y 0 1
8. Approximate y(1.5) and y(2) with Heuns method, using the IVP below and a
step of h 0.5.
y1 t 1 0:25yt 0:2t
y 1 2
9. Given an IVP with an initial condition y(0) y0, if the second derivative is
bounded by 8 < y(2)(t) < 8, on how large an interval can we estimate y(t)
using Heuns method if we want to ensure that the error is less than 0.0001?
Compare this with the range for Question 4.
10. Solve Example 12.2 using the fourth-order RungeKutta method. Compare
your results to the error from Eulers method given in that example.
11. Given the following IVP, approximate y(1) and y(1.5) using one step of the
fourth-order RungeKutta method.
y1 t 1 tyt
y0:5 2:5
12. Given the following IVP, approximate y(0.5), y(1), and y(1.5) using the fourthorder RungeKutta method and a step of h 0.5.
282
12
y1 t 1 0:25yt 0:2t
y 0 1
13. Given the following IVP, approximate y(1.5) and y(2) using the fourth-order
RungeKutta method and a step of h 0.5.
y1 t 1 0:25yt 0:2t
y 1 2
14. Approximate y(1) for the following IVP using four steps of Eulers method,
Heuns method, and the fourth-order RungeKutta method, given the initial
condition that y(0) 1.
y1 t 1 0:25yt 0:2t
y 0 1
15. Approximate y(1) for the following IVP using four steps of Eulers method,
Heuns method, and the fourth-order RungeKutta method.
y1 t tyt t 1
y 0 1
16. Solve Example 12.2 using the backward Eulers method. Compare your results
to the error from Eulers method given in that example.
17. Given the following IVP, approximate y(1) and y(1.5) using the backward
Eulers method.
y1 t 1 tyt
y0:5 2:5
18. Approximate y(0.5), y(1), and y(1.5) using the backward Eulers method for the
following IVP.
y1 t 1 0:25yt 0:2t
y 0 1
19. Given the following IVP, approximate y(1.5) and y(2) using the backward
Eulers method.
y1 t 1 0:25yt 0:2t
y 1 2
12.10
Exercises
283
20. Given the following system of IVPs, compute the first two steps using h 0.1.
x1 t 0:6xt 1:8yt
y1 t 2:6xt 0:3yt
x0 1
y0 1
21. Find the current moving through the circuit of Example 12.7 after 0.1 s, 0.2 s,
and 0.3 s:
a. When the initial current is 1 A.
b. When the voltage function is v(t) et for t 0 s.
22. Consider the second-order IVP below. Perform three steps of Eulers method
using h 0.1.
y2 t 4 sin t yt 2y1 t
y 0 1
y1 0 2
23. Consider the third-order IVP below. Perform two steps of Heuns method using
h 0.1.
y3 t yt ty1 t 4y2 t
y 2 1
1
y 2 2
y2 2 3
24. Van der Pols second-order ODE, given below, is an example of a higher-order
stiff ODE. Using 0.3 and h 0.1, and with the initial conditions y(0) 0.7
and y(1)(0) 1.2, compute four steps of this IVP.
y2 t 1 yt2 y1 t yt 0
25. Given the following IVP, approximate y(1) by using
a. One step of Eulers method.
b. Two steps of Eulers method.
c. Four steps of Eulers method.
y3 t y2 t y1 t yt 3 0
y 0 3
y1 0 2
y2 0 1
284
12
Chapter 13
13.1
Introduction
13:1
This model would be easy to use if the initial value of the current and its derivative
was known. But what if the rate of change of the current is not known, and the only
information available is the current in the circuit the moment the system was turned
on and when it was shut down? That is to say, only the initial and final values of the
current are known, while its derivatives are unknown.
This type of problem is a boundary value problem (BVP), a situation in which a
parameters value is known at the two boundaries of an interval, called the boundary conditions of the system, and its value within that interval must be estimated.
The boundary conditions typically represent initial and terminal conditions of the
system. And the interval might be over time, as in the previous example using the
current at the moment the system was turned on and off, or over space, for example
if one was measuring the voltage at the starting and terminal positions along a long
transmission line.
One can immediately see the similarities and differences with the higher-order
initial value problem (IVP) situation presented in the previous chapter. Both setups
are very similar: both start with an nth order ODE that is a function of an
independent variable (we will use time, for simplicity) and of all lower-order
derivatives:
285
286
13
yn t f t, yt, y1 t, . . . , yn1 t
13:2
tn1 t0
ti1 ti
n1
13:3
13.2
Shooting Method
Given the similarities between higher-order IVPs and BVPs, it would be nice if it
were possible to use the higher-order IVP method learnt in the previous chapter in
order to solve BVPs. Recall from Chap. 12 that this would require rewriting the
system of Eq. (13.2) as a vector of lower-order derivative values and taking that
vectors derivative:
2
yt
7
6 1
6 y t 7
7
6
ut 6
7
6 7
5
4
yn1 t
13:4
13.2
Shooting Method
2
6
du 6
6
6
dt 6
4
287
y1 t
y2 t
yn t f t, yt, y1 t, . . . , yn1 t
3
7
7
7
7 f t, ut
7
5
13:5
Then, given a vector of initial values u(t0), vectorial versions of the IVP methods of
Chap. 12 can be used to solve the problem. The only real issue is that these initial
values of the derivatives required to complete vector u(t0) and to compute the initial
step of the IVP methods are unknown in the BVP case. Without them, the IVP
methods cannot start.
However, with the initial and final values of that parameter known and a model
of the derivative equation available, a simple solution presents itself: take a shot at
guessing the initial value of the derivatives. Using some random value for the initial
value of the derivatives in u(t0), it is possible to apply any of the IVP methods and
compute what the approximated final value will be. Then, compare this approximation to the real target final value to see how far off the mark the shot was, and
refine the guess to take another shot. Actually getting the correct initial derivative
value in this manner is unlikely; but after N guesses, a set of N n-dimensional points
will have been generated, each combining initial derivative values and a final
parameter value. It now becomes possible to use these points to compute an
equation modelling the relationship between the initial derivative values and the
final parameter value using any of the interpolation and regression methods of
Chap. 6, and to use this equation to compute the correct initial derivative
corresponding to the real final value of the parameter. Finally, once the correct
initial derivative value is known, the chosen IVP method can be used one last time
to compute the correct approximations of the parameter within the interval.
This method is called the shooting method. Its biggest advantage is that it uses
only interpolation and IVP methods that were learnt in previous chapters. Its error
rate will likewise be directly related to the interpolation and IVP methods chosen.
Using a greater number N of shots will lead to a better interpolation or regression
and to a better approximation of the initial derivative, which can be used in turn in a
more accurate IVP method. Moreover, since the shooting method is a bounded
method, with the boundary conditions constraining the initial and final value of the
IVP method, the absolute error of the IVP approximation will increase the farther a
point is from the two boundaries, and will be at its maximum near the middle of the
interval. The biggest disadvantage of the shooting method is its runtime: for
N random shots, the method will need to apply the IVP method N + 1 times, or
once for every shot plus one final time using the final estimate of the initial
derivative. This in turn limits its applicability to problems in more than two
dimensions, which will require a large number of shots in order to generate enough
points to estimate the multidimensional function and the correct derivative values.
Figure 13.2 presents the pseudocode for a version of the shooting method for a
second-order BVP of the form of Eq. (13.1). There are two loops in this version of
288
13
the algorithm; the first generates guesses of the first derivative until one shoots
below the final value of the parameter, and the second generates guesses of the first
derivative until one shoots above that final value. Next, the two 2D points generated
are used to interpolate a function and to get the correct value of the first derivative,
which is then used in a final application of the IVP method to get the steps that solve
the BVP.
Example 13.1
Consider a circuit with a single loop with an inductor of 1 H, a resistor of 3
and a capacitor of 0.5 F, as shown in Fig. 13.1. Initially, a current of 1.28 A is
going through the loop. At time t 0 s, a voltage force of V(t) sin(t) is
applied. After 1 s, the current is measured to be 0.991 A. Approximate the
current I(t) moving through the loop at every 0.2 s using Eulers method. This
situation can be modelled by the following ODE:
I 2 t 3I 1 t 2I t cos t
(continued)
13.2
Shooting Method
289
dt
u 0
"
I t
I 1 t
I 1 t
cos t 2I t 3I 1 t
"
#
1:28
f t, ut
A first guess for the value of could be simply 0. This would give the
following steps using Eulers method:
"
u0:2 u0 hf 0, u0
1:28
"
0:2
"
1:28
1:56
0:312
"
# "
#
"
#
1:218
1:28
0:312
0:441
0:192
0:479
"
# "
#
"
#
1:034
1:129
0:479
0:478
0:065
0:465
0
The estimated current at t 1 s with this shot is 0.938 A, less than the actual
value measured in the system. The derivative can be increased for a second
shot, for example to 0.5. In that case, the steps will be:
(continued)
290
13
1:322
0:227
1:192
0:496
"
u0:4
"
u0:8
1:277
0:424
#
1:093
0:510
#
"
u1:0
0:991
0:502
(continued)
13.3
291
1.4
1.3
1.2
1.1
1
0.9
0
13.3
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
292
13
yti1 yti1
O h2
2h
13:6
yti1 2yti yti1
O h2
2
h
13:7
y1 ti
y2 ti
2h2 c0 4c2
hc1 2c2
0
0
6
2h2 c0 4c2
hc1 2c2
0
6 2c2 hc1
6
6
2h2 c0 4c2 hc1 2c2
0
2c2 hc1
6
6
6
6
6
0
0
0
0
6
6
4
0
0
0
0
0
0
3 20
3
20
y1
2h2 f 1 2c2 hc1 y0
7
6 y2 7 6
2
2h f 2
7 6
7
6
7 6
7
6
2
7
6 y3 7 6
2h f 3
7 6
7
6
7 6
7
6
7
6 76
7 6
7
6
2
7
6 ytn4 7 6
2h f tn4
7 6
7
6
7 6
7
6
2
5
4 ytn3 5 4
2h f tn3
2
ytn2
2h f tn2 hc1 2c2 ytn1
0
0
2c2 hc1
0
0
0
0
2h2 c0 4c2
2c2 hc1
0
0
0
hc1 2c2
2h2 c0 4c2
2c2 hc1
7
0
7
7
7
0
7
7
7
7
7
0
7
7
hc1 2c2 5
2
2h c0 4c2
13:9
The pseudocode for an algorithm to solve a second-order ODE by building the
matrix-vector system of Eq. (13.9) is presented in Fig. 13.3. Note that, given an
equation of the form c2y(2)(t) + c1y(1)(t) + c0y(t) f(t), the value of each individual
coefficient can be obtained by the algorithm (rather than input manually by the
user) by setting in turn either y(t) or one of its derivatives to 1 and the other two
values to 0.
13.3
293
h
+
h c1
h c0 4 c2
2 c2
h
294
13
As with the shooting method, the main advantage of this method is that it relies
exclusively on methods learnt in previous chapters, namely Chaps. 4 and 10, and
requires no new mathematical tools. Moreover, it again gives the designer control
on the error rate of the method; a higher error rate can be achieved by approximating the ODE using higher-order divided-difference formulae, albeit at the cost of
having more terms to handle in the new version of Eq. (13.8) and the new matrix of
Eq. (13.9). The main disadvantage is its cost overhead: before Eq. (13.8) can be
developed, it is necessary to write out the divided-difference formula at the required
error rate for every derivative order in the systems ODE by using the Taylor series
technique presented in Chap. 10.
Example 13.2
Consider a circuit with a single loop with an inductor of 1 H, a resistor of 3
and a capacitor of 0.5 F, as shown in Fig. 13.1. Initially, a current of 1.28 A is
going through the loop. At time t 0, a voltage force of V(t) sin(t) is
applied. After 1 s, the current is measured to be 0.988 A. Approximate the
current I(t) moving through the loop at every 0.2 s using Eulers method. This
situation can be modelled by the following ODE:
I 2 t 3I 1 t 2I t cos t
Solution
Using h 0.2, c0 2, c1 3, and c2 1, the ODE can be rewritten in the form
of Eq. (13.8):
1:40I ti1 3:84I ti 2:60I ti1 0:08 cos ti
This new equation can then be duplicated for each mesh point and used to
write a matrix-vector system of the form of Eq. (13.9):
2
3:84
6 1:40
6
6
4 0
0
2:60
3:84
1:40
2:60
3:84
1:40
32
3 2
3
y0:2
1:714
6
7
0 7
y0:4 7
76
7 6 0:074 7
76
4
5
4
5
y
0:6
0:066
2:60 5
y
0:8
2:521
3:84
0
And finally, this system can be solved using any of the methods seen in
Chap. 4, or by straightforward backward elimination, to get the values of the
four internal mesh points. These values are presented in the table below, with
their correct equivalents and the relative errors. It can be seen that the errors
are a lot smaller than they were in Example 13.1, despite the fact that both
examples use O(h2) methods, namely Eulers IVP method in the previous
example and the second-order divided-difference formula in this one. The
(continued)
13.3
295
The current function over time and the finite difference approximation can
be plotted together, along with the best approximation from the shooting
method of Example 13.1. This figure confirms visually that the finite difference method (in purple), gives a much closer approximation of the real
function (in blue) than the shooting method (in red).
I (t )
1.4
1.3
1.2
1.1
1
0.9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
296
13
yx0 ; x1 ; . . . ; xm1
yx0 ; x1 ; . . . ; xm1
c8
x0 n
x0 2
n
yx0 ; x1 ; . . . ; xm1
yx0 ; x1 ; . . . ; xm1
c6
c7
x0
x1 n
c9
c5
yx0 ; x1 ; . . . ; xm1
yx0 ; x1 ; . . . ; xm1
c4
2
x1
x1
n
13:10
yx0 ; x1 ; . . . ; xm1
yx0 ; x1 ; . . . ; xm1
c2
n
xm1
xm1 2
yx0 ; x1 ; . . . ; xm1
c1
c0 yx0 ; x1 ; . . . ; xm1 f x0 ; x1 ; . . . ; xm1
xm1
c3
x0
2h
13:11
2 y x0, i ; x1, j ; . . . ; xm1, k
x0 2
y x0, i1 ; x1, j ; . . . ; xm1, k 2y x0, i ; x1, j ; . . . ; xm1, k y x0, i1 ; x1, j ; . . . ; xm1, k
O h2
2
h
13:12
Comparing these to Eqs. (13.6) and (13.7), it can be seen that the variable x0 that the
derivative is taken with respect to changes as before, while the other independent
variables x1 to xm 1 remain constant.
13.3
297
Once the derivatives in the system Eq. (13.10) are approximated using divideddifference formulae, the resulting equation can be written out for every internal
mesh point that needs to be approximated. As before, this will generate an equal
number of equations and unknown mesh point values, which can be written into an
Mx b system and solved.
The previous explanation was for the general case with m independent variables
and an equation using up to the nth derivative. So it might feel a bit abstract. It is
worthwhile to expound the method by considering the simplest but common case of
working in three dimensions, with the function y(x0, x1) of two independent variables to model. The variable x0 takes values in the interval [x0,0, x0,m 1] and the
variable x1 takes values in the interval [x1,0, x1,n 1], both at regular steps of h, so
that:
h
m1
n1
13:13
The boundary conditions for this problem are the set of values the function takes
at every point where either x0 or x1 is at its minimum or maximum value: y(x0,0, x1)
and y(x0,m 1, x1) for all values of x1, and y(x0, x1,0) and y(x0, x1,n 1) for all values
of x0. The problem is to determine the values for all mesh points of the function
that are not on the boundary. This situation is illustrated in Fig. 13.4. The value of
y(x0, x1) is known at the blue points on the boundary in that figure, and must be
estimated for the red points inside the interval.
Finally, a formula is known to model the system using its partial derivatives;
again for simplicity, assume it only uses up to the second-order derivatives:
298
c4
13
2 y x 0 ; x1
2 y x 0 ; x1
yx0 ; x1
yx0 ; x1
c
c2
c1
c0 yx0 ; x1
3
2
2
x0
x1
x0
x1
f x0 ; x1
13:14
Since the partial derivative takes one parameter to be constant, each can be
taken to be a single-variable derivative. The centered divided-difference formulae of Eqs. (13.6) and (13.7) can then be modified to use a function of two
independent variables where one of the variables is constant. The resulting
equations are:
y x0, i ; x1, j
y x0, i1 ; x1, j y x0, i1 ; x1, j
O h2
x0
2h
13:15
2 y x0, i ; x1, j
y x0, i1 ; x1, j 2y x0, i ; x1, j y x0, i1 ; x1, j
O h2 13:16
2
2
x0
h
y x0, i ; x1, j
y x0, i ; x1, j1 y x0, i ; x1, j1
O h2
x1
2h
13:17
2 y x0, i ; x1, j
y x0, i ; x1, j1 2y x0, i ; x1, j y x0, i ; x1, j1
O h2 13:18
2
2
x1
h
These can be substituted back into the model equation to write it as only a function
of data points and eliminate the derivatives:
2c4 hc2 y x0, i1 ; x1, j 2c3 hc1 y x0, i ; x1, j1
hc2 2c4 y x0, i1 ; x1, j hc1 2c3 y x0, i ; x1, j1
4c4 4c3 2h2 c0 y x0, i ; x1, j 2h2 f x0, i ; x1, j
13:19
This model equation can then be written out for every internal mesh point, to
create a set of (m n) m n equations with as many unknowns. This set can then
be arranged in an Mx b system to be solved:
The matrix-vector system is a lot larger than it was in the two-dimensional case
of the finite difference method presented earlier, but the methodology used to obtain
it is the same. Solving the three-dimensional BVP is no more difficult than solving
the two-dimensional BVP, only longer. Likewise, the pseudocode for this version
of the finite difference method will not be substantially different from that of the
two-dimensional case presented in Fig. 13.3.
0
2c4 hc2
4c4 4c3 2h2 c0
...
2c4 hc2
4c4 4c3 2h2 c0
hc2 2c4
...
...
0
0
2c4 hc2
...
2c3 hc1
0
0
0
0
0
0
3
yx0, 1 ; x1, 1
6 yx0, 2 ; x1, 1 7
6
7
6 yx0, 3 ; x1, 1 7
6
7
6 yx0, 4 ; x1, 1 7
6
7
6 yx0, 1 ; x1, 2 7
6
7
6
x6 yx0, 2 ; x1, 2 7
7b
6 yx0, 3 ; x1, 2 7
6
7
6
7
6
7
6 yx0, m3 ; x1, n2 7
6
7
4 yx0, m2 ; x1, n3 5
; x1, n2
2 yx0, m2
3
2h2 f x0, 1 ; x1, 1 hc2 2c4 yx0, 0 ; x1, 1 hc1 2c3 yx0, 1 ; x1, 0
2
6
7
2h f x0, 2 ; x1, 1 hc1 2c3 yx0, 2 ; x1, 0
6
7
6
7
2h2 f x0, 3 ; x1, 1 hc1 2c3 yx0, 3 ; x1, 0
6
7
6
7
6
7
6
7
6
7
6
7
6
7
6
7
6
7
6
7
6
7
6
7
6
7
4
5
2h2 f x0, m2 ; x1, n2 2c4 hc2 yx0, m1 ; x1, n2 2c3 hc1 y x0, m2 ; x1, n1
...
0
2c3 hc1
0
...
0
0
2c3 hc1
...
0
0
0
. . . hc2 2c4
...
...
...
hc1 2c3
...
0
0
0
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
7
5
13:20
...
0
0
0
13.3
299
300
13
Example 13.3
Consider Laplaces equation of a function of two independent variables,
z f(x, y):
2 f x; y 2 f x; y
0
x2
y2
This can be modelled as a BVP problem. Use the domains x 2 [0, 1] and y 2
[0, 1] with a step of h 0.25. The boundary conditions will be:
8
0
x0
>
>
<
0
y0
f x; y
1 x 1, y 6 0
>
>
:
1 y 1, x 6 0
Compute the value for all internal mesh points.
Solution
Applying the centered divided-difference formulae, Laplaces equation
becomes:
f xi1 ; yj f xi1 ; yj f xi ; yj1 f xi ; yj1 4f xi ; yj 0
Now this equation can be written out for each internal point. In this example,
this is a simple matter, since there are only nine internal mesh points, namely
(0.25, 0.25), (0.25, 0.50), (0.25, 0.75), (0.50, 0.25), (0.50, 0.50), (0.50, 0.75),
(0.75, 0.25), (0.50, 0.75), and (0.75, 0.75). The values of the 16 boundary
points are already known. This setup is represented in the figure below:
(continued)
13.3
301
4
6 1
6
6 0
6
6 1
6
6 0
6
6 0
6
6 0
6
4 0
0
1
4
1
0
1
0
0
0
0
0
1
4
0
0
1
0
0
0
1
0
0
4
1
0
1
0
0
0
1
0
1
4
1
0
1
0
0
0
1
0
1
4
0
0
1
0
0
0
1
0
0
4
1
0
0
0
0
0
1
0
1
4
1
32
3 2 3
f 0:25;0:25
0
0
7 6 7
6
0 7
76 f 0:25;0:50 7 6 0 7
7 6 7
6
0 7
76 f 0:25;0:75 7 6 1 7
6 f 0:50;0:25 7 6 0 7
0 7
76
7 6 7
7 6 7
6
0 7
76 f 0:50;0:50 7 6 0 7
6 f 0:50;0:75 7 6 1 7
1 7
76
7 6 7
7 6 7
6
0 7
76 f 0:75;0:25 7 6 1 7
1 54 f 0:75;0:50 5 4 1 5
4
f 0:75;0:75
2
And this system can be solved using any of the methods learnt in Chap. 4 to
find the solution vector:
3
3 2
f 0:25; 0:25
0:14
6 f 0:25; 0:50 7 6 0:29 7
7
7 6
6
6 f 0:25; 0:75 7 6 0:50 7
7
7 6
6
6 f 0:50; 0:25 7 6 0:29 7
7
7 6
6
6 f 0:50; 0:50 7 6 0:50 7
7
7 6
6
6 f 0:50; 0:75 7 6 0:71 7
7
7 6
6
6 f 0:75; 0:25 7 6 0:50 7
7
7 6
6
4 f 0:75; 0:50 5 4 0:71 5
0:86
f 0:75; 0:75
2
(continued)
302
13
13.4
13.4
Engineering Applications
303
Engineering Applications
The BVPs of this chapter are closely related to the IVPs of Chap. 12, with the
difference that in this case there are measurements at the start and end of the system
that the solution must respect. For example, in Sect. 13.1, the current has been
measured in the circuit both when it was turned on and when it was turned off, and
the solution, the current at intermediate moments, must be approximated in a way
that is consistent with both these measurements. Other examples include:
Newtons law of cooling models the decreasing temperature T over distance of a
hot object in a cool environment as:
dT hA
T T env
dx
C
13:21
where h is the heat transfer coefficient between the object and the environment,
A is the heat transfer surface area between the object and its environment, C is
the total heat capacity of the system, and Tenv is the temperature of the environment. Provided the object is heated to a different temperature at each end, this
model makes it possible to compute the intermediate temperatures along the
length of the object.
The Lotka-Volterra equations describe the dynamic relationship between the
populations of two species living in the same environment. One is the prey
species, with population size x, and the other is the predator species, with
population size y, that hunts the prey. The rate of change of each population is
modelled as:
dx
x xy
dt
dy
xy y
dt
13:22
where the four constants describe the interaction of the two species; is the
growth rate of the prey, is the rate at which predators kill off preys, is the
growth rate of predators given their consumption of prey, and is the death rate
of predators. Provided the predator and prey population sizes have been
observed at two different times, it becomes possible to use this model to estimate
the population history.
The displacement over time of damped spring in orientation x is described by:
d2 x
dx
2 2 x 0
dt2
dt
13:23
304
13
where is the damping ratio and is the angular frequency of the oscillations.
Knowing the position at two different instants makes it possible to approximate
the complete motion of the spring.
13.5
Summary
Requires
Boundary conditions, multiple shots IVP method
and interpolation or regression method
Boundary conditions, divided-difference formula
and linear algebra method
Error
O(h2) or better
O(h2) or better
13.6
13.6
Exercises
305
Exercises
1. Solve the following BVP using h 0.5 using the method of your choice.
y2 t 2yt 1
y 0 0
y 2 1
2. Using the shooting method and the shots y(1)(0) 10 and y(1)(0) 20, solve the
following BVP using h 0.5.
y2 t 3y1 t 8yt 0
y 0 1
y 1 2
3. Approximate nine interior mesh points using the finite difference method for
the following BVP.
y2 t 3y1 t 8yt 0
y 0 1
y 1 2
4. Solve the following BVP using h 1 and applying the shooting method. For the
shots, use y(1)(0) 0 and y(1)(0) 1. For the IVP method, use the Fourth-Order
Runge Kutta method.
2y2 t yt 0
y 0 5
y5 5
5. Repeat Exercise 4 using the finite difference method.
6. Solve the following BVP using h 0.2 and applying the shooting method. For
the shots, use y(1)(0) 1 and y(1)(0) 2. For the IVP method, use the
Fourth-Order Runge Kutta method.
y2 t 2y1 t 3yt 4
y 0 2
y1 2
7. Repeat Exercise 6 using the finite difference method and using h 0.1.
306
13
8. Solve the following BVP using h 1 and applying the shooting method. For the
shots, use y(1)(0) 0 and y(1)(0) 2. For the IVP method, use Eulers method.
4y2 t 8y1 t 2yt 7
y 0 0
y5 10
9. Repeat Exercise 8 using the Fourth-Order Runge Kutta method.
10. Repeat Exercise 8 using the finite difference method.
11. Repeat Example 13.3 with the following boundary conditions:
2 f x; y 2 f x; y
cos x y
x2
y2
A.1
Introduction
The focus of this textbook is not just on teaching how to perform the computations
of the mathematical tools and numerical methods that will be presented, but also on
demonstrating how to implement these tools and methods as computer software. It
is indeed necessary for a modern engineer to be able not just to understand the
theory behind numerical methods or to use a calculator in which they are
preprogrammed, but to be able to write the software to compute a method when it
is not available or to verify this software when it is given. Programming has become
a fundamental part of a successful engineering career.
Most people today write software in a language in the C programming language
family. This is a very large and diverse family that includes C++, C#, Objective-C,
Java, Matlab, Python, and countless other languages. Writing out each algorithm in
every language, one is likely to use would be an endless task! Instead, this book
presents the algorithms in pseudocode. Pseudocode is a middle-ground between
English and programming, that makes it possible to plan out a programs structure
and logical steps in a way that is understandable to humans and easily translatable
to a programming language without being tied to one specific language over all
others. In fact, writing out complex algorithms in pseudocode is considered an
important step in software development projects and an integral part of a software
systems documentation.
For example, consider a software program that takes in the length of the side of a
square and computes and displays the area and perimeter of that square. The
pseudocode of that program could be the one presented in Fig. A.1.
Note the use of an arrow to assign values to the variables Side, Perimeter,
and Area. This is done to avoid confusion with the equal sign, which could be
interpreted as either an assignment or an equality test. Note as well the use of
human terms for commands, such as Input and Display. These commands are
used to abstract away the technical details of specific languages. This pseudocode
Springer International Publishing Switzerland 2016
R. Khoury, D.W. Harder, Numerical Methods and Modelling for Engineering,
DOI 10.1007/978-3-319-21176-3
307
308
Fig. A.2 Code of the square program in C++ (top), Matlab (middle), and Python (bottom)
will never run on a computer, nor is it meant to. But it is simple to translate into a
variety of programming languages:
All the functions in Fig. A.2 will run in their respective programming environments. Notice how the pseudocode human commands Input and Display were
replaced by the language-specific commands cin and cout in C++, input and
disp in Matlab, and input and print in Python, and how unique languagespecific actions needed to be added to each program, such as the explicit declaration
of the variable type int in C++, the square-bracketed array display in Matlab, or the
309
int command to convert the user input into an integer in Python. These languagespecific technical details are simplified away using pseudocode, in order to keep the
readers focus on the big picture, the language-independent functionalities of the
algorithm.
A.2
Control Statements
A.2.1
IF Control Statements
310
(or correct) the corresponding lines of code are executed, then the program jumps to
the END IF line without checking the other conditions. That is why it is not
necessary to re-check that previous conditions are false in later conditions; if the
later conditions are being checked at all, then all previous conditions must have
been false. For example, in the line ELSE IF (Value < 10), it is not necessary
to check that the value is greater than zero before displaying that the value is
positive, since there has already been the line IF (Value < 0) which must have
been evaluated as false. A negative value would have evaluated to true and led to
the execution of the code corresponding to that condition and consequently would
have never reached the less-than-10 evaluation. The only way a program will reach
the less-than-10 line is if the value is not less than zero and not equal to 0 and not
equal to 1.
All C family programming languages will have the IF and ELSE control
statements, and the ELSE IF control can be written in two words (as in C++ and
Java), one word (ELSEIF, as in Matlab) or an abbreviation (ELIF, as in Python).
The condition may be required to be between parentheses (C, C++, C#) or not
(Matlab, Python). And the code to be executed might be required to be between
curly brackets (C++, unless the code is exactly one line) or tabulated (Python) or
require no special markers at all. The END IF termination of the block can be
marked by closing the curly brackets (C++, Java) or de-tabulating the lines
(Python), or by an explicit END command (Matlab). These variations are illustrated
in Fig. A.4, which gives three functional implementations of the pseudocode of
Fig. A.3. Finally, some languages offer alternative controls as well, namely the
SWITCH-CASE control which is useful when the value of the same variable is
evaluated in all conditions, and the ? : operator for cases where only two outcomes
are possible. These additional controls provide fundamentally the same functionalities as the IF command, but are made available to ease code writing in some
common special cases.
311
Fig. A.4 IF control statement implemented in C++ (top), Matlab (middle), and Python (bottom)
A.2.2
The WHILE control statement is used to create loops in the program by executing a
block of code over and over again multiple times. Just like the IF command, it will
evaluate a condition and execute a block of code if that condition is true. But unlike
312
the IF command, once the block of code is completed, the condition will be
evaluated again and, if it is still true, the block of code will run again. This will
go on until the condition evaluates as false. Note that if the condition is initially
false, then the block of code will not be executed even once.
Consider the example pseudocode in Fig. A.5, which is meant to display
sequences of numbers. The user inputs a value, and the program will display all
numbers from 1 until that value. This is done by using a WHILE control statement
that evaluates whether the value to display is less than the user-specified target. If it
is, the code inside the WHILE is executed: the value is displayed and incremented
by 1. Once the code inside the WHILE has been completely executed, the program
returns to the beginning of the loop and evaluates the condition again. This repeats
until the condition evaluates to false (meaning that the incremented value has
become equal or greater than the user-specified maximum), at which point the
loop ends, the code inside the WHILE is skipped, and the program continues on to
the goodbye message. As well, if the user inputs a value that is less than 1 (such as
0), the WHILE control statement will initially evaluate to false and will be skipped,
and nothing will be displayed.
All C family programming languages will have the WHILE control statements
and the FOR control statement. Both of them allow programmers to create loops,
and simply offer different syntaxes. As well, many languages will have a
DO-WHILE control statement, which works as a WHILE with the difference that
the evaluation of the condition comes after the block of code is executed instead of
before (meaning that one execution of the block of code is guaranteed unconditionally). The condition may be required to be between parenthesis (C, C++, C#) or
not (Matlab, Python). And the code to be executed might be required to be between
curly brackets (C++, unless the code is exactly one line) or tabulated (Python) or
require no special markers at all. The END WHILE termination of the block can be
marked by closing the curly brackets (C++, Java) or de-tabulating the lines
(Python), or by an explicit END command (Matlab). These differences are illustrated in Fig. A.6.
313
Fig. A.6 WHILE control statement implemented in C++ (top), Matlab (middle), and Python
(bottom)
A.2.3
Two more control statements are worth noting: they are the CONTINUE control and
the BREAK control. They are both used inside WHILE blocks and in conjunction
with IF controls. The CONTINUE control is used to skip the rest of the WHILE
block and jump to the next evaluation of the loop. The BREAK control is used to
skip the rest of the WHILE block and exit the WHILE loop regardless of the value of
the condition, to continue the program unconditionally. These two control statements are not necessarythe same behaviors could be created using finely crafted
IF and WHILE conditionsbut they are very useful to create simple and clear
control paths in the program. The code in Fig. A.7 uses both control statements in
the code for a simple two-player guess the number game. The game runs an
314
Fig. A.7 Pseudocode using the CONTINUE and BREAK control statements
infinite loop (the WHILE (TRUE) control statement, which will always evaluate to
true and execute the code), and in each loop Player 2 is asked to input a guess as to
the number Player 1 selected. There are two IF blocks; each one evaluates whether
Player 2s guess is too low or too high and displays an appropriate message, and
then encounters a CONTINUE control statement that immediately stops executing
the block of code and jumps back to the WHILE command to evaluate the condition
(which is true) and start again. If neither of these IF command statement conditions
evaluate to true (meaning Player 2s guess is neither too low nor too high), then the
success message is displayed and a BREAK command statement is reached. At that
point, the execution of the block of code terminates immediately and the execution
leaves the WHILE loop entirely (even though the WHILE condition still evaluates to
true), and the program continues from the END WHILE line to display the thank-you
message. The hidden display line in the WHILE block of code after the BREAK
cannot possibly be reached by the execution, and will never be displayed to the user.
In fact, some compilers will even display a warning if such a line is present in the
code.
A.3
315
Functions
316
line. Figure A.8 gives an example of a function call. The function Fibonacci
computes a Fibonacci sequence to a value specified in parameter, and returns the
final number of that sequence. That is all the information that a developer who uses
that function needs to know about it. The fact that it creates four additional local
variables (F0, F1, User, and Counter) is transparent to the calling function,
since these variables are never returned. These variables will be destroyed after the
function Fibonacci returns. Likewise, the users name is a local variable of the
main program that is not passed to the Fibonacci function, and is thus invisible
to that function. The fact that there are two variables with the same name User, one
local to the main program and one local to the Fibonacci function, is not a
problem at all. These remain two different and unconnected variables, and the value
of User displayed at the end of the main program will be the one input by the user
at the beginning of that program, not the one created in the Fibonacci function.
Like the other control statements, having functions is standard in all languages of
the C family, but the exact syntax and keywords used to define them vary greatly. In
C and C++, there is no keyword to define a function, but the type of the variable
returned must be specified ahead of it (so the function in Fig. A.8 would be int
Fibonacci for example). Other languages in the family do use special keywords
to declare that a function definition begins, such as def in Python and function
in Matlab. These differences are illustrated in Fig. A.9.
Fig. A.9 Functions implemented in C++ (top), Matlab (middle), and Python (bottom)
317
Chapter 1
1. This level of precision is impossible with a ruler marked at only every 0.1 cm.
Decimals lesser than this are noise.
2. The second number has higher precision, but the first is more accurate.
3. First number is more precise, second number is more accurate.
4. The lower precision on the distance nullifies the higher precision of the
conversion. The decimals should not have been kept.
5. Absolute error 0.001593. Relative error 0.05 %.
6. Absolute error 0.0013. Relative error 0.04 %. Three significant digits.
7. Absolute error 2.7 107. Relative error 8.5 106 %. Six significant
digits.
8. Absolute error 3.3 . Relative error 1.4 %.
9. Absolute error 0.3 MV. Relative error 14 %.
10. Absolute error 8.2 mF. Relative error 7.6 %. Zero significant digits.
11. 3.1415 has four significant digits. 3.1416 has five significant digits.
12. One significant digit.
13. Two significant digits.
Chapter 2
1. 5.232345 102 or 5.232345e2.
2. The value 12300000000 suggests an implied precision of 0.5, which has a
maximum relative error of 4.1 1011, while the original scientific notation
suggests an implied precision of 50000000, which has a maximum relative
error of 0.0041.
3. 11011012
4. 101100100002
5. 111.00012
6. 11010.1110012
7. 1100001102
Springer International Publishing Switzerland 2016
R. Khoury, D.W. Harder, Numerical Methods and Modelling for Engineering,
DOI 10.1007/978-3-319-21176-3
319
320
8.
9.
10.
11.
12.
13.
14.
111010110012
11.11011112
1100.00112
11101.0110012
18.1875
7.984375
(a) 0.09323
(b) 9.323
(c) 93.23
(d) 932300
321
0 0
1. (a) PT 4 1 0
0 1
(c)
2. (a)
(b)
(c)
(d)
(e)
2
3
3
2:0
2:0 1:0
8:0 3:0 5 x 4 1:0 5
0
5:0
2:0
3
3
2
3
2
0 0 1 0
1 0
0 0
6:0 2:0 1:0 0
60 0 0 17
6 0:2 1
6 0 5:0 0 3:0 7
0 07
7
7
6
7
6
PT 6
4 0 1 0 0 5L 4 0 0:4 1 0 5 U 4 0
0 7:0 2:0 5
1 0 0 0
0:5 0 0:3 1
0
0
0 8:0
3
2
1:3
6 0:4 7
7
x6
4 3:4 5
0:9
3
2
3
2
3
2
1 0 0 0
1
0
0 0
10:0 3:0 2:0 3:0
60 0 1 07
6 0:2 1
7
6
0 07
7
7 U 6 0 8:0 1:0 2:0 7
6
PT 6
4 0 0 0 1 5 L 4 0:1 0
5
4
1 0
0
0 12:0 3:0 5
0 1 0 0
0:2 0:5 0:1 1
0
0
0
8:0
3
2
1:0
6 0:0 7
7
x6
4 1:0 5
2:0
2 3
2
3
9 0 0
10
L 42 8 05 x 4 5 5
1
1 2 6
2
3
2
3
7 0 0
2
L 4 2 9 0 5 x 4 1 5
4 6 11
3
3
2
3
2
3:00
0
0
0
0:30
6 0:20
7
6
4:00
0
0 7
7 x 6 0:00 7
L6
4 0:10 0:30
4 0:20 5
2:00
0 5
0:50 0:40 0:20 5:00
0:10
3
2
3
2
7
0
0 0
1
6 2
6 2 7
5
0 07
7
7
6
L6
4 1 2 6 0 5 x 4 0 5
1
0 3 5
1
3
2
3
2
2:00
0
0
0
0:00
6 0:20
7
6
1:00
0
0 7
7 x 6 1:00 7
L6
5
4 0:40 0:20 3:00
4
0
1:00 5
0:10 0:30 0:50 2:00
3:00
2
(b)
2
2
3
3
6:0
1
0 0
1
0 5L 4 0:2 1 0 5U 4 0:0
0
0:1 0:3 1
0
322
0:42
0:16
2
3
0:51
(b) x 4 0:31 5
0:13
3. (a) x
4.
5.
6.
7.
Chapter 5
1.
2.
3.
4.
f x0 h f x0 f 1 x0 h f 2!x0 h2 f 3!x0 h3
0.8955012154 and 0.0048
0.89129
(a) [0.125, 0.125 e0.5] [0.125, 0.20609]
(b) [0.125 e0.5, 0.125 e] [0.20609, 0.33979]
2. No, they only need to be consistent from row to row. They can compute the
coefficients of the polynomial in any column (exponent) order.
3. f(x) 0.03525sin(x) + 0.72182cos(x)
4. f(x) 0.96 1.86sin(x) + 0.24cos(x)
5. (a) f(x) 1(x 5) 2(x 2)
(b) f(x) 1(x 3) + 2(x 1)
(c) f(x) 3(x 3)(x 5) + 7(x 2)(x 5) 4(x 2)(x 3)
(d) f(x) 0.33(x 1)(x 3) + 0 + 0.66x(x 1)
323
f(x) 3 + 1.66(x 2)
f(x) 2 (x 2) + 0.5(x 2)(x 3)
f(x) 39 + 21(x + 2) 6(x + 2)x + 2(x + 2)x(x 1)
f(x) 21 10(x + 2) + 3(x + 2)x 3(x + 2)x(x 1).
f(x) 5 + 2(x 1)
f(x) 0.51 0.6438(x 1.3) 0.2450(x 1.3)(x 0.57)
3.3677(x 1.3)(x 0.57)(x + 0.33) + 1.0159(x 1.3)(x 0.57)(x + 0.33)(x + 1.2)
+ 0.8223(x 1.3)(x 0.57)(x + 0.33)(x + 1.2)(x 0.36)
11. The measurement at x 8 is very different from the others, and likely wrong.
Reasonable solutions include taking a new measurement at x 8 or discarding
the measurement.
12. (a) f(x) 0 + 0.4x + x2
(b) f(x) 0.20 + 0.44x + 0.26x2
(c) f(x) 2.5 + 2.4x + 1.9x2
13. (a) f(x) 2.3492e0.53033x
(b) f(x) 0.52118 e3.27260x
(c) f(x) 0.71798 e0.51986x
14. (a) f(x) 2.321cos(0.4x) 0.6921 sin(0.4x)
(b) f(x) 0.006986 + 2.318cos(0.4x) 0.6860sin(0.4x)
(c) Being of much smaller magnitude than the other coefficients, it is most
likely unnecessary.
15. (a) y 6.5
(b) y 3.36
(c) y 2.1951
16. y 10.323
17. Time 2.2785 s
324
Chapter 7
No exercises.
Chapter 8
1. (a) 1.4375
(b) 3.15625
(c) 3.2812
2. The interval is [40.84070158, 40.84070742] after 24 iterations. Note however
that sin(x) has 31 roots on the interval [1, 99], however the bisection method
neither suggests that more roots exist nor gives any suggestion as to where they
may be.
3. (a) 1.4267
(b) 3.16
(c) 3.3010
4. x 1.57079632679490 after five iterations.
5. x 0.4585 after two iterations.
6. x1 3/2, x2 17/12, x3 577/408
7. (a) x1 [0.6666667,1.833333]T,x2 [0.5833333,1.643939]T,x3 [0.5773810,1.633030]T
(b) x1[1.375, 0.575]T, x2 [1.36921, 0.577912]T, x3 [1.36921,
0.577918]T
8.
9.
10.
11.
x1 2, x2 4/3, x3 7/5
x1 0.5136, x2 0.6100, x3 0.6514, x4 0.6582
x 0.4585 after three iterations.
x1 1.14864,
x2 0.56812,
x3 0.66963,
x5 0.70686, x6 0.70683.
x4 0.70285,
Chapter 9
1. (a)
(b)
(c)
(d)
(e)
(f)
2. dlog 1(/h)e
3. (a) x1 0
(b) x1 0
(c) x2 0.5744
(d) x3 4.7124
(e) x3 1.3333
(f) x3 3.926990816
4. (a) x1 0
(b) x1 0.5969, x2 0.5019, x3 0.4117, x4 0.3310
(c)
(d)
(e)
(f)
325
x2 0.5735
x5 4.712388984477041
x2 1.3316
x7 3.9269
0.00012493 F/s
0.96519 rad/s and 0.96679 rad/s
0.970285 rad/s
3.425518831
0.3011686798
D3(0.25) 1.000003
(a) D3(0.5) 0.1441
(b) D3(0.5) 0.1911
(c) D3(0.5) 0.1585
5.00022699964881
1.02070069942442
0.999954724240937 after 13 iterations
1.711661979876841
1.388606601719423
2. (a) 4.5
(b) 2.137892120
3. (a) Integral 4; estimated error 4/3; real error 4/3
(b) Integral 16; estimated error 32/3; real error 48/5
(c) Integral 0.1901127572; estimated error 0.0006358300384;
real error 0.0006362543
4. 3.76171875
5. 0.8944624935
6. (a) Four segments 6; 8 segments 5.5; estimated error 1/6; real error 1/6
(b) Four segments 18; 8 segments 14.125; estimated error 4/3; real
error 1.325
7. 0.141120007827708
8. 682.666666666667
326
16.
17.
18.
19.
20.
21.
327
(c) u(0.25) [3.5, 2.25, 1.25]T; u(0.5) [4.0625, 2.5625, 1.375]T; u(0.75)
[4.703125, 2.90625, 1.40625]T; u(1) [5.4296875, 3.2578125,
1.35546875]T
26. (a) u(1) [5, 3, 0]T
(b) u(0.5) [4, 2.5, 0.5]T; u(1) [5.25, 2.75, 0.375]T
(c) u(0.25) [3.5, 2.25, 0.75]T; u(0.5) [4.0625, 2.4375, 0.578125]T;
u(0.75) [4.671875, 2.58203125, 0.474609375]T; u(1) [5.3173828125,
2.70068359375, 0.42565917975]T
Chapter 13
1. (0.5, 0.16667), (1.0, 0.5), (1.5, 0.83333).
2. y(1)(0) 12; y(0.5) 7
3 2
3
2
3:87
y0:1
6 y0:2 7 6 5:72 7
7 6
7
6
6 y0:3 7 6 6:70 7
7 6
7
6
6 y0:4 7 6 6:95 7
7 6
7
6
7 6
7
3. 6
6 y0:5 7 6 6:65 7
6 y0:6 7 6 5:97 7
7 6
7
6
6 y0:7 7 6 5:05 7
7 6
7
6
4 y0:8 5 4 4:02 5
2:98
y0:9
3 2
2
3
y 1
4:47
6 y2 7 6 1:81 7
7 6
7
4. y(1)(0) 0.73, 6
4 y3 5 4 1:71 5
y 4
4:41
3 2
3
2
y 1
4:55
6 y2 7 6 1:82 7
7 6
7
5. 6
4 y3 5 4 1:82 5
y 4
4:55
3 2
2
3
y0:2
1:66
6 y0:4 7 6 1:11 7
7 6
7
6. y(1)(0) 1.24, 6
4 y0:6 5 4 0:31 5
y0:8
0:74
3
3 2
2
1:85
y0:1
6 y0:2 7 6 1:66 7
7
7 6
6
6 y0:3 7 6 1:41 7
7
7 6
6
6 y0:4 7 6 1:10 7
7
7 6
6
7
7 6
7. 6
6 y0:5 7 6 0:73 7
6 y0:6 7 6 0:30 7
7
7 6
6
6 y0:7 7 6 0:20 7
7
7 6
6
4 y0:8 5 4 0:75 5
1:35
y0:9
328
3 2
3
y1
1:03
6 y2 7 6 2:38 7
7 6
7
y(1)(0) 1.03,6
4 y3 5 4 4:17 5
y4
6:59
3 2
3
2
y 1
1:19
6 y2 7 6 2:78 7
7 6
7
y(1)(0) 1.02, 6
4 y3 5 4 4:87 5
y 4
7:49
3 2
2
3
y 1
1:17
6 y2 7 6 2:72 7
7 6
6
7
4 y3 5 4 4:80 5
y 4
7:56
3 2
3
2
0:35
f 0:25; 0:25
6 f 0:25; 0:50 7 6 0 7
7 6
7
6
6 f 0:25; 0:75 7 6 0:35 7
7 6
7
6
6 f 0:50; 0:25 7 6 0 7
7 6
7
6
6 f 0:50; 0:50 7 6 0 7
7 6
7
6
6 f 0:50; 0:75 7 6 0 7
7 6
7
6
6 f 0:75; 0:25 7 6 0:35 7
7 6
7
6
4 f 0:75; 0:50 5 4 0 5
0:35
f 0:75; 0:75
2
8.
9.
10.
11.
References
Beeler, M., Gosper, R.W., Schroeppel, R.: HAKMEM. MIT AI Memo 239, 1972. Item 140
Bradie, B.: A Friendly Introduction to Numerical Analysis. Pearson Prentice Hall, Upper Saddle
River (2006)
Chapra, S.C.: Numerical Methods for Engineers, 4th edn. McGraw Hill, New York (2002)
Ferziger, J.H.: Numerical Methods for Engineering Applications, 2nd edn. Wiley, New York
(1998)
Goldstine, H.H.: A History of Numerical Analysis. Springer, New York (1977)
Griffits, D.V., Smith, I.M.: Numerical Methods for Engineers, 2nd edn. Chapman & Hall/CRC,
New York (2006)
Hammerlin, G., Hoffmann, K.-H.: Numerical Mathematics. Springer, New York (1991)
James, G.: Modern Engineering Mathematics, 3rd edn. Pearson Prentice Hall, Englewood Cliffs
(2004)
Mathews, J.H., Fink, K.D.: Numerical Methods Using Matlab, 4th edn. Pearson Prentice Hall,
Upper South River (2004)
Stoer, J., Bulirsch, R.: Introduction to Numerical Analysis. Springer, New York (1993)
Weisstein, E.W.: MathWorld. Wolfram Web Resource. http://mathworld.wolfram.com/
329
Index
A
Accuracy, 5
B
Big O notation, 9
Binary, 13
binary point, 15
bit, 13, 15
Binary search algorithm, 115
Boundary conditions, 285
Bracketing, 115
C
Cholesky decomposition, 46
Closed method, 131
Confidence interval (CI), 107
Convergence, 32
convergence rate, 33
D
Digit
least-significant digit, 14
most-significant digit, 14
Divergence, 33
Divided differences, table of, 89
E
Error, 6
absolute error, 6
implementation error, 4
measurement error, 4
model error, 3
Mxb systems, 56
relative error, 7
simulation error, 5
sum of square errors, 97
Exponent, 15
Extrapolation, 77, 109
G
Gaussian elimination, 41
Gauss-Seidel method, 54
Gradient, 172
I
Interpolation, 77, 78
Iteration, 31
halting conditions, 34
J
Jacobi method, 50
L
Linear regression, 77, 97
simple linear regression, 97
Lorenz equations, 273
LUP decomposition. See PLU Decomposition
M
Maclaurin series, 68
Mantissa, 15
331
332
Matrix
2-norm, 57
cofactor, 59
condition number, 57
determinant, 60
eigenvalue, 57
eigenvalue (maximum), 62
eigenvector, 57
euclidean norm, 57
expansion by cofactors, 60
inverse, 59
Jacobian, 141
reciprocal, 51, 59
Maximization. See Optimization
Mesh Point, 252
Minimization. See Optimization
Model, 2
modelling cycle, 2
N
Newton-Cotes rules, 219
Number representation
binary, 15
decimal, 14
double, 22
fixed-point, 20
float, 22
floating-point, 20
problems with, 24
O
Open method, 131
Optimization, 158
Ordinary differential equation, 251
stiff ODE, 266
Index
P
PLU decomposition, 41
Precision, 5
implied precision, 6
R
Radix point, 16
Root, 119
root finding, 119
S
Scientific notation, 14
in binary, 16
Significant digit, 8
T
Taylor series, 68
nth-order Taylor series approximation, 68
Transformation
for linear regression, 105
V
Variable
global variable, 315
local variable, 315316
parameter, 315
return value, 315
Vector
Euclidean distance, 35
Euclidean norm, 57