100% found this document useful (8 votes)

3K views

Numerical Methods and Modelling

advanced engineering mathematics

Uploaded by

mthmstr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (8 votes)

3K views

Numerical Methods and Modelling

advanced engineering mathematics

Uploaded by

mthmstr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 343

RichardKhoury

DouglasWilhelmHarder

Numerical
Methods and
Modelling for
Engineering

Numerical Methods and Modelling for Engineering

Richard Khoury Douglas Wilhelm Harder

Numerical Methods
and Modelling
for Engineering

Richard Khoury
Lakehead University
Thunder Bay, ON, Canada

Douglas Wilhelm Harder

University of Waterloo
Waterloo, ON, Canada

ISBN 978-3-319-21175-6
ISBN 978-3-319-21176-3
DOI 10.1007/978-3-319-21176-3

(eBook)

Library of Congress Control Number: 2016933860

Springer International Publishing Switzerland 2016
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt
from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained
herein or for any errors or omissions that may have been made.
Printed on acid-free paper
This Springer imprint is published by Springer Nature
The registered company is Springer International Publishing AG Switzerland

Conventions

Throughout this textbook, the following conventions are used for functions and
variables:
Object
Scalar number
(real or
complex)
Vector
Matrix

Format
Lowercase
italics

Example
x5

Lowercase
bold
Uppercase
bold

v x0 ; x1 ; . . . ; xi ; . . . ; xN1
2
6
6
6
M6
6
6
4

x0, 0
x1, 0
...
xi , 0
...

x0 , 1

...

xi, 1

...

x0, j
x1, j
...
xi, j
...

...
...

x0, N1
x1, N1
...
xi, N1
...

7 6
7 6
7 6
76
7 6
7 6
5 4

xM1, 0 xM1, 1 ... xM1, j ... xM1, N1

Scalar-valued
function of
scalar
Scalar-valued
function of
vector
Vectorvalued function
of scalar
Vector-valued
function of
vector

Lowercase
italics with
lowercase
italics
Lowercase
italics with
lowercase
bold
Lowercase
bold with
lowercase
italic
Lowercase
bold with
lowercase
bold

v0
v1
...
vi
...

3
7
7
7
7
7
7
5

vM1

f x 5x 2

f v f x0 ; x1 ; . . . ; xi ; . . . ; xN1
5x0 2x1 7xi 4xN1 3
f x f 0 x, f 1 x, . . . , f i x, . . . , f N1 x

f v f 0 v, f 1 v, . . . , f i v, . . . , f N1 v

(continued)

Conventions

Object
Matrix-valued
function of
scalar

Format
Uppercase
bold with
lowercase
italics

Matrix-valued
function of
vector

Uppercase
bold with
lowercase
bold

Example
2

3
f 0, 1 x ... f 0, j x ... f 0, N1 x
f 0, 0 x
6 f 1, 0 x
f 1 , j x
f 1, N1 x 7
7
6
7
6 ...
...
...
7
Mx 6
7
6 f i , 0 x
f

...
f

i
,
1
i
,
j
i
,
N1
7
6
5
4 ...
...
...
f M1, 0 x f M1, 1 x ... f M1, j x ... f M1, N1 x
3
f 0, 0 v
f 0, 1 v ... f 0, j v ... f 0, N1 v
6 f 1, 0 v
f 1, j v
f 1, N1 v 7
7
6
7
6
...
...
...
7
Mv 6
6 f i, 0 v
f i, 1 v ... f i, j v ... f i, N1 v 7
7
6
5
4
...
...
...
f M1, 0 v f M1, 1 v ... f M1, j v ... f M1, N1 v
2

Acknowledgements

Thanks to the following for pointing out mistakes, providing suggestions, or

helping to improve the quality of this text:

Khadijeh Bayat
Dan Busuioc
Tim Kuo
Abbas Attarwala
Prashant Khanduri
Matthew Chan
Christopher Olekas
Jaroslaw Kuszczak
Chen He
Hans Johannes Petrus Vanleeuwen
David Smith
Jeff Teng
Roman Kogan
Mohamed Oussama Damen
Rudko Volodymyr
Vladimir Rutko
George Rizkalla
Alexandre James
Scott Klassen
Brad Murray
Brendan Boese
Aaron MacLennan

vii

Contents

Modelling and Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
Simulation and Approximation . . . . . . . . . . . . . . . . . . . . . .
1.3
Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1
Precision and Accuracy . . . . . . . . . . . . . . . . . . . . .
1.3.2
Absolute and Relative Error . . . . . . . . . . . . . . . . . .
1.3.3
Significant Digits . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.4
Big O Notation . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.

1
1
2
5
5
6
8
9
11
11

Numerical Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
Decimal and Binary Numbers . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1
Decimal Numbers . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2
Binary Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3
Base Conversions . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3
Number Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1
Fixed-Point Representation . . . . . . . . . . . . . . . . . . . .
2.3.2
Floating-Point Representation . . . . . . . . . . . . . . . . . .
2.3.3
Double-Precision Floating-Point Representation . . . . .
2.4
Limitations of Modern Computers . . . . . . . . . . . . . . . . . . . . .
2.4.1
Underflow and Overflow . . . . . . . . . . . . . . . . . . . . .
2.4.2
Subtractive Cancellation . . . . . . . . . . . . . . . . . . . . . .
2.4.3
Non-associativity of Addition . . . . . . . . . . . . . . . . . .
2.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13
13
14
14
15
18
18
19
20
22
24
24
25
27
29
29

Contents

Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
Iteration and Convergence . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3
Halting Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.5
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.

31
31
31
34
37
37

Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2
PLU Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3
Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4
Jacobi Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5
Gauss-Seidel Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6
Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.1
Reciprocal Matrix . . . . . . . . . . . . . . . . . . . . . . . . .
4.6.2
Maximum Eigenvalue . . . . . . . . . . . . . . . . . . . . . .
4.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

39
39
41
46
50
54
56
59
62
64
64

Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2
Taylor Series and nth-Order Approximation . . . . . . . . . . . . .
5.3
Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4
Modelling with the Taylor Series . . . . . . . . . . . . . . . . . . . . .
5.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.

67
67
67
71
74
74
74

Interpolation, Regression, and Extrapolation . . . . . . . . . . . . . . . . .

6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2
Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3
Vandermonde Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.3.1
Univariate Polynomials . . . . . . . . . . . . . . . . . . . . . .
6.3.2
Univariate General Functions . . . . . . . . . . . . . . . . . .
6.3.3
Multidimensional Polynomial . . . . . . . . . . . . . . . . . .
6.4
Lagrange Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5
Newton Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.6
Interpolation Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . .
6.7
Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.8
Method of Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.9
Vandermonde Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.9.1
Vandermonde Method for Multivariate
Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . .
6.10
Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.11
Linear Regression Error Analysis . . . . . . . . . . . . . . . . . . . . . .
6.12
Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.13
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.14
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77
77
78
79
79
81
83
84
87
94
96
98
100
103
104
106
108
110
111

Contents

Bracketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2
Binary Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3
Advantages and Limitations . . . . . . . . . . . . . . . . . . . . . . . .
7.4
Summary of the Five Tools . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.

115
115
115
117
118

Root-Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2
Bisection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3
False Position Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.1
Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.2
Nonlinear Functions . . . . . . . . . . . . . . . . . . . . . . . .
8.4
Closed and Open Methods . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5
Simple Fixed-Point Iteration . . . . . . . . . . . . . . . . . . . . . . . .
8.6
Newtons Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6.1
One-Dimensional Newtons Method . . . . . . . . . . . .
8.6.2
Multidimensional Newtons Method . . . . . . . . . . . .
8.7
Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8
Mullers Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.9
Engineering Applications . . . . . . . . . . . . . . . . . . . . . . . . . .
8.10
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

119
119
120
124
127
130
131
131
136
136
140
144
148
153
154
155

Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2
Golden-Mean Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3
Newtons Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4
Quadratic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.5
Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.6
Stochastic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.7
Random Brute-Force Optimization . . . . . . . . . . . . . . . . . . . .
9.8
Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.9
Engineering Applications . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.10
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.11
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

157
157
159
166
169
172
179
181
183
187
188
189

Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.2
Centered Divided-Difference Formulae . . . . . . . . . . . . . . . . .
10.3
Forward and Backward Divided-Difference Formulae . . . . . . .
10.4
Richardson Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.5
Second Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.6
Unevenly Spaced Measurements . . . . . . . . . . . . . . . . . . . . . .
10.7
Inaccurate Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.8
Engineering Applications . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.9
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

191
191
193
198
199
205
209
211
214
215
216

xii

Contents

Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.2
Trapezoid Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.2.1 Single Segment . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.2.2 Composite Trapezoid Rule . . . . . . . . . . . . . . . . . . .
11.3
Romberg Integration Rule . . . . . . . . . . . . . . . . . . . . . . . . . .
11.4
Simpsons Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.4.1 Simpsons 1/3 Rules . . . . . . . . . . . . . . . . . . . . . . . .
11.4.2 Simpsons 3/8 Rule . . . . . . . . . . . . . . . . . . . . . . . .
11.5
Gaussian Quadrature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.6
Engineering Applications . . . . . . . . . . . . . . . . . . . . . . . . . .
11.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.8
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.

219
219
220
220
223
228
232
232
236
239
246
247
248

Initial Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.2
Eulers Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.3
Heuns Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.4
Fourth-Order RungeKutta Method . . . . . . . . . . . . . . . . . . .
12.5
Backward Eulers Method . . . . . . . . . . . . . . . . . . . . . . . . . .
12.6
Systems of IVPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.7
Higher-Order ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.8
Engineering Applications . . . . . . . . . . . . . . . . . . . . . . . . . .
12.9
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

251
251
252
256
261
266
271
275
278
279
280

Boundary Value Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.2
Shooting Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.3
Finite Difference Method . . . . . . . . . . . . . . . . . . . . . . . . . .
13.3.1 One-Dimensional Functions . . . . . . . . . . . . . . . . . .
13.3.2 Multidimensional Functions . . . . . . . . . . . . . . . . . .
13.4
Engineering Applications . . . . . . . . . . . . . . . . . . . . . . . . . .
13.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13.6
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

285
285
286
291
291
296
303
304
305

Appendix A: Code and Pseudocode . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307

Appendix B: Answers to Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331

List of Figures

Fig. 1.1

The modelling loop of reality to engineering

approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Fig. 2.1
Fig. 2.2

C code generating the double special values . . . . . . . . . . . . . . . . . . . . .

C++ code suffering from non-associativity . . . . . . . . . . . . . . . . . . . . . . .

25
28

Fig. 3.1

Pseudo-code of an iterative software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Fig. 4.1
Fig. 4.2
Fig. 4.3
Fig. 4.4
Fig. 4.5
Fig. 4.6

Example electrical circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Pseudo-code of the PLU decomposition . . . . . . . . . . . . . . . . . . . . . . . . . .
Pseudo-code of the Cholesky decomposition . . . . . . . . . . . . . . . . . . . . .
Pseudo-code of the Jacobi method .. . .. . .. . .. .. . .. . .. . .. .. . .. . .. . ..
Pseudo-code of the Gauss-Seidel method . . . . . . . . . . . . . . . . . . . . . . . . .
Pseudo-code of the maximum eigenvalue algorithm . . . . . . . . . . . .

40
42
48
52
56
63

Fig. 6.1

Left: A set of exact measurement points in 2D

and the interpolated mathematical function (solid line)
and extrapolated function (dashed line). Right:
The same points as inexact measurements in 2D
and the regressed mathematical function (solid line)
and extrapolated function (dashed line) . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Pseudocode of Lagrange polynomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Pseudocode of Newton polynomial . . .. . . .. . . . .. . . . .. . . .. . . . .. . . .. . 90
Three points on a polynomial (blue) and the
interpolated parabola (red) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
20 measurements with errors of a linear system . . . . . . . . . . . . . . . . . 97
Three approximations of the set of points . . . . . . . . . . . . . . . . . . . . . . . . 98
Linear regression on the x- and y-axes, with the
probability of the measurements on top . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
Comparison of interpolation and extrapolation
of a system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Fig. 6.2
Fig. 6.3
Fig. 6.4
Fig. 6.5
Fig. 6.6
Fig. 6.7
Fig. 6.8

xiii

xiv

List of Figures

Fig. 7.1

Pseudocode of the binary search algorithm . . . . . . . . . . . . . . . . . . . . . . . 117

Fig. 8.1
Fig. 8.2
Fig. 8.3
Fig. 8.4
Fig. 8.5
Fig. 8.6
Fig. 8.7
Fig. 8.8
Fig. 8.9
Fig. 8.10
Fig. 8.11
Fig. 8.12
Fig. 8.13
Fig. 8.14
Fig. 8.15
Fig. 8.16
Fig. 8.17

A simple diode circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Pseudocode of the bisection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The interpolating linear polynomial and its root . . . . . . . . . . . . . . . . .
Pseudocode of the false position method . . . . . . . . . . . . . . . . . . . . . . . . .
One iteration of the false position method . . . . . . . . . . . . . . . . . . . . . . . .
A highly nonlinear function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pseudocode of the SPFI method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The convergence of Example 8.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The divergence of Eq. (8.18) . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .. . . . . . .
One iteration of Newtons method . .. . .. . . .. . .. . .. . .. . . .. . .. . .. . .. .
Pseudocode of Newtons method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Divergence because the first derivative is near zero . . . . . . . . . . . . .
Divergence because the second derivative is high . . . . . . . . . . . . . . .
Pseudocode of the multidimensional Newtons method . . . . . . . . .
Pseudocode of the secant method . . .. . . . . .. . . . . .. . . . . .. . . . .. . . . . .. .
Convergence and divergence of the secant method . . . . . . . . . . . . . .
Approximating the root using a degree 1 (left)
and degree 2 (right) interpolation of a function . . . . . . . . . . . . . . . . . .
Horizontal shift of the parabola f(x) to f(x + 3) . . . . . . . . . . . . . . . . . . .
Pseudocode of Mullers method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

120
121
124
127
128
130
134
134
135
136
138
140
141
142
146
147

Radius and cost of fuel tanks . . .. . .. . . .. . .. . . .. . .. . . .. . .. . . .. . .. . .. .

Two functions with points (1, 4), (2, 1), and (3, 3),
with a minimum in the [1, 2] interval (left) and in the [2, 3]
interval (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Two functions with points (1,4), (1.66,0.95), (2.33,1.6),
and (3,3), with a minimum in the [1, 1.66] interval (left)
and in the [1.66, 2.33] interval (right) . . . . .. . . . .. . . . . .. . . . .. . . . . .. .
Pseudocode of the golden-mean method . . . . . . . . . . . . . . . . . . . . . . . . . .
Pseudocode of Newtons method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The optimum of a function (solid blue line)
and an interpolated parabola (dashed red line) . . . . . . . . . . . . . . . . . . .
Pseudocode of the quadratic optimization method . . . . . . . . . . . . . . .
Pseudocode of the gradient descent method . . . . . . . . . . . . . . . . . . . . . .
Local and global maxima and minima . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pseudocode of the simulated annealing method . . . . . . . . . . . . . . . . . .

159

Fig. 8.18
Fig. 8.19
Fig. 9.1
Fig. 9.2

Fig. 9.3

Fig. 9.4
Fig. 9.5
Fig. 9.6
Fig. 9.7
Fig. 9.8
Fig. 9.9
Fig. 9.10
Fig. 10.1

Fig. 10.2
Fig. 10.3

149
150
151

160

160
163
167
169
173
176
179
185

Jerk (top-right), acceleration (top-left), speed (bottom-left),

and position (bottom-right) with respect to time,
for a robot at constant acceleration of 5 m/s2 .. . . .. . .. . . .. . . .. . .. . 192
Pseudocode of Richardson extrapolation . .. . .. . .. . .. . .. .. . .. . .. . .. 203
Position (left) and speed (right) using exact values (blue)
and noisy values (red) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

List of Figures

Fig. 11.1
Fig. 11.2

Fig. 11.3
Fig. 11.4

Fig. 11.5
Fig. 11.6
Fig. 11.7
Fig. 11.8
Fig. 11.9
Fig. 12.1
Fig. 12.2
Fig. 12.3

Depth map of a cross-section of a river . . . . . . . . . . . . . . . . . . . . . . . . . . .

Two measurements at x0 and x1 (left). A fictional trapezoid,
the area of which approximates the integral of the
function from x0 to x1 (right) . . . . .. . . . . .. . . . .. . . . .. . . . .. . . . . .. . . . .. .
Integration error for the example of Fig. 11.2 . . . . . . . . . . . . . . . . . . . .
Trapezoid approximation of the integral of Fig. 11.2
with two points and one segment (left), with three
points and two segments (center), and with four points
and three segments (right) . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . .
Pseudocode of the composite trapezoid rule . . . . . . . . . . . . . . . . . . . . . .
Pseudocode of the Romberg integration rule . . . . . . . . . . . . . . . . . . . . .
An integration interval divided into five segments . . . . . . . . . . . . . . .
Pseudocode of the Simpsons rules algorithm . . . . . . . . . . . . . . . . . . . .
An open single-segment trapezoid approximation . . . . . . . . . . . . . . .

220

221
221

223
224
230
237
237
240

Fig. 12.8
Fig. 12.9
Fig. 12.10

A sample RC circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pseudocode of Eulers method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Eulers method underestimating a convex
functions values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Eulers method overestimating a convex
functions values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Heuns method averaging the Eulers
method approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pseudocode of Heuns method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Top-left: K0, aka Eulers method, used to compute K1.
Top-right: K1 used to compute K2. Middle-left: K2
used to compute K3. Middle-right: K0 to K3 used to compute
the next point in the fourth-order RungeKutta method.
Bottom-left: Eulers method used to compute
the next point. Bottom-right: Heuns method
used to compute the next point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pseudocode of the fourth-order RungeKutta method . . . . . . . . . . .
Pseudocode of the backward Eulers method . . . . . . . . . . . . . .. . . . . . .
Pseudocode of Eulers method for a system of IVP . . . . . . . . . . . . .

Fig. 13.1
Fig. 13.2
Fig. 13.3
Fig. 13.4

A sample circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pseudocode of the shooting method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pseudocode of the finite difference method . . . . . . . . . . . . . . . . . . . . . .
A visualization of a two-dimensional BVP . . . . . . . . . . . . . . . . . . . . . . .

Fig. A.1
Fig. A.2

Pseudocode of the square program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

Code of the square program in C++ (top), Matlab (middle),
and Python (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
Pseudocode using an IF control statement . . . . . . . . . . . . . . . . . . . . . . . . 310

Fig. 12.4
Fig. 12.5
Fig. 12.6
Fig. 12.7

Fig. A.3

252
253
256
257
258
258

263
264
267
273
286
288
293
297

xvi

Fig. A.4
Fig. A.5
Fig. A.6
Fig. A.7
Fig. A.8
Fig. A.9

List of Figures

IF control statement implemented in C++ (top), Matlab

(middle), and Python (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pseudocode using a WHILE control statement . . . . . . . . . . . . . . . . . . .
WHILE control statement implemented in C++ (top), Matlab
(middle), and Python (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pseudocode using the CONTINUE and BREAK control
statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pseudocode calling a function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Functions implemented in C++ (top), Matlab (middle),
and Python (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

311
312
313
314
316
317

List of Tables

Table 1.1
Table 2.1

Sample functions and big O values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Binary, hexadecimal, and decimal number conversions . . . . . . . .

10
22

Table 6.1
Table 6.2

Sample table of divided differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Sample standard deviation and confidence interval . . . . . . . . . . . . . 107

Table 8.1

Summary of root-finding methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

Table 9.1
Table 9.2
Table 9.3

Sample iterations using 2/3 . . .. . . .. . . .. . . .. . . .. . . .. . . .. . . .. . .. . 161

Sample iterations using 0.6180 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Summary of optimization methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

Table 10.1
Table 10.2
Table 10.3

Robot speed given exact position measurements . . . . . . . . . . . . . . . . 211

Robot speed given noisy position measurements . . . . . . . . . . . . . . . . 211
Summary of derivative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

Table 11.1
Table 11.2

Points and weights for the Gaussian quadrature

method with different number of points . . . . . . . . . . . . . . . . . . . . . . . . . . 245
Summary of integration methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

Table 12.1

Summary of IVP methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280

Table 13.1

Summary of BVP methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304

xvii

Chapter 1

Modelling and Errors

1.1

Introduction

As an engineer, be it in the public or private sector, working for an employer or as

an independent contractor, your job will basically boil down to this: you need to
solve the problem you are given in the most efficient manner possible. If your
solution is less efficient than another, you will ultimately have to pay a price for this
inefficiency. This price may take many forms. It could be financial, in the form of
unnecessary expenses. It could take the form of less competitive products and a
reduced market share. It could be the added workload due to problems stemming
from the inefficiencies of your work. It could be an intangible but very real injury to
your professional reputation, which will be tarnished by being associated to inefficient work. Or it could take many other forms, all negative to you.
How can you make sure that your proposed solutions are as efficient as possible?
By representing the problem you are working on with an appropriate mathematical
model and then solving this model to find the optimal solution while being mindful
of the numerical errors that will necessarily crop in. Numerical methods, the
algorithms presented in this textbook, are the tools you can use to this end.
Numerical methods are a set of mathematical modelling tools. Each method
allows you to solve a specific type of problem: a root-finding problem, an optimization problem, an integral or derivative problem, an initial value problem, or a
boundary value problem. Once you have developed a proper model and understanding of the problem you are working on, you can break it down into a set of
these problems and apply the appropriate numerical method. Each numerical
method encompasses a set of algorithms to solve the mathematical problem it
models given some information and to a known error bound. This will be an
important point throughout the book: none of the algorithms that will be shown
can allow you to find the exact perfect solution to the problems, only approximate
solutions with known error ranges. Completely eliminating the errors is impossible;

Springer International Publishing Switzerland 2016

R. Khoury, D.W. Harder, Numerical Methods and Modelling for Engineering,
DOI 10.1007/978-3-319-21176-3_1

Modelling and Errors

rather, as an engineer, it is your responsibility to design systems that are tolerant of

errors. In that mindset, being able to correctly measure the errors is an important
advantage.

1.2

Simulation and Approximation

Before we can meaningfully talk about solving a model and measuring errors, we
must understand the modelling process and the sources of error.
Engineers and scientists study and work in the physical world. However, exactly
measuring and tracking every value of every variable in the natural world, and its
complete effect on nature, is a completely impossible task. Consequently, all
engineers and scientists work on different models of the physical world, which
track every variable and natural phenomenon we need to be aware of for our given
tasks to a level of accuracy we require for our work. This implies that different
professionals will work using different models; for instance, while an astrophysicist
studying the movement of galaxies and a quantum physicist studying the collision
of subatomic particles are studying the same physical world, they use completely
different models of it in their work.
Selecting the proper model for a project is the first step of the modelling cycle
shown in Fig. 1.1. All models stem in some way from the real physical world we
live in and are meant to represent some aspect of it. Once a proper model has been
selected, an implementation of it has to be made. Today, this is synonymous with
writing a software version of the model to run on a computer, but in the past a
simpler implementation approach was used, which consisted in writing down all
necessary equations on paper or on a blackboard. Whichever implementation
method is used, it will include variables that are placeholders for real-world values.
Consequently, the next step is to look to the physical world again and take
measurements of these values to fill in the variables. Finally, we are ready for the
final step, simulation. At this step, we execute the implementation of the model with
the measured values to get an output result. Whether this execution is done by a
Fig. 1.1 The modelling
loop of reality to
engineering approximation

1.2 Simulation and Approximation

computer running software or an engineer diligently computing values on their

slide rule is dependent on the implementation method chosen, but that is of little
consequence. In either case, the final result will be an approximation of part of
reality that the model was meant to represent.
One thing that is evident from Fig. 1.1, and should also be understood from our
previous explanation, is that the approximation obtained at the end of the modelling
cycle is not a 100 % perfect reflection of reality. This is unavoidable: every step of
the modelling cycle includes errors, and so the final result cannot possibly be
perfect. But more fundamentally, engineers do not seek to have a perfect model
of the world; they seek to have a model that is good enough to complete a project
correctly and efficiently. Correctness is not synonymous with perfection. For
example, Newtonian physics is not a perfect model of the universe, but it is correct
enough to have been the foundation of countless successful systems over the
centuries and to still be commonly used today despite its limitations being well
known now. Likewise, efficiency can stem from a simpler (and less accurate)
model. Relativistic physics is unarguably a more accurate model of the universe
than Newtonian physics, but trying to apply it when designing a factory or a road
will only lead to an inefficient waste of effort and time (and thus money) to model
undetectable relativistic effects.
Nonetheless, all approximations are not equal. Again referring to Fig. 1.1, it can
be seen that some approximations will greatly overlap with reality, while others will
have little if anything in common with reality. While an engineer does not seek
perfect and complete overlap between the approximation and reality, there must be
overlap with the parts of reality that matter, the ones that affect the project. To build
a factory, the approximation of the weight that can be supported by load-bearing
structures must be an accurate representation of reality, and the approximation of
the weight expected to be put on that structure must be an accurate prediction of the
weight of upper floors, of machinery and people on these floors, and of snow
accumulations on the roof in winter. The difference between a good and a bad
approximation is the difference between a successful project and a disastrous one. It
is critically important that an engineer be always aware of the errors that creep into
the modelling cycle and of the resulting inaccuracies in the approximation, to insure
that the approximation is still valid and the project is successful.
It was noted earlier that one reason why the approximation will always differ
from reality is because of errors introduced in each step of the modelling cycle.
There are four steps in this cycle: model selection, implementation, measurement,
and simulation. There are likewise four possible types of errors: model errors,
implementation errors, measurement errors, and simulation errors.
Model errors come from selecting an inappropriate model for a given application. An inappropriate model is one that does not include all the aspects of the
physical world that will influence the project being done. Designing docks using a
model of the shoreline that does not include tides and a skyscraper using a model of
the city that does not include winds are examples of model errors. One of the most
famous examples of model errors occurred in 1964, when Arno Penzias and Robert
Wilson were experimenting with a supersensitive antenna. After filtering out all

Modelling and Errors

sources of interference accounted for in their model, they found that there was still a
low steady noise being detected by their receiver, and that it was 100 times more
intense than what the model had predicted they would find. This noise, it was later
found, was the cosmic microwave background radiation of the universe left over
from the Big Bang, which their model did not include and which, as a result, threw
off their entire predictions. But to be fair, the Big Bang was still a recently proposed
hypothesis at that time, and one that a large portion of the scientific community did
not yet accept, and so we cannot fault Penzias and Wilson for selecting a model that
did not include it. In fact, their accidental discovery earned them a Nobel Prize in
1978. Most engineer errors do not have such positive outcomes however.
Implementation errors occur when the software code representing the model in a
computer is poorly built. This can be the result of algorithmic errors, of passing
values in the wrong order through a software interface, of legacy code being used in
a way it was never meant for, and of many more potential issues. These errors are
usually detected and corrected through proper software quality assurance (SQA)
methods within the project, and consequently SQA is an important component of
software engineering good practice. Implementation errors have notably plagued
space exploration agencies worldwide and are blamed for some of the most famous
space disasters. The explosion of the European Ariane 5 rocket shortly after takeoff
in 1996 was due to a 64-bit value in a new inertial reference system being passed
into a 16-bit value in a legacy control system. When the value exceeded the 16-bit
limit, the control system failed and the rocket shot off course and destroyed itself
and its cargo, a loss of some $500 million. Likewise, NASA lost its $328 million
Mars Climate Orbiter in 1999 because of poorly documented software components.
The Orbiters instruments took measurements in imperial units and relayed them to
the control software, which was designed to handle metric units. Proper documentation of these two modules to explicitly name the units used by each, along with
proper code review, would have caught this problem easily; instead, it went
unnoticed until the Orbiter crashed into Mars.
Once an appropriate model has been selected and correctly implemented, it is
necessary to fill in unknown variables with values measured from the real world to
represent the current problem. Measurement errors occur at this stage, when the
measurements are inaccurate. In a sense, measurement errors will always occur;
while one can avoid model errors through proper research and implementation
errors through a proper SQA process, measuring tools will always have limited
precision and consequently the measurements themselves will always have some
errors. However, care can still be taken in a number of ways: by including error
bounds on the measures rather than treating them as exact values, by running the
computations on worst-case scenarios in addition to more likely average scenarios,
by designing in safety buffers, and of course by making sure the measurements are
taken properly in the first place. This was not the case in 1979, by a Finnish team
tasked to build a lighthouse on Market Island. This island is on the border between
Sweden and Finland and had been neatly divided by a treaty between the
two nations 170 years before. The team was tasked with building the new lighthouse on the Finnish side of the island, but because of improper geographical

1.3 Error Analysis

measurements, they built it on the Swede side accidentally. Rectifying the situation
after construction required reopening the century-old treaty between the two
nations to negotiate new borders that remained fair for territory size, coast lines,
fishery claims, and more. And while the two nations resolved the issue peacefully,
to their credit, accidentally causing an international incident is not a line any team
leader would want to include on their resume.
Even once a proper model has been selected, correctly implemented and populated with accurate measurements, errors can still occur. These final errors are
simulation errors that are due to the accumulation of inaccuracies over the execution of the simulation. To understand the origin of these errors, one must remember
that a simulation on a computer tries to represent reality, a continuous-valued and
infinite world, with a discrete set of finite values, and then predicts what will happen
next in this world using approximation algorithms. Errors are inherent and unavoidable in this process. Moreover, while the error on an individual value or algorithm
may seem so small as to be negligible, these errors accumulate with each other. An
individual value may have a small error, but then is used in an algorithm with its
own small error and the result has the error of both. When a proper simulation uses
dozens of values and runs algorithms hundreds of times, the errors can accumulate
to very significant values. For example, in 1991, the Sleipner A oil platform under
construction in Norway collapsed because of simulation errors. The problem could
be traced back to the approximation in a finite element function in the model; while
small, this error then accumulated throughout the simulation so that by the end the
stress predicted on the structure by the model was 47 % less than reality. Consequently, the concrete frame of the oil platform was designed much too weak, sprung
a leak after it was submersed under water, and caused the entire platform to sink to
the bottom of a fjord. The shock of the platform hitting the bottom of the fjord
caused a seismic event of 3.0 on the Richter scale about $700 million in damages.
This book focuses on simulation errors. Throughout the work, it will present not
only algorithms to build simulations and model reality but their error values in order
to account for simulation errors in engineering work.

1.3
1.3.1

Error Analysis
Precision and Accuracy

Before talking about errors, it is necessary to lay down some formal vocabulary.
The first are the notions of precision and accuracy, two words that are often used
interchangeably by laypeople. In engineering, these words have different, if related,
meanings. Precision refers to the number of digits an approximation uses to
represent a real value, while accuracy refers to how close to the real value the
approximation is.

Modelling and Errors

An example can help clarify these notions. Imagine a car with two speedometers,
an analogue one and a digital one. The digital one indicates the cars speed at every
0.1 km/h, while the analogue one only indicates it at every 1 km/h. When running
an experiment and driving the car at a constant 100 km/h, it is observed that the
digital speedometer fluctuates from 96.5 to 104.4 km/h, while the analogue one
only fluctuates from 99 to 101 km/h. In this example, the digital speedometer is
more precise, as it indicates the speed with one more digit than the analogue one,
but the analogue speedometer is more accurate, as it is closer to the real value of
100 km/h than the digital one.
While precision and accuracy measure two different and independent aspects of
our values, in practice it makes sense to use precision to reflect accuracy. Adding
additional digits of precision that cannot be accurately measured consists simply in
adding noise in our values. This was the case in the previous example, with the
digital speedometer showing a precision of 0.1 km/h when it couldnt accurately
measure the speed to more than 3 or 4 km/h. On the other hand, if a value can be
accurately measured to a great precision, then these digits should be included. If the
cars speed is accurately measured to 102.44 km/h, then reporting it to a lesser
precision at 102 km/h not only discards useful information, it actually reduces
accuracy by rounding known figures.
Consequently, the accuracy of a measure is usually a function of the last digit of
precision. When a speedometer indicates the cars speed to 0.1 km/h, it implies that
it can accurately measure its speed to that precision. In fact, given no other
information except a value, it is implied that the accuracy is half the last digit of
precision. For example, a car measured as going to 102.3 km/h is implied to have
been accurately measured to 0.05 km/h to get that precision. This accuracy is
called the implied precision of the measure. In our example, this means that the real
speed of the car is somewhere in the range from 102.25 to 103.35 km/h and cannot
be obtained any more accurately than that.

1.3.2

Absolute and Relative Error

The next important term to introduce is that of error. The error is the value of the
inaccuracy on a measure. If it is given with the same units as the measure itself, then
it is an absolute error. More formally, given a real measure and an approximation,
the absolute error is the difference between the approximation and the real value:
Eabs japproximation valuej

1:1

It can be realized at this point that the implied precision introduced in the previous
subsection is also a measure of absolute error. Absolute error has the benefit of
being immediately clear and related to the measure being evaluated. However, it is
also inherently vague when it comes to determining if that measure is accurate or
not. Given a distance with an absolute error of 3 m, one can get an immediate sense

1.3 Error Analysis

of the precision that was used to measure it and of how far apart the two objects
might be, but is this accurate enough? The answer is that it depends on the
magnitude of the distance being measured. An absolute error of 3 m is incredibly
accurate when measuring the thousands of metres of distance between two cities,
but incredibly inaccurate when measuring the fraction of a metre distance between
your thumb and index finger. The notion of relative error, or absolute error as a
ratio of the value being measured, introduces this difference:

approximation value

1:2
Erel

value
Unlike absolute error, which is given in the same units as the value being measured,
relative error is given as a percentage of the measured value.
Example 1.1
What is the maximum and minimum resistance of a resistor labelled brown,
grey, brown, red?
Solution
Given the colour code, the resistor is 180 with a tolerance of 2 %. In order
words, the resistance value is approximated as 180 and the relative error on
this approximation is 2 %. Putting these values in the relative error formula
(1.2) to solve for the real value:
Erel

j180 r j
0:02 ) r
jr j

176:5
183:7

The resistors minimum and maximum resistance values are 176.5 and
183.7 , respectively, and the real resistance value is somewhere in that
range. It can be noted that the absolute error on the resistance value is 3.6 ,
which is indeed 2 % of 180 .

Example 1.2
A cars speedometer indicates a current speed of 102 km/h. What is the
relative error on that measure?
Solution
The implied precision on the measure is half the last decimal, or 0.5 km/h.
The real speed is in the interval from 101.5 to 102.5 km/h. The relative error is
computed from these two bounds:
(continued)

Modelling and Errors

Example 1.2 (continued)

j102 101:5j
0:004926
j101:5j
j102 102:5j
0:004878
j102:5j
Thus, the relative error is 0.4926 %.

1.3.3

Significant Digits

When trying to determine which of two approximations of a value is more accurate,

it is intuitive to compare each to the correct value digit by digit and pick the
approximation with the greatest number of digits in common with the value.
Given two approximations of the constant , one at 3.142 and one at 3.1416, the
second would be intuitively preferred because it has four digits in common with the
real value of , as opposed to three digits in the first one. The set of correct digits in
the approximation are called significant digits, and the intuitive idea of preferring
an approximation with more significant digits is entirely valid.
However, a simple count of significant digits is not always enough; operations
such as rounding decimals can cause approximations with fewer significant digits to
be better. For example, for the value 2.0000, the approximation 2.9999 is much
worse than the approximation 1.9999, despite the fact the former has one significant
digit and the latter has none.
A better alternative to counting decimals is to look at the order of magnitude of
the relative error of the approximations. This order of magnitude is proportional to
the number of significant digits, without being misled by the rollover of values due
to rounding. More formally, one would look for the integer value of n that satisfies
the following inequality on the order of magnitude of the relative error:
Erel 0:5 10n

1:3

This integer n is the actual number of significant digits that we are looking for.
Given multiple approximations for a value, the most accurate one is the one with the
highest value of n. Moreover, very bad approximations that yield a positive power
of 10 in Eq. (1.3) and therefore negative values of n are said to have no significant
digits.

1.3 Error Analysis

Example 1.3
Given two approximations 2.9999 and 1.9999 for the real value 2.0000,
which has the greatest number of significant digits?
Solution
First, compute the relative error of each approximation:
j2:9999 2:0000j
0:49995
j2:0000j
j1:9999 2:0000j
0:00005
j2:0000j
Next, find the maximum exponent n for the inequalities on the order of
magnitude in Eq. (1.3):
0:49995 0:5 100
0:00005 0:5 104
This tells us that the approximation of 1.9999 has four significant digits, while
the approximation of 2.9999 has none. This is despite the fact that the value of
2.9999 has one digit in common with the real value of 2.0000 while 1.9999
has none. However, this result is in line with mathematical sense: 1.9999 is
only 0.0001 off from the correct value, while 2.9999 is off by 0.9999.

1.3.4

Big O Notation

When it comes to measuring the error caused by mathematical algorithms, trying to

compute an exact value often proves impractical. The mathematical formula that is
implemented in the algorithm may be a summation of many (or an infinite number
of) terms, making exact computation difficult (or impossible). Moreover, constant
coefficients multiplying some terms of the mathematical formula may make a lessefficient algorithm appear more efficient in a special range of values, making the
results only valid in that special case rather than a general conclusion. In this case, it
is better to represent the error in general terms rather than try to compute an exact
value. The general form we will use in this book is the big O notation.
Big O is a function of a variable of the equation being studied and is a measure of
worst-case growth rate as the variable tends towards infinity or of decline rate as the
variable tends towards zero. It is very commonly used in software engineering to
measure the growth of time (computation) and space (memory) cost of software
algorithms as the input size increases towards infinity. In that context, an algorithm
with a smaller big O value is one whose time and space cost will increase more
slowly with input size and thus should be preferred to another algorithm with a

Modelling and Errors

Table 1.1 Sample functions and big O values

Function
f x 6x 3x 17x 4x
f x 42x4 17x2
f x 1050x3
f x 8
4

Big O growth rate

O(x4)
O(x4)
O(x3)
O(1)

Big O decline rate

O(x)
O(x2)
O(x3)
O(1)

greater big O value. It is important to note again that this is a general rule and does
not account for special cases, such as specific input values for which an algorithm
with a greater big O value might outperform one with a smaller big O value.
The generalization power of big O in that case comes from the fact that, given a
mathematical sequence, it only keeps the term with the greatest growth rate,
discarding all other terms and the coefficient multiplying that term. Table 1.1
gives some example of functions with their big O growth rates in the second
column. In all these functions, the term with the greatest growth rate is the one
with the greatest exponent. The first and second functions have the same big O
value despite the fact they would give very different results mathematically,
because they both have the same highest exponent x4, and both the constant
multiplying that term and all other terms are abstracted away. The third function
will clearly give a greater result than either of the first two for a large range of lower
values of x, but that is merely a special case due to the coefficient multiplying the x3
of that equation term. Beyond that range in the general case, values of x4 will be
greater than x3 multiplying a constant, and so the third functions O(x3) is considered lesser than O(x4). The fourth function is a constant; it returns the same value
regardless of the input value of x. Likewise, its big O value is a constant O(1). When
the goal is to select the function with the least growth rate, the one with the lowest
big O value is preferred.
The mathematical formula and algorithms used for modelling are also evaluated
against a variable to obtain their big O values. However, unlike their software
engineering counterparts, they are not measured against variable input sizes; their
inputs will always be the measured values of the model. Rather, the variable will be
the size of the simulation step meant to approximate the continuous nature of the
natural world. Whether the model simulates discrete steps in time, in space, in
frequency, in pressure, or in some other attribute of the physical world, the smaller
the step, the more natural the simulation will be. In this context, big O notation is
thus measuring a decline rate instead of a growth rate, and the value of x becomes
smaller and tends towards zero. In that case, the term of the equation with greater
exponents will decline more quickly than those with lesser exponents. Big O
notation will thus estimate the worst-case decline value by keeping the lowest
exponent term, discarding all other terms and constants multiplying that term.
This yields the third column of Table 1.1, and the equation with the greatest big
O exponent, rather than the lowest one, will be preferred. That equation is the one
that will allow the error of the formula to decrease the fastest as the step size is
reduced.

1.5 Exercises

Big O notation will be used in this book to measure both the convergence rate of
algorithms and their error rate. In fact, these two notions are interchangeable in this
context: an algorithm converges on a solution by reducing the error on its approximation of this solution, and the rate at which it converges is the same as the rate at
which it reduces the approximation error.

1.4

Summary

The main focus of this chapter has been to introduce and formally define several
notions related to error measurement. The chapter began by introducing the four
steps of the modelling cycle, namely, model selection to implementation, measurements, and simulation, along with the errors that can be introduced at each step. It
then defined the vocabulary of error measurement, precision, accuracy, and implied
precision. And finally it presented formal measures of error, namely, relative and
absolute error, significant digits, and big O notation.

1.5

Exercises

1. Your partner uses a ruler to measure the length of a pencil and states that the
length is 20.35232403 cm. What is your response to the given precision?
2. Given two approximations of the constant , as 3.1417 and 3.1392838, which
has the greatest precision? Which has the greatest accuracy?
3. Which number has more precision and which has more accuracy as an approximation of e, 2.7182820135423 or 2.718281828?
4. The distance between two cities is given as approximately 332 mi. As
1 mi 1.609344 km exactly, it follows that the distance is approximately
534.302208 km. Discuss this conversion with respect to precision and
accuracy.
5. What is approximately the absolute and relative error of 3.14 as an approximation of the constant ?
6. What are the absolute and relative errors of the approximation 22/7 of ? How
many significant digits does it have?
7. What are the absolute and relative errors of the approximation 355/113 of ?
How many significant digits does it have?
8. A resistor labelled as 240 is actually measured at 243.32753 . What are the
absolute and relative errors of the labelled value?
9. The voltage in a high-voltage transmission line is stated to be 2.4 MV while the
actual voltage may range from 2.1 to 2.7 MV. What is the maximum absolute
and relative error of voltage?

Modelling and Errors

10. A capacitor is labelled as 100 mF, whereas it is actually measured to be

108.2532 mF. What are the absolute and relative errors of the label? To how
many significant digits does the label approximate the actual capacitance?
11. Of 3.1415 and 3.1416, which has more significant digits as an approximation of
the constant ?
12. What is the number of significant digits of the label 240 when the correct
value is 243.32753 ?
13. To how many significant digits is the approximation 1.998532 when the actual
value is 2.001959?

Chapter 2

Numerical Representation

2.1

Introduction

The numerical system used in the Western World today is a place-value base-10
system inherited from India through the intermediary of Arabic trade; this is why
the numbers are often called Arabic numerals or more correctly Indo-Arabic
numerals. However, this is not the only numerical system possible. For centuries,
the Western World used the Roman system instead, which is a base-10 additivevalue system (digits of a number are summed and subtracted from each other to get
the value represented), and that system is still in use today, notably in names and
titles. Other civilizations experimented with other bases: some precolonial
Australian cultures used a base-5 system, while base-20 systems arose independently in Africa and in pre-Columbian America, and the ancient Babylonians used a
base-60 counting system. Even today, despite the prevalence of the base-10 system,
systems in other bases continue to be used every day: degrees, minutes, and seconds
are counted in the base-60 system inherited from Babylonian astrologers, and base12 is used to count hours in the day and months (or zodiacs) in the year.
When it comes to working with computers, it is easiest to handle a base-2 system
with only two digits, 0 and 1. The main advantage is that this two-value system can
be efficiently represented by an open or closed electrical circuit that measures 0 or
5 V, or in computer memory by an empty or charged capacitor, or in secondary
storage by an unmagnetized or magnetized area of a metal disc or an absorptive or
refractive portion of an optical disc. This base-2 system is called binary, and a
single binary digit is called a bit.
It should come as no surprise, however, that trying to model our infinite and
continuous real world using a computer which has finite digital storage, memory,
and processing capabilities will lead to the introduction of errors in our modelling.
These errors will be part of all computer results; no amount of technological
advancement or upgrading to the latest hardware will allow us to overcome them.
Nonetheless, no one would advocate for engineers to give up computers altogether!
Springer International Publishing Switzerland 2016
R. Khoury, D.W. Harder, Numerical Methods and Modelling for Engineering,
DOI 10.1007/978-3-319-21176-3_2

2 Numerical Representation

It is only necessary to be aware of the errors that arise from using computers and to
account for them.
This chapter will look in details at how binary mathematics work and how
computers represent and handle numbers. It will then present the weaknesses that
result from this representation and that, if ignored, can compromise the quality of
engineering work.

2.2

Decimal and Binary Numbers

This section introduces binary numbers and arithmetic. Since we assume the reader
to be intimately familiar with decimal (base-10) numbers and arithmetic, we will
use that system as a bridge to binary.

2.2.1

Decimal Numbers

Our decimal system uses ten ordered digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 to represent

any number as a sequence:
dn dn1 dn2 . . . d1 d 0 :d1 d 2 . . .

2:1

where every di is a digit, n is an integer, and dn 6 0. The sequence is place valued,

meaning that the value of di in the complete number is the value of its digit
multiplied by 10 to the power i. This means the value of the complete number
can be computed as
n
X

di 10i

2:2

The digits d0 and d1, the first digit multiplied by 10 to a negative power, are
separated by a point called the decimal point. The digit dn, which has the greatest
value in the total number, is called the most significant digit, while the digit di with
the lowest value of i and therefore the lowest contribution in the total number is
called the least significant digit.
It is often inconvenient to write numbers in the form of Eq. (2.1), especially when
modelling very large or very small quantities. For example, the distance from the
Earth to the Sun is 150,000,000,000 m, and the radius of an electron is
0.0000000000000028 m. For this reason, numbers are often represented in scientific
notation, where the non-zero part is kept, normally with one digit left of the decimal
point and a maximum of m digits on the right, and the long string of zeroes is
simplified using a multiplication by a power of 10. The number can then be written as

2.2 Decimal and Binary Numbers

d 0 :d1 d2 . . . dm 10n

2:3

or equivalently but more commonly as

d0 :d1 d 2 . . . d m en

2:4

The m + 1 digits that are kept are called the mantissa, the value n is called the
exponent, and the letter e in Eq. (2.4) stands for the word exponent, and
normally m < n. Using scientific notation, the distance from the Earth to the Sun
is 1.5 1011 m and the radius of the electron is 2.8 1015.
We can now define our two basic arithmetic operations. The rule to perform the
addition of two decimal numbers written in the form of Eq. (2.1) is to line up the
decimal points and add the digits at corresponding positions. If a digit is missing
from a position in one of the numbers, it is assumed to be zero. If two digits sum to
more than 9, the least significant digit is kept in that position and the most
significant digit carries and is added to the digits on the left. An addition of two
numbers in scientific notations is done first by writing the two numbers at the same
exponent value, then adding the two mantissas in the same way as before. To
multiply two decimal numbers written in the form of Eq. (2.1), multiply the first
number by each digit di of the second number and multiply that partial result by 10i,
then sum the partial results together to get the total. Given two numbers in scientific
notation, multiply the two mantissas together using the same method, and add the
two exponents together.

2.2.2

Binary Numbers

A binary system uses only two ordered digits (or bits, short for binary digits),
0 and 1, to represent any number as a sequence:
bn bn1 bn2 . . . b1 b0 :b1 b2 . . .

2:5

where every bi is a bit, n is an integer, and bn 6 0. The sequence is place valued,

meaning that the value of bi in the complete number is the value of its digit
multiplied by 2 to the power i. This means the value of the complete number can
be computed as
n
X

bi 2i

2:6

The digits b0 and b1, the first digit multiplied by 2 to a negative power, are
separated by a point; however, it would be wrong to call it a decimal point now
since this is not a decimal system. In binary it is called the binary point, and a more

2 Numerical Representation

general term for it independent of base is the radix point. We can define a binary
scientific notation as well, as
b0 :b1 b2 . . . bm 2n

2:7

b0 :b1 b2 . . . bm en

2:8

The readers can thus see clear parallels with Eqs. (2.1)(2.4) which define
our decimal system. Likewise, the rules for addition and multiplication in
binary are the same as in decimal, except that digits carry whenever two 1s are
summed.
Since binary and decimal use the same digits 1 and 0, it can lead to ambiguity as
to whether a given number is written in base 2 or base 10. When this distinction is
not clear from the context, it is habitual to suffix the numbers with a subscript of
their base. For example, the number 110 is ambiguous, but 11010 is one hundred and
ten in decimal, while 1102 is a binary number representing the number 6 in decimal.
It is not necessary to write that last value as 610 since 62 is nonsensical and no
ambiguity can exist.
Example 2.1
Compute the addition and multiplication of 3.25 and 18.7 in decimal and of
1011.1 and 1.1101 in binary.
Solution
Following the first addition rule, line up the two numbers and sum the digits
as follows:
3:25
18:7
21:95
The second addition rule requires writing the numbers in scientific notation
with the same exponent. These two numbers in scientific notations are
3.25 100 and 1.87 101, respectively. Writing them in the same exponent
would change the first one to 0.325 101. Then the sum of the mantissa gives
0:325
1:87
2:195
for a final total of 2.195 101, the same result as before.
(continued)

2.2 Decimal and Binary Numbers

Example 2.1 (continued)

The multiplication of two numbers in decimal using the first rule is done
by summing the result of partial multiplications as follows:
3:25
18:7
2:275
26
32:5
60.775
To perform the operation using the second rule, write the numbers in scientific notation as 3.25 100 and 1.87 101, respectively, then multiply the
mantissas as before:
3:25
1:87
0:2275
2:6
3:25
6.0775
and sum the exponents (0 + 1) to get a final result of 6.0775 101 as before.
Working in binary, the rules apply in the same way. The binary sum, using
the first rule, gives
1011:1
1:1101
1101:0101
Writing the numbers in scientific notation gives 1.0111 23 and 1.1101 20,
respectively. Using the second rule, the multiplication of the mantissas gives
1:0111
1:1101
0:00010111
0:
0:010111
0:10111
1:0111
10.10011011
and the sum of the exponents gives 3, for a total of 10.10011011 23, better
written as 1.010011011 24.

2 Numerical Representation

2.2.3

Base Conversions

The conversion from binary to decimal can be done simply by computing the
summation from Eq. (2.6).
The conversion from decimal to binary is much more tedious. For a decimal
number N, it is necessary to find the largest power k of 2 such that 2k N. Add that
power of 2 to the binary number and subtract it from the decimal number, and
continue the process until the decimal number has been reduced to 0. This will yield
the binary number as a summation of the form of Eq. (2.6).
Example 2.2
Convert the binary number 101.101 to decimal, and then convert the result
back to binary.
Solution
Convert the number to decimal by writing it in the summation form of
Eq. (2.6) and computing the total:
101:101 1 22 0 21 1 20 1 21 0 22 1 23
4 0 1 0:5 0 0:125
5:625
Converting 5.625 back to binary requires going through the algorithm steps:
step 1 : N 5:625

2k 4 N

N 2k 1:625

step 2 : N 1:625

2k 1 N

N 2k 0:625

step 3 : N 0:625

k 1

2k 0:5 N

N 2k 0:125

step 4 : N 0:125

k 3

2k 0:125 N

N 2k 0

5:625 4 1 0:5 0:125

1 22 1 20 1 21 1 23
101:101

2.3

Number Representation

It is unreasonable and often impossible to store numbers in computer memory to

a maximum level of precision by keeping all their digits. One major issue is
that the increasing number of digits that results from performing complex calculations will quickly consume the computers memory. For example, the product
of the two 11-digit numbers 1.2345678901 2.3456789012 equals the 21-digit
number 2.89589985190657035812, and the sum of two 9-digit numbers

2.3 Number Representation

123456789.0 + 0.123456789 equals the 18-digit 123456789.123456789 where the

two summands each have 9 significant digits but the sum requires 18 significant
digits. Moreover, certain numbers, such as , have an infinite number of digits that
would need to be stored in memory for a maximum level of precision, which is quite
simply impossible to do. Numbers in the natural world may be infinite, but
computer memory is not.
It is therefore necessary for modern computers to truncate the numbers stored in
their memory. This truncation will naturally cause a loss in precision in the values
stored and will be a source of errors for any computer model of the real world. And
this problem will be unavoidable on any computer, no matter how advanced
and powerful (at least, until someone invents a computer with infinite memory
and instantaneous processing). Nonetheless, the way numbers are represented and
stored in the computer will lead to different levels of seriousness of these errors, and
one would be wise to pick a representation scheme that minimizes problems.
Since all number representation schemes are not equal, it is necessary to begin
by defining four important requirements to compare the different schemes by:
1. A scheme must represent numbers using a fixed amount of memory. This is an
unavoidable requirement of modern computers.
2. A scheme must allow the user to represent a range of values both very large and
very small, in order to accommodate models of any aspect of the world. The real
world has an infinite range of values, which is something that is impossible to
represent given the requirement of using a fixed amount of memory. However,
the greater the range of values that a scheme can be represented, the better.
3. A scheme must be able to represent numbers, within the range it can handle, with
a small relative error. Truncation to accommodate a fixed amount of memory
will necessarily lead to errors, but keeping these errors to a minimum is always
preferable. This requirement does not take into account the error on numbers
outside the range of values the scheme can represent, as this error can be infinite.
4. A scheme must allow software to efficiently test for equality and relative
magnitude between two numbers. This is a computing requirement rather than
a memory requirement. Number comparisons are the fundamental building
block of software, and given two otherwise equivalent representation scheme,
one that allows more efficient comparisons will lead to more efficient software
and runtime performances.

2.3.1

Fixed-Point Representation

Perhaps the easiest method of storing a real number is by storing a fixed number of
digits before and after the radix point, along with its sign (0 or 1 to represent a
positive or negative number respectively). For the sake of example, we will
assume three digits before the point and three after, thus storing a decimal

2 Numerical Representation

number d2d1d0.d1d2d3. For example, the constant would be stored as

0003142. This representation is called fixed-point representation. It clearly satisfies
the first requirement of using a fixed amount of memory. Moreover, the fourth
requirement of efficiently comparing two numbers for relative magnitude can be
achieved by simply comparing the two numbers digit by digit from left to right and
stopping as soon as one is greater than the other.
Unfortunately, fixed-point representation does not perform well on the other two
requirements. For the second requirement, the range of values that this notation can
represent is very limited. The largest value that can be stored is 999.999 and the
smallest one is 000.001, which are neither very large nor very small. And for the
third requirement, the relative error on some values within the range can be very
large. For instance, the value 0.0015 is stored as 0.002 with a staggering relative
error of 0.33.

2.3.2

Floating-Point Representation

Using the same amount of memory as fixed-point representation, it would be a lot

more efficient to store numbers in scientific notation and use some of the stored
digits to represent the mantissa and some to represent the exponent. Using the same
example as before of storing six digits and a sign, this could give the decimal
number d0 :d 1 d2 d3 10E0 E1 49 . For reasons that will become clear shortly, the
exponent digits are stored before the mantissa digits, so in this representation, the
constant would be stored as 0493142. This representation is called floating-point
representation. The value 49 subtracted in the exponent is called the bias and is half
the maximum value the exponent can take.
Floating-point representation satisfies all four requirements better than fixedpoint representation. The first requirement is clearly satisfied by using exactly
as much memory as fixed-point representation. The range of values covered
to satisfy the second requirement is much larger than before: a two-digit exponent
can cover 100 orders of magnitude, and thanks to the introduction of the bias,
numbers can have exponents ranging from the minuscule 1049 to the massive
1050. As for the third requirement, the maximum relative error of any real
number in the interval [1.000 1049, 9.999 1050] is 1/2001 0.0005.
Finally, by adding the requirement that the first digit of the mantissa d0
must be different from zero and thanks to the fact the exponent digits are
stored first, it is still possible to efficiently compare the relative magnitude of
two numbers by comparing them digit by digit from left to right and stopping
as soon as one is greater than the other. The added requirement that the first
mantissa bit be different from zero is required; otherwise it would always be
necessary to read all digits of the numbers and to perform some computations
to standardize them, as the same real number could be stored in several
different ways. For instance, the number 3 could be stored as 0493000,

2.3 Number Representation

0500300, 0510030, or 0520003, and only by subtracting the bias and shifting
the mantissa appropriately does it become evident that all four values are the
same. The requirement that the first bit of the mantissa be non-zero insures
that only the first of these four representations is legal.
The requirement that the first bit of the mantissa must be non-zero introduces a
surprising new problem: representing the real value 0 in floating-point representation is a rule-breaking special case. Moreover, given that each floating-point value
has a sign, there are two such special values at 0000000 and 1000000. Floatingpoint representation uses this to its advantage by actually defining two values of
zero, a positive and a negative one. A positive zero represents a positive number
smaller than the smallest positive number in the range, and a negative zero
represents a negative number greater than the greatest negative number in the range.
It is also possible to include an additional exception to the rule that the first digit
of the mantissa must be non-zero, in the special case where a number is so small that
it cannot be represented while respecting the rule. In the six-digit example, this
would be the case, for example, for the number 1.23 1050, which could be
represented as 0000123 but only with a zero as the first digit of the mantissa.
Allowing this type of exception is very attractive; it would increase the range of
values that can be represented by several orders of magnitude at no cost in memory
space and without making relative comparisons more expensive. But this is no free
lunch: the cost is that the mantissa will have fewer digits, and thus the relative error
on the values in this range will be increased. Nonetheless, this trade-off can
sometimes be worthwhile. A floating-point representation that allows this exception
is called denormalized.
Example 2.3
Represent 10! in the six-digit floating-point format.
Solution
First, compute that 10! 3628800, or 3.6288 106 in scientific notation. The
exponent is thus 55 to take into account the bias of 49, the mantissa rounded
to four digits is 3.629, and the positive sign is a 0, giving the representation
0553629.

Example 2.4
What number is represented, using the six-digit floating-point format, by
1234567?
Solution
The leading 1 indicates that it is a negative number, the exponent is 23 and
the mantissa is 4.567. This represents the number 4.567 102349
4.567 1026.

2.3.3

2 Numerical Representation

Double-Precision Floating-Point Representation

The representation most commonly used in computers today is double, short for
double-precision floating-point format, and formally defined in the IEEE
754 standard. Numbers are stored in binary (as one would expect in a computer)
over a fixed amount of memory of 64 bits (8 bytes). The name comes from the fact
this format uses double the amount of memory that was allocated to the original
floating-point format (float) numbers, a decision that was made when it was found
that 4 bytes was not enough to allow for the precision needed for most scientific and
engineering calculations.
The 64 bits of a double number comprise, in order, 1 bit for the sign (0 for
positive numbers, 1 for negative numbers), 11 bits for the exponent, and 52 bits for
the mantissa. The maximum exponent value that can be represented with 11 bits is
2047, so the bias is 1023 (011111111112), allowing the representation of numbers
in the range from 21022 to 21023. And the requirement defined previously, that the
first digit of the mantissa cannot be 0, still holds. However, since the digits are now
binary, this means that the first digit of the mantissa must always be 1; it is
consequently not stored at all, and all 52 bits of the mantissa represent digits after
the radix point following an implied leading 1. This means also that double cannot
be a denormalized number representation.
For humans reading and writing 64-bit-long binary numbers can be tedious
and very error prone. Consequently, for convenience, the 64 bits are usually grouped
into 16 sets of 4 bits, and the value of each set of 4 bits (which will be between 0 and
15) is written using a single hexadecimal (base-16) digit. The following Table 2.1
gives the equivalences between binary, hexadecimal, and decimal.

Table 2.1 Binary,

hexadecimal, and decimal
number conversions

Binary
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111

Hexadecimal
0
1
2
3
4
5
6
7
8
9
a
b
c
d
e
f

Decimal
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

2.3 Number Representation

Converting a decimal real number to double format is done by converting it to

scientific notation binary, rounding the mantissa to 52 bits after the radix point, and
adding the bias of 1023 to the exponent. The double number is the assembled,
starting with the sign bit of 0 or 1 if the number is positive or negative, respectively,
then the exponent, and finally the 52 bits of the mantissa after the radix point,
discarding the initial 1 before the radix point. On the other hand, converting a
double number into a decimal real is done first by splitting it into three parts, the
sign, exponent, and mantissa. The bias of 1023 is subtracted from the exponent,
while a leading one is appended at the beginning of the mantissa. The real number is
then computed by converting the mantissa to decimal, multiplying it by 2 to the
correct exponent, and setting the appropriate positive or negative sign based on the
value of the sign bit.
Example 2.5
Convert the double c066f40000000000 into decimal.
Solution
First, express the number into binary, by replacing each hexadecimal digit
with its binary equivalent:
c

1100

0000

0110

1111

0100

0000

Next, consider the three components of a double number:

1. The first bit is the sign. It is 1, meaning the number is negative.
2. The next 11 bits are the exponent. They are 100000001102 103010.
Recall also that the bias in double is 102310, which means the number is
to the power 210301023 27.
3. The remaining 52 bits are for the mantissa. They are 0110111101
000000000000000000000000000000000000000000. Adding in the implied
leading 1 before the radix point and discarding the unnecessary trailing zeros,
this represents the mantissa 1.01101111012 1.434570312510.
The number in decimal is thus 1.4345703125 27 183.625.

Example 2.6
Find the double representation of the integer 289.
Solution
First, note that the number is positive, so the sign bit is 0.
Next, convert the number to binary: 289 256 + 32 + 1 28 + 25 + 20
1001000012. In scientific notation, this becomes 1.001000012 28 (the
radix point must move eight places to the left). The exponent for the number
(continued)

2 Numerical Representation

Example 2.6 (continued)

is thus 1031 (100000001112) since 10311023 8. And the mantissa, taking
out the leading 1 that will be implied by double notation, is 001000012
followed by 44 more zeros to get the total 52 bits.
Putting it all together, the double representation is 0 10000000111 0010000
100000000000000000000000000000000000000000000.

Example 2.7
Find the double-precision floating-point format of 324/33 given that its
binary representation is:
1001.11010001011101000101110100010111010001011101000101110
100010111010001. . .
Solution
The number is negative, so the sign bit is 1.
The radix point must be moved three spots to the left to produce a
scientific-format number, so the exponent is 310 112. Adding the bias
gives 01111111111 + 11 10000000010.
Finally, rounding the infinite number to 53 bits and removing the leading
1 yield the 52 bits of the mantissa, 0011101000101110100010111010001
011101000101110100011.
Putting it all together, the double representation is:
1 10000000010 0011101000101110100010111010001011101000101110100011

2.4

Limitations of Modern Computers

Since computers try to represent the infinite range of real numbers using the finite
set of floating-point numbers, it is unavoidable that some problems will arise. Three
of the most common ones are explored in this section.

2.4.1

Underflow and Overflow

Any number representation format that is restricted to a limited amount of memory

will necessarily only be able to represent a finite range of numbers. In the case of
double, that range goes from negative to positive 1.8 10308 at the largest magnitudes
and at negative to positive 5 10324 at the smallest magnitudes around zero. Any
number greater or smaller than these extremes falls outside of the number scheme and
cannot be represented. This problem is called overflow when the number is greater
than the maximum value and underflow when it is lesser than the smallest value.

2.4 Limitations of Modern Computers

Fig. 2.1 C code generating
the double special values

25
double MaxVal = 1.8E307;
double MinVal = 5E-324;

double PlusInf = MaxVal * 10;

double MinusInf = MaxVal * -1 * 10;
double PlusZero = MinVal / 10;
double MinusZero = MinVal * -1 / 10;

The name overflow comes from a figurative imagery of the problem. Picture
the largest number that can be represented in 8-bit binary, 111111112. Adding 1 to
that value causes a chain of carry-overs: the least significant bit flips to 0 and a
1 carries over to the next position and causes that 1 to flip and carry a 1 over to the
next position, and so on. In the end the most significant bit flips to zero and a 1 is
carried over, except it has no place to carry to since the value is bounded to 8 bits;
the number is said to over flow. As a result, instead of 1000000002, the final result
of the addition is only 000000002. This problem will be very familiar to the older
generation of gamers: in many 8-bit RPG games, players who spent too much time
levelling up their characters might see a level-255 (111111112) character gain a
level and fall back to its starting level. This problem was also responsible for the
famous kill screen in the original Pac-Man game, where after passing level
255, players found themselves in a half-formed level 00.
Overflow and underflow are well-known problems, and they have solutions defined
in the IEEE 754 standard. That solution is to define four special values: a positive
infinity as 7ff0000000000000, a negative infinity as fff0000000000000, a positive
zero as 0000000000000000, and a negative zero as 8000000000000000 (both of
which are different from each other and from an actual value of zero). Converting
these to binary will show the positive and negative infinity values to be the appropriate
sign bit with all-1 exponent bits and all-zero mantissa bits, while the positive and
negative zero values are again the appropriate sign bit with all-zero exponent and
mantissa bits. Whenever a computation gives a result that falls beyond one of the four
edges of the double range, it is replaced by the appropriate special value. The sample
code in the next Fig. 2.1 is an example that will generate all four special values.

2.4.2

Subtractive Cancellation

Consider the following difference: 3.523 3.537 0.014. Using the six-digit floating-point system introduced previously, these numbers are represented by 0493523,
0493537, and 0471400, respectively. All three numbers appear to have the same

2 Numerical Representation

precision, with four decimal digits in the mantissa. However, 3.523 is really a
truncated representation of any number in the range [3.5225, 3.5235], as any
number in that five-digit range will round to the four-digit representation 3.523.
Likewise, the second number 3.537 represents the entire range [3.5365, 3.5375].
The maximum relative error on any of these approximations is 0.00014, so they are
not a problem. However, when considering the ranges, the result of the subtraction
is not 0.014 but actually could be any value in the range [0.013, 0.015]. The result
0.014 has no significant digits. Worse, as an approximation of the range of results,
0.014 has a relative error of 0.071, more than 500 times greater than the error of the
initial values.
This phenomenon where the subtraction of similar numbers results in a significant reduction in precision is called subtractive cancellation. It will occur any time
there is a subtraction of two numbers which are almost equal, and the result will
always have no significant digits and much less precision than either initial
numbers.
Unlike overflow and underflow, double format does not substitute the result of
such operations with a special value. The result of 0.014 in the initial example will
be stored and used in subsequent computations as if it were a precise value rather
than a very inaccurate approximation. It is up to the engineers designing the
mathematical software and models to check for such situations in the algorithms
and take steps to avoid them.
Example 2.8
Consider two approximations of using the six-digit floating-point representation: 3.142 and 3.14. Subtract the second from the first. Then, compute the
relative error on both initial values and on the subtraction result.
Solution
3:142 3:14 0:002
However, in six-digit floating-point representation, 3.142 (0493142) represents any value in [3.1415, 3.1425] and 3.14 (0493140) represents any value
in the range [3.1395, 3.1405]. Their difference is any number in the range
[0.001, 0.003]. The result of 0.002 has no significant digits.
Compared to 3.141592654. . ., the value 3.142 has a relative error of
0.00013 and the value 3.14 has a relative error of 0.0051. The correct result of
the subtraction is 3.14 0.001592654. . ., and compared to that result,
0.002 has a relative error of 0.2558, 50 times greater than the relative error of
3.14.

2.4 Limitations of Modern Computers

2.4.3

Non-associativity of Addition

In mathematics, associativity is a well-known fundamental property of additions

which states that the order in which additions are done makes no difference on the
final result. Formally
a b c a b c

2:9

This property no longer holds in floating-point number representation. Because of

the truncation required to enforce a fixed amount of memory for the representation,
larger numbers will dominate smaller numbers in the summation, and the order in
which the larger and smaller numbers are introduced into the summation will affect
the final result.
To simplify, consider again the six-digit floating-point representation. For an
example of a large number dominating a smaller one, consider the sum 5592
+ 0.7846 5592.7846. In six-digit floating-point representation, these numbers
are written 0525592 and 0487846, respectively, and there is no way to store the
result of their summation entirely as it would require ten digits instead of six. The
result stored is actually 0525593, which corresponds to the value 5593, a rounded
result that maintains the more significant digits of the larger number and discards
the less significant digits of the smaller number. The larger number has eclipsed the
smaller one in importance. The problem becomes even worse in the summation
5592 + 0.3923 5592.3923. The result will again need to be cropped to be stored in
six digits, but this time it will be rounded to 5592. This means that the summation
has left the larger number completely unchanged!
These two rounding issues are at the core of the problem of non-associativity
when dealing with sequence of sums as in Eq. (2.9). The problem is that not just the
final result, but also every partial result summed in the sequence, must be encoded
in the same representation and will therefore be subject to rounding and loss.
Consider the sum of three values, 5592 + 0.3923 + 0.3923 5592.7846. As
before, every value in the summation can be encoded in the six-digit floating-point
format, but the final result cannot. However, the order in which the summation is
computed will affect what the final result will be. If the summation is computed as
5592 0:3923 0:3923 5592 0:7846 5592:7846

2:10

then there is no problem in storing the partial result 0.7846, and only the final result
needs to be rounded to 5593. However, if the summation is computed as
5592 0:3923 0:3923 5592:3923 0:3923 5592:7846

2:11

then there is a problem, as the partial result 5592.3923 gets rounded to 5592, and the
second part of the summation then becomes 5592 + 0.3923 again, the result of
which again gets rounded to 5592. The final result of the summation has changed

2 Numerical Representation

Fig. 2.2 C++ code

suffering from
non-associativity

double sum = 0.0;

for ( int i = 0; i < 100000; i++ ) {
sum += 0.1;
}

because of the order in which the partial summations were computed, in clear
violation of the associativity property.
Example 2.9
Using three decimal digits of precision, add the powers of 2 from 0 to 17 in
the order from 0 to 17 and then in reverse order from 17 to 0. Compute the
relative error of the final result of each of the two summations.
Solution
Recall that with any such system, numbers must be rounded before and after
any operation is performed. For example, 210 1024 1020 after rounding
to three significant digits. Thus, the partial sums in increasing order are
1 3 7 15 31 63 127 255 511 1020 2040 4090 8190 16400 32800 65600
131000 262000
while in decreasing order, they are
131000 196000 229000 245000 253000 257000 259000 260000 261000
261000 261000 261000 261000 261000 261000 261000 261000 261000
The correct value of this sum is 218 1 262,143. The relative error of the
first sum is 0.00055, while the relative error of the second sum is 0.0044. So
not only are the results different given the order of the sum, but the sum in
increasing order gives a result an order of magnitude more accurate than the
second one.
Like with subtractive cancellation, there is no special value in the double format
to substitute in for non-associative results, and it is the responsibility of software
engineers to detect and correct such cases when they happen in an algorithm. In that
respect, it is important to note that non-associativity can be subtly disguised in the
code. It can occur, for example, in a loop that sums a small value into a total at each
increment of a long process, such as in the case illustrate in Fig. 2.2. As the partial
total grows, the small incremental addition will become rounded off in later
iterations, and inaccuracies will not only occur, they will become worse and
worse as the loop goes on.
The solution to this problem is illustrated in Example 2.9. It consists in sorting
the terms of summations in order of increasing value. Two values of the same
magnitude summed together will not lose precision, as the digits of neither number
will be rounded off. By summing together smaller values first, the precision of the
partial total is maintained, and moreover the partial total grows in magnitude and
can then be safely added to larger values. This ordering ensures that the cumulative
sum of many small terms is still present in the final total.

2.6 Exercises

2.5

Summary

Modern engineering modelling is done on computers, and this will continue to be

the case long into the foreseeable future. This means that the first source of errors in
any model is not in the algorithms used to compute and approximate it, but in the
format used by the computer to store the very values that are modelled. It is simply
impossible to store the infinite continuous range of real numbers using the finite set
of discrete values available to computers, and this fundamental level of approximation is also the first source of errors that must be considered in any modelling
process.
In engineering, a less accurate result with a predictable error is better than a more
accurate result with an unpredictable error. This was one of the main reasons behind
standardizing the format of floating-point representations on computers. Without
standardization, the same code run on many machines could produce different
answers. IEEE 754 standardized the representation and behaviour of floatingpoint numbers and therefore allowed better prediction of the error, and thus, an
algorithm designed to run within certain tolerances will perform similarly on all
platforms. Without standardization, a particular computation could have potentially
very different results when run on different machines. Standardization allows the
algorithm designer to focus on a single standard, as opposed to wasting time finetuning each algorithm for each different machine.
The properties of the double are specified by the IEEE 754 technical standard.
For additional information, there are a few excellent documents which should be
read, especially Lecture Notes on the Status of IEEE Standard 754 for Binary
Floating-Point Arithmetic by Prof W. Kahan and What Every Computer Scientist
Should Know about Floating-Point Arithmetic by David Goldberg.

2.6
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.

Exercises
Represent the decimal number 523.2345 in scientific notation.
What is the problem if we dont use scientific notation to represent 1.23e10?
Add the two binary integers 1001112 and 10001102.
Add the two binary integers 11000112 and 101001011012.
Add the two binary numbers 100.1112 and 10.001102.
Add the two binary numbers 110.00112 and 10100.1011012.
Multiply the two binary integers 1001112 and 10102.
Multiply the two binary integers 11000112 and 100112.
Multiply the two binary numbers 10.11012 by 1.0112.
Multiply the two binary numbers 100.1112 and 10.102.
Multiply the two binary numbers 1100.0112 and 10.0112.
Convert the binary number 10010.00112 to decimal.
Convert the binary number 111.1111112 to decimal.

2 Numerical Representation

14. What decimal numbers do the following represent using the six-digit floating-point
format?
a.
b.
c.
d.

479323
499323
509323
549323

15. Given their implied precision, what range of numbers do the following represent using the six-digit floating-point format?
a. 521234
b. 522345
16. Represent the following numbers in six-digit floating-point format:
a. Square root of two (1.414213562)
b. One million (1000000)
c. e10 0.00004539992976
17. Convert the decimal number 1/8 to binary double format.
18. Convert the hexadecimal double format number c01d600000000000 to binary
and to decimal.
19. Convert the following binary double format numbers to decimal:
a. 0100000001100011001011111000000000000000000000000000000000000000
b. 0011111111101000100000000000000000000000000000000000000000000000
20. Add the following two hexadecimal double format numbers: 3fe8000000000000
and 4011000000000000.
21. Using the six-digit floating-point format:
a. What is the largest value which can be added to 3.523 which will result in a
sum of 3.523 and why?
b. What is the largest float which may be added to 722.4 which will result in a
sum of 722.4 and why?
c. What is the largest float which may be added to 722.3 which will result in a
sum of 722.3 and why?
22. How would you calculate the sum of n2 for n 1, 2, . . ., 100,000 and why?

Chapter 3

Iteration

3.1

Introduction

This chapter and the following four chapters introduce five basic mathematical
modelling tools: iteration, linear algebra, Taylor series, interpolation, and
bracketing. While they can be used as simple modelling tools on their own, their
main function is to provide the basic building blocks from which numerical
methods and more complex models will be built.
One technique that will be used in almost every numerical method in this book
consists in applying an algorithm to some initial value to compute an approximation
of a modelled value, then to apply the algorithm to that approximation to compute an
even better approximation and repeat this step until the approximation improves to a
desired level. This process is called iteration, and it can be as simple as applying a
mathematical formula over and over or complex enough to require conditional flowcontrol statements. It also doubles as a modelling tool for movement, both for
physical movement in space and for the passage of time. In those cases, one iteration
can represent a step along a path in space or the increment of a clock.
This chapter will introduce some basic notions and terminology related to the
tool of iteration, including most notably the different halting conditions that can
come into play in an iterating function.

3.2

Iteration and Convergence

Given a function f(x) and an initial value x0, it is possible to calculate the result of
applying the function to the value as such:
x1 f x 0

Springer International Publishing Switzerland 2016

R. Khoury, D.W. Harder, Numerical Methods and Modelling for Engineering,
DOI 10.1007/978-3-319-21176-3_3

3:1

3 Iteration

This would be our first iteration. The second iteration would compute x2 by
applying the function to x1, the third iteration would compute x3 by applying the
function to x2, and so on. More generally, the ith iteration is given in Eq. (3.2):
x1 f x0
x2 f x1
x3 f x2
...
xi f xi1

3:2

Each value of x will be different from the previous. However, for certain
functions, each successive value of x will become more and more similar to the
previous one, until they stop changing altogether. At that point, the function is said
to have converged to the value x. The steps of a function starting at x0 and
converging to xi are given in Eq. (3.3):
x1 f x0
x2 f x1
x3 f x2

xi1 f xi2
xi f xi1
x i f x i

3:3

Convergence can be observed by doing a simple experiment using any scientific

calculator. Set the calculator in degrees, compute the cosine of 0, and then iteratively compute the cosine of every result obtained. The first result will be 1, and
then, depending on the number of digits of precision of the calculator, it can be
0.9998476951, then 0.9998477415, and so on. At every iteration, more decimals
will stay constant and the decimals that change will be further away from the radix
point; in other words, the difference between the two successive numbers is
decreasing. Then at some point, the number will stop changing altogether and the
function will have converged. Again, depending on the precision of the calculator,
this point will change: the value has already converged for the ten digits of
precision in this paragraph and can take another half-dozen iterations to converge
to 30 digits of precision.
Running the same experiment with a calculator set in radians yields a different
result. This time the sequence goes 0, 1, 0.5403, 0.8575, 0.6542, 0.79348, 0.7013,
0.76395, etc. It is still converging, but while in degrees the value was decreasing
towards convergence, in radians it oscillates around the value the function converges to, jumping from one value that is greater than the convergence value to one
that is lesser and back to greater. It can also be observed that convergence is a lot
slower. While in degrees the function can converge in five to ten iterations,
depending on the precision of the calculator, in degrees it takes six iterations only
for the first decimal to converge, and it will take up to iteration 16 for the second

3.2 Iteration and Convergence

decimal to converge to 0.73. This notion of the speed with which a function
converges is called convergence rate, and in the numerical methods that will be
introduced later on, it will be found that it is directly linked to the big O error rate of
the functions themselves: a function with a lesser big O error rate will have less
error remaining in the value computed at each successive iteration and will thus
converge to the error-free value in fewer iterations than a function with a greater big
O error rate.
Finally, it is easy to see that not all functions will converge. Using the x2 function
of the calculator and starting at any value greater than 1 will yield results that are
both greater at each successive iteration and that increase more between each
successive iteration, until the maximum value or the precision of the calculator is
exceeded. Functions with that behaviour are said to diverge. It is worth noting that
some functions display both behaviours: they can converge when iterating from
certain initial values, but diverge when iterating from others. The x2 function is in
fact an example of such a function: it diverges for any initial value greater than 1 or
lesser than 1, but converges to 0 using any initial value between 1 and 1.
Example 3.1
Starting from any positive or negative non-zero value, compute ten iterations
of the function:
f x

x 1

2 x

Solution
p
This function has been known since antiquity to converge to 2. The exact
convergence sequence will depend on the initial value, but positive and
negative examples are given in the table below.
Iteration
x0
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10

Positive sequence
44.0
22.0227272727
11.0567712731
5.6188279524
2.9873870354
1.8284342085
1.4611331460
1.4149668980
1.4142137629
1.4142135623
1.4142135623

Negative sequence
88.0
44.0113636364
22.0284032228
11.0595975482
5.6202179774
2.9880380306
1.8286867771
1.4611838931
1.4149685022
1.4142137637
1.4142135623

(continued)

3 Iteration

Example 3.1 (continued)

The convergence can be further observed by plotting the value of the
sequence at each iteration, as is done in the figure below (in blue for the
positive sequence and in red for the negative sequence). It becomes clear to
see in such a graphic that not only the sequence converges quickly, but it
forms a decreasing logarithmic curve as it does so. This sequence is said to
converge logarithmically.
Value
60
50
40
30
20
10
0
10

20
30

Iteration number

40
50
60
70
80
90

3.3

Halting Conditions

As explained in the previous section, a function might converge quickly to an exact

value, or it might converge at a very slow rate, or even diverge altogether. And even
in the cases where the function converges, the function might reach an approximate
result quickly or spend endless iterations computing more and more digits of
precision. For these reasons, it is important to give any iterating algorithm a set
of halting conditions at which it should stop iterating. There are two types of halting
conditions: success conditions, which indicate that the last iteration yielded a
converged value of sufficient precision for a given model or application, and failure
conditions, which indicate that the function has failed to converge for some reason.
The exact details of these conditions and the specific conditions they test for will
vary based on the numerical method that is being implemented and the situation that
is being modelled. However, in general terms, a success condition will test the
relative error between two iterations, and a failure condition will test the number of
iterations computed.

3.3 Halting Conditions

An iterative function succeeds when it converges to a value. However, this

converged value might be a rational number with an endless number of non-zero
decimals (such as 1/3) or an irrational number, which would require endless
iterations to compute completely. Moreover, in engineering practice, it seldom
makes sense to compute all possible digits of a value, since after a certain point,
these digits represent values too small to have a practical impact or even smaller
than the precision of the instruments used to measure them. For these reasons,
iterating until the exact equality condition of Eq. (3.3) is usually neither possible
nor desirable. It is preferable to iterate until two successive values are approximately equal within a relative error, the value of which is problem dependent.
Recall that relative error was defined in Eq. (1.2) to be the difference between an
approximation and a real value as a ratio of the real value. In that definition, it is
possible for an engineer to set a threshold relative error, below which an approximation is close enough to the real value to be considered equal for all intents and
purposes. However, Eq. (1.2) requires knowledge of the real value, which is
unknown in this current situation; the iterating function is being used to refine an
approximation of the value and would be entirely unnecessary if the exact value
was known to begin with! Nonetheless, if the function is converging, then each
successive value is more exact than the previous one, and it is possible to rewrite
Eq. (1.2) to use that fact as follows:

previous value current value

Erel
3:4

current value

xi1 xi

Ei
3:5

x
i

In that definition, when the values computed in two successive iterations have a
relative error Ei less than a preset threshold, they are close enough to be considered
equal for all intents and purposes, and the function has successfully converged.
If the algorithm iterates to discover the value of a vector instead of a scalar, a
better test for convergence is to compute the Euclidean distance between two
successive vectors. This distance is the square root of the sum of squared differences between each pair of values in the two vectors. More formally, given two
successive n 1 vectors xi [xi,0, . . ., xi,n1]T and xi1 [xi1,0, . . ., xi1,n1]T, the
Euclidean distance is defined as
Ei kxi1 xi k

q
xi1, 0 xi, 0 2 xi1, n1 xi, n1 2

3:6

Once again, when two successive vectors have a distance Ei less than a preset
threshold, they are close enough to be considered equal for all intents and purposes,
and the function has successfully converged.
An iterative function fails to converge if it does not reach a success condition in a
reasonable amount of time. A simple catch-all failure condition would be to set a

3 Iteration

maximum number of iterations and to terminate the function if that number is

reached. Note however that this does not provide any feedback on why the function
failed to converge. There are in fact several possible reasons why that could have
happened. It could be that the function cannot possibly converge, that it is a
diverging function, and that a solution is impossible to find. Or the problem
might be the initial value chosen; it might be in a diverging region of the function
or even on a singularity (a value that causes a division by zero). It could even be the
case that the function was converging, but too slowly to reach the success condition
in the set maximum number of iterations. In that case, it would be necessary to
increase the maximum number of iterations allowed or to redesign the mathematical formulas to increase the convergence rate somehow. The important conclusion
to retain is that a function failing to converge after a set number of iterations is not
synonymous with a function never converging and no solution existing for the
problem. This is important to remember because this failure condition, of setting a
maximum number of iterations, is one of the most commonly used ones in practice.
Putting it all together, the pseudo-code of an iterative algorithm to compute the
iterations of Eq. (3.2) using a threshold on the relative error of Eq. (3.5) as a success
condition and a maximum number of iterations as a failure condition is given in
Fig. 3.1.
CurrentValue Initial Value
IterationCounter 0
IterationMaximum Maximum Iteration Threshold
ErrorMinimum Minimum Relative Error Threshold
WHILE (TRUE):
PreviousValue CurrentValue
CurrentValue CALL F(CurrentValue)
IterationCounter IterationCounter + 1
CurrentError absolute((PreviousValue - CurrentValue)/ CurrentValue)
IF (CurrentError <= ErrorMinimum)
RETURN Success
ELSE IF(IterationCounter = IterationMaximum)
RETURN Failure
END IF
END WHILE
FUNCTION F(x)
RETURN compute a result given an input variable x
END FUNCTION

Fig. 3.1 Pseudo-code of an iterative software

3.5 Exercises

Example 3.2
x0:5
in radians using x0 0.5, until the relative error is
Iterate f x x sin
cos x
105 or up to a maximum of ten iterations.
Solution
Since the target relative error is 105, it is necessary to keep six digits of
precision in the values. Any less would make it impossible to compute the
error, while more digits would be unnecessary.
The results of the iterations, presented in the table below, show that the
threshold relative error has been surpassed after four iterations. The function
has converged and reached the success condition. It is unnecessary to compute the remaining six iterations.
Iteration
x0
x1
x2
x3
x4

3.4

Value
0.5
0.616049
0.520707
0.523596
0.523599

Relative error
N/A
1.811624
0.183101
0.005518
0.000006

Summary

This chapter has introduced the concept of iteration, the first of the five mathematical tools that will underlie the numerical methods and modelling algorithms of
later chapters. While iteration is a simple tool, it is also a very versatile technique
and will be used in most algorithms in coming chapters. This chapter has introduced
notions related to iterations, such as the notion of convergence, divergence, and
divergence rate and the notions of success and failure halting conditions.

3.5

Exercises

1. What value does the function f x 2:4xx 1 converge to? How many
iterations do you need to get the equality condition of Eq. (3.3)?
2. Starting with x0 0.5, compute iterations of f(x) sin(x) and f(x) cos(x).
Which converges faster? Is the difference significant?
3. Consider f(x) x3. For which range of values will the function converge or
diverge?
4. Consider the function f(x) x + sin(x), where the sin function is in radians
(Beeler et al. 1972). Starting from x0 0.5, compute the value of xi and its
relative error as an approximation of over five iterations.

3 Iteration

5. Consider the function f(x) (3x4 + 10x3 20x2 24)/(4x3 + 15x2 40x). Starting
from x0 5, compute the value of xi and its relative error as an approximation of
2 over five iterations.
6. Consider the following functions. How many values can each one converge to?
What are they?
(a)
(b)
(c)
(d)
(e)

f(x) 1.2x1 + 0.8.

f(x) (x 8)/3.
f(x) (0.5x2 10)/(x6).
f(x) (x2 + 0.4)/(2x + 0.6).
f(x) (2x3 10.3x2 36.5)/(3x2 20.6x + 9.7).

Chapter 4

Linear Algebra

4.1

Introduction

The first of the five mathematical modelling tools, introduced in the previous
chapter, is iteration. The second is solving systems of linear algebraic equations
and is the topic of this chapter. A system of linear algebraic equations is any set of
n equations with n unknown variables x0, . . ., xn1:
m0, 0 x0 m0, 1 x1 m0, n1 xn1 b0
m1, 0 x0 m1, 1 x1 m1, n1 xn1 b1

mn1, 0 x0 mn1, 1 x1 mn1, n1 xn1 bn1

4:1

where the n n values m0,0, . . ., mn1,n1 are known coefficient values that multiply
the variables, and the n values b0, . . ., bn1 are the known result of each equation. A
system of that form can arise in many ways in engineering practice. For example, it
would be the result of taking measurements of a dynamic system at n different
times. It also results from taking measurements of a static system at n different
internal points. Consider, for example, the simple electrical circuit in Fig. 4.1. Four
internal nodes have been identified in it. If one wants to model this circuit, for
example, to be able to predict the voltage flowing between two nodes, the
corresponding energy losses, or other of its properties, it is first necessary to
model the voltages at each node using Kirchhoffs current law. Removing units
and with appropriate scaling, this gives the following set of four equations and four
unknown variables:

Springer International Publishing Switzerland 2016

R. Khoury, D.W. Harder, Numerical Methods and Modelling for Engineering,
DOI 10.1007/978-3-319-21176-3_4

4 Linear Algebra

Fig. 4.1 Example electrical

circuit

v1 0 v1 v2

0:01
120
240
v2 0 v2 v1 v2 v3 v2 v4
Node 2 :

0
320
240
180
200
v3 0 v3 v2 v3 v4
Node 3 :

0
160
180
360
v4 v2 v4 v3
Node 4 :

0:01
200
360

Node 1 :

4:2

which is a system of four linear algebraic equations of the same form as Eq. (4.1).
The linear system of Eq. (4.1) can be written as a classic matrix-vector problem:
4:3

Mxb

where M is the n n matrix of known coefficients, x is an n 1 column vector of

unknowns, and b is an n 1 column vector of known equation result values. The
sample system given in Eq. (4.2), for example, would be written as
3

1
1
1

0
6 120 240
240
6
6
1
1
1
1
1
6 1

6
240
320 240 180 200
180
6
6
6
1
1
1
1
6

0

6
180
160
180
360
6
4
1
1

0

200
360
3
2
0:01
7
6
6 0 7
7
6
6
7
6 0 7
5
4

0
1
200
1

360
1
1

200 360

72 3
7 v1
7
76 7
7 6 v2 7
76 7
76 7
7 6 v3 7
74 5
7
7 v4
5

0:01
4:4

4.2 PLU Decomposition

and each of the node equations can be recovered by multiplying the corresponding
row of the matrix by the vector of unknowns and keeping the matching result value.
Writing the system in matrix-vector form makes it easier to solve and discover the
value of each unknown variable.
Some readers may have learned how to solve a system of linear equations using
Gaussian elimination together with backward substitution, possibly in a previous
course on linear algebra. However, there are two problems with Gaussian elimination. The first is its lack of efficiency. Even an optimal implementation of a
Gaussian elimination and backward substitution algorithm for solving a system of
n linear equations will require n3 =3 n=3 multiplications and additions and
n2 =2 n=2 divisions and negations for the Gaussian elimination step in addition
to n2 =2 n=2 multiplications and subtractions and n divisions for the backward
substitution step. In other words, it is an algorithm with O(n3) time complexity,
which is very inefficient: the computation time required to solve a linear system will
grow proportionally to the cube of the size of the system! Doubling the size of a
system will require eight times more computations to solve, and trying to solve a
system with 10 times more equations and unknowns will take 1000 times as long.
The second problem with Gaussian elimination is related to the step that requires
adding a multiple of one row to the others. In any situation where the coefficients of
one row are a lot bigger in magnitude than those of another, and given a finite
number of digits of precision, the algorithm will suffer from the problem of
non-associativity of addition explained in Chap. 2. When that happens, the values
computed for the variables by Gaussian elimination will end up having a very high
relative error compared to the real values that should have been obtained.
This chapter will introduce better methods for solving the Mx b system, both
for general matrices and for some special cases. The ability to solve linear systems
will be important for other mathematical tools and for numerical methods, as it will
make it possible to easily, efficiently, and accurately solve complex systems.

4.2

PLU Decomposition

The PLU decomposition, also called the LUP decomposition or the PLU factorization, is an improvement on the Gaussian elimination technique. It addresses the two
problems highlighted in the introduction: by decomposing the matrix M into a
lower triangular matrix L and an upper triangular matrix U, it can solve the system
in O(n2) instead of O(n3). And by doing a permutation of the rows so that the
element with the maximum absolute value is always in the diagonal and keeping
track of these changes in a permutation matrix P, it can avoid the non-associativity
problem.
The PLU decomposition technique thus works in two steps: first, decomposing
the matrix M into three equivalent matrices:

M
P
L
U

4 Linear Algebra

Input nn matrix
nn identity matrix
nn zero matrix
M

ColumnIndex 0
WHILE (ColumnIndex < n-1)
ColumnVector columnColumnIndex, rows ColumnIndex to n of U
IndexOfMaximum index of maximum absolute value in ColumnVector
P swap row ColumnIndex and row IndexOfMaximum in P
L swap row ColumnIndex and row IndexOfMaximum in L
U swap row ColumnIndex and row IndexOfMaximum in U
RowIndex ColumnIndex + 1
WHILE (RowIndex < n)
s -1 (element at row RowIndex, column ColumnIndex of U) /
(element at row RowIndex, column ColumnIndex of U)
row Update of U (row RowIndex of U) + s (row ColumnIndex of U)
element at row RowIndex, column ColumnIndex of L -1 s
RowIndex RowIndex + 1
END WHILE
ColumnIndex ColumnIndex + 1
END WHILE
L L + (nn identity matrix)
RETURN P, L, U

Fig. 4.2 Pseudo-code of the PLU decomposition

MPLU

4:5

then solving the PLUx b system using simple forward and backward
substitutions.
There is a simple step-by-step algorithm to decompose the matrix M into L, U,
and PT. Note that the algorithm decomposes into the transpose of the permutation
matrix P; later the forward and backward substitution operations will need this
transposed matrix, so this actually saves a bit of time in the overall process. A
pseudo-code version of this algorithm is presented in Fig. 4.2.
Step 1: Initialization. Each of the three decomposition matrices will have initial
values. The matrix U is initially equal to M, the matrix L is initially an
n n zero matrix, and the matrix PT is initially an n n identity matrix.
Step 2: Decomposition. The algorithm considers each column in turn, working left
to right from column 0 to the penultimate column n 2. For the current
column i, find the row j in the matrix U such that the element uj,i has the
greatest absolute value in the column and j i, meaning that the element is
on or below the diagonal element of the column. If that element is zero,
then the matrix is singular and cannot be decomposed with this method, and

4.2 PLU Decomposition

the algorithm terminates. Next, swap rows i and j in all three matrices U, L,
and PT. This will bring the greatest-valued element found on the diagonal
of matrix U as element ui,i. Finally, for every row k below the diagonal,
calculate a scalar value s uk,i/ui,i. Save the value of s at element lk,i in
matrix L in order to fill the entries below the diagonal in that matrix, and
add s times row i to row k in matrix U. This addition will cause every
value in column i under the diagonal of U to become 0.
Step 3: Finalization. Once the decomposition step has been done for every column
except the last right-most one, the algorithm ends by adding an n n
identity matrix to the lower-diagonal matrix L.
Once the matrix M has been decomposed, the PLUx b system can be solved in
two steps. The matrix-vector Ux is replaced with a vector y in the equation, which
can be computed with a forward substitution step:
LyPT b

4:6

then knowing y, a backward substitution step can compute x:

Uxy

4:7

Example 4.1
Use PLU decomposition to solve the following matrix-vector problem. Keep
two decimals of precision:
2

0:7
6 1:4
6
4 7
1:4

4
8
0
0:8

7:4
2:8
1
2:75

32 3 2 3
4:3
x0
7
6 x1 7 6 4 7
0:6 7
76 7 6 7
2 54 x2 5 4 8 5
6:25
3
x3

Solution
Initialize the values of the matrices PT, L, and U:
2

1
60
T
P 6
40
0

0
1
0
0

0
0
1
0

3
2
0
0
60
07
7 L6
40
05
1
0

0
0
0
0

3
3
2
0
0:7 4 7:4 4:3
7
6
07
7 U 6 1:4 8 2:8 0:6 7
5
4
0
7
0
1
2 5
0
1:4 0:8 2:75 6:25

Find the element on or under the diagonal in column 0 that has the largest
absolute value. In this case, it is element u2,0, so rows 0 and 2 need to be
swapped in all three matrices.
(continued)

4 Linear Algebra

Example 4.1 (continued)

3
2
2
0 0 1 0
0
60 1 0 07
60
T
7
6
6
L4
P 4
1 0 0 05
0
0 0 0 1
0

0
0
0
0

3
3
2
0
7
0
1
2
7
6
07
7 U 6 1:4 8 2:8 0:6 7
5
4
0
0:7 4 7:4 4:3 5
0
1:4 0:8 2:75 6:25

Then, for each row under the diagonal, compute the scalar value s.
For row 1, the value is s 1.4/7 0.2. That value will take position l1,0
in matrix L. Multiplying s times row 0 gives [1.4, 0, 0.2, 0.4], and
adding that to row 1 gives [0, 8, 3, 1]. For row 2 the scalar value
is s 0.7/7 0.1, which is saved as l2,0; multiplying s by row 0 gives
[0.7, 0, 0.1, 0.2], and adding that to row 2 gives [0, 4, 7.5, 4.5]. Finally,
for row 3, s 1.4/7 0.2, which is saved as l3,0; multiplying s by row
0 gives [1.4, 0, 0.2, 0.4], and adding that to row 3 gives [0, 0.8, 2.55, 6.65].
The resulting matrices are
2

0
6
0
PT 6
41
0

0
1
0
0

1
0
0
0

3
2
0
0
6 0:2
07
7 L6
4 0:1
05
1
0:2

0
0
0
0

3
3
2
0
7 0
1
2
6
07
3
1 7
7 U 60 8
7
4 0 4 7:5 4:5 5
05
0
0 0:8 2:55 6:65

Moving on to column 1, the largest absolute value element of that column

in U is already on the diagonal, so no swapping takes place. For row 2, the
scalar computed is s 4/8 0.5, and for row 3 it is s 0.8/8 0.1. These
values are saved in matrix L as l2,1 and l3,1, respectively, and after adding
multiplied versions of row 1 in U, the resulting matrices are
2

0
60
6
PT 6
41

0
1

1
0

2
3
0
0
6
7
07
6 0:2
7 L6
4 0:1
05

0 0
0 0 0 1
3
2
7 0
1
2
60 8
3
1 7
7
6
U6
7
4 0 0 9
5 5
0

2:25

0:2

0
0

0:5 0
0:1 0

3
0
07
7
7
05
0

6:75

Moving on to column 2, the largest absolute value element of that column

in U is again on the diagonal, so no swapping takes place. The scalar
computed for row 3 is s 2.25/(9) 0.25, which is saved in L as element
l3,2. Multiplying row 2 by s gives [0, 0, 2.25, 1.25] and adding that to row
3 gives the following matrices:
(continued)

4.2 PLU Decomposition

Example 4.1
2
0 0
60 1
T
6
P 4
1 0
0 0

(continued)
3
2
1 0
0
0
0
6
0 07
0
7 L 6 0:2 0
4 0:1 0:5
0 05
0
0 1
0:2 0:1 0:25

3
2
0
7
6
07
7 U 60
40
05
0
0

3
0 1 2
8 3 1 7
7
0 9 5 5
0 0 8

Since this operation is not done on the final row, the decomposition step is
over. The final step is to add a 4 4 identity matrix to L, to get the final
matrices:
2

0
6
0
PT 6
41
0

0
1
0
0

1
0
0
0

3
2
1
0
0
0
6 0:2 1
0
07
7 L6
4 0:1 0:5
1
05
0:2 0:1 0:25
1

3
2
7
0
60
07
7 U6
40
05
0
1

3
0 1 2
8 3 1 7
7
0 9 5 5
0 0 8

Now it is possible to solve the original system by substituting the matrix

M for PLU, then replacing Ux with a vector y:
Mx b
PLUx b
PLUx b
PLy b
2

6
6 0:2
6
6
6 0:1
4
0:2

0:5

0:1

Ly PT b
3 2
y0
0 0
0
76 7 6
6 7 6
07
76 y1 7 6 0 1
76 7 6
6 7 6
07
54 y2 5 4 1 0

0:25 1

0 0

32 3
7
76 7
6 7
07
76 4 7
76 7
6 7
07
54 8 5

1
0

By forward substitution, it is easy to find the values of y:

y0
0:2y0 y1
0:1y0 0:5y1 y2
0:2y0 0:1y1 0:25y2 y3
2 3 2
3
y0
8:0
6 y1 7 6 5:6 7
6 76
7
4 y2 5 4 3:4 5
1:7
y3

8
4
7
3

Then the system Ux y can be solved by backward substitution to get the

value of x:
(continued)

4 Linear Algebra

Example 4.1 (continued)

2
7 0 1
60 8 3
6
4 0 0 9
0 0 0

3
32 3 2
8:0
2
x0
6 7 6
7
1 7
76 x1 7 6 5:6 7
5
4
4
5
3:4 5
x2
5
1:7
8
x3

8x3 1:7
9x2 5x3 3:4
8x1 3x2 x3 5:6
7x0 x2 2x3 8:0
2 3 2
3
x0
1:24
6 x1 7 6 0:82 7
6 76
7
4 x2 5 4 0:26 5
0:21
x3

4.3

Cholesky Decomposition

Under certain circumstances, it is possible to decompose a matrix M into the form

LLT. Such a decomposition is called a Cholesky decomposition. It requires half the
memory and half the number operations of a PLU decomposition, since there is
only one matrix L to compute. However, this technique can only be applied in
specific circumstances, namely, when the matrix M is real, symmetric, and
positive definite. The criterion of being real simply means that the matrix
contains no complex numbers, and symmetric means that for all non-diagonal
entries mi,j mj,i. A positive-definite matrix that is also symmetric is one where
all diagonal entries are positive, and each one is equal or greater than the sum of
absolute values of all other entries in the row. While these criteria may seem
restrictive, matrices of this form often arise in engineering practice. The matrix of
Eq. (4.4), for example, is a real, symmetric, positive-definite matrix.
The easiest way to understand the Cholesky decomposition is to visualize the
equality M LLT . This is presented in the case of a 4 4 matrix in Eq. (4.8). The
matrix M is symmetric, but so is the product LLT. By inspection, a direct equality
can be seen between each element mi,j in M and the corresponding element in LLT,
which gives a simple equation of elements of L.
2

m0, 0
6 m1, 0
6
4 m2, 0
m3, 0

m1, 0
m1, 1
m2, 1
m3, 1

m2, 0
m2, 1
m2, 2
m3, 2

3 2
l0, 0
m3, 0
6
m3, 1 7
7 6 l1, 0
m3, 2 5 4 l2, 0
l3, 0
m3, 3

0
l1, 1
l2, 1
l3, 1

0
0
l2, 2
l3, 2

32
l0, 0
0
6
0 7
76 0
0 54 0
l3, 3
0

l1, 0
l1, 1
0
0

l2, 0
l2, 1
l2, 2
0

3
l3, 0
l3, 1 7
7
l3, 2 5
l3, 3

4.3 Cholesky Decomposition

m 0, 0

m 1, 0

m 2, 0

m 3, 0

6
7
6 m 1, 0 m 1, 1 m 2, 1 m 3, 1 7
6
7
6m
7
4 2, 0 m 2, 1 m 2, 2 m 3, 2 5
m 3, 0 m 3, 1 m 3, 2 m 3, 3
2 2
l 0, 0 l 1, 0
l 0, 0
6
2
6 l 0, 0 l 1, 0
l1, 0 l21, 1
6
6
6 l 0, 0 l 2, 0 l 1, 0 l 2, 0 l 1, 1 l 2 , 1
4
l 0, 0 l 3, 0

l 0, 0 l 2, 0
l 1, 0 l 2, 0 l 1, 1 l 2 , 1
l22, 0 l22, 1 l22, 2

l 1, 0 l 3, 0 l 1, 1 l 3 , 1

l 2, 0 l 3, 0 l 2, 1 l 3, 1 l 2, 2 l 3, 2

l 0, 0 l 3, 0

7
7
7
7
l 2, 0 l 3, 0 l 2, 1 l 3, 1 l 2, 2 l 3 , 2 7
5
2
2
2
2
l 3, 0 l 3, 1 l 3, 2 l 3, 3
l 1, 0 l 3, 0 l 1, 1 l 3, 1

4:8
Furthermore, looking at the matrix LLT column by column (or row by row, since
it is symmetric), it can be seen that each element li,j can be discovered by forward
substitution in order, starting from element l0,0. Column 0 gives the set of equations:
p
m 0, 0
m1, 0

l0, 0
m2, 0

l0, 0
m3, 0

l0, 0

l0, 0
l1, 0
l2, 0
l3, 0

4:9
4:10
4:11
4:12

Then column 1 gives

l1, 1

q
m1, 1 l21, 0

4:13

l2, 1

m2, 1 l1, 0 l2, 0

l1, 1

4:14

l3, 1

m3, 1 l1, 0 l3, 0

l1, 1

4:15

And columns 2 and 3 give

l2, 2

m3, 2 l2, 0 l3, 0 l2, 1 l3, 1

l2, 2
q
m3, 3 l23, 0 l23, 1 l23, 2

l3, 2
l3, 3

q
m2, 2 l22, 0 l22, 1

4:16
4:17
4:18

4 Linear Algebra

The step-by-step algorithm to construct the L matrix is thus apparent. For an n n

matrix M, take each column j in order from 0 to n1. Then for each element i where
j i n 1:

8v
u
j1
>
u
>
> tm X l 2
>
>
i
,
j
i, k
>
>
<
k0
j1
li, j
X
>
>
m

lj, k li, k
>
i
,
j
>
>
>
k0
>
:
l j, j

if i j
4:19
if i 6 j

In other words, if the current element being computed is on the diagonal of L,

subtract from the corresponding diagonal element of M the dot product of the
current row of L (as constructed so far) with itself and take the square root of this
result. If the current element being computed is below the diagonal of L, subtract
from the corresponding element of M the dot product of the current row and current
column of L (as constructed so far), and divide this result by the columns diagonal
entry. The pseudo-code for an algorithm to compute this decomposition is given in
Fig. 4.3.
Once the matrix M has been decomposed, the system can be solved in a manner
very similar to the one used in the PLU decomposition technique. The matrix M is
replaced by its decomposition in the Mx b system to get
M Input nn matrix
L nn zero matrix
Index 0
WHILE (Index < n)
element at row Index column Index of L square root of [(element at
row Index column Index of M ) (sum of: square of element at
row 0 to Index column Index of L) ]
BelowIndex Index + 1
WHILE (BelowIndex < n)
element at row BelowIndex
BelowIndex column
0 to Index column
column BelowIndex
Index of L)
BelowIndex BelowIndex +

column Index of L [(element at row

Index of M ) (sum of: element at row
Index of L element at row 0 to Index
of L) ] / (element at row Index column
1

END WHILE
Index Index + 1
END WHILE
RETURN L

Fig. 4.3 Pseudo-code of the Cholesky decomposition

4.3 Cholesky Decomposition

LLT xb

4:20

which can then be solved by replacing the matrix-vector LTx with a vector y and
performing a forward substitution step:
4:21

Lyb
followed by a backward substitution step to compute x:

4:22

LT xy

Example 4.2
Use Cholesky decomposition to solve the system of Eq. (4.4). Keep four
decimals of precision.
Solution
Begin by writing the matrix M with four digits of precision, to make it easier
to work with:
2

0:0125
6 0:0042
M6
4
0
0

0:0042
0:0178
0:0056
0:0050

3
0
0
0:0056 0:0050 7
7
0:0146 0:0028 5
0:0028 0:0078

Since it is a 4 4 matrix, it could be decomposed using the general formula of

Eq. (4.19), or the expanded set of equations from Eqs. (4.9) to (4.18) which is
obtained by applying Eq. (4.19) to each of the ten entries of L. The values
computed are
l0, 0

p
0:0125 0:1118

l1, 0

0:0042
0:0373
0:1118

l2, 0

0
0
0:1118

0
0
0:1118
q
0:0178 0:03732 0:1283

l3, 0
l1, 1

l2, 1

0:0056 0:03730
0:0433
0:1283
(continued)

4 Linear Algebra

Example 4.2 (continued)

0:0050 0:03730
0:0390
0:1283
q
0:0146 02 0:04332 0:1127

l3, 1
l2, 2

0:0028 00 0:04330:0390
0:0396
0:1127
q
0:0078 02 0:03902 0:03692 0:0685

l3, 2
l3, 3

The matrix M thus decomposes into the following matrix L:

3
2
0:1118
0
0
0
6 0:0373 0:1283
0
0 7
7
L6
4
0
0:0433 0:1127
0 5
0
0:0390 0:0396 0:0685
The forward substitution step to compute the vector y is the following:
2

0:1118
6 0:0373
6
4
0
0

0
0:1283
0:0433
0:0390

3
32 3 2
0:01
0
0
y0
6 7 6
7
0
0 7
7 6 y1 7 6 0 7
0:1127
0 5 4 y2 5 4 0 5
0:01
y3
0:0396 0:0685

and the vector y obtained is [0.0894, 0.0260, 0.0100, 0.1255]T. This

vector is then used for the backward substitution step (keeping the vector
v from Eq. (4.4) instead of the vector x from Eq. (4.22)):
2
32 3 2
3
0:1118 0:0373
0
0
v1
0:0894
6 0
6 7 6
7
0:1283 0:0433 0:0390 7
6
76 v2 7 6 0:0260 7
4 0
0
0:1127 0:0396 54 v3 5 4 0:0100 5
0
0
0
0:0685
0:1255
v4
to get the values of the voltage vector v [0.6195, 0.5415, 0.5553, 1.8321]T,
the values that solve the system and model the circuit of Fig. 4.1.

4.4

Jacobi Method

The PLU decomposition and Cholesky decomposition methods make it possible to

solve Mx b systems provided no information on what x may be. However, if
information on x is available, for instance, if a similar system has been solved in the
past, it could be beneficial to make use of it. The Jacobi method is an iterative algorithm
for solving Mx b systems provided some initial estimate of the value of x.

4.4 Jacobi Method

It is important to note though that the Jacobi method can be used even without prior
knowledge of x, by using a random vector or a zero vector. It will take longer to
converge to a solution in that case, but may still be more efficient than the PLU
decomposition, especially for very large systems, and it is not restricted to positivedefinite matrices like the Cholesky decomposition. The only requirement to use the
Jacobi method is that the matrix M must have non-zero diagonal entries.
The Jacobi method begins by decomposing the matrix M into the sum of two
matrices D and E, where D contains the diagonal entries of M and zeros everywhere
else, and E contains the off-diagonal entries of M and zeros on the diagonal. For a
simple 3 3 example:
2

6
6d
4
g

M DE
3 2
a 0
b c
7 6
6
e f7
5 40 e
0 0
h i

7
f7
5
h 0

7 6
6
07
5 4d
i

4:23

This makes it possible to rewrite the system in this way:

Mx b
D Ex b
DxEx b
x D1 b Ex

4:24

where the inverse matrix D1 is the reciprocal matrix of D, defined as

DD1 I

4:25

Moreover, in the special case where D is already a diagonal matrix, the reciprocal D1
is simply a diagonal matrix containing the inverse of each diagonal scalar value:
2

a
40
0

0
e
0

1
0
36
0 6a
1
6
0 56 0
6
e
i 4
0 0

3
07 2
1
7
7
07 40
7
0
15
i

0
1
0

3
0
05
1

4:26

The Jacobi method implements an iterative algorithm using Eq. (4.24) to refine the
estimate of x at every iteration k as follows:
xk1 D1 b Exk

4:27

4 Linear Algebra

x Input initial value vector of length n

b Input solution vector of length n
M Input nn matrix
IterationMaximum Input maximum number of iterations
ErrorMinimum Input minimum relative error
D nn zero matrix
diagonal elements of D 1 / (diagonal elements of M)
E M
Diagonal elements of E 0
IterationCounter 0
WHILE (TRUE)
PreviousValue x
x D (b E x)
CurrentError Euclidean distance between x and PreviousValue
IterationCounter IterationCounter + 1
IF (CurrentError <= ErrorMinimum)
RETURN Success, x
ELSE IF (IterationCounter = IterationMaximum)
RETURN Failure
END IF
END WHILE

Fig. 4.4 Pseudo-code of the Jacobi method

The iterations continue until one of two halting conditions is reached: either the
Euclidean distance (see Chap. 3) between two successive iterations of xk is less than
a target error, in which case the algorithm has converged to a good estimate of x, or
k increments to a preset maximum number of iterations, in which case the method
has failed to converge. The pseudo-code for this algorithm is presented in Fig. 4.4.
Example 4.3
Use the Jacobi method to solve the following system to an accuracy of 0.1,
keeping two decimals of precision and starting with a zero vector.
2

5 2
43 7
1 4

32 3 2
3
1
x0
2
3 54 x1 5 4 1 5
6
1
x2

Solution
Begin by decomposing the matrix into diagonal and off-diagonal matrices:
(continued)

4.4 Jacobi Method

Example 4.3 (continued)

2
3 2
5 2 1
5
43 7
3 5 40
1 4 6
0

0
7
0

3 2
0
0
05 43
6
1

3
2 1
0
3 5
4 0

Then build the iterative equation:

1
2
3 6
xk1, 0
65
4 xk1, 1 5 6
60
6
xk1, 2
4
0

0
1
7
0

0 7 02
3 2
2
0 2
7
7
0 7@4 1 5 4 3 0
7
1
1 4
15

32
31
1
xk , 0
3 5 4 xk , 1 5 A
0
xk , 2

It can be seen that the zeros off-diagonal in D1 will simplify the equations a
lot. In fact, the three values of xk+1 will be computed by these simple
equations:
1
xk1, 0 2 2xk, 1 1xk, 2
5
1
xk1, 1 1 3xk, 0 3xk, 2
7
1
xk1, 2 1 1xk, 0 4xk, 1
6
Starting with x0 [0 0 0]T, the first iteration will give x1 [0.40 0.14 0.17]T.
The Euclidean distance between these two iterations is
E1 k x 1 x 0 k

q
0:40 02 0:14 02 0:17 02 0:46

The following iterations will be

x2 0:49 0:39 0:00T

kx2 x1 k 0:31

x3 0:56 0:36 0:17T x3 x2 0:19

x4 0:51 0:31 0:16T x4 x3 0:07
The target accuracy has been reached after the fourth iteration. For reference,
the correct answer to this system is x [0.50 0.31 0.12]T, which the
method approximated well. It could be noted that the value xi2 actually started
off in the wrong direction, starting at 0 and increasing to 0.17, before
dropping towards its correct negative value in subsequent iterations.

4 Linear Algebra

4.5

Gauss-Seidel Method

The Gauss-Seidel method is a technical improvement on the Jacobi method. The

iterative formula of Eq. (4.27) computes the vector xk+1 based on the vector xk,
using the matrix-vector multiplication Exk. This means that every element xk+1,i of
xk+1 will be computed using some combination of the elements xkj of xk, including
the elements where j < i for which better estimates have already been computed in
xk+1. But if better values of these elements are already available, why continue
using the older values in the computations? The Gauss-Seidel method proposes the
simple improvement of using each new value of xk+1 in the computation of
Eq. (4.27) as soon as it is available, rather than waiting to compute the entire vector
xk+1 using xk only. This modifies the iterative Eq. (4.27) into
xk1 D1 b Exk1

4:28

and adds the step to being each iteration by setting xk+1 xk. Since the value of xk is
converging, this simple improvement of using the updated values earlier in the
iterations actually allows it to converge faster. This change to the pseudo-code of
the Jacobi method is shown in Fig. 4.5. Notice that the single line computing the
new value of the vector x in the code of Fig. 4.4 has been replaced by a loop that
computes each new element of vector x one at a time and uses all new values from
previous iterations of the loop to compute subsequent ones.
Example 4.4
Use the Gauss-Seidel method to solve the following system to an accuracy of
0.1, keeping two decimals of precision and starting with a zero vector.
2

5 2
43 7
1 4

32 3 2
3
1
x0
2
3 54 x1 5 4 1 5
6
1
x2

Solution
This is the same system as in Example 4.3, and using Eq. (4.28) will build an
almost identical iterative equation, with the only difference that it uses xk+1
instead of xk in its computations:
2

1
2
3 6
xk1, 0
65
4 xk1, 1 5 6
60
6
xk1, 2
4
0

3
0 0 702
3 2
2
0
7
1
7@4
1 5 4 3
07
7
7
1
1
15
0
6

2
0
4

32
31
1
xk1, 0
3 54 xk1, 1 5A
0
xk1, 2

(continued)

4.5 Gauss-Seidel Method

Example 4.4 (continued)

1
xk1, 0 2 2xk1, 1 1xk1, 2
5
1
xk1, 1 1 3xk1, 0 3xk1, 2
7
1
xk1, 2 1 1xk1, 0 4xk1, 1
6
Starting with x1 x0 [0 0 0]T, the first element will remain the same as was
computed with the Jacobi method, x10 0.40. However, that value is immediately part of x1 and used in the computation of the next element.
Using x1 [0.40 0 0]T, the next element is x11 0.31. Then using
x1 [0.40 0.31 0]T, the last element is x12 0.11. This gives a final result
for the first iteration of x1 [0.40 0.31 0.11]T. The Euclidean distance
between the two iterations is
q
E1 kx1 x0 k 0:40 02 0:31 02 0:11 02 0:52
The following iterations will be
x2 0:50 0:31 0:13T
x3 0:50 0:30 0:12T

x2 x1 0:11

x3 x2 0:01

The target accuracy has been reached after the third iteration, one sooner than
the Jacobi method achieved in Example 4.3. Comparing the results of the first
iteration obtained with the two methods demonstrates how beneficial using
the updated values is. While the element computed will always be the same
with both methods, the second element computed here was x1,1 0.31
(or more precisely x1,1 0.314), almost exactly correct compared to the
correct answer of 0.307 and a much better estimate than the x1,1 0.14
computed by the Jacobi method. And the third element computed here was
x1,2 0.11, a very good step towards the correct answer of 0.12, and a
definite improvement compared to the Jacobi method which had started off
with a step the wrong direction entirely, at x1,2 0.17.

4 Linear Algebra

x Input initial value vector of length n

b Input solution vector of length n
M Input nn matrix
IterationMaximum Input maximum number of iterations
ErrorMinimum Input minimum relative error
D nn zero matrix
diagonal elements of D 1 / (diagonal elements of M)
E M
Diagonal elements of E 0
IterationCounter 0
WHILE (TRUE)
PreviousValue x
Index 0
WHILE (Index < n)
element Index of x D (b E x)
Index Index + 1
END WHILE
CurrentError Euclidean distance between x and PreviousValue
IterationCounter IterationCounter + 1
IF (CurrentError <= ErrorMinimum)
RETURN Success, x
ELSE IF (IterationCounter = IterationMaximum)
RETURN Failure
END IF
END WHILE

Fig. 4.5 Pseudo-code of the Gauss-Seidel method

4.6

Error Analysis

When solving an Mx b system, one must be mindful to account for the error both
in the matrix M and the vector b, as well as for the propagation of that error to
the solution vector x. The error on the values in M and b will be dependent on the
method used to obtain those values; it could be, for example, the error on the
instruments used to measure these values or the error of the mathematical models
used to estimate the values. Since both M and b are used to compute x, it should be
no surprise that the error on x will be dependent on those of M and b. Unfortunately,
the error propagation is not linear: a 10 % relative error in either M or b does not
translate to a 10 % relative error in x, but could in fact be a lot more.

4.6 Error Analysis

The error propagation involves a property of the matrix M called the condition
number of the matrix, which is defined as

condM kMkM1
4:29
where the double bar represents the Euclidean norm (or 2-norm) of the matrix
defined as the maximum that the matrix stretches the length of any vector, that is,
k
the largest value of kkMv
vk for any vector v. This value may be calculated by finding
the square root of the maximum eigenvalue of the matrix multiplied by its
transpose:
kM k

q

max MMT

4:30

and the eigenvalue, in turn, is any scalar value that is a solution to the matrixvector problem:
Mv v

4:31

where v is the eigenvector corresponding to . An algorithm to compute the maximum

eigenvalue will be discussed later in this section. The inverse matrix M1 in Eq. (4.29)
is the reciprocal matrix of M, which will also be discussed later in this section.
It is also necessary to know the relative error on the solution vector b. If the
vector of absolute errors eb giving the error on each value in b is known, then the
relative error on b will be the scalar b computed as
b

ke b k
kb k

4:32

where the double-bar operator is the Euclidean norm of the vector or the square
root of sum of squares of the elements of b. More formally, if b is an n 1 vector
[b0, . . ., bn1]T:
v
u n1
uX
4:33
b2i
kbk t
i0

Finally, the relative error on the solution of the system x will be bounded by the
relative error on b and the condition number of M:
Ex b condM

4:34

4 Linear Algebra

Example 4.5
Consider the following Mx b system:

5
2

1
10

x0
x1

0:005
0:901

Compute the value of x. If the value of b was perturbed to [0.095, 0.901]T,

compute the relative error on b, then compute the perturbed x and the relative
error on x.
Solution
A two-equation two-variable system can be solved easily to find x [0.016,
0.087]T. Using the perturbed value of b, which we will note bp, the perturbed
solution vector xp [0.001, 0.090]T.
The absolute error caused by the perturbation in each case can be computed by taking the absolute-valued difference between the correct vector and
its perturbed version. This gives eb [0.090, 0]T and ex [0.015, 0.003]T.
The relative error on each vector is
q

2
2

0
0:090
ke b k
b
q 0:100
kbk
0:0052 0:9012
q

0:0152 0:0032
kex k
x
q 0:173
kxk
0:0162 0:0872
Thus, a perturbation causing a relative error of 10 % on b has caused a
corresponding relative error of over 17 % on the solution x.

Example 4.6
An Mx b system has the following matrix of coefficients:
2

3
M 42
4

1
7
2

3
1
35
9

and the value of b is being measured by instruments that have a relative error
of 5 %. What relative error can be expected on the value of x?
(continued)

4.6 Error Analysis

Example 4.6 (continued)

Solution
Answering this question requires knowing the condition number of the matrix
M. The first step to computing Eq. (4.29) is to multiply
2

11
MMT 4 16
23

16
62
49

3
23
49 5
101

The maximum eigenvalue of that matrix is 140.31 (see Example 4.8), and the
square root of it is 11.85. Next, compute the inverse matrix (see Example 4.7):
2

0:40
4 0:04
0:17

3
0:05 0:03
0:16 0:05 5
0:01 0:13

multiply it by its transpose and find the largest eigenvalue to be 0.20, the
square root of which is 0.45. Finally, we can use these values in Eq. (4.29):
condM 11:85 0:45 5:33
The relative error on the vector b is given to be 5 %, or 0.05. Using
Eq. (4.29), the relative error on x will be
Ex 0:05 5:33 0:267
which means the relative error on x will be bounded to 26.7 % at a maximum.

4.6.1

Reciprocal Matrix

The inverse matrix or reciprocal matrix M1 of a matrix M is a matrix such that
MM1 I

4:35

and is computed as the adjoint matrix (the transpose of the cofactor matrix) of
M divided by the determinant of M.
The cofactor of a matrix given an element mi,j is the determinant of the
submatrix obtained by deleting row i and column j. The cofactor matrix of an
n n matrix M is the n n matrix obtained by computing the cofactor given each
element mi,j in the corresponding position of M and alternating + and signs, with
the initial cofactor given m0,0 having a positive sign.

4 Linear Algebra

The determinant of a square matrix M, written jMj, is a scalar value obtained by

computing the expansion by cofactors, or the sum of each cofactor given m1,j
multiplied by m1,j for the first row of the matrix, alternating + and signs, and
starting with a positive sign. Since the cofactor given m1,j is itself the determinant of
the submatrix with one less row and column, this will compute determinants
recursively, until the scalar case where the determinant is the scalar number itself.
Two common cases of the reciprocal matrix can help clarify notions. If the
matrix M is 2 2,

a b
M
c d

4:36

its cofactor matrix will be

jd j
jbj

jcj
d

b
j aj

c
a

4:37

then its adjoint will be the transpose of the cofactor matrix:

d c
b
a

d b
c
a

4:38

and finally its reciprocal will be the adjoint divided by the determinant:
"
M1
a

c

b b

d
"
d
1

a
b

ad bc c

4:39

Likewise, if M is a 3 3 matrix,
2

a
M 4d
g
its reciprocal will be

b
e
h

3
c
f5
i

4:40

4.6 Error Analysis

d
d f

g

g i

1
a c a

g
g i
a b

d e
a c
a

d
g h
d f
2
e
6h
6
6f
1
6

aei hf bdi gf cdh ge6
6 i
4d

g

2

e f
6 h i

6
6 b c

6

c 6
6 h i
4

f
b c

e f
i

3T
e
h 7
7
b 7
7
h 7
7
b 5
e
f
i
d
g
e
h

c

i

a

g
b

h

b
h
c
i
a
g

b

e
c

f
a

d

3
c
f 7
7
a 7
7
d 7
7
b 5
e
4:41

Example 4.7
Compute the inverse of this matrix, using two decimals of precision:
2

3
M 42
4

1
7
2

3
1
35
9

Solution
The cofactor matrix is obtained by computing, at each position (i,j), the
determiner of the submatrix without row i and column j. So, for example, at
position (0,0),

7 3

2 9 7 9 2 3 57
and at position (1,2),

3

4

1
32412
2

The cofactors are assembled into the matrix with alternating signs, to create
the cofactor matrix:
(continued)

4 Linear Algebra

Example 4.7 (continued)

57
4 7
4

6
23
7

3
24
2 5
19

and the adjoint matrix is simply the transpose of that one.

Next, the determinant can be computed as the sum of cofactor of each
element in the first row multiplied by the corresponding element, with
alternating signs:

3 1 1

2 7 3 3 7 3 1 2 3 1 2 7 141
4 2
4 9
2 9

4 2 9
Finally, the inverse matrix is the adjoint matrix divided by the
determinant:
2

4.6.2

57
1 4
6

141
24

7
23
2

3 2
0:40
4
7 5 4 0:04
0:17
19

0:05
0:16
0:01

3
0:03
0:05 5
0:13

Maximum Eigenvalue

A simple iterative algorithm can be used to find the maximum eigenvalue of an

n n matrix M.
1. Begin by defining an n 1 vector of random values x0.
2. At each iteration step k, compute:
xk1

Mxk
kxk k

4:42

where kxkk is the Euclidean norm of the vector.

3. End the iterations when a predefined maximum number of steps has been
reached (failure condition) or when the relative error between the Euclidean
norm of xk and of xk+1 is less than a target error threshold (success condition).
If the iterative algorithm ends in success, then the maximum eigenvalue of M is
the Euclidean norm ||xk+1||. That is why the error used in the iterative algorithm is
the relative error between the norms, rather than the Euclidean distance between the
vectors, which is what would normally be used in an algorithm like this one as
explained in Chap. 3. The pseudo-code for this algorithm is presented in Fig. 4.6.

4.6 Error Analysis

x Input initial value vector of length n

M Input nn matrix
IterationMaximum Input maximum number of iterations
ErrorMinimum Input minimum relative error
IterationCounter 0
WHILE (TRUE)
PreviousValue x
x M x / (Euclidean norm of x)
Eigenvalue square root of (sum of square of elements of x)
CurrentError Euclidean distance between x and PreviousValue
IterationCounter IterationCounter + 1
IF (CurrentError <= ErrorMinimum)
RETURN Success, Eigenvalue
ELSE IF (IterationCounter = IterationMaximum)
RETURN Failure
END IF
END WHILE

Fig. 4.6 Pseudo-code of the maximum eigenvalue algorithm

Example 4.8
Compute the maximum eigenvalue of this matrix with two decimals to a
relative error of less than 0.01.
2

11
M 4 16
23

16
62
49

3
23
49 5
101

Solution
Start with a vector of random values, such as x0 [0.1, 0.2, 0.3]T. Then
compute the iterations, keeping track of the error at each one.
x1

Mx0 11:20 28:70 42:40 T

p 29:93
kx0 k
0:12 0:22 0:32

kx1 k kx0 k 140:07 0:37

1:00

140:07
kx1 k
x2 29:72

77:01

76:70

113:32 T

113:46 T

e2 0:002

(continued)

4 Linear Algebra

Example 4.8 (continued)

The target error has been surpassed after the second iteration. The maximum eigenvalue is the norm:
kx2 k

4.7

p
29:722 77:012 113:462 140:31

Summary

Systems of linear algebraic equations linking n variables in n equations arise

frequently in engineering practice, and so modelling and solving them accurately
and efficiently is of great importance. This chapter has introduced four methods to
that end. The first method, the PLU decomposition method, can work on any
non-singular matrix. The second method, the Cholesky decomposition, is more
limited and only works on symmetric positive-definite matrices; however, such
matrices often model engineering systems, and when it can be used, it is much more
efficient than the PLU decomposition in terms of time and space complexity. The
Jacobi method and its improved version the Gauss-Seidel method are both iterative
methods whose only requirement is that the matrix must have non-zero diagonal
entries. These methods converge most quickly when an initial estimate of the
solution is available to begin the iterations, but like any iterative algorithm, they
carry the risk of not converging to a solution at all. Whichever method is used, the
fact remains that errors in the matrix and the result vector of the system will
propagate to the solution vector computed. The chapter thus ended by presenting
the mathematical tools needed to estimate the upper bound of that error.

4.8

Exercises

1. Solve the following Mx b systems using the PLU decomposition.

2
3
2
3
0:6 2:6 5:8
16:4
(a) M 4 6:0 2:0
1:0 5 b 4 15:4 5
1:2 8:4 2:8
12:0
3
2
3
2
3:0 1:0 1:6 8:6
0:8
6 0:0 2:0 7:0 0:8 7
7
6
7 b 6 22:0 7
(b) M 6
5
4 6:0 2:0
4
1:0
0:0
3:6 5
1:2 4:6 0:2
3:0
0:2
3
3
2
2
10:0 3:0
2:0
3:0
14:0
7
6 1:0 0:3 11:8 3:3 7
6
7 b 6 12:9 7
(c) M 6
5
4 2:0 3:4
4
1:3
8:1
5:8 5
2:0 8:6
1:4
2:6
19:4

4.8 Exercises

2. Solve the following Mx b systems using the Cholesky decomposition.

2
3
2
3
81 18
9
909
(a) M 4 18 68 14 5 b 4 506 5
9 14 41
61
2
3
2
3
49 14 28
196
(b) M 4 14 85
46 5 b 4 25 5
28
46 173
529
2
3
3
2
9:00
0:60 0:30 1:50
2:49
6 0:60
7
6
16:04
1:18 1:50 7
7 b 6 0:57 7
(c) M 6
5
4 0:30 1:18
4
4:10 0:57
0:79 5
1:50 1:50 0:57 25:45
2:21
3
2
3
2
49
14
7
7
14
6 14
6 46 7
29 12
2 7
7
7
6
(d) M 6
4 7 12 41 19 5 b 4 36 5
7
2
19 35
32
3
2
3
2
4:00
0:40
0:80 0:20
0:20
6 0:40
7
6
1:04 0:12 0:28 7
7 b 6 0:32 7
(e) M 6
5
4 0:80 0:12 9:20
4
1:40
13:52 5
0:20 0:28
1:40
4:35
14:17
3. Solve the following Mx b systems using the Jacobi method given the initial
values of x and the target accuracy required.

2:00 1:00
1:00
0:00
(a) M
b
x
accuracy 0:1
1:00 10:00
2:00
0:00
2
3
2
3
2
3
2:05
0:45
5:02 2:01 0:98
(b) M 4 3:03 6:95 3:04 5 b 4 1:02 5 x 4 0:41 5 accuracy 0:001
1:01 3:99 5:98
0:98
0:01
4. Redo exercise 3 using the
method.
Gauss-Seidel

5 0
and a vector b with a relative error of 5 %, what
5. Given the matrix M
0 2
will the error on x be bounded
to?

2:0 0:2
2 3
, what will the
and a vector b
6. Given the matrix M
4:0 0:1
1 2
error on x be bounded to?

0 1
?
7. What is the maximum eigenvalue of the matrix M
1 1

Chapter 5

Taylor Series

5.1

Introduction

It is known that, zooming-in close enough to a curve, it will start to look like a
straight line. This can be tested easily by using any graphic software to draw a
curve, and then zooming into a smaller and smaller region of it. It is also the reason
why the Earth appears flat to us; it is of course spherical, but humans on its surface
see a small portion up close so that it appears like a plane. This leads to the intuition
for the third mathematical and modelling tool in this book: it is possible to represent
a high-order polynomial (such as a curve or a sphere) with a lower-order polynomial (such as a line or a plain), at least over a small region. The mathematical tool
that allows this is called the Taylor series. And, since the straight line mentioned in
the first intuitive example is actually the tangent (or first derivative) of the curve, it
should come as no surprise that this Taylor series will make heavy use of derivatives of the functions being modelled.

5.2

Taylor Series and nth-Order Approximation

Assume a function f(x) which has infinite continuous derivatives (which can be zero
after a point). Assume furthermore than the function has a known value at a point xi,
and that an approximation of the functions value is needed at another point
x (which will generally be near xi). That approximation can be obtained by
expanding the derivates of the function around xi in this manner:

Springer International Publishing Switzerland 2016

R. Khoury, D.W. Harder, Numerical Methods and Modelling for Engineering,
DOI 10.1007/978-3-319-21176-3_5

f x f xi f 1 xi x xi

f n xi
x xi n
n!

Taylor Series

f 2 xi
f 3 xi
x xi 2
x xi 3
2!
3!
5:1

Equation (5.1) can be written in an equivalent but more compact form as:
f x

1
X
f k xi
x x i k
k!
k0

5:2

This expansion is called the Taylor series. Note that Eq. (5.2) makes explicit the
divisions by 0! and 1!, which were not written in Eq. (5.1) because both those
factorials are equal to 1, as well as a multiplication by (x xi)0 also excluded from
Eq. (5.1) for the same reason. In the special case where xi 0, the series simplifies
into Eq. (5.3), which is also called the Maclaurin series.
f x

1
X
f k 0 k
x
k!
k0

5:3

Instead of speaking of a general point x in the vicinity of xi, it can be useful

to formally define a point xi+1 exactly one step h distant from xi. This updates
Eq. (5.2) as:
f xi1
f x i h

1
X
f k xi
xi1 xi k
k!
k0
1
X
f k xi k
h
k!
k0

5:4

This form of the Taylor series given in Eq. (5.4) is the one that will be most useful in
this textbook. One problem with it, however, is that it requires computing an infinite
summation, something that is never practical to do in engineering! Consequently, it
is usually truncated at some value n as such:
f x i h

n
X
f k xi k
h
k!
k0

5:5

In this truncated form, it is an nth-order Taylor series approximation. As indicated

by the name and by the fact it eliminates an infinite number of terms from the
summation, in this form the series only computes an approximation of the value of
the function at xi+1, not the actual exact value.

5.2 Taylor Series and nth-Order Approximation

Example 5.1
Approximate the value of the following function f(x) at a step h 1 after
x0 0, using a 0th, 1st, 2nd, and 3rd-order Taylor series approximation.
f x 1 x

x2 x3 x4

2! 3! 4!

Solution
To begin, note that the summation is actually the expansion of f(x) ex. This
means that the value f(1) that is being approximated is the constant
e 2.7183. . . Note as well that the summation is infinitely derivable and
that each derivative is equal to the original summation, as is the case for
f(x) ex:
f x f 1 x f 2 x f 3 x
The Taylor series expansion of the function at any step h after x0 0 is:
f 0 h f h

1 k
X
f 0
k0

f 0 f 1 0h

hk
f 2 0 2 f 3 0 3 f 4 0 4
h
h
h
2!
3!
4!

and the 0th, 1st, 2nd, and 3rd-order approximations truncate the series after
the first, second, third, and fourth terms respectively. In other words, the
0th-order approximation is:
f h f 0
f 1 1
The 1st-order approximation is:
f h f 0 f 1 0h 1 h
f 1 2
The 2nd-order approximation is:
f h f 0 f 1 0h

f 2 0 2
h2
h 1h
2!
2!

f 1 2:5
And the 3rd-order approximation is:
(continued)

Taylor Series

Example 5.1 (continued)

f h f 0 f 1 0h
f 1 2:67

f 2 0 2 f 3 0 3
h2 h3
h
h 1 h 2:67
2!
3!
2! 3!

Comparing these approximations to the original equation in the example, it

can be seen that every additional order makes the expansion f(h) match the
original function f(x) more closely. Likewise, knowing that f(1) 2.7183. . .,
it can be seen that with every additional order, the approximation gets closer
to the real value.
To further illustrate the situation, the following figure compares f(x) in
black to the 0th-order approximation f(h) in red, the 1st-order approximation
f(h) in dashed blue, the 2nd-order approximation f(h) in green, and the
3rd-order approximation f(h) in dashed orange. Once again, it can be seen
that the higher the order, the more closely the approximation matches the real
function.

Alternatively, it can be seen that all five functions go through the same
point f(0). However, the 0th approximation in red then diverges immediately
from the correct function in black, while the 1st approximation in blue
matches the correct function over a short step to about 0.1, the 2nd approximation in green follows the function in black over a longer step to approximately 0.3, and the 3rd approximation in orange has the longest overlaps
with the function in black, to a step of almost 0.7. This highlights another
understanding of the Taylor series approximation: the greater the approximation order, the greater the step for which it will give an accurate
approximation.

5.3 Error Analysis

5.3

Error Analysis

When expanded to infinity, the Taylor series of Eq. (5.4) is exactly equivalent to the
original function. That is to say, the error in that case is null. Problems arise
however when the series is truncated into the nth-order approximation of
Eq. (5.5). Clearly, the truncated series is not equivalent to the complete series nor
to the original equation, and there is an approximation error to account for.
Comparing Eqs. (5.4) and (5.5), it is clear to see that the error will be exactly
equal to the truncated portion of the series:
f x i h

n
1
X
X
f k xi k
f k xi k
h
h
k!
k!
k0
kn1

5:6

Unfortunately, this brings back the summation to infinity that the nth-order approximation was meant to eliminate. Fortunately, there is a way out of this, by noting
that the terms of the Taylor series are ordered in decreased absolute value. That is to
say, each term contributes less than the previous but more than the next to the total
summation. This phenomenon could be observed in Example 5.1: note that the 0th
and 1st-order approximation add a value of 1 to the total, the 2nd-order approximation adds a value of 0.5, and the 3rd-order approximation a value of 0.17.
Likewise, graphically in that example, it can be seen that while each step brings
the approximation closer to the real value, it also leaves much less room for further
improvement with the infinite number of remaining terms. This observation can be
formalized by writing:
1
X
f k xi k
f n1 xi n1
h
h
k!
n 1!
kn1

5:7

In other words, the error on an nth-order Taylor series approximation will be

proportional to the (n + 1)th term of the series. Using the big O notation introduced
in Chap. 1, Eqs. (5.6) and (5.7) can also be written as:
f xi h

n
X

f k x i k
h O hn1
k!
k0

5:8

Special care should be taken with Eq. (5.7) when dealing with series that
alternate zero and non-zero terms (such as trigonometric functions). If the (n + 1)
th term happens to be one of the zero terms of the series, it should not be mistaken
for the approximation having no error! Rather, in that case, the (n + 1)th term and all
subsequent zero terms should be skipped, and the error will be proportional to the
next non-zero term.

Taylor Series

Example 5.2
What is the error of the 1st-order Taylor series approximation of the following
function at a step h 1 after x0 0?
f x 1 0:2x 0:6x2 0:3x3 0:5x4 0:1x5
Solution
The 1st-order approximation is:
f x0 h f x0 f 1 x0 h Ex0
where the error term E(x) is the 2nd-order term:
E x 0

f 2 x0 2
h
2!

and the relevant derivatives are:

f 1 x 0:2 1:2x 0:9x2 2x3 0:5x4
f 2 x 1:2 1:8x 6x2 2x3
The series can now be evaluated at step h 1 after x0 0:
f x 0 1
f 1 x0 0:2
f 2 x0 1:2
f x0 h 1 0:21 Ex0 0:8 Ex0
1:2 2
1 0:6
E x 0
2
The function is thus approximated as 0.8 with an error on the order of 0.6.
To verify, the actual value of f(1) can be computed from the equation as 1.5, so
the approximation has an absolute error of 0.7, which is indeed on the same
order as 0.6.

Example 5.3
What is the error of the 2nd-order Taylor series approximation of cos(x) in
radians at a step h 0.01 after x0 0?
Solution
The derivatives of the cos and sin functions are:
(continued)

5.3 Error Analysis

Example 5.3 (continued)

d
cos x sin x
dx
d
sin x cos x
dx
and so the 2nd-order approximation will be:
f 2 x0 2
h E x 0
2!
cos x0 2
h E x 0
cos x0 sin x0 h
2!
cos 0
0:012 E0
f 0 0:01 cos 0 sin 00:01
2
1
1 0:012 E0
2
0:99995 E0
f x0 h f x0 f 1 x0 h

The error term E(x) is the 3rd-order term:

E x 0

f 3 x0 3
sin x0 3
h
h
3!
3!

However, since sin(0) 0, this term will be 0. That is clearly wrong, since
0.99995 is not a perfect approximation of the value of cos(0.01)! In this case,
the error is the next non-zero term, which is the 4th-order term:
E x 0

f 4 x0 4
cos x0 4
1
h
h 0:014 4:16 1010
4!
4!
24

Now that it is possible to measure the error of a Taylor series approximation, the
natural next question is, how can this information be used to create better approximations of real systems? Given both the discussion in this section and Eq. (5.8) specifically, it can be seen that there are two ways to reduce the error term O(hn+1): by using
smaller values of h or greater values of n. Using smaller values of h means taking
smaller steps, or evaluating the approximation nearer to the known point. Indeed, it has
been established and clearly illustrated in Example 5.1 that the approximation
diverges more from the real function the further it gets from the evaluated point;
conversely, even a low-order approximation is accurate for a small step around the
point. It makes sense, then, that reducing the step size h will lead to a smaller
approximation error. The second option is to increase the approximation order n,
which means adding more terms to the series. This will make the approximation more
complete and similar to the original function, and therefore reduce the error.

5.4

Taylor Series

Modelling with the Taylor Series

The Taylor series is a powerful modelling approximation tool. It can be used to

model even the most complex, infinite, high-degree mathematical functions with a
simpler, finite, lower-degree polynomial, provided only that the original function is
derivable. Most notably, the 1st-order Taylor series approximation can be used to
linearize a complex system into a linear degree-1 polynomial (a.k.a. a straight line)
which will be a simple but accurate local model over a small neighborhood.
Additionally, the Taylor series is a useful tool to perform a mathematical
analysis of other more complex mathematical formula, such as the numerical
methods presented later in this book. By modelling these methods using Taylor
series approximations, it will become possible to define and measure the upper
bound of their errors.

5.5

Summary

For many engineering applications, it can be useful to model a complex function as

a simpler low-order polynomial. This chapter has introduced the infinite Taylor
series, which gives an exact equivalent of the function, and the finite nth-order
Taylor series approximation, which gives an nth-order polynomial model of the
function with a predictable O(hn+1) error.

5.6

Exercises

1. Rewrite the Taylor series

f x f x0 f 1 x0 x x0

f 2 x0
f 3 x0
x x0 2
x x 0 3
2!
3!

in the form of f(x0 + h) where h x x0.

2. Approximate sin(1.1) (in radians) using a 1st-order Taylor series approximation
expanded around x0 1. What is the relative error of this answer?
3. Approximate sin(1.1) (in radians) using a 2nd-order Taylor series approximation
expanded around x0 1.
4. What is the bound on the error of using a 1st-order Taylor series approximation
expanded around x0 0.5 for the function f(x) ex when computing the approximation for:
(a) x 0
(b) x 1

5.6 Exercises

5. Compute the 0th to 2nd Taylor series approximation of the following functions
for x0 1 and h 0.5. For each one, use the Taylor series to estimate the
approximation error and compute the absolute error to the real value.
(a) f x x2 4x 3
(b) f x 3x3 x2 4x 3
(c) f x 2x5 3x3 x2 4x 3

Chapter 6

Interpolation, Regression, and Extrapolation

6.1

Introduction

Oftentimes, practicing engineers are required to develop new models of existing

undocumented systems they need to understand. These could be, for example,
man-made legacy systems for which documentation is outdated or missing, or
natural systems that have never been properly studied. In all cases, there are no
design documents or theoretical resources available to guide the modelling. The
only option available is to take discrete measurements of the system and to discover
the underlying mathematical function that generates these points. This chapter will
present a set of mathematical and modelling tools that can perform this task.
The tools presented in this chapter can be divided into three variations of this
challenge. In the first variation, an exact and error-free set of measurements is available,
and the mathematical function computed must have each and every one of these points
as an exact result, as illustrated for a 2D case in Fig. 6.1 (left). This is the challenge of
interpolation. In the second variation, the measurements have errors, and as a result the
mathematical function computed does not need to exactly account for all the measurements (or even any of the measurements) but is rather the function with the minimal
average error to the set of measurements, as illustrated in Fig. 6.1 (right). This challenge
is called linear regression. Finally, given a set of measurements, it may be necessary
to find not the function that exactly fits or best approximates them, but the function
that can best predict future (or past) behavior of the system beyond the measurements.
That is the challenge of extrapolation. In both graphs of Fig. 6.1, the portions left and
right of the first and last points, marked in a dashed line, are extrapolated.
The tools presented in this chapter are an important addition to the toolbox
developed over the last three chapters. While iteration, linear algebra, and the
Taylor series, are all useful mathematical modelling tools, they all assume that a
mathematical function of the system is known and available to iterate, solve, or
derive. The tools of iteration, linear regression, and extrapolation make it possible
to discover new functions where none are known.
Springer International Publishing Switzerland 2016
R. Khoury, D.W. Harder, Numerical Methods and Modelling for Engineering,
DOI 10.1007/978-3-319-21176-3_6

6 Interpolation, Regression, and Extrapolation

Fig. 6.1 Left: A set of exact measurement points in 2D and the interpolated mathematical function
(solid line) and extrapolated function (dashed line). Right: The same points as inexact measurements in 2D and the regressed mathematical function (solid line) and extrapolated function
(dashed line)

6.2

Interpolation

Given a set of n measurement points, the challenge of interpolation consists in

discovering the lowest-degree polynomial that has those points as exact solutions. It
is said that polynomial fits the points. To be sure, many polynomials of higher
degrees could fit the points as well; however, given no other information on the
function that generated the data (such as which degree it should be) it makes sense
to prefer the simplest, lowest-degree function that could be found. Moreover,
picking the lowest-degree function solves a problem of ambiguity: there could be
multiple different functions with the same higher degree that fit the points and no
way to prefer one over another, but the function of the lowest possible degree will
always be unique.
The lowest degree of the function depends on the number of points being used
for interpolation. With a set of n points, the lowest-degree polynomial it is possible
to interpolate is n 1, meaning the highest power any variable will have is n 1. In
the simplest 2D case, which is the one most of this chapter will focus on and the
most common one in engineering practice, the equation will thus follow this form:
y f x c0 c1 x c2 x2 c3 x3 . . . cn1 xn1

6:1

where every value ck is a coefficient of the polynomial whose value must be

discovered by the interpolation technique.
How is it known that only one unique polynomial of degree n 1 can fit the
n points? To understand this, observe two facts:
The sum or difference of two polynomials of degree n 1 must be a polynomial
of degree n 1 or less. It will be less than degree n 1 if the degree-n 1 terms
in the two original polynomials cancel each other out, and it will be degree n 1
if they do not, it cannot possibly contain terms of a degree higher than n 1
since those did not exist in the original polynomials being summed or subtracted.

6.3 Vandermonde Method

A polynomial with n 1 roots must be of degree n 1 or more. The roots are the
values for which the polynomials solution is 0. Each root is a value for which
one of the terms of the polynomial becomes exactly equal but of opposite sign to
all the other terms combined, which is why a polynomial with n 1 roots must
have at least n 1 terms and be at least of degree n 1. Graphically, when
plotted, a polynomial of degree n 1 will have n 2 local optima where the
curve changes directions. In order for the curve to intersect the zero axis n 1
times by crossing it, changing directions, and crossing it again, it will need to
perform at least n 2 changes of directions, and thus be at least of degree n 1.
The one exception to this rule is the zero polynomial f(x) 0, which has more
roots (an infinite number of them) than its degree of 0.
Now assume there exists two polynomials of degree n 1 p1(x) and p2(x) that
both interpolate a set of n points. Define a third polynomial as the difference of
these two:
r x p1 x p2 x

6:2

By the first of the two observations above, it is clear that r(x) will be a polynomial of
degree n 1. Moreover, the polynomials interpolate the same set of n points, which
means they will both have the same value at those points and their difference will be
zero. These points will therefore be the roots of r(x). And there will be n of them,
one more than the degree of r(x). By the second of the two observations above, the
only polynomial r(x) could be is the zero polynomial, which means p1(x) and p2(x)
were the same polynomials in the first place. Q.E.D.

6.3
6.3.1

Vandermonde Method
Univariate Polynomials

The Vandermonde method is a very straightforward interpolation technique. It

simply requires substituting each of the n points into a polynomial of the form of
Eq. (6.1) to create a linear system of n equation, which can then be solved using any
of the techniques learnt in Chap. 4. But despite its simplicity, the Vandermonde
method is very powerful and can be generalized to use different non-polynomial
functions, multidimensional points, and even to perform regressions, as will be
presented later on.
Given a set of n points such as the list of Eq. (6.3), it has been proven in the
previous section that it is possible to interpolate a unique polynomial of degree
n 1 of the form of Eq. (6.1). Since each point is a solution of the polynomial, the
set of points yields n discrete evaluations of the polynomial, written out in the set of
equations (6.4).

6 Interpolation, Regression, and Extrapolation

x0 ; y0 , x1 ; y1 , . . . , xi ; yi , . . . , xn1 ; yn1

6:3

y0 f x0 c0 c1 x0 c2 x20 c3 x30 . . . cn1 x0n1

y1 f x1 c0 c1 x1 c2 x21 c3 x31 . . . cn1 x1n1
...

6:4

yi f xi c0 c1 xi c2 x2i c3 x3i . . . cn1 xin1

...
n1
yn1 f xn1 c0 c1 xn1 c2 x2n1 c3 x3n1 . . . cn1 xn1

This should be immediately recognizable as a system of equations identical to the

one in Eq. (4.1), except with different notation, and with the coefficients of the
equations being the unknowns and the variables being known instead of the other
way around. Writing this system of equations into matrixvector form gives:
2

1
6 1
6
6
6
6
6 1
6
6
4
1

x0
x1

x0i
x1i

...

xii

3 2
c0
76 c 7 6
76 1 7 6
76
7 6
76 7 6

76
76

7 6
6
xin1 7
76 ci 7 6
76
7 6
54 5 4

i
xn1

n1
xn1

xn1

x0n1
x1n1

cn1

3
y0
y1 7
7
7
7
7
yi 7
7
7
5

6:5

yn1

Here, the matrix containing the values of the variables of the polynomial is called
the Vandermonde Matrix and is written V, the vector of unknown coefficients is
written c, and the vector of solutions of the polynomial, or the evaluations of the
points, is y. This gives a Vc y system. This system can then be solved using any of
the techniques learnt in Chap. 4, or any other decomposition technique, to discover
the values of the coefficients and thus the polynomial interpolating the points.
Example 6.1
Four measurements of an electrical system were taken. At time 0 s the output
is 1 V, at time 1 s it is 2 V, at time 2 s it is 9 V, and at time 3 s it is 28 V. Find a
mathematical model for this system.
Solution
There are four 2D points: (0, 1), (1, 2), (2, 9), and (3, 28). Four points can
interpolate a polynomial of degree 3 of the form:
y f x c0 c1 x c2 x2 c3 x3
Writing this into a Vandermonde system of the form of Eq. (6.5) gives:
(continued)

6.3 Vandermonde Method

Example 6.1 (continued)

2
1
61
6
41
1

0
1
2
3

0
1
4
9

32 3 2 3
0
c0
1
6 c1 7 6 2 7
1 7
76 7 6 7
8 5 4 c2 5 4 9 5
27
28
c3

This system can then be solved using any technique to find the solution:
c 1

0 1 T

Meaning that the polynomial is:

y f x 1 x 3
To illustrate, the four data points and the interpolated polynomial are
presented in the figure below.

Voltage
25
20
15
10
5
0
0

0.5

1.5

2.5

Time

6.3.2

Univariate General Functions

One of the advantages of the Vandermonde method is its flexibility: with little
modification, it can be use to interpolate any function, not just a polynomial. If it is
known that a non-polynomial function of x, such as a trigonometric function for
example, is part of the underlying model, the mathematical development can be
adapted to include it. This is accomplished by rewriting Eq. (6.1) in a more general

6 Interpolation, Regression, and Extrapolation

form, with each term having one of the desired functions of x multiplied by an
unknown coefficient:
y f x c0 c1 f 1 x c2 f 2 x c3 f 3 x . . . cn1 f n1 x

6:6

It can be seen that the original polynomial equation was simply a special case of this
equation with fi(x) xi. In the more general case, any function of x can be used, its
result evaluated for each given sample of x and stored in the matrix V, and then used
to solve the vector of coefficients.

Example 6.2
Four measurements of an electrical system were taken. At time 0 s the output
is 1 V, at time 1 s it is 2 V, at time 2 s it is 9 V, and at time 3 s it is 28 V. Find a
mathematical model for this system, knowing that the system handles a sine
and a cosine wave. Work in radians.
Solution
There are four 2D points: (0,1), (1,2), (2,9), and (3,28). Given the information
in the question, they interpolate a trigonometric function of the form:
y f x c0 c1 sin x c2 cos x c3 sin x cos x
Writing this into a Vandermonde system gives:
2

1
61
6
41
1

0
0:84
0:91
0:14

1
0:54
0:42
0:99

32 3 2 3
0
c0
1
6 c1 7 6 2 7
0:45 7
76 7 6 7
0:38 54 c2 5 4 9 5
0:14
28
c3

This system can then be solved using any technique to find the solution:
c 15:91 11:19

14:91

7:87 T

Meaning that the polynomial is:

y f x 15:91 11:19 sin x 14:91 cos x 7:87 sin x cos x
To illustrate, the four data points and the interpolated polynomial are
presented in the figure below. Comparing to the solution of Example 6.1, it
can be seen that the polynomial fits the measurements just as well, but
behaves a bit differently in-between the data points.
(continued)

6.3 Vandermonde Method

Example 6.2 (continued)

Voltage
25
20
15
10
5
0
0

0.5

1.5

2.5

Time

6.3.3

Multidimensional Polynomial

The same basic technique can be used to generalize the Vandermonde method to
interpolate multivariate or multidimensional functions. In this case, the singlevariable functions fi(x) in Eq. (6.6) become multivariate functions of k variables
fi(x0, . . ., xk1). Each function fi(x0, . . ., xk1) is a different product of the
k variables, and the entire polynomial would exhaustively list all such products,
starting with the constant term (all variables exponent 0), then the single-variable
products (one variable at exponent 1 multiplying all others at exponent 0), then the
two-variable products (two variables at exponent 1 multiplying all others at exponent 0), and so on. After the last degree-1 term (multiplying all variables together),
the exhaustive list continues with one variable at exponent 2.
The most common multidimensional case in engineering is the threedimensional case, with measurement points being of the form (x, y, z f(x, y)). In
this case, the polynomial exhaustively listing all terms at degree 1 and at degree
2 are given in Eqs. (6.7) and (6.8) respectively.
z f x; y c0 c1 x c2 y c3 xy

6:7

z f x; y c0 c1 x c2 y c3 xy c4 x c5 y c6 x y c7 xy c8 x y
2

2 2

6:8
Given the number of coefficients to compute, Eq. (6.7) requires four points to
interpolate, and Eq. (6.8) requires nine points to interpolate. The choice of how

6 Interpolation, Regression, and Extrapolation

many terms to include in the polynomial will be guided by how many measurements of the system are available to use. With fewer points available, it is possible
to interpolate a partial form of one of those polynomials.
Example 6.3
Four measurements of the height of a structure were taken. At position (3 km,
3 km) the height is 5 km, at position (3 km, 4 km) it is 6 km, at position (4 km,
3 km) it is 7 km, and at position (4 km, 4 km) it is 9 km in height. Find a
mathematical model for this structure.
Solution
There are four 3D points: (3, 3, 5), (3, 4, 6), (4, 3, 7), (4, 4, 9). Given the
information in the question, they interpolate a 3D function of the form:
z f x; y c0 c1 x c2 y c3 xy
Writing this into a Vandermonde system gives:
2

1
61
6
41
1

3
3
4
4

3
4
3
4

32 3 2 3
9
c0
5
6 c1 7 6 6 7
12 7
76 7 6 7
12 54 c2 5 4 7 5
16
9
c3

This system can then be solved using any technique to find the solution:
c 5

1 T

Meaning that the polynomial is:

z f x; y 5 x 2y xy

6.4

Lagrange Polynomials

The Vandermonde method gives a simple and flexible technique to interpolate

polynomials. However, it requires solving a linear system, which in turn requires
the use of a matrix calculator (or a lot of patience to get through long and tedious
mathematical equations). The method of Lagrange polynomials is not as flexible as
Vandermonde and is limited to the 2D case, but it has the benefit of being intuitive
for humans and to be computable by hand, at least for a small number of points.
The Lagrange polynomials technique works in two simple steps. Given a set of
n measurement points (x0, y0), . . ., (xi, yi), . . ., (xn1, yn1), the first step computes

6.4 Lagrange Polynomials

n separate polynomials, with each one being equal to 1 for one of the n points and
equal to 0 for all others. These polynomials are actually quite simple to define; they
will each have the form:
L i x

x x0 . . . x xi1 x xi1 . . . x xn1

xi x0 . . . xi xi1 xi xi1 . . . xi xn1

6:9

Notice that this polynomial, developed for point xi, skips over the (x xi) term in
the numerator and the (xi xi) term in the denominator, but has n 1 terms for the
other n 1 points in the set. When x xi, the denominator and numerator will be
equal and the polynomial will evaluate to 1. At any of the other points, one of the
subtractions in the numerator will give 0, as will the entire polynomial. The second
point then multiplies each polynomial Li(x) with the value yi of the measurement at
that point, and sums them all together.
y f x

n1
X

yi Li x

6:10

The final polynomial can then optionally be simplified.

Example 6.4
Four measurements of an electrical system were taken. At time 0 s the output
is 1 V, at time 1 s it is 2 V, at time 2 s it is 9 V, and at time 3 s it is 28 V. Find a
mathematical model for this system.
Solution
There are four 2D points: (0, 1), (1, 2), (2, 9), and (3, 28). Write a polynomial
of the form of Eq. (6.9) for each of the points:
L0 x

x 1x 2x 3 x3 6x2 11x 6

0 10 20 3
6

L1 x

x 0x 2x 3 x3 5x2 6x

1 0 1 2 1 3
2

L2 x

x 0x 1x 3 x3 4x2 3x

2 0 2 1 2 3
2
(continued)

6 Interpolation, Regression, and Extrapolation

Example 6.4 (continued)

L3 x

x 0x 1x 2 x3 3x2 2x

3 0 3 1 3 2
6

Next, each polynomial is multiplied by its matching value, and they are all
summed up together to get:
y f x y0 L0 x y1 L1 x y2 L2 x y3 L3 x
1

x3 6x2 11x 6
x3 5x2 6x
x3 4x2 3x
2
9
6
2
2

x3 3x2 2x
6

which is the polynomial interpolating the points. It can further be simplified

into:
y f x

1 6 27 28
6 30 108 84
x3

x2
6
6
6
6
6
6
6
6

11 36 81 56
6 0 0 0

6
6
6
6
6 6 6 6
x3 1
which is the same polynomial that was found in Example 6.1.
While the Lagrange polynomials method is the easiest interpolation method
for humans to understand and use, it is also the most complicated one to
implement in software, as can be seen from the pseudocode in Fig. 6.2. It also
suffers from the problem that interpolating a polynomial for a set of n points
with this method gives no information whatsoever on the polynomial that could
be interpolated with n + 1 points including the same n points. In practical terms,
this means that if a polynomial has been interpolated for a set of n points and
new measurements of the system are made subsequently, the computations
have to be done over from scratch. To be sure, that was also the case with
the Vandermonde method. However, since with Lagrange polynomials the
computations are also made by hand, this can become a major limitation of
this method.

6.5 Newton Polynomials

Points Input list of n 2D points (x,y)

Function empty function
FunctionIndex 0
WHILE (FunctionIndex < n)
PointIndex 0
Numerator 1
Denominator 1
WHILE (PointIndex < n)
IF (PointIndex different from FunctionIndex)
Numerator Numerator [(x variable) (x coordinate of point
number PointIndex)]
Denominator Denominator [(x coordinate of point number
FunctionIndex) (x coordinate of point number
PointIndex)]
END IF
PointIndex PointIndex + 1
END WHILE
Function Function + (y coordinate of point number FunctionIndex)
Numerator / Denominator
FunctionIndex FunctionIndex + 1
END WHILE
RETURN Function

Fig. 6.2 Pseudocode of Lagrange polynomial

6.5

Newton Polynomials

The Newton polynomials method discovers the polynomial that interpolates a set
of n points under the form of a sum of polynomials going from degree 0 to degree
n 1, in the form given in Eq. (6.11). That equation may look long, but it is actually
quite straightforward: each individual term i is composed of a coefficient ci multiplied by a series of subtractions of x by every measurement point from x0 to xi1.
y f x
c0 c1 x x0 c2 x x0 x x1 . . .
cn1 x x0 x x1 . . . x xn2

6:11

Unlike the Vandermonde method and Lagrange polynomials, the Newton polynomials method can be used to incrementally add points to the interpolation set.
A new measurement point (xn, yn) will simply add the term cn(x x0)(x xn1) to
the sum of Eq. (6.11). This new term will be a polynomial of degree n, as will the
entire polynomial (as it should be since it now interpolates a set of n + 1 points).
Moreover, it can be seen that this new term will not have any effect on the terms
computed previously: since it is multiplied by (x x0)(x xn1), it was 0 at all
previous interpolated points. The polynomial of Eq. (6.11) was correct for n points,
and the newly added n + 1 point makes it possible to compute a refinement to that
equation without requiring recomputing of the entire interpolation.

6 Interpolation, Regression, and Extrapolation

The biggest challenge in Newton polynomials is to compute the set of coefficients. There is actually a simple method for computing them, but to understand
where the equations come from, it is best to learn the underlying logic by computing
the first few coefficients.
Much like Eq. (6.11) makes it possible to incrementally add new points into the
interpolation, the coefficients are computed by incrementally adding new points
into the set considered. The first coefficient, c0, will be computed using only the first
point (x0, y0). Evaluating Eq. (6.11) at that first point reduces it to the straight-line
polynomial y f(x) c0 since, when the polynomial is evaluated at x0, all other
terms are multiplied by (x0 x0) and become 0. The value of the coefficient is thus
clear:
y0 f x 0 c 0

6:12

Taking the second point into consideration and evaluating Eq. (6.11) at that
coordinate while including the result of Eq. (6.12) gives a polynomial the degree 1:
y1 f x 1 y 0 c 1 x 1 x 0

6:13

The value of the coefficient c1 in the newly added term of the equation is the only
unknown in that equation, and can be discovered simply by isolating it in that
equation:
c1

y1 y0 f x1 f x0

x1 x0
x1 x0

6:14

The right-hand side of Eq. (6.14) can be written in a more general form:
f xi ; xi1

f xi1 f xi
xi1 xi

6:15

in which case the coefficient c1 of Eq. (6.14) becomes:

c1 f x0 ; x1

6:16

Next, a third measurement point (x2, y2) is observed. Evaluating Eq. (6.11) with
that new point gives:
y2 f x2 f x0 f x0 ; x1 x2 x0 c2 x2 x0 x2 x1

6:17

Once again, the value of the coefficient in the newly added term of the equation is
the only unknown in that equation, and its value can be discovered simply by
isolating it in that equation:
c2

f x2 f x0 f x0 ; x1 x2 x0 f x1 ; x2 f x0 ; x1

x2 x0
x2 x0 x2 x2

6:18

6.5 Newton Polynomials

And once again that result can be written in a more compact function form:
f xi ; xi1 ; xi2

f xi1 ; xi2 f xi ; xi1

xi2 xi

c2 f x 0 ; x 1 ; x 2

6:19
6:20

A general rule should be apparent from these examples. For any new point
(xk, yk) added to the interpolation set, a new function can be written as:
f xi ; xi1 ; . . . ; xik

f xi1 ; . . . ; xik f xi ; . . . ; xik1

xik xi

6:21

and the coefficient of the new term added to the polynomial is the evaluation of that
new function from x0 to xk:
ck f x0 ; x1 ; . . . ; xk

6:22

This gives the interpolated polynomial of Eq. (6.11) the form:

y f x
f x0 f x0 ; x1 x x0 f x0 ; x1 ; x2 x x0 x x1 . . .

6:23

f x0 ; . . . ; xn1 x x0 x x1 . . . x xn2
One thing that should be evident from the examples and from the general
formula of Eq. (6.21) is that calculating one level of the function f(xi,. . .,xi+k)
requires knowledge of the previous level of the function f(xi,. . .,xi+k1) and, recursively, knowledge of all previous levels of the function down to f(xi). There is in
fact a simple method of systematically computing all these values, by building what
is called a table of divided differences. One such table combining the information of
the sample computations from Eqs. (6.12) to (6.20) is given in Table 6.1. Each
column of this table is filled in by computing one level of the function f. The first
column simply contains the measurement values xi, and the second column the
corresponding values f(xi). The third column then has the values f(xi,xi+1), which are
computed from the first two columns. Moreover, following Eq. (6.15), each individual value is computed by subtracting the two immediately adjacent values in the
previous column, divided by the subtraction of highest and lowest value of xi. That
column will also have one less value than the previous one, since there are
fewer combinations possible at that level. The fourth column has the values of
f(xi,xi+1,xi+2), which are computed from the third and first column. Once again, each
individual value of the new column is computed by subtracting the two immediately adjacent values in the previous column divided by the subtraction of highest
and lowest value of xi, as per Eq. (6.19). And once again, there will be one less value
in the new column than there was in the previous one. This process goes on until the

6 Interpolation, Regression, and Extrapolation

Table 6.1 Sample table of divided differences
xi

f (xi)

f (x0)

f (x1)

f (x2)

f (xi,xi+1)

f (xi,xi +1,xi +2)

f (x0,x1)
f (x0,x1,x2)
f (x1,x2)

Points Input list of n 2D points (x,y)

DividedDifferences table of n rows and n + 1 columns
RowIndex 0
WHILE (RowIndex < n)
element at column 0, row RowIndex of DividedDifferences x coordinate
of point number RowIndex
element at column 1, row RowIndex of DividedDifferences y coordinate
of point number RowIndex
RowIndex RowIndex + 1
END WHILE
ColumnIndex 2
WHILE (ColumnIndex < n + 1)
RowIndex 0
WHILE (RowIndex < n ColumnIndex + 1)
element at column ColumnIndex, row RowIndex of DividedDifferences
[ (element at column ColumnIndex - 1,
row RowIndex + 1 of DividedDifferences)
- (element at column ColumnIndex - 1,
row RowIndex of DividedDifferences)]
/ [
(element at column 0,
row RowIndex + ColumnIndex - 1 of DividedDifferences)
- (element at column 0,
row RowIndex of DividedDifferences)]
RowIndex RowIndex + 1
END WHILE
ColumnIndex ColumnIndex + 1
END WHILE
RETURN Row 0 of DividedDifferences

Fig. 6.3 Pseudocode of Newton polynomial

last column has only one value. The coefficients of the polynomial in Eq. (6.23) are
immediately available in the final table, as the first value of each column.
The pseudocode for an algorithm to compute the table of divided differences is
presented in Fig. 6.3. This algorithm will return the coefficients needed to build a
polynomial of the form of Eq. (6.23). An additional step would be needed to recover

6.5 Newton Polynomials

coefficients for a simpler but equivalent polynomial of the form of Eq. (6.1); this
step would be to multiply the Newton coefficients with subsets of x coordinates of
the interpolation points and adding all products of the same degree together. This
additional step is not included here.
Example 6.5
Four measurements of an electrical system were taken. At time 0 s the output
is 1 V, at time 1 s it is 2 V, at time 2 s it is 9 V, and at time 3 s it is 28 V. Find a
mathematical model for this system.
Solution
There are four 2D points: (0,1), (1,2), (2,9), and (3,28). Build the table of
divided differences. The first two columns are immediately available.
xi

f(xi)

f (xi,xi+1)

f (xi,xi+1,xi +2)

f (xi,xi +1,xi +2,xi +3)

Values in the third column are computed using Eq. (6.15), combining
values from the previous two columns. Then, values in the fourth column will
be computed using Eq. (6.19) and the values of the third column and the first
column.
xi

f(xi)

f (xi,xi+1)

f (xi,xi+1,xi +2)

f(xi,xi +1,xi +2,xi +3)

1
3
7
2

6
19

(continued)

6 Interpolation, Regression, and Extrapolation

Example 6.5 (continued)

The equation to compute the value in the last column can be derived from
Eq. (6.21) as:
f xi ; xi1 ; xi2 ; xi3

f xi1 ; xi2 ; xi3 f xi ; xi1 ; xi2

xi3 xi

and the values needed to compute it are the two values in column four and the
largest and smallest values of xi. This completes the table:
xi

f(xi)

f (xi,xi+1)

f (xi,xi+1,xi +2)

f (xi,xi +1,xi +2,xi +3)

1
3
7
2

1
6

19
3

Finally, the polynomial of Eq. (6.23) can be constructed by using the first
entry of each column as the coefficients.
y f x 1 1x 0 3x 0x 1 1x 0x 1x 2
x3 1
Again, the final simplified polynomial is the same one that was computed in
Examples 6.1 and 6.4.
As explained previously, a major advantage of Newton polynomials is that it is
possible to add points into the interpolation set without recomputing the entire
interpolation, but simply by adding higher-order terms to the existing polynomial.
In practice, this is done by appending the new points to the existing table of divided
differences and adding columns as needed to generate more coefficients.
Example 6.6
A fifth measurement of the electrical system of Example 6.5 has been taken.
At time 5 s, the measurement is 54 V. Update the mathematical model for this
system.
(continued)

6.5 Newton Polynomials

Example 6.6 (continued)

Solution
Append the table of divided differences by adding 5 and 54 to columns one
and two respectively. Then, using the equations already known, the functions
f can be computed. In the end, a new column must be added at the right-hand
side of the table, for the new function (derived from Eq. (6.21)):
f xi ; xi1 ; xi2 ; xi3 ; xi4

f xi1 ; xi2 ; xi3 ; xi4 f xi ; xi1 ; xi2 ; xi3

xi4 xi

The complete table is:

f(xi)

f(xi,xi +1)

f (xi,xi +1,xi +2)

f (xi,xi +1,xi+2,xi +3) f (xi,xi +1,xi +2,xi +3,xi+4)

1
3
7

1
6

19
3

3/5
2

2
13

and the polynomial is:

y f x 1 1x 0 3x 0x 1 1x 0x 1x 2
3
x 0x 1x 2x 3
5
18
33
23
3
1 x x2 x3 x4
5
5
5
5
which is the same as for Example 6.5 with one additional term added to
account for the new measurement point. The five measurements are presented
in the figure below, along with the original interpolated function from Example 6.5 in red and the updated interpolated function above in purple.
(continued)

6 Interpolation, Regression, and Extrapolation

Example 6.6 (continued)

Voltage
120
100
80
60
40
20
0
0

Time

6.6

Interpolation Error Analysis

It is worth remembering that the polynomial f(x) interpolated from a set of

n measurement points is, by design, the unique lowest-degree polynomial that can
exactly fit the points. However, there is no guarantee that the real polynomial p(x)
that generated those measurements is that polynomial; it could be, for example,
a much higher-degree polynomial that cannot be uniquely determined from the
limited set of points available. This observation is illustrated in Fig. 6.4: the
interpolated parabola f(x) in red fits the three data points perfectly, but is inaccurate
compared to the real polynomial p(x) in blue elsewhere in the interpolation interval.
Formally, for any point x in the interval [x0, xn1], the relationship between reality
and interpolation is:
Fig. 6.4 Three points on a
polynomial (blue) and the
interpolated parabola (red)

6.6 Interpolation Error Analysis

px f x Ex

6:24

where E(x) is an interpolation error term, which is equal to 0 at the interpolated

measurement points.
Chapter 5 has already introduced the Taylor series as an error modelling tool.
While it may be difficult to see how it could be applied to the error of a polynomial in
the form of Eq. (6.1), writing it in the equivalent Newton polynomial form of
Eq. (6.11) makes things a lot clearer. Indeed, Eq. (6.11) can be seen as an (n 1)thorder Taylor series approximation with:
ci

pi x
i!

6:25

Consequently, the error term will be the n-order term of the series evaluated at a
point x in the interval [x0, xn1]:
E x

pn x
x x0 x x1 . . . x xn1
n!

6:26

Unfortunately, Eq. (6.26) cannot be used to compute the error term, for the same
reason Eq. (6.25) could not be used to compute the coefficients: the polynomial p(x)
is unknown. It is, after all, the very polynomial that is being modelled by interpolation. However, an alternative is immediately available from Eq. (6.25): using the
coefficient cn, which can be computed from Eqs. (6.21) and (6.22). The error term
then becomes:
Ex f x0 , x1 , . . . , xn1 x x x0 x x1 . . . x xn1

6:27

The coefficient of the error term can thus be computed using Newton polynomials
and the table of divided differences learnt in the previous section, provided an
additional point x not used in the interpolation is available.
It is worth noting that, while the development of the error term above uses
explicitly Newton polynomials, the error term will be the same for any interpolation
method, including Vandermonde and Lagrange polynomials. It is also worth
remembering again that this error term is only valid within the interpolation
interval.
Example 6.7
Given the interpolated model of the electrical system from Example 6.5,
estimate the modelling error on a point computed at time 2.5 s. Use the
additional measurement of 4 V at 1.5 s to compute the coefficient.
(continued)

6 Interpolation, Regression, and Extrapolation

Example 6.7 (continued)

Solution
Append the table of divided differences of Example 6.5 by adding 1.5 and 4 to
columns one and two respectively and compute the new coefficients:
xi

f(xi)

f(xi,xi +1)

f (xi,xi +1,xi +2)

f (xi,xi +1,xi +2,xi+3) f(xi,xi+1,xi +2,xi+3,xi+4)

1
2

3
7

1
6

19
3

1.5

2/3
0

6
16

The error term is thus:

2
E x x 0 x 1 x 2 x 3
3
and the error on a computation at 2.5 s is E(2.5) 0.625 V.
For comparison, the value computed at 2.5 s by the model of Example 6.5
is 16.625 V, while the value computed by the more accurate model of
Example 6.6 is 17.1875 V. The difference between these values is 0.5625 V,
very close to the predicted error of the model to the real polynomial.

6.7

Linear Regression

Given a set of n points, interpolation finds a polynomial of degree n 1 that fits

exactly all the points. Oftentimes, however, that is not the model required for a
problem. There could be a number of reasons for that. For example, it could be that
the number of measurements taken of the system is a lot greater than the expected
degree the model polynomial should be. It would be silly to ignore some (or most)
of the measurements in order to reach a target polynomial order. Moreover,
measurements taken in practice will normally be inexact, due for example to
noise, to observation inaccuracies, or to the inherent limits of the equipment used.
A model that fits these points exactly will therefore be the wrong model for the
system, since it fits erroneous data!
For example, it is well known that an ideal resistor is linear in its response.
However, the measured response of a resistor in practice might not be linear,

6.7 Linear Regression

Fig. 6.5 20 measurements

with errors of a linear
system

because of measuring equipment error, fluctuations in the system, defects in the

resistor itself, or many other reasons. Simply taking two measurements of the
resistor and interpolating a straight line will thus lead to an incorrect model.
Using multiple readings to approximate the straight-line response of the resistor
will make it possible to create a much more accurate model of the resistor.
Figure 6.5 illustrates the problem. The 20 measurements are clearly pointing to a
linear system. However, because of measurement errors, the points do not line up
properly. With 20 points, an interpolation method would generate a polynomial of
degree 19, which is much, much more complex than what the data is pointing to. On
the other hand, selecting two points could lead to the interpolation of very different
lines, depending on which pair of points are selected, whereas the correct line can
be determined less ambiguously when the entire set of points is considered.
The solution to this problem is regression, or the process of discovering the
polynomial y f(x) that best approximates (as opposed to fits) the measurement
data. When this polynomial is linear, such as the one of Eq. (6.1), the process is
called linear regression (the term linear here has nothing to do with whether the
polynomial is for a straight line or not). When the regression is actually looking for
a straight line (a polynomial of degree 1), it is called a simple linear regression.
One question remains however: how to define the polynomial that best
approximates a set of points? After all, several polynomials of the same degree
could approximate a set of points, and be better approximations of certain points or
regions of the set of points while being worse at others. For example, Fig. 6.6 shows
three possible straight-line approximations (among countless others!) of the set of
points of Fig. 6.5.
Indeed, some approximations could be a better fit for part of the set of points
and minimize the errors over that region of the measurements, while others will be
better fits for other parts of the measurements. The best approximation, in the
context of regression, is not the one that will minimize the errors over part of the
points, but the one that will lead to the lowest overall errors. The errors of the model
on a measurement is the difference between the observed value yi (which, while
incorrect, is still the best information available about that point) and the value
predicted by the model polynomial f(xi). Then, the overall error of the model on the
set of measurements will be the sum of square errors (SSE) of each point:

6 Interpolation, Regression, and Extrapolation

Fig. 6.6 Three

approximations of the set of
points

SSE

n1
X

y i f x i 2

6:28

The best polynomial that can be regressed is the one that minimizes the value of
the SSE.

6.8

Method of Least Squares

The method of least squares computes the polynomial that minimizes the SSE
through a formal mathematical development. To understand the development,
consider the easiest case of a simple linear regression. In that case, Eq. (6.28)
becomes:
SSE

n1
X

yi c0 c1 xi 2

6:29

The method is looking for the polynomial, or the values of c0 and c1, that minimize
the SSE. The minimum for each coefficient is found by computing the partial
derivative of the equation with respect to that coefficient and setting it equal to 0:
n1
X
SSE
2
yi c0 c1 xi 0
c0
i0

6:30

n1
X
SSE
2
yi c0 c1 xi xi 0
c1
i0

6:31

The problem is now reduced to a system of two equations with two unknown
variables to solve together, which is trivial to do. It can be done by isolating c0 and
c1 in Eqs. (6.30) and (6.31) (note that the coefficients multiply the summation), or

6.8 Method of Least Squares

by writing the equations into an Mx b form and solving the system using one of
the decomposition techniques from Chap. 4:
2
6 n
6
6
6
6X
n1
4
xi
i0

n1
X

xi 7" # 6
y 7
7 c0
6 i0 i 7
7
7
6
6
7
7
7 c1
7
6X
n1
n1
X
5
5
4
x2i
yi xi
i0

6:32

The method of least squares can be applied to other cases of linear regression, to
discover higher-order polynomials for the model. The only downside is that each
additional term and coefficient in the polynomial to regress requires computing one
more derivative and handling one more equation.
Example 6.8
An electrical system is measured at every second, starting at time 1 s, using
noisy equipment. At time 1 s the initial output is 0.5 V, and the following
measurements are 1.7 V at time 2 s, then 1.4, 2.8, 2.3, 3.6, 2.7, 4.1, 3.0 V, and
finally 4.9 V at time 10 s. Find a linear model for this system.
Solution
Compute a simple linear regression by filling in the values into the matrix
vector system of Eq. (6.32):

10
55

55
385

c0
c1

27:0
180:1

Then solve the system to find c0 0.59 and c1 0.38. This means the model is:
y f x 0:59 0:38x
And the models SSE, computed using Eq. (629), is 3.30.
Graphically, the measurements and the regressed line are presented below.
It can be seen that the data points are lined up in two uneven linear sets. While
the model does not actually go through any of the points, it is nonetheless the
best approximation, as it goes roughly in-between the two sets, a bit closer to
the larger one. Since the errors are squared in the SSE, attempting to reduce
the error by moving the line closer to one of the sets of points would cause a
much larger increase from the error to the other set.
(continued)

100

6 Interpolation, Regression, and Extrapolation

Example 6.8 (continued)

6.9

Vandermonde Method

The Vandermonde method learned for interpolation in Sect. 6.3 can be used for
linear regression as well. Much like before, this is done first by writing out the
polynomial of the model, then filling in a Vc y system using the values of the
measurements, and finally solving for the coefficient vector. The main difference
with the previous version of the method is that there are a lot more points than
coefficients, so the matrixvector system does not balance out. This can be simply
solved by multiplying both sides of the system by the transpose of V. The correct
system for regression is thus:
VT VcVT y

6:33

It is worth noting that the Vandermonde method is equivalent to the method of least
squares. The multiplications VTV and VTy yield the values computed by deriving
and expanding the SSE equations. The main advantage of the Vandermonde
method is its simplicity. The matrix V and vector y are straightforward to build
from the observations without having to derive equations or remember sets of
summations, then the system can be built from two multiplications only.
An important benefit of having a fast and simple way to compute regressions is
to make it possible to easily compute multiple regressions of a set of measurements.
This is a benefit when the degree of the polynomial required to model a system is
unknown, and must be discovered through trial-and-error. Generating multiple
regressions at different degrees and finding which one gives the best trade-off
between low SSE and simplicity is one modelling approach that can work when
there is no other information available.

6.9 Vandermonde Method

101

Example 6.9
The following measurements of an electrical system were taken with noisy
equipment. At time 1 s the output is 0.3 V, at time 2 s it is 0.2 V, at time 3 s
it is 0.5 V, at time 4 s it is 2.0 V, at time 5 s it is 4.0 V, at time 6 s it is 6.0 V, at
time 7 s it is 9.0 V, at time 8 s it is 13.0 V, at time 9 s it is 17.0 V, and at time
10 s it is 22.0 V. Find a model for this system.
Solution
Since the degree of the model is unknown, use a trial-and-error approach to
find the correct one. Begin by computing three regressions for polynomials of
degree 1, 2, and 3, to see if one of those can approximate the data well
enough. If none of them are appropriate, higher-degree regressions might be
required. The three polynomials are:
f 1 x c0 c 1 x
f 2 x c0 c 1 x c2 x2
f 3 x c0 c 1 x c2 x2 c 3 x3
The corresponding Vandermonde systems are:
V1T V1 c1 V1T y
V2T V2 c2 V2T y
V3T V3 c3 V3T y
where the matrices Vi and the vectors of coefficients ci will have two, three, or
four columns or rows, respectively, depending on the polynomial being
computed. Expanded, the seven vectors and matrices used in the above
equations are:
2

2
3
1 1
1 1
61 2 7
61 2
6
6
7
6
6
7
61 3 7
61 3
6
6
7
61 4 7
61 4
6
6
7
6
6
7
61 5 7
61 5
7V2 6
V1 6
61 6 7
61 6
6
6
7
6
6
7
61 7 7
61 7
6
6
7
61 8 7
61 8
6
6
7
6
6
7
41 9 5
41 9
1 10
1 10

2
3
1
1
6
7
4 7
61
6
7
61
9 7
6
7
61
7
16 7
6
6
7
61
25 7
7V3 6
61
7
36 7
6
6
7
61
49 7
6
7
61
64 7
6
7
6
7
41
81 5
100
1

3
2
3
1
0:3
6 0:2 7
8 7
7
6
7
7
6
7
6 0:5 7
27 7
7
6
7
6 2 7
64 7
7
6
7
7
6
7
6 4 7
125 7
7y 6
7
6 6 7
216 7
7
6
7
7
6
7
6 9 7
343 7
7
6
7
6 13 7
512 7
7
6
7
7
6
7
4 17 5
729 5

1
2

1
4

3
4

9
16

5
6

25
36

7
8

49
64

9
10

81
100 1000

(continued)

102

6 Interpolation, Regression, and Extrapolation

Example 6.9 (continued)

2 3
2 3
c0

c0
6 c1 7
c0
7
c1
c 4 c1 5 c3 6
4 c2 5
c1 2
c2
c3

Solving each of the three vectormatrix systems finds the coefficients of the
corresponding polynomial. Those polynomials are:
f 1 x 6:25 2:46x
f 2 x 0:19 0:76x 0:29x2
f 3 x 0:10 0:50x 0:24x2 0:003x3
The final challenge is to decide which of these three polynomials, if any, is
the best approximation of the system, to use in a model. To make this
decision, consider the SSE values. For f1(x) it is 45.70, for f2(x) it is 0.23,
and for f3(x) it is 0.19. These error values clearly indicate that a polynomial of
degree 1 is a very wrong approximation of the data, while a polynomial of
degree 3 gives very little improvement compared to the one of degree 2. The
polynomial of degree 2 is the best model in this situation.
Alternatively, looking at the situation graphically can help shed some light
on it. The data points are presented in the following figure, along with the
approximations f1(x) in solid red, f2(x) in dashed red, and f3(x) in dashed
brown A visual inspection makes it clear that the measurements are following
a parabola curve and that the straight-line regression is a very poor approximation. Meanwhile, the degree-3 approximation overlaps very much with
the degree-2 approximation and does not offer a better approximation.

(continued)

6.9 Vandermonde Method

103

Example 6.9 (continued)

Finally, consider the values of the coefficient of the highest-degree term in
each equation. For f1(x) and f2(x), the terms of degree 1 and degree 2, respectively, have coefficients that are of the same magnitude as those of other terms
in the equation, indicating that they contribute as much as other terms in the
equation and that degree 1 and degree 2 matter in this model. For f3(x), the
coefficient of the degree-3 term is two orders of magnitude less than those of
the other terms of the equation, making its contribution to the equation
minimal and indicating that the system is probably not of degree 3.

6.9.1

Vandermonde Method for Multivariate Linear

Regression

In interpolation, one of the major advantages of the Vandermonde method was that
it made it possible to model multivariate cases and multidimensional problems
easily. This is also true when the method is used for linear regression. Moreover, it
is done in the same way, by computing the values of the matrix V using terms of a
multivariate polynomial and solving the system to get the coefficients.
Example 6.10
The shape of a ski slope needs to be modelled. The elevation of various points
on the slope has been measured, along with their GPS coordinates. Defining
the low end of the ski lift at GPS coordinates (1,3) as elevation 0 m, the points
measured are: (0,0) 5 m, (2,1) 10 m, (2.5,2) 9 m, (4,6) 3 m, and (7,2) 27 m.
Knowing that the ski slope can be approximated as a plane, find the best
model for it.
Solution
A plane is simply a linear polynomial in 3D, and its equation is:
z f x; y c0 c1 x c2 y
The system to solve is VTVc VTz, where the matrix V will contain the
values multiplying each coefficient, namely 1, x, and y respectively. The
values of the three variables in the system are:
(continued)

104

6 Interpolation, Regression, and Extrapolation

Example 6.10 (continued)

2
1 0
61 2
6
6 1 2:5
V6
61 1
6
41 4
1 7

3
2 3
0
5
6 10 7
2 3
17
7
7
6
c0
6 7
27
7 c 4 c1 5 z 6 9 7
6 0 7
37
7
6 7
c2
5
4 3 5
6
2
27

Solving the system finds the coefficients c0 5, c1 4, and c2 3,

corresponding to the polynomial:
z f x; y 5 4x 3y
which has an SSE of 0.0. Visually, the measurement points and the model
plane look like this:

6.10

Transformations

The two regression methods seen so far are used specifically for linear regressions.
However, many systems in engineering practice are not linear, but instead are
logarithmic or exponential. Such systems cannot be modelled accurately by a linear
polynomial, regardless of the order of the polynomial used. This is the case for

6.10

Transformations

105

example of models of population growths, and of capacitor charges in resistorcapacitor (RC) circuits.
In such a situation, the solution is to compute a transformation of the nonlinear
equation into a linear one, compute the linear regression in that form to find the best
approximation, then to reverse the transformation to find the real model. The
transformation is whatever operation is needed to turn the polynomial into a linear
function. For example, if the function is logarithmic, the transformation is to take its
exponential, and the reverse transformation is to take the logarithm of the model.
Example 6.11
The following measures are taken of a discharging capacitor in an RC circuit:
at time 0.25 s it registers 0.54 V, at 0.75 s it registers 0.25 V, at 1.25 s it
registers 0.11 V, at 1.75 s it registers 0.06 V, and at 2.25 s it registers 0.04 V.
Find the best model to approximate this capacitor.
Solution
Plotting the measurements graphically shows clearly that they follow an
exponential relationship of the form:
y f x c0 ec1 x
Such a function cannot be modelled using the linear regression tools seen so
far. However, transforming by taking the natural log of each side of the
equation yields a simple linear function:
lny lnc0 ec1 x lnc0 lnec1 x c0transform c1 x
This simple linear regression problem can easily be solved using the method
of least squares or the Vandermonde method to find the coefficients. The
linear equation is:
lny 0:25 1:66x
Finally, reverse the transformation by taking the exponential of each side of
the equation:
y e0:251:66x e0:25 e1:66x 0:78e1:66x
That equation models the measured data with an SSE of 0.002. The data and
the modelling exponential are illustrated below.
(continued)

106

6 Interpolation, Regression, and Extrapolation

Example 6.11 (continued)

6.11

Linear Regression Error Analysis

Linear regression error is different from the interpolation error computed previously in some major respects. Interpolation methods compute a polynomial that fits
the measured data exactly, and consequently constrain the error that can occur
in-between those measures, since the error must always drop back to zero at the
next interpolated measurement. Linear regression does not impose such a requirement; it computes a polynomial that approximates the measured data, and that
polynomial might not actually fit any of the measurements with zero error. Consequently, the error in-between the measures are not constrained. It is instead probabilistic: the values in-between the approximated measurements are probably near
the polynomial values (since it is the approximation with minimal SSE), but some
of them might be far away from it. In fact, the same holds true for the measurements
themselves. The situation could be understood visually by adding a third probability
dimension on top of the two dimensions of the data, as in Fig. 6.7. That figure shows
a polynomial y f(x) regressed from a set of points, and the probability of the
position of measurements in the XY plane is illustrated in the third dimension. The

Fig. 6.7 Linear regression on the x- and y-axes, with the probability of the measurements on top

6.11

Linear Regression Error Analysis

Table 6.2 Sample standard

deviation and confidence
interval

107
Range of s
around f(x)
1.00s
1.28s
1.64s
1.96s
2.00s
2.58s
2.81s
3.00s
3.29s
4.00s
5.00s

Confidence interval
of observations
0.6826895
0.8000000
0.9000000
0.9500000
0.9544997
0.9900000
0.9950000
0.9973002
0.9990000
0.9999366
0.9999994

error of the measurement is thus a normal distribution with the mean at the
polynomial, which is the position with least error, and some standard deviation
of unknown value.
The standard deviation may not be known, but given the set of measurement
points it can be approximated as the sample standard deviation s:
v
u
n1
u 1 X
st
y f x i 2
n 1 i0 i

6:34

This in turn makes it possible to compute the confidence interval (CI) of the
approximation, or the area around the regressed polynomial that the measurements
are likely to be found with a given probability. For a normal distribution, these
intervals are well known: 68.3 % of the observed measurements y will be within 1
sample standard deviations of the regressed f(x), 95.4 % of the measurements will
be within 2 s, and 99.7 % of the measurements will be within 3 s of f(x). These
points are also called the 0.683 CI, the 0.954 CI, and the 0.997 CI. Table 6.2 lists
other common relationships between s and CI.
Example 6.12
The following measurements of an electrical system were taken with noisy
equipment. At time 1 s the output is 2.6228 V, at time 2 s it is 2.9125 V, at
time 3 s it is 3.1390 V, at time 4 s it is 4.2952 V, at time 5 s it is 4.9918 V,
at time 6 s it is 4.6468 V, at time 7 s it is 5.4008 V, at time 8 s it is 6.3853 V, at
time 9 s it is 6.7494 V, and at time 10 s it is 7.3864 V. Perform a simple linear
regression to find a model of the system, and compute the 0.8 CI.
(continued)

108

6 Interpolation, Regression, and Extrapolation

Example 6.12 (continued)

Solution
Using any of the methods seen previously, the model of the system can be
found to be the straight-line polynomial
y f x 1:889 0:539x
with an SSE of 0.745. Next, the sample standard deviation is given by
Eq. (6.34) as:
r
1
s
0:745 0:288
9
The 0.8 CI is the interval at 1.28 times s, as listed in Table 6.2. The
requested function is thus
y f x 1:28s 1:889 0:539x 0:369
Visually, the result is illustrated below. The ten data points are marked
with dots, the regressed polynomial is the solid red line, and the upper and
lower bounds of the confidence interval are the two dotted lines. It can be seen
visually that this interval does include 8 of the 10 measurement points, as
expected; only the points at 5 and 6 s fall outside the range.

6.12

Extrapolation

The two techniques seen so far, interpolation and regression, have in common that
they take in a set of measurements from x0 to xn1 and compute a model to represent
the system within the interval covered by those n measurements in order to predict
the value of new measurements with a predictable error. The model in question is

6.12

Extrapolation

109

Fig. 6.8 Comparison of

interpolation and
extrapolation of a system

however not valid outside of that interval, and if used beyond those bounds it could
lead to a massive misrepresentation of reality. The problem is illustrated graphically in Fig. 6.8, in the case of the interpolation of three points. The degree-2
polynomial interpolated (the solid red parabola in the figure) fits the measurements
perfectly and is a good, low-error approximation of the real system (the blue line) in
that interval. However, the real system is a degree-4 polynomial, and as a result,
outside the interpolation region of the three measurements, the polynomial quickly
becomes an inaccurate and high-error approximation of the system (the dashed red
line), especially after the inflection points of the system that is not part of the model.
Nonetheless, being able to model and predict the values of a system beyond the
confines of a measured interval is a common problem in engineering. It must be
done, for example, in order to predict the future behavior of a natural system, in order
to design structures that can withstand the likely natural conditions and variations
they will be subjected to. It is also necessary to reconstruct historical data that has
been lost or was never measured, for example to analyze the failure of a system after
the fact and understand the conditions that caused it to go wrong. This challenge, of
modelling a system beyond the limits of the measurements, is called extrapolation.
Performing an accurate extrapolation requires more information than interpolation and linear regression. Most notably, it requires knowledge of the nature of the
system being modelled, and of the degree of the polynomial that can represent it.
With that additional information, it becomes possible to compute a model that will
have the correct number of inflection points and will avoid the error illustrated in
Fig. 6.8. Then, by performing a linear regression over the set of measurements, it is
possible to find the best polynomial of the required degree to approximate the data.
That polynomial will also give the best extrapolation values.
Example 6.13
The following input/output measurements of a system were recorded:
0:73507, 0:17716, 0:58236, 0:13734, 0:22868, 0:00741,
0:24253, 0:00397, 0:27129, 0:01410, 0:31244, 0:08215,
0:51378, 0:04926, 0:59861, 0:14643, 0:63754, 0:08751
(continued)

110

6 Interpolation, Regression, and Extrapolation

Example 6.13 (continued)

The system is believed to be either linear or quadratic. Predict the output of
the system at x 1.5 in each case, and determine which one is the best
prediction.
Solution
Using any of the linear regression methods seen so far, the degree-1 and
degree-2 polynomials can be found to be:
f 1 x 0:08270 0:04556x
f 2 x 0:00470 0:01834x 0:30768x2
The value extrapolated with each one are f1(1.5) 0.01436 and f2(1.5)
0.66947. These are clearly two very different answers! However, a quick
visual inspection of the data, presented below, shows that the data is quadratic
rather than linear, and the correct extrapolated value is thus 0.66947. This
illustrates the absolute necessity of knowing with certainty the correct degree
of polynomial to use to model the data when doing extrapolation. An error of
even one degree leads to a wrong inflection in the model and to extrapolated
values that are potentially wildly divergent from reality.

6.13

Summary

It is often necessary, in engineering practice, to develop a mathematical model of a

system given only a set of observed measurements. This challenge can further be
divided into three categories. When the measurements are exact and error-free and
the model must account for them exactly, it is called interpolation. When the
measurements are inexact and the model must approximate them, it is linear
regression. And when values beyond the bounds of the measurements must be
modelled and predicted, the challenge is extrapolation. This chapter has introduced
several techniques to accomplish all three of these. The centerpiece is the
Vandermonde method, a technique to reduce the challenge to a linear vectormatrix
problem such as those studied in Chap. 4. This method benefits from very large

6.14

Exercises

111

flexibility: it can be adapted for interpolation or linear regression, to discover

single-variable or multivariable polynomials, and even to handle nonlinear polynomials. The Lagrange polynomials method was also introduced as a more humanunderstandable technique for interpolation, while the Newton polynomials have the
benefit of allowing the incremental improvement of the model when new measurements become available. For regression, in addition to the Vandermonde method,
the equivalent mathematical development of the method of least squares was
discussed in order to demonstrate the formal mathematical foundations of the
techniques.

6.14

Exercises

1. Using the Vandermonde Method, find the polynomial which interpolates the
following set of measurements:
(a)
(b)
(c)
(d)
(e)
(f)

(2,3), (5,7).
(0,2), (1,6), (2,12).
(2,21), (0,1), (1,0), (3, 74).
(1,5), (2,7), (4,11), (6,15).
(3.2,4.5), (1.5,0.5), (0.3,0.6), (0.7,1.2), (2.5,3.5).
(1.3,0.51), (0.57,0.98), (0.33,1.2), (1.2,14), (2.1, 0.35), (0.36,0.52).

2. Must the x values be ordered from smallest to largest exponent for the
Vandermonde method to work?
3. Using the Vandermonde Method, find the polynomial of the form f(x) c1sin
(x) + c2cos(x) which interpolates the following set of measurements: (0.3,0.7),
(1.9, 0.2).
4. Using the Vandermonde Method, find the polynomial of the form f(x)
c0 + c1sin(x) + c2cos(x) which interpolates the following set of measurements:
(4,0.3), (5, 0.9), (6, 0.2).
5. Using Lagrange polynomials, find the polynomial which interpolates the following set of measurements:
(a)
(b)
(c)
(d)
(e)
(f)
(g)

(2,3), (5, 6).

(1,2), (3,4)
(2,9), (3, 14), (5, 24).
(0,1), (1,0), (3,4).
(5.3,4.6), (7.3,2.6).
(0,0), (1,2), (2,36)
(0,0), (1,2), (2,36), (3,252)

6. Using Newton polynomial, find the polynomial which interpolates the following set of measurements:
(a) (2,3), (5,7).
(b) (2,2), (3,1), (5,2).

112

6 Interpolation, Regression, and Extrapolation

(2, 39), (0,3), (1,6), (3,36).

(2,21), (0,1), (1,0), (3, 74).
(1,5), (2,7), (4,11), (6,15).
(1.3, 0.51), (0.57, 0.98), (0.33, 1.2), (1.2, 14), (2.1, 0.35), (0.36, 0.52).

7. Must the x values be ordered from smallest to largest for the method to find
Newton polynomials to work?
8. Suppose you have computed the polynomial which interpolates the set of
measurements (1, 4), (3, 2), (4, 10), (5, 16) using the following table of
divided differences:
xi

f(xi)

f(xi,xi+1)

f(xi,xi+1,xi+2)

f(xi,xi+1,xi+2,xi+3)

3
3

5
12

2
3

6
5

Use this result to compute the polynomial which interpolates the set of
measurements (3,2), (4,10), (5,16), (7,34).
9. Using the Vandermonde Method, find the polynomial which interpolates the
following set of measurements:
(a)
(b)
(c)
(d)
(e)

(0,0,5), (0,1,4), (1, 0,3), (1,1,6).

(2,4,12), (2,5,11), (3,4,10), (3,5,14).
(2,2,6), (2,3,10), (4,2,12), (5,5,18).
(2,3,16), (2,5,15), (4,3,14), (6,6,17).
(1,1,3.2), (1,2,4.4), (1,3,6.5), (2,1,2.5), (2,2,4.7), (2,3,5.8), (3,1,5.1),
(3,2,3.6), (3,3,2.9).
(f) (1,2,3,4.9), (3,5,2,2.6), (5,4,2,3.7), (4,1,4,7.8).

10. Compute a simple linear regression using the following set of measurements:
(a) (1,0), (2,1), (3,1), (4,2).
(b) (0.282,0.685), (0.555,0.563), (0.089,0.733), (0.157,0.722), (0.357,0.662),
(0.572,0.588), (0.222,0.693), (0.800,0.530), (0.266,0.650), (0.056,0.713).
(c) (1, 2.6228), (2, 2.9125), (3, 3.1390), (4,4.2952), (5, 4.9918), (6, 4.6468),
(7,5.4008), (8, 6.3853), (9, 6.7494), (10, 7.3864).
(d) (0.350,2.909), (0.406,2.987), (0.597,3.259), (1.022,3.645), (1.357,4.212),
(1.507,4.295), (2.228,5.277),(2.475,5.574), (2.974,6.293), (2.975,6.259).
11. Consider the following set of measurements submitted for simple linear
regression:
(1, 2.6228), (2, 2.9125), (3, 3.1390), (4, 4.2952), (5, 4.9918),
(6, 4.6468), (7, 5.4008), (8, 63.853), (9, 6.7494), (10, 7.3864)

6.14

Exercises

113

What would you consider to be problematic about it, and what would you
consider a reasonable solution?
12. Compute a linear regression for a quadratic polynomial using the following set
of measurements:
(a) (2,3), (1,1), (0,0), (1,1), (2,5).
(b) (1,0.5), (2,1.7), (3,3.4), (4,5.7), (5,8.4).
(c) (0,2.1), (1,7.7), (2,13.6), (3,27.2), (4,40.9), (5,61.1).
13. Compute a linear regression for an exponential polynomial using the following
set of measurements:
(a) (0.029,2.313), (0.098, 2.235), (0.213,2.094), (0.352,1.949), (0.376,1.924),
(0.393,1.907), (0.473,1.828), (0.639,1.674), (0.855,1.493), (0.909,1.451).
(b) (0.228,0.239), (0.266,0.196), (0.268,0.218), (0.345,0.173), (0.351,0.188),
(0.543,0.090), (0.667,0.057), (0.942,0.022), (0.959,0.026), (0.991,0.019).
(c) (0,0.71666), (1,0.42591), (2,0.25426), (3,0.15122), (4,0.08980), (5,0.05336),
(6,0.03179), (7,0.01889), (8,0.01123), (9,0.00666), (10,0.00396).
14. Using the following set of measurements:
(0,2.29), (1,1.89), (2,1.09), (3,0.23), (4, 0.80), (5, 1.56), (6, 2.18),
(7, 2.45), (8, 2.29), (9, 1.75), (10, 1.01)
compute a linear regression for a polynomial of the following form:
(a) f(x) c1sin(0.4x) + c2cos(0.4x).
(b) f(x) c0 + c1sin(0.4x) + c2cos(0.4x).
(c) Comparing both polynomials, what conclusion can you reach about the
constant term c0?
15. Compute the requested value at the given following set of measurements,
knowing that the polynomial is linear:
(a) (1, 7), (0, 3), (1, 0), (2, 3), looking for x 3.
(b) (0.3,0.80), (0.7,1.3), (1.2,2.0), (1.8,2.7), looking for x 2.3.
(c) (0.01559,0.73138), (0.30748,0.91397), (0.31205,0.83918), (0.90105,1.05687),
(1.21687,1.18567), (1.47891,1.23277), (1.52135,1.25152), (3.25427,1.79252),
(3.42342, 1.85110), (3.84589,1.98475), looking for x 4.5.
16. Compute the value at x 3 given the following set of measurements, knowing
that the polynomial is of the form y(x) c1x2: (2, 5), (1, 1), (0, 0), (1, 2),
(2, 4).
17. The following measurements come from the exponential decrease of a
discharging capacitor:
(1.5,1.11), (0.9,0.92), (0.7,0.85), (0.7,0.57), (1.2,0.49), (1.4,0.45)
At what time will the charge be half the value of the charge at time 0 s?

Chapter 7

Bracketing

7.1

Introduction

Consider the problem of searching for the word lemniscate in a dictionary of

thousands of pages. It would be unthinkable to find the word by reading the
dictionary systematically page by page. However, since words are sorted alphabetically from A to Z in the dictionary, it is easy to open the dictionary at a random
page and determine whether the letter L is before or after that page. This single step
will greatly reduce the number of pages to search through. Next, select a page at
random in the portion of the dictionary the word is known to be in, and determine
whether the word is before or after that second point. Once a point starting with the
letter L is reached, the following letter E is considered the section kept is the one
containing LE, then LEM, and so on until sufficient precision is achieved (namely
that the page containing the word is found).
This type of search is called bracketing. It consists in defining brackets, or upper
and lower bounds, on the value of the solution of a problem, then iteratively refining
these bounds and reducing the interval the solution is found in, until the interval
represents an acceptable error range on the solution.

7.2

Binary Search Algorithm

Possibly the simplest and most popular bracketing algorithm is the binary search
algorithm. Assume that a solution to a problem is needed; its value is unknown but
it is known to be somewhere between a lower bound value xL and an upper bound
value xU. The algorithm words by iteratively dividing the interval into half and
keeping the half that contains the solution. So, in the first iteration, the middle point
would be:

Springer International Publishing Switzerland 2016

R. Khoury, D.W. Harder, Numerical Methods and Modelling for Engineering,
DOI 10.1007/978-3-319-21176-3_7

115

116

7 Bracketing

xU xL
xi
2

7:1

and, assuming the solution is not exactly that point xi (which it will rarely if ever be
for any real-world problem except very simple ones), then one of the two intervals,
either [xL, xi] or [xi, xU], will contain the solution. In either case, the search interval
is now half the size it was before! In the next iteration, the remaining interval is
again divided into half, then again and again. As with any iterative algorithm in
Chap. 3, halting conditions must be defined, namely a maximum number of
iterations and a target-relative error. The error can be defined as before, between
two successive middle points (using for example xi+1, the middle point of the
interval [xL, xi]):

xi xi1

7:2
Erel
xi1
The final result returned by the algorithm is not the solution to the problem, but an
interval that the solution is found in. The solution can be approximated as the
central point of that interval, with half the interval as absolute error:
xi1

jxi xL j
2

7:3

Note that this definition of the absolute error could be substituted into Eq. (7.2) as
well to compute the relative error:

xi xL

2
7:4
Erel

xi1
The pseudocode for the binary search algorithm is presented in Fig. 7.1. This code
will serve as a foundation for the more sophisticated bracketing algorithms that will
be presented in later chapters. Note that it requires a call to a function
SolutionInInterval(XL,XU) which serves to determine if the solution is
between the lower and upper bounds given in parameter. This function cannot be
defined in pseudocode; it will necessarily be problem-specific. For example, in the
dictionary lookup problem of the previous section it will consist in comparing the
spelling of the target word to that of the bounds, while in more mathematical
problems it can require evaluating a function and comparing the result to that of
the bounds. Note as well that the first check in the IF uses the new point x as both
bounds; it will return true if that point is the exact solution.

7.3 Advantages and Limitations

117

XL Input lower bound

XU Input upper bound
IterationMaximum Input maximum number of iterations
ErrorMinimum Input minimum relative error
x XL
IterationCounter 0
WHILE (TRUE)
PreviousValue x
x XL + (XU XL) / 2
IF ( CALL SolutionInInterval(x,x) )
RETURN Success, x
ELSE IF ( CALL SolutionInInterval(XL,x) )
XU x
ELSE IF ( CALL SolutionInInterval(x,XU) )
XL x
END IF
CurrentError absolute value of [ (x PreviousValue) / x ]
IterationCounter IterationCounter+1
IF (CurrentError <= ErrorMinimum)
RETURN Success, x
ELSE IF (IterationCounter = IterationMaximum)
RETURN Failure
END IF
END WHILE
FUNCTION SolutionInInterval(XL, XU)
IF (solution to the problem is between XL and XU)
RETURN TRUE
ELSE
RETURN FALSE
END IF
END FUNCTION

Fig. 7.1 Pseudocode of the binary search algorithm

7.3

Advantages and Limitations

Bracketing is by far the simplest and most intuitive of the mathematical tools
covered in this book. It is nonetheless quite powerful, since it makes it possible to
zoom in on the solution to a problem given no information other than a way to
evaluate bounds. The initial bounds for the algorithm can easily be selected by ruleof-thumb, rough estimation (or guesstimation), or visual inspection of the
problem.
However, this tool does have important limitations as well. It is less efficient
than the other four mathematical tools studied, both in terms of computational time
and in terms of accuracy. As a result, numerical methods that make use of
bracketing, despite being simpler, will also be the ones that converge to the solution
in the longest time and to the least degree of accuracy. Bracketing methods also

118

7 Bracketing

work best for one-dimensional y f(x) problems and scale up very poorly into
n dimensions. Indeed, a system of n equations and n unknowns will require 2n
bounds to be properly bracketed.

7.4

Summary of the Five Tools

Chapters 37 have introduced five mathematical and modelling tools: iteration,

linear algebra, Taylor series, interpolation, regression, and extrapolation, and now
bracketing. Each of these tools is useful to solve a specific problem or to model a
specific aspect of nature, and together they will form the building blocks of the
more advanced numerical methods that will be presented in the next six chapters. It
is worth finishing this section with a review of these important building blocks.
Iteration is a tool useful to refine an approximation step by identical step. As a
modelling tool, it can represent any naturally converging system. This tool will
be the foundation of almost all numerical method algorithms coming up.
Linear algebra is useful to model and solve systems with multiple independent
variables that interact through a set of equations. This is often the case, especially when dealing with multivariate or multidimensional problems.
Taylor series is an approximation tool that can give a good local representation
of a system using its derivatives, and can do so with a measurable error. This will
be useful mainly to develop algorithms to solve complex systems and to estimate
the error on the solutions these algorithms provide.
Interpolation, regression, and extrapolation tools are used to estimate a continuous model of a system given a set of discrete measurements. In many cases in
engineering practice and in the upcoming chapters, discrete points are all that is
available to work with, and these tools thus become cornerstones of the work.
Bracketing, the topic of this chapter, is a search tool useful to converge on a
solution when almost no other information is available. While the least efficient
of the five tools, it remains nonetheless a useful last-resort tool to have available.

Chapter 8

Root-Finding

8.1

Introduction

The root of a continuous multidimensional function f(x) is any point x xr for

which the function f(r) 0. Algorithms that discover the value of a functions root
are called root-finding algorithms, and they constitute the first numerical method
presented in this book.
Root-finding is an important engineering modelling skill. Indeed, many situations, once properly modelled by a set of mathematical equations, can be solved by
finding the point where the multiple equations are equal to each other; or said
differently, where the difference between the results of all the equations is zero.
That is the root of the system of equations.
As an example, consider the simple circuit illustrated in Fig. 8.1. Suppose it is
necessary to find the current running through this circuit. To model this system,
recall that Kirchhoffs law states that the sum of voltages of the components in the
circuit will be equal to the voltage of the source:
VD VR VS

8:1

The voltage of the source VS is known to be 0.5 V, while Ohms law says that the
voltage going through the resistor is:
V R RI

8:2

where R is also given in the circuit to be 10 . The Shockley ideal diode equation
gives the voltage going through the diode, VD, as:
VD

I I S enVT 1

Springer International Publishing Switzerland 2016

R. Khoury, D.W. Harder, Numerical Methods and Modelling for Engineering,
DOI 10.1007/978-3-319-21176-3_8

8:3

119

120

8 Root-Finding

Fig. 8.1 A simple diode

circuit

where Is is the saturation current, VT is the thermal voltage, and n is the ideality
factor. Assume these values have been measured to be Is 8.3 1010 A,
VT 0.7 V and n 2.
Moving VS to the left-hand side of Eq. (8.1) along with the other two voltage
values makes the model into a root-finding problem. Moreover, incorporating the
equations of VR and VD from Eqs. (8.2) and (8.3) respectively into Eq. (8.1) along
with the measured values given yields the following equation for the system:

0:5 1:4ln 1:20482 109 I 1 10I 0

8:4

The value of the current running through the circuit loop is thus the root of the
circuits model. In the case of Eq. (8.4), that value is 3.562690102 1010 A.

8.2

Bisection Method

The bisection method is an iterative bracketing root-finding method. It follows the

bracketing algorithm steps presented in Chap. 7: it begins by setting an upper and
lower bound before and after the root, then at each iteration, it reduces the interval
between these bounds to zoom in on the root. The bisection method thus has the
same advantages and limitations that were highlighted in Chap. 7: it is not only a
simple, robust, and intuitive root-finding method, but also an inefficient one that
works best in one-dimensional problems such as the resistor example of the
previous section.
The algorithm for the bisection method is a simple three-step iterative process:
setup for the iterations, iterate, and terminate the iterations. The steps are explained
below, followed by pseudocode for the algorithm in Fig. 8.2.
Step 1: Set the initial lower bound xL and upper bound xU around the root of the
function f(x). Since the root is a zero of the equation, the two bounds must be on
either side of a zero-crossing; in other words, they must be points where the
equation evaluates to values of opposite signs:

f x L < 0 ^ f x U > 0
xL ; xU
f x L > 0 ^ f x U < 0

8:5

Step 2: Once points on either side of the zero-crossing have been selected, the
bisection method iteratively tightens them by picking the point exactly in-between
the two bounds:

8.2 Bisection Method

121

XL Input lower bound

XU Input upper bound
IterationMaximum Input maximum number of iterations
ErrorMinimum Input minimum relative error
x XL
IterationCounter 0
WHILE (TRUE)
PreviousValue x
x (XL + XU) / 2
Evaluation CALL F(x)
IF ( Evaluation = 0 )
RETURN Success, x
ELSE IF ( Evaluation is of the same sign as CALL F(XL) )
XL x
ELSE IF ( Evaluation is of the same sign as CALL F(XU) )
XU x
END IF
CurrentError absolute value of [ (x PreviousValue) / x ]
IterationCounter IterationCounter + 1
IF (CurrentError <= ErrorMinimum)
RETURN Success, x
ELSE IF (IterationCounter = IterationMaximum)
RETURN Failure
END IF
END WHILE

FUNCTION F(x)
RETURN evaluation of the target function at point x
END FUNCTION

Fig. 8.2 Pseudocode of the bisection method

xL xU
2

8:6

The function is then evaluated at that point. In some rare cases, the function will
evaluate to zero exactly, in which case the point xi is the root and the algorithm can
terminate. More generally however, the function will evaluate to a positive or
negative value, and the middle point will be on one or the other side of the root.
The middle point replaces the bound on the same side of the root as itself and
becomes the new bound on that side of the root. The interval is thus reduced by half
at each iteration, and the root remains bracketed between a lower and upper bound:

xL ; xU !

xi ; xU
xL ; xi

if f xi f xL > 0
if f xi f xU > 0

8:7

122

8 Root-Finding

Step 3: The iterative process continues until a halting condition is reached. One
halting condition mentioned in the previous step, albeit an unlikely one, is that a
middle point xi is found to be exactly the root. More usually, one of the two halting
conditions presented in Chap. 7 and Chap. 3 will apply: the algorithm will reach a
preset maximum number of iterations (failure condition) or some error metric, such
as the interval between the two brackets or the relative error on the point xi, will
become lower than some preset error value (success condition).
It is clear to see that the absolute error on the approximation of the root xi is the
current interval [xi2, xi1]. Moreover, since this interval is reduced at each
iteration, it follows that the error is also reduced at each iteration. More formally,
define h0 as the width of the initial interval, and the initial absolute error value:
h0 jxL xU j

8:8

At each iteration, the interval, and therefore the error, is reduced by a factor of
2 compared to its previous iteration value. If we assume a total number of
n iterations performed, the final interval and error value is:
hn

hn1 h0
n
2
2

8:9

Equation (8.9) is the convergence rate of the algorithm to the solution, or the rate
the error decreases over the iterations: it is a linear algorithm with O(h). But the
equation also makes it possible to predict the number of iterations that will
be needed to converge to an acceptable solution. For example, if an initial interval
on a root is [0.7, 1.5] and a solution is required with an error of no more than 105,
then the algorithm will need to perform dlog2(0.8/105)e 17 iterations.
Example 8.1
Suppose a circuit represented by a modified version of Eq. (8.4) as follows:
0:5 1:4lnI 1 0:1I 0
Perform six iterations of the bisection method to determine the value of the
current running through this circuit, knowing initially that it is somewhere
between 0 and 1 A.
Solution
First, evaluate the function at the given bounds. At I 0 A the total voltage in
the system is 0.5 V, and at I 1 A it is 0.5704 A.
The first middle point between the initial bounds is x1 0.5 A. Evaluating the
function at that value gives 0.1177 V. This is a positive evaluation, just like at
1 A; the middle value thus replaces this bound, and the new interval is [0, 0.5].
(continued)

8.2 Bisection Method

123

Example 8.1 (continued)

The second middle point is at x2 0.25 A. The function evaluation is
0.1626 V. This is a negative evaluation, on the same side of the zerocrossing as 0 A. Consequently, it replaces that bound and the new interval
at this iteration is [0.25,0.5].
The next four iterations are summarized in this table:
Initial interval (A)
[0.25, 0.5]
[0.375, 0.5]
[0.375, 0.4375]
[0.375, 0.4063]

Middle point (A)

x3 0.3750
x4 0.4375
x5 0.4063
x6 0.3906

Function evaluation (V)

0.0167
0.0518
0.0179
0.0007

Final interval (A)

[0.375, 0.5]
[0.375, 0.4375]
[0.375, 0.4063]
[0.375, 0.3906]

At the end of six iterations, the root is known to be in the interval [0.375,
0.3906]. It could be approximated as 0.3828 0.0078 A. For reference, the
real root of the equation is 0.389977 A, so the approximation has a relative
error of only 1.84 %. To further illustrate this method, the function is plotted
in red in the figure below, with the five intervals of the root marked in
increasingly light color on the horizontal axis. It can be seen that the method
tries to keep the root as close as possible to the middle of each shrinking
interval.
Voltage
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1
0.2
0.3
0.4
0.5
0.6

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Current

124

8.3

8 Root-Finding

False Position Method

The false position method is an improvement of the bisection method. Like the
bisection method, it is a bracketing algorithm. However, it does not simply use the
middle point between the two bounds as a new bound, but instead interpolates a
degree-1 polynomial between the two bounds and uses the root of that straight line
as the new bound. This usually gives a better approximation of the root then blindly
using the middle point, especially when the function being modeled can accurately
be represented by a straight line, which will be the case as the interval around the
root gets smaller and smaller, as discussed in Chap. 5. To understand how this
works, consider the function plotted in blue in Fig. 8.3. Its brackets go from 6 to
8, and its root is at 6.3. The bisection method picks the middle point at each
iteration: the middle point this iteration is at 7, which reduces the interval to [6,
7] and yields 6.5 0.5 as an approximation of the root. On the other hand, the false
position method interpolates a straight line between the two bounds, shown in red in
Fig. 8.3, and uses the root of that interpolated polynomial at 6.45 to reduce the
interval to [6, 6.45]. This interval is already a lot better than the one obtained after
one iteration (or even two iterations) of the bisection method. After just one step,
the root is approximated to be 6.225 0.225, a much better approximation than
the one obtained by the bisection method. Moreover, as can be seen in Fig. 8.3, the
function in that new interval is practically linear, which means that the root of the
polynomial interpolated in-between those two bounds will be practically the same
as the root of the real function.
The algorithm for the false position method is a three-step iterative process that
is very similar to the bisection method. In particular, the first step to setup the initial
bounds is exactly the same as before, to get two points on either side of the zerocrossing. In the second step, the method iteratively tightens the bounds by using the
root of the straight-line polynomial interpolated in-between the two bounds. Interpolating a straight line between two known points and then finding the zero of that
line is trivially easy, and in fact both operations can be done at once using
Eq. (8.10).
xi xU

Fig. 8.3 The interpolating

linear polynomial and
its root

f xU xL xU
f xL f xU

8:10

8.3 False Position Method

125

The method then evaluates the new function at the new point, f(xi), and substitutes xi for the root on the same side of the zero-crossing. One distinctive feature
of the false position method is that it will usually focus on updating only one bound
of the interval. For a concave-down function such as the one in Fig. 8.3, the root of
the interpolated polynomial will usually be on the right-hand side of the real root
and only the right-hand side bound will be the one updated. And conversely, in the
case of a concave-up function, the interpolated root will usually be on the left-hand
side of the function and only the left-hand bound will be updated.
Finally, the iterative process terminates when one of three termination conditions are reached. Two of these conditions are exactly the same as for the bisection
method: the algorithm might generate a point xi that is exactly the root of the
function (success condition), or reach a preset maximum number of iterations
(failure condition). The third condition is that the root of the function is approximated to an acceptable preset error rate (success condition). However, this acceptable approximation is defined differently in this algorithm than it was in the
bisection algorithm or in most bracketing algorithms. Since usually only one
bound is updated, the relative error between two successive update values can be
used to measure the error, using the usual relative error value introduced in Chap. 1:

xi1 xi

Ei
8:11
xi
The pseudocode for this method, in Fig. 8.4, will clearly be very similar to the
one for the bisection method, which was presented in Fig. 8.2, with only a different
formula to evaluate the new point.
Example 8.2
Suppose a circuit represented by a modified version of Eq. (8.4) as follows:
0:5 1:4lnI 1 0:1I 0
Perform six iterations of the bisection method to determine the value of the
current running through this circuit, knowing initially that it is somewhere
between 0 and 1 A.
Solution
First, evaluate the function at the given bounds. At I 0 A the total voltage in
the system is 0.5 V, and at I 1 A, it is 0.5704 A.
Using Eq. (8.10), the first interpolated root is found to be at:
x1 1

0:57040 1
0:4671 A
0:5 0:5704

and evaluating the function using that current value gives a voltage of
0.0833 V. This is a positive evaluation, just like at 1 V; the middle value
thus replaces this bound, and the new interval is [0, 0.4671].
(continued)

126

8 Root-Finding

Example 8.2 (continued)

The next interpolated root is at x2 0.4004 A. The function evaluation
gives 0.0115 V. This is again a positive evaluation, on the same side of the
zero-crossing as the previous point. Consequently, it replaces that bound and
the new interval at this iteration is [0 0.4004].
The next four iterations are summarized in this table:
Initial interval (A)
[0,0.4004]
[0,0.3914]
[0,0.3902]
[0,0.3900]

Interpolated root (A)

x3 0.3914
x4 0.3902
x5 0.3900
x6 0.38998

Evaluation (V)
0.0016
0.0002
0.000029
0.000004

Final interval (A)

[0,0.3914]
[0,0.3902]
[0,0.3900]
[0,0.38998]

At the end of six iterations, the root is known to be in the vicinity of

0.38998 A. This is a relative error of 0.000974 % compared to the real root at
0.389977 A, a major improvement compared to the bisection method, which
only achieved 1.84 % relative error after the same number of iterations.
Notice as well that, in all the iterations, only one bound was ever updated.
To further illustrate this method, the function is plotted in red in the figure
below, with the five intervals of the root marked in increasingly light color on
the horizontal axis. It can be seen that the method tries to keep the root as
close as possible to the upper bound of the interval. As a result, by contrast to
Example 6.1, the interval size decreases much more slowly and the final
interval is a lot larger than it was with the bisection method, but the final
approximation of the root is much more accurate.
Voltage
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1
0.2
0.3
0.4
0.5

0.6

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Current

8.3 False Position Method

127

XL Input lower bound

XU Input upper bound
IterationMaximum Input maximum number of iterations
ErrorMinimum Input minimum relative error
x XL
IterationCounter 0
WHILE (TRUE)
PreviousValue x
x XU (CALL F(XU)) (XL - XU) / [(CALL F(XL)) - (CALL F(XU))]
Evaluation CALL F(x)
IF ( Evaluation = 0 )
RETURN Success, x
ELSE IF ( Evaluation is of the same sign as CALL F(XL) )
XL x
ELSE IF ( Evaluation is of the same sign as CALL F(XU) )
XU x
END IF
CurrentError absolute value of [ (x PreviousValue) / x ]
IterationCounter IterationCounter + 1
IF (CurrentError <= ErrorMinimum)
RETURN Success, x
ELSE IF (IterationCounter = IterationMaximum)
RETURN Failure
END IF
END WHILE
FUNCTION F(x)
RETURN evaluation of the target function at point x
END FUNCTION

Fig. 8.4 Pseudocode of the false position method

8.3.1

Error Analysis

To simplify the error analysis, assume that one of the bounds is fixed. As explained
previously, that assumption will be correct for all except occasionally the first few
iterations of the false position method.
Let xr be the root, bracketed between a lower bound a0 and an upper bound b,
and assume that the bound b is fixed. Then, the change in the moving bound will
be proportional to the difference between the slope from (xr, 0) to (b, f(b)) and the
derivative f(1)(xr). To visualize this, define the error between the root and
the moving bound h0 |a0 r| and assume that the bound is sufficiently close to
the root that the first-order Taylor series approximation f(a0) f (1)(r)h0 holds.
In that case, the slope of the interpolated polynomial from (a0, f(a0)) to (b, f(b)) is
approximately equal to f(b)/(b xr). This is shown in Fig. 8.5.

128

8 Root-Finding

Fig. 8.5 One iteration of the false position method

After one iteration, the bound will move to a1, the root of the interpolated line
(in red), and the error will be reduced by the distance between a0 and a1, which can
be computed by simple trigonometry as indicated in Fig. 8.5. The resulting error is:
!
f 1 xr h0
f 1 xr b xr

h0 1
h1 h0
f b
f b
b xr

8:12

and more generally at iteration n:

hn h0

f 1 xr b xr
1
f b

!n
8:13

The error rate thus decreases linearly as a function of h, or O(h), just like for the
bisection method. This result seems to clash with the observation earlier, both in the
explanations and in Example 8.2 that the error of the false position method
decreases faster than that of the bisection method. In fact, Eq. (8.13) shows that
this method will not always converge faster than the bisection method, but only in
the specific case where the factor multiplying h0 is smaller, that is to say:
1 f 1 r b xr
<
2
f b

8:14

Since the root and the derivative are beyond ones ability to change, the only ways
to improve the convergence rate of the false position algorithm is to increase the

8.3 False Position Method

129

difference between b and xr, or to decrease the value of f(b). It can be seen from
Fig. 8.5 that the effect of these actions will be, respectively, to move the point
b further right on the x-axis or down nearer to the x-axis. Either option will cause the
root of the interpolated line between a0 and b to be nearer to the root.
Example 8.3
Compare and contrast the error of the bisection method and the false position
method from Examples 8.1 and 8.2.
Solution
The points computed in each of the six iterations are listed in the table
below, along with the relative error of each one compared to the real root at
0.389977 A.

Iteration
1
2
3
4
5
6

Bisection
Point (A)
x1 0.5
x2 0.25
x3 0.375
x4 0.4375
x5 0.40625
x6 0.390625

Relative error (%)

28.21
35.89
3.84
12.19
4.17
0.17

False position
Point (A)
x1 0.4671
x2 0.4004
x3 0.3914
x4 0.3902
x5 0.3900
x6 0.38998

Relative error (%)

19.78
2.67
0.36
0.05
0.0068
0.000974

It is clear to see that the false position method reduces the error a lot more
quickly than the bisection method: after three iterations the error achieved by
the false position is comparable to that from six iterations of the bisection
method, and after the fifth iteration, the false position method has surpassed
by orders of magnitude the best error achieved by the bisection method.
But it is even more interesting to observe the progression of the error. The
relative error of points generated by the false position method always
decreases after each generation, and does so almost perfectly linearly. By
contrast, the relative error of the points computed by the bisection method
zigzags, it decreases but then increases again between iterations 3 and 4, and
between iterations 5 and 6. This zigzag is another consequence of blindly
selecting the middle point of the interval at each iteration. When the real root
is near the center of the interval, the middle point selected by the bisection
method will have a low relative error, but when the root is nearer the edge of
the interval the middle point will have a high relative error. By contrast, the
false position method selects points intelligently by interpolating an approximation of the function and using the approximations root, and thus is not
subject to these fluctuations. Moreover, as the interval becomes smaller after
each iteration, the approximation becomes more accurate and the interpolated
root is guaranteed to become closer to the real one.

130

8.3.2

8 Root-Finding

Nonlinear Functions

The false position method introduced an additional assumption that was not present
in the bisection method, namely that the function can be approximated within the
interval by a linear polynomial. This is of course not always the case, and it is
important to be mindful of it: when this assumption does not hold, the function
cannot be approximated well by a straight line, and consequently the root of the
straight line is a very poor approximation of the root and cannot be used to
effectively tighten the bounds.
To illustrate, an example of a highly nonlinear function is given in Fig. 8.6:
within the interval [0,1], the function looks like a straight horizontal line with a
sudden sharp vertical turn near the root at 0.96. The function is concave-up, and as
explained before the root of the interpolated polynomial falls on the left side of the
real root, and the left-hand bound is the one updated. And indeed, it can be seen in
that Figure that a straight line interpolated from 0 to 1 will have a root at a point
before 0.96, and therefore the left bound will be updated. However, interpolated
lines root will actually be at 0.04, very far from the real root at 0.96! This is a result
of the fact that the straight line from 0 to 1 is not at all a good approximation of the
concave function it is supposed to represent. Worse, the bounds will be updated to
[0.04,1] and the function in that interval will still be nonlinear, so in the next
iteration the false position method will again interpolate a poor approximation and
poorly update the bounds. In fact, it will take over 20 iterations for the false position
to generate a approximation of the root of this function within the interval [0.9,1].
By contrast, the bisection method, by blindly cutting out half the interval at each
iteration, gets within that interval in four iterations.
Referring back to Eq. (8.13), it can be seen that the inequality necessary for the
false position method to outperform the bisection method does not hold in this
example. The difference between the root and the fixed bound is only 0.04, while

Fig. 8.6 A highly nonlinear

function

8.5 Simple Fixed-Point Iteration

131

the first derivative at the root is almost 1 because of the discontinuity and the value
of f(b) is exactly 1, so Eq. (8.13) evaluates to a result much smaller than 0.5.
Clearly, it is important to determine whether the function can be approximated
by a straight line interpolated between the bounds before starting the false position
method, otherwise many iterations will be wasted computing poor approximations
of the root. The solution to that problem is also hinted at the end of the example: to
switch to the bisection method, which will work the same regardless of whether the
function is linear or nonlinear within the interval, for a few iterations, until the
interval has been reduced to a region where the function is closer to linear.

8.4

Closed and Open Methods

The bisection and false position methods are both bracketing root-finding methods.
They are called closed methods, because they enclose the root between bounds.
These bounds constitute both an advantage and a limitation. On the one hand, they
guarantee that the methods will converge, that they will succeed in finding the root.
Indeed, these methods cannot possibly fail, since they begin by locking in the root
between bounds and never lose sight of it. However, iteratively updating bounds is a
slow process, and the two methods seen so far only have O(h) convergence rates.
The alternatives to closed methods are open methods. As the name implies, these
methods do not enclose the root between bounds. They do use initial points, but
these points could all be on the same side of the root. The methods then use some
mathematical formula to iteratively refine the value of the root. Since these algorithms can update their estimates without worrying about maintaining bounds, they
typically converge a lot more efficiently than closed methods. However, for the
same reason that they do not keep the root bracketed, they can sometimes diverge
and fail to find the root altogether if they use a bad point or combination of points in
their computations.

8.5

Simple Fixed-Point Iteration

The simple fixed-point iteration (SFPI) method is the simplest open root-finding
method available. As will be seen, it is also the open method with the worst
convergence rate in general and it diverges in many common situations, so it is
far from the best. However, it will be useful to use as the first open numerical
method in this book, to introduce fundamental notions that will be applied to all
other methods.
As mentioned in the previous section, an open method is one that iteratively
improves an (unbounded) estimate of the solution point xi. The first necessary step
to any open method is thus to write the system being studied in an iterative form of
xi+1 f(xi). In the case of root-finding methods, however, this is a special problem,
since the solution point is the one where f(xi) 0. It is necessary to modify the

132

8 Root-Finding

equation of the systems model somehow. Each open root-finding method that will
be presented in the next sections will be based on a different intuition to rewrite the
equation into iterative form. For SPFI, it is done simply by isolating a single
instance of x in the equation f(x) 0 to get g(x) x. In other words:
f x g x x 0 ) g x x

8:15

This transformation is nothing more than a bit of algebra. If an instance of x is

available in the equation f(x), isolate it. If not, then simply add x to both side of
f(x) 0. Equation (8.16) gives examples of both situations.
f x 5x3 4x2 3x 2 0 ) gx
f x lnx 0 ) gx lnx x x

5x3 4x2 2
x
3

8:16

The fixed-point equation of (8.15) is an iterative equation. Given an initial

estimate of the root x0, the method computes
xi1 gxi

8:17

until the equation converges; and given Eq. (8.15), the value of x it converges to is
the root of f(x). As with any iterative method, the standard halting conditions apply.
The method will be said to have converged and succeeded if the relative error
between two successive iterations is less than a predefined threshold :

xi1 xi
<
Ei
8:18
xi1
And it will be said to have failed if the number of iterations reaches a preset
maximum. The pseudocode for this method is presented in Fig. 8.7. Notice that,
contrary to the pseudocode of the bisection and false position methods, this one
does not maintain two bounds, only one current value x. Consequently, the iterative
update is a lot simpler; while before it was necessary to check the evaluation of the
current value against that of each bound to determine which one to replace, now the
value is updated unconditionally.
Example 8.4
The power of a signal being measured by a tower is decreasing over time
according to this model:
Pt et cos t
Starting at t0 0 s, find the time that the signal will have lost all power to a
relative error of less than 0.5 %.
(continued)

8.5 Simple Fixed-Point Iteration

133

Example 8.4 (continued)

Solution
The iterative SFPI equation corresponding to this problem is:
eti cos ti ti ti1 gti
At the initial value of t0 0, t1 e0cos(0) + 0 1, and the relative error
computed by Eq. (8.18) is 100 %. This is usual for a first iteration when using
0 as an initial estimate. The following iterations, until the target relative error
is achieved, are given in the table below. The figure next to the table
illustrates the power function and the approximation of the root at each
iteration.
Iteration
0
1
2
3
4
5
6
7
8
9
10
11
12

ti (s)
0
1
1.20
1.31
1.38
1.43
1.46
1.49
1.51
1.52
1.53
1.54
1.55

Ei (%)

Power
1
0.9

100.00
16.58
8.38
5.09
3.38
2.36
1.71
1.26
0.95
0.72
0.56
0.43

0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

Time

For reference, the real root of the equation is at tr 1.5707963267949 s, so

the relative error of the final result compared to the real root is 1.56 %.
It is interesting to note as well that, from iteration 4 onward, the relative
error of each iteration is approximately 0.75 of the previous one. This shows
practically that the algorithm has a linear convergence rate comparable to that
of the bisection method.
The SFPI method works by computing a sequence xi ! g(xi) ! xi+1 ! g(xi+1)
!, and converges on the point where xi+1 g(xi). Example 8.4 can be visualized
graphically by making a graph of ti against ti+1, and plotting both sides of the
equation, namely ti+1 g(ti) and ti+1 ti, and seeing where both lines intersect. This
is shown in Fig. 8.8, with ti+1 g(ti) in blue and ti+1 ti in red. The root, the
intersection of both functions, is clearly found at ti+1 ti 1.5707963267949 s.
Figure 8.8 also shows the first four iterations of the example as arrows starting at
ti 0, going to the corresponding value of g(ti) ti+1 1, then setting ti 1,

134

8 Root-Finding

x Input initial approximation of the root

IterationMaximum Input maximum number of iterations
ErrorMinimum Input minimum relative error
IterationCounter 0
WHILE (TRUE)
PreviousValue x
x CALL G(x)
IF ( (CALL G(x)) - x = 0 )
RETURN Success, x
END IF
CurrentError absolute value of [ (x PreviousValue) / x ]
IterationCounter IterationCounter + 1
IF (CurrentError <= ErrorMinimum)
RETURN Success, x
ELSE IF (IterationCounter = IterationMaximum)
RETURN Failure
END IF
END WHILE

FUNCTION G(x)
RETURN evaluation of the transformed version of the function at point x
END FUNCTION

Fig. 8.7 Pseudocode of the SPFI method

Fig. 8.8 The convergence

of Example 8.4

pointing to g(ti) ti+1 1.20, and so on. It can be seen that the method converges
towards the root.
But, as explained in Sect. 8.4, convergence is not guaranteed for open methods
like the SFPI. In fact, it can be shown that the SFPI method only converges if the

8.5 Simple Fixed-Point Iteration

135

Fig. 8.9 The divergence

of Eq. (8.18)

slope of g(x) at the root point xr is less than the slope of x at that point, or in other
words if the absolute value of the derivative of g(xr) is less than 1. That was not a
problem for the function of Example 8.8; its derivative is g(1)(x) ex(cos(x) +
sin(x)) + 1, and evaluated at the root it gives 0.792120424, below the threshold
value of 1. However, consider for example the function f(x) x2 x, which has two
roots at xr0 0 and xr1 1. In the SFPI method, this equation becomes:
xi1 gxi xi 2

8:19

The absolute value of the derivative is |g(1)(x)| |2xi|, and evaluated at the roots it
gives |g(1)(xr0)| 0 and |g(1)(xr1)| 2. This means the SPFI will converge on the first
root, but cannot converge on the second root. The practical impact of this problem
is illustrated in Fig. 8.9. As this figure shows, picking an initial value x0 that is less
than 1 will lead to the SPFI converging on the root at 0. However, picking an initial
value that is greater than 1 doesnt allow the SPFI to converge on the root at 1, but
instead it causes the method to diverge quickly away from both roots and towards
infinity. The root at 1 is simply unreachable by the SPFI, unless it is picked as the
initial value (in which case the method converges on it without iterating). This
discussion also illustrates two other problems with the SPFI method. First, one
cannot predict whether the method will converge or diverge ahead of time unless
one already knows the value of the root in order to evaluate the derivative at that
point, which is of course not a piece of information available ahead of time in a
root-finding problem. And second, the form of g(x) that the equation f(x) is
rewritten into actually matters. Multiple different forms of g(x) are possible for
one equation f(x), and not all of them will converge on the same values, or at all.
The convergence rate for the SPFI method is derived from a Taylor series
expansion of the function g(x). Recall from Chap. 5 that this means that the error
rate will be proportional to the order of the first non-null term in the series. In other

136

8 Root-Finding

words, much like the convergence test, the convergence rate will depend on
evaluations of the derivative at the root. If g(1)(xr) 0, then the error rate will be
O(h2), if in addition g(2)(xr) 0 then the error rate will be O(h3), and if in addition to
those two g(3)(xr) 0 then the error rate will be O(h4), and so on. In the general case
though, the assumption is that g(1)(xr) 6 0 and the error rate is O(h).

8.6
8.6.1

Newtons Method
One-Dimensional Newtons Method

Newtons method, also called the Newton-Raphson method, is possibly the most
popular root-finding method available. It has a number of advantages: it converges
very efficiently (in fact it has the highest convergence rate of any root-finding
methods covered in this book), it is simple to implement, and it only requires to
maintain one past estimate of the root, like the SFPI but unlike any of the other rootfinding methods available. Its main downside is that it requires knowing or estimating the derivative of the function being modelled.
The basic assumption behind Newtons method is that, for a small enough
neighborhood around a point, a function can be approximated by its first derivative.
Since this first derivative is a straight line, its root is straightforward to find. The
derivatives root is used as an approximation of the original functions root, and as a
new point to evaluate the derivative at to iteratively improve the approximation.
As the approximation point gets closer to the root and the neighborhood approximated by the first derivative gets smaller, the first derivative becomes a more
accurate approximation of the function and its root becomes a more accurate
approximation of the functions root. To illustrate, a single iteration of Newtons
method is represented graphically in Fig. 8.10.

Fig. 8.10 One iteration of Newtons method

8.6 Newtons Method

137

The two underlying ideas of Newtons method should be familiar: the

approximation of a function by its first derivative is the first-order Taylor series
approximation explained back in Chap. 4, and modelling the root of a function by
the root of a straight-line approximation is the idea behind the false position
method. In fact, Newtons method is derived directly from the first-order Taylor
series approximation:
f xi1 f xi f 1 xi xi1 xi

8:20

Since the function is converging to a root, then the value f(xi+1) 0. With that value
set, Eq. (8.20) can be rewritten as an iterative formula of x:
xi1 xi

f xi
f

x i

8:21

And that is Newtons method.

There are three halting conditions for Newtons method. Two are the standard
conditions: the success condition if the function converges and the relative error
between two successive approximations is less than a predefined threshold , as
defined in Eq. (8.18), and the failure condition if the method reaches a preset
maximum number of iterations. To this, it is necessary to add a third failure
condition, if the first derivative f (1)(xi) 0. In that case, the point generated is in
a discontinuity, which causes a division by zero in Eq. (8.21), and the method
cannot continue.
The pseudocode for Newtons method is given in Fig. 8.11. Notice the introduction of a new function to evaluate the derivative of a target function, which is
necessary to compute Eq. (8.21), as well as the additional failure test if the first
derivative is zero.
Since the equation for Newtons method is the first-order Taylor series approximation, the error of the method is the same as that of the approximation. Recall
from Chap. 4 that the error of a Taylor series approximation is the next non-zero
term after the cut-off point of the series. In this case, it is the second-order term, and
so the error is O(h2). Newtons method thus has a quadratic convergence rate, much
better than the linear methods seen so far. To further understand this, consider the
second-order Taylor series approximation of the root xr:
f xr f xi f 1 xi xr xi

f 2 xi
xr xi 2
2

8:22

Following the same steps used to get from Eq. (8.20) to (8.21) but putting the entire
Newtons method equation on one side of the equation gives:
xr xi

f x i
f 2 xi
xr xi 2

0
f 1 xi 2f xi

8:23

138

8 Root-Finding

x Input initial approximation of the root

IterationMaximum Input maximum number of iterations
ErrorMinimum Input minimum relative error
IterationCounter 0
WHILE (TRUE)
PreviousValue x
x x (CALL F(x)) / (CALL Derivative(F(x)))
IF ( CALL F(x) = 0 )
RETURN Success, x
ELSE IF ( CALL Derivative(F(x)) = 0 )
RETURN Failure
END IF
CurrentError absolute value of [ (x PreviousValue) / x ]
IterationCounter IterationCounter + 1
IF (CurrentError <= ErrorMinimum)
RETURN Success, x
ELSE IF (IterationCounter = IterationMaximum)
RETURN Failure
END IF
END WHILE

FUNCTION F(x)
RETURN evaluation of the target function at point x
END FUNCTION
FUCNTION Derive(F(x))
RETURN evaluation of the derivative of the function at point x
END FUNCTION

Fig. 8.11 Pseudocode of Newtons method

Note that Newtons method formula for xi+1 is on the left of the equation. Both sides
of the equation then have a subtraction of xr to an approximation, which is the error
h at that point. The equation thus simplifies to:
hi1
Which is indeed an O(h2) error rate.

f 2 xi
2f

x i

hi 2

8:24

8.6 Newtons Method

139

Example 8.5
The power of a signal being measured by a tower is decreasing over time
according to this model:
Pt et cos t
Starting at t0 0 s, find the time that the signal will have lost all power to a
relative error of less than 0.5 %.
Solution
The derivative of this function is:
P0 t et sin t cos t
And the iterative formula for Newtons method, from Eq. (8.21), is thus:
ti1 ti

eti cos ti
eti sin ti cos ti

Starting at t0 0, the iterations computed are:

Iteration
0
1
2
3
4
5

ti (s)
0
1
1.39
1.54
1.570
1.571

Ei (%)

Power
1

100
28.1
10.0
1.6
0.04

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

Time

For reference, the real root of the equation is at tr 1.5707963267949 s, so

the relative error of the final result compared to the real root is less than 3e-5.
The benefits of quadratic convergence are clear to see: after three iterations the approximation obtained by Newtons method was the same as that
obtained by the SFPI method after 11 iterations, and the final result of this
method is orders of magnitude better than that of the SFPI method despite
requiring less than half the total number of iterations.
As with any open method, there is a risk that Newtons method will diverge and
fail to find a root altogether. In fact, Eq. (8.24) suggests that there are three
conditions when the method can diverge. The first condition is if xi is very far
from the root xr. In that case, the error hi will be a large value and squared, which
will cause the method to fail. This highlights the need to pick an initial value that is

140

8 Root-Finding

Fig. 8.12 Divergence

because the first derivative
is near zero

somewhat in the vicinity of the root, and not any random value anywhere in the
x-axis. The second condition that can cause the method to diverge is if the first
derivative at the current point is near zero. This has already been included as a
failure halting condition in the algorithm. Conceptually, this means the point xi is at
an optimum of the function, and the first derivative is horizontal. In such a case, the
next point computed by Eq. (8.21) will shoot out very far from the current point and
the root. This situation is illustrated in Fig. 8.12. The third and final condition that
can cause the method to diverge is if the second derivative is very large. Again, it is
clear to see from Eq. (8.24) that this will cause the error to increase between
iterations. Conceptually, a high second derivative means that the current point is
near a saddle of the function. In that case, Eq. (8.21) will generate points that
oscillate around this saddle and get progressively further and further away. This
situation is illustrated in Fig. 8.13.

8.6.2

Multidimensional Newtons Method

The examples used so far in this chapter have all been one-dimensional y f(x)
root-finding problems. However, one of the main advantages of Newtons method,
especially when contrasted to the bracketing methods, is that it can easily be
adapted to more complex multidimensional problems. In such a problem, an
engineer must deal with n independent variables constrained by n separate model
equations. The root of the system is the simultaneous root of all n equations.
Begin by defining x [x0, x1, . . ., xn1]T, the vector of n independent variables of
the system, and f(x) [f0(x), f1(x), . . ., fn1(x)]T, the vector of n n-dimensional
functions that model the system. Since this is now a vector problem, Newtons
method equation (8.21) needs to be rewritten to eliminate the division as:

8.6 Newtons Method

141

Fig. 8.13 Divergence

because the second
derivative is high

f 1 xi xi1 xi f xi

8:25

And then, substituting the newly defined vectors:

f 1 xi xi1 xi f xi

8:26

The derivative of the vector of functions is the Jacobian matrix Jf(x), or the n n
matrix of partial derivatives of each of the n functions with respect to each of the
n variables, arranged as shown in Eq. (8.27):
2

f 0 x
6 x0
6
6 f 1 x
6
Jf x 6 x0
6
6
4 f x
n1
x0

f 0 x

x1
f 1 x

x1

f n1 x

x1

3
f 0 x
xn1 7
7
f 1 x 7
7
xn1 7
7
7
f n1 x 5
xn1

8:27

With this new definition, Eq. (8.26) can be rewritten as:

Jf xi xi f xi

8:28

At each iteration, the Jacobian function can be evaluated, and only the step size xi
is unknown. The problem has thus become an Mx b linear algebra equation to
solve, which can be done using any of the methods learned in Chap. 4. Finally, the
next vector is obtained simply with:
xi1 xi xi

8:29

142

8 Root-Finding

The halting conditions for the iterative algorithm are the same as for the
one-dimensional Newtons method, but adapted to matrices and vectors. The
success condition is that the relative error between two successive approximations
of the root is less than a preset threshold value, defined now as the Euclidean
distance between the two vectors xi and xi+1 introduced in Chap. 3. There is a failure
condition if the derivative is zero, as there was with the one-dimensional Newtons
method. This is defined here as the case where the determinant of the Jacobian
matrix is zero. Finally, as always, the algorithm fails if it reaches a preset maximum
number of iterations. The pseudocode for Newtons method, updated to handle
multidimensional problems, has been updated from Fig. 8.11 and is presented in
Fig. 8.14.
Since the multidimensional Newtons method equation of (8.28) is derived from
the first-order Taylor series approximation, just like the one-dimensional case, it
will also have O(h2) convergence rate.
x Input initial approximation of the root as vector of length n
IterationMaximum Input maximum number of iterations
ErrorMinimum Input minimum relative error
IterationCounter 0
WHILE (TRUE)
PreviousValue x
Delta solution of system [CALL Jacobian(F(x))] Delta
= -1 [CALL F(x)]
x x + Delta
IF ( CALL F(x) = 0 )
RETURN Success, x
ELSE IF ( Determinant of [CALL Jacobian(F(x))] = 0 )
RETURN Failure
END IF
CurrentError Euclidean distance between x and PreviousValue
IterationCounter IterationCounter + 1
IF (CurrentError <= ErrorMinimum)
RETURN Success, x
ELSE IF (IterationCounter = IterationMaximum)
RETURN Failure
END IF
END WHILE

FUNCTION F(x)
RETURN vector of length n of evaluations of the n target functions at
point x
END FUNCTION
FUCNTION Jacobian(F(x))
RETURN nn matrix of evaluation of the partial derivatives of the n
functions with respect to the n variables at point x
END FUNCTION

Fig. 8.14 Pseudocode of the multidimensional Newtons method

8.6 Newtons Method

143

Example 8.6
The shape of the hull of a sunken ship is modelled by this equation:
z f 0 x; y x2 2y2 xy x 1
where z is the height of the sunken hull above the seabed. An automated
submarine is scanning the hull to find the damaged point where the ship has
hit the seabed. It has been programmed to explore it in a 2D grid pattern,
starting at coordinates (0,0) and following this program:
z f 1 x; y 3x2 2y2 xy 3y 2
Determine if the probe will find the damage it is looking for with a relative
error of 0.001.
Solution
This underwater exploration can be modelled by the following system of
equations:

f x

x2 2y2 xy x 1
3x2 2y2 xy 3y 2

x
x
y

The point where the hull has hit the sea bed is at z 0, and that is the
damaged point the probe is looking for. It is therefore a root-finding problem.
To use Newtons method, first compute the Jacobian following Eq. (8.27):

2x y 1
Jf x
6x y

4y x
4y x 3

Then, starting at x0 [0, 0]T, the first iteration will compute

Jf x0 x0 f x0

1
1 0
x0
2
0 3

1:0000
x0
0:6667
x1 x0 x0

1:0000
x1
0:6667
(continued)

144

8 Root-Finding

Example 8.6 (continued)

The relative error is computed using the Euclidean distance between x0 and x1:
q
E1 0 12 0 0:66672 1:2019
which is well above the required threshold of 0.001. The next iterations are
given in the following table:
i
0
1
2
3
4
5

xi
[0, 0]T
[1.0000, 0.66667]T
[0.87517, 0.10816]T
[0.67288, 0.18795]T
[0.64038, 0.19199]T
[0.63982, 0.19183]T

xi
[1.0000, 0.66667]T
[0.12483, 0.55851]T
[0.20229, 0.079792]T
[0.032496, 0.0040448]T
[0.00056451, 0.00016391]T

Ei
1.2019
0.5723
0.2175
0.0328
0.0009

For reference, the real root is found at [0.6397934171, 0.1918694996]T,

so the approximation found by Newtons method after five iterations has a real
relative error of only 4.8e-5. To further illustrate, the curves of the ships hull
and of the probes exploration pattern are represented in the figure below, with
the common root of both equations marked.

8.7

Secant Method

One limitation of Newtons method is that it requires computing the derivative

of the function being studied at multiple points. There are many cases where that
can be a problem: situations where the derivative is unknown and cannot be
estimated to a good accuracy, for example, or situations where the derivative is
too difficult or computationally expensive to evaluate. In these cases, Newtons

8.7 Secant Method

145

method cannot be used. One alternative is to approximate the derivative using the
secant line of the curve, a line passing through (or interpolating) two points on the
function. As these two points iteratively become closer to the root and to each other,
the secant line will become an approximation of the tangent near the root and this
secant method will approximate Newtons method.
From the first-order Taylor series approximation, the approximation of the first
derivative at a point xi computed near a previous point xi1 is given as:
f xi1 f xi
xi1 xi

f 1 xi

8:30

This immediately adds a new requirement into the method: instead of keeping only
one current point with Newtons method, it is necessary to keep two points at each
iteration. This is one of the costs of eliminating the derivative from the method.
Next, the derivative approximation formula is used to replace the actual derivative
in Eq. (8.21):
xi1 xi f

f xi
xi1 f xi =x

i1 xi

f xi xi1 xi
f xi1 f xi

8:31

And Newtons method is now the secant method. The halting conditions for the
iterations are the same as for Newtons method: the method will fail if it reaches a
preset maximum number of iterations or if the denominator becomes zero, which
will be the case if two points are generated too close to each other (this situation will
also introduce the risk of subtractive cancellation explained in Chap. 2), and it will
succeed if the relative error between two iterations is less than a preset threshold.
The pseudocode for Newtons method in Fig. 8.11 can be updated for the secant
method, and is presented in Fig. 8.15.
Since the secant method approximates Newtons method and replaces the
derivative with an approximation of the derivative, it should be no surprise that
its convergence rate is not as good as Newtons method. In fact, while the proof is
outside the scope of this book, the convergence rate of the secant method is O(h1.618),
less than the quadratic rate Newtons method boasted but better than the linear rate of
the other methods presented so far in this chapter.
Equation (8.31) should be immediately recognizable: it is the same as the false
position methods equation (8.10). In fact, both methods work in the same way:
they both estimate the root by modelling the function with a straight line interpolated from two function points, and use the root of that line as an approximation of
the root. The difference between the two methods is in the update process once a
new approximation of the root is available. As explained back in Sect. 8.3, the false
position method will update the one boundary point on the same side of the zerocrossing as the new point. Moreover, the method will usually generate points only
on one side of the zero-crossing, which means that only one of the two bounds is
updated, while the other keeps its original value in most of the computations. This
will insure that the root stays within the brackets, and guarantee that the method will

146

8 Root-Finding

PreviousValue, x Input two initial approximations of the root

IterationMaximum Input maximum number of iterations
ErrorMinimum Input minimum relative error
IterationCounter 0
WHILE (TRUE)
TemporaryValue x
x x (CALL F(x)) (PreviousValue x) /
[ (CALL F(PreviousValue)) - (CALL F(x)) ]
PreviousValue TemporaryValue
IF ( CALL F(x) = 0 )
RETURN Success, x
END IF
CurrentError absolute value of [ (x PreviousValue) / x ]
IterationCounter IterationCounter + 1
IF (CurrentError <= ErrorMinimum)
RETURN Success, x
ELSE IF (IterationCounter = IterationMax imum)
RETURN Failure
END IF
END WHILE

FUNCTION F(x)
RETURN evaluation of the target function at point x
END FUNCTION

Fig. 8.15 Pseudocode of the secant method

converge, albeit slowly. By contrast, the secant method will update the points in the
order they are generated: the new approximation and the previous one are kept and
used to compute the next one, and the approximation from two iterations back is
discarded. This is done without checking whether the points are on the same side of
the zero-crossing or on opposite sides. This allows faster convergence, since the
two newest and best estimates of the root are always used in the computations.
However, it also introduces the risk that the function will diverge, which was
impossible for the false position method.
To understand the problem of divergence with the secant method, consider the
example in Fig. 8.16. On the top side, a secant line (in blue) is interpolated between
two points xi1 and xi of the function (in red) and an approximation of the root xi+1
is obtained. This approximation is then used along with xi to interpolate a new
secant line, which is a very good approximation of the function. It can clearly be
seen that the next approximation xi+2 will be very close to the real root of the
function. But what if the exact same points had been considered in the opposite
order? The result is shown on the bottom side of Fig. 8.16. Initially the same secant

8.7 Secant Method

147

Fig. 8.16 Convergence and divergence of the secant method

line is interpolated and the same approximation xi+1 is obtained. However, now the
next secant line interpolated between xi and xi+1 diverges and the next point xi+2 will
be very distant from the root. The problem is that, in this new situation, the points xi
and xi+1 are interpolating a section of the function that is very dissimilar to the
section that includes the root. As a result, while the interpolation is a good
approximation of that section of the function, it is not at all useful for the purpose
of root-finding. Meanwhile, because the false position method only updates the
point on the same side of the zero-crossing, it can only generate the situation on
the top side of Fig. 8.16 regardless of the order the points are fed into the algorithm,
and can never diverge in the way shown on the bottom side. Note however that this
constraint is not necessary to avoid divergence: it is only necessary for the secant
method to use points that interpolate a section of the function similar to the section
that has the zero-crossing. For example, using two points both on the negative side
of the function would allow the secant method to generate a very good approximation of the root.

148

8 Root-Finding

Example 8.7
The power of a signal being measured by a tower is decreasing over time
according to this model:
Pt et cos t
Starting at t1 0 s and t0 1 s, find the time that the signal will have lost all
power to a relative error of less than 0.5 %.
Solution
The secant method equation, using Eq. (8.31), is:
ti1 ti

eti cos ti ti1 ti

cos ti1 eti cos ti

eti1

Given the two initial points given, the iterations computed are:
Iteration
1
0
1
2
3
4
5

ti (s)
0
1
1.25
1.46
1.54
1.568
1.571

Ei (%)

Power
1

100
19.9
14.4
5.51
1.61
0.2

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

Time

For reference, the real root of the equation is at 1.5707963267949 s, and

the relative error of the final result compared to the real root is 0.005 %. This
is comparable to the error of the estimate obtained by Newtons method after
4 iterations in Example 8.5. Conversely, to get the same error as the final
result of Newtons method would require six iterations of the secant method.
This example shows that the method is very efficient, but not quite as efficient
as Newtons method: it requires one more iteration to reach the same relative
error.

8.8

Mullers Method

The secant method and the false position method both approximate the function
using a straight line interpolated from two points, and use the root of that line as the
approximation of the root of the function. The weakness of these methods, as
illustrated in Figs. 8.6 and 8.16, is that a straight line is not always a good
approximation of a function. To address that problem, a simple solution is available: to use more points and compute a higher-degree interpolation, which would be

8.8 Mullers Method

149

Fig. 8.17 Approximating the root using a degree 1 (left) and degree 2 (right) interpolation of a
function

a better approximation of the function and would make it possible to get a closer
approximation of the root. Of course, there is a limit to this approach; the higher the
degree of the interpolating polynomial, the more roots it will have and the more
difficult it will be to find them all easily and efficiently. Mullers method offers a
good compromise. It uses three points to compute a degree-2 interpolation
(a parabola) of the function to model. The degree-2 polynomial offers a better
approximation of the function than a straight line, as illustrated in Fig. 8.17, while
still being easy enough to handle to find the roots.
The first step of Mullers method is thus to approximate the function f(x) using a
parabola interpolated from three points xi2, xi1, and xi. The equation for a
parabola is well known to be p(x) ax2 + bx + c. Given three points on the function
to model, it can be computed using the Vandermonde method from Chap. 4 as the
solution to the system:
2

x2i2

6 2
4 xi1
x2i

3
1 2 a 3 2 f x 3
i2
7
1 54 b 5 4 f xi1 5
f xi
c
1

xi2
xi1
xi

8:32

In order to use these equations in an iterative formula of the form xi+1 xi + xi,
substitute x for x xi. This changes the parabola equation to p(x) a(x xi)2 + b
(x xi) + c and the Vandermonde system to:
2

xi2 xi 2

6
4 xi1 xi 2
0

xi2 xi
xi1 xi
0

3
1 2 a 3 2 f xi2 3
7
1 54 b 5 4 f xi1 5
c
f xi
1

8:33

Written in that form, the Vandermonde system is trivial to solve. In fact, a solution
can be obtained immediately as:

150

8 Root-Finding

f xi1 f xi2

xi xi1
xi1 xi2
a
x

x
i i2

b axi xi1 f xi f xi1 x x
f xi f xi1

8:34

c f xi
Note that in both the parabola equation and the Vandermonde system, the solution
remains unchanged. This is because the subtraction represents only a horizontal
shift of the function. All values of the function are moved along the x-axis by a
factor of xi, but they remain unchanged along the y-axis. This is akin to a timeshifting operation in signal processing, and is illustrated in Fig. 8.18 for clarity.
Once the coefficients a, b, and c for the parabola equation are known, the next
step is to find the roots of the parabola, which will serve as approximations of the
root of f(x). The standard quadratic equation to find the roots of a polynomial is:
p
b b2 4ac
r0

r1
2a

8:35

This equation will yield both roots of the polynomial. The one that is useful for the
iterative system is the one obtained by setting the sign to the same sign as b. Note
however that this will introduce the risk of that the problem of subtractive cancellation described in Chap. 2 will occur in cases where b2 4 ac. To avoid this, an
alternative form of Eq. (8.35) exists that avoids this issue:
2c
r0
p

r1
b b2 4ac

8:36

Finally, the iterative algorithm of Mullers method can be written as:

xi1 xi

2c
p
b b2 4ac

8:37

Where the is set to the same sign as b and the values a, b, and c are computed by
solving the Mx b system of Eq. (8.33). The algorithm has only two halting
conditions: a success conditions if the relative error between two successive values
Fig. 8.18 Horizontal shift
of the parabola f(x) to
f(x + 3)

8.8 Mullers Method

151

xi and xi+1 is less than a preset threshold, and a failure condition if a preset maximum
number of iterations is reached. The pseudocode for this algorithm, using the
solution of Eq. (8.34) and a simple test to set the sign, is presented in Fig. 8.19.
The convergence rate of Mullers method is O(h1.839), slower than Newtons
method but better than the secant method. Intuitively, the fact that Mullers method
performs better than the secant method should not be a surprise, since it follow the
same idea of interpolating a model of the function and using the models root as an
approximation, but does so with more information (one more point) to get a better
model. On the other hand, Newtons method uses information from the function
itself, namely its derivative, instead of an approximation, so it should naturally
perform better than any approximation-based method.
PreviousValue2, PreviousValue, x Input three initial approximations of
the root
IterationMaximum Input maximum number of iterations
ErrorMinimum Input minimum relative error
IterationCounter 0
WHILE (TRUE)
A

{[(CALL F(x) CALL F(PreviousValue)] / [x PreviousValue]

[(CALL F(PreviousValue) CALL F(PreviousValue2)] /
[PreviousValue PreviousValue2]} / (x PreviousValue2)
B A (x PreviousValue) +
[(CALL F(x) CALL F(PreviousValue)]/[x PreviousValue]
C CALL F(x)
IF (B < 0)
Sign -1
ELSE
Sign 1
END IF
PreviousValue2 PreviousValue
PreviousValue x
x x + (-2 C) / [ B + Sign square root of (B B 4 A C) ]
IF ( CALL F(x) = 0 )
RETURN Success, x
END IF
CurrentError absolute value of [ (x PreviousValue) / x ]
IterationCounter IterationCounter + 1
IF ( CurrentError <= ErrorMinimum )
RETURN Success, x
ELSE IF (IterationCounter = IterationMaximum)
RETURN Failure
END IF
END WHILE

FUNCTION F(x)
RETURN evaluation of the target function at point x
END FUNCTION

Fig. 8.19 Pseudocode of Mullers method

152

8 Root-Finding

Example 8.8
The power of a signal being measured by a tower is decreasing over time
according to this model:
Pt et cos t
Starting at t2 0 s, t1 0.1 s, t0 0.2 s, find the time that the signal will
have lost all power to a relative error of less than 0.5 %.
Solution
Note to begin that this example starts with some initial bad values. Indeed,
p(0) 1, p(0.1) 0.900, and p(0.2) 0.802. Nonetheless, use these values to
compute the first iteration of Mullers method. Applying Eq. (8.34) finds the
coefficients of the interpolated parabola to be:
0:8020:900=0:20:1 0:9000=0:10
0:089
0:2 0
b 0:0890:2 0:1 0:8020:900=0:20:1 0:970

c 0:802
Using these values in Eq. (8.37) finds the relevant root of the interpolated
parabola, and the first approximation of the root:
t1 0:2

2 0:802
q 1:101
0:970 0:9702 4 0:089 0:802

Note that the actual root of this equation is at tr 1.5707963267949 s.

Given the initial points used for the computations, a first approximation at
1.101 is already a huge step forward. The relative error compared to the
previous approximation of 0.2 is still too high, and the iterations continue.
The next steps are listed in the table below.
Iteration
2
1
0
1
2
3
4
5

ti (s)
0
0.1
0.2
1.101
1.481
1.583
1.5706
1.5708

Ei (%)

Power
1
0.9
0.8
0.7
0.6

81.8
25.6
6.5
0.82
0.01

0.5
0.4
0.3
0.2
0.1
0
0.1

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

Time

(continued)

8.9 Engineering Applications

153

Example 8.8 (continued)

Notice how, as the bad initial values are replaced by better approximations
in each successive iteration, the approximation of the root improves a lot.
This is a feature of every iterative algorithm: the better the approximations
used in its computation are, the better the result will be. However, since
Mullers method uses three values in the computation of each iteration, as
opposed to only one for Newtons method or two for the secant method, this
means that bad initial guesses will negatively affect the result of the iterative
algorithm for up to the first three iterations! This is what is observed in this
example. Iterations 1 and 2 yield rather poor approximations of the root
because they use three and two of the bad initial guesses in their computations, respectively. Iteration 3 gives a much better result because it uses only
one bad initial guess and two computed approximations near the real root in
its computations. And iteration 4, computed from the three approximations of
the first three iterations and not influenced by the bad initial guesses at all, is
almost spot-on.

8.9

Engineering Applications

Root-finding problems arise often in engineering design, namely when a system has
been modelled by an equation or set of equations with known parameter and
property values, and the value of a dependent (controllable) variable of the system
must be discovered. If the model used only a linear equation, it would be a simple
matter to isolate the variable to be controlled and to compute its optimal value.
Unfortunately, most real engineering situations are modelled by more complex
equations where the variable is part of exponentials, logarithms, or trigonometric
terms, and cannot be isolated. This was the case of the value of the current I in
Eq. (8.4); it is simply impossible to isolate it in that equation. Many other such
situations can also occur in engineering practice. Some examples are listed here.
The van der Waals equation relates the pressure p, volume V, number of moles
N, and temperature T of a fluid or gas in a container. The equation is:

N2
pa 2
V

V
b
N

8:38

where a and b are substance-specific constants and R is the universal gas

constant. In that case, suppose a system is designed to handle a known maximum
pressure and temperature. It becomes necessary to know what is the maximum
volume (or molal volume V/N ) of each type of substance that can be safely
contained in the system, in order for example to properly document the systems

154

8 Root-Finding

safe operating parameters or to insure that it can be used in an specific

application.
A column supporting an off-center load will suffer from bending stress and
compression stress. The maximum stress max that the column can sustain is
given by the secant formula:
max

"
r!#
P
ec
L P
1 2 sec

A
r
2r EA

8:39

where P is the axial load, A is the cross-section area of the column, ec/r2 is the
eccentricity ratio, L/r is the slenderness ratio, and E is Youngs modulus for the
material of the column. Structural design often requires using this equation to
determine the area of a column that will support a given load.
It is well-known that an object thrown will follow a parabolic trajectory. More
specifically, the (x, y) coordinates of the object following this trajectory after
being thrown at an angle with initial velocity v will obey the equation:
y x tan

gx2
2v2 cos 2

8:40

where g is the Earths gravitational acceleration and the objects starting position
is assumed to be the origin (0, 0). While the initial speed will often be determined by the nature of the objects propulsion mechanism, a common challenge
is to determine the initial angle to use in order to reach a specific target
destination or to intercept a point in mid-trajectory.
In all these situations, the exact value of a specific variable must be known to
properly design the system, but that variable cannot be isolated from the systems
equation in order for the value to be computed. However, simply by subtracting one
side of the equation from the other, the equation becomes equal to zero and the
needed value becomes the root of the equation, and can thus be approximated to a
known error rate by any of the methods seen in this chapter.

8.10

Summary

Many engineering situations can be modelled and solved by finding the value of
some parameters of the system for which the system balances out to zero. These are
root-finding problems, and this chapter introduced several numerical methods to
solve them. The two closed methods, the bisection and false position methods,
setup bounds around the root and either blindly pick the middle point between these
bounds or interpolate a line through the function to get closer to the root. Because
they bracket the root, these two methods are guaranteed to converge on the root

8.11

Exercises

155

Table 8.1 Summary of root-finding methods

Method
Bisection method
False position method
Simple fixed-point iteration
Newtons method
Secant method
Mullers method

Requires
2 bounds
2 bounds
1 point
1 point + derivative
2 points
3 points

Error
O(h)
O(h)
O(h)
O(h2)
O(h1.618)
O(h1.839)

eventually, albeit slowly. Next, three open methods were introduced, namely
Newtons method, the secant method, and Mullers method. These methods all
work by approximating the function, either using its derivative at one point, a
straight line interpolated through two points, or a parabola interpolated through
three points. Since none of them are burdened by maintaining possibly inaccurate
brackets, they all perform faster than the closed methods. However, they all have a
risk of diverging and failing to find the root in certain conditions. Of these three
open methods, Newtons method was the most efficient and the most versatile since
it could easily be expanded to multidimensional and multivariate problems.
Table 8.1 summarizes the methods covered in this chapter.

8.11

Exercises

1. Approximate the root of the following equations in the respective intervals

using the bisection method to a relative error of 0.1.
(a) f(x) x3 3; interval [1, 2]
(b) f(x) x2 10; interval [3, 4]
(c) f(x) ex(3.2 sin(x) 0.5 cos(x)); interval [3, 4]
2. Write an algorithm to use the bisection method to find a root of f(x) sin(x)
starting with the interval [1, 99] with a relative error of 0.00001. Comment on
the result.
3. Approximate the root of the following equations in the respective intervals
using the false position method to a relative error of 0.1.
(a) f(x) x3 3; interval [1, 2]
(b) f(x) x2 10; interval [3, 4]
(c) f(x) ex(3.2 sin(x) 0.5 cos(x)); interval [3, 4]
4. Use Newtons method to find a root of the function f(x) ex cos(x) starting
with x0 1.3 to a relative error of 105.
5. Use Newtons method to find a root of the function f(x) x2 7x + 3 starting
with x0 0 and with an accuracy of 0.1.

156

8 Root-Finding

6. Perform three steps of Newtons method for the function f(x) x2 2 starting
with x0 1.
7. Perform three iterations of Newtons method to approximate a root of the
following multivariate systems given their starting points:

x2 y2 3
(a) f x
, x0 [1, 1]T.
2x2 0:5y2 2
2

x xy y2 3
(b) f x
, x0 [1.5, 0.5]T.
x y xy
8. Perform three steps of the secant method for the function f(x) x2 2 starting
with
x1 0 and x0 1.
9. Perform four steps of the secant method for the function f(x) cos(x) + 2 sin(x) + x2
starting with x1 0.0 and x0 0.1.
10. Use the secant method to find a root of the function f(x) x2 7x + 3 starting
with
x1 1 and x0 0 and with an accuracy of 0.1.
11. Perform six iterations of Mullers method on the function f(x) x7 + 3x6 + 7x5
+ x4 + 5x3 + 2x2 + 5x + 5 starting with the three initial values x2 0,
x1 0.1, and x0 0.2.

Chapter 9

Optimization

9.1

Introduction

One major challenge in engineering practice is often the need to design systems that
must perform as well as possible given certain constraints. Working without
constraints would be easy: when a system can be designed with no restrictions on
cost, size, or components used, imagination is the only limit on what can be built.
But when constraints are in place, as they always will be in practice, then not only
must engineering designs respect them, but the difference between a good and a bad
design will be which one can get the most done within the stated constraints.
Take for example the design of a fuel tank. If the only design requirement is
hold a certain amount of fuel, then there are no constraints and the tank could be
of any shape at all, provided the shapes volume is greater than the amount of fuel it
must contain. However, when the cost of the materials the tank is made up of is
taken into account, the design requirement becomes hold at least a certain amount
of fuel at the least cost possible, and this new constraint means the problem
becomes about designing a fuel tank while minimizing its surface area, a very
different one from before. A clever engineer would design the fuel tank to be a
sphere, the shape with the lowest surface to volume ratio, in order to achieve the
optimal result within the constraints. This design will be superior to the one using,
say, a cube-shaped fuel tank, that would have a higher surface area and higher cost
to hold the same volume of fuel.
To make the example more interesting, suppose the shape of the fuel tank is also
constrained by the design of the entire system: it must necessarily be a cylinder
closed at the top and made of a metal that costs 300$/m2, while the bottom of the
tank is attached to a nozzle shaped as a cone with height equal to its radius and made
of a plastic that costs 500$/m2. The entire assembly must hold at least 2000 m3
of fuel. How to determine the optimal dimensions of the tank and the connected
nozzle? First, model the components. For a given radius r and height h of the cylinder
tank, the surface of the side and top of the cylinder will be:
Springer International Publishing Switzerland 2016
R. Khoury, D.W. Harder, Numerical Methods and Modelling for Engineering,
DOI 10.1007/978-3-319-21176-3_9

157

158

9 Optimization

A1 2rh r 2

9:1

while the area of the nozzle will be:

p
A2 r 2 1 2

9:2

Likewise, the volume of the cylinder of radius r and height h will be:
V 1 r 2 h

9:3

And the volume of the nozzle will be:

r 3
3

9:4

By looking at the cost (area) and volume of the entire assembly, this model becomes
two equations with two unknown parameters that can be controlled, r and h:
r 3
2000m3
3
p
2rh r 2 300 r 2 1 2 500 ?$
r 2 h

9:5

Normally a system of two equations and two unknowns would be easy to solve. The
problem in the system of (9.5) is that one of the equations does not have a known
result. The area and cost of the tank is not specified in the problem, the only
requirement is that they must be as low as possible.
The problem could be further simplified by writing the parameter h as a function
of r in the volume equation, and inserting that function of r into the price equation,
to get:
2000 r
h
r 2
3

2
p

4000 2r
2
2

r 300 r 1 2 500 ?$
r
3

9:6

Now the price in Eq. (9.6) is only dependent on the radius; the height will be
automatically adjusted to generate a container of 2000 m3 of fuel. The cost of a fuel
tank with a radius from 1 to 10 m can be computed, and will give the values
illustrated on the graph of Fig. 9.1. The ideal tank with minimal cost can also be
found to have a radius of 5.27 m, a height of 17,679 m, and a cost of $341,750.
This type of problem is called optimization, since it is seeking the optimal value
of a function. This optimum can be the minimum value of the function, as it was for
the cost function in the preceding example, or its maximum, for example if one was
trying to design a fuel tank that can hold the greatest volume given a fixed budget.

9.2 Golden-Mean Search

159

Fig. 9.1 Radius and cost

of fuel tanks

In the former case the problem can be called minimization, and in the latter
maximization. It is important to note that these are not different types of problems
though, but simply sign differences on the same optimization problem.

9.2

Golden-Mean Search

The golden-mean search, sometimes also called the golden-section search, is a

simple and straightforward bracketing optimization method. A basic outline of its
algorithm would be very similar to the other bracketing algorithms in Chap. 7 and
Chap. 8: set an upper and a lower bound that bracket one (and only one) optimum
of the function, then iteratively reduce the interval between these bounds to get a
better and better approximation of the optimum. One important difference however concerns the iterative update of the bounds. In other algorithms, such as the
bisection and false position methods of Chap. 8, one point was generated in the
interval and used to replace one of the bounds. This cannot work in an optimization problem, because one point alone is not enough information to determine
where the optimum might be. To visualize this problem, consider the example of a
function with a minimum between the bounds of x 1 and x 3. At x 1 the
function evaluates at 4, at x 3 it evaluates at 3, and in the middle point of x 2
the function evaluates to 1. Should the new bounds be [1, 2] or [2, 3]? In fact there
is not enough information provided to decide: the function might have reached its
minimum between x 1 and x 2 and be on an upward slope from x 2 to x 3,
or it might be on a downward slope from x 1 to x 2 to reach a minimum
somewhere between x 2 and x 3 before increasing again. Both of these
scenarios are illustrated in Fig. 9.2. Note that the issue is not with the selection
of the point in the middle of the interval as opposed to somewhere else in
the interval. One measurement in the interval, no matter where in the interval it
is taken, will never be enough information to determine the location of the
optimum.
If one point does not provide sufficient information to make a decision, then
more points must be considered. In fact, two points dividing the bracketed section
into three intervals are enough to determine which two intervals of the optimum

160

9 Optimization

Fig. 9.2 Two functions

with points (1, 4), (2, 1), and
(3, 3), with a minimum in
the [1, 2] interval (left) and
in the [2, 3] interval (right)

Fig. 9.3 Two functions

with points (1,4),
(1.66,0.95), (2.33,1.6), and
(3,3), with a minimum in
the [1, 1.66] interval (left)
and in the [1.66, 2.33]
interval (right)

must be in and which one can be safely discarded. Consider once again the example
of the function with a minimum between the bounds of x 1 and x 3 and which
evaluates to (1,4) and (3,3). Points evaluated at 1.66 and 2.33 divide the function
neatly into three equal intervals. Suppose the function evaluates to (1.66, 0.95) and
(2.33, 1.6). The fact that the function has a lower value at the one-third point than at
the two-third point means that a minimum must have been reached somewhere
within those two intervals, to allow the function to turn around and increase again.
In fact, two cases are possible: either the minimum is in the interval [1, 1.66] and the
function is on an upward slope through 1.66 and 2.33 to 3, or the function is
decreasing from 1 through 1.66 to reach a minimum in the [1.66, 2.33] interval,
and is increasing again through 2.33 to 3. The only impossible case is for the
minimum to be in the [2.33, 3] interval, as that would require the function two have
two minimums, one in the [1, 2.33] interval to allow the decrease from 1 to 1.66 and
increase from 1.66 to 2.33, and the second one in the [2.33, 3] interval, and it has
already been stated that the function has only one minimum within the bounds.
Consequently, the interval [2.33, 3] can safely be discarded, and the new bounds
can be reduced to [1, 2.33]. This situation is illustrated in Fig. 9.3.
To formalize the bound update rule demonstrated above, assume that, at iteration i,
the algorithm has a lower bound xiL and an upper bound xiU bracketing an optimum of
the function f(x). Two points are generated for the iteration within the bounds, xi0 and
xi1 where xi0 < xi1, and they are evaluated. Then, in the case of a minimization
problem, the bounds are updated according to the following rule:

9.2 Golden-Mean Search

161

xiL ; xiU !

xiL ; xi1
xi0 ; xiU

if f xi0 < f xi1

if f xi0 > f xi1

9:7

In the case of a maximization problem, the update rule is simply inverted:

xiL ; xiU !

xiL ; xi1

if f xi0 > f xi1

xi0 ; xiU

if f xi0 < f xi1

9:8

Note that these rules are independent of the step between xiL, xi0, xi1, and xiU. The
decision to use the one-third and two-third points in the previous example was made
only for the sake of simplicity. More generally, the step can be represented by a
value , and the two points are computed as:
xi0 xiL 1 xiU

9:9

xi1 1 xiL xiU

In these definitions, at each iteration, the bracketed interval is reduced by a factor of

, to a size of (xiL xiU). In the previous example using the one-third and two-third
points, the value was 2/3. While that is a natural value to use when the question
is to split an interval into three parts, it is also a suboptimal value for the computations in the algorithm. Indeed, consider how these two points and the update rules
of Eqs. (9.7) and (9.8) will interact with each other. An example case is detailed in
Table 9.1, with the initial bounds being 0 and 1 and the first rule of Eq. (9.7) being
used every time.
Notice that both values xi0 and xi1 are new at each step, which means that both
f(xi0) and f(xi1) need to be recomputed each time. This is twice the amount that would
be needed if one point could be reused, and in cases where evaluating the function is
time-consuming, it can become a major drawback for the algorithm. Now consider
what would happen in the case where 1 0.6180, where is the golden ratio.
Table 9.2 runs through the example again, using this new value of .
Table 9.1 Sample iterations
using 2/3

Iteration
0
1
2
3

xiL
0
0
0
0

xi0
0.33333
0.22222
0.14815
0.09876

xi1
0.66666
0.44444
0.29629
0.19753

xiU
1
0.66666
0.44444
0.29629

Table 9.2 Sample iterations

using 0.6180

Iteration
0
1
2
3

xiL
0
0
0
0

xi0
0.3820
0.2361
0.1459
0.0902

xi1
0.6180
0.3820
0.2361
0.1459

xiU
1
0.6180
0.3820
0.2361

162

9 Optimization

This time, when one inner value becomes the new bound, the interval is reduced
in such a way that the other inner value becomes the new opposite inner value. In
Table 9.2, whenever xi1 becomes the new bound, xi0 becomes xi1. This is a natural
result of using the golden ratio: the ratio of the distance between xiL and xiU to the
distance between xiL and xi1 is the same as the ratio of the distance between xiL and
xi1 to the distance between xi1 and xiU and the same as the ratio of the distance
between xiL and xi1 to the distance between xiL and xi0,. Consequently, when the
interval between xi1 and xiU is taken out and the new complete interval is xiL to xi1,
xi0 is at the correct distance from xiL to become the new inner point xi1. Moreover,
with this value of , the interval is reduced at each iteration to 0.6180 of its previous
size, which is smaller than the reduction to 0.6666 of its previous size when 2/3.
In other words, using 0.6180 leads to an algorithm that both requires only half
the computations in each iteration and that converges faster. There are no
downsides.
As with any iterative algorithm, it is important to define termination conditions.
There are two conditions for the golden-mean search, the two usual conditions
apply. If the absolute error between the bounds after the update is less than a
predefined threshold, then an accurate enough approximation of the optimum has
been found and the algorithm terminates in success. If however the algorithm first
reaches a predefined maximum number of iterations, it ends in failure. The
pseudocode of the complete golden-mean search method is given in Fig. 9.4.
The convergence rate of this algorithm has already been hinted to previously,
when it was mentioned that each iteration reduces the interval by a factor of .
When the value of is set to the golden ratio and the initial interval between the
bounds is h0 jx0L x0Uj, then after the first iteration it will be h1 h0, and after
the second iteration it will be:
h2 h1 2 h0

9:10

and more generally, after iteration n it will be:

hn n h0 0:6180n h0

9:11

This is clearly a linear O(h) convergence rate. One advantage of Eq. (9.11) is that it
makes it possible to predict an upper bound on the number of iterations the goldenmean algorithm will reach the desired error threshold. For example, if the initial
search interval was h0 1 and an absolute error of 0.0001 is required, the algorithm
will need to perform at most log0.6180(0.0001) 19 iterations.

9.2 Golden-Mean Search

163

XL Input lower bound

XU Input upper bound
ProblemType Input minimization or maximization
IterationMaximum Input maximum number of iterations
ErrorMinimum Input minimum relative error
X0 0.6180 XL + (1 0.6180) XU
X1 (1 - 0.6180) XL + 0.6180 XU
IterationCounter 0
WHILE (TRUE)
IF (ProblemType = minimization)
IF ( [CALL F(X0)] < [CALL F(X1)] )
XU X1
X1 X0
X0 0.6180 XL + (1 0.6180)
ELSE
XL X0
X0 X1
X1 (1 - 0.6180) XL + 0.6180
END IF
ELSE IF (ProblemType = maximization)
IF ( [CALL F(X0)] > [CALL F(X1)] )
XU X1
X1 X0
X0 0.6180 XL + (1 0.6180)
ELSE
XL X0
X0 X1
X1 (1 - 0.6180) XL + 0.6180
END IF
END IF

CurrentError absolute value of (XU XL)

IterationCounter IterationCounter + 1
IF (CurrentError <= ErrorMinimum)
RETURN Success, x
ELSE IF (IterationCounter = IterationMaximum)
RETURN Failure
END IF
END WHILE
FUNCTION F(x)
RETURN evaluation of the target function at point x
END FUNCTION

Fig. 9.4 Pseudocode of the golden-mean method

164

9 Optimization

Example 9.1
A solar panel is connected to a house, connected also to the citys power grid.
When the house consumes more power than can be generated by the solar
panel it draws from the city, and when it consumes less the extra power is fed
into the citys power grid. The power consumption of the house over time has
been modelled as P(t) t(t 1), where a positive value is extra power
generated by the house and a negative value is power drain from the city.
Find the maximum amount of power the house will need from the city over
the time interval [0, 2] to an absolute error of less than 0.01 kW.
Solution
Begin by noting that Eq. (9.11) gives:
0:01 2 0:6180n
n8
In other words, the solution should be found at the eighth iteration of the
golden-mean method.
The first two middle points computed from Eq. (9.9) are:
x00 0:6180 0 0:3820 2 0:76393
x01 0:3820 0 0:6180 2 1:2361
The power consumption can then be evaluated from the model at those two
points:
Px00 0:763930:76393 1 0:18034
Px01 1:23611:2361 1 0:29180
Since this is a minimization problem, the rule of Eq. (9.7) applies, and the
upper bound is replaced by x01. The absolute error after this first iteration is
j0 1.2361j 1.2361. At the second iteration, the new middle points are:
x10 0:6180 0 0:3820 1:2361 0:47214
x11 0:3820 0 0:6180 1:2361 0:76393
Notice that x11 is exactly the same as x00; this was expected from the earlier
explanations, and as a result that middle point does not need to be
re-evaluated, its value can simply be carried over from the previous iteration.
The other middle point does need to be evaluated:
Px10 0:472140:47214 1 0:24922
(continued)

9.2 Golden-Mean Search

165

Example 9.1 (continued)

Once again, using the update rule of Eq. (9.7), the upper bound is the one that
is updated. The absolute error is now 0.76393. The table below summarizes
all the iterations needed for this example, and the following figure illustrates
the function and the decreasing size of the interval around the optimum as the
iterations increase, from darker to lighter shade.
i
0
1
2
3
4
5
6

xiL
0
0
0
0.29180
0.29180
0.40325
0.47214

xiU
2
1.2361
0.76393
0.76393
0.58359
0.58359
0.58359

xi0
0.76393
0.47214
0.29180
0.47214
0.40325
0.47214
0.51471

xi1
1.2361
0.76393
0.47214
0.58359
0.47214
0.51471
0.54102

P(xi0)
0.18034
0.24922
0.20665
0.24924
0.24064
0.24922
0.24978

P(xi1)
0.29180
0.18034
0.24922
0.24301
0.24922
0.24978
0.24832

Ei
1.2361
0.76393
0.47213
0.29179
0.18034
0.11145
0.06888

Power
2.1
1.8
1.5
1.2
0.9
0.6
0.3
0
0
0.3

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

2
Time

At the beginning of iteration 7 (the eighth one, as predicted), the minimum

is known to be in [0.47214, 0.54102] with an absolute error of 0.06888, and
one middle point at 0.51471 carries over from iteration 6. That middle point
has the lowest evaluated value, at P(0.51471) 0.24978 kW, and can be
used as the minimum without any further function evaluations.

166

9.3

9 Optimization

Newtons Method

It is well-known that the optimum of a function f(x) is an inflection point where its
derivative f (1)(x) 0. This means that an optimization method for f(x) is the same as
a root-finding method for f (1)(x), and any of the root-finding methods learned in
Chap. 8 could be used. Most interestingly, if the first and second derivatives of the
function are known, it is possible to use Newtons method, the most efficient of the
root-finding methods learned. Recall from Sect. 8.6 that the equation for Newtons
method to find a root of f(x) is:
xi1 xi

f xi
f 1 xi

9:12

Then the equation to find a root of f (1)(x), an optimum of f(x), is simply:

xi1 xi

f 1 xi
f 2 xi

9:13

As was proven in Chap. 8 using Taylor series, this method will iteratively converge
towards the nearest root of f (1)(x) at a quadratic rate O(h2).
Once issue is that there is no indication in Eq. (9.13) as to whether this root will
be a maximum or a minimum of f (x); the root of the derivative only indicates that it
is an optimum. Real-world functions will usually have both maxima and minima,
and a problem will require finding a specific one of the two, not just the nearest
optimum regardless of whether it is a maximum or a minimum. One way of
checking if the function is converging on a maximum or a minimum is of course
to evaluate f(xi) and see if the values are increasing or decreasing. However, this
will require additional function evaluations, since evaluating f(x) is not needed for
Newtons method in Eq. (9.13), as well as a memory of one past value f(xi1) to
compare f(xi) to. To avoid these added costs in the algorithm, another way of
checking using only information available in Eq. (9.13) is to consider the sign of
the second derivative at the final value xi. If f (2)(xi) < 0 the optimum is a maximum,
and if f (2)(xi) > 0 then the optimum is a minimum. If the method is found to have
converged to the wrong type of optimum, then the only solution is to start over from
another, more carefully chosen initial point.
The same three halting conditions seen for Newtons method in Chap. 8 still
apply. To review, if the relative error between two successive approximations is
less than a predefined threshold , then the iterative algorithm has converged
successfully. If however a preset maximum number of iterations is reached first,
then the method has failed to converge. Likewise, if the evaluation of the second
derivative f (2)(xi) 0, then the point generated is in a discontinuity of f (1)(x) and the
method cannot continue. The pseudocode for Newtons optimization method is
given in Fig. 9.5; it can be seen that it is only a minor modification of the code of
Newtons root-finding method presented in the previous chapter, to replace the

9.3 Newtons Method

167

x Input initial approximation of the optimum

IterationMaximum Input maximum number of iterations
ErrorMinimum Input minimum relative error
IterationCounter 0
WHILE (TRUE)
PreviousValue x
x x [CALL Derivative(F(x))] / [CALL Derivative(Derivative(F(x)))]
IF ( CALL Derivative(F(x)) = 0 )
RETURN Success, x
ELSE IF ( CALL Derivative(Derivative(F(x))) = 0 )
RETURN Failure
END IF
CurrentError absolute value of [ (x PreviousValue) / x ]
IterationCounter IterationCounter + 1
IF (CurrentError <= ErrorMinimum)
RETURN Success, x
ELSE IF (IterationCounter = IterationMaximum)
RETURN Failure
END IF
END WHILE

FUNCTION F(x)
RETURN evaluation of the target function at point x
END FUNCTION
FUCNTION Derive(F(x))
RETURN evaluation of the derivative of the function at point x
END FUNCTION

Fig. 9.5 Pseudocode of Newtons method

function calls F(x) and Derivative(F(x)) with Derivative(F(x)) and

Derivative(Derivative(F(x))) respectively. On that point, notice that
the second derivative of a function is computed simply by calling the Derive
function twice in a row. In essence, Newtons optimization method could be
implemented using exactly the code of Newtons root-finding method, but by
calling it with the derivative of the function f(x) instead of the function itself.
Example 9.2
A sudden electrical surge is known to cause a nearly one-second-long power
spike in an electrical system. The behavior of the system during the spike has
been studied, and during that event the power (in kW) is modelled as:
Pt sin t t5
(continued)

168

9 Optimization

Example 9.2 (continued)

Determine the maximum power that the system must be designed, in order to
handle the electrical surge, to a relative error of 0.01 %.
Solution
The first and second derivatives of the model are:
P0 t cos t 5t4
P t sin t 20t3
00

For an initial value, since none are provided but the spike is said to be
one-second-long, the search could start in the middle of the interval, at
t0 0.5. In that case, Eq. (9.13), using the models derivatives gives:
0:565
0:6900s
t1 0:5
2:979

0:5 0:6900

27:501%
E1

0:6900

The next iterations are given in the table below, and illustrated in the
accompanying figure along with the power function:
i
0
1
2
3
4

ti (s)
0.5
0.690
0.640
0.634
0.634

P(1)(ti)
0.565
0.360
0.035
0.0005
8 108

P(2)(ti)
2.979
7.197
5.832
5.682
5.680

ti+1 (s)
0.6900
0.6397
0.6337
0.6336
0.6336

Ei (%)
27.501
7.813
0.945
0.013
0.000

Power
0.5
0.4
0.3
0.2
0.1
0
0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Time

The method converges in five iterations on t 0.6336 s, at which point the

power spike is 0.4899 V. For reference the real optimum is at t 0.63361673 s
with a power spike of 0.489938055 kW. The relative error of the method
compared to the real result is thus less than 2.3 106 % on the time of the
spike peak and 5.1 108 % on the power at the maximum the system must
be able to handle. Note as well that the second derivative of the function is
negative at the final result, and in fact throughout the iterations, confirming
that the method is converging on a maximum and not on a minimum.

9.4 Quadratic Optimization

9.4

169

Quadratic Optimization

It has been observed several times already that an optimum in a function f(x) is an
inflection point where the function turns around. Locally, the inflection point region
could be approximated as a degree-2 polynomial, a parabola p(x). As was learned in
Chap. 5, all that is required for this is to be able to evaluate three points on the
function to interpolate the polynomial from. The situation is illustrated in Fig. 9.6.
The equation for a degree-2 polynomial is:
px c0 c1 x c2 x2

9:14

where p(x) f(x) at three points x xi2, xi1, and xi. Given this information,
Chap. 5 has covered several methods to discover the values of the coefficients,
such as solving the matrixvector system of the Vandermonde method:
2

1 xi2
6
4 1 xi1
1

32 3 2
3
c0
f xi2
x2i2
76 7 6
7
x2i1 54 c1 5 4 f xi1 5
f x i
c2
x2i

9:15

If the interpolated parabola serves as a local approximation of the inflection

point of the function, then the optimum of the parabola can serve as an approximation of the optimum of the function. This is of great advantage, since the
optimum of the parabola is also a lot easier to find. Once the equation of the
parabola has been interpolated, its optimum is simply the point where its derivative
is zero:
p1 x c1 2c2 x 0
c1
x
2c2

9:16

The quadratic optimization method is an iterative version of this approach. As a

new approximation of the optimum is computed at each iteration, it replaces the
Fig. 9.6 The optimum of a
function (solid blue line)
and an interpolated parabola
(dashed red line)

170

9 Optimization

oldest of the three points used in the interpolation. The region covered by the
interpolation becomes iteratively a smaller section of the inflection point, the
interpolated polynomial thus becomes a better approximation of the function in
that region, and the optimum of the parabola becomes closer to the functions
optimum.
For the iterative version of this approach, given three past approximations of the
optimum xi2, xi1, and xi, it is possible to interpolate this iterations polynomial
pi(x). Then, the new approximation of the optimum is computed as:
xi1

c1, i
2c2, i

9:17

or, integrating Eq. (9.15) in, as:

xi1

f xi2 x2i1 x2i f xi1 x2i x2i2 f xi x2i2 x2i1
2f xi2 xi1 xi 2f xi1 xi xi2 2f xi xi2 xi1

9:18

This new point replaces xi2 to compute pi+1(x) in the next iteration. There are three
halting conditions to the iterative algorithm. If the relative error between two
successive approximations of the optimum is less than a preset threshold , then
the algorithm has successfully converged. On the other hand, if the algorithm first
reaches a preset maximum number of iterations, it has failed. There is a second
failure condition to watch out for: if the interpolated polynomial becomes a degree1 polynomial, a straight line, then the algorithm has diverged and is no longer in the
region of the inflection point at all. From Eq. (9.17), it can be seen that in that
case the equation would have a division by zero, a sure sign of divergence.
The pseudocode of a version of the quadratic optimization method using the
matrixvector system of Eq. (9.15) and including the additional failure condition
check is presented in Fig. 9.7.
The convergence rate for this method is O(h1.497), although the proof is outside
the scope of this book. This method thus converges more efficiently than the
golden-mean method, which is normal when comparing an open method like this
one to a closed method that must maintain brackets. On the other hand it converges
more slowly than the Newtons method. Again, this was to be expected: Newtons
method uses actual features of the function, namely its first and second derivatives,
to find the optimum, while this method uses an interpolated approximation of the
function to do it, and therefore cannot get as close at each iteration.
Example 9.3
A sudden electrical surge is known to cause a one-second-long power spike in
an electrical system. The behavior of the system during the spike has been
studied, and during that event the power (in kW) is modelled as:
(continued)

9.4 Quadratic Optimization

171

Example 9.3 (continued)

Pt sin t t5
Determine the maximum power the system must be designed, in order to
handle the electrical surge, to a relative error of 0.01 %. Use the samples at
t 0, 0.5, and 1 s as initial values.
Solution
Begin by evaluating the power at the three initial points:
t2 0 s

pt2 0 kW

t1 0:5 s

pt1 0:448 kW

t0 1 s

pt0 0:159 kW

From there, the Vandermonde method can be used to interpolate the

parabola:
2

1
41
1

0
0:5
1

32
3 2
3
0
c00
0
0:25 54 c10 5 4 0:488 5
1
0:159
c20

p0 x 0 1:951x 2:110x2
The optimum of this parabola, the first approximation of the optimum computed by the method, is then obtained from the derivative of the parabola, as
given in Eq. (9.17):
t1

1:951
0:4624s
2 2:110

Alternatively, Eq. (9.18) could be used to compute the approximation directly

without solving the Vandermonde system:
t1

00:25 1 0:4481 0 0:1590 0:25

0:4624s
00:5 1 0:8961 0 0:3180 0:5

Either way, the relative error on this approximation is:

1 0:462

116:249%

0:462

(continued)

172

9 Optimization

Example 9.3 (continued)

This is far higher than the required error, and the algorithm continues. The
next iterations are given in the table below, and illustrated on the accompanying figure along with the power function:
i
0
1
2
3
4
5
6
7

ti2 (s)
0.0000
0.5000
1.0000
0.4624
0.5719
0.5849
0.6538
0.6342

ti1 (s)
0.5000
1.0000
0.4624
0.5719
0.5849
0.6538
0.6342
0.6329

ti (s)
1.0000
0.4624
0.5719
0.5849
0.6538
0.6342
0.6329
0.6336

ti+1 (s)
0.4624
0.5719
0.5849
0.6538
0.6342
0.6329
0.6336
0.6336

Ei (%)
116.249
19.136
2.237
10.527
3.088
0.207
0.117
0.000

Power

0.5

0.4

0.3

0.2

0.1

0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Time

The method converges in eight iterations on t 0.6336 s, the same optimum that Newtons method found in five iterations in Example 9.2. Having to
carry three bad initial guesses for several iterations slows down the initial
convergence; in iteration 3, the first one computed using only approximations
computed in previous iterations, the result shows a large jump in accuracy,
from a relative error of 7.8 % compared to the real optimum to 3.1 %.

9.5

Gradient Descent

The gradient descent optimization method is also known by its more figurative
name of hill climbing. It has been described as what you would do if you needed to
find the top of Mount Everest with amnesia in a fog. What would this unfortunate
climber, unable to remember where theyve been or to see where they are going,
do? Simply feel around the ground in one step in every direction to find the one that
goes up the fastest, and proceed along that way step by step. Once the climber has
reached a point where the ground only goes down in all directions, they can assume
they have reached the top of the mountain. In mathematical terms, the direction of
the step that gives the greatest change (be it increase or decrease) in the value of a
function is called its gradient, and the basic idea of the gradient descent method is
simply to take step after step along the gradient until a point is reached where no
step can be taken to improve the value.
The gradient descent is different from the other methods covered in this chapter
so far by the fact that it is a multidimensional optimization method, instead of a
one-dimensional one. In fact, as will be shown, it makes it possible to reduce the
multidimensional optimization problem into a one-dimensional problem of optimizing the step size along the gradient that optimizes the multidimensional
function.

9.5 Gradient Descent

173

PreviousValue2, PreviousValue, x Input three initial approximations of

the optimum
IterationMaximum Input maximum number of iterations
ErrorMinimum Input minimum relative error
IterationCounter 0
WHILE (TRUE)
V 33 Vandermonde matrix with the following rows:
Row 1: 1, PreviousValue2, PreviousValue2 PreviousValue2
Row 2: 1, PreviousValue, PreviousValue PreviousValue
Row 3: 1, x, x x
Y vector of length 3 with the following values:
CALL F(PreviousValue2), CALL F(PreviousValue), CALL F(x)
C vector of length 3 that is the solution of the system V C = Y
IF [ (Third value of C) = 0 ]
RETURN Failure
END
PreviousValue2 PreviousValue
PreviousValue x
x [-1 (Second value of C)] / [2 (Third value of C)]
CurrentError absolute value of [ (x PreviousValue) / x ]
IterationCounter IterationCounter + 1
IF (CurrentError <= ErrorMinimum)
RETURN Success, x
ELSE IF (IterationCounter = IterationMaximum)
RETURN Failure
END IF
END WHILE

FUNCTION F(x)
RETURN evaluation of the target function at point x
END FUNCTION

Fig. 9.7 Pseudocode of the quadratic optimization method

Assume a function y f(x) where x and y are n-dimensional vectors, an initial

point x0, and a step size h. At each iteration i, the gradient descent algorithm
evaluates the function at steps of h around the current point xi, and takes one step
in the orientation that evaluates closest to the optimum (the maximum for maximization and the minimum for minimization). The next point xi+1 is xi plus the
appropriate step. This leaves only two questions: how to determine the orientation
of the best step, and how to determine the size of the step.
As indicated earlier, the orientation of the step is the gradient, the direction of
maximum change of the function. This is the orientation that will bring the
iterations close to the optimum fastest. The gradient is the vector of partial
derivatives of f(x) with respect to each dimension of x:

174

9 Optimization

3
f x
6 x0 7
7
6
7
6
6 f x 7
7
6
7
f x 6
6 x1 7
7
6
6 7
7
6
4 f x 5
2

9:19

xn1
Knowing the direction of maximum change limits the options for the orientation of
the step from 360 around the current point to only two directions, either positively
or negatively along the line of the gradient. In fact this is the distinction between
maximization and minimization in this method: for a maximization problem the
algorithm should positively follow the gradient to increase as quickly as possible,
and for a minimization problem the algorithm should go in the negative direction to
decrease as quickly as possible. Thus, the next point xi+1 will be the current
iterations xi plus or minus one step along the gradient at that point, as so:
xi1 xi hf xi

9:20

Example 9.4
Compute the gradient of this function:
f x x20 3x0 x1 x31 x2 4x23 x1 x2 x3
Solution
Following Eq. (9.19), the gradient is:
3
f x
6 x 7
0 7
6
3
2
6 f x 7
2x0 3x1
7
6
6 x 7 6 3x1 3x2 x2 x3 7
1
1 7
7
6
f x 6
6 f x 7 4
5
1 x2 x3
7
6
6 x 7
8x

x
x
3
1 2
2 7
6
4 f x 5
2

x3
The second question is how to pick the step size. After all, a value of h too large
will give a poor approximation, as the method will step over the optimum. On the
other hand, a small step will make the algorithm converge slowly. One solution is to
use an iteratively decreasing step size hi, that begins with a large value to take large
steps towards the optimum and then decreases it in order to pinpoint the optimum.

9.5 Gradient Descent

175

A better solution though would be to actually compute the length of the step hi to
take at each iteration to get as close as possible to the optimum; in other words, to
optimize the step size at that iteration! The optimal step size is of course the one that
will give the value of xi+1, as computed in Eq. (9.20), which will in turn allow f(xi+1)
to evaluate to its optimal value. This leads to a simple but important realization:
the only unknown value in Eq. (9.20) missing to compute the value of xi+1 is the
value of hi, and as a result the value of the function evaluation f(xi+1) will only vary
based on hi:
f xi1 f xi hi f xi ghi

9:21

In other words, optimizing the value of the multidimensional function f(xi+1) is the
same as optimizing the single-variable function g(hi) obtained by the simple
variable substitution of Eq. (9.20). And a single-variable, single-dimension optimization problem is one that can easily be done using the golden-mean method,
Newtons method, or the quadratic optimization method. The optimal value of hi
that is found is the optimal step size to use in iteration i to compute xi+1;
it corresponds to the step to take to get to the local optimum of the gradient line
f(xi).
To summarize, at each iteration i, the gradient descent algorithm will perform
these steps:
1. Evaluate the gradient at the current point, f(xi), using Eq. (9.19).
2. Rewrite the function xi+1 with the variable substitution of Eq. (9.21) as
f(xi hif(xi)) to get a function of hi.
3. Use the method of your choice to find the value of hi that is the optimum of
f(xi hif(xi)).
4. Compute the next value of xi+1 using Eq. (9.20) as xi hif(xi).
5. Evaluate termination conditions, either to terminate or continue the iterations.
The pseudocode of an algorithm implementing all these steps is presented in
Fig. 9.8. The success termination condition for this algorithm is that the Euclidean
distance between xi and xi+1 becomes less than a preset error threshold , at which
point the method has converged on the optimum of the function with sufficient
accuracy. There are two failure termination conditions. The first is, as always, if the
algorithm reaches a preset maximum number of iterations without converging. The
second condition is if the gradient f(xi) evaluates to a vector of zeros. From
Eq. (9.20), it can be seen that in that case, the method is stuck in place and xi+1 xi.
Mathematically, a point where the gradient is null is a plateau in the function, a
point where there is no change to the evaluation of the function in any direction.
The method cannot continue anymore since all orientations from that point are
equivalent and none improve the function at all. Note however that, since xi+1 xi
in that situation, the Euclidean distance between xi and xi+1 will be zero, which
corresponds to the success condition despite it being actually a failure of the

176

9 Optimization

x Input vector of length n; the initial approximation of the optimum

h Input the initial step size
ProblemType Input minimization or maximization
IterationMaximum Input maximum number of iterations
ErrorMinimum Input minimum step size
IterationCounter 0
WHILE (TRUE)
IF (ProblemType = minimization)
x(h) new function of variable
[CALL F(x)] - h [CALL
ELSE IF (ProblemType = maximization)
x(h) new function of variable
[CALL F(x)] + h [CALL
END IF

h as:
Gradient(F(x))]
h as:
Gradient(F(x))]

h optimum of [CALL F(x(h))]

PreviousValue x
x x(h)
IF [ (CALL Gradient(F(X))) = zero vector ]
RETURN Failure
END IF
CurrentError Euclidean distance between x and PreviousValue
IterationCounter IterationCounter + 1
IF (CurrentError <= ErrorMinimum)
RETURN Success, x
ELSE IF (IterationCounter = IterationMaximum)
RETURN Failure
END IF
END WHILE

FUNCTION F(x)
RETURN vector of length n of evaluation of the n-dimensional target
function at point x
END FUNCTION
FUNCTION Gradient(F(x))
RETURN vector of length n of evaluation of the partial derivatives of
the function with respect to the n variables at point x
END FUNCTION

Fig. 9.8 Pseudocode of the gradient descent method

method. It is thus important to evaluate this condition first, before the Euclidean
distance of the success condition, to avoid potentially disastrous mistakes in the
interpretation of the results.

9.5 Gradient Descent

177

Example 9.5
The strength of the magnetic field of a sheet of magnetic metal has been
modelled by the 2D function:
H x x2 4x 2xy 2y2 2y 14
which is in amperes per meter. The origin (0,0) corresponds to the center of
the sheet of metal. Determine the point with the weakest magnetic field, to the
nearest centimeter. Use the corner of the sheet at (4, 4) meters as a starting
point.
Solution
The first thing to do is to compute the gradient of the function using equation
(9 19). This gives the vector of functions:

Hx

2x 2y 4
2x 4y 2

For the first iteration, the value of x0 is given as [4, 4]T. Since this is a
minimization problem, Eq. (9.20) becomes:
xi1 xi hi Hxi
And for the first iteration, it evaluates to:

2 4 2 4 4
4
4
4

h0
h0
x1
2 4 4 4 2
6
4
4

The next step of the iteration is the variable substitution of x1 to make H(x1)
into a function of h0 that can be easily optimized. To do this, write down the
original H(x) replacing x with (4 + 4 h0) and y with (4 + 6 h0). The new
equation, which will be labelled g(h0) for convenience, is:
gh0 4 4h0 2 44 4h0 24 4h0 4 6h0 24 6h0 2
24 6h0 14 136h20 52h0 6
The step h0 is the optimum of g(h0), which can be easily found using any
single-variable optimization method, or by setting the derivative of g(h0) to
zero (which is the most straightforward way to get the optimum of a degree-2
polynomial). The value is h0 0.191. Using this step value in Eq. (9.20) gives
the next point x1 [4.764, 2.854]. The magnetic field model and the first
(continued)

178

9 Optimization

Example 9.5 (continued)

iteration are represented in the figure below. The gradient line is represented
as the solid red line on the figure, starting at the initial (4, 4) point; it can be
seen visually that it is indeed a parabola and indeed follows the direction of
maximum change from (4, 4). The new point x1 is found at the minimum of
the gradient parabola, and in fact very near to the minimum of the magnetic
field.

The Euclidean distance between x1 and the initial x0 is 1.378 m. This is

much more than the centimeter precision required, so the iterations continue.
The table below summarizes the next steps.
i
0
1
2

4
4

4:764
2:854
4:994
3:006

H(xi)

4
6

0:176
0:112

0:024
0:035

hi
0.191
1.300
0.191

xi+1

4:764
2:854

4:994
3:006

4:999
2:999

Ei
1.378
0.276
0.008

The method converges to 8 mm of precision by the third iteration. The

optimum at that point is at (4.999, 2.999) meters, almost exactly on the real
optimum at (5, 3) meters.

9.6 Stochastic Optimization

9.6

179

Stochastic Optimization

Engineering models of complex real-world systems will usually have no one

optimum but several. Oscillating signals, repeating features, periodic events, and
the interactions of multiple independent variables moving in opposite directions, all
lead to systems that peak, drop, and peak again, repeatedly. A single one of these
peaks or drops, the optimum within a limited region of the system, is called a local
optimum. This is to differentiate it from the single greatest optimum of the entire
function, which is the global optimum. To take a simple example, every single
mountain in the Himalayas is a geographic local optimum, but Mount Everest is the
single global optimum of the mountain range. For a mathematical example, consider the fluctuating function illustrated in Fig. 9.9. It has one global maximum and
one global minimum indicated, as well as four additional local maxima, a plateau,
and five local minima.
In such situations, a good optimization method should be able to converge on the
global optimum, not just on a local optimum. It should discover the best solution
possible. This is a problem with the methods presented so far in this chapter: the
golden-means, quadratic, and gradient descent methods all converge on the nearest
maximum or minimum without any checks to make sure it is the global one, and
Newtons method is even worse, converging on the nearest optimum without even
checking if it is a maximum or minimum.
What would be needed for a method to discover the global optimum? The
method cannot somehow know that it has only converged on a local optimum
and to continue searching, since that would imply that it already knows the value of
the global optimum to compare its current result to. Instead, it has to decide
somehow to diverge away from the optimum it has found and continue searching
the function in case there is a better one elsewhere, and then to converge on that
better optimum. Such a change in behavior of the method over time, especially to
include a behavior that involves deliberately diverging away from an apparent
solution, is very different from the behavior of the methods studied so far. In fact
this is impossible for any of the methods seen so far, as they have all been designed

Fig. 9.9 Local and global

maxima and minima

180

9 Optimization

to behave in only one way, to always take the best step they can find, the one that
makes them converge on the nearest optimum in the fastest way possible. They are
deterministic methods: their behavior is entirely known and predictable. Given the
same function and initial values, they will always compute the same steps and
converge on the same optimum.
The alternative to a deterministic method is a stochastic method: an algorithm
that includes an element of randomness. This random element is what will allow the
method to have two different behaviors, by giving it a chance to escape local optima
but not the global optimum. This can be implemented in practice in a number of
ways, such as for example by including a random variable in an equation or by
having a decision step that includes an element of chance. It should be noted that a
random element is not necessarily one whose value is a result of complete and
unbiased chance, such as a lottery draw. Rather, it is simply a term whose result is
not known for sure in advance. It is completely acceptable to skew the probabilities
towards a preferred outcome, or to change the probabilities as the algorithm progresses to reduce the impact of randomness over the iterations. A flip of a coin with
a 99 % probability of landing on heads is still a stochastic event: even though one
outcome is more likely than the other, it is still a result of chance.
Using randomness in optimization methods eliminates some of the certainty
offered by deterministic algorithms. As mentioned already, a stochastic method is
no longer guaranteed to converge on the nearest local optimum, which can be a
desirable feature. However, this should not be mistaken for a certainty to converge
on the global optimum; stochastic methods can make no such guarantee. Given the
use of randomness in their algorithms and decision-making, no outcome can be
certain or guaranteed. Another important point to keep in mind is that running the
same stochastic method twice on the same function with the same initial values can
lead to two very different final results, unlike with stochastic methods which
guaranteed the exact same result both times. The reason for this is of course the
inclusion of a random element in the algorithm, which can take very different
values in successive runs.
Stochastic optimization algorithms are an intense area of ongoing research.
Dozens of algorithms already exist, and new algorithms, variations, and
enhancements are being proposed every year. Popular algorithms include genetic
algorithms, ant colony algorithms, particle swarm algorithms, and many others.
A complete review of these methods is beyond the scope of this book. The next two
sections will introduce two stochastic optimization methods, to give an overview of
this class of optimization methods.

9.7 Random Brute-Force Optimization

9.7

181

Random Brute-Force Optimization

The random brute-force search is the simplest stochastic search method available.
However, despite its inefficiency, it remains a functional and useful tool. Moreover,
its simplicity makes it a good method to study as an initiation to stochastic
optimization.
A brute-force search refers to any search algorithm that systematically tries
possible solutions one after the other until it finds one that is acceptable or until a
preset maximum number of attempts. A brute-force optimization algorithm would
thus simply evaluate value after value for a given time, and return the value with the
optimal result as its solution at the end. And a random brute-force search is one that
selects the values to evaluate stochastically.
While the random brute-force search may seem unsophisticated, it does have the
advantage of being able to search any function, even one that has a complex and
irregular behavior, multiple local optima, and even discontinuities. By trying points
at random and always keeping the optimal one, it is likely to get close to the global
optimum and certain not to get stuck in a local optimum. The random brute-force
search can be useful to deal with black box problems, when no information is
available on the behavior of the function being optimized. The method makes no
assumptions on the function and does not require a starting point, an interpolation or
a derivative; it only needs an interval to search in. It can therefore perform its search
and generate a good result in a situation of complete ignorance.
However, when this method does get a point close to the global optimum, it does
not improve on it except by possibly randomly generating an even closer point.
In other words, while the random brute-force approach is likely to find a point close
to the global optimum, it is very unlikely to actually find the global optimum itself.
For that reason, the algorithm is often followed by a few iterations of a deterministic
algorithm such as Newtons method, which can easily and quickly converge to the
global optimum from the starting point found by the brute-force search.
It should be instinctively clear that testing more points increases the algorithms
odds of getting closer to the optimum. However, even that rule of thumb is not a
certainty given the stochastic nature of the algorithm. It could easily be the case that
in one short run, the algorithm fortuitously generates a point very close to the
optimum while in another much longer run, the algorithm is a lot unluckier and does
not get as close. This is one of the risks of working with stochastic algorithms.
The iterative algorithm of the random brute-force search is straightforward: at
each iteration, generate a random value within the search interval and evaluate it.
Compare the result to the best value discovered so far in previous iterations.
If the new result is better, keep it as the new best value; otherwise, discard
it. This continues until the one and only termination condition, that a maximum
number of iterations is reached.

182

9 Optimization

Example 9.6
A two-dimensional periodic signal s(x,y) is generated by the combination of
two sinusoids, modelled by this equation:
sx; y ey cos 3x ex cos 3y
2

The emitter is hidden somewhere in a 12 km 12 km field, marked from 6

to 6 km in each direction. Determine where the emitter is.
Solution
The emitter will be at the position where the signal is strongest, so this is a
maximization problem. However, a combination of periodic signals such as
this one will have multiple optima, making the search difficult. A visual
plotting of the function in the interval, shown below, confirms that there are
multiple local optima surrounding the global optimum where a deterministic
optimization method could get stuck, as well as four large plateaux where the
evaluation of the function is constant and deterministic optimization methods
cannot work. This visual inspection also shows that the optimum is found at
(0, 0) and evaluates to 2.0.

A stochastic optimization method can work well in this case, however. To

illustrate, ten separate runs of the random brute-force search algorithm were
performed, by increasing each time the number of random points generated
and evaluated by 1000. The optimal point discovered in each run of the
algorithm is listed in the following table. From the results shown, it can be
seen that the algorithm never returned one of the local maxima and always got
close to the global maximum. Although it never actually found the maximum,
in some runs it got extremely close.
(continued)

9.8 Simulated Annealing

183

Example 9.6 (continued)

Number of points
1000
2000
3000
4000
5000
6000
7000
8000
9000
10,000

Optimum found
(0.0284, 0.1005)
(0.0219, 0.0730)
(0.0090, 0.0459)
(0.0317, 0.0011)
(0.0067, 0.0161)
(0.0137, 0.0010)
(0.0064, 0.0005)
(0.0098, 0.0149)
(0.0108, 0.0065)
(0.0245, 0.0159)

Optimum value
1.6886
1.9453
1.9581
1.9410
1.8775
1.9999
1.9922
1.9909
1.9816
1.9981

The table also shows that, on average, checking more points leads to a
better result: the five runs of 6000 points or more all returned better maxima
than the five runs with 5000 points or less. However, this relationship is not
perfect. The maximum found in the run with 6000 points is the single best one
of all ten runs, better even than the maximum found in the run with 10,000
points, while the one found in the run at 5000 points is second-worst, better
only than the one found in 1000 points. This illustrates nicely one of the
important differences between stochastic and deterministic optimization
mentioned earlier. In a deterministic algorithm, more iterations will always
lead to a better result (unless the method diverges), while in a stochastic
search, more iterations will on average, but not necessarily improve the result.

9.8

Simulated Annealing

Annealing is a metallurgical process used to temper metals through a heating and

cooling treatment. The weaknesses in the metal that are eliminated by annealing are
the result of atomic irregularities in the crystalline structure of the metal.
These irregularities are due to atoms being stuck in the wrong place of the structure.
In the process of annealing, the metal is heated up and then allowed to cool down
slowly. Heating up gives the atoms the energy they need to get unstuck, and the slow
cool-down period allows them to move to their correct location in the structure.
Annealing can be seen as a multiple-optima optimization problem. A weakness in
the metal is due to an atom having converged on a local optimum in the metals
crystalline structure. Heating the metal gives that atom the ability to escape the local
optimum, and the slow cool-down period allows it to converge on its global optimum.
Simulated annealing is a stochastic optimization method based on the annealing
process and on the gradient descent method studied previously. In fact, may
stochastic optimization methods take their inspiration on natural phenomena;
genetic algorithms, and ant colony algorithms are two more popular examples.

184

9 Optimization

The parallel between real-world annealing and simulated annealing is straightforward: in one case an atom moves towards an optimal position in the crystal while
avoiding getting stuck in attractive but suboptimal positions, and in the other steps
are taken on a function to find the global optimum while avoiding getting trapped
in a local optimum. But the real insight comes by studying how to escape the local
optimum. In annealing, this is done by heating the metal to give it energy. When
the metal is hot and the atoms are energized, they are more likely to move out of
the local optimum (a high-energy movement), and as the metal cools down over
time the atoms are more less energized and more likely to simply converge on
the nearest (hopefully global) optimum. This can be simulated by using a temperature parameter that starts off at a high value and decreases iteratively.
This temperature is directly related to the probability of a bad step (one that causes
the value of the function to become less optimal) being accepted. At the higher
initial value, bad moves are accepted more often and steps are taken away from the
local optimum, and at the lower temperature of later iterations, bad steps are
unlikely to be accepted and the method converges.
An iteration of the simulated annealing algorithm is thus:
1. Select a step hi a random orientation around xi to generate xi+1. Compute fi, the
difference in function evaluation between f(xi) and f(xi+1), defined as:
f i f xi f xi1

9:22

for maximization problems and:

f i f xi1 f xi

9:23

for minimization problems. Either way, the value of fi will be negative if the
step brings the method closer to an optimum and positive if it moves away from
it, and the magnitude of the value will be proportional to the significance of the
change.
2. If the value of fi is negative, accept the step to xi+1.
3. If the value of fi is positive, compute a probability of accepting the step based
on the current temperature parameter value Ti:
Pe

f i
Ti

9:24

4. Reduce the temperature and step size for the next iteration.
5. Terminate the search if the termination condition is reached, which is that Ti+1 0.
These steps are implemented in the pseudocode of Fig. 9.10.
The stochastic behavior of the method thus comes from step 3, where a step that
worsens the value of the function might or might not be accepted based on a
probability P. Equation (9.24) shows that P depends on two variables: the change
in value of the step fi, so that steps with a weak negative impact are more likely to
be accepted than steps that massively worsen results, and the temperature Ti, so that

9.8 Simulated Annealing

185

Optimum Input initial approximation of the optimum

Temperature Input initial temperature
DeltaTemperature Input iterative temperature decrease value
h Input initial neighbourhood size
DeltaH Input iterative neighbourhood size decrease value
ProblemType Input minimization or maximization
WHILE (TRUE)
x random neighbour of Optimum at distance h
IF (ProblemType = minimization)
DeltaF [CALL F(x)] - [CALL F(Optimum)]
ELSE IF (ProblemType = maximization)
DeltaF [CALL F(Optimum)] - [CALL F(x)]
END IF
IF (DeltaF < 0)
Optimum x
ELSE
Threshold Random value between 0 and 1
P Exponential of ( -1 DeltaF / T )
IF (Threshold < P)
Optimum x
END IF
END IF
Temperature Temperature DeltaTemperature
h h - DeltaH
IF (Temperature <= 0)
RETURN Success, x
END IF
END WHILE
FUNCTION F(x)
RETURN vector of length n of evaluation of the n-dimensional target
function at point x
END FUNCTION

Fig. 9.10 Pseudocode of the simulated annealing method

a bad move is more likely to be accepted at the beginning of the method when the
temperature is high than at the end when the temperature is low.
Simulated annealing has the advantage of being able to explore a complex
solution space and to escape local optima. The method used to explore the solution
space in step 1 is simple and only requires that it be possible to numerically evaluate
and compare two possible solutions. For that reason, simulated annealing is also
very good at optimizing complex problems, including problems where the optimum
depends on multiple interdependent variables and where the optimum is found by
maximizing certain variables while minimizing others.

186

9 Optimization

Example 9.7
A two-dimensional periodic signal s(x,y) is generated by the combination of
two sinusoids, modelled by this equation:
sx; y ey cos 3x ex cos 3y
2

The emitter is hidden somewhere in a 12 km 12 km field, marked from 6

to 6 km in each direction. Determine where the emitter is.
Solution
Example 9.6 has already explored this function, and demonstrated how the
solution space is too irregular for a deterministic optimization method. The
random brute-force optimization method found points very near the optimum
at (0, 0), but by the pure random chance of generating a point to try near the
optimum, without any method to the search. A simulated annealing method
could be used instead, to actually search the space step by step. The following
figures show in white the path followed by this methodwhich, given the
stochastic nature of the method, is only one of the countless possible paths it
could randomly take.

It can be seen that the method moves randomly through the space, and
visits six of the local optima and three of the plateaux, before finally finding
the global optimum. For each of the local optima, the method eventually steps
away, quickly earlier in the iterations or after a longer exploration later in the
iterations. Being able to explore and eventually leave the plateaux is another
advantage of simulated annealing over other methods; Newtons method,
quadratic optimization, and the gradient descent would all fail and terminate
if they reached a constant region of the function they were optimizing.

9.9 Engineering Applications

9.9

187

Engineering Applications

Engineering design is constrained by reality, both physical (the laws of nature that
dictate the performance limits of their systems) and economic (the need to keep
costs and resource consumption down). In that sense, engineers are constantly
confronted by optimization problems, to get the most out of systems within the
limits of their constraints. The fuel tank design problem of Sect. 9.1 was a
simple illustration of that common problem: the design had to minimize the
tanks surface area and cost while respecting the systems requirements in terms
of shape and volume. It is nonetheless representative of real-world problems; a
similar optimization challenge led to the selection of the cylindrical 330 mL softdrink can as the most cost-efficient design. Optimization problems are also encountered elsewhere in engineering practice, whenever conflicting requirements and
constraints will arise.
The design of many electrical components can be reduced to finding optimal
points in equations. Indeed, the equations representing individual resistors,
capacitors, inductances, and voltage sources are well-known, as are the formulae
to combine them in parallel and serial connections. An entire circuit can thus be
modelled in that manner, and once this model is available, the values of specific
components can be optimized. For example, the impedance value Z for a resistor
is known to be R, for an inductance it is L, and for a capacitor it is (C)1,
where is the frequency of the power supply. In turn, the impedance of a serial
RLC circuit is given by:
s

1 2
2
Z R L
C

9:25

From this model, it is possible to select a power supply with a frequency

appropriate to maximize or minimize the circuits impedance.
The growth rate of yeast can be modelled by an exponential equation in function
of the environments temperature t:
Gt atb ect

9:26

where a, b, and c are constants dependent on the specific type of yeast studied.
From this equation, it is possible to determine the temperature that will maximize growth.
Scheduling problems are among the most popular optimization problems
encountered in practice. Suppose for example a production line that can manufacture two different products, each with an associated unit production cost Ck
and unit profit Pk. The aim is to manufacture a number of units of each product
Nk in order to maximize profits P; however, the production line must operate
within its allocated budget B. In other words:

188

9 Optimization

N 0 C0 N 1 C1 B
N 0 P0 N 1 P1 P
N 0 P0

B N 0 C0
P1 P
C1

9:27

Equation (9.27) can be used to maximize the number of units N0 to manufacture,

and from that value the number of units N2 can be determined easily.

9.10

Summary

One of the most common challenges in engineering is to try to determine the value
of some parameter of a system being designed to either maximize or minimize
its output value. This value of the parameter is the optimum, and this challenge is
optimization. This chapter has introduced several methods designed to solve
an optimization model. The golden-mean method is a closed method, which sets
up bounds around the optimum and uses the golden ratio property to get closer to
the value. As with the closed methods seen for root-finding, this closed method is
the least efficient one available but also the only one guaranteed not to diverge,
because of its requirement to keep the optimum bracketed between the bounds. Two
more open methods were examined, Newtons method and the quadratic optimization method. Both are more efficient than the golden-mean method, but both require
more information, namely the first and second derivative for Newtons method and
three points to interpolate a parabola with for the quadratic optimization, and both
have a risk of diverging and failing in certain conditions. All three of these methods
are also designed for two-dimensional problems; the next method learned was
the gradient method, and it is a more general method designed to deal with
multidimensional optimization problems. Finally, the topic of stochastic optimization was discussed. This topic is huge, worthy of an entire textbook to itself, and
highly active in the scientific literature, so the discussion in this chapter is meant as
nothing more than an introduction. Nonetheless, two stochastic methods were
presented, the random brute-force search and simulated annealing. Table 9.3
summarizes the methods covered in this chapter.
Table 9.3 Summary of optimization methods
Method
Golden-mean search
Newtons method
Quadratic optimization
Gradient descent
Random Brute-Force search
Simulated annealing

Requires
2 bounds
1 point + first and second derivatives
3 points
1 point + derivatives
1 point + thousands of tries
1 point

Error
O(h)
O(h2)
O(h1.497)
O(h2)
Unbounded
Unbounded

9.11

Exercises

9.11

189

Exercises

1. Use the golden-mean search to find a minimum of:

(a) f(x) x2 starting with the interval [1, 2] to an absolute error of 0.1.
(b) f(x) x4 starting with the interval [0, 1] to an absolute error of 0.1.
(c) f(x) (x 1) x (x + 1) starting with the interval [0, 2] to an absolute error of
0.1.
(d) f(x) sin(x) starting with the interval [4, 5] to an absolute error of 0.1,
working in radians.
(e) f(x) x2(x 2) starting with the interval [1, 2] to an absolute error of 0.1.
(f) f(x) ex sin(x) starting with the interval [3, 5] to an absolute error of 0.1.
2. Using the golden-mean search with an initial interval of width h, how many
iterations would be required to get to a width of less than ?
3. Use Newtons method to find a minimum of:
(a)
(b)
(c)
(d)
(e)
(f)

f(x) x2 starting with the point x0 1 to a relative error of 0.1.

f(x) x4 starting with the point x0 2 to a relative error of 0.1.
f(x) (x 1) x (x + 1) starting with the point x0 1 to a relative error of 0.1.
f(x) sin(x) starting with the point x0 4 radians to a relative error of 0.01.
f(x) x2(x 2) starting with the point x0 1.5 to a relative error of 0.01.
f(x) ex sin(x) starting with x0 4 to a relative error of 0.001.

4. Use the quadratic optimization method to find a minimum of:

(a) f(x) x2 starting with the points x2 1, x1 0.9, and x0 0.8 to a relative
error of 0.1.
(b) f(x) x4 starting with the points x2 1, x1 0.9, and x0 0.8 and iterating for four steps.
(c) f(x) (x 1) x (x + 1) starting with the point x2 1.5, x1 1, and x0 0.5
to a relative error of 0.1.
(d) f(x) sin(x) starting with the points x2 4, x1 4.1, x0 4.2 to a relative
error of 0.00001.
(e) f(x) x2(x 2) starting with the points x2 2, x1 1, and x0 1.5 to a
relative error of 0.01.
(f) f(x) ex sin(x) with the points x2 3, x1 4, and x0 5 to a relative
error of 0.001.
5. Perform two iterations of gradient descent to find a minimum of the function f
(x) x02 + x12 x0x1 + x0 2x1 starting with x [1, 1]T.

Chapter 10

Differentiation

10.1

Introduction

Differentiation and its complement operation, integration, allow engineers to

measure and quantify change. Measuring change is essential in engineering practice, which often deals with systems that are changing in some way, by moving,
growing, filling up, decaying, discharging, or otherwise increasing or decreasing in
some way.
To visualize the relationship, consider a simple example: a robot that was
initially at rest begins moving at time 0 s in a straight line with a constant
acceleration of 5 m/s2. From this information, it is possible to model the entire
situation and to track the robots speed and position. Since the robot was at rest
before 0 s and has a constant acceleration after 0 s, at time 0 s exactly it has an
impulse jerk (variation in acceleration) of 5 m/s3 that goes back to zero after that
point. Its speed was also 0 m/s before time 0 s, but with a constant acceleration it
will increase at a constant rate after that, being 5 m/s after 1 s, 10 m/s after 2 s,
15 m/s after 3 s, and so on. And the position will also be increasing, but faster than
the speed, as the robot moves further per second at higher speeds. It will be 2.5 m
from the starting position after 1 s, at 10 m after 2 s, at 22.5 m after 3 s, and so
on. These four measures are illustrated in Fig. 10.1. And to further formalize this
example, the equations for each of the metrics are as follows: The jerk, as indicated,
is a single impulse of 5 m/s3 at time 0 s:

t0
t>0

10:1

Acceleration 5 m=s2

10:2

Jerk

5 m=s3
0 m=s3

The acceleration is constant:

Springer International Publishing Switzerland 2016

R. Khoury, D.W. Harder, Numerical Methods and Modelling for Engineering,
DOI 10.1007/978-3-319-21176-3_10

191

192

10 Differentiation

Fig. 10.1 Jerk (top-right), acceleration (top-left), speed (bottom-left), and position (bottom-right)
with respect to time, for a robot at constant acceleration of 5 m/s2

And the speed and position both increase with respect to time:
Speed 5t m=s

10:3

Position 2:5t m

10:4

Looking at the relationship, the graphs of Fig. 10.1 in clockwise order from the
bottom-right shows the differentiation operation, while looking at their relationship
in counter-clockwise order from the top-right shows the integration operation.
Indeed, the counter-clockwise relationship demonstrates the rate of change of the
previous curve. The position curve is increasing exponentially, as Eq. (10.4) shows,
and the speed curve is thus one whose value is constantly increasing. But since it is
increasing at a constant rate, the acceleration curve is a constant line, aside from the
initial jump from 0 to 5 m/s2 when movement started. And since a constant line is
not changing, the jerk curve is zero, save for the impulse of 5 m/s3 at time 0 s when
the acceleration changes from 0 to 5 m/s2. On the other hand, integration, which
will be covered in Chap. 11, is the area under the curve of the previous graph. The
jerk is a single impulse of 5 m/s3 at time 0 s, which has an area of 5 m/s2, followed
by a constant zero line with null area. Consequently, the acceleration value jumps to
5 m/s2 at time 0 s and remains constant there since no additional area is added. The
area under this acceleration graph will be increasing constantly, by a value of 5 m/s.
Consequently, the speed value that reflects its area increases constantly by that rate.
And the area under the linearly increasing speed graph is actually increasing
exponentially: it covers 2.5 m from 0 to 1 s, 10 m from 0 to 2 s, 22.5 m from 0 to
3 s, and so on, and as a result the position value is an exponentially increasing curve
over time.
Being able to measure the differentiation and integral of systems is thus critical
if the system being modelled is changing; if the system is not in a steady-state or if
the model is not meant to capture a snapshot of the system at a specific moment in
time. If the equation of the system is known or can be determined, as is the case with
Eq. (10.2) in the previous example, then its derivates and integrals can be computed
exactly using notions learned in calculus courses. This chapter and the next one,
however, will deal with the case where the equation of the system is unknown and

10.2

Centered Divided-Difference Formulae

193

cannot be determined precisely, and the only information available is discrete

measurements of the system. As these two chapters will show, it is still possible
to compute the integral and derivative of the attributes of a system and thus model
its changing nature with as little as two discrete measurements of it.

10.2

Centered Divided-Difference Formulae

Recall from Chap. 5 the first-order Taylor series approximation formula:

f xi h f xi f 1 xi h

10:5

This formula makes it possible to approximate the value of a function at a point a

step h after a point xi where the value is known, using the known value of the first
derivative at that point. Alternatively, if there was a need to approximate the value
of the function at a point a step h before the known point xi, there would only be a
sign difference in the formula:
f xi h f xi f 1 xi h

10:6

Now assume that the measurements of the function are known at all three points,
but the derivative is unknown. It is immediately clear that either (10.5) or (10.6)
could be solved to find the value of the derivative, since each is an equation with
only one unknown. But, for reasons that will become clear soon, it is possible and
preferable to do even better than this, by taking the difference of both equations:
f xi h f xi h f xi f 1 xi h f xi f 1 xi h

10:7

And solving that equation for f (1)(xi):

f 1 xi

f x i h f x i h
2h

10:8

The formula of Eq. (10.8) is called the second-order centered divided-difference

formula, and it gives a good approximation of the derivative of a function at point xi
given only two measurements of the function at equally spaced intervals before and
after xi. How good is the approximation? Since it is derived from the Taylor series,
recall that the error is proportional to the next non-zero term. In this case, it will be
the third-order term, since the second-order terms will cancel out. This can be
verified by expanding the Taylor series approximations of Eqs. (10.5) and (10.6)
and solving Eq. (10.7) again:

194

10 Differentiation

f 2 xi h2 f 3 xi h3

2!
3!
2
f xi h2 f 3 xi h3
1

f xi h f xi f xi h
2!
3!
2
2
3
f xi h
f xi h3
1
f xi h f xi h f xi f xi h

f xi
2!
3!
2
2
3
3
f xi h
f xi h
10:9

f 1 xi h
2!
3!
2f 3 xi h3
2f 1 xi h
3!
f xi h f xi h f 3 xi h2
1

f xi
2h
3!
2
f xi h f xi h
O h

2h
f xi h f xi f 1 xi h

Since the error is quadratic, or second-order, this gives the formula its name.
Moreover, the development of Eq. (10.9) demonstrates why it was preferable to
take the difference between Eqs. (10.5) and (10.6) to approximate the derivative,
rather than simply solving either one of these equations. With only one equation,
the second-order term of the Taylor series would have nothing to cancel out with,
and the final formula would be O(h), a less-accurate first-order formula.
Example 10.1
A 1-L reservoir is getting filled with water. It was initially empty, but reached
a quarter-full after 1.16 s, half-full after 2.39 s, three-quarter-full after 3.45 s,
and completely full after 4 s. Estimate the rate it was getting filled up by the
time it was half-full using the centered divided-difference formula and steps
of 0.5 and 0.25 L.
Solution
The fill-up rate is the volume filled per unit of time. The information given in
the problem statement is instead the time needed to fill certain units of
volume. The derivative of these values will be the time per volume, and the
inverse will be the rate asked for. The derivative can be obtained by a
straightforward application of Eq. (10.8):
f 1 f 0
40

4:00 s=L ) 0:25 L=s

2h
2 0:5
f 0:75 f 0:25 3:45 1:16
f 1 h0:25 0:5

4:57 s=L ) 0:22 L=s

2h
2 0:25
f 1 h0:5 0:5

From Eq. (10.9), it has been demonstrated that the error on the approximation
is proportional to the square of the step size. Reducing the step size by
half, from h to h/2, should thus reduce the error by a factor of 4, from O(h2)
(continued)

10.2

Centered Divided-Difference Formulae

195

Example 10.1 (continued)

to O((h/2)2) O(h2/4). To verify this, note that the equation modelling the
time needed to fill the reservoir is:
f V 3V 6 4V 5 V 4 6V 3 4V 2 4V
Evaluating the derivative of this equation at V 0.5 gives a rate of 4.69 s/L,
or a fill-up rate of 0.21 L/s. The relative error of each approximation of the
derivative is:

4:00 4:69
14:7 %

Eh0:5

4:69
4:57 4:69
2:4 %

Eh0:25
4:69
The error has in fact been reduced to less than a quarter of its original value,
though it is still within the quadratic reduction range that was expected.
Finally, it is interesting to visualize the situation. The function describing
the time needed to fill a given volume is plotted below. The derivative value is
the slope of the tangent to the function at point V 0.5. The approximation of
that tangent with h 0.5 is plotted in purple, and it can be seen to be rather
off, and in fact clearly intersects the function. The approximation with
h 0.25, in red, is clearly a lot better, and is in fact almost overlaps with
the real tangent, plotted in green.
Time
4
3.5
3
2.5
2
1.5
1
0.5
0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1
Volume

196

10 Differentiation

The previous example used two pairs of two points to compute two secondorder divided-difference approximations of the derivative. But if four points are
available, an intuitive decision would be rather to use all four of them to compute
a single approximation with a higher accuracy. Such a formula can be derived
from the Taylor series approximations, as before. Begin by writing out the four
Taylor series approximations, to a sufficiently high order to model the error of
the approximation as well. To determine which order to go up to, note that for
with the divided-difference formula with two points in Eq. (10.9), the even
second-order term cancelled out and the error was due to the third-order term.
Consequently, it can be intuitively expected that, with the more accurate formula
with four points, the next even-order term will also cancel out and the error
will be due to the fifth-order term. The four fifth-order Taylor series approximations are:
f xi 2h f xi 2f 1 xi h

4f 2 xi h2 8f 3 xi h3 16f 4 xi h4 32f 5 xi h5

2!
3!
4!
5!

f xi h f xi f 1 xi h

f 2 xi h2 f 3 xi h3 f 4 xi h4 f 5 xi h5

2!
3!
4!
5!

f xi h f xi f 1 xi h

f 2 xi h2 f 3 xi h3 f 4 xi h4 f 5 xi h5

2!
3!
4!
5!

f xi 2h f xi 2f 1 xi h

4f 2 xi h2 8f 3 xi h3 16f 4 xi h4 32f 5 xi h5

2!
3!
4!
5!

10:10
Taking the difference of the series one step before and after, as before, cancels out
the second- and fourth-order terms, but leaves the third-order term:
f xi h f xi h 2f 1 xi h

2f 3 xi h3 2f 5 xi h5

3!
5!

10:11

Thats a problem; the third-order term must get cancelled out as well, otherwise it
will be the dominant term for the error. Fortunately, there are two more series to
incorporate into the formula. Looking back at the set of equations (10.10), it can be
seen that the third-order term in the series f(xi h) are eight times less than the
third-order term in the series f(xi 2 h). To get them to cancel out, the series
f(xi h) should thus be multiplied by 8 in Eq. (10.11) and the series f(xi 2 h)
should be of opposite signs from their counterparts one step before or after. And by
having the series f(xi 2 h) be of opposite signs from each other, the second- and
fourth-order terms will cancel with each other as they did in Eq. (10.11).
The resulting formula is:

10.2

Centered Divided-Difference Formulae

197

48f 5 xi h5
5!

f

x

h

x

2h

i
i
i
i
O h4 10:12
f 1 xi
12h

f xi 2h 8f xi h 8f xi h f xi 2h 12f 1 xi h

The third-order terms now cancel out, leaving only the fifth-order terms and a
division by h for an error of O(h4). As expected, using two points before and after
rather than just one has greatly increased the accuracy of the formula. This is now
the fourth-order centered divided-difference formula.
Example 10.2
A 1-L reservoir is getting filled with water. It was initially empty, but reached
a quarter-full after 1.16 s, half-full after 2.39 s, three-quarter-full after 3.45 s,
and completely full after 4 s. Estimate the rate it was getting filled up by the
time it was half-full using the fourth-ordered centered divided-difference
formula.
Solution
The fill-up rate is the volume filled per unit of time. The information given in
the problem statement is instead the time needed to fill certain units of
volume. The derivative of these values will be the time per volume, and the
inverse will be the rate asked for. The derivative can be obtained by a
straightforward application of Eq. (10.12):
f 1 8f 0:75 8f 0:25 f 0
12 h
4 8 3:45 8 1:16 0
4:77 s=L ) 0:21 L=s

12 0:25

f 1 0:5

Compared to the real rate of 4.69 s/L (fill-up rate of 0.21 L/s), the relative
error of this approximation is:

4:77 4:69
1:7 %

Eh0:25
4:69
This result is clearly more accurate than either ones obtained with the
same data using the second-order centered divided-difference formula in
Example 10.1.

198

10.3

10 Differentiation

Forward and Backward Divided-Difference Formulae

The best approximation of the derivative is obtained by using measurements both

before and after a target point. Unfortunately, oftentimes in engineering, it is not
possible to get measurements on both sides of a target point. For example, it may be
necessary to use the current rate of change of an incoming signal to adjust buffer
capacity and processing resources in real-time, in which case only past measurements are available. Or it may be necessary to reconstruct how temperatures were
changing on Earth long before climate records were kept for environmental and
historical studies, in which case only future measurements (from the point of view
of the values being estimated) are available. To deal with such cases, the backward
divided-difference formula (using only past values) and the forward divideddifference formula (using only future values) can be derived from the Taylor series
using only steps in the acceptable direction. The trade-off will be the need for more
measurements to maintain the same accuracy as the centered divided-difference
formulae.
In the previous section, the second-order centered divided-difference formula
and its error term were developed from two steps of the third-order Taylor series
approximation. Likewise, the second-order backward divided-difference formula
and its error term can be derived from the same starting point. However, in this case,
the two steps will be two previous steps, as such:
f xi h f xi f 1 xi h
f xi 2h f xi 2f

f 2 xi h2 f 3 xi h3

2!
3!

4f 2 xi h2 8f 3 xi h3

xi h
2!
3!

10:13

The second-order term of the Taylor series approximations will need to cancel out
in order to keep the third-order term as the dominant error term. It can be seen from
Eq. (10.13) that this second-order term is four times greater in f(xi 2 h) than it is in
f(xi h), and of the same sign. Consequently, the series f(xi h) will need to be
multiplied by 4 for the second-order terms to cancel out:
f xi 2h 4f xi h 3f xi 2f 1 xi h
f 1 xi

4f 3 xi h3
3!

f xi 2h 4f xi h 3f xi
O h2
2h

10:14

The resulting equation is indeed O(h2). However, it can be noted that this accuracy
required three measurements, at the current point and at one and two steps before,
while the second-order centered divided-difference formula achieved it with only
two measurements. As indicated before, this is because more measurements are

10.4

Richardson Extrapolation

199

needed to even out the loss of information that comes from using measurements all
on one side of the target point, rather than centered around the target point.
The second-order forward divided-difference formula is computed using the
same development as its backward counterpart. The final equation and error term
are:
f 1 xi

f xi 2h 4f xi h 3f xi
O h2
2h

10:15

Example 10.3
A robot is observed moving in a straight line. It starts off at the 8 m mark, and
is measured every second:
Time t (s)
Position f(t) (m)

0
8

1
16

2
34

3
62

4
100

Estimate its current speed, at time 4 s.

Solution
Speed is the first derivative of position. Since only past measurements and the
current-time measurement are available, the second-order backward divideddifference formula can be used. Applying Eq. (10.14) gives:
f 2 4f 3 3f 4
21
34 4 62 3 100

2
43 m=s

f 1 4

To verify this result, note that the equation modelling the robots position is:
f t 5t2 3t 8
The derivative of this equation is trivial to compute, and evaluated at t 4
it does give 43 m/s.

10.4

Richardson Extrapolation

It has been demonstrated in the previous sections that the error on the approximation of the derivative is function of h, the step size between the measurements. In
fact, Example 10.1 even demonstrated practically that a smaller step size leads to a

200

10 Differentiation

better approximation. However, this also leads to a major problem with the divideddifference formulae: the risk of subtractive cancellation, introduced back in
Chap. 2. Consider for example the second-order centered divided-difference formula of Eq. (10.9). As the value of h is reduced with the expectation of increasing
accuracy, there will come a point where f(xi h) f(xi + h) and the effects of
subtractive cancellation will be felt. At that point, it would be wrong, and potentially dangerous, to continue to use smaller and smaller values of h and to advertise
the results as more accurate. And this issue of subtractive cancellation at smaller
values of h will occur with every divided-difference formula available, as they are
all based on taking the difference between measurements at regular intervals.
Yet the fundamental problem remains; smaller values of h are the only way to
improve the accuracy of the approximation of the derivative. Richardson extrapolation offers a solution to this problem. Instead of decreasing the value of h and
computing the divided-difference formula, this method makes it possible to compute the divided-difference formula with a large value of h then iteratively decrease
it to increase accuracy.
To begin, note that in every divided-difference formula, the even-order terms
from the Taylor series approximation cancel out, leaving the odd-order terms with
even-valued exponents of h once the final division by h is performed. The first
non-zero term after the formula becomes the error term, and all other terms are
ignored. But if the first non-zero term is cancelled out, as it was when going from
the second-order centered divided-difference formula to the fourth-order one, then
the error drops by a factor of h2 for that reason.
Now consider again the centered divided-difference formula of Eq. (10.9). If the
equation is expanded to include all terms from the Taylor series, and since the evenorder terms cancel out, the formula becomes:
f 1 x i

f xi h f xi h f 3 xi h2 f 5 xi h4 f 7 xi h6

2h
3!
5!
7!

10:16

Note again that since every other term cancels out, every term k actually appearing
in the series represents an error of O(h2k). The error is of course dominated by the
largest term, which in this case is O(h2). But keeping that first error term written out
explicitly, Eq. (10.16) can be rewritten equivalently as:

Dexact D1 h K 1 h2 O h4

10:17

where Dexact represents the real exact value of the derivative and:
D 1 h

f x i h f x i h
2h

10:18

10.4

Richardson Extrapolation

201

f 3 xi h2
3!

10:19

As mentioned already, the way to improve the accuracy of this formula is to

decrease the step size. Therefore, replace h with h/2 in Eq. (10.17) to improve its
accuracy. This gives:
Dexact

2
4 !
h
h
h
D1
O
K1
2
2
2

2

h
h
D1
K 1 O h4
2
4

10:20

Note that the division by 4 in the big O term has disappeared; as explained in
Chap. 1, big O notation is indifferent to constant values and only function of the
variable, in this case h. As for the derivative approximation, nothing special has
happened. The formula is the same with half the step size, and the risk of subtractive
cancellation is still present if h becomes too small. But notice that Eqs. (10.17) and
(10.20) each have a different parameter for D1 but the same term h2, with the only
difference being that one is four times larger than the other. This should give an idea
for cancelling out the h2 term: taking four times Eq. (10.20) and subtracting
Eq. (10.17):

h
h2
4Dexact Dexact 4D1
4K 1 4O h4 D1 h K 1 h2 O h4
2
4

h
3Dexact 4D1
D 1 h O h4
2

4D1 h2 D1 h
O h4
Dexact
10:21
3
The error is now O(h4), an important improvement, and more importantly this was
done without risking subtractive cancellation! And this is after only one iteration;
Richardson extrapolation is an iterative process, so it can be done again. Begin by
rewriting Eq. (10.21) in the same form as Eq. (10.17):
Dexact

h
D2
K 2 h4 O h6
2

10:22

where:

4D1 h2 D1 h
h
D2

3
2

10:23

202

10 Differentiation

Then compute again a more accurate version by dividing h into half:

Dexact D2

h
h4
K 2 O h6
4
16

10:24

This time, comparing Eqs. (10.22) and (10.24), the h4 term is 16 times larger before.
Therefore the second equation will need to be multiplied by 16 to cancel out this
next error term.

6

h
h4
h
16Dexact Dexact 16D2
16K 2 16O h D2
K 2 h4 O h6
4
2
16

h
h
15Dexact 16D2
D2
O h6
4
2
h
h

16D2 4 D2 2
10:25
O h6
Dexact
15
This process can go on iteratively forever, or until one of the usual termination
conditions applies: a threshold relative error between the approximation of the
derivative of two iterations is achieved (success condition), or a preset maximum
number of iterations is reached (failure condition). Richardson extrapolation can be
summarized as follows:

Dexact Dk

h
k1

O h2k

for k 1

10:26

where:

h
h
4k1 Dk1 j Dk1 j1
h
2
2
Dk j
if k > 1
2
4k1 1

h
h
f xi j f xi j
h
2
2
Dk j
if k 1
h
2
2 j
2

10:27

10:28

Then, for each value of j starting from 0 and going up to the termination condition,
compute Eq. (10.27) for all values of k from 1 to j + 1. The first iteration will thus
only compute one instance of Eq. (10.28), and each subsequent iteration will add
one more instance of Eq. (10.27). Moreover, each new instance of Eq. (10.27) will
be computed from a lower-k-valued instance of it, down to Eq. (10.28). The value of
Eq. (10.27) with the highest values of j and k at the final iteration will be the
approximation of the derivative in Eq. (10.26).
Algorithmically, the Richardson extrapolation method can be implemented by
filling up a table left to right using values computed and stored in the previous

10.4

Richardson Extrapolation

203

columns, much like the table of divided-differences in Newtons interpolation

method. Each column in this table thus represents an increment of the value of
k in Eqs. (10.26) to (10.28). The elements of the first, left-most column represent
k 1 and are computed using Eq. (10.28), while the following columns represent
k > 1 and are computed using Eq. (10.27). Likewise, each row of the table represents a factor of 2 in the division, or an exponent of j in Eqs. (10.27) and (10.28). At
each iteration, the algorithm adds a row at the bottom of the first column (one new
division by 2), and then moves right using the new value to compute higher-order
approximations. Since each step right in the table requires computing a new value
using two previous values, the number of elements in each column is one less than
the previous, until the right-most column has only one element and the iteration
ends. This process then repeats until one of the termination conditions mentioned
before is realized. The pseudocode for this method is presented in Fig. 10.2.
x Input target value to find the derivative at
h Input initial step size
IterationMaximum Input maximum number of iterations
ErrorMinimum Input minimum relative error
RichardsonTable empty IterationMaximum IterationMaximum table
BestValue 0
RowIndex 0
WHILE (RowIndex < IterationMaximum)
element at column 0, row RowIndex of RichardsonTable
[F(x + h) F(x h)] / (2 h)
h h / 2
ColumnIndex 1
WHILE (ColumnIndex <= RowIndex)
element at column ColumnIndex, row RowIndex of RichardsonTable
[ (4 to the power ColumnIndex-1) (element at column
ColumnIndex-1, row RowIndex of RichardsonTable) - (element
at column ColumnIndex-1, row RowIndex-1 of RichardsonTable)
] / [(4 to the power ColumnIndex-1) 1]
ColumnIndex ColumnIndex + 1
END WHILE
PreviousValue BestValue
BestValue element at column RowIndex, row RowIndex of
RichardsonTable
CurrentError absolute value of [ (BestValue PreviousValue) /
BestValue ]
IF (CurrentError <= ErrorMinimum)
RETURN Success, BestValue
END IF
RowIndex RowIndex + 1
END WHILE
RETURN Failure
FUNCTION F(x)
RETURN evaluation of the target function at point x
END FUNCTION

Fig. 10.2 Pseudocode of Richardson extrapolation

204

10 Differentiation

Example 10.4
A 1-L reservoir is getting filled with water. It was initially empty, but reached
a quarter-full after 1.16 s, half-full after 2.39 s, three-quarter-full after 3.45 s,
and completely full after 4 s. Estimate the rate it was getting filled up by the
time it was half-full using the centered divided-difference formula and
Richardson extrapolation.
Solution
The fill-up rate is the volume filled per unit of time. The information given in
the problem statement is instead the time needed to fill certain units of
volume. The derivative of these values will be the time per volume, and the
inverse will be the rate asked for.
Richardson extrapolation starts with j 1. The only possible value of
i from 1 to j + 1 is thus k 1, and the only equation to compute is (10.28).
Putting in the values naturally gives the second-order centered divideddifference formula as it was computed in Example 10.1:

h
h
f x 0 f x 0
h
2
2
D1 0
h
2
2 0
2
f 1 f 0

2h
40
4:00 s=L

2 0:5
At the next iteration, j 1 and k {1, 2}. There are now two equations to
compute, again one of which was already computed in Example 10.1:

h
h
f x 1 f x 1
h
2
2
D1 1
h
2
2 1
2
f 0:75 f 0:25

2 0:25
3:45 1:16
4:57 s=L

2 0:25

h
h
1
4 D1 1 D1 0
h
2
2
D2 1
2
41 1

4 4:57 4
4:77 s=L
3
(continued)

10.5

Second Derivatives

205

Example 10.4 (continued)

The relative error between the derivative approximations of these two iterations is:

4:77 4:00
16:1%

E
4:77
As expected, the equation for D2 in this new equation is computed from two
values for D1, one from the previous iteration and one computed just this
iteration. If another iteration were computed, D3 would likewise be evaluated
using this value of D2 and from one computed in the third iteration.
However, given the information in the problem statement, this second
iteration is the maximum that can be computed. The final result is thus at i 2
and:

h
Dexact D2 1 O h4
2

4:77 s=L O h4
Note this same O(h4) approximation of the derivative that was computed with
the fourth-order centered divided-difference formula in Example 10.2, but
using only the second-order centered divided-difference results of Example
10.1 and an iterative process to combine and refine them.

10.5

Second Derivatives

So far, this chapter has focused on approximating the first derivative of a system
being modelled. This is reasonable, as the rate of change of the parameters of a
system over time is often critically important to include in a complete model.
However, these rates of change are themselves often not constant, and modelling
them as such will lead the model to diverge from reality over time. To remedy
that, it is important to include their rate of change over time as well; in other words,
to compute higher derivatives. In engineering practice, the second derivative of
the system (modelling the rate of change of the rate of change of parameters) is the
one most often included, and the one this chapter will focus on, but the same
technique described here could be used to develop equations for third derivatives
and higher.
The technique for finding the nth derivative is the same as that for finding the
first derivative. Given a set of measurements of the system at equally spaced
intervals, expand the Taylor series approximations for each measurement, then
combine them with multiplications and subtractions to eliminate all non-zeroth-

206

10 Differentiation

order terms except the one of the same order as the desired derivative and the
highest-possible-order one for the error term.
Consider the case with one measurement before and after the target point. The
two third-order Taylor series approximations were expanded in Eq. (10.9), and they
were subtracted from each other to derive the second-order centered divideddifference formula to approximate the first derivative. But if the goal is to keep
the second derivative and cancel out the first, then the two series should be summed
together rather than subtracted. The result is:

f x i h f x i h

!
2
2
3
3
4
4
f

x
h
f

x
h
i
i
i

f xi f 1 xi h

2!
3!
4!
!
f 2 xi h2 f 3 xi h3 f 4 xi h4

f xi f 1 xi h
2!
3!
4!

2f 2 xi h2 2f 4 xi h4

2!
4!

f xi h f xi h 2f xi
f 2 xi
O h2
2
h
2f xi

10:29

And this is the second-derivative second-order divided-difference formula. As

before, the formula has an error of O(h2). However, by summing up, this time it
is the odd-order terms of the series that cancel out in the centered formula, rather
than the even-order ones. Note also that the measurement at the target point, f(xi), is
needed in this formula. Mathematically, this corresponds to the zeroth-order term of
the series, which like other even-order terms was cancelled out when computing the
first derivative but is not anymore. Practically, it fits engineering intuition that
getting more information (the second derivative rather than the first) requires more
data (a third measurement). In general, using the centered divided-difference
formula to get the nth derivative of a measurement with O(h2) error will require
n + 1 points.
As with the first derivative, a fourth-order divided-difference formula can be
computed for the second derivative by using one more point before and one more
after the target point. Since the development of Eq. (10.29) has already shown that
the final formula is divided by h2, and since the final error term of a fourth-order
formula after that division must be h4, then the Taylor series approximations must
be expanded to the sixth order. The fifth-order series were already given in
Eq. (10.10), and the sixth-order term can be appended easily. Looking at
Eq. (10.10), it can be seen that all the odd-order terms will cancel each other out
provided that the two series one step away from the target point are summed
together with the same sign and multiplier, and that the two series two steps away
from the target point are also summed up together with the same sign and
multiplier:

10.5

Second Derivatives

207

f xi h f xi h 2f xi
f xi 2h f xi 2h 2f xi

2f 2 xi h2 2f 4 xi h4 2f 6 xi h6

2!
4!
6!

8f 2 xi h2 32f 4 xi h4 128f 6 xi h6

2!
4!
6!
10:30

This leaves the zeroth-order term, which is the measurement at the target point
and therefore available, the second-order term, which has the second derivative, the
sixth-order term, which will be the error term, and the fourth-order term, which
must be eliminated in order for the error to be the sixth-order term. The problem is
that the fourth-order term is positive in all four series, so it cannot be cancelled out
by adding them together, and it is 16 times greater in the two series two steps away.
The solution is to multiply the two series one step away by 16 to make their fourthorder term of the correct magnitude, and the two series two steps away by 1 to
insure that the terms cancel out with the series one step away without affecting the
other terms eliminated by summation. The final result is:
f 2 xi

f xi 2h 16f xi h 16f xi h f xi 2h 30f xi
O h4
12h2
10:31

The same process can also be used to devise forward and backward formulae.
Recall that the second-order backward divided-difference formula for the first
derivative was estimated from two past measurements of the system. The discussion about the centered divided-difference formulae has already shown that an
additional point is needed to estimate the second derivative to the same error
value, as well as an additional order term in the Taylor series approximation to
account for the division by h2. Consequently, three past measurements will be
needed for a second-derivative second-order backward divided-difference formula,
and the corresponding Taylor series approximations will need to be expanded to the
fourth order, as such:
f xi h f xi f 1 xi h

f 2 xi h2 f 3 xi h3 f 4 xi h4

2!
3!
4!

f xi 2h f xi 2f 1 xi h
f xi 3h f xi 3f 1 xi h

4f 2 xi h2 8f 3 xi h3 16f 4 xi h4

2!
3!
4!
9f 2 xi h2 27f 3 xi h3 81f 4 xi h4

2!
3!
4!
10:32

Clearly, cancelling out the first and third-order terms will require more than a
simple addition as was the case with the centered divided-difference formulae.

208

10 Differentiation

However, this can be done in a simple methodical way, by starting with f(xi 3 h),
the formula with the largest coefficients multiplying terms, and figuring out the
multiple of f(xi 2 h) needed to eliminate the largest coefficients. It will not be
exact, but it should be rounded up, and then f(xi h) can be used to cancel out the
remainders. In the case of Eq. (10.32), the third-order term of f(xi 3 h) is 3.4 times
larger than that of f(xi 2 h), so rounding up the latter series will be multiplied by
4 and subtracted:
f xi 3h 4f xi 2h 3f xi 5f 1 xi h

7f 2 xi h2
2!

5f 3 xi h3 17f 4 xi h4

3!
4!

10:33

This leaves five times the first-order term and five times the third-order term. The
series f(xi h) will thus need to be multiplied by 5, and added to the other two to
cancel out these terms:
2f 2 xi h2 22f 4 xi h4

2!
4!
10:34

f xi 3h 4f xi 2h 5f xi h 2f xi
f 2 xi
10:35
O h2
2
h
f xi 3h 4f xi 2h 5f xi h 2f xi

The second-order forward divided-difference formula is the same but with a sign
difference, and can be derived using the same process:
f 2 xi

f xi 3h 4f xi 2h 5f xi h 2f xi
O h2
2
h

10:36

Example 10.5
A robot is observed moving in a straight line. It starts off at the 8 m mark, and
is measured every second:
Time t (s)
Position f(t) (m)

0
8

1
16

2
34

3
62

4
100

Estimate its current acceleration, at time 4 s.

Solution
Acceleration is the second derivative of position. Since only past measurements and the current-time measurement are available, the second-order
(continued)

10.6

Unevenly Spaced Measurements

209

Example 10.5 (continued)

backward divided-difference formula can be used. Applying Eq. (10.35)
gives:
f 1 4f 2 5f 3 2f 4
12
16 4 34 5 62 2 100

f 2 4

10 m=s2
To verify this result, note that the equation modelling the robots position is:
f t 5t2 3t 8
The second derivative of this equation is trivial to compute, and evaluated at
t 4 it does give 10 m/s2.
The speed of the robot at t 4 has already been found to be 43 m/s in
Example 10.3. Not modelling acceleration, at t 5 the position would be
assumed to be 143 m and the speed still 43 m/s. But now that the model does
include acceleration, the speed and position at t 5 will be found to be 53 m/s
and 148 m respectively. This shows the importance of including not only the
rate of change of parameters, but also the second-derivative change of the rate
of change, in engineering models.

10.6

Unevenly Spaced Measurements

All the formulae seen so far have one thing in common: they require measurements
taken at equal intervals before or after the target point at which the derivative is
required. Unfortunately, such measurements may not always be available. They
might have been recorded irregularly because of equipment failure, or lost to a data
storage failure, bad record-keeping, or simple human negligence. Another approach
will be needed to deal with such cases.
Given measurements at irregular intervals, one simple option is to interpolate a
polynomial that fits these measurements using any of the techniques learned in
Chap. 6, and then simply compute the derivative of that polynomial. In fact, theres
an even better option, namely to include the derivative in the interpolation formula
and thus to interpolate the derived equation directly. This can be done easily
starting from the Lagrange polynomial formula:

210

10 Differentiation

f x

n1
X

f x i

x x0 . . . x xi1 x xi1 . . . x xn1

xi x0 . . . xi xi1 xi xi1 . . . xi xn1

10:37

To interpolate the derivative instead, take the derivative of the Lagrange formula
with respect to x. This is in fact easier than it looks, since x only appears in the
numerator:
f 1 x

n1
X
i0

d
x x0 . . . x xi1 x xi1 . . . x xn1
f xi dx
xi x0 . . . xi xi1 xi xi1 . . . xi xn1

10:38

This is for the first derivative, but higher derivatives can be obtained in the same
way. The interpolation method then works as it did back in Chap. 6: for each of the
n measurements available, compute the polynomial that results from the multiplications in the numerator, derive it, and sum it with the other polynomials from the
other measurements to get the derivative equation. That equation can then be
evaluated at any point of interest within the interpolation interval.
Example 10.6
A robot is observed moving in a straight line. It has been measured at the
following positions:
Time t (s)
Position f(t) (m)

1
16

3
62

4
100

Estimate its current speed, at time 4 s.

Solution
Speed is the first derivative of position. However, the measurements available
do not make it possible to use any of the derived-difference formulae. Instead,
Eq. (10.38) can be used to interpolate the speed equation.
d
d
d
t 3 t 4
t 1 t 4
t 1 t 3
62 dt
100 dt
f 1 t 16 dt
1 3 1 4
3 13 4
4 14 3

d 2
d 2
d 2
t 3t 4t 12
t t 4t 4
t t 3t 3
62 dt
100 dt
f 1 t 16 dt
23
21
3 1
f 1 t 16

2t 7
2t 5
2t 4
62
100
6
2
3

f 1 t 10t 3
(continued)

10.7

Inaccurate Measurements

211

Example 10.6 (continued)

This formula can then be used to compute the speed of the robot at any
time between the interpolation bounds, from t 1 s to t 4 s. At the requested
time of t 4 s, the speed is 43 m/s. From Example 10.3, this is known to be
the correct result.

10.7

Inaccurate Measurements

The divided-difference formulae studied in this chapter all estimate the derivative
from measurements of a system. So far, these measurements have been assumed to
be accurate, and have been used as such. But empirical measurements taken in
practice will often have measurement errors, due to inaccurate instrumentation and
handling errors. Worse, differentiation can be very unstable in the presence of this
noise: the errors get added together and amplified.
Consider the robot tracking data of Example 10.3. Given exact data, the derivative can be computed at any of the five times using the backward, centered, or
forward divided-difference formulae, as in Table 10.1:
However, small errors in measurements can have a drastic impact. Table 10.2
runs through the example again, this time introducing 14 m of errors on the
position measurements. Notice how this error is amplified dramatically in the
derivatives:
To further illustrate, Fig. 10.3 compares the real and noisy position measurements, and the real and noisy derivative estimations. A visual inspection of that
figure confirms how even a small error in measurements can cause errors in the
derivative estimation that are not only much larger in amplitude, but also fluctuate
wildly.
Clearly, the divided-difference formulae should be avoided in cases such as this
one. An alternative solution is to compute a linear regression on the data, as was
learned in Chap. 6, to obtain the best-fitting polynomial that goes through the data.

Table 10.1 Robot speed

given exact position
measurements

Time t (s)
Position f(t) (m)
Speed f (1)(t) (m/s)

0
8
3

1
16
13

2
34
23

3
62
33

4
100
43

Table 10.2 Robot speed

given noisy position
measurements

Time t (s)
Position f(t) (m)
Position error (%)
Speed f (1)(t) (m/s)
Speed error (%)

0
7
12.5
5.5
83.3

1
17
6.3
14.5
11.5

2
36
5.9
20.6
10.4

3
60
3.2
34
3.0

4
104
4.0
54
25.6

212

10 Differentiation

Fig. 10.3 Position (left) and speed (right) using exact values (blue) and noisy values (red)

That polynomial can then be derived and used to estimate the derivative value at
any point within its interval.
Example 10.7
A robot is observed moving in a straight line. It has been measured, with
noise, at the following positions:
Time t (s)
Position f(t) (m)

0
7

1
17

2
36

3
60

4
104

Estimate its current speed, at time 4 s.

Solution
Speed is the first derivative of position. However, since the measurements are
noisy, a divided-difference formula cannot be used. Instead, use the
Vandermonde method to compute a linear regression of the data. Plotting
the data points, as was done in Fig. 10.2 show that they appear to draw an
exponential curve, so to compute a regression for a degree-2 polynomial. The
Vandermonde matrix and solution vector will be
2

1
61
6
V6
61
41
1

0
1
2
3
4

3
2
3
0
7
6 17 7
1 7
7
6
7
6 36 7;
4 7
and
y

7
6
7
4 60 5
9 5
16
104

and the Vandermonde matrix-vector to solve will thus be:

(continued)

10.7

Inaccurate Measurements

213

Example 10.7 (continued)

5
6
6 10
4
30

10
30
100

VT Vc VT y
3
32 3 2
224
30
c0
7
76 7 6
7
6 7 6
100 7
54 c1 5 4 685 5
2365
354
c2
3
2 3 2
c0
7:83
7
6 7 6
7
7 6
6
4 c1 5 4 2:84 5
5:21
c2

This gives the regressed polynomial for the position of the robot:
f t 5:21t2 2:84t 7:83
which is trivial to derive to obtain the equation for the speed. Note that this
regressed polynomial is very close to the real polynomial that generated the
correct values of the example, which was:
f t 5t2 3t 8
Using the derivative of the regressed equation makes it possible to compute
the speed at all five measurement times:
Time t (s)
Speed f (1)(t) (m/s)
Speed error (%)

0
2.8
6.7

1
13.3
2.3

2
23.7
3.0

3
34.1
3.3

4
44.5
3.5

These results are clearly much more accurate than those obtained using the
divided-difference formula in Table 10.2. To further illustrate the difference,
Fig. 10.2 is taken again, this time to include the regressed estimate of the
speed (purple dashed line) in addition to the actual value (blue line) and
divided-difference estimate (red dashed line). It can be seen that, while the
divided-difference estimate fluctuates wildly, the regressed estimate remains
linear like the actual derivative, and very close to it in value, even overlapping
with it for half a second.
(continued)

214

10 Differentiation

Example 10.7 (continued)

10.8

Engineering Applications

Differentiation is a mathematical tool that allows engineers to measure the rate of

change of a parameter of their systems, such as the rate of change in position
(speed) of an object, the rate of change of its speed (acceleration), or the rate of
change of its acceleration (jerk). This relationship was discussed in Sect. 10.1. But
there are countless other engineering applications where differentiation can be
useful, and in fact it comes up in many common engineering equations. These
include:
Fouriers law of heat conduction, which models the heat transfer rate through a
material as:
qx k

dT
dx

10:39

where qx is the heat flux in orientation x, k is the materials conductivity, and dT/
dx is the first derivative of the temperature over orientation x.
Ficks laws of diffusion model the movement of particles of a substance from a
region of higher concentration to a region of lower concentration. Ficks first law
models the diffusion flux in orientation x, Jx, as:
J x D

d
dx

10:40

where D is the diffusion coefficient of the medium, and d/dx is the first derivative
of the concentration over orientation x. Ficks second law models the rate of change

10.9

Summary

215

of the concentration over time, d/dt, in relationship to the second derivative of the
concentration over orientation x:
d
d2
D 2
dt
dx

10:41

The electromotive force of an electrical source can be measured by the rate of

change of its magnetic flux over time dB/dt, according to Faradays law of
induction:

dB
dt

10:42

The current-voltage relationships of electrical components are also the derivative of their performance over time. For a capacitor with capacitance C, that
relationship is:
IC

dV
dt

10:43

while an inductor of inductance L has the relationship:

dI
dt

10:44

This means that the current going through a capacitor is proportional to the rate
of change of its voltage over time, while the voltage across an inductor is proportional to the rate of change of the current going through it over time.
In all these examples, as in many others, a value of the system is defined and
modelled in relationship to the rate of change of another related parameter. If this
parameter can be observed and measured, then the methods seen in this chapter can
be used to approximate its rate of change.

10.9

Summary

Engineering models of systems that are not in a steady-state are incomplete if they
only include a current snapshot of the values of system parameters. To be complete
and accurate, it is necessary to include information about the rate of change of these
parameters. With this addition, models are not static pictures but they change,
move, or grow, in ways that reflect the changes of the real systems they represent.
If a mathematical model of the system is already available, then it is straightforward
to compute its derivative and include it in the model. This chapter has focused on
the case where such a mathematical model is not available, and presented methods
to estimate the derivative using only observed measurements of the system.

216

10 Differentiation

Table 10.3 Summary of derivative methods

Method
Second-order centered divideddifference formula
Second-order backward divideddifference formula
Second-order forward divideddifference formula
Fourth-order centered divideddifference formula
Richardson extrapolation

Requires
1 Point before and 1 point after, equally spaced

Error
O(h2)

Current point and 2 points before, equally spaced

O(h2)

Current point and 2 points after, equally spaced

O(h2)

2 Points before and 2 points after, equally spaced

O(h4)
O(h2n)

Interpolation method

n Times the number of points of the divideddifference formula used with it

n Points, unequally spaced

Regression method

n Points, noisy

See
Chap. 6
See
Chap. 6

If a set of error-free and equally spaced measurements are available, then one of
the divided-difference formulae can be used. The backward, forward, or centered
formulae can be used in the case that measurements are available before, after, or
around the target point at which the derivative is needed, and more measurements
can be used in the formulae to improve the error rate. This chapter presented in
detail how new divided-difference formulae can be developed from Taylor series
approximations, so whichever set of points are available, it will always be possible
to create a custom divided-difference formula to fit them and to know its error rate.
And in addition to the divided-difference formulae, Richardson extrapolation was
presented as a means to improve the error rate of the derivative estimate.
If measurements are available but they are noisy or unevenly spaced, then the
divided-difference formulae cannot be used. Two alternatives were presented to
deal with these cases. If the measurements are error-free but unevenly spaced, then
an interpolation method can be used to model the derivative of the system. And if
the measurements are noisy, whether they are evenly or unevenly spaced, then a
regression method should be used to find the best-fitting mathematical model of the
system, and the derivative of that model can then be computed. Table 10.3 summarizes all the methods learned in this chapter.

10.10

Exercises

1. The charge of a capacitor is measured every 0.1 s. At the following five measurement times: {7.2, 7.3, 7.4, 7.5, 7.6 s}, the charge is measured at {0.00242759F,
0.00241500F, 0.00240247F, 0.00239001F, 0.00237761F} respectively. Find the
rate of change of the charge at 7.4 s using the second-order centered divideddifference formula.

10.10

Exercises

217

2. The rotation of a satellite is measured at times {3.2, 3.3, 3.4, 3.5, 3.6 s}, and the
measured angles are {1.05837, 1.15775, 1.25554, 1.35078, 1.44252 rad}
respectively. Approximate the rate of change of the angle at time 3.4 using
both the second-order and fourth-order centered divided-difference formulae.
3. Repeat exercise 2 using the second-order backward divided-difference
formula.
4. Use h 0.5 approximate the derivative of f(x) tan(x) at x 1 to a relative
error of 0.00001 using the centered divided-difference formula.
5. Repeat Question 4 but for the function f(x) sin(x)/x.
6. Perform three iterations of Richardson extrapolation to estimate the derivative
of f(x) ex at x 0 starting with a step of h 1, using the centered divideddifference formula.
7. Perform three iterations of Richardson extrapolation to estimate the derivative
of f(x) sin2(x)/x at x 5 rad starting with a step of h 2, using (a) the secondorder centered divided-difference formula; (b) the forward divided-difference
formula; (c) the fourth-order centered divided-difference formula.
8. Repeat exercise 7 using the function f(x) cos1(x) at x 2 rad starting with a
step of h 0.5. Perform 4 iterations.
9. A runner starts a 40-m sprint at time 0. He passes the 10-m mark after 2 s, the
20-m mark after 3 s, the 30-m mark after 4 s and reaches the finish line after
4.5 s. Estimate his speed at the middle of his run, after 2.25 s.
10. A 3 L container is getting filled. It is one-third filled after an hour, two-thirds
filled after 3 h, and full after 6 h. Determine the initial filling rate.

Chapter 11

Integration

11.1

Introduction

Chapter 10 has already introduced the need for differentiation and integration to
quantify change in engineering systems. Differentiation measures the rate of
change of a parameter, and integration conversely measures the changing value
of a given parameter. Chapter 10 demonstrated how important modelling change
was to insure that engineering models accurately reflected reality.
Integration does have uses beyond measuring change in parameters. Integration
is mathematically the measure of an area under a curve. It can thus be used to model
and approximate forces, areas, volumes, and other quantities bounded geometrically. Suppose for example an engineer who needs to model a river; possibly an
environmental engineer who needs to model water flow, or a civil engineer who is
doing preliminary work to design a dam or a bridge. In all cases, a complete model
will need to include the area of a cross-section of the river. So a boat is sent out with
a sonar, and it takes depth measurements at regular intervals, to generate a depth
map such as the one shown in Fig. 11.1. From these discrete measurements, it is
then possible to compute the cross-sectional area of the river. The process to do this
computation is an integral: the depth measurements can be seen as points on the
curve of a function, the straight horizontal surface of the water is the axis of the
graph, and the cross-sectional area is the area under the curve.
As with derivation, computing an exact integral would be a simple calculus
problem if the equation of the system were known. This chapter, like Chap. 10, will
deal with the case where the equation is not known, and the only information
available is discrete measurements of the system. The formulae presented in this
chapter are all part of the set of Newton-Cotes rules for integration, the general
name for the family of formulae that approximate an integral value from a set of
equally spaced points, by interpolating a polynomial through these points and
computing its area. Most of this chapter will focus on closed Newton-Cotes rules,

Springer International Publishing Switzerland 2016

R. Khoury, D.W. Harder, Numerical Methods and Modelling for Engineering,
DOI 10.1007/978-3-319-21176-3_11

219

220

Integration

Fig. 11.1 Depth map of a cross-section of a river

which are closed in the sense that the first and last of the equally spaced points are
also the integration boundaries. However, the last method presented will be an open
Newton-Cotes rule, where the integration boundaries lie beyond the first and last of
the equally spaced point.

11.2

Trapezoid Rule

11.2.1 Single Segment

The trapezoid rule is the simplest way to approximate an integral, when only two
measurements of the system are available. It interpolates a polynomial between the
two pointsin other words, a single straight diagonal lineand then computes the
area of the resulting trapezoid shape.
Suppose two measurements at samples x0 and x1, which have the measured
values f(x0) and f(x1) respectively. An interpolated polynomial would be a straight
diagonal line connecting the points (x0, f(x0)) and (x1, f(x1)). Then, the integral will
be the area under that straight line down to the x-axis, which is the same as the area
of a fictional trapezoid that connects the two aforementioned points and the two
points on the x-axis, (x0, 0) and (x1, 0). This situation is illustrated in Fig. 11.2. The
area of that trapezoid can be computed easily, using the simple geometric formula
of Eq. (11.1): it is the base width multiplied by the average height.
I

x1
x0

f xdx x1 x0

f x 0 f x 1
2

11:1

While the trapezoid rule has an undeniable advantage in simplicity, its downside
is potentially a very high error. Indeed, it works by approximating the function
being modelled f(x) as a straight line between the measurements x0 and x1, and can
therefore be very wrong when that is not the case. The value of the integral will be

11.2

Trapezoid Rule

221

Fig. 11.2 Two measurements at x0 and x1 (left). A fictional trapezoid, the area of which
approximates the integral of the function from x0 to x1 (right)

Fig. 11.3 Integration error

for the example of Fig. 11.2

wrong by the area between the straight line and the real curve of the function, as in
Fig. 11.3. This graphical representation is a good way to visualize the error, but
unfortunately it does not help to compute it.
An alternative way to understand the error of this formula is to recall that the
function f(x) is being modelled as a polynomial p(x) interpolated from two points,
and then the trapezoid method takes the integral of that interpolation. Consequently,
the integration error will be the integral of the interpolation error; and the interpolation error E(x) is one that was already learnt, back in Chap. 6. For an interpolation
from two points, the error is:
E x

f 2 x
x x0 x x1
2

11:2

for a point x in the interval [x0, x1]. Then the integration error will be obtained by
taking the integral of the formula:

222

f xdx

px Exdx

pxdx

Exdx

Integration

11:3

The integral of p(x) is obtained from the trapezoid rule of Eq. (11.1). For E(x),
Eq. (11.2) can be substituted in to compute the integral:
x1
x0

f 2 x
x x0 x x1 dx
2
x0

f x0 f x1 f 2 x x1 2

x 1 x 0
x xx0 x1 x0 x1 dx
2
2
x0

x1
f x0 f x1 f 2 x x3 x2

x0 x1 xx0 x1
x1 x0
2
2
3
2
x0
"
#
3
2
f x0 f x1 f x
x 1 x0

x 1 x 0
2
2
6

f xdx x1 x0

f x0 f x1

f x0 f x1 f 2 x x1 x0 3

2
12
2
f x0 f x1 f x x1 x0 3

x1 x0
2
12

x1 x 0

11:4
Equation (11.4) gives a formula for the error of the trapezoid method, but it does
require an extra point x within the interpolation interval in order to compute it. If
such a point is not available, the formula can still be used by substituting the exact
value of the second derivative at x with the average value of the second derivative
within the integration interval:
x1

f xdx x1 x0

f x0 f x1 f

2

x 1 x 0 3
12

11:5

And if this average second derivative is also not available, it can be estimated from
the first derivative of the function:
x1
f

f 2 xdx

x1 x0

f 1 x1 f 1 x0
x1 x0

11:6

11.2

Trapezoid Rule

223

11.2.2 Composite Trapezoid Rule

The two-point trapezoid rule has some important advantages. It is a simple formula
to compute, and the fact that it only requires two measurements of the system is
beneficial when measurements are sparse and getting more is difficult. Its major
problem is that it also incurs a large error in cases where the straight line between x0
and x1 is not a good representation of the function in that interval. To make things
worse, the method is not iterative and only depends on the integration bounds x0 and
x1, which means that the error could not be reduced even if more measurements
were available.
The solution to both these problems, if more measurements are available, is to
subdivide the integration interval into a set of smaller, nonoverlapping subintervals.
Then, compute the integral of each subinterval using the trapezoid rule, and sum
them all together to get the integral of the entire interval. Each individual subinterval will have a smaller error, as Fig. 11.4 illustrates. Intuitively, the reason this
works is because, at a smaller interval, a function can be more accurately approximated by a straight line, as was mentioned multiple times since Chap. 5. Consequently, the more intermediate points are available and the more subintervals are
computed, the more accurate the integration approximation will be.
Provided a set of n measurements, where x0 and xn1 are the integration
bounds, applying the trapezoid rule repeatedly on each of the m n 1 subsegments will give:
xn1
x0

f xdx x1 x0

f x 0 f x 1
f x 1 f x 2
x2 x1

2
2

n2
f xn2 f xn1 X
f xi f xi1
xn1 xn2

xi1 xi
2
2
s0

11:7

Fig. 11.4 Trapezoid approximation of the integral of Fig. 11.2 with two points and one segment
(left), with three points and two segments (center), and with four points and three segments (right)

224

Integration

Moreover, if the measurements are equally spaced, then the length of all the
subsegments will be the same fraction of the length of the entire integration
interval:
xi1 xi

xn1 x0
h
m

11:8

And looking closely at Eq. (11.7), it can be seen that all measurements f(xi) will be
summed twice, except for the measurements at the two bounds, f(x0) and f(xn1).
Putting this observation and Eq. (11.8) into the formula of Eq. (11.7) gives the
composite trapezoid rule:
xn1
x0

n2
X
h
f x0 2
f xdx
f xi f xn1
2
i1

!
11:9

Comparing to the equation for the trapezoid rule with one segment in the previous
section, it can be seen that Eq. (11.1) is only a simplification of Eq. (11.9) for the
special case where only two measurements at the integration bounds are available.
The pseudocode for the composite trapezoid rule is presented in Fig. 11.5. Like
Eq. (11.9), this code can also simplify for the one-segment rule, by setting the value
of the appropriate input variable.
The error on the composite rule is the sum of the error on each two-point
subsegment, and the error of each subsegment can be computed using Eq. (11.7).
This means the error of the entire formula will be:

xL Input lower integration bound

xU Input upper integration bound
Segments Input number of segments
h ( xU xL ) / Segments
Integral F( xL )
x xL + h
WHILE (x < xU)
Integral Integral + 2 F(x)
x x + h
END WHILE
Integral Integral + F(xU)
Integral Integral h / 2
RETURN Integral
FUNCTION F(x)
RETURN evaluation of the target function at point x
END FUNCTION

Fig. 11.5 Pseudocode of the composite trapezoid rule

11.2

Trapezoid Rule

xn1

225

n2
X
h
f x 0 2
f xdx
f xi f xn1
2
i1

n2 2
X
f xi xi1 xi 3
12
i0

11:10
where xi is a point in the interval [xi, xi+1]. Substituting in Eq. (11.8) further
simplifies the equation to:
xn1
x0

n2
X
h
f x 0 2
f xdx
f xi f xn1
2
i1

n2
xn1 x0 3 X
f 2 xi
12m3
i0

11:11
This leaves n 2 instances of the second derivative f (2)(xi) to evaluate, one for
each of the m segments. But recall that, when computing the error for the two-point
trapezoid, one approximation used was that the second derivative at any point in the
integration interval could be substituted for the average value of the second
derivative in the integration interval. Using the same assumption here makes it
possible to replace every instance of f (2)(xi) with the average:
n2
X

f 2 xi m f 2

11:12

The final equation and error is thus:

xn1
x0

n2
X
h
f x 0 2
f xdx
f xi f xn1
2
i1

f 2 xn1 x0 3
12m2

11:13

Notice that the error term is almost the same as it was for the two-point trapezoid
rule in Eq. (11.5), since xn1 is the upper integration bound as x1 was back in
Eq. (11.5). The difference is that the error term is divided by m2, the number of
subsegments within the integration interval; that value was m 1 in the case of
Eq. (11.5) when the entire integration interval was only one segment. It is however
also important to keep in mind that Eqs. (11.5) and (11.13) give approximations of
the absolute error, not exact values; if the value by which the integral approximation was wrong could be computed exactly, then itd be added to the approximation
to get the exact integral value! An error approximation is useful rather to design and
build safety margins into engineering systems. Equation (11.13) also demonstrates
that the error is quadratic in terms of the number of segments. And since the number
of segments is directly related to the interval width h in Eq. (11.8), this means the
trapezoid formulae have a big O error rate of O(h2).

226

Integration

Example 11.1
The relationship between the voltage V(t) and current I(t) that goes through a
capacitor over a period of time from t0 to tn1 can described by the following
integral, where C is the capacitance value and V(t0) is the initial voltage
across the capacitor:
V t

1
C

tn1

I tdt V t0

For a supercapacitor of 1 F in a computer system, current measurements were

taken at computer boot-up, after 0.5 s, and after 1 s. The measurements are
given in the following table:
Time t (s)
Current I(t) (A)

0
0

0.5
16.2

1
1

Compute the voltage going across this supercapacitor using one-segment

and two-segment trapezoid rules, and determine the error of each
approximation.
Solution
To begin, note that, since the first measurement is at the moment the computer
boots up, the initial voltage V(t0) will be null. With the capacitance value of
1 F, the voltage will be only the result of the integral of the current.
The single-segment, two-point trapezoid rule can be computed using
Eq. (11.1):
V 1 t 1 0

01
0:5 V
2

While the two-segment, three-point trapezoid rule can be computed using

Eq. (11.9):
V 2 t

1 0 0 2 16:2 1
8:4 V
2
2

To compute the error, it is necessary to know the average value of the second
derivative over this one-second interval. That information is not given
directly; however, the first derivative can be computed from the measurements using the methods learned in Chap. 10, and then Eq. (11.6) can be used
to compute the average second derivative. With three measurements available, the forward and backward divided-difference formulae can be applied:
(continued)

11.2

Trapezoid Rule

227

Example 11.1 (continued)

I 1 4I 0:5 3I 0 1 4 16:2 3 0

63:8 A=s
2 0:5
2 0:5
I 0 4I 0:5 3I 1 0 4 16:2 3 1
I 1 1

61:8 A=s
2 0:5
2 0:5
I 1 0

Then Eq. (11.6) can be used to compute the average second derivative:
I 2

I 1 1 I 1 0
125:6 A=s2
10

Finally, the error can be computed using Eq. (11.13) as:

I 2 1 03
10:6 V
12

for the single-segment trapezoid rule, and:

I 2 1 03
2:6 V
12 22

for the two-segment trapezoid. This is consistent with a quadratic error rate;
doubling the number of segments had roughly quartered the error on the
approximation.
There is a very large difference between the approximated integral value
with one and two segments. The reason for this difference is that the initial
and final measurements in the interval give a very poor picture of the current
going through the supercapacitor over that time. The current over that period
is illustrated in the figure below: it can be seen that, starting from zero, it rises
to a peak of almost 40 A before dropping again by the time the final
measurement is taken. The single-segment trapezoid, by using only the initial
and final measurements, ignores everything that happened in-between those
bounds. This corresponds to the straight-line interpolation and the purple area
in the figure below; it is clearly a poor representation of the current. The
two-segment trapezoid uses an additional measurement in the middle of the
time interval, and the resulting two interpolations, in red in the figure below
(and including the area in purple), while still inaccurate, nonetheless give a
much better approximation of the current over that time.
(continued)

228

Integration

Example 11.1 (continued)

For reference, the actual integral value is 16.5 V. This means that the
single-segment trapezoid gave an approximation with an absolute error of
16 V; the error estimate of 10.6 V was in the correct range. Meanwhile, the
two-segment approximation had an absolute error of 8.1 V, three times higher
than the error estimate of 2.6 V, but still in the correct order of magnitude.

11.3

Romberg Integration Rule

Back in Chap. 10, the Richardson extrapolation method was introduced as a means
to iteratively reduce the error rate of the derivative approximation without the risk
of subtractive cancellation that would come from taking the difference of two
points that are nearer and nearer together. To be sure, the trapezoid rule to
approximate the integral does not perform such a difference, and is therefore not
susceptible to subtractive cancellation. Nonetheless, an iterative method to improve
its approximation accuracy could be very beneficial. The Romberg integration rule
provides such a method.
Suppose two approximations of an integral I, both obtained using the composite
trapezoid rule as written out in Eq. (11.13) but with different numbers of segments.
The trapezoid approximation obtained using m0 segments will be noted I0,0, and the
other obtained using m1 segments will be noted I1,0.
I I 0 , 0 E0
I I 1 , 0 E1

11:14

Moreover, from the discussion in the previous section, it has been noted that
doubling the number of segments quarters the error. So if m1 2m0, then

11.3

Romberg Integration Rule

229

E1 E0/4. In that case, it becomes possible to combine the two equations of

Eq. (11.14) to express E0 in terms of the two trapezoid approximations:
I 1, 0 I 0, 0

3E0
4

11:15

Next, substitute the value of E1 of Eq. (11.15) back into the I1,0 line of Eq. (11.14)
gives:
I I 0, 0

4I 1, 0 I 0, 0 4I 1, 0 I 0, 0

I 1, 1
3
3

11:16

This integral approximation is labelled I1,1; the first subscript 1 is because the best
approximation it used from the previous iteration was computed from m1 segments,
and the second subscript 1 is because it is the first iteration (iteration 0 being the
trapezoid rule iteration). Moreover, while the approximations of iteration 0 had an
error rate of O(h2), the approximation at iteration 1 has an error rate of O(h4). This
can be shown from the Taylor series, in a proof similar to that of the Richardson
extrapolation.
This first iteration can be generalized as follows: given two trapezoid approximations Ij,0 and Ij1,0 computed from 2j and 2j1 segments respectively using
Eq. (11.13) with O(h2) error rate, then:
I j, 1

4I j, 0 I j1, 0
O h22
3

11:17

This process can then be repeated iteratively. For iteration k, the general version of
the Romberg integration rule is:
I j, k

4k I j, k1 I j1, k1
O m22k
k
4 1

k>0

11:18

As with the Richardson extrapolation, this process can be applied iteratively; at

each iteration k all new values Ij,k are computed by combining successive pairs
of approximations from the previous iteration Ij,k1 and Ij1,k1. Each iteration
will count one less approximation value, until at the last iteration j floor
(log2(m + 1)) + 1 there will be only one final Ij,k with the highest values of j and
k, which will be the best possible approximation with O(h2+2k). Alternatively, if
new values of the function and the trapezoid rule can be computed, the iterative
process can go on until a threshold relative error between the approximation of
the integral of two iterations is achieved (success condition), or a preset maximum number of iterations is reached (failure condition).
The pseudocode implementing this method, given in Fig. 11.6, also bears a lot of
similarity to the implementation of the Richardson extrapolation given in the
previous chapter. The Romberg integration rule will also build a table of values

230

Integration

xL Input lower integration bound

xU Input upper integration bound
IterationMaximum Input maximum number of iterations
ErrorMinimum Input minimum relative error
RombergTable empty IterationMaximum IterationMaximum table
BestValue 0
RowIndex 0
WHILE (RowIndex < IterationMaximum)
element at column 0, row RowIndex of RombergTable composite
trapezoid of the target function from xL to xU using (2 to
the power RowIndex) segments
ColumnIndex 1
WHILE (ColumnIndex <= RowIndex)
element at column ColumnIndex, row RowIndex of RombergTable
[ (4 to the power ColumnIndex) (element at column
ColumnIndex-1, row RowIndex of RombergTable) - (element at
column ColumnIndex-1, row RowIndex-1 of RombergTable) ] /
[(4 to the power ColumnIndex) 1]
ColumnIndex ColumnIndex + 1
END WHILE
PreviousValue BestValue
BestValue element at column RowIndex, row RowIndex of RombergTable
CurrentError absolute value of [ (BestValue PreviousValue) /
BestValue ]
IF (CurrentError <= ErrorMinimum)
RETURN Success, BestValue
END IF
RowIndex RowIndex + 1
END WHILE
RETURN Failure

Fig. 11.6 Pseudocode of the Romberg integration rule

and fill it left to right and top to bottom, where each new column added represents
an increment of k in Eq. (11.18) and will contain one less element than the previous
column, and each additional row represents a power of 2 of the number of segments,
or a value of j in Eq. (11.18).
Example 11.2
The relationship between the voltage V(t) and current I(t) that goes through a
capacitor over a period of time from t0 to tn1 can described by the following
integral, where C is the capacitance value and V(t0) is the initial voltage
across the capacitor:
1
V t
C

tn1

I tdt V t0

(continued)

11.3

Romberg Integration Rule

231

Example 11.2 (continued)

For a supercapacitor of 1 F in a computer system, current measurements were
taken at computer boot-up, after 0.5 s, and after 1 s. The measurements are
given in the following table:
Time t (s)
Current I(t) (A)

0
0

0.5
16.2

1
1

Compute the voltage going across this supercapacitor using the best application of the Romberg integration rule possible.
Solution
To begin, note that, since the first measurement is at the moment the computer
boots up, the initial voltage V(t0) will be null. With the capacitance value of
1 F, the voltage will be only the result of the integral of the current.
At iteration 0, two approximations are possible. I0,0 is the 20 segment
trapezoid rule, which has been computed in Example 11.1 as:
I 0 , 0 1 0

01
0:5 V
2

The second iteration 0 approximation is I1,0, the 21 segment trapezoid rule.

This has also been computed in Example 11.1:
I 1, 0

1 0 0 2 16:2 1
8:4 V
2
2

At iteration 1, these two approximations can be combined using Eq. (11.18) to

get:
I 1, 1

41 I 1, 0 I 0, 0 4 8:4 0:5
11:0 V

3
41 1

For reference, the actual integral value is 16.5 V. As expected, the higheriteration Romberg rule result generates a better approximation than either
of the trapezoid rule approximations it is computed from. In fact, I0,0 has
a relative error of 97 % and I1,0 has a relative error of 49 %, but I1,1 has a
relative error of only 34 %.

232

11.4

Integration

Simpsons Rules

11.4.1 Simpsons 1/3 Rules

The basic idea of the Newton-Cotes integration rule, as explained back in
Sect. 11.1, is to interpolate a simple polynomial to approximate the function
being integrated, and then compute its area to approximate the functions integral.
Given two measurements of the function, it makes sense to interpolate a straight
line, and from that comes the trapezoid rule. Given three measurements, the
trapezoid rule uses them as two pairs of measurements and interpolates two straight
lines. But is that really the best option in that case?
As was demonstrated in Sect. 11.2, the major source of errors with the trapezoid
rule is that a straight line can be a very poor approximation of a complex function
over a large interval. The solution proposed by the composite trapezoid rule is to get
more measurements and use them to break up the interval, so that the straight-line
interpolation is more accurate over each smaller interval. But as was learned back in
Chap. 6, with more measurements available it is also possible to interpolate a
polynomial of a higher degree than a straight line, and that polynomial will be a
more accurate approximation of the real function. Of course, the downside of this
approach is that computing the area of the shape created by this more complex
polynomial will not be as easy as computing the area of a trapezoid. Consequently,
there must be a balance between using a more accurate higher-degree interpolation
and having a more complex computation of the integral.
With three measurements, instead of interpolating two straight lines, it is possible to interpolate a second-degree polynomial, a parabola. This will provide a more
accurate approximation of the function, while still being a simple enough shape to
compute the area. This is the idea that underlies Simpsons 1/3 rule. The equation
is equally simple to obtain; it stems from approximating the integral of the
function between x0 and x2 as the integral of its fourth-order Taylor series approximation at x1:
x2

f xdx f x1 dx
x0

!
f 2 x1
f 3 x 1
2
3
x1 x x1
x x1
x x1 dx
2!
3!

x0
x2

f 4 x 1
x x1 4 dx
4!

11:19
#x2

f 1 x1
f 2 x1
f 3 x1
2
3
4
f x1 x2 x0
x x 1
x x 1
x x1

2
3!
4!
x0
x2

4
f x 1

x x1 5

5!
"

11:20

11.4

Simpsons Rules

233

Next, recall that the three points x0, x1 and x2 are equally spaced, and the distance
between two successive steps is defined as h in Eq. (11.8). As a result, all the evenexponent subtractions in Eq. (11.20) will cancel out, and all the odd-exponent ones
will be added together:
x2 x1 2 x0 x1 2 0
x2 x1 3 x0 x1 3 2h3
x2 x1 4 x0 x1 4 0

11:21

x2 x1 5 x0 x1 5 2h5
This result simplifies Eq. (11.20) considerably:
x2

f xdx f x1 2h

f 2 x1 3 f 4 x1 5
h
h
3
60

11:22

The fourth-order term of the series, which has been kept somewhat separate so far
in the equations, will become the error term of the method. This however leaves the
second derivative to deal with in the second-order term; after all, the derivative of
the function is not known, and only three measurements are available. Fortunately,
Chap. 10 has explained how to approximate the derivative of a function from
measurements. Specifically, the centered divided-difference formula for the second
derivative can be substituted into Eq. (11.22), and subsequently simplified to get the
formula for Simpsons 1/3 rule:
x2

f xdx f x1 2h

2
h3 f x2 f x0 2f x1
f 4 x1 5
h

O
h

60
3
h2

f 4 x1 5
6h h
f x2 f x0 2f x1 O h5
h
3 3
60

h
f x0 4f x1 f x2 O h5
3

f x0 4f x1 f x2
O h5
x2 x0
6
f x 1

11:23
If more than three measurements are available, then the same idea as for the
composite trapezoid applies: group them into triplets of successive points and
interpolate multiple smaller and more accurate nonoverlapping parabola, and sum
the resulting areas to get a higher-accuracy approximation of the integral. A general
form of the equation can be obtained to do this:

234

f xdx xn1 x0
x0

n2
X

f x0 4

xn1

i1, 3, 5, ...

f x i 2

n1
X

Integration

f xj f xn1

i2, 4, 6, ...
3n 1

5
h
O
n
11:24

Do be careful with the two separate summations that must be computed in the
composite equation: each adds every other measurement, but they are multiplied by
different constants. Note also that, just like with the composite trapezoid equation,
the measurements at the two bounds of the integration interval are only
summed once.
Example 11.3
The relationship between the voltage V(t) and current I(t) that goes through a
capacitor over a period of time from t0 to tn1 can described by the following
integral, where C is the capacitance value and V(t0) is the initial voltage
across the capacitor:
1
V t
C

tn1

I tdt V t0

For a supercapacitor of 1 F in a computer system, current measurements were

taken at computer boot-up, after 0.5 s, and after 1 s. The measurements are
given in the following table:
Time t (s)
Current I(t) (A)

0
0

0.5
16.2

1
1

Compute the voltage going across this supercapacitor using Simpsons

1/3 rule.
Solution
To begin, note that, since the first measurement is at the moment the computer
boots up, the initial voltage V(t0) will be null. With the capacitance value of
1 F, the voltage will be only the result of the integral of the current.
Since there are three points to compute Simpsons rule from, either
Eq. (11.23) or (11.24) could be used; the latter will simplify into the former.
(continued)

11.4

Simpsons Rules

235

Example 11.3 (continued)

The resulting formula is:
1
I tdt 1 0

0 4 16:2 1
11:0 V
6

And this approximation has a relative error of 34 % compared to the real

integral value of 16.5 V.
Comparing this result to those obtained with the same points in Examples
11.1 and 11.2 shows the improved accuracy of this method. The two-segment
trapezoid rule gave a result of 8.3 V with a relative error of 49 %, a clear
indication that two straight-line interpolations of two halves of the function
are not at all as good an approximation as a single parabola over the entire
function. The result obtained using the Romberg integration rule was 11.0 V,
the same as is obtained here, but with Romberg it was obtained in the second
iteration after three steps of computations: two trapezoid rules and one
iterative Romberg equation. Here, Simpsons 1/3 rule provides the same
result in a single step with much less computation (and thus much less
chances for error).

Example 11.4
Two sets of current measurements with different intervals have been taken for
the capacitor of Example 11.3. They are:
Time t (s)
Current I(t) (A)

0
0

0.25
4.4

0.5
16.2

0.75
35.5

1
1

and:
Time t (s)
Current I(t) (A)

0
0

0.2
3.1

0.4
10.1

0.6
24.1

0.8
37.1

1
1

Compute the voltage going across this supercapacitor using Simpsons 1/3
rule on each set of points, and compare the results.
Solution
Using Eq. (11.24) on the first set of five measurements gives:
(continued)

236

Integration

Example 11.4 (continued)

I 5 1 0

5
0 44:4 35:5 2 16:2 1
h
O
16:1 V
34
5

Then using the set of six measurements gives:

5
0 43:1 24:1 210:1 37:1 1
h
O
I 6 1 0
13:6 V
35
6
Compared to the real value of this integral of 16.5 V, the result with five
points has a relative error of 3 % while the result with six points has a relative
error of 18 %. This goes against the intuition built throughout his chapter that
more points should make it possible to generate a better approximation of the
function and thus a more accurate integral, and it goes against the error rate
computed in the equations, which predicts that the approximation with six
points should be more accurate.
This problem is the subject of the next section.

11.4.2 Simpsons 3/8 Rule

The previous example demonstrated a problem with Simpsons 1/3 rule: using six
measurements yielded an approximation that was considerably worse than when
using five measurements, when intuitively the opposite should be true. The problem
does not stem from the measurements themselves; both sets of samples are equally
accurate. Rather, the problem is the choice of methods. Simpsons 1/3 rule is
designed to work with an integration interval broken up in an even number of
segments, or an odd number of measurements, and the example that failed used six
measurements, an even number that divided the integration interval into an odd
number of segments. In practice though, one is limited by the measurements
available and the fact they must be equally spaced: if an even number of equally
spaced measurements are available, then discarding one to use Simpsons 1/3 rule is
not an option. So what can be done in such cases?
The best option is to combine Simpsons 1/3 rule with another rule, called
Simpsons 3/8 rule. That rule is designed to work with exactly four measurements,
or three segments. Given four measurements, it can be used by itself, and given an
even number of points greater than 4, it can be used to handle the first or last four
points and leave behind an odd number of points for Simpsons 1/3 rule to handle.
For example, in Fig. 11.7, an integration interval is divided into five segments using
six measurements, much like with Example 11.4. In such a case, the first four
measurements can be used in Simpsons 3/8 rule and the last three in Simpsons 1/3
rule, with the fourth measurement thus being used in both formulae. The
pseudocode of an algorithm combining both versions of Simpsons rule will be
presented in Fig. 11.8.

Fig. 11.7 An integration

interval divided into five
segments

xL Input lower integration bound

xU Input upper integration bound
Segments Input number of segments
h ( xU xL ) / Segments
IF (Segments = 1)
RETURN Failure
ELSE IF (Segments = 2)
Integral (xU xL) (F(xL) + F(xL + h) + F(xU)) / 6
ELSE IF (Segments = 3)
Integral (xU xL) (F(xL) + 3F(xL + h)+3F(xL + 2h) + F(xU)) / 8
ELSE IF (Segments is odd)
Integral F(xL)
x xL + h
WHILE (x < xU)
Integral Integral + 4F(x)+ 2F(x + h)
x x + 2h
END WHILE
Integral Integral + F(xU)
Integral (xU xL) Integral / [3 Segments]
ELSE IF (Segments is even)
x3 xL + 3h
Part1 (x3 xL) (F(xL) + 3F(xL + h)+3F(xL + 2h) + F(x3)) / 8
Part2 F(x3)
x x3 + h
WHILE (x < xU)
Part2 Part2 + 4F(x)+ 2F(x + h)
x x + 2h
END WHILE
Part2 Part2 + F(xU)
Part2 (xU x3) Part2 / [3 (Segments-3)]
Integral Part1 + Part2
END IF
RETURN Integral
FUNCTION F(x)
RETURN evaluation of the target function at point x
END FUNCTION

Fig. 11.8 Pseudocode of the Simpsons rules algorithm

238

Integration

The formula for Simpsons 3/8 rule is given below. It can be seen that it has an
error rate of O(h5), just like Simpsons 1/3 rule with three points. Thus, both
formulae can be used together without loss of accuracy.
x3

f xdx x3 x0

f x0 3f x1 3f x2 f x3
O h5
8

11:25

Example 11.5
The relationship between the voltage V(t) and current I(t) that goes through a
capacitor over a period of time from t0 to tn1 can described by the following
integral, where C is the capacitance value and V(t0) is the initial voltage
across the capacitor:
1
V t
C

tn1

I tdt V t0

For a supercapacitor of 1 F in a computer system, current measurements were

taken at computer boot-up and every 0.2 s. The measurements are given in the
following table:
Time t (s)
Current I(t) (A)

0
0

0.2
3.1

0.4
10.1

0.6
24.1

0.8
37.1

1
1

Compute the voltage going across this supercapacitor using Simpsons rules.
Solution
To begin, note that, since the first measurement is at the moment the computer
boots up, the initial voltage V(t0) will be null. With the capacitance value of
1 F, the voltage will be only the result of the integral of the current.
With six measurements, two options are available to use Simpsons rules:
either apply Simpsons 3/8 rule on the first four and Simpsons 1/3 rule on the
last three, or the other way around, apply Simpsons 1/3 rule on the first three
measurements and Simpsons 3/8 rule on the last four. There is, a priori, no
way to prefer one option over the other. So using the first one, Eq. (11.25)
over the first four measurements gives:
I 3=8 0:6 0

0 3 3:1 3 10:1 24:1

4:8 V
8

And Eq. (11.24) over the last three measurements gives:

(continued)

11.5

Gaussian Quadrature

239

Example 11.5 (continued)

24:1 4 37:1 1
11:6 V
I 1=3 1 0:6
6
The entire integral is the sum of I3/8 and I1/3, which is 16.4 V, and has a
relative error of 0.8 % compared to the real integral value of 16.5 V. This is
also an improvement compared to the five-measurement integral approximation of Example 11.4, and especially compared to the erroneous application
of Simpsons 1/3 rule on six measurements that was done in Example 11.4.

11.5

Gaussian Quadrature

So far this chapter has introduced several formulae to approximate the integral of a
function, using different number of measurements and different number of iterations, and with different error rates. By far, the simplest one was the single-segment
trapezoid rule of Eq. (11.1). Unfortunately it was also the method with the worst
error rate. As was demonstrated in Example 11.1, the error stems from the selection
of the two points; they must be the two integration bounds, and these two bounds
may not be representative of the entire function. Since the trapezoid rule is a closed
method, those are the two points that must be used.
Wouldnt it be better if there was an open-method equivalent of the trapezoid
method, which had the simplicity of using only two points but made it possible to
choose points within the integration interval that are more representative of the
function, points that make it possible to interpolate a straight line such that the
missing area above the interpolated line and the extra area below the interpolated
line cancel each other out? Such a method could pick the two points that yield a
trapezoid with an area as close as possible to the real function. For example, instead
of the two bounds x0 and x1 of Fig. 11.2, the two points a and b inside the interval
could be selected as in Fig. 11.9 to get a better approximation of the integral value.
Such a method does exist. It is called the Gaussian quadrature method, or
alternatively the Gauss-Legendre rule. To understand where it comes from, it is
best to go back to the two-point trapezoid and learn a different way to discover that
method.
The single-segment trapezoid of Eq. (11.1) estimates the integral of the function
as a combination of the measurement value of that function at two points, x0 and x1,
the two known integration bounds. The formula can thus be written as:
I

f xdx w0 f x0 w1 f x1

11:26

The challenge is then discovering what the weights w0 and w1 are. To solve for two
unknown values, two equations are needed; that means two equations where the

240

Integration

Fig. 11.9 An open singlesegment trapezoid

approximation

function evaluations f(x0) and f(x1) and the total integral value I are known. But
these equations dont need to be complicated. They could be, for example, the
integral of the straight-line polynomial p0(x) 1 and the integral of the diagonal
line p1(x) x. Moreover, to further simplify, the polynomials can be centered on the
origin, meaning that x0 x1 6 0. In that case, the two integrals become:
x1

p0 xdx w0 p0 x0 w1 p0 x1
x0

11:27
1dx 1w0 1w1

x0
x1 x0

w0 w1

p1 xdx w0 p1 x0 w1 p1 x1
x0

x dx w0 x0 w1 x1
x0

11:28

0 w0 x0 w1 x1

From these two equations, it is simple to solve for the unknown weight values and
find that w0 w1 (x1 x0)/2, the same values as in Eq. (11.1).
The Gaussian quadrature formula starts from the same single-segment formula,
except the two measurements are taken not at the known bounds but at two
unknown points a and b inside the integration interval. The equation thus becomes:

11.5

Gaussian Quadrature

241

f xdx w0 f a w1 f b

11:29

And there are now four unknown values to discover, namely the two points in
addition to the two weights. Four equations will be needed in that case; so in
addition to the integrals of the straight line and diagonal line polynomials from
earlier, add the polynomials p2(x) x2 and p3(x) x3. But to simplify, though,
assume that the original function f(x) has been transformed to an equivalent
function g( y) of the same degree and area but over the integration interval
y0 1 to y1 1. In that case, the four equations become:
1
p0 ydy w0 p0 a w1 p0 b
1

11:30

1
1 dy 1w0 1w1
1

2 w0 w1

1
p1 ydx w0 p1 a w1 p1 b
1

11:31

1
y dy w0 a w1 b
1

0 w0 a w1 b

1
p2 ydx w0 p2 a w1 p2 b
1

1
y2 dy w0 a2 w1 b2
1

11:32

2
w0 a2 w1 b2
3

1
p3 ydx w0 p3 a w1 p3 b
1

11:33
y3 dy w0 a3 w1 b3

0 w0 a3 w1 b3

242

Integration

With four equations and four unknowns, it is easy to solve to find that w0 w1 1
p
and a b 1= 3. Equation (11.29) thus becomes:
I

x1
x0

1
1
f xdx
gydy g p g p
3
3
1
1

11:34

The only problem that remains is to convert f(x) into its equivalent form g( y),
and Eq. (11.34) will make it easy to compute an approximation of its integral. This
transformation will be done with a linear mapping of the form:
x c0 c1 y
dx c1 dy

11:35

This transformation introduces two more unknown coefficients, c0 and c1, and thus
two equations will be needed to discover their values. Fortunately, two equations
are already available thanks to the known mappings of the two bounds from x0 and
x1 to 1 and 1 respectively:
x0 c0 c1 1
x1 c0 c1 1

11:36

It is trivial to find the values of c0 and c1 as functions of x0 and x1 by solving these

equations. The conversion of Eq. (11.34) is thus done by using the following
substitutions:
x

x1 x0 x1 x0 y
2
x1 x0
dx
dy
2

11:37
11:38

To summarize, the Gaussian quadrature method is done in two steps. First,

convert the integral of f(x) from x0 to x1 into the integral of g( y) from 1 to 1 by
substituting x and dx with the values of Eqs. (11.37) and (11.38) respectively. Then
second, approximate the value of the integral by adding together the evaluation of g
p
( y) at the two internal points y 1= 3, as indicated in Eq. (11.34). These steps
hint at the two major limitations of the Gaussian quadrature method: first, it requires
a lot more computation than the other methods seen so far, and second, it requires
having access to the actual mathematical model of the system, not just samples, in
order to make it possible to perform the variable substitution and to evaluate the
function at the two specific internal points required. The trade-off, though, is an
accuracy comparable to Simpsons rules using only two measurements of the
system.

11.5

Gaussian Quadrature

243

Example 11.6
By taking samples of the current going through a supercapacitor at every
second during seven seconds and computing an interpolation, the following
mathematical model of the current has been developed:
I t 8t 42t2 45t3 62t4 286t5 352t6
Determine the voltage going through the supercapacitor over the first second
after the systems boot-up using the Gaussian quadrature method, knowing
that its capacitance value is 1 F.
Solution
The relationship between the voltage V(t) and current I(t) is the following
integral:
V t

1
C

tn1

I tdt V t0

Since the integration interval starts at the moment the computer boots up the
initial voltage V(t0) will be null, and with the capacitance value of 1 F the
voltage will be only the result of the integral of the current.
The first step of applying the Gaussian quadrature method is to convert the
integral using the two Eqs. (11.37) and (11.38). With the integration bounds
t0 0 and t1 1, the equations become:
t

1y
2

dy
2

And the integral becomes:

8t 42t2 45t3 62t4 286t5 352t6 dt

01
1y
1y 2
1y 3
1y 4

8
45
62
42
2
2
2
2
1
!

5

6
1y
1y
dy
286
352
2
2
2
1

8:1 17:7y 11:9y2 5:4y3 17:0y4 12:0y5 2:8y6 dy

V t

The next step is to approximate the value of the integral using Eq. (11.34):
(continued)

244

Example 11.6 (continued)

V t

Integration

gydy

1
1
g p g p
3
3
1:7 18:5
20:2 V
1

Compared to the real value of the integral of 16.5 V, this approximation has a
relative error of 22.2 %. This is a massive improvement compared to the
trapezoid rule approximation of Example 11.1, which had a relative error of
97.0 % using the same number of measurements. This approximation is also
an improvement compared to the three-measurement approximations
obtained by the composite trapezoid rule and Simpsons 1/3 rule, which had
relative errors of 49.5 % and 33.7 % respectively, despite being computed
using one more measurement than this approximation.
To further illustrate how this equation works, apply Eq. (11.37) again to
p
find that the two points at y 1= 3 correspond to times t 0.21 s and
t 0.79 s. This means the integral approximation is computed from the
colored area of the trapezoid under the red line in the figure below. Compared
to the single-segment and two-segment trapezoids of Example 11.1, included
as the light and dark purple lines respectively in this figure, it is clear to see
how the Gaussian quadrature gives a superior result. Because it is an open
method, it can forego the unrepresentative points at the integration bounds
that the two trapezoid rules are forced to use in their computations. The
straight-line interpolation resulting from the Gaussian quadrature points is
clearly a better linear approximation of the function over a large part of the
integration interval than either of the trapezoid interpolations. And even the
errors, the large section included under the interpolation line beyond t 0.83 s
when the function begins decreasing quickly, is cancelled out in part by the
negative area under the curve from t 0 to t 1.5.

11.5

Gaussian Quadrature

245

To conclude, it should be noted that the version of Gaussian quadrature

presented in this section is actually a special case of the method that uses two
points. In fact, the Gaussian quadrature method can be used with any number of
points, just like the composite trapezoid rule or Simpsons rules. The general form
of the formula is:
I

f xdx

1
1

gydy

n1
X

w k g y k

11:39

In the case where n 2, the weights wk are always 1 and the evaluated points yk are
p
1= 3, and Eq. (11.39) reduces to Eq. (11.34). Weights and points for the first four
values of n are presented in Table 11.1. Notice from this table that the weights and
points are different at every value of n; this means that the complete summation will
have to be recomputed from scratch every time the number of points n is increased.
The Gaussian quadrature method thus cannot be implemented in an iterative
algorithm that increments the number of points to gradually improve the quality
of the approximation, in the way the Romberg integration rule did.

Table 11.1 Points and

weights for the Gaussian
quadrature method with
different number of points

Number of points n
1
2

Evaluated points yk
0
r
1

3
r
1
3
r
3

5
0
r
3
5
s
r
3 2 6

7 7 5
s
r
3 2 6

7 7 5
s
r
3 2 6

7 7 5
s
r
3 2 6

7 7 5

Weights wk
2
1
1
5
9
8
9
5
9

p
18 30
36
p
18 30
36
p
18 30
36
p
18 30
36

246

Integration

The development of the error for Eq. (11.39) falls outside the scope of this book,
but the final result is:
Ex

22n1 n!4
2n 12n!

g2n x O h2n

11:40

This shows that the Gaussian quadrature method will compute the exact integral
value with no error for a polynomial f(x) of degree 2n 1, in which case the 2nth
derivative of g(x) will be 0. In the special case of n 2 that has been the topic of
this section, the method will have an error rate of O(h4), a considerable improvement compared to the O(h2) error rate of the trapezoid rule with the same number
of points.

11.6

Engineering Applications

Like differentiation, integration comes up in countless engineering models of real

physical phenomena. Chapter 10 demonstrated this using the relationship between
the jerk, the acceleration (the integral of the jerk), the speed (the integral of the
acceleration), and the position (the integral of the speed) of a moving object.
Section 11.1 demonstrated in turn how integration could also be used to estimate
physical areas. Other popular applications include the following:
The work W performed by a force that varies over position F(x) on an object that
is moving in a straight line from position x0 to position x1 is given by:
W

Fxdx

11:41

The Fourier transform of a continuous signal over time s(t) into a continuous
frequency-domain signal S() is done using the equation:
S

1
1

stejt dt

11:42

where e and j are Eulers number and the imaginary number, respectively.
According to Ohms law, the voltage between two points x0 and x1 along a path
is given by:
V

x1
x0

E dx

J dx

11:43

where E is the electric field, J is the current density, and is the resistivity along
the path.

11.7

Summary

247

Given a spring of stiffness k that was initially at rest and was gradually stretched
or compressed by a length L, the total elastic potential energy transferred into the
spring is computed as:
U

L
kx dx

11:44

Finally, integration is also useful to model a nonuniform force being applied on a

surface, such as an irregular wind on a sail or the increasing water pressure on
the side of a dam.

11.7

Summary

Approximating integrals is a useful tool in any modelling task. In addition to

complementing differentiation in modelling change in systems and its literal use
in modelling physical areas, numerous physical processes and formulae that may be
critical to include in models of the real-world are computed by integration of system
parameters.
This chapter has introduced several methods for approximating the integral of a
function. They all fall under the Newton-Cotes family of integration formulae, and
consequently share the same basic idea: to interpolate a polynomial approximation
from measurements of the function over the integration interval, and use the area of
the shape formed by this interpolation, the x-axis, and the two integration bounds, as
an approximation of the integral. The simplest possible such technique is the
trapezoid rule, which only interpolates a straight line between the two integration
bounds. Given more measurements, the integration interval can be broken up into
smaller intervals and the integral computed using the composite trapezoid rule, and
even refined iteratively using the Romberg integration rule. With more measurements, it is also possible to interpolate parabolas, which give a better approximation
of the function and therefore of the integral, using Simpsons 1/3 rule and
Simpsons 3/8 rule. These techniques are all closed methods, in that they require
that the two integration bounds be among the measurements used to approximate
the integral. This is a limitation, as the choice of these bounds will usually be
constrained by the problem being studied, and not selected for being representative
points of the systems behavior. The Gaussian quadrature method is an open
method, which uses representative points inside the integration interval and not
the bounds. Although that method is more computationally intensive, it does yield
results using only two points that rival a three- or four-point application of
Simpsons rules. Table 11.2 summarizes the methods covered in this chapter.

248

Integration

Table 11.2 Summary of integration methods

Method
Trapezoid rule
Composite trapezoid rule
Romberg integration
Simpsons rules
Gaussian quadrature

11.8

Requires
Measurements at the two integration bounds
n measurements of the function
n<2
n measurements of the function and k iterations
n 2k
n measurements of the function
n<3
Two points selected within boundaries

Error
O(h2)
O(h2)
O(h2k+2)
O(h5)
O(h4)

Exercises

1. Approximate the integral of the function f(x) ex over the interval [0, 10]
using:
(a) A single-segment trapezoid rule.
(b) A 20-segment composite trapezoid rule.
(c) Rombergs integration rule, starting with one interval and continuing until
the absolute error between two approximations is less than 0.000001.
(d) Simpsons rule with three points.
(e) Simpsons rule with four points.
2. Using a single-segment trapezoid rule, approximate the integral of the following functions over the specified intervals.
(a) f(x) x3 over the interval [1, 2].
(b) f(x) e0.1x over the interval [2, 5].
3. Using a single-segment trapezoid rule, approximate the integral of the following functions over the specified intervals. Then, evaluate their approximate
error and their real error.
(a) f(x) x2 over the interval [0, 2].
(b) f(x) x4 over the interval [0, 2].
(c) f(x) cos(x) over the interval [0.2, 0.4].
4. Approximate the integral of f(x) x3 over the interval [1, 2] using a foursegment composite trapezoid rule.
5. Approximate the integral of f(x) xex over the interval [0, 4] using a
10-segment composite trapezoid rule.
6. Using four- and eight-segment composite trapezoid rules, approximate the
integral of the following functions over the specified intervals. Then, evaluate
their approximate error and their real error when using eight segments.
(a) f(x) x2 over the interval [2, 2].
(b) f(x) x4 over the interval [2, 2].

11.8

Exercises

249

7. Use Romberg integration to approximate the integral of f(x) cos(x) over the
interval [0, 3], starting with one interval and computing ten iterations.
8. Use Romberg integration to approximate the integral of f(x) x5 on the interval
[0, 4], starting with one interval and until the error on two successive steps is 0.
9. Use Romberg integration to approximate the integral of f(x) sin(x) on the
interval [0, ], starting with one interval and until the error on two successive
steps is less than 105.
10. Using a three-point Simpsons rule, approximate the integral of the following
functions over the specified intervals.
(a) f(x) x3 over the interval [1, 2].
(b) f(x) e0.1x over the interval [2, 5].
11. Using a three-point Simpsons rule and a four-point Simpsons rule, approximate the integral of the following functions over the specified intervals.
(a) f(x) x2 over the interval [0, 2].
(b) f(x) x4 over the interval [0, 2].

Chapter 12

Initial Value Problems

12.1

Introduction

Consider a simple RC circuit such as the one shown in Fig. 12.1. Kirchhoffs law
states that this circuit can be modelled by the following equation:
dV
V t

dt
RC

12:1

This model would be easy to use if the voltage and the values of the resistor and
capacitor are known. But what if the voltage is not known or measurable over time,
and only the initial conditions of the system are known? That is to say, only the
initial value of the voltage and of its derivative, along with the resistor and capacitor
value, are known.
This type of problem is an initial value problem (IVP), a situation in which a
parameters change (derivative) equation can be modelled mathematically and
initial condition measurements are available, and future values of the parameters
need to be estimated. Naturally, if the initial value of a parameter and the equation
modelling its change over time are both available, it can be expected that it is
possible to predict the value at any time in the systems operation. Different
numerical methods to accomplish this, with different levels of complexity and of
accuracy, will be presented in this chapter.
To formalize the discussion, an equation such as (12.1), or more generally any
equation of the form
y1 t f t, yt c0 c1 yt c2 t

12:2

is called a first-order ordinary differential equation (ODE). For an IVP, the initial
value y(t0) y0 is known, as are the values of the coefficients c0, c1, and c2, and the
goal is to determine the value at a future time y(tn1). However, the challenge is that
Springer International Publishing Switzerland 2016
R. Khoury, D.W. Harder, Numerical Methods and Modelling for Engineering,
DOI 10.1007/978-3-319-21176-3_12

251

252

Initial Value Problems

Fig. 12.1 A sample RC

circuit

the equation of y(t) is itself unknown. It is thus necessary to approximate its

behavior from knowledge of its derivative only. Later sections of this chapter will
deal with more complex cases, namely with systems of ODEs and with higherorder ODEs.
It is also pertinent to note in this introduction that time, while a continuous
variable, will be handled as a set of discrete equally spaced time mesh points. That
is to say, instead of considering all possible moments in a continuous time interval
going from the initial instant that parameter values of the system are known for to
the target instant the parameter value is needed for:
t 2 t0 ; tn1

12:3

the methods in this chapter will instead consider a set of n discrete time mesh
points:
t 2 ft0 ; . . . ; ti ; ti1 ; . . . ; tn1 g

12:4

separated by an equal interval h:

tn1 t0
ti1 ti
n1

12:5

From these equations, any mesh point within a problems time interval can be
written as:
ti t0 ih

12:6

While these definitions may seem simple, and indeed they are, they will also be
fundamental to the numerical methods presented in this chapter. Indeed, they make
the IVP problem simpler: instead of trying to model the behavior of the unknown
function y(t) over the entire time interval of Eq. (12.3), it is only necessary to
approximate it over the finite set of mesh points of Eq. (12.4).

12.2

Eulers Method

It was established, back in Chap. 5, that a function can be approximated as a straight

line over a short interval around a specific point. And moreover, it was learned that
this straight line is the derivative of the function evaluated at that point. This gives
the intuition that underlies the simplest and most intuitive of the IVP methods,

12.2

Eulers Method

253

y Input initial value

tL Input lower mesh point
tU Input upper mesh point
h Input step size
t tL
WHILE (t <= tU)
y y + h F(t,y)
t t + h
END WHILE
RETURN y
FUNCTION F(t,y)
RETURN evaluation of the derivative of the target function at mesh
point t and at function point y
END FUNCTION

Fig. 12.2 Pseudocode of Eulers method

Eulers method: starting at the initial known mesh point t0, evaluate the derivative at
each mesh point and follow the straight line to approximate the function to the next
mesh point, and repeat this process until the target point tn1 is reached. Stated
more formally, this method follows the equation:
yti1 yti hy1 ti
yti hf ti , yti

12:7

Starting from the known initial conditions of y(t0) y0 at time t0, it is possible to
evaluate the ODE to obtain the value of the derivative y(1)(t0) and to use it to
approximate the value of y(t1). This process is then repeated iteratively at each
mesh point until the requested value of y(tn1) is obtained. The pseudocode of an
algorithm to do this is presented in Fig. 12.2.
Equation (12.7) should be immediately recognizable as a first-order Taylor
series approximation of the function y(t) evaluated at ti+1 from the point ti (and
indeed it could have been obtained from the Taylor series instead of the reasoning
presented above). This means that the error on this method is proportional to the
second-order term of the series:

y2 ti 2
h O h2
2!

12:8

Eulers method thus has a quadratic error rate, and for example halving the step size
h will quarter the approximation error. It should be easy to understand why reducing
the step size improves the approximation: as was seen in Chap. 5, the underlying
assumption that a function can be approximated by its straight-line first derivative is
only valid for a short interval around any given point and becomes more erroneous
the farther away from that point the approximation goes.

254

Initial Value Problems

Example 12.1
Using Kirchhoffs law, the voltage in a circuit has been modelled by the
following equation:
dV
V t t 1
dt
Given that the initial voltage was of 0.5 V, determine the voltage after 1 s
using six steps of Eulers method.
Solution
Using n 6 gives a sample every 0.20 s, following Eq. (12.5). Putting the
ODE equation of this problem into Eulers method Eq. (12.7) gives the
formula to compute these samples:
V ti1 V ti hV 1 ti
V ti hV ti ti 1
And this formula can then be used to compute the value at each step of the
method:
V 0 0:5 V
V 0:20 0:5 0:200:5 0 1 0:80 V
V 0:40 0:80 0:200:80 0:20 1 1:12 V
V 0:60 1:12 0:201:12 0:40 1 1:46 V
V 0:80 1:46 0:201:46 0:60 1 1:83 V
V 1 1:83 0:201:83 0:80 1 2:24 V
To compare, note that the equation for the voltage used in this example was:
V t t

et
2

The actual voltage values computed by this equation are presented in the table
below, alongside the values computed by Eulers method and their relative
error. It can be seen that the error is small, thanks to the small step size used to
in this example. It can also be noted that the error increases in each successive
step. This is a consequence of the process implemented by Eulers method, as
explained in this section: the new point estimated at each step is computed by
following an approximation of the function starting from an approximation of
the previous point, and thus errors accumulate step after step.

V(0.20)
V(0.40)
V(0.60)
V(0.80)
V(1.00)

Eulers method (V)

0.80
1.12
1.46
1.83
2.24

Real value (V)

0.81
1.15
1.51
1.91
2.36

Relative error (%)

1.23
2.61
3.31
4.19
5.08

(continued)

12.2

Eulers Method

255

Example 12.1 (continued)

To further study this example, it is interesting to draw the (t, V(t)) plot in the
range [0, 4] [1, 3] and to draw the orientation of the derivative at every 0.2
interval in that range. The resulting graph is shown below, and four example
runs of Eulers method are marked on it with red dots and lines, with the
initial conditions of y(0) 0.5 V, y(0) 0 V, y(0) 0.5 V, and y(0) 1.0 V.
The actual functions given these four initial conditions are also marked with
blue lines on the figure. These examples show how Eulers method follows
the direction indicated by the derivative, step by step, until it reaches the
target time. In fact, from a figure like the one below, it looks possible for a
human to simply draw and connect the arrows to reach the solution at any
time given any initial condition!

Example 12.2
Given the following IVP, approximate the value of y(1) and y(0.5) using one
step of Eulers method for each:
y1 t 1 tyt
y0 1
Solution
Using Eq. (12.7), the result can be computed immediately:
(continued)

256

Initial Value Problems

Example 12.2 (continued)

y 0 h y 0 h 1 0 y 0
y1 1 11 0 1 2
y0:5 1 0:51 0 1 1:5
It is interesting to note that the real values of the function are: y(0.5) 1.34
and y(1) 1.33. This means the relative error at t 0.5 is 12 % and at t 1 is
50 %. Reducing the step size h by a factor of 2, from 1 to 0.5, has thus reduced
the error by approximately a factor of 4. This is exactly what should be
expected for a method with a quadratic error rate.

12.3

Heuns Method

It was explained in the previous section that Eulers method approximates the
behavior of a function by following the derivative at the current point y(ti) for
one step. But since the function being approximated will normally not be linear, the
approximated behavior will diverge from the real function and the estimated next
point y(ti+1) will be somewhat off. From that point, the function will again be
approximated by a straight line, and the following point y(ti+2) will be more off
compared to the real functions value. These errors will continue to accumulate,
step after step. In the case of a convex function such as the one in Fig. 12.3, for
example, it will lead to a consistent and increasing underestimation of the values of
the function.
The reason for this accumulation of error is that the derivative at y(ti) is a good
approximation of the functions behavior at that point, but not at the next point y(ti+1).
Fig. 12.3 Eulers method
underestimating a convex
functions values

12.3

Heuns Method

257

Fig. 12.4 Eulers method

overestimating a convex
functions values

But what if the derivative at y(ti+1) was somehow available to be used in Eulers
method instead of the derivative at y(ti)? It would give a good approximation of the
behavior of the function at y(ti+1). . . but not at y(ti). The net result would be an
accumulation of errors in the opposite direction from before. For the convex function
of Fig. 12.3, it would lead to a consistent and increasing overestimation instead of an
underestimation, resulting in Fig. 12.4.
Considering the previous discussion, and comparing Figs. 12.3 and 12.4, a
solution becomes apparent: to average out the two estimates. Since the derivative
at y(ti) is a good approximation of the behavior of the function at y(ti) but leads to
errors at y(ti+1), and vice-versa, an average of the two derivatives should lead to a
good representation of the functions behavior on average over the interval from ti
to ti+1. Or, looking at the figures, taking the average of the underestimation of
Fig. 12.3 and the overestimation of Fig. 12.4 should give much more accurate
estimates, as shown in Fig. 12.5. And a better approximation of the behavior of the
function will, in turn, lead to a better approximation of y(ti+1).
That is the intuition that underlies Heuns method. Mathematically, it simply
consists in rewriting the Eulers method equation of (12.7) to use the average of the
two derivatives instead of using only the derivative at the current point:
y1 ti y1 ti1
2
f ti , yti f ti1 , yti1
y t i h
2

yti1 yti h

12:9

There is one problem with Eq. (12.9): it requires the use of the value of the next
point y(ti+1) in order to compute the derivative at the next point, f(ti+1, y(ti+1)), and
that next point is exactly what the method is meant to estimate! This circular
requirement can be solved easily though, by using Eulers method to get an initial
estimate of the value of y(ti+1). That initial estimate will be of lesser quality than the

258

Initial Value Problems

Fig. 12.5 Heuns method

averaging the Eulers
method approximations

y Input initial value

tL Input lower mesh point
tU Input upper mesh point
h Input step size
t tL
WHILE (t <=
Euler
y y +
t t +
END WHILE

tU)
y + h F(t,y)
h [ F(t,y) + F(t+h,Euler) ] / 2
h

RETURN y
FUNCTION F(t,y)
RETURN evaluation of the derivative of the target function at mesh
point t and at function point y
END FUNCTION

Fig. 12.6 Pseudocode of Heuns method

one computed by Heuns method, but it makes the computation of Heuns method
possible. Integrating Eulers method into Heuns method alters Eq. (12.9) into:
yti1 yti h

f ti , yti f ti1 , yti hf ti , yti

12:10

The pseudocode for Heuns method is only a small modification of the one
presented earlier for Eulers method, as shown in Fig. 12.6.
Just like for Eulers method, the error for Heuns method can be obtained from its
Taylor series approximation. Since it was already shown that Eulers method can be
obtained from the first-order Taylor series approximation and it was stated that

12.3

Heuns Method

259

Heuns method is more accurate, then it can be expected that Heuns method could
be obtained from a second-order Taylor series approximation, and thus that its error
will be the next term. So begin from a third-order Taylor series approximation:
yti1 yti hy1 ti

y2 ti 2 y3 ti 3
h
h
2!
3!

12:11

The second derivative is a problem, since it has no place in Heuns method

equation. However, by isolating the first derivative in a second-order Taylor series
approximation (or in other words taking a first-order forward divided difference
formula) and taking the derivative of that formula gives an approximation of the
second derivative, as follows:
y1 ti

yti1 yti y2 ti

h
h
2!

y1 ti1 y1 ti y3 ti

h
y ti
h
2!

12:12

Substituting this second derivative into Eq. (12.11) gives the formula of (12.13),
which is only a simplification away from Eq. (12.9) and shows the error term to be
O(h3).
yti1 y1 ti hy1 ti

y1 ti1 y1 ti
y3 ti 3 y3 ti 3
h
h
h
2
4
3!
12:13

Example 12.3
Using Kirchhoffs law, the voltage in a circuit has been modelled by the
following equation:
dV
V t t 1
dt
Given that the initial voltage was of 0.5 V, determine the voltage after 1 s
using six steps of Heuns method.
Solution
Using n 6 gives a sample every 0.20 s, following Eq. (12.5). Computing the
derivative at the initial value, at t 0, gives:
V 1 0 V 0 0 1 1:5 V=s
(continued)

260

Initial Value Problems

Example 12.3 (continued)

And the value at the next mesh point is:
V 0:2 V 0 0:2V 1 0 0:8 V
This corresponds to the value computed by Eulers method in the first step of
Example 12.1. Next, however, this value is used to compute the derivative at
the next mesh point:
V 1 0:2 V 0:2 0:2 1 1:6 V=s
And the more accurate approximation of the voltage value at the next mesh
point is obtained using Heuns method formula in Eq. (12.9) as the average of
these two derivatives:
V 0:2 V 0 0:2

V 1 0 V 1 0:2
0:81 V
2

The values for all five mesh points to compute, along with the real value
computed by the actual voltage equation of V(t) t + 0.5et and the relative
error of Heuns approximation, are given in the table below. Note that the
relative error in this table was computed using nine decimals of precision
instead of the two shown in the table, for added details.

V(0.20)
V(0.40)
V(0.60)
V(0.80)
V(1.00)

Heuns method (V)

0.81
1.14
1.50
1.90
2.35

Real value (V)

0.81
1.15
1.51
1.91
2.36

Relative error (%)

0.09
0.15
0.21
0.27
0.33

As with Eulers method, it can be seen that the relative error increases at
every step. However, the improved O(h3) pays off, and even in the final step
the relative error is one quarter that of the first step using Eulers method. To
further illustrate the improvement, the function is plotted in blue in the figure
below, along with Eulers approximation in red and Heuns approximation in
green. It can be seen visually that Eulers method diverges from the real
function quite quickly, while Heuns method continues to match the real
function quite closely over the entire interval.
(continued)

12.4

Fourth-Order RungeKutta Method

261

Example 12.3 (continued)

V (t)

1.5

0.5

0
0

12.4

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Fourth-Order RungeKutta Method

To summarize the IVP methods learnt so far: following the derivative at y(ti) for one
step generates a poor estimation of the next point y(ti+1), while predicting the
derivative at y(ti+1) and following that for one step generates a poor estimation
with the opposite error. Following the first derivative is the idea behind Eulers
method, while Heuns method takes the average of both derivatives and cancels out
a lot of the errors, thus leading to a much better estimate of y(ti+1). It is also known
that the error is proportional to h, the step size between two successive mesh points.
Taking these ideas together leads to an intuition for a new IVP method: perhaps
taking the average derivative at more than two points could lead to a more accurate
approximation of the behavior of the function and thus a more accurate computation of y(ti+1). And since smaller step sizes help, perhaps this average should
include the derivative estimated halfway between ti and ti. These are the intuitions
that underlie the fourth-order RungeKutta method.
To lay down foundations for this method, begin by defining a point half a step
between two mesh points:
ti0:5 ti

h
2

12:14

The RungeKutta method begins, like Eulers method and Heuns method, by
computing the derivative at the current point. That result will be labelled K0:

262

K 0 f ti , yti